The NVIDIA Canary-1B-v2 model now supports automatic speech recognition, translation, and subtitle generation via a Python script. The workflow converts audio to 16 kHz mono, runs inference on a GPU, and outputs SRT files with word-level timestamps.
In this article
Installing NeMo, Audio Libraries, NumPy, and SciPy Dependencies
The setup script checks for an existing installation marker to avoid redundant work. If the marker is missing, it runs system commands to install libsndfile and ffmpeg, then pulls the NeMo ASR toolkit, librosa, soundfile, and pydub. It forces a reinstall of NumPy and SciPy to ensure version compatibility before killing the runtime to restart cleanly.
import os, subprocess, sys
SENTINEL = "/content/.canary_setup_done"
if not os.path.exists(SENTINEL):
def sh(c):
print("$", c); subprocess.run(c, shell=True, check=False)
print(">>> PHASE 1: installing dependencies (one-time)...\n")
sh("apt-get -qq update")
sh("apt-get -qq install -y libsndfile1 ffmpeg > /dev/null")
sh('pip install -q "nemo_toolkit[asr]"')
sh("pip install -q librosa soundfile pydub")
sh('pip install -q --force-reinstall --no-cache-dir "numpy>=2.2,<2.4" "scipy>=1.15"')
open(SENTINEL, "w").write("done")
print("\n
Setup complete. Restarting the runtime now.")
print(" When it reconnects, RUN THIS CELL AGAIN to start the tutorial.")
os.kill(os.getpid(), 9)
Loading NVIDIA Canary-1B-v2 and Checking GPU Availability
The script imports PyTorch and NumPy to verify the environment. It prints the library versions and checks for CUDA availability. If a GPU is present, it displays the device name and VRAM capacity. If not, it warns that CPU mode will be slow and instructs the user to change the runtime type. The code then defines a dictionary of supported languages, ranging from Bulgarian to Ukrainian, before loading the model from Hugging Face and moving it to the selected device.
import time, json, gc, math, urllib.request
import torch, numpy as np, soundfile as sf, librosa
print(">>> PHASE 2: running tutorial\n")
print("NumPy:", np.__version__, "| PyTorch:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
print("GPU:", torch.cuda.get_device_name(0),
f"| VRAM: {torch.cuda.get_device_properties(0).total_memory/1e9:.1f} GB")
else:
print("
No GPU — will run on CPU (very slow). "
"Set Runtime > Change runtime type > GPU.")
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
LANGS = {
"bg":"Bulgarian","hr":"Croatian","cs":"Czech","da":"Danish","nl":"Dutch",
"en":"English","et":"Estonian","fi":"Finnish","fr":"French","de":"German",
"el":"Greek","hu":"Hungarian","it":"Italian","lv":"Latvian","lt":"Lithuanian",
"mt":"Maltese","pl":"Polish","pt":"Portuguese","ro":"Romanian","sk":"Slovak",
"sl":"Slovenian","es":"Spanish","sv":"Swedish","ru":"Russian","uk":"Ukrainian",
}
print(f"\nSupported languages ({len(LANGS)}):", ", ".join(LANGS.keys()))
from nemo.collections.asr.models import ASRModel
print("\nLoading nvidia/canary-1b-v2 ...")
t0 = time.time()
asr_model = ASRModel.from_pretrained(model_name="nvidia/canary-1b-v2").to(DEVICE).eval()
print(f"Model loaded in {time.time()-t0:.1f}s")
Preparing 16 kHz Audio and Running English ASR with Translation
The code includes a function to download remote audio files and convert them to 16 kHz mono WAV format. It then loads a sample file and defines a helper for transcription. The example runs basic English-to-English recognition, then translates the same audio into French, German, Spanish, and Italian.
TARGET_SR = 16000
def prepare_audio(path_or_url, out_path=None):
if str(path_or_url).startswith(("http://", "https://")):
local = "/content/_dl_" + os.path.basename(path_or_url.split("?")[0])
urllib.request.urlretrieve(path_or_url, local)
path_or_url = local
audio, _ = librosa.load(path_or_url, sr=TARGET_SR, mono=True)
if out_path is None:
base = os.path.splitext(os.path.basename(path_or_url))[0]
out_path = f"/content/{base}_16k_mono.wav"
sf.write(out_path, audio, TARGET_SR, subtype="PCM_16")
dur = len(audio) / TARGET_SR
print(f"Prepared: {out_path} ({dur:.1f}s, 16kHz, mono)")
return out_path, dur
SAMPLE_URL = "https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav"
sample_wav, sample_dur = prepare_audio(SAMPLE_URL)
def transcribe(files, source_lang="en", target_lang="en", timestamps=False, batch_size=1):
if isinstance(files, str):
files = [files]
return asr_model.transcribe(files, source_lang=source_lang, target_lang=target_lang,
timestamps=timestamps, batch_size=batch_size)
print("\n=== 1) BASIC ASR (English) ===")
res = transcribe(sample_wav, source_lang="en", target_lang="en")
print("Transcript:", res[0].text)
print("\n=== 2) TRANSLATION (EN audio -> X) ===")
for tgt in ["fr", "de", "es", "it"]:
out = transcribe(sample_wav, source_lang="en", target_lang=tgt)
print(f" EN -> {LANGS[tgt]:<10} ({tgt}): {out[0].text}")
Generating Word and Segment Timestamps and Exporting SRT Subtitles
The script requests timestamps during transcription to retrieve both segment-level and word-level timing data. It prints the first ten words with their start and end times to verify alignment. A helper function converts these timestamps into the standard SRT format (hours, minutes, seconds, milliseconds) and writes the translated French segments to a file.
print("\n=== 3) TIMESTAMPS (ASR) ===")
ts_out = transcribe(sample_wav, source_lang="en", target_lang="en", timestamps=True)
word_ts = ts_out[0].timestampSource Read original →

Setup complete. Restarting the runtime now.")
print(" When it reconnects, RUN THIS CELL AGAIN to start the tutorial.")
os.kill(os.getpid(), 9)
No GPU — will run on CPU (very slow). "
"Set Runtime > Change runtime type > GPU.")
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
LANGS = {
"bg":"Bulgarian","hr":"Croatian","cs":"Czech","da":"Danish","nl":"Dutch",
"en":"English","et":"Estonian","fi":"Finnish","fr":"French","de":"German",
"el":"Greek","hu":"Hungarian","it":"Italian","lv":"Latvian","lt":"Lithuanian",
"mt":"Maltese","pl":"Polish","pt":"Portuguese","ro":"Romanian","sk":"Slovak",
"sl":"Slovenian","es":"Spanish","sv":"Swedish","ru":"Russian","uk":"Ukrainian",
}
print(f"\nSupported languages ({len(LANGS)}):", ", ".join(LANGS.keys()))
from nemo.collections.asr.models import ASRModel
print("\nLoading nvidia/canary-1b-v2 ...")
t0 = time.time()
asr_model = ASRModel.from_pretrained(model_name="nvidia/canary-1b-v2").to(DEVICE).eval()
print(f"Model loaded in {time.time()-t0:.1f}s")


