How to Use NVIDIA Canary-1B-v2 for ASR, Translation, and Automatic SRT Subtitle Export in Python

The NVIDIA Canary-1B-v2 model now supports automatic speech recognition, translation, and subtitle generation via a Python script. The workflow converts audio to 16 kHz mono, runs inference on a GPU, and outputs SRT files with word-level timestamps.

Installing NeMo, Audio Libraries, NumPy, and SciPy Dependencies

The setup script checks for an existing installation marker to avoid redundant work. If the marker is missing, it runs system commands to install libsndfile and ffmpeg, then pulls the NeMo ASR toolkit, librosa, soundfile, and pydub. It forces a reinstall of NumPy and SciPy to ensure version compatibility before killing the runtime to restart cleanly.

Copy Code

import os, subprocess, sys
SENTINEL = "/content/.canary_setup_done"
if not os.path.exists(SENTINEL):
   def sh(c):
       print("$", c); subprocess.run(c, shell=True, check=False)
   print(">>> PHASE 1: installing dependencies (one-time)...\n")
   sh("apt-get -qq update")
   sh("apt-get -qq install -y libsndfile1 ffmpeg > /dev/null")
   sh('pip install -q "nemo_toolkit[asr]"')
   sh("pip install -q librosa soundfile pydub")
   sh('pip install -q --force-reinstall --no-cache-dir "numpy>=2.2,<2.4" "scipy>=1.15"')
   open(SENTINEL, "w").write("done")
   print("\n Setup complete. Restarting the runtime now.")
   print("   When it reconnects, RUN THIS CELL AGAIN to start the tutorial.")
   os.kill(os.getpid(), 9)

Loading NVIDIA Canary-1B-v2 and Checking GPU Availability

The script imports PyTorch and NumPy to verify the environment. It prints the library versions and checks for CUDA availability. If a GPU is present, it displays the device name and VRAM capacity. If not, it warns that CPU mode will be slow and instructs the user to change the runtime type. The code then defines a dictionary of supported languages, ranging from Bulgarian to Ukrainian, before loading the model from Hugging Face and moving it to the selected device.

Copy Code

import time, json, gc, math, urllib.request
import torch, numpy as np, soundfile as sf, librosa
print(">>> PHASE 2: running tutorial\n")
print("NumPy:", np.__version__, "| PyTorch:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
   print("GPU:", torch.cuda.get_device_name(0),
         f"| VRAM: {torch.cuda.get_device_properties(0).total_memory/1e9:.1f} GB")
else:
   print("  No GPU — will run on CPU (very slow). "
         "Set Runtime > Change runtime type > GPU.")
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
LANGS = {
   "bg":"Bulgarian","hr":"Croatian","cs":"Czech","da":"Danish","nl":"Dutch",
   "en":"English","et":"Estonian","fi":"Finnish","fr":"French","de":"German",
   "el":"Greek","hu":"Hungarian","it":"Italian","lv":"Latvian","lt":"Lithuanian",
   "mt":"Maltese","pl":"Polish","pt":"Portuguese","ro":"Romanian","sk":"Slovak",
   "sl":"Slovenian","es":"Spanish","sv":"Swedish","ru":"Russian","uk":"Ukrainian",
}
print(f"\nSupported languages ({len(LANGS)}):", ", ".join(LANGS.keys()))
from nemo.collections.asr.models import ASRModel
print("\nLoading nvidia/canary-1b-v2 ...")
t0 = time.time()
asr_model = ASRModel.from_pretrained(model_name="nvidia/canary-1b-v2").to(DEVICE).eval()
print(f"Model loaded in {time.time()-t0:.1f}s")

Preparing 16 kHz Audio and Running English ASR with Translation

The code includes a function to download remote audio files and convert them to 16 kHz mono WAV format. It then loads a sample file and defines a helper for transcription. The example runs basic English-to-English recognition, then translates the same audio into French, German, Spanish, and Italian.

Copy Code

TARGET_SR = 16000
def prepare_audio(path_or_url, out_path=None):
   if str(path_or_url).startswith(("http://", "https://")):
       local = "/content/_dl_" + os.path.basename(path_or_url.split("?")[0])
       urllib.request.urlretrieve(path_or_url, local)
       path_or_url = local
   audio, _ = librosa.load(path_or_url, sr=TARGET_SR, mono=True)
   if out_path is None:
       base = os.path.splitext(os.path.basename(path_or_url))[0]
       out_path = f"/content/{base}_16k_mono.wav"
   sf.write(out_path, audio, TARGET_SR, subtype="PCM_16")
   dur = len(audio) / TARGET_SR
   print(f"Prepared: {out_path}  ({dur:.1f}s, 16kHz, mono)")
   return out_path, dur
SAMPLE_URL = "https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav"
sample_wav, sample_dur = prepare_audio(SAMPLE_URL)
def transcribe(files, source_lang="en", target_lang="en", timestamps=False, batch_size=1):
   if isinstance(files, str):
       files = [files]
   return asr_model.transcribe(files, source_lang=source_lang, target_lang=target_lang,
                               timestamps=timestamps, batch_size=batch_size)
print("\n=== 1) BASIC ASR (English) ===")
res = transcribe(sample_wav, source_lang="en", target_lang="en")
print("Transcript:", res[0].text)
print("\n=== 2) TRANSLATION (EN audio -> X) ===")
for tgt in ["fr", "de", "es", "it"]:
   out = transcribe(sample_wav, source_lang="en", target_lang=tgt)
   print(f"  EN -> {LANGS[tgt]:<10} ({tgt}): {out[0].text}")

Generating Word and Segment Timestamps and Exporting SRT Subtitles

The script requests timestamps during transcription to retrieve both segment-level and word-level timing data. It prints the first ten words with their start and end times to verify alignment. A helper function converts these timestamps into the standard SRT format (hours, minutes, seconds, milliseconds) and writes the translated French segments to a file.

Copy Code

print("\n=== 3) TIMESTAMPS (ASR) ===")

ts_out = transcribe(sample_wav, source_lang="en", target_lang="en", timestamps=True)

word_ts = ts_out[0].timestamp
Source Read original →
The SignalThe Signal: Edition 01Read this edition →Every Friday: the one AI story that actually mattered, plus the tools worth your time.

AM
AI Maestro is an independent British AI publication. We test what we recommend, and we write it the way we would say it. More about us

How to Use NVIDIA Canary-1B-v2 for ASR, Translation, and Automatic SRT Subtitle Export in Python

Installing NeMo, Audio Libraries, NumPy, and SciPy Dependencies

Loading NVIDIA Canary-1B-v2 and Checking GPU Availability

Preparing 16 kHz Audio and Running English ASR with Translation

Generating Word and Segment Timestamps and Exporting SRT Subtitles

`Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.`

`follow us`

`Popular Tag`

`Popular Post`

`OPFS + Pyodide test…`

`Experimenting with the proposed…`

`Why corporate AI super…`

Installing NeMo, Audio Libraries, NumPy, and SciPy Dependencies

Loading NVIDIA Canary-1B-v2 and Checking GPU Availability

Preparing 16 kHz Audio and Running English ASR with Translation

Generating Word and Segment Timestamps and Exporting SRT Subtitles

More in AI Guides & Tutorials

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

OPFS + Pyodide test…

Experimenting with the proposed…

Why corporate AI super…

`More in AI Guides & Tutorials`

`Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.`

`follow us`

`Popular Tag`

`Popular Post`

`OPFS + Pyodide test…`

`Experimenting with the proposed…`

`Why corporate AI super…`