How to Fine-Tune Nemotron 3.5 ASR for Your Language, Domain, or Accent

For makers and artists building voice interfaces, the current landscape of speech recognition is a nightmare of fragmented tools. Supporting multiple languages…

By AI Maestro June 4, 2026 6 min read
How to Fine-Tune Nemotron 3.5 ASR for Your Language, Domain, or Accent

For makers and artists building voice interfaces, the current landscape of speech recognition is a nightmare of fragmented tools. Supporting multiple languages often means stitching together dozens of different models or APIs, creating a maintenance-heavy infrastructure. Real-time captioning usually forces a trade-off between speed and accuracy, where systems fake streaming by re-processing audio chunks, burning compute and adding delay. Furthermore, raw output often arrives as unpunctuated text, requiring a second model just to add commas and capitalisation. Many systems also demand you specify the language upfront, failing when a user switches between languages mid-sentence.

Nvidia’s Nemotron 3.5 ASR aims to collapse these four distinct problems into a single, unified model.

What it does

A single 600M-parameter checkpoint handles 40 language locales. It transcribes English (US/GB), Spanish (US/ES), German, French (FR/CA), Italian, Arabic, Japanese, Korean, Portuguese (BR/PT), Russian, Hindi, Turkish, Vietnamese, Dutch, Ukrainian, Polish, Finnish, Mandarin, Czech, Bulgarian, Slovak, Swedish, Croatian, Romanian, Estonian, Danish, Hungarian, Norwegian Bokmål, Norwegian Nynorsk, Hebrew, Greek, Lithuanian, Latvian, Maltese, Slovenian, and Thai. There is no need for per-language deployment or model swapping.

Real-time streaming is executed correctly. Built on a Cache-Aware FastConformer encoder, the architecture avoids the inefficiency of traditional buffered systems that re-process overlapping audio chunks. Instead, it caches the encoder’s internal state and reuses it. Every audio frame is processed exactly once with no overlap, resulting in dramatically lower compute and end-to-end latency without sacrificing accuracy.

Punctuation and capitalisation are native features. The output is production-ready text with proper casing and punctuation straight from the model, eliminating the need for a separate restoration step.

Language conditioning is optional. You can either tell the model the input language (

target_lang=en-US

) for maximum accuracy, or let the model detect it automatically (

target_lang=auto

) when the language is unknown.

How it works (the 2-minute version)

The model comprises two main components:

  • A Cache-Aware FastConformer encoder (24 layers). This is an efficient evolution of the Conformer architecture with linearly scalable attention. The “cache-aware” mechanism enables streaming magic: the encoder retains a cache of self-attention and convolution activations from previous frames. As new audio arrives, it computes only what is genuinely new. Nothing is recomputed.

  • An RNNT (Recurrent Neural Network Transducer) decoder. This is the workhorse for streaming ASR, emitting text as audio streams in, frame by frame, which is essential for live transcription.

Additionally, the model incorporates prompt-based language-ID conditioning. A language signal is fed alongside the audio, allowing one set of weights to specialise its output for the target language—or, in

auto

mode, infer the language itself. It was trained on a massive speech dataset spanning all supported languages, using a blend of public and proprietary data normalised to punctuated, properly-cased text.

A knob worth knowing:

att_context_size

Streaming ASR involves a trade-off between how soon text is emitted and how much future audio the model can “peek at” before committing. Nemotron ASR exposes this directly through the attention context size:

Attention ContextChunk Size (Latency)Use Case
[56, 0]
80ms (Ultra-Low)Ultra low latency Voice Agents
[56, 1]
160ms (Low)Interactive Voice Agents, Conversational AI
[56, 3]
320ms (Balanced)Conversational AI, Live caption
[56, 6]
560ms (Medium)High accuracy with reasonable latency
[56, 13]
1.12s (High)Highest accuracy with high latency

The same checkpoint covers the whole spectrum—you choose the operating point at inference time, with no retraining required.

Try it in minutes

The model ships as a NeMo checkpoint. Clone the NeMo branch and point the streaming inference script at your audio:

git clone https://github.com/NVIDIA-NeMo/NeMo.git

Transcribe with a known language:

python ${NEMO_ROOT}/examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py \
    model_path=${MODEL_PATH} \
    dataset_manifest=${MANIFEST_PATH} \
    output_path=${OUTPUT_FOLDER} \
    target_lang=es-ES \
    att_context_size="[56,3]" \
    strip_lang_tags=true

Or let the model detect the language:

python ${NEMO_ROOT}/examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py \
    model_path=${MODEL_PATH} \
    dataset_manifest=${MANIFEST_PATH} \
    output_path=${OUTPUT_FOLDER} \
    target_lang=auto \
    att_context_size="[56,3]" \
    strip_lang_tags=true

Audio should be mono-channel

.wav

. The manifest is a standard NeMo JSON-lines file:

{"audio_filepath": "/path/to/clip.wav", "duration": 4.27, "text": "reference transcript"}

The model automatically predicts a language tag at the end of each completed sentence, e.g. “This is a test sample. “. Setting

strip_lang_tags=True

removes the tag for better readability.

Deep Dive: Fine-Tuning Nemotron ASR for Your Language

Nemotron 3.5 ASR is strong out of the box, but it was trained on a mix where some languages have far more data than others. The long-tail locales have headroom, and a few hours of in-domain audio plus the right recipe can close a surprising amount of the gap.

To illustrate, we ran a worked example: taking the base model and sharpening it on two mid-resource European languages—Greek and Bulgarian—then measuring honestly on held-out data. The results below are from that run. This section is a high-level overview, with the coding example living in the companion GitHub repo. We will update this blog accordingly when we publish an agentic SKILL.md covering the whole process.

Why fine-tune?

There are a few situations where it pays off:

  • Sharpening a long-tail locale. Languages with less pretraining data have the most to gain.
  • Domain expertise or specialised vocabulary. Medical, legal, financial, or technical vocabulary the base model rarely saw.
  • Accent, dialect, and acoustics. Telephony, far-field, in-car, or a specific speaker population.
  • New languages. Bootstrapping a locale that isn’t yet covered.

A Preview of the Power of Fine-Tuning

🎥 Video Walkthrough: Watch on YouTube

This walkthrough demonstrates multilingual streaming inference, latency/accuracy tradeoffs, deployment options, and the fine-tuning workflow described below.

The recipe at a glance

The whole workflow is five moves:

  • Point the trainer at tarred speech data for the target languages—no per-file unpacking, streamed efficiently by NeMo/Lhotse.
  • Fine-tune from the base checkpoint (
    init_from_nemo_model

    ) using the same Cache-Aware FastConformer-RNNT recipe, conditioned on each clip’s language tag.

  • Evaluate on a held-out set the model never saw—at the same low-latency streaming setting you’ll deploy (e.g.
    att_context_size=[56,0]

    , 80ms chunk; 0ms lookahead).

  • Add more data where the language is weak and retrain.
  • Export and deploy the fine-tuned checkpoint.

Step 1 — Data

We assembled a balanced, ~2000-hour mix across the two languages (Greek and Bulgarian) from public multilingual corpora (Granary, Common Voice, FLEURS), kept as tarred NeMo/Lhotse shards. The two details that matter most are:

  • Every clip carries a
    target_lang

    tag. This drives the model’s prompt-based language conditioning, so getting the tag right (and using a value the model recognises) is essential.

  • Match the base model’s text style—punctuated, properly-cased transcripts, since that is what the model produces.

Held-out FLEURS test splits (which were not in training) gave us an honest, in-the-wild benchmark per language.

Step 2 — Train

This is a straightforward full fine-tune of the streaming RNNT model, driven by a fixed step budget—the right way to schedule with streaming/iterable data. It runs on a single GPU for a quick pass and scales cleanly to multi-GPU for a fuller run. On a small dataset like this, an epoch is minutes, not hours.

Step 3 — Evaluate

We measured Word Error Rate on the held-out FLEURS test set, in streaming mode with 80ms chunk—the most demanding condition, with no future-audio “peeking.” The improvement over the base model is large, especially for the languages that started out weakest:

LanguageBase modelFine-tunedRelative Improvement in WER
🇬🇷 Greek352432%
🇧🇬 Bulgarian2215

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top