For makers and artists building voice interfaces, the current landscape of speech recognition is a nightmare of fragmented tools. Supporting multiple languages often means stitching together dozens of different models or APIs, creating a maintenance-heavy infrastructure. Real-time captioning usually forces a trade-off between speed and accuracy, where systems fake streaming by re-processing audio chunks, burning compute and adding delay. Furthermore, raw output often arrives as unpunctuated text, requiring a second model just to add commas and capitalisation. Many systems also demand you specify the language upfront, failing when a user switches between languages mid-sentence.

Nvidia’s Nemotron 3.5 ASR aims to collapse these four distinct problems into a single, unified model.

What it does

A single 600M-parameter checkpoint handles 40 language locales. It transcribes English (US/GB), Spanish (US/ES), German, French (FR/CA), Italian, Arabic, Japanese, Korean, Portuguese (BR/PT), Russian, Hindi, Turkish, Vietnamese, Dutch, Ukrainian, Polish, Finnish, Mandarin, Czech, Bulgarian, Slovak, Swedish, Croatian, Romanian, Estonian, Danish, Hungarian, Norwegian Bokmål, Norwegian Nynorsk, Hebrew, Greek, Lithuanian, Latvian, Maltese, Slovenian, and Thai. There is no need for per-language deployment or model swapping.

Real-time streaming is executed correctly. Built on a Cache-Aware FastConformer encoder, the architecture avoids the inefficiency of traditional buffered systems that re-process overlapping audio chunks. Instead, it caches the encoder’s internal state and reuses it. Every audio frame is processed exactly once with no overlap, resulting in dramatically lower compute and end-to-end latency without sacrificing accuracy.

Punctuation and capitalisation are native features. The output is production-ready text with proper casing and punctuation straight from the model, eliminating the need for a separate restoration step.

Language conditioning is optional. You can either tell the model the input language (

target_lang=en-US

) for maximum accuracy, or let the model detect it automatically (

target_lang=auto

) when the language is unknown.

How it works (the 2-minute version)

The model comprises two main components:

A Cache-Aware FastConformer encoder (24 layers). This is an efficient evolution of the Conformer architecture with linearly scalable attention. The “cache-aware” mechanism enables streaming magic: the encoder retains a cache of self-attention and convolution activations from previous frames. As new audio arrives, it computes only what is genuinely new. Nothing is recomputed.
An RNNT (Recurrent Neural Network Transducer) decoder. This is the workhorse for streaming ASR, emitting text as audio streams in, frame by frame, which is essential for live transcription.

Additionally, the model incorporates prompt-based language-ID conditioning. A language signal is fed alongside the audio, allowing one set of weights to specialise its output for the target language—or, in

auto

mode, infer the language itself. It was trained on a massive speech dataset spanning all supported languages, using a blend of public and proprietary data normalised to punctuated, properly-cased text.

A knob worth knowing:

att_context_size

Streaming ASR involves a trade-off between how soon text is emitted and how much future audio the model can “peek at” before committing. Nemotron ASR exposes this directly through the attention context size:

Attention Context	Chunk Size (Latency)	Use Case
[56, 0]	80ms (Ultra-Low)	Ultra low latency Voice Agents
[56, 1]	160ms (Low)	Interactive Voice Agents, Conversational AI
[56, 3]	320ms (Balanced)	Conversational AI, Live caption
[56, 6]	560ms (Medium)	High accuracy with reasonable latency
[56, 13]	1.12s (High)	Highest accuracy with high latency

The same checkpoint covers the whole spectrum—you choose the operating point at inference time, with no retraining required.

Try it in minutes

The model ships as a NeMo checkpoint. Clone the NeMo branch and point the streaming inference script at your audio:

git clone https://github.com/NVIDIA-NeMo/NeMo.git

Transcribe with a known language:

python ${NEMO_ROOT}/examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py \
    model_path=${MODEL_PATH} \
    dataset_manifest=${MANIFEST_PATH} \
    output_path=${OUTPUT_FOLDER} \
    target_lang=es-ES \
    att_context_size="[56,3]" \
    strip_lang_tags=true

Or let the model detect the language:

python ${NEMO_ROOT}/examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py \
    model_path=${MODEL_PATH} \
    dataset_manifest=${MANIFEST_PATH} \
    output_path=${OUTPUT_FOLDER} \
    target_lang=auto \
    att_context_size="[56,3]" \
    strip_lang_tags=true

Audio should be mono-channel

.wav

. The manifest is a standard NeMo JSON-lines file:

{"audio_filepath": "/path/to/clip.wav", "duration": 4.27, "text": "reference transcript"}

The model automatically predicts a language tag at the end of each completed sentence, e.g. “This is a test sample. “. Setting

strip_lang_tags=True

removes the tag for better readability.

Deep Dive: Fine-Tuning Nemotron ASR for Your Language

Nemotron 3.5 ASR is strong out of the box, but it was trained on a mix where some languages have far more data than others. The long-tail locales have headroom, and a few hours of in-domain audio plus the right recipe can close a surprising amount of the gap.

To illustrate, we ran a worked example: taking the base model and sharpening it on two mid-resource European languages—Greek and Bulgarian—then measuring honestly on held-out data. The results below are from that run. This section is a high-level overview, with the coding example living in the companion GitHub repo. We will update this blog accordingly when we publish an agentic SKILL.md covering the whole process.

Why fine-tune?

There are a few situations where it pays off:

Sharpening a long-tail locale. Languages with less pretraining data have the most to gain.
Domain expertise or specialised vocabulary. Medical, legal, financial, or technical vocabulary the base model rarely saw.
Accent, dialect, and acoustics. Telephony, far-field, in-car, or a specific speaker population.
New languages. Bootstrapping a locale that isn’t yet covered.

A Preview of the Power of Fine-Tuning

🎥 Video Walkthrough: Watch on YouTube

This walkthrough demonstrates multilingual streaming inference, latency/accuracy tradeoffs, deployment options, and the fine-tuning workflow described below.

The recipe at a glance

The whole workflow is five moves:

Point the trainer at tarred speech data for the target languages—no per-file unpacking, streamed efficiently by NeMo/Lhotse.
Fine-tune from the base checkpoint (
```
init_from_nemo_model
```
) using the same Cache-Aware FastConformer-RNNT recipe, conditioned on each clip’s language tag.
Evaluate on a held-out set the model never saw—at the same low-latency streaming setting you’ll deploy (e.g.
```
att_context_size=[56,0]
```
, 80ms chunk; 0ms lookahead).
Add more data where the language is weak and retrain.
Export and deploy the fine-tuned checkpoint.

Step 1 — Data

We assembled a balanced, ~2000-hour mix across the two languages (Greek and Bulgarian) from public multilingual corpora (Granary, Common Voice, FLEURS), kept as tarred NeMo/Lhotse shards. The two details that matter most are:

Every clip carries a
```
target_lang
```
tag. This drives the model’s prompt-based language conditioning, so getting the tag right (and using a value the model recognises) is essential.
Match the base model’s text style—punctuated, properly-cased transcripts, since that is what the model produces.

Held-out FLEURS test splits (which were not in training) gave us an honest, in-the-wild benchmark per language.

Step 2 — Train

This is a straightforward full fine-tune of the streaming RNNT model, driven by a fixed step budget—the right way to schedule with streaming/iterable data. It runs on a single GPU for a quick pass and scales cleanly to multi-GPU for a fuller run. On a small dataset like this, an epoch is minutes, not hours.

Step 3 — Evaluate

We measured Word Error Rate on the held-out FLEURS test set, in streaming mode with 80ms chunk—the most demanding condition, with no future-audio “peeking.” The improvement over the base model is large, especially for the languages that started out weakest:

Language	Base model	Fine-tuned	Relative Improvement in WER
🇬🇷 Greek	35	24	32%
🇧🇬 Bulgarian	22	15 Source Read original → Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise. Please enable JavaScript in your browser to complete this form. Name Email Name First Last Email AI Maestro is an independent British AI publication. We test what we recommend. More about us → Share X LinkedIn Copy link More in AI Guides & Tutorials 1 How to Build a Document Intelligence Backend with iii Using Workers, Functions, and Cron Triggers 2 5 ways Google Search can level up your thrift and vintage shopping 3 How to Fine-Tune LFM2 Using QLoRA and DPO: A Complete Step-by-Step Coding Tutorial on Google Colab 4 Microsoft Build 2026: All the news about Windows, AI, RTX Spark and more More in AI Guides & Tutorials AI Guides & Tutorials How to Build a Document Intelligence Backend with iii Using Workers, Functions, and Cron Triggers Jun 3, 2026 AI Guides & Tutorials 5 ways Google Search can level up your thrift and vintage shopping Jun 3, 2026 AI Guides & Tutorials How to Fine-Tune LFM2 Using QLoRA and DPO: A Complete Step-by-Step Coding Tutorial on Google Colab Jun 3, 2026 Empowering Businesses with AI — Smart Tools, Smarter Business Decisions. follow us Popular Tag AI Ethics & Society AI for Business AI Guides & Tutorials AI Music AI News AI Research & Science Popular Post ChatGPT now saves narrative… What to expect from… Meta rolls out a… © 2026 AI Maestro · All rights reserved Manage Consent To provide the best experiences, we use technologies like cookies to store and/or access device information. Consenting to these technologies will allow us to process data such as browsing behaviour or unique IDs on this site. Not consenting or withdrawing consent, may adversely affect certain features and functions. Functional Functional Always active The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network. Preferences Preferences The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user. Statistics Statistics The technical storage or access that is used exclusively for statistical purposes. The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you. Marketing Marketing The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes. Manage options Manage services Manage {vendor_count} vendors Read more about these purposes View preferences {title} {title} {title} Scroll to Top

Language

Base model

Fine-tuned

Relative Improvement in WER

🇬🇷 Greek

32%

🇧🇬 Bulgarian

Source Read original →

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

How to Fine-Tune Nemotron 3.5 ASR for Your Language, Domain, or Accent

What it does

How it works (the 2-minute version)

A knob worth knowing:
att_context_size

Try it in minutes

Deep Dive: Fine-Tuning Nemotron ASR for Your Language

Why fine-tune?

A Preview of the Power of Fine-Tuning

The recipe at a glance

Step 1 — Data

Step 2 — Train

Step 3 — Evaluate

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

ChatGPT now saves narrative…

What to expect from…

Meta rolls out a…

What it does

How it works (the 2-minute version)

A knob worth knowing:att_context_size

Try it in minutes

Deep Dive: Fine-Tuning Nemotron ASR for Your Language

Why fine-tune?

A Preview of the Power of Fine-Tuning

The recipe at a glance

Step 1 — Data

Step 2 — Train

Step 3 — Evaluate

More in AI Guides & Tutorials

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

ChatGPT now saves narrative…

What to expect from…

Meta rolls out a…

A knob worth knowing:
att_context_size