NVIDIA Releases Nemotron 3.5 ASR: A 600M-Parameter Cache-Aware Streaming Model Transcribing 40 Language-Locales in Real Time

For audio engineers and developers building real-time transcription pipelines, NVIDIA has just dropped a tool that finally solves the “one model, many languages” bottleneck. The new Nemotron 3.5 ASR is a 600M-parameter streaming model capable of handling 40 distinct language-locales without needing to swap weights or run separate instances. It handles punctuation and capitalisation natively, shipping as open weights under the OpenMDW-1.1 license. Under the hood, it runs on a Cache-Aware FastConformer-RNNT architecture.

What makes Nemotron 3.5 ASR different

Previously, teams had to maintain separate models for different languages or rely on heavy inference swapping. This release extends the base nvidia/nemotron-speech-streaming-en-0.6b model by adding prompt-based language-ID conditioning. A single 600M-parameter checkpoint now covers 40 locales, eliminating the need for per-language model management.

The system targets two specific workloads: low-latency streaming for live audio feeds and high-throughput batch transcription. The output is production-ready text with correct casing and punctuation, removing the need for a separate restoration step.

The mechanics of Cache-Aware FastConformer-RNNT

The architecture splits into two main components. The first is a 24-layer Cache-Aware FastConformer encoder, an efficient evolution of the Conformer design that uses linearly scalable attention. The second is an RNNT (Recurrent Neural Network Transducer) decoder that emits text frame by frame as audio streams in.

The “cache-aware” design is the critical efficiency lever. Standard buffered streaming re-processes overlapping audio windows at every step, repeating work and adding delay. Instead, this model caches encoder self-attention and convolution activations. It reuses these cached states as new audio arrives, ensuring each audio frame is processed exactly once with no overlap. The result is a drop in compute and end-to-end latency with no accuracy penalty.

The Latency Knob: `att_context_size`

One inference setting controls the tradeoff between latency and accuracy: the attention context size, att_context_size. A smaller context emits text sooner but sees less future audio, while a larger context raises accuracy at the cost of higher latency.

The same checkpoint covers the full range. Settings map to chunk sizes of 80ms, 160ms, 320ms, 560ms, and 1.12s. For instance, [56,0] enables an 80ms ultra-low-latency mode, while [56,13] sets the context to 1.12s for maximum accuracy. Teams can pick the operating point at inference time without any retraining.

Language detection and coverage

The 40 language-locales include major variants of English, Spanish, German, and French, alongside Arabic, Japanese, Korean, Mandarin, Hindi, and Thai, plus several other European and Nordic languages.

Language conditioning works in two ways. Setting target_lang to a known locale usually yields the best accuracy. Alternatively, setting target_lang=auto allows the model to detect the language itself. In auto mode, it emits a language tag after terminal punctuation, enabling a single deployment to transcribe mixed-language traffic without a separate language-ID component.

How it stacks up against the competition

Product	Company	Access	Native streaming	Language coverage	Reported latency	Pricing model
Nemotron 3.5 ASR	NVIDIA	Open weights (OpenMDW-1.1), self-host; hosted on DeepInfra	Yes, cache-aware FastConformer-RNNT	40 language-locales	80ms–1.12s, configurable at inference	Free to self-host; usage-based via host
Whisper large-v3	OpenAI	Open weights (MIT), self-host; API	No, offline/batch	~99 languages	Not streaming-native	Self-host free; API ~$0.006/min (batch)
Nova-3	Deepgram	Closed API; on-premise/self-host (enterprise)	Yes, streaming + batch	Multilingual; +10 monolingual languages added Jan 2026	Low-latency streaming (reported sub-300ms)	~$0.0077/min (Nova-3 Monolingual, PAYG)
Universal-3 Pro Streaming	AssemblyAI	Closed API (EU endpoint available)	Yes	6 languages: English, Spanish, French, German, Italian, Portuguese	Sub-300ms (official); first partial ~750ms	Usage-based (PAYG)
Scribe v2 Realtime	ElevenLabs	Closed API	Yes	90+ languages (99 per ElevenLabs)	~150ms (p50)	~$0.28/hour
Ursa / streaming	Speechmatics	API + on-premise + edge	Yes, streaming + batch	50+ languages with automatic identification	Ultra-low latency (positioned)	Enterprise/usage

Fine-tuning results

Because the weights are open, teams can fine-tune the model for specific languages, domains, or accents. NVIDIA published a worked example for Greek and Bulgarian, fine-tuning the base checkpoint with the same Cache-Aware FastConformer-RNNT recipe. Each clip carried a target_lang tag for language conditioning, using training data from public corpora including Granary, Common Voice, and FLEURS.

Results were measured as Word Error Rate (WER) on held-out FLEURS data at the 80ms setting. Greek WER fell from 35 to 24, a 32% relative improvement. Bulgarian fell from 22 to 15, a 31% relative improvement. These are raw WER percentages at the lowest-latency streaming mode. NVIDIA notes that evaluating at deployment latency on held-out data provides honest numbers.

Strengths and considerations

Strengths

One 600M-parameter checkpoint covers 40 language-locales, cutting deployment sprawl.
Cache-aware streaming processes each frame once, reported at 17x buffered concurrency on an H100.
att_context_size tunes latency from 80ms to 1.12s at inference, with no retraining.
Punctuation, capitalization, and auto language tagging are built in.
Open weights enabled a 31–32% relative WER drop on Greek and Bulgarian after fine-tuning.

Considerations

The model handles English, but NVIDIA recommends its dedicated English model for English-only use.
The 80ms mode trades some accuracy for the lowest latency.
Japanese and Korean use CER, so cross-language error comparisons need care.
Throughput figures are measured on H100, so results on other GPUs will differ.
The production NIM with gRPC streaming is announced, but not yet released.

Key takeaways

NVIDIA’s Nemotron 3.5 ASR is an open-weights (OpenMDW-1.1), 600M-parameter streaming model transcribing 40 language-locales from a single checkpoint.
Its Cache-Aware FastConformer-RNNT design processes each audio frame once, reported at 17x the concurrent streams of buffered approaches on an H100.
Latency is configurable from 80ms to 1.12s at inference via att_context_size, with no retraining.
A short fine-tune cut FLEURS WER 32% on Greek (35→24) and 31% on Bulgarian (22→15), at the 80ms setting.

Source Read original →

NVIDIA Releases Nemotron 3.5 ASR: A 600M-Parameter Cache-Aware Streaming Model Transcribing 40 Language-Locales in Real Time

What makes Nemotron 3.5 ASR different

The mechanics of Cache-Aware FastConformer-RNNT

The Latency Knob: `att_context_size`

Language detection and coverage

How it stacks up against the competition

Fine-tuning results

Strengths and considerations

Strengths

Considerations

Key takeaways

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

The State of Simulation…

Substack adds an AI…

AI and the rise…

What makes Nemotron 3.5 ASR different

The mechanics of Cache-Aware FastConformer-RNNT

The Latency Knob: att_context_size

Language detection and coverage

How it stacks up against the competition

Fine-tuning results

Strengths and considerations

Strengths

Considerations

Key takeaways

Related articles

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

The State of Simulation…

Substack adds an AI…

AI and the rise…

The Latency Knob: `att_context_size`