Interfaze Ships diffusion-gemma-asr-small, an Open-Source Diffusion ASR Model Transcribing Six Languages via DiffusionGemma’s Parallel Denoising Decoder

Interfaze has released diffusion-gemma-asr-small, an open-source speech recognition model that processes audio using a diffusion decoder rather than the standard autoregressive method. The system claims to be the first open-source multilingual audio diffusion ASR model, handling six languages through a single adapter trained on approximately 42 million parameters. This adapter sits atop a frozen 26 billion parameter backbone, representing roughly 0.16% of the total model weights.

Distinction between the two main approaches is clear. Autoregressive models generate text token by token. Diffusion models refine all tokens simultaneously. This new model applies the diffusion approach to speech-to-text conversion.

TL;DR

The Interfaze team claims this is the first open-source multilingual diffusion ASR model, transcribing six languages from a single ~42M-parameter adapter.
Transcription runs via DiffusionGemma’s diffusion decoder using uniform, random-token diffusion. It does not use the absorbing <mask> scheme.
Transcription cost depends on denoising steps, not the length of the transcript.
It outperforms other diffusion peers on LibriSpeech (6.6% WER versus Whisfusion’s 8.3%) but falls behind autoregressive Whisper.
The adapter ships under Apache-2.0; DiffusionGemma (Gemma terms) and whisper-small (MIT) load separately.

What is diffusion-gemma-asr-small?

diffusion-gemma-asr-small is an audio-native ASR model. It converts speech to text using a discrete diffusion decoder. That decoder belongs to DiffusionGemma, Google’s 26 billion parameter mixture-of-experts model. DiffusionGemma activates 4 billion parameters, using 128 experts with top-8 routing. It generates text by discrete diffusion instead of autoregression.

The diffusion detail is specific. Most diffusion LLMs use an absorbing <mask> scheme. DiffusionGemma uses uniform, random-token diffusion instead. It fills a fixed-length canvas with random vocabulary tokens. Each step keeps confident predictions and re-randomizes the rest. After a few steps the noise anneals into text.

Interfaze added audio to this text-only model. Out of the box, DiffusionGemma takes text, images, and video. It does not take audio. The repo ships only the trained adapter, about 42 million parameters. The frozen backbones download separately from their own repos.

How it works

The model does not feed raw waveforms to the LLM. An early attempt tried exactly that and failed. A frozen LLM has never seen a spectrogram. The embedding space has no notion of formants or phonemes. The model learned to ignore audio and hallucinate fluent nonsense.

The working design uses a frozen whisper-small encoder. It acts only as a feature extractor, not a decoder. Whisper turns 30 seconds of audio into 1500 frames. Each frame holds 768-dimensional acoustic features. A small trainable projector then compresses these frames. It uses conv layers that subsample 8 times plus a linear map. The output is 188 audio tokens at 2816 dimensions. These tokens scatter into the prompt’s reserved <|audio|> slots. LoRA adapters let the backbone attend to this new modality. The decoder then denoises a 192-token transcript canvas. It runs bidirectionally over roughly 16 steps.

The pipeline, from the model card, is compact:

Copy Code

raw audio ─► whisper-small encoder (frozen) ─► projector (trained, ~19M)
          ─► scatter into <audio> token slots of DiffusionGemma's encoder
          ─► DiffusionGemma decoder denoises a 192-token canvas (bidirectional, cross-attends audio)
          ─► transcript

The training unlock

The first training runs stalled. Loss flatlined near 8. The failure was circular. The projector started random, so its output was noise. Attention then learned to ignore it. Almost no gradient reached the projector. The model never learned.

The fix supervised the projector directly. The research team ran the 188 audio tokens through DiffusionGemma’s frozen lm_head. They applied a CTC loss against the transcript. CTC means Connectionist Temporal Classification. It aligns audio features to text without needing attention.

This sidesteps the standoff. The audio embeddings became linearly predictive of the right words. CTC loss then dropped from 24 to 8.6 in 300 steps. On LibriSpeech test-clean, English WER fell 90% to 52% to 14.6% to 6.6% over ten epochs.

Performance and benchmarks

WER means Word Error Rate, where lower is better. CER means Character Error Rate. The model trained on FLEURS, LibriSpeech, and VoxPopuli. All scores below use the Whisper text normalizer at 16 diffusion steps.

benchmark	metric	score
LibriSpeech test-clean (en)	WER	6.6%
FLEURS English	WER	15.7%
VoxPopuli English	WER	18.5%
FLEURS Hindi	CER	15.8%
FLEURS Mandarin	CER	29.6%

Against other diffusion or non-autoregressive ASR, it leads.

model	approach	LibriSpeech test-clean
TransFusion (2022)	multinomial diffusion	~6–7% (proof-of-concept)
Whisfusion (Aug 2025)	Whisper-large-v3 + masked diffusion	8.3%
diffusion-gemma-asr-small (2026)	Whisper-small + DiffusionGemma	6.6%

Against autoregressive Whisper, it trails. The team frames this gap as data, not architecture.

benchmark	ours	Whisper-small	Whisper-large-v3
LibriSpeech clean	6.6%	~3.4%	~2.0%
FLEURS-en	15.7%	~9–10%	~4–5%
VoxPopuli-en	18.5%	~9–11%	~7–10%

The denoising-step sweep shows a nearly flat curve.

steps	FLEURS-en WER	speed
8	15.7%	14.9× real-time
16	15.6%	10.3×
32	15.2%	6.5×
48	15.6%	4.7×

Going from 8 to 48 steps buys about 0.1 WER point. It costs roughly 3 times the latency. The model converges in about 8 parallel passes. That is around 0.7–1.5 seconds of model time for a 10-second clip.

Use cases with examples

Batch transcription pipelines benefit from parallel decoding. Cost is set by denoising steps, not clip length. A 10-second clip needs roughly the same passes as a shorter one.
Multilingual transcription runs from a single adapter. It covers English, German, French, Spanish, Hindi, and Mandarin. Teams avoid loading a separate model per language.
Non-autoregressive ASR research gains a reproducible baseline. The recipe grounds a frozen LLM with a small adapter. Researchers can extend it with more audio or a larger encoder.

How to get started

The model lives on the Hub. It ships the adapter, model.py, audio.py, and a runnable inference.py. DiffusionGemma support needs transformers from main.

Source Read original →

Interfaze Ships diffusion-gemma-asr-small, an Open-Source Diffusion ASR Model Transcribing Six Languages via DiffusionGemma’s Parallel Denoising Decoder

TL;DR

What is diffusion-gemma-asr-small?

How it works

The training unlock

Performance and benchmarks

Use cases with examples

How to get started

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Security vulnerability reports have…

Claude Code’s complicated China…

Google DeepMind Unionization Talks…

TL;DR

What is diffusion-gemma-asr-small?

How it works

The training unlock

Performance and benchmarks

Use cases with examples

How to get started

Related articles

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Security vulnerability reports have…

Claude Code’s complicated China…

Google DeepMind Unionization Talks…