Google AI Releases DiffusionGemma, a 26B MoE Open Model Using Text Diffusion for Up to 4x Faster Generation

For developers and artists building local AI tools, Google has finally cracked the code on speed. The release of DiffusionGemma signals a shift away from the slow, sequential nature of standard language models. By swapping autoregressive decoding for text diffusion, this open model allows for parallel generation, delivering up to four times faster output on dedicated hardware. It is licensed under Apache 2.0, making it immediately available for local workflows where latency is the enemy of creativity.

While most models you use today write text one word at a time, DiffusionGemma paints the whole picture simultaneously. This architectural change means no more waiting for the model to finish one sentence before starting the next. On a high-end GPU, this parallelism translates directly to throughput, enabling rapid iteration for in-line editing and complex, non-linear text structures.

What is DiffusionGemma

At its core, DiffusionGemma is a 26-billion parameter Mixture of Experts (MoE) model. During actual use, it activates only 3.8 billion parameters, keeping the computational load manageable. It is built upon the Gemma 4 backbone, specifically the 26B-A4B architecture, with a custom diffusion head added on top.

The model is multimodal, capable of processing interleaved text, images, and video inputs to generate text responses. It boasts a massive context window of 256,000 tokens and supports over 140 languages.

Performance figures are impressive for local deployment. When quantized, the model fits within 18GB of VRAM, placing it squarely in the realm of high-end consumer GPUs. On a single NVIDIA H100, it generates over 1,000 tokens per second. On an NVIDIA GeForce RTX 5090, it still manages 700+ tokens per second.

Google is transparent about the trade-off: speed comes at the cost of perfection. DiffusionGemma prioritizes rapid generation and parallel layout over the nuanced quality of standard Gemma 4. For production work where maximum fidelity is required, the traditional autoregressive Gemma 4 remains the recommended choice.

How Text Diffusion Works

The concept borrows heavily from AI image generators. Just as image models start with static noise and refine it into a picture, DiffusionGemma starts with text and refines it into a coherent response.

The process occurs in three conceptual stages. First, the model initializes a canvas of random placeholder tokens. Second, it runs multiple passes, locking in high-confidence tokens to use as context for the rest. Third, the text converges into the final output.

Google terms this core mechanism Uniform State Diffusion. Highly confident tokens act as anchors, helping resolve adjacent positions during the denoising phase. The entire sequence snaps into focus over several passes rather than one by one.

In practice, the model denoises a 256-token canvas in parallel. It finalizes roughly 15 to 20 tokens per forward pass. This parallelism is the engine driving the throughput gains.

Crucially, the model uses bidirectional attention during denoising. Every token on the canvas can attend to every other token. This is a sharp break from autoregressive models, which can only look backward at prior tokens.

That bidirectional context enables real-time self-correction. If a token’s confidence drops, the sampler can re-noise it. The model then replaces that token on a later pass. Autoregressive models cannot do this, as they commit each token permanently once generated.

The Architecture

The technical advancement here is hardware utilization. For local GPU inference, the main bottleneck is usually memory bandwidth. Autoregressive models repeatedly load weights from memory per token, causing the GPU to spend most of its time waiting.

DiffusionGemma shifts the bottleneck from memory bandwidth to compute. It drafts and refines a 256-token canvas in parallel, giving idle tensor cores a large, simultaneous workload.

The model alternates between two attention modes during inference. The prefill stage uses causal attention to ingest the prompt and write the KV cache. The denoising stage uses bidirectional attention to refine the canvas.

For longer outputs, DiffusionGemma employs Block Autoregressive Diffusion. Once a 256-token block is fully denoised, it commits to the KV cache. The model then starts a fresh canvas conditioned on prior history. This pairs parallel block speed with the sequential stability of autoregressive models.

The architecture shares the same backbone as Gemma 4 26B A4B. Developers mainly need to implement a denoising step, making integration into existing serving frameworks simpler.

A clear example of its utility is the Sudoku showcase from Google’s developer guide. Autoregressive models often struggle with strict, multivariable constrained puzzles. The base DiffusionGemma model solves roughly 0% of Sudoku puzzles. However, after a simple JAX supervised fine-tuning recipe, correctness rises to 80%. The fine-tuned model also stops earlier, cutting inference steps significantly.

Interactive Demo: How DiffusionGemma Decodes in Parallel

The visualizer below illustrates how DiffusionGemma decodes text, contrasted with a standard autoregressive model. Toggle between the two modes and press Run. In Autoregressive mode, tokens fill in one at a time, strictly left to right, taking one forward pass per token — the way most LLMs generate today. In Diffusion mode, the model starts from a canvas of masked placeholder tokens and resolves many of them in parallel each pass, in no fixed order, converging in far fewer passes. The animation also shows a brief re-noise step, where a low-confidence token is reset and refined again — a stand-in for the real model’s self-correction, which autoregressive decoding cannot do once a token is committed. Note this is a conceptual animation, not live model output: the real DiffusionGemma resolves a 256-token canvas and finalizes roughly 15–20 tokens per forward pass.

Interactive · Illustrative

Watch DiffusionGemma Decode in Parallel

This is a conceptual animation of the denoising process — not live model output. The real model resolves a 256-token canvas, finalizing ~15–20 tokens per forward pass.

0Forward passes

0 / 16Tokens resolved

DiffusionDecoding mode

Press Run to start.

Marktechpost

Practitioner-first AI/ML coverage — deep dives, model releases, and research, decoded for builders.

Key takeaways

DiffusionGemma achieves up to 4x faster generation by using text diffusion to process tokens in parallel rather than sequentially.
While it offers massive speed gains and self-correction capabilities, the model trades off some output quality compared to standard autoregressive Gemma 4.
The model is optimized for local workflows, fitting within 18GB of VRAM and activating only 3.8B parameters during inference.
Simple fine-tuning can significantly boost performance on constrained tasks, such as raising Sudoku puzzle accuracy from 0% to 80%.

Source Read original →

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.