Speeding Up Text Generation with Nemotron-Labs Diffusion Language Models

Large language models (LLMs) have become a default interface for various tasks in developer workflows. However, the autoregressive approach-where each token depends on previous tokens-is stable to train but imposes significant limitations: every new token requires a full model pass and all weights are loaded before computation. For applications requiring high performance or efficient use of modern GPUs, this can be limiting.

Nemotron-Labs Diffusion introduces an alternative approach by generating multiple tokens in parallel and iteratively refining them over several steps. This not only leverages the computational model more efficiently but also provides a mechanism to revise previously generated tokens, making it better suited for tasks like text editing or handling fill-in-the-middle objectives.

Quick Links

Three Generation Modes in One Model

Nemotron-Labs Diffusion is designed to integrate autoregressive and diffusion capabilities into a single model. The three generation modes are:

Autoregressive mode: Runs like a standard left-to-right LLM, maintaining compatibility with existing workflows.
Diffusion mode: Generates tokens in blocks over multiple steps, allowing for parallel processing and refinement.
Self-speculation mode: Drafts multiple candidate tokens using diffusion and verifies them using autoregressive decoding. This combines the speed of diffusion with the reliability of AR verification.

This flexible design enables seamless switching between modes without requiring changes to the application, making it suitable for a wide range of use cases.

Performance Highlights

Nemotron-Labs Diffusion 8B achieves an improved average accuracy compared to Qwen3 8B. In terms of inference speed measured in tokens per forward pass (TPF), the diffusion mode reaches up to 6.4× higher TPF than autoregressive models, with comparable accuracy across various tasks.

How We Trained Nemotron-Labs Diffusion

The recent work on Efficient-DLM demonstrated that pretrained AR models can be converted into diffusion language models through continued pretraining and modifying the attention mechanism to a block-wise approach. This design helps preserve AR model capabilities while enabling parallel decoding, which is crucial for KV caching.

Nemotron-Labs Diffusion builds on this insight by adding diffusion capabilities to an existing AR model. The model was trained with both an autoregressive and diffusion objective, allowing it to retain its initial AR training while integrating the benefits of diffusion for parallel generation and token refinement.

Deployment and Inference through SGLang

Nemotron-Labs Diffusion models will be supported in the main branch of SGLang. Currently, inference support is available via an issue tracker request on GitHub.

Autoregressive mode: Set ar_mode=true, allowing the model to behave like any other causal LLM and serving as a correctness reference or sanity check.
Diffusion mode (FastDiffuser): This is our primary recommendation for maximum throughput. The model iteratively refines blocks of tokens by denoising them, with a confidence threshold deciding when to commit each step.
Self-speculation mode: Drafts tokens bidirectionally and verifies them causally. At temperature 0, the output is lossless compared to autoregressive generation but offers significantly higher throughput-around 4× faster on B200 hardware for speedbench tasks.

Get Started Today

Nemotron-Labs Diffusion brings diffusion-style generation into a practical form: open models, familiar autoregressive compatibility, and diverse inference modes in one package. Developers can now draft, refine, verify, and accelerate text generation without altering their applications.

To get started, explore the Nemotron-Labs Diffusion model family, read the technical report, and try out the available training recipe.

Key Takeaways

Nemotron-Labs Diffusion integrates autoregressive and diffusion capabilities into a single model for better performance and functionality.
The model supports three generation modes: autoregressive, diffusion, and self-speculation, providing flexibility in deployment.
Diffusion mode offers significant speed improvements over autoregressive models while maintaining comparable accuracy across various tasks.
Nemotron-Labs Diffusion is now available through SGLang, simplifying its integration into existing workflows.

Source Read original →

Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models

Speeding Up Text Generation with Nemotron-Labs Diffusion Language Models

Quick Links

Three Generation Modes in One Model

Performance Highlights

How We Trained Nemotron-Labs Diffusion

Deployment and Inference through SGLang

Get Started Today

Key Takeaways

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Mistral Vibe for Code…

lobste.rs is now running…

OpenCoreDev Releases Domain SDK…

Speeding Up Text Generation with Nemotron-Labs Diffusion Language Models

Quick Links

Three Generation Modes in One Model

Performance Highlights

How We Trained Nemotron-Labs Diffusion

Deployment and Inference through SGLang

Get Started Today

Key Takeaways

Related articles

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Mistral Vibe for Code…

lobste.rs is now running…

OpenCoreDev Releases Domain SDK…