NVIDIA AI Releases Nemotron 3 Ultra: An Open 550B Mixture-of-Experts Hybrid Mamba-Transformer for Long-Running Agents

NVIDIA has unveiled Nemotron 3 Ultra, the heavyweight flagship of its latest model family. This release addresses a critical bottleneck for autonomous…

By AI Maestro June 4, 2026 5 min read
NVIDIA AI Releases Nemotron 3 Ultra: An Open 550B Mixture-of-Experts Hybrid Mamba-Transformer for Long-Running Agents

NVIDIA has unveiled Nemotron 3 Ultra, the heavyweight flagship of its latest model family. This release addresses a critical bottleneck for autonomous systems: the escalating cost and latency of running long-duration agents. As these systems plan, execute tools, and reason over extended interactions, token consumption—and therefore expense—skyrockets. Nemotron 3 Ultra aims to slash these inference costs while maintaining high fidelity.

Understanding the Architecture

Nemotron 3 Ultra is a sparse Mixture-of-Experts (MoE) model with 550 billion total parameters, though only 55 billion are active for any given token. This architecture maximises efficiency without sacrificing precision.

Crucially, it abandons the pure Transformer design for a hybrid Mamba-Attention stack. The Mamba component handles long sequences with sub-quadratic scaling, while a few Attention layers ensure precise recall across massive contexts.

The model was pre-trained on 20 trillion text tokens, with context extended to one million tokens. Post-training involved Supervised Fine-Tuning (SFT), Reinforcement Learning (RL), and a novel Multi-teacher On-Policy Distillation (MOPD) process.

According to the NVIDIA team, this approach delivers up to six times higher inference throughput than comparable open models, without compromising accuracy.

Technical Specifications and Design Choices

The model features 108 layers with a dimension of 8,192. It employs 64 query heads but restricts key-value heads to just two, keeping the KV cache compact. Each MoE layer contains 512 experts, activating the top 22 per token.

Three architectural innovations define this release:

  • LatentMoE: This routing mechanism trades hidden-dimension width for a greater number of routed experts at a fixed inference cost. The team reports superior accuracy per parameter compared to standard granular MoEs.
  • Multi-Token Prediction (MTP): This feature predicts multiple future tokens in a single forward pass, enabling native speculative decoding for faster generation. Two MTP heads share parameters during training.
  • NVFP4 pre-training: This uses the E2M1 4-bit datatype with two-dimensional block quantization. The team cites this as the largest demonstration of stable, accurate training using NVFP4 to date.

The hybrid Mamba-Attention stack is particularly vital for agents. Because Mamba’s per-step decode cost remains constant regardless of sequence length, throughput gains widen significantly on long, decode-heavy workloads.

Data and Training Strategy

Pre-training utilised a Warmup-Stable-Decay learning rate schedule across 20 trillion tokens, split into two phases: the first 15 trillion focused on diversity, while the final 5 trillion prioritised high-quality data.

NVIDIA also released new domain-specific datasets, including 173 billion refreshed GitHub code tokens. In an ablation study using Nemotron 3 Nano, a synthetic legal dataset boosted a LegalBench proxy average from 64.6 to 74.7. Similarly, a Wiki-based fact-seeking set lifted a SimpleQA proxy from 40.2 to 50.2.

The post-training expansion is equally substantial. NVIDIA added 10 million new SFT samples and 1 million new RL tasks across 15 new environments. Cumulatively, the open Nemotron series now stands at 50M SFT samples, 2M RL tasks, and 55 RL environments.

Training was not without its hurdles. The team documented two loss divergences as valuable engineering records. The first, occurring near 8 trillion tokens, stemmed from moving output-layer gradient reduction from FP32 to BF16, which lost precision in the mantissa bits. Reverting to FP32 stabilised training. The second divergence, near 16 trillion tokens, had an unconfirmed root cause but was mitigated by early learning rate annealing and capping the total token horizon at 20 trillion.

Post-Training: RLVR and MOPD

The post-training pipeline executes SFT, followed by unified RLVR, then MOPD warmup, MOPD, and MTP Boosting. This loop can repeat for multiple cycles.

RLVR (Reinforcement Learning with Verifiable Reward) operates across diverse environments including terminal use, software engineering, search, math, code, and safety. Rewards here are often sparse and environment-dependent.

MOPD is the standout new method. As the number of environments grows in RLVR, the learning signal dilutes. To counter this, NVIDIA trained over ten domain-specialised teacher models, each with its own pipeline.

During MOPD, the student model generates rollouts across domains, which are scored by the matching teacher using dense, token-level guidance. This offers a richer signal than the sparse rewards of RLVR. The process runs asynchronously, pipelining rollout generation, scoring, and updates.

MOPD is iterative. After one checkpoint, new teachers are initialised from the improved student, with their gains merging into the next round. The team ran two MOPD iterations for Nemotron 3 Ultra.

A practical caveat exists: MOPD works best when student rollouts stay within the teacher’s support. A brief SFT warmup aligns the distributions first. The team found gains were smaller on self-contained reasoning tasks the student rarely sampled.

Reasoning and Efficiency Controls

Nemotron 3 Ultra supports three reasoning modes: reasoning-off, regular, and medium-effort. The latter two accept inference-time budget controls.

Medium-effort serves as the efficiency lever. The team reports it uses roughly 2.5 times fewer tokens than regular mode, at the cost of a 7% accuracy drop. For high-volume agent steps, this trade-off can significantly reduce spend.

Benchmark Performance

Comparisons against models like GLM-5.1 (754B), Kimi-K2.6 (1T), and Qwen-3.5 (397B) show a competitive rather than dominant picture.

On agentic tasks, Nemotron 3 Ultra scores 90.0 on PinchBench and 56.0 on ProfBench (Search), which were held-out generalisation gates. It achieves 71.9 on SWE-Bench Verified and 56.4 on Terminal Bench 2.1, though Kimi-K2.6 leads Terminal Bench at 67.2.

In reasoning, it scores 570.0 on IOI 2025, framing this as top-3 human-level competitive programming. On AA-Omniscience, it records the highest non-hallucination score in the set at 78.7, indicating a lower tendency to answer when uncertain.

Long context performance holds up well; it scores 94.7 on RULER at one million tokens, whereas several larger comparison models cap out at 256K context.

On an 8K input / 64K output setting using NVFP4 on GB200, Nemotron 3 Ultra reaches 5.9 times the throughput of GLM-5.1, 4.8 times faster than Kimi-K2.6, and 1.6 times faster than Qwen-3.5. Note that Nemotron’s figures use TRT-LLM, while competitors use vLLM.

Trade-offs appear on prefill-heavy workloads. On a 50K input / 2K output setting, it trails Qwen-3.5, as prefill cost tracks active parameters. However, the team reports up to 30% lower cost to task completion due to fewer tokens per turn on SWE-Bench and Terminal Bench.

Robustness is another key metric. The model is trained under multiple agent harnesses per task type. SWE-Bench Verified scores remain consistent between 65% and 70.4% across Pi, OpenHands, Hermes, OpenCode, and Mini SWE Agent, ensuring consistent behaviour regardless of the deployment framework.

Quantization and Deployment

The team ships a single NVFP4 checkpoint. On Blackwell hardware, it runs with native FP4 math. On Hopper, which lacks native FP4 tensor cores, it runs as W4A16.

The final solution operates at 5.03 bits-per-element, mixing NVFP4 routed experts with FP8 layers for shared experts and Mamba linears, while Attention layers stay in BF16. The team found accuracy saturated below this budget, so higher precision offered no measurable gain.

The reduced weight footprint aids deployment. The W4A16 path allows MTP weights to fit on a single 8-GPU H100 node, whereas an FP8 checkpoint would require spanning two nodes.

Key takeaways

  • Nemotron 3 Ultra is a 550B open MoE (55B active) leveraging a hybrid Mamba-Attention design specifically for long-running agents.
  • NVIDIA claims up to ~6x higher inference throughput than comparable open LLMs at on-par accuracy, achieving 5.9x vs GLM-5.1 on 8K/64K workloads.
  • It combines a 1M-token context with the highest non-hallucination score in its comparison set (78.7 on AA-Omniscience).
  • Post-training relies heavily on Multi-teacher On-Policy Distillation (MOPD), distilling over ten specialised teachers into a single student model.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top