Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents
- NVIDIA Nemotron 3 Nano Omni is a new multimodal understanding model designed for real-world document analysis, image reasoning, automatic speech recognition (ASR), long audio-video understanding, and general reasoning.
- It extends the Nemotron line to include text, image, video, and audio capabilities, achieving top accuracy on various benchmarks like MMlongbench-Doc, OCRBenchV2 for documents, WorldSense and DailyOmni for video understanding, and VoiceBench for ASR.
- The model combines a hybrid Mamba-Transformer-Mixture-of-Experts (MoE) backbone with a C-RADIOv4-H vision encoder and Parakeet-TDT-0.6B-v2 audio encoder to process images, videos, and audio data efficiently.
- Its architecture is designed for preserving fine visual details while supporting native audio understanding and handling very long multimodal contexts.
- The training process uses staged alignment and context extension followed by preference optimization and reinforcement learning (RL).
- Nemotron 3 Nano Omni delivers up to 9 times higher throughput on multimodal use-cases compared to existing alternatives, making it a more efficient model for real-world applications.
- Download the pre-trained checkpoints at HuggingFace.
- For more information about the model architecture and training details, refer to the full report.
At a high level, Nemotron 3 Nano Omni is aimed at five classes of workloads:
Nemotron 3 Nano Omni’s Target Workloads
1. Real-world document analysis
This includes not just OCR but understanding long, complex documents with layout, tables, figures, and references across multiple pages.
2. Automatic Speech Recognition (ASR)
The model incorporates robust speech understanding capabilities for diverse audio conditions, enabling high-quality transcription of spoken content in various environments.
3. Long audio-video understanding
Nemotron 3 Nano Omni is designed to reason over mixed modalities like screen recordings with narration, video tutorials, and customer support captures, among others.
4. Agentic computer use
The model is specifically trained for assisting tasks in graphical user interfaces (GUIs), enabling it to interpret screenshots, monitor UI states, and assist with task completion or automation.
5. General multimodal reasoning
It excels at multi-step reasoning across multiple modalities, drawing connections between text, images, tables, and other inputs to derive coherent answers based on the available evidence.
Model Architecture and Key Innovations
Nemotron 3 Nano Omni employs a unified encoder-projector-decoder design. The language backbone is built on top of the Nemotron 3 Nano 30B-A3B model, paired with vision and audio encoders.
A hybrid Mamba-Transformer-MoE backbone for long multimodal context
- The model’s architecture interleaves three key components: state-space layers for efficient processing of long contexts; MoE layers with 128 experts, top-6 routing, and a shared expert mechanism to manage conditional capacity; and grouped-query attention layers for maintaining strong global interactions.
Dynamic resolution for dense documents, charts, and screens
On the vision side, dynamic resolution processing at native aspect ratio is used. Each image can be represented using a variable number of 16×16 patches, enabling flexibility in handling high-resolution visual inputs like OCR-heavy documents or financial tables.
Conv3D temporal compression for video
- Videos are processed by fusing every pair of consecutive frames into a single “tubelet,” reducing the number of vision tokens the language model needs to attend to. This approach allows either increasing or decreasing the frame count while maintaining context and accuracy.
EVS — Efficient Video Sampling
The EVS (Efficient Video Sampling) feature is enabled during inference time, where redundant video tokens are pruned based on whether they represent dynamic changes in the video. This reduces latency and improves throughput without sacrificing accuracy.
Native audio input, not just text transcripts
- The model incorporates a native audio encoder (Parakeet-TDT-0.6B-v2) connected to the LLM backbone via a lightweight 2-layer MLP projector. This allows for joint processing of text and audio tokens within the same sequence.
- Audio is sampled at 16 kHz, with training support up to 1,200 seconds (about 20 minutes), enabling handling of longer audio clips in inference.
Lightweight modality projectors and unified token interleaving
Each encoder’s feature space is projected into the shared embedding space using a lightweight 2-layer MLP projector. Once projected, vision, audio, and text tokens are interleaved and processed jointly.
Training Data, Infrastructure & Systems Story
The semi-supervised fine-tuning (SFT) stages are trained on NVIDIA H100 hardware, scaling from 32 to 128 nodes depending on the stage. The stack uses Megatron-LM, Transformer Engine, and Megatron Energon for parallel processing.
Using RL to shape reliable multimodal behavior
- We introduce multi-environment text and omnimodal training in Nemotron 3 Nano Omni. The text RL stage uses Nemo-Gym environments to evaluate the model’s ability to perform sequences of actions such as tool calls, writing code, and planning.
- The omni RL stage trains the model across a unified framework for tasks ranging from single-modality to fully multimodal scenarios. A diverse verifier suite evaluates outputs in various formats like multiple-choice questions, math problems, GUI grounding, and ASR, while intentionally including unanswerable cases to teach abstaining when evidence is insufficient.
Data and Data Pipelines
Nemotron 3 Nano Omni is trained on an enhanced dataset emphasizing high-quality reasoning across multiple modalities. Synthetic data for complex scenarios where public datasets are limited is introduced to expand task coverage.
To support scalable synthetic data generation, we build task-specific multi-stage pipelines that generate diverse and extensive training examples tailored to the model’s needs.
Key Takeaways
- NVIDIA Nemotron 3 Nano Omni is a robust multimodal model designed for real-world tasks like document analysis, ASR, video understanding, GUI assistance, and general reasoning.
- The model’s hybrid Mamba-Transformer-MoE backbone enables efficient long-context processing while preserving fine visual details and supporting rich audio understanding.
- Dynamic resolution and Conv3D temporal compression features allow Nemotron 3 Nano Omni to handle high-resolution images and videos effectively, making it suitable for diverse real-world applications.
- The model’s training pipeline leverages reinforcement learning (RL) techniques, including multi-environment text and omnimodal RL training, ensuring robust performance across various modalities and use cases.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




