Nemotron 3.5 Content Safety: Customizable Multimodal Safety for Global Enterprise AI

Why this matters for makers and artists For creators and developers building AI tools, the latest update from NVIDIA offers a significant…

By AI Maestro June 4, 2026 7 min read
Nemotron 3.5 Content Safety: Customizable Multimodal Safety for Global Enterprise AI

Why this matters for makers and artists

For creators and developers building AI tools, the latest update from NVIDIA offers a significant upgrade in how safety is handled within multimodal systems. The new Nemotron 3.5 model moves beyond simple text analysis to evaluate the complex interplay between user prompts, assistant responses, and images within a single context window. This approach catches policy violations that only emerge when text and visuals interact, rather than scoring them in isolation. Furthermore, the system now supports custom policy enforcement, allowing teams to define specific risk profiles for their applications—whether for healthcare, finance, or education—without relying on a rigid, universal taxonomy. For those concerned with transparency, an optional “think mode” provides auditable reasoning traces, detailing the step-by-step logic before a final safe or unsafe verdict is issued.

Key architectural shifts in Nemotron 3.5

The update deepens the integration of image understanding introduced in the previous version. Instead of treating inputs separately, the model processes a user prompt, an optional image, and an optional assistant response together to produce a coherent safety assessment. This closes a critical gap where violations arising from the interaction between text and image were previously missed.

Global language coverage remains a strong feature. The model maintains explicit training in 12 languages—English, French, Spanish, German, Chinese, Japanese, Korean, Arabic, Hindi, Russian, Portuguese, and Italian—while leveraging the Gemma 3 base model to provide strong zero-shot generalization across approximately 140 languages. This ensures that deployments in markets with sparse training data, such as Southeast Asia or Scandinavia, benefit from multilingual transfer without needing separate fine-tuning.

Custom Policy Enforcement

This is the most significant architectural addition. Production deployments rarely operate under a single universal safety taxonomy; a healthcare platform faces different risks than a financial services chatbot or a children’s education app. Nemotron 3.5 accepts a custom policy specification alongside the input, reasoning over these rules to produce its verdict rather than deferring entirely to a built-in taxonomy. This extends the work first introduced in the Nemotron Content Safety Reasoning 4B model to the full multimodal, multilingual setting.

Reasoning Traces (THINK Mode)

Every safety verdict can be accompanied by an auditable reasoning trace via an optional think mode. When enabled, the model outputs its step-by-step reasoning before delivering a final

safe

or

unsafe

label, and optionally the violated categories.

<think>
The user prompt asks for guidance on acquiring a controlled substance without a prescription.
The assistant response provides specific sourcing steps and references an online marketplace.
This interaction violates the Criminal Planning/Confessions and Controlled Substances categories.
The image (a pharmacy exterior) provides locational context but does not alter the verdict.
</think>

User Safety: unsafe
Response Safety: unsafe
Safety Categories: Criminal Planning/Confessions, Controlled Substances

When latency is the primary constraint, THINK mode can be disabled to return to the same low-latency binary verdict available in Nemotron 3.

Model Architecture and Inference Modes

Nemotron 3.5 Content Safety is built on the Google Gemma 3 4B IT base (4B parameters), providing a 128K context window, strong vision-language reasoning, and broad multilingual coverage. NVIDIA fine-tunes this base with a LoRA adapter that installs targeted safety classification behavior while keeping the model compact enough for real-time deployment on 8GB+ VRAM GPUs.

The inference interface supports three output modes:

  • Mode 1 — Low-latency binary verdict: Returns a simple safe or unsafe status.
  • Mode 2 — Binary verdict with categories: Adds specific violation categories to the verdict.
  • Mode 3 — THINK mode (reasoning + verdict): Includes the step-by-step reasoning trace alongside the final decision.

The safety taxonomy follows the Aegis 2.0 framework: 13 core categories aligned with the MLCommons safety taxonomy, plus 10 fine-grained subcategories. This alignment allows direct comparison with other open and closed guard systems benchmarked on Aegis-taxonomy datasets.

The Power of Reasoning in Safety

Reasoning acts as a supercharger for content safety classification by providing the necessary context, customization, and accountability required for production AI systems, especially in enterprise and regulated environments.

Enables Custom and Contextual Policy Enforcement

Reasoning allows a content safety model to dynamically interpret and enforce custom, domain-specific policies defined in natural language at the time of inference. This is necessary because production deployments rarely operate under a single, universal safety taxonomy. A financial services chatbot has a different risk profile than a children’s education app which may have a lower tolerance for profanity. This capability supports:

  • Category Suppression: Disabling irrelevant categories, such as preventing a “violence” category trigger when a DevOps tool handles the phrase “terminate a process”.
  • Custom Category Injection: Defining proprietary risk categories specific to an organization’s regulatory or product policies.

Provides Auditable and Documented Justification

The reasoning traces show the model’s step-by-step logic before it delivers a final safe or unsafe verdict. This documented justification serves several purposes:

  • Compliance and Audit Logging: Regulated industries often require documented justifications for content moderation decisions.
  • Human Review: Reviewers can audit why a verdict was reached to identify systematic model errors.
  • Policy Iteration: The traces reveal how the model interprets edge cases, allowing teams to iteratively refine and improve custom policy language.

Latency Management

While reasoning can introduce latency, the Nemotron model addresses this by condensing reasoning chains into concise summaries to limit output tokens and increase efficiency. This is done in a 2-step process similar to what was done in the predecessor model Nemotron-Content-Safety-Reasoning-4B. In the first step, larger models such as Qwen 397B generate chain-of-thought reasoning traces based upon provided prompts, images, and responses. Ground-truth labels are provided to avoid misclassification. In step 2, another large model such as Qwen 80B rephrases the original traces so that they fit in no more than 3 sentences. Based on experiments, most reasoning traces generated are under 3 sentences.

The efficient reasoning traces optimization allows for low-latency custom policy enforcement. Furthermore, the reasoning traces provide a valuable training signal that can be used for training specialized moderator models. Developers can choose a dual-mode operation, disabling reasoning for minimal latency in generic tasks or enabling it for complex policies.

Training Data Sources

The dataset driving Nemotron 3.5 is an evolution of the multimodal, multilingual blends used for Nemotron 3, with additions targeting the reasoning and custom-policy capabilities. The following sources were used:

  • Multilingual text safety data from Nemotron Safety Guard Dataset v3, sampled from culturally nuanced subsets with proportional representation across safety categories and safe/unsafe splits.
  • Human-annotated multimodal data collected in English by NVIDIA, translated into 12 languages. Critically, 99% of training images are real photographs—not synthetic generations. This directly addresses a known weakness in the multimodal safety benchmark landscape, where existing datasets like VLGuard and MM-SafetyBench rely heavily on SDXL-generated images that lack the cultural texture and adversarial complexity of production content. While not all of these real images could be released due to licensing constraints, a subset from Wikimedia and synthetic generation is available.
  • Safe multimodal data from Nemotron VLM Dataset v2, covering scanned documents, charts, papers, and diagrams with associated queries—ensuring the model does not over-flag benign professional content.
  • Reasoning traces derived from chain-of-thought outputs produced by larger teacher models—Qwen 397B and then shortened using Qwen 80B—are used to teach the model how to reason.
  • Topic following data from the CantTalkAboutThis dataset consisting of policy-specification/verdict pairs across a range of enterprise deployment scenarios (healthcare, finance, banking, education, etc.).
  • Synthetic data accounting for roughly 10% of total training volume, used primarily to diversify jailbreak patterns, generate rare policy violation examples, and produce multimodal adversarial cases.

Benchmarking Performance

Nemotron 3.5 Content Safety was evaluated across multilingual, multimodal, and custom-policy safety benchmarks, including VLGuard, MM-SafetyBench, PolyGuard, RTP-LX, Aya Redteaming, XSafety, MultiJail, Aegis, Dynaguardrail, and CoSA. These evaluations reflect the core production challenge for enterprise safety: applying consistent guardrails across global languages, text and image inputs, and domain-specific policies without adding significant latency.

Nemotron 3 set a strong baseline with 84% average accuracy on multimodal harmful-content tests and roughly half the latency of LlamaGuard-4-12B. Nemotron 3.5 maintains that compact 4B efficiency while adding custom policy support and reasoning traces.

Across multilingual and multimodal safety benchmarks, Nemotron 3.5 delivers strong harmful-content classification accuracy while maintaining a compact footprint. This matters because many safety models remain English-first, text-only, or too costly to run repeatedly in production pipelines. Nemotron 3.5 is designed to combine multilingual coverage, multimodal classification, custom-policy support, and low-latency deployment in one model.

On Multilingual Aegis, Nemotron 3.5 averages 96.5% harmful-content classification accuracy across 12 languages. On RTP-LX, it averages 88.8%, for a combined Aegis and RTP-LX average of 92.7%. This consistency helps teams apply the same safety posture across customer, employee, and partner-facing workflows instead of relying on English-only moderation or separate regional safety models.

Key takeaways

  • Unified Multimodal Evaluation: The model assesses user prompts, images, and assistant responses in a single pass, catching policy violations that emerge from the interaction between text and visuals.
  • Custom Policy Enforcement: Organizations can define specific risk profiles and inject proprietary categories, moving beyond rigid universal taxonomies to fit healthcare, finance, or education needs.
  • Auditable Reasoning: An optional THINK mode provides concise, step-by-step reasoning traces, offering transparency and auditability for regulated environments without sacrificing latency.
  • Global Efficiency: Built on a compact

    Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

    Name
Scroll to Top