Blackwell LLM Toolkit - NVFP4 Config +Wheels + Benchmarks for Blackwell GPUs via TensorRT-LLM - 270 tk/s Nemotron 3 Omni

“`html

Blackwell LLM Toolkit – NVFP4 Config + Wheels + Benchmarks for Blackwell GPUs via TensorRT-LLM

Overview

I was trying to get a good set of models with NVFP4 to leverage the RTX Pro 6000 and managed to set up configs, wheels, and ran benchmarks. Hopefully this helps some folks out.

This should work on all Nvidia Blackwell cards: 5090, 5080, 5070ti etc., as long as the models fit (like maybe stack 2x 5070TI’s).

Gotchas & Solutions

TensorRT-LLM launch flags: Some obscure settings had to be enabled to make TensorRT-LLM run newer Mamba-hybrid models. The YAML file in the repo at configs/trtllm/nemotron-omni-v3-sm120.yaml.
LMCache: Offloading context to SSD makes space for model on VRAM. PyPI wheel was crashing on Blackwell (missing sm_120 cubins), so I rebuilt it from source. Both the prebuilt wheel and build script are in the repo.
Research docs: Helpful AI-outputted deep-dives on what’s actually different about the latest model families (Nemotron Omni V3, Qwen 3.5/3.6, Gemma 4). The Qwen 3.5/3.6 one in particular saved me from a nasty trap, they look like renamed Qwen3-VL but are completely different architecture under the hood.

Benchmark Highlights

Nemotron-3-Nano-Omni V3 (multimodal, image/video/audio + text): Tested at 8k context → 270 tok/s. Fastest and handles all modalities. Needs TRT-LLM v1.3.0rc13.
Nemotron-3-Nano (text only): Tested at 8k context → 249 tok/s. Best for tool-calling agents (10/10 on tools).
DeepSeek-V4-Flash: IQ2_XXS-XL GGUF, tested at 65k context → 31 tok/s. Best for complex reasoning (9/10 intel + 10/10 tools + 13/13 calibration).
MiniMax-M2.7-REAP-172B: Q3_K_S GGUF, tested at 196k context → 117 tok/s. Long conversations.
MiniMax-M2.7 W4A16 (with LMCache, Optane SSD): Tested at 154k context → 20-22 tok/s. Long-ctx with W4A16-quality answers, KV cache offloaded to SSD.
MiniMax-M2.7 W4A16 (short ctx, no LMCache): Tested at 64k context → 22-25 tok/s. Highest-quality short answers (10/10 intel).

Bench Tools Used to Validate

`rapid_bench.py`, 41-prompt quality eval (10 intelligence + 10 tool-use + 13 calibration + 3 orchestration + 5 creative writing)
`bench_harness.py`, sustained decode + TTFT + prefill + concurrency, plus a `–prompt-tokens N` mode for the long-ctx mjpansa runs

Key Takeaways

The Blackwell LLM toolkit provides configs and wheels for leveraging models on NVFP4 with various GPUs.
Benchmarks show a wide range of token speeds, from 20-31 tokens per second to over 270 tokens per second for the fastest model families.
For optimal performance, using LMCache can significantly improve long-context queries by offloading them to SSDs.

“`

Source Read original →

Blackwell LLM Toolkit – NVFP4 Config +Wheels + Benchmarks for Blackwell GPUs via TensorRT-LLM – 270 tk/s Nemotron 3 Omni

Overview

Gotchas & Solutions

Benchmark Highlights

Bench Tools Used to Validate

Key Takeaways

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Mistral enters robotics with…

Your gaming data could be…

OpenAI releases new voice…

Overview

Gotchas & Solutions

Benchmark Highlights

Bench Tools Used to Validate

Key Takeaways

Related articles

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Mistral enters robotics with…

Your gaming data could be…

OpenAI releases new voice…