“`html
Overview
I was trying to get a good set of models with NVFP4 to leverage the RTX Pro 6000 and managed to set up configs, wheels, and ran benchmarks. Hopefully this helps some folks out.
This should work on all Nvidia Blackwell cards: 5090, 5080, 5070ti etc., as long as the models fit (like maybe stack 2x 5070TI’s).
Gotchas & Solutions
- TensorRT-LLM launch flags: Some obscure settings had to be enabled to make TensorRT-LLM run newer Mamba-hybrid models. The YAML file in the repo at
configs/trtllm/nemotron-omni-v3-sm120.yaml. - LMCache: Offloading context to SSD makes space for model on VRAM. PyPI wheel was crashing on Blackwell (missing sm_120 cubins), so I rebuilt it from source. Both the prebuilt wheel and build script are in the repo.
- Research docs: Helpful AI-outputted deep-dives on what’s actually different about the latest model families (Nemotron Omni V3, Qwen 3.5/3.6, Gemma 4). The Qwen 3.5/3.6 one in particular saved me from a nasty trap — they look like renamed Qwen3-VL but are completely different architecture under the hood.
Benchmark Highlights
- Nemotron-3-Nano-Omni V3 (multimodal — image/video/audio + text): Tested at 8k context →
270 tok/s. Fastest and handles all modalities. Needs TRT-LLM v1.3.0rc13. - Nemotron-3-Nano (text only): Tested at 8k context →
249 tok/s. Best for tool-calling agents (10/10 on tools). - DeepSeek-V4-Flash: IQ2_XXS-XL GGUF, tested at 65k context →
31 tok/s. Best for complex reasoning (9/10 intel + 10/10 tools + 13/13 calibration). - MiniMax-M2.7-REAP-172B: Q3_K_S GGUF, tested at 196k context →
117 tok/s. Long conversations. - MiniMax-M2.7 W4A16 (with LMCache — Optane SSD): Tested at 154k context →
20-22 tok/s. Long-ctx with W4A16-quality answers, KV cache offloaded to SSD. - MiniMax-M2.7 W4A16 (short ctx, no LMCache): Tested at 64k context →
22-25 tok/s. Highest-quality short answers (10/10 intel).
Bench Tools Used to Validate
- `rapid_bench.py` — 41-prompt quality eval (10 intelligence + 10 tool-use + 13 calibration + 3 orchestration + 5 creative writing)
- `bench_harness.py` — sustained decode + TTFT + prefill + concurrency, plus a `–prompt-tokens N` mode for the long-ctx mjpansa runs
Key Takeaways
- The Blackwell LLM toolkit provides configs and wheels for leveraging models on NVFP4 with various GPUs.
- Benchmarks show a wide range of token speeds, from 20-31 tokens per second to over 270 tokens per second for the fastest model families.
- For optimal performance, using LMCache can significantly improve long-context queries by offloading them to SSDs.
“`
Originally published at reddit.com. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




