Out of random curiosity, I ran a shootout on Qwen3-Coder-Next. I’ve been using the MXFP4_MOE from unsloth for awhile as it’s just really fast on my system. But was curious about precision. I know quantization hurts the model, but I don’t think I had really understood that till I tested it myself.
Hardware
Hardware: 3× R9700 PRO (96 GB VRAM)
Backend
Backend: llama.cpp Vulkan
Evaluation
Evaluation: wikitext-2 (583 chunks, ctx 512)
- Formats tested: MXFP4_MOE Q4_K_M Q5_K_M UD-Q5_K_M
Tldr
The TLDR: UD-Q5_K_M is cooking! Better quality than formats half its size, barely any speed penalty. Unsloth’s dynamic precision approach is really good. I might need to test it at lower quants now.
The Numbers
| Metric | MXFP4 | Q4_K_M | Q5_K_M | UD-Q5_K_M |
|---|---|---|---|---|
| Same top-1 | 89.4% | 89.6% | 93.0% | 94.0% |
| Mean KL divergence | 0.0746 | 0.0685 | 0.0308 | 0.0217 |
| Max KL (worst token) | 13.04 | 5.93 | 8.19 | 4.75 |
| File size | 44.7 GB | 45.2 GB | 52.9 GB | 55.2 GB |
UD-Q5_K_M wins on literally every quality metric
UD-Q5_K_M is the clear winner in terms of quality across all metrics tested, and only 10 GB larger than MXFP4. The differences are stark:
- MXFP4 (89.4%) > 100 token output: 0.0014% chance of perfect agreement
- UD-Q5_K_M (94%) > 100 token output: 0.21% chance of perfect agreement
This is not a small number when we’re talking about long refactoring tasks or multi-step reasoning where errors can compound exponentially. MXFP4 often “goes off the rails” more frequently.
Speed Trade-Offs
- Refill (batch 512): MXFP4 still fastest (hardware kernels)
- Prefill (batch 4096): MXFP4 wins again
- Decode: Q4_K_M edges UD-Q5 slightly, but UD-Q5 is within 9% despite being 22% larger
For interactive coding tasks (which are decode-bound anyway), the speed hit from using a lower quantization level is negligible.
Conclusion
I swapped my default from MXFP4 to UD-Q5_K_M. MXFP4 is still great for heavy prefill workloads, but for daily code generation where quality over speed matters more, UD-Q5 is the clear winner.
What quants are you guys running for code models? Are you finding the same quality cliff with aggressive compression? And if you’re on Nvidia hardware, are you seeing different tradeoffs than RDNA?
Submitted by /u/alphatrad
Originally published at reddit.com. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




