I ran a quantization shootout on Qwen3-Coder and the results are… interesting.
Key Takeaways
- UD-Q5_K_M is the clear winner in terms of quality for code models, outperforming other quantized formats by a significant margin.
- The trade-off between speed and quality is more nuanced than previously thought; while UD-Q5_K_M has slightly slower decode times compared to MXFP4_MOE, its superior token accuracy makes it the preferred choice for tasks where quality over speed matters.
- For interactive coding tasks, which are typically decode-bound, the performance difference between Qwen3-Coder and other quantized models like UD-Q5_K_M is negligible. However, for prefill workloads that require fast responses, MXFP4 remains superior.
The Numbers
| Metric | MXFP4 | Q4_K_M | Q5_K_M | UD-Q5_K_M |
|---|---|---|---|---|
| Same top-1 | 89.4% | 89.6% | 93.0% | 94.0% |
| Mean KL divergence | 0.0746 | 0.0685 | 0.0308 | 0.0217 |
| Max KL (worst token) | 13.04 | 5.93 | 8.19 | 4.75 |
| File size | 44.7 GB | 45.2 GB | 52.9 GB | 55.2 GB |
UD-Q5_K_M wins on literally every quality metric while only being ~10 GB larger than MXFP4.
A 5% difference in per-token agreement becomes a 500× difference by token 100. All LLM’s are auto-regressive. Yann LeCun is always talking about this and that LLM’s suffer from exponentially diverging error probabilities. This is where all your hallucinations and stuff happen.
MXFP4 (89.4%) > 100 token output: 0.0014% chance of perfect agreement
UD-Q5_K_M (94%) > 100 token output: 0.21% chance of perfect agreement
That’s not a big number, but on long refactoring tasks or multi-step reasoning, you feel it. MXFP4 “goes off the rails” way more often.
The Hardware and Backend Details:
- Hardware: 3× R9700 PRO (96 GB VRAM)
- Backend: llama.cpp Vulkan
- Evaluation Dataset: wikitext-2 (583 chunks, ctx 512)
Image Credit: /u/alphatrad
What quants are you guys running for code models? Are you finding the same quality cliff with aggressive compression? And if you’re on Nvidia hardware, are you seeing different tradeoffs than RDNA?
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




