
Key Takeaways
- UD-Q5_K_M is performing better than expected, with file sizes only slightly larger than MXFP4.
- The quality improvements in UD-Q5_K_M outweigh the speed trade-offs, especially for tasks requiring high accuracy over speed.
- Nvidia hardware may offer different performance characteristics compared to RDNA architecture when using aggressive quantization techniques.
Out of random curiosity I ran a shootout on Qwen3-Coder-Next. I’ve been using the MXFP4_MOE from unsloth for awhile as it’s just really fast on my system. But was curious about precision, so I tested different quantizations to see how they perform.
Hardware
- Hardware: 3× R9700 PRO (96 GB VRAM)
- Backend: llama.cpp Vulkan
- Evaluation dataset: wikitext-2 (583 chunks, ctx 512)
- Formats tested: MXFP4_MOE, Q4_K_M, Q5_K_M, UD-Q5_K_M
The results show that UD-Q5_K_M has better quality than its smaller counterparts without any significant speed penalty. The dynamic precision approach used by unsloth is very effective.
The Numbers
| Metric | MXFP4 | Q4_K_M | Q5_K_M | UD-Q5_K_M |
|---|---|---|---|---|
| Same top-1 accuracy | 89.4% | 89.6% | 93.0% | 94.0% |
| Mean KL divergence | 0.0746 | 0.0685 | 0.0308 | 0.0217 |
| Max KL (worst token) | 13.04 | 5.93 | 8.19 | 4.75 |
| File size | 44.7 GB | 45.2 GB | 52.9 GB | 55.2 GB |
UD-Q5_K_M outperforms all other formats on every quality metric, being only slightly larger than MXFP4.
Token Accuracy and Performance
- MXFP4 (89.4%): A 5% difference in per-token agreement becomes a 500× difference by token 100, indicating how quickly errors accumulate.
- UD-Q5_K_M (94%): The probability of perfect agreement for a 100-token output is much higher, suggesting that UD-Q5_K_M performs better in long tasks and reasoning scenarios.
For interactive coding tasks which are mostly decode-bound, the speed hit from using more aggressive quantization techniques like Q4_K_M or even MXFP4 compared to UD-Q5_K_M is negligible. However, for daily code generation where quality over speed matters, UD-Q5_K_M provides a clear advantage.
I’ve switched my default format from MXFP4 to UD-Q5_K_M. MXFP4 remains great for heavy prefill workloads but for tasks requiring high-quality outputs like code generation, UD-Q5_K_M is now the preferred choice.
Discussion
- What quants are you guys running for code models?: Are you finding similar quality cliffs with aggressive compression techniques?
- Are you seeing different trade-offs than RDNA architectures when using Nvidia hardware?
For those interested, here is a link to the original image: Image Preview
Submitted by /u/alphatrad
Originally published at reddit.com. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




