Quantization Shootout Results
I ran a quantization shootout on Qwen3-Coder-Next to test different precision levels. The goal was to see how these variations affect both quality and performance.
Hardware and Setup
- Hardware: 3× R9700 PRO (96 GB VRAM)
- Backend: llama.cpp Vulkan
- Evaluation: wikitext-2 (583 chunks, ctx 512 tokens)
- Formats Tested: MXFP4_MOE Q4_K_M Q5_K_M UD-Q5_K_M
The Results
The UD-Q5_K_M format stood out as having the best balance between quality and file size. It achieved a 94% top-1 accuracy, which is better than both Q4_K_M and Q5_K_M by significant margins.
| Metric | MXFP4 | Q4_K_M | Q5_K_M | UD-Q5_K_M |
|---|---|---|---|---|
| Same top-1 | 89.4% | 89.6% | 93.0% | 94.0% |
| Mean KL divergence | 0.0746 | 0.0685 | 0.0308 | 0.0217 |
| Max KL (worst token) | 13.04 | 5.93 | 8.19 | 4.75 |
| File size | 44.7 GB | 45.2 GB | 52.9 GB | 55.2 GB |
UD-Q5_K_M wins on every quality metric tested.
A 5% difference in per-token agreement becomes a 500× difference by token 100, highlighting the importance of token accuracy in long sequences. MXFP4 had a 94.0% top-1 accuracy compared to UD-Q5_K_M’s 94%, but it was still slightly ahead in this specific test.
The file size difference between MXFP4 (44.7 GB) and UD-Q5_K_M (55.2 GB) is notable, with UD-Q5_K_M being ~10 GB larger than MXFP4.
Performance
- Refill (batch 512): MXFP4 was still the fastest due to hardware kernels.
- Prefill (batch 4096): MXFP4 maintained its lead.
- Decode: Q4_K_M edges UD-Q5 slightly, but UD-Q5 is within 9% despite being 22% larger.
For interactive coding tasks (which are decode-bound), the speed hit from using UD-Q5_K_M was negligible. I have since switched my default format to UD-Q5_K_M for daily code generation, as it offers a better quality-to-size trade-off compared to MXFP4.
Quants for Code Models
What quantization techniques are you using for your code models? Have you noticed similar quality cliffs with aggressive compression?
Key Takeaways
- The UD-Q5_K_M format provides the best balance between quality and file size.
- A 5% difference in token accuracy can lead to a significant change, especially in long sequences.
- For interactive tasks like code generation, the speed hit from using lower precision models is often minimal.
Originally published at reddit.com. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




