I ran a quantization shootout on Qwen3-Coder and the results are... interesting

I ran a quantization shootout on Qwen3-Coder and the results are… interesting.

Key Takeaways

UD-Q5_K_M is the clear winner in terms of quality for code models, outperforming other quantized formats by a significant margin.
The trade-off between speed and quality is more nuanced than previously thought; while UD-Q5_K_M has slightly slower decode times compared to MXFP4_MOE, its superior token accuracy makes it the preferred choice for tasks where quality over speed matters.
For interactive coding tasks, which are typically decode-bound, the performance difference between Qwen3-Coder and other quantized models like UD-Q5_K_M is negligible. However, for prefill workloads that require fast responses, MXFP4 remains superior.

The Numbers

Metric	MXFP4	Q4_K_M	Q5_K_M	UD-Q5_K_M
Same top-1	89.4%	89.6%	93.0%	94.0%
Mean KL divergence	0.0746	0.0685	0.0308	0.0217
Max KL (worst token)	13.04	5.93	8.19	4.75
File size	44.7 GB	45.2 GB	52.9 GB	55.2 GB

UD-Q5_K_M wins on literally every quality metric while only being ~10 GB larger than MXFP4.

A 5% difference in per-token agreement becomes a 500× difference by token 100. All LLM’s are auto-regressive. Yann LeCun is always talking about this and that LLM’s suffer from exponentially diverging error probabilities. This is where all your hallucinations and stuff happen.

MXFP4 (89.4%) > 100 token output: 0.0014% chance of perfect agreement

UD-Q5_K_M (94%) > 100 token output: 0.21% chance of perfect agreement

That’s not a big number, but on long refactoring tasks or multi-step reasoning, you feel it. MXFP4 “goes off the rails” way more often.

The Hardware and Backend Details:

Hardware: 3× R9700 PRO (96 GB VRAM)
Backend: llama.cpp Vulkan
Evaluation Dataset: wikitext-2 (583 chunks, ctx 512)

View image

Image Credit: /u/alphatrad

What quants are you guys running for code models? Are you finding the same quality cliff with aggressive compression? And if you’re on Nvidia hardware, are you seeing different tradeoffs than RDNA?

Source Read original →

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

I ran a quantization shootout on Qwen3-Coder and the results are… interesting

Key Takeaways

The Numbers

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

How to Speed Up…

Alphabet plans to raise…

Nvidia chases $200B CPU…