I ran a quantization shootout on Qwen3-Coder and the results are... interesting

I ran a quantization shootout on Qwen3-Coder-Next and the results are… interesting

Out of pure curiosity, I ran a comparison test between Qwen3-Coder-Next and some other quantized models. My primary hardware is three R9700 PROs with 96GB VRAM, and I’m using llama.cpp via Vulkan for the backend.

Details

Hardware: 3× R9700 PRO (96 GB VRAM)
Backend: llama.cpp Vulkan
Evaluation dataset: wikitext-2, batch size of 512 tokens per chunk with a context window of 512 tokens.
Formats tested: MXFP4_MOE, Q4_K_M, Q5_K_M, UD-Q5_K_M

The TL;DR is that the UD-Q5_K_M format performs exceptionally well. It offers better quality compared to formats half its size and has only a slight speed penalty.

The Numbers

Metric	MXFP4	Q4_K_M	Q5_K_M	UD-Q5_K_M
Same top-1 accuracy	89.4%	89.6%	93.0%	94.0%
Mean KL divergence	0.0746	0.0685	0.0308	0.0217
Max KL (worst token)	13.04	5.93	8.19	4.75
File size (in GB)	44.7	45.2	52.9	55.2

The UD-Q5_K_M format triumphs across all quality metrics, being only about 10GB larger than the MXFP4 model.

A key insight is that token accuracy compounds exponentially. A 5% difference in per-token agreement can lead to a 500x difference by the 100th token. This means that LLMs like Qwen suffer from rapidly diverging error probabilities, which are critical for maintaining high-quality outputs.

The MXFP4 model (89.4%) still performs well but “goes off the rails” more often than UD-Q5_K_M when producing 100 tokens at a time.

For tasks like interactive coding where decoding is the bottleneck, the speed hit from using UD-Q5_K_M is negligible compared to MXFP4. I’ve switched my default model to UD-Q5_K_M for daily code generation tasks.

I’m curious about what quantization strategies other folks are using and if they’re seeing similar quality cliffs with aggressive compression. For those on Nvidia hardware, have you observed different trade-offs than those seen with RDNA?

Key Takeaways

The UD-Q5_K_M format offers superior quality at a modest file size increase over MXFP4.
A 10GB reduction in model size can lead to significant improvements in token accuracy and overall quality.
For tasks requiring high-quality outputs, especially those involving long sequences or complex reasoning, quantized models like UD-Q5_K_M are likely the best choice.

Source Read original →