I ran a quantization shootout on Qwen3-Coder and the results are... interesting

Quantization Shootout Results

I ran a quantization shootout on Qwen3-Coder-Next to test different precision levels. The goal was to see how these variations affect both quality and performance.

Hardware and Setup

Hardware: 3× R9700 PRO (96 GB VRAM)
Backend: llama.cpp Vulkan
Evaluation: wikitext-2 (583 chunks, ctx 512 tokens)
Formats Tested: MXFP4_MOE Q4_K_M Q5_K_M UD-Q5_K_M

The Results

The UD-Q5_K_M format stood out as having the best balance between quality and file size. It achieved a 94% top-1 accuracy, which is better than both Q4_K_M and Q5_K_M by significant margins.

Metric	MXFP4	Q4_K_M	Q5_K_M	UD-Q5_K_M
Same top-1	89.4%	89.6%	93.0%	94.0%
Mean KL divergence	0.0746	0.0685	0.0308	0.0217
Max KL (worst token)	13.04	5.93	8.19	4.75
File size	44.7 GB	45.2 GB	52.9 GB	55.2 GB

UD-Q5_K_M wins on every quality metric tested.

A 5% difference in per-token agreement becomes a 500× difference by token 100, highlighting the importance of token accuracy in long sequences. MXFP4 had a 94.0% top-1 accuracy compared to UD-Q5_K_M’s 94%, but it was still slightly ahead in this specific test.

The file size difference between MXFP4 (44.7 GB) and UD-Q5_K_M (55.2 GB) is notable, with UD-Q5_K_M being ~10 GB larger than MXFP4.

Performance

Refill (batch 512): MXFP4 was still the fastest due to hardware kernels.
Prefill (batch 4096): MXFP4 maintained its lead.
Decode: Q4_K_M edges UD-Q5 slightly, but UD-Q5 is within 9% despite being 22% larger.

For interactive coding tasks (which are decode-bound), the speed hit from using UD-Q5_K_M was negligible. I have since switched my default format to UD-Q5_K_M for daily code generation, as it offers a better quality-to-size trade-off compared to MXFP4.

Quants for Code Models

What quantization techniques are you using for your code models? Have you noticed similar quality cliffs with aggressive compression?

Key Takeaways

The UD-Q5_K_M format provides the best balance between quality and file size.
A 5% difference in token accuracy can lead to a significant change, especially in long sequences.
For interactive tasks like code generation, the speed hit from using lower precision models is often minimal.

Source Read original →

I ran a quantization shootout on Qwen3-Coder and the results are… interesting

Quantization Shootout Results

Hardware and Setup

The Results

Performance

Quants for Code Models

Key Takeaways

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

lobste.rs is now running…

OpenCoreDev Releases Domain SDK…

SpaceXAI’s Grok programming tool…

Quantization Shootout Results

Hardware and Setup

The Results

Performance

Quants for Code Models

Key Takeaways

Related articles

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

lobste.rs is now running…

OpenCoreDev Releases Domain SDK…

SpaceXAI’s Grok programming tool…