I ran a quantization shootout on Qwen3-Coder and the results are... interesting

I ran a quantization shootout on Qwen3-Coder-Next and was curious about its performance. I’ve been using the MXFP4_MOE from unsloth for my work as it’s very fast. However, I wanted to see how precision affects things.

Hardware

Hardware: 3× R9700 PRO (96 GB VRAM)

Backend: llama.cpp Vulkan

Evaluation dataset: wikitext-2 (583 chunks, context length of 512 tokens)

Formats tested

MXFP4_MOE
Q4_K_M
Q5_K_M
UD-Q5_K_M

Tldr: UD-Q5_K_M is the clear winner! It has better quality than formats half its size, and there’s barely any speed penalty. Unsloth’s dynamic precision approach is really good. I might need to test it at lower quants now.

The Numbers

Metric	MXFP4	Q4_K_M	Q5_K_M	UD-Q5_K_M
Same top-1 accuracy	89.4%	89.6%	93.0%	94.0%
Mean KL divergence	0.0746	0.0685	0.0308	0.0217
Max KL (worst token)	13.04	5.93	8.19	4.75
File size (GB)	44.7 GB	45.2 GB	52.9 GB	55.2 GB

UD-Q5_K_M wins on literally every quality metric, while only being ~10 GB larger than MXFP4.

A 5% difference in per-token agreement becomes a 500× difference by token 100. All LLM’s are auto-regressive, and Yann LeCun is always talking about this. This is where all your hallucinations and stuff happen.

MXFP4 (89.4%)

UD-Q5_K_M (94%)

There is a speed trade-off to all of this though.

refill (batch 512):

MXFP4 still fastest (hardware kernels)

Prefill (batch 4096):

MXFP4 wins again

Decode:

Q4_K_M edges UD-Q5 slightly, but UD-Q5 is within 9% despite being 22% larger

For interactive coding (which is decode-bound anyway), the speed hit is negligible.

For me, I swapped my default from MXFP4 to UD-Q5_K_M. MXFP4 is still great for heavy prefill workloads but for daily code generation where you care about quality over speed, UD-Q5 is the clear winner.

What quants are you guys running for code models?

I’m curious if others are finding similar quality cliffs with aggressive compression. And if you’re on Nvidia hardware, are you seeing different trade-offs than RDNA?

Key Takeaways

UD-Q5_K_M is the clear winner in terms of quality and file size.
MXFP4 is still very fast for heavy prefill workloads.
The speed hit for interactive coding is negligible, especially when using decode-bound models like Qwen3-Coder-Next.
For code generation tasks where you care about quality over speed, UD-Q5_K_M is recommended.

If you’re on Nvidia hardware and have different experiences with other quantization formats, please share them!

Source Read original →

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

I ran a quantization shootout on Qwen3-Coder and the results are… interesting

Hardware

Formats tested

The Numbers

MXFP4 (89.4%)

UD-Q5_K_M (94%)

refill (batch 512):

Prefill (batch 4096):

Decode:

What quants are you guys running for code models?

Key Takeaways

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

How to Speed Up…

Alphabet plans to raise…

Nvidia chases $200B CPU…