I ran a quantization shootout on Qwen3-Coder and the results are… interesting

I ran a quantization shootout on Qwen3-Coder-Next and was curious about its performance. I’ve been using the MXFP4_MOE from unsloth for my…

By AI Maestro May 22, 2026 2 min read
I ran a quantization shootout on Qwen3-Coder and the results are… interesting

I ran a quantization shootout on Qwen3-Coder-Next and was curious about its performance. I’ve been using the MXFP4_MOE from unsloth for my work as it’s very fast. However, I wanted to see how precision affects things.

Hardware

Hardware: 3× R9700 PRO (96 GB VRAM)

Backend: llama.cpp Vulkan

Evaluation dataset: wikitext-2 (583 chunks, context length of 512 tokens)

Formats tested

  • MXFP4_MOE
  • Q4_K_M
  • Q5_K_M
  • UD-Q5_K_M

Tldr: UD-Q5_K_M is the clear winner! It has better quality than formats half its size, and there’s barely any speed penalty. Unsloth’s dynamic precision approach is really good. I might need to test it at lower quants now.

The Numbers

MetricMXFP4Q4_K_MQ5_K_MUD-Q5_K_M
Same top-1 accuracy89.4%89.6%93.0%94.0%
Mean KL divergence0.07460.06850.03080.0217
Max KL (worst token)13.045.938.194.75
File size (GB)44.7 GB45.2 GB52.9 GB55.2 GB

UD-Q5_K_M wins on literally every quality metric, while only being ~10 GB larger than MXFP4.

A 5% difference in per-token agreement becomes a 500× difference by token 100. All LLM’s are auto-regressive, and Yann LeCun is always talking about this. This is where all your hallucinations and stuff happen.

MXFP4 (89.4%)

A 5% difference in per-token agreement becomes a 500× difference by token 100. All LLM’s are auto-regressive, and Yann LeCun is always talking about this. This is where all your hallucinations and stuff happen.

UD-Q5_K_M (94%)

A 5% difference in per-token agreement becomes a 500× difference by token 100. All LLM’s are auto-regressive, and Yann LeCun is always talking about this. This is where all your hallucinations and stuff happen.

There is a speed trade-off to all of this though.

refill (batch 512):

MXFP4 still fastest (hardware kernels)

Prefill (batch 4096):

MXFP4 wins again

Decode:

Q4_K_M edges UD-Q5 slightly, but UD-Q5 is within 9% despite being 22% larger

For interactive coding (which is decode-bound anyway), the speed hit is negligible.

For me, I swapped my default from MXFP4 to UD-Q5_K_M. MXFP4 is still great for heavy prefill workloads but for daily code generation where you care about quality over speed, UD-Q5 is the clear winner.

What quants are you guys running for code models?

I’m curious if others are finding similar quality cliffs with aggressive compression. And if you’re on Nvidia hardware, are you seeing different trade-offs than RDNA?

Key Takeaways

  • UD-Q5_K_M is the clear winner in terms of quality and file size.
  • MXFP4 is still very fast for heavy prefill workloads.
  • The speed hit for interactive coding is negligible, especially when using decode-bound models like Qwen3-Coder-Next.
  • For code generation tasks where you care about quality over speed, UD-Q5_K_M is recommended.

If you’re on Nvidia hardware and have different experiences with other quantization formats, please share them!

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top