It's OK to quantize the KV cache. Model quant matters more. Some Qwen3.6 27B tests with (approximated) KLD

It’s OK to quantize the KV cache. Model quant matters more. Some Qwen3.6 27B tests with (approximated) KLD

Some discussion here has been about KV-cache quantization, especially following recent improvements in Llama.cpp. There’s debate on whether to prioritize cache quant over weight quant.

I’ve noticed a lack of data-backed comparisons. One reason for this is the computational expense of computing KL-Divergence (KLD). This measure assesses how similar two probability distributions are, requiring logits from an unquantized model.

However, we can approximate KLD using logits from a high-quality quantized model as a substitute. I used unsloth’s quants for Qwen3.6 27B. The largest model size I could fit on my 7900 XTX (with only 24GB of RAM) was Q5_K_M, but this had no MTP and a small context window.

I ran KLD calculations using llama-perplexity, which is part of Llama.cpp. The dataset consists of wikitext-2, downloaded with the script provided by Llama.cpp. The context size was set to 16,000 tokens due to concerns about cache quantization affecting long-context performance.

I tested multiple combinations of K and V cache quant types, focusing on thresholds between Q5 and Q4 model quants, as well as the impact of using a smaller quant for V. I didn’t observe any significant slowdowns from mixed KV quants due to my Llama.cpp being compiled with -DGGML_CUDA_FA_ALL_QUANTS=ON.

The goal was to answer: "Should you quantize the cache or use a smaller model quant to achieve longer context?"

Here are some key findings:

Model quant is more important than KV-cache quant: Regardless of cache quant, Q5 performed better than Q4. Even the least optimized Q5 config (Q5_K_S weights with q4_0 kv cache) outperformed the best Q4 config (Q4_K_XL with f16 cache).
q4_0 is not ideal: I recommend using quant types like q5_* at least.

The table below summarizes my findings:

Weights	cache-type-k	cache-type-v	KLD¹
Q5_K_M	q8_0	q8_0	0.003838 ± 0.000322
Q5_K_M	q8_0	q5_1	0.006456 ± 0.000542
Q5_K_M	q5_1	q5_1	0.007214 ± 0.000560
Q5_K_M	q4_0	q4_0	0.014045 ± 0.000828
Q5_K_S	q4_0	q4_0	0.016304 ± 0.000865
Q4_K_XL	f16	f16	0.026067 ± 0.001021
Q4_K_M	f16	f16	0.031078 ± 0.001176

^{1 Calculated w.r.t. Q5\K_M on wikitext-2 using 16k context)}

There are some limitations to this analysis:

The KLD approximation is based on the largest model size I could fit, which might not be representative of larger models.
I didn’t test more quant levels due to computational constraints. Q5_K_M was the largest model size I tested because with q8_0 KV-cache I can achieve 100k context on my 24GB 7900 XTX.
The dataset (wikitext-2) isn’t ideal for coding or agent workflows; a better dataset might be available, but it needs to integrate well with the `llama-perplexity` tool.
16k context is insufficient for testing long-context scenarios. I’m waiting for Llama.cpp to fix an overflow bug that currently prevents running such tests.

Lead with what it means for makers and artists:

The findings suggest that prioritizing model quantization over cache quant can lead to better performance, especially when aiming for longer context. This is crucial for applications like AI music generation or any creative task where the model needs to handle large sequences of data.

Key Takeaways

Model quantization is more critical than KV-cache quantization for achieving better performance, especially with longer context windows.
The Q5_K_M configuration consistently outperformed the best Q4 configurations in our tests.
Makers and artists should focus on optimizing model quantization rather than just focusing on cache quantization when working with large datasets or complex models.

Source Read original →

It’s OK to quantize the KV cache. Model quant matters more. Some Qwen3.6 27B tests with (approximated) KLD

Key Takeaways

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

simonw/pedalican

PrismML Releases Bonsai 27B:…

OpenAI’s first hardware device…

Key Takeaways

Related articles

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

simonw/pedalican

PrismML Releases Bonsai 27B:…

OpenAI’s first hardware device…