Here are my KV cache quantization benchmarks: TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM

“`html KV Cache Quantization Benchmarks Here are my KV cache quantization benchmarks: TurboQuant is overrated but saved by TCQ, q5 deserves more…

By AI Maestro May 19, 2026 2 min read
Here are my KV cache quantization benchmarks: TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM

“`html




KV Cache Quantization Benchmarks

Here are my KV cache quantization benchmarks: TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM

Welcome to the world of BeeLlama v0.1.2 and my single RTX 3090 as we explore PPL and KLD benchmarks with a focus on our 24 GB VRAM setup.

I have thoroughly re-examined these tests to ensure they are grounded in real-world usage, particularly for models like Qwen 3.6 27B (Q5_K_S and IQ4_XS) at 64k and 128k context lengths.

PPL Hides the Tail, KLD Exposes It

  • PPL keeps performance close to bf16. Even with TurboQuant’s quantization methods like turbo3_tcq, PPL only adds a small margin. However, q4_0 has a significant tail issue compared to q5_0. This is crucial as it affects the quality of tool calls and JSON structures.

Rotation Closed the Gap at 4 Bits

  • TurboQuant’s value lies in its ability to apply rotation before quantization, which is similar to what turbo4 does. At 4 bits, TurboQuant doesn’t offer a quality improvement over plain q4_0, saves no memory, and runs slower by 17%. For cases requiring aggressive compression, TCQ (turbo3_tcq) is where TurboQuant shines.

TCQ Saves the Low End

  • TurboQuant’s turbo3_tcq and turbo2_tcq are significantly better than their non-TCQ counterparts. These methods are essential for scenarios where memory savings are critical, such as in environments with limited VRAM.

Asymmetric KV Beats Symmetric at the Same Size

  • TurboQuant’s asymmetric quantization (q5_0/q4_1) outperforms symmetric quantization (q4_0/q4_1) across all test configurations. This is especially true once key memory usage reaches q5_0, where the next bit goes to value rather than further reducing key space.

Higher Model Precision Means More Cache Damage

  • The impact of model precision on cache damage is significant. For instance, Q5_K_S uses 3-5% more memory than IQ4_XS to achieve the same level of 99.9% precision.

q8 Is Mostly a Luxury Tier

  • For models like Qwen, q8 quantization (53.1%) at 43.8% of bf16 is often more about validation than practical use. It’s best reserved for cases where you have ample VRAM to spare and need the highest precision without worrying about cache size constraints.

Key Takeaways

  • TurboQuant’s value lies in its ability to apply rotation before quantization, which can be a game-changer for memory savings.
  • TCQ is particularly valuable when aiming for aggressive compression in environments with limited VRAM.
  • Asymmetric quantization outperforms symmetric methods once key memory usage reaches q5_0.
  • Higher model precision can lead to more cache damage, highlighting the importance of balancing both model and key space quantizations.



“`


Originally published at reddit.com. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top