“`html
OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving
Why INT2 KV Cache Quantization is Hard
KV activations contain channel-wise outliers. A small subset of channels holds extremely large values, while most are well-behaved. When quantized to 2 bits (INT2), the quantizer wastes most of its range on rare spikes, compressing normal values into just one or two effective levels. This degrades attention quality significantly.
What OSCAR Does Differently
The key observation is that the rotation applied before quantization should be derived from attention statistics, rather than raw KV activations. For keys and values, OSCAR estimates empirical covariance matrices to align channels with important attention directions.
The final composed rotations are:
RK = UQ · HHad · Pbr
RV = US · HHad · Pbr
Each of these factors addresses a distinct failure mode of per-group low-bit quantization:
- UQ / US: Aligns channels with attention-importance directions. This diagonalizes the error-weighting matrix so that important directions are identifiable.
- HHad: Equalizes channel importance exactly by compressing peaky eigenspectra to a uniform value across all channels.
- Pbr: Reorders channels so each group receives one representative from each level of the importance hierarchy, ensuring balanced quantization error distribution.
The research team proves UQ and US are optimal under a frozen-error surrogate objective with diagonal residual assumptions. They provide Theorem 1 to support these claims.
The Serving System: Mixed-Precision Cache Layout
OSCAR integrates into SGLang’s production serving stack as an INT2 KV-cache mode with full compatibility for paged attention. The KV cache layout uses three regions per request:
- Sink tokens (first 64 tokens): Stored in BF16 and used as attention sinks.
- Recent tokens (last 256 tokens before current position): Stored in BF16 for recent context.
- History tokens: Quantized to INT2 after OSCAR rotation and clipping. For values, the relevant error is in attention output SV; directions that remain large after aggregation by S are quantization error propagators. The value rotation RV is absorbed into model weights offline.
The write path involves rotating tokens, clipping them to a calibration-derived percentile threshold (e.g., cK = 0.96 for keys and cV = 0.92 for values), then quantizing with per-token asymmetric INT2 at a default group size of GK = 64 channels per group. The read path dequantizes, inverse-rotates, and passes results to the attention kernel in one fused pass.
Outcome
The research team evaluated OSCAR on four model configurations: Qwen3-4B-Thinking-2507, Qwen3-8B, Qwen3-32B, and GLM-4.7-FP8 (358B parameters). Benchmarks include AIME25, GPQA-Diamond, HumanEval, LiveCodeBench v6, and MATH500, all at 32K maximum generation length.
| Model | BF16 Mean | OSCAR Mean | Gap to BF16 |
|---|---|---|---|
| Qwen3-4B-Thinking-2507 | 75.64 | 71.86 | −3.78 |
| Qwen3-8B | 70.84 | 69.42 | −1.42 |
| Qwen3-32B | 74.19 | 74.17 | −0.02 |
| GLM-4.7-FP8 (358B) | 77.89 | 78.16 | +0.27 |
The research team also compared against channel-wise methods on AIME25, showing OSCAR at 2.38 bits per KV element achieves above KIVI-KV2* and Kitty in terms of accuracy.
| Model | Method | 16K | 32K | 64K | 128K |
|---|---|---|---|---|---|
| Qwen3-4B-Thinking | BF16 | 99.7 | 99.3 | 85.3 | 81.0 |
| Qwen3-4B-Thinking | QuaRot-INT2 | 0.0 | 0.0 | 15.6 | 0.0 |
| Qwen3-4B-Thinking | OSCAR | 97.8 | 87.6 | 61.9 | 39.5 |
| Qwen3-8B | BF16 | 98.9 | 97.3 | 79.2 | 78.2 |
| Qwen3-8B | QuaRot-INT2 | 19.0 | 9.8 | 0.0 | 0.0 |
| Qwen3-8B | OSCAR | 93.9 | 86.3 | 61.9 | 45.0 |
The research team compared long-context robustness (RULER-NIAH) and throughput (H100, 100K context, batch size 1), showing OSCAR matches the BF16 curve through 128K context.
| Model | 30K | 60K | 100K |
|---|---|---|---|
| Qwen3-4B-Thinking | 1.98× | 2.52× | 3.08× |
| Qwen3-8B | 1.84× | 2.29× | 2.88× |
| GLM-4.7-FP8 | 1.98× | 2.49× | 2.83× |
The speedup in throughput at batch size 32 reaches 6.17× on Qwen3-4B-Thinking and 7.83× on GLM-4.7-FP8, reducing KV memory by 8× directly.




