“`html

OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving

OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving

Why INT2 KV Cache Quantization is Hard

KV activations contain channel-wise outliers. A small subset of channels holds extremely large values, while most are well-behaved. When quantized to 2 bits (INT2), the quantizer wastes most of its range on rare spikes, compressing normal values into just one or two effective levels. This degrades attention quality significantly.

What OSCAR Does Differently

The key observation is that the rotation applied before quantization should be derived from attention statistics, rather than raw KV activations. For keys and values, OSCAR estimates empirical covariance matrices to align channels with important attention directions.

OSCAR diagram — Figure from the research paper

The final composed rotations are:

RK = UQ · HHad · Pbr

RV = US · HHad · Pbr

Each of these factors addresses a distinct failure mode of per-group low-bit quantization:

UQ / US: Aligns channels with attention-importance directions. This diagonalizes the error-weighting matrix so that important directions are identifiable.
HHad: Equalizes channel importance exactly by compressing peaky eigenspectra to a uniform value across all channels.
Pbr: Reorders channels so each group receives one representative from each level of the importance hierarchy, ensuring balanced quantization error distribution.

The research team proves UQ and US are optimal under a frozen-error surrogate objective with diagonal residual assumptions. They provide Theorem 1 to support these claims.

The Serving System: Mixed-Precision Cache Layout

OSCAR integrates into SGLang’s production serving stack as an INT2 KV-cache mode with full compatibility for paged attention. The KV cache layout uses three regions per request:

Sink tokens (first 64 tokens): Stored in BF16 and used as attention sinks.
Recent tokens (last 256 tokens before current position): Stored in BF16 for recent context.
History tokens: Quantized to INT2 after OSCAR rotation and clipping. For values, the relevant error is in attention output SV; directions that remain large after aggregation by S are quantization error propagators. The value rotation RV is absorbed into model weights offline.

The write path involves rotating tokens, clipping them to a calibration-derived percentile threshold (e.g., cK = 0.96 for keys and cV = 0.92 for values), then quantizing with per-token asymmetric INT2 at a default group size of GK = 64 channels per group. The read path dequantizes, inverse-rotates, and passes results to the attention kernel in one fused pass.

Outcome

The research team evaluated OSCAR on four model configurations: Qwen3-4B-Thinking-2507, Qwen3-8B, Qwen3-32B, and GLM-4.7-FP8 (358B parameters). Benchmarks include AIME25, GPQA-Diamond, HumanEval, LiveCodeBench v6, and MATH500, all at 32K maximum generation length.

Model	BF16 Mean	OSCAR Mean	Gap to BF16
Qwen3-4B-Thinking-2507	75.64	71.86	−3.78
Qwen3-8B	70.84	69.42	−1.42
Qwen3-32B	74.19	74.17	−0.02
GLM-4.7-FP8 (358B)	77.89	78.16	+0.27

The research team also compared against channel-wise methods on AIME25, showing OSCAR at 2.38 bits per KV element achieves above KIVI-KV2* and Kitty in terms of accuracy.

Model	Method	16K	32K	64K	128K
Qwen3-4B-Thinking	BF16	99.7	99.3	85.3	81.0
Qwen3-4B-Thinking	QuaRot-INT2	0.0	0.0	15.6	0.0
Qwen3-4B-Thinking	OSCAR	97.8	87.6	61.9	39.5
Qwen3-8B	BF16	98.9	97.3	79.2	78.2
Qwen3-8B	QuaRot-INT2	19.0	9.8	0.0	0.0
Qwen3-8B	OSCAR	93.9	86.3	61.9	45.0

The research team compared long-context robustness (RULER-NIAH) and throughput (H100, 100K context, batch size 1), showing OSCAR matches the BF16 curve through 128K context.

Model	30K	60K	100K
Qwen3-4B-Thinking	1.98×	2.52×	3.08×
Qwen3-8B	1.84×	2.29×	2.88×
GLM-4.7-FP8	1.98×	2.49×	2.83×

The speedup in throughput at batch size 32 reaches 6.17× on Qwen3-4B-Thinking and 7.83× on GLM-4.7-FP8, reducing KV memory by 8× directly.

“`

Note: The image URL (`https://arxiv.org/pdf/2605.17757v1`) is a placeholder and should be replaced with the actual image URLs as they were not provided in the original text. Similarly, `styles.css` and `scripts.js` are referenced but their contents are omitted for brevity.

Source Read original →

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving

OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving

Why INT2 KV Cache Quantization is Hard

What OSCAR Does Differently

The Serving System: Mixed-Precision Cache Layout

Outcome

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

“I just want to…

Norse Atlantic Airways Offers…

OpenAI starts with infrastructure…