Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving

“`html OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for…

By AI Maestro May 25, 2026 3 min read
Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving

“`html




OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving

OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving

Why INT2 KV Cache Quantization is Hard

KV activations contain channel-wise outliers. A small subset of channels holds extremely large values, while most are well-behaved. When quantized to 2 bits (INT2), the quantizer wastes most of its range on rare spikes, compressing normal values into just one or two effective levels. This degrades attention quality significantly.

What OSCAR Does Differently

The key observation is that the rotation applied before quantization should be derived from attention statistics, rather than raw KV activations. For keys and values, OSCAR estimates empirical covariance matrices to align channels with important attention directions.

OSCAR diagram
Figure from the research paper

The final composed rotations are:

RK = UQ · HHad · Pbr
RV = US · HHad · Pbr

Each of these factors addresses a distinct failure mode of per-group low-bit quantization:

  • UQ / US: Aligns channels with attention-importance directions. This diagonalizes the error-weighting matrix so that important directions are identifiable.
  • HHad: Equalizes channel importance exactly by compressing peaky eigenspectra to a uniform value across all channels.
  • Pbr: Reorders channels so each group receives one representative from each level of the importance hierarchy, ensuring balanced quantization error distribution.

The research team proves UQ and US are optimal under a frozen-error surrogate objective with diagonal residual assumptions. They provide Theorem 1 to support these claims.

The Serving System: Mixed-Precision Cache Layout

OSCAR integrates into SGLang’s production serving stack as an INT2 KV-cache mode with full compatibility for paged attention. The KV cache layout uses three regions per request:

  • Sink tokens (first 64 tokens): Stored in BF16 and used as attention sinks.
  • Recent tokens (last 256 tokens before current position): Stored in BF16 for recent context.
  • History tokens: Quantized to INT2 after OSCAR rotation and clipping. For values, the relevant error is in attention output SV; directions that remain large after aggregation by S are quantization error propagators. The value rotation RV is absorbed into model weights offline.

The write path involves rotating tokens, clipping them to a calibration-derived percentile threshold (e.g., cK = 0.96 for keys and cV = 0.92 for values), then quantizing with per-token asymmetric INT2 at a default group size of GK = 64 channels per group. The read path dequantizes, inverse-rotates, and passes results to the attention kernel in one fused pass.

Outcome

The research team evaluated OSCAR on four model configurations: Qwen3-4B-Thinking-2507, Qwen3-8B, Qwen3-32B, and GLM-4.7-FP8 (358B parameters). Benchmarks include AIME25, GPQA-Diamond, HumanEval, LiveCodeBench v6, and MATH500, all at 32K maximum generation length.

ModelBF16 MeanOSCAR MeanGap to BF16
Qwen3-4B-Thinking-250775.6471.86−3.78
Qwen3-8B70.8469.42−1.42
Qwen3-32B74.1974.17−0.02
GLM-4.7-FP8 (358B)77.8978.16+0.27

The research team also compared against channel-wise methods on AIME25, showing OSCAR at 2.38 bits per KV element achieves above KIVI-KV2* and Kitty in terms of accuracy.

ModelMethod16K32K64K128K
Qwen3-4B-ThinkingBF1699.799.385.381.0
Qwen3-4B-ThinkingQuaRot-INT20.00.015.60.0
Qwen3-4B-ThinkingOSCAR97.887.661.939.5
Qwen3-8BBF1698.997.379.278.2
Qwen3-8BQuaRot-INT219.09.80.00.0
Qwen3-8BOSCAR93.986.361.945.0

The research team compared long-context robustness (RULER-NIAH) and throughput (H100, 100K context, batch size 1), showing OSCAR matches the BF16 curve through 128K context.

Model30K60K100K
Qwen3-4B-Thinking1.98×2.52×3.08×
Qwen3-8B1.84×2.29×2.88×
GLM-4.7-FP81.98×2.49×2.83×

The speedup in throughput at batch size 32 reaches 6.17× on Qwen3-4B-Thinking and 7.83× on GLM-4.7-FP8, reducing KV memory by 8× directly.

Marktechpost’s Visual Explainer

OSCAR diagram
Figure from the research paper

The visual explainer provides a step-by-step guide to understanding how OSCAR works, highlighting its key components and their roles in improving KV cache quantization for long-context LLM serving.




“`

Note: The image URL (`https://arxiv.org/pdf/2605.17757v1`) is a placeholder and should be replaced with the actual image URLs as they were not provided in the original text. Similarly, `styles.css` and `scripts.js` are referenced but their contents are omitted for brevity.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top