Baidu Releases Unlimited OCR, a 3B Model That Keeps the KV Cache Flat for Long-Document Parsing

Disclosure: Some links in this article are affiliate links. AI Maestro may earn a commission if you make a purchase, at no…

By AI Maestro June 25, 2026 2 min read
Baidu Releases Unlimited OCR, a 3B Model That Keeps the KV Cache Flat for Long-Document Parsing

Baidu has released Unlimited OCR, a 3-billion parameter model designed to maintain constant memory usage while parsing long documents. The system replaces standard decoder attention with a mechanism that keeps the KV cache flat, allowing it to process dozens of pages in a single pass without performance degradation.

Core specifications

Unlimited OCR is built on a Mixture-of-Experts architecture. The total parameter count stands at 3B, though only 500M parameters remain active during inference. The model parses documents up to a 32K maximum length in one forward pass.

Performance benchmarks show a score of 93.23 on OmniDocBench v1.5. This result beats the DeepSeek OCR baseline by 6.22 points. The developers achieved this through continue-training on DeepSeek OCR, rather than training the model from scratch.

Architecture and compression

The system retains the DeepEncoder and the Mixture-of-Experts decoder from its predecessor. The DeepEncoder functions as a compression engine, combining a SAM-ViT under window attention with a CLIP-ViT under global attention. It applies 16× token compression at the bridge.

A 1024×1024 PDF image reduces to just 256 visual tokens. Fewer input tokens mean a smaller prefill requirement. The DeepEncoder natively supports five resolution modes, but Unlimited OCR maintains two. The ‘Base’ mode runs at 1024×1024 for multi-page work. The ‘Gundam’ mode uses dynamic resolution for single pages.

How R-SWA keeps the cache constant

The core contribution is Reference Sliding Window Attention. Standard Multi-Head Attention stores a key and value for every token. As the output length grows, the cache grows with it. Memory and latency climb without bound.

R-SWA breaks that link. Each generated token attends to all reference tokens, covering the visual tokens and the prompt. It also attends to the preceding n output tokens, where n defaults to 128. Everything older is evicted. The cache becomes a fixed queue of size m + n.

The size is bounded by a constant. As the output length grows far beyond n, the cache ratio trends toward zero. Memory stays flat and per-step latency stays flat.

The research team compares this to soft forgetting. A person copying a book glances at the source and the last few words. They do not re-read everything transcribed so far. Visual tokens never undergo state updates. That avoids the progressive blurring seen in linear attention. The interactive simulator below lets you vary the output length and watch both caches respond.