Let’s be honest for a second. The way most of us do OCR on long documents is a hack.
You’ve got a 40-page contract, a dense academic thesis, or a stack of legal filings. You don’t just feed it to the model. You write a script that slices it page-by-page, processes each one in a for-loop, resets the memory, stitches the outputs back together, and prays that the table spanning pages 3 and 4 doesn’t turn into digital confetti.
This isn’t a solution. It’s an engineering workaround that we’ve normalized for years. And it has a dirty secret: every page boundary you cut is a potential error cascade waiting to happen. Cross-page references break. Layout context vanishes. Footnote anchors lose their targets.
Baidu just released Unlimited-OCR, and it makes that entire workflow look like a relic from 2023.

The Core Claim: Parse 40 Pages in One Go Without Slowing Down
Unlimited-OCR is a 3.3B parameter end-to-end model (with only 0.5B activated via MoE) released under the MIT license. It can ingest dozens of document pages in a single forward pass and produce over 32K output tokens with an output latency that does not increase as the generation gets longer.
This isn’t just impressive on paper. The benchmark results are unambiguous. On the OmniDocBench v1.5 benchmark, Unlimited-OCR scores 93.23% overall, blowing past the DeepSeek-OCR baseline by a full 6.22 percentage points. On the newer v1.6 benchmark, it achieves 93.92%, matching or beating every other model including Qwen3-VL (235B), DeepSeek-OCR-2, and Qianfan-OCR.
The Architectural Trick: Stop Treating Memory Like It’s Infinite
The fundamental insight here is almost embarassingly simple once you see it. When a human copies a book by hand, they don’t re-read everything they’ve already written. They glance at the source material, check the last few characters they’ve just written, and move on. They maintain a continuous cognitive state with intentional forgetting.
Standard decoder attention does the opposite. Every new token attends to every previous token in the sequence. As the output grows from 1K to 10K to 50K tokens, the KV cache grows linearly, memory balloons, and generation speed plummets. It’s a design that actively fights against the very task it’s supposed to perform.
The team behind Unlimited-OCR, led by Youyang Yin at Baidu, designed Reference Sliding Window Attention (R-SWA) to mimic that human working memory pattern.
Here’s how it works:
- Reference tokens (the visual embeddings from your pages, plus the prompt) are globally visible to every generated token. They never decay or get evicted.
- Output tokens are constrained to a causal sliding window of just 128 tokens. The model can only look back at the most recent 128 tokens it generated.
The magic is in the aftermath. Under standard Multi-Head Attention, after generating T tokens, the KV cache size is L_m + T. Under R-SWA, it’s L_m + min(n, T) where n is the window size (128). In plain English: the KV cache hits a maximum and then stops growing.
The cache ratio tells the story:
ρ(T) = (L_m + n) / (L_m + T) → 0 as T increases
At 6,000 generated tokens, R-SWA’s memory footprint is less than 10% of what standard attention requires. That’s the difference between being able to process a 40-page document on a single GPU and hitting OOM errors at page 12.
Speed Doesn’t Degrade. It Doesn’t Even Flinch.
The efficiency numbers from the paper are worth a close look. Under ideal concurrency, the team benchmkarked theoretical TPS (tokens per second) across different output lengths:
| Output Length | DeepSeek-OCR TPS | Unlimited-OCR TPS | Advantage |
|---|---|---|---|
| 256 | 7,229 | 7,229 | ~0% |
| 512 | 7,468 | 7,714 | +3.3% |
| 1,024 | 7,422 | 7,840 | +5.6% |
| 2,048 | 7,166 | 7,881 | +10.0% |
| 4,096 | 6,430 | 7,905 | +22.9% |
| 6,144 | 5,822 | 7,847 | +34.8% |
DeepSeek-OCR’s TPS steadily degrades as output length increases. Unlimited-OCR’s stays essentially flat. At 6K tokens, the speed advantage is nearly 35%.
The kernel latency comparison is even more revealing. Standard Flash Attention v3 shows steadily increasing per-call duration as the KV cache grows, punctuated by spikes when cache alignment boundaries are crossed. R-SWA shows a flat horizontal line.
This is what production stability looks like.
Benchmarks: It’s Not Just Efficiency, It’s Accuracy
You might expect that limiting attention to a sliding window would hurt performance. The subcategory analysis on OmniDocBench v1.5 proves otherwise. Across nine document types, PPT, academic papers, books, textbooks, exam papers, magazines, newspapers, notes, and research reports, Unlimited-OCR beats DeepSeek-OCR on every single metric.
Some highlights from the subcategory breakdown:
- Text Edit Distance on Newspaper: DeepSeek-OCR 0.131 → Unlimited-OCR 0.081 (38% improvement)
- Text Edit Distance on Note: DeepSeek-OCR 0.145 → Unlimited-OCR 0.066 (54% improvement)
- Text Edit Distance on Magazine: DeepSeek-OCR 0.049 → Unlimited-OCR 0.020 (59% improvement)
The “free lunch” observation from the paper is real. R-SWA actually improves accuracy while reducing computational cost. The working hypothesis is that full attention can diverge on long sequences, with the model diluting its focus across irrelevant distant tokens. R-SWA’s locality forces the model to route information more efficiently through the sliding window.
Long-Horizon Parsing: The Real Test
The multi-page benchmark is where theory hits reality. The team tested on books and documents of varying lengths:
| Pages | Distinct-35 | Edit Distance |
|---|---|---|
| 2 | 99.87% | 0.0362 |
| 5 | 99.98% | 0.0452 |
| 10 | 99.83% | 0.0526 |
| 20 | 99.89% | 0.0572 |
| 40+ | 96.90% | 0.1069 |
Even at 40+ pages, the Distinct-35 score stays above 96% and the edit distance stays below 0.11. The researchers note that most errors at 40+ pages come from small text being difficult to discern at the 1024×1024 resolution used in multi-page mode, not from the model losing its place in the parsing process.
This is the architectural breakthrough of constant-memory OCR that makes full-document parsing possible.
The Technical Details That Matter
DeepEncoder’s High Compression
Unlimited-OCR retains the DeepEncoder from DeepSeek-OCR, which cascades SAM-ViT and CLIP-ViT with a 16x token compression. A 1024×1024 PDF page compresses to just 256 visual tokens. For a 20-page document, that’s about 5,120 tokens of prefix, leaving 27K tokens of context for generation.
The 16x compression is critical because visual tokens are static in R-SWA. They’re encoded once at the start and never updated. This high compression means you can fit more pages into the prefix without starving the output.
The Queue-Based KV Cache
The KV cache is implemented as a FIFO queue with capacity m + n. Each time a new token is generated, the KV corresponding to the (m+1)-th token in the queue gets evicted. This is almost trivial to implement, which means it’s easy to reason about in production.
Inference Setup
You can run it with standard Transformers:
import torch
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('baidu/Unlimited-OCR', trust_remote_code=True)
model = AutoModel.from_pretrained(
'baidu/Unlimited-OCR',
trust_remote_code=True,
use_safetensors=True,
torch_dtype=torch.bfloat16,
).eval().cuda()
# PDF-native multi-page
model.infer_multi(
tokenizer,
prompt='<image>Multi page parsing.',
image_files=pdf_to_images('your_doc.pdf', dpi=300),
output_path='./output',
image_size=1024,
max_length=32768,
no_repeat_ngram_size=35, ngram_window=1024,
save_results=True,
)
For high-throughput production workloads, the bundled SGLang server provides an OpenAI-compatible API with streaming support and concurrent batch processing.
Two Image Modes
- gundam: 640px image size with aggressive cropping for single-page speed
- base: 1024px image size for multi-page fidelity
The naming choice is a reference to Fate/stay night ("I am the page of my document" has become an instant meme in the developer community), which honestly just makes the whole project more endearing.
The MIT License Changes the Game
This is not a research-only release. The MIT license means you can use it commercially, modify it, fork it, embed it in your product without legal gymnastics. Combined with the constant-memory architecture, this makes Unlimited-OCR a legitimate competitor to managed APIs for teams that need to process sensitive documents on their own infrastructure.
For teams evaluating the cost-benefit math of running open-source models locally versus paid API alternatives, the math just got significantly more favorable on the self-hosted side.
The Limitations Are Honest
The paper is refreshingly candid about what doesn’t work yet. The prefill length is still bounded by the 32K context limit. Multi-page mode tops out at 1024×1024 resolution, which can be problematic for documents with very small text. The proposed future work includes training 128K context versions and building a “prefill pool” architecture where the model learns to fetch KV chunks dynamically, simulating page-flipping.
Why This Matters Beyond OCR
The broader significance is architectural. R-SWA is a general-purpose parsing attention mechanism. It’s applicable to any task with static reference input and dynamic long output. Speech recognition. Machine translation. Video captioning. Long-document summarization. Anywhere you have a fixed context that drives a potentially unbounded output.
The paper’s conclusion makes this explicit: “We believe that R-SWA will be applied to more tasks in the future, making attention computation and memory footprint no longer the bottleneck for long-horizon parsing field.”
This is the direction long-context AI should be going. Not brute-force scaling context windows to 1M tokens and hoping the quadratic costs don’t destroy your budget. But designing attention patterns that mimic human cognitive constraints, focused, local, and persistent only where it needs to be.
This also speaks to the real-world challenges of document quality that full-document OCR aims to solve. If your RAG pipeline is failing because your PDFs are fragmented across page boundaries, the fix might not be better chunking, it’s a better parser.
Getting the Code
The repository is at github.com/baidu/Unlimited-OCR. The model is on Hugging Face at baidu/Unlimited-OCR. The paper is on arXiv at 2606.23050. The inference code is ready to run with both Transformers and SGLang.
The community response has been immediate. The repo collected 6.4K stars in its first three days. The Reddit thread hit 846 upvotes with top comments pointing out that the name references Fate/stay night and that Baidu is now setting the pace for open-source OCR.
The page-by-page for-loop has been the standard for so long that most developers stopped questioning it. Unlimited-OCR asks a better question: why reset memory at all? The answer is a model that can parse an entire book in one shot, maintain cross-page context, and do it on a standard GPU with predictable latency.
The era of long-horizon parsing has arrived, and it’s open-source.




