DeepSeek-OCR represents a paradigm shift in how Large Language Models (LLMs) handle long-context processing by leveraging vision as a compression medium. This article provides a detailed, interactive walkthrough of the DeepSeek-OCR paper, covering its architecture, innovations, benchmarks, and practical implications, with guidance on where to insert key figures from the paper.
Architecture Overview
DeepSeek-OCR introduces a unified end-to-end Vision-Language Model (VLM) designed for optical context compression, where text is rendered into images and encoded into a compact sequence of vision tokens. This approach addresses the quadratic compute cost of LLMs when processing long text sequences.
Core Components
The model consists of two main components:
- DeepEncoder: A novel vision encoder that maps high-resolution document images into a minimal number of vision tokens.
- DeepSeek3B-MoE-A570M: A 3-billion parameter Mixture-of-Experts (MoE) language model with ~570 million active parameters per token, responsible for decoding the compressed visual representation into structured text.

DeepEncoder: The Vision Engine
DeepEncoder is designed to satisfy five key requirements: high-resolution input support, low activation memory, minimal vision tokens, multi-resolution adaptability, and moderate parameter count — goals unmet by existing VLM encoders.
Hybrid Architecture
DeepEncoder combines:
- SAM-Base (80M): A window-attention module for local perception.
- CLIP-Large (300M): A global-attention module for semantic understanding.
- 16× Convolutional Compressor: A 2-layer conv module that reduces tokens from 4096 to 256 before global attention.
This design allows the model to process 1024×1024 images while maintaining low memory footprint and enabling efficient training.

Multi-Resolution Support
DeepSeek-OCR supports five input modes to handle diverse document types:

For ultra-high-resolution inputs, the Gundam mode uses a tiling strategy with local and global views, enabling parsing of documents with 4,000+ text tokens.

Optical Context Compression: The Core Innovation
The key innovation of DeepSeek-OCR is vision-text compression: converting text into images and encoding them into far fewer vision tokens.
Compression-Ratio Analysis
Tests on the Fox benchmark reveal that:
- At 10× compression (e.g., 1000 text tokens → 100 vision tokens), OCR precision reaches 96.8%.
- Even at 20× compression, accuracy remains at ~60%, showing resilience to high compression.
This suggests that compact LLMs can effectively decode visually compressed text, opening new possibilities for efficient long-context processing.

Performance Benchmarks
OmniDocBench: State-of-the-Art OCR
DeepSeek-OCR outperforms existing models on the OmniDocBench benchmark while using significantly fewer vision tokens.

Lower Edit Distance (ED) = Better performance
Despite using only 100 vision tokens, DeepSeek-OCR (Small) surpasses GOT-OCR2.0 (256 tokens), and the Gundam mode outperforms MinerU2.0 (7,000 tokens).
Practical Efficiency
In production, DeepSeek-OCR achieves:
- 200,000+ pages per day on a single A100–40G.
- 33 million pages per day on a 20-node cluster (8×A100 each).
This makes it highly scalable for enterprise document processing and LLM pretraining data generation.
Implementation and Data Pipeline
Training Strategy
The model is trained in two stages:
— DeepEncoder Pretraining: Using a compact language model and next-token prediction on OCR and general vision data.
— End-to-End Fine-tuning: On a mix of:
- 70% OCR data (100 languages)
- 20% general vision data
- 10% text-only data
OCR 1.0 and OCR 2.0
The training data includes:
— OCR 1.0: Traditional text recognition (30M pages, 100 languages).
— OCR 2.0: Complex parsing tasks:
- Charts → HTML tables
- Chemical formulas → SMILES
- Geometry → structured dictionaries

Deep Parsing and Multimodal Capabilities
DeepSeek-OCR enables recursive parsing through “deep parsing” mode — where the model can parse embedded content like charts, chemical formulas, and natural images within documents using a single prompt.
Multilingual Support
The model supports OCR in nearly 100 languages, including Arabic, Sinhala, and other minority languages, with fine-tuned layout and non-layout outputs.
General Vision Understanding
Despite being OCR-focused, DeepSeek-OCR retains general vision capabilities:
- Image captioning
- Object detection
- Visual grounding
- Scene understanding
Theoretical Implications: Simulating Forgetting
DeepSeek-OCR’s compression approach mirrors human memory decay. By progressively resizing older document images, the model can simulate “forgetting” — where distant context becomes blurred and token-efficient, while recent context remains high-fidelity.
This suggests a path toward biologically inspired memory systems in LLMs, where resource allocation dynamically scales with recency and importance.
Python Implementation
Dependency Resolution
!pip install -q transformers==4.46.3
!pip install -q tokenizers==0.20.3
!pip install -q einops
!pip install -q addict
!pip install -q easydict
!pip install -q flash-attn==2.7.3 --no-build-isolation
Basic Implementation
from transformers import AutoModel, AutoTokenizer
import torch
os.environ["CUDA_VISIBLE_DEVICES"] = '0'
model_name = 'deepseek-ai/DeepSeek-OCR'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_name,
_attn_implementation='flash_attention_2',
trust_remote_code=True,
use_safetensors=True
).eval().cuda().to(torch.bfloat16)
prompt = "<image>n<|grounding|>Convert the document to markdown."
image_file = 'input.jpg'
output_path = 'output/'
res = model.infer(
tokenizer,
prompt=prompt,
image_file=image_file,
output_path=output_path,
base_size=1024,
image_size=640,
crop_mode=True,
save_results=True
)
Conclusion
DeepSeek-OCR is not just an OCR model — it’s a proof-of-concept for vision-as-compression in LLMs. By converting text into visual tokens, it achieves:
- 10–20× compression with high fidelity
- Superior OCR performance using fewer tokens
- Scalable production throughput
The model’s architecture and findings open new research directions in long-context modeling, memory efficiency, and multimodal pretraining. As the paper concludes, this approach may enable theoretically unlimited context by balancing retention and compression across time and space
DeepSeek-OCR: Contexts Optical Compression was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.