DeepSeek-OCR: Contexts Optical Compression

October 27, 2025

3 min read

DeepSeek-OCR represents a paradigm shift in how Large Language Models (LLMs) handle long-context processing by leveraging vision as a compression medium. This article provides a detailed, interactive walkthrough of the DeepSeek-OCR paper, covering its architecture, innovations, benchmarks, and practical implications, with guidance on where to insert key figures from the paper.

Architecture Overview

DeepSeek-OCR introduces a unified end-to-end Vision-Language Model (VLM) designed for optical context compression, where text is rendered into images and encoded into a compact sequence of vision tokens. This approach addresses the quadratic compute cost of LLMs when processing long text sequences.

Core Components

The model consists of two main components:

DeepEncoder: A novel vision encoder that maps high-resolution document images into a minimal number of vision tokens.
DeepSeek3B-MoE-A570M: A 3-billion parameter Mixture-of-Experts (MoE) language model with ~570 million active parameters per token, responsible for decoding the compressed visual representation into structured text.

DeepEncoder: The Vision Engine

DeepEncoder is designed to satisfy five key requirements: high-resolution input support, low activation memory, minimal vision tokens, multi-resolution adaptability, and moderate parameter count — goals unmet by existing VLM encoders.

Hybrid Architecture

DeepEncoder combines:

SAM-Base (80M): A window-attention module for local perception.
CLIP-Large (300M): A global-attention module for semantic understanding.
16× Convolutional Compressor: A 2-layer conv module that reduces tokens from 4096 to 256 before global attention.

This design allows the model to process 1024×1024 images while maintaining low memory footprint and enabling efficient training.

Multi-Resolution Support

DeepSeek-OCR supports five input modes to handle diverse document types:

For ultra-high-resolution inputs, the Gundam mode uses a tiling strategy with local and global views, enabling parsing of documents with 4,000+ text tokens.

Optical Context Compression: The Core Innovation

The key innovation of DeepSeek-OCR is vision-text compression: converting text into images and encoding them into far fewer vision tokens.

Compression-Ratio Analysis

Tests on the Fox benchmark reveal that:

At 10× compression (e.g., 1000 text tokens → 100 vision tokens), OCR precision reaches 96.8%.
Even at 20× compression, accuracy remains at ~60%, showing resilience to high compression.

This suggests that compact LLMs can effectively decode visually compressed text, opening new possibilities for efficient long-context processing.

Performance Benchmarks

OmniDocBench: State-of-the-Art OCR

DeepSeek-OCR outperforms existing models on the OmniDocBench benchmark while using significantly fewer vision tokens.

Lower Edit Distance (ED) = Better performance

Despite using only 100 vision tokens, DeepSeek-OCR (Small) surpasses GOT-OCR2.0 (256 tokens), and the Gundam mode outperforms MinerU2.0 (7,000 tokens).

Practical Efficiency

In production, DeepSeek-OCR achieves:

200,000+ pages per day on a single A100–40G.
33 million pages per day on a 20-node cluster (8×A100 each).

This makes it highly scalable for enterprise document processing and LLM pretraining data generation.

Implementation and Data Pipeline

Training Strategy

The model is trained in two stages:

— DeepEncoder Pretraining: Using a compact language model and next-token prediction on OCR and general vision data.

— End-to-End Fine-tuning: On a mix of:

70% OCR data (100 languages)
20% general vision data
10% text-only data

OCR 1.0 and OCR 2.0

The training data includes:

— OCR 1.0: Traditional text recognition (30M pages, 100 languages).

— OCR 2.0: Complex parsing tasks:

Charts → HTML tables
Chemical formulas → SMILES
Geometry → structured dictionaries

Deep Parsing and Multimodal Capabilities

DeepSeek-OCR enables recursive parsing through “deep parsing” mode — where the model can parse embedded content like charts, chemical formulas, and natural images within documents using a single prompt.

Multilingual Support

The model supports OCR in nearly 100 languages, including Arabic, Sinhala, and other minority languages, with fine-tuned layout and non-layout outputs.

General Vision Understanding

Despite being OCR-focused, DeepSeek-OCR retains general vision capabilities:

Image captioning
Object detection
Visual grounding
Scene understanding

Theoretical Implications: Simulating Forgetting

DeepSeek-OCR’s compression approach mirrors human memory decay. By progressively resizing older document images, the model can simulate “forgetting” — where distant context becomes blurred and token-efficient, while recent context remains high-fidelity.

This suggests a path toward biologically inspired memory systems in LLMs, where resource allocation dynamically scales with recency and importance.

Python Implementation

Dependency Resolution

!pip install -q transformers==4.46.3
!pip install -q tokenizers==0.20.3
!pip install -q einops
!pip install -q addict
!pip install -q easydict
!pip install -q flash-attn==2.7.3 --no-build-isolation

Basic Implementation

from transformers import AutoModel, AutoTokenizer
import torch

os.environ["CUDA_VISIBLE_DEVICES"] = '0'
model_name = 'deepseek-ai/DeepSeek-OCR'

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name,
    _attn_implementation='flash_attention_2',
    trust_remote_code=True,
    use_safetensors=True
).eval().cuda().to(torch.bfloat16)

prompt = "<image>n<|grounding|>Convert the document to markdown."
image_file = 'input.jpg'
output_path = 'output/'

res = model.infer(
    tokenizer,
    prompt=prompt,
    image_file=image_file,
    output_path=output_path,
    base_size=1024,
    image_size=640,
    crop_mode=True,
    save_results=True
)

Conclusion

DeepSeek-OCR is not just an OCR model — it’s a proof-of-concept for vision-as-compression in LLMs. By converting text into visual tokens, it achieves:

10–20× compression with high fidelity
Superior OCR performance using fewer tokens
Scalable production throughput

The model’s architecture and findings open new research directions in long-context modeling, memory efficiency, and multimodal pretraining. As the paper concludes, this approach may enable theoretically unlimited context by balancing retention and compression across time and space

DeepSeek-OCR: Contexts Optical Compression was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.