LightonOCR : Fastest OCR AI, beats DeepSeek OCR, PaddleOCR
How to use LightonOCR for free?
OCR has always felt a bit like duct-taping a dozen tools together: detect text boxes, crop images, feed them to a model, stitch outputs back into layout. It works, barely, but it’s brittle, slow, and painful to adapt to new document types.
https://medium.com/media/5d6573ae7fcdfe67db0600fc89b64d25/href
LightOnOCR-1B breaks that cycle. Instead of relying on a pipeline, it’s a single, trainable vision-language model that eats entire pages and spits out clean, structured Markdown. No multi-step mess.
A True End-to-End OCR Model
LightOnOCR isn’t just another model with OCR in its name. It’s actually end-to-end. There’s no segmentation or text detection stage, it learns everything jointly. That makes it fully differentiable, meaning you can fine-tune it as a whole for whatever weird dataset you have (receipts, legal PDFs, academic papers). That simplicity is the point: fewer moving parts, fewer chances to break.
Built on a 1B Vision-Language Backbone
Under the hood, it’s a compact 1B-parameter model, but it borrows serious components:
- A Vision Transformer (ViT) backbone inspired by Mistral’s Pixtral for high-res image understanding.
- A Qwen3-based language model that handles text reasoning.
- A fresh multimodal projection layer connecting vision and text spaces, trained from scratch.
Together, it acts like a small general-purpose VLM, but fine-tuned for the world of PDFs, scanned documents, and screenshots.
Fast Enough to Process a Library Before Lunch
This part’s wild: LightOnOCR does 5.71 pages per second on a single H100 GPU. That’s nearly half a million pages per day. Compared to existing models, it’s:
- 6.5× faster than dots.ocr
- 2.7× faster than PaddleOCR-VL
- 1.7× faster than DeepSeekOCR
And at that speed, it costs under $0.01 per thousand pages on cloud GPUs. It achieves that because it runs in one shot, a single forward pass per page, no retries or patch cropping.
Outputs Markdown, Not HTML
Instead of verbose HTML trees, it outputs Markdown. That’s clever. Markdown keeps structure, headings, tables, even equations via LaTeX, but remains compact and human-readable. It also tokenizes better for language models and converts neatly into JSON or HTML when needed. It’s lightweight structure without the clutter.
17.6 Million Synthetic Pages
The team trained it on a massive synthetic corpus:
- 17.6M pages generated via Qwen2-VL-72B-Instruct as the teacher.
- Roughly 45.5B tokens, rendered at native PDF resolution (up to 1540px).
- Cleaned for loops, duplicates, and hallucinations.
And this dataset isn’t staying locked up, they’re releasing it. That’s significant: OCR datasets are notoriously fragmented, so this could become a standard benchmark for future work.
Simpler Training, Still Better Accuracy
LightOnOCR dropped the usual two-stage “freeze then unfreeze” training routine. Instead, everything, vision, language, and projection layers, was trained together. The result: slightly better performance (+1.4 points) and a cleaner workflow. It’s a rare case where the simpler path also wins.
Big Teacher Models
An interesting finding: when the dataset was labeled by Qwen2-VL-72B, results jumped +11.8 points compared to data labeled by the smaller Qwen2-VL-7B. In other words, the size of your teacher matters a lot more than people admit. Bigger models make better synthetic data, even if your student model is small.
Pruned Vocabulary for Efficiency
OCR doesn’t need 150k tokens. LightOnOCR trims the original Qwen3 tokenizer down to 32k or 16k tokens. That cut reduces inference time significantly with almost no drop in accuracy. The 32k version hits the sweet spot for English and French, though it loses some multilingual flexibility.
Handles Native Resolutions
LightOnOCR encodes images at their native resolution, inspired by NaViT’s approach. That means no resizing or cropping. High-res inference (up to 1540px) gives around +18% accuracy on dense-text pages, like contracts or research papers. Slight drop for table-heavy docs, but overall, this improves clarity and context retention.
Barely Any Augmentation Needed
Because of its large and diverse dataset, LightOnOCR didn’t need aggressive data augmentation. Small tweaks like rotation and noise barely moved the needle (–0.2 on average). That’s a sign of a solid, well-balanced training corpus rather than model luck.
Easy to Fine-Tune
Fine-tuning on a small OCR mix dataset (OlmOCR-mix-0225) for just one epoch improved overall accuracy by +9 points, a huge jump. Headers and footers jumped from 40% to 91% accuracy. That kind of quick adaptability is rare. You can imagine using it for specialized documents, like medical forms or invoices, without retraining from scratch.
Benchmarks and Real Results
It’s now state-of-the-art for its size on Olmo-Bench, beating models twice or thrice its scale like DeepSeekOCR and dots.ocr. Even on OmniDocBench, where it wasn’t optimized for HTML, it holds up remarkably well. No benchmark-specific tuning, just the base pretrained model.
Cheap, Fast, and Open
LightOnOCR hits the sweet trifecta: small enough to run on modest GPUs, fast enough for production, accurate enough for enterprise workloads.
It sits on the Pareto frontier for OCR , better speed and cost with near state-of-the-art accuracy.And the best part: they’re open-sourcing both the weights (1B, 0.9B-32k, 0.9B-16k) and the dataset under a permissive license. That’s rare, and it might finally push open-source OCR to catch up with proprietary systems.
In Short
LightOnOCR-1B feels like what OCR should’ve been all along: one model that reads documents end-to-end, outputs clean structured text, and doesn’t cost a fortune to run. It’s ViT + Qwen3 under the hood, distilled from a massive 72B teacher, pruned for speed, and tuned for Markdown. Simple, fast, and open.
OCR, finally, without the glue code.
The model is open-sourced and available at huggingface
lightonai/LightOnOCR-1B-1025 · Hugging Face
Try it here
LightOnOCR 1B Demo – a Hugging Face Space by lightonai
LightonOCR : Fastest OCR AI, beats DeepSeek OCR, PaddleOCR was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.