LightonOCR : Fastest OCR AI, beats DeepSeek OCR, PaddleOCR

LightonOCR : Fastest OCR AI, beats DeepSeek OCR, PaddleOCR

LightonOCR : Fastest OCR AI, beats DeepSeek OCR, PaddleOCR

How to use LightonOCR for free?

Photo by Nathaniel Shuman on Unsplash

OCR has always felt a bit like duct-taping a dozen tools together: detect text boxes, crop images, feed them to a model, stitch outputs back into layout. It works, barely, but it’s brittle, slow, and painful to adapt to new document types.

https://medium.com/media/5d6573ae7fcdfe67db0600fc89b64d25/href

LightOnOCR-1B breaks that cycle. Instead of relying on a pipeline, it’s a single, trainable vision-language model that eats entire pages and spits out clean, structured Markdown. No multi-step mess.

Audio AI for Beginners: Generative AI for Voice Recognition, TTS, Voice Cloning and more (Generative AI books)

A True End-to-End OCR Model

LightOnOCR isn’t just another model with OCR in its name. It’s actually end-to-end. There’s no segmentation or text detection stage, it learns everything jointly. That makes it fully differentiable, meaning you can fine-tune it as a whole for whatever weird dataset you have (receipts, legal PDFs, academic papers). That simplicity is the point: fewer moving parts, fewer chances to break.

Built on a 1B Vision-Language Backbone

Under the hood, it’s a compact 1B-parameter model, but it borrows serious components:

  • A Vision Transformer (ViT) backbone inspired by Mistral’s Pixtral for high-res image understanding.
  • A Qwen3-based language model that handles text reasoning.
  • A fresh multimodal projection layer connecting vision and text spaces, trained from scratch.

Together, it acts like a small general-purpose VLM, but fine-tuned for the world of PDFs, scanned documents, and screenshots.

Fast Enough to Process a Library Before Lunch

This part’s wild: LightOnOCR does 5.71 pages per second on a single H100 GPU. That’s nearly half a million pages per day. Compared to existing models, it’s:

  • 6.5× faster than dots.ocr
  • 2.7× faster than PaddleOCR-VL
  • 1.7× faster than DeepSeekOCR

And at that speed, it costs under $0.01 per thousand pages on cloud GPUs. It achieves that because it runs in one shot, a single forward pass per page, no retries or patch cropping.

Outputs Markdown, Not HTML

Instead of verbose HTML trees, it outputs Markdown. That’s clever. Markdown keeps structure, headings, tables, even equations via LaTeX, but remains compact and human-readable. It also tokenizes better for language models and converts neatly into JSON or HTML when needed. It’s lightweight structure without the clutter.

17.6 Million Synthetic Pages

The team trained it on a massive synthetic corpus:

  • 17.6M pages generated via Qwen2-VL-72B-Instruct as the teacher.
  • Roughly 45.5B tokens, rendered at native PDF resolution (up to 1540px).
  • Cleaned for loops, duplicates, and hallucinations.

And this dataset isn’t staying locked up, they’re releasing it. That’s significant: OCR datasets are notoriously fragmented, so this could become a standard benchmark for future work.

Simpler Training, Still Better Accuracy

LightOnOCR dropped the usual two-stage “freeze then unfreeze” training routine. Instead, everything, vision, language, and projection layers, was trained together. The result: slightly better performance (+1.4 points) and a cleaner workflow. It’s a rare case where the simpler path also wins.

Big Teacher Models

An interesting finding: when the dataset was labeled by Qwen2-VL-72B, results jumped +11.8 points compared to data labeled by the smaller Qwen2-VL-7B. In other words, the size of your teacher matters a lot more than people admit. Bigger models make better synthetic data, even if your student model is small.

Pruned Vocabulary for Efficiency

OCR doesn’t need 150k tokens. LightOnOCR trims the original Qwen3 tokenizer down to 32k or 16k tokens. That cut reduces inference time significantly with almost no drop in accuracy. The 32k version hits the sweet spot for English and French, though it loses some multilingual flexibility.

Handles Native Resolutions

LightOnOCR encodes images at their native resolution, inspired by NaViT’s approach. That means no resizing or cropping. High-res inference (up to 1540px) gives around +18% accuracy on dense-text pages, like contracts or research papers. Slight drop for table-heavy docs, but overall, this improves clarity and context retention.

Barely Any Augmentation Needed

Because of its large and diverse dataset, LightOnOCR didn’t need aggressive data augmentation. Small tweaks like rotation and noise barely moved the needle (–0.2 on average). That’s a sign of a solid, well-balanced training corpus rather than model luck.

Easy to Fine-Tune

Fine-tuning on a small OCR mix dataset (OlmOCR-mix-0225) for just one epoch improved overall accuracy by +9 points, a huge jump. Headers and footers jumped from 40% to 91% accuracy. That kind of quick adaptability is rare. You can imagine using it for specialized documents, like medical forms or invoices, without retraining from scratch.

Benchmarks and Real Results

It’s now state-of-the-art for its size on Olmo-Bench, beating models twice or thrice its scale like DeepSeekOCR and dots.ocr. Even on OmniDocBench, where it wasn’t optimized for HTML, it holds up remarkably well. No benchmark-specific tuning, just the base pretrained model.

Cheap, Fast, and Open

LightOnOCR hits the sweet trifecta: small enough to run on modest GPUs, fast enough for production, accurate enough for enterprise workloads.

It sits on the Pareto frontier for OCR , better speed and cost with near state-of-the-art accuracy.And the best part: they’re open-sourcing both the weights (1B, 0.9B-32k, 0.9B-16k) and the dataset under a permissive license. That’s rare, and it might finally push open-source OCR to catch up with proprietary systems.

In Short

LightOnOCR-1B feels like what OCR should’ve been all along: one model that reads documents end-to-end, outputs clean structured text, and doesn’t cost a fortune to run. It’s ViT + Qwen3 under the hood, distilled from a massive 72B teacher, pruned for speed, and tuned for Markdown. Simple, fast, and open.

OCR, finally, without the glue code.

The model is open-sourced and available at huggingface

lightonai/LightOnOCR-1B-1025 · Hugging Face

Try it here

LightOnOCR 1B Demo – a Hugging Face Space by lightonai


LightonOCR : Fastest OCR AI, beats DeepSeek OCR, PaddleOCR was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.

Share this article
0
Share
Shareable URL
Prev Post

Mem0 : Add memory to LLM APIs

Next Post

How to use ChatGPT GO for free in India

Read next
Subscribe to our newsletter
Get notified of the best deals on our Courses, Tools and Giveaways..