PaddleOCR-VL : Best OCR AI model

PaddleOCR-VL : Best OCR AI model

PaddleOCR-VL : Best OCR AI model

How to use PaddleOCR-VL for free?

Photo by Joshua Earle on Unsplash

Baidu dropped a new model, PaddleOCR-VL. It’s a document parsing system that can read text, tables, formulas, and even charts across 109 languages.

https://medium.com/media/7c8d620e8adcccf5f7687c64c9cd8cf0/href

The strange part: it does all this with only 0.9 billion parameters, while some competitors need 70–200B to reach the same numbers.

My new book on Audio AI Agents is out !!

Audio AI for Beginners: Generative AI for Voice Recognition, TTS, Voice Cloning and more (Generative AI books)

The project sits inside PaddleOCR, Baidu’s open-source OCR ecosystem. But this one’s different: it’s not just another text detector. It’s a vision-language model (VLM) tuned for the gritty stuff, PDFs, scanned papers, handwritten notes, old exams, the messy documents most models choke on.

Why This Model Exists

Modern documents aren’t just text. They’re multi-column layouts, mathematical formulas, half-scanned tables, multilingual text, and charts in odd resolutions. End-to-end models like GPT-4o or Qwen-VL can parse them, but they’re slow, hallucinate layouts, and burn through GPU memory.

So Baidu split the job in two:

Detect and order layout elements first.

Recognize each element precisely using a compact vision-language model.

That small design decision, decoupling layout and recognition, is what makes PaddleOCR-VL faster and more stable than the usual all-in-one systems.

How It’s Built

The system runs in two clear stages:

Stage 1: Layout Analysis (PP-DocLayoutV2)

This part identifies text blocks, tables, formulas, and charts. It uses:

  • RT-DETR for object detection (basically bounding boxes + class labels).
  • A pointer network (6 transformer layers) that figures out the reading order of elements, top to bottom, left to right, etc.

Together they output a map of where everything is and how it should be read.

Stage 2: Element Recognition (PaddleOCR-VL-0.9B)

This is where the vision-language model kicks in. It uses:

  • NaViT-style encoder (from Keye-VL) that takes dynamic image resolutions. No tiling, no stretching.
  • A simple 2-layer MLP to align vision features to the language space.
  • ERNIE-4.5–0.3B as the language model, a small but fast one with 3D-RoPE for position encoding.

The model then outputs structured Markdown or JSON, whatever format you need.

The Training Setup

It’s trained in two rounds:

Tasks used during fine-tuning:

  • OCR (text extraction)
  • Table recognition with OTSL format
  • Formula recognition (LaTeX output)
  • Chart parsing (converted to Markdown tables)

That’s the whole recipe. No fancy mixture-of-experts or magic scaling laws. Just brute, clean supervision.

The Data Engine

They didn’t rely on just public datasets. The data pipeline mixes four sources:

  1. Public datasets : CASIA-HWDB, UniMER-1M, ChartQA, PlotQA, etc.
  2. Synthetic data : generated through XeLaTeX and browser rendering.
  3. Web-scraped PDFs : academic papers, exams, handwritten notes, slides.
  4. In-house datasets : Baidu’s private OCR data.

Annotations are semi-automatic: pseudo-labels from PP-StructureV3, refined by ERNIE-VL and Qwen-VL, and then filtered for hallucinations.

Hard cases are found using metrics like EditDist (text), TEDS (table), RMS-F1 (chart), BLEU (formula). Those tough samples are regenerated synthetically until the model stops failing on them. Smart feedback loop.

Performance

Let’s talk numbers.

It’s faster too. On an A100 GPU:

  • 1.22 pages/s throughput
  • 15.8% faster than MinerU2.5
  • ~40% less VRAM than dots.ocr

Not a small deal when you’re parsing thousands of PDFs.

What Makes It Unique

  • Compact size (0.9B) but multilingual reach (109 languages).
  • Dynamic resolution : no tile-based preprocessing, fewer artifacts.
  • Stable layout ordering : avoids hallucinated document structure.
  • Balanced design : a small LM and strong visual encoder instead of a bloated decoder.
  • Efficient inference : multithreaded pipeline with vLLM or SGLang backend.

Final Thoughts

Baidu didn’t over-market this one, but PaddleOCR-VL is arguably the most practical open-source document parsing model right now. It hits that rare mix: accuracy, multilingual support, and low compute cost.

It’s not flashy, no “agent” or “general intelligence” nonsense, but if your job involves turning PDFs into structured data, this model actually works.

You can try it yourself:

PaddlePaddle/PaddleOCR-VL · Hugging Face


PaddleOCR-VL : Best OCR AI model was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.

Share this article
0
Share
Shareable URL
Prev Post

SAIL-Embedding: Building a Smart AI Bridge Between Text, Images, and More for Everyday Searches and…

Next Post

Google DeepMind’s C2S-Scale 27B: Teaching AI the Language of Cells to Crack Cancer’s Code

Read next
Subscribe to our newsletter
Get notified of the best deals on our Courses, Tools and Giveaways..