Nanonets OCR2 : Turning Documents into Structured, LLM-Ready Data

Nanonets OCR2 : Turning Documents into Structured, LLM-Ready Data

Nanonets OCR2 : Turning Documents into Structured, LLM-Ready Data

How to use Nanonets OCR2 for free?

Photo by Harpreet Singh on Unsplash

Most document AI models are good at extraction, but bad at understanding. They’ll find text, maybe a table, but ask them where the watermark ends and the signature begins, and they’ll hallucinate faster than a sleep-deprived intern.

My new book, Audio AI for Beginners is out now !!

Audio AI for Beginners: Generative AI for Voice Recognition, TTS, Voice Cloning and more (Generative AI books)

Nanonets OCR 2 fixes that.

This isn’t another “OCR but with AI” release. It’s a proper multimodal visual-language model trained to see documents like humans: context, structure, meaning, and noise included.

Overview

Nanonets-OCR2 is built for two main things: converting document images into structured Markdown and answering document-related questions accurately (VQA). It extends the older Nanonets-OCR-s model with a stronger backbone, better content segmentation, and less confusion between document artifacts like headers, footers, watermarks, and body text.

It can handle checkboxes, signatures, tables, flowcharts, and multilingual documents. It even outputs Mermaid code for visual diagrams, something you rarely see in OCR systems.

What sets it apart is its honesty: if the answer isn’t present in the document, the model says Not mentioned. No made-up answers, no filler text.You can test the family of models directly on Docstrange or Hugging Face:

  • Nanonets-OCR2+ : the flagship model
  • Nanonets-OCR2–3B : balanced between speed and accuracy
  • Nanonets-OCR2–1.5B-exp : lightweight version for faster inference

Key Capabilities

Let’s skip the buzzwords and get into what it actually does.

  1. LaTeX Equation Recognition

Converts printed or handwritten math into LaTeX, differentiating between inline and display modes. Page numbers are wrapped in <page_number> tags, useful for structured post-processing.

2. Intelligent Image Description

If an image has a caption, it uses that. If not, it generates one. The descriptions go inside <img> tags, with enough context for LLMs to understand what’s visually there, graphs, logos, QR codes, etc.

3. Signature Detection & Isolation

Detects and separates signatures from surrounding text. If unreadable, it still marks <signature>signature</signature> so downstream systems know it’s signed, not skipped.

4. Watermark Extraction

Finds watermark text, even on noisy scans, and wraps it in <watermark> tags. Surprisingly robust on low-quality inputs.

5. Checkbox & Radio Handling

Translates checkboxes into Unicode symbols inside <checkbox> tags. That’s a big deal for form automation and compliance-heavy workflows.

6. Complex Table Extraction

Pulls out nested or multi-row tables into Markdown and HTML. Better than most open models, which tend to flatten cell hierarchies.

7. Flowchart and Org Chart Parsing

Outputs Mermaid code, which means you can regenerate the original chart visually. A rare but useful feature.

8. Multilingual Support

Trained on over a dozen languages: English, Chinese, French, Spanish, Japanese, Arabic, and more.

9. Visual Question Answering (VQA)

When queried, it either gives a direct answer or says Not mentioned. No guessing, no hallucination padding.

Comparison with dots.ocr

Against dots.ocr, Nanonets OCR 2 shows visible improvements across the board:

  • Checkboxes: cleaner tag recognition, fewer false positives
  • Flowcharts: generates syntactically valid Mermaid code
  • Signatures: correctly separates pen marks from text noise
  • Tables: maintains cell structure better in Markdown export
  • Watermarks: consistent detection even in grayscale or faded documents

The difference is clear in outputs: dots.ocr treats visuals as background noise; Nanonets-OCR2 treats them as first-class information.

Image-to-Markdown Evaluations

They used Gemini-2.5-Pro as the evaluator. Existing OCR benchmarks like olmOCRBench or OmniDocBench don’t measure markdown quality well, so Nanonets built their own evaluation set.

Results:

OCR2+ beats all previous internal models, and even GPT-5 (low-thinking mode), in markdown correctness.

OCR2+ had a win rate of ~34% vs Gemini Flash and 29% vs OCR 2 3B, showing that model size doesn’t directly correlate to markdown accuracy.

The team plans to open-source evaluation scripts and predictions on GitHub, which is good, OCR benchmarks need more transparency.

VQA Evaluations

For document-level question answering, they used the IDP Leaderboard datasets (ChartQA, DocVQA).

  • ChartQA: OCR2+ scores 79.2 vs 76.2 on Qwen-72B.
  • DocVQA: OCR2+ hits 85.15, slightly below Gemini-Flash’s 85.5 but higher than most mid-scale open models.

Given its smaller parameter count, that’s impressive.

Training Details

The model was trained on 3 million+ pages, covering everything from invoices and research papers to healthcare forms and tax receipts. Both synthetic and manually annotated data were used.
The pipeline:

  1. Pretrain on synthetic docs for broad generalization.
  2. Fine-tune on annotated data for realism.

Base model: Qwen2.5-VL-3B, fine-tuned for document-specific vision-language tasks.

Limitations:

  • Struggles with very complex flowcharts or diagrams.
  • Occasional hallucinations in dense tabular layouts.

Still, it’s one of the more balanced trade-offs between performance and interpretability in the current open OCR landscape.

Use Cases

Nanonets-OCR2 basically acts as a bridge between scanned mess and structured reasoning.

  • Academia & Research: Converts PDFs with formulas and tables into Markdown + LaTeX.
  • Legal & Finance: Detects signatures, extracts checkboxes, standardizes formatting.
  • Healthcare: Reads medical forms and checkboxes accurately.
  • Enterprise Knowledge Bases: Turns multi-page reports into LLM-digestible text blocks

If your workflow involves feeding documents into an LLM, this model saves you hours of cleanup.

Open Source & Access

You can try Nanonets-OCR2 directly on Docstrange, or grab it from Hugging Face. The team encourages open discussions on both platforms, rare for enterprise OCRs.

nanonets/Nanonets-OCR2-3B · Hugging Face

In a world where unstructured data still blocks automation, models like Nanonets-OCR2 make the connection simple: raw pixels in, structured reasoning out.

Docstrange – AI Document Data Extraction by Nanonets


Nanonets OCR2 : Turning Documents into Structured, LLM-Ready Data was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.

Share this article
0
Share
Shareable URL
Prev Post

DeepSeek OCR is here

Next Post

OpenAI Atlas vs Google Chrome : The best Broswer for you?

Read next
Subscribe to our newsletter
Get notified of the best deals on our Courses, Tools and Giveaways..