Command A Vision : Best MultiModal LLM is here by Cohere

Command A Vision : Best MultiModal LLM is here by Cohere

Command A Vision : Best MultiModal LLM is here by Cohere

How to use Command A Vision?

Cohere just dropped Command A Vision. Think of it as Command A’s smarter, more perceptive sibling except this one sees everything. Charts. Scanned documents. Risky real-world scenes. You name it. It’s dense, 112B dense and they’ve open-sourced the weights, which is rare air for a model of this size and quality. It’s enterprise-grade and deployment-ready.

https://medium.com/media/e81c57b8ab20cd3d0e6c763fd9236420/href

Most vision-language models claim a lot and deliver somewhere around mediocre. Command A Vision? It’s built to do work, real-world, enterprise work. You can throw financial tables, healthcare diagrams, or construction PDFs at it, and it doesn’t blink. OCR? Real-world image analysis? That’s its default mode.

Benchmark

It outperforms GPT-4.1, Mistral Medium, Pixtral Large, and LLaMA 4 Maverick on almost every major benchmark that actually matters for businesses. On Document VQA it pulls nearly 96%, and on MathVista which tests reasoning with visuals it clocks in at 73.5%, ahead of most “non-thinking” vision models that just regurgitate.

What’s Under the Hood

It’s stitched together from a SigLIP2 vision encoder, sliced images into 12 tiles plus a global thumbnail, and a 111B dense LLM backbone. Then they ran it through a classic trifecta of:

vision-language alignment

supervised fine-tuning

and reinforcement learning (RLHF)

All with regularization and contrastive policy gradient hacks. The adapter architecture is Llava-style MLP glue between pixels and text.

Token count? Each tile’s 256 tokens. One image maxes out at 3,328 tokens. It’s heavy, yeah, but it’s thorough. And you get structured outputs in JSON not some loose-text mess you have to clean up with regex.

Deploys Without a Data Center

Here’s the wild part. You don’t need a mega-cluster to run it. Two A100s or a single H100 is enough if you go 4-bit. That’s it. Private, on-prem, offline your data never leaves the building.

What It Can Actually Do

— Extract terms from blurry scanned invoices
 — Spot risks from industrial site images
 — Parse construction drawings buried in PDFs
 — Read diagrams from multilingual product manuals
 — Understand charts and tables like a junior analyst who actually slept

Not just OCR. Not just object detection. It understands layout, context, industry lingo. It gets the scene.

How to use Command A Vision for free?

The weights are open-sourced and are available below

CohereLabs/command-a-vision-07-2025 · Hugging Face

The model can be tried here

Login | Cohere

Real People Already Using It

“We’re moving beyond text into what we can see.”
 — Jeffrey English, Director, Fujitsu Intelligence

“It automates data capture from lien waivers and construction drawings — cutting risk, time, and cost.”
 — Mark Webster, SVP, Oracle Infra Industries

Final Take

Command A Vision isn’t trying to be flashy. It’s not here for hype. It just quietly replaces a hundred brittle pipelines, manual review processes, and custom OCR hacks. If your business runs on documents, diagrams, or any image that actually means something this model shows up, does the job, and doesn’t ask questions.

And yeah: it’s open weight. No API key. No gatekeeping.


Command A Vision : Best MultiModal LLM is here by Cohere was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.

Share this article
0
Share
Shareable URL
Prev Post

Stripes Payment Foundation Model: Does Stripe Really Need Transformer?

Next Post

ChatGPT Study Mode : Say Bye to Tuitions

Read next
Subscribe to our newsletter
Get notified of the best deals on our Courses, Tools and Giveaways..