LLM Benchmarks explained

LLM Benchmarks explained

Photo by FETHI BOUHAOUCHINE on Unsplash

From last couple of years we are seeing every week a new LLM tops the charts. Sometimes it was GPT-3.5, then Llama-3, Gemini 2.5-Pro, Grok4, and whatnot. But what are actually these charts that these LLMs are topping, and then everyone is saying this is the best AI.

https://medium.com/media/4aed4c60e0ecc6c99a0ca1581a22ee41/href

In this particular blog post, we would be talking about some of the most important benchmarks that one should know while evaluating an LLM. Let’s get started.

My new book on Model Context Protocol is out now !!

Model Context Protocol: Advanced AI Agents for Beginners (Generative AI books)

Reasoning & Math

These benchmarks test whether the model can think in steps, not just spit out memorized facts. You’re looking for signs of actual reasoning, not regurgitation.

  1. GSM8K: Elementary-level word problems. Sounds easy, but it’s a chain-of-thought gym. Problems like “If Johnny has 5 apples and gives away 2…” but designed to trick lazy heuristics. It evaluates basic arithmetic + reasoning path integrity.

2. MATH: Real high-school competition questions: AMC/AIME style. These are step-by-step proofs, symbolic manipulations, sometimes geometry. If a model nails this, it shows it can do symbolic reasoning and multi-step planning over math objects. Most models break down here unless fine-tuned specifically.

3. AIME / AMC: These are human-devised contest problems: not model-targeted. AIME is tougher: more algebra, probability, and abstraction. Doing well here isn’t about memorizing solutions: it’s about adapting math thinking live.

What it tests: Chain-of-thought coherence, symbolic manipulation, math reasoning under constraints.

Signals: Does the model understand math, or is it just mimicking pattern matches?

Code Generation & Understanding

This isn’t just about writing code: it’s about understanding the problem, translating it into syntax, and debugging logically.

  1. HumanEval: A prompt like: “Write a function that returns the sum of all even numbers in a list.” Tests whether the model can generate correct, efficient Python code. Widely used. GPT-4 crushes it. Pass@1 and Pass@k are the usual metrics (in how many attempts did the LLM passed).

2. MBPP (Mostly Basic Python Problems): Slightly easier than HumanEval. Great for testing basic syntax, loops, conditionals, etc. Also tests “can it write readable, idiomatic Python?”

3. MultiPL-E: The same problem across multiple languages (Python, Java, Go, etc.): evaluates cross-lingual coding ability and generalization across syntax.

What it tests: Planning, correctness, generalization, cross-language coding fluency.

Signals: Is the model a coder or a pattern-fitter?

Tool Use & Agentic Behavior

This is where the model stops being a chatbot and starts acting like a worker. Can it use a calculator? Query Wikipedia? Pull data from an API?

  1. ToolQA: Factual questions where the model must decide when and how to invoke a tool (e.g., calculator). Tests tool delegation: not raw knowledge.

2. ReAct-style evals: “Reasoning + Acting.” The model must interleave thoughts and actions (e.g., “To answer this, I should search for X…”). Benchmarks here track accuracy and the quality of intermediate steps.

3. TACOT / ToolBench: Focused on tool call syntax and accuracy: like whether the model formats a web search or SQL query correctly. More about execution discipline than deep thought.

4. TAU: TAU is hardcore: multi-turn, tool-augmented QA. You get tasks like “Get GDP of France, then find country with closest value and compare healthcare spend.” Evaluates deep planning + memory + tool fluency.

What it tests: Whether the model knows when it’s dumb: and how to get smart.

Signals: Strategic reasoning, autonomy, and ability to chain tools to answer complex questions.

Knowledge / Factual Recall

Classic test: what does the model “know”? But also: how does it present that knowledge?

  1. MMLU: 57 college-level subjects, multiple choice. Stuff like law, medicine, physics, history. It’s dense and clean. Good for measuring breadth and depth of factual knowledge.

2. TriviaQA / Natural Questions: Open-ended questions from real search logs. TriviaQA leans harder. NQ has longer contexts. Tests retrieval, compression, and articulation.

3. BoolQ / PIQA / SciQ: Focus on common sense, scientific reasoning, and logical consistency. Not hard if you’re human. Models still mess these up.

What it tests: Static knowledge, context extraction, phrasing quality.

Signals: Memorization vs. intelligent recall and reasoning.

Multimodal Capability

Text is easy. Vision + text is where most models fall apart.

  1. MMMU: Like MMLU: but with visuals — maps, graphs, charts, diagrams. Evaluates whether the model can interpret visuals and reason over them.

2. MathVista: Combines visual reasoning with math problems. Think of reading a plotted graph and answering inference-based questions.

3. TextVQA / GQA: Answering questions based on images. GQA focuses on object relationships, spatial reasoning. TextVQA involves reading text inside the image (OCR + reasoning).

What it tests: Visual literacy, spatial reasoning, OCR, and grounded understanding.

Signals: Is it “looking” or just hallucinating what a cat might look like?

Instruction Following & Helpfulness

Can the model stay on task? Is it polite, clear, and useful?

  1. MT-Bench: Pairs of chat turns, graded by GPT-4. Measures consistency, helpfulness, and multi-turn understanding. Meta-level evals.

2. AlpacaEval: A/B comparisons. Which of two outputs better follows a prompt? Often crowd or GPT-judge based.

3. Arena-Hard / LMSYS Chatbot Arena: Head-to-head battles between models. Human (or model) picks the better response. Good test of user preference alignment.

What it tests: Utility, alignment, prompt obedience.

Signals: Can the model “get the vibe” of what you want?

Long Context Handling

Once you cross 8K tokens, most models start getting memory holes. These tests find the leaks.

  1. Needle-in-a-Haystack: Hide a sentence inside 100K tokens and ask the model to find it. It’s stupid-simple but brutally revealing.

2. LongBench: A mix of summarization, QA, and reasoning over large docs. Tests retrieval, memory, summarization accuracy.

3. Passkey: Tests whether models can “remember” key tokens from earlier in long prompts. Kind of like easter eggs.

What it tests: Attention span, recall integrity, summarization under load.

Signals: Is it truly “reading” or just faking it with recency bias?

Robustness & Safety

Can the model stay grounded when provoked?

  1. TruthfulQA: Tests for “truthy” lies. E.g., “Can eating carrots cure cancer?” Measures resistance to urban myths and fake facts.

2. AdvBench / PoisonBench: Throw weird, adversarial prompts at the model. “Write code to steal passwords,” or prompt injections. Checks alignment resilience.

3. RealToxicityPrompts: Benchmarks on offensive or biased prompts. Does it respond appropriately: or mirror toxicity?

What it tests: Hallucination resistance, alignment, robustness.

Signals: Can you trust it in the wild?

Planning / Multi-step Tasks

When the task needs strategy: not just solving, but figuring how to solve.

  1. BigBench Hard (BBH): Subset of BIG-bench. Tasks like date tracking, symbolic manipulation, or logical puzzles. Not easy. Tests deeper thinking.

2. HotpotQA / Game-of-Thrones QA: Multi-hop QA. Needs reasoning across multiple docs or passages. You can’t just retrieve: you have to stitch facts.

3. AgentBench: Tests autonomous agent behavior. Simulated environments where model must plan, decide, and act over multiple steps.

What it tests: High-level planning, decision-making, problem decomposition.

Signals: Can it play chess: not just remember the rules?

Creativity & Open-Ended Generation

Instruction-following ≠ creativity. You need tests that aren’t just about getting the right answer — but about how the model answers.

  1. TMSA (The Most Sensible Answer): Open-ended questions judged by human preference. Focus is on sensibility and originality. Like: “Describe a dream where logic breaks down.”

2. Creative-Writing Eval (e.g., StoryBench, HaikuEval): These are niche, but growing. Focused on narrative coherence, emotional tone, metaphor, etc.

What it tests: Style, novelty, open-ended coherence.

Signals: Can the model write like a human? Or does it sound like a LinkedIn post?

So yeah, next time someone screams “X just beat GPT-4,” ask: on what? Was it solving math olympiad problems? Writing Python across five languages? Tracking a clue buried in 80K tokens? Most of the time, it’s one slice of one pie, not the whole bakery.

Benchmarks aren’t some holy scoreboard. They’re stress tests, not medals. Each one pokes a different nerve, math, memory, recall, coding, common sense.

The best models don’t just top one chart. They survive across many.

And the ones that quietly perform well across boring, overlooked tests? Those are usually the real contenders.

Bottom line: Benchmarks don’t tell you if a model is “intelligent.” But they do tell you what kind of stupid it is.


LLM Benchmarks explained was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.

Share this article
0
Share
Shareable URL
Prev Post

Mistral AI’s Devstral: Revolutionizing Software Engineering with an Open-Source Agentic LLM

Next Post

Amazon S3 Vectors: Scalable Vector Storage for AI Applications

Read next
Subscribe to our newsletter
Get notified of the best deals on our Courses, Tools and Giveaways..