Qwen3-Thinking-2507 : The best reasoning LLM is here

Qwen3-Thinking-2507 : The best reasoning LLM is here

Qwen3-Thinking-2507 : The best reasoning LLM is here

Qwen3-thinking beats DeepSeek, Gemini-2.5 Pro, etc

Photo by Sarah Penney on Unsplash

Qwen models are all over the internet since last week. First, they dropped Qwen3–235B-A22B, an upgraded version that beat out Kimi-K2. Then they dropped Qwen3-Coder. Yesterday, they dropped Qwen3-MT, and now they have dropped Qwen3-Thinking, which looks to be the best reasoning model.

This isn’t just another fine-tune or instruction-following upgrade. The focus this time is sharp and specific: thinking. Real, structured, step-by-step reasoning.

My new book “Model Context Protocol” is live now.

Model Context Protocol: Advanced AI Agents for Beginners (Generative AI books)

So what’s new?

Qwen3-Thinking-2507 takes reasoning to another level. It’s better at logic, math, science questions, and code, anything where you actually have to go step by step and not just guess what sounds right. This thing also performs well on academic benchmarks, not just toy examples.

It’s also improved on general abilities: following instructions properly, using tools effectively, generating text that makes sense, and aligning better with what humans actually prefer. On top of that, it now handles long contexts up to 256K tokens natively. That means you can throw entire books at it, and it won’t choke.

And because it’s designed purely for “thinking” tasks, this version runs in thinking mode only.

The model is structured to include internal planning markers (like <think>) in the backend, even if you don’t explicitly add them. So if you ever see an output that only ends with </think>that’s by design. It’s thinking behind the scenes, not chatting casually.

The internals

Under the hood, it’s still the same 235 billion parameter model, but only 22B are actively used during inference. The rest are split across 128 experts, and at any time, only 8 of them are active. That’s a Mixture of Experts setup. So it’s heavy in capacity but light in actual compute use per forward pass.

It’s got 94 layers. For attention, it uses a GQA setup: 64 heads for queries and 4 for key/value. And again, the native context length is a whopping 262,144 tokens.

How does it perform?

  • Math? On the AIME25 benchmark, it hit 92.3, just behind OpenAI’s O4-mini. On HMMT25, it landed 83.9, better than Claude and most others.
  • General knowledge and logic? It scored 84.4 on MMLU-Pro, 93.8 on MMLU-Redux, and made a huge leap on GPQA, from 71.1 in the earlier Qwen3 to 81.1 now. That’s the kind of jump you rarely see between versions.
  • SuperGPQA : which is even harder, got a bump too, now sitting at 64.9.
  • Coding? This is where things got interesting. On LiveCodeBench, the previous Qwen3 was sitting around 55.7. Now? It’s 74.1. That’s the best among all tested models, even beating OpenAI’s O3 and O4-mini.
  • CFEval also went up to 2134, again the highest in the set.
  • OJBench dropped a bit though, down to 32.5 from 33.6. Minor dip, not a dealbreaker.

Agent-style tasks?

It’s more capable at multi-step tool use and planning. Tasks like BFCL, TAU2-Retail, and TAU2-Telecom saw massive improvements.

TAU2-Retail jumped from 40.4 to 71.9.

TAU2-Telecom went from 21.9 to 45.6.

These aren’t just scores. They show the model’s getting better at decision-making across different domains.

Writing and alignment?

It scored 88.3 on WritingBench, higher than even O3 and Claude.
On Creative Writing, it held strong at 86.1, almost matching top-tier outputs.

Multilingual reasoning?

MultiIF jumped to 80.6, ahead of most others. PolyMATH also hit 60.1, which is surprisingly high considering it’s one of the toughest multilingual benchmarks around.

For the really hard reasoning and coding tasks, like PolyMATH, AIME, or long-form code generation, they used longer output lengths up to 81,920 tokens during evaluation. For normal tasks, it sticks to 32,768. But if you’re benchmarking or running complex pipelines, it’s good to know how much room it can handle.

How to use Qwen3-Thinking-2507 for free?

The model can be tried for free at below qwen-chat

Qwen Chat

Also, the weights are open-sourcd hence can be taken from huggingface

Qwen/Qwen3-235B-A22B-Thinking-2507 · Hugging Face

Final thoughts

Qwen3-Thinking-2507 isn’t trying to be a do-it-all chatbot. It’s built to think, not just talk. That means better step-by-step breakdowns, better use of tools, fewer hallucinated answers, and actual planning across long tasks.

If you’re building agents, testing long-context reasoning, or doing anything more complex than summarizing an email, this is probably the model you want to try next. No fireworks, no fluff, just more brain.


Qwen3-Thinking-2507 : The best reasoning LLM is here was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.

Share this article
0
Share
Shareable URL
Prev Post

Qwen-MT : The best AI Language Translation model, beats everything

Next Post

Mistral AI’s Devstral: Revolutionizing Software Engineering with an Open-Source Agentic LLM

Read next
Subscribe to our newsletter
Get notified of the best deals on our Courses, Tools and Giveaways..