DeepSeek V3.1 Terminus : The ChatGPT killer is back
How to use DeepSeek’s new model?
DeepSeek has been shipping models at a fast clip. The latest update is called V3.1 Terminus, which isn’t a new base model but a refinement on top of V3.1.
Think of it as a service pack release: not rewriting the engine, but patching some quirks and tightening the bolts.
https://medium.com/media/8ff078f7e92cc21281d7e266aa093ccb/href
My new book on AI Agents is back
Model Context Protocol: Advanced AI Agents for Beginners (Generative AI books)
What actually changed
The main fix is around language consistency.
Earlier V3.1 sometimes slipped into mixing Chinese and English or randomly outputting stray characters. Terminus irons that out. If you’re building multilingual systems, this alone saves hours of cleaning weird artifacts.
The other upgrade is in its agents.
DeepSeek’s models aren’t just text predictors, they’re trained to act as “agents” that use tools. Terminus pushes improvements into the Code Agent (better at structured programming tasks) and the Search Agent (retrieval and browsing). If you measure these against agentic benchmarks like BrowseComp, SimpleQA, or Terminal-bench, you can see clear jumps.
Benchmarks
Benchmarks tell the story better than release notes
Reasoning without tool use
- MMLU-Pro: 84.8 → 85.0. Basically a rounding error, but it shows consistency. The model isn’t slipping, though it isn’t breaking new ground either.
- GPQA-Diamond: 80.1 → 80.7. Slight bump, which means small but steady improvements in high-level question answering.
- Humanity’s Last Exam: 15.9 → 21.7. This is the only reasoning benchmark where Terminus made a noticeable jump. The task is designed to push edge-case reasoning, so the gain suggests the tweaks helped with rare or difficult prompts.
- LiveCodeBench: 74.8 → 74.9. No real difference. Coding ability under pure reasoning remains flat.
- Codeforces: 2091 → 2046. A step back. The contest-style coding benchmark is unforgiving, and here Terminus underperforms compared to V3.1.
- Aider-Polyglot: 76.3 → 76.1. A negligible dip, but basically unchanged.
Agentic tool use
- BrowseComp: 30.0 → 38.5. Huge improvement. Terminus handles browsing + reasoning tasks much more reliably.
- BrowseComp-zh: 49.2 → 45.0. Regression in the Chinese browsing benchmark. The consistency fixes seem to have skewed toward English performance.
- SimpleQA: 93.4 → 96.8. Already strong, now even stronger. It nails simple question answering when tools are involved.
- SWE Verified: 66.0 → 68.4. A solid bump on software engineering tasks, where external tool usage is critical.
- SWE-bench Multilingual: 54.5 → 57.8. Gains here mean Terminus is better at fixing or reasoning about codebases across languages, not just English.
- Terminal-bench: 31.3 → 36.7. Big step up in command-line style tasks, showing the Code Agent is better tuned.
So Terminus is noticeably better at tool-use tasks.
But if you look at pure reasoning benchmarks (MMLU-Pro, GPQA-Diamond, Codeforces ratings), the gains are modest. In some coding contests, it even slipped a little, Codeforces went from 2091 down to ~2046. It’s a reminder: optimizations for tool use don’t always translate to raw reasoning strength.
Specifications
- Parameters: 671B total, with ~37B active at once through mixture-of-experts
- Context length: up to 128K tokens (that’s book-length conversations)
- Inference: runs with FP8 microscaling to save compute
- Modes: a regular chat mode and a “thinking” mode for heavier reasoning or agentic workflows
- Open source, MIT license weights are up on HuggingFace, so you can self-host
Why this update matters
If you’re mainly using DeepSeek for reasoning benchmarks, Terminus isn’t a dramatic leap. But if your workflows rely on agents, code generation, search, tool chaining it’s a solid bump in reliability and accuracy.
One caveat: Terminus optimized for multilingual consistency, and that came with a small trade-off. In some Chinese browsing tasks, numbers dropped slightly. Nothing dramatic, but worth noting if your use case is Chinese-first.
Bottom line
Terminus isn’t a flashy release. It’s more of a targeted fix: better agent performance, cleaner multilingual outputs, same long context, and still fully open-sourced. For anyone building retrieval-augmented systems or coding assistants, this is the version you’d actually want to deploy.
DeepSeek V3.1 Terminus : The ChatGPT killer is back was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.