The best LLMs so far?
OpenAI strikes back, and how? OpenAI dropped in two major models: the full version of o3 and the o4-mini model, and the results are looking pretty good on the benchmarks.
https://medium.com/media/00e017c4b52d354b785b96ba1e65a9be/href
Though o3 is just for the paid ChatGPT users, o4-mini is available for the free tier as well.
Key Features of OpenAI o3:
Advanced Reasoning Capabilities
- O3 is OpenAI’s most powerful reasoning model, excelling in coding, math, science, and visual perception.
- Sets new state-of-the-art (SOTA) benchmarks in Codeforces, SWE-bench (without scaffolding), and MMMU.
- Makes 20% fewer major errors than O1 in real-world tasks, particularly in programming, business consulting, and creative ideation.
Full Tool Integration
- Can agentically use and combine all tools within ChatGPT, including:
Web search
Python code execution
Visual reasoning (image analysis, charts, graphics)
Image generation
- Trained to reason about when and how to use tools effectively.
Multimodal & Visual Reasoning
- Can integrate images into its reasoning process, enabling problem-solving that blends visual and textual analysis.
- Excels in interpreting whiteboards, textbook diagrams, and hand-drawn sketches, even if blurry or low-quality.
Improved Efficiency & Performance Scaling
Benefits from reinforcement learning scaling, showing consistent performance gains with increased compute.
- More efficient than O1 at equal latency and cost.
Safety & Refusal Improvements
- Completely rebuilt safety training data with enhanced refusal mechanisms for biorisk, malware, and jailbreaks.
- Passes OpenAI’s Preparedness Framework evaluations (below “High” risk in biorisk, cybersecurity, and AI self-improvement).
Key Features of OpenAI o4-Mini:
Optimised for Speed & Cost-Efficiency
- A smaller, faster model designed for high-throughput reasoning.
- Achieves remarkable performance for its size, especially in math, coding, and visual tasks.
- Outperforms O3-Mini in both STEM and non-STEM tasks (e.g., data science).
Strong Benchmark Performance
- Best-performing model on AIME 2024 & 2025 (competition math).
- Excels in real-world tasks with higher usage limits than O3.
Improved Instruction Following & Natural Responses
- More natural and conversational compared to previous models.
- Better at referencing memory and past conversations for personalised responses.
Tool Use & Agentic Capabilities
- Like O3, it can strategically use tools (web search, Python, image generation).
- Optimized for fast, multi-step workflows (typically under a minute).
Safety & Compliance
- Shares O3’s enhanced safety mitigations, including refusal training and reasoning-based monitoring.
Common Features (O3 & O4 Mini):
- Available in ChatGPT (Plus, Pro, Team, Enterprise) and via API. O4-mini as a free tier as well
- Unified reasoning and conversational abilities, blending O-series problem-solving with GPT-series natural dialogue.
- Codex CLI Support — Works with OpenAI’s new terminal-based coding agent for local code execution and reasoning.

Benchmarks and metrics


- O3 (no tools) nailed AIME math with up to 91.6% accuracy — top-tier reasoning in competitive math.
- O4-mini (no tools) nearly matched O3 on AIME and Codeforces tasks, showing it’s lean yet lethal.
- In Codeforces-style programming, both O3 and O4-mini scored above 2700 ELO, which is elite coder territory.
- O3 (no tools) crushed GPQA (PhD-level science) with 83.3% accuracy, outperforming even newer versions.
- O4-mini (no tools) followed closely with 81.4%, showing great reasoning without external help.
- On “Humanity’s Last Exam”, O3 jumped from 20.3% → 24.9% with Python + browsing — great at tool use.
- O4-mini (with tools) hit 17%, very respectable for a compact model with generalist capabilities.
- Overall, O3 is the best all-rounder, especially with tools — smart and resourceful.
- O4-mini is the surprise underdog, consistently delivering near-O3 performance at a lighter footprint.
Bottom line?
O3 is your go-to academic overachiever, while O4-mini is the budget genius that can almost keep up.
Concluding,
OpenAI’s O3 and O4 Mini redefine AI intelligence — O3 as the powerhouse for complex reasoning, and O4 Mini as the fast, cost-efficient alternative. Both excel in coding, math, and multimodal tasks while integrating tools seamlessly. With top-tier benchmarks and enhanced safety, they set a new standard for AI performance. The future of smart, efficient AI is here
OpenAI o3 and o4-mini released was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.