Scaling Agents via Continual Pre-training Paper explained

Scaling Agents via Continual Pre-training Paper explained

Tongyi DeepResearch, Best LLM for DeepResearch

Photo by Hans-Peter Gauster on Unsplash

recently, Alibaba launched an open-source LLM called Tongyi Deep Research, which is taking the internet by storm, and it’s one of the best-performing LLMs for deep research. Alongside that, there is a research paper “Scaling Agents via Continual Pre-training” also released on how they built out the deep research model, and today we would be explaining that.

https://medium.com/media/b8306679b47724f52aec1d8004f07536/href

My new book on AI Agents is out !!

Model Context Protocol: Advanced AI Agents for Beginners (Generative AI books)

The Core Problem

Most agent systems today are built on top of general-purpose LLMs (like Qwen, GLM, etc.) using post-training methods such as SFT (Supervised Fine-Tuning) and RLHF (Reinforcement Learning with Human Feedback).

The problem is: these models weren’t designed as agents from the ground up. They try to learn tool use, reasoning, and alignment all at once during post-training. That creates an optimization conflict the model struggles because it’s being forced to develop new skills while being aligned at the same time.

As a result, open-source agent models (WebSailor, GLM-4.5, DeepSeek-V3.1, etc.) lag far behind closed-source systems like OpenAI’s Deep Research.

Agentic Continual Pre-training (Agentic CPT)

Instead of only relying on post-training, the authors insert a new stage between pre-training and post-training.

  • Normal pipeline: Pre-training → Post-training
  • Their pipeline: Pre-training → Agentic CPT → Post-training

This new middle stage creates agentic foundation models that already have some built-in reasoning and tool-use ability. So post-training no longer carries the double burden of both learning agent skills and aligning them.

How They Did It: AgentFounder Model

They built a model called AgentFounder (30B params) using this pipeline.

Two main ideas drive their data and training:

First-order Action Synthesis (FAS)

  • Generate question–plan–action style data.
  • Use knowledge-to-question transformation (turning static text into dynamic questions).
  • Planning synthesis: generate multiple reasoning paths and first-step tool calls (without actually calling APIs, which avoids cost).
  • Reasoning synthesis: build synthetic chain-of-thoughts to answer questions step by step.

Higher-order Action Synthesis (HAS)

  • Instead of reusing only “successful” agent trajectories, they expand each step into multiple alternative decisions.
  • This turns wasted or discarded trajectories into rich training data, forcing the model to learn decision-making at every step instead of just mimicking a full trajectory.

Progressive Two-stage Training

  • Stage 1: Train with ~200B tokens of agentic data at 32K context length.
  • Stage 2: Train with ~100B tokens of high-quality agent data at 128K context length, enabling long-horizon planning.

Example

1. First-order Action Synthesis (FAS)

This is about generating basic planning and reasoning data without making actual tool calls. Think of it as simulating “what an agent would plan to do” when faced with a question.

Example (Planning Action Synthesis):

  • Question: “Who won the men’s singles at Wimbledon in 2024, and how many Grand Slam titles does he have now?”

Generated plan (no real API calls):

  • Step 1: Search “Wimbledon 2024 men’s singles winner”
  • Step 2: Identify the player (e.g., Carlos Alcaraz).
  • Step 3: Search “Carlos Alcaraz total Grand Slam titles”
  • Step 4: Summarize and answer.

Here, the model generates the reasoning and the intended tool actions (“search this… check that…”) but doesn’t actually query Google. That way, they can cheaply create lots of training samples.

Example (Reasoning Action Synthesis):

  • Question: “Which city hosted the Olympics two years before Paris 2024?”

Generated reasoning:

  • The 2024 Olympics were in Paris.
  • Two years before 2024 is 2022.
  • The 2022 Olympics were Winter Games.
  • Winter Olympics 2022 were hosted in Beijing.
  • Answer: Beijing.

Notice: no external tool call, but the reasoning chain itself is turned into training data.

2. Higher-order Action Synthesis (HAS)

This step is about expanding agent trajectories into richer decision spaces. Instead of keeping only the “golden path” (correct sequence of actions), they branch out at each step and explore alternative reasoning paths.

Example: Let’s say the model is tasked with: “Find the CEO of OpenAI in 2025 and summarize their background.”

Original trajectory (from a rollout):

  • Step 1: Search “OpenAI CEO 2025” → Response: “Sam Altman is CEO.”
  • Step 2: Search “Sam Altman biography” → Response: text about his background.
  • Step 3: Summarize → Answer.

HAS Expansion: For Step 1, instead of only keeping “Search OpenAI CEO 2025,” the system generates alternative candidate actions:

  • Option A: “Search OpenAI leadership 2025”
  • Option B: “Who leads OpenAI 2025?”
  • Option C: “Search Sam Altman CEO OpenAI 2025”

All these are plausible. The training data then records the choice-making process.

“I will choose option A” → tool response → continue reasoning.

This way, the model isn’t just imitating one trajectory, it’s learning decision-making at each step, with exposure to multiple valid paths.

In short:

  • FAS = teaching the model how to plan and reason cheaply (questions → plans & reasoning chains).
  • HAS = teaching the model to explore and decide among alternatives (branching decision space from each step of a trajectory).

Results (AgentFounder-30B)

On 10 benchmarks, AgentFounder beat all open-source models and even outperformed some commercial closed ones.

  • BrowseComp-en: 39.9% (better than DeepSeek-V3.1’s 30.0%)
  • GAIA: 72.8% (new SOTA)
  • Humanity’s Last Exam (HLE): 31.5% (first open-source to cross 30%)
  • Academic Browse: 75.3% (excellent academic research ability)
  • Strong tool-use still maintained (tested on ACEBench).

They also found clear scaling laws: more data and bigger models consistently improve agentic capabilities, but their CPT method made the scaling more efficient than other models (AgentFounder-30B beat larger baselines like DeepSeek-V3.1).

Why It Matters

The key contribution here isn’t just another agent. It’s a new training philosophy:

  • Build agentic foundation models first (via CPT),
  • Then apply post-training for alignment.

This approach makes training more efficient, produces models that are better at reasoning, planning, and tool use, and closes the gap with closed-source research agents like OpenAI’s.

My Take

This paper is basically arguing that the next big leap in agent models won’t come from better RLHF tricks or more synthetic trajectories alone. It’s from rethinking pre-training so that models already have “agentic biases” baked in. AgentFounder is their proof-of-concept.

AgentFounder shows that agents can’t just be patched onto general LLMs after pre-training they need to be grown as agentic foundation models. By baking in planning and decision-making through continual pre-training, Tongyi moves closer to closing the gap with closed systems like OpenAI Deep Research.

The message is blunt: the future of strong agents won’t come from better post-training tricks, but from rethinking pre-training itself.


Scaling Agents via Continual Pre-training Paper explained was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.

Share this article
0
Share
Shareable URL
Prev Post

Qwen-Image-Edit 2509 : Goodbye Google Nano Banana

Next Post

Meta Code World Models released

Read next
Subscribe to our newsletter
Get notified of the best deals on our Courses, Tools and Giveaways..