Self-Searching Minds: How Tsinghua’s Reinforcement Learning Algo Teaches LLMs to Be Their Own Search Engines
SSRL (Self-Search Reinforcement Learning) shows that large language models can “search their own brains” to solve information-seeking tasks, cutting expensive calls to web search during RL training and still transferring well to real search at inference. The team first measures an LLM’s intrinsic “Self-Search” ceiling via structured prompting and repeated sampling, then trains policies with format- and rule-based rewards to better extract internal knowledge without external tools.
What SSRL is
SSRL is a training framework where the model plays dual roles — both the policy and the “search engine” — generating queries and the retrieved snippets itself inside a structured, single-pass trajectory. By rewarding correct, well‑formatted outputs, SSRL improves internal knowledge utilization and reduces hallucinations, while remaining compatible with later plug‑in of real web search.
Why it matters
- Cost and stability: RL with live search engines is pricey and brittle; SSRL trains fully offline on internal knowledge yet yields robust behavior, enabling cheaper, more stable training loops.
- Sim‑to‑real: Skills learned via self-search transfer to online settings with Google or local corpora, often needing fewer tool calls than prior search-RL baselines.
- Capability insight: Repeated sampling reveals a high pass@k ceiling — LLMs often already “know” the answers; SSRL teaches them to retrieve that knowledge more reliably.
My book with 20+ End to End Data Science Case Studies from 5 different domains is available on Amazon.
Cracking Data Science Case Study Interview: Data, Features, Models and System Design
How Self-Search is measured
The authors quantify intrinsic search by prompting models to think, propose queries, and “retrieve” information they generate within tags like <think>, <search>, <information>, and <answer> — all in one forward pass. Scaling the number of samples K significantly boosts pass@k across QA, multi‑hop QA, and web‑browsing benchmarks, with Llama models showing strong gains and smaller models narrowing the gap to much larger ones as K grows.
Key findings at a glance
- Predictive scaling: Performance improves consistently as the number of samples increases, even on BrowseComp, a tough browse-and-search task.
- Llama vs. Qwen: Contrary to math‑reasoning trends, Llama models outperform Qwen families for self-search world knowledge priors.
- Token economy: More “thinking tokens,” multi‑turn self‑search, or enforced reflection can hurt efficiency; short‑CoT and naive repeated sampling work best under fixed token budgets.
The SSRL training recipe

- Objective simplification: Because the policy both reasons and retrieves internally, the RL objective drops explicit external-retrieval terms, optimizing expected reward with a KL regularizer to a reference policy.
- Rewards: A composite signal combines outcome accuracy and a format reward that enforces structured, multi‑step trajectories; information tokens inside <information>…</information> are masked for loss, which still stabilizes learning.
- Algorithms and setup: GRPO is the default, with PPO, Reinforce++, DAPO, and KL‑Cov also tested for robustness; experiments focus on Llama‑3.2‑3B and Llama‑3.1‑8B (base and instruct).
Benchmarks and results

The study spans Natural Questions, TriviaQA, HotpotQA, MuSiQue, 2WikiMultiHopQA, and Bamboogle (EM metric). SSRL‑trained instruction models consistently outperform vanilla prompts, R1‑like methods, and even several external‑search RL baselines, highlighting the strength of auto‑regressive internal retrieval.

Sim‑to‑real generalization
When replacing simulated information with real Google or local corpus results at inference, SSRL models generally beat Search‑R1 and ZeroSearch while requiring fewer turns. The format‑alignment from training makes plugging in real search trivial, suggesting a practical path to low‑cost training with high‑fidelity deployment.
Practical implications
- Cheaper agent training: Train agents for search‑heavy QA without paying per‑call search costs; then attach web search at inference only where needed.
- Reduced hallucination: By enforcing structure and outcome‑based reward, SSRL improves factual reliability from internal knowledge before consulting tools.
- Small‑model viability: With repeated sampling and SSRL, compact models approach larger models’ accuracy ceilings, useful for edge or budget constraints.
Open questions and limits
The work shows high “upper bound” via pass@k, but selecting the single correct answer without ground truth remains hard; naive majority voting scales weakly, indicating the need for better selection strategies. While SSRL thrives offline, knowledge freshness still depends on pretraining and may require hybrid sim+real approaches for up-to-date facts.
Resources and reproducibility
Code, datasets, and scripts are available from the TsinghuaC3I SSRL repository, with instructions for running baseline comparisons and sim‑to‑real evaluations on an 8×A800 node. The arXiv paper provides prompts, ablations, and full reward-format code for replication.
git clone https://github.com/TsinghuaC3I/SSRL
cd verl
pip install -r requirements.txt
huggingface-cli download --repo-type dataset --resume-download TsinghuaC3I/SSRL --local-dir SSRL_dataset # download the dataset
bash examples/ssrl/example.sh
Bottom line
SSRL reframes search‑agent RL by turning the model into its own search engine during training, demonstrating strong gains, robust sim‑to‑real transfer, and lower cost. It suggests a future where agentic systems learn to harvest internal knowledge first — then call the web only when it truly matters.
Source:
Self-Searching Minds: How Tsinghua’s Reinforcement Learning Algo Teaches LLMs to Be Their Own… was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.