
Large language models, like the AI chatbots we use every day, are powerful but not perfect. They sometimes forget details, make the same mistakes, or struggle with specific tasks like planning a trip or analyzing money reports. A new research paper introduces a smart way to fix this without tweaking the model’s core programming — instead, it builds and improves a “playbook” of tips and strategies that the AI can refer to. This approach, called Agentic Context Engineering or ACE, helps AI systems learn and get better over time, much like how we humans jot down notes from experience to avoid repeating errors.
The Problem with Today’s AI Systems
Imagine trying to cook a new recipe without a full set of instructions. You might guess some steps, but you’d likely mess up or skip important details. That’s similar to how current AI systems handle complex jobs, like acting as a virtual assistant or crunching financial numbers. Traditional methods to improve AI involve fine-tuning its “brain” by retraining on more data, but that’s expensive and time-consuming. Instead, many experts focus on “context adaptation”, where you feed the AI better instructions, examples, or facts right before it works on a task.
This sounds great, but it has issues. First, there’s “brevity bias”, where tools try to make instructions super short and general, losing useful specifics — like tips for using a phone app or spotting errors in bank statements. Second, “context collapse” happens when the AI rewrites its own notes over time, shortening them so much that key details vanish, causing performance to drop sharply. For example, in one test, a detailed set of notes shrank from over 18,000 words to just 122, and the AI’s accuracy fell below its starting point.

The paper argues that AI doesn’t need tiny summaries like humans do; it thrives on long, rich “playbooks” full of strategies, where it can pick what’s relevant on its own. This is especially true for AI agents that chat back and forth or handle specialized fields like finance.
My book with 20+ End to End Data Science Case Studies from 5 different domains is available on Amazon.
Cracking Data Science Case Study Interview: Data, Features, Models and System Design
Introducing ACE: Building Evolving Playbooks
ACE is a framework that treats the AI’s context as a living playbook — a collection of bullet points with tips, strategies, and lessons that grows and refines itself. It works for both “offline” setup (like optimizing a starting prompt) and “online” use (like updating notes during real tasks). Unlike past methods that rewrite everything at once, ACE uses a team of three AI “roles” to build this playbook step by step, preventing loss of information.
The core idea is “grow-and-refine”: the playbook expands with new insights but gets cleaned up to avoid repeats or junk. This makes it scalable for big AI models that handle long inputs, and it’s cheap because it doesn’t need full rewrites each time.

How ACE Works: The Three Key Players
ACE splits the work into three parts, like a team of experts collaborating.
- Generator tackles new problems. It takes a query, like “split a bill with roommates using a payment app”, and tries to solve it using the current playbook. It generates step-by-step reasoning, code, or answers, and notes which playbook tips helped or hurt.
- Reflector reviews what happened. It looks at the Generator’s attempt, any errors (like wrong calculations), and feedback from the task (success or failure). Without needing perfect “ground truth” answers, it spots patterns — like “always check contacts in the phone app before assuming relationships from emails.” It tags playbook items as helpful, harmful, or neutral and suggests fixes. This step can loop a few times to polish insights.
- Finally, Curator updates the playbook. It adds new bullet points only if they’re fresh and useful, avoiding duplicates by comparing ideas with simple math (like word similarities). Updates are small “deltas” — just changes to specific parts — keeping the whole process fast.
Each bullet in the playbook has a unique ID, counters for how often it’s useful or not, and content like “For bill splitting: Use phone app to find roommates first, then filter payments by their emails.” Over time, the playbook becomes a detailed guide, organized into sections like strategies, common mistakes, or API tips.

Smart Updates: Avoiding Old Pitfalls
To fight brevity bias and collapse, ACE uses “incremental delta updates.” Instead of rewriting the entire playbook, it only tweaks relevant bullets. This keeps details intact and lets you process many updates at once.
The “grow-and-refine” rule adds new stuff freely but prunes extras later — either right away or when the playbook gets too long. It uses quick checks (no heavy AI needed) to merge similar ideas, keeping things tidy without losing value. This setup works well with modern AI that can handle huge inputs, and it cuts wait times dramatically.
What the Tests Show: Real-World Wins
The researchers tested ACE on two tough areas: AI agents and financial analysis.
For agents, they used AppWorld — a benchmark where AI controls apps like email or music players to complete goals, like organizing files or setting alarms. Starting with a base AI (DeepSeek-V3.1), plain use got 42% success. Adding examples (ICL) bumped it to 46%, and a rival method (GEPA) to 46.4%. But ACE hit 59.4% offline and 59.5% online — up to 17% better. Even without perfect feedback, it improved 14.8%. On the public leaderboard, ACE tied the top pro system (using a bigger AI) overall and beat it on hard tasks.

In finance, tests on FiNER (tagging money reports) and Formula (calculating from filings) showed similar gains. Base AI scored 69% average; ACE reached 82% offline (12.8% gain) and 76.6% online (7.5% gain), beating rivals by 8.6% on average. It shines in spots needing precise rules, like XBRL financial tags.


Ablations proved the parts matter: Skipping the Reflector dropped 4.3%; no multi-round reviews hurt 2.6%; starting fresh online lost 2.8%.

Best of all, ACE is efficient — 82% faster than rivals offline, 91% online, with 75–83% less compute or cost.
Why This Matters and What’s Next
ACE shows we can make AI self-improve cheaply by evolving its “notes” rather than its core. Longer playbooks don’t mean higher costs anymore, thanks to tricks like reusing cached info. It’s great for ongoing learning, like adapting to new rules without retraining, and even “unlearning” bad info for privacy.
But it’s not magic — it needs decent feedback to work well, and simple tasks might not need such detail. Future work could blend it with other AI tools for even smarter systems.
In short, ACE turns AI contexts into dynamic guides that grow wiser with use, paving the way for reliable helpers in agents, finance, and beyond
Making AI Smarter Without Changing Its Brain: A Look at Agentic Context Engineering was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.