What is Google Nested Learning ?
resolving catastrophic forgetting in machine learning
If you’ve ever fine-tuned a model and watched it forget what it knew before, you’ve seen the core problem of modern AI, catastrophic forgetting. You teach it something new, and the old stuff vanishes.
https://medium.com/media/aa55e040940e8851af7cd5ffe9f7898f/href

For example:
- You train a model on English.
- Then you fine-tune it on French.
- It becomes better at French… but suddenly forgets English.
Nested Learning is a way to stop that from happening.
We patch this with tricks: replay buffers, architectural tweaks, clever optimizers. But none of it really fixes the root issue, our models don’t learn like brains do. They overwrite instead of integrate.
Google Research just published a paper at NeurIPS 2025 introducing Nested Learning. It’s not another architecture tweak. It’s a structural rethink of what “learning” even means inside a neural network.
The setup

Typical deep learning is built on a simple hierarchy:
You have a model (the architecture).
You have an optimizer (the training rule).
You have data.
We’ve always treated these as separate. The model learns representations. The optimizer just updates weights.
Nested Learning says: that division is artificial.The model and optimizer are both learning systems, just operating at different levels of abstraction.
Each has its own internal flow of information, or what the authors call “context flow.” When you zoom out, training isn’t one process, it’s a stack of interconnected optimization problems nested inside one another, each updating at a different rate.
Think of it like Russian dolls

Inside every neural network sits another smaller learner: the optimizer.
Inside that, another layer that decides how fast or slow each part updates.
Each layer learns from its own context, one from short-term examples, another from longer-term patterns.
Humans do something similar. You might pick up a new fact in seconds (short-term), but updating a belief or habit can take years (long-term).
The brain has multiple learning rates running in parallel. Nested Learning gives models that same structure: multi-time-scale updates.
A small example
Say you’re training a chatbot:
- The inner layer learns quick patterns, how to respond in a conversation.
- The middle layer slowly learns your tone and style.
- The outer layer holds stable knowledge like grammar or facts.
When you fine-tune it for a new topic (say, finance), the fast inner layer adapts, but the slow layers remain steady. So it learns new things without forgetting old ones.
Why this matters
Right now, language models have two kinds of memory:
- The context window, which vanishes after you hit enter.
- The weights, which are frozen after pretraining.
Everything else, all that “learning in between”, doesn’t exist. That’s why they don’t truly remember across interactions or adapt without retraining.

Nested Learning introduces what they call a Continuum Memory System (CMS). Instead of just short-term and long-term memory, CMS has a spectrum: modules that update at different frequencies. Some update every few steps, others every few thousand. Together, they behave like a layered memory that doesn’t erase itself every time new data comes in.
The optimizer gets an upgrade

Another piece of this is how they reinterpret optimizers.
Take Adam or SGD, they update weights using gradients, treating each sample as independent. Nested Learning reframes optimizers as associative memory modules. Each remembers how previous examples behaved and uses that memory to update more intelligently.
Instead of just using dot-product similarity (how close two vectors are), they switch to using actual loss metrics like L2 regression, making the update more aware of relationships between samples.
It’s subtle but powerful: optimizers start behaving like mini-learners inside the model.
A new model: Hope

To test this, they built a model called Hope, a variant of their older Titans architecture.
Titans already had long-term memory that prioritized “surprising” events. Hope extends this by adding more nested levels and connecting them through the continuum memory system.
Hope is a self-modifying model. It can literally update the way it updates itself, a kind of recursive learning loop. That lets it scale to very long context windows and maintain stable memory over time.
The experiments

They tested Hope on:
- Language modeling
- Long-context reasoning
- Continual learning
- Knowledge incorporation tasks
Results were consistent: Hope outperformed Titans, Samba, and standard Transformers.
Lower perplexity (so better language prediction), higher reasoning accuracy, and better long-context performance, especially on “Needle-in-a-Haystack” tasks where the model has to recall a single relevant piece of information from massive context sequences.
What’s really going on

If you strip away the jargon, Nested Learning is trying to close the gap between how neural networks train and how brains learn. It merges architecture and optimization into a single, self-updating system.
That sounds abstract, but here’s what it implies in practice:
- You can build models that keep learning without retraining from scratch.
- You can separate fast and slow memory updates, preventing catastrophic forgetting.
- You can design optimizers that behave like memory systems rather than mechanical gradient followers.
It’s a step toward continual learning, the kind humans do without thinking about it.
A small example
Imagine a chatbot trained on general English. Now you fine-tune it to understand legal contracts. A standard model will start writing like a lawyer but forget how to chat casually. A Nested Learning model wouldn’t.
Its fast-updating layers adapt to legal phrasing.
Its slow-updating layers keep the general language sense intact.
Over time, both layers integrate: you get a model that understands law but still “talks” like a person.
That’s the practical win here , adaptability without forgetting.
Final thought

Nested Learning isn’t just an algorithm tweak; it’s a reframing. It treats every part of a neural network architecture, optimizer, memory as part of the same recursive learning ecosystem.
Hope is just the first prototype, but it hints at something bigger: models that can learn continuously and modify their own learning rules as they go.
If that scales, it could shift AI from being a product you retrain to a system that evolves, slowly, intelligently, and without forgetting what came before.
What is Google Nested Learning ? was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.