Kimi-Linear : Bye Bye Transformers
New LLM architecture better than Transformers
Transformers changed everything, from how we generate language to how we reason over data, but they were never built for endurance. They were designed for precision, not memory efficiency. Every token talks to every other token through a full attention matrix.
https://medium.com/media/6cbfa2d2efb3e642bcc247d494a6c782/href
My New book on Audio AI is out now !!
That’s beautiful in theory but disastrous at scale. When context stretches to hundreds of thousands or even millions of tokens, Transformers turn sluggish. GPU memory bloats, inference slows, and the entire model begins to drag under its own weight.
Kimi Linear, built by Moonshot AI, simply fixes the part of the Transformer that ages badly: its attention. It’s the same family, but with a different metabolism. Instead of remembering everything, it remembers what matters, forgets what doesn’t, and does it all without losing accuracy.
Why Traditional Transformers Choke on Long Contexts

Let’s start with how a normal Transformer works.
Each token in your input sequence (say, every word in a paragraph) tries to understand its relation to all other tokens through a computation called self-attention.
The cost of that computation grows with the square of your sequence length, if your sequence doubles, the work quadruples.
That’s fine for short paragraphs, but absurd when you’re feeding it long papers, books, or gigabyte-scale codebases.
Even during inference, Transformers keep every intermediate “key” and “value” vector in memory so that the model doesn’t have to recompute them. This KV cache expands linearly with the number of tokens, eating VRAM like it’s free. Once you cross 128k or 256k tokens, even high-end GPUs start sweating.
This is why models like Claude 3.5 or GPT-4 have to use clever compression and caching schemes to simulate long memory. They’re still bound by the Transformer’s math.
Kimi Linear attacks that directly.
The Core Idea: Attention That Behaves Like Memory

At the center of Kimi Linear lies Kimi Delta Attention (KDA), a more structured and efficient form of linear attention.
To understand it, think of the Transformer’s attention as a giant lookup system: every token looks up what other tokens said, weighted by relevance. KDA changes that to a running memory: instead of looking at the entire past, it keeps an internal state that continuously updates with each new token.
- You can think of that internal state as a notebook that the model keeps updating. Each new token either strengthens, rewrites, or fades part of what’s written there.
- Old information doesn’t just sit forever, it decays unless it’s reinforced. This introduces a kind of learnable forgetting. The mechanism that controls this decay is called a gate.
- In earlier linear models like Gated DeltaNet or Mamba, that gate was blunt, it decided memory decay per attention head (like a single dimmer switch for an entire group of neurons).
Kimi Delta Attention goes deeper: it adds a channel-wise forget gate, letting each dimension decide how fast it forgets. That’s like giving every neuron its own clock for memory. Some features can hold information for long-term patterns (like syntax or topic flow), while others can forget instantly (like stop words or punctuation).
Fine-Grained Forgetting: A Closer Look

Imagine feeding Kimi Linear a long story:
“A boy found a key near an old oak tree. Twenty paragraphs later, he opens a locked door.”
- A Transformer would carry all the intermediate text in its KV cache just to remember what “the key” referred to. Kimi Linear doesn’t. It continuously compresses the past into a single evolving state matrix.
- The decay mechanism ensures that irrelevant details like descriptions of weather or filler dialogue slowly fade, while meaningful entities like key or door remain stored longer.
This is closer to how human recall works. You don’t remember every word of a story, you remember the threads that matter. Kimi Linear formalizes that logic in its attention dynamics.
The Hybrid Architecture: Mixing Linear and Global Layers

A big problem with pure linear attention models is that they lose a sense of “global” context.
Because they only maintain a running summary, they struggle with tasks that need precise long-range relationships (like exact copy tasks or mathematical proofs).
- Kimi Linear sidesteps this by mixing three KDA layers with one full Transformer-style attention layer, a ratio of 3:1.
- The linear layers handle the heavy lifting: fast updates, local reasoning, sequential recall. The periodic full-attention layers reintroduce the global perspective, ensuring the model doesn’t drift too far from the full context.
- This simple ratio balances quality and efficiency perfectly. Too many global layers, and you lose speed. Too few, and you lose precision. The 3:1 balance keeps both.
This hybridization gives Kimi Linear the unusual ability to scale to a 1-million-token context while maintaining accuracy equal to or better than full attention. In practical numbers, it’s six times faster decoding and 75% less memory used for the KV cache.
That means a story of a million words, or a repository with tens of thousands of files, can fit into a single model pass.
Forgetting Is the New Positional Encoding

One of the Transformer’s biggest quirks is that it doesn’t know order unless you explicitly tell it. It treats “cat sat on mat” and “mat sat on cat” as the same unless you add positional encodings like sinusoidal embeddings or RoPE. These inject mathematical signals that encode position in the sequence.
Kimi Linear doesn’t need that.
Because it decays over time, the strength of memory itself encodes position implicitly. The further back something is, the weaker its representation becomes unless it’s reinforced by context. In other words, position is baked into the model’s forgetting curve.
This approach not only makes the model simpler no more fine-tuning positional frequency bases but also makes it generalize better to longer contexts. Traditional RoPE-based Transformers often start to lose coherence beyond the lengths they were trained on, because the rotation frequencies no longer align. Kimi Linear, driven by adaptive decay, doesn’t have that problem. It simply keeps decaying smoothly as it moves forward.
Hardware and Implementation Details

Under the hood, Kimi Linear uses an optimized Diagonal-Plus-Low-Rank (DPLR) structure for its state transitions. The math here usually involves expensive matrix multiplications that scale poorly. Kimi’s team found a way to tie certain parameters together so the computations could be parallelized efficiently across GPU tensor cores.
- The result: the operator runs almost twice as fast as the previous DPLR versions used in models like RWKV and Mamba. Even better, it avoids the instability those models faced when using half-precision training. It’s not just theoretically faster it’s hardware-aware.
- This attention mechanism is implemented as an open-source kernel integrated with Flash Linear Attention (fla) and vLLM, which means developers can plug it into existing inference pipelines without rewriting the caching logic.
How It Performs in the Wild

The Kimi Linear 48B model (with 3B active parameters per forward pass via Mixture-of-Experts) outperforms standard full-attention Transformers on a wide range of benchmarks.
On general reasoning tasks like MMLU, BBH, and TriviaQA, it consistently scores higher. In math and coding benchmarks such as GSM8K, AIME 2025, and LiveCodeBench, it holds or beats the full Transformer baselines. And on long-context evaluations RULER, RepoQA, Frames it leads by a significant margin.
For reinforcement learning–style fine-tuning, where models learn by optimizing reasoning under reward signals, Kimi Linear trains faster and achieves better convergence. It maintains stability across long sequences, something Transformers often struggle with as their memory precision fades over time.
A Small Example to Ground It

Let’s take a simple reasoning chain like:
“If the key is under the mat and the mat is under the cat, where is the key?”
A Transformer processes all the tokens together, storing every intermediate vector. Kimi Linear, instead, builds a compressed memory as it reads.
- When it sees “key under mat,” it writes an internal mapping of key → mat.
- Then “mat under cat” updates that mapping, relating mat → cat and possibly inferring a new chain key → cat through the memory state.
When you query it later, the memory already contains the composed relation, it doesn’t need to re-scan the entire text.
This kind of compositional recall is exactly what makes Kimi Linear better suited for reasoning-heavy or long-context tasks.
What It Means for AI Developers

For anyone building long-context systems code interpreters, data analysts, document agents Kimi Linear changes the constraints.
You no longer need to chunk your data into smaller pieces or rely on external retrieval mechanisms. The model itself can hold an entire corpus in memory and still generate in real time. At the infrastructure level, this reduces GPU costs dramatically. The model’s fixed-size memory footprint means predictable latency and no cache blow-ups.
And because it’s open-sourced with Hugging Face checkpoints and vLLM integration, you can actually deploy it without reinventing tooling. It’s not just a research prototype; it’s ready for production-scale inference.
In the simplest terms:

If the Transformer is a librarian who remembers every book word-for-word, Kimi Linear is the one who remembers the ideas. It’s not perfect memory it’s better memory. And that’s exactly what long-context intelligence needs.
moonshotai/Kimi-Linear-48B-A3B-Instruct · Hugging Face
Kimi-Linear : Bye Bye Transformers was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.