PagedAttention, SparseAttention, Flash Attention and more
If you’ve ever wondered how transformers manage to “understand” language instead of just reading it word by word, the secret lies in attention. That’s the mechanism that lets the model focus, not evenly, but selectively, on parts of a sentence that matter to a given word.
But before we get buried under fancy names like FlashAttention or PagedAttention, let’s take a walk back to the start. What’s so special about this thing we now call full attention?
The Basics: Full Attention
Imagine you’re reading: “The cat sat on the mat because it was warm.”
When you hit “it,” your brain automatically searches back, “what does ‘it’ refer to?” You don’t think about it, but you’re connecting “it” with “cat” or “mat.” That act of scanning back and weighing possibilities is what attention does in transformers.
Formally, every word (or token, to use the machine’s term) looks at every other word and decides how much to “care” about it. So for “it,” the model might decide:
- “cat” matters 0.7
- “mat” matters 0.3
- “because” and “was” barely matter
This “weighted caring” creates a web of relevance among words, one that helps the model make sense of relationships that simple position or grammar rules can’t express.
That’s the beauty of full attention: every token can talk to every other token. It’s like a group chat where everyone listens to everyone else before replying.
But group chats don’t scale well, and neither does full attention.
The Limits of Full Attention
Here’s where things go south. Full attention looks elegant until you actually run it on hardware.
- If you’ve got 1,000 tokens, each token has to look at 999 others, roughly a million interactions. Bump that to 10,000 tokens, and suddenly you’ve got 100 million comparisons. You can hear your GPU fans accelerating like a jet engine.
- The second issue is memory. All those “who-looked-at-whom” scores are stored in giant attention matrices, the kind that eat VRAM for breakfast.
This is why most transformers, even the big ones, choke on long documents. They aren’t dumb; they’re just drowning in their own self-awareness. So researchers started pruning, simplifying, optimizing, anything to make attention scale beyond a few thousand tokens.
Sparse Attention: Look Less, Think Faster
Sparse attention was the first real breakthrough in that optimization war.
The idea is so human it almost sounds obvious: you don’t need to look at everything to understand something.
Take a paragraph. When you read the 5th word, do you really need to reference the 500th? Probably not. You care about your immediate context, what came before and after.
Sparse attention takes that intuition literally. It tells each token: “Look only at your neighbors, maybe ±10 tokens, or a few special ones like [CLS] or [SEP].” This shortcut saves massive compute because you’re no longer comparing everything to everything.
Different models twist this rule differently:
- Longformer uses a sliding window : each token sees a moving slice of context.
- BigBird adds a few “global” tokens so some words act like central hubs.
- Sparse Transformer just blocks out certain regions entirely.
Sparse attention doesn’t give you the full picture, but it gives enough of it, fast.
FlashAttention: Same Attention, New Engineering
FlashAttention isn’t about what attention does, it’s about how it’s done.
Regular attention wastes time moving data in and out of GPU memory. It’s like trying to bake cookies but running to the pantry for every single ingredient one at a time. GPUs are fast at math but painfully slow at memory transfers.
FlashAttention fixes that by reorganizing the computation. It processes attention in small, perfectly sized blocks that fit into the GPU’s fast local memory. No wasteful back-and-forth, no massive intermediate buffers.
In practice, FlashAttention can be 2–3× faster and use far less memory. That’s why modern models like Llama 3, Mistral, and Claude use it by default. It’s become the silent optimization that keeps large models practical.
You could say FlashAttention didn’t reinvent attention. it just made it worth running.
5. PagedAttention: Memory with a System
If FlashAttention made computation faster, PagedAttention made memory smarter.
This one came out of the vLLM project, a system designed for high-throughput inference (basically, serving models efficiently). The problem they faced was simple but brutal: when a model generates long responses, it needs to keep all its past attention states in memory.
Traditional systems store those states in one massive block. That works fine… until you run multiple requests at once and everything crashes.
PagedAttention borrows a trick from operating systems. It splits memory into pages, chunks that can be easily swapped, reused, or moved around.
That means when you’re generating text, you don’t have to reload or recompute everything. You can just flip to the right “page.” This is why vLLM can handle way more concurrent users than other inference engines not because it changes attention’s math, but because it manages the chaos better.
Think of it as attention with an efficient filing system.
Cross-Attention: When Two Worlds Meet
Self-attention deals with one input talking to itself. Cross-attention happens when two different inputs talk to each other.
Example: in an image-captioning model, the text decoder uses cross-attention to look at image embeddings produced by a vision encoder. When it’s generating the word “dog,” cross-attention focuses on the image region that is the dog.
In chat systems like ChatGPT, cross-attention lets the model blend retrieved documents, external tools, or even multimodal inputs (like text + images).
If self-attention is introspection, cross-attention is conversation.
Other Flavors and Experiments
The creativity around attention never stops. Once full, sparse, and flash attention became standard, researchers began inventing specialized variants to stretch performance or adapt to niche scenarios:
- Sliding Window Attention: tokens only attend within a moving local window. Efficient for streaming text.
- Hierarchical Attention: process small chunks first, then combine summaries great for long documents.
- Linear Attention: rewrites the math so computation grows linearly with sequence length. Sounds magical but often loses precision.
- Multi-Query Attention: reduces the number of “key/value” projections, speeding up inference without a huge quality hit.
Each of these is a compromise, between accuracy, speed, and memory. Pick your poison depending on what you’re building.
Attention isn’t just a computation trick. It’s the bridge between raw data and understanding. And every new version, from Sparse to Flash to Paged, is just another attempt to make that bridge a little faster, a little longer, and maybe someday, as seamless as thought itself.
Different types of Attention mechanism for LLMs explained was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.