DeepSeek AI’s new research paper on LLMs
DeepSeek AI has been on fire since the release of DeepSeek-V3 and DeepSeek-R1. Their research paper on Reinforcement Learning for LLM training even crashed the US stock market.
https://medium.com/media/97637c743dd4b84dc507f23fd55ba22e/href
The DeepSeek team is back again with an interesting paper on an improvised Attention mechanism that helps LLMS with longer context windows.
What is Native Sparse Attention?
The authors introduce NSA, a natively trainable sparse attention mechanism designed to address the challenges of efficient long-context modelling. NSA employs a dynamic hierarchical sparse strategy that combines coarse-grained token compression with fine-grained token selection, preserving both global context awareness and local precision. The key innovations of the NSA are:
Combining Smart Compression and Selection
The model uses shortcuts by grouping similar information (like summarizing chunks of text) and selectively focusing on important parts (like picking key sentences). This reduces the amount of data the model needs to process without losing critical context.
Hardware-Friendly Design
The attention mechanism is optimized to work well with modern computer hardware (like GPUs). It organizes data into efficient chunks, allowing the model to run faster and use less memory.
End-to-End Trainable Sparsity
Unlike previous methods that only sped up inference, this model learns to be sparse during training. This means it can adaptively decide what’s important to focus on while still improving its understanding of language through training.
Balance Between Global and Local Attention
The model uses three parallel branches: one for broad context (like understanding the big picture), one for detailed local information (like focusing on recent details), and one for selected important parts. This lets it handle both long-range and short-range dependencies effectively.
Significant Speedups without Losing Accuracy
Through these optimizations, the model can process very long texts (like entire books or code repositories) much faster than traditional methods, while maintaining or even improving performance on various language tasks.
Overall Framework
The Native Sparse Attention (NSA) model uses a smart way to handle information so that it can process very long texts or conversations without slowing down. Here’s how it works in simple terms:
- Three Attention Paths
Imagine the model looks at a long story and uses three lenses to understand it:
Compressed Lens (Global View): It quickly summarizes parts of the story, capturing the main ideas. This gives a high-level understanding.
Selected Lens (Important Details): It picks out key sentences or moments in the story that are crucial for the plot or context.
Sliding Lens (Recent Context): It focuses on the recent parts of the story, like the last few sentences, to stay updated on what’s happening now.
2. Combining the Views
The model doesn’t just pick one lens; it looks through all three at the same time. This helps it understand both the big picture and the small details without missing important parts.
3. Learning What’s Important
Like a detective, the model learns what parts of the story are most important as it processes more text. It uses this knowledge to decide where to focus its attention, making it more efficient.
4. Hardware Optimization
The model is designed to work well with computer hardware, much like a well-organized filing system that makes it easy to find what you need. This means it can process information faster and use less memory, even when handling very long texts.
Algorithm Design
Token Compression:
Imagine you have a long line of words or pictures (tokens). Instead of looking at each one individually, you group them into small blocks. Then, you create a “summary” for each block. This summary helps the model understand the main idea of that block without having to process every single word or picture. It’s like creating a chapter summary in a book. By doing this, the model can work with less information but still grasp the big picture.
Token Selection:
Once the blocks are summarized, the model picks only the most important blocks to focus on. It’s like skimming through a book and only reading the highlighted parts. The model uses a smart way to decide which blocks are crucial based on how much attention they are getting. This helps the model work faster by not wasting time on less important details.
Sliding Window:
Even when the model is looking at large blocks of information, it also keeps an eye on the most recent parts. It’s like reading a long paragraph but paying special attention to the last few sentences. This ensures the model doesn’t miss anything that’s happening right now and can better understand the current context.
Kernel Design
The kernel design for NSA is interesting. But first understand
What is a Kernel?
Imagine the GPU (graphics processing unit) as a super-fast worker in a factory. A kernel is like the set of instructions given to this worker to perform specific tasks, such as assembling parts or solving calculations. In computing, a kernel is a small program that runs on the GPU to efficiently process large amounts of data in parallel.
Kernel Design in NSA (Native Sparse Attention)
The NSA kernel design is like organizing this super-fast worker to handle tasks in the most efficient way possible:
Group-Centric Data Loading:
Think of the worker needing to get tools from different toolboxes. Instead of fetching tools one by one, the worker is told to grab entire sets of tools (data) that belong to the same group. This way, the worker spends less time moving between toolboxes and more time assembling parts (processing data).
Shared KV Fetching:
KV stands for “key-value,” which are parts of the data the worker needs. Instead of each worker getting their own set of keys and values, they share a common set. This is like multiple workers at a construction site sharing a common supply of materials, reducing the need to constantly retrieve new supplies (saving time and effort).
Outer Loop on Grid:
The tasks are divided into a grid, like a game board. The worker can move efficiently across different parts of the grid, completing similar tasks (like painting tiles) all at once. This helps the worker avoid confusion and work systematically through the entire grid (processing all necessary data).
Experiments
The authors evaluate NSA on a 27B-parameter transformer backbone pretrained with 260B tokens. The performance is assessed across general language evaluations, long-context evaluations, and chain-of-thought reasoning evaluations.
General Evaluation
NSA achieves superior or comparable performance to full attention models across various benchmarks, including MMLU, MMLU-PRO, CMMLU, BBH, GSM8K, MATH, DROP, MBPP, and HumanEval. Notably, NSA demonstrates significant gains in reasoning-related benchmarks, such as DROP (+0.042) and GSM8K (+0.034).
Long-Context Evaluation
NSA achieves perfect retrieval accuracy in the 64k-context needle-in-a-haystack test, showcasing its ability to maintain global awareness and local precision. On LongBench, NSA outperforms all baselines, including Full Attention, with an average score of 0.469.
Chain-of-Thought Reasoning Evaluation
NSA demonstrates better performance than Full Attention in acquiring chain-of-thought mathematical reasoning abilities via post-training. On the AIME benchmark, NSA-R achieves significantly higher accuracy than Full Attention-R under both 8k and 16k context settings.
Efficiency Analysis
NSA achieves substantial speedups over Full Attention, with up to 9.0× forward and 6.0× backward speedup at 64k context length during training. In decoding, NSA achieves up to 11.6× speedup at 64k context length, primarily due to reduced memory access volume.
NSA VS Transformers
NSA (Native Sparse Attention) outperforms baseline transformers through:
- Efficient Long-Context Handling: NSA uses hierarchical sparse strategies (compression, blockwise selection, and sliding windows) to process long sequences efficiently, unlike the baseline’s resource-intensive full attention.
- Hardware Alignment: Optimized kernels leverage modern GPU architecture (Tensor Cores, memory access patterns) for faster computation, reducing latency in both training and inference.
- End-to-End Trainability: NSA’s design allows the model to learn optimal sparse patterns during training, unlike baselines which often train with full attention and apply sparsity post-hoc.
- Balanced Attention Mechanisms: Combines global context (compression), local details (sliding windows), and dynamic block selection, enabling the model to capture diverse patterns effectively.
- Reduced Memory Footprint: By focusing on key tokens and blocks, NSA minimizes memory usage, making it feasible to handle 64k+ sequences that strain baseline models.
- Superior Performance: Maintains or exceeds full attention’s performance in benchmarks (MMLU, BBH, GSM8K) while achieving massive speedups, especially in lengthy contexts.
But I guess Transformers still have an edge on the NSA
- Ease of Implementation: Transformers use straightforward full attention mechanisms, simplifying setup and deployment compared to NSA’s complex hierarchical structure.
- Legacy System Compatibility: Traditional Transformers integrate better with existing software and hardware optimized for their architecture, avoiding the need for specialized sparse kernels.
- Non-Sparse Data Patterns: For sequences with uniformly distributed attention (e.g., some specialized data processing tasks), full attention may capture all relationships more effectively than sparse mechanisms.
- Interpretability: Full attention matrices provide transparent insights into token interactions, which may be obscured by NSA’s compressed/sparse mechanisms in some research contexts.
- Precision in Short Contexts: For simple tasks with short sequences, the Transformer’s full attention can process all tokens directly without the overhead of NSA’s hierarchical steps.
Parting words,
DeepSeek AI’s Native Sparse Attention (NSA) brings a major leap in handling long-context sequences efficiently. By combining smart token compression, dynamic selection, and a hardware-optimized design, NSA significantly speeds up processing without sacrificing accuracy. It outperforms traditional transformers in long-context tasks, reasoning, and efficiency while maintaining strong performance across benchmarks. However, standard transformers still have advantages in ease of implementation, legacy compatibility, and short-context precision. NSA is a promising step forward, especially for applications requiring extended context windows and faster computation.
DeepSeek Native Sparse Attention: Advanced Attention Mechanism for LLMs was originally published in Data Science in your pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.