What is DeepSeek DeepEP ?
DeepSeek opensource week day 2
After releasing DeepSeek FlashMLA yesterday, DeepSeek has released yet again an advanced library designed for Mixture Of Experts (MoE) LLMs call DeepEP
DeepEP might not be useful for all but worth giving a read
In case you missed FlashMLA yesterday
https://medium.com/media/3ad15de156e15d0bb45b32fa817f3e84/href
Before we jump onto DeepEP, we need to understand two key topics
What is Mixture of Experts?
In generative AI, Mixture of Experts (MoE) is a model architecture that uses multiple specialized “expert” sub-models to handle different tasks. Instead of using a single large model for everything, the MoE model selectively activates a subset of experts based on the input, making it more efficient by using only the most relevant experts for each task. This approach helps in scaling up models while maintaining computational efficiency.
What is Expert Parallelism?
Expert Parallelism (EP) refers to a technique used in Mixture of Experts (MoE) models where multiple experts (specialized sub-models) work in parallel to process different parts of a task. By distributing the workload across these experts, EP allows for faster processing and more efficient use of computational resources, enabling large-scale AI models to handle complex tasks without overwhelming individual experts.
Subscribe to datasciencepocket on Gumroad
What is DeepSeek DeepEP?
DeepEP is a library designed to help speed up and improve communication between computers (or GPUs) when working on complex machine learning tasks, especially those involving Mixture-of-Experts (MoE) models. These models use several “experts” (specialized sub-models) to handle different parts of a problem, and DeepEP makes sure the data moves between these experts quickly and efficiently.
What Does DeepEP Do?
Imagine you’re working on a big project with multiple teammates (the experts) in different rooms, and you need to pass information back and forth. To make the work faster and smoother, DeepEP acts like a super-fast delivery service that:
Quickly sends data between different experts working on different parts of the problem.
Makes sure everything is well-organized so no expert is overloaded, and no data gets lost.
Helps in handling large-scale tasks by splitting the workload and ensuring everything runs smoothly even when the work is huge.
DeepEP is like a smart traffic manager for data in ML systems, ensuring that all the experts get their data on time and can work together without delays, making the system more efficient and faster.
Key Examples of What DeepEP Does:
Sending Data Between Experts:
Imagine you have a large dataset, and you want to use different models (or experts) to process different parts of it. DeepEP will send the data to the right expert at the right time so they can do their job without waiting around or causing delays.
For example, if you’re working on a machine learning model that understands text, DeepEP ensures that different experts handle tasks like text translation, sentiment analysis, and keyword extraction at the same time without crashing into each other.
Handling Large-scale Operations:
If you’re training a machine learning model on many GPUs (powerful processors), you need to transfer data between these GPUs. DeepEP optimizes how data moves between them, making sure the data flows quickly and smoothly. This is especially useful when dealing with RDMA (a system that helps move data directly between computers) and NVLink (a fast way for GPUs to talk to each other).
For example, if you’re training a model that takes in millions of pieces of data, DeepEP ensures that the data is passed around efficiently, so everything runs faster.
Minimizing Latency:
Latency is the delay between sending a request and receiving a response. DeepEP includes tools to reduce this latency, which is especially useful when you’re making real-time predictions. For example, in a video streaming application where you want to predict what the next frame might look like, DeepEP ensures that data flows quickly to make predictions almost instantly.
Let’s consider a simple example
Imagine you’re running a large-scale online store. You have many experts (models) working on different tasks:
One expert handles product recommendations.
Another expert analyzes user reviews.
Another expert predicts sales trends.
DeepEP helps the data move smoothly between these experts, ensuring they get the right information at the right time to give accurate results fast. It also helps them work together efficiently, ensuring that they don’t delay each other or overload their resources.
Now, as you must have understood what is DeepEP, let’s talk about some technical details (you can skip this)
DeepEP Technical details
High-Throughput and Low-Latency Kernels:
Supports MoE dispatch (sending data to different experts) and combine (merging outputs) with low latency.
Optimized for both NVLink and RDMA communications to improve data transfer speed.
Optimized for Group-Limited Gating Algorithm:
Uses specialized kernels for asymmetric-domain bandwidth forwarding, meaning it efficiently handles data transfer between different hardware domains (like from NVLink to RDMA, which are both interconnect technologies).
Latency-Sensitive Inference:
Includes low-latency kernels that use RDMA for inference tasks to minimize delays during data processing.
Uses a hook-based method to allow for communication and computation to overlap without occupying computational resources like SMs (Streaming Multiprocessors) on GPUs.
Performance Testing:
Tested on H800 GPUs with CX7 InfiniBand 400 Gb/s RDMA network cards, showing high performance in different configurations like dispatching and combining EPs (expert parallelism units) with various network bandwidths.
RDMA and NVLink Integration:
Supports RDMA (Remote Direct Memory Access) for fast data transfer across different nodes and NVLink for intra-node communication, making it highly efficient for distributed machine learning tasks.
Traffic Isolation and Adaptive Routing:
Uses Virtual Lanes (VL) in InfiniBand to separate different types of traffic, ensuring workloads don’t interfere with each other.
Supports adaptive routing to avoid network congestion, though it’s currently limited to low-latency kernels.
Congestion Control:
Congestion control is disabled as there’s no significant congestion observed in the production environment, simplifying deployment.
Compatibility:
Works with InfiniBand networks and is theoretically compatible with RDMA over Converged Ethernet (RoCE).
Jargons explained
FP8: A low-precision floating-point format with 8 bits, which is used to speed up computations and reduce memory usage at the cost of some precision.
RDMA (Remote Direct Memory Access): A technology that allows data to be transferred directly between the memory of two computers without involving the CPU, improving speed and reducing latency.
NVLink: A high-bandwidth, energy-efficient interconnect technology developed by NVIDIA to connect GPUs and accelerate data transfer.
SM (Streaming Multiprocessors): These are the basic processing units in a GPU that handle the majority of computational tasks.
Virtual Lanes (VL): Part of InfiniBand’s networking technology, where traffic is segregated into different logical channels to prevent interference between different types of traffic.
Adaptive Routing: A network routing feature that dynamically adjusts the path of data to avoid congestion, improving overall performance.
The codes can be explored below
GitHub – deepseek-ai/DeepEP: DeepEP: an efficient expert-parallel communication library
Not much of use for everyone, still, do try giving it a read. Thanks
What is DeepSeek DeepEP? was originally published in Data Science in your pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.