What is GSPO ? The RL algorithm used to train Qwen3

What is GSPO ? The RL algorithm used to train Qwen3

What is GSPO ? The RL algorithm used to train Qwen3

How GSPO different from GRPO Reinforcement Learning algorithm

You remember DeepSeek? The model that took the internet by storm in early 2025. You know why it went viral? Because of a new reinforcement learning technique called GRPO Which reduce the computation cost by a huge margin.

This blog is a must to understand this post

What is GRPO ? The RL algorithm used to train DeepSeek

Training LLMs using reinforcement learning has become a common recipe for getting models that don’t just autocomplete text but can reason, follow complex instructions, and stay aligned with human preferences. This is how models like ChatGPT, Claude, and Qwen reach their final fine-tuned forms.

Training LLMs is difficult

But here’s the thing, the process of actually applying RL at scale is a mess. Especially when you’re dealing with very large models (like 70B+ parameters) or sparsely activated ones like MoEs (Mixture-of-Experts), where different parts of the model are active in different passes. The problem isn’t RL itself, it’s how we’ve been bolting it onto LLMs.

A big chunk of the mess comes from how current methods try to squeeze a reward, which is usually given for the whole output, into a token-by-token update process. The math doesn’t really line up. You’re asking the model to learn at a level of granularity that the reward doesn’t actually care about.

GSPO (Group Sequence Policy Optimization) is a method from the Qwen team that directly addresses that mismatch. Instead of tweaking every token like it’s a mini-decision, GSPO steps back and says: wait, what if we just train on full sequences, like how the reward actually works?

Let’s dig into what this really means and why it matters. Before we jump onto GSPO, let’s have a quick recap of PPO RL algorithm alongside GRPO

What is PPO?

Pretty much all RL fine-tuning of LLMs today is built around Proximal Policy Optimization (PPO). PPO is like a cautious optimizer. Instead of letting the new model go wild with updates, it keeps it “proximal”, not too far from the old model. This stops catastrophic forgetting and model collapse.

PPO does this by:

  • Using a value function to estimate how good each token is in a generated response.
  • Comparing the old and new policy (model) and clipping the update if it changes too much.

In other words: it nudges the model gently in the direction of better outputs, but not too fast.

Sounds reasonable. But under the hood, PPO brings its own baggage.

That value function, a second neural net, often doesn’t work well for long texts. It tries to assign value to individual tokens, but language isn’t a sum of parts in a clean way. A single word can flip the sentiment or meaning of a sentence. Assigning partial credit this way is like trying to grade a movie scene by how good each frame looks in isolation.

This leads to noisy or misleading gradients. The model gets confused about which tokens were “responsible” for the reward, and you end up with weird learning dynamics. Especially in longer outputs.

What GRPO Tried to Fix

DeepSeek proposed GRPO (Group Relative Policy Optimization) to get rid of the value model entirely. Instead of trying to estimate token values, it directly compares different responses to the same prompt.

Imagine you give a model a prompt like: “Explain black holes to a 10-year-old.”

It generates three different completions. Human raters (or some other feedback source) score them based on how good or helpful they are. Then GRPO does something like:

Figure out which responses were better than average.

Use that difference as a learning signal.

Update the model accordingly.

No value head. Just relative comparison. But even here, GRPO still keeps the update granularity at the token level. Which means:

It still calculates an importance weight for every token.

It still clips updates token-by-token.

It still distributes the learning signal across the sequence, rather than treating it as a whole.

This might sound okay until you try it at scale.

Because when you start using long outputs, or large, sparsely activated MoE models , this token-level noise gets worse. Tokens compete against each other for credit. Different experts in an MoE might be activated inconsistently across token updates. You end up with blurry, unstable learning.

Sometimes the model just collapses. Like… permanently. Even rolling back to a checkpoint doesn’t help because the gradients poisoned the expert routes or embeddings.

What GSPO Does Differently

GSPO takes a sharp turn. It says: let’s stop pretending the reward is about tokens. It’s not. It’s about sequences.

Here’s the big shift:

  • Optimization happens at the sequence level, not the token level.
  • Importance ratios (the key quantity in PPO-type updates) are calculated over the whole output.
  • Clipping is also done at the level of full sequences.
  • All tokens in a sequence get treated equally during backpropagation.

Why is this cleaner? Because the reward, the signal that tells the model whether it did a good job , is already being applied to the full sequence. You don’t get partial feedback like “the middle of this answer was a 7/10 but the ending was a 3/10.” You get one reward. So you should update the model in a way that reflects that.

It also solves another subtle problem: sequence length bias. Without normalization, longer outputs might get penalized more just because they accumulate more clipped gradients or instability. GSPO corrects for that too, it normalizes by sequence length to keep things fair.

A Simple Walkthrough: How GSPO Works

Let’s make it concrete.

You give the model this prompt:
“Write a poem about the ocean.”

The model generates 3 completions:

  1. The waves crash gently on the shore…Reward = 0.8
  2. Oceans deep and full of dread…Reward = 0.4
  3. Blue vastness stretches far and wide…Reward = 0.6

GSPO starts by computing how each sequence scores relative to the group.

  • Mean = (0.8 + 0.4 + 0.6) / 3 = 0.6
  • Std Dev = 0.2

Then it computes a normalized advantage for each:

  • First: (0.8–0.6)/0.2 = +1.0
  • Second: (0.4–0.6)/0.2 = –1.0
  • Third: (0.6–0.6)/0.2 = 0.0

Now, we get the importance ratio, the probability of that full sequence under the new model, divided by the probability under the old one.

Let’s say:

Old model gave sequence 1 a likelihood of 1e–4

New model gives it 2e–4 → Importance ratio = 2.0

GSPO clips that ratio (just like PPO would), then multiplies it by the normalized advantage (+1.0 in this case), and uses that as the learning signal for the whole sequence.

No token-level slicing. No reward redistribution. No routing replay needed for MoEs. It just works.

GRPO vs. GSPO

Here’s what GSPO really fixes, in practice:

  • Optimization Unit
    GRPO: token-level updates
    GSPO: sequence-level updates that match reward granularity
  • Clipping Strategy
    GRPO: clips each token independently
    GSPO: clips at the sequence level, giving cleaner gradients
  • Gradient Assignment
    GRPO: assigns separate gradient weights to every token
    GSPO: treats the whole sequence equally
  • Handling MoEs (Mixture-of-Experts)
    GRPO: needs tricks like Routing Replay to avoid expert collapse
    GSPO: it avoids token-level gradient noise entirely
  • Risk of Collapse
    GRPO: high, model often becomes unstable, especially on long outputs
    GSPO: low, clean, stable updates reduce that risk
  • Training Efficiency
    GRPO: noisy signals mean wasted compute
    GSPO: trains faster on what matters, even with more aggressive clipping

Why This Matters

GSPO isn’t just a theoretical cleanup. It’s a pragmatic fix to a very real and growing problem: training large language models with reinforcement learning is brittle. A single unstable update can blow up weeks of compute. And debugging collapsed experts in MoEs is like chasing ghosts.

By aligning the optimization level (sequence) with the reward level (also sequence), GSPO removes a lot of noise. It simplifies the pipeline. It makes RL training feel less like walking on a tightrope.

And this isn’t just a paper idea, Qwen3 used GSPO to train their latest LLMs, including Qwen2. The method has already proven itself at scale.

Should You Use GSPO?

If you’re training with human feedback or preference models, or if you’re seeing training collapse in MoE settings, yes, GSPO is very likely what you’re missing. You’ll still need all the other RLHF infrastructure (prompt generation, reward modeling, etc), but GSPO gives you a better optimizer core.

It’s rare in machine learning that a new method is not only cleaner but also more stable and more efficient. GSPO might just be one of those rare things.


What is GSPO ? The RL algorithm used to train Qwen3 was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.

Share this article
0
Share
Shareable URL
Prev Post

Mixture-of-Recursions vs Mixture-of-Experts

Next Post

Hunyuan World 1 : 1st open-sourced Interactive 3D World Generation AI model

Read next
Subscribe to our newsletter
Get notified of the best deals on our Courses, Tools and Giveaways..