Tencent SRPO : A Smarter Way to Train Text-to-Image AI Models

Tencent SRPO : A Smarter Way to Train Text-to-Image AI Models

Tencent SRPO : A Smarter Way to Train Text-to-Image AI Models

What is Tencent SRPO?

Photo by Ken Cheung on Unsplash

When you train text-to-image models and try to make them “follow human preferences,” something strange happens: they start cheating.

You ask for more realistic photos, and the model figures out shortcuts like making things overexposed or too sharp, which tricks the scoring system but looks worse to humans. This is called reward hacking.

My new book on MCP Servers is out

Model Context Protocol: Advanced AI Agents for Beginners (Generative AI books)

Tencent’s SRPO (Semantic-Relative Preference Optimization) is a method to fight exactly that. It doesn’t change the model architecture. It changes how the model is fine-tuned with rewards, so it learns genuine improvements instead of gaming the system.

What’s the problem with older methods?

Earlier approaches like ReFL or DRaFT usually do one of two things:

  1. Only optimize the last step of image generation. That’s faster but easy to hack because the model just polishes the final image in a way the reward likes.
  2. Backpropagate through the whole chain of steps. That’s slow and unstable, imagine pushing gradients through 50 steps, the math just blows up.

Both are flawed: one gets hacked images, the other eats too much compute.

The Direct-Align trick

Diffusion models work by gradually adding noise to an image and then learning how to remove it. SRPO uses a neat math shortcut: if you know the noise you added, you can directly recover the clean image from a noisy one.

Why does this matter? Because it means you can check rewards at any point in the denoising process, even when the image is still very blurry. And you don’t have to run the whole chain or risk unstable gradients.

This lets the model learn from early steps, where structure and realism are formed, not just the final polish.

Semantic-Relative Preference: rewards by comparison

The second big idea is how they calculate rewards. Instead of asking a reward model “is this image good?” and taking the score, they ask two questions:

  • How good is the image with a positive cue (like “realistic”)?
  • How good is the same image with a negative cue (like “cartoon”)?

Then they take the difference between the two.

Why is this better?

  • It pushes the model toward the good direction and away from the bad one at the same time.
  • It cancels out a lot of weird biases in reward models.
  • It’s more stable, so the model doesn’t just exploit quirks of a single reward signal.

Think of it like training with both “do more of this” and “do less of that,” instead of only one side.

Training details

  • They sample images at multiple noisy stages, not just one, and give more weight to early stages.
  • They add a reconstruction regularizer to keep the model from drifting too far.
  • They train with 25 steps and evaluate with 50 steps.
  • On 32 H20 GPUs, training converges in about 10 minutes, much faster than earlier methods (they claim ~75× faster than DanceGRPO).

What’s unique here

  • It’s the first method to make early-step optimization practical.
  • Rewards are based on relative comparisons, not absolute scores.
  • Works without needing a brand-new dataset or reward model. Just prompt tweaks with positive and negative control words.

Results

On FLUX.1.dev with a human preference dataset:

  • Human realism scores went up by ~3.7×.
  • Aesthetic quality improved ~3.1×.
  • Unlike older methods, SRPO didn’t fall into obvious hacks like overexposed lighting or strange sharpness.

Limitations

  • Depends a lot on the vocabulary of the reward model. If your chosen “negative” word is rare, it won’t guide well.
  • Sometimes the meaning of words in the embedding space isn’t intuitive.
  • Works best when rewards are tied to text. For pure image-based scores, they need extra tricks.

Takeaway

SRPO isn’t a new model, it’s a better training recipe. By letting rewards shape early timesteps and by making those rewards comparative instead of absolute, Tencent reduced reward hacking while speeding up training.

If you’re fine-tuning a diffusion model for realism or style, SRPO gives you a way to do it fast, with fewer hacks, and without needing a fresh reward dataset.


Tencent SRPO : A Smarter Way to Train Text-to-Image AI Models was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.

Share this article
0
Share
Shareable URL
Prev Post

Don’t buy GPUs for AI

Next Post

VoxCPM : Tokenizer Free TTS and Voice Cloning AI

Read next
Subscribe to our newsletter
Get notified of the best deals on our Courses, Tools and Giveaways..