ByteDance DreamActor-M1: Video generation model for movies

ByteDance DreamActor-M1: Video generation model for movies

DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance

Photo by Myke Simon on Unsplash

A few days back, Meta released Meta Mocha — a short video generation tool using AI for movie actors. Now, ByteDance has again come up after Goku, Omni-Human-1, with DreamActor-M1, again targeting video generation for Movie characters

https://medium.com/media/dd9e2ea87941dbc2561b32e863eeb273/hrefhttps://medium.com/media/f91ebecf49ad811cb00f10dd93dbbd1f/href

What is ByteDance Dream Actor-M1?

ByteDance’s DreamActor-M1 is an AI model that brings still images of people to life in super-realistic videos. Think of it like a digital puppeteer — it can animate faces, body movements, and even subtle expressions (like eye blinks or lip tremors) just from a single photo or a short clip.

Key Features:

Hybrid Guidance System — Controls facial expressions, head movements, and body poses precisely by combining different methods.

Example: To make a character smile while turning their head, the system uses face tracking for the smile and a 3D head model to adjust the angle.

Multi-Scale Adaptability — Works for different video sizes, from close-up faces to full-body shots, by training on varied resolutions.

Example: It can animate a talking head (small frame) or a person dancing (full-body) without losing quality.

Long-Term Temporal Coherence — Keeps animations smooth over long videos, fixing issues like flickering clothes or inconsistent textures.

Example: If a character’s shirt pattern changes oddly between frames, the system corrects it for consistency.

Complementary Appearance Guidance — Uses AI-generated reference frames to fill in missing details during animation.

Example: If a hand moves out of view, the system predicts how it should look when it reappears.

Progressive Training — The AI learns in three stages, adding more control signals step by step for better results.

Example: First, it learns basic face shapes, then head movements, and finally full-body poses.

Unique Aspects:

Holistic Control — Adjusts face, head, and body movements separately for more natural animations.

Example: You can change a character’s smile without affecting their body posture.

Hybrid Motion Signals — Uses both subtle (face details) and strong (3D head/body models) controls for realistic motion.

Example: Small eyebrow raises use face tracking, while big jumps use a skeleton model.

Multi-Reference Protocol — Improves smoothness by checking multiple past frames while animating.

Example: Instead of just looking at the last frame, it checks the last few to keep movements fluid.

Scalability — Works for different types of animations, like talking or dancing, because it trains on varied data.

Example: The same system can animate a podcast host speaking or a dancer performing complex moves.

Architecture

  • Backbone: Diffusion Transformer (DiT) based on MMDiT, pre-trained for image/video tasks.
  • Latent Space: Utilizes a 3D Variational Autoencoder (VAE) for latent space training.

Key Components:

  • Face Motion Encoder: Extracts identity-independent facial expressions.
  • Pose Encoder: Processes 3D head spheres and body skeletons.
  • Cross-Attention Layers: Integrates facial tokens and reference features into the DiT blocks.
  • Training: Three stages — body/head control only, facial representation addition, and full joint optimization.

Benchmark Performance:

  • Metrics: Outperforms state-of-the-art methods (Animate Anyone, Champ, etc.) in FID, SSIM, PSNR, LPIPS, and FVD.
  • Body Animation: Achieves FID 27.27 (vs. 33.01–40.21 for competitors).
  • Portrait Animation: Superior in lip-sync and expression accuracy (FID 25.70 vs. 29.84–31.72).
  • Ablation Studies: Confirms the importance of hybrid signals and multi-reference protocols.

Limitations:

Struggles with dynamic camera movements and environmental interactions.

Bone length adjustment may require manual tuning for edge cases.

How does DreamActor M1 work?

1. Input Prep: What You Need

  • Reference Image: A photo of the person you want to animate (e.g., a portrait or full-body shot).
  • Driving Video (Optional): A video of someone else moving (to copy motions from) OR audio (for lip-syncing).

2. Extract Control Signals (The “Puppet Strings”)

The model analyzes the driving video/audio to figure out how to animate the reference image. It uses three types of control:

  • Facial Expressions:

Detects faces in the driving video.

Uses a pre-trained encoder to extract implicit facial motions (like smiles, blinks) without copying the driver’s identity.

  • Head Pose:

Tracks head movements (tilts, turns) and represents them as a 3D colored sphere.

The sphere’s color/position tells the model how to rotate the head.

  • Body Movements:

Converts the driver’s poses into a 3D skeleton (stick-figure style).

Adjusts bone lengths to match the reference person’s proportions.

3. Fill in the Gaps (For Unseen Areas)

If the reference image is a front-facing photo, but the driving video shows a back view:

  • The model generates fake “hints” (pseudo-reference frames) of what the back might look like.
  • These hints keep clothing/hair consistent when the person turns around.

4. Diffusion Transformer (DiT) Magic

  • The reference image, control signals (face/sphere/skeleton), and pseudo-references are all fed into a Diffusion Transformer (a fancy AI that generates video frames step-by-step).
  • The DiT blends everything together:

Face Attention: Focuses on matching expressions.

Pose Attention: Aligns head/body movements.

Appearance Injection: Uses pseudo-references to keep textures realistic.

5. Output: Smooth, Realistic Video

  • The model generates the video in short clips (~73 frames).
  • For long videos, it uses the last frame of each clip to start the next one, avoiding jumps.
  • Final result: A seamless animation where the reference image moves exactly like the driving video (or speaks to the audio).

Key Tricks for Quality

Hybrid Control: Separating face/body/pose avoids weird glitches.

Multi-Scale Training: Works for close-ups (portraits) and full-body shots.

Progressively Trained: Learns simple motions first (body/head), then adds facial details.

Example: Give it a selfie + a dancing video, and it’ll make you dance just like the original performer — without looking like a zombie!

Conclusion,

ByteDance’s DreamActor-M1 is a big step forward in AI animation. Using smart controls (like mixing face tracking, body poses, and 3D models), it turns still images into super realistic videos — no zombie-like glitches!

While dynamic cameras are still a work in progress, DreamActor-M1 beats other models like Animate Anyone in quality and ease of use. For filmmakers, gamers, or anyone creating avatars, it’s a game-changer — bringing characters to life has never been easier.

Hope it becomes open-source soon.


ByteDance DreamActor-M1: Video generation model for movies was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.

Share this article
0
Share
Shareable URL
Prev Post

MCP Servers are not safe!

Next Post

ByteDance MegaTTS3: Best Voice Cloning AI

Read next
Subscribe to our newsletter
Get notified of the best deals on our Courses, Tools and Giveaways..