DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance
A few days back, Meta released Meta Mocha — a short video generation tool using AI for movie actors. Now, ByteDance has again come up after Goku, Omni-Human-1, with DreamActor-M1, again targeting video generation for Movie characters
https://medium.com/media/dd9e2ea87941dbc2561b32e863eeb273/hrefhttps://medium.com/media/f91ebecf49ad811cb00f10dd93dbbd1f/href
What is ByteDance Dream Actor-M1?
ByteDance’s DreamActor-M1 is an AI model that brings still images of people to life in super-realistic videos. Think of it like a digital puppeteer — it can animate faces, body movements, and even subtle expressions (like eye blinks or lip tremors) just from a single photo or a short clip.


Key Features:
Hybrid Guidance System — Controls facial expressions, head movements, and body poses precisely by combining different methods.
Example: To make a character smile while turning their head, the system uses face tracking for the smile and a 3D head model to adjust the angle.
Multi-Scale Adaptability — Works for different video sizes, from close-up faces to full-body shots, by training on varied resolutions.
Example: It can animate a talking head (small frame) or a person dancing (full-body) without losing quality.
Long-Term Temporal Coherence — Keeps animations smooth over long videos, fixing issues like flickering clothes or inconsistent textures.
Example: If a character’s shirt pattern changes oddly between frames, the system corrects it for consistency.
Complementary Appearance Guidance — Uses AI-generated reference frames to fill in missing details during animation.
Example: If a hand moves out of view, the system predicts how it should look when it reappears.
Progressive Training — The AI learns in three stages, adding more control signals step by step for better results.
Example: First, it learns basic face shapes, then head movements, and finally full-body poses.
Unique Aspects:
Holistic Control — Adjusts face, head, and body movements separately for more natural animations.
Example: You can change a character’s smile without affecting their body posture.
Hybrid Motion Signals — Uses both subtle (face details) and strong (3D head/body models) controls for realistic motion.
Example: Small eyebrow raises use face tracking, while big jumps use a skeleton model.
Multi-Reference Protocol — Improves smoothness by checking multiple past frames while animating.
Example: Instead of just looking at the last frame, it checks the last few to keep movements fluid.
Scalability — Works for different types of animations, like talking or dancing, because it trains on varied data.
Example: The same system can animate a podcast host speaking or a dancer performing complex moves.
Architecture

- Backbone: Diffusion Transformer (DiT) based on MMDiT, pre-trained for image/video tasks.
- Latent Space: Utilizes a 3D Variational Autoencoder (VAE) for latent space training.
Key Components:
- Face Motion Encoder: Extracts identity-independent facial expressions.
- Pose Encoder: Processes 3D head spheres and body skeletons.
- Cross-Attention Layers: Integrates facial tokens and reference features into the DiT blocks.
- Training: Three stages — body/head control only, facial representation addition, and full joint optimization.
Benchmark Performance:
- Metrics: Outperforms state-of-the-art methods (Animate Anyone, Champ, etc.) in FID, SSIM, PSNR, LPIPS, and FVD.
- Body Animation: Achieves FID 27.27 (vs. 33.01–40.21 for competitors).
- Portrait Animation: Superior in lip-sync and expression accuracy (FID 25.70 vs. 29.84–31.72).
- Ablation Studies: Confirms the importance of hybrid signals and multi-reference protocols.
Limitations:
Struggles with dynamic camera movements and environmental interactions.
Bone length adjustment may require manual tuning for edge cases.
How does DreamActor M1 work?

1. Input Prep: What You Need
- Reference Image: A photo of the person you want to animate (e.g., a portrait or full-body shot).
- Driving Video (Optional): A video of someone else moving (to copy motions from) OR audio (for lip-syncing).
2. Extract Control Signals (The “Puppet Strings”)
The model analyzes the driving video/audio to figure out how to animate the reference image. It uses three types of control:
- Facial Expressions:
Detects faces in the driving video.
Uses a pre-trained encoder to extract implicit facial motions (like smiles, blinks) without copying the driver’s identity.
- Head Pose:
Tracks head movements (tilts, turns) and represents them as a 3D colored sphere.
The sphere’s color/position tells the model how to rotate the head.
- Body Movements:
Converts the driver’s poses into a 3D skeleton (stick-figure style).
Adjusts bone lengths to match the reference person’s proportions.
3. Fill in the Gaps (For Unseen Areas)
If the reference image is a front-facing photo, but the driving video shows a back view:
- The model generates fake “hints” (pseudo-reference frames) of what the back might look like.
- These hints keep clothing/hair consistent when the person turns around.
4. Diffusion Transformer (DiT) Magic
- The reference image, control signals (face/sphere/skeleton), and pseudo-references are all fed into a Diffusion Transformer (a fancy AI that generates video frames step-by-step).
- The DiT blends everything together:
Face Attention: Focuses on matching expressions.
Pose Attention: Aligns head/body movements.
Appearance Injection: Uses pseudo-references to keep textures realistic.
5. Output: Smooth, Realistic Video
- The model generates the video in short clips (~73 frames).
- For long videos, it uses the last frame of each clip to start the next one, avoiding jumps.
- Final result: A seamless animation where the reference image moves exactly like the driving video (or speaks to the audio).
Key Tricks for Quality
Hybrid Control: Separating face/body/pose avoids weird glitches.
Multi-Scale Training: Works for close-ups (portraits) and full-body shots.
Progressively Trained: Learns simple motions first (body/head), then adds facial details.
Example: Give it a selfie + a dancing video, and it’ll make you dance just like the original performer — without looking like a zombie!
Conclusion,
ByteDance’s DreamActor-M1 is a big step forward in AI animation. Using smart controls (like mixing face tracking, body poses, and 3D models), it turns still images into super realistic videos — no zombie-like glitches!
While dynamic cameras are still a work in progress, DreamActor-M1 beats other models like Animate Anyone in quality and ease of use. For filmmakers, gamers, or anyone creating avatars, it’s a game-changer — bringing characters to life has never been easier.
Hope it becomes open-source soon.
ByteDance DreamActor-M1: Video generation model for movies was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.