
ByteDance’s Waver is a high-performance foundation model for unified image and video generation that produces 5–10s clips at native 720p and upscales to 1080p, supporting text-to-video, image-to-video, and text-to-image within one architecture. It ranks among the Top 3 for both T2V and I2V on the Artificial Analysis Arena as of late July 2025, with notable strength in complex motion.
What Waver is
Waver is a two-stage generative system built on rectified-flow Diffusion Transformers that first synthesizes 720p video and then refines it to 1080p with a specialized super-resolution module. It unifies T2V, I2V, and T2I via a single Task‑Unified DiT core and a Cascade Refiner for high-fidelity upscaling.
Key capabilities
- Unified modalities: supports T2V, I2V, and T2I in one model, avoiding separate systems for each task.
- High resolution: generates 720p videos and refines to sharp 1080p with fewer inference steps than single-stage 1080p generators.
- Strong motion: excels at large, smooth motion and temporal consistency, particularly in sports and dance scenes.
My book with 20+ End to End Data Science Case Studies from 5 different domains is available on Amazon.
Cracking Data Science Case Study Interview: Data, Features, Models and System Design
Architecture overview

- Task‑Unified DiT: a rectified-flow Transformer with a “Dual Stream + Single Stream” Hybrid Stream design to balance text–video alignment and parameter efficiency, plus hybrid positional embeddings combining 3D RoPE with learnable absolute embeddings.

- Cascade Refiner: a flow-matching refiner with window attention to upscale 480p/720p inputs to 1080p while correcting generative artifacts via pixel and latent degradation strategies during training.
Unified conditioning design
Waver concatenates three inputs: noisy video latent V, conditional frames I (VAE latents for provided frames), and a binary Mask channel to indicate which frames are conditions versus to-be-generated. This flexible tensor formulation allows mixing tasks and straightforward extensions like interpolation.
Hybrid Stream DiT
- Dual Stream layers early for text–video co-adaptation and precise alignment.
- Single Stream layers later for efficiency and faster training throughput.
- Reported configuration example: 12B model with 16 Dual Stream layers then 40 Single Stream layers.
Positional encoding
A hybrid positional scheme combines 3D RoPE for relative temporal/height/width relations with factorized learnable absolute embeddings to accelerate convergence and reduce distortions across variable durations and resolutions.
Cascade Refiner details
The refiner uses window attention to constrain computation within local spatio-temporal windows, alternating spatial and temporal windows and retaining full attention only in first/last layers to balance fidelity and cost. It is trained with pixel down/up-sampling and latent noise injection to match first-stage artifacts and improve robustness.
Training data pipeline

Waver is trained on over 200 million video clips curated through multi-source acquisition, shot segmentation, multi-dimensional scoring, and hierarchical filtering, augmented with high-resolution and synthetic content at later stages. A custom MLLM quality model and a detailed captioning system enhance data quality and temporal action understanding.
Quality model and captions
- Quality model: an MLLM (fine-tuned from VideoLLaMA-like systems) classifies high-quality and 13 low-quality dimensions, achieving 78% accuracy on high-quality predictions in validation.
- Caption model: built on Qwen2.5‑VL with DPO stabilization to produce rich, temporally grounded action descriptions and sub-action timestamps for improved motion fidelity.
Semantic balancing
Action taxonomies (12 top-level, 100 second-level, 6,000+ third-level labels) are used to identify underrepresented categories like sports; Waver balances via oversampling and synthetic generation to strengthen scarce yet challenging motion segments.
Multi-stage training recipe
- T2I pretraining from 256p up to 1024p across diverse aspect ratios to instill strong text–image alignment.
- T2V/I2V staged from 192p at 12→16 fps to learn motion cheaply before 480p and 720p phases, with continued T2I joint training to preserve semantics; 1080p training is delegated to the refiner.
- Representative schedule lists data volumes, epochs, learning rates, and sigma shifts, ensuring at least one epoch per stage and increasing flow-matching sigma with resolution.
Motion optimization
- Low-resolution motion pretraining: extensive 192p training decouples motion learning from visual fidelity and accelerates later convergence.
- Timestep scheduling: adopts SD3-inspired flow-matching distributions; mode sampling improves motion amplitude over logit-normal in ablations.
- T2V+I2V joint training: conditions I2V on the first frame only 20% of the time to prevent motion collapse and match T2V motion magnitude.
- Filtering: removes overly static or overly shaky clips using foreground optical-flow-based scoring to shape motion distribution.
Representation alignment
Waver aligns intermediate DiT features to high-level Qwen2.5‑VL semantics via cosine-similarity loss (applied at 480p), improving semantic structure and prompt following without the storage overhead of 720p features.
Aesthetics and realism
Recipes detail how to optimize aesthetics, color, and realism across stages, including prompt rewriting, KL loss adjustments to mitigate graininess/distortion, and LPIPS tuning to avoid grid artifacts; 1080p images in T2I improve visual quality.
Benchmarks and standing


- Public leaderboards: Top 3 for both T2V and I2V on the Artificial Analysis Arena as of 2025‑07‑30 10:00 GMT+8.
- Comparative notes: internal tests claim stronger motion and visual quality than Kling 2.0 and Wan 2.1, and better motion/visuals than Veo 3 with slightly weaker prompt following; external coverage highlights similar strengths and multi‑shot storytelling.
Practical implications
The single-model approach reduces training and maintenance overhead across T2I/T2V/I2V while preserving alignment and efficiency, and the two‑stage 1080p path yields roughly 40% faster high‑res synthesis than direct 1080p generation. The explicit training and inference “recipes” make the system reproducible and informative for practitioners building large-scale video generators.
Release status and ecosystem notes
Public materials include the arXiv technical report and community commentary/podcasts; third-party posts emphasize benchmark strength and motion quality, though distribution of runnable weights and productized endpoints may vary over time across ByteDance’s creative tooling ecosystem and demos highlighted in daily AI news digests
Waver Unleashed: ByteDance’s All‑in‑One Engine for Lifelike Video Creation was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.