Ovi : Free Veo3 is here !!

Ovi : Free Veo3 is here !!

Ovi : Free Veo3 is here !!

How to use Ovi for free?

Photo by Alexander Shatov on Unsplash

Most so-called “audio-video” generators still cheat. They either make a silent video and glue sound later, or start with an audio track and stretch the visuals to fit.

It works fine for talking heads or background noise, but it collapses when you want a real cinematic moment: footsteps echoing just as the man steps, thunder syncing with lightning, lips actually matching syllables.

OVI changes that. It’s built on a simple but gutsy idea: generate both sound and visuals at once, as one process. Not two pipelines fighting to stay aligned.

Audio AI for Beginners: Generative AI for Voice Recognition, TTS, Voice Cloning and more (Generative AI books)

The architecture

OVI uses two identical Diffusion Transformers (DiTs).
One for video, one for audio.

Same depth, same number of heads, same feed-forwards. Every layer in one model talks to the same layer in the other using bidirectional cross-attention. The video pays attention to sound, and the sound pays attention back. No separate projection layers or adapters. Just two symmetric towers exchanging cues block by block.

That symmetry matters. It means both learn at the same rate and speak the same representational language. The result is a model that doesn’t need any face-mask hacks or sync modules, it just learns timing on its own.

The time problem

Audio moves faster than video. For a 5-second clip, video gives you maybe 30 frames; audio gives you hundreds of tokens.

So they scaled the Rotary Positional Embeddings (RoPE) of the audio branch by about 0.2. That scaling trick lines up the temporal “frequency” of audio tokens with the coarser video tokens. Without it, the two streams drift. With it, attention maps line up perfectly, mouths move with words, impacts hit with sound.

One text prompt

Instead of separate prompts for sound and visuals, OVI uses one combined description, passed through a frozen T5 encoder.

Something like:

“A man walks across a wet street. <S>He speaks softly</E> <AUDCAP>Footsteps splash in puddles</ENDAUDCAP>.”

That single embedding conditions both the audio and video towers. So when T5 says “speaks softly,” both the mouth movement and the voice reflect that cue. No extra semantic alignment losses, no second encoder.

Training: staged but unified

Two main stages:

  1. Audio pretraining : they built the audio tower from scratch using hundreds of thousands of hours of raw audio. Mostly speech, but also background sound and effects. The goal was to make it a strong standalone generator before pairing it with video.
  2. Joint fusion training : they froze most feed-forward layers, connected both towers with fresh cross-attention blocks, and fine-tuned on millions of tightly synchronized 5-second clips. They used Flow Matching (FM) loss, basically predicting how to move from noise to real data. Both towers share the same timestep schedule, so timing is learned directly instead of imposed later.

The dataset side

They didn’t just grab YouTube junk. The pipeline filters heavily:

  • Uses SyncNet to reject clips where sound doesn’t match mouth motion.
  • Keeps only clips above 720p with motion (no static junk).
  • Captions every clip with an MLLM that describes what’s seen and heard.

Every sample is 5 seconds, 720×720 resolution, 24 fps. Audio sampled at 16 kHz. Enough for fidelity but small enough to train at scale.

The results speak (and look) better

  • In human preference tests, OVI beats other open models like UniVerse-1 and JavisDiT on all three counts: audio quality, video quality, and how well they sync.
  • The model makes 5-second clips that feel cohesive, speech fits lips, instruments hit visually, sound effects match scene rhythm. It’s not perfect, but the jump in realism is visible.
  • The audio tower itself (OVI-AUD) performs close to top-tier text-to-speech models like Fish Speech or CosyVoice, and also works as a text-to-audio generator for effects. One model, two roles.

Where it still falls short

  • OVI is stuck at 5-second clips, 720p, and 16 kHz audio.
  • It’s heavy, 11 billion parameters total, so inference is slow.
  • There’s no long-form narrative ability yet. To scale beyond short shots, they’ll need something chunk-wise or causal.

Why this matters

The idea of treating audio and video as one object instead of two stitched systems feels like the first step toward real cinematic AI. Everything before this has been a patchwork. OVI feels like a clean start. A bit brute-force, yes, but technically honest: if you want sound and sight to move together, you train them together.

The model is open-sourced and and can used for free

If Veo is Google’s black box for film-grade generation, OVI is the first open model that actually shows how it might work inside, twin DiTs talking across time until pixels and waves finally agree.


Ovi : Free Veo3 is here !! was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.

Share this article
0
Share
Shareable URL
Prev Post

BitNet b1.58 2B4T: Pushing the Boundaries of Efficient On-Device LLMs

Next Post

UPI Comes to ChatGPT: What This Means for Payments

Read next
Subscribe to our newsletter
Get notified of the best deals on our Courses, Tools and Giveaways..