LongCat : Generate 1 minute long AI videos with this model

LongCat : Generate 1 minute long AI videos with this model

LongCat-Video : Generate 1 minute long AI videos with this model

How to generate long AI videos for free?

Photo by Joey Huang on Unsplash

There’s a quiet shift happening in video generation. Most models, whether open or closed, still live inside 10-second bubbles: impressive clips that crumble when you ask for longer stories. Meituan’s LongCat-Video changes that.

https://medium.com/media/ba027418e3456748311b11bafffbbf64/href

Audio AI for Beginners: Generative AI for Voice Recognition, TTS, Voice Cloning and more (Generative AI books)

It’s the first open-source video model that can generate minutes-long coherent videos, without the usual color drift, motion blur, or looping hallucinations.

This isn’t a marketing claim. LongCat is a 13.6B parameter model that’s been trained to understand continuity, how things flow, not just how they look.

The core idea: one model for all video tasks

Most video models today specialize. One for text-to-video. Another for image-to-video. Maybe a fine-tuned variant for video continuation. LongCat does it all in a single transformer.

It’s built on top of the Diffusion Transformer (DiT) architecture. That means it uses the same general design that powers high-end image generators, but extends it into the 3D space of video. The model takes frames as 3D tokens, width, height, time, and predicts what comes next.

  • Text-to-Video: no conditioning frames. The model starts from noise and text.
  • Image-to-Video: one conditioning frame.
  • Video-Continuation: multiple conditioning frames.

Everything is treated as “video continuation” internally. The difference is only how many frames you feed it. That’s a neat trick: instead of teaching the model three separate tricks, LongCat just learns one, predict the next frames given what’s already there.

It also uses KV caching during inference, meaning it doesn’t waste computation reprocessing the fixed conditioning frames. Once they’re encoded, the model reuses them, frame after frame. That’s part of how it keeps inference fast for long videos.

How it handles long videos

The big problem with long videos isn’t quality, it’s drift.

Even small prediction errors pile up over time, leading to flickering colors, warped objects, or random resets.

LongCat solves this in training itself. It’s pretrained directly on Video-Continuation tasks, not just short video diffusion. That means it’s seen examples where it has to extend an ongoing scene rather than just start from scratch. The result is smoother transitions and consistency over minutes-long clips.

The internal architecture uses 3D attention with RoPE positional encoding, a way to help the transformer understand how tokens relate across both space and time. This lets it keep track of how a person moves across multiple seconds, not just per-frame texture details.

The coarse-to-fine trick

Generating long, high-resolution videos is computationally painful. Every pixel adds tokens. Every frame multiplies them. Attention cost grows quadratically, and soon even GPUs start sweating.

LongCat deals with this using a coarse-to-fine pipeline. Here’s how:

  1. First, it generates a rough version of the video at 480p, 15fps.
  2. Then it refines it into a sharper 720p, 30fps version using a lightweight LoRA-based “refinement expert.”

This refinement expert is trained to add missing texture and detail, almost like an artist redrawing outlines after a blurry sketch.

Because it doesn’t have to regenerate everything, this two-step setup gives a 10×–12× speedup, producing a full 720p video in minutes on a single H800 GPU. And the cool part: the refinement model uses the same flow matching logic as the main one, so the transition between coarse and fine outputs is smooth.

Sparse attention

Dense attention is wasteful. Every frame compares every token to every other token, even ones far apart in space or time that don’t matter. LongCat uses Block Sparse Attention (BSA) instead.

Imagine dividing the video tokens into small 3D cubes, and only letting each cube “look” at the few that are most relevant. This slashes computation to less than 10% of the normal load. But because the blocks are chosen based on similarity, not just distance, quality loss is almost zero.

This BSA module is open-sourced separately, so other video and multimodal models can reuse it.

The RLHF stage: tuning for human sense

After pretraining, the model isn’t done. It’s further tuned using Group Relative Policy Optimization (GRPO), basically a reinforcement learning method for generative models.

But unlike text models that just optimize a single “helpfulness” score, LongCat uses three different rewards:

  1. Visual Quality (VQ) : judged by a frame-level image quality model (HPSv3).
  2. Motion Quality (MQ) : checks for smoothness and natural motion using a VideoAlign-based model.
  3. Text Alignment (TA): measures how well the output matches the input prompt.

Instead of chasing one metric, the model balances all three. This prevents “reward hacking”, for instance, a model that makes still but photorealistic frames could trick a visual-quality score, but would fail the motion reward.

The end result is a model that looks good, moves right, and stays faithful to the prompt.

Training from scratch: a slow climb

The team trained LongCat progressively. First on static images, then on short clips, and finally on full multi-task video datasets. The total data pipeline included heavy preprocessing, scene detection, black-border cropping, and even optical flow estimation to filter low-motion videos.

The training stack used DeepSpeed-Zero2, Ring Attention, and Context Parallelism, achieving around 33–38% hardware utilization. That’s impressive efficiency for a 13B parameter model running multimodal workloads.

How it performs

Benchmarks show LongCat standing toe-to-toe with top proprietary models. On internal evaluations, it outperforms PixVerse-V5 and Wan 2.2, and nearly matches Google Veo3.

On VBench 2.0, it ranks third overall, behind Veo3 and Shengshu’s Vidu Q1 but ranks first among open models in Commonsense reasoning, meaning it produces physically believable motions and cause-effect consistency.

It’s not just another video diffusion model. It’s built for duration, for the ability to stay coherent over time, which is what every “world model” eventually needs.

Why this matters

If you think about it, short video generation is a solved toy problem. The hard part is getting a model to hold memory, to remember what it drew five seconds ago and not contradict itself in the next frame.

LongCat proves that long-term coherence can be learned, not just patched in. It does it with clean architectural decisions: unified tasks, flow matching, sparse attention, and multi-reward training — not by adding massive parameter counts or fake frame interpolation.

It’s not perfect yet. Motion realism still falls behind top proprietary systems. But it’s an open, working foundation for anyone who wants to build long video AIs that think in motion, not just static scenes.

In short

LongCat-Video isn’t flashy. It’s methodical. It’s the first open model that scales time as a first-class dimension. In a field obsessed with visual fidelity, that’s a quiet but important shift.

The model is open-sourced and can be tested at

meituan-longcat/LongCat-Video · Hugging Face


LongCat : Generate 1 minute long AI videos with this model was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.

Share this article
0
Share
Shareable URL
Prev Post

Mem0 : Add memory to LLM APIs

Next Post

How to use ChatGPT GO for free in India

Read next
Subscribe to our newsletter
Get notified of the best deals on our Courses, Tools and Giveaways..