DiffRhythm: Full-length AI song Generation (4 min) with vocals

DiffRhythm: Full-length AI song Generation (4 min) with vocals

Open-source model for song generation

Photo by Simon Noh on Unsplash

While the war for the best model between GROK-3 and GPT-4.5 goes on, A new open-source model for song generation has been dropped and looks wild. It can create full-length songs up to 4 minutes with vocals, giving you a sample.

https://medium.com/media/e19b896102ab03be8c60e0fb3b56de4b/href

Bye bye music industry

What is DiffRhythm?

DiffRhythm is a groundbreaking song generation model capable of synthesizing full-length songs, including vocals and accompaniment tracks. It is the first latent diffusion-based model that can generate complete songs up to 4 minutes and 45 seconds in just 10 seconds.

Data Science in Your Pocket – No Rocket Science

Designed for simplicity, scalability, and efficiency, DiffRhythm addresses key challenges in music generation, such as:

Combining vocals and accompaniment seamlessly

Maintaining long-term musical coherence

Achieving fast inference speeds

How DiffRhythm Works

DiffRhythm operates through a two-stage process, leveraging a Variational Autoencoder (VAE) and a Diffusion Transformer (DiT) to generate high-quality music efficiently.

1. Variational Autoencoder (VAE)

The VAE compresses raw audio into a compact latent space while preserving perceptual quality, reducing computational complexity when modeling long audio sequences.

https://medium.com/media/21751fb46ed93f77137859e564c7112f/href

Key aspects of the VAE:

Optimized for spectral reconstruction and adversarial training to enhance audio fidelity

Trained to handle MP3 compression artifacts, allowing high-quality reconstruction from lossy inputs

Shares the same latent space as Stable Audio VAE, ensuring compatibility with existing latent diffusion frameworks

2. Diffusion Transformer (DiT)

The DiT generates songs by iteratively denoising latent representations conditioned on lyrics and style prompts.

What is Diffusion Transformer?

A Diffusion Transformer (DiT) is a type of model that generates data (like images, audio, or text) by gradually refining noise into meaningful outputs. It works by repeatedly denoising random noise over several steps, guided by conditions (like text or style prompts). DiTs are powerful because they combine the iterative refinement of diffusion models with the flexibility and scalability of transformers, making them efficient for complex tasks like song or image generation.

Key aspects of the DiT:

  • Conditioned on three inputs:

Style prompt (controls song genre/style)

Timestep (indicates current diffusion step)

Lyrics (guides vocal generation)

  • Uses LLaMA decoder layers optimized for natural language processing
  • Incorporates FlashAttention2 and gradient checkpointing to improve efficiency

3. Lyrics-to-Latent Alignment

To ensure vocals align accurately with lyrics, DiffRhythm introduces a sentence-level alignment mechanism, reducing the need for extensive supervision and improving coherence between lyrics and sparse vocal segments.

Key Features of DiffRhythm

End-to-End Song Generation

Generates full-length songs (up to 4m45s) in just 10 seconds, maintaining musicality and intelligibility.

Simple & Scalable Architecture

Eliminates the need for complex multi-stage cascading pipelines, making it easier to scale and deploy.

Lightning-Fast Inference

Thanks to its non-autoregressive design, DiffRhythm outperforms traditional autoregressive models, which are often slow in generating long-form content.

Robustness to MP3 Compression

Since the VAE is trained on MP3 artifacts, it can reconstruct high-quality audio even from lossy inputs — ideal for real-world applications.

Lyrics-to-Vocal Alignment

Uses a sentence-level alignment mechanism to ensure vocals match the lyrics accurately, even when vocals are sparse.

Open & Accessible

DiffRhythm’s training code, pre-trained models, and data processing pipeline are publicly available, fostering reproducibility and research in AI-driven music generation.

How to use DiffRhythm?

The app is deployed for free usage on HuggingFace

DiffRhythm – a Hugging Face Space by ASLP-lab

Model weights are available here:
https://huggingface.co/ASLP-lab/DiffRhythm-base

Conclusion

DiffRhythm is a game-changer in AI-driven music generation, proving that full-length, high-quality songs with vocals can be created in mere seconds. With its latent diffusion-based approach, fast inference speeds, and open-source accessibility, this model sets a new benchmark for AI music generation.

Whether you’re an artist, producer, or just someone curious about AI’s impact on music, DiffRhythm offers a glimpse into the future — where creating music is as easy as generating text. As AI-generated content continues to push creative boundaries, one thing is certain: the music industry will never be the same.

Want to try it yourself? Check it out on Hugging Face and experience the future of AI song generation firsthand.


DiffRhythm: Full-length AI song Generation (4 min) with vocals was originally published in Data Science in your pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.

Share this article
0
Share
Shareable URL
Prev Post

GPU Programming for beginners

Next Post

QWQ-32B vs DeepSeek-R1

Read next
Subscribe to our newsletter
Get notified of the best deals on our Courses, Tools and Giveaways..