Open-source model for song generation
While the war for the best model between GROK-3 and GPT-4.5 goes on, A new open-source model for song generation has been dropped and looks wild. It can create full-length songs up to 4 minutes with vocals, giving you a sample.
https://medium.com/media/e19b896102ab03be8c60e0fb3b56de4b/href
Bye bye music industry
What is DiffRhythm?
DiffRhythm is a groundbreaking song generation model capable of synthesizing full-length songs, including vocals and accompaniment tracks. It is the first latent diffusion-based model that can generate complete songs up to 4 minutes and 45 seconds in just 10 seconds.
Data Science in Your Pocket – No Rocket Science
Designed for simplicity, scalability, and efficiency, DiffRhythm addresses key challenges in music generation, such as:
Combining vocals and accompaniment seamlessly
Maintaining long-term musical coherence
Achieving fast inference speeds
How DiffRhythm Works

DiffRhythm operates through a two-stage process, leveraging a Variational Autoencoder (VAE) and a Diffusion Transformer (DiT) to generate high-quality music efficiently.
1. Variational Autoencoder (VAE)
The VAE compresses raw audio into a compact latent space while preserving perceptual quality, reducing computational complexity when modeling long audio sequences.
https://medium.com/media/21751fb46ed93f77137859e564c7112f/href
Key aspects of the VAE:
Optimized for spectral reconstruction and adversarial training to enhance audio fidelity
Trained to handle MP3 compression artifacts, allowing high-quality reconstruction from lossy inputs
Shares the same latent space as Stable Audio VAE, ensuring compatibility with existing latent diffusion frameworks
2. Diffusion Transformer (DiT)
The DiT generates songs by iteratively denoising latent representations conditioned on lyrics and style prompts.
What is Diffusion Transformer?
A Diffusion Transformer (DiT) is a type of model that generates data (like images, audio, or text) by gradually refining noise into meaningful outputs. It works by repeatedly denoising random noise over several steps, guided by conditions (like text or style prompts). DiTs are powerful because they combine the iterative refinement of diffusion models with the flexibility and scalability of transformers, making them efficient for complex tasks like song or image generation.
Key aspects of the DiT:
- Conditioned on three inputs:
Style prompt (controls song genre/style)
Timestep (indicates current diffusion step)
Lyrics (guides vocal generation)
- Uses LLaMA decoder layers optimized for natural language processing
- Incorporates FlashAttention2 and gradient checkpointing to improve efficiency
3. Lyrics-to-Latent Alignment
To ensure vocals align accurately with lyrics, DiffRhythm introduces a sentence-level alignment mechanism, reducing the need for extensive supervision and improving coherence between lyrics and sparse vocal segments.
Key Features of DiffRhythm
End-to-End Song Generation
Generates full-length songs (up to 4m45s) in just 10 seconds, maintaining musicality and intelligibility.
Simple & Scalable Architecture
Eliminates the need for complex multi-stage cascading pipelines, making it easier to scale and deploy.
Lightning-Fast Inference
Thanks to its non-autoregressive design, DiffRhythm outperforms traditional autoregressive models, which are often slow in generating long-form content.
Robustness to MP3 Compression
Since the VAE is trained on MP3 artifacts, it can reconstruct high-quality audio even from lossy inputs — ideal for real-world applications.
Lyrics-to-Vocal Alignment
Uses a sentence-level alignment mechanism to ensure vocals match the lyrics accurately, even when vocals are sparse.
Open & Accessible
DiffRhythm’s training code, pre-trained models, and data processing pipeline are publicly available, fostering reproducibility and research in AI-driven music generation.
How to use DiffRhythm?
The app is deployed for free usage on HuggingFace
DiffRhythm – a Hugging Face Space by ASLP-lab
Model weights are available here:
https://huggingface.co/ASLP-lab/DiffRhythm-base
Conclusion
DiffRhythm is a game-changer in AI-driven music generation, proving that full-length, high-quality songs with vocals can be created in mere seconds. With its latent diffusion-based approach, fast inference speeds, and open-source accessibility, this model sets a new benchmark for AI music generation.
Whether you’re an artist, producer, or just someone curious about AI’s impact on music, DiffRhythm offers a glimpse into the future — where creating music is as easy as generating text. As AI-generated content continues to push creative boundaries, one thing is certain: the music industry will never be the same.
Want to try it yourself? Check it out on Hugging Face and experience the future of AI song generation firsthand.
DiffRhythm: Full-length AI song Generation (4 min) with vocals was originally published in Data Science in your pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.