Conversational Speech AI Model that can talk
The AI influencers and Experts predicted the year of 2025 to be of AI Agents. But we’re done with about a quarter of 2025 and as of now, it is dominated by two things
Reasoning LLMs
Audio AI models/Agents
https://medium.com/media/e9c6cf0ffdc4c0ab395757b8c073f9a8/href
Yet again, we’ve got a unique release in the Audio AI domain where Sesame has released CSM-1B, which is more than just a TTS model.
Subscribe to datasciencepocket on Gumroad
Sesame CSM-1B is a conversational speech model
How CSM-1B Differs from Traditional TTS Models
1. Conversational Focus
- Traditional TTS models generate speech from standalone text inputs.
- CSM-1B, however, is optimized for conversational contexts, leveraging previous dialogue turns to produce more natural and coherent speech.
It generates conversations rather than just a text-to-speech conversion
2. Multi-Speaker Support
- Unlike typical TTS models, CSM-1B supports multiple speakers, allowing it to generate distinct voices in a dialogue.
- This makes it ideal for virtual assistants, interactive voice systems, and chatbots.
3. RVQ Audio Codes
- Instead of generating raw audio waveforms, CSM-1B produces Residual Vector Quantization (RVQ) audio codes, which are then decoded into speech.
- This method ensures high-quality, natural-sounding speech while maintaining computational efficiency.
4. No Predefined Voices
- Many TTS models come with pre-trained voices, but CSM-1B is a base-generation model.
- It does not have fine-tuned voices, making it highly flexible and adaptable for custom voice generation.
5. Integration with LLMs
- CSM-1B is not a general-purpose multimodal Large Language Model (LLM) and cannot generate text.
- It is designed to work alongside an LLM for text generation, focusing exclusively on speech synthesis.
6. Architecture
- Transformer-Based: Uses a Llama backbone, a transformer architecture known for efficiency and scalability.
- Text & Audio Encoding: Processes both text inputs and audio prompts to generate intermediate representations.
- Contextual Understanding: Captures long-range dependencies and contextual information, making it ideal for conversational speech generation.
How the Architecture Works
- Input Processing: Accepts text inputs (e.g., “Hello from Sesame”) and optionally audio prompts. Speaker identity is provided via speaker embeddings.
- Encoding: The Llama backbone processes text and audio inputs, generating semantic and contextual representations.
- RVQ Code Generation: The audio decoder converts these representations into RVQ audio codes, a compact speech format.
- Decoding: The RVQ codes are transformed into high-quality speech waveforms using a vocoder or similar synthesis tool.
- Output: The final result is a natural-sounding speech waveform, which can be saved as an audio file (e.g., WAV format).
Advantages of the Architecture
- Efficiency: Combines a powerful transformer backbone with a lightweight decoder for fast and efficient speech generation.
- High-Quality Speech: Uses RVQ and Mimi codes to ensure speech remains natural-sounding and expressive.
- Contextual Awareness: Generates coherent and contextually appropriate speech, making it ideal for conversational applications.
- Flexibility: Supports multiple speakers and contextual prompts, making it adaptable for various use cases.
- Scalability: The architecture is designed to scale, allowing for future improvements and adaptations.
How to use Sesame CSM-1b?
The model is open-sourced and can be accessed here
GitHub – SesameAILabs/csm: A Conversational Speech Generation Model
All the codes can be accessed from the official git repo. If you don’t want to deploy it, you can try the demo here
Crossing the uncanny valley of conversational voice
This is actually a unique release, given the conversational capabilities of Sesam CSM-1b and now, creating applications like Character.ai or customer support AI Agents is easier than ever
Hope you try out this revolutionary model !
Sesame CSM-1b: More Than TTS, ElevenLabs Free Alternative was originally published in Data Science in your pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.