KaniTTS : The fastest TTS model for Conversational AI is here
How to use Kani-TTS for free?
The race in text-to-speech isn’t about realism anymore, it’s about responsiveness. We’ve hit a point where voices can already fool the ear, but they can’t yet keep up with real-time dialogue. That’s the gap KaniTTS tries to close. It doesn’t dream about mimicking human imperfection like some over-engineered lab model. It just wants to talk back, now.
This model, built by NineNineSix, is small enough to run on consumer hardware yet fast enough to make real-time audio possible even for edge applications. It’s like someone finally remembered that speech synthesis is part of a conversation, not an art exhibition.
My new book on Audio AI for Beginners is here
The Two-Step Brain
KaniTTS isn’t one of those end-to-end waveform monsters. It runs on a two-stage pipeline, a simple but sharp architecture that trades brute-force computation for precision and speed.
- Stage one: a language backbone based on LiquidAI’s LFM2 (350M) model. This part doesn’t make sound; it compresses the text into what you could call “audio intent”, token representations that capture rhythm, pauses, and phrasing.
- Stage two: NVIDIA NanoCodec, a neural codec that takes those compact tokens and expands them back into actual waveforms. It’s not generating raw sound from scratch every time; it’s decoding compressed acoustic signals.
That pairing LLM backbone + codec decoder is where the magic happens. It’s clean, modular, and absurdly fast.
The Speed Story
Speed here isn’t a side benefit; it’s the design goal.
On an RTX 5090, the Real-Time Factor (RTF) clocks at 0.19. That means for every second of speech, it needs only 0.19 seconds to process. So it’s literally speaking faster than time.
And the beauty is: that speed doesn’t collapse on smaller GPUs.
- On an RTX 4080, it hits 0.20 RTF, basically identical.
- Even on a 3060, it stays at 0.6 RTF, still faster than real-time. The GPU memory footprint sits around 16 GB, which makes it surprisingly lightweight for what it does.
You could run it on Vast AI for pocket change, $0.09/hr on a 3060 tier card. That’s enough to host your own conversational TTS backend if you don’t want to touch the big APIs.
The Sound
The audio output sits at 22 kHz, not studio-master quality, but perfectly clear for dialogue. The speech has a conversational tone: not flat, not over-acted. There’s some variability in pitch and pacing, but it doesn’t wander into fake enthusiasm. It sounds like someone speaking on a call rather than narrating a commercial, which honestly fits most real-world use cases better.
Out of the box, you get two voices, Andrew and Katie. They sound balanced and neutral, not overly filtered. The expressivity is moderate, though you can fine-tune emotional depth or accent by retraining the NanoCodec.
Trained on Real Conversations
Most open-source TTS models still rely on synthetic or audiobook-style data. KaniTTS pulls from more realistic corpora:
- [Emolia (LAION)] : a large multilingual dataset with emotional labeling
- [Expresso-Conversational] : natural dialogue and conversational tone recordings
- [Andrew-v3] : a smaller, speaker-focused dataset with cleaner phonetics
This mix gives the model a relaxed rhythm, it doesn’t sound like someone reading a paragraph; it sounds like someone talking. That’s subtle but crucial for assistant-style speech.
Engineering for Real-Time AI
The model size is 400 million parameters, tiny compared to other speech models that often exceed a billion. That’s part of the point: small enough for real-time edge deployment but trained smartly enough to keep quality high.
It integrates neatly with vLLM, meaning you can host it the same way you’d serve an LLM like GPT or Mistral. The OpenAI-compatible API setup means existing chat pipelines can simply switch the backend and get voice out without rewriting everything.
They even built wrappers for ComfyUI and a Next.js demo app, so developers can plug it into visual workflows or web-based chat tools without touching low-level code.
The Fine-Tuning Pipeline
NineNineSix didn’t just drop the model; they dropped the whole ecosystem.
- Fine-tuning pipeline: KaniTTS-Finetune-pipeline
- Dataset builder: NanoCodec Dataset Pipeline
- Demo apps and API examples: kanitts-vllm
You can actually collect your own dataset, compress it using NanoCodec, and retrain your custom voice. That’s a big shift from typical “play with our demo” releases. It’s built for builders.
Practical Limitations
No model’s perfect. KaniTTS starts to lose structure when you feed it long text, anything over 15 seconds needs to be chunked into smaller pieces. It also isn’t emotionally expressive out of the box. If you want someone to sound heartbroken or ecstatic, you’ll have to fine-tune on emotional data.
And yes, it’s still English-only for now. The multilingual expansion will need additional NanoCodec retraining, though the devs hint at ongoing work in that direction.
It also inherits some subtle bias, certain intonations or pacing quirks, from its training data. You might notice this if you synthesize long monologues.
Where It Fits
KaniTTS isn’t meant for cinematic dubbing or synthetic YouTube narrations. It’s designed for live systems, places where latency kills experience.
- Chatbots and customer support agents
- Accessibility tools that need instant feedback
- Interactive learning apps or AI companions
- Edge deployments where GPU time costs money
It’s practical, not performative.
Responsible Use
The devs are explicit about restrictions: no impersonation, no harassment, no fake content generation. Given how fast and natural it sounds, those boundaries matter.
They also emphasize ethical deployment, which, honestly, is something we need more of in open TTS right now.
Comparing It with Others
If you’ve used XTTS, Bark, or CosyVoice, KaniTTS feels leaner. XTTS is strong on expressiveness but slower and heavier. Bark sounds great but stumbles on speed and consistency. KaniTTS trades a bit of emotional nuance for real-time reliability. It’s the model you pick when latency matters more than dramatic delivery.
nineninesix/kani-tts-400m-en · Hugging Face
Final Take
KaniTTS feels like what Whisper was for speech recognition, a tool that shifts the baseline. Not flashy, but usable.
It proves that you can have low latency, open-source speech synthesis without selling your GPU farm.
If you’re building an assistant that talks, or an app that needs to feel alive, this one deserves a spot in your toolkit. It’s fast, free, and engineered with the right kind of pragmatism, built for the world where speed is the intelligence.
KaniTTS : The fastest TTS model for Conversational AI is here was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.