
NeuTTS Air, released by Neuphonic in October 2025, represents a breakthrough in accessible, privacy-focused text-to-speech (TTS) technology. This open-source model, licensed under Apache 2.0, enables real-time speech synthesis on everyday devices like laptops, smartphones, and Raspberry Pi, without relying on cloud APIs or GPUs. By integrating a lightweight language model with a novel audio codec, NeuTTS Air democratizes high-quality voice AI for applications ranging from embedded assistants to compliance-sensitive tools.
Architectural Innovations
NeuTTS Air’s design prioritizes efficiency and realism at a sub-1B parameter scale, making it ideal for edge deployment. The core architecture combines a compact language model (LM) backbone with a specialized neural audio codec, forming a streamlined pipeline for text-conditioned speech generation.
Core Components
LM Backbone: Qwen 0.5B (748M Parameters)
At the heart is a Qwen2-based language model with approximately 748 million parameters, optimized for text understanding and generation. This lightweight LLM handles phonemization, prosody modeling, and conditioning the output on input text and reference speaker styles. Qwen’s architecture ensures low-latency token generation (up to 50 tokens/second), balancing expressiveness with computational footprint. The model is quantized to Q4 or Q8 in GGUF format, compatible with llama.cpp for CPU inference, reducing memory usage to under 1GB while maintaining human-like intonation.
Audio Codec: NeuCodec
Neuphonic’s proprietary NeuCodec is a neural audio codec that compresses speech into low-bitrate acoustic tokens (0.8 kbps) using a single codebook and finite scalar quantization (FSQ). It operates at 24 kHz sample rate, enabling high-fidelity reconstruction from sparse representations. During inference, the LM generates these tokens, which NeuCodec decodes into raw audio waveforms via upsampling (16x from tokens to 24 kHz). This hybrid approach — LM for semantics and codec for acoustics — achieves exceptional timbre preservation and natural prosody without the bloat of end-to-end diffusion or GAN-based TTS systems.
Voice Cloning Mechanism
Instant cloning is powered by reference encoding: Users provide a 3–15 second mono WAV file (16–44 kHz) and its transcript. The system extracts style tokens from the reference using NeuCodec, which the LM then conditions during synthesis. This zero-shot adaptation captures speaker timbre, accent, and rhythm with minimal data, outperforming traditional fine-tuning in speed and privacy (no cloud upload required). The pipeline uses eSpeak-ng for initial phonemization, ensuring cross-lingual compatibility.
Efficiency Optimizations
- Quantization and Format: Pre-built GGUF files (via llama.cpp) enable real-time CPU inference (RTF <1 on mid-range hardware). Optional ONNX paths further reduce dependencies, eliminating PyTorch for decoder stages.
- Watermarking: All outputs embed a Perth (Perceptual Threshold) watermark for provenance and responsible use, detectable without affecting audio quality.
- Deployment Footprint: Total size ~500MB (Q4 GGUF), with power-optimized for mobile/embedded, consuming minimal battery on devices like iOS/Android or SBCs.
This architecture sidesteps the high costs of cloud TTS (e.g., ElevenLabs) by shifting computation to the edge, unlocking offline voice agents and toys.
Key Features and Capabilities
NeuTTS Air excels in scenarios demanding low-latency, secure speech synthesis:
- Ultra-Realistic Output: Produces human-like speech with natural pauses, emphasis, and emotional nuance, rivaling larger models in prosody for its size.
- On-Device Privacy: All processing is local — no data leaves the device — ideal for sensitive apps like healthcare assistants or financial advisors.
- Multilingual Support: Inherits Qwen’s capabilities for English and select languages; eSpeak handles broader phonemization.
- Instant Cloning Workflow: From reference audio/transcript to synthesized speech in seconds, supporting custom voices for personalization.
- Extensibility: Integrates with agent frameworks (e.g., via structured outputs) for voice-enabled LLMs or RAG systems.
Limitations include sensitivity to noisy references and current focus on clean, continuous speech inputs.
Benchmark Results
While NeuTTS Air’s model card emphasizes qualitative “best-in-class realism for its size,” quantitative benchmarks are emerging from community tests and vendor claims. As a recent release, formal evaluations like MOS (Mean Opinion Score) or WER (Word Error Rate) are limited, but initial results highlight its edge in efficiency-adjusted performance.
Performance Metrics
- Inference Speed: Real-time factor (RTF) <0.5 on CPU (e.g., Intel i5 or ARM-based Raspberry Pi 5), generating 24 kHz audio at ~50 tokens/second. On mid-range laptops, full sentences synthesize in under 1 second.
- Memory Usage: Q4 GGUF variant uses ~400–600MB RAM; Q8 ~800MB, enabling deployment on 2GB+ devices without swapping.
— Audio Quality:
- MOS Scores (Subjective): Community demos report MOS ~4.2–4.5 for naturalness (out of 5), competitive with cloud TTS like Google WaveNet at 1/10th the size. Timbre fidelity in cloning exceeds baselines like Tortoise-TTS in zero-shot settings.
- Bitrate Efficiency: NeuCodec achieves 0.8 kbps compression with minimal perceptual loss (PESQ >3.5 on clean speech), outperforming traditional codecs like Opus in neural tasks.
- Cloning Accuracy: With 3s references, speaker similarity (via cosine distance on embeddings) reaches 85–90%; improves to 95%+ with 15s inputs. Tested against VoxCPM, NeuTTS Air shows lower latency (200ms vs. 500ms) but slightly higher artifact rates in noisy clones.
Comparative Benchmarks

Sources: Vendor claims, HN/Reddit community tests. No formal AIR-Bench or HELM-TTS integration yet, but ongoing community efforts aim to standardize.
In real-world tests (e.g., voice agents), NeuTTS Air reduces end-to-end latency by 70% vs. cloud alternatives, with 90% privacy compliance in edge scenarios.
Hugging Face Implementation and Code Examples
NeuTTS Air is hosted on Hugging Face under neuphonic/neutts-air, with Q4/Q8 GGUF variants, a demo Space, and a model collection for easy access. Installation leverages the official GitHub repo for full interactivity. No direct Transformers pipeline yet—use the custom neuttsair library or llama.cpp.
Setup Steps
- Clone and Install:
git clone https://github.com/neuphonic/neutts-air.git cd neutts-air # Install eSpeak (phonemizer) # macOS: brew install espeak # Ubuntu: sudo apt install espeak pip install -r requirements.txt # Includes torch, soundfile, etc.
- Tested on Python 3.11+; ONNX mode skips PyTorch.
- Basic CLI Usage (Interactive Synthesis):
Run from command line for quick tests:
python -m examples.basic_example --input_text "Hello, this is a test of voice cloning." --ref_audio samples/dave.wav --ref_text samples/dave.txt --backbone "neuphonic/neutts-air-q4-gguf"
- Outputs output.wav in the reference speaker’s voice. Experiment with –max_new_tokens for longer generations.
- Python API for Interactive Apps (Embed in Notebooks/Scripts):
For dynamic apps (e.g., Gradio interfaces), use the NeuTTSAir class:
from neuttsair.neutts import NeuTTSAir
import soundfile as sf
# Initialize (CPU for on-device; adjust device as needed)
tts = NeuTTSAir(
backbone_repo="neuphonic/neutts-air-q4-gguf",
backbone_device="cpu",
codec_repo="neuphonic/neucodec",
codec_device="cpu"
)
# Inputs: Text to synthesize, reference WAV, and its transcript
input_text = "My name is Dave, and um, I'm from London."
ref_audio_path = "samples/dave.wav" # 3-15s clean mono WAV
ref_text_path = "samples/dave.txt" # Exact transcript of ref_audio
ref_text = open(ref_text_path, "r").read().strip()
ref_codes = tts.encode_reference(ref_audio_path) # Extract style tokens
# Generate audio
wav = tts.infer(input_text, ref_codes, ref_text)
sf.write("cloned_speech.wav", wav, 24000) # Save at 24 kHz
# Interactive loop example (e.g., in Jupyter)
while True:
user_text = input("Enter text to speak: ")
if user_text.lower() == 'quit': break
wav = tts.infer(user_text, ref_codes, ref_text)
# Play via IPython.display.Audio(wav, rate=24000) in notebook
- This snippet enables real-time cloning — load once, synthesize repeatedly. For multi-speaker apps, swap ref_audio_path dynamically.
- Advanced: Gradio Demo Integration (For Web Interfaces):
Build an interactive UI using the HF Space as inspiration:
import gradio as gr
from neuttsair.neutts import NeuTTSAir
import soundfile as sf
import tempfile
tts = NeuTTSAir(backbone_repo="neuphonic/neutts-air-q4-gguf", backbone_device="cpu")
def synthesize(text, ref_audio, ref_text):
ref_codes = tts.encode_reference(ref_audio)
wav = tts.infer(text, ref_codes, ref_text)
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
sf.write(tmp.name, wav, 24000)
return tmp.name
iface = gr.Interface(
fn=synthesize,
inputs=[gr.Textbox(label="Text to Speak"), gr.Audio(source="upload", label="Reference Audio"), gr.Textbox(label="Reference Transcript")],
outputs=gr.Audio(label="Generated Speech"),
title="NeuTTS Air Interactive Cloner"
)
iface.launch()
- Upload audio/transcript, input text, and hear cloned output instantly. Customize for agents by chaining with ASR pipelines (e.g., Whisper).
Tips for Best Results
- Use clean, mono references (3–15s) for high fidelity.
- For longer texts, chunk into sentences to avoid truncation.
- Debug: Check examples/ Jupyter notebook for visualizations of token flows and audio spectrograms.
Ethical Considerations and Future Outlook
NeuTTS Air includes built-in safeguards like watermarking to prevent misuse (e.g., deepfakes), aligning with responsible AI principles. Its open-source nature invites community contributions for multilingual expansion or benchmark standardization.
Looking ahead, Neuphonic plans fine-tunes for dialects and integrations with larger LMs. As edge AI grows, models like NeuTTS Air will power a new era of ubiquitous, private voice interactions — try cloning your voice today via the HF Space.
NeuTTS Air: Revolutionizing On-Device Text-to-Speech with Instant Voice Cloning was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.