
NVIDIA’s Canary-Qwen-2.5B is a groundbreaking hybrid model that combines Automatic Speech Recognition (ASR) with a Large Language Model (LLM), setting a new standard in speech-to-text transcription and language understanding. Released on July 17, 2025, this model has achieved the top position on the Hugging Face OpenASR Leaderboard with an impressive Word Error Rate (WER) of 5.63%, outperforming all prior open-source models. Its innovative architecture, built on the NVIDIA NeMo framework, integrates speech recognition with advanced language processing capabilities, enabling not only transcription but also downstream tasks like summarization and question-answering directly from audio inputs. This article explores the technical details, performance metrics, and practical applications of Canary-Qwen-2.5B, along with a code example for inferencing using the Hugging Face platform.
Technical Overview
Canary-Qwen-2.5B is a Speech-Augmented Language Model (SALM) that combines the strengths of two base models: NVIDIA’s Canary-1B-Flash and Qwen’s Qwen3–1.7B. The model employs a FastConformer encoder for audio feature extraction and a Transformer decoder for text generation, enhanced by a linear projection and Low-Rank Adaptation (LoRA) applied to the LLM component. This architecture allows the model to process audio inputs (in .wav or .flac formats) and text prompts, producing accurate transcriptions with punctuation and capitalization, as well as enabling advanced tasks like summarization and contextual question-answering.
The model was trained on an extensive dataset of 234,000 hours of publicly available English speech data, including conversations, web videos, and audiobook recordings. Notably, the AMI dataset was oversampled to constitute about 15% of the training data, which slightly biases the model toward verbatim transcripts that include conversational disfluencies like repetitions. The training process utilized greedy decoding, and performance was evaluated using the Word Error Rate (WER) metric, with text normalization performed via whisper-normalizer version 0.1.12.
Key features of Canary-Qwen-2.5B include:
- State-of-the-Art Performance: Achieves a WER of 5.63% on the Hugging Face OpenASR Leaderboard, the lowest recorded for an open-source model.
- High Inference Speed: Boasts a Real-Time Factor (RTFx) of 418 on an A100 GPU, meaning it processes audio 418 times faster than real-time, making it suitable for low-latency applications like live captioning.
- Commercial-Friendly License: Released under the CC-BY-4.0 license, allowing unrestricted commercial use.
- Multitask Capabilities: Supports transcription, summarization, and question-answering, leveraging its LLM backbone for contextual understanding.
- Hardware Compatibility: Optimized for NVIDIA GPUs, including A100, H100, and newer Hopper/Blackwell-class GPUs, enabling scalable deployment in both cloud and on-premises environments.
Model Architecture
SALM(
(llm): PeftModelForCausalLM(
(base_model): LoraModel(
(model): Qwen3ForCausalLM(
(model): Qwen3Model(
(layers): ModuleList(
(0-27): 28 x Qwen3DecoderLayer(
(self_attn): Qwen3Attention(
(q_proj): lora.Linear(
(base_layer): Linear(in_features=2048, out_features=2048, bias=False)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.01, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=2048, out_features=128, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=128, out_features=2048, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(k_proj): Linear(in_features=2048, out_features=1024, bias=False)
(v_proj): lora.Linear(
(base_layer): Linear(in_features=2048, out_features=1024, bias=False)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.01, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=2048, out_features=128, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=128, out_features=1024, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(o_proj): Linear(in_features=2048, out_features=2048, bias=False)
(q_norm): Qwen3RMSNorm((128,), eps=1e-06)
(k_norm): Qwen3RMSNorm((128,), eps=1e-06)
)
(mlp): Qwen3MLP(
(gate_proj): Linear(in_features=2048, out_features=6144, bias=False)
(up_proj): Linear(in_features=2048, out_features=6144, bias=False)
(down_proj): Linear(in_features=6144, out_features=2048, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen3RMSNorm((2048,), eps=1e-06)
(post_attention_layernorm): Qwen3RMSNorm((2048,), eps=1e-06)
)
)
(norm): Qwen3RMSNorm((2048,), eps=1e-06)
(rotary_emb): Qwen3RotaryEmbedding()
)
(lm_head): Linear(in_features=2048, out_features=151936, bias=False)
)
)
)
(embed_tokens): Embedding(151936, 2048)
(perception): AudioPerceptionModule(
(preprocessor): AudioToMelSpectrogramPreprocessor(
(featurizer): FilterbankFeatures()
)
(encoder): ConformerEncoder(
(pre_encode): ConvSubsampling(
(out): Linear(in_features=4096, out_features=1024, bias=True)
(conv): MaskedConvSequential(
(0): Conv2d(1, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(1): ReLU(inplace=True)
(2): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), groups=256)
(3): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1))
(4): ReLU(inplace=True)
(5): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), groups=256)
(6): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1))
(7): ReLU(inplace=True)
)
)
(pos_enc): RelPositionalEncoding(
(dropout): Dropout(p=0.1, inplace=False)
)
(layers): ModuleList(
(0-31): 32 x ConformerLayer(
(norm_feed_forward1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(feed_forward1): ConformerFeedForward(
(linear1): Linear(in_features=1024, out_features=4096, bias=True)
(activation): Swish()
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=4096, out_features=1024, bias=True)
)
(norm_conv): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(conv): ConformerConvolution(
(pointwise_conv1): Conv1d(1024, 2048, kernel_size=(1,), stride=(1,))
(depthwise_conv): CausalConv1D(1024, 1024, kernel_size=(9,), stride=(1,), groups=1024)
(batch_norm): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activation): Swish()
(pointwise_conv2): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
)
(norm_self_att): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(self_attn): RelPositionMultiHeadAttention(
(linear_q): Linear(in_features=1024, out_features=1024, bias=True)
(linear_k): Linear(in_features=1024, out_features=1024, bias=True)
(linear_v): Linear(in_features=1024, out_features=1024, bias=True)
(linear_out): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear_pos): Linear(in_features=1024, out_features=1024, bias=False)
)
(norm_feed_forward2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(feed_forward2): ConformerFeedForward(
(linear1): Linear(in_features=1024, out_features=4096, bias=True)
(activation): Swish()
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=4096, out_features=1024, bias=True)
)
(dropout): Dropout(p=0.1, inplace=False)
(norm_out): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
)
)
(modality_adapter): IdentityConnector()
(proj): Linear(in_features=1024, out_features=2048, bias=True)
)
)
Performance Metrics
Canary-Qwen-2.5B’s performance is remarkable for its relatively modest size of 2.5 billion parameters. It outperforms larger models like Whisper-large-v3 and SeamlessM4T-Medium-v1, despite being trained on significantly less data. The model’s WER of 5.63% was measured on the Librispeech Test Clean dataset at various Signal-to-Noise Ratio (SNR) levels, demonstrating robust performance even in noisy conditions. Additionally, its inference speed (RTFx=418) makes it highly efficient for real-world applications, such as transcription at scale or live captioning systems.
The model’s fairness was evaluated using the CasualConversations-v1 dataset, following the methodology outlined in the paper “Towards Measuring Fairness in AI: the Casual Conversations Dataset.” Error rates were determined by normalizing both reference and predicted text, ensuring consistent evaluation aligned with the Hugging Face OpenASR Leaderboard standards.
Practical Applications
Canary-Qwen-2.5B is designed for a wide range of applications, particularly in industries requiring high-accuracy speech-to-text capabilities and contextual understanding. Some key use cases include:
- Transcription Services: Provides highly accurate transcriptions with proper punctuation and capitalization, ideal for media, legal, and healthcare sectors where precision is critical.
- Summarization and Question-Answering: The LLM component enables summarization of audio content and answering user queries about the transcript, enhancing usability in customer service and content analysis.
- Live Captioning: Its high inference speed supports real-time applications like live event captioning or virtual meeting transcription.
- Customizable Workflows: Developers can fine-tune the model using the NVIDIA NeMo toolkit for domain-specific applications, such as medical dictation or legal documentation.
The model’s open-source nature and commercial-friendly license make it a versatile tool for enterprises, researchers, and developers aiming to build voice-first AI applications.
Inferencing with Canary-Qwen-2.5B
To use Canary-Qwen-2.5B for inference, you need to install the NVIDIA NeMo toolkit, which requires the latest trunk version and PyTorch 2.6+ for FSDP2 support. The model accepts audio inputs in .wav or .flac formats (16kHz mono-channel) and text prompts for tasks like transcription or summarization. Below is a code example for performing inference using the Hugging Face platform and NVIDIA NeMo toolkit.
Prerequisites
- Python 3.9
- Install the NeMo toolkit: pip install “nemo_toolkit[asr,tts]@git+https://github.com/NVIDIA/NeMo.git”
- Ensure PyTorch 2.6+ is installed.
- pip install numpy==2.0.0, pip install fsspec==2025.3.2, pip install sacrebleu
- Prepare an audio file (.wav or .flac) for transcription.
Code Example
from nemo.collections.speechlm2.models import SALM
# Load the pre-trained Canary-Qwen-2.5B model
model = SALM.from_pretrained('nvidia/canary-qwen-2.5b')
# Define the audio file path and text prompt
audio_file = 'path/to/your/audio.wav' # Replace with your audio file path
prompt = "Transcribe the following: <|audioplaceholder|>"
# Configure inference settings
decode_cfg = model.cfg.decoding
decode_cfg.beam.beam_size = 1 # Use greedy decoding for simplicity
model.change_decoding_strategy(decode_cfg)
# Perform transcription
output = model.transcribe(
paths2audio_files=[audio_file],
batch_size=16,
pnc='yes' # Enable punctuation and capitalization
)
# Extract and print the transcribed text
predicted_text = output[0].text
print("Transcribed Text:", predicted_text)
Challenges and Considerations
While Canary-Qwen-2.5B excels in many areas, there are some considerations:
- English-Only Support: The model is currently optimized for English speech, limiting its applicability for multilingual use cases.
- Long Audio Transcription: For audio files longer than 10 seconds, NVIDIA recommends using the chunked inference script (speech_to_text_aed_chunked_infer.py) with a chunk length of 10 seconds to avoid degeneration issues, as noted in discussions on GitHub.
- Hardware Requirements: While optimized for NVIDIA GPUs, running the model on less powerful hardware may require quantization or reduced batch sizes, potentially impacting performance.
NVIDIA’s Canary-Qwen-2.5B: Revolutionizing Speech Recognition with a Hybrid ASR-LLM Model was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.