Cohere releases open source model that tops speech recognition benchmarks

Cohere Releases Open-Source Speech Recognition Model Outperforming Whisper on Key Benchmarks

Cohere, a prominent player in the generative AI space, has launched an open-source automatic speech recognition (ASR) model that achieves top scores on several industry-standard benchmarks, surpassing OpenAI’s Whisper in multiple categories. This release marks a significant advancement in accessible, high-performance speech-to-text technology, particularly for multilingual applications.

The new model, available under the Apache 2.0 license, comes in two variants: a 670 million parameter version optimized for speed and efficiency, and a larger 1.6 billion parameter model designed for superior accuracy. Both are hosted on Hugging Face, enabling developers worldwide to download, fine-tune, and deploy them freely. Cohere’s announcement emphasizes the models’ robustness across diverse accents, noisy environments, and over 100 languages, addressing longstanding challenges in ASR systems.

At the heart of this release is a transformer-based architecture trained on vast datasets of multilingual audio, incorporating advanced techniques like Connectionist Temporal Classification (CTC) loss for direct sequence-to-text alignment. This approach allows the models to transcribe speech without requiring explicit phoneme segmentation, improving both speed and accuracy. The training process leveraged Cohere’s proprietary infrastructure, but the resulting weights are fully open, democratizing access to state-of-the-art performance previously dominated by closed-source alternatives.

Benchmark results highlight the models’ superiority. On the FLEURS dataset, which evaluates ASR across 102 languages using the Word Error Rate (WER) metric, the larger 1.6B model achieved an average WER of 7.0%, outperforming Whisper-large-v3’s 8.4% by a substantial margin. It also topped scores in low-resource languages like Amharic (WER 15.2% vs. Whisper’s 19.1%) and Yoruba (WER 12.8% vs. 16.5%), demonstrating exceptional generalization.

Similarly, on Common Voice 15—a crowd-sourced dataset spanning 109 languages—the 1.6B model recorded a WER of 6.2%, edging out Whisper-large-v3’s 6.8%. In multilingual LibriSpeech (MLS), a benchmark for read English speech mixed with other languages, it scored 4.1% WER compared to Whisper’s 4.7%. The smaller 670M model, while slightly behind on accuracy (e.g., 7.5% on FLEURS), excels in inference speed, processing audio up to 2.5 times faster than Whisper on consumer GPUs, making it ideal for real-time applications like live captioning or voice assistants.

Cohere attributes these gains to meticulous data curation, including 100,000 hours of carefully filtered multilingual audio paired with accurate transcripts. The models were evaluated using standardized metrics like Character Error Rate (CER) and WER, with results independently verified on public leaderboards. Notably, the 1.6B model ranks first overall on the Hugging Face Open ASR Leaderboard, a testament to its broad capabilities.

This open-source push aligns with Cohere’s dual strategy of offering enterprise-grade APIs alongside community-driven models. While the company maintains proprietary versions for production-scale deployments—such as their Command family of models—this release invites collaboration. Developers can integrate the models via the Transformers library with minimal code:

from transformers import pipeline
pipe = pipeline("automatic-speech-recognition", model="cohere/whisper-base-multilingual")
result = pipe("audio.wav")
print(result["text"])

Fine-tuning scripts and example notebooks are provided on Hugging Face, supporting tasks like domain adaptation for medical transcription or accented speech.

The implications extend to accessibility and innovation. By topping benchmarks in resource-constrained languages, the models bridge gaps in global AI equity, enabling applications in education, healthcare, and customer service across non-English speaking regions. Reduced latency and on-device compatibility further enhance privacy, as transcriptions can occur locally without cloud dependency.

Cohere’s engineering team highlighted the challenges overcome, including handling code-switching (mixing languages mid-sentence) and environmental noise, where the models show 20-30% relative error reductions over baselines. Future iterations may incorporate end-to-end multimodal capabilities, but this baseline sets a new standard for open ASR.

In summary, Cohere’s open-source ASR models represent a leap forward, combining benchmark-leading performance with practical deployability. They challenge the status quo, proving that open collaboration can yield results competitive with or exceeding proprietary giants.

Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.

What are your thoughts on this? I’d love to hear about your own experiences in the comments below.