ElevenLabs and Google Lead in Artificial Analysis’s Updated Speech-to-Text Benchmark
Artificial Analysis has released an updated benchmark evaluating speech-to-text (STT) models, revealing strong performances from ElevenLabs and Google at the top. This comprehensive assessment tests 17 STT systems across diverse audio conditions, focusing on word error rate (WER) as the primary metric. The benchmark uses a dataset of 500 hours of English speech, incorporating accents from India, Nigeria, the Philippines, and the UK, alongside clean and noisy environments. Lower WER scores indicate superior transcription accuracy.
ElevenLabs’ Scribe model emerges as the overall leader with a WER of 3.7 percent on the clean test set, outperforming competitors by a significant margin. On noisy audio, Scribe maintains excellence at 5.9 percent WER. Google’s US English STT model follows closely, achieving 4.1 percent on clean audio and 6.2 percent on noisy samples. These results highlight the prowess of both providers in handling varied speech patterns.
The benchmark distinguishes between streaming and non-streaming models. Streaming capabilities allow real-time transcription, crucial for applications like live captioning or virtual assistants. ElevenLabs Scribe excels in streaming with a 4.2 percent WER on clean data, while its non-streaming version hits 3.7 percent. Google dominates non-streaming at 4.1 percent but shows slightly higher latency in streaming scenarios.
Deepgram’s Nova-2 model ranks third overall, with a 4.5 percent WER on clean audio. It shines in streaming at 4.7 percent but lags in noise robustness. OpenAI’s Whisper Large v3, a popular open-source option, scores 5.2 percent on clean data, competitive yet trailing proprietary leaders. Other notable performers include AWS Transcribe (5.8 percent), Microsoft Azure (6.1 percent), and AssemblyAI Universal (6.3 percent).
Accent handling reveals disparities. On Indian English, ElevenLabs Scribe achieves 5.1 percent WER, Google’s model 5.4 percent, and Deepgram 6.2 percent. Nigerian English proves challenging, with Scribe at 7.2 percent, Google at 7.5 percent, and Whisper at 9.1 percent. UK English sees tighter competition: Scribe at 3.2 percent, Google at 3.4 percent. These variations underscore the importance of diverse training data.
Noise robustness testing uses backgrounds like cafe chatter, music, and white noise at levels from 0 dB SNR to 15 dB SNR. ElevenLabs Scribe consistently leads, dropping to 8.5 percent WER at 0 dB SNR, compared to Google’s 9.1 percent. Whisper struggles more, reaching 12.3 percent under similar conditions.
Latency metrics, vital for interactive use, measure time to first token and total transcription time. ElevenLabs Scribe offers low latency at 320 milliseconds for first token in streaming mode. Google follows at 450 milliseconds. Deepgram Nova-2 is fastest at 280 milliseconds but sacrifices some accuracy.
Cost analysis factors in per-minute pricing. ElevenLabs Scribe is economical at $0.40 per hour for streaming, making it attractive for high-volume use. Google charges $0.036 per 15 seconds, totaling around $8.64 per hour. OpenAI Whisper is pricier at $0.60 per hour but benefits from API flexibility.
The benchmark employs LibriSpeech test-clean for clean evaluation and custom noisy/accented datasets. Audio is resampled to 16 kHz mono, with models prompted via APIs. WER calculation uses jiwer library with exact match normalization. Artificial Analysis emphasizes reproducibility, publishing full results and code on GitHub.
Since the prior benchmark in April 2024, improvements are evident. ElevenLabs Scribe, new to the leaderboard, sets records. Google shaved 0.5 percentage points off its WER. Deepgram improved Nova-2 by 1.2 points. Whisper Large v3 advanced modestly from v2.
These findings affirm proprietary models’ edge over open-source alternatives, driven by vast proprietary datasets and optimization. ElevenLabs leverages its voice synthesis expertise for superior STT. Google benefits from scale in search and assistant technologies.
For developers, selecting an STT provider depends on priorities: accuracy, latency, cost, or accents. ElevenLabs suits premium applications demanding top fidelity. Google offers balanced enterprise performance. Budget-conscious users may prefer Deepgram or Whisper for adequate results.
This update solidifies ElevenLabs and Google as STT frontrunners, pushing the field toward sub-4 percent WER on challenging inputs. Future iterations may incorporate more languages or speaker diarization.
Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.
What are your thoughts on this? I’d love to hear about your own experiences in the comments below.