AI benchmarks are broken and the industry keeps using them anyway, study finds

AI Benchmarks Are Flawed Due to Data Contamination, Yet the Industry Persists in Their Use, New Study Reveals

Artificial intelligence benchmarks, long considered the gold standard for evaluating large language model (LLM) performance, are fundamentally compromised by data contamination. A recent study conducted by researchers from the University of California, Berkeley, Carnegie Mellon University, and Vectara has exposed pervasive issues in these evaluation metrics, revealing that popular benchmarks contain questions and answers directly sourced from the training data of leading AI models. Despite these findings, the AI industry continues to rely on them, perpetuating misleading claims of progress.

The study, titled “AI Benchmarks Are Prone to Data Leakage,” meticulously analyzed 38 widely used benchmarks across four categories: math, coding, expert knowledge, and instruction following. These include high-profile tests such as GSM8K for grade-school math problems, HumanEval for code generation, MMLU for multitask language understanding, and AlpacaEval for instruction adherence. The researchers employed a sophisticated pipeline to detect data leakage by scraping publicly available pre-training datasets, including Common Crawl snapshots used by models like GPT-4, Llama 2, and Mistral.

Their methodology was rigorous. They processed over 7.3 trillion tokens from training corpora, filtering for benchmark-like content using regular expressions tailored to each evaluation set. For instance, GSM8K problems follow a predictable format: a short story problem followed by “Question: [query]” and “Answer: [[answer]]”. Similar patterns were identified for other benchmarks. To quantify contamination, the team measured exact matches and semantic similarities, establishing thresholds—such as Hamming distances below 0.1 for strings and cosine similarities above 0.95 for embeddings—to flag leaked data.

The results were staggering. In the math category, GSM8K exhibited the highest contamination rates. For OpenAI’s GPT-4, an estimated 82% of the benchmark’s questions appeared verbatim in its training data, primarily from Common Crawl’s 2021-09 snapshot. GPT-3.5 showed 75% leakage, while models like Vicuna-13B and Llama2-70B had rates of 62% and 55%, respectively. Even proprietary models trained post-benchmark release, such as GPT-4 Turbo, retained significant contamination, suggesting incomplete data cleaning.

Coding benchmarks fared no better. HumanEval saw 48% contamination in GPT-4’s training data, with the MultiPL-E multilingual extension reaching 71%. Expert knowledge tests like MMLU revealed 45% leakage for GPT-4, encompassing subjects from abstract algebra to clinical knowledge. Instruction-following benchmarks, including AlpacaEval 2.0, showed 37% contamination. Across the board, contamination exceeded 10% for most models on most benchmarks, far surpassing acceptable thresholds for fair evaluation.

This leakage explains the perplexing phenomenon of “benchmark saturation.” Models now routinely score over 90% on tests like GSM8K—GPT-4 achieves 95%—yet struggle with novel problems. The study demonstrates that high scores correlate strongly with contamination levels rather than true reasoning capabilities. For example, when researchers introduced slight perturbations to GSM8K questions, model performance plummeted, confirming memorization over generalization.

Why does this persist? The researchers attribute it to a toxic mix of competitive pressures and methodological shortcomings. AI companies race to top leaderboards on platforms like Hugging Face’s Open LLM Leaderboard, where benchmarks are the primary metric. Benchmark creators often release datasets publicly years before model training, allowing inadvertent inclusion via web crawls. Efforts like data deduplication in training pipelines—employing techniques such as MinHash or exact matching—are insufficient against the scale of trillion-token datasets.

Moreover, proprietary training details obscure verification. OpenAI, Anthropic, and others rarely disclose exact contamination checks, fostering an environment of unchecked hype. The study critiques leaderboard practices, noting that they fail to filter contaminated data or require post-training leakage reports, leading to inflated capabilities claims.

The implications are profound. Overreliance on broken benchmarks misguides research priorities, resource allocation, and public perception. Progress appears explosive—doubling every few months per scaling laws—yet real-world utility lags. The paper urges a paradigm shift: dynamic benchmarks with hidden test sets, like those in Big-Bench Hard or HELM; rigorous pre-release contamination audits; and hybrid evaluations incorporating human judgment and real-world tasks.

It also proposes practical mitigations. Benchmark authors should watermark data or use private evaluations. Model developers must enhance deduplication with benchmark-specific filters and publish contamination matrices. Leaderboards could adopt contamination-adjusted scores or ban heavily leaked tests.

Leading voices echo these concerns. Sander Schulhoff of LMSYS.org, which runs the Chatbot Arena, called the findings “a huge wake-up call,” highlighting how contamination skews even preference-based evaluations. Percy Liang of Stanford, LMSYS co-founder, emphasized the need for benchmarks to evolve faster than models.

Ultimately, the study paints a sobering picture: without reform, AI evaluation remains a house of cards. As models grow larger and training data more exhaustive, contamination will only worsen unless the industry acts decisively. True advancement demands benchmarks that measure intelligence, not regurgitation.

Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.

What are your thoughts on this? I’d love to hear about your own experiences in the comments below.