OpenAI wants to retire the AI coding benchmark that everyone has been competing on

OpenAI Proposes Retirement of HumanEval Coding Benchmark Amid Saturation Concerns

OpenAI researchers have called for the retirement of HumanEval, a widely adopted benchmark for evaluating the coding capabilities of large language models. Introduced in 2021 alongside the GPT-3.5 model known as Codex, HumanEval consists of 164 hand-written programming problems designed to test an AI’s ability to generate functional code from natural language descriptions. Each problem includes a docstring specifying the expected function signature, inputs, and outputs, with success measured by whether the model’s generated code passes the provided unit tests.

The benchmark quickly became an industry standard. Developers and AI companies alike used HumanEval scores to compare model performance, often highlighting pass-at-1 metrics, which indicate the percentage of problems solved correctly on the first attempt. Early results showed modest performance: GPT-3.5 scored around 48.1 percent pass-at-1, while subsequent models improved steadily. By mid-2023, models from various providers began surpassing 80 percent, and top performers now exceed 90 percent accuracy.

This rapid saturation has prompted OpenAI to deem HumanEval no longer useful for differentiating capabilities. In a recent blog post titled “Frontier Math,” researchers Daniel Keysers, Nathan C. Redwood, and others argue that benchmarks must evolve as models advance. They note that HumanEval’s problems, averaging 100 lines of code with three function arguments, have become too easy for state-of-the-art systems. Leading models such as OpenAI’s o1-preview achieve 92.4 percent pass-at-1, Claude 3.5 Sonnet from Anthropic scores 93.7 percent, and Google’s Gemini 1.5 Pro hits 96.1 percent. Even open-source alternatives like DeepSeek-Coder-V2 and Qwen2.5-Coder now rival or exceed proprietary scores.

The researchers emphasize that clinging to saturated benchmarks distorts progress measurement. They advocate shifting focus to more challenging evaluations that better reflect real-world software engineering demands. One recommended alternative is SWE-bench, introduced by Princeton University researchers in late 2023. SWE-bench draws from over 2,000 real GitHub issues across 12 popular Python repositories, requiring models to resolve issues by editing entire files rather than writing isolated functions. Current top scores on SWE-bench remain low, with Claude 3.5 Sonnet at 33.4 percent and OpenAI’s o1 at 38.5 percent resolved issues, highlighting substantial room for improvement.

OpenAI’s position underscores broader tensions in AI benchmarking. While HumanEval offered a simple, reproducible metric, its limitations are evident: problems are synthetic and lack the complexity of production codebases, including dependencies, context from multiple files, and iterative debugging. Critics have long pointed out potential overfitting, as models trained on vast code corpora may memorize patterns from leaked or similar problems. Despite these issues, HumanEval persists in marketing materials. Recent announcements from companies like Alibaba’s Qwen team and Mistral AI prominently feature near-perfect scores, perpetuating its use.

The push to retire HumanEval aligns with OpenAI’s experience developing advanced models. The company reports that internal evaluations now prioritize agentic tasks, where AI systems autonomously plan, execute, and verify code across repositories. Benchmarks like SWE-bench better capture this paradigm, simulating end-to-end software engineering workflows. Other emerging evaluations include LiveCodeBench for competitive programming and BigCodeBench for instruction-following in code generation.

Industry observers see this as a pivotal moment. As AI coding assistants mature from snippet generators to full-fledged developers, benchmarks must scale in difficulty and realism. OpenAI’s call challenges competitors to adopt harder metrics, potentially slowing the hype cycle around incremental HumanEval gains. However, transition hurdles remain: SWE-bench demands significant computational resources for evaluation and raises questions about contamination from training data scraped from public repositories.

OpenAI’s researchers conclude that benchmarks serve as temporary scaffolds, valuable until saturation renders them obsolete. Retiring HumanEval would free resources for next-generation evaluations, ensuring progress tracking keeps pace with model capabilities. Until broader consensus emerges, HumanEval scores will likely continue appearing in leaderboards, but their diminishing relevance signals a maturing field where true engineering prowess takes center stage.

Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.

What are your thoughts on this? I’d love to hear about your own experiences in the comments below.