New benchmark shows LLMs still can't do real scientific research

amu · December 26, 2025, 12:24pm

New Benchmark Reveals Limitations of Large Language Models in Genuine Scientific Research

Large language models (LLMs) have made remarkable strides in generating text, answering questions, and even simulating reasoning. However, their ability to conduct authentic scientific research—encompassing hypothesis formulation, experimental design, data analysis, and peer-review-level validation—remains a critical frontier. A newly introduced benchmark, SciArena, developed by researchers at Sakana AI, rigorously tests this capability and delivers sobering results: even the most advanced LLMs fall dramatically short of human performance in real-world scientific tasks.

The SciArena Benchmark: A Rigorous Test for Scientific Discovery

SciArena stands out from conventional AI benchmarks by focusing on the end-to-end process of scientific inquiry rather than isolated subtasks. Traditional evaluations, such as those measuring factual recall or code generation, often inflate perceptions of LLM competence. In contrast, SciArena draws tasks directly from cutting-edge scientific literature, challenging models to replicate the creative and iterative nature of discovery.

The benchmark comprises 24 tasks across diverse fields, including biology, chemistry, physics, and materials science. Each task is derived from landmark papers published between 2021 and 2024 in top journals like Nature, Science, and Cell. Participants must navigate a full research pipeline:

Literature Review: Synthesize prior work to identify gaps.
Hypothesis Generation: Propose novel, testable ideas.
Experiment Design: Outline protocols, controls, and expected outcomes.
Data Analysis: Interpret simulated or provided results.
Conclusion and Iteration: Draw inferences and suggest follow-ups.

Critically, success is not determined by the model itself but by expert human evaluators—PhD holders in relevant domains—who score submissions on a 1-10 scale across criteria like novelty, feasibility, rigor, and alignment with ground-truth discoveries from the source papers. A score of 8 or higher qualifies as a “success,” mimicking peer-review standards.

To ensure fairness, models receive access to tools like web search, code execution (via Python interpreters), and scientific databases. Multiple runs per model account for variability, with outputs judged blindly against human baselines.

Performance Results: LLMs Lag Far Behind Humans

The evaluation pitted 17 leading LLMs against 5 domain experts. Human researchers achieved a 40.0% success rate, demonstrating their prowess in tackling unfamiliar challenges. In stark contrast, the top-performing LLM, OpenAI’s o1-preview, managed only a 4.2% success rate. Close contenders included Anthropic’s Claude 3.5 Sonnet (3.7%), OpenAI’s o1-mini (3.1%), and Google’s Gemini 1.5 Pro (2.1%). Even ensembles of models or those augmented with external tools topped out below 6%.

Breaking down the scores reveals consistent weaknesses:

Novelty and Insight: LLMs excelled at regurgitating known facts (average literature review score: 6.5/10) but faltered in generating truly original hypotheses (average: 3.2/10).
Experimental Rigor: Designs often lacked proper controls or overlooked practical constraints, scoring 4.1/10 on average.
Data Interpretation: Models misinterpreted noisy data or overfit to patterns, yielding 3.8/10.
Iteration: Few models effectively refined ideas based on feedback, averaging 2.9/10.

Notably, reasoning-focused models like o1-preview showed marginal gains over standard chatbots (e.g., GPT-4o at 1.0%), but the gap to humans persisted. Tool use provided minimal uplift; for instance, code execution helped in simulations but not in conceptual leaps.

Model	Success Rate (%)	Avg. Overall Score (/10)
Human Experts	40.0	7.2
o1-preview	4.2	4.6
Claude 3.5 Sonnet	3.7	4.4
o1-mini	3.1	4.2
Gemini 1.5 Pro	2.1	4.0
GPT-4o	1.0	3.5

Why Do LLMs Struggle with Scientific Research?

SciArena’s findings underscore fundamental limitations in current LLM architectures. Trained predominantly on internet-scale text, models prioritize pattern matching over ab initio reasoning. Scientific research demands:

Causal Understanding: Discerning mechanisms from correlations, which LLMs approximate via statistical associations.
Uncertainty Handling: Science thrives on ambiguity; LLMs often confabulate confidently erroneous explanations (hallucinations).
Creativity: Breakthroughs require analogical leaps across domains, rare in training data.
Iterative Refinement: Humans pivot based on subtle cues; LLMs remain anchored to initial prompts.

The benchmark also highlights evaluation pitfalls. Self-reported “reasoning” traces from models like o1 correlate poorly with expert judgments (r=0.45), suggesting inflated internal metrics.

Implications for AI-Augmented Science

SciArena does not dismiss LLMs’ utility—they accelerate literature searches, draft protocols, and automate routine analyses. However, automating discovery remains elusive. The authors recommend hybrid approaches: LLMs as “idea generators” vetted by humans. Future benchmarks should expand to more domains and incorporate real wet-lab validation.

As Sakana AI’s David Ha notes, “True scientific progress requires more than intelligence; it demands reliable, creative agency.” This benchmark serves as a wake-up call, urging the field to bridge the chasm between simulation and genuine innovation.

Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.

What are your thoughts on this? I’d love to hear about your own experiences in the comments below.