ARC-AGI-3 offers $2M to any AI that matches untrained humans, yet every frontier model scores below 1%

ARC-AGI-3 Benchmark Challenges AI with $2 Million Prize for Human-Level Performance on Novel Tasks

The ARC Prize initiative has unveiled ARC-AGI-3, the latest iteration of its benchmark designed to evaluate artificial general intelligence (AGI) capabilities. This new version introduces a public evaluation set comprising 100 intellectually demanding tasks, each crafted to test an AI system’s ability to perform novel reasoning and abstraction akin to that of untrained humans. The prize pool totals $2 million, with the grand prize of $1 million awarded to any entrant achieving at least 85 percent accuracy on a corresponding private evaluation set. Additional prizes recognize incremental progress, including $500,000 for 65 percent, $300,000 for 50 percent, $150,000 for 40 percent, $75,000 for 30 percent, and $50,000 for 25 percent.

At the core of ARC-AGI-3 lies the Abstraction and Reasoning Corpus (ARC) framework, originally developed by François Chollet, a researcher at Google and creator of Keras. Unlike traditional benchmarks such as ImageNet or GLUE, which reward pattern matching from vast training datasets, ARC-AGI emphasizes “core knowledge priors” fundamental to human intelligence. These include objectness, goal-directedness, numbers, basic geometry, and short-term memory. Tasks present a few demonstration input-output grid pairs, typically two to four, where the AI must infer the underlying transformation rule and apply it to a novel test input. Grids consist of 1x1 to 30x30 pixels colored in up to 10 hues, ensuring compactness while demanding creative problem-solving.

Human performance provides a clear target: untrained individuals, including children as young as four, solve around 60 percent of ARC-AGI tasks within two hours, reaching 85 percent with familiarity. In contrast, state-of-the-art large language models (LLMs) and frontier systems lag dramatically. The public ARC-AGI-3 leaderboard, hosted on Kaggle, reveals the stark gap. OpenAI’s o3 model, leveraging high-compute test-time scaling, leads with 1.69 percent accuracy using 5,000 parallel runs per task. Other entries trail: Claude 3.5 Sonnet at 0.95 percent (1,000 runs), Gemini 2.0 Flash Thinking at 0.52 percent, and o1-pro at 0.31 percent. Even specialized approaches, such as the “ARC-AGI Gold Standard” test-time train baseline at 2.81 percent and “Humanity’s Last Exam” at 3.11 percent, fall short of 5 percent.

This underperformance persists despite the immense scale of modern models. Training costs for systems like GPT-4 exceed hundreds of millions of dollars, yet they fail on tasks requiring zero-shot generalization. ARC-AGI-3 addresses prior limitations by increasing difficulty: the public training set retains 1,000 tasks from ARC-AGI-2, while the evaluation set introduces 100 entirely new challenges scaled for greater complexity. Private evaluation mirrors this structure but remains unseen to prevent overfitting. Scores are normalized by aggregating success across multiple stochastic runs, accounting for the probabilistic nature of current AI inference.

The benchmark’s design rigorously excludes data contamination. Analysis confirms no overlap between ARC tasks and training corpora of major models, including Common Crawl snapshots up to mid-2025. Techniques like test-time training, program synthesis, and neurosymbolic methods have yielded modest gains, but scaling compute alone yields diminishing returns. For instance, o3’s 1.69 percent required vast parallel inference, highlighting efficiency issues. Traditional deep learning struggles because ARC tasks defy memorization; each demands hypothesis generation, rule extraction, and application under uncertainty, mirroring human-like fluid intelligence.

ARC Prize organizers, including Chollet and Mike Knoop (co-founder of Zapier), frame this as a pivotal test for AGI timelines. Previous editions underscored stagnation: ARC-AGI-1 saw a high of 20 percent in 2019, dropping to under 10 percent by 2022 as models prioritized narrow capabilities. ARC-AGI-2, launched in March 2024 with a $600,000 prize, peaked at 53.5 percent on public eval using heavy test-time adaptation, but private scores hovered below 10 percent. ARC-AGI-3 resets the bar higher, compelling innovation beyond brute-force scaling.

Community efforts reflect diverse strategies. Open-source leaderboards showcase programmatic solvers like ICECUBE at 3.7 percent and DEN3 at 3.0 percent. Kaggle competitions encourage collaboration, with over 1,000 teams registered. Winners from prior rounds, such as Ryan Greenblatt’s o3 agent, demonstrate that chain-of-thought prompting and self-verification can nudge scores upward, yet systemic breakthroughs elude the field.

The implications extend to AGI discourse. Proponents of scaling laws argue continued hardware and data growth will close the gap, but ARC-AGI-3 exposes their limits on core reasoning. Critics like Chollet advocate hybrid architectures blending neural nets with symbolic reasoning. As of the leaderboard’s launch, no frontier model exceeds 2 percent, reinforcing that today’s AI excels at interpolation but falters on extrapolation.

With the competition open until November 3, 2025, ARC-AGI-3 stands as a rigorous litmus test. Surpassing human baselines would signal genuine progress toward AGI, potentially reshaping industries reliant on adaptive intelligence.

Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.

What are your thoughts on this? I’d love to hear about your own experiences in the comments below.