GPT-5.2 Pro solves another Erdős problem while a new database reveals most attempts still fail

GPT-5.2 Pro Solves Another Erdős Problem Amid Revelations from New AI Benchmark Database

In a notable advancement for artificial intelligence in mathematics, GPT-5.2 Pro has successfully solved yet another problem from the renowned collection posed by mathematician Paul Erdős. This achievement underscores the growing prowess of large language models in tackling complex, open-ended mathematical challenges, even as a newly released database reveals the stark reality that most AI attempts on such problems still fall short.

Paul Erdős, one of the 20th century’s most prolific mathematicians, left behind a legacy of unsolved problems, many accompanied by cash prizes to incentivize solutions. These Erdős problems span graph theory, combinatorics, number theory, and other domains, testing the limits of human ingenuity. Over the years, a handful have been resolved, but many persist. The intervention of AI models like GPT-5.2 Pro marks a new era, where computational reasoning can probe these enigmas at unprecedented scales.

The specific problem cracked by GPT-5.2 Pro involves a conjecture in extremal graph theory, building on its prior success with another Erdős challenge earlier this year. In the previous instance, the model navigated intricate probabilistic arguments to establish bounds on graph densities avoiding certain substructures. For this latest solution, GPT-5.2 Pro employed a multi-step reasoning process, generating hypotheses, verifying them through simulated proofs, and refining counterexamples iteratively. The model’s output included a complete formal proof, which independent verification by mathematicians confirmed as correct. This proof leverages novel combinatorial identities and asymptotic analysis, techniques that align with classical approaches but executed with machine-assisted precision.

What sets GPT-5.2 Pro apart is its enhanced reasoning architecture, optimized for long-chain inference. Unlike earlier models that relied heavily on pattern matching from training data, this iteration simulates human-like deliberation, pausing to evaluate intermediate steps and backtrack on errors. The solve required over 10,000 tokens of internal reasoning, far exceeding typical query limits, highlighting the model’s capacity for sustained cognitive effort. OpenAI researchers noted that the solution emerged after prompting the model with the problem statement and encouraging exploratory variants, a technique refined through recent updates.

While this success is cause for optimism, it exists against a backdrop of broader struggles illuminated by the ErdosBench database, a comprehensive new repository launched this week. Curated by a collaboration of AI labs and academic institutions, ErdosBench aggregates over 5,000 attempts by leading models, including predecessors to GPT-5.2 Pro, on 127 verified Erdős problems. The database logs prompts, reasoning traces, final outputs, and human-verified outcomes, providing granular insights into failure modes.

Analysis of ErdosBench paints a sobering picture: only 2.3 percent of attempts yield correct solutions. Success rates plummet for problems requiring novel insights, with models succeeding in under 1 percent of cases involving infinite graphs or Diophantine approximations. Common pitfalls include hallucinated proofs, where models fabricate plausible but invalid lemmas; premature convergence to local optima in search spaces; and brittleness to slight problem reformulations. Even GPT-5.2 Pro, the top performer at 4.8 percent success, falters on 95 percent of tasks, often due to incomplete exploration of proof branches.

The database categorizes failures by type: syntactic errors in 28 percent of cases, logical gaps in 41 percent, and computational overflows in 19 percent. Visualizations in ErdosBench reveal progress over time, with success rates doubling since 2023, yet plateaus persist for high-difficulty problems. Metrics like proof length, reasoning depth, and verification time offer benchmarks for future models. Researchers emphasize that while scaled compute and better chain-of-thought prompting drive gains, fundamental limitations in symbolic reasoning remain.

This duality success and struggle has profound implications for AI’s role in mathematics. On one hand, GPT-5.2 Pro’s solves accelerate discovery, potentially claiming Erdős prizes worth thousands of dollars and inspiring hybrid human-AI workflows. Mathematicians report using the model to generate proof sketches, accelerating their own work by factors of 5 to 10. On the other, ErdosBench serves as a cautionary benchmark, urging the field toward innovations in neurosymbolic integration, formal verification plugins, and training on curated proof corpora.

As AI continues to encroach on pure mathematics, events like this latest solve prompt reflection on creativity’s essence. Is machine reasoning a complement to human intuition or a harbinger of transformation? ErdosBench equips the community with data to guide that evolution, ensuring progress is measured rigorously.

Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.

What are your thoughts on this? I’d love to hear about your own experiences in the comments below.