New Artificial Analysis benchmark shows OpenAI, Anthropic, and Google locked in a three-way tie at the top

New Benchmark Reveals Tight Race Among Top AI Providers: OpenAI, Anthropic, and Google Tied for Leadership

In the rapidly evolving landscape of artificial intelligence, a new benchmark has emerged to provide a clearer picture of frontier model performance. Artificial Analysis, an independent evaluator of AI models, has launched its Intelligence Index—a comprehensive quality metric designed to rank the capabilities of leading large language models (LLMs). The inaugural results show OpenAI, Anthropic, and Google locked in a three-way tie at the top, each achieving a score of 70 out of 100.

The Intelligence Index focuses primarily on model quality, aggregating performance across key domains such as reasoning, coding, mathematics, and vision. It draws from established evaluations including GPQA Diamond for expert-level science questions, AIME 2024 for advanced math competition problems, MATH-500 for challenging mathematical reasoning, LiveCodeBench for coding proficiency, and MMMU for multimodal understanding. Scores on these benchmarks are normalized to a 0-100 scale and combined using a geometric mean, ensuring balanced representation without overemphasizing any single area. This methodology aims to reflect real-world utility for complex, knowledge-intensive tasks.

Leading the pack are OpenAI’s o1-preview, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 2.0 Flash Thinking Experimental, all scoring precisely 70. These models demonstrate exceptional prowess in high-difficulty benchmarks. For instance, o1-preview excels in reasoning-heavy tasks, while Claude 3.5 Sonnet maintains strong consistency across coding and math. Gemini 2.0 Flash Thinking, an experimental variant, leverages enhanced chain-of-thought processing to match its rivals.

Close behind is xAI’s Grok 3 Reasoning at 68, followed by DeepSeek’s R1 at 67 and Grok 3 mini Reasoning at 66. Other notable performers include Meta’s Llama 4 Maverick (65), Mistral’s Small 3.1 (64), and DeepSeek V3 (63). The index highlights a narrowing gap among top-tier models, with incremental improvements driving intense competition.

Beyond quality, Artificial Analysis provides multifaceted comparisons, including price-performance ratios, output speed (tokens per second), and latency (time to first token). These metrics are crucial for practical deployment, as raw intelligence alone does not guarantee usability.

In terms of cost efficiency, DeepSeek V3 stands out as the cheapest high-quality option at just $0.28 per million tokens—far below competitors like OpenAI’s o1-preview ($30 per million input tokens). This positions DeepSeek as an attractive choice for budget-conscious applications requiring solid reasoning capabilities.

Speed leaders include xAI’s Grok 3 mini Reasoning, which tops output velocity at over 200 tokens per second, ideal for interactive scenarios. Google’s Gemini 2.0 Flash Thinking leads in low latency, delivering the first token in under 0.2 seconds, making it suitable for real-time chat interfaces.

The benchmark also underscores trade-offs. Premium models like o1-preview offer superior quality but at higher costs and slower speeds (around 20 tokens per second with latencies exceeding one second). In contrast, lighter models such as Mistral Small 3.1 balance quality (64) with affordability ($0.50 per million) and respectable speed (100+ tokens per second).

Artificial Analysis emphasizes transparency in its evaluations. All tests use the latest model versions as of late 2024, with API-based prompting to mimic production environments. Vision capabilities are assessed separately via MMMU, where top models score in the mid-70s. The index updates dynamically to capture ongoing advancements, providing a living leaderboard at artificialanalysis.ai.

This tie at the summit signals maturity in the AI frontier. No single provider dominates across all dimensions, forcing developers to select models based on specific needs—whether prioritizing peak intelligence, cost savings, or responsiveness. For example, enterprises tackling advanced R&D might favor the tied trio, while startups could opt for DeepSeek’s value proposition.

The results also spotlight emerging challengers. xAI’s Grok models punch above their weight in speed and reasoning, potentially disrupting the market with future iterations. Open-source options like DeepSeek and Mistral demonstrate that proprietary barriers are not insurmountable, fostering broader accessibility.

As AI benchmarks proliferate, the Intelligence Index distinguishes itself through its rigorous, multi-domain focus and practical ancillary metrics. It serves as a vital tool for users navigating the crowded LLM marketplace, where claims of superiority often outpace verifiable evidence.

In summary, OpenAI’s o1-preview, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 2.0 Flash Thinking Experimental define the current pinnacle of AI capability, each earning a 70 on the Intelligence Index. This equilibrium underscores the fierce innovation cycle propelling the field forward.

Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.

What are your thoughts on this? I’d love to hear about your own experiences in the comments below.