Anthropic leak reveals new model "Claude Mythos" with "dramatically higher scores on tests" than any previous model

Anthropic Leak Exposes Claude Mythos: New Model Achieves Unprecedented Benchmark Scores

A significant data leak from Anthropic has surfaced, revealing internal benchmark results for an unreleased AI model named Claude Mythos. Posted on 4chan by an anonymous user claiming to be an Anthropic employee, the leaked document details performance metrics that position Mythos as a substantial advancement over existing models, including Anthropics own Claude 3 family and competitors like OpenAIs GPT-4o.

The leak originates from an internal evaluation framework labeled evalplus-prod-claude3.5, dated April 18, 2024. It compares Mythos, identified as claude-3.5-mythos-latest, against prior Claude iterations such as claude-3-opus-20240229, claude-3.5-sonnet-latest, and claude-3-haiku-20240307, as well as OpenAIs gpt-4o-2024-05-13. These results highlight Mythos dominance across multiple standardized tests designed to assess large language model capabilities.

On the Massive Multitask Language Understanding benchmark, or MMLU, which evaluates knowledge across 57 subjects from STEM to humanities, Mythos scores 89.1 percent. This edges out Claude 3 Opus at 88.7 percent, Claude 3.5 Sonnet at 87.2 percent, Claude 3 Haiku at 80.5 percent, and GPT-4o at 88.7 percent. MMLU remains a cornerstone metric for gauging broad factual recall and reasoning, and Mythos narrow lead signals refined training on diverse datasets.

Mythos excels most dramatically on GPQA Diamond, a rigorous graduate-level dataset in biology, physics, and chemistry curated to challenge expert-level reasoning. Here, it achieves 59.4 percent, surpassing Claude 3 Opus 50.4 percent, Claude 3.5 Sonnet 57.4 percent, Claude 3 Haiku 44.9 percent, and GPT-4o 53.6 percent. This 9 percentage point improvement over Opus underscores enhanced logical inference under uncertainty, a persistent hurdle for AI systems.

In mathematics, specifically MATH Level 5 from the competition-level MATH benchmark, Mythos records 80.5 percent. This outperforms Claude 3 Opus 60.3 percent, Claude 3.5 Sonnet 71.1 percent, Claude 3 Haiku 59.0 percent, and GPT-4o 76.6 percent. Such gains suggest optimizations in symbolic manipulation and proof-based problem-solving, critical for advanced computational tasks.

The Multi-discipline Multimodal Understanding benchmark, MMMU, tests integrated text and image comprehension across college-level subjects. Mythos attains 74.4 percent, ahead of Claude 3 Opus 68.4 percent, Claude 3.5 Sonnet 70.1 percent, Claude 3 Haiku 56.0 percent, and GPT-4o 69.1 percent. This multimodal prowess indicates progress in vision-language alignment, vital for real-world applications like visual question answering.

Additional evaluations reinforce Mythos superiority. On HumanEval, a code generation test, it scores 92.0 percent versus Opus 84.9 percent, Sonnet 92.0 percent tie, Haiku 75.9 percent, and GPT-4o 90.2 percent. DROP, for discrete reasoning over paragraphs, yields 95.0 percent for Mythos against Opus 94.4 percent and GPT-4o 94.5 percent. In multilingual MGSM, Mythos hits 92.3 percent, topping Opus 91.1 percent and GPT-4o 91.6 percent. Other metrics like MBPP at 92.6 percent and LiveCodeBench at 60.5 percent further demonstrate consistent excellence.

The leak also references claude-3.5-haiku-prod, scoring lower across the board, such as 80.5 percent on MMLU and 44.9 percent on GPQA Diamond, aligning with expectations for a lighter model variant.

This disclosure arrives amid Anthropics preparations for Claude 3.5 Sonnet public release, anticipated shortly after May 2024 teasers. Internal documents hint at Mythos as a successor or parallel development, potentially claude-3.5-mythos, with training compute exceeding prior efforts. The 4chan post includes JSON-formatted eval results and screenshots, lending credibility despite the forums reputation for hoaxes.

Authenticity remains unverified. Anthropic has not commented, but the data format matches known internal tools, and scores show plausible incremental improvements without anomalies. If genuine, Mythos represents a leap, narrowing gaps with hypothetical next-generation rivals and raising the bar for safety-aligned AI scaling.

Benchmark caveats apply: These are internal evals, possibly optimistic, and real-world deployment involves latency, cost, and robustness factors absent here. Nonetheless, the leak spotlights Anthropics rapid iteration, building on Claude 3s constitutional AI principles for helpful, honest, and harmless outputs.

For the AI community, this fuels speculation on release timelines and capabilities. Mythos could redefine frontiers in reasoning-intensive domains, from scientific research to software engineering, while intensifying competition among frontier model developers.

Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.

What are your thoughts on this? I’d love to hear about your own experiences in the comments below.