Gemini 3 Pro tops new AI reliability benchmark, but hallucination rates remain high

amu · November 19, 2025, 4:00pm

Google’s Gemini 3 Pro Leads AI Reliability Rankings Amid Persistent Hallucination Challenges

In the rapidly evolving landscape of artificial intelligence, reliability has emerged as a critical metric for evaluating large language models (LLMs). A new benchmark, introduced by Scale AI, known as the Multilingual Accuracy Evaluation (MAE), aims to measure how well these models perform in generating accurate responses across diverse languages and tasks. According to recent evaluations, Google’s Gemini 3 Pro has claimed the top spot in this benchmark, demonstrating superior performance in factual accuracy. However, the results also underscore a persistent issue in the AI domain: high rates of hallucinations, where models produce plausible but incorrect information. This duality highlights both the progress and the limitations of current AI technologies.

The MAE benchmark represents a significant advancement in assessing AI reliability. Unlike previous evaluations that often focused on English-centric tasks, MAE incorporates 10 languages, including widely spoken ones like Spanish, French, and Hindi, as well as less common ones such as Arabic and Indonesian. It tests models on a variety of question-answering scenarios, ranging from simple factual queries to more complex reasoning tasks. The benchmark’s design emphasizes factual correctness, penalizing outputs that deviate from verified sources. Scale AI developed MAE to address the growing need for models that can operate reliably in global contexts, where linguistic diversity is a key factor.

Gemini 3 Pro, the latest iteration in Google’s Gemini series, outperformed competitors like OpenAI’s GPT-4o and Anthropic’s Claude 3.5 Sonnet in the MAE rankings. Achieving a score that reflects high accuracy rates—reportedly above 90% in several categories—Gemini 3 Pro excelled particularly in multilingual tasks. This performance is attributed to its advanced training on diverse datasets and fine-tuning techniques that prioritize factual alignment. For instance, in English-language evaluations, the model demonstrated robust handling of historical facts, scientific concepts, and current events, minimizing errors that plague earlier versions.

Despite these strengths, the benchmark reveals that hallucinations remain a formidable challenge across the board. Even top-performing models like Gemini 3 Pro exhibited hallucination rates hovering around 10-15% in certain scenarios. Hallucinations occur when AI generates information that sounds authoritative but is entirely fabricated, often due to gaps in training data or overgeneralization from patterns. In the MAE tests, this manifested in responses to niche queries, such as obscure cultural references or specialized technical details, where models confidently asserted false claims. For example, when queried about lesser-known historical events in non-English languages, Gemini 3 Pro occasionally conflated dates or figures, leading to inaccuracies that could mislead users.

Comparatively, other leading models showed similar vulnerabilities. GPT-4o, while strong in creative and conversational tasks, lagged behind in factual precision, with higher hallucination instances in multilingual settings. Claude 3.5 Sonnet, praised for its ethical guardrails, performed admirably in safety-related evaluations but struggled with consistency in long-form responses. Open-source alternatives, such as Meta’s Llama 3.1, trailed the proprietary leaders, underscoring the resource-intensive nature of achieving high reliability. These findings suggest that while proprietary models benefit from vast computational resources, the core issue of hallucinations stems from inherent architectural limitations in transformer-based LLMs.

The implications of these results extend beyond academic benchmarks to real-world applications. In sectors like education, healthcare, and journalism, where accuracy is paramount, high hallucination rates pose risks. Users relying on AI for information retrieval might propagate errors unknowingly, amplifying misinformation. Scale AI’s report emphasizes the need for hybrid approaches, such as integrating retrieval-augmented generation (RAG) systems, which ground responses in external databases to reduce fabrications. However, even with such enhancements, the benchmark indicates that no model has fully eliminated hallucinations, with average rates across the tested suite exceeding 5%.

Google’s response to the MAE results has been cautiously optimistic. Representatives highlighted Gemini 3 Pro’s leadership as evidence of ongoing improvements in safety and reliability features. The model incorporates built-in mechanisms, like confidence scoring and citation generation, to flag potential uncertainties. Yet, experts caution that these tools are not foolproof; users must still verify outputs critically. The benchmark also calls attention to the importance of diverse evaluation frameworks. By including non-English languages, MAE exposes biases in training data that favor Western-centric knowledge, potentially skewing global AI deployment.

Looking ahead, the AI community is likely to see intensified efforts to combat hallucinations. Researchers are exploring techniques like self-verification loops, where models cross-check their own outputs, and advanced fine-tuning on adversarial datasets designed to provoke errors. Scale AI plans to expand MAE with more dynamic tasks, such as real-time fact-checking, to better simulate everyday use. For developers and enterprises adopting these models, the takeaway is clear: reliability benchmarks like MAE provide valuable insights, but they must be complemented by rigorous testing in specific use cases.

In summary, Gemini 3 Pro’s top ranking in the MAE benchmark marks a milestone in AI reliability, particularly for multilingual applications. However, the elevated hallucination rates serve as a reminder that the journey toward fully trustworthy AI is ongoing. As models grow more sophisticated, balancing innovation with accuracy will be essential to unlocking their full potential without unintended consequences.

Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.

What are your thoughts on this? I’d love to hear about your own experiences in the comments below.