Even Top AI Models Struggle with Factual Accuracy in the New “Facts” Benchmark
Large language models (LLMs) have made remarkable strides in generating human-like text, but their reliability on factual matters remains a critical concern. A new benchmark called “Facts,” developed by Vectara, exposes significant limitations in even the most advanced models. Released in late 2024, this evaluation framework rigorously tests LLMs’ ability to produce truthful statements when summarizing knowledge-intensive content. The results reveal that leading models, including OpenAI’s GPT-4o and Anthropic’s Claude 3.5 Sonnet, achieve only modest factual accuracy scores, often hallucinating false information at rates exceeding 40%.
Understanding the Facts Benchmark
The Facts benchmark addresses a core challenge in LLM evaluation: distinguishing between fluent but fabricated outputs and verifiable truths. Traditional benchmarks like TruthfulQA or HellaSwag focus on narrow tasks or rely on human judgments, which can be subjective. In contrast, Facts automates fact-checking at the sentence level using a novel retrieval-augmented pipeline grounded in Wikipedia.
The methodology is straightforward yet stringent. Evaluators select 4,000 passages from English Wikipedia articles, each containing 30 to 50 sentences. These passages cover diverse topics, ensuring broad knowledge coverage. LLMs are prompted to generate summaries consisting of four sentences. Each generated sentence is then decomposed into atomic claims—individual verifiable statements—using another LLM.
These claims undergo automated verification via Wikipedia search. A claim is deemed factual if it aligns precisely with retrieved evidence, partially factual if it matches loosely, and hallucinatory otherwise. The final score is the percentage of fully factual sentences, providing a clear metric for truthfulness. This approach minimizes human bias and scales effectively, making it suitable for continuous model monitoring.
Vectara’s researchers emphasize that Facts is designed for “closed-book” generation, where models rely solely on their parametric knowledge without external tools like retrieval-augmented generation (RAG). This setup mirrors real-world scenarios where users expect standalone answers from chatbots or assistants.
Performance of Leading Models
Benchmark results paint a sobering picture. Among 30 tested models, the highest factual accuracy score is 60.3%, achieved by Meta’s Llama 3.1 405B. OpenAI’s flagship GPT-4o scores 58.0%, while Claude 3.5 Sonnet manages 56.6%. Smaller or older models fare worse: GPT-4o mini hits 52.9%, and Mistral Large 2 reaches 55.1%.
Hallucination rates tell an even starker story. The top performer hallucinates in nearly 40% of its sentences, meaning two out of every five outputs contain fabrications. GPT-4o hallucinates 42% of the time, and Claude 3.5 Sonnet 43.4%. These figures represent a step back from simpler tasks; for instance, on short-answer trivia like TruthfulQA, top models exceed 70% accuracy.
Notably, larger models do not consistently outperform smaller ones. Llama 3.1 70B scores 54.8%, trailing its bigger sibling but surpassing GPT-4o mini. Open-source models like Qwen2.5 72B (55.7%) show competitive results, suggesting that proprietary advantages are diminishing.
The benchmark also evaluates instruction-tuned variants. Fine-tuning for summary tasks improves fluency but not truthfulness—GPT-4o-long-lora, optimized for long contexts, drops to 54.9%. This indicates that current training paradigms prioritize coherence over veracity.
Comparisons to Existing Benchmarks
Facts proves more demanding than predecessors. On the RAGAS faithfulness metric, which evaluates summaries against references, top models score over 90%. Vectara’s analysis attributes this discrepancy to RAGAS’s leniency: it flags hallucinations only if they contradict references directly, overlooking unsubstantiated claims.
Similarly, GPT-4-as-a-judge, a popular automated evaluator, inflates scores by 20-30 points compared to Facts. Human evaluations align more closely with Facts but remain costly and inconsistent. By automating atomic claim verification, Facts offers a reliable, reproducible standard.
Cross-benchmark correlations are moderate (Pearson’s r around 0.7), underscoring the need for multifaceted testing. Facts excels at detecting subtle inventions, such as plausible but incorrect details about historical events or scientific facts.
Implications for AI Reliability
These findings challenge overly optimistic claims about LLM progress. As Amr Awadallah, Vectara’s CEO, notes, “Hallucinations are not a solved problem. Even state-of-the-art models routinely invent facts.” This is particularly alarming for high-stakes applications like legal research, medical advice, or journalism, where errors can have real-world consequences.
The benchmark highlights parametric knowledge gaps. Models trained on vast internet data still falter on Wikipedia-scale facts, suggesting incomplete memorization or overgeneralization. Instruction following helps—models prompted to “be factual” improve by 5-10%—but does not eliminate issues.
Vectara plans to expand Facts with multilingual support, longer contexts, and RAG integration. Leaderboards will update dynamically as new models emerge, fostering competition on truthfulness.
Toward More Trustworthy AI
The Facts benchmark underscores that fluency is no proxy for truth. Developers must prioritize factuality in training, perhaps through synthetic data emphasizing verifiable claims or contrastive learning against hallucinations. Users, meanwhile, should verify outputs, especially on knowledge-heavy queries.
While LLMs excel at creative tasks, their factual shortcomings demand caution. Benchmarks like Facts provide essential guardrails, reminding us that true intelligence requires grounding in reality.
Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.
What are your thoughts on this? I’d love to hear about your own experiences in the comments below.