AI models confidently describe images they never saw, and benchmarks fail to catch it

AI Vision Models Hallucinate Confident Descriptions of Invisible Images, Evading Standard Benchmarks

Vision-language models (VLMs) have achieved remarkable proficiency in analyzing and describing images, powering applications from medical diagnostics to autonomous driving. However, a new study reveals a critical flaw: these models often generate highly detailed, confident descriptions of images that are entirely invisible to them, such as pitch-black squares or severely corrupted visuals. Standard evaluation benchmarks fail to detect this behavior, raising serious concerns about the reliability of VLMs in real-world deployment.

Researchers from the University of California, Berkeley, and Shanghai AI Laboratory introduced this issue in a paper titled “HallusionBench: An Advanced Diagnostic Benchmark for Entangled Language Image Comprehension.” Their investigation demonstrates that leading VLMs, including OpenAI’s GPT-4V and GPT-4o, Anthropic’s Claude 3.5 Sonnet, Google’s Gemini 1.5 Pro, and others, routinely “hallucinate” vivid scene descriptions when confronted with non-informative inputs. For instance, when presented with a solid black image and prompted to “describe this image in detail,” GPT-4o might respond: “The image shows a vibrant beach scene at sunset, with golden sands meeting turquoise waves under a sky painted in hues of orange and pink. Palm trees sway gently in the background, and a few seashells dot the foreground.”

This phenomenon persists across diverse prompts and image perturbations. The team tested models using three types of challenging inputs: pure black images, Gaussian noise, and cropped images retaining only 1 percent of original pixels. Despite the absence of discernible content, models produced coherent narratives, often exceeding 100 words with specifics like object counts, colors, actions, and spatial relationships. Claude 3.5 Sonnet, for example, described a black square as “a cozy living room interior with wooden furniture, a rug, and warm lighting,” while Gemini 1.5 Pro imagined “a bustling city street at night with neon signs and pedestrians.”

Quantitative analysis underscores the severity. On black image tests, GPT-4o achieved a 92 percent “success” rate in generating descriptions longer than 30 words, with 100 percent confidence in their assertions. Even when explicitly warned—“this image may be corrupted or black”—hallucinations continued unabated. The study attributes this to the models’ training on vast internet-scale image-caption pairs, which biases them toward assuming every input contains describable content. Over-optimization on recognition tasks further reinforces this tendency, as models learn to prioritize fluent, detailed outputs over admitting uncertainty.

Traditional benchmarks exacerbate the problem by design. Evaluations like MMMU (Massive Multi-discipline Multimodal Understanding), MathVista, ScienceQA, and MMBench use clean, high-quality images tightly paired with ground-truth text. These setups reward accurate recall but overlook failure modes in low-information scenarios. As a result, leaderboards crown models as state-of-the-art without probing their propensity for confident fabrication. HallusionBench addresses this gap with a curated dataset of 800 images spanning 20 categories, including indoor/outdoor scenes, objects, animals, and text. Each includes corrupted variants where humans unanimously agree no description is possible.

The benchmark evaluates two facets: Hallucination Score (HS), measuring the rate of erroneous descriptions, and Confidence Disagreement Score (CDS), quantifying overconfidence relative to human judgments (scored 1-5). Leaderboard results paint a sobering picture. GPT-4o tops the chart with HS 66.4 percent and CDS 2.92, followed by Claude 3.5 Sonnet (HS 63.5 percent, CDS 2.88) and Gemini 1.5 Pro (HS 61.2 percent, CDS 2.76). Notably, smaller models like Qwen-VL-Chat fare worse, with HS exceeding 80 percent, while no model consistently refuses to describe invisible content.

Further experiments reveal deeper entanglement in multimodal reasoning. When tasks require both vision and language, such as visual question answering (VQA) or optical character recognition (OCR), hallucinations propagate. For a black image asked “What is the license plate number?”, GPT-4o invents “ABC-123.” In grounded captioning—where descriptions must align with verifiable elements—models insert fictional details 70-90 percent of the time.

Mitigation attempts yield mixed results. Chain-of-thought prompting slightly reduces hallucinations but boosts confidence in false outputs. Uncertainty estimation techniques, like verbalized confidence scores, often miscalibrate, with models claiming 90 percent certainty on pure noise. The researchers advocate for diagnostic benchmarks like HallusionBench to guide development, urging model creators to prioritize “visual grounding” mechanisms that detect input insufficiency before generating text.

This discovery challenges the narrative of VLMs as robust perceptual agents. In safety-critical domains, fabricated descriptions could mislead users or systems, from misdiagnosing X-rays to hallucinating road signs. As models scale, addressing these blind spots becomes imperative. The HallusionBench dataset and leaderboard are publicly available, inviting community scrutiny and improvement.

Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.

What are your thoughts on this? I’d love to hear about your own experiences in the comments below.