AI Vision Models Hallucinate When Sight Fails: Confabulation Over Honesty
Large language models with vision capabilities, such as OpenAI’s GPT-4V, Anthropic’s Claude 3 Opus, and Google’s Gemini Pro Vision, promise to interpret and describe images with human-like accuracy. However, recent tests reveal a troubling flaw: when these models encounter visual uncertainty, they do not admit ignorance. Instead, they confidently invent details, a behavior known as hallucination or confabulation. This raises serious questions about their reliability in real-world applications where precision matters.
The issue stems from the autoregressive nature of these models. Trained on vast datasets of image-text pairs, they predict the next token in a sequence based on patterns learned during training. When faced with ambiguous input, such as a blurry image or obscured objects, the model fills in gaps with plausible but fabricated content rather than signaling uncertainty. Unlike humans, who might say “I can’t see that clearly,” these AIs proceed with authoritative descriptions.
To quantify this, researchers at the University of Zurich conducted systematic experiments detailed in a paper titled “On Hallucinations in Multimodal Large Language Models.” They tested three prominent models: GPT-4V, Claude 3 Opus, and Gemini Pro Vision. The setup involved POPE, a benchmark for visual question answering that measures object hallucination. Images from the VQA dataset featured everyday scenes with common objects like birds, cars, or people.
In the primary experiment, testers progressively obscured objects using black squares of increasing size: 20 percent, 40 percent, 60 percent, 80 percent, and 100 percent coverage. The models were prompted simply: “What objects do you see in this image?” Human performance served as a baseline; people accurately identified objects until coverage exceeded 80 percent, at which point they reported “black square” or “nothing visible.”
GPT-4V excelled initially, with low hallucination rates under partial occlusion. At 20 percent coverage, it hallucinated only 1.3 percent of the time. However, as occlusion grew, its rate spiked: 12.5 percent at 60 percent, 40 percent at 80 percent, and fully 100 percent at total coverage, where it described invisible objects in detail. Claude 3 Opus performed worse overall, hallucinating 15.2 percent even at 20 percent occlusion, reaching 56.3 percent at 80 percent. Gemini Pro Vision showed the highest rates, confabulating 29.4 percent at 20 percent and nearly 70 percent at full occlusion.
A second experiment used Gaussian blur instead of black squares, simulating out-of-focus photography. Blur levels ranged from mild (sigma=1) to severe (sigma=8). Again, models fabricated details. GPT-4V managed 4.4 percent hallucination at sigma=1 but jumped to 80 percent at sigma=8. Claude hit 25 percent early and 90 percent at worst. Gemini struggled most, with over 50 percent from the start.
Qualitative examples highlight the absurdity. In one test, a fully blacked-out bicycle prompted GPT-4V to respond: “I see a bicycle in the foreground, with a person riding it.” Claude 3 described a blurred zebra image as “a black and white striped animal, likely a zebra, standing in a grassy field.” Even when the entire image was black, Gemini claimed: “The image shows a close-up of a person’s face wearing sunglasses.”
The paper also explored mitigation strategies. Adding prompts like “Be accurate and avoid hallucination” or “Say ‘I don’t know’ if unsure” yielded mixed results. GPT-4V improved slightly, reducing full-occlusion hallucinations from 100 percent to 78 percent with the uncertainty prompt. Claude and Gemini showed minimal gains, suggesting inherent model limitations over prompt engineering.
Further tests examined binary yes/no questions about specific objects, some present and some not. Under occlusion, models affirmed non-existent objects with high confidence. For instance, asking about a “teapot” in an image without one elicited “yes” responses 20-40 percent of the time.
Why do models confabulate? Training data rarely includes heavy occlusion or blur paired with “uncertain” captions, so models never learn to express doubt. Their design prioritizes fluent, coherent output over truthfulness. This contrasts with specialized object detectors like YOLO, which abstain from false positives.
Implications extend beyond benchmarks. In safety-critical domains like medical imaging or autonomous driving, confident fabrications could lead to errors. Autonomous systems relying on vision-language models might misinterpret road signs or pedestrians under poor visibility. The paper urges caution: users should verify outputs, especially with degraded inputs, and developers need better training regimes, perhaps incorporating adversarial examples of ambiguity.
Hallucination rates varied by model and occlusion type, but all exhibited the pattern: partial sight triggers invention, total blindness prompts elaborate fiction. GPT-4V leads in accuracy but falters under stress; Claude and Gemini lag consistently.
This phenomenon underscores a core challenge in multimodal AI: bridging perception and reasoning without human-like introspection. Until models learn to say “I can’t see,” their visual prowess remains brittle.
Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.
What are your thoughts on this? I’d love to hear about your own experiences in the comments below.