Even the Best AI Models Falter on Visual Tasks Toddlers Master with Ease
Recent research reveals a stark gap between the capabilities of cutting-edge artificial intelligence models and the intuitive visual understanding possessed by human infants. A study conducted by researchers from the University of Hamburg, led by Moahad Alhabbash and Nyasha Magadzire, alongside Luca Franzen from the University of Tübingen, exposes how top multimodal AI systems collapse when confronted with perceptual challenges that even six-month-old babies navigate effortlessly. These challenges stem from what cognitive scientists term “core knowledge priors,” fundamental cognitive abilities hardwired into human brains from birth.
Core knowledge priors encompass a suite of innate intuitions that enable infants to make sense of the physical world without explicit instruction. These include object permanence, the understanding that objects continue to exist even when out of sight; intuitive physics, such as anticipating how objects move or collide; causality, recognizing cause-and-effect relationships in actions; and agentivity, distinguishing intentional agents from inanimate objects. Toddlers demonstrate these through simple behaviors, like reaching for a toy hidden behind a screen or predicting a ball’s path after it rolls behind an obstacle. Adults retain these priors, achieving near-perfect performance on corresponding tests.
The researchers devised a rigorous evaluation framework to probe whether state-of-the-art AI vision-language models (VLMs) possess analogous abilities. They curated 20 short video clips, each one to two seconds long, depicting scenarios that elicit these core priors. Examples include a ball rolling behind a barrier and reappearing on the other side (testing continuity and object permanence), a block being pushed to launch another block (causality), or an arm intentionally grasping an object versus an accidental bump (agentivity). For each video, the team posed precise yes/no questions aligned with the prior being tested, such as “Does the blue block move because the green block hit it?” or “Is the ball the same one that comes out on the other side?”
Twelve leading VLMs underwent testing, including heavyweights like OpenAI’s GPT-4V, GPT-4o, and o1-preview; Anthropic’s Claude 3.5 Sonnet and Claude 3 Opus; Google’s Gemini 1.5 Pro; Meta’s Llama 3.2 Vision; and others such as Qwen2-VL, LLaVA-OneVision, and Pixtral 12B. The models received the videos as input alongside the questions, with prompts designed for clarity and neutrality to minimize biases.
The results were sobering. Human baselines, drawn from adults and calibrated against infant studies, hovered around 80 to 95 percent accuracy across tasks. In stark contrast, the best-performing AI model, Claude 3.5 Sonnet, scraped by with just 34.4 percent overall accuracy. GPT-4o managed 32.5 percent, while Gemini 1.5 Pro lagged at 27.5 percent. On individual priors, failures were even more pronounced. For intuitive physics tasks, like predicting whether a ball would emerge from behind a U-shaped obstacle (a classic “Michotte” illusion test), models averaged below 20 percent accuracy. Object permanence stumped them too: in videos where a ball passes behind a rectangular barrier, models frequently denied it was the same ball exiting, mistaking continuity violations.
Causality detection fared marginally better but still dismal, with models like GPT-4V achieving only 40 percent on launch events. Agentivity proved particularly elusive; distinguishing deliberate grasps from inadvertent contacts led to error rates exceeding 70 percent for most systems. Even advanced reasoning models like o1-preview, touted for chain-of-thought processing, floundered at 28.1 percent overall.
The study extended to video generation models, testing OpenAI’s Sora on similar physics scenarios. Sora generated clips of balls rolling behind barriers, but 80 percent violated basic continuity: balls vanished without reappearing or spawned anew, betraying a lack of grounded physical simulation.
Why do these models fail so spectacularly? The researchers attribute it to their reliance on statistical pattern matching from vast internet-scale datasets rather than genuine world models. VLMs excel at describing familiar scenes but crumble on edge cases requiring extrapolation of physical laws. As Alhabbash notes, “These models do not yet possess the rich intuitive physics infants have from the start.” Training data biases exacerbate issues; videos in datasets often lack the precise occlusions or timings needed for robust prior learning.
Prompt engineering offered scant reprieve. Even with detailed instructions, like “Think step by step about the physical laws,” performance barely budged. Zero-shot prompting yielded the highest scores, suggesting models overthink when over-prompted. The paper, published on arXiv and presented at the NeurIPS 2024 TinyModels Workshop, underscores that scaling alone insufficiently bridges this chasm. Current architectures, predominantly transformer-based, prioritize linguistic fluency over causal reasoning.
This benchmark, dubbed PriorsBench, provides a standardized tool for future evaluations. It highlights the need for AI systems grounded in simulation environments or developmental robotics to instill core knowledge. Until then, claims of human-level vision intelligence ring hollow. As Franzen observes, “Toddlers laugh at violations of these priors; AI models endorse them.”
The implications ripple across AI applications. Autonomous vehicles, robotic manipulation, and augmented reality demand reliable intuitive physics, yet these lapses signal ongoing risks. While VLMs dazzle in benchmarks like MMMU or MathVista, PriorsBench unmasks their perceptual brittleness.
In summary, despite trillions of parameters and multimodal prowess, today’s best AI vision models operate worlds apart from infant cognition. True visual intelligence demands more than correlation; it requires the innate scaffolding humans inherit.
Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.
What are your thoughts on this? I’d love to hear about your own experiences in the comments below.