Naver's "Seoul World Model" uses actual Street View data to stop AI from hallucinating entire cities

Naver’s Seoul World Model Leverages Real Street-View Data to Ground AI in Urban Reality

Artificial intelligence models, particularly those designed for robotics and autonomous navigation, often struggle with hallucinations. These fabrications occur when AI generates plausible but entirely fictional elements, such as nonexistent buildings or roads, leading to unreliable predictions in dynamic environments like cities. Traditional world models, which learn from video sequences to forecast future scenes, exacerbate this issue in complex urban settings where occlusions, varying lighting, and intricate layouts abound. To address this, Naver, South Korea’s leading internet company, has introduced the Seoul World Model (SWM), a pioneering approach that integrates actual street-view imagery from Naver Maps. By anchoring predictions in geospatial reality, SWM significantly reduces hallucinations, enabling more accurate robot navigation and path planning.

World models represent a critical advancement in AI for embodied agents. They compress observations into latent representations and autoregressively predict future states, facilitating planning without direct environment interaction. However, when trained solely on egocentric video, these models falter in long-horizon urban scenarios. They might invent entire city blocks or misalign trajectories, compromising safety and efficiency. Naver’s SWM tackles this by conditioning predictions on real-world map data, specifically panoramic street-view images geolocated with high precision.

The core innovation lies in SWM’s training dataset and architecture. Naver Maps boasts extensive coverage of Seoul, capturing 360-degree street-level panoramas at regular intervals. These images, paired with precise GPS coordinates, form a vast repository of grounded visual data. SWM leverages this to create a “map-conditioned video prediction model.” During training, the model receives an egocentric camera view from a robot’s perspective, along with the robot’s GPS position and intended trajectory. It then retrieves relevant street-view panoramas from Naver Maps corresponding to those coordinates and predicts future egocentric frames conditioned on both the current view and the map data.

This conditioning mechanism is implemented through a multimodal architecture. The model employs a video diffusion transformer backbone, similar to those in recent video generation works, but augmented with geospatial awareness. Street-view inputs are projected into a shared latent space via a vision encoder, fused with egocentric observations through cross-attention layers. A GPS embedding further contextualizes the scene, ensuring predictions align with actual urban geometry. Training occurs on sequences where robots navigate Seoul’s streets, with losses computed on reconstructed future frames. Ablation studies demonstrate that map conditioning slashes hallucination rates: without it, models generate phantom structures 40 percent more frequently, per quantitative metrics like Fréchet Video Distance adapted for urban scenes.

SWM’s efficacy shines in robot navigation tasks. In simulations and real-world tests on instrumented vehicles in Seoul, the model outperforms baselines like VideoRT, a state-of-the-art unconditioned world model. For path planning, SWM unrolls predictions over 20-second horizons, scoring trajectories via a reward model that penalizes collisions and deviations from map-aligned paths. This enables hierarchical planning: high-level goals from GPS waypoints guide low-level actions, with the world model simulating outcomes to select optimal routes. Real deployments show SWM reducing navigation errors by 25 percent in dense areas like Gangnam, where visual ambiguities from crowds and signage prevail.

Key technical contributions include scalable data curation and efficient inference. Naver processed petabytes of street-view data, aligning panoramas to egocentric views via pose estimation networks trained on synthetic renders. At inference, retrieval is optimized with vector databases indexing image embeddings, achieving sub-100ms latency on consumer GPUs. The model generalizes beyond training zones, adapting to unseen Seoul districts with minimal fine-tuning, thanks to the density of Naver Maps coverage.

Challenges remain, particularly in handling dynamic elements like pedestrians and vehicles absent from static street views. SWM mitigates this by blending predictions with online detection modules, though future iterations may incorporate temporal map updates. Ethical considerations, such as privacy in street-view data, align with Naver’s anonymization protocols, blurring faces and plates pre-training.

Presented at the International Conference on Intelligent Robots and Systems (IROS) 2024, SWM marks a shift toward “geospatially grounded” world models. It paves the way for deployment in autonomous delivery robots, urban mobility services, and beyond, proving that real-world data integration is key to hallucination-free AI in cities. As robotics scales to megacities, innovations like SWM ensure AI perceives and plans within the bounds of reality.

Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.

What are your thoughts on this? I’d love to hear about your own experiences in the comments below.