The Case Against Predicting Tokens to Build AGI
In the race toward artificial general intelligence (AGI), the dominant paradigm revolves around large language models (LLMs) trained primarily through next-token prediction. This approach, epitomized by models like GPT-4 and its successors, involves predicting the most likely sequence of tokens in a stream of text based on vast datasets scraped from the internet. Proponents, including leading labs like OpenAI, argue that scaling up compute, data, and model size will inevitably yield AGI. However, a growing chorus of skeptics contends that this method is fundamentally flawed, unlikely to produce true general intelligence no matter the scale. This article examines the core limitations of token prediction as a path to AGI, drawing on empirical evidence and theoretical insights.
At its heart, next-token prediction is a form of statistical pattern matching. LLMs excel at this task because human-generated text is highly predictable: given a context, the next word often follows probabilistic regularities derived from billions of examples. This yields impressive capabilities in language generation, translation, summarization, and even coding. Yet, these feats stem from mimicry rather than comprehension. Consider hallucinations, where models confidently output fabricated facts. This occurs because the model optimizes for fluency over truthfulness; a plausible continuation scores high regardless of accuracy. Studies, such as those analyzing GPT-3’s performance on factual recall, reveal error rates exceeding 20% on simple trivia, underscoring the absence of a robust world model.
True AGI demands more than fluent imitation—it requires understanding causality, engaging in long-term planning, and adapting to novel environments. Token prediction falls short here. LLMs lack an explicit representation of the physical or logical world. They cannot reliably simulate physics beyond memorized patterns, nor do they perform symbolic reasoning natively. For instance, in tasks requiring multi-step arithmetic or compositional generalization (e.g., applying a rule from one domain to another untrained scenario), performance plummets. Benchmarks like ARC (Abstraction and Reasoning Corpus) expose this brittleness: even top models score below 50%, far from human levels around 85%. The model’s “knowledge” is distributed across billions of parameters as latent correlations, not structured beliefs amenable to verification or update.
Scaling advocates invoke emergent abilities—sudden jumps in capability as models grow larger. While real, these are often overstated. Phenomena like in-context learning (adapting to tasks from few examples) emerge around 10^23 FLOPs but plateau thereafter. Recent analyses of scaling laws indicate diminishing returns: performance on reasoning tasks improves sublinearly with compute, suggesting fundamental bottlenecks. If token prediction were sufficient, we might expect consistent progress toward superhuman levels across domains. Instead, LLMs struggle with agency: they cannot pursue goals autonomously without human-specified prompts, and reinforcement learning from human feedback (RLHF) merely aligns outputs to preferences, not instilling intrinsic motivation.
Contrast this with historical paths to intelligence. Biological cognition integrates sensory-motor loops, hierarchical planning, and symbolic abstraction—none of which token prediction replicates. Early AI successes, like chess engines, relied on search and evaluation functions, not sequence modeling. Modern alternatives echo this: world models (e.g., as in Dreamer or MuZero) learn predictive simulations of environments, enabling planning via Monte Carlo tree search. Reinforcement learning agents, such as those mastering Atari or robotics, optimize policies through trial-and-error interaction, building causal understanding absent in passive text prediction.
Hybrid architectures offer a promising escape. Neurosymbolic systems combine neural pattern recognition with discrete logic engines, allowing verifiable reasoning. For example, integrating LLMs with theorem provers or code interpreters (as in o1-preview) yields verifiable steps, mitigating hallucinations. Similarly, retrieval-augmented generation (RAG) grounds outputs in external knowledge bases, but this remains a patch, not a cure. Test-time compute—allocating more inference resources for chain-of-thought prompting—boosts performance but inflates costs exponentially without addressing core deficits.
Critics like François Chollet argue that intelligence is about efficient skill acquisition on unseen tasks, measured by sample complexity. LLMs require enormous data to approximate narrow skills, failing Chollet’s ARC benchmark despite trillions of tokens. Yann LeCun advocates objective-driven architectures with energy-based models for planning and discrete computation. Even Sam Altman has hinted at post-LLM paradigms, suggesting multimodal integration and real-world embodiment as next frontiers.
Empirical evidence mounts against pure scaling. Post-GPT-4 models show incremental gains amid ballooning costs: training runs now exceed $100 million, with energy demands rivaling small cities. Diminishing returns imply that exaFLOP-scale efforts might yield only marginal improvements in generality. Moreover, data exhaustion looms—high-quality text is finite, forcing reliance on synthetic data that risks model collapse.
In conclusion, while next-token prediction revolutionized narrow AI, it is a cul-de-sac for AGI. It produces sophisticated parrots, not thinkers. Progress demands paradigm shifts: toward active learning, structured world models, and verifiable computation. Labs ignoring this risk sunk-cost fallacies, pouring resources into a method provably insufficient for causal reasoning, planning, or generalization. The path to AGI lies not in bigger predictors, but in architectures that reason explicitly about the world.
Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.
What are your thoughts on this? I’d love to hear about your own experiences in the comments below.