Andrej Karpathy says humans are now the bottleneck in AI research with easy-to-measure results

amu · March 22, 2026, 11:50am

Humans as the New Bottleneck in AI Research: Insights from Andrej Karpathy

Andrej Karpathy, a prominent figure in artificial intelligence with a distinguished career at OpenAI, Tesla, and now his own Eureka Labs, has declared that humans have become the primary bottleneck in AI research. This shift marks a significant evolution from the early days of the field, where computational resources and data availability dominated progress. Karpathy’s observation, shared during a recent discussion, underscores how predictable scaling has transformed AI development into a realm where human ingenuity now dictates the pace of breakthroughs.

Historically, AI advancement hinged on scaling three key factors: compute power, model size, and training data volume. Researchers adhered to scaling laws, which provided reliable predictions of performance gains. For instance, doubling compute resources typically yielded measurable improvements in model capabilities, as evidenced by exponential progress on benchmarks like ImageNet for vision tasks or GLUE for language understanding. This era allowed teams to forecast outcomes with high confidence; if a model underperformed, it was often attributable to insufficient resources rather than flawed methodology.

Karpathy illustrates this with concrete, quantifiable results. Consider large language models (LLMs). When evaluated on standardized tests such as MMLU (Massive Multitask Language Understanding), early models scored around 60 percent. With increased scaling, scores climbed predictably to near-human levels, hovering between 88 and 92 percent across leading models like GPT-4 and Claude 3.5 Sonnet. Similar saturation appears on other metrics: GPQA (Graduate-Level Google-Proof Q&A) benchmarks show top models at 50 to 60 percent, while MATH and HumanEval coding challenges reach 80 to 90 percent proficiency. These plateaus indicate that further resource escalation yields diminishing returns, often just 1 to 2 percentage point gains.

The predictability of these improvements stems from established scaling relationships. Karpathy notes that performance on most academic benchmarks follows a power-law curve when plotted against compute expenditure. This makes progress “easy to measure,” as teams can run standardized evaluations post-training to verify adherence to expectations. Deviations below the curve signal issues like data contamination or inefficient training, which are now routine to diagnose and mitigate.

Yet, this reliability exposes the human bottleneck. To surpass current frontiers, innovation must transcend brute-force scaling. Karpathy emphasizes the need for novel algorithms, architectures, and training paradigms that unlock non-linear leaps. Examples include breakthroughs like transformers, which revolutionized sequence modeling, or diffusion models that redefined image generation. Such advances require creative problem-solving from researchers, not merely more hardware.

Karpathy’s perspective draws from his frontline experience. At Tesla, he led the vision team behind Autopilot’s neural networks, scaling models to process vast video datasets for real-time driving decisions. At OpenAI, he contributed to GPT-2 and early vision-language efforts, witnessing firsthand how scaling propelled conversational AI. Now, founding Eureka Labs focused on AI for education, he observes that while student-facing tools like his AI tutor deliver consistent results via scaling, pushing educational outcomes further demands human-designed curricula and interaction loops.

This bottleneck manifests organizationally too. Elite labs like OpenAI, Anthropic, and Google DeepMind hoard top talent, amplifying their edge. Karpathy likens it to physics post-World War II, where a few brilliant minds drove particle accelerator discoveries. Today, AI progress correlates with researcher quality over raw compute budgets. Smaller teams with exceptional ideas can outpace giants, as seen in open-source releases like Llama from Meta.

Metrics reinforce this human-centric view. Leaderboards track not just model scores but the ingenuity behind them. For example, o1-preview’s chain-of-thought reasoning boosted performance on challenging tasks, a human-inspired technique scaled effectively. However, inventing the next such method requires deep domain expertise and experimentation.

Karpathy warns against complacency. While scaling remains viable for incremental gains, true paradigm shifts like multimodal integration or agentic systems demand interdisciplinary thinking. He advocates training more specialized researchers, akin to PhD programs in theoretical physics, to sustain momentum.

In essence, AI research has matured into a human-limited endeavor. Compute and data, once scarce, are now abundant and predictable. The challenge lies in leveraging them through superior ideas. As Karpathy puts it, the field awaits its next Einstein-equivalent innovator to shatter existing ceilings.

Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.

What are your thoughts on this? I’d love to hear about your own experiences in the comments below.