Current language model training leaves large parts of the internet on the table

Current Language Model Training Overlooks Vast Swaths of the Internet

Large language models (LLMs) power many of todays most advanced AI applications, from chatbots to code generators. These models rely on massive datasets scraped from the web, primarily through archives like Common Crawl. However, a recent analysis reveals a critical limitation: current training pipelines capture only a tiny fraction of the internet’s content. This gap means LLMs are missing out on enormous amounts of valuable data, potentially hindering their development and generalization.

The study, conducted by researchers including Eric Wang from Brown University, examined the coverage of Common Crawl, the dominant source for LLM pretraining datasets. Common Crawl periodically crawls the web, archiving billions of pages monthly. Datasets like The Pile, C4, and FineWeb derive from these crawls after filtering for quality, deduplication, and language. Yet, the researchers found that even after processing, these datasets represent just a sliver of the total web.

Key findings highlight the scale of the untapped internet. Across 808 English-language domains sampled from the Majestic Million top sites, Common Crawl captured only 27 percent of the domains and a mere 5 percent of all pages. For the full sample of over 7,000 domains, coverage dropped further, with 60 percent of domains entirely absent. This disparity arises because crawlers like those used by Common Crawl prioritize popular, publicly accessible pages but struggle with modern web complexities.

Several factors contribute to this incomplete harvest. Paywalls block access to premium content on sites like The New York Times or academic journals, which require subscriptions. Single-page applications (SPAs) built with JavaScript frameworks such as React or Angular load content dynamically after the initial HTML fetch. Traditional crawlers fetch only static HTML, missing rendered JavaScript content, which now dominates many high-value sites. For instance, 40 percent of sampled pages relied heavily on JavaScript, rendering them invisible to standard crawlers.

User-generated content exacerbates the issue. Forums, comment sections, and social media platforms like Reddit host dynamic discussions that change rapidly or require authentication. E-commerce sites personalize pages based on user sessions, evading static crawls. The researchers noted that even when pages are crawled, postprocessing filters often discard them. Deduplication removes near-identical content, while heuristics for quality (like n-gram repetition or PII detection) eliminate noisy forum posts or boilerplate-heavy pages.

Temporal dynamics further widen the gap. The web evolves quickly; a page crawled today may be outdated tomorrow. Common Crawl’s monthly snapshots lag behind, and datasets aggregate crawls over years, diluting recency. High-value domains update frequently, yet only 15 percent of pages persist across multiple crawls. Infrequent crawls of long-tail sites mean niche blogs, forums, and regional content remain perpetually overlooked.

The implications for LLM training are profound. Missing content skews models toward mainstream, Western-centric perspectives. News from major outlets overshadows independent journalism; academic papers behind paywalls exclude specialized knowledge. User forums, rich in practical advice and edge cases, are underrepresented, potentially weakening models on real-world queries. The study estimates that expanding coverage could increase training data by 20 times or more, without needing novel sources.

Efforts to address these shortcomings exist but face hurdles. Enhanced crawlers like Crawlee or Firecrawl incorporate JavaScript rendering via headless browsers, boosting yield on SPAs. However, scaling this globally demands immense compute resources; rendering billions of pages is infeasible for nonprofit efforts like Common Crawl. Commercial services offer APIs for dynamic scraping, but licensing and ethical concerns arise, especially with robots.txt compliance and terms of service.

Postprocessing innovations also help. FineWeb’s classifier, trained on human judgments, retains more diverse content than prior filters. Deduplication at the document level, rather than n-gram, preserves unique pages. Yet, these tweaks recover only a portion of the lost data. Privacy regulations like GDPR limit scraping of personal data, while site owners increasingly deploy anti-bot measures.

Future directions point to hybrid approaches. Partnerships with content providers could grant access to paywalled archives. Synthetic data generation fills gaps, but risks hallucination reinforcement. Decentralized crawling incentivizes users to contribute local scrapes, though coordination remains challenging. Ultimately, the study underscores that the web’s “long tail” holds untapped potential for richer, more robust LLMs.

As AI scales, bridging this coverage chasm becomes essential. Without it, models risk plateauing, blind to the internet’s full breadth. Researchers call for transparent dataset audits and open tools to map web accessibility, enabling the community to reclaim overlooked knowledge.

Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.

What are your thoughts on this? I’d love to hear about your own experiences in the comments below.