Encyclopedia Britannica sues OpenAI for training on nearly 100,000 articles without permission

Encyclopaedia Britannica Files Copyright Lawsuit Against OpenAI Over Unauthorized Use of Nearly 100,000 Articles

In a significant escalation of legal challenges facing generative AI developers, Encyclopaedia Britannica, the renowned publisher of one of the world’s oldest encyclopedias, has initiated a copyright infringement lawsuit against OpenAI and its close partner Microsoft. The complaint, filed on October 30, 2024, in the United States District Court for the Southern District of New York, alleges that OpenAI systematically scraped and incorporated content from 96,803 Britannica articles to train its flagship large language models, including GPT-3, GPT-3.5, GPT-4, and subsequent iterations powering ChatGPT.

The suit marks a bold move by Britannica to protect its intellectual property amid a growing wave of litigation targeting AI companies for their data training practices. According to the 51-page complaint, OpenAI’s actions constitute direct infringement by reproducing, storing, and using Britannica’s copyrighted works without permission, license, or compensation. The publisher contends that these models have been “instructed to memorize and regurgitate” vast swaths of Britannica’s content, enabling the AI to generate responses that closely mirror or derive from the original articles.

Britannica’s legal team draws on evidence from OpenAI’s own disclosures and third-party analyses to substantiate the claims. For instance, when prompted with queries about historical facts or biographical details covered extensively in Britannica’s encyclopedia, ChatGPT frequently produces outputs that echo the structure, phrasing, and details of Britannica articles. The complaint highlights specific examples, such as responses to questions about figures like Abraham Lincoln or events like the Battle of Hastings, where the AI’s answers align verbatim or near-verbatim with Britannica’s text.

Central to the dispute is the scale of the alleged infringement. Court documents detail how OpenAI’s training datasets, such as Common Crawl and Books1, ingested massive portions of Britannica’s online content. Britannica asserts that 97 percent of its articles published between 2010 and 2022 were harvested, totaling over 30 million words. This data was then used to fine-tune models that now compete directly with Britannica’s own digital offerings, including its premium ProCon.org debate platform and educational resources.

The publisher argues that OpenAI’s use does not qualify as fair use under U.S. copyright law. Fair use typically considers four factors: the purpose and character of the use, the nature of the copyrighted work, the amount and substantiality of the portion used, and the effect on the potential market. Britannica counters each point methodically. First, it describes OpenAI’s commercial exploitation through subscription services like ChatGPT Plus as neither transformative nor non-commercial. Second, it emphasizes the creative and factual nature of encyclopedia entries as highly protected works. Third, the wholesale copying of entire articles undermines claims of limited use. Finally, Britannica warns that AI-generated summaries erode demand for its authoritative content, diverting users and revenue.

OpenAI has consistently defended its practices by invoking fair use, asserting that training AI on public web data fosters innovation without supplanting original works. In prior statements and court filings in similar cases, the company compares its process to how search engines index the internet or how humans learn from books. However, Britannica dismisses this analogy, noting that AI models retain ingested data in probabilistic weights, allowing regurgitation rather than mere indexing.

This lawsuit arrives against a backdrop of intensifying scrutiny on AI training data sources. Similar actions have been brought by The New York Times, authors like John Grisham, and news outlets worldwide. Courts have issued mixed rulings: some, like a recent denial of a preliminary injunction against Anthropic, have leaned toward fair use for training, while others signal caution. Britannica seeks not only damages but also injunctive relief to prevent further use of its content and to mandate disclosure of training datasets.

Britannica’s CEO, Robert M. M. Stewart, underscored the stakes in a public statement: “OpenAI’s unauthorized use of our content undermines the value of human-created knowledge and threatens the sustainability of quality journalism and reference materials.” The publisher, which traces its origins to 1768, positions itself as a steward of verified information, contrasting its editorial rigor with AI’s potential for hallucination and bias.

Microsoft, named as a co-defendant due to its substantial investment in OpenAI and integration of GPT models into products like Bing Chat and Copilot, faces parallel accusations. The complaint alleges Microsoft’s Azure infrastructure facilitated the infringing activities.

As discovery proceeds, the case could compel OpenAI to reveal granular details about its data pipelines, a closely guarded secret in the AI industry. Legal experts anticipate this suit could set precedents on whether scraping public websites for AI training constitutes infringement, potentially reshaping how models are built and licensed.

Britannica’s action underscores a pivotal tension in the AI era: balancing technological advancement with creators’ rights. While OpenAI maintains its datasets are curated to respect copyrights, incidents of model outputs citing or mimicking protected works have fueled demands for transparency and opt-out mechanisms. Publishers are increasingly exploring licensing deals with AI firms, but Britannica’s uncompromising stance signals reluctance without fair compensation.

The outcome may influence global policy, as the European Union and other regions tighten regulations on AI data usage. For now, the suit amplifies calls for ethical data practices, urging the industry to prioritize consent and attribution in pursuit of artificial general intelligence.

Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.

What are your thoughts on this? I’d love to hear about your own experiences in the comments below.