Authors sue six AI giants for book piracy

Authors Launch Lawsuit Against Six Major AI Companies Over Alleged Book Piracy

A group of 12 authors has filed a high-profile class-action lawsuit against six prominent AI companies, accusing them of systematically using pirated copies of copyrighted books to train their large language models (LLMs). The complaint, lodged on August 23, 2024, in the US District Court for the Northern District of California, targets xAI, Anthropic, Databricks, Meta, Bloomberg, and Nu Holdings. The plaintiffs argue that these firms knowingly incorporated massive troves of illegally obtained books into datasets such as Books3—a subset of the controversial training corpus known as The Pile—without permission or compensation.

The lead plaintiffs include well-known writers such as Juno Dawson (author of Her Majesty’s Royal Coven), Paul Tremblay (A Head Full of Ghosts), and Mona Awad (Bunny), alongside others like Richard Kadrey, Andrea Bartz, and Christopher Farnsworth. They represent a proposed class of thousands of authors whose works were allegedly scraped from shadow libraries including Library Genesis (LibGen), Z-Library, and Anna’s Archive. These sites, notorious for hosting millions of pirated ebooks, served as primary sources for the pirated content ingested into AI training pipelines.

Central to the lawsuit is Books3, a 196-gigabyte dataset comprising nearly 200,000 books published after 2008. Forensic analysis cited in the complaint reveals that over 80% of these books bear digital fingerprints consistent with piracy from LibGen and Z-Library. The Pile, curated by EleutherAI and released in 2020, explicitly included Books3, making it a cornerstone for open-weight models like Meta’s Llama series and others. The plaintiffs’ legal team, led by attorney Matthew Sag and the firm Keller Lenkner, points to public GitHub repositories where AI developers openly discussed and shared these datasets, complete with instructions on downloading from pirate repositories.

For instance, Anthropic’s training stack reportedly relied on The Pile, including Books3, as confirmed in their own technical disclosures. Similarly, xAI’s Grok models drew from comparable corpora, with evidence from leaked configurations showing ingestion of pirated materials. Databricks, known for its MosaicML Composer tool, facilitated the processing of such datasets, while Bloomberg’s BigDataBench and Nu Holdings’ custom pipelines allegedly incorporated them. Meta’s admissions in Llama licensing agreements acknowledge the use of “publicly available” data, which the suit contends encompasses pirated works.

The complaint meticulously documents the technical pathways of infringement. Pirates on sites like LibGen employ scripts to strip DRM from legitimate ebooks purchased from platforms such as Amazon Kindle or Apple Books, then upload them with metadata intact—such as ISBNs and publisher details. These “fingerprints” persist through dataset curation, allowing researchers to trace origins. Tools like Hugging Face’s dataset viewer and GitHub repos from EleutherAI explicitly list Books3 mirrors hosted on pirate sites. Even after controversies erupted in 2023, some models continued training on unfiltered versions, the suit alleges.

This case builds on prior litigation shaking the AI industry. In 2023, comedians Sarah Silverman and Richard Kadrey sued OpenAI and Meta, claiming similar misuse of Books3 in GPT and Llama training. Judge Vince Chhabria dismissed parts of that suit in 2024 but allowed fair use questions to proceed to discovery. The New York Times has also sued OpenAI and Microsoft over article scraping. Unlike those, this lawsuit zeroes in on six defendants and emphasizes provable piracy chains, potentially strengthening claims of direct infringement over transformative fair use defenses.

The AI companies face steep challenges. Fair use arguments hinge on whether LLM outputs constitute protected expression or mere tools, but plaintiffs counter that verbatim regurgitation—demonstrated via prompts eliciting full book passages—undermines this. Technical exhibits include model extractions: for example, querying Anthropic’s Claude with specific passages yields near-exact reproductions from plaintiffs’ works. Economic harm is quantified through lost licensing revenue; authors estimate billions in foregone deals as AI firms bypass traditional data acquisition.

Defendants have yet to respond formally, but patterns from past cases suggest reliance on Section 230 immunities (dismissed previously) or claims of independent contractor liability for dataset creators like EleutherAI. However, the complaint highlights companies’ active roles: Anthropic funded Pile improvements, Meta hosted Books3 previews, and xAI’s Colossus cluster processed vast unlicensed corpora.

Broader implications ripple through AI development. The suit demands injunctions halting use of tainted datasets, plus statutory damages up to $150,000 per infringed work—potentially totaling trillions given millions of books implicated. It pressures the industry toward licensed data markets, like those emerging from deals between OpenAI and publishers. Ethically, it spotlights tensions between innovation and IP rights, as shadow libraries evade takedowns via decentralized mirrors.

As discovery unfolds, expect forensic deep dives into training logs and model weights. Researchers have already reverse-engineered watermarks in Books3, confirming piracy prevalence. This litigation could redefine AI data hygiene, compelling firms to audit corpora or pivot to synthetic data generation.

For authors, the stakes are existential: AI threatens creative livelihoods by commoditizing literature without reciprocity. The case underscores a pivotal moment—will courts deem mass scraping fair use, or mandate equitable training practices?

Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.

What are your thoughts on this? I’d love to hear about your own experiences in the comments below.