OpenAI ordered to turn over 20 million ChatGPT chats to the New York Times

OpenAI Court-Ordered to Disclose 20 Million ChatGPT User Chats to The New York Times

In a significant development in the ongoing copyright infringement lawsuit filed by The New York Times against OpenAI, a federal judge has mandated that OpenAI surrender data from approximately 20 million ChatGPT user conversations. This ruling, issued by Magistrate Judge Ona T. Wang in the Southern District of New York, represents a pivotal moment in the legal battle over whether OpenAI’s generative AI models unlawfully trained on copyrighted material, including Times articles.

The order stems from The New York Times’ December 2023 complaint, which accuses OpenAI and its close partner Microsoft of systematically scraping millions of Times articles to train ChatGPT and other large language models (LLMs). The Times alleges that these models not only memorized vast swaths of its content but also reproduce it verbatim or near-verbatim in responses to user prompts, infringing on copyrights. To bolster its case, the newspaper sought extensive discovery, including user interaction data that could demonstrate instances of such regurgitation.

Judge Wang’s decision requires OpenAI to produce chat logs from user sessions where prompts referenced The New York Times or its content. Specifically, the court has directed OpenAI to disclose data covering roughly 20 million chats conducted between March 2023 and April 2024. This dataset includes anonymized user queries and the corresponding AI-generated responses, focusing on interactions that explicitly mention the Times. OpenAI must deliver this information in a structured format, such as JSON files, to facilitate analysis by the Times’ legal team and experts.

OpenAI had resisted the broad scope of the request, citing user privacy concerns and the sheer volume of data involved. The company argued that handing over millions of chat histories could expose sensitive user information, even if anonymized, and proposed a narrower production limited to a sample of chats. Lawyers for OpenAI also contended that the Times’ demands were overly burdensome, potentially encompassing petabytes of data storage. However, Judge Wang rejected these objections, emphasizing that anonymization—stripping identifiers like IP addresses, timestamps, and user IDs—adequately protects privacy while serving the interests of justice.

The ruling delineates precise parameters for the disclosure. OpenAI is instructed to identify chats via keyword searches for terms like “New York Times,” “NYT,” or specific article titles. The dataset must include full conversation threads, not just isolated responses, to capture context. Additionally, the judge ordered the production of metadata, such as session lengths and frequency of Times-related queries, alongside the raw chat content. This comprehensive handover is due within 20 days of the order, underscoring the urgency of discovery in this high-stakes litigation.

This discovery phase highlights broader tensions in AI litigation. The Times’ strategy leverages user-generated evidence to illustrate “prompt engineering” techniques that allegedly coax ChatGPT into outputting protected material. For instance, plaintiffs have publicly demonstrated how targeted prompts can elicit full articles or article summaries, suggesting the model retains memorized training data rather than merely generating novel text. Such regurgitation, the Times argues, violates fair use doctrines and directly competes with its subscription-based journalism.

OpenAI maintains that its models are designed to avoid verbatim copying, employing techniques like reinforcement learning from human feedback (RLHF) and output filtering to paraphrase content. The company also asserts that training on publicly available web data, including news sites, constitutes transformative fair use, akin to how search engines index the internet. Yet, the mandated chat disclosures could provide empirical evidence either supporting or undermining these defenses. If the data reveals widespread reproduction of Times content, it may strengthen claims of market harm; conversely, sparse instances could favor OpenAI.

The implications extend beyond this case. Similar lawsuits from authors, news outlets, and media conglomerates against OpenAI and Anthropic underscore the need for transparency in AI training pipelines. Courts are grappling with how to balance proprietary model safeguards against evidentiary demands. Judge Wang’s order sets a precedent for compelled disclosure of user interaction logs, potentially influencing future cases involving Meta’s Llama models or Google’s Gemini.

For OpenAI, compliance entails substantial engineering effort. Processing 20 million chats requires scalable infrastructure to query vast databases, apply anonymization algorithms, and export data securely. The company has already faced scrutiny over data practices, including a prior Italian regulatory probe that temporarily banned ChatGPT over privacy lapses. This ruling amplifies calls for robust data governance in AI deployments.

As the lawsuit progresses toward summary judgment or trial, both parties continue motions practice. The Times seeks further discovery, including details on OpenAI’s web crawlers like GPTBot, while OpenAI pursues counterclaims challenging the complaint’s breadth. Legal experts anticipate appeals, but for now, the 20 million chats represent a treasure trove that could redefine accountability in generative AI.

Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.

What are your thoughts on this? I’d love to hear about your own experiences in the comments below.