Some of the largest AI players are now paying Wikipedia for the data they already use

Major AI Companies Strike Deals to Pay Wikipedia for Its Publicly Available Content

In a significant shift for the intersection of artificial intelligence and open knowledge, the Wikimedia Foundation, the nonprofit organization behind Wikipedia, has secured multimillion-dollar licensing agreements with three of the biggest players in AI development: OpenAI, Google (through its parent company Alphabet), and xAI, founded by Elon Musk. These deals mark the first time Wikipedia has entered into formal paid partnerships specifically for the use of its content in training large language models (LLMs) and other generative AI systems. Notably, Wikipedia’s vast repository of articles has long been freely accessible under the Creative Commons Attribution-ShareAlike (CC-BY-SA) license, allowing broad reuse—including by AI companies—without prior payment.

The agreements, announced by Wikimedia Foundation CEO Maryana Iskander, provide these tech giants with explicit permission to incorporate Wikipedia data into their AI training pipelines. Financial details remain confidential, but sources familiar with the negotiations describe them as substantial, potentially totaling tens of millions of dollars annually across the partners. Iskander emphasized that the revenue will directly support Wikipedia’s operations, which rely almost entirely on voluntary donations from users worldwide. “This is about ensuring the sustainability of free knowledge in an AI-driven world,” she stated in an official blog post.

Background on Wikipedia’s Data and AI Usage

Wikipedia, launched in 2001, hosts over 60 million articles across 300+ language editions, making it one of the internet’s most comprehensive and reliable sources of factual information. Its content is crowdsourced, edited by volunteers, and governed by strict neutrality and verifiability policies. The CC-BY-SA 4.0 license permits commercial use, modification, and distribution, as long as attribution is given and derivative works are shared under the same terms. This openness has made Wikipedia a staple in AI training datasets, such as Common Crawl, which scrapes the web and includes billions of Wikipedia pages.

Historically, AI companies have scraped Wikipedia data without direct compensation, citing the license’s permissions. However, as generative AI models like ChatGPT, Gemini, and Grok gained prominence, Wikimedia observed spikes in traffic from AI crawlers. In 2023, the Foundation introduced opt-out mechanisms via robots.txt directives and ORES (Objective Revision Evaluation Service) to limit automated access for training purposes. Despite these measures, widespread scraping continued, prompting calls for ethical data sourcing.

The new deals address this tension. OpenAI, maker of ChatGPT, was the first to sign on earlier this year, followed by Google and xAI. Under the terms, these companies gain “preferred access” to Wikipedia’s content, including real-time updates, while agreeing to proper attribution in AI outputs. For instance, when users query an AI about historical events or scientific concepts, responses citing Wikipedia will link back to the original articles, driving traffic to the site.

Key Provisions of the Agreements

Each partnership includes tailored commitments:

  • OpenAI: Commits to integrating Wikipedia citations prominently in ChatGPT responses. The deal also funds Wikimedia’s diversity initiatives, aiming to expand content in underrepresented languages.

  • Google: Leverages Wikipedia data for Gemini and future models, with a focus on multilingual support. Google has pledged technical collaboration, such as improving search integrations that surface Wikipedia snippets.

  • xAI: Elon Musk’s venture, behind the Grok chatbot, emphasizes transparency in training data. xAI’s agreement includes provisions for auditing its use of Wikipedia content and supporting open-source AI tools aligned with Wikimedia’s mission.

These arrangements are non-exclusive; other AI developers can still access Wikipedia under CC-BY-SA but without the formal assurances or financial contributions. Wikimedia has clarified that the deals do not grant exclusive rights, preserving the platform’s openness.

Implications for AI Ethics and Wikimedia’s Future

The partnerships come amid broader debates over AI data practices. Lawsuits from news outlets like The New York Times against OpenAI and Microsoft highlight risks of “data poisoning” and copyright infringement. Wikipedia’s approach sidesteps litigation by embracing voluntary payments, positioning it as a model for collaborative sustainability. Iskander noted a 30% increase in AI-related traffic to Wikipedia in recent months, underscoring the symbiotic relationship: AI relies on Wikipedia for accuracy, while Wikipedia benefits from exposure and funding.

Critics within the open-source community question the necessity of payments for freely licensed data, arguing it sets a precedent that could fragment access. Proponents, however, see it as pragmatic—rewarding volunteer labor in an era where compute costs for AI training exceed billions. The revenue is earmarked for server maintenance, anti-vandalism tools, and editor grants, ensuring Wikipedia remains ad-free and independent.

Looking ahead, Wikimedia is exploring similar deals with additional AI firms and investing in tools to track content usage. These steps reinforce Wikipedia’s role not just as a knowledge base, but as a cornerstone of trustworthy AI development.

Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.

What are your thoughts on this? I’d love to hear about your own experiences in the comments below.

Most importantly here, you must reside in a jurisdiction that recognizes Creative Commons (CC BY-SA 4.0) as an enforceable license. Similar to the GPL for software, its legal validity varies by country.

If an AI company uses content licensed under CC BY-SA 4.0 (Attribution-ShareAlike 4.0) to train its models, the “technical” and “legal” possibilities for users depend on w
hether the model or its output is considered an adaptation (derivative) of that content.

The CC BY-SA 4.0 license is a “copyleft” license, often compared to the GPL in software. Its core philosophy is: If you build upon this work, you must share your version under the same license.

1. What is Technically Possible for the User?

If the AI company strictly follows the license (treating the model as an “adaptation”), the following becomes possible for you as a user:

  • Access to the Model: You would theoretically have the right to access and use the AI model itself under the same CC BY-SA 4.0 terms.
  • Freedom to Remix/Modify: You could technically “fine-tune” or modify the model for your own purposes, provided you also share your modified version under CC BY-SA 4.0.
  • Commercial Use: You can use the model or its outputs for commercial purposes, as CC BY-SA 4.0 explicitly allows commercial use (unlike “NC” non-commercial versions).
  • Redistribution: You could host your own version of the model or distribute copies of it to others.

2. The “ShareAlike” (SA) Trigger

The biggest debate in the AI industry right now is whether training a model actually triggers the ShareAlike clause.

Element If triggered… If NOT triggered…
The Model The weights/parameters must be released under CC BY-SA 4.0. The company can keep the model “Closed Source” or proprietary.
The Output If an output is a “copy” or “adaptation” of the data, that output must be CC BY-SA. The user or company may claim copyright over the output.

3. Key Constraints & Realities

While the license says one thing, the “technical possibility” is often limited by two factors:

  • Exceptions & Limitations: Many AI companies argue that training is “Fair Use” (US) or covered by “Text and Data Mining” (TDM) exceptions (EU). If a court agrees that tra
    ining is an exception to copyright, the CC license doesn’t apply at all, and the company has no obligation to share their model.
  • Attribution (BY): Even if the company doesn’t “ShareAlike,” they must still provide Attribution. Technically, this means they should acknowledge the creators of the
    training data. For users, this provides transparency—you would know exactly whose work “built” the AI you are using.

4. Summary Table for Users

Feature Your Rights under CC BY-SA 4.0
Usage Use the model for any purpose (personal, research, commercial).
Sharing Share the model or its outputs with others freely.
Modification Modify or fine-tune the model, but you must use the same license for the result.
Attribution You must credit the original creators if you redistribute the work.

Finally, and most importantly, you must reside in a jurisdiction that recognizes Creative Commons (CC BY-SA 4.0) as an enforceable license. Similar to the GPL for software, its legal validity varies by country.