Simply put
- The Wikimedia Foundation has announced a number of partnerships with AI companies to use their content for LLM training.
- AI companies have signed up for enterprise products to reuse Wikipedia content at scale.
- Last October, the foundation announced that site visits were declining as people were using AI summaries instead of visiting the site.
Wikimedia Foundation, a nonprofit organization seeking to strengthen long-term sustainability amid changing online behavior, has announced a series of new partnerships with artificial intelligence companies that will enable it to use Wikipedia content to train and power AI models.
The agreement was signed through Wikimedia Enterprise, the Foundation’s commercial product designed for large-scale reusers and distributors of content from Wikimedia projects. New sign-ups include Ecosia, Microsoft, Mistral AI, Perplexity, Pleias, and ProRata. These join existing partners such as Amazon, Google, and Meta.
“In the age of AI, Wikipedia and its human-created and curated knowledge are more valuable than ever,” the foundation said in a statement.
“That knowledge is[s] Generating AI chatbots, search engines, voice assistants, and more. Wikipedia is one of the highest quality datasets used to train large language models. ”
This announcement was made as part of updates related to Wikipedia’s 25th anniversary.
This online encyclopedia is among the top 10 most visited websites in the world and is the only site in its group run by a non-profit organization. The foundation says its more than 65 million articles published in more than 300 languages are viewed nearly 15 billion times each month.
But he warns that traffic patterns are changing. In October, it announced that human visits to Wikipedia were down 8% year over year, and that the decline was due to users no longer visiting the site directly, but instead relying on AI-generated summaries. Currently, nearly 60% of Google searches end without a click, and on-page responses are often provided by Wikipedia content.
AI vs. Publisher
The deal comes amid broader discussion about how AI companies acquire training data. Large-scale language models are typically trained using vast amounts of online material, which has drawn criticism from authors, publishers, and other rights holders who argue that unauthorized use of copyrighted works is infringement.
Among them, Reddit has filed several lawsuits with AI companies to use its content to train models, even though it has licensing agreements with Google and others.
On Thursday, major book publishers Hachette Book Group and Cengage Group filed a complaint to join an existing class action lawsuit against Google, accusing the company of “historic copyright infringement” in building its Gemini AI platform. The lawsuit alleges that Google copied the book in the course of its AI training without obtaining the proper license. The lawsuit was originally filed in 2023 by the authors’ group.
OpenAI faces similar lawsuits from plaintiffs including “Game of Thrones” screenwriter George R.R. Martin.
Entertainment companies are also grappling with this problem. In mid-December, Disney sent a cease-and-desist letter to Google accusing it of copyright infringement, even though it had a separate licensing agreement with OpenAI that covered hundreds of characters in AI-generated videos. Disney has issued similar notices to other AI companies and is involved in a lawsuit against image generation company Midjourney, along with major studios.
That same month, a coalition of screenwriters, actors and engineers launched a new industry group aimed at promoting legally enforceable standards governing how AI is trained and used in the entertainment sector. More than 500 celebrities have endorsed the initiative, including Natalie Portman, Cate Blanchett, Ben Affleck, Guillermo del Toro and Taika Waititi.
The European Commission has also launched a formal antitrust investigation into whether Google breached EU competition rules by using content from publishers and YouTube to power its AI services without fair remuneration or consent.
It is unclear whether copyright owners will ultimately be able to find a remedy. A federal judge in the United States recently ruled that Meta and Anthropic’s use of copyrighted books to train AI models constitutes fair use, criticizing the companies for maintaining permanent libraries of pirated works.
daily report meeting Newsletter
Start each day with the current top news stories, plus original features, podcasts, videos, and more.
