The Internet Archive is often a valuable resource for journalists, as it searches records of deleted tweets and provides academic documents for background research. However, the advent of AI has created new tensions between the parties. Several major publications have begun blocking access to content in nonprofit digital libraries based on concerns that bots from AI companies are indirectly harvesting articles using the Internet Archive’s collections.
“Many of these AI businesses are looking for structured content databases that are readily available,” said Robert Hahn, the company’s head of business affairs and licensing. guardiansaid Nieman Institute. “The Internet Archive’s API would have been an obvious place to connect your own machines and siphon the IP.”
new york times also took similar measures. “We are blocking access by Internet Archive bots. times The Wayback Machine provides unfettered access; times “We are providing content that includes AI companies without permission,” a representative for the newspaper confirmed to the Nieman Institute. financial times and social forum Reddit have also moved to selectively block how the Internet Archive catalogs their content.
Many publishers are looking to sue AI businesses over how they access content used to train language models at scale. Here are a few from the field of journalism:
-
new york times Sues OpenAI and Microsoft
-
Center for Investigative Reporting sues OpenAI and Microsoft
-
wall street journal and new york post complained of perplexity
-
A group of publishers including atlantic ocean, guardian and politiko sued Kohia
-
new york times and chicago tribune complained of perplexity
Other media outlets seek financial deals before offering their libraries as training materials, but these arrangements appear to pay publishers rather than writers. And it doesn’t delve into the copyright and piracy issues that other creative fields, from novelists to visual artists to musicians, are also battling with AI tools. whole Nieman Institute This story is worth reading for anyone observing the creative industries’ response to artificial intelligence.
