Exclusive: Licensing company says AI companies are circumventing web standards to scrape publisher sites

Content licensing startup TollBit has told publishers that multiple artificial intelligence companies are circumventing common web standards that publishers use to block their content from being scraped for use in generative AI systems.

The letter to publishers, seen by Reuters on Friday, did not name the affected AI companies or publishers, but it comes amid a public dispute over the same web standards between AI search startup Perplexity and media giant Forbes, as well as a broader debate between tech and media companies over the value of content in the age of generative AI.

The business media publisher publicly accused Perplexity of plagiarizing its research articles in AI-generated summaries without citing Forbes or asking for permission.

A Wired investigation published this week found that Perplexity is likely circumventing efforts to block web crawlers via the Robots Exclusion Protocol (robots.txt), a widely accepted standard for determining which parts of a site can be crawled.

Perplexity declined a Reuters request for comment on the dispute.

The News Media Alliance, a trade group representing more than 2,200 U.S.-based publishers, expressed concern about the impact ignoring “no crawl” signals could have on its members.

“Without the ability to opt out of large-scale scraping, we would be unable to monetize our valuable content or pay our journalists – this could severely harm our industry,” said Daniel Coffey, the group's president.

TollBit, an early-stage startup, is positioning itself as a matchmaker between AI companies looking for content and publishers willing to enter into licensing deals with them.

The company tracks AI traffic to publishers' websites and uses the analytics to help the two sides agree on fees to be paid for the use of different types of content.

For example, publishers may choose to charge higher fees for “premium content, such as breaking news and exclusive information,” the company said on its website.

The company had 50 websites live as of May, but did not disclose their names.

According to TollBit's letter, Perplexity isn't the only violator who appears to be ignoring robots.txt.

Torbit said its analysis showed that “a large number” of AI agents were circumventing the protocol, a standard tool used by publishers to indicate which parts of their sites can be crawled.

“What this means in practice is that AI agents from multiple sources (not just one company) are choosing to bypass the robots.txt protocol to retrieve content from sites,” TollBit wrote. “The more publisher logs we ingest, the more this pattern becomes apparent.”

The robots.txt protocol was created in the mid-1990s as a way to prevent web crawlers from overloading websites. While there is no clear legal enforcement mechanism, it has historically been widely followed on the web, and some groups, such as the News Media Alliance, have said there may still be room for publishers to take legal action.

More recently, robots.txt has become a key tool used by publishers to stop tech companies from ripping up their content for free for use in generative AI systems that can mimic human creativity and instantly summarize articles.

AI companies use content to train algorithms to generate summaries of real-time information.

Some publishers, including The New York Times, have sued AI companies for copyright infringement over such use. Other publishers have licensing agreements with AI companies that are willing to pay for content, but the two sides often disagree over the value of the material. Many AI developers argue that free access to their content doesn't violate the law.

Thomson Reuters, which owns Reuters News, is one of the companies that has signed deals to license the use of news content by AI models.

Publishers have been sounding the alarm about news summaries in particular since Google launched a product last year that uses AI to create summaries for some search queries.

If publishers want to prevent their content from being used by Google's AI to generate summaries, they will have to use the same tools that prevent their content from appearing in Google search results as well, making their content effectively invisible on the web.

Source link