AI companies are reportedly still scraping websites, despite protocols aimed at blocking it.

Perplexity, a company that describes its product as a “free AI search engine,” has come under fire in recent days. Forbes They accused him of plagiarizing the article and republishing it on multiple platforms. Wired Perplexity reportedly ignored the Robots Exclusion Protocol (robots.txt) and scraped its own website and other Condé Nast publications. Shortcuts They also accused the company of scrapping the article. Reuters Perplexity reports that it's not the only AI company circumventing robots.txt files and scraping websites for content that it then uses to train its own tech.

Reuters The company said it had seen a letter to publishers from TollBit, a startup that connects AI companies with publishers to enter into licensing agreements. The letter warned that “AI agents from multiple sources (not just one) are choosing to circumvent the robots.txt protocol to retrieve content from sites.” Robots.txt files contain instructions for which pages web crawlers can and cannot access. Web developers have used the protocol since 1994, but compliance is entirely voluntary.

The Torbit letter did not name any companies, but Business Insider The company said it learned that OpenAI and Anthropic, developers of the ChatGPT and Claude chatbots, were also circumventing robots.txt signals, after both companies previously declared they respect “do not crawl” instructions that websites put in their robots.txt files.

investigating, Wired The company discovered that a machine on an Amazon server “operated by Perplexity” was bypassing the robots.txt instructions on the company's websites. To see if Perplexity was scraping content, Wired They fed the tool article headlines and short prompts to describe the articles, and the tool reportedly came up with mostly paraphrases of the articles “with minimal citations,” sometimes resulting in inaccurate summaries of the articles. Wired In one case, the chatbot falsely claimed to have reported a specific California police officer committing a crime.

In an interview Fast CompanyPerplexity CEO Aravind Srinivas told the publication that the company is “not lying about ignoring robots exclusion protocols,” but that doesn't mean the company doesn't benefit from crawlers that ignore the protocols. Srinivas said the company uses third-party web crawlers in addition to its own, and that the crawlers Wired One of them was identified. Fast Company When asked if Perplexity had instructed its crawler provider to stop scraping Wired's website, he would simply say, “It's complicated.”

Srinivas defended his company's practices, telling the publication that the Robot Exclusion Protocol is “not a legal framework” and suggesting that publishers and companies like his company may need to forge new relationships. Wired Perplexity's chatbot deliberately used prompts to act like the real thing, so that the average user wouldn't get the same results. “We never said we'd never hallucinated,” Srinivas said of the inaccurate summaries the tool generated.

Source link