Cloudflare on Wednesday offered its web hosting customers a way to block AI bots from scraping website content and using that data to train machine learning models without permission.
In a statement, the company said it did so based on customer dislike of AI bots and to “contribute to keeping the internet safe for content creators.”
“We've heard loud and clear from customers that they don't want AI bots accessing their websites, especially unauthorized access. To help, we've added an all-new feature that allows you to block all AI bots with just one click.”
Somewhat effective ways to block bots already exist: robots.txt files are widely available to website owners. When placed in the root directory of a website, automated web crawlers are expected to notice and follow the instructions in the file to disallow bots.
Given the widespread belief that generative AI is based on theft and the many lawsuits attempting to hold AI companies accountable, companies that trade in laundered content are generous in allowing web publishers to opt out of the theft.
Last August, OpenAI published guidance on how to block GPTbot crawlers using robots.txt directives, likely acknowledging concerns about content being scraped without consent and used for AI training. Google followed suit the following month. And last September, Cloudflare began offering a way to block rule-abiding AI bots, with 85% of its customers said to have enabled the blockage.
The network services industry is now looking to provide stronger barriers to bot intrusion: The internet is “now flooded with these AI bots,” according to the company, which says they access about 39% of the top 1 million web properties served by Cloudflare.
The problem is that, like the Do Not Track header that browsers implemented 15 years ago to declare privacy a priority, robots.txt can be ignored and typically will not have any repercussions.
And recent reports suggest that AI bots are doing just that: Amazon said last week that it was investigating evidence that bots working on behalf of AWS customer AI search company Perplexity were crawling websites, including news sites, and reproducing their content without proper credit or permission.
Perplexity was accused of violating robots.txt, which Amazon cloud customers are supposed to follow. The AI startup's CEO, Aravind Srinivas, denied that the company had wrongfully ignored the file, but acknowledged that it was third-party bots used by Perplexity that were scraping pages against the wishes of webmasters.
Disguise
“Unfortunately, we have observed bot operators using fake user-agents to make their browsers appear legitimate,” Cloudflare said. “We have been monitoring this activity for many years and are proud to say that our global machine learning models have consistently identified this activity as a bot, even when the operators lied about the user-agent.”
Cloudflare said its machine learning scoring system consistently rated the disguised Perplexity bots below 30 between June 14 and June 27, indicating they were “likely automated.”
This bot-detection approach relies on digital fingerprinting, a technique commonly used to track people online and deny them privacy. Crawlers, like individual internet users, are often distinguished from other users based on technical details that can be read through their network interactions.
These bots tend to use the same tools and frameworks to automate visits to websites, and with a network that handles an average of 57 million requests per second, Cloudflare has enough data to determine which of these fingerprints can be trusted.
Eventually, this became a reality. The machine learning model will defend against bots foraging to feed into the AI model, and will be available to free-level customers. All the customer needs to do is enter the[セキュリティ]->[ボット]In the menu[AI スクレーパーとクローラーをブロック]Just click the toggle button.
“We are concerned that some AI companies are trying to circumvent our rules to access content and are relentlessly adapting to evade bot detection,” Cloudflare said in a statement. “We will continue to monitor, add bot blocking to our AI scraper and crawler rules, and evolve our machine learning models to keep the internet a place where content creators can thrive and ensure they have full control over the models they use to train their content and perform inference.”®