Cloudflare has launched a new free tool that prevents AI companies' bots from scraping content from customers' websites to train large-scale language models. The cloud service provider is offering the tool to all customers, including those on its free plan. “The feature will be automatically updated whenever we find new fingerprints of problematic bots that we've identified as scraping the web extensively to train models,” the company said.
In announcing the update, the Cloudflare team also shared data on how customers are responding to the proliferation of bots that scrape content to train generative AI models: According to the company's internal data, 85.2% of customers choose to block AI bots from accessing their sites, even if they properly identify themselves.
Cloudflare also identified the most active bots over the past year: Bytedance-owned Bytespider bots attempted to access 40% of websites under Cloudflare's control, and 35%. These bots, along with Amazonbot and ClaudeBot, made up half of the top four AI bot crawlers by number of requests on the Cloudflare network.
Blocking AI bots from accessing content completely and consistently has proven very difficult, and the arms race to build models faster has led some companies to circumvent or outright break existing rules about blocking scrapers – the practice of scraping websites without the necessary permission. But a backend company the size of Cloudflare could make some headway if it makes a serious effort to stop the practice.
“We are concerned that some AI companies are trying to circumvent our rules to access content, and are persistently adapting to evade bot detection,” the company said. “We will continue to monitor and add bot blocking to our AI scraper and crawler rules, and evolve our machine learning models to keep the internet a place where content creators can thrive, and have full control over the models they use to train their content and perform inference.”