Cloudflare strengthens ability to block AI bots from scraping websites

AI News

Cloudflare, a global internet security company that claims to protect around 20% of the world's web traffic, has launched what it calls an “easy button” for website owners who want to block AI services from accessing their content. The move comes in the wake of a surge in demand for content used to train AI models.

Cloudflare's core service, which acts as an internet proxy, scans and filters web traffic before it reaches websites, and the company says that on average, its network handles more than 57 million requests per second.

“To help keep the internet safe for content creators, we've introduced a brand new 'easy button' to block all AI bots,” Cloudflare said in a statement on Wednesday. “It's clear that our customers don't want AI bots, especially bad bots, visiting their websites.”

While some AI companies do a good job of identifying web scraping bots and respecting website instructions to stay away, not all are transparent about their activities.

This new simplified setup is available to all Cloudflare customers, including those on free plans.

Analyze AI bot activity

Along with the announcement, Cloudflare also shared a wealth of information about the AI ​​crawler activity it's observing across its systems.

According to Cloudflare data, AI bots accessed about 39% of the top 1 million “Internet properties” that use Cloudflare in June. However, only 2.98% of these properties took action to block or challenge those requests. Cloudflare also noted that “the higher the ranking (more popular) an Internet property is, the more likely it is to be targeted by AI bots.”

The company said the most active web crawlers are operated by TikTok owner ByteDance, Amazon, Anthropic and OpenAI. The top crawler was Bytedance's Bytespider, which topped all of the requests, scope of activity and frequency of blocks. GPTBot, which is managed by OpenAI and used to collect training data for products such as ChatGPT, ranked second in both crawling activity and blocks.

Image: Cloudflare

Perplexity's web crawlers, which recently sparked controversy for their content-crawling practices, have been detected accessing just a small percent of the sites protected by Cloudflare.

Image: Cloudflare

Website owners can implement their own rules to block known web crawlers, but Cloudflare also said that most of its clients who do so only block mainstream AI developers like OpenAI, Google, and Meta, and not top crawlers from Bytedance or other companies.

AI vs. AI

Cloudflare's report highlights that some AI bot operators are resorting to deceptive tactics to circumvent blocking measures, trying to disguise their crawler activity as legitimate web traffic.

“Unfortunately, we have observed bot operators using spoofed user-agents in an attempt to appear as legitimate browsers,” Cloudflare wrote.

Ultimately, AI is a key tool in the company's arsenal for thwarting automated activity, whether from AI developers, search engines or malicious actors. Cloudflare says it uses machine learning models to assign a “bot score” to each request to websites protected by its service, with a lower score indicating the activity is less likely to be legitimate.

The model uses Cloudflare's vast dataset of global internet traffic to determine a bot score, taking into account a variety of signals, including the request's IP address, user agent, and behavioral patterns.

Image: Cloudflare

To illustrate this, Cloudflare says it looked at traffic from specific bots known for evasive behavior. The results were suggestive: all detections scored below 30 out of 100, with the vast majority falling into the bottom two bands, scoring 9 or less. In other words, even if a bot tried to hide its source, its activity patterns revealed it and Cloudflare was able to block it.

Protecting Web Content

Generative AI models rely on vast amounts of existing content, much of it collected from across the web, and developers need to keep collecting it at scale to keep the AI ​​providing the latest information.

Website owners and content creators are fighting back, and major publishers such as news organizations are taking legal action against AI companies. Forbes and Wired The company claims the content is being taken and republished without permission. Music publisher Sony preemptively warned more than 700 tech companies to stay away in May, and Warner Music Group did the same this week.

If AI increasingly provides information to users without referencing the source, it could become an existential threat to publishers: SparkToro CEO Rand Fishkin recently published a study showing that 60% of people searching for information on Google stopped visiting the website providing that information because Google's AI provided an immediate, summarized answer.

Editor: Ryan Ozawa.

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *