Artists face challenges with work shield work from AI crawlers

Visual artists want to protect their work from nonconsensual use by generative AI tools such as CHATGPT. However, most of them have no technical know-how or control over the tools needed to do so.

One of the best ways to protect an artist's creative work is to avoid seeing it in “AI Crawlers” (a program that collects data on the Internet for training generative models). However, most artists have no access to tools that allow them to take such behavior. And when they have access, they don't know how to use them.

These are some of the conclusions of a study by a group of researchers at the University of California, San Diego and the University of Chicago, presented at the 2025 Internet Measurement Conference held in Madison, Wisconsin in October.

“At the heart of this paper's conflict is the concept of content creators wanting to control the way content is used. The concept that, not merely accessible or not, such rights are explicit in copyright law, but not obsessively expressible on the Internet today, and instead, instead, a set of ad hoc controls are displayed based on existing web capabilities based on existing web capabilities,” the researchers write.

The research team examined more than 200 visual artists on the demand for tools to block AI crawlers and the artist's technical expertise. Researchers also reviewed over 1,100 professional artist websites to see how much control artists have over AI-Blocking tools. Finally, the team evaluated which processes were most effective at blocking AI crawlers.

Nowadays, artists can use tools that mask the original AI crawler artwork fairly easily by turning their art into something else. The University of Chicago study co-authors have developed one of these tools known as Glaze.

But ideally, artists can prevent AI crawlers from harvesting data completely. To do this, visual artists need to protect themselves from three categories: AI crawlers. One type will collect data and train large language models that power chatbots, one that will enhance the knowledge of AI-assistance assistants, and the other that will support AI-backed search engines.

The researchers will present their work at the ACM Internet Measurement Conference held in Madison, WISC this October.

Artist survey

There has been extensive media coverage of how significantly AI has disrupted the livelihoods of many artists. As a result, nearly 80% of the 203 visual artists surveyed by researchers said they had tried to take proactive steps to prevent artwork from being included in training data for AI-generating tools. Two thirds reported using glaze. Additionally, 60% of artists reduce the amount of work they share online, with 51% sharing only low-resolution images of their works.

Additionally, 96% of artists want to access tools that can prevent AI crawlers from harvesting data. But over 60% of them were not familiar with robots.txt, one of the simplest tools that could do this.

Tools to suppress AI crawlers

robots.txt is a simple text file located in the root directory of a website, describing the pages that the crawler can access on that website. The text file can also provide a detailed explanation of which crawlers are not allowed to access the website at all. However, crawlers are not obligated to comply with these restrictions.

Researchers have looked into the top 100,000 most popular websites on the internet and found that over 10% explicitly allow AI crawlers in their Robots.txt files. However, some sites, including Vox Media and The Atlantic, have removed the ban after signing license agreements with AI companies. In fact, the number of sites that allow AI crawlers is increasing. Researchers assume that these sites aim to spread misinformation to LLM.

One of the problems for artists is that they have no access or control over the associated robots.txt file. This is because researchers found that over three-quarters of third-party service platforms are hosted in a survey of 1100 artist websites. Most of them do not allow changes to Robots.txt. Many of these content management system artists give little or no information about which types of crawls are blocked. Squarespace is the only company that offers a simple interface for blocking AI tools. However, researchers found that only 17% of artists using Squarespace enable this option. This may be because, in many cases, artists don't know that the service is available.

But crawlers respect the ban listed in robots.txt, but is it not mandatory?

The answers are mixed. Large corporate crawlers generally respect robots.txt, both in their claims and in practice. The only crawler that researchers can clearly determine is the bytespider deployed by the bytespider by the owner of Tiktok. Furthermore, many crawlers claim they respect the robots.txt restrictions, but researchers were unable to confirm that this is actually the case.

Overall, “The majority of AI crawlers run by large companies respect Robots.txt, but the majority of AI assistant crawlers do not,” the researchers wrote.

Recently, network provider CloudFlare has launched the “Block AI Bot” feature. At this point, only 5.7% of sites using CloudFlare have enabled this option. However, researchers hope it will become more popular over time.

“It's a 'new option for encouragement', but I hope providers will become transparent with tool manipulation and coverage (for example, by providing a list of blocked AI bots)” Savage's research group student.

Legislative and Legal Uncertainty

The global landscape around AI crawlers is constantly changing due to different legal changes and wide range of legislative proposals.

In the US, AI companies face legal challenges regarding the extent to which copyright applies to models trained with reduced data from the Internet, and what their obligations are to the creator of this content. In the European Union, the recently passed AI law requires providers of AI models to obtain permission from copyright holders to use the data.

“There is reason to believe that confusion over the availability of legal remedies only focuses more on technical access control,” the researchers wrote. “As long as US courts find positive “fair use” defenses in AI model builders, this weakening of remedies for use will inevitably create even stronger demand to enforce control of access. ”

This work was funded in part by NSF Grant SATC-2241303 and Naval Research Project #N00014-24-1-2669.

Som Sites I crawled: Awareness, agency and effectiveness in protecting content creators from AI crawlers

Enze of Alex Liu, Elisa Lu, Jeffrey M. Volker and Stephen Savage, Department of Computer Science Engineering, University of California, San Diego.

Shaun Shan, Ben Y. Zhao, University of Chicago

/Public release. This material of the Organization of Origin/Author is a point-in-time nature and may be edited for clarity, style and length. Mirage.news does not take any institutional position or aspect, and all views, positions and conclusions expressed here are the views of the authors alone.

Source link