Crackdown on bots: How new rules could save the web from AI scrapers

Australians are among the most concerned about artificial intelligence (AI) in the world.

This fear is driven by concerns that AI is being used to spread misinformation and deceive people, fears about job losses, and the fact that AI companies are training their models on the expertise and creative work of others for free.

AI companies use pirated books and articles and regularly send bots across the web to systematically collect content for their models to learn from. That content may come from social media platforms such as Reddit, university academic research repositories, and authoritative publications such as news organizations.

Previously, online scraping was a kind of détente. Although scraping may technically be illegal, it was necessary to make the Internet work. For example, without scraping there would be no Google. Website owners have had no problem with scraping as they follow the “open web” vision and make their content more available.

Under these circumstances, scraping was governed by principles such as respect, recognition, and reciprocity. In the context of AI, they are currently hitting a dead end.

new online environment

Many news organizations now block web scrapers. Creators are choosing not to use certain platforms or posting less.

Barriers are being erected throughout the open web. Democracy, scientific innovation, and creative communities all suffer when only some people can afford to pay to access news and information.

Exceptions to copyright infringement, such as fair dealing for research and research, were enacted long before generative AI became generally available. These exceptions are no longer fit for purpose in the AI era.

The Australian Government has excluded a new copyright exception for text and data mining. While this demonstrates a commitment to support Australia’s creative industries, there remains great uncertainty about how creative content can be legally controlled at scale as AI companies roam the web.

In response, the international nonprofit organization Creative Commons proposed a new voluntary framework: CC Signals.

Creative Commons licenses allow creators to share their content and control how it is used. All licenses require attribution credit, but various additional restrictions may apply. Creators can ask others not to modify their work or use it for commercial purposes. For example, articles from The Conversation can be reused under the CC BY-ND license. This means you must credit the source and may not remix, transform, or build upon it.

How do CC signals work?

The proposed CC Signals framework allows creators to decide whether and how their material will be used in machines. It aims to strike a balance between responsible use of AI and not stifling innovation, and is based on the principles of consent, indemnity and trust.

Simplified, CC signals work by allowing “declarers” such as news websites to attach machine-readable instructions to the body of their content. These instructions specify what machines are allowed to be used, in what combinations, and under what conditions.

CC signals are standardized and can be understood by both humans and machines.

This proposal comes at a moment that closely reflects the early days of the web, when norms around automated access (crawling and scraping) were still being developed in practice rather than in law.

A useful historical analogy is robots.txt. This is a simple file that your web host uses to tell bots that can access what parts of your site as they scour the web looking for content. Although it was not enforceable, it became widely adopted because it provided a clear and standardized way to communicate expectations between content hosts and developers.

CC signals may also operate in much the same spirit. However, like any system, there are drawbacks as well as potential benefits.

professionals

This framework provides more nuance and flexibility than current scraping/non-scraping environments. This gives creators more control over the use of their content.

It can also affect the amount of high-quality content available for scraping. Without access to high-quality data, biases in AI will worsen and the technology will be less useful.

The framework could also benefit small businesses that don’t have the bargaining power to negotiate with big tech companies, but still want compensation, credibility and recognition for their work.

Cons

The biggest challenge with CC signals may be the practical challenge of how to calculate and enforce the monetary or in-kind support required by some signals.

This is also a major problem with the content industry’s proposals for bulk licensing schemes for AI. Calculating and distributing license fees for the thousands, if not millions, of Internet works accessed by generative AI systems around the world is a logistical nightmare.

Creative Commons said it plans to create a best practices guide on how to contribute and give credit based on CC signals. However, this work is still in progress.

Where do we go from here?

Creative Commons argues that the CC Signals framework is less a legal tool than an attempt to define “machine etiquette.” Manners is a good way to think about this.

There are significant legal and practical hurdles to implementing effective copyright management for AI systems. But without stopping important technological developments, we should embrace new ideas and frameworks that bring respect and recognition to creators to the fore.

CC Signals is an incomplete framework, but it’s just the beginning. I hope there will be more to come in the future.

TJ Thomson, Associate Professor of Visual Communication and Digital Media, RMIT University. Daniel Angus, Professor of Digital Communication and Director of the QUT Digital Media Research Center at Queensland University of Technology. Jake Goldenfein, Associate Professor, Melbourne Law School, University of Melbourne, and Kylie Pappalardo, Associate Professor, Faculty of Law, Queensland University of Technology.

This article is republished from The Conversation under a Creative Commons license. Read the original article.

Source link