AI companies need to be regulated: An open letter to the US Congress and the European Parliament

Federico: Historically, technology has advanced in lockstep with giving people new creative opportunities. From word processors that enabled authors to write their next novel, to digital cameras that allowed photographers to express themselves in new ways and capture more moments, technological advances over the past few decades have empowered creators and, perhaps more importantly, spawned industries that didn't exist before.

Technology has enabled millions of people like me to realize their life dreams and make a living from “content creation” in the digital age.

All this is changing with the emergence of artificial intelligence products based on large-scale language models – changes that, we believe, could get worse if left unregulated.

Over the past two years, we have witnessed the rise of AI tools and services that often use human input without consent in an effort to deliver faster and cheaper results. This is not surprising in a capitalist industry obsessed with maximizing profits above all else, but it is still deeply concerning. especially Because this time, most of these AI tools are built on a foundation of non-consensual appropriation, or, put simply, digital theft.

As we documented in MacStories and have also investigated other (larger) publications, it has become clear that the underlying models of various LLMs are trained on content taken from the open web without seeking permission from the publishers beforehand. These models can then power AI interfaces that can spit out similar content or provide answers with hidden citations that rarely prioritize traffic to the publishers. As far as MacStories is concerned, this is limited to text scraped from websites, but we are seeing this happening in other industries as well, from design assets to photos, music, and more. Moreover, publishers and creators whose content has been appropriated for training and/or crawled for generated responses cannot even ask the AI companies for transparency about which parts of their content were used. It's a black box where original content goes in and derivative miscellaneous comes out.

We think this is all wrong.

The practices followed by the majority of AI companies are ethically unfair to publishers and walk a dangerous line of piracy that should be regulated. Most concerning, ignoring these tools could lead to the gradual erosion of the open web as we know it, diminishing individual creativity, and concentrating “knowledge” in the hands of a few tech companies that build AI services without the explicit consent of web publishers and creators.

In other words, this time we fear that technology won't create new opportunities for creative people on the web. We fear that technology will destroy creative people.

We want to do something about this problem, starting with writing an open letter on behalf of MacStories, Inc. to the US Senators who sponsored the AI bill and to the Italian members of the EU Special Committee on Artificial Intelligence in the Digital Age (embedded below).

In this letter, which we invite other publishers to copy, we outline our position on AI companies' use of the open web for training purposes, on the failure to compensate publishers for the content they appropriate and use, and on the failure to be transparent about the composition of their models' datasets. The letter was sent today in English and will be translated into Italian in the near future.

I know MacStories is just a tiny speck in the open web, and I can't afford to sue anyone. But I'd rather be strong-minded and protect my intellectual property than sit back and accept what I believe is fundamentally unfair to creators and dangerous to the open web, and I'm grateful to have business partners who share these ideals and principles.

That being said, here is a copy of the letter we will be sending to representatives of the US and EU.

Hello,

We, on behalf of MacStories, Inc., are writing in support of legislation regulating the following:

Artificial intelligence companies commercially using third party intellectual property to train large language models without consent.
AI-based content generation designed to replace or devalue original material

MacStories is a small American media company founded in Italy in 2009 by Federico Viticci. Today, MacStories operates MacStories.net and produces several podcasts about the world of apps, technology, video games and media, reaching a global audience, primarily in the EU and the US.

As business owners with a long history of operating on the web, we wanted to share our perspective on the training of artificial intelligence (AI) large language models (LLMs) and some of the products created with them. What we have clearly seen in the past few weeks is that, as an industry, the companies that train AI models are not respecting the intellectual property rights of web-based content creators. Moreover, these companies' cavalier attitude toward decades-old norms on the internet clearly demonstrates how the training of AI models and some of the products created with them threaten the very foundation of the web as an outlet for human creativity and communication.

The dangers to the internet as a cultural institution are real, and they are evolving just as rapidly as AI technology itself. But while the threats to the web are new and novel, what these AI companies are doing is not. Simply put, it's theft, and it's something as old as AI is new. The thieves may be well-funded, and their misdeeds may be veiled in technological sophistication, but it's still theft, and it must be stopped.

The Internet's strength comes from hyperlinks. They connect people and ideas together, creating value that is greater than the sum of its parts. But as the web grew, discovery became a problem. Google and other companies built search engines that use web crawlers to index the web. Search engines like Google are imperfect, but they generally offer a fair deal: in exchange for crawling and indexing publishers' websites, links to their content appear in search results, sending traffic to the publishers. And if publishers don't want their sites crawled, they can opt out thanks to the Robots Exclusion Protocol by adding a simple code. robots.txt You upload a file to a website. This is a social contract between participants in the web that has been in operation for decades before the advent of AI.

But it turns out that putting more raw material into LLMs creates models that perform better. As a result, the companies that create these models, with their insatiable appetite for text, images, and video, have turned straight to the web and mined it for fuel to feed their voracious models.

The problem with the companies developing LLM is that rather than offering publishers and other creators a fair deal and respecting their wishes about whether their content will be crawled or not, they've just accepted it, and in some cases brazenly lied to everyone in the process. The breadth of violators is staggering. This isn't just a startup issue. In fact, a broad swath of the tech industry, including giants like Apple, Google, Microsoft, and Meta, have joined OpenAI, Anthropic, and Perplexity in co-opting publishers' intellectual property without their consent, and using that property to build their own commercial products. None of the companies considered the Robots Exclusion Protocol at all; instead, they offered a way for publishers to opt out of crawling activities, but rear They have already stolen the entire contents of the Internet, just as a thief would hand over the keys to a store owner after emptying the shop of the clerk.

Some companies have gone further, coming up with products that aim to replace the web as we know it by substituting AI-generated web pages for source material, which often amounts to plagiarism. Perplexity Pages, The Browser Company's Arc Search app, and the inclusion of AI answers in Google search results are all designed to get between people and web content creators. All claim to drive traffic to source material with obfuscated citations, but as Wired recently reported (and we've seen), these products drive very little traffic.

As tech writers and podcasters, we've built our careers on enthusiasm and excitement about new technology and the ways it can help people. AI is no exception. AI can play a huge role in fighting disease, climate change, and other challenges big and small that humanity faces. But in the race to advance AI and satisfy investors, the tech industry is losing sight of the value of the internet itself. If left unchecked, this disregard for internet culture will undermine the ability of today's creators to earn a fair wage for their work, and the ability of the next generation of creators to aspire to do the same.

Therefore, on behalf of our fellow internet publishers and creators, we urge you to support legislation to regulate the artificial intelligence industry to prevent further harm and to compensate creators whose work has already been misused without their consent. Existing tools to protect what is published on the web are too limited and incomplete. What is needed is a comprehensive regulatory regime that treats content creators and the companies that want to model on what they publish as equals. That starts by giving publishers control over their content and asking for their consent. in front They can be used to train LLMs and mandate transparency about the source material used to train those models.

Federico Viticci, Editor-in-Chief and Co-Owner of MacStories
John Voorhees, Editor-in-Chief and Co-Owner of MacStories

Source link