Generic-AI companies have established an extraordinary impact on how people seek and access information. Chatbots are confident in answering any question, promise to generate images and videos very quickly, replacing traditional search engines and human experts as sources of knowledge. But their input (the data that determines how chatbots react to users) is a secret that is closely guarded by powerful companies fighting each other hard for AI domination.
The question of how AI models are trained is very important. This is because AI companies trained machines with copyright-rich works without the consent of writers, musicians, podkesters, filmmakers and others. (Many high-tech companies have been sued for doing this, and the legality of practice remains an open question.) Works responsible for AI actions may also contain misinformation, conspiracy theories, and material that some people challenge.
Atlantic OceanWhen creating an AI Watchdog, the goal is to open a machine learning black box. Understanding the future of technology and the wild imagination, hubrism and dramatic changes that accompany any technological revolution was immersed in it. Atlantic Ocean Author for generations. Vannevar Bush invented hyperlinks on our page. And we were trying to train machines to write like humans long before ChatGpt existed. Recently, we have published a groundbreaking study of Books3, a dataset of nearly 200,000 copyrighted books used to train large-scale language models. Since then, we have featured a much larger collection of pirated books, showing that writing from films and television shows is also being used by AI companies without the consent of the writers.
AI Watchdog expands these efforts with a search tool that allows you to see the materials contained in various datasets, and see which tech companies use that material to train AI products. At the time of its release, it includes over 7.5 million books, 81 million research articles, 15 million YouTube videos, and writing from tens of thousands of movies and TV shows. Add more datasets while reviewing the dataset. Most of the datasets in our collection were created by AI companies or research institutes and published on the AI Development Forum.
If my work is displayed in the search tool, was it definitely used to train AI?
Perhaps so, but how does work on a dataset look like do not have Conclusive evidence that a particular company actually used the work. It is likely that any company has decided to exclude work when training the model.
How do AI companies get content without paying?
AI companies sometimes pay for content licenses for training, but they also use many techniques to avoid paying.
- Books are usually obtained via pirated libraries or BitTorrent on the web.
- Other media can be downloaded by removing the web widely or downloading existing web scuffs such as Common Crawl.
- Search indexes such as Bing, Brave, Google and more to create full-text articles that AI companies can use.
How can I prevent AI companies from using my job?
The training product frenzy has been happening for several years, and many companies may already be using your job. However, tech companies are still constantly cutting the web in search of new materials. There are things you can do that may help protect your work.
If your job is visual, placing watermarks and logos on images and videos can be less attractive to AI training. Companies generally do not have the risk of identifying individual creators. For example, stable diffusion was sued after the image generator created a composite photo containing a watermark for Getty Images.
There are also AI Posion systems such as night shades and glaze drugs. This modifies images in ways that humans cannot see, but can interfere with the ability of AI models to learn from those images. Poisoned AI models can generate inconsistent content. At least one addiction system has also been developed for music.
I think my work was used by a particular company. What can I do about it?
Individuals and institutions have filed dozens of lawsuits against AI companies to train their products with copyrighted books, articles, songs, videos and art. Some of these cases are class actions. In other words, if the plaintiff wins, he is entitled to damages. (Atlantic Ocean Plaintiffs against AI startups are opposed to AI startups. ) If the work is registered with the US Copyright Office, the possibility of damages is greater.
