Generative AI systems rooted in mass privacy violations

AI News


Amnesty International said in a new briefing today that companies are extracting vast amounts of online data through illegal web scraping to build generative artificial intelligence (AI) products in ways that enable large-scale invasions of privacy, and that these systems are illegal by design.

“Illegal by Design: Exposing the Human Rights Costs of Generative AI” documents significant risks in scraping and processing the large scale data used to build and train these systems, including violations of privacy rights by design and negative impacts on the environment and historically marginalized communities.

“Companies around the world offer generative AI products under the guise of efficiency and sophistication, but in reality these systems perpetuate massive privacy violations through illegal web scraping, an automated process that extracts data, including personal data such as images and social media activity, from websites to train AI models,” said Rikita Banerji, director of Amnesty International’s Algorithmic Accountability Lab.

“Extracted data pipelines, inherent design choices made by tech companies and exploitative supply chains to build generative AI systems have enabled a paradigm of technology development that poses the risk of large-scale human rights abuses.”

Amnesty International investigated models that power some of the most popular publicly available standalone generative AI tools, including Open AI’s GPT 3, Google’s Gemini, Meta’s Llama, DeepSeek, and tools from Midjourney and Stable Diffusion.

Such systems rely on extracting information from billions of public online posts and images, often without the explicit consent of the individuals posting or creating them. Not only is this a deliberate invasion of privacy, but as the datasets powering AI models expand, it also amplifies the presence of hateful and discriminatory content in their output, along with negative stereotypes and biases, particularly along racial and gender lines.

Racial, gender, and cultural biases are consistent features of generative AI systems, which are largely the product of training data obtained from the web and therefore contaminated with real-world biases that harm historically marginalized communities. Additionally, generative AI systems pose a risk to the right to freedom of thought, as they can influence users’ thinking through predictive suggestions and form personal beliefs. This is especially true for large models that rely on vast amounts of training data.

“These choices are not inevitable, and we must challenge the design choices made by companies that build generative AI systems that rely on training data, including personal data, that is extracted at scale without consent,” said Rikita Banerji.

“This is one of the most egregious acts of an AI company operating with disregard for human rights, and must be addressed urgently. If authorities act quickly to correct course, a different trajectory for technology development could be possible.”

large environmental costs

As the scale and speed of generative AI company development accelerates, so too do infrastructure requirements and associated environmental costs.

The advanced processing needs of larger models require more energy-hungry chips and larger data centers, which in turn require more energy and water to operate. Generative AI production often negatively impacts historically marginalized communities. This is because the land and resources belonging to these communities are used to build data centers and meet processing requirements.

Google’s own 2024 Sustainability Report notes that the company’s greenhouse gas emissions have increased by a staggering 48% since 2019, driven by data center and supply chain emissions. Similarly, Microsoft’s emissions increased by 29% between 2020 and 2024. This is due to data centers running processes that support AI.

The resource-intensive use of producing generative AI has led communities from Cerrillos, Chile, and Queretaro, Mexico, to Arizona, USA, to resist data centers in regions already heavily affected by drought and power shortages.

As part of the research process, Amnesty International wrote to Google, OpenAI, Meta, Stability AI, Midjourney, and DeepSeek, giving them an opportunity to respond to the findings of a research report stating that their models rely on illegal web scraping and a number of other related human rights concerns.

Amnesty International has also written to Intel and VMware about the risks of discrimination, among other things, and to Google, Microsoft and Amazon about the environmental hazards associated with their generative AI systems and related infrastructure. At the time of publication, only Microsoft, Amazon, Intel, OpenAI, and Meta had responded to Amnesty International. A summary of their responses is included in the briefing.

Amnesty International is calling on states to ban standalone generative AI systems built using illegal web scraping. Web scraping is defined as the collection of large amounts of training data through the web. Companies must immediately stop illegal and non-consensual web scraping of personal data for AI training purposes, and states must hold companies accountable for engaging in human rights abuses related to their design and business choices.

background

This briefing provides a human rights analysis of the “data pipelines” that power generative AI products. This includes the stages of data capture, analysis, and processing that are critical to the overall functioning of these systems. Specifically, this includes focusing on data collection, data processing, model scaling, methods and sources of data output, and expanding the parameters and impact of design choices made with respect to training data for generative AI models.

Amnesty International defines standalone generative AI tools as products developed, deployed, and sold solely for generative AI functionality, such as AI chatbots and image/video/audio/text generators. This does not include products for which generative AI is an additional feature, or features in larger product suites such as word processing software with optional generative AI capabilities.



Source link