In this episode of GZERO AI, Taylor OwenHost of the podcast “Machines Like Us” examines the scale and impact of the historic data war taking place in the field of AI. According to researcher Kate Crawford, AI is the largest superconstruct humanity has ever built, requiring massive amounts of human labor, natural resources, and vast amounts of data. But how are tech giants like Meta and Google amassing this data?
AI researcher Kate Crawford recently told me that she thinks of AI as the largest superstructure humanity has ever built. This is because of the enormous human labor that goes into building it, the physical infrastructure required to compute these AI systems, the natural resources, energy, and water that go into this entire infrastructure, and of course the sheer volume of data required to build state-of-the-art models. It is becoming increasingly clear that we are in the middle of a historic land grab for all this data — all the data humanity has ever created. So where is all this data coming from, and how are these companies accessing it? First, they are obviously scraping the public internet. It is fair to say that if anything you have done is publicly available on the internet, it is within the training data of at least one of these models.
But these scrapings may also contain a lot of copyrighted data or data that is not necessarily publicly available. They are also probably paying subscribers, as we will soon see as the lawsuit against OpenAI by The New York Times goes through the system and they are scraping each other’s data. The New York Times said that Google knew that OpenAI was scraping YouTube, but did not publicize this, make it public, or scale it back because they were also scraping the entirety of YouTube. Secondly, all of these companies buy or license data. This includes licensing news, entering into deals with publishers, buying data from data brokers, buying companies, or accessing corporate data with rich data sets. For example, Meta was considering acquiring the publisher Simon and Schuster just to get access to their copyrighted books for their law masters training.
Companies with access to rich datasets clearly have an advantage here. In particular, Meta and Google. Meta uses all public data entered into their system. And Meta says that even if a user does not or does not use their product, their data may be in their system from data purchased outside their product or, for example, simply by appearing in an Instagram photo. Google says that it uses all public data that is on its platform. This means, for example, that unrestricted Google Docs will be included in the training dataset. And they get their data in creative ways, to say the least. Meta trained a large language model on a dataset called book3, which contains more than 170,000 pirated and copyrighted books. So what does this mean for us, citizens and Internet users?
One thing is clear: you cannot opt out of this data collection and data use. The opt-out tools Meta offers are hidden and complicated to use, and your data will not be removed from the dataset unless you provide proof that it was used to train Meta’s AI systems. This is not the kind of user tool you would expect in a democratic society. So it is clear that we need to do three things: First, we need to scale journalism. This is exactly why we need investigative journalism: to hold society’s powerful governments, actors, and corporations accountable. Journalism needs to dig deep into who is collecting what data, how these models are trained, and how they are built on data collected from our lives and online experiences. Second, litigation needs to be system-wide, and the findings that come with it should be revealing. The New York Times lawsuit is just one of many against OpenAI, but it will surely become clear if paid journalism is lurking in the training models of these AI systems. And finally, there is no doubt that we need regulation to ensure transparency and accountability of the data collection that drives AI.
For example, Meta recently announced that it plans to use data collected from EU citizens to train LLMs. Shortly after the Irish Data Protection Commission objected, the company announced it would suspend this activity. That's why we need regulation: people who live in countries and regions with strong data protection regulations and AI transparency regimes will ultimately be better protected. I'm Taylor Owen. Thanks for watching.
