Legal protections for user data are emerging, but the sheer volume of YouTube content makes ensuring compliance difficult, and even seemingly inconsequential uploads could be used to train AI, raising privacy concerns.
The promised artificial intelligence revolution needs data — lots of data. OpenAI and Google have begun using YouTube videos to train their text-based AI models. But what does the YouTube archive actually contain? A team of digital media researchers at the University of Massachusetts Amherst collected and analyzed a random sample of YouTube videos to learn more about the archive.
We published an 85-page paper on that dataset and launched a website, TubeStats, for researchers and journalists who want basic information about YouTube. Now we're digging deeper into some surprising findings to better understand how these anonymous videos could become part of powerful AI systems. We found that many of the YouTube videos were intended for personal use or small groups, and a significant proportion were created by children who appear to be under the age of 13.
YouTube: The tip of the iceberg
For most people, the YouTube experience is curated by an algorithm: up to 70% of the videos users watch are recommended to them by the site's algorithm. Recommended videos are typically popular content like influencer stunts, news clips, explainer videos, travel blogs, and video game reviews, while non-recommended content is left in obscurity.
While some of the YouTube content imitates popular creators or fits into established genres, a lot of it is personal: family celebrations, selfies set to music, homework assignments, out-of-context video game clips, kids dancing.
The lesser known side of YouTube – most of the estimated 14.8 billion videos created and uploaded to the platform – is poorly understood. Uncovering this side of YouTube, and social media in general, is difficult because big tech companies have become increasingly hostile to researchers. It turns out that many of YouTube's videos were never intended to be shared widely.
We documented thousands of short, personal videos with low views but high engagement (likes and comments), suggesting a small but engaged audience. These were clearly targeted to small audiences of friends and family. This social use of YouTube contrasts with videos that attempt to maximize viewership, and suggests other ways to use YouTube as a video-centric social network for small groups.
Other videos seem to be aimed at different kinds of fixed, small audiences — recordings of pandemic-era online classes, school board meetings, work meetings — not what most people would think of as social uses, but similarly suggest that creators' expectations of their viewers are different from those of creators of content people watch through recommendations.
Fueling the AI Machine
With this broader understanding, we read the New York Times expose and learned how OpenAI and Google have turned to YouTube in the race to find new treasure troves of data to train large-scale language models. YouTube's archive of transcripts makes a great dataset for text-based models. And, fueled in part by a vague response from OpenAI's Chief Technology Officer Mira Murati, there has been speculation that the videos themselves could be used to train AI text-video models, such as OpenAI's Sora.
The New York Times article raised concerns about YouTube's terms of service and, of course, copyright issues that permeate much of the discussion about AI. But there's another question: Who can know what's actually in its archive of more than 14 billion videos uploaded by people all over the world? It's not entirely clear that Google knows, or could find out if it wanted to.
Kids as content creators
We were surprised by the number of videos that featured or appeared to have been made by children. YouTube requires that uploaders be over 13, but we frequently saw kids who appeared to be much younger, dancing, singing, or playing video games. In preliminary studies, coders determined that nearly one-fifth of random videos with at least one face in them likely contained someone under 13. Videos that were clearly filmed with a parent or guardian's consent were not taken into account.
While our current sample size of 250 is relatively small (we are currently working on coding a larger sample), the results so far are consistent with what we've seen in the past. I don't mean to scold Google — age verification on the Internet is notoriously difficult and problematic, and there's no way to determine whether these videos were uploaded with parental or guardian consent — but I do want to highlight what goes into the AI models of these big companies.
Small reach, big impact
While it's tempting to think that OpenAI is using highly produced influencer videos or TV news shows posted to its platform to train its models, previous studies of large-scale language model training data have found that the most popular content is not necessarily the most influential for training AI models: a little-watched conversation between three friends may be far more linguistically valuable in training a chatbot's language model than a music video with millions of views.
Unfortunately, OpenAI and other AI companies are very opaque about their training materials – they don't specify what they take in and what they don't. In most cases, researchers can infer that there's something wrong with the training data from bias in the output of an AI system. But actually looking at the training data often reveals something to be concerned about.
For example, Human Rights Watch released a report on June 10, 2024, revealing that popular training datasets contained numerous photos of identifiable children. The history of self-regulation among big tech companies is rife with moving goalposts. OpenAI in particular has been notorious for asking forgiveness rather than permission, and has come under increasing criticism for putting profits above safety.
Concerns about using user-generated content to train AI models have typically centered on intellectual property, but there are also privacy issues. YouTube is a vast and unwieldy archive, impossible to review in its entirety. Models trained on a subset of professionally produced videos could become an AI company's initial training corpus. But without strong policies in place, companies that take on more than the tip of the popular iceberg are likely to include content that violates the Federal Trade Commission's Children's Online Privacy Protection Rule, which prohibits companies from collecting data from children under the age of 13 without notice.
Last year's executive order on AI and at least one promising proposal for comprehensive privacy legislation are signs that user data may become more legally protected in the US.
Have you unwittingly helped train ChatGPT?
The intent of someone uploading to YouTube is not as consistent or predictable as the intent of someone publishing a book, writing an article for a magazine, or exhibiting a painting in a gallery. But even if YouTube’s algorithm ignores your upload and it only gets a few views, it could still be used to train models like ChatGPT or Gemini. As far as AI is concerned, your family reunion video may be just as important as one uploaded by influencer giants Mr. Beast or CNN.
(The author is with the University of Massachusetts, Amherst.)
