The promised artificial intelligence revolution needs data. Lots of data. OpenAI and Google have started using YouTube videos to train their text-based AI models. But what actually is contained in the YouTube archive?
A team of digital media researchers at the University of Massachusetts Amherst dug deeper into the archive by collecting and analyzing a random sample of YouTube videos, publishing an 85-page paper about the dataset and launching a website called TubeStats for researchers and journalists who want basic information about YouTube.
Now, we're taking a closer look at some surprising findings to better understand how these anonymous videos could become part of powerful AI systems. We found that many YouTube videos are intended for personal use or small groups, and a significant proportion were created by children who appear to be under the age of 13.
YouTube: The tip of the iceberg
For most people, the YouTube experience is curated by an algorithm: up to 70% of the videos users watch are recommended to them by the site's algorithm. Recommended videos are typically popular content like influencer stunts, news clips, explainer videos, travel blogs, and video game reviews, while non-recommended content is left in obscurity.
While some YouTube content imitates popular creators or fits into existing genres, so much of it is personal: family celebrations, selfies set to music, homework assignments, out-of-context video game clips, kids dancing. The hidden side of YouTube, the vast majority of the estimated 14.8 billion videos created and uploaded to the platform, is little understood.
This aspect of YouTube, and social media in general, is difficult to uncover as big tech companies have become increasingly hostile towards researchers.
We found that many videos on YouTube were not intended to be shared widely. We recorded thousands of short, personal videos with low views but high engagement (likes and comments), indicating a small but engaged audience. These were clearly intended for a small audience of friends and family. This social use of YouTube contrasts with videos that attempt to maximize viewership, and suggests a different use case for YouTube: as a video-centric social network for small groups.
Other videos seem to be aimed at different kinds of fixed, small audiences — recordings of pandemic-era online classes, school board meetings, work meetings — not what most people would think of as social uses, but similarly suggest that creators' expectations of their viewers are different from those of creators of content people watch through recommendations.
Fueling the AI Machine
With this broad understanding, we read a New York Times article reporting how OpenAI and Google have turned to YouTube in their race to find new treasure troves of data to train large-scale language models: YouTube's archive of transcripts provides an excellent dataset for text-based models.
There's also speculation, due in part to vague answers from OpenAI's Chief Technology Officer Mira Murati, that the videos themselves could be used to train AI text-to-video translation models, such as OpenAI's Sora.
The New York Times article raised concerns about YouTube's terms of service and, of course, copyright issues that permeate much of the discussion about AI. But there's another question: Who can know what's actually in its archive of more than 14 billion videos uploaded by people all over the world? It's not entirely clear that Google knows, or could find out if it wanted to.
Kids as content creators
I was surprised by the surprising number of videos that featured or appeared to have been made by children. YouTube requires uploaders to be over 13, but I frequently saw kids who appeared to be much younger than that, dancing, singing, and playing video games.
In preliminary studies, coders determined that roughly one in five random videos showing at least one person's face likely contained someone under the age of 13, not counting videos that were clearly filmed with a parent or guardian's consent.
While our current sample size of 250 is relatively small (we are currently working on coding a larger sample), the results so far are consistent with what we've seen in the past. I don't mean to scold Google — age verification on the Internet is notoriously difficult and problematic, and there's no way to determine whether these videos were uploaded with parental or guardian consent — but I do want to highlight what goes into the AI models of these big companies.
Small reach, big impact
While it's tempting to think that OpenAI is using highly produced influencer videos or TV news shows posted to its platform to train its models, previous studies of large-scale language model training data have found that the most popular content is not necessarily the most influential for training AI models: a little-watched conversation between three friends may be far more linguistically valuable in training a chatbot's language model than a music video with millions of views.
Unfortunately, OpenAI and other AI companies are very opaque about their training materials. They don't specify what they take in and what they don't. In most cases, researchers can infer problems with the training data from bias in the AI system's output. But a peek at the training data often gives cause for concern. For example, Human Rights Watch published a report on June 10, 2024, showing that popular training datasets contain many identifiable photos of children.
The history of self-regulation among big tech companies has been one of constantly shifting goals, with OpenAI in particular notorious for asking for forgiveness rather than permission, drawing criticism for putting profits above safety.
Concerns about using user-generated content to train AI models have typically centered on intellectual property, but there are also privacy issues: YouTube is a vast and unwieldy archive that is impossible to fully review.
Models trained on a subset of professionally produced videos could become an AI company's initial training corpus. But without strong policies, companies that pull in more than the tip of the iceberg in popular videos will likely end up including content that violates the Federal Trade Commission's Children's Online Privacy Protection Rule, which prohibits companies from collecting data from children under 13 without notice.
Last year's executive order on AI and at least one promising proposal for comprehensive privacy legislation are signs that user data may become more legally protected in the US.
Have you unwittingly helped train ChatGPT?
The intentions of someone uploading to YouTube are not as consistent or predictable as someone publishing a book, writing an article for a magazine, or exhibiting a painting in a gallery, but even if YouTube's algorithm ignores your upload and it doesn't get more than a few views, your video may still be used to train models like ChatGPT or Gemini.
As far as AI is concerned, your family reunion video may be just as important as a video uploaded by influencer giant Mr. Beast or CNN.
Ryan McGrady is a senior research fellow at the Digital Public Infrastructure Initiative. University of Massachusetts Amherst, Ethan Zuckerman is an associate professor of public policy, communication and information. University of Massachusetts Amherst.
This article is republished from The Conversation under a Creative Commons license. Read the original article.
