Meta announces that its AI chatbot was trained on millions of YouTube videos

The Meta AI chatbot is more willing to share the data it was trained on than Meta.

Meta, formerly known as Facebook, first launched Meta AI last year as part of a larger push into the field of generative AI in response to growing public interest sparked by the release of OpenAI's ChatGPT in late 2022. The company expanded Meta AI in April to offer chat and image generator features across all of its apps, including Instagram and WhatsApp. However, the company has not revealed much about how its chatbot was trained.

When Business Insider asked Meta AI a series of questions about what data it trained on and how Meta gets its data, the chatbot gave some interesting answers.

Meta AI told Business Insider that it trained on a large dataset of YouTube video transcripts, and that it has its own web scraper bots, called “MSAE” (short for Meta Scraping and Extraction), that scrape large amounts of data from the web to train its AI models.

Meta had not previously disclosed the existence of the scraper, and OpenAI has recently come under scrutiny for using bots or scrapers to collect data, despite YouTube's terms of service prohibiting such practices from being used or used without permission.

A Meta spokesperson didn't deny any of Meta AI's responses about the scrapers or the training data — BI provided Meta with the prompts it used — but instead suggested that Meta AI might be wrong.

“As with all generative AI systems, the models may return inaccurate or inappropriate outputs,” the spokesperson said. “We will continue to improve these features as they evolve and more people share their feedback.”

The spokesperson also said, “Like others in the industry, we use web crawlers,” but did not specifically mention the MSAE bot cited by Meta AI.

“Generative AI models require large amounts of data to be effectively trained, so we use a combination of different sources for training, including publicly available information online and annotated data,” the spokesperson added.

Meta AI initially said that its training data included a third-party dataset of transcripts of 3.7 million YouTube videos, and clarified that the chatbot “does not use web scraper bots to scrape YouTube videos directly.”

In response to further questions about the YouTube training data, Meta AI said that the training data also includes an even larger dataset of transcriptions of 6 million YouTube videos, also compiled by a third party. It added that the training data also includes two additional sets of YouTube transcriptions or subtitles, one with 1.5 million videos and the other with 2.5 million videos, as well as a dataset of transcriptions of 2,500 TED Talks posted to YouTube. All of these datasets were compiled by third parties, according to Meta AI.

Meta's chatbot said it “takes steps to avoid collecting copyrighted data.” In using Meta AI, it's clear that the chatbot is scraping the web to some extent. Results for some queries cited sources such as NBC News, CNN, and The Financial Times. Meta AI often did not include sources in its answers unless specifically requested.

As reported by BI, Meta is currently considering new paid deals with media publishers to access more AI training data, which could improve Meta AI's results.

Meta AI also said it respects robots.txt, a line of code that website owners can use to ostensibly stop their content from being scraped by bots that use it for AI training.

Meta developed the chatbot using its large-scale language model, Llama. Llama 3 was released in April as an expansion of Meta AI, but Meta has not yet published a research paper on the new model or released the training data used. Meta said in a blog post that the massive set of 15 trillion tokens (linguistic units) used to train Llama 3 came from “publicly available sources.”

Web scrapers such as OpenAI's GPTBot, Google's GoogleBot, and Common Crawl's CCBot can effectively extract any content accessible on the web. The content is stored in large datasets that are ingested into LLM and frequently regenerated by generative AI tools such as ChatGPT.

Some of the ongoing cases are over the free use of copyrighted content owned by the world's largest tech companies, and the U.S. Copyright Office is expected to release new guidelines on acceptable use for AI companies later this year.

Are you a Meta employee or just want to share some tips and insights? Contact Kali Hays at tech@meta.com. email address Or in a secure messaging app signal Please call us at 949-280-0267. Please call us using a non-work device.

Source link