Apple, Nvidia, Anthropic, and Salesforce have all been caught using YouTube data to build their AI models with.
An investigation by Proof News and co-published with Wired found that YouTube subtitles data has been ripped from the video-sharing platform without permission and used to train AI models. It does not involve video imagery.
The data was used to train (Large-Language Models) LLMs, like ChatGPT, but it raises the issue of tech companies pilfering YouTube data to train models.
YouTube has expressly stated that such usage of videos to train AI is an infraction of the platform’s terms of service (ToS). But it is widely acknowledged that YouTube is a data goldmine for generative AI at a time when the race for text-to-video models is hotting up.
Apple has sourced data for their AI from several companies
One of them scraped tons of data/transcripts from YouTube videos, including mine
Apple technically avoids “fault” here because they’re not the ones scraping
But this is going to be an evolving problem for a long time https://t.co/U93riaeSlY
— Marques Brownlee (@MKBHD) July 16, 2024
Roughly 180,000 YouTube videos were found in the dataset being used by Apple et al. The data was compiled by a nonprofit and is called The Pile. It does not just contain YouTube data but also Wikipedia articles, books, and Enron emails.
“The Pile includes a very small subset of YouTube subtitles,” Jennifer Martinez, a spokesperson for Anthropic, tells Proof News.
“YouTube’s terms cover direct use of its platform, which is distinct from use of The Pile dataset. On the point about potential violations of YouTube’s terms of service, we’d have to refer you to The Pile authors.”
Apple, Nvidia, and others haven’t commented. Neither has YouTube.
Nobody Wants to Talk About Training Data
After some early burns, tech firms do not want to talk about where they get training data from to build generative AI models.
With OpenAI’s video generator Sora on the horizon, CTO Mira Murati has repeatedly refused to reveal the training data for the much-hyped app.
“I’m not going to go into the details of the data that was used, but it was publicly available or licensed data,” she told The Wall Street Journal in March.
YouTube CEO Sundar Pichai told The Verge that using video content from the platform — include subtitles — is a violation of ToS.
“We have terms and conditions, and we would expect people to abide by those terms and conditions when you build a product, so that’s how I felt about it,” Pichai said.
Image credits: Header photo licensed via Depositphotos.