Since OpenAI, the artificial intelligence technology company behind the DALL-E and ChatGPT, unveiled its new AI video generator, Sora, there has been controversy and questions surrounding how the Microsoft-backed company trained its new video model.
In an interview with the Wall Street Journal in March, OpenAI’s former CEO and current CTO, Mira Murati, refused to discuss Sora’s training data set in any detail or shed light about where OpenAI might have pilfered it from. Murati stuck to her guns, insisting Sora was trained using “publicly available” data. What “available” means to one person could mean something drastically different to an AI company.
The opaqueness was a bad look for Murati and OpenAI, and created fresh controversy for a company that is no stranger to public relations mishaps.
OpenAI got yet another opportunity in the limelight to live up to its name and be open about how it trains its AI. And yet, that isn’t what happened. Bloomberg‘s Shirin Ghaffary chatted with Brad Lightcap, OpenAI’s COO, for nearly 20 minutes on stage at Bloomberg Tech in San Francisco, covering various topics, including how Sora was trained. That awkward exchange starts just after the 15 minute and 30 second mark of the video below.
While Lightcap had ample opportunity to tout the business promise of OpenAI, no doubt appealing to the company’s influential investors, for many people, the concern is less about the commercial viability of OpenAI’s tools and platforms but how the company built them.
As a very brief aside, AI image generators must be trained, whether they create still images or moving ones. The training process, which is instrumental to a model’s ability to succeed, is extensive and requires input data. The input data must be actual images or, in the case of Sora, videos.
After Murati struggled in March and refused to explain where OpenAI got its data, it came out that OpenAI pulled at least some data from Shutterstock following the companies entering an agreement last year. AI companies sure do love stock image platforms. However, OpenAI has yet to explain where all its training data came from. Just because some came from Shutterstock through an agreement doesn’t mean that’s where the entire dataset was sourced.
Bringing it back to OpenAI COO Brad Lightcap last night in San Francisco, OpenAI yet again failed to answer the remarkably simple question as to how Sora was trained.
“Speaking of Sora, there was a lot of conversation, as I’m sure you’ve seen, about what training data was used to train [Sora],” Ghaffary starts. “Can you say, and clear up once and for all, whether Sora was trained on YouTube data?”
“Yeah, I mean, look, the conversation around data is really important. We obviously need to know, kind of where data comes from,” Lightcap says. “We just put out a post this week about this exact topic, which is basically that there needs to be a content ID system for AI that lets creators understand, as they create stuff, where it’s going, who’s training on it, being able to opt in and out of training, opt in and out of use.”
The post in question is a lengthy exercise and vagueness about public conversations concerning how AI models should behave, and has little, if anything, to do with potentially stolen data.
“Conversely, also on the other side of that, being able to actively allow your content to be put in a model or to be accessed by a model — because there may be like this other economic opportunity on the other end of this, and that’s something we’re exploring too. Which is, how do you actually go create an entirely different social contract with the web, with creators, with publishers, where, as these models go off in the world and do things that are useful, create value, to the extent they’re able to reference and incorporate content from the web, there should be some sort of way that people can kind of get benefit from that,” Lightcap explains.
“So, yeah, we’re looking at this problem. It’s really hard. We don’t have all the answers yet. Maybe by 2026. If you have any ideas, we’ll take ’em. But it is a hard one.”
The sprawling response is something, but the question was whether OpenAI used YouTube data to train Sora. Yet again, to the chagrin of the artists who have uploaded their content to YouTube, OpenAI has opted for silence shrouded in a big heap of words.
“So no answer on the YouTube [topic] for now,” says an exasperated Ghaffary.
PetaPixel’s Take
Lightcap and OpenAI’s silence is deafening. From a legal perspective, it makes sense for a company to refuse to answer a straightforward question about whether it illegally used data to train its AI model. This is the tech company equivalent to pleading the fifth. OpenAI’s lawyers aren’t stupid, and Lightcap, if nothing else, managed to stick generally to the script and not open the company up to additional liability.
However, you can only refuse to say something so many times before what you aren’t saying becomes crystal clear anyway. Lightcap’s obvious discomfort when trying to navigate the waters about how OpenAI trained Sora speaks volumes, too.
His doddering response, full of pregnant pauses and filler words bursting at the seams, reeks of carefully orchestrated misdirection.
Image credits: Photos licensed via Depositphotos.