The rise of artificial intelligence meteors may seem unstoppable, but it faces a lack of training data.
“We already have a lack of data,” said Neema Rafael, chief data officer and head of data engineering at Goldman Sachs, about the bank's “exchange” podcast, which was released Tuesday.
Raphael said the shortage could already be affecting the way new AI systems are built.
He pointed to China's Deepseek as an example, and stated that one hypothesis of development costs came from training on the output of existing models rather than completely new data.
“I think the real interesting thing is how the previous model shapes the next iteration of the world looks like this,” Raphael said.
When the web is tapped out, developers turn to composite data such as machine-generated text, images, and code. That approach offers an unlimited supply, but also has the risk of overwhelming models with low quality output or AI slops.
However, Raphael said he doesn't think the lack of fresh data would be a major constraint. Partly because companies sit in undeveloped information reserves.
“I think it's definitely interesting from the consumer world model, the explosion of data synthesis. But from a company's perspective, I think there's still a lot of juice that's squeezed out of it,” he said.
This means that the real frontier may not be an open internet, but a unique data set held by a company. From transaction flows to client interactions, companies like Goldman sit on information that could make AI tools much more valuable when used correctly.
Raphael's comments come as the industry has been working on “peak data” since ChatGpt's breakout three years ago.
In January, Openai co-founder Ilya Sutskever said at the meeting that all useful data online is already being used to train models, warning that an era of rapid development of AI will “end unquestionably.”
Next frontier: Original data
For businesses highlighted by Raphael, obstacles not only do they find more data, but also ensure that data is available.
“The challenge is to understand the data, understand the business context of the data and normalize it in a way that makes sense for the business to consume it,” he said.
Still, Raphael suggested that heavily relying on synthetic data raises deeper questions about AI trajectories. “The interesting thing is, I think people might think there might be a creative plateau,” he said.
He wondered what would happen if the model continued to train with only machine-generated content.
“If all the data is generated synthetically, how much human data can you incorporate?” he said.
“I think it's interesting to look at from a philosophical perspective,” he added.

