AI is already running out of training data, says Goldman data chief

The rise of artificial intelligence meteors may seem unstoppable, but it faces a lack of training data.

“We already have a lack of data,” said Neema Rafael, chief data officer and head of data engineering at Goldman Sachs, about the bank's “exchange” podcast, which was released Tuesday.

Raphael said the shortage could already be affecting the way new AI systems are built.

He pointed to China's Deepseek as an example, and stated that one hypothesis of development costs came from training on the output of existing models rather than completely new data.

“I think the real interesting thing is how the previous model shapes the next iteration of the world looks like this,” Raphael said.

When the web is tapped out, developers turn to composite data such as machine-generated text, images, and code. That approach offers an unlimited supply, but also has the risk of overwhelming models with low quality output or AI slops.

However, Raphael said he doesn't think the lack of fresh data would be a major constraint. Partly because companies sit in undeveloped information reserves.

“I think it's definitely interesting from the consumer world model, the explosion of data synthesis. But from a company's perspective, I think there's still a lot of juice that's squeezed out of it,” he said.

This means that the real frontier may not be an open internet, but a unique data set held by a company. From transaction flows to client interactions, companies like Goldman sit on information that could make AI tools much more valuable when used correctly.

Raphael's comments come as the industry has been working on “peak data” since ChatGpt's breakout three years ago.

In January, Openai co-founder Ilya Sutskever said at the meeting that all useful data online is already being used to train models, warning that an era of rapid development of AI will “end unquestionably.”

Next frontier: Original data

For businesses highlighted by Raphael, obstacles not only do they find more data, but also ensure that data is available.

“The challenge is to understand the data, understand the business context of the data and normalize it in a way that makes sense for the business to consume it,” he said.

Still, Raphael suggested that heavily relying on synthetic data raises deeper questions about AI trajectories. “The interesting thing is, I think people might think there might be a creative plateau,” he said.

He wondered what would happen if the model continued to train with only machine-generated content.

“If all the data is generated synthetically, how much human data can you incorporate?” he said.

“I think it's interesting to look at from a philosophical perspective,” he added.

Source link

打开Binance账户 commented on Venture capital is opening the gates for defense tech: Can you be more specific about the content of your
注册 commented on Apple Stops Human Support on X: Your point of view caught my eye and was very inte
god of كازينو commented on Apple and Salesforce respond to YouTube video complaints: Hello Dear, are you actually visiting this web pag
创建免费账户 commented on CX Decoded Podcast Episode 2: AI Empowered CX: Real Conversations, Real Results: Shri Nandan, Comcast: Thank you for your sharing. I am worried that I la
开设Binance账户 commented on Driving Innovation & Making a Lasting Impact: Can you be more specific about the content of your

AI is already running out of training data, says Goldman data chief

Next frontier: Original data

Leave a Reply

RECENT POSTS

How AI sovereignty and governance can strengthen your business

Apocalypse No « Machine Learning Times

7 innovative ways to use Image-to-Video AI for viral social content

Related Stories

Business Insider tells the innovative stories you want to know

Business Insider tells the innovative stories you want to know

Next frontier: Original data

Related Posts

Leave a Reply