AI models are choked with junk data

How we move from ChatGPT to humanoid robots depends on one of the most important but least discussed bottlenecks in artificial intelligence: the quality of the data we feed these systems for learning.

So far, the AI industrial complex has operated on the idea that feeding models more data means smarter models. This worked brilliantly when researchers could simply scour the internet to train language models at scale. But we are on the cusp of the next frontier in AI: physical AI and world models, systems that learn and ultimately operate within their environments. physical world. Think about the cognition required to navigate roads and traffic, fold laundry, or assist with complex medical surgeries. All of these require something that simply cannot be downloaded. We need rich, multifaceted data from which these world models can learn.

There is currently a potential crisis underway that could have a major impact on the AI movement. If we fail to stem the glut of junk data (data that cannot advance model development), the entire promise of physical AI and world models may not reach its full potential.

A big part of the problem is the hunger for data to feed new and better models. AI companies are hungry for that data, resulting in a wave of multibillion-dollar AI data startups with services like Scale AI, Surge AI, and Mercor. However, satisfying this insatiable appetite has produced vast amounts of junk data that doesn’t actually advance AI models at all.

Generating junk data is easy, but the data needed for physical AI and world models requires more time and effort. The physical world is so complex that training these models to understand the multidimensional world requires much more data. Obtaining data is also very difficult. Machine learning engineers rely on simulating this data, requiring hours of virtual recreation of real-world scenarios to create the data that will ultimately train robots and self-driving cars. When AI models use junk data, it can lead to poor performance, increased time to market, and unpredictable results.

For example, for a fully self-driving car to be considered safe, it needs a system that can handle all the unexpected variables people might encounter while driving, such as a car driving on the wrong side of the road or glare that makes it difficult to see a child trying to run into the road. Junk data only makes it harder for such autonomous systems to learn what is typical from what is possible.

We are already seeing the junk data problem rearing its ugly head. OpenAI has discontinued its AI video app Sora and reassigned the team to other departments. This was essentially a junk data problem, as their world models lacked a sufficient understanding of physics to make realistic predictions.

To realize the true potential of AI capabilities, machine learning teams need tools and processes to reduce junk data from their workflows. You need to invest in technology that analyzes, cleans, normalizes, and corrects your training data. Extracting valuable insights and separating them from junk is how you train your AI models with the right information for success.

The scaling hypothesis that feeding AI systems ever more data will produce smarter systems turns out to be incorrect. Currently, quality data is a constraint. The first companies and research institutes to recognize this will be the ones to build AI systems that actually work in the world.

The opinions expressed in Fortune.com commentary articles are solely those of the author and do not necessarily reflect the author’s opinions or beliefs. luck.

Source link