The hidden high cost of training AI on AI

Machine Learning


Today’s AI models are falling victim to a dangerous vulnerability called data poisoning. but, Data poisoning crisis It’s not just, or even mostly, hackers and adversaries. That’s self-inflicted. As companies race to implement AI across their workflows, AI-generated summaries, emails, code, and reports are quietly and rapidly flooding internal databases. Data poisoning occurs when synthetic content is ingested into the training pipelines used to build and fine-tune an organization’s next-generation AI models.

For many organizations, the AI ​​transformations they have invested in are now actively cannibalizing the AI ​​future they hope for.

“What happens is the signal-to-noise ratio collapses,” said Daniel Kimber, CEO of Brainfish AI, an Australian-founded tech startup focused on building AI agents. “Native human reasoning, edge-case knowledge, and nuanced institutional context are diluted by synthetic content that abstracts from what is already real. When you train and fine-tune on that data, you’re not learning from experience; you’re learning from copies of copies.”

Related:Red Hat CIO Marco Bill: Resource control is key to AI sovereignty

The end result of data poisoning is a risk that many CIOs may already be aware of: model degradation. However, reducing the problem to simply “model deterioration” can obscure what is really at stake: business outcomes. Model degradation This occurs when decisions made by machines or humans rely on distorted analysis or output from AI.

“Loss in accuracy is not just a degradation, it’s a distortion,” says Zbyněk Sopuch, CTO at Safetica, a data loss prevention and insider risk management provider. “Problems don’t usually manifest themselves in a linear fashion, but rather in a quiet, intertwined manner of failure.” “The loss of precision and feedback loops results in a large-scale decision loss. This means we’ve moved from a model problem to a business problem.”

Data poisoning can also raise a surprisingly wide variety of legal, compliance, and organizational knowledge issues. According to a study of AI models published in , the data degradation it causes is irreversible. Nature.com in 2024. Not only that, but the process also flattens out “the nuanced and rare institutional knowledge at the edge of data delivery,” according to Dan Ivtsan, senior director of AI products at Steno, a technology-enabled court reporting and litigation support service.

“The trick is that even if factual accuracy is disrupted, fluency is maintained, so standard benchmarks miss it completely,” he added.

Related:CIOs face new gaps in AI security as Microsoft expands Copilot

Beyond loss of precision, organizations can also face amplification of bias due to factors such as loss of data output for minority groups and homogenization of output, or convergence of output to a bland average.

“In the legal AI that I’m developing the product for, that drift could mean hallucinatory citations or inaccurate medical timelines. This is a real medical malpractice exposure,” Ivtsan said. “A proven precaution: Always accumulate real data in parallel with synthetic data. Never replace it.”

040926_cioquotebox_aimodels.png

The dangers of backflow feedback loops

Data poisoning reduces the value of the original data, explains Ryoji Morii, founder of Insynergy.io, a Tokyo-based company specializing in AI governance and AI decision architecture. “Data is treated as a disposable resource and derived values ​​are used instead. This pollutes the training data and reduces the relevance of the raw data,” Morii said.

This problem can be blamed on companies’ need for speed, human instinct for what’s easiest, or simply a misunderstanding of how training and fine-tuning AI actually works. Regardless of the reason or intention, the damage is undeniable.

“What is being described is ‘data poisoning in the name of convenience.’ While this is not malicious, it will cause long-term damage,” Sopchi said.

Related:AI vendor is a single point of failure

Assigning responsibility is not as important as being able to recognize the danger now.

Chetan Saundankar, CEO of Coditation, an India-based company that builds and deploys AI systems for enterprise customers. But this is the calm before the storm.

“After a few weeks or months, the model starts to get the problem wrong in ways that are difficult to detect, because the answer still looks perfectly reasonable,” he said. “Code tools begin to suggest patterns that work but have security holes. Summarization models begin to sound authoritative but remove qualifications and nuances that made the original document useful.”

This problem pervades everything important to running a successful and profitable organization. Dirk Alshuth, chief marketing officer at Luxembourg-based cloud management platform Emma, ​​explained that small mistakes, such as misjudging resource allocation or mislabeling usage patterns, can quickly snowball. Ultimately, these errors can result in increased costs and decreased performance over time. “Feedback loops make the situation even worse, as the same flawed output is logged and reused, potentially reinforcing mistakes,” he added.

For example, in cloud and infrastructure environments, small inaccuracies such as making slightly incorrect recommendations from misjudging resource allocation or mislabeling usage patterns can silently increase costs or degrade performance over time, Althus said. This can have a huge impact on your business.

Another problem he noticed was a loss of adaptability. “AI-trained AI tends to struggle when something new or unexpected happens, because the AI ​​doesn’t see the actual variability,” he said.

“The best precaution is to keep training data relevant to actual system behavior, using live telemetry, logs, and human-reviewed decisions as sources of truth, and treating AI-generated outputs as temporary rather than fundamental,” Alshuth added.

Impending model collapse

CIOs need to realize that data poisoning issues don’t end with model degradation. Training on AI-generated content can lead to “model collapse” where the AI ​​system eventually malfunctions completely. Effectively, your AI investment is reduced to a deterioration loss. Loss occurs when model, data, or output degradation renders a project useless beyond the point of repair.

“Model collapse refers to the degradation that occurs when a model is trained repeatedly based on output from other models. Over time, the system becomes more repetitive, loses nuance, and becomes less representative of the real world.” Ori OstertagPresident of Growth Platforms and AI at PAR Technology, a unified commerce platform provider for restaurants, convenience stores, and fuel retailers.

Even if organizations have internalized vendor AI solutions, disruption can still start close to home. “Conversations about AI data pollution tend to focus on training the underlying model. [meaning] “What OpenAI and Google train,” Kimber said. “But the more pressing issue for most organizations is what’s happening one layer below, in their own knowledge infrastructure. Every company is now functionally a model trainer.”

Salvage models and incorporate protection features

The first step to solving data poisoning problems is to stop them from getting worse. Fortunately, there are ways to restore performance when or after a model collapses, but it requires considerable effort. Prevention is always preferable, but when a collapse occurs, the solution is to retrain with clean data to recover performance, Ivtsan said.

According to one report, collapse can be avoided if real data is accumulated alongside synthetic data, rather than being replaced. paper According to Gerstglasser et al. According to , even incomplete external validation can stabilize the orbit. another paper According to Yi et al.

In this context, “incomplete” external verification does not mean using verification sources or information that may be flawed or inaccurate. That means using methods such as spot checks, subject matter expert reviews, or educated human judgment. Although these methods themselves are not exhaustive fact-checking, they are still considered to be highly accurate. Large-scale, targeted verification overcomes both zero monitoring and the impracticality of thorough fact-checking.

It’s best to prevent that from happening if possible.

“The way to prevent that is to design human-machine feedback loops. The most powerful systems are iterative systems, from human to AI and back to human, where the output is continually shaped, challenged, and refined,” explains Kaare Wesnes, head of innovation at Ogilvy North America, which helps Fortune Global 500 companies around the world build brands.

In other words, “the strongest system is not just AI; it’s the human-machine loop,” Wesnes says.

The key idea is to remember that AI is only as good as the data, and act accordingly.

“Companies need to protect data integrity, and that means prioritizing high-quality human-generated input, clearly separating synthetic data from real data, and continuously reintroducing fresh real-world signals into the system,” Wesness said.





Source link