Innovative detection methods make AI smarter by cleaning up bad data before learning

Machine Learning


ai

Credit: Unsplash/CC0 Public Domain

In the world of machine learning and artificial intelligence, clean data is everything. Even examples with a few false labels known as label noise can derail the performance of the model, especially the Support Vector Machine (SVM), which rely on some important data points to make decisions.

SVMS is a widely used machine learning algorithm that applies to everything from image and speech recognition to medical diagnosis and text classification. These models work by finding boundaries that best separate data from different categories. They rely on a small but important subset of training data known as support vectors to determine this boundary. If these few examples are incorrectly labeled, the resultant decision boundaries are flawed and the performance of the actual data will be degraded.

Now, a team of researchers from Florida Atlantic University and collaborators' Center for Connected Atonomy and Artificial Intelligence (CA-AI) within the University of Engineering and Computer Science has developed an innovative method to automatically detect and remove failed labels before models are trained.

Before AI begins learning, researchers use mathematical techniques to clean the data, looking for strange or unusual examples that don't suit them at all. These “outliers” are removed or flagged to ensure that AI gets high quality information from the start. This paper is published in IEEE Transactions on Neural Networks and Learning Systems.

“SVM is one of the most powerful and widely used classifiers in machine learning, with applications ranging from cancer detection to spam filtering,” says Dimitris Pados, PhD, an engineering and computer science scholar at the Schmidt Enance Engineered Scholar Scholar of Engineering of Engineering and Computer Science of Electrical and Computer Science of Ca-AI and FAU Sensing Institute (I-sense Institute).

“What makes them particularly effective is when even one of those points is misunderstood because it relies on a small number of key data points called support vectors to derive lines between different classes – for example, if a malignant tumor is mismarked, it can distort the overall understanding of the model.

The outcome can be serious, whether it's a security system that has missed a cancer diagnosis or is unable to flag a threat. Our job is to protect the models (AI models including machine learning and SVM) from these hidden dangers by identifying and removing these hidden cases before causing harm. ”

A data-driven method of “cleaning” a training dataset uses a mathematical approach called L1-NORM principal component analysis. Unlike traditional methods that often require manual parameters tuning or assumptions regarding the type of noise present, this method identifies and removes suspicious data points within each class based purely on their compatibility with other members of the group.

“In many cases, data points that appear to be a significant deviation from the rest due to label errors are flagged and removed,” Pados said. “Unlike many existing techniques, this process does not require manual tuning or user intervention. It can be applied to any AI model, making it scalable and practical.”

This process is robust, efficient and completely touch-free, and can even handle the infamous tricky tasks of rank selection without user input (determine the number of dimensions to maintain during analysis).

Researchers extensively tested methods on real synthetic datasets with varying levels of label contamination. Overall, it has provided consistent and significant improvements in classification accuracy, demonstrating its potential as a standard pre-processing step in the development of high-performance machine learning systems.

“What makes our approach particularly convincing is its flexibility,” Pados said. “It can be used as a plug-and-play preprocessing step for AI systems regardless of task or dataset. It also includes well-known benchmarks such as Wisconsin breast cancer datasets, which have not merely theoretical, and have shown consistent improvements in classification accuracy in both noisy and clean datasets.

“Even if the original training data look perfect, our new method still enhances performance, suggesting that subtle, hidden label noise is more common than previously thought.”

In the future, this research opens the door to a wider range of applications. The team will be interested in exploring how this mathematical framework can be extended to tackle deeper data science issues, such as reducing data bias and improving dataset integrity.

“As machine learning is deeply integrated into high-stakes domains such as healthcare, finance and justice systems, the integrity of the data driving these models has become more important than ever,” said Dr. Stella Batalama, Dean of FAU Engineering Computer Science.

“We want algorithms to make decisions that have an impact on real life. They can be diagnosed with illness, assessed loan applications, or even notified of legal judgments. If the training data is flawed, the outcome is devastating. That's why innovations like this are so important.

“We don't just make AI more accurate by improving data quality at the source before the model is trained. We're responsible for it. This work represents a meaningful step to building trustworthy AI systems to be fair, reliable and ethically performed in the real world.”

detail:
Shruti Shukla et al, training dataset curation with key component analysis of L1-gnomes in support vector machines; IEEE Transactions on Neural Networks and Learning Systems (2025). doi:10.1109/tnnls.2025.3568694

Provided by Florida Atlantic University

Quote: An innovative detection method makes AI smarter by cleaning up bad data before retrieving it on June 12, 2025 from https://techxplore.com/news/2025-06-06-06-06-06-i-smarter-bad.html.

This document is subject to copyright. Apart from fair transactions for private research or research purposes, there is no part that is reproduced without written permission. Content is provided with information only.





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *