Fau's Ca-Ai makes AI smarter by cleaning up bad data before learning

Machine Learning


Smarter AI, AI


In the world of machine learning and artificial intelligence, clean data is everything. Even examples with a few false labels known as label noise can derail the performance of the model, especially the Support Vector Machine (SVM), which rely on some important data points to make decisions.

SVMS is a widely used machine learning algorithm that applies to everything from image and speech recognition to medical diagnosis and text classification. These models work by finding boundaries that best separate data from different categories. They rely on a small but important subset of training data known as support vectors to determine this boundary. If these few examples are incorrectly labeled, the resultant decision boundaries are flawed and the performance of the actual data will be degraded.

Now, a team of researchers from Florida Atlantic University and collaborators' Center for Connected Atonomy and Artificial Intelligence (CA-AI) within the University of Engineering and Computer Science has developed an innovative method to automatically detect and remove failed labels before the model is trained. Before AI begins learning, researchers use mathematical techniques to clean the data, looking for strange or unusual examples that don't suit them at all. These “outliers” are removed or flagged to ensure that AI gets high quality information from the start.

“SVM is one of the most powerful and widely used classifiers in machine learning, with applications ranging from cancer detection to spam filtering,” says Schmitris Pados, PhD, Schmidt's renowned Engineering and Computer Science, Professor of Engineering and Computer Science, Faculty of Engineering and Computer Science, CA-AI and FAU Sensing Institute (I-sense Instity). “Although it is particularly effective, there are unique vulnerabilities, it relies on a small number of key data points called support vectors to attract lines between different classes. For example, if a malignant tumor is mismarked, if it is mismarked, then it misunderstands whether it is misunderstanding the model's misconception. Our job is to protect the model from these hidden dangers before harming these mislabeled cases.

A data-driven method of “cleaning” a training dataset uses a mathematical approach called L1-NORM principal component analysis. Unlike traditional methods that often require manual parameters tuning or assumptions regarding the type of noise present, this method identifies and removes suspicious data points within each class based purely on their compatibility with other members of the group.

“In many cases, label errors cause data points to be flagged and removed that appear to be a significant deviation from the rest,” Pados said. “Unlike many existing techniques, this process does not require manual tuning or user intervention. It can be applied to any AI model and is scalable and practical.”

This process is robust, efficient and completely touch-free, and can even handle the infamous tricky tasks of rank selection (determining the number of dimensions to maintain during analysis) without user input.

Researchers extensively tested methods on real synthetic datasets with varying levels of label contamination. Overall, it has provided consistent and significant improvements in classification accuracy, demonstrating its potential as a standard pre-processing step in the development of high-performance machine learning systems.

“What makes our approach particularly convincing is its flexibility,” Pados said. “It can be used as a plug-and-play preprocessing step for any AI system regardless of the task or dataset. And it's not a theoretical test on both noisy datasets, including well-known benchmarks such as the Wisconsin Breast Cancer Dataset. Hidden label noise may be more common than previously thought.”

In the future, this research opens the door to a wider range of applications. The team will be interested in exploring how this mathematical framework can be extended to tackle deeper data science issues, such as reducing data bias and improving dataset integrity.

“As machine learning is deeply integrated into high-stakes domains such as healthcare, finance and justice systems, the integrity of the data driving these models has become more important than ever. “We ask algorithms to make decisions that affect disease diagnosis, loan applications, and even notification of legal judgments. If the training data is flawed, the outcome is devastating. So, such innovations are extremely important. Quite a lot, certainly ethical in the real world.”

This work will be featured in the Institute of Electrical and Electronic Equipment Engineers (IEEE), Transactions on Neural Networks and Learning Systems. Co-authors are all IEEE members: Shruti Shukla. PhD CA-AI and FAU Department of Electrical Engineering and Computer Science Students. George Sklivanitis, Ph.D., Charles E. Schmidt Research Associate Professor at Ca-AI and faculty fellow at the Bureau of Electrical Engineering and Computer Science, and I-Sense. Dr. Elizabeth Serena Bentley; Dr. Michael J. Medley; US Air Force Research Institute.

Dimitris Padso

Dimitris Pados, Ph.D., Schmidt Fam, a renowned engineering and computer science scholar, Professor of Electrical Engineering and Computer Science, Director of CA-AI, and Faculty of the FAU Sensing Institute (I-Sense).

-fau-



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *