Cleanlab Launches with $5M to Automate Data Curation for LLMs and Modern AI Stacks

AI and ML Jobs

Cleanlab, an automation solution that improves the accuracy of enterprise artificial intelligence (AI), LLM and analytics solutions, today announced a $5 million seed investment round led by Bain Capital Ventures. Its flagship product, Cleanlab Studio, evaluates and error-corrects both large-scale structured data (such as tabular data and spreadsheets) and large-scale unstructured data (such as visual data, LLM-generated data, and conversational data). It’s the only enterprise solution that can be fixed.

This press release features multimedia. Read the full release here:

A member of the clean lab team participating in the team training camp in Portland. Co-founder and principal investigator Jonas Muller is second from the left. Co-founder and CEO Curtis Northcutt is second from right. Co-founder and CTO Anish Athalye is on the right.  (Photo: Business Wire)

A member of the clean lab team participating in the team training camp in Portland. Co-founder and principal investigator Jonas Muller is second from the left. Co-founder and CEO Curtis Northcutt is second from right. Co-founder and CTO Anish Athalye is on the right. (Photo: Business Wire)

Most companies today have AI models and business intelligence (BI) solutions, but not all of their data is used to train the models. Data and label quality issues such as outliers, label errors, and data shifts often make data poor as useful input for reliable business intelligence, ML model training, or LLM fine-tuning .

Inaccurate data costs U.S. businesses $3.1 trillion annually, and the losses are rising, according to an IBM study. With Cleanlab, organizations like Amazon, Google, Walmart, Deloitte, and Wells Fargo have dramatically reduced the cost and time spent on data quality by automating the correction of errors in their datasets. Cleanlab is designed to work with most types of datasets including text, images, tabular/CSV/JSON data.

Cleanlab solves this problem for enterprises by analyzing unreliable real-world datasets to find and fix errors, generating improved datasets, and AI Free up valuable engineering resources to focus on solving problems instead of data with new labels generated by . Curation and model training.

Cleanlab already creates the most popular open source libraries for data-centric AI. This library is used by thousands of data scientists to automatically diagnose problems in real-world data through algorithms running on existing ML models. But diagnostics alone won’t work for companies that don’t have a model or interface to solve the problems they identify. To serve this broad market, the company introduced Cleanlab Studio, an enterprise application that seamlessly handles fixing data issues and deploying reliable models.

Curtis Northcutt, Jonas Mueller, and Anish Athalye, all three MIT PhDs, invented Confidence Learning while working with quantum pioneer Isaac Chuan during their MIT PhDs. After working on a new field of AI known as ‘Clean Labs’. computer).

Cleanlab Studio enables both individual data scientists and corporate teams to automate the process of finding and fixing images, text, outliers, label issues, and other data issues in tabular datasets, Now you can train more confidently and extract more value from your data. Build models to derive more accurate analysis and insights. Unlike other solutions in the space, Cleanlab Studio uses state-of-the-art automated ML to handle model training, hyperparameter tuning, model selection, code, and machine learning to deliver improved datasets. No expertise required. Dramatically accelerate ML models and business insights.

“We often forget that, like humans, artificial intelligence solutions also embody imperfections. Clean Labs will resonate with everyone because it works just like you do wrong You’ll do worse on exams if you’re taught to do it.Cleanlab automates the curation and modification of data to produce more accurate models in less time,” said Curtis, Co-Founder and CEO of Cleanlab AI. Northcutt said. “We do not guarantee perfection, we guarantee improvement. Cleanlab breaks the AI ​​glass ceiling by providing accessibility and reliability for AI solutions.”

“The main risk of LLM is ‘garbage in, garbage out’. If trained on messy data containing bias, inaccuracy, or meaningless information, its output can contain similar problems. It often happens,” said partner Aleph Hillary. at Bain Capital Ventures. “As Deepmind’s Chinchilla paper (and others) show, LLM performance is still largely data dependent, so there is also a huge opportunity to improve data curation. It’s the easiest way to curate data for training and fine-tuning, and it’s an integral part of any emerging infrastructure stack that supports modern AI.”

“Cleanlab has increased accuracy by 28% and reduced the number of labeled transactions required to train the model by more than 98%,” said one of the largest, BBVA (Bank of Bilbao Vizcaya Argentina). said David Muelas Recuenco, an expert data scientist. Financial institutions around the world joined as Cleanlab discussed how he reduced the cost of curating datasets and training models by more than 98%.

“Using Cleanlab AI improved model accuracy by 15% and reduced the number of training iterations by a factor of 3,” said Steven Gawthorpe, senior managing consultant data scientist at Berkeley Research Group. said. “Our team is extremely impressed with the accuracy, speed and ease of use Cleanlab provides.”

Prior to Cleanlab, Co-Founder and Principal Investigator Jonas Mueller built Amazon’s automated ML solution, which is now used by all AWS automated ML jobs. Co-Founder and CTO Anish Athalye has earned his 5,000+ citations for some groundbreaking work showing where AI solutions fall short and how to improve them. Combining Curtis’ work in auto-correcting most dataset issues, Jonas’ work in auto-training ML models on arbitrary datasets, and Anish’s work in secure systems, the team created Cleanlab Studio We were able to accomplish our mission of making AI more accessible. effective for humans.

Cleanlab Studio integrates with the most popular data and ML workflows, uploads large datasets in internet bandwidth time, and scales for the enterprise.

On June 1, 2023, Databricks announced a partnership with Cleanlab to enable automated data correction for both structured and unstructured datasets through the Databricks platform through Cleanlab Studio integration.

In 2021, Cleanlab has been nominated for NeurIPS Best Paper Award. In 2022, Cleanlab published his 5 peer-reviewed papers NeurIPS and ICML conference/workshop, and in 2023, Cleanlab management led his MIT Data-Centric AI course.

Cleanlab actively collaborates with organizations that train large-scale models and develop business intelligence and analytics solutions on images, text, tabular, and other types of data. Visit Cleanlab Studio to learn more about data remediation with Cleanlab’s no-code, automated enterprise AI platform.

About Clean Lab

Pioneered at MIT and trusted by hundreds of top organizations, Cleanlab automatically detects and corrects errors in both structured and unstructured data sets, including visual, text, and tabular data. transforms unreliable data into reliable models and insights. San Francisco-based Cleanlab was founded in 2021 by three of his PhDs in computer science from MIT.

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *