CMU Researchers Introduce Zeno: A Framework for Behavioral Evaluation of Machine Learning (ML) Models

Prototyping AI-driven systems is always more complex. However, after using the prototype for a while, you may find that it could be more functional. A chatbot for taking notes, an editor for creating images from text, and a tool for summarizing customer comments can all be created with a basic understanding of programming and a few hours.

In the real world, machine learning (ML) systems can incorporate issues such as social bias and safety concerns. From racial bias in pedestrian detection models to systematic misclassification of certain medical images, experts and researchers continually discover significant limitations and failures in state-of-the-art models. . Behavioral evaluation or testing is commonly used to discover and validate model limitations. Understanding patterns in model output for subgroups or slices of input data is more than just examining aggregate metrics such as accuracy and F1 score. Stakeholders such as ML engineers, designers, and domain experts should work together to identify expected and potential bottlenecks in the model.

Although the importance of conducting behavioral assessments has been widely emphasized, it remains difficult to do. Additionally, many of the popular behavioral assessment tools, such as the Fairness Toolkit, do not support the models, data, or behaviors that real-world experts typically work with. Practitioners manually test handpicked cases from users and stakeholders to evaluate models and appropriately select the best deployment version. Models are often created before practitioners are familiar with the products and services for which they are used.

🚀 Build high-quality training datasets, solve NLP machine learning challenges, and develop powerful ML applications with Kili Technology

Understanding how well a machine learning model can complete a particular task is a difficulty in model evaluation. Just as IQ tests are only crude and imperfect measures of human intelligence, model performance can only be roughly estimated using aggregate metrics. For example, he may fail to incorporate basic features such as correct grammar into his NLP system, or may fail to mask systemic flaws such as social prejudices. Standard testing methods include computing overall performance metrics on subsets of data.

The difficulty of determining what features a model should have is essential for the field of behavioral assessment. In a complex domain, testing a list of requirements is impossible as the number of requirements can be infinite. Instead, ML engineers work with domain experts and designers to describe the model’s expected functionality before iterating and deploying the model. Users provide feedback on model constraints and expected behavior through interactions with the product or service, which is reflected in subsequent iterations of the model.

Many tools exist for identifying, validating, and monitoring model behavior in ML evaluation systems. The tool employs data transformations and visualizations to uncover patterns such as fairness concerns and special cases. Zeno works in tandem with other systems and combines methods from other systems. Subgroup or slice-based analysis, which computes metrics on subsets of a dataset, is the closest behavioral assessment method to Zeno. Zeno now allows sliding-based metamorphic testing for any domain or activity.

Zeno consists of a Python application programming interface (API) and a graphical user interface (GUI) (UI). Model outputs, metrics, metadata, and modified instances are just some of the basic components of behavioral evaluation that can be implemented as Python API functions. The output of the API is a framework for building the main interface for evaluating and testing behavior. The zeno frontend view has two main views. There is an exploration UI used for discovering and slicing data, and an analysis UI used for creating tests, creating reports, and monitoring performance.

Zeno is exposed to the public via Python scripts. The frontend built in Svelte employs Vega-Lite for visuals and Arquero for data processing. This library is included in the Python package. After specifying necessary settings such as test files, data paths, and column names in the TOML configuration file, the user initiates Zeno processing and interfacing from the command line. Zeno can host his UI as his URL endpoint, so it can be deployed locally or on a server with other compute, and users can still access it from their devices. The framework has been tried and proven on datasets containing millions of instances. So it should scale well for any good deployment scenario.

There are many frameworks and libraries in the ML environment, each for specific data or models. Zeno relies heavily on a customizable Python-based model inference and data processing API. The researchers found zeno’s backend as a set of Python decorator methods capable of supporting most modern ML models, despite being subject to the same fragmentation as most ML libraries are based on Python. Developed an API.

A case study conducted by the research team demonstrated how Zeno’s API and UI work together to enable practitioners to discover critical flaws in models across datasets and jobs. In a broad sense, the results of this study suggest that behavioral assessment frameworks may be useful for different types of data and models.

Depending on the user’s needs and the difficulty of the task at hand, Zeno’s different affordances make behavioral assessment easier, faster, and more accurate. Participants in Case 2 used API extensibility to create model analysis metadata. Case study participants reported having little difficulty embedding Zeno into their existing workflows or writing code to communicate with the Zeno API.

Constraints and precautions

Knowing which behaviors are essential to the end user and encoded by the model presents a major challenge for behavior evaluation. The researcher actively develops his ZenoHub, a collaborative repository where users can share her Zeno functions, find relevant analytical components more easily, and promote reuse of model functions that underpin discovery. doing.
Zeno’s primary function is to define and test metrics for data slices, but the tool only provides limited grid and table views for viewing data and slices. Zeno’s usefulness can be enhanced by supporting a variety of powerful visualization techniques. Users may be better able to discover patterns and new behaviors in their data using instance views that encode semantic similarities such as DendroMap, Facets, and AnchorViz. ML Cube, Neo, and ConfusionFlow are just some of the ML performance visualizations that Zeno can modify to better display model behavior.
Zeno’s parallel computing and caching allow it to scale to huge datasets, but the size of machine learning datasets is growing rapidly. Further refinements would therefore speed things up significantly. Processing on distributed computing clusters using libraries like Ray may be a future update.
Cross-filtering multiple histograms on very large tables is another barrier. Zeno may employ Falcon-like optimization techniques to facilitate real-time cross-filtering on large datasets.

Conclusion is –

Even if a machine learning model achieves high accuracy on training data, it can still suffer from systemic obstacles such as negative biases and safety hazards in the real world. Practitioners conduct behavioral evaluations of models, inspecting the model’s output for specific inputs to identify and correct such shortcomings. Important but difficult behavioral assessments require uncovering real-world patterns and examining systemic failures. Machine learning behavioral evaluation is important for identifying and fixing problematic model behaviors, such as biases and safety issues. In this work, the authors delve into the difficulty of his ML evaluation and develop a universal method for scoring models in different situations. The researchers demonstrated how Zeno applies to multiple domains through his four case studies in which practitioners evaluated real-world models.

Many people have great expectations for the development of AI. Nevertheless, the complexity of their behavior is developing at the same rate as their capabilities. Robust resources are essential to enable behavior-driven development and ensure that intelligent systems are built in harmony with human values. Zeno is a flexible platform that allows the user to perform this kind of in-depth inspection across her wide range of AI-related jobs.

Please check paper and CMU Blog. All credit for this research goes to the researchers of this project.Also, don’t forget to participate 16,000+ ML SubReddit, Discord channeland email newsletterShare the latest AI research news, cool AI projects, and more.

Dhanshree Shenwai is a computer science engineer with extensive experience in FinTech companies covering the fields of finance, cards and payments, and banking, with a strong interest in AI applications. She is passionate about exploring new technologies and advancements in today’s evolving world to make life easier for everyone.

🔥 StoryBird.ai added some great features. Generate illustrated stories from prompts. Check here. (with sponsorship)

Source link