Tools to help you choose the right method for evaluating AI models

When machine learning models are introduced to real-world situations to flag potential diseases on X-rays, perhaps for radiologists to examine, human users will know when to trust the model’s predictions. is needed.

However, machine learning models are so large and complex that even the scientists who design them do not understand exactly how they make predictions. So they created a technique known as the saliency technique that attempts to explain the model’s behavior.

With new techniques being released all the time, researchers at MIT and IBM Research have created a tool that allows users to choose the best saliency technique for a given task. They have developed a salience card that provides standardized documentation on how a method works, including its strengths and weaknesses and explanations to help users correctly interpret it.

They hope to leverage this information to deliberately choose the appropriate saliency technique for both the type of machine learning model you’re using and the task that model is performing, they said. Co-lead author Angie Bogast, a graduate student in electrical engineering, explains. He studied engineering and computer science at MIT and is a member of the visualization group at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL).

Interviews with AI researchers and experts in other fields revealed that the cards help people quickly compare different methods side by side and choose the right one for the task. Choosing the right method will give users a better idea of how the model is behaving, and thus help them interpret the predictions correctly.

“The saliency cards are designed to give a quick, at-a-glance overview of saliency methods and classify them into the most important human-centric attributes. It’s designed for everyone, down to the average user looking to understand and choose,” says Bogast.

Co-lead author Harini Suresh, a postdoctoral fellow at MIT, joins Boggust on the paper. Hendrik Strobelt, senior researcher at IBM Research. John Gutag, Dugald C. Jackson Professor of Computer Science and Electrical Engineering at MIT. And the lead author, Arvind Satyanarayan, is an Associate Professor of Computer Science at MIT who leads the Visualization Group at CSAIL. The research will be presented at his ACM Conference on Equity, Accountability and Transparency.

choose the right method

Researchers have previously evaluated saliency methods using the concept of fidelity. In this context, fidelity captures how accurately a method reflects the model’s decision-making process.

But loyalty isn’t black and white, explained Bogast. A method may perform well in one fidelity test but fail in another. With so many saliency methods and so many possible ratings, users often settle for one method because it is popular or used by colleagues.

However, choosing the “wrong” method can have serious consequences. For example, one saliency technique, known as integrated gradient, compares the importance of features in an image to a meaningless baseline. The features that are most important to the baseline are the most meaningful for the model’s predictions. This method usually uses all 0’s as a baseline, but when applied to an image, all 0’s are equivalent to black.

“A black pixel in an image, even if it is important, is the same as a meaningless baseline, so we know it is not important. If you’re looking at lines, this can be a big problem,” says Bogast.

Saliency cards help users avoid this kind of problem by summarizing how saliency methods work for 10 user-focused attributes. Attributes capture how saliency is calculated, the relationship between saliency methods and models, and how users perceive that output.

For example, one of the attributes is hyperparameter dependence, which measures how sensitive a saliency method is to user-specified parameters. The integrated Gradient Salience card describes its parameters and how they affect its performance. With this card, the user can immediately see that the default parameters (all 0 baseline) may produce misleading results during her X-ray evaluation.

The cards can also help scientists by revealing gaps in the research space. For example, MIT researchers were unable to identify a saliency method that is computationally efficient but applicable to all machine learning models.

“Can we bridge that gap? Is there a saliency method that can do both? Or maybe these two ideas are theoretically at odds with each other,” Bogast said. say.

show card

After creating a few cards, the team conducted user research with eight subject matter experts, ranging from computer scientists to radiologists who are new to machine learning. During the interview, all participants said that the concise description helped them prioritize attributes and compare methods. And even though radiologists were new to machine learning, they were able to understand the cards and use them to participate in the process of selecting saliency techniques, Bogast says.

The interview also revealed some surprises. Researchers often expect clinicians to want a sharper way, a way to focus on a particular object in a medical image. However, clinicians in this study actually preferred to include some noise in the medical images to reduce uncertainty.

“When we categorized the problem into different attributes and asked people, no one had the same priorities as the others we studied, even if they had the same role. did,” she says.

In the future, researchers hope to explore some of the more underestimated attributes and design task-specific saliency methods. We also believe it could lead to a better understanding of how people perceive the output of saliency methods, leading to better visualizations. Plus, they host their work in public repositories, so others can provide feedback that guides future work, Bogast said.

“We sincerely hope that these will become living documents and grow as new saliency methods and assessments are developed. , and how they affect different tasks,” she says.

This research was supported in part by the MIT-IBM Watson AI Lab, the US Air Force Research Laboratory, and the US Air Force Artificial Intelligence Accelerator.

Source link