What Explainability Means for AI | Stephanie Karmer

Do we still care about how machine learning works?

I want to get a little philosophical today and talk about how explainability and risk intersect in machine learning.

Simply put, explainability in machine learning is the idea of being able to explain how a model makes a decision to a human user (not necessarily technically savvy). Decision trees are an example of models that can be easily explained (sometimes called “white box”): something like “the model splits the data between houses with area > 1 and houses <= 1". Other kinds of more complex models can be "grey box" or "black box" and become increasingly difficult and impossible for a human user to quickly understand.

A fundamental lesson in my machine learning education is that your relationship with models (usually boosted tree style models) should be at best “trust but validate”. When training a model, don't take the initial predictions at face value, but take the time to seriously validate them. Test the model's behavior with very strange outliers that are unlikely to occur in reality. If the tree itself is shallow, plot the tree itself. Use techniques like feature importance, Shapley values, LIME, etc. to test that the model is making inferences using features that correspond to your knowledge of the subject and logic. Did the feature splits in a particular tree match what you know about the subject? If you're modeling a physical phenomenon, you can also compare the model's behavior to scientific knowledge of how things work. Don't just trust but verify that the model is approaching the problem in the right way.

Don't just trust that the model is approaching the problem the right way, make sure.

As neural networks explode in relevance, the biggest tradeoff we have to consider is that the way their architecture works makes this kind of explainability very difficult and drastically variable.

Neural network models apply functions to the input data at each intermediate layer, modifying the data in various ways before passing it back to a target value in the final layer. As a result, unlike the partitions in tree-based models, the intermediate layers between the input and output are often not properly interpretable by humans. While you might be able to find a particular node in an intermediate layer and see how its value affects the output, linking this to an actual concrete input that a human can understand usually fails due to the abstraction that even the layers of a simple NN provide.

This can be easily illustrated by the “huskies vs wolves” problem: a convolutional neural network was trained to distinguish between pictures of huskies and wolves, but upon investigation it was discovered that the model was making its choice based on the color of the background. Training photos of huskies were less likely to have snowy scenes than photos of wolves, so when the model received an image with a snowy background, it always predicted that there was a wolf. The model had developed an internal logic based on incorrect characteristics, using information that the humans involved had not considered.

This means that the traditional test of “does this model 'think' about the problem in a way that is consistent with physical or intuitive reality” becomes obsolete: we cannot know if the model is making choices in the same way, but instead resort to a trial-and-error approach. This involves a systematic experimental strategy, essentially testing the model against many counterfactuals to determine what kind and degree of variation in the inputs produces changes in the outputs, which is necessarily difficult and computationally intensive.

We cannot know how the model would have made its choices in the same way, but instead rely on a trial-and-error approach.

I do not mean to claim that the effort to understand how neural networks work is hopeless. Many scholars are very interested in explainable AI, known in the literature as XAI. The variety of models currently available means that there are many approaches that can and should be pursued. Attention mechanisms are one of the technological advances that can help us understand which parts of the input the model pays the most attention to or is driven by. Anthropic published a very interesting report that delves into Claude's interpretability, in which they try to understand which words, phrases, or images cause the strongest activation in the LLM in response to prompts using a sparse autoencoder. Tools such as Shapley and LIME discussed above can also be applied to some types of neural networks such as CNNs, but the results can be difficult to interpret. But by definition, the more complexity there is, the harder it is for a human viewer or user to understand and interpret the model's behavior.

Another important element here is to recognize that many neural networks have built-in randomness, so models will not always return the same output for the same input. In particular, generative AI models sometimes intentionally generate different outputs from the same input, which makes them appear more “human” or creative. The extremes of this variation can be increased or decreased by adjusting the “temperature”, meaning the model may choose to return a “surprising” output rather than the most probabilistically desirable output, increasing the creativity of the results.

In these situations, you can still do some trial and error to try to understand what the model is doing and why, but the complexity grows exponentially. Where the only change to the equation was a different input, you now add unknown variations due to randomness on top of the input change. Did the response change due to the input change, or was it the result of randomness? In many cases, it is impossible to truly know.

Did changing the input change the response, or was it the result of randomness?

So where does this leave us? Why would you want to know how a model made inference in the first place? Why does it matter as a machine learning developer or a user of a model?

If we build machine learning that helps inform choices and shape people's behavior, we must be held accountable for the results. Model predictions sometimes go through human intermediaries before being applied to the real world, but increasingly models are used freely and inferences are operationalized without further review. More than ever before, the public has uninterrupted access to highly complex machine learning models.

So for me, understanding how and why a model works the way it does is due diligence, similar to testing to make sure a manufactured toy doesn't have lead paint on it, or that a machine won't break under normal use and break someone's hand. That's a lot harder to test for, but ensuring that we don't put products out there that make life worse is a moral position that I'm committed to. If you build a machine learning model, you're responsible for what that model does and how it impacts people and the world. So to be really confident that a model is safe to use, you need to have some understanding of how and why it returns the output it does.

When you build a machine learning model, you are responsible for what it does and how it impacts people and the world.

As an aside, readers may remember from my article on EU AI law that model predictions are required to be subject to human oversight and not to make decisions that have discriminatory effects based on protected characteristics. So even if you don't feel driven by a moral argument, for many of us there is also a legal motivation.

Even with neural networks, there are tools available to help you better understand how the model is making its choices, and it takes work over time.

Philosophically, one could argue (and some do argue) that for machine learning to progress beyond a basic level of sophistication, we need to give up the desire to understand everything. This may be true. But we shouldn't ignore the trade-offs this creates and the risks we accept. At best, a generative AI model mostly does what you expect it to do (perhaps by controlling the temperature, and if the model is very uncreative) and doesn't do much that's unexpected. At worst, it can cause disaster because the model reacts in a way that you never expected. This could mean making you look stupid, or it could mean the end of your business. Or it could mean actual physical harm to people. These are the kinds of risks you take on your own shoulders if you accept that model explainability is unattainable. You can't build this and say, “Oh, the model models it,” when you make a conscious decision to release it or use its predictions.

A range of technology companies, large and small, have acknowledged that generative AI can sometimes produce inaccurate, dangerous, discriminatory, or otherwise harmful outcomes and have decided that this is worth it given the perceived benefits. We know this because generative AI models that routinely exhibit undesirable behaviors have been publicly released. Personally, I find it disturbing that the tech industry has chosen to expose the public to such risks without any explicit consideration or discussion, but the issue persists.

To me, pursuing XAI and trying to keep up with the advances in generative AI seems like a noble goal, but I don't think we'll ever get to the stage where most people can easily understand how these models work, because their architectures are so complex and difficult. As a result, risk mitigation measures also need to be implemented, and those responsible for the increasingly sophisticated models that impact our daily lives need to be held accountable for these products and their safety. Because outcomes are often unpredictable, we need frameworks to protect our communities from worst-case scenarios.

While not all risks should be considered unacceptable, we must be cognizant of the fact that they exist and that state-of-the-art AI explainability challenges make machine learning risks harder than ever to measure and predict. The only responsible choice is to weigh this risk against the actual benefits these models produce (not to take for granted the predicted or promised benefits of future versions) and make thoughtful decisions accordingly.

Source link