With a clear need for XAI, data scientists have rushed into deploying XAI algorithms without considering the drawbacks, limitations of these algorithms, or indeed, the end business outcome intended. In recent blogs we have discussed some problems with the commonly available explainer algorithms: LIME and SHAP. We shared how, in general, feature importances are not good enough and how LIME and SHAP do not produce explanations which are easy to understand or actionable. In this final blog of this series, we discuss some more technical problems with LIME and SHAP, which further limit their scalability or robustness in business context.
Problem 1: LIME and SHAP can be fooled – even simple cases!
Some recent work in XAI has revealed another weakness in LIME and SHAP and other perturbation-based post-hoc explanation methods: they can be fooled. In two recent papers published by researchers at Harvard University and the University of Pennsylvania, we see examples where LIME and SHAP are vulnerable to adversarial attacks and can be used to generate deliberately misleading explanations. Many proponents of XAI claim that this technology is an easy and quick solution to solve bias in models, meet regulatory demands, and provide auditable models. What these papers demonstrate is that, because of weaknesses in how LIME and SHAP use the underlying dataset, a clever, malicious, or even unaware data scientist can create a model which does use a particular feature in order to make a prediction without that feature appearing in the explanation of the prediction. LIME and SHAP can be deceived, and on their own they cannot be used to test for bias in a model.
A third paper, published by NYU is focused on criticism of SHAP. It shows that a feature can be extremely important to a very simple model, but the Shapley values calculated for it can be almost zero, and similarly a feature can have a high Shapley value but have no importance. Again, this shows us that it is possible to design cases, intentional or other, which fool these feature importance explanations. While the cases here were designed to fool LIME and SHAP, the models produced are actually very simple and it would not be surprising for models like these to show up in real world cases.
One of Elula’s AI products, Sticky, predicts which home loan customers are likely to churn. Let’s imagine a very simply churn prediction model using only three binary features:
F1 = 1 if this account is on a variable rate, 0 otherwise.
F2 = 1 if this account has a balance above 200k, 0 otherwise.
F3 = 1 if this is an investment loan, 0 otherwise.
Suppose our model is C = F1 + F2 + 10xF1xF3 + 10xF2xF3, where we predict churn if C > 0. In this case, SHAP will give F3 a larger weight than F1 or F2, but in reality F3 does not impact the decision at all! Notice that no matter what the value of F3 is, we predict churn if F1>0 or if F2>0. In this very simple example, we see how easily SHAP can misunderstand which features are actually important to a model. The end result being that the user now incorrectly believes this customer is going to churn because they’re an investment loan but in actual fact it is driven by their variable rate and their balance. An explanation should be a minimal set of facts which guarantee the conclusion; that minimal set of facts is either “This loan is on a variable rate” or “This loan has a balance above $200k”. The fact that the loan is an investment loan does not guarantee C > 0, despite having the highest Shapley value.
Problem 2: Unrealistic scenarios
The flaws in LIME and SHAP which lead to unrealistic scenarios are not easily addressed. LIME produces thousands of random scenarios using a perturbation sampling method which produces unrealistic counterfactuals: LIME might generate a home loan in the Sticky dataset which has an interest rate of 12%, with monthly repayments of $50, and a balance of $1M. This loan clearly cannot exist, yet LIME will generate this unrealistic scenario and many other impossible combinations in order to find its “local area”. SHAP operates similarly, producing a K-Means clustering and then generating every possible combination of feature values from the cluster means. There is nothing which forces these generated points to be realistic. Elula’s proprietary explainer algorithms do not rely on these kinds of unrealistic counterfactuals.
Because LIME and SHAP have these unrealistic scenarios, when they then ask “Why was account X predicted to churn compared to other accounts like X”, they will get incorrect answers because they do not know what “other accounts like X” means! They will think that some of these impossible scenarios are “other accounts like X”, and will compare them to X. This will make the explanations confused. Suppose X has an interest rate of 8%, and monthly repayments of $1000. If X is compared to the loan with a rate of 12% and repayments of $50, the explanation could claim that X is predicted to churn because their interest rate is too low (8% compared to 12%) and their repayments are too high ($1000 compared to $50). But this is clearly contradictory, and is indeed unlikely to be the actual reason they were predicted to churn.
Problem 3: Complex parameters
SHAP and LIME also suffer from complex parameterisation. When you build a LIME explainer, you must input the kernel width parameter to tell the explainer how “wide” the local area is. The problem is that it is not at all obvious what this parameter should be. Similarly, the SHAP explainer must be told how many clusters to create in its internal K-Means algorithm. You can’t tell beforehand which number for this parameter will produce better explanations, you have to simply test. Unfortunately, as you try multiple values, you will see different explanations without any good way to evaluate which explanation better fits the model. There is no obvious way to find the “best” parameter, it just comes down to the subjective judgement of the data scientist, and subjectivity is fraught with problems and bias. At Elula, we have developed a framework for evaluating explanations, allowing us to effectively tune parameters of explainer models.
Problem 4: Outputs are not unique
To be able to understand and act on an explanation, the results produced need to be consistent. However, LIME is not consistent. When you generate a LIME explanation on a particular prediction, you are not guaranteed to have unique solutions. What this means is that if you run it multiple times on the same predicted instance, you will end up with a different explanation each time: sometimes significantly different! This indicates that the algorithm is neither robust nor reliable. At Elula, repeatability is a core business requirement for our product. Imagine if you asked someone to explain to you why they made a certain decision, and each time you asked them they gave you a different answer. You would not trust the explanations this person gives, however part of the core purpose of an explainer is to produce trust in a prediction model. If LIME is not repeatable, then it should not be trusted, and is not fit for purpose in a business context.
In this series we have explored some of the limitations of common XAI algorithms. We have sought to demonstrate and highlight the issues and risks associated with the popular publicly available algorithms which are not fit to be deployed in a business context. They have significant weaknesses which are often overlooked, and the algorithms are used without a deep understanding of what they are actually doing “under the hood”.
At Elula, we have developed XAI solutions to these problems, deploying proprietary explainer algorithms which resolve these technical limitations, are easily human-readable and understandable, and go beyond feature-importances to deliver actionable insights to our customers. In our next blog we will go on to share how we combat these problems.