A Human-Grounded Evaluation of SHAP for Alert Processing

In the past years, many new explanation methods have been proposed to achieve interpretability of machine learning predictions. However, the utility of these methods in practical applications has not been researched extensively. In this paper we pres…

Authors: Hilde J.P. Weerts, Werner van Ipenburg, Mykola Pechenizkiy

A Human-Grounded Evaluation of SHAP for Alert Processing
A Human-Grounded Evaluation of SHAP for Alert Processing Hilde J.P . W eerts h.j.p.weerts@student.tue.nl Eindhoven University of T echnology Eindhoven, The Netherlands W erner van Ipenburg werner .van.ipenburg@rabobank.nl Rabobank Zeist, The Netherlands Mykola Pechenizkiy m.pechnizkiy@tue.nl Eindhoven University of T echnology Eindhoven, The Netherlands ABSTRA CT In the past years, many ne w explanation methods have been pro- posed to achieve interpretability of machine learning pr edictions. Howev er , the utility of these methods in practical applications has not been researched extensively . In this paper we present the results of a human-grounded evaluation of SHAP, an explanation method that has been well-received in the XAI and related communities. In particular , we study whether this local mo del-agnostic expla- nation method can b e useful for real human domain experts to assess the correctness of positive predictions, i.e. alerts generated by a classier . W e performe d experimentation with three dierent groups of participants (159 in total), who had basic knowledge of explainable machine learning. W e performed a qualitative analysis of recorded reections of experiment participants performing alert processing with and without SHAP information. The results sug- gest that the SHAP explanations do impact the decision-making process, although the model’s condence score remains to be a leading source of evidence. W e statistically test whether there is a signicant dierence in task utility metrics between tasks for which an explanation was available and tasks in which it was not provided. As opposed to common intuitions, we did not nd a sig- nicant dierence in alert pr ocessing performance when a SHAP explanation is available compared to when it is not. KEY W ORDS explainable predictive analytics, human-grounded evaluation, human- computer interaction Reference Format: Hilde J.P. W eerts, W erner van Ipenburg, and Mykola Pechenizkiy. 2019. A Human-Grounded Evaluation of SHAP for Alert Processing. In Procee dings of KDD W orkshop on Explainable AI (KDD-XAI ’19). 1 IN TRODUCTION Complex, black-box machine learning algorithms are increasingly applied in elds ranging from medicine to nance. In many contexts, the predictions are input for human decision-makers. Consequently , it is thought to b e useful, if not crucial, to understand why a machine learning mo del made a prediction. However , as the complexity of the models increase, it be comes more dicult for humans to understand their behavior . T o tackle this issue, recent eorts in explainable articial intelligence (XAI) have resulted in many new explanation methods. However , the utility of these approaches in practical scenarios has not been resear ched extensively . Often, claims about interpretability and utility are based on pro xies that have not been evaluated with real humans [ 1 , 8 ]. As explanations need to be interpreted by real humans, a strong but solely theoretical foundation is no guarantee for utility . In this work, we present a human-grounded evaluation to de- termine the utility of Shapley Additive Explanations (SHAP) for domain experts who assess the correctness of predictions, such as in medical diagnosis and fraud detection. In particular , we consider the utility for assessment of positive predictions, which we refer to as alert processing . SHAP is a state-of-the-art feature contribution method for explaining individual pr edictions [ 13 , 18 ]. T o determine the utility of SHAP, we perform two experiments in which real humans perform simplied alert processing tasks while alternately being provided with SHAP explanations. Methods. Real humans performed simplie d alert processing tasks, with and without an explanation of the model’s prediction. Our approach is two-fold: (1) we statistically test whether there is a signicant dierence in task utility metrics b etween tasks for which an explanation was available and tasks in which it was not provided, and (2) we analyze the participants’ written reasoning to determine the impact of dierent sources of evidence on the decision-making process, including the explanation. Main findings. In contrast to common assumptions, we did not nd a signicant dierence in alert processing performance be- tween tasks for which a SHAP e xplanation was shown and tasks for which it was not shown. Our results suggest that possibly SHAP explanations alone are not that useful for alert pr ocessing. On the other hand, our qualitative analysis of the participants’ reasoning during alert processing suggests that SHAP does ae ct the decision- making process. W e sp eculate that possibly combining SHAP-based explanations with other techniques may provide higher utility for such tasks. Outline. The present pap er is structured as follows. In Section 2, we discuss related w ork on evaluating explanations of predictive modeling. In Section 3, we introduce several ways in which SHAP values may improve task utility as well as the corresponding hy- potheses we formulated for the user study . Section 4 and 5 cover the e xperiment setup and results of the rst and se cond e xperiment respectively . In Section 6 we present our concluding remarks. 2 RELA TED WORK Calling for more rigorous evaluations of XAI, Doshi- V elez and Kim [1] introduce a three-level taxonomy for e valuating explanations: application-grounded , human-grounded , and functionally grounded . A pplication-grounded evaluations consider the evaluation of real applications with expert users. Human-grounded evaluations con- sider r eal humans performing simplied tasks that either require or can benet from interpretability . functionally-grounded evaluations use formal proxies of interpretability and do not require research with real human users. KDD-XAI ’19, A ugust 2019, Anchorage, Alaska, USA W eerts, et al. Our study is an example of a human-grounded evaluation. Al- though only few works address this type of evaluations, we can identify several evaluation pr ocedures that are commonly applied. Output verication can be used to compare the interpretability of models [e.g. 6 , 8 , 9 ]. In this evaluation procedure, participants are asked to verify whether an output is consistent with the model. Another common approach for e valuating the interpretability of an explanation is forward simulation [e.g. 4 , 6 , 16 , 18 ]. In forward sim- ulations, it is determined how well humans can predict behavior of the model, after being exposed to an explanation [ 10 ]. Additionally , Doshi- V elez and Kim [1] suggest that dierent explanations can be evaluated by means of binar y forced choice , in which humans review tw o alternative explanations of the same model and choose the best alternative. An evaluation approach closely related to the present paper is identication of incorrect behavior . For example, Ribeiro et al . [17] study whether dierent explanations methods allow users to identify which classier is likely to generalize to real world context. The capability of users to identify particular cases in which the model makes a wrong prediction under conditions with varying global interpretability was evaluated in [ 16 ]. In contrast to common assumptions, the authors found that exposing a model’s global internals decreased people’s ability to detect mistakes for unusual instances. This result shows the importance of user testing for validating intuitions on the utility of XAI. SHAP explanations have been previously evaluated with human users in two ways. A forward simulation experiment, in which SHAP explanations signicantly increased predictive performance, was reported in [ 18 ]. Other recent studies [e.g. 12 , 13 ] show that among several explanation techniques, SHAP corresponds best with human intuitions of a simple decision tree model. Although these results provide evidence on the interpretability of SHAP, it do es not directly follow that SHAP is useful for alert processing. 3 H YPOTHESES In this section, we formulate research hyp otheses on the utility of SHAP. T o this end, we identify cases in which SHAP values might aect task performance for alert processing. W e measure task performance in terms of task eectiveness , task eciency , and mental eciency , each of which could be impacted by providing a user with an explanation. T ask eectiveness refers to the extent to which the system helps the user to perform the task more eectively . For alert processing tasks, task eectiveness can be expressed as accuracy: the pro- portion of tasks in which the participant correctly distinguished between true positives and false positives. Compared to just providing a model’s condence score, SHAP ex- planations could increase task eectiveness by increasing the ability of the user to assess model’s credibility . For example, if a certain fea- ture contributes substantially to the model’s b elief that an instance is positive, but the user assesses this r easoning as counter-intuitiv e, the user will be more likely to question the model’s prediction. Contrarily but equally useful, the SHAP explanation may point the domain expert towards important feature values they would not have considered without being exposed to the explanation. Hypothesis 1. SHAP explanations increase task eectiveness of alert processing compared to the model’s condence scores alone, depending on the reasonableness of the explanation. T ask eciency refers to the extent to which the system helps the user to perform a task more eciently , which we express as time spent on the task. Because SHAP explanations r eveal which feature values are relevant for the model’s decision, they can be used to determine whether the model’s explanation is reasonable given domain knowledge. If it is, the domain expert might be able to process the instance more quickly . Hypothesis 2. If the number of features of an instance is suf- ciently large, SHAP explanations increase task eciency invested in alert processing compared to the model’s condence scores alone depending on the reasonableness of the explanation. Lastly , mental eciency refers to the required mental resources to perform a task. Mental eciency can b e measured as self-reported mental eort [ 14 ]. According to cognitive load theor y , the work- ing memory has a limited capacity . For low-dimensional instances, SHAP explanations are unlikely to improve mental eciency . Given the increase in information, they may ev en increase mental eort. On the other hand, a complex instance with many (tabular) fea- tures is unlikely to t into working memory . In such cases, SHAP explanations may help the domain e xpert to focus on a subset of features that do t into working memory , unless the explanation is unreasonable. Hypothesis 3. If the numb er of features of an instance is su- ciently large, SHAP explanations increase mental eciency in alert processing compared to the model’s condence scores alone, depending on the reasonableness of the explanation. W e test our hypotheses by means of two user experiments in which participants perform alert processing tasks while alternately being provided with SHAP explanations. In the rst experiment, we measure the added value of SHAP explanations for each partici- pant compared to the prediction probability alone. In the second experiment, we rst explicitly quantify to what extent SHAP ex- planations agree with human intuitions. Subse quently , we measure the dierence in task performance between participants who ar e provided with an explanation and those who are not. Moreover , we measure whether the dierence depends on the extent to which the explanation aligns with human intuition. 4 EXPERIMEN T 1: WI THIN-SUBJECT DESIGN The rst experiment is designed to measure the adde d value of SHAP explanations, compared to the mo del’s condence score alone. In order to mitigate user-spe cic eects, we adhere a within- subject experiment design. 4.1 Experiment Procedure The rst experiment consists of alert processing tasks followed by a written reection. The alert processing tasks are performed in two rounds. The rst round is designed for measuring mental eciency , the second round for measuring task eectiveness. Alert Processing T asks. In each alert processing task, the participant is pro vided with an instance the classier classie d as p ositive. The A Human-Grounded Evaluation of SHAP for Alert Processing KDD-XAI ’19, A ugust 2019, Anchorage, Alaska, USA participant is asked to predict the true class label and how much mental eort the y invested to get to their answer . Each task belongs either to the SHAP or NoSHAP condition. In each task, participants are provided with the model’s av erage condence score (i.e. base value), the condence score for the current instance, and the feature values of the instance. In the SHAP condition, the participants are also provided with a SHAP explanation (see Figur e 1). Figure 1: Example of an alert processing task in SHAP con- dition. In the NoSHAP condition, only the left part of the gure is shown. Round 1: Mental Eciency . Participants are provided with two sets of ve instances, A and B . Each instance in set A is in the NoSHAP condition whereas each instance in set B is in the SHAP condition. The two sets are shown in or der (see Figure 2). A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 Figure 2: Setup of mental eciency in Experiment 1. The let- ter ( A or B ) indicates the instance set, the numb er (1,2,3,4,5) the instance in the set. The color of the box indicates whether SHAP values are provided (white) or not (black). Round 2: Task Eectiveness. Participants are provided with one set of ten instances the model predicte d to be positives. Each instance is shown twice. The rst time, an instance is shown in NoSHAP condition, the second time in the SHAP condition (see Figure 3). In this setup, any improvement or decrease in task p erformance will be due to additional information, which means that we can account for the diculty of the instances. Note that this setup is not suitable for measuring the dierence in mental eort, be cause the same instance is shown twice. Participants’ W rien Ref lections. After performing the alert process- ing tasks, participants discuss their results and experiences in small groups. In their written reections, the groups discuss a.o. whether and how it would have been possible to distinguish between false positives and true positives for each of the ten instances of round 2, task eectiveness . 1 1 2 2 . . . 9 9 10 10 Figure 3: Setup of task eectiveness in Experiment 1. The number indicates the instance. The color of the box in- dicates whether SHAP values are provided (white) or not (black). 4.2 Experiment Details Choosing an appr opriate classication task for this user experiment is not trivial. On the one hand, the classication task should b e non-trivial for humans. On the other hand, participants should have some domain knowledge about the data set to b e able to argue about the reasonableness of an explanation. In the rst e xperiment, we use the well-known A dult data set from the UCI repository [ 2 ] which we retrie ved from OpenML [ 19 ]. The related classication task is to predict whether the income of a person exceeds $50,000 per year based on census data. W e select ve features and train a random forest classier using the implementation of scikit-learn [ 15 ]. SHAP values were computed using the exact TreeSHAP algo- rithm proposed and implemented by Lundberg et al. [12]. A total of 102 students enrolled in an undergraduate introductor y machine learning course participated in the rst experiment. 4.3 Results of Experiment 1 T o account for multiple testing and retain a family-wise error rate of α = 0 . 05 , we apply the Bonferr oni correction. Accor dingly , for each of the tests in the present work, we use a signicance level of α = 0 . 005 . Analysis of T ask Ee ctiveness (Hypothesis 1) . Method. W e measure task eectiveness through the proportion of correctly identied false positives and true positives. W e test for a dierence in the participants’ accuracy before and after SHAP values are shown using McNemar’s test. A post-hoc two one-sided equivalence test (TOST) procedure is used to assert whether the observed accuracy is e quivalent [ 11 ]. W e use an equivalence inter- val of [− 0 . 05 , 0 . 05 ] , i.e. the accuracy of participants within the two conditions is considered equivalent if the dierence in accuracy is smaller than 0.05. Results. W e fail to reject the null hypothesis at a signicance level of α = 0 . 005 ( χ 2 (1, N=978) = 0.890, p = 0 . 346 ). Hence, it can be concluded that the dierence in proportion of correct answers between SHAP (M=0.61, SD=0.49) and NoSHAP (M=0.59, SD=0.49) was not statistically signicant. Moreover , the null hypothesis of the post-hoc equivalence test is rejected at α = 0 . 005 ( z = − 3 . 20 , p l = < . 001 , z = 5 . 27 , p u = < . 001 ). Hence, we can conclude that there did not exist a meaningful dierence in accuracy between the two conditions. Analysis of Mental Eiciency (Hypothesis 3) . Method. Mental eciency is measured through self-reported mental eort, using the 9-point Likert-scale introduced by Paas [14] . As some instances may r equire much more mental eort than others, regardless of the SHAP condition, we rst perform a one-way ANO V A to determine whether mental eort invested in particular KDD-XAI ’19, A ugust 2019, Anchorage, Alaska, USA W eerts, et al. tasks was signicantly dierent from other tasks within either the SHAP or NoSHAP condition. These tasks are excluded from the remainder of the analysis. Subsequently , we test for dierence in average mental eort spent in SHAP and NoSHAP by means of a two-sided paired t-test. A post-hoc TOST procedure using a one- sided paired t-test is used to assert whether the average mental eort in b oth conditions is equivalent, considering equivalence interval [− 0 . 5 , 0 . 5 ] . Results. For samples related to the NoSHAP condition, none of the tasks had a signicantly higher mental eort ( F ( 4 , 404 ) = 1 . 89 , p = 0 . 11 ). For the SHAP condition, we do nd a signicant eect ( F ( 4 , 404 ) = 19 . 02 , p = 0 . 00 ) and a multiple-comparison post-ho c analysis revealed that tasks 2 and 4 required signicantly less men- tal eort than the other questions in the SHAP condition. Hence, data points related to these tasks are not considered in the paired t-test. After asserting that the assumptions of normally distributed dierences and the absence of outliers are met, we fail to reject the null hypothesis at α = 0 . 005 ( t ( 100 ) = − 0 . 66 , p = 0 . 51 ), which means that the dierence in average mental eort in SHAP (M=4.81, SD=1.37, SE=0.14) and NoSHAP (M=4.74, SD=1.25, SE=-0.12) was not statistically signicant. Moreover , the null hypothesis of the equivalence test is rejected ( t l ( 100 ) = − 4 . 37 , p l = < . 001 , t u ( 100 ) = 5 . 68 , p u = < . 001 ). Hence, we can conclude that ther e does not e xist a meaningful dierence in invested mental eort. Analysis of Recorded Participants’ Ref lections . Method. A content analysis of the written reections is per- formed by means of the grounded theory approach. Each of the reection reports is co ded with regard to the pieces of evidence that were used to make a decision about the true class of each of the instances. Results. In total, 22 reports are analyzed. T en types of evidence are identied, which can be further categorized into four main categories: SHAP values, feature values, the model’s condence score, and similar instances. The primary source of evidence described in the written reec- tions is the instance itself, i.e . its feature values. After feature values, the model’s condence score is mentioned most often. SHAP ex- planations are used in three dierent ways. If an explanation is intuitive, participants see this as evidence of the correctness of the prediction. If an explanation is counter-intuitive , participants typically "adjust" the condence score accordingly . For example, one of the groups argues that: “ the probability might be on the lower side but the SHAP values show that it is mainly brought down by having a positive capital gain which is quite counter-intuitive. ” In rare cases, participants change their initial b eliefs based on the SHAP explanation. Finally , the true class of a similar instance is sometimes mentioned as evidence. W e would like to stress that the written reections were per- formed in hindsight. Hence, the results do not necessarily reect the participant’s behavior during the alert processing tasks. 4.4 Conclusion from Experiment 1 There was no signicant nor a practically relevant dierence (larger than 0.05) in task accuracy before SHAP values were shown and after they wer e shown. Additionally , there was no signicant nor meaningful dierence (larger than 0.5) in average self-reported men- tal eort between instances for which SHAP values were pro vided and instances where SHAP values were not pr ovided. From the written reections, we can conclude that apart from the instance’s featur e values, the leading source of evidence of the true class of an instance was the mo del’s condence score. This can be alarming, because raw condence scores are often poorly calibrated; i.e., the predicted probabilities often doe not correspond to the true frequencies [ 7 ]. Consequently , condence scores may be misleading for domain experts. 5 EXPERIMEN T 2: CROSSO VER DESIGN In the second experiment, we measure the dierence in task utility metrics when SHAP values are shown compared to when they are not shown. Recall that Hypotheses 1, 2, and 3 include a notion of reasonableness of the SHAP explanation. In order to quantify the extent to which a SHAP explanation aligns with human intuition, the second experiment is preceded by a pretest experiment in which we measure human-assigned featur e value contributions. The ex- periment setup is adapted from a within-subject to a crossover design. 5.1 Experiment Procedure The experiment setup of the second experiment includes both a pretest experiment and a main experiment. Pretest Experiment. In order to quantify to what extent SHAP expla- nations align with human intuitions, we ask participants to assign contributions to feature values of in total 20 instances. For each instance, participants are asked to explicitly indicate to what extent they believ e a particular feature value would make it more unlikely , more likely , or would have no impact on the probability of belong- ing to the positive class. The participants are randomly assigned to two groups, group 1 and group 2. 10 of the instances are evaluated by group 1, the other 10 by gr oup 2. After the data collection, the human-assigned contributions are compared to the corresponding SHAP explanations. Main Experiment. In the main experiment, we adhere a cross-over design (see Figure 4). Each participant is randomly assigned to either group 1 or group 2. T wo sets of instances are considered in the alert processing tasks, set A and set B . Group 1 will view instance set A in SHAP condition and set B in NoSHAP condition. Conversely , Group 2 will see set A in NoSHAP condition and set B in SHAP condition. In addition to the questions aske d in the previous experiment, the participants are asked to provide their reasoning directly after each task; i.e. why they b elieve a particular instance is a false positive or true positive. The crossover design has several advantages over the within- subject design of the previous experiment. First of all, in the previ- ous experiment setup, participants had the option to change their answer after being exposed to a SHAP explanation. In the current A Human-Grounded Evaluation of SHAP for Alert Processing KDD-XAI ’19, A ugust 2019, Anchorage, Alaska, USA setup, all information is always shown at once, which better re- sembles alert processing in a decision support scenario. Second, the new setup allows us to measure all hypotheses in the same experiment. Group 1 A1 B1 A2 B2 A3 B3 A4 B4 A5 B5 Group 2 B1 A1 B2 A2 B3 A3 B4 A4 B5 A5 Figure 4: Setup of Experiment 2. The letter ( A or B) indicates the instance set, the number (1,2,3,4,5) the instance in the set. The color of the box indicates whether SHAP values are provided (white) or not (black). 5.2 Experiment Details In the second experiment, the UCI Students Academic Performance data set is used [ 5 ]. This data set contains student p erformance for mathematics in secondar y education of two Portuguese schools. The associated classication task is to predict student’s grades in mathematics based on a number of features including e.g. age and number of previous failures . W e convert the regression task to a classication task, based on the minimum grade r equired to pass a course and retain the 13 most predictive features. Compared to the rst experiment, the number of features is increased from ve to thirteen. Similar to the previous experiment, we have split the data set into a training and test set and trained a random forest classier . A total of 20 undergraduate and graduate computer science or industrial engineering students participated in the pretest experi- ment. The main experiment was execute d twice. In total, 57 pe ople participated in the main experiment, consisting mainly of graduate and PhD students majoring in computer science or data science and having basic knowledge of XAI and SHAP. 5.3 Results of Experiment 2 Analysis of Agreement Between SHAP and Human Intuition (Pretest Experiment) . Method. For each instance, we quantify the agreement b etween human-assigned contributions and SHAP values as the average cor- relation. As SHAP e xplanations typically contain only a few large values and many smaller ones, it is desirable that higher weight is given to the top and bottom ranks. Hence we use a weighted signed rank correlation. Results. The average SHAP agr eement diers across instances but is typically not much lower than 0. For most instances, the model made a correct prediction and the corresponding SHAP explanations agree strongly with human intuitions ( ¯ R i > 0 . 20 ) . Analysis of T ask Ee ctiveness (Hypothesis 1). Method. Given the crossover design, we require a more sophisti- cated statistical test than in the previous experiment. For Hypoth- esis 1, we use a generalized linear mixed model (GLMM) with a logit link function. W e include a xed eect for SHAP condition and agreement with human intuition. Following our hyp othesis, we add an interaction eect between condition and agr eement, and random eects for the participant and the alert processing task. Results. Neither the coecient corresponding to SHAP (M=0.12, 95% CI = [ -0.71, 0.96], p = 0.76), agreement with human intuition (M= − 7 . 17 , 95% CI= [− 13 . 47 , − 0 . 87 ] , p = 0 . 026 ), nor their interaction eect (M= 0 . 70 , 95% CI= [− 1 . 98 , 3 . 38 ] , p = 0 . 61 ) were statistically dierent from zero . Analysis of T ask Eiciency (Hyp othesis 2) . Method. A linear mixed eects model is used to test for dif- ferences in speed. Again, we include the eect of SHAP condition, agreement with human intuition, and the interaction eect b etween SHAP and agreement . Results. Even after data transformations, normality of the residu- als could not be assumed. Hence, no conclusions can be made with regard to task eciency . Analysis of Mental Eiciency (Hypothesis 3) . Method. Mental eciency is measured as self-reported mental eort. A linear mixe d eects model is used, including the same eects as for the previous task performance metrics. Results. The assumption of normally distributed residuals is rea- sonable. The SHAP main eect did not signicantly dier fr om zero at α = 0 . 005 (M=0.27, 95% CI = [ -0.11, 0.65], p = 0.167), neither did the coecient of rank agreement (M=-0.35, 95% CI = [ -1.85, 1.15]) nor the interaction eect between SHAP and agreement (M=-0.94, 95% CI=[ -1.16, 0.272], p=0.129). Analysis of Recorded Participants’ Reasoning . Recall that af- ter each alert processing task, the participants are asked to articulate why they believed a certain instance was a false positive or a true positive. Method. After several pre-processing steps including stop word removal and lemmatization, the replies are converted to vector- representation that indicates the pr esence of each term in each of the replies. For each combination of an instance and condition, the proportion of replies that contains a particular term is computed. Subsequently , for each of the instances, the proportions in the SHAP and NoSHAP conditions are compared. If the absolute per centage point dierence between the tw o proportions is larger than 0.2, the corresponding replies are further inspected manually . Results. In total, 20 terms w ere further inspected. It became clear that some of the dierences were due to dierent wordings (e.g. study time versus studytime ) and in some cases terms were men- tioned because the participants did not agree with the SHAP ex- planation (e.g. “ I do not take into account that much the failures = 0 "). However , in ve of the fourteen instances, the participants’ reasoning did seem to b e aected by the SHAP explanation (see T able 1). In most of these cases, feature values of the instance are taken into account more heavily when presented with a relativ ely large SHAP value for that feature value (the largest absolute SHAP values in this data set typically ranged between 0.07 and 0.11). W e have not identie d any cases in which feature values that were taken KDD-XAI ’19, A ugust 2019, Anchorage, Alaska, USA W eerts, et al. T able 1: Proportion of replies in which a feature was dis- cussed in NoSHAP and SHAP condition. T ask ID Feature SHAP value NoSHAP SHAP A2 number of absences 0.08 2/20 11/29 A4 previous failure 0.05 5/20 11/25 A7 higher e ducation -0.08 1/17 8/27 B3 number of absences 0.04 3/30 8/22 B6 paid math classes -0.01 1/28 4/16 into account by people in the NoSHAP condition were not taken into account by participants in the SHAP condition. 5.4 Conclusion from Experiment 2 No signicant dierences in task eectiveness, task eciency , and mental eciency were measur ed when SHAP values were sho wn compared to when they were not available to the participants. From the analysis of the textual replies, it can be concluded large SHAP values did aect the reasoning applied by our participants. These results suggest that large SHAP values can bring feature values of the instance to attention that are otherwise ignored. 6 CONCLUSIONS XAI and related research communities have become productive in developing new interpretable machine learning methods. However , the evaluation of these methods often remains limited. The results of the present pap er suggest that it is important to perform evaluations with real users, rather than to rely on intu- itions about utility . In neither of the two experiments it could be concluded whether SHAP explanations signicantly impact task utility measured as task eectiveness and mental eciency . There- fore, we cannot conclude that SHAP explanations are useful for human experts performing alert processing tasks. The post-hoc equivalence tests of our rst experiment show that the failure to reject the null hypothesis was likely due to only a small dier- ence in utility rather than a lack of data. This suggests that SHAP explanations alone are not that useful for alert processing. Our analysis of the written reections of participants of Exp eri- ment 1 has shown that, apart fr om the feature values themselv es, the leading source of evidence was the model’s condence score. This is concerning, since condence scores can be misleading. Our textual analysis of the participants’ reasoning in Exp eriment 2 has shown that large SHAP values can bring to attention feature values that are otherwise ignored. This shows that even though we could not identify a signicant dierence in task utility , the SHAP explanations did have an impact on the participants’ decision- making process. Future W ork. W e intend to pursue this direction further by perform- ing a deeper analysis of the gathered data. It would be interesting to identify subgroups in the data for which the dierence between SHAP and NoSHAP is exceptionally large. These subgroups could be described e.g. in terms of instance attributes and participant attributes. Such an approach could result in new hypotheses re- garding the eect of local explanations on dierent aspe cts of alert processing performance. W e intend to adopt an exceptional model mining approach introduced in [3] to automate this search. Additionally , we would like to replicate the experiments with a classication task that contains a larger number of features and study how this aects task eciency and mental eciency . REFERENCES [1] Finale Doshi- V elez and Been Kim. 2017. T owards A Rigorous Science of Inter- pretable Machine Learning. arXiv:1702.08608 [2] Dheeru Dua and Karra T aniskidou E. 2017. UCI Machine Learning Repository. Retrieved from http://archive .ics.uci.edu/ml. [3] W outer Duivesteijn, Tara Farzami, Thijs Putman, Evertjan Peer , Hilde J. P . W eerts, Jasper N. Adegeest, Gerson Foks, and Mykola Pechenizkiy . 2017. Have It Both W ays - From A/B T esting to A&B T esting with Exceptional Model Mining. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2017 . 114–126. [4] Allahyari Hiva and Lavesson Niklas. 2011. User-oriented Assessment of Classi- cation Model Understandability . Frontiers in A rticial Intelligence and A pplications 227 (2011), 11–19. https://doi.org/10.3233/978- 1- 60750- 754- 3- 11 [5] Sadiq Hussain, Neama Abdulaziz Dahan, Fadl Mutaher Ba- Alwi, and Najoua Ribata. 2018. Educational Data Mining and Analysis of Students’ Academic Performance Using WEKA. J. Electrical Engineering and Computer Science 9, 2 (Feb. 2018), 447. https://doi.org/10.11591/ijeecs.v9.i2.pp447- 459 [6] Johan Huysmans, Karel Dejaeger , Christophe Mues, Jan Vanthienen, and Bart Baesens. 2011. An empirical evaluation of the compr ehensibility of decision table, tree and rule based predictive models. Decision Support Systems 51, 1 (apr 2011), 141–154. https://doi.org/10.1016/J.DSS.2010.12.003 [7] V olodymyr Kuleshov and Percy S Liang. 2015. Calibrated Structured Prediction. In Advances in Neural Information Processing Systems 28 , C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, Inc., 3474–3482. [8] Isaac Lage, Emily Chen, Jerey He, Menaka Narayanan, Been Kim, Sam Gershman, and Finale Doshi- V elez. 2019. An Evaluation of the Human-Interpretability of Explanation. [9] Himabindu Lakkaraju, Stephen H Bach, and Jure Leskovec. 2016. Interpretable Decision Sets: A Joint Framework for Description and Prediction. In Proce edings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16) . ACM, New Y ork, NY, USA, 1675–1684. https://doi.org/ 10.1145/2939672.2939874 [10] Zachary C. Lipton. 2016. The Mythos of Model Interpretability . [11] Ying Lu and Judy A. Bean. 1995. On the sample size for one-sided equivalence of sensitivities based upon McNemar ' s test. Statistics in Medicine 14, 16 (A ug. 1995), 1831–1839. https://doi.org/10.1002/sim.4780141611 [12] Scott M. Lundberg, Gabriel G. Erion, and Su-In Le e. 2018. Consistent Individual- ized Feature Attribution for Tree Ensembles. [13] Scott M Lundberg and Su-In Lee. 2017. A Unied Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30 , I. Guyon, U. V . Luxburg, S. Bengio, H. W allach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 4765–4774. [14] Fred G. Paas. 1992. Training strategies for attaining transfer of problem-solving skill in statistics: A cognitive-load approach. Journal of Educational Psychology 84, 4 (1992), 429–434. [15] F. Pe dregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O . Grisel, M. Blondel, P. Prettenhofer , R. W eiss, V . Dubourg, J. V anderplas, A. Passos, D. Cour- napeau, M. Brucher , M. Perrot, and E. Duchesnay . 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830. [16] Forough Poursabzi-Sangdeh, Daniel G. Goldstein, Jake M. Hofman, Jennifer W ort- man V aughan, and Hanna W allach. 2018. Manipulating and Measuring Model Interpretability . [17] Marco T ulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust Y ou?": Explaining the Predictions of Any Classier. In Proceedings of the 22nd ACM SIG International Conference on Knowledge Discovery and Data Mining (KDD) . A CM Press, New Y ork, New Y ork, USA, 1135–1144. https://doi.org/10. 1145/2939672.2939778 [18] Erik Štrumb elj and Igor Kononenko. 2014. Explaining prediction models and individual predictions with feature contributions. Knowledge and Information Systems 41, 3 (2014), 647–665. [19] Joaquin V anschoren, Jan N. van Rijn, Bernd Bischl, and Luis T orgo. 2013. OpenML: Networked Science in Machine Learning. SIGKDD Explorations 15, 2 (2013), 49–60. https://doi.org/10.1145/2641190.2641198

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment