Case-Based Reasoning for Assisting Domain Experts in Processing Fraud Alerts of Black-Box Machine Learning Models

In many contexts, it can be useful for domain experts to understand to what extent predictions made by a machine learning model can be trusted. In particular, estimates of trustworthiness can be useful for fraud analysts who process machine learning-…

Authors: Hilde J.P. Weerts, Werner van Ipenburg, Mykola Pechenizkiy

Case-Based Reasoning for Assisting Domain Experts in Processing Fraud   Alerts of Black-Box Machine Learning Models
Case-Based Reasoning for A ssisting Domain Experts in Processing Fraud Alerts of Black-Box Machine Learning Models Hilde J.P . W eerts h.j.p.weerts@student.tue.nl Eindhoven University of T echnology Eindhoven, The Netherlands W erner van Ipenburg werner .van.ipenburg@rab obank.nl Rabobank Nederland Zeist, The Netherlands Mykola Pechenizkiy m.pechnizkiy@tue.nl Eindhoven University of T echnology Eindhoven, The Netherlands ABSTRA CT In many contexts, it can b e useful for domain experts to understand to what extent predictions made by a machine learning model can be trusted. In particular , estimates of trustworthiness can be useful for fraud analysts who process machine learning-generated alerts of fraudulent transactions. In this work, we pr esent a case-base d reasoning (CBR) approach that pro vides evidence on the trustwor- thiness of a prediction in the form of a visualization of similar previous instances. Dierent from previous works, we consider similarity of local post-hoc explanations of predictions and show empirically that our visualization can be useful for processing alerts. Furthermore, our appr oach is perceived useful and easy to use by fraud analysts at a major Dutch bank. KEY W ORDS explainable articial intelligence, case-based r easoning, fraud de- tection, SHAP explanation similarity Reference Format: Hilde J.P. W eerts, W erner van Ipenburg, and Mykola Pe chenizkiy. 2019. Case- Based Reasoning for Assisting Domain Experts in Processing Fraud Alerts of Black-Box Machine Learning Models. In Pr oce edings of KDD W orkshop on Anomaly Detection in Finance (KDD- ADF ’19). 1 IN TRODUCTION Machine learning models are increasingly applied in real-world contexts, including detection of fraudulent transactions. Often, the best performing mo dels are complex mo dels such as deep neu- ral networks and ensembles. Despite their excellent p erformance, these models are not infallible , and predictions may require post- processing by human domain experts. However , as the complexity of the mo del increases, it can become more dicult for human domain experts to assess the correctness of a prediction. In such cases, domain experts who post-process predictions can b enet from evidence on the trustworthiness of the mo del’s prediction. Moreover , this type of evidence could be useful to identify when the model is no longer trustworthy due to concept drift [ 3 , 23 ], which is particularly relevant in the conte xt of fraud detection. A straightforward indicator of the trustworthiness of a prediction is the model’s own reported condence score. However , raw con- dence scores are often poorly calibrated [ 8 ], which means they can be misleading for human domain experts. Furthermore , since the models are not perfect, and in case of heavily imbalance d problems including fraud detection far from being perfect, even calibrated condence scores can be inaccurate and hence misleading for pro- cessing of fraud alerts. Recent explanation methods, including e.g. SHAP [ 10 ], LIME [ 15 ], Anchor [ 16 ] can potentially help domain experts in determining to what extend the mo del’s predictions can be trusted. For example, domain e xperts can look at local feature importance of the alert that is b eing processed or at a lo cal sur- rogate model that mimics the behavior of the global model in the neighborhood of the alert. However , there is lack of empirical evi- dence that would illustrate the utility of such approaches for alert processing tasks. Our recent user study on the utility of SHAP for processing alerts suggests that SHAP explanations alone do not contribute to better decision-making by domain experts [21]. A pproach. In the present paper , we introduce a case-based reason- ing (CBR) approach to pr ovide domain e xperts with evidence on the trustworthiness of a prediction. The proposed approach consists of two steps: (1) retriev e the k most similar instances to the query instance and (2) visualize the similarity as well as the true class of the retrieved neighbors. If the true class of similar instances corresponds to the prediction of the model, this provides evidence on the trustworthiness of the model’s prediction, and the other way around. An important consideration of any nearest-neighbor type approach is the distance function. A straightfor ward notion of similarity in our scenario is similarity in feature values. How- ever , instances with ver y similar feature values may b e treated very dierently by the model. Thus, it may be more useful for alert processing to consider similarity in local feature contributions . That is, we can consider whether the model’s predictions of an instance can be explained in a similar way as the prediction corresponding to the alert. Dierent from previous works, w e consider distance functions that take into account similarity in feature values, local explanations, and combinations thereof. Empirical Evaluation. In simulated user experiments, we empiri- cally show that our approach can be useful for alert processing. In particular , the usage of a distance function that considers similarity in local feature contributions often results in the best p erformance. Furthermore, a usability test with fraud analysts at a major Dutch bank indicates that our approach is perceived useful and easy to use. Outline. The present paper is structured as follows. Se ction 2 covers related work. In Section 3, we introduce our CBR approach. In Section 4, we present the results of an empirical evaluation of our approach. W e discuss concluding remarks in Section 5. 2 RELA TED W ORK The basis of our approach is CBR: similar problems have similar solutions and previous solutions can be use d to solve new prob- lems [ 6 ]. CBR decision support systems became popular during the nineties and many dierent case-based explanations (CBE) have been suggested in that context [17]. KDD- ADF ’19, August 2019, Anchorage, Alaska, USA W e erts, et al. Arguably the most straightforward CBE method is to retrieve the most similar cases. For example, Nugent and Cunningham [12] propose to retrieve the most similar instance to the current case, weighted by the local feature contributions of the query instance. The proposed distance function is intuitive, but its utility is not empirically tested. In the related eld of k -Nearest Neighbor ( k - NN) classication, the importance of the distance function has been long recognized. Dierent weight-setting algorithms have been propose d to improve the performance of k -NN algorithms. In particular , dierent algorithms can be distinguished based on whether weights are applied globally (i.e. featur e importance across all instances) or locally (i.e. feature importance p er instance or a subgroup of instances) [ 22 ]. In our experiments, we consider several distance functions, including unweighted, locally weighted, and globally weighted functions. Similar to our goal, Jiang et al . [4] propose to capture trustwor- thiness into a trust score : the ratio between the distance from the query instance to the nearest class dierent from the predicted class and the distance to the predicte d class. Howe ver , the utility of the trust score for human users is not evaluated. In particular , summarizing trustworthiness in a single score makes it impossible for users to determine whether the derivation of the score aligns with their domain knowledge. Moreover , the score can take any value, which can make it dicult to interpret by novice users. 3 CASE-BASED REASONING APPROA CH For a given instance, which we will refer to as the query instance , our approach consists of the following two steps ( see Figure 1): (1) Case retrieval . Retrieve the k instances from the case base that are most similar to the quer y instance. The appropriate distance function may be dierent for dierent problems. The case base consists of instances for which the true class is known. (2) Neighborhood visualization . Visualize the k retrieved instances as well as the query instance as points in a scatter plot, such that: (a) the distance between any two instances corresponds to their similarity according to a distance function; (b) the colors of the neighb ors correspond to their true classes. The number of retriev e d cases, k , is a user-dened parameter; i.e . the user can choose how many instances they would like to retriev e. Note that a dierent distance function can be use d for step (1) and (2). In Section 4.1, we empirically test which combination of distance functions are most useful in each step for sev eral b enchmark data sets as well as a real-life fraud-detection data set. 3.1 Case Retrieval In the case retrieval stage, we retrieve instances from the case base that are similar to the instance we are trying to explain. The case base consists of instances for which the ground truth is known at the time the machine learning model is trained. When the amount of historical data is relativ ely small, all instances can be added to the case base. Otherwise, sampling or prototyping approaches may be used to decrease the size of the case base. Our approach assumes that the true class of instances in the case base is known. In some contexts, such as money laundering, (a) Case Retrieval . Retrieve the k instances most similar to the quer y instance. (b) Neighb orho od Visualization . Visualize how similar the retrieved k neighbors are as well as their true class. Figure 1: The two stages of the CBR approach for estimating the trustworthiness of a lo cal prediction. the true class is typically not known for all instances. When in- stances whose true class has not b een veried are added to the case base, this should be made explicit to the user in the neighborhoo d visualization, e.g. by means of a dierent color . 3.2 Neighborhoo d Visualization Rather than just returning the retrie ved neighb ors to the user , we visualize the neighborhood in a two dimensional scatter plot (see Figure 1b). The distance between any two instances i and j in the Case-Based Reasoning for Assisting Domain Experts in Processing Fraud Alerts of Black-Box Machine Learning Mo dels KDD- ADF ’19, August 2019, Anchorage, Alaska, USA visualization roughly corresponds to the dissimilarity of two in- stances, given by the used distance function. Similar to the approach taken by McArdle and Wilson [11] , we compute the coordinates of the scatter plot using multidimensional scaling (MDS) [7]. In addition, colors or shapes can b e used to visualize the model’s performance for each of the retrieved instances. The user can use this information to determine whether the prediction of the quer y instance is trustworthy or not. For example, if many of the most sim- ilar neighbors are false p ositives, this decreases the trustworthiness of an alert. The visualization can be further extended based on the appli- cation context. For example , we know that fraud schemes change over time . Hence, transactions that occurred a long time ago may be less relevant than newer transactions. In this scenario, a valuable extension could be to visualize the maturity of retrieved instances using e.g. an time-based color scale or a time lter . Another inter- esting extension would be to add counterfactual instances to the visualization. Counterfactual instances ar e perturb ed versions of the quer y instance that received a dierent prediction from the model [ 20 ]. Adding such perturbed instances to the visualization may help the domain expert to identify and inspe ct the decision boundary of the classication model. W e leave this to future work. 3.3 Distance Functions The idea behind our approach is to retrieve and visualize instances that are similar to the quer y instance. However , it is unclear which notion of dissimilarity will b e most useful for alert processing. Moreover , dierent combinations of distance functions in the case retrieval and visualization step may be more useful than others. 3.3.1 Feature V alues. The most straightforward way to dene sim- ilarity is to consider the feature values of transactions. In this case, instances that with similar feature values are considered similar . Depending on the feature type (e.g. categorical or continuous) dif- ferent distance functions may be more appropriate than others. In this work, we assume that all feature values are properly normalize d and that Euclidean distance is meaningful. However , we encourage readers to use a dierent distance function if appropriate . 3.3.2 Feature Contributions. A potential disadvantage of a plain feature value-based distance function is that the machine learning model is not taken into account at all. Instances that seem similar with regard to feature values, may have be en treated very dierently by the model. For example, consider a fraud detection decision tree with at its root node the decision rule amount > $10,000 . T wo transactions that are exactly the same with regard to all feature values except for amount are in dierent branches of the decision tree. Hence, the transactions may seem ver y similar in the data, but the decision-making process of the model could b e completely dierent for each of the transactions. Judging the trustworthiness of a new prediction based on instances that were treated dier ently by the mo del does not seem intuitive. Instead, it might be more informative to take into account the model’s arguments. A state-of-the-art approach for explaining single predictions are Shapley Additiv e Explanations (SHAP) [ 10 , 18 ]. SHAP is based on concepts from cooperative game theor y and explains how each feature value of an instance contributed to the mo del’s condence score. In or der to take into account the model’s arguments for its predictions, we can consider similarity in SHAP explanations. In a SHAP value-based distance function, instances whose feature values contributed similarly to the mo del’s condence score are considered similar . Interestingly , we nd that distances in SHAP value space behave very well. First of all, SHAP explanations can b e clustered well using relativ ely few clusters (see Figure 2b). Moreover , it is possible to identify subsets of transactions for which the mo del performs worse than for others (see Figure 2c). This indicates that SHAP value similarity can be meaningful for alert processing. 3.3.3 Globally W eighted Feature V alues. A potential disadvantage of a distance function base d solely on SHAP values, is that instances with a similar SHAP explanation can still have dierent feature values. W e can combine SHAP explanations and feature values in a single distance function by weighting features values by the model’s feature imp ortance. Feature importance can be dene d either globally (i.e. across all instances) or locally (i.e. per instance). When considering a globally weighted distance function, in- stances with similar feature values on features that are considered globally important by the mo del (i.e. across all instances in the train- ing data) are considered similar . Global SHAP feature importances can be computed as follows [9]: ¯ | Φ | = 1 N N Õ i = 1 | Φ i | (1) where N is the total number of instances in the data set and Φ i the SHAP value vector of instance i . 3.3.4 Locally W eighte d Feature V alues. Lo cal SHAP importances can be very dierent from global SHAP importances. For example, a particular feature may be relatively important on average, but not contribute at all to the prediction of a particular instance. Therefore, the utility of feature imp ortance may depend on whether impor- tance is dened globally or locally . When locally weighted feature value distance function is used, instances with similar feature val- ues on features that are considered locally important by the model (i.e. for the query instance) are considered similar . Note that this distance function is similar to the one suggested by Nugent and Cunningham [12] , except we use SHAP importances rather than local feature contributions similar in spirit to LIME. 3.3.5 Formalization. As a basic distance function, we consider the weighted Euclidean distance. Given two input vectors z a = ( z a 1 , . . . , z am ) and z b = ( z b 1 , . . . , z b m ) , and a weight vector w = ( w 1 , . . . , w m ) , the weighted Euclidean distance is dened as follows: d ab = d ( w , z a , z b ) = v u t m Õ j = 1 w ( z a − z b ) 2 (2) Note that Equation 2 is equivalent to the unweighted Euclidean distance when w is an all-ones vector ( 1 ). W e can describ e the four considered distance functions with regar d to the input vectors of Equation 2 (see T able 1). In the next section we discuss the performance of our approach when d F , d S , d G , and d L are used. W e will omit d and use the corresponding index letter in the gures for bre vity . KDD- ADF ’19, August 2019, Anchorage, Alaska, USA W e erts, et al. (a) Model’s Condence (b) k-means clustering k = 10 (c) Model’s Performance Figure 2: t-SNE visualization that groups transactions with similar SHAP explanations. The SHAP explanations explain pre- dictions made by a random forest fraud detection model. T able 1: In both the case retrieval and neighborho od visual- ization stage, we consider four distance functions that dier with regard to input values of Equation 2. x a refers to the fea- ture value vector of instance a , Φ a to the SHAP value vector of instance a , and q denotes the query instance. Notation Description Denition d F feature values d ( 1 , x a , x a ) d S SHAP values d ( 1 , Φ a , Φ b ) d G feature values weighted by global SHAP imp ortance d ( ¯ | Φ | , x a , x b ) d L feature values weighted by local SHAP importance d ( | Φ q | , x a , x b ) 4 EV ALU A TION The goal of our CBR approach is to pr ovide evidence on the trust- worthiness of a prediction. Similar to [ 4 ], we dene local trust- worthiness as the dierence b etween the Bayes-optimal classier’s condence and the model’s prediction ( e.g. fraud or no fraud) for that instance. That is, if the model agrees with the Bay es-optimal classier , trustworthiness is high, and vice versa. Notably , even the Bayes-optimal classier can b e wrong in some regions due to noise. Hence, trustworthiness should be interpreted as an estimate of the reasonableness of the prediction, given the underlying data distri- bution. In practice, the Bayes-optimal classier is not realizable and empirical measurements of trustworthiness are not possible. Howev er , our goal is to provide decision support for domain ex- perts that perform alert processing tasks. Hence, we can bypass the diculty of measuring trustworthiness by measuring utility for domain experts instead. 4.1 Simulated User Exp eriment W e evaluate the expected utility of our visualization for alert pro- cessing in a simulated user experiment . In a simulated user experi- ment, assumptions are made about how users utilize an approach to perform particular tasks. Subsequently , the expected task perfor- mance is computed as the task p erformance that is achieved when applying the assumed strategy . Simulated user experiments have been used to evaluate the coverage , precision and trustworthiness of dier ent explanations by Ribeiro et al . [15 , 16] . W e are not aware of simulated user experiments aimed at the evaluation of alert pro- cessing performance of domain experts. Consequently , we present a novel experiment setup. In our simulated user experiments, we simulate a situation in which a machine learning mo del has been put in production. In this scenario, a model has been trained on historical data and new data is arriving. For each new instance, the model pr e dicts whether it is a positive or a negative. Positives will trigger an alert and are inspected by human analysts, while negatives are not further inspected. The goal of the experiments is to estimate how well a analyst would be able to process alerts when provided with the neighbor- hood visualization. T o this end, we make a few assumptions about how users interpret the visualization. Based on these assumptions, we estimate how condent user would be about the instance be- longing to the p ositive class. In order to determine the utility of the visualization compared to the mo del, we evaluate how well the estimated user condence score as well as the model’s condence score correspond to the ground truth. 4.1.1 Method. T o simulate the scenario where a model has been put into production, we split our data set in three dierent parts (see Figure 3). W e can describe the simulated user experiment further using the following six steps: (1) Split the dataset. W e rst split the dataset into three dierent sets: the training data , the test data , and production data . The production data represents data that arrives as the model is in production. (2) T rain classier . W e train a classier on the training data. (3) Initialize Case Base. As the training and test set contain in- stances for which we know the ground truth at the time the model goes into production, we add these instances to the case base . (4) Initialize Alert Set. W e determine which instances from the production data would r esult in a positive prediction from Case-Based Reasoning for Assisting Domain Experts in Processing Fraud Alerts of Black-Box Machine Learning Mo dels KDD- ADF ’19, August 2019, Anchorage, Alaska, USA training test production instances used to train the classier instances added to the case base instances that can trigger alerts Figure 3: In the simulate d user experiment, the dataset is split into three sets. The training and test set correspond to data that is available at the time of model inference. The production set corresponds to new instances that arrive once the model is in production. our machine learning model. These instances are put in the alert set . (5) Estimate user’s and model’s condence scores. For each of the instances in the alert set, we estimate the user’s condence of the instance belonging to the positive class as a number between 0 and 1. Additionally , we determine the model’s condence for each instance in the alert set. (6) Evaluate condence. Given the ground truth of the instances in the alert set, we compare the mean average precision (MAP) that is achieved using the user’s condence to MAP achieved by the model’s condence. In or der to estimate the user’s condence, we make several assump- tions on how our visualization is interpreted. Recall that we are interested in alert processing. W e assume that a positive neighb or increases the user’s condence that the instance is a true p ositive and a negative neighb or de creases the user’s condence. Then, we can estimate the user’s condence, c i based on the retrieved neighbors using the following equation: c i = 1 k k Õ j = 1 1 { y j = 1 } (3) where y j is the true class of instance j . Howev er , some neighbors may be much more similar to the quer y instance than others. This is shown to the user in our neighb orhood visualization, so the user will likely take this into account. Ther efore, we weight each neighbor’s class by the inverse distance between the neighbor and the query instance i : c w i = k Í j = 1 1 d i j 1 { y j = 1 } k Í j = 1 1 d i j (4) Note that 1 d i j is undened if d i j is equal to zero, i.e. if the neighb or is identical to the instance we are tr ying to explain. In our experiments, we deal with this by setting d i j to a small number that is at least smaller than the most similar non-identical neighbor . In this way , a large weight to the identical neighbor , but the other neighbors are still taken into account. 4.1.2 Results. W e evaluate our CBR approach on three benchmark classication data sets: Adult [ 5 ], Phoneme [ 1 ], and Churn [ 13 ]. All data sets wer e retrieved from Op enML [ 19 ]. Additionally , we evalu- ate our approach on a real-life Fraud Detection data set provided by a major Dutch bank. On each of the data set, we train a random forest classier using the implementation in scikit-learn [14]. W e evaluate the estimated user condence scores for each pos- sible combination of distance functions in T able 1. As a baseline, we also add a user condence score that w ould b e achieved when no distance function is considered in the neighb orhood visualiza- tion (i.e. Equation 3). These results represent the case in which similarities are not provided to the user at all. Recall that the number of neighbors k is a user-set parameter . Consequently , the approach is evaluated for dierent values of k , ranging from 1 to 500 neighb ors. For each combination of dis- tance functions, we compare the MAP of the model’s condence to the MAP of the estimated user condence averaged over the dierent values for k . In Figure 4, we summarize the dierence in performance as the average over all possible values of k . Estimated User Confidence Mostly Performs Beer Than Model’s Confidence. For the Churn , Phoneme and Fraud Dete ction classica- tion tasks, the estimated user condence mostly results in higher average MAP than the model’s condence, but the achie ved perfor- mance gain typically diers for dierent combinations of distance functions (Figure 4). Only for the Adult data set, the estimated user condence results in worse average MAP scor es than the model’s condence. Number of Retrieved Neighb ors ( k ) Impacts Performance. In some data sets, user-set parameter k has a high impact on the perfor- mance of the estimated user condence. In particular , our approach outperforms the classier in the Adult data set only for a very par- ticular range of neighbors (see Figure 5a). For the P honeme and Fraud Detection data sets, a minimum number of neighbors of ap- proximately 20 is typically required to outperform the model’s condence scor e (see Figur e 5b). This result suggests that returning only the most similar case to the user , as suggested by Nugent and Cunningham [12] , may not provide enough evidence to b e useful for alert processing. When applied to new pr oblems, simulated user experiments could be performe d to decide upon the appropriate range of k that can be selected by a real human user . Unweighted User Confidence Performs Consistently W orse than User Confidence W eighted By Any Distance Function. For each of the data sets, estimating the user’s condence as the simple average of the true class of the r etrieved neighbors consistently results in the worst performance. Recall that unweighted user condence corresponds to a user who ignores similarity of the retrieved cases. This result shows the importance of communicating the similarity of the retrieved neighbors to the user , as is done in the neighborhood visualization step. d S Mostly Performs Best. For all data sets apart from the Churn data set, using d S performs best for both case retrieval and neigh- borhood visualization (see Figure 4). In particular , performing case retrieval using d S for the Phoneme and Fraud Detection data sets consistently results in top performance, regardless of the distance function that is used in neighb orhood visualization. This indicates that the relevance of the retrieved neighb ors is very high. In the KDD- ADF ’19, August 2019, Anchorage, Alaska, USA W e erts, et al. (a) Adult (b) Churn (c) Phoneme (d) Fraud Detection Figure 4: Improv ement or decrease in average MAP of the estimated user condence score compared to the MAP of the model’s condence score. MAP of the estimated user condence is averaged over number of retrieved cases k ∈ { 1 , 2 , . . . , 500 } . The dierence is shown for all p ossible combinations of distance functions in the two steps of the approach. F , G , L , and S refer to the distance functions dened in Table 1. U refers to an unweighted estimate d user condence according to Equation 3 (i.e. if the user ignores the distances in the neighborhoo d visualization). Churn data set, d F and d L perform best for case retrieval and d L for neighborhood visualization. 4.2 Usability T est T o determine the perceived utility of our approach for fraud analysts at the Rabobank, we conduct usability test. 4.2.1 Method. The CBR approach is implemented in a Python- based dashboard, which displays the model’s condence, SHAP explanation, and neighb orhood visualization of a selected alert (see Figure 6). The evaluation is performed in individual sessions with fraud analysts, using a think-out loud protocol. After the usability test, the dashboard is evaluated on p erceived usefulness and perceived ease of use by means of a short survey introduced by Davis [2]. 4.2.2 Results. Four fraud analysts participated in the evaluation. The average perceived usefulness was 5 . 64 on a 7-point Likert scale, with a standard deviation of 1 . 2 . The average perceived utility was 5 . 96 out of 7, with a standard deviation of 0 . 9 . From the v erbal pr otocols, it became clear that the neighborhood visualization materializes the fraud analysts’ intuitions on the trust- worthiness of fraud detection rules. As such, we expect the system to be particularly relevant for performing deep er analyses of cases for which a fraud analyst has not yet developed a strong intuition. As fraud detection models are constantly r etrained, explanations for machine generated alerts ar e expected to dier over time, which makes our approach particularly relevant in that scenario . 5 CONCLUSIONS Recent explanation methods have b een proposed as a means to assess trust in a model’s prediction. Howev er , there is a lack of em- pirical evidence that illustrates the utility of explanations for alert processing tasks. In particular , understanding why the model made a certain prediction may not be enough to assess the correctness of the prediction. Hence, rather than explaining a prediction, our goal is to provide evidence on the reasonableness of the prediction given the underlying data. In this paper , we have introduced a novel Case-Based Reasoning for Assisting Domain Experts in Processing Fraud Alerts of Black-Box Machine Learning Mo dels KDD- ADF ’19, August 2019, Anchorage, Alaska, USA (a) Adult (b) Phoneme Figure 5: The performance of dierent neighborhood visual- ization functions against the retrieved number of neighbors for the Adult and Phoneme data set when using d S for case retrieval. CBR approach that can b e used to assess the trustworthiness of a prediction. In our simulated user experiments, we have shown that the two- stage CBR approach can be useful for processing alerts. According to our intuitions, our results suggest that a distance function based on similarity in SHAP values is more useful than distances based on feature value similarity . Moreover , the results of a usability test with fraud analysts at a major Dutch bank indicate that our approach is perceived useful as well as easy to use. 5.1 Future W ork In the present paper , we have evaluated our approach on four dierent classication tasks with some varying results. Not all of these r esults are alr eady well understo od. In particular , future work could consider a more extensive analysis on why particular distance functions work well for some data sets and not as good for others. Additionally , future work could consider extensions of the neigh- borhood visualization. In particular , adding counterfactual instances is expected to provide more insights in the decision boundar y of the model. As SHAP values can be expensive to compute, future work could focus on optimizing the case base, by means of e.g. prototype se- lection or sampling approaches. An important aspect of these ap- proaches is how they may be misleading for users. In particular , future work could study how over- and undersampling approaches aect the decision-making process of users in scenarios with highly imbalanced data. One of the ndings presented in this work is that SHAP explana- tions are remarkably clusterable. An interesting direction of future work that leverages this observation are prototypical explanations , which could be use d to provide a global explanation of a black-box model in a model-agnostic fashion. REFERENCES [1] José Miguel Benedí Ruiz, Francisco Casacub erta Nolla, Enrique Vidal Ruiz, In- maculada Benlloch, Antonio Castellanos López, María José Castro Bleda, Jon An- der Gómez Adrián, Alfons Juan Císcar , and Juan Antonio Puchol García. 1991. Proyecto RO ARS: Robust Analytical Spee ch Recognition System. (1991). [2] Fred D. Davis. 1989. Perceived Usefulness, Perceived Ease of Use, and User Acceptance of Information T echnology . MIS Quarterly 13, 3 (1989), 319–340. [3] João Gama, Indre Zliobaite, Albert Bifet, Mykola Pechenizkiy , and Abdelhamid Bouchachia. 2014. A survey on concept drift adaptation. ACM Comput. Sur v . 46, 4 (2014), 44:1–44:37. https://doi.org/10.1145/2523813 [4] Heinrich Jiang, Been Kim, Melody Guan, and Maya Gupta. 2018. T o Trust Or Not T o Trust A Classier . In Advances in Neural Information Processing Systems 31 , S. Bengio, H. W allach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Associates, Inc., 5541–5552. [5] Ron Kohavi. 1997. Scaling Up the Accuracy of Naive-Bayes Classiers: a Decision- Tree Hybrid. KDD (09 1997). [6] Janet L. Kolodner . 1992. An introduction to case-based reasoning. A rticial Intelligence Review 6, 1 (1992), 3–34. https://doi.org/10.1007/bf00155578 [7] J. B. Kruskal. 1964. Multidimensional scaling by optimizing goodness of t to a nonmetric hypothesis. Psychometrika 29, 1 (01 Mar 1964), 1–27. https: //doi.org/10.1007/BF02289565 [8] V olo dymyr Kuleshov and Percy S Liang. 2015. Calibrated Structured Prediction. In Advances in Neural Information Processing Systems 28 , C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, Inc., 3474–3482. http://papers.nips.cc/paper/5658- calibrated- structured- prediction.pdf [9] Scott M. Lundberg, Gabriel G. Erion, and Su-In Le e. 2018. Consistent Individual- ized Feature Attribution for Tree Ensembles. (2018). [10] Scott M Lundberg and Su-In Lee . 2017. A Unied Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30 , I Guyon, U V Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan, and R Garnett (Eds.). Curran Associates, Inc., 4765–4774. http://papers.nips.cc/paper/7062- a- unied- approach- to- interpreting- model- predictions.p df [11] G.P. McArdle and D .C. Wilson. 2003. Visualising Case-Base Usage. In W orkshop Proceedings ICCBR , L. McGinty (Ed.). Trondhuim, 105–114. [12] Conor Nugent and Pádraig Cunningham. 2005. A Case-Based Explanation System for Black-Box Systems. A rticial Intelligence Review 24, 2 (o ct 2005), 163–178. https://doi.org/10.1007/s10462- 005- 4609- 5 [13] Randal S. Olson, William La Cava, Patryk Orzechowski, Ryan J. Urbanowicz, and Jason H. Moore. 2017. PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData Mining 10, 1 (11 Dec 2017), 36. https: //doi.org/10.1186/s13040- 017- 0154- 4 [14] F. Pe dregosa, G. V aroquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer , R. W eiss, V . Dubourg, J. V anderplas, A. Passos, D. Cour- napeau, M. Brucher , M. Perrot, and E. Duchesnay . 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830. [15] Marco T ulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust Y ou?": Explaining the Predictions of Any Classier. In Proceedings of the 22nd ACM SIG International Conference on Knowledge Discovery and Data Mining (KDD) . ACM Press, New Y ork, New Y ork, USA, 1135–1144. https://doi.org/10. 1145/2939672.2939778 [16] Marco Túlio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Anchors: High- Precision Model-A gnostic Explanations. In AAAI . [17] Frode Sørmo, Jörg Cassens, and Agnar A amodt. 2005. Explanation in Case-Based Reasoning–Perspectives and Goals. A rticial Intelligence Review 24, 2 (oct 2005), 109–143. https://doi.org/10.1007/s10462- 005- 4607- 7 [18] Erik Štrumb elj and Igor Kononenko. 2014. Explaining prediction models and individual predictions with feature contributions. Knowledge and Information Systems 41, 3 (2014), 647–665. https://doi.org/10.1007/s10115- 013- 0679- x KDD- ADF ’19, August 2019, Anchorage, Alaska, USA W e erts, et al. Figure 6: CBR dashb oard when applied to predictions of a random forest model trained on the Churn dataset. (A) The mo del’s condence for the selecte d alert, (B) A bar chart showing the SHAP values of the selected alert, (C) The CBR neighb orhood visualization. The number of neighbors k can be chosen in the slider . [19] Joaquin V anschoren, Jan N. van Rijn, Bernd Bischl, and Luis T orgo. 2013. OpenML: Networked Science in Machine Learning. SIGKDD Explorations 15, 2 (2013), 49–60. https://doi.org/10.1145/2641190.2641198 [20] Sandra W achter , Brent Mittelstadt, and Chris Russell. 2018. Counterfactual Explanations Without Op ening the Black Box: Automated Decisions and the GDPR. Harvard journal of law & technology 31 (04 2018), 841–887. [21] Hilde J.P. W eerts, W erner van Ipenburg, and Mykola Pechenizkiy . 2019. A Human- Grounded Evaluation of SHAP for Alert Processing. In Pr oce edings of KDD W ork- shop on Explainable AI (KDD-XAI ’19) . [22] Dietrich W ettschereck, David W . Aha, and T akao Mohri. 1997. A Review and Empirical Evaluation of Feature W eighting Methods for a Class of Lazy Learning Algorithms. A rticial Intelligence Review 11, 1/5 (1997), 273–314. https://doi.org/ 10.1023/a:1006593614256 [23] Indre Zliobaite, Mykola Pe chenizkiy , and Joao Gama. 2016. An overview of concept drift applications. In Big Data Analysis: New Algorithms for a New Society . Springer , 91–114.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment