Learning Interpretability for Visualizations using Adapted Cox Models through a User Experiment
In order to be useful, visualizations need to be interpretable. This paper uses a user-based approach to combine and assess quality measures in order to better model user preferences. Results show that cluster separability measures are outperformed b…
Authors: Adrien Bibal, Benoit Frenay
Lear ning Interpr etability f or V isualizations using Adapted Cox Models thr ough a User Experiment Adrien Bibal PReCISE Research Center Faculty of Computer Science Univ ersity of Namur Namur , 5000 - Belgium adrien.bibal@unamur.be Benoît Frénay PReCISE Research Center Faculty of Computer Science Univ ersity of Namur Namur , 5000 - Belgium benoit.frenay@unamur.be Abstract In order to be useful, visualizations need to be interpretable. This paper uses a user- based approach to combine and assess quality measures in order to better model user preferences. Results show that cluster separability measures are outperformed by a neighborhood conserv ation measure, e ven though the former are usually considered as intuitiv ely representati ve of user moti v es. Moreover , combining measures, as opposed to using a single measure, further improv es prediction performances. 1 Introduction Measuring interpretability is a major concern in machine learning. Along with other classical performance measures such as accuracy , interpretability defines the limit between black-box and white-box models (Rüping, 2006; Bibal and Frénay, 2016). Interpretable models allo w one to understand how inputs are linked to the output. This paper focuses on visualizations that map high-dimensional data to a 2D projection. In this context, interpretability refers to the ability of a user to understand ho w a particular visualization model projects data. When a user chooses a particular visualization, he or she implicitly states that he or she understands how the points are presented, i.e. how the model works. Interpretability is then defined through user preferences and no a priori definition is assumed. Follo wing Freitas (2014) and others, Bibal and Frénay (2016) highlights two w ays to measure inter- pretability: through heuristics and user -based surv eys. T ailored quality measures for visualizations are examples of the heuristics approach. Surve ys can be used to qualitati vely define the understandability of a visualization by asking for user feedback. Both approaches are complementary , but only a fe w works (e.g. Sedlmair and Aupetit (2015)) attempt to mix them to assess the relev ance of se veral quality metrics for visualization. This paper bridges this gap through a user -based experiment that uses meta-learning to combine sev eral measures of visualization interpretability . Section 2 presents some visualization quality measures that are used during meta-learning. Section 3 introduces a family of white-box meta-models to find a score of interpretability . Then, Section 4 describes the user experiment that is used to model interpretability from user preferences. Finally , Section 5 discusses the experimental results and Section 6 concludes the paper . 2 Quality Measures of V isualizations One can consider two types of quality measures for visualizations: one type uses only the data after projection and the other compares the points before and after projection. T ypical measures of the first type focus on the separability of clusters in the visualization. Sedlmair and Aupetit (2015) revie wed, ev aluated and sorted such measures in terms of algorithmic similarity and agreement with 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. human judgments. They confirmed the top position of distance consistency (DSC) as one of the best measures (Sedlmair and Aupetit, 2015). Let P be the set of points of the projection, C the set of classes and centr oid ( c ) the centroid of class c , then (Sips et al., 2009): DSC = |{ x ∈ P : ( ∃ c ∈ C : c 6 = c x ∧ dist ( centr oid ( c ) , x ) < dist ( centr oid ( c x ) , x )) }| / |C | . T wo other top measures in Sedlmair and Aupetit (2015) are the hypothesis margin (HM) and the av erage between-within (ABW). HM computes the av erage dif ference between the distance of each point x from its closest neighbor of another class and its closest neighbor of the same class (Gilad- Bachrach et al., 2004). ABW (Sedlmair and Aupetit, 2015; Le wis et al., 2012) computes the ratio of the av erage distance between points of dif ferent clusters and the a verage distance within clusters. In order to compare visualization algorithms, Lee et al. (2015) propose a measure of the second type modeling neighborhood preserv ation. Their measure, NH A UC , can be defined as follo ws. Let N be the number of points in the dataset, K the number of neighbors, v K i the K nearest neighbors of the i th point in the original dataset and n K i the K nearest neighbors of the i th point in the projection, Q NX ( K ) = N X i =1 | v K i ∩ n K i | ! / ( K N ) measures the a verage preserv ation of neighborhoods of size K . Lee et al. (2015) then use the area under the Q NX ( K ) curve for dif ferent neighborhood sizes in order to compute NH A UC . 3 Meta-Learning with Adapted Cox Models The main goal of this paper is to e valuate whether combining state-of-the-art measures of dif ferent types improv e the modeling of human judgment. T o asses this, we set up an experiment asking users to express preferences between visualizations shown in pairs (see section 4 for more details) and then used these preferences to determine an interpretability score. Since our dataset is composed of preferences between visualizations, our learning problem is rooted in preference learning. For this kind of problem, an order must be learned based on preferences (Fürnkranz and Hüllermeier, 2011). Our dataset consists of a set of visualizations V and a set of user -gi ven preferences v i v j expressing that v i is preferred ov er v j for some pairs of visualization v i , v j ∈ V . The preference learning algorithm considered for modeling user preferences must be interpretable, such as with a logistic regression (Arias-Nicolás et al., 2008), so that knowledge about the measures used as meta-features can be gained. T o solv e this problem, we consider a well-kno wn interpretable model used in survi v al analysis, the Cox model (Cox, 1972; Branders, 2015). W e adapted the Cox model to fit our preference learning problem. Indeed, in the case of pairwise comparisons of objects, the partial likelihood of a Cox model can be adapted as follo ws: Cox pref ( β ) = Y v i v j h exp ( β T v i ) exp ( β T v i ) + exp ( β T v j ) i = Y v i v j h 1 1 + exp ( − β T ( v i − v j )) i . This adapted Cox model learns a preference score using measures presented in section 2 as features of visualizations v i and v j . This regression dif fers from a true logistic regression in that there is no intercept term. The term β T v i can be interpreted as an understandability score for visualization v i . 4 User -Based Experiment As mentioned in section 3, an experiment was set up to collect preferences from users. V isualizations presented to users were generated from the dataset MNIST with various numbers of classes (from 2 to 6) using t-SNE (van der Maaten and Hinton, 2008) with v arious perplexities between 5 and the dataset size in a logarithmic scale. Each user was interviewed after the e xperiment to discuss his or her strategies for choosing between visualizations. W e then used this information to better understand cases where Cox pref models were not in agreement with user preferences. The population of our experiment consisted of 40 first-year university students. They were instructed to select, from tw o displayed visualizations, the one for which they best understood “how the computer had positioned the numbers”. In addition to these two options, they could also select “no preference”, in which case the comparison was not used for learning. Successiv e comparisons were assumed to be independent, meaning that no psychological learning bias was assumed to be in volved. 2 T able 1: A v erage percentage of agreement with user preferences and 95% confidence interval thereof. number of classes ABW HM DSC NH A UC Cox pref 63.6% ± 0.1 65.6% ± 0.1 67% ± 0.2 68.5% ± 0.2 71.5% ± 0.1 76.4% ± 0.2 T able 2: Percentage of wins for every pairwise comparison between the fi ve quality measures. number of classes ABW HM DSC NH A UC Cox pref ABW 84.5% HM 88.3% 67% DSC 97.5% 89.6% 70% NH A UC 100% 99.3% 98.2% 87.1% C ox pref 100% 100% 100% 100% 99.3% A total of 3294 preferences was collected. Because each user may have a dif ferent strategy while choosing visualizations, they were grouped into batches per user . For a gi v en user , a random subset of his or her preferences was selected, with the total number of preferences being the same for all users. Thanks to this subsampling, all users had the same weight when modeling the overall strate gy . The number of preferences per user was set at 30, which let aside 10 users that provided less than 30 preferences; our dataset was composed of 900 preferences. 1000 user permutations were performed. For each permutation, 2/3 of the users were used for training the Cox pref model and 1/3 for testing. The performance measure was the percentage of agreement between users and the model. W e used the same performance measure to individually compare the visualization measures used as meta-features. 5 Discussion In addition to the two types of measures presented in section 2, the number of classes was also considered for meta-learning (Garcia et al., 2016). In the case of a tie (i.e same number of classes), one of the visualization was chosen randomly . T able 1 shows the means and standard de viations computed on the 1000 permutations and table 2 presents the percentage of win ag ainst other measures. Measure m p i wins against measure m p j if m i has better performances than m j for the permutation p . Among the measures of the first type discussed in section 2, DSC performs well in its group b ut is beaten by NH A UC , the measure of the second type. Interestingly , NH A UC obtains very good results despite the fact that it does not directly apply the well-known user-strategy of cluster separability (Sedlmair and Aupetit, 2015), a strategy that was confirmed during the interviews. Indeed, measures of the second type use the original high-dimensional data in their computation, which is not possible for a human. In both table 1 and 2, the Cox pref model outperforms individual measures. Similar results were observed using all 3129 preferences from the same 30 users. In order to understand why the Cox pref models fail in 23.6% of the cases on average, we checked judgment errors from Cox pref by referring to what users said during the interviews. W e could observe that in v olving users open the opportunity for mistakes or unusual behaviors, as we can see in figure 1. Furthermore, in a few cases, when the user has no preference but distinguishes a semantic pattern that makes sense for him or her in the visualization, he or she tends to choose it (see figure 1). In order to assess the importance of each visualization measure in the score of Cox pref , we varied the L1 penalization to enforce sparsity . NH A UC is selected first. Then ABW is added with an improv ement of roughly 3.5%. The number of classes is added as a third measure, which improv es the model by roughly 1.5%. Other additional measures only offer a minor impro vement. 6 Conclusion Using an adapted Cox model to handle the task of preference learning, we observed the modeling power of a measure taking into account elements that a human being cannot handle, such as NH A UC . Furthermore, we confirmed the position of DSC as leader of its category . Finally , we showed that using a white-box model to aggregate state-of-the-art measures can improv e the prediction of human judgment using information of measures from different f amilies. Further work needs to confirm the results obtained with t-SNE for MNIST on a wide range of datasets and visualization schemes. 3 (a) (b) (c) Figure 1: Examples of disagreement between users and Cox pref . Among visualizations (a) and (b), Cox pref prefers (b) where 0s and 1s are clearly separated, whereas the user preferred (a). V isualization (c) shows an e xample of semantic bias: two users reported that they preferred (c) when there is a tie because it looks like a clock (1s on the left, 2s at the top, 3s on the right and 4s at the bottom). Acknowledgments W e are grateful to Prof. Bruno Dumas for his help for the design of the experiment in volving users. W e also thanks Dr . Samuel Branders for fruitful discussions and sharing resources on Cox models. References Arias-Nicolás, J., Pérez, C., and Martín, J. (2008). A logistic regression-based pairwise comparison method to aggregate preferences. Gr oup Decision and Negotiation , 17(3):237–247. Bibal, A. and Frénay , B. (2016). Interpretability of machine learning models and representations: an introduction. In Proc. ESANN , pages 77–82, Bruges, Belgium. Branders, S. (2015). Re gr ession, classification and featur e selection fr om survival data : modeling of hypoxia conditions for cancer pr ognosis . PhD thesis, Univ ersité catholique de Louv ain. Cox, D. R. (1972). Regression models and life-tables. Journal of the Royal Statistical Society . Series B (Methodological) , 34(2):187–220. Freitas, A. A. (2014). Comprehensible classification models: a position paper . A CM SIGKDD Explorations Ne wsletter , 15(1):1–10. Fürnkranz, J. and Hüllermeier , E. (2011). Pr efer ence learning . Springer . Garcia, L. P ., Lorena, A. C., Matwin, S., and de Carvalho, A. (2016). Ensembles of label noise filters: a ranking approach. Data Mining and Knowledge Discovery , 30(5):1192–1216. Gilad-Bachrach, R., Navot, A., and T ishby , N. (2004). Margin based feature selection-theory and algorithms. In Proc. ICML , page 43, Banf f, Canada. Lee, J. A., Peluf fo-Ordóñez, D. H., and V erle ysen, M. (2015). Multi-scale similarities in stochastic neighbour embedding. Neurocomputing , 169:246–261. Lewis, J. M., Ackerman, M., and De Sa, V . (2012). Human cluster e valuation and formal quality measures: A comparativ e study . In Pr oc. CogSci , pages 1870–1875, Sapporo, Japan. Rüping, S. (2006). Learning interpretable models . PhD thesis, Univ ersität Dortmund. Sedlmair , M. and Aupetit, M. (2015). Data-driv en e v aluation of visual quality measures. Computer Graphics F orum , 34(3):201–210. Sips, M., Neubert, B., Le wis, J. P ., and Hanrahan, P . (2009). Selecting good vie ws of high-dimensional data using class consistency . Computer Graphics F orum , 28(3):831–838. v an der Maaten, L. and Hinton, G. (2008). V isualizing data using t-sne. Jour nal of Mac hine Learning Resear c h , 9:2579–2605. 4
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment