Revealing Neural Network Bias to Non-Experts Through Interactive Counterfactual Examples

“. Grgic-Hlaca et al. 59 showed how understandable models can easily mislead our intuitions, and that predominantly using features people believed to be fair slightly increased the racism exhibited by algorithms, while decreasing accuracy. In general, the best tools for uncovering systematic biases are likely to be based upon large-scale statistical analysis and not upon explanations of individual decisions.60”

56 See Kilbertus et al., supra note 54, at 1. 57 See Matt J. Kusner et al., Counterfactual Fairness, AʀXɪᴠ:1703.06856, at 16 (2017), https://arxiv.org/pdf/1703.06856.pdf [https://perma.cc/4SVN-7J9D]. 58 Id. 59 Nina Grgic-Hlaca et al., The Case for Process Fairness in Learning: Feature Selection for Fair Decision Making, in NIPS SYMPOSIUM ON MACHINE LEARNING AND THE LAW 8 (2016). 60 See Andrea Romei & Salvatore Ruggieri, A Multidisciplinary Survey on Discrimination Analysis, 29 KNOWLEDGE ENGINEERING REV. 582, 617 (2014). 61 Establishing the influence of a protected variable on a decision does not, by itself, prove that illegal discrimination has occurred. Mitigating factors may exist which justify the usage of a protected attribute. See, e.g., Solon Barocas & Andrew D. Selbst, Big Data’s Disparate Impact, 104 CAL. L. REV. 671, 676 (2016) (discussing disparate treatment in American anti-discrimination law).

Several works have approached the problem of guaranteeing that algorithms are fair, i.e. that they do not exhibit a bias towards particular ethnic, gender, or other protected groups, using causal reasoning56 and counterfactuals. 57 Kusner et al.58 consider counterfactuals where the subject belongs to a different race or sex, and require that the decision made remain the same under such a counterfactual for it to be considered fair. In contrast, we consider counterfactuals in which the decision differs from its current state. Many works have suggested that transparency might be a useful tool for enforcing fairness. While it is unclear how counterfactuals could be used for this purpose, it is also unclear if any form of explanation of individual decisions can in fact help. Grgic-Hlaca et al. 59 showed how understandable models can easily mislead our intuitions, and that predominantly using features people believed to be fair slightly increased the racism exhibited by algorithms, while decreasing accuracy. In general, the best tools for uncovering systematic biases are likely to be based upon large-scale statistical analysis and not upon explanations of individual decisions.60 With that said, counterfactuals can provide evidence that an algorithmic decision is affected by a protected variable (e.g. race), and that it may therefore be discriminatory. 61 For the types of distance function we consider in the next section, if the counterfactuals found change someone’s race, then the treatment of that individual is dependent on race. However, the converse statement is not true. Counterfactuals which do not modify a protected attribute cannot be used as evidence that the attribute was irrelevant to the decision. This is because counterfactuals describe only some of the dependencies between a particular decision and