Fair prediction with disparate impact: A study of bias in recidivism prediction instruments

F air prediction with disparate impact: A study of bias in recidivism prediction instrumen ts Alexandra Chouldec ho v a Heinz College, Carnegie Mellon Univ ersit y 5000 F orb es Av en ue, Pittsburgh, P A, USA achould@cmu.edu Abstract Recidivism prediction instrumen ts (RPI’s) pro vide decision makers with an assessmen t of the lik eliho o d that a criminal defendan t will reoﬀend at a future p oin t in time. While such instrumen ts are gaining increasing p opularit y across the c oun try , their use is attracting tremendous contro v ersy . Much of the con tro versy concerns p otential discriminatory bias in the risk assessmen ts that are pro duced. This pap er discusses a fairness criterion originating in the ﬁeld of educational and psychological testing that has re- cen tly b een applied to assess the fairness of recidi- vism prediction instruments. W e demonstrate how adherence to the criterion may lead to considerable disparate impact when recidivism prev alence diﬀers across groups. 1 In tro duction Risk assessmen t instruments are gaining increasing p opularit y within the criminal justice system, with v ersions of such instrumen ts b eing used or consid- ered for use in pre-trial decision-making, parole de- cisions, and in some states ev en sentencing [1, 2]. In each of these cases, a high-risk classiﬁcation— particularly a high-risk misclassiﬁcation—may hav e a direct adv erse impact on a criminal defendant’s outcome. If RPI’s are to con tin ue to b e used, it is imp ortan t to ensure that they do not result in uneth- ical practices that disparately aﬀect diﬀerent groups. Within the psychometrics literature, there exist widely accepted and adopted standards for assessing whether an instrumen t is fair in the sense of being free of pr e dictive bias . These standards ha ve recently b een applied to the COMP AS [3] and PCRA [4] in- strumen ts, with initial ﬁndings suggesting that there is evidence of predictive bias when it comes to gen- der, but not when it comes to race [5, 6, 7]. In a recent widely popularized inv estigation of the COMP AS RPI conducted by a team at ProPublica, a diﬀerent approac h to assessing instrumen t bias told what app ears to b e a contradictory story [8]. The au- thors found that the likelihoo d of a non-recidiv ating Blac k defendant being assessed as high-risk is nearly t wice that of White defendan ts. While this analysis has met with muc h criticism, it has also made head- lines. There is no doubt that it is now embedded in the national con versation on the use of RPI’s. In this pap er w e show that the diﬀerences in false p ositiv e and false negative rates cited as evidence of racial bias in the ProPublica article are a direct con- sequence of applying an instrumen t that is free from predictiv e bias 1 to a p opulation in whic h recidivism prev alence diﬀers across groups. Our main contri- bution is t wof old. (1) First, w e make precise the connection b etw een the psychometric notion of test fairness and error rates in classiﬁcation. (2) Next, w e demonstrate how using an RPI that has diﬀer- en t false p ostive and false negative rates betw een groups can lead to disparate impact when individ- uals assessed as high risk receive stricter penalties. Throughout our discussion w e use the term disp ar ate imp act to refer to s ettings where a p enalt y p olicy has unintended disprop ortionate adv erse impact on a particular group. It is imp ortan t to b ear in mind that fairness itself—along with the notion of disparate impact— is a so cial and ethical concept, not a statistical one. An instrument that is free from predictive bias ma y nev ertheless result in disparate impact dep ending on ho w and where it is used. In this pap er we consider 1 in the psyc hometric sense 1 h yp othetical use cases in which w e are able to di- rectly connect statistically quantiﬁable features of RPI’s to a measure of disparate impact. 1.1 Data description and setup The empirical results in this pap er are based on the Brow ard Count y data made publicly av ailable b y ProPublica [9]. This data set con tains COM- P AS recidivism risk decile scores, 2-year recidi- vism outcomes, and a n um b er of demographic and crime-related v ariables. W e restrict our attention to the subset of defendan ts whose race is recorded as African-American ( b ) or Caucasian ( w ). 2 Assessing fairness W e b egin by with some notation. Let S = S ( x ) denote the risk score based on cov ariates X = x , with higher v alues of S corresp onding to higher lev els of assessed risk. Let R ∈ { b, w } denote the group that the individual b elongs to, which may b e one of the comp onen ts of X . Lastly , let Y ∈ { 0 , 1 } b e the outcome indicator, with 1 denoting that the given individual recidiv ates. In this notation, w e can think of the psyc hometric test fairness condition roughly as follo ws. Deﬁnition 2.1 (T est fairness) . A score S = S ( x ) is test-fair (well-calibrated) 2 if it reﬂects the same lik e- liho o d of recidivism irresp ective of the individual’s group membership, R . That is, if for all v alues of s , P ( Y = 1 | S = s, R = b ) = P ( Y = 1 | S = s, R = w ) . (2.1) Figure 1 sho ws a plot of the observ ed recidivism rates across all p ossible v alues of the COMP AS score. W e can see that the COMP AS RPI app ears to adhere w ell to the test fairness condition. In their resp onse to the ProPublica inv estigation, Flores et al. [10] fur- ther v erify this adherence using logistic regression. 2.1 Implied constraints on the false p os- itiv e and false negativ e rates T o facilitate a simpler discussion of error rates, we in tro duce the c o arsene d sc or e S c , whic h is obtained 2 Dep ending on the context, we may further desire that this criterion is satisﬁed when we condition on some of the co v ariates. Our analysis extends to this case as w ell. 0.00 0.25 0.50 0.75 1.00 1 2 3 4 5 6 7 8 9 10 COMP AS decile score Obser v ed probability of recidivism race Black White Figure 1: Plot shows P ( Y = 1 | S = s, R ) for the COM- P AS decile score, with R ∈ { Black , White } . Error bars represen t 95% conﬁdence in terv als. b y thresholding S at some cutoﬀ s H R . S c ( x ) ≡ ( HR if S ( x ) > s H R LR if S ( x ) ≤ s H R (2.2) The coarsened score simply assesses each defen- dan t as b eing at high-risk or low-risk of recidivism. F or the purpose of our discussion, we will think of S c as a classiﬁer used to predict the binary outcome Y . This allo ws us to summarize S c in terms of a confusion matrix, as sho wn b elow. S c = Lo w-Risk S c = High-Risk Y = 0 TN FP Y = 1 FN TP It is easily veriﬁed that test fairness of S implies that the p ositive pr e dictive value of the coarsened score S c do es not dep end on R . More precisely , it implies that that the quan tity PPV( S c | R = r ) ≡ P ( Y = 1 | S c = HR , R = r ) (2.3) do es not dep end on r . Equation (2.3) th us forms a necessary condition for the test fairness of S . W e can think of this as a constraint on the v alues of the confusion matrix. A second constraint—one that w e hav e no direct control o ver—is the recidivism prev alence within groups, which we denote here b y p r ≡ P ( Y = 1 | R = r ). Giv en v alues of the PPV ∈ (0 , 1) and prev alence p ∈ (0 , 1), it is straigh tforward to sho w that the false ne gative r ate FNR = P ( S c = LR | Y = 1) and false p ositive r ate FPR = P ( S c = HR | Y = 0) are related 2 via the equation FPR = p 1 − p 1 − PPV PPV (1 − FNR) . (2.4) A direct implication of this simple expression is that when the recidivism prev alence diﬀers b et ween t wo groups, a test-fair score S c cannot ha ve equal false p ositiv e and negative rates across those groups. 3 This observ ation enables us to b etter understand wh y the ProPublica authors observ ed large discrep- ancies in FPR and FNR b et ween Blac k and White defendan ts. 4 The recidivism rate among bac k defen- dan ts in the data is 51%, compared to 39% for White defendan ts. Since the COMP AS RPI approximately satisﬁes test fairness, we kno w that some level of im- balance in the error rates must exist. 3 Assessing impact In this section we show how diﬀerences in false p os- itiv e and false negative rates can result in disparate impact under policies where a high-risk assessmen t results in a stricter p enalty for the defendant. Such situations may arise when risk assessments are used to inform bail, parole, or sentencing decisions. In the state of P ennsylv ania, for instance, statutes p er- mit the use of RPI’s in sen tencing, provided that the sen tence ultimately falls within accepted guide- lines. W e use the term “p enalt y” somewhat lo osely in this discussion to refer to outcomes b oth in the pre-trial and p ost-conviction phase of legal pro ceed- ings. Even though pre-trial outcomes such as the amoun t at which bail is set are not punitive in a le- gal sense, w e nevertheless refer to bail amoun t as a “p enalt y” for the purp ose of our discussion. There are notable cases where RPI’s are used for the express purp ose of informing risk reduction ef- forts. In such settings, individuals assessed as high risk receive what may b e view ed as a b eneﬁt rather than a p enalt y . The PCRA score, for instance, is in tended to supp ort precisely this type of decision- making at the federal courts level. Our analysis in this section sp eciﬁcally addresses use cases where high-risk individuals receiv e stricter p enalties. 3 This observ ation is also made in indep endent concurrent w ork by Klein b erg et al. [11]. 4 Blac k: FPR = 45%, FNR = 28%. White: FPR = 23%, FNR = 48% T o b egin, consider a setting in which guidelines indicate that a defendan t is to receiv e a p enalty t L ≤ T ≤ t H . A very simple risk-based approach, whic h w e will refer to as the MinMax policy , w ould b e to assign p enalties as follo ws: T MinMax = ( t L if S c = Low-risk t H if S c = High-risk . (3.1) In this simple setting, we can precisely character- ize the exten t of disparate impact in terms of rec- ognizable quantities. Deﬁne T r,y to b e the p enalt y giv en to a defendant in group R = r with observed outcome Y = y ∈ { 0 , 1 } , and let ∆ = ∆( y 1 , y 2 ) = E ( T b,y 1 − T w,y 2 ) b e exp ected diﬀerence in sen tence b et w een defendan ts in diﬀerent groups. ∆ is a mea- sure of disparate impact. Prop osition 3.1. The exp e cte d diﬀer enc e in p enalty under the MinMax p olicy is given by ∆ ≡ E MinMax ( T b,y 1 − T w,y 2 ) = ( t H − t L )  P ( S c = HR | R = b, Y = y 1 ) − P ( S c = HR | R = w , Y = y 2 )  W e will discuss t w o immediate Corollaries of this result. Corollary 3.1 (Non-recidiv ators) . Among individ- uals who do not recidiv ate , the diﬀer enc e in aver age p enalty under the MinMax p olicy is ∆ = ( t H − t L )(FPR b − FPR w ) (3.2) Corollary 3.2 (Recidiv ators) . Among individuals who recidiv ate , the diﬀer enc e in aver age p enalty un- der the MinMax p olicy is ∆ = ( t H − t L )(FNR w − FNR b ) (3.3) When using a test-fair RPI in p opulations where recidivism prev alence diﬀers across groups, it will generally b e the case that the higher recidivism prev alence group will hav e a higher FPR and lo wer FNR. F rom equations (3.2) and (3.3), we can see that this w ould result in greater p enalties for defen- dan ts in the higher prev alence group, b oth among recidiv ating and non-recidiv ating oﬀenders. An interesting sp ecial cas e to consider is one where t L = 0. This could arise in sen tencing decisions for 3 oﬀenders convicted of low-sev erity crimes who ha ve go o d prior records. In such cases, so-called restora- tiv e sanctions may b e imp osed as an alternativ e to a p erio d of incarceration. If we further take t H = 1, then E T = P ( T 6 = 0), whic h can b e interpreted as the probability that a defendan t receives a sen tence imp osing some p erio d of incarceration. It’s easy to see that in suc h settings a non- recidiv ating defendan t in group b is FPR b / FPR w times more lik ely to b e incarcerated compared to a non-recidiv ating defendant in group w . 5 This nat- urally raises the question of whether ov erall diﬀer- ences in error rates are observ ed to p ersist across more gran ular subgroups. One might exp ect that diﬀerences in false p osi- tiv e rates are largely attributable to the subset of defendan ts who are charged with more serious of- fenses and who hav e a larger num b er of prior ar- rests/con victions. While it is true that the false posi- tiv e rates within b oth racial groups are higher for de- fendan ts with worse criminal histories, considerable b et w een-group diﬀerences in these error rates persist across low prior count subgroups. Figure 2 shows a plot of false p ositiv e rates across diﬀerent ranges of prior count for defendants c harged with a misde- meanor oﬀense, which is the low est severit y criminal oﬀense category . As one can see, diﬀerences in false p ositiv e rates betw een Blac k defendan ts and White defendan ts p ersist across prior record subgroups. 3.1 Connections to measures of eﬀect size A natural ques tion to ask is whether the lev el of dis- parate impact, ∆, is related to some measures of ef- fect size commonly used in scien tiﬁc reporting. With a small generalization of the % non-overlap measure, w e can answer this question in the aﬃrmative. The % non-o verlap of tw o distributions is gener- ally calculated assuming b oth distributions are nor- mal, and thus has a one-to-one corresp ondence to Cohen’s d [12]. 6 Figure 3 shows that the COMP AS decile score is far from b eing normally distributed. A more reasonable wa y to calculate % non-ov erlap is to note that in the Gaussian case % non-ov erlap 5 W e are ov erloading notation in this expression: Here, FPR r = P (HR | R = r, t L = 0), similarly for FNR r . 6 d = ¯ S b − ¯ S w S D , where S D is a p o oled estimate of standard deviation. 0.00 0.25 0.50 0.75 1.00 0 1−3 4−6 7−10 >10 Number of priors F alse positive r ate race Black White Figure 2: F alse p ositive rates across prior record count for defendan ts c harged with a Misdemeanor oﬀense. Plot is based on assessing a defendant as “high-risk” if their COMP AS decile score is > 4. Error bars represent 95% conﬁdence interv als. 0.0 0.1 0.2 0.3 1 2 3 4 5 6 7 8 9 10 COMP AS decile score Frequency race Black White Figure 3: COMP AS decile score histograms for Black and White defendants. Cohen’s d = 0 . 60, non-ov erlap d TV ( f b , f w ) = 24 . 5%. is equiv alent to the total v ariation distance. Letting f r,y ( s ) denote the score distribution for race r and recidivism outcome y , one can establish the follo wing sharp b ound on ∆. Prop osition 3.2 (P ercent ov erlap b ound) . Under the MinMax p olicy, ∆ ≤ ( t H − t L ) d TV ( f b,y , f w,y ) . 4 Discussion The primary con tribution of this pap er was to show ho w disparate impact can result from the use of a recidivism prediction instrument that is kno wn to b e free from predictive bias. Our analysis fo cussed on the simple setting where a binary risk assessmen t 4 w as used to inform a binary p enalt y p olicy . While all of the formulas hav e natural analogs in the non- binary score and p enalt y setting, w e ﬁnd that all of the salient features are already present in the anal- ysis of the simpler binary-binary problem. Our analysis indicates that there are risk assess- men t use cases in whic h it is desirable to balance error rates across diﬀerent groups, even though this will generally result in risk assessments that are not free from predictiv e bias. Ho wev er, balancing er- ror rates ov erall may not b e suﬃcien t, as this does not guaran tee balance at ﬁner lev els of granularit y . That is, even if FPR b = FPR w , w e may still see diﬀerences in error rates within prior record score categories (see e.g., Figure 2). One needs to decide the lev el of gran ularit y at whic h error rate balance is desirable to ac hieve. In closing, we would like to note that there is a large b o dy of literature showing that data-driv en risk assessmen t instruments tend to b e more accurate than professional h uman judgements [13, 14], and in v estigating whether h uman-driven decisions are themselv es prone to exhibiting racial bias [15, 16]. W e should not abandon the data-driv en approach on the basis of negative headlines. Rather, we need to work to ensure that the instruments w e use are demonstrably free from the kinds of quantiﬁable bi- ases that could lead to disparate impact in the sp e- ciﬁc con texts in which they are to b e applied. References [1] Thomas Blom b erg, William Bales, Karen Mann, Ry an Meldrum, and Joe Nedelec. V ali- dation of the compas risk assessment classiﬁca- tion instrumen t. 2010. [2] Ben Casselman Anna Maria Barry-Jester and Dana Goldstein. Should prison sentences b e based on crimes that ha ven’t b een committed y et? [3] Northpointe. Compas risk & need assessment system: Selected questions p osed by inquiring agencies. [4] Administrativ e Oﬃce of the United States Courts. An o verview of the federal p ost conviction risk assessment, September 2011. [5] Ja y P Singh. Predictive v alidity p erformance in- dicators in violence risk assessmen t: A metho d- ological primer. Behavior al Scienc es & the L aw , 31(1):8–22, 2013. [6] Jennifer L Sk eem and Christopher T Lo w enk amp. Risk, race, & recidivism: Predictiv e bias and disparate impact. A vailable at SSRN , 2015. [7] Jennifer L Skeem, John Monahan, and Christo- pher T Lo wenk amp. Gender, risk assessmen t, and sanctioning: The cost of treating w omen lik e men. Available at SSRN 2718460 , 2016. [8] Julia Angwin, Jeﬀ Larson, Sury a Mattu, and Lauren Kirchner. Mac hine bias: There’s soft ware used across the coun- try to predict future criminals. and it’s biased against blacks. 2016. URL https://www.propublica.org/article/ machine- bias- risk- assessments- in- criminal- sentenc ing . [9] Julia Angwin, Jeﬀ Larson, Sury a Mattu, and Lauren Kirc hner. Ho w we analyzed the compas recidivism algorithm. 2016. URL https://www.propublica.org/article/ how- we- analyzed- the- compas- recidivism- algorithm . 5 [10] An thony W Flores, Kristin Bec htel, and Christopher T Low enk amp. F alse p ositiv es, false negativ es, and false analyses: A rejoin- der to “mac hine bias: There’s softw are used across the country to predict future criminals. and it’s biased against blacks.”. Unpublishe d manuscript , 2016. [11] Jon Kleinberg, Sendhil Mullainathan, and Man- ish Raghav an. Inheren t trade-oﬀs in the fair determination of risk scores. arXiv pr eprint arXiv:1609.05807 , 2016. [12] Jacob Cohen. Statistic al Power Analysis for the Behavior al Scienc es (2nd Edition) . Lawrence Erlbaum Asso ciates, 1988. [13] P aul E Meehl. Clinic al versus statistic al pr e dic- tion: A the or etic al analysis and a r eview of the evidenc e. Universit y of Minnesota Press, 1954. [14] William M Grov e, Da vid H Zald, Boyd S Leb o w, Beth E Snitz, and Chad Nelson. Clinical v ersus mec hanical prediction: a meta-analysis. Psycholo gic al assessment , 12(1):19, 2000. [15] Shamena Anw ar and Hanming F ang. T esting for racial prejudice in the parole b oard release pro cess: Theory and evidence. T echnical rep ort, National Bureau of Economic Research, 2012. [16] Laura T Sweeney and Craig Haney . The in- ﬂuence of race on sen tencing: A meta-analytic review of exp erimental studies. Behavior al Sci- enc es & the L aw , 10(2):179–195, 1992. 6

Fair prediction with disparate impact: A study of bias in recidivism prediction instruments

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment