Optimized Data Pre-Processing for Discrimination Prevention

Non-discrimination is a recognized objective in algorithmic decision making. In this paper, we introduce a novel probabilistic formulation of data pre-processing for reducing discrimination. We propose a convex optimization for learning a data transf…

Authors: Flavio P. Calmon, Dennis Wei, Karthikeyan Natesan Ramamurthy

Optimized Data Pre-Processing for Discrimination Prevention
Optimized Data Pre-Pro cessing for Discrimination Prev en tion Fla vio P . Calmon, Dennis W ei, Karthikey an Natesan Ramamurth y , and Kush R. V arshney Data Science Department, IBM Thomas J. W atson Researc h Cen ter ∗ Abstract Non-discrimination is a recognized ob jective in algorithmic decision making. In this pap er, w e intro- duce a nov el probabilistic form ulation of data pre-pro cessing for reducing discrimination. W e prop ose a conv ex optimization for learning a data transformation with three goals: controlling discrimination, limiting distortion in individual data samples, and preserving utility . W e characterize the impact of lim- ited sample size in accomplishing this ob jective, and apply t wo instances of the proposed optimization to datasets, including one on real-w orld criminal recidivism. The results demonstrate that all three criteria can b e simultaneously ac hieved and also reveal in teresting patterns of bias in American so ciet y . 1 In tro duction Discrimination is the prejudicial treatmen t of an individual based on membership in a legally protected group suc h as a race or gender. Direct discrimination o ccurs when protected attributes are used explicitly in making decisions, which is referred to as disp ar ate tr e atment in law. More p erv asive now adays is indirect discrimination, in which protected attributes are not used but reliance on v ariables correlated with them leads to significan tly differen t outcomes for different groups. The latter phenomenon is termed disp ar ate imp act . Indirect discrimination may b e inten tional, as in the historical practice of “redlining” in the U.S. in whic h home mortgages were denied in zip co des populated primarily by minorities. How ever, the doctrine of disparate impact applies in man y situations regardless of actual in tent. Sup ervised learning algorithms, increasingly used for decision making in applications of consequence, ma y at first b e presumed to b e fair and devoid of inherent bias, but in fact, inherit an y bias or discrimination presen t in the data on whic h they are trained ( Calders & ˇ Zliobait ˙ e , 2013 ). F urthermore, simply remo ving protected v ariables from the data is not enough since it does nothing to address indirect discrimination and may in fact conceal it. The need for more sophisticated to ols has made discrimination disco very and prev ention an important researc h area ( Pedresc hi et al. , 2008 ). Algorithmic discrimination prev ention inv olves mo difying one or more of the follo wing to ensure that decisions made b y sup ervised learning metho ds are less biased: (a) the training data, (b) the learning algorithm, and (c) the ensuing decisions themselves. These are resp ectively classified as pre-pro cessing ( Ha jian , 2013 ), in-processing ( Fish et al. , 2016 ; Zafar et al. , 2016 ; Kamishima et al. , 2011 ) and p ost-processing approac hes ( Hardt et al. , 2016 ). In this pap er, we fo cus on pre-pro cessing since it is the most flexible in terms of the data science pip eline: it is indep enden t of the mo deling algorithm and can b e integrated with data release and publishing mechanisms. Researc hers ha ve also studied sev eral notions of discrimination and fairness. Disparate impact is addressed b y the principles of statistic al p arity and gr oup fairness ( F eldman et al. , 2015 ), which seek similar outcomes for all groups. In con trast, individual fairness ( Dw ork et al. , 2012 ) mandates that similar individuals be treated similarly irresp ective of group mem b ership. F or classifiers and other predictive mo dels, equal error rates for differen t groups are a desirable prop ert y ( Hardt et al. , 2016 ), as is calibration or lack of pr e dictive bias in the predictions ( Zhang & Neill , 2016 ). The tension b et ween the last t wo notions is described b y Klein b erg et al. ( 2017 ) and Chouldecho v a ( 2016 ); the work of F riedler et al. ( 2016 ) is in a similar v ein. Corb ett-Da vies et al. ( 2017 ) discuss the cost of satisfying prev ailing notions of algorithmic fairness from a public safety standp oin t and discuss the trade-offs. Since the present work pertains to pre-pro cessing and ∗ Contact: { fdcalmon,dwei,knatesa,krvarshn } @us.ibm.com 1 Lea rn/Apply Transformation Original data { ( X i ,Y i ) } Discriminatory variable { D i } Utility: p X,Y  p ˆ X, ˆ Y Individual distortion: ( x i ,y i )  (ˆ x i , ˆ y i ) Discrimination control: ˆ Y i  D i Learn/Apply predictiv e model ( ˆ Y | ˆ X, D ) T ransformed data { ( D i , ˆ X i , ˆ Y i ) } Figure 1: The prop osed pipeline for predictive learning with discrimination preven tion. L e arn mode applies with training data and apply mo de with no vel test data. Note that test data also requires transformation before predictions can b e obtained. not mo deling, balanced error rates and predictive bias are less relev ant criteria. Instead w e focus primarily on achieving group fairness while also accounting for individual fairness through a distortion constraint. Existing pre-processing approac hes include sampling or re-w eighting the data to neutralize discriminatory effects ( Kamiran & Calders , 2012 ), c hanging the individual data records ( Ha jian & Domingo-F errer , 2013 ), and using t -closeness ( Li et al. , 2007 ) for discrimination control ( Ruggieri , 2014 ). A common theme is the imp ortance of balancing discrimination con trol against utility of the pro cessed data. Ho wev er, this prior w ork neither presents general and principled optimization frameworks for trading off these tw o criteria, nor allows connections to b e made to the broader statistical learning and information theory literature via probabilistic descriptions. Another shortcoming is that individual distortion or fairness is not made explicit. In this w ork, addressing gaps in the pre-pro cessing literature, we in tro duce a probabilistic framework for discrimination-preven ting pre-processing in supervised learning. Our aim in part is to work tow ard a more unified view of previously prop osed concepts and metho ds, which may help to suggest refinemen ts. W e form ulate the determination of a pre-processing transformation as an optimization problem that trades off discrimination control, data utility , and individual distortion. (T rade-offs among v arious fairness notions ma y b e inherent as sho wn by Klein b erg et al. ( 2017 ).) While discrimination and utilit y are defined at the lev el of probability distributions, distortion is controlled on a per-sample basis, thereby limiting the effect of the transformation on individuals and ensuring a degree of individual fairness. Figure 1 illustrates the sup ervised learning pipeline that includes our proposed discrimination-prev enting pre-pro cessing. The w ork of Zemel et al. ( 2013 ) is closest to ours in also presen ting a framew ork with three criteria related to discrimination con trol (group fairness), individual fairness, and utilit y . Ho wev er, the criteria are manifested less directly than in our prop osal. In particular, discrimination control is posed in terms of intermediate features rather than outcomes, individual distortion do es not take outcomes in to account (simply being an ` 2 -norm betw een original and transformed features), and utilit y is specific to a particular classifier. Our formulation more naturally and generally encodes thes e fairness and utility desiderata. Giv en the no velt y of our formulation, we devote more effort than usual to discussing its motiv ations and p oten tial v ariations. W e state natural conditions under which the prop osed optimization problem is conv ex. The resulting transformation is in general a randomized one. The prop osed optimization problem assumes as input an estimate of the distribution of the data whic h, in practice, can be imprecise due to limited sample size. Accordingly , w e characterize the p ossible degradation in discrimination and utilit y guaran tees at test time in terms of the training sample size. As a demonstration of our framework, we apply sp ecific instances of it to a prison recidivism risk score dataset ( ProPublica , 2017 ) and the UCI adult dataset ( Lic hman , 2013 ). By solving the optimization problem, w e show that discrimination, distortion, and utility loss can b e con trolled simultaneously with real data. In addition, the resulting transformations rev eal intriguing demographic patterns in the data. 2 General F orm ulation W e are giv en a dataset consisting of n i.i.d. samples { ( D i , X i , Y i ) } n i =1 from a join t distribution p D,X ,Y with domain D × X × Y . Here D denotes one or more discriminatory v ariables such as gender and race, X denotes other non-protected v ariables used for decision making, and Y is an outc ome random v ariable. F or instance, Y i could represent a loan approv al decision for individual i based on demographic information D i and c redit score X i . W e fo cus in this pap er on discrete (or discretized) and finite domains D and X and binary outcomes, i.e. Y = { 0 , 1 } . There is no restriction on the dimensions of D and X . 2 Our goal is to determine a randomized mapping p ˆ X , ˆ Y | X,Y ,D that (i) transforms the given dataset in to a new dataset n ( D i , ˆ X i , ˆ Y i ) o n i =1 , whic h may b e used to train a mo del, and (ii) similarly transforms data to whic h the mo del is applied, i.e. test data. Each ( ˆ X i , ˆ Y i ) is dra wn indep enden tly from the same domain X × Y as X , Y by applying p ˆ X , ˆ Y | X,Y ,D to the corresp onding triplet ( D i , X i , Y i ). Since D i is retained as-is, we do not include it in the mapping to b e determined. Motiv ation for retaining D is discussed later in Section 3.2 . F or test samples, Y i is not a v ailable at the input while ˆ Y i ma y not b e needed at the output. In this case, a reduced mapping p ˆ X | X,D ma y b e used, whic h can b e obtained from p ˆ X , ˆ Y | X,Y ,D b y marginalizing ov er ˆ Y and Y after weigh ting by p Y | X,D . It is assumed that p D,X ,Y is known along with its marginals and conditionals. This assumption is often satisfied using the empirical distribution of { ( D i , X i , Y i ) } n i =1 . In Section 3.2 , we state a result ensuring that discrimination and utility loss con tin ue to be con trolled if the distribution used to determine p ˆ X , ˆ Y | X,Y ,D differs from the distribution of test samples. W e prop ose that the mapping p ˆ X , ˆ Y | X,Y ,D satisfy the prop erties discussed in the follo wing three subsec- tions. 2.1 Discrimination Con trol The first ob jective is to limit the dep endence of the transformed outcome ˆ Y on the discriminatory v ariables D , as represented by the conditional distribution p ˆ Y | D . W e prop ose tw o alternative formulations. The first requires p ˆ Y | D to b e close to a target distribution p Y T for all v alues of D , J  p ˆ Y | D ( y | d ) , p Y T ( y )  ≤  y ,d ∀ d ∈ D , y ∈ { 0 , 1 } , (1) where J ( · , · ) denotes some distance function. The second formulation constrains p ˆ Y | D to b e similar for any t wo v alues of D , J  p ˆ Y | D ( y | d 1 ) , p ˆ Y | D ( y | d 2 )  ≤  y ,d 1 ,d 2 (2) for all d 1 , d 2 ∈ D , y ∈ { 0 , 1 } . The latter ( 2 ) does not require a target distribution as reference but do es increase the num b er of constraints from O ( |D | ) to O ( |D | 2 ). The choice of target p Y T in ( 1 ), and distance J and thresholds  in ( 1 ) and ( 2 ) should be informed by so cietal considerations. If the application domain has a clear legal definition of disparate impact, for example the “80% rule” ( EEOC , 1979 ), then it can be translated in to a mathematical constrain t. Otherwise and more generally , the instan tiation of ( 1 ) should in volv e consultation with domain experts and stak eholders before b eing put in to practice. F or this w ork, we choose J to be the follo wing probability ratio measure: J ( p, q ) =     p q − 1     . (3) The combination of ( 3 ) and ( 1 ) generalizes the extended lift criterion prop osed in the literature ( P edreschi et al. , 2012 ), while the combination of ( 3 ) and ( 2 ) generalizes selective and contrastiv e lift. In the numerical results in Section 4 , w e use b oth ( 1 ) and ( 2 ). F or ( 1 ), w e mak e the straigh tforward choice of setting p Y T = p Y , the original marginal distribution of the outcome v ariable. W e recognize how ever that this choice of target ma y run the risk of perp etuating bias in the original dataset. On the other hand, how to choose a target distribution that is “fairer” than p Y is largely an op en question; we refer the reader to ˇ Zliobait ˙ e et al. ( 2011 ) for one such prop osal, whic h is reminiscent of the concept of “balanced error rate” in classification ( Zhao et al. , 2013 ). In ( 1 ) and ( 2 ), discrimination control is imposed join tly with resp ect to all discriminatory v ariables, e.g. all combinations of gender and race if D consists of those tw o v ariables. An alternative is to take the discriminatory v ariables one at a time, e.g. gender without regard to race and vice-versa. The latter, whic h w e refer to as univ ariate discrimination control, can be formulated similarly to ( 1 ), ( 2 ). In this w ork, we opt for join t discrimination control as it is more stringen t than univ ariate. W e note ho wev er that legal form ulations tend to be of the univ ariate type. 3 F ormulations ( 1 ) and ( 2 ) control discrimination at the lev el of the o verall p opulation in the dataset. It is also p ossible to con trol discrimination within segmen ts of the population by conditioning on additional v ariables B , where B is a subset of X and X is a collection of features. Constrain t ( 1 ) would then generalize to J  p ˆ Y | D ,B ( y | d, b ) , p Y T | B ( y | b )  ≤  y ,d,b (4) for all d ∈ D , y ∈ { 0 , 1 } , and b ∈ B . Similar conditioning or “context” for discrimination has been explored b efore in Ha jian & Domingo-F errer ( 2013 ) in the setting of asso ciation rule mining. As one example, B ma y consist of non-discriminatory v ariables that are strongly correlated with the outcome Y , e.g. education lev el as it relates to income. One may wish to con trol for suc h v ariables in determining whether discrimi- nation is presen t and needs to b e corrected. At the same time, care must be taken so that the p opulation segmen ts created by conditioning on B are large enough for statistically v alid inferences to b e made. F or presen t purp oses, we simply note that conditional discrimination constraints ( 4 ) can b e accommo dated in our framework and defer further in v estigation to future w ork. 2.2 Distortion Con trol The mapping p ˆ X , ˆ Y | X ,Y,D should satisfy distortion constraints with respect to the domain X × Y . These constrain ts restrict the mapping to reduce or a void altogether certain large c hanges (e.g. a v ery lo w credit score b eing mapp ed to a very high credit score). Giv en a distortion metric δ : ( X × Y ) 2 → R + , we constrain the conditional exp ectation of the distortion as follo ws: E h δ (( x, y ) , ( ˆ X , ˆ Y )) | D = d, X = x, Y = y i ≤ c d,x,y ∀ ( d, x, y ) ∈ D × X × Y . (5) W e assume that δ ( x, y , x, y ) = 0 for all ( x, y ) ∈ X × Y . Constrain t ( 5 ) is formulated with p oin t wise conditioning on ( D , X, Y ) = ( d, x, y ) in order to promote individual fairness. It ensures that distortion is controlled for every com bination of ( d, x, y ), i.e. every individual in the original dataset, and more imp ortan tly , every individual to which a mo del is later applied. By wa y of con trast, an a verage-case measure in whic h an exp ectation is also tak en ov er D , X , Y may result in high distortion for certain ( d, x, y ), likely those with lo w probability . Equation ( 5 ) also allows the level of con trol c d,x,y to depend on ( d, x, y ) if desired. W e also note that ( 5 ) is a prop ert y of the mapping p ˆ X , ˆ Y | D ,X,Y , and do es not dep end on the assumed distribution p D,X ,Y . The exp ectation ov er ˆ X , ˆ Y in ( 5 ) encompasses several cases dep ending on the choices of the metric δ and thresholds c d,x,y . If c d,x,y = 0, then no mappings with nonzero distortion are allo wed for individuals with original v alues ( d, x, y ). If c d,x,y > 0, then certain mappings ma y still b e disallow ed b y assigning them infinite distortion. Mappings with finite distortion are permissible sub ject to the budget c d,x,y . Lastly , if δ is binary-v alued (p erhaps achiev ed b y thresholding a m ulti-v alued distortion function), it can b e seen as classifying mappings in to desirable ( δ = 0) and undesirable ones ( δ = 1). Here, ( 5 ) reduces to a b ound on the conditional probability of an undesirable mapping, i.e. Pr  δ (( x, y ) , ( ˆ X , ˆ Y )) = 1 | D = d, X = x, Y = y  ≤ c d,x,y . (6) 2.3 Utilit y Preserv ation In addition to constraints on individual distortions, we also require that the distribution of ( ˆ X , ˆ Y ) b e sta- tistically close to the distribution of ( X , Y ). This is to ensure that a model learned from the transformed dataset (when av eraged ov er the discriminatory v ariables D ) is not too differen t from one learned from the original dataset, e.g. a bank’s existing p olicy for approving loans. F or a giv en dissimilarity measure ∆ b et w een probabilit y distributions (e.g. KL-divergence), we require that ∆  p ˆ X , ˆ Y , p X,Y  b e small. 2.4 Optimization F orm ulation Putting together the considerations from the three previous subsections, we arriv e at the optimization problem b elo w for determining a randomized transformation p ˆ X , ˆ Y | X ,Y,D mapping each sample ( D i , X i , Y i ) 4 to ( ˆ X i , ˆ Y i ): min p ˆ X , ˆ Y | X,Y ,D ∆  p ˆ X , ˆ Y , p X,Y  s.t. J  p ˆ Y | D ( y | d ) , p Y T ( y )  ≤  y ,d and E h δ (( x, y ) , ( ˆ X , ˆ Y )) | D = d, X = x, Y = y i ≤ c d,x,y ∀ ( d, x, y ) ∈ D × X × Y , p ˆ X , ˆ Y | X ,Y,D is a v alid distribution. (7) W e c ho ose to minimize the utilit y loss ∆ sub ject to constraints on individual distortion ( 5 ) and discrimination, where we hav e used ( 1 ) for concreteness, since it is more natural to place bounds on the latter tw o. The distortion constrain ts ( 5 ) are an essential component of the problem form ulation ( 7 ). Without ( 5 ) and assuming that p Y T = p Y , it is p ossible to ac hiev e p erfect utility and non-discrimination simply by sam- pling ( ˆ X i , ˆ Y i ) from the original distribution p X,Y indep enden tly of any inputs, i.e. p ˆ X , ˆ Y | X ,Y,D ( ˆ x, ˆ y | x, y , d ) = p ˆ X , ˆ Y ( ˆ x, ˆ y ) = p X,Y ( ˆ x, ˆ y ). Then ∆  p ˆ X , ˆ Y , p X,Y  = 0, and p ˆ Y | D ( y | d ) = p ˆ Y ( y ) = p Y ( y ) = p Y T ( y ) for all d ∈ D . This solution ho wev er is clearly ob jectionable from the viewp oin t of individual fairness, especially for individuals to whom a subsequent mo del is applied since it amounts to discarding an individual’s data and replacing it with a random sample from the p opulation p X,Y . Constrain t ( 5 ) seeks to preven t such gross deviations from o ccurring. 3 Theoretical Prop erties 3.1 Con v exity W e first discuss conditions under whic h ( 7 ) is a conv ex or quasiconv ex optimization problem. Considering first the ob jectiv e function, the distribution p X,Y is a given quan tity while p ˆ X , ˆ Y ( ˆ x, ˆ y ) = X d,x,y p D,X ,Y ( d, x, y ) p ˆ X , ˆ Y | D ,X,Y ( ˆ x, ˆ y | d, x, y ) is seen to b e a linear function of the mapping p ˆ X , ˆ Y | D ,X,Y , i.e. the optimization v ariable. Hence if the statistical dissimilarity ∆( · , · ) is conv ex in its first argumen t with the second fixed, then ∆( p ˆ X , ˆ Y , p X,Y ) is a conv ex function of p ˆ X , ˆ Y | D ,X,Y b y the affine composition property ( Boyd & V andenberghe , 2004 ). This condition is satisfied for example by all f -div ergences ( Csisz´ ar & Shields , 2004 ), which are jointly conv ex in b oth arguments, and b y all Bregman divergences ( Banerjee et al. , 2005 ). If instead ∆( · , · ) is only quasicon vex in its first argumen t, a similar composition property implies that ∆( p ˆ X , ˆ Y , p X,Y ) is a quasicon vex function of p ˆ X , ˆ Y | D ,X,Y ( Bo yd & V andenberghe , 2004 ). F or discrimination constrain t ( 1 ), the target distribution p Y T is assumed to b e given. The conditional distribution p ˆ Y | D can b e related to p ˆ X , ˆ Y | D ,X,Y as follows: p ˆ Y | D ( ˆ y | d ) = X ˆ x X x,y p X,Y | D ( x, y | d ) p ˆ X , ˆ Y | D ,X,Y ( ˆ x, ˆ y | d, x, y ) . Since p X,Y | D is given, p ˆ Y | D is a linear function of p ˆ X , ˆ Y | D ,X,Y . Hence b y the same composition property as ab o v e, ( 1 ) is a con vex constraint, i.e. specifies a conv ex set, if the distance function J ( · , · ) is quasicon vex in its first argument. If constrain t ( 2 ) is used instead of ( 1 ), then both argumen ts of J are linear functions of p ˆ X , ˆ Y | D ,X,Y . Hence ( 2 ) is con vex if J is jointly quasiconv ex in b oth arguments. Lastly , the distortion constrain t ( 5 ) can be expanded explicitly in terms of p ˆ X , ˆ Y | D ,X,Y to yield X ˆ x, ˆ y p ˆ X , ˆ Y | D ,X,Y ( ˆ x, ˆ y | d, x, y ) δ (( x, y ) , ( ˆ x, ˆ y )) ≤ c d,x,y . Th us ( 5 ) is a linear constrain t in p ˆ X , ˆ Y | D ,X,Y regardless of the c hoice of distortion metric δ . W e summarize this subsection with the follo wing prop osition. 5 Prop osition 1. Pr oblem ( 7 ) is a (quasi)c onvex optimization if ∆( · , · ) is (quasi)c onvex and J ( · , · ) is quasi- c onvex in their r esp e ctive first ar guments (with the se c ond ar guments fixe d). If discrimination c onstr aint ( 2 ) is use d in plac e of ( 1 ) , then the c ondition on J is that it b e jointly quasic onvex in b oth ar guments. 3.2 Generalizabilit y of Discrimination Con trol W e now discuss the generalizability of discrimination guaran tees ( 1 ) and ( 2 ) to unseen individuals, i.e. those to whom a mo del is applied. Recall from Section 2 that the prop osed transformation retains the discriminatory v ariables D . W e first consider the case where mo dels trained on the transformed data to predict ˆ Y are allo wed to depend on D . While such models may qualify as disparate treatment, the inten t and effect is to b etter mitigate disparate impact resulting from the model. In this respect our proposal shares the same spirit with “fair” affirmative action in Dwork et al. ( 2012 ) (fairer on accoun t of distortion constrain t ( 5 )). Later in this subsection we consider the case where D is suppressed at classification time. 3.2.1 Main taining the Discriminatory V ariable Assuming that predictive mo dels for ˆ Y can dep end on D , let e Y be the output of suc h a mo del based on D and ˆ X . T o remov e the separate issue of mo del accuracy , suppose for simplicity that the model pro vides a go o d appro ximation to the conditional distribution of ˆ Y : p e Y | ˆ X ,D ( e y | ˆ x, d ) ≈ p ˆ Y | ˆ X ,D ( e y | ˆ x, d ). Then for individuals in a protected group D = d , the conditional distribution of e Y is given b y p e Y | D ( e y | d ) = X ˆ x p e Y | ˆ X ,D ( e y | ˆ x, d ) p ˆ X | D ( ˆ x | d ) ≈ X ˆ x p ˆ Y | ˆ X ,D ( e y | ˆ x, d ) p ˆ X | D ( ˆ x | d ) = p ˆ Y | D ( e y | d ) . (8) Hence the mo del output p e Y | D can also b e controlled by ( 1 ) or ( 2 ). On the other hand, if D must b e suppressed from the transformed data, p erhaps to comply with legal requiremen ts regarding its non-use, then a predictive mo del can depend only on ˆ X and approximate p ˆ Y | ˆ X , i.e. p e Y | ˆ X ,D ( e y | ˆ x, d ) = p e Y | ˆ X ( e y | ˆ x ) ≈ p ˆ Y | ˆ X ( e y | ˆ x ). In this case we hav e p e Y | D ( e y | d ) ≈ X ˆ x p ˆ Y | ˆ X ( e y | ˆ x ) p ˆ X | D ( ˆ x | d ) , (9) whic h in general is not equal to p ˆ Y | D ( e y | d ) in ( 8 ). The quantit y on the righ t-hand side of ( 9 ) is less straight- forw ard to control. W e address this issue in the next subsection. 3.2.2 Suppressing the Discriminatory V ariable In many applications the discriminatory v ariable cannot be revealed to the classification algorithm. In this case, the train-time discrimination guaran tees are preserv ed at apply time if the Mark ov relationship D → ˆ X → ˆ Y (i.e. p ˆ Y | ˆ X ,D = p ˆ Y | ˆ X ) holds since, in this case, p e Y | D ( e y | d ) ≈ X ˆ x p ˆ Y | ˆ X ( e y | ˆ x ) p ˆ X | D ( ˆ x | d ) = p ˆ Y | D ( e y | d ) . (10) Th us, giv en that the distribution p D,X ,Y is known, the guaran tees pro vided during training still hold when applied to fresh samples if the additional constraint p ˆ X , ˆ Y | D ,X,Y = p ˆ Y | ˆ X p ˆ X | D ,X,Y is satisfied. W e refer to ( 7 ) with the additional constraint p ˆ X , ˆ Y | D ,X,Y = p ˆ Y | ˆ X p ˆ X | D ,X,Y as the suppr esse d optimization formulation (SOF). Alas, since the added constraint is non-conv ex, the SOF is not a con vex program, despite being con vex in p ˆ X | D ,X,Y for a fixed p ˆ Y | ˆ X and vice-versa (i.e. it is bicon vex). W e prop ose next t wo strategies for addressing this problem. 1. The first approac h is to restrict p ˆ Y | ˆ X = p Y | X and solve ( 7 ) for p ˆ X | D ,X,Y . If ∆( · , · ) is an f -divergence, then ∆  p X,Y , p ˆ X , ˆ Y  = D f  p X,Y k p ˆ X , ˆ Y  6 = X x,y p ˆ X , ˆ Y ( x, y ) f p X,Y ( x, y ) p ˆ X , ˆ Y ( x, y ) ! ≥ X x p ˆ X ( x ) f X y p ˆ Y | ˆ X ( x | y ) p X,Y ( x, y ) p ˆ X , ˆ Y ( x, y ) ! = D f  p X k p ˆ X  , where the inequality follows from conv exit y of f . Since the last quantit y is achiev ed b y setting p ˆ Y | ˆ X = p Y | X , this choice is optimal in terms of the ob jectiv e function. It may , how ever, render the constrain ts in ( 7 ) infeasible. Assuming feasibility is maintained, this approach has the added benefit that a classifier f θ ( x ) ≈ p Y | X ( ·| x ) can b e trained using the original (non-p erturb ed) data, and maintained for classification during apply time. 2. Alternatively , a solution can be found through alternating minimization: fix p ˆ Y | ˆ X and solv e the SOF for p ˆ X | D ,X,Y , and then fix p ˆ X | D ,X,Y as the optimal solution and solve the SOF for p ˆ Y | ˆ X . The resulting sequence of v alues of the ob jectiv e function is non-increasing, but may conv erge to a local minima. 3.3 A Note on Estimation and Discrimination There is a close relationship b et ween estimation and discrimination. If the discriminatory v ariable D can b e reliably estimated from the outcome v ariable Y , then it is reasonable to exp ect that the discrimination con trol constraint ( 1 ) does not hold for small v alues of  y ,d . W e mak e this in tuition precise in the next prop osition when J is given in ( 3 ). More sp ecifically , w e pro v e that if the adv an tage of estimating D from Y ov er a random guess is large, then there must exist a v alue of d and y such that J ( p Y | D ( y | d ) , p Y T ( y )) is also large. Thus, standard estimation metho ds can be used to detect the presence of discrimination: if an estimation algorithm can estimate D from Y , then discrimination may be presen t. Alternatively , if discrimination con trol is successful, then no estimator can significantly impro ve up on a random guess when estimating D from Y . W e denote the highest probability of correctly guessing D from an observ ation of Y by P c ( D | Y ), where P c ( D | Y ) , max D → Y → ˆ D Pr  D = ˆ D  , (11) and the maximum is tak en across all estimators p ˆ D | Y that satisfy the Mark ov condition D → Y → ˆ D . F or D and Y defined o ver finite supp orts, this is ac hiev ed b y the maximum a p osteriori (MAP) estimator and, consequen tly , P c ( D | Y ) = X y ∈Y p Y ( y ) max d ∈D p D | Y ( d | y ) . (12) Let p ∗ D b e the most lik ely outcome of D , i.e. p ∗ D , max d ∈D p D ( d ). The (m ultiplicative) adv antage ov er a random guess is giv en by Adv ( D | Y ) , P c ( D | Y ) p ∗ D . (13) Prop osition 2. F or D and Y define d over finite supp ort sets, if Adv ( D | Y ) > 1 +  (14) then for any p Y T , ther e exists y ∈ Y and d ∈ D such that     p Y | D ( y | d ) p Y T ( y ) − 1     > . (15) Pr o of. W e prov e the contrapositive of the statemen t of the proposition. Assume that     p Y | D ( y | d ) p Y T ( y ) − 1     ≤  ∀ y ∈ Y , d ∈ D . (16) 7 Then P c ( D | Y ) = X y ∈Y max d ∈D p D | Y ( d | y ) p Y ( y ) = X y ∈Y max d ∈D p Y | D ( y | d ) p D ( d ) ≤ X y ∈Y max d ∈D (1 +  ) p Y T ( y ) p D ( d ) = (1 +  ) max d ∈D p D ( d ) , where the inequalit y follows b y noting that ( 16 ) implies p Y | D ( y | d ) ≤ (1 +  ) p Y T ( y ) for all y ∈ Y , d ∈ D . Rearranging the terms of the last equality , we arriv e at P c ( D | Y ) max d ∈D p D ( d ) ≤ 1 + , and the result follo ws by observing that the left-hand side is the definition of Adv ( D | Y ). 3.4 T raining and Application Considerations The prop osed optimization framework has tw o mo des of op eration (Fig. 1 ): train and apply . In train mo de, the optimization problem ( 7 ) is solved in order to determine a mapping p ˆ X , ˆ Y | X ,Y,D for randomizing the training set. The randomized training set, in turn, is used to fit a classification mo del f θ ( ˆ X , D ) that appro ximates p ˆ Y | ˆ X ,D , where θ are the parameters of the model. At apply time, a new data p oint ( X , D ) is receiv ed and transformed into ( ˆ X , D ) through a randomized mapping p ˆ X | X,D . The mapping p ˆ X | D ,X is given b y marginalizing ov er Y , ˆ Y : p ˆ X | D ,X ( ˆ x | d, x ) = X y , ˆ y p ˆ X , ˆ Y | X ,Y,D ( ˆ x, ˆ y | x, y , d ) p Y | X ,D ( y | x, d ) . (17) Assuming that the v ariable D is not suppressed, and that the marginals are known, then the utility and discrimination guaran tees set during train time still hold during apply time, as discussed in Section 3.2 . Ho wev er, the distortion con trol will inevitably c hange, since the mapping has been marginalized ov er Y . More sp ecifically , the bound on the exp ected distortion for eac h sample becomes E h E h δ (( x, Y ) , ( ˆ X , ˆ Y )) | D = d, X = x, Y i | D = d, X = x i ≤ X y ∈Y p Y | X ,D ( y | x, d ) c x,y ,d , c x,d . (18) If the distortion control v alues c x,y ,d are indep enden t of y , then the upp er-bound on distortion set during training time still holds during apply time. Otherwise, ( 18 ) provides a b ound on individual distortion at apply time. The same guarantee holds for the case when D is suppressed. 3.5 Robustness W e may also consider the case where the distribution p D,X ,Y used to determine the transformation dif- fers from the distribution q D,X ,Y of test samples. This occurs, for example, when p D,X ,Y is the empirical distribution computed from n i.i.d. samples from an unknown distribution q D,X ,Y . In this situation, discrim- ination control and utility are still guaranteed for samples drawn from q D,X ,Y that are transformed using p ˆ Y , ˆ X | X,Y ,D , where the latter is obtained by solving ( 7 ) with p D,X ,Y . In particular, denoting b y q ˆ Y | D and q ˆ X , ˆ Y the corresp onding distributions for ˆ Y , ˆ X and D when q D,X ,Y is transformed using p ˆ Y , ˆ X | X,Y ,D , we hav e J  p ˆ Y | D ( y | d ) , p Y T ( y )  → J  q ˆ Y | D ( y | d ) , p Y T ( y )  and ∆  p X,Y , p ˆ X , ˆ Y  → ∆  q X,Y , q ˆ X , ˆ Y  for n sufficiently large (the distortion control constraints ( 5 ) only depend on p ˆ Y , ˆ X | X,Y ,D ). The next proposition provides an estimate of the rate of this con vergence in terms of n and assuming p Y ,D ( y , d ) is fixed and bounded aw ay from zero. Its pro of can b e found in the App endix. 8 Prop osition 3. L et p D,X ,Y b e the empiric al distribution obtaine d fr om n i.i.d. samples that is use d to determine the mapping p ˆ Y , ˆ X | X,Y ,D , and q D,X ,Y b e the true distribution of the data. In addition, denote by q D, ˆ X , ˆ Y the joint distribution after applying p ˆ Y , ˆ X | X,Y ,D to samples fr om q D,X ,Y . If for al l y ∈ Y , d ∈ D we have p Y ,D ( y , d ) > 0 , J  p ˆ Y | D ( y | d ) , p Y T ( y )  ≤  , wher e J is given in ( 3 ) , and ∆  p X,Y , p ˆ X , ˆ Y  = X x,y    p X,Y ( x, y ) − p ˆ X , ˆ Y ( x, y )    ≤ µ, then with pr ob ability 1 − β , J  q ˆ Y | D ( y | d ) , p Y T ( y )  =  + O  r 1 n log n β  , (19) ∆  q X,Y , q ˆ X , ˆ Y  = µ + O  r 1 n log n β  . (20) Prop osition 3 guarantees that, as long as n is sufficien tly large, the utility and discrimination con trol guaran tees will approximately hold when p ˆ X , ˆ Y | Y ,X,D is applied to fresh samples drawn from q D,X ,Y . In particular, the utilit y and discrimination guaran tees will conv erge to the ones used as parameters in the optimization at a rate that is at least Θ  q 1 n log n  . The distortion control guarantees ( 5 ) are a prop ert y of the mapping p ˆ X , ˆ Y | Y ,X,D , and do not dep end on the distribution of the data. Observ e that hidden within the big-O terms in Prop osition 3 are constan ts that depend on the probabilit y of the least likely symbol and the alphab et size. The exact characterization of these constants can b e found in the pro of of the prop osition in the appendix. Moreo ver, the upp er bounds become loose if p Y ,D ( y , d ) can b e made arbitrarily small. Thus, it is necessary to assume that p Y ,D ( y , d ) is fixed and b ounded a wa y from zero. Moreov er, if the dimensionalit y of the supp ort sets of D, X and Y is large, and the num b er of samples n is limited, then a dimensionalit y reduction step (e.g. clustering) may be necessary in order to assure that discrimination control and utilit y are adequately preserv ed at test time. Prop osition 3 and its pro of can be used to provide an explicit estimate of the required reduction. Finally , we also note that if there are insufficien t samples to reliably estimate q D,X ,Y ( d, x, y ) for certain v alues ( d, x, y ) ∈ D × X × Y , then, for those groups ( d, x ), it is statistically challenging to verify discrimination and thus control may not b e meaningful. 4 Applications to Datasets W e apply our prop osed data transformation approach to tw o different datasets to demonstrate its capabil- ities. W e appro ximate p D,X ,Y using the empirical distribution of ( D , X , Y ) in the datasets, sp ecialize the optimization ( 7 ) according to the needs of the application, and solv e ( 7 ) using a standard conv ex solv er ( Diamond & Boyd , 2016 ). 4.1 ProPublica’s COMP AS Recidivism Data Recidivism refers to a p erson’s relapse into criminal b ehavior. It has b een found that about t wo-thirds of prisoners in the US are re-arrested after release ( Durose et al. , 2014 ). It is imp ortan t therefore to understand the recidivistic tendencies of incarcerated individuals who are co nsidered for release at several p oints in the criminal justice system (bail hearings, parole, etc.). Automated risk scoring mec hanisms ha v e been dev elop ed for this purp ose and are currently used in courtro oms in the US, in particular the proprietary COMP AS to ol b y Northpointe ( Northp oin te Inc. ). Recen tly , ProPublica published an article that inv estigates racial bias in the COMP AS algorithm ( ProP- ublica , 2016 ), releasing an accompanying dataset that includes COMP AS risk scores, recidivism records, and other relev ant attributes ( ProPublica , 2017 ). A basic finding is that the COMP AS algorithm tends to assign higher scores to African-American individuals, a reflection of the a priori higher prev alence of recidivism in this group. The article go es on to demonstrate unequal false p ositiv e and false negativ e rates b et w een 9 T able 1: ProPublica dataset features. F eature V alues Commen ts Recidivism (binary) { 0 , 1 } 1 if re-offended, 0 otherwise Gender { Male, F emale } Race { Caucasian, African-American } Races with small samples remo ved Age category { < 25 , 25 − 45 , > 45 } y ears of age Charge degree { F elony , Misdemeanor } F or the current arrest Prior coun ts { 0 , 1 − 3 , > 3 } Num b er of prior crimes 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ² 0.000 0.002 0.004 0.006 0.008 0.010 Objective Value Infeasible O b j e c t i v e v s . ² Figure 2: Ob jective vs. discrimination parameter  for distortion constraint c = 0 . 25. African-Americans and Caucasian-Americans, which has since b een shown by Chouldecho v a ( 2016 ) to be a necessary consequence of the calibration of the model and the difference in a priori prev alence. In this w ork, our interest is not in the debate surrounding the COMP AS algorithm but rather in the under- lying recidivism data ( ProPublica , 2017 ). Using the prop osed data transformation approac h, we demonstrate the tec hnical feasibility of mitigating the disparate impact of recividism records on different demographic groups while also preserving utilit y and individual fairness. (W e make no commen t on the asso ciated societal considerations.) F rom ProPublica’s dataset, we select sev erity of c harge, num b er of prior crimes, and age category to b e the decision v ariables ( X ). The outcome v ariable ( Y ) is a binary indicator of whether the individual recidiv ated (re-offended), and race and gender are set to b e the discriminatory v ariables ( D ). The enco ding of the decision and discrimination v ariables is described in T able 1 . The dataset was pro cessed to con tain around 5k records. Sp ecific F orm of Optimization. W e specialize our general form ulation in ( 7 ) by setting the utility measure ∆( p X,Y , p ˆ X , ˆ Y ) to be the KL divergence D KL ( p X,Y k p ˆ X , ˆ Y ). F or discrimination control, we use ( 2 ), with J giv en in ( 3 ), while fixing  y ,d 1 ,d 2 =  . F or the sake of simplicity , we use the expected distortion constrain t in ( 5 ) with c d,x,y = c uniformly . The distortion function δ in ( 5 ) has the following form. Jumps of more than one category in age and prior counts are hea vily discouraged by setting a high distortion p enalt y (10 4 ) for suc h transformations. W e imp ose the same p enalt y on increases in recidivism (c hange of Y from 0 to 1). Both these c hoices are made to promote individual fairness. F urthermore, for ev ery jump to the next category for age and prior counts, a p enalt y of 1 is assessed, and a similar jump incurs a p enalt y of 2 for c harge degree. Reduction in recidivism (1 to 0) has a p enalt y of 2. The total distortion for eac h individual is the sum of squares of distortions for eac h attribute of X . These distortion v alues w ere c hosen for demonstration purp oses to be reasonable to our judgement, and can easily b e tuned according to the needs of a practitioner. Results. W e computed the optimal ob jectiv e v alue (i.e., KL divergence) resulting from solving ( 7 ) for differen t v alues of the discrimination control parameter  , when the exp ected distortion constraint c = 0 . 25. Around  = 0 . 2, no feasible solution can be found that also satisfies the distortion constrain t. Abov e  = 0 . 59, the discrimination con trol is lo ose enough to be satisfied by the original dataset with just an identit y mapping ( D KL ( p X,Y k p ˆ X , ˆ Y ) = 0). In b et w een, the optimal v alue v aries as a smooth function (Fig. 2 ). 10 Figure 3: Conditional mappings p ˆ X , ˆ Y | X,Y ,D with  = 0 . 1, and c = 0 . 5 for: ( left ) D = (African-American , Male), less than 25 years ( X ), Y = 1, ( middle ) D = (African-American , Male), less than 25 years ( X ), Y = 0, and ( right ) D = (Caucasian , Male), less than 25 years ( X ), Y = 1. Original c harge degree and prior coun ts ( X ) are shown in v ertical axis, while the transformed age category , charge degree, prior counts and recidivism ( ˆ X , ˆ Y ) are represented along the horizontal axis. The charge degree F indicates felon y and M indicates misdemeanor. Colors indicate mapping probability v alues. Columns included only if the sum of its v alues exceeds 0 . 05. Figure 4: T op row: Percen tage recidivism rates in the original dataset as a function of charge degree, age and prior coun ts for the o verall p opulation (i.e. p Y | X (1 | x )) and for differen t groups ( p Y | X,D (1 | x, d )). Bottom row: Change in p ercen tages due to transformation, i.e. p ˆ Y | ˆ X ,D (1 | x, d ) − p Y | X,D (1 | x, d ), etc. V alues for cohorts of charge degree, age, and prior counts with fewer than 20 samples are not shown. The discrimination and distortion constraints are set to  = 0 . 1 and c = 0 . 5 resp ectiv ely . W e set c = 0 . 5 and  = 0 . 1 for the rest of the exp erimen ts. The optimal v alue of utility measure (KL div ergence) w as 0 . 021. In order to ev aluate if discrimination control was ac hieved as expected, we examine the dep endence of the outcome v ariable on the discrimination v ariable b efore and after the transformation. Note that to hav e zero disparate impact, w e would like the ˆ Y to b e indep enden t of D , but practically it will b e con trolled by the discrimination con trol parameter  . The corresp onding marginals p Y | D and p ˆ Y | D are illustrated in T able 2 , where clearly ˆ Y is less dep enden t on D compared to Y . In particular, since an increase in recidivism is heavily p enalized, the net effect of the randomized transformation is to decrease the recidivism risk of males, and particularly African-American males. The mapping p ˆ X , ˆ Y | X ,Y,D pro duced by the optimization ( 7 ) can reveal imp ortan t insigh ts on the nature of disparate impact and how to mitigate it. W e illustrate this by exploring p ˆ X , ˆ Y | X ,Y,D for the COMP AS dataset next. Fig. 3 displa ys the conditional mapping restricted to certain so cio-demographic groups. First consider young males who are African-American (left-most plot). This group has a high recidivism rate, and hence the most prominen t action of the mapping (besides iden tit y transformation) is to c hange the recidivism 11 T able 2: Dep endence of the outcome v ariable on the discrimination v ariable b efore and after the prop osed transfor- mation. F and M indicate F emale and Male, and A-A, and C indicate African-American and Caucasian. D Before transformation After transformation (gender, race) p Y | D (0 | d ) p Y | D (1 | d ) p ˆ Y | D (0 | d ) p ˆ Y | D (1 | d ) F, A-A 0.607 0.393 0.607 0.393 F, C 0.633 0.367 0.633 0.367 M, A-A 0.407 0.593 0.596 0.404 M, C 0.570 0.430 0.596 0.404 Figure 5: T op row: High income p ercen tages in the original dataset as a function of age and education for the o verall p opulation (i.e. p Y | X (1 | x )) and for different groups p Y | X,D (1 | x, d )). Bottom row: Change in p ercentages due to transformation, i.e. p ˆ Y | ˆ X ,D (1 | x, d ) − p Y | X (1 | x, d ), etc. Age-education pairs with few er than 20 samples are not sho wn. v alue from 1 (recidivism) to 0 (no recidivism). The next prominen t action is to change the age category from y oung to middle aged (25 to 45 years). This effectively reduces the av erage v alue of ˆ Y for young African- Americans, since the mapping for young males who are African-American and do not recidiv ate (middle plot) is essen tially the iden tity mapping, with the exception of changing age category to middle aged. This is exp ected, since increasing recidivism is heavily p enalized. F or young Caucasian males who recidiv ate, the action of the proposed transformation seems to b e similar to that of y oung African-American males who recidiv ate, i.e., the outcome v ariable is either changed to 0, or the age category is changed to middle age. Ho wev er the probabilities of the transformations are low er since Caucasian males ha ve, according to the dataset, a low er recidivism rate. W e apply this conditional mapping on the dataset (one trial) and present the results in Fig. 4 . The original p ercen tage recidivism rates are also sho wn in the top panel of the plot for comparison. Because of our constrain t that disallows changing the outcome to 1, a demographic group’s recidivism rate can (indirectly) increase only through changes to the decision v ariables ( X ). W e note that the av erage p ercentage change in recidivism rates across all demographics is negativ e when the discrimination v ariables are marginalized out (leftmost column). The maxim um decreases in recidivism rates are observed for African-American males since they hav e the highest v alue of p Y | D (1 | d ) (cf. T able 2 ). Con trast this with Caucasian females (middle column), who hav e virtually no change in their recidivism rates since they are a priori close to the final ones (see T able 2 ). Another interesting observ ation is that middle aged Caucasian males with 1 to 3 prior coun ts see an increase in p ercen tage recidivism. This is consistent with the mapping seen in Fig. 3 (middle), and is an example of the indirect in tro duction of positive outcome v ariables in a cohort as discussed ab o v e. 4.2 UCI Adult Data W e apply our optimization approach to the w ell-known UCI Adult Dataset ( Lic hman , 2013 ) as a second illustration of its capabilities. The features w ere categorized as discriminatory v ariables ( D ): Race (White, 12 Minorit y) and Gender (Male, F emale); decision v ariables ( X ): Age (quantized to decades) and Education (quan tized to years); and resp onse v ariable ( Y ): Income (binary). While the response v ariable considered here is income, the dataset could be regarded as a simplified pro xy for analyzing other financial outcomes suc h as credit appro v als. Sp ecific F orm of Optimization. W e use ` 1 -distance (t wice the total v ariation) ( Pollard , 2002 ) to measure utility , ∆  p X,Y , p ˆ X , ˆ Y  = P x,y    p X,Y ( x, y ) − p ˆ X , ˆ Y ( x, y )    . F or discrimination control, we use ( 1 ), with J giv en in ( 3 ), W e also set  y ,d =  in ( 1 ). W e use the distortion function in ( 5 ), and write x = ( a, e ) for an age-education pair and ˆ x = (ˆ a, ˆ e ) for a corresp onding transformed pair. The distortion function returns (i) v 1 if income is decreased, age is not changed and education is increased by at most 1 year, (ii) v 2 if age is c hanged by a decade and education is increased by at most 1 year regardless of the change of income, (iii) v 3 if age is c hanged b y more than a decade or education is lo w ered b y an y amoun t or increased b y more than 1 y ear, and (iv) 0 in all other cases. W e set ( v 1 , v 2 , v 3 ) = (1 , 2 , 3) with corresponding distance thresholds for δ = 0 as (0 . 9 , 1 . 9 , 2 . 9) and corresponding probabilities ( c d,x,y ) as (0 . 1 , 0 . 05 , 0) in ( 5 ). As a consequence, decreases in income, small c hanges in age, and small increases in education (ev ents (i), (ii)) are p ermitted with small probabilities, while larger c hanges in age and education (ev ent (iii)) are not allo wed at all. W e note that the parameter settings are selected with the purp ose of demonstrating our approach, and would c hange depending on the practitioner’s requiremen ts or guidelines. Results. F or the remainder of the results presented here, w e set  = 0 . 15, and the optimal v alue of the utility measure ( ` 1 distance) was 0 . 014. W e apply the conditional mapping, generated as the optimal solution to ( 7 ), to transform the age, education, and income v alues of eac h sample in the dataset. The result of a single realization of this randomization is giv en in Fig. 5 , where we show p ercen tages of high income individuals as a function of age and education before and after the transformation. The original age and education ( X ) are plotted throughout Fig. 5 for ease of comparison, and that changes in individual p ercen tages may b e larger than a factor of 1 ±  b ecause discrimination is not controlled b y ( 1 ) at the level of age-education cohorts. The top left panel indicates that income is higher for more educated and middle- aged p eople, as expected. The second column shows that high income p ercen tages are significantly low er for females and are accordingly increased by the transformation, most strongly for educated older women and y ounger women with only 8 years of education, and less so for other younger w omen. Conv ersely , the p ercen tages are decreased for males but b y m uch smaller magnitudes. Minorities receiv e small perce n tage increases but less than for w omen, in part because they are a more heterogeneous group consisting of b oth genders. 5 Conclusions W e prop osed a flexible, data-driven optimization framework for probabilistically transforming data in order to reduce algorithmic discrimination, and applied it to t wo datasets. The differences b et ween the original and transformed datasets rev ealed in teresting discrimination patterns, as well as corrective adjustments for con trolling discrimination while preserving utility of the data. Despite being programmatically generated, the optimized transformation satisfied properties that are sensible from a socio-demographic standp oin t, reducing, for example, recidivism risk for males who are African-American in the recidivism dataset, and increasing income for w ell-educated females in the UCI adult dataset. The flexibility of the approac h allows n umerous extensions using different measures and constrain ts for utility preserv ation, discrimination, and individual distortion con trol. In vestigating such extensions, developing theoretical c haracterizations based on the prop osed framew ork, and quantifying the impact of the transformations on sp ecific sup ervised learning tasks will b e pursued in future work. App endix A Pro of of Prop osition 3 The prop osition is a consequence of the follo wing elementary lemma. Lemma 1. L et p ( x ) , q ( x ) and r ( x ) b e thr e e fixe d pr ob ability mass functions with the same discr ete and finite 13 supp ort set X , c 1 , min x ∈X p ( x )(1 − p ( x )) 3(1+ p ( x )) 2 > 0 and p m , min x p ( x ) > 0 . Then if D KL ( p k q ) ≤ τ ≤ c 1 (21) and for al l x ∈ X and γ 1 ≤ p ( x ) r ( x ) ≤ γ 2 , (22) then for al l x ∈ X and g ( τ , p m ) , q 3 τ p m γ 1 exp ( − g ( τ , p m )) ≤ q ( x ) r ( x ) ≤ γ 2 exp ( g ( τ , p m )) . (23) Pr o of. W e assume τ > 0, otherwise p ( x ) = q ( x ) ∀ x ∈ X and we are done. F rom ( 21 ) and the Data Processing Inequalit y for KL-divergence, for any x ∈ X p ( x ) log p ( x ) q ( x ) + (1 − p ( x )) log 1 − p ( x ) 1 − q ( x ) ≤ τ . (24) Let x b e fixed, and, in order to simplify notation, denote c , p ( x ). Assuming, without loss of generalit y , q ( x ) = c exp  − ατ c  , then ( 24 ) implies f ( α ) , α − 1 − c τ log 1 − c exp  − ατ c  1 − c ! ≤ 1 . (25) The T aylor series of f ( α ) around 0 has the form f ( α ) = ∞ X n =2 ( − 1) n n !  τ (1 − c ) c  n − 1 A n − 1 ( c ) α n , (26) where A n ( c ) is the Eulerian p olynomial, which is p ositiv e for c > 0 and satisfies A 1 ( c ) = 1 and A 2 ( c ) = (1 + c ). First, assume α ≤ 0. Then f ( α ) can b e low er-b ounded b y the first term in its T aylor series expansion since all the terms in the series are non-negativ e. F rom ( 25 ), τ α 2 2(1 − c ) c ≤ f ( α ) ≤ 1 . (27) Consequen tly , α ≥ − r 2(1 − c ) c τ . (28) No w assume α ≥ 0. Then the T aylor series ( 26 ) b ecomes an alternating series, and f ( α ) can be lo wer-bounded b y its first t w o terms τ α 2 2(1 − c ) c − (1 + c ) τ 2 α 3 6(1 − c ) 2 c 2 ≤ f ( α ) ≤ 1 . (29) The term in the l.h.s. of the first inequalit y satisfies τ α 2 3(1 − c ) c ≤ τ α 2 2(1 − c ) c − (1 + c ) τ 2 α 3 6(1 − c ) 2 c 2 (30) as long as α ≤ c (1 − c ) (1+ c ) τ . Since the lhs is larger than 1 when α > q 3(1 − c ) c τ , then it is a v alid lo wer-bound for f ( α ) in the entire in terv al where f ( α ) ≤ 1 and α ≥ 0 as long as r 3(1 − c ) c τ ≤ c (1 − c ) (1 + c ) τ ⇔ τ ≤ c (1 − c ) 3(1 + c ) 2 , (31) 14 whic h holds by assumption in the Lemma. Th us, α ≤ r 3(1 − c ) c τ , (32) and combining the previous equation with ( 28 ) − r 2(1 − c ) c τ ≤ α ≤ r 3(1 − c ) c τ (33) Finally , since q ( x ) p ( x ) = exp( − ατ /p ( x )), from the previous inequalities exp − s 3(1 − p ( x )) τ p ( x ) ! ≤ q ( x ) p ( x ) ≤ exp s 2(1 − p ( x )) τ p ( x ) ! , (34) and the result follows b y further low er b ounding the lhs by γ 1 r ( x ) ≤ p ( x ) and upper bounding the rhs b y p ( x ) ≥ γ 2 r ( x ) The previous Lemma allo ws us to deriv e the result presen ted in Proposition 3 . Pr o of of Pr op osition 3 . Let m , |X ||Y |D | . The distribution p D,X ,Y is the type ( Cov er & Thomas , 2006 )[Chap. 11] of n observ ations of q D,X ,Y . Then 1 , from ( Csisz´ ar & Shields , 2004 )[Corollary 2.1], for τ > 0 Pr ( D KL ( p D,X ,Y k q D,X ,Y ) ≥ τ ) ≤  n + m − 1 m − 1  e − nτ ≤  e ( n + m ) m  m e − nτ . F rom the Data Processing Inequalit y for KL-div ergence, if D KL ( p D, ˆ Y k q D, ˆ Y ) ≤ D KL ( p D,X ,Y k q D,X ,Y ), and, consequen tly , Pr  D KL ( p D, ˆ Y k q D, ˆ Y ) ≤ τ  ≥ Pr ( D KL ( p D,X ,Y k q D,X ,Y ) ≤ τ ) ≥ 1 −  e ( n + m ) m  m e − nτ . If D KL ( p D, ˆ Y k q D, ˆ Y ) ≤ τ , then since 0 ≤ D KL ( p D k q D ), we hav e D KL ( p ˆ Y | D ( ·| d ) k q ˆ Y ( ·| d )) ≤ τ p D ( d ) ∀ d ∈ D . Cho osing τ = 1 n log  1 β  e ( n + m ) m  m  , (35) then, with probability 1 − β , for all d ∈ D D KL ( p ˆ Y | D ( ·| d ) k q ˆ Y ( ·| d )) ≤ 1 np D ( d ) log  1 β  e ( n + m ) m  m  . 1 Other bounds on the KL-div ergence betw een an observ ed type and its distribution could b e used, suc h as ( Co ver & Thomas , 2006 )[Thm. 11.2.2], without changing the asymptotic result. 15 Assuming that m , and c m , min y ∈Y ,d ∈D p D, ˆ Y ( d, y ) > 0 constan t, from the pro of of Lemma 1 and, more sp ecifically , inequalities ( 34 ), as long as τ ≤ min d,y p ˆ Y ,D ( y ,d )(1 − p ˆ Y | D ( y | d )) 3(1+ p ˆ Y | D ( y | d )) 2 , (1 −  ) exp ( − h ( n, β )) ≤ q ˆ Y | D ( y | d ) p Y T ( y ) (36) ≤ (1 +  ) exp ( h ( n, β )) , (37) where h ( n, β ) , s 3 nc m log  1 β  e ( n + m ) m  m  . (38) Observ e that h ( n, β ) = Θ  r 1 n log n β  . Since for x sufficien tly small e x ≈ 1 + x , we hav e    q ˆ Y | D ( y | d ) − p Y T ( y )    p Y T ( y ) ≤  + Θ  r 1 n log n β  , (39) pro ving the first claim. F or the second claim, we start b y applying the triangle inequality: ∆  q X,Y , q ˆ X , ˆ Y  ≤ ∆  p X,Y , p ˆ X , ˆ Y  + ∆ ( q X,Y , p X,Y ) + ∆  q ˆ X , ˆ Y , p ˆ X , ˆ Y  ≤ µ + ∆ ( q X,Y , p X,Y ) + ∆  q ˆ X , ˆ Y , p ˆ X , ˆ Y  . (40) No w assume D KL ( p D,X ,Y k q D,X ,Y ) ≤ τ . Then the Data Pro cessing Inequalit y for KL-div ergence yields D KL ( p X,Y k q X,Y ) ≤ τ and D KL ( p ˆ X , ˆ Y k q ˆ X , ˆ Y ) ≤ τ . In addition, from Pinsker’s inequalit y , ∆ ( q X,Y , p X,Y ) ≤ 2 q 2 D KL ( p X,Y k q X,Y ) ≤ 2 √ 2 τ , and, analogously , ∆  q ˆ X , ˆ Y , p ˆ X , ˆ Y  ≤ 2 √ 2 τ . Th us ( 40 ) b ecomes ∆  q X,Y , q ˆ X , ˆ Y  ≤ µ + 4 √ 2 τ . (41) Selecting τ as in ( 35 ), then, with probability 1 − β , ∆  q X,Y , q ˆ X , ˆ Y  ≤ µ + 4 s 2 n log  1 β  e ( n + m ) m  m  , (42) and the result follo ws. References Banerjee, Arindam, Merugu, Srujana, Dhillon, Inderjit S., and Ghosh, Joydeep. Clustering with Bregman div ergences. J. Mach. L e arn. R es. , 6:1705–1749, 2005. 16 Bo yd, S. and V anden b erghe, L. Convex Optimization . Cam bridge Universit y Press, Cambridge, UK, 2004. Calders, T o on and ˇ Zliobait ˙ e, Indr ˙ e. Why un biased computational processes can lead to discriminative decision pro cedures. In Discrimination and Privacy in the Information So ciety , pp. 43–57. Springer, 2013. Chouldec hov a, Alexandra. F air prediction with disparate impact: A study of bias in recidivism prediction instrumen ts. arXiv pr eprint arXiv:1610.07524 , 2016. Corb ett-Da vies, Sam, Pierson, Emma, F eller, Avi, Go el, Sharad, and Huq, Aziz. Algorithmic decision making and the cost of fairness. arXiv pr eprint arXiv:1701.08230 , 2017. Co ver, Thomas M. and Thomas, Joy A. Elements of Information The ory . Wiley-Interscience, 2 edition, July 2006. Csisz´ ar, Imre and Shields, P aul C. Information theory and statistics: A tutorial. F oundations and T r ends in Communic ations and Information The ory , 1(4):417–528, 2004. Diamond, Stev en and Boyd, Stephen. CVXPY: A Python-em b edded mo deling language for con vex opti- mization. Journal of Machine L e arning R ese ar ch , 17(83):1–5, 2016. Durose, Matthew R, Co oper, Alexia D, and Sn yder, How ard N. Recidivism of prisoners released in 30 states in 2005: Patterns from 2005 to 2010. Washington, DC: Bur e au of Justic e Statistics , 28, 2014. Dw ork, Cynthia, Hardt, Moritz, Pitassi, T oniann, Reingold, Omer, and Zemel, Richard. F airness through a wareness. In Pr o c e e dings of the 3r d Innovations in The or etic al Computer Scienc e Confer enc e , pp. 214–226. A CM, 2012. EEOC, The U.S. Uniform guidelines on emplo yee selection pro cedures. https://www.eeoc.gov/policy/ docs/qanda_clarify_procedures.html , March 1979. F eldman, Michael, F riedler, Sorelle A, Mo eller, John, Scheidegger, Carlos, and V enk atasubramanian, Suresh. Certifying and rem o ving disparate impact. In Pr o c. ACM SIGKDD Int. Conf. Know l. Disc. Data Min. , pp. 259–268, 2015. Fish, Benjamin, Kun, Jerem y , and Lelk es, ´ Ad´ am D. A confidence-based approac h for balancing fairness and accuracy . In Pr o c e e dings of the SIAM International Confer enc e on Data Mining , pp. 144–152. SIAM, 2016. F riedler, Sorelle A, Scheidegger, Carlos, and V enk atasubramanian, Suresh. On the (im) possibility of fairness. arXiv pr eprint arXiv:1609.07236 , 2016. Ha jian, Sara. Simultane ous Discrimination Pr evention and Privacy Pr ote ction in Data Publishing and Mining . PhD thesis, Universitat Rovira i Virgili, 2013. Av ailable online: 6805 . Ha jian, Sara and Domingo-F errer, Josep. A metho dology for direct and indirect discrimination prev ention in data mining. IEEE T r ans. Know l. Data Eng. , 25(7):1445–1459, 2013. Hardt, Moritz, Price, Eric, and Srebro, Nathan. Equalit y of opp ortunit y in sup ervised learning. In A dv. Neur. Inf. Pr o c ess. Syst. 29 , pp. 3315–3323, 2016. Kamiran, F aisal and Calders, T o on. Data prepro cessing techniques for classification without discrimination. Know le dge and Information Systems , 33(1):1–33, 2012. Kamishima, T oshihiro, Ak aho, Shotaro, and Sakuma, Jun. F airness-aw are learning through regularization approac h. In Data Mining Workshops (ICDMW), IEEE 11th International Confer enc e on , pp. 643–650. IEEE, 2011. Klein b erg, Jon, Mullainathan, Sendhil, and Raghav an, Manish. Inherent trade-offs in the fair determination of risk scores. In Pr o c. Innov. The or et. Comp. Sci. , 2017. 17 Li, Ninghui, Li, Tiancheng, and V enk atasubramanian, Suresh. t-closeness: Priv acy b ey ond k-anonymit y and l-div ersity . In IEEE 23r d International Confer enc e on Data Engine ering , pp. 106–115. IEEE, 2007. Lic hman, M. UCI mac hine learning repository , 2013. URL http://archive.ics.uci.edu/ml . Northp oin te Inc. COMP AS - the most scien tifically adv anced risk and needs assessments. http://www. northpointeinc.com/risk- needs- assessment . P edreschi, Dino, Ruggieri, Salv atore, and T urini, F ranco. Discrimination-a ware data mining. In Pr o c. A CM SIGKDD Int. Conf. Know l. Disc. Data Min. , pp. 560–568. ACM, 2008. P edreschi, Dino, Ruggieri, Salv atore, and T urini, F ranco. A study of top-k measures for discrimination disco very . In Pr o c. ACM Symp. Applie d Comput. , pp. 126–131, 2012. P ollard, David. A User’s Guide to Me asur e The or etic Pr ob ability . Cam bridge Universit y Press, Cambridge, UK, 2002. ProPublica. Machine Bias. https://www.propublica.org/article/mac hine-bias-risk-assessments-in-criminal- sen tencing, 2016. ProPublica. COMP AS Recidivism Risk Score Data and Analysis. h ttps://www.propublica.org/datastore/dataset/compas-recidivism-risk-score-data-and-analysis, 2017. Ruggieri, Salv atore. Using t-closeness anonymit y to control for non-discrimination. T r ans. Data Privacy , 7 (2):99–129, 2014. Zafar, Muhammad Bilal, V alera, Isab el, Rodriguez, Manuel Gomez, and Gummadi, Krishna P . F airness b ey ond disparate treatment & disparate impact: Learning classification without disparate mistreatment. arXiv pr eprint arXiv:1610.08452 , 2016. Zemel, Richard, W u, Y u (Ledell), Sw ersky , Kevin, Pitassi, T oniann, and Dw ork, Cynthia. Learning fair represen tations. In Pr o c. Int. Conf. Mach. L e arn. , pp. 325–333, 2013. Zhang, Zhe and Neill, Daniel B. Identifying significant predictive bias in classifiers. In Pr o c e e dings of the NIPS Workshop on Interpr etable Machine L e arning in Complex Systems , 2016. Av ailable online: https://arxiv.org/abs/1611.08292 . Zhao, Ming-Jie, Edakunni, Naray anan, Poco c k, Adam, and Bro wn, Gavin. Beyond Fano’s inequalit y: Bounds on the optimal F-score, BER, and cost-sensitiv e risk and their implications. J. Mach. L e arn. R es. , 14:1033–1090, 2013. ˇ Zliobait ˙ e, Indr ˙ e, Kamiran, F aisal, and Calders, T o on. Handling conditional discrimination. In Pr o c. IEEE Int. Conf. Data Mining , pp. 992–1001, 2011. 18

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment