Fair regression under localized demographic parity constraints

Demographic parity (DP) is a widely used group fairness criterion requiring predictive distributions to be invariant across sensitive groups. While natural in classification, full distributional DP is often overly restrictive in regression and can le…

Authors: Arthur Charpentier, Christophe Denis, Romuald Elie

Fair regression under localized demographic parity constraints
F air r egression under localized demographic parity constraints Arthur Charpentier 1 2 Christophe Denis * 3 Romuald Elie 4 Mohamed Hebiri 4 Franc ¸ ois Hu 5 6 Abstract Demographic parity (DP) is a widely used group fairness criterion requiring predictive distrib utions to be in variant across sensiti ve groups. While nat- ural in classification, full distributional DP is often ov erly restricti ve in re gression and can lead to sub- stantial accuracy loss. W e propose a relaxation of DP tailored to regression, enforcing parity only at a finite set of quantile le vels and/or score thresh- olds. Concretely , we introduce a no vel ( ℓ , Z ) -fair predictor , which imposes groupwise CDF con- straints of the form F f | S = s ( z m ) = ℓ m for pre- scribed pairs ( ℓ m , z m ) . For this setting, we de- riv e closed-form characterizations of the optimal fair discretized predictor via a Lagrangian dual formulation and quantify the discretization cost, showing that the risk gap to the continuous opti- mum vanishes as the grid is refined. W e further dev elop a model-agnostic post-processing algo- rithm based on two samples (labeled for learning a base regressor and unlabeled for calibration), and establish finite-sample guarantees on constraint violation and excess penalized risk. In addition, we introduce two alternativ e frameworks where we match group and marginal CDF v alues at se- lected score thresholds. In both settings, we pro- vide closed-form solutions for the optimal fair discretized predictor . Experiments on synthetic and real datasets illustrate an interpretable fair- ness–accuracy trade-of f, enabling targeted correc- tions at decision-rele vant quantiles or thresholds while preserving predictiv e performance. 1. Introduction Machine learning systems increasingly support or automate decisions in socially sensitiv e settings such as credit, hiring, insurance pricing, or public policy . In these applications, 1 Univ ersit ´ e du Qu ´ ebec ` a Montr ´ eal, UQAM, Canada 2 Kyoto Univ ersity , Japan 3 Univ ersit ´ e Paris 1 Panth ´ eon-Sorbonne, Paris, France 4 Univ ersit ´ e Gustav e Eiffel, P aris, France 5 Milliman Paris, France 6 Univ ersit ´ e Claude Bernard, L yon, France. Correspondence to: Christophe Denis < Christophe.denis1@univ-paris1.fr > . Pr eprint. Mar ch 27, 2026. predictions may depend (directly or indirectly) on sensi- tiv e attrib utes S ( e.g ., gender , ethnicity , age), raising major concerns about discrimination. A widely used statistical requirement is demogr aphic parity (DP), which imposes the predictive distribution to be in variant across groups, i.e., f ( X, S ) ⊥ ⊥ S (or equiv alently , the conditional law of f ( X, S ) giv en S is the same for all groups). DP is natural and operational in classification, where decisions often boil down to thresholding a score. In re gression, howev er , DP becomes significantly more delicate and typically too re- strictiv e ( Agarwal et al. , 2019 ; Chzhen et al. , 2021 ; 2020b ). Indeed, enforcing full distributional parity in regression often induces a sev ere loss of accuracy: it constrains the predictor on parts of the outcome distrib ution that may be irrelev ant to the fairness concern, and it may require hea vy distortions e ven when group disparities are localized ( e .g., only in upper tails or around key decision thresholds). This is not merely a modeling artifact: in many real deplo yments, fairness is articulated at a few interpretable summary points (medians, quartiles, or policy thresholds), rather than on the entire distribution. For instance, pay-transparency regula- tions often emphasize median and quartile gaps 1 ; in lending, audits commonly focus on approv al/denial rates at opera- tional cutoffs 2 . These examples suggest that quantile-level or thr eshold-level parity may be a more f aithful and action- able target than full DP . Motiv ated by this, we study localized r elaxations of demo- graphic parity for re gression. Rather than enforcing equal- ity of the entire conditional distribution of f ( X, S ) across groups, we impose fairness only at a finite set of probability lev els (quantiles) and/or score thresholds. This viewpoint connects to recent work arguing that “quantile f airness” cap- tures important distributional disparities that are invisible to mean-based criteria and can be enforced through post- processing or calibration ( Y ang et al. , 2019 ; Liu et al. , 2022 ; W ang et al. , 2023 ; Plecko & Meinshausen , 2020 ). Con- cretely , giv en a vector of quantiles ℓ = ( ℓ 1 , . . . , ℓ M ) and/or a vector of thresholds Z = ( z 1 , . . . , z M ) , we consider con- 1 E.g., the EU Pay T ransparency Directiv e (EU) 2023/970 and the UK Gender Pay Gap reporting regulations (2017) explicitly require reporting median and quartile statistics. 2 E.g., in the US, ECO A/Regulation B compliance is typically monitored through comparativ e acceptance/denial rates across pro- tected classes. 1 Fair r egression under localized demographic parity constraints straints of the form ( Q f | S = s ( ℓ m ) is equal across s = Q f ( ℓ m ) , F f | S = s ( z m ) is equal across s = F f ( z m ) , These constraints are low-dimensional, interpretable, and naturally aligned with how stakeholders specify fairness requirements ( e.g ., “equal predicted median”, “equal top- decile access”, or “equal approv al rate at cutof f z ”). Related w ork. Recent w ork ar gues that enforcing fairness uniformly ov er the whole score range can be unnecessarily stringent. ( He et al. , 2025 ) proposes to enforce a so-called “ partial fairness ” only on score ranges of interest (e.g., con- tested regions), using an in-processing formulation with difference-of-con ve x constraints. In a different direction, ( Chen et al. , 2025 ) de velops a hypothesis-testing frame work that audits approximate (strong) demographic parity under explicit utility trade-of fs using W asserstein projections, mo- tiv ated by causal policy ev aluation. ( W ang et al. , 2023 ) introduces Equal Opportunity of Coverage and uses binned fair quantile regression as a post-processing step. Our framework. W e introduce a ne w notion of fairness that formalizes this idea. W e define the ( ℓ , Z ) -fair predictor , which directly imposes a finite family of quantile/threshold constraints coupling the probability le vels ℓ and the corre- sponding score values Z . This notion yields a continuous fairness–accuracy continuum : increasing M strengthens fairness (and approaches full distributional parity), while small M targets only the distributional regions of interest. Additionally , we introduce specific settings that focus on matching group and marginal CDF v alues at selected thresh- olds. These can be viewed as variants of the ( ℓ , Z ) -fair constraint, sharing the same objectiv e: localizing the fair- ness constraint to mitigate accuracy loss. Why quantiles rather than optimal transport? A promi- nent alternativ e for distributional fairness in re gression aligns group-conditional predicti ve distrib utions via W asser - stein barycenters or optimal transport (O T) mappings. W e refer to ( Gordaliza et al. , 2019 ; Chzhen et al. , 2020b ) for W asserstein/O T -based approaches to distrib utional fairness. These approaches are ele gant and can provide strong guar- antees, but they may be brittle in practice: they depend on a choice of ground cost, can be sensitiv e to outliers/heavy tails, and may behav e poorly under multimodality or group imbalance. In contrast, quantile-based constraints reduce fairness to a finite set of univariate restrictions. They are robust, in variant under monotone transformations of the score, computationally simple, and directly interpretable in terms of policy-relev ant thresholds. This complements recent OT -based fairness lines dev eloped by some of the authors ( e.g., multi-task/barycentric formulations or sequen- tially fair mechanisms) ( Denis et al. , 2024 ; Hu et al. , 2024 ; 2023 ; Charpentier , 2024 ; Charpentier et al. , 2023 ). Contributions. Our contributions are threefold: (i) we introduce localized-based relaxations of DP for regression. The ( ℓ , Z ) -fair predictor and variants of this setting (the partially DP-fair discr etized predictor s ); (ii) we character- ize the corresponding optimal fair predictors and clarify their relation to full DP; and (iii) we propose a practical, data-driv en procedure and demonstrate theoretically and empirically (synthetic and real data) that localized-based fairness can mitigate distributional bias while preserving predictiv e performance. 2. Statistical setting Data and risk. Let ( X, S, Y ) be a random triplet with features X ∈ R d , sensitive attrib ute S ∈ S (multi-group setting, |S | < ∞ ), and response Y ∈ R with E [ Y 2 ] < ∞ . Denote π s := P ( S = s ) , and use the shorthand P s ( · ) := P ( · | S = s ) and E s [ · ] := E [ · | S = s ) . W e work under squared loss and define the (population) risk R ( f ) := E  ( Y − f ( X , S )) 2  = X s ∈S π s E s  ( Y − f ( X , s )) 2  . The Bayes predictor for squared loss is the conditional mean f ⋆ ( x, s ) := E [ Y | X = x, S = s ] (hence f ⋆ ∈ argmin f R ( f ) ), see, e.g., ( Hastie et al. , 2009 ). Our post- processing is group-conditional and therefore assumes S is av ailable at deployment (as is common in fairness audit- ing/calibration pipelines). Throughout, we assume bounded outcomes: there exists A > 0 such that | Y | ≤ A a.s., which implies | f ⋆ ( X, S ) | ≤ A a.s. (by Jensen). Equiv alently , we may write the regression model Y = f ⋆ ( X, S ) + ε, E [ ε | X , S ] = 0 . Non-atomicity . T o a void ties at the fairness thresholds and ensure well-defined quantile-le vel constraints, we as- sume that the group-conditional distribution of f ⋆ ( X, S ) is continuous. Assumption 2.1 (Continuity / non-atomicity) . For ev ery s ∈ S , the CDF t 7→ P s ( f ⋆ ( X, s ) ≤ t ) is continuous. Predictor classes and discretization. Let F be the set of all measurable predictors f : R d × S → [ − A, A ] . For K ≥ 2 , introduce a regular grid of [ − A, A ] with K points Y K :=  y k : y k = − A + 2 A K − 1 ( k − 1) , k ∈ [ K ]  . W e then define the discretized class F K := { f : R d × S → Y K } . This discretization is standard when dealing with real- valued predictors: it yields finite-dimensional constraints 2 Fair r egression under localized demographic parity constraints and closed-form characterizations, while the induced ap- proximation error vanishes as the grid is refined (see Propo- sition 2.4 and classical quantization results, e.g., ( Gray & Neuhoff , 1998 ; Ag arwal et al. , 2018 )). 2.1. ( ℓ , Z ) -fair predictor Throughout, we fix M ≥ 1 and assume ℓ = ( ℓ 1 , . . . , ℓ M ) ∈ (0 , 1) M with 0 < ℓ 1 < · · · < ℓ M < 1 . W e also fix thresh- olds Z = ( z 1 , . . . , z M ) ∈ [ − A, A ] M with z 1 < · · · < z M (and typically z m ∈ Y K in the discretized setting). Quantile/threshold constraints. A discretized predictor f ∈ F K is said to be ( ℓ , Z ) -fair if ∀ s ∈ S , ∀ m ∈ [ M ] , P s  f ( X, s ) ≤ z m  = ℓ m . (1) Equiv alently , F f | S = s ( z m ) = ℓ m for all m , i.e., each group has the same CDF v alues at the specified thresholds. This can be vie wed as a finite relaxation of distributional demo- graphic parity , which is kno wn to be demanding in regres- sion ( Agarwal et al. , 2019 ; Chzhen et al. , 2020b ). Let s ∈ S and set T := f ( X , s ) , with conditional CDF F s ( t ) := P ( T ≤ t | S = s ) and quantile function Q s ( τ ) := inf { t ∈ R : F s ( t ) ≥ τ } . Assume that T | ( S = s ) is absolutely continuous with density p s such that p s ( t ) > 0 for Lebesgue-a.e. t in an open interval containing z m , so that F s is continuous and strictly increasing in a neighborhood of z m . Then, the quantile function coincides with the usual in verse, Q s ( τ ) = F − 1 s ( τ ) for all τ ∈ (0 , 1) , and therefore, for any m ∈ [ M ] and an y ℓ m ∈ (0 , 1) , F s ( z m ) = ℓ m ⇐ ⇒ Q s ( ℓ m ) = z m . (2) Consequently , the constraints F f | S = s ( z m ) = ℓ m can be interpreted either as prescribing groupwise CDF v alues at thresholds Z , or equivalently as fixing group-conditional quantiles at lev els ℓ . In other words, our framework can be interpreted either as enforcing parity at prescribed thresholds Z or as aligning specified quantiles at le vels ℓ . W e study the risk-minimizing discretized predictor under these constraints: f ⋆ ( ℓ , Z ) -fair ∈ argmin f ∈F K n R ( f ) : f satisfies ( 1 ) o . (3) Lagrangian f orm. F or y ∈ [ − A, A ] , define the indicator vector a ( y ) :=  1 { y ≤ z 1 } , . . . , 1 { y ≤ z M }  ∈ { 0 , 1 } M . Let λ = ( λ s,m ) s ∈S ,m ∈ [ M ] and write λ s ∈ R M for the block ( λ s, 1 , . . . , λ s,M ) . The (groupwise) Lagrangian is R λ ( f ) := R ( f )+ X s ∈S X m ∈ [ M ] λ s,m  P s ( f ( X, s ) ≤ z m ) − ℓ m  , a standard constrained-risk formulation (see, e.g., ( Boyd & V andenberghe , 2004 )). Theorem 2.2 (Optimal ( ℓ , Z ) -fair discretized predictor) . Under Assumption 2.1 , ther e exists λ ⋆ such that the pr edic- tor f ⋆ ( ℓ , Z ) -fair admits the pointwise form f ⋆ ( ℓ , Z ) -fair ( x, s ) ∈ argmin y ∈Y K n π s ( y − f ⋆ ( x, s )) 2 + ⟨ λ ⋆ s , a ( y ) ⟩ o . (4) Mor eover , λ ⋆ can be chosen as a minimizer of the dual objective λ ⋆ ∈ argmin λ ∈ R |S |× M X s ∈S E s [ V s ( λ ; X )] , (5) wher e V s ( λ ; x ) := max y ∈Y K { Φ s,λ ( x, y ) } and Φ s,λ ( x, y ) := − π s ( y − f ⋆ ( x, s )) 2 − ⟨ λ s , a ( y ) − ℓ ⟩ . Penalized-risk inter pretation. Theorem 2.2 implies that the optimal fair predictor is also a minimizer of a Lagrangian-penalized risk. Corollary 2.3. Under Assumption 2.1 , f ⋆ ( ℓ , Z ) -fair ∈ argmin f ∈F K R λ ⋆ ( f ) . Discretization cost. Define the (continuous) constrained optimum ˜ f ∈ argmin f ∈F n R ( f ) : f satisfies ( 1 ) o . W e compare the optimal discretized risk to its continuous counterpart. Proposition 2.4 (Cost of discretization) . The following holds: R  f ⋆ ( ℓ , Z ) -fair  − R ( ˜ f ) ≤ C A 2 K , for some absolute constant C > 0 . Consequently , R  f ⋆ ( ℓ , Z ) -fair  → R ( ˜ f ) as K → + ∞ . 3. Data-driven algorithm This section describes a practical procedure to estimate the optimal ( ℓ , Z ) -fair discretized predictor characterized in Theorem 2.2 . Our approach is a post-pr ocessing method: we first learn an unconstrained regressor and then calibrate its outputs to satisfy the f airness constraints. Post-processing is model-agnostic and can be applied to any black-box re- gressor ( Hardt et al. , 2016 ; Chzhen et al. , 2020b ). 3 Fair r egression under localized demographic parity constraints T wo-sample setup. W e use two independent samples: • a labeled sample D n = { ( X i , S i , Y i ) } n i =1 , used to learn a base regressor ˆ f for f ⋆ ; • an unlabeled sample D N = { ( X ′ i , S ′ i ) } N i =1 , used to esti- mate the dual parameters (Lagrange multipliers) enforcing the fairness constraints. Using unlabeled data for calibration is natural here because the constraints P ( f ( X, S ) ≤ z m | S = s ) = ℓ m depend only on the distribution of ( X, S ) and on the predictor out- puts, not directly on Y . Dithering to ensur e continuity . Assumption 2.1 av oids ties at thresholds and ensures a well-behav ed dual. In prac- tice, ˆ f may hav e atoms ( e.g ., tree-based models). W e there- fore introduce a randomized (“dithered”) version ¯ f ( x, s ) := Π [ − A,A ]  ˆ f ( x, s ) + ξ  , for ξ ∼ Unif ([0 , u ]) , where Π [ − A,A ] denotes projection onto [ − A, A ] and ξ is independent of all data. Conditionally on D n , the mapping t 7→ P ( ¯ f ( X, S ) ≤ t | S = s ) is continuous for each s , which simplifies both theory and implementation. The dithering v ariable ξ is introduced only to break ties and guarantee continuity . When ˆ f is continuous (or when ties are negligible), we set u = 0 and the procedure be- comes deterministic. Otherwise, u can be chosen arbitrarily small, so that the impact of randomization on predictions is negligible. Empirical group weights. From D N , define ˆ π s := N s / N with N s := N X i =1 1 { S ′ i = s } , I s := { i ∈ [ N ] : S ′ i = s } , and let π min := min s ∈S { π s } > 0 . Recall Z = ( z 1 , . . . , z M ) and ℓ = ( ℓ 1 , . . . , ℓ M ) . Define for y ∈ [ − A, A ] ( a ( y ) :=  1 { y ≤ z 1 } , . . . , 1 { y ≤ z M }  ∈ { 0 , 1 } M , b ( y ) := a ( y ) − ℓ . For λ = ( λ s,m ) s ∈S ,m ∈ [ M ] , write λ s := ( λ s, 1 , . . . , λ s,M ) . Define the empirical per-sample dual score, for y ∈ Y K , b Φ s,λ ( x, y ) := − ˆ π s ( y − ¯ f ( x, s )) 2 − ⟨ λ s , b ( y ) ⟩ . The empirical dual objectiv e (compared with the population dual in Theorem 2.2 ) is b H ( λ ) = X s ∈S 1 N s X i ∈ I s max y ∈Y K b Φ s,λ ( X ′ i , y ) . (6) Since b H is a sum of pointwise maxima of af fine functions in λ , it is con vex and can be minimized with standard first- order methods ( e.g., projected subgradient) ( Boyd & V an- denberghe , 2004 ; Shale v-Shwartz & Ben-David , 2014 ). W e define the estimated multipliers as any minimizer ˆ λ ∈ argmin λ ∈ R |S |× M b H ( λ ) . Calibrated fair pr edictor . Finally , the empirical ( ℓ , Z ) - fair post-processed predictor is ˆ f ( ℓ , Z ) -fair ( x, s ) ∈ argmin y ∈Y K n ˆ π s ( y − ¯ f ( x, s )) 2 + ⟨ ˆ λ s , a ( y ) ⟩ o . This mirrors the population characterization in Theorem 2.2 , with f ⋆ replaced by ¯ f and λ ⋆ replaced by ˆ λ . 3.1. Theoretical study W e summarize the main statistical guarantees satisfied by the post-processed predictor ˆ f ( ℓ , Z ) -fair . Constraint violation. For any predictor f , define the max- imal constraint violation U ( ℓ , Z ) ( f ) := max s ∈S  max m ∈ [ M ]    P s  f ( X, s ) ≤ z m  − ℓ m     . Theorem 3.1 (Rate for fairness violation) . There exist s a constant C S (depending only on S and π min ) such that E h U ( ℓ , Z )  ˆ f ( ℓ , Z ) -fair i ≤ C S r 1 N + K 2 N ! . Sev eral comments can be made from this results. First, the bound depends only on the unlabeled sample and is indepen- dent of the quality of the initial estimator ˆ f , implying that the result holds for an y base regression algorithm. Second, it guarantees that the empirical fair predictor asymptoti- cally satisfies the target fairness constraints provided that K 2 / N → 0 . Third, the bound decomposes into two terms: the first arises from controlling the de viation between the true CDF and the empirical CDF , while the second accounts for tie effects due to minimizing the empirical counterpart of the function H . Finally , the obtained rates highlight a trade- off between the grid resolution and the size of the unlabeled sample. From this result, we also derive a high-probability guarantee. Theorem 3.2 (High-probability fairness violation) . Assume | Y | ≤ A a.s. and π min = min s π s > 0 . Conditionally on D n , for any δ ∈ (0 , 1) , with pr obability at least 1 − δ (over D N and the dithering), U ( ℓ , Z )  ˆ f ( ℓ , Z ) -fair  ≤ C S r 1 N + K 2 N + r log(1 /δ ) N ! . Excess penalized risk. Recall the Lagrangian-penalized risk R λ ⋆ introduced in Section 2 . The next bound controls the excess penalized risk of the empirical post-processing solution relativ e to the population optimum. 4 Fair r egression under localized demographic parity constraints Theorem 3.3 (Excess penalized risk) . There e xists a con- stant C S such that E h R λ ⋆  ˆ f ( ℓ , Z ) -fair  − R λ ⋆  f ⋆ ( ℓ , Z ) -fair  i ≤ C S ,A  E  | ˆ f ( X, S ) − f ⋆ ( X, S ) |  + r 1 N + u  + C S M r 1 N + K 2 N ! . The theorem shows that the excess risk decomposes into two main components. The first one consists in three error terms: (i) the statistical error of the base regressor ˆ f , (ii) the statis- tical error of the estimators ( ˆ π s ) s ∈S , (iii) the dithering level u introduced to ensure continuity . The second component is related to the unfairness of the predictor and corresponds to the calibration error due to estimating the dual with N unlabeled points and a grid of size K . Theorem 3.4 (High-probability excess penalized risk) . Un- der the assumptions of Theor em 3.2 , for any δ ∈ (0 , 1) , with pr obability at least 1 − δ (conditionally on D n ), R λ ⋆  ˆ f ( ℓ , Z ) -fair  − R λ ⋆  f ⋆ ( ℓ , Z ) -fair  ≤ C S ,A  E [ | ˆ f ( X, S ) − f ⋆ ( X, S ) | ] + r 1 N + u  + C S M r 1 N + K 2 N r log(1 /δ ) N ! . Remark 3.5 (High-probability variants) . Conditionally on the labeled sample D n , the calibration step depends only on D N . Since the constraints only inv olve CDF v alues at thresholds, one may control max m ∈ [ M ] | ˆ F s ( z m ) − F s ( z m ) | using the Dvoretzky–Kiefer –W olfowitz inequality (in its sharp form due to Massart) ( Dv oretzky et al. , 1956 ; Mas- sart , 1990 ). This yields high-probability bounds with p log(1 /δ ) / N s dependence. A detailed statement is giv en in Appendix B.2 . 3.2. Implementation details and practical choices Choosing ( ℓ , Z ) . W e assume 0 < ℓ 1 < · · · < ℓ M < 1 and z 1 < · · · < z M . In practice, ( ℓ , Z ) can be se- lected to match either (i) policy targets ( e.g., fixed accep- tance/flagging cutof fs), or (ii) distrib utional summaries ( e.g ., medians/upper quantiles). Choosing K and u . The grid size K trades of f computa- tional cost, discretization error , and unfairness rate: Proposi- tion 2.4 yields a risk gap of order A 2 /K , while Theorem 3.1 giv es a bound of order p 1 / N + K 2 / N . Therefore a choice of K = N 1 / 3 trades-off the discretization error and unfair - ness rate. The dithering le vel u is only used to av oid ties and ensure continuity; when the base re gressor ˆ f is (approx- imately) continuous, we set u = 0 , otherwise we take u small ( e.g ., u ≪ 1 ). Optimizing the dual. The objectiv e b H in ( 6 ) is conv ex (as a sum of maxima of affine functions), so it can be minimized with standard first-order methods (projected subgradient or mirror descent). Each ev aluation of b H ( λ ) requires O ( N K ) operations, since the inner maximization is over the K grid points. 4. Particular setting: Partially DP-fair discretized pr edictor This section studies a special case of our framework, where fairness is imposed by matching group-conditional and marginal probabilities at a finite set of thresholds. 4.1. Z -DP (partial demographic parity at thresholds) Fix an integer M ≥ 1 and a strictly increasing vector of thresholds Z = ( z 1 , . . . , z M ) ∈ [ − A, A ] M (typically with z m ∈ Y K ). For a discretized predictor f ∈ F K , we say that f is Z -DP-fair if ∀ s ∈ S , ∀ m ∈ [ M ] , P s  f ( X, s ) ≤ z m  = P  f ( X, S ) ≤ z m  . (7) That is, at each threshold z m , e very group shares the same fraction of predictions belo w z m as in the ov erall population. W e consider the risk-minimizing predictor under ( 7 ): f ⋆ Z -fair ∈ argmin f ∈F K n R ( f ) : f satisfies ( 7 ) o . (8) Corollary 4.1 (Recov ery of discretized strong DP) . If Z = Y K (equivalently , constraints at all grid points), then Z -DP fairness is equivalent to equality of the entir e discr etized pr edictive distributions acr oss gr oups. Dual constraints and notation. The constraints ( 7 ) com- pare each group to the marginal distribution; equi valently , they can be written as P s ∈S π s P s ( f ≤ z m ) − P s ( f ≤ z m ) = 0 . This yields a dual where the Lagrange multipli- ers at each threshold must sum to zero across groups. W e therefore define ∆ M := n λ ∈ R |S |× M : X s ∈S λ s,m = 0 , ∀ m ∈ [ M ] o . As before, let a ( y ) = ( 1 { y ≤ z 1 } , . . . , 1 { y ≤ z M } ) ∈ { 0 , 1 } M and λ s = ( λ s, 1 , . . . , λ s,M ) . Theorem 4.2 (Optimal Z -DP-fair discretized predictor) . Under Assumption 2.1 , ther e exists λ ⋆ ∈ ∆ M such that f ⋆ Z -fair ( x, s ) ∈ argmin y ∈Y K n π s ( y − f ⋆ ( x, s )) 2 + ⟨ λ ⋆ s , a ( y ) ⟩ o . (9) 5 Fair r egression under localized demographic parity constraints Mor eover , λ ⋆ can be chosen as a minimizer of the dual objective λ ⋆ ∈ argmin λ ∈ ∆ M X s ∈S E s  max y ∈Y K Φ Z s,λ ( X, y )  , (10) wher e Φ Z s,λ ( x, y ) := − π s ( y − f ⋆ ( x, s )) 2 − ⟨ λ s , a ( y ) ⟩ . 4.2. ∂ Z -DP (partial demographic parity with borders constraint) W e start by two couples of interest in the quantile/threshold space. Formally , we introduce ℓ = ( ℓ 1 , ℓ 2 ) ∈ (0 , 1) 2 and Z = ( z 1 , z 2 ) ∈ [ − A, A ] 2 . For a discretized predictor f ∈ F K , the ultimate goal is to achiev es ∂ Z -DP: ∀ s ∈ S , ∀ t ∈ ( z 1 , z 2 ) , P s  f ( X, s ) ≤ t  = P  f ( X, S ) ≤ t  , with additionally P s  f ( X, s ) = z m  = ℓ m for m = 1 , 2 for all s ∈ S . This definition enforces equality of the CDF across groups in the interval [ z 1 , z 2 ] while imposing the lev el of quantiles ℓ 1 and ℓ 2 at the borders of this interval. That is, we explicitly control the mass of the CDFs in the interval [ z 1 , z 2 ] of interest. The solution of the problem f ⋆ ∂ Z -fair ∈ argmin f ∈F K n R ( f ) : f satisfies ∂ Z -DP o , (11) can be approached by combining ideas from the ( ℓ , Z ) - DP framew ork and the above Z -DP one. T o this end, we discretize [ z 1 , z 2 ] and define e Z M z 1 ,z 2 = ( ˜ z 1 , . . . , ˜ z M ) ∈ [ − A, A ] M with z 1 = ˜ z 0 < ˜ z 1 < . . . < ˜ z M < ˜ z M +1 = z 2 . W e consider a proxy of the abo ve ∂ Z -DP f airness constraint and ask for P s  f ( X, s ) ≤ z  = P  f ( X, S ) ≤ z  only for thresholds z ∈ e Z M z 1 ,z 2 . Hence our goal becomes ˜ f ⋆ ∂ Z M -fair ∈ argmin f ∈F K n R ( f ) : f satisfies ∂ e Z M z 1 ,z 2 -DP o , (12) where a discretized prediction function f ∈ F K is said ∂ e Z M z 1 ,z 2 -DP fair if ∀ s ∈ S , ∀ z ∈ e Z M z 1 ,z 2 , P s  f ( X, s ) ≤ z  = P  f ( X, S ) ≤ z  and P s  f ( X, s ) = z m  = ℓ m for m = 1 , 2 . Theorem 4.3 (Proxy ∂ Z -DP discretized predictor) . Under Assumption 2.1 , ther e exists λ ⋆ 1 ∈ R |S |× 2 and λ ⋆ 2 ∈ ∆ M such that the pr edictor ˜ f ⋆ ∂ Z M -fair admits the pointwise form given by ˜ f ⋆ ∂ Z M -fair ( x, s ) ∈ argmin y ∈Y K n π s ( y − f ⋆ ( x, s )) 2 + ⟨ ( λ ⋆ 1 ) s , a ( y ) ⟩ + ⟨ ( λ ⋆ 2 ) s , ˜ a ( y ) ⟩ o . with ( λ ⋆ 1 , λ ⋆ 2 ) being a minimizer of the dual objective ( λ ⋆ 1 , λ ⋆ 2 ) ∈ argmin ( λ 1 ,λ 2 ) ∈ R |S |× 2 × ∆ M X s ∈S  max y ∈Y K Φ e Z z 1 ,z 2 s,λ ( X, y )  . with Φ e Z z 1 ,z 2 s,λ ( x, y ) := − π s ( y − f ⋆ ( x, s )) 2 − ⟨ ( λ 1 ) s , a 1 ( y ) − ℓ ⟩ − ⟨ ( λ 2 ) s , a 2 ( y ) ⟩ , and a 1 ( y ) = ( 1 { y ≤ z 1 } , 1 { y ≤ z 2 } ) ∈ { 0 , 1 } 2 and a 2 ( y ) = ( 1 { y ≤ ˜ z 1 } , . . . , 1 { y ≤ ˜ z M } ) ∈ { 0 , 1 } M . Theorem 4.3 exhibits a solution ˜ f ⋆ ∂ Z M -fair that is a good proxy for f ⋆ ∂ Z -fair from Equation ( 12 ) when the grid e Z M z 1 ,z 2 is good, e.g ., a regular grid with lar ge M . From the estimation perspectiv e, building a data-dri ven method from ˜ f ⋆ ∂ Z M -fair is performed as in the previous section – a labeled dataset to estimateur the regression function f ∗ and an unlabeled dataset to calibrate the partial unfairness. The framew ork that we consider here resembles the one in ( He et al. , 2025 ) where the authors fits the CDFs across groups for a range of quantiles — [ ℓ 1 , ℓ 2 ] with our notation. The only difference is that we also specify the range of pre- diction values [ z 1 , z 2 ] . In terms of estimation strategy we also differ since we rely on post-processing while they con- sider in-processing approaches — exploiting discretization as well. 5. Numerical experiments In this section, we validate our frame work on both real and synthetic data designed to highlight the trade-off between predictiv e risk and distrib utional fairness constraints. W e illustrate ho w our approach supports a continuum of inter - ventions, from sur gical corrections at a few polic y-relev ant thresholds to localized regional constraints, and we con- trast these with a fully distribution-matching (“strong DP”) baseline. W e consider a re gression setting where the sensiti ve group S ∈ { A, B } influences the target Y through both a location shift and a group-specific non-linearity . Synthetic data and base learner . W e generate n = 4000 samples ( X, S, Y ) for each of N sim = 30 simulations, with X ∼ U (0 , 10) 2 and P ( S = B ) = 0 . 5 . The outcome follows Y = f ∗ ( X, S ) + ε , where ε ∼ N (0 , 5) and: f ∗ ( X, S ) = 5 X 1 +3 X 2 +20+ 1 { S = B }  15 + 2( X 1 − 5) 2  . This model induces a linear location shift ( +15 ) and a non- linear structural polarization ( 2( X 1 − 5) 2 ) for group B . The unconstrained predictor ˆ f is estimated using a deci- sion tree regressor (minimum 20 samples per leaf). All 6 Fair r egression under localized demographic parity constraints outcomes/predictions are clipped to [ − 100 , 100] and post- processing is ev aluated on a regular grid of size K = 201 . Compared methods. All methods below are applied as post-processing on top of the same base regressor ˆ f : (i) Unconstrained : the base regressor ˆ f (no fairness post- processing); (ii) ( ℓ , Z ) -fair : enforce F ˆ f | S = s ( z m ) = ℓ m at a small number of prescribed pairs ( ℓ m , z m ) (Figure 2 ); (iii) Z -fair : enforce partial distrib utional parity at a finite set of thresholds Z (Figure 3 ); (iv) ∂ Z -DP refereed to as Z-fair , range (Figure 3 ); (v) Strong DP (full distrib ution matching) : enforce parity on the whole grid, e.g. by taking Z = Y K (Figure 3 ). W e emphasize that the last baseline rep- resents the “global” end of the fairness spectrum, whereas ( ℓ , Z ) -fair and Z -fair provide localized alternativ es. Metrics. to e valuate the trade-of f between predicti ve per - formance and group equity , we report three main met- rics. First, we measure the price of fairness via the root mean squared error (rmse) between the fair predic- tor ˆ f and the unconstrained optimal baseline ˆ f ∗ : rmse := q 1 n test P n test i =1  ˆ f ∗ ( X i , S i ) − ˆ f ( X i , S i )  2 . This metric rep- resents the distortion risk R D ( f ) minimized in our theoret- ical results; by construction, the unconstrained model ˆ f ∗ yields an rmse of 0 . 00 . Second, we quantify the partial de- mographic parity violation U ( ℓ , Z ) ( ˆ f ) at the specific thresh- olds Z as defined in section 3 . Finally , we assess the en- tire outcome range using the kolmogoro v-smirnov statistic: ks := max s,s ′ ∈S sup t ∈ [ − A,A ]   F ˆ f | S = s ( t ) − F ˆ f | S = s ′ ( t )   . While our optimization targets specific points in Z , the ks metric allo ws us to ev aluate the impact of these local constraints on the global alignment of the group-conditional predictiv e distributions. Implementation details. Unless specified otherwise, we set K = 201 , choose ( ℓ , Z ) based on quartiles ( 25 th , 50 th , and 75 th percentiles) of the unconstrained predictor ˆ f on a calibration set, and use projected subgradient descent to minimize b H in ( 6 ). 5.1. Focus on ( ℓ , Z ) -fair prediction In this first setting, we illustrate the prescriptive capacity of our frame work: a practitioner specifies both the thresholds Z and the tar get probabilities ℓ a priori , modeling scenarios where policy dictates acceptance rates or quotas at decision- relev ant cutoffs. Throughout, we take M = 3 and ℓ = (0 . 25 , 0 . 50 , 0 . 75) and consider three choices of Z (see Figure 2 ): • Global : Z is set to the marginal quartiles of the uncon- strained scores ˆ f ( X, S ) on the calibration set. This enforces agreement at common thresholds shared across groups. • T ar get-A : Z is set to the quartiles of ˆ f ( X, A ) . Since F ˆ f | S = A ( z m ) = ℓ m holds by construction, the constraints effecti vely force group B to match group A at these thresh- olds. • T arget-B : symmetric choice with Z set to the quartiles of ˆ f ( X, B ) . Figure 2 shows that these localized constraints can sub- stantially reduce disparities at the prescribed cutof fs while preserving much of the predicti ve structure a way from them. As expected, more prescriptive choices ( e.g. targeting an- other group at fixed cutoffs) may increase risk when the specified targets are far from the group’ s natural score dis- tribution. 5.2. Extension to Z -fair and ∂ Z -fair prediction W e now consider Z -fair constraints, which enforce partial distributional parity only at a finite set of thresholds (or ov er a selected region). Figure 3 contrasts enforcing parity at a small number of thresholds (“ Z -fair , M = 3 ”) with a localized range constraint (“ Z -fair , range”) and with the global strong-DP baseline (full grid matching). Enforcing parity at a fe w thresholds reduces group dif ferences wher e constrained while allowing more fle xibility elsewhere; the range constraint (3rd column) further concentrates the cor- rection within a chosen interv al, leaving the tails compara- tiv ely less affected. In contrast, full distribution matching (last column) yields near-complete o verlap of predicti ve dis- tributions b ut can substantially distort predictions. W e refer the reader to Appendix A for additional numerical results relying on the ev olution of risk/unfairness w .r .t. M . 5.3. Real-data illustration Finally , Figure 1 reproduces the same qualitati ve behavior on the CRIME dataset (we refer to Appendix A for a de- scription of the dataset) using a LightGBM base regressor (default scikit-learn parameters): localized constraints re- duce distributional gaps around selected thresholds while typically incurring a smaller performance penalty than full distribution matching. 5.4. Overall conclusion. Our numerical study , both on synthetic and real data high- lights that by enforcing constraints only at a finite num- ber of thresholds—or within a selected re gion of the score distribution—our approach enables localized interventions that can be tuned to policy-rele vant cutoffs while limiting unnecessary distortion elsewhere. These dif ferent localized interventions that we considered yield a f av orable accuracy– fairness trade-off compared to global matching baselines (O T matching) and confirm our theory . 7 Fair r egression under localized demographic parity constraints F igur e 1. Comparison of localized constraints and full distribution matching on CRIME data using a LightGBM model with default scikit-learn parameters. F igur e 2. Analysis of ( ℓ, Z ) -fair methods on synthetic data. W e compare three prescriptions for Z (Global, T arget-A, T arget-B) with M = 3 and ℓ = (0 . 25 , 0 . 50 , 0 . 75) . F igur e 3. Analysis of Z -fair methods on synthetic data. W e compare enforcing parity at M = 3 thresholds, enforcing parity only ov er a selected range, and full distribution matching (strong DP , full grid). 8 Fair r egression under localized demographic parity constraints Impact Statement This work contributes to the growing literature on algo- rithmic fairness in regression by proposing quantile- and threshold-based relaxations of demographic parity . By al- lowing stakeholders to enforce parity only at selected parts of the predicti ve distrib ution (e.g., medians, upper quantiles, or operational cutoffs), the proposed frame work can enable more transparent and polic y-aligned fairness requirements than full distributional parity , while reducing unnecessary accuracy loss. Potential benefits include improved account- ability in high-stakes scoring applications (credit, hiring, risk assessment) and clearer communication of fairness con- straints to non-technical decision makers. At the same time, this approach may ha ve negativ e soci- etal impacts if misused. First, selecting the le vels/thresholds ( ℓ , Z ) is a normati ve choice: poorly chosen targets may hide disparities outside the monitored region of the distrib ution, or may be used as a superficial “fairness compliance” layer without addressing structural harms. Second, the method relies on access to a sensiti ve attrib ute S (or reliable prox- ies) during calibration; collecting, storing, or using such attributes can raise priv acy and governance concerns, and may be restricted by re gulation or institutional policy . Third, because the procedure is a post-processing step, it can alter score calibration or ranking near cutoffs; if downstream decisions are highly sensiti ve to small score changes, this may create unexpected incenti ves or discontinuities. W e emphasize that quantile-based constraints should be deployed only with careful stakeholder consultation and do- main expertise. In practice, we recommend: (i) reporting the chosen ( ℓ , Z ) and conducting sensitivity analyses to alterna- tiv e choices; (ii) complementing partial distributional parity with additional diagnostics (e.g., error disparities, tail-risk metrics, subgroup analyses) to reduce the risk of “fairness gerrymandering”; (iii) documenting data collection and pri- vac y safeguards for sensiti ve attrib utes; and (iv) monitoring post-deployment performance to detect distribution shift or new disparities. Overall, the proposed methodology is intended to pro vide a tractable and interpretable tool for reducing group-lev el distributional disparities in regression, b ut it does not elimi- nate the need for broader organizational, le gal, and societal ov ersight when automated predictions influence real-world outcomes. References Agarwal, A., Beygelzimer , A., Dud ´ ık, M., Langford, J., and W allach, H. A reductions approach to fair classification. In Pr oceedings of the 35th International Confer ence on Machine Learning , 2018. Agarwal, A., Dudik, M., and W u, Z. S. Fair regression: Quantitativ e definitions and reduction-based algorithms. In International Confer ence on Machine Learning , 2019. Boyd, S. and V andenberghe, L. Con vex Optimization . Cam- bridge Univ ersity Press, 2004. ISBN 9780521833783. Charpentier , A. Quantifying fairness and discrimination in predictiv e models. In Machine Learning for Economet- rics and Related T opics , pp. 37–77. Springer , 2024. Charpentier , A., Hu, F ., and Ratz, P . P arametric fairness with statistical guarantees. arXiv , 2310.20508, 2023. Chen, Y ., T an, Z., Blanchet, J., and Qin, H. T esting fairness with utility tradeoffs: A wasserstein projection approach, 2025. Chzhen, E., Denis, C., Hebiri, M., Oneto, L., and Pontil, M. Fair regression via plug-in estimator and recalibra- tion with statistical guarantees. In Advances in Neural Information Pr ocessing Systems , 2020a. Chzhen, E., Denis, C., Hebiri, M., Oneto, L., and Pontil, M. F air regression with W asserstein barycenters. In Advances in Neural Information Pr ocessing Systems , vol- ume 33, pp. 7321–7331, 2020b. Chzhen, E., Denis, C., and Hebiri, M. Minimax semi- supervised set-v alued approach to multi-class classifica- tion. Bernoulli , 27(4), 2021. Denis, C., Elie, R., Hebiri, M., and Hu, F . Fairness guaran- tees in multi-class classification with demographic parity . Journal of Machine Learning Resear ch , 25(130):1–46, 2024. Dvoretzk y , A., Kiefer, J., and W olfowitz, J. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator . The Annals of Mathematical Statistics , 27(3):642–669, 1956. Gordaliza, P ., Del Barrio, E., Fabrice, G., and Loubes, J. M. Obtaining fairness using optimal transport theory . In International Confer ence on Machine Learning , 2019. Gray , R. M. and Neuhoff, D. L. Quantization. IEEE T rans- actions on Information Theory , 44(6):2325–2383, 1998. Gy ¨ orfi, L., K ohler , M., Krzyzak, A., and W alk, H. A Distribution-F ree Theory of Nonparametric Re gression. Springer series in statistics. Springer , 2002. Hardt, M., Price, E., and Srebro, N. Equality of opportunity in supervised learning. In Neural Information Pr ocessing Systems , 2016. Hastie, T ., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning: Data Mining, Inference , and Pr edic- tion . Springer , 2 edition, 2009. 9 Fair r egression under localized demographic parity constraints He, Y ., Huang, Y ., Y ao, Y ., and Lin, Q. Enforcing fairness where it matters: An approach based on difference-of- con ve x constraints. arXiv , 2505.12530, 2025. Hoeffding, W . Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association , 58(301):13–30, 1963. Hu, F ., Ratz, P ., and Charpentier , A. Fairness in multi-task learning via wasserstein barycenters. In Machine Learn- ing and Knowledge Discovery in Databases: Researc h T rack (ECML PKDD 2023) , pp. 295–312, 2023. Hu, F ., Ratz, P ., and Charpentier , A. A sequentially fair mechanism for multiple sensiti ve attrib utes. In Pr oceed- ings of the AAAI Confer ence on Artificial Intelligence , volume 38, pp. 12502–12510, 2024. Liu, M., Ding, L., Y u, D., Liu, W ., Kong, L., and Jiang, B. Conformalized fairness via quantile regression. In Advances in Neural Information Pr ocessing Systems , vol- ume 35, pp. 11561–11572, 2022. Massart, P . The tight constant in the Dvoretzky–Kiefer – W olfowitz inequality . The Annals of Pr obability , 18(3): 1269–1283, 1990. Plecko, D. and Meinshausen, N. Fair data adaptation with quantile preserv ation. Journal of Machine Learning Re- sear ch , 21(225):1–37, 2020. Redmond, M. and Baveja, A. A data-dri ven software tool for enabling cooperati ve information sharing among police departments. Eur opean Journal of Operational Resear ch , 141(3):660–678, 2002. Shalev-Shw artz, S. and Ben-Da vid, S. Understanding Ma- chine Learning: F r om Theory to Algorithms . Cambridge Univ ersity Press, 2014. W ang, F ., Cheng, L., Guo, R., Liu, K., and Y u, P . S. Equal opportunity of cov erage in fair regression. In Advances in Neural Information Pr ocessing Systems 36 (NeurIPS 2023) , 2023. Y ang, D., Laf ferty , J., and Pollard, D. Fair quantile re gres- sion. arXiv , 1907.08646, 2019. 10 Fair r egression under localized demographic parity constraints Supplementary Materials Appendix o verview . The first section (Section A ) presents additional numerical results complementing those in the main paper . The subsequent sections (Section B ) pro vide proofs of the theoretical results. A. Numerical considerations Data description. The main dataset we consider is CRIME that contains socio-economic, law enforcement, and crime data about communities in the US with 1994 examples ( Redmond & Bav eja , 2002 ). The task is to predict the violent crime rate per population. W e consider race-related attributes, in particular the proportion of African-American residents, as sensitiv e attributes, which obtains 1,032 instances for s = − 1 and 962 instances for s = 1 . W e split the data into three sets (60% training, 20% hold-out and 20% unlabeled). Additional numerical study . Figure 4 summarizes the resulting fairness–accuracy trade-of f: as constraints become more global (more thresholds and/or full-grid matching), distributional discrepancies decrease (lo wer KS and lo wer constraint violation) at the cost of increased predictiv e error , whereas localized constraints provide intermediate operating points. F igur e 4. Fairness–accuracy trade-of f on synthetic data for increasingly global constraints. B. Proofs of main r esults This appendix is dedicated to the proofs of the theoretical results. Notice that we omit the proof of Theorems 4.2 and 4.3 since it relies on similar arguments as those in Theorem 2.2 . Notation. Conditionally on D n , define for each s ∈ S the conditional law P s ( · ) = P ( · | S = s ) and its empirical version based on D N , ˆ P s ( A ) := 1 N s X i ∈ I s 1 { X ′ i ∈ A } , where I s := { i ∈ [ N ] : S ′ i = s } . B.1. Proof of Section 2 Pr oof of Theorem 2.2 . First, we observe that our minimization problem can reformulated as follo ws f ∗ ( ℓ , Z ) − fair ∈ argmin f ∈F K { R ( f ) − R ( f ∗ ) , f is ( ℓ , Z ) − fair } . 11 Fair r egression under localized demographic parity constraints W e consider the Lagrangian L associated to our optimization problem. Let f ∈ F K , and λ = ( λ ) s ∈S ,m ∈ [ M ] , we hav e that since R ( f ) − R ( f ∗ ) = E  ( f ( X, S ) − f ∗ ( X, S )) 2  L ( f , λ ) = E  f ∗ ( X, S ) − f ( X , S )) 2  + X s ∈S X m ∈ [ M ] λ s,m ( P s ( f ( X, S ) ≤ z m ) − ℓ m ) . W e observe that L ( f , λ ) = X s ∈S E s   π s  f ∗ ( X, S ) − f ( X , S ) 2  + X m ∈ [ M ] λ s,m 1 { f ( X ,S ) ≤ z m }   − X s ∈S X m ∈ [ M ] λ s,m ℓ m . Now since f ∈ F K , we hav e that L ( f , λ ) = X s ∈S ,k ∈ [ K ] E s     π s ( f ∗ ( X, S ) − y k ) 2 + X m ∈ [ M ] λ s,m 1 { y k ≤ z m }   1 { f ( X ,S )= y k }   − X s ∈S ,m ∈ [ M ] λ s,m ℓ m . (13) Hence, we observe that f ∗ λ ∈ argmin f ∈F K L ( f , λ ) , is characterized pointwise as f ∗ λ ( x, s ) = argmin k ∈ [ K ] π s ( f ∗ ( x, s ) − y k ) 2 + X m ∈ [ M ] λ s,m 1 { y k ≤ z m } . Furthermore, we also hav e L ( f ∗ λ , λ ) = X s ∈S E s   min k ∈ [ K ]   π s ( f ∗ ( x, s ) − y k ) 2 + X m ∈ [ M ] λ s,m 1 { y k ≤ z m }     − X s ∈S ,m ∈ [ M ] λ s,m ℓ m = −   X s ∈S E s   max k ∈ [ K ]   − π s ( f ∗ ( x, s ) − y k ) 2 − X m ∈ [ M ] λ s,m 1 { y k ≤ z m }     + X s ∈S ,m ∈ [ M ] λ s,m ℓ m   . (14) Therefore H : λ 7→ −L ( f ∗ λ , λ ) is con vex w .r .t. λ . Besides, H is coercive. Indeed, H ( λ ) ≥ max k ∈ [ K ] X s ∈S E s   − π s ( f ∗ ( X, S ) − y k ) 2 − X m ∈ [ M ] λ s,m 1 { y k ≤ z m }   + X s ∈S X m ∈ [ M ] λ s,m ℓ m . Since | f ∗ ( X, S ) | ≤ A a.s. , we deduce H ( λ ) ≥ − 2 A 2 + max k ∈ [ K ] X s ∈S X m ∈ [ M ] λ s,m  l m − 1 { y k ≤ z m }  . From the abov e inequality , we deduce since for each m ∈ [ M ] , y 0 < z m < y K , that H ( λ ) → + ∞ , as , ∥ λ ∥ → + ∞ . Therefore, H admits a global minimizer . Then, we consider the predictor f ∗ λ ∗ with λ ∗ argmin λ −L ( f ∗ λ , λ ) . Under Assumption 2.1 , we hav e that the function H is differentiable w .r .t. λ , and ∂ H s,m = − P s ( f ∗ λ ( X, S ) ≤ z m ) + ℓ m . 12 Fair r egression under localized demographic parity constraints Therefore the first order condition for the minimization ov er λ shows that for s ∈ S , and m ∈ [ M ] P s ( f ∗ λ ∗ ( X, S ) ≤ z m ) = ℓ m , which implies that f ∗ λ ∗ is ( ℓ , Z ) -fair . Finally , we observe that if f ∈ F K is a predictor that is ( ℓ , Z -fair , we hav e that R ( f ) − R ( f ∗ ) = L ( f , λ ∗ ) ≥ L ( f ∗ λ ∗ , λ ∗ ) = R ( f ∗ λ ∗ ) − R ( f ∗ ) . From the abov e inequality , we deduce that f ∗ λ ∗ ∈ argmin f ∈F K { R ( f ) − R ( f ∗ ) , f is( ℓ , Z ) − fair } . Pr oof of Proposition 2.4 . First of all, since for each m ∈ [ M ] , z m ∈ [ − A, A ] , we can assume that ˜ f ( X, S ) ∈ [ − A, A ] . W e define the predictor T ( ˜ f ) that is the approximation of ˜ f ov er the set F K . T ( ˜ f ) = y k , if , f ∗ ( X, S ) ∈ ( y k − 1 , y k ] . for each m ∈ [ M ] , since z m ∈ F K , we hav e that for each s ∈ S , m ∈ [ M ] P s  ˜ f ( X, S ) ≤ z m  = P s  T ( ˜ f ( X, S ) ≤ z m  = ℓ m . Therefore, the predictor T ( ˜ f ) is ( ℓ , Z ) -fair . Hence R ( f ∗ ( ℓ , Z ) − fair ) ≤ R ( T ( ˜ f )) = R ( T ( ˜ f )) − R ( ˜ f ) + R ( ˜ f ) . Now , we study the term R ( T ( ˜ f )) − R ( ˜ f ) in the r .h.s. of the abov e inequality . W e hav e that 0 ≤ R ( T ( ˜ f )) − R ( ˜ f ) = E h 2 Y  ˜ f ( X, S ) − T ( ˜ f ( X, S )  + T 2 ( ˜ f )( X, S ) − ˜ f 2 ( X, S ) i = E h (2 Y − ˜ f ( X, S ) − T ( ˜ f ( X, S )))  ˜ f ( X, S ) − T ( ˜ f ( X, S ) i Now since Y ∈ [ − A, A ] , T ( ˜ f )( X, S ) ∈ [ − A, A ] , and ˜ f ( X, S ) ∈ [ − A, A ] , we deduce that R ( T ( ˜ f )) − R ( ˜ f ) ≤ 3 A E h    ˜ f ( X, S ) − T ( ˜ f ( X, S )    i ≤ 3 A max k | y k − y k − 1 | ≤ 6 A 2 K − 1 , which yields the desired result. B.2. Proof of Section 3 Proof of Theor em 3.1 For each k ∈ [ K ] , and ( x, s ) ∈ R d × S we define ˆ h k ( λ , ( x, s )) =   − X m ∈ [ M ] λ s,m (( 1 { y k ≤ z m } − ℓ m )   − ˆ π s  y k − ¯ f ( x, s )  2 . For each k ∈ [ K ] , we also define k ( m ) := max { k : y k ≤ z m } Let s ∈ S , m ∈ [ M ] . For i ∈ [ N ] , we introduce the events A k = n ∀ j  = k , ˆ h j ( λ , ( X , S )) < ˆ h k ( λ , ( X , S )) o , and B k = n ∀ j  = k , ˆ h j ( λ , ( X , S )) ≤ ˆ h k ( λ , ( X , S )) , ∃ j  = k , ˆ h j ( λ , ( X , S )) = ˆ h k ( λ , ( X , S )) o . W e have that b P s  ˆ f ( ℓ , Z ) − fair ( X, S ) ≤ z m  = X k ≤ k ( m ) b P s ( A k ) + b P s ( B k ) . (15) 13 Fair r egression under localized demographic parity constraints Let g i ∈ ∂ s,m max k ∈ [ K ] ˆ h ( λ , ( X , S )) for k ∈ [ K ] , on the ev ent n ˆ f ( ℓ , Z ) − fair ( X, S ) = y k o , we hav e g i = ∂ s,m ˆ h k ( λ , ( X , S )) 1 { A k } + 1 { B k } X j ∈ [ K ] α s,m,j ( X, S ) ∂ s,m ˆ h j ( λ, ( X , S )) 1 { ˆ h k ( λ , ( X,S ))= ˆ h j ( λ , ( X,S )) } , with ( α j,s,m ( X, S ) j ∈ [ K ] ∈ [0 , 1] K that satisfy X j ∈ [ K ] α s,m,j ( X, S ) 1 { ˆ h k ( λ , ( X,S ))= ˆ h j ( λ , ( X,S )) } = 1 a.s. Follo wing similar arguments as in Proof of Theorem 2.2 , we hav e that the function ˆ H is coerci ve and then admits a minimizer . Since ˆ λ is defined as ˆ λ ∈ argmin λ ˆ H ( λ ) = X s ∈S ˆ E s  max k ∈ [ K ] ˆ h k ( λ , ( X , S ))  , Since P k ∈ [ K ] 1 { ˆ f ( ℓ , Z ) − fair ( X,S )= y k } = 1 , we deduce from the first order condition that l m − X k ≤ k ( m ) b P s ( A k ) − X k ≤ k ( m ) b E s   1 B k X j ≤ j ( m ) α s,m,j ( X, S ) 1 { ˆ h k ( λ , ( X s i ,S i ))= ˆ h j ( λ , ( X,S )) }   = 0 Hence, from the abov e equation, and ( 15 ) we deduce that    P s  ˆ f ( ℓ , Z ) − fair ( X, S ) ≤ z m  − ℓ m    ≤     P s − ˆ P s   ˆ f ( ℓ , Z ) − fair ( X, S ) ≤ z m     + X k ≤ k ( m ) b P s  ∃ j  = k , ˆ h j ( λ , ( X , S )) = ˆ h k ( λ , ( X , S ))  . Therefore, it yields U  ˆ f ( ℓ , Z ) − fair  ≤ X s ∈S sup t ∈ R     P s − ˆ P s   ˆ f ( ℓ , Z ) − fair ( X, S ) ≤ t     + X k ∈ [ K ] b P s  ∃ j  = k , ˆ h j ( λ , ( X , S )) = ˆ h k ( λ , ( X , S ))  . Now , conditional on D n , applying using the Dv oretzky–Kiefer –W olfowitz inequality , ( Dvoretzk y et al. , 1956 ) with Massart’ s sharp constant ( Massart , 1990 ), and Lemma B.8 in ( Chzhen et al. , 2020a ), we obtain that , E h U  ˆ f ( ℓ , Z ) − fair i ≤ C X s ∈S E  r 1 N s + K 2 N s  . Finally , using the Lemma 4.1 in ( Gy ¨ orfi et al. , 2002 ), we get the desired result. Proof of Theor em 3.2 Recall that ˆ f ( ℓ , Z ) -fair takes v alues in the finite grid Y K = { y 1 , . . . , y K } . Step 1: empirical constraints are (essentially) satisfied. W e claim that, conditionally on D n , for all s ∈ S and m ∈ [ M ] , ˆ P s  ˆ f ( ℓ , Z ) -fair ( X, s ) ≤ z m  = ℓ m + ( tie terms ) , a.s. (16) with tie terms of order of K 2 / N . Indeed, as in the previous proof, ˆ λ minimizes the con vex objecti ve b H ( λ ) (Eq. (6) in the main text), hence 0 ∈ ∂ b H ( ˆ λ ) . As in the proof of Theorem 3.1 , one can compute a subgradient component-wise and obtain 0 ∈ ∂ s,m b H ( ˆ λ ) = ℓ m − ˆ P s  ˆ f ( ℓ , Z ) -fair ( X, s ) ≤ z m  + (tie terms) . By the dithering construction, conditionally on D n the random v ariable ¯ f ( X, s ) has a continuous distrib ution, which implies that ties in argmin y ∈Y K occur with probability zero. Hence the tie terms vanish a.s., yielding ( 16 ). 14 Fair r egression under localized demographic parity constraints Step 2: reduce the population violation to a generalization gap. Fix s ∈ S and m ∈ [ M ] . Using ( 16 ), P s  ˆ f ( ℓ , Z ) -fair ( X, s ) ≤ z m  − ℓ m =  P s − ˆ P s   ˆ f ( ℓ , Z ) -fair ( X, s ) ≤ z m  + (tie terms) . T aking the maximum over m and s yields U ( ℓ , Z )  ˆ f ( ℓ , Z ) -fair  ≤ max s ∈S sup t ∈ R  P s − ˆ P s   ˆ f ( ℓ , Z ) -fair ( X, s ) ≤ t  + (tie terms) . Step 3: concentration of empirical CDF . Conditionally on D n , and for a fixed s , we can bound the de viation sup t ∈ R   ˆ F s ( t ) − F s ( t )   using the Dv oretzky–Kiefer–W olfowitz inequality , ( Dvoretzk y et al. , 1956 ). Using Massart’ s sharp constant ( Massart , 1990 ), with probability at least 1 − δ we ha ve sup t | ˆ F s ( t ) − F s ( t ) | ≤ s log(2 /δ ) 2 N s Using the group-mass assumption π min > 0 and a standard concentration bound on N s (e.g., Hoeffding for binomials, ( Hoeffding , 1963 )), we hav e on an e vent of probability at least 1 − δ / 2 that N s ≥ 1 2 N π min for all s ∈ S . Combining and absorbing log |S | and log 2 into constants yields: with probability at least 1 − δ , max s ∈S     sup t ∈ R  P s − ˆ P s   ˆ f ( ℓ , Z ) -fair ( X, s ) ≤ t      ≤ C S r 1 N + r log(1 /δ ) N ! . The abov e implies the claimed U ( ℓ , Z )  ˆ f ( ℓ , Z ) -fair  ≤ C S M r 1 N + K 2 N + r log(1 /δ ) N ! , which concludes the proof of Theorem 3.2 . Proof of Theor em 3.3 First, for each λ = ( λ s,m ) s ∈S ,m ∈ [ M ] ∈ R S M , we introduce the predictor f ∗ λ defined as f ∗ λ ∈ arg min f ∈F K R λ ( f ) . It is important to note that the Lagrange multiplier λ ∗ is characterized as λ ∗ ∈ arg max λ ∈ R S M R λ ( f λ ) . W e start with the following decomposition R λ ⋆  ˆ f ( ℓ , Z ) -fair  − R λ ⋆  f ∗ ( ℓ , Z ) -fair  = R λ ⋆  ˆ f ( ℓ , Z ) -fair  − R ˆ λ  ˆ f ( ℓ , Z ) -fair  + R ˆ λ  ˆ f ( ℓ , Z ) -fair  − R ˆ λ  f ∗ ˆ λ  + R ˆ λ  f ∗ ˆ λ  − R λ ∗  f ∗ ( ℓ , Z ) -fair  (17) By definition of parameter λ ∗ , conditional on the data, the last term in the r .h.s. of the abov e equation satisfies R ˆ λ  f ∗ ˆ λ  − R λ ∗  f ∗ ( ℓ , Z ) -fair  ≤ 0 . Furthermore, since each coordinates of parameters ˆ λ , and λ ∗ are bounded by a constant that depends on A , we observe that the first term in the r .h.s. of Equation 17 satisfies E h R λ ⋆  ˆ f ( ℓ , Z ) -fair  − R ˆ λ  ˆ f ( ℓ , Z ) -fair i ≤ C M E h U  ˆ f ( ℓ , Z ) -fair i . 15 Fair r egression under localized demographic parity constraints Therefore, we deduce with Equation ( 17 ) and Theorem 3.1 that E h R λ ⋆  ˆ f ( ℓ , Z ) -fair  − R λ ⋆  f ∗ ( ℓ , Z ) -fair i ≤ C M r 1 N + K 2 N ! + E h R ˆ λ  ˆ f ( ℓ , Z ) -fair  − R ˆ λ  f ∗ ˆ λ i . (18) Now , we study the second term in the r .h.s. of the abov e equation. From Equation ( 13 ), and ( 14 ), we hav e that, conditional on the data, R ˆ λ  ˆ f ( ℓ , Z ) -fair  − R ˆ λ  f ∗ ˆ λ  = X s ∈S E s  max k ∈ [ K ]  − π s ( f ∗ ( X, S ) − y k ) 2 − ⟨ ˆ λ s , a ( y k ) ⟩   − E s   X k ∈ [ K ]  − π s ( f ∗ ( X, S ) − y k ) 2 − ⟨ ˆ λ s , a ( y k ) ⟩  1 { ˆ f ( ℓ , Z ) -fair = y k }   (19) Now , for each k ∈ [ K ] , and s ∈ S , we introduce ˜ h k ( X, S ) = − π s ( f ∗ ( X, S ) − y k ) 2 − ⟨ ˆ λ s , a ( y k ) ⟩ , and , ˆ h k ( X, S ) = − ˆ π s  ¯ f ( X, S ) − y k  2 − ⟨ ˆ λ s , a ( y k ) ⟩ . Note that we hav e f ∗ ˆ λ ( x, s ) ∈ arg max y k ∈F K ˜ h k ( x, s ) and ˆ f ( ℓ , Z ) -fair ( x, s ) ∈ arg max y k ∈F K ˆ h k ( x, s ) . Therefore, from Equation ( 19 ), we deduce that R ˆ λ  ˆ f ( ℓ , Z ) -fair  − R ˆ λ  f ∗ ˆ λ  ≤ 2 X s ∈S E s  max k ∈ [ K ]    ˜ h k ( X, S ) − ˆ h k ( X, S )     . Finally , since each k ∈ [ K ] , y k , f ∗ ( X, S ) , and ¯ f ( X, S ) are bounded by A we deduce that    ˜ h k ( X, S ) − ˆ h k ( X, S )    ≤ C A  π s   ¯ f ( X, S ) − f ∗ ( X, S )   + | ˆ π s − π s |  ≤ C A  π s    ˆ f ( X, S ) − f ∗ ( X, S )    + | ˆ π s − π s | + u  . Therefore, the last inequality yields E h R ˆ λ  ˆ f ( ℓ , Z ) -fair  − R ˆ λ  f ∗ ˆ λ i ≤ C A  E h    ˆ f ( X, S ) − f ∗ ( X, S )    i + E [ | ˆ π s − π s | ] + u  . Combining the abov e equation with Equation ( 18 ) giv es the desired result. Proof of Theor em 3.4 W e prove a high-probability result, analogue of Theorem 3.3 . Step 1: a deterministic decomposition. Recall R λ ⋆ ( f ) = R ( f ) + X s ∈S X m ∈ [ M ] λ ⋆ s,m  P s ( f ( X, s ) ≤ z m ) − ℓ m  . Since f ⋆ ( ℓ , Z ) -fair satisfies the constraints, the penalty term vanishes for f ⋆ ( ℓ , Z ) -fair , hence R λ ⋆  ˆ f ( ℓ , Z ) -fair  − R λ ⋆  f ⋆ ( ℓ , Z ) -fair  = R ( ˆ f ( ℓ , Z ) -fair ) − R ( f ⋆ ( ℓ , Z ) -fair ) | {z } (I) + X s,m λ ⋆ s,m ∆ s,m | {z } (II) , (20) with ∆ s,m := P s ( ˆ f ( ℓ , Z ) -fair ( X, s ) ≤ z m ) − ℓ m . 16 Fair r egression under localized demographic parity constraints Step 2: control of the penalty term by U ( ℓ , Z ) . By definition of U ( ℓ , Z ) , | ∆ s,m | ≤ U ( ℓ , Z ) ( ˆ f ( ℓ , Z ) -fair ) . Therefore, | (I I) | ≤  X s ∈S X m ∈ [ M ] | λ ⋆ s,m |  U ( ℓ , Z )  ˆ f ( ℓ , Z ) -fair  . The quantity | λ ⋆ s,m | depends only on ( S , π min , A ) through the dual problem (Theorem 2.2 ) and is absorbed into the constant C S . Therefore, it yields | (I I) | ≤ C S M U ( ℓ , Z )  ˆ f ( ℓ , Z ) -fair  . Step 3: control of the risk term by the base r egressor error . W e compare the post-processing based on f ⋆ and on ¯ f . Using boundedness | Y | ≤ A and | f | ≤ A , the squared loss is 4 A -Lipschitz: for any two predictors f , g , | R ( f ) − R ( g ) | =    E  ( Y − f ) 2 − ( Y − g ) 2     ≤ 4 A E [ | f ( X, S ) − g ( X , S ) | ] . (21) Since ¯ f = Π [ − A,A ] ( ˆ f + ξ ) with ξ ∼ U nif ([0 , u ]) independent, E  | ¯ f ( X, S ) − f ⋆ ( X, S ) |  ≤ E  | ˆ f ( X, S ) − f ⋆ ( X, S ) |  + E [ | ξ | ] ≤ E  | ˆ f ( X, S ) − f ⋆ ( X, S ) |  + u. The post-processed predictor is obtained by minimizing a pointwise objecti ve of the form ˆ π s ( y − ¯ f ( x, s )) 2 + ⟨ ˆ λ s , a ( y ) ⟩ ov er y ∈ Y K . A standard comparison argument (the same as in the proof of Theorem 3.3 in expectation form) combined with ( 21 ) yields (I) ≤ C S ,A  E  | ˆ f ( X, S ) − f ⋆ ( X, S ) |  + | ˆ π s − π s | + u  + C S U ( ℓ , Z )  ˆ f ( ℓ , Z ) -fair  , where the additional U term accounts for the calibration/generalization gap on the unlabeled sample. Step 4: plug the high-probability bound on U . Combining Steps 1–3 and applying Theorem 3.2 giv es that, conditionally on D n , with probability at least 1 − δ , R λ ⋆  ˆ f ( ℓ , Z ) -fair  − R λ ⋆  f ⋆ ( ℓ , Z ) -fair  ≤ C S  E [ | ˆ f − f ⋆ | ] + | ˆ π s − π s | + u  + C S M  r 1 N + K 2 N + r log(1 /δ ) N  . Absorbing constants yields the statement of Theorem 3.4 . 17

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment