Optimal Fair Aggregation of Crowdsourced Noisy Labels using Demographic Parity Constraints
As acquiring reliable ground-truth labels is usually costly, or infeasible, crowdsourcing and aggregation of noisy human annotations is the typical resort. Aggregating subjective labels, though, may amplify individual biases, particularly regarding sensitive features, raising fairness concerns. Nonetheless, fairness in crowdsourced aggregation remains largely unexplored, with no existing convergence guarantees and only limited post-processing approaches for enforcing $\varepsilon$-fairness under demographic parity. We address this gap by analyzing the fairness s of crowdsourced aggregation methods within the $\varepsilon$-fairness framework, for Majority Vote and Optimal Bayesian aggregation. In the small-crowd regime, we derive an upper bound on the fairness gap of Majority Vote in terms of the fairness gaps of the individual annotators. We further show that the fairness gap of the aggregated consensus converges exponentially fast to that of the ground-truth under interpretable conditions. Since ground-truth itself may still be unfair, we generalize a state-of-the-art multiclass fairness post-processing algorithm from the continuous to the discrete setting, which enforces strict demographic parity constraints to any aggregation rule. Experiments on synthetic and real datasets demonstrate the effectiveness of our approach and corroborate the theoretical insights.
💡 Research Summary
The paper tackles the largely overlooked problem of fairness in crowdsourced label aggregation, focusing on demographic parity (DP) as the fairness criterion. In many real‑world applications, obtaining ground‑truth labels is expensive or impossible, so multiple annotators provide noisy labels that are later aggregated. However, annotators often exhibit systematic biases correlated with sensitive attributes such as gender or race, and naïve aggregation can amplify these biases. The authors therefore study how two canonical aggregation rules—Majority Vote (MV) and the Bayes‑optimal aggregator—affect the DP gap, and they propose a post‑processing method that can enforce strict ε‑DP on any aggregation outcome.
Problem setting and notation
The authors consider binary classification (Y∈{0,1}) with a binary sensitive feature A∈{0,1} and additional non‑sensitive features X. Each annotator r follows a one‑coin model: its skill p_r(a,x)=P(˜Y_r=Y|Y, A=a, X=x) may depend on both A and X. The set of noisy labels ˜Y_{1:R} together with (X,A) is fed to an aggregation function ϕ: {0,1}^R×X×{0,1}→{0,1}. The DP gap of any binary predictor Z is defined as ΔDP(Z)=|P(Z=1|A=1)−P(Z=1|A=0)|. A predictor is ε‑fair if ΔDP(Z)≤ε.
Theoretical contributions
-
Error exponent bounds – Building on Gao et al. (2016), the authors prove that for any (a,x) the conditional error probability of both MV and the Bayes‑optimal aggregator satisfies
P(ϕ(˜Y_{1:R},X,A)≠Y | A=a,X=x) ≤ exp(−R·K_ϕ(a,x)),
where K_ϕ is an explicit function of the annotators’ skills. For the Bayes‑optimal rule K_ϕ★ is a simple average of log‑likelihood ratios; for MV it is a minimax expression over a tilting parameter t∈(0,1]. -
Fairness convergence – Proposition 3.2 shows that the difference between the DP gap of the aggregated label and that of the true label is bounded by a term that decays as E_{X|A=a}
Comments & Academic Discussion
Loading comments...
Leave a Comment