Do Prevalent Bias Metrics Capture Allocational Harms from LLMs?

Do Prevalent Bias Metrics Capture Allocational Harms from LLMs?
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Allocational harms occur when resources or opportunities are unfairly withheld from specific groups. Many proposed bias measures ignore the discrepancy between predictions, which are what the proposed methods consider, and decisions that are made as a result of those predictions. Our work examines the reliability of current bias metrics in assessing allocational harms arising from predictions of large language models (LLMs). We evaluate their predictive validity and utility for model selection across ten LLMs and two allocation tasks. Our results reveal that commonly-used bias metrics based on average performance gap and distribution distance fail to reliably capture group disparities in allocation outcomes. Our work highlights the need to account for how model predictions are used in decisions, in particular in contexts where they are influenced by how limited resources are allocated.


💡 Research Summary

This paper investigates whether the bias metrics that are currently used to evaluate large language models (LLMs) are capable of capturing allocational harms—situations where certain demographic groups are systematically denied resources or opportunities. The authors argue that most existing metrics focus solely on differences in model predictions, ignoring the crucial step where those predictions are translated into concrete decisions (e.g., who gets a job interview or a loan). To test this claim, they conduct a systematic empirical study across ten publicly available LLMs and two distinct allocation tasks: (1) resume screening for hiring and (2) essay grading for language‑learning assessment.

The allocation tasks are framed as top‑k ranking problems. For each task a pool of candidates is constructed, and a fixed quota k (either 1 or 2) of candidates is selected based on the model’s predicted scores. The authors define two standard fairness gaps that reflect allocational outcomes: demographic parity gap (ΔDP), the difference in selection rates between a protected group and a reference group, and equal‑opportunity gap (ΔEO), the difference in selection rates among qualified candidates. These gaps serve as the ground‑truth measures of allocational harm.

Four conventional bias metrics are evaluated: (i) average performance gap (δ), which computes the mean score difference between groups; (ii) Jensen–Shannon divergence (JSD) and (iii) Earth Mover’s Distance (EMD), both of which quantify distributional distance between group score distributions; and (iv) a newly proposed rank‑biserial correlation (RB). RB measures the correlation between group membership and ranking by counting the proportion of favorable versus unfavorable pairwise comparisons (i.e., how often the model prefers a candidate from group A over one from group B).

The experimental results reveal a striking pattern. In the resume‑screening task, δ, JSD, and EMD show virtually no correlation with the actual ΔDP or ΔEO values (Pearson r near zero). In the essay‑grading task, these metrics achieve modest positive correlations, but still far below what would be required for reliable auditing. By contrast, RB consistently yields high correlations (r ≥ 0.86) with both ΔDP and ΔEO across both tasks.

Beyond correlation, the authors assess the utility of each metric for model selection. They simulate an audit scenario where models are ranked by their bias scores (aggregated across groups) and compare this ranking to an “ideal” ranking based on the true allocation gaps. Using normalized discounted cumulative gain (NDCG) as the evaluation metric, RB again outperforms all other measures, achieving average NDCG@10 ≥ 0.95, whereas the traditional metrics often rank more biased models as more fair.

A deeper analysis shows that the conventional metrics are highly sensitive to the shape of the prediction score distribution. The resume‑screening scores are left‑skewed with heavy tails, violating the implicit normality assumptions underlying δ, JSD, and EMD. The essay‑grading scores are closer to a normal distribution, which partly explains the modest improvement in correlation for those tasks. RB, however, does not rely on distributional assumptions; it directly captures the ordering that drives allocation decisions, making it robust across disparate score shapes.

Group‑level examinations further expose the shortcomings of the traditional metrics: they can both under‑estimate harms for some groups (e.g., white females) and over‑estimate for others (e.g., Hispanic males). RB provides consistent estimates across all demographic slices, suggesting that it is less likely to introduce secondary biases when used in audits.

The discussion emphasizes that audits which ignore the decision‑making context risk providing a false sense of safety. In high‑stakes domains such as lending, hiring, or medical triage, practitioners often rely heavily on model scores; therefore, bias measures must reflect how those scores will be operationalized. The authors advocate for the adoption of metrics like RB that are tightly coupled to the allocation mechanism, arguing that such measures are essential for trustworthy AI governance.

In summary, the paper makes three key contributions: (1) it empirically demonstrates that widely‑used average‑gap and distribution‑distance bias metrics fail to predict allocational harms for LLMs; (2) it introduces the rank‑biserial correlation as a simple yet powerful alternative that aligns closely with actual allocation disparities; and (3) it shows that RB improves model selection for fairness audits, thereby offering a practical tool for regulators and developers aiming to mitigate allocational bias in LLM‑driven decision systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment