Conformal prediction (CP) offers distribution-free uncertainty quantification for machine learning models, yet its interplay with fairness in downstream decision-making remains underexplored. Moving beyond CP as a standalone operation (procedural fairness), we analyze the holistic decision-making pipeline to evaluate substantive fairness-the equity of downstream outcomes. Theoretically, we derive an upper bound that decomposes prediction-set size disparity into interpretable components, clarifying how label-clustered CP helps control method-driven contributions to unfairness. To facilitate scalable empirical analysis, we introduce an LLM-in-the-loop evaluator that approximates human assessment of substantive fairness across diverse modalities. Our experiments reveal that label-clustered CP variants consistently deliver superior substantive fairness. Finally, we empirically show that equalized set sizes, rather than coverage, strongly correlate with improved substantive fairness, enabling practitioners to design more fair CP systems. Our code is available at https://github.com/layer6ai-labs/llm-in-the-loop-conformal-fairness.
Conformal prediction (CP) (Vovk et al., 2005;Shafer & Vovk, 2008) provides finite-sample, distribution-free statistical guarantees through a well-defined procedure; yet, whether these procedural guarantees translate into equitable outcomes in downstream decision-making remains unclear. In high-stakes domains, reliable uncertainty quantification is essential for building trustworthy models. Unlike other methods that rely on strong assumptions about the data distribution (Gal & Ghahramani, 2016;Lakshminarayanan Fairness in machine learning (Barocas et al., 2023), particularly in regulated fields such as healthcare and finance, is commonly understood through two complementary perspectives: procedural fairness, which concerns the integrity of the decision process (e.g., fairness through unawareness (Zemel et al., 2013;Kusner et al., 2017)); and substantive fairness, which focuses on equitable outcomes across groups (e.g., Equalized Odds (Hardt et al., 2016)). Prior research in CP has mainly focused on procedural fairness, treating CP as a standalone process (Romano et al., 2020a). In practice, CP constitutes one step in a larger pipeline that includes downstream decisions. The interactions of CP with procedural and substantive notions of fairness in this broader context remain less well understood (Cresswell, 2025).
In this work, we move beyond viewing CP as a standalone operation to analyze the holistic decision-making pipeline. While ultimate fairness is defined by substantive outcomes, procedural choices within CP play a critical role in shaping these results. We aim to uncover the specific connections between procedural properties and substantive fairness, enabling the design of procedures that positively influence downstream equity. By evaluating fairness as an emergent property of the entire pipeline, we can distinguish between procedural metrics that are merely performative and those that genuinely drive fair outcomes.
Our main contributions are threefold: Scalable LLM-in-the-loop fairness evaluation. To overcome the resource constraints of human-subject experiments, we leverage large language models (LLMs) in an evaluation protocol that approximates human decision behavior. We validate that this evaluator produces results comparable to human-in-the-loop benchmarks, enabling us to scale our analysis of substantive fairness across a broader range of datasets and algorithms than prior work. Connecting procedural properties to substantive fairness. We explicitly map the relationships between procedu-ral CP metrics and substantive outcomes. Crucially, we find that Equalized Set Size correlates strongly with improved substantive fairness, whereas the standard goal of Equalized Coverage often has negative effects. This insight shifts the design objective from coverage parity to set size parity. Theoretical and empirical validation of Label-Clustered CP. Guided by the connection between set sizes and substantive fairness, we analyze Label-Clustered CP. We derive a theoretical upper bound decomposing the set size disparity into interpretable components. Experimentally, we confirm that Label-Clustered CP reduces set size disparity more effectively than marginal or group-conditional approaches, and consistently achieves the best substantive fairness results in our evaluations.
Consider inputs x ∈ X ⊂ R d with ground truth labels y ∈ Y = [m] := {1, . . . , m}, drawn from a joint distribution (x, y) ∼ P. Let f : X → ∆ m-1 ⊂ R m be a classifier outputting predicted probabilities, where ∆ m-1 is the (m-1)-dimensional probability simplex. CP constructs a set-valued function C : X → P(Y) where P(Y) denotes the power set of Y, such that the following marginal coverage guarantee holds,
where α ∈ [0, 1] is user-specified (Vovk et al., 1999;2005).
CP achieves coverage by varying set size |C(x)| based on a calibrated notion of model confidence. Calibration relies on a held-out dataset D cal = {(x i , y i )} n cal i=0 consisting of n cal datapoints drawn from P. A conformal score function s : X × Y → R measures non-conformity between a candidate label and an input datapoint x, with higher scores indicating poorer agreement. The score function is often defined to make use of information from the classifier f in judging the level of agreement.
Let S i := s(x i , y i ) for i ∈ [n cal ], and define
The empirical conformal threshold is then given by qα := Quantile τα (S 1 , . . . , S n cal ) ∈ R.
(3)
For a test point x test drawn from the x-marginal of distribution P, a conformal prediction set is constructed as
Sets constructed this way will satisfy 1 -α coverage (Equation (1)) for any score function s, but smaller sets are more useful for downstream uncertainty quantification applications (Cresswell et al., 2024). The average set size E[|C|] is dictated by the quality of s, and in turn by the accuracy and calibration of the classifier f . Efficient score functions like APS (Romano et al., 2020b), RAPS (Angelopoulos et
This content is AI-processed based on open access ArXiv data.