Conditional Factuality Controlled LLMs with Generalization Certificates via Conformal Sampling
Large language models (LLMs) need reliable test-time control of hallucinations. Existing conformal methods for LLMs typically provide only \emph{marginal} guarantees and rely on a single global threshold, which can under-cover hard prompts, over-cove…
Authors: Kai Ye, Qingtao Pan, Shuo Li
Conditional F actuality Contr olled LLMs with Generalization Certificates via Conf ormal Sampling Kai Y e Qingtao Pan Shuo Li * Case W estern Reserve Univ ersity Kai.Ye@pitt.edu qingtaopan33@gmail.com shuo.li11@case.edu Abstract Lar ge langua ge models (LLMs) need r eliable test-time con- tr ol of hallucinations. Existing conformal methods for LLMs typically pr ovide only mar ginal guarantees and r ely on a single global thr eshold, which can under -cover hard pr ompts, over -cover easy ones, and produce oversized pre- diction sets. W e pr opose Conditional Factuality Con- trol (CFC), a post-hoc conformal framework that r eturns set-valued outputs with conditional coverage guarantees. CFC defines a continuous, feature-conditional acceptance thr eshold thr ough augmented quantile r e gr ession on a la- tent “success” scor e, and deploys it thr ough a fixed-point thr eshold rule at infer ence time. Theor etically , we show that CFC satisfies a conditional cover age guarantee under exc hangeability and analyze its ef ficiency , pr oving that, un- der mild assumptions on the score distributions, the condi- tional rule is strictly more sample-efficient than mar ginal conformal pr ediction at the same targ et cover age . W e fur- ther derive a P AC-style variant, CFC-P AC, which shrinks the nominal risk level based on a stability bound, yielding a finite-sample certificate that the conditional miscover age deviates fr om the tar get by at most O ( p log(1 /δ ) / N ) . Em- pirically , on synthetic data, real-world r easoning and QA benchmarks, and a Flic kr8k VLM setting, CFC and CFC- P A C consistently attain near-tar get coverage acr oss diffi- culty groups while using smaller prediction sets than CP and non-CP baselines. 1. Introduction Large language models (LLMs) hav e delivered striking progress across reasoning and generation tasks [ 2 , 21 ], yet their outputs can be unreliable due to hallucinations [ 9 ]. Inference-time strategies that in vest more compute on sam- pling often improve accuracy [ 1 , 5 ], but the y do not pro- vide formal reliability guarantees. For safety-critical or high-stakes applications, such heuristics are insufficient: we * Corresponding author . f(S) 𝑞 ! Score S Score S 𝑞 ! f(S) Group Eas y Group Hard 100% cover age 80% cover age 𝑞 " Score S f(S) 90% cover age f(S) Scor e S 𝑞 # 90% cover age Marginal Ours Figure 1. Limitation of marginal CP and advantage of proposed CFC. Left: A single global threshold learned from the marginal score mixture yields only marginal cov erage and can under-co ver hard prompts while over-co vering easy ones. Right: Our CFC learns a data-dependent threshold via conformal quantile re gres- sion, adapting the acceptance lev el to input features and achieving conditional cov erage across subgroups. need procedures that make uncertainty e xplicit and can con- trol error rates. Among uncertainty-estimation methods, conformal pre- diction (CP) is a natural choice for adapting uncer- tainty control to LLMs, complementing broader efforts on trustworthy uncertainty modeling and multimodal pre- diction [ 16 , 20 , 22 – 24 ]. It is model-agnostic and distribution-free; under e xchangeability between calibra- tion and test examples, it constructs set-valued predictions that contain the true candidate with probability at least 1 − α for a user-specified risk lev el α [ 20 ]. Recent work adapts CP to LLMs by constructing sets of sampled responses that aim to contain at least one correct answer with high prob- ability [ 11 , 14 , 18 ]. Ho wev er , these methods typically rely on a single global thr eshold on scores, and so they only pro- vide mar ginal guarantees: coverage holds on average over prompts, not for prompts with particular characteristics. This mar ginal cov erage can hide se vere heterogeneity . Hard prompts (e.g., long math questions or rare entities) may be systematically under-covered, while easy prompts are over-co vered, potentially inflating prediction sets un- necessarily (Fig. 1 , left). A global threshold is forced to compromise between easy and hard regions of the feature space, leading to miscalibration in subgroups and inefficient use of samples. Motivation. T o address this failure mode, we seek con- ditional coverag e : guarantees that cov erage holds not only on a verage, but also when conditioning on relev ant fea- tures (or groups) of the prompt. Conditional coverage is strictly stronger than marginal coverage and directly tar gets reliability in under-represented or systematically hard sub- populations (Fig. 1 , right). At the same time, we would like prediction sets to remain compact on a verage so that sampling-based inference stays computationally practical. W e tackle these challenges with Conditional F actuality Contr ol (CFC), a post-hoc conformal layer for LLM sam- pling. Rather than using a single scalar threshold, CFC de- fines a featur e-conditional acceptance rule b λ α ( X ) by con- formalizing a quantile-regression model for a latent success score S ( X ) , namely the best score among correct candi- dates for prompt X . At test time, gi ven a prompt X , we sample candidates from the base generator and accept those whose scores satisfy V ( X , y ) ≤ b λ α ( X ) . This requires no finetuning of the base model, and lets the acceptance thresh- old adapt to prompt difficulty . Beyond the basic procedure, we develop a P AC-style variant, CFC-P AC , that adds a stability-based finite-sample certificate: with high probability over the dra w of calibra- tion sample, the deployed rule achie ves coverage at least 1 − α − ε N ( δ ) , where ε N ( δ ) = O p log(1 /δ ) / N . Finally , we study the ef fi ciency of CFC. Under natural assumptions relating prompt difficulty to the distrib ution of scores, we sho w that an oracle conditional rule can attain smaller e xpected prediction-set size than marginal CP rule at the same target coverage. Our learned CFC asymptoti- cally inherits this oracle efficiency as the augmented quan- tile regression becomes consistent. Our main contributions are: • W e introduce CFC , a conformal procedure for sam- pled LLM outputs that defines a continuous, feature- conditional acceptance rule via augmented quantile re- gression on the latent success score and yields prediction sets satisfying the conditional guarantee of [ 7 ] under ex- changeability . • W e de velop CFC-P A C , a certified v ariant of CFC equipped with a P AC-style generalization bound: with probability at least 1 − δ o ver the calibration sample, the deployed rule achieves coverage at least 1 − α − ε N ( δ ) for an ε N ( δ ) = O q log(1 /δ ) N . • W e analyze the efficiency of CFC, proving that, un- der mild monotonicity and concavity assumptions on the score distribution, conditional rules can be strictly more sample-efficient than marginal CP rules at the same cov- erage lev el. • W e validate CFC and CFC-P AC on synthetic data, real- world reasoning / QA benchmarks, and a Flickr8k VLM setting, sho wing that the same post-hoc procedure ex- tends beyond text-only generators without finetuning the base model. Code availability . Code for reproducing our e xperiments is av ailable here . 2. Related W ork Inference-time sampling and reranking f or LLMs. A common way to improve LLM outputs with extra test-time compute is to sample multiple candidates and rerank or fil- ter them, as in Best-of- N decoding and pass@ N e valua- tion [ 1 , 3 , 5 ]. In practice, candidate quality is often esti- mated with an external verifier or rew ard model [ 5 , 8 ]. Our setting follows this line of work: we treat the base gener- ator as a black box, sample multiple candidates, and use a verifier score to decide which candidates enter the final set. Conformal prediction and conditional guarantees. Conformal prediction provides distribution-free finite- sample cov erage under exchangeability [ 17 , 20 ]. Split/inductiv e conformal prediction (ICP) is the stan- dard practical v ariant, b ut its guarantee is mar ginal: it controls co verage only on average over test inputs [ 13 ]. Exact conditional coverage is impossible without additional assumptions or relaxations [ 6 , 19 ]. Recent work therefore studies weaker but useful coverages, including confor- malized quantile regression [ 15 ] and the function-class conditional framew ork of Gibbs et al. [ 7 ], which learns feature-dependent thresholds through augmented quantile regression. Our method builds on this latter perspecti ve. Conformal prediction for LLMs. Sev eral recent papers adapt CP to language models by constructing sets of sam- pled responses that contain at least one correct answer with high probability [ 4 , 11 , 12 , 14 , 18 ]. Most existing LLM- specific methods use a global acceptance rule and there- fore inherit only marginal coverage, which can under-cov er hard prompts and over -cover easy ones. Relati ve to these marginal baselines, CFC replaces the single threshold with a feature-conditional one and is therefore designed to im- prov e subgroup reliability . Compared with prior condi- tional CP works, our contrib utions are dif ferent in empha- sis: we de velop conformal factuality control for sampled LLM candidates with verifier scores, provide an efficienc y analysis showing when conditional rules are more sample- efficient than marginal ones, introduce the P A C-certified variant CFC-P AC, and demonstrate transfer to a VLM set- ting in the main experiments. Because CFC is purely post hoc, it requires no generator finetuning and transfers across base models. 3. Preliminaries 3.1. Conf ormal Factuality Let X ∈ X be a prompt and let π : X → ∆( Y ) de- note a fix ed generator over candidate completions. For each prompt, we sample a candidate set C ( X ) = { Y j } M j =1 with Y j ∼ π ( · | X ) , and ev aluate each candidate with a verifier score V : X × Y → [0 , 1] , normalized so that smaller scores ar e better . Gi ven a correctness indicator A ( X , y ) ∈ { 0 , 1 } , the conformal factuality goal is to output a set b C α ( X ) sat- isfying P ∃ y ∈ b C α ( X n +1 ) : A ( X n +1 , y ) = 1 ≥ 1 − α. (3.1) A con venient latent v ariable for this e vent is the success scor e S ( X ) := inf { V ( X , y ) : y ∈ C ( X ) , A ( X , y ) = 1 } , (3.2) so that the ev ent that the prediction set contains at least one correct answer is equi valent to S ( X ) ≤ λ ( X ) for the de- ployed threshold rule λ ( · ) . 3.2. From Mar ginal to Conditional Coverage Split conformal prediction. Standard split conformal prediction calibrates a single global threshold from held- out calibration scores and therefore provides only mar ginal cov erage. Gi ven exchangeable data { ( X i , Y i ) } n +1 i =1 and a nonconformity score s ( x, y ) , it constructs a prediction set b C α ( X n +1 ) satisfying P Y n +1 ∈ b C α ( X n +1 ) ≥ 1 − α. (3.3) Using a calibration set { ( X i , Y i ) } n i =1 , one computes cal- ibration scores R i = s ( X i , Y i ) and forms the empirical quantile b q 1 − α := R ( ⌈ ( n +1)(1 − α ) ⌉ ) (3.4) and predicts b C α ( x ) := { y : s ( x, y ) ≤ b q 1 − α } . (3.5) In our setting, this corresponds to accepting all sampled candidates with verifier score below one constant cutof f. Such a rule can ov er-cov er easy prompts and under-co ver hard prompts, because the same acceptance lev el must serve the entire prompt distribution. Conditional conf ormal prediction. The limitation of marginal CP is that it a verages o ver the prompt distribution. A stronger objectiv e is pointwise conditional coverage, P Y n +1 ∈ b C ( X n +1 ) | X n +1 = x ≥ 1 − α, (3.6) but exact finite-sample conditional coverage is impossible without additional assumptions or relaxations [ 6 , 19 ]. Fol- lowing Gibbs et al. [ 7 ], we instead target function-class con- ditional cov erage: E h f ( X n +1 ) 1 {∃ y ∈ b C ( X n +1 ) : A ( X n +1 , y ) = 1 } − (1 − α ) i = 0 (3.7) for all f in a chosen class F . When F = { 1 } , this reduces to marginal conformal prediction; when F = { Φ( · ) ⊤ β : β ∈ R d } , it yields a feature-conditional guarantee over the basis Φ . CFC instantiates this framew ork for conformal factuality by learning a feature-dependent quantile of the latent suc- cess score S ( X ) rather than a single global cutoff. The main paper focuses on the procedure and its guarantees. Appendix A gives the detailed background about ICP and Gibbs. [ 7 ] work. 4. Method W e now present CFC , our conditional conformal rule for sampled LLM outputs, and its P A C-style v ariant CFC- P A C . Appendices A and B contain the extended back- ground, deriv ations, and proofs. 4.1. CFC: Conditional F actuality Control Let { ( X i , S i ) } N i =1 be calibration prompt/score pairs, where S i = S ( X i ) is the latent success score from Sec. 3 . For a test prompt X N +1 , we sample C N +1 = { Y N +1 ,j } M j =1 from π ( · | X N +1 ) and deploy a prompt-dependent threshold on verifier scores: b C α ( X N +1 ; λ ) := { y ∈ C N +1 : V ( X N +1 , y ) ≤ λ } . (4.1) T o learn this threshold, we follow the augmented quantile-regression construction of [ 7 ]. For a candidate test- time score s ∈ [0 , 1] , define β s = arg min β ∈ R d " 1 N + 1 N X i =1 ρ 1 − α S i − Φ( X i ) ⊤ β + 1 N + 1 ρ 1 − α s − Φ( X N +1 ) ⊤ β # , (4.2) where Φ( X ) is the chosen feature map and ρ 1 − α ( u ) = u (1 − α − 1 { u < 0 } ) is the pinball loss. The induced test-time map is g X N +1 ( s ) := Φ( X N +1 ) ⊤ β s . Algorithm 1 CFC Inference Require: Calibration pairs { ( X i , S i ) } N i =1 , test prompt x , genera- tor π , verifier V , feature map Φ , nominal significance α , sam- ple budget M 1: Draw C ( x ) = { Y j } M j =1 with Y j ∼ π ( · | x ) 2: Let ρ 1 − α ( u ) = u (1 − α − 1 { u < 0 } ) 3: For s ∈ [0 , 1] , define 4: β s = arg min β ∈ R d " 1 N + 1 N X i =1 ρ 1 − α S i − Φ( X i ) ⊤ β + 1 N + 1 ρ 1 − α s − Φ( x ) ⊤ β # 5: Define g x ( s ) = Φ( x ) ⊤ β s 6: Compute b λ α ( x ) = sup { s ∈ [0 , 1] : s ≤ g x ( s ) } 7: Set b C α ( x ) = { y ∈ C ( x ) : V ( x, y ) ≤ b λ α ( x ) } 8: retur n b λ α ( x ) , b C α ( x ) W e then take the lar gest fixed point belo w this map as the deployed threshold: b λ α ( X N +1 ) := sup { s ∈ [0 , 1] : s ≤ g X N +1 ( s ) } , (4.3) and return b C α ( X N +1 ) := { y ∈ C N +1 : V ( X N +1 , y ) ≤ b λ α ( X N +1 ) } . (4.4) Although Eqs. ( 4.2 )–( 4.3 ) define CFC conceptually through an augmented quantile-re gression family , deployment only requires computing the fixed-point threshold in Eq. ( 4.3 ); it does not require repeatedly solving Eq. ( 4.2 ) on a grid of candidate scores (see Gibbs. [ 7 ]). Theorem 4.1 (Conditional co verage of C F C ) . Let F = { Φ( X ) ⊤ β : β ∈ R d } be any finite-dimensional linear class, and assume exchangeability . Then for any non- ne gative f ∈ F with E [ f ( X )] > 0 , the prediction set in Eq. ( 4.3 ) satisfies P f ∃ y ∈ b C α ( X N +1 ) : A ( X N +1 , y ) = 1 ≥ 1 − α. (4.5) 4.2. CFC-P A C: A High-Probability Certificate Theorem 4.1 is an expectation-le vel guarantee over the cal- ibration draw . T o certify the deployed rule itself, we add ridge regularization to Eq. ( 4.2 ) and shrink the nominal tar - get by a stability slack. Assumptions. W e assume: (1) bounded features ∥ Φ( X ) ∥ 2 ≤ R almost surely; (2) ridge regularization λ 2 ∥ β ∥ 2 2 with λ > 0 in the augmented quantile-regression objectiv e; and (3) the conditional CDF F S | X = x ( t ) is L -Lipschitz in t on [0 , 1] . Exchangeability is as in Theorem 4.1 . Algorithm 2 CFC-P A C Inference Require: Calibration pairs { ( X i , S i ) } N i =1 , test prompt x , genera- tor π , verifier V , feature map Φ , nominal significance α , con- fidence δ , ridge parameter λ , sample budget M 1: Compute the P A C slack ε N ( δ ) from the P AC certificate 2: Set the ef fectiv e target α eff = max { 0 , α − ε N ( δ ) } 3: Similar to Algorithm 1 , run CFC with target α eff , replacing the augmented objectiv e by the ridge-regularized v ersion below 4: β s = arg min β ∈ R d " 1 N + 1 N X i =1 ρ 1 − α eff S i − Φ( X i ) ⊤ β + 1 N + 1 ρ 1 − α eff s − Φ( x ) ⊤ β + λ 2 ∥ β ∥ 2 2 # 5: This produces b λ α eff ( x ) and b C α eff ( x ) 6: retur n b λ α eff ( x ) , b C α eff ( x ) Theorem 4.2 (P AC conditional coverage for CFC) . Assume α ≥ ε N ( δ ) and define α eff := α − ε N ( δ ) . Let b λ α eff ( · ) be the thr eshold learned fr om algorithm 2 . Then for any δ ∈ (0 , 1) , with pr obability at least 1 − δ over the calibration sample, P S ≤ b λ α eff ( X ) | D cal ≥ 1 − α eff − ε N ( δ ) = 1 − α, wher e ε N ( δ ) = O r log(1 /δ ) N ! . Equivalently , with the same pr obability , P ∃ y ∈ b C α eff ( X ) : A ( X, y ) = 1 | D cal ≥ 1 − α. 4.3. Efficiency Analysis Beyond coverage, we ask whether conditioning can reduce av erage prediction-set size. Let G X ( λ ) := P V ( X , Y ) ≤ λ | X , Y ∼ π ( · | X ) be the score CDF under sampling at prompt X . If we draw M candidates and retain those with score at most λ , then E | b C ( X ) | | X = M G X ( λ ) . Thus ef ficiency reduces to comparing E [ G X ( λ ( X ))] across threshold rules. Marginal CP uses a constant threshold ¯ λ α satisfying P ( S ≤ ¯ λ α ) ≈ 1 − α . In contrast, an oracle condi- tional rule uses the conditional quantile q α ( X ) of S | X and sets λ ⋆ ( X ) = q α ( X ) . Let T = ψ ( X ) be a scalar difficulty variable. F or each t , define F t ( λ ) := P ( S ≤ λ | T = t ) , G t ( λ ) := P ( V ≤ λ | T = t ) , and C t ( u ) := G t F − 1 t ( u ) , u ∈ [0 , 1] , where F − 1 t ( u ) := inf { λ ∈ [0 , 1] : F t ( λ ) ≥ u } . Assumptions. (1) For each t , F t is continuous and strictly increasing on [0 , 1] . (2) For each fixed λ ∈ [0 , 1] , the map t 7→ F t ( λ ) is nonincreasing. (3) For each t , the map u 7→ C t ( u ) is conv ex and differentiable on (0 , 1) . (4) C has decreasing dif ferences in ( u, t ) , i.e. for ev ery u ∈ (0 , 1) , t 7→ ∂ u C t ( u ) is nonincreasing. Under Assumptions 1–2, the conditional (1 − α ) -quantile q α ( t ) := F − 1 t (1 − α ) is automatically nondecreasing in t . Proposition 4.3 (Oracle CFC ef ficiency) . Let λ ⋆ ( X ) := q α ( X ) = F − 1 X (1 − α ) . Under the assumptions above, E G X ( λ ⋆ ( X )) ≤ E G X ( ¯ λ ) for any constant ¯ λ satisfying P ( S ≤ ¯ λ ) = 1 − α. In particular , E G X ( λ ⋆ ( X )) ≤ E G X ( ¯ λ α ) . If, in addition, u 7→ C t ( u ) is strictly conve x for almost every t and P q α ( X ) = ¯ λ α > 0 , then the inequality is strict. Theorem 4.4 (CFC inherits oracle efficienc y) . Assume the conditions of Pr oposition 4.3 , and let λ ⋆ ( X ) := q α ( ψ ( X )) . Let b λ α,N ( · ) denote the CFC threshold learned fr om a cali- bration sample of size N , and suppose that sup x ∈X b λ α,N ( x ) − λ ⋆ ( x ) p − → 0 as N → ∞ . Then lim N →∞ E G X ( b λ α,N ( X )) = E G X ( λ ⋆ ( X )) ≤ E G X ( ¯ λ α ) . Consequently , for fixed M , lim N →∞ E | b C α,N ( X ) | = M E G X ( λ ⋆ ( X )) ≤ M E G X ( ¯ λ α ) . If, in addition, the strictness conditions in Proposi- tion 4.3 hold, then lim N →∞ E G X ( b λ α,N ( X )) < E G X ( ¯ λ α ) , and hence lim N →∞ E | b C α,N ( X ) | < M E G X ( ¯ λ α ) . 1 2 3 4 5 6 7 8 9 10 Gr oup 0.0 0.1 0.2 0.3 0.4 0.5 Miscoverage tar get T opK ICP L ear nt CP CFC CFC-P A C Figure 2. Groupwise miscoverage on synthetic data across 10 difficulty bins. The dashed line marks the target miscoverage α = 0 . 10 . Learnt CP improves over marginal baselines, but CFC and CFC-P remain closest to the target across all bins, especially on hard prompts. 5. Experiments 5.1. Synthetic Data Setup. W e first study a controlled synthetic setting with a scalar prompt-difficulty v ariable T ∈ [0 , 1] . F or each prompt we draw M candidates, the probability of correct- ness decreases with difficulty , and verifier scores are sam- pled from a known difficulty-dependent distribution. Unless otherwise stated, we use target error α = 0 . 10 , calibration size N cal = 10 , 000 , test size N test = 10 , 000 , and M = 50 . W e report mean ± standard deviation ov er random seeds. Baselines and metrics. W e compare against three base- lines. T opK keeps the smallest top- K prefix whose calibra- tion cov erage reaches the tar get lev el. ICP [ 20 ] calibrates a single global verifier threshold. Learnt CP fits a feature- conditional threshold from calibration data b ut omits the exact conformal correction. W e report empirical co ver - age rate (ECR), av erage prediction set size (APSS, lower is better), and group-stratified cov erage (GSC), the min- imum empirical coverage over difficulty groups. On syn- thetic data, ECR and GSC are computed using the ground- truth correctness labels, and APSS is the av erage accepted set size. Results. Figure 2 shows the main phenomenon moti- vating CFC: a single global threshold under-cov ers hard prompts and over -covers easy ones, while conditional thresholds track the target miscov erage across the entire dif- ficulty range. Learnt CP narrows this gap, b ut it still fails to match the subgroup reliability of CFC/CFC-P on the hard- Figure 3. Learned threshold b λ α ( X ) versus prompt difficulty . Easy prompts receiv e stricter thresholds, while harder prompts receive looser thresholds, which is the mechanism behind the improved group-wise reliability of CFC. T able 1. Synthetic-data results at α = 0 . 10 . Learnt CP uses a learned dif ficulty-conditional threshold but no e xact conformal correction, isolating learning from feature-conditional conformal- ization. Method ECR APSS ↓ GSC ↑ T opK 90.6 ± 0.1 16.00 ± 0.00 58.2 ± 1.3 ICP 90.2 ± 0.2 16.71 ± 0.23 57.4 ± 1.4 Learnt CP 90.2 ± 0.3 15.72 ± 0.15 84.3 ± 0.5 CFC (ours) 90.3 ± 0.5 15.53 ± 0.12 88.7 ± 0.7 CFC-P (ours) 90.8 ± 0.6 15.87 ± 0.16 89.1 ± 0.5 est bins. CFC achiev es the smallest av erage prediction sets at the target le vel, and CFC-P raises the worst-group cov er- age floor further with only a modest set-size increase. This shows that the gain does not come merely from learning a better global score; it comes from feature-conditional con- formalization. Threshold adaptation. Figure 3 visualizes the learned threshold itself. Relative to the single ICP cutoff, CFC as- signs tighter thresholds to easy prompts and looser thresh- olds to hard prompts, which is exactly the behavior needed to correct the systematic under-cov erage of global-threshold baselines on difficult inputs. T able 1 shows that learning a stronger dif ficulty-aware threshold already helps substantially , but it still fails to match the subgroup reliability of CFC/CFC-P . This isolates the main empirical point of the method: the gains are not explained by better score fitting alone, but by the e xact con- ditional conformal correction on top of that learned thresh- old. Additional Sweeps and Ablations. Appendix C reports the complete synthetic sweep across target error rates α to- gether with additional sensitivity analyses ov er calibration size, sampling budget, and group granularity . 5.2. Real-W orld Data Datasets. W e ev aluate on three settings. GSM8K [ 5 ] con- tains grade-school math word problems. T riviaQA [ 10 ] is an open-domain question-answering benchmark. T o test transfer beyond text-only generators, we also include a Flickr8k vision-language experiment with Q W E N 2 - V L - 7 B - I N S T RU C T as the base model. W e use L L A M A - 3 - 8 B - I N S T RU C T for GSM8K and T ri viaQA. Baselines and metrics. On real data we compare three baselines against CFC variants. T opK keeps the smallest top- K prefix whose calibration coverage reaches the tar - get le vel. ICP calibrates a single global verifier threshold. Learnt CP learns a feature-conditional threshold from cal- ibration data but omits the e xact conformal correction. W e again report ECR, APSS, and GSC. F or ECR, the preferred value is the one closest to the target cov erage 1 − α . APSS can f all belo w 1 because the method may abstain and return the empty set on some prompts when no sampled candidate passes the calibrated threshold. Implementation details. Unless otherwise stated, we use L L A M A - 3 - 8 B - I N S T RU C T as the base LLM with nu- cleus sampling (top- p = 0 . 8 , temperature = 0 . 7 ) and draw up to M = 20 samples per prompt. On reason- ing tasks we use a separate verifier to compute V ( X , y ) , e.g. Q W E N 2 . 5 - M A T H - R M - 7 2 B (and Q W E N 2 - V L - 7 B - I N S T R U C T for VLM tasks). On GSM8K, the strongest set- ting uses the first 5 sampled candidates per prompt, de- fines T ( X ) as the mean verifier loss across those samples, and uses the quadratic basis Φ( X ) = [1 , T ( X ) , T ( X ) 2 ] . On Flickr8k, the strongest setting k eeps up to 2 cached candidates per image, defines T ( X ) as the mean v erifier loss across those candidates, and uses the quadratic basis Φ( X ) = [1 , T ( X ) , T ( X ) 2 ] . On T riviaQA, the strongest setting uses a calibration-defined feature map b uilt from answer-distrib ution entropy and v erifier loss; Appendix C giv es the e xact construction. In the main paper , GSM8K, T riviaQA, and Flickr8k report CFC , and CFC-P-F . CFC truncates the accepted set after the best accepted candidate in sample order, while CFC-P-F applies the stability-based P A C adjustment to the full thresholded set. Appendix C includes the full T ri viaQA, GSM8K, and Flickr8k sweeps with all four CFC v ariants. For GSM8K and Flickr8k, GSC is computed over five equal-frequency bins of the scalar dif- ficulty proxy; for T riviaQA, GSC uses the two-group fea- ture map, whose exact definition is giv en in Appendix C . Easy Har d Gr oup 0 10 20 30 40 Miscoverage (%) T r i v i a Q A ( = 0 . 2 5 ) 1 2 3 4 5 Gr oup 0 10 20 30 40 50 G S M 8 K ( = 0 . 1 0 ) 1 2 3 4 5 Gr oup 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 F l i c k r 8 k ( = 0 . 0 3 ) T opK ICP L ear nt CP CFC CFC-P -F Figure 4. Groupwise miscoverage on real-world datasets at representative target errors: T riviaQA α = 0 . 25 , GSM8K α = 0 . 10 , and Flickr8k α = 0 . 03 . Bars show mean miscoverage over split seeds and the upper error bars sho w one standard deviation. The T riviaQA panel uses the chosen two-group feature map; Appendix C giv es its exact construction. The GSM8K and Flickr8k panels use fiv e equal- frequency difficulty groups ordered from easy to hard. The dashed line marks the target miscoverage α . Across all three datasets, the conditional methods flatten the miscov erage profile relativ e to marginal baselines, especially on the hardest groups. T able 2. Tri viaQA results for α ∈ { 0 . 20 , 0 . 25 , 0 . 30 , 0 . 35 } under the chosen calibration-defined feature map. APSS below 1 indicates abstention on some prompts. ECR and GSC are reported in percent; for ECR, values closest to the target coverage 1 − α are preferred. Appendix C reports the full experiments with additional details. Methods α = 0 . 20 α = 0 . 25 ECR GSC ↑ APSS ↓ ECR GSC ↑ APSS ↓ T opK 81.6 ± 0.3 69.7 ± 1.5 1.75 ± 0.00 73.4 ± 0.3 55.9 ± 1.6 1.00 ± 0.00 ICP 80.0 ± 0.5 65.1 ± 1.5 1.43 ± 0.02 74.9 ± 0.3 56.7 ± 1.9 1.08 ± 0.01 Learnt CP 79.6 ± 0.5 76.3 ± 1.5 1.69 ± 0.04 74.7 ± 0.4 74.0 ± 1.1 1.22 ± 0.03 CFC (ours) 76.2 ± 0.5 65.4 ± 1.7 1.21 ± 0.01 72.7 ± 0.4 65.2 ± 1.8 1.03 ± 0.03 CFC-P-F (ours) 80.1 ± 0.5 76.3 ± 1.6 1.72 ± 0.04 75.3 ± 0.4 74.6 ± 1.0 1.32 ± 0.10 Methods α = 0 . 30 α = 0 . 35 ECR GSC ↑ APSS ↓ ECR GSC ↑ APSS ↓ T opK 73.4 ± 0.3 55.9 ± 1.6 1.00 ± 0.00 73.4 ± 0.3 55.9 ± 1.6 1.00 ± 0.00 ICP 69.5 ± 0.8 49.3 ± 2.7 0.90 ± 0.02 64.5 ± 0.6 43.0 ± 2.1 0.78 ± 0.01 Learnt CP 69.5 ± 0.7 68.7 ± 1.3 0.97 ± 0.02 64.6 ± 0.7 63.0 ± 2.1 0.82 ± 0.02 CFC (ours) 68.4 ± 0.7 62.8 ± 2.0 0.88 ± 0.01 63.9 ± 0.7 59.2 ± 2.3 0.78 ± 0.01 CFC-P-F (ours) 70.0 ± 0.8 69.2 ± 1.3 0.99 ± 0.03 65.1 ± 0.8 63.7 ± 2.2 0.83 ± 0.02 T able 3. Tri viaQA feature-map ablation at α = 0 . 30 (tar- get coverage 70% ). Each cell reports ECR/GSC/APSS. T opK and ICP are unchanged across feature maps and remain fix ed at 73 . 4 / 55 . 9 / 1 . 00 and 69 . 5 / 49 . 3 / 0 . 90 , respectively . Setting CFC CFC-P-F Entropy-linear Φ 66.9 / 45.1 / 1.02 70.6 / 53.2 / 1.32 Max-loss-linear Φ 67.6 / 56.5 / 0.99 70.7 / 57.4 / 1.26 Chosen Φ 68.4 / 62.8 / 0.88 70.0 / 69.2 / 0.99 All v ariants are purely post-hoc: they do not finetune the base generator or verifier . Groupwise beha vior . Figure 4 complements T ables 2 – 6 by visualizing the same subgroup effect behind the real-data tables: a single global cutoff tends to under-cov er the hard- est inputs, while conditional thresholds flatten the miscov- erage profile. This pattern is strongest on GSM8K, where the gain is concentrated on the hardest bins, and it remains visible on T riviaQA and Flickr8k as well. T able 4. GSM8K results at α = 0 . 05 using the first fi ve sam- pled candidates per prompt and the quadratic basis Φ( X ) = [1 , T ( X ) , T ( X ) 2 ] with T ( X ) equal to mean verifier loss. ECR and GSC are reported in percent; for ECR, v alues closest to the target coverage 1 − α are preferred. Appendix C reports the full sweep. Methods ECR GSC ↑ APSS ↓ T opK 96.42 ± 0.39 86.52 ± 1.11 1.00 ± 0.00 ICP 95.09 ± 1.42 79.85 ± 6.53 4.73 ± 0.09 Learnt CP 94.91 ± 1.03 88.48 ± 3.05 4.01 ± 0.98 CFC (ours) 94.82 ± 0.97 88.48 ± 2.32 2.35 ± 0.43 CFC-P-F (ours) 95.24 ± 1.40 88.79 ± 3.01 4.59 ± 0.62 T able 5. GSM8K sample-budget ablation at α = 0 . 10 under the same mean-loss quadratic rule. The larger b udget giv es only a small subgroup-cov erage gain, but it substantially inflates APSS. Method N = 5 N = 20 ECR GSC ↑ APSS ↓ ECR GSC ↑ APSS ↓ CFC (ours) 90.18 86.36 1.49 90.30 87.73 3.79 CFC-P-F (ours) 91.91 88.48 2.34 92.06 88.79 7.97 T able 6. VLM experiment on Flickr8k with Q W E N 2 - V L - 7 B - I N S T R U C T at α = 0 . 03 using up to two cached candidates per image and the quadratic basis Φ( X ) = [1 , T ( X ) , T ( X ) 2 ] with T ( X ) equal to mean verifier loss. ECR and GSC are reported in percent; for ECR, v alues closest to the tar get coverage 1 − α = 97% are preferred. Method ECR APSS ↓ GSC ↑ T opK 96.37 ± 0.17 1.00 ± 0.00 93.23 ± 0.47 ICP 95.58 ± 0.54 1.84 ± 0.01 85.21 ± 3.14 Learnt CP 96.16 ± 0.17 1.22 ± 0.07 94.48 ± 0.26 CFC (ours) 95.81 ± 0.38 0.99 ± 0.00 93.23 ± 0.47 CFC-P-F (ours) 97.27 ± 0.21 1.42 ± 0.07 95.21 ± 0.77 Results for T riviaQA. T able 2 shows that the chosen T riviaQA feature map giv es the strongest ov erall tradeoff we found, and the left panel of Figure 4 shows why: the conditional rule mainly corrects the hard subset rather than trying to smooth e very prompt equally . Learnt CP al- ready improves subgroup reliability substantially o ver ICP , but it does so with larger prediction sets than CFC . At α = 0 . 30 (tar get co verage 70% ), CFC-P-F is closest to target at 70 . 0% , while CFC attains the smallest prediction sets at 0 . 88 and still raises GSC from 49 . 3% (ICP) to 62 . 8% . The same division of labor is visible at α = 0 . 20 , α = 0 . 25 , and α = 0 . 35 : the P A C variant is the most tar get-calibrated among our methods, while CFC is the most size-efficient. Featur e-map ablation. T able 3 compares simple Tri vi- aQA feature maps at α = 0 . 30 . The chosen Φ yields the best subgroup reliability and the smallest APSS for CFC , and it also gi ves the strongest GSC for CFC-P-F while keeping target fit competitive. This is the main point of the ablation: the gains are not tied to a single scalar proxy , but the chosen feature map gives the best overall balance among target fit, subgroup reliability , and set size. Results f or GSM8K. T able 4 , T able 5 , and the middle panel of Figure 4 sho w that on GSM8K a small candidate budget together with a smooth loss-based feature map giv es the best tradeoff we found. At α = 0 . 05 (target coverage 95% ), all methods are reasonably close to target, but the conditional methods materially improve subgroup beha vior ov er ICP : CFC reduces APSS from 4 . 73 to 2 . 35 while lift- ing GSC from 79 . 85% to 88 . 48% , and CFC-P-F raises the subgroup floor further to 88 . 79% . T able 5 explains why the main paper keeps the smaller N = 5 budget: moving to N = 20 barely changes ECR or GSC, b ut it more than dou- bles APSS for CFC and more than triples it for CFC-P-F . Appendix C reports the complete GSM8K sweep. Results for Flickr8k. T able 6 and the right panel of Fig- ure 4 show that the same post-hoc conformal layer transfers to a vision-language model and a different verifier . At the target co verage le vel 1 − α = 97% , CFC-P-F is closest to target at 97 . 27% while achieving the strongest subgroup reliability at 95 . 21% GSC. CFC is the most size-ef ficient variant at 0 . 99 APSS, but on this easy benchmark it often collapses to a single caption and therefore gi ves up more target fit than the P AC rule. This still supports the transfer claim: the same conditional calibration layer remains ef fec- tiv e when the base model is Q W E N 2 - V L - 7 B - I N S T R U C T , and Appendix C shows the full small- α sweep together with a compact setting ablation. Additional Sweeps and Ablations. Appendix C re- ports the full Tri viaQA, GSM8K, and Flickr8k target-risk sweeps, other CFC v ariants, and compact ablations for the GSM8K budget choice and the Flickr8k setting choice. 6. Conclusion Conditional Factuality Control replaces a single global fac- tuality threshold with a feature-conditional one, and our re- sults sho w that this post-hoc conformal layer improves sub- group reliability and often yields a better cov erage–set-size tradeoff than mar ginal baselines across synthetic and real LLM/VLM settings. Acknowledgements This work w as supported in part by the National Insti- tutes of Health (NIH) under Grants R01HL173186 and R01HL177813, and by the National Science Foundation (NSF) under Grant No. 2306545. References [1] Bradley Brown, Jordan Jura vsky , Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher R ´ e, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. arXiv pr eprint arXiv:2407.21787 , 2024. [2] S ´ ebastien Bubeck, V arun Chandrasekaran, Ronen Eldan, Jo- hannes Gehrke, Eric Horvitz, Ece Kamar , Peter Lee, Y in T at Lee, Y uanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv pr eprint arXiv:2303.12712 , 2023. [3] Mark Chen, Jerry T worek, Hee woo Jun, Qiming Y uan, Hen- rique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Y uri Burda, Nicholas Joseph, Greg Brockman, et al. Evalu- ating large language models trained on code. arXiv preprint arXiv:2107.03374 , 2021. [4] John Cherian, Isaac Gibbs, and Emmanuel Candes. Large language model v alidity via enhanced conformal prediction methods. Advances in Neural Information Pr ocessing Sys- tems , 37:114812–114842, 2024. [5] Karl Cobbe, V ineet K osaraju, Mohammad Bav arian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry T worek, Jacob Hilton, Reiichiro Nakano, et al. T rain- ing verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 , 2021. [6] Rina Foygel Barber, Emmanuel J Candes, Aaditya Ramdas, and Ryan J Tibshirani. The limits of distribution-free con- ditional predictiv e inference. Information and Inference: A Journal of the IMA , 10(2):455–482, 2021. [7] Isaac Gibbs, John J. Cherian, and Emmanuel J. Cand ` es. Con- formal prediction with conditional guarantees, 2024. [8] Arian Hosseini, Xingdi Y uan, Nikolay Malkin, Aaron Courville, Alessandro Sordoni, and Rishabh Agarwal. V - star: Training v erifiers for self-taught reasoners. arXiv pr eprint arXiv:2402.06457 , 2024. [9] Lei Huang, W eijiang Y u, W eitao Ma, W eihong Zhong, Zhangyin Feng, Haotian W ang, Qianglong Chen, W eihua Peng, Xiaocheng Feng, Bing Qin, et al. A surve y on hal- lucination in large language models: Principles, taxonomy , challenges, and open questions. A CM T ransactions on Infor- mation Systems , 43(2):1–55, 2025. [10] Mandar Joshi, Eunsol Choi, Daniel S W eld, and Luk e Zettle- moyer . T riviaqa: A large scale distantly supervised chal- lenge dataset for reading comprehension. In Pr oceedings of the 55th Annual Meeting of the Association for Compu- tational Linguistics (V olume 1: Long P apers) , pages 1601– 1611, 2017. [11] Bhawesh Kumar , Charlie Lu, Gauri Gupta, Anil Palepu, David Bellamy , Ramesh Raskar, and Andre w Beam. Confor- mal prediction with large language models for multi-choice question answering. arXiv pr eprint arXiv:2305.18404 , 2023. [12] Christopher Mohri and T atsunori Hashimoto. Language models with conformal factuality guarantees. In Pr oceedings of the 41st International Confer ence on Machine Learning , pages 36029–36047, 2024. [13] Harris Papadopoulos, Kostas Proedrou, V olodya V ovk, and Alex Gammerman. Inductiv e confidence machines for re- gression. In Machine learning: ECML 2002: 13th Eur o- pean confer ence on machine learning Helsinki, F inland, Au- gust 19–23, 2002 proceedings 13 , pages 345–356. Springer, 2002. [14] V ictor Quach, Adam Fisch, T al Schuster, Adam Y ala, Jae Ho Sohn, T ommi S. Jaakkola, and Regina Barzilay . Conformal language modeling. In The T welfth International Conference on Learning Repr esentations , 2024. [15] Y ani v Romano, Evan P atterson, and Emmanuel Candes. Conformalized quantile regression. Advances in neural in- formation pr ocessing systems , 32, 2019. [16] Murat Sensoy , Lance Kaplan, and Melih Kandemir . Eviden- tial deep learning to quantify classification uncertainty . Ad- vances in neural information pr ocessing systems , 31, 2018. [17] Glenn Shafer and Vladimir V ovk. A tutorial on conformal prediction. J ournal of Machine Learning Resear ch , 9:371– 421, 2008. [18] Jiayuan Su, Jing Luo, Hongwei W ang, and Lu Cheng. API is enough: Conformal prediction for lar ge language mod- els without logit-access. In Findings of the Association for Computational Linguistics: EMNLP 2024 , pages 979–995, Miami, Florida, USA, 2024. Association for Computational Linguistics. [19] Vladimir V ovk. Conditional validity of inductive conformal predictors. In Asian confer ence on machine learning , pages 475–490. PMLR, 2012. [20] Vladimir V ovk, Alexander Gammerman, and Glenn Shafer . Algorithmic learning in a random world . Springer , 2005. [21] Jason W ei, Y i T ay , Rishi Bommasani, Colin Raf fel, Bar - ret Zoph, Sebastian Borgeaud, Dani Y ogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, T at- sunori Hashimoto, Oriol V inyals, Percy Liang, Jef f Dean, and W illiam Fedus. Emer gent abilities of large language models. T ransactions on Machine Learning Resear ch , 2022. Surve y Certification. [22] Kai Y e, Haoteng T ang, Siyuan Dai, Lei Guo, Johnny Y uehan Liu, Y alin W ang, Alex D. Leow , Paul M. Thompson, Heng Huang, and Liang Zhan. Bidirectional mapping with con- trastiv e learning on multimodal neuroimaging data. In Medi- cal Image Computing and Computer Assisted Intervention – MICCAI 2023 , pages 138–148. Springer , 2023. [23] Kai Y e, Tiejin Chen, Hua W ei, and Liang Zhan. Uncertainty regularized evidential regression. In Pr oceedings of the AAAI Confer ence on Artificial Intelligence , pages 16460–16468, 2024. [24] Kai Y e, Haoteng T ang, Siyuan Dai, Igor Fortel, Paul M. Thompson, R. Scott Mackin, Alex D. Leow , Heng Huang, and Liang Zhan. BPEN: Brain posterior e vidential network for trustworthy brain imaging analysis. Neural Networks , 183:106943, 2025. A. Extended Background A.1. Conf ormal Factuality Let X ∈ X be a prompt and let π : X → ∆( Y ) denote a fixed generator o ver completions. At inference time, we repeatedly draw candidates Y ∼ π ( · | X ) and seek a prediction set that contains at least one corr ect answer with high probability: P ∃ y ∈ b C α ( X N +1 ) : A ( X N +1 , y ) = 1 ≥ 1 − α, (A.1) where α is the target error le vel and A ( X , y ) ∈ { 0 , 1 } indicates whether candidate y is correct for prompt X . T o achie ve this guarantee, each sampled candidate is e valuated with a verifier score V : X × Y → [0 , 1] . In this paper , smaller verifier scores are better , so a calibrated acceptance rule amounts to choosing a threshold and retaining all candidates with score below it. A.2. Inductive Conf ormal Prediction Split conformal prediction transforms the outputs of a black-box model into v alid prediction sets using a held-out calibration set [ 17 , 20 ]. Gi ven calibration pairs { ( X i , Y i ) } N i =1 and a nonconformity score function s : X × Y → R , one computes calibration scores V i = s ( X i , Y i ) , sorts them, and forms the empirical quantile b Q 1 − α := V ( ⌈ ( N +1)(1 − α ) ⌉ ) . (A.2) The corresponding split-conformal prediction set is b C 1 − α ( X N +1 ) := { y ∈ Y : s ( X N +1 , y ) ≤ b Q 1 − α } , (A.3) which satisfies marginal co verage P Y N +1 ∈ b C 1 − α ( X N +1 ) ≥ 1 − α (A.4) under exchangeability . In the conformal-factuality setting, this corresponds to using one global acceptance threshold for all prompts. A.3. Conditional Conf ormal Prediction Marginal coverage holds only on average over the prompt distribution. For LLMs, this can hide severe heterogeneity: easy prompts may be ov er-co vered while hard prompts are under -covered. The ideal pointwise conditional guarantee P Y N +1 ∈ b C ( X N +1 ) | X N +1 = x ≥ 1 − α (A.5) is impossible to achiev e exactly in finite samples without strong assumptions [ 6 , 19 ]. Follo wing Gibbs et al. [ 7 ], one can re write exact conditional cov erage as an infinite f amily of weighted marginal con- straints: P Y N +1 ∈ b C ( X N +1 ) | X N +1 = 1 − α ⇐ ⇒ E h f ( X N +1 ) 1 { Y N +1 ∈ b C ( X N +1 ) } − (1 − α ) i = 0 (A.6) for all measurable f . Their relaxation replaces the class of all measurable functions with a chosen function class F : E h f ( X N +1 ) 1 { Y N +1 ∈ b C ( X N +1 ) } − (1 − α ) i = 0 , for all f ∈ F . (A.7) T aking F = { 1 } reco vers marginal conformal prediction, while F = { Φ( · ) ⊤ β : β ∈ R d } yields a finite-dimensional feature-conditional target. For this linear class, Gibbs et al. define an augmented quantile-regression estimator using the pinball loss ρ 1 − α ( u ) = u (1 − α ) − 1 { u < 0 } . (A.8) Giv en calibration scores { ( X i , S i ) } N i =1 and a fresh candidate score S , the augmented fit is b g S := arg min g ∈F 1 N + 1 N X i =1 ρ 1 − α S i − g ( X i ) + 1 N + 1 ρ 1 − α S − g ( X N +1 ) . (A.9) The resulting prediction rule keeps labels whose score does not exceed the fitted v alue at the same score: b C ( X N +1 ) := y : S ( X N +1 , y ) ≤ b g S ( X N +1 ,y ) ( X N +1 ) . (A.10) Theorem A.1 (Gibbs et al. [ 7 ], Theorem 2) . Let F = { Φ( · ) ⊤ β : β ∈ R d } be a linear class over the basis Φ : X → R d . Then for any non-ne gative f ∈ F with E [ f ( X )] > 0 , the prediction rule abo ve satisfies P f Y N +1 ∈ b C ( X N +1 ) ≥ 1 − α. (A.11) Our method is a conformal-factuality instantiation of this framew ork, with the latent success score S ( X ) replacing the standard label-wise nonconformity score and a fixed-point construction used to obtain the deployed threshold. B. Proof of Results fr om the Main Paper B.1. Proof of Theor em 4.1 Pr oof. Recall the success score S ( X ) := inf { λ ∈ [0 , 1] : ℓ λ ( X ) = 0 } = inf { V ( X , y ) : y ∈ C ( X ) , A ( X, y ) = 1 } , so that for any threshold λ ( X ) ∈ [0 , 1] , ∃ y ∈ b C α ( X ) : A ( X, y ) = 1 ⇐ ⇒ S ( X ) ≤ λ ( X ) . (B.1) Thus, if we show that P f S ( X N +1 ) ≤ b λ α ( X N +1 ) ≥ 1 − α (B.2) for ev ery non-negativ e f ∈ F with E [ f ( X )] > 0 , the result follows immediately from ( B.1 ). Throughout, for any such f we define the f -reweighted probability of an e vent E by P f ( E ) := E f ( X ) 1 { ( X, S ) ∈ E } E [ f ( X )] . Let τ := 1 − α and recall the pinball loss ρ τ ( u ) = u τ − 1 { u < 0 } . Let β S be the augmented quantile-regression minimizer from Eq. ( B.5 ) for the realized calibration set and the fresh pair ( X, S ) , and define g ( S ) := Φ( X ) ⊤ β S . Then for any nonne gative f ∈ F with E [ f ( X )] > 0 , E f ( X ) 1 { S ≤ g ( S ) } ≥ (1 − α ) E [ f ( X )] . Fix γ ∈ R d with f ( X ) = Φ( X ) ⊤ γ ≥ 0 for all X and consider ψ N ( ε ) = 1 N + 1 N X i =1 ρ 1 − α S i − Φ( X i ) ⊤ ( β S + εγ ) + 1 N + 1 ρ 1 − α S − Φ( X ) ⊤ ( β S + εγ ) . By con vexity and optimality of β S , ψ ′ N (0 + ) ≥ 0 . This gi ves ψ ′ N (0 + ) = − 1 N + 1 " N X i =1 (1 − α ) − 1 { S i ≤ Φ( X i ) ⊤ β S } f ( X i ) + (1 − α ) − 1 { S ≤ Φ( X ) ⊤ β S } f ( X ) # ≥ 0 . Rearranging, 1 N + 1 N X i =1 1 { S i ≤ Φ( X i ) ⊤ β S } f ( X i )+ 1 N + 1 1 { S ≤ Φ( X ) ⊤ β S } f ( X ) ≥ (1 − α ) 1 N + 1 N X i =1 f ( X i ) + (1 − α ) 1 N + 1 f ( X ) . Now tak e expectation over all N +1 exchangeable draws. By exchangeability , each of the N +1 summands on each side has the same distribution, so E f ( X ) 1 { S ≤ Φ( X ) ⊤ β S } ≥ (1 − α ) E [ f ( X )] . Since g ( S ) = Φ( X ) ⊤ β S , the claim follows. Recall the CFC threshold b λ α ( X ) := sup { t ∈ [0 , 1] : t ≤ g ( t ) } , g ( t ) = Φ( X ) ⊤ β t . By definition of the supremum, for any realized S ∈ [0 , 1] we ha ve 1 { S ≤ g ( S ) } ≤ 1 { S ≤ b λ α ( X ) } . (B.3) Multiplying ( B.3 ) by f ( X ) ≥ 0 and taking expectations, E f ( X ) 1 { S ≤ b λ α ( X ) } ≥ E f ( X ) 1 { S ≤ g ( S ) } ≥ (1 − α ) E [ f ( X )] . Dividing by E [ f ( X )] > 0 gi ves P f S ≤ b λ α ( X ) ≥ 1 − α. By definition of the success score and the prediction set, { S ( X ) ≤ b λ α ( X ) } ⇐ ⇒ { ∃ y ∈ b C α ( X ) : A ( X, y ) = 1 } . Therefore, P f ∃ y ∈ b C α ( X ) : A ( X, y ) = 1 = P f S ( X ) ≤ b λ α ( X ) ≥ 1 − α, as claimed. B.2. Proof of Theor em 4.2 Pr oof. Write the calibration set as D cal = { ( X i , S i ) } N i =1 , ( X N +1 , S N +1 ) ∼ i.i.d. as ( X i , S i ) , and recall the success indicator Z ( x, s ) := 1 s ≤ b λ α ( x ) , so that Z ( X , S ) = 1 iff there e xists a correct candidate in b C α ( X ) . Define Q ( D cal ) := P Z ( X , S ) = 1 | D cal = E Z ( X , S ) | D cal . By Theorem 4.1 applied with f ≡ 1 and the law of total e xpectation, P Z ( X , S ) = 1 = E D cal Q ( D cal ) ≥ 1 − α . Thus E D cal Q ( D cal ) ≥ 1 − α. (B.4) W e no w show that Q ( D cal ) is Lipschitz in the calibration set, with sensiti vity of order 1 / N to replacing one calibration pair . Let D = { ( X i , S i ) } N i =1 and D ′ = { ( X ′ i , S ′ i ) } N i =1 differ only in the k -th pair . F or a fixed test prompt x and a candidate success score s ∈ [0 , 1] , consider the ridge-regularized augmented quantile regression objecti ve β 7→ 1 N + 1 N X i =1 ρ 1 − α S i − Φ( X i ) ⊤ β + 1 N + 1 ρ 1 − α s − Φ( x ) ⊤ β + λ 2 ∥ β ∥ 2 2 . (B.5) Let β s ( D , x ) and β s ( D ′ , x ) denote the unique minimizers of ( B.5 ) with D and D ′ respectiv ely , and define g D ( x, s ) := Φ( x ) ⊤ β s ( D , x ) , g D ′ ( x, s ) := Φ( x ) ⊤ β s ( D ′ , x ) . By Assumption (1), ∥ Φ( X i ) ∥ 2 ≤ R almost surely , and the pinball loss is 1-Lipschitz. Hence each sample loss is R - Lipschitz in β . By Assumption (2), adding the ridge term makes the objecti ve ( B.5 ) strongly con vex. Standard stability results for regularized ERM imply that replacing one data point perturbs the minimizer by at most β s ( D , x ) − β s ( D ′ , x ) 2 ≤ C 1 R λN for some univ ersal constant C 1 > 0 , uniformly ov er s ∈ [0 , 1] and x . Using Assumption (1) again, g D ( x, s ) − g D ′ ( x, s ) ≤ ∥ Φ( x ) ∥ 2 β s ( D , x ) − β s ( D ′ , x ) 2 ≤ C 1 R 2 λN . Thus there is a constant C 2 > 0 such that sup x ∈X sup s ∈ [0 , 1] g D ( x, s ) − g D ′ ( x, s ) ≤ C 2 N . (B.6) The CFC threshold at x is b λ ( D ) α ( x ) := sup { s ∈ [0 , 1] : s ≤ g D ( x, s ) } , and similarly for b λ ( D ′ ) α ( x ) . Under the fixed-point construction in Eq. ( 4.3 ), this threshold is a Lipschitz functional of s 7→ g D ( x, s ) in the uniform norm: there exists a constant C 3 > 0 such that sup x ∈X b λ ( D ) α ( x ) − b λ ( D ′ ) α ( x ) ≤ C 3 N . (B.7) For a fresh test pair ( X, S ) , Q ( D ) = P S ≤ b λ ( D ) α ( X ) | D = E h F S | X b λ ( D ) α ( X ) D i , where F S | X = x ( · ) is the conditional CDF of S gi ven X = x . By Assumption (3), for each x the map t 7→ F S | X = x ( t ) is L -Lipschitz on [0 , 1] . Combining this with ( B.7 ) yields Q ( D ) − Q ( D ′ ) ≤ L sup x ∈X b λ ( D ) α ( x ) − b λ ( D ′ ) α ( x ) ≤ C 4 N , where C 4 := LC 3 . Thus Q ( D cal ) satisfies bounded differences with constants c i = C 4 / N for i = 1 , . . . , N . By McDiarmid’ s inequality , for any ε > 0 , P Q ( D cal ) ≤ E D cal Q ( D cal ) − ε ≤ exp − 2 N ε 2 C 2 4 . Giv en δ ∈ (0 , 1) , set ε N ( δ ) := C 4 √ 2 N r log 1 δ = O r log(1 /δ ) N ! . Then with probability at least 1 − δ o ver D cal , Q ( D cal ) ≥ E D cal Q ( D cal ) − ε N ( δ ) . (B.8) Combining ( B.8 ) with ( B.4 ) yields Q ( D cal ) ≥ 1 − α − ε N ( δ ) with probability at least 1 − δ , i.e. P ( Z ( X , S ) = 1 | D cal ) ≥ 1 − α − ε N ( δ ) , Finally , we hav e α eff = max { 0 , α − ε N ( δ ) } = α − ε N ( δ ) (the slack is a small term) from algorithm 2 , then, P S ≤ b λ α eff ( X ) | D cal ≥ 1 − α eff − ε N ( δ ) = 1 − α, B.3. Proof of Pr oposition 4.3 Pr oof. Let u 0 := 1 − α, u ¯ λ ( t ) := F t ( ¯ λ ) . Since P ( S ≤ ¯ λ ) = 1 − α , we have E u ¯ λ ( T ) = E F T ( ¯ λ ) = 1 − α = u 0 . By Assumption 2, the map t 7→ u ¯ λ ( t ) is nonincreasing. Now , E G X ( λ ⋆ ( X )) = E G T ( q α ( T )) = E C T ( u 0 ) , because q α ( t ) = F − 1 t ( u 0 ) , and also E G X ( ¯ λ ) = E G T ( ¯ λ ) = E C T ( u ¯ λ ( T )) . Set a ( t ) := u ¯ λ ( t ) − u 0 . Then a ( t ) is nonincreasing in t and E [ a ( T )] = 0 . By con vexity of C t ( · ) , for each t , C t ( u ¯ λ ( t )) ≥ C t ( u 0 ) + ∂ u C t ( u 0 ) u ¯ λ ( t ) − u 0 . Hence C t ( u ¯ λ ( t )) − C t ( u 0 ) ≥ m ( t ) a ( t ) , m ( t ) := ∂ u C t ( u 0 ) . By Assumption 4, m ( t ) is nonincreasing in t . Since both m ( t ) and a ( t ) are nonincreasing, Chebyshev’ s rearrangement inequality giv es E [ m ( T ) a ( T )] ≥ E [ m ( T )] E [ a ( T )] = 0 . T aking expectations in the pre vious con ve xity bound yields E C T ( u ¯ λ ( T )) − E C T ( u 0 ) ≥ E [ m ( T ) a ( T )] ≥ 0 . Therefore, E G X ( ¯ λ ) = E C T ( u ¯ λ ( T )) ≥ E C T ( u 0 ) = E G X ( λ ⋆ ( X )) . For strictness, if P ( q α ( X ) = ¯ λ α ) > 0 , then by strict monotonicity of each F t we also hav e P F T ( ¯ λ α ) = 1 − α > 0 . If C t ( · ) is strictly con vex for almost every t , then the supporting-line inequality is strict on a set of positive probability , which implies E G X ( λ ⋆ ( X )) < E G X ( ¯ λ α ) . B.4. Proof of Theor em 4.4 Pr oof. Let T = ψ ( X ) and set u 0 := 1 − α, λ ⋆ ( X ) = q α ( T ) = F − 1 T ( u 0 ) . By Proposition 4.3 , E G X ( λ ⋆ ( X )) ≤ E G X ( ¯ λ α ) , (B.9) with strict inequality under the additional strictness assumptions stated there. It remains to show that E G X ( b λ α,N ( X )) − → E G X ( λ ⋆ ( X )) . Fix t . Since F t is continuous and strictly increasing on [0 , 1] , we hav e F − 1 t ( F t ( λ )) = λ for all λ ∈ [0 , 1] . By definition, C t ( u ) = G t ( F − 1 t ( u )) , so for ev ery λ ∈ [0 , 1] , G t ( λ ) = C t ( F t ( λ )) . Now u 7→ C t ( u ) is con ve x and differentiable on (0 , 1) by Proposition 4.3 , hence continuous on (0 , 1) . Since F t ( λ ⋆ ( t )) = F t F − 1 t ( u 0 ) = u 0 = 1 − α ∈ (0 , 1) , it follows that λ 7→ G t ( λ ) is continuous at λ ⋆ ( t ) . Now fix x ∈ X . By the assumed uniform consistenc y , b λ α,N ( x ) − λ ⋆ ( x ) ≤ sup z ∈X b λ α,N ( z ) − λ ⋆ ( z ) p − → 0 , so b λ α,N ( x ) p − → λ ⋆ ( x ) . Since G x ( · ) is continuous at λ ⋆ ( x ) , the continuous mapping theorem yields G x b λ α,N ( x ) p − → G x λ ⋆ ( x ) . For ε > 0 , define p N ,ε ( x ) := P D cal G x b λ α,N ( x ) − G x λ ⋆ ( x ) > ε . Then p N ,ε ( x ) → 0 for e very fixed x , and 0 ≤ p N ,ε ( x ) ≤ 1 . Since the fresh test point X is independent of the calibration sample, P G X b λ α,N ( X ) − G X λ ⋆ ( X ) > ε = E X p N ,ε ( X ) − → 0 by dominated con vergence. Therefore, G X b λ α,N ( X ) p − → G X λ ⋆ ( X ) . Moreov er , 0 ≤ G X b λ α,N ( X ) ≤ 1 , 0 ≤ G X λ ⋆ ( X ) ≤ 1 , so the sequence is uniformly integrable. Hence conv ergence in probability upgrades to con ver gence in L 1 , and therefore lim N →∞ E G X ( b λ α,N ( X )) = E G X ( λ ⋆ ( X )) . Combining this with ( B.9 ) prov es lim N →∞ E G X ( b λ α,N ( X )) ≤ E G X ( ¯ λ α ) , with strict inequality under the additional strictness assumptions from Proposition 4.3 . Finally , conditional on X and a threshold λ , each of the M sampled candidates is accepted with probability G X ( λ ) , so E | b C α,N ( X ) | X, b λ α,N ( X ) = M G X b λ α,N ( X ) . T aking expectations gi ves E | b C α,N ( X ) | = M E G X ( b λ α,N ( X )) . Passing to the limit yields lim N →∞ E | b C α,N ( X ) | = M E G X ( λ ⋆ ( X )) ≤ M E G X ( ¯ λ α ) , with strict inequality under the same additional assumptions. Figure 5. Learned threshold b λ α ( X ) versus prompt difficulty in the updated synthetic run ( α = 0 . 10 , 5 bins). Easy prompts receive stricter thresholds, while harder prompts receive looser thresholds, explaining the impro ved group-wise cov erage of CFC relati ve to global- threshold baselines. C. Additional experiments C.1. Synthetic Data Setup details. All synthetic appendix results use the same clean synthetic generator as the main text. Each prompt has a scalar difficulty variable T ∈ [0 , 1] , we draw M candidates, and define the prompt-level latent success score as S ( X ) = min { V j : A j = 1 } , with S ( X ) = 1 if no sampled candidate is correct. Unless varied explicitly in the ablations, we use N cal = N test = 10 , 000 and M = 50 . Throughout the synthetic appendix, ECR and GSC use the ground-truth correctness labels rather than the surrog ate ev ent S ( X ) ≤ b λ ( X ) . The main synthetic comparisons use 10 equal-frequenc y bins for GSC, the threshold-adaptation plot below uses 5 bins for readability , and CFC-P AC uses the same stability-mode adjustment with δ = 0 . 90 as in the main synthetic experiments. Threshold adaptation versus prompt difficulty . Figure 5 visualizes the learned threshold as a function of prompt difficulty in the updated synthetic run at α = 0 . 10 using 5 difficulty bins. As expected, CFC assigns stricter thresholds to easy prompts and looser thresholds to hard prompts. This is exactly the adapti ve behavior that a single global-threshold baseline cannot express. Full target-risk sweep. T able 7 reports the full synthetic sweep across target error rates. As above, ECR and GSC use ground-truth correctness labels, and CFC-P AC uses the stability-mode adjustment with δ = 0 . 90 throughout. W e include L E A R N T C P in the full sweep to separate gains from learning a better threshold from gains due to exact conditional confor- malization. Sensitivity to calibration size and sampling b udget. T able 8 reports a representati ve synthetic ablation of CFC and CFC- P A C as we v ary the number of calibration points N cal and the sampling budget M , using the same 10-bin setting as the main synthetic comparison. F or consistency with the main synthetic experiments, CFC-P AC uses the same stability-mode adjustment with δ = 0 . 90 , and all reported coverages are true label-based coverages. Figure 6 sho ws the corresponding group-lev el miscoverage profile. T able 7. Results at different target error rates α . Methods α = 0 . 10 α = 0 . 15 ECR GSC ↑ APSS ↓ ECR GSC ↑ APSS ↓ T opK 90.6 ± 0.1 58.2 ± 1.3 16.00 ± 0.00 85.0 ± 1.6 44.9 ± 2.3 10.80 ± 0.98 ICP 90.2 ± 0.2 57.4 ± 1.4 16.71 ± 0.23 85.3 ± 0.6 46.5 ± 1.4 12.11 ± 0.28 Learnt CP 90.2 ± 0.3 84.3 ± 0.5 15.72 ± 0.15 85.2 ± 0.6 77.4 ± 0.6 12.44 ± 0.17 CFC (ours) 90.3 ± 0.5 88.7 ± 0.7 15.53 ± 0.12 85.2 ± 0.6 82.7 ± 0.8 12.42 ± 0.09 CFC-P A C (ours) 90.8 ± 0.6 89.1 ± 0.5 15.87 ± 0.16 85.6 ± 0.6 83.4 ± 1.1 12.66 ± 0.07 Methods α = 0 . 20 α = 0 . 25 ECR GSC ↑ APSS ↓ ECR GSC ↑ APSS ↓ T opK 79.7 ± 0.2 36.5 ± 0.9 8.00 ± 0.00 73.6 ± 0.2 29.4 ± 1.2 6.00 ± 0.00 ICP 80.3 ± 0.7 37.9 ± 1.6 9.32 ± 0.26 75.4 ± 0.8 31.6 ± 1.4 7.43 ± 0.21 Learnt CP 80.2 ± 0.6 71.2 ± 0.9 10.31 ± 0.09 75.4 ± 0.8 65.1 ± 1.6 8.72 ± 0.13 CFC (ours) 80.2 ± 0.6 77.4 ± 0.7 10.39 ± 0.07 75.3 ± 0.8 72.1 ± 1.2 8.80 ± 0.10 CFC-P A C (ours) 80.6 ± 0.6 78.1 ± 0.6 10.53 ± 0.07 75.7 ± 0.8 72.4 ± 1.2 8.89 ± 0.10 Methods α = 0 . 30 α = 0 . 35 ECR GSC ↑ APSS ↓ ECR GSC ↑ APSS ↓ T opK 73.6 ± 0.2 29.4 ± 1.2 6.00 ± 0.00 64.2 ± 0.3 21.7 ± 1.4 4.00 ± 0.00 ICP 70.5 ± 0.8 27.1 ± 1.5 6.00 ± 0.19 65.6 ± 0.7 23.1 ± 1.0 4.91 ± 0.16 Learnt CP 70.4 ± 0.9 59.6 ± 1.7 7.37 ± 0.12 65.5 ± 1.0 54.7 ± 1.6 6.32 ± 0.14 CFC (ours) 70.4 ± 0.9 66.5 ± 1.4 7.53 ± 0.11 65.5 ± 0.8 61.3 ± 1.8 6.48 ± 0.10 CFC-P A C (ours) 70.7 ± 0.9 66.8 ± 1.3 7.60 ± 0.11 65.8 ± 0.9 61.5 ± 1.9 6.55 ± 0.11 Methods α = 0 . 40 α = 0 . 45 ECR GSC ↑ APSS ↓ ECR GSC ↑ APSS ↓ T opK 64.2 ± 0.3 21.7 ± 1.4 4.00 ± 0.00 64.2 ± 0.3 21.7 ± 1.4 4.00 ± 0.00 ICP 60.8 ± 0.8 19.4 ± 0.7 4.06 ± 0.13 55.6 ± 1.0 16.5 ± 0.9 3.32 ± 0.13 Learnt CP 60.4 ± 0.9 49.6 ± 1.8 5.40 ± 0.13 55.7 ± 1.0 45.5 ± 1.6 4.66 ± 0.10 CFC (ours) 60.5 ± 0.7 56.4 ± 1.7 5.58 ± 0.08 55.6 ± 0.8 51.9 ± 1.5 4.83 ± 0.10 CFC-P A C (ours) 60.9 ± 0.7 56.6 ± 1.6 5.62 ± 0.09 56.0 ± 0.9 52.3 ± 1.5 4.88 ± 0.11 T able 8. Ablation of CFC vs. CFC-P A C on synthetic data, with 10 bins. CFC CFC-P A C N cal M T rue cov . Mean set T rue cov . Mean set 2000 50 0.799 ± 0.006 10.60 ± 0.35 0.808 ± 0.005 10.92 ± 0.30 2000 100 0.809 ± 0.012 9.77 ± 0.31 0.819 ± 0.012 10.15 ± 0.29 2000 150 0.796 ± 0.006 9.03 ± 0.43 0.807 ± 0.006 9.37 ± 0.43 5000 50 0.801 ± 0.004 10.45 ± 0.12 0.808 ± 0.004 10.69 ± 0.16 5000 100 0.803 ± 0.008 9.52 ± 0.19 0.809 ± 0.007 9.74 ± 0.15 5000 150 0.802 ± 0.006 9.13 ± 0.26 0.808 ± 0.007 9.35 ± 0.27 10000 50 0.802 ± 0.006 10.39 ± 0.07 0.806 ± 0.006 10.53 ± 0.07 10000 100 0.803 ± 0.005 9.54 ± 0.06 0.808 ± 0.006 9.69 ± 0.09 10000 150 0.799 ± 0.008 9.15 ± 0.15 0.803 ± 0.008 9.29 ± 0.15 C.2. Real-W orld Data C.2.1. T riviaQA Full target-risk sweep. For the chosen Tri viaQA feature map, we compute the rank-normalized answer-distribution en- tropy T ent ( X ) and the rank-normalized maximum verifier loss T loss ( X ) on the calibration split, then assign a prompt to the hard group when max { T ent ( X ) , T loss ( X ) } ≥ q 0 . 925 , where q 0 . 925 is the calibration 92 . 5 th percentile of that combined score. T able 9 reports the full T riviaQA sweep under this chosen feature map. The full table makes the main-paper tradeoff more explicit: CFC is the most size-ef ficient of our methods, while CFC-P A C-FULL is the strongest at hitting the tar get cov erage with a higher subgroup floor . C.2.2. GSM8K Full target-risk sweep. T able 10 reports the full GSM8K tar get-risk sweep for the chosen setting from the main paper: we keep the first 5 sampled candidates per prompt, define T ( X ) as the mean verifier loss across those candidates, and use the quadratic basis Φ( X ) = [1 , T ( X ) , T ( X ) 2 ] . The same qualitati ve pattern holds across the sweep: the conditional methods Figure 6. Group miscoverage for CFC and CFC-P AC, with 10 bins. T able 9. Full Tri viaQA sweep across target error rates α under the calibration-defined split max { T ent ( X ) , T loss ( X ) } ≥ q 0 . 925 . APSS below 1 indicates that the method abstains and returns the empty set on some prompts. ECR and GSC are reported in percent; for ECR, values closest to the tar get coverage 1 − α are preferred. Methods α = 0 . 20 α = 0 . 25 α = 0 . 30 ECR GSC ↑ APSS ↓ ECR GSC ↑ APSS ↓ ECR GSC ↑ APSS ↓ T opK 81.6 ± 0.3 69.7 ± 1.5 1.75 ± 0.00 73.4 ± 0.3 55.9 ± 1.6 1.00 ± 0.00 73.4 ± 0.3 55.9 ± 1.6 1.00 ± 0.00 ICP 80.0 ± 0.5 65.1 ± 1.5 1.43 ± 0.02 74.9 ± 0.3 56.7 ± 1.9 1.08 ± 0.01 69.5 ± 0.8 49.3 ± 2.7 0.90 ± 0.02 Learnt CP 79.6 ± 0.5 76.3 ± 1.6 1.69 ± 0.04 74.7 ± 0.4 74.0 ± 1.1 1.22 ± 0.03 69.5 ± 0.7 68.7 ± 1.3 0.97 ± 0.02 CFC (ours) 76.2 ± 0.5 65.4 ± 1.7 1.21 ± 0.01 72.7 ± 0.4 65.2 ± 1.8 1.03 ± 0.03 68.4 ± 0.7 62.8 ± 2.0 0.88 ± 0.01 CFC-P A C (ours) 76.4 ± 0.5 65.4 ± 1.7 1.22 ± 0.01 73.1 ± 0.5 65.3 ± 1.8 1.05 ± 0.04 68.8 ± 0.7 62.9 ± 2.1 0.89 ± 0.02 CFC-FULL (ours) 79.6 ± 0.5 76.3 ± 1.6 1.69 ± 0.04 74.8 ± 0.3 74.1 ± 1.1 1.26 ± 0.09 69.6 ± 0.7 68.9 ± 1.1 0.97 ± 0.02 CFC-P A C-FULL (ours) 80.1 ± 0.5 76.3 ± 1.6 1.72 ± 0.04 75.3 ± 0.4 74.6 ± 1.0 1.32 ± 0.10 70.0 ± 0.8 69.2 ± 1.3 0.99 ± 0.03 Methods α = 0 . 35 α = 0 . 40 α = 0 . 45 ECR GSC ↑ APSS ↓ ECR GSC ↑ APSS ↓ ECR GSC ↑ APSS ↓ T opK 73.4 ± 0.3 55.9 ± 1.6 1.00 ± 0.00 73.4 ± 0.3 55.9 ± 1.6 1.00 ± 0.00 73.4 ± 0.3 55.9 ± 1.6 1.00 ± 0.00 ICP 64.5 ± 0.6 43.0 ± 2.1 0.78 ± 0.01 59.9 ± 0.9 36.1 ± 3.1 0.70 ± 0.01 54.8 ± 0.9 29.6 ± 2.3 0.62 ± 0.01 Learnt CP 64.6 ± 0.7 63.0 ± 2.1 0.82 ± 0.02 59.9 ± 0.8 58.2 ± 2.4 0.72 ± 0.01 55.2 ± 1.0 53.8 ± 1.7 0.65 ± 0.01 CFC (ours) 63.9 ± 0.7 59.2 ± 2.3 0.78 ± 0.01 59.5 ± 0.9 55.8 ± 2.7 0.70 ± 0.01 54.9 ± 1.0 52.3 ± 2.4 0.63 ± 0.01 CFC-P A C (ours) 64.3 ± 0.7 59.5 ± 2.4 0.78 ± 0.01 59.9 ± 0.8 56.4 ± 2.9 0.70 ± 0.01 55.4 ± 1.0 52.8 ± 2.6 0.64 ± 0.01 CFC-FULL (ours) 64.7 ± 0.7 63.2 ± 2.0 0.82 ± 0.02 59.9 ± 0.9 58.4 ± 2.2 0.72 ± 0.01 55.2 ± 1.0 53.8 ± 1.7 0.65 ± 0.01 CFC-P A C-FULL (ours) 65.1 ± 0.8 63.7 ± 2.2 0.83 ± 0.02 60.4 ± 0.8 59.0 ± 2.2 0.73 ± 0.01 55.7 ± 1.0 54.3 ± 1.9 0.65 ± 0.01 remain much more efficient than ICP while sharply impro ving worst-group coverage. Candidate-budget ablation. T able 11 compares the chosen N = 5 budget against N = 20 at the representati ve tar get α = 0 . 10 , keeping the same mean-loss proxy and quadratic basis. The larger candidate budget barely changes target calibration, but it inflates APSS substantially for every threshold-based method. This is why the main paper uses the smaller budget on GSM8K: the extra samples add little ne w diversity b ut materially hurt ef ficiency . C.2.3. Flickr8k Full tar get-risk sweep. T able 12 reports the full Flickr8k sweep for the chosen clean setting from the main paper: we keep up to two cached candidates per image, define T ( X ) as the mean verifier loss across those candidates, and use the quadratic basis Φ( X ) = [1 , T ( X ) , T ( X ) 2 ] . This benchmark is visibly easier than GSM8K or T riviaQA, so the most informati ve comparison is closeness to the target coverage together with subgroup reliability . At α = 0 . 03 (tar get coverage 97% ), CFC- P A C-FULL is the closest full-set variant to target while improving GSC over every baseline, whereas CFC collapses to almost one caption per image and is best viewed as the smallest-size e xtreme. T able 10. Full GSM8K sweep across target error rates α using the first five sampled candidates per prompt, T ( X ) equal to mean verifier loss, and the quadratic basis Φ( X ) = [1 , T ( X ) , T ( X ) 2 ] . ECR and GSC are reported in percent; for ECR, values closest to the target cov erage 1 − α are preferred. Methods α = 0 . 05 α = 0 . 10 α = 0 . 15 ECR GSC ↑ APSS ↓ ECR GSC ↑ APSS ↓ ECR GSC ↑ APSS ↓ T opK 96.42 ± 0.39 86.52 ± 1.11 1.00 ± 0.00 96.42 ± 0.39 86.52 ± 1.11 1.00 ± 0.00 96.42 ± 0.39 86.52 ± 1.11 1.00 ± 0.00 ICP 95.09 ± 1.42 79.85 ± 6.53 4.73 ± 0.09 90.39 ± 1.44 56.36 ± 6.20 4.36 ± 0.08 83.91 ± 2.15 27.73 ± 8.14 3.97 ± 0.11 Learnt CP 94.91 ± 1.03 88.48 ± 3.05 4.01 ± 0.98 90.09 ± 1.38 86.06 ± 0.77 2.22 ± 0.07 84.70 ± 1.11 79.09 ± 1.56 1.92 ± 0.06 CFC (ours) 94.82 ± 0.97 88.48 ± 2.32 2.35 ± 0.43 90.18 ± 1.41 86.36 ± 1.07 1.49 ± 0.04 84.97 ± 1.12 79.55 ± 1.52 1.34 ± 0.04 CFC-FULL (ours) 95.03 ± 0.97 88.94 ± 2.74 4.08 ± 0.93 90.30 ± 1.40 86.36 ± 1.07 2.23 ± 0.08 85.09 ± 1.03 79.55 ± 1.52 1.96 ± 0.06 CFC-P A C (ours) 95.03 ± 1.33 88.18 ± 2.47 2.59 ± 0.29 91.79 ± 1.66 88.18 ± 1.41 1.55 ± 0.06 86.64 ± 0.95 81.67 ± 1.11 1.38 ± 0.03 CFC-P A C-FULL (ours) 95.24 ± 1.40 88.79 ± 3.01 4.59 ± 0.62 91.91 ± 1.68 88.48 ± 1.62 2.34 ± 0.11 86.76 ± 0.88 81.67 ± 1.11 2.05 ± 0.05 Methods α = 0 . 20 α = 0 . 25 α = 0 . 30 ECR GSC ↑ APSS ↓ ECR GSC ↑ APSS ↓ ECR GSC ↑ APSS ↓ T opK 96.42 ± 0.39 86.52 ± 1.11 1.00 ± 0.00 96.42 ± 0.39 86.52 ± 1.11 1.00 ± 0.00 96.42 ± 0.39 86.52 ± 1.11 1.00 ± 0.00 ICP 79.55 ± 2.91 18.48 ± 4.92 3.69 ± 0.16 74.64 ± 2.53 13.48 ± 1.30 3.38 ± 0.15 70.18 ± 2.51 10.15 ± 0.91 3.11 ± 0.13 Learnt CP 79.42 ± 1.87 71.36 ± 3.63 1.74 ± 0.10 74.82 ± 2.51 66.97 ± 2.90 1.57 ± 0.07 69.36 ± 2.48 60.91 ± 2.23 1.42 ± 0.08 CFC (ours) 79.94 ± 1.80 72.12 ± 3.85 1.22 ± 0.04 75.03 ± 2.52 67.12 ± 2.86 1.12 ± 0.05 69.55 ± 2.53 61.21 ± 2.11 1.01 ± 0.05 CFC-FULL (ours) 80.06 ± 1.78 72.12 ± 3.85 1.76 ± 0.08 75.15 ± 2.47 67.12 ± 2.86 1.60 ± 0.07 69.64 ± 2.48 61.21 ± 2.11 1.44 ± 0.09 CFC-P A C (ours) 81.36 ± 1.68 73.18 ± 4.27 1.25 ± 0.04 75.94 ± 2.47 68.03 ± 2.81 1.13 ± 0.05 70.82 ± 2.58 62.58 ± 2.01 1.04 ± 0.05 CFC-P A C-FULL (ours) 81.48 ± 1.65 73.18 ± 4.27 1.82 ± 0.08 76.06 ± 2.42 68.03 ± 2.81 1.63 ± 0.07 70.94 ± 2.52 62.58 ± 2.01 1.48 ± 0.08 T able 11. GSM8K sample-b udget ablation at α = 0 . 10 under the mean-loss quadratic rule. Each budget uses the same basis Φ( X ) = [1 , T ( X ) , T ( X ) 2 ] , differing only in the number of retained sampled candidates per prompt. Methods N = 5 N = 20 ECR GSC ↑ APSS ↓ ECR GSC ↑ APSS ↓ T opK 96.42 ± 0.39 86.52 ± 1.11 1.00 ± 0.00 96.70 ± 0.36 88.48 ± 1.11 1.00 ± 0.00 ICP 90.39 ± 1.44 56.36 ± 6.20 4.36 ± 0.08 90.15 ± 1.37 55.76 ± 5.52 16.75 ± 0.35 Learnt CP 90.09 ± 1.38 86.06 ± 0.77 2.22 ± 0.07 90.15 ± 0.81 87.42 ± 1.41 7.24 ± 0.33 CFC (ours) 90.18 ± 1.41 86.36 ± 1.07 1.49 ± 0.04 90.30 ± 0.63 87.73 ± 1.47 3.79 ± 0.12 CFC-FULL (ours) 90.30 ± 1.40 86.36 ± 1.07 2.23 ± 0.08 90.45 ± 0.70 87.73 ± 1.47 7.50 ± 0.18 CFC-P A C (ours) 91.79 ± 1.66 88.18 ± 1.41 1.55 ± 0.06 91.91 ± 0.46 88.64 ± 2.30 3.99 ± 0.07 CFC-P A C-FULL (ours) 91.91 ± 1.68 88.48 ± 1.62 2.34 ± 0.11 92.06 ± 0.57 88.79 ± 2.46 7.97 ± 0.03 T able 12. Full Flickr8k sweep with Q W E N 2 - V L - 7 B - I N S T R U C T using up to two cached candidates per image, T ( X ) equal to mean verifier loss, and the quadratic basis Φ( X ) = [1 , T ( X ) , T ( X ) 2 ] . ECR and GSC are reported in percent; for ECR, v alues closest to the target co verage 1 − α are preferred. Methods α = 0 . 01 α = 0 . 02 α = 0 . 03 ECR GSC ↑ APSS ↓ ECR GSC ↑ APSS ↓ ECR GSC ↑ APSS ↓ T opK 97.75 ± 0.19 95.62 ± 0.71 2.00 ± 0.00 97.75 ± 0.19 95.62 ± 0.71 2.00 ± 0.00 96.37 ± 0.17 93.23 ± 0.47 1.00 ± 0.00 ICP 97.29 ± 0.29 93.33 ± 1.45 1.93 ± 0.01 96.25 ± 0.66 88.23 ± 3.66 1.87 ± 0.02 95.58 ± 0.54 85.21 ± 3.14 1.84 ± 0.01 Learnt CP 97.39 ± 0.33 95.52 ± 0.85 1.74 ± 0.06 97.06 ± 0.31 95.10 ± 0.63 1.34 ± 0.07 96.16 ± 0.17 94.48 ± 0.26 1.22 ± 0.07 CFC (ours) 96.37 ± 0.17 93.23 ± 0.47 1.00 ± 0.00 96.37 ± 0.17 93.23 ± 0.47 1.00 ± 0.00 95.81 ± 0.38 93.23 ± 0.47 0.99 ± 0.00 CFC-FULL (ours) 97.66 ± 0.16 95.62 ± 0.71 1.86 ± 0.04 97.27 ± 0.21 95.21 ± 0.77 1.40 ± 0.06 96.27 ± 0.24 94.58 ± 0.26 1.25 ± 0.08 CFC-P A C (ours) 96.37 ± 0.17 93.23 ± 0.47 1.00 ± 0.00 96.37 ± 0.17 93.23 ± 0.47 1.00 ± 0.00 96.37 ± 0.17 93.23 ± 0.47 1.00 ± 0.00 CFC-P A C-FULL (ours) 97.75 ± 0.19 95.62 ± 0.71 2.00 ± 0.00 97.64 ± 0.13 95.62 ± 0.71 1.86 ± 0.06 97.27 ± 0.21 95.21 ± 0.77 1.42 ± 0.07 T able 13. Flickr8k setting ablation at α = 0 . 03 (target cov erage 97% ). Each cell reports ECR/APSS/GSC for the corresponding method. Setting CFC CFC-P A C-FULL N = 2 , max-loss, poly2 96.02 / 1.00 / 93.02 97.62 / 1.89 / 95.93 Chosen: N = 2 , mean-loss, poly2 95.81 / 0.99 / 93.23 97.27 / 1.42 / 95.21 N = 3 , mean-loss, poly2 96.00 / 1.00 / 93.02 97.81 / 2.16 / 96.35 Setting ablation. T able 13 compares the chosen Flickr8k setting against two nearby alternativ es from the clean search at the representativ e target α = 0 . 03 . The chosen N = 2 mean-loss quadratic rule is the best-balanced option we found: it keeps the CFC variant essentially at single-caption size, while CFC-P A C-FULL remains close to tar get without the lar ger APSS jump of the N = 3 alternati ve.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment