Coverage Guarantees for Pseudo-Calibrated Conformal Prediction under Distribution Shift

Conformal prediction (CP) offers distribution-free marginal coverage guarantees under an exchangeability assumption, but these guarantees can fail if the data distribution shifts. We analyze the use of pseudo-calibration as a tool to counter this per…

Authors: Farbod Siahkali, Ashwin Verma, Vijay Gupta

1 Co v erage Guarantees for Pseudo-Calibrated Conformal Prediction under Distrib ution Shift Farbod Siahkali, Member , IEEE , Ashwin V erma, Member , IEEE , V ijay Gupta, F ellow , IEEE Abstract —Conformal prediction (CP) offers distrib ution-free marginal coverage guarantees under an exchangeability assump- tion, but these guarantees can fail if the data distribution shifts. W e analyze the use of pseudo-calibration as a tool to counter this performance loss under a bounded label-conditional covariate shift model. Using tools fr om domain adaptation, we derive a lower bound on target coverage in terms of the source-domain loss of the classifier and a W asserstein measure of the shift. Using this result, we provide a method to design pseudo-calibrated sets that inflate the conformal threshold by a slack parameter to keep target coverage above a prescribed level. Finally , we propose a source-tuned pseudo-calibration algorithm that inter polates between hard pseudo-labels and randomized labels as a function of classifier uncertainty . Numerical experiments show that our bounds qualitatively track pseudo-calibration behavior and that the source-tuned scheme mitigates coverage degradation under distribution shift while maintaining nontrivial pr ediction set sizes. I . I N T R O D U C T I O N C ONFORMAL prediction (CP) provides a rigorous frame- work for constructing prediction sets with guaranteed marginal co verage under an exchangeability assumption be- tween calibration and test data [1], [2]. CP has been applied in a variety of settings [3], [4], [5]. Howe ver , the finite-sample, distribution-free guarantees from CP rely on exchangeabil- ity [6], which is often violated due to distrib ution shift between the source (calibration) and tar get (test) domains [7], [8], [9], [10]. If target labels are a vailable, one approach to correct such miscov erage is to utilize weighted CP by importance weighting the calibration scores with estimated density ratios between source and target mar ginals on the input space [2], [9]. Alternativ ely , robust distrib utional approaches construct am- biguity sets (such as L ´ evy-Prokhorov balls) around the score distribution and propagate worst-case perturbations through the conformal quantile in score space [11]. When target-domain labels are unav ailable, one possibility is to utilize pseudo-labels from source classifiers. Howe ver , pseudo-labels introduce additional uncertainty that can de grade cov erage. Some recent works ha ve attempted to mitigate this by heuristically rescaling scores using predictiv e entropy or reconstruction loss [12], [13]; ho wev er , these methods do not yield analytical coverage guarantees. Alternatively , [14] offers bounds for pseudo-labeled targets using score distribu- tion distances. Howe ver , since these bounds do not depend on the underlying classifier or shift characteristics, they do This work was supported in part by the U.S. Army Research Office under Grant 13001664. Authors are with the Elmore Family School of Electrical and Computer Engineering, at Purdue Univ ersity , W est Lafayette, IN 47907 USA (e-mail: { siahkali,verma240,gupta869 } @purdue.edu). not provide insights on designing the classifier or pseudo- calibration schemes for trading off coverage and set size. These limitations point to a broader gap: existing CP methods under distrib ution shift do not account for how source domain errors translate to the target domain. Domain adaptation (D A) theory provides a natural lens for this question by bounding tar get errors in terms of source losses and distributional shift measures [15], [16]. Y et, these ideas hav e not been incorporated into CP under distribution shift, leaving the analytical understanding of label-free multiclass CP under distribution shift underde veloped. Our work bridges this gap by drawing on D A theory to deriv e coverage guarantees that explicitly depend on classifier properties and shift measures. W e extend these tools to mul- ticlass classification and obtain bounds for pseudo-calibration on the target domain in terms of the source classifier’ s loss and the W asserstein distance between source and target dis- tributions. Inspired by [17], we further introduce a source- tuned pseudo-calibration algorithm that interpolates between hard pseudo-labels and randomized labels based on classifier uncertainty , reducing the conservatism of standard pseudo- calibration while preserving source domain cov erage. T o our knowledge, this is the first integration of DA theory with conformal coverage analysis under distribution shift. Our contributions can be summarized as follo ws. First, we deri ve cov erage lo wer bounds for pseudo-calibrated prediction sets on the tar get domain in terms of the classifier’ s source-domain loss, Lipschitz properties of the classifier , and W asserstein measure of the distribution shift. Second, we introduce relaxed pseudo-calibrated sets that inflate the conformal threshold by a slack parameter and provide a simple design rule for choosing this slack to guarantee a desired target coverage lev el. Finally , we propose a source-tuned pseudo-calibration algorithm that interpolates between hard pseudo-labels and randomized labels based on classifier uncertainty . Experiments show that our theoretical bounds track empirical behavior and that the proposed algorithm mitigates coverage drop on the target while maintaining nontrivial expected set sizes. I I . B A C K G RO U N D A N D P R O B L E M F O R M U L A T I O N a) Conformal Pr ediction: Conformal prediction con- structs prediction sets with finite-sample marginal coverage guarantees [18]. Giv en calibration data { ( X i , Y i ) } n i =1 i.i.d. ∼ P X Y and a test point ( X n +1 , Y n +1 ) , the goal is to ensure Pr { Y n +1 ∈ C 1 − α P ( X n +1 ) } ≥ 1 − α, (1) for any α ∈ (0 , 1) under an exchangeability assumption between the calibration and test point, i.e., their joint dis- 2 tribution is inv ariant under permutations [18, Section 3]. Let s : X × Y → R be a nonconformity score. For a distribution P on X × Y , denote the pushforward score distribution by s # P , with CDF F s # P . For α ∈ (0 , 1) , define ( 1 − α )-quantile as inf { t ∈ R : F s # P ( t ) ≥ 1 − α } . W ith the empirical distribution ˆ P n = 1 n P n i =1 δ ( X i ,Y i ) , the split-conformal threshold at α is q P,α := Quan tile  (1 − α )( n + 1)  n ; s # ˆ P n ! , (2) and the conformal prediction set is gi ven by C 1 − α P ( x ) =  y ∈ Y : s ( x, y ) ≤ q P,α  . Under exchangeability , this construction guarantees (1) [19, Theorem 1.1]. Howe ver , when test data are drawn from a joint distribution Q X Y that is dif ferent from the calibration distribution P X Y , the coverage can degrade if the threshold is computed from s # P X Y while test scores follo w s # Q X Y . b) Distribution shift measur e: T o quantify the shift of the input distribution, we use W asserstein distance. W e assume in this paper that all probability measures considered are supported on the metric space ( R d , ∥ · ∥ 2 ) . Definition 1: For p ≥ 1 and probability measures P and Q on X , the p -W asserstein distance is W p ( P , Q ) :=  inf π ∈ Π( P,Q ) E ( X,X ′ ) ∼ π  ∥ X − X ′ ∥ p 2   1 /p , where Π( P , Q ) is the set of all couplings of P , Q . For p = ∞ , we define W ∞ ( P , Q ) := inf π ∈ Π( P,Q ) ess sup ( X,X ′ ) ∼ π ∥ X − X ′ ∥ 2 . c) Pr oblem Consider ed: W e consider a multiclass clas- sification setting with input space X = R d and label space Y = [ K ] := { 1 , . . . , K } . The source and target domains are represented by joint distributions P X Y and Q X Y ov er X × Y . A classifier f : X → Y is induced by a logit map M f : X → R K , which returns the vector of class logits. The predicted label is gi ven by f ( x ) := arg max k ∈ [ K ] M f ( x ) k . (3) For a labeled example ( x, y ) , define the multiclass margin that measures ho w much the logit of the true class e x- ceeds the largest competing logit as γ f ( x, y ) := M f ( x ) y − max k  = y M f ( x ) k . T o bound errors under distrib ution shift, we employ the ramp loss defined as ℓ r (( x, y ); f ) := r ( γ f ( x, y )) , where r ( t ) := min { max(1 − t, 0) , 1 } is the ramp function which clips the surrogate loss to the interval [0 , 1] . The population ramp loss under distribution P X Y is L r ( f , P ) := E P X Y [ ℓ r (( X, Y ); f )] . W e will also use the hinge loss ℓ h (( x, y ); f ) := max { 1 − γ f ( x, y ) , 0 } , with population hinge loss under P X Y giv en by L h ( f , P ) := E ( X,Y ) ∼ P X Y  ℓ h (( X, Y ); f )  . Assumption 1: For all x, x ′ ∈ X and y ∈ [ K ] , γ f ( · , · ) satisfies | γ f ( x, y ) − γ f ( x ′ , y ) | ≤ L γ ∥ x − x ′ ∥ 2 . In our setting, f is a pre-trained classifier with known or estimated L r ( f , P ) on held-out source validation data. Once f is fixed, L γ can be upper-bounded using spectral norm bounds, or estimated via data-dependent local gradient norm bound around observed source samples. Only unlabeled target inputs X ∼ Q X are av ailable. W e, therefore, form determin- istic pseudo-labels ˜ Y := f ( X ) , inducing a pseudo-labeled joint distribution ˜ Q X Y and an associated score distribution s # ˜ Q X Y . Throughout this paper, we use the nonconformity score s ( x, y ) := − γ f ( x, y ) . By definition of f in (3), for any x ∈ X and y  = f ( x ) , we have γ f ( x, f ( x )) ≥ 0 and γ f ( x, y ) ≤ 0 . Hence, s ( x, f ( x )) ≤ s ( x, y ) . Thus, under pseudo-labeling, the score for the predicted label is always less than or equal to the score for an y other label at the same x . This, in turn, implies that for ( X , Y ) ∼ Q X Y , we hav e s ( X, f ( X )) ≤ s ( X , Y ) almost surely . Consequently , F s # ˜ Q ( t ) ≥ F s # Q ( t ) , (4) holds for all t ∈ R . Equiv alently , q ˜ Q,α ≤ q Q,α for e very α ∈ (0 , 1) . Pseudo-calibration, therefore, tends to use smaller thresholds and hence produces smaller prediction sets on the target domain than oracle calibration with true labels. W e will also mak e the following assumption. Assumption 2: The distrib utions P X Y and Q X Y satisfy: ( i ) Identical Marginal Distrib utions: P Y = Q Y . ( ii ) Bounded Conditional Shift: For some ρ > 0 , we hav e sup y ∈Y W ∞ ( P X | y , Q X | y ) < ρ. Assumption 2(i) is standard in domain adaptation analy- ses to isolate label-conditional cov ariate shift (e.g. [15]). Assumption 2(ii) is natural in sensing/control pipelines where perturbations are physically constrained; in such cases, ρ is treated as an a priori specification parame- ter (e.g., from kno wn en vironment/sensor dynamics such as bounded drift/noise). Under Assumption 2 we also ha ve W 1 ( P X | y , Q X | y ) ≤ ρ for all y , and hence W 1 ( P X , Q X ) ≤ ρ mix := P K y =1 P Y ( y ) W 1 ( P X | y , Q X | y ) ≤ ρ . I I I . A N A L Y T I C A L R E S U LT S W e now present upper bounds on the cov erage gap under distribution shift; all proofs are deferred to the supplementary material. Note that ρ and L γ appear only in the following bounds and are not required by the proposed procedures. By Assumption 1, the score function s ( x, y ) is L γ -Lipschitz in x for every y ∈ [ K ] . For a given le vel α ∈ (0 , 1) , let q P,α denote the empirical split-conformal threshold computed from s # ˆ P n in (2). Using this threshold, the achie ved coverage with the target distribution Q is F s # Q ( q P,α ) , while the co verage with P is F s # P ( q P,α ) ≈ 1 − α . Define the pointwise cov erage gap ∆ P,Q ( α ) :=   F s # P ( q P,α ) − F s # Q ( q P,α )   . Follo wing [10], we can aggregate these discrepancies across all nominal lev els via ∆ P,Q := R 1 0 ∆ P,Q ( α ) dα to measure the average co verage mismatch incurred when calibrating on P but deploying on Q . A. Cover age Gap Upper Bounds under Distrib ution Shift Our first result bounds the W asserstein distance between the original and shifted score distributions. 3 Lemma 1: Under Assumptions 1 and 2, we ha ve W 1 ( s # P , s # Q ) ≤ L γ ρ. For source calibration using labeled data from P X Y , we in voke the general coverage-g ap bound of [10, Theorem 3.2]: ∆ P,Q ≤  sup t ∈ R p s # P ( t )  W 1 ( s # P , s # Q ) , (5) where p s # P denotes the PDF of s # P (when it ex- ists). Combining (5) with Lemma 1 yields ∆ P,Q ≤ (sup t ∈ R p s # P ( t )) L γ ρ, which quantifies the worst-case degra- dation in coverage when the conformal threshold is computed from source data in terms of the score density , the margin’ s Lipschitz constant, and the shift magnitude. For pseudo-calibration on Q , scores follow s # ˜ Q X Y while test scores follow s # Q X Y . W e ha ve the follo wing result. Theor em 1: Let Assumptions 1 and 2 hold. Let X 1 , . . . , X n +1 i.i.d. ∼ Q X , and define pseudo-labels ˜ Y i := f ( X i ) , so that ( X i , ˜ Y i ) i.i.d. ∼ ˜ Q X Y . Let q ˜ Q,α be the (1 − α ) split-conformal threshold computed from the pseudo-scores { s ( X i , ˜ Y i ) } n i =1 , and define the pseudo-calibrated prediction set C 1 − α ˜ Q ( X n +1 ) :=  y ∈ [ K ] : s ( X n +1 , y ) ≤ q ˜ Q,α  . Then the marginal coverage on the target domain satisfies Q X Y  Y n +1 ∈ C 1 − α ˜ Q ( X n +1 )  ≥ 1 − α − L r ( f , P ) − L γ ρ mix , (6) with the right-hand side clipped at 0 when it becomes negati ve. In other words, ∆ ˜ Q,Q ≤ L r ( f , Q ) ≤ L r ( f , P ) + L γ ρ mix . Note that when P Y  = Q Y , the result can be relaxed with an additional total variation distance term between the marginals. This result shows that hard pseudo-calibrated conformal sets retain meaningful coverage guarantees controlled by the source ramp loss and the magnitude of the shift. The following corollary provides an explicit ( τ )-dependent lower bound for the relaxed conformal set ( C 1 − α ˜ Q,τ ), obtained via the hinge loss. Cor ollary 1: Let Assumptions 1 and 2 hold. For any τ ≥ 0 , define the relaxed prediction set C 1 − α ˜ Q,τ ( X n +1 ) :=  y ∈ [ K ] : s ( X n +1 , y ) ≤ q ˜ Q,α + τ  . Then the marginal coverage on the target domain satisfies Q X Y  Y n +1 ∈ C 1 − α ˜ Q,τ ( X n +1 )  ≥ 1 − α − min  L r ( f , Q ) , L h ( f , Q ) 1 + τ / 2  . (7) B. Sour ce-T uned Pseudo-Calibr ation From (4), we see that pseudo-labeling is pessimistic in score space. Specifically , it tends to produce smaller quantile thresholds and therefore smaller prediction sets, often resulting in undercov erage relativ e to calibration with true labels. T o mitigate this pessimism, we interpolate between pseudo-labels and random labels in a data-dependent manner . Giv en a function H : X → R ≥ 0 that measures some notion of uncertainty (e.g., predictiv e entrop y), we rely on pseudo- labels when H ( x ) is small and randomize otherwise. Giv en a threshold u ∈ U , define, for x ∈ X , the quantity ˜ Y u ( x ) = f ( x ) 1 { H ( x ) ≤ u } + U 1 { H ( x ) > u } , (8) Algorithm 1 Source-T uned Pseudo-Calibration Require: Source data { ( X P i , Y P i ) } m i =1 ∼ P X Y ; unlabeled target data { X Q j } n j =1 ∼ Q X ; classifier f ; uncertainty H : X → R ≥ 0 ; score s : X × Y → R ; le vel α . 1: For each X P i , compute ˜ Y u ( X P i ) as in (8) and correspond- ing pseudo-scores S u i := s ( X P i , ˜ Y u ( X P i )) . 2: Let q ˜ P u ,α be the empirical (1 − α ) -quantile of { S u i } m i =1 , and define empirical source coverage and threshold ˆ c ( u ) := 1 m m X i =1 1 { s ( X P i , Y P i ) ≤ q ˜ P u ,α } . u ⋆ = max  u ∈ U : ˆ c ( u ) ≥ 1 − α  . 3: Pseudo-label target points { ˜ Y u ⋆ ( X Q j ) } n j =1 and then com- pute tar get scores { S Q,u ⋆ j := s ( X Q j , ˜ Y u ⋆ ( X Q j )) } n j =1 . 4: Let q ˜ Q u ⋆ ,α be the (1 − α ) quantile of { S Q,u ⋆ j } n j =1 . 5: retur n Threshold q ˜ Q u ⋆ ,α . where U ∼ Unif ([ K ]) and 1 {·} is the indicator function. Let ˜ P u X Y and ˜ Q u X Y denote the randomized pseudo-labeled source and target joint distributions induced by ˜ Y u , and let s # ˜ P u and s # ˜ Q u denote their associated score distributions. Since f ( x ) minimizes s ( x, y ) over y ∈ [ K ] , randomization can only increase the scores. Thus, for every x and e very realization of ˜ Y u ( x ) , we must have s ( x, ˜ Y u ( x )) ≥ s ( x, f ( x )) . Consequently , for nominal lev el 1 − α , the conformal threshold computed from ˜ Q u X Y is nev er smaller than under ˜ Q X Y , and the prediction sets are never less conserv ati ve. W e tune u on labeled source data by sweeping over a grid U . For each u ∈ U , we compute the split-conformal threshold using the mixed pseudo-labeled scores { s ( X P i , ˜ Y u ( X P i )) } m i =1 . W e then select u ⋆ such that the empirical source coverage ˆ c ( u ) stays abov e 1 − α . W ith this u ⋆ fixed, we pseudo-label the target calibration sample { X Q j } n j =1 and compute the final threshold from the scores { s ( X Q j , ˜ Y u ⋆ ( X Q j )) } n j =1 . The overall procedure is summarized in Algorithm 1. The following result says that, for any fixed u , this scheme nev er decreases tar get cov erage relati ve to hard pseudo-calibration. Theor em 2: For any joint distribution Q X Y with marginal Q X , the split-conformal predictor based on scores s ( X, ˜ Y u ( X )) achiev es target coverage at least as large as that of hard pseudo-calibration at the same le vel 1 − α . I V . N U M E R I C A L E X P E R I M E N T S W e e v aluate our bounds and calibration schemes on MNIST , CIF AR-10, and CIF AR-100. The original dataset distributions serve as P . The target distributions Q σ are obtained by apply- ing a stochastic image transform T σ to the inputs, consisting of an appearance change plus clipped (truncated) Gaussian noise of strength σ , which aligns with the bounded-shift assumption in Assumption 2(ii). For each dataset, we train a fixed classifier f on a source training split and form calibration and test dataset splits D P cal and D P tst . The corresponding target dataset splits D Q σ cal and D Q σ tst are obtained by applying T σ to the inputs in D P cal and D P tst . W e use split conformal prediction with 4 T ABLE I: Coverage and expected set size (ESS) versus shift le vel σ (nominal 1 − α = 0 . 8 ). Boldface denotes the best non- oracle method in each row: smallest ESS with coverage ≥ 1 − α , or if none, the highest cov erage. Dataset σ Source Calibration Hard Pseudo-Calibration Source-T uned Pseudo-Calibration T arget Calibration (Oracle) Cov ESS Cov ESS Cov ESS Cov ESS MNIST 0.70 0.837 0.84 0.871 0.88 0.903 0.93 0.892 0.91 1.60 0.440 0.55 0.539 0.83 0.718 1.87 0.834 3.32 2.00 0.276 0.50 0.347 0.82 0.437 1.22 0.813 5.45 CIF AR-10 0.45 0.727 0.97 0.647 0.80 0.935 2.12 0.801 1.18 1.05 0.481 0.95 0.437 0.80 0.879 4.32 0.794 3.05 1.50 0.329 0.94 0.299 0.80 0.777 4.83 0.808 5.03 CIF AR-100 0.30 0.755 2.46 0.533 0.79 0.954 18.80 0.813 3.54 0.50 0.680 2.79 0.458 0.81 0.937 31.95 0.809 6.16 0.70 0.596 2.92 0.381 0.81 0.901 39.22 0.797 9.83 0 . 0 0 . 5 1 . 0 1 . 5 S h i f t l e v e l ¾ 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 C o v e r a g e o n Q ¾ Source Cal. Hard Pseudo-Cal. Source-tuned Pseudo-Cal. 1 ¡ ® ¡ L r ( f ; Q ¾ ) 1 ¡ ® ¡ ( L r ( f ; P ) + L ° ½ m i x ( ¾ ) ) N o m i n a l 1 ¡ ® (a) MNIST 0 . 0 0 . 5 1 . 0 1 . 5 S h i f t l e v e l ¾ 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 C o v e r a g e o n Q ¾ (b) CIF AR-10 Fig. 1: Cov erage on Q σ vs. shift σ for source calibration, hard pseudo-calibration, and source-tuned pseudo-calibration. Dashed curves sho w the coverage lower bounds from Theo- rem 1. 0 . 0 0 . 5 1 . 0 1 . 5 S h i f t l e v e l ¾ 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 C o v e r a g e o n Q ¾ q ~ Q ¾ ; ® q ~ Q ¾ ; ® + ¿ ( ¾ ) q Q ¾ ; ® 0 2 4 6 8 1 0 E x p e c t e d s e t s i z e o n Q ¾ (a) MNIST 0 . 0 0 . 5 1 . 0 1 . 5 S h i f t l e v e l ¾ 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 C o v e r a g e o n Q ¾ 0 2 4 6 8 1 0 E x p e c t e d s e t s i z e o n Q ¾ (b) CIF AR-10 Fig. 2: Cov erage (solid, left axis) and ESS (dashed, right axis) on Q σ vs. shift. Colors denote the method: hard pseudo- calibration ˜ q Q σ ,α , τ -adjusted ˜ q Q σ ,α + τ ( σ ) , and oracle q Q σ ,α . The black curve is the hinge-loss lower bound. Increasing τ maintains the coverage lower bound at the cost of larger ESS. nominal miscoverage α = 0 . 2 and take predictive entropy as the uncertainty function H . W e compare three calibration strategies on Q σ : source calibration (calibration on samples from P using true labels), hard pseudo-calibration (pseudo- calibration of samples from Q σ using ˜ Y ), and source-tuned pseudo-calibration (Algorithm 1). T able I reports empirical cov erage and expected set size (ESS) across all datasets and shift lev els. On all datasets, source calibration and hard pseudo-calibration exhibit increas- ing undercov erage as σ increases. Howe ver , the source-tuned scheme improves coverage relativ e to source/hard pseudo- calibration and often approaches oracle, especially at moderate shifts (particularly on CIF AR-10). On CIF AR-100, both source and hard pseudo-calibration suffer severe undercoverage at large shifts, whereas the source-tuned method restores cover - age at the cost of enlarged ESS. T arget calibration (oracle) uses target labels and is reported as a performance upper bound. T o connect these observations with our theoretical results, we ev aluate the cov erage lower bound implied by Theo- rem 1. For MNIST and CIF AR-10, we estimate the ramp loss L r ( f , Q σ ) using the oracle labels of samples from Q σ , and compute the corresponding lower bound (with τ = 0 ). Fig. 1 plots empirical cov erage of source calibration, hard pseudo- calibration, and source-tuned pseudo-calibration on Q σ as a function of σ , together with the theoretical bounds and the nominal le vel. Although conservati ve, the bounds track the qualitati ve degradation of co verage under increasing shift and remain uniformly below the empirical cov erage of hard pseudo-calibration across the full range of σ . W e select τ ( σ ) using a source-calibrated undercov erage correction. First, we quantify hard pseudo-calibration un- dercov erage gap on P by calibrating with pseudo-labels ˜ Y instead of Y . W e obtain the hard pseudo-calibrated set C 1 − α P, 0 ( x ) , and define ∆ P := (1 − α ) − P  Y ∈ C 1 − α P, 0 ( X )  . Next, for each shift le vel σ , gi ven L h ( f , P ) and L h ( f , Q σ ) (without rev ealing target labels, e.g. a bound via D A, or a proxy estimator), using the Corollary 1, we solve for τ ( σ ) by matching (7) to the source lower bound plus the gap: τ ( σ ) = 2  L h ( f ,Q σ ) L h ( f ,P ) − ∆ P − 1  . W e then construct prediction sets on Q σ using the relaxed threshold q ˜ Q σ ,α + τ ( σ ) and report cov erage and ESS. As shown in Fig. 2, the unadjusted pseudo- calibration drops as the shift increases, whereas the adjusted scheme maintains coverage, at the cost of larger ESS. V . C O N C L U S I O N W e studied conformal prediction under distribution shift when tar get domain labels are unav ailable. Drawing on ideas from domain adaptation, we deriv ed co verage lower bounds that explicitly connect target-domain cov erage to classifier properties and distribution shift measures. Building on these results, we proposed a source-tuned pseudo-calibration scheme that interpolates between hard pseudo-labels and randomized labels using an uncertainty measure, mitigating the pessimism inherent in standard pseudo-calibration. Experiments sho w that the proposed source-tuned scheme reduces coverage drop while maintaining nontrivial set sizes. 5 R E F E R E N C E S [1] A. N. Angelopoulos, R. F . Barber , and S. Bates, “Theoretical foundations of conformal prediction, ” arXiv preprint , 2024. [2] R. J. Tibshirani, R. Foygel Barber, E. Candes, and A. Ramdas, “Con- formal prediction under cov ariate shift, ” Advances in neural information pr ocessing systems , vol. 32, 2019. [3] W . Deng, S. Park, M. Li, and O. Simeone, “Optimizing in-context learning for efficient full conformal prediction, ” IEEE Signal Processing Letters , pp. 1–5, 2025. [4] B. W ang, M. Zecchin, and O. Simeone, “Mirror online conformal prediction with intermittent feedback, ” IEEE Signal Pr ocessing Letters , vol. 32, pp. 2888–2892, 2025. [5] I. Alon, D. Arnon, and A. W iesel, “Learning minimal volume uncertainty ellipsoids, ” IEEE Signal Pr ocessing Letters , vol. 31, pp. 1655–1659, 2024. [6] R. F . Barber , E. J. Candes, A. Ramdas, and R. J. Tibshirani, “Conformal prediction beyond e xchangeability , ” The Annals of Statistics , vol. 51, no. 2, pp. 816–845, 2023. [7] J. G. Moreno-T orres, T . Raeder , R. Alaiz-Rodr ´ ıguez, N. V . Cha wla, and F . Herrera, “ A unifying view on dataset shift in classification, ” P attern r ecognition , vol. 45, no. 1, pp. 521–530, 2012. [8] A. Podkopae v and A. Ramdas, “Distribution-free uncertainty quantification for classification under label shift, ” in Pr oceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence , ser . Proceedings of Machine Learning Research, C. de Campos and M. H. Maathuis, Eds., vol. 161. PMLR, 27–30 Jul 2021, pp. 844–853. [Online]. A vailable: https://proceedings.mlr .press/v161/podkopaev21a.html [9] R. Xu, C. Chen, Y . Sun, P . V enkitasubramaniam, and S. Xie, “W asserstein-regularized conformal prediction under general distribution shift, ” in The Thirteenth International Confer ence on Learning Repr esentations , 2025. [Online]. A vailable: https://openre view .net/ forum?id=aJ3tiX1T u4 [10] A. H. C. Correia and C. Louizos, “Non-exchangeable conformal prediction with optimal transport: T ackling distribution shifts with unlabeled data, ” 2025. [Online]. A vailable: https://arxiv .org/abs/2507. 10425 [11] L. Aolaritei, Z. O. W ang, J. Zhu, M. I. Jordan, and Y . Marzouk, “Confor- mal prediction under levy-prokhoro v distribution shifts: Rob ustness to local and global perturbations, ” arXiv pr eprint arXiv:2502.14105 , 2025. [12] K. Kasa, Z. Zhang, H. Y ang, and G. W . T aylor, “ Adapting prediction sets to distrib ution shifts without labels, ” 2025. [Online]. A vailable: https://arxiv .org/abs/2406.01416 [13] S. Alijani and H. Najjaran, “Wqlcp: W eighted adaptive conformal prediction for robust uncertainty quantification under distribution shifts, ” in Proceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition (CVPR) W orkshops , June 2025, pp. 1732–1741. [14] S. Angelman, R. Nizhar, and J. Goldberger , “Calibrating without labels: Source-free conformal prediction using pseudo-labels, ” in Pr oceedings of the F ourteenth Symposium on Conformal and Pr obabilistic Pr ediction with Applications , ser. Proceedings of Machine Learning Research, K. A. Nguyen, Z. Luo, H. Papadopoulos, T . L ¨ ofstr ¨ om, L. Carlsson, and H. Bostr ¨ om, Eds., vol. 266. PMLR, 10–12 Sep 2025, pp. 63–81. [Online]. A vailable: https: //proceedings.mlr .press/v266/angelman25a.html [15] A. Kumar , T . Ma, and P . Liang, “Understanding self-training for gradual domain adaptation, ” in International conference on machine learning . PMLR, 2020, pp. 5468–5479. [16] Y . He, H. W ang, B. Li, and H. Zhao, “Gradual domain adaptation: Theory and algorithms, ” Journal of Machine Learning Resear ch , vol. 25, no. 361, pp. 1–40, 2024. [17] E. J. Cand ` es, A. Ilyas, and T . Zrnic, “Probably approximately correct labels, ” arXiv pr eprint arXiv:2506.10908 , 2025. [18] G. Shafer and V . V ovk, “ A tutorial on conformal prediction. ” J ournal of Machine Learning Researc h , vol. 9, no. 3, 2008. [19] A. N. Angelopoulos and S. Bates, “ A gentle introduction to conformal prediction and distribution-free uncertainty quantification, ” 2022. [Online]. A vailable: https://arxiv .org/abs/2107.07511 6 S U P P L E M E N TA RY M AT E R I A L ( A P P E N D I X ) A. Pr oof of Lemma 1 Fix ϵ > 0 . For each y , pick π y ∈ Π( P X | y , Q X | y ) with ess sup ( X,X ′ ) ∼ π y ∥ X − X ′ ∥ 2 ≤ ρ + ϵ. (9) Let Y ∼ P Y , and conditional on Y = y sample ( X , X ′ ) ∼ π y . Then ( X, Y ) ∼ P X Y and ( X ′ , Y ) ∼ Q X Y , and | s ( X, Y ) − s ( X ′ , Y ) | ≤ L γ ∥ X − X ′ ∥ 2 a.s. Hence W 1 ( s # P , s # Q ) ≤ L γ E ∥ X − X ′ ∥ 2 ≤ L γ ( ρ + ϵ ) . Let ϵ → 0 . B. Kantor ovich-Rubinstein Inequality Lemma 2: Let ( X , d ) be a metric space, let P, Q be probability measures on X , and let f : X → R be L -Lipschitz. Then   E P [ f ( X )] − E Q [ f ( X ′ )]   ≤ L W 1 ( P , Q ) . Pr oof: For any coupling π ∈ Π( P , Q ) with ( X , X ′ ) ∼ π , | E [ f ( X ) − f ( X ′ )] | ≤ E | f ( X ) − f ( X ′ ) | ≤ L E [ d ( X , X ′ )] . T aking inf π ∈ Π( P,Q ) yields the claim. C. Pr oof of Theorem 1 Define S := − γ f ( X, Y ) , ˜ S := − γ f ( X, ˜ Y ) , where ( X, Y ) ∼ Q X Y . For any fixed classifier f , the pseudo- scores ˜ S i = s ( X i , f ( X i )) are exchangeable under ˜ Q . Thus, by split conformal validity , ˜ Q  ˜ S ≤ q ˜ Q,α  ≥ 1 − α . Since ˜ S = s ( X, f ( X )) depends only on X , and X ∼ Q X under both Q and ˜ Q , this is also equal to Q ( ˜ S ≤ q ˜ Q,α ) ≥ 1 − α . Hence, under Q X Y , Q  S > q ˜ Q,α + τ  ≤ α + Q  S − ˜ S > τ  . (10) If Y = f ( X ) , then S = ˜ S and hence S − ˜ S = 0 . Thus { S − ˜ S > τ } ⊆ { Y  = f ( X ) } . On the event { Y  = f ( X ) } , S − ˜ S = − γ f ( X, Y ) + γ f ( X, f ( X )) . W e have 0 ≤ γ f ( X, f ( X )) ≤ − γ f ( X, Y ) , and therefore S − ˜ S ≤ 2  − γ f ( X, Y )  . Hence, { S − ˜ S > τ } ⊆ n 2( − γ f ( X, Y )) > τ o = n γ f ( X, Y ) < − τ 2 o , and consequently , Q  S − ˜ S > τ  ≤ Q  γ f ( X, Y ) ≤ − τ 2  . (11) Next we control Q ( γ f ( X, Y ) ≤ − τ 2 ) via the ramp loss. Recall the ramp loss ℓ r (( x, y ); f ) = r ( γ f ( x, y )) . For any γ ≤ 0 we hav e 1 − γ ≥ 1 , hence r ( γ ) = min { (1 − γ ) + , 1 } = 1 . Thus, on the e vent { γ f ( X, Y ) ≤ − τ 2 } we hav e 1 { γ f ( X, Y ) ≤ − τ 2 } ≤ r  γ f ( X, Y )  . T aking e xpectations under Q X Y yields Q  γ f ( X, Y ) ≤ − τ 2  ≤ E Q  r  γ f ( X, Y )  = L r ( f , Q ) . Since we can state the result for any τ ≥ 0 , we have Q  γ f ( X, Y ) ≤ − τ 2  ≤ Q ( γ f ( X, Y ) ≤ 0) ≤ L r ( f , Q ) . For each y ∈ [ K ] , define ℓ y ( x ) := r  γ f ( x, y )  . By Assumption 1, and since r ( · ) is 1-Lipschitz, ℓ y is L γ -Lipschitz in x . W e can write L r ( f , P ) = P K y =1 P Y ( y ) E P X | y [ ℓ y ( X )] , and similarly L r ( f , Q ) = P K y =1 Q Y ( y ) E Q X | y [ ℓ y ( X )] . By Assumption 2(i), L r ( f , Q ) − L r ( f , P ) = K X y =1 P Y ( y )  E Q X | y [ ℓ y ( X )] − E P X | y [ ℓ y ( X )]  . By Lemma 2, for each y ,   E Q X | y [ ℓ y ( X )] − E P X | y [ ℓ y ( X )]   ≤ L γ W 1 ( P X | y , Q X | y ) . Therefore,   L r ( f , Q ) − L r ( f , P )   ≤ L γ K X y =1 P Y ( y ) W 1 ( P X | y , Q X | y ) = L γ ρ mix , and in particular L r ( f , Q ) ≤ L r ( f , P ) + L γ ρ mix . Com- bining the previous displays yields Q  S − ˜ S > 0  ≤ L r ( f , P ) + L γ ρ mix . Thus Q  S > q ˜ Q,α  ≤ α + L r ( f , P ) + L γ ρ mix . Finally , Q  Y n +1 ∈ C 1 − α ˜ Q ( X n +1 )  = Q  S ≤ q ˜ Q,α  , which giv es (6). D. Pr oof of Corollary 1 Fix any τ ≥ 0 . Let ( X , Y ) := ( X n +1 , Y n +1 ) ∼ Q X Y denote the test point, and define ˜ Y := f ( X ) , S := s ( X , Y ) , ˜ S := s ( X , ˜ Y ) . For the relaxed set C 1 − α ˜ Q,τ ( X ) = { y : s ( X , y ) ≤ q ˜ Q,α + τ } , we ha ve { Y / ∈ C 1 − α ˜ Q,τ ( X ) } = { S > q ˜ Q,α + τ } . Using (10) and (11), Q ( S > q ˜ Q,α + τ ) ≤ α + Q ( S − ˜ S > τ ) ≤ α + Q  γ f ( X, Y ) ≤ − τ 2  . On the ev ent { γ f ( X, Y ) ≤ − τ 2 } we hav e 1 − γ f ( X, Y ) ≥ 1 + τ 2 , hence ℓ h (( X, Y ); f ) = max { 1 − γ f ( X, Y ) , 0 } ≥ 1 + τ 2 . Therefore, 1  γ f ( X, Y ) ≤ − τ 2  ≤ ℓ h (( X,Y ); f ) 1+ τ 2 . T aking expec- tation under Q X Y yields Q  γ f ( X, Y ) ≤ − τ 2  ≤ L h ( f ,Q ) 1+ τ 2 . Combining the displays gi ves Q ( S > q ˜ Q,α + τ ) ≤ α + L h ( f ,Q ) 1+ τ 2 . Rearranging gi ves (7). E. Pr oof of Theorem 2 Let X 1 , . . . , X n +1 i.i.d. ∼ Q X , and ˜ Y i = f ( X i ) with the corresponding pseudo-scores ˜ S i := s ( X i , ˜ Y i ) . Let q ˜ Q,α be the split-conformal (1 − α ) threshold of { ˜ S i } n i =1 . Define C 1 − α ˜ Q ( x ) := { y : s ( x, y ) ≤ q ˜ Q,α } . Let ˜ Y u ( X i ) be defined as in (8), and S u i := s ( X i , ˜ Y u ( X i )) . Let q ˜ Q u ,α be the split- conformal (1 − α ) threshold from { S u i } n i =1 . Let C 1 − α ˜ Q u ( x ) := { y ∈ [ K ] : s ( x, y ) ≤ q ˜ Q u ,α } . If H ( X i ) ≤ u , then S u i = ˜ S i , while if H ( X i ) > u , the label ˜ Y u ( X i ) is drawn uniformly from [ K ] , and since f ( X i ) minimizes s ( X i , y ) ov er y ∈ [ K ] we have s ( X i , ˜ Y u ( X i )) ≥ s ( X i , f ( X i )) = ˜ S i for ev ery realization. Thus, S u i ≥ ˜ S i . Consequently , for any t ∈ R , we have { S u i ≤ t } ⊆ { ˜ S i ≤ t } implying 1 n P n i =1 1 { S u i ≤ t } ≤ 1 n P n i =1 1 { ˜ S i ≤ t } . By the definition of the empirical quantile, it follows that q ˜ Q u ,α ≥ q ˜ Q,α . Let U i ∼ Unif ([ K ]) on the event { H ( X i ) > u } . For the test point ( X n +1 , Y n +1 ) ∼ Q X Y , the indicator 1 { s ( X n +1 , Y n +1 ) ≤ t } is non-decreasing in t , hence for e very realization of C = ( X 1 , . . . , X n , U 1 , . . . , U n ) and ( X n +1 , Y n +1 ) we hav e 1 { s ( X n +1 , Y n +1 ) ≤ q ˜ Q u ,α } ≥ 1 { s ( X n +1 , Y n +1 ) ≤ q ˜ Q,α } . T aking conditional expectation giv en C yields E h 1 { Y n +1 ∈ C 1 − α ˜ Q u ( X n +1 ) } | C i ≥ E h 1 { Y n +1 ∈ C 1 − α ˜ Q ( X n +1 ) } | C i . Finally , taking expectation over C giv es Q  Y n +1 ∈ C 1 − α ˜ Q u ( X n +1 )  ≥ Q  Y n +1 ∈ C 1 − α ˜ Q ( X n +1 )  .

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment