Shape-Adaptive Conditional Calibration for Conformal Prediction via Minimax Optimization

Achieving valid conditional coverage in conformal prediction is challenging due to the theoretical difficulty of satisfying pointwise constraints in finite samples. Building upon the characterization of conditional coverage through marginal moment re…

Authors: Yajie Bao, Chuchen Zhang, Zhaojun Wang

Shape-Adaptive Conditional Calibration for Conformal Prediction via Minimax Optimization
Shap e-A daptiv e Conditional Calibration for Conformal Prediction via Minimax Optimization Y a jie Bao 1 ∗ , Ch uc hen Zhang 1 ∗ , Zhao jun W ang 1 , Hao jie Ren 2 , and Changliang Zou 1 1 Sc ho ol of Statistics and Data Science, Nankai Universit y 2 Sc ho ol of Mathematical Sciences, Shanghai Jiao T ong Universit y Marc h 25, 2026 Abstract A c hieving v alid conditional co v erage in conformal prediction is challenging due to the theoretical difficulty of satisfying p oint wise constraints in finite samples. Building up on the c haracterization of conditional co v erage through marginal momen t restric- tions, we in tro duce Minimax Optimization Predictive Inference (MOPI), a framew ork that generalizes prior work b y optimizing o v er a flexible class of set-v alued mappings during the calibration phase, rather than simply calibrating a fixed sublevel set. This minimax form ulation effectiv ely circum v ents the structural constrain ts of predefined score functions, ac hieving superior shap e adaptivity while maintaining a principled connection to the minimization of mean squared cov erage error. Theoretically , we pro vide non-asymptotic oracle inequalities and sho w that the conv ergence rate of the co v erage error attains the optimal order under regular conditions. The MOPI also enables v alid inference conditional on sensitiv e attributes that are a v ailable during calibration but unobserv ed at test time. Empirical results on complex, non-standard conditional distributions demonstrate that MOPI pro duces more efficient prediction sets than existing baselines. K eywor ds: Conditional v alidit y; Geometric shap e; M ean squared co v erage error; Prediction set; Set-v alued mapping. ∗ The first t w o authors con tributed equally to this w ork and are listed in alphab etical order. 1 1 In tro duction Predictiv e inference aims to construct a prediction set for an unknown lab el based on a mac hine learning mo del, ensuring that the prediction set co v ers the true lab el with a pre- sp ecified confidence lev el. This statistical guaran tee is essential in high-stakes applications, suc h as medical diagnosis ( V azquez and F acelli , 2022 ), safe planning in autonomous systems ( Lindemann et al. , 2023 ), and robust decision-making under uncertain t y ( Johnstone and Co x , 2021 ), where reliable and uncertain t y-a w are predictions can enhance decision safet y . Conformal prediction ( V ovk et al. , 2005 ) is a distribution-free and mo del-agnostic to ol in predictiv e inference. Given the test co v ariate X ∈ X , it can issue an (1 − α ) -lev el prediction set C ( X ) for the unobserved lab el Y ∈ Y b y using the collected lab eled data and a mac hine learning mo del. Supp ose the lab eled data and the test data ( X , Y ) are indep enden t and iden tically distributed (i.i.d.) or exc hangeable, then the conformal prediction set satisfies the mar ginal co verage prop erty P { Y ∈ C ( X ) } ≥ 1 − α , see Lei et al. ( 2018 ). Although marginal co v erage offers a simple and attractiv e v alidit y guarantee, it may fail to provide uniform reliabilit y across individuals or under heterogeneous data distributions. F or example, certain subgroups or regions of the cov ariate space may exp erience systematic under-cov erage. T o o v ercome this limitation, a more stringen t and relev an t ob jective is the test-c onditional co v erage: P { Y ∈ C ( X ) | X } ≥ 1 − α almost surely . Previous w ork has shown that achieving exact test-conditional co v erage is imp ossible in a distribution-free setting without resorting to trivial prediction sets of infinite exp ected size ( V ovk , 2012 ; Lei et al. , 2013 ). Consequen tly , v arious conformal prediction metho ds ha v e b een prop osed to ac hiev e approximate or asymptotic conditional cov erage by either mo difying the calibration step ( Guan , 2023 ; Hore and Barb er , 2025 ; Gibbs et al. , 2025 ) or using differen t score functions ( Romano et al. , 2019 ; Chernozh uk o v et al. , 2021 ). 2 Consider a generalized notion for conditional predictiv e inference: P { Y ∈ C ( X ) | Z = z } = 1 − α, ∀ z ∈ Z , (1) where Z ∈ Z is an arbitr ary conditioning v ariable. The formulation in ( 1 ) encapsulates sev eral imp ortan t cases: test-conditional cov erage when Z = X ; group-conditional cov erage when Z = 1 { X ∈ G } where G is a sp ecific group in the co v ariate domain X ; the equalized co v erage in Romano et al. ( 2020 ) when Z represen ts a sensitive attribute, e.g., gender, race. T o achiev e this conditional co verage, a natural strategy is the “conditional-to-marginal” relaxation, whic h transforms the conditional constraint into a set of marginal moment restrictions ( Andrews and Shi , 2013 ): E [ f ( Z ) ( 1 { Y / ∈ C ( X ) } − α )] = 0 , ∀ f ∈ F , (2) where F is a “w eigh t” function class defined on the conditioning domain Z . When F includes all measurable functions, the conditional cov erage target ( 1 ) is equiv alent to the marginal relation ( 2 ) . A seminal approac h lev eraging this relaxation is the c onditional c alibr ation framew ork prop osed by Gibbs et al. ( 2025 ), which considers the prediction set of the sublev el form: C ( X , Z ) = { y ∈ Y : s ( X , y ) ≤ f ( Z ) } , where s is a fixed score pretrained on a separate dataset and f ∈ F is a threshold function to b e determined. In this con text, finding a prediction set that satisfies ( 2 ) can b e recast as solving the functional quantile regression problem: min f ∈F E [ ℓ α ( s ( X , Y ) , f ( Z ))] , where ℓ α ( u, v ) = ( u − v )( 1 { u − v > 0 } − α ) is the pinball loss. While computationally tractable, this form ulation inheren tly restricts the predictive region to the sublev el sets of a fixed score function. It can only adjust the volume of the set 3 in the calibration phase but fails to rotate the shap e or alter its asp ect ratio to capture lo cal correlations among output v ariables. This ge ometric rigidity is esp ecially pronounced in m ultiv ariate regression with vector-v alued lab els. Although recen t efforts ha v e fo cused on designing efficien t scores during the training phase ( Johnstone and Co x , 2021 ; Th urin et al. , 2025 ), the resulting predictive sets still lack the flexibilit y to adapt to lo cal distributional heteroscedasticit y during calibration. F urthermore, this paradigm typically requires the threshold f ( · ) to b e a direct mapping from observed features. Consequen tly , it struggles to handle scenarios where Z represen ts sensitive attributes that can b e masked at test time ( Zafar et al. , 2017 ). In such cases, a threshold function learned solely on X cannot effectiv ely incorp orate the necessary conditioning information from Z during calibration. 1.1 A snapshot of our ideas and con tributions T o break these constrain ts, we introduce Minimax Optimization Pr e dictive Infer enc e (MOPI), a no v el framework that reformulates conditional predictiv e inference as a minimax problem. W e consider a general and structured set-v alued function class: C = n C ( x ; h ) = { y ∈ Y : T ( h ( x ) , y ) ≤ 0 } : h ∈ H o , (3) where T ( · , · ) is a fixed function, h ( x ) can b e a vector or matrix-v alued function enco ding the geometric structure (e.g., mean µ ( x ) or co v ariance Σ( x ) ), and H is a function class in the cov ariate domain X . Assuming F is symmetric (i.e., − f ∈ F whenev er f ∈ F ), the deviation of the left-hand side of ( 2 ) from zero can b e quantified by the maximum quan tit y max f ∈F E [ f ( Z )( 1 { Y ∈ C ( X ) } − (1 − α ))] . Hence, we can construct the desired prediction set by minimizing this maximum quantity ov er all set-v alued functions C ∈ C , leading naturally to a minimax optimization problem ( Dikkala et al. , 2020 ): min C ∈ C max f ∈F E [ f ( Z )( 1 { Y / ∈ C ( X ) } − α ) − f 2 ( Z )] , 4 where the term f 2 ( Z ) is used for normalization (see Section 2.2 for details). This framework offers tw o fundamental adv antages. First, by optimizing C ∈ C , MOPI can dynamically “reshap e” the prediction set through finding the optimal h ∈ H in ( 3 ) , such as adjusting the asp ect ratio of a b ox or the orientation of an ellipsoid, to match lo cal heteroscedasticity . Second, MOPI is inherently suited for masked conditioning v ariables. By decoupling the co v erage enforcemen t (inner maximization dep ending on Z ) from the predictive mapping (outer minimization dep ending on X ), our framework allows information from Z to guide the construction of prediction set C ( X ) during the calibration phase without requiring Z at prediction time. The main con tributions of this pap er are summarized as follo ws: (1) The MOPI generalizes Gibbs et al. ( 2025 )’s conditional calibration principle by allo wing the prediction set’s geometry to co-ev olve with the lo cal data densit y . Moreo v er, unlik e existing conditional conformal metho ds, which typically require the conditioning v ariable Z to b e a subset of the test cov ariates (i.e., Z ⊆ X ), MOPI allows Z to b e distinct from X . This decoupling enables our framework to enforce v alid conditional co v erage on sensitive attributes or latent groups that are av ailable during calibration but mask ed at test time. (2) W e demonstrate that the MOPI ob jectiv e p ossesses a rigorous statistical interpretation: under mild expressiv eness conditions on the w eigh t function class F , it is equiv alen t to minimizing the Me an Squar e d Cover age Err or (MSCE), defined as MSCE ( C ) = E [( P { Y / ∈ C ( X ) | Z } − α ) 2 ] . W e further establish non-asymptotic oracle inequalities for the MSCE of the empirical MOPI predictor. Our analysis pro vides specific con v ergence rates for b oth finite-dimensional and RKHS function classes, matc hing the optimal order up to logarithmic factors in the group-conditional co v erage case. (3) W e pro vide a principled guide for choosing the w eigh t function class F in ( 2 ) according 5 to the cardinalit y of the conditioning domain Z . Under practical choices of F , the inner maximization of the minimax problem admits a closed form, whic h greatly facilitates the o v erall optimization pro cess. T o solve the empirical problem, we replace the miscov erage indicator with a smooth surrogate, and the corresp onding MSCE b ound is also established. (4) W e ev aluate MOPI on extensiv e syn thetic and real-w orld datasets, including those with m ulti-dimensional lab els and masked sensitive attributes. The results demonstrate that MOPI pro duces more compact prediction sets while maintaining robust conditional co v erage p erformance. 1.2 Related w orks Sev eral studies ha v e examined ho w to ac hiev e test-conditional co v erage in the asymptotic regime, typically by designing improv ed nonconformit y scores or b y reweigh ting labeled data. F or example, Romano et al. ( 2019 ) prop osed a conformalized quantile regression score that lev erages quantile regression to capture lo cal distributional c haracteristics of the lab el. Chernozh uk o v et al. ( 2021 ) in tro duced a nonconformity score based on join t distribution mo deling. More recen tly , Guan ( 2023 ) developed a lo calized conformal prediction approach that assigned similarity-based w eigh ts to calibration scores, which ensured finite-sample marginal co v erage and asymptotic conditional co v erage under mild distributional assump- tions. Building on this framew ork, Hore and Barb er ( 2025 ) prop osed a randomly lo calized conformal prediction metho d that offers relaxed lo cal cov erage and marginal v alidit y under certain cov ariate shifts. F urther examples and developmen ts can also b e found in Gyôrfi and W alk ( 2019 ), Sesia and Romano ( 2021 ), and Kiy ani et al. ( 2024a ). Another line of work has fo cused on approximate notions of test-conditional cov erage. Lei and W asserman ( 2014 ) prop osed partitioning the cov ariate space in to disjoin t bins and constructing prediction sets using only calibration p oin ts within the same bin, thereby 6 ac hieving group-conditional co verage within each bin of these partitions. Barb er et al. ( 2021 ) studied the problem of obtaining approximate conditional cov erage ov er ev ery regular subset of the cov ariate space. Dra wing on the idea of m ultiaccuracy from algorithmic fairness ( Héb ert-Johnson et al. , 2018 ), Jung et al. ( 2023 ) developed a multiv alid conformal prediction metho d that asymptotically guarantees conditional co v erage ov er any user-sp ecified finite collection of (possibly ov erlapping) groups. Notably , Gibbs et al. ( 2025 ) in tro duced a conditional calibration (abbreviated as CC ) framework via functional quan tile regression. F or the finite-dimensional case, their metho d provides a finite-sample group-conditional co v erage guaran tee and maintains marginal v alidit y under a class of cov ariate shifts. In the infinite-dimensional setting, they characterized relaxed conditional cov erage by b ounding the absolute deviation in ( 2 ) . In contrast, our work establishes an oracle inequalit y for the MSCE, whic h offers a more direct quan tification of conditional co v erage. F or v ector-v alued resp onses, recent studies hav e increasingly mov ed tow ards param- eterized scoring functions that aim to b etter approximate non-standard prediction set geometries. Johnstone and Co x ( 2021 ) and Messoudi et al. ( 2022 ) prop osed norm-based scores that restrict prediction regions to b o x or ellipsoidal shap es. F eldman et al. ( 2023 ) emplo y ed a latent-space representation learning approac h to enable non-con v ex set. Thurin et al. ( 2025 ) lev eraged optimal transp ort to define a ranking of multiv ariate scores, thereb y allo wing prediction regions b eyond predefined geometric families. Braun et al. ( 2025 ) stud- ied the volume minimization problem o v er a class of arbitrary norm-based sets. How ever, these approaches fo cus primarily on the design of nonconformit y scores within the split conformal prediction framew ork, rather than adapting the shap e of prediction sets during the calibration, and they do not explicitly address conditional co v erage for shap e-adaptiv e prediction sets. The minimax form ulations for momen t-restriction problems ha v e b een explored in sev eral related contexts, including solving nonparametric regression problems ( Dikkala et al. , 2020 ), 7 the estimation of generalized av erage causal effects in causal inference ( Kallus et al. , 2021 ), off-p olicy ev aluation in reinforcemen t learning ( Shi et al. , 2022 ), and parameter estimation under conditional moment constraints ( Bennett and Kallus , 2023 ). In contrast, these prior w orks fo cused on learning p oint - prediction mo dels. In the con text of conformal prediction, Kiy ani et al. ( 2024b ) studied the length minimization for sublevel sets of a pretrained score while ensuring the group-conditional v alidity , and transformed the constrained optimization problem to a minimax form ulation. How ever, b oth the motiv ation and ob jective in Kiy ani et al. ( 2024b ) differ from ours, and the theoretical analysis is fundamen tally distinct. Notations The following notations will b e used throughout the pap er. F or domains X , Y , and Z , we use d X , d Y , and d Z , resp ectively , to denote the corresponding dimension if they are Euclidean spaces. When Z is finite, we write |Z | for its cardinalit y . F or a Euclidean vector u ∈ R d , w e write ∥ u ∥ = q P d j =1 u 2 j and ∥ u ∥ ∞ = max 1 ≤ j ≤ d | u j | . F or a function g : X × Y × Z → R , w e denote the L 2 -norm as ∥ g ∥ L 2 = q E [ g 2 ( X , Y , Z )] . F or t w o p ositiv e sequences a n and b n for n ≥ 1 , w e write a n ≲ b n if there exists a constan t C > 0 suc h that a n ≤ C b n holds for sufficien tly large n . 2 Minimax Optimization for Conditional Co v erage 2.1 Problem setup Let ( X , Y ) ∈ X × Y b e a data pair dra wn from an unknown distribution, and let C ( · ) : X → 2 Y denote a measurable set-v alued function that assigns to eac h co v ariate X a pr e diction set C ( X ) ⊆ Y . In particular, given a conditioning v ariable Z ∈ Z , our goal is to find a prediction set C ( X ) satisfying the conditional co v erage prop ert y in ( 1 ) . The conditioning v ariable Z t ypically dep ends on ( X , Y ) and determines the scop e of co v erage enforcemen t, represen ting v arious forms of contextual, structural, or laten t information ab out the data. Belo w, w e pro vide sev eral represen tativ e examples with differen t t yp es of Z . 8 Example 2.1 (T est-conditional cov erage) . The most p opular form of c onditional c over age validity arises when Z = X , which is also r eferr e d to as test-c onditional c over age in the c onformal pr e diction liter atur e ( A ngelop oulos et al. , 2024 ). In this c ase, the tar get ( 1 ) r e duc es to: P { Y ∈ C ( X ) | X = x } = 1 − α, ∀ x ∈ X . A nother pr actic al c ase is that Z is a p art of the c ovariate X , r epr esenting sp e cial fe atur es of inter est. Example 2.2 (Group-conditional cov erage) . L et { G 1 , . . . , G K } b e a finite c ol le ction of gr oups, wher e e ach G k ⊆ X defines a subset of the c ovariate sp ac e. The gr oups c an b e disjoint or overlap. The go al of gr oup-c onditional c over age ( Jung et al. , 2023 ; Gibbs et al. , 2025 ) is to ensur e c over age within e ach gr oup: P { Y ∈ C ( X ) | X ∈ G k } = 1 − α, ∀ 1 ≤ k ≤ K , which is a sp e cial c ase of the tar get ( 1 ) by Z = ( 1 { X ∈ G 1 } , . . . , 1 { X ∈ G K } ) ⊤ . Example 2.3 (Equalized cov erage on sensitiv e attributes) . In the c ontext of fairness, the c onditioning variable Z c an r epr esent sensitive attributes, and we r e quir e c over age p arity of the pr e diction set C ( X ) acr oss differ ent values of the sensitive attribute Z ( R omano et al. , 2020 ). In p articular, the tar get ( 1 ) al lows the c ase wher e the variable Z is maske d in the test data (the pr e diction set dep ends only on the c ovariate X ), which aligns with the r e quir ement of algorithmic fairness liter atur e ( Zafar et al. , 2017 ; Zhang et al. , 2018 ). 2.2 A minimax optimization framew ork Giv en a set-v alued function C , recall from ( 2 ) that the conditional cov erage ( 1 ) is equiv alent to requiring Φ( C, f ) := E [ f ( Z )( 1 { Y / ∈ C ( X ) } − α )] = 0 for all measurable f . Then, when the exact conditional cov erage holds, and the weigh t function class F is symmetric, w e hav e max f ∈F Φ( C, f ) = 0 . This motiv ates us to construct the desired prediction set by solving the follo wing minimax problem: min C ∈ C max f ∈F E [ f ( Z )( 1 { Y / ∈ C ( X ) } − α )] , 9 where C is the general class of set-v alued functions defined in ( 3 ). T o make the optimization problem well-defined, the inner maximization domain (i.e., the function class F ) must t ypically be b ounded ( Nemirovski , 2004 ). T o remov e the dep endence on the scale of f , w e ma y add an L 2 norm constrain t ∥ f ∥ L 2 ≤ 1 when taking the maxim um on Φ( C, f ) o v er f ∈ F . T o further facilitate optimization, we consider the following L 2 p enalized ob jectiv e: Ψ( C, f ) := E h f ( Z )( 1 { Y / ∈ C ( X ) } − α ) − f 2 ( Z ) i , (4) and define the minimax optimization pr e dictive infer enc e (MOPI) set as the solution C ∗ = arg min C ∈ C max f ∈F Ψ( C, f ) . (5) The L 2 p enalt y mak es the ob jectiv e Ψ( C, f ) strongly concav e in f , and the corresp onding minimax problem is easier to tac kle; see Lin et al. ( 2020 ) and Rafique et al. ( 2022 ). R emark 2.1 . In the previous w orks such as Dikkala et al. ( 2020 ), the ob jective function includes a tuning parameter λ > 0 for the L 2 p enalt y , taking the form Ψ λ ( C, f ) := E [ f ( Z )( 1 { Y / ∈ C ( X ) } − α ) − λf 2 ( Z )] in our case. In fact, under the mild condition that λf ∈ F for all f ∈ F , the choice of λ do es not affect the resulting optimizer. This can b e seen from the relation: λ max f ∈F Ψ λ ( C, f ) = max f ∈F { λf ( Z )( 1 { Y / ∈ C ( X ) } − α ) − λ 2 f 2 ( Z ) } = max f ∈F Ψ( C, f ) . This allo ws us to simply set λ = 1 without additional tuning. See App endix A.1 for a more detailed discussion. □ 2.2.1 Mean squared cov erage error of minimax solution Giv en any set-v alued function C , w e denote by α ( Z ; C ) := P { Y / ∈ C ( X ) | Z } the conditional misco v erage probabilit y and recall the definition of MSCE: MSCE ( C ) = E h ( α ( Z ; C ) − α ) 2 i . 10 This metric quan tifies the a v erage conditional co v erage error. Next, w e establish the theoretical connection b etw een the minimax ob jective ( 4 ) and the MSCE ( C ) , pro viding a rigorous justification for the statistical optimalit y of the MOPI framew ork. In fact, note that the conditional cov erage ( 1 ) can also b e pursued b y minimizing the or acle least-squares problem C ora = arg min C ∈ C MSCE ( C ) . (6) The term “oracle” indicates that the conditional misco v erage probabilit y α ( Z ; C ) cannot b e computed if the conditional distribution is unknown. The oracle set-v alued function C ora is the optimal element within the class C , ac hieving the smallest MSCE, and the MSCE ( C ora ) can b e interpreted as the appro ximation error of the class C . The next prop osition connects the MSCEs of the minimax solution C ∗ in ( 5 ) and the oracle solution C ora . Prop osition 2.1. If α ( · ; C ) − α ∈ F for any C ∈ C , we have max f ∈F Ψ( C, f ) = MSCE ( C ) / 4 for any C ∈ C and MSCE ( C ∗ ) = MSCE ( C ora ) . This result implies that if F is expressiv e enough to include the miscov erage error function α ( · ; C ) − α , then solving the minimax problem ( 5 ) is equiv alent to solving ( 6 ) . Prop osition 2.1 also sheds light on a principled wa y to choose the weigh t function class F based on the domain of the conditioning v ariable Z . • Finite-dimensional F . When the domain Z con tains only finite elemen ts, i.e., |Z | < ∞ , the function α ( · ; C ) is a |Z | -dimensional function. T o meet the condition of Prop osition 2.1 , we can take a parametric function class F = { P z ∈Z β z 1 { Z = z } : β z ∈ R , z ∈ Z } . F or the group-conditional co v erage in Example 2.2 , it can b e seen that |Z | = K for the disjoin t groups and |Z | ≤ 2 K for the o v erlap groups. In Example 2.3 , the parametric class F remains applicable when the sensitiv e attribute Z is a categorical v ariable. 11 • Infinite-dimensional F . When the domain Z con tains infinite points, e.g., Z is a subset of the Euclidean space, we can choose F as a repro ducing k ernel Hilb ert space (RKHS). This c hoice is suitable for the test-conditional cov erage in Example 2.1 and the con tin uous sensitiv e v ariable in Example 2.3 . By adjusting the w eigh t function class F to the conditioning v ariable Z , the MOPI framew ork can b e tailored to enforce lo cal or group wise co v erage, as w ell as fairness constrain ts. 2.2.2 Structured set-v alued function class Structured set-v alued function class C in ( 3 ) pro vides a flexible and expressive formulation of prediction sets, and can meet most downstream requirements. No w, w e present tw o common examples of C . Sublevel set of pr etr aine d sc or e function. This class is widely used in the conformal prediction literature ( V ovk et al. , 2005 ; Lei et al. , 2018 ). Given a pretrained score function s ( x, y ) , the sublev el prediction set tak es the form C sublevel = { C ( x ; h ) = { y ∈ Y : s ( x, y ) ≤ h ( x ) } : h ∈ H} , (7) where h : X → R is a one-dimensional threshold function. Let T ( h ( x ) , y ) = s ( x, y ) − h ( x ) in ( 3 ) , then C sublevel is a sp ecial case. In this form ulation, the pretrained function s fixes the geometric structure of C ( x ; h ) , lea ving only the threshold h ( x ) to b e optimized. R emark 2.2 . F or test-conditional cov erage with Z = X , the p opulation quan tile regression in the CC metho d sets H = F in ( 7 ) , yielding the prediction set C qr ( X ) := { y ∈ Y : s ( X , y ) ≤ f qr ( X ) } . Here, f qr = arg min f ∈F E [ ℓ α ( s ( X , Y ) , f ( X ))] is the minimizer of the exp ected pin ball loss. In addition, let q ( x ) = inf { t ∈ R : P ( s ( X , Y ) ≤ t | X = x ) ≥ 1 − α } b e the ground truth conditional quantile function. If q ∈ F , the prediction set C qr ( X ) ac hieves zero MSCE, as w ell as for the minimax solution C ∗ under the condition of Proposition 12 2.1 . If q / ∈ F , the first conclusion of Prop osition 2.1 guarantees that C ∗ ( X ) ac hiev es lo w er MSCE than C qr ( X ) . Hence, MOPI exhibits b etter performance in terms of MSCE. W e defer more theoretical discussions to App endix B . □ Ge ometric al ly structur e d pr e diction sets. In some do wnstream tasks, e.g., the robust optimization problems ( Ben-T al et al. , 2009 ; Johnstone and Cox , 2021 ), the geometric structure of prediction sets in the multi-dimensional lab el space Y ⊆ R d Y is critical to the decision making. Let us consider the following tw o commonly used set classes: (i) Let h = ( µ, σ ) with mean function µ : X → R d Y and v ariance function σ : X → R d Y , the b o x set-v alued function class tak es the form C box =  C ( x ; h ) = n y ∈ R d Y : ∥ ( y − µ ( x )) /σ ( x ) ∥ ∞ ≤ 1 o : h ∈ H  , where “ / ” denotes element-wise division of v ectors. By setting T ( h ( x ) , y ) = ∥ ( y − µ ( x )) /σ ( x ) ∥ ∞ − 1 in ( 3 ), w e reco v er the b o x class C box . (ii) Let h = ( µ, Σ) with mean function µ : X → R d Y and cov ariance function Σ : X → R d Y × d Y , the ellipsoidal set-v alued function class tak es the form C ell =  C ( x ; h ) = n y ∈ R d Y : ( y − µ ( x )) ⊤ Σ − 1 ( x )( y − µ ( x )) ≤ 1 o : h ∈ H  . This corresp onds to the c hoice T ( h ( x ) , y ) = ( y − µ ( x )) ⊤ Σ − 1 ( x )( y − µ ( x )) − 1 in ( 3 ). R emark 2.3 . It is also p ossible to construct a sublev el prediction set of the form ( 7 ) with geometric structures describ ed ab ov e using a pretrained score. F or example, Johnstone and Co x ( 2021 ) and Sun et al. ( 2023 ) built an ellipsoidal sublevel set via the score s ( x, y ) = ( y − µ 0 ( x )) ⊤ Σ − 1 0 ( y − µ 0 ( x )) , where µ 0 ( x ) is a regression model and Σ 0 is the co v ariance matrix obtained on a separate dataset. Ho w ev er, giv en such a score function, the CC metho d cannot further adjust the cov ariance matrix with resp ect to the conditioning v ariable. T o 13 em b ed the pretrained information into a more flexible structure, we ma y consider setting T ( h ( x ) , y ) = ˜ y ⊤ Σ − 1 ( x ) ˜ y in ( 3 ) where ˜ y = Σ − 1 / 2 0 ( y − µ 0 ( x )) and h ( x ) = Σ( x ) . This enables MOPI to adaptively refine the shape mo del Σ( x ) during calibration, leading to more accurate conditional co v erage. □ 3 Empirical Minimax Optimization Problem Supp ose we hav e i.i.d. lab eled data D n := { ( X i , Y i , Z i ) } n i =1 . The goal of this section is to construct MOPI sets b y solving the empirical v ersion of the minimax problem ( 5 ). 3.1 Empirical minimax problem F ormally , w e consider the follo wing sample a v eraging approximation to the p opulation ob jectiv e function Ψ( C, f ) defined in ( 4 ), b Ψ( C, f ) := 1 n n X i =1 h f ( Z i ) ( 1 { Y i / ∈ C ( X i ) } − α ) − f 2 ( Z i ) i . W e assume the set-v alued function class C is finite-dimensional, and defer the infinite- dimensional case to App endix D . Supp ose the weigh t function class F is equipp ed with a norm ∥ · ∥ F , which measures the complexity of F . A ccordingly , w e output the empirical MOPI prediction set b y solving the follo wing minimax optimization problem: b C = arg min C ∈ C max f ∈F n b Ψ( C, f ) − γ ∥ f ∥ 2 F o , (8) where γ ≥ 0 acts as a regularization parameter when F is an infinite-dimensional function class. W e can simply set γ = 0 when F is finite-dimensional. In Section 2.2.1 , we discussed ho w to choose the weigh t function class F according to the cardinalit y of the conditioning domain Z . Now, w e sho w that the inner maximization 14 of the empirical problem ( 8 ) admits a closed form under tw o practical choices of F . Lemma 3.1. L et F = { P z ∈Z β z 1 { Z = z } : β z ∈ R , z ∈ Z } when |Z | < ∞ . The inner maximization in the pr oblem ( 8 ) with γ = 0 is: max f ∈F b Ψ( C, f ) = X z ∈Z  P n i =1 1 { Z i = z } ( 1 { Y i / ∈ C ( X i ) } − α )  2 4 n P n i =1 1 { Z i = z } . Lemma 3.2. L et F b e a RKHS e quipp e d with a kernel function K ( · , · ) : Z × Z → R . Denote φ n ( C ) = ( φ 1 ( C ) , . . . , φ n ( C )) ⊤ wher e φ i ( C ) = 1 n  1 { Y i / ∈ C ( X i ) } − α  for i = 1 , . . . , n and K n = ( K ( Z i , Z j )) n i,j =1 b e the empiric al kernel matrix. L et I n b e an n × n identity matrix. The inner maximization in the pr oblem ( 8 ) is: max f ∈F n b Ψ( C, f ) − γ ∥ f ∥ 2 F o = 1 4 φ n ( C ) ⊤ K n  1 n K n + γ I n  − 1 φ n ( C ) . Using the t w o lemmas ab o v e, w e transform the corresp onding empirical minimax prob- lems ( 8 ) in to empirical risk minimization problems. 3.2 Implemen tation of MOPI by smo othing Since the ob jectiv e function b Ψ ( C, f ) con tains the non-differentiable miscov erage indicator 1 { Y i / ∈ C ( X i ) } = 1 { T ( h ( X i ) , Y i ) > 0 } for C ( X i ) = { y ∈ Y : T ( h ( X i ) , Y i ) ≤ 0 } as defined in ( 3 ) , direct optimization is infeasible. T o ov ercome this non-differen tiabilit y , w e can replace the indicator 1 { u > 0 } with a smo oth surrogate, such as the Sigmoid function ˜ 1 { u > 0 } = (1 + exp ( − u/r )) − 1 or the Gaussian error function ˜ 1 { u > 0 } = 1 2  1 + erf  u/ √ 2 r  , where erf ( x ) = 2 √ π R x 0 e − t 2 dt and r > 0 is the smo othing parameter. The smo othing strategy has b een widely used in recent optimization w orks on conformal prediction, see e.g. Bai et al. ( 2022 ), Kiy ani et al. ( 2024b ), and W u et al. ( 2025 ). Denote the smo othed miscov erage indicator as ˜ 1 { Y / ∈ C ( X ) } := ˜ 1 { T ( h ( X ) , Y ) > 0 } . 15 Then w e ha v e the follo wing smo othed ob jectiv e function: e Ψ( C, f ) := 1 n n X i =1 h f ( Z i )  ˜ 1 { Y i / ∈ C ( X i ) } − α  − f 2 ( Z i ) i . A ccordingly , the smo othed empirical MOPI prediction set can b e obtained by solving the minimax problem e C := arg min C ∈ C max f ∈F n e Ψ( C, f ) − γ ∥ f ∥ 2 F o . (9) Similar to Lemmas 3.1 and 3.2 , w e can obtain the closed form of the inner maximization max f ∈F n e Ψ( C, f ) − γ ∥ f ∥ 2 F o b y replacing misco verage indicators. Then we can solve the resulting empirical risk minimization problem through gradient-based metho ds, suc h as sto c hastic gradien t descent. In App endix B.2 , w e discuss and compare the computational efficiency of MOPI via the smo othed minimax optimization ( 9 ) and the quantile regression in the CC metho d ( Gibbs et al. , 2025 ). 4 Theoretical Results In this section, we first establish an upp er b ound for the MSCE of the empirical MOPI set ( 8 ) , and then deriv e sp ecific conv ergence rates for examples discussed in Section 2.1 . After that, w e analyze the MSCE of the MOPI set obtained b y the smo othed problem ( 9 ). 4.1 Oracle inequalit y for MOPI W e b egin with the follo wing t w o assumptions on the w eigh t function class F . Assumption 1. Ther e exists a c onstant U > 0 such that sup f ∈F U sup z ∈Z | f ( z ) | ≤ 1 , wher e F U := { f ∈ F : ∥ f ∥ 2 F ≤ U } . In addition, F U is star-shap e d ar ound zer o, i.e., tf ∈ F U whenever f ∈ F U for any t ∈ [0 , 1] . 16 Assumption 2. F or any C ∈ C , the pr oje ction err or of f C satisfies E [( f C ( Z ) − { α ( Z ; C ) − α ( Z ; C ora ) } ) 2 ] ≤ η 2 , wher e f C := arg min f ∈F U E   f ( Z ) − { α ( Z ; C ) − α ( Z ; C ora ) }  2  and C ora is define d in ( 6 ) . In Assumption 1 , F U is a ball with radius U in the function class F , and the conditions are quite mild and can b e satisfied for b oth a parametric function class and an RKHS. In Assumption 2 , the function f C ( · ) is a pro jection of the conditional misco v erage probabilit y difference α ( · ; C ) − α ( · ; C ora ) on to the function class F , and η is an upp er b ound for the pro jection error. If the assumption of Prop osition 2.1 holds, that is α ( · ; C ) − α ∈ F for an y C ∈ C , w e ha v e η = 0 ; see the corollaries in Section 4.3 . T o characterize the excess risk of the empirical objective relativ e to the p opulation ob jectiv e, w e in tro duce the follo wing function class: G :=  ( x, y , z ) 7→ f C ( z )[ 1 { y / ∈ C ( x ) } − 1 { y / ∈ C ora ( x ) } ] : C ∈ C  , whic h dep ends on the set-v alued class C . Given δ > 0 , define lo calized Rademacher com- plexities ( Bartlett et al. , 2005 ) as R n ( δ ; G ) := E h sup g ∈G , ∥ g ∥ L 2 ≤ δ | n − 1 P n i =1 ε i g ( X i , Y i , Z i ) | i , where { ε i } n i =1 are i.i.d. Rademac her random v ariables taking v alues in {− 1 , +1 } with equal probabilit y . The critic al r adius of G is an y p ositiv e solution to R n ( δ ; F U ) ≤ δ 2 . The critical radius of F U is defined similarly . The next theorem gives a non-asymptotic upp er b ound for the MSCE of the empirical MOPI in ( 8 ). Theorem 4.1 (Oracle inequality) . Supp ose A ssumptions 1 - 2 hold. L et δ n, F U and δ n, G b e the critic al r adius of F U and G , r esp e ctively. (i) F or finite-dimensional F , we cho ose the hyp erp ar ameter γ = 0 in ( 8 ) ; (ii) F or infinite-dimensional F , we cho ose the hyp erp ar ameter γ in ( 8 ) such that γ ∥ f ∥ 2 F ≤ ∥ f ∥ 2 L 2 . F or any ζ ∈ (0 , 1) , if  δ n, F U + q log(1 /ζ ) n  ∥ f ∥ F ≲ √ U ∥ f ∥ L 2 17 holds al l f ∈ F , with pr ob ability at le ast 1 − 4 ζ , we have: E   α ( Z ; b C ) − α  2 | D n  ≲ MSCE ( C ora ) + η 2 + δ 2 n, F U + δ 2 n, G + log(log 2 (1 /δ n, G ) /ζ ) n . In general, the approximation error MSCE ( C ora ) is small if the capacity of C is large enough. Moreo v er, the pro jection error η can b e negligible when F is expressiv e enough. The critical v alues δ n, F U and δ n, G decrease as the sample size n gro ws, and their scales also dep end on the complexities of F and C . 4.2 Upp er b ound for the appro ximation error Next, we inv estigate how the relationship b et w een the cov ariate X and the conditioning v ariable Z influences the approximation error. Assumption 3. Supp ose H is a class including functions that map fr om X to R m for some m ≥ 1 . (i) Ther e exist me asur able functions w ( · ) and l ( · , · ) such that for any u 1 , u 2 ∈ R m , | P { T ( u 1 , Y ) ≤ 0 | w ( X ) , Z } − P { T ( u 2 , Y ) ≤ 0 | w ( X ) , Z }| ≤ l ( w ( X ) , Z ) ∥ u 1 − u 2 ∥ , wher e E [ l 2 ( w ( X ) , Z )] ≤ κ 2 for some c onstant κ > 0 . (ii) Ther e exists a me asur able function H 0 ( · , · ) such that P { T ( H 0 ( w ( X ) , Z ) , Y ) ≤ 0 | w ( X ) , Z } = 1 − α . (iii) The c onditional exp e ctation h 0 ( w ( X )) := E [ H 0 ( w ( X ) , Z ) | w ( X )] b elongs to H . Assumption 3 (i) requires Lipschitz contin uity of the conditional miscov erage proba- bilit y with resp ect to the geometric parameter, which is a standard regularity condition. Assumption 3 (ii) and (iii) imp ose the realizability condition for the conditional co v erage. Let us consider the group-conditional co v erage in Example 2.2 and assume C is the sublevel prediction set ( 7 ) of the pretrained score s . Recalling that Z = ( 1 { X ∈ G 1 } , . . . , 1 { X ∈ G K } ) ⊤ , w e take w ( X ) = Z in this case. Let H 0 ( w ( X ) , Z ) = P z ∈Z q z 1 { Z = z } , where 18 q z = inf { t : P ( s ( X , Y ) ≤ t | Z = z ) ≥ 1 − α } is the (1 − α ) conditional quantile given Z = z . Then, we can guaran tee h 0 ∈ H b y choosing H = { P z ∈Z β z 1 { Z = z } : β z ∈ R , z ∈ Z } since h 0 ( w ( X )) = H 0 ( w ( X ) , Z ) . Theorem 4.2 (Appro ximation error) . L et ρ 2 = T r ( Cov { h 0 ( w ( X )) } ) T r ( Cov { H 0 ( w ( X ) ,Z ) } ) . Under A ssumption 3 , we have MSCE ( C ora ) ≤ κ 2  1 − ρ 2  T r  Cov n H 0 ( w ( X ) , Z ) o . In p articular, MSCE ( C ora ) = 0 if Z = w ( X ) for some me asur able function w ( · ) . As sho wn in Theorem 4.2 , the appro ximation error is largely determined b y the prop ortion of the v ariance of H 0 captured by h 0 , which is quantified b y ρ . When h 0 captures a large prop ortion of v ariance, it can appro ximately achiev e conditional co v erage. Sp ecifically , if Z is a function of X , then h 0 has the same v ariance as H 0 , leading to the p erfect conditional co v erage, i.e., MSCE ( C ora ) = 0 . This case corresp onds to the test-conditional cov erage with Z = X in Example 2.1 and group-conditional co v erage in Example 2.2 . F or Example 2.3 , the appro ximation error w ould dep end on the degree of correlation b etw een cov ariate X and the sensitiv e attribute Z . 4.3 Con vergence rates of MSCE In what follo ws, w e deriv e the sp ecific con v ergence rates of the MSCE under the follo wing V apnik–Cherv onenkis (V C) complexit y assumption on the set-v alued class C , which is satisfied b y basic linear mo dels ( W ain wrigh t , 2019 ) and deep neural netw orks ( Bartlett et al. , 2019 ). F or infinite-dimensional class C , since V C-based argumen ts are no longer applicable, w e establish the corresp onding con v ergence rate using a differen t technique; the details are deferred to App endix D . Assumption 4. The function class { ( x, y ) 7→ T ( h ( x ) , y ) : h ∈ H} has a V C-dimension d C . 19 4.3.1 Con v ergence rate when F is finite-dimensional In the follo wing analysis, w e upp er b ound the MSCE when the domain Z is finite. In this case, one can consider the parametric function class F , whic h naturally satisfies Assumption 1 . In addition, given an y C ∈ C , the function α ( z ; C ) takes at most |Z | distinct v alues. Consequen tly , α ( z ; C ) ∈ F and we hav e η = 0 in Assumption 2 . Theorem 4.3. Supp ose |Z | < ∞ and A ssumption 4 hold for the set-value d class C in ( 3 ) . Cho osing F = { P z ∈Z β z 1 { Z = z } : β z ∈ R , ∀ z ∈ Z } and setting γ = 0 in the optimization pr oblem ( 8 ) , then we have E   α ( Z ; b C ) − α  2  ≲ MSCE ( C ora ) + d C + |Z | n log n d C + |Z | ! . In particular, we can obtain the group-conditional co v erage guaran tee in Example 2.2 based on Theorem 4.3 . Let { G 1 , . . . , G K } b e a collection of groups in X , and w e recall the conditioning v ariable Z = ( 1 { X ∈ G 1 } , . . . , 1 { X ∈ G K } ) ⊤ . The choice of Z implies that ρ = 1 in Theorem 4.2 , and hence MSCE ( C ora ) = 0 . By choosing the function class H = F in the set-v alued class C , w e ha v e the follo wing corollary . Corollary 4.1. Considering Example 2.2 , we let Z = ( 1 { X ∈ G 1 } , . . . , 1 { X ∈ G K } ) ⊤ and cho ose H = F = { P z ∈Z β z 1 { Z = z } : β z ∈ R , ∀ z ∈ Z } . Supp ose the c onditions (i) and (ii) in A ssumption 3 hold for w ( X ) = Z , then for any k ∈ [ K ]    P n Y ∈ b C ( X ) | X ∈ G k o − (1 − α )    ≲ v u u t d C + |Z | n · P ( X ∈ G k ) log n d C + |Z | ! . F or the sublev el set class ( 7 ) under the c hoice of H in Corollary 4.1 , w e ha v e d C = O ( |Z | ) . Then the upp er b ound significantly improv es the dep endence on sample size from n − 1 / 4 in Jung et al. ( 2023 ) to n − 1 / 2 , and matc hes the optimal conv ergence rate in Areces et al. ( 2024 ) up to a logarithmic factor. 20 Next, w e pro vide equalized co v erage guaran tee for Example 2.3 . Corollary 4.2. Considering Example 2.3 with |Z | < ∞ , supp ose the c onditions (i)–(iii) in A ssumption 3 hold for some w ( X ) , then for any z ∈ Z ,    P n Y ∈ b C ( X ) | Z = z o − (1 − α )    ≲ v u u t 1 − ρ 2 P ( Z = z ) + v u u t d C + |Z | n · P ( Z = z ) log n d C + |Z | ! , wher e ρ 2 is define d in The or em 4.2 . 4.3.2 Con v ergence rate when F is RKHS W e now establish upp er b ounds for the MSCE when the domain Z is infinite. F or simplicity , let us consider a b ounded Euclidean domain Z = [0 , 1] d Z , where d Z is the dimension. In this case, we choose F as the RKHS equipped with a Gaussian kernel K ( z , z ′ ) = exp  − 1 2 ϑ 2 ∥ z − z ′ ∥ 2  with the bandwidth ϑ . Then Assumption 1 holds with U = 1 since sup z ∈Z | f ( z ) | ≤ sup z ∈Z q K ( z , z ) ∥ f ∥ F ≤ 1 for any f ∈ F U . Theorem 4.4. Supp ose A ssumptions 2 and 4 hold for the set-value d class C in ( 3 ) . If the hyp erp ar ameter γ in optimization pr oblem ( 8 ) is chosen such that γ ∥ f ∥ 2 F ≤ ∥ f ∥ 2 L 2 for al l f ∈ F , then we have E   α ( Z ; b C ) − α  2  ≲ MSCE ( C ora ) + η 2 + d C log( n/d C ) n + d Z log d Z +1 ( n/d Z ) n . Com bining Theorems 4.2 and 4.4 , we ha v e the following corollary for the test-conditional co v erage in Example 2.1 . Corollary 4.3. Consider Example 2.1 with Z = X . Supp ose the c onditions (i)–(iii) in A ssumption 3 hold with w ( X ) = X , and A ssumptions 2 and 4 hold for C . If α ( Z ; C ) − α ∈ F U 21 with U = 1 for any C ∈ C , then E   α ( X ; b C ) − α  2  ≲ d C log( n/d C ) n + d Z log d Z +1 ( n/d Z ) n . 4.4 Oracle inequalit y for MOPI via smo othed optimization Since the miscov erage indicator is replaced b y the smo oth surrogate in the ob jective ( 9 ) , we define a new function class as e G := n ( x, y , z ) 7→ f C ( z )[ ˜ 1 { y / ∈ C ( x ) } − ˜ 1 { y / ∈ C ora ( x ) } ] : C ∈ C o to get the b ound of the corresp onding excess risk. In the following, w e fo cus on the Gaus- sian error function as the smo oth surrogate: ˜ 1 { y / ∈ C ( x ) } = 1 2  1 + erf n T ( h ( x ) , y ) / √ 2 r o for C ( x ) = { y ∈ Y : T ( h ( x ) , y ) ≤ 0 } , where r > 0 is the smo othing parameter. The next theorem pro vides the oracle inequalit y for the MOPI set ( 9 ) . Compared with Theorem 4.1 , there will b e an additional error regarding the smo othing bias, which dep ends on the scale of the smo othing parameter r . Theorem 4.5. Supp ose A ssumptions 1 - 2 hold. A ssume the density of T ( h ( X ) , Y ) | Z is upp er b ounde d for al l h ∈ H , and sup f ∈F sup z ∈Z | f ( z ) | is also upp er b ounde d. Under the same choic e of γ in The or em 4.1 , if  δ n, F U + q log(1 /ζ ) n  ∥ f ∥ F ≲ √ U ∥ f ∥ L 2 holds for any f ∈ F , with pr ob ability at le ast 1 − 4 ζ , we have: E   α ( Z ; e C ) − α  2 | D n  ≲ MSCE ( C ora ) + η 2 + δ 2 n, F U + δ 2 n, e G + r 2  δ n, F U + p log(1 /ζ ) /n  2 + log(log 2 (1 /δ n, e G ) /ζ ) n , wher e δ n, F U and δ n, e G ar e critic al r adius of F U and e G , r esp e ctively. A ccording to the result of Theorem 4.5 , the MSCE will b e dominated by complexities of F and C pro vided that the smo othing parameter is c hosen suc h that r ≲ min { δ 2 n, F U , log n/n } . Similar to Section 4.3 , w e can get sp ecific cov erage rates on MSCE for different c hoices 22 of classes F and C . In App endix A.2 , we conduct an empirical sensitivit y analysis on the c hoice of r , indicating that the p erformance of MOPI is stable as long as r is not to o large. 5 Sim ulation Results W e conduct sim ulations on synthetic data to compare the p erformance of our proposed MOPI with split conformal prediction ( SCP ) in V o vk et al. ( 2005 ) and Lei et al. ( 2018 ), the conditional calibration ( CC ) in Gibbs et al. ( 2025 ), and randomly-lo calized conformal prediction ( RLCP ) in Hore and Barb er ( 2025 ). F or MOPI , w e replace the indicator function 1 { u > 0 } with a sigmoid smo othing surrogate ˜ 1 { u > 0 } = (1 + exp ( − u/r )) − 1 , with the smo othing parameter set to r = 0 . 1 . The nominal cov erage level is set to 1 − α = 90% , and all sim ulation results are av eraged ov er 100 replications. In eac h replication, we indep endently generate three datasets from the same distribution: pretraining set { ( X i , Y i , Z i ) } i ∈D pre , calibration set { ( X i , Y i , Z i ) } i ∈D cal and test set { ( X i , Y i , Z i ) } i ∈D test . W e provide additional sim ulation results in App endix F . 5.1 Geometrically structured sets for m ulti-dimensional lab el F or a m ultiv ariate lab el Y , w e design a sim ulation setting inv olving cross-dimensional dep endence, in which the strength of heterogeneity v aries across dimensions. Let X ∼ U (0 , 5) , the m ulti-dimensional lab el is generated as follows: Y | X ∼ N( µ ∗ ( X ) , Σ ∗ ( X )) , where µ ∗ j ( X ) = sin ( X + j ) + 0 . 3 j and Σ ∗ j j ( X ) =  0 . 05 + 1 . 5 √ j · | sin ( X + 2 j ) |  √ j for j = 1 , . . . , d Y and Σ ∗ 12 ( X ) = Σ ∗ 21 ( X ) = 0 . 6 q Σ ∗ 11 ( X )Σ ∗ 22 ( X ) , while all other entries of Σ ∗ ( X ) are zero. In this sim ulation, w e set the sample size of test data as |D test | = 10 , 000 . T est-conditional cov erage. In this exp eriment, we fo cus on the test-conditional cov erage ( Z = X ) for ellipsoid-shaped sets. The baseline metho ds ( SCP , CC , and RLCP ) rely on a score function s ( x, y ) fitted on the pretraining set to construct the sublevel set ( 7 ) . 23 The score takes the form s ( x, y ) = ( y − µ 0 ( x )) ⊤ Σ − 1 0 ( y − µ 0 ( x )) for ellipsoidal prediction sets. Here, µ 0 ( x ) is a pretrained random forest mo del and Σ 0 is the constan t empirical co v ariance matrix computed from the pretraining set with size |D pre | = 1 , 500 . Using this score function s ( x, y ) , baseline metho ds use calibration data to obtain thresholds for the sublev el sets. T o make a fair comparison and incorp orate the pretrained mo dels µ 0 ( x ) and Σ 0 , the MOPI first normalizes y as ˜ y = Σ − 1 / 2 0 ( y − µ 0 ( x )) for ellipsoidal sets. It then constructs structured prediction sets by instantiating the general form in ( 3 ) as C ell = { C ( x ; Σ) = { y ∈ R d Y : ˜ y ⊤ Σ − 1 ( x ) ˜ y ≤ 1 } : Σ ∈ H } , where H is a class of neural net w orks that maps the co v ariate to co v ariance matrices. F or MOPI and CC , w e c ho ose the function class F to b e the RKHS on X equipp ed with a Gaussian k ernel. RLCP also employs the same Gaussian kernel for its lo calization. T o ev aluate the p erformance of differen t metho ds, w e compute the follo wing metrics: (i) Marginal Co v erage ( Mar ginal ): |D test | − 1 P i ∈D test 1 { Y i ∈ b C ( X i ) } . (ii) Ro ot of MSCE ( √ MSCE ):  |J | − 1 P J ∈J  n − 1 J P i ∈D test 1 { X i ∈ J, Y i / ∈ b C ( X i ) } − α  2  1 / 2 , where J is the set of equidistant partition sets of X and n J = P i ∈D test 1 { X i ∈ J } . (iii) W orst-case conditional cov erage ( W orst-c ase ): min B ∈B n − 1 B P i ∈D test 1 { X i ∈ B , Y i ∈ b C ( X i ) } , where n B = P i ∈D test 1 { X i ∈ B } , and B denotes a collection of balls in X . Sp ecifically , eac h ball is cen tered at a randomly selected test p oin t, and its radius is dra wn indep enden tly from a uniform distribution on the in terv al (0 . 1 , 0 . 25) . (iv) Set v olume: Median {| b C ( X i ) | : i ∈ D test } , whic h serv es as a robust empirical proxy for the exp ected set volume E [ | b C ( X ) | ] , since RLCP ma y pro duce sets with infinite volume. W e first ev aluate the comp eting metho ds by v arying the calibration sample size. The results in Figure 1 sho w that, compared to the baseline metho ds, MOPI ac hiev es worst- case cov erage rates closest to the nominal 90% level across all calibration set sizes, while consisten tly attaining a low er MSCE. Moreov er, during the calibration stage, our metho d 24 can adapt the shape of the prediction sets via the learned Σ( x ) , whereas the baseline metho ds are limited to simply rescaling the pretrained sets b y adjusting a scalar quan tile. Consequen tly , MOPI pro duces substan tially smaller prediction sets. As sho wn in T able 1 , w e further compare four metho ds across differen t lab el-space dimensions d Y ∈ { 2 , 4 , 6 } and differen t nominal co v erage levels 1 − α ∈ { 95% , 90% , 85% } . The results indicate that MOPI consisten tly outp erforms the baselines in b oth conditional co v erage metrics and set v olumes. 750 1000 1250 1500 1750 2000 | D c a l | 0.82 0.84 0.86 0.88 0.90 Coverage W orst-case Coverage 750 1000 1250 1500 1750 2000 | D c a l | 0.035 0.040 0.045 0.050 0.055 0.060 p M S C E Root of MSCE 750 1000 1250 1500 1750 2000 | D c a l | 3.0 3.2 3.4 3.6 3.8 4.0 l o g ( | c C | ) Log of Set V olume Method MOPI CC RLCP SCP Figure 1: Conditional co v erage metrics and log of set volumes (denoted by log | b C | ) v ersus sample sizes of calibration set under el lipsoidal sets and d Y = 2 . The error bars represent the standard deviation of the metrics o v er 100 replications. T able 1: Cov erage metrics and log of set volumes (denoted by log | b C | ) for el lipsoidal prediction sets under differen t lab el-space dimensions and nominal co v erage lev els. d Y = 2 d Y = 4 d Y = 6 Methods Mar ginal W orst-c ase √ MSCE log | b C | Marginal W orst-c ase √ MSCE log | b C | Marginal W orst-c ase √ MSCE log | b C | 1 − α = 95% MOPI 0.952 0.918 0.026 3.301 0.955 0.920 0.027 7.752 0.953 0.914 0.027 13.311 CC 0.948 0.905 0.034 4.393 0.947 0.907 0.032 10.228 0.948 0.905 0.033 18.845 RLCP 0.952 0.908 0.034 4.431 0.951 0.892 0.040 10.248 0.952 0.900 0.036 18.844 SCP 0.950 0.892 0.040 4.414 0.950 0.858 0.054 10.177 0.951 0.872 0.047 18.718 1 − α = 90% MOPI 0.903 0.856 0.037 3.072 0.902 0.856 0.035 7.301 0.902 0.853 0.037 12.714 CC 0.897 0.840 0.046 4.040 0.898 0.847 0.044 9.636 0.898 0.844 0.045 17.960 RLCP 0.901 0.837 0.050 4.052 0.901 0.817 0.059 9.586 0.902 0.828 0.054 17.865 SCP 0.900 0.813 0.061 4.041 0.900 0.767 0.083 9.500 0.900 0.787 0.073 17.722 1 − α = 85% MOPI 0.853 0.799 0.044 2.901 0.853 0.801 0.042 7.064 0.852 0.795 0.043 12.313 CC 0.846 0.780 0.056 3.780 0.848 0.788 0.053 9.204 0.848 0.785 0.053 17.320 RLCP 0.850 0.774 0.061 3.777 0.850 0.751 0.073 9.117 0.851 0.764 0.066 17.188 SCP 0.849 0.746 0.074 3.759 0.850 0.692 0.103 9.020 0.850 0.713 0.092 16.992 Group-conditional cov erage. W e now inv estigate group-conditional cov erage for the b o x-shap ed prediction sets, where the groups are defined as G k = [ k − 1 , k ) for k = 25 1 , . . . , 5 . Let Z = ( 1 { X ∈ G 1 } , . . . , 1 { X ∈ G 5 } ) . F or this exp eriment, w e c ho ose F and H in MOPI , F in CC , to b e the same parametric function class n β ⊤ z : β ∈ R 5 o . Let σ 0 ,k b e the empirical standard deviation of pretraining lab els { Y i : X i ∈ G k } i ∈D pre for k = 1 , . . . , 5 , where |D pre | = 1 , 500 . F or baseline metho ds, the score function is defined as s ( X , y ) = ∥ ( y − µ 0 ( X )) /σ 0 ( X ) ∥ ∞ , where µ 0 ( X ) is a pretrained random forest mo del and σ 0 ( X ) = P 5 k =1 σ 0 ,k 1 { X ∈ G k } is a piecewise constant function deriv ed from the pretraining set. F ollo wing the same normalization as in the previous exp eriment, MOPI first computes ˜ y = ( y − µ 0 ( X )) /σ 0 ( X ) . It then constructs structured prediction sets by instan tiating the general form in ( 3 ) as C box = { C ( x ; σ ) = { y ∈ R d Y : ∥ ˜ y /σ ( x ) ∥ ∞ ≤ 1 } : σ ∈ H } . W e ev aluate eac h metho d b y computing the group-conditional cov erage rates: ( P i ∈D test 1 { X i ∈ G k } ) − 1 P i ∈D test 1 { X i ∈ G k , Y i ∈ b C ( X i ) } for k = 1 , . . . , 5 . The simulation results are rep orted in Figure 2 , where the sample size of calibration set is |D cal | = 1 , 500 . Ov erall, MOPI ac hiev es group-conditional cov erage rates that are closer to the nominal lev el than those of the baseline metho ds. Moreov er, MOPI yields box prediction regions with consisten tly smaller sizes across groups than the baselines. G 1 G 2 G 3 G 4 G 5 0.84 0.86 0.88 0.90 0.92 0.94 0.96 Coverage Method MOPI CC RLCP SCP G 1 G 2 G 3 G 4 G 5 5 10 15 20 25 Set V olume Method MOPI CC RLCP SCP Figure 2: Group-conditional cov erage rates and set sizes under b ox sets and d Y = 2 . The error bars represen t the standard deviation of the metrics o v er 100 replications. 5.2 Sublev el sets for equalized cov erage on sensitiv e attributes In this subsection, we consider the equalized cov erage problem describ ed in Example 2.3 , where the sensitive attribute Z is unobserved in the test data. W e construct prediction 26 sets using the sublevel set formulation in ( 7 ) based on a pretrained score function s ( x, y ) . Here, w e use a pretraining set with size |D pre | = 3 , 000 to fit a linear mo del µ ( x ) and adopt the absolute residual score function s ( x, y ) = | y − µ ( x ) | . Then, using the calibration set with size |D cal | = 1 , 500 , w e construct the prediction sets for all metho ds based on the same pretrained score. The size of test data is |D test | = 500 . The data are generated from a heteroscedastic regression mo del Y = µ ∗ ( X , Z ) + ε ( X , Z ) with ε ( X , Z ) ∼ N (0 , σ ∗ ( X , Z ) 2 ) and X = ( X 1 , . . . , X 11 ) ⊤ , where X 1 is indep endently dra wn from Unif (0 , 5) , and X 2 , . . . , X 11 are indep enden tly dra wn from N(0 , 1) . W e consider the following tw o settings: • Setting 1 : µ ∗ ( X , Z ) = P 11 j =1 1 √ 11 X j + 0 . 5 Z and σ ∗ ( X , Z ) 2 = 1 + Z 2 + 0 . 5 × sin ( P 11 j =2 X j ) with Z = 1 { X 1 ≤ 2 . 5 } and Z = { 0 , 1 } . • Setting 2 : µ ∗ ( X , Z ) = µ ∗ Z ( X ) and σ ∗ ( X , Z ) = σ ∗ Z ( X ) for Z ∈ Z = { 1 , 2 , 3 , 4 } , where Z | X ∼ Cat ( π ∗ 1 ( X ) , . . . , π ∗ 4 ( X )) , and Cat ( · , . . . , · ) denotes the categorical distribution. The functions { µ ∗ z ( X ) , σ ∗ z ( X ) , π ∗ z ( X ) } 4 z =1 are giv en in App endix F.2 . F or MOPI , we take the w eigh t function class as F = { P z ∈Z β z 1 { Z = z } : β z ∈ R } . F or the comparison, w e let H in MOPI and F in CC b e the same RKHS on X equipp ed with a Gaussian k ernel. W e rep ort the conditional cov erage rate for eac h sensitiv e attribute v alue z ∈ Z , whic h should reac h the nominal lev el 1 − α = 90% . The exp erimen t results are rep orted in Figure 3 . W e can see that MOPI attains the b est equalized cov erage in b oth settings, again without sacrificing marginal cov erage. In Setting 1 , where Z is a deterministic function of X , b oth CC and RLCP sho w mo derate impro v emen ts in equalized co v erage ov er SCP . How ever, in Settings 2 , where Z is correlated with X but not a deterministic function of it, CC and RLCP largely fail to lev erage the information in Z to improv e equalized co v erage. This limitation arises b ecause these metho ds cannot incorp orate information ab out Z during the calibration stage. In contrast, MOPI effectiv ely utilizes the av ailable calibration-stage information ab out Z , thereby achieving 27 substan tially b etter equalized co v erage. Marginal Z = 0 Z = 1 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Coverage Setting 1 Method MOPI CC RLCP SCP Marginal Z = 1 Z = 2 Z = 3 Z = 4 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Coverage Setting 2 Method MOPI CC RLCP SCP Figure 3: Marginal cov erage rate and conditional cov erage rates on sensitive attributes. No w, w e empirically inv estigate the impact of the dep endence b et w een X and Z on conditional co v erage p erformance under the follo wing setting. • Setting 1 ′ : µ ∗ ( X , Z ) and σ ∗ ( X , Z ) 2 are the same as Setting 1 , where Z = 1 { ρ ∗ X 1 + (1 − ρ ∗ ) V ≤ 2 . 5 ρ ∗ } , where V ∼ N(0 , 1) and ρ ∗ ∈ { 0 , 0 . 25 , 0 . 5 , 0 . 75 , 1 } . The dep endence b et w een the conditioning v ariable Z and the co v ariate X increases mono- tonically with ρ ∗ , which pla ys a similar role to ρ 2 in Theorem 4.2 . When ρ ∗ = 1 , Z is a deterministic function of X ; when ρ ∗ = 0 , Z is independent of X . As sho wn in Figure 4 , for all v alues of ρ ∗ , the prop osed MOPI metho d consistently achiev es b etter conditional co v erage than the three baseline approac hes. The sim ulation results are consistent with the theoretical insight in Theorem 4.2 and Corollary 4.2 that a larger correlation of Z and X leads to smaller MSCE and conditional co v erage gap. 0.00 0.25 0.50 0.75 1.00 ρ ∗ 0.02 0.03 0.04 0.05 0.06 0.07 p M S C E Root of MSCE 0.00 0.25 0.50 0.75 1.00 ρ ∗ 0.90 0.92 0.94 0.96 Coverage C o v e r a g e f o r Z = 0 0.00 0.25 0.50 0.75 1.00 ρ ∗ 0.84 0.86 0.88 0.90 Coverage C o v e r a g e f o r Z = 1 Method MOPI CC RLCP SCP Figure 4: Ro ot of MSCE and conditional cov erage rates under Setting 1 ′ as ρ ∗ v aries. The error bars represen t the standard deviation of the metrics o v er 100 replications. 28 6 Real Data Analysis In this section, we examine the p erformance of MOPI and three baseline metho ds on tw o real-w orld datasets. The additional exp eriment results are deferred to App endix G . 6.1 Households dataset In this subsection, we consider the Households Dataset, a multiv ariate resp onse dataset previously analyzed in Dheur et al. ( 2025 ). The resp onse dimension is d Y = 2 , corresp onding to exp enditures on housing and health. The cov ariates include five household c haracteristics, exp enditures on transp ortation, entertainmen t, fo o d, utilities, and household income. The full dataset consists of 7,207 observ ations. Due to the wide numerical ranges, w e apply a logarithmic transformation to b oth the lab els and the co v ariates. In this exp erimen t, w e fo cus on constructing ellipsoidal prediction sets. In each rep etition, w e split the dataset into three parts: pretraining set with size |D pre | = 3 , 460 , calibration set with size |D cal | = 2 , 307 , and test set with size |D test | = 1 , 440 . The exp eriment is rep eated 100 times with random splits, and the target co v erage level is 1 − α = 90% . The baseline metho ds use the score function s ( x, y ) = ( y − µ 0 ( x )) ⊤ Σ − 1 0 ( x )( y − µ 0 ( x )) , where µ 0 ( x ) is a random forest predictor and Σ 0 ( x ) is a neural netw ork fitted on the pretraining data. The detailed formulation of the pretraining loss is pro vided in App endix G.1 . F or CC , the function space is c hosen to b e an RKHS with a Gaussian kernel, and RLCP is implemen ted using the same Gaussian kernel. F or MOPI , we tak e F to b e an RKHS equipp ed with a Gaussian k ernel, and H to b e the class of neural netw orks with the same arc hitecture used to estimate Σ 0 ( x ) . 29 Marginal G T G E G F G U G I 0.82 0.84 0.86 0.88 0.90 0.92 0.94 0.96 0.98 Coverage Marginal and Group Conditional Covarage Method MOPI CC RLCP SCP 7 8 9 10 1 1 12 Log Income 0.84 0.86 0.88 0.90 0.92 0.94 Local Coverage Local Coverage of Log Income Method MOPI CC RLCP SCP Figure 5: Cov erage comparison for the Households dataset. The error bars (left) sho w the standard deviation, and the confidence bands (right) represen t approximate 95% normal confidence in terv als, b oth computed o v er 100 replications. W e ev aluate tw o metrics to assess test-conditional cov erage. (i) Gr oup-c onditional c over age : we consider subp opulations corresp onding to low exp enditure or income levels across fiv e cov ariates describ ed ab ov e. Sp ecifically , for each cov ariate, we define the “lo w” group as those observ ations whose v alue falls b elo w the 10 th p ercentile of that cov ariate ov er the full sample, denoted by { G T , G E , G F , G U , G I } , where the subscripts denote the initials of the five cov ariates describ ed ab ov e, resp ectively . (ii) L o c al c over age of lo g inc ome : we partition the range of the household income in to 30 evenly spaced grid p oints and compute the co v erage rate at each p oint using the nearest 5% of test samples. F rom Figure 5 , we observ e that MOPI ac hiev es co v erage that is generally close to the nominal level across most groups while main taining marginal v alidit y . Moreov er, the local co verage of log income sho ws that MOPI pro vides more uniform co v erage across income lev els. 6.2 Comm unities and Crime dataset In this subsection, we fo cus on the Communities and Crime dataset ( Redmond and Bav eja , 2002 ), in whic h the ob jective is to predict each communit y’s p er-capita violen t crime rate from its demographic and so cio economic characteristics. F ollo wing Gibbs et al. ( 2025 ), w e consider a concise set of explanatory v ariables, including p opulation size, un- emplo ymen t rate, median household income, p ercentage of residents with limited English 30 proficiency , racial comp osition (the prop ortions of Black, White, Asian, and Hispanic categories, denoted as X %Black , X %White , X %Asian , and X %Hispanic , resp ectiv ely), and age structure (the shares of individuals aged 12–21 and those o ver 65). Let X all denote the collection of all cov ariates listed ab o v e, and define the racial comp osition v ariables as X racial = ( X %Black , X %White , X %Asian , X %Hispanic ) . This v ector X racial serv es as the condi- tioning v ariable Z . In this exp erimen t, we consider the sublevel sets defined in ( 7 ) with the pretrained score s ( x, y ) = | y − µ 0 ( x ) | , where µ 0 is fitted on the pretraining set. The exp erimen t is repeated 100 times with random splits. In eac h repetition, w e split the original dataset in to three parts: a pretraining set of size |D pre | = 650 , a calibration set of size |D cal | = 650 , and a test set of size |D test | = 694 . W e set both H and F in MOPI to b e the same RKHS with Gaussian k ernels. RLCP is implemented using the same Gaussian kernel. W e compute the linear re-w eigh ting co v erage of each racial group ( Gibbs et al. , 2025 ) defined as P i ∈D test X racial i,j P i ∈D test X racial i,j 1 { Y i ∈ b C ( X i ) } , j = 1 , . . . , 4 , where X racial i,j denotes the prop ortion of racial group j in comm unit y i . W e consider tw o exp erimental settings. In the unmaske d case, b oth the calibration and test datasets contain the full set of co v ariates, including racial and demographic attributes. This corresponds to the case X = X all and Z = X racial . In this case, w e implemen t CC under the same configuration as in Gibbs et al. ( 2025 ). Sp ecifically , w e c ho ose the function class F = { β 0 + Z ⊤ β + f ( X ) : f ∈ H , β 0 ∈ R , β ∈ R 4 } , where H is an RKHS equipp ed with a Gaussian kernel. In the maske d case, the sensitive attribute is unobserv ed at test time. This corresp onds to the scenario where X = X all \ X racial and Z = X racial . Since Z is unobserv ed in test data, we adopt the function class F = { β 0 + f ( X ) : f ∈ H , β 0 ∈ R } for CC . The hyperparameters of each metho d are selected via cross-v alidation on the calibration data b y minimizing the linearly rew eigh ted co v erage error, meaning that baseline metho ds can also lev erage information from sensitiv e attributes during calibration. The exp erimen t results are rep orted in Figure 6 . In the unmask ed case, both MOPI 31 Marginal %Black %White %Asian %Hispanic 0.65 0.70 0.75 0.80 0.85 0.90 0.95 Coverage Unmasked case Method MOPI CC RLCP SCP Marginal %Black %White %Asian %Hispanic 0.70 0.75 0.80 0.85 0.90 0.95 Coverage Masked case Method MOPI CC RLCP SCP Figure 6: Marginal and linear re-weigh ting cov erage for the Communities and Crime dataset. The b o x plots are obtained from 100 replications with random data splits. and CC successfully ac hiev e the target cov erage level across all four racial groups. In the mask ed case, the baseline metho ds fail to maintain the desired cov erage across different racial subgroups. By con trast, since MOPI can lev erage Z during the minimax optimization phase, it ac hiev es b etter and more balanced co v erage across racial groups than baselines. 7 Concluding Remarks T w o promising directions remain for future w ork. First, the miscov erage indicator in ob jectiv e ( 8 ) results in a noncon v ex optimization problem; it would th us b e meaningful to explore con v ex surrogate losses and analyze the corresp onding theoretical prop erties, as has b een done in classification contexts ( Bartlett et al. , 2006 ). Second, while the prediction sets in this pap er are constructed to satisfy Z -conditional co verage, a c hallenging op en problem is how to simultaneously optimize the efficiency of a do wnstream task under the same conditional cov erage constraint, such as v olume efficiency ( Braun et al. , 2025 ) and decision efficiency ( Bao et al. , 2025 ). SUPPLEMENT AR Y MA TERIAL The supplemen tary material contains the implementation details of MOPI, more compar- isons with the conditional calibration metho d, pro ofs of theoretical results, and additional 32 exp erimen tal results. References Andrews, D. W. and Shi, X. (2013), “Inference based on conditional momen t inequalities,” Ec onometric a , 81, 609–666. Angelop oulos, A. N., Barb er, R. F., and Bates, S. (2024), “Theoretical foundations of conformal prediction,” arXiv pr eprint arXiv:2411.11824 . Areces, F., Cheng, C., Duchi, J., and Rohith, K. (2024), “T wo fundamen tal limits for uncer- tain t y quan tification in predictiv e inference,” in The Thirty Seventh A nnual Confer enc e on L e arning The ory , PMLR, pp. 186–218. Bai, Y., Mei, S., W ang, H., Zhou, Y., and Xiong, C. (2022), “Efficien t and differentiable conformal prediction with general function classes,” in International Confer enc e on L e arning R epr esentations . Bao, Y., Hu, Y., Ren, H., Zhao, P ., and Zou, C. (2025), “Optimal mo del selection for conformalized robust optimization,” arXiv pr eprint arXiv:2507.04716 . Barb er, R. F., Candes, E. J., Ramdas, A., and Tibshirani, R. J. (2021), “The limits of distribution-free conditional predictive inference,” Information and Infer enc e: A Journal of the IMA , 10, 455–482. Bartlett, P . L., Bousquet, O., and Mendelson, S. (2005), “Lo cal Rademacher complexities,” The A nnals of Statistics , 33, 1497 – 1537. Bartlett, P . L., Harv ey , N., Liaw, C., and Mehrabian, A. (2019), “Nearly-tigh t V C-dimension and pseudo dimension b ounds for piecewise linear neural netw orks,” Journal of Machine L e arning R ese ar ch , 20, 1–17. Bartlett, P . L., Jordan, M. I., and McAuliffe, J. D. (2006), “Con v exit y , classification, and risk b ounds,” Journal of the A meric an Statistic al A sso ciation , 101, 138–156. Ben-T al, A., Ghaoui, L., and Nemirovski, A. (2009), R obust Optimization , Princeton Series in Applied Mathematics, Princeton Univ ersit y Press. Bennett, A. and Kallus, N. (2023), “The v ariational metho d of momen ts,” Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 85, 810–841. Braun, S., A olaritei, L., Jordan, M. I., and Bach, F. (2025), “Minimum volume conformal sets for m ultiv ariate regression,” arXiv pr eprint arXiv:2503.19068 . Chernozh uk o v, V., Wüthric h, K., and Zhu, Y. (2021), “Distributional conformal prediction,” Pr o c e e dings of the National A c ademy of Scienc es , 118, e2107794118. Dheur, V., F on tana, M., Estiev enart, Y., Desobry , N., and Ben T aieb, S. (2025), “A Unified Comparativ e Study with Generalized Conformit y Scores for Multi-Output Conformal Regression,” in International Confer enc e on Machine L e arning , PMLR, vol. 267, pp. 13444–13485. Dikkala, N., Lewis, G., Mack ey , L., and Syrgkanis, V. (2020), “Minimax estimation of conditional moment mo dels,” A dvanc es in Neur al Information Pr o c essing Systems , 33, 12248–12262. 33 F eldman, S., Bates, S., and Romano, Y. (2023), “Calibrated m ultiple-output quan tile regression with representation learning,” Journal of Machine L e arning R ese ar ch , 24, 1–48. Gibbs, I., Cherian, J. J., and Candès, E. J. (2025), “Conformal prediction with conditional guaran tees,” Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 87, 1100–1126. Guan, L. (2023), “Lo calized conformal prediction: A generalized inference framew ork for conformal prediction,” Biometrika , 110, 33–50. Gyôrfi, L. and W alk, H. (2019), “Nearest neighbor based conformal prediction,” A nnales de l’ISUP , 63, 173–190. Héb ert-Johnson, U., Kim, M., Reingold, O., and Rothblum, G. (2018), “Multicalibration: Calibration for the (computationally-identifiable) masses,” in International Confer enc e on Machine L e arning , PMLR, pp. 1939–1948. Hore, R. and Barb er, R. F. (2025), “Conformal prediction with lo cal weigh ts: randomization enables robust guaran tees,” Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 87, 549–578. Johnstone, C. and Co x, B. (2021), “Conformal uncertaint y sets for robust optimization,” in Conformal and Pr ob abilistic Pr e diction and A pplic ations , PMLR, pp. 72–90. Jung, C., Noaro v, G., Ramalingam, R., and Roth, A. (2023), “Batch multiv alid conformal prediction,” in International Confer enc e on L e arning R epr esentations . Kallus, N., Mao, X., and Uehara, M. (2021), “Causal inference under unmeasured confounding with negative con trols: A minimax learning approach,” arXiv pr eprint arXiv:2103.14029 . Kiy ani, S., Pappas, G. J., and Hassani, H. (2024a), “Conformal prediction with learned features,” in International Confer enc e on Machine L e arning , PMLR, pp. 24749–24769. — (2024b), “Length optimization in conformal prediction,” A dvanc es in Neur al Information Pr o c essing Systems , 37, 99519–99563. K oltc hinskii, V. and Panc henko, D. (2002), “Empirical margin distributions and b ounding the generalization error of com bined classifiers,” The A nnals of Statistics , 30, 1–50. Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., and W asserman, L. (2018), “Distribution- free predictiv e inference for regression,” Journal of the A meric an Statistic al A sso ciation , 113, 1094–1111. Lei, J., Robins, J., and W asserman, L. (2013), “Distribution-free prediction sets,” Journal of the A meric an Statistic al A sso ciation , 108, 278–287. Lei, J. and W asserman, L. (2014), “Distribution-free prediction bands for non-parametric regression,” Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 76, 71–96. Lin, T., Jin, C., and Jordan, M. (2020), “On gradient descent ascent for noncon v ex-conca v e minimax problems,” in International Confer enc e on Machine L e arning , PMLR, pp. 6083– 6093. Lindemann, L., Clea v eland, M., Shim, G., and Pappas, G. J. (2023), “Safe planning in dynamic environmen ts using conformal prediction,” IEEE R ob otics and A utomation L etters , 8, 5116–5123. 34 Mason, L., Bartlett, P . L., and Baxter, J. (2000), “Impro v ed generalization through explicit optimization of margins,” Machine L e arning , 38, 243–255. Messoudi, S., Desterck e, S., and Rousseau, S. (2022), “Ellipsoidal conformal inference for m ulti-target regression,” in Conformal and Pr ob abilistic Pr e diction with A pplic ations , PMLR, pp. 294–306. Nemiro vski, A. (2004), “Prox-method with rate of conv ergence O (1/t) for v ariational inequalities with Lipschitz contin uous monotone op erators and smo oth conv ex-concav e saddle p oin t problems,” SIAM Journal on Optimization , 15, 229–251. Rafique, H., Liu, M., Lin, Q., and Y ang, T. (2022), “W eakly-con vex–conca ve min–max optimization: prov able algorithms and applications in mac hine learning,” Optimization Metho ds and Softwar e , 37, 1087–1121. Romano, Y., Barb er, R. F., Sabatti, C., and Candès, E. (2020), “With malice to w ard none: Assessing uncertaint y via equalized cov erage,” Harvar d Data Scienc e R eview , 2, h ttps://hdsr.mitpress.mit.edu/pub/qedrw cz3. Romano, Y., P atterson, E., and Candes, E. (2019), “Conformalized quantile regression,” A dvanc es in Neur al Information Pr o c essing Systems , 32, 1–11. Sesia, M. and Romano, Y. (2021), “Conformal prediction using conditional histograms,” A dvanc es in Neur al Information Pr o c essing Systems , 34, 6304–6315. Shi, C., Uehara, M., Huang, J., and Jiang, N. (2022), “A minimax learning approac h to off-p olicy ev aluation in confounded partially observ able marko v decision pro cesses,” in International Confer enc e on Machine L e arning , PMLR, pp. 20057–20094. Sun, C., Liu, L., and Li, X. (2023), “Predict-then-calibrate: A new p ersp ectiv e of robust con textual lp,” A dvanc es in Neur al Information Pr o c essing Systems , 36, 17713–17741. Th urin, G., Nadjahi, K., and Bo y er, C. (2025), “Optimal transp ort-based conformal prediction,” arXiv pr eprint arXiv:2501.18991 . V azquez, J. and F acelli, J. C. (2022), “Conformal prediction in clinical medical sciences,” Journal of He althc ar e Informatics R ese ar ch , 6, 241–252. V o vk, V. (2012), “Conditional v alidity of inductiv e conformal predictors,” in A sian Confer- enc e on Machine L e arning , PMLR, pp. 475–490. V o vk, V., Gammerman, A., and Shafer, G. (2005), A lgorithmic L e arning in A R andom W orld , vol. 29, Springer. W ain wrigh t, M. J. (2019), High-Dimensional Statistics: A Non-A symptotic V iewp oint , v ol. 48, Cam bridge Univ ersit y Press. W u, J., Hu, D., Bao, Y., Xia, S.-T., and Zou, C. (2025), “Error-quantified conformal inference for time series,” in International Confer enc e on L e arning R epr esentations . Zafar, M. B., V alera, I., Rogriguez, M. G., and Gummadi, K. P . (2017), “F airness constraints: Mec hanisms for fair classification,” in International Confer enc e on A rtificial Intel ligenc e and Statistics , PMLR, pp. 962–970. Zhang, B. H., Lemoine, B., and Mitc hell, M. (2018), “Mitigating unw anted biases with adv ersarial learning,” in Pr o c e e dings of the 2018 AAAI/A CM Confer enc e on AI, Ethics, and So ciety , pp. 335–340. 35 Supplemen tary Material for “Shap e-A daptiv e Conditional Calibration for Conformal Prediction via Minimax Optimization” A Implemen tation of MOPI A.1 The effect of L 2 p enalt y co efficien t In this section, w e consider a more general form of ( 4 ): Ψ λ ( C, f ) := E h f ( Z )( 1 { Y / ∈ C ( X ) } − α ) − λf 2 ( Z ) i , (A.1) and therefore ( 4 ) is a sp ecial case of ( A.1 ) with λ = 1 . Dep ending on the cardinalit y of Z , w e c ho ose differen t F and admit close forms of max f ∈F Ψ λ ( C, f ) . Lemma A.1. If the c ar dinality of Z is finite and F = { P z ∈Z β z 1 { Z = z } : β z ∈ R , z ∈ Z } , then for any C ∈ C we have max f ∈F Ψ λ ( C, f ) = X z ∈|Z |  E h 1 { Z = z } ( 1 { Y / ∈ C ( X ) } − α ) i 2 4 λ P ( Z = z ) . Lemma A.2. If the c ar dinality of Z is infinite and F = K as the RKHS with the kernel function K ( · , · ) : Z × Z → R . F or u, v , h ∈ K , let ⟨ u ( · ) , v ( · ) ⟩ K b e the inner pr o duct in RKHS and define ( u ⊗ v )( h ) = ⟨ h, v ⟩ K u . Supp ose α ( · ; C ) − α ∈ F for al l C ∈ C , then we have max f ∈F Ψ λ ( C, f ) = 1 4 λ  E [ K ( · , Z ) { α ( Z ; C ) − α } ] , T − 1 E [ K ( · , Z ) { α ( Z ; C ) − α } ]  K , wher e T ( · ) = E [ K ( · , Z ) ⊗ K ( · , Z )] . F rom the t w o lemmas ab o v e, w e see that when the function class F is restricted to a finite- 36 dimensional space or an RKHS according to the cardinalit y of Z , the inner maximization problem admits a closed-form solution. Moreov er, the parameter λ only rescales the ob jectiv e function and do es not affect the optimizer. Align with ( A.1 ), consider the follo wing empirical ob jectiv e function: b Ψ λ ( C, f ) = 1 n n X i =1 f ( Z i ) ( 1 { Y i / ∈ C ( X ) } − α ) − λ n n X i =1 f 2 ( Z i ) , where λ > 0 is a h yp erparameter. The empirical minimax optimization problem can be written as: arg min C ∈ C max f ∈F n b Ψ λ ( C, f ) − γ ∥ f ∥ 2 F o . (A.2) Hence, problem ( 8 ) is a sp ecial case of ( A.2 ) b y letting λ = 1 . F or finite dimensional F , w e let γ = 0 and F = { P z ∈Z β z 1 { Z = z } : β z ∈ R , z ∈ Z } , b y Lemma C.2 , the ( A.2 ) can b e rewritten as: arg min C ∈ C      X z ∈Z  P n i =1 1 { Z i = z } ( 1 { Y i / ∈ C ( X i ) } − α )  2 4 λn P n i =1 1 { Z i = z }      . Since λ only acts as a m ultiplicativ e scaling factor in the ob jective, its v alue do es not affect the optimizer. Similarly , F or F = K is a RKHS, by Lemma C.3 , the ( A.2 ) can b e rewritten as: arg min C ∈ C      1 4 λ φ n ( C ) ⊤ K n    1 n K n + γ /λ | {z } := ˜ γ I n    − 1 φ n ( C )      . Th us, by reparameterizing γ as ˜ γ = γ /λ , the parameter λ still only rescales the ob jectiv e function and do es not affect the resulting optimal solution. Consequently , throughout this pap er w e set λ = 1 b oth theoretically and empirically . 37 A.2 Sensitiv e analysis of smo othing parameter T o study the effect of the smo othing parameter r on the resulting prediction sets, w e consider Setting 1( x ) and apply the sigmoid surrogate with different v alues of r . The exp erimen tal setup is iden tical to that in Section F.3 . As sho wn in Figure A.1 , the Ro ot of MSCE decreases rapidly as r decreases and attains its minim um around r = 0 . 1 . A t the same time, b oth the marginal co v erage and the w orst-case conditional co v erage decrease from o v er-co v erage to v alues close to the nominal lev el. As r is further reduced, the Ro ot of MSCE increases slowly with a mild slop e, and the c hanges in marginal and worst-case conditional cov erage remain limited. Moreo v er, the error bars indicate that smaller v alues of r lead to larger v ariability in co v erage. This is exp ected, since a smaller r shrinks the region where the smo oth surrogate has non-negligible gradien ts, whic h in turn mak es the optimization more unstable. 1 0 3 1 0 2 1 0 1 1 0 0 r 0.050 0.055 0.060 0.065 0.070 M S C E 0.86 0.88 0.90 0.92 0.94 Coverage M S C E Marginal W orst-case Figure A.1: Ro ot of MSCE, Marginal co verage and W orst-case conditional cov erage as smo othing parameter r v aries. Dots and error bars represen t the means and confidence in terv als o v er 300 trials. Ov erall, these results suggest that in practice one should choose a relatively small—but not excessively small—smo othing parameter, dep ending on the av ailable lab eled sample 38 size. Moreov er, MOPI app ears to b e reasonably robust to the c hoice of r o v er a mo derate range, without requiring precise tuning of the optimal v alue. B More Comparison with Conditional Calibration B.1 Theoretical comparison under sublev el sets In this section, w e compare the MSCE performance of the quantile regression approach and MOPI. F or test-conditional co v erage with Z = X consider the sublevel set ( 7 ) , and set H = F in MOPI for a fair comparison. F or a sufficien tly large F suc h that α ( X , C ) − α ∈ F for all C ∈ C = { C ( X ; f ) = { y : s ( X , y ) ≤ f ( X ) } , f ∈ F } , b y prop osition 2.1 , we hav e max f ∈F Ψ( C, f ) = MSCE ( C ) / 4 . Therefore, C ∗ := arg min C ∈ C max f ∈F Ψ( C, f ) = arg min C ∈ C 1 4 MSCE ( C ) := C ora . On the other hand, the quantile regression predictor is given b y C qr ( X ) := { y ∈ Y : s ( X , y ) ≤ f qr ( X ) } where f qr = arg min f ∈F E [ ℓ α ( s ( X , Y ) , f ( X ))] . By the optimality of C ora , it follo ws that MSCE ( C ∗ ) = MSCE ( C ora ) ≤ MSCE ( C qr ) . Therefore, the minimax solution C ∗ ac hiev es lo w er MSCE compared with C qr . Let q ( x ) = inf { t ∈ R : P ( s ( X , Y ) ≤ t | X = x ) ≥ 1 − α } b e the ground truth quantile function. Next, w e in v estigate when the MSCE of C ∗ and C qr equals zero. If q ∈ F , b y the definition of q , w e ha v e MSCE ( C ( X ; q )) = 0 . Hence, under the condition of prop osition 2.1 : 0 ≤ MSCE ( C ∗ ) = MSCE ( C ora ) ≤ MSCE ( C ( X ; q )) = 0 . 39 Let f x = f ( x ) for any fixed x ∈ X . Note that E [ ℓ α ( s ( X , Y ) , f ( X )) | X = x ] = E [ ℓ α ( s ( X , Y ) , f x ) | X = x ] , consider the following p oint-wise optimization: min f x ∈ R E [ ℓ α ( s ( X , Y ) , f x ) | X = x ] . Then, the first order condition of ab o v e optimization problem is ∂ ∂ f x E [ ℓ α ( s ( X , Y ) , f x ) | X = x ] = P { s ( X , Y ) < f x | X = x } − (1 − α ) = 0 , Hence, the optimal solution is q ( x ) for a fixed x ∈ X b y the definition of q ( x ) . If q ∈ F , note that min f ∈F E [ ℓ α ( s ( X , Y ) , f ( X ))] ≤ E [ ℓ α ( s ( X , Y ) , q ( X ))] = E X [ E [ ℓ α ( s ( X , Y ) , q ( x )) | X = x ]] = E X  min f x ∈ R E [ ℓ α ( s ( X , Y ) , f x ) | X = x ]  . On the other hand, E [ ℓ α ( s ( X , Y ) , f x ) | X = x ] ≥ E [ ℓ α ( s ( X , Y ) , q ( x )) | X = x ] , b y the optimalit y of q ( x ) . T aking exp ectation on b oth side and minimize ov er f ∈ F , we hav e min f ∈F E [ ℓ α ( s ( X , Y ) , f ( X ))] ≥ E [ ℓ α ( s ( X , Y ) , q ( X ))] . Therefore, f qr ( x ) = q ( x ) a.s. for all x ∈ X . Thus, when q ∈ F , MSCE ( C ∗ ) = MSCE ( C qr ) = 0 . B.2 Computational efficiency of MOPI In this section, w e study the computational efficiency of MOPI . W e fo cus on Setting 1 , with exp erimen tal configurations follo wing Section 5.2 . 40 W e consider tw o v arian ts of CC : CC(full) and CC(non-full) . The CC(full) v arian t follo ws Gibbs et al. ( 2025 ) exactly , solving qunatile regression join tly o v er D cal and eac h test p oin t. In contrast, CC(non-full) estimates the score quantile using only D cal via quan tile regression. Both CC(non-full) and MOPI are optimized using the A dam algorithm with the same learning rate and n um b er of iterations. All exp eriments are conducted on a MacBo ok Pro equipp ed with an M4 Pro CPU and 24GB of RAM. 250 500 750 1000 1250 1500 1750 2000 | D c a l | 1 0 0 1 0 1 1 0 2 T ime (s) Number of test points = 1 MOPI CC(non-full) CC(full) Figure B.1: Comparison of the computational efficiency of MOPI and CC with a single test p oin t. Dots and error bars show means and confidence interv als from 100 trials. As sho wn in Figure B.1 , the computational time of MOPI is comparable to that of CC(non-full) , since b oth metho ds solv e an optimization problem solely on the calibration dataset. In con trast, CC(full) , as a full conformal metho d, requires solving the optimization problem join tly o v er the calibration data and each test p oint, resulting in a computational cost that is several times higher—t ypically 10 × to 100 × that of MOPI and CC(non-full) . This observ ation is broadly consistent with the simulation findings rep orted in Duchi ( 2025 ). 41 C Pro ofs of Main Results C.1 Pro of of Prop osition 2.1 Prop osition 2.1 is the sp ecial case of Lemma C.1 with λ = 1 . Lemma C.1. L et Ψ λ ( C, f ) = E [ f ( Z ) ( 1 { Y / ∈ C ( X ) } − α ) − λf 2 ( Z )] . Then we have Ψ λ ( C, f ) ≤ 1 4 λ E h ( α ( Z ; C ) − α ) 2 i . If for any C ∈ C , ther e exists some f ∗ ∈ F such that α ( Z ; C ) − α = 2 λf ∗ ( Z ) , then it holds that max f ∈F Ψ λ ( C, f ) = 1 4 λ E h ( α ( Z ; C ) − α ) 2 i for any C ∈ C . In addition, we also have MSCE ( C ∗ ) = MSCE ( C ora ) . (C.1) Pr o of. By the definition of Ψ λ , w e ha v e max f ∈F Ψ λ ( C, f ) = max f ∈F E h f ( Z )( 1 { Y / ∈ C ( X ) } − α ) − λf 2 ( Z ) i = max f ∈F E h f ( Z )( α ( Z ; C ) − α ) − λf 2 ( Z ) i = 1 4 λ E h ( α ( Z ; C ) − α ) 2 i − 1 λ min f ∈F E "  1 2 ( α ( Z ; C ) − α ) − λf ( Z )  2 # (C.2) A ccording to the assumption, we know for any C ∈ C , ∃ f ∗ ∈ F suc h that α ( Z ; C ) − α = 2 λf ∗ ( Z ) . Therefore, 0 ≤ min f ∈F E h ( α ( Z ; C ) − α − λf ( Z )) 2 i ≤ E h ( α ( Z ; C ) − α − λf ∗ ( Z )) 2 i = 0 . Hence by ( C.2 ) w e hav e max f ∈F Ψ λ ( C, f ) = 1 4 λ E h ( α ( Z ; C ) − α ) 2 i for all C ∈ C . T aking arg min on b oth side w.r.t. C , C ∗ := arg min C ∈ C max f ∈F Ψ λ ( C, f ) = arg min C ∈ C 1 4 λ E h ( α ( Z ; C ) − α ) 2 i =: C ora , 42 whic h indicate MSCE ( C ∗ ) = MSCE ( C ora ) . C.2 Pro of of Lemma 3.1 and 3.2 W e consider the empirical minimax problem ( A.2 ) and state the follo wing t wo Lemmas. Lemma 3.1 and 3.2 is a sp ecial case with λ = 1 . Lemma C.2. L et F = { P z ∈Z β z 1 { Z = z } : β z ∈ R , z ∈ Z } . The inner maximization in the pr oblem ( A.2 ) with γ = 0 is: max f ∈F b Ψ λ ( C, f ) = X z ∈Z  P n i =1 1 { Z i = z } ( 1 { Y i / ∈ C ( X i ) } − α )  2 4 λn P n i =1 1 { Z i = z } . Lemma C.3. L et F = K , wher e K is a RKHS e quipp e d with a kernel function K ( · , · ) : Z × Z → R . Denote φ n ( C ) = ( φ 1 ( C ) , . . . , φ n ( C )) ⊤ wher e φ i ( C ) = 1 n  1 { Y i / ∈ C ( X i ) } − α  for i = 1 , . . . , n and K n = ( K ( Z i , Z j )) n i,j =1 b e the empiric al kernel matrix. Then inner maximization in the pr oblem ( A.2 ) r e duc es to: max f ∈F n b Ψ λ ( C, f ) − γ ∥ f ∥ 2 F o = 1 4 λ φ n ( C ) ⊤ K n  1 n K n + γ /λ I n  − 1 φ n ( C ) , wher e I n is an n × n identity matrix. Pr o of of L emma C.2 . Since F = { P z ∈Z β z 1 { Z = z } : β z ∈ R , z ∈ Z } , let β = ( β 1 , . . . , β |Z | ) ⊤ , 1 ( Z ) = ( 1 { Z = z 1 } , . . . , 1 { Z = z |Z | } ) ⊤ and φ i ( C ) = 1 n  1 { Y i / ∈ C ( X i ) } − α  , then b y c ho osing γ = 0 we can rewrite the inner problem of ( 8 ) as: sup f ∈F b Ψ λ ( C, f ) = sup β ∈ R |Z | 1 n n X i =1 h β ⊤ 1 ( Z i ) nφ i ( C ) − λ β ⊤ 1 ( Z i ) 1 ( Z i ) ⊤ β i := sup β ∈ R |Z | β ⊤ n X i =1 1 ( Z i ) φ i ( C ) ! − λ β ⊤ 1 n n X i =1 1 ( Z i ) 1 ( Z i ) ⊤ ! β , whic h is a quadratic form w.r.t. β . Moreo v er, since for z , z ′ ∈ Z , z  = z ′ , w e ha v e 43 1 { Z = z } · 1 { Z = z ′ } = 0 . Hence, 1 n n X i =1 1 ( Z i ) 1 ( Z i ) ⊤ = Diag 1 n n X i =1 1 { Z i = z 1 } , . . . , 1 n n X i =1 1 { Z i = z |Z | } ! . Assume n is sufficien tly large suc h that P n i =1 1 { Z i = z } > 0 for all z ∈ Z . T aking the first order condition, w e ha v e the optimizer: β ∗ = 1 2 λ 1 n n X i =1 1 ( Z i ) 1 ( Z i ) ⊤ ! − 1 n X i =1 1 ( Z i ) φ i ( C ) ! Th us, sup f ∈F b Ψ( C, f ) = sup β ∈ R |Z | β ⊤ n X i =1 1 ( Z i ) φ i ( C ) ! − λ β ⊤ 1 n n X i =1 1 ( Z i ) 1 ( Z i ) ⊤ ! β = X z ∈Z  P n i =1 1 { Z i = z } φ i ( C )  2 4 λ  1 n P n i =1 1 { Z i = z }  = X z ∈Z  P n i =1 1 { Z i = z } ( 1 { Y i / ∈ C ( X i ) } − α )  2 4 λn P n i =1 1 { Z i = z } Pr o of of L emma C.3 . Since w e let F = K and the inner maximization of ( A.2 ) tak es form sup f ∈F b Ψ λ ( C, f ) − γ ∥ f ∥ 2 F . By the generalized represen ter theorem (E.g., Schölk opf et al. ( 2001 ), Theorem 1), the optimal solution of the inner maximization of ( A.2 ) takes form f ∗ ( z ) = n X i =1 α ∗ i K ( Z i , z ) for some weigh t α ∗ ∈ R n . Let K n = ( K ( Z i , Z j )) n i,j =1 b e the empirical k ernel matrix. F or 44 an y α = ( α 1 , . . . , α n ) ⊤ ∈ R n , consider a function f ( z ) = P n i =1 α K ( Z i , z ) , then w e hav e ∥ f ∥ 2 K = α ⊤ K n α , f ( Z i ) = e ⊤ i K n α , where e i ∈ { 0 , 1 } n is the v ector with a 1 in the i -th p osition and 0 elsewhere. Then 1 n n X i =1 f ( Z i ) = 1 n n X i =1 α ⊤ K n e i e ⊤ i K n α = 1 n α ⊤ K 2 n α . Let φ n ( C ) = ( φ 1 ( C ) , . . . , φ n ( C )) ⊤ where φ i ( C ) = 1 n  1 { Y i / ∈ C ( X i ) } − α  for i = 1 , . . . , n . Then w e can rewrite the inner problem of ( A.2 ) as: sup f ∈F b Ψ λ ( C, f ) − γ ∥ f ∥ 2 F = sup α ∈ R n φ ⊤ n ( C ) K n α − α ⊤ K n λ n K n + γ I n ! α . T aking the first order condition, w e ha v e the optimizer: α ∗ = 1 2 λ n K n + γ I n ! − 1 φ n ( C ) , and the optimal v alue: 1 4 φ n ( C ) ⊤ K n λ n K n + γ I n ! − 1 φ n ( C ) = 1 4 λ φ n ( C ) ⊤ K n  1 n K n + γ /λ I n  − 1 φ n ( C ) . C.3 Pro of of Theorem 4.1 In the following, we denote ϵ 2 := MSCE ( C ora ) , b Ψ γ ( C, f ) = b Ψ ( C, f ) − γ ∥ f ∥ F . Consider C = { C ( x ; h ) : h ∈ H} b e the structure prediction set ( 3 ) and a more general minimax problem with the p enalt y ∥ C ∥ H = ∥ h ∥ H , Eq. ( D.1 ), we rewrite it here for conv enience. b C := arg min C ∈ C max f ∈F n b Ψ γ ( C, f ) + ν ∥ C ∥ H o . (C.3) 45 Th us, the Eq. ( 8 ) is the sp ecial case of ( C.3 ) with µ = 0 and Theorem 4.1 is the sp ecial case of the follo wing Theorem C.1 . Theorem C.1. Supp ose the A ssumptions 1 , 2 holds. F urthermor e, assume the c onstant U satisfies sup f ∈F U ∥ f ∥ ∞ ≤ 1 . L et δ n, F U and δ n, G b e the critic al r adius of F U and G , r esp e ctively. F or any ζ ∈ (0 , 1) , let ˜ δ n, F := δ n, F U + q log( c 1 /ζ ) c 2 n for some p ositive c onstants c 1 and c 2 . • F or finite-dimensional F , let the hyp erp ar ameter γ = 0 in ( C.3 ) . Then, for sufficiently lar ge n , such that ˜ δ 2 n, F ∥ f ∥ 2 F ≤ U 2 ∥ f ∥ 2 L 2 for any f ∈ F , c onditioning on the lab ele d data D n , with pr ob ability at le ast 1 − 4 ζ , we have: E   α ( Z ; b C ) − α  2 | D n  ≲ ϵ 2 + η 2 + δ 2 n, F U + δ 2 n, G + log (log 2 (1 /δ n, G ) /ζ ) n + ν 2 ∥ C ora ∥ 4 H δ 2 n, F . • F or infinite-dimensional F , if we cho ose the hyp erp ar ameter γ in ( C.3 ) such that γ ∥ f ∥ 2 F ≤ ∥ f ∥ 2 L 2 , then for sufficiently lar ge n , such that ˜ δ 2 n, F ∥ f ∥ 2 F ≤ U 2 ∥ f ∥ 2 L 2 for any f ∈ F , c onditioning on the lab ele d data D n , with pr ob ability at le ast 1 − 4 ζ , we have: E   α ( Z ; b C ) − α  2 | D n  ≲ ϵ 2 + η 2 + δ 2 n, F U + δ 2 n, G + log (log 2 (1 /δ n, G ) /ζ ) n + ν 2 ∥ C ora ∥ 4 H δ 2 n, F . Pr o of of The or em C.1 . Let b C = arg min C ∈ C max f ∈F b Ψ γ ( C, f ) + ν ∥ C ∥ H . Notice that E   α ( Z ; b C ) − α  2  ≤ 2 ϵ 2 + 2 E   α ( Z ; b C ) − α ( Z ; C ora )  2  , (C.4) where E [( α ( Z ; C ora ) − α ) 2 ] = ϵ 2 . Let F U = { f ∈ F : ∥ f ∥ 2 F ≤ U } and δ n, F U b e the critical radius of F U . Let P f 2 := ∥ f ∥ 2 L 2 = E P [ f 2 ( Z )] , P n f 2 := ∥ f ∥ 2 L 2 ,n = 1 n P n i =1 f 2 ( Z i ) . By 46 Lemma E.6 and our assumption sup f ∈F U ∥ f ∥ ∞ ≤ 1 , for any t ≥ δ n, F U , w e ha v e P ∀ f ∈ F U ,    P n f 2 − P f 2    ≤ P f 2 2 + t 2 2 ! ≥ 1 − c 1 e − c 2 nt 2 , where c 1 , c 2 > 0 are universal constants. F or f / ∈ F U , we apply the same exp onen tial inequalit y to √ U f / ∥ f ∥ F , then w e ha v e P ∀ f ∈ F U ,    P n f 2 − P f 2    ≤ P f 2 2 + t 2 ∥ f ∥ 2 F 2 U ! ≥ 1 − c 1 e − c 2 nt 2 . Com bining the results ab ov e and taking t = δ n, F U + q log( c 1 /ζ ) c 2 n for any ζ > 0 , we hav e w.p. 1 − ζ , ∀ f ∈ F ,    P n f 2 − P f 2    ≤ P f 2 2 + 1 ∨ ( ∥ f ∥ 2 F /U ) 2   δ n, F U + s log( c 1 /ζ ) c 2 n   2 . Since w e choose ˜ δ n, F = δ n, F U + q log( c 1 /ζ ) c 2 n , and assume ˜ δ 2 n, F ∥ f ∥ 2 F ≤ U 2 ∥ f ∥ 2 L 2 , then w.p. 1 − ζ : ∀ f ∈ F : γ ∥ f ∥ 2 F + ∥ f ∥ 2 L 2 ,n ≥ γ ∥ f ∥ 2 F + ∥ f ∥ 2 L 2 2 − 1 ∨ ∥ f ∥ 2 F U ! ˜ δ 2 n, F 2 ! ≥ γ ∥ f ∥ 2 F + 1 4 ∥ f ∥ 2 L 2 − ˜ δ 2 n, F 2 , (C.5) and ∀ f ∈ F : γ ∥ f ∥ 2 F + ∥ f ∥ 2 L 2 ,n ≤ γ ∥ f ∥ 2 F + 3 ∥ f ∥ 2 L 2 2 + 1 ∨ ∥ f ∥ 2 F U ! ˜ δ 2 n, F 2 ! ≤ γ ∥ f ∥ 2 F + 7 4 ∥ f ∥ 2 L 2 + ˜ δ 2 n, F 2 . (C.6) F or f ∈ F , C ∈ C , w e denote the function ψ ( C , f ) = f ( Z ) ( 1 { Y ∈ C ( X ) } − α ) , 47 Since ψ ( C, f ) is 1-Lipschitz w.r.t. f , using Lemma 14 in F oster and Syrgkanis ( 2023 ), then w.p. 1 − ζ , ∀ f ∈ F , | ( P n − P ) ψ ( C ora , f ) | ≤ 18 ˜ δ n, F ∥ f ∥ L 2 + 1 ∨ ∥ f ∥ F √ U ! ˜ δ 2 n, F ! . (C.7) Note that Ψ λ ( C, f ) = P ψ ( C, f ) − λP f 2 and b Ψ γ ( C, f ) = P n ψ ( C, f ) − P n f 2 − γ ∥ f ∥ 2 F for any C ∈ C , f ∈ F . By the optimality of b C . sup f ∈F b Ψ γ ( b C , f ) ≤ sup f ∈F b Ψ γ ( C ora , f ) + ν  ∥ C ora ∥ 2 H − ∥ b C ∥ 2 H  (C.8) Com bining ( C.5 ) and ( C.7 ), for γ ≥ 0 , we can get w.p. 1 − 2 ζ : sup f ∈F b Ψ γ ( C ora , f ) = sup f ∈F n P n ψ ( C ora , f ) − P n f 2 − γ ∥ f ∥ 2 F o ≤ sup f ∈F ( P n ψ ( C ora , f ) − γ ∥ f ∥ 2 F + 1 4 ∥ f ∥ 2 L 2 − ˜ δ 2 n, F 2 !) ≤ sup f ∈F ( P ψ ( C ora , f ) + 18 ˜ δ n, F ∥ f ∥ L 2 + 1 ∨ ∥ f ∥ F √ U ! ˜ δ 2 n, F ! − γ ∥ f ∥ 2 F + 1 4 ∥ f ∥ 2 L 2 − ˜ δ 2 n, F 2 !) ( i ) ≤ sup f ∈F ( P ψ ( C ora , f ) + 18  2 ˜ δ n, F ∥ f ∥ L 2 + ˜ δ 2 n, F  − 1 4 ∥ f ∥ 2 L 2 − ˜ δ 2 n, F 2 !) ≤ sup f ∈F  P ψ ( C ora , f ) − 1 8 P f 2  + sup f ∈F  36 ˜ δ n, F ∥ f ∥ L 2 − 1 8 ∥ f ∥ 2 L 2  + 18 ˜ δ 2 n, F + ˜ δ 2 n, F 2 ( ii ) ≤ sup f ∈F Ψ 1 8 ( C ora , f ) + 2 · 36 2 ˜ δ 2 n, F + 18 ˜ δ 2 n, F + ˜ δ 2 n, F 2 ≤ sup f ∈F Ψ 1 8 ( C ora , f ) + 2611 ˜ δ 2 n, F , (C.9) where (i) w e used the assumption ˜ δ n, F ∥ f ∥ F ≤ q U 2 ∥ f ∥ L 2 and γ ≥ 0 , and (ii) we used the fact sup f ∈F ( a ∥ f ∥ − b ∥ f ∥ 2 ) ≤ a 2 4 b for an y norm ∥ · ∥ and a, b > 0 . In addition, w e also ha v e the lo w er b ound sup f ∈F b Ψ γ ( b C , f ) = sup f ∈F n P n ψ ( b C , f ) − P n ψ ( C ora , f ) + P n ψ ( C ora , f ) − P n f 2 − γ ∥ f ∥ 2 F o 48 ≥ sup f ∈F n P n ψ ( b C , f ) − P n ψ ( C ora , f ) − 2  P n f 2 + γ ∥ f ∥ 2 F o + inf f ∈F n P n ψ ( C ora , f ) + P n f 2 + γ ∥ f ∥ 2 F o = sup f ∈F n P n ψ ( b C , f ) − P n ψ ( C ora , f ) − 2  P n f 2 + γ ∥ f ∥ 2 F o − sup f ∈F b Ψ γ ( C ora , f ) , where w e used the fact that F is symmetric. It follows that w.p. 1 − 2 ζ : sup f ∈F n P n ψ ( b C , f ) − P n ψ ( C ora , f ) − 2  P n f 2 + γ ∥ f ∥ 2 F o ≤ ( sup f ∈F b Ψ γ ( b C , f ) + sup f ∈F b Ψ γ ( C ora , f ) ) (i) ≤ 2 sup f ∈F b Ψ γ ( C ora , f ) + ν  ∥ C ora ∥ 2 H − ∥ b C ∥ 2 H  (ii) ≤ 2 sup f ∈F Ψ 1 8 ( C ora , f ) + 2 · 2611 ˜ δ 2 n, F + ν  ∥ C ora ∥ 2 H − ∥ b C ∥ 2 H  (iii) ≤ 4 ϵ 2 + 2 · 2611 ˜ δ 2 n, F + ν ∥ C ora ∥ 2 H , (C.10) where (i) holds due to ( C.8 ) , (ii) holds due to ( C.9 ) and (iii) holds due to Lemma C.1 and E [( α ( Z ; C ora ) − α ) 2 ] = ϵ 2 . F or an y C ∈ C , w e define f C = arg min f ∈F U E h ( f ( Z ) − ( α ( Z ; C ) − α ( Z ; C ora ))) 2 i . (C.11) By the assumption E [( f C ( Z ) − ( α ( Z ; C ) − α ( Z ; C ora ))) 2 ] ≤ η 2 for an y C ∈ C , w e ha v e P ψ ( C , f C ) − P ψ ( C ora , f C ) = E [ f C ( Z )( α ( Z ; C ) − α ( Z ; C ora ))] = E h f 2 C ( Z ) i + E [ f C ( Z )( α ( Z ; C ) − α ( Z ; C ora ) − f C ( Z ))] ≥ ∥ f C ∥ 2 L 2 − ∥ f C ∥ L 2 ∥ α ( C ) − α ( C ora ) − f C ∥ L 2 ≥ ∥ f C ∥ 2 L 2 − ∥ f C ∥ L 2 η , 49 where the first inequalit y follo ws from Cauc h y’s inequalit y . It follows that P ψ ( C , f C ) − P ψ ( C ora , f C ) ∥ f C ∥ L 2 ≥ ∥ f C ∥ L 2 − η ≥ ∥ α ( C ) − α ( C ora ) ∥ L 2 − 2 η . (C.12) Let r ∈ [0 , 1] , then r f b C ∈ F U since F U is star-shaped. Let v 2 ( C, C ora ) := E [ | α ( Z ; C ) − α ( Z ; C ora ) | 2 ] and define the function class G = { ( x, z , y ) 7→ f C ( z )( 1 { y / ∈ C ( x ) } − 1 { y / ∈ C ora ( x ) } ) : C ∈ C } . (C.13) Then w e ha v e w.p. 1 − 2 ζ : sup f ∈F n P n ψ ( b C , f ) − P n ψ ( C ora , f ) − 2  P n f 2 + γ ∥ f ∥ 2 F o ≥ r P n ψ ( b C , f b C ) − r P n ψ ( C ora , f b C ) − 2 r 2  P n f 2 b C + γ ∥ f b C ∥ 2 F  (i) ≥ r P n ψ ( b C , f b C ) − r P n ψ ( C ora , f b C ) − 2 r 2 γ ∥ f b C ∥ 2 F + 7 4 ∥ f b C ∥ 2 L 2 + ˜ δ 2 n, F 2 ! (ii) ≥ r    P ψ ( b C , f b C ) − P ψ ( C ora , f b C ) − v ( b C , C ora )   4 δ n, G + 2 s 2 log( R n /ζ ) n   − log( R n /ζ ) 3 n    − 2 r 2 γ ∥ f b C ∥ 2 F + 7 4 ∥ f b C ∥ 2 L 2 + ˜ δ 2 n, F 2 ! (iii) ≥ r    ( v ( b C , C ora ) − 2 η ) ∥ f b C ∥ L 2 − v ( b C , C ora )   4 δ n, G + 2 s 2 log( R n /ζ ) n   − log( R n /ζ ) 3 n    − 2 r 2 γ ∥ f b C ∥ 2 F + 7 4 ∥ f b C ∥ 2 L 2 + ˜ δ 2 n, F 2 ! , (C.14) where (i) use ( C.6 ) ; (ii) holds due to Lemma E.8 ; and (iii) holds due to ( C.12 ) . R n = ⌈ log 2 (1 /δ n, G ) ⌉ + 1 . If 4 δ n, G + 2 q 2 log( R n /ζ ) n ≥ ∥ f b C ∥ L 2 / 2 , w e ha v e v ( b C , C ora ) ≤ ∥ f b C ∥ L 2 + η ≲ δ n, G + s log( R n /ζ ) n + η . Otherwise, i.e., 4 δ n, G + 2 q 2 log( R n /ζ ) n ≤ ∥ f b C ∥ L 2 / 2 , combining ( C.14 ) and ( C.10 ) , w.p. 1 − 4 ζ , 50 w e get v ( b C , C ora ) ≤ 2 r ∥ f b C ∥ L 2  4 ϵ 2 + 2 · 2611 ˜ δ 2 n, F + ν ∥ C ora ∥ 2 H  + 2 η + 2 log( R n /ζ ) 3 n ∥ f b C ∥ L 2 + 4 r ∥ f b C ∥ L 2 γ ∥ f b C ∥ 2 F + 7 4 ∥ f b C ∥ 2 L 2 + ˜ δ 2 n, F 2 ! . If ∥ f b C ∥ L 2 ≥ ˜ δ n, F ≥ q log( c 1 /ζ ) c 2 n , c ho osing r = max { ϵ, ˜ δ n, F } / ∥ f b C ∥ L 2 , b y γ ∥ f ∥ 2 F ≤ ∥ f ∥ 2 L 2 , w e ha v e v ( b C , C ora ) ≲ ϵ + ˜ δ n, F + ν ∥ C ora ∥ 2 H ˜ δ n, F + η + s log( R n /ζ ) n + ˜ δ n, F + ˜ δ n, F + ˜ δ n, F ≲ ϵ + ˜ δ n, F + ν ∥ C ora ∥ 2 H ˜ δ n, F + η + s log( R n /ζ ) n . Therefore, v ( b C , C ora ) ≲ ϵ + ˜ δ n, F + η + s log( R n /ζ ) n + ν ∥ C ora ∥ 2 H ˜ δ n, F . (C.15) Otherwise, i.e. ∥ f b C ∥ L 2 ≤ ˜ δ n, F , w e ha v e v ( b C , C ora ) ≤ ∥ f b C ∥ L 2 + η ≲ ˜ δ n, F + η . Com bining the relations ab o v e, w e can conclude that v ( b C , C ora ) ≲ ϵ + δ n, F U + δ n, G + η + ν ∥ C ora ∥ 2 H ˜ δ n, F + s log( R n /ζ ) n , (C.16) where w e used the definition of ˜ δ n, F . 51 C.4 Pro of of Theorem 4.2 Pr o of. By Assumption 3 , there exist H 0 ( w ( X ) , Z ) suc h that P { T ( H 0 ( w ( X ) , Z ) , Y ) ≤ 0 | w ( X ) , Z } = 1 − α and for an y u 1 , u 2 ∈ R m , there exist l ( x, z ) suc h that E [ l 2 ( w ( X ) , Z )] = κ 2 < ∞ , | P { T ( u 1 , Y ) ≤ 0 | w ( X ) , Z } − P { T ( u 2 , Y ) ≤ 0 | w ( X ) , Z }| ≤ l ( w ( X ) , Z ) ∥ u 1 − u 2 ∥ , where ∥ u ∥ = √ u T u . Consider the Structured set-v alued function class C = { C ( x ; h ) : h ∈ H} defined in ( 3 ). Let α ( Z ; h ) := P { Y / ∈ C ( X ; h ) } = 1 − E [ P { T ( h ( X ) , Y ) ≤ 0 | w ( X ) , Z } | Z ] , and h 0 ( w ( X )) = E [ H 0 ( w ( X ) , Z ) | w ( X )] . Then we hav e: E h ( α ( Z ; h 0 ) − α ) 2 i = E  n E h P { T ( H 0 ( w ( X ) , Z ) , Y ) ≤ 0 | w ( X ) , Z } − P { T ( h 0 ( w ( X )) , Y ) ≤ 0 | w ( X ) , Z } | Z io 2  ≤ κ 2 E  n E h ∥ h 0 ( w ( X )) − H 0 ( w ( X ) , Z ) ∥ | Z io 2  ≤ κ 2 E   h 0 ( w ( X )) − H 0 ( w ( X ) , Z )  T  h 0 ( w ( X )) − H 0 ( w ( X ) , Z )   = κ 2 T r  E  E   h 0 ( w ( X )) − H 0 ( w ( X ) , Z )   h 0 ( w ( X )) − H 0 ( w ( X ) , Z )  T | w ( X )  = κ 2 T r  E h Cov ( H 0 ( w ( X ) , Z ) | w ( X ) i = κ 2 T r  Cov ( H 0 ( w ( X ) , Z ))  (1 − ρ 2 ) , where the last inequalit y w e use Jensen’s inequalit y and ρ 2 = T r ( Cov ( h 0 ( w ( X )))) T r ( Cov ( H 0 ( w ( X ) ,Z ))) . By the definition of h ora : E h ( α ( Z ; h ora ) − α ) 2 i ≤ E h ( α ( Z ; h 0 ) − α ) 2 i ≤ κ 2 T r  Cov ( H 0 ( w ( X ) , Z ))  (1 − ρ 2 ) . 52 C.4.1 Discuss of Assumption 3 F or the test-conditional co v erage setting in Example 2.1 , Assumption 3 is required to hold with w ( X ) = X . W e tak e Z = X and consider the pretrained sublevel prediction sets in ( 7 ) , then T ( h ( X ) , Y ) = s ( X , Y ) − h ( X ) . In this case, if the distribution of s ( X , Y ) | X is a Lipsc hitz function, then for an y h 1 , h 2 ∈ H , | P { T ( h 1 ( X ) , Y ) ≤ 0 | w ( X ) , Z } − P { T ( h 2 ( X ) , Y ) ≤ 0 | w ( X ) , Z }| = | P { s ( X , Y ) ≤ h 1 ( X ) | X } − P { s ( X , Y ) ≤ h 2 ( X ) | X }| ≲ | h 1 ( X ) − h 2 ( X ) | , Hence, Assumption 3 (1) holds. Let h 0 ( X ) = E h H 0 ( X ) | X i = H 0 ( X ) = inf n t : P { s ( X , Y ) ≤ t | X } ≥ 1 − α o . Therefore, P { s ( X , Y ) ≤ H 0 ( X ) | X } = 1 − α b y the distribution of s ( X , Y ) | X is a Lipsc hitz con tin uous and then Assumption 3 (2) holds. Then h 0 ( x ) ∈ H pro vided that H is sufficien tly ric h. F or example, H ma y b e c hosen as a function class capable of approximating arbitrary con tin uous functions, suc h as an RKHS with a univ ersal k ernel. F or the group-conditional cov erage setting in Example 2.2 , Assumption 3 is required to hold with w ( X ) =  1 { X ∈ G 1 } , . . . , 1 { X ∈ G K }  ⊤ . W e tak e Z = w ( X ) , H = F =  P Z = z β z 1 { Z = z } : β z ∈ R , z ∈ Z  and consider the pretrained sublevel prediction sets in ( 7 ) , then T ( h ( X ) , Y ) = s ( X , Y ) − h ( X ) . If the distribution of s ( X , Y ) | Z is a Lipschitz function, then for an y h 1 , h 2 ∈ H , | P { T ( h 1 ( X ) , Y ) | w ( X ) , Z } − P { T ( h 2 ( X ) , Y ) | w ( X ) , Z }| = | P { s ( X , Y ) ≤ h 1 ( X ) | Z } − P { s ( X , Y ) ≤ h 2 ( X ) | Z }| ≲ | h 1 ( X ) − h 2 ( X ) | , 53 Hence, Assumption 3 (1) holds. Moreov er, by letting q z = inf n t : P { s ( X , Y ) ≤ t | Z = z } ≥ 1 − α o for z ∈ Z , w e ha v e h 0 ( w ( X )) = E h H 0 ( Z ) | w ( X ) i = H 0 ( Z ) = X z ∈Z q z 1 { Z = z } , Therefore, P { s ( X , Y ) ≤ H 0 ( Z ) | Z } = 1 − α b y the distribution of s ( X , Y ) | Z is a Lipsc hitz con tin uous and then Assumption 3 (2) holds. Moreov er, h 0 ∈ H b y the definition of H and Assumption 3 (3) holds. F or the Equalized cov erage setting in Example 2.3 , we assume there exist a w ( x ) that satisfies Assumption 3 , then h 0 ( w ( X )) = E [ H 0 ( w ( X ) , Z ) | w ( X )] . h 0 ( w ( X )) ∈ H if H is sufficien tly ric h. C.5 Pro of of Theorem 4.3 and 4.4 By Theorem 4.1 , the con v ergence rate of the MSCE is primarily gov erned b y the critical radius δ n, F U and δ n, G asso ciated with the function classes F and G . Accordingly , the follo wing theorem c haracterizes the orders of δ n, F U and δ n, G in the setting where C is a V C class with VC dimension d C and C is c hosen to b e shap e-constrained prediction set ( 3 ) and C ora ( X ) = { y ∈ Y : T ( h ora ( X ) , Y ) ≤ 0 } . Theorem C.2. L et C b e V C class with V C dimsnsion d C and c onsider the shap e-c onstr aine d pr e diction set ( 3 ) . Then we obtain fol lowing r esults: (1) If F is a V C class with V C dimsnsion d F , then δ n, F U ≍ s d F n , δ n, G ≍ s d F ∨ d C n . 54 (2) If F is a RKHS e quipp e d with a Gaussian kernel and Z = [0 , 1] d Z , then δ n, F U ≍ s d Z log d Z +1 ( n/d Z ) n , δ n, G ≍ d C log  n d C  ∨ d Z log d Z  n d Z  n . Pr o of of The or em C.2 . W e b ound δ n, F U and δ n, G resp ectiv ely . Step 1: Upp er bounded δ n, F U . Case (a): F is a V C class. Since F U = { f ∈ F : ∥ f ∥ 2 F ≤ U } is a subset of F , it is straigh tforw ard to verify that for any pair of function classes U ⊂ V , the corresponding critical radius satisfy δ n, U ≤ δ n, V . Hence, it suffices to b ound δ F . Let d F b e the V C- dimension of F . Using Theorem 2.6.4 in V an Der V aart and W ellner ( 2023 ), we can b ound the co v ering n um b er of F b y its V C dimension d F : N ( ρ, F , ∥ · ∥ L 2 ) ≤ K d F (16 e ) d F (1 /ρ ) 2 d F , (C.17) where C is a univ ersal constant. No w use Dudley inequality , e.g., Theorem 5.22 in W ain wrigh t ( 2019 ), w e can b ound lo calized Rademac her complexit y R n ( δ ; F ) by R n ( δ ; F ) = E     sup f ∈F ∥ f ∥ L 2 ≤ δ      1 n n X i =1 ε i f ( Z i )          ≤ C 0 √ n Z δ 0 q log N ( ρ, F , ∥ · ∥ L 2 ) dρ, (C.18) where C 0 is a univ ersal constan t. Hence, plugging ( C.17 ) in to ( C.18 ), w e ha v e R n ( δ ; F ) ≤ C 0 √ n Z δ 0 q log( C d F ) + d F log 16 e + 2 d F log(1 /ρ ) dρ ≤ C 0 √ n Z δ 0 q 2 d F log(1 /ρ ) dρ + δ q log( C d F ) + d F log 16 e ! ≤ δ s C 1 d F log(1 /δ 2 ) n + δ s C 2 (log d F + d F ) n ≤ δ s C 3 d F log(1 /δ 2 ) n , (C.19) 55 where the last equalit y hold for sufficiently small δ . Since δ F is an arbitrary solution of R n ( δ ; F ) ≲ δ 2 , then w e just require δ s d F log(1 /δ 2 ) n ≲ δ 2 ⇐ ⇒ d F log(1 /δ 2 ) n ≲ δ 2 . Considering the fixed p oint condition, let t = 1 δ 2 and c hec k the ro ot of the equation t log t = n/d F . T aking the logarithm on b oth sides, we hav e: log t + log log t = log ( n/d F ) . By Lemma E.1 with L n = log( n/d F ) , w e ha v e 1 δ 2 F ≍ exp( L n − log L n ) = n/d F log( n/d F ) . Hence, it follo ws that δ n, F U ≲ δ F ≍ q d F n log n d F . Case (b): F is a RKHS e quipp e d with a Gaussian kernel. Since F is a RKHS equipp ed with a Gaussian k ernel and F U = { f ∈ F : ∥ f ∥ 2 F ≤ U } . When F U is equipp ed with a Gaussian k ernel with ∥ f ∥ F ≤ U , Z ∈ [0 , 1] d Z , b y Prop osition 1 of Zhou ( 2002 ), log N ( ρ, F U , ∥ · ∥ L 2 ) ≤ 4 d Z (6 d Z + 2) (log ( U /ρ )) d Z +1 . (C.20) By substituting ( C.20 ) into ( C.18 ), R n ( δ ; F U ) ≤ C 0 √ n Z δ 0 q 4 d Z (6 d Z + 2) (log ( U /ρ )) d Z +1 dρ ≤ C 0 δ s (2 d Z + 1)(4 log ( U 2 /δ 2 )) d Z +1 n . (C.21) 56 Since δ F U is an arbitrary solution of R ( δ ; F U ) ≲ δ 2 , then w e just require δ s d Z n log d Z +1 ( U 2 /δ 2 ) ≲ δ 2 ⇐ ⇒ d z n log d Z +1 ( U 2 /δ 2 ) ≲ δ 2 . Considering the fixed point condition, let t = U 2 /δ 2 and chec k the root of the equation t (log t ) d Z +1 = nB 2 /d Z . T aking the logarithm on b oth sides, we hav e: log t + ( d Z + 1) log log t = log ( nU 2 /d Z ) . By Lemma E.1 with L n = log( nU 2 /d Z ) , w e ha v e U 2 δ 2 F ≍ exp n log( nU 2 /d Z ) − ( d Z + 1) log log ( nU 2 /d Z ) o = nU 2 /d Z log d Z +1 ( nU 2 /d Z ) . (C.22) Hence, δ n, F U ≍ q d Z log d Z +1 ( n/d Z ) n . Step 2: Upp er b ounded δ n, G . Since we consider sublev el set ( 7 ) , b y the definition of G in ( C.13 ), g h ∈ G can b e written as, g h ( X , Z, Y ) = f h ( Z ) ( 1 { T ( h ( X ) , Y ) ≤ 0 } − 1 { T ( h ora ( X ) , Y ) ≤ 0 } ) , where f h = arg min f ∈F U E [( f − ( α ( Z ; h ) − α ( Z ; h ora ))) 2 ] . W e define the function class G ∗ := { f ( Z ) ( 1 { T ( h ( X ) , Y ) ≤ 0 } − 1 { T ( h ora ( X ) , Y ) ≤ 0 } ) : f ∈ F U , h ∈ H} . Note that G ⊆ G ∗ , then w e hav e R n ( δ ; G ) ≤ R n ( δ ; G ∗ ) . Let γ h ( x, y ) = 1 { T ( h ( x ) , y ) ≤ 0 } − 1 { T ( h ora ( x ) , y ) ≤ 0 } and Γ H = { γ h : h ∈ H} . By Assumption 4 , the VC-dimension of 57 Γ H is also O ( d C ) . By Lemma E.5 and ∥ f ∥ ∞ ≤ 1 and ∥ γ h ∥ ∞ ≤ 1 , we hav e log N ( ρ, G ∗ , ∥ · ∥ L 2 ) ≤ log N ( ρ/ 2 , F U , ∥ · ∥ L 2 ) + log N ( ρ/ 2 , Γ H , ∥ · ∥ L 2 ) . T ogether with ( C.17 ) and ( C.18 ), w e ha v e R n ( δ ; G ) ≤ R n ( δ ; G ∗ ) ≤ C 0 √ n Z δ 0 q log N ( ρ, G ∗ , ∥ · ∥ L 2 ) dρ ≤ 2 Z δ / 2 0 q log N ( ρ, F U , ∥ · ∥ L 2 ) dρ + 2 Z δ / 2 0 q log N ( ρ, Γ H , ∥ · ∥ L 2 ) dρ ≤ 2 Z δ / 2 0 q log N ( ρ, F U , ∥ · ∥ L 2 ) dρ + δ s C 5 d C log(4 /δ 2 ) n . (C.23) Case (a): F is a V C class. Combining ( C.19 ) and ( C.23 ), w e ha v e R n ( δ ; G ) ≤ δ s C 6 ( d C ∨ d F ) log(4 /δ 2 ) n Considering the fixed p oin t condition, and let t = 2 δ , L = log n d F ∨ d C . By Lemma E.1 , 1 δ 2 G ≍ exp( L − log L ) = n/ ( d F ∨ d C ) log( n/ ( d F ∨ d C )) = ⇒ δ n, G ≲ s d F ∨ d C n log  n d F ∨ d C  . Case (b): F is a RKHS e quipp e d with a Gaussian kernel. Com bining ( C.21 ) and ( C.23 ) , we ha v e R n ( δ ; G ) ≤ δ s C 4 log d Z +1 (4 U 2 /δ 2 ) n + δ s C 5 d C log(4 /δ 2 ) n 58 Since w e require R ( δ ; G ) ≲ δ 2 , then using Lemma E.1 , w e can ha v e              d Z n log d Z +1 ( U 2 /δ 2 ) ≲ 1 2 δ 2 , d C n log 1 δ 2 ≲ 1 2 δ 2 . = ⇒ δ n, G ≍ v u u t d C log  n d C  ∨ d Z log d Z +1 ( n/d Z ) n . Pr o of of The or em 4.3 . Since w e c ho ose F = { P z ∈Z β z 1 { Z = z } : β z ∈ R } , for an y C ∈ C , the function α ( z ; C ) corresponds to Z = z tak es at most |Z | distinct v alues whic h indicates that α ( z ; C ) ∈ F . Therefore, b y definitions of F and f C = arg min f ∈F U E [( f ( Z ) − ( α ( Z ; C ora ) − α ( Z ; C ))) 2 ] , w e hav e E [( f C ( Z ) − ( α ( Z ; C ora ) − α ( Z ; C ))) 2 ] = 0 . Hence, η = 0 . By Theorem 4.1 and C.2 , we choose γ = 0 and then there exist C > 0 such that w.p. 1 − 4 ζ : E h ( α ( Z ; b C ) − α ) 2 | D n i ≤ C ϵ 2 + d F ∨ d C n log  n d F ∨ d C  + log((log n ) /ζ ) n ! := A n . Therefore, Let E = n E h ( α ( Z ; b C ) − α ) 2 | D n i ≤ A n o E h ( α ( Z ; b C ) − α ) 2 i = E h E h ( α ( Z ; b C ) − α ) 2 | D n ii = E h E h ( α ( Z ; b C ) − α ) 2 | D n i 1 E i + E h E h ( α ( Z ; b C ) − α ) 2 | D n i 1 E c i ≤ A n + 4 ζ . By c ho osing ζ = 1 n , w e ha v e E h ( α ( Z ; b C ) − α ) 2 i ≲ ϵ 2 + d F ∨ d C n log  n d F ∨ d C  . 59 Pr o of of The or em 4.4 . By Theorem 4.1 and C.2 , w e c ho ose γ ≥ 0 s.t. γ ∥ f ∥ 2 F ≤ ∥ f ∥ 2 2 and com bining all results ab o v e, there exist C > 0 suc h that w.p. 1 − 4 ζ : E h ( α ( Z ; b C ) − α ) 2 | D n i ≤ C   ϵ 2 + η 2 + d C log  n d C  ∨ d Z log d Z +1  n d Z  n + log((log n ) /ζ ) n   := A n . Therefore, Let E = n E h ( α ( Z ; b C ) − α ) 2 | D n i ≤ A n o E h ( α ( Z ; b C ) − α ) 2 i = E h E h ( α ( Z ; b C ) − α ) 2 | D n ii = E h E h ( α ( Z ; b C ) − α ) 2 | D n i 1 E i + E h E h ( α ( Z ; b C ) − α ) 2 | D n i 1 E c i ≤ A n + 4 ζ . By c ho osing ζ = 1 n , w e ha v e E h ( α ( Z ; b C ) − α ) 2 i ≲ ϵ 2 + η 2 + d C log  n d C  ∨ d Z log d Z +1  n d Z  n . C.6 Pro of of Corollary 4.1 Pr o of. Since Z = ( 1 { X ∈ G 1 } , . . . , 1 { X ∈ G K } ) ⊤ and w ( X ) = Z , then h 0 ( w ( X )) = E [ H 0 ( w ( X ) , Z ) | w ( X )] = H 0 ( w ( X ) , Z ) is a function of Z and tak es at most |Z | distinct v alues. Hence b y c ho osing H = F =  P Z = z β z 1 { Z = z } : β z ∈ R , z ∈ Z  , w e ha ve h 0 ∈ H . Therefore Assumption 3 holds. By Theorem 4.2 , w e ha v e ϵ = 0 . By the definition of Z , for any k ∈ [ K ] , there exist some z k ∈ Z suc h that 1 { X ∈ G k } = 1 { Z = z k } . Then 60 for an y k ∈ [ K ] , P { Y / ∈ b C ( X ; h ) | X ∈ G k } = P { Y / ∈ b C ( X ; h ) | Z = z k } . Note that ( P { Y / ∈ b C | X ∈ G k } − α ) 2 P { X ∈ G k } ≤ K X k =1 ( P { Y / ∈ b C | X ∈ G k } − α ) 2 P { X ∈ G k } = K X k =1 ( P { Y / ∈ b C | Z = z k } − α ) 2 P { Z = z k } ( i ) ≤ X z ∈Z ( α ( z , b C ) − α ) 2 P { Z = z } = E h ( α ( Z ; b C ) − α ) 2 i , where (i) is b ecause the groups G k ma y o v erlap; when the groups are disjoin t, the inequality b ecomes an equalit y . Hence,    P { Y / ∈ b C | X ∈ G k } − α    ≤ v u u t E [( α ( Z ; b C ) − α ) 2 ] P { X ∈ G k } Hence b y Theorem 4.3 , w e ha v e    P n Y ∈ b C ( X ) | X ∈ G k o − (1 − α )    ≲ v u u t d C + |Z | n log n d C + |Z | . Moreo v er, for pretrained sublev el set ( 7 ) , T ( h ( X , Y )) = s ( X , Y ) − h ( X ) and then { Y / ∈ C ( X ; h ) } = { s ( X , Y ) > h ( X ) } is a subgraph for all h ∈ H . Since H =  P Z = z β z 1 { Z = z } : β z ∈ R , z ∈ Z  and it has VC dimension d H ≤ |Z | , then { ( x, y ) : s ( x, y ) > h ( x ) , h ∈ H} is a V C-subgraph with V C dimension O ( |Z | ) . 61 C.7 Pro of of Corollary 4.2 Pr o of. Since Assumption 3 holds for some w ( X ) , then b y Theorem 4.2 , w e ha v e MSCE ( C ora ) ≤ κ 2  1 − ρ 2  T r  Cov n H 0 ( w ( X ) , Z ) o ≲ 1 − ρ 2 . F or eac h z ∈ Z , note that E h ( α ( Z ; b C ) − α ) 2 i = X z ∈Z ( α ( z , b C ) − α ) 2 P { Z = z } . Hence,    P { Y / ∈ b C | Z = z } − α    ≤ v u u t E [( α ( Z ; b C ) − α ) 2 ] P { Z = z } Hence b y Theorem 4.3 , w e ha v e    P n Y ∈ b C opt ( X ) | Z = z o − (1 − α )    ≲ v u u t 1 − ρ 2 P { Z = z } + v u u t d C + |Z | n · P ( Z = z ) log n d C + |Z | ! . C.8 Pro of of Corollary 4.3 Pr o of. Since Assumption 3 holds for some w ( X ) = X and Z = X , then by Theorem 4.2 , we ha v e ϵ = 0 . Moreo v er, since w e assume α ( Z ; C ) − α ( Z ; C ora ) ∈ F U , b y the definition of f C , f C = arg min f ∈F E h ( f ( Z ) − ( α ( Z ; C ) − α ( Z ; C ora ))) 2 i , then w e ha v e η = 0 . Hence, b y Theorem 4.4 , w e ha v e E   α ( X n +1 ; b C ) − α  2  ≲ d C log( n/d C ) n + d Z log d Z +1 ( n/d Z ) n . 62 C.9 Multi-group co verage guaran tee In addition to the MSE results, we also ha v e the multi-group cov erage guaran tee for MOPI. Theorem C.3 (Multi-group co verage) . Under the same settings of The or em 4.1 and assume f is b ounde d by some c onstant B > 0 for al l f ∈ F , with pr ob ability at le ast 1 − 4 ζ : E h f ( Z )  1 { Y / ∈ b C ( X ) } − α  | D n i ≲  E h ( α ( Z ; b C ) − α ) 2 | D n i 1 / 2 . Pr o of. Since ∥ f ∥ ∞ ≤ B , then sup f ∈F ∥ f ∥ 2 ≤ B . Note that E [ f ( Z ) ( 1 { Y / ∈ C ( X ) } − α )] = E [ f ( Z )( α ( Z ; C ) − α )] , by Cauch y-Sch warz inequalit y , sup f ∈F E [ f ( Z ) ( 1 { Y / ∈ C ( X ) } − α )] ≤ sup f ∈F ∥ f ∥ 2  E [( α ( Z ; C ) − α ) 2 ]  1 / 2 . for all C ∈ C . D Infinite-dimensional set-v alued function class No w w e consider the case where C is an infinite-dimensional set-v alued function space, e.g., C = { C ( x ; h ) : h ∈ H} and H is an RKHS. In this case, let ∥ · ∥ H b e the equipp ed norm in the space H . Then, we construct the empirical MOPI prediction set via b C = arg min C ∈ C max f ∈F n b Ψ( C, f ) − γ ∥ f ∥ F − ν ∥ C ∥ H o , (D.1) where ν ≥ 0 is a regularization parameter for C ∈ C , and ∥ C ∥ H = ∥ h ∥ H for C ( x ) = C ( x ; h ) for some h ∈ H . Similar to Lemmas 3.1 and 3.2 , w e can obtain the closed forms for the inner maximization for ( D.1 ) under the corresp onding c hoice of F . 63 Next, we present the upp er b ound of MSCE for the set-v alued function class defined in ( 3 ) when F and H are b oth RKHS with a Gaussian kernel. F or simplicity , we assume Z = [ 0 , 1] d Z and X = [0 , 1] d X , where d Z and d X are the dimensions. Theorem D.1. Supp ose A ssumption 2 holds for the structur e d set-value d class C in ( 3 ) . A ssume the mar ginal density of T ( h ( X ) , Y ) has an upp er b ound for any h ∈ H , and T ( · , · ) is Lipschitz c ontinuous with r esp e ct to its first variable. If the hyp erp ar ameters γ and ν in optimization pr oblem ( D.1 ) ar e chosen such that ∥ f ∥ 2 L 2 ≤ γ ∥ f ∥ 2 F for al l f ∈ F and ν ≲ d Z log d Z +1 ( n/d Z ) n , then for sufficiently lar ge n , we have E   α ( Z ; b C ) − α  2  ≲ MSCE ( C ora ) + η 2 + md X log d X +1 ( mn/d X ) n + d Z log d Z +1 ( n/d Z ) n . The establishment of MSCE b ound differs from that in Section 4.1 when C is finite- dimensional. When C is infinite-dimensional, its V C dimension is also infinite, so the complexit y of the induced class G can no longer b e controlled via V C theory as we did in Section 4.3 . In addition, we cannot directly apply standard contraction argumen ts (e.g., Prop osition 5.28 of W ainwrigh t ( 2019 )) to G b ecause the functions in G are not Lipschitz due to the presence of indicator functions. T o address this challenge, w e use the mar gin c ost functions in the generalization theory of classification problems ( Mason et al. , 2000 ; K oltc hinskii and Panc henko , 2002 ), which are Lipschitz contin uous and can appro ximate indicator functions. The final b ounds are determined by optimizing the margin parameter to ac hiev e the fastest con v ergence rate. D.1 Pro of of Theorem D.1 In this section, w e consider F and H are b oth RKHS with a Gaussian kernel and C ( x ; h ) = { y ∈ Y : T ( h ( x ) , y ) ≤ 0 } . F ormally , let K Z ( · , · ) : Z × Z → R , K X ( · , · ) : X × X → R b e Gaussian k ernel and K K Z , K K X denote RKHSs equipp ed with kernel K Z , K X resp ectiv ely . 64 Let F = K K Z and H = { h = ( h 1 , . . . , h m ) ⊤ : h j ∈ K K X , ∥ h j ∥ 2 K K X ≤ U X , j ∈ [ m ] } . (D.2) Consider the shape-constrained prediction set ( 3 ) and let C ora ( X ) = { y ∈ Y : T ( h ora ( X ) , Y ) ≤ 0 } . Align with the pro of of Theorem C.1 , w e define the follo wing notations. Let ψ ( h, f ) = f ( z )( 1 { y / ∈ C ( x ; h ) } − α ) = f ( z )((1 − 1 { T ( h ( x ) , y ) ≤ 0 } ) − α ) , v ( h, h ora ) = v ( C , C ora ) = ∥ α ( Z ; C ( X ; h )) − α ( Z ; C ( X ; h ora )) ∥ L 2 and f h = arg min f ∈F U E h ( f ( Z ) − { α ( Z ; C ( X ; h )) − α ( Z ; C ( X ; h ora )) } ) 2 i . F or an y C ∈ C , define ∥ C ∥ 2 H := m X j =1 ∥ h j ∥ 2 K K X . Pr o of. Note that 1 { y ∈ C ( x ; h ) } − 1 { y ∈ C ( x ; h ora ) } = 1 { T ( h ora ( x ) , y ) ≤ 0 } − 1 { T ( h ( x ) , y ) ≤ 0 } =: D ( h ( x ) , y ) . Denote δ n ( σ ℓ ) = max { δ n, G + φ σ ℓ , δ n, G + φ σ ℓ , δ n, G − φ σ ℓ , δ n, G − φ σ ℓ } and R n ( σ ℓ ) = ⌈ log 2 (1 /δ n ( σ ℓ )) ⌉ + 1 . Define sequence σ ℓ = 2 − ℓ +1 for ℓ ≥ 1 . T aking x = log (4 R n ( σ ℓ ) /ζ ) for Lemma D.1 , we can guaran tee that w.p. 1 − ζ , ∀ h ∈ H , | ( P n − P ) { D ( h ) f h }| ≤ 2 inf ℓ ≥ 1    ∆( h, σ ℓ ) + ∆( h, σ ℓ ) + 2 log(4 R n ( σ ℓ ) /ζ ) 3 n + 4 v ( h, h ora )   2 δ n ( σ ℓ ) + s 2 log(4 R n ( σ ℓ ) /ζ ) n      , where we also used the fact ∥ f h ∥ L 2 ≤ 2 v ( h, h ora ) . By Lemma D.2 , w e hav e 65 ∆( h, σ ℓ ) , ∆( h, σ ℓ ) ≲ σ ℓ and δ n ( σ ℓ ) ≲ s d Z log d Z +1 ( n/d Z ) n + v u u t md X log d X +1 ( mn σ 2 l d X ) n . Hence, let ℓ = ⌈ log 2 n ⌉ , then ∆( h, σ ℓ ) , ∆( h, σ ℓ ) ≲ 1 n , δ n ( σ ℓ ) ≲ s d Z log d Z +1 ( n/d Z ) n + v u u t md X log d X +1 ( mn d X ) n := δ n, G , and R n ( σ ℓ ) ≲ log 2 n . Thus, ∀ h ∈ H , w.p. 1 − ζ , | ( P n − P ) { D ( h ) f h }| ≤ O (1)    1 n + log(log 2 n/ζ ) n + v ( h, h ora )   δ n, G + s log(log 2 n/ζ ) n      . (D.3) Similar to ( C.14 ), using ( D.3 ) we hav e sup f ∈F n P n ψ ( b h, f ) − P n ψ ( h ora , f ) − 2  P n f 2 + γ ∥ f ∥ 2 F o ≥ r P n ψ ( b h, f b h ) − r P n ψ ( h ora , f b h ) − 2 r 2  P n f 2 b h + γ ∥ f b h ∥ 2 F  ≥ r ( v ( b h, h ora ) − 2 η ) ∥ f b h ∥ L 2 − 2 r 2 γ ∥ f b h ∥ 2 F + 7 4 ∥ f b h ∥ 2 L 2 + ˜ δ 2 n, F 2 ! − r O (1)    1 n + log(log 2 n/ζ ) n + v ( h, h ora )   δ n, G + s log(log 2 n/ζ ) n      . If 2 δ n, G + q 2 log(log 2 n/ζ ) n ≥ ∥ f b h ∥ L 2 / 2 , w e ha v e v ( b h, h ora ) ≤ ∥ f b h ∥ L 2 + η ≲ δ n, G + s log(log 2 n/ζ ) n + η . 66 Otherwise, i.e., 2 δ n, G + q 2 log(log 2 n/ζ ) n ≤ ∥ f b h ∥ L 2 / 2 , using ( C.10 ), w.p. 1 − 4 ζ , we get v ( b h, h ora ) ≲ 1 r ∥ f b h ∥ L 2  ϵ 2 + ˜ δ 2 n, F + ν ∥ C ora ∥ 2 H  + η + 1 n ∥ f b h ∥ L 2 + log(log 2 n/ζ ) n ∥ f b h ∥ L 2 + r ∥ f b h ∥ L 2  γ ∥ f b h ∥ 2 F + ∥ f b h ∥ 2 L 2 + ˜ δ 2 n, F  , If ∥ f b h ∥ L 2 ≥ ˜ δ n, F ≥ q log( c 1 /ζ ) c 2 n , choosing r = max { ϵ, ˜ δ n, F } / ∥ f b C ∥ L 2 , by γ ∥ f ∥ 2 F ≤ ∥ f ∥ 2 L 2 , we ha v e v ( b h, h ora ) ≲ ϵ + ˜ δ n, F + ν ∥ C ora ∥ 2 H ˜ δ n, F + η + s log(log 2 n/ζ ) n + ˜ δ n, F + ˜ δ n, F + ˜ δ n, F ≲ ϵ + ˜ δ n, F + ν ∥ C ora ∥ 2 H ˜ δ n, F + η + s log(log 2 n/ζ ) n . Therefore, v ( b h, h ora ) ≲ ϵ + ˜ δ n, F + η + s log(log 2 n/ζ ) n + ν ∥ C ora ∥ 2 H ˜ δ n, F . (D.4) Otherwise, i.e. ∥ f b h ∥ L 2 ≤ ˜ δ n, F , w e ha v e v ( b h, h ora ) ≤ ∥ f b h ∥ L 2 + η ≲ ˜ δ n, F + η . Com bining the relations ab o v e, w e can conclude that v ( b h, h ora ) ≲ ϵ + δ n, F U + δ n, G + η + ν ∥ C ora ∥ 2 H ˜ δ n, F + s log(log 2 n/ζ ) n , where w e used the definition of ˜ δ n, F . Hence there exist C > 0 such that w.p. 1 − 4 ζ : E h ( α ( Z ; b C ) − α ) 2 | D n i ≤ C ϵ 2 + η 2 + δ 2 n, F U + δ 2 n, G + ν 2 ∥ C ora ∥ 4 H δ 2 n, F U + log((log 2 n ) /ζ ) n ! := A n . 67 Therefore, Let E = n E h ( α ( Z ; b C ) − α ) 2 | D n i ≤ A n o E h ( α ( Z ; b C ) − α ) 2 i = E h E h ( α ( Z ; b C ) − α ) 2 | D n ii = E h E h ( α ( Z ; b C ) − α ) 2 | D n i 1 E i + E h E h ( α ( Z ; b C ) − α ) 2 | D n i 1 E c i ≤ A n + 4 ζ . By c ho osing ζ = 1 n , w e ha v e E h ( α ( Z ; b C ) − α ) 2 i ≲ ϵ 2 + η 2 + δ 2 n, F U + δ 2 n, G + ν 2 ∥ C ora ∥ 4 H δ 2 n, F U + log( n log 2 n ) n . Since F = K K Z , b y Theorem C.2 , we hav e δ n, F U ≲ q d Z log d Z +1 ( n/d Z ) n . By the definition of H in ( D.2 ), ∥ C ora ∥ 2 H ≤ mU X . Then by choosing ν ≲ d Z log d Z +1 ( n/d Z ) n , w e ha v e E h ( α ( Z ; b C ) − α ) 2 i ≲ ϵ 2 + η 2 + md X log d X +1 ( mn/d X ) n + d Z log d Z +1 ( n/d Z ) n . D.1.1 Margin concen tration lemma Lemma D.1. Define se quenc e σ ℓ = 2 − ℓ +1 for ℓ ≥ 1 . F or any x > 0 , w.p. 1 − ( ⌈ log 2 (1 /δ n, e G − φ ) ⌉ + ⌈ log 2 (1 /δ n, G − φ ) ⌉ + 2) e − x − ( ⌈ log 2 (1 /δ n, e G + φ ) ⌉ + ⌈ log 2 (1 /δ n, G + φ ) ⌉ + 2) e − x , ∀ h ∈ H , | ( P n − P ) { D ( h ) f h }| ≤ inf ℓ ≥ 1    ∆( h, σ ℓ ) + ∥ f h ∥ L 2   2 δ n, G + φ σ ℓ + s 2 log x n   + log x 3 n    + inf ℓ ≥ 1    ∆( h, σ ℓ ) + ∥ f h ∥ L 2   2 δ n, G + φ σ ℓ + s 2 log x n   + log x 3 n    + inf ℓ ≥ 1    ∆( h, σ ℓ ) + ∥ f h ∥ L 2   2 δ n, G − φ σ ℓ + s 2 log x n   + log x 3 n    68 + inf ℓ ≥ 1    ∆( h, σ ℓ ) + ∥ f h ∥ L 2   2 δ n, G − φ σ ℓ + s 2 log x n   + log x 3 n    , wher e g + φ σ ℓ ( h ) ∈ G + φ σ ℓ , g + φ σ ℓ ( h ) ∈ G + φ σ ℓ ar e define d in ( D.5 ) and ( D.6 ) ; g − φ σ ℓ ( h ) ∈ G − φ σ ℓ , g − φ σ ℓ ( h ) ∈ G − φ σ ℓ ar e define d in ( D.10 ) and ( D.11 ) ; ∆ ( h, σ ℓ ) and ∆( h, σ ℓ ) ar e define d in ( D.7 ) and ( D.8 ) . Pr o of. F or an y function f , w e write [ f ] + = f 1 { f ≥ 0 } and [ f ] − = − f 1 { f < 0 } . W e first notice that ( P n − P ) D ( h ) f h = ( P n − P ) n D + ( h )[ f h ] + o | {z } (I) − ( P n − P ) n D ( h )[ f h ] − o | {z } (II) . Assumption D.1. The Lipschitz functions φ and φ satisfy 0 ≤ φ ( u ) ≤ 1 { u ≤ 0 } ≤ φ ( u ) ≤ 1 . Bounding the term (I) . Given φ and φ , we define tw o function classes G + φ = { g + φ ( h ) = [ f h ] + ( φ ◦ T ( h ) − (1 − 1 { T ( h ora ) ≤ 0 } )) : h ∈ H} , (D.5) G + φ = { g + φ ( h ) = [ f h ] + ( φ ◦ T ( h ) − (1 − 1 { T ( h ora ) ≤ 0 } )) : h ∈ H} . (D.6) Using Lemma E.7 , w.p. 1 − ( ⌈ log 2 (1 /δ n, e G + φ ) ⌉ + 1) e − x , w e ha v e ∀ h ∈ H ∈ G + φ ,    ( P n − P ) g + φ ( h )    ≤ ∥ g + φ ( h ) ∥ L 2   2 δ n, G + φ + s 2 log x n   + log x 3 n ≤ ∥ f h ∥ L 2   2 δ n, G + φ + s 2 log x n   + log x 3 n . 69 And, w.p. 1 − ( ⌈ log 2 (1 /δ n, G + φ ) ⌉ + 1) e − x , w e ha v e ∀ h ∈ H ,    ( P n − P ) g + φ ( h )    ≤ ∥ g + φ ( h ) ∥ L 2   2 δ n, G + φ + s 2 log x n   + log x 3 n ≤ ∥ f h ∥ L 2   2 δ n, G + φ + s 2 log x n   + log x 3 n . F or an y σ ∈ [0 , 1] , we take the Lipschitz contin uous function as φ σ ( u ) =                      1 , u < − σ , − u/σ, − σ ≤ u ≤ 0 0 , a ≤ 0 . , φ σ ( u ) =                      1 , u < 0 , − u/σ, 0 ≤ u < σ 0 , u ≥ σ . , whic h satisfies Assumption D.1 . In addition, we know L φ σ , L φ σ ≤ 1 /σ ℓ . Define Lipschitz constan ts σ ℓ = 2 − ℓ +1 for ℓ ≥ 1 . Denote the quan tities ∆( h, σ ℓ ) := P { φ σ ℓ ◦ T ( h )  = 1 { T ( h ) ≤ 0 }} , (D.7) ∆( h, σ ℓ ) := P { φ σ ℓ ◦ T ( h )  = 1 { T ( h ) ≤ 0 }} . (D.8) Then, w.p. 1 − R n P ∞ ℓ =1 e − ( x +2 log ℓ ) ≥ 1 − 2 R n e − x , w e ha v e ∀ h ∈ H , P n n D ( h )[ f h ] + o ≤ inf ℓ ≥ 1  P n g + φ σ ℓ ( h )  ≤ inf ℓ ≥ 1  P g + φ σ ℓ ( h ) +     ( P n − P ) g + φ σ ℓ ( h )      ≤ P n D ( h )[ f h ] + o + inf ℓ ≥ 1    ∆( h, σ ℓ ) + ∥ f h ∥ L 2   2 δ n, G + φ σ ℓ + s 2 log x n   + log x 3 n    , and ∀ h ∈ H , P n n D ( h )[ f h ] + o ≥ sup ℓ ≥ 1 n P n g + φ σ ℓ ( h ) o 70 ≥ sup ℓ ≥ 1 n P g + φ σ ℓ ( h ) −    ( P n − P ) g + φ σ ℓ ( h )    o ≥ P n D ( h )[ f h ] + o + sup ℓ ≥ 1    − ∆( h, σ ℓ ) − ∥ f h ∥ L 2   2 δ n, G + φ σ ℓ + s 2 log x n   − log x 3 n    . It follo ws that w.p. 1 − ( ⌈ log 2 (1 /δ n, e G + φ ) ⌉ + ⌈ log 2 (1 /δ n, G + φ ) ⌉ + 2) e − x ∀ h ∈ H ,    ( P n − P ) n D ( h )[ f h ] + o    ≤ inf ℓ ≥ 1    ∆( h, σ ℓ ) + ∥ f h ∥ L 2   2 δ n, G + φ σ ℓ + s 2 log x n   + log x 3 n    + inf ℓ ≥ 1    ∆( h, σ ℓ ) + ∥ f h ∥ L 2   2 δ n, G + φ σ ℓ + s 2 log x n   + log x 3 n    . (D.9) Bounding the term ( I I ) . Similarly , giv en φ and φ , w e define another tw o function classes G − φ = { [ f h ] − ( φ ◦ T ( h ) − (1 − 1 { T ( h ora ) ≤ 0 } )) : h ∈ H} , (D.10) G − φ = { [ f h ] − ( φ ◦ T ( h ) − (1 − 1 { T ( h ora ) ≤ 0 } )) : h ∈ H} . (D.11) Recalling the definition of D ( h ) , w e ha v e P { g − φ σ ℓ  = D ( h )[ f h ] − } = P { φ σ ℓ ◦ T ( h )  = 1 { T ( h ) ≤ 0 }} = ∆( h, σ ℓ ) , P { g + φ σ ℓ  = D ( h )[ f h ] − } = P { φ σ ℓ ◦ T ( h )  = 1 { T ( h ) ≤ 0 }} = ∆( h, σ ℓ ) . W e can sho w that w.p. 1 − ( ⌈ log 2 (1 /δ n, e G − φ ) ⌉ + ⌈ log 2 (1 /δ n, G − φ ) ⌉ + 2) e − x , ∀ h ∈ H ,    ( P n − P ) n D ( h )[ f h ] − o    ≤ inf ℓ ≥ 1    ∆( h, σ ℓ ) + ∥ f h ∥ L 2   2 δ n, G − φ σ ℓ + s 2 log x n   + log x 3 n    + inf ℓ ≥ 1    ∆( h, σ ℓ ) + ∥ f h ∥ L 2   2 δ n, G − φ σ ℓ + s 2 log x n   + log x 3 n    . (D.12) 71 Com bining Eqs. ( D.9 ) and ( D.12 ), we can prov e the conclusion. Lemma D.2. Supp ose the mar ginal density of T := T ( h ( X ) , Y ) is b ounde d, i.e. p T ( t ) ≤ C T with some c onstant C T > 0 and assume for any u 1 , u 2 ∈ R m , | T ( u , y ) − T ( u , y ) | ≤ L T ∥ u 1 − u 2 ∥ with some c onstant L T > 0 . If we cho ose F = K K Z and H is define d in ( D.2 ) , then we have ∆( h, σ ℓ ) , ∆( h, σ ℓ ) ≤ C T σ ℓ , and δ n, G + φ , δ n, G − φ , δ n, G + φ , δ n, G − φ ≲ 1 √ n · s d Z log d Z +1 ( n/d Z ) ∨ md X log d X +1 ( mn σ 2 l d X ) . Pr o of. By definition of ∆ and ¯ ∆ in ( D.8 ) and ( D.8 ) and the assumption, ∆( h, σ ℓ ) = P {− σ ℓ ≤ T ( h ) ≤ 0 } = Z 0 − σ ℓ f T ( t ) dt ≤ C T σ ℓ , ∆( h, σ ℓ ) = P { 0 ≤ T ( h ) ≤ σ ℓ } = Z σ ℓ 0 f T ( t ) dt ≤ C T σ ℓ , The critical part is to calculate the rate of G + φ , G − φ , G + φ , G − φ . W e first consider the G + φ in ( D.5 ) . Since for an y u 1 , u 2 ∈ R m , | T ( u , y ) − T ( u , y ) | ≤ L T ∥ u 1 − u 2 ∥ , then φ ◦ T ( u , y ) is a Lipsc hitz function w.r.t. u with Lipschitz constant L T /σ l . F or fixed h ora ∈ H , define e G + φ = { g + φ ( h ) = [ f ] + ( φ ◦ T ( h ) − (1 − 1 { T ( h ora ) ≤ 0 } )) : h ∈ H , f ∈ F U Z } . Here w e denote U Z := U to distinguish it from U X . Note that G + φ ⊆ e G + φ , then we only need to b ound the co v ering n um b er of e G + φ . By Lemma E.2 , E.3 and E.5 , we hav e log N ( ρ, e G + φ , ∥ · ∥ L 2 ) ≤ log N ( ρ/ 2 , F U Z , ∥ · ∥ L 2 ) + log N σ l ρ 2 L T , H , ∥ · ∥ L 2 ! . Let K Z ( · , · ) : Z × Z → R , K X ( · , · ) : X × X → R b e Gaussian k ernel and K K Z , K K X denote RKHSs equipp ed with kernel K Z , K X resp ectiv ely . Let F = K K Z and H = { h = 72 ( h 1 , . . . , h m ) ⊤ : h j ∈ K K X , ∥ h j ∥ 2 K K X ≤ U X , j ∈ [ m ] } . By Lemma E.4 and ( C.20 ), w e ha v e log N ( ρ, F U Z , ∥ · ∥ L 2 ) ≤ 4 d Z (6 d Z + 2) (log ( U Z /ρ )) d Z +1 , log N ( ρ, H , ∥ · ∥ L 2 ) ≤ m 4 d X (6 d Z + 2) (log ( mU X /ρ )) d X +1 . R n ( δ ; G + φ ) ≤ R n ( δ ; e G + φ ) ≤ C 0 √ n Z δ 0 q log N ( ρ, e G + φ , ∥ · ∥ L 2 ) dρ ≤ 2 √ n Z δ / 2 0 q log N ( ρ, F U Z , ∥ · ∥ L 2 ) dρ + 2 L T σ l √ n Z σ l δ / (2 L T ) 0 q log N ( ρ, H , ∥ · ∥ L 2 ) dρ ≲ δ s (2 d Z + 1)(4 log ( U Z 2 /δ 2 )) d Z +1 n + δ v u u t m (2 d X + 1)(4 log n ( mU X L T ) 2 / ( σ l δ ) 2 o ) d X +1 n . (D.13) Similar with the pro of of Theorem C.2 , since w e require R ( δ ; e G + φ ) ≲ δ 2 , w e can ha v e              d Z n log d Z +1 ( U 2 Z /δ 2 ) ≲ 1 2 δ 2 , md X n log d X +1 (( mU X L T ) 2 / ( σ l δ ) 2 ) ≲ 1 2 δ 2 . F or the first inequality , using ( C.22 ) w e get δ ≍ q d Z log d Z +1 ( n/d Z ) n . F or the second inequality , using Lemma E.1 , let t = ( mU X L T ) 2 σ 2 l δ 2 and c hec k the ro ot of the equation t ( log t ) d X +1 = nm ( U X L T ) 2 σ 2 l d X . T aking the logarithm on b oth sides, we hav e: log t + ( d X + 1) log log t = log ( nm ( U X L T ) 2 σ 2 l d X ) . By Lemma E.1 with L n = log( nm ( U X L T ) 2 σ 2 l d X ) , w e ha v e ( mU X L T ) 2 σ 2 l δ 2 ≍ exp ( log nm ( U X L T ) 2 σ 2 l d X ! − ( d X + 1) log log nm ( U X L T ) 2 σ 2 l d X !) 73 = nm ( U X L T ) 2 σ 2 l d X log d X +1 ( nm ( U X L T ) 2 σ 2 l d X ) . Hence, δ n, G + φ ≲ δ n, e G + φ ≍ 1 √ n · s d Z log d Z +1 ( n/d Z ) ∨ md X log d X +1 ( mn σ 2 l d X ) . The ab ov e deriv ation applies equally to G − φ , G + φ , G − φ since φ and φ , [ f ] + and [ f ] − share the same Lipsc hitz constan t. Similarly , we can b ound δ n, G − φ , δ n, G + φ , and δ n, G − φ in the same manner. Therefore, δ n, G + φ , δ n, G − φ , δ n, G + φ , δ n, G − φ ≲ 1 √ n · s d Z log d Z +1 ( n/d Z ) ∨ md X log d X +1 ( mn σ 2 l d X ) . D.2 Pro of of Theorem 4.5 Lemma D.3. L et ˜ 1 { a < b } b e the smo othe d indic ator function define d by ˜ 1 { a < b } = 1 2 1 + erf b − a √ 2 r !! wher e erf ( x ) is the err or function, define d as erf ( x ) = 2 √ π R x 0 e − t 2 dt , and r > 0 is the smo othing p ar ameter. The err or b etwe en the smo othe d indic ator function ˜ 1 { a < b } and the actual indic ator function 1 { a < b } , inte gr ate d over al l a , is given by E = Z ∞ −∞ | ˜ 1 { a < b } − 1 { a < b }| da = r π 2 r . Pr o of. Let x = a − b . Note ˜ 1 { a < b } = 1 2  1 + erf  b − a √ 2 r   = Φ  b − a r  = Φ  − x r  , 74 where Φ is the standard normal CDF and φ its densit y . E = Z ∞ −∞    Φ( − x/r ) − 1 { x < 0 }    dx = Z 0 −∞  1 − Φ( − x/r )  dx + Z ∞ 0 Φ( − x/r ) dx. By symmetry , E = 2 Z ∞ 0 Φ( − x/r ) dx = 2 r Z ∞ 0 Φ( − u ) du = 2 r Z ∞ 0  1 − Φ( u )  du. Since 1 − Φ( u ) = R ∞ u φ ( t ) dt , F ubini giv es Z ∞ 0  1 − Φ( u )  du = Z ∞ 0 Z ∞ u φ ( t ) dt du = Z ∞ 0 tφ ( t ) dt = 1 √ 2 π . Therefore E = 2 r · 1 √ 2 π = r q 2 π . Lemma D.4. L et f ( x, b ) = ˜ 1 { x < b } = 1 2  1 + erf  b − x √ 2 r  , wher e erf ( x ) = 2 √ π R x 0 e − t 2 dt and r > 0 is the smo othing p ar ameter. Then f ( x, b ) is Lipschitz c ontinuous with r esp e ct to x with a Lipschitz c onstant 1 √ 2 π r . Pr o of. Since      ∂ ∂ x f ( x, b )      =      1 2 ∂ ∂ x 1 + 2 √ π Z ( b − x ) / √ 2 r 0 e − t 2 dt !      = 1 √ 2 π r e − ( b − x ) 2 2 r 2 ≤ 1 √ 2 π r , f ( x, b ) is Lipschitz contin uous with resp ect to x with Lipschitz constant 1 √ 2 π r . Pr o of of The or em 4.5 . F or f ∈ F , h ∈ H , we denote the function ˜ ψ ( h, f )( X, Z , Y ) = f ( Z )  ˜ 1 { T ( h ( X ) , Y ) > 0 } − α  , ψ ( h, f )( X, Z , Y ) = f ( Z ) ( 1 { T ( h ( X ) , Y ) > 0 } − α ) , 75 and assume sup f ∈F ∥ f ∥ ∞ ≤ C F and the densit y of T ( h ( X ) , Y ) | Z , p T | Z is upp er b ounded b y C T . Consider sup f ∈F n P ˜ ψ ( h, f ) − P ψ ( h, f ) o = sup f ∈F E h f ( Z )  ˜ 1 { T ( h ( X ) , Y ) > 0 } − 1 { T ( h ( X ) , Y ) > 0 } i ≤ sup f ∈F ∥ f ∥ ∞ E h    ˜ 1 { T ( h ( X ) , Y ) > 0 } − 1 { T ( h ( X ) , Y ) > 0 }    i ≤ C F E Z h E h    1 { T ( h ( X ) , Y ) > 0 } − ˜ 1 { T ( h ( X ) , Y ) > 0 }    | Z ii ≤ s 2 π C F C T r , (D.14) where w e use Lemma D.3 . Recall that ˜ δ n, F = δ n, F U + q log( c 1 /ζ ) c 2 n for some p ositiv e constants c 1 and c 2 . Since ˜ ψ ( h, f ) is 1-Lipschitz w.r.t. f , using Lemma 14 in F oster and Syrgkanis ( 2023 ), w e also ha v e P ∀ f ∈ F ,    ( P n − P ) ˜ ψ ( h ora , f )    ≤ 18 ˜ δ n, F ∥ f ∥ L 2 + 1 ∨ ∥ f ∥ F √ U ! ˜ δ 2 n, F !! ≥ 1 − ζ . (D.15) Define e Ψ γ ( h, f ) = P n ˜ ψ ( h, f ) − P n f 2 − γ ∥ f ∥ 2 F for any h ∈ H , f ∈ F and recall that Ψ λ ( h, f ) = P ψ ( h, f ) − λP f 2 . Since ˜ h = arg min h ∈H max f ∈ F e Ψ γ ( h, f ) , by the optimality of ˆ h , sup f ∈F e Ψ γ ( ˜ h, f ) ≤ sup f ∈F e Ψ γ ( h ora , f ) . (D.16) Com bining ( C.5 ) and ( D.15 ), we can get w.p. 1 − 2 ζ : sup f ∈F e Ψ γ ( h ora , f ) = sup f ∈F n P n ˜ ψ ( h ora , f ) − P n f 2 − γ ∥ f ∥ 2 F o ≤ sup f ∈F ( P n ˜ ψ ( h ora , f ) − γ ∥ f ∥ 2 F + 1 4 ∥ f ∥ 2 L 2 − ˜ δ 2 n, F 2 !) ≤ sup f ∈F ( P ˜ ψ ( h ora , f ) + 18 ˜ δ n, F ∥ f ∥ L 2 + 1 ∨ ∥ f ∥ F √ U ! ˜ δ 2 n, F ! − γ ∥ f ∥ 2 F + 1 4 ∥ f ∥ 2 L 2 − ˜ δ 2 n, F 2 !) 76 (i) ≤ sup f ∈F  P ψ ( h ora , f ) − 1 8 P f 2  + sup f ∈F  36 ˜ δ n, F ∥ f ∥ L 2 − 1 8 ∥ f ∥ 2 L 2  + 18 ˜ δ 2 n, F + ˜ δ 2 n, F 2 + sup f ∈F n P ˜ ψ ( h ora , f ) − P ψ ( h ora , f ) o ≤ sup f ∈F Ψ 1 8 ( h ora , f ) + 2 · 36 2 ˜ δ 2 n, F + 18 ˜ δ 2 n, F + ˜ δ 2 n, F 2 + s 2 π C F C T r , (D.17) where (i) w e used the assumption ˜ δ n, F ∥ f ∥ F ≤ q U 2 ∥ f ∥ L 2 and γ ≥ 0 . In addition, we also ha v e the lo w er b ound sup f ∈F e Ψ γ ( ˜ h, f ) = sup f ∈F n P n ˜ ψ ( ˜ h, f ) − P n ˜ ψ ( h ora , f ) + P n ˜ ψ ( h ora , f ) − P n f 2 − γ ∥ f ∥ 2 F o ≥ sup f ∈F n P n ˜ ψ ( ˜ h, f ) − P n ˜ ψ ( h ora , f ) − 2  P n f 2 + γ ∥ f ∥ 2 F o + inf f ∈F n P n ˜ ψ ( h ora , f ) + P n f 2 + γ ∥ f ∥ 2 F o = sup f ∈F n P n ˜ ψ ( ˜ h, f ) − P n ˜ ψ ( h ora , f ) − 2  P n f 2 + γ ∥ f ∥ 2 F o − sup f ∈F e Ψ γ ( h ora , f ) , where w e used the fact that F is symmetric. It follows that w.p. 1 − 2 ζ : sup f ∈F n P n ˜ ψ ( ˜ h, f ) − P n ˜ ψ ( h ora , f ) − 2  P n f 2 + γ ∥ f ∥ 2 F o ≤ ( sup f ∈F e Ψ γ ( ˜ h, f ) + sup f ∈F e Ψ γ ( h ora , f ) ) (i) ≤ 2 sup f ∈F e Ψ γ ( h ora , f ) (ii) ≤ 2 sup f ∈F Ψ 1 8 ( h ora , f ) + 4 · 36 2 ˜ δ 2 n, F + 36 ˜ δ 2 n, F + ˜ δ 2 n, F + 2 s 2 π C F C T r (iii) ≤ 4 ϵ 2 + 4 · 36 2 ˜ δ 2 n, F + 36 ˜ δ 2 n, F + ˜ δ 2 n, F + 2 s 2 π C F C T r , (D.18) where (i) holds due to ( D.16 ) , (ii) holds due to ( D.17 ) and (iii) holds due to Lemma C.1 and the assumption E [( α ( Z ; h ora ) − α ) 2 ] ≤ ϵ 2 . 77 Define ˜ α ( Z ; h ) = E [ ˜ 1 { T ( h ( X ) , Y ) > 0 }| Z ] and note that ∥ ( α ( Z ; h ) − ˜ α ( Z ; h )) − ( α ( Z ; h ora ) − ˜ α ( Z ; h ora )) ∥ L 2 ≤ ∥ α ( Z ; h ) − ˜ α ( Z ; h ) ∥ L 2 + ∥ α ( Z ; h ora ) − ˜ α ( Z ; h ora ) ∥ L 2 ≤  E Z   E h ˜ 1 { T ( h ( X ) , Y ) > 0 } − 1 { T ( h ( X ) , Y ) > 0 } | Z i 2  1 / 2 +  E Z   E h ˜ 1 { T ( h ora ( X ) , Y ) > 0 } − 1 { T ( h ora ( X ) , Y ) > 0 } | Z i 2  1 / 2 ≤ 2 s 2 π C T r . Recall that f h = arg min f ∈F U E h ( f ( Z ) − ( α ( Z ; h ora ) − α ( Z ; h ))) 2 i , and we assume that E [( f h ( X ) − ( α ( X ; h ora ) − α ( X ; h ))) 2 ] ≤ η 2 holds for an y h ∈ H . Notice that P ˜ ψ ( h, f h ) − P ˜ ψ ( h ora , f h ) = E [ f h ( Z )( ˜ α ( Z ; h ) − ˜ α ( Z ; h ora ))] = E h f 2 h ( Z ) i + E [ f h ( Z )( α ( Z ; h ) − α ( Z ; h ora ) − f h ( Z ))] + E [ f h ( Z )( { ˜ α ( Z ; h ) − ˜ α ( Z ; h ora ) } − { α ( Z ; h ) − α ( Z ; h ora ) } )] ≥ ∥ f h ∥ 2 L 2 − ∥ f h ∥ L 2 ∥ α ( Z ; h ) − α ( Z ; h ora ) − f h ( Z ) ∥ L 2 − ∥ f h ∥ L 2 ∥ { ˜ α ( Z ; h ) − ˜ α ( Z ; h ora ) } − { α ( Z ; h ) − α ( Z ; h ora ) } ∥ L 2 ≥ ∥ f h ∥ 2 L 2 − ∥ f h ∥ L 2 ( η + 2 s 2 π C T r ) . Note that ∥ α ( Z ; h ) − α ( Z ; h ora ) ∥ L 2 ≤ ∥ ˜ α ( Z ; h ) − ˜ α ( Z ; h ora ) ∥ L 2 + ∥ ( α ( Z ; h ) − ˜ α ( Z ; h )) − ( α ( Z ; h ora ) − ˜ α ( Z ; h ora )) ∥ L 2 78 ≤ ∥ ˜ α ( Z ; h ) − ˜ α ( Z ; h ora ) ∥ L 2 + 2 s 2 π C T r . Therefore, b y symmetry w e ha v e    ∥ α ( Z ; h ) − α ( Z ; h ora ) ∥ L 2 − ∥ ˜ α ( Z ; h ) − ˜ α ( Z ; h ora ) ∥ L 2    ≤ 2 s 2 π M r. (D.19) Then w e ha v e P ˜ ψ ( h, f h ) − P ˜ ψ ( h ora , f h ) ∥ f h ∥ L 2 ≥ ∥ f h ∥ L 2 − η − 2 s 2 π C T r ≥ ∥ α ( Z ; h ) − α ( Z ; h ora ) ∥ L 2 − 2 s 2 π C T r − 2 η , (D.20) where the last inequity we use ( D.19 ) . Let r ∈ [0 , 1] , then r f ˜ h ∈ F U since F U is star-shap ed. Recall that v 2 ( h, h ora ) = E [ | α ( Z ; h ) − α ( Z ; h ora ) | 2 ] and define e G = n ˜ ψ ( h, f h ) − ˜ ψ ( h ora , f h ) : h ∈ H o . By the optimality of f h , ∥ f h − ( α ( X ; h ora ) − α ( X ; h )) ∥ L 2 ≤ ∥ 0 − ( α ( X ; h ora ) − α ( X ; h )) ∥ L 2 = v ( h, h ora ) , for all h ∈ H . Thus ∥ ˜ ψ ( h, f h ) − ˜ ψ ( h ora , f h ) ∥ L 2 ≤ ∥ f h ∥ L 2 ≤ 2 v ( h, h ora ) . (D.21) Then w e ha v e w.p. 1 − 2 ζ : sup f ∈F n P n ψ ( ˜ h, f ) − P n ψ ( h ora , f ) − 2  P n f 2 + γ ∥ f ∥ 2 F o ≥ r P n ˜ ψ ( ˜ h, f ˜ h ) − r P n ˜ ψ ( h ora , f ˜ h ) − 2 r 2  P n f 2 ˜ h + γ ∥ f ˜ h ∥ 2 F  (i) ≥ r P n ˜ ψ ( ˜ h, f ˜ h ) − r P n ˜ ψ ( h ora , f ˜ h ) − 2 r 2 γ ∥ f ˜ h ∥ 2 F + 7 4 ∥ f ˜ h ∥ 2 L 2 + ˜ δ 2 n, F 2 ! 79 (ii) ≥ r   P ˜ ψ ( ˜ h, f ˜ h ) − P ˜ ψ ( h ora , f ˜ h ) − v ( ˜ h, h ora )   4 δ n, e G + 2 s 2 log( R n /ζ ) n     − r log ( R n /ζ ) 3 n − 2 r 2 γ ∥ f ˜ h ∥ 2 F + 7 4 ∥ f ˜ h ∥ 2 L 2 + ˜ δ 2 n, F 2 ! (iii) ≥ r   ( v ( ˜ h, h ora ) − 2 s 2 π C T r − 2 η ) ∥ f ˜ h ∥ L 2 − v ( ˜ h, h ora )   4 δ n, e G + 2 s 2 log( R n /ζ ) n     − r log ( R n /ζ ) 3 n − 2 r 2 γ ∥ f ˜ h ∥ 2 F + 7 4 ∥ f ˜ h ∥ 2 L 2 + ˜ δ 2 n, F 2 ! , (D.22) where (i) use ( C.6 ) ; (ii) holds due to Lemma E.7 and ( D.21 ) ; and (iii) holds due to ( D.20 ) . R n = ⌈ log 2 (1 /δ n, e G ) ⌉ . If 4 δ n, e G + 2 q 2 log( R n /ζ ) n ≥ ∥ f ˜ h ∥ L 2 / 2 , w e ha v e v ( ˜ h, h ora ) ≤ ∥ f ˜ h ∥ L 2 + η + 2 s 2 π C T r ≲ δ n, e G + s log( R n /ζ ) n + η + r. Otherwise, i.e. 4 δ n, e G + 2 q 2 log( R n /ζ ) n ≤ ∥ f ˜ h ∥ L 2 / 2 , using the upp er b ound in ( D.18 ) , by ( D.22 ) w.p. 1 − 4 ζ we get v ( ˜ h, h ora ) ≤ 2 r ∥ f ˜ h ∥ L 2   4 ϵ 2 + 4 · 36 2 ˜ δ 2 n, F + 36 ˜ δ 2 n, F + ˜ δ 2 n, F + 2 s 2 π C F C T r   + 2 log( R n /ζ ) 3 n ∥ f ˜ h ∥ L 2 + 4 r ∥ f ˜ h ∥ L 2 γ ∥ f ˜ h ∥ 2 F + 7 4 ∥ f ˜ h ∥ 2 L 2 + ˜ δ 2 n, F 2 ! + 2 s 2 π C T r + 2 η . If ∥ f ˜ h ∥ L 2 ≥ ˜ δ n, F ≥ q log( c 1 /ζ ) c 2 n , c ho osing r = max { ϵ, ˜ δ n, F } / ∥ f ˜ h ∥ L 2 , b y γ ∥ f ∥ 2 F ≤ ∥ f ∥ 2 L 2 for all f ∈ F , w e ha v e v ( ˜ h, h ora ) ≲ ϵ + ˜ δ n, F + η + s log( R n /ζ ) n + ν ∥ h ora ∥ 2 H ˜ δ n, F + (1 + ˜ δ n, F ) r ˜ δ n, F . Otherwise, i.e. ∥ f ˜ h ∥ L 2 ≤ ˜ δ n, F , w e ha v e v ( ˜ h, h ora ) ≤ ∥ f ˜ h ∥ L 2 + η + 2 s 2 π C T r ≲ ˜ δ n, F + η + r. 80 Com bining the relations ab o v e, w e can conclude that v ( ˜ h, h ora ) ≲ ϵ + δ n, F U + δ n, e G + η + ν ∥ h ora ∥ 2 H ˜ δ n, F + r ˜ δ n, F + s log( R n /ζ ) n , where w e used the definition of ˜ δ n, F . E A uxiliary Lemmas Lemma E.1. L et log t + d log log t = L n b e a e quation w.r.t. t with d > 0 and L n → ∞ as n → ∞ . Then the solution t ∗ of the e quation satisfies t ∗ ≍ exp { L n − d log L n } . Pr o of. Consider log t = L − d log L + ∆ and w e prov e ∆ = o (1) . Using log (1 + u ) = u + O ( u 2 ) , w e ha v e ( L − d log L + ∆) + d log ( L − d log L + ∆) = L ⇐ ⇒ ( L − d log L + ∆) + d log L + d log 1 − d log L L + ∆ L ! = L ⇐ ⇒ ∆ + d − d log L L + O (log L ) 2 L 2 ! + ∆ L ! = 0 ⇐ ⇒ ∆ 1 + d L ! = d 2 log L L + O (log L ) 2 L 2 ! . (E.1) As n → ∞ , L → ∞ , the right hand side of ( E.1 ) con v erge to 0, therefore ∆ = o (1) . Therefore, solution t ∗ ≍ exp { L n − d log L n } . Lemma E.2. L et F b e a function class with ρ -c overing numb er N ( ρ, F , ∥ · ∥ L 2 ) , and g b e any me asur able function. L et F − g := { f − g : f ∈ F } , then for any ρ > 0 , N ( ρ, F , ∥ · ∥ L 2 ) = N ( ρ, F − g , ∥ · ∥ L 2 ) . Pr o of. Let { f ℓ } N ( ρ, F , ∥·∥ L 2 ) ℓ =1 b e a ρ -co v ering of F under the L 2 norm. F or an y h ∈ F − g , 81 there exists f ∈ F suc h that h = f − g . By the definition of the cov ering num b er, there exists some f ℓ 1 satisfying ∥ ( f − g ) − ( f ℓ 1 − g ) ∥ L 2 = ∥ f − f ℓ 1 ∥ L 2 ≤ ρ. Hence, the collection { f ℓ − g } N ( ρ, F , ∥·∥ L 2 ) ℓ =1 forms a ρ -co v ering of F − g , which implies N ( ρ, F − g , ∥ · ∥ L 2 ) ≤ N ( ρ, F , ∥ · ∥ L 2 ) . Conv ersely , observ e that F = ( F − g ) + g . Applying the ab ov e argumen t with − g in place of g yields N ( ρ, F , ∥ · ∥ L 2 ) ≤ N ( ρ, F − g , ∥ · ∥ L 2 ) . Com bining the t w o inequalities completes the pro of. Lemma E.3. L et F b e a function class with ρ -c overing numb er N ( ρ, F , ∥ · ∥ L 2 ) , and φ b e a L -Lipschitz function. L et φ ◦ F := { φ ◦ f : f ∈ F } , then for any ρ > 0 , N ( ρ, φ ◦ F , ∥ · ∥ L 2 ) ≤ N ( ρ/L, F , ∥ · ∥ L 2 ) . Pr o of. Let { f ℓ } N ( ρ/L, F , ∥·∥ L 2 ) ℓ =1 b e a ρ/L -co v ering of F under the L 2 norm. By the definition of the cov ering num b er, there exists some f ℓ 1 satisfying ∥ f − f ℓ 1 ∥ L 2 ≤ ρ/L . F or any φ ◦ f ∈ φ ◦ F , ∥ φ ( f ) − φ ( f ℓ 1 ) ∥ 2 L 2 = Z ( φ ( f )) − φ ( f ℓ 1 )) 2 dP ≤ L 2 Z ( f − f ℓ 1 ) 2 dP = L 2 ∥ f − f ℓ 1 ∥ 2 L 2 ≤ ρ 2 Hence, the collection { φ ◦ f ℓ } N ( ρ/L, F , ∥·∥ L 2 ) ℓ =1 forms a ρ -co v ering of φ ◦ F , which implies N ( ρ, φ ◦ F , ∥ · ∥ L 2 ) ≤ N ( ρ/L, F , ∥ · ∥ L 2 ) . Lemma E.4. L et F b e a function class with ρ -c overing numb er N ( ρ, F , ∥ · ∥ L 2 ) , and let F = { f = ( f 1 , . . . , f m ) ⊤ : f 1 , . . . f m ∈ F } , then for any ρ > 0 , N ( ρ, F , ∥ · ∥ L 2 ( P, R m ) ) ≤  N ( ρ/ √ m, F , ∥ · ∥ L 2 ( P, R ) )  m . Pr o of. Let { f ℓ } N ( ρ/ √ m, F , ∥·∥ L 2 ) ℓ =1 b e a ρ/ √ m -co v ering of F under the L 2 norm. By the definition 82 of the co v ering n um b er, there exists some f ℓ i satisfying ∥ f − f ℓ i ∥ L 2 ≤ ρ/ √ m . Let F = { ( f ℓ 1 , . . . , f ℓ m ) ⊤ : ℓ 1 , . . . , ℓ m ∈ [ N ( ρ/ √ m, F , ∥ · ∥ L 2 )] } F or an y f ∈ F , there exist some ¯ f ∈ F suc h that ∥ f − ¯ f ∥ 2 L 2 = Z m X i =1  f i − ¯ f ℓ i  2 dP ≤ m · ρ 2 m = ρ 2 Hence, the collection F forms a ρ -co v ering of F , whic h implies N ( ρ, F , ∥ · ∥ L 2 ) ≤ | F | ≤ ( N ( ρ/ √ m, F , ∥ · ∥ L 2 )) m . Lemma E.5. L et F , G b e two function classes with the ρ -c overing numb er N ( ρ, F , ∥ · ∥ L 2 ) and N ( ρ, G , ∥ · ∥ L 2 ) , r esp e ctively. In addition, sup f ∈F ∥ f ∥ ∞ ≤ B F and sup g ∈G ∥ g ∥ ∞ ≤ B G . Then the pr o duct function class M = { f g : f ∈ F , g ∈ G } satisfies N ( ρ, M , ∥ · ∥ L 2 ) ≤ N ( ρ/ (2 B G ) , F , ∥ · ∥ L 2 ) · N ( ρ/ (2 B F ) , G , ∥ · ∥ L 2 ) . Pr o of. W e first notice that for an y f 1 , f 2 ∈ F and g 1 , g 2 ∈ G , it holds that ∥ f 1 g 1 − f 2 g 2 ∥ L 2 ≤ ∥ ( f 1 − f 2 ) g 1 ∥ L 2 + ∥ ( g 1 − g 2 ) f 2 ∥ L 2 ≤ B G ∥ f 1 − f 2 ∥ L 2 + B F ∥ g 1 − g 2 ∥ L 2 . Let { f ℓ } N ( ρ/ (2 B G ) , F , ∥·∥ L 2 ) ℓ =1 and { g ℓ } N ( ρ/ (2 B F ) , G , ∥·∥ L 2 ) ℓ =1 b e ρ/ (2 B G ) -co v ering of F and ρ/ (2 B F ) - co v ering G , resp ectiv ely . It means that for an y f ∈ F and g ∈ G , there exist some f ℓ 1 and g ℓ 2 suc h that ∥ f − f ℓ 1 ∥ L 2 ≤ ρ/ (2 B G ) and ∥ g − g ℓ 2 ∥ L 2 ≤ ρ/ (2 B F ) . Hence we hav e ∥ f g − f ℓ 1 g ℓ 2 ∥ L 2 ≤ ρ. 83 Therefore, { f ℓ 1 g ℓ 2 : ℓ 1 ∈ [ N ( ρ/ (2 B G ) , F , ∥ · ∥ L 2 )] , ℓ 2 ∈ [ N ( ρ/ (2 B F ) , G , ∥ · ∥ L 2 )] } is a ρ -co v ering for M , then the conclusion follo ws immediately . Lemma E.6 (Theorem 14.1, W ainwrigh t ( 2019 )) . Given a star-shap e d function class F , supp ose ∥ f ∥ ∞ ≤ 1 holds for any f ∈ F . L et δ n b e any p ositive solution of the ine quality E   sup f ∈F , ∥ f ∥ L 2 ≤ δ     1 n ε i f ( X i )       ≤ δ 2 , wher e { ε i } n i =1 ar e i.i.d. R ademacher variables. F or any t ≥ δ n , we have    ( P n − P ) f 2    ≤ 1 2 P f 2 + t 2 2 , ∀ f ∈ F , with pr ob ability at le ast 1 − c 1 e − c 2 nδ 2 n /b 2 . If in addition nδ 2 n ≥ 2 c 2 log(4 log (1 /δ n )) , then    P n f 2 − P f 2    ≤ c 0 δ n , ∀ f ∈ F , with pr ob ability at le ast 1 − c ′ 1 e − c ′ 2 nδ 2 n /b 2 . Lemma E.7. L et G b e a star-shap e d function class. Denote the critic al r adius of G as δ n, G . If sup g ∈G ∥ g ∥ ∞ ≤ 1 , with pr ob ability at le ast 1 − ζ , we have | ( P n − P ) g | ≤ ∥ g ∥ L 2   2 δ n, G + s 2 log( R n /ζ ) n   + log( R n /ζ ) 3 n simultane ously holds for any g ∈ G , wher e R n = ⌈ log 2 (1 /δ n, G ) ⌉ . Pr o of. Define the following function space G ( r ) = { g ∈ G : ∥ g ∥ L 2 ≤ r } , where r > 0 . Denote the random v ariable Z n ( r ) = sup g ∈G ( r ) | ( P n − P ) g | . Applying the 84 functional v ersion of Bennett’s inequalit y (e.g., Theorem 7.3 in Bousquet ( 2003 )), we hav e P    Z n ( r ) ≥ E [ Z n ( r )] + r s 2 x n + x 3 n    ≤ e − x . (E.2) Let ε i ∼ Uniform {± 1 } b e i.i.d Rademacher v ariables. Using the standard symmetrization tec hnique (e.g., Lemma A.5 in Bartlett et al. ( 2005 )), for an y r ≥ δ n, G w e ha v e E [ Z n ( r )] ≤ 2 E " sup g ∈G ( r )      1 n n X i =1 ε i g ( X i )      # = 2 R ( r ; G ) ≤ 2 r δ n, G , (E.3) where we used the fact R ( r ; G ) /r is non-increasing and the definition of critical radius, then for an y r ≥ δ n, G , R ( r ; G ) r ≤ R ( δ n, G ; G ) δ n, G ≤ δ n, G . Substituting ( E.3 ) into ( E.2 ), we can hav e P    Z n ( r ) ≥ 2 r δ n, G + r s 2 x n + x 3 n    ≤ e − x . (E.4) Next, w e emplo y the p eeling argumen t b y defining G (0) = { g ∈ G : ∥ g − g ∗ ∥ L 2 ≤ δ n, G } and G ( k ) = n g ∈ G : 2 k − 1 δ 2 n, G s ≤ ∥ g ∥ L 2 ≤ 2 k δ n, G o , k ≥ 1 . Let R n = ⌈ log 2 (1 /δ n, G ) ⌉ + 1 , then it holds that ∪ R n k =0 G ( k ) = G . Letting r k = 2 k δ n, G and using ( E.4 ), w e can guaran tee that P    ∀ k ∈ [ K ] , Z n ( r k ) ≥ 2 r k δ n, G + r k s 2 x n + x 3 n    ≤ R n e − x . 85 And w e also ha v e P    Z n ( r 0 ) ≥ 2 δ n, G + δ n, G s 2 x n + x 3 n    ≤ e − x . Notice that g ∈ G ( k ) implies that ∥ g ∥ L 2 ≤ r k for k ≥ 1 and g ∈ G (0) implies that ∥ g ∥ L 2 ≤ δ n, G . Hence w e can conclude that, for an y g ∈ G such that ∥ g ∥ L 2 ≥ δ n, G , P    sup g ∈G | ( P n − P ) g | ≥ 2 ∥ g ∥ L 2 δ n, G + ∥ g ∥ L 2 s 2 x n + x 3 n    ≤ R n e − x . (E.5) T aking x = log ( R n /ζ ) in ( E.5 ) and denoting v ( g , g ∗ ) = ∥ g − g ∗ ∥ L 2 , w e can prov e that with probabilit y at least 1 − ζ , | ( P n − P ) g | ≤ ∥ g ∥ L 2   2 δ n, G + s 2 log( R n /ζ ) n   + log( R n /ζ ) 3 n , sim ultaneously holds for an y g ∈ G . Lemma E.8. L et G b e the function class define d in ( C.13 ) , R n = ⌈ log 2 (1 /δ n, G ) ⌉ + 1 and v ( C , C ora ) = ∥ α ( C ) − α ( C ora ) ∥ L 2 , then with pr ob ability at le ast 1 − δ , we have | ( P n − P )( ψ ( C, f C ) − ψ ( C ora , f C )) | ≤ v ( C, C ora )   4 δ n, G + 2 s 2 log( R n /ζ ) n   + log( R n /ζ ) 3 n , simultane ously holds for any C ∈ C . Pr o of. By the definition of f C in ( C.11 ) and G = { ( x, z , y ) 7→ f C ( z )( 1 { y / ∈ C ( x ) } − 1 { y / ∈ C ora ( x ) } ) : C ∈ C } , w e know sup g ∈G ∥ g ∥ ∞ ≤ sup C ∈ C ∥ f C ∥ ∞ ≤ 1 . Applying Lemma E.7 on G , w e can pro v e the conclusion by using the fact ∥ g ∥ L 2 ≤ ∥ f C ∥ L 2 ≤ 2 v ( C, C ora ) due to the definition 86 f C = arg min f ∈F ∥ f − ( α ( C ) − α ( C ora )) ∥ 2 L 2 and ∥ f C ∥ L 2 ≤ ∥ f − ( α ( C ) − α ( C ora )) ∥ L 2 + ∥ α ( C ) − α ( C ora ) ∥ L 2 . F A dditional Syn thetic Data Results F.1 Exp erimen t details of Section 5.1 T o obtain a reliable estimate of the MSCE, w e set |D test | = 10 , 000 . The CC metho d Gibbs et al. ( 2025 ) is a full conformal pro cedure that requires recomputation for each test p oin t and therefore cannot b e efficiently implemented with matrix op erations, resulting in substan tial computational cost. In our exp erimen ts, its runtime is approximately 10–100 times larger than that of CC(non-full) , which only p erforms quantile regression using D cal (see App endix B.2 ). T o mitigate the excessiv e computational burden caused b y the large n um b er of test samples, we therefore adopt CC(non-full) as the CC baseline in this section. In this exp eriment, we set the n um b er of equidistant partition sets to |J | = 100 when computing √ MSCE and the num b er of balls to n B = 50 when ev aluating w orst-case co v erage. F or the metric Set Size , we compute the set size | b C ( x ) | in closed form for b oth ellipsoidal and b o x prediction sets. Ellipsoidal sets. F or the baseline metho ds, the prediction set takes the form b C ( x ) = n y ∈ R d Y : ( y − µ 0 ( x )) ⊤ Σ − 1 0 ( y − µ 0 ( x )) ≤ q ( x ) o , whose v olume is giv en b y | b C ( x ) | = π d Y / 2 Γ  d Y 2 + 1  q ( x ) d Y / 2 det(Σ 0 ) 1 / 2 , where Γ( · ) is the standard Gamma function. F or our MOPI , the v olume of prediction set is | b C ( x ) | = π d Y / 2 Γ  d Y 2 + 1  det(Σ 0 Σ( x )) 1 / 2 . 87 Bo x sets. F or the baseline metho ds, the prediction set is defined as b C ( x ) = n y ∈ R d Y : ∥ ( y − µ 0 ( X )) /σ 0 ( X ) ∥ ∞ ≤ q ( x ) o , whic h corresp onds to an axis-aligned b ox with side lengths 2 q ( x ) σ 0 ,j ( x ) along eac h co ordinate. Its volume is | b C ( x ) | = d Y Y j =1 2 q ( x ) σ 0 ,j ( x ) . F or our MOPI , the v olume of prediction set is | b C ( x ) | = d Y Y j =1 2 σ 0 ,j ( x ) σ j ( x ) . F or each metho d, all tuning parameters are selected by minimizing the av erage √ MSCE o v er 10 replications of the en tire exp erimen t with different random seeds. These seeds are distinct from those used in the rep orted exp erimen ts. F.2 Exp erimen t details of Section 5.2 W e first describ e Setting 2 in detail: Y i | { X i , Z i = z } ∼ N ( µ z ( X i ) , σ 2 z ( X i )) , Z i | X i ∼ Cat ( π 1 ( X i ) , . . . , π 4 ( X i )) , where Cat is the categorical distribution. The mixing weigh ts { π z ( X i ) } 4 k =1 are defined through a softmax mo del π z ( X i ) = exp  X ⊤ i γ z  P 4 ℓ =1 exp  X ⊤ i γ ℓ  , z = 1 , . . . , 4 , with co efficien t v ectors γ z ∈ R d X . µ z ( X i ) = X ⊤ i a z + 0 . 5 sin 2 ( X i, 1 ) , σ z ( X i ) =    X ⊤ i δ z    + c z . Here, a z , δ z ∈ R d X are fixed co efficient vectors and c z > 0 are comp onent-specific constants. In this exp eriment, since the supp ort of Z is relativ ely small, we set the test sample size 88 to |D test | = 500 . Due to the relativ ely small |D test | , w e can adopt the full conformal v ersion of the CC metho d in this section, following Gibbs et al. ( 2025 ). All tuning parameters are selected by minimizing the av erage √ MSCE o v er 10 replications of the entire exp erimen t with different random seeds, whic h are distinct from those used in the rep orted exp erimen ts. Here, √ MSCE =      |Z | − 1 X z ∈Z   n − 1 z X i ∈D test 1 { Z i = z , Y i / ∈ b C ( X i ) } − α   2      1 / 2 , where n z = P i ∈D test 1 { Z i = z } . Since Z is a sensitive v ariable, our sim ulation setting restricts its usage as follo ws: Z is not a v ailable during pretraining or when constructing prediction interv als. It is only used in the calibration stage, where it helps learn a more accurate threshold function h ( x ) . In our framework, we can leverage the sensitiv e v ariable Z b y solving ( 8 ) . How ever, since Z is unobserv able for the test samples, algorithms such as CC , SCP , and RLCP cannot leverage Z in their calibration pro cedures. Hence, for CC , SCP , and RLCP , prediction sets can only b e constructed using X . F.3 Sim ulation results for one-dimensional lab el In this section, w e consider the sublev el set ( 7 ) for all baseline metho ds and MOPI . In eac h replication, w e indep enden tly generate pretraining, calibration, and test datasets from the same distribution. W e use a pretraining set of size |D pre | = 3 , 000 to train ˆ µ via linear regression with score s ( x, y ) = | y − ˆ µ ( x ) | , a calibration set of size |D cal | = 1 , 500 to construct prediction sets, and a test set of size |D test | = 5 , 000 for ev aluation. F.3.1 T est-conditional co v erage W e first fo cus on test-conditional cov erage, that is Z = X = ( X 1 , . . . , X 6 ) ⊤ . Consider a heteroscedastic regression mo del Y = P 6 j =1 1 √ 6 X j + e ( X ) with X 1 indep enden tly dra wn from 89 Unif (0 , 5) , and X 2 , . . . , X 6 indep enden tly dra wn from N(0 , 1) . The noise term is generated as e ( X ) ∼ N (0 , σ ∗ ( X ) 2 ) and w e consider the follo wing three differen t settings: • Setting 1( x ) : σ ∗ ( X ) 2 = 1 + X 2 1 + 0 . 5 × sin  P 6 j =2 X j  ; • Setting 2( x ) : σ ∗ ( X ) 2 = 1 + 0 . 2 × P 6 j =1 sin 2 ( X j ) + P 6 j =1 X 2 j · 1 n P 6 j =1 X j > 3 o . • Setting 3( x ) : σ 2 ( X i ) = 2 − 1 { X i, 1 < 1 . 2 } + 5 1 { X i, 1 > 3 } + 2 1 { X i, 3 > 0 } + 3 1 { X i, 5 > 0 . 5 } . In Setting 1( x ), the v ariance dep ends only on the first co ordinate X 1 , whereas in Setting 2( x ), it dep ends on all co ordinates. In Setting 3( x ), the v ariance follows a discrete structure. W e c ho ose F and H as the same RKHS. All tuning parameters are selected by minimizing the a v erage √ MSCE o v er 10 replications of the en tire exp erimen t with different random seeds. These seeds are distinct from those used in the rep orted exp eriments. Computation of ev aluation metrics. In our simulation studies, when computing the worst-case cov erage, w e set the num b er of random balls to |B | = 50 and the radius to r = √ 2 d X , where d X denotes the dimension of the co v ariates X . F or the m ulti- dimensional settings with d X = 6 or 11 , the √ MSCE is approximated using equidistan t partition sets constructed only along the first co ordinate X 1 . This c hoice is motiv ated by t w o considerations: first, X 1 ∼ U (0 , 5) , and from the v ariance expressions in simulation scenarios Setting 1( x ), 2( x ), and 3( x ), it is clear that the v ariance contribution from X 1 dominates. Second, constructing equidistant partition sets ov er the full feature space w ould result in an extremely large n um b er of partitions, rendering the computation prohibitiv ely exp ensiv e. T able F.1 sho ws that MOPI basically main tains marginal co verage v alidity while achieving the highest worst-case co v erage and the low est MSCE. Figure F.1 presen ts prediction in terv als for the one-dimensional X . Compared with CC , MOPI yields in terv als that more 90 closely align with the oracle, exhibit few er uncov ered cases, as highlighted in the zo omed-in region of Figure F.1 . Compared with RLCP , MOPI and CC pro duce smo other interv als as x v aries, enhancing practical usabilit y . T able F.1: Cov erage metrics and set size under different v ariance settings with d X = 6 . Setting 1( x ) Setting 2( x ) Setting 3( x ) Methods Mar ginal W orst-case √ MSCE Set size Mar ginal W orst-case √ MSCE Set size Mar ginal W orst-case √ MSCE Set size MOPI 0.888 0.865 0.050 4.492 0.891 0.865 0.049 3.046 0.888 0.865 0.054 4.075 CC 0.883 0.858 0.056 4.480 0.883 0.857 0.055 3.005 0.882 0.859 0.062 4.069 RLCP 0.900 0.815 0.094 4.955 0.901 0.811 0.076 3.320 0.900 0.846 0.066 4.249 SCP 0.900 0.796 0.113 5.090 0.900 0.788 0.095 3.351 0.899 0.831 0.075 4.251 91 0 1 2 3 4 5 X 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 Y Oracle MOPI CC RLCP 0 1 2 3 4 5 X 15 10 5 0 5 10 15 20 25 Y Oracle MOPI CC RLCP Figure F.1: Prediction interv als pro duced by each metho d under the univ ariate settings. W e generate X ∼ U (0 , 5) and Y | X ∼ N ( X , σ 2 ( X )) . Up: σ 2 ( X ) = 0 . 5 · 1 { X < 1 . 5 } + 10 · 1 { 1 . 5 ≤ X < 3 . 5 } + 2 · 1 { 3 . 5 ≤ X } ; Down: σ 2 ( X ) = 1 + X 3 / 2 . The orange solid line represen ts the linear regression mo del ˆ µ fitted on the training data; the colored dashed lines denote the prediction in terv als constructed by the different metho ds; and the black dashed line corresp onds to the oracle prediction in terv al. A dditionally , w e consider Setting 1 ( x ) , Setting 2 ( x ) and Setting 3 ( x ) with d X = 11 . F rom T ables F.2 , we observe that MOPI ac hiev es the highest worst-case cov erage and the smallest MSCE while maintaining basically v alid marginal cov erage. These results indicate that, for the settings studied in our exp erimen ts, MOPI pro vides impro v ed test-conditional co v erage 92 p erformance under heterogeneous v ariance structures and v arying co v ariate dimensions. T able F.2: Cov erage metrics and set size for d X = 11 under Settings 1( x ), 2( x ), and 3( x ). Setting 1( x ) Setting 2( x ) Setting 3( x ) Methods Mar ginal W orst-case √ MSCE Set size Mar ginal W orst-case √ MSCE Set size Mar ginal W orst-case √ MSCE Set size MOPI 0.896 0.884 0.056 4.546 0.896 0.885 0.053 3.159 0.895 0.885 0.054 4.155 CC 0.885 0.868 0.056 4.466 0.882 0.870 0.056 3.033 0.888 0.875 0.057 4.067 RLCP 0.901 0.823 0.100 5.027 0.900 0.836 0.069 3.313 0.901 0.852 0.068 4.263 SCP 0.900 0.813 0.112 5.099 0.900 0.826 0.078 3.332 0.900 0.844 0.075 4.258 F.3.2 Group-conditional co verage Align with the sim ulation of test-conditional cov erage, we inv estigate Setting 1( x ) and Setting 2( x ). T o ev aluate group-conditional co v erage, we consider a collection of o v erlapping complex groups G = { G 1 , . . . , G 4 } , where G 1 = { x : x 1 + x 6 ≥ 3 } , G 2 = { x : sin ( x 5 ) + x 2 1 ≥ 3 . 5 } , G 3 = n x : | x 2 | + 3 | x 4 | + | x 6 | + 2 | x 3 | ≤ 3 √ x 1 o , and G 4 = { x : 4 x 2 3 + x 2 5 + 2 x 2 2 − x 2 1 ≤ x 1 } . Let Z = ( 1 { X ∈ G 1 } , . . . , 1 { X ∈ G 4 } ) T . F or a fair comparison, w e choose the solving space of CC , F and H of MOPI to b e { P z ∈Z β z 1 { Z = z } : β z ∈ R , z ∈ { 0 , 1 } 2 |G | } . F or RLCP , w e also c ho ose the Gaussian kernel. F rom Figure F.2 , F.3 and F.4 , we observe that b oth MOPI and CC ac hiev e accurate group-conditional co v erage under b oth the disjoint and ov erlapping group structures, whereas RLCP and SCP exhibit comparativ ely inferior p erformance. G 1 G 2 G 3 G 4 0.75 0.80 0.85 0.90 0.95 Coverage S e t t i n g 1 ( x ) Method MOPI CC RLCP SCP G 1 G 2 G 3 G 4 5.0 5.5 6.0 6.5 7.0 7.5 Set Size S e t t i n g 1 ( x ) Method MOPI CC RLCP SCP Figure F.2: Group-conditional cov erage and set size under Setting 1( x ) with d X = 6 . 93 G 1 G 2 G 3 G 4 0.75 0.80 0.85 0.90 0.95 Coverage S e t t i n g 2 ( x ) Method MOPI CC RLCP SCP G 1 G 2 G 3 G 4 3.0 3.5 4.0 4.5 5.0 5.5 Set Size S e t t i n g 2 ( x ) Method MOPI CC RLCP SCP Figure F.3: Group-conditional cov erage and set size under Setting 2( x ) with d X = 6 . G 1 G 2 G 3 G 4 0.75 0.80 0.85 0.90 0.95 Coverage S e t t i n g 3 ( x ) Method MOPI CC RLCP SCP G 1 G 2 G 3 G 4 4.0 4.5 5.0 5.5 Set Size S e t t i n g 3 ( x ) Method MOPI CC RLCP SCP Figure F.4: Group-conditional cov erage and set size under Setting 3( x ) with d X = 6 . G A dditional Real Data Results G.1 Exp erimen t details of Section 6.1 Pretrained mo dels. F or el lipsoidal sets , SCP , CC , and RLCP use the score s ( x, y ) = ( y − µ 0 ( x )) ⊤ Σ 0 − 1 ( x )( y − µ 0 ( x )) and prediction sets of the form b C ( x ) = { y : s ( x, y ) ≤ ˆ h ( x ) } . Here the regression function µ 0 ( x ) is pretrained using a random forest, while Σ − 1 0 ( x ) = L 0 ( x ) L 0 ( x ) ⊤ is obtained from a neural netw ork that predicts the Cholesky factor L ∗ ( x ) of (Σ ∗ ) − 1 ( x ) . The netw ork takes x as input and outputs the low er-triangular entries of L 0 ( x ) , with p ositivity of the diagonal enforced via a Softplus transformation. The netw ork L 0 is 94 pretrained b y minimizing the Gaussian negativ e log-lik eliho o d min L 1 n pre n pre X i =1  1 2 ( Y i − µ 0 ( X i )) ⊤  L ( X i ) L ( X i ) ⊤  ( Y i − µ 0 ( X i )) − 1 2 log det  L ( X i ) L ( X i ) ⊤   , where log det( L ( x ) L ( x ) ⊤ ) = 2 P j log L j j ( x ) since L ( x ) is lo w er triangular. G.2 A dditional results for Communities and Crime dataset W e rep ort additional group-conditional cov erage results. W e fo cus on subp opulations with high racial representation, defining a comm unit y as “high-representation” for a given racial group if that group’s share in the communit y exceeds the 70 th percentile. As shown in Figure G.1 , in b oth the masked and unmask ed cases, MOPI attains more accurate and more equitable cov erage across these “high-represen tation” subp opulations compared with the baseline metho ds. %B>0.17 %W>0.93 %A>0.14 %H>0.12 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Coverage Unmasked case Method MOPI CC RLCP SCP %B>0.17 %W>0.93 %A>0.14 %H>0.12 0.75 0.80 0.85 0.90 0.95 1.00 Coverage Masked case Method MOPI CC RLCP SCP Figure G.1: Group-conditional cov erage for comm unities and crime dataset. The letters B, W, A, and H refer to the racial categories “Blac k”, “White”, “Asian”, and “Hispanic”, resp ectiv ely . G.3 Medical Insurance Dataset This dataset 1 con tains 3,630 records of medical charges and demographic attributes (age, BMI, n um b er of children, sex, and smoking status). W e use Y = log ( c harges ) as the lab el 1 h ttps://www.kagg le.com/datasets/rajg upta2019/medical- insura nce- da taset 95 v ariable and tak e X all = ( X age , X BMI , X children , X sex , X smoking ) as the feature vector. W e consider the sublevel set ( 7 ) for all baseline metho ds and MOPI and split the data into three parts. The first 2000 samples are used to pretrain a random forest predictor µ , and we define the score function as s ( x, y ) = | y − µ ( x ) | . The second 1080 samples are used to construct the prediction set and the rest 550 samples are used as test data. Here, w e consider t w o experimental settings, the unmaske d case and the maske d case. The exp erimen t is rep eated 100 times with random train-calibration-test splits, and the target cov erage level is 1 − α = 0 . 9 . In the unmaske d case, all features are a v ailable at prediction time, corresp onding to X = Z = X all . In this case, we ev aluate whether eac h demographic subgroup ac hieves the target cov erage lev el. F or a fair comparison, we set the function classes H , F in MOPI , and the search space in CC as RKHS with a Gaussian kernel. T o av oid uninforma- tiv e interv als in RLCP , we tune the k ernel bandwidth suc h that their prop ortion is b elow 10%, and exclude these interv als when rep orting lengths. W e consider marginal cov er- age and group-conditional cov erage, whic h include t w o simple groups and tw o complex groups: (1) Smoker : X smoking = 1 ; (2) Male : X sex = 1 ; (3) BFR > 40% : individuals with a b o dy fat ratio (BFR) exceeding 40%, corresp onding to the extremely ob ese group. BFR = (1 . 2 · X BMI + 0 . 23 · X age − 10 . 8 · X sex − 5 . 4) % , see Deuren b erg et al. ( 1991 ); (4) Male, BFR > 24% : ob ese males, defined as those with X sex = 1 and BFR > 24% . Figure G.2 sho ws that MOPI ac hiev es near-p erfect co v erage guaran tees with reasonable set sizes. In the maske d case, the gender attribute is treated as a sensitiv e v ariable that is una v ailable for the test p oints. Here, we let Z = X sex and X = X all \ X sex , and assess whether the metho ds achiev e e qualize d c over age across the t w o gender groups despite the sensitiv e attribute b eing unobserved at prediction time. Figure G.3 rep orts the cov erage for the Male and Female groups under b oth exp erimental settings. Our MOPI metho d ac hiev es nearly p erfect co v erage in b oth the unmask ed and the mask ed settings, whereas the other 96 Marginal Smoker Male BFR>40% Male,BFR>24% 0.80 0.85 0.90 0.95 1.00 Coverage Method MOPI QR RLCP SCP Marginal Smoker Male BFR>40% Male,BFR>24% 0.20 0.25 0.30 0.35 0.40 0.45 Set Size Method MOPI QR RLCP SCP Figure G.2: P erformance comparison under medical insurance charges dataset. Left: marginal and group-conditional co v erage; Righ t: corresp onding set size. baselines exhibit noticeable co v erage gaps when the sensitiv e attributes are mask ed. Male Female 0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 Coverage Unmasked case Method MOPI CC RLCP SCP Male Female 0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 Coverage Masked case Method MOPI CC RLCP SCP Figure G.3: Cov erage on the Male and F emale groups for Medical Insurance Dataset. G.4 F ashionMNIST Dataset The dataset ( Xiao et al. , 2017 ) contains 70 , 000 gra yscale images of clothing items ( 28 × 28 pixels), ev enly distributed across ten categories suc h as T-shirts, sho es, and bags. W e consider the sublevel set ( 7 ) for all baseline metho ds and MOPI . The data are partitioned in to training, calibration, and test sets in a ratio of 4 : 1 : 1 . W e train a con v olutional neural net w ork (CNN) π ( x ) : X 7→ [0 , 1] |Y | , consisting of t w o conv olutional la y ers (each follo w ed by max-p o oling, ReLU, and drop out), and tw o fully connected la y ers of sizes 320 → 20 and 20 → 10 , mapping the flattened features to 10 logits for classification. The 20-dimensional output of the p en ultimate la y er is used as the feature representation. W e use the adaptiv e prediction sets (APS) score ( Romano et al. , 2019 ), S ( x, y ) := P i : π i ( x ) >π y ( x ) π i ( x ) . 97 W e set H , F in MOPI , and the solve space in CC to b e RKHSs with a Gaussian kernel. F or PLCP prop osed b y Kiy ani et al. ( 2024a ), the prediction set is obtained b y solving g ∗ , q ∗ = argmin q ∈ R m g ∈G 1 n n X j =1 m X i =1 g i ( X j ) · ℓ α ( q i , S j ) , (G.1) where ℓ α ( · , · ) is pin ball loss. F ollo wing the PLCP setting in Kiy ani et al. ( 2024a ), w e parameterize g with m = 8 groups using a CNN with three conv olutional la y ers and t w o fully connected la y ers. In addition to marginal co v erage and worst-case conditional cov erage, w e also ev aluate the absolute cluster-conditional cov erage gap ( Cluster-c ond-gap ) and the absolute predicted-class conditional cov erage gap ( Class-c ond-gap ). F or Cluster-c ond-gap , w e apply k -means clustering (with k = 5 ) to the feature representations and compute the av erage absolute cov erage gap across the resulting clusters. F or Class-c ond-gap , we measure co v erage within each predicted class and then tak e the mean of the av erage absolute co v erage gap ov er all predicted classes. F rom T able G.1 and the Figure G.4 , our metho d ac hiev es marginal v alidity while attaining the b est p erformance across all metrics that reflect conditional co v erage. T able G.1: Cov erage metrics comparison under F ashionMNIST dataset. Metho ds Mar ginal W orst-c ase Cluster-c ond-gap Class-c ond-gap Set size MOPI 0.910 0.851 0.021 0.036 2.294 CC 0.893 0.814 0.028 0.048 1.922 PLCP 0.907 0.825 0.021 0.050 2.049 SCP 0.894 0.802 0.036 0.053 1.928 98 T -shirt T rouser Pullover Dress Coat Sandal Shirt Sneaker Bag Ankle-boot 0.75 0.80 0.85 0.90 0.95 1.00 Coverage Method MOPI CC PLCP SCP Figure G.4: Predicted class-conditional cov erage of the F ashionMNIST dataset. References Bartlett, P . L., Bousquet, O., and Mendelson, S. (2005), “Lo cal Rademacher complexities,” The A nnals of Statistics , 33, 1497 – 1537. Bousquet, O. (2003), “Concentration inequalities for sub-additiv e functions using the entrop y metho d,” in Sto chastic Ine qualities and A pplic ations , Springer, pp. 213–247. Deuren b erg, P ., W eststrate, J. A., and Seidell, J. C. (1991), “Bo dy mass index as a measure of b o dy fatness: age-and sex-sp ecific prediction formulas,” British Journal of Nutrition , 65, 105–114. Duc hi, J. (2025), “Sample-Conditional Co ve rage in Split-Conformal Prediction,” in The Thirty-ninth A nnual Confer enc e on Neur al Information Pr o c essing Systems . F oster, D. J. and Syrgkanis, V. (2023), “Orthogonal statistical learning,” The A nnals of Statistics , 51, 879–908. Gibbs, I., Cherian, J. J., and Candès, E. J. (2025), “Conformal prediction with conditional guaran tees,” Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 87, 1100–1126. Kiy ani, S., P appas, G. J., and Hassani, H. (2024), “Conformal prediction with learned features,” in International Confer enc e on Machine L e arning , PMLR, pp. 24749–24769. Redmond, M. and Ba v eja, A. (2002), “A data-driven softw are to ol for enabling co op era- tiv e information sharing among p olice departments,” Eur op e an Journal of Op er ational R ese ar ch , 141, 660–678. Romano, Y., P atterson, E., and Candes, E. (2019), “Conformalized quantile regression,” A dvanc es in Neur al Information Pr o c essing Systems , 32, 1–11. 99 Sc hölk opf, B., Herbric h, R., and Smola, A. J. (2001), “A generalized represen ter theorem,” in International Confer enc e on Computational L e arning The ory , Springer, pp. 416–426. V an Der V aart, A. W. and W ellner, J. A. (2023), W e ak Conver genc e and Empiric al Pr o c esses: With A pplic ations to Statistics , Springer Nature. W ain wrigh t, M. J. (2019), High-Dimensional Statistics: A Non-A symptotic V iewp oint , v ol. 48, Cam bridge Univ ersit y Press. Xiao, H., Rasul, K., and V ollgraf, R. (2017), “F ashion-mnist: a no vel image dataset for b enc hmarking mac hine learning algorithms,” arXiv pr eprint arXiv:1708.07747 . Zhou, D.-X. (2002), “The co v ering n um b er in learning theory ,” Journal of Complexity , 18, 739–767. 100

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment