Conformal Prediction for Nonparametric Instrumental Regression
We propose a method for constructing distribution-free prediction intervals in nonparametric instrumental variable regression (NPIV), with finite-sample coverage guarantees. Building on the conditional guarantee framework in conformal inference, we r…
Authors: Masahiro Kato
Conformal Prediction for Nonparametric Instrumen tal Regression Masahiro Kato ∗ The Univ ersity of T oky o Marc h 27, 2026 Abstract W e prop ose a metho d for constructing distribution-free prediction in terv als in nonparametric instrumen tal v ariable regression (NPIV), with finite-sample co verage guaran tees. Building on the conditional guarantee framew ork in conformal inference, w e reformulate conditional cov erage as marginal co v erage ov er a class of IV shifts F . Our method can b e com bined with any NPIV estimator, including sieve 2SLS and other mac hine-learning-based NPIV metho ds such as neural netw orks minimax approac hes. Our theoretical analysis establishes distribution-free, finite-sample co verage o ver a practitioner-c hosen class of IV shifts. 1 In tro duction Instrumen tal v ariables (IVs) are widely used for causal and p olicy ev aluation when regressors are endogenous. W e consider structural prediction problems c haracterized by momen t conditions conditional on IVs, for whic h a representativ e example is nonparametric IV regression (NPIV, Newey & P o w ell , 2003 ; Ai & Chen , 2003 ; Darolles et al. , 2011 ). A large literature studies how to estimate h 0 under ill-p osedness, using series, minim um-distance, repro ducing kernel Hilb ert space (RKHS), neural net work, and minimax pro cedures ( Newey & Po w ell , 2003 ; Hartford et al. , 2017 ; Singh et al. , 2019 ; Dikk ala et al. , 2020 ). Here, w e briefly introduce the NPIV setup. F or details, see Section 2 . In NPIV, w e t ypically consider the following data-generating pro cess: Y = h 0 ( X ) + ε, E [ ε | Z ] = 0 , (1) where X is an endogenous v ariable, ε is the error term, Z is an IV, and h 0 ( · ) is called the structural function. While the structural function h 0 ( · ) relates the endogenous v ariable X to the outcome Y , it cannot b e estimated b y regressing Y on X b ecause of endogeneity , that is, the correlation b et w een X and ε . ∗ Email: mkato-csecon@g.ecc.u-tokyo.ac.jp 1 In this paper, we study prediction in terv als for future outcomes in NPIV. Existing inference metho ds in NPIV t ypically target the structural function h 0 or its functionals through asymptotic appro ximations to regularized estimators. Our goal is different. W e aim to construct prediction in terv als for Y itself, with v alidit y formulated relativ e to the IV. F or example, consider predicting future demand for a go o d. In conv en tional approaches, one first estimates the demand curv e h 0 and then predicts demand Y . In that case, one constructs confidence or prediction in terv als for h 0 . In this study , by contrast, we target Y itself, that is, w e construct prediction interv als for Y , not h 0 . Our prediction interv als for Y are justified b y the IV Z . Conformal prediction pro vides a natural starting p oint for this purp ose, since it yields distribution-free finite-sample cov erage under exchangeabilit y ( V ovk et al. , 2005 ). While conformal prediction is a con v enien t approac h in man y tasks, it is not immediately applicable to NPIV, b ecause exact conditional cov erage is imp ossible ( Lei & W asserman , 2013 ; F o ygel Barb er et al. , 2020 ). That is, exact conditional co v erage given the IV is imp ossible in a fully distribution-free, finite-sample sense. T o address this issue, we build on the results in Tibshirani et al. ( 2019 ) and Gibbs et al. ( 2025 ), where the former considers conformal prediction under cov ariate shift, and the latter considers conditional guarantees in conformal prediction. A basic mo deling question then arises. In IV problems, what should the final interv al dep end on? The structural regression itself is a function of X alone, whic h suggests an X -indexed prediction interv al C ( X ) as the most natural predictive ob ject. A t the same time, one ma y also allow the radius to v ary join tly with ( X , Z ) or only with Z . These three p ossibilities ha v e different interpretations and different algorithmic prop erties. W e study conformal pro cedures for all three cases. Within that common target, the three c hoices of radius lead to different metho ds. When the radius dep ends join tly on ( X , Z ), the finite-dimensional conditional conformal machinery of Gibbs et al. ( 2025 ) applies directly with the co v ariate ( X , Z ), yielding exact finite-sample co verage ov er joint shifts in ( X , Z ). When the radius dep ends only on Z , the same machinery sp ecializes to an exact IV-sp ecific pro cedure, whic h we call IV-conditional conformal prediction, with co verage ov er IV shifts. When the radius dep ends only on X , exact finite-sample family-wise calibration is not curren tly a v ailable. T o handle that case, we use the imp ortance- w eighting logic of Kato et al. ( 2022 ) to conv ert the conditional cov erage moment in to weigh ted unconditional moments, and then combine that idea with w eighted conformal recalibration under a fixed target shift in the spirit of Tibshirani et al. ( 2019 ). 1.1 Con tribution W e prop ose metho ds for prediction in terv als in NPIV with distribution-free finite-sample co verage guarantees. In particular, we refer to the IV-sp ecific conformal construction as IV-CCP . IV-CCP is based on conformal prediction, in particular, on conditional guaran tees through conditional calibration. The metho dological con tribution is to transform the NPIV problem with conditional momen t restrictions into one under marginal momen t restrictions so that w e can apply conformal prediction. W e recast exact conditional cov erage as a moment condition against all measurable reweigh tings of the IV and relax that imp ossible target to robust cov erage 2 o ver a user-sp ecified class of IV shifts F . This approach is closely related to co v ariate shift ( Shimo daira , 2000 ). In implementation, we use an estimate of h 0 ( x ) to tighten the prediction in terv als. As a basic construction, w e first estimate the structural function h 0 ( x ) and then construct prediction interv als by adding a radius to the estimator b h . W e study three t yp es of radii. The first dep ends on b oth X and Z , the second dep ends only on Z , and the third dep ends only on X . F or each approac h, we present a concrete implemen tation and theoretical guarantees. 1.2 Related W ork Our work lies at the intersection of NPIV estimation and conformal prediction. Estimation of causal parameters under conditional moment restrictions is a core problem in causal inference, and NPIV is a representativ e example. In this problem, the technical difficulties come from iden tification and stable estimation, including ill-p osedness and regularization for h 0 ( New ey & Po w ell , 2003 ; Ai & Chen , 2003 ; Darolles et al. , 2011 ). In the machine learning literature, sev eral studies use flexible mo dels for the structural function based on neural net w orks and RKHS, or emplo y a minimax approach to estimation ( Hartford et al. , 2017 ; Singh et al. , 2019 ; Dikk ala et al. , 2020 ). Our metho d is agnostic to the particular NPIV mo dels and estimation metho ds and therefore acts as a predictiv e wrapp er around these estimators rather than a replacemen t for them. On the conformal side, the starting p oint is distribution-free finite-sample v alidit y under exc hangeability ( V ovk et al. , 2005 ). Exact distribution-free conditional cov erage is imp ossible in general ( Lei & W asserman , 2013 ; F o ygel Barb er et al. , 2020 ), so recen t work has studied structured relaxations, including w eighted conformal prediction under a fixed target shift ( Tibshirani et al. , 2019 ), lo calized conformal prediction ( Guan , 2022 ), distributional conformal prediction ( Chernozh uko v et al. , 2021 ), and conditional guarantees ov er a family of shifts ( Gibbs et al. , 2025 ). The present pap er sp ecializes these ideas to IV settings and combines the conformal p ersp ective with the imp ortance-weigh ted conditional-momen t formulation of Kato et al. ( 2022 ). App endix A pro vides a fuller discussion. 2 Setup In this section, we form ulate our problem. 2.1 V ariables and Observ ations V ariables Let Y ∈ Y ⊆ R b e a scalar outcome, X ∈ X ⊆ R k X b e a k X -dimensional endogenous v ariable, and Z ∈ Z ⊆ R k Z b e k Z -dimensional IVs, where Y , X , and Z denote the outcome, co v ariate, and IV spaces, resp ectiv ely . Assume that ( Y , X , Z ) jointly follo ws a distribution P . Observ ations Let n ∈ N b e the sample size. Assume that we observ e a dataset { ( Y i , X i , Z i ) } n i =1 , where each ( Y i , X i , Z i ) is an i.i.d. copy of ( Y , X, Z ) follo wing the distribution P . 3 2.2 Prediction under Endogeneit y This study considers prediction of Y under endogeneity of X . W e clarify the meaning of endogeneit y b elow. Recap of iden tification through conditional momen t restrictions The conv en tional approac h defines the structural function through iden tification under conditional moment restrictions ( Ai & Chen , 2003 ; Darolles et al. , 2011 ). In this approac h, we define the causal relationship b etw een Y and X as Y = h 0 ( X ) + ε, where h 0 : X → Y is the structural function and ε is a sub-Gaussian error term with mean zero. T o estimate h 0 , suppose that the IV Z satisfies the following conditional moment restriction: E [ ε i | Z i ] = 0 , ∀ i ∈ { 1 , 2 , . . . , n } . (2) Then we assume that h 0 is uniquely identified under ( 2 ). The goal of the conv en tional setup is to estimate h 0 from the conditional moment restriction in ( 2 ). If Z = X , this problem reduces to estimation of the regression function E [ Y | X ] . Ho wev er, when E [ ε | X ] = 0, E [ Y | X ] is not equal to h 0 ( X ), so standard regression metho ds suc h as least squares ma y fail to recov er the structural function. In suc h cases, we use NPIV metho ds to estimate h 0 using IVs Z . Reform ulation b y conditional prediction given IVs The conv en tional approach relies on identification of the structural function itself. In this study , w e do not fo cus on iden tification. Instead, we consider prediction of Y using X , with v alidit y indexed by the IV Z . In this approac h, we define the target as a prediction interv al b C : X × Z → { in terv als in R } , whic h satisfies P Y n +1 ∈ b C ( X n +1 , Z n +1 ) | Z n +1 = z = 1 − α, ∀ z ∈ Z . (3) W e first define a prediction in terv al b C that dep ends on b oth X and Z . W e then introduce prediction interv als that dep end only on Z or only on X . Our main fo cus is the prediction in terv al that dep ends only on X . Note that in our prop osed metho d, w e also use an estimate of h 0 to tighten the prediction in terv als, but this is not necessary . 2.3 Ideal (but Infeasible) Goal: Prediction In terv al with Condi- tional Co verage Guaran tee In summary , our ideal goal is to predict Y giv en X while accounting for endogeneity and to obtain a prediction interv al b C : X × Z → { in terv als in R } satisfying the conditional cov erage guaran tee ( 3 ). 4 F or the construction of b C ( X n +1 , Z n +1 ), we allo w the use of an estimator b h : X → Y of the structural function h 0 . This reflects the common application setting in which X is the argumen t of the prediction rule and Z is used to calibrate v alidit y , not necessarily to serve as a direct input to the p oint predictor. In this study , we aim to guarantee distribution-free and finite-sample co v erage for the prediction interv al. How ev er, distribution-free pro cedures cannot satisfy ( 3 ) with nontrivial finite-length interv als in general ( Lei & W asserman , 2013 ; F oygel Barb er et al. , 2020 ). W e therefore pursue a relaxation in the next section. Notation Let P denote the probability la w induced by P , and let E denote exp ectation under P . W e refer to the en vironment in which we predict Y n +1 as the test en vironment, and to the corresp onding probability law as the test law. Let len ( C ) b e the length of a prediction in terv al C ⊆ R . 3 T arget of In terest and IV Shift Relaxation This section defines the target prediction in terv als. W e b egin with prediction interv als with exact conditional cov erage, rewrite this target in moment form, and then relax it to co verage o ver a class of IV shifts. The resulting target applies to all radius classes considered in the pap er. 3.1 Exact Conditional Co v erage F or a generic in terv al b C τ , exact conditional cov erage in the IV setting takes the form P Y n +1 ∈ b C τ ( X n +1 , Z n +1 ) | Z n +1 = z = 1 − α, ∀ z ∈ Z . (4) The next prop osition rewrites ( 4 ) in momen t form. Prop osition 3.1 (Moment c haracterization of exact conditional co v erage) . Equation ( 4 ) holds if and only if E h f ( Z n +1 ) 1 h Y n +1 ∈ b C τ ( X n +1 , Z n +1 ) i − (1 − α ) i = 0 , ∀ f ∈ M , (5) wher e M denotes the set of al l b ounde d me asur able functions f : Z → R . Prop osition 3.1 sho ws that exact conditional cov erage is equiv alent to marginal cov erage w eighted b y every measurable function of the IV. This equiv alence also clarifies the imp os- sibilit y result. Exact conditional co verage asks for v alidity after ev ery p ossible reweigh ting of Z , which is to o demanding in a distribution-free finite-sample setting. Lei & W asserman ( 2013 ) shows that conditional guarantees of this type cannot generally b e achiev ed by non- trivial finite-length prediction sets, and F o ygel Barb er et al. ( 2020 ) sharp ens this message by sho wing that even approximate conditional cov erage requires interv als that are essentially as conserv ativ e as those built for m uc h larger subsets. The conditioning v ariable is the IV rather than a generic predictor, but the imp ossibility itself is unc hanged. 5 3.2 Relaxation via IV Shifts W e therefore replace the full measurable class M in ( 5 ) with a user-sp ecified class F ⊂ M . This yields the relaxed requiremen t E h f ( Z n +1 ) 1 h Y n +1 ∈ b C τ ( X n +1 , Z n +1 ) i − (1 − α ) i ≥ 0 , ∀ f ∈ F . (6) F ollo wing Gibbs et al. ( 2025 ), ( 6 ) can b e interpreted as robust marginal cov erage ov er a family of IV shifts. F or any nonnegative f ∈ F with E [ f ( Z )] > 0, define a tilted distribution d P f ( y , x, z ) d P ( y , x, z ) = f ( z ) E [ f ( Z )] . (7) Then ( 6 ) implies P f Y n +1 ∈ b C τ ( X n +1 , Z n +1 ) ≥ 1 − α for every suc h f . In applications with IVs, this interpretation is natural. P olicies often c hange the marginal distribution of the IV while leaving the structural relationship ( 1 ) in tact. The one-sided inequality in ( 6 ) is enough for low er cov erage guarantees, whic h is the finite-sample target throughout the pap er. 3.3 Prediction In terv als and Radius Classes In constructing prediction interv als b C τ ( x, z ), w e allow the use of an estimator b h of the structural function h 0 . Let b h : X → Y denote a predictor of h 0 obtained from some NPIV metho d. Then we define the general family of interv als as b C τ ( x, z ) = h b h ( x ) − b τ ( x, z ) , b h ( x ) + b τ ( x, z ) i , (8) where b τ : X × Z → R + is a nonnegativ e radius function. Based on their dep endence on the v ariables, we distinguish the following three radius classes: T X Z : = { τ : X × Z → R + measurable } , T Z : = { τ : ∃ τ Z : Z → R + suc h that τ ( x, z ) = τ Z ( z ) } , T X : = { τ : ∃ τ X : X → R + suc h that τ ( x, z ) = τ X ( x ) } . The broad class T X Z lets the radius v ary jointly with ( X , Z ) and serves as the common paren t class for the later sp ecializations. The IV-indexed class T Z k eeps the center equal to b h ( X ) while allowing the radius to adapt to the IV. The X -indexed class T X yields a prediction in terv al of the form b C X ( x ) = h b h ( x ) − b τ X ( x ) , b h ( x ) + b τ X ( x ) i . The common sp ecial case T X ∩ T Z is the constan t-radius split conformal rule. The distinction among the three classes is substantiv e. The class T X Z is the most expressiv e b enc hmark. The class T Z is IV-sp ecific, b ecause the IV indexes predictive uncertain ty while the center remains a function of X alone. The class T X yields prediction in terv als C ( X ) that dep end only on X , which is the most natural class from the NPIV motiv ation. This pap er studies all three classes, with the main metho dological fo cus on T X . 6 3.4 A Radius-Class Oracle Problem The relaxation ( 6 ) is common across the three radius classes, but their feasible sets differ. Let T ⊂ T X Z b e any radius class, and define the oracle problem inf τ ∈T E [ τ ( X , Z )] (9) s.t. E h f ( Z ) 1 [ | Y − h 0 ( X ) | ≤ τ ( X , Z )] − (1 − α ) i ≥ 0 , ∀ f ∈ F . (10) Let L ( T , F ) denote the infimum in ( 9 ) under the constraint ( 10 ), that is, L ( T , F ) = inf { E [ τ ( X , Z )] : τ ∈ T satisfies ( 10 ) } . Here L ( T , F ) is the oracle exp ected half-length within the radius class T under the shift-robust target generated b y F . Prop osition 3.2 (Radius-class nesting) . Supp ose T 1 ⊂ T 2 ⊂ T X Z and that the fe asible set define d by ( 9 ) and ( 10 ) is nonempty. Then we have L ( T 2 , F ) ≤ L ( T 1 , F ) . In p articular, we have L ( T X Z , F ) ≤ L ( T Z , F ) , L ( T X Z , F ) ≤ L ( T X , F ) . Prop osition 3.2 clarifies the trade-off. The class T X Z is the broadest feasible class and therefore the easiest one in whic h to reduce oracle length. The classes T Z and T X are more restrictiv e, but they hav e sharp er substantiv e interpretations. The class T Z treats the IV as an index of residual uncertain ty . The class T X yields prediction interv als C ( X ) dep ending only on X , which is the most natural class from the NPIV motiv ation. 3.5 Connection to Minimax and Imp ortance W eigh ting The constrained problem defined b y ( 9 ) and ( 10 ) has the same minimax flav or as conditional- momen t form ulations in NPIV. In the contin uum-of-momen ts and linear-in verse viewp oin ts of Carrasco & Florens ( 2000 ) and Carrasco et al. ( 2007 ), conditional momen ts are con verted in to a family of unconditional moments indexed by a function class. More recent adv ersarial pro cedures, such as DeepGMM and the minimax-optimal estimators of Bennett et al. ( 2019 ) and Dikk ala et al. ( 2020 ), estimate h 0 b y searching for moment violations ov er a critic class. Our use of the class F is differen t. W e are not trying to identify h 0 b y finding momen ts that are most informative ab out the structural equation. Instead, h 0 or its estimator b h is treated as fixed, and the radius τ is chosen so that the co v erage residual 1 h Y ∈ b C τ ( X , Z ) i − (1 − α ) has nonnegativ e moments against ev ery f ∈ F . Th us, F is not a critic class for structural estimation. This viewp oint also connects naturally to imp ortance weigh ting. Equation ( 7 ) rewrites the relaxed co verage target as a family of ordinary marginal co verage statements under w eighted la ws. This is the same algebraic mec hanism that underlies co v ariate-shift correction in Shimo daira ( 2000 ) and imp ortance-weigh ting formulations for NPIV such as Kato et al. ( 2022 ). In our setting, how ever, the weigh ts define the probability la ws o v er which co verage is required. 7 4 IV-Conditional Conformal Prediction In this section, w e prop ose IV-conditional conformal prediction (IV-CCP). The details differ across the three target radius classes, T X Z , T Z , and T X . 4.1 Split Conformal Prediction for NPIV In all cases, w e construct prediction in terv als using a split-type implemen tation of conformal prediction. W e first split the sample into a training set I tr and a calibration set I cal with |I cal | = m . W e then fit any NPIV estimator b h on I tr . On the calibration set, w e compute the follo wing nonconformity scores: S i = Y i − b h ( X i ) , i ∈ I cal . (11) Conditional on the training split, b h is fixed, so the calibration scores and the test score are exc hangeable. 4.2 IV-CCP with ( X , Z ) -Indexed Radius The first metho d studies the broad b enchmark class T X Z . This class is not the main target of the pap er, but it is mathematically con venien t b ecause the same v ariable ( X , Z ) indexes b oth the radius and the shift family . Class of the joint shift. Let W = ( X , Z ) and let ϕ W : X × Z → R d b e a feature map. Define G W = w 7→ β ⊤ ϕ W ( w ) : β ∈ R d , F W = G W ∩ { f ≥ 0 } . Throughout this section, w e tak e the first comp onent of ϕ W to b e the constan t 1, so G W con tains the constan t functions. The class G W determines how the radius may v ary with ( X , Z ), and F W determines the family of join t distribution shifts. Calibration. Fix a test pair w = ( x, z ) and a candidate radius s ≥ 0. Denote the pin ball loss by ρ α ( u ) = α max { u, 0 } + (1 − α ) max {− u, 0 } . Using the calibration pairs ( W i , S i ) and the augmented p oin t ( w , s ), consider b g s ∈ arg min g ∈G W 1 m + 1 X i ∈I cal ρ α ( S i − g ( W i )) + 1 m + 1 ρ α ( s − g ( w )) . This is the finite-dimensional conditional conformal mac hinery of Gibbs et al. ( 2025 ) with co v ariate W = ( X , Z ). F orm the LP dual corresp onding to this augmented quan tile regression problem, and let η m +1 ( s ) denote the dual co ordinate asso ciated with the augmented p oint. As in Gibbs et al. ( 2025 ), η m +1 ( s ) is nondecreasing in s , and s b elongs to the conformal upp er 8 set if and only if η m +1 ( s ) < 1 − α . W e therefore define b τ X Z ( x, z ) as the largest accepted v alue of s and output b C X Z ( x, z ) = h b h ( x ) − b τ X Z ( x, z ) , b h ( x ) + b τ X Z ( x, z ) i . (12) F or an y nonnegative f ∈ F W with E [ f ( W )] > 0, define the joint tilted law d P W f ( y , x, z ) d P ( y , x, z ) = f ( x, z ) E [ f ( W )] . (13) The finite-sample guaran tee for this class is stated in Theorem 6.1 . 4.3 IV-CCP with Z -Indexed Radius The second metho d studies the IV-indexed class T Z . This is the exact IV-sp ecific conformal class, and it is the first main fo cus of the pap er. The in terv al cen ter remains b h ( X ), while the radius dep ends on Z alone. Class of the IV shift. Let ϕ : Z → R d b e a feature map, and define G = z ′ 7→ β ⊤ ϕ ( z ′ ) : β ∈ R d , F = G ∩ { f ≥ 0 } . Throughout the sequel, we take the first comp onen t of ϕ to b e the constan t 1, so G con tains the constant functions. Fix a test IV z and a candidate radius s ≥ 0. Calibration. Using the same scores ( 11 ), consider b g s ∈ arg min g ∈G 1 m + 1 X i ∈I cal ρ α ( S i − g ( Z i )) + 1 m + 1 ρ α ( s − g ( z )) . This is the augmen ted quan tile-regression problem of Gibbs et al. ( 2025 ) with cov ariate Z . Let η m +1 ( s ) denote the dual v ariable asso ciated with the augmente d p oint. As b efore, η m +1 ( s ) is nondecreasing in s , and the largest accepted v alue of s defines the radius b τ Z ( z ). The resulting IV-indexed interv al is b C Z ( x, z ) = h b h ( x ) − b τ Z ( z ) , b h ( x ) + b τ Z ( z ) i . (14) In terpretation The class T Z is less flexible than T X Z , but it restores the IV-sp ecific in terpretation that the relev ant change o ccurs in the IV distribution. The in terv al center remains a function of X alone. What c hanges with Z is the residual uncertain ty around that cen ter. This is compatible with the exclusion restriction in ( 1 ), b ecause the IV do es not enter the structural cen ter directly . 9 Algorithm 1 IV-CCP interv al at a test p oin t ( x, z ) 1: Input: training data I tr , calibration data I cal , level α , features ϕ 2: Fit NPIV on I tr to get b h 3: Compute calibration scores S i = Y i − b h ( X i ) for i ∈ I cal 4: F or a test p oint z , compute b τ Z ( z ) via binary searc h ov er s ≥ 0 using the LP dual mem b ership test 5: Return: b C Z ( x, z ) = h b h ( x ) − b τ Z ( z ) , b h ( x ) + b τ Z ( z ) i Relationship to weigh ted conformal prediction. W eigh ted conformal prediction under co v ariate shift ( Tibshirani et al. , 2019 ) protects against one fixed target distribution, pro vided the corresp onding density ratio is known or can b e estimated. IV-CCP is different. It protects sim ultaneously against every IV tilt in the class F . In other w ords, it is a guarantee o v er a family of shifts rather than a guaran tee for a single shift. This distinction is cen tral in IV applications, where the relev ant uncertaint y is often not one fully sp ecified probabilit y law in a setting of interest but a set of plausible changes in the IV distribution. 4.4 IV-CCP with X -Indexed Radius The third metho d studies the class T X , which pro duces prediction interv als C ( X ) dep ending only on X . This is the most natural target in NPIV, b ecause the final prediction rule dep ends on the structural regressor alone. It is also the second main fo cus of the pap er. The difficulty is that the interv al is indexed by X , while the robustness target is indexed by shifts in Z . IV-CCP for the X -indexed in terv al In this case, we consider a prediction interv al of the following form: b C X ( x ) = h b h ( x ) − b τ X ( x ) , b h ( x ) + b τ X ( x ) i . (15) Let us define the nonconformit y score as S = Y − b h ( X ) . F or a candidate radius function τ X : X → R + , define the cov erage residual m τ X ( S, X ) = 1 [ S ≤ τ X ( X )] − (1 − α ) . (16) Then exact conditional cov erage for b C X is equiv alen t to E [ m τ X ( S, X ) | Z = z ] = 0 , ∀ z . (17) The relaxed family-of-shifts target b ecomes E [ f ( Z ) m τ X ( S, X )] ≥ 0 , ∀ f ∈ F . (18) The challenge is to construct an X -indexed radius while the protected shift family remains a class of functions of Z . 10 Imp ortance-w eigh ted represen tation The imp ortance-w eighting approach of Kato et al. ( 2022 ) provides a natural route. Supp ose the conditional density ratio r 0 ( s, x | z ) = p S,X | Z ( s, x | z ) p S,X ( s, x ) (19) is well defined. Then conditional moments of ( S, X ) given Z can b e rewritten as w eighted unconditional moments. Prop osition 4.1 (Imp ortance-weigh ted iden tity for the X -indexed class) . Supp ose the c onditional density r atio ( 19 ) exists. Then for any me asur able τ X : X → R + , E [ m τ X ( S, X ) | Z = z ] = E [ m τ X ( S, X ) r 0 ( S, X | z )] , ∀ z . (20) Conse quently, ( 17 ) is e quivalent to E [ m τ X ( S, X ) r 0 ( S, X | z )] = 0 , ∀ z . (21) Prop osition 4.1 is the key bridge from the conditional cov erage target to unconditional w eighted moments. It is exactly the same algebraic step that Kato et al. ( 2022 ) uses for general conditional momen t learning, now applied to the interv al residual m τ X ( S, X ). T o turn this identit y into a learning criterion, let R X ⊂ { τ X : X → R + } denote a c hosen mo del class for the radius function, for example a sieve span, a neural netw ork class, or a shap e-restricted class. Let b r ( s, x | z ) b e an estimated version of ( 19 ), obtained on the training split or b y cross-fitting. Let e Z 1 , . . . , e Z M b e ev aluation p oin ts, for example the calibration IVs themselv es or a deterministic grid. Because the indicator in ( 16 ) is discontin uous, it is conv enien t in optimization to replace it with a smo oth surrogate. Let ψ κ : R → R b e a smo oth appro ximation to u 7→ 1 [ u ≥ 0] − (1 − α ), indexed by a smo othing parameter κ > 0. W e then consider the criterion b Q n ( τ X ) = 1 M M X j =1 1 m X i ∈I cal ψ κ ( τ X ( X i ) − S i ) b r S i , X i | e Z j ! 2 (22) and the estimator b τ X ∈ arg min τ X ∈R X ( 1 m X i ∈I cal τ X ( X i ) + λ b Q n ( τ X ) ) , (23) where λ > 0 balances av erage interv al length and w eighted momen t fit. The resulting prediction in terv al is ( 15 ). This pro cedure is not distribution-free in finite samples, b ecause it relies on density-ratio estimation and surrogate optimization. Its role is differen t. It provides a principled wa y to construct an X -indexed radius under the same conditional-moment logic that underlies NPIV. 11 Single-shift w eighted conformal recalibration The imp ortance-w eigh ted pro cedure ab o v e learns the shap e of the X -indexed radius. T o obtain a finite-sample guaran tee for a fixed target distribution shift while keeping the final interv al X -indexed, one can add a final scalar conformal recalibration step in the spirit of Tibshirani et al. ( 2019 ). Let q : X → R + b e a p ositive function learned on a sample indep endent of the final recalibration split, for example b y ( 23 ), or b y any other metho d. Let I rcal denote the final recalibration split. Consider interv als of the form b C X,f 0 ( x ) = h b h ( x ) − b t f 0 q ( x ) , b h ( x ) + b t f 0 q ( x ) i , (24) where f 0 ∈ F is a fixed target shift. F orm the normalized scores as R i = Y i − b h ( X i ) q ( X i ) , i ∈ I rcal . (25) The weigh ts are the density ratios induced b y the target shift, whic h are defined as w f 0 ( z ) = f 0 ( z ) E [ f 0 ( Z )] . (26) Assume furthermore that there is a kno wn finite constant B f 0 suc h that w f 0 ( z ) ≤ B f 0 for all z in the supp ort of Z . Define b t f 0 as the conserv ative w eigh ted split conformal cutoff obtained from the calibration w eights w f 0 ( Z i ) and the worst-case test weigh t B f 0 , that is, b t f 0 = inf ( t ∈ R + : X i ∈I rcal w f 0 ( Z i ) P j ∈I rcal w f 0 ( Z j ) + B f 0 1 [ R i ≤ t ] ≥ 1 − α ) , with the conv ention inf ∅ = ∞ . By construction, the cutoff is scalar and the final in terv al remains a function of X only . Because it upp er b ounds the test-p oin t-sp ecific weigh ted split conformal cutoff of Tibshirani et al. ( 2019 ), it yields the following finite-sample guaran tee under the single target la w P f 0 : P f 0 Y n +1 ∈ b C X,f 0 ( X n +1 ) ≥ 1 − α, where the detailed statement is shown in Theorem 6.6 . This cov erage guaran tee is inten tionally narro wer than the guaran tees for T X Z and T Z . It holds for one sp ecified distribution shift rather than uniformly ov er a whole class. This reflects the current metho dological gap. F amily-wise exact finite-sample co verage for the mixed-index target C ( X ) is not presen tly a v ailable, whereas a conserv ative single-shift w eigh ted conformal recalibration is a v ailable. 4.5 Discussion The three classes no w fall into place. The b enchmark class T X Z is the broad exact conformal class. The class T Z is the exact IV-sp ecific class that delivers simultaneous finite-sample co verage ov er a family of IV tilts. The class T X is the primary target of in terest, but it curren tly requires a different strategy , namely , importance-weigh ted conditional-moment learning and, when needed, a final single-shift weigh ted conformal recalibration. 12 5 Designing the IV Shift Class F The c hoice of F is the main mo deling decision in the pap er. Through ( 7 ), it determines the ambiguit y set of shifted distributions under whic h co v erage is required. Accordingly , F should enco de ho w the marginal distribution of the IV ma y change, not how Y dep ends on Z , and not ho w b h is estimated. 5.1 F rom F eature Maps to Am biguit y Sets F or the exact classes T X Z and T Z , the shift class is induced by a finite-dimensional feature map. F or T Z , with feature map ϕ : Z → R d , define G ϕ = z 7→ β ⊤ ϕ ( z ) : β ∈ R d , F ϕ = G ϕ ∩ { f ≥ 0 } . The corresp onding am biguit y set is Q ( F ϕ ) = P f : d P f ( y , x, z ) d P ( y , x, z ) = f ( z ) E [ f ( Z )] , f ∈ F ϕ , E [ f ( Z )] > 0 . If F 1 ⊂ F 2 , then Q ( F 1 ) ⊂ Q ( F 2 ). Enlarging F therefore protects against a broader family of shifted distributions. If the constan t function b elongs to F , then the training distribution itself b elongs to the ambiguit y set. In practice, this is enforced by taking the first comp onen t of ϕ to b e 1. The gain in robustness is not free. In the exact IV-indexed class, Theorem 6.4 sho ws that the calibration contribution to in terv al length scales with d/ ( m + 1). Th us, the design problem is to c ho ose the smallest class F whose ambiguit y set still con tains the distribution shifts that matter scientifically or empirically . 5.2 Finite-Dimensional T emplates for Exact Conformal Classes Indicator and partition classes If Z is discrete, or if the user is willing to discretize a con tinuous IV, the simplest construction is an indicator basis. F or a partition Π = {Z 1 , . . . , Z G } of the IV space, take ϕ ( z ) = (1 , 1 [ z ∈ Z 1 ] , . . . , 1 [ z ∈ Z G ]) . Then the induced tilts are constan t within cells, and the fitted radius is piecewise constant. This is the natural construction for binary encouragement designs, judge iden tifiers, hospital iden tifiers, or other categorical IVs. The same idea extends to o verlapping groups. If G 1 , . . . , G L ⊂ Z are p olicy-relev an t subsets, one ma y take ϕ ( z ) = (1 , 1 [ z ∈ G 1 ] , . . . , 1 [ z ∈ G L ]) . The resulting guarantee holds for every group indicator and every nonnegativ e linear combina- tion of them. When the groups ov erlap, the fitted radius is constant on the atoms generated b y the ov erlap pattern. 13 Multiscale discretization F or a con tin uous scalar IV, one may include indicators for a coarse partition together with indicators for a finer nested partition. This allo ws the am biguity set to contain b oth global reweigh tings and more lo cal p erturbations of the IV la w. When Z is irregular or mixed discrete-contin uous, tree-based or forest-based partitions pla y the same role. A tree learned on I tr induces leav es L 1 , . . . , L G , and one can use leaf indicators 1 [ z ∈ L g ] as features. This yields an adaptive discretization of the IV space. Series and siev e classes When Z is contin uous and lo w-dimensional, a natural alternativ e is a smo oth series basis defined as ϕ ( z ) = (1 , ψ 1 ( z ) , . . . , ψ d ( z )) , where ψ 1 , . . . , ψ d ma y b e spline basis functions, orthogonal p olynomials, wa v elets, or trigono- metric terms. The resulting class protects against smo oth densit y-ratio p erturbations that can b e represented, or well approximated, b y the siev e span. This is the same approximation logic that underlies sieve NPIV estimation ( Chen & P ouzo , 2012 , 2015 ). F or a v ector IV Z = ( Z 1 , . . . , Z k Z ), one may use additive or tensor-pro duct siev es. An additiv e construction takes the form ϕ ( z ) = 1 , ψ 1 , 1 ( z 1 ) , . . . , ψ 1 ,d 1 ( z 1 ) , . . . , ψ k Z , 1 ( z k Z ) , . . . , ψ k Z ,d k Z ( z k Z ) , while a tensor-pro duct construction adds selected in teraction terms of the form ϕ j 1 ,...,j q ( z ) = q Y ℓ =1 ψ ℓ,j ℓ ( z ℓ ) . The additive v ersion is often preferable when k Z is mo derate, b ecause it keeps d small. T ensor pro ducts are appropriate only when domain kno wledge p oin ts to sp ecific interactions that ma y change in the setting of interest. Remark. Not every useful class r e quir es a lar ge b asis. In many IV applic ations, a low- dimensional semip ar ametric summary is enough. Examples include ϕ ( z ) = (1 , z ) , ϕ ( z ) = 1 , z , z 2 , ϕ ( z ) = (1 , judge indic ators , judge-by-time inter actions ) . These classes ar e esp e cial ly attr active when p olicy discussions alr e ady single out the r elevant dir e ctions of change, for example a shift in enc our agement pr ob ability, a drift in judge shar es, or a smo oth incr e ase in the mass of distant households. 5.3 Ric her Classes through Finite-Dimensional Approximation RKHS and k ernel dictionaries A con v enient idealization for smooth, lo cal, and nonlinear shifts is an RKHS ball. Let K : Z × Z → R b e a p ositive-definite k ernel with RKHS H K . A natural infinite-dimensional candidate is F ∞ K,B = f ∈ H K : ∥ f ∥ H K ≤ B , f ≥ 0 . 14 The exact finite-sample result in this pap er is developed for finite-dimensional classes, so the practical route is to appro ximate H K b y a finite dictionary , ϕ r ( z ) = (1 , K ( z , u 1 ) , . . . , K ( z , u r )) , where u 1 , . . . , u r ∈ Z are landmark p oin ts. The induced class F ϕ r is then a finite-dimensional pro xy for the ideal RKHS family , and the exact theorem applies to this pro xy . The k ernel det ermines the geometry of the protected shifts. Gaussian k ernels generate local tilts centered around the landmarks. Polynomial k ernels generate global lo w-order smo oth rew eightings. If one exp ects the shift to b e smo oth after an appropriate transformation of the IV, one ma y replace z b y a representation g ( z ) and use features of the form K ( g ( z ) , g ( u j )) . Lo w-rank appro ximations such as Nystr¨ om dictionaries and random features provide practical w ays to keep d mo derate ( Rahimi & Rec h t , 2007 ; Musco & Musco , 2017 ). Sparse linear and representation-based classes When the IV is high-dimensional, a direct sieve or kernel expansion in the raw co ordinates can mak e d to o large relative to the calibration size. A natural compromise is to construct a low-dimensional summary map ψ : Z → R q on the training split and then set ϕ ( z ) = (1 , ψ 1 ( z ) , . . . , ψ q ( z )) . The summary ma y consist of principal comp onents, selected co ordinates, geographic sum- maries, judge-share summaries, or a learned representation of a text, image, or net w ork-v alued IV. The p oint is not that the represen tation must b e ric h enough to estimate h 0 . It only needs to span the directions along whic h the IV distribution is likely to mov e in the setting of interest. Shap e-restricted classes F or scalar IVs, or for lo w-dimensional summaries of v ector IVs, it can be natural to fo cus on monotone, conv ex, or b ounded-v ariation shifts. Examples include in terven tions that only increase encouragement among larger v alues of Z , or geo- graphic changes that monotonically rew eigh t distance. In the exact conformal setting, the cleanest implementation is to approximate a shap e-restricted family by a finite grid and a basis of piecewise-constan t or piecewise-linear functions. One may then use the induced lo w-dimensional span as a pro xy for the desired cone. A stronger approac h w ould imp ose linear inequality constraints on the co efficients. Suc h constrained calibration remains com- putationally natural, b ecause it leads to linear or conv ex programs, but it go es b eyond the unconstrained span class analyzed here. 5.4 The Additional Mo deling La y er in the X -Indexed Class F or the X -indexed class T X , there are t wo distinct mo deling choices. The first is the shift class F , which still describ es how the IV distribution ma y mo v e in the setting of interest. The second is the radius class for τ X , whic h determines ho w the in terv al width may v ary with X . The tw o should not b e conflated. In practice, one ma y choose the radius class on X from the same broad families used in nonparametric regression, for example bins in X , additiv e siev es, tensor-pro duct bases, 15 shap e-restricted classes, RKHS mo dels, or neural-netw ork classes. The role of that class is completely different from the role of F . The radius class controls how m uc h heterogeneity the final prediction interv al may hav e across v alues of X . The shift class controls whic h settings of interest m ust b e protected. In the exact classes T X Z and T Z , these t w o roles are partially tied together b y the conditional conformal algorithm. In the X -indexed metho d, they are separate. 6 Theory This section collects the main theoretical statemen ts. Proofs app ear in App endix C . 6.1 Exact Finite-Sample Co v erage for ( X , Z ) -Indexed Radius W e first show that the co v erage ratio with an ( X, Z )-indexed radius works as intended. Theorem 6.1 (Shift-robust cov erage ov er joint tilts) . Assume that the c alibr ation sample and the test p oint ar e exchange able c onditional on the tr aining split. L et b C X Z ( x, z ) b e pr o duc e d by the augmente d quantile-r e gr ession c alibr ation of Se ction 4 using fe atur e map ϕ W and the asso ciate d nonne gative line ar shift class F W . Then for any me asur able f ∈ F W with f ≥ 0 and E [ f ( W )] > 0 , the interval satisfies P W f Y n +1 ∈ b C X Z ( X n +1 , Z n +1 ) ≥ 1 − α, wher e P W f is the tilte d distribution in ( 13 ). 6.2 Exact Finite-Sample Co v erage for Z -Indexed Radius W e next show that the co v erage ratio with a Z -indexed radius works as intended. Theorem 6.2 (Shift-robust co verage o v er IV tilts) . Assume that the c alibr ation sample and the test p oint ar e exchange able c onditional on the tr aining split. L et b C Z ( x, z ) b e pr o duc e d by A lgorithm 1 using fe atur e map ϕ and the asso ciate d nonne gative line ar shift class F . Then for any me asur able f ∈ F with f ≥ 0 and E [ f ( Z )] > 0 , the interval satisfies P f Y n +1 ∈ b C Z ( X n +1 , Z n +1 ) ≥ 1 − α, wher e P f is the tilte d distribution in ( 7 ). Remark. The or em 6.2 is distribution-fr e e, b e c ause it do es not r e quir e ( 1 ) to hold. The class T X r emains the main tar get of inter est, but the exact finite-sample guar ante e curr ently available under IV shifts is the IV-indexe d class T Z . 16 Oracle comparison for the IV-indexed class W e no w relate in terv al length in the exact IV-indexed class to NPIV estimation error and calibration complexit y . Let the oracle structural function b e h 0 and define oracle scores S ⋆ i = | Y i − h 0 ( X i ) | . Let b τ Z b e computed from { S i } and let b τ ⋆ Z b e computed from { S ⋆ i } using the same IV-indexed calibration algorithm and features. Prop osition 6.3 (Stabilit y to NPIV estimation error) . F or any r e alization of the data and any z ∈ Z , we have | b τ Z ( z ) − b τ ⋆ Z ( z ) | ≤ b h − h 0 ∞ . Conse quently, we have len( b C Z ( x, z )) − len( b C ⋆ Z ( x, z )) ≤ 2 b h − h 0 ∞ , wher e b C ⋆ Z ( x, z ) = h b h ( x ) − b τ ⋆ Z ( z ) , b h ( x ) + b τ ⋆ Z ( z ) i . Next, compare this with the oracle IV-indexed radius τ 0 ( z ) = inf { t : P ( | ε | ≤ t | Z = z ) ≥ 1 − α } . W e require tw o regularit y conditions. Assumption 6.1 (Lo cal densit y lo wer b ound) . F or al l z ∈ Z , the c onditional distribution of | ε | given Z = z has a L eb esgue density p | ε || z ( · ) that is b ounde d b elow by p min > 0 on a neighb orho o d of τ 0 ( z ) . Assumption 6.2 (Oracle quan tile-transfer regularit y) . Assume that the or acle r adius function z ′ 7→ τ 0 ( z ′ ) b elongs to G = z ′ 7→ β ⊤ ϕ ( z ′ ) : β ∈ R d . F or e ach fixe d z ∈ Z , let b g z ∈ G b e an optimal b asic solution of the augmente d or acle quantile-r e gr ession pr oblem define d in the pr o of of The or em 6.4 , chosen so that b g z ( z ) = b τ ⋆ Z ( z ) . Then F z ( b τ ⋆ Z ( z )) − 1 m + 1 m +1 X i =1 1 [ S ⋆ i ≤ b g z ( Z i )] = o p (1) , wher e F z ( t ) = P ( | ε | ≤ t | Z = z ) . Theorem 6.4 (Length inflation b ound) . Assume ( 1 ) and Assumptions 6.1 and 6.2 . Then, for e ach fixe d z ∈ Z , | b τ ⋆ Z ( z ) − τ 0 ( z ) | ≤ d ( m + 1) p min + o p (1) . Combining this with Pr op osition 6.3 , we obtain | b τ Z ( z ) − τ 0 ( z ) | ≤ b h − h 0 ∞ + d ( m + 1) p min + o p (1) . In p articular, len( b C Z ( x, z )) − len( C 0 ( x, z )) ≤ 2 b h − h 0 ∞ + 2 d ( m + 1) p min + o p (1) , wher e C 0 ( x, z ) = [ h 0 ( x ) − τ 0 ( z ) , h 0 ( x ) + τ 0 ( z )] . Remark. If τ 0 do es not lie in G , the same ar gument yields an additional appr oximation term inf g ∈G | g ( z ) − τ 0 ( z ) | . 17 6.3 Exact Finite-Sample Co v erage for X -Indexed Radius The next prop osition records the p opulation criterion underlying the X -indexed imp ortance- w eighted metho d. Prop osition 6.5 (Integrated weigh ted-moment c haracterization) . Supp ose that the c ondi- tional density r atio ( 19 ) exists. L et ν b e a me asur e on Z with ful l supp ort and define Q ( τ X ) = Z E [ m τ X ( S, X ) r 0 ( S, X | z )] 2 ν ( dz ) . Then Q ( τ X ) = 0 if and only if ( 17 ) holds for ν -almost every z . In p articular, Q ( τ X ) = 0 implies exact c onditional c over age for b C X whenever ν has ful l supp ort and the c onditional moment is c ontinuous in z . Remark. Pr op osition 6.5 explains why ( 22 ) is a natur al empiric al analo gue. The criterion is not distribution-fr e e finite-sample, b e c ause it dep ends on the estimate d r atio b r and the surr o gate ψ κ , but it tar gets the c orr e ct p opulation c ondition. Then, the following theorem pro vides a complementary finite-sample statemen t for the X -indexed class. It applies to one fixed target shift rather than to a whole class, and it requires an indep enden t final recalibration split together with a kno wn upp er b ound on the target weigh t. This is the exact finite-sample guarantee currently av ailable for prediction in terv als C ( X ) under a fixed target shift. Theorem 6.6 (Single-shift weigh ted conformal recalibration for C ( X )) . Fix a nonne gative f 0 ∈ F with E [ f 0 ( Z )] > 0 . Supp ose b h and q ar e le arne d on samples indep endent of the final r e c alibr ation split I rcal , the weights ( 26 ) ar e known, and ther e exists a known finite c onstant B f 0 satisfying sup z w f 0 ( z ) ≤ B f 0 . L et b t f 0 b e define d as ab ove fr om the normalize d sc or es ( 25 ). Then the interval ( 24 ) satisfies P f 0 Y n +1 ∈ b C X,f 0 ( X n +1 ) ≥ 1 − α. 7 Sim ulation 7.1 Data-Generating Process W e consider three syn thetic NPIV designs of increasing difficult y . In all designs, the instru- men ts are sampled from a uniform distribution on [ − 1 , 1] d Z , the endogenous regressors are nonlinear functions of the IVs and laten t v ariables, and the outcome is generated as Y = h 0 ( X ) + confounding + σ ( X , Z ) V , where the latent v ariables shared b y X and Y induce endogeneit y and σ ( X , Z ) yields heterosk edastic predictive uncertaint y . 18 T able 1: Results of the exp erimen t using Dataset 1. Radius IV Shift Radius Model T est Dist. of IVs Cov. Ratio Interv al Length Mean SD Mean SD XZ-indexed Bins Observed 0.902 0.020 4.456 9.446 Linear Tilt 0.901 0.024 4.288 7.181 Local Tilt 0.900 0.027 4.335 6.623 Step Tilt 0.901 0.022 4.329 7.742 Linear Observ ed 0.908 0.021 4.080 7.179 Linear Tilt 0.907 0.028 4.082 6.576 Local Tilt 0.906 0.032 4.076 6.175 Step Tilt 0.907 0.026 4.065 6.564 RKHS Observed 0.914 0.020 4.720 9.806 Linear Tilt 0.913 0.023 4.588 8.097 Local Tilt 0.922 0.026 4.685 7.389 Step Tilt 0.913 0.022 4.584 8.059 Z-indexed Bins Observed 0.904 0.021 4.227 8.520 Linear Tilt 0.904 0.025 4.236 8.186 Local Tilt 0.902 0.029 4.201 7.677 Step Tilt 0.904 0.024 4.202 7.971 Linear Observed 0.906 0.020 4.046 7.553 Linear Tilt 0.907 0.026 4.114 7.563 Local Tilt 0.905 0.029 4.147 7.571 Step Tilt 0.906 0.024 4.094 7.557 RKHS Observed 0.913 0.019 4.446 9.054 Linear Tilt 0.913 0.024 4.422 8.218 Local Tilt 0.912 0.029 4.455 7.764 Step Tilt 0.913 0.022 4.402 8.189 X-indexed Linear Observed 0.906 0.028 4.904 15.400 Linear Tilt 0.917 0.032 5.415 17.546 Local Tilt 0.919 0.036 5.808 18.995 Step Tilt 0.913 0.031 5.313 17.559 Bins Observed 0.907 0.027 20.055 164.856 Linear Tilt 0.918 0.031 25.255 213.098 Local Tilt 0.921 0.036 27.552 233.685 Step Tilt 0.915 0.030 25.166 213.356 RKHS Observ ed 0.907 0.029 10.505 71.069 Linear Tilt 0.917 0.031 10.207 64.795 Local Tilt 0.919 0.036 7.377 33.785 Step Tilt 0.913 0.031 10.043 64.432 MLP Observed 0.906 0.029 inf - Linear Tilt 0.918 0.030 inf - Local Tilt 0.919 0.036 inf - Step Tilt 0.914 0.029 inf - Dataset 1 ( d X = d Z = 1 ). Let Z ∼ Unif [ − 1 , 1] and let U, E , V b e indep endent standard normal v ariables. W e generate X = 1 . 05 sin( π Z ) + 0 . 55 U + 0 . 12 E , h 0 ( X ) = sin( π X ) 1 + 0 . 45 X 2 , and Y = h 0 ( X ) + 0 . 55 U + σ ( X , Z ) V , with σ ( X , Z ) = 0 . 35 + 0 . 10 | X | + 0 . 12 Z + 1 2 + 0 . 08 (1 + exp( − X Z )) − 1 . This one-dimensional design is the basis for the visualization in Figure 1 . 19 T able 2: Results of the exp erimen t using Dataset 2. Radius IV Shift Radius Model T est Dist. of IVs Cov. Ratio Interv al Length Mean SD Mean SD XZ-indexed Bins Observed 0.901 0.020 2.751 0.210 Linear Tilt 0.898 0.024 2.800 0.246 Local Tilt 0.889 0.028 2.822 0.274 Step Tilt 0.900 0.021 2.786 0.235 Linear Observed 0.914 0.023 2.858 0.239 Linear Tilt 0.915 0.025 2.951 0.282 Local Tilt 0.916 0.027 3.070 0.302 Step Tilt 0.917 0.023 2.944 0.269 RKHS Observed 0.912 0.020 inf - Linear Tilt 0.910 0.022 inf - Local Tilt 0.906 0.025 inf - Step Tilt 0.910 0.021 inf - Z-indexed Bins Observed 0.903 0.019 2.761 0.212 Linear Tilt 0.905 0.023 2.856 0.251 Local Tilt 0.904 0.026 2.956 0.281 Step Tilt 0.905 0.021 2.834 0.243 Linear Observed 0.904 0.021 2.746 0.218 Linear Tilt 0.906 0.025 2.842 0.257 Local Tilt 0.898 0.028 2.880 0.279 Step Tilt 0.905 0.022 2.813 0.241 RKHS Observed 0.914 0.020 2.906 0.230 Linear Tilt 0.916 0.023 3.014 0.300 Local Tilt 0.917 0.026 3.139 0.331 Step Tilt 0.917 0.022 2.988 0.276 X-indexed Linear Observed 0.901 0.028 2.827 0.357 Linear Tilt 0.920 0.029 3.156 0.488 Local Tilt 0.920 0.036 3.298 0.541 Step Tilt 0.913 0.028 3.027 0.435 Bins Observed 0.902 0.027 2.920 0.348 Linear Tilt 0.916 0.033 3.229 0.477 Local Tilt 0.918 0.037 3.427 0.550 Step Tilt 0.913 0.031 3.143 0.462 RKHS Observed 0.904 0.027 2.855 0.330 Linear Tilt 0.917 0.031 3.133 0.444 Local Tilt 0.921 0.035 3.339 0.555 Step Tilt 0.914 0.029 3.056 0.425 MLP Observed 0.903 0.032 3.149 0.523 Linear Tilt 0.917 0.036 3.527 0.739 Local Tilt 0.920 0.040 3.775 0.899 Step Tilt 0.914 0.034 3.388 0.600 Dataset 2 ( d X = 3 , d Z = 1 ). Let Z ∼ Unif [ − 1 , 1] and let U 1 , U 2 , U 3 , E 1 , E 2 , E 3 , V b e m utually indep endent standard normal v ariables. W e set X 1 = sin( π Z ) + 0 . 50 U 1 + 0 . 10 E 1 , X 2 = Z 2 − 1 3 + 0 . 42 U 2 + 0 . 10 E 2 , X 3 = cos( π Z ) + 0 . 24( U 1 + U 3 ) + 0 . 10 E 3 , h 0 ( X ) = sin( X 1 ) + 0 . 45 X 2 2 − 0 . 35 X 1 X 3 + 0 . 40 cos( X 3 ) , and Y = h 0 ( X ) + (0 . 40 U 1 − 0 . 30 U 2 + 0 . 22 U 3 ) + σ ( X , Z ) V , where σ ( X , Z ) = 0 . 32 + 0 . 08 | X 1 | + 0 . 07 X 2 2 + 0 . 10 Z + 1 2 + 0 . 05 max { X 3 Z, 0 } . 20 T able 3: Results of the exp erimen t using Dataset 3. Radius IV Shift Radius Mo del T est Dist. of IVs Cov. Ratio Interv al Length Mean SD Mean SD XZ-indexed Bins Observed 0.899 0.022 34.507 141.721 Linear Tilt 0.898 0.023 34.532 141.043 Local Tilt 0.899 0.025 34.710 142.243 Step Tilt 0.898 0.024 34.487 140.992 Linear Observed 0.916 0.021 37.134 149.733 Linear Tilt 0.917 0.024 38.053 156.486 Local Tilt 0.917 0.028 38.470 159.830 Step Tilt 0.917 0.024 37.813 154.825 RKHS Observ ed 0.912 0.022 inf - Linear Tilt 0.913 0.026 inf - Local Tilt 0.916 0.030 inf - Step Tilt 0.913 0.026 inf - Z-indexed Bins Observed 0.902 0.023 34.581 143.175 Linear Tilt 0.903 0.024 35.391 149.331 Local Tilt 0.904 0.026 35.540 149.977 Step Tilt 0.903 0.024 35.012 146.496 Linear Observed 0.911 0.021 37.132 156.724 Linear Tilt 0.910 0.024 38.008 163.341 Local Tilt 0.910 0.028 38.744 169.183 Step Tilt 0.911 0.023 37.830 162.051 RKHS Observ ed 0.911 0.021 35.587 139.016 Linear Tilt 0.910 0.023 36.831 147.884 Local Tilt 0.910 0.026 37.425 151.207 Step Tilt 0.911 0.023 36.658 146.601 X-indexed Linear Observed 0.895 0.035 1323.543 11129.639 Linear Tilt 0.909 0.033 1402.822 11599.136 Local Tilt 0.915 0.035 836.614 6013.192 Step Tilt 0.906 0.035 1302.644 10630.369 Bins Observed 0.898 0.032 243.714 817.299 Linear Tilt 0.910 0.032 274.721 922.389 Local Tilt 0.918 0.035 293.849 1032.362 Step Tilt 0.910 0.033 271.293 907.443 RKHS Observ ed 0.898 0.035 145.999 972.635 Linear Tilt 0.912 0.036 167.911 1126.497 Local Tilt 0.916 0.040 163.141 1088.062 Step Tilt 0.909 0.038 310.516 2531.610 MLP Observed 0.902 0.033 inf - Linear Tilt 0.911 0.037 inf - Local Tilt 0.915 0.042 inf - Step Tilt 0.908 0.036 inf - Dataset 3 ( d X = d Z = 3 ). Let Z = ( Z 1 , Z 2 , Z 3 ) hav e indep endent co ordinates distributed as Unif [ − 1 , 1], and let U 1 , U 2 , U 3 , E 1 , E 2 , E 3 , V b e mutually indep enden t standard normal v ariables. W e generate X 1 = sin ( π ( Z 1 + 0 . 30 Z 2 )) + 0 . 42 U 1 + 0 . 10 E 1 , X 2 = Z 2 Z 3 + 0 . 25 Z 2 1 + 0 . 40 U 2 + 0 . 10 E 2 , X 3 = cos( π Z 3 ) + 0 . 22 Z 1 − 0 . 18 Z 2 + 0 . 28( U 1 + U 3 ) + 0 . 10 E 3 , h 0 ( X ) = 0 . 60 sin( X 1 ) + 0 . 38 X 2 2 − 0 . 25 X 1 X 3 + 0 . 48 cos( X 3 ) + 0 . 18 X 1 X 2 , and Y = h 0 ( X ) + (0 . 36 U 1 − 0 . 24 U 2 + 0 . 18 U 3 ) + σ ( X , Z ) V , with σ ( X , Z ) = 0 . 30 + 0 . 03( X 2 1 + X 2 2 ) + 0 . 07 max { X 3 , 0 } + 0 . 07 Z 1 + 1 2 + 0 . 05 Z 2 Z 3 + 1 2 . 21 This is the most difficult design, b ecause b oth the structural function and the heteroskedastic scale dep end on higher-dimensional nonlinear interactions. 7.2 Exp erimen tal Setup W e set the nominal co v erage level to 1 − α = 0 . 9 and run 100 Monte Carlo replications for eac h synthetic design. Eac h replication uses n train = 1000 observ ations to fit the NPIV base learner, n cal = 200 observ ations for conformal calibration, and n test = 1000 observ ations for ev aluation. Across all metho ds, the center b h is a common series-2SLS NPIV estimator with cubic p olynomial bases in b oth X and Z . F or the exact classes T X Z and T Z , we consider three finite-dimensional shift families. The Bins family uses a four-bin basis defined on a one-dimensional pro jection. F or scalar IVs, this pro jection is Z itself. F or v ector IVs, it is the standardized first principal comp onen t of the p o oled training and calibration IVs. The Line ar family uses an affine basis with an in tercept, and the RKHS family uses a Gaussian-kernel dictionary with four landmarks and k ernel parameter γ = 0 . 2. F or the X -indexed class T X , w e consider four radius mo dels: a linear basis, a six-bin basis, an RKHS basis with four landmarks and γ = 0 . 2, and an MLP radius mo del. The conditional densit y ratio in the X -indexed learner is estimated by a neural net work with hidden lay ers (32 , 32) using t w o-fold cross-fitted training scores. W e then split the calibration sample evenly in to a shap e-learning half and an indep endent final recalibration half. The X -indexed learner uses λ = 50, κ = 0 . 05, and, for the neural radius mo del, hidden la yers (32 , 32). T o ev aluate robustness under IV shifts, w e rep ort four test laws. Let u ( z ) b e the stan- dardized one-dimensional pro jection of z computed from the p o oled training and calibration IVs, and when Z is multiv ariate, u ( z ) is the first principal comp onen t after standardization. The raw w eights are w obs ( u ) = 1 , w lin ( u ) = max { 1 + 0 . 95 u, 0 . 05 } , w step ( u ) = ( e, u > 0 , 1 , u ≤ 0 , w loc ( u ) = 0 . 20 + 1 . 60 exp " − 1 2 u − 0 . 75 0 . 35 2 # . F or eac h metho d and scenario, the rep orted cov erage ratio and interv al length are w eighted test-sample av erages computed with the normalized versions of these weigh ts. An entry of inf in the length column means that at least one replication pro duced an unbounded interv al, whic h makes the across-replication mean infinite. 7.3 Results T ables 1 – 3 summarize the syn thetic exp eriments. Overall, the exact classes T X Z and T Z b eha v e as in tended for the Bins and Linear shift families: their co verage ratios stay close to the nominal lev el across the observ ed law and the three shifted IV la ws. In Dataset 1, exact in terv al lengths are around 4 for b oth classes, with the Linear sp ecification sligh tly shorter than Bins and RKHS. In Dataset 2, the same pattern remains, and the exact finite-length in terv als are even shorter, around 2 . 7 to 3 . 1. In Dataset 3, the exact pro cedures remain 22 finite for Bins and Linear, but the interv als b ecome m uch longer, reflecting the more difficult three-dimensional nonlinear design. The RKHS pro xy is more fragile in the exact classes. In Dataset 1, b oth T X Z and T Z with RKHS remain finite and are mildly conserv ative. In Dataset 2, the Z -indexed RKHS pro cedure is still w ell b ehav ed, but the ( X , Z )-indexed RKHS pro cedure yields unbounded in terv als in at least one replication. In Dataset 3, the same contrast b ecomes sharp er: the Z -indexed RKHS metho d remains finite, whereas the ( X , Z )-indexed RKHS metho d again pro duces inf . This pattern is consistent with the fact that ric her exact shift classes are harder to calibrate stably as the dimension of the conditioning v ariable grows. The X -indexed results illustrate the trade-off b etw een interpretabilit y and difficulty . In Dataset 1, the Linear X -indexed mo del is the most stable C ( X ) rule, with co v erage near the nominal level and only a mo derate increase in length relativ e to the exact metho ds. The RKHS X -indexed mo del is w ork able but more v ariable, while the Bins and esp ecially the MLP radius mo dels are unstable. In Dataset 2, the Linear, Bins, and RKHS X -indexed mo dels all remain finite and near nominal, showing that fixed-shift recalibration can work w ell in a mo derate-dimensional design. In Dataset 3, ho wev er, all X -indexed pro cedures b ecome muc h longer and more v ariable, esp ecially the Linear and MLP mo dels. This is exactly the difficult regime for the target class T X : the final interv al is required to dep end only on X , even though the predictive uncertaint y in the data-generating pro cess v aries with b oth X and Z . 7.4 Visualization of Prediction In terv als Figure 1 visualizes the fitted interv als for Dataset 1 under the observ ed IV distribution using a single replication with seed 2026. F or the exact pro cedures, we use the Linear shift family , and for the X -indexed pro cedure w e use the Linear radius mo del. The top ro w plots the lo wer and upp er interv al surfaces o v er a 41 × 41 grid in ( x, z ). The middle row fixes x = 0 and plots the Y – Z slices, while the b ottom row fixes z = 0 and plots the Y – X slices. Note that large or infinite-length prediction in terv als do not necessarily imply po or p erformance. If one of the upp er or lo wer b ounds is large or infinite, the in terv al length b ecomes infinite. How ever, the opp osite b ound can still b e finite, and if so, we can still use suc h prediction interv als in practice. A more imp ortan t metric is the cov erage ratio. The figure mak es the distinction among the three radius classes visually transparent. In the middle row, the ( X , Z )-indexed and Z -indexed interv als v ary with z , whereas the X -indexed in terv al is flat in z b y construction. In the b ottom ro w, all three metho ds v ary with x , but the width profiles differ: the exact metho ds adapt to IV-indexed uncertain t y , while the X -indexed interv al absorbs that uncertain ty in to a single X -only radius. In Dataset 1, the ( X , Z )-indexed and Z -indexed surfaces are quite similar, which suggests that m uc h of the relev ant heterosk edasticity can already b e captured b y allowing the radius to dep end on the IV alone. 7.5 Discussion The simulation evidence supp orts the main conceptual message of the pap er. The exact Z -indexed class T Z is the most reliable practical exact pro cedure: it consistently delivers 23 Figure 1: Visualization of prediction in terv als for Dataset 1 under the observ ed IV distribution. The ( X , Z )-indexed and Z -indexed pro cedures use the Linear shift family , and the X -indexed pro cedure uses the Linear radius mo del. The top row shows the interv al surfaces ov er ( x, z ), the middle ro w shows Y – Z slices at x = 0, and the b ottom row sho ws Y – X slices at z = 0. 24 near-nominal co verage under the shifted IV la ws while k eeping the in terv al lengths comparable to, and often sligh tly shorter than, those of the broader exact class T X Z . The X -indexed target T X remains substan tively app ealing, because it yields in terv als of the form C ( X ), but its finite-sample p erformance dep ends muc h more strongly on the radius mo del and on the complexity of the design. In low- and mo derate-dimensional settings, simple X -indexed mo dels such as the Linear basis can work well. In harder settings, the price of insisting on Z -free final interv als can b e substan tial. These results also clarify ho w to in terpret richer mo del classes in practice. RKHS-based exact classes can b ecome unstable in more difficult designs, and ov erly flexible X -indexed radius mo dels can pro duce very long in terv als. Accordingly , the finite-dimensional Bins and Linear exact classes are the most robust baselines in the curren t exp eriments, while ric her classes are b etter viewed as sensitivity analyses rather than default c hoices. 8 Empirical Analysis W e next study t wo empirical IV datasets using the same nominal lev el, shift scenarios, and ev aluation metrics as in Section 7. In the CigarettesSW data ( Sto c k & W atson , 2007 ), we tak e Y = log ( packs ) , X = log ( price / cpi ) , Z = log ( taxs / cpi ) , and include log ( income / population / cpi ) and a 1995 year indicator as exogenous con trols. Eac h rep etition uses a y ear-stratified split with 24 training observ ations, 40 calibration observ ations, and 32 test observ ations. In the CollegeDist ance data ( Sto ck & W atson , 2007 ; Rouse , 1995 ), we tak e Y = log max wage , 10 − 2 , X = education , Z = distance , and use score , tuition , unemp , and demographic dumm y v ariables as con trols. T o k eep the exact conditional pro cedures computationally manageable, each rep etition first subsamples 1800 observ ations and then uses a 600 / 600 / 600 training, calibration, and test split. All empirical results are av eraged ov er 100 random replications. T ables 4 and 5 show a sharp er contrast b et ween stable and unstable sp ecifications than in the syn thetic designs. In CigarettesSW , the exact Bins procedures and the exact Z -indexed Linear pro cedure remain finite and close to the nominal co verage lev el, whereas the ric her exact classes frequen tly pro duce un b ounded in terv als. F or the X -indexed class, the Observ ed and Step-Tilt scenarios are still w ork able for the Linear and RKHS radius mo dels, but the Linear-Tilt and Lo cal-Tilt scenarios often lead to inf . This is the small-sample regime in which aggressive reweigh tings of the IV distribution are hardest to supp ort, so coarse exact shift classes are the most reliable practical choices. In CollegeDist ance , the exact Bins and Linear pro cedures are muc h more stable. Their cov erage ratios stay close to the nominal level under all four scenarios, and their interv al lengths are around 2 . 5 to 2 . 7. By contrast, the exact RKHS pro cedures again pro duce inf , whic h indicates that the richer exact shift class is still to o unstable in this empirical design. The most striking pattern app ears in the X -indexed class: the Linear, Bins, and RKHS radius mo dels pro duce very long in terv als, whereas the MLP radius mo del remains stable, 25 T able 4: Results of the exp erimen t using the CigarettesSW dataset. Radius IV Shift Radius Model T est Dist. of IVs Cov. Ratio Interv al Length Mean SD Mean SD XZ-indexed Bins Observed 0.911 0.055 1.489 2.392 Linear Tilt 0.918 0.060 1.282 1.511 Local Tilt 0.918 0.070 1.335 1.621 Step Tilt 0.919 0.057 1.360 1.816 Linear Observed 0.928 0.051 inf - Linear Tilt 0.935 0.056 inf - Local Tilt 0.936 0.064 inf - Step Tilt 0.934 0.051 inf - RKHS Observ ed 0.957 0.040 inf - Linear Tilt 0.960 0.041 inf - Local Tilt 0.969 0.038 inf - Step Tilt 0.962 0.040 inf - Z-indexed Bins Observed 0.915 0.051 1.524 2.545 Linear Tilt 0.919 0.058 1.299 1.668 Local Tilt 0.916 0.069 1.337 1.756 Step Tilt 0.916 0.059 1.374 1.960 Linear Observed 0.920 0.054 1.476 2.385 Linear Tilt 0.928 0.060 1.340 1.774 Local Tilt 0.926 0.072 1.329 1.633 Step Tilt 0.926 0.055 1.382 1.956 RKHS Observ ed 0.954 0.048 inf - Linear Tilt 0.954 0.054 inf - Local Tilt 0.960 0.062 inf - Step Tilt 0.956 0.053 inf - X-indexed Linear Observed 0.908 0.080 1.587 2.982 Linear Tilt 0.965 0.045 inf - Local Tilt 0.997 0.017 inf - Step Tilt 0.942 0.064 2.889 12.223 Bins Observed 0.908 0.078 30.228 218.774 Linear Tilt 0.963 0.050 inf - Local Tilt 0.998 0.014 inf - Step Tilt 0.940 0.065 75.474 520.323 RKHS Observ ed 0.915 0.080 1.656 3.127 Linear Tilt 0.971 0.042 inf - Local Tilt 0.999 0.007 inf - Step Tilt 0.946 0.067 2.327 5.911 MLP Observed 0.909 0.076 104.428 1022.857 Linear Tilt 0.961 0.050 inf - Local Tilt 0.999 0.009 inf - Step Tilt 0.942 0.064 110.968 1075.274 with cov erage around 0 . 895 to 0 . 905 and interv al length around 2 . 54 to 2 . 56. Th us, in the larger empirical dataset, a flexible nonlinear q ( x ) can b e essen tial for obtaining a practical C ( X ) in terv al. T ak en together, the empirical results complement the sim ulations. Exact IV-specific pro cedures based on Bins or Linear shift classes are the most robust default options across datasets. The X -indexed target can b e comp etitive, but only when the radius mo del matches the complexity of the application and the final fixed-shift recalibration remains numerically stable. 9 Conclusion W e prop osed IV-CCP for constructing prediction interv als in NPIV with a distribution-free finite-sample cov erage guaran tee. Our prop osed prediction in terv al construction allows three 26 T able 5: Results of the exp erimen t using the CollegeDistance dataset. Radius IV Shift Radius Model T est Dist. of IVs Cov. Ratio Interv al Length Mean SD Mean SD XZ-indexed Bins Observed 0.901 0.017 2.638 4.845 Linear Tilt 0.893 0.020 2.521 4.446 Local Tilt 0.882 0.028 2.386 4.043 Step Tilt 0.891 0.021 2.502 4.402 Linear Observed 0.903 0.017 2.668 4.744 Linear Tilt 0.903 0.019 2.626 4.599 Local Tilt 0.904 0.025 2.570 4.429 Step Tilt 0.903 0.021 2.622 4.605 RKHS Observ ed 0.904 0.015 inf - Linear Tilt 0.908 0.017 inf - Local Tilt 0.915 0.024 inf - Step Tilt 0.907 0.019 inf - Z-indexed Bins Observed 0.899 0.018 2.671 5.017 Linear Tilt 0.900 0.019 2.619 4.750 Local Tilt 0.900 0.023 2.558 4.492 Step Tilt 0.899 0.020 2.608 4.712 Linear Observed 0.898 0.018 2.573 4.519 Linear Tilt 0.899 0.020 2.555 4.447 Local Tilt 0.900 0.025 2.534 4.378 Step Tilt 0.898 0.020 2.553 4.448 RKHS Observ ed 0.901 0.017 inf - Linear Tilt 0.904 0.018 inf - Local Tilt 0.908 0.024 inf - Step Tilt 0.903 0.019 inf - X-indexed Linear Observed 0.896 0.019 151.561 1492.597 Linear Tilt 0.901 0.020 154.493 1521.789 Local Tilt 0.906 0.026 151.402 1490.779 Step Tilt 0.901 0.021 140.837 1385.177 Bins Observed 0.895 0.021 93.181 506.650 Linear Tilt 0.900 0.022 97.514 537.866 Local Tilt 0.905 0.028 100.555 545.249 Step Tilt 0.900 0.023 95.276 521.839 RKHS Observ ed 0.895 0.019 79.233 770.463 Linear Tilt 0.901 0.021 80.615 784.204 Local Tilt 0.906 0.027 78.894 767.130 Step Tilt 0.901 0.021 73.547 713.580 MLP Observed 0.895 0.020 2.544 4.955 Linear Tilt 0.901 0.021 2.549 4.962 Local Tilt 0.905 0.028 2.564 5.063 Step Tilt 0.901 0.022 2.543 4.938 t yp es of radii, ( X , Z )-indexed, Z -indexed, and X -indexed, whose classes are denoted b y T X Z , T Z , and T X , resp ectively . The broad b enchmark class is T X Z , which pro duces in terv als whose radius may v ary jointly with ( X , Z ) and admits exact finite-sample cov erage o ver joint shifts. The exact IV-sp ecific class is T Z , which keeps the center equal to b h ( X ) and yields IV-CCP , an exact finite-sample pro cedure o ver p olicy-relev an t IV shifts. The class T X pro duces interv als C ( X ), whic h may b e the most natural operational target when the final prediction rule should dep end on the structural regressor alone. F or these approaches, we provide theoretical guaran tees and confirm their v alidity through simulation studies and empirical analyses. References Ch unrong Ai and Xiaohong Chen. Efficient estimation of mo dels with conditional moment restrictions containing unkno wn functions. Ec onometric a , 71(6):1795–1843, 2003. 1 , 3 , 4 , 27 31 Osb ert Bastani, V arun Gupta, Christopher Jung, Georgy Noaro v, Ramy a Ramalingam, and Aaron Roth. Practical adv ersarial multiv alid conformai prediction. In International Confer enc e on Neur al Information Pr o c essing Systems (NeurIPS) , 2022. 33 Andrew Bennett and Nathan Kallus. The v ariational metho d of momen ts, 2020. 32 Andrew Bennett, Nathan Kallus, and T obias Sc hnab el. Deep generalized metho d of momen ts for instrumental v ariable analysis. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , 2019. 7 , 32 Aab esh Bhattac haryya and Rina F oygel Barb er. Group-weigh ted conformal prediction, 2025. arXiv: 2401.17452. 33 Ric hard Blundell, Xiaohong Chen, and Dennis Kristensen. Semi-nonparametric iv estimation of shap e-inv ariant engel curves. Ec onometric a , 75(6):1613–1669, 2007. 31 Marine Carrasco and Jean-Pierre Florens. Generalization Of Gmm T o A Contin uum Of Momen t Conditions. Ec onometric The ory , 16(6):797–834, December 2000. 7 Marine Carrasco, Jean-Pierre Florens, and Eric Renault. Linear inv erse problems in structural econometrics estimation based on spectral decomp osition and regularization. In J.J. Hec kman and E.E. Leamer (eds.), Handb o ok of Ec onometrics , v olume 6B, chapter 77. Elsevier, 1 edition, 2007. 7 , 31 Gary Cham b erlain. Asymptotic efficiency in estimation with conditional moment restrictions. Journal of Ec onometrics , 34(3):305–334, 1987. 31 Xiaohong Chen and Timothy M. Christensen. Optimal sup-norm rates and uniform inference on nonlinear functionals of nonparametric iv regression. Quantitative Ec onomics , 9(1): 39–84, 2018. 32 Xiaohong Chen and Demian Pouzo. Estimation of nonparametric conditional momen t mo dels with p ossibly nonsmo oth generalized residuals. Ec onometric a , 80(1):277–321, 2012. 14 , 31 Xiaohong Chen and Demian P ouzo. Siev e w ald and qlr inferences on semi/nonparametric conditional moment mo dels. Ec onometric a , 83(3):1013–1079, 2015. 14 , 32 Victor Chernozh uk o v, Kaspar W¨ uthric h, and Yinc hu Zh u. Distributional conformal prediction. Pr o c e e dings of the National A c ademy of Scienc es (PNAS) , 118(48):e2107794118, 2021. 3 , 33 Serge Darolles, Y anqin F an, Jean-Pierre Florens, and Eric Renault. Nonparametric instru- men tal regression. Ec onometric a , 79(5):1541–1565, 2011. 1 , 3 , 4 , 31 Nishan th Dikk ala, Greg Lewis, Lester Mack ey , and V asilis Syrgk anis. Minimax estimation of conditional momen t mo dels. In A dvanc es in Neur al Information Pr o c essing Systems , v olume 33, pp. 12248–12262. Curran Asso ciates, Inc., 2020. 1 , 3 , 7 , 32 , 34 28 Rina F o ygel Barb er, Emmanuel J Cand ` es, Aadit ya Ramdas, and Ryan J Tibshirani. The limits of distribution-free conditional predictive inference. Information and Infer enc e: A Journal of the IMA , 10(2):455–482, 2020. 2 , 3 , 5 , 33 Isaac Gibbs, John J Cherian, and Emman uel J Cand ` es. Conformal prediction with conditional guaran tees. Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 87 (4):1100–1126, 03 2025. 2 , 3 , 6 , 8 , 9 , 33 , 37 Leying Guan. Lo calized conformal prediction: a generalized inference framew ork for conformal prediction. Biometrika , 110(1):33–50, 07 2022. 3 , 33 P eter Hall and Jo el L. Horo witz. Nonparametric metho ds for inference in the presence of instrumen tal v ariables. The Annals of Statistics , 33(6):2904 – 2929, 2005. 31 Lars Peter Hansen. Large sample prop erties of generalized metho d of moments estimators. Ec onometric a , 50(4):1029–1054, 1982. 31 Jason Hartford, Greg Lewis, Kevin Leyton-Brown, and Matt T addy . Deep IV: A flexible approac h for coun terfactual prediction. In International Confer enc e on Machine L e arning (ICML) , 2017. 1 , 3 , 32 Jo el L. Horo witz. Asymptotic normality of a nonparametric instrumental v ariables estimator. International Ec onomic R eview , 48(4):1329–1349, 2007. 32 Jo el L. Horowitz and Sokbae Lee. Uniform confidence bands for functions estimated non- parametrically with instrumen tal v ariables. Journal of Ec onometrics , 168(2):175–188, 2012. 32 Christopher Jung, Georgy Noarov, Ram ya Ramalingam, and Aaron Roth. Batc h multiv alid conformal prediction. In International Confer enc e on L e arning R epr esentations (ICLR) , 2023. 33 Masahiro Kato, Masaaki Imaizumi, Kenichiro McAlinn, Shota Y asui, and Haruo Kak ehi. Learning causal mo dels from conditional moment restrictions b y imp ortance w eigh ting. In International Confer enc e on L e arning R epr esentations (ICLR) , 2022. 2 , 3 , 7 , 11 , 33 Jing Lei and Larry W asserman. Distribution-free prediction bands for non-parametric regression. Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 76(1): 71–96, 2013. 2 , 3 , 5 , 33 Greg Lewis and V asilis Syrgk anis. Adversarial generalized metho d of moments, 2018. 32 Luofeng Liao, Y ou-Lin Chen, Zhuoran Y ang, Bo Dai, Mladen Kolar, and Zhaoran W ang. Pro v ably efficient neural estimation of structural equation mo dels: An adversarial approac h. In A dvanc es in Neur al Information Pr o c essing Systems , v olume 33, pp. 8947–8958. Curran Asso ciates, Inc., 2020. 32 Krik amol Muandet, Arash Mehrjou, Si Kai Lee, and Anan t Ra j. Dual instrumen tal v ariable regression. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , 2020. 32 29 Cameron Musco and Christopher Musco. Recursive sampling for the nystrom metho d. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , v olume 30, 2017. 15 Whitney K. Newey and James L. P ow ell. Instrumental v ariable estimation of nonparametric mo dels. Ec onometric a , 71(5):1565–1578, 2003. 1 , 3 , 31 T aisuk e Otsu. Empirical lik eliho o d estimation of conditional moment restriction mo dels with unkno wn functions. Ec onometric The ory , 27(1):8–46, 2011. 31 Ali Rahimi and Benjamin Rech t. Random features for large-scale k ernel machines. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , 2007. 15 Y aniv Romano, Ev an Patterson, and Emmanuel J. Cand` es. Conformalized quan tile regression. In International Confer enc e on Neur al Information Pr o c essing Systems (NeurIPS) , 2019. 33 Cecilia E. Rouse. Demo cratization or diversion? the effect of communit y colleges on educational attainment. Journal of Business & Ec onomic Statistics , 13(2):217–224, 1995. 25 J. D. Sargan. The estimation of economic relationships using instrumental v ariables. Ec ono- metric a , 26(3):393–415, 1958. 31 Hidetoshi Shimo daira. Impro ving predictive inference under cov ariate shift by w eighting the log-lik eliho o d function. Journal of Statistic al Planning and Infer enc e , 90(2):227–244, 2000. 3 , 7 Rah ul Singh, Maneesh Sahani, and Arth ur Gretton. Kernel instrumen tal v ariable regression. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , 2019. 1 , 3 , 32 James H. Sto ck and Mark W. W atson. Intr o duction to Ec onometrics . Addison W esley , 2 edition, 2007. 25 Ry an J Tibshirani, Rina F o ygel Barb er, Emmanuel Candes, and Aadit ya Ramdas. Conformal prediction under cov ariate shift. In A dvanc es in Neur al Information Pr o c essing Systems , 2019. 2 , 3 , 10 , 12 , 33 , 40 Vladimir V ovk. Conditional v alidit y of inductiv e conformal predictors. In Asian Confer enc e on Machine L e arning (A CML) , 2012. 33 Vladimir V o vk, Alex Gammerman, and Glenn Shafer. Algorithmic L e arning in a R andom World . Springer-V erlag, 2005. 2 , 3 , 32 Liyuan Xu, Y utian Chen, Siddarth Sriniv asan, Nando de F reitas, Arnaud Doucet, and Arth ur Gretton. Learning deep features in instrumental v ariable regression. In International Confer enc e on L e arning R epr esentations (ICLR) , 2021. 32 30 A Detailed Related W ork The contribution of this pap er is easiest to lo cate once the metho d is decomp osed into the ob jects separated in the main text. The first ob ject is structural estimation under endogeneit y . This is the step that pro duces the base learner b h used in the score construction ( 11 ). The second ob ject is the radius class T , which determines whether the in terv al is allow ed to v ary with ( X , Z ), only with Z , or only with X . The third ob ject is the shift class F , which determines the family of the probabilit y laws o v er which the co v erage constraint ( 6 ) is required to hold. The related work b elow follo ws this decomp osition, b ecause it mak es the connection to the rest of the pap er transparen t. A.1 Structural Models and the Momen t Restrictions The econometric language b ehind our setup is the language of momen t restrictions. In classical linear IV mo dels, exogeneit y is expressed through orthogonality conditions b et ween the IV and the structural error, and ov eriden tified mo dels are naturally handled by GMM ( Sargan , 1958 ; Hansen , 1982 ). More broadly , Cham b erlain ( 1987 ) studies estimation under conditional moment restrictions and provides the semiparametric framework that underlies man y later developmen ts in IV and NPIV. This background is directly relev an t for our pap er, b ecause the structural equation ( 1 ) and the conditional moment restriction ( 2 ) are sp ecial cases of that general framew ork. This literature also clarifies what our pap er is not trying to do. In linear IV and GMM, the inferential target is t ypically a structural parameter or a finite-dimensional functional iden tified b y the momen ts. In our setting, the structural learner still comes from that tradition, but the final target is a prediction interv al for a future outcome. This distinction matters throughout the pap er. The structural literature explains how to estimate h 0 , whereas our prediction lay er uses the resulting residual scores to enforce predictiv e v alidity . A.2 NPIV and Estimation under the Conditional Momen t Restric- tion NPIV extends IV from finite-dimensional linear mo dels to an unknown structural function. F oundational pap ers show ho w iden tification can b e written as an inv erse problem under a conditional moment restriction, and how estimation therefore requires regularization ( New ey & Po well , 2003 ; Ai & Chen , 2003 ; Darolles et al. , 2011 ; Carrasco et al. , 2007 ). The ill- p osedness of NPIV, and its implications for conv ergence rates and smo othing choices, are cen tral in Hall & Horowitz ( 2005 ). Related semiparametric dev elopments, suc h as Blundell et al. ( 2007 ), and broader conditional-moment formulations, suc h as Chen & Pouzo ( 2012 ) and Otsu ( 2011 ), place NPIV inside a larger class of problems where the unkno wn ob ject is a function and the identifying information is a conditional momen t. This literature is the natural source of the base learner in our algorithm. Neither the join t class T X Z nor the IV-indexed class T Z nor the X -indexed class T X prescrib es a sp ecific NPIV estimator. Any estimator that outputs a structural prediction rule b h ( X ) can b e used in the training step, including sieve, minimum-distance, Tikhono v, and related regularized pro cedures. 31 A.3 Asymptotic Inference in NPIV A substan tial econometric literature studies inference for the structural function or its functionals in NPIV and related conditional moment mo dels. Representativ e results include asymptotic normality for NPIV estimators ( Horo witz , 2007 ), uniform confidence bands ( Horo witz & Lee , 2012 ), sieve W ald and QLR inference for semi/nonparametric conditional momen t models ( Chen & P ouzo , 2015 ), and sup-norm rates with uniform inference for nonlinear functionals ( Chen & Christensen , 2018 ). These pap ers are closely connected to our theory section, b ecause they clarify how ill-p osedness, approximation error, and regularization affect the qualit y of structural estimation. A t the same time, the inferent ial target is fundamen tally differen t. Those pap ers dev elop confidence pro cedures for h 0 or for functionals of h 0 , typically under asymptotic appro xi- mations and structural regularit y conditions. By contrast, the ob ject of this pap er is the future outcome Y n +1 , and the guaran tee in Theorem 6.2 is finite-sample and distribution-free conditional on the training split. The connection to the NPIV inference literature app ears through efficiency rather than through asymptotic Gaussianit y . Prop osition 6.3 and Theo- rem 6.4 show that a smaller structural estimation error pro duces a shorter interv al in the exact IV-indexed class, so adv ances in NPIV estimation feed directly into the efficiency of the conformal la yer. A.4 Mac hine learning estimators for IV and conditional momen ts A recen t machine learning literature dev elops flexible learners for IV and conditional momen t problems. One group of metho ds extends the tw o-stage logic of 2SLS. Hartford et al. ( 2017 ) estimates the first-stage treatmen t distribution with neural net w orks and in tegrates the second- stage predictor ov er that estimated distribution. Singh et al. ( 2019 ) dev elops a kernel version of nonlinear IV regression in RKHS, and Xu et al. ( 2021 ) learns deep representations for IVs and treatments b efore the tw o regression stages. These metho ds are directly compatible with our framew ork, b ecause they pro duce exactly the ob ject needed in ( 11 ); that is, a base learner b h . Another group of metho ds form ulates conditional momen t estimation as a minimax or adv ersarial problem. This includes adversarial GMM ( Lewis & Syrgk anis , 2018 ), DeepGMM ( Bennett et al. , 2019 ), the v ariational metho d of momen ts ( Bennett & Kallus , 2020 ), DualIV ( Muandet et al. , 2020 ), neural min-max estimators for structural equations ( Liao et al. , 2020 ), and minimax-optimal pro cedures for IV regression ( Dikk ala et al. , 2020 ). The conceptual connection to our pap er is imp ortant but subtle. In that literature, the adversarial or critic class is used to estimate h 0 b y searc hing for violated momen ts. In our pap er, the class F is not part of the structural estimator. It enco des the test environmen ts ov er whic h the predictiv e cov erage requirement ( 6 ) must hold. A.5 Conformal Prediction Classical conformal prediction provides exact finite-sample marginal cov erage under exc hange- abilit y ( V o vk et al. , 2005 ). A large literature impro ves efficiency by adapting interv al length to heteroskedasticit y or lo cal structure, for example through conformalized quantile regression 32 ( Romano et al. , 2019 ), distributional conformal prediction ( Chernozhuk ov et al. , 2021 ), and lo calized conformal prediction ( Guan , 2022 ). These pap ers matter for our setup b ecause they show how conformal metho ds can wrap around arbitrary black-box predictors while preserving finite-sample marginal v alidit y . F or our pap er, how ev er, the cen tral issue is not only efficiency under marginal v alidit y . It is how to mo ve tow ard a v alidity notion indexed by the IV. Early work such as V ovk ( 2012 ) studies differen t notions of conditional v alidit y for conformal predictors, while Lei & W asserman ( 2013 ) and F oygel Barb er et al. ( 2020 ) sho w why exact distribution-free finite-sample conditional cov erage of the form ( 4 ) cannot generally b e ach iev ed with nontrivial in terv als. This imp ossibility result is the reason the main text replaces exact conditional co verage b y the relaxed moment condition ( 6 ). Sev eral recent strands of the literature study structured relaxations that are closer to what we need. When the test environmen t is a single sp ecified test distribution, w eigh ted conformal prediction provides v alid cov erage under cov ariate shift if the lik eliho o d ratio is kno wn or can b e estimated ( Tibshirani et al. , 2019 ). When the shift is driven by a finite set of groups, Bhattac haryya & Barb er ( 2025 ) studies group-w eigh ted conformal prediction. A related line studies subgroup or m ultiv alid guaran tees o ver finite collections of sets ( Jung et al. , 2023 ; Bastani et al. , 2022 ). These pap ers are directly relev an t for our design of the shift class F . They also clarify the main distinction from our setting. W e do not target one kno wn test distribution. Instead, in the exact IV-indexed class T Z w e require cov erage to hold simultaneously o ver the family of IV tilts enco ded b y F . The closest metho dological anteceden t is Gibbs et al. ( 2025 ). That pap er reformulates conditional v alidit y as co verage o v er a class of cov ariate shifts and dev elops the augmented quan tile-regression calibration algorithm that underlies our exact constructions for T X Z and T Z . OOur paper sp ecializes that framework to IV settings, treats the IV Z as the v ariable indexing the protected shifts, and places this exact conformal lay er alongside the imp ortance-w eigh ted X -indexed formulation of Kato et al. ( 2022 ). A.6 P osition of this P ap er Putting these strands of the literature together mak es the contribution of the pap er precise. Relativ e to the IV and NPIV literatures, we do not prop ose a new identification strategy or a new estimator of h 0 . Relative to the conformal literature, we do not target marginal co v erage only , and we do not assume a single known target distribution except in the X -indexed fixed-shift recalibration step. Instead, the main text organizes predictive inference in NPIV around three radius classes presen ted in the order T X Z , T Z , and T X . The class T X Z is the broad exact b enc hmark, b ecause the v ariable indexing the radius and the v ariable indexing the shift family coincide. The class T Z is the exact IV-sp ecific conformal class and the first main fo cus of the pap er. The class T X is the target C ( X ) and the second main fo cus. In that class, the pap er com bines importance-weigh ted conditional-momen t learning with a fixed-shift weigh ted conformal recalibration la y er. This is the sense in which the pap er sits b etw een structural estimation and predictive inference while b eing explicit ab out the difference b et w een the prediction in terv al one ultimately wan ts and the exact sub classes current finite-sample conformal mac hinery can certify . 33 B T ec hnical Discussion of Candidate Shift Classes This app endix records sev eral concrete constructions that con vert rich candidate shift families in to the finite-dimensional feature map ϕ required by the exact conformal pro cedures for T X Z and T Z . The common pattern is to start from an ideal class of nonnegativ e densit y ratios and then choose a finite dictionary whose span approximates that ideal class. This is directly parallel to the w ay siev e and lo w-rank appro ximations are used in the NPIV literature, including the app endices of Dikk ala et al. ( 2020 ). The difference is that the present pap er appro ximates a class of distribution shifts in the test en vironment, not a class of structural functions. B.1 Nested Appro ximations Let ϕ ( r ) : Z → R d r denote a sequence of feature maps suc h that G r = z 7→ β ⊤ ϕ ( r ) ( z ) : β ∈ R d r is nested in r , that is, G r ⊂ G r +1 . Define the corresp onding nonnegative shift classes by F r = G r ∩ { f ≥ 0 } . Then the ambiguit y sets are nested as well, Q ( F r ) ⊂ Q ( F r +1 ) . Th us increasing r strengthens the protected family of the test environmen ts. F or ev ery fixed r , Theorems 6.1 and 6.2 remain exact and finite-sample. The price of increasing r app ears through the dimension term in Theorem 6.4 . B.2 Discretization and Multiscale P artitions Supp ose for simplicit y that Z is scalar and supp orted on an interv al [ a, b ]. Let a = u 0 < u 1 < · · · < u G = b b e a grid, and define cells I g = [ u g − 1 , u g ] for g = 1 , . . . , G . The most basic discretization takes ϕ ( z ) = (1 , 1 [ z ∈ I 1 ] , . . . , 1 [ z ∈ I G ]) . Then any tilt in F ϕ is constant within cells, and the fitted radius is piecewise constan t. A m ultiscale v arian t uses several nested grids, for example dy adic partitions. The same idea applies to v ector-v alued or mixed-t yp e IVs. If a tree or forest trained on I tr partitions Z in to lea ves L 1 , . . . , L G , then the leaf indicators define a finite-dimensional map. B.3 Siev e and T ensor-Pro duct Constructions Let ψ 1 , ψ 2 , . . . b e a sequence of basis functions on Z . A growing sieve class takes the form ϕ ( r ) ( z ) = (1 , ψ 1 ( z ) , . . . , ψ r ( z )) . 34 Common choices include B-splines, wa velets, F ourier series, and orthogonal p olynomials. If one b elieves that the relev an t density ratios are smo oth, then the appro ximation error of the sieve is the natural price of replacing an infinite-dimensional smo othness class by the finite-dimensional span. F or Z = ( Z 1 , . . . , Z k Z ), one may use additiv e or tensor-pro duct siev es. The additive construction keeps the dimension manageable, ϕ ( r ) ( z ) = 1 , ψ 1 , 1 ( z 1 ) , . . . , ψ 1 ,r 1 ( z 1 ) , . . . , ψ k Z , 1 ( z k Z ) , . . . , ψ k Z ,r k Z ( z k Z ) . T ensor pro ducts capture interactions, ϕ j 1 ,...,j q ( z ) = q Y ℓ =1 ψ ℓ,j ℓ ( z ℓ ) , but should b e used sparingly , b ecause the resulting dimension grows quic kly . B.4 RKHS Pro xies, Nystr¨ om Dictionaries, and Random F eatures Let K b e a p ositiv e-definite kernel on Z and H K its RKHS. The ideal smo oth shift class is F ∞ K,B = f ∈ H K : ∥ f ∥ H K ≤ B , f ≥ 0 . Because the main pap er dev elops exact finite-sample theory for finite-dimensional feature maps, the direct implementation of F ∞ K,B is outside the scope of the exact theorem. A finite-dimensional proxy is obtained b y c ho osing landmarks u 1 , . . . , u r and setting ϕ r ( z ) = (1 , K ( z , u 1 ) , . . . , K ( z , u r )) . A normalized Nystr¨ om version replaces the ra w kernel sections b y a stabilized low-rank basis. F or shift-in v arian t k ernels, random features provide another finite-dimensional pro xy , ϕ r ( z ) = (1 , φ 1 ( z ) , . . . , φ r ( z )) , φ j ( z ) = r 2 r cos ω ⊤ j z + b j . Finally , when Z is high-dimensional or structured, one ma y comp ose the k ernel with a represen tation map g learned on the training split and use features of the form K ( g ( z ) , c j ). B.5 Shap e-Restricted Appro ximations Supp ose again that Z is scalar and let u 1 < · · · < u G b e a grid. A piecewise-linear appro ximation writes f ( z ) = G X g =1 θ g h g ( z ) , where h 1 , . . . , h G are hat functions on the grid. Monotone or con vex shift families are obtained b y adding linear inequalit y constraints to the co efficients. F or example, monotonicity ma y b e enco ded by θ 1 ≤ θ 2 ≤ · · · ≤ θ G , and conv exit y by second-difference inequalities. The baseline theorem do es not analyze this constrained-cone version, but it remains a natural structured extension. 35 B.6 The Separate Radius-Class Design Problem for T X When the final interv al is X -indexed, one m ust also c ho ose a class for τ X . The same appro ximation ideas apply on the X side, but they should not b e conflated with the shift class on the Z side. One may use bins in X , additive or tensor-pro duct siev es on X , RKHS mo dels on X , learned represen tations of X , or shap e-restricted classes for τ X . The shift class F determines which probabilit y laws must b e protected. The radius class for τ X determines ho w muc h heterogeneity the prediction in terv al may ha v e across X . C Pro ofs Pr o of of Pr op osition 3.1 . Let U τ = 1 h Y n +1 ∈ b C τ ( X n +1 , Z n +1 ) i − (1 − α ) . If ( 4 ) holds, then E [ U τ | Z n +1 ] = 0 almost surely . Therefore, for every b ounded measurable f , E [ f ( Z n +1 ) U τ ] = E [ f ( Z n +1 ) E [ U τ | Z n +1 ]] = 0 , whic h prov es ( 5 ). Con versely , supp ose ( 5 ) holds for ev ery b ounded measurable f . Then E [ f ( Z n +1 ) E [ U τ | Z n +1 ]] = 0 , ∀ f ∈ M . Cho osing f ( z ) = 1 [ z ∈ A ] for an arbitrary measurable set A ⊆ Z giv es E [ 1 [ Z n +1 ∈ A ] E [ U τ | Z n +1 ]] = 0 , ∀ A. Hence E [ U τ | Z n +1 ] = 0 almost surely . Rewriting this identit y yields P Y n +1 ∈ b C τ ( X n +1 , Z n +1 ) | Z n +1 = 1 − α almost surely , which is exactly ( 4 ). Pr o of of Pr op osition 3.2 . Let Γ( T , F ) denote the feasible set of ( 9 ) and ( 10 ) within the class T . If T 1 ⊂ T 2 , then ev ery feasible radius in Γ( T 1 , F ) also b elongs to Γ( T 2 , F ). Therefore the infim um of E [ τ ( X , Z )] o ver Γ( T 2 , F ) cannot exceed the corresp onding infim um o ver Γ( T 1 , F ). This prov es L ( T 2 , F ) ≤ L ( T 1 , F ) . The display ed consequences follow by taking ( T 1 , T 2 ) = ( T Z , T X Z ) and ( T 1 , T 2 ) = ( T X , T X Z ). Pr o of of The or em 6.1 . Condition on the training split. Then b h is fixed. Relabel the calibra- tion sample as { ( S i , W i ) } m i =1 , where W i = ( X i , Z i ), and define the test score S m +1 = Y n +1 − b h ( X n +1 ) , W m +1 = ( X n +1 , Z n +1 ) . 36 By assumption, conditional on the training split, the augmen ted sample { ( S i , W i ) } m +1 i =1 is exc hangeable. The calibration pro cedure of Section 4 is exactly the finite-dimensional conditional conformal construction of Gibbs et al. ( 2025 ) applied to the conformity score S = Y − b h ( X ) and the co v ariate W = ( X , Z ). Their finite-sample result therefore yields the conditional momen t inequality E h f ( W m +1 ) 1 [ S m +1 ≤ b τ X Z ( W m +1 )] − (1 − α ) | I tr i ≥ 0 (27) for every nonnegativ e f ∈ F W . Fix an y measurable f ∈ F W with f ≥ 0 and E [ f ( W )] > 0. Under the tilted law P W f defined in ( 13 ), conditional on the training split, w e hav e P W f ( S m +1 ≤ b τ X Z ( W m +1 ) | I tr ) = E [ f ( W m +1 ) 1 [ S m +1 ≤ b τ X Z ( W m +1 )] | I tr ] E [ f ( W m +1 ) | I tr ] ≥ (1 − α ) E [ f ( W m +1 ) | I tr ] E [ f ( W m +1 ) | I tr ] = 1 − α, where the inequalit y uses ( 27 ). Finally , b y the definition of the interv al in ( 12 ), S m +1 ≤ b τ X Z ( W m +1 ) ⇐ ⇒ Y n +1 ∈ b C X Z ( X n +1 , Z n +1 ) . Hence P W f Y n +1 ∈ b C X Z ( X n +1 , Z n +1 ) | I tr ≥ 1 − α . T aking exp ectations ov er the training split gives the claimed unconditional statement. Pr o of of The or em 6.2 . W e verify explicitly that the IV-indexed pro cedure is a sp ecialization of the joint construction. In Section 4, set W = Z and choose the feature map on W to b e the IV feature map ϕ . Then G W = w 7→ β ⊤ ϕ ( w ) : β ∈ R d = G , F W = G W ∩ { f ≥ 0 } = F . The augmented quantile-regression problem of Section 4 b ecomes exactly the IV-indexed problem in Section 5, b ecause the cov ariate W i is now simply Z i . Consequen tly the fitted radius from the join t construction satisfies b τ X Z ( W n +1 ) = b τ Z ( Z n +1 ), and the interv al ( 12 ) reduces to the IV-indexed in terv al ( 14 ). Under the same sp ecialization, the tilted la w ( 13 ) b ecomes d P W f ( y , x, z ) d P ( y , x, z ) = f ( z ) E [ f ( Z )] , whic h is exactly the IV-tilt family ( 7 ). Ev ery assumption of Theorem 6.1 is therefore satisfied, and applying that theorem yields P f Y n +1 ∈ b C Z ( X n +1 , Z n +1 ) ≥ 1 − α for every measurable f ∈ F with f ≥ 0 and E [ f ( Z )] > 0. 37 Pr o of of Pr op osition 6.3 . Fix z ∈ Z and let T z ( s 1 , . . . , s m ) denote the IV-indexed radius pro duced by Algorithm 1 at IV z when the calibration scores are s 1 , . . . , s m and the calibration IVs are kept fixed. Set δ = b h − h 0 ∞ . F or eac h calibration observ ation, | S i − S ⋆ i | = Y i − b h ( X i ) − | Y i − h 0 ( X i ) | ≤ b h ( X i ) − h 0 ( X i ) ≤ δ . Therefore, S ⋆ i − δ ≤ S i ≤ S ⋆ i + δ, i = 1 , . . . , m. W e next use tw o prop erties of T z . First, T z is co ordinatewise monotone in the calibration scores. If one increases some calibration score while k eeping everything else fixed, the largest accepted radius cannot decrease. Second, b ecause the constant functions b elong to G , the map is translation-equiv ariant, T z ( s 1 + c, . . . , s m + c ) = T z ( s 1 , . . . , s m ) + c, c ∈ R . Indeed, shifting every calibration score and the candidate test score b y the same constan t lea ves all residuals in the pinball loss unchanged after replacing g b y g + c , and g + c ∈ G b ecause G contains constan ts. Applying the co ordinatewise inequalities to the monotone map T z giv es T z ( S ⋆ 1 − δ, . . . , S ⋆ m − δ ) ≤ T z ( S 1 , . . . , S m ) ≤ T z ( S ⋆ 1 + δ, . . . , S ⋆ m + δ ) . By translation equiv ariance, T z ( S ⋆ 1 − δ, . . . , S ⋆ m − δ ) = T z ( S ⋆ 1 , . . . , S ⋆ m ) − δ = b τ ⋆ Z ( z ) − δ, T z ( S ⋆ 1 + δ, . . . , S ⋆ m + δ ) = T z ( S ⋆ 1 , . . . , S ⋆ m ) + δ = b τ ⋆ Z ( z ) + δ. Hence b τ ⋆ Z ( z ) − δ ≤ b τ Z ( z ) ≤ b τ ⋆ Z ( z ) + δ, whic h is equiv alent to | b τ Z ( z ) − b τ ⋆ Z ( z ) | ≤ δ = b h − h 0 ∞ . The in terv al-length b ound follows immediately b ecause len ( b C Z ( x, z )) = 2 b τ Z ( z ) and len ( b C ⋆ Z ( x, z )) = 2 b τ ⋆ Z ( z ). Pr o of of The or em 6.4 . Under ( 1 ), the oracle scores are S ⋆ i = | Y i − h 0 ( X i ) | = | ε i | , i = 1 , . . . , m. 38 Fix z ∈ Z . F or the augmen ted sample consisting of the m oracle calibration pairs and the additional pair ( z , b τ ⋆ Z ( z )), let b g z ∈ G b e an optimal basic solution of the augmen ted quan tile-regression problem, chosen so that b g z ( z ) = b τ ⋆ Z ( z ). W rite the residuals as r i = S ⋆ i − b g z ( Z i ) , i = 1 , . . . , m, r m +1 = b τ ⋆ Z ( z ) − b g z ( z ) = 0 . Let A z = { i ∈ { 1 , . . . , m + 1 } : r i = 0 } b e the activ e set. Because b g z is c hosen from a d -dimensional linear class and w e take an optimal basic feasible solution of the asso ciated linear program, at most d residual constraints can b e tigh t. Therefore, |A z | ≤ d. The KKT conditions for quan tile regression imply the usual empirical quan tile-balance relation. Since the only ambiguit y in that relation comes from observ ations with zero residual, the deviation of the empirical quantile lev el from its target is b ounded by the fraction of activ e residuals. Therefore, 1 m + 1 m +1 X i =1 1 [ S ⋆ i ≤ b g z ( Z i )] − (1 − α ) ≤ |A z | m + 1 ≤ d m + 1 . (28) Because b g z ( z ) = b τ ⋆ Z ( z ), Assumption 6.2 combines with ( 28 ) to yield | F z ( b τ ⋆ Z ( z )) − (1 − α ) | ≤ d m + 1 + o p (1) . Since F z ( τ 0 ( z )) = 1 − α , we obtain | F z ( b τ ⋆ Z ( z )) − F z ( τ 0 ( z )) | ≤ d m + 1 + o p (1) . Assumption 6.1 then implies | b τ ⋆ Z ( z ) − τ 0 ( z ) | ≤ d ( m + 1) p min + o p (1) . The b ound for b τ Z ( z ) follows by adding and subtracting b τ ⋆ Z ( z ) and using Prop osition 6.3 : | b τ Z ( z ) − τ 0 ( z ) | ≤ | b τ Z ( z ) − b τ ⋆ Z ( z ) | + | b τ ⋆ Z ( z ) − τ 0 ( z ) | ≤ b h − h 0 ∞ + d ( m + 1) p min + o p (1) . Finally , len( b C Z ( x, z )) − len( C 0 ( x, z )) = 2 b τ Z ( z ) − 2 τ 0 ( z ) ≤ 2 | b τ Z ( z ) − τ 0 ( z ) | ≤ 2 b h − h 0 ∞ + 2 d ( m + 1) p min + o p (1) , whic h completes the pro of. 39 Pr o of of Pr op osition 4.1 . Fix any measurable τ X : X → R + . By definition of the conditional densit y ratio, E [ m τ X ( S, X ) r 0 ( S, X | z )] = Z m τ X ( s, x ) p S,X | Z ( s, x | z ) p S,X ( s, x ) p S,X ( s, x ) ds dx = Z m τ X ( s, x ) p S,X | Z ( s, x | z ) ds dx = E [ m τ X ( S, X ) | Z = z ] , whic h is ( 20 ). The equiv alence with ( 17 ) is immediate. Pr o of of Pr op osition 6.5 . If Q ( τ X ) = 0, then the in tegrand z 7→ E [ m τ X ( S, X ) r 0 ( S, X | z )] 2 v anishes for ν -almost every z . By Prop osition 4.1 , this is equiv alent to E [ m τ X ( S, X ) | Z = z ] = 0 for ν -almost every z . The conv erse implication is immediate. If ν has full supp ort and the conditional moment is con tinuous in z , then v anishing ν -almost everywhere implies v anishing ev erywhere on the supp ort of Z , which yields exact conditional cov erage for b C X . Pr o of of The or em 6.6 . Condition on the samples used to construct b h and q . Then b h and q are fixed, so the normalized recalibration scores R i = Y i − b h ( X i ) q ( X i ) , i ∈ I rcal , and the test score R n +1 = Y n +1 − b h ( X n +1 ) q ( X n +1 ) are exchangeable under the training la w. The target probabilit y law is the single shifted law P f 0 induced by the known w eights w f 0 ( Z ) = f 0 ( Z ) / E [ f 0 ( Z )]. F or a test p oint with IV v alue z , w eighted split conformal prediction under co v ariate shift, as dev elop ed b y Tibshirani et al. ( 2019 ), applies directly to the normalized score R with co v ariate Z and target w eigh t w f 0 . Let b t f 0 ( z ) = inf ( t ∈ R + : X i ∈I rcal w f 0 ( Z i ) P j ∈I rcal w f 0 ( Z j ) + w f 0 ( z ) 1 { R i ≤ t } ≥ 1 − α ) , with the con ven tion inf ∅ = ∞ . Their finite-sample theorem yields P f 0 R n +1 ≤ b t f 0 ( Z n +1 ) ≥ 1 − α . 40 Because w f 0 ( Z n +1 ) ≤ B f 0 almost surely and the cutoff is nondecreasing in the test weigh t, w e hav e b t f 0 ( Z n +1 ) ≤ b t f 0 almost surely . Therefore, P f 0 R n +1 ≤ b t f 0 ≥ 1 − α . By definition of the in terv al ( 24 ), R n +1 ≤ b t f 0 ⇐ ⇒ Y n +1 ∈ b C X,f 0 ( X n +1 ) . Hence P f 0 Y n +1 ∈ b C X,f 0 ( X n +1 ) ≥ 1 − α , whic h prov es the claim. 41
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment