Mitigating Covariate Shift in Misspecified Regression with Applications to Reinforcement Learning

Mitigating Co v ariate Shift in Missp eciﬁed Regression with Applications to Reinforcemen t Learning Philip Amortila Univ ersity of Illinois, Urbana-Champaign philipa4@illinois.edu T ongyi Cao Univ ersity of Massach usetts, Amherst tcao@cs.umass.edu Aksha y Krishnamurth y Microsoft Research, NYC akshaykr@microsoft.com Abstract A perv asive phenomenon in machine learning applications is distribution shift , where training and deplo yment conditions for a machine learning model diﬀer. As distribution shift typically results in a degradation in p erformance, muc h attention has b een devoted to algorithmic interv en tions that mitigate these detrimental eﬀects. In this pap er, we study the eﬀect of distribution shift in the presence of mo del missp eciﬁcation, sp eciﬁcally fo cusing on L ∞ -missp eciﬁed regression and adversarial c ovariate shift , where the regression target remains ﬁxed while the cov ariate distribution changes arbitrarily . W e show that empirical risk minimization, or standard least squares regression, can result in undesirable misspe ciﬁc ation ampliﬁc ation where the error due to missp eciﬁcation is ampliﬁed by the density ratio betw een the training and testing distributions. As our main result, we develop a new algorithm—inspired b y robust optimization tec hniques—that av oids this undesirable b eha vior, resulting in no misspeciﬁcation ampliﬁcation while still obtaining optimal statistical rates. As applications, we use this regression pro cedure to obtain new guaran tees in oﬄine and online reinforcement learning with missp eciﬁcation and establish new separations b et w een previously studied structural conditions and notions of cov erage. 1 In tro duction A ma jority of machine learning methods are developed and analyzed under the idealized setting where the training conditions accurately reﬂect those at deploymen t. Y et, almost all practical applications exhibit distribution shift , where these conditions diﬀer signiﬁcan tly . Distribution shift can occur for a plethora of reasons, ranging from quirks in data collection ( Rech t et al. , 2019 ), to temp oral drift ( Gama et al. , 2014 ; Besb es et al. , 2015 ), to users adapting to an ML model ( P erdomo et al. , 2020 ), and it typically results in a degradation in mo del p erformance. Due to the prev alence of this phenomenon and the diversit y of applications where it manifests, there is a v ast and ev er-growing b ody of literature studying algorithmic in terven tions to mitigate distribution shift ( Quinonero-Candela et al. , 2008 ; Sugiy ama and Kaw anab e , 2012 ). Covariate shift is p erhaps the most basic form of distribution shift. Co v ariate shift is p ertinent to sup ervised learning—where the goal is to predict a label Y from co v ariates X —and p osits a change in the distribution o ver co v ariates while keeping the target predictor ﬁxed. This setup, in particular that the target do es not change, is natural in applications including neural algorithmic reasoning ( Anil et al. , 2022 ; Zhang et al. , 2022 ; Liu et al. , 2023 ), reinforcement learning ( Ross et al. , 2011 ; Levine et al. , 2020 ), and computer vision ( Koh et al. , 2021 ; Rec ht et al. , 2019 ; Miller et al. , 2021 ). It is well kno wn that one can adapt guaran tees from statistical learning to the co v ariate shift setting; sp eciﬁcally , for well-speciﬁed regression, a classical density-ratio argumen t sho ws that empirical risk minimization (ERM) is consisten t under suitably well-behav ed cov ariate shifts. One stipulation of this consistency guaran tee is that the mo del/hypothesis class b e wel l-sp e ciﬁe d (also referred Authors listed in alphabetical order. 1 to as r e alizable ). Although statistical learning theory oﬀers a rather complete understanding of missp eciﬁcation in the absence of cov ariate shift (via agnostic learning and excess risk bounds), our understanding of how co v ariate shift can adversely interact with mo del missp eciﬁcation remains fairly immature. This in teraction is the fo cus of the presen t paper. 1.1 Con tributions W e study regression under adversarial c ovariate shift where w e receiv e regression samples from a distribution D train but are ev aluated on an arbitrary distribution D test for which no prior kno wledge is av ailable; w e only assume that the distributions share the same target regression function f ⋆ and that the worst-case density ratio of the cov ariate marginals is b ounded b y C ∞ ∈ [1 , ∞ ) (formally deﬁned in Section 2 ). As inductive bias, w e hav e a function class F of predictors and assume L ∞ -missp e ciﬁc ation : there exists a predictor ¯ f ∈ F that is p oin t wise close to f ⋆ , i.e., ∥ ¯ f − f ⋆ ∥ ∞ ≤ ε ∞ . This notion is natural for the co v ariate shift setting b ecause it ensures that ¯ f has lo w and comparable prediction error on b oth D train and any D test . In this setup w e obtain the following results: 1. W e sho w that standard empirical risk minimization ( ERM ) is not robust to cov ariate shift in the presence of missp eciﬁcation. Precisely , even in the limit of inﬁnite data, ERM ov er F can incur squared prediction error under D test scaling as Ω( C ∞ ε 2 ∞ ) . Meanwhile the error of the L ∞ -missp eciﬁed predictor ¯ f is at most ε 2 ∞ . W e call this phenomenon—where the missp eciﬁcation error is scaled by the density ratio co eﬃcien t (despite there being a predictor av oiding this scaling)— missp e ciﬁc ation ampliﬁc ation . 2. As our main result, we give a new algorithm, called disagreemen t-based regression ( DBR ), that av oids missp e ciﬁc ation ampliﬁc ation and is therefore robust to adversarial co v ariate shift under missp eciﬁca- tion. DBR has asymptotic prediction error under D test scaling as O ( ε 2 ∞ ) , with no dep endence on the densit y ratio co eﬃcien t C ∞ . At the same time, it has order-optimal ﬁnite sample b eha vior recov ering standard “fast rate” guarantees for the w ell-sp eciﬁed setting, and can b e extended to adapt to unknown missp eciﬁcation level (as shown in App endix A.4 ). T o our knowledge, this is the ﬁrst result a voiding mis- sp eciﬁcation ampliﬁcation in the adv ersarial cov ariate shift se tting. Our assumptions—particularly that no information ab out D test is av ailable and that F is unstructured—rule out prior approaches based on densit y ratios ( Shimo daira , 2000 ; Duchi and Namkoong , 2021 ) or sup-norm conv ergence ( Schmidt-Hieber and Zamolo dtc hiko v , 2022 ); see Section 5 for further discussion. T o demonstrate the utility of disagreement-based regression, we deplo y the pro cedure in v alue function appro ximation settings in reinforcement learning (RL), where regression is a standard primitive and mitigating the adverse eﬀects of distribution shift is a cen tral challenge. Here, using DBR as a drop-in replacement for ERM when ﬁtting Bellman backups, we obtain the following results: 1. In the oﬄine RL setting, we instantiate the minimax algorithm of Chen and Jiang ( 2019 ) with DBR and sho w that, under L ∞ -missp eciﬁcation and with co verage measured via the concentrabilit y co eﬃcien t, missp eciﬁcation ampliﬁcation can b e av oided when learning a near optimal p olicy . In con trast, prior lo wer bounds imply that missp eciﬁcation ampliﬁcation is unav oidable when cov erage is measured via Bellman transfer co eﬃcien ts ( Du et al. , 2020 ; V an Roy and Dong , 2019 ; Lattimore et al. , 2020 ). Our result therefore establishes a new separation b et ween concentrabilit y and Bellman transfer co eﬃcien ts. 2. In the online RL setting, we instan tiate the GOLF algorithm of Jin et al. ( 2021 ) with DBR and obtain analogous results under the structural condition of c over ability (building on the analysis of Xie et al. ( 2023 )). T aken with the ab o ve lo wer b ounds ( Du et al. , 2020 ; V an Roy and Dong , 2019 ; Lattimore et al. , 2020 ), this separates structural conditions inv olving Bellman errors (e.g., Bellman rank ( Jiang et al. , 2017 ), Bellman-eluder dimension ( Jin et al. , 2021 ), or sequential extrap olation co eﬃcien t ( Xie et al. , 2023 )) from cov erabilit y , whic h do es not. T o keep the presentation concise and focused on the in teraction b etw een cov ariate shift and missp eciﬁcation, w e fo cus on the simplest settings that manifest missp eciﬁcation ampliﬁcation. In Section 6 , we discuss a n umber of directions for future work, which include extensions to the core technical and algorithmic results. 2 2 Missp eciﬁed regression under distribution shift W e b egin by introducing the formal problem setting and our assumptions. Most pro ofs for results in this section are deferred to Ap pendix A . There are t wo join t distributions, called D train and D test , ov er X × R where X is a cov ariate space. W e use P train , P test and E train , E test to denote the probability law and expectation under these distributions. W e hypothesize that D train and D test share the same Bayes r e gr ession function , an assumption referred to as cov ariate shift in the literature ( Shimodaira , 2000 ). Assumption 2.1 (Cov ariate shift) . F or al l x ∈ X we have E train [ y | x ] = E test [ y | x ] . Let f ⋆ : x 7→ E train [ y | x ] denote the shared Bay es regression function. W e p osit that the marginal distributions o ver X are absolutely contin uous with respect to a reference measure and use d train and d test to denote the corresp onding marginal densities. W e assume these are related via the following density ratio assumption. Assumption 2.2 (Bounded densit y ratios) . The density r atio C ∞ := sup x ∈X     d test ( x ) d train ( x )     is b ounde d, i.e., C ∞ < ∞ . Note that C ∞ ≥ 1 alwa ys. Boundedness of density ratios is standard in the cov ariate shift literature; indeed the co eﬃcien t C ∞ app ears in the classical cov ariate shift analyses as w ell as in man y algorithmic in terven tions ( Shimo daira , 2000 ; Sugiyama et al. , 2007 ). Beyond satisfying these assumption, D test can b e adaptiv ely and adversarially chosen. In particular, no information ab out D test , such as lab eled/unlabeled samples or other inductiv e bias, is av ailable. W e hav e a dataset { ( x i , y i ) } n i =1 of n i.i.d. lab eled examples sampled from D train and a function class F ⊂ ( X → R ) of predictors. W e deﬁne the (squared) pr e diction err ors R train ( f ) := E train  ( f ( x ) − f ⋆ ( x )) 2  , and R test ( f ) := E test  ( f ( x ) − f ⋆ ( x )) 2  . (1) W e seek to use the dataset to ﬁnd a predictor ˆ f for whic h R test ( ˆ f ) is small. Regarding F , we make tw o assumptions: w e assume that |F | < ∞ and that F is L ∞ -missp eciﬁed. Assumption 2.3 ( L ∞ -missp eciﬁcation) . F or some ε ∞ ≥ 0 , ther e exists ¯ f ∈ F with   ¯ f − f ⋆   ∞ ≤ ε ∞ , where ∥ f ∥ ∞ := sup x ∈X | f ( x ) | . Most prior analyses for regression under cov ariate shift assume that the mo del class F is well-speciﬁed, i.e., that ε ∞ = 0 so that f ⋆ ∈ F . L ∞ -missp eciﬁcation provides a relaxation that is natural for at least t wo reasons. First, it enables end-to-end learning guarantees via comp osition with approximation-theoretic results for sp eciﬁc function classes (e.g., neural net works), where it is standard to measure approximation via the L ∞ norm ( T elgarsky , 2021 ). More imp ortan tly , L ∞ -missp eciﬁcation is particularly apt in the co v ariate shift setting b ecause it ensures that ¯ f has low prediction error on b oth D test and D train . Thus, there is at least one high-quality predictor whose p erformance is stable across distributions. In contrast, we hav e no suc h guaran tee if w e, for example, measure missp eciﬁcation with respect to other norms (which dep end on the distribution) or consider the agnostic setting (with no quantiﬁed missp eciﬁcation assumption). Indeed, we will see b elow that misspeciﬁcation ampliﬁcation is unav oidable in suc h cases. W e also mak e the follo wing technical assumption. Assumption 2.4 (Boundedness) . sup f ∈F ∥ f ∥ ∞ ≤ 1 and | y | ≤ 1 almost sur ely under D train and D test . W e imp ose Assumption 2.4 and that |F | < ∞ solely to highligh t the nov el algorithmic and technical asp ects; w e expect that relaxing these assumptions is p ossible. 3 2.1 Missp eciﬁcation ampliﬁcation for empirical risk minimization When there is no prior knowledge ab out or data from D test , p erhaps the most natural algorithm for optimizing R test ( · ) is empirical risk minimization ( ERM ) on the data from the training distribution: ˆ f ( n ) ERM := arg min f ∈F 1 n n X i =1 ( f ( x i ) − y i ) 2 . A standard uniform con vergence argumen t yields the classical cov ariate shift guaran tee for ERM : Prop osition 2.1 ( ERM upp er b ound) . F or any δ ∈ (0 , 1) with pr ob ability at le ast 1 − δ , ERM satisﬁes R test ( ˆ f ( n ) ERM ) ≤ O  C ∞ ε 2 ∞ + C ∞ log( |F | /δ ) n  . The second term which scales as 1 /n —the statistical term—is optimal in the generality of our setup ( Ma et al. , 2023 ; Ge et al. , 2023 ), the in terpretation b eing that the eﬀective sample size is reduced by a factor of C ∞ due to the mismatch b et ween D train and D test . The ﬁrst term—the missp eciﬁcation term—represents the asymptotic 1 test error of ERM and demonstrates a phenomenon that w e call missp e ciﬁc ation ampliﬁc ation , whereb y the error due to missp eciﬁcation is ampliﬁed by the densit y ratio co eﬃcient. This phenomenon is sim ultaneously more concerning and less intuitiv e than the degradation of the statistical term, b ecause it describ es an error which do es not decay with larger sample sizes and b ecause ¯ f ∈ F has R test ( ¯ f ) = ε 2 ∞ . Since F con tains a predictor that do es not incur missp eciﬁcation ampliﬁcation, one might hop e that missp eciﬁcation ampliﬁcation can b e av oided. Our ﬁrst main result is that misspeciﬁcation ampliﬁcation c annot b e av oided by ERM in the w orst case. The result is prov ed in the asymptotic regime, where ERM is equiv alen t to the L 2 ( D train ) -pro jection of f ⋆ on to the function class F , deﬁned as ˆ f ( ∞ ) ERM ∈ arg min f ∈F ∥ f − f ⋆ ∥ 2 L 2 ( D train ) , with ∥ g ∥ 2 L 2 ( D train ) := E train  g ( x ) 2  . The next prop osition shows that ˆ f ( ∞ ) ERM can incur missp eciﬁcation ampliﬁcation. Prop osition 2.2 ( ERM lo wer b ound) . F or al l ε ∞ ∈ (0 , 1) and C ∞ ∈ [1 , ∞ ) such that p C ∞ · ε ∞ ≤ 1 / 2 , and for al l ζ > 0 suﬃciently smal l, ther e exist distributions D train , D test and a function class F with |F | = 2 satisfying Assumption 2.1 - Assumption 2.4 (with p ar ameters ε ∞ , C ∞ ) such that R test ( ˆ f ( ∞ ) ERM ) = C ∞ ε 2 ∞ − ζ . X f ⋆ ¯ f ε ∞ f bad D train D test Figure 1: The construction used to pro ve Proposition 2.2 . f bad and ¯ f ha ve equal risk under D train but f bad concen trates errors onto D test . Com bined with the optimality of the statistical term ( Ma et al. , 2023 ; Ge et al. , 2023 ), this establishes that Proposition 2.1 characterizes the b eha vior of ERM under L ∞ -missp eciﬁcation and cov ariate shift. The construction is based on the following insigh t, visualized in Figure 1 . The fact that ¯ f is L ∞ -close to f ⋆ guaran tees that its prediction errors are “spread out” across the domain X . Since ¯ f ∈ F , we know that ˆ f ( ∞ ) ERM m ust satisfy ∥ ˆ f ( ∞ ) ERM − f ⋆ ∥ 2 L 2 ( D train ) ≤ ε 2 ∞ . Unfortunately , this prop ert y do es not guarantee that the errors of ˆ f ( ∞ ) ERM are “spread out” in a similar manner to ¯ f ’s. Indeed, we construct a predictor f bad that concentrates its errors on a region of X that is ampliﬁed by D test and makes up for this b y having zero error elsewhere. By setting the parameters carefully , w e can ensure that this bad predictor is c hosen by ERM . W e note that essen tially the same construction sho ws that, under the w eak er notion of L 2 ( D train ) -missp eciﬁcation, ampliﬁcation is una voidable for any prop er learner (which outputs a function in F ). Indeed, in Figure 1 , the function class { f bad } is L 2 ( D train ) -missp eciﬁed but f bad has muc h higher error on D test . 1 W e consider the asymptotic regime where n → ∞ with all other quantities, lik e log |F | and ε ∞ , ﬁxed. 4 Other existing algorithms. Prop osition 2.2 only p ertains to ERM , and th us, one migh t ask whether other algorithms can a void missp eciﬁcation ampliﬁcation. Before turning to our p ositive results in the next section, we brieﬂy note that other standard algorithms (that do not require kno wledge of D test ) either incur missp eciﬁcation ampliﬁcation to some degree, or ha ve some other failure mo de. This pertains to the star algorithm ( Audib ert , 2007 ; Liang et al. , 2015 ), other aggregation schemes (c.f., Lecué and Rigollet , 2014 ), and L ∞ -regression ( Knight , 2017 ; Yi and Neyko v , 2024 ), as w e discuss in App endix A.2 . Sev eral methods for mitigating cov ariate shift can a void missp eciﬁcation ampliﬁcation, but either require kno wledge of D test or structural assumptions on F ; see Section 5 . 2.2 Main result: Disagreemen t-based regression In this section, w e provide a new algorithm that a voids missp eciﬁcation ampliﬁcation while requiring no kno wledge of D test and recov ering optimal statistical rates. T o develop some in tuition, observe that in the construction in Figure 1 , the only w ay for the bad predictor ( f bad , in red) to b e chosen by ERM and hav e large errors on D test is for it to hav e muc h low er error than ¯ f on the rest of the domain. Indeed, if we could ﬁlter out the points where f bad ’s error is less than ¯ f ’s, then f bad cannot ov ercome the large errors on D test . Stated another wa y , w e can av oid missp eciﬁcation ampliﬁcation in this example if w e restrict the regression problem to the region where | f bad ( x ) − f ⋆ ( x ) | ≥ | ¯ f ( x ) − f ⋆ ( x ) | . Generalizing this insight to a larger function class suggests that, when considering a candidate f ∈ F , we should only measure the square loss for f on the region where | f ( x ) − f ⋆ ( x ) | ≥ | ¯ f ( x ) − f ⋆ ( x ) | . Unfortunately , this region dep ends on f ⋆ and ¯ f , b oth of whic h are unknown. Nevertheless, our approach is based on this in tuition, and we av oid the dep endence on these unknown functions with tw o algorithmic ideas. T o eliminate the dependence on f ⋆ , we use the fact that | ¯ f ( x ) − f ⋆ ( x ) | ≤ ε ∞ and approximate the abov e region with I f := { x : | f ( x ) − ¯ f ( x ) | ≥ cε ∞ } . Indeed for c ≥ 2 , { x : | f ( x ) − ¯ f ( x ) | ≥ cε ∞ } ⊆ { x : | f ( x ) − f ⋆ ( x ) | ≥ | ¯ f ( x ) − f ⋆ ( x ) |} . On the other hand, w e know that | f ( x ) − f ⋆ ( x ) | ≤ ( c + 1) ε ∞ in the complemen tary region, I C f . This is, up to the constant factor, the b est p oin twise guaran tee w e can attain, making it safe to ignore the complementary region. This resolv es the ﬁrst issue of dependence on f ⋆ . T o address the dep endence on ¯ f , w e use that ¯ f ∈ F and formulate a robust optimization ob jectiv e that implicitly considers all possible pairwise “disagreement regions.” F ormally , the algorithm is: W τ f ,g ( x ) := 1 {| f ( x ) − g ( x ) | ≥ τ } , ˆ f ( n ) DBR ← arg min f ∈F max g ∈F 1 n n X i =1 W τ f ,g ( x )  ( f ( x ) − y ) 2 − ( g ( x ) − y ) 2  . (2) W e call this algorithm disagr e ement-b ase d r e gr ession ( DBR ) and keep the dep endence on τ implicit in the notation for the solution ˆ f ( n ) DBR . 2 There are essentially three key ingredien ts. First, we introduce the “ﬁlter” W τ f ,g to restrict the regression problem to the set of p oin ts where the predictions of f and g diﬀer considerably , whic h w e call the disagr e ement r e gion . This formalizes the intuition that we should only measure the square loss for f on p oin ts where | f ( x ) − ¯ f ( x ) | ≥ cε ∞ . Second is the robust optimization approach, where for each f ∈ F , we consider all possible c hoices g ∈ F for ﬁltering, whic h allows us to tak e g to b e L ∞ -close to f ⋆ in the analysis. Finally , we measure the square loss r e gr et in the disagreemen t region, b y subtracting oﬀ the square loss of the comparator function g . Similar to Agarwal and Zhang ( 2022 ), this accounts for the fact that each g ∈ F yields a diﬀeren t regression problem, with p oten tially diﬀeren t Bay es error rates. 3 As our main theorem, we show that disagreement-based regression enjo ys the following guarantee. 2 The name stems from the literature on disagreemen t-based activ e learning ( Hanneke , 2014 ), where a similar “range” computation has app eared ( Krishnamurth y et al. , 2019 ; F oster et al. , 2018 , 2021 ). How ever our usage is conceptually unrelated: we use disagreemen t for robustness to cov ariate shift, while, in activ e learning, disagreement is used to reduce sample complexit y . 3 More directly , the probability mass of ﬁltered points P train [ W τ f ,g ( x )] could v ary considerably for diﬀerent f , g ∈ F . 5 Theorem 2.1 (Main result for DBR ) . Fix δ ∈ (0 , 1) . L et F b e a function class with |F | < ∞ satisfying As- sumption 2.3 and Assumption 2.4 . Then with pr ob ability at le ast 1 − δ , ˆ f ( n ) DBR with τ ≥ 3 ε ∞ satisﬁes E train h 1 n | ˆ f ( n ) DBR ( x ) − f ⋆ ( x ) | ≥ τ + ε ∞ o · n ( ˆ f ( n ) DBR ( x ) − f ⋆ ( x )) 2 − ε 2 ∞ oi ≤ 160 log(2 |F | /δ ) 3 n , (3) which dir e ctly implies P train h    ˆ f ( n ) DBR ( x ) − f ⋆ ( x )    ≥ τ + ε ∞ i ≤ 160 log(2 |F | /δ ) 3 n ( τ 2 + 2 τ ε ∞ ) . (4) Before turning to a discussion of Theorem 2.1 we state tw o immediate corollaries. The ﬁrst addresses the adv ersarial co v ariate shift setting, bounding the risk of ˆ f ( n ) DBR under D test . Corollary 2.1 (Cov ariate shift for DBR ) . Fix δ ∈ (0 , 1) . Under Assumption 2.1 – Assumption 2.4 , with pr ob ability at le ast 1 − δ , ˆ f ( n ) DBR with τ = 3 ε ∞ satisﬁes R test ( ˆ f ( n ) DBR ) ≤ 17 ε 2 ∞ + O  C ∞ log( |F | /δ ) n  . (5) The next result shows that ˆ f ( n ) DBR reco vers the optimal guaran tee in the well-speciﬁed case, i.e., when ε ∞ = 0 . Corollary 2.2 (W ell-sp eciﬁed cas e) . Fix δ ∈ (0 , 1) . Under Assumption 2.1 – Assumption 2.4 (with ε ∞ = 0 ), with pr ob ability at le ast 1 − δ , ˆ f ( n ) DBR with τ ≤ O  p log( |F | /δ ) /n  satisﬁes R train ( ˆ f ( n ) DBR ) ≤ O  log( |F | /δ ) n  and R test ( ˆ f ( n ) DBR ) ≤ O  C ∞ log( |F | /δ ) n  . (6) W e now turn to some remarks regarding Theorem 2.1 and the corollaries. DBR a v oids missp eciﬁcation ampliﬁcation Comparing Corollary 2.1 in the n → ∞ limit with Prop osi- tion 2.2 highlights the main qualitative diﬀerence b et ween DBR and ERM . DBR attains O ( ε 2 ∞ ) asymptotic test error while the test error for ERM is low er b ounded by Ω( C ∞ ε 2 ∞ ) . In other w ords, DBR av oids missp eciﬁcation ampliﬁcation while ERM does not. At the same time, the statistical term is identical (up to constan ts) to that of ERM , enabling us to recov er the optimal rate in the well-speciﬁed case. Quan tile guaran tee T aking τ = O ( ε ∞ ) in Eq. (3) , we hav e that P train [ | ˆ f ( n ) DBR ( x ) − f ⋆ ( x ) | ≥ cε ∞ ] ≲ 1 nε 2 ∞ , whic h con trols the large quantiles of the prediction error. This is reminiscent of what can b e achiev ed by applying Marko v’s inequality to the guarantee for ERM in the well-speciﬁed case. In contrast, ERM only ensures that R train ( ˆ f ( n ) ERM ) = Ω( ε 2 ∞ ) under missp eciﬁcation, whic h do es not imply an y meaningful quantile guaran tee. One interpretation of our results is that, although suc h quantile guarantees are not p ossible for ERM under missp eciﬁcation, there is no information-theoretic obstruction. W e also note that these quantile guaran tees are rather diﬀerent from sup-norm conv ergence; see Section 5 for further discussion. Computational eﬃciency DBR , as described in Eq. (2) , do es not app ear to b e computationally tractable, primarily due to the non-smo othness and non-conv exity introduced by the ﬁlter W f ,g . A natural direction for future work is to understand the computational c hallenges inv olved in a voiding missp eciﬁcation ampliﬁcation. 2.2.1 Extensions Before closing this section, we mention t wo extensions that we defer to Appendix A.4 . • Appr oximation factor. The approximation factor of 17 in Corollary 2.1 can b e improv ed to 10 (cf. Prop osition A.1 ); how ever our approac h for doing so degrades the conv ergence rate of the statistical term. W e do not know the optimal approximation factor for this setting or whether there is an inherent trade-oﬀ b et ween the statistical term and the approximation/misspeciﬁcation term. 6 • A dapting to unknown missp e ciﬁc ation. Theorem 2.1 requires setting τ ≥ 3 ε ∞ whic h can alwa ys b e ac hieved b y setting τ suﬃcien tly large. How ever, setting τ = O ( ε ∞ ) yields the best guarantee, and so, w e w ould lik e to choose τ in a data-dep endent fashion to adapt to the missp eciﬁcation level. Prop osition A.2 sho ws that this can b e done while recov ering essentially the same guaran tee as in Theorem 2.1 . 3 Pro of of Theorem 2.1 This section con tains the pro of of Theorem 2.1 —which w e emphasize only requires elemen tary arguments— and is not essen tial for understanding the main results of the pap er. A reader interested in applications of Theorem 2.1 to reinforcement learning can pro ceed to Section 4 . The pro of of Theorem 2.1 is organized into three steps, each of which is fairly simple. It is helpful to deﬁne empirical and p opulation versions of the pairwise ob jective used b y DBR : (Empirical) : b L ( f ; g ) := 1 n n X i =1 W τ f ,g ( x i )  ( f ( x i ) − y i ) 2 − ( g ( x i ) − y i ) 2  , (P opulation) : L ( f ; g ) := E train  W τ f ,g ( x )  ( f ( x ) − y ) 2 − ( g ( x ) − y ) 2  . First, we establish a certain non-negativit y prop erty of the population ob jective, which is the main structural result. The second step is a uniform conv ergence argument to show that b L ( · ; · ) , which app ears in the algorithm, concentrates to the population coun terpart L ( · ; · ) . Finally , w e study the minimizer ˆ f ( n ) DBR and an L ∞ -appro ximation ¯ f and relate their ob jective v alues to establish the theorem. Details and pro ofs for the corollaries are deferred to App endix A.3 . Step 1: Non-negativity . The key lemma for the analysis is the follo wing structural prop ert y . Lemma 3.1 (Non-negativity) . With τ ≥ 2 ε ∞ and for any ¯ f ∈ F such that ∥ ¯ f − f ⋆ ∥ ∞ ≤ ε ∞ , we have L ( f ; ¯ f ) ≥ ( τ 2 − 2 τ ε ∞ ) Pr[ W τ f , ¯ f ( x )] ≥ 0 . The pro of requires only algebraic manipulations and actually rev eals a stronger property: with τ ≥ 2 ε ∞ , the random v ariable W τ f , ¯ f ( x )  ( f ( x ) − f ⋆ ( x )) 2 − ( ¯ f ( x ) − f ⋆ ( x )) 2  is non-negative almost surely . By the symmetry L ( f ; g ) = −L ( g ; f ) , the lemma also shows that any L ∞ -missp eciﬁed ¯ f has non-p ositiv e population ob jective. Step 2: Uniform conv ergence. Next w e establish the following concen tration guaran tee. Lemma 3.2 (Concentration) . Fix δ ∈ (0 , 1) and τ ≥ 3 ε ∞ and deﬁne ε stat := 80 log( |F | /δ ) 3 n . Under Assumption 2.3 , for any ¯ f ∈ F such that ∥ ¯ f − f ⋆ ∥ ∞ ≤ ε ∞ , with pr ob ability at le ast 1 − δ we have ∀ f ∈ F : L ( f ; ¯ f ) ≤ 2 b L ( f ; ¯ f ) + ε stat , and e quivalently, b L ( ¯ f ; f ) ≤ 1 2  L ( ¯ f ; f ) + ε stat  . The pro of is based on Bernstein’s inequalit y and imp ortantly exploits a “self-b ounding” property of b L ( f ; g ) —in particular that V ar[ b L ( f ; ¯ f )] ≤ ( 12 / n ) L ( f ; ¯ f ) —analogously to the analysis for ERM in the well-speciﬁed case. Step 3: Analysis of ˆ f ( n ) DBR . Let ¯ f ∈ F b e any function that is L ∞ -close to f ⋆ and condition on the high probabilit y ev ent in Lemma 3.2 holding with the choice ¯ f . The DBR minimizer satisﬁes L ( ˆ f ( n ) DBR ; ¯ f ) (i) ≤ 2 b L ( ˆ f ( n ) DBR ; ¯ f ) + ε stat (ii) ≤ 2 max g ∈F b L ( ˆ f ( n ) DBR ; g ) + ε stat (iii) ≤ 2 max g ∈F b L ( ¯ f ; g ) + ε stat (iv) ≤ max g ∈F L ( ¯ f ; g ) + 2 ε stat (v) ≤ 2 ε stat . 7 Here inequalities (i) and (iv) are applications of Lemma 3.2 , (ii) and (iii) follo w from the deﬁnition of ˆ f ( n ) DBR since ¯ f ∈ F , and (v) is an application of Lemma 3.1 along with the symmetry L ( f ; g ) = −L ( g ; f ) . Eq. (3) no w follo ws from the fact that W τ f , ¯ f ( x ) ≥ 1 {| f ( x ) − f ⋆ ( x ) | ≥ τ + ε ∞ } . Eq. (4) follows since under the even t | f ( x ) − f ⋆ ( x ) | ≥ τ + ε ∞ w e can low er b ound ( f ( x ) − f ⋆ ( x )) 2 − ( ¯ f ( x ) − f ⋆ ( x )) 2 ≥ ( τ + ε ∞ ) 2 − ε 2 ∞ . 4 Applications to online and oﬄine reinforcemen t learning In this section, we deplo y disagreement-based regression to obtain new results in oﬄine and online RL with function approximation. Algorithmically , this is achiev ed by using DBR as a drop-in replacemen t for square loss regression in existing algorithms. W e illustrate this b y examining and impro ving the Bellman residual minimization (a.k.a. minimax) algorithm for oﬄine RL ( Antos et al. , 2008 ; Chen and Jiang , 2019 ) ( Section 4.1 ) and the GOLF algorithm ( Jin et al. , 2021 ) for online RL ( Section 4.2 ). The analyses also require minimal mo diﬁcations to those of Xie and Jiang ( 2021 ) and Xie et al. ( 2023 ), resp ectiv ely . T o emphasize the ease with whic h DBR can b e applied, w e adopt the formulations and muc h of the notation from these works. All pro ofs for results in this section are deferred to App endix B . 4.1 Oﬄine reinforcemen t learning Setup and notation. W e consider a discounted Marko v decision pro cess (MDP) M = ( P , R, d 0 , γ ) ov er states S and actions A , where P : S × A → ∆( S ) is the transition op erator, R : S × A → [0 , 1] is the reward function, d 0 ∈ ∆( S ) is the initial state distribution, and γ ∈ [0 , 1) is the discoun t factor. A policy π : S → ∆( A ) induces a tra jectory s 0 , a 0 , r 0 , s 1 , a 1 , r 1 , . . . where s 0 ∼ d 0 , and for each h ∈ N , a h ∼ π ( s h ) , r h = R ( s h , a h ) , and s h +1 ∼ P ( s h , a h ) . W e use P π [ · ] and E π [ · ] to denote probabilit y and exp ectation under this process. Let d π h ∈ ∆( S × A ) denote the occupancy measure of π at time-step h , deﬁned as d π h ( s, a ) := P π [ s h = s, a h = a ] and let d π := (1 − γ ) P ∞ h =0 γ h d π h . The v alue of π is denoted J ( π ) := E π  P ∞ h =0 γ h r h  . Eac h p olicy π has v alue functions V π : s 7→ E π  P ∞ h =0 γ h r h | s 0 = s  and Q π : ( s, a ) 7→ E π  P ∞ h =0 γ h r h | s 0 = s, a 0 = a  , and it is kno wn that there exists a p olicy π ⋆ that maximizes V π ( s ) simultaneously for all s ∈ S . This p olicy also optimizes J ( · ) and hence is called the optimal p olicy . It is also known that the v alue function Q ⋆ := Q π ⋆ induces the optimal p olicy via π ⋆ : s 7→ arg max a Q ⋆ ( s, a ) and additionally satisﬁes Bel lman ’s optimality e quation : Q ⋆ ( s, a ) := [ T Q ⋆ ] ( s, a ) where T is the Bellman op erator, deﬁned via T f : ( s, a ) 7→ E [ r 0 + γ max a ′ f ( s 1 , a ′ ) | s 0 = s, a 0 = a ] . In the oﬄine v alue function appro ximation setting, we are given a dataset of n tuples D n : = { ( s i , a i , r i , s ′ i ) } n i =1 generated i.i.d. from the following pro cess: ( s i , a i ) ∼ µ where µ ∈ ∆( S , A ) is the data c ol le ction distribution , r i = R ( s i , a i ) , and s ′ i ∼ P ( s i , a i ) . W e are also given a function class F ⊂ ( S × A → R ) , where eac h f ∈ F induces the p olicy π f : s 7→ arg max a f ( s, a ) . Giv en dataset D n and function class F , we seek a p olicy ˆ π that has small sub optimality gap: J ( π ⋆ ) − J ( ˆ π ) . W e imp ose the follo wing assumptions on the function class and on the data collection distribution: • L ∞ -missp eciﬁed realizability/completeness : There exists ¯ f ∈ F suc h that ∥ ¯ f − T ¯ f ∥ ∞ ≤ ε ∞ . A dditionally , for any f ∈ F there exists g ∈ F such that ∥ g − T f ∥ ∞ ≤ ε ∞ . • Concen trability : There exists a constan t C conc ∈ [1 , ∞ ) such that max π ∈ Π    d π µ    ∞ ≤ C conc . Here Π := { π f : f ∈ F } is the p olicy class induced b y F . There is a large b ody of recen t work studying v arious function approximation and cov erage assumptions in oﬄine RL (c.f., Xie and Jiang , 2021 ). Arguably the most standard are concentrabilit y , as we use, and exact realizabilit y/completeness, whic h is stronger than our version with misspeciﬁcation. Regarding the function appro ximation assumption, it is not hard to sho w that missp eciﬁcation ampliﬁcation—which in this setting is deﬁned by the sub optimalit y J ( π ⋆ ) − J ( ˆ π ) scaling as Ω( ε ∞ √ C conc ) —is necessary under weak er notions, suc h as L 2 ( µ ) -missp eciﬁcation. Regarding co verage, as w e will discuss below, the strength of the cov erage assumption determines whether misspeciﬁcation ampliﬁcation can b e av oided or not. 8 Algorithm and guaran tee. The algorithm we study is a minor mo diﬁcation to the minimax algorithm ( An- tos et al. , 2008 ; Chen and Jiang , 2019 ). F or eac h function ˜ f ∈ F and each tuple ( s i , a i , r i , s ′ i ) we can form a regression sample ( s i , a i , y ˜ f ,i := r i + γ max a ′ ˜ f ( s ′ i , a ′ )) and deﬁne the predictor ˆ f via the ob jective: ˆ f := arg min f ∈F max g ∈F 1 n n X i =1 W τ f ,g ( s i , a i )  ( f ( s i , a i ) − y f ,i ) 2 − ( g ( s i , a i ) − y f ,i ) 2  . (7) Here W τ f ,g ( · ) is the ﬁlter in Eq. (2) with x = ( s, a ) . Given ˆ f , we output ˆ π := π ˆ f . Note that the only diﬀerence b et w een this algorithm and the original minimax algorithm is the use of the ﬁlter W τ f ,g ( · ) which is essen tial for obtaining the follo wing guarantee. Theorem 4.1 ( DBR for oﬄine RL) . Fix δ ∈ (0 , 1) , assume that F is L ∞ -missp e ciﬁe d and µ satisﬁes c onc entr ability (as deﬁne d ab ove). Consider the algorithm deﬁne d in Eq. (7) with τ = 3 ε ∞ . Then, with pr ob ability at le ast 1 − δ we have J ( π ⋆ ) − J ( ˆ π ) ≤ O ε ∞ 1 − γ + 1 1 − γ r C conc log( |F | /δ ) n ! . The theorem is b est understo o d via comparison to the guarantee for the standard minimax algorithm, e.g., Theorem 5 of Xie and Jiang ( 2020 ). Under our assumptions ( L ∞ -missp eciﬁcation and concentrabilit y), these t wo b ounds diﬀer only in the missp eciﬁcation term: our theorem scales as ε ∞ / (1 − γ ) while the guarantee for the minimax algorithm scales as ε ∞ √ C conc / (1 − γ ) . 4 Th us, our algorithm inherits the fa vorable prop erties of DBR to av oid missp eciﬁcation ampliﬁcation in oﬄine RL. This feature is notable in ligh t of existing lo wer b ounds for missp eciﬁed RL ( Du et al. , 2020 ; V an Roy and Dong , 2019 ; Lattimore et al. , 2020 ). F ormally , these results consider linear function appro ximation in v arious online RL mo dels, but the constructions can b e extended to oﬄine RL with general function appro ximation where cov erage is measured via the Bel lman tr ansfer c o eﬃcient . This co eﬃcien t is the smallest C transfer suc h that max π ,f ∈F ∥ f − ap x [ f ] ∥ 2 L 2 ( d π ) ∥ f − ap x [ f ] ∥ 2 L 2 ( µ ) ≤ C transfer where ap x [ f ] ∈ F is the L ∞ -appro ximation of T f . 5 The lo wer b ound states that an asymptotic error of Ω( ε ∞ √ C transfer ) is unav oidable. T o contextualize our result with this lo wer b ound, we identify tw o regimes: the “Bellman transfer regime” where C transfer < ∞ and the “concentrabilit y regime ” where C conc < ∞ , and note that, since C transfer ≤ C conc , the former is more general. In the Bellman transfer regime, missp eciﬁcation ampliﬁcation is una voidable. In the concentrabilit y regime, Theorem 4.1 av oids missp eciﬁcation ampliﬁcation and is sample eﬃcient (i.e., has statistical term scaling as p oly ( C conc , log ( |F | /δ ) , 1 n , 1 1 − γ ) ). This is the ﬁrst result showing that both of these prop erties are simultaneously achiev able: prior results achiev e sample eﬃciency with missp eciﬁcation ampliﬁcation (e.g., Xie and Jiang , 2020 ), or a void missp eciﬁcation ampliﬁcation with undesirable sample complexit y scaling as p oly ( |S | ) (the latter is easily ac hieved under concentrabilit y via a tabular model-based approac h). Thus, the regime determines whether missp eciﬁcation ampliﬁcation is av oidable or not, and, in the regime where it is av oidable, our algorithm do es so in a sample-eﬃcien t manner. 4.2 Online reinforcemen t learning Setup and notation. W e consider a ﬁnite horizon episo dic MDP ( P , R, H , s 1 ) ov er state space S and action space A , where H ∈ N is horizon, P := { P h } H h =1 with P h : S × A → ∆( S ) is the non-stationary 4 Xie and Jiang ( 2020 ) consider slightly weak er assumptions: they measure b oth missp eciﬁcation and concentrabilit y via the L 2 ( µ ) norm. Our analysis easily accommo dates L 2 ( µ ) -concentrabilit y , as can be seen from the pro of. On the other hand, as described in Section 2.1 , missp eciﬁcation ampliﬁcation is necessary under L 2 ( µ ) -misspeciﬁcation. 5 Many Bellman transfer coeﬃcients exist, but a standard one is the smallest C transfer such that max π,f ∈F ∥ f −T f ∥ 2 L 2 ( d π ) / ∥ f −T f ∥ 2 L 2 ( µ ) ≤ C transfer . This coincides with ours under exact realizabilit y/completeness, but we b eliev e our deﬁnition is more appropriate for the misspeciﬁed case because it is equivalen t to feature cov erage under linear function approximation. Indeed, if F consists of linear functions in some feature map ϕ : S × A → R d (but T f may not b e linear due to missp eciﬁcation) then our deﬁnition can b e expressed via the features (as max π,θ ∈ R d θ ⊤ Σ π θ / θ ⊤ Σ µ θ where Σ d = E d  ϕ ( s, a ) ϕ ( s, a ) ⊤  ) but the standard deﬁnition cannot. 9 transition op erator, R := { R h } H h =1 with R h : S × A → [0 , 1] is the non-stationary rew ard function, and s 1 is a ﬁxed starting state. A (non-stationary) policy π := { π h } H h =1 is a sequence of mappings π h : S → ∆( A ) which induces a tra jectory ( s 1 , a 1 , r 1 , . . . , s H , a H , r H ) where a h ∼ π h ( s h ) , r h = R h ( s h , a h ) and s h +1 ∼ P h ( s h , a h ) for eac h time step. W e use P π [ · ] and E π [ · ] to denote probability and exp ectation under this pro cess, resp ectively . Let d π h ∈ ∆( S × A ) denote the occupancy measure of π at time-step h , deﬁned as d π h ( s, a ) := P π [ s h = s, a h = a ] . The v alue of policy π is denoted J ( π ) := E π h P H h =1 r h i . Each p olicy has v alue functions: V π h : s 7→ E π h P H h ′ = h r h ′ | s h = s i and Q π h : ( s, a ) 7→ E π h P H h ′ = h r h ′ | s h = s, a h = a i and there exist an optimal p olicy π ⋆ = { π ⋆ h } H h =1 that maximizes V π h sim ultaneously for each state s ∈ S and hence maximizes J ( · ) . The optimal v alue function Q ⋆ h := Q π ⋆ h h induces π ⋆ via π ⋆ h : s 7→ arg max a Q ⋆ h ( s, a ) and satisﬁes Bellman’s equation: Q ⋆ h ( s, a ) = [ T h Q ⋆ h +1 ]( s, a ) where the Bellman operator T h is deﬁned via [ T h f h +1 ]( s, a ) = R h ( s, a ) + E [max a ′ f h +1 ( s h +1 , a ′ ) | s h = s, a h = a ] . W e assume p er-episode rewards satisfy P H h =1 r h ∈ [0 , 1] . In online RL, w e in teract with the MDP for T episo des, where in eac h episo de we select a p olicy π ( t ) and collect the tra jectory ( s ( t ) 1 , a ( t ) 1 , r ( t ) 1 , . . . , s ( t ) H , a ( t ) H , r ( t ) H ) by taking actions a ( t ) h = π ( t ) h ( s ( t ) h ) . W e measure p erformance via the cumulativ e regret, deﬁne as Reg := P T t =1 J ( π ⋆ ) − J ( π ( t ) ) . W e equip the learner with a v alue function class F := F 1 × . . . × F H where each F h ⊂ S × A → [0 , 1] . Each f ∈ F induces a p olicy π f whic h, at time step h takes actions via π f ,h ( s h ) = arg max a f h ( s h , a h ) . W e make the following assumptions: • L ∞ -appro ximate realizabilit y/completeness. F or eac h h ∈ [ H ] there exists ¯ f h ∈ F h suc h that   ¯ f h − T h ¯ f h +1   ∞ ≤ ε ∞ .A dditionally , for eac h f h +1 ∈ F h +1 there exists f h ∈ F h suc h that ∥ f h − T h f h +1 ∥ ∞ ≤ ε ∞ . • Co verabilit y . There exists a constant C cov ∈ [1 , ∞ ) such that inf µ 1 ,...,µ H ∈ ∆( S ×A ) sup π ∈ Π ,h    d π h µ h    ∞ ≤ C cov . Here Π := { π f : π f ,h ( s ) = arg max a f h ( s, a ) , f ∈ F } is the p olicy class induced b y F . As in oﬄine RL, there is a large b ody of recent w ork studying function appro ximation and structural conditions for sample-eﬃcient online RL (c.f., Agarwal et al. , 2019 ; F oster and Rakhlin , 2023 ). It is fairly standard to assume exact realizability and completeness, which is stronger than our version with missp eciﬁcation. Co verabilit y is a recently prop osed structural condition ( Xie et al. , 2023 ): C cov is known to b e small in many MDP mo dels of interest, but weak er conditions that enable sample-eﬃciency are kno wn. As we will see, the strength of the structural condition determines whether misspeciﬁcation ampliﬁcation can be a voided or not. Algorithm and guaran tee. The algorithm is a very minor mo diﬁcation to GOLF ( Jin et al. , 2021 ; Xie et al. , 2023 ). T o condense the notation, given a sample ( s ( i ) h , a ( i ) h , r ( i ) h , s ( i ) h +1 ) and a function f ′ ∈ F h +1 , deﬁne x ( i ) h := ( s ( i ) h , a ( i ) h ) and y ( i ) f ′ ,h := r ( i ) h + max a ′ f ′ ( s ( i ) h +1 , a ′ ) . At the b eginning of episo de t , deﬁne a version space F ( t − 1) := ( f ∈ F : ∀ h ∈ [ H ] : max g h ∈F h t − 1 X i =1 W τ f h ,g h ( x ( i ) h ) n ( f h ( x ( i ) h ) − y ( i ) f h +1 ,h ) 2 − ( g h ( x ( i ) h ) − y ( i ) f h +1 ,h ) 2 o ≤ β ) , where β > 0 is a h yp erparameter we will set b elo w. Then, we deﬁne the optimistic v alue function f ( t ) := arg max f ∈F ( t − 1) f 1 ( s 1 , π f , 1 ( s 1 )) and the induced p olicy π ( t ) := π f ( t ) , collect a tra jectory via π ( t ) , and proceed to the next episo de. Note that the only diﬀerence b et w een this algorithm, which we call GOLF .DBR , and the version of GOLF studied by Xie et al. ( 2023 ) is that w e use the ﬁlter W τ f h ,g h ( · ) in the construction of the version space. GOLF .DBR enjoys the following guarantee. Theorem 4.2 ( DBR for online RL) . Fix δ ∈ (0 , 1) , and assume that F is L ∞ -missp e ciﬁe d and µ satisﬁes c over ability (as deﬁne d ab ove). Consider GOLF .DBR with τ = 3 ε ∞ and β = c log ( T H |F | /δ ) . Then, with pr ob ability at le ast 1 − δ , we have Reg ≤ O  ε ∞ H T + H p C cov T log ( T H |F | /δ ) log ( T )  . P aralleling the discussion following Theorem 4.1 , w e emphasize tw o asp ects of the result. The ﬁrst is that it extends Theorem 1 of Xie et al. ( 2023 ) to the misspeciﬁed setting, with no degradation of the statistical term and without incurring a dep endence on ε ∞ √ C cov . In other words, it a voids missp eciﬁcation ampliﬁcation. 10 The second remark is that, when taken with existing low er b ounds ( Du et al. , 2020 ; V an Roy and Dong , 2019 ; Lattimore et al. , 2020 ), Theorem 4.2 establishes a separation b et ween co verabilit y and structural parameters deﬁned in terms of Bellman errors, which include the Bellman-Eluder dimension ( Jin et al. , 2021 ), bilinear rank ( Du et al. , 2021 ), and Bellman rank ( Jiang et al. , 2017 ). 6 This separation is more subtle than in oﬄine RL, b ecause here, as long as the state-action space is ﬁnite, one can alwa ys use a “tabular” metho d and eliminate missp eciﬁcation altogether, at the cost of p oly ( |S | , |A| ) · √ T regret. T o rule out this algorithm, we restrict to sample-eﬃcien t metho ds: in a setting where a particular structural parameter (e.g., co verabilit y or Bellman rank) is b ounded b y d w e say that an algorithm is sample-eﬃcient if its statistical term scales as p oly ( d, log ( |F | /δ ) , H ) · o ( T ) . The lo wer bounds sho w that, when the structural parameter inv olv es Bellman errors (like the Bellman rank), ε ∞ T √ d missp eciﬁcation error is necessary for sample eﬃcient algorithms. 7 On the other hand, under cov erability , w e can achiev e missp eciﬁcation error with no dep endence on the structural parameter, in a sample eﬃcien t manner. 8 This establishes that whether misspeciﬁcation ampliﬁcation can b e a voided sample-eﬃcien tly depends on the structural prop erties of the MDP . T o our knowledge, this is a no vel insigh t in to the in teraction b etw een the structural and function approximation assumptions in online RL. 5 Related w ork There is a v ast b ody of work studying distribution shift broadly and cov ariate shift in particular. W e fo cus on the most closely related techniques for the co v ariate shift setting and refer the reader to Quinonero-Candela et al. ( 2008 ); Sugiy ama and Ka wanabe ( 2012 ); Shen et al. ( 2021 ) for a more comprehensive treatment. Rew eighting and robust optimization. P erhaps the most common wa y to correct for co v ariate shift is b y rew eighting each example ( x, y ) in the ob jectiv e function b y the densit y ratio w ( x ) := d test ( x ) /d train ( x ) . This metho d has b een studied in a long series of works ( Shimo daira , 2000 ; Cortes et al. , 2010 ; Cortes and Mohri , 2014 ). In its simplest form it requires kno wledge of D test via the density ratios, so it is not directly applicable to our adversarial cov ariate shift setting. Extensions include approaches that estimate densit y ratios using unlab eled samples from D test ( Huang et al. , 2006 ; Sugiyama et al. , 2007 ; Gretton et al. , 2009 ; Y u and Szep esvári , 2012 ) and robust optimization approaches that emplo y an auxiliary h yp othesis class of distributions P con taining D test ( Hashimoto et al. , 2018 ; Sagaw a et al. , 2020 ; Duc hi and Namk o ong , 2021 ; Agarw al and Zhang , 2022 ). Ho w ever, these still require prior kno wledge ab out D test , in particular it is kno wn that the sample complexity of robust optimization scales with the statistical complexity of the auxiliary class P ( Duchi and Namkoong , 2021 ), leading to v acuous b ounds in the absence of inductive bias. Ge et al. ( 2023 ) study statistical inference under co v ariate shift in well- and missp eciﬁed settings. They sho w that maxim um likelihoo d estimation on D train is inconsistent with missp eciﬁcation, a result which is conceptually similar to our low er b ound for ERM . How ev er, their construction is not L ∞ -missp eciﬁed so it is not directly comparable. Algorithmically , the y use reweigh ting for the missp eciﬁed case, which, as mentioned, cannot b e implemented in our setting. Sup-norm con v ergence and function class-sp eciﬁc results. Another line of work provides sp ecialized analyses for sp eciﬁc function classes of in terest, such as linear ( Lei et al. , 2021 ), nonparametric ( Kp otufe and Martinet , 2018 ; Pathak et al. , 2022 ; Ma et al. , 2023 ), and some neural netw ork ( Dong and Ma , 2023a ) classes. The o verarc hing technical approach in these works is to measure distance b et ween distributions in manner that captures the structure of the function class, analogously to learning-theoretic results for domain adaptation ( Ben-David et al. , 2006 ; Mansour et al. , 2009 ). A complemen tary approac h is based on sup-norm con vergence which seeks to control ∥ ˆ f − f ⋆ ∥ ∞ for a predictor ˆ f and is naturally robust to co v ariate shift. Sup-norm conv ergence has b een studied for v arious function classes (c.f., Schmidt-Hieber and Zamolo dtc hiko v , 2022 ; Dong and Ma , 2023b ), but unfortunately is not p ossible in the general statistical learning setup ( Dong and Ma , 2023b ). W e mention sup-norm conv ergence primarily to contrast with our quan tile guaran tee in Eq. (4) , which con trols the probabilit y o ver x of large errors rather than the magnitude of the errors themselves 6 As with Bellman transfer co eﬃcien ts, we believe these deﬁnitions should b e adjusted to accommodate missp eciﬁcation. See Deﬁnition 10 in Jiang et al. ( 2017 ) for an example. 7 F ormally , for an y ζ > 0 one requires at least exp ( d 2 ζ ) samples to ﬁnd a d 1 / 2 − ζ ε ∞ suboptimal p olicy ( Lattimore et al. , 2020 ). 8 W e b eliev e that missp eciﬁcation error ε ∞ H T is optimal under co verabilit y and that ε ∞ H T √ d is optimal under structural parameters like Bellman rank. Ho wev er, it remains op en to establish the necessity of the horizon factors. 11 and which is attainable for any function class, ev en with missp eciﬁcation. All of these w orks diﬀer from ours in that (a) they consider speciﬁc function classes and (b) they op erate closer to the w ell-sp eciﬁed regime than w e do (e.g., in the nonparametric setting, one can drive the misspeciﬁcation error to zero). Related w ork in reinforcemen t learning. Our results for oﬄine and online RL build directly on the analyses in Xie and Jiang ( 2020 ) and Xie et al. ( 2023 ) resp ectiv ely . The former con tributes to a long line of w ork on oﬄine RL ( Munos , 2003 , 2007 ; An tos et al. , 2008 ; Chen and Jiang , 2019 ) while the latter is part of a series of works establishing structural conditions under which online reinforcemen t learning is statistically tractable (c.f., Agarwal et al. , 2019 ; F oster and Rakhlin , 2023 ). Man y of these w orks do account for missp eciﬁcation, but the question of whether missp eciﬁcation ampliﬁcation can b e av oided is not considered. Results that do fo cus on missp eciﬁcation primarily consider linear function appro ximation. In the simpler oﬄine p olicy ev aluation setting, several w orks study least squares temp oral diﬀerence learning (LSTD) ( Bradtke and Barto , 1996 ) with missp eciﬁcation ( T sitsiklis and V an Roy , 1996 ; Y u and Bertsek as , 2010 ; Mou et al. , 2022 ). Recen tly , Amortila et al. ( 2023 ) precisely c haracterized the optimal misspeciﬁcation ampliﬁcation (i.e., appro ximation factors) achiev able across a range of settings, showing that LSTD is essentially optimal in most regimes. The exception is when the oﬄine data distribution is supported on the en tire state space, one can emplo y a “tabular” mo del-based algorithm to incur no appro ximation error whatso ev er, but the sample complexity scales p olynomially with |S | . Our oﬄine RL results are conceptually similar b ecause under concen trability (w hic h essentially implies full supp ort), the standard minimax algorithm does not ac hieve the optimal approximation factor. A crucial diﬀerence is that our disagreemen t-based v ariant achiev es an impro ved appro ximation factor without incurring any sample complexity ov erhead. F or the more challenging oﬄine p olicy optimization and online RL, Du et al. ( 2020 ); V an Ro y and Dong ( 2019 ); Lattimore et al. ( 2020 ) establish conditions under which missp eciﬁcation ampliﬁcation is necessary . As discussed abov e, combining our results with these lo wer b ounds and their v ariations, reveals new tradeoﬀs b et w een cov erage/structural and function appro ximation conditions, distinct from tradeoﬀs established by prior work ( Xie and Jiang , 2021 ; F oster et al. , 2022 ). 6 Discussion This pap er highlights an in triguing interpla y b et ween missp eciﬁcation and distribution shift, exp osing the undesirable missp e ciﬁc ation ampliﬁc ation prop ert y of ERM , and prop osing disagreement-based regression as a remedy . W e hav e shown that using disagreemen t-based regression in online and oﬄine reinforcemen t learning yields new tec hnical results and reveals new tradeoﬀs b et ween co verage/structural assumptions and function appro ximation assumptions. W e close by mentioning several interesting av enues for future work. There are a num b er of directions that p ertain to the core setting of missp eciﬁed regression under cov ariate shift; for example, (a) extending the analysis of DBR to inﬁnite function classes, other loss functions, and other notions of misspeciﬁcation, (b) deriving a more computationally eﬃcient pro cedure—p erhaps in an oracle mo del of computation—that av oids missp eciﬁcation ampliﬁcation, and (c) determining the optimal achiev able appro ximation factor. Pertaining to reinforcement learning theory , w e b eliev e the most pressing direction is to deep en our understanding of the relationship b et ween co verage/structural assumptions (for oﬄine/online RL, resp ectiv ely) and function appro ximation assumptions, and we b eliev e missp eciﬁcation pro vides a nov el lens to study this relationship. It is also w orthwhile to consider other applications inv olving distribution shift where DBR or related pro cedures ma y reveal new conceptual insights. Finally , it would also b e interesting to study empirical issues, to understand how p erv asiv e and problematic missp eciﬁcation ampliﬁcation is, dev elop practical in terven tions, and consider applying them to distribution shift and deep reinforcement learning scenarios. In short, there is muc h more to understand about the in terplay b et ween misspeciﬁcation and distribution shift, and we lo ok forw ard to progress in the y ears to come. A c knowledgemen ts W e thank A dam Blo ck for helpful feedbac k on a early v ersion of the manuscript. 12 References Alekh Agarw al and T ong Zhang. Minimax regret optimization for robust mac hine learning under distribution shift. In Confer enc e on L e arning The ory , 2022. Alekh Agarwal, Nan Jiang, Sham M Kak ade, and W en Sun. Reinforcement learning: Theory and algorithms. https://rltheorybook.github.io/ , 2019. V ersion: January 31, 2022. Philip Amortila, Nan Jiang, and Csaba Szep esvári. The optimal approximation factors in missp eciﬁed oﬀ-p olicy v alue function estimation. In International Confer enc e on Machine L e arning , 2023. Cem Anil, Y uhuai W u, Anders Andreassen, Aitor Lewk owycz, V edant Misra, Vinay Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dy er, and Behnam Neyshabur. Exploring length generalization in large language mo dels. A dvanc es in Neur al Information Pr o c essing Systems , 2022. András Antos, Csaba Szep esvári, and Rémi Munos. Learning near-optimal policies with bellman-residual minimization based ﬁtted policy iteration and a single sample path. Machine L e arning , 2008. Jean-Y ves Audib ert. Progressive mixture rules are deviation sub optimal. A dvanc es in Neur al Information Pr o c essing Systems , 2007. Shai Ben-David, John Blitze r, K oby Crammer, and F ernando P ereira. Analysis of representations for domain adaptation. A dvanc es in Neur al Information Pr o c essing Systems , 19, 2006. Omar Besb es, Y onatan Gur, and Assaf Zeevi. Non-stationary sto c hastic optimization. Op er ations R ese ar ch , 2015. Stev en J Bradtke and Andrew G Barto. Linear least-squares algorithms for temp oral diﬀerence learning. Machine le arning , 1996. Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcemen t learning. In International Confer enc e on Machine L e arning , 2019. Corinna Cortes and Mehryar Mohri. Domain adaptation and sample bias correction theory and algorithm for regression. The or etic al Computer Scienc e , 2014. Corinna Cortes, Yisha y Mansour, and Mehry ar Mohri. Learning b ounds for imp ortance w eighting. A dvanc es in Neur al Information Pr o c essing Systems , 2010. Kefan Dong and T engyu Ma. First steps tow ard understanding the extrap olation of nonlinear mo dels to unseen domains. In International Confer enc e on L e arning R epr esentations , 2023a. Kefan Dong and T engyu Ma. T o w ard L ∞ -reco very of nonlinear functions: A polynomial sample complexity b ound for gaussian random ﬁelds. In Confer enc e on L e arning The ory , 2023b. Simon Du, Sham Kak ade, Jason Lee, Shac har Lov ett, Gaurav Maha jan, W en Sun, and Ruosong W ang. Bilinear classes: A structural framework for prov able generalization in RL. In International Confer enc e on Machine L e arning , 2021. Simon S Du, Sham M Kak ade, Ruosong W ang, and Lin F Y ang. Is a go o d representation suﬃcien t for sample eﬃcien t reinforcemen t learning? In International Confer enc e on L e arning R epr esentations , 2020. John C Duchi and Hongseok Namkoong. Learning mo dels with uniform p erformance via distributionally robust optimization. The Annals of Statistics , 2021. Dylan F oster, Alekh Agarwal, Miroslav Dudík, Haip eng Luo, and Rob ert Sc hapire. Practical contextual bandits with regression oracles. In International Confer enc e on Machine L e arning , 2018. Dylan J F oster and Alexander Rakhlin. F oundations of reinforcement learning and in teractive decision making. arXiv:2312.16730 , 2023. Dylan J F oster, Alexander Rakhlin, David Simchi-Levi, and Y unzong Xu. Instance-dep enden t complexity of contextual bandits and reinforcement learning: A disagreemen t-based persp ective. In Confer enc e on L e arning The ory , 2021. 13 Dylan J F oster, Akshay Krishnamurth y , David Simc hi-Levi, and Y unzong Xu. Oﬄine reinforcement learning: Fundamen tal barriers for v alue function approximation. In Confer enc e on L e arning The ory , 2022. João Gama, Indr ˙ e Žliobait ˙ e, Alb ert Bifet, Mykola Pec henizkiy , and Ab delhamid Bouc hachia. A survey on concept drift adaptation. ACM Computing Surveys , 2014. Jia wei Ge, Shange T ang, Jianqing F an, Cong Ma, and Chi Jin. Maximum likelihoo d estimation is all you need for well-speciﬁed co v ariate shift. , 2023. Arth ur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, and Bernhard Sc hölkopf. Co v ariate shift b y kernel mean matching. Dataset Shift in Machine L e arning , 2009. Stev e Hannek e. Theory of disagreement-based active learning. F oundations and T r ends in Machine L e arning , 2014. T atsunori Hashimoto, Megha Sriv asta v a, Hongseok Namkoong, and Percy Liang. F airness without demo- graphics in rep eated loss minimization. In International Confer enc e on Machine L e arning , 2018. Jia yuan Huang, Arthur Gretton, Karsten Borgwardt, Bernhard Sc hölkopf, and Alex Smola. Correcting sample selection bias b y unlab eled data. A dvanc es in Neur al Information Pr o c essing Systems , 2006. Nan Jiang, Aksha y Krishnamurth y , Alekh Agarwal, John Langford, and Rob ert E Schapire. Contextual decision processes with low Bellman rank are P AC-learnable. In International Confer enc e on Machine L e arning , 2017. Chi Jin, Qinghua Liu, and Sobhan Miryooseﬁ. Bellman eluder dimension: New ric h classes of RL problems, and sample-eﬃcient algorithms. A dvanc es in Neur al Information Pr o c essing Systems , 2021. Keith Knight. On the asymptotic distribution of the L ∞ estimator in linear regression. T echnical rep ort, Univ ersity of T oronto, 2017. P ang W ei Koh, Shiori Sagaw a, Henrik Marklund, Sang Mic hael Xie, Marvin Zhang, Akshay Balsubramani, W eihua Hu, Michihiro Y asunaga, Ric hard Lanas Phillips, Irena Gao, T on y Lee, Etienne Da vid, Ian Sta vness, W ei Guo, Berton Earnshaw, Imran Haque, Sara M Beery , Jure Lesko vec, Ansh ul Kunda je, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang. Wilds: A b enc hmark of in-the-wild distribution shifts. In International Confer enc e on Machine L e arning , 2021. Samory Kp otufe and Guillaume Martinet. Marginal singularit y , and the b eneﬁts of labels in cov ariate-shift. In Confer enc e On L e arning The ory , 2018. Aksha y Krishnamurth y , Alekh Agarwal, T zu-Kuo Huang, Hal Daumé I I I, and John Langford. Activ e learning for cost-sensitive classiﬁcation. Journal of Machine L e arning R ese ar ch , 2019. T or Lattimore, Csaba Szep esv ari, and Gellert W eisz. Learning with go od feature representations in bandits and in RL with a generative mo del. In International Confer enc e on Machine L e arning , 2020. Guillaume Lecué and Philipp e Rigollet. Optimal learning with Q-aggregation. The Annals of Statistics , 2014. Qi Lei, W ei Hu, and Jason Lee. Near-optimal linear regression under distribution shift. In International Confer enc e on Machine L e arning , 2021. Sergey Levine, A viral Kumar, George T uck er, and Justin F u. Oﬄine reinforcement learning: T utorial, review, and p erspectives on op en problems. , 2020. T engyuan Liang, Alexander Rakhlin, and Karthik Sridharan. Learning with square loss: Lo calization through oﬀset rademacher complexity . In Confer enc e on L e arning The ory , 2015. Bingbin Liu, Jordan T Ash, Surbhi Go el, Akshay Krishnamurth y , and Cyril Zhang. Exposing atten tion glitc hes with ﬂip-ﬂop language mo deling. A dvanc es in Neur al Information Pr o c essing Systems , 2023. Cong Ma, Reese Pathak, and Martin J W ainwrigh t. Optimally tac kling cov ariate shift in RKHS-based nonparametric regression. The Annals of Statistics , 2023. 14 Yisha y Mansour, Me hry ar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. , 2009. John P Miller, Rohan T aori, Aditi Raghunathan, Shiori S aga w a, P ang W ei Koh, V aishaal Shank ar, Percy Liang, Y air Carmon, and Ludwig Schmidt. A ccuracy on the line: On the strong correlation b etw een out-of-distribution and in-distribution generalization. In International Confer enc e on Machine L e arning , 2021. W enlong Mou, Ashwin Pananjady , and Martin J W ainwrigh t. Optimal oracle inequalities for solving pro jected ﬁxed-p oin t equations. Mathematics of Op er ations R ese ar ch , 2022. Rémi Munos. Error b ounds for approximate p olicy iteration. In International Confer enc e on Machine L e arning , 2003. Rémi Munos. Performance b ounds in L p -norm for approximate v alue iteration. SIAM Journal on Contr ol and Optimization , 2007. Reese P athak, Cong Ma, and Martin W ainwrigh t. A new similarity measure for co v ariate shift with applications to nonparametric regression. In International Confer enc e on Machine L e arning , 2022. Juan Perdomo, Tijana Zrnic, Celestine Mendler-Dünner, and Moritz Hardt. Performativ e prediction. In International Confer enc e on Machine L e arning , 2020. Joaquin Quinonero-Candela, Masashi Sugiyama, Anton Sch waighofer, and Neil D Lawrence. Dataset Shift in Machine Le arning . Mit Press, 2008. Benjamin Rech t, Reb ecca Ro elofs, Ludwig Sc hmidt, and V aishaal Shank ar. Do imagenet classiﬁers generalize to imagenet? In International Confer enc e on Machine L e arning , 2019. Stéphane Ross, Geoﬀrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In International Confer enc e on Artiﬁcial Intel ligenc e and Statistics , 2011. Shiori Sagaw a, Pang W ei K oh, T atsunori B Hashimoto, and Percy Liang. Distributionally robust neural net works for group shifts: On the imp ortance of regularization for w orst-case generalization. In International Confer enc e on L e arning R epr esentations , 2020. Johannes Schmidt-Hieber and P etr Zamolo dtchik o v. Lo cal con vergence rates of the least squares estimator with applications to transfer learning. , 2022. Zhey an Shen, Jiash uo Liu, Y ue He, Xingxuan Zhang, Renzhe Xu, Han Y u, and Peng Cui. T ow ards out-of- distribution generalization: A survey . , 2021. Hidetoshi Shimo daira. Improving predictive inference under cov ariate shift b y weigh ting the log-likelihoo d function. Journal of Statistic al Planning and Infer enc e , 2000. Masashi Sugiyama and Motoaki Kaw anab e. Machine L e arning in non-stationary envir onments: Intr o duction to c ovariate shift adaptation . MIT press, 2012. Masashi Sugiy ama, Shinichi Nak a jima, Hisashi Kashima, P aul Buenau, and Motoaki Ka wanabe. Direct imp ortance estimation with model selection and its application to co v ariate shift adaptation. A dvanc es in Neur al Information Pr o c essing Systems , 2007. Matus T elgarsky . Deep learning theory lecture notes. https://mjt.cs.illinois.edu/dlt/ , 2021. V ersion: 2021-10-27 v0.0-e7150f2d (alpha). John T sitsiklis and Benjamin V an Ro y . Analysis of temp oral-diﬀerence learning with function approximation. A dvanc es in Neur al Information Pr o c essing Systems , 1996. Benjamin V an Ro y and Shi Dong. Commen ts on the Du-Kak ade-Wang-Yang lo wer b ounds. , 2019. 15 T engyang Xie and Nan Jiang. Q ⋆ appro ximation sc hemes for batch reinforcement learning: A theoretical comparison. In Confer enc e on Unc ertainty in Artiﬁcial Intel ligenc e , 2020. T engyang Xie and Nan Jiang. Batc h v alue-function approximation with only realizability . In International Confer enc e on Machine L e arning , 2021. T engyang Xie, Dylan J F oster, Y u Bai, Nan Jiang, and Sham M Kak ade. The role of cov erage in online reinforcemen t learning. In International Confer enc e on L e arning R epr esentations , 2023. Y ufei Yi and Matey Neyk ov. Non-asymptotic b ounds for the L ∞ estimator in linear regression with uniform noise. Bernoul li , 2024. Huizhen Y u and Dimitri P Bertsek as. Error bounds for appro ximations from pro jected linear equations. Mathematics of Op er ations R ese ar ch , 2010. Y aoliang Y u and Csaba Szepesvári. Analysis of kernel mean matching under co v ariate shift. In International Confer enc e on Machine L e arning , 2012. Yi Zhang, Arturs Backurs, Sébastien Bub ec k, Ronen Eldan, Suriy a Gunasek ar, and T al W agner. Un veiling transformers with lego: a synthetic reasoning task. , 2022. 16 A Pro ofs for Section 2 A.1 Analysis for ERM Prop osition 2.1 ( ERM upp er b ound) . F or any δ ∈ (0 , 1) with pr ob ability at le ast 1 − δ , ERM satisﬁes R test ( ˆ f ( n ) ERM ) ≤ O  C ∞ ε 2 ∞ + C ∞ log( |F | /δ ) n  . Pro of of Prop osition 2.1 . The pro of of Prop osition 2.1 is fairly standard, particularly in the well-speciﬁed case when ε ∞ = 0 . Our analysis that handles missp eciﬁcation is adapted from the pro of of Lemma 16 in Chen and Jiang ( 2019 ). F or the ma jorit y of the pro of w e only consider D train , and we consequen tly omit the subscript when indexing expectations, v ariances, and the risk functional. Deﬁne R ( f ) := E [( f ( x ) − f ⋆ ( x )) 2 ] and b R ( f ) := 1 n n X i =1 ( f ( x i ) − y i ) 2 , so that ˆ f ( n ) ERM := arg min f ∈F b R ( f ) . W e establish concentration on the “excess risk” functional b R ( f ) − b R ( ¯ f ) . F or any f ∈ F , w e establish the following facts: E [( f ( x ) − y ) 2 − ( ¯ f ( x ) − y ) 2 ] = E [( f ( x ) − f ⋆ ( x )) 2 − ( ¯ f ( x ) − f ⋆ ( x )) 2 ] (8) V ar[( f ( x ) − y ) 2 − ( ¯ f ( x ) − y ) 2 ] ≤ 8 E [( f ( x ) − y ) 2 − ( ¯ f ( x ) − y ) 2 ] + 16 ε 2 ∞ . (9) Eq. (8) implies that E [ b R ( f ) − b R ( ¯ f )] = R ( f ) − R ( ¯ f ) as desired. Eq. (9) will enable us to achiev e a fast con vergence rate. The former is derived as follows. Observ e that conditional on an y x w e hav e E [( f ( x ) − y ) 2 − ( ¯ f ( x ) − y ) 2 | x ] = E [( f ( x ) − y ) 2 − ( ¯ f ( x ) − f ⋆ ( x ) + f ⋆ ( x ) − y ) 2 | x ] = E [( f ( x ) − y ) 2 − ( ¯ f ( x ) − f ⋆ ( x )) 2 − 2( ¯ f ( x ) − f ⋆ ( x ))( f ⋆ ( x ) − y ) − ( f ⋆ ( x ) − y ) 2 | x ] = E [( f ( x ) − y ) 2 − ( ¯ f ( x ) − f ⋆ ( x )) 2 − ( f ⋆ ( x ) − y ) 2 | x ] = f ( x ) 2 − f ⋆ ( x ) 2 − 2 E train [ y | x ]( f ( x ) − f ⋆ ( x )) − ( ¯ f ( x ) − f ⋆ ( x )) 2 = ( f ( x ) − f ⋆ ( x )) 2 − ( ¯ f ( x ) − f ⋆ ( x )) 2 . Eq. (9) is derived as follo ws. V ar[( f ( x ) − y ) 2 − ( ¯ f ( x ) − y ) 2 ] ≤ E [(( f ( x ) − y ) 2 − ( ¯ f ( x ) − y ) 2 ) 2 ] = E [( f ( x ) − ¯ f ( x )) 2 ( f ( x ) + ¯ f ( x ) − 2 y ) 2 ] ≤ 4 E [( f ( x ) − ¯ f ( x )) 2 ] ≤ 8 E [( f ( x ) − f ⋆ ( x )) 2 + ( ¯ f ( x ) − f ⋆ ( x )) 2 ] = 8 E [( f ( x ) − f ⋆ ( x )) 2 − ( ¯ f ( x ) − f ⋆ ( x )) 2 + 2( ¯ f ( x ) − f ⋆ ( x ) 2 ] ≤ 8 E [( f ( x ) − f ⋆ ( x )) 2 − ( ¯ f ( x ) − f ⋆ ( x )) 2 ] + 16 ε 2 ∞ . Finally , we apply Eq. (8) . No w, Bernstein’s inequality and a union b ound ov er f ∈ F gives that with probability at least 1 − δ ∀ f ∈ F : R ( f ) − R ( ¯ f ) − ( b R ( f ) − b R ( ¯ f )) ≤ r (16( R ( f ) − R ( ¯ f )) + 32 ε 2 ∞ ) log( |F | /δ ) n + 4 log( |F | /δ ) 3 n . Since ˆ f ( n ) ERM minimizes b R ( f ) we ha ve that b R ( ˆ f ( n ) ERM ) − b R ( ¯ f ) ≤ 0 , we can deduce that R ( ˆ f ( n ) ERM ) − R ( ¯ f ) ≤ s (16( R ( ˆ f ( n ) ERM ) − R ( ¯ f )) + 32 ε 2 ∞ ) log( |F | /δ ) n + 4 log( |F | /δ ) 3 n . 17 Using the AM-GM inequalit y ( √ ab ≤ a/ 2 + b/ 2 ), the righ t hand side can b e simpliﬁed to yield R ( ˆ f ( n ) ERM ) − R ( ¯ f ) ≤ 1 2 ( R ( ˆ f ( n ) ERM ) − R ( ¯ f )) + ε 2 ∞ + 28 log( |F | /δ ) 3 n . Re-arranging and using that R train ( ¯ f ) ≤ ε 2 ∞ w e obtain R train ( ˆ f ( n ) ERM ) ≤ 3 ε 2 ∞ + 56 log( |F | /δ ) 3 n . Finally we b ound the risk under D test via a standard importance w eighting argumen t: R test ( ˆ f ( n ) ERM ) = E train  d test ( x ) d train ( x ) ( ˆ f ( n ) ERM ( x ) − f ⋆ ( x )) 2  ≤ sup x ∈X     d test ( x ) d train ( x )     ·  3 ε 2 ∞ + 56 log( |F | /δ ) 3 n  . Note that we crucially use that ( ˆ f ( n ) ERM ( x ) − f ⋆ ( x )) 2 is non-negative here. This pro ves the proposition. Prop osition 2.2 ( ERM lo wer b ound) . F or al l ε ∞ ∈ (0 , 1) and C ∞ ∈ [1 , ∞ ) such that p C ∞ · ε ∞ ≤ 1 / 2 , and for al l ζ > 0 suﬃciently smal l, ther e exist distributions D train , D test and a function class F with |F | = 2 satisfying Assumption 2.1 - Assumption 2.4 (with p ar ameters ε ∞ , C ∞ ) such that R test ( ˆ f ( ∞ ) ERM ) = C ∞ ε 2 ∞ − ζ . Pro of of Prop osition 2.2 . Fix ε ∞ ∈ (0 , 1) and C ∞ ≥ 1 such that p C ∞ · ε ∞ ≤ 1 / 2 . Let 0 < ζ < p C ∞ · ε ∞ . Let X = [0 , 1] and let D train b e the distribution ov er ( x, y ) where x ∼ Uniform ( X ) and y ∼ Ber (1 / 2) . Let e X := [0 , 1 /C ∞ ] ⊂ X and let D test b e the distribution ov er ( x, y ) where x ∼ Uniform ( e X ) and y ∼ Ber (1 / 2) . These choices yield f ⋆ ( x ) = 1 / 2 for all x ∈ X , satisfy Assumption 2.1 , and ensure that sup x ∈X    d test ( x ) d train ( x )    = C ∞ . Let F = { ¯ f , f bad } where ¯ f ( x ) = 1 / 2 + ε ∞ for all x ∈ X (satisfying Assumption 2.3 ) and f bad is deﬁned as f bad ( x ) = ( 1 / 2 if x / ∈ e X 1 / 2 + ζ if x ∈ e X . By deﬁnition, observ e that ˆ f ( ∞ ) ERM = f bad as long as ∥ f bad − f ⋆ ∥ 2 L 2 ( D train ) <   ¯ f − f ⋆   2 L 2 ( D train ) . A direct calculation shows that this inequalit y is satisﬁed for any ζ < p C ∞ · ε ∞ . Ho wev er, f bad has large population risk under D test , in particular R test ( f bad ) = E test [( f bad ( x ) − f ⋆ ( x )) 2 ] = ζ 2 , whic h w e can mak e arbitrarily close to C ∞ ε 2 ∞ . A.2 Discussion of other algorithms Star algorithm. Audib ert’s star algorithm ( Audib ert , 2007 ; Liang et al. , 2015 ) is a tw o-stage regression pro cedure that achiev es the fast conv ergence rate for non-conv ex classes in missp eciﬁed or agnostic regression. Giv en that the construction used to prov e Prop osition 2.2 has a ﬁnite (and hence non-con vex) function class, one might ask whether the star algorithm can av oid missp eciﬁcation ampliﬁcation. W e brieﬂy sketc h here wh y this is not the case. In the context of the construction, where F = { f bad , ¯ f } , the asymptotic v ersion of the star algorithm is to compute ˆ f star := arg min f α : α ∈ [0 , 1] E train [( f α ( x ) − f ⋆ ( x )) 2 ] where f α ( x ) = (1 − α ) f bad ( x ) + α ¯ f ( x ) . W e claim that when ζ = p C ∞ · ε ∞ , the optimal choice for α is exactly 1 / 2 . The prediction error under D test for this c hoice is, unfortunately , exactly 1 / 4 ( p C ∞ + 1) 2 ε 2 ∞ , which still manifests missp eciﬁcation ampliﬁcation. 18 Note that, due to the simplicity of our construction, the same argumen t applies to other improp er learning sc hemes based on conv exiﬁcation (c.f., Lecué and Rigollet , 2014 ). T o see that the minimum is achiev ed at α = 1 / 2 , we write the optimization problem ov er α as arg min α ∈ [0 , 1] 1 C ∞ ·  (1 − α ) q C ∞ ε ∞ + αε ∞  2 +  1 − 1 C ∞  · ( αε ∞ ) 2 = arg min α ∈ [0 , 1] α 2 + (1 − α ) 2 + 2 α (1 − α ) p C ∞ . The deriv ativ e, w.r.t. α , of the latter is d  α 2 + (1 − α ) 2 + 2 α (1 − α ) √ C ∞  dα = 2 α − 2(1 − α ) + 2 p C ∞ − 4 α p C ∞ = 2 − 2 p C ∞ ! (2 α − 1) . Since C ∞ > 1 , the second deriv ativ e is non-negative, so the optimization problem is conv ex. Moreov er, the deriv ativ e is zero at α = 1 / 2 , showing that this is a minimizer of the optimization problem. L ∞ regression. Giv en that we assume L ∞ -missp eciﬁcation, and in ligh t of the construction for Prop osi- tion 2.2 , it is tempting to optimize the maximal absolute deviation instead of the square loss: ˆ f ( n ) ∞ ← arg min f ∈F max i | f ( x i ) − y i | . This pro cedure is kno wn as L ∞ regression or the Chebyshev estimator and has b een studied in the statistics comm unity ( Knigh t , 2017 ; Yi and Neyko v , 2024 ). These analyses primarily consider the well-speciﬁed setting with noise that is uniformly distributed, i.e., y i = f ⋆ ( x i ) + ϵ i where ϵ i ∼ Unif ([ − a, a ]) for some a ≥ 0 . W e b eliev e suc h analyses can extend to the L ∞ -missp eciﬁed setting to sho w that the pro cedure av oids missp eciﬁcation ampliﬁcation. How ev er, strong assumptions on the noise are crucial, as L ∞ regression can b e inconsisten t under more general conditions. W e illustrate with a simple example. Let X = { x } b e a singleton, y = Ber ( 1 / 4 ) and F = { f ⋆ : x 7→ 1 / 4 , f : x 7→ 1 / 2 } b e a class with tw o functions. F or all n suﬃcien tly large, the dataset will con tain the sample ( x, 1) at which p oin t f ⋆ will hav e L ∞ error 3 / 4 , while f will hav e error 1 / 2 . Th us the metho d will be inconsisten t. A.3 Analysis for DBR W e begin with the pro ofs of Lemma 3.1 and Lemma 3.2 , th us completing steps one and t wo of the pro of. Then we turn to proving the corollaries. Lemma 3.1 (Non-negativity) . With τ ≥ 2 ε ∞ and for any ¯ f ∈ F such that ∥ ¯ f − f ⋆ ∥ ∞ ≤ ε ∞ , we have L ( f ; ¯ f ) ≥ ( τ 2 − 2 τ ε ∞ ) Pr[ W τ f , ¯ f ( x )] ≥ 0 . Pro of of Lemma 3.1 . F ollowing the calculation used to derive Eq. (8) w e hav e that, conditional on any x : E train [( f ( x ) − y ) 2 − ( ¯ f ( x ) − y ) 2 | x ] = ( f ( x ) − f ⋆ ( x )) 2 − ( ¯ f ( x ) − f ⋆ ( x )) 2 Under the even t x ∈ W τ f , ¯ f with τ ≥ 2 ε ∞ w e claim that this must b e non-negativ e. In particular | f ( x ) − f ⋆ ( x ) | ≥ | f ( x ) − ¯ f ( x ) | − | ¯ f ( x ) − f ⋆ ( x ) | ≥ τ − ε ∞ ≥ ε ∞ ≥ 0 Therefore, ( f ( x ) − f ⋆ ( x )) 2 − ( ¯ f ( x ) − f ⋆ ( x )) 2 ≥ ( τ − ε ∞ ) 2 − ε 2 ∞ ≥ τ 2 − 2 τ ε ∞ . The right hand side is non-negative whenev er τ ≥ 2 ε . 19 Lemma 3.2 (Concentration) . Fix δ ∈ (0 , 1) and τ ≥ 3 ε ∞ and deﬁne ε stat := 80 log( |F | /δ ) 3 n . Under Assumption 2.3 , for any ¯ f ∈ F such that ∥ ¯ f − f ⋆ ∥ ∞ ≤ ε ∞ , with pr ob ability at le ast 1 − δ we have ∀ f ∈ F : L ( f ; ¯ f ) ≤ 2 b L ( f ; ¯ f ) + ε stat , and e quivalently, b L ( ¯ f ; f ) ≤ 1 2  L ( ¯ f ; f ) + ε stat  . Pro of of Lemma 3.2 . The concen tration inequalit y is similar to the one used in the pro of of Proposition 2.1 . W e apply Bernstein’s inequalit y and a union bound to the empirical disagreemen t-based loss b L ( f ; ¯ f ) for eac h f ∈ F . T o do so, w e must calculate the mean, v ariance, and range of b L ( f ; ¯ f ) . Note that by the same calculation as in the pro of of Proposition 2.1 , w e ha ve that E [ b L ( f ; ¯ f )] = L ( f ; ¯ f ) and that the range of each random v ariable in the empirical av erage is 1 . The v ariance calculation ho wev er is slightly diﬀerent: V ar[ W τ f , ¯ f ( x ) · { ( f ( x ) − y ) 2 − ( ¯ f ( x ) − y ) 2 } ] ≤ E [ W τ f , ¯ f ( x ) · { ( f ( x ) − y ) 2 − ( ¯ f ( x ) − y ) 2 } 2 ] ≤ E [ W τ f , ¯ f ( x )( f ( x ) − ¯ f ( x )) 2 ( f ( x ) + ¯ f ( x ) − 2 y ) 2 ] ≤ 4 E [ W τ f , ¯ f ( x )( f ( x ) − ¯ f ( x )) 2 ] . Next, w e consider a ﬁxed x and deﬁne a := ( f ( x ) − f ⋆ ( x )) and b := ( f ⋆ ( x ) − ¯ f ( x )) , so that we can write ( f ( x ) − ¯ f ( x )) 2 = ( f ( x ) − f ⋆ ( x ) + f ⋆ ( x ) − ¯ f ( x )) 2 = ( a + b ) 2 . Now, when τ ≥ 3 ε ∞ w e ha ve: W τ f , ¯ f ( x ) = 1 ⇒ | a | = | f ( x ) − f ⋆ ( x ) | ≥ | f ( x ) − ¯ f ( x ) | − ε ∞ ≥ 2 ε ∞ . Along with the fact that | b | = | ¯ f ( x ) − f ⋆ ( x ) | ≤ ε ∞ , this implies that | b | ≤ | a | / 2 or equiv alen tly that b 2 ≤ a 2 / 4 . Using this, we can deduce that ( a + b ) 2 ≤ 9 a 2 4 ≤ 9 a 2 4 − 3 b 2 + 3 a 2 2 = 3( a 2 − b 2 ) . Re-in tro ducing the deﬁnitions for a and b we hav e that V ar[ W τ f , ¯ f ( x ) · { ( f ( x ) − y ) 2 − ( ¯ f ( x ) − y ) 2 } ] ≤ 12 L ( f ; ¯ f ) No w, applying Bernstein’s inequality and a union bound o ver all f ∈ F yields that with probability 1 − δ : ∀ f ∈ F : L ( f ; ¯ f ) − b L ( f ; ¯ f ) ≤ r 24 L ( f ; ¯ f ) log ( |F | /δ ) n + 4 log( |F | /δ ) 3 n ≤ 1 2 L ( f ; ¯ f ) + 40 log( |F | /δ ) 3 n . Re-arranging prov es the ﬁrst statement, and the second statement follows from the symmetries b L ( f ; g ) = − b L ( g ; f ) and L ( f ; g ) = −L ( g ; f ) . Corollary 2.1 (Cov ariate shift for DBR ) . Fix δ ∈ (0 , 1) . Under Assumption 2.1 – Assumption 2.4 , with pr ob ability at le ast 1 − δ , ˆ f ( n ) DBR with τ = 3 ε ∞ satisﬁes R test ( ˆ f ( n ) DBR ) ≤ 17 ε 2 ∞ + O  C ∞ log( |F | /δ ) n  . (5) Pro of of Corollary 2.1 . Beginning the with risk under D test and assuming that τ = 3 ε ∞ w e can write R test ( ˆ f ( n ) DBR ) = E test [( ˆ f ( n ) DBR ( x ) − f ⋆ ( x )) 2 ] = E test [ 1 {| ˆ f ( n ) DBR ( x ) − f ⋆ ( x ) | < 4 ε ∞ } · ( ˆ f ( n ) DBR ( x ) − f ⋆ ( x )) 2 ] + E test [ 1 {| ˆ f ( n ) DBR ( x ) − f ⋆ ( x ) | ≥ 4 ε ∞ } · ( ˆ f ( n ) DBR ( x ) − f ⋆ ( x )) 2 ] ≤ 16 ε 2 ∞ + E test [ 1 {| ˆ f ( n ) DBR ( x ) − f ⋆ ( x ) | ≥ 4 ε ∞ } · ( ˆ f ( n ) DBR ( x ) − f ⋆ ( x )) 2 ] ≤ 17 ε 2 ∞ + E test [ 1 {| ˆ f ( n ) DBR ( x ) − f ⋆ ( x ) | ≥ 4 ε ∞ } · { ( ˆ f ( n ) DBR ( x ) − f ⋆ ( x )) 2 − ε 2 ∞ } ] . Note that, due to the indicator, the quan tity inside the exp ectation is non-negative. Therefore, via exactly the same imp ortance weigh ting argument as we used in the pro of of Prop osition 2.1 , the latter is at most C ∞ times the quantit y b ounded in Eq. (3) . 20 Corollary 2.2 (W ell-sp eciﬁed cas e) . Fix δ ∈ (0 , 1) . Under Assumption 2.1 – Assumption 2.4 (with ε ∞ = 0 ), with pr ob ability at le ast 1 − δ , ˆ f ( n ) DBR with τ ≤ O  p log( |F | /δ ) /n  satisﬁes R train ( ˆ f ( n ) DBR ) ≤ O  log( |F | /δ ) n  and R test ( ˆ f ( n ) DBR ) ≤ O  C ∞ log( |F | /δ ) n  . (6) Pro of of Corollary 2.2 . Let ∆ denote the righ t hand side of Eq. (3) . Note that in the well-speciﬁed case where ε ∞ = 0 , Theorem 2.1 ensures that E train [ 1 {| ˆ f ( n ) DBR ( x ) − f ⋆ ( x ) | ≥ τ } · ( ˆ f ( n ) DBR ( x ) − f ⋆ ( x )) 2 ] ≤ ∆ . Then, if we take τ ≤ √ ∆ , we hav e R train ( ˆ f ( n ) DBR ) = E train [ 1 {| ˆ f ( n ) DBR ( x ) − f ⋆ ( x ) | < τ } · ( ˆ f ( n ) DBR ( x ) − f ⋆ ( x )) 2 ] + E train [ 1 {| ˆ f ( n ) DBR ( x ) − f ⋆ ( x ) | ≥ τ } · ( ˆ f ( n ) DBR ( x ) − f ⋆ ( x )) 2 ] ≤ τ 2 + ∆ ≤ 2∆ . This prov es the corollary . A.4 Extensions In this section, we provide t wo results men tioned in Section 2 . First we improv e the approximation factor in Corollary 2.1 from 17 to 10 alb eit at th e cost of a worse statistical term. Second w e sho w ho w to choose τ in a data-driven fashion to adapt to unknown missp eciﬁcation level ε ∞ . Prop osition A.1 (Improv ed approximation factor) . Under Assumption 2.1 – Assumption 2.4 , with τ = 2 ε ∞ and for δ ∈ (0 , 1) , we have that, with pr ob ability at le ast 1 − δ : R test ( ˆ f ( n ) DBR ) ≤ 10 ε 2 ∞ + C ∞ · O r log( |F | /δ ) n ! . (10) Pro of sk etch. The pro of is essentially identical to that of Theorem 2.1 , except that w e replace the concen tration statement of Lemma 3.2 with a simpler one that relies on Ho eﬀding’s inequality . The new concen tration statemen t is that for any τ ≥ 0 and δ ∈ (0 , 1) with probabilit y 1 − δ w e hav e ∀ f ∈ F : L ( f ; ¯ f ) ≤ b L ( f ; ¯ f ) + ε slow , where ε slow := c q log( |F | /δ ) n for some univ ersal constan t c > 0 . This follows b y a standard application of Ho eﬀding’s inequality and a union b ound, but imp ortan tly do es not imp ose the restriction that τ ≥ 3 ε ∞ . No w the analysis to prov e Theorem 2.1 yields that for any τ ≥ 2 ε ∞ : E train h 1 {| ˆ f ( n ) DBR ( x ) − f ⋆ ( x ) | ≥ τ + ε ∞ } · n ( ˆ f ( n ) DBR ( x ) − f ⋆ ( x )) 2 − ( ¯ f ( x ) − f ⋆ ( x )) 2 oi ≤ cε slow . T aking τ = 2 ε ∞ and following the deriv ation used to prov e Corollary 2.1 , we get R test ( ˆ f ( n ) DBR ) ≤ 10 ε 2 ∞ + cε slow (Note that this requires the non-negativity prop ert y provided by Lemma 3.1 , whic h we still hav e.) The next result considers adapting to an unknown missp eciﬁcation level. 21 Prop osition A.2 (A dapting to ε ∞ ) . L et δ ∈ (0 , 1) and deﬁne S := { 2 i : τ min ≤ 2 i ≤ τ max } wher e τ min := q 160 log( |F || S | /δ ) 3 n and τ max := 1 . L et τ ⋆ := min { τ ∈ S : τ ≥ 3 ε ∞ } . Then ther e is an algorithm that, without know le dge of ε ∞ and with pr ob ability at le ast 1 − δ , c omputes ˆ f satisfying E train h 1 {| ˆ f ( x ) − f ⋆ ( x ) | ≥ τ ⋆ + ε ∞ } · n ( ˆ f ( x ) − f ⋆ ( x )) 2 − ε 2 ∞ oi ≤ 160 log(2 |F || S | /δ ) 3 n . Note that when ε ∞ ≪ τ min , we are essen tially in the realizable regime. Thus, via the pro of of Corollary 2.2 the ab o ve guarantee with τ ⋆ := τ min suﬃces. On the other hand if ε ∞ ≥ 1 / 3 then τ ⋆ is undeﬁned, but due to Assumption 2.4 the guarantee in Theorem 2.1 is v acuous. Thus, the ab o v e theorem recov ers essentially the same result as Theorem 2.1 , but without knowledge of ε ∞ . Pro of sketc h. The algorithm is as follows. W e run a slight v ariation of disagreemen t based regression for eac h τ ∈ S : Instead of computing the minimizer of the ob jective in Eq. (2) w e form the version space of near-minimizers. Sp eciﬁcally , deﬁne ∀ τ ∈ S : F τ := ( f ∈ F : max g ∈F 1 n n X i =1 W τ f ,g ( x i )  ( f ( x i ) − y i ) 2 − ( g ( x i ) − y i ) 2  ≤ ε stat / 2 ) , where w e deﬁne ε stat = 80 log( |F || S | /δ ) 3 n . Note this is sligh tly inﬂated from the deﬁnition in the statement of Lemma 3.2 , whic h accounts for a union b ound o ver all | S | runs of the algorithm. Next, we deﬁne ˆ τ := arg min    τ ∈ S : \ τ ′ ∈ S : τ ′ ≥ τ F τ ′  = ∅    , and return any function in this in tersection, i.e., let ˆ f b e any function in T τ ′ ∈ S : τ ′ ≥ ˆ τ F τ ′ . F or the analysis, via the analysis of Theorem 2.1 and a union bound o ver the | S | c hoices for τ , we hav e ∀ τ ≥ τ ⋆ : ¯ f ∈ F τ and f ∈ F τ ⇒ L τ ( f ; ¯ f ) ≤ ε stat , where L τ ( f ; g ) is the p opulation ob jective with parameter τ . The ﬁrst statement directly implies that ˆ τ ≤ τ ⋆ . This in turn implies that ˆ f ∈ F τ ⋆ and so ˆ f ac hieves the same statistical guarantee as if we ran DBR with parameter τ ⋆ (up to the additional union b ound). B Pro ofs for Section 4 B.1 Oﬄine RL Theorem 4.1 ( DBR for oﬄine RL) . Fix δ ∈ (0 , 1) , assume that F is L ∞ -missp e ciﬁe d and µ satisﬁes c onc entr ability (as deﬁne d ab ove). Consider the algorithm deﬁne d in Eq. (7) with τ = 3 ε ∞ . Then, with pr ob ability at le ast 1 − δ we have J ( π ⋆ ) − J ( ˆ π ) ≤ O ε ∞ 1 − γ + 1 1 − γ r C conc log( |F | /δ ) n ! . Pro of of Theorem 4.1 . F or eac h “target” function f trg ∈ F suc h that f trg  = ¯ f , let us deﬁne ap x [ f trg ] ∈ F to b e any approximation to the Bellman backup T f trg s.t. ∥ ap x [ f trg ] − T f trg ∥ ∞ ≤ ε ∞ . Deﬁne ap x [ ¯ f ] = ¯ f , whic h also satisﬁes ∥ ap x [ ¯ f ] − T ¯ f ∥ ∞ ≤ ε ∞ b y assumption. Let us deﬁne the empirical and p opulation losses for the disagreement-based regression problem with regression targets derived from f trg . (Empirical) b L f trg ( f ; g ) := 1 n n X i =1 W τ f ,g ( s i , a i )  ( f ( s i , a i ) − y f trg ,i ) 2 − ( g ( s i , a i ) − y f trg ,i ) 2  , (P opulation) L f trg ( f ; g ) := E µ  W τ f ,g ( s, a )  ( f ( s, a ) − y f trg ) 2 − ( g ( s, a ) − y f trg ) 2  . 22 Here recall that y f trg := r + max a ′ f trg ( s ′ , a ′ ) is derived from the sample ( s, a, r , s ′ ) . Also note that w e use E µ [ · ] to denote expectation with resp ect to the data collection p olicy . First, we apply Lemma 3.1 and Lemma 3.2 to eac h of the |F | regression problems. By appro ximate completeness and the deﬁnition of apx [ f trg ] this yields ∀ f trg , f ∈ F : 0 ≤ L f trg ( f ; apx [ f trg ]) ≤ 2 b L f trg ( f ; apx [ f trg ]) + ε stat , (11) where ε stat := 160 log( |F | /δ ) 3 n . The ab o ve uniform bound holds with probability 1 − δ . Note that this ε stat is t wice as large as the one in the pro of of Theorem 2.1 , which accounts for the additional union bound ov er all |F | regression problems. The main statistical guaran tee for ˆ f is deriv ed as follo ws L ˆ f ( ˆ f ; apx [ ˆ f ]) (i) ≤ 2 b L ˆ f ( ˆ f ; apx [ ˆ f ]) + ε stat (ii) ≤ 2 max g ∈F b L ˆ f ( ˆ f ; g ) + ε stat (iii) ≤ 2 max g ∈F b L ¯ f ( ¯ f ; g ) + ε stat (iv) ≤ 2 ε stat . Here (i) is the second inequalit y in Eq. (11) , (ii) follo ws since ap x [ ˆ f ] ∈ F , (iii) uses the optimality prop ert y of ˆ f , and (iv) uses Eq. (11) again, noting the symmetry of L ¯ f ( · ; · ) and using ap x [ ¯ f ] = ¯ f . Since the Bay es regression function deﬁned by targets y ˆ f is T ˆ f , this yields E µ h 1 n | ˆ f ( s, a ) − ap x [ ˆ f ]( s, a ) | ≥ 3 ε ∞ o · n ( ˆ f ( s, a ) − [ T ˆ f ]( s, a )) 2 − ( apx [ ˆ f ]( s, a ) − [ T ˆ f ]( s, a )) 2 oi ≤ 2 ε stat . (12) W e translate this to the squared Bellman error on an y other distribution ν ∈ ∆( X × A ) via a sligh tly stronger argumen t than the one used to prov e Corollary 2.1 . E ν h    ˆ f ( s, a ) − [ T ˆ f ]( s, a )    i ≤ ε ∞ + E ν h    ˆ f ( s, a ) − ap x [ ˆ f ]( s, a )    i ≤ 4 ε ∞ + E ν h 1 n | ˆ f ( s, a ) − ap x [ ˆ f ]( s, a ) | ≥ 3 ε ∞ o ·    ˆ f ( s, a ) − ap x [ ˆ f ]( s, a )    i ≤ 4 ε ∞ + v u u t E µ "  ν ( s, a ) µ ( s, a )  2 # · s E µ  1 n | ˆ f ( s, a ) − ap x [ ˆ f ]( s, a ) | ≥ 3 ε ∞ o ·  ˆ f ( s, a ) − ap x [ ˆ f ]( s, a )  2  = 4 ε ∞ + ∥ ν /µ ∥ L 2 ( µ ) · s E µ  1 n | ˆ f ( s, a ) − ap x [ ˆ f ]( s, a ) | ≥ 3 ε ∞ o ·  ˆ f ( s, a ) − ap x [ ˆ f ]( s, a )  2  ≤ 4 ε ∞ + ∥ ν /µ ∥ L 2 ( µ ) · √ 6 ε stat . The last inequalit y is based on the “self-b ounding” argument we used to control the v ariance in the pro of of Lemma 3.2 , whic h show ed that under the even t | ˆ f ( s, a ) − ap x [ ˆ f ]( s, a ) | ≥ 3 ε ∞ :  ˆ f ( s, a ) − ap x [ ˆ f ]( s, a )  2 ≤ 3 ·   ˆ f ( s, a ) − [ T ˆ f ]( s, a )  2 −  ap x [ ˆ f ]( s, a ) − [ T ˆ f ]( s, a )  2  . Note that ∥ ν /µ ∥ 2 L 2 ( µ ) ≤ ∥ ν /µ ∥ ∞ since E µ [ ν ( s, a ) /µ ( s, a )] = E ν [1] = 1 . Finally , we app eal to the telescoping performance diﬀerence lemma (c.f., Xie and Jiang , 2020 , Theorem 2), whic h states that for an action-v alue function f , J ( π ⋆ ) − J ( π f ) ≤ E d π ⋆ [[ T f ]( s, a ) − f ( s, a )] 1 − γ + E d π f [ f ( s, a ) − [ T f ]( s, a )] 1 − γ , where d π := (1 − γ ) P ∞ h =0 γ h d π h . Both terms are controlled by the distribution shift argument ab o v e and the concen trability co eﬃcien t, yielding the theorem. 23 B.2 Online RL Theorem 4.2 ( DBR for online RL) . Fix δ ∈ (0 , 1) , and assume that F is L ∞ -missp e ciﬁe d and µ satisﬁes c over ability (as deﬁne d ab ove). Consider GOLF .DBR with τ = 3 ε ∞ and β = c log ( T H |F | /δ ) . Then, with pr ob ability at le ast 1 − δ , we have Reg ≤ O  ε ∞ H T + H p C cov T log ( T H |F | /δ ) log ( T )  . Pro of of Theorem 4.2 . The pro of mak es essentially t wo modiﬁcations to the pro of of Theorem 1 of Xie et al. ( 2023 ). The ﬁrst step is a concentration argument, whic h is essentially a martingale v ersion of Theorem 2.1 . The second is the distribution shift argumen t, whic h is very similar to the one we used to prov e Theorem 4.1 . T o keep the presen tation concise, we focus on these arguments, and explain how they ﬁt into the analysis of Xie et al. ( 2023 ), but w e do not provide a self-con tained pro of. Notation. W e adopt the follo wing notation. Recall that F ( t − 1) is the version space used in episo de t and that f ( t ) ∈ F ( t − 1) induces the p olicy π ( t ) deplo yed in the episo de. As b efore, let ap x [ f h +1 ] ∈ F h denote the L ∞ -appro ximation to T h f h +1 . F or each episo de t let δ ( t ) h ( · ) := f ( t ) h ( · ) − [ T h f ( t ) h +1 ]( · ) and err ( t ) ( · ) := 1    f ( t ) h ( · ) − apx [ f ( t ) h +1 ]( · )   ≥ 3 ε ∞  · n  f ( t ) h ( · ) − [ T h f ( t ) h +1 ]( · )  2 −  ap x [ f ( t ) h +1 ]( · ) − [ T h f ( t ) h +1 ]( · )  2 o . Let d ( t ) h = d π ( t ) h and deﬁne e d ( t ) h ( x, a ) = P t − 1 i =1 d ( i ) h ( x, a ) and µ ⋆ h to b e the distribution that ac hieves the v alue C cov for lay er h . Concen tration. By a martingale version of Theorem 2.1 , we can sho w that with probabilit y at least 1 − δ , for all t ∈ [ T ] : ( i ) ¯ f ∈ F ( t ) , and ( ii ) ∀ h ∈ [ H ] : X s,a e d ( t ) h ( s, a )err ( t ) h ( s, a ) ≤ O ( β ) , (13) where β = c log ( T H |F | /δ ) . W e do not pro vide a complete pro of of this statemen t, noting that it is essen tially the same guarantee as in Eq. (12) , except that (a) it is a non-stationary v ersion with a union b ound ov er eac h time step h and episo de t and (b) it uses martingale concen tration (i.e., F reedman’s inequality instead of Bernstein’s inequality). It is also worth comparing with the concentration guaran tee of ( Xie et al. , 2023 ) under exact realizability/completeness, which is that Q ⋆ ∈ F ( t ) and that P s,a e d ( t ) h ( s, a )( δ ( t ) h ( s, a )) 2 ≤ O ( β ) . Distribution shift. T o b ound the regret, note that Reg ≤ T X t =1 H X h =1 E ( s,a ) ∼ d ( t ) h  δ ( t ) h ( s, a )  . F or distribution shift, we must translate the abov e on-p olicy Bellman errors to the “ DBR ” errors on the historical data e d ( t ) h , whic h is controlled by Eq. (13) . F ollowing ( Xie et al. , 2023 ) w e consider burn-in and stable phases. Let γ h ( s, a ) := min n t : e d ( t ) h ( s, a ) ≥ C cov · µ ⋆ h ( s, a ) o , and decomp ose T X t =1 E ( s,a ) ∼ d ( t ) h  δ ( t ) h ( s, a )  = T X t =1 E ( s,a ) ∼ d ( t ) h  δ ( t ) h ( s, a ) 1 { t < γ h ( s, a ) }  + E ( s,a ) ∼ d ( t ) h  δ ( t ) h ( s, a ) 1 { t ≥ γ h ( s, a ) }  . 24 The ﬁrst term is the regret incurred during the burn-in phase, whic h is b ounded b y 2 C cov follo wing exactly the argument of Xie et al. ( 2023 ). This con tributes a total regret of 2 H C cov . The second term is the regret incurred during the stable-phase, for whic h we m ust p erform a distribution shift argument. T o condense the notation, deﬁne ¯ δ ( t ) h ( · ) := ap x [ f ( t ) h +1 ]( · ) − [ T h f ( t ) h +1 ]( · ) , and ˜ δ ( t ) h ( · ) := f ( t ) h ( · ) − apx [ f ( t ) h +1 ]( · ) . Note that, by assumption,   ¯ δ ( t ) h ( s, a )   ≤ ε ∞ . Then, T X t =1 E d ( t ) h  δ ( t ) h ( s, a ) 1 { t > γ h ( s, a ) }  = T X t =1 E d ( t ) h h ˜ δ ( t ) h ( s, a ) + ¯ δ ( t ) h ( s, a )  1 { t > γ h ( s, a ) } i ≤ T X t =1 E d ( t ) h h ˜ δ ( t ) h ( s, a ) 1 { t > γ h ( s, a ) } i + T ε ∞ ≤ T X t =1 E d ( t ) h h 1 n | ˜ δ ( t ) h ( s, a ) | ≥ 3 ε ∞ o ˜ δ ( t ) h ( s, a ) 1 { t > γ h ( s, a ) } i + 4 T ε ∞ ≤ v u u t T X t =1 X s,a  1 { t > γ h ( s, a ) } d ( t ) h ( s, a )  2 e d ( t ) h ( s, a ) · v u u t T X t =1 X x,a e d ( t ) h ( x, a ) 1 n | ˜ δ ( t ) h ( s, a ) | ≥ 3 ε ∞ o ( ˜ δ ( t ) h ( s, a )) 2 + 4 T ε ∞ ≤ v u u t T X t =1 X s,a  1 { t > γ h ( s, a ) } d ( t ) h ( s, a )  2 e d ( t ) h ( s, a ) · v u u t 3 T X t =1 X x,a e d ( t ) h ( x, a )err ( t ) h ( s, a ) + 4 T ε ∞ . The p en ultimate inequalit y is Cauch y-Sch w arz and the ﬁnal inequality follows from the self-bounding property that we used in the pro of of Lemma 3.2 and Theorem 4.1 . In particular under the ev ent that    ˜ δ ( t ) h ( s, a )    ≥ 3 ε ∞ , w e can b ound ( ˜ δ ( t ) h ( s, a )) 2 ≤ 3  ( δ ( t ) h ( s, a )) 2 − ( ¯ δ ( t ) h ( s, a )) 2  . Thus we ha ve con verted from the on-policy Bellman error to the historical “ DBR ” errors, i.e., we can further b ound by ≤ v u u t T X t =1 X s,a  1 { t > γ h ( s, a ) } d ( t ) h ( s, a )  2 e d ( t ) h ( s, a ) · O  p β T  + 4 T ε ∞ . Mean while the density ratio term is b ounded by O ( p C cov log( T ) ) via the analysis of Xie et al. ( 2023 ). Rep eating this analysis for each time step h pro ves the theorem. 25

Mitigating Covariate Shift in Misspecified Regression with Applications to Reinforcement Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment