Mitigating Covariate Shift in Misspecified Regression with Applications to Reinforcement Learning

A pervasive phenomenon in machine learning applications is distribution shift, where training and deployment conditions for a machine learning model differ. As distribution shift typically results in a degradation in performance, much attention has b…

Authors: Philip Amortila, Tongyi Cao, Akshay Krishnamurthy

Mitigating Co v ariate Shift in Missp ecified Regression with Applications to Reinforcemen t Learning Philip Amortila Univ ersity of Illinois, Urbana-Champaign philipa4@illinois.edu T ongyi Cao Univ ersity of Massach usetts, Amherst tcao@cs.umass.edu Aksha y Krishnamurth y Microsoft Research, NYC akshaykr@microsoft.com Abstract A perv asive phenomenon in machine learning applications is distribution shift , where training and deplo yment conditions for a machine learning model differ. As distribution shift typically results in a degradation in p erformance, muc h attention has b een devoted to algorithmic interv en tions that mitigate these detrimental effects. In this pap er, we study the effect of distribution shift in the presence of mo del missp ecification, sp ecifically fo cusing on L ∞ -missp ecified regression and adversarial c ovariate shift , where the regression target remains fixed while the cov ariate distribution changes arbitrarily . W e show that empirical risk minimization, or standard least squares regression, can result in undesirable misspe cific ation amplific ation where the error due to missp ecification is amplified by the density ratio betw een the training and testing distributions. As our main result, we develop a new algorithm—inspired b y robust optimization tec hniques—that av oids this undesirable b eha vior, resulting in no misspecification amplification while still obtaining optimal statistical rates. As applications, we use this regression pro cedure to obtain new guaran tees in offline and online reinforcement learning with missp ecification and establish new separations b et w een previously studied structural conditions and notions of cov erage. 1 In tro duction A ma jority of machine learning methods are developed and analyzed under the idealized setting where the training conditions accurately reflect those at deploymen t. Y et, almost all practical applications exhibit distribution shift , where these conditions differ significan tly . Distribution shift can occur for a plethora of reasons, ranging from quirks in data collection ( Rech t et al. , 2019 ), to temp oral drift ( Gama et al. , 2014 ; Besb es et al. , 2015 ), to users adapting to an ML model ( P erdomo et al. , 2020 ), and it typically results in a degradation in mo del p erformance. Due to the prev alence of this phenomenon and the diversit y of applications where it manifests, there is a v ast and ev er-growing b ody of literature studying algorithmic in terven tions to mitigate distribution shift ( Quinonero-Candela et al. , 2008 ; Sugiy ama and Kaw anab e , 2012 ). Covariate shift is p erhaps the most basic form of distribution shift. Co v ariate shift is p ertinent to sup ervised learning—where the goal is to predict a label Y from co v ariates X —and p osits a change in the distribution o ver co v ariates while keeping the target predictor fixed. This setup, in particular that the target do es not change, is natural in applications including neural algorithmic reasoning ( Anil et al. , 2022 ; Zhang et al. , 2022 ; Liu et al. , 2023 ), reinforcement learning ( Ross et al. , 2011 ; Levine et al. , 2020 ), and computer vision ( Koh et al. , 2021 ; Rec ht et al. , 2019 ; Miller et al. , 2021 ). It is well kno wn that one can adapt guaran tees from statistical learning to the co v ariate shift setting; sp ecifically , for well-specified regression, a classical density-ratio argumen t sho ws that empirical risk minimization (ERM) is consisten t under suitably well-behav ed cov ariate shifts. One stipulation of this consistency guaran tee is that the mo del/hypothesis class b e wel l-sp e cifie d (also referred Authors listed in alphabetical order. 1 to as r e alizable ). Although statistical learning theory offers a rather complete understanding of missp ecification in the absence of cov ariate shift (via agnostic learning and excess risk bounds), our understanding of how co v ariate shift can adversely interact with mo del missp ecification remains fairly immature. This in teraction is the fo cus of the presen t paper. 1.1 Con tributions W e study regression under adversarial c ovariate shift where w e receiv e regression samples from a distribution D train but are ev aluated on an arbitrary distribution D test for which no prior kno wledge is av ailable; w e only assume that the distributions share the same target regression function f ⋆ and that the worst-case density ratio of the cov ariate marginals is b ounded b y C ∞ ∈ [1 , ∞ ) (formally defined in Section 2 ). As inductive bias, w e hav e a function class F of predictors and assume L ∞ -missp e cific ation : there exists a predictor ¯ f ∈ F that is p oin t wise close to f ⋆ , i.e., ∥ ¯ f − f ⋆ ∥ ∞ ≤ ε ∞ . This notion is natural for the co v ariate shift setting b ecause it ensures that ¯ f has lo w and comparable prediction error on b oth D train and any D test . In this setup w e obtain the following results: 1. W e sho w that standard empirical risk minimization ( ERM ) is not robust to cov ariate shift in the presence of missp ecification. Precisely , even in the limit of infinite data, ERM ov er F can incur squared prediction error under D test scaling as Ω( C ∞ ε 2 ∞ ) . Meanwhile the error of the L ∞ -missp ecified predictor ¯ f is at most ε 2 ∞ . W e call this phenomenon—where the missp ecification error is scaled by the density ratio co efficien t (despite there being a predictor av oiding this scaling)— missp e cific ation amplific ation . 2. As our main result, we give a new algorithm, called disagreemen t-based regression ( DBR ), that av oids missp e cific ation amplific ation and is therefore robust to adversarial co v ariate shift under missp ecifica- tion. DBR has asymptotic prediction error under D test scaling as O ( ε 2 ∞ ) , with no dep endence on the densit y ratio co efficien t C ∞ . At the same time, it has order-optimal finite sample b eha vior recov ering standard “fast rate” guarantees for the w ell-sp ecified setting, and can b e extended to adapt to unknown missp ecification level (as shown in App endix A.4 ). T o our knowledge, this is the first result a voiding mis- sp ecification amplification in the adv ersarial cov ariate shift se tting. Our assumptions—particularly that no information ab out D test is av ailable and that F is unstructured—rule out prior approaches based on densit y ratios ( Shimo daira , 2000 ; Duchi and Namkoong , 2021 ) or sup-norm conv ergence ( Schmidt-Hieber and Zamolo dtc hiko v , 2022 ); see Section 5 for further discussion. T o demonstrate the utility of disagreement-based regression, we deplo y the pro cedure in v alue function appro ximation settings in reinforcement learning (RL), where regression is a standard primitive and mitigating the adverse effects of distribution shift is a cen tral challenge. Here, using DBR as a drop-in replacement for ERM when fitting Bellman backups, we obtain the following results: 1. In the offline RL setting, we instantiate the minimax algorithm of Chen and Jiang ( 2019 ) with DBR and sho w that, under L ∞ -missp ecification and with co verage measured via the concentrabilit y co efficien t, missp ecification amplification can b e av oided when learning a near optimal p olicy . In con trast, prior lo wer bounds imply that missp ecification amplification is unav oidable when cov erage is measured via Bellman transfer co efficien ts ( Du et al. , 2020 ; V an Roy and Dong , 2019 ; Lattimore et al. , 2020 ). Our result therefore establishes a new separation b et ween concentrabilit y and Bellman transfer co efficien ts. 2. In the online RL setting, we instan tiate the GOLF algorithm of Jin et al. ( 2021 ) with DBR and obtain analogous results under the structural condition of c over ability (building on the analysis of Xie et al. ( 2023 )). T aken with the ab o ve lo wer b ounds ( Du et al. , 2020 ; V an Roy and Dong , 2019 ; Lattimore et al. , 2020 ), this separates structural conditions inv olving Bellman errors (e.g., Bellman rank ( Jiang et al. , 2017 ), Bellman-eluder dimension ( Jin et al. , 2021 ), or sequential extrap olation co efficien t ( Xie et al. , 2023 )) from cov erabilit y , whic h do es not. T o keep the presentation concise and focused on the in teraction b etw een cov ariate shift and missp ecification, w e fo cus on the simplest settings that manifest missp ecification amplification. In Section 6 , we discuss a n umber of directions for future work, which include extensions to the core technical and algorithmic results. 2 2 Missp ecified regression under distribution shift W e b egin by introducing the formal problem setting and our assumptions. Most pro ofs for results in this section are deferred to Ap pendix A . There are t wo join t distributions, called D train and D test , ov er X × R where X is a cov ariate space. W e use P train , P test and E train , E test to denote the probability law and expectation under these distributions. W e hypothesize that D train and D test share the same Bayes r e gr ession function , an assumption referred to as cov ariate shift in the literature ( Shimodaira , 2000 ). Assumption 2.1 (Cov ariate shift) . F or al l x ∈ X we have E train [ y | x ] = E test [ y | x ] . Let f ⋆ : x 7→ E train [ y | x ] denote the shared Bay es regression function. W e p osit that the marginal distributions o ver X are absolutely contin uous with respect to a reference measure and use d train and d test to denote the corresp onding marginal densities. W e assume these are related via the following density ratio assumption. Assumption 2.2 (Bounded densit y ratios) . The density r atio C ∞ := sup x ∈X     d test ( x ) d train ( x )     is b ounde d, i.e., C ∞ < ∞ . Note that C ∞ ≥ 1 alwa ys. Boundedness of density ratios is standard in the cov ariate shift literature; indeed the co efficien t C ∞ app ears in the classical cov ariate shift analyses as w ell as in man y algorithmic in terven tions ( Shimo daira , 2000 ; Sugiyama et al. , 2007 ). Beyond satisfying these assumption, D test can b e adaptiv ely and adversarially chosen. In particular, no information ab out D test , such as lab eled/unlabeled samples or other inductiv e bias, is av ailable. W e hav e a dataset { ( x i , y i ) } n i =1 of n i.i.d. lab eled examples sampled from D train and a function class F ⊂ ( X → R ) of predictors. W e define the (squared) pr e diction err ors R train ( f ) := E train  ( f ( x ) − f ⋆ ( x )) 2  , and R test ( f ) := E test  ( f ( x ) − f ⋆ ( x )) 2  . (1) W e seek to use the dataset to find a predictor ˆ f for whic h R test ( ˆ f ) is small. Regarding F , we make tw o assumptions: w e assume that |F | < ∞ and that F is L ∞ -missp ecified. Assumption 2.3 ( L ∞ -missp ecification) . F or some ε ∞ ≥ 0 , ther e exists ¯ f ∈ F with   ¯ f − f ⋆   ∞ ≤ ε ∞ , where ∥ f ∥ ∞ := sup x ∈X | f ( x ) | . Most prior analyses for regression under cov ariate shift assume that the mo del class F is well-specified, i.e., that ε ∞ = 0 so that f ⋆ ∈ F . L ∞ -missp ecification provides a relaxation that is natural for at least t wo reasons. First, it enables end-to-end learning guarantees via comp osition with approximation-theoretic results for sp ecific function classes (e.g., neural net works), where it is standard to measure approximation via the L ∞ norm ( T elgarsky , 2021 ). More imp ortan tly , L ∞ -missp ecification is particularly apt in the co v ariate shift setting b ecause it ensures that ¯ f has low prediction error on b oth D test and D train . Thus, there is at least one high-quality predictor whose p erformance is stable across distributions. In contrast, we hav e no suc h guaran tee if w e, for example, measure missp ecification with respect to other norms (which dep end on the distribution) or consider the agnostic setting (with no quantified missp ecification assumption). Indeed, we will see b elow that misspecification amplification is unav oidable in suc h cases. W e also mak e the follo wing technical assumption. Assumption 2.4 (Boundedness) . sup f ∈F ∥ f ∥ ∞ ≤ 1 and | y | ≤ 1 almost sur ely under D train and D test . W e imp ose Assumption 2.4 and that |F | < ∞ solely to highligh t the nov el algorithmic and technical asp ects; w e expect that relaxing these assumptions is p ossible. 3 2.1 Missp ecification amplification for empirical risk minimization When there is no prior knowledge ab out or data from D test , p erhaps the most natural algorithm for optimizing R test ( · ) is empirical risk minimization ( ERM ) on the data from the training distribution: ˆ f ( n ) ERM := arg min f ∈F 1 n n X i =1 ( f ( x i ) − y i ) 2 . A standard uniform con vergence argumen t yields the classical cov ariate shift guaran tee for ERM : Prop osition 2.1 ( ERM upp er b ound) . F or any δ ∈ (0 , 1) with pr ob ability at le ast 1 − δ , ERM satisfies R test ( ˆ f ( n ) ERM ) ≤ O  C ∞ ε 2 ∞ + C ∞ log( |F | /δ ) n  . The second term which scales as 1 /n —the statistical term—is optimal in the generality of our setup ( Ma et al. , 2023 ; Ge et al. , 2023 ), the in terpretation b eing that the effective sample size is reduced by a factor of C ∞ due to the mismatch b et ween D train and D test . The first term—the missp ecification term—represents the asymptotic 1 test error of ERM and demonstrates a phenomenon that w e call missp e cific ation amplific ation , whereb y the error due to missp ecification is amplified by the densit y ratio co efficient. This phenomenon is sim ultaneously more concerning and less intuitiv e than the degradation of the statistical term, b ecause it describ es an error which do es not decay with larger sample sizes and b ecause ¯ f ∈ F has R test ( ¯ f ) = ε 2 ∞ . Since F con tains a predictor that do es not incur missp ecification amplification, one might hop e that missp ecification amplification can b e av oided. Our first main result is that misspecification amplification c annot b e av oided by ERM in the w orst case. The result is prov ed in the asymptotic regime, where ERM is equiv alen t to the L 2 ( D train ) -pro jection of f ⋆ on to the function class F , defined as ˆ f ( ∞ ) ERM ∈ arg min f ∈F ∥ f − f ⋆ ∥ 2 L 2 ( D train ) , with ∥ g ∥ 2 L 2 ( D train ) := E train  g ( x ) 2  . The next prop osition shows that ˆ f ( ∞ ) ERM can incur missp ecification amplification. Prop osition 2.2 ( ERM lo wer b ound) . F or al l ε ∞ ∈ (0 , 1) and C ∞ ∈ [1 , ∞ ) such that p C ∞ · ε ∞ ≤ 1 / 2 , and for al l ζ > 0 sufficiently smal l, ther e exist distributions D train , D test and a function class F with |F | = 2 satisfying Assumption 2.1 - Assumption 2.4 (with p ar ameters ε ∞ , C ∞ ) such that R test ( ˆ f ( ∞ ) ERM ) = C ∞ ε 2 ∞ − ζ . X f ⋆ ¯ f ε ∞ f bad D train D test Figure 1: The construction used to pro ve Proposition 2.2 . f bad and ¯ f ha ve equal risk under D train but f bad concen trates errors onto D test . Com bined with the optimality of the statistical term ( Ma et al. , 2023 ; Ge et al. , 2023 ), this establishes that Proposition 2.1 characterizes the b eha vior of ERM under L ∞ -missp ecification and cov ariate shift. The construction is based on the following insigh t, visualized in Figure 1 . The fact that ¯ f is L ∞ -close to f ⋆ guaran tees that its prediction errors are “spread out” across the domain X . Since ¯ f ∈ F , we know that ˆ f ( ∞ ) ERM m ust satisfy ∥ ˆ f ( ∞ ) ERM − f ⋆ ∥ 2 L 2 ( D train ) ≤ ε 2 ∞ . Unfortunately , this prop ert y do es not guarantee that the errors of ˆ f ( ∞ ) ERM are “spread out” in a similar manner to ¯ f ’s. Indeed, we construct a predictor f bad that concentrates its errors on a region of X that is amplified by D test and makes up for this b y having zero error elsewhere. By setting the parameters carefully , w e can ensure that this bad predictor is c hosen by ERM . W e note that essen tially the same construction sho ws that, under the w eak er notion of L 2 ( D train ) -missp ecification, amplification is una voidable for any prop er learner (which outputs a function in F ). Indeed, in Figure 1 , the function class { f bad } is L 2 ( D train ) -missp ecified but f bad has muc h higher error on D test . 1 W e consider the asymptotic regime where n → ∞ with all other quantities, lik e log |F | and ε ∞ , fixed. 4 Other existing algorithms. Prop osition 2.2 only p ertains to ERM , and th us, one migh t ask whether other algorithms can a void missp ecification amplification. Before turning to our p ositive results in the next section, we briefly note that other standard algorithms (that do not require kno wledge of D test ) either incur missp ecification amplification to some degree, or ha ve some other failure mo de. This pertains to the star algorithm ( Audib ert , 2007 ; Liang et al. , 2015 ), other aggregation schemes (c.f., Lecué and Rigollet , 2014 ), and L ∞ -regression ( Knight , 2017 ; Yi and Neyko v , 2024 ), as w e discuss in App endix A.2 . Sev eral methods for mitigating cov ariate shift can a void missp ecification amplification, but either require kno wledge of D test or structural assumptions on F ; see Section 5 . 2.2 Main result: Disagreemen t-based regression In this section, w e provide a new algorithm that a voids missp ecification amplification while requiring no kno wledge of D test and recov ering optimal statistical rates. T o develop some in tuition, observe that in the construction in Figure 1 , the only w ay for the bad predictor ( f bad , in red) to b e chosen by ERM and hav e large errors on D test is for it to hav e muc h low er error than ¯ f on the rest of the domain. Indeed, if we could filter out the points where f bad ’s error is less than ¯ f ’s, then f bad cannot ov ercome the large errors on D test . Stated another wa y , w e can av oid missp ecification amplification in this example if w e restrict the regression problem to the region where | f bad ( x ) − f ⋆ ( x ) | ≥ | ¯ f ( x ) − f ⋆ ( x ) | . Generalizing this insight to a larger function class suggests that, when considering a candidate f ∈ F , we should only measure the square loss for f on the region where | f ( x ) − f ⋆ ( x ) | ≥ | ¯ f ( x ) − f ⋆ ( x ) | . Unfortunately , this region dep ends on f ⋆ and ¯ f , b oth of whic h are unknown. Nevertheless, our approach is based on this in tuition, and we av oid the dep endence on these unknown functions with tw o algorithmic ideas. T o eliminate the dependence on f ⋆ , we use the fact that | ¯ f ( x ) − f ⋆ ( x ) | ≤ ε ∞ and approximate the abov e region with I f := { x : | f ( x ) − ¯ f ( x ) | ≥ cε ∞ } . Indeed for c ≥ 2 , { x : | f ( x ) − ¯ f ( x ) | ≥ cε ∞ } ⊆ { x : | f ( x ) − f ⋆ ( x ) | ≥ | ¯ f ( x ) − f ⋆ ( x ) |} . On the other hand, w e know that | f ( x ) − f ⋆ ( x ) | ≤ ( c + 1) ε ∞ in the complemen tary region, I C f . This is, up to the constant factor, the b est p oin twise guaran tee w e can attain, making it safe to ignore the complementary region. This resolv es the first issue of dependence on f ⋆ . T o address the dep endence on ¯ f , w e use that ¯ f ∈ F and formulate a robust optimization ob jectiv e that implicitly considers all possible pairwise “disagreement regions.” F ormally , the algorithm is: W τ f ,g ( x ) := 1 {| f ( x ) − g ( x ) | ≥ τ } , ˆ f ( n ) DBR ← arg min f ∈F max g ∈F 1 n n X i =1 W τ f ,g ( x )  ( f ( x ) − y ) 2 − ( g ( x ) − y ) 2  . (2) W e call this algorithm disagr e ement-b ase d r e gr ession ( DBR ) and keep the dep endence on τ implicit in the notation for the solution ˆ f ( n ) DBR . 2 There are essentially three key ingredien ts. First, we introduce the “filter” W τ f ,g to restrict the regression problem to the set of p oin ts where the predictions of f and g differ considerably , whic h w e call the disagr e ement r e gion . This formalizes the intuition that we should only measure the square loss for f on p oin ts where | f ( x ) − ¯ f ( x ) | ≥ cε ∞ . Second is the robust optimization approach, where for each f ∈ F , we consider all possible c hoices g ∈ F for filtering, whic h allows us to tak e g to b e L ∞ -close to f ⋆ in the analysis. Finally , we measure the square loss r e gr et in the disagreemen t region, b y subtracting off the square loss of the comparator function g . Similar to Agarwal and Zhang ( 2022 ), this accounts for the fact that each g ∈ F yields a differen t regression problem, with p oten tially differen t Bay es error rates. 3 As our main theorem, we show that disagreement-based regression enjo ys the following guarantee. 2 The name stems from the literature on disagreemen t-based activ e learning ( Hanneke , 2014 ), where a similar “range” computation has app eared ( Krishnamurth y et al. , 2019 ; F oster et al. , 2018 , 2021 ). How ever our usage is conceptually unrelated: we use disagreemen t for robustness to cov ariate shift, while, in activ e learning, disagreement is used to reduce sample complexit y . 3 More directly , the probability mass of filtered points P train [ W τ f ,g ( x )] could v ary considerably for different f , g ∈ F . 5 Theorem 2.1 (Main result for DBR ) . Fix δ ∈ (0 , 1) . L et F b e a function class with |F | < ∞ satisfying As- sumption 2.3 and Assumption 2.4 . Then with pr ob ability at le ast 1 − δ , ˆ f ( n ) DBR with τ ≥ 3 ε ∞ satisfies E train h 1 n | ˆ f ( n ) DBR ( x ) − f ⋆ ( x ) | ≥ τ + ε ∞ o · n ( ˆ f ( n ) DBR ( x ) − f ⋆ ( x )) 2 − ε 2 ∞ oi ≤ 160 log(2 |F | /δ ) 3 n , (3) which dir e ctly implies P train h    ˆ f ( n ) DBR ( x ) − f ⋆ ( x )    ≥ τ + ε ∞ i ≤ 160 log(2 |F | /δ ) 3 n ( τ 2 + 2 τ ε ∞ ) . (4) Before turning to a discussion of Theorem 2.1 we state tw o immediate corollaries. The first addresses the adv ersarial co v ariate shift setting, bounding the risk of ˆ f ( n ) DBR under D test . Corollary 2.1 (Cov ariate shift for DBR ) . Fix δ ∈ (0 , 1) . Under Assumption 2.1 – Assumption 2.4 , with pr ob ability at le ast 1 − δ , ˆ f ( n ) DBR with τ = 3 ε ∞ satisfies R test ( ˆ f ( n ) DBR ) ≤ 17 ε 2 ∞ + O  C ∞ log( |F | /δ ) n  . (5) The next result shows that ˆ f ( n ) DBR reco vers the optimal guaran tee in the well-specified case, i.e., when ε ∞ = 0 . Corollary 2.2 (W ell-sp ecified cas e) . Fix δ ∈ (0 , 1) . Under Assumption 2.1 – Assumption 2.4 (with ε ∞ = 0 ), with pr ob ability at le ast 1 − δ , ˆ f ( n ) DBR with τ ≤ O  p log( |F | /δ ) /n  satisfies R train ( ˆ f ( n ) DBR ) ≤ O  log( |F | /δ ) n  and R test ( ˆ f ( n ) DBR ) ≤ O  C ∞ log( |F | /δ ) n  . (6) W e now turn to some remarks regarding Theorem 2.1 and the corollaries. DBR a v oids missp ecification amplification Comparing Corollary 2.1 in the n → ∞ limit with Prop osi- tion 2.2 highlights the main qualitative difference b et ween DBR and ERM . DBR attains O ( ε 2 ∞ ) asymptotic test error while the test error for ERM is low er b ounded by Ω( C ∞ ε 2 ∞ ) . In other w ords, DBR av oids missp ecification amplification while ERM does not. At the same time, the statistical term is identical (up to constan ts) to that of ERM , enabling us to recov er the optimal rate in the well-specified case. Quan tile guaran tee T aking τ = O ( ε ∞ ) in Eq. (3) , we hav e that P train [ | ˆ f ( n ) DBR ( x ) − f ⋆ ( x ) | ≥ cε ∞ ] ≲ 1 nε 2 ∞ , whic h con trols the large quantiles of the prediction error. This is reminiscent of what can b e achiev ed by applying Marko v’s inequality to the guarantee for ERM in the well-specified case. In contrast, ERM only ensures that R train ( ˆ f ( n ) ERM ) = Ω( ε 2 ∞ ) under missp ecification, whic h do es not imply an y meaningful quantile guaran tee. One interpretation of our results is that, although suc h quantile guarantees are not p ossible for ERM under missp ecification, there is no information-theoretic obstruction. W e also note that these quantile guaran tees are rather different from sup-norm conv ergence; see Section 5 for further discussion. Computational efficiency DBR , as described in Eq. (2) , do es not app ear to b e computationally tractable, primarily due to the non-smo othness and non-conv exity introduced by the filter W f ,g . A natural direction for future work is to understand the computational c hallenges inv olved in a voiding missp ecification amplification. 2.2.1 Extensions Before closing this section, we mention t wo extensions that we defer to Appendix A.4 . • Appr oximation factor. The approximation factor of 17 in Corollary 2.1 can b e improv ed to 10 (cf. Prop osition A.1 ); how ever our approac h for doing so degrades the conv ergence rate of the statistical term. W e do not know the optimal approximation factor for this setting or whether there is an inherent trade-off b et ween the statistical term and the approximation/misspecification term. 6 • A dapting to unknown missp e cific ation. Theorem 2.1 requires setting τ ≥ 3 ε ∞ whic h can alwa ys b e ac hieved b y setting τ sufficien tly large. How ever, setting τ = O ( ε ∞ ) yields the best guarantee, and so, w e w ould lik e to choose τ in a data-dep endent fashion to adapt to the missp ecification level. Prop osition A.2 sho ws that this can b e done while recov ering essentially the same guaran tee as in Theorem 2.1 . 3 Pro of of Theorem 2.1 This section con tains the pro of of Theorem 2.1 —which w e emphasize only requires elemen tary arguments— and is not essen tial for understanding the main results of the pap er. A reader interested in applications of Theorem 2.1 to reinforcement learning can pro ceed to Section 4 . The pro of of Theorem 2.1 is organized into three steps, each of which is fairly simple. It is helpful to define empirical and p opulation versions of the pairwise ob jective used b y DBR : (Empirical) : b L ( f ; g ) := 1 n n X i =1 W τ f ,g ( x i )  ( f ( x i ) − y i ) 2 − ( g ( x i ) − y i ) 2  , (P opulation) : L ( f ; g ) := E train  W τ f ,g ( x )  ( f ( x ) − y ) 2 − ( g ( x ) − y ) 2  . First, we establish a certain non-negativit y prop erty of the population ob jective, which is the main structural result. The second step is a uniform conv ergence argument to show that b L ( · ; · ) , which app ears in the algorithm, concentrates to the population coun terpart L ( · ; · ) . Finally , w e study the minimizer ˆ f ( n ) DBR and an L ∞ -appro ximation ¯ f and relate their ob jective v alues to establish the theorem. Details and pro ofs for the corollaries are deferred to App endix A.3 . Step 1: Non-negativity . The key lemma for the analysis is the follo wing structural prop ert y . Lemma 3.1 (Non-negativity) . With τ ≥ 2 ε ∞ and for any ¯ f ∈ F such that ∥ ¯ f − f ⋆ ∥ ∞ ≤ ε ∞ , we have L ( f ; ¯ f ) ≥ ( τ 2 − 2 τ ε ∞ ) Pr[ W τ f , ¯ f ( x )] ≥ 0 . The pro of requires only algebraic manipulations and actually rev eals a stronger property: with τ ≥ 2 ε ∞ , the random v ariable W τ f , ¯ f ( x )  ( f ( x ) − f ⋆ ( x )) 2 − ( ¯ f ( x ) − f ⋆ ( x )) 2  is non-negative almost surely . By the symmetry L ( f ; g ) = −L ( g ; f ) , the lemma also shows that any L ∞ -missp ecified ¯ f has non-p ositiv e population ob jective. Step 2: Uniform conv ergence. Next w e establish the following concen tration guaran tee. Lemma 3.2 (Concentration) . Fix δ ∈ (0 , 1) and τ ≥ 3 ε ∞ and define ε stat := 80 log( |F | /δ ) 3 n . Under Assumption 2.3 , for any ¯ f ∈ F such that ∥ ¯ f − f ⋆ ∥ ∞ ≤ ε ∞ , with pr ob ability at le ast 1 − δ we have ∀ f ∈ F : L ( f ; ¯ f ) ≤ 2 b L ( f ; ¯ f ) + ε stat , and e quivalently, b L ( ¯ f ; f ) ≤ 1 2  L ( ¯ f ; f ) + ε stat  . The pro of is based on Bernstein’s inequalit y and imp ortantly exploits a “self-b ounding” property of b L ( f ; g ) —in particular that V ar[ b L ( f ; ¯ f )] ≤ ( 12 / n ) L ( f ; ¯ f ) —analogously to the analysis for ERM in the well-specified case. Step 3: Analysis of ˆ f ( n ) DBR . Let ¯ f ∈ F b e any function that is L ∞ -close to f ⋆ and condition on the high probabilit y ev ent in Lemma 3.2 holding with the choice ¯ f . The DBR minimizer satisfies L ( ˆ f ( n ) DBR ; ¯ f ) (i) ≤ 2 b L ( ˆ f ( n ) DBR ; ¯ f ) + ε stat (ii) ≤ 2 max g ∈F b L ( ˆ f ( n ) DBR ; g ) + ε stat (iii) ≤ 2 max g ∈F b L ( ¯ f ; g ) + ε stat (iv) ≤ max g ∈F L ( ¯ f ; g ) + 2 ε stat (v) ≤ 2 ε stat . 7 Here inequalities (i) and (iv) are applications of Lemma 3.2 , (ii) and (iii) follo w from the definition of ˆ f ( n ) DBR since ¯ f ∈ F , and (v) is an application of Lemma 3.1 along with the symmetry L ( f ; g ) = −L ( g ; f ) . Eq. (3) no w follo ws from the fact that W τ f , ¯ f ( x ) ≥ 1 {| f ( x ) − f ⋆ ( x ) | ≥ τ + ε ∞ } . Eq. (4) follows since under the even t | f ( x ) − f ⋆ ( x ) | ≥ τ + ε ∞ w e can low er b ound ( f ( x ) − f ⋆ ( x )) 2 − ( ¯ f ( x ) − f ⋆ ( x )) 2 ≥ ( τ + ε ∞ ) 2 − ε 2 ∞ . 4 Applications to online and offline reinforcemen t learning In this section, we deplo y disagreement-based regression to obtain new results in offline and online RL with function approximation. Algorithmically , this is achiev ed by using DBR as a drop-in replacemen t for square loss regression in existing algorithms. W e illustrate this b y examining and impro ving the Bellman residual minimization (a.k.a. minimax) algorithm for offline RL ( Antos et al. , 2008 ; Chen and Jiang , 2019 ) ( Section 4.1 ) and the GOLF algorithm ( Jin et al. , 2021 ) for online RL ( Section 4.2 ). The analyses also require minimal mo difications to those of Xie and Jiang ( 2021 ) and Xie et al. ( 2023 ), resp ectiv ely . T o emphasize the ease with whic h DBR can b e applied, w e adopt the formulations and muc h of the notation from these works. All pro ofs for results in this section are deferred to App endix B . 4.1 Offline reinforcemen t learning Setup and notation. W e consider a discounted Marko v decision pro cess (MDP) M = ( P , R, d 0 , γ ) ov er states S and actions A , where P : S × A → ∆( S ) is the transition op erator, R : S × A → [0 , 1] is the reward function, d 0 ∈ ∆( S ) is the initial state distribution, and γ ∈ [0 , 1) is the discoun t factor. A policy π : S → ∆( A ) induces a tra jectory s 0 , a 0 , r 0 , s 1 , a 1 , r 1 , . . . where s 0 ∼ d 0 , and for each h ∈ N , a h ∼ π ( s h ) , r h = R ( s h , a h ) , and s h +1 ∼ P ( s h , a h ) . W e use P π [ · ] and E π [ · ] to denote probabilit y and exp ectation under this process. Let d π h ∈ ∆( S × A ) denote the occupancy measure of π at time-step h , defined as d π h ( s, a ) := P π [ s h = s, a h = a ] and let d π := (1 − γ ) P ∞ h =0 γ h d π h . The v alue of π is denoted J ( π ) := E π  P ∞ h =0 γ h r h  . Eac h p olicy π has v alue functions V π : s 7→ E π  P ∞ h =0 γ h r h | s 0 = s  and Q π : ( s, a ) 7→ E π  P ∞ h =0 γ h r h | s 0 = s, a 0 = a  , and it is kno wn that there exists a p olicy π ⋆ that maximizes V π ( s ) simultaneously for all s ∈ S . This p olicy also optimizes J ( · ) and hence is called the optimal p olicy . It is also known that the v alue function Q ⋆ := Q π ⋆ induces the optimal p olicy via π ⋆ : s 7→ arg max a Q ⋆ ( s, a ) and additionally satisfies Bel lman ’s optimality e quation : Q ⋆ ( s, a ) := [ T Q ⋆ ] ( s, a ) where T is the Bellman op erator, defined via T f : ( s, a ) 7→ E [ r 0 + γ max a ′ f ( s 1 , a ′ ) | s 0 = s, a 0 = a ] . In the offline v alue function appro ximation setting, we are given a dataset of n tuples D n : = { ( s i , a i , r i , s ′ i ) } n i =1 generated i.i.d. from the following pro cess: ( s i , a i ) ∼ µ where µ ∈ ∆( S , A ) is the data c ol le ction distribution , r i = R ( s i , a i ) , and s ′ i ∼ P ( s i , a i ) . W e are also given a function class F ⊂ ( S × A → R ) , where eac h f ∈ F induces the p olicy π f : s 7→ arg max a f ( s, a ) . Giv en dataset D n and function class F , we seek a p olicy ˆ π that has small sub optimality gap: J ( π ⋆ ) − J ( ˆ π ) . W e imp ose the follo wing assumptions on the function class and on the data collection distribution: • L ∞ -missp ecified realizability/completeness : There exists ¯ f ∈ F suc h that ∥ ¯ f − T ¯ f ∥ ∞ ≤ ε ∞ . A dditionally , for any f ∈ F there exists g ∈ F such that ∥ g − T f ∥ ∞ ≤ ε ∞ . • Concen trability : There exists a constan t C conc ∈ [1 , ∞ ) such that max π ∈ Π    d π µ    ∞ ≤ C conc . Here Π := { π f : f ∈ F } is the p olicy class induced b y F . There is a large b ody of recen t work studying v arious function approximation and cov erage assumptions in offline RL (c.f., Xie and Jiang , 2021 ). Arguably the most standard are concentrabilit y , as we use, and exact realizabilit y/completeness, whic h is stronger than our version with misspecification. Regarding the function appro ximation assumption, it is not hard to sho w that missp ecification amplification—which in this setting is defined by the sub optimalit y J ( π ⋆ ) − J ( ˆ π ) scaling as Ω( ε ∞ √ C conc ) —is necessary under weak er notions, suc h as L 2 ( µ ) -missp ecification. Regarding co verage, as w e will discuss below, the strength of the cov erage assumption determines whether misspecification amplification can b e av oided or not. 8 Algorithm and guaran tee. The algorithm we study is a minor mo dification to the minimax algorithm ( An- tos et al. , 2008 ; Chen and Jiang , 2019 ). F or eac h function ˜ f ∈ F and each tuple ( s i , a i , r i , s ′ i ) we can form a regression sample ( s i , a i , y ˜ f ,i := r i + γ max a ′ ˜ f ( s ′ i , a ′ )) and define the predictor ˆ f via the ob jective: ˆ f := arg min f ∈F max g ∈F 1 n n X i =1 W τ f ,g ( s i , a i )  ( f ( s i , a i ) − y f ,i ) 2 − ( g ( s i , a i ) − y f ,i ) 2  . (7) Here W τ f ,g ( · ) is the filter in Eq. (2) with x = ( s, a ) . Given ˆ f , we output ˆ π := π ˆ f . Note that the only difference b et w een this algorithm and the original minimax algorithm is the use of the filter W τ f ,g ( · ) which is essen tial for obtaining the follo wing guarantee. Theorem 4.1 ( DBR for offline RL) . Fix δ ∈ (0 , 1) , assume that F is L ∞ -missp e cifie d and µ satisfies c onc entr ability (as define d ab ove). Consider the algorithm define d in Eq. (7) with τ = 3 ε ∞ . Then, with pr ob ability at le ast 1 − δ we have J ( π ⋆ ) − J ( ˆ π ) ≤ O ε ∞ 1 − γ + 1 1 − γ r C conc log( |F | /δ ) n ! . The theorem is b est understo o d via comparison to the guarantee for the standard minimax algorithm, e.g., Theorem 5 of Xie and Jiang ( 2020 ). Under our assumptions ( L ∞ -missp ecification and concentrabilit y), these t wo b ounds differ only in the missp ecification term: our theorem scales as ε ∞ / (1 − γ ) while the guarantee for the minimax algorithm scales as ε ∞ √ C conc / (1 − γ ) . 4 Th us, our algorithm inherits the fa vorable prop erties of DBR to av oid missp ecification amplification in offline RL. This feature is notable in ligh t of existing lo wer b ounds for missp ecified RL ( Du et al. , 2020 ; V an Roy and Dong , 2019 ; Lattimore et al. , 2020 ). F ormally , these results consider linear function appro ximation in v arious online RL mo dels, but the constructions can b e extended to offline RL with general function appro ximation where cov erage is measured via the Bel lman tr ansfer c o efficient . This co efficien t is the smallest C transfer suc h that max π ,f ∈F ∥ f − ap x [ f ] ∥ 2 L 2 ( d π ) ∥ f − ap x [ f ] ∥ 2 L 2 ( µ ) ≤ C transfer where ap x [ f ] ∈ F is the L ∞ -appro ximation of T f . 5 The lo wer b ound states that an asymptotic error of Ω( ε ∞ √ C transfer ) is unav oidable. T o contextualize our result with this lo wer b ound, we identify tw o regimes: the “Bellman transfer regime” where C transfer < ∞ and the “concentrabilit y regime ” where C conc < ∞ , and note that, since C transfer ≤ C conc , the former is more general. In the Bellman transfer regime, missp ecification amplification is una voidable. In the concentrabilit y regime, Theorem 4.1 av oids missp ecification amplification and is sample efficient (i.e., has statistical term scaling as p oly ( C conc , log ( |F | /δ ) , 1 n , 1 1 − γ ) ). This is the first result showing that both of these prop erties are simultaneously achiev able: prior results achiev e sample efficiency with missp ecification amplification (e.g., Xie and Jiang , 2020 ), or a void missp ecification amplification with undesirable sample complexit y scaling as p oly ( |S | ) (the latter is easily ac hieved under concentrabilit y via a tabular model-based approac h). Thus, the regime determines whether missp ecification amplification is av oidable or not, and, in the regime where it is av oidable, our algorithm do es so in a sample-efficien t manner. 4.2 Online reinforcemen t learning Setup and notation. W e consider a finite horizon episo dic MDP ( P , R, H , s 1 ) ov er state space S and action space A , where H ∈ N is horizon, P := { P h } H h =1 with P h : S × A → ∆( S ) is the non-stationary 4 Xie and Jiang ( 2020 ) consider slightly weak er assumptions: they measure b oth missp ecification and concentrabilit y via the L 2 ( µ ) norm. Our analysis easily accommo dates L 2 ( µ ) -concentrabilit y , as can be seen from the pro of. On the other hand, as described in Section 2.1 , missp ecification amplification is necessary under L 2 ( µ ) -misspecification. 5 Many Bellman transfer coefficients exist, but a standard one is the smallest C transfer such that max π,f ∈F ∥ f −T f ∥ 2 L 2 ( d π ) / ∥ f −T f ∥ 2 L 2 ( µ ) ≤ C transfer . This coincides with ours under exact realizabilit y/completeness, but we b eliev e our definition is more appropriate for the misspecified case because it is equivalen t to feature cov erage under linear function approximation. Indeed, if F consists of linear functions in some feature map ϕ : S × A → R d (but T f may not b e linear due to missp ecification) then our definition can b e expressed via the features (as max π,θ ∈ R d θ ⊤ Σ π θ / θ ⊤ Σ µ θ where Σ d = E d  ϕ ( s, a ) ϕ ( s, a ) ⊤  ) but the standard definition cannot. 9 transition op erator, R := { R h } H h =1 with R h : S × A → [0 , 1] is the non-stationary rew ard function, and s 1 is a fixed starting state. A (non-stationary) policy π := { π h } H h =1 is a sequence of mappings π h : S → ∆( A ) which induces a tra jectory ( s 1 , a 1 , r 1 , . . . , s H , a H , r H ) where a h ∼ π h ( s h ) , r h = R h ( s h , a h ) and s h +1 ∼ P h ( s h , a h ) for eac h time step. W e use P π [ · ] and E π [ · ] to denote probability and exp ectation under this pro cess, resp ectively . Let d π h ∈ ∆( S × A ) denote the occupancy measure of π at time-step h , defined as d π h ( s, a ) := P π [ s h = s, a h = a ] . The v alue of policy π is denoted J ( π ) := E π h P H h =1 r h i . Each p olicy has v alue functions: V π h : s 7→ E π h P H h ′ = h r h ′ | s h = s i and Q π h : ( s, a ) 7→ E π h P H h ′ = h r h ′ | s h = s, a h = a i and there exist an optimal p olicy π ⋆ = { π ⋆ h } H h =1 that maximizes V π h sim ultaneously for each state s ∈ S and hence maximizes J ( · ) . The optimal v alue function Q ⋆ h := Q π ⋆ h h induces π ⋆ via π ⋆ h : s 7→ arg max a Q ⋆ h ( s, a ) and satisfies Bellman’s equation: Q ⋆ h ( s, a ) = [ T h Q ⋆ h +1 ]( s, a ) where the Bellman operator T h is defined via [ T h f h +1 ]( s, a ) = R h ( s, a ) + E [max a ′ f h +1 ( s h +1 , a ′ ) | s h = s, a h = a ] . W e assume p er-episode rewards satisfy P H h =1 r h ∈ [0 , 1] . In online RL, w e in teract with the MDP for T episo des, where in eac h episo de we select a p olicy π ( t ) and collect the tra jectory ( s ( t ) 1 , a ( t ) 1 , r ( t ) 1 , . . . , s ( t ) H , a ( t ) H , r ( t ) H ) by taking actions a ( t ) h = π ( t ) h ( s ( t ) h ) . W e measure p erformance via the cumulativ e regret, define as Reg := P T t =1 J ( π ⋆ ) − J ( π ( t ) ) . W e equip the learner with a v alue function class F := F 1 × . . . × F H where each F h ⊂ S × A → [0 , 1] . Each f ∈ F induces a p olicy π f whic h, at time step h takes actions via π f ,h ( s h ) = arg max a f h ( s h , a h ) . W e make the following assumptions: • L ∞ -appro ximate realizabilit y/completeness. F or eac h h ∈ [ H ] there exists ¯ f h ∈ F h suc h that   ¯ f h − T h ¯ f h +1   ∞ ≤ ε ∞ .A dditionally , for eac h f h +1 ∈ F h +1 there exists f h ∈ F h suc h that ∥ f h − T h f h +1 ∥ ∞ ≤ ε ∞ . • Co verabilit y . There exists a constant C cov ∈ [1 , ∞ ) such that inf µ 1 ,...,µ H ∈ ∆( S ×A ) sup π ∈ Π ,h    d π h µ h    ∞ ≤ C cov . Here Π := { π f : π f ,h ( s ) = arg max a f h ( s, a ) , f ∈ F } is the p olicy class induced b y F . As in offline RL, there is a large b ody of recent w ork studying function appro ximation and structural conditions for sample-efficient online RL (c.f., Agarwal et al. , 2019 ; F oster and Rakhlin , 2023 ). It is fairly standard to assume exact realizability and completeness, which is stronger than our version with missp ecification. Co verabilit y is a recently prop osed structural condition ( Xie et al. , 2023 ): C cov is known to b e small in many MDP mo dels of interest, but weak er conditions that enable sample-efficiency are kno wn. As we will see, the strength of the structural condition determines whether misspecification amplification can be a voided or not. Algorithm and guaran tee. The algorithm is a very minor mo dification to GOLF ( Jin et al. , 2021 ; Xie et al. , 2023 ). T o condense the notation, given a sample ( s ( i ) h , a ( i ) h , r ( i ) h , s ( i ) h +1 ) and a function f ′ ∈ F h +1 , define x ( i ) h := ( s ( i ) h , a ( i ) h ) and y ( i ) f ′ ,h := r ( i ) h + max a ′ f ′ ( s ( i ) h +1 , a ′ ) . At the b eginning of episo de t , define a version space F ( t − 1) := ( f ∈ F : ∀ h ∈ [ H ] : max g h ∈F h t − 1 X i =1 W τ f h ,g h ( x ( i ) h ) n ( f h ( x ( i ) h ) − y ( i ) f h +1 ,h ) 2 − ( g h ( x ( i ) h ) − y ( i ) f h +1 ,h ) 2 o ≤ β ) , where β > 0 is a h yp erparameter we will set b elo w. Then, we define the optimistic v alue function f ( t ) := arg max f ∈F ( t − 1) f 1 ( s 1 , π f , 1 ( s 1 )) and the induced p olicy π ( t ) := π f ( t ) , collect a tra jectory via π ( t ) , and proceed to the next episo de. Note that the only difference b et w een this algorithm, which we call GOLF .DBR , and the version of GOLF studied by Xie et al. ( 2023 ) is that w e use the filter W τ f h ,g h ( · ) in the construction of the version space. GOLF .DBR enjoys the following guarantee. Theorem 4.2 ( DBR for online RL) . Fix δ ∈ (0 , 1) , and assume that F is L ∞ -missp e cifie d and µ satisfies c over ability (as define d ab ove). Consider GOLF .DBR with τ = 3 ε ∞ and β = c log ( T H |F | /δ ) . Then, with pr ob ability at le ast 1 − δ , we have Reg ≤ O  ε ∞ H T + H p C cov T log ( T H |F | /δ ) log ( T )  . P aralleling the discussion following Theorem 4.1 , w e emphasize tw o asp ects of the result. The first is that it extends Theorem 1 of Xie et al. ( 2023 ) to the misspecified setting, with no degradation of the statistical term and without incurring a dep endence on ε ∞ √ C cov . In other words, it a voids missp ecification amplification. 10 The second remark is that, when taken with existing low er b ounds ( Du et al. , 2020 ; V an Roy and Dong , 2019 ; Lattimore et al. , 2020 ), Theorem 4.2 establishes a separation b et ween co verabilit y and structural parameters defined in terms of Bellman errors, which include the Bellman-Eluder dimension ( Jin et al. , 2021 ), bilinear rank ( Du et al. , 2021 ), and Bellman rank ( Jiang et al. , 2017 ). 6 This separation is more subtle than in offline RL, b ecause here, as long as the state-action space is finite, one can alwa ys use a “tabular” metho d and eliminate missp ecification altogether, at the cost of p oly ( |S | , |A| ) · √ T regret. T o rule out this algorithm, we restrict to sample-efficien t metho ds: in a setting where a particular structural parameter (e.g., co verabilit y or Bellman rank) is b ounded b y d w e say that an algorithm is sample-efficient if its statistical term scales as p oly ( d, log ( |F | /δ ) , H ) · o ( T ) . The lo wer bounds sho w that, when the structural parameter inv olv es Bellman errors (like the Bellman rank), ε ∞ T √ d missp ecification error is necessary for sample efficient algorithms. 7 On the other hand, under cov erability , w e can achiev e missp ecification error with no dep endence on the structural parameter, in a sample efficien t manner. 8 This establishes that whether misspecification amplification can b e a voided sample-efficien tly depends on the structural prop erties of the MDP . T o our knowledge, this is a no vel insigh t in to the in teraction b etw een the structural and function approximation assumptions in online RL. 5 Related w ork There is a v ast b ody of work studying distribution shift broadly and cov ariate shift in particular. W e fo cus on the most closely related techniques for the co v ariate shift setting and refer the reader to Quinonero-Candela et al. ( 2008 ); Sugiy ama and Ka wanabe ( 2012 ); Shen et al. ( 2021 ) for a more comprehensive treatment. Rew eighting and robust optimization. P erhaps the most common wa y to correct for co v ariate shift is b y rew eighting each example ( x, y ) in the ob jectiv e function b y the densit y ratio w ( x ) := d test ( x ) /d train ( x ) . This metho d has b een studied in a long series of works ( Shimo daira , 2000 ; Cortes et al. , 2010 ; Cortes and Mohri , 2014 ). In its simplest form it requires kno wledge of D test via the density ratios, so it is not directly applicable to our adversarial cov ariate shift setting. Extensions include approaches that estimate densit y ratios using unlab eled samples from D test ( Huang et al. , 2006 ; Sugiyama et al. , 2007 ; Gretton et al. , 2009 ; Y u and Szep esvári , 2012 ) and robust optimization approaches that emplo y an auxiliary h yp othesis class of distributions P con taining D test ( Hashimoto et al. , 2018 ; Sagaw a et al. , 2020 ; Duc hi and Namk o ong , 2021 ; Agarw al and Zhang , 2022 ). Ho w ever, these still require prior kno wledge ab out D test , in particular it is kno wn that the sample complexity of robust optimization scales with the statistical complexity of the auxiliary class P ( Duchi and Namkoong , 2021 ), leading to v acuous b ounds in the absence of inductive bias. Ge et al. ( 2023 ) study statistical inference under co v ariate shift in well- and missp ecified settings. They sho w that maxim um likelihoo d estimation on D train is inconsistent with missp ecification, a result which is conceptually similar to our low er b ound for ERM . How ev er, their construction is not L ∞ -missp ecified so it is not directly comparable. Algorithmically , the y use reweigh ting for the missp ecified case, which, as mentioned, cannot b e implemented in our setting. Sup-norm con v ergence and function class-sp ecific results. Another line of work provides sp ecialized analyses for sp ecific function classes of in terest, such as linear ( Lei et al. , 2021 ), nonparametric ( Kp otufe and Martinet , 2018 ; Pathak et al. , 2022 ; Ma et al. , 2023 ), and some neural netw ork ( Dong and Ma , 2023a ) classes. The o verarc hing technical approach in these works is to measure distance b et ween distributions in manner that captures the structure of the function class, analogously to learning-theoretic results for domain adaptation ( Ben-David et al. , 2006 ; Mansour et al. , 2009 ). A complemen tary approac h is based on sup-norm con vergence which seeks to control ∥ ˆ f − f ⋆ ∥ ∞ for a predictor ˆ f and is naturally robust to co v ariate shift. Sup-norm conv ergence has b een studied for v arious function classes (c.f., Schmidt-Hieber and Zamolo dtc hiko v , 2022 ; Dong and Ma , 2023b ), but unfortunately is not p ossible in the general statistical learning setup ( Dong and Ma , 2023b ). W e mention sup-norm conv ergence primarily to contrast with our quan tile guaran tee in Eq. (4) , which con trols the probabilit y o ver x of large errors rather than the magnitude of the errors themselves 6 As with Bellman transfer co efficien ts, we believe these definitions should b e adjusted to accommodate missp ecification. See Definition 10 in Jiang et al. ( 2017 ) for an example. 7 F ormally , for an y ζ > 0 one requires at least exp ( d 2 ζ ) samples to find a d 1 / 2 − ζ ε ∞ suboptimal p olicy ( Lattimore et al. , 2020 ). 8 W e b eliev e that missp ecification error ε ∞ H T is optimal under co verabilit y and that ε ∞ H T √ d is optimal under structural parameters like Bellman rank. Ho wev er, it remains op en to establish the necessity of the horizon factors. 11 and which is attainable for any function class, ev en with missp ecification. All of these w orks differ from ours in that (a) they consider specific function classes and (b) they op erate closer to the w ell-sp ecified regime than w e do (e.g., in the nonparametric setting, one can drive the misspecification error to zero). Related w ork in reinforcemen t learning. Our results for offline and online RL build directly on the analyses in Xie and Jiang ( 2020 ) and Xie et al. ( 2023 ) resp ectiv ely . The former con tributes to a long line of w ork on offline RL ( Munos , 2003 , 2007 ; An tos et al. , 2008 ; Chen and Jiang , 2019 ) while the latter is part of a series of works establishing structural conditions under which online reinforcemen t learning is statistically tractable (c.f., Agarwal et al. , 2019 ; F oster and Rakhlin , 2023 ). Man y of these w orks do account for missp ecification, but the question of whether missp ecification amplification can b e av oided is not considered. Results that do fo cus on missp ecification primarily consider linear function appro ximation. In the simpler offline p olicy ev aluation setting, several w orks study least squares temp oral difference learning (LSTD) ( Bradtke and Barto , 1996 ) with missp ecification ( T sitsiklis and V an Roy , 1996 ; Y u and Bertsek as , 2010 ; Mou et al. , 2022 ). Recen tly , Amortila et al. ( 2023 ) precisely c haracterized the optimal misspecification amplification (i.e., appro ximation factors) achiev able across a range of settings, showing that LSTD is essentially optimal in most regimes. The exception is when the offline data distribution is supported on the en tire state space, one can emplo y a “tabular” mo del-based algorithm to incur no appro ximation error whatso ev er, but the sample complexity scales p olynomially with |S | . Our offline RL results are conceptually similar b ecause under concen trability (w hic h essentially implies full supp ort), the standard minimax algorithm does not ac hieve the optimal approximation factor. A crucial difference is that our disagreemen t-based v ariant achiev es an impro ved appro ximation factor without incurring any sample complexity ov erhead. F or the more challenging offline p olicy optimization and online RL, Du et al. ( 2020 ); V an Ro y and Dong ( 2019 ); Lattimore et al. ( 2020 ) establish conditions under which missp ecification amplification is necessary . As discussed abov e, combining our results with these lo wer b ounds and their v ariations, reveals new tradeoffs b et w een cov erage/structural and function appro ximation conditions, distinct from tradeoffs established by prior work ( Xie and Jiang , 2021 ; F oster et al. , 2022 ). 6 Discussion This pap er highlights an in triguing interpla y b et ween missp ecification and distribution shift, exp osing the undesirable missp e cific ation amplific ation prop ert y of ERM , and prop osing disagreement-based regression as a remedy . W e hav e shown that using disagreemen t-based regression in online and offline reinforcemen t learning yields new tec hnical results and reveals new tradeoffs b et ween co verage/structural assumptions and function appro ximation assumptions. W e close by mentioning several interesting av enues for future work. There are a num b er of directions that p ertain to the core setting of missp ecified regression under cov ariate shift; for example, (a) extending the analysis of DBR to infinite function classes, other loss functions, and other notions of misspecification, (b) deriving a more computationally efficient pro cedure—p erhaps in an oracle mo del of computation—that av oids missp ecification amplification, and (c) determining the optimal achiev able appro ximation factor. Pertaining to reinforcement learning theory , w e b eliev e the most pressing direction is to deep en our understanding of the relationship b et ween co verage/structural assumptions (for offline/online RL, resp ectiv ely) and function appro ximation assumptions, and we b eliev e missp ecification pro vides a nov el lens to study this relationship. It is also w orthwhile to consider other applications inv olving distribution shift where DBR or related pro cedures ma y reveal new conceptual insights. Finally , it would also b e interesting to study empirical issues, to understand how p erv asiv e and problematic missp ecification amplification is, dev elop practical in terven tions, and consider applying them to distribution shift and deep reinforcement learning scenarios. In short, there is muc h more to understand about the in terplay b et ween misspecification and distribution shift, and we lo ok forw ard to progress in the y ears to come. A c knowledgemen ts W e thank A dam Blo ck for helpful feedbac k on a early v ersion of the manuscript. 12 References Alekh Agarw al and T ong Zhang. Minimax regret optimization for robust mac hine learning under distribution shift. In Confer enc e on L e arning The ory , 2022. Alekh Agarwal, Nan Jiang, Sham M Kak ade, and W en Sun. Reinforcement learning: Theory and algorithms. https://rltheorybook.github.io/ , 2019. V ersion: January 31, 2022. Philip Amortila, Nan Jiang, and Csaba Szep esvári. The optimal approximation factors in missp ecified off-p olicy v alue function estimation. In International Confer enc e on Machine L e arning , 2023. Cem Anil, Y uhuai W u, Anders Andreassen, Aitor Lewk owycz, V edant Misra, Vinay Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dy er, and Behnam Neyshabur. Exploring length generalization in large language mo dels. A dvanc es in Neur al Information Pr o c essing Systems , 2022. András Antos, Csaba Szep esvári, and Rémi Munos. Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Machine L e arning , 2008. Jean-Y ves Audib ert. Progressive mixture rules are deviation sub optimal. A dvanc es in Neur al Information Pr o c essing Systems , 2007. Shai Ben-David, John Blitze r, K oby Crammer, and F ernando P ereira. Analysis of representations for domain adaptation. A dvanc es in Neur al Information Pr o c essing Systems , 19, 2006. Omar Besb es, Y onatan Gur, and Assaf Zeevi. Non-stationary sto c hastic optimization. Op er ations R ese ar ch , 2015. Stev en J Bradtke and Andrew G Barto. Linear least-squares algorithms for temp oral difference learning. Machine le arning , 1996. Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcemen t learning. In International Confer enc e on Machine L e arning , 2019. Corinna Cortes and Mehryar Mohri. Domain adaptation and sample bias correction theory and algorithm for regression. The or etic al Computer Scienc e , 2014. Corinna Cortes, Yisha y Mansour, and Mehry ar Mohri. Learning b ounds for imp ortance w eighting. A dvanc es in Neur al Information Pr o c essing Systems , 2010. Kefan Dong and T engyu Ma. First steps tow ard understanding the extrap olation of nonlinear mo dels to unseen domains. In International Confer enc e on L e arning R epr esentations , 2023a. Kefan Dong and T engyu Ma. T o w ard L ∞ -reco very of nonlinear functions: A polynomial sample complexity b ound for gaussian random fields. In Confer enc e on L e arning The ory , 2023b. Simon Du, Sham Kak ade, Jason Lee, Shac har Lov ett, Gaurav Maha jan, W en Sun, and Ruosong W ang. Bilinear classes: A structural framework for prov able generalization in RL. In International Confer enc e on Machine L e arning , 2021. Simon S Du, Sham M Kak ade, Ruosong W ang, and Lin F Y ang. Is a go o d representation sufficien t for sample efficien t reinforcemen t learning? In International Confer enc e on L e arning R epr esentations , 2020. John C Duchi and Hongseok Namkoong. Learning mo dels with uniform p erformance via distributionally robust optimization. The Annals of Statistics , 2021. Dylan F oster, Alekh Agarwal, Miroslav Dudík, Haip eng Luo, and Rob ert Sc hapire. Practical contextual bandits with regression oracles. In International Confer enc e on Machine L e arning , 2018. Dylan J F oster and Alexander Rakhlin. F oundations of reinforcement learning and in teractive decision making. arXiv:2312.16730 , 2023. Dylan J F oster, Alexander Rakhlin, David Simchi-Levi, and Y unzong Xu. Instance-dep enden t complexity of contextual bandits and reinforcement learning: A disagreemen t-based persp ective. In Confer enc e on L e arning The ory , 2021. 13 Dylan J F oster, Akshay Krishnamurth y , David Simc hi-Levi, and Y unzong Xu. Offline reinforcement learning: Fundamen tal barriers for v alue function approximation. In Confer enc e on L e arning The ory , 2022. João Gama, Indr ˙ e Žliobait ˙ e, Alb ert Bifet, Mykola Pec henizkiy , and Ab delhamid Bouc hachia. A survey on concept drift adaptation. ACM Computing Surveys , 2014. Jia wei Ge, Shange T ang, Jianqing F an, Cong Ma, and Chi Jin. Maximum likelihoo d estimation is all you need for well-specified co v ariate shift. , 2023. Arth ur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, and Bernhard Sc hölkopf. Co v ariate shift b y kernel mean matching. Dataset Shift in Machine L e arning , 2009. Stev e Hannek e. Theory of disagreement-based active learning. F oundations and T r ends in Machine L e arning , 2014. T atsunori Hashimoto, Megha Sriv asta v a, Hongseok Namkoong, and Percy Liang. F airness without demo- graphics in rep eated loss minimization. In International Confer enc e on Machine L e arning , 2018. Jia yuan Huang, Arthur Gretton, Karsten Borgwardt, Bernhard Sc hölkopf, and Alex Smola. Correcting sample selection bias b y unlab eled data. A dvanc es in Neur al Information Pr o c essing Systems , 2006. Nan Jiang, Aksha y Krishnamurth y , Alekh Agarwal, John Langford, and Rob ert E Schapire. Contextual decision processes with low Bellman rank are P AC-learnable. In International Confer enc e on Machine L e arning , 2017. Chi Jin, Qinghua Liu, and Sobhan Miryoosefi. Bellman eluder dimension: New ric h classes of RL problems, and sample-efficient algorithms. A dvanc es in Neur al Information Pr o c essing Systems , 2021. Keith Knight. On the asymptotic distribution of the L ∞ estimator in linear regression. T echnical rep ort, Univ ersity of T oronto, 2017. P ang W ei Koh, Shiori Sagaw a, Henrik Marklund, Sang Mic hael Xie, Marvin Zhang, Akshay Balsubramani, W eihua Hu, Michihiro Y asunaga, Ric hard Lanas Phillips, Irena Gao, T on y Lee, Etienne Da vid, Ian Sta vness, W ei Guo, Berton Earnshaw, Imran Haque, Sara M Beery , Jure Lesko vec, Ansh ul Kunda je, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang. Wilds: A b enc hmark of in-the-wild distribution shifts. In International Confer enc e on Machine L e arning , 2021. Samory Kp otufe and Guillaume Martinet. Marginal singularit y , and the b enefits of labels in cov ariate-shift. In Confer enc e On L e arning The ory , 2018. Aksha y Krishnamurth y , Alekh Agarwal, T zu-Kuo Huang, Hal Daumé I I I, and John Langford. Activ e learning for cost-sensitive classification. Journal of Machine L e arning R ese ar ch , 2019. T or Lattimore, Csaba Szep esv ari, and Gellert W eisz. Learning with go od feature representations in bandits and in RL with a generative mo del. In International Confer enc e on Machine L e arning , 2020. Guillaume Lecué and Philipp e Rigollet. Optimal learning with Q-aggregation. The Annals of Statistics , 2014. Qi Lei, W ei Hu, and Jason Lee. Near-optimal linear regression under distribution shift. In International Confer enc e on Machine L e arning , 2021. Sergey Levine, A viral Kumar, George T uck er, and Justin F u. Offline reinforcement learning: T utorial, review, and p erspectives on op en problems. , 2020. T engyuan Liang, Alexander Rakhlin, and Karthik Sridharan. Learning with square loss: Lo calization through offset rademacher complexity . In Confer enc e on L e arning The ory , 2015. Bingbin Liu, Jordan T Ash, Surbhi Go el, Akshay Krishnamurth y , and Cyril Zhang. Exposing atten tion glitc hes with flip-flop language mo deling. A dvanc es in Neur al Information Pr o c essing Systems , 2023. Cong Ma, Reese Pathak, and Martin J W ainwrigh t. Optimally tac kling cov ariate shift in RKHS-based nonparametric regression. The Annals of Statistics , 2023. 14 Yisha y Mansour, Me hry ar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. , 2009. John P Miller, Rohan T aori, Aditi Raghunathan, Shiori S aga w a, P ang W ei Koh, V aishaal Shank ar, Percy Liang, Y air Carmon, and Ludwig Schmidt. A ccuracy on the line: On the strong correlation b etw een out-of-distribution and in-distribution generalization. In International Confer enc e on Machine L e arning , 2021. W enlong Mou, Ashwin Pananjady , and Martin J W ainwrigh t. Optimal oracle inequalities for solving pro jected fixed-p oin t equations. Mathematics of Op er ations R ese ar ch , 2022. Rémi Munos. Error b ounds for approximate p olicy iteration. In International Confer enc e on Machine L e arning , 2003. Rémi Munos. Performance b ounds in L p -norm for approximate v alue iteration. SIAM Journal on Contr ol and Optimization , 2007. Reese P athak, Cong Ma, and Martin W ainwrigh t. A new similarity measure for co v ariate shift with applications to nonparametric regression. In International Confer enc e on Machine L e arning , 2022. Juan Perdomo, Tijana Zrnic, Celestine Mendler-Dünner, and Moritz Hardt. Performativ e prediction. In International Confer enc e on Machine L e arning , 2020. Joaquin Quinonero-Candela, Masashi Sugiyama, Anton Sch waighofer, and Neil D Lawrence. Dataset Shift in Machine Le arning . Mit Press, 2008. Benjamin Rech t, Reb ecca Ro elofs, Ludwig Sc hmidt, and V aishaal Shank ar. Do imagenet classifiers generalize to imagenet? In International Confer enc e on Machine L e arning , 2019. Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In International Confer enc e on Artificial Intel ligenc e and Statistics , 2011. Shiori Sagaw a, Pang W ei K oh, T atsunori B Hashimoto, and Percy Liang. Distributionally robust neural net works for group shifts: On the imp ortance of regularization for w orst-case generalization. In International Confer enc e on L e arning R epr esentations , 2020. Johannes Schmidt-Hieber and P etr Zamolo dtchik o v. Lo cal con vergence rates of the least squares estimator with applications to transfer learning. , 2022. Zhey an Shen, Jiash uo Liu, Y ue He, Xingxuan Zhang, Renzhe Xu, Han Y u, and Peng Cui. T ow ards out-of- distribution generalization: A survey . , 2021. Hidetoshi Shimo daira. Improving predictive inference under cov ariate shift b y weigh ting the log-likelihoo d function. Journal of Statistic al Planning and Infer enc e , 2000. Masashi Sugiyama and Motoaki Kaw anab e. Machine L e arning in non-stationary envir onments: Intr o duction to c ovariate shift adaptation . MIT press, 2012. Masashi Sugiy ama, Shinichi Nak a jima, Hisashi Kashima, P aul Buenau, and Motoaki Ka wanabe. Direct imp ortance estimation with model selection and its application to co v ariate shift adaptation. A dvanc es in Neur al Information Pr o c essing Systems , 2007. Matus T elgarsky . Deep learning theory lecture notes. https://mjt.cs.illinois.edu/dlt/ , 2021. V ersion: 2021-10-27 v0.0-e7150f2d (alpha). John T sitsiklis and Benjamin V an Ro y . Analysis of temp oral-difference learning with function approximation. A dvanc es in Neur al Information Pr o c essing Systems , 1996. Benjamin V an Ro y and Shi Dong. Commen ts on the Du-Kak ade-Wang-Yang lo wer b ounds. , 2019. 15 T engyang Xie and Nan Jiang. Q ⋆ appro ximation sc hemes for batch reinforcement learning: A theoretical comparison. In Confer enc e on Unc ertainty in Artificial Intel ligenc e , 2020. T engyang Xie and Nan Jiang. Batc h v alue-function approximation with only realizability . In International Confer enc e on Machine L e arning , 2021. T engyang Xie, Dylan J F oster, Y u Bai, Nan Jiang, and Sham M Kak ade. The role of cov erage in online reinforcemen t learning. In International Confer enc e on L e arning R epr esentations , 2023. Y ufei Yi and Matey Neyk ov. Non-asymptotic b ounds for the L ∞ estimator in linear regression with uniform noise. Bernoul li , 2024. Huizhen Y u and Dimitri P Bertsek as. Error bounds for appro ximations from pro jected linear equations. Mathematics of Op er ations R ese ar ch , 2010. Y aoliang Y u and Csaba Szepesvári. Analysis of kernel mean matching under co v ariate shift. In International Confer enc e on Machine L e arning , 2012. Yi Zhang, Arturs Backurs, Sébastien Bub ec k, Ronen Eldan, Suriy a Gunasek ar, and T al W agner. Un veiling transformers with lego: a synthetic reasoning task. , 2022. 16 A Pro ofs for Section 2 A.1 Analysis for ERM Prop osition 2.1 ( ERM upp er b ound) . F or any δ ∈ (0 , 1) with pr ob ability at le ast 1 − δ , ERM satisfies R test ( ˆ f ( n ) ERM ) ≤ O  C ∞ ε 2 ∞ + C ∞ log( |F | /δ ) n  . Pro of of Prop osition 2.1 . The pro of of Prop osition 2.1 is fairly standard, particularly in the well-specified case when ε ∞ = 0 . Our analysis that handles missp ecification is adapted from the pro of of Lemma 16 in Chen and Jiang ( 2019 ). F or the ma jorit y of the pro of w e only consider D train , and we consequen tly omit the subscript when indexing expectations, v ariances, and the risk functional. Define R ( f ) := E [( f ( x ) − f ⋆ ( x )) 2 ] and b R ( f ) := 1 n n X i =1 ( f ( x i ) − y i ) 2 , so that ˆ f ( n ) ERM := arg min f ∈F b R ( f ) . W e establish concentration on the “excess risk” functional b R ( f ) − b R ( ¯ f ) . F or any f ∈ F , w e establish the following facts: E [( f ( x ) − y ) 2 − ( ¯ f ( x ) − y ) 2 ] = E [( f ( x ) − f ⋆ ( x )) 2 − ( ¯ f ( x ) − f ⋆ ( x )) 2 ] (8) V ar[( f ( x ) − y ) 2 − ( ¯ f ( x ) − y ) 2 ] ≤ 8 E [( f ( x ) − y ) 2 − ( ¯ f ( x ) − y ) 2 ] + 16 ε 2 ∞ . (9) Eq. (8) implies that E [ b R ( f ) − b R ( ¯ f )] = R ( f ) − R ( ¯ f ) as desired. Eq. (9) will enable us to achiev e a fast con vergence rate. The former is derived as follows. Observ e that conditional on an y x w e hav e E [( f ( x ) − y ) 2 − ( ¯ f ( x ) − y ) 2 | x ] = E [( f ( x ) − y ) 2 − ( ¯ f ( x ) − f ⋆ ( x ) + f ⋆ ( x ) − y ) 2 | x ] = E [( f ( x ) − y ) 2 − ( ¯ f ( x ) − f ⋆ ( x )) 2 − 2( ¯ f ( x ) − f ⋆ ( x ))( f ⋆ ( x ) − y ) − ( f ⋆ ( x ) − y ) 2 | x ] = E [( f ( x ) − y ) 2 − ( ¯ f ( x ) − f ⋆ ( x )) 2 − ( f ⋆ ( x ) − y ) 2 | x ] = f ( x ) 2 − f ⋆ ( x ) 2 − 2 E train [ y | x ]( f ( x ) − f ⋆ ( x )) − ( ¯ f ( x ) − f ⋆ ( x )) 2 = ( f ( x ) − f ⋆ ( x )) 2 − ( ¯ f ( x ) − f ⋆ ( x )) 2 . Eq. (9) is derived as follo ws. V ar[( f ( x ) − y ) 2 − ( ¯ f ( x ) − y ) 2 ] ≤ E [(( f ( x ) − y ) 2 − ( ¯ f ( x ) − y ) 2 ) 2 ] = E [( f ( x ) − ¯ f ( x )) 2 ( f ( x ) + ¯ f ( x ) − 2 y ) 2 ] ≤ 4 E [( f ( x ) − ¯ f ( x )) 2 ] ≤ 8 E [( f ( x ) − f ⋆ ( x )) 2 + ( ¯ f ( x ) − f ⋆ ( x )) 2 ] = 8 E [( f ( x ) − f ⋆ ( x )) 2 − ( ¯ f ( x ) − f ⋆ ( x )) 2 + 2( ¯ f ( x ) − f ⋆ ( x ) 2 ] ≤ 8 E [( f ( x ) − f ⋆ ( x )) 2 − ( ¯ f ( x ) − f ⋆ ( x )) 2 ] + 16 ε 2 ∞ . Finally , we apply Eq. (8) . No w, Bernstein’s inequality and a union b ound ov er f ∈ F gives that with probability at least 1 − δ ∀ f ∈ F : R ( f ) − R ( ¯ f ) − ( b R ( f ) − b R ( ¯ f )) ≤ r (16( R ( f ) − R ( ¯ f )) + 32 ε 2 ∞ ) log( |F | /δ ) n + 4 log( |F | /δ ) 3 n . Since ˆ f ( n ) ERM minimizes b R ( f ) we ha ve that b R ( ˆ f ( n ) ERM ) − b R ( ¯ f ) ≤ 0 , we can deduce that R ( ˆ f ( n ) ERM ) − R ( ¯ f ) ≤ s (16( R ( ˆ f ( n ) ERM ) − R ( ¯ f )) + 32 ε 2 ∞ ) log( |F | /δ ) n + 4 log( |F | /δ ) 3 n . 17 Using the AM-GM inequalit y ( √ ab ≤ a/ 2 + b/ 2 ), the righ t hand side can b e simplified to yield R ( ˆ f ( n ) ERM ) − R ( ¯ f ) ≤ 1 2 ( R ( ˆ f ( n ) ERM ) − R ( ¯ f )) + ε 2 ∞ + 28 log( |F | /δ ) 3 n . Re-arranging and using that R train ( ¯ f ) ≤ ε 2 ∞ w e obtain R train ( ˆ f ( n ) ERM ) ≤ 3 ε 2 ∞ + 56 log( |F | /δ ) 3 n . Finally we b ound the risk under D test via a standard importance w eighting argumen t: R test ( ˆ f ( n ) ERM ) = E train  d test ( x ) d train ( x ) ( ˆ f ( n ) ERM ( x ) − f ⋆ ( x )) 2  ≤ sup x ∈X     d test ( x ) d train ( x )     ·  3 ε 2 ∞ + 56 log( |F | /δ ) 3 n  . Note that we crucially use that ( ˆ f ( n ) ERM ( x ) − f ⋆ ( x )) 2 is non-negative here. This pro ves the proposition. Prop osition 2.2 ( ERM lo wer b ound) . F or al l ε ∞ ∈ (0 , 1) and C ∞ ∈ [1 , ∞ ) such that p C ∞ · ε ∞ ≤ 1 / 2 , and for al l ζ > 0 sufficiently smal l, ther e exist distributions D train , D test and a function class F with |F | = 2 satisfying Assumption 2.1 - Assumption 2.4 (with p ar ameters ε ∞ , C ∞ ) such that R test ( ˆ f ( ∞ ) ERM ) = C ∞ ε 2 ∞ − ζ . Pro of of Prop osition 2.2 . Fix ε ∞ ∈ (0 , 1) and C ∞ ≥ 1 such that p C ∞ · ε ∞ ≤ 1 / 2 . Let 0 < ζ < p C ∞ · ε ∞ . Let X = [0 , 1] and let D train b e the distribution ov er ( x, y ) where x ∼ Uniform ( X ) and y ∼ Ber (1 / 2) . Let e X := [0 , 1 /C ∞ ] ⊂ X and let D test b e the distribution ov er ( x, y ) where x ∼ Uniform ( e X ) and y ∼ Ber (1 / 2) . These choices yield f ⋆ ( x ) = 1 / 2 for all x ∈ X , satisfy Assumption 2.1 , and ensure that sup x ∈X    d test ( x ) d train ( x )    = C ∞ . Let F = { ¯ f , f bad } where ¯ f ( x ) = 1 / 2 + ε ∞ for all x ∈ X (satisfying Assumption 2.3 ) and f bad is defined as f bad ( x ) = ( 1 / 2 if x / ∈ e X 1 / 2 + ζ if x ∈ e X . By definition, observ e that ˆ f ( ∞ ) ERM = f bad as long as ∥ f bad − f ⋆ ∥ 2 L 2 ( D train ) <   ¯ f − f ⋆   2 L 2 ( D train ) . A direct calculation shows that this inequalit y is satisfied for any ζ < p C ∞ · ε ∞ . Ho wev er, f bad has large population risk under D test , in particular R test ( f bad ) = E test [( f bad ( x ) − f ⋆ ( x )) 2 ] = ζ 2 , whic h w e can mak e arbitrarily close to C ∞ ε 2 ∞ . A.2 Discussion of other algorithms Star algorithm. Audib ert’s star algorithm ( Audib ert , 2007 ; Liang et al. , 2015 ) is a tw o-stage regression pro cedure that achiev es the fast conv ergence rate for non-conv ex classes in missp ecified or agnostic regression. Giv en that the construction used to prov e Prop osition 2.2 has a finite (and hence non-con vex) function class, one might ask whether the star algorithm can av oid missp ecification amplification. W e briefly sketc h here wh y this is not the case. In the context of the construction, where F = { f bad , ¯ f } , the asymptotic v ersion of the star algorithm is to compute ˆ f star := arg min f α : α ∈ [0 , 1] E train [( f α ( x ) − f ⋆ ( x )) 2 ] where f α ( x ) = (1 − α ) f bad ( x ) + α ¯ f ( x ) . W e claim that when ζ = p C ∞ · ε ∞ , the optimal choice for α is exactly 1 / 2 . The prediction error under D test for this c hoice is, unfortunately , exactly 1 / 4 ( p C ∞ + 1) 2 ε 2 ∞ , which still manifests missp ecification amplification. 18 Note that, due to the simplicity of our construction, the same argumen t applies to other improp er learning sc hemes based on conv exification (c.f., Lecué and Rigollet , 2014 ). T o see that the minimum is achiev ed at α = 1 / 2 , we write the optimization problem ov er α as arg min α ∈ [0 , 1] 1 C ∞ ·  (1 − α ) q C ∞ ε ∞ + αε ∞  2 +  1 − 1 C ∞  · ( αε ∞ ) 2 = arg min α ∈ [0 , 1] α 2 + (1 − α ) 2 + 2 α (1 − α ) p C ∞ . The deriv ativ e, w.r.t. α , of the latter is d  α 2 + (1 − α ) 2 + 2 α (1 − α ) √ C ∞  dα = 2 α − 2(1 − α ) + 2 p C ∞ − 4 α p C ∞ = 2 − 2 p C ∞ ! (2 α − 1) . Since C ∞ > 1 , the second deriv ativ e is non-negative, so the optimization problem is conv ex. Moreov er, the deriv ativ e is zero at α = 1 / 2 , showing that this is a minimizer of the optimization problem. L ∞ regression. Giv en that we assume L ∞ -missp ecification, and in ligh t of the construction for Prop osi- tion 2.2 , it is tempting to optimize the maximal absolute deviation instead of the square loss: ˆ f ( n ) ∞ ← arg min f ∈F max i | f ( x i ) − y i | . This pro cedure is kno wn as L ∞ regression or the Chebyshev estimator and has b een studied in the statistics comm unity ( Knigh t , 2017 ; Yi and Neyko v , 2024 ). These analyses primarily consider the well-specified setting with noise that is uniformly distributed, i.e., y i = f ⋆ ( x i ) + ϵ i where ϵ i ∼ Unif ([ − a, a ]) for some a ≥ 0 . W e b eliev e suc h analyses can extend to the L ∞ -missp ecified setting to sho w that the pro cedure av oids missp ecification amplification. How ev er, strong assumptions on the noise are crucial, as L ∞ regression can b e inconsisten t under more general conditions. W e illustrate with a simple example. Let X = { x } b e a singleton, y = Ber ( 1 / 4 ) and F = { f ⋆ : x 7→ 1 / 4 , f : x 7→ 1 / 2 } b e a class with tw o functions. F or all n sufficien tly large, the dataset will con tain the sample ( x, 1) at which p oin t f ⋆ will hav e L ∞ error 3 / 4 , while f will hav e error 1 / 2 . Th us the metho d will be inconsisten t. A.3 Analysis for DBR W e begin with the pro ofs of Lemma 3.1 and Lemma 3.2 , th us completing steps one and t wo of the pro of. Then we turn to proving the corollaries. Lemma 3.1 (Non-negativity) . With τ ≥ 2 ε ∞ and for any ¯ f ∈ F such that ∥ ¯ f − f ⋆ ∥ ∞ ≤ ε ∞ , we have L ( f ; ¯ f ) ≥ ( τ 2 − 2 τ ε ∞ ) Pr[ W τ f , ¯ f ( x )] ≥ 0 . Pro of of Lemma 3.1 . F ollowing the calculation used to derive Eq. (8) w e hav e that, conditional on any x : E train [( f ( x ) − y ) 2 − ( ¯ f ( x ) − y ) 2 | x ] = ( f ( x ) − f ⋆ ( x )) 2 − ( ¯ f ( x ) − f ⋆ ( x )) 2 Under the even t x ∈ W τ f , ¯ f with τ ≥ 2 ε ∞ w e claim that this must b e non-negativ e. In particular | f ( x ) − f ⋆ ( x ) | ≥ | f ( x ) − ¯ f ( x ) | − | ¯ f ( x ) − f ⋆ ( x ) | ≥ τ − ε ∞ ≥ ε ∞ ≥ 0 Therefore, ( f ( x ) − f ⋆ ( x )) 2 − ( ¯ f ( x ) − f ⋆ ( x )) 2 ≥ ( τ − ε ∞ ) 2 − ε 2 ∞ ≥ τ 2 − 2 τ ε ∞ . The right hand side is non-negative whenev er τ ≥ 2 ε . 19 Lemma 3.2 (Concentration) . Fix δ ∈ (0 , 1) and τ ≥ 3 ε ∞ and define ε stat := 80 log( |F | /δ ) 3 n . Under Assumption 2.3 , for any ¯ f ∈ F such that ∥ ¯ f − f ⋆ ∥ ∞ ≤ ε ∞ , with pr ob ability at le ast 1 − δ we have ∀ f ∈ F : L ( f ; ¯ f ) ≤ 2 b L ( f ; ¯ f ) + ε stat , and e quivalently, b L ( ¯ f ; f ) ≤ 1 2  L ( ¯ f ; f ) + ε stat  . Pro of of Lemma 3.2 . The concen tration inequalit y is similar to the one used in the pro of of Proposition 2.1 . W e apply Bernstein’s inequalit y and a union bound to the empirical disagreemen t-based loss b L ( f ; ¯ f ) for eac h f ∈ F . T o do so, w e must calculate the mean, v ariance, and range of b L ( f ; ¯ f ) . Note that by the same calculation as in the pro of of Proposition 2.1 , w e ha ve that E [ b L ( f ; ¯ f )] = L ( f ; ¯ f ) and that the range of each random v ariable in the empirical av erage is 1 . The v ariance calculation ho wev er is slightly different: V ar[ W τ f , ¯ f ( x ) · { ( f ( x ) − y ) 2 − ( ¯ f ( x ) − y ) 2 } ] ≤ E [ W τ f , ¯ f ( x ) · { ( f ( x ) − y ) 2 − ( ¯ f ( x ) − y ) 2 } 2 ] ≤ E [ W τ f , ¯ f ( x )( f ( x ) − ¯ f ( x )) 2 ( f ( x ) + ¯ f ( x ) − 2 y ) 2 ] ≤ 4 E [ W τ f , ¯ f ( x )( f ( x ) − ¯ f ( x )) 2 ] . Next, w e consider a fixed x and define a := ( f ( x ) − f ⋆ ( x )) and b := ( f ⋆ ( x ) − ¯ f ( x )) , so that we can write ( f ( x ) − ¯ f ( x )) 2 = ( f ( x ) − f ⋆ ( x ) + f ⋆ ( x ) − ¯ f ( x )) 2 = ( a + b ) 2 . Now, when τ ≥ 3 ε ∞ w e ha ve: W τ f , ¯ f ( x ) = 1 ⇒ | a | = | f ( x ) − f ⋆ ( x ) | ≥ | f ( x ) − ¯ f ( x ) | − ε ∞ ≥ 2 ε ∞ . Along with the fact that | b | = | ¯ f ( x ) − f ⋆ ( x ) | ≤ ε ∞ , this implies that | b | ≤ | a | / 2 or equiv alen tly that b 2 ≤ a 2 / 4 . Using this, we can deduce that ( a + b ) 2 ≤ 9 a 2 4 ≤ 9 a 2 4 − 3 b 2 + 3 a 2 2 = 3( a 2 − b 2 ) . Re-in tro ducing the definitions for a and b we hav e that V ar[ W τ f , ¯ f ( x ) · { ( f ( x ) − y ) 2 − ( ¯ f ( x ) − y ) 2 } ] ≤ 12 L ( f ; ¯ f ) No w, applying Bernstein’s inequality and a union bound o ver all f ∈ F yields that with probability 1 − δ : ∀ f ∈ F : L ( f ; ¯ f ) − b L ( f ; ¯ f ) ≤ r 24 L ( f ; ¯ f ) log ( |F | /δ ) n + 4 log( |F | /δ ) 3 n ≤ 1 2 L ( f ; ¯ f ) + 40 log( |F | /δ ) 3 n . Re-arranging prov es the first statement, and the second statement follows from the symmetries b L ( f ; g ) = − b L ( g ; f ) and L ( f ; g ) = −L ( g ; f ) . Corollary 2.1 (Cov ariate shift for DBR ) . Fix δ ∈ (0 , 1) . Under Assumption 2.1 – Assumption 2.4 , with pr ob ability at le ast 1 − δ , ˆ f ( n ) DBR with τ = 3 ε ∞ satisfies R test ( ˆ f ( n ) DBR ) ≤ 17 ε 2 ∞ + O  C ∞ log( |F | /δ ) n  . (5) Pro of of Corollary 2.1 . Beginning the with risk under D test and assuming that τ = 3 ε ∞ w e can write R test ( ˆ f ( n ) DBR ) = E test [( ˆ f ( n ) DBR ( x ) − f ⋆ ( x )) 2 ] = E test [ 1 {| ˆ f ( n ) DBR ( x ) − f ⋆ ( x ) | < 4 ε ∞ } · ( ˆ f ( n ) DBR ( x ) − f ⋆ ( x )) 2 ] + E test [ 1 {| ˆ f ( n ) DBR ( x ) − f ⋆ ( x ) | ≥ 4 ε ∞ } · ( ˆ f ( n ) DBR ( x ) − f ⋆ ( x )) 2 ] ≤ 16 ε 2 ∞ + E test [ 1 {| ˆ f ( n ) DBR ( x ) − f ⋆ ( x ) | ≥ 4 ε ∞ } · ( ˆ f ( n ) DBR ( x ) − f ⋆ ( x )) 2 ] ≤ 17 ε 2 ∞ + E test [ 1 {| ˆ f ( n ) DBR ( x ) − f ⋆ ( x ) | ≥ 4 ε ∞ } · { ( ˆ f ( n ) DBR ( x ) − f ⋆ ( x )) 2 − ε 2 ∞ } ] . Note that, due to the indicator, the quan tity inside the exp ectation is non-negative. Therefore, via exactly the same imp ortance weigh ting argument as we used in the pro of of Prop osition 2.1 , the latter is at most C ∞ times the quantit y b ounded in Eq. (3) . 20 Corollary 2.2 (W ell-sp ecified cas e) . Fix δ ∈ (0 , 1) . Under Assumption 2.1 – Assumption 2.4 (with ε ∞ = 0 ), with pr ob ability at le ast 1 − δ , ˆ f ( n ) DBR with τ ≤ O  p log( |F | /δ ) /n  satisfies R train ( ˆ f ( n ) DBR ) ≤ O  log( |F | /δ ) n  and R test ( ˆ f ( n ) DBR ) ≤ O  C ∞ log( |F | /δ ) n  . (6) Pro of of Corollary 2.2 . Let ∆ denote the righ t hand side of Eq. (3) . Note that in the well-specified case where ε ∞ = 0 , Theorem 2.1 ensures that E train [ 1 {| ˆ f ( n ) DBR ( x ) − f ⋆ ( x ) | ≥ τ } · ( ˆ f ( n ) DBR ( x ) − f ⋆ ( x )) 2 ] ≤ ∆ . Then, if we take τ ≤ √ ∆ , we hav e R train ( ˆ f ( n ) DBR ) = E train [ 1 {| ˆ f ( n ) DBR ( x ) − f ⋆ ( x ) | < τ } · ( ˆ f ( n ) DBR ( x ) − f ⋆ ( x )) 2 ] + E train [ 1 {| ˆ f ( n ) DBR ( x ) − f ⋆ ( x ) | ≥ τ } · ( ˆ f ( n ) DBR ( x ) − f ⋆ ( x )) 2 ] ≤ τ 2 + ∆ ≤ 2∆ . This prov es the corollary . A.4 Extensions In this section, we provide t wo results men tioned in Section 2 . First we improv e the approximation factor in Corollary 2.1 from 17 to 10 alb eit at th e cost of a worse statistical term. Second w e sho w ho w to choose τ in a data-driven fashion to adapt to unknown missp ecification level ε ∞ . Prop osition A.1 (Improv ed approximation factor) . Under Assumption 2.1 – Assumption 2.4 , with τ = 2 ε ∞ and for δ ∈ (0 , 1) , we have that, with pr ob ability at le ast 1 − δ : R test ( ˆ f ( n ) DBR ) ≤ 10 ε 2 ∞ + C ∞ · O r log( |F | /δ ) n ! . (10) Pro of sk etch. The pro of is essentially identical to that of Theorem 2.1 , except that w e replace the concen tration statement of Lemma 3.2 with a simpler one that relies on Ho effding’s inequality . The new concen tration statemen t is that for any τ ≥ 0 and δ ∈ (0 , 1) with probabilit y 1 − δ w e hav e ∀ f ∈ F : L ( f ; ¯ f ) ≤ b L ( f ; ¯ f ) + ε slow , where ε slow := c q log( |F | /δ ) n for some univ ersal constan t c > 0 . This follows b y a standard application of Ho effding’s inequality and a union b ound, but imp ortan tly do es not imp ose the restriction that τ ≥ 3 ε ∞ . No w the analysis to prov e Theorem 2.1 yields that for any τ ≥ 2 ε ∞ : E train h 1 {| ˆ f ( n ) DBR ( x ) − f ⋆ ( x ) | ≥ τ + ε ∞ } · n ( ˆ f ( n ) DBR ( x ) − f ⋆ ( x )) 2 − ( ¯ f ( x ) − f ⋆ ( x )) 2 oi ≤ cε slow . T aking τ = 2 ε ∞ and following the deriv ation used to prov e Corollary 2.1 , we get R test ( ˆ f ( n ) DBR ) ≤ 10 ε 2 ∞ + cε slow (Note that this requires the non-negativity prop ert y provided by Lemma 3.1 , whic h we still hav e.) The next result considers adapting to an unknown missp ecification level. 21 Prop osition A.2 (A dapting to ε ∞ ) . L et δ ∈ (0 , 1) and define S := { 2 i : τ min ≤ 2 i ≤ τ max } wher e τ min := q 160 log( |F || S | /δ ) 3 n and τ max := 1 . L et τ ⋆ := min { τ ∈ S : τ ≥ 3 ε ∞ } . Then ther e is an algorithm that, without know le dge of ε ∞ and with pr ob ability at le ast 1 − δ , c omputes ˆ f satisfying E train h 1 {| ˆ f ( x ) − f ⋆ ( x ) | ≥ τ ⋆ + ε ∞ } · n ( ˆ f ( x ) − f ⋆ ( x )) 2 − ε 2 ∞ oi ≤ 160 log(2 |F || S | /δ ) 3 n . Note that when ε ∞ ≪ τ min , we are essen tially in the realizable regime. Thus, via the pro of of Corollary 2.2 the ab o ve guarantee with τ ⋆ := τ min suffices. On the other hand if ε ∞ ≥ 1 / 3 then τ ⋆ is undefined, but due to Assumption 2.4 the guarantee in Theorem 2.1 is v acuous. Thus, the ab o v e theorem recov ers essentially the same result as Theorem 2.1 , but without knowledge of ε ∞ . Pro of sketc h. The algorithm is as follows. W e run a slight v ariation of disagreemen t based regression for eac h τ ∈ S : Instead of computing the minimizer of the ob jective in Eq. (2) w e form the version space of near-minimizers. Sp ecifically , define ∀ τ ∈ S : F τ := ( f ∈ F : max g ∈F 1 n n X i =1 W τ f ,g ( x i )  ( f ( x i ) − y i ) 2 − ( g ( x i ) − y i ) 2  ≤ ε stat / 2 ) , where w e define ε stat = 80 log( |F || S | /δ ) 3 n . Note this is sligh tly inflated from the definition in the statement of Lemma 3.2 , whic h accounts for a union b ound o ver all | S | runs of the algorithm. Next, we define ˆ τ := arg min    τ ∈ S : \ τ ′ ∈ S : τ ′ ≥ τ F τ ′  = ∅    , and return any function in this in tersection, i.e., let ˆ f b e any function in T τ ′ ∈ S : τ ′ ≥ ˆ τ F τ ′ . F or the analysis, via the analysis of Theorem 2.1 and a union bound o ver the | S | c hoices for τ , we hav e ∀ τ ≥ τ ⋆ : ¯ f ∈ F τ and f ∈ F τ ⇒ L τ ( f ; ¯ f ) ≤ ε stat , where L τ ( f ; g ) is the p opulation ob jective with parameter τ . The first statement directly implies that ˆ τ ≤ τ ⋆ . This in turn implies that ˆ f ∈ F τ ⋆ and so ˆ f ac hieves the same statistical guarantee as if we ran DBR with parameter τ ⋆ (up to the additional union b ound). B Pro ofs for Section 4 B.1 Offline RL Theorem 4.1 ( DBR for offline RL) . Fix δ ∈ (0 , 1) , assume that F is L ∞ -missp e cifie d and µ satisfies c onc entr ability (as define d ab ove). Consider the algorithm define d in Eq. (7) with τ = 3 ε ∞ . Then, with pr ob ability at le ast 1 − δ we have J ( π ⋆ ) − J ( ˆ π ) ≤ O ε ∞ 1 − γ + 1 1 − γ r C conc log( |F | /δ ) n ! . Pro of of Theorem 4.1 . F or eac h “target” function f trg ∈ F suc h that f trg  = ¯ f , let us define ap x [ f trg ] ∈ F to b e any approximation to the Bellman backup T f trg s.t. ∥ ap x [ f trg ] − T f trg ∥ ∞ ≤ ε ∞ . Define ap x [ ¯ f ] = ¯ f , whic h also satisfies ∥ ap x [ ¯ f ] − T ¯ f ∥ ∞ ≤ ε ∞ b y assumption. Let us define the empirical and p opulation losses for the disagreement-based regression problem with regression targets derived from f trg . (Empirical) b L f trg ( f ; g ) := 1 n n X i =1 W τ f ,g ( s i , a i )  ( f ( s i , a i ) − y f trg ,i ) 2 − ( g ( s i , a i ) − y f trg ,i ) 2  , (P opulation) L f trg ( f ; g ) := E µ  W τ f ,g ( s, a )  ( f ( s, a ) − y f trg ) 2 − ( g ( s, a ) − y f trg ) 2  . 22 Here recall that y f trg := r + max a ′ f trg ( s ′ , a ′ ) is derived from the sample ( s, a, r , s ′ ) . Also note that w e use E µ [ · ] to denote expectation with resp ect to the data collection p olicy . First, we apply Lemma 3.1 and Lemma 3.2 to eac h of the |F | regression problems. By appro ximate completeness and the definition of apx [ f trg ] this yields ∀ f trg , f ∈ F : 0 ≤ L f trg ( f ; apx [ f trg ]) ≤ 2 b L f trg ( f ; apx [ f trg ]) + ε stat , (11) where ε stat := 160 log( |F | /δ ) 3 n . The ab o ve uniform bound holds with probability 1 − δ . Note that this ε stat is t wice as large as the one in the pro of of Theorem 2.1 , which accounts for the additional union bound ov er all |F | regression problems. The main statistical guaran tee for ˆ f is deriv ed as follo ws L ˆ f ( ˆ f ; apx [ ˆ f ]) (i) ≤ 2 b L ˆ f ( ˆ f ; apx [ ˆ f ]) + ε stat (ii) ≤ 2 max g ∈F b L ˆ f ( ˆ f ; g ) + ε stat (iii) ≤ 2 max g ∈F b L ¯ f ( ¯ f ; g ) + ε stat (iv) ≤ 2 ε stat . Here (i) is the second inequalit y in Eq. (11) , (ii) follo ws since ap x [ ˆ f ] ∈ F , (iii) uses the optimality prop ert y of ˆ f , and (iv) uses Eq. (11) again, noting the symmetry of L ¯ f ( · ; · ) and using ap x [ ¯ f ] = ¯ f . Since the Bay es regression function defined by targets y ˆ f is T ˆ f , this yields E µ h 1 n | ˆ f ( s, a ) − ap x [ ˆ f ]( s, a ) | ≥ 3 ε ∞ o · n ( ˆ f ( s, a ) − [ T ˆ f ]( s, a )) 2 − ( apx [ ˆ f ]( s, a ) − [ T ˆ f ]( s, a )) 2 oi ≤ 2 ε stat . (12) W e translate this to the squared Bellman error on an y other distribution ν ∈ ∆( X × A ) via a sligh tly stronger argumen t than the one used to prov e Corollary 2.1 . E ν h    ˆ f ( s, a ) − [ T ˆ f ]( s, a )    i ≤ ε ∞ + E ν h    ˆ f ( s, a ) − ap x [ ˆ f ]( s, a )    i ≤ 4 ε ∞ + E ν h 1 n | ˆ f ( s, a ) − ap x [ ˆ f ]( s, a ) | ≥ 3 ε ∞ o ·    ˆ f ( s, a ) − ap x [ ˆ f ]( s, a )    i ≤ 4 ε ∞ + v u u t E µ "  ν ( s, a ) µ ( s, a )  2 # · s E µ  1 n | ˆ f ( s, a ) − ap x [ ˆ f ]( s, a ) | ≥ 3 ε ∞ o ·  ˆ f ( s, a ) − ap x [ ˆ f ]( s, a )  2  = 4 ε ∞ + ∥ ν /µ ∥ L 2 ( µ ) · s E µ  1 n | ˆ f ( s, a ) − ap x [ ˆ f ]( s, a ) | ≥ 3 ε ∞ o ·  ˆ f ( s, a ) − ap x [ ˆ f ]( s, a )  2  ≤ 4 ε ∞ + ∥ ν /µ ∥ L 2 ( µ ) · √ 6 ε stat . The last inequalit y is based on the “self-b ounding” argument we used to control the v ariance in the pro of of Lemma 3.2 , whic h show ed that under the even t | ˆ f ( s, a ) − ap x [ ˆ f ]( s, a ) | ≥ 3 ε ∞ :  ˆ f ( s, a ) − ap x [ ˆ f ]( s, a )  2 ≤ 3 ·   ˆ f ( s, a ) − [ T ˆ f ]( s, a )  2 −  ap x [ ˆ f ]( s, a ) − [ T ˆ f ]( s, a )  2  . Note that ∥ ν /µ ∥ 2 L 2 ( µ ) ≤ ∥ ν /µ ∥ ∞ since E µ [ ν ( s, a ) /µ ( s, a )] = E ν [1] = 1 . Finally , we app eal to the telescoping performance difference lemma (c.f., Xie and Jiang , 2020 , Theorem 2), whic h states that for an action-v alue function f , J ( π ⋆ ) − J ( π f ) ≤ E d π ⋆ [[ T f ]( s, a ) − f ( s, a )] 1 − γ + E d π f [ f ( s, a ) − [ T f ]( s, a )] 1 − γ , where d π := (1 − γ ) P ∞ h =0 γ h d π h . Both terms are controlled by the distribution shift argument ab o v e and the concen trability co efficien t, yielding the theorem. 23 B.2 Online RL Theorem 4.2 ( DBR for online RL) . Fix δ ∈ (0 , 1) , and assume that F is L ∞ -missp e cifie d and µ satisfies c over ability (as define d ab ove). Consider GOLF .DBR with τ = 3 ε ∞ and β = c log ( T H |F | /δ ) . Then, with pr ob ability at le ast 1 − δ , we have Reg ≤ O  ε ∞ H T + H p C cov T log ( T H |F | /δ ) log ( T )  . Pro of of Theorem 4.2 . The pro of mak es essentially t wo modifications to the pro of of Theorem 1 of Xie et al. ( 2023 ). The first step is a concentration argument, whic h is essentially a martingale v ersion of Theorem 2.1 . The second is the distribution shift argumen t, whic h is very similar to the one we used to prov e Theorem 4.1 . T o keep the presen tation concise, we focus on these arguments, and explain how they fit into the analysis of Xie et al. ( 2023 ), but w e do not provide a self-con tained pro of. Notation. W e adopt the follo wing notation. Recall that F ( t − 1) is the version space used in episo de t and that f ( t ) ∈ F ( t − 1) induces the p olicy π ( t ) deplo yed in the episo de. As b efore, let ap x [ f h +1 ] ∈ F h denote the L ∞ -appro ximation to T h f h +1 . F or each episo de t let δ ( t ) h ( · ) := f ( t ) h ( · ) − [ T h f ( t ) h +1 ]( · ) and err ( t ) ( · ) := 1    f ( t ) h ( · ) − apx [ f ( t ) h +1 ]( · )   ≥ 3 ε ∞  · n  f ( t ) h ( · ) − [ T h f ( t ) h +1 ]( · )  2 −  ap x [ f ( t ) h +1 ]( · ) − [ T h f ( t ) h +1 ]( · )  2 o . Let d ( t ) h = d π ( t ) h and define e d ( t ) h ( x, a ) = P t − 1 i =1 d ( i ) h ( x, a ) and µ ⋆ h to b e the distribution that ac hieves the v alue C cov for lay er h . Concen tration. By a martingale version of Theorem 2.1 , we can sho w that with probabilit y at least 1 − δ , for all t ∈ [ T ] : ( i ) ¯ f ∈ F ( t ) , and ( ii ) ∀ h ∈ [ H ] : X s,a e d ( t ) h ( s, a )err ( t ) h ( s, a ) ≤ O ( β ) , (13) where β = c log ( T H |F | /δ ) . W e do not pro vide a complete pro of of this statemen t, noting that it is essen tially the same guarantee as in Eq. (12) , except that (a) it is a non-stationary v ersion with a union b ound ov er eac h time step h and episo de t and (b) it uses martingale concen tration (i.e., F reedman’s inequality instead of Bernstein’s inequality). It is also worth comparing with the concentration guaran tee of ( Xie et al. , 2023 ) under exact realizability/completeness, which is that Q ⋆ ∈ F ( t ) and that P s,a e d ( t ) h ( s, a )( δ ( t ) h ( s, a )) 2 ≤ O ( β ) . Distribution shift. T o b ound the regret, note that Reg ≤ T X t =1 H X h =1 E ( s,a ) ∼ d ( t ) h  δ ( t ) h ( s, a )  . F or distribution shift, we must translate the abov e on-p olicy Bellman errors to the “ DBR ” errors on the historical data e d ( t ) h , whic h is controlled by Eq. (13) . F ollowing ( Xie et al. , 2023 ) w e consider burn-in and stable phases. Let γ h ( s, a ) := min n t : e d ( t ) h ( s, a ) ≥ C cov · µ ⋆ h ( s, a ) o , and decomp ose T X t =1 E ( s,a ) ∼ d ( t ) h  δ ( t ) h ( s, a )  = T X t =1 E ( s,a ) ∼ d ( t ) h  δ ( t ) h ( s, a ) 1 { t < γ h ( s, a ) }  + E ( s,a ) ∼ d ( t ) h  δ ( t ) h ( s, a ) 1 { t ≥ γ h ( s, a ) }  . 24 The first term is the regret incurred during the burn-in phase, whic h is b ounded b y 2 C cov follo wing exactly the argument of Xie et al. ( 2023 ). This con tributes a total regret of 2 H C cov . The second term is the regret incurred during the stable-phase, for whic h we m ust p erform a distribution shift argument. T o condense the notation, define ¯ δ ( t ) h ( · ) := ap x [ f ( t ) h +1 ]( · ) − [ T h f ( t ) h +1 ]( · ) , and ˜ δ ( t ) h ( · ) := f ( t ) h ( · ) − apx [ f ( t ) h +1 ]( · ) . Note that, by assumption,   ¯ δ ( t ) h ( s, a )   ≤ ε ∞ . Then, T X t =1 E d ( t ) h  δ ( t ) h ( s, a ) 1 { t > γ h ( s, a ) }  = T X t =1 E d ( t ) h h ˜ δ ( t ) h ( s, a ) + ¯ δ ( t ) h ( s, a )  1 { t > γ h ( s, a ) } i ≤ T X t =1 E d ( t ) h h ˜ δ ( t ) h ( s, a ) 1 { t > γ h ( s, a ) } i + T ε ∞ ≤ T X t =1 E d ( t ) h h 1 n | ˜ δ ( t ) h ( s, a ) | ≥ 3 ε ∞ o ˜ δ ( t ) h ( s, a ) 1 { t > γ h ( s, a ) } i + 4 T ε ∞ ≤ v u u t T X t =1 X s,a  1 { t > γ h ( s, a ) } d ( t ) h ( s, a )  2 e d ( t ) h ( s, a ) · v u u t T X t =1 X x,a e d ( t ) h ( x, a ) 1 n | ˜ δ ( t ) h ( s, a ) | ≥ 3 ε ∞ o ( ˜ δ ( t ) h ( s, a )) 2 + 4 T ε ∞ ≤ v u u t T X t =1 X s,a  1 { t > γ h ( s, a ) } d ( t ) h ( s, a )  2 e d ( t ) h ( s, a ) · v u u t 3 T X t =1 X x,a e d ( t ) h ( x, a )err ( t ) h ( s, a ) + 4 T ε ∞ . The p en ultimate inequalit y is Cauch y-Sch w arz and the final inequality follows from the self-bounding property that we used in the pro of of Lemma 3.2 and Theorem 4.1 . In particular under the ev ent that    ˜ δ ( t ) h ( s, a )    ≥ 3 ε ∞ , w e can b ound ( ˜ δ ( t ) h ( s, a )) 2 ≤ 3  ( δ ( t ) h ( s, a )) 2 − ( ¯ δ ( t ) h ( s, a )) 2  . Thus we ha ve con verted from the on-policy Bellman error to the historical “ DBR ” errors, i.e., we can further b ound by ≤ v u u t T X t =1 X s,a  1 { t > γ h ( s, a ) } d ( t ) h ( s, a )  2 e d ( t ) h ( s, a ) · O  p β T  + 4 T ε ∞ . Mean while the density ratio term is b ounded by O ( p C cov log( T ) ) via the analysis of Xie et al. ( 2023 ). Rep eating this analysis for each time step h pro ves the theorem. 25

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment