Gap-Dependent Bounds for Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation

We study gap-dependent performance guarantees for nearly minimax-optimal algorithms in reinforcement learning with linear function approximation. While prior works have established gap-dependent regret bounds in this setting, existing analyses do not…

Authors: Haochen Zhang, Zhong Zheng, Lingzhou Xue

Gap-Dep enden t Bounds for Nearly Minimax Optimal Reinforcemen t Learning with Linear F unction Appro ximation Hao c hen Zhang, Zhong Zheng, and Lingzhou Xue ∗ Departmen t of Statistics, The Pennsylv ania State Univ ersity Abstract W e study gap-dep enden t p erformance guaran tees for nearly minimax-optimal algorithms in reinforcemen t learning with linear function appro ximation. While prior w orks hav e established gap-dep enden t regret bounds in this setting, existing analyses do not apply to algorithms that ac hieve the nearly minimax-optimal w orst-case regret b ound e O ( d √ H 3 K ), where d is the feature dimension, H is the horizon length, and K is the n um b er of episo des. W e bridge this gap b y providing the first gap-dependent regret bound for the nearly minimax-optimal algorithm LSVI-UCB++ (He et al., 2023). Our analysis yields improv ed dep endencies on b oth d and H compared to previous gap-dep enden t results. Moreo ver, leveraging the low p olicy-switc hing prop ert y of LSVI-UCB++, we introduce a concurrent v arian t that enables efficien t parallel exploration across multiple agents and establish the first gap-dep enden t sample complexity upp er b ound for online multi-agen t RL with linear function approximation, ac hieving linear sp eedup with respect to the num b er of agents. 1 In tro duction Reinforcemen t learning (RL) (Sutton & Barto, 2018) provides a formal framework for sequential decision-making, where an agent learns to maximize cum ulativ e rewards through iterativ e in teraction with a dynamic environmen t. In mo dern RL, the design of efficient algorithms for problems with large state and action spaces has b ecome a central challenge. A widely adopted approach is function appro ximation, which enables efficient learning by representing v alue functions using a restricted function class. Recen tly , a large b ody of literature has fo cused on establishing regret upp er b ounds for RL with linear function appro ximation, where the v alue function is represen ted as a linear function of known features. These w orks can b e broadly categorized into tw o main classes. The first class considers the mo del-free setting (He et al., 2021a) or the linear Marko v decision pro cess (MDP) (Jin et al., 2020), where the transition dynamics and rew ard functions are assumed to ∗ Lingzhou Xue (Email: lzxue@psu.edu) is the corresponding author. 1 b e linear with regard to the kno wn features. In particular, Jin et al. (2020) proposed the first pro v ably efficien t algorithm, LSVI-UCB. It is based on the principle of optimism in the face of uncertain ty , and achiev es the regret b ound e O ( √ d 3 H 4 K ). Here, e O ( · ) hides logarithmic factors, d denotes the feature dimension, H is the horizon length, and K is the n umber of episo des. Subsequently , He et al. (2023) prop osed the LSVI-UCB++ algorithm, which leverages adaptive weigh ted ridge regression and p essimism tec hniques, and impro ves the regret b ound to the nearly minimax-optimal rate e O ( d √ H 3 K ) (Zhou et al., 2021a). Agarwal et al. (2023) ac hieved similar near-optimal results in the time-homogeneous setting. The second class fo cuses on the mo del-based setting (He et al., 2021a) or the linear mixture MDPs (Ayoub et al., 2020; Zhou et al., 2021a; Zhou & Gu, 2022), where the transition probability is modeled as a linear com bination of m ultiple base mo dels. Ay oub et al. (2020) prop osed the UCRL-VTR algorithm with a regret b ound O ( d √ H 4 K ). Later, Zhou et al. (2021a) developed a near-optimal algorithm for time-inhomogeneous linear mixture MDPs. F urthermore, Zhou & Gu (2022) prop osed a near-optimal, horizon-free algorithm for time-homogeneous linear mixture MDPs. In practice, RL algorithms often outp erform their worst-case guarantees when a p ositiv e sub- optimalit y gap exists (i.e., the optimal action at each state is b etter than sub optimal actions b y a non-negligible margin). The gap-dep endent analysis is w ell studied in the tabular setting (Simc howitz & Jamieson, 2019; Dann et al., 2021; Y ang et al., 2021; Xu et al., 2020; Zheng et al., 2025b; Zhang et al., 2025b,c; Chen et al., 2025), and it has also gained traction in RL with linear function approximation. He et al. (2021a) pioneered this direction, showing that LSVI-UCB (Jin et al., 2020) ac hieves an exp ected regret b ound e O ( d 3 H 5 / ∆ min ) under linear MDPs, and UCRL-VTR (Ay oub et al., 2020) attains an exp ected regret b ound e O ( d 2 H 5 / ∆ min ) under linear mixture MDPs, where the minimum gap ∆ min is defined as the infimum of the p ositive sub optimalit y gap ∆ h ( s, a ) o ver all state-action-step triples ( s, a, h ). P apini et al. (2021); Zhang et al. (2024a) also established the same logarithmic guarantees for the exp ected regret of different algorithms. Collectively , these results demonstrate that RL with linear function approximation can achiev e logarithmic regret. Despite these adv ances, existing gap-dep enden t analyses do not y et co ver algorithms that ac hieve the nearly minimax-optimal worst-case regret b ound e O ( d √ H 3 K ), e.g., LSVI-UCB++ (He et al., 2023) or UCRL-VTR+ (Zhou et al., 2021a). As a result, current gap-dep enden t regret b ounds exhibit a lo ose dep endence on the feature dimension d and horizon length H , i.e., e O ( d 3 H 5 / ∆ min ) under linear MDPs or e O ( d 2 H 5 / ∆ min ) under linear mixture MDPs, failing to reflect their sup erior efficiency . This suggests that existing analyses ma y not fully capture the potential of minimax- optimal algorithms in the presence of sub optimalit y gaps. Consequen tly , this gap leads to the follo wing op en question: Can we establish impr ove d gap-dep endent r e gr et upp er b ounds for ne arly minimax-optimal algorithms in RL with line ar function appr oximation? 2 Impro ving the dep endence on the feature dimension d and the horizon length H in regret b ounds is not only of theoretical interest but also carries significant practical implications. In domains suc h as rob otics and healthcare, agents often op erate in complex, high-dimensional environmen ts, where function appro ximation is essential for tractable learning and control. F or instance, in rob otic manipulation and grasping tasks, the state space often consists of high-dimensional con tinuous observ ations, including robot join t configurations, ob ject positions, and other en vironmental v ariables (Kob er et al., 2013; T oner et al., 2023). Similarly , in healthcare applications such as optimizing cell growth conditions or sequen tial treatment planning, agen ts must handle rich biological state represen tations and long sequences of decisions, making effective function approximation critical for practical deplo yment (Al-Hamadani et al., 2024). These applications illustrate that reducing the theoretical dep endence on d and H in regret guarantees can directly enhance b oth the sample efficiency and the practical feasibility of RL algorithms in high-dimensional, long-horizon tasks. In this pap er, we provide an affirmativ e answer to the op en question ab o v e by establishing an impro ved gap-dep enden t regret upp er b ound for the nearly minimax-optimal LSVI-UCB++ (He et al., 2023) in the linear MDP setting. Bey ond its regret guarantees, LSVI-UCB++ also exhibits infrequent p olicy up dates, which is particularly adv an tageous for real-world applications where a single agent’s data collection capacity is inherently limite d and multi-agen t collab oration is required. While multi-agen t reinforcement learning (MARL) can impro ve sample efficiency through parallel exploration, a primary challenge is the high communication cost asso ciated with p olicy synchronization. The lo w p olicy-switc hing prop ert y of LSVI-UCB++ suggests that it can serv e as a foundation for a concurrent v arian t where agents explore in parallel with infrequent sync hronizations. This insight motiv ates the dev elopment of our concurren t RL v arian t, which seeks to accelerate learning without compromising comm unication efficiency . Our Con tributions. Our contributions are summarized as: (i) Impro ved Gap-Dep enden t Regret Bound: W e establish the first gap-dep enden t regret b ound for the nearly minimax-optimal algorithm LSVI-UCB++ with linear function approximation. Our results significantly improv e the dep endence on the feature dimension d and the horizon length H relativ e to existing literature (see T able 1 for a detailed comparison). Moreov er, w e show that our refined regret b ound implies an impro v ed Probably Approximately Correct (P A C) sample complexity (Kak ade, 2003), reducing the dep endence on the accuracy parameter ϵ of the num b er of episo des required to identify an ϵ -optimal p olicy , from the worst-case rate of e O (1 /ϵ 2 ) to e O (1 /ϵ ). (ii) Concurrent RL Algorithm: Leveraging the low p olicy-switching prop ert y of LSVI- UCB++, we introduce a concurrent v ariant, Concurrent LSVI-UCB++, that enables efficient parallel exploration. W e establish the first gap-dep enden t sample complexity upp er b ound for online MARL with linear function appro ximation, achieving linear sp eedup with resp ect to the n umber of agen ts. 3 Algorithm E [ Regret ( T )] LSVI-UCB b O ( d 3 H 5 / ∆ min ) (He et al., 2021a) UCRL-VTR b O ( d 2 H 5 / ∆ min ) (He et al., 2021a) LSVI-UCB b O ( d 3 H 5 / ∆ min ) (P apini et al., 2021) Cert-LSVI-UCB b O ( d 3 H 5 / ∆ min ) (Zhang et al., 2024a) LSVI-UCB++ b O ( d 2 H 4 / ∆ min ) (Ours) T able 1: Comparison of gap-dep enden t regret b ounds for differen t RL algorithms with linear function appro ximation. W e denote T as the total num b er of steps, d as the feature dimension, H as the horizon length, and ∆ min as the minim um gap. The notation b O ( · ) hides b oth log T dep endence and lo wer-order terms. Our result achiev es the tigh test dep endence on d and H . (iii) T echnical No velt y: Refining the w orst-case guarantees of LSVI-UCB++ in to gap- dep enden t b ounds requires new techniques for b ounding the partial sums of both b on uses and estimated v ariances. T o control the partial sums of b onuses, Lemma 5.3 in tro duces a surrogate matrix that admits a one-step recursive structure, allowing us to obtain tight b ounds. F or the partial sum of estimated v ariances, Lemma 5.4 establishes a recursive relationship for the partial sums of v alue function estimation errors across different steps (Lemma C.4 and C.5), whic h yields an upp er b ound on the estimated v ariance. T ogether, these nov el technical developmen ts pro vide the necessary machinery to achiev e refined gap-dep endent guarantees. See Section 5 for details. 2 Related W ork Near-Optimal RL. In tabular RL, algorithms are t ypically categorized into mo del-based and mo del-free approaches. Mo del-based metho ds explicitly estimate the transition and rew ard mo dels from data and plan based on the learned mo dels, while mo del-free metho ds directly maintain v alue function estimates and act greedily . A large b ody of work has fo cused on mo del-based algorithms (Agarw al et al., 2020; Agraw al & Jia, 2017; Auer et al., 2008; Azar et al., 2017; Dann et al., 2019; Kak ade et al., 2018; Zanette & Brunskill, 2019; Zhang et al., 2024b, 2021; Zhou et al., 2023). Notably , Zhang et al. (2024b) prop osed an algorithm ac hieving a regret bound of e O ( min { √ S AH 2 T , T } ), whic h matc hes the information-theoretic low er b ound. Mo del-free approaches hav e also been extensiv ely studied (Jin et al., 2018; Li et al., 2023; M´ enard et al., 2021b; Y ang et al., 2021; Zhang et al., 2020, 2025c), and several works (Zhang et al., 2020; M´ enard et al., 2021b; Li et al., 2023; Zhang et al., 2025c) achiev ed the near-optimal regret b ound e O ( √ S AH 2 T ). Sev eral w orks also fo cus on federated RL, including Zheng et al. (2024); Labbi et al. (2024); Zheng et al. (2025a); Zhang et al. (2025c), with the last three attaining the near-optimal regret. The literature on RL with linear function approximation can b e divided based on the structural 4 assumptions imp osed on the MDP . One line of w ork considers linear MDPs (Y ang & W ang, 2019; Jin et al., 2020; W ei et al., 2021; W agenmak er et al., 2022; He et al., 2023; Zhang et al., 2024c). Y ang & W ang (2019) provide the first sample-efficient algorithm under a generativ e mo del. Subsequently , Jin et al. (2020) prop ose the first prov ably efficient algorithm LSVI-UCB that achiev es a regret b ound e O ( √ d 3 H 3 T ) without access to a generativ e model. More recen tly , He et al. (2023) develop the LSVI-UCB++ algorithm, impro ving regret to the nearly minimax-optimal b ound e O ( d √ H 3 K ) (Zhou et al., 2021a). Another line of work studies linear mixture MDPs (Jia et al., 2020; Ayoub et al., 2020; Mo di et al., 2020; Zhou et al., 2021a,b; Zhou & Gu, 2022). Jia et al. (2020) and Ayoub et al. (2020) prop osed the UCRL-VTR algorithm for episo dic MDPs, achieving a regret b ound of e O ( d √ H 4 K ). Subsequen tly , Zhou et al. (2021a) dev elop ed a near-optimal algorithm for time-inhomogeneous linear mixture MDPs by prop osing a Bernstein-type concentration inequality for self-normalized martingales. F urthermore, Zhou & Gu (2022) prop osed a near-optimal, horizon-free algorithm for time-homogeneous linear mixture MDPs. Zhang et al. (2023) provide the near-optimal horizon-free sample complexit y in the rew ard-free time-homogeneous setting. Gap-Dep enden t RL. In tabular RL, early works establish asymptotic logarithmic regret b ounds (Auer & Ortner, 2007; T ew ari & Bartlett, 2008). Later, non-asymptotic b ounds ha ve b een deriv ed in m ultiple w orks (Jaksch et al., 2010; Ok et al., 2018; Simcho witz & Jamieson, 2019; Dann et al., 2021; Y ang et al., 2021; Xu et al., 2021; Zheng et al., 2025b; Chen et al., 2025; Zhang et al., 2025b). F or model-based algorithms, Simcho witz & Jamieson (2019); Dann et al. (2021); Chen et al. (2025) obtain fine-grained gap-dep enden t regret b ounds. F or mo del-free algorithms, Y ang et al. (2021) provided the first gap-dep endent regret b ound for UCB-Ho effding (Jin et al., 2018), whic h w as subsequen tly refined by the AMB algorithm in Xu et al. (2021). Later, Zheng et al. (2025b) further impro ve the result by reanalyzing the UCB-Adv an tage algorithm (Zhang et al., 2020) and the Q-EarlySettled-Adv an tage algorithm (Li et al., 2023). Zhang et al. (2025a,c) extend gap-dep enden t analysis to federated Q -learning settings. More recen tly , Zhang et al. (2025b) provide the first fine-grained gap-dep enden t regret upp er b ound for UCB-Ho effding. F or RL with linear function approximation, gap-dep enden t regret b ounds hav e also b een studied. F or online RL, He et al. (2021a) provided the first gap-dep enden t regret b ound, showing that LSVI-UCB (Jin et al., 2020) achiev es e O ( d 3 H 5 / ∆ min ) in linear MDPs, and UCRL-VTR (Ay oub et al., 2020) achiev es e O ( d 2 H 5 / ∆ min ) in linear mixture MDPs. Subsequent works (P apini et al., 2021; Zhang et al., 2024a) impro ved these results, showing that LSVI-UCB and Cert-LSVI-UCB can ac hieve constant regret with high probability , indep enden t of the total n umber of steps T . MARL with Linear F unction Appro ximation. In MARL, Dub ey & Pen tland (2021) prop ose the Coop - LSVI algorithm, which extends LSVI - UCB (Jin et al., 2020) from the single- agen t setting to co operative multi-agen t parallel RL, ac hieving prov ably efficien t learning with a limited num ber of communication rounds among agen ts. Building on LSVI - UCB as well, Min et al. (2023) in tro duce an async hronous v ariant that preserv es the same regret b ound while improving 5 comm unication efficiency relative to Dub ey & P entland (2021). More recen tly , Hsu et al. (2024) dev elop t wo randomized-exploration algorithms that attain the same regret and communication guaran tees as Min et al. (2023). 3 Preliminaries Notation. In this pap er, we adopt the conv en tion that 0 / 0 = 0. F or any C ∈ N + , we write [ C ] := { 1 , 2 , . . . , C } . W e denote by I [ x ] the indicator function, which takes the v alue 1 if the even t x is true and 0 otherwise. F or a v ector x ∈ R d and a matrix Σ ∈ R d × d , w e use ∥ x ∥ 2 to denote the Euclidean norm, and ∥ x ∥ Σ := √ x ⊤ Σx . F or any a ≤ b ∈ R , x ∈ R , let [ x ] [ a,b ] denote the truncate function a · I [ x ≤ a ] + x · I [ a < x < b ] + b · I [ b ≤ x ]. W e then in tro duce the mathematical framewor k of episo dic Marko v decision pro cesses. Episo dic Mark ov Decision Pro cesses. An episo dic MDP is denoted as M := ( S , A , H , P , r ), where S is a measurable space with p ossibly infinite n umber of states, A is the finite set of actions, H is the num b er of steps in each episo de, P := { P h } H h =1 is the transition kernel so that P h ( · | s, a ) c haracterizes the distribution o ver the next state given the state-action pair ( s, a ) at step h , and r := { r h } H h =1 is the collection of reward functions. W e assume that r h ( s, a ) ∈ [0 , 1] is a deterministic function of ( s, a ). In eac h episo de, an initial state s 1 is selected arbitrarily b y an adversary . Then, at each step h ∈ [ H ], an agent observ es a state s h ∈ S , picks an action a h ∈ A , receives the rew ard r h = r h ( s h , a h ) and then transitions to the next state s h +1 . The episo de ends when an absorbing state s H +1 is reac hed. F or conv enience, w e denote P s,a,h f = E s ′ ∼ P h ( ·| s,a ) f ( s ′ ), 1 s f = f ( s ) and V s,a,h ( f ) = P s,a,h f 2 − ( P s,a,h f ) 2 for an y function f : S → R and state-action-step triple ( s, a, h ). P olicy and V alue F unctions. A p olicy π is a collection of H functions  π h : S → ∆ A  h ∈ [ H ] , where ∆ A is the set of probabilit y distributions ov er A . A p olicy is deterministic if for any s ∈ S , π h ( s ) concen trates all the probabilit y mass on an action a ∈ A . In this case, we denote π h ( s ) = a . Let V π h : S → R denote the state v alue function at step h under policy π , so that V π h ( s ) represen ts the exp ected return when starting from state s h = s and following π . F ormally , V π h ( s ) := H X t = h E ( s t ,a t ) ∼ ( P ,π ) [ r t ( s t , a t ) | s h = s ] . W e also denote by Q π h : S × A → R the state-action v alue function at step h under p olicy π , so that Q π h ( s, a ) represents the exp ected return when starting from state-action pair ( s h , a h ) = ( s, a ) and follo wing the p olicy π : Q π h ( s, a ) := r h ( s, a ) + H X t = h +1 E ( s t ,a t ) ∼ ( P ,π ) [ r t ( s t , a t ) | ( s h , a h ) = ( s, a )] . 6 Since the action space and the horizon are all finite, there exists an optimal p olicy π ⋆ that achiev es the optimal v alue V ⋆ h ( s ) = sup π V π h ( s ) = V π ⋆ h ( s ) for all ( s, h ) ∈ S × [ H ] (Azar et al., 2017). The Bellman equation and the Bellman optimality equation can b e expressed as        V π h ( s ) = E a ′ ∼ π h ( s ) [ Q π h ( s, a ′ )] Q π h ( s, a ) := r h ( s, a ) + E s ′ ∼ P h ( ·| s,a ) V π h +1 ( s ′ ) V π H +1 ( s ) = 0 , ∀ ( s, a, h ) ∈ S × A × [ H ] ,        V ⋆ h ( s ) = max a ′ ∈A Q ⋆ h ( s, a ′ ) Q ⋆ h ( s, a ) := r h ( s, a ) + E s ′ ∼ P h ( ·| s,a ) V ⋆ h +1 ( s ′ ) V ⋆ H +1 ( s ) = 0 , ∀ ( s, a, h ) ∈ S × A × [ H ] . (1) F or any algorithm ov er K episo des, let π k b e the p olicy used in the k -th episo de, and s k 1 b e the corresp onding initial state. The regret ov er T = H K steps is Regret( T ) := K X k =1  V ⋆ 1 − V π k 1  ( s k 1 ) . Sub optimalit y Gap. F or any given MDP , we can provide the following formal definition of the sub optimalit y gap. Definition 3.1. F or any ( s, a, h ) ∈ S × A × [ H ], the sub optimality gap is defined as ∆ h ( s, a ) := V ⋆ h ( s ) − Q ⋆ h ( s, a ) . Equation (1) implies that for any ( s, a, h ), ∆ h ( s, a ) ≥ 0. Then, it is natural to define the minim um gap, which is the minimum non-zero sub optimalit y gap. Definition 3.2. W e define the minimum gap as ∆ min := inf { ∆ h ( s, a ) | ∆ h ( s, a ) > 0 , ∀ ( s, a, h ) } . If the set { ∆ h ( s, a ) | ∆ h ( s, a ) > 0 , ( s, a, h ) ∈ S × A × [ H ] } is empty , then all p olicies are optimal, leading to a degenerate MDP . Therefore, we assume that the set is nonempt y and ∆ min > 0 in the rest of this pap er. Definitions 3.1 and 3.2 and the non-degeneration assumption are standard in the literature of gap-dep enden t analysis (Simcho witz & Jamieson, 2019; Dann et al., 2021; Y ang et al., 2021; Xu et al., 2020; He et al., 2021a; Zhang et al., 2024a; Zheng et al., 2025b; Zhang et al., 2025b,c). Linear Marko v Decision Pro cesses. In this work, we fo cus on the linear Marko v Decision Pro cess (Jin et al., 2020; He et al., 2021a, 2023), which is formally defined as follows: Definition 3.3. An episo dic MDP M := ( S , A , H , P , r ) is a linear MDP if for an y h ∈ [ H ], there exists an unkno wn measure θ h ( · ) : S → R d and a kno wn feature mapping ϕ : S × A → R d , such that for each state-action pair ( s, a ) ∈ S × A and state s ′ ∈ S , we ha ve P h ( s ′ | s, a ) =  ϕ ( s, a ) , θ h ( s ′ )  . F or simplicity , w e assume that the norms of θ h ( · ) and ϕ ( · , · ) are upper b ounded as follo ws: ∥ ϕ ( s, a ) ∥ 2 ≤ 1 and ∥ θ h ( s ) ∥ 2 ≤ √ d for an y ( s, a, h ) ∈ S × A × [ H ]. F or linear MDPs, we hav e the follo wing prop ert y: 7 Prop osition 3.4 (Prop osition 3.3 of He et al. 2021a) . F or any p olicy π , ther e exist weights { w π h } H h =1 such that for any state-action-step triple ( s, a, h ) ∈ S × A × [ H ] , we have P s,a,h V π h +1 = ⟨ ϕ ( s, a ) , w π h ⟩ . 4 Theoretical Guaran tee 4.1 Algorithm Review W e b egin by reviewing the nearly minimax-optimal LSVI-UCB++ algorithm prop osed by He et al. (2023), as presented in the following Algorithm 1. Algorithm 1 LSVI-UCB++ Require: Regularization parameter λ > 0, confidence radii β , ¯ β , e β > 0. Episo de num ber K ∈ N + . 1: Initialize k mid , k last ← 0, and for each step h ∈ [ H ], set Σ 0 ,h , Σ 1 ,h = λ I d . 2: F or each step h ∈ [ H ] and state-action ( s, a ) ∈ S × A , set Q 0 ,h ( s, a ) = H , ˇ Q 0 ,h ( s, a ) = 0. 3: for episo de k = 1 , . . . , K do 4: Receive the initial state s k 1 . 5: for step h = H , . . . , 1 do 6: b w k,h = Σ − 1 k,h P k − 1 i =1 ¯ σ − 2 i,h ϕ ( s i h , a i h ) V k,h +1 ( s i h +1 ). 7: ˇ w k,h = Σ − 1 k,h P k − 1 i =1 ¯ σ − 2 i,h ϕ ( s i h , a i h ) ˇ V k,h +1 ( s i h +1 ). 8: if there exists h ′ ∈ [ H ] such that det ( Σ k,h ′ ) ≥ 2 det ( Σ k last ,h ′ ), then for any ( s, a ) ∈ S × A , 9: Q k,h ( s, a ) = min n r h ( s, a ) + b w ⊤ k,h ϕ ( s, a ) + β q ϕ ( s, a ) ⊤ Σ − 1 k,h ϕ ( s, a ) , Q k − 1 ,h ( s, a ) , H o , 10: ˇ Q k,h ( s, a ) = max n r h ( s, a ) + ˇ w ⊤ k,h ϕ ( s, a ) − ¯ β q ϕ ( s, a ) ⊤ Σ − 1 k,h ϕ ( s, a ) , ˇ Q k − 1 ,h ( s, a ) , 0 o . 11: Set k mid ← k . 12: else 13: Q k,h ( s, a ) = Q k − 1 ,h ( s, a ) , ˇ Q k,h ( s, a ) = ˇ Q k − 1 ,h ( s, a ) , ∀ ( s, a ) ∈ S × A . 14: end if 15: V k,h ( s ) = max a Q k,h ( s, a ) , ˇ V k,h ( s ) = max a ˇ Q k,h ( s, a ) , ∀ s ∈ S . 16: end for 17: Set the last up dating episo de k last ← k mid . 18: for step h = 1 , . . . , H do 19: T ake action a k h = π k h ( s k h ) = argmax a Q k,h ( s k h , a ) . 20: Set the estimated v ariance σ 2 k,h as in (3). 21: ¯ σ 2 k,h = max  σ 2 k,h , H , 2 d 3 H 2 ∥ ϕ ( s k h , a k h ) ∥ Σ − 1 k,h  . 22: Σ k +1 ,h = Σ k,h + ¯ σ − 2 k,h ϕ ( s k h , a k h ) ϕ ( s k h , a k h ) ⊤ . 23: Receiv e the next state s k h +1 . 24: end for 25: end for 8 LSVI-UCB++ reduces the learning of the optimal action-v alue function into a series of linear regression problems. Based on the relationship P s,a,h V π h +1 = ⟨ ϕ ( s, a ) , w π h ⟩ in Prop osition 3.4, Algorithm 1 constructs the estimator b w k,h b y solving the following weigh ted ridge regression b w k,h = argmin w ∈ R d λ ∥ w ∥ 2 2 + k − 1 X i =1 ¯ σ − 2 i,h  w ⊤ ϕ ( s i h , a i h ) − V k,h +1 ( s i h +1 )  2 . Here, the adjusted estimated v ariance ¯ σ 2 k,h is set as ¯ σ 2 k,h = max  σ 2 k,h , H , 2 d 3 H 2 ∥ ϕ ( s k h , a k h ) ∥ Σ − 1 k,h  , (2) where the estimated v ariance σ 2 k,h is defined in the Equation (3) later. With the help of exploration b on uses, Line 9 of Algorithm 1 constructs an optimistic v alue estimate Q k,h ( s, a ) as Q k,h ( s, a ) ≈ r h ( s, a ) + b w ⊤ k,h ϕ ( s, a ) + β ∥ ϕ ( s, a ) ∥ Σ − 1 k,h . Here, Q k,h ( s, a ) is an upp er b ound of Q ⋆ h ( s, a ) with high probabilit y when the hyperparameter β is c hosen as e Θ ( √ d ). The remaining comp onen ts in Line 9 ensure b oundedness and monotonicity of the Q -estimate. The algorithm also uses a pessimistic estimate ˇ Q k,h ( s, a ), whic h serves as a low er b ound for Q ⋆ h ( s, a ) with high probability . By Equation (1) and Line 17 of Algorithm 1, this estimate allows us to control the v alue estimation error V k,h ( s ) − V ⋆ h ( s ) via the error V k,h ( s ) − ˇ V k,h ( s ). The p essimistic estimate ˇ Q k,h ( s, a ) is constructed in a manner analogous to the optimistic one: after obtaining the v ector ˇ w k,h b y solving the following weigh ted ridge regression ˇ w k,h = argmin w ∈ R d λ ∥ w ∥ 2 2 + k − 1 X i =1 ¯ σ − 2 i,h  w ⊤ ϕ ( s i h , a i h ) − ˇ V k,h +1 ( s i h +1 )  2 , w e compute ˇ Q k,h as: ˇ Q k,h ( s, a ) ≈ r h ( s, a ) + ˇ w ⊤ k,h ϕ ( s, a ) − ¯ β ∥ ϕ ( s, a ) ∥ Σ − 1 k,h , where the hyperparameter ¯ β can b e c hosen as e Θ( √ d 3 H 2 ). Finally , LSVI-UCB++ constructs σ 2 k,h as follo ws: σ 2 k,h = ¯ V s k h ,a k h ,h V k,h +1 + E k,h + D k,h + H . (3) In Equation (3), ¯ V s k h ,a k h ,h V k,h +1 represen ts the estimated v ariance of v alue functions and is defined as h e w ⊤ k,h ϕ ( s k h , a k h ) i [0 ,H 2 ] −   b w ⊤ k,h ϕ ( s k h , a k h )  2  [0 ,H 2 ] . 9 Here, e w k,h := argmin w ∈ R d λ ∥ w ∥ 2 2 + k − 1 X i =1 ¯ σ − 2 i,h  w ⊤ ϕ ( s i h , a i h ) − V 2 k,h +1 ( s i h +1 )  2 is the solution to the weigh ted ridge regression problem for the squared v alue function. In addition, E k,h = min n e β q ϕ ( s k h , a k h ) ⊤ Σ − 1 k,h ϕ ( s k h , a k h ) , H 2 o + min n 2 H ¯ β q ϕ ( s k h , a k h ) ⊤ Σ − 1 k,h ϕ ( s k h , a k h ) , H 2 o , and D k,h = min n 4 d 3 H 2  b w ⊤ k,h ϕ ( s k h , a k h ) − ˇ w ⊤ k,h ϕ ( s k h , a k h ) + 2 ¯ β q ϕ ( s k h , a k h ) ⊤ Σ − 1 k,h ϕ ( s k h , a k h )  , d 3 H 3 o . Here, E k,h b ounds the error b et w een the estimated v ariance ¯ V s k h ,a k h ,h V k,h +1 and the true v ariance V s k h ,a k h ,h V k,h +1 of V k,h +1 , and D k,h b ounds the error betw een the v ariance V s k h ,a k h ,h V k,h +1 and the v ariance V s k h ,a k h ,h V ⋆ h +1 . 4.2 Gap-Dep enden t Regret Upp er Bound After introducing the LSVI-UCB++ algorithm, we now present our main theoretical result: the first gap-dep enden t regret upp er b ound for a nearly minimax-optimal algorithm in RL with linear function appro ximation. Theorem 4.1. F or any line ar MDP M , if we set the p ar ameters λ = 1 /H 2 and c onfidenc e r adii β , ¯ β , e β as β = Θ  H √ dλ + q d log 2 (1 + dT / ( δ λ ))  , ¯ β = Θ  H √ dλ + q d 3 H 2 log 2 ( dT / ( δ λ ))  , e β = Θ  H 2 √ dλ + q d 3 H 4 log 2 ( dT / ( δ λ ))  , with the failur e pr ob ability δ = 1 / 18 T , then E [Regret( T )] for A lgorithm 1 in the first T steps is upp er b ounde d by O  d 2 H 4 ∆ min ι 3 1 + d 6 H 7 ι 2 1  , wher e ι 1 = log (1 + dH K / ∆ min ) . The full pro of is provided in Section D, with a pro of sketc h giv en in Section 5. Compared with prior gap-dep enden t exp ected regret upp er b ounds for RL with linear function approximation (He et al., 2021a; Papini et al., 2021; Zhang et al., 2024a), our result strictly improv es the dep endence on b oth the feature dimension d and the horizon length H in the ∆ − 1 min term for the nearly minimax- optimal algorithm LSVI-UCB++. In particular, while existing b ounds scale as b O ( d 3 H 5 / ∆ min ) or 10 b O ( d 2 H 5 / ∆ min ) (see T able 1 for details), our b ound reduces this dep endence to b O ( d 2 H 4 / ∆ min ). W e also remark that He et al. (2021a) provide a gap-dep enden t regret low er b ound of Ω( dH / ∆ min ), but whether it is minimax-optimal and achiev able remains an op en question. As an immediate corollary of Theorem 4.1, we obtain a gap-dep enden t Probably Approximately Correct (P AC) sample complexit y b ound (Kak ade, 2003), whic h characterizes the n umber of episo des required to learn an ϵ -optimal p olicy π satisfying V ⋆ 1 ( s 1 ) − V π 1 ( s 1 ) < ϵ for a fixed initial state s 1 . Without loss of generalit y , we fo cus on the case where s 1 is fixed; the general case reduces to this setting b y adding an auxiliary time step at the b eginning of eac h episo de. Corollary 4.2. F or any line ar MDP, failur e pr ob ability δ ∈ (0 , 1) and ac cur acy p ar ameter ϵ > 0 , running Algorithm 1 with p ar ameters sp e cifie d in The or em 4.1, with pr ob ability at le ast 1 − δ , after K = e O  d 2 H 4 ∆ min δ ϵ + d 6 H 7 δ ϵ  numb er of episo des, the output p olicy b π = P K k =1 π k /K is an ϵ -optimal p olicy. Her e, π k denotes the p olicy use d in episo de k of LSVI-UCB++. Compared with worst-case P AC b ounds in b oth tabular RL (Jin et al., 2018; Bai et al., 2019; Dann et al., 2019; M´ enard et al., 2021a) and RL with linear function appro ximation (He et al., 2021b; W u et al., 2023), our gap-dep enden t P A C sample complexity impro ves the dep endence on the accuracy lev el ϵ from e O (1 /ϵ 2 ) to e O (1 /ϵ ), whic h implies that a ϵ -optimal p olicy can b e learned with few er samples. 4.3 Extension to Concurren t RL In concurren t RL, multiple agents interact with the environmen t in parallel and share information to accelerate learning (Bai et al., 2019; Zhang et al., 2020). W e consider a setting with M parallel agen ts, where each agent in teracts with an indep enden t copy of the same episo dic MDP . Within eac h episo de, the M agen ts act synchronously without communication. Information exchange and p olicy up dates are p ermitted only after all agents hav e completed the episo de. A c oncurr ent r ound is defined as the time p eriod during which the M agen ts sim ultaneously complete one episo de and comm unicate to up date their p olicies. The p erformance of a concurren t algorithm is measured by the num ber of concurren t rounds required to learn an ϵ -optimal p olicy . Algorithm 2 presen ts the concurrent v ersion of LSVI-UCB++ algorithm. F ollowing the idea of concurrent UCB-Adv an tage (Zhang et al., 2020), we sim ulate the single-agen t LSVI-UCB++ algorithm b y treating the M episo des collected in a single concurrent round as M consecutiv e episo des without intermediate p olicy up dates. All collected tra jectories are sequen tially fed into the single-agen t LSVI-UCB++. Whenever an up date is triggered during an episo de in the single-agent algorithm (see Line 8 of Algorithm 1), the estimated v alue functions and the p olicy are up dated 11 Algorithm 2 Concurrent LSVI-UCB++ Initialize: Regularization parameter λ = 1 /H 2 , failure probability δ ∈ (0 , 1), confidence radii β , ¯ β , e β as sp ecified in Theorem 4.1. for concurrent rounds k = 1 , 2 , 3 , . . . do All agents follow the same p olicy π k determined by the current v alue estimation (Line 21 of Algorithm 1). for i = 1 , 2 , 3 , . . . , M do Collect the tra jectory and feed it to LSVI-UCB++. if an up date is triggered (Line 8 of Algorithm 1) then Up date the v alue estimations (Lines 9–10 of Algorithm 1). break end if end for end for (Lines 9–10 of Algorithm 1), and any remaining tra jectories in the current round are discarded in the learning pro cess. W e now present Theorem 4.3, whic h c haracterizes the sample complexit y of the concurren t LSVI-UCB++ algorithm. T o the b est of our kno wledge, this result provides the first gap-dep enden t sample complexit y b ound for online MARL with linear function appro ximation. Theorem 4.3. Given M p ar al lel agents, any δ ∈ (0 , 1) and ϵ > 0 , with pr ob ability at le ast 1 − δ , Concurr ent LSVI-UCB++ r e quir es at most e O  dH + d 2 H 4 M ∆ min δ ϵ + d 6 H 7 M δ ϵ  c oncurr ent r ounds to le arn an ϵ -optimal p olicy. Theorem 4.3 implies that concurren t LSVI-UCB++ ac hieves a linear sp eedup in the num ber of agen ts M when M = e O  min  dH 3 ∆ min δ ϵ , d 5 H 6 δ ϵ  . In particular, when the target accuracy ϵ is sufficiently small, the first term e O ( dH ) b ecomes negligible, and the algorithm enjoys an asymptotically linear sp eedup in M . The proof of Theorem 4.3 is deferred to Section F. Compared with existing w orst-case sample complexities for MARL algorithms in the tabular setting (Bai et al., 2019; Zhang et al., 2020) or with linear function approximation (Dub ey & P entland, 2021; Min et al., 2023; Hsu et al., 2024), our gap-dep enden t sample complexit y improv es the dep endence on the accuracy parameter ϵ from e O (1 /ϵ 2 ) to e O (1 /ϵ ). 12 5 Pro of Sk etc h of Theorem 4.1 In this section, we present the key techniques underlying the pro of of Theorem 4.1. W e b egin with Lemma 5.1, which relates the exp ected regret to the sum of sub optimalit y gaps. Lemma 5.1. F or any le arning algorithm with K episo des and T = H K steps, E [Regret( T )] is upp er b ounde d as E [Regret( T )] ≤ E K X k =1 H X h =1 ∆ h ( s k h , a k h ) ! . The pro of is pro vided in App endix D, which follows directly from the Bellman Equation (1) . Therefore, w e fo cus on controlling the summation of sub optimalit y gaps. Let N = ⌈ H / ∆ min ⌉ . Similar to He et al. (2021a), we partition the interv al [∆ min , H ] into dyadic in terv als of the form I n = [2 n − 1 ∆ min , 2 n ∆ min ) for 1 ≤ n < N and I N = [2 N − 1 ∆ min , H ]. Then w e ha ve K X k =1 H X h =1 ∆ h ( s k h , a k h ) = K X k =1 H X h =1 ∆ h ( s k h , a k h ) N X n =1 I [∆ h ( s k h , a k h ) ∈ I n ] ≤ H X h =1 N X n =1 2 n ∆ min × K ′ ( h, n − 1) , (4) where for any step h ∈ [ H ] and 0 ≤ n ≤ N , we define K ′ ( h, n ) = K X k =1 I  Q k,h − Q π k h  ( s k h , a k h ) ≥ 2 n ∆ min  . (5) The inequalit y (4) follows from ∆ h ( s k h , a k h ) ≤ Q k,h ( s k h , a k h ) − Q π k h ( s k h , a k h ) , whic h holds b ecause Q ⋆ h ( s k h , a k h ) ≥ Q π k h ( s k h , a k h ) and the optimism property (Lemma C.1) ensures Q k,h ( s k h , a k h ) = V k,h ( s k h ) ≥ V ⋆ h ( s k h ) with high probability . Therefore, to b ound the exp ected regret, it suffices to b ound K ′ ( h, n ) for an y ( h, n ) ∈ [ H ] × [ N ]. Let k 0 ( h, n ) = 0, and for i ∈ [ K ′ ( h, n )], define k i ( h, n ) as the smallest episo de index satisfying k i ( h, n ) = min n k : k > k i − 1 ( h, n ) , Q k,h ( s k h , a k h ) − Q π k h ( s k h , a k h ) ≥ 2 n ∆ min o . (6) When there is no am biguity , we use K ′ and k i as shorthand for K ′ ( h, n ) and k i ( h, n ). The following lemma pro vides an upp er b ound on K ′ ( h, n ). Lemma 5.2 (Informal) . F or any δ ∈ (0 , 1) , let ι 2 = log(1 + dH N K/δ ) . With high pr ob ability, for any ( h, n ) ∈ [ H ] × [ N ] , we have K ′ ( h, n ) ≤ O  d 2 H 3 ι 3 2 4 n ∆ 2 min + d 6 H 6 ι 2 2 2 n ∆ min  . 13 Com bining Lemma 5.2 with Equation (4), we immediately obtain the exp ected regret upp er b ound in Theorem 4.1. Next, we explain the k ey ideas b ehind the pro of of Lemma 5.2. The formal statemen t is given in Lemma D.2, with the complete pro of provided thereafter. 5.1 Pro of Sk etch of Lemma 5.2 T o b ound K ′ ( h, n ) for any ( h, n ) ∈ [ H ] × [ N ], we analyze the partial sum P K ′ i =1 ( Q k i ,h ( s k i h , a k i h ) − Q π k i h ( s k i h , a k i h )), and com bine its upp er and lo wer b ounds to construct an inequalit y in volving K ′ . By the definition of k i ( h, n ) in Equation (6), we immediately obtain the low er b ound K ′ X i =1  Q k i ,h − Q π k i h  ( s k i h , a k i h ) ≥ 2 n ∆ min K ′ . (7) Similar to Equation (E.4) in He et al. (2023), by the optimism prop ert y of Q k,h ( s, a ) and the Bellman Equation (1), we obtain the following upp er b ound:  Q k,h − Q π k h  ( s k h , a k h ) ≤  Q k,h +1 − Q π k h +1  ( s k h +1 , a k h +1 ) +  P s k h ,a k h ,h − 1 s k h +1   V k,h +1 − V π k h +1  + 4 min n β q ϕ ( s k h , a k h ) ⊤ Σ − 1 k,h ϕ ( s k h , a k h ) , H o . (8) The details are pro vided in Equation (26). Summing Equation (8) ov er h ≤ h ′ ≤ H and all k i giv es K ′ X i =1  Q k i ,h ( s k i h , a k i h ) − Q π k i h ( s k i h , a k i h )  ≤ K ′ X i =1 H X h ′ = h  P s k i h ′ ,a k i h ′ ,h ′ − 1 s k i h ′ +1  V k i ,h ′ +1 − V π k i h ′ +1  + K ′ X i =1 H X h ′ = h 4 min  β q ϕ ( s k i h ′ , a k i h ′ ) ⊤ Σ − 1 k i ,h ′ ϕ ( s k i h ′ , a k i h ′ ) , H  . (9) The first term on the RHS of Equation (9) is a martingale difference sequence and can b e b ounded via the Azuma–Ho effding inequalit y by O ( √ H 3 ι 2 K ′ ) with high probability . The challenge is b ounding the second term, the partial sum of b onuses. W e further establish the follo wing result. Lemma 5.3. L et ι = log (1 + K/ ( dλ )) . F or any h ∈ [ H ] , n ∈ [ N ] and p ar ameters β ′ ≥ 1 and C ≥ 1 , we have K ′ X i =1 min  β ′ q ϕ ( s k i h , a k i h ) ⊤ Σ − 1 k i ,h ϕ ( s k i h , a k i h ) , C  ≤ 4 d 3 H 3 C ι + 10 β ′ d 4 H 2 ι + 2 β ′ v u u t dι K ′ X i =1 ( σ 2 k i ,h + H ) . The w orst-case regret analysis in He et al. (2023) only requires b ounding the total summation of the b onuses P K k =1 β q ϕ ( s k h , a k h ) ⊤ Σ − 1 k,h ϕ ( s k h , a k h ) , where Σ k,h admits a one-step recursive up date according to line 24 in Algorithm 1: Σ k +1 ,h = Σ k,h + ¯ σ − 2 k,h ϕ ( s k h , a k h ) ϕ ( s k h , a k h ) ⊤ . This recursive structure enables standard tec hniques. Ho wev er, in Lemma 5.3, w e need to control partial summations of the b on uses. In this case, the matrices Σ k i +1 ,h and Σ k i ,h no longer admit the one-step recursive relationship, whic h preven ts us from directly applying standard arguments. 14 T o address this challenge, for an y h ∈ [ H ], we introduce a surrogate matrix with Σ ′ 1 = λ I d and Σ ′ i +1 = Σ ′ i + ¯ σ − 2 k i ,h ϕ ( s k i h , a k i h ) ϕ ( s k i h , a k i h ) ⊤ , (10) whic h satisfies Σ ′ i ⪯ Σ k i ,h . As a result, for each k i , we can upp er b ound the b onus q ϕ ( s k i h , a k i h ) ⊤ Σ − 1 k i ,h ϕ ( s k i h , a k i h ) b y q ϕ ( s k i h , a k i h ) ⊤ ( Σ ′ i ) − 1 ϕ ( s k i h , a k i h ) . Consequen tly , the partial sum of the b onuses can b e b ounded b y the total sum of q ϕ ( s k i h , a k i h ) ⊤ ( Σ ′ i ) − 1 ϕ ( s k i h , a k i h ) with resp ect to i ∈ [ K ′ ]. Since { Σ ′ i } i admits a one-step recursive up date as sho wn in Equation (10), the total sum with resp ect to i can be b ounded using standard arguments. Details are provided in the pro of of Lemma C.3. W e further upp er b ound the partial sum of the estimated v ariances σ 2 k,h o ver k = k i in the follo wing lemma. Lemma 5.4 (Informal) . With high pr ob ability, the p artial sum of the estimate d varianc e is b ounde d as: X K ′ i =1 X H h =1 σ 2 k i ,h ≤ O  H 2 K ′ + d 10 H 11 ι 3 2  . The formal statemen t is provided in Lemma D.1, and its complete pro of is provided subsequen tly . Here, we briefly outline the main ideas of the pro of. According to the definition in Equation (3), w e need to handle comp onen ts that include ¯ V s k h ,a k h ,h V k,h +1 , ( b w ⊤ k,h − ˇ w ⊤ k,h ) ϕ ( s k h , a k h ), the bonus q ϕ ( s k h , a k h ) ⊤ Σ − 1 k,h ϕ ( s k h , a k h ) , and constan ts that dep end only on d and H . Since the constants can b e b ounded easily , and the b on us can b e controlled b y Lemma 5.3 when taking partial sum, it suffices to con trol the first t wo comp onen ts. The first comp onen t can b e approximated by V s k h ,a k h ,h V π k h +1 , where the appro ximation error can b e decomp osed in to tw o parts: ¯ V s k h ,a k h ,h V k,h +1 − V s k h ,a k h ,h V k,h +1 ≤ E k,h and V s k h ,a k h ,h V k,h +1 − V s k h ,a k h ,h V π k h +1 . The details of b ounding V s k h ,a k h ,h V π k h +1 and the tw o parts ab o v e are giv en in Lemma D.1 in App endix D. This controls the first comp onen t. F or the second comp onen t, we approximate it by the term P s k h ,a k h ,h ( V k,h +1 − ˇ V k,h +1 ), where the appro ximation error can b e b ounded b y the b onus at step h due to the optimistic and p essimistic prop erties of the t wo Q -estimates, Q k,h and ˇ Q k,h . W e further establish a recursive structure across steps to b ound the partial sum of P s k h ,a k h ,h ( V k,h +1 − ˇ V k,h +1 ) ov er k = k i in Lemma C.5 in App endix C. This controls the second comp onen t and completes the pro of of Lemma 5.4. Using Lemma 5.4, the partial sum of b on uses is b ounded by Lemma 5.3. T ogether with the upp er b ound for the first term on the RHS of Equation (9), P K ′ i =1 ( Q k i ,h − Q π k i h )( s k i h , a k i h ) is b ounded b y O ( d 6 H 6 ι 2 2 + d p H 3 ι 3 2 K ′ ) . The details are provided in Equations (28) to (30) in App endix D. Com bining this upper bound with the lo wer b ound in Equation (7), we obtain the follo wing inequalit y 2 n ∆ min K ′ ( h, n ) ≤ O  d 6 H 6 ι 2 2 + d q H 3 ι 3 2 K ′ ( h, n )  . Solving for K ′ ( h, n ) completes the pro of of Lemma 5.2. 15 6 Conclusion In this pap er, w e establish the first gap-dep endent regret upp er b ound for a nearly minimax-optimal algorithm with linear function approximation, showing that LSVI-UCB++ achiev es an impro ved regret upper b ound with reduced dep endence on the feature dimension and horizon length. Moreov er, lev eraging the lo w p olicy-switc hing prop ert y of LSVI-UCB++, w e further develop a concurrent v arian t and establish the first gap-dep enden t sample complexity for online MARL with linear function appro ximation, demonstrating a linear sp eedup with resp ect to the num b er of agents. References Abbasi-Y adkori, Y., P´ al, D., and Szep esv´ ari, C. Improv ed algorithms for linear sto c hastic bandits. A dvanc es in neur al information pr o c essing systems , 24, 2011. Agarw al, A., Kak ade, S., and Y ang, L. F. Mo del-based reinforcement learning with a generative mo del is minimax optimal. In Confer enc e on L e arning The ory , pp. 67–83. PMLR, 2020. Agarw al, A., Jin, Y., and Zhang, T. V o q l: T ow ards optimal regret in mo del-free rl with nonlinear function appro ximation. In The Thirty Sixth Annual Confer enc e on L e arning The ory , pp. 987–1063. PMLR, 2023. Agra wal, S. and Jia, R. Optimistic p osterior sampling for reinforcemen t learning: w orst-case regret b ounds. A dvanc es in Neur al Information Pr o c essing Systems , 30, 2017. Al-Hamadani, M. N., F adhel, M. A., Alzubaidi, L., and Harangi, B. Reinforcement learning algorithms and applications in healthcare and rob otics: A comprehensive and systematic review. Sensors , 24(8):2461, 2024. Auer, P . and Ortner, R. Logarithmic online regret b ounds for undiscoun ted reinforcement learning. In A dvanc es in Neur al Information Pr o c essing Systems , pp. 49–56. MIT Press, 2007. Auer, P ., Jaksc h, T., and Ortner, R. Near-optimal regret bounds for reinforcement learning. A dvanc es in Neur al Information Pr o c essing Systems , 21, 2008. Ay oub, A., Jia, Z., Szep esv ari, C., W ang, M., and Y ang, L. Mo del-based reinforcement learning with v alue-targeted regress ion. In International Confer enc e on Machine L e arning , pp. 463–474. PMLR, 2020. Azar, M. G., Osband, I., and Munos, R. Minimax regret b ounds for reinforcement learning. In International Confer enc e on Machine L e arning , pp. 263–272. PMLR, 2017. 16 Bai, Y., Xie, T., Jiang, N., and W ang, Y.-X. Pro v ably efficient q-learning with low switching cost. A dvanc es in Neur al Information Pr o c essing Systems , 32, 2019. Chen, S., Zhou, R., Zhang, Z., F azel, M., and Du, S. S. Sharp gap-dep enden t v ariance-a ware regret b ounds for tabular mdps. arXiv pr eprint arXiv:2506.06521 , 2025. Dann, C., Li, L., W ei, W., and Brunskill, E. P olicy certificates: T ow ards accountable reinforcement learning. In International Confer enc e on Machine L e arning , pp. 1507–1516. PMLR, 2019. Dann, C., Marinov, T. V., Mohri, M., and Zimmert, J. Bey ond v alue-function gaps: Improv ed instance-dep enden t regret b ounds for episo dic reinforcement learning. In A dvanc es in Neur al Information Pr o c essing Systems , pp. 1–12, 2021. Dub ey , A. and P entland, A. Prov ably efficient co op erativ e multi-agen t reinforcemen t learning with function appro ximation. arXiv pr eprint arXiv:2103.04972 , 2021. He, J., Zhou, D., and Gu, Q. Logarithmic regret for reinforcement learning with linear function appro ximation. In International Confer enc e on Machine L e arning , pp. 4171–4180. PMLR, 2021a. He, J., Zhou, D., and Gu, Q. Uniform-pac b ounds for reinforcement learning with linear function appro ximation. A dvanc es in Neur al Information Pr o c essing Systems , 34:14188–14199, 2021b. He, J., Zhao, H., Zhou, D., and Gu, Q. Nearly minimax optimal reinforcement learning for linear mark ov decision pro cesses. In International Confer enc e on Machine L e arning , pp. 12790–12822. PMLR, 2023. Hsu, H.-L., W ang, W., Pa jic, M., and Xu, P . Randomized exploration in co op erativ e multi-agen t reinforcemen t learning. A dvanc es in Neur al Information Pr o c essing Systems , 37:74617–74689, 2024. Jaksc h, T., Ortner, R., and Auer, P . Near-optimal regret b ounds for reinforcement learning. Journal of Machine L e arning R ese ar ch , 11:1563–1600, 2010. Jia, Z., Y ang, L., Szep esv ari, C., and W ang, M. Mo del-based reinforcement learning with v alue- targeted regression. In L e arning for Dynamics and Contr ol , pp. 666–686. PMLR, 2020. Jin, C., Allen-Zhu , Z., Bub ec k, S., and Jordan, M. I. Is q-learning prov ably efficien t? A dvanc es in Neur al Information Pr o c essing Systems , 31, 2018. Jin, C., Y ang, Z., W ang, Z., and Jordan, M. I. Prov ably efficient reinforcement learning with linear function appro ximation. In Confer enc e on L e arning The ory , pp. 2137–2143. PMLR, 2020. Kak ade, S., W ang, M., and Y ang, L. F. V ariance reduction metho ds for sublinear reinforcement learning. arXiv pr eprint arXiv:1802.09184 , 2018. 17 Kak ade, S. M. On the sample c omplexity of r einfor c ement le arning . Univ ersit y of London, Univ ersity College London (United Kingdom), 2003. Kob er, J., Bagnell, J. A., and Peters, J. Reinforcement learning in rob otics: A survey . The International Journal of R ob otics R ese ar ch , 32(11):1238–1274, 2013. Labbi, S., Tiapkin, D., Mancini, L., Mangold, P ., and Moulines, E. F ederated ucb vi: Communication - efficien t federated regret minimization with heterogeneous agen ts. arXiv pr eprint arXiv:2410.22908 , 2024. Li, G., Shi, L., Chen, Y., and Chi, Y. Breaking the sample complexit y barrier to regret-optimal mo del-free reinforcement learning. Information and Infer enc e: A Journal of the IMA , 12(2): 969–1043, 2023. M ´ enard, P ., Domingues, O. D., Jonsson, A., Kaufmann, E., Leuren t, E., and V alk o, M. F ast active learning for pure exploration in reinforcemen t learning. In International Confer enc e on Machine L e arning , pp. 7599–7608. PMLR, 2021a. M ´ enard, P ., Domingues, O. D., Shang, X., and V alko, M. Ucb momen tum q-learning: Correcting the bias without forgetting. In International Confer enc e on Machine L e arning , pp. 7609–7618. PMLR, 2021b. Min, Y., He, J., W ang, T., and Gu, Q. Co operative multi-agen t reinforcemen t learning: Asynchronous comm unication and linear function approximation. In International Confer enc e on Machine L e arning , pp. 24785–24811. PMLR, 2023. Mo di, A., Jiang, N., T ewari, A., and Singh, S. Sample complexity of reinforcement learning using linearly combined mo del ensembles. In International Confer enc e on A rtificial Intel ligenc e and Statistics , pp. 2010–2020. PMLR, 2020. Ok, J., Proutiere, A., and T ranos, D. Exploration in structured reinforcement learning. A dvanc es in Neur al Information Pr o c essing Systems , 31, 2018. P apini, M., Tirinzoni, A., Pacc hiano, A., Restelli, M., Lazaric, A., and Pirotta, M. Reinforcement learning in linear mdps: Constant regret and representation selection. A dvanc es in Neur al Information Pr o c essing Systems , 34:16371–16383, 2021. Simc howitz, M. and Jamieson, K. G. Non-asymptotic gap-dep enden t regret bounds for tabular mdps. In A dvanc es in Neur al Information Pr o c essing Systems , 2019. Sutton, R. S. and Barto, A. G. R einfor c ement L e arning: An Intr o duction . MIT Press, 2018. 18 T ewari, A. and Bartlett, P . Optimistic linear programming gives logarithmic regret for irreducible mdps. In A dvanc es in Neur al Information Pr o c essing Systems , pp. 1505–1512, 2008. T oner, T., Saez, M., Tilbury , D. M., and Barton, K. Opp ortunities and challenges in applying reinforcemen t learning to rob otic manipulation: An industrial case study . Manufacturing L etters , 35:1019–1030, 2023. W agenmaker, A. J., Chen, Y., Simcho witz, M., Du, S., and Jamieson, K. First-order regret in reinforcemen t learning with linear function appro ximation: A robust estimation approach. In International Confer enc e on Machine L e arning , pp. 22384–22429. PMLR, 2022. W ei, C.-Y., Jahromi, M. J., Luo, H., and Jain, R. Learning infinite-horizon a verage-rew ard mdps with linear function appro ximation. In International Confer enc e on A rtificial Intel ligenc e and Statistics , pp. 3007–3015. PMLR, 2021. W u, Y., He, J., and Gu, Q. Uniform-pac guarantees for mo del-based rl with b ounded eluder dimension. In Unc ertainty in A rtificial Intel ligenc e , pp. 2304–2313. PMLR, 2023. Xu, H., Ma, T., and Du, S. Fine-grained gap-dep enden t bounds for tabular mdps via adaptiv e m ulti-step b o otstrap. In Confer enc e on L e arning The ory , pp. 4438–4472. PMLR, 2021. Xu, T., W ang, Z., Zhou, Y., and Liang, Y. Reanalysis of v ariance reduced temp oral difference learning. In International Confer enc e on L e arning R epr esentations , 2020. Y ang, K., Y ang, L., and Du, S. Q-learning with logarithmic regret. In International Confer enc e on A rtificial Intel ligenc e and Statistics , pp. 1576–1584. PMLR, 2021. Y ang, L. and W ang, M. Sample-optimal parametric q-learning using linearly additive features. In International c onfer enc e on machine le arning , pp. 6995–7004. PMLR, 2019. Zanette, A. and Brunskill, E. Tigh ter problem-dep enden t regret b ounds in reinforcement learning without domain knowledge using v alue function b ounds. In International Confer enc e on Machine L e arning , pp. 7304–7312. PMLR, 2019. Zhang, H., Zheng, Z., and Xue, L. Gap-dep endent b ounds for federated $ q $ -learning. In F orty-se c ond International Confer enc e on Machine L e arning , 2025a. Zhang, H., Zheng, Z., and Xue, L. Q-learning with fine-grained gap-dep enden t regret. arXiv pr eprint arXiv:2510.06647 , 2025b. Zhang, H., Zheng, Z., and Xue, L. Regret-optimal q-learning with lo w cost for single-agent and federated reinforcement learning. In The Thirty-ninth A nnual Confer enc e on Neur al Information Pr o c essing Systems , 2025c. URL https://openreview.net/forum?id=fNOCsycDG4 . 19 Zhang, J., Zhang, W., and Gu, Q. Optimal horizon-free reward-free exploration for linear mixture mdps. In International Confer enc e on Machine L e arning , pp. 41902–41930. PMLR, 2023. Zhang, W., F an, Z., He, J., and Gu, Q. Ac hieving constant regret in linear marko v decision pro cesses. A dvanc es in Neur al Information Pr o c essing Systems , 37:130694–130738, 2024a. Zhang, Z., Zhou, Y., and Ji, X. Almost optimal mo del-free reinforcement learning via reference- adv antage decomp osition. A dvanc es in Neur al Information Pr o c essing Systems , 33:15198–15207, 2020. Zhang, Z., Ji, X., and Du, S. Is reinforcement learning more difficult than bandits? a near-optimal algorithm escaping the curse of horizon. In Confer enc e on L e arning The ory , pp. 4528–4531. PMLR, 2021. Zhang, Z., Chen, Y., Lee, J. D., and Du, S. S. Settling the sample complexit y of online reinforcemen t learning. In Confer enc e on L e arning The ory , pp. 5213–5219. PMLR, 2024b. Zhang, Z., Lee, J. D., Chen, Y., and Du, S. S. Horizon-free regret for linear marko v decision pro cesses. In The Twelfth International Confer enc e on L e arning R epr esentations , 2024c. Zheng, Z., Gao, F., Xue, L., and Y ang, J. F ederated q-learning: Linear regret sp eedup with lo w comm unication cost. In The Twelfth International Confer enc e on L e arning R epr esentations , 2024. Zheng, Z., Zhang, H., and Xue, L. F ederated q-learning with reference-adv an tage decomp osition: Almost optimal regret and logarithmic comm unication cost. In The Thirte enth International Confer enc e on L e arning R epr esentations , 2025a. Zheng, Z., Zhang, H., and Xue, L. Gap-dep endent b ounds for q-learning using reference-adv antage decomp osition. In The Thirte enth International Confer enc e on L e arning R epr esentations , 2025b. Zhou, D. and Gu, Q. Computationally efficien t horizon-free reinforcemen t learning for linear mixture mdps. A dvanc es in neur al information pr o c essing systems , 35:36337–36349, 2022. Zhou, D., Gu, Q., and Szep esv ari, C. Nearly minimax optimal reinforcement learning for linear mixture marko v decision pro cesses. In Confer enc e on L e arning The ory , pp. 4532–4576. PMLR, 2021a. Zhou, D., He, J., and Gu, Q. Prov ably efficient reinforcement learning for discounted mdps with feature mapping. In International Confer enc e on Machine L e arning , pp. 12793–12802. PMLR, 2021b. Zhou, R., Zihan, Z., and Du, S. S. Sharp v ariance-dependent b ounds in reinforcement learning: Best of b oth worlds in sto c hastic and deterministic en vironments. In International Confer enc e on Machine L e arning , pp. 42878–42914. PMLR, 2023. 20 In the app endix, Appendix A collects sev eral auxiliary lemmas that facilitate the pro of. App endix B establishes a set of high-probabilit y even ts. App endix C summarizes k ey prop erties of the v alue function estimates in LSVI-UCB++. The pro of of Theorem 4.1 is pro vided in App endix D, while the pro ofs of Corollary 4.2 and Theorem 4.3 are giv en in Section E and Section F, resp ectiv ely . A Auxiliary Lemmas In this section, we introduce several auxiliary lemmas that will b e used to supp ort our pro of. Lemma A.1. (Azuma-Ho effding Inequalit y). Supp ose { X k } ∞ k =0 is a martingale and | X k − X k − 1 | ≤ c k , ∀ k ∈ N + , almost sur ely. Then for any N ∈ N + and ϵ > 0 , it holds that: P ( | X N − X 0 | ≥ ϵ ) ≤ 2 exp − ϵ 2 2 P N k =1 c 2 k ! . Lemma A.2 (Lemma 12, Abbasi-Y adkori et al. 2011) . Supp ose A , B ∈ R d × d ar e two p ositive definite matric es satisfying that A ⪰ B , then for any x ∈ R d , we have ∥ x ∥ A ≤ ∥ x ∥ B · p det( A ) / det( B ) . Lemma A.3 (Lemma 11, Abbasi-Y adkori et al. 2011) . L et { x k } K k =1 b e a se quenc e of ve ctors in R d , matrix Σ 0 a d × d p ositive definite matrix and define Σ k = Σ 0 + P k i =1 x i x ⊤ i , then we have k X i =1 min n 1 , x ⊤ i Σ − 1 i − 1 x i o ≤ 2 log  det Σ k det Σ 0  . In addition, if ∥ x i ∥ 2 ≤ L holds for al l i ∈ [ K ] , then k X i =1 min n 1 , x ⊤ i Σ − 1 i − 1 x i o ≤ 2 log  det Σ k det Σ 0  ≤ 2  d log  ( tr ac e ( Σ 0 ) + k L 2 ) /d  − log det Σ 0  . Lemma A.4 (Lemma 4.4, Zhou & Gu 2022) . L et { σ k , b β k } k ≥ 1 b e a se quenc e of non-ne gative numb ers, α, γ > 0 , { a k } k ≥ 1 ⊂ R d and ∥ a k ∥ 2 ≤ A . L et { ¯ σ k } k ≥ 1 and { b Σ k } k ≥ 1 b e (r e cursively) define d as fol lows: b Σ 1 = λ I d , ∀ k ≥ 1 , ¯ σ k = max { σ k , α, γ ∥ a k ∥ 1 / 2 b Σ − 1 k } , b Σ k +1 = b Σ k + a k a ⊤ k / ¯ σ 2 k . L et ι = log (1 + K A 2 / ( dλα 2 )) . Then we have K X k =1 min n 1 , ∥ a k ∥ b Σ − 1 k o ≤ 2 dι + 2 γ 2 dι + 2 √ dι v u u t K X k =1 ( σ 2 k + α 2 ) . 21 B Probabilit y Ev en ts In this section, we introduce several high-probability even ts for LSVI-UCB++. Lemma B.1. F or LSVI-UCB++, we have the fol lowing high-pr ob ability events. (a) Define E 1 as the event that the fol lowing ine qualities hold for any ( s, a, h, k ) ∈ S × A × [ H ] × [ K ] simultane ously.    b w ⊤ k,h ϕ ( s, a ) − P s,a,h V k,h +1    ≤ ¯ β q ϕ ( s, a ) ⊤ Σ − 1 k,h ϕ ( s, a ) ,    e w ⊤ k,h ϕ ( s, a ) − P s,a,h V 2 k,h +1    ≤ e β q ϕ ( s, a ) ⊤ Σ − 1 k,h ϕ ( s, a ) ,    ˇ w ⊤ k,h ϕ ( s, a ) − P s,a,h ˇ V k,h +1    ≤ ¯ β q ϕ ( s, a ) ⊤ Σ − 1 k,h ϕ ( s, a ) , wher e ¯ β = Θ  H √ dλ + q d 3 H 2 log 2 ( dH K / ( δ λ ))  , e β = Θ  H 2 √ dλ + q d 3 H 4 log 2 ( dH K / ( δ λ ))  . Then the event E holds with pr ob ability at le ast 1 − 7 δ . (b) Define E 2 as the event such that for any ( s, a, h, k ) ∈ S × A × [ H ] × [ K ] , the weight ve ctor b w k,h satisfies that    b w ⊤ k,h ϕ ( s, a ) − P s,a,h V k,h +1    ≤ β q ϕ ( s, a ) ⊤ Σ − 1 k,h ϕ ( s, a ) , wher e β = Θ  H √ dλ + q d log 2 (1 + dK H / ( δλ ))  . Then the event E 2 holds with pr ob ability at le ast 1 − 8 δ . (c) With pr ob ability at le ast 1 − δ , the fol lowing event holds simultane ously for any ( h, n, K ′ ) ∈ [ H ] × [ N ] × [ K ] : E 3 = ( K ′ X i =1 H X h ′ = h  P s k i h ′ ,a k i h ′ ,h ′ − 1 s k i h ′ +1   V k i ,h ′ +1 − V π k i h ′ +1  ≤ 2 p 2 H 3 K ′ log( H N K/δ ) ) . (d) With pr ob ability at le ast 1 − δ , the fol lowing event holds simultane ously for any ( h, n, K ′ ) ∈ [ H ] × [ N ] × [ K ] : E 4 = ( K ′ X i =1 H X h ′ = h  P s k i h ′ ,a k i h ′ ,h ′ − 1 s k i h ′ +1   V k i ,h ′ +1 − ˇ V k i ,h ′ +1  ≤ 2 p 2 H 3 K ′ log( H N K/δ ) ) . (e) With pr ob ability at le ast 1 − δ , the fol lowing event holds simultane ously for any ( h, n, K ′ ) ∈ [ H ] × [ N ] × [ K ] : E 5 = ( K ′ X i =1 H X h =1 V s k i h ,a k i h ,h V π k i k i ,h +1 ≤ 3 H 2 K ′ + 3 H 3 log( H N K/δ ) ) . 22 Pr o of of L emma B.1. P arts (a) and (b) follo w directly from Lemmas B.1 and B.5 in He et al. (2023), resp ectiv ely . Part (e) follows from Lemma C.5 in Jin et al. (2018) by applying the Azuma-Bernstein inequalit y , follow ed by a union b ound ov er all ( h, n, K ′ ) ∈ [ H ] × [ N ] × [ K ]. Next w e prov e the parts (c) and (d). By Lem ma A.1, with probability at least 1 − δ / ( H N K ), for an y fixed ( h, n, K ′ ) ∈ [ H ] × [ N ] × [ K ], we hav e K ′ X i =1 H X h ′ = h  P s k i h ′ ,a k i h ′ ,h ′ − 1 s k i h ′ +1   V k i ,h ′ +1 − V π k i h ′ +1  ≤ 2 p 2 H 3 K ′ log( H N K/δ ) , where w e use the fact that  P s k i h ′ ,a k i h ′ ,h ′ − 1 s k i h ′ +1   V k i ,h ′ +1 − V π k i h ′ +1  forms a martingale difference sequence b ounded by 2 H . T aking a union b ound for all ( h, n, K ′ ) ∈ [ H ] × [ N ] × [ K ] finishes the pro of of (c). The pro of of part (d) pro ceeds similarly , with V π k i h ′ +1 replaced b y ˇ V k i ,h ′ +1 . C Prop erties of V alue F unction Estimates F ollowing Lemma B.4 and Lemma B.2 in He et al. (2023), we obtain the following tw o lemmas. Lemma C.1. On the event E 1 ∩ E 2 , for any ( s, a, h, k ) ∈ S × A × [ H ] × [ K ] , we have Q k,h ( s, a ) ≥ Q ⋆ h ( s, a ) ≥ ˇ Q k,h ( s, a ) , V k,h ( s ) ≥ V ⋆ h ( s ) ≥ ˇ V k,h ( s ) . Lemma C.2. On the event E 1 ∩ E 2 , for any ( h, k ) ∈ [ H ] × [ K ] , the estimate d varianc e satisfies    ¯ V s k h ,a k h ,h V k,h +1 − V s k h ,a k h ,h V k,h +1    ≤ E k,h . The follo wing lemma provides a b ound on the partial sum of the b on uses. Lemma C.3 (Restatemen t of Lemma 5.3) . L et ι = log (1 + K/ ( dλ )) = log (1 + T H /d ) . F or any h ∈ [ H ] , n ∈ [ N ] and p ar ameters β ′ ≥ 1 and C ≥ 1 , the p artial sum of b onuses c an b e b ounde d as K ′ X i =1 min  β ′ q ϕ ( s k i h , a k i h ) ⊤ Σ − 1 k i ,h ϕ ( s k i h , a k i h ) , C  ≤ 4 d 3 H 3 C ι + 10 β ′ d 4 H 2 ι + 2 β ′ v u u t dι K ′ X i =1  σ 2 k i ,h + H  . Pr o of of L emma C.3. F or a fix step h ∈ [ H ] and n ∈ [ N ], since β ′ ≥ 1, the summation of b on uses is b ounded by K ′ X i =1 min  β ′ q ϕ ( s k i h , a k i h ) ⊤ Σ − 1 k i ,h ϕ ( s k i h , a k i h ) , C  ≤ K ′ X i =1 β ′ min  q ϕ ( s k i h , a k i h ) ⊤ Σ − 1 k i ,h ϕ ( s k i h , a k i h ) , 1  23 + C K ′ X i =1 I  q ϕ ( s k i h , a k i h ) ⊤ Σ − 1 k i ,h ϕ ( s k i h , a k i h ) ≥ 1  . (11) F or the first term in Equation (11), for any i ∈ [ K ′ ], define Σ ′ i = λ I d + i − 1 X j =1 ¯ σ − 2 k j ,h ϕ ( s k j h , a k j h ) ϕ ( s k j h , a k j h ) ⊤ . Then according to definition of Σ k i ,h in Line 23 of Algorithm 1, we hav e Σ ′ i ⪯ Σ k i ,h and th us K ′ X i =1 β ′ min  q ϕ ( s k i h , a k i h ) ⊤ Σ − 1 k i ,h ϕ ( s k i h , a k i h ) , 1  ≤ K ′ X i =1 β ′ min  q ϕ ( s k i h , a k i h ) ⊤ ( Σ ′ i ) − 1 ϕ ( s k i h , a k i h ) , 1  ≤ 10 β ′ d 4 H 2 ι + 2 β ′ v u u t dι K ′ X i =1 ( σ 2 k i ,h + H ) . (12) The last inequality is by Lemma A.4. F or the second term in Equation (11), let { i 1 , i 2 , ..., i m } =  i | q ϕ ( s k i h , a k i h ) ⊤ Σ − 1 k i ,h ϕ ( s k i h , a k i h ) ≥ 1  and for any t ∈ [ m ], set Σ ′′ 0 = λ I d and Σ ′′ t = λ I d + t X j =1 ¯ σ − 2 k i j ,h ϕ ( s k i j h , a k i j h ) ϕ ( s k i j h , a k i j h ) ⊤ . Then it holds that Σ ′′ t − 1 ⪯ Σ k i t ,h and th us m X t =1 ϕ ( s k i t h , a k i t h ) ⊤ ( Σ ′′ t − 1 ) − 1 ϕ ( s k i t h , a k i t h ) ≥ m X t =1 ϕ ( s k i t h , a k i t h ) ⊤ Σ − 1 k i t ,h ϕ ( s k i t h , a k i t h ) ≥ m. (13) On the other hand, notice that ¯ σ 2 k,h ≤ 4 d 3 H 2 / √ λ since Σ k,h ⪰ λ I d , ∥ ϕ ( s k h , a k h ) ∥ 2 ≤ 1 in Equation (2), w e hav e m X t =1 ϕ ( s k i t h , a k i t h ) ⊤ ( Σ ′′ t − 1 ) − 1 ϕ ( s k i t h , a k i t h ) ≤ 4 d 3 H 2 √ λ m X t =1 ¯ σ − 2 k i t ,h ϕ ( s k i t h , a k i t h ) ⊤ ( Σ ′′ t − 1 ) − 1 ϕ ( s k i t h , a k i t h ) ≤ 4 d 3 H 3 ι. (14) The last inequality holds due to Lemma A.3. Combining the results in (13) and (14) , w e hav e m ≤ 4 d 3 H 3 ι . T ogether with Equation (12), and substituting bac k in to Equation (11), this completes the pro of of Lemma C.3. With the lemma established ab o ve, we can then b ound the error b et ween the optimistic estimate V k,h ( s ) and the true v alue function V π k i h +1 ( s ). 24 Lemma C.4. On the events T 3 i =1 E i , for any h ∈ [ H ] , n ∈ [ N ] , we have the fol lowing c onclusion: K ′ X i =1 H X h =1 P s k i h ,a k i h ,h  V k i ,h +1 − V π k i h +1  ≤ 16 d 3 H 6 ι + 40 β d 4 H 5 ι + 8 H β v u u t dH ι H X h =1 K ′ X i =1  σ 2 k i ,h + H  + 8 p H 5 K ′ log( H N K/δ ) . Pr o of of L emma C.4. F or an y h ∈ [ H ] and k ∈ [ K ], w e hav e V k,h ( s k h ) − V π k h ( s k h ) = Q k,h ( s k h , a k h ) − Q π k h ( s k h , a k h ) ≤ min n b w ⊤ k last ,h ϕ ( s, a ) + β q ϕ ( s k h , a k h ) ⊤ Σ − 1 k last ,h ϕ ( s k h , a k h ) , H o − P s k h ,a k h ,h V k,h +1 + P s k h ,a k h ,h  V k,h +1 − V π k h +1  ≤ P s k h ,a k h ,h  V k,h +1 − V π k h +1  + 2 min n β q ϕ ( s k h , a k h ) ⊤ Σ − 1 k last ,h ϕ ( s k h , a k h ) , H o ≤ P s k h ,a k h ,h  V k,h +1 − V π k h +1  + 4 min n β q ϕ ( s k h , a k h ) ⊤ Σ − 1 k,h ϕ ( s k h , a k h ) , H o = V k,h +1 ( s k h +1 ) − V π k h +1 ( s k h +1 ) +  P s k h ,a k h ,h − 1 s k h +1  ( V k,h +1 − V π k h +1 ) + 4 min n β q ϕ ( s k h , a k h ) ⊤ Σ − 1 k,h ϕ ( s k h , a k h ) , H o . (15) Here the first inequality holds due to the up date rule of Q k,h ( s k h , a k h ) in Line 9 of Algorithm 1 and Bellman Equation in Equation (1), the second inequalit y holds due to the even t E 2 in Lemma B.1, and the last inequalit y holds due to the up date rule in Line 8 of Algorithm 1 and Lemma A.2. Summing Equation (15) ov er h ∈ [ H ] and k i for all i ∈ [ K ′ ], w e obtain K ′ X i =1  V k i ,h ( s k i h ) − V π k i h ( s k i h )  ≤ K ′ X i =1 H X h ′ = h  P s k i h ′ ,a k i h ′ ,h ′ − 1 s k i h ′ +1   V k i ,h ′ +1 − V π k i h ′ +1  + K ′ X i =1 H X h ′ = h 4 min  β q ϕ ( s k i h ′ , a k i h ′ ) ⊤ Σ − 1 k i ,h ′ ϕ ( s k i h ′ , a k i h ′ ) , H  ≤ K ′ X i =1 H X h ′ = h 4 min  β q ϕ ( s k i h ′ , a k i h ′ ) ⊤ Σ − 1 k i ,h ′ ϕ ( s k i h ′ , a k i h ′ ) , H  + 4 p H 3 K ′ log( H N K/δ ) ≤ 16 d 3 H 5 ι + 40 β d 4 H 4 ι + 8 β H X h ′ = h v u u t dι K ′ X i =1  σ 2 k i ,h ′ + H  + 4 p H 3 K ′ log( H N K/δ ) ≤ 16 d 3 H 5 ι + 40 β d 4 H 4 ι + 8 β v u u t dH ι H X h =1 K ′ X i =1  σ 2 k i ,h + H  + 4 p H 3 K ′ log( H N K/δ ) , (16) where the second inequality holds due to the even t E 3 in Lemma B.1, the third inequalit y holds due to Lemma C.3 and the last inequalit y holds due to Cauch y-Sc hw artz inequality . F urthermore, w e 25 ha ve K ′ X i =1 H X h =1 P s k i h ,a k i h ,h  V k i ,h +1 − V π k i h +1  = K ′ X i =1 H X h =1  V k i ,h +1 ( s k i h +1 ) − V π k i h +1 ( s k i h +1 )  + K ′ X i =1 H X h =1  P s k i h ,a k i h ,h − 1 s k i h +1   V k i ,h +1 − V π k i h +1  ≤ 16 d 3 H 6 ι + 40 β d 4 H 5 ι + 8 H β v u u t dH ι H X h =1 K ′ X i =1  σ 2 k i ,h + H  + 8 p H 5 K ′ log( H N K/δ ) . where the last inequality holds due to (16) and the even t E 6 in Lemma B.1. Therefore, we finish the pro of of Lemma C.4. In addition, for the difference b et ween the optimistic estimate V k,h ( s ) and p essimistic estimate ˇ V k,h ( s ), w e hav e the following lemma. Lemma C.5. On the events E 1 ∩ E 2 ∩ E 4 , the differ enc e b etwe en the optimistic value function V k,h and the p essimistic value function ˇ V k,h is upp er b ounde d by: K ′ X i =1 H X h =1 P s k i h ,a k i h ,h  V k i ,h +1 − ˇ V k i ,h +1  ≤ 32 d 3 H 6 ι + 40  β + ¯ β  d 4 H 5 ι + 8 H  β + ¯ β  v u u t dH ι H X h =1 K ′ X i =1  σ 2 k i ,h + H  + 8 p H 5 K ′ log( H N K/δ ) , Pr o of of L emma C.5. F or eac h step h ∈ [ H ] and episo de k ∈ [ K ], we hav e V k,h ( s k h ) − ˇ V k,h ( s k h ) ≤ Q k,h ( s k h , a k h ) − ˇ Q k,h ( s k h , a k h ) ≤ P s k h ,a k h ,h  V k,h +1 − ˇ V k,h +1  + min n ˇ w ⊤ k last ,h ϕ ( s, a ) + β q ϕ ( s k h , a k h ) ⊤ Σ − 1 k last ,h ϕ ( s k h , a k h ) , H o − P s k h ,a k h ,h V k,h +1 − max n b w ⊤ k last ,h ϕ ( s, a ) − ¯ β q ϕ ( s k h , a k h ) ⊤ Σ − 1 k last ,h ϕ ( s k h , a k h ) , 0 o + P s k h ,a k h ,h ˇ V k,h +1 ≤ P s k h ,a k h ,h  V k,h +1 − ˇ V k,h +1  + 2 min n β q ϕ ( s k h , a k h ) ⊤ Σ − 1 k last ,h ϕ ( s k h , a k h ) , H o + 2 min n ¯ β q ϕ ( s k h , a k h ) ⊤ Σ − 1 k last ,h ϕ ( s k h , a k h ) , H o ≤ V k,h +1 ( s k h +1 ) − ˇ V k,h +1 ( s k h +1 ) +  P s k h ,a k h ,h − 1 s k h +1   V k,h +1 − ˇ V k,h +1  + 4 min n β q ϕ ( s k h , a k h ) ⊤ Σ − 1 k,h ϕ ( s k h , a k h ) , H o + 4 min n ¯ β q ϕ ( s k h , a k h ) ⊤ Σ − 1 k,h ϕ ( s k h , a k h ) , H o , (17) where the first inequality holds due to the fact that ˇ V k,h ( s k h ) = max a ˇ Q k,h ( s k h , a ) ≥ ˇ Q k,h ( s k h , a k h ), the second inequality holds due to the the up date rule of v alue function Q k,h ( s k h , a k h ) and ˇ Q k,h ( s k h , a k h ) in Lines 9–10 of Algorithm 1, the third inequalit y holds due to the even t E 1 and E 2 in Lemma B.1, and the last inequality holds due to the up dating rule in Line 8 of Algorithm 1 and Lemma A.2. 26 Summing Equation (17) ov er h ∈ [ H ] and k i for all i ∈ [ K ′ ], w e obtain, K ′ X i =1  V k i ,h ( s k i h ) − ˇ V k i ,h ( s k i h )  ≤ K ′ X i =1 H X h ′ = h  P s k i h ′ ,a k i h ′ ,h ′ − 1 s k i h ′ +1   V k i ,h ′ +1 − ˇ V k i ,h ′ +1  + K ′ X i =1 H X h ′ = h 4 min  β q ϕ ( s k i h ′ , a k i h ′ ) ⊤ Σ − 1 k i ,h ′ ϕ ( s k i h ′ , a k i h ′ ) , H  + K ′ X i =1 H X h ′ = h 4 min  ¯ β q ϕ ( s k i h ′ , a k i h ′ ) ⊤ Σ − 1 k i ,h ′ ϕ ( s k i h ′ , a k i h ′ ) , H  ≤ K ′ X i =1 H X h ′ = h 4 min  β q ϕ ( s k i h ′ , a k i h ′ ) ⊤ Σ − 1 k i ,h ′ ϕ ( s k i h ′ , a k i h ′ ) , H  + K ′ X i =1 H X h ′ = h 4 min  ¯ β q ϕ ( s k i h ′ , a k i h ′ ) ⊤ Σ − 1 k i ,h ′ ϕ ( s k i h ′ , a k i h ′ ) , H  + 4 p H 3 K ′ log( H N K/δ ) ≤ 32 d 3 H 5 ι + 40  β + ¯ β  d 4 H 4 ι + 8  β + ¯ β  H X h ′ = h v u u t dι K ′ X i =1  σ 2 k i ,h ′ + H  + 4 p H 3 K ′ log( H N K/δ ) ≤ 32 d 3 H 5 ι + 40  β + ¯ β  d 4 H 4 ι + 8  β + ¯ β  v u u t dH ι H X h =1 K ′ X i =1  σ 2 k i ,h + H  + 4 p H 3 K ′ log( H N K/δ ) , (18) where the second inequality holds due to the even t E 4 in Lemma B.1, the third inequalit y holds due to Lemma C.3 and the last inequalit y holds due to Cauch y-Sc hw artz inequality . F urthermore, w e ha ve K ′ X i =1 H X h =1 P s k i h ,a k i h ,h  V k i ,h +1 − ˇ V k i ,h +1  = K ′ X i =1 H X h =1  V k i ,h +1 ( s k i h +1 ) − ˇ V k i ,h +1 ( s k i h +1 )  + K ′ X i =1 H X h =1  P s k i h ,a k i h ,h − 1 s k i h +1   V k i ,h +1 − ˇ V k i ,h +1  ≤ 32 d 3 H 6 ι + 40  β + ¯ β  d 4 H 5 ι + 8 H  β + ¯ β  v u u t dH ι H X h =1 K ′ X i =1  σ 2 k i ,h + H  + 8 p H 5 K ′ log( H N K/δ ) , where the first inequality holds due to Equation (18) and the even t E 4 in Lemma B.1. Therefore, w e finish the pro of. D Pro of of Theorem 4.1 W e first pro ve Lemma 5.1 that connects exp ected regret with cumulativ e sum of sub optimality gaps. 27 Pr o of of L emma 5.1. ( V ⋆ 1 − V π k 1 )( s k 1 ) = V ⋆ 1 ( s k 1 ) − Q ⋆ 1 ( s k 1 , a k 1 ) +  Q ⋆ 1 − Q π k 1  ( s k 1 , a k 1 ) = ∆ 1 ( s k 1 , a k 1 ) + E h V ⋆ 2 − V π k 2  ( s k 2 ) | s k 2 ∼ P 1 ( · | s k 1 , a k 1 ) i = E h ∆ 1 ( s k 1 , a k 1 ) + ∆ 2 ( s k 2 , a k 2 ) | s k 2 ∼ P 1 ( · | s k 1 , a k 1 ) i + E h Q ⋆ 2 − Q π k 2  ( s k 2 , a k 2 ) | s k 2 ∼ P 1 ( · | s k 1 , a k 1 ) i = · · · = E " H X h =1 ∆ h  s k h , a k h       s k h +1 ∼ P h ( · | s k h , a k h ) , h ∈ [ H − 1] # . Here, the second equation holds due to Equation (1). Therefore, w e can get another expression of exp ected regret: E (Regret( T )) = E " K X k =1  V ⋆ 1 − V π k 1  ( s k 1 ) # = E " K X k =1 H X h =1 ∆ h ( s k h , a k h ) # . W e then pro ve the Lemma 5.4 that b ounds the partial sum of estimated v ariance. Lemma D.1 (F ormal Statement of Lemma 5.4) . On the event T 5 i =1 E i , the p artial sum of the estimate d varianc e is upp er b ounde d as fol lows: K ′ X i =1 H X h =1 σ 2 k i ,h ≤ O  H 2 K ′ + d 10 H 11 log 3 (1 + dH N K /δ )  . Pr o of of L emma D.1. According to the definition of σ k,h in Equation (3), we hav e K ′ X i =1 H X h =1 σ 2 k i ,h = K ′ X i =1 H X h =1  ¯ V s k i h ,a k i h ,h V k i ,h +1 + E k i ,h + D k i ,h + H  = H 2 K ′ + K ′ X i =1 H X h =1  ¯ V s k i h ,a k i h ,h V k i ,h +1 − V s k i h ,a k i h ,h V k i ,h +1  | {z } I 1 + K ′ X i =1 H X h =1 E k i ,h | {z } I 2 + K ′ X i =1 H X h =1 D k i ,h | {z } I 3 + K ′ X i =1 H X h =1  V s k i h ,a k i h ,h V k i ,h +1 − V s k i h ,a k i h ,h V π k i h +1  | {z } I 4 + K ′ X i =1 H X h =1 V s k i h ,a k i h ,h V π k i h +1 | {z } I 5 . (19) F or the term I 1 , according to Lemma C.2, it is upp er b ounded by: I 1 = K ′ X i =1 H X h =1  ¯ V s k i h ,a k i h ,h V k i ,h +1 − V s k i h ,a k i h ,h V k i ,h +1  ≤ K ′ X i =1 H X h =1 E k i ,h = I 2 . (20) 28 F or the term I 2 , it is upp er b ounded by I 2 = K ′ X i =1 H X h =1  min  e β q ϕ ( s k i h , a k i h ) ⊤ Σ − 1 k i ,h ϕ ( s k i h , a k i h ) , H 2  + min  2 H ¯ β q ϕ ( s k i h , a k i h ) ⊤ Σ − 1 k i ,h ϕ ( s k i h , a k i h ) , H 2   ≤ 8 d 3 H 6 ι +  10 e β + 20 ¯ β  d 4 H 4 ι +  2 e β + 4 ¯ β  v u u t dH ι K ′ X i =1 H X h =1 ( σ 2 k i ,h + H ) , (21) where the inequality holds due to Lemma C.3 and the Cauc hy-Sc h wartz inequality . F or the term I 3 , it is upp er b ounded by I 3 = K ′ X i =1 H X h =1 min  4 d 3 H 2  b w ⊤ k i ,h ϕ ( s, a ) − ˇ w ⊤ k i ,h ϕ ( s, a ) + 2 ¯ β q ϕ ( s k i h , a k i h ) ⊤ Σ − 1 k i ,h ϕ ( s k i h , a k i h )  , d 3 H 3  ≤ K ′ X i =1 H X h =1 min  4 d 3 H 2  P s k i h ,a k i h ,h  V k i ,h +1 − ˇ V k i ,h +1  + 4 ¯ β q ϕ ( s k i h , a k i h ) ⊤ Σ − 1 k i ,h ϕ ( s k i h , a k i h )  , d 3 H 3  ≤ K ′ X i =1 H X h =1 4 d 3 H 2 P s k i h ,a k i h ,h  V k i ,h +1 − ˇ V k i ,h +1  + K ′ X i =1 H X h =1 min  16 d 3 H 2 ¯ β q ϕ ( s k i h , a k i h ) ⊤ Σ − 1 k i ,h ϕ ( s k i h , a k i h ) , d 3 H 3  ≤ K ′ X i =1 H X h =1 4 d 3 H 2 P s k i h ,a k i h ,h  V k i ,h +1 − ˇ V k i ,h +1  + 64 d 6 H 7 ι + 160 ¯ β d 7 H 5 ι + 32 d 3 H 2 ¯ β v u u t dH ι K ′ X i =1 H X h =1  σ 2 k i ,h + H  ≤ 192 d 6 H 8 ι + 320( β + ¯ β ) d 7 H 7 ι + 64 d 3 H 3 ( β + ¯ β ) v u u t dH ι K ′ X i =1 H X h =1 ( σ 2 k i ,h + H ) + 32 d 3 p H 9 K ′ log( H N K/δ ) (22) where the first inequality holds due to the even t E 1 in Lemma B.1, the second inequality holds due to the fact that V k,h +1 ( s ) ≥ V ⋆ h +1 ( s ) ≥ ˇ V k,h +1 ( s ) by Lemma C.1, the third inequality holds due to Lemma C.3 and the last inequality is b ecause Lemma C.5. F or the term I 4 , it is upp er b ounded by I 4 = K ′ X i =1 H X h =1  P s k i h ,a k i h ,h ( V k i ,h +1 ) 2 −  P s k i h ,a k i h ,h V k i ,h +1  2 − P s k i h ,a k i h ,h  V π k i h +1  2 +  P s k i h ,a k i h ,h V π k i h +1  2  ≤ K ′ X i =1 H X h =1  P s k i h ,a k i h ,h ( V k i ,h +1 ) 2 − P s k i h ,a k i h ,h  V π k i h +1  2  29 ≤ 2 H K ′ X i =1 H X h =1  P s k i h ,a k i h ,h V k i ,h +1 − P s k i h ,a k i h ,h V π k i h +1  ≤ 32 d 3 H 7 ι + 80 β d 4 H 6 ι + 16 H 2 β v u u t dH ι K ′ X i =1 H X h =1  σ 2 k i ,h + H  + 16 p H 7 K ′ log( H N K/δ ) , (23) where the first inequalit y is b ecause of the fact that V π k h +1 ≤ V ⋆ h +1 ≤ V k,h +1 ( s ′ ) b y Lemma C.1, the second inequalit y holds due to 0 ≤ V k,h +1 ( s ′ ) , V π k h +1 ( s ′ ) ≤ H and the last inequality holds due to Lemma C.4. By the even t E 5 in Lemma B.1, for the term I 5 , w e hav e I 5 ≤ 3 H 2 K ′ + 3 H 3 log( H N K/δ ) . (24) Substituting the results in (20), (21), (22), (23) and (24) in to (19), we hav e K ′ X i =1 H X h =1 σ 2 k i ,h ≤ 4 H 2 K ′ + 240 d 6 H 8 ι + 360( β + e β + ¯ β ) d 7 H 7 ι + 48 d 3 p H 9 K ′ log( H N K/δ ) + 80 d 3 H 4 ( β + e β + ¯ β ) √ dK ′ ι + 80 d 3 H 3 ( β + e β + ¯ β ) v u u t dH ι K ′ X i =1 H X h =1 σ 2 k i ,h . Therefore, using the fact that x ≤ a √ x + b implies x ≤ a 2 + 2 b and λ = 1 /H 2 , w e can solve the inequalit y and complete the pro of. With the lemma established ab o ve, we can now prov e Lemma 5.2. Lemma D.2 (F ormal Statement of Lemma 5.2) . L et ι 2 = log (1 + dH N K /δ ) . On the event T 5 i =1 E i , for any h ∈ [ H ] and n ∈ N , we have K X k =1 I h Q k,h ( s k h , a k h ) − Q π k h ( s k h , a k h ) ≥ 2 n ∆ min i = K ′ ( h, n ) ≤ O  d 2 H 3 ι 3 2 4 n ∆ 2 min + d 6 H 6 ι 2 2 2 n ∆ min  . Pr o of of L emma D.2. F or any fixed h ∈ [ H ] and n ∈ [ N ], by definition of k i Equation (6) and K ′ in Equation (5), we first note that K ′ X i =1  Q k i ,h ( s k i h , a k i h ) − Q π k i h ( s k i h , a k i h )  ≥ 2 n ∆ min K ′ . (25) On the other hand, we upp er b ound K ′ X i =1  Q k i ,h ( s k i h , a k i h ) − Q π k i h ( s k i h , a k i h )  as follows. By Equation (15) and the relationships V k,h ′ +1 ( s k h ′ +1 ) = Q k,h ′ +1 ( s k h ′ +1 , a k h ′ +1 ) and V π k h ′ +1 ( s k h ′ +1 ) = Q π k h ′ +1 ( s k h ′ +1 , a k h ′ +1 ), w e hav e Q k,h ′ ( s k h ′ , a k h ′ ) − Q π k h ′ ( s k h ′ , a k h ′ ) ≤ Q k h ′ +1 ( s k h ′ +1 , a k h ′ +1 ) − Q π k h ′ +1 ( s k h ′ +1 , a k h ′ +1 ) 30 +  P s k h ′ ,a k h ′ ,h ′ − 1 s k h ′ +1   V k,h ′ +1 − V π k h ′ +1  + 4 min n β q ϕ ( s k h ′ , a k h ′ ) ⊤ Σ − 1 k,h ′ ϕ ( s k h ′ , a k h ′ ) , H o . (26) T aking summation for Equation (26) o ver h ≤ h ′ ≤ H and all k i , w e hav e K ′ X i =1  Q k i ,h ( s k i h , a k i h ) − Q π k i h ( s k i h , a k i h )  ≤ K ′ X i =1 H X h ′ = h  P s k i h ′ ,a k i h ′ ,h ′ − 1 s k i h ′ +1   V k i ,h ′ +1 − V π k i h ′ +1  + K ′ X i =1 H X h ′ = h 4 min  β q ϕ ( s k i h ′ , a k i h ′ ) ⊤ Σ − 1 k i ,h ′ ϕ ( s k i h ′ , a k i h ′ ) , H  . (27) F or the first term on the righ t hand side of Equation (27), by the even t E 3 in Lemma B.1, we hav e K ′ X i =1 H X h ′ = h  P s k i h ′ ,a k i h ′ ,h ′ − 1 s k i h ′ +1   V k i ,h ′ +1 − V π k i h ′ +1  ≤ O  p H 3 ι 2 K ′  , (28) F or the second term on the right hand side of Equation (27), by Lemma C.3, w e hav e K ′ X i =1 H X h ′ = h 4 min  β q ϕ ( s k i h ′ , a k i h ′ ) ⊤ Σ − 1 k i ,h ′ ϕ ( s k i h ′ , a k i h ′ ) , H  ≤ 16 d 3 H 5 ι + 40 β d 4 H 3 ι + 8 β H X h ′ = h v u u t dι K ′ X i =1  σ 2 k i ,h ′ + H  ≤ 16 d 3 H 5 ι + 40 β d 4 H 3 ι + 8 β v u u t dH ι K ′ X i =1 H X h =1  σ 2 k i ,h + H  ≤ O  d 6 H 6 ι 2 2 + d q H 3 ι 3 2 K ′  , (29) where the second inequality is b ecause Cauch y-Sch w artz inequalit y and the last inequality holds due to Lemma D.1. Substituting Equation (28) and Equation (29) into Equation (27), we hav e K ′ X i =1  Q k i ,h ( s k i h , a k i h ) − Q π k i h ( s k i h , a k i h )  ≤ O  d 6 H 6 ι 2 2 + d q H 3 ι 3 2 K ′  . (30) By no w, w e ha ve obtained both the lo wer and upper b ounds for P K ′ i =1  Q k i ,h ( s k i h , a k i h ) − Q π k i h ( s k i h , a k i h )  from (25) and (30) . Finally , com bining (25) and (30) , we can derive the follo wing constrain t on K ′ : 2 n ∆ min K ′ ≤ O  d 6 H 6 ι 2 2 + d q H 3 ι 3 2 K ′  . (31) Solving for K ′ from Equation (31) completes the pro of. Bac k to Theorem 4.1, on the ev ent T 5 i =1 E i , w e hav e K X k =1 H X h =1 ∆ h ( s k h , a k h ) = K X k =1 H X h =1 ∆ h ( s k h , a k h ) × N X n =1 I h ∆ h ( s k h , a k h ) ∈ I n i 31 ≤ K X k =1 H X h =1 N X n =1 2 n ∆ min × I h ∆ h ( s k h , a k h ) ∈ I n i ≤ H X h =1 N X n =1 2 n ∆ min × K X k =1 I h V ⋆ h ( s k h ) − Q π k h ( s k h , a k h ) ≥ 2 n − 1 ∆ min i ≤ O  d 2 H 4 ι 3 2 ∆ min + d 6 H 7 ι 2 2  . (32) Let δ = 1 / 18 T , then the even t E := T 5 i =1 E i holds with probability at least 1 − 1 /T . Therefore, it holds that E (Regret( T )) ≤ E " K X k =1 H X h =1 ∆ h ( s k h , a k h )     E # P ( E ) + E " K X k =1 H X h =1 ∆ h ( s k h , a k h )     E c # P ( E c ) ≤ O  d 2 H 4 ι 3 2 ∆ min + d 6 H 7 ι 2 2  + 1 T · H T = O  d 2 H 4 ∆ min log 3 (1 + dH N K ) + d 6 H 7 log 2 (1 + dH N K )  ≤ O  d 2 H 4 ∆ min log 3  1 + dH K ∆ min  + d 6 H 7 log 2  1 + dH K ∆ min  . Then w e finish the pro of of Theorem 4.1. E Pro of of Corollary 4.2 F ollowing the Lemma 6.1 in He et al. (2021a), we hav e the follo wing conclusion. Lemma E.1. F or any MDP and any δ ∈ (0 , 1) , with pr ob ability at le ast 1 − δ , the fol lowing event holds E 6 = ( K X k =1  V ⋆ 1 ( s k 1 ) − V π k 1 ( s k 1 )  ≤ 2 K X k =1 H X h =1 ∆ h ( s k h , a k h ) + 16 H 2 log( H K/δ ) 3 + 2 . ) Set δ ← δ / 38, then the even t T 6 i =1 E i holds with probability at least 1 − δ / 2. On the ev ent T 6 i =1 E i , w e hav e K X k =1  V ⋆ 1 ( s k 1 ) − V π k 1 ( s k 1 )  ≤ 2 K X k =1 H X h =1 ∆ h ( s k h , a k h ) + 16 H 2 log(38 H K/δ ) 3 + 2 ≤ O  d 2 H 4 ι 3 2 ∆ min + d 6 H 7 ι 2 2  , (33) where the first inequality holds due to E 6 in Lemma E.1 and the last inequality holds due to Equation (32). Similar to Bai et al. (2019), now we define a sto c hastic p olicy b π as b π = 1 K K X k =1 π k . 32 Here, π k is the p olicy executed in episo de k of LSVI-UCB++. By definition, with probabilit y at least 1 − δ / 2 we hav e E h V ⋆ 1 ( s 1 ) − V b π 1 ( s 1 ) i = 1 K K X k =1  V ⋆ 1 ( s 1 ) − V π k 1 ( s 1 )  ≤ O  d 2 H 4 ι 3 2 ∆ min K + d 6 H 7 ι 2 2 K  . F urthermore, by the Marko v inequality , w e hav e with probability at least 1 − δ / 2 that V ⋆ 1 ( s 1 ) − V b π 1 ( s 1 ) ≤ E h V ⋆ 1 ( s 1 ) − V b π 1 ( s 1 ) i /δ ≤ O  d 2 H 4 ι 3 2 ∆ min K δ + d 6 H 7 ι 2 2 K δ  . The last inequality holds with probability at least 1 − δ due to Equation (33). Then taking K = O  d 2 H 4 ∆ min δ ϵ log 3  dH ∆ min δ ϵ  + d 6 H 7 δ ϵ log 2  dH ∆ min δ ϵ  b ounds the V ⋆ 1 ( x 1 ) − V b π 1 ( x 1 ) b y ϵ . F Pro of of Theorem 4.3 T o learn an ϵ -optimal p olicy , by Corollary 4.2, with probability at least 1 − δ , the total n umber of episo des collected from the concurrent LSVI-UCB++ with M agents is at most K ϵ ≤ O  d 2 H 4 ∆ min δ ϵ log 3  dH ∆ min δ ϵ  + d 6 H 7 δ ϵ log 2  dH ∆ min δ ϵ  (34) Supp ose that during these K ϵ episo des, the p olicy is switched N ϵ times. Let e t denote the num ber of episo des executed b et w een the ( t − 1)-th and t -th p olicy switches. Then the num b er of concurrent rounds required for this segment is ⌈ e t / M ⌉ . Let R denote the total num b er of concurrent rounds. Then w e hav e R = N ϵ X t =1 l e t M m ≤ N ϵ X t =1  1 + e t M  ≤ N ϵ + K ϵ M . (35) By Theorem 5.1 of He et al. (2023), in K ϵ episo des, the single-agent LSVI-UCB++ algorithm p erforms at most N ϵ ≤ O  dH log(1 + K H 2 )  (36) p olicy switches. Plugging Equations (34) and (36) into Equation (35), w e obtain R ≤ O  dH log(1 + K H 2 ) + d 2 H 4 M ∆ min δ ϵ log 3  dH ∆ min δ ϵ  + d 6 H 7 M δ ϵ log 2  dH ∆ min δ ϵ  , whic h completes the pro of. 33

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment