Asymptotically Optimal Sequential Testing with Markovian Data

We study one-sided and $α$-correct sequential hypothesis testing for data generated by an ergodic Markov chain. The null hypothesis is that the unknown transition matrix belongs to a prescribed set $P$ of stochastic matrices, and the alternative corr…

Authors: Alhad Sethi, Kavali Sofia Sagar, Shubhada Agrawal

Asymptotically Optimal Sequential Testing with Markovian Data
Asymptotically Optimal Sequential T esting with Markovian Data Alhad Sethi 1 , Kavali Sofia Sagar 2 , Shubhada Agrawal 1 , Debabrota Basu 3 , and P . N. Karthik 2 1 Indian Institute of Science, Bangalore alhadsethi@iisc.ac.in 2 Indian Institute of T echnology , Hyderabad ai24resch11003@iith.ac.in 1 Indian Institute of Science, Bangalore shubhada@iisc.ac.in 3 Univ . Lille, Inria, CNRS, Centrale Lille, UMR 9189 – CRIStAL debabrota.basu@inria.fr 2 Indian Institute of T echnology , Hyderabad pnkarthik@ai.iith.ac.in February 20, 2026 Abstract W e study one-sided and α -correct sequential hypothesis testing for data generated by an er godic Markov chain. The null hypothesis is that the unknown transition matrix belongs to a pr escribed set P of stochastic matrices, and the alternative corresponds to a disjoint set Q . W e establish a tight non-asymptotic instance-dependent lower bound on the expected stopping time of any valid sequential test under the alternative. Our novel analysis improves the existing lower bounds, which are either asymptotic or provably sub-optimal in this setting. Our lower bound incorporates both the stationary distribution and the transition structur e induced by the unknown Markov chain. W e further propose an optimal test whose expected stopping time matches this lower bound asymptotically as α → 0. W e illustrate the usefulness of our framework thr ough applications to sequential detection of model misspecification in Markov Chain Monte Carlo and to testing structural properties, such as the linearity of transition dynamics, in Markov decision processes. Our findings yield a sharp and general characterization of optimal sequential testing procedur es under Markovian dependence. Contents 1 Introduction 3 1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Related W orks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Preliminaries: Primer on Markov Chains 5 1 3 Instance-dependent Lower Bound 7 4 Algorithm Design and Optimality 9 4.1 Computational T ractability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.2 Extensions to T wo-Sided T esting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 5 Applications 13 5.1 T esting Misspecification in MCMC Samplers . . . . . . . . . . . . . . . . . . . . . . . . 13 5.2 T esting Linearity of MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 6 Discussion and Future W orks 16 A Proofs of Results Appearing in Section 3 20 A.1 W ald’s Lemma for Markovian Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 A.2 Bounding the Magnitude of a Solution to the Poisson Equation via Spectral Properties 22 A.3 Proofs Related to the Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 A.4 Suboptimality of Existing Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . 26 B Proofs of the Results Appearing in Section 4 27 B.1 Good Event Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 B.2 Concentration Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 B.3 Constr uction of Mixture Martingale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 B.4 Continuity Pr operties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 B.4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 B.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 B.5 Pr oof of Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 C Extension to T wo-Sided Sequential T esting 38 D Computationally T ractable Lower Bound 39 E Proofs for Section 5 42 F T echnical Lemmas 43 G Experiments 44 G.1 Misspecification in MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 G.2 Linearity T esting in MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 G.3 Experiments on Synthetic Data from a Parametric Family . . . . . . . . . . . . . . . . . 45 G.4 Comparison with Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2 1 Introduction Hypothesis testing is a cornerstone of statistics and theoretical computer science: fr om data, one decides whether an unknown data-generating mechanism satisfies a prescribed pr operty ( Lehmann & Romano , 2005 ; Goldreich , 2017 ). Classical theory lar gely assumes i.i.d. samples, but many modern streams are temporally dependent, making hypothesis testing under dependence both practically important and theoretically subtle ( Phatarfod , 1965 ; Gyori & Paulin , 2015 ; Fauß et al. , 2020 ). A widely used model of dependence is Markovianity , where the future is conditionally independent of the past given the pr esent state ( Bengio et al. , 1999 ; Nagaraj et al. , 2020 ). Markov dynamics arise in Markov Chain Monte Carlo (MCMC) ( Roy , 2020 ), reinfor cement learning and contr ol ( Sutton et al. , 1998 ; Bertsekas , 2019 ; Beutler & Ross , 1985 ), and hidden Markov models ( Rabiner & Juang , 2003 ). In these settings, the transition mechanism is typically unknown; theory often proceeds by imposing structural assumptions (e.g., Gaussianity or bilinearity) ( Jin et al. , 2020 ; Ouhamma et al. , 2023 ), whose validity may be unclear . This motivates testing model classes under Markovian data ( Natarajan , 2003 ; Fauß et al. , 2020 ). W e study sequential hypothesis testing for finite-state Markov chains. Let [ m ] : = { 1, . . . , m } , ∆ m be the simplex in R m , and M the set of m × m row-stochastic matrices. A time-homogeneous Markov chain is specified by ( P , µ ) with P ∈ M and µ ∈ ∆ m , generating X 1 , X 2 , . . . via P ( P , µ ) [ X 1 = x 1 , . . . , X n = x n ] = µ ( x 1 ) n ∏ t = 2 P ( x t − 1 , x t ) . W e assume sequential access to data: at time t we observe X t . Given an unknown ( P , µ ) , we test whether P lies in a prescribed null class P ⊂ M versus an alternative Q ⊂ M : H 0 : P ∈ P versus H 1 : P ∈ Q , a composite versus composite problem. W e assume P ∩ Q = ∅ (and impose additional separa- tion/identifiability conditions only when requir ed for sharp characterizations). W e work in the one-sided, α -correct, power-one sequential framework ( Darling & Robbins , 1967 ; Farrell , 1964 ; Robbins & Siegmund , 1974 ). For α ∈ ( 0, 1 ] , an α -correct, power-one sequential test is a stopping time τ α (rejecting H 0 upon stopping) such that, uniformly over µ ∈ ∆ m , P P , µ [ τ α < ∞ ] ≤ α ∀ P ∈ P , P Q , µ [ τ α < ∞ ] = 1 ∀ Q ∈ Q . Suppressing µ for br evity , our objective is instance-dependent ef ficiency under the alternative: 1. For fixed Q ∈ Q , what is inf E Q [ τ α ] over all α -correct, power-one tests? 2. Can we design a procedure that achieves this infimum to first order as α → 0, simultaneously for all Q ∈ Q ? 1.1 Contributions Our main contributions are as follows. 1. Non-asymptotic instance-dependent lower bound and asymptotically optimal test. W e pr ove the 3 first non-asymptotic, instance-dependent lower bound on E Q [ τ α ] for α -correct, power -one sequential tests under an unknown alternative Q ∈ Q (Theorem 3.3 ). The leading term is log ( 1 / α ) scaled by an information quantity D inf M ( Q , P ) , an infimum of stationary-weighted KL divergences over P , plus an α -independent term depending on str uctural pr operties of Q . W e then construct a sequential test that matches this bound to first order as α → 0 (Theorem 4.1 ). 2. Composite null is fundamentally harder than a known singleton null. Fields et al. ( 2025 ) study singleton null versus composite alternative : H 0 : P = P 0 for known P 0 , versus H 1 : P ∈ Q . Our setting is composite versus composite with a composite null: even when data is generated by a specific Q , the test must rule out every P ∈ P while maintaining uniform T ype-I control. This necessity of certifying incompatibility with an entir e null class (rather than a single known refer ence) drives both our information characterization (via an infimum over P ) and the technical analysis. 3. T wo technical tools. W e state and prove two results that may be of independent interest: (a) A Pinsker-type inequality (Proposition 4.3 ) lower bounding a stationary-weighted diver gence between two transition matrices by squar ed gaps between stationary means of suitable func- tions. (b) A uniform control of solutions to the Poisson equation in terms of mixing properties (Proposition 3.1 ). 4. Applications. W e instantiate our framework for (i) misspecification detection in MCMC by testing consistency with a target stationary distribution (Section 5.1 ), obtaining optimal detection guar- antees (Corollary 5.1 ); and (ii) structural testing in RL by sequentially verifying linear transition dynamics in MDPs (Section 5.2 ). 1.2 Related W orks Markovian Sequential T esting. Sequential testing dates back to W ald ( 1945 ) and the SPR T for two simple hypotheses under i.i.d. data. W ork on sequential testing under Markovian dependence is comparatively limited, but includes early contributions such as Phatarfod ( 1965 , 1971 ); Schmitz & S ¨ uselbeck ( 1983 ); Dimitriadis & Kazakos ( 2007 ); Kiefer & Sistla ( 2016 ), which are lar gely SPR T -type procedur es tailor ed to simple hypotheses (singleton null and singleton alternative). Fauß et al. ( 2020 ) study sequential and fixed-sample testing for Markov processes from a mini- max/robust perspective, deriving tests with optimal worst-case guarantees. Their objective is com- plementary to ours: we consider one-sided, α -correct, power-one tests and seek instance-dependent characterizations under the (unknown) alternative. The closest work to ours is Fields et al. ( 2025 ), which analyzes one-sided sequential tests for Markov chains based on the plug-in likelihood estimator of T akeuchi et al. ( 2013 ). Their setting differs in two key respects: (i) they test a singleton null against its complement, whereas we treat composite null and composite alternative classes; and (ii) their analysis assumes uniformly bounded likelihood ratios, which we do not r equire. Moreover , the above works do not pr ovide instance-dependent lower bounds on the expected stopping time. In contrast, we establish a non-asymptotic, instance-dependent lower bound and match it (to first order) with an explicit procedure, pr oving asymptotic optimality without boundedness assumptions. Multi-armed bandits and RL. Best arm identification in Markovian bandits ( Anantharam et al. , 2003 ; Moulos , 2019 ; Karthik et al. , 2024 ) is closely r elated: one-sided sequential testing can be viewed as a single-armed fixed-confidence identification pr oblem, wher e one observes a single evolving pr ocess and stops with controlled err or once suf ficient evidence accumulates. 4 Sharp fixed-confidence r esults for Markovian bandits, however , typically rely on str ong parametric structur e, most notably single-parameter exponential family (SPEF) assumptions on the transition models (or rewar d/transition parametrizations) ( Anantharam et al. , 2003 ; Moulos , 2019 ; Karthik et al. , 2024 ). Even in such regimes, lower bounds are often asymptotic and achievability arguments exploit the parametric form. While our setting admits a single-pr ocess interpr etation, we do not r equire an SPEF assumption : we derive non-asymptotic, instance-dependent lower bounds and matching (first-order) achievability for substantially more general Markov dynamics. A r elated but distinct line concerns best policy identification (BPI) in MDPs ( Al Marjani & Proutiere , 2021 ; Al Marjani et al. , 2021 ; W agenmaker et al. , 2022 ; T uynman et al. , 2024 ), which pr ovides non-asymptotic lower bounds for identifying an optimal policy under active data collection. Directly adapting these bounds to our testing pr oblem can be loose; see Section A.4 . Nevertheless, we leverage martingale constructions fr om this literature in our test design and analysis. Closest in spirit to our work is policy testing in MDPs. Ariu et al. ( 2025 ) derive asymptotic lower bounds and propose pr ocedures that match them asymptotically using martingale techniques. Although BPI- style non-asymptotic lower bounds apply in principle, they need not be tight for policy testing. Our instance-dependent, non-asymptotic methodology can be adapted to obtain sharper lower bounds in this setting as well, beyond parametric SPEF-type regimes. Paper organization: Section 2 introduces background and notation. Section 3 establishes the instance- dependent lower bound. Section 4 presents an asymptotically optimal test with matching sample- complexity guarantees. Section 5 contains applications: Section 5.1 (MCMC misspecification testing) and Section 5.2 (structural testing in MDPs). Proofs are deferr ed to the appendices. 2 Preliminaries: Primer on Markov Chains This section collects the notations and standard background on finite-state Markov chains used throughout the paper . Notation. W e denote the natural numbers, real numbers, and strictly positive real numbers by N , R , and R ++ , respectively . For m ∈ N , let [ m ] : = { 1, . . . , m } . W e write 1 n × m for the all-ones matrix (dropping subscripts when dimensions are clear). Any function f : [ m ] → R is identified with the vector f ∈ R m , and we use these representations inter changeably . For x ∈ R , define x + : = max { x , 0 } . Let ∆ m be the probability simplex in R m . For p , q ∈ ∆ m , we write p ≪ q if q ( i ) = 0 implies p ( i ) = 0 for all i ∈ [ m ] . For a random variable X , σ ( X ) denotes the σ -algebra it generates. For p ∈ ∆ m and f : [ m ] → R , define E p [ f ] : = ∑ i ∈ [ m ] f ( i ) p ( i ) , and let I denote the indicator function. T ransition matrices and norms. A transition matrix on m states is a nonnegative row-stochastic matrix P ∈ R m × m satisfying P ( i , j ) ≥ 0 for all i , j ∈ [ m ] , and P 1 = 1 . Let M denote the set of all such matrices, and write P ( i , · ) ∈ ∆ m for the i th row of P . For P , Q ∈ M , we say Q is (row-wise) absolutely continuous with respect to P , denoted Q ≪ P , if Q ( i , · ) ≪ P ( i , · ) for all i ∈ [ m ] . W e use E Q [ · ] to denote expectations when the chain evolves according to transition matrix Q (with the initial distribution clear from context). W e equip R m × m with the ∥ · ∥ 1, ∞ norm ( W olfer & Kontorovich , 2019 ), defined as ∥ A ∥ 1, ∞ : = max i ∈ [ m ] ∥ A ( i , · ) ∥ 1 , A ∈ R m × m . 5 For A ∈ M , one has ∥ A ∥ 1, ∞ = 1, but the norm is most useful through the metric it induces on M : ∥ P − Q ∥ 1, ∞ = 2 max i ∈ [ m ] ∥ P ( i , · ) − Q ( i , · ) ∥ TV , i.e., twice the worst-case total variation distance between corresponding r ows. Ergodic Markov chains. Let ( X t ) t ≥ 0 be a Markov chain on [ m ] with transition matrix P ∈ M and initial distribution µ ⊤ ∈ ∆ m . A distribution π ⊤ ∈ ∆ m is stationary for P if π P = π . W e call P (and the induced chain) ergodic if it is irr educible and aperiodic. In this case, P admits a unique stationary distribution π ⊤ with π ∗ : = min i ∈ [ m ] π ( i ) > 0, and π is also the limiting visitation distribution: lim n → ∞ ∥ µ P n − π ∥ TV = 0. Assumption 2.1. The data-generating Markov chain is ergodic. The above assumption is not too restrictive, as the set of ergodic transition matrices is dense in M . Concretely , for any P ∈ M and ϵ > 0, the matrix P ϵ : =  1 − ϵ m  P +  ϵ m  1 m × m is ergodic and can be made arbitrarily close to P as ϵ → 0; see Levin et al. ( 2017 , Chapter 1). Poisson equation. A central tool in Markov chain analysis is the Poisson equation (PE). For an ergodic P ∈ M with stationary distribution π ⊤ and a function f : [ m ] → R , the PE for ( P , f ) (in the unknown ω ) is ( I − P ) ω = f − ( π f ) 1 , (1) where π f : = E π [ f ] is a scalar and 1 is the all-ones column vector . Equation ( 1 ) always admits solutions; one convenient choice is ω P , f : = ∞ ∑ n = 0 ( P n − 1 π ) f , (2) where 1 π denotes the rank-one matrix with every row equal to π . W e refer to Douc et al. ( 2018 , Section 21.2) for further background. The PE is useful because it enables a standard decomposition of additive functionals into a martingale differ ence term plus a remainder that is typically negligible under mixing. Spectral properties. Spectral information about P quantifies its long-run behavior and mixing. Fix an er godic P ∈ M with stationary distribution π ⊤ . The chain is r eversible if it satisfies detailed balance: π ( i ) P ( i , j ) = π ( j ) P ( j , i ) for all i , j ∈ [ m ] . For reversible P , all eigenvalues lie in [ − 1, 1 ] and can be order ed as 1 = λ 1 ≥ λ 2 ≥ · · · ≥ λ m ( Levin et al. , 2017 , Lemma 12.2). The (usual) spectral gap is then γ ( P ) : = 1 − λ 2 . For non-r eversible chains, several notions of spectral gap exist ( Fill , 1991 ; Paulin , 2015 ; Chatterjee , 2025 ). W e follow Paulin ( 2015 ) and work with the pseudo-spectral gap . Definition 2.2 (Pseudo-spectral gap, Paulin ( 2015 )) . For an er godic P ∈ M with stationary distribution π ⊤ , let P ∗ be its time-reversal, defined via P ∗ ( i , j ) : = P ( j , i ) π ( j ) π ( i ) . The pseudo-spectral gap of P is γ ps ( P ) : = max k ≥ 1 γ  ( P ∗ ) k P k  k . One has γ ps ( P ) ∈ ( 0, 1 ] for ergodic P , and lar ger γ ps ( P ) corresponds to faster mixing ( Paulin , 2015 , 6 Proposition 3.4). In our development, γ ps enters the lower-bound analysis; the algorithm itself is largely insensitive to this particular choice (see Remark A.5 ). 3 Instance-dependent Lower Bound In this section, we derive a non-asymptotic lower bound on the expected stopping time of any α - correct, power -one sequential test for Markov chains. The bound holds for every α ∈ ( 0, 1 ) . W e first introduce two quantities that will appear in the statement. Proposition 3.1. For an ergodic P ∈ M and f ∈ R m , let ω P , f ∈ R m be the solution ( 2 ) to the PE for ( P , f ) . Then ∥ ω P , f ∥ ∞ ≤ C P ∥ f ∥ ∞ , where, writing γ ps = γ ps ( P ) , the constant C P depends only on P and is given by C P : =      ( 1 − γ ps ) − 1 2 γ ps √ π ∗  1 1 − √ 1 − γ ps  , γ ps ∈ ( 0, 1 ) , 2, γ ps = 1. Interpretation. The quantity C P controls the sensitivity of the Poisson solution: it upper bounds the magnitude of ω P , f relative to that of f , in terms of the mixing pr operties of P (captured by γ ps and π ∗ ). When γ ps = 1, the chain behaves as in the i.i.d. case: P has identical rows equal to its stationary distribution, so P n = 1 π for all n ≥ 1. Substituting into ( 2 ) yields ω P , f = f − ( π f ) 1 and hence C P = 2. For γ ps ∈ ( 0, 1 ) , the proof combines the repr esentation ( 2 ) with bounds on ∥ P n − 1 π ∥ in terms of γ ps from Paulin ( 2015 ). The complete proof is given in Section A.2 . Next we define the information-theoretic quantity that governs the har dness of testing. Definition 3.2 (Stationary-weighted KL divergence) . Let P , Q ∈ M be such that Q ≪ P and Q is ergodic, and let π ⊤ ∈ R m be the stationary distribution of Q . Let D M ( Q , P ) : = ∑ i ∈ [ m ] π i D KL ( Q ( i , · ) , P ( i , · ) ) , where for q , p ∈ ∆ m , D KL ( q , p ) : = ∑ j ∈ [ m ] q j log q j p j . W ith the above notations in place, we now state the first main result of our paper: instance-dependent lower bound on the expected stopping time. Theorem 3.3 (Lower bound) . Fix α ∈ ( 0, 1 ) and an ergodic Q ∈ Q with stationary distribution π ⊤ ∈ R m . Let τ α be the stopping time of any α -correct, power-one sequential test. Then for every P ∈ P , E Q [ τ α ] ≥ log ( 1 / α ) D M ( Q , P ) − E Q h ω Q , f P ( X 0 ) − ω Q , f P ( X τ α ) i D M ( Q , P ) , 7 where f P ( i ) : = D KL ( Q ( i , · ) , P ( i , · ) ) for all i ∈ [ m ] , and ω Q , f P denotes the solution ( 2 ) for ( Q , f P ) . Moreover , E Q [ τ α ] ≥ log ( 1 / α ) D inf M ( Q , P ) − 2 C Q min i π i ! + , (3) where D inf M ( Q , P ) : = inf P ∈ P D M ( Q , P ) and C Q is the constant in Proposition 3.1 . Proof sketch. W e apply a Markov-chain version of W ald’s lemma ( Moustakides , 1999 ) to expr ess the expected log-likelihood ratio between Q and P as E Q [ τ α ] D M ( Q , P ) plus an additive term involving a Poisson solution for ( Q , f P ) . W e then invoke the data pr ocessing inequality to r elate the log-likelihood ratio to the test error , which yields the first inequality . T o obtain ( 3 ) , we optimize over P ∈ P and upper bound the Poisson correction using Pr oposition 3.1 . Full details appear in Section A . T aking α → 0 in ( 3 ) yields the asymptotic relation lim inf α → 0 E Q [ τ α ] log ( 1 / α ) ≥ 1 D inf M ( Q , P ) for every α -correct, power-one test. In particular , D inf M ( Q , P ) is the fundamental instance-dependent hardness of certifying that the data-generating transition matrix is Q (against P ). (a) Relation to non-asymptotic RL lower bounds. Non-asymptotic lower bounds in best policy identi- fication for MDPs ( Al Marjani & Proutiere , 2021 ; Al Marjani et al. , 2021 ; W agenmaker et al. , 2022 ; T uynman et al. , 2024 ) often proceed by (i) writing the expected log-likelihood ratio as a weighted sum of per -state divergences f P ( i ) , (ii) applying data processing, and (iii) optimizing over the weights. This yields denominators of the form sup w ∈ ∆ m inf P ∈ P w ⊤ f P . While valid, this relaxation can be loose in our setting because we do not control state visitation: the weights are fixed by the instance Q through its stationary distribution. In particular , even in the simple case P = { P } , the optimizer over w places all mass on the state with the lar gest KL term, which is not achievable when the visitation pr oportions are dictated by Q . (b) Connection to the i.i.d. case. In the i.i.d. setting, Agrawal & Ramdas ( 2025 ) characterize hardness via the KL projection of Q onto P . Theorem 3.3 shows that the analogous quantity in the Markovian setting is the stationary-weighted projection D inf M ( Q , P ) . A key technical difference is that the Poisson correction term depends on P through f P . Therefor e, naively taking a supr emum over P (as one can in the i.i.d. analysis) can make the correction term arbitrarily large and destroy the bound. W e avoid this by controlling the corr ection uniformly using Proposition 3.1 , which depends only on the alternative Q . Remark 3.4 (Recovery of i.i.d. bounds) . Consider the following testing problem: given i.i.d. samples from some distribution in ∆ m , we want to test whether the samples are drawn from q ∈ ∆ m or from some distribution in a disjoint set P iid ⊂ ∆ m . This is a special case of the Markovian case and corresponds to the setting where Q ∈ M has identical r ows equal to q , and P ⊂ M consists of matrices with identical r ows equal to some p ∈ P iid . In this case, the per -state diver gence vector f P (defined in Theorem 3.3 ) is constant. Consequently , the zero vector satisfies the Poisson equation for ( Q , f P ) , allowing us to take C Q = 0. W e thus recover the standard i.i.d. bounds ( Agrawal & Ramdas , 2025 , Theorem 3.1). (c) Implications for best arm identification in Markovian bandits. A one-sided sequential test can be viewed as a single-armed Markovian bandit instance ( Anantharam et al. , 2003 ; Moulos , 2019 ; Karthik et al. , 8 2024 ). Existing lower bounds in this literatur e ar e typically asymptotic (e.g., α → 0) and/or rely on single-parameter exponential family (SPEF) assumptions. Our proof technique suggests a r oute to non-asymptotic instance-dependent lower bounds without the SPEF assumption. (d) Implications for policy testing. In policy testing, one collects Markovian trajectories under a fixed policy and tests whether the expected cumulative reward exceeds a threshold. Our argument can be adapted to yield non-asymptotic lower bounds for this problem, complementing the currently available asymptotic results ( Ariu et al. , 2025 ). 4 Algorithm Design and Optimality W e now pr esent a sequential procedur e for testing a compact null hypothesis set P against an alternative hypothesis set Q , with P ∩ Q = ∅ . Unlike much of the prior literature—which typically assumes at least one of P , Q is a singleton Phatarfod ( 1965 ); Fields et al. ( 2025 )—our framework accommodates composite structur e for both sets. Our method, outlined in Algorithm 1 , estimates the transition matrix fr om observed data and employs a martingale-based test statistic. In brief, it constructs an empirical transition kernel b Q t from state- transition counts, computes the statistic L t (Line 16), and stops once L t exceeds a state-visitation- count-dependent boundary β t (Line 17), at which point it rejects the null. W e establish the asymptotic optimality of Algorithm 1 . Theorem 4.1. For any α ∈ ( 0, 1 ) , the test (Algorithm 1 ) is α -correct. Moreover , for any Q ∈ Q , lim sup α → 0 E Q [ τ α ] log ( 1 / α ) ≤ 1 D inf M ( Q , P ) . (4) Proof sketch. W e establish α -correctness by constructing a nonnegative supermartingale and applying V ille’s inequality , which yields a uniform bound (over P ∈ P ) on the pr obability of stopping under the null. T o bound the sample complexity , we define a good event on which the empirical transition matrix b Q t and the visitation frequencies N x ( t ) / t concentrate around the true transition kernel Q and its stationary distribution π Q , respectively . Using concentration inequalities for Markov chains W olfer & Kontorovich ( 2019 ); Paulin ( 2015 ), we show that the complementary bad event has exponentially decaying probability and, in particular , is summable over time. W e then decompose E Q [ τ α ] into contributions from the good and bad events. On the good event, τ α is deterministically bounded by a term that scales as log ( 1 / α ) ; since the bad-event probabilities are summable, their total contribution to the expectation is finite. Finally , after establishing lower semicontinuity of a suitable functional under the topology induced by the ℓ 1, ∞ norm, taking limits yields the stated asymptotic bound. See Section B for the detailed proof. Remark 4.2 . Note that the (non-negative) process M t : = e L t − ( m − 1 ) ψ t is upper bounded by a non- negative super -martingale (see, Appendix B.3 , Eq. ( 22 ) ), and hence, is an e-process. W e refer the r eader to cf. Ramdas & W ang ( 2025 , §7) for a definition of an e-process. 9 Algorithm 1 Sequential Markov Chain T est Require: State space [ m ] , (null) set P , α ∈ ( 0, 1 ) . 1: Initialize: t ← 0, observe initial state X 0 . 2: Initialize Counts: N x ← 0, N x , y ← 0 ∀ x , y ∈ [ m ] . 3: loop 4: t ← t + 1 5: Observe next state X t . 6: u ← X t − 1 , v ← X t , N u ← N u + 1, N u , v ← N u , v + 1 7: for x ∈ [ m ] do 8: if N x > 0 then 9: b Q t ( x , · ) ← [ N x ,1 / N x , . . . , N x , m / N x ] 10: else 11: b Q t ( x , · ) ← [ 1/ m , . . . , 1/ m ] 12: end if 13: end for 14: ψ t ← ∑ x ∈ [ m ] log  e  1 + N x m − 1  15: β t ← log ( 1/ α ) + ( m − 1 ) ψ t 16: L t ← inf P ∈ P ∑ x : N x > 0 N x D KL  b Q t ( x , · ) , P ( x , · )  17: if L t ≥ β t then 18: Stop (Reject P ) and set τ α = t 19: end if 20: end loop 4.1 Computational T ractability Since the test statistic L t optimizes a convex objective over P ∈ P , when P is convex we can com- pute L t efficiently using standard convex optimization tools. In Section 5 , we illustrate this in two instances—testing misspecification of MCMC samplers and testing linearity of transitions in Markov Decision Processes (MDPs)—wher e the null set is convex and the statistic can be evaluated efficiently . In general, our setup allows the null set P to be arbitrarily non-convex. Consequently , designing computationally ef ficient implementations requir es exploiting additional structure in P (cf. Al Marjani & Pr outier e ( 2021 ); Ariu et al. ( 2025 )). When P is a finite union of convex sets, one can solve the convex program over each component and then take the minimum Carlsson et al. ( 2024 ); Das et al. ( 2025 ). For computational efficiency , recent work has also explored T ransformer-based surrogates for se- quential testing/pure-exploration style problems Russo et al. ( 2025 ). While promising empirically , such approaches do not, in general, come with (a) rigorous, user-pr escribed α -type guarantees of the kind required here, and (b) a clear mechanism that remains well-calibrated and scalable in the fixed-confidence regime α → 0. W e address both issues in one step via the next pr oposition, which provides a principled Pinsker -type lower bound on our statistic: thresholding this lower bound pre- serves α -correctness by construction, while yielding a computationally tractable alternative. The following result may be of independent inter est. Proposition 4.3. Let P , Q ∈ M be ergodic with stationary distributions π P , π Q , respectively . For any g : [ m ] → R not constant over states, D M ( Q , P ) ≥ 1 2 ∥ ω P , g ∥ 2 ∞  E π Q [ g ] − E π P [ g ]  2 , (5) 10 where ω P , g is a solution to the PE for the pair ( P , g ) . Recall that in case g is constant, the differ ence in expectations is zer o and we get 0 / 0 which we take by convention to be 0. Hence, the bound is trivial. A similar approximation (lower bound) for the corresponding complexity term in the i.i.d. setting was pr oven in Agrawal et al. ( 2021 , Section 3.4). However , proving it in the Markovian setting is substantially mor e delicate and requir es new ideas. The main technical challenge is to pass from a differ ence in stationary expectations to a sum of row-wise divergences. W e overcome this by combining the Poisson equation with the variational characterization of total variation distance, which allows us to rewrite the stationary-expectation gap as a stationary-weighted sum of row-wise expectation gaps of the Poisson solution. Applying Pinsker ’s inequality and Jensen’s inequality then yields the stated bound. A complete proof is provided in Section D . A computationally tractable surrogate statistic. Proposition 4.3 motivates a natural, computationally tractable surrogate for L t obtained by lower bounding the stationary distribution-weighted KL divergence (Definition 3.2 ). Concretely , we replace the unknown kernel Q in ( 5 ) by its empirical estimate b Q t , maximize the r esulting bound over the choice of test function g for each fixed P ∈ P , and then take the infimum over P ∈ P . This leads to the statistic L t : = inf P ∈ P sup g : [ m ] → R  E π b Q t [ g ] − E π P [ g ]  2 2 ∥ ω P , g ∥ 2 ∞ (6) = inf P ∈ P  E π b Q t [ g ∗ ]  2 2 ∥ ω ∗ ( η ∗ ) ∥ 2 ∞ (7) where π b Q t denotes a stationary distribution of b Q t (whenever it exists), ω P , g denotes the solution to the PE for ( P , g ) as in ( 2 ) and g ∗ = ( I − P ) ω ∗ ( η ∗ ) , where η ∗ is the median of the random variable that takes value ( I − P ) π b Q t ( i ) / π P ( i ) with probability π P ( i ) and ω ∗ ( η ∗ ) is defined via: ω ∗ i ( η ) =        + 1, ( I − P ) π b Q t ( i ) > η ∗ π P ( i ) , − 1, ( I − P ) π b Q t ( i ) < η ∗ π P ( i ) , t i ∈ [ − 1, 1 ] , ( I − P ) π b Q t ( i ) = η ∗ π P ( i ) , for t i ∈ [ − 1, 1 ] chosen to satisfy the constraint ⟨ ω ∗ ( η ) , π P ⟩ = 0. Equivalently , for each candidate null model P , L t selects the function g ∗ that maximizes the normalized squared discrepancy between stationary expectations under b Q t and P , and then r eports the least favorable value over P ∈ P . The inner supremum in ( 6 ) can be solved in closed form to give ( 7 ) (see, Proposition D.1 ). Note that g ∗ implicitly depends on P . The proof of the simplification of the inner supremum problem proceeds via reparameterizing the optimization variable using the Poisson equation, which transforms the pr oblem into a constrained linear program. While the natural step would be to analyze the dual of this linear program, we observe that the problem exhibits a duality gap. However , we can rewrite the linear program in an equivalent formulation r esembling the dual, which we solve by identifying it as the minimization of the expected absolute loss with respect to a certain random variable. W e refer the reader to Appendix D for a detailed pr oof. Thresholding L t therefor e yields a computationally ef ficient alternative to thresholding L t . Since L t is constructed via a lower bound on the Markov divergence, the resulting test is conservative. In particular , thresholding L t preserves α -correctness, but may incr ease the expected stopping time; the 11 magnitude of this increase depends on the tightness of the lower bound. A precise characterization of the resulting sample-complexity gap is beyond the scope of this work. In addition to enabling a computationally tractable test, Proposition 4.3 also improves interpr etability . While Theorem 4.1 characterizes the asymptotic sample complexity thr ough inf P ∈ P D M ( Q , P ) , this quantity can be abstract in applications. In Section 5.1 , we give an example where Proposition 4.3 yields a natural lower bound in terms of a squared sub-optimality gap. 4.2 Extensions to T wo-Sided T esting In this section, we show how the one-sided (power -one, α -correct) sequential tests studied thr oughout this work can be combined to construct a two-sided sequential test. W e begin by formalizing two-sided testing between P (null) and Q (alternative). For α , β > 0, a level- ( α , β ) sequential test consists of a stopping time τ α , β and a decision rule D ∈ { 0, 1 } , where D = 0 and D = 1 correspond to selecting the null and the alternative, r espectively . Recall that a one-sided test is specified solely by a stopping time τ α , with rejection of the null upon stopping. For any initial distribution µ ∈ ∆ m , a two-sided test ( D , τ α , β ) satisfies P P , µ  D ( τ α , β ) = 1  ≤ α ∀ P ∈ P , P Q , µ  D ( τ α , β ) = 0  ≤ β ∀ Q ∈ Q . In the following theorem, we present a lower bound and a test that achieves it asymptotically as ( α , β ) → 0. Theorem 4.4. For any ergodic Q ∈ Q , P ∈ P , and α , β ∈ ( 0, 0.5 ) , any level-( α , β ) two-sided test with stopping time τ α , β such that E  τ α , β  < ∞ under both the null and the alternative satisfies E Q  τ α , β  ≥ d ( β , 1 − α ) D inf M ( Q , P ) − 2 C Q π ∗ Q ! + , E P  τ α , β  ≥ d ( α , 1 − β ) D inf M ( P , Q ) − 2 C P π ∗ P ! + , where π ∗ Q = min i ∈ [ m ] π Q ( i ) , π ∗ P = min i ∈ [ m ] π P ( i ) , and C P , C Q are as defined in Pr oposition 3.1 . Furthermore, ther e exists a level-( α , β ) two-sided test for compact P , Q ⊂ M which is asymptotically optimal: lim α , β → 0 E Q  τ α , β  log ( 1 / α ) = 1 D inf M ( Q , P ) ∀ ergodic Q ∈ Q , lim α , β → 0 E P  τ α , β  log ( 1 / β ) = 1 D inf M ( P , Q ) ∀ ergodic P ∈ P . W e establish the non-asymptotic lower bound on the expected stopping time using arguments analogous to those employed for one-sided tests. A two-sided sequential test can be constructed via a simple and intuitive approach: running two one-sided tests in parallel—one testing P against Q and the other testing Q against P . This idea has pr eviously been explored in the context of multiple hypothesis testing for i.i.d. data ( Lorden , 1976 , 1977 ; Baum & V eeravalli , 2002 ). W e show that this technique yields an asymptotically optimal test for Markovian data. 12 W e refer the reader to Section C for a detailed pr oof of Theor em 4.4 . 5 Applications W e now present two concr ete applications of our framework: (i) testing misspecification in MCMC samplers, and (ii) testing linear transition dynamics in MDPs. 5.1 T esting Misspecification in MCMC Samplers Let π ⊤ ∈ ∆ m be a known, strictly positive tar get distribution. In MCMC, one constructs an er godic Markov chain with transition matrix P satisfying the stationarity condition π P = π . Expectations under π are then estimated via time averages of a function f along the trajectory . In practice—especially with approximate kernels or black-box samplers—it is often necessary to validate whether the observed path X 1 , X 2 , . . . could plausibly have been generated by a Markov chain whose stationary distribution is π ⊤ . If the underlying kernel violates π P = π , then the resulting estimates are asymptotically biased. Consequently , it is important to detect misspecification quickly whenever it occurs. Related work on testing MCMC procedures typically studies threshold tests for stationary expectations of a fixed function Gyori & Paulin ( 2015 ); Rabinovich et al. ( 2020 ). Our goal is dif ferent: we test misspecification of the transition structure without committing to a specific test function. W e also emphasize that our objective is not to test convergence diagnostics for MCMC ( Roy , 2020 ), but rather to test whether the observed samples could converge to the desir ed stationary distribution. Problem formulation. Define the set of valid transition matrices P π ≜ { P ∈ M : π P = π } . W e test P = P π against Q = P ∁ π . The one-sided nature of our framework is well-suited to this setting. Under the null, continued sampling is beneficial since it reduces the variance of the MCMC estimator . Our test guarantees that, when there is no misspecification, the chain is not stopped with pr obability at least 1 − α , allowing estimation to pr oceed in parallel with testing. Under misspecification, further sampling only increases the computational cost of producing biased samples. In this case, our test stops with probability 1, preventing continued waste of computation. T o instantiate the test, we require compactness of P π , which we establish in Lemma E.1 in Section E . This yields the following corollary to Theor em 4.1 . Corollary 5.1. For any ergodic Q ∈ P ∁ π with stationary distribution π Q  = π , the stopping time τ α satisfies lim sup α → 0 E Q [ τ α ] log ( 1 / α ) ≤ 1 D inf M ( Q , P π ) . Interpretation in terms of estimation bias. As discussed in Section 4.1 , Proposition 4.3 can be used to obtain interpretable lower bounds on D M ( Q , P ) . In this application, for any function of interest g , the quantity ∆ ≜   E π Q [ g ] − E π [ g ]   13 Figure 1: T est statistic trajectories over time for Q bad (red) and Q good (green), aggregated over 100 runs. Solid curves and shaded regions show the mean ± 3 σ ; dotted colored curves denote the boundary . The dotted black line shows the theor etical slope D inf M ( Q bad , P π ) , and the vertical line marks the mean stopping time for Q bad . is precisely the asymptotic bias of the MCMC estimator under Q . Applying Proposition 4.3 yields that D M ( Q , P ) can be lower bounded by a term of order ∆ 2 , and hence the dominant scaling of the complexity term behaves as 1/ D M ( Q , P ) = O ( 1/ ∆ 2 ) . Experimental validation. W e fix a target distribution π ∈ ∆ 5 and set α = 0.05. W e compare two kernels: a valid sampler Q good ∈ P π and a misspecified sampler Q bad / ∈ P π . As shown in Figure 1 , under the alternative ( Q bad ) the statistic gr ows appr oximately linearly and tracks the rate D inf M ( Q bad , P π ) . Under the null ( Q good ), the statistic remains below the boundary . Across all trials, we observed no false rejections, suggesting that the boundary may be conservative in practice. Additional details are pr ovided in Section G . 5.2 T esting Linearity of MDPs W e consider the problem of testing a commonly used str uctural assumption in reinfor cement learning: that the transition dynamics and rewar ds admit a linear parameterization with r espect to a known feature map. Formally , consider an infinite-horizon MDP M with finite state space S and finite action space A . The MDP is specified by transition pr obabilities p M ( s , a , s ′ ) and rewar d distributions q M ( s , a ) , with expected rewar d r M ( s , a ) = E [ q M ( s , a ) ] . A policy π M ( a ′ | s ′ ) specifies the probability of selecting action a ′ in state s ′ . Let T M ∈ R |S || A|× |S | denote the transition matrix with entries T M (( s , a ) , s ′ ) ≜ p M ( s , a , s ′ ) . Let Π M ∈ R |S |× |S ||A | denote the policy matrix with entries Π M ( s , ( s ′ , a ′ )) ≜ π M ( a ′ | s ) I { s = s ′ } . Definition 5.2 (Linear MDP ( Jin et al. , 2020 )) . An MDP M is linear if there exist µ M ∈ R |S |× d and θ M ∈ R d such that T M = Φ µ ⊤ M , r M = Φ θ M , 14 Figure 2: Mean statistic trajectory aggregated over 20 runs. Shaded regions indicate ± 3 standard deviations. where Φ ∈ R |S || A|× d is a known featur e matrix whose row for ( s , a ) equals ϕ ( s , a ) ⊤ for a given map ϕ : S × A → R d . W e assume ∥ θ M ∥ 2 ≤ √ d , ∥ ϕ ( s , a ) ∥ 2 ≤ 1 for all ( s , a ) , and ∥ µ M ( s ) ∥ 2 ≤ √ d for all s ∈ S . Executing policy π M in M induces a Markov chain P M , π on state–action pairs with transition matrix P M , π  ( s , a ) , ( s ′ , a ′ )  = p M ( s , a , s ′ ) π M ( a ′ | s ′ ) . W e assume the induced chain is ergodic. For a linear MDP , this induced kernel can be written as P M , π = Φ µ ⊤ M Π M . Since Φ and Π M are known, linearity imposes explicit structural constraints on the induced transitions. W e encode these constraints through the null class P L = n P : P = Φ µ ⊤ Π M , µ ∈ R |S |× d , ∥ µ ∥ 2, ∞ ≤ √ d o . W e test P = P L against Q = P ∁ L . W e establish compactness of P L in Lemma E.2 in Section E , which yields the following corollary of Theor em 4.1 . Corollary 5.3. Suppose M is not linear (with r espect to Φ ), and that policy π M induces an er godic Markov chain Q M , π on state–action pairs. Then lim sup α → 0 E Q M , π [ τ α ] log ( 1 / α ) ≤ 1 D inf M ( Q M , π , P L ) . Here, D inf M ( Q M , π , P L ) quantifies how far the induced dynamics are from admitting the prescribed linear factorization. Experimental validation. W e evaluate our procedure on the Mountain Car environment as imple- 15 mented in gym ( T owers et al. , 2024 ). W e discretize the state space ( |S | = 8 × 8) and use |A | = 3 actions with a uniformly random policy . W e generate radial basis function featur es of varying dimensions and plot the statistic over time in Figure 2 . In all cases, the statistic gr ows steadily , indicating misspeci- fication and rejection of the linear MDP hypothesis. The lower-dimensional r epresentation ( d = 3) exhibits the steepest gr owth and thus rejects earlier than higher-dimensional r epresentations ( d = 5, 7). Additional experimental results, including a comparison with the algorithm of Fields et al. ( 2025 ), are in Section G . 6 Discussion and Future W orks W e developed a rigorous framework for one-sided sequential hypothesis testing with Markovian data. Our results substantially generalized prior work by allowing a composite null to be tested against a disjoint, composite alternative. W e introduced new tools that yielded the first non-asymptotic, instance-dependent lower bounds in this setting, and we proposed a martingale-based test whose asymptotic sample complexity matched these bounds. W e illustrated the framework through applica- tions to misspecification testing in MCMC and to structural testing in reinforcement learning. W e also established a Pinsker-type inequality for Markov divergences via the Poisson equation, which may be of independent interest. Several directions remain open. First, it would be valuable to extend the techniques of Fields et al. ( 2025 ) to the general composite setting; this likely requires new low-r egret learning pr ocedures under Markovian dependence. Second, our results highlight a statistical–computational tradeof f: while the approximation in Section 4.1 yields computationally tractable tests, it may be sample-inef ficient. It remains unclear whether one can achieve both computational efficiency and statistical optimality for general (possibly non-convex) P . Third, it is natural to move beyond finite state spaces to countable- state Markov chains. Finally , an important extension is sequential testing under hidden Markov observations, where only a noisy function of the latent chain is observed and the resulting pr ocess need not be Markov . This raises major analytical challenges for both α -correct martingale constructions and sharp, instance-dependent lower bounds. Acknowledgements SA acknowledges the generous support from the Pratiksha T rust, Bangalore, through the Y oung Investigator A ward, and the DST INSPIRE Faculty Grant IF A24-ENG-389. DB acknowledges the support of the ANR JCJC pr oject REPUBLIC (ANR-22-CE23-0003-01) and PEPR pr oject FOUNDR Y (ANR23-PEIA-0003). References Agrawal, A., V erschueren, R., Diamond, S., and Boyd, S. A rewriting system for convex optimization problems. Journal of Contr ol and Decision , 5(1):42–60, 2018. Agrawal, S. and Ramdas, A. On stopping times of power-one sequential tests: T ight lower and upper bounds. arXiv preprint , 2025. 16 Agrawal, S., Juneja, S. K., and Koolen, W . M. Regret minimization in heavy-tailed bandits. In Proceedings of Thirty Fourth Conference on Learning Theory (COL T) , volume 134, pp. 26–62. PMLR, 15–19 Aug 2021. Al Marjani, A. and Proutiere, A. Adaptive sampling for best policy identification in Markov decision processes. In International Confer ence on Machine Learning , pp. 7459–7468. PMLR, 2021. Al Marjani, A., Garivier , A., and Proutiere, A. Navigating to the best policy in Markov decision processes. Advances in Neural Information Pr ocessing Systems , 34:25852–25864, 2021. Anantharam, V ., V araiya, P ., and W alrand, J. Asymptotically efficient allocation rules for the mul- tiarmed bandit problem with multiple plays-part ii: Markovian rewards. IEEE T ransactions on Automatic Control , 32(11):977–982, 2003. Ariu, K., W ang, P .-A., Pr outiere, A., and Abe, K. Policy testing in Markov decision pr ocesses. arXiv preprint arXiv:2505.15342 , 2025. Baum, C. W . and V eeravalli, V . V . A sequential procedure for multihypothesis testing. IEEE T ransactions on Information Theory , 40(6):1994–2007, 2002. Bengio, Y . et al. Markovian models for sequential data. Neural computing surveys , 2(199):129–162, 1999. Berge, C. T opological Spaces: Including a T reatment of Multi-valued Functions, V ector Spaces, and Convexity . Dover Books on Mathematics. Dover Publications, 1997. ISBN 9780486696539. Bertsekas, D. Reinforcement learning and optimal contr ol , volume 1. Athena Scientific, 2019. Beutler , F . J. and Ross, K. W . Optimal policies for controlled Markov chains with a constraint. Journal of mathematical analysis and applications , 112(1):236–252, 1985. Carlsson, E., Basu, D., Johansson, F ., and Dubhashi, D. Pur e exploration in bandits with linear constraints. In International Conference on Artificial Intelligence and Statistics , pp. 334–342. PMLR, 2024. Chatterjee, S. Spectral gap of nonreversible Markov chains. The Annals of Applied Probability , 35(4): 2644–2677, 2025. Cover , T . and Thomas, J. Elements of Information Theory . W iley , 2012. ISBN 9781118585771. Darling, D. A. and Robbins, H. Iterated logarithm inequalities. Proceedings of the National Academy of Sciences of the United States of America , 57(5):1188–1192, 1967. ISSN 00278424. Das, U., Shukla, A., and Basu, D. Frappe: Fast and efficient pr eference-based pur e exploration. In The Thirty-ninth Annual Conference on Neural Information Pr ocessing Systems , 2025. Diamond, S. and Boyd, S. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research , 17(83):1–5, 2016. Dimitriadis, B. and Kazakos, D. A nonparametric sequential test for data with Markov dependence. IEEE T ransactions on Aerospace and Electr onic Systems , 3:338–347, 2007. Douc, R., Moulines, E., Priouret, P ., and Soulier , P . Markov Chains . Springer International Publishing, 2018. ISBN 9783319977041. doi: 10.1007/978- 3- 319- 97704- 1. 17 Farrell, R. H. Asymptotic behavior of expected sample size in certain one sided tests. The Annals of Mathematical Statistics , pp. 36–72, 1964. Fauß, M., Zoubir , A. M., and Poor , H. V . Minimax optimal sequential hypothesis tests for Markov processes. The Annals of Statistics , 48(5), October 2020. ISSN 0090-5364. doi: 10.1214/19- aos1899. Fields, G., Javidi, T ., and Shekhar , S. Sequential one-sided hypothesis testing of Markov chains. arXiv preprint arXiv:2501.13187 , 2025. Fill, J. A. Eigenvalue Bounds on Convergence to Stationarity for Nonreversible Markov Chains, with an Application to the Exclusion Process. The Annals of Applied Probability , 1(1):62 – 87, 1991. doi: 10.1214/aoap/1177005981. Goldreich, O. Intr oduction to property testing . Cambridge University Press, 2017. Gyori, B. M. and Paulin, D. Hypothesis testing for Markov chain monte carlo. Statistics and Computing , 26(6):1281–1292, July 2015. ISSN 1573-1375. doi: 10.1007/s11222- 015- 9594- 1. Huang, D. and Li, X. Bernstein-type inequalities for Markov chains and Markov processes: A simple and robust pr oof. arXiv preprint , 2024. Jin, C., Y ang, Z., W ang, Z., and Jordan, M. I. Provably efficient reinforcement learning with linear function approximation. In Confer ence on learning theory , pp. 2137–2143. PMLR, 2020. Jonsson, A., Kaufmann, E., M ´ enard, P ., Darwiche Domingues, O., Leurent, E., and V alko, M. Planning in Markov decision pr ocesses with gap-dependent sample complexity . Advances in Neural Information Processing Systems , 33:1253–1263, 2020. Karthik, P . N., T an, V . Y . F ., Mukherjee, A., and T ajer , A. Optimal best arm identification with fixed confidence in restless bandits. IEEE T ransactions on Information Theory , 70(10):7349–7384, 2024. Kiefer , S. and Sistla, A. P . Distinguishing hidden Markov chains. In Proceedings of the 31st Annual ACM/IEEE Symposium on Logic in Computer Science , LICS ’16, pp. 66–75, New Y ork, NY , USA, 2016. Kreyszig, E. Introductory Functional Analysis with Applications . W iley classics library . W iley , 1978. ISBN 9780471507314. Lehmann, E. L. and Romano, J. P . T esting statistical hypotheses . Springer , 2005. Levin, D. A., Peres, Y ., and W ilmer , E. L. Markov chains and Mixing T imes: Second edition . 2017. Lorden, G. 2-sprt’s and the modified kiefer -weiss problem of minimizing an expected sample size. The Annals of Statistics , 4(2):281–291, 1976. Lorden, G. Nearly-optimal sequential tests for finitely many parameter values. The Annals of Statistics , 5(1):1–21, 1977. Moulos, V . Optimal best Markovian arm identification with fixed confidence. In W allach, H., Larochelle, H., Beygelzimer , A., d'Alch ´ e-Buc, F ., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems , volume 32. Curran Associates, Inc., 2019. Moustakides, G. V . Extension of Wald’s first lemma to Markov processes. Journal of Applied Probability , 36(1):48–59, March 1999. 18 Nagaraj, D., W u, X., Bresler , G., Jain, P ., and Netrapalli, P . Least squares r egression with Markovian data: Fundamental limits and algorithms. Advances in neural information processing systems , 33: 16666–16676, 2020. Natarajan, S. Large deviations, hypotheses testing, and source coding for finite Markov chains. IEEE T ransactions on Information Theory , 31(3):360–365, 2003. Ouhamma, R., Basu, D., and Maillard, O. Bilinear exponential family of mdps: frequentist regr et bound with tractable exploration & planning. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 37, pp. 9336–9344, 2023. Paulin, D. Concentration inequalities for Markov chains by Marton couplings and spectral methods. Electronic Journal of Pr obability , 20(none):1 – 32, 2015. Phatarfod, R. M. Sequential analysis of dependent observations. i. Biometrika , 52(1–2):157–166, June 1965. ISSN 1464-3510. doi: 10.1093/biomet/52.1- 2.157. Phatarfod, R. M. Sequential tests for normal Markov sequence. Journal of the Australian Mathematical Society , 12(4):433–440, 1971. Polyanskiy , Y . and W u, Y . Information Theory: From Coding to Learning . Cambridge University Press, 2025. Posner , E. Random coding strategies for minimum entropy . IEEE T ransactions on Information Theory , 21 (4):388–391, 1975. doi: 10.1109/TIT .1975.1055416. Rabiner , L. and Juang, B. An introduction to hidden Markov models. ieee assp magazine , 3(1):4–16, 2003. Rabinovich, M., Ramdas, A., Jordan, M. I., and W ainwright, M. J. Function-specific mixing times and concentration away from equilibrium. Bayesian Analysis , 15(2), June 2020. ISSN 1936-0975. doi: 10.1214/19- ba1151. Ramdas, A. and W ang, R. Hypothesis testing with e-values . Foundations and T rends in Statistics, 2025. Robbins, H. and Siegmund, D. The expected sample size of some tests of power one. The Annals of Statistics , 2(3):415–436, 1974. Roy , V . Convergence diagnostics for Markov chain monte carlo. Annual Review of Statistics and Its Application , 7(1):387–412, March 2020. Russo, A., W elch, R., and Pacchiano, A. Learning to explor e: An in-context learning appr oach for pur e exploration. arXiv preprint , 2025. Schmitz, N. and S ¨ uselbeck, B. Sequential probability ratio tests for homogeneous Markov chains. In Mathematical Learning Models—Theory and Algorithms: Proceedings of a Conference , pp. 191–202. Springer , 1983. Sutton, R. S., Barto, A. G., et al. Reinforcement learning: An introduction , volume 1. MIT pr ess Cambridge, 1998. T akeuchi, J., Kawabata, T ., and Barron, A. R. Properties of jef freys mixtur e for Markov sour ces. IEEE T ransactions on Information Theory , 59(1):438–457, January 2013. 19 T owers, M., Kwiatkowski, A., T erry , J. K., Balis, J. U., Cola, G. D., Deleu, T ., Goul ˜ ao, M., Kallinteris, A., Krimmel, M., KG, A., Perez-V icente, R., Pierr ´ e, A., Schulhoff, S., T ai, J. J., T an, H., and Y ounis, O. G. Gymnasium: A standard interface for reinfor cement learning environments. CoRR , abs/2407.17032, 2024. URL https://doi.org/10.48550/arXiv.2407.17032 . T uynman, A., Degenne, R., and Kaufmann, E. Finding good policies in average-rewar d Markov decision processes without prior knowledge. In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024. W agenmaker , A. J., Simchowitz, M., and Jamieson, K. Beyond no regret: Instance-dependent pac reinfor cement learning. In Conference on Learning Theory , pp. 358–418. PMLR, 2022. W ald, A. Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics , 16(2):117–186, June 1945. ISSN 0003-4851. doi: 10.1214/aoms/1177731118. W olfer , G. and Kontorovich, A. Minimax learning of ergodic Markov chains. In Garivier , A. and Kale, S. (eds.), Proceedings of the 30th International Confer ence on Algorithmic Learning Theory , volume 98 of Proceedings of Machine Learning Resear ch , pp. 904–930. PMLR, 22–24 Mar 2019. A Proofs of Results Appearing in Section 3 This appendix builds the r equisite tools and then proves the lower bound, Theorem 3.3 . Section A.1 provides a r ecap of the r esults from Moustakides ( 1999 ) on a W ald-type identity for Markovian data, and an application of the same to our problem setting. W e show that the latter leads to certain auxiliary terms which arise as solutions to the Poisson equation (PE). In Section A.2 , we build the primary tool requir ed to obtain a bound on the magnitude of the auxiliary terms via the spectral pr operties of the underlying Markov chain. Finally , we prove the lower bound in Section A.3 . A.1 W ald’ s Lemma for Markovian Data T owards building our machinery , we first recap the W ald’s lemma for Markov chains from Mous- takides ( 1999 ). Lemma A.1 ( Moustakides ( 1999 , Lemma 2.3) ) . Let P ∈ M be aperiodic and irreducible, with a unique stationary distribution π ∈ R 1 × m . Let f ∈ R m . Consider the PE with respect to the unknown ν ∈ R m , given by ( P − I ) ν = − ( P − 1 m × 1 π ) f , where 1 denotes the all-ones vector , and ν satisfies the constraint π ν = 0 . Then, the above system of equations possesses a solution given by ν = ∞ ∑ n = 1 ( P n − 1 π ) f . W e note here that the form of the PE appearing in Lemma A.1 differs slightly fr om the standard form in that if ω = ω ∗ denotes a solution to the (standard form of) PE ( I − P ) ω = f − ( π f ) 1 , 20 then ν and ω ∗ satisfy the relation ν = P ω ∗ . Accordingly , the statement of W ald’s lemma in Mous- takides ( 1999 ) can be modified to use the solution ω ∗ of the standard form of the PE with minor modifications to the martingale construction therein. W e recor d this in the following result and provide a pr oof of the same for completeness. Theorem A.2 (W ald’s Lemma for Markov Chains) . Let { X n } n ≥ 0 be an irr educible, aperiodic Markov chain on a finite state space with m states and transition matrix P . Let π denote the unique stationary distribution of P . Let f ∈ R m be any cost function, and let N be a stopping time with respect to the natural filtration generated by the process { X n } n ≥ 0 , with E [ N ] < + ∞ . Define the total accumulated cost S N : = ∑ N − 1 k = 0 f ( X k ) . Let ω P , f ∈ R m be a solution to the PE ( I − P ) ω P , f = f − ( π f ) 1 . Then, we have E [ S N ] = ( π f ) E [ N ] + E [ ω P , f ( X 0 ) − ω P , f ( X N )] . Proof. For the rest of the proof, we dr op the subscript in ω P , f . Define the martingale process { U n } n ≥ 0 as U 0 = ω ( X 0 ) , U n = ω ( X n ) + n − 1 ∑ k = 0 ( f ( X k ) − π f ) , n ≥ 1. T o verify the martingale property , defining F n = σ ( X 0 , X 1 , . . . , X n ) for each n ≥ 0, we have E [ U n + 1 | F n ] = E " ω ( X n + 1 ) + n ∑ k = 0 ( f ( X k ) − π f )      F n # , which may be rewritten as E [ U n + 1 | F n ] = E [ ω ( X n + 1 ) | X n ] + ( f ( X n ) − π f ) + n − 1 ∑ k = 0 ( f ( X k ) − π f ) . Observe that E [ ω ( X n + 1 ) | X n ] = ( P ω )( X n ) . Thus, E [ U n + 1 | F n ] = ( P ω )( X n ) + f ( X n ) − π f + n − 1 ∑ k = 0 ( f ( X k ) − π f ) . Rearranging terms in the PE ( I − P ) ω = f − ( π f ) 1 , we get ( P ω ) ( X n ) = ω ( X n ) − ( f ( X n ) − π f ) , from which it follows that E [ U n + 1 | F n ] = [ ω ( X n ) − ( f ( X n ) − π f ) ] + ( f ( X n ) − π f ) + n − 1 ∑ k = 0 ( f ( X k ) − π f ) = ω ( X n ) + n − 1 ∑ k = 0 ( f ( X k ) − π f ) = U n , thereby pr oving that { U n } n ≥ 0 is a martingale. Noting that the state space is finite, we have that ω , f are bounded. Then, there exists K < + ∞ such 21 that for all n ≥ 1, | U n + 1 − U n | = | ω ( X n + 1 ) − ω ( X n ) + f ( X n ) − π f | ≤ K a.s., thus proving that { U n } n ≥ 0 has bounded increments. Given that N is a stopping time with E [ N ] < ∞ and the martingale { U n } n ≥ 0 has bounded increments, we apply the optional stopping theorem to deduce that E [ U N ] = E [ U 0 ] , which in turn is equivalent to E [ ω ( X N )] + E " N − 1 ∑ k = 0 f ( X k ) # − ( π f ) E [ N ] = E [ ω ( X 0 )] . Rearranging terms, we arrive at the desired r esult. A.2 Bounding the Magnitude of a Solution to the Poisson Equation via Spectral Properties T o apply Theorem A.2 later on, we desir e contr ol over the magnitude of ω ∗ = ω P , f , the solution to the PE for the ergodic Markov chain with transition matrix P and the function f . In the following r esult below , we derive a bound on the magnitude of ω P , f (as given in ( 2 )) that depends on P and f . As a first step, we recor d the following cor ollary to ( Paulin , 2015 , Proposition 3.4). Corollary A.3. Let P be an ergodic transition matrix on a finite state space [ m ] with unique stationary distribution π . Let γ ps be the pseudo-spectral gap. Then, for any n ≥ 1 and x ∈ [ m ] , we have ∥ P n ( x , · ) − π ∥ TV ≤ 1 2 ( 1 − γ ps ) 1/ ( 2 γ ps ) p π ( x ) ( 1 − γ ps ) n / 2 . (8) Remark A.4 . It is worthwhile to compare ( 8 ) with ( Fill , 1991 , Theor em 2.1) which gives a similar bound but in terms of the second largest eigenvalue of the multiplicative reversibilization of P (i.e., P ∗ P ). However , as noted in ( Paulin , 2015 , Remark 3.2), there exist cases wher e the second largest eigenvalue is zero, while the pseudo-spectral gap is strictly positive. Proof of Cor ollary A.3 . Observe that a finite state, ergodic Markov chain is uniformly ergodic ( Douc et al. , 2018 , Exer cise 15.6), and consider the initial distribution q of the Markov chain localized at x , i.e., q ( x ) = 1 and q ( y ) = 0 for all y  = x . Applying ( Paulin , 2015 , Proposition 3.4), we have ∥ P n ( x , · ) − π ∥ TV ≤ 1 2 ( 1 − γ ps ) ( n − 1/ γ ps ) / 2 s 1 π ( x ) − 1 ≤ 1 2 ( 1 − γ ps ) 1/ ( 2 γ ps ) p π ( x ) ( 1 − γ ps ) n / 2 , where the second line follows by noting that p 1 − π ( x ) ≤ 1. W ith the above corollary in hand, we provide the pr oof of Pr oposition 3.1 below . 22 Proof of Pr oposition 3.1 . W e use the form of ω = ω P , f given in ( 2 ), i.e. ω = ∞ ∑ n = 0 ( P n − 1 π ) f . More explicitly , for each x ∈ [ m ] , we have ω ( x ) = ∞ ∑ n = 0 ∑ y ∈ [ m ] ( P n ( x , y ) − π ( y ) ) f ( y ) . T aking absolute values on both sides and applying the triangle inequality , we get | ω ( x ) | ≤ ∞ ∑ n = 0 ∑ y ∈ [ m ] | P n ( x , y ) − π ( y ) | | f ( y ) | ≤ 2 ∥ f ∥ ∞ ∞ ∑ n = 0 ∥ P n ( x , · ) − π ∥ TV . Assuming that γ ps ∈ ( 0, 1 ) and using Corollary A.3 to bound the total variation terms, we have | ω ( x ) | ≤ ∥ f ∥ ∞ ( 1 − γ ps ) 1/ ( 2 γ ps ) p π ( x ) ∞ ∑ n = 0 ( 1 − γ ps ) n / 2 = ∥ f ∥ ∞ ( 1 − γ ps ) 1/ ( 2 γ ps ) p π ( x ) 1 1 − p 1 − γ ps ! . Replacing π ( x ) with the π ∗ = min x ∈ [ m ] π ( x ) gives an upper bound that is valid for any x ∈ [ m ] , thereby giving us the desir ed r esult. W e now show boundedness of ω = ω P , f for the corner cases when γ ps ∈ { 0, 1 } . Recall that γ ps ( P ) : = max k ≥ 1 γ ( ( P ∗ ) k P k ) k , From ( Fill , 1991 , Section 2.1), we have that ( P ∗ ) k P k is reversible for each k ≥ 1 and has all eigenvalues in [ 0, 1 ] , thus implying that γ ps ∈ [ 0, 1 ] . • γ ps = 0: W e argue that this case is not possible for ergodic, finite state Markov chains. From the convergence theor em for ergodic Markov chains ( Levin et al. , 2017 , Theorem 4.9), we know that the mixing time, t mix , of an ergodic matrix is finite. Paulin ( 2015 , Proposition 3.4) shows that γ ps ≥ 1/ ( 2 t mix ) for ergodic Markov chains. Thus, we must have that γ ps > 0. • γ ps = 1: W e show below that this case corresponds to the i.i.d. setting, and that in this case, ω P , f = f − ( π f ) 1 . Observe that if γ ps = max k ≥ 1 γ ( ( P ∗ ) k P k ) k = 1, we must have that γ ps = γ ( P ∗ P ) = 1, as γ ( ( P ∗ ) k P k ) k ∈ [ 0, 1 / k ] for any k (noting that γ ( ( P ∗ ) k P k ) ∈ [ 0, 1 ] for all k ≥ 1). From the defi- nition of spectral gap, we have 1 − λ 2 ( P ∗ P ) = 1, thereby implying that λ 2 = λ 2 ( P ∗ P ) = 0. Using the fact that all eigenvalues of P ∗ P lie in [ 0, 1 ] , we must have 0 = λ 2 = λ 3 · · · = λ m . Applying ( Fill , 1991 , Theorem 2.1), we have that ∥ P n ( x , · ) − π ∥ TV ≤ 0 for all n ≥ 1 and x ∈ [ m ] , thus implying that P ( x , · ) = π for all x ∈ [ m ] . That is, P has identical rows, all of which match 23 with the stationary distribution, thus implying that the random variables of the pr ocess { X n } n ≥ 0 are independent and identically distributed (i.i.d.), with distribution π . In this case, the PE gives ( I − P ) ω P , f = f − ( π f ) 1 = ⇒ ω P , f − ( π ω P , f ) 1 = f − ( π f ) 1 . W e note that the solution to PE is unique on the subspace of R m orthogonal to the all ones vector , 1 , and the above equation shows that any ω P , f and f are equal on this subspace. W e can then choose an ω P , f such that its projection along 1 is equal to zero. This gives us ω P , f = f − ( π f ) 1 , which we note is consistent with the solution given as ( 2 ) in the main text. The solution now follows from observing: ∥ f ∥ ∞ ≥ π f and | f ( i ) | ≤ ∥ f ∥ ∞ ∀ i ∈ [ m ] . Applying the triangle inequality , we get: | f ( i ) − π f | ≤ 2 ∥ f ∥ ∞ ∀ i ∈ [ m ] and we obtain the result. This completes the proof. T o summarize the above discussion, we have that for an ergodic Markov chain P and any function f : [ m ] → R , ∥ ω P , f ∥ ∞ ∥ f ∥ ∞ ≤      1 ( 1 − γ ps ) 1/ ( 2 γ ps ) √ π ∗ 1 1 − p 1 − γ ps ! , γ ps ∈ ( 0, 1 ) 2, γ ps = 1. (9) Remark A.5 (Choice of Spectral Gap) . One may also bound the solution to the PE with a differ ent notion of the spectral gap, say γ c , defined by Chatterjee ( 2025 ). In particular , we have ( Chatterjee , 2025 , Equation 3.1) ∥ ω ∥ 2, π ≤ 1 γ c ∥ f − ( π f ) 1 ∥ 2, π , where ∥ a ∥ 2, π : = ∑ x ∈ [ m ] π ( x )( a ( x )) 2 . However , the key differ ence arises in the fact that this bound is stated in terms of the norm of the center ed version of the function, i.e., ∥ f − ( π f ) 1 ∥ 2, π . Our bound, on the contrary , is in t erms of the norm of the uncentered function, i.e., ∥ f ∥ ∞ . It is not immediately obvious how the bound with the centered version may be used to derive a bound with the uncenter ed version, or vice-versa. In general, we do not know how such a result would be derived for other notions of spectral gaps. Looking forward, we note that the analysis of the upper bound could potentially be done by a dif ferent choice of spectral gap. This can be done by substituting the relevant concentration inequalities in Section B.2 (for example, by using Huang & Li ( 2024 , Theorem 2.4)). A.3 Proofs Related to the Lower Bound W e are now equipped with all the tools needed to the lower bound. W e proceed as follows: first, we use Theorem A.2 to contr ol the expected log-likelihood ratio between two data generating chains as in Fields et al. ( 2025 , Appendix). Following this, we proceed to show the lower bound for testing Q against a singleton P ∈ P via a change-of-measure argument. While the analogous proof for the independent setting ( Agrawal & Ramdas , 2025 , Theorem C.1) effectively stops here by taking a supr emum over all elements of the null hypothesis set, we requir e the bounds developed in Proposition 3.1 to bound the auxiliary PE terms arising fr om Theorem A.2 , which depend implicitly on P , so that we can safely maximize the lower bound over the whole of the null hypothesis set P . 24 Corollary A.6 (Expected log-likelihood ratio) . Let { X n } n ≥ 0 be an ergodic Markov chain with transition matrix Q on a finite state space, say [ m ] . Let P be any transition matrix on the state space [ m ] , with Q ≪ P . Let τ α be a stopping time with respect to the natural filtration generated by { X n } n ≥ 0 , with E [ τ α ] < + ∞ . Then, E Q " τ α ∑ i = 1 log  Q ( X i − 1 , X i ) P ( X i − 1 , X i )  # = D M ( Q , P ) E [ τ α ] + C τ α , where C τ α : = E [ ω Q , f P ( X 0 ) − ω Q , f P ( X τ α )] , ω Q , f P is a solution of the PE for the function f P where f P ( i ) = D KL ( Q ( i , · ) , P ( i , · ) ) for all i ∈ [ m ] . Proof. Consider the function: f ( X n ) : = D KL ( Q ( X n , · ) , P ( X n , · ) ) = E X n + 1 ∼ Q ( X n , · )  log  Q ( X n , X n + 1 ) P ( X n , X n + 1 )  . Using the fact that Q ≪ P , we have f ( X n ) < + ∞ for all n . Applying Theorem A.2 , the desir ed result follows by observing that π f = D M ( Q , P ) . W e now proceed to prove Theor em 3.3 . Proof of Theor em 3.3 . Let { X n } n ≥ 0 be an ergodic Markov chain with transition matrix some Q ∈ Q . If Q  ≪ P for any P ∈ P , then D inf M ( Q , P ) = + ∞ and the lower bound in ( 3 ) holds trivially . Therefor e, fixing P ∈ P such that Q ≪ P , we have E Q " τ α ∑ t = 1 log  Q ( X t − 1 , X t ) P ( X t − 1 , X t )  # = D KL ( Q τ α , P τ α ) , where the divergence is between the joint distributions on the τ α samples induced by the Markov chains Q and P respectively . From Cor ollary A.6 , we have E Q " τ α ∑ t = 1 log  Q ( X t − 1 , X t ) P ( X t − 1 , X t )  # = E Q [ τ α ] D M ( Q , P ) + C τ α , f P , where C τ α , f P = E [ ω Q , f P ( X 0 ) − ω Q , f P ( X τ α )] . Applying the data processing inequality , we have D KL ( Q τ α , P τ α ) ≥ d ( Q τ α ( E ) , P τ α ( E )) ∀ E ∈ F τ α , where F τ α : = n E : E ∩ { τ α = n } ∈ F n ∀ n o . Choosing the event E = { τ α < + ∞ } , and noting that E ∈ F τ α , we have KL ( Q τ α , P τ α ) ≥ d ( 1, α ) = log ( 1/ α ) . Then, for any P ∈ P , we have E Q [ τ α ] ≥ log ( 1 / α ) D M ( Q , P ) − E [ ω Q , f P ( X 0 ) − ω Q , f P ( X τ )] D M ( Q , P ) . The above bound holds for any P ∈ P , however to safely take a supr emum over all such P ∈ P , we 25 get the following worst case bound on the latter term: E [ ω Q , f P ( X 0 ) − ω Q , f P ( X τ α )] D M ( Q , P ) ≤ 2 ∥ ω Q , f P ∥ ∞ D M ( Q , P ) ≤ 2 ∥ ω Q , f P ∥ ∞ π ∗ ∥ f P ∥ ∞ ≤ 2 C Q π ∗ , where the thir d inequality above follows by noting that D M ( Q , P ) = ∑ i ∈ [ M ] π i D KL ( Q ( i , · ) , P ( i , · )) ≥ π ∗ ∑ i ∈ [ M ] D KL ( Q ( i , · ) , P ( i , · )) | {z } f P ( i ) ≥ π ∗ ∥ f P ∥ ∞ , and the last inequality follows from the definition of C Q (see Proposition 3.1 ). Thus, we have the worst case bound: E Q [ τ α ] ≥ log ( 1 / α ) D M ( Q , P ) − 2 C Q π ∗ . Noting that the above inequality holds for any P ∈ P , and replacing D M ( Q , P ) with inf P ∈ P D M ( Q , P ) , we arrive at the desir ed result. Since our worst-case bound on the PE terms may be loose in certain cases, we take a maximum between our obtained bound and 0 (since τ α ≥ 0) to ensure the bound isn’t vacuous in instances where log ( 1 / α ) D inf M ( Q , P ) < 2 C Q π ∗ . A.4 Suboptimality of Existing Lower Bounds W e note here that we could have alternatively used Al Marjani & Pr outiere ( 2021 , Lemma 1) to get a bound of the following form: E Q [ τ α ] ≥ log ( 1 / α ) sup ω ∈ ∆ m inf P ∈ P  ∑ i ∈ [ m ] ω i D KL ( Q ( x , · ) , P ( x , · ) )  . (10) While this is a valid lower bound, we demonstrate that it is suboptimal in our setting with the help of a simple example. Consider testing a singleton null, i.e. P = { P } against a problem instance Q (with corresponding stationary distribution π ). Let f ∈ R m be the vector corr esponding to the row KL divergences, i.e. f ( i ) = D KL ( Q ( i , · ) , P ( i , · ) ) . W e choose P and Q such that f is not a constant vector , i.e. ∃ i , j ∈ [ m ] : D KL ( Q ( i , · ) , P ( i , · ) ) < D KL ( Q ( j , · ) , P ( j , · ) ) . In this example, the term in the denominator of ( 10 ) becomes sup ω ∈ ∆ m inf P ∈ P ω ⊤ f . Since the null is a singleton set, the inner infimum disappears, and it is easy to see that sup ω ∈ ∆ m ω ⊤ f = ∥ f ∥ ∞ . Thus, ( 10 ) becomes: E Q [ τ α ] ≥ log ( 1 / α ) ∥ f ∥ ∞ . (11) On the contrary , our proposed lower bound (Theor em 3.3 ) gives: E Q [ τ α ] ≥ log ( 1 / α ) π f − 2 C Q π ∗ . (12) Observe that π f = ∑ i ∈ [ m ] π i D KL ( Q ( i , · ) , P ( i , · ) ) < max i D KL ( Q ( i , · ) , P ( i , · ) ) = ∥ f ∥ ∞ , where the strict inequality comes from the fact that min i π i > 0 and the fact that f is not constant. Thus, the dominant term in ( 12 ) is strictly larger than ( 11 ) . T aking α → 0, we can see that the bound given by ( 11 ) is worse by a multiplicative factor . 26 T o understand why this approach is loose in our setup but not in the context of best policy identification in MDPs, one needs to understand a subtle differ ence in setups. Suppose the supremum in ( 10 ) is achieved by ω ∗ ∈ ∆ m . In the context of best policy identification in MDPs, one interprets ω ∗ as the optimal proportion in which one should visit states and the design of the optimal algorithm focuses on tracking this proportion. On the other hand, in our setup (as well as that in Moulos ( 2019 ); Karthik et al. ( 2024 ); Ariu et al. ( 2025 )), we do not decide the proportion in which to visit the states of the Markov chain, it is, in fact, specified by the pr oblem instance! Owing to this key difference, prior works like Moulos ( 2019 ); Ariu et al. ( 2025 ) proceed via arguing that E [ N x ( τ α ) ] E [ τ α ] = π x + o ( 1 ) and can only obtain a lower bound asymptotically . Our appr oach allows us to characterize this o ( 1 ) term exactly in terms of mixing properties of the pr oblem instance. B Proofs of the Results Appearing in Section 4 In this section, we first collect the necessary results, and subsequently apply them to complete the desired pr oof. B.1 Good Event Analysis For t ≥ 1, define the event A t : =    t inf P ∈ P ∑ x ∈ [ m ] N x ( t ) t D KL  b Q x ( t ) , P x  < log ( 1/ α ) + ( m − 1 ) ∑ x ∈ [ m ] log  e  1 + N x ( t ) m − 1     . (13) Also, for any ϵ > 0 and t ≥ 1, define the good set E ϵ ( t ) : =  ∀ x ∈ [ m ] ,     N x ( t ) t − π x     ≤ ϵ  ∩ n ∥ c Q t − Q ∥ 1, ∞ ≤ ϵ o . (14) Let A t , ϵ : = A t ∩ E ϵ ( t ) . Lemma B.1 (Good event analysis) . For any ϵ > 0 and t ≥ 1 , A t , ϵ ⊆  t ≤ 2 C C ( ϵ ) log  log ( 1 / α ) + D C ( ϵ )  + log ( 1 / α ) + D C ( ϵ )  , where: C : = m ( m − 1 ) , D : = 2 m ( m − 1 ) , and C ( ϵ ) : = inf Q ′ : ∥ Q ′ − Q ∥ 1, ∞ ≤ ϵ inf P ∈ P ∑ x ∈ [ m ] ( π x − ϵ ) D KL ( Q ′ x , P x ) . Proof. Let R : = log ( 1/ α ) + ( m − 1 ) ∑ x ∈ [ m ] log  e  1 + N x ( t ) m − 1  , L : = t inf P ∈ P ∑ x ∈ [ m ] N x ( t ) t D KL  b Q x ( t ) , P x  . Noting that N x ( t ) ≤ t for all x ∈ [ m ] and t ≥ 1, we have ( m − 1 ) ∑ x ∈ [ m ] log  e  1 + N x ( t ) m − 1  = m ( m − 1 ) + ( m − 1 ) ∑ x ∈ [ m ] log  1 + N x ( t ) m − 1  27 ≤ m ( m − 1 ) + ( m − 1 ) ∑ x ∈ [ m ] log  1 + t m − 1  ≤ m ( m − 1 ) + ( m − 1 ) ∑ x ∈ [ m ] log  t  1 + 1 m − 1  ≤ m ( m − 1 ) log t + m ( m − 1 )  1 + log  1 + 1 m − 1  ≤ C log t + D , where C and D are as defined in the statement of the lemma, and the last line follows by noting that log  1 + 1 m − 1  ≤ 1 for all m ≥ 2. W e thus have R ≤ log ( 1/ α ) + C log t + D . On the good set E ϵ ( t ) , noting that N x ( t ) / t ≥ ( π x − ϵ ) for all x ∈ [ m ] , we have L ≥ t inf P ∈ P ∑ x ∈ [ m ] ( π x − ϵ ) D KL  b Q x ( t ) , P x  . Let C ( ϵ ) be as defined in the statement of the lemma. Noting that ∥ c Q t − Q ∥ 1, ∞ ≤ ϵ on the set E ϵ ( t ) , it follows that t C ( ϵ ) ≤ L . Thus, on A t , ϵ , we have tC ( ϵ ) ≤ L < R ≤ log ( 1/ α ) + C log t + D . Rearranging terms, we have t ≤ log ( 1 / α ) + D C ( ϵ ) + C log t C ( ϵ ) ≤ 2 C C ( ϵ ) log  log ( 1 / α ) + D C ( ϵ )  + log ( 1 / α ) + D C ( ϵ ) , where the second line above follows fr om the technical Lemma F .1 . This completes the proof. Lemma B.2 (Concentration of frequency vector to stationary distribution) . For an ergodic Markov chain with transition matrix P, associated unique stationary distribution π , and initial distribution µ , ∀ ϵ > 0, P P , µ      N x ( t ) t − π x     > ϵ  ≤ q 2 Π µ · exp − ϵ 2 t 2 γ ps 4 ( t + 1/ γ ps ) + 40 ϵ t ! , (15) where Π µ : = ∑ i ∈ [ m ] µ ( i ) 2 π ( i ) ≤ 1 π ∗ . Proof of Lemma B.2 . Fr om Pr oposition 3.15 of Paulin ( 2015 ), we have P P , µ      N x ( t ) t − π x     > ϵ  ≤ q Π µ  P P , π      N x ( t ) t − π x     > ϵ  1/2 , where P P , π denotes the probability measure induced when the initial distribution is, in fact, the 28 stationary distribution, and the state transitions are governed by the transition matrix P . W e note that P P , π      N x ( t ) t − π x     > ϵ  = P P , π [ | N x ( t ) − t π x | > t ϵ ] . Observe that: 1. N x ( t ) = ∑ t i = 1 I { X i − 1 = x } = ∑ t − 1 i = 0 I { X i = x } . 2. For every x ∈ [ m ] ,    I { X = x } − E π [ I { X = x } ]    ≤ max { π x , 1 − π x } ≤ 1. 3. V ar π ( I { X = x } ) = π x ( 1 − π x ) ≤ 1/ 4. Applying Theorem 3.11 of Paulin ( 2015 ), we get P P , π [ | N x ( t ) − t π x | > ϵ t ] ≤ 2 · exp − ϵ 2 t 2 γ ps 2 ( t + 1/ γ ps ) + 20 ϵ t ! , thereby establishing the desir ed r esult. B.2 Concentration Analysis Lemma B.3 (Concentration Inequality for Empirical T ransition Matrix ( W olfer & Kontorovich , 2019 ) ) . For an ergodic Markov chain with finite state space [ m ] , unknown transition matrix P and its associated unique stationary distribution π . For any initial distribution µ , the empirical transition matrix b P ( t ) , defined via b P x , y ( t ) = N x , y ( t ) N x ( t ) for all x , y ∈ [ m ] , satisfies ∀ t ≥ 1, ∀ ϵ > 0, P P , µ h ∥ P − b P ( t ) ∥ 1, ∞ > ϵ i ≤ 2 m 2 exp  − ϵ 2 t π ∗ 16 m  + m q Π µ exp − γ ps ( 0.5 t π ∗ ) 2 14 t + 4 / γ ps ! , (16) where π ∗ = min x ∈ [ m ] π x . Proof of Lemma B.3 . From ( W olfer & Kontorovich , 2019 , Theorem 1), for each x ∈ [ m ] and t ≥ 1, we have P P , µ h ∥ b P ( x , · ) ( t ) − P ∥ 1 > ϵ , N x ( t ) ∈ [ 0.5 t π x , 1.5 t π x ] i ≤ 2 m exp  − ϵ 2 t π ∗ 16 m  . (17) Furthermore, fr om Lemma B.2 (which in turn follows from Paulin ( 2015 ) and is given as Lemma 6 in W olfer & Kontorovich ( 2019 )), for each x ∈ [ m ] , we have P P , µ [ N x ( t ) / ∈ [ 0.5 t π x , 1.5 t π x ] ] ≤ q Π µ exp − γ ps ( 0.5 t π x ) 2 2 ( 8 ( t + 1 / γ ps ) π x ( 1 − π x ) + 10 t π x ) ! . (18) Let E : = γ ps ( 0.5 t π x ) 2 2 ( 8 ( t + 1 / γ ps ) π x ( 1 − π x ) + 10 t π x ) . 29 T o simplify E further , we note that γ ps ( 0.5 t π x ) 2 ≥ γ ps ( 0.5 t π ∗ ) 2 . The denominator term (say D ) in the expression for E may be upper bounded as D = 2 ( 8 ( t + 1/ γ ps ) π x ( 1 − π x ) + 10 t π x ) ≤ 2 ( 2 ( t + 1 / γ ps ) + 10 t ) = 14 t + 4 / γ ps , where the penultimate line follows by using the fact that π x ≤ 1, π x ( 1 − π x ) ≤ 1 / 4 for any x ∈ [ m ] . Therefor e, E ≥ γ ps ( 0.5 t π ∗ ) 2 14 t + 4 / γ ps , from which it follows that P P , µ [ N x ( t ) / ∈ [ 0.5 t π x , 1.5 t π x ] ] ≤ q Π µ exp − γ ps ( 0.5 t π ∗ ) 2 14 t + 4 / γ ps ! . (19) An application of the union bound then yields P P , µ h ∥ P − b P ( t ) ∥ 1, ∞ > ϵ i ≤ ∑ x ∈ [ m ] P P , µ h ∥ b P ( x , · ) ( t ) − P ∥ 1 > ϵ , N x ( t ) ∈ [ 0.5 t π x , 1.5 t π x ] i + ∑ x ∈ [ m ] P P , µ [ N x ( t ) / ∈ [ 0.5 t π x , 1.5 t π x ] ] ≤ 2 m 2 exp  − ϵ 2 t π ∗ 16 m  + m q Π µ exp − γ ps ( 0.5 t π ∗ ) 2 14 t + 4 / γ ps ! , thereby establishing the desir ed r esult. B.3 Construction of Mixture Martingale W e use a mixture martingale construction from MDP literatur e ( Jonsson et al. , 2020 ; Al Marjani et al. , 2021 ). W e specialize Lemma 15 of Al Marjani et al. ( 2021 ) (in turn adapted from Proposition 1 of Jonsson et al. ( 2020 )) to our setting. While exactly the same proof follows, we provide the details below for completeness. W e collect some results to be used in the proof. Lemma B.4 (Lemma 3 in Jonsson et al. ( 2020 )) . For q , p ∈ ∆ m and λ ∈ R m − 1 × { 0 } : λ T q − ϕ q ( λ ) = D KL ( q , p ) − D KL  q , p λ  , where p λ = ∇ ϕ p ( λ ) . Lemma B.5 (Theorem 11.1.3 in Cover & Thomas ( 2012 )) . Let N ∈ N , x ∈ { 0, 1, . . . , N } k such that ∑ i ∈ [ k ] x i = N then:  N x  : = N ! ∏ k i = 1 x i ! ≤ exp ( N H ( x / N )) , where H ( x / N ) is the entropy of the distribution over k letters with corresponding probabilities { x i / N } i ∈ [ k ] . Now , we can prove the main result of this section. Note that for ease of reading, we prove Claim 1 and Claim 2 after the main proof. 30 Lemma B.6. For t ≥ 1 and α ∈ ( 0, 1 ) , let β ( t , α ) : = log ( 1 / α ) + ( m − 1 ) ∑ x ∈ [ m ] log  e  1 + N x ( t ) m − 1  . Then for all α ∈ ( 0, 1 ) and P ∈ P , P P   ∃ t ≥ 1 : ∑ x ∈ [ m ] N x ( t ) D ( b P ( t ) ( x , · ) , P ( x , · )) ≥ β ( t , α )   ≤ α . Proof. Let | S | = m . For the following definitions, λ ∈ R m − 1 × { 0 } , λ = ( λ 1 , λ 2 , . . . , , λ m − 1 , 0 ) and any distribution p = ( p 1 , p 2 , . . . , p m ) over the m states: 1. λ T p = ∑ m − 1 i = 1 λ i p i . 2. ϕ p ( λ ) : = log ( p m + ∑ m − 1 i = 1 p i e λ i ) . Constructing martingale for each state x ∈ [ m ] . Let b p x ( t ) be the row corresponding to the state x in the empirical transition matrix b P ( t ) and ϕ x be the corresponding function for the tr ue distribution for this row , i.e. ϕ x ( λ ) = ϕ P ( x , · ) ( λ ) . M λ t ( x ) = exp ( N x ( t )( λ T b p x ( t ) − ϕ x ( λ )) . (20) Claim 1 . M λ t ( x ) is a martingale adapted to the sigma algebra generated sequence of states sampled from P i.e. F t = σ ( X 0 , X 1 , . . . , X t ) . Now , we define the mixture martingale given by the prior over λ given by λ q = ( ∇ ϕ x ) − 1 ( q ) where q follows the Dirichlet distribution i.e. q ∼ D ir ( 1, 1, . . . , 1 ) . M t ( x ) : = Z M λ q t ( x ) Γ ( m ) ∏ m i = 1 Γ ( 1 ) m ∏ i = 1 q i dq = Z exp ( N x ( t )( λ T q b p x ( t ) − ϕ x ( λ q )) Γ ( m ) ∏ m i = 1 Γ ( 1 ) m ∏ i = 1 q i dq = Z exp [ N x ( t )( KL ( b p x ( t ) , p x ) − KL ( b p x ( t ) , q )] ( m − 1 ) ! m ∏ i = 1 q i dq (Lemma B.4 ) = exp [ N x ( t )( KL ( b p x ( t ) , p x ) + H ( b p t ( x ))] ( m − 1 ) ! Z m ∏ i = 1 q 1 + N x ( t ) b p x , i ( t ) i dq ( ∵ − KL ( b p x ( t ) , q ) = H ( b p x ( t )) + ∑ b p x ( t )( i ) log ( q ( i ) ) ) = exp [ N x ( t )( KL ( b p x ( t ) , p x ) + H ( b p t ( x ))] ( m − 1 ) ! ∏ m i = 1 Γ ( 1 + N x ( t ) b p x , i ( t )) Γ ( N x ( t ) + m ) = exp [ N x ( t )( KL ( b p x ( t ) , p x ) + H ( b p t ( x ))] ( m − 1 ) ! ∏ m i = 1 ( N x ( t ) b p x , i ( t )) ! ( N x ( t ) + m − 1 ) ! = exp [ N x ( t )( KL ( b p x ( t ) , p x ) + H ( b p t ( x ))] ∏ m i = 1 ( N x ( t ) b p x , i ( t )) ! N x ( t ) ! ( m − 1 ) ! N x ( t ) ! ( N x ( t ) + m − 1 ) ! = exp [ N x ( t )( KL ( b p x ( t ) , p x ) + H ( b p t ( x ))] 1 ( N x ( t )+ m − 1 m − 1 ) 1 ( N x ( t ) ( N x ( t ) b p x , i ( t )) i ∈ [ k ] ) , 31 where b p x , i ( t ) is the i th component of the probability vector . Now , using Lemma B.5 to control the binomial coefficients: M t ( x ) ≥ exp [ N x ( t )( KL ( b p x ( t ) , p x ) + H ( b p t ( x )) − N x ( t ) H ( b p x ( t )) − ( N x ( t ) + m − 1 ) H (( m − 1 ) / ( N x ( t ) + m − 1 ) ] = exp [ N x ( t ) KL ( b p x ( t ) , p x ) − ( N x ( t ) + m − 1 ) H (( m − 1 ) / ( N x ( t ) + m − 1 ) ] . T o upper bound the entropic term: ( N x ( t ) + m − 1 ) H (( m − 1 ) / ( N x ( t ) + m − 1 ) = ( m − 1 ) log  N x ( t ) + m − 1 m − 1  + N x ( t ) log  N x ( t ) + m − 1 N x ( t )  = ( m − 1 ) log  1 + N x ( t ) m − 1  + N x ( t ) log  1 + m − 1 N x ( t )  ≤ ( m − 1 ) log  1 + N x ( t ) m − 1  + m − 1 ( ∵ log ( 1 + x ) ≤ x ) = ( m − 1 ) log  e  1 + N x ( t ) m − 1  . Thus, we have M t ( x ) ≥ exp  N x ( t ) KL ( b p x ( t ) , p x ) − ( m − 1 ) log  e  1 + N x ( t ) m − 1   . Product martingale over all states Finally , we take a product over all the states to get our final martingale. M t : = ∏ x ∈ [ m ] M t ( x ) . (21) Using the lower bound on each state martingale, we have M t ≥ exp   ∑ x ∈ [ m ] N x ( t ) KL ( b p x ( t ) , p x ) − ( m − 1 ) ∑ x ∈ [ m ] log  e  1 + N x ( t ) m − 1    . (22) Claim 2 . M t is a martingale adapted to the sigma algebra generated by the sequence of states sampled from the Markov Chain P . Applying V ille’s inequality: P [ ∃ t , M t ≥ 1/ α ] ≤ α E [ M 0 ] = α . (23) 32 From Equation ( 22 ) we have P   ∃ t , ∑ x ∈ [ m ] N x ( t ) KL ( b p x ( t ) , p x ) ≥ log ( 1 / α ) + ( m − 1 ) ∑ x ∈ [ m ] log  e  1 + N x ( t ) m − 1    ≤ P [ ∃ t , M t ≥ 1/ α ] ≤ α . Proof of Claim 1 . Suppose X t − 1  = x , then N x ( t ) = N x ( t − 1 ) and N x , y ( t ) = N x , y ( t − 1 ) ∀ y ∈ [ m ] , therefor e b p x ( t ) = b p x ( t − 1 ) . Thus, since none of the data dependent parameters change: M λ t ( x ) = M λ t − 1 ( x ) . If X t − 1 = x , then: E h λ T b p x ( t ) | X 0 , . . . , X t − 1 − x i = E h λ T b p x ( t ) | X 0 , . . . , X t − 1 = x i = E X t ∼ P ( x , · )  λ T  N x ( t − 1 ) b p x ( t − 1 ) + ¯ X t N x ( t − 1 ) + 1  = 1 N x ( t − 1 ) + 1 E X t ∼ P ( x , · ) h λ T ( N x ( t − 1 ) b p x ( t − 1 ) + ¯ X t ) i , where ¯ X t ∈ { 0, 1 } m is the one-hot encoding of the sample X t i.e. ¯ X t ( i ) = I { X t = i } ∀ i ∈ [ m ] . Now , we have: E h M λ t ( x ) | X 0 , X 1 , . . . , X t − 1 = x i = E h exp ( N x ( t )( λ T b p x ( t ) − ϕ x ( λ )) | X 0 , X 1 , . . . , X t − 1 = x i = E X t ∼ P ( x , · )  exp  ( N x ( t − 1 ) + 1 ) ( λ T  N x ( t − 1 ) b p x ( t − 1 ) + ¯ X t N x ( t − 1 ) + 1  − ϕ x ( λ ))  | . . .  = E X t ∼ P ( x , · ) h exp  ( λ T ( N x ( t − 1 ) b p x ( t − 1 ) + ¯ X t ) − ( N x ( t − 1 ) + 1 ) ϕ x ( λ ))  | . . . i = M λ t − 1 ( x ) E X t ∼ P ( x , · ) h exp ( λ T ¯ X t − ϕ x ( λ )) | . . . , i = M λ t − 1 ( x ) . Proof of Claim 2 . Fix some state x ∈ [ m ] . Then observe: E [ M t | X 0 , X 1 , .. X t − 1 = x ] = E " M t ( x ) ∏ y  = x M t ( y ) | X 0 , X 1 , .. X t − 1 = x # = E " M t ( x ) ∏ y  = x M t − 1 ( y ) | X 0 , X 1 , .. X t − 1 = x # = ∏ y  = x M t − 1 ( y ) E [ M t ( x ) | X 0 , X 1 , .. X t − 1 = x ] = M t − 1 , 33 where the second to last equality follows from observing the conditional independence of M t ( x ) and { M t − 1 ( y ) : y  = x } . Since this holds for each state in S , we have the martingale property by applying the tower rule. B.4 Continuity Properties Lemma B.7. Let C ( ϵ ) : = inf Q ′ : ∥ Q ′ − Q ∥ 1, ∞ ≤ ϵ inf P ∈ P ∑ x ∈ [ m ] ( π x − ϵ ) D KL ( Q ′ x , P x ) , then: lim inf ϵ ↓ 0 C ( ϵ ) ≥ D inf M ( Q , P ) . Proof. T o show the result, it suffices to show the lower semi-continuity of C , as done in Lemma B.19 , since that implies lim inf ϵ → 0 C ( ϵ ) ≥ C ( 0 ) = D inf M ( Q , P ) . T o show Lemma B.19 , we requir e some tools from optimization theory . For convenience, we collect some basic definitions and results befor e pr oceeding to the formal arguments. B.4.1 Preliminaries W e recall some basic tools from Chapter 6 of Ber ge ( 1997 ). Definition B.8 (Correspondence) . Let X , Y be topological spaces, a set valued function f : X → 2 Y is said to be a correspondence, i.e. for each x ∈ X , f ( x ) ⊆ Y . Definition B.9 (Lower semi-continuity of a correspondence) . Let X , Y be topological spaces. A correspondence f : X → 2 Y is said to be lower semi-continuous at x 0 ∈ X , if for each open set G intersecting f ( x 0 ) , there exists a neighbour hood U ( x 0 ) of x 0 such that: x ∈ U ( x 0 ) = ⇒ f ( x ) T G  = ∅ . Definition B.10 (Upper semi-continuity of a corr espondence) . Let X , Y be topological spaces. A correspondence f : X → 2 Y is said to be upper semi-continuous at x 0 ∈ X , if for each open set G containing f ( x 0 ) , there exists a neighbourhood U ( x 0 ) of x 0 such that: x ∈ U ( x 0 ) = ⇒ f ( x ) ⊆ G . Observe that if correspondence has co-domain is over the reals and is singleton valued, i.e. a real- valued function, these definitions coincide with the more familiar definitions. Definition B.11 (Upper semi-continuity of real valued functions) . Let X be a topological space, a function f : X → R ∪ { − ∞ , ∞ } is said to be upper semi-continuous at x 0 ∈ X if for every y such that f ( x 0 ) < y , ther e exists a neighbour hood U of x 0 such that f ( x ) < y ∀ x ∈ U . Definition B.12 (Lower semi-continuity of real valued functions) . Let X be a topological space, a function f : X → R ∪ { − ∞ , ∞ } is said to be lower semi-continuous at x 0 ∈ X if for every y such that f ( x 0 ) > y , ther e exists a neighbour hood U of x 0 such that f ( x ) > y ∀ x ∈ U . Remark B.13 . A correspondence is called lower (resp. upper) semi-continuous if it is lower (resp. upper) semi-continuous at each point in its domain. Theorem B.14 ( Maximum Theorem, Ber ge ( 1997 , pg. 116) ) . If ϕ : X × Y → R is an upper semi- continuous function and Γ : X → 2 Y is an upper semi-continuous, compact-valued correspondence such that, for each x , Γ ( x )  = ∅ , then M : X → R defined as: M ( x ) : = max { ϕ ( x , y ) : y ∈ Γ ( x ) } , 34 is upper semi-continuous. W e require the minimizing version of the theor em, which we state her e for completeness. Corollary B.15 (Minimum Theorem) . If ϕ : X × Y → R is a lower semi-continuous function and Γ : X → 2 Y is an upper semi-continuous, compact-valued correspondence such that, for each x , Γ ( x )  = ∅ , then M : X → R defined as: M ( x ) : = min { ϕ ( x , y ) : y ∈ Γ ( x ) } , is lower semi-continuous. Proof. Observe that if ϕ is lower semi-continuous, then − ϕ is upper semi-continuous. Applying Maximum theorem, we have that M ′ ( x ) = max {− ϕ ( x , y ) : y ∈ Γ ( x ) } = − min ϕ ( x , y ) : y ∈ Γ ( x ) is upper semi-continuous, thus, M ( x ) = min { ϕ ( x , y ) : y ∈ Γ ( x ) } is lower semi-continuous. B.4.2 Results Let M be the set of m × m stochastic matrices. W e show the following lemmas before showing the requisite lower semi-continuity of C ( ϵ ) . Lemma B.16 (Compactness of M ) . Let M : = { A ∈ R m × m : A is stochastic i.e. ∑ j ∈ [ m ] A ( i , j ) = 1 ∀ i ∈ [ m ] , A ( i , j ) ≥ 0 ∀ i , j } . M ⊂ R n × n is compact under the standard topology . Proof. W e will show that M is closed and bounded in the topology generated by ∥· ∥ 1, ∞ . Observe that for each P ∈ M , ∥ P ∥ 1, ∞ = 1, thus the set is bounded. Consider { P n } → P ∗ . Fix any row i and observe: ∑ j ∈ [ m ] P ∗ ( i , j ) = ∑ j ∈ [ m ] lim n → ∞ P n ( i , j ) = lim n → ∞ ∑ j ∈ [ m ] P n ( i , j ) = 1. Further , observe that P n ( i , j ) ≥ 0 for each entry thus, P ∗ ( i , j ) ≥ 0. Thus, since this set is a closed and bounded subset of a finite dimensional normed space, i.e. R m × m , we have compactness by Kreyszig ( 1978 , Theorem 2.5-3). Lemma B.17 (Upper semi-continuity of correspondence) . The corr espondence Γ ( ϵ ) : = { Q ′ ∈ M : ∥ Q ′ − Q ∥ 1, ∞ ≤ ϵ } is upper semi-continuous and compact valued. Proof. Consider Γ : R ++ → 2 M . Let ϵ 0 > 0 then, Γ ( ϵ 0 ) = { Q ′ ∈ M : sup i ∈ [ m ] ∥ Q ′ ( i , · ) − Q ( i , · ) ∥ 1 ≤ ϵ 0 } . Compactness of Γ ( ϵ 0 ) follows by observing it is a closed and bounded subset of a finite dimen- sional normed vector space ( Kreyszig , 1978 , Theor em 2.5-3). T o show upper semi-continuity of Γ , we demonstrate upper semi-continuity at any ϵ 0 ∈ R ++ . Let G be an arbitrary open set in M such that Γ ( ϵ 0 ) ⊆ G . W e must find a neighbor hood ( ϵ 0 − δ , ϵ 0 + δ ) such that for all ϵ in this neighborhood, Γ ( ϵ ) ⊆ G . W e proceed by contradiction. Suppose no such δ exists. Then, for every n ∈ N , we can find an ϵ n such that | ϵ n − ϵ 0 | < 1 / n and yet Γ ( ϵ n )  ⊆ G . This implies that for each n , there exists a matrix Q ′ n ∈ Γ ( ϵ n ) such that Q ′ n / ∈ G . Consider the sequence ( Q ′ n ) n ∈ N . By definition of Γ ( ϵ n ) , we have ∥ Q ′ n − Q ∥ 1, ∞ ≤ ϵ n . Since the space of stochastic matrices M is compact (Lemma B.16 ), the sequence ( Q ′ n ) must have a convergent subsequence ( Q ′ n k ) that converges to some limit point Q ∗ ∈ M . 35 T aking the limit of the inequality along this subsequence: ∥ Q ∗ − Q ∥ 1, ∞ ≤ lim k → ∞ ∥ Q ′ n k − Q ∥ 1, ∞ ≤ lim k → ∞ ϵ n k = ϵ 0 , where we apply the triangle inequality to ∥ Q ∗ − Q ′ n k + Q ′ n k − Q ∥ 1, ∞ and then take the limit as k → ∞ . Therefor e, Q ∗ ∈ Γ ( ϵ 0 ) . Since Γ ( ϵ 0 ) ⊆ G , it must be that Q ∗ ∈ G . However , G is an open set. Since the subsequence converges to Q ∗ (which is inside G ), the tail of the subsequence must eventually enter G . This contradicts the assumption that Q ′ n / ∈ G for all n . Thus, our assumption was false. A δ must exist, proving that Γ is upper semi-continuous. Lemma B.18 (Joint lower semi-continuity) . Let ϕ : R ++ × M × P → R S { + ∞ } be given as ϕ ( ϵ , Q ′ , P ) : = ∑ x ∈ [ m ] ( π x − ϵ ) D KL  Q ′ x , P x  . This function is jointly lower semi-continuous in its arguments for ϵ < π ∗ . Proof. Observe that convergence in the ℓ 1, ∞ norm implies convergence in total variation for each row probability vector . Further , recall that convergence in total variation implies weak convergence. Thus, Q n ℓ 1, ∞ → Q = ⇒ Q n ( x , · ) w → Q ( x , · ) ∀ x ∈ [ m ] . First, it is well known that KL divergence ( p , q ) 7 → KL ( p , q ) is lower-semi-continuous in the topology of weak convergence i.e. for p n w → p , q n w → q , we have lim inf n → ∞ KL ( p n , q n ) ≥ KL ( p , q ) (see Theor em 4.9 in Polyanskiy & W u ( 2025 ), originally shown by Posner ( 1975 )). Since we’re in finite dimensions, lower semi-continuity in the topology of weak convergence is equivalent to lower semi-continuity in the topology induced by the norm ℓ 1, ∞ ( Kreyszig , 1978 , Theor em 4.8-4, part c). Second, consider the weight map w x ( ϵ ) : = π x − ϵ is continuous and positive for ϵ ≤ π ∗ : = min x ∈ [ m ] π x (recall that π ∗ is positive by Perr on Frobenius theorem). Since the product of a positive continuous function and a non-negative lower semi-continuous function is lower semi-continuous, each term ( π x − ϵ ) D K L ( Q ′ x ∥ P x ) is jointly lower semi-continuous. Finally , since the sum of lower semi-continuous functions is lower semi-continuous, we have the result. Lemma B.19. C ( ϵ ) : = inf Q ′ : ∥ Q ′ − Q ∥ 1, ∞ ≤ ϵ inf P ∈ P ∑ x ∈ [ m ] ( π x − ϵ ) D KL ( Q ′ x , P x ) is lower semi-continuous. Proof. W e proceed by applying the Minimum theor em twice. First, consider the lower semi-continuous map ϕ ( ϵ , Q ′ , P ) : = ∑ x ∈ [ m ] ( π x − ϵ ) D KL ( Q ′ x , P x ) with the constant corr espondence given by Γ 1 ( P ) 7 → P ∀ P ∈ P . Clearly , this is compact valued (since P is com- pact) and continuous. Therefor e, from Cor ollary B.15 , we have that ϕ ′ ( ϵ , Q ′ ) : = min P ∈ P ϕ ( ϵ , Q ′ , P ) is lower semi-continuous. Secondly , consider this lower semi-continuous map and the compact valued, upper semi-continuous correspondence ϵ 7 → { Q ′ ∈ M : ∥ Q ′ − Q ∥ 1, ∞ ≤ ϵ } (see Lemma B.17 ). Applying Corollary B.15 , we have that C ( ϵ ) is lower semi-continuous. 36 B.5 Proof of Optimality Proof of Theor em 4.1 . T o show α -correctness, we use the mixture martingale intr oduced in Section B.3 . P P [ τ α < ∞ ] = P P   ∃ t , inf P ′ ∈P ∑ x ∈ [ m ] N x ( t ) D ( b P ( t ) ( x , · ) , P ′ ( x , · )) ≥ β ( t , α )   ≤ P P   ∃ t , ∑ x ∈ [ m ] N x ( t ) D ( b P ( t ) ( x , · ) , P ( x , · )) ≥ β ( t , α )   ≤ α . (using Lemma B.6 ) From Lemma B.3 and Lemma B.2 , we have that for any t ≥ 1: P h E ∁ ϵ ( t ) i ≤ q 2 Π µ · exp − ϵ 2 t 2 γ ps 4 ( t + 1/ γ ps ) + 40 ϵ t ! + 2 m 2 exp  − ϵ 2 t π ∗ 16 m  + m q Π µ exp − γ ps ( 0.5 t π ∗ ) 2 14 t + 4 / γ ps ! , (24) which is summable across time, i.e. ∑ ∞ t = 1 P h E ∁ ϵ ( t ) i < ∞ . From Lemma B.3 , we have that ∥ b Q t − Q ∥ 1, ∞ → 0. Pick ϵ > 0, ϵ < π ∗ , where π ∗ : = min j ∈ [ m ] π j . Define A t and E ϵ as in the previous sections. E Q [ τ α ] = ∞ ∑ t = 1 P [ τ α ≥ t ] ≤ ∞ ∑ t = 1 P [ L t < β ( t , α ) ] = ∞ ∑ t = 1 P [ A t ] = ∞ ∑ t = 1 P [ A t ∩ E ϵ ( t ) ] + ∞ ∑ t = 1 P h E ∁ ϵ ( t ) i ≤ ∞ ∑ t = 1 P  t ≤ 2 C C ( ϵ ) log  log ( 1 / α ) + D C ( ϵ )  + log ( 1 / α ) + D C ( ϵ )  + ∞ ∑ t = 1 P h E ∁ ϵ i ≤ 2 C C ( ϵ ) log  log ( 1 / α ) + D C ( ϵ )  + log ( 1 / α ) + D C ( ϵ ) + ∞ ∑ t = 1 P h E ∁ ϵ ( t ) i . Dividing by log ( 1 / α ) and taking the limit as α → 0, we have lim α → 0 E Q [ τ α ] log ( 1 / α ) = 1 C ( ϵ ) . 37 Letting ϵ → 0 and using the lower semi-continuity of C ( ϵ ) , shown in Lemma B.7 , we have the result. C Extension to T wo-Sided Sequential T esting In this section, we provide a pr oof of Theor em 4.4 , stated in Section 4.2 . Proof of Theor em 4.4 . Lower Bound: W e remark that the lower bound follows very similarly to the one-sided case, with the only change coming in the event chosen for the data processing inequality . Suppose the alternate hypothesis is true and the data is generated by some Q ∈ Q and we fix some P ∈ P such that Q ≪ P . Following the proof of Theor em 3.3 , we have E Q  τ α , β  D M ( Q , P ) + C Q τ ≥ d ( Q ( E ) , P ( E )) ∀ E ∈ F τ , (25) where C Q τ = E Q [ ω ( X 0 ) − ω ( X τ α , β )] . Choosing E = { D ( τ α , β ) = 0 } . Then, by the ( α , β ) correctness, we must have: P ( E ) ≥ 1 − α , Q ( E ) ≤ β (observe that this implies P ( E ) ≥ 0.5 ≥ Q ( E ) ). From the monotonicity properties of Bernoulli KL divergence, we have d ( Q ( E ) , P ( E ) ) ≥ d ( Q ( E ) , 1 − α ) ≥ d ( β , 1 − α ) . Combining the equations we have the result. Following the proof of Theorem 3.3 , we bound the excess Poisson terms using Proposition 3.1 to allow us to safely maximize over P . A similar proof follows for the case wher e the null is true. Construction of two-sided test: The basic idea for the construction is to r un two one-sided tests in parallel: one testing P vs Q and the other testing Q vs P . W e can choose the parameters such that the requir ed err or levels are met. Let τ P α be an α -correct, power-one test for testing P against Q . By definition, we have P P h τ P α < ∞ i ≤ α , P Q h τ P α < ∞ i = 1, ∀ P ∈ P , Q ∈ Q . (26) Now , let τ Q β be a level β , power-one test for testing Q against P . By definition, we have P Q h τ Q β < ∞ i ≤ β , P P h τ Q β < ∞ i = 1 ∀ P ∈ P , Q ∈ Q . (27) Define the parallel test: τ α , β = min { τ P α , τ Q β } with the decision rule, D ( τ α , β ) = 1 if τ P α ≤ τ Q β , and D ( τ α , β ) = 0 otherwise (since τ P α < ∞ corresponds to choosing H 1 and vice-versa). W e show that this test satisfies the correctness pr operties. Let P ∈ P , then: P P  D ( τ α , β ) = 1  = P P h τ P α ≤ τ Q β i ≤ P P h τ P α < ∞ i ≤ α . Similarly , for any Q ∈ Q , P Q  D ( τ α , β ) = 0  ≤ β . 38 W e remark that this construction yields asymptotically optimal tests in the i.i.d. setting as well, as shown by Lorden ( 1976 , Theor em 1). W e show that the same holds for Markovian data. Asymptotic optimality: Now , we prove the asymptotic optimality of τ α , β described above as α , β → 0. Suppose the data is generated by some Q ∈ P , the lower bound states that: E Q  τ β , α  ≥ d ( β , 1 − α ) D inf M ( Q , P ) − 2 C Q π ∗ Q ! + . Dividing by log ( 1 / α ) on either side and taking the limit α , β → 0, we obtain: lim inf α , β → 0 E Q  τ α , β  log ( 1 / α ) ≥ lim α , β → 0 d ( β , 1 − α ) log ( 1 / α ) D inf M ( Q , P ) = 1 D inf M ( Q , P ) , (28) where the equality follows fr om Lemma F .2 . Now , we consider the test defined above, i.e. τ α , β = min { τ P α , τ Q β } , where τ P α is the α − correct, power - one test for testing compact P against Q given by Algorithm 1 . Similarly , τ Q β is the α − correct, power-one test for testing compact Q against P given by Algorithm 1 . E Q  τ α , β  = E Q h min { τ P α , τ Q β } i ≤ E Q h τ P α i . W e divide by log ( 1 / α ) take the limit as α , β → 0. Optimality follows by comparing this with ( 28 ) . A similar proof holds when the data is generated by some er godic P ∈ P . D Computationally T ractable Lower Bound Proof of Pr oposition 4.3 . Fix some g : [ m ] → R such that it is not constant across states. Let ergodic P , Q ∈ M such that they have the stationary distributions π P , π Q . Further , suppose ω P , g is a solution to the following PE: ( I − P ) ω P = g − E π P [ g ] . (29) Recall that such a solution always exists for finite state, ergodic chains and since g is not constant, ω  = 0 . Now , observe: E π Q [ g ] − E π P [ g ] = E π Q [ g − E π P [ g ]] = E π Q  ( I − P ) ω P , g  (from ( 29 )) = E π Q  ω P , g  − E π Q  P ω P , g  = E π Q  Q ω P , g  − E π Q  P ω P , g  ( ∵ π Q ω P , g = π Q Q ω P , g ) = E π Q  ( Q − P ) ω P , g  = ∑ i ∈ [ m ] π Q ( i )  E Q ( x , · )  ω P , g  − E P ( x , · )  ω P , g   . From the variational form of T otal V ariation distance ( Polyanskiy & W u , 2025 , Theorem 7.7, (a)), for 39 any Q x , P x ∈ ∆ m : T V ( Q x , P x ) = 1 2 sup f : ∥ f ∥ ∞ ≤ 1 | E Q x [ f ] − E P x [ f ] | . As a consequence, for any f : [ m ] → R , we have: T V ( Q x , P x ) ≥ 1 2 ∥ f ∥ ∞ | E Q x [ f ] − E P x [ f ] | . Combining this with Pinkser ’s inequality , we have D KL ( Q x , P x ) ≥ 1 2 ∥ f ∥ 2 ∞ ( E Q x [ f ] − E P x [ f ] ) 2 . From our pr evious analysis, we have | E π Q [ g ] − E π P [ g ] | ≤ ∑ i ∈ [ m ] π Q ( i )    E Q ( x , · )  ω P , g  − E P ( x , · )  ω P , g     ≤ ∑ i ∈ [ m ] π Q ( i ) q 2 ∥ ω P , g ∥ 2 ∞ D KL ( Q ( i , · ) , P ( i , · ) ) = √ 2 ∥ ω P , g ∥ ∞ ∑ i ∈ [ m ] π Q ( i ) q D KL ( Q ( i , · ) , P ( i , · ) ) ≤ √ 2 ∥ ω P , g ∥ ∞ s ∑ i ∈ [ m ] π Q ( i ) D KL ( Q ( i , · ) , P ( i , · ) ) . (Jensen’s inequality) Thus, we have D M ( Q , P ) ≥ 1 2 ∥ ω P , g ∥ 2 ∞ ( E π Q [ g ] − E π P [ g ] ) 2 . (30) Proposition D.1. For er godic transition matrices Q , P ∈ M with stationary distributions π Q , π P respectively , the set of maximizers of the optimization problem: sup g : [ m ] → R  E π Q [ g ] − E π P [ g ]  2 2 ∥ ω P , g ∥ 2 ∞ , is the set { c g ⋆ + b 1 : c  = 0, b ∈ R } , where g ∗ = ( I − P ) ω ∗ ( η ∗ ) , η ∗ is the median of the random variable that takes value ( I − P ) π Q ( i ) / π P ( i ) with probability π P ( i ) and ω ∗ ( η ∗ ) is defined as: ω ∗ i ( η ) =        + 1, ( I − P ) π Q ( i ) > η ∗ π P ( i ) , − 1, ( I − P ) π Q ( i ) < η ∗ π P ( i ) , t i ∈ [ − 1, 1 ] , ( I − P ) π Q ( i ) = η ∗ π P ( i ) . Proof. Fix an ergodic transition matrix P on [ m ] with stationary distribution π P . Recall the Poisson solution ω P , g ∈ R m as in ( 2 ) for ( P , g ) . Given er godic Q ∈ M with stationary distribution π Q , consider 40 the functional ψ ( g ) : = 1 2 ∥ ω P , g ∥ 2 ∞  ⟨ π Q , g ⟩ − ⟨ π P , g ⟩  2 = 1 2 ∥ ω P , g ∥ 2 ∞ ⟨ π Q − π P , g ⟩ 2 . (31) The map g 7→ ψ ( g ) is invariant to adding constants ( g 7→ g + c 1 ) and to scaling ( g 7→ c g ). Hence, without loss of generality , restrict to ⟨ π P , g ⟩ = 0 and optimize over directions. On the subspace { g : ⟨ π P , g ⟩ = 0 } , ( 1 ) reduces to ( I − P ) ω = g with ⟨ π P , ω ⟩ = 0, so we can reparameterize g = ( I − P ) ω . For a function of this form, we observe that the solution as in ( 2 ) is exactly equal to ω , i.e. ω P , g = ω . Therefor e, we use the two interchangeably thr oughout this section. Using ⟨ π P , ( I − P ) ω ⟩ = 0, we obtain ⟨ π Q − π P , g ⟩ = ⟨ π Q , g ⟩ = ⟨ π Q , ( I − P ) ω ⟩ = ω ⊤ ( I − P ⊤ ) π Q = ⟨ a , ω ⟩ , (32) where a : = ( I − P ⊤ ) π Q . Following this, we can rewrite the inner supr emum in ( 6 ) as: sup g ∈ R m : ⟨ π P , g ⟩ = 0 ψ ( g ) = sup g ∈ R m : ⟨ π P , g ⟩ = 0 1 2 ∥ ω P , g ∥ 2 ∞ ⟨ π Q , g ⟩ 2 = sup g ∈ R m : ⟨ π P , g ⟩ = 0 1 2 ∥ ω P , g ∥ 2 ∞ ⟨ π Q , ( I − P ) ω P , g ⟩ 2 (33) W e can now change the optimization variable from g to ω as: sup g ∈ R m : ⟨ π P , g ⟩ = 0 1 2 ∥ ω P , g ∥ 2 ∞ ⟨ π Q , ( I − P ) ω P , g ⟩ 2 = sup ω  = 0 ⟨ π P , ω ⟩ = 0 ⟨ a , ω ⟩ 2 2 ∥ ω ∥ 2 ∞ = 1 2  sup ∥ ω ∥ ∞ = 1 ⟨ π P , ω ⟩ = 0 ⟨ a , ω ⟩  2 . (34) T o justify the change of variable, observe that for every feasible g ∈ R m (i.e., ⟨ π P , g ⟩ = 0), ther e exists a corresponding non-zero ω P , g satisfying ⟨ π P , ω P , g ⟩ = 0. This implies that the value of the first optimization problem is at most that of the second. Conversely , for any non-zero ω such that ⟨ π P , ω ⟩ = 0, the vector defined by g = ( I − P ) ω satisfies the constraint ⟨ π P , g ⟩ = 0. This ensures that the value of the second optimization problem is at most that of the first, pr oving the equality . The inner supremum is a linear pr ogram: max ω ∈ [ − 1,1 ] m ⟨ a , ω ⟩ s.t. ⟨ π P , ω ⟩ = 0. (35) Constructing the Lagrangian with parameter η ∈ R , we have L ( η ; ω ) = ⟨ a , ω ⟩ − η ⟨ π P , ω ⟩ = ∑ i ∈ [ m ] ω i  a i − η π P ( i )  . While a natural approach to solving the linear program in ( 35 ) would be to look at the dual pr oblem to ( 35 ) given by min η ∈ R max ω ∈ [ − 1,1 ] m L ( η ; ω ) , (36) we notice that there exists a duality gap—the value of ( 36 ) is strictly larger than that of ( 35 ) . This motivates us to constrain the inner maximization in the dual problem to reduce its value. T o this end, 41 observe that ( 35 ) can be rewritten in the form, min η ∈ R max ω ∈ [ − 1,1 ] m : ⟨ ω , π P ⟩ = 0 L ( η ; ω ) = min η ∈ R max ω ∈ [ − 1,1 ] m : ⟨ ω , π P ⟩ = 0 ∑ i ∈ [ m ] ω i  a i − η π P ( i )  = min η ∈ R ∑ i ∈ [ m ] ω ⋆ i ( η )  a i − η π P ( i )  = min η ∈ R ⟨ ω ⋆ ( η ) , a ⟩ − η ⟨ π P , ω ⋆ ( η ) ⟩ = min η ∈ R ∑ i ∈ [ m ]    a i − η π P ( i )    = min η ∈ R ∑ i ∈ [ m ] π P ( i ) | r i − η | , (37) where r i : = a i / π P ( i ) for each i ∈ [ m ] (recall, π P ( i ) > 0 for all i since P is ergodic), and the second equality follows from observing that for a fixed η , the optimizer ω ∗ = ω ∗ ( η ) is defined via ω ∗ i ( η ) =        + 1, r i > η , − 1, r i < η , t i ∈ [ − 1, 1 ] , r i = η , with m ∑ i = 1 π P ( i ) ω ∗ i ( η ) = 0, (38) for t i ∈ [ − 1, 1 ] chosen to satisfy the constraint ⟨ ω ∗ ( η ) , π P ⟩ = 0. W e then note that min η ∈ R ∑ i ∈ [ m ] π P ( i ) | r i − η | = min η ∈ R E [ | X − η | ] , where X takes value r i with probability π P ( i ) for each i ∈ [ m ] . The optimal solution η ∗ to the above optimization problem (and ther eby to ( 37 ) ) is the median of X , where by convention we pick η ⋆ as the infimum element whenever the median is defined by a set. Mapping back to ( 33 ), we get g ∗ = ( I − P ) ω ∗ ( η ∗ ) , (39) where ω ∗ ( η ∗ ) is given by ( 38 ) with η = η ∗ . Moreover , the full set of maximizers of ( 33 ) is then { c g ⋆ + b 1 : c  = 0, b ∈ R } , reflecting the scale and shift invariances. E Proofs for Section 5 Lemma E.1. The set P π : = { P ∈ M : π P = π } is compact. Proof of Lemma E.1 . Boundedness follows the fact that each element of P π is a stochastic matrix and thus has ℓ 1, ∞ norm of 1. For closedness, consider the linear , continuous function f : R m × m → R m , f ( P ) = π P . The preimage of the closed set { π } under this function, i.e. f − 1 ( π ) is a closed set. Closedness follows from observing that the space of stochastic matrices is closed, the intersection of 42 closed sets is closed and P π = f − 1 ( π ) T M . Compactness then follows from the Heine-Borel theorem ( Kreyszig , 1978 , Theor em 2.5-3). Lemma E.2. The set P L is compact under the standard topology . Proof of Lemma E.2 . Observe that the set { µ ∈ R |S |× d : ∥ µ ∥ 2, ∞ ≤ √ d } is a compact set under the topology induced by the L 2, ∞ . Since R |S |× d is a finite dimensional normed space, all norms induce an equivalent topology . As a consequence, { µ ∈ R |S |× d : ∥ µ ∥ 2, ∞ ≤ √ d } is compact under the topology induced by the ℓ 1, ∞ norm. P L is the image of this set under the linear function given by µ 7 → Φ µ ⊤ Π M , which is continuous. Since continuous functions map compact sets to compact sets, we have that P L is compact. F T echnical Lemmas Lemma F .1. Let a ≥ 1 and b ≥ 2 a. Then: x ≤ a log ( x ) + b = ⇒ x ≤ b + 2 a log ( b ) . Proof. W e proceed by showing the contrapositive. Let f ( x ) : = x − a log x − b . Then, f ′ ( x ) = 1 − a / x . Let x 0 : = b + 2 a log b . Observe x > x 0 = ⇒ x > b > a = ⇒ f ′ ( x ) > 0. Therefor e f is strictly increasing ∀ x > x 0 . Next, we show b 2 > b + 2 a log b . b 2 − b = b ( b − 1 ) > b log b > 2 a log b , where the first inequality follows since b > 1, and second uses b ≥ 2 a . Now , consider: f ( x 0 ) = b + 2 a log b − a log ( b + 2 a log b ) − b = a ( 2 log b − log ( b + 2 a log b )) = a log  b 2 b + 2 a log b  ( ∵ b 2 > b + 2 a log b ) > 0. Since the function is strictly increasing, f ( x ) > f ( x 0 ) ∀ x > x 0 i.e. x > a log x + b ∀ x > b + 2 a log b . Lemma F .2. Let d ( x , y ) : = x log  x y  + ( 1 − x ) log  1 − x 1 − y  be the KL diver gence between Ber ( x ) and Ber ( y ) for x , y ∈ ( 0, 1 ) . Then: lim α , β → 0 d ( β , 1 − α ) log ( 1 / α ) = 1, 43 Proof. W e expand the KL divergence as follows: d ( β , 1 − α ) log ( 1 / α ) = β log  β 1 − α  log ( 1 / α ) + ( 1 − β ) log  1 − β α  log ( 1 / α ) = β log  β 1 − α  log ( 1 / α ) + ( 1 − β ) log ( 1 − β ) log ( 1 / α ) + ( 1 − β ) . T aking the limit as α , β → 0, we see that the first two terms go to zero, while the last term goes to 1. G Experiments In this section, we provide details about our experimental methodology . G.1 Misspecification in MCMC Statistic vs time (fixed α ): Here, we fix π = [ 0.1, 0.1, 0.2, 0.2, 0.4 ] and α = 0.05 and study the expected stopping times under two data generating instances. 1. Q / ∈ P π . W e take the following matrix to generate the data: Q bad =        0.1 0.5 0.1 0.1 0.2 0.2 0.1 0.4 0.2 0.1 0.1 0.1 0.1 0.6 0.1 0.3 0.2 0.1 0.1 0.3 0.1 0.1 0.1 0.1 0.6.        W e note that this matrix is positive and hence, specifies an ergodic Markov chain. Its corr espond- ing stationary distribution is given as: π bad ≈ [ 0.1574, 0.1825, 0.1548, 0.1956, 0.3097 ] . W e show the growth of the statistic and demonstrate how evidence accumulates against the null with time. W e repeat the experiment over 50 runs and pr esent the results in Figur e 3 . 2. Q ∈ P π . W e take the following matrix to generate the data: Q good =        0.50 0.20 0.00 0.00 0.30 0.20 0.50 0.30 0.00 0.00 0.00 0.15 0.50 0.35 0.00 0.00 0.00 0.35 0.50 0.15 0.075 0.00 0.00 0.075 0.85        . Naturally , the stationary distribution of this chain, π good = π . W e show in this case, the statistic does not grow and stays below the threshold. W e repeat the experiment over 50 runs and present the results below in Figur e 4 . W e remark that in our testing, the test did not stop err oneously . 44 Figure 3: On the left side, we show the average trajectory (shaded: ± 3 standard deviations) taken by the statistic acr oss time. On the right side, we plot the histogram of stopping times across the same runs. G.2 Linearity T esting in MDPs Here, we pr ovide details of the experimental setup implemented in Section 5.2 . W e evaluate our testing framework on a discretized version of the classic MountainCar -v0 environ- ment. The continuous state space, consisting of position p ∈ [ − 1.2, 0.6 ] and velocity v ∈ [ − 0.07, 0.07 ] , is discretized into a uniform grid. W e use an 8 × 8 grid, resulting in a finite state space of size S = 64. The environment has a discr ete action space of size A = 3 (accelerate left, no acceleration, accelerate right). The data stream ( X 0 , A 0 , X 1 , A 1 , . . . ) is generated using a uniform random policy , where actions are selected uniformly at random A t ∼ Unif ( A ) at each step. W e construct the feature map ϕ ( s , a ) using Radial Basis Functions (RBF). W e treat the state-action pair index k ∈ { 0, . . . , S A − 1 } as a scalar and define d − 1 Gaussian centers c j evenly spaced across the domain. The feature vector is given by: ϕ ( s , a ) = Normalize  1, exp  − ( k − c 1 ) 2 2 σ 2  , . . . , exp  − ( k − c d − 1 ) 2 2 σ 2  ⊤ ! , where a bias term of 1 is included, and the resulting vector is normalized to sum to 1. W e solve the convex optimization pr oblem for the statistic using the cvxpy package ( Diamond & Boyd , 2016 ; Agrawal et al. , 2018 ). W e test the linearity assumption across varying feature dimensions (ranks) d ∈ { 3, 5, 7 } with the following choice of hyper-parameters: G.3 Experiments on Synthetic Data from a Parametric Family W e define a one-parameter exponential family of transition matrices { P θ : θ ∈ R } derived from P 0 and a fixed feature vector f ∈ R m . For each θ , define the un-normalized matrix e P θ ( i , j ) = P 0 ( i , j ) exp  θ · f j  , 45 Figure 4: On the left side, we show the average trajectory (shaded: ± 3 standard deviations) taken by the statistic acr oss time. On the right side, we plot the histogram of stopping times across the same runs. T able 1: Hyperparameters for the Linear MDP Experiment on Discretized MountainCar . Parameter V alue Description α 0.01 T arget T ype-I error pr obability S 64 State space size (8 × 8 discretization) A 3 Action space size d { 3, 5, 7 } T ested feature dimensions (ranks) Check Interval 100 Fr equency of statistic evaluation (steps) T (Horizon) 100, 000 Maximum duration of the experiment Let ρ θ denote the Perr on–Frobenius eigenvalue of e P θ and let v θ be the corr esponding positive right eigenvector normalized to satisfy ∑ j v θ ( j ) = 1. The normalized transition matrix is defined as P θ ( i , j ) = e P θ ( i , j ) v θ ( j ) ρ θ v θ ( i ) . For our experiments, we consider a five-state Markov chain. W e randomly generate a transition matrix P 0 . W e then define null and alternative families as two disjoint parameter intervals: P = { P θ : θ ∈ [ 0.4, 0.8 ] } , Q = { P θ : θ ∈ [ − 0.8, − 0.4 ] } . A single generating distribution Q ∈ Q is selected by fixing θ Q = − 0.6 from the alternate family .W e set Q = P θ Q and generate the new transition matrix as mentioned above. All observed data sequences are generated fr om Q in the reported experiments. The feature vector is fixed as f = ( 1, 1, 0, − 1, − 1 ) . W e run each experiment for 100 epochs across varying values of α which is the T ype I error level 46 Figure 5: Expected Stopping T ime as a function of log  1 α  for the experimental setup described in Appendix G.3 . G.4 Comparison with Baselines Since there are curr ently no established baselines for sequential testing with composite null and composite alternative hypotheses, this experiment is not intended as a quantitative comparison of algorithmic efficiency or optimality . Instead, it is designed to illustrate the ability of the proposed method to adapt to varying levels of problem difficulty . T o obtain a point of reference, we implement the procedur e of Fields et al. ( 2025 ), which is designed for a simple null hypothesis and a composite alternative. W e apply it in our setting by collapsing the null family to a singleton at θ = 0.2, while retaining the composite alternative θ ∈ [ − 0.8, − 0.4 ] . All other aspects of the experimental setup ar e kept identical across methods. 47 (a) α = 0.01 (b) α = 0.05 Figure 6: Expected stopping time as a function of θ ∈ [ − 0.8, − 0.4 ] for the experimental setup described in Section G.4 . Figure 7: Expected Stopping T ime as a function of log  1 α  for a fixed θ = − 0.6 in the experimental setup described in G.4 . 48

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment