Sparse tree search optimality guarantees in POMDPs with continuous observation spaces

Partially observable Markov decision processes (POMDPs) with continuous state and observation spaces have powerful flexibility for representing real-world decision and control problems but are notoriously difficult to solve. Recent online sampling-ba…

Authors: Michael H. Lim, Claire J. Tomlin, Zachary N. Sunberg

Sparse tree search optimality guarantees in POMDPs with continuous   observation spaces
Sparse T r ee Sear ch Optimality Guarantees in POMDPs with Continuous Observ ation Spaces Michael H. Lim 1 , Claire J . T omlin 1 and Zachary N. Sunberg 2 1 Uni versity of California, Berk ele y 2 Uni versity of Colorado, Boulder michaelhlim@berkele y .edu, tomlin@eecs.berkeley .edu, zachary .sunberg@colorado.edu Abstract Partially observ able Markov decision processes (POMDPs) with continuous state and observa- tion spaces hav e powerful flexibility for represent- ing real-world decision and control problems b ut are notoriously difficult to solve. Recent online sampling-based algorithms that use observation likelihood weighting have sho wn unprecedented ef- fectiv eness in domains with continuous observ ation spaces. Howe ver there has been no formal theoret- ical justification for this technique. This work of- fers such a justification, proving that a simplified al- gorithm, partially observ able weighted sparse sam- pling (POWSS), will estimate Q-v alues accurately with high probability and can be made to perform arbitrarily near the optimal solution by increasing computational power . 1 Introduction The partially observ able Markov decision process (POMDP) is a flexible mathematical framework for representing se- quential decision problems where knowledge of the state is incomplete [ Kaelbling et al. , 1998; K ochenderfer, 2015; Bertsekas, 2005 ] . The POMDP formalism can represent a wide range of real world problems including autonomous driving [ Sunberg et al. , 2017; Bai et al. , 2015 ] , cancer screen- ing [ A yer et al. , 2012 ] , spoken dialog systems [ Y oung et al. , 2013 ] , and others [ Cassandra, 1998 ] . In one of the most suc- cessful applications, an approximate POMDP solution is be- ing used in a ne w aircraft collision a voidance system that will be deployed worldwide [ Holland et al. , 2013 ] . A POMDP is an optimization problem for which the goal is to find a policy that specifies actions that will control the state to maximize the expectation of a re ward function. One of Proceedings of The 29th International Joint Conference on Ar- tificial Intelligence and the 17th Pacific Rim International Confer- ence on Artificial Intelligence (IJCAI-PRICAI 2020). The sole copyright holder is IJCAI (International Joint Conferences on Ar- tificial Intelligence), all rights reserved. Full link to the paper: https://www .ijcai.org/Proceedings/2020/0572.pdf. Figure 1: Trees generated from partially observ able sparse sampling (POSS) algorithm (left), and partially observable weighted sparse sampling (POWSS) algorithm (right) with depth D = 2 and width C = 2 , for a continuous-observ ation POMDP . Nodes below the fading edges are omitted for clarity . Square nodes correspond to actions, filled circles to state particles with size representing weight, and unfilled circles to beliefs. the most popular ways to deal with the challenging computa- tional complexity of finding such a policy [ Papadimitriou and Tsitsiklis, 1987 ] is to use online tree search algorithms [ Sil- ver and V eness, 2010; Y e et al. , 2017; Sunberg et al. , 2017; Kurnia wati and Y adav , 2016 ] . Instead of attempting to find a global policy that specifies actions for every possible outcome of the problem, online algorithms look for local approximate solutions as the agent is interacting with the en vironment. Previous online approaches such as ABT [ Kurnia wati and Y ada v , 2016 ] , POMCP [ Silver and V eness, 2010 ] , and DESPO T [ Y e et al. , 2017 ] hav e exhibited good performance in large discrete domains. Ho wever , many real-w orld do- mains, notably when a robot interacts with the physical world, hav e continuous observ ation spaces, and the algorithms men- tioned abo ve will not al ways con verge to an optimal policy in problems with continuous or nai vely discretized observ ation spaces [ Sunberg and K ochenderfer , 2018 ] . T w o recent approaches, POMCPO W [ Sunberg and K ochenderfer, 2018 ] and DESPO T - α [ Garg et al. , 2019 ] , have employed a weighting scheme inspired by particle filtering to achiev e good performance on realistic problems with lar ge or continuous observ ation spaces. Howe ver , there are currently no theoretical guarantees that these algorithms will find opti- mal solutions in the limit of infinite computational resources. A conv ergence proof for these algorithms must have the following two components: (1) A proof that the particle weighting scheme is sound, and (2) a proof that the heuris- tics used to focus search on important parts of the tree are sound. This paper tackles the first component by analyzing a new simplified algorithm that e xpands every node of a sparse search tree. First, we naiv ely extend the sparse sampling algorithm of Kearns et al. [ 2002 ] to the partially observable domain in the spirit of POMCP and explain why this algorithm, known as partially observable sparse sampling (POSS), will con verge to a suboptimal solution when the observation space is continu- ous. Then, we introduce appropriate weighting that results in the partially observ able weighted sparse sampling (POWSS) algorithm. W e prov e that the value function estimated by PO WSS con verges to the optimal value function at a rate of O ( C D exp( − t · C )) , where C is the planning width and num- ber of particles, D is the depth, and t is a constant specific to the problem and desired accuracy . This yields a policy that can be made arbitrarily close to optimal by increasing the computation. T o our kno wledge, PO WSS is the first algorithm prov en to con verge to a globally optimal polic y for POMDPs with con- tinuous observ ation spaces without relying on any discretiza- tion schemes. 2 Since PO WSS fully expands all nodes in the sparse tree, it is not computationally efficient and is only practically applicable to toy problems. Ho wev er , the conv er- gence guarantees justify the weighting schemes in state-of- the-art efficient algorithms like DESPOT - α and POMCPOW that solve realistic problems by only constructing the most important parts of the search tree. The remainder of this paper proceeds as follows: First, Sections 2 and 3 revie w preliminary definitions and previous work. Then, Section 4 presents an o verview of the POSS and PO WSS algorithms. Next, Section 5 contains an importance sampling result used in subsequent sections. Section 6 con- tains the main contribution, a proof that POWSS con verges to an optimal policy using induction from the leaves to the root to prove that the value function estimate will eventually be accurate with high probability at all nodes. Finally , Sec- tion 7 empirically sho ws con vergence of POWSS on a modi- fied tiger problem [ Kaelbling et al. , 1998 ] . 2 Preliminaries POMDP Formulation A POMDP is defined by a 7-tuple ( S, A, O , T , Z , R , γ ) : S is the state space, A is the action space, O is the obser- vation space, T is the transition density T ( s ′ | s, a ) , Z is the observation density Z ( o | a, s ′ ) , R is the reward function, and γ ∈ [0 , 1) is the discount factor [ K ochenderfer, 2015; Bertsekas, 2005 ] . Since a POMDP agent receiv es only ob- servations, the agent infers the state by maintaining a belief 2 Edit (06/03/2023): This characterization is not accurate, since there exist other algorithms that previously have shown con ver gence for continuous observation POMDPs [ Bai et al. , 2014 ] , and those that utilize particle weighting for importance sampling [ Luo et al. , 2019 ] . Howe ver , we believ e it is the first theoretical analysis of on- line POMDP tree search algorithms that use particle filters weighted using the observation density . b t at each step t and updating it with new action and obser- vation pair ( a t +1 , o t +1 ) via Bayesian updates [ Kaelbling et al. , 1998 ] . A policy , denoted with π , maps beliefs b t gener- ated from histories h t = ( b 0 , a 1 , o 1 , · · · , a t , o t ) to actions a t . Thus, to maximize the expected cumulati ve rew ard in POMDPs, the agent wants to find the optimal policy π ∗ ( b t ) . W e solve the finite-horizon problem of horizon length D . W e formulate the state v alue function V and action value function Q for a given belief state b and policy π at step t by Bellman updates for t ∈ [0 , D − 1] , where bao indicates the belief b updated with ( a, o ) : V π t ( b ) = E   D − 1 X i = t γ i − t R ( s i , π ( s i ))       b   , V π D ( b ) = 0 (1) Q π t ( b, a ) = E  R ( s, a ) + γ V π t +1 ( bao ) | b  (2) Specifically , the optimal value functions satisfy the following: V ∗ t ( b ) = max a ∈ A Q ∗ t ( b, a ) (3) π ∗ t ( b ) = arg max a ∈ A Q ∗ t ( b, a ) (4) Q ∗ t ( b, a ) = E  R ( s, a ) + γ V ∗ t +1 ( bao ) | b  (5) Generative models. F or many problems, the probability densities T and Z may be dif ficult to determine e xplicitly . Thus, some approaches only require that samples are gener- ated according to the correct probability . In this case, a gen- erativ e model G implicitly defines T , Z by generating a new state, s ′ , observ ation, o , and re ward, r , given the current state s and action a . Probability notation. W e denote probability measures with calligraphic letters (e.g. P , Q ) to avoid confusion with the action value function Q ( b, a ) . Furthermore, for tw o prob- ability measures P , Q defined on a σ -algebra F , we denote P ≪ Q to state that P is absolutely continuous with respect to Q ; for e very measurable set A, Q ( A ) = 0 implies that P ( A ) = 0 . Also, we use the abbre viations “a.s. ” for al- most surely , and “i.i.d.r .v . ” for independent and identically distributed random v ariables. 3 Additional Related W ork In addition to the work mentioned in the introduction, there has been much work in similar areas. There are se veral on- line tree search techniques for fully observable Marko v de- cision processes with continuous state spaces, most promi- nently Sparse-UCT [ Bjarnason et al. , 2009 ] , and double pro- gressiv e widening [ Cou ¨ etoux et al. , 2011 ] . There are also sev eral approaches for solving POMDPs or belief-space MDPs with continuous observation spaces. For example, Monte Carlo V alue Iteration (MCVI) can use a classifier to deal with continuous observation spaces [ Bai et al. , 2014 ] . Others partition the observation space [ Hoey and Poupart, 2005 ] or assume that the most likely obser- vation is always receiv ed [ Platt et al. , 2010 ] . Other ap- proaches are based on motion planning [ Bry and Roy , 2011; Agha-Mohammadi et al. , 2011 ] , locally optimizing pre- computed trajectories [ V an Den Berg et al. , 2012 ] , or opti- mizing open-loop plans [ Sunberg et al. , 2013 ] . McAllester Algorithm 1 Routines common to PO WSS and POSS Algorithm: SelectAction( b 0 , γ , G, C, D ) Input: Belief state b 0 , discount γ , generativ e model G, width C, max depth D . Output: An action a ∗ . 1: From the initial belief distribution b 0 , sample C particles and store it in ¯ b 0 . F or POWSS, weights are also initial- ized with w i = 1 /C . 2: For each of the actions a ∈ A , calculate: ˆ Q ∗ 0 ( ¯ b 0 , a ) = E S T I M AT E Q ( ¯ b 0 , a, 0) 3: Return a ∗ = arg max a ∈ A ˆ Q ∗ 0 ( ¯ b 0 , a ) Algorithm: EstimateV( ¯ b, d ) Input: Belief particles ¯ b , current depth d . Output: A scalar ˆ V ∗ d ( ¯ b ) that is an estimate of V ∗ d ( b ) . 1: If d ≥ D the max depth, then return 0. 2: For each of the actions a ∈ A , calculate: ˆ Q ∗ d ( ¯ b, a ) = E S T I M A T E Q ( ¯ b, a, d ) 3: Return ˆ V ∗ d ( ¯ b ) = max a ∈ A ˆ Q ∗ d ( ¯ b, a ) and Singh [ 1999 ] also extend the sparse sampling algorithm of Kearns et al. [ 2002 ] , b ut they use a belief simplification scheme instead of the particle sampling scheme. 4 Algorithms W e first define the algorithmic elements shared by POSS and PO WSS, S E L E C T A C T I O N and E S T I M A T E V , in Algorithm 1. S E L E C T A C T I O N is the entry point of the algorithm, which se- lects the best action for a belief b 0 according to the Q -function by recursiv ely calling E S T I M A T E Q . E S T I M ATE V is a subrou- tine that returns the v alue, V , for an estimated belief, by call- ing E S T I M AT E Q for each action and returning the maximum. W e use belief particle set ¯ b at every step d , which contain pairs ( s i , w i ) that correspond to the generated sample and its corresponding weight. The weight at initial step is uniformly normalized to 1 /C , as the samples are drawn directly from b 0 . In Algorithms 1 to 3, we omit γ , G, C, D in the subsequent recursiv e calls for conv enience since they are fixed globally . W e define E S T I M A T E Q functions in Algorithm 2 for POSS and Algorithm 3 for POWSS, where both methods perform sampling and recursive calls to E S T I M A T E V to estimate the Q -function at a giv en step. The crucial difference between these algorithms is shown in Fig. 1. POSS nai vely samples the ne xt s ′ i , o i , r i via the generating function for each state s i in the belief particle set ¯ b , at a gi ven step d . Then, for each unique observation o j generated from the sampling step, POSS inserts the states s ′ i into the next- step belief particle set bao j only if the generated observation o i matches o j . This behavior is similar to POMCP , DESPO T , or a particle filter that uses rejection and can quickly lead to particle depletion when there are many unique observ ations. Finally , POSS returns the naive average of the Q -functions calculated via recursive calculation of E S T I M A T E V for each of the next-step beliefs. Algorithm 2 POSS Algorithm: EstimateQ( ¯ b, a, d ) Input: Belief particles ¯ b , action a , current depth d . Output: A scalar ˆ Q ∗ d ( ¯ b, a ) that is an estimate of Q ∗ d ( b, a ) . 1: For each particle s i in ¯ b , generate s ′ i , o i , r i = G ( s i , a ) . If i > | ¯ b | , use s i mod | ¯ b | . 2: For each unique observation o j from previous step, insert all s ′ i that satisfy o i = o j to a ne w belief particle set bao j . 3: Return ˆ Q ∗ d ( ¯ b, a ) = 1 C C X i =1 ( r i + γ · E S T I M A T E V ( bao i , d + 1)) Algorithm 3 PO WSS Algorithm: EstimateQ( ¯ b, a, d ) Input: Belief particles ¯ b , action a , current depth d . Output: A scalar ˆ Q ∗ d ( ¯ b, a ) that is an estimate of Q ∗ d ( b, a ) . 1: For each particle-weight pair ( s i , w i ) in ¯ b , generate s ′ i , o i , r i from G ( s i , a ) . 2: For each observation o j from previous step, iterate over i = { 1 , · · · , C } to insert ( s ′ i , w i · Z ( o j | a, s ′ i )) to a ne w belief particle set bao j . 3: Return ˆ Q ∗ d ( ¯ b, a ) = P C i =1 w i ( r i + γ · E S T I M A T E V ( bao i , d + 1)) P C i =1 w i On the other hand, POWSS uses particle weighting rather than using only unweighted particles with matching observa- tion histories as in POSS. POWSS samples the next s ′ i , o i , r i via the generating function for each state-weight pair ( s i , w i ) in the belief particle set ¯ b . Now , for each observation o j gen- erated from the sampling step, PO WSS inserts all the states s ′ i and the new weights w ′ i = w i · Z ( o j | a, s ′ i ) into the next-step belief particle set bao j . These weights are the adjusted prob- ability of hypothetically obtaining o j from state s ′ i . PO WSS then returns the weighted av erage of the Q -functions. 5 Importance Sampling W e begin the theoretical portion of this work by stating important properties about self-normalized importance sam- pling estimators (SN estimators). One goal of importance sampling is to estimate an expected value of a function f ( x ) where x is drawn from distribution P , while the estimator only has access to another distribution Q along with the im- portance weights w P / Q ( x ) ∝ P ( x ) / Q ( x ) . This is crucial for PO WSS because we wish to estimate the rew ard for be- liefs conditioned on observation sequences, while only being able to generate the marginal distribution of states with cor - rect probability for an action sequence. W e define the following quantities: ˜ w P / Q ( x ) ≡ w P / Q ( x ) P N i =1 w P / Q ( x i ) (SN Importance W eight) d α ( P ||Q ) ≡ E x ∼Q [ w P / Q ( x ) α ] (R ´ enyi Di vergence) ˜ µ P / Q ≡ N X i =1 ˜ w P / Q ( x i ) f ( x i ) (SN Estimator) Theorem 1 (SN d ∞ -Concentration Bound) . Let P and Q be two pr obability measur es on the measurable space ( X , F ) with P ≪ Q and d ∞ ( P ||Q ) < + ∞ . Let x 1 , · · · , x N be i.i.d.r .v . sampled fr om Q , and f : X → R be a bounded Bor el function ( ∥ f ∥ ∞ < + ∞ ). Then, for any λ > 0 and N lar ge enough such that λ > ∥ f ∥ ∞ d ∞ ( P ||Q ) / √ N , the fol- lowing bound holds with pr obability at least 1 − 3 exp( − N · t 2 ( λ, N )) : | E x ∼P [ f ( x )] − ˜ µ P / Q | ≤ λ (6) wher e t ( λ, N ) is defined as: t ( λ, N ) ≡ λ ∥ f ∥ ∞ d ∞ ( P ||Q ) − 1 √ N (7) Theorem 1 builds upon the deriv ation in Proposition D.3 of Metelli et al. [ 2018 ] , which provides a polynomially de- caying bound by assuming d 2 exists. Here, we compromise by further assuming that the infinite R ´ enyi diver gence d ∞ ex- ists and is bounded to get an exponentially decaying bound: d ∞ ( P ||Q ) = ess sup x ∼Q w P / Q ( x ) < + ∞ . The proof of Theorem 1 is giv en in Appendix A. This exponential decay is important for the proofs in Sec- tion 6.2. W e need to ensure that all nodes of the POWSS tree at all depths d reach con vergence. The branching of the tree induces a factor proportional to N D . T o of fset this, we need a probabilistic bound at each depth that decays exponentially with N . Intuitive explanation of the d ∞ assumption is giv en at the beginning of Section 6.2. 6 Con vergence 6.1 POSS Con vergence to QMDP W e present a short informal argument for the con vergence of the Q -value estimates of POSS to the QMDP value (Def- inition 1) in continuous observation spaces. Sunberg and K ochenderfer [ 2018 ] provide a formal proof for a similar al- gorithm. Definition 1 (QMDP value) . Let Q M D P ( s, a ) denote the opti- mal Q -function evaluated at state s and action a for the fully observable MDP r elaxation of a POMDP . Then, the QMDP value at belief b , Q M D P ( b, a ) , is E s ∼ b  Q M D P ( s, a )  . Since the observations o i are dra wn from a continuous dis- tribution, the probability of obtaining duplicate o i values in E S T I M A T E Q , line 1 is 0. Consequently , when ev aluating E S - T I M A T E Q , all the belief particle sets after the root node only contain a single state particle each (Fig. 1, left), which means that each belief node is merely an alias for a unique state par - ticle. Therefore, E S T I M A T E V performs a rollout e xactly as if the current state became entirely known after taking a single action, identical to the QMDP approximation. Since QMDP is sometimes suboptimal [ Kaelbling et al. , 1998 ] , POSS is suboptimal for some continuous-observation POMDPs. 6.2 Near -Optimality of PO WSS On the other hand, we claim that the POWSS algorithm can be made to perform arbitrarily close to the optimal policy by increasing the width C . In analyzing near-optimality of PO WSS, we view PO WSS Q -function estimates as SN estimators, and we apply the concentration inequality result from Theorem 1 to show that PO WSS estimates at every node have small errors with high probability . Through the near-optimality of the Q -functions, we conclude that the value obtained by employing POWSS policy is also near-optimal with further assumptions on the closed-loop POMDP system. Assumptions for Analyzing PO WSS The following assumptions are needed for the proof: (i) S and O are continuous spaces, and the action space has a finite number of elements, | A | < + ∞ . (ii) For any observation sequence { o n } j , the densities Z , T , b 0 are chosen such that the R ´ enyi di ver gence of the target distribution P d and sampling distrib ution Q d (Eqs. (20) and (21)) is bounded above by d max ∞ < + ∞ a.s. for all d = 0 , · · · , D − 1 : d ∞ ( P d ||Q d ) = ess sup x ∼Q d w P d / Q d ( x ) ≤ d max ∞ (iii) The rew ard function R is Borel and bounded by a fi- nite constant || R || ∞ ≤ R max < + ∞ a.s., and V max ≡ R max 1 − γ < + ∞ . (iv) W e can sample from the generating function G and e v al- uate the observation probability density Z . (v) The POMDP terminates after D < ∞ steps. Intuitiv ely , condition (ii) means that the ratio of the con- ditional observation probability to the marginal observ ation probability cannot be too high. Additionally , our results still hold even when either of S, O are discrete, as long as it doesn’t violate condition (ii), by appropriately switching the integrals to Riemann sums. While we restrict our analysis to the case when γ < 1 for a finite horizon problem, the authors believ e that similar results can be deriv ed for either when γ = 1 or when dealing with infinite horizon problems. Theorem 2 (Accuracy of POWSS Q-V alue Estimates) . Sup- pose conditions (i)-(v) are satisfied. Then, for a given ϵ > 0 , choosing constants C, λ, δ that satisfy: λ = ϵ (1 − γ ) 2 / 5 , δ = λ/ ( V max D (1 − γ ) 2 ) (8) δ ≥ 3 | A | (3 | A | C ) D exp( − C · t 2 max ) (9) t max ( λ, C ) = λ 3 V max d max ∞ − 1 √ C > 0 (10) The Q -function estimates obtained for all depths d = 0 , · · · , D − 1 and all actions a ar e near-optimal with pr oba- bility at least 1 − δ :    Q ∗ d ( b d , a ) − ˆ Q ∗ d ( ¯ b d , a )    ≤ λ 1 − γ (11) Theorem 3 (PO WSS Polic y Con vergence) . In addition to conditions (i)-(v), assume that the closed-loop POMDP Bayesian belief update step is exact. Then, for any ϵ > 0 , we can choose a C such that the value obtained by PO WSS is within ϵ of the optimal value function at b 0 a.s.: V ∗ ( b 0 ) − V P OW S S ( b 0 ) ≤ ϵ (12) Theorems 2 and 3 are proven sequentially in the follow- ing subsections. W e generally follo w the proof strategy of Kearns et al. [ 2002 ] but with significant additions to account for the belief-based POMDP calculations rather than state- based MDP calculations. W e use induction to prove a con- centration inequality for the v alue function at all nodes in the tree, starting at the leav es and proceeding up to the root. V alue Con vergence at Leaf Nodes First, we reason about the con vergence at nodes at depth D − 1 (leaf nodes). In the subsequent analysis, we abbreviate some terms of interest with the following notation: T i 1 : d ≡ d Y n =1 T ( s n,i | s n − 1 ,i , a n ) (13) Z i,j 1 : d ≡ d Y n =1 Z ( o n,j | a n , s n,i ) Here d denotes the depth, i denotes the index of the state sample, and j denotes the index of the observation sample. Absence of indices i, j means that { s n } and/or { o n } appear as regular variables. Intuitiv ely , T i 1 : d is the transition density of state sequence i from the root node to depth d , and Z i,j 1 : d is the conditional density of observation sequence j given state sequence i from the root node to depth d . Addition- ally , b i d denotes b d ( s d,i ) , r d,i the re ward R ( s d,i , a d ) , and w d,i the weight of s d,i . Since the problem ends after D steps, the Q -function for nodes at depth D − 1 is simply the e xpectation of final rew ard and the PO WSS estimate has the following form: Q ∗ D − 1 ( b D − 1 , a ) = Z S R ( s D − 1 , a ) b D − 1 ds D − 1 (14) ˆ Q ∗ D − 1 ( ¯ b D − 1 , a ) = P C i =1 w D − 1 ,i r D − 1 ,i P C i =1 w D − 1 ,i (15) Lemma 1 (SN Estimator Leaf Node Con vergence) . ˆ Q ∗ D − 1 ( ¯ b D − 1 , a ) is an SN estimator of Q ∗ D − 1 ( b D − 1 , a ) , and the following leaf-node concentration bound holds with pr ob- ability at least 1 − 3 exp( − C · t 2 max ( λ, C )) , | Q ∗ D − 1 ( b D − 1 , a ) − ˆ Q ∗ D − 1 ( ¯ b D − 1 , a ) | ≤ λ (16) Pr oof. First, we sho w that ˆ Q ∗ D − 1 ( ¯ b D − 1 , a ) is an SN estima- tor of Q ∗ D − 1 ( b D − 1 , a ) . By following the recursiv e belief up- date, the belief term can be fully expanded: b D − 1 ( s D − 1 ) = R S D − 1 ( Z 1 : D − 1 )( T 1 : D − 1 ) b 0 ds 0 : D − 2 R S D ( Z 1 : D − 1 )( T 1 : D − 1 ) b 0 ds 0 : D − 1 (17) Then, Q ∗ D − 1 ( b D − 1 , a ) is equal to the follo wing: Q ∗ D − 1 ( b D − 1 , a ) = Z S R ( s D − 1 , a ) b D − 1 ds D − 1 (18) = R S D R ( s D − 1 , a )( Z 1 : D − 1 )( T 1 : D − 1 ) b 0 ds 0 : D − 1 R S D ( Z 1 : D − 1 )( T 1 : D − 1 ) b 0 ds 0 : D − 1 (19) W e approximate the Q ∗ function with importance sampling by utilizing problem requirement (iv), where the target den- sity is b D − 1 . First, we sample the sequences { s n } i ac- cording to the joint probability ( T 1 : D − 1 ) b 0 . Afterwards, we weight the sequences by the corresponding observ ation den- sity Z 1 : D − 1 , obtained from the generated observation se- quences { o n } j . For now , we assume the observation se- quences { o n } j are fixed. Applying the importance sampling formalism to our sys- tem for all depths d = 0 , · · · , D − 1 , P d is the normal- ized measure incorporating the probability of observation se- quence j on top of the state sequence i until the node at depth d , and Q d is the measure of the state sequence. W e can think of P d being index ed by the observation sequence { o n } j . P d = P d { o n } j ( { s n } i ) = ( Z i,j 1 : d )( T i 1 : d ) b i 0 R S d +1 ( Z j 1 : d )( T 1 : d ) b 0 ds 0 : d (20) Q d = Q d ( { s n } i ) = ( T i 1 : d ) b i 0 (21) w P d / Q d ( { s n } i ) = ( Z i,j 1 : d ) R S d +1 ( Z j 1 : d )( T 1 : d ) b 0 ds 0 : d (22) The weighing step is done by updating the self-normalized weights given in POWSS algorithm. W e define w d,i and r d,i as the weights and re wards obtained at step d for state se- quence i from POWSS simulation. With our recursi ve defi- nition of the empirical weights, we obtain the full weight of each state sequence i for a fixed observ ation sequence j : w D − 1 ,i = w D − 2 ,i · Z ( o D − 1 ,j | a D − 1 , s D − 1 ,i ) (23) ∝ Z i,j 1 : D − 1 (24) Realizing that the mar ginal observ ation probability is inde- pendent of indexing by i , we show that ˆ Q ∗ D − 1 ( ¯ b D − 1 , a ) is an SN estimator of Q ∗ D − 1 ( b D − 1 , a ) : ˆ Q ∗ D − 1 ( ¯ b D − 1 , a ) = P C i =1 ( Z i,j 1 : D − 1 ) R ( s D − 1 ,i , a ) P C i =1 ( Z i,j 1 : D − 1 ) (25) = P C i =1 ( Z i,j 1 : D − 1 ) R S D ( Z j 1 : D − 1 )( T 1 : D − 1 ) b 0 ds 0 : D − 1 R ( s D − 1 ,i , a ) P C i =1 ( Z i,j 1 : D − 1 ) R S D ( Z j 1 : D − 1 )( T 1 : D − 1 ) b 0 ds 0 : D − 1 (26) = P C i =1 w P D − 1 / Q D − 1 ( { s n } i ) R ( s D − 1 ,i , a ) P C i =1 w P D − 1 / Q D − 1 ( { s n } i ) (27) = C X i =1 ˜ w P D − 1 / Q D − 1 ( { s n } i ) R ( s D − 1 ,i , a ) (28) Since { s n } 1 , · · · { s n } C are i.i.d.r .v . sequences of depth D sampled from Q D − 1 , and R is a bounded function from prob- lem requirement (iii), we can apply the SN concentration bound in Theorem 1 to prove Lemma 1 . Detailed finishing steps of the proof are giv en in Appendix B. Induction from Leaf to Root Nodes Now , we want to sho w that nodes at all depths have conv er- gence guarantees via induction. Lemma 2 (SN Estimator Step-by-Step Con vergence) . ˆ Q ∗ d ( ¯ b d , a ) is an SN estimator of Q ∗ d ( b d , a ) , and for all d = 0 , · · · , D − 1 and a , the following holds with pr obability at least 1 − 3 | A | (3 | A | C ) D exp( − C · t 2 max ) : | Q ∗ d ( b d , a ) − ˆ Q ∗ d ( ¯ b d , a ) | ≤ α d (29) α d ≡ λ + γ α d +1 ; α D − 1 = λ (30) Pr oof. First, we set C such that C > (3 V max d max ∞ /λ ) 2 to sat- isfy t max ( λ, C ) > 0 , which ensures that the SN concentration inequality holds with probability 1 − 3 exp( − C · t 2 max ( λ, C )) at any given step d and action a . Furthermore, we multiply the worst-case union bound factor (3 | A | C ) D , since we want the function estimates to be within their respecti ve concentra- tion bounds for all the actions | A | and child nodes C at each step d = 0 , · · · , D − 1 , for the 3 times we use SN concentra- tion bound in the induction step. W e once again multiply the final δ by | A | to account for the root node Q -value estimates also satisfying their respecti ve concentration bounds for all actions. Follo wing our definition of E S T I M ATE Q , the value func- tion estimates at step d are gi ven as the following: ˆ V ∗ d ( ¯ b d ) = max a ∈ A ˆ Q ∗ d ( ¯ b d , a ) (31) ˆ Q ∗ d ( ¯ b d , a ) = P C i =1 w d,i  r d,i + γ ˆ V ∗ d +1 ( b d ao i )  P C i =1 w d,i (32) The base case d = D − 1 holds by Lemma 1. Then for the inductiv e step, we assume Eq. (29) holds for all actions at step d + 1 . Using the triangle inequality for step d , we split the dif ference into two terms, the re ward estimation error (A) and the next-step v alue estimation error (B): | Q ∗ d ( b d , a ) − ˆ Q ∗ d ( ¯ b d , a ) | (33) ≤      E [ R ( s d , a ) | b d ] − P C i =1 w d,i r d,i P C i =1 w d,i      | {z } (A) + γ      E [ V ∗ d +1 ( bao ) | b d ] − P C i =1 w d,i ˆ V ∗ d +1 ( b d ao i ) P C i =1 w d,i      | {z } (B) Each of the error terms are bound by ( A ) ≤ R max 3 V max λ and ( B ) ≤ 1 3 λ + 2 3 γ λ + α d +1 . W e provide a detailed justification of these bounds in Appendix C, which uses the SN concen- tration bound 3 times. Combining (A) and (B), we prove the inductiv e hypothesis: | Q ∗ d ( b d , a ) − ˆ Q ∗ d ( ¯ b d , a ) | ≤ R max 3 V max λ + γ [ 1 3 λ + 2 3 γ λ + α d +1 ] ≤ λ + γ α d +1 = α d (34) Therefore, Eq. (29) holds for all d = 0 , · · · , D − 1 with prob- ability at least 1 − 3 | A | (3 | A | C ) D exp( − C · t 2 max ) . Pr oof. (Theorem 2) First, we choose constants C, λ, δ and densities Z , T , b 0 that satisfy the conditions in Theorem 2. Since α d ≤ α 0 , the follo wing holds for all d = 0 , · · · , D − 1 with probability at least 1 − δ through Lemmas 1 and 2: | Q ∗ d ( b d , a ) − ˆ Q ∗ d ( ¯ b d , a ) | ≤ α 0 = D − 1 X d =0 γ d λ ≤ λ 1 − γ (35) Note that the con ver gence rate δ is O ( C D exp( − tC )) , where t = ( λ/ (3 V max d max ∞ )) 2 . Near -Optimal Policy Perf ormance W e hav e proven in the pre vious subsection that the planning step results in a near-optimal Q -value for a giv en belief. As- suming further that we have a perfect Bayesian belief up- date in the outer observ e-plan-act loop, we pro ve Theorem 3, which states that the closed-loop POMDP polic y generated by PO WSS at each planning step results in a near-optimal policy . The proof gi ven in Appendix D combines Theorem 2 with results from K earns et al. [ 2002 ] ; Singh and Y ee [ 1994 ] : V ∗ ( b 0 ) − V P OW S S ( b 0 ) ≤ ϵ (36) 7 Experiments The simple numerical experiments in this section confirm the theoretical results of Section 6. Specifically , they show that the value function estimates of POSS con ver ge to the QMDP approximation and the value function estimates of POWSS con verge to the optimal v alue function for a toy problem. 7.1 Continuous Observation T iger Problem W e consider a simple modification of the classic tiger problem [ Kaelbling et al. , 1998 ] that we refer to as the continuous observation tiger (CO-tiger) problem. In the CO-tiger problem, the agent is presented with two doors, left (L) and right (R). One door has a tiger behind it ( S = { Tiger L , Tiger R } ). In the classic prob- lem, the agent can either open one of the doors or lis- ten, and the CO-tiger problem has an additional wait ac- tion to illustrate the suboptimality of QMDP estimates ( A = { Open L , Open R , Wait , Listen } ). If the agent opens a door , the problem terminates immediately; If the tiger is be- hind that door , a penalty of -10 is recei ved, but if not, a re ward of 10 is gi ven. W aiting has a penalty of -1 and listening has a penalty of -2. If the agent w aits or listens, a noisy continuous observation between 0 and 1 is recei ved ( O = [0 , 1] ). In the wait case, this observ ation is uniformly distributed, indepen- dent of the tiger’ s position, yielding no information. In the listen case, the observ ation distribution is piecewise uniform. An observation in [0 , 0 . 5] corresponds to a tiger behind the left door and (0 . 5 , 1] the right door . Listening yields an ob- servation in the correct range 85% of the time. The discount is 0 . 95 , and the terminal depth is 3 . The optimal solution to this problem may be found by sim- ply discretizing the observation space so that any continu- ous observ ation in [0 , 0 . 5] is treated as a Tiger L obser - vation, and any continuous observation in (0 . 5 , 1] is treated as a Tiger R observation. This fully discrete version of the problem may be easily solved by a classical solution method such as the incremental pruning method of Cassan- dra et al. [ 1997 ] . Gi ven an ev enly-distributed initial belief, the optimal action is Listen with a value of 4.65, and the Wait action has a value of 3.42. The QMDP estimate for Wait is 8.5 and for Listen is 7.5. While the CO-tiger problem is too small to be of prac- tical significance, it serves as an empirical demonstration that PO WSS con verges to ward the optimal v alue estimates and that POSS conv erges to ward the QMDP estimates. In fact, the QMDP estimates generated by POSS are subopti- mal in this e xample and lead to picking the suboptimal Wait action. Both POWSS and POSS were implemented using the POMDPs.jl frame work, [ Egorov et al. , 2017 ] and open- source code can be found at https://github .com/JuliaPOMDP/ SparseSampling.jl. 7.2 Results The results plotted in Fig. 2 show the Q -value estimates of PO WSS con verging toward the optimal Q -v alues as the width C is increased. Each data point represents the mean Q -v alue from 200 runs of the algorithm from a uniformly-distributed belief, with the standard deviation plotted as a ribbon. The estimates for POSS have no uncertainty bounds since the es- timates in this problem are the same for all C . W ith C = 1 , POWSS suffers from particle depletion and, because of the particular structure of this problem, finds the QMDP Q -values. As C increases, one can observe that both bias and variance in the Q -value estimates significantly de- crease in agreement with our theoretical results, while POSS continues to yield incorrect estimates. Some estimates by POMCPOW are also included. These are not directly comparable since POMCPO W is parame- terized differently . For these tests, the double progressi ve widening parameters k o = C , α o = 0 were used to limit the tree width, with n = C 3 iterations to keep the particle den- sity high in wider trees (see Sunberg and Kochenderfer [ 2018 ] for parameter definitions). POMCPOW’ s value estimates are strongly biased do wnwards by exploration actions, b ut the es- timated v alue for Listen action is much higher than the es- timated value for the Wait action, which is too low to appear on the plot. Thus the correct action will usually still be cho- sen. At C = 41 , POMCPO W is about an order of magnitude faster than PO WSS. 8 Conclusion This work has proposed two new POMDP algorithms and analyzed their con vergence in POMDPs with continuous ob- servation spaces. Though these algorithms are not computa- tionally ef ficient and thus not suitable for realistic problems, this work lays the foundation for analysis of more complex algorithms, rigorously justifying the observ ation likelihood weighting used in PO WSS, POMCPOW , and DESPO T - α . There is a great deal of future work to be done along this path. Most importantly , the theory presented in this work should be extended to more computationally ef ficient and hence practical algorithms. Before extending to POMCPOW and DESPO T - α , it may be beneficial to apply these tech- niques to an algorithm that is less conceptually complex, such POWSS Optimal POSS POMCPOW W idth (C) W idth (C) Estimated Q V alue Listen W ait 9 8 7 6 5 4 3 2 9 8 7 6 5 4 3 2 0 10 20 30 40 0 10 20 30 40 Figure 2: Numerical con vergence of Q -value estimates for POSS, PO WSS, and POMCPOW in the CO-tiger problem. Ribbons indi- cate standard deviation. as a modification of Sparse-UCT [ Bjarnason et al. , 2009 ] ex- tended to partially observ able domains. Such an algorithm could enjoy strong theoretic guarantees, ease of implementa- tion, and good performance on large problems. Moreov er, the proof techniques in this work may yield in- sight into which problems are difficult for sparse tree search techniques. For example, the R ´ enyi div ergence between the marginal and conditional state distributions (assumption (ii)) may be a difficulty indicator for likelihood-weighted sparse tree solvers, similar to the covering number of the optimal reachable belief space for point-based solvers [ Lee et al. , 2008 ] . Acknowledgements This material is based upon work supported by a D ARP A Assured Autonomy Grant, the SRC CONIX program, NSF CPS Frontiers, the ONR Embedded Humans MURI, and the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE 1752814. Any opinions, find- ings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of any aforementioned organizations. The authors also thank Lasse Peters for valuable manuscript feedback. References [ Agha-Mohammadi et al. , 2011 ] Ali-Akbar Agha- Mohammadi, Suman Chakravorty , and Nancy M. Amato. FIRM: Feedback controller -based information- state roadmap - a framew ork for motion planning under uncertainty . In IEEE/RSJ IROS , 2011. [ A yer et al. , 2012 ] T urgay A yer, Oguzhan Alagoz, and Natasha K. Stout. A POMDP approach to personalize mammography screening decisions. Operations Resear ch , 60(5):1019–1034, 2012. [ Bai et al. , 2014 ] Haoyu Bai, David Hsu, and W ee Sun Lee. Integrated perception and planning in the continu- ous space: A POMDP approach. International Journal of Robotics Resear ch , 33(9):1288–1302, 2014. [ Bai et al. , 2015 ] Haoyu Bai, Shaojun Cai, Nan Y e, David Hsu, and W ee Sun Lee. Intention-aw are online POMDP planning for autonomous dri ving in a crowd. In IEEE ICRA , pages 454–460, 2015. [ Bertsekas, 2005 ] D. Bertsekas. Dynamic Pr ogramming and Optimal Contr ol . Athena, 2005. [ Bjarnason et al. , 2009 ] Ronald Bjarnason, Alan Fern, and Prasad T adepalli. Lower bounding Klondike solitaire with Monte-Carlo planning. In ICAPS , 2009. [ Bry and Roy , 2011 ] Adam Bry and Nicholas Roy . Rapidly- exploring random belief trees for motion planning under uncertainty . In IEEE ICRA , pages 723–730, 2011. [ Cassandra et al. , 1997 ] Anthony Cassandra, Michael L. Littman, and Ne vin L. Zhang. Incremental pruning: A simple, f ast, exact method for partially observable Markov decision processes. In U AI , pages 54–61, 1997. [ Cassandra, 1998 ] Anthony R. Cassandra. A surve y of POMDP applications. In AAAI F all Symposium: Planning with POMDPs , 1998. [ Cou ¨ etoux et al. , 2011 ] Adrien Cou ¨ etoux, Jean-Baptiste Hoock, Nataliya Sokolovska, Olivier T eytaud, and Nico- las Bonnard. Continuous upper confidence trees. In Learning and Intelligent Optimization , Rome, Italy , 2011. [ Egorov et al. , 2017 ] Maxim Egoro v , Zachary N. Sunberg, Edward Balaban, T im A. Wheeler, Jayesh K. Gupta, and Mykel J. K ochenderfer . POMDPs.jl: A framework for se- quential decision making under uncertainty . Journal of Machine Learning Resear ch , 18(26):1–5, 2017. [ Garg et al. , 2019 ] Neha P . Garg, David Hsu, and W ee Sun Lee. DESPO T- α : Online POMDP planning with lar ge state and observation spaces. In RSS , 2019. [ Hoey and Poupart, 2005 ] Jesse Hoey and Pascal Poupart. Solving POMDPs with continuous or large discrete obser- vation spaces. In IJCAI , pages 1332–1338, 2005. [ Holland et al. , 2013 ] Jessica E. Holland, Mykel J. K ochen- derfer , and W esley A. Olson. Optimizing the next gen- eration collision avoidance system for safe, suitable, and acceptable operational performance. Air T raffic Contr ol Quarterly , 21(3):275–297, 2013. [ Kaelbling et al. , 1998 ] Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and act- ing in partially observable stochastic domains. Artificial Intelligence , 101(1):99 – 134, 1998. [ Kearns et al. , 2002 ] Michael Kearns, Y ishay Mansour , and Andrew Y . Ng. A sparse sampling algorithm for near - optimal planning in large Marko v decision processes. Ma- chine Learning , 49(2):193–208, 2002. [ K ochenderfer, 2015 ] Mykel J. K ochenderfer . Decision Making Under Uncertainty: Theory and Application . MIT Press, 2015. [ Kurnia wati and Y adav , 2016 ] Hanna Kurniaw ati and V inay Y adav . An online POMDP solver for uncertainty plan- ning in dynamic en vironment. In Robotics Resear ch , pages 611–629. Springer , 2016. [ Lee et al. , 2008 ] W ee Sun Lee, Nan Rong, and David Hsu. What makes some POMDP problems easy to approxi- mate? In NeuRIPS , pages 689–696, 2008. [ Luo et al. , 2019 ] Y uanfu Luo, Haoyu Bai, Da vid Hsu, and W ee Sun Lee. Importance sampling for online planning under uncertainty . International Journal of Robotics Re- sear ch , 38(2-3):162–181, 2019. [ McAllester and Singh, 1999 ] David A. McAllester and Satinder Singh. Approximate planning for factored POMDPs using belief state simplification. In U AI , pages 409–416, 1999. [ Metelli et al. , 2018 ] Alberto Maria Metelli, Matteo Papini, Francesco F accio, and Marcello Restelli. Policy optimiza- tion via importance sampling. In NeuRIPS , pages 5442– 5454, 2018. [ Papadimitriou and Tsitsiklis, 1987 ] Christos H. Papadim- itriou and John N. Tsitsiklis. The complexity of Marko v decision processes. Mathematics of Operations Resear ch , 12(3):441–450, 1987. [ Platt et al. , 2010 ] Robert Platt, Jr ., Russ T edrake, Leslie Kaelbling, and T omas Lozano-Perez. Belief space plan- ning assuming maximum likelihood observ ations. In RSS , 2010. [ Silver and V eness, 2010 ] David Silver and Joel V eness. Monte-Carlo planning in large POMDPs. In NeuRIPS , pages 2164–2172, 2010. [ Singh and Y ee, 1994 ] Satinder P . Singh and Richard C. Y ee. An upper bound on the loss from approximate optimal- value functions. Machine Learning , 16(3):227–233, 1994. [ Sunberg and K ochenderfer , 2018 ] Zachary Sunberg and Mykel J. Kochenderfer . Online algorithms for POMDPs with continuous state, action, and observ ation spaces. In ICAPS , 2018. [ Sunberg et al. , 2013 ] Zachary Sunberg, Suman Chakra- vorty , and R Scott Erwin. Information space reced- ing horizon control. IEEE T ransactions on Cybernetics , 43(6):2255–2260, 2013. [ Sunberg et al. , 2017 ] Zachary N. Sunber g, Christopher J. Ho, and Mykel J. K ochenderfer . The v alue of inferring the internal state of traf fic participants for autonomous free- way dri ving. In A CC , 2017. [ V an Den Berg et al. , 2012 ] Jur V an Den Berg, Sachin Patil, and Ron Alterovitz. Motion planning under uncertainty using iterative local optimization in belief space. Interna- tional Journal of Robotics Resear ch , 31(11):1263–1278, 2012. [ Y e et al. , 2017 ] Nan Y e, Adhiraj Somani, David Hsu, and W ee Sun Lee. DESPO T: Online POMDP planning with regularization. Journal of Artificial Intelligence Resear ch , 58:231–266, 2017. [ Y oung et al. , 2013 ] Stev e Y oung, Milica Ga ˇ si ´ c, Blaise Thomson, and Jason D W illiams. POMDP-based statisti- cal spoken dialog systems: A revie w . IEEE , 101(5):1160– 1179, 2013. A ppendix T able of Contents A Proof of Theorem 1 1 B Proof of Lemma 1 (Continued) 3 C Proof of Lemma 2 (Continued) 4 C.1 Importance Sampling Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 C.2 Monte Carlo Next-Step Integral Approximation Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 C.3 Function Estimation Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 D Proof of Theorem 3 6 D.1 Belief State Policy Con vergence Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 D.2 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 A Proof of Theor em 1 Theorem 1 (SN d ∞ -Concentration Bound) . Let P and Q be two pr obability measur es on the measurable space ( X , F ) with P ≪ Q and d ∞ ( P ||Q ) < + ∞ . Let x 1 , · · · , x N be i.i.d.r .v . sampled fr om Q , and f : X → R be a bounded Bor el function ( ∥ f ∥ ∞ < + ∞ ). Then, for any λ > 0 and N lar ge enough such that λ > ∥ f ∥ ∞ d ∞ ( P ||Q ) / √ N , the following bound holds with pr obability at least 1 − 3 exp( − N · t 2 ( λ, N )) : | E x ∼P [ f ( x )] − ˜ µ P / Q | ≤ λ (1) wher e t ( λ, N ) is defined as: t ( λ, N ) ≡ λ ∥ f ∥ ∞ d ∞ ( P ||Q ) − 1 √ N (2) Pr oof. This proof follows similar proof steps as in Metelli et al. [ 2018 ] . Since we hav e upper bounds on the infinite R ´ enyi div ergence d ∞ ( P ||Q ) , we can start from the Hoeffding’ s inequality for bounded random variables applied to the regular IS estimator ˆ µ P / Q = 1 N P N i =1 w P / Q ( x i ) f ( x i ) , which is unbiased. While applying the Hoeffding’ s inequality , we can vie w importance sampling on f ( x ) weighted by w P / Q ( x ) as Monte Carlo sampling on g ( x ) = w P / Q ( x ) f ( x ) , which is a function bounded by ∥ g ∥ ∞ = d ∞ ( P ||Q ) ∥ f ∥ ∞ : P  ˆ µ P / Q − E x ∼ P [ f ( x )] ≥ λ  = P  ˆ µ P / Q − E x ∼ Q [ ˆ µ P / Q ( x ) f ( x )] ≥ λ  (3) ≤ exp − 2 N 2 λ 2 P N i =1 2( d ∞ ( P ||Q ) ∥ f ∥ ∞ ) 2 ! (4) ≤ exp − N λ 2 d 2 ∞ ( P ||Q ) ∥ f ∥ 2 ∞ ! ≡ δ (5) P  | ˆ µ P / Q − E x ∼ P [ f ( x )] | ≥ λ  ≤ 2 exp − N λ 2 d 2 ∞ ( P ||Q ) ∥ f ∥ 2 ∞ ! = 2 δ (6) W e pro ve a similar bound for the SN estimator ˜ µ P / Q = P N i =1 ˜ w P / Q ( x i ) f ( x i ) , which is a biased estimator . Howe ver , we need to take a step further and analyze the absolute dif ference, requiring us to split the difference up into two terms: P ( | E x ∼P [ f ( x )] − ˜ µ P / Q | ≥ λ ) (7) ≤ P ( ˜ µ P / Q − E x ∼P [ f ( x )] ≥ λ ) + P ( E x ∼P [ f ( x )] − ˜ µ P / Q ≥ λ ) (8) ≤ P ( ˜ µ P / Q − E x ∼Q [ ˜ µ P / Q ] ≥ ˜ λ ) + P ( E x ∼Q [ ˜ µ P / Q ] − ˜ µ P / Q ≥ ˜ λ ) (9) ≤ ˜ δ + P ( E x ∼P [ f ( x )] − ˜ µ P / Q ≥ λ ) (10) The first term is bounded by ˜ δ from the abov e bound and recasting λ to ˜ λ to account for the bias of the SN estimator: ˜ λ = λ −    E x ∼P [ f ( x )] − E x ∼Q [ ˜ µ P / Q ]    (11) ˜ δ = exp − N ˜ λ 2 d 2 ∞ ( P ||Q ) ∥ f ∥ 2 ∞ ! (12) Note that the bias term in the SN estimator is bounded by following through Cauchy-Schwarz inequality , closely following steps from Metelli et al. [ 2018 ] : | E x ∼P [ f ( x )] − E x ∼Q [ ˜ µ P / Q ] | =    E x ∼Q [ ˆ µ P / Q − ˜ µ P / Q ]    ≤ E x ∼Q [ | ˆ µ P / Q − ˜ µ P / Q | ] (13) ≤ E x ∼Q       P N i =1 w P / Q ( x i ) f ( x i ) P N i =1 w P / Q ( x i ) − 1 N N X i =1 w P / Q ( x i ) f ( x i )       (14) = E x ∼Q        P N i =1 w P / Q ( x i ) f ( x i ) P N i =1 w P / Q ( x i )           1 − P N i =1 w P / Q ( x i ) N        (15) ≤ E x ∼Q   P N i =1 w P / Q ( x i ) f ( x i ) P N i =1 w P / Q ( x i ) ! 2   1 / 2 E x ∼Q   1 − P N i =1 w P / Q ( x i ) N ! 2   1 / 2 (16) ≤ ∥ f ∥ ∞ r d 2 ( P ||Q ) − 1 N ≤ ∥ f ∥ ∞ d ∞ ( P ||Q ) √ N (17) In the last step, the first term is bounded by ∥ f ∥ ∞ as the function is bounded, and the second term is bounded by the fact that we can bound the square root of variance with the supremum squared, where we square it for the conv enience of the definition of t ( λ, N ) later on such that the 1 / √ N f actor is nicely separated. W e assume that N is chosen large enough that λ > ∥ f ∥ ∞ d ∞ ( P ||Q ) / √ N . Using this, we bound the ˜ δ term: ˜ δ ≤ exp − N ( λ − ∥ f ∥ ∞ d ∞ ( P ||Q ) / √ N ) 2 d 2 ∞ ( P ||Q ) ∥ f ∥ 2 ∞ ! (18) = exp   − N λ − ∥ f ∥ ∞ d ∞ ( P ||Q ) / √ N ∥ f ∥ ∞ d ∞ ( P ||Q ) ! 2   (19) ≡ exp  − N · t 2 ( λ, N )  (20) Here, we define t ( λ, N ) ≡ λ ∥ f ∥ ∞ d ∞ ( P ||Q ) − 1 √ N , which satisfies 0 < t ( λ, N ) ≤ λ ∥ f ∥ ∞ d ∞ ( P ||Q ) . The second term can be bounded similarly by rebounding the bias term with ˜ λ , using symmetry and Hoeffding’ s inequality: P ( E x ∼P [ f ( x )] − ˜ µ P / Q ≥ λ ) ≤ P ( E x ∼Q [ ˜ µ P / Q ] − ˜ µ P / Q ≥ ˜ λ ) (21) ≤ P ( | E x ∼Q [ ˜ µ P / Q ] − ˜ µ P / Q | ≥ ˜ λ ) ≤ 2 ˜ δ (22) Thus, we obtain the following bound: P ( | E x ∼P [ f ( x )] − ˜ µ P / Q | ≥ λ ) ≤ 3 exp( − N · t 2 ( λ, N )) (23) B Proof of Lemma 1 (Continued) In the main paper , we show that ˆ Q ∗ D − 1 ( ¯ b D − 1 , a ) is an SN estimator of Q ∗ D − 1 ( b D − 1 , a ) . W e apply the concentration inequality prov en in Theorem 1 to finish the proof of Lemma 1. Lemma 1 (SN Estimator Leaf Node Conv ergence) . ˆ Q ∗ D − 1 ( ¯ b D − 1 , a ) is an SN estimator of Q ∗ D − 1 ( b D − 1 , a ) , and the following leaf-node concentration bound holds with pr obability at least 1 − 3 exp( − C · t 2 max ( λ, C )) , | Q ∗ D − 1 ( b D − 1 , a ) − ˆ Q ∗ D − 1 ( ¯ b D − 1 , a ) | ≤ λ (24) Pr oof. W e first bound R by 3 V max , where we define V max : V max ≡ R max 1 − γ ≥ R max (25) W e make this crude upper bound starting at the leaf node so that the probability upper bound at other subsequent steps will be bounded by the same factor . In addition, since d ∞ ( P D − 1 ||Q D − 1 ) is bounded by d max ∞ a.s., we can bound the resulting t D − 1 ( λ, C ) by t max ( λ, C ) a.s.: t D − 1 ( λ, C ) = λ 3 V max d ∞ ( P D − 1 ||Q D − 1 ) − 1 √ C ≥ λ 3 V max d max ∞ − 1 √ C ≡ t max ( λ, C ) (26) Note that this algebra holds for all steps d = 0 , · · · , D − 1 , which allo ws us to say t d ( λ, C ) ≥ t max ( λ, C ) . Thus, bounding the concentration inequality probability with t max ( λ, C ) is justified when we prove Lemma 2 later . This probabilistic bound holds for any choice of { o n } j , where { o n } j could be a sequence of random variables correlated with any elements of { s n } i . | Q ∗ D − 1 ( b D − 1 , a ) − ˆ Q ∗ D − 1 ( ¯ b D − 1 , a ) | ≤ λ (27) Thus, for all { o n } j , { a n } and a fixed a , Eq. (27) holds with probability at least 1 − 3 exp( − C · t 2 max ( λ, C )) . C Proof of Lemma 2 (Continued) Lemma 2 (SN Estimator Step-by-Step Con vergence) . ˆ Q ∗ d ( ¯ b d , a ) is an SN estimator of Q ∗ d ( b d , a ) for all d = 0 , · · · , D − 1 and a , and the following holds with pr obability at least 1 − 3 | A | (3 | A | C ) D exp( − C · t 2 max ) : | Q ∗ d ( b d , a ) − ˆ Q ∗ d ( ¯ b d , a ) | ≤ α d (28) α d ≡ λ + γ α d +1 ; α D − 1 = λ (29) In Lemma 2, we split the dif ference between the SN estimator and the Q ∗ function into two terms, the re ward estimation error (A) and the next-step v alue estimation error (B): | Q ∗ d ( b d , a ) − ˆ Q ∗ d ( ¯ b d , a ) | ≤      E [ R ( s d , a ) | b d ] − P C i =1 w d,i r d,i P C i =1 w d,i      | {z } (A) + γ      E [ V ∗ d +1 ( bao ) | b d ] − P C i =1 w d,i ˆ V ∗ d +1 ( b d ao i ) P C i =1 w d,i      | {z } (B) (30) T o bound these terms, we will use the SN concentration bound (Theorem 1) 3 times throughout the process. For (A), we use the SN concentration bound to obtain the bound R max 3 V max λ ; rather than bounding R with 3 V max in this step, we instead bound R with R max and then augment λ to R max 3 V max λ in order to obtain the same uniform t max factor as the other steps. This choice of bound is made to effecti vely combine the λ terms when we add (A) and (B). For (B), we use the triangle inequality repeatedly to separate it into three terms; the importance sampling error bounded by λ/ 3 , the Monte Carlo next-step integral approximation error bounded by 2 λ/ 3 γ , and the function estimation error bounded by α d +1 : ( B ) ≤      E [ V ∗ d +1 ( bao ) | b d ] − P C i =1 w d,i V ∗ d +1 ( s d,i , b d , a ) P C i =1 w d,i      | {z } Importance sampling error +      P C i =1 w d,i V ∗ d +1 ( s d,i , b d , a ) P C i =1 w d,i − P C i =1 w d,i V ∗ d +1 ( b d ao i ) P C i =1 w d,i      | {z } MC next-step integral approximation error +      P C i =1 w d,i V ∗ d +1 ( b d ao i ) P C i =1 w d,i − P C i =1 w d,i ˆ V ∗ d +1 ( b d ao i ) P C i =1 w d,i      | {z } Function estimation error (31) ≤ 1 3 λ + 2 3 γ λ + α d +1 (32) The following subsections justify ho w each of the error terms are bounded. C.1 Importance Sampling Error Before we analyze the first term, note that the conditional expectation of the optimal value function at step d + 1 giv en b d , a is calculated by the following, where we introduce V ∗ d +1 ( s d,i , b d , a ) as a shorthand for the next-step inte gration over ( s d +1 , o ) conditioned on ( s d,i , b d , a ) : V ∗ d +1 ( s d,i , b d , a ) ≡ Z S Z O V ∗ d +1 ( b d ao ) Z ( o | a, s d +1 ) T ( s d +1 | s d,i , a ) ds d +1 do (33) E [ V ∗ d +1 ( bao ) | b d ] = Z S Z S Z O V ∗ d +1 ( b d ao )( Z d +1 )( T d,d +1 ) b d · ds d : d +1 do (34) = Z S V ∗ d +1 ( s d , b d , a ) b d · ds d (35) = R S d +1 V ∗ d +1 ( s d , b d , a )( Z 1 : d )( T 1 : d ) b 0 ds 0 : d R S d +1 ( Z 1 : d )( T 1 : d ) b 0 ds 0 : d (36) Noting that the first term is then the dif ference between the SN estimator and the conditional e xpectation, and that || V ∗ d +1 || ∞ ≤ V max , we can apply the SN inequality for the second time in Lemma 2 to bound it by the augmented λ/ 3 . C.2 Monte Carlo Next-Step Integral A pproximation Err or The second term can be thought of as Monte Carlo next-step integral approximation error . T o estimate V ∗ d +1 ( s d,i , b d , a ) , we can simply use the quantity V ∗ d +1 ( b d ao i ) , as the random vector ( s d +1 ,i , o i ) is jointly generated using G according to the correct probability Z ( o | a, s d +1 ) T ( s d +1 | s d,i , a ) giv en s d,i in the PO WSS simulation. Consequently , the quantity V ∗ d +1 ( b d ao i ) for a giv en ( s d,i , b d , a ) is an unbiased 1-sample MC estimate of V ∗ d +1 ( s d,i , b d , a ) . W e define the dif ference between these two quantities as ∆ d +1 , which is implicitly a function of random variables ( s d +1 ,i , o i ) : ∆ d +1 ( s d,i , b d , a ) ≡ V ∗ d +1 ( s d,i , b d , a ) − V ∗ d +1 ( b d ao i ) (37) Then, we note that || ∆ d +1 || ∞ ≤ 2 V max and E ∆ d +1 = 0 by the T ower property conditioning on ( s d,i , b d , a ) and integrating ov er ( s d +1 ,i , o i ) first, which holds for any choice of well-behaved sampling distributions on { s 0 : d } i . Using this fact, we can then consider the second term as an SN estimator for the bias E ∆ d +1 = 0 , and use our SN concentration bound for the third time. Since || ∆ d +1 || ∞ ≤ 2 V max , our λ factor is then augmented by 2/3:      P C i =1 w d,i V ∗ d +1 ( s d,i , b d , a ) P C i =1 w d,i − P C i =1 w d,i V ∗ d +1 ( b d ao i ) P C i =1 w d,i      =      P C i =1 w d,i ∆ d +1 ( s d,i , b d , a ) P C i =1 w d,i − 0      ≤ 2 3 λ ≤ 2 3 γ λ (38) C.3 Function Estimation Error Lastly , the third term is bounded by the inductive hypothesis, since each i -th absolute difference of the Q -function and its estimate at step d + 1 , and furthermore the value function and its estimate at step d + 1 , are all bounded by α d +1 . D Proof of Theor em 3 D.1 Belief State P olicy Con vergence Lemma Before we prove Theorem 3, we first prov e the following lemma, which is an adaptation of Kearns et al. [ 2002 ] and Singh and Y ee [ 1994 ] for belief states b . Lemma 4. Suppose we obtain a greedy policy implemented by some appr oximation ˜ V d,t ( b ) ≈ V ∗ t + d ( b ) by g enerating a tr ee at online step t and obtaining a value function estimate at tr ee depth d for a belief b . Define the total loss L ˜ V ,t ≡ V ∗ t ( b ) − V ˜ V 0 ,t ( b ) as the differ ence between the value obtained by the optimal policy and the value obtained by the ˜ V d,t appr oximation at online step t . If | ˜ V d,t ( b ) − V ∗ t + d ( b ) | ≤ β for all online steps t ∈ [0 , D − 1] and its corr esponding tr ee depth d = 0 , · · · , D − 1 − t , then the total loss by implementing the gr eedy policy fr om the be ginning is bounded by the following: L ˜ V , 0 ( b ) ≤ 3 1 − γ β (39) Pr oof. W e mirror the proof strategies gi ven in Kearns et al. [ 2002 ] and Singh and Y ee [ 1994 ] for belief states b . Consider the optimal action a = π ∗ d ( b ) and the greedy action ˜ a = π ˜ V ,t ( b ) . Here, we denote R ( b, a ) as the shorthand notation for E [ R ( s, a ) | b ] . Since ˜ a is greedy , it must look at least as good as a under ˜ V : R ( b, a ) + γ E [ ˜ V 1 ,t ( bao ) | b ] ≤ R ( b, ˜ a ) + γ E [ ˜ V 1 ,t ( b ˜ ao ) | b ] (40) Since we hav e | ˜ V d,t ( b ) − V ∗ t + d ( b ) | ≤ β , R ( b, a ) + γ E [ V ∗ t +1 ( bao ) − β | b ] ≤ R ( b, ˜ a ) + γ E [ V ∗ t +1 ( b ˜ ao ) + β | b ] (41) R ( b, a ) − R ( b, ˜ a ) ≤ 2 γ β + γ E [ V ∗ t +1 ( b ˜ ao ) | b ] − γ E [ V ∗ t +1 ( bao ) | b ] (42) Then, the loss for b at time t is: L ˜ V ,t ( b ) = V ∗ t ( b ) − V ˜ V 0 ,t ( b ) (43) = R ( b, a ) − R ( b, ˜ a ) + γ E [ V ∗ t +1 ( bao ) | b ] − γ E [ V ˜ V 0 ,t +1 ( b ˜ ao ) | b ] (44) Substituting the rew ard function into the loss expression, L ˜ V ,t ( b ) = R ( b, a ) − R ( b, ˜ a ) + γ E [ V ∗ t +1 ( bao ) | b ] − γ E [ V ˜ V 0 ,t +1 ( b ˜ ao ) | b ] (45) ≤ 2 γ β + γ E [ V ∗ t +1 ( b ˜ ao ) | b ] − γ E [ V ∗ t +1 ( bao ) | b ] + γ E [ V ∗ t +1 ( bao ) | b ] − γ E [ V ˜ V 0 ,t +1 ( b ˜ ao ) | b ] (46) ≤ 2 γ β + γ E [ V ∗ t +1 ( b ˜ ao ) | b ] − γ E [ V ˜ V 0 ,t +1 ( b ˜ ao ) | b ] (47) ≤ 2 γ β + γ E [ L ˜ V ,t +1 ( b ˜ ao ) | b ] (48) Note that we have L ˜ V ,D − 1 ( b ) ≤ β from the root node estimate at the last step, which means we obtain the bound with some ov er-approximations: L ˜ V , 0 ( b ) ≤ D − 1 X d =1 2 β γ d + γ D − 1 β ≤ D − 1 X d =0 3 β γ d ≤ 3 1 − γ β (49) These ov er-approximations are done in order to generate constants that can be easily calculated. D.2 Pr oof of Theorem 3 W e reiterate the conditions and Theorem 3 below: (i) S and O are continuous spaces, and the action space has a finite number of elements, | A | < + ∞ . (ii) For any observation sequence { o n } j , the densities Z , T , b 0 are chosen such that the R ´ enyi diver gence of the target distribu- tion P d and sampling distrib ution Q d (Eqs. (20) and (21)) is bounded abo ve by d max ∞ < + ∞ a.s. for all d = 0 , · · · , D − 1 : d ∞ ( P d ||Q d ) = ess sup x ∼Q d w P d / Q d ( x ) ≤ d max ∞ (iii) The re ward function R is Borel and bounded by a finite constant || R || ∞ ≤ R max < + ∞ a.s., and V max ≡ R max 1 − γ < + ∞ . (iv) W e can ev aluate the generating function G as well as the observation probability density Z . (v) The POMDP terminates after D < ∞ steps. Theorem 3 (PO WSS Policy Con ver gence) . In addition to conditions (i)-(v), assume that the closed-loop POMDP Bayesian belief update step is exact. Then, for any ϵ > 0 , we can choose a C such that the value obtained by PO WSS is within ϵ of the optimal value function at b 0 a.s.: V ∗ ( b 0 ) − V P OW S S ( b 0 ) ≤ ϵ (50) Pr oof. In our main report, we hav e proved Lemmas 1 and 2, which gets us the root node con vergence for all actions. W e apply these lemmas as well as Lemma 4 to prov e the policy con vergence. From Lemma 2, we have that the error in estimating Q ∗ with our PO WSS policy is bounded by λ/ (1 − γ ) for all d, a with probability at least 1 − δ . This directly implies that the V -function estimation errors are bounded as well for all steps d ; if | Q ∗ d ( b d , a ) − ˆ Q ∗ d ( ¯ b d , a ) | ≤ λ 1 − γ for all d, a, then: | max a ∈ A Q ∗ d ( b d , a ) − max a ∈ A ˆ Q ∗ d ( ¯ b d , a ) | = | V ∗ d ( b d ) − ˆ V ∗ d ( ¯ b d ) | ≤ λ 1 − γ (51) For online planning, we require that the PO WSS trees generated at each online planning step must satisfy the concentration inequalities for all of its nodes. As each of the trees generated for D steps need to ha ve good estimates, we worst-case upper bound the union bound probability by multiplying D to δ . Applying Lemma 4, we get that if all the nodes satisfy the concentration inequality , which happens with probability at least 1 − D δ , the following holds: V ∗ ( b 0 ) − V P OW S S ( b 0 ) = L ˆ V ∗ , 0 ( b 0 ) ≤ 3 λ (1 − γ ) 2 (52) Note that the maximum difference between the values obtained by the two policies is bounded by 2 V max . At each online step, you can hav e 2 R max as the maximum possible dif ference between the two re wards the agent can obtain via the greedy PO WSS policy and the optimal policy generated at each online planning step, and at each online step there exists a discount γ . W e can use this bound for the bad case probability D δ . Using all the definitions of the constants defined in Theorem 2: V ∗ ( b 0 ) − V P OW S S ( b 0 ) = E   D − 1 X i =0 γ i R ( s i , π ∗ i ( s i ))       b 0   − E   D − 1 X i =0 γ i R ( s i , π P OW S S i ( s i ))       b 0   (53) ≤ (1 − Dδ ) 3 λ (1 − γ ) 2 + D δ sup    D − 1 X i =0 γ i | R ( s i , π ∗ i ( s i )) − R ( s i , π P OW S S i ( s i )) |       b 0    (54) = (1 − D δ ) 3 λ (1 − γ ) 2 + D δ D − 1 X i =0 γ i (2 R max ) (55) ≤ (1 − Dδ ) 3 λ (1 − γ ) 2 + D δ 2 R max 1 − γ (56) ≤ 3 λ (1 − γ ) 2 + 2 D δ V max = 5 λ (1 − γ ) 2 = ϵ (57) Therefore, we obtain our desired bound on the values obtained by PO WSS policy .

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment