Strategies in POMDPs with Stage Duration

Partially observable Markov decision processes (POMDPs) with stage duration provide a framework for approximating continuous-time behavior by scaling transition probabilities with a stage duration parameter $h \in (0,1]$. While previous literature ha…

Authors: Ivan Novikov

Strategies in POMDPs with Stage Duration Iv an No vik o v, Univ ersit ´ e P aris 1 P an th ´ eon-Sorb onne Abstract P artially observ able Marko v decision pro cesses (POMDPs) with stage duration pro vide a framework for appro ximating contin uous-time b ehavior b y scaling transition probabilities with a stage duration parameter ℎ ∈ (0 , 1]. While previous literature has primarily fo cused on the limit of the discounted v alue as the stage duration ℎ v anishes, this paper in vestigates the global behavior of the asymptotic v alue, 𝑉 ( ℎ ), across v arying stage durations. Our main result demonstrates that an y strategy in a POMDP with stage duration ℎ can b e mimic ked in the base POMDP ( ℎ = 1). Sp ecifically , we pro vide an explicit construction sho wing that for an y strategy in the POMDP with stage duration ℎ , there exists a strategy in the base POMDP that secures the same asymptotic pay off. As a consequence of this theorem, w e establish that the v alue function 𝑉 ( ℎ ) is nondecreasing with resp ect to ℎ , and that the contin uous-time limit lim ℎ → 0 𝑉 ( ℎ ) exists. Con tents 1 In tro duction 2 2 POMDPs with stage duration 3 3 Main results 4 4 Pro of of Theorem 1 4 4.1 Construction of the strategy  𝜎 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4.2 Pro of of Theorem 1 and Corollaries 1-3 mo dulo tec hnical lemmas . . . . . . . . . . . . . . . . . 6 4.3 Pro of of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.4 Pro of of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.5 Pro of of Lemma 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.6 Pro of of Lemma 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 5 Concluding remarks 11 5.1 F ully observed state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 5.2 Con tin uit y of 𝑉 ( ℎ ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 6 Ac kno wledgmen ts 13 7 Bibliograph y 13 1 Note: Throughout the pap er, we use △ to mark the end of a definition or a remark. 1 In tro duction P artially observ able Marko v decision pro cesses (POMDPs) w ere in tro duced b y Drake (1962). In this pap er, we consider the special case where the signal dep ends only on the current state and is deterministic. Suc h a POMDP proceeds in discrete time as follo ws. In the b eginning, the initial state is dra wn according to some probability distribution, after which a deterministic signal is generated based on that state. At each stage, the decision mak er observes the signal, recalls all previous actions and signals, and c ho oses an action. This action determines the stage pay off. The next state is then drawn randomly , conditional on the curren t state and action. The goal of the decision maker is to maximize his ov erall pay off. POMDPs can b e generalized to the case of tw o decision makers with opp osite interests. In this case, we refer to them as zero-sum sto chastic games. In this framework, we can distinguish b etw een the cases of full state observ ation, where the state is fully observed by b oth decision makers (introduced b y Shapley (1953)); public signals, where the signal is seen b y b oth decision mak ers (see, e.g., Ziliotto (2016)); and priv ate signals, where each decision maker observ es his own signal (see, e.g., Renault (2006)). Note that in POMDPs, there is no difference b etw een public and priv ate signals b ecause there is only one decision maker. POMDPs with stage duration w ere introduced b y Neyman (2013) in the con text of zero-sum sto c hastic games with fully observed state. Giv en a sto c hastic game 𝐺 1 , Neyman (2013) considers a family 𝐺 ℎ of sto c hastic games in which the lea ving probabilities 1 and the discount rate are normalized at each stage; that is, they are prop ortional to ℎ . Let 𝑉 𝜆 ( ℎ ) denote the v alue of the game with discoun t factor 𝜆 and stage duration ℎ . The games with stage duration ℎ appro ximate con tinuous-time sto chastic games as ℎ v anishes; see, e.g., the PhD thesis (Novik ov, 2025b, In tro duction, § 2.2). Previous pap ers on stage duration hav e examined the limit of 𝑉 𝜆 ( ℎ ) as ℎ v anishes: 1. In the case of fully observed state: Neyman (2013), Sorin and Vigeral (2016); 2. In the case of public signals: Sorin (2018), Novik ov (2025a), Novik ov (2024); 3. In the case of priv ate signals: Cardaliaguet et al. (2016), Gensbittel (2016). Note that not all of the abov e pap ers refer to suc h games as “games with stage duration”; instead, they are sometimes called, e.g., “games with frequent actions”, but the underlying idea is alwa ys to approximate con tin uous-time sto c hastic games using discrete-time ones. No w, w e define 𝑉 ( ℎ ) = lim 𝜆 → 0 𝑉 𝜆 ( ℎ ) . The main goal of this article is to in v estigate the global b ehavior of 𝑉 ( ℎ ) across v arying stage durations, rather than solely fo cusing on its limit as ℎ v anishes. Our main result, Theorem 1, establishes that an y strategy executed in 𝐺 ℎ can b e perfectly mimic k ed in the base POMDP 𝐺 1 . Specifically , if a strategy 𝜎 in 𝐺 ℎ yields an asymptotic pay off of 𝑣 , w e provide an explicit construction for a strategy  𝜎 in 𝐺 1 that secures the same asymptotic pay off 𝑣 . Theorem 1 has several corollaries: 1. Corollary 1: Any strategy in 𝐺 ℎ 1 can b e mimick ed by a strategy in 𝐺 ℎ 2 , provided that ℎ 1 < ℎ 2 ; 2. Corollary 2: The function ℎ ↦→ 𝑉 ( ℎ ) is nondecreasing; 3. Corollary 3: The contin uous-time limit lim ℎ → 0 𝑉 ( ℎ ) exists. Theorem 1, as well as Corollaries 1-3, is new. Note that while many pap ers hav e studied the con tin uous- time limit lim ℎ → 0 𝑉 ( ℎ ), they considered the case of a fixed discount factor 𝜆 , whereas we consider the asymp- totic case 𝜆 → 0. 1 A le aving pr ob ability refers to any transition probabilit y b et ween t w o distinct states; see Definition 1 for the precise definition. 2 2 POMDPs with stage duration A p artial ly observable Markov de cision pr o c ess (POMDP) is a 7-tuple (Ω , 𝐴, 𝑆, 𝑓 , 𝑔 , 𝑃 , 𝑝 1 ), where Ω is the finite set of states, 𝐴 is the finite set of actions, 𝑆 is the finite set of signals, 𝑓 : Ω → 𝑆 is the function giving a signal for each state, 𝑔 : Ω × 𝐴 → R is the stage pa y off function, 𝑃 : Ω × 𝐴 → Δ(Ω) is the transition probability function, and 𝑝 1 ∈ Δ(Ω) is the initial distribution on the states. The POMDP (Ω , 𝐴, 𝑆, 𝑓 , 𝑔 , 𝑃 , 𝑝 1 ) pro ceeds as follo ws. An initial state 𝜔 1 is dra wn from 𝑝 1 , and the decision maker receives the signal 𝑓 ( 𝜔 1 ). A t eac h stage 𝑛 ∈ N * , the decision mak er chooses an action 𝑎 𝑛 ∈ 𝐴 and receiv es the unobserv ed pa y off 𝑔 ( 𝜔 𝑛 , 𝑎 𝑛 ). The next state 𝜔 𝑛 +1 is drawn according to 𝑃 ( 𝜔 𝑛 , 𝑎 𝑛 ), and the decision maker receives the signal 𝑓 ( 𝜔 𝑛 +1 ). A history of length 𝑡 ∈ N in the POMDP (Ω , 𝐴, 𝑆, 𝑓 , 𝑔, 𝑃 , 𝑝 1 ) is ( 𝑠 1 , 𝑎 1 , 𝑠 2 , 𝑎 2 , . . . , 𝑠 𝑡 − 1 , 𝑎 𝑡 − 1 , 𝑠 𝑡 ). The set of all histories of length 𝑡 is 𝐻 𝑡 := 𝑆 × ( 𝐴 × 𝑆 ) 𝑡 − 1 . A (b ehavior) str ate gy of the decision maker is a function 𝜎 :  𝑡 ≥ 1 𝐻 𝑡 → Δ( 𝐴 ). The decision maker’s strategy induces a probabilit y distribution on the set 𝑆 × ( 𝐴 × 𝑆 ) N * . (Indeed, the strategy induces a probability distribution on the set 𝐻 1 , then on the set 𝐻 2 , etc. By the Kolmogoro v extension theorem, this probability can b e extended in a unique wa y to the set 𝑆 × ( 𝐴 × 𝑆 ) N * ). In particular, giv en an initial probability distribution 𝑝 0 ∈ Δ(Ω), a strategy 𝜎 :  𝑡 ≥ 1 𝐻 𝑡 → Δ( 𝐴 ), and the induced probabilit y measure P 𝑝 0 𝜎 on 𝑆 × ( 𝐴 × 𝑆 ) N * , w e can consider the expectation E 𝑝 0 𝜎 of an y random v ariable defined on  𝑡 ≥ 1 𝐻 𝑡 . Definition 1 (POMDP with stage duration, (No viko v, 2024, Definition 1) and (Neyman, 2013, p. 240)) . Fix a POMDP 𝐺 1 = (Ω , 𝐴, 𝑆, 𝑓 , 𝑔 , 𝑃 , 𝑝 1 ) . The POMDP with stage dur ation ℎ ∈ (0 , 1] is the POMDP 𝐺 ℎ = (Ω , 𝐴, 𝑆, 𝑓 , 𝑔 , 𝑃 ℎ , 𝑝 1 ) , with 𝑃 ℎ ( · | 𝜔 , 𝑎 ) = ℎ𝑃 ( · | 𝜔 , 𝑎 ) + (1 − ℎ ) 𝛿 𝜔 ( · ) , where 𝛿 𝜔 ( 𝜔 ′ ) =  1 , if 𝜔 ′ = 𝜔 ; 0 , otherwise. In 𝐺 ℎ , we consider the asymptotic v alue 𝑉 ( ℎ ) := lim 𝜆 → 0  sup 𝜎 E ℎ 𝜎  𝜆ℎ ∞  𝑖 =1 (1 − 𝜆ℎ ) 𝑖 − 1 𝑔 ( 𝜔 𝑖 , 𝑎 𝑖 )  . △ In the abov e definition, w e use E ℎ 𝜎 to denote the exp ectation generated b y the strategy 𝜎 in 𝐺 ℎ . F or- mally , this expectation also dep ends on the initial probabilit y distribution 𝑝 0 , but since this distribution is indep enden t of the stage duration ℎ , we omit 𝑝 0 from the notation. Remark 1. Definition 1 coincides with the one given in No vik ov (2024) except that the latter considers the 𝜆 -discoun ted pa y off 𝑉 𝜆 ( ℎ ) := sup 𝜎 E ℎ 𝜎  𝜆ℎ ∞  𝑖 =1 (1 − 𝜆ℎ ) 𝑖 − 1 𝑔 ( 𝜔 𝑖 , 𝑎 𝑖 )  , where 𝜆 ∈ (0 , 1] , whereas we consider its limit 𝑉 ( ℎ ) = lim 𝜆 → 0 𝑉 𝜆 ( ℎ ) . Definition 1 is also similar to the one in tro duced in Neyman (2013), except that the latter considers the case of full state observ ation, and uses 𝑉 𝜆 ( ℎ ) instead of 𝑉 ( ℎ ). △ W e will need an alternative expression for 𝑉 ( ℎ ). T o this end, we provide the follo wing definition. 3 Definition 2 (cf. (Arap ostathis et al., 1993, p. 286)) . F or each ℎ ∈ (0 , 1] and eac h strategy 𝜎 in 𝐺 ℎ , we define the exp e cte d long-run aver age p ayoff as 𝑅 ( 𝜎, ℎ ) := lim inf 𝑇 →∞ E ℎ 𝜎  1 𝑇 𝑇  𝑖 =1 𝑔 ( 𝜔 𝑖 , 𝑎 𝑖 )  . △ Remark 2. W e hav e 𝑉 ( ℎ ) = lim 𝜆 → 0  sup 𝜎 E ℎ 𝜎  𝜆ℎ ∞  𝑖 =1 (1 − 𝜆ℎ ) 𝑖 − 1 𝑔 ( 𝜔 𝑖 , 𝑎 𝑖 )  = lim 𝜆 → 0  sup 𝜎 E ℎ 𝜎  𝜆 ∞  𝑖 =1 (1 − 𝜆 ) 𝑖 − 1 𝑔 ( 𝜔 𝑖 , 𝑎 𝑖 )  = sup 𝜎  lim inf 𝑇 →∞ E ℎ 𝜎  1 𝑇 𝑇  𝑖 =1 𝑔 ( 𝜔 𝑖 , 𝑎 𝑖 )  , where the first equalit y is obtained b y using the c hange of v ariable 𝜆 ↦→ 𝜆/ℎ , and the last equalit y follows from Rosenberg et al. (2002), see also (Chatterjee et al., 2022, Remark 2.1). △ Remark 2 implies that 𝑉 ( ℎ ) = sup 𝜎 𝑅 ( 𝜎, ℎ ) . 3 Main results Theorem 1. Let ℎ ∈ (0 , 1) and let 𝜎 be a strategy in 𝐺 ℎ . Then there exists a strategy  𝜎 in 𝐺 1 suc h that 𝑅 ( 𝜎, ℎ ) = 𝑅 (  𝜎, 1) . Corollary 1. Let 0 < ℎ 1 < ℎ 2 ≤ 1 and let 𝜎 b e a strategy in 𝐺 ℎ 1 . Then there exists a strategy  𝜎 in 𝐺 ℎ 2 suc h that 𝑅 ( 𝜎, ℎ 1 ) = 𝑅 (  𝜎, ℎ 2 ) . Corollary 2. The function ℎ ↦→ 𝑉 ( ℎ ) is nondecreasing. Corollary 3. The limit lim ℎ → 0 𝑉 ( ℎ ) exists. Remark 3. In the case of fully observed state, the function ℎ ↦→ 𝑉 ( ℎ ) is constant, see § 5.1. △ 4 Pro of of Theorem 1 4.1 Construction of the strategy  𝜎 W e first construct the strategy  𝜎 . W e fix ℎ ∈ (0 , 1) and a strategy 𝜎 in 𝐺 ℎ . Definition 3. 1. W e denote by 𝑋 = ( 𝑋 1 , 𝑋 2 , . . . ) the sto c hastic pro cess of i.i.d. random v ariables, where  P ( 𝑋 𝑖 = 1) = ℎ and P ( 𝑋 𝑖 = 0) = 1 − ℎ  ⇐ ⇒ 𝑋 𝑖 ∼ 𝐵 𝑒𝑟 𝑛𝑜𝑢𝑙 𝑙 𝑖 ( ℎ ) . 2. W e denote by 𝑇 = ( 𝑇 0 , 𝑇 1 , 𝑇 2 , . . . ) the sto c hastic pro cess, in which 𝑇 𝑖 is the random v ariable 𝑇 0 = 0; 𝑇 𝑖 = inf { 𝑛 > 𝑇 𝑖 − 1 | 𝑋 𝑛 = 1 } for 𝑖 > 0 . 3. W e denote by 𝑁 = ( 𝑁 1 , 𝑁 2 , . . . ) the sto c hastic pro cess of i.i.d. random v ariables, where  𝑁 𝑖 = 𝑇 𝑖 − 𝑇 𝑖 − 1  ⇐ ⇒ ∀ 𝑖 ∈ N * 𝑁 𝑖 = 𝑘 with probability ℎ (1 − ℎ ) 𝑘 − 1 ⇐ ⇒ 𝑁 𝑖 ∼ 𝐺𝑒𝑜𝑚𝑒𝑡𝑟 𝑖𝑐 ( ℎ ) . △ 4 Remark 4 (Interpretation of 𝑋 , 𝑇 , and 𝑁 ) . In 𝐺 ℎ , the next state is chosen according to ℎ𝑃 ( · | 𝜔 , 𝑎 ) + (1 − ℎ ) 𝛿 𝜔 ( · ) . This means that at the end of eac h stage, the next state is dra wn according to 𝑃 with probabilit y ℎ ; otherwise, the state remains unchanged (drawn according to 𝛿 𝜔 ) with probability 1 − ℎ . Hence: 1. 𝑋 𝑖 indicates whether the transition at the end of the 𝑖 -th stage is gov erned by 𝑃 (if 𝑋 𝑖 = 1) or by 𝛿 𝜔 (if 𝑋 𝑖 = 0). 2. 𝑇 𝑖 is the stage num ber at which state transition is gov erned by 𝑃 for the 𝑖 -th time. 3. 𝑁 𝑖 is the duration of the 𝑖 -th ep o ch. It represents the n umber of stages betw een the ( 𝑖 − 1)-th and 𝑖 -th time the transition is go verned b y 𝑃 , consisting of 𝑁 𝑖 − 1 self-transitions ( 𝛿 𝜔 ) follo w ed b y one 𝑃 -transition. △ Remark 5. This remark summarizes some useful prop erties of 𝑇 and 𝑁 that will b e used in subsequen t pro ofs. F or 𝑁 , we hav e P ( 𝑁 𝑗 ≥ 𝑚 ) = (1 − ℎ ) 𝑚 − 1 and E ( 𝑁 𝑗 ) = 1 ℎ and V ar( 𝑁 𝑗 ) = 1 − ℎ ℎ 2 . F or 𝑇 , we ha v e 𝑇 𝑗 = 𝑗  𝑖 =1 𝑁 𝑖 and E ( 𝑇 𝑗 ) = 𝑗 ℎ and V ar( 𝑇 𝑗 ) = 1 − ℎ ℎ 2 𝑗. △ Fix 𝑘 ∈ N * . F or a fixed infinite history ℋ = ( 𝑠 ′ 1 , 𝑎 ′ 1 , 𝑠 ′ 2 , . . . ) in 𝐺 ℎ , we denote the filtered history ℋ 𝑓 𝑖𝑙 𝑘 as the random vector ℋ 𝑓 𝑖𝑙 𝑘 = ( 𝑠 ′ 1 , 𝑎 ′ 𝑇 1 , 𝑠 ′ 𝑇 1 +1 , 𝑎 ′ 𝑇 2 , . . . , 𝑎 ′ 𝑇 𝑘 − 1 , 𝑠 ′ 𝑇 𝑘 − 1 +1 ) . The probability measure P ℎ 𝜎 is defined on an extended joint probability space that encompasses b oth the histories of 𝐺 ℎ and the auxiliary sto chastic pro cesses 𝑋, 𝑇 , and 𝑁 . Similarly , w e use E ℎ 𝜎 to denote the exp ectation of random v ariables defined on this probability space. Definition 4. Let 𝜂 𝑘 = ( 𝑠 1 , 𝑎 1 , 𝑠 2 , . . . , 𝑎 𝑘 − 1 , 𝑠 𝑘 ) b e a history of length 𝑘 in 𝐺 1 , and let  𝑎 ∈ 𝐴 . The strategy  𝜎 is defined by  𝜎 ( 𝜂 𝑘 )(  𝑎 ) := P ℎ 𝜎 ( 𝑎 ′ 𝑇 𝑘 =  𝑎 | ℋ 𝑓 𝑖𝑙 𝑘 = 𝜂 𝑘 ) . If P ℎ 𝜎 ( ℋ 𝑓 𝑖𝑙 𝑘 = 𝜂 𝑘 ) = 0, then we define  𝜎 ( 𝜂 𝑘 ) as an arbitrary fixed mixed action. △ Remark 6 (In tuition b ehind the construction of  𝜎 ) . The strategy  𝜎 is designed to sim ulate in 𝐺 1 the strategy 𝜎 from 𝐺 ℎ . In 𝐺 ℎ , the state only truly transitions (according to 𝑃 ) at the random stages 𝑇 1 , 𝑇 2 , . . . . Bet w een these stages, the state is frozen. The decision mak er in 𝐺 1 do es not experience these frozen perio ds; instead, ev ery stage in 𝐺 1 corresp onds to a true transition. Therefore, to replicate 𝜎 , the decision maker in 𝐺 1 examines the current history 𝜂 𝑘 and asks: “If I wer e playing 𝐺 ℎ and the se quenc e of true tr ansitions p erfe ctly matche d 𝜂 𝑘 , what action would I play at the (r andom and unobserve d) moment 𝑇 𝑘 when the state is final ly ab out to change?” As the durations of the frozen p erio ds ( 𝑁 𝑖 ) and the intermediate actions taken during them are absent in 𝐺 1 ,  𝜎 effectiv ely “filters out” this information. It calculates the exp ected mixed action that 𝜎 w ould pla y at stage 𝑇 𝑘 , conditioned on the fact that the filtered history ℋ 𝑓 𝑖𝑙 𝑘 of true transitions aligns with the history 𝜂 𝑘 observ ed in 𝐺 1 . △ Remark 7 (Strategy in the state-blind case) . In the case of a single signal, an y strategy 𝜎 in 𝐺 ℎ b ecomes a sequence of (mixed or pure) actions, i.e., 𝜎 = ( 𝑎 1 , 𝑎 2 , . . . ). In this case, the strategy  𝜎 in 𝐺 1 is giv en b y  𝜎 = (  𝑎 1 ,  𝑎 2 , . . . ), where  𝑎 𝑖 = E ℎ 𝜎 ( 𝑎 𝑇 𝑖 ) . Note that, in general, the actions  𝑎 𝑖 are mixed, ev en if 𝜎 is a pure strategy . △ 5 4.2 Pro of of Theorem 1 and Corollaries 1-3 mo dulo tec hnical lemmas W e first state four technical lemmas. Lemma 1. F or each 𝑘 ∈ N * , we hav e E 1  𝜎 𝑔 ( 𝜔 𝑘 , 𝑎 𝑘 ) = E ℎ 𝜎 𝑔 ( 𝜔 𝑇 𝑘 , 𝑎 𝑇 𝑘 ) . Lemma 2. F or each 𝑘 ∈ N * , we hav e E ℎ 𝜎 ⎛ ⎝ 𝑇 𝑘  𝑗 = 𝑇 𝑘 − 1 +1 𝑔 ( 𝜔 𝑗 , 𝑎 𝑗 ) ⎞ ⎠ = 1 ℎ E ℎ 𝜎 𝑔 ( 𝜔 𝑇 𝑘 , 𝑎 𝑇 𝑘 ) . Lemma 3. Let 𝑡 𝑘 = ⌊ 𝑘 /ℎ ⌋ . F or eac h 𝑘 ∈ N * , we hav e lim inf 𝑘 → + ∞ E ℎ 𝜎 ⎛ ⎝ 1 𝑡 𝑘 𝑡 𝑘  𝑗 =1 𝑔 ( 𝜔 𝑗 , 𝑎 𝑗 ) ⎞ ⎠ = lim inf 𝑘 → + ∞ E ℎ 𝜎 ⎛ ⎝ ℎ 𝑘 𝑇 𝑘  𝑗 =1 𝑔 ( 𝜔 𝑗 , 𝑎 𝑗 ) ⎞ ⎠ . Lemma 4. Let 𝑀 ∈ N * . Let { 𝑥 𝑛 } ∞ 𝑛 =1 b e a sequence and { 𝑥 𝑛 𝑘 } ∞ 𝑘 =1 b e a subsequence of { 𝑥 𝑛 } ∞ 𝑛 =1 . Supp ose that lim 𝑗 →∞ ( 𝑥 𝑗 +1 − 𝑥 𝑗 ) = 0 and | 𝑛 𝑗 +1 − 𝑛 𝑗 | ≤ 𝑀 for all 𝑗 ∈ N * . W e then hav e lim inf 𝑛 →∞ 𝑥 𝑛 = lim inf 𝑘 →∞ 𝑥 𝑛 𝑘 . Pr o of of The or em 1 mo dulo te chnic al lemmas. T o show that 𝑅 ( 𝜎, ℎ ) = 𝑅 (  𝜎 , 1), we manipulate the expressions for 𝑅 ( 𝜎, ℎ ) and 𝑅 (  𝜎 , 1). W e start with 𝑅 (  𝜎 , 1). By Lemma 1, we hav e 𝑅 (  𝜎 , 1) = lim inf 𝑇 → + ∞ E 1  𝜎  1 𝑇 𝑇  𝑘 =1 𝑔 ( 𝜔 𝑘 , 𝑎 𝑘 )  = lim inf 𝑇 → + ∞  1 𝑇 𝑇  𝑘 =1 E 1  𝜎 𝑔 ( 𝜔 𝑘 , 𝑎 𝑘 )  = lim inf 𝑇 → + ∞  1 𝑇 𝑇  𝑘 =1 E ℎ 𝜎 𝑔 ( 𝜔 𝑇 𝑘 , 𝑎 𝑇 𝑘 )  . (1) W e now aim to compute 𝑅 ( 𝜎, ℎ ) . Consider the sequence { 𝑥 𝑡 } ∞ 𝑡 =1 and the subsequence { 𝑥 𝑡 𝑘 } ∞ 𝑘 =1 of { 𝑥 𝑡 } ∞ 𝑡 =1 , where 𝑥 𝑡 := E ℎ 𝜎  1 𝑡 𝑡  𝑖 =1 𝑔 ( 𝜔 𝑖 , 𝑎 𝑖 )  and 𝑡 𝑘 = ⌊ 𝑘 /ℎ ⌋ . Let 𝑀 = sup 𝜔 ,𝑎 𝑔 ( 𝜔 , 𝑎 ). W e hav e | 𝑥 𝑡 +1 − 𝑥 𝑡 | =      E ℎ 𝜎  1 𝑡 + 1 𝑡 +1  𝑖 =1 𝑔 ( 𝜔 𝑖 , 𝑎 𝑖 ) − 1 𝑡 𝑡  𝑖 =1 𝑔 ( 𝜔 𝑖 , 𝑎 𝑖 )       =      − 1 𝑡 ( 𝑡 + 1) 𝑡  𝑖 =1 E ℎ 𝜎 𝑔 ( 𝜔 𝑖 , 𝑎 𝑖 ) + 1 𝑡 + 1 E ℎ 𝜎 𝑔 ( 𝜔 𝑡 +1 , 𝑎 𝑡 +1 )      ≤ 𝑀 𝑡 𝑡 ( 𝑡 + 1) + 𝑀 𝑡 + 1 𝑡 →∞ − − − → 0 . (2) 6 W e also hav e for any 𝑘 ∈ N * | 𝑡 𝑘 +1 − 𝑡 𝑘 | ≤ 1 + 1 ℎ . (3) No w Lemma 4, together with (2) and (3), implies 𝑅 ( 𝜎, ℎ ) = lim inf 𝑇 → + ∞ E ℎ 𝜎  1 𝑇 𝑇  𝑖 =1 𝑔 ( 𝜔 𝑖 , 𝑎 𝑖 )  = lim inf 𝑘 → + ∞ E ℎ 𝜎 ⎛ ⎝ 1 𝑡 𝑘 𝑡 𝑘  𝑗 =1 𝑔 ( 𝜔 𝑗 , 𝑎 𝑗 ) ⎞ ⎠ , (4) By Lemmas 2 and 3, we hav e lim inf 𝑘 → + ∞ E ℎ 𝜎 ⎛ ⎝ 1 𝑡 𝑘 𝑇 𝑘  𝑗 =1 𝑔 ( 𝜔 𝑗 , 𝑎 𝑗 ) ⎞ ⎠ = lim inf 𝑘 → + ∞ E ℎ 𝜎 ⎛ ⎝ ℎ 𝑘 𝑇 𝑘  𝑗 =1 𝑔 ( 𝜔 𝑗 , 𝑎 𝑗 ) ⎞ ⎠ = lim inf 𝐾 → + ∞ E ℎ 𝜎 ⎛ ⎝ ℎ 𝐾 𝐾  𝑘 =1 𝑇 𝑘  𝑗 = 𝑇 𝑘 − 1 +1 𝑔 ( 𝜔 𝑗 , 𝑎 𝑗 ) ⎞ ⎠ = lim inf 𝐾 → + ∞  1 𝐾 𝐾  𝑘 =1 E ℎ 𝜎 𝑔 ( 𝜔 𝑇 𝑘 , 𝑎 𝑇 𝑘 )  . (5) By combining (1), (4), and (5), w e obtain 𝑅 ( 𝜎, ℎ ) = 𝑅 (  𝜎 , 1) . Pr o of of Cor ol lary 1. Note that given ℎ 1 , ℎ 2 with 0 < ℎ 1 < ℎ 2 < 1, we consider 𝐺 ℎ 1 as the POMDP with stage duration ℎ 1 relativ e to the base POMDP 𝐺 1 . Ho w ev er, we can also consider 𝐺 ℎ 1 as the POMDP with stage duration ℎ 1 /ℎ 2 relativ e to the base POMDP 𝐺 ℎ 2 . Indeed, for the transition law 𝑃 ℎ 1 in 𝐺 ℎ 1 , we hav e 𝑃 ℎ 1 = (1 − ℎ 1 ) 𝐼 𝑑 + ℎ 1 𝑃 1 =  1 − ℎ 1 ℎ 2  𝐼 𝑑 + ℎ 1 ℎ 2 ((1 − ℎ 2 ) 𝐼 𝑑 + ℎ 2 𝑃 1 ) =  1 − ℎ 1 ℎ 2  𝐼 𝑑 + ℎ 1 ℎ 2 𝑃 ℎ 2 . Consequen tly , Theorem 1 implies that there is a strategy  𝜎 in 𝐺 ℎ 2 suc h that 𝑅 ( 𝜎, ℎ 1 ) = 𝑅 (  𝜎 , ℎ 2 ) . Pr o of of Cor ol lary 2. This follows from Corollary 1 and Remark 2. F or an y 0 < ℎ 1 ≤ ℎ 2 ≤ 1, we hav e 𝑉 ( ℎ 1 ) = sup { 𝑅 ( 𝜎, ℎ ) | 𝜎 is a strategy in 𝐺 ℎ 1 } ≤ sup { 𝑅 ( 𝜎, ℎ ) | 𝜎 is a strategy in 𝐺 ℎ 2 } = 𝑉 ( ℎ 2 ) . Pr o of of Cor ol lary 3. This follo ws directly from Corollary 2 and the fact that the stage pa yoff function 𝑔 is b ounded. 4.3 Pro of of Lemma 1 T o av oid am biguity b et ween the tw o distinct sto c hastic pro cesses, w e adopt the follo wing notational con ven tion in the pro of of Lemma 1. The state 𝜔 𝑘 (resp. action 𝑎 𝑘 , signal 𝑠 𝑘 ) refers exclusively to the 𝑘 -th stage state (resp. action, signal) in 𝐺 1 . Con v ersely , the state 𝜔 ′ 𝑘 (resp. action 𝑎 ′ 𝑘 , signal 𝑠 ′ 𝑘 ) refers exclusiv ely to the 𝑘 -th stage state (resp. action, signal) in 𝐺 ℎ . In the pro of of Lemma 1, when manipulating conditional probabilities, we adopt the standard conv ention that 0 × undefined = 0. Consequently , whenev er we ev aluate a conditional probability of the form P ( 𝐴 | 𝐵 ), it is implicitly assumed that P ( 𝐵 ) > 0. In cases where P ( 𝐵 ) = 0, the conditional probabilit y P ( 𝐴 | 𝐵 ) is tec hnically undefined; ho wev er, because suc h terms only app ear multiplied by P ( 𝐵 ) = 0 (such as in the law of total probability or Bay es’ theorem), they contribute zero to the ov erall expression. Thus, we do not need to explicitly ev aluate or restrict our summations to exclude these zero-probability even ts. 7 Pr o of of L emma 1. W e will prov e a more general statemen t: for any 𝜔 ∈ Ω , 𝑎 ∈ 𝐴, 𝑘 ∈ N * , and an y history 𝜂 𝑘 of length 𝑘 , we hav e P 1  𝜎 ( 𝜂 𝑘 , 𝜔 𝑘 = 𝜔 , 𝑎 𝑘 = 𝑎 ) = P ℎ 𝜎 ( ℋ 𝑓 𝑖𝑙 𝑘 = 𝜂 𝑘 , 𝜔 ′ 𝑇 𝑘 = 𝜔 , 𝑎 ′ 𝑇 𝑘 = 𝑎 ) . (6) This identit y implies the lemma. Indeed, by (6) w e ha v e P 1  𝜎 ( 𝜔 𝑘 = 𝜔 , 𝑎 𝑘 = 𝑎 ) =  Histories 𝜂 𝑘 of length 𝑘 in 𝐺 1 P 1  𝜎 ( 𝜂 𝑘 , 𝜔 𝑘 = 𝜔 , 𝑎 𝑘 = 𝑎 ) =  Histories 𝜂 𝑘 of length 𝑘 in 𝐺 1 P ℎ 𝜎 ( ℋ 𝑓 𝑖𝑙 𝑘 = 𝜂 𝑘 , 𝜔 ′ 𝑇 𝑘 = 𝜔 , 𝑎 ′ 𝑇 𝑘 = 𝑎 ) = P ℎ 𝜎 ( 𝜔 ′ 𝑇 𝑘 = 𝜔 , 𝑎 ′ 𝑇 𝑘 = 𝑎 ) . (7) Subsequen tly , w e obtain b y (7) E 1  𝜎 𝑔 ( 𝜔 𝑘 , 𝑎 𝑘 ) =  𝜔 ∈ Ω ,𝑎 ∈ 𝐴 P 1  𝜎 ( 𝜔 𝑘 = 𝜔 , 𝑎 𝑘 = 𝑎 ) · 𝑔 ( 𝜔 , 𝑎 ) =  𝜔 ∈ Ω ,𝑎 ∈ 𝐴 P ℎ 𝜎 ( 𝜔 ′ 𝑇 𝑘 = 𝜔 , 𝑎 ′ 𝑇 𝑘 = 𝑎 ) · 𝑔 ( 𝜔 , 𝑎 ) = E ℎ 𝜎 𝑔 ( 𝜔 ′ 𝑇 𝑘 , 𝑎 ′ 𝑇 𝑘 ) . The rest of the pro of is dedicated to justifying (6). The pro of is by induction on 𝑘 . The base case 𝑘 = 1 holds b ecause the initial probability distribution is the same in b oth 𝐺 1 and 𝐺 ℎ . W e no w suppose that (6) holds for some 𝑘 ∈ N * , and we will prov e that it also holds for 𝑘 + 1. Let 𝜂 𝑘 +1 = ( 𝜂 𝑘 , 𝑎, 𝑠 ) . In 𝐺 1 , we hav e P 1  𝜎 ( 𝜂 𝑘 +1 , 𝜔 𝑘 +1 = 𝜔 , 𝑎 𝑘 +1 = 𝑎 ) = P 1  𝜎 ( 𝜂 𝑘 +1 , 𝜔 𝑘 +1 = 𝜔 ) ·  𝜎 ( 𝜂 𝑘 +1 )( 𝑎 ) . (8) W e hav e P 1  𝜎 ( 𝜂 𝑘 +1 , 𝜔 𝑘 +1 = 𝜔 ) =   𝜔 ∈ Ω P 1  𝜎 ( 𝜂 𝑘 , 𝜔 𝑘 =  𝜔 , 𝑎 𝑘 = 𝑎, 𝜔 𝑘 +1 = 𝜔 , 𝑠 𝑘 +1 = 𝑠 ) =   𝜔 ∈ Ω P 1  𝜎 ( 𝜂 𝑘 , 𝜔 𝑘 =  𝜔 , 𝑎 𝑘 = 𝑎 ) · P 1  𝜎 ( 𝜔 𝑘 +1 = 𝜔 , 𝑠 𝑘 +1 = 𝑠 | 𝜂 𝑘 , 𝜔 𝑘 =  𝜔 , 𝑎 𝑘 = 𝑎 ) =   𝜔 ∈ Ω P 1  𝜎 ( 𝜂 𝑘 , 𝜔 𝑘 =  𝜔 , 𝑎 𝑘 = 𝑎 ) · 𝑃 ( 𝜔 | 𝑎,  𝜔 ) · 1 𝑓 ( 𝜔 )= 𝑠 (9) In 𝐺 ℎ , it follows from the definition of the conditional probability that P ℎ 𝜎 ( ℋ 𝑓 𝑖𝑙 𝑘 +1 = 𝜂 𝑘 +1 , 𝜔 ′ 𝑇 𝑘 +1 = 𝜔 , 𝑎 ′ 𝑇 𝑘 +1 = 𝑎 ) = P ℎ 𝜎 ( ℋ 𝑓 𝑖𝑙 𝑘 +1 = 𝜂 𝑘 +1 , 𝜔 ′ 𝑇 𝑘 +1 = 𝜔 ) · P ℎ 𝜎 ( 𝑎 ′ 𝑇 𝑘 +1 = 𝑎 | ℋ 𝑓 𝑖𝑙 𝑘 +1 = 𝜂 𝑘 +1 , 𝜔 ′ 𝑇 𝑘 +1 = 𝜔 ) . (10) Because the state is frozen b etw een stage 𝑇 𝑘 + 1 and stage 𝑇 𝑘 +1 , we hav e 𝜔 ′ 𝑇 𝑘 +1 = 𝜔 ′ 𝑇 𝑘 +1 . (11) By the construction of the even t ℋ 𝑓 𝑖𝑙 𝑘 +1 = 𝜂 𝑘 +1 , we can write it as a union of disjoint even ts:  ℋ 𝑓 𝑖𝑙 𝑘 +1 = 𝜂 𝑘 +1  =   𝜔 ∈ Ω  ℋ 𝑓 𝑖𝑙 𝑘 = 𝜂 𝑘 , 𝜔 ′ 𝑇 𝑘 =  𝜔 , 𝑎 ′ 𝑇 𝑘 = 𝑎, 𝑠 ′ 𝑇 𝑘 +1 = 𝑠  . (12) Using (11) and (12), we obtain P ℎ 𝜎 ( ℋ 𝑓 𝑖𝑙 𝑘 +1 = 𝜂 𝑘 +1 , 𝜔 ′ 𝑇 𝑘 +1 = 𝜔 ) =   𝜔 ∈ Ω P ℎ 𝜎 ( ℋ 𝑓 𝑖𝑙 𝑘 = 𝜂 𝑘 , 𝜔 ′ 𝑇 𝑘 +1 = 𝜔 , 𝜔 ′ 𝑇 𝑘 =  𝜔 , 𝑎 ′ 𝑇 𝑘 = 𝑎, 𝑠 ′ 𝑇 𝑘 +1 = 𝑠 ) =   𝜔 ∈ Ω P ℎ 𝜎 ( ℋ 𝑓 𝑖𝑙 𝑘 = 𝜂 𝑘 , 𝜔 ′ 𝑇 𝑘 =  𝜔 , 𝑎 ′ 𝑇 𝑘 = 𝑎 ) · P ℎ 𝜎 ( 𝜔 ′ 𝑇 𝑘 +1 = 𝜔 , 𝑠 ′ 𝑇 𝑘 +1 = 𝑠 | ℋ 𝑓 𝑖𝑙 𝑘 = 𝜂 𝑘 , 𝜔 ′ 𝑇 𝑘 =  𝜔 , 𝑎 ′ 𝑇 𝑘 = 𝑎 ) . =   𝜔 ∈ Ω P ℎ 𝜎 ( ℋ 𝑓 𝑖𝑙 𝑘 = 𝜂 𝑘 , 𝜔 ′ 𝑇 𝑘 =  𝜔 , 𝑎 ′ 𝑇 𝑘 = 𝑎 ) · 𝑃 ( 𝜔 | 𝑎,  𝜔 ) · 1 𝑓 ( 𝜔 )= 𝑠 (13) 8 Th us, b y (9), (13), and the induction hypothesis, we hav e P 1  𝜎 ( 𝜂 𝑘 +1 , 𝜔 𝑘 +1 = 𝜔 ) = P ℎ 𝜎 ( ℋ 𝑓 𝑖𝑙 𝑘 +1 = 𝜂 𝑘 +1 , 𝜔 ′ 𝑇 𝑘 +1 = 𝜔 ) (14) By the law of total probability , we obtain P ℎ 𝜎 ( 𝑎 ′ 𝑇 𝑘 +1 = 𝑎 | ℋ 𝑓 𝑖𝑙 𝑘 +1 = 𝜂 𝑘 +1 , 𝜔 ′ 𝑇 𝑘 +1 = 𝜔 ) =  Histories 𝜂 ′ 𝑘 +1 of length 𝑇 𝑘 +1 in 𝐺 ℎ P ℎ 𝜎 ( 𝑎 ′ 𝑇 𝑘 +1 = 𝑎 | 𝜂 ′ 𝑘 +1 , ℋ 𝑓 𝑖𝑙 𝑘 +1 = 𝜂 𝑘 +1 , 𝜔 ′ 𝑇 𝑘 +1 = 𝜔 ) · P ℎ 𝜎 ( 𝜂 ′ 𝑘 +1 | ℋ 𝑓 𝑖𝑙 𝑘 +1 = 𝜂 𝑘 +1 , 𝜔 ′ 𝑇 𝑘 +1 = 𝜔 ) , (15) Since 𝑎 ′ 𝑇 𝑘 +1 dep ends by the definition only on the history of length 𝑇 𝑘 +1 , we hav e P ℎ 𝜎 ( 𝑎 ′ 𝑇 𝑘 +1 = 𝑎 | 𝜂 ′ 𝑘 +1 , ℋ 𝑓 𝑖𝑙 𝑘 +1 = 𝜂 𝑘 +1 , 𝜔 ′ 𝑇 𝑘 +1 = 𝜔 ) = P ℎ 𝜎 ( 𝑎 ′ 𝑇 𝑘 +1 = 𝑎 | 𝜂 ′ 𝑘 +1 ) . (16) By the Bay es’ theorem, we obtain P ℎ 𝜎 ( 𝜂 ′ 𝑘 +1 | ℋ 𝑓 𝑖𝑙 𝑘 +1 = 𝜂 𝑘 +1 , 𝜔 ′ 𝑇 𝑘 +1 = 𝜔 ) = P ℎ 𝜎 ( 𝜔 ′ 𝑇 𝑘 +1 = 𝜔 | 𝜂 ′ 𝑘 +1 , ℋ 𝑓 𝑖𝑙 𝑘 +1 = 𝜂 𝑘 +1 ) · P ℎ 𝜎 ( 𝜂 ′ 𝑘 +1 | ℋ 𝑓 𝑖𝑙 𝑘 +1 = 𝜂 𝑘 +1 ) P ℎ 𝜎 ( 𝜔 ′ 𝑇 𝑘 +1 = 𝜔 | ℋ 𝑓 𝑖𝑙 𝑘 +1 = 𝜂 𝑘 +1 ) . (17) Note that 𝜂 ′ 𝑘 +1 con tains all the information in ℋ 𝑓 𝑖𝑙 𝑘 +1 . Let ℋ 𝑟𝑒𝑚 𝑘 +1 denote the supplementary information in 𝜂 ′ 𝑘 +1 not present in ℋ 𝑓 𝑖𝑙 𝑘 +1 , consisting sp ecifically of the exact holding durations 𝑁 𝑖 for 1 ≤ 𝑖 ≤ 𝑘 + 1, and the in termediate actions 𝑎 ′ 𝑗 tak en during those waiting p erio ds. By the construction, we ha ve P ℎ 𝜎 ( 𝜔 ′ 𝑇 𝑘 +1 = 𝜔 | 𝜂 ′ 𝑘 +1 , ℋ 𝑓 𝑖𝑙 𝑘 +1 = 𝜂 𝑘 +1 ) = P ℎ 𝜎 ( 𝜔 ′ 𝑇 𝑘 +1 = 𝜔 | ℋ 𝑟𝑒𝑚 𝑘 +1 , ℋ 𝑓 𝑖𝑙 𝑘 +1 = 𝜂 𝑘 +1 ) (18) W e claim that 𝜔 ′ 𝑇 𝑘 +1 is conditionally indep endent of ℋ 𝑟𝑒𝑚 𝑘 +1 giv en ℋ 𝑓 𝑖𝑙 𝑘 +1 = 𝜂 𝑘 +1 ; that is P ℎ 𝜎 ( 𝜔 ′ 𝑇 𝑘 +1 = 𝜔 | ℋ 𝑟𝑒𝑚 𝑘 +1 , ℋ 𝑓 𝑖𝑙 𝑘 +1 = 𝜂 𝑘 +1 ) = P ℎ 𝜎 ( 𝜔 ′ 𝑇 𝑘 +1 = 𝜔 | ℋ 𝑓 𝑖𝑙 𝑘 +1 = 𝜂 𝑘 +1 ) , (19) This conditional indep endence holds due to the structure of the dela yed transitions. By definition, the w aiting p erio ds 𝑁 𝑖 are generated by an indep endent sequence of geometrically distributed random v ariables, and are therefore indep endent of the underlying state sequence. F urthermore, during any waiting p erio d 𝑁 𝑖 , the state remains constant. Consequen tly , the signals observed during these p eriods simply rep eat the last signal recorded in the filtered history ℋ 𝑓 𝑖𝑙 𝑘 +1 . Because the decision mak er only observ es signals and not the state itself, the intermediate actions in ℋ 𝑟𝑒𝑚 𝑘 +1 generated b y the strategy 𝜎 dep end only on the sequence of receiv ed signals and past actions. Since these signals are completely determined by the filtered history ℋ 𝑓 𝑖𝑙 𝑘 +1 and the state-independent w aiting perio ds 𝑁 𝑖 , the generation of ℋ 𝑟𝑒𝑚 𝑘 +1 is conditionally indep endent of the true, unobserved state 𝜔 ′ 𝑇 𝑘 +1 . Thus, conditioning on ℋ 𝑟𝑒𝑚 𝑘 +1 pro vides no additional information regarding the distribution of 𝜔 ′ 𝑇 𝑘 +1 , implying that the equalit y (19) holds. Th us w e ha v e by (17), (18), and (19) P ℎ 𝜎 ( 𝜂 ′ 𝑘 +1 | ℋ 𝑓 𝑖𝑙 𝑘 +1 = 𝜂 𝑘 +1 , 𝜔 ′ 𝑇 𝑘 +1 = 𝜔 ) = P ℎ 𝜎 ( 𝜂 ′ 𝑘 +1 | ℋ 𝑓 𝑖𝑙 𝑘 +1 = 𝜂 𝑘 +1 ) . (20) F rom (15), (16), (20), we obtain P ℎ 𝜎 ( 𝑎 ′ 𝑇 𝑘 +1 = 𝑎 | ℋ 𝑓 𝑖𝑙 𝑘 +1 = 𝜂 𝑘 +1 , 𝜔 ′ 𝑇 𝑘 +1 = 𝜔 ) =  Histories 𝜂 ′ 𝑘 +1 of length 𝑇 𝑘 +1 in 𝐺 ℎ P ℎ 𝜎 ( 𝑎 ′ 𝑇 𝑘 +1 = 𝑎 | 𝜂 ′ 𝑘 +1 ) · P ℎ 𝜎 ( 𝜂 ′ 𝑘 +1 | ℋ 𝑓 𝑖𝑙 𝑘 +1 = 𝜂 𝑘 +1 ) = P ℎ 𝜎 ( 𝑎 ′ 𝑇 𝑘 +1 = 𝑎 | ℋ 𝑓 𝑖𝑙 𝑘 +1 = 𝜂 𝑘 +1 ) =  𝜎 ( 𝜂 𝑘 +1 )( 𝑎 ) . (21) Finally , by combining (8), (10), (14), and (21), we hav e P 1  𝜎 ( 𝜂 𝑘 +1 , 𝜔 𝑘 +1 = 𝜔 , 𝑎 𝑘 +1 = 𝑎 ) = P ℎ 𝜎 ( ℋ 𝑓 𝑖𝑙 𝑘 +1 = 𝜂 𝑘 +1 , 𝜔 ′ 𝑇 𝑘 +1 = 𝜔 , 𝑎 ′ 𝑇 𝑘 +1 = 𝑎 ) . 9 4.4 Pro of of Lemma 2 Pr o of of L emma 2. W e hav e E ℎ 𝜎 ⎛ ⎝ 𝑇 𝑘  𝑗 = 𝑇 𝑘 − 1 +1 𝑔 ( 𝜔 𝑗 , 𝑎 𝑗 ) ⎞ ⎠ = E ℎ 𝜎  ∞  𝑚 =1 1 { 𝑁 𝑘 ≥ 𝑚 } · 𝑔 ( 𝜔 𝑇 𝑘 − 1 + 𝑚 , 𝑎 𝑇 𝑘 − 1 + 𝑚 )  = ∞  𝑚 =1 E ℎ 𝜎  1 { 𝑁 𝑘 ≥ 𝑚 } · 𝑔 ( 𝜔 𝑇 𝑘 − 1 + 𝑚 , 𝑎 𝑇 𝑘 − 1 + 𝑚 )  , (22) where the last equalit y holds by F ubini’s theorem (since 𝑔 is bounded and E ( 𝑁 𝑖 ) = 1 /ℎ < ∞ , the tail sum form ula implies the absolute conv ergence). W e also hav e E ℎ 𝜎 𝑔 ( 𝜔 𝑇 𝑘 , 𝑎 𝑇 𝑘 ) = E ℎ 𝜎  ∞  𝑚 =1 1 { 𝑁 𝑘 = 𝑚 } · 𝑔 ( 𝜔 𝑇 𝑘 − 1 + 𝑚 , 𝑎 𝑇 𝑘 − 1 + 𝑚 )  = ∞  𝑚 =1 E ℎ 𝜎  1 { 𝑁 𝑘 = 𝑚 } · 𝑔 ( 𝜔 𝑇 𝑘 − 1 + 𝑚 , 𝑎 𝑇 𝑘 − 1 + 𝑚 )  . (23) No w, the ev en t { 𝑁 𝑘 = 𝑚 } can b e considered as the intersection of tw o indep endent ev ents: { 𝑁 𝑘 = 𝑚 } = { 𝑁 𝑘 ≥ 𝑚 } ∩ { 𝑋 𝑇 𝑘 − 1 + 𝑚 = 1 } . Note that the random v ariable 𝑋 𝑇 𝑘 − 1 + 𝑚 dep ends only on ℎ and is entirely indep enden t of 𝑁 𝑖 , as w ell as the states and actions up until the stage 𝑇 𝑘 − 1 + 𝑚 . Subsequen tly , the random v ariable 1 { 𝑋 𝑇 𝑘 − 1 + 𝑚 } is indep endent of the random v ariable 1 { 𝑁 𝑘 ≥ 𝑚 } · 𝑔 ( 𝜔 𝑇 𝑘 − 1 + 𝑚 , 𝑎 𝑇 𝑘 − 1 + 𝑚 ). Hence, w e ha v e E ℎ 𝜎  1 { 𝑁 𝑘 = 𝑚 } · 𝑔 ( 𝜔 𝑇 𝑘 − 1 + 𝑚 , 𝑎 𝑇 𝑘 − 1 + 𝑚 )  = E ℎ 𝜎  1 { 𝑋 𝑇 𝑘 − 1 + 𝑚 } · 1 { 𝑁 𝑘 ≥ 𝑚 } · 𝑔 ( 𝜔 𝑇 𝑘 − 1 + 𝑚 , 𝑎 𝑇 𝑘 − 1 + 𝑚 )  = E ℎ 𝜎  1 { 𝑋 𝑇 𝑘 − 1 + 𝑚 }  · E ℎ 𝜎  1 { 𝑁 𝑘 ≥ 𝑚 } · 𝑔 ( 𝜔 𝑇 𝑘 − 1 + 𝑚 , 𝑎 𝑇 𝑘 − 1 + 𝑚 )  = P ( 𝑋 𝑇 𝑘 − 1 + 𝑚 = 1) · E ℎ 𝜎  1 { 𝑁 𝑘 ≥ 𝑚 } · 𝑔 ( 𝜔 𝑇 𝑘 − 1 + 𝑚 , 𝑎 𝑇 𝑘 − 1 + 𝑚 )  = ℎ · E ℎ 𝜎  1 { 𝑁 𝑘 ≥ 𝑚 } · 𝑔 ( 𝜔 𝑇 𝑘 − 1 + 𝑚 , 𝑎 𝑇 𝑘 − 1 + 𝑚 )  . (24) The lemma now follows directly from (22), (23), and (24). 4.5 Pro of of Lemma 3 Pr o of of L emma 3. Let 𝑀 := max 𝜔 ,𝑎 | 𝑔 ( 𝜔 , 𝑎 ) | . By the triangle inequality , w e ha v e E ℎ 𝜎 ⎛ ⎝       𝑇 𝑘  𝑗 =1 𝑔 ( 𝜔 𝑗 , 𝑎 𝑗 ) − 𝑡 𝑘  𝑗 =1 𝑔 ( 𝜔 𝑗 , 𝑎 𝑗 )       ⎞ ⎠ ≤ 𝑀 · E ℎ 𝜎  | 𝑇 𝑘 − 𝑡 𝑘 |  . (25) By Jensen’s inequality , it follows that E ℎ 𝜎      𝑇 𝑘 − 𝑘 ℎ      = E ℎ 𝜎     𝑇 𝑘 − E ℎ 𝜎 ( 𝑇 𝑘 )     ≤  E ℎ 𝜎  | 𝑇 𝑘 − E ℎ 𝜎 ( 𝑇 𝑘 ) | 2  =  V ar( 𝑇 𝑘 ) =  1 − ℎ ℎ 2 · 𝑘 . (26) Finally , we hav e       E ℎ 𝜎 ⎛ ⎝ ℎ 𝑘 𝑇 𝑘  𝑗 =1 𝑔 ( 𝜔 𝑗 , 𝑎 𝑗 ) − 1 𝑡 𝑘 𝑡 𝑘  𝑗 =1 𝑔 ( 𝜔 𝑗 , 𝑎 𝑗 ) ⎞ ⎠       =       E ℎ 𝜎 ⎛ ⎝ ℎ 𝑘 𝑇 𝑘  𝑗 =1 𝑔 ( 𝜔 𝑗 , 𝑎 𝑗 ) − ℎ 𝑘 𝑡 𝑘  𝑗 =1 𝑔 ( 𝜔 𝑗 , 𝑎 𝑗 ) + ℎ 𝑘 𝑡 𝑘  𝑗 =1 𝑔 ( 𝜔 𝑗 , 𝑎 𝑗 ) − 1 𝑡 𝑘 𝑡 𝑘  𝑗 =1 𝑔 ( 𝜔 𝑗 , 𝑎 𝑗 ) ⎞ ⎠       ≤       E ℎ 𝜎 ⎡ ⎣ ℎ 𝑘 ⎛ ⎝ 𝑇 𝑘  𝑗 =1 𝑔 ( 𝜔 𝑗 , 𝑎 𝑗 ) − 𝑡 𝑘  𝑗 =1 𝑔 ( 𝜔 𝑗 , 𝑎 𝑗 ) ⎞ ⎠ ⎤ ⎦       +       E ℎ 𝜎 ⎛ ⎝  ℎ 𝑘 − 1 𝑡 𝑘  𝑡 𝑘  𝑗 =1 𝑔 ( 𝜔 𝑗 , 𝑎 𝑗 ) ⎞ ⎠       , (27) 10 W e now ev aluate each term of the ab ov e sum. W e hav e by (25) and (26)       E ℎ 𝜎 ⎡ ⎣ ℎ 𝑘 ⎛ ⎝ 𝑇 𝑘  𝑗 =1 𝑔 ( 𝜔 𝑗 , 𝑎 𝑗 ) − 𝑡 𝑘  𝑗 =1 𝑔 ( 𝜔 𝑗 , 𝑎 𝑗 ) ⎞ ⎠ ⎤ ⎦       ≤ ℎ 𝑘 E ℎ 𝜎 ⎛ ⎝       ⎛ ⎝ 𝑇 𝑘  𝑗 =1 𝑔 ( 𝜔 𝑗 , 𝑎 𝑗 ) − 𝑡 𝑘  𝑗 =1 𝑔 ( 𝜔 𝑗 , 𝑎 𝑗 ) ⎞ ⎠       ⎞ ⎠ ≤ 𝑀 ℎ 𝑘 E ℎ 𝜎  | 𝑇 𝑘 − 𝑡 𝑘 |  ≤ 𝑀 ℎ 𝑘 E ℎ 𝜎      𝑇 𝑘 − 𝑘 ℎ      + 𝑀 ℎ 𝑘     𝑡 𝑘 − 𝑘 ℎ     = 𝑀 ·  1 − ℎ 𝑘 + 𝑀 ℎ 𝑘 · 1 𝑘 →∞ − − − → 0 . (28) W e hav e       E ℎ 𝜎 ⎛ ⎝  ℎ 𝑘 − 1 𝑡 𝑘  𝑡 𝑘  𝑗 =1 𝑔 ( 𝜔 𝑗 , 𝑎 𝑗 ) ⎞ ⎠       ≤     ℎ 𝑘 − 1 𝑡 𝑘     · E ℎ 𝜎 ⎛ ⎝       𝑡 𝑘  𝑗 =1 𝑔 ( 𝜔 𝑗 , 𝑎 𝑗 )       ⎞ ⎠ ≤     ℎ 𝑘 − 1 𝑡 𝑘     · 𝑀 𝑡 𝑘 = 𝑀     ℎ 𝑘 𝑡 𝑘 − 1     𝑘 →∞ − − − → 0 . (29) The lemma no w follows from (27), (28), (29), and the standard fact that lim( 𝑥 𝑛 − 𝑦 𝑛 ) = 0 implies lim inf 𝑥 𝑛 = lim inf 𝑦 𝑛 . 4.6 Pro of of Lemma 4 Pr o of of L emma 4. Since { 𝑥 𝑛 𝑘 } ∞ 𝑘 =1 is a subsequence of { 𝑥 𝑛 } ∞ 𝑛 =1 , we hav e lim inf 𝑛 →∞ 𝑥 𝑛 ≤ lim inf 𝑘 →∞ 𝑥 𝑛 𝑘 . (30) W e no w prov e the rev erse inequalit y . By definition, there exists a subsequence { 𝑥 𝑚 𝑘 } ∞ 𝑘 =1 con v erging to lim inf 𝑛 →∞ 𝑥 𝑛 . F or each subsequence index 𝑚 𝑗 of { 𝑥 𝑚 𝑘 } ∞ 𝑘 =1 , let 𝑛 𝑘 𝑗 b e the largest subsequence index of { 𝑥 𝑛 𝑘 } ∞ 𝑘 =1 suc h that 𝑛 𝑘 𝑗 ≤ 𝑚 𝑗 . By the triangle inequality , we hav e | 𝑥 𝑚 𝑗 − 𝑥 𝑛 𝑘 𝑗 | ≤ 𝑚 𝑗 − 1  𝑖 = 𝑛 𝑘 𝑗 | 𝑥 𝑖 +1 − 𝑥 𝑖 | ≤ ( 𝑚 𝑗 − 𝑛 𝑘 𝑗 ) · sup 𝑖 ≥ 𝑛 𝑘 𝑗 | 𝑥 𝑖 +1 − 𝑥 𝑖 | ≤ 𝑀 · sup 𝑖 ≥ 𝑛 𝑘 𝑗 | 𝑥 𝑖 +1 − 𝑥 𝑖 | 𝑗 →∞ − − − → 0 . (31) Since 𝑛 𝑘 𝑗 → ∞ as 𝑗 → ∞ , the set of accumulation p oin ts of the sequence { 𝑥 𝑛 𝑘 𝑗 } ∞ 𝑗 =1 is contained in the set of accumulation p oints of the sequence { 𝑥 𝑛 𝑘 } ∞ 𝑘 =1 . T ogether, this inclusion and (31) imply lim inf 𝑘 →∞ 𝑥 𝑛 𝑘 ≤ lim inf 𝑗 →∞ 𝑥 𝑛 𝑘 𝑗 = lim inf 𝑗 →∞ 𝑥 𝑚 𝑗 = lim 𝑗 →∞ 𝑥 𝑚 𝑗 = lim inf 𝑛 →∞ 𝑥 𝑛 , (32) where the last equalit y follows from the definition of the subsequence { 𝑥 𝑚 𝑗 } ∞ 𝑗 =1 . The lemma now follows directly from (30) and (32). 5 Concluding remarks 5.1 F ully observ ed state This subsection is dedicated to the case where the signal fully reveals the state, meaning that 𝑆 = Ω and 𝑓 ( 𝜔 ) = 𝜔 . In suc h a case, (Sorin and Vigeral, 2016, Proposition 5.2) pro ves that for all 𝜆 ∈ (0 , 1] and ℎ ∈ (0 , 1], w e ha v e 𝑉 𝜆 ( ℎ ) = 𝑉 𝜆 1+ 𝜆 − 𝜆ℎ (1) . This implies that for any ℎ ∈ (0 , 1], we hav e 𝑉 ( ℎ ) = lim 𝜆 → 0 𝑉 𝜆 ( ℎ ) = lim 𝜆 → 0 𝑉 𝜆 1+ 𝜆 − 𝜆ℎ (1) = lim 𝜆 → 0 𝑉 𝜆 (1) = 𝑉 (1) . Hence, 𝑉 is a constant function in this particular case. 11 5.2 Con tin uity of 𝑉 ( ℎ ) In this subsection, we examine whether the function ℎ ↦→ 𝑉 ( ℎ ) is contin uous. Prop osition 1. The function ℎ ↦→ 𝑉 ( ℎ ) is low er semi-contin uous on (0 , 1). Pr o of. If ℎ ′ ∈ (0 , 1) and 𝜀 > 0 is sufficien tly small, then the support of the transition probabilities of the POMDPs in the family { 𝐺 ℎ } ℎ ∈ ( ℎ ′ − 𝜀,ℎ ′ + 𝜀 ) is identical. The prop osition now follows directly from (Chatterjee et al., 2022, Corollary 3.10). Ho w ev er, it is p ossible to introduce new transitions when moving from ℎ = 1 to ℎ < 1, which implies that Chatterjee et al. (2022) cannot b e applied. In fact, it is p ossible that 𝑉 ( ℎ ) is not lo w er semi-contin uous at 1, ev en in the state-blind case. Example 1 (An example of POMDP in which 𝑉 ( ℎ ) is not lo w er semi-contin uous at 1) . Consider a POMDP 𝐺 1 with • The signal set: { 𝑠 1 } . • The action set: { 𝑎, 𝑏 } . • The state set: { 𝜔 1 , 𝜔 2 , 𝜔 3 } . • The initial state: 𝜔 1 . • The stage pay off function: 𝑔 ( 𝑎, 𝜔 1 ) = 𝑔 ( 𝑏, 𝜔 1 ) = 𝑔 ( 𝑎, 𝜔 2 ) = 𝑔 ( 𝑏, 𝜔 2 ) = 1, and 𝑔 ( 𝑎, 𝜔 3 ) = 𝑔 ( 𝑏, 𝜔 3 ) = 0. • The transitions: 𝑃 ( 𝜔 1 | 𝑏, 𝜔 2 ) = 𝑃 ( 𝜔 2 | 𝑎, 𝜔 1 ) = 𝑃 ( 𝜔 3 | 𝑏, 𝜔 1 ) = 𝑃 ( 𝜔 3 | 𝑎, 𝜔 2 ) = 𝑃 ( 𝜔 3 | 𝑎, 𝜔 3 ) = 𝑃 ( 𝜔 3 | 𝑏, 𝜔 3 ) = 1 . See Figure 1 for the visual representation. No w, given a base POMDP 𝐺 1 , consider the POMDP with stage duration 𝐺 ℎ . It is clear that the decision mak er can achiev e a pa yoff of 1 by playing the strategy ( 𝑎, 𝑏, 𝑎, 𝑏, . . . ). How ever, as so on as ℎ < 1, the state sta ys frozen with a probabilit y of 1 − ℎ . Because of this, the decision maker’s b elief about the state (conditional on the fact that it is not 𝜔 3 ) will conv erge to 1 2 𝜔 1 + 1 2 𝜔 2 . Since there is only a single signal ( 𝑠 1 ), the decision mak er cannot detect when a state transition fails to o ccur. Once his b elief gets sufficien tly close to 1 2 𝜔 1 + 1 2 𝜔 2 , an y c hosen action carries a probabilit y of ≈ 0 . 5 of incorrectly guessing the curren t state. Because of that, the decision maker even tually reaches the absorbing state 𝜔 3 with pay off 0. Thus we hav e 𝑉 ( ℎ ) =  1 , if ℎ = 1 0 , if ℎ < 1 . 1 𝜔 1 1 𝜔 2 0 𝜔 3 𝑎 𝑏 𝑏 𝑎 𝑎 𝑏 Figure 1: The POMDP 𝐺 1 Remark 8. It is not yet known whether ℎ ↦→ 𝑉 ( ℎ ) is con tin uous on (0 , 1). In general, it is possible for the asymptotic v alue to b e discon tin uous even if no new transitions are in tro duced; see (Chatterjee et al., 2022, Prop osition 3.11 and its pro of ). How ever, the author was unable to adapt the coun terexample from Chatterjee et al. (2022) to construct one for 𝑉 ( ℎ ). △ 12 6 Ac knowledgmen ts The author is grateful to Guillaume Vigeral for his help during the writing of this article. The author is grateful to Raimundo Saona and Eilon Solan for useful discussions. 7 Bibliograph y Arap ostathis, A., V. S. Bork ar, E. F ern´ andez-Gaucherand, M. K. Ghosh, and S. I. Marcus (1993). Discrete- time con trolled marko v pro cesses with av erage cost criterion: A survey . SIAM J. Contr ol Optim. 31 , 282–344. Cardaliaguet, P ., C. Rainer, D. Rosenberg, and N. Vieille (2016). Mark o v games with frequent actions and incomplete information—the limit case. Mathematics of Op er ations R ese ar ch 41 , 49–71. Chatterjee, K., R. Saona, and B. Ziliotto (2022, April). Finite-memory strategies in pomdps with long-run a v erage ob jectiv es. Mathematics of OR 47 , 100–119. Drak e, A. W. (1962). Observation of a Markov pr o c ess thr ough a noisy channel . Ph. D. thesis, Massac h usetts Institute of T echnology . Gensbittel, F. (2016). Con tin uous-time limit of dynamic games with incomplete information and a more informed play er. International Journal of Game The ory 45 , 321–352. Neyman, A. (2013). Sto c hastic games with short-stage duration. Dynamic Games and Applic ations 3 , 236– 278. No vik o v, I. (2024). Asymptotic v alue in zero-sum sto chastic games with v anishing stage duration and public signals. Preprin t. No vik o v, I. (2025a). Zero-sum state-blind sto chastic games with v anishing stage duration. Dynamic Games and Applic ations 15 , 1094–1115. No vik o v, I. (2025b). Zer o-Sum Sto chastic Games with V anishing Stage Dur ation and Public Signals . Ph. D. thesis, Universit ´ e P aris-Dauphine. https://theses.hal.science/tel- 05450848v1/document . Renault, J. (2006). The v alue of marko v chain games with lack of information on one side. Mathematics of Op er ations R ese ar ch 31 , 490–512. Rosen b erg, D., E. Solan, and N. Vieille (2002). Blackw ell optimalit y in mark o v decision pro cesses with partial observ ation. The Annals of Statistics 30 , 1178–1193. Shapley , L. S. (1953). Sto chastic games. Pr o c e e dings of the National A c ademy of Scienc es 39 , 1095–1100. Sorin, S. (2018). Limit v alue of dynamic zero-sum games with v anishing stage duration. Mathematics of Op er ations R ese ar ch 43 , 51–63. Sorin, S. and G. Vigeral (2016). Op erator approach to v alues of stochastic games with v arying stage duration. International Journal of Game The ory 45 , 389–410. Ziliotto, B. (2016). Zero-sum rep eated games: Coun te rexamples to the existence of the asymptotic v alue and the conjecture maxmin = lim 𝑣 𝑛 . The A nnals of Pr ob ability 44 , 1107 – 1133. 13

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment