Structure and Optimality of Myopic Policy in Opportunistic Access with Noisy Observations

A restless multi-armed bandit problem that arises in multichannel opportunistic communications is considered, where channels are modeled as independent and identical Gilbert-Elliot channels and channel state observations are subject to errors. A simp…

Authors: Qing Zhao, Bhaskar Krishnamachari

Structur e and Optimality of the My opic Policy in Opportunistic Access with Noisy Observations Qing Zhao ∗ , Bhaskar Krishnamachari Abstract A restless multi-armed bandit prob lem that arises in multichannel opportunistic communicatio ns is considered, where channels are modeled as indep endent and id entical Gilb ert-Elliot channe ls and ch annel state observations are subject to errors. A simp le structure of the my opic policy is established u nder a ce rtain co ndition on the false alarm probab ility of the chann el state detector . I t is shown th at the m yopic po licy has a semi-un iv ersal struc ture that reduces channel selec tion to a simple roun d-rob in proce dure and obviates the need to know the under lying Markov transition probabilities. The optimality of the myopic policy is pr oved for the case of two ch annels and conjectured for the general case b ased on nu merical examples. Index T erms: Myopic policy , op portun istic access, restless multi-arm ed b andit, cognitive radio . I . I N T RO D U C T I O N W e consider th e follo wing stochastic control problem that arises in multi channel opp ortunistic commu- nications. Assume that there are N independent and stochasti cally identi cal Gi lbert-Elliot channels [1]. As illustrated in Fig. 1, the state of a channel — “good” or “bad” — indicates th e desirability of accessing this channel and determines the resulting re ward. The transitions between t hese two states follow a di screte- time Markov chain with transit ion probabilities { p ij } i,j =0 , 1 . This channel model has been comm only used to abstract p hysical channels with m emory (see [2], [3] and references therein). Consider , for example, the emer ging application of cognitive radios for oppo rtunistic spectrum access where secondary u sers search This work was supported by the A rmy Research Laboratory CT A on Communication and Networks under Grant DAAD1 9-01-2-0011 and by the National Science Foundation under Grants CNS-0627090, ECS- 062220 0, and CNS-0347621. Part of this work was presented at the 2nd International Confer ence on Cognitive Radio Oriented W ir eless Networks and Communications (Cr ownCom) , August, 2007. Q. Zhao is with the Department of Electrical and Comp uter Engineering, Univ ersity of California, Davis, CA 95 616. E mail: qzhao@ece.ucd avis.edu. B. Krishnamachari is with the Ming Hsieh Department of Electrical Engineering, Univ ersity of Southern California, Los Angeles, CA 90089. Email: bkrishna@u sc.edu. ∗ Corresponding author . Phone: 1-530-75 2-7390. Fax: 1-530-752-84 28. SUBMITTED TO IEEE TRANSACTIONS ON AUT OMATIC CONTR OL IN FEBRU AR Y , 2008, REVISED IN A UGUST , 2008. 2 in the spectrum for idl e channels tem porarily un used by primary users [4]. For this application, the good state represents an id le channel whi le the bad st ate an occupied channel 1 . P S f r a g r e p l a c e m e n t s 0 1 (bad) (good) p 01 p 11 p 00 p 10 Fig. 1. The Gilbert-Elli ot channel model. In each time slot, a user chooses one of the N channels to sense and subsequently access if the chosen channel is sensed to be in the g ood state. Sensing is subject to errors: a good channel may be sensed as bad and vice versa . Accessing a good channel results i n a unit re ward, and no access or accessing a bad channel leads to zero re ward. The design objective is th e optim al sensi ng policy for channel selection in order to maximize t he expected long-term reward. Thi s problem can be form ulated as a partially observable Markov decision process (POMDP) for generally correlated channels, or a restl ess multi -armed bandit process for independent channels. It has been sh own in [5] that obtaining the optimal policy for a general restless multi-armed bandit problem is PSP A CE-hard. For special classes of restless bandit processes, ho wev er , si mple st ructural pol icies may e xist that achieve opti mality wi th l ow complexity . As shown in this paper , for the mu ltichannel opportunist ic access problem s tated above , th e myopic policy for thi s problem has a simpl e and robust structure that reduces channel selection to a simple round-robin procedure when the false alarm p robability of t he channel state detector is below a certain v alue. This structure rev eals that the myopi c policy do es not require the knowledge of the transition p robabilities of the Markovian model except the order of p 11 and p 01 . The myopic policy thu s automatically tracks v ariations in the channel model provided that the order of p 11 and p 01 remains unchanged. Furthermore, exploiting this simple structure, we prove that the myopic policy is optim al for N = 2 . Numerical examples 2 suggest its opti mality for general N . This t echnical note extends ou r earlier work in [6] that assumes perfect observ ation of channel states. As shown in Sections II and III, commun ication constraint s, nam ely , synchronization i n channel s election 1 When the primary netw ork employs load balancing a cross channels, the occupanc y processes of all chann els can be con sidered stochastically identical. 2 Actions gi ven by the myopic policy and the optimal policy are compared numerically for randomly chosen p 11 and p 01 and N = 3 , 4 , and 5 . All examples sho w the equiv alence between the myopic policy and t he optimal policy . SUBMITTED TO IEE E TR ANSACTION S ON AUTOMA TIC CONTR OL IN FEBR U AR Y , 2008, REVISED IN A UGUST , 2008. 3 between the transmi tter and its receiver , require changes i n the probl em formulation when observations are im perfect, and uncertainties i n the state of s ensed channels compl icate the proo fs for the structure and optimalit y of th e m yopic p olicy . I I . P RO B L E M F O R M U L A T I O N A. S ystem Model Let S ( t ) ∆ = [ S 1 ( t ) , . . . , S N ( t )] denote the channel states, where S n ( t ) ∈ { 0 (bad), 1 (good) } is t he s tate of channel n in slot t . At the beginnin g of each sl ot, the user first decides which of the N channels to choose for potential access. Once a channel (say channel n ) is chosen, the user detects t he channel state, which can be consi dered as a bi nary hypot hesis test 3 : H 0 : S n ( t ) = 1 (goo d) vs. H 1 : S n ( t ) = 0 (bad) . The performance of channel state detection is characterized by the prob ability of false alarm ǫ and th e probability o f m iss detection δ : ǫ ∆ = Pr { decide H 1 | H 0 is true } , δ ∆ = Pr { decide H 0 | H 1 is true } . For example, in the appl ication of cogni tiv e radios for opportunis tic spectrum access, t he user can employ an energy detector to detect the presence of primary signals . If the measured energy is abov e a certain threshold, the channel is det ected as bad ( i.e., busy). Otherwise, t he channel is considered id le and suitable for t ransmission . The user transmits over t he chosen channel if and only if the channel is d etected as in the good state. Thus, one of the following four possibl e ev ents can occur in each slot: (i) t he chosen channel is good and is correctly detected as such, resul ting in a s uccessful transm ission; (ii) a false alarm occurs, and a comm unication opportuni ty is m issed; (iii) the chosen channel is bad and is correctly detected; the transmitter refrains from transmitt ing; (iv) a m iss detection occurs, resul ting in a failed transmission . Only in the first ev ent, a unit re ward i s accr ued in this slot. The objectiv e is to maximize the a verage rewar d (throughput) over a horizon of T slots by choos ing judiciously a sensing pol icy that gover ns channel 3 W e consider here the nontrivial cases with p 01 and p 11 in t he open interv al of (0 , 1) . W hen they t ake the special value of 0 or 1 , channel state detection can be simplified. Extensions to such special cases are straightforward . SUBMITTED TO IEE E TR ANSACTION S ON AUTOMA TIC CONTR OL IN FEBR U AR Y , 2008, REVISED IN A UGUST , 2008. 4 selection i n each slot 4 . Since failed transmissions may occur , acknowledgements are necessary to ensure guaranteed deliv ery . Specifically , when the recei ver successfully recei ves a packet (eve nt (i)), it s ends an acknowledgement to the transm itter at the end o f the slot. Oth erwise, the recei ver does nothi ng, i.e., a N AK is defined as the absence of an A CK, which occurs when the transmitter did not transm it (events (ii) and (iii)) or transmitted over a bad channel (event (iv)). W e assu me that acknowledgements are recei ved without error since acknowledgements are always t ransmitted over a good/i dle channel. B. V alue Function and Belief Update While the full system state S ( t ) = [ S 1 ( t ) , · · · , S N ( t )] is n ot obs erv able, the us er can in fer the state from its decision and ob serva tion history . A sufficient stati stic for optimal decision maki ng is giv en by the conditional prob ability that each channel is i n state 1 giv en all past decisions and observations [8]. Referred to as the belief vector (or information state), th is sufficient statist ic is denoted by Ω( t ) ∆ = [ ω 1 ( t ) , · · · , ω N ( t )] , where ω i ( t ) is the conditi onal probabilit y that S i ( t ) = 1 . In order to ens ure that the user and i ts i ntended recei ver tune t o the same channel in each s lot, channel selections shoul d be based on com mon observations: the acknowledgement K ( t ) ∈ { 0 (NAK) , 1 (A CK) } in each slot rather than the detection out come at the transmitter . Given t he action a and obs erv ation K a ( t ) = k ( k = 0 , 1) , the b elief vector in slot t + 1 can be obtained via the Bayes rule. ω i ( t + 1) =          p 11 , a = i, K a ( t ) = 1 Γ( ǫω i ( t ) ǫω i ( t )+(1 − ω i ( t )) ) , a = i, K a ( t ) = 0 Γ( ω i ( t )) , a 6 = i , (1) where t he operator Γ( · ) is defined as Γ( x ) ∆ = xp 11 + (1 − x ) p 01 . A s ensing po licy π specifies a sequence of functions π = [ π 1 , π 2 , · · · , π T ] wh ere π t maps a beli ef vector Ω( t ) to a sensing action a ( t ) ∈ { 1 , · · · , N } for slot t . W e thu s arri ve at the fol lowing stochastic control 4 Note that often the design should be subject to a constraint on the probability of accessing a bad channel, which may cause interference or waste energy . For example, in the application of cognitiv e radios for opportunistic spectrum access, transmitting over a bad (busy ) channel leads to a collision with primary users and should be limited belo w a prescribed le vel. T his constrained stochastic control problem requires the joint design of t he chann el state detector ( i.e., ho w to choose the detection threshold to trade off false alarms with miss detections), the access policy that decides t he transmission probability based on imperfect detection outcome, and the sensing policy for channel selection. It has been shown in [7] under a general correlated channel model t hat the optimal detector is the Neyman-Pearson detector wit h t he probability of miss detection giv en by the maximum allowable probability of collision, and the optimal access policy i s to simply trust the detection outcome: transmit if and only if t he channel is detected as good. The optimal sensing policy can then be designed using t his optimal detector and the optimal access policy without the constraint on accessing a bad channel. This is the problem addressed in this paper . SUBMITTED TO IEE E TR ANSACTION S ON AUTOMA TIC CONTR OL IN FEBR U AR Y , 2008, REVISED IN A UGUST , 2008. 5 problem. π ∗ = a rg max π E π " T X t =1 R π t (Ω( t )) ( t ) | Ω(1) # , (2) where R π t (Ω( t )) ( t ) is the re ward o btained when the belief is Ω( t ) and channel a = π t (Ω( t )) is selected, and Ω(1) is t he initial belief vector . This problem falls into the general m odel of POMDP . It can al so be considered as a restless multi-armed bandit problem by treating the belief v alue of each channel as the state of each arm of a bandit . Let V t (Ω) be the value function, wh ich represents the maximum expected remai ning rew ard that can be accrued starting from slo t t when the current belief vector is Ω . W e have the following optimality equation. V T (Ω) = max a =1 , ··· ,N ω a (1 − ǫ ) , V t (Ω) = max a =1 , ··· ,N { ω a (1 − ǫ ) + ω a (1 − ǫ ) V t +1 ( T (Ω | a, 1)) + (1 − ω a (1 − ǫ )) V t +1 ( T ( Ω | a, 0)) } , where T (Ω | a, i ) denotes the updated belief vector for sl ot t + 1 after incorporati ng action a and observa tion K ( t ) = i as giv en in (1). In theory , t he opti mal policy π ∗ can be obtain ed by solvi ng the above dynamic p rogram. Unfort unately , this app roach is comput ationally prohibitive due to the impact of th e current action on the future rew ard and the uncou ntable sp ace of the bel ief vector Ω . I I I . S T RU C T U R E A N D O P T I M A L I T Y O F M YO P I C P O L I C Y A myopic policy ign ores the impact of the current action on the future re ward, focusing solely on maximizing the expected im mediate re ward E [ R a ( t )] = ω a ( t )(1 − ǫ ) . It is an index policy and is station ary: the mapp ing from belief vectors t o actions does not change with time t . The myopic action ˆ a ( t ) in sl ot t under belief st ate Ω( t ) is simply g iv en by ˆ a ( t ) = arg max a =1 , ··· ,N ω a ( t ) . (3) In general, ob taining the myopic action in each slo t requires the recursi ve update of the belief vector Ω( t ) as given in (1), which requires the knowledge o f the transiti on probabilities { p ij } . As shown in Theorem 1, for the problem at hand, the myopic policy has a simple structure that does not need the update of t he belief vector or the knowledge o f t he transition p robabilities. The basic element in the structure of the m yopic policy is a circular ordering C o f the channels. For a circular order , the starting point is irrele vant: a circular order C = ( n 1 , n 2 , · · · , n N ) is equiv alent to ( n i , n i +1 , · · · , n N , n 1 , n 2 , · · · , n i − 1 ) for any 1 ≤ i ≤ N . SUBMITTED TO IEE E TR ANSACTION S ON AUTOMA TIC CONTR OL IN FEBR U AR Y , 2008, REVISED IN A UGUST , 2008. 6 W e now introduce t he following notations. For a circular order C , let −C denote its rev erse circular order , i.e., for C = ( n 1 , n 2 , · · · , n N ) , w e have −C = ( n N , n N − 1 , · · · , n 1 ) . For a channel i , l et i + C denote the next channel i n the ci rcular o rder C . For example, for C = (1 , 2 , · · · , N ) , we have i + C = i + 1 for 1 ≤ i < N and N + C = 1 . W e present below the stru cture of the myopic pol icy . W e assume first that the initial b elief value ω i (1) of each channel is bou nded between p 01 and p 11 . In Appendix B, we show that when this condition on the initial bel ief values is vi olated, t he same s tructure hol ds for t > 2 . The only difference is that special care needs t o be giv en to the second slot. This can b e seen from the belief up date given in (1). Specifically , for any i nitial belief value, the updated beli ef of each channel (observed or unobs erved) in slot t ≥ 2 is bounded between p 01 and p 11 ; a b elief value outside the interval of [min { p 01 , p 11 } , max { p 01 , p 11 } ] can only occur i n the first sl ot as a given in itial state, th us referred to as a transient belief state. Theor em 1: Struct ur e of Myopic P olicy . Let Ω (1) = [ ω 1 (1) , · · · , ω N (1)] denot e the initial belief vector . Assum e that ω i (1) ∈ [min { p 01 , p 11 } , max { p 01 , p 11 } ] for all i = 1 , 2 , · · · , N . The circular channel order C (1) in slot 1 i s determined by a descending order of Ω(1) ( i.e., C (1) = ( n 1 , n 2 , · · · , n N ) impl ies th at ω n 1 (1) ≥ ω n 2 (1) ≥ · · · ≥ ω n N (1) ). Let ˆ a (1) = arg max i =1 , ··· ,N ω i (1) . Th e m yopic action ˆ a ( t ) i n slot t ( t > 1 ) i s given as foll ows. • Case 1: p 11 ≥ p 01 and ǫ < p 10 p 01 p 11 p 00 ˆ a ( t ) =    ˆ a ( t − 1 ) , if K ˆ a ( t − 1) ( t − 1) = 1 ˆ a ( t − 1 ) + C ( t ) , if K ˆ a ( t − 1) ( t − 1) = 0 , (4) where C ( t ) = C (1) . • Case 2: p 11 < p 01 and ǫ < p 00 p 11 p 01 p 10 ˆ a ( t ) =    ˆ a ( t − 1 ) if K ˆ a ( t − 1) ( t − 1 ) = 0 ˆ a ( t − 1 ) + C ( t ) if K ˆ a ( t − 1) ( t − 1 ) = 1 , (5) where C ( t ) = C (1) when t is odd and C ( t ) = −C (1) when t i s ev en. Pr o of: See Appendi x A. Theorem 1 along with Appendi x B shows that th e basic structure of the myopic pol icy is a round-robin scheme based on a circular ordering of the channels. For p 11 ≥ p 01 (which corresponds to a pos itive correlation between th e channel states in two consecutive slots ), the circular order is constant: C ( t ) = C (1) in ev ery slot t , where C (1) i s determined by a descending o rder of the initial belief values. The m yopic action is to st ay in the sam e channel after an A CK and swi tch to the next channel in the circular order SUBMITTED TO IEE E TR ANSACTION S ON AUTOMA TIC CONTR OL IN FEBR U AR Y , 2008, REVISED IN A UGUST , 2008. 7 after a N AK, p rovided th at t he false alarm probability ǫ of the channel state detector is below a certain value. For p 11 < p 01 (which corresponds t o a n egati ve correlation between the channel states in two consecutive slots), the circular order is re versed in ever y slot : C ( t ) = C (1) when t is odd and C ( t ) = −C (1) when t is e ven, where the init ial order C (1) is determined by the initial belief values. The myopic policy s tays in the sam e channel after a NAK; otherwise, i t switches to the next channel in t he curr ent circular order C ( t ) , wh ich is either C (1) or −C (1) depending on wheth er th e current time t is odd or eve n 5 . This s imple s tructure s uggests t hat the myopic sensing policy is particularly at tractiv e in impl ementation. Besides i ts si mplicity , the myopic p olicy obviates th e need for knowing the channel transiti on probabil ities and automatically t racks var iation s in the channel model. W e point out that the st ructure of th e myop ic sensing po licy in t he presence of sensing errors is sim ilar to that u nder perfect sensin g giv en in [6]. The p roof, howe ver , is more i n volved since the observations here are acknowledgements and the state of the sensed channel cannot be inferred wit h certainty from a N AK. Theorem 2 below shows that the myopic sensing pol icy with such a s imple and robust s tructure is, in fact, optim al for N = 2 . Theor em 2: Optima lity of Myopic P olicy . For N = 2 , the myop ic policy is o ptimal when ǫ < p 10 p 01 p 11 p 00 for pos itively correlated channels ( p 11 ≥ p 01 ) and ǫ < p 00 p 11 p 01 p 10 for negativ ely correlated channels ( p 11 < p 01 ) when th e initial belief values are bound ed 6 between p 01 and p 11 . Pr o of: See Appendi x B. Numerical examples suggest that there exist simi lar conditi ons for all N under which the myopic po licy is optim al. Proving thi s conjecture turns out to be challengin g. A recent work [9] has made progress tow ards proving a corresponding conjectu re under the assum ption of perfect sensin g, by showing th at the optimalit y holds for N > 2 under the conditio n that p 11 > p 01 . Furthermore, it is shown in [9] t hat if the 5 An alternativ e way to see the channel swi tching structure of the myopic policy is through the last visit to each channel (once ev ery channel has been visited at least once). Specifically , for p 11 ≥ p 01 , when a channel switch is needed, the policy selects the channel visited the l ongest time ago. For p 11 < p 01 , when a channel switch is needed, the policy selects, among those channels to which the last visit occurred an ev en number of slots ago, the one most recently visited. If there are no such channels, the user chooses the channel visited the longest time ago. 6 Recall that a belief v alue outside t he interval of [min { p 01 , p 11 } , max { p 01 , p 11 } ] is tr ansient. For any initial st ate, the belief values in slots t ≥ 2 are bounded between p 01 and p 11 . As a consequence, T heorem 2 shows that when one or more of the initial belief values are transient, the myopic polic y still provides the optimal actions in all slots except maybe the first slot. SUBMITTED TO IEE E TR ANSACTION S ON AUTOMA TIC CONTR OL IN FEBR U AR Y , 2008, REVISED IN A UGUST , 2008. 8 myopic poli cy is optim al under t he sum -re ward criterion over a finite h orizon, it is also op timal for other criteria such as discounted and a veraged re wards ov er a finite or infinite horizon. These results may be extended to the case with noisy observations, si nce t he optim ality p roof given in [9] exploits th e s imple structure of th e m yopic policy , which, as s hown here, also holds with nois y observations. Both t he s tructure and the optimality of the m yopic pol icy require a certain le vel of reliability o f the channel state det ector . When this leve l of reli ability is n ot m et, the simpl e structure of th e m yopic poli cy may no longer hold, and the myo pic acti ons need to be obtained from (3) and the recursive belief update in (1). T he o ptimality o f t he m yopic p olicy may also be lost i n t his case. A more com plex policy , for example, Whittle’ s index p olicy [11], may need to be soug ht after to achie ve better performance. This brings out an interesti ng tradeoff between the complexity of the detector at the physical layer and th e complexity of the sensi ng strategy at the Medium Access Control (MA C) layer . In parti cular , the reliability of a detector (for example, an ener gy detector) can always be im proved by increasing the sensing t ime s o that a simple and optimal p olicy—the myop ic policy—can be employed. The ca veat is the reduced transmission t ime for a given s lot l ength. Such a tradeoff can be compl ex and is beyond t he scope of thi s techni cal n ote. I V . C O N C L U S I O N A N D D I S C U S S I O N S W e have established a simple s tructure of the myopic po licy for channel selection in an N -channel opportunist ic commun ication sys tem under an i.i.d. Gilbert-Elliot channel model. The optimality of this simple myo pic poli cy is proved for N = 2 and con jectured for N > 2 . T his is a non-trivial extension of our previous results pertaining t o the case of error- free channel state detection [6], as noisy observations make it challenging to maintain synchronou s chann el selection between the transmi tter and its recei ver . This communication constraint adds an i nteresting twis t to the resulti ng st ochastic control probl em. The optimality of the myo pic policy i n the context of opportunistic com munications may bear significance in the g eneral context of restless multi-armed bandit processes. While t he classical bandit p roblems can b e solved optimally using the Gittins Index [10], restless bandit prob lems are known to be PSP A CE-hard in general [5]. Whittle propos ed a Gitt ins-like indexing heuristi c for the restless bandit problem s [11] whi ch is s hown t o be asymp totically op timal in certain limiti ng regime [12]. Beyond this asymptotic resu lt, relativ ely little is known about the structure of t he o ptimal policies for a general restless bandit process. The o ptimality of t he m yopic policy shown in t his paper and [6] suggest s non-asymptot ic con ditions und er which an index policy with a semi-un iv ersal structure can actually be o ptimal for restless bandit processes. Approximation algorith ms for restless bandit problems hav e also been explored in the literature. In [13], Guha and Munagala hav e developed a constant -factor ( 1 / 6 8 ) approxim ation via LP relaxation for the same SUBMITTED TO IEE E TR ANSACTION S ON AUTOMA TIC CONTR OL IN FEBR U AR Y , 2008, REVISED IN A UGUST , 2008. 9 class of restless band it processes as considered in this paper . The diffe rence is that the model in [1 3] allows for no n-identical channels b ut every channel i s positively correlated. W e point out t hat negati vely correlated processes are s ignificantly harder to d eal wi th due to t he loss of m onotonicity in the beli ef updates (see [6]). In [14], Guha et al. have d e veloped a factor 2 approximati on policy for another class of restless bandit problem s (referred to as mon otone bandi ts) via LP relaxation . Raghunath an et al. [15] have also modeled multicast scheduling in broadcast wireless LANs as a restless bandit p roblem and provided a clo sed-form bou nd for the performance of Whit tle’ s index policy with respect to the o ptimal. A P P E N D I X A : P R O O F O F T H E O R E M 1 W e prove Th eorem 1 by showing that the chan nel ˆ a ( t ) giv en by (4) an d (5) is indeed the chan nel with the largest belief value in slot t . Sp ecifically , we prove the following lem ma. Lemma 1: L et ˆ a ( t ) = i 1 be the chann el determined by (4) for p 11 ≥ p 01 and by (5) for p 11 < p 01 . Let C ( t ) = ( i 1 , i 2 , · · · , i N ) be the c ircular ord er of chann els in slot t , where we set the startin g point to ˆ a ( t ) = i 1 . W e then hav e, for any t ≥ 1 , ω i 1 ( t ) ≥ ω i 2 ( t ) ≥ · · · ≥ ω i N ( t ) , (6) i.e., the chan nel giv en by (4) and (5) has the largest belief value in every slot t . T o p rove Lem ma 1, we note the following prope rties o f the o perator Γ( x ) defined in (1). P1. Γ( x ) is a n increasing function for p 11 ≥ p 01 and a decreasing functio n for p 11 < p 01 . P2. ∀ 0 ≤ x ≤ 1 , p 01 ≤ Γ( x ) ≤ p 11 for p 11 ≥ p 01 and p 11 ≤ Γ( x ) ≤ p 01 for p 11 < p 01 . P3. For p 11 ≥ p 01 and ǫ < p 10 p 01 p 11 p 00 , we have Γ( ǫω ǫω +(1 − ω ) ) ≤ Γ( ω ′ ) ∀ p 01 ≤ ω , ω ′ ≤ p 11 ; for p 11 < p 01 and ǫ < p 00 p 11 p 01 p 10 , we have Γ( ǫω ǫω +(1 − ω ) ) ≥ Γ( ω ′ ) ∀ p 11 ≤ ω , ω ′ ≤ p 01 . P1 and P2 follow direc tly fro m the defin ition o f Γ( x ) . T o show P3 for p 11 ≥ p 01 , it suffices to show ǫω ǫω +(1 − ω ) ≤ p 01 due to the mo notonic ally increasing pro perty of Γ( x ) and the b ound o n ω ′ . Noticing that ǫω ǫω +(1 − ω ) is an increasing function of both ω an d ǫ , we arrive at P3 by using the uppe r bounds on ω a nd ǫ . Similarly , we can show P3 for p 11 < p 01 . W e n ow prove Lemma 1 by indu ction. For t = 1 , (6) hold s by the definitio n o f C (1) . Assume that ( 6) is tr ue for slot t , where C ( t ) = ( i 1 , i 2 , · · · , i N ) an d ˆ a ( t ) = i 1 . W e show th at it is also true for slot t + 1 . Consider first p 11 ≥ p 01 . W e have C ( t + 1) = C ( t ) = ( i 1 , i 2 , · · · , i N ) . When K i 1 ( t ) = 1 , we h av e ˆ a ( t + 1) = ˆ a ( t ) = i 1 from (4). Since ω i 1 ( t + 1) = p 11 achieves th e upper b ound of th e belief values (see P2) and the order of the belief values of the uno bserved chan nels remains unchan ged du e to P1, we arrive at (6) f or t + 1 . When K i 1 ( t ) = 0 , we have ˆ a ( t + 1) = i 2 from (4). W e ag ain have (6) by noticing that ω i 1 ( t + 1) = Γ( ǫω i 1 ( t ) ǫω i 1 ( t )+(1 − ω i 1 ( t )) ) is the smallest belief value in slot t + 1 (see P3) and C ( t + 1) = ( i 2 , i 3 , · · · , i N , i 1 ) when th e starting poin t is set to ˆ a ( t + 1 ) = i 2 . For p 11 < p 01 , C ( t + 1) = −C ( t ) = ( i 1 , i N , i N − 1 , · · · , i 2 ) . Whe n K i 1 ( t ) = 0 , we h av e ˆ a ( t + 1) = ˆ a ( t ) = i 1 from (5). Sin ce ω i 1 ( t + 1) = Γ( ǫω ǫω +(1 − ω ) ) is the largest belief value in slot t + 1 (see P3) and the order o f the belief values of the uno bserved channels is r ev ersed due to P1, we have, from th e induction assump tion at t , ω i 1 ( t + 1) ≥ ω i N ( t + 1) ≥ ω i N − 1 ( t + 1) ≥ · · · ≥ ω i 2 ( t + 1) , SUBMITTED TO IEE E TR ANSACTION S ON AUTOMA TIC CONTR OL IN FEBR U AR Y , 2008, REVISED IN A UGUST , 2008. 10 which agrees with (6) for t + 1 an d C ( t + 1) = ( i 1 , i N , i N − 1 , · · · , i 2 ) . When K i 1 ( t ) = 1 , we ha ve ˆ a ( t + 1 ) = i N from (5). W e ag ain have (6) by noticing that ω i 1 ( t + 1) = p 11 achieves the lower bo und of the b elief values and C ( t + 1 ) = ( i N , i N − 1 , · · · , i 2 , i 1 ) when the starting poin t is set to ˆ a ( t + 1) = i N . This conclud es the p roof of Le mma 1, he nce Theor em 1. A P P E N D I X B : S T RU C T U R E O F T H E M Y O P I C P O L I C Y U N D E R T R A N S I E N T I N I T I A L B E L I E F S T A T E S W e now consider when o ne or more initial belief values are transient, i.e., o utside the interv al of [min { p 01 , p 11 } , max { p 01 , p 11 } ] . Let Ω(1) = [ ω 1 (1) , · · · , ω N (1)] denote the initial belief vector . W itho ut loss of generality , assume that ω 1 (1) ≥ ω 2 (1) ≥ · · · ≥ ω N (1) . Th us ˆ a (1 ) = 1 . Let r de note the ran k of ǫω 1 (1) ǫω 1 (1)+(1 − ω 1 (1)) in { ǫω 1 (1) ǫω 1 (1)+(1 − ω 1 (1)) , ω 2 (1) , · · · , ω N (1) } with r = 1 wh en ǫω 1 (1) ǫω 1 (1)+(1 − ω 1 (1)) is the largest and r = N when it is the smallest. When one or more of the initial belief values are tran sient, the my opic action ˆ a ( t ) in slot t ( t > 1 ) is given as f ollows. • Case 1: p 11 ≥ p 01 and ǫ < p 10 p 01 p 11 p 00 – I f K ˆ a (1) (1) = 1 , the myo pic action ˆ a ( t ) ( t > 1) follows the same structur e given by (4) with C (1) = (1 , 2 , · · · , N ) . – I f K ˆ a (1) (1) = 0 , th e my opic a ction in slot t = 2 is ˆ a (2) = 1 when r = 1 and ˆ a (2) = 2 when r > 1 . The myopic action ˆ a ( t ) fo r t > 2 fo llows the same stru cture g i ven by ( 4) with C (1) = (1 , 2 , · · · , N ) when r = 1 and C (1) = (2 , 3 , · · · , r, 1 , r + 1 , r + 2 , · · · , N ) wh en r > 1 . • Case 2: p 11 < p 01 and ǫ < p 00 p 11 p 01 p 10 – I f K ˆ a (1) (1) = 1 , the myo pic action ˆ a ( t ) ( t > 1) follows the same structur e given by (5) with C (1) = (1 , 2 , · · · , N ) . – I f K ˆ a (1) (1) = 0 , the m yopic action in slo t t = 2 is ˆ a (2) = 1 when r = N and ˆ a (2) = N when r < N . Th e myopic action ˆ a ( t ) fo r t > 2 fo llows the same stru cture g i ven by ( 5) with C (1) = (1 , 2 , · · · , N ) when r = 1 and C (1) = (2 , 3 , · · · , r, 1 , r + 1 , r + 2 , · · · , N ) wh en r > 1 . The above modification can be easily proved b ased on P1 and P2 given in Ap pendix A. A P P E N D I X C : P R O O F O F T H E O R E M 2 Let ˆ V t (Ω) d enote the total expec ted r ew ard o btained und er the m yopic p olicy starting from slot t , and ˆ V t (Ω; a ) the total expected reward obtain ed b y a ction a in slot t followed b y th e myop ic po licy in fu ture slo ts. T he pro of is b ased on the fo llowing lemma which a pplies to a genera l POMDP . Lemma 2: For a T -h orizon POMDP , th e my opic policy is optim al if for t = 1 , · · · , T , ˆ V t (Ω) ≥ ˆ V t (Ω; a ) , ∀ a, Ω . (7) Lemma 2 can be proved by reverse inductio n, where th e initial conditio n of th e optimality of the myo pic action in that last slot T is straightforward . W e n ow pr ove Theo rem 2. Considerin g all chann el state realizations in slot t , we have ˆ V t (Ω; a ) = (1 − ǫ ) ω a + X s 1 ,s 2 ∈{ 0 , 1 } Pr[ S ( t ) = [ s 1 , s 2 ] | Ω( t )] ˆ V t +1 ( T (Ω( t ) | a, s a ) | S ( t ) = [ s 1 , s 2 ]) , (8) where ˆ V t +1 ( T (Ω( t ) | a, s a ) | S ( t ) = [ s 1 , s 2 ]) is the cond itional r ew ard ob tained starting fr om slot t + 1 g i ven that the system state in slot t is [ s 1 , s 2 ] . Next, we establish two lemmas regardin g the c ondition al value fun ction o f the m yopic policy . SUBMITTED TO IEE E TR ANSACTION S ON AUTOMA TIC CONTR OL IN FEBR U AR Y , 2008, REVISED IN A UGUST , 2008. 11 Lemma 3: U nder the co nditions of Theo rem 1, the expected total re maining rew ard starting from slot t un der th e myo pic policy is determ ined b y the action a ( t − 1 ) and th e system state S ( t − 1) in slot t − 1 , hence in depend ent of the belief vector Ω( t ) at the beginning of slot t , i.e., ˆ V t ( T (Ω( t − 1) | a, s a ) | S ( t − 1 ) = [ s 1 , s 2 ]) = ˆ V t ( T (Ω ′ ( t − 1) | a, s a ) | S ( t − 1) = [ s 1 , s 2 ]) . Adopting the simp lified notation of ˆ V t ( a ( t − 1 ) | S ( t − 1) = [ s 1 , s 2 ]) , W e fur ther have ˆ V t ( a ( t − 1) = 1 | S ( t − 1 ) = [ s 1 , s 2 ]) = ˆ V t ( a ( t − 1) = 2 | S ( t − 1 ) = [ s 2 , s 1 ]) . (9) Pr oo f: Giv en a ( t − 1) and S ( t − 1) , the my opic actions in slots t to T , governed by the structure given in Theorem 1, are fixed for e ach samp le path of system state an d o bservation, ind ependen t o f Ω( t ) . As a conseq uence, the total reward obtained in slots t to T for each samp le path is indepen dent of Ω( t ) , so is th e expected total reward. ( 9) f ollows from the statistically identical assumptio n of channe ls. Lemma 4: U nder the co nditions of Th eorem 1, we hav e, ∀ t, a ,    ˆ V t ( a ( t − 1) = a | S ( t − 1) = [1 , 0 ]) − ˆ V t ( a ( t − 1) = a | S ( t − 1) = [0 , 1 ])    ≤ (1 − ǫ ) . (10) Pr oo f: Based on (9), it suffices to co nsider a ( t − 1) = 1 . W e prove for p 11 < p 01 by reverse ind uction. The pr oof for p 11 > p 01 is similar . T he in equality in (10) holds fo r t = T since (1 − ǫ ) is the maxim um expected reward that ca n be obtained in one slot. Assume that the inequality h olds for t + 1 . W e show that it ho lds fo r t . Consider first ˆ V t ( a ( t − 1) = 1 | S ( t − 1) = [1 , 0]) . W ith pro bability 1 − ǫ , th e user successfully identifies tha t channe l 1 is in the go od state in slot t − 1 and receives a n acknowledgement at the end of slot t − 1 . According to the structure of the myopic policy , th e user switches channel in slot t , i.e., a ( t ) = 2 . Th e expected immed iately reward in slot t is thus p 01 (1 − ǫ ) since the state o f chan nel 2 in slot t − 1 is 0 . W e thus arrive at th e first ter m of (11), w here ˆ V t ( a ( t − 1 ) = 1 | S ( t − 1 ) = [1 , 0]) is g i ven by the sum mation of p 01 (1 − ǫ ) and the future reward star ting from slot t + 1 cond itioned on all fou r possible system states in slot t . W ith pro bability ǫ , a false a larm occurs in slot t − 1 , resultin g in a N AK. Th e user thus stays in channel 1 in slot t : a ( t ) = 1 . W e thu s a rrive at the second term of (11). Similarly , we obtain ˆ V t ( a ( t − 1) = 1 | S ( t − 1) = [0 , 1]) as given in (12), which follows from the fact that a NAK occurs in slot t − 1 du e to the given bad state of the chosen channel 1 . ˆ V t (1 | [1 , 0]) = (1 − ǫ ) n p 01 (1 − ǫ ) + p 10 p 00 ˆ V t +1 (2 | [0 , 0]) + p 11 p 01 ˆ V t +1 (2 | [1 , 1]) + p 11 p 00 ˆ V t +1 (2 | [1 , 0]) + p 10 p 01 ˆ V t +1 (2 | [0 , 1]) o + ǫ n p 11 (1 − ǫ ) + p 10 p 00 ˆ V t +1 (1 | [0 , 0]) + p 11 p 01 ˆ V t +1 (1 | [1 , 1]) + p 11 p 00 ˆ V t +1 (1 | [1 , 0]) + p 10 p 01 ˆ V t +1 (1 | [0 , 1]) o (11) ˆ V t (1 | [0 , 1]) = p 01 (1 − ǫ ) + p 00 p 10 ˆ V t +1 (1 | [0 , 0]) + p 01 p 11 ˆ V t +1 (1 | [1 , 1]) + p 11 p 00 ˆ V t +1 (1 | [0 , 1]) + p 10 p 01 ˆ V t +1 (1 | [1 , 0]) (12) Applying (9) and the up per bound on ǫ , we have    ˆ V t (1 | [0 , 1]) − ˆ V t (1 | [1 , 0])    ≤ (1 − ǫ ) p 01 − (1 − ǫ )( ǫp 11 + (1 − ǫ ) p 01 ) + ǫ    ˆ V t +1 (1 | [1 , 0]) − ˆ V t +1 (1 | [0 , 1]    ( p 10 p 01 − p 11 p 00 ) ≤ 2(1 − ǫ ) ǫ ( p 01 − p 11 ) ≤ 2(1 − ǫ ) p 00 p 11 p 01 p 10 ( p 01 − p 11 ) < (1 − ǫ ) , SUBMITTED TO IEE E TR ANSACTION S ON AUTOMA TIC CONTR OL IN FEBR U AR Y , 2008, REVISED IN A UGUST , 2008. 12 where the last inequ ality follows fro m ( p 01 − p 11 ) p 11 p 01 ≤ 1 4 and p 00 p 10 < 1 . W e now show that (7) in Lemma 2 h olds. Con sider Ω( t ) = [ ω 1 ( t ) , ω 2 ( t )] with ω 1 ( t ) > ω 2 ( t ) , i.e., the myo pic action in slot t is a ( t ) = 1 . App lying (9) an d Lemma 4 to (8), we have ˆ V t (Ω; a = 1) − ˆ V t (Ω; a = 2) = ( ω 1 − ω 2 )(1 − ǫ + ˆ V t +1 (1 | [1 , 0]) − ˆ V t +1 (1 | [0 , 1])) ≥ 0 . R E F E R E N C E S [1] E. N. Gilbert, “Capacity of burst-noise channels, ” Bell Syst. T ech. J., vol. 39, pp. 1253-1265, Sept. 1960. [2] M. Zorzi, R. Rao, and L. Milstein, “Error statist ics in data transmission over fading channels, ” IEEE T rans. Commun. , vol. 46, pp. 1468- 1477, Nov . 1998. [3] L. A. Johnston and V . Krishnamurthy , “Opportunistic File Transfer ov er a Fading C hannel: A P OMDP Search Theory Formulation wit h Optimal Threshold Policies, ” IEEE Tr ans. W ir eless Communications , vol. 5, no. 2, 2006 . [4] Q. Zhao and B. Sadler, “A Surve y of Dynamic S pectrum Access, ” IEEE Signal P r ocessing magazine: Special Issue on Resour ce- Constraine d Signal Processing , Communications , and Networking , vol. 24, no. 3, pp. 79-89, May 2007. [5] C. H. Papadimitriou and J. N. Tsit siklis, “The complexity of optimal queueing network control. ” in Mathematics of Operations Resear ch , V olume. 24, 1999. [6] Q. Zhao, B. Krishnamachari, and K. L iu, “On myopic sensing for opportunistic spectrum access: structure, optimality , and performance, ” to appear in IEEE T ransactions on W i r eless Communications (also see Pr oc. of IEEE W orkshop on T owar d Cognition in W ir eless Networks (CogNet) , June, 2007). [7] Y . Chen, Q. Zhao, and A. Swami, “Joint design and separation principle for opportun istic spectrum access in the presence of sensing errors, ” IE EE T ransa ctions on Information T heory , vo l. 54, no. 5, pp. 2053-2071, May , 2008 (also see Pr oc. of IEEE Asilomar Confer ence on Signals, Systems, and Computers , Oct. 2006 ). [8] R. Smallwood and E. Sondik, “T he optimal control of partially ovserv able Marko v processes over a finite horizon, ” Operations R esear ch , pp. 1071–1088, 1971. [9] T . Javidi, B. Krishnamachari, Q. Zhao, and M. Liu, “Optimali ty of Myopic Sensing in Multi-Channel Opportunistic Access, ” IEEE ICC 2008 . [10] J.C. Gittins, “Bandit Processes and Dynamic Allocation Indices, ” Jo urnal of the Royal Statistical Society , Series B, 41, pp. 148-177, 1979. [11] P . W hittle, “Restless bandits: Activity all ocation i n a chang ing world”, in J ournal of Applied Prob ability , V olume 25, 1988. [12] R. R. W eber and G. W eiss, “On an index policy for restless bandits, ” Journal of Applied Pr obability , 27:637–648, 1990. [13] S. Guha, K. Munagala, “ Approximation Algorithms for Partial-information based St ochastic Control with Marko vian Rew ards, ” IEEE FOCS 2007. [14] S. Guha, K. Munagala, “ Approximation Algorithms for Restl ess Bandit P roblems, ” http://arxiv .or g/abs/0711.386 1 . [15] V . Raghunathan, V . Borkar , M. Cao, and P .R. Kumar , “Index P olicies for Real-Time Multicast Scheduling for W ireless Broadcast Systems, ” IEE E INFOCOM , 2008.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment