Algorithms for Dynamic Spectrum Access with Learning for Cognitive Radio

We study the problem of dynamic spectrum sensing and access in cognitive radio systems as a partially observed Markov decision process (POMDP). A group of cognitive users cooperatively tries to exploit vacancies in primary (licensed) channels whose o…

Authors: Jayakrishnan Unnikrishnan, Venugopal Veeravalli

Algorithms for Dynamic Spectrum Access with Learning for Cognitive Radio
1 Algorithms for Dynamic Spectrum Access with Learning for Cogn iti v e Radio Jayakrishnan Unnikrishnan, Student Membe r , IEEE, and V enugopal V . V eerav alli, F ellow , IEEE Abstract W e study the problem of dynamic spectrum sensing and access in c ognitive radio systems as a partially observed Markov decision process (POMDP). A group of cognitive users coop erativ ely tries to exp loit vacancies in prim ary (licensed) chann els whose occup ancies follow a Markovian ev olution. W e fir st consider the scenario where the cognitive users have perf ect knowledge of the distribution of the signals they receive from the prima ry users. For this pro blem, w e obtain a gre edy channel selection and access policy that maximizes the in stantaneou s rew ard, wh ile satisfying a co nstraint on the proba bility of interfering with licensed transmissions. W e a lso derive an analy tical u niversal upper bound on the perform ance of the optimal policy . Th rough simulation, we show that our scheme a chieves good perfo rmance relative to the up per bound and improved perfo rmance r elativ e to an existing schem e. W e then co nsider the m ore practical scena rio where the exact d istribution of the sign al from the primary is unknown. W e assume a parametr ic model f or the distribution and d ev elop an algo rithm that can learn the true d istribution, still guarantee ing the constraint on the in terferenc e probab ility . W e show that this a lgorithm outp erforms the naive de sign that assume s a worst case v alue for th e param eter . W e also provide a proo f fo r the con vergence of the learning algorithm. Index T erms Copyrigh t (c) 2008 IEEE. Personal use of this material is permitted. Howe v er , permission to use this material for an y other purposes must be obtained f rom the IEEE by sending a request to pubs-permissions@ieee.org. This work was supported by a V oda fone Foundation Graduate Fello wship and by NSF Grant CCF 0 7-29031, through the Uni versity of Illinois. This p aper was presented in part at the 47 th IEEE Conference on Decision and Control, Cancun, Mexico, 2008, and at the Asilomar Conference on Signals, Systems and Computers, Pacific Grove, C A, 2008. The authors are with the Coordinated Science Laboratory and Dept. of Electrical and Computer Engineering, Univ ersity of Illinois at Urbana- Champaign, Urbana, IL 61801 , US A. Email { junnikr2 , vvv } @illinois.edu 2 Cognitive radio, dyn amic spectru m access, chan nel selection, partially observed Markov d ecision process (POMDP), learning. I . I N T RO D U C T I O N Cognitive radio s that exploit v acancies in the licensed spectrum hav e been proposed as a solution to t he e ver -increasing demand for radio s pectrum. The idea is to sense t imes when a specific licensed band i s not used at a parti cular place and use thi s band for u nlicensed transmissio ns witho ut causing interference to the l icensed user (referred to as the ‘primary’). An important part of d esigning such s ystems is to develop an effic ient channel selection policy . The cognitive radio (also called t he ‘secondary user’) needs to adopt th e best strategy for selecting channels for s ensing and access. The sensing and access po licies should jo intly ensure t hat the probability of int erfering wi th the prim ary’ s transmi ssion meets a giv en constrain t. In the first part of thi s p aper , we consider the design o f such a j oint sensing and access policy , assuming a M arko vian model for the primary spectrum us age on the channels being m onitored. The second ary users use the observ ations made in each slot t o track the probabil ity o f o ccupancy of the different channels. W e obt ain a subopti mal so lution to the resul tant POMDP problem. In the second part of t he paper , we propose and stud y a more practical problem that arises when the secondary users are not aware of th e exact dist ribution of the signals that they receive from the primary transmitters. W e develop an algorithm that learns these unkno wn s tatistics and show that this scheme gives im proved performance over the naiv e scheme that assumes a worst-case va lue for the unknown di stribution. A. Contribution When the s tatistics of the signals from the primary are known, we show that, under our formulation, the dynamic spectrum access problem with a group of cooperating secondary u sers is equiv alent i n s tructure to a sing le us er problem. W e als o ob tain a new analytical upper bound on t he expected rew ard under the optimal schem e. Our suboptim al solution to the POMDP i s shown via simulat ions to yi eld a performance that is close to the upper bo und and better than that under an e xisting scheme. The main contribution of this paper is th e formulation and solution of the problem studied in t he second part in v olving unknown observation stati stics. W e show that unknown statis tics of 3 the pri mary signals can b e learned and provide an algori thm that learns these statisti cs online and maxim izes the e xpected rew ard still sati sfying a constraint on in terference probabilit y . B. Related W ork In m ost of the existing schemes [1], [2] in the literature on dynamic sp ectrum access for cognitive radio s, t he authors assume that every time a secondary user senses a prim ary channel, it can determine whether or no t the channel is occupied by th e primary . A different s cheme was proposed i n [3] and [4] where t he authors assume that the secondary transm itter receiv es error -free A CK signals from the secondary’ s receiv ers whene ver t heir transmission is successful. The secondary users use t hese A CK s ignals to t rack the chann el states o f the prim ary channels. W e adopt a different strategy i n this paper . W e ass ume that ev ery time the secondary users sense a channel they s ee a random observation w hose distribution depends on the state of t he channel. Our approach is disti nctly dif ferent from and more realis tic than that in [1], [2] since we d o not assume th at the secondary us ers know the prim ary channel stat es perfectly through sensin g. W e provide a detail ed compariso n of our approach with that of [3] and [4] after presenting our solution. In particular , we po int out that while using the schem e of [4] there are som e practical diffic ulties in maint aining synchronizati on between the s econdary transm itter and recei ver . Our scheme provides a way around thi s diffic ulty , albeit w e require a dedi cated control channel between the secondary t ransmitter and receiv er . The problem studied in the second part of this paper th at i n v olves learning of unknown observation statist ics i s new . Howe ver , the idea of combining learning and dynamic spectrum access was also used in [5] where the autho rs propose a reinforcement-learning scheme for learning channel idl ing prob abilities and int erference probabiliti es. W e introduce th e basic spectrum sensi ng and access problem in Section II and describe our p roposed solution in Section III. In Se ction IV, we elaborate o n t he problem where the distributions of the o bserva tions are unkn own. W e p resent simulation resul ts and comparison s with some existing schemes in Section V, and our concl usions in Section VI. I I . P RO B L E M S T A T E M E N T W e consider a slotted system where a group of secondary users monitor a set C of primary channels. The state o f each primary channel switches bet ween ‘occupied’ and ‘unoccupied’ 4 according to t he e volution o f a Markov chain . The secondary us ers can cooperatively sense any one ou t of the channels in C in each sl ot, and can access any one of the L = |C | channels in the same slo t. In each slot, the secondary users must s atisfy a strict const raint on t he probabil ity of interfering with potential primary transmissions on an y channel. When the secondary users access a channel that is free during a given time slot, they rec eiv e a re ward proportional to the bandwidth of the channel that they access. The obj ectiv e of the secondary users is to select the channels for sensing and access i n each slot i n such a way that their total expected reward accrued over all slots is maximized subj ect to the constraint on in terfering with potential primary transmissions e very t ime they access a channel 1 . Since the secondary users do not have explicit knowledge of the st ates of the channels, the result ant problem is a constrained partially ob serva ble Markov decision process (POMDP) problem. W e assume that all channels in C hav e equal bandwid th B , and are statisti cally identical and independent in terms of primary usage. The occupanc y o f each channel fol lows a stationary Markov chain. The state of channel a in any ti me slot k is represented by var iable S a ( k ) and could be either 1 or 0 , where state 0 correspond s t o th e channel bein g free for secondary access and 1 corresponds to t he channel being occupied by some primary user . The secondary sys tem includes a decision center that has access to all the obs erv ations made by the cooperating secondary users 2 . Th e observ ations are transmitted t o t he decision center over a dedicated control channel . The sam e dedicated channel can also be used to m aintain synchronization between the secondary transmitter and secondary recei ver so that the receiv er can tune to the correct channel to recei ve transmissions from th e transmitter . The sensing and access decisi ons in each sl ot are made at this decision center . When channel a is sensed in sl ot k , we use X a ( k ) to denote the vector o f observ ations made b y t he dif ferent coop erating users o n channel a in sl ot k . These o bservations represent the sampled outputs of the wireless receivers tuned to channel a that are employed by the cognitive us ers. The st atistics of these observations are assumed to be tim e-in v ariant and distinct for different channel stat es. The observations on 1 W e do not consider scheduling policies in this paper and assume that the secondary users have some predetermined scheduling policy to decide which user accesses the primary channel ev ery time they determine that a channel is free for access. 2 The scheme proposed in this paper and the analyses presented in this paper are valid ev en if the cooperating secondary users transmit quantized versions of their observa tions to the f usion center . Minor changes are required to account for the discrete nature of the observ ations. 5 channel a in slot k have d istinct joint probability densit y functions f 0 and f 1 when S a ( k ) = 0 and S a ( k ) = 1 respective ly . T he coll ection of all observations up to slot k is denoted by X k , and the collection o f observations on channel a up t o slot k is denot ed by X k a . The channel sensed in slot k is denoted by u k , the sequence of channels sensed up to sl ot k is denoted by u k , and the set of time slots up to slot k when channel a was sensed is denoted by K k a . The decision to access channel a in slo t k is denot ed by a binary variable δ a ( k ) , which takes v alue 1 when channel a is accessed in slot k , and 0 otherwise. Whene ver the secondary users access a free channel in som e time s lot k , they get a rew ard B equal to the bandwidth of each channel in C . The secondary users should satisfy t he following constraint on the probability of interfering wi th the prim ary transmission s in each s lot: P ( { δ a ( k ) = 1 }|{ S a ( k ) = 1 } ) ≤ ζ . In order to simplify the structure of the acc ess policy , we also assume that in each slot th e decision to access a channel i s made using only the ob serva tions made in th at slot. Hence it follows t hat in each sl ot k , the secondary users can access only the channel they sense in slot k , say channel a . Furthermore, th e access decision must be based on a binary hypothesis t est [6] between the two possib le states of channel a , performed on the observ ation X a ( k ) . This leads to an acce ss policy with a s tructure simi lar to that establ ished in [4]. The optimal test [6 ] is t o compare the joint l og-likelihood ratio (LLR) L ( X a ( k )) gi ven by , L ( X a ( k )) = lo g  f 1 ( X a ( k )) f 0 ( X a ( k ))  to some t hreshold ∆ that is chosen to sati sfy , P ( {L ( X a ( k )) < ∆ } |{ S a ( k ) = 1 } ) = ζ (1) and the optimal access decisi on would be to access the sensed channel whene ver the threshold exceeds t he j oint LLR. Hence δ a ( k ) = I {L ( X a ( k )) < ∆ } I { u k = a } (2) and the rew ard obtained in slot k can be expressed as, ˆ r k = B I { S u k ( k )=0 } I { L ( X u k ( k )) < ∆ } (3) where I E represents t he i ndicator functio n o f event E . The main advantage of the structure of the access policy given i n (2) i s that we can obtain a simple sufficient statist ic for the resultant 6 POMDP without having to keep track of all t he past observ ations, as discussed l ater . It also has the added adv antage [4] th at the s econdary users can set the threshol ds ∆ t o meet the constraint o n the probabili ty of interfering with the primary transmissi ons without relying on their knowledge of t he Markov statist ics. Our o bjectiv e is to generate a policy that makes optim al use of primary sp ectrum subject to the interference con straint. W e i ntroduce a discount factor α ∈ (0 , 1) and aim to s olve the infinit e horizon dy namic program with di scounted rew ards [7]. T hat is, w e seek the sequence of channels { u 0 , u 1 , . . . } , such that the ∞ X k =0 α k E [ ˆ r k ] is maximi zed, where th e expectation i s performed over the random observations and channel st ate realizations. W e can sh ow the following re lation based on the assum ption of identical channels : E [ ˆ r k ] = E h B I { S u k ( k )=0 } I { L ( X u k ( k )) < ∆ } i = E h E h B I { S u k ( k )=0 } I { L ( X u k ( k )) < ∆ } | S u k ( k ) ii = E  B (1 − ˆ ǫ ) I { S u k ( k )=0 }  (4) where, ˆ ǫ = P ( {L ( X a ( k )) > ∆ }|{ S a ( k ) = 0 } ) . (5) Since all the channels are assumed to be identical and the statis tics of the observ ations ar e assumed to be cons tant ov er time, ˆ ǫ given by (5) is a constant i ndependent of k . From the structure o f the expected rewa rd i n (4) i t follows that we can redefine our problem such that the re ward in slot k i s now given by , r k = B (1 − ˆ ǫ ) I { S u k ( k )=0 } (6) and the optim ization prob lem is equiv alent t o maximi zing ∞ X k =0 α k E [ r k ] . Since we know th e structure of the opti mal access decisions from (2), t he problem of spectrum sensin g and access boils down to choosing t he optim al channel to sense in each sl ot. Whenev er the secondary users sense some channel and m ake o bserva tions with LLR lower t han the threshold, they are free to access that channel. Thus we have con verted the constrained POMDP problem into an unconstrained POMDP probl em as was done in [4]. 7 I I I . D Y NA M I C P RO G R A M M I N G The state of t he system in slo t k denoted by S ( k ) = ( S 1 ( k ) , S 2 ( k ) , . . . , S L ( k )) ⊤ is the vector of states of th e channels in C t hat ha ve independent and identi cal Markovian e volutions. The channel to b e sensed i n slot k is d ecided in slot k − 1 and i s given by u k = µ k ( I k − 1 ) where µ k is a determinis tic function and I k , ( X k , u k ) represents th e net in formation about past observ ations and d ecisions up to slot k . The rew ard obt ained in sl ot k is a function of the state in sl ot k and u k as given by (6). W e seek the sequ ence of channels { u 0 , u 1 , . . . } , such that ∞ X k =0 α k E [ r k ] is m aximized. It is easily verified th at this probl em is a s tandard dynamic programming problem with i mperfect observations. It is kno wn [7] that for such a POMDP problem, a su f ficient st atistic at the end of any time slot k , is the probability distribution o f t he system state S ( k ) , conditioned on all the past o bservations and decisions, giv en by P ( { S ( k ) = s }| I k ) . Furthermore, since th e M arko vian ev olution of the different channels are independent of each other , this condition al p robability distri bution is equiv alently represented by the set of beliefs about the occupancy st ates of each channel, i.e., the probabil ity of occupancy of each channel in slot k , conditioned on all the past observations on channel a and times when channel a was sens ed. W e use p a ( k ) to represent the belief about channel a at the end of slot k , i.e., p a ( k ) is the probability that the state S a ( k ) of channel a in slot k is 1 , conditioned on all observations and decisions up to ti me slot k , which is given by p a ( k ) = P ( { S a ( k ) = 1 }| X k a , K k a ) . W e u se p ( k ) to denote the L × 1 vector representing the beliefs about the channels in C . The initial values of the belief parameters for all channels are set using the stati onary di stribution of the Markov chain. W e use P to represent the transi tion probabilit y matri x for the s tate transiti ons of each channel, with P ( i, j ) representing the prob ability that a channel t hat is in st ate i in slot k switches to s tate j in sl ot k + 1 . W e define, q a ( k ) = P (1 , 1) p a ( k − 1) + P (0 , 1)(1 − p a ( k − 1)) . (7) 8 This q a ( k ) represents the probabil ity of occupancy of channel a in slot k , conditio ned on the observations up to slot k − 1 . U sing Bayes’ rule, the belief values are updated as follows after the observation in ti me slot k : p a ( k ) = q a ( k ) f 1 ( X a ( k )) q a ( k ) f 1 ( X a ( k )) + (1 − q a ( k )) f 0 ( X a ( k )) (8) when channel a was selected in slot k (i.e., u k = a ), and p a ( k ) = q a ( k ) otherwise. Thus from (8) we see that updates for the sufficient statisti c can be performed usi ng o nly the joint LLR of the observations, L ( X a ( k )) , inst ead of the entire vector o f observations. Furthermo re, from (2) we also see that th e access decisions als o depend only o n the LLRs. H ence we conclude that this probl em with vector observations i s equiv alent to one with scalar observations where the scalars represent the joint LLR of the observations of all the cooperating secondary users. Therefore, in the rest of thi s paper , we use a scalar observa tion m odel wit h t he ob serva tion made on channel a in slot k represented by Y a ( k ) . W e use Y k to d enote the set of all observations up to time slot k and Y k a to denote the set of all observations on channel a u p to slot k . Hence the new access d ecisions are given by δ a ( k ) = I {L ′ ( Y a ( k )) < ∆ ′ } I { u k = a } (9) where L ′ ( Y a ( k )) represents the LLR of Y a ( k ) and the access th reshold ∆ ′ is chosen to satisfy , P ( {L ′ ( Y a ( k )) < ∆ ′ }|{ S a ( k ) = 1 } ) = ζ . (10) Similarly the beli ef updates are p erformed as in (8) with the ev aluations of density functions of X a ( k ) replaced with the ev aluations of the density functi ons f ′ 0 and f ′ 1 of Y a ( k ) : p a ( k ) = q a ( k ) f ′ 1 ( Y a ( k )) q a ( k ) f ′ 1 ( Y a ( k )) + (1 − q a ( k )) f ′ 0 ( Y a ( k )) (11) when channel a is accessed in slot k (i.e., u k = a ), and p a ( k ) = q a ( k ) otherwis e. W e use G ( p ( k − 1) , u k , Y u k ( k )) to denot e t he function that returns the value of p ( k ) given that channel u k was sensed in slo t k . This function can b e calculated using the relations (7) and (11 ). The re ward obtained in slot k can now be expressed as, r k = B (1 − ǫ ) I { S u k ( k )=0 } (12) where ǫ i s giv en by ǫ = P ( {L ′ ( Y a ( k )) > ∆ ′ }|{ S a ( k ) = 0 } ) . (13) 9 From the structu re of the d ynamic prog ram, it can be shown that t he o ptimal sol ution to this dynamic program can b e obtained b y s olving the fol lowing Bellman equation [7] for the optimal re ward-to-go functi on: J ( p ) = max u ∈C [ B (1 − ǫ )(1 − q u ) + α E ( J ( G ( p, u, Y u )))] (14) where p represents the i nitial value o f the beli ef vector , i.e., the prior probabi lity o f channel occupancies in slot − 1 , and q is calculated from p as in (7) b y , q a = P (1 , 1) p a + P (0 , 1)(1 − p a ) , a ∈ C . (15) The expectation in (14) is performed over the random observation Y u . Since it is not easy to find the optimal solution to this Bellman equation, we adopt a suboptimal strategy to obtain a channel selection pol icy that performs well. In the rest of the paper we assume that the transition probability matrix P satisfies the following regularity cond itions: Assumptio n 1 : 0 < P ( j, j ) < 1 , j ∈ { 0 , 1 } (16) Assumptio n 2 : P (0 , 0) > P (1 , 0) (17) The first assumption ensures that t he resultant Marko v chain is irreducible and positive recurrent, while the second assumption ensures t hat it is more l ikely for a channel that is free in the current slot t o remain free in t he next slot than for a chann el that is occupied in th e current slot to swi tch states and become free in the next slot. Whi le the first ass umption is important the second one is used only in the d eri vation of the upper bou nd on the optimal performance and can easily b e relaxed b y separately considering t he case where th e i nequality (17) d oes not hold. A. Gr eedy p olicy A straightforward suboptimal solution t o the channel s election probl em is the greedy policy , i.e., t he pol icy of maximizing the expected in stantaneous rewar d in the current time slot. The expected inst antaneous rewar d obtained by accessing som e channel a i n a g iv en slot k i s giv en by B (1 − ǫ )(1 − q a ( k )) where ǫ is given by (13). Hence the greedy policy i s to choose th e channel a such t hat 1 − q a ( k ) is th e maximum . u gr k = argmax u ∈C { 1 − q u ( k ) } . (18) 10 In other words, in e very sl ot the greedy policy chooses the channel t hat is most likely to be free, conditioned on the p ast o bservations. The greedy pol icy for this problem is in fact equivalent to the Q MDP policy , which is a st andard su boptimal sol ution t o t he POMDP problem (see, e.g., [8]). It is s hown in [1] and [2] th at under som e condi tions on P and L , the g reedy policy is optimal if the observation i n each slot re veals the underlying state of the channel. Hence it can be argued that under the sam e conditions , t he greedy policy would also be optim al for our problem at hi gh SNR. B. An upper bound An upper bou nd o n the op timal re ward for the POMDP of (14) can be obtai ned by assumi ng more informati on than the maximum that can be obtained i n reality . One such assumption that can give us a simple upp er bound is t he Q MDP assumption [8], which is to assume that in all future sl ots, the state of all channels become known exactly after maki ng the o bserva tion in that slot. The opt imal rewa rd under the Q MDP assumption is a functi on o f the ini tial belief vector , i.e., the prior probabil ities of occupancy of the chann els in slot − 1 . W e represent this function by J Q . In practice, a reasonable choice of initial value of th e belief vector is gi ven by the stationary distribution of the Markov chains. Hence for any solution to the POMDP that uses this initialization, an upper bound for the optim al re war d under the Q MDP assumption is g iv en by J U = J Q ( p ∗ 1 ) where p ∗ represents the probabil ity that a channel is o ccupied under t he stati onary distribution of the transition p robability matrix P , and 1 represents an L × 1 vector of all 1 ’ s. The first step i n v olved in ev aluating th is upper bound is to determine the optimal re ward function under the assum ption t hat all the channel states become known exactly after making the observation in each slot i ncluding the current slot. W e call this function ˜ J . That is, we want to ev aluate ˜ J ( x ) for all binary string s x of length L t hat represent the 2 L possible values of the vector representing the st ates of all channels in s lot − 1 . The Q MDP assumption implies that the functions J Q and ˜ J sat isfy the foll owing equation: J Q ( z ) = max u ∈C { [ B (1 − ǫ )(1 − q u ) + X x ∈{ 0 , 1 } L α P ( { S (0) = x } ) ˜ J ( x )] } s.t. p ( − 1) = z (19) where p ( − 1) denotes the a priori belief vector about the channel st ates in slot − 1 and q u is 11 obtained from p u ( − 1) just as in (15). Hence the upper bound J U = J Q ( p ∗ 1 ) can be easily e valuated using P once the function ˜ J i s determined. Now we describe how one can solve for the functi on ˜ J . Under the assum ption that t he states of all the channels become known exactly at th e time of ob serva tion, th e optim al channel selected in any slot k w ould be a function of the states of th e channels in slot k − 1 . Moreover , th e sensing acti on in the current slot would not affect t he rewards in the future slots. Hence the optimal poli cy would b e to m aximize the expected in stantaneous re ward, which is achiev ed b y accessing the channel that is m ost likely to be free in t he current sl ot. No w under the added assumption stated i n (17) earlier 3 , the optimal po licy would alwa ys select some channel that was free in the pre vious time slot, if there is any . If no channel is free in the previous time s lot, then the optim al policy would be to select any one of th e channels in C , since al l of them are equally likely to be free i n the current slot. Hence the deriv ation of the o ptimal total reward for this problem i s st raightforward as illustrated below . T he t otal reward for this policy is a function of the state of the system in the slo t preceding the ini tial slot, i.e., S ( − 1) . ˜ J ( x ) = max u ∈C E  κ [ I { S u (0)=0 } + α ˜ J ( S (0))]     { S ( − 1) = x }  =    κP (0 , 0 ) + αV ( x ) if x 6 = 1 κP (1 , 0 ) + αV ( x ) if x = 1 where V ( x ) = E [ ˜ J ( S (0)) |{ S ( − 1) = x } ] , κ = B (1 − ǫ ) , and 1 i s an L × 1 string of all 1 ’ s. This means that we can wri te ˜ J ( x ) = κ ( P (0 , 0) ∞ X k =0 α k − ( P (0 , 0) − P (1 , 0)) w ( x ) ) = κ  P (0 , 0) 1 − α − ( P (0 , 0) − P (1 , 0)) w ( x )  (20) where w ( x ) , E   X M ≥− 1: S ( M )=1 α M +1      { S ( − 1) = x }   is a scalar function of the vector state x . Here t he expectation is over the random s lots when 3 It i s easy to see that a minor modification of the deriv ation of the upper bound works when assumption (17) does not hold. 12 the syst em reaches state 1 . Now by st ationarity we have, w ( x ) = E   X M ≥ 0: S ( M )=1 α M      { S (0) = x }   . (21) W e use P to denote the matrix of size 2 L × 2 L representing the transition probability matrix of the joint Markov process that d escribes the transitio ns o f the vector o f channel states S ( k ) . The ( i, j ) th element of P represents the probability that t he state of t he system s witches to y in slot k + 1 g iv en that the state of the system is x in slot k , where x is the L -bit binary representation of i − 1 and y is the L -bit binary representation of j − 1 . Using a sli ght ab use of notati on we represent the ( i, j ) th element of P as P ( x , y ) itself. Now equation (21) can be s olved to obtain, w ( x ) = X y α P ( x, y ) w ( y ) + I { x =1 } . (22) This fixed point equation which can b e solved to obtain, w = ( I − α P ) − 1        0 . . . 0 1        2 L × 1 (23) where w is a 2 L × 1 vector whose elements are the values of the function w ( x ) ev aluated at the 2 L diffe rent possible values of t he vector state x of the system in tim e slot − 1 . Again, the i th element of vector w is w ( x ) where x is the L -bit bin ary representation of i − 1 . Thus ˜ J can n ow be ev aluated by using relatio n (20) and the expected re ward for this probl em under the Q MDP assumption can be calculated by ev aluating J U = J Q ( p ∗ 1 ) via equatio n (19). This optim al va lue yields an analyt ical upper bound on t he optim al rewa rd of the origin al problem (14). C. Comparison with [4] f or sing le user pr obl em Although we hav e studied a spectrum access s cheme for a cooperative cognitive r adio network, it can also be employed by a s ingle cognitive user . Und er this setting, our approach to the spectrum access problem described earlier in this section is similar t o that considered in [4] and [3] in t hat sensing d oes n ot re veal the true channel states but only a random variable whose distribution depends on the current st ate o f the sensed channel. As a result, the structu re of our 13 optimal access policy and the sufficient statisti c are sim ilar to th ose in [4]. In thi s section we compare the two schemes. The main dif ference between our formulation and that in [4] is that in our formulation t he secondary us ers use the primary signal receiv ed on the channel to t rack the channel occupancies, while in [4] they use the A CK si gnals exchanged between the secondary transmitter and recei ver . Under t he schem e of [4], in each slo t, the secondary receiv er transmit s an A CK s ignal upo n successful reception of a t ransmission from the secondary recei ver . T he belief up dates are then performed using the singl e b it of information provided by the presence or absence of the A CK signal. The approach of [4] was motivated b y the fact that, under that scheme, the secondary recei ver kno ws in advance the channel on which to expect potential transmis sions from the secondary t ransmitter in each slot , thus obviating the need for control channels for s ynchro- nization. Ho wev er , such sy nchronization b etween the transm itter and receiv er i s n ot reliable in the presence of i nterfering terminals that are hidden [9] from either the receiver or transmi tter , because the A CK si gnals wil l no longer be error-free . In this regard, we bel iev e that a more practical solution t o this problem would be to set aside a dedicated control channel of lo w capacity for the purpose of reliably maint aining synchronization, and use the observations on the primary channel for tracking the channel occupancies. In additi on to guaranteeing synchroni zation, our scheme provides some im provement i n ut ilizing t ransmission opportunit ies over the A CK-based scheme, as we sh ow in sectio n V -A. Another di f ference between ou r formu lation and that in [4] is that we assume that t he statist ics of channel occupancies are independent and identical while [4] considers t he more general case of correlated and non-identi cal channels. Howe ver , the scheme we proposed i n section III can be easily modified to handle this case, with added complexity . The suffi cient statistic would n ow be the po steriori distribution of S ( k ) , the vector of st ates of all channels, and the access thresholds on different channels would be non-identical and depend on the statistics of the o bserva tions the respectiv e channels . W e av oid elaborating on this more general setting t o keep the presentation simple. I V . T H E C A S E O F U N K N O W N D I S T R I B U T I O N S In practice, the s econdary users are typically unaw are of the primary’ s signal characteristi cs and t he channel realization from th e prim ary [10]. Hence cognitiv e radio systems ha ve to rely on 14 some form of non-coherent detection such as energy detectio n while sensing the primary signal s. Furthermore, eve n whi le employing non-coherent detectors, t he secondary users are also unaware of their locations relativ e to the primary and hence are not aware of the shadowing and path loss from the primary to the secondary . Hence it is n ot reasonable to assum e that the secondary users know the exact distributions of the observ ations under the primary-present hypothesis, although it can be ass umed that t he distribution of the obs erv ations under the primary-absent hypothesis is known exactly . This scenario can be modeled by using a parametric description for t he dist ributions of the recei ved signal un der t he primary-present hypothesi s. W e d enote the density functions of t he observations under the two possib le hypotheses as, S a ( k ) = 0 : Y a ( k ) ∼ f θ 0 S a ( k ) = 1 : Y a ( k ) ∼ f θ a where θ a ∈ Θ , ∀ a ∈ { 1 , 2 , . . . , L } (24) where the parameters { θ a } are unknown for all channels a , and θ 0 is known. W e use L θ ( . ) to denote the log-l ikelihood functi on und er f θ defined by , L θ ( x ) , log  f θ ( x ) f θ 0 ( x )  , x ∈ R , θ ∈ Θ . (25) In th is section , we study two possible appro aches for dealing with such a scenario, while restricting to greedy policies for channel selection. For ease of illustration , in this section we consider a secondary sys tem comprised o f a s ingle user , although the sam e ideas can als o be applied for a s ystem with mu ltiple cooperating users. A. W orst-case design for non-random θ a A clos e examination of Section III re veals two specific uses for the d ensity functio n of the observations under the S a ( k ) = 1 hypothesis . T he kno wledge o f this densit y was of crucial importance in setting the access threshold in (10) t o m eet the constraint o n th e probabil ity of interference. The other place where this densi ty was used was in updating the belief prob abilities in (11). When the parameters { θ a } are non-random and unkno wn, we h a ve to guarantee the constraint on the interference p robability for all possi ble realizations of θ a . The optimal access decision would thus be gi ven by , ˆ δ a ( k ) = I { u k = a } Y θ ∈ Θ I {L θ ( Y a ( k )) <τ θ } (26) 15 where τ θ satisfies, P ( {L θ ( Y a ( k )) < τ θ }|{ S a ( k ) = 1 , θ a = θ } ) = ζ . (27) The other con cern that we need to address in this approach is: what dis tribution do we use for the observations under S a ( k ) = 1 i n order to perform t he updates in (11). An intelligent solution is possible provided the dens ities described in (24) satisfy th e condi tion t hat there is a θ ∗ ∈ Θ such that th e for all θ ∈ Θ and for all τ ∈ R the following inequality hol ds: P ( {L θ ∗ ( Y a ( k )) > τ }|{ S a ( k ) = 1 , θ a = θ } ) ≥ P ( {L θ ∗ ( Y a ( k )) > τ }|{ S a ( k ) = 1 , θ a = θ ∗ } ) . (28) The condit ion (28) is sat isfied by severa l parameterized densit ies inclu ding an i mportant practical example discussed later . Under conditi on (28), a good suboptimal solutio n t o the channel selection problem would be to run the greedy pol icy for channel selectio n using f θ ∗ for the dens ity under S a ( k ) = 1 while performing the updates of the channel beliefs i n (11). This i s a consequence of the following lem ma. Lemma 4.1: Assume condition (28) holds. Suppos e f ∗ θ is used i n place o f f ′ 1 for the d istribution of the observations under S a ( k ) = 1 w hile performing belief u pdates in (11). Then, (i) For all γ ∈ Θ and for all β , p ∈ [0 , 1 ] , P γ ( { p a ( k ) > β }|{ S a ( k ) = 1 , p a ( k − 1) = p } ) ≥ P θ ∗ ( { p a ( k ) > β }| { S a ( k ) = 1 , p a ( k − 1) = p } ) (29) where P θ represents the p robability measure when θ a = θ . (ii) Conditioned on { S a ( k ) = 0 } , the distribution of p a ( k ) giv en any value for p a ( k − 1) is identical for all possible values of θ a . Pr oof: (i) Clearly (29) holds with equality when channel a is not sensed in slo t k (i.e. u k 6 = a ). When u k = a , it is easy to see t hat the new belief g iv en by (11) i s a mono tonically increasing function of the log-likelihood function, L θ ∗ ( Y a ( k )) . Hence (29) foll ows from condition (28). (ii) This is obvious since the randomness in p a ( k ) under { S a ( k ) = 0 } is solely due to the observation Y a ( k ) whose distribution f θ 0 does not depend on θ a . Clearly , up dating usi ng f θ ∗ in (11) is optimal if θ a = θ ∗ . When θ a 6 = θ ∗ , t he tracking of beliefs are guaranteed to b e at least as accurate, in the sense describ ed in Lemma 4.1. Hence, under 16 condition (28), a good subop timal soluti on to the channel selecti on problem would b e to run the greedy policy for channel selection u sing f θ ∗ for the densit y under S a ( k ) = 1 whil e performing the up dates of th e channel beliefs in (11). Furthermore, it is kn own that [11] under conditi on (28), th e set of likelihood ratio t ests in t he access decision of (26) can b e replaced wi th a singl e likelihood rati o test und er th e worst case parameter θ ∗ giv en by , ˆ δ a ( k ) = I { u k = a } I {L θ ∗ ( Y a ( k )) <τ θ ∗ } . (30) The st ructure of t he access decision g iv en in (30), and the conclusion from Lemma 4.1 suggests that θ ∗ is a worst-case value o f th e parameter θ a . Hence the s trategy of designing the sensing and access poli cies assuming thi s worst po ssible value of the parameter is optimal i n the follo wing min-max sense: Th e av erage re ward when the true value of θ a 6 = θ ∗ is expected to be no smaller than that obtai ned wh en θ a = θ ∗ since the t racking o f beliefs is worst when θ a = θ ∗ as s hown in Lemm a 4.1. This in tuitive reasoning is seen to hol d in t he simulation results in Section V -B. B. Modeling θ a as random In Section V -B, we show through simulatio ns t hat the worst-case approach of the previous section leads to a se vere decline in performance relative to the scenario where the distribution parameters in (24) are k nown accurately . In practice it may be pos sible to l earn the value of these parameters online. In order to learn the para meters { θ a } we need to have a statis tical model for these parameters and a reliabl e statis tical model for the channel st ate process. In thi s section we mo del the parameters { θ a } as random variables, which are i.i.d. across t he channels and independent of the Mark ov process as well as the noise process. In order to assure the con ve rgence of our learning algorithm , we also assum e that t he cardinality of set Θ is finite 4 and let | Θ | = N . Let { µ i } N 1 denote the elem ents of set Θ . The prior dis tribution of the parameters { θ a } is known to th e secondary users. The beliefs of the different channels no longer form a suffi cient stati stic for this probl em. Instead, we keep t rack of the foll owing set of a posterio ri probabilities which we refer to as joint beli efs : { P ( { ( θ a , S a ( k )) = ( µ i , j ) }| I k ) : ∀ i, j, a } . (31) 4 W e do discuss the scenario when Θ is a compact set in the example considered in Section V -B. 17 Since we assum e that the parameters { θ a } take values in a finit e set, we can keep track of these joint beliefs jus t as we kept track of the beliefs of the states of di ff erent channels in Section III. For the i nitial values of these joint beliefs we use the p roduct dist ribution of the stationary distribution of the Markov chain and the prior di stribution on the parameters { θ a } . W e store these join t beliefs at the end of slot k i n an L × N × 2 array Q ( k ) wi th elements given by , Q a,i,j ( k ) = P ( { ( θ a , S a ( k )) = ( µ i , j ) }| Y k a , K k a ) . (32) The ent ries o f th e array Q ( k ) corresponding t o channel a represent t he j oint a posteriori p rob- ability distribution of th e parameter θ a and the state of channel a in slot k condi tioned on the information av ailable up to s lot k whi ch w e called I k . Now define, H a,i,j ( k ) = X ℓ ∈{ 0 , 1 } P ( ℓ, j ) Q a,i,ℓ ( k − 1) . Again, the values of the array H ( k ) represent the a posteriori prob ability d istributions about the parameters { θ a } and th e channel states in slo t k conditioned on I k − 1 , the in formation up to slot k − 1 . The update equations for the j oint beliefs can now be written as follo ws: Q a,i,j ( k ) =    λH a,i, 0 ( k ) f θ 0 ( Y a ( k )) if j = 0 λH a,i, 1 ( k ) f µ i ( Y a ( k )) if j = 1 when channel a was accessed in slot k , and Q a,i,j ( k ) = H a,i,j ( k ) otherwise. Here λ is just a normalizing factor . It is shown in Appendix that, for each channel a , the a posteriori probability mass funct ion of parameter θ a conditioned on the information up to slot k , con ver ges to a delt a-function at the true value of parameter θ a as k → ∞ , provided we sense channel a frequently enough. This essentially means that we can l earn the value of the actual realization of θ a by just updating the join t beliefs. This observ ation suggests that we could use t his knowledge learned about the parameters i n order to obt ain better performance t han that obt ained under the pol icy o f Section IV -A. W e could, for instance, use t he knowledge of the true value of θ a to be more liberal in our access policy than the sati sfy-all-constraints approach that we used in Section IV -A when we did a worst-case design. W ith this in mind, we propose the following al gorithm for choosi ng the threshol d to be used in each s lot for determin ing whet her or not t o access the spectrum . Assume channel a was sens ed in slo t k . W e first arrange the elements of set Θ i n increasing order of t he a p osteriori probabi lities of parameter θ a . W e partiti on Θ into two groups, a ‘lower’ 18 partition and an ‘upper’ partit ion, such that all elements in the lower partiti on h a ve lower a posteriori probabili ty values than all elements in the up per partition. Th e partitioni ng i s done such that the number of elements in the l ower partition is maximi zed sub ject t o the constraint that the a posteriori probabilities of the elements in the lower partition add up to a value lower than ζ . These elements of Θ can be ignored while designing the access policy since the sum of their a posteriori probabili ties is below t he interference constraint. W e then d esign the access policy such t hat we meet th e int erference constraint conditioned on parameter θ a taking an y value i n t he upper partition . The mathem atical description of t he algori thm is as fol lows. Define b i a ( k ) , X j ∈{ 0 , 1 } Q a,i,j ( k − 1) . The vector ( b 1 a ( k ) , b 2 a ( k ) , . . . , b N a ( k )) ⊤ represents the a posteriori probability mass function of parameter θ a conditioned o n I k − 1 , the information a va ilable up t o sl ot k − 1 . Now let π k ( i ) : { 1 , 2 , . . . , N } 7→ { 1 , 2 , . . . , N } be a permutation of { 1 , 2 , . . . , N } such that { µ π k ( i ) } N i =1 are arranged in in creasing order of post eriori probabilities, i.e. i ≥ j ⇔ b π k ( i ) a ( k ) ≥ b π k ( j ) a ( k ) and let N a ( k ) = max { c ≤ N : c X i =1 b π k ( i ) a ( k ) < ζ } . Now define set Θ a ( k ) = { µ π k ( i ) : i ≥ N a ( k ) } . This set is the upper partiti on mentioned earlier . The access decision on channel a in slot k is giv en by , 5 ˜ δ a ( k ) = I { u k = a } Y θ ∈ Θ a ( k ) I {L θ ( Y a ( k )) <τ θ } (33) where τ θ satisfy (27). The access p olicy g iv en above guarantees that P ( { ˜ δ a ( k ) = 1 }|{ S a ( k ) = 1 } , Y k − 1 , K k − 1 ) < ζ (34) whence the same holds without condition ing on Y k − 1 and K k − 1 . Hence, the interference con- straint is met on an a verage, a veraged ove r the posteriori distributions of θ a . Now it is shown in Appendix that the a po steriori probability mass function of parameter θ a con ve rges to a delta 5 The access policy obtained via the partitioning scheme is simple to implement but is not the optimal policy in general. The optimal access decision on channel a in slot k would be giv en by a l ikelihood -ratio test between f θ 0 and the mixture density P θ ∈ Θ r θ ( k − 1) f θ where r θ ( k − 1) represents the value of the posterior distribution of θ a after slot k − 1 , ev aluated at θ . Ho wev er setting thresholds for such a test is prohibiti vely comple x. 19 function at th e t rue value of parameter θ a almost surely . Hence the constraint is asym ptotically met e ven conditioned on θ a taking the correct value. This follows from the fact that, if µ i ∗ is the actual realization of the random v ariable θ a , and b i ∗ a ( k ) conv er ges to 1 almost surely , t hen, for suffi ciently large k , (33) becomes: ˜ δ a ( k ) = I { u k = a } I {L µ i ∗ ( Y a ( k )) <τ µ i ∗ } with probabilit y one and hence the claim is satis fied. It is imp ortant to note t hat the access policy given in (33) need not be the opti mal access policy for this problem . Unlike in Section II, h ere we are al lowing the access decision in sl ot k to depend on the obs erv ations in all sl ots u p to k via t he joint beliefs. Hence, i t is no longer obvious that the optimal test should be a th reshold test on the LLR of th e observations in th e current s lot eve n if parameter θ a is known. Howe ver , this structure fo r t he access pol icy can be justified from the observation that it is si mpler to implement in practice than som e other pol icy that requires us to keep track of all the p ast observations. The si mulation results that we present in Section V -B als o suggest that this schem e achieves substanti al improvement in performance over the worst-case approach, thus further justifying this structure for th e access policy . Under this scheme th e new greedy policy for channel selection is to sense the channel whi ch promises the highest e xpected instantaneous reward which is no w g iv en by , f u gr k = argmax a ∈C ( N X i =1 h a,i, 0 ( k )(1 − ǫ a ( k )) ) (35) where ǫ a ( k ) = P   [ θ ∈ Θ a ( k ) {L θ ( Y a ( k )) > τ θ }     { S a ( k ) = 0 }   . Howe v er , in order to prove th e con ver gence of the a pos teriori probabili ties of the parameters { θ a } , we need to make a s light modification to this channel selection policy . In our proof, we require that each channel is accessed frequently . T o enforce that this condition is satisfied, we modify the channel selection po licy so t hat the new channel selection s cheme is as follo ws: g u mod k =    C j if k ≡ j mo d C L, j ∈ C f u gr k else (36) where C > 1 is some constant and {C j : 1 ≤ j ≤ L } is s ome ordering of the channels in C . 20 V . S I M U L A T I O N R E S U L T S A N D C O M P A R I S O N S A. Known distr ibutions W e consider a simp le model for the distributions of the observations and illust rate the ad- vantage of our propo sed scheme over that in [4] b y simulatin g the performances obtained by employing the greedy alg orithm on both these schemes. W e also consider a combined scheme that uses bot h the channel observations and the AC K s ignals for updating beli efs. W e simulated th e greedy policy under three different schemes. Our scheme, which we call G 1 , uses only the ob serva tions made on the channels to update the belief vectors. The second one, G 2 , uses only the A CK signals t ransmitted by the secondary receiv er , while the third on e, G 3 , uses both o bservations as w ell as the error-free AC K s ignals. W e have performed the simulations for two differ ent values of the int erference con straint ζ . T he number of channels was kept at L = 2 in bot h cases and the t ransition probability matrix used was, P =   0 . 9 0 . 1 0 . 2 0 . 8   where the first in dex represents state 0 and the second represents st ate 1. Both channels were assumed to h a ve u nit bandwidth, B = 1 and the dis count fac tor was set to α = 0 . 999 . Such a high value o f α was chosen to approxim ate the problem with no discounts whi ch would be the problem of practical interest. As we saw in Section III, the spectrum access problem with a group of cooperating secondary users is equiv alent to that w ith a single user . Hence, i n our simulatio ns we use a scalar observation model with the following simple distributions for Y a ( k ) under the two hypo theses: S a ( k ) = 0 (pri mary OFF) : Y a ( k ) ∼ N (0 , σ 2 ) S a ( k ) = 1 (pri mary ON) : Y a ( k ) ∼ N ( µ, σ 2 ) (37) It is easy to verify t hat th e LLR for these observa tions is an increasing l inear functio n of Y a ( k ) . Hence the ne w access decisi ons are made by comparing Y a ( k ) to a threshold τ chosen such that, P ( { Y a ( k ) < τ }|{ S a ( k ) = 1 } ) = ζ (38) and access decisio ns are give n by , δ a ( k ) = I { Y a ( k ) <τ } I { u k = a } . (39) 21 −5 −4 −3 −2 −1 0 1 2 3 4 5 0 100 200 300 400 500 600 SNR → Reward → Observations only (G 1 ) ACK only (G 2 ) Both (G 3 ) Upper bound ζ = 0.01 ζ = 0.1 Fig. 1. Comparison of performances obtained with greedy policy that uses observa tions and greedy policy that uses A CKs. Performance obtained with greedy policy that uses both ACKs as well as observ ations and the upper bound are also sho wn. The belief u pdates in (8) are no w giv en by , p a ( k ) = q a ( k ) f ( µ, σ 2 , Y a ( k )) q a ( k ) f ( µ, σ 2 , Y a ( k )) + (1 − q a ( k )) f (0 , σ 2 , Y a ( k )) when channel a was selected in slot k (i .e. u k = a ), and p a ( k ) = q a ( k ) oth erwise. Here q a ( k ) is giv en by (7) and f ( x, y , z ) represents th e value of the Gaussian densit y function wit h mean x and var iance y e valuated at z . F or the mean and variance parameters i n (37) we use σ = 1 and choo se µ so that SNR = 2 0 log 10 ( µ/σ ) takes values from − 5 dB t o 5 dB. In t he case o f cooperative sensing, this SNR represents the effecti ve signal-to-noise ratio in the j oint LLR statis tic at the decision center , L ( X a ( k )) . W e perform simulations for two values of the interference constraint, ζ = 0 . 1 and ζ = 0 . 01 . As seen in Fig. 1, the strategy of using on ly A CK si gnals ( G 2 ) performs worse than the one that uses all t he observa tions ( G 1 ), especially for ζ = 0 . 01 , thus demonstrating that relying only on A CK si gnals com promises o n t he am ount of information that can be l earned. W e also 22 observe that the greedy policy attains a performance that i s withi n 10% of t he up per bou nd. It is also seen in the figure that the rew ard values obt ained under G 1 and G 3 are almos t equal. For ζ = 0 . 01 , it is seen that the two curves are overlapping. This obs erv ation s uggests that the extra advantage obtained by incorporating the A CK signals i s insi gnificant especially when the interference cons traint is low . The explanation for this ob serva tion is that the A CK sig nals are recei ved only when t he signal transm itted by the secondary t ransmitter successfully gets across to its receiv er . For th is to happen the state of t he primary channel should be ‘0’ and t he secondary must decide to access the channel. When the v alue of th e interference const raint ζ is low , the secondary accesses the channel only when the value of the Y a ( k ) is low . Hence the observations in this case carry a significant amount of i nformation about the s tates themselves and the additional inform ation that can be o btained from the A CK si gnals is no t signi ficant. Thus l earning using only observations is almost as good as learning using both observations as well as A CK signals in th is case. B. Unknown distributions W e compare th e performances of the two different approaches to the spectrum access problem with unk nown dist ributions that we discus sed in Section IV . W e use a parameterized version of t he observation m odel we used in the example in Section V -A. W e assume th at the primary and secondary users are st ationary and assume that the second ary user is u nawa re of its location relativ e to the prim ary transm itter . W e assum e th at the secondary user employs s ome form of ener gy detection, which means that the l ack of knowledge abo ut the location of t he primary manifests itself in the form of an unknown mean power of the signal from t he prim ary . Using Gaussian distributions as in Section V -A, we model t he l ack of knowledge of the received primary power by assum ing that the mean of the observation under H 1 on channel a is an unknown parameter θ a taking va lues in a finite s et of positive values Θ . The new hypotheses are: S a ( k ) = 0 : Y a ( k ) = N a ( k ) S a ( k ) = 1 : Y a ( k ) = θ a + N a ( k ) where N a ( k ) ∼ N (0 , σ 2 ) , θ a ∈ Θ , min(Θ) > 0 . (40) 23 For the set of parameterized d istributions in (40), the log-likelihood ratio function L θ ( x ) defined in (25 ) is l inear in x for all θ ∈ Θ . Hence comparing L θ ( Y a ( k )) to a threshold is equiv alent to com paring Y a ( k ) to some other threshold. Furthermore, for this set of parameterized distributions, it i s easy to see that the condit ional cumulativ e distribution function (cdf) of the observations Y a ( k ) under H 1 , cond itioned on θ a taking value θ , is monotonically decreasing in θ . Furthermore, under the assumpti on that min Θ > 0 , it fol lows that choosing θ ∗ = min Θ satisfies the conditions of (28). Hence the optimal access decision under the w orst-case approach given in (30) can be written as ˆ δ a ( k ) = I { u k = a } I { Y a ( k ) <τ w } (41) where τ w satisfies P ( { Y a ( k ) < τ w }|{ S a ( k ) = 1 , θ a = θ ∗ } ) = ζ (42) where θ ∗ = min Θ . Thus the worst-case solution for thi s set of parameterized di stributions is identical to that obtain ed for the probl em with kno wn d istributions described in (37) with µ replaced b y θ ∗ . Thu s the st ructures of the access policy , the channel selection policy , and the belief up date equations are id entical to those d eri ved in the example shown in Section V -A with µ replaced by θ ∗ . Similarly , the access policy for the case of random θ a parameters give n in (33) can now be written as ˜ δ a ( k ) = I { u k = a } I { Y a ( k ) <τ r ( k ) } (43) where τ r ( k ) satis fies P ( { Y a ( k ) < τ r ( k ) }|{ S a ( k ) = 1 , θ a = θ # ( k ) } ) = ζ (44) where θ # ( k ) = min Θ a ( k ) . The b elief updates and greedy channel selection are performed as described in Section IV -B. The quanti ty ǫ a ( k ) appearing in (35) can n ow be written as ǫ a ( k ) = P ( { Y a ( k ) > τ r ( k ) }|{ S a ( k ) = 0 } ) . W e simul ated the performances of bo th the schemes on t he hypotheses described in (40). W e used the same v alues of L , P , α and σ as in Section V -A. W e chos e set Θ such that the SNR values in dB given by 20 log µ i σ belong to th e set {− 5 , − 3 , − 1 , 1 , 3 , 5 } . The prior probabi lity distribution for θ a was chosen to be the uniform distribution on Θ . The i nterference cons traint 24 ζ was set to 0 . 01 . Both channels were assumed to have the sam e values of true SNR while t he simulatio ns were p erformed. The rew ard was comp uted over 10000 slots since the remaining slots do not contribute sig nificantly t o the rew ard. The value of C i n (36) was set to a value higher than the num ber of slot s considered so that the greedy channel selection poli cy always uses the second alternativ e in (36). Although we require (36) for our proof of conv er gence of the a posteriori probabil ities in th e Appendix, it was observed in s imulation s that this condit ion was not necessary for con ver gence. The resul ts of the simulations are gi ven in Fig. 2. The net rew ard values obtained under the worst-case desig n of Section IV -A and that obt ained with the algorithm for learning θ a giv en in Section IV -B are plotted. W e have als o in cluded the rewards obtained with t he greedy algorithm G 1 with known θ a values; these values denote the b est rewards that can be obtained with th e greedy policy when the parameters θ a are known exactly . Clearly , we see that the worst-case design gives us alm ost no improvement in performance for high v alues of actual SNR. This is because the threshold we choose i s too conserv ativ e for hig h SNR scenarios leading to several missed op portunities for t ransmitting. The minimal im provement i n performance at high SNR i s due to t he fact that the s ystem now has mo re accurate estim ates of the channel b eliefs although the update equations were designed for a lo wer SNR le vel. The learning scheme, on the other hand, yields a significant performance adv antage ov er the worst-case scheme for high SNR v alues as seen in the figure. It is als o seen t hat t here is a significant gap between th e performance with learning and that wit h known θ a values at high SNR values. This gap is du e to t he fact that the posteriori probabilities about the θ a parameters take som e time t o con v erge. As a result of this delay in conv er gence a conservati ve access t hreshold has to be used i n the initial slots thus leading t o a d rop in th e discount ed infinite horizon rew ard. Howe ver , if we w ere usi ng an ave rage re ward formu lation for the d ynamic program rather than a dis counted rew ard formulation, we would expect the two curves to overlap since the loss in the init ial slots is i nsignificant while computing the lo ng-term average re ward. Remark 5.1: So far in thi s paper , we hav e assumed that the car dinality of set Θ is finite. The p roposed learning algorithm can also be adapted for the case when Θ is a compact set. A simple example illu strates how th is may be do ne. Assuming parameterized dis tributions of the form described in (40), suppose that the value of θ a in dB is uniformly distributed in the int erv al [ − 4 . 5 , 4 . 5] and that we compute the posteriori probabili ties of θ a assuming that i ts value in d B 25 −5 −4 −3 −2 −1 0 1 2 3 4 5 0 50 100 150 200 250 SNR → Reward → Greedy policy (G 1 ) Worst case Learn parameters Fig. 2. Comparison of performances obtained with worst-case approach and learning approach. Performance obtained with greedy policy is also shown. The value of ζ = 0 . 01 . is quantized to th e finite set Θ = {− 5 , − 4 , . . . , 5 } . Now if the actual realization of θ a is between 1 dB and 2 dB, say 1 . 5 dB, then we expect to see low p osteriori probabiliti es for all elements of Θ except 1 dB and 2 d B and in t his case it would be safe to set the access t hreshold ass uming an SNR of 1 dB. Alth ough this t hreshold is not the best that can be set for the actual realization of θ a , it is s till a sign ificant improvement over t he worst-case threshold which would correspond to an SNR of − 4 . 5 dB. W e expect the a po steriori probabilit ies of all elements of Θ other than 1 dB and 2 dB to con ver ge t o 0 , but t he a posteriori p robabilities of these two values may not con ve rge; they may oscillate between 0 and 1 such that t heir sum con ver ges to 1 . A rigorous version of t he above argument w ould require some ordering of the parameterized distri butions as in (28). 26 V I . C O N C L U S I O N S A N D D I S C U S S I O N The results of Section V -A and t he arguments we presented in Section III-C clearly show that our scheme of esti mating the channel o ccupancies us ing the observations yiel ds performance gains and may hav e practical advantages ove r the A CK-based scheme that was proposed in [4]. W e believe that these advantages are si gnificant enough to justify using our scheme even th ough it necessitates the u se of dedi cated con trol channels for synchronization. For the scenario where t he distri butions of the recei ved signals from the primary transmitters are unkno wn and belong to a parameterized family , t he simul ation resul ts in Section V -B sugg est that designi ng for worst-case values of th e unknown parameters can lead to a significant drop in performance relativ e to th e scenario where the dist ributions are known. Our proposed learning- based scheme overcomes this performance drop by learning the p rimary sign al’ s statis tics. Th e ca veat is that the l earning procedure requires a reli able model for the state transit ion process if we n eed to giv e probabilisti c guarantees of t he form (34) and t o ensure con vergence of the beliefs about th e θ a parameters. In most o f the existing l iterature on sensing and access p olicies for cognitiv e radios that use ener gy detectors, the typical practice is to consider a w orst-case mean power under the primary-present hypothesis. The reasoning behind this approach is that the cognitive users hav e to guarantee that the probability of interfering wi th any primary recei ver located within a protected region [10], [9] around the primary transm itter is below the interference constraint. Hence it is natural to assume that th e mean po wer of t he primary s ignal is the worst-case one, i.e., the mean power that one would expect at the edge of the protected region. Howe ver , th e problem w ith this approach is that by designing for the worst-case distribution, the secondary users are forced to set conservati ve thresholds whi le making access decisions. Hence e ven when the second ary us ers are close t o the primary transmitter and the SNR of the si gnal t hey receiv e from the primary transmitter is high, they cannot ef ficiently detect vacancies in the primary spectrum. Inst ead, if they were a ware that they were close to the transmitter they could ha ve detected spectral vac ancies mo re effic iently as demon strated by t he imp rovement in p erformance at higher SNRs observed in the simulation example in Section V -A. This loss in performance is overcome by the learning schem e proposed in Section IV -B. By learning t he value of θ a the secondary users can now set more lib eral t hresholds and hence exploit vacancies in the primary s pectrum better when 27 they are located close to the primary transmitter . Thus , using such a scheme would produce a significant performance imp rove ment in ov erall throu ghput of the cogn itive radio sys tem. A P P E N D I X Here we show t hat for each channel a , the a posteriori probabilit y m ass function of parameter θ a con ve rges to a delta-function at the true va lue of the parameter almost surely under the algorithm described in Section IV -B. Theor em A.1: A ssume th at the transition probability matrix P satisfies (16). Further assu me that the condi tional densities of t he observations give n in (24) satisfy Z f µ i ( y ) | log( f µ i ( y )) | dy < ∞ for all µ i ∈ Θ , (45) and that all densities in (24 ) are distinct. Then, und er the channel sensing scheme that was introduced in (36), for each channel a , P ( { θ a = µ i }| Y n ) a.s. − − − → n →∞ I { θ a = µ i } , for all µ i ∈ Θ . Pr oof: W ithout lo ss of generality , we can restrict ourselves to the proof of the con ver g ence of the a po steriori distribution of θ 1 , the parameter for the first channel. By the modified sensing scheme int roduced in (36), it can be seen that channel 1 i s sensed at least every M L slots. Hence, if the a posteriori dis tribution con ver ges for an algorith m t hat senses channel 1 exactly ev ery M L slots, it sho uld con ver ge even for o ur algo rithm, since our algorithm updates the a pos teriori probabilities more frequently . Furthermore b y considering an M L -tim es undersampled version of the Markov chain that determi nes the ev olution of channel 1 , without lo ss of generality , it i s suffi cient to show con ver gence for a sensing poli cy in which channel 1 is sensed i n eve ry slot . It is obvious that since condi tion (16) holds for th e original Markov chain, it holds ev en for the undersampled version. So n ow we assume that an o bservation Y k is made on channel 1 in every slot k . W e use Y k to represent al l observations on channel 1 up to slot k . W e use µ i ∗ ∈ Θ to represent t he true realization of random variable θ 1 with i ∗ ∈ { 1 , . . . , N } , and π to denote the p rior d istribution of θ 1 . Th e a post eriori probabili ty mass fun ction of θ 1 e valuated at µ i after n t ime slots can be expressed as P ( { θ 1 = µ j }| Y n ) = P j ( Y n ) π ( µ j ) P i P i ( Y n ) π ( µ i ) (46) 28 where we use the notati on P i ( . ) to denot e the distribution of the obs erv ations con ditioned on θ 1 taking the va lue µ i ∈ Θ . It follows from [12, Theorem 1, Theorem 2, and Lem ma 6] that conditioned on { θ 1 = µ i ∗ } we hav e, P i ∗ ( Y n ) P i ( Y n ) a.s. − − − → n →∞ ∞ for all i 6 = i ∗ . Hence, it follows from (46) that condit ioned on { θ 1 = µ i ∗ } we have, P i ∗ ( Y n ) π ( µ i ∗ ) P i P i ( Y n ) π ( µ i ) a.s. − − − → n →∞ 1 which further imp lies that conditio ned on { θ 1 = µ i ∗ } we have, P j ( Y n ) π ( µ j ) P i P i ( Y n ) π ( µ i ) a.s. − − − → n →∞ I { i ∗ = j } . Since this hold s for all possi ble realizations µ i ∗ ∈ Θ of θ 1 , the result follows. A C K N OW L E D G M E N T The authors would like to thank Prof. V ivek Borkar for ass istance with the proof of the con ve rgence of t he l earning algorit hm in Section IV -B. R E F E R E N C E S [1] Q. Zhao, B. Kr ishnamachari, and K. Li u, “On Myopic Sensing for Multi-Channel Opportunistic Access: Structure, Optimality , and Performance, ” IEEE T ra nsactions on W ireless Communications , vol. 7, no. 12, pp. 5431-5440 , December , 2008. [2] S. H. A. Ahmad, M. Liu, T . Javidi, Q. Zhao, and B. Krishnamachari, “Optimality of Myopic Sensing in Multi-Channel Opportunistic Access, ” IEEE T ransac tions on Information Theory , vol. 55, no. 9, pp. 4040-4050, September 2009. [3] Q. Zhao, L. T ong, A. Swami, and Y . Chen, “Decentralized cogniti ve MA C for opportunistic spectrum access in ad hoc networks: A POMDP framew ork, ” IE EE Jou rnal on Selected A r eas in Communications (JSAC): Special Issue on Adaptive, Spectrum Agile and Cognitive W ireless Networks , vol. 25, no. 3, pp. 589-600, April 2007. [4] Y . Chen, Q. Zhao, and A. S wami, “Joint Design and Separation Principle for Oppo rtunistic S pectrum Access in the Presence of Sensing Errors, ” IEEE T ransa ctions on Information Theory , vol. 54, no. 5, pp. 2053-2071, May , 2008. [5] A. Motamedi and A. Bahai, “Optimal Channel Selection for S pectrum-Agile Low-Powe r W ireless Packet Switched Networks in Unlicensed Band, ” EURASIP Journal on W ir eless Communications and Networking , vol. 2008, Article ID 896420 , 10 pages, 2008 . doi:10.1155/2008/89 6420. [6] H. V . Poor, An Intr oduction to Signal Detection and Estimation , 2nd ed. Ne w Y ork: Springer-V erlag, 1994. [7] D. Bertsekas, Dynamic Pr ogr amming and Optimal Contr ol , vol.1, Athena Scientific 2005. [8] D. Aberdeen, “ A (r e vised) survey of approximate methods for solving POMDPs, ” T echnical Report, Dec. 2003, http://users. rsise.anu.edu .au/ ∼ daa/pap ers.html . 29 [9] J. Unnikrishnan and V . V . V eerav alli, “Cooperativ e sensing for primary detection in cognitive radio, ” IEEE J ournal of Selected T opics in Signal Pro cessing, Special Issue on Signal Proce ssing and Networking for Dynamic Spectrum Access , vol. 2, pp. 18-27, February 2008. [10] A. S ahai, N. Hoven, and R. T andra, “Some fundamental limi ts on cogniti ve radio, ” i n Proc. F orty-Second All erton Confer ence on Communication, Contr ol, and Computing , Oct. 2004. [11] V . V . V eerav alli, T . Basar , and H. V . Poor , “Minimax rob ust decentralized detection, ” IEEE T ransac tions on Information Theory , vol. 40, pp. 35-40, January 1994. [12] B. G. Leroux, “Maximum-likelihood estimation of hidden Mark ov models, ” Stoc hastic Proce sses and their Applications , vol. 40, pp. 127-143, 1992.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment