Delay-Optimal Power and Subcarrier Allocation for OFDMA Systems via Stochastic Approximation

In this paper, we consider delay-optimal power and subcarrier allocation design for OFDMA systems with $N_F$ subcarriers, $K$ mobiles and one base station. There are $K$ queues at the base station for the downlink traffic to the $K$ mobiles with hete…

Authors: Vincent K.N.Lau, Ying Cui

IEEE TRANSACTIONS ON WIRELESS COMMUNICA TIONS, V OL. 9, NO. 1, J ANUAR Y 2010 1 Delay-Optimal Po wer a nd Sub carrie r Allocation for OFDMA Systems via Stochastic Approx imation V inc ent K. N. La u , Senior Member , I EEE , Y i ng Cu i, Student Membe r , IEEE Department of ECE, The Hon g K ong Univ ersity of Scienc e and T echnology Abstract — In this paper , we consider delay-optimal power and subcarrier alloca tion design for OFDMA systems with N F subcarriers, K mo biles and one base station. There are K queues at the base station fo r th e downlink traffic to the K mobiles with heterogeneo us packet arri vals and delay r equirements. W e shall model the problem as a K -dimensional infinite h orizon a verag e r eward Markov Decision Problem (MDP) wher e the control actions ar e assumed to be a function of the instantaneous Channel State Inf ormation (CSI) as well as the joint Qu eue State Inf ormation (QS I). This problem is challen ging because it corres ponds to a stochastic Network Utili ty Maximization (NUM) problem where general solution is still un known. W e p ropose an online stochastic value iteration solution using st ochastic appro ximation . The proposed po wer control algorithm, which is a function of b oth the CS I and the QSI, takes th e form of multi-level water -filling. W e prov e that u nder two mild conditions in Th eorem 1 (One is the stepsize condition. T he other is the condition on accessibility of the Markov Ch ain, which can b e easily sa tisfied in most of the cases we a re interested.), the proposed soluti on conv erge s to the optimal solu tion almost surely (with probability 1) and the proposed fra mework offers a possible solution to the general stochastic NUM p roblem. B y exploiting the birth-death structure of th e q ueue dynamics, we obtain a reduced complexity decomposed soluti on with l inear O ( K N F ) complexity and O ( K ) memory r equirement. I . I N T RO D U C T I O N There ar e plen ty o f literature on cro ss-layer optimization of power and subband allocation in OFDMA systems [1], [2]. Y e t, all these work s focused on o ptimizing the physical layer p erform ance and the power/subband allocation solutions derived are all f unctions of the channel state informatio n (CSI) on ly . On the other ha nd, real life application s are de lay- sensiti ve an d it is critical to consider th e delay perfor mance in addition to the conventional physical layer perf ormanc e in OFDMA cross-layer design. A comb ined framework taking into accou nt o f bo th queu eing de lay an d phy sical layer perf or- mance is not tr ivial as it inv olves bo th the q ueueing theo ry (to model the queue dynamics) and information theory (to model the ph ysical layer dynam ics). The first app roach converts the delay constraint into average rate co nstraint using tail p robab il- ity at large d elay r egime and so lve the optimizatio n problem using info rmation theoretical form ulation b ased on th e rate constraint [3]– [5]. While th is a pproach allows poten tially sim- ple solution, the deri ved co ntrol policy will be a function of the Manuscript recei v ed January 8, 2009; re vised J uly 4, 2009 and October 5, 2009; accepted October 22, 2009. The associate editor coordinating the re vie w of this paper and approv ing it for publica tion was H. Dai. The au thors are with the Department of ECE, the Hong Kong Univ ersity of Science and T ech nologies, Clear W ater Bay , Ko wlo on, Hong K ong (e-mail: eeknla u, cuiyi ng@ust.hk). channel state inform ation (CSI) only . In gener al, delay-optimal control actions sho uld be a function of both the CSI and queue state info rmation (QSI). In [6] , [7], the authors showed that LQHPR policy is delay optimal f or mu ltiaccess fading channels. Howe ver , the solution utilizes stochastic majoriza- tion th eory which requires sym metry among the users and is difficult to extend to o ther situations. In [8]–[10 ], the author s studied the q ueue stability re gion of various wireless systems using L yapun ov drift. Under th e assumption that all que ues are large enough , GPD (Greed y Prim al-Dual) algorithm [11] and R T -SPD (Real Time Stoc hastic Prima l-Dual) algo rithm [12] ar e proposed to solve utility-based optim ization pr oblem under q ueueing network stability constraint and av erage d elay constraint separately . While all the above works addre ssed d ifferent asp ects of the d elay sen siti ve r esource allo cation problem, th ere ar e still a n umber of first order issues to be add ressed. I n this paper, we con sider an OFDMA wireless system with N F subcarriers, K mobiles and a base station. There are K queues for the mobiles at the base station with heter ogeneo us arrivals and departur es. Th e delay-optim al power and subcarrier allocation actions, which minimize the av erage delay o f the K MSs under the average total power con straint an d sub carrier a llocation constraint, ar e fun ctions of both the CSI an d th e joint QSI. W e shall elabor ate the majo r challenges behind this prob lem below . • The Curse o f Dimensionalit y A mor e general app roach is to mo del the pro blem as a Markov Decision Pr oblem (MDP) [13 ], [14] . Howe ver , a primary difficulty in deter- mining th e op timal p olicy using the MDP appro ach is th e huge state space inv olved. For instance, the state space is expo nentially large 1 in the nu mber of users. Hence, brute force solution by value iteration or policy iteration is not applica ble due to h uge com plexity and memo ry requirem ent inv olved. • Issues of Con vergence in Sto chastic NUM Problem In conv entional iterative solutions for determin istic NUM problem s, the u pdates in the iterativ e alg orithms (such as subgrad ient search ) a re p erform ed with in th e cohe rence time of the CSI (the CSI remains quasi-static d uring the iteration up dates) 2 [15]. In stoch astic NUM, ho wev er , 1 For a system with 4 users, 6 subcarriers, a buff er size of 50 per user and 4 channel states, the system state space conta ins 4 4 × 6 × (50 + 1) 4 states, which is already unmanageabl e. 2 This poses a serious limitati on on the practica lity of the distribu ti ve iterat i ve solutions because the con ver gence and the optimal ity of the iterati v e solutions are not guara nteed if the CSI changes significa ntly duri ng the update. IEEE TRANSACTIONS ON WIRELESS COMMUNICA TIONS, V OL. 9, NO. 1, J ANUAR Y 2010 2 the upda tes are per formed over the ergodic rea lizations of th e system states, and hence, the con vergence proo f is ch allenging ( such as GPD i n [ 11] and R T -SPD in [12]). When we consider the delay -optimal prob lem, the problem is stoch astic a nd the control action s are defined over the e rgodic r ealizations of the system states (CSI,QSI). The refore, the conver gence proof is also quite challengin g. • Heterogeneo us Users There are some w orks th at ob- tained a simple delay-o ptimal policy for multiple access channels by using majorizatio n theory a nd exploiting symmetry between users [6 ], [7]. Howe ver , such simpli- fications canno t be extended e asily to our case in wh ich users hav e heterogeneo us arrivals and delay require ments. In this paper, we shall address the above issues by prop osing a low-complexity solution to the delay- optimal OFDMA sys- tem. T o ad dress the o pen iss ue concer ning the hu ge complexity in volved in solving the K -dimen sional MDP , we utilize th e stochastic appr oximation (SA) techniques [1 6], [1 7] to deriv e a low complexity onlin e stochastic value iteration algorith m. W e shall sh ow tha t under some mild co nditions, the p roposed online stochastic va lue iteration algorithm conver ges to the optimal solution almost sur ely ( with prob ability 1). By ex- ploiting the birth-d eath structure of the queu e dyn amics, we obtain a redu ced complexity decomp osed solution with linear O ( K N F ) complexity and O ( K ) mem ory requiremen t. I I . S Y S T E M M O D E L S In this section, we sha ll elabo rate the system mod el, the OFDMA physical laye r m odel as w ell as the underlyin g queue- ing mod el. Fig. 1 illustrates th e top level system m odel. Th ere are K user q ueues at the base station which buf fer p ackets for the K mobile users in th e OFDMA system. These K application streams m ay h av e d ifferent source arr iv al rates and delay requir ements and this correspo nds to a hetero geneou s user situa tion. The base station ha s a cr oss-layer scheduler which takes the CSI an d joint QSI as the inp uts and produ ces a power allocation and subcar rier allocation action as outpu ts.  λ  λ  λ Q  Q  Q  . . .  µ Q ( )  µ Q ( )  µ Q ( ) Resource All ocation Cont roll er . . . SC a lloc . po licy P all oc . p olicy CSI QSI  Ω  Ω Fig. 1. OFDMA physical layer and queueing model. A. OFDMA Physical Layer Mod el W e consider an OFDMA system with N F subcarriers over a fr equency selecti ve fading ch annel with L ind ependen t multipaths. As a result, the received sign al at the k -th mo bile and n -th s ubcar rier is gi ven by Y r k,n = H k,n X t k,n + Z k,n , where X t k,n is the tran smitted symbol and H k,n , Z k,n ∼ C N (0 , 1) are fading coefficient (CSI) and noise r espectiv ely . Let s k,n ∈ { 0 , 1 } den ote the subcarrier allo cation index fo r the k -th user at the n -th subcar rier . For simp licity , we assum e powerful ch annel cod ing (such as LDPC) at the transmitter . Furthermo re, the m obile recei ver ha s p erfect kn owledge o f CSI. Hence, the maximu m achie vable data rate of the k - th user is g iv en by the mutual info rmation between the ch annel in puts { X k,n : s k,n = 1 } and the channel outputs { Y k,n : s k,n = 1 } , which is giv en b y: R k = N F X n =1 s k,n I ( X k,n ; Y k,n | H k,n ) = N F X n =1 s k,n log  1 + p k,n | H k,n | 2  (1) B. Queue Mo del, S ystem Dynamics and Contr ol P o licy In this paper, the time dimen sion is p artitioned into schedu l- ing slots indexed by t . Let τ (second /slot) be the slot d uration. Assumption 1: The joint CSI H ( t ) = { H k,n ( t ) ∀ n, k } re- mains quasi-static within a sch eduling slot and i.i. d. between scheduling slots 3 . There are K qu eues at the base st ation transmitter for the K mob iles resp ectiv ely . Da ta arrives in packets accordin g to K ran dom arriv al proce sses and each packet is stor ed in o ne of th e K queues accordin g to its d estination. Let A ( t ) =  A 1 ( t ) , · · · , A K ( t )  and N ( t ) =  N 1 ( t ) , · · · , N K ( t )  be the random new packet arriv als at the end of th e t -th schedulin g slot and the packet sizes of the packet in the front of the q ueues for the K u sers at the b eginning of the t -th sched uling slot, respectively . Assumption 2: The arriv al pro cess A k ( t ) and random packet size N k ( t ) are i.i.d. over sche duling slots. Let Q ( t ) =  Q 1 ( t ) , · · · , Q K ( t )  be the join t QSI of the K - user OFDMA system, where Q k ( t ) denotes the unfinished work (number of packets) in the k -th qu eue at the beginning of the t -th slot. Let R ( t ) = ( R 1 ( t ) , · · · , R K ( t )) be the scheduled data rates (bit/secon d) o f the K users wher e R k ( t ) is given by (1). At the beginning of the t -th sched uling slot, the cro ss- layer sch eduler observes th e CSI H ( t ) an d the joint QSI Q ( t ) an d calculate the sche duled data r ate R ( t ) . W e assume the scheduler at the transmitter is causal in th e sense tha t new packet arri vals A ( t ) at the t -th slot appears after the scheduler’ s ac tion. Hence , the queue d ynamics is gi ven by the following equatio n. Q k ( t + 1) = min {  Q k ( t ) − R k ( t ) τ / N k ( t )  + + A k ( t ) , N Q } (2) 3 The quasi-static assumption is realistic for pedestri an mobility users where the channe l coherence time is arou nd 50 ms but typica l frame durati on is less than 5ms in next generation wireless systems such as WiMAX. On the other hand, we assume the CSI is i.i.d. between slots (as in many other literature) in order to capture first order insights. Similar s olution frame work can be applie d to deal with FSMC fading. IEEE TRANSACTIONS ON WIRELESS COMMUNICA TIONS, V OL. 9, NO. 1, J ANUAR Y 2010 3 where x + = max { x, 0 } , N Q denotes the max imum buf fer size (numb er of packets). Thus, the cardinality of the joint QSI is I = ( N Q + 1) K which grows expon entially with K . For n otation conv enience, we denote χ ( t ) = ( H ( t ) , Q ( t )) to be th e glob al system sta te at th e t -th slo t. Given an observed system state realization χ , the transmitter may ad just the transm it power and subc arrier allocation a ccording to a stationary power co ntr ol an d subcarrier allocation policy defined below . Definition 1: ( Stationary P ower Contr ol and Subca rrier Al- location P olicy ) A stationa ry transmit power and subcarrier allocation policy Ω = (Ω p , Ω s ) is a mapp ing from the system state χ to the power and subcarrier allocation actions. A policy Ω is called fea sible if th e associated actio ns satisfy th e average total tran smit power constraint and th e subca rrier assignmen t constraint. Specifically , Ω p ( χ ) = p and Ω s ( χ ) = s , where p = { p k,n } are the power allocatio n actio ns satisfying K X k =1 N F X n =1 E [ p k,n ] ≤ P 0 , p k,n ≥ 0 (3) and s = { s k,n ∈ { 0 , 1 }} are the sub carrier alloca tion actions satisfying K X k =1 s k,n = 1 ∀ n ∈ { 1 , N F } (4) Note that the K q ueues are co upled together via th e contr ol policy Ω and the constraints in ( 3) and (4). The goal of th e scheduler at the transmitter is to choose an optimal stationary feasible p olicy Ω ∗ that min imizes the av erage delays of the K users. Assume that the arrival rate vector falls in side the stability r e gion of th e system [10]. Given a feasible unichain policy Ω , the induced Markov chain { χ ( t ) } is er godic and there exists a u nique steady state distribution π χ where π χ ( χ 0 ) = lim t →∞ Pr[ χ ( t ) = χ 0 ] . Hen ce, by L ittle’ s L aw [18], the average delay of th e k -th user und er a p olicy Ω is giv en by: T k (Ω) = lim T →∞ 1 T T X t =1 E [ Q k ( t )] λ k = E π χ [ Q k ] λ k , ∀ k ∈ { 1 , K } (5) where the E π χ denotes expectatio n w .r .t. the und erlying mea- sure π . Similar ly , the a verage tra nsmit power con straint in ( 3) is giv en by P tx (Ω) = lim T →∞ 1 T T X t =1 E h X k,n p k,n ( t ) i = E π χ h X k,n p k,n i ≤ P 0 (6) The av erage delay is r elated to the control actions ( power and subcarrier allocation ) v ia the p acket ser vice rates R ( t ) in ( 1) and the queue dynamics in (2). Th e delay- optimal scheduler design can b e fo rmulated as the following optimiz ation pro b- lem 4 . 4 W e can apply exactly the same solution frame work to the optimiza tion problem which maximiz es the physical layer through put under the ave rage delay constraint and the avera ge power constraint because the Lagrangian functio n of such problem has the same form as our delay-optimal problem in (7) (both belongs to infinite horizon MDP). Pr oblem 1 (Delay-Optima l P olicy): For some co nstants β = ( β 1 , · · · , β K ) ( β k > 0 ∀ k ), we seek to find a station ary policy Ω tha t minimizes J Ω β = K X k =1 β k T k (Ω) + γ P tx (Ω) = lim T →∞ 1 T T X t =1 E h g ( χ ( t ) , Ω( χ ( t ))) i (7) where g ( χ , { p , s } ) is the per-stage rew ard g iv en b y: g ( χ , { p , s } ) = X k β k Q k λ k + γ X k,n p k,n (8) The positive weighting factors β in (7) indicate the relativ e importan ce of buf fer delay among th e K data streams and for each given β , the solu tion to (7) corr esponds to a point on the Pareto optimal d elay tr adeoff b ound ary o f a multi-o bjective optimization pr oblem. The con stant γ > 0 is the Lag range multiplier for the av erage transmit power con straint in ( 6). I I I . M A R K OV D E C I S I O N P RO B L E M F O R M U L AT I O N In this section, we shall f ormulate the delay m inimization problem in (7) as an in finite horizon Marko v Decision Pro blem (MDP) and discuss the optimality condition . A. MDP F ormulation A stationar y contr ol policy Ω ind uces a joint d istribution for the random process { χ ( t ) } . Since the system que ue le vel Q ( t ) ev olves according to the system d ynamics described in (2) and the arriv al, departu re and the CSI proc esses are mem oryless, the random process χ ( t ) is a Markov chain 5 and hence, the optimization pro blem in ( 7) can be m odeled as a MDP w ith transition probability giv en by 6 : Pr[ χ ( t + 1) | χ ( t ) , Ω( χ ( t ))] = Pr[ H ( t + 1 ) | χ ( t ) , Ω( χ ( t ))] P r[ Q ( t + 1) | χ ( t ) , Ω( χ ( t ))] = Pr[ H ( t + 1 )] Pr[ Q ( t + 1) | χ ( t ) , Ω( χ ( t ))] (9) As a result, th e optimizin g policy for th e MDP in (7) can be o btained b y solving the Bellman equation [1 3] (Page 308) recursively w .r .t. ( θ , { V ( χ ) } ) as below: θ + V ( χ ) ∀ χ (10) = min u ( χ ) h g ( χ , u ( χ )) + X χ ′ Pr[ χ ′ | χ , u ( χ )] V ( χ ′ ) i where u ( χ ) = Ω( χ ) = { p , s } ar e the power contro l and subcarrier a llocation action s taken in state χ and g ( χ , u ( χ )) giv en by (8) is the per -stage re war d when the curren t state is χ an d actio n u ( χ ) is taken. If there is a ( θ , { V ( χ ) } ) satisfying (10), then θ = J ∗ β = inf Ω J Ω β is the op timal av erage rew ard per stage and the optimizin g po licy is given 5 Gi ven current queue state Q ( t ) , the departure R ( t ) (which is determine d by the con trol act ion Ω( χ ( t ) ) ) and arri v al A ( t ) , the future queue le vel Q ( t + 1) is indepen dent of the prev ious system state s. 6 Although the QSI Q ( t + 1) and CSI H ( t ) are correla ted via the control actio n Ω( χ ( t )) , due to the causal ity of the control action, H ( t + 1) is indepen dent of χ ( t ) and Q ( t + 1) . IEEE TRANSACTIONS ON WIRELESS COMMUNICA TIONS, V OL. 9, NO. 1, J ANUAR Y 2010 4 by Ω ∗ ( χ ) = u ∗ ( χ ) where u ∗ ( χ ) are the op timizing action s of (10) at state χ . Furth ermore , since the induced Mar kov ch ain { χ ( t ) } is irred ucible f or any stationar y policy con sidered, the solution to (10) is unique up to one additive constant. B. Reduced S tate Bellman Equation Instead of workin g on the global state sp ace χ = ( H , Q ) , we shall de riv e a reduced -state Bellman eq uation from (10) using partitionin g of the contr ol policy Ω , which is based on par tial system state Q only . Specifically , we partition a unichain policy Ω in to a collection of actions based on the QSI as follows: Definition 2 (Condition al Actions) : Given a co ntrol p olicy Ω , we d efine Ω( Q ) = { ( p , s ) = Ω( χ ) : χ = ( Q , H ) ∀ H } a s the collection of actions under a gi ven QSI Q fo r all possible CSI H . The policy Ω is there fore equ al to the union of all the condition al action s. i.e. Ω = S Q Ω( Q ) . The following lemma summa rizes the m ain result. Lemma 1 (Equivale nt Reduce d-State B ellman Equation ): The con trol p olicy obtained b y solvin g the origin al Bellman equation in (1 0) is equivalent to the con trol policy ob tained by solving the following redu ced state Bellma n equation. θ + e V ( Q i ) 1 ≤ i ≤ I (11) = min u ( Q i ) h e g ( Q i , u ( Q i )) + X Q j e f ( Q j | Q i , u ( Q i )) e V ( Q j ) i where e V ( Q ) = E [ V ( χ ) | Q ] is the conditiona l poten - tial functio n , u ( Q ) = Ω( Q ) is the collectio n of actio ns under a gi ven QSI Q , e g ( Q , u ( Q )) = E [ g ( χ , u ( χ )) | Q ] is the c ondition al per-stage r ewar d , e f ( Q j | Q i , u ( Q i )) = E h Pr[ Q j | χ i , u ( χ i )] | Q i i is th e co nditiona l aver age transition kernel . Pr oof: Please refe r to Appendix A for the proof. A solutio n to (11) is still very comp lex due to th e h uge dim en- sionality of th e state space ( I is expo nential in K ) and b rute force v alue iteration or policy iteration [19] h as expon ential memory size requiremen t. As a result, it is desirable to obtain an online and low-complexity solution for the problem. I V . G E N E R A L S O L U T I O N T O T H E D E L A Y O P T I M A L P R O B L E M In this section, we shall derive a low comp lexity (but optimal) solu tion by proposing an onlin e value itera tion to solve the reduced state Bellman equ ation in (11). W e shall also establish technical condition s for almost-sure conver gence of the online value iteration . A. Online va lue iteration via Stocha stic App r oximation W e shall p ropose an online sample-pa th-based iterative learning algorith m to estimate the perfo rmance p otential a nd the control policy . Define a vector mappin g T : R I → R I with the i -th compon ent mapp ing ( 1 ≤ i ≤ I ) as T i ( e V ) = min u ( Q i )  e g ( Q i , u ( Q i ))+ X Q j e f ( Q j | Q i , u ( Q i )) e V ( Q j )  (12) Since the po tential is uniq ue up to a c onstant, we cou ld set T i ( e V ) − e V ( Q i ) = J β for some referen ce state Q i ( 1 ≤ i ≤ I ). Let t = { 0 , 1 , 2 , ... } be the slot index and { Q (0) , · · · , Q ( t ) , · · · } be th e sample -path , i.e . the cor- respond ing realiza tions o f the system states. T o p erform the online value iteration, we divide the sample-path into regener- ativ e period s. A regenerative period is defined as the m inimum interval that each Q state is visited at least once. Let l k ( i ) and b V k be the times that Q i is visited and the estimated per for- mance po tential in the k -th regenerative period re spectiv ely . Let n 0 = 0 , n k +1 = min { t + 1 : t > n k , min i l k ( i ) = 1 } for k ≥ 0 . Th en t he sample path in the k -th regenerati ve period is { Q ( n k ) , · · · , Q ( n k +1 − 1) } . At the beginning of the k -th regenerative period ( n k ≤ t ≤ n k +1 − 1 ), initialize the following dum my variables as S b g k ( i ) = 0 , S b V k ( i ) = 0 and l k ( i ) = 0 . W ithin the k - th regen erativ e period, we adopt policy Ω k . After observ ing the q ueue state Q ( t + 1) at the end of the t - th slot, update the fo llowing metric o f the visited queue state Q ( t ) accord ing to    S b g k ( i ) = S b g k ( i ) + g  χ ( t ) , Ω k ( χ ( t ))  S b V k ( i ) = S b V k ( i ) + b V k  Q ( t + 1)  l k ( i ) = l k ( i ) + 1 if Q i = Q ( t ) (13) At the en d of the k -th regenerative period, using stochastic a p- proxim ation algorithm [ 17], we up date the estimated potential for the ( k + 1) -th regenerative per iod, which is b V k +1 ( Q i ) = b V k ( Q i ) + ǫ k Y k ( Q i ) , 1 ≤ i ≤ I (14) where Y k ( Q i ) = S b g k ( i ) l k ( i ) −  S b g k ( I ) l k ( I ) + S b V k ( I ) l k ( I ) − b V k ( Q I )  + S b V k ( i ) l k ( i ) − b V k ( Q i ) and ǫ k is the step size o f the stochastic appr oximation algo - rithm and Q I is the referenc e state 7 . Accordin gly , we u pdate the policy for the ( k + 1) -th regenerative per iod, which is given by Ω k +1 = arg min T ( b V k +1 ) (15) Therefo re, we could constru ct an o nline value iteration algo- rithm as below . Algorithm 1: Online V a lue Itera tion · Step 1 ( Initializat ion ): Set t = 0 , and star t th e system at an initial state Q (0 ) . Set k = 0 , initialize the potential b V 0 and policy Ω 0 = arg min T ( b V 0 ) in the 0-th regenerative per iod. · Step 2 ( Online Potential Estimation ): At the beginn ing of the k -th regener ativ e period, set S b g k ( i ) = 0 , S b V k ( i ) = 0 an d l k ( i ) = 0 ∀ i . Run the sy stem with po licy Ω k to n k +1 − 1 and accumulate the inf ormation of the visited Q from slot to slot accordin g to (13). At n k +1 − 1 , upda te the estimated poten tial b V k +1 for the ( k + 1) -th regenerativ e per iod accor ding to (14). · Step 3 ( Online P olicy Improv ement ): Upd ate the policy Ω k +1 for the ( k + 1) -th regenerative perio d according to (15). · Step 4 ( T erminatio n ): If || b V k +1 − b V k || < δ v , stop; other wise, set k := k + 1 and go to Step 2. Fig. 2 illu strates the on line value iteration algorithm with an example. 7 W ithout loss of generality , we set the state that all K buf fers are empty as Q I and initia lize b V 0 ( Q I ) = 0 . IEEE TRANSACTIONS ON WIRELESS COMMUNICA TIONS, V OL. 9, NO. 1, J ANUAR Y 2010 5 ˆ ( 1 ) k V ˆ ( 2 ) k V 1 k n  k n 1 k n  2 k n  3 k n  4 k n  1 ( 1 ) k n   1 k n  ˆ arg min ( ) k k T V : 1 1 ˆ arg min ( ) k k T V   : 1 ˆ k V  ˆ k V ˆ ( 1 ) k V ˆ ( 1 ) k V ˆ ( 1 ) k V ˆ ( 1 ) k V ˆ ( 2 ) k V ˆ ( 1 ) k V ˆ ( 1 ) k V ˆ ( 1 ) k V ˆ ( 1 ) k V ˆ (2) k V ˆ (0) k V ˆ (2) k V ˆ (0) k V ( ) k l i ˆ ( ) k V S i ( ) k l i ˆ ( ) k V S i ( ) k l i ˆ ( ) k V S i ( ) k l i ˆ ( ) k V S i ( ) k l i ˆ ( ) k V S i ( ) k l i ˆ ( ) k V S i Fig. 2. Illustra tion of online sample path based potential learning algorithm. K = 2 , N Q = 1 , I = ( N Q + 1) 2 = 4 , the four joint queue state s are 00, 01, 10, 11, which are denoted as 0, 1, 2, 3 for simplicity . The sample path in the k -th regenera ti ve period ( n k ≤ t ≤ n k +1 − 1 ) is { 3 , 1 , 2 , 1 , 0 , 1 } . At the beginning of the n k -th slot, set S b g k ( i ) = 0 , S b V k ( i ) = 0 and l k ( i ) = 0 . In the k -th regenera ti ve perio d, adopt polic y Ω k and update S b g k ( i ) , S b V k ( i ) and l k ( i ) accordi ng to (13) from slot to slot. At the end of the ( n k +1 − 1) -th slot, update the potential b V k +1 ( Q i ) and polic y Ω k +1 for the ( k + 1) -th rege nerati v e period acco rding to (14) and (15). Remark 1 (Comparison to the deterministic NUM): In conv entional iterativ e solutio ns fo r deter ministic NUM [15], the itera ti ve updates ( with messag e exchange) 8 are perf ormed within the CSI coheren ce time and hence, this limits the number of iteratio ns an d the per forman ce. Howe ver , in the propo sed online algorith m, the upd ates in th e itera tion steps ev olve in the same time scale a s the CSI an d QSI. Hen ce, the algorith m co uld c onv erge to a better solution becau se the number of iterations is no longer limited by the c oheren ce time of CSI. B. Con ver gence Analysis In this part, we shall establish the technical pro of abou t the almost sure conv ergence of the online value iteratio n in (14). Assume the sequence of step size { ǫ k } is chosen such that it satisfies the following stepsize cond itions: X k ǫ k = ∞ , ǫ k ≥ 0 , ǫ k → 0 , X k ǫ 2 k < ∞ (16) Let E k and Pr k denote the expectation and probab ility con - ditioned on th e σ -alg ebra F k , gen erated by { b V 0 , Y i , i < k } . Define δ M k ( Q i ) , Y k ( Q i ) − E k [ Y k ( Q i )] and A k − 1 , P Ω k ǫ k − 1 + (1 − ǫ k − 1 ) I B k − 1 , P Ω k − 1 ǫ k − 1 + (1 − ǫ k − 1 ) I where P Ω k and I are I × I transition matrix un der policy Ω k and id entity m atrix. The conver gence o f the online value iteration algorithm is summarized in the following theor em. 8 Since the iterat ions within a CSI coherence time in volv e expli cit message passing, there is processing and signaling over head per it eration and this li mits the total number of iterati ons within a CSI coherence time. Theor em 1 (Con ver gence of Online V a lue I teration): Assume δ M k is bounded ∀ k and the sequ ence of step size { ǫ k } satisfies the conditions in (1 6). Assume that for ev ery set of co ntrol policies Ω 0 , · · · , Ω m , th ere exist a δ m = O ( ǫ m ) > 0 and some positive integer m such th at [ A m · · · A 1 ] iI ≥ δ m , [ B m · · · B 1 ] iI ≥ δ m , 1 ≤ i ≤ I (1 7) where [ · ] iI denotes the element of the i -th ro w and I -th column of the correspond ing I × I matrix and e = [1 , · · · , 1] T is the I × 1 un it vector . For an arbitrary initial p otential vecto r b V 0 , we ha ve lim k →∞ b V k = b V ∞ w .p .1, whe re the steady s tate potential b V ∞ satisfies:  T I ( b V ∞ ) − b V ∞ ( Q I )  e + b V ∞ = T ( b V ∞ ) (18) Furthermo re, the optima l reward of the delay op timal pr ob- lem is J β ∗ = T I ( b V ∞ ) − b V ∞ ( Q I ) and th e optimal stationary policy is Ω ∗ = arg min T ( b V ∞ ) . Pr oof: Please refe r to Appendix B. Remark 2: Note that A k and B k are related to an equ iv- alent transition matr ix o f the u nderlyin g Markov chain. (17) simply m eans tha t state Q I is acce ssible fro m all the Q states after some finite numb er of tr ansition steps. This is a v ery mild con dition and will b e satisfied in most o f the cases we are interested. Remark 3: Note that ( 18) is equiv alent to the reduced state Bellman equation in (11). As a re sult, the co n verged solu tion of (18) correspond s to the solution of (11). V . A P P L I C AT I O N T O O F D M A S Y S T E M S W I T H P O I S S O N A R R I V A L In this sectio n, we shall illustrate the u sage o f the onlin e value iteration to minimize the average weighted delay of OFDMA systems und er Poisson packet arri val an d exponential distributed packet size. Assumption 3: The arriv al process A k ( t ) is i.i.d. over scheduling slots following Poisson distribution with average arriv al rate E [ A k ] = λ k . The ra ndom packet size N k ( t ) is i.i.d. over scheduling slots following an expo nential distribution with mean packet size N k . Assumption 4: The slot duration τ is sufficiently small compare d with the average p acket inter arriv al time as well as con ditional a verage packet service time 9 , i.e. λ k τ ≪ 1 an d µ k ( χ ) τ ≪ 1 . Conditioned on the system state χ , the co nditional mean d e- parture rate of user k is given by µ k ( χ ) = E [ R k ( χ ) / N k | χ ] = R k ( χ ) / N k . Thus, the conditio nal p robability (conditione d on the cu rrent system state realization χ ( t ) ) of a packet dep arture ev ent a t the t -th slot is given by Pr h N k ( t ) R k ( t ) < τ | χ ( t ) , Ω( χ ( t )) i = Pr h N k ( t ) N k < µ k ( χ ( t )) τ i =1 − exp( − µ k ( χ ( t )) τ ) ≈ µ k ( χ ( t )) τ where the “ ≈ ” is du e to Assumption 4. Note that under As- sumption 4, the probability for simultaneous arr i val, d eparture 9 This is a mild assumption which could be justified in many applicat ions. For example , in W iMAX, a frame durati on is around 2ms while the target queuein g delay for appl icati ons (such as video streaming) is aroun d 200ms or more. IEEE TRANSACTIONS ON WIRELESS COMMUNICA TIONS, V OL. 9, NO. 1, J ANUAR Y 2010 6 of two or more pa ckets from t he same queu e or different queues and simultan eous arriv al as well as d eparture in a slot are O  ( λ k τ ) 2  , O  ( µ k ( χ ( t )) τ ) 2  and O  λ k τ µ k ( χ ( t )) τ  respectively , which are asymptotically negligible. Therefor e, the vector queue d ynamics are giv en by e f [( Q i 1 , · · · , ( Q i k + 1) ∧ N Q , · · · , Q i K ) | Q i , u ( Q i )] ≈ λ k τ , ∀ k e f [( Q i 1 , · · · , ( Q i k − 1) + , · · · , Q i K ) | Q i , u ( Q i )] ≈ µ k ( Q i ) τ , ∀ k e f [ Q i | Q i , u ( Q i )] ≈ 1 − X k  λ k τ + µ k ( Q i ) τ  (19) where x ∧ N Q = min { x, N Q } , Q i = ( Q i 1 , · · · , Q i K ) , and µ k ( Q ) = E [ µ k ( χ ) | Q ] . In what follows, we shall discuss the optimal solution and asympto tically optimal solution with only linear com plexity and m emory requirem ent o f the OFDMA system with the condition al tran sition kern el gi ven by (1 9). A. Optimal So lution Giv en (19), the optimizatio n ob jectiv e in th e R.H.S. of Bellman equation in (11) becomes E h γ X k,n p k,n ( χ i ) − K X k =1 ∆ k e V ( Q i ) τ N k (20) ·  N F X n =1 s k,n log(1 + p k,n | H k,n | 2 )  | Q i i where ∆ k e V ( Q i ) , e V ( Q i 1 , · · · , Q i K ) − e V ( Q i 1 , · · · , [ Q i k − 1] + , · · · , Q i K ) . Using standard optimization techniques, the closed-for m solu tion during the policy imp rovement step (15) in the online-value iteration algo rithm is sum marized below . Lemma 2: ( Closed-F orm P ower Contr ol an d Sub carrier Al- location of Online P olicy Impr ovement ) Under the a bove setup, the optimizing p ower co ntrol and subcarrie r allocation actions in the policy imp rovement step (15) for given CSI H and QSI Q are giv en by: p k,n ( H , Q ) = s k,n ( H , Q )  τ N k ∆ k e V ( Q ) γ − 1 | H k,n | 2  + (21) s k,n ( H , Q ) =  1 , if X k,n = max j  X j,n  > 0 0 , other wise (22) where γ satisfies (6) and X k,n = τ N k ∆ k e V ( Q ) log  1 + | H k,n | 2  τ N k ∆ k e V ( Q ) γ − 1 | H k,n | 2  +  − γ  τ N k ∆ k e V ( Q ) γ − 1 | H k,n | 2  + . Pr oof: Due to the comb inatorial nature of the subcar rier allocation in the OFDMA sy stem, i.e. s k,n ∈ { 0 , 1 } and (4), finding the optimal solution by brute fo rce exhaustiv e search req uires expon ential com plexity [20]. By app lying continuo us relaxatio n tech nique [ 20], [21] , th e original integer progr amming problem can be relaxed to a convex optimizatio n problem and then the standard c onv ex optimization tech niques can be ap plied. It can b e shown that the orig inal problem and the relaxed problem share the same optimal solu tion in general. W e omit the detailed proof here due to page limit. Remark 4: ( Structur e of the Optimal P ower Contr ol a nd Subca rrier Allocation ) The optim al power contro l and the subcarrier allocation s olution in Lemma 2 ar e both functions of CSI and QSI where they d epend on the QSI indirectly via the potential fu nction { e V ( Q ) } . For the p ower control solution, it has the form of multi-level water-filling where th e power is allocated acco rding to the CSI acr oss subcarrier s but the water -level is adaptiv e to the QSI. Similarly , the subcarrier is selected accord ing to the metric X k,n which dep ends on bo th the CSI and the QSI. While the on line value itera tion is much simp lified com- pared with the brute-f orce MDP solutio ns, it still su ffers fro m exponentially large mem ory requir ement and computational complexity for storing and computin g th e potential vector . In the n ext subsection, we shall exploit the birth- death dy- namics and deri ve an asympto tically o ptimal solution with linear O ( K N F ) computation al complexity and O ( K ) memo ry requirem ent. B. Asymptotically Optima l Solution W e first define a simplified sub carrier allocation policy below and summarize an important stru ctural result o f the Bellman equation in (11) under the simplified policy . Definition 3: [CSI-Only Subcar rier Allo cation Po licy] A CSI-only subca rrier allocation po licy is defined as e Ω s ( H ) = { ˜ s k,n ( H ) ∈ { 0 , 1 }| P K k =1 ˜ s k,n = 1 ∀ n } . Theor em 2 (Additive Pr operty of the P o tential Function): Under the average power constra int in (3) and th e CSI - only subc arrier alloca tion policy , th e solution of the Bellman equation in (11) can b e exp ressed into the form e V ( Q ) = P k e V k ( Q k ) and θ = P k θ k , where { e V k ( Q k ) , θ k } is the solution of the k -th user’ s r educed Bellman equation: θ k = min u k ( Q k ) n e g k ( Q k , u k ( Q k )) + λ k τ ∆ e V k ( Q k + 1) − µ k ( Q k ) τ ∆ e V k ( Q k ) o (23) where u k ( Q k ) = { ( ˜ p ( H , Q k ) , ˜ s ( H )) : ( H , Q k ) ∀ H } is the set o f th e conditio nal control actions, µ k ( Q k ) = E  P N F n =1 ˜ s k,n ( H ) log (1 + ˜ p k,n ( H , Q k ) | H k,n | 2 ) | Q k  / N k is the con - ditional av erage service rate and ∆ e V k ( Q k ) = e V k ( Q k ) − e V k ([ Q k − 1] + ) is the potential incremen t of the k - th queue. Pr oof: Please refe r to Appendix C f or the proo f. As a re sult o f Theor em 2, if we restrict the subcar rier allocation policy to the CSI- only subcarrier alloca tion policy , then the potential function { e V ( Q ) } of the joint QSI for the original MDP p roblem can b e decomposed into K ind ividual potential fu nctions { e V k ( Q k ) } of the individual QSI for the K individual MDP’ s. This could sub stantially simplify the potential estimation, the convergence spe ed as well as the memory requirement. In particular, to satisfy the co ndition of Theorem 2, we m odify the optimal subcar rier allocation solution in Lemma 2 to a CSI-only policy which is gi ven by ˜ s k,n ( H ) =  1 , if | H k,n | 2 = max j  | H j,n | 2  0 , other wise (24) While using (24) will result in so me loss of optimality in strict sense, we shall show in the following Co rollary that the solution is indeed asymptotically optimal. IEEE TRANSACTIONS ON WIRELESS COMMUNICA TIONS, V OL. 9, NO. 1, J ANUAR Y 2010 7 Cor ollary 1 ( Reduced Complexity On line V alue Iteration): The follo wing redu ced complexity on line value iteration algorithm has O ( K N F ) complexity and O ( K ) memory requirem ent. It conver ges a lmost surely to a solution w hich is asymptotically optimal for sufficiently large K 10 . Algorithm 2: Reduced Complexity Online value iteratio n · Step 1 ( Initialization ): For each user k = 1 : K , initialize the potential b V 0 k and policy Ω 0 k = arg min T ( b V 0 k ) in the 0 -th regenerative period . Set t = 0 , and star t th e OFDMA at an initial QSI state Q k (0) for each user . · Step 2 ( Online Potential Estimat ion ): For each user k = 1 : K in its l k -th regenerati ve per iod, set S b g l k k ( Q k ) = 0 , S b V l k k ( Q k ) = 0 a nd j l k k ( Q k ) = 0 ∀ Q k . Perfo rm power contro l and su bcarrier allo cation acco rding to the policy Ω l k k , and accumulate the info rmation of the visited Q k for each user k if Q k = Q k ( t ) according to      S b g l k k ( Q k ) = S b g l k k ( Q k ) + g k  χ k ( t ) , Ω l k k ( χ k ( t ))  S b V l k k ( Q k ) = S b V l k k ( Q k ) + b V l k k ( Q k ( t + 1)) j l k k ( Q k ) = j l k k ( Q k ) + 1 If the curren t slot cor respond s to the end of any of the K r egenerative period , upda te the cor respond ing estimated potential b V l k k for its next regenerative per iod accordin g to ( 0 ≤ Q k ≤ N Q ) b V l k +1 k ( Q k ) = b V l k k ( Q k ) + ǫ l k  S b g l k k ( Q k ) j l k k ( Q k ) + S b V l k k ( Q k ) j l k k ( Q k ) −  S b g l k k ( Q I k ) l k ( Q I k ) + S b V l k k ( Q I k ) l k ( Q I k ) − b V l k k ( Q I k )  − b V l k k ( Q k )  · Step 3 ( Online Policy Improvement ): For each user k = 1 : K , if the cur rent slo t correspo nds to the end of any of the K r egenerative period s, then up date the policy for its next regenerative p eriod at the BS according to (24) and ˜ p k,n ( H , Q k ) = ˜ s k,n ( H )  τ N k ∆ b V l k +1 k ( Q k ) γ − 1 | H k,n | 2  + (25) · S tep 4 ( T ermination ): If || b V l k +1 k − b V l k k || < δ v ∀ k , stop; otherwise, set l k := l k + 1 and go to Step 2. Pr oof: Please refe r to Appendix D for the proof. V I . S I M U L A T I O N R E S U LT S A N D D I S C U S S I O N S In this section, we shall c ompare our prop osed op timal and reduced com plexity solutions by online value iteration v ia stochastic approx imation to the delay optimal prob lem for the system with Poisson ar riv al and expo nential packet size with three referen ce baselines. Baseline 1 ref ers to a thro ughp ut optimal po licy 11 , na mely the Mo dified Lar gest W eighted Dela y F ir st (M-L WDF) [2 2]. Baseline 2 refe rs to the Real T ime Stochastic Primal Du al (RT -SPD) algorithm [12] . Baseline 3 refers to the R ound Robin Schedulin g , in which different u sers 10 As we scale up K , we assume the transmit total po wer P 0 is suffici ently larg e so that ( λ 1 , · · · , λ K ) still remains in the stabilit y region of the system 11 Throughput optimal polic y means that it shall stabiliz e the queue when- e ver the arri v al rate vec tor fal ls withi n the stabi lity region. 5 10 15 7.5 8 8.5 9 9.5 10 SNR (dB) Average Delay per Queue (packets) Baseline 1 (M−LWDF) Baseline 2 (RT−SPD) Baseline 3 (Round Robin) Proposed (Reduced Complexity) Proposed (Centralized) Fig. 3. A vera ge delay per queue versus SNR. The number of users K = 2 , the buf fer size N Q = 10 , the mean packet size N k = 305 . 2 Kbyte/ pck, the av erage arri v al rate λ k = 20 pck/s, the queue weight β 1 = β 2 = 1 . The pack et drop rate of the proposed schemes are 5%, while the packe t drop rate of the Baseli ne 1 (M-L WDF), Basel ine 2 (R T -SPD) and Baseline 3 (Round Robin) are 5%, 5%, 6% respecti v ely . 0 5 10 15 6.5 7 7.5 8 8.5 9 9.5 10 SNR (dB) Average Weighted Delay (packets) Baseline 3 (Round Robin) Proposed (Centralized) Proposed (Reduced Complexity) Baseline 2 (RT−SPD) Baseline 1 (M−LWDF) Fig. 4. A v erage weighted dela y ve rsus SNR. The number of users K = 2 , the buf fer size N Q = 10 , the mean packet size N k = 305 . 2 Kbyte/ pck, the av erage arriv al rate λ k = 20 pck/s, the queue weight β 1 = 1 , β 2 = 4 . The pack et drop rate of the proposed schemes are 4%, while the packe t drop rate of the Baseli ne 1 (M-L WDF), Basel ine 2 (R T -SPD) and Baseline 3 (Round Robin) are 4%, 6%, 6% respecti v ely . are served in TDMA fashion with equally allocated time slots and water-filling power allocation acr oss the su bcarriers. In the simulation, we assume there are 6 4 subcarr iers with total bandwidth 10MHz, and th e numb er o f indepen dent subbands N F is 4. The schedu ling slot duration τ is 5ms. The buf fer size N Q is 10. The average pa cket arriv al r ate of Poisson pr ocess λ k is 20 packet/s. Fig. 3 illustrates th e av erage d elay per queue versu s SNR of 2 users with eq ual queu e weight. It can be observed that b oth the optimal so lution and red uced co mplexity solu tion ha ve significant gain compar ed with three ba selines ( e.g. mo re than 5 dB ga in when average delay per q ueue is less than 9 packets). In a ddition, th e de lay p erform ance o f the red uced com plexity solution, wh ich is asymptotically optimal in la rge num ber of users, is very clo se to th e perf ormanc e of the optimal solutio n IEEE TRANSACTIONS ON WIRELESS COMMUNICA TIONS, V OL. 9, NO. 1, J ANUAR Y 2010 8 4 6 8 10 12 0 1 2 3 4 5 6 7 8 9 10 Number of Users Average Delay per Queue (packets) Proposed (Reduced Complexity) Baseline 3 (Round Robin) Baseline 2 (RT−SPD) Baseline 1 (M−LWDF) Fig. 5. A ve rage delay per queue versus the number of users. The buf fer size N Q = 10 , the mean pac ket size N k = 39 Kbyt e/pck, the averag e arri va l rate λ k = 20 pck/s, the queue weigh t β k = 1 at a transmit SNR = 10 dB. The pack et drop rate of the proposed scheme is 4%, while the packet drop rate of the Baseline 1 (M-L WDF), Baseline 2 (R T -SPD) and Baseline 3 (Round Robin) are 4%, 4%, 6% respecti vel y . 0 50 10 0 150 200 250 3 00 35 0 400 0 200 400 600 800 100 0 120 0 140 0 Nu mber of Itera ti ons (T ime Dura ti on in a Communicat ion Ses s ion) Av era ge Potenti al Funct ion łŷŦųŢŨŦġŅŦŭŢź őųŰűŰŴŦťĻġĴįĹĴķ łŷŦųŢŨŦġŅŦŭŢźġĩűŢŤŬŦŵŴĪ őųŰűŰŴŦťġĩœŦťŶŤŦťġńŰŮűŭŦŹŪŵźĪĻġĴįĸIJķ ŃŢŴŦŭŪůŦġIJġĩŎĮōŘŅŇĪĻġĸįĹĴij ŃŢŴŦŭŪůŦġijġĩœŕĮŔőŅĪĻġĶįĹĸĶ ŃŢŴŦŭŪůŦġĴġĩœŰŶůťġœŰţŪůĪĻġĹįķĴĵ łŷŦųŢŨŦġŅŦŭŢź őųŰűŰŴŦťĻġĴįĸIJķ Fig. 6. Illustration of con ve rgence property . Potenti al function versu s the number of iterations. The number of users K = 8 , the buf fer s ize N Q = 10 , the mean packet size N k = 39 Kbyte/ pck, the av erage arriv al rate λ k = 20 pck/s, the queue weight β k = 1 at a transmit SNR = 10 dB. The packet drop rate of the prop osed scheme is 1%, while the pac ket drop rate of the Ba seline 1 (M-L W DF), Baseli ne 2 (R T -SPD) and Basel ine 3 (Round Robin) are 1%, 1%, 4% respecti v ely . ev en in 2 user-case. Fig. 4 d epicts the average weig hted delay versus SNR of 2 hetero geneou s users with d ifferent queue weight. The average weig hted delay of the reduced complexity solution is close to that o f th e op timal solution as well. Therefo re, the prop osed r educed complexity solu tion with linear O ( K N F ) memory requiremen t and compu tational complexity as well as n ear optimal perfo rmance is of great practical significance. Fig. 5 illu strates the average d elay per q ueue of the re duced complexity solu tion versus the n umber of users with equa l queue weight at a transmit SNR = 10 d B. It is obvious that the reduced complexity solution has great g ain in delay over the three baselines in the whole user region. Fig. 6 illustrates the c onv ergence pro perty of the prop osed reduced comp lexity algor ithm. W e plo t th e average p otential function of 8 u sers versus th e number of iterations at a transmit SNR = 10 dB. It can be seen that the redu ced complexity alg o- rithm conv erges qu ite fast. The a verage d elay correspon ding to the potential fun ction at th e 5 0-th iteration is 3.8 packets, which is much smaller than the other baselines. V I I . S U M M A RY In this pa per, we pr opose a low-complexity solution to the delay -optimal power and su bcarrier alloc ation design for OFDMA systems. W e model th e pro blem as a K -dimension al infinite h orizon average rew ard MDP with the contro l actions based on CSI an d joint QSI. W e derive the equiv alent red uced state Bellman equation and p ropose an on line stochastic value iteration solution using stochastic appr oximation . W e pr ove that un der some mild condition s 12 , the pro posed solutio n con- verges to th e optimal solu tion almost sur ely (w ith probability 1). By exploiting the birth-de ath structure of the queue dynam - ics in the Poisson arr i vals, we obtain a red uced complexity decomp osed so lution with line ar O ( K N F ) complexity and O ( K ) memory requirem ent. A P P E N D I X A P P E N D I X A : P RO O F O F L E M M A 1 θ + V ( H , Q i ) ∀ H , 1 ≤ i ≤ I = min u ( H , Q i ) h g (( H , Q i ) , u ( H , Q i )) + X ( H ′ , Q j ) Pr[( H ′ , Q j ) | ( H , Q i ) , u ( H , Q i )] V ( H ′ , Q j ) i ( a ) = min u ( H , Q i ) h g (( H , Q i ) , u ( H , Q i )) + X Q j Pr[ Q j | ( H , Q i ) , u ( H , Q i )]  X H ′ Pr[ H ′ ] V ( H ′ , Q j )  i ( b ) = min u ( H , Q i ) h g (( H , Q i ) , u ( H , Q i )) + X Q j Pr[ Q j | ( H , Q i ) , u ( H , Q i )] e V ( Q j ) i ( c ) ⇒ θ + e V ( Q i ) ∀ 1 ≤ i ≤ I = E h min u ( H , Q i ) h g (( H , Q i ) , u ( H , Q i )) + X Q j Pr[ Q j | ( H , Q i ) , u ( H , Q i )]  X H ′ Pr[ H ′ ] V ( H ′ , Q j )  | Q i i ( d ) = min u ( Q i ) e g ( Q i , u ( Q i )) + X Q j e f ( Q j | Q i , u ( Q i )) e V ( Q j ) (26) where (a) is d ue to (9), (b) is due to the definition e V ( Q ) , E [ V ( χ ) | Q ] , (c) is ob tained by taking the co n- ditional expectation (cond itioned on Q i ) on both sides of (10) and ( d) is due to th e definition o f “con ditional actions” 12 The mild condit ions refer to the two conditions in T heorem 1. One is the stepsize condition . The other is the conditi on on accessibi lity of the Markov Chain, which can be easily satisfied in most of the cases we are interest ed. IEEE TRANSACTIONS ON WIRELESS COMMUNICA TIONS, V OL. 9, NO. 1, J ANUAR Y 2010 9 and e g ( Q , u ( Q )) = E [ g ( χ , u ( χ )) | Q ] , e f ( Q j | Q i , u ( Q i )) = E h Pr[ Q j | χ i , u ( χ i )] | Q i i . A P P E N D I X B : P RO O F O F T H E O R E M 1 W e shall first show the conv ergence of the martingale no ise in (14). Let q k ( Q i ) , E k [ Y k ( Q i )] = T i ( b V k ( Q i )) − b V k ( Q i ) −  T I ( b V k ( Q I )) − b V k ( Q I )  (27) Y k ( Q i ) is the n oise-corru pt o bservation of q k ( Q i ) . The m ar- tingale difference noise is δ M k ( Q i ) = Y k ( Q i ) − q k ( Q i ) =  S b g k ( i ) l k ( i ) − e g ( Q i )  −  S b g k ( I ) l k ( I ) − e g ( Q I )  +  S b V k +1 ( i ) l k +1 ( i ) − X Q j e f ( Q j | Q i , u ( Q i )) b V k ( Q j )  −  S b V k +1 ( I ) l k +1 ( I ) − X Q j e f ( Q j | Q I , u ( Q I )) b V k ( Q j )  with prope rty that E k [ δ M k ( Q i )] = 0 and E [ δ M k ( Q i ) δ M k ′ ( Q i )] = 0 , ∀ k 6 = k ′ . For some j , defin e M k ( Q i ) = P k l = j ǫ l δ M l ( Q i ) . Thus, from (14), we hav e b V k +1 ( Q i ) = b V k ( Q i )+ ǫ k  q k ( Q i ) + δ M k ( b V k ( Q i ))  = b V j ( Q i )+ k X l = j ǫ l q l ( Q i ) + M k ( Q i ) (28) Since E k [ M k ( Q i )] = M k − 1 ( Q i ) , M k ( Q i ) is a martingale seq uence. By martingale in equality , we have Pr j  sup j ≤ l ≤ k | M l | ≥ λ  ≤ E j [ | M k ( Q i ) | 2 ] λ 2 . By the boun dness assumption and the property o f the martingale difference noise as well as the conditio n on the stepsize sequence, we have E j [ | M k ( Q i ) | 2 ] = E j [ | P k l = j ǫ l δ M l ( Q i ) | 2 ] = P k l = j E j [ ǫ 2 l δ M 2 l ( Q i )] ≤ M P k l = j ǫ 2 l ⇒ lim j →∞ Pr j  sup j ≤ l ≤ k | M l ( Q i ) | ≥ λ  = 0 . Thus, as j → ∞ , (28) goes to b V k +1 ( Q i ) = b V j ( Q i )+ P k l = j ǫ l q l ( Q i ) with prob ability 1, the vector form of which is giv en by b V k +1 = b V j + k X l = j ǫ l q l = b V j + k X l = j  T ( b V l ) − b V l −  T I ( b V l ) − b V l ( Q I )  e  (29) Next, we shall show the convergence of (29) after the martingale noise are av eraged out. In the following proof, we use i in stead of Q i for simplicity . Let Ω k ( i ) den ote th e optimal contro l a ction attaining the minimu m in T i ( b V k ) . Let e g Ω k and P Ω k denote the cond itional p er-stage rew ard vector and con ditional average transition prob ability matrix un der the optimal con trol policy Ω k . Denote w k = T I ( b V k ) − b V k ( I ) . W e have q k = e g Ω k + P Ω k b V k − b V k − w k e ≤ e g Ω k − 1 + P Ω k − 1 b V k − b V k − w k e q k − 1 = e g Ω k − 1 + P Ω k − 1 b V k − 1 − b V k − 1 − w k − 1 e ≤ e g Ω k + P Ω k b V k − 1 − b V k − 1 − w k − 1 e ⇒ A k − 1 q k − 1 − ( w k − w k − 1 ) e ≤ q k ≤ B k − 1 q k − 1 − ( w k − w k − 1 ) e , ∀ k ≥ 1 by iterating ⇒ A k − 1 · · · A k − m q k − m − ( w k − w k − m ) e ≤ q k ≤ B k − 1 · · · B k − m q k − m − ( w k − w k − m ) e From (27), we hav e q k ( I ) = T I ( b V k ( I )) − b V k ( I ) −  T I ( b V k ( I )) − b V k ( I )  = 0 for all k . By the assumptio n (17), we have (1 − δ ) min i ′ q k − m ( i ′ ) − ( w k − w k − m ) ≤ q k ( i ) ≤ (1 − δ ) max i ′ q k − m ( i ′ ) − ( w k − w k − m ) ∀ i ⇒ ( min i ′ q k ( i ′ ) ≥ (1 − δ ) min i ′ q k − m ( i ′ ) − ( w k − w k − m ) max i ′ q k ( i ′ ) ≤ (1 − δ ) max i ′ q k − m ( i ′ ) − ( w k − w k − m ) ⇒ max i ′ q k ( i ′ ) − min i ′ q k ( i ′ ) ≤ (1 − δ )  max i ′ q k − m ( i ′ ) − min i ′ q k − m ( i ′ )  ⇒ max i ′ q k ( i ′ ) − min i ′ q k ( i ′ ) ≤ φ j ⌊ k − j m ⌋ Y l =1 (1 − δ j + lm ) where φ j > 0 . Since q k ( I ) = 0 , we have max i ′ q k ( i ′ ) ≥ 0 an d min i ′ q k ( i ′ ) ≤ 0 . Thus, ∀ i , we hav e | q k ( i ) | ≤ max i ′ q k ( i ′ ) − min i ′ q k ( i ′ ) ≤ φ j Q ⌊ k − j m ⌋ l =1 (1 − δ j + lm ) . Therefo re, as k → ∞ , q k → 0 , i.e. b V k satisfies Bellman equation (18). By the Proposition 1 in Chap ter 7 of [13], J β = T I ( b V k ( I )) − b V k ( I ) is the optimal value and b V k is th e potential vector, which is up to an con stant vector . Howe ver , due to the property th at q k ( I ) = 0 ∀ k ⇒ b V k ( I ) = b V 0 ( I ) ∀ k , we h av e the conv ergence of the p otential vector b V ∞ = lim k →∞ b V k and the optimal value J β ∗ = T I ( b V ∞ ) − b V ∞ ( Q I ) by th e o nline value iteration via stochastic app roximatio n algorithm . Accor dingly , the o ptimal stationary p olicy is Ω ∗ = arg min T ( b V ∞ ) . A P P E N D I X C : P RO O F O F T H E O R E M 2 Solution of Bellman equation in (11) c an b e o btained b y offline relative value itera tion [ 19]. W ithout loss of gen erality , we set Q I as th e reference state. Hen ce, we hav e normalizing equation e V l ( Q I ) = 0 ∀ l . Assume e V l ( Q ) = P K k =1 e V l k ( Q k ) ∀ l . At the ( l − 1) -th iter ation, up dating p olicy accordin g to (15) is equal to findin g po licy Ω l p which minimize the objective function (20). Und er any g iv en CSI-only subca rrier allo cation policy , the optimal power actions fo r the l - th iteratio n which IEEE TRANSACTIONS ON WIRELESS COMMUNICA TIONS, V OL. 9, NO. 1, J ANUAR Y 2010 10 minimize (20) is given by ˜ p l k,n ( H , Q k ) = ˜ s l k,n ( H )  τ N k ∆ e V l − 1 k ( Q k ) γ − 1 | H k,n | 2  + ⇒ µ l k ( Q k ) = E [ N F X n =1 ˜ s l k,n ( H ) log(1 + ˜ p l k,n ( H , Q k ) | H k,n | 2 ) | Q ] / N k where µ l k ( Q k ) is the mea n dep arture rate under u l k ( Q k ) = { u l k ( H , Q k ) : ( H , Q k ) ∀ H } with u l k ( H , Q k ) = ( ˜ p l ( H , Q k ) , ˜ s l ( H )) , and ∆ e V l k ( Q k ) = e V l k ( Q k ) − e V l k ([ Q k − 1] + ) is the potential incremen t fo r the k -th q ueue. At the l - th iteration, we determ ine the potential e V l ( Q ) and θ l by so lving the norma lizing e quation e V l ( Q I ) = 0 togeth er with I = ( N Q + 1) K Poisson equatio ns θ l + e V l ( Q i ) = X k e g k ( Q i k , u l k ( Q i k )) + e V l ( Q i ) + X k λ k τ ∆ e V l ( Q i k + 1) − X k µ l k ( Q i k ) τ ∆ e V l ( Q i k ) = ⇒ θ l = X k θ l k , ∀ 1 ≤ i ≤ I (30) where θ l k ∀ 0 ≤ Q k ≤ N Q (31) = e g k ( Q i k , u l k ( Q i k )) + λ k τ ∆ e V l k ( Q i k + 1) − µ l k ( Q i k ) τ ∆ e V l k ( Q i k ) e g k ( Q i k , u l k ( Q i k )) = β k Q k λ k i + γ E [ X n ˜ p l k,n ( H n , Q i k ) | Q i k ] There ar e I joint Q = ( Q 1 , · · · , Q K ) states, but there are o nly N Q + 1 states for Q k ∀ k . Hence, in the orig inal ( N Q + 1) K Poisson equations (30), there are only N Q + 1 ind ependen t Poisson equations in (31) f or the k -th (1 ≤ k ≤ K ) queue. In ad dition, set e V l k (0) = 0 ∀ k as th e individual no rmalizing equation, which also satisfies e V l ( Q I ) = P k e V l k (0) = 0 . Hence, in the l -th iteration , we c an obtain { e V l k ( Q k ) , θ l k } by solving the k - th user ’ s reduced state Poisson equation in (31) together with its the individual nor malizing equ ation. Acco rd- ingly , { e V l ( Q ) , θ l } is the solution fo r the original ( N Q + 1) K Poisson equation s (30), where e V l ( Q ) = P k e V l k ( Q k ) and θ l = P k θ l k . Continue the iterations un til Ω ( l +1) p = Ω ( l ) p . W e ob tain { e V k ( Q k ) , θ k } , which is the solution of the k -th u ser’ s red uced Bellman equation in (2 3). Acc ording ly , { e V ( Q ) , θ } is the solution of the or iginal ( N Q + 1) K Bellman equation s in (11), wh ere e V ( Q ) = P k e V k ( Q k ) an d θ = P k θ k , which are the potential for th e joint Q and the optimal av erage reward respectively . A P P E N D I X D : P RO O F O F C O RO L L A RY 1 Since { e V k ( Q k ) , θ k } is th e solution of the k - th user’ s reduced Bellman equation ( 23), the or iginal MD P can b e decoupled into K individual MDP’ s with tran sition kern el is given by e f ( Q k + 1 | Q k , u k ( Q k )) = λ k τ , e f ( Q k − 1 | Q k , u k ( Q k )) = µ k ( Q k ) τ and e f ( Q k | Q k , u k ( Q k )) = 1 − λ k τ − µ k ( Q k ) τ . Therefo re, the on line value iteration ca n be a pplied to K individual MDP’ s respectively . Under th e same co ndition of Theorem 1 13 , almost sure co n vergence o f the online value iteration algorithm in Corollary 1 is guaranteed. Next, we have to show that the conv erged solution is in deed asymptotically optim al. De note k ∗ n , a rg max k | H k,n | 2 . For large K , | H k ∗ n ,n | 2 grows with lo g( K ) by extreme value theory . Because th e traffic loadin g rem ains un changed as we scale u p K , max k,j | ∆ k e V ( Q ) − ∆ j e V ( Q ) | = O (1 ) . Hence, X k ∗ n ,n grows like log(log ( K )) . As K → ∞ , Pr[ k ∗ n = arg max k x k,n ] = 1 . Thus the subband allocation result of optimal subb and allocation in (22) and the b est CSI sub band allocation in (24) will be the same for large K . T hus, ( 24) and (25) are asympto tically optimal. Ther efore, with (24), fol- lowing the proof of Theorem 2, we can pr ove P k e V k ( Q k ) → e V ∗ ( Q ) , P k θ k → θ ∗ as K → ∞ , where e V ∗ ( Q ) a nd θ ∗ are the potential and optimal average reward under the g lobal optimal power and subcar rier a llocation gi ven by Lemma 2 . R E F E R E N C E S [1] K. Seong, M. Mohseni, and J . Cioffi, “Optimal resource alloc ation for OFDMA downli nk systems, ” in IEEE Int. Symp. Inform. Theory (ISIT) , July 2006, pp. 1394 – 1398. [2] C. Y . W ong, R.S. Cheng, K.B. Lataief, and R.D. Murch, “Multiuser OFDM with adapti ve subcarri er , bit, and power allocati on, ” IEEE J. Selec t. Are as Commun. , vol. 17, no. 10, pp. 1747 – 1758, Oct. 1999. [3] D. W u and R. Negi, “Ef fecti v e Capacity: A W ire less Link Model for Support of Quality of Service , ” IEEE T rans. W ir eless Commun. , vol. 2, pp. 630–643, July 2003. [4] D. Hui and V . Lau, “Cross-Layer Design for OFDMA Wirel ess Sys- tems Wi th Heterogene ous Delay Requiremen ts, ” IE EE T ra ns. W ir eless Commun. , vol. 6, pp. 2872–2880 , Aug. 2007. [5] J. T a ng and X. Zhang, “Quality-o f-Service Dri ven Po wer and Rate Adaptat ion ov er W irele ss Links, ” IEEE T rans. W irele ss Commun. , v ol. 6, pp. 3059–3068, Aug. 2007. [6] E. M. Y eh, “Multiacce ss and Fading in Communication Networks, ” Ph.D. disserta tion, MIT , September 2001. [7] E. M. Y eh and A. Cohen, “Throughput and delay optimal resource alloc ation in mult iacce ss fad ing channels, ” in Proc. IS IT , June-July 2003, p. 245. [8] L. Geor giadis, M. J. Neely, and L. T assiulas, “Resourc e alloc ation and cross-laye r control in wireless networks, ” F oundations and T re nds in Network ing , vol. 1, no. 1, pp. 1–144, 2006. [9] M. J. Nee ly, “Order opt imal dela y for opportu nistic schedulin g in multi- user wireless uplinks and downl inks, ” IEEE/ACM T rans. Networking , vol. 16, no. 5, pp. 1188–1199 , 2008. [10] W . Luo and A. Ephremides, “Stabil ity of n interact ing queues in random- access systems, ” IEE E T rans. Inform. Theory , vol. 45, no. 5, pp. 1579 – 1587, July 1999. [11] A. Stolyar , “Maximizin g queuei ng network utilit y subject to stabili ty: Greedy primal-dual algorithm, ” Queueing Systems , vol. 50, no. 4, pp. 401–457, 2005. [12] X. W ang, G. B. Giannakis, and A. G. Marques, “ A unified approach to qos-uaranteed schedulin g for channel- adapti v e wireless networks, ” Pr oceedin gs of the IEEE , vol. 95, no. 12, pp. 2410–2431, 2007. [13] D. P . Bertseka s, “Dynamic programming - deterministic and stochastic models, ” Prentic e Hall, Ne w Jersey , NJ, USA, 1987. [14] X. R. Cao, Stochastic Learning and Optimization: A Sensitivity -Based Appr oach , 1st ed. Ne w Y ork: Springer , 2007. [15] D. P . Palomar and M. Chiang, “ A tutorial on decomposition methods for netw ork utill ity maximization, ” IEE E J. Sel ect. Areas Commun. , vol. 24, no. 8, pp. 1439 – 1451, Aug. 2006. [16] L. Ljung, G. Pflug, and H. W alk , Stochast ic appr oxia mtion and opti- mization of random systems , 1st ed. Berlin: Birkhasuser V erlag Basel , 1992. [17] H. J. Kushner and G. G. Y in, Stoc hastic approx iamtion and optimization of random systems , 2nd ed. New Y ork: S pringer , 2003. [18] L. Kleinrock, Queueing Systems. V olume 1: Theory , 1st ed. Ne w Y ork: W ile y Inter science, 1975, ch. 2. 13 It can be easily veri fy that all states in the birth-death chain are accessible . IEEE TRANSACTIONS ON WIRELESS COMMUNICA TIONS, V OL. 9, NO. 1, J ANUAR Y 2010 11 [19] D. P . Bertse kas, Dynamic Pro gramming and Optimal Contr ol , 3rd ed. Massachuset ts: Athena Scientific, 2007. [20] L. M. C. Hoo, B. Halder , J. T el lado, and J. M. Cioffi, “Multiuse r transmit optimi zation for multicarri er broadca st chann els, ” IEEE T r ans. Commun. , vol. 52, no. 6, June 2004. [21] W . Y u and J . M. Cioffi, “FDMA cap acity of gaussian multiple-acc ess channe ls with ISI, ” IEE E T r ans. Commun. , vol. 50, no. 1, Jan. 2002. [22] M. Andre ws, K. Kumaran, K. Ramanan, A. Stolyar , P . Whiting, and R. V ijaya kumar , “Provi ding qualit y of service ov er a shared wireless link, ” in Communic ations Magazine , IEEE , vol. 39, no. 2, Feb . 2001, pp. 150–154. Vi ncent K. N. Lau obtained B.Eng (Distinct ion 1st Hons) from the Uni versit y of Hong Kong in 1992 and Ph.D. from Cambridge Uni versit y in 1997. He was with PCC W as system enginee r from 199 2-1995 and Bell Labs - Lucent T ec hnologies as member of techni cal staf f from 1997-2003. He then joined the Departmen t of ECE, HKUST as Associ ate Professor . His current researc h intere sts include the robu st and delay-se nsiti ve cross-laye r scheduling , cooperati v e and cogniti ve communications as well as stochastic approximat ion and Marko v Decision Process. Yi ng Cui recei v ed B.E ng degree (first cla ss honor) in Electronic an d In formation E nginee ring, Xian Jiaotong Univ ersity , Xi’an, China in 2007. She is currently a Ph.D candid ate in the Depart- ment of Elec tronic and Computer Engineeri ng, the Hong Kong Uni v ersity of Science and T e chnology (HKUST). Her c urrent research int erests incl ude coopera ti ve and cogniti ve communica tions, delay- sensiti v e cross-layer schedulin g as well as stochasti c approximat ion and Marko v Decision Process.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment