On the Optimal Scheduling of Independent, Symmetric and Time-Sensitive Tasks
Consider a discrete-time system in which a centralized controller (CC) is tasked with assigning at each time interval (or slot) K resources (or servers) to K out of M>=K nodes. When assigned a server, a node can execute a task. The tasks are independ…
Authors: Fabio Iannello, Osvaldo Simeone, Umberto Spagnolini
1 On the Optimal Scheduling of Independent, Symmetric and T ime-Sensiti v e T asks Fabio Iannello 1 , Osvaldo Simeone 2 and Umber to Spagnolini 3 Abstract Consider a discrete- time system in which a centralized contr oller (CC) is tasked with assigning at each time interval (or slot ) K reso urces (or servers ) to K out of M ≥ K nodes . Wh en a ssigned a server , a no de can execute a task. Th e tasks are independ ently generated at each nod e by stochastically symmetric and memoryless r andom processes and stor ed in a fin ite-capacity task queue . Mo reover , they are time-sensitive in the sense that within each slot th ere is a non-zero pr obability that a task expires before b eing schedu led. The sched uling prob lem is tackled with the aim of maximizing the numbe r of tasks com pleted over time (or the task-thr ou ghput ) under the assumption that the CC has no dir ect access to th e state of the task queu es. The schedu ling decisions at the CC are based on the outcom es of p revious scheduling comm ands, and on the known statistical pro perties of the task generation and expiration processes. Based on a Mar kovian modelin g of the task generatio n and expir ation pro cesses, the CC scheduling problem is fo rmulated as a partially observable Markov dec ision process (POMDP) tha t can be cast into the framework of restless multi-arme d bandit (RMAB) pro blems. When th e task q ueues are of capacity o ne, the optimality o f a myop ic (or gre edy) policy is proved. It is also dem onstrated that the MP co incides with th e Whittle index policy . For task queues of arbitrary cap acity instead, th e my opic policy is genera lly suboptimal, and its p erform ance is co mpared with an upper bou nd obtained through a relaxatio n of th e origin al pr oblem. Overall, the settings in th is paper provide a rar e example where a RMAB problem can be explicitly solved, and in which the Whittle index po licy is p roved to be op timal. 1 , 3 F . Iann ello (corresponding author: i annello@elet.polimi.it, phone : +390223993 604, fax +390 223993413) and U. Spa gnolini (spagnoli@elet.polimi.it) are with Politecnico di Milano, P .zza L. da V inci 32, 20133 Milan, Italy . 1 , 2 F . Iannello and O. Simeone (osv aldo.simeone@njit.edu) are with the CW CSPR, Ne w Jersey Institute of T echnolog y , U. Heights, Newark , NJ, 07102 , USA 2 I . I N T RO D U C T I O N A N D S Y S T E M M O D E L The probl em of scheduling concurrent t asks under resou rce cons traints finds applications in a v ariety of fields includin g communication networks [1], dis tributed computing [2] and virtual machine scenarios [3]. In this paper we consider a specific instance of this general problem in which a centralized contr oller (CC) is tasked with assign ing at each time int erv al (or slot ) K resources, referred t o as servers , to K out of M ≥ K nodes as s hown in Fig . 1. A server can complete a singl e task per slot and can be assi gned to one node per ti me i nterval. The tasks are generated at the M nodes b y st ochastically sym metric, independ ent and memoryless random processes. Th e tasks are stored by each nod e in a finite-capacity task queue , and th ey are time-sensitive i n the sense t hat at each slot there is a non-zero probabi lity that a task expires before being completed su ccessfully . It i s assumed that the CC has no di rect access to the node queues, and thus it is not fully informed of their actual st ates. Instead, the s cheduling decis ion is based on the out comes of pre vious s cheduling commands, and on the stati stical knowledge of t he t ask generation and expiration processes. If a server is assigned to a node wit h an empty queue, it remains idle for the w hole slot. The purpose here is t hus to pair servers to nodes so as to maxim ize the average number of successfully completed task s within either a finit e or infinite number of slot s ( horizon ), which we refer to as ta sk-thr oughput , or simply throughput . Servers 1 2 K Q 1 ( t ) Q 2 ( t ) Q M ( t ) CC Centralized Controller Node U 1 Node U 2 Node U M Slot t =1 U ( 1 ) U ( t ) U ( T ) Slot t Slot T Figure 1. The centralized controller (CC) assigns K resources (servers) to K out of M ≥ K nodes to complete their tasks in each slot t . The tasks of node U i at slot t are stored in a task queue Q i ( t ) . 3 A. Markov F ormulation W e no w introduce the stochastic model that describes the e volution of the task queues across slots. In this section we consider task queues o f capacity one (see Sec. V for capacity larger than o ne), where Q i ( t ) ∈ { 0 , 1 } d enotes t he num ber of tasks in the queue of node U i , for i ∈ { 1 , ..., M } . The s tochastic ev oluti on of queue Q i ( t ) is shown in Fig. 2 as a function o f the scheduling decision U ( t ) , which consists in th e assignment at each slot t of the K servers to a subset U ( t ) ⊆ { U 1 , ..., U M } of K nodes, wit h |U ( t ) | = K . a) b) 0 1 ) 0 ( 01 p ) 0 ( 10 p ) 0 ( 11 p ) 0 ( 00 p 0 1 ) 1 ( 01 p ) 1 ( 10 p ) 1 ( 11 p ) 1 ( 00 p Figure 2. Markov model for the ev olution of the state of the task queue Q i ( t ) ∈ { 0 , 1 } , w hen t he node U i : a) is not scheduled in slot t (i.e., U i / ∈ U ( t ) ); b) is scheduled in slot t (i .e., U i ∈ U ( t ) ). At each slo t, node U i can be either scheduled ( U i ∈ U ( t ) ) or not ( U i / ∈ U ( t ) ). If U i is not scheduled (i.e., U i / ∈ U ( t ) , s ee Fig. 2-a)) and there is a task in its queue (i.e., Q i ( t ) = 1 ), then the t ask expires with probabil ity (w .p.) p (0) 10 = Pr[ Q i ( t + 1) = 0 | Q i ( t ) = 1 , U i / ∈ U ( t )] , whi le i t remains in the queue w .p . p (0) 11 = 1 − p (0) 10 . Instead, if node U i is scheduled (i.e., U i ∈ U ( t ) , see Fig. 2-b)) and Q i ( t ) = 1 , its tas k is compl eted successfull y and its queue in the next sl ot is either empty or full w .p. p (1) 10 = Pr[ Q i ( t + 1) = 0 | Q i ( t ) = 1 , U i ∈ U ( t )] and p (1) 11 = 1 − p (1) 10 , respecti vely . Probability p (1) 11 accounts for the pos sible arriv al of a n e w t ask. If Q i ( t ) = 0 the probabil ities of receiving a ne w task when U i is not scheduled and scheduled are p (0) 01 = Pr[ Q i ( t + 1) = 1 | Q i ( t ) = 0 , U i / ∈ U ( t )] and p (1) 01 = Pr[ Q i ( t + 1) = 1 | Q i ( t ) = 0 , U i ∈ U ( t )] , respectiv ely , wh ile the probabili ties of recei ving no t ask are p (0) 00 = 1 − p (0) 01 and p (1) 00 = 1 − p (1) 01 , respectively . B. Related W ork and Contri butions In this work we assume that the CC has no di rect access to the s tate of the t ask queues Q 1 ( t ) , ..., Q M ( t ) , whi le it knows t he transi tions prob abilities p ( u ) xy , wit h x, y , u ∈ { 0 , 1 } , and the out comes of pre viousl y scheduled tasks. The scheduling probl em i s thus formalized as a 4 partially observa ble Markov decision process (POMDP) [4], and then cast into a restless multi - armed bandit (RMAB) problem [5]. A RMAB i s constituted by a set of arms (th e queues in o ur model), a subset o f which needs t o be activ ated (or scheduled) in each slot by the controller . T o elaborate, we assume that the transition probabilities of the Markov chains in Fig. 2, the number of nodes M and servers K are su ch that m = M /K, is integer , and (1a) p (1) 11 ≤ p (1) 01 ≤ p (0) 01 ≤ p (0) 11 . (1b) Assumptio n (1a) states that the ratio m = M /K between the numbers M of nodes and K of servers is an integer , generalizing the s ingle-server case ( K = 1 ). Proving the results provided later in this paper for the case of no n-integer m remains an open problem . Assum ption (1b) is motiv ated as follows. The inequality p (1) 11 ≤ p (1) 01 imposes that the probabi lity that a new task arri ves when the task queue is full and t he no de i s scheduled ( p (1) 11 ) is no larger than when the task queue is empty ( p (1) 01 ). This applies, e.g., when the arri val of a new task is independent on the queue’ s state and scheduling decisi ons (i.e., p (1) 11 = p (1) 01 ), or when a new task is not accepted when the queue is full, i.e., p (1) 11 = 0 . Inequality p (1) 01 ≤ p (0) 01 applies, e.g. , when th e t ask generation process does not depend on the queue’ s state and on the s cheduling decisions, s o that p (1) 01 = p (0) 01 , or wh en a ne w task canno t be accepted while the no de i s scheduled even if the queue is empty ( p (1) 01 = 0 ). Inequality p (0) 01 ≤ p (0) 11 indicates that, when a node is not scheduled, the probabili ty p (0) 01 that i ts task queue i s ful l in the next sl ot, given that it is currently em pty , is smaller than the probabil ity p (0) 11 that the task q ueue i s full in the next s lot given t hat it is currently full. Th is applies, e.g., when th e task generation and expiration processes are independent of each other . Main Contribu tions: Wh en t he t ask queues are of capacity one, and under assum ptions (1), we first show that the myopic policy (MP) for the RMAB at hand is a round robin (RR) strategy that: i ) re-numb ers t he no des in a decreasing order according to the initi al probabili ty that their respectiv e task queue is full; and then ii ) schedules the no des p eriodically in group of K by 5 exploiting the initial ordering. The MP is then proved to be throughput-opt imal. W e then show that, for the sp ecial case in wh ich p (0) 01 = p (1) 01 and p (0) 10 = p (1) 11 = 0 , the MP coincides with the Whittle ind ex policy , which is a generally subopt imal in dex s trategy for RMAB problems [6]. Finally , we extend the mo del of Sec. I-A t o queues with an arbitrary capacity C . Characterizing optimal policies for C > 1 is significantly more compl icated th an the case of C = 1 . Hence, inspired by the optim ality of the MP for C = 1 , we compare t he performance of the MP for C > 1 , wi th a upper b ound based on a relaxation of the scheduling const raints of the origin al RMAB p roblem [6]. It i s recalled that the results in this p aper represent a rare case in which the optim al policy for a RMAB can be found explicitly [5]. Related W ork: The work in this paper is related t o the works [7], [8], in which a RMAB problem similar to the on e in this paper is addressed. Howe ver , the m ain differ ence between our RMAB and the one in [7], [8] is the ev olutio n of the arms across slo ts. In parti cular , in our RMAB, each arm ev olves across a slot depending on the scheduling decision taken by the controller , while in [7], [8], th e e volution of the arms d oes not d epend on the s cheduling decision. The transi tion probabil ities for t he RMAB in [7], [8] are thus equiv alent to setti ng p (0) 01 = p (1) 01 and p (0) 11 = p (1) 11 in the Markov chains of Fig. 2. For instance, our model applies t o s cenarios in which the arms are, e.g., data qu eues, w here each arm draws a data packet from its queue only when scheduled. Instead, the mod el in [7], [8] applies to scenarios in which the arms are, e.g., communication channels, whose q uality e volves across slots regardless whether they are selected for transmiss ion or not. In [7] it is shown that the MP i s opti mal for p (0) 01 = p (1) 01 ≤ p (0) 11 = p (1) 11 with K = 1 , wh ile [8] extends this result to an arbitrary K . The work [7] also demonst rates that the MP i n n ot generally optim al in the case p (0) 01 = p (1) 01 ≥ p (0) 11 = p (1) 11 . Finally , paper [9] proves the opti mality of the Whittle i ndex poli cy for p (0) 11 = p (1) 11 ≤ p (0) 11 = p (1) 11 . W e emphasi ze that neither our mo del nor the one considered in [7], [8] su bsumes the other , and t he results here and in the mentioned pre vious works shou ld be considered as com plementary . 6 Notation : V ectors are denoted in bold, while the corresponding non-bol d l etters denote the vectors components. Giv en a vector x = [ x 1 , ..., x M ] and a set S = { i 1 , ..., i K } ⊆ { 1 , ..., M } of cardinality K ≤ M , we define vector x S = [ x i 1 , ..., x i K ] , where i 1 ≤ ... ≤ i K . A function f ( x ) of vector x is also denoted as f ( x 1 , ..., x M ) or as f ( x 1 , ..., x l , x { l +1 ,... ,M } ) for some 1 ≤ l ≤ M , or similar notations depending on the cont ext. Given a s et A and a su bset B ⊆ A , B c represents the complem ent of B in A . I I . P RO B L E M F O R M U L A T I O N Here we formalize the scheduling prob lem o f Sec. I (see Fig. 1), in which the task generation and expiration processes are modeled, independently at each node, by the Markov models of Sec. I-A with qu eues of capacity one. Ex tension to task queues of arbitrary capacity is addressed in Sec. V. A. Pr oblem Definition The scheduling problem at the CC is addressed in a finite-horizon scenario over slots t ∈ { 1 , ..., T } . L et Q ( t ) = [ Q 1 ( t ) , ..., Q M ( t )] be the vector collecting the st ates of the task queue at slot t . At slot t = 1 , the CC is only awar e of the initial probability distribution ω (1) = [ ω 1 (1) , ..., ω M (1)] of Q (1) , whose i th entry is ω i (1) = Pr[ Q i (1) = 1] . Thus, the subset U (1) of |U (1) | = K nodes scheduled at slot t = 1 is chosen as a functi on of the i nitial di stribution ω (1) only . For any nod e U i ∈ U ( t ) scheduled at slot t , an observation is made ava ilable to the CC at the end of the sl ot, while no observations are av ailable for non-scheduled nodes U i / ∈ U ( t ) . Specifically , if Q i ( t ) = 1 and U i ∈ U ( t ) , the task of U i is served within slot t , and the CC observes t hat Q i ( t ) = 1 . Con versely , i f Q i ( t ) = 0 and U i ∈ U ( t ) , no tasks are completed and the CC observes th at Q i ( t ) = 0 . W e define O ( t ) = { Q i ( t ) : U i ∈ U ( t ) } as the set of (new) o bservations av ailable at the CC at the end of slot t . At time t, the CC hence knows the history of all d ecisions and previous observations and the initial dist ribution ω (1) , namely H ( t ) = {U (1) , ..., U ( t − 1) , O (1 ) , ..., O ( t − 1) , ω (1) } , wi th H (1) = { ω (1) } . 7 Since t he CC has o nly partial information about the system state Q ( t ) , through O ( t ) , the scheduling prob lem at hand can be m odeled as a POM DP . It is well -known that a su f ficient statistics for taking decisions in such POMDP is giv en by the prob ability di stribution o f Q ( t ) conditioned o n th e history H ( t ) [10], referred to as bel ief , and represented by the vector ω ( t ) = [ ω 1 ( t ) , ..., ω M ( t )] , with i th entry give n by ω i ( t ) = Pr [ Q i ( t ) = 1 |H ( t )] . (2) Since the belief ω ( t ) fully sum marizes the entire history H ( t ) of past actio ns and observations [10], a scheduling policy π = [ U π (1) , ..., U π ( T )] is defined as a collection of functions U π ( t ) that map t he belief ω ( t ) to a subset U ( t ) of |U ( t ) | = K nod es, i.e., U π ( t ) : ω ( t ) → U ( t ) . W e wil l refer to U π ( t ) as t he sub set of scheduled nodes, even t hough, st rictly speaking, it is t he mapping function defined above. The transition prob abilities over t he b elief space are deriv ed i n Sec. II-B. The immedi ate re war d R ( ω , U ) , accrued by the CC when the belief vector is ω and actio n U is taken, measures t he a verage number of tasks completed within the current slot, and it i s R ( ω , U ) = X U i ∈U ω i . (3) Notice that R ( ω , U ) ≤ K s ince there are onl y K servers. The thr oughput measures t he a verage number of tasks completed ov er t he slots { 1 , ..., T } that, by exploiting (3) and under poli cy π , is given by V π 1 ( ω (1)) = T X t =1 β t − 1 E π [ R ( ω ( t ) , U π ( t )) | ω (1 )] . (4) In (4), the expectation E π [ ·| ω (1)] , under policy π for init ial belief ω (1) , is intended with respect to the di stribution of the M arkov process ω ( t ) , as obtained from the transition probabilit ies to be deriv ed in Sec. II-B. For g enerality , the definition (4) includes a discount factor 0 ≤ β ≤ 1 [7], while the infinite horizon scenario (i.e., T → ∞ ) wil l be discuss ed in Sec. III-C. 8 The goal is to find a policy π ∗ = [ U ∗ (1) , ..., U ∗ ( T )] that maximi zes the throughput (4) so th at V ∗ 1 ( ω (1) ) = V π ∗ 1 ( ω (1) ) = max π V π 1 ( ω (1) ) , with π ∗ = arg max π V π 1 ( ω (1) ) (5) B. T ransition Pr obabilities The belief t ransition probabilities, giv en d ecision U ( t ) = U and ω ( t ) = ω = [ ω 1 , ..., ω M ] , are p ( U ) ω ω ′ = Pr [ ω ( t + 1) = ω ′ | ω ( t ) = ω , U ( t ) = U ] = M Y i =1 Pr[ ω i ( t + 1 ) = ω ′ i | ω i ( t ) = ω i , U ( t ) = U ] , (6) where ω ( t + 1) = ω ′ = [ ω ′ 1 , ..., ω ′ M ] , while the distribution of entry ω i ( t + 1 ) is (see Fig. 2) Pr[ ω i ( t + 1 ) = ω ′ i | ω i ( t ) = ω i , U ( t ) = U ] = ω i if ω ′ i = p (1) 11 and U i ∈ U (1 − ω i ) if ω ′ i = p (1) 01 and U i ∈ U 1 if ω ′ i = τ (1) 0 ( ω i ) and U i / ∈ U , (7) where we have defined the determini stic function τ (1) 0 ( ω ) = Pr[ Q i ( t + 1) = 1 | ω i ( t ) = ω , U i / ∈ U ( t )] = ω p (0) 11 + (1 − ω ) p (0) 01 = ω δ 0 + p (0) 01 (8) to indicate t he next s lot’ s beli ef when U i is not scheduled ( U i / ∈ U ( t ) ), wit h δ 0 = p (0) 11 − p (0) 01 ≥ 0 due to inequalities (1b). Eq. (8) follows from Fig. 2-a), since the next sl ot’ s belief is eit her p (0) 11 if Q i ( t ) = 1 (w .p. ω ) o r p (0) 01 if Q i ( t ) = 0 (w .p. (1 − ω ) ). A generalization of function τ (1) 0 ( ω ) that com putes the belief ω i ( t + k ) of node U i when it is not schedul ed for k successive slots, e.g., slots { t, ..., t + k − 1 } , and ω i ( t ) = ω , can be obtain ed as τ ( k ) 0 ( ω ) = Pr[ B i ( t + k ) = 1 | ω i ( t ) = ω , U i / ∈ U ( t ) , ..., U i / ∈ U ( t + k − 1)] = ω δ k 0 + p (0) 01 1 − δ k 0 1 − δ 0 . (9) Eq. (9) can be o btained recursi vely from (8) as τ ( k ) 0 ( ω ) = τ (1) 0 ( τ ( k − 1) 0 ( ω )) , for all k ≥ 1 , with τ (0) 0 ( ω ) = ω . 9 Under assumptions (1b), i t is easy to verify that function (8) satisfies t he inequalities p (1) 11 ≤ p (1) 01 ≤ τ (1) 0 ( ω ) , for al l 0 ≤ ω ≤ 1 , and (10) τ (1) 0 ( ω ) ≤ τ (1) 0 ( ω ′ ) , for all ω ≤ ω ′ with 0 ≤ ω , ω ′ ≤ 1 . (11) The inequalities in (10) guarantee that the belief of a no n-scheduled node is alwa ys larger t han that of a scheduled on e, while the inequality (11) says that the belief ordering of two non- scheduled nodes is m aintained across a slot. These inequalit ies play a crucial role i n the analysi s below . C. Optimality Equations The dynamic programming (DP) formulation of p roblem (5) (s ee e.g., [11]) allows to express the through put recursi vely ov er the horizon { t, ..., T } , under poli cy π and ini tial belief ω , as V π t ( ω ) = T X j = t β j − t E π [ R ( ω ( j ) , U π ( j )) | ω ( t ) = ω ] = R ( ω , U π ( t )) + β X ω ′ p ( U π ) ω ω ′ V π t +1 ( ω ′ ) , (12) where V π t ( · ) = 0 for t > T . The DP optimality conditions (or Bellma n equations ) are then expressed in terms of the valu e function V ∗ t ( ω ) = max π V π t ( ω ) , whi ch represents the opti mal throughput in t he interval { t, ..., T } , and it is given by V ∗ t ( ω ) = max U ( t )= U ⊆{ U 1 ,...,U M } ( R ( ω , U ) + β X ω ′ p ( U ) ω ω ′ V ∗ t +1 ( ω ′ ) ) . (13) Note that, since the nodes are stochast ically equ iv alent, the value functi on (13) on ly depends on the numerical values of the entries of the belief vector ω regardless of the way it is ordered. Finally , an opti mal pol icy π ∗ = [ U ∗ (1) , ..., U ∗ ( T )] (see (5)) is such that U ∗ ( t ) attains the maximum in the con dition (13) for t ∈ { 1 , 2 , ..., T } . I I I . O P T I M A L I T Y O F T H E M YO P I C P O L I C Y W e now define the myopic pol icy (MP) and show that, under assumpti ons (1), it is a round- robin (RR) policy that schedules the nodes periodically and that it is optimal for problem (5). 10 A. The Myopic P olicy is Round-Robin The MP π M P = {U M P (1) , ..., U M P ( T ) } , with throughput V M P t ( · ) , is the greedy policy that schedules at each slot the K nodes with the largest beliefs so as to maxim ize the im mediate re ward (3), that is , we hav e U M P ( t ) = arg max U R ( ω ( t ) , U ) = arg max U X U i ∈U ω i ( t ) . (14) Pr opositio n 1. Und er assumpt ions (1), the MP (14), given an initial beli ef ω ′ (1) , i s a RR policy th at operates as follows: 1) Sort vector ω ′ (1) i n a d ecreasing order to obtain ω (1) = [ ω 1 (1) , ..., ω M (1)] such that ω 1 (1) ≥ ... ≥ ω M (1) . Re-number th e nod es so th at U i has b elief ω i (1) ; 2) Divide the nodes into m groups of K nodes each, s o that th e g th group G g , g ∈ { 1 , ..., m } , contains all nodes U i such that g = i − 1 K + 1 , namely: G 1 = { U 1 , ..., U K } , G 2 = { U K +1 , ..., U 2 K } , and so on ; 3) Schedule the groups in a RR fashion wi th period m slots, s o that groups G 1 , ..., G m , G 1 , ... are sequentially scheduled at slot t = 1 , ..., m, m + 1 , ... and so on . Pr oof: According to (14), t he first scheduled set is U M P (1) = G 1 = { U 1 , U 2 , ..., U K } . T he beliefs are then updated through (7). Recalling (10), the scheduled nodes, in G 1 , hav e their belief updated to either p (1) 11 or p (1) 01 , w hich are both smaller th an t he b elief o f any no n-scheduled nod e in { U 1 , ..., U M } \ G 1 . Moreover , the ordering of t he non-scheduled nod es’ beliefs is preserved due to (11). H ence, the second scheduled group i s U M P (2) = G 2 , the t hird is U M P (3) = G 3 , and so on. This proves that the MP , upon an initial ordering of the beliefs, is a RR policy . W e emphasize th at the MP so rts the beliefs of the nodes on ly at the first slot in which it is operated, and then it keeps schedulin g the group s of nodes according to their init ial o rdering, without requiring to recalculate the beliefs. B. Optimalit y of t he Myopic P ol icy W e now p rove the opt imality of the MP b y showing t hat it satis fies the Bellman equati ons (13). T o start with, let us consider a RR poli cy π RR that o perates according to s teps 2) and 11 3) of Proposit ion 1 (i.e., without re-ordering the i nitial belief), and let its throughput (12) be denoted by V RR t ( ω ) . Note that, when the initial belief ω i s ordered s o that ω 1 ≥ ... ≥ ω M , then V RR t ( ω ) = V M P t ( ω ) . Based o n backward induction arguments sim ilarly to [7], [8], the fol lowing lemma establishes a suffi cient condit ion for the opti mality of the MP . Lemma 2. A ssume that the MP is optim al at slot t + 1 , ..., T , i.e., i t satisfies (13). T o show that the MP is optimal also at slo t t it is suffi cient to prove the inequality V RR t ( ω S , ω S c ) ≤ V M P t ( ω S , ω S c ) = V RR t ( ω 1 , ω 2 , ..., ω M ) , for all ω 1 ≥ ω 2 ≥ ... ≥ ω M (15) and all sets S ⊆ { 1 , ..., M } of K element s, with the elements in ω S c decreasingly ordered. Pr oof: Since the MP is o ptimal from t + 1 onward by assum ption, it is su f ficient t o show that scheduli ng K nodes wit h arbitrary beliefs at slot t and t hen fol lowing th e MP from slot t + 1 on is n o better than following t he MP imm ediately at slot t . The performance of t he former poli cy is given by the left-hand si de (LHS) of (15). In fact V RR t ( ω S , ω S c ) , for any set S , represents the throughp ut of a policy that schedules the K no des with beli efs ω S at slot t , and then operates as the MP from t + 1 on ward, s ince beliefs in ω S c are decreasingly ordered. The MP’ s p erformance is inst ead given by the right-hand si de (RHS) of (15). Note that, for t = T , it is im mediate to verify that the MP is optimal. This concludes th e proof. Theor em 3. Un der assumpti ons (1) the MP i s optimal for p roblem (5), so that π M P = π ∗ . Pr oof: T o start wi th, we first p rove in Appendi x A that the inequali ty V RR t ( ω 1 , ..., ω j , y , x, ..., ω M ) ≤ V RR t ( ω 1 , ..., ω j , x, y , ..., ω M ) (16) holds for any x ≥ y , with 0 ≤ j ≤ M − 2 , and for all t ∈ { 1 , ..., T } and bel iefs ω k (not necessarily ordered), with k ∈ { 1 , ..., M } . Inequality (16) for j = 0 is intended as V RR t ( y , x, ..., ω M ) ≤ V RR t ( x, y , ..., ω M ) . If (16) holds, then inequali ty (15) is satis fied for all ω 1 ≥ ... ≥ ω M and all subsets S ⊆ { 1 , ..., M } of K elements. In fact, (16) states th at the through put of the RR pol icy 12 nev er i ncreases when, for any p air of adjacent nodes, the o ne with the sm allest belief of the pair is scheduled first. Hence, by st arting from th e RHS of (15) (i.e., V RR t ( ω 1 , ω 2 , ..., ω M ) ) and by applying a con venient number of successive switchings between pair of adjacent elements of vector [ ω 1 , ω 2 , ..., ω M ] to achiev e [ ω S , ω S c ] , for any S , we can obtain a cascade of inequalities through (16) (one for each switching), which guarantees that (15) holds . By L emma 2 this i s suffi cient to prove that the MP is op timal, since the inequ ality (15) holds for any arbitrary t . C. Extension to the Infin ite-Horizon Case W e now briefly describe the extension o f problem (5) to the infinite-horizon case, for which the through put under policy π and its op timal value are gi ven by (see e.g., [7]) V π ( ω (1) ) = ∞ X t =1 β t − 1 E π [ R ( ω ( t ) , U π ( t )) | ω (1)] , and V ∗ ( ω (1) ) = max π V π ( ω (1) ) , (17) where the optim al policy i s π ∗ = arg max π V π ( ω (1)) and 0 ≤ β < 1 . From stand ard DP t heory [11], the optim al policy π ∗ is stationary, so that t he optimal decision U ∗ ( t ) is a function of the current state ω ( t ) only , in dependently of slot t [11]. By following the same reasoning as in [7, Theorem 3], it can be shown that the optim ality of the MP for the finite-horizon setting impli es the opt imality also for the infini te-horizon scenario. Moreover , b y foll owing [7, Theorem 4] i t can be shown that the MP is op timal also for th e undiscount ed avera ge reward crit erion (i.e., V π avg ( ω (1) ) = l im T →∞ 1 T P ∞ t =1 E π [ R ( ω ( t ) , U π ( t )) | ω (1)] ). I V . O P T I M A L I T Y O F T H E W H I T T L E I N D E X P O L I C Y In t his secti on, we briefly re view the Whittle i ndex policy for RMAB problems [5], and then focus on the infinite-horizon scenario of Sec. III-C, when conditio ns (1b) are specialized to 0 = p (1) 11 ≤ p (1) 01 = p (0) 01 = p 01 ≤ p (0) 11 = 1 , (18) and where the task queues are of capacity one. W e show that und er the assumpti on (18) (see Sec. I-B for a discussi on on these condition s), the RMAB at hand is i ndexable and we calculate 13 its Whittle ind ex in closed-form. W e then sh ow that the Whittle index policy is equiv alent t o the MP , and t hus optimal for the problem (17). W e emph asize that, our results provide a rare example [5] in which, as in [9], not only indexability is establish ed, but also the Whittle index i s obtained in clo sed form and t he Whittle policy proved to be opti mal. It is finally remarked t hat our proof techniq ue is ins pired by [9], but th e different sys tem model poses new challenges that require sig nificant work. A. Whittle Index The Whitt le index policy assigns a num erical value W ( ω i ) to each state ω i of node U i , referred to as index , to measure how rewar ding it is to schedule node U i in the current slot. The K n odes with the largest i ndex are t hen scheduled in each slot. As detailed below , the Whittle index is calculated independently for each no de, and thus the Whi ttle index policy is not generally optimal for RMAB probl ems. Moreover , ev en the existence of a well-defined Whittle in dex i s not guaranteed [5]. T o study the i ndexability and the Whittle index for the RMAB at hand, we can focus on a restless single-armed bandit (RSAB) mo del, as defined below [5]. A RSAB is a RMAB with a single arm, in whi ch th e onl y decision that needs to be taken by the CC is whether activ ating the (single) arm or not (i.e., keep it passive). 1) RSAB with Subsidy for P ass ivity: The Whitt le index is based on the concept of su bsidy for passi vity , whereby the CC i s g iv en a subs idy m ∈ R when th e arm is not scheduled. A t each slot t , the CC, based on the state ω ( t ) of the arm, can decide to activa te (or s chedule) it , i.e., to set u ( t ) = 1 , obtaining an immediate rew ard R m ( ω ( t ) , 1) = ω ( t ) . If, instead, the arm is kept passive , i.e., u ( t ) = 0 , a re ward R m ( ω ( t ) , 0) = m equal to the subsi dy is accrued. The state ω ( t ) e volves through (7), wh ich under (18) and adapted t o the s implified notation u sed here becomes ω ( t + 1) = 0 w .p. ω ( t ) if u ( t ) = 1 p 01 w .p. (1 − ω ( t )) if u ( t ) = 1 τ (1) 0 ( ω ( t )) w .p. 1 if u ( t ) = 0 . (19) 14 The throughp ut, gi ven policy π = { u π (1) , u π (2) , ... } and initial belief ω (1) , is V π m ( ω (1)) = ∞ X t =1 β t − 1 E π [ R m ( ω ( t ) , u π ( t )) | ω (1)] . (20) The optim al throughput is V ∗ m ( ω (1)) = max π V π m ( ω (1)) , wh ile the opti mal policy π ∗ = a rg max π V π m ( ω (1)) is stationary in the sense that the optim al decisions u ∗ m ( ω ) ∈ { 0 , 1 } are functions of the beli ef ω only , independently of slot t [9]. Removing the sl ot i ndex from the initial beli ef, the opti mal throughput V ∗ m ( ω ) and the op timal decision u ∗ m ( ω ) satisfy the foll owing DP optimali ty equations for the infinite-horizon s cenario (see [9]) V ∗ m ( ω ) = max u ∈{ 0 , 1 } { V m ( ω | u ) } , (21) and u ∗ m ( ω ) = arg max u ∈{ 0 , 1 } { V m ( ω | u ) } . (22) In (21)-(22) we defined V m ( ω | u ) , u ∈ { 0 , 1 } , as the throug hput (20) of a p olicy t hat takes action u at the current slot and then uses the opt imal policy u ∗ m ( ω ) onward, we h a ve V m ( ω | 0) = m + β V ∗ m ( τ (1) 0 ( ω ) ) , and (23) V m ( ω | 1) = ω + β [ ω V ∗ m (0) + (1 − ω ) V ∗ m ( p 01 )] . (24) 2) Inde xability and Whittle Inde x: W e use the notation of [9] to define indexability and Whittl e index for the RSAB at hand. W e first define the so called passive set P ( m ) = { ω : 0 ≤ ω ≤ 1 and u ∗ m ( ω ) = 0 } , (25) as the set t hat contains all t he beliefs ω for which t he passive action is op timal (i.e., all 0 ≤ ω ≤ 1 such that V m ( ω | 0) ≥ V m ( ω | 1) , see (23)-(24)) under t he giv en s ubsidy for passivity m ∈ R . The RMAB at hand is said t o be indexable if the passiv e set P ( m ) , for the associated RSAB problem 1 , is m onotonically increasing as m increases within the i nterval ( −∞ , + ∞ ) , in the 1 Note that in a RMAB with arms characterized by differe nt statisti cs this condition must be checked for all arms. 15 sense that P ( m ′ ) ⊆ P ( m ) if m ′ ≤ m and P ( −∞ ) = ∅ and P (+ ∞ ) = [0 , 1 ] . If the RMAB is in dexable, the Whit tle index W ( ω ) for each arm with st ate ω is the infimum subsidy m such that it is optim al to make the arm pas siv e. Equiv alently , the Whit tle index W ( ω ) is the infimu m subsidy m that makes pass iv e and activ e actions equally rewarding, i .e., W ( ω ) = inf { m : u ∗ m ( ω ) = 0 } = inf { m : V m ( ω | 0) = V m ( ω | 1) } . (26) B. Optimalit y of t he Thr eshold P ol icy Here, we show t hat the optimal pol icy u ∗ m ( ω ) for the RSAB of Sec. IV -A1 is a threshold policy over the belief ω . This is crucial i n our proo f of indexability of the RMAB at h and giv en in Sec. IV -D. T o this end, we observe that: i ) function V m ( ω | 1) in (24) is lin ear over the belief ω ; ii ) functi on V m ( ω | 0) = m + β V ∗ m ( τ (1) 0 ( ω ) ) in (23) is con vex over ω , since the value function V ∗ m ( ω ) is con vex for the problem at hand (see [9], [10]). W e n eed the following lemma. Lemma 4. The foll owing inequalit ies hold: a) F o r 0 ≤ m < 1 : a.1) V m (0 | 1) ≤ V m (0 | 0) ≤ V m (1 | 1); a.2) V m (1 | 0) ≤ V m (1 | 1); (27a) b) F o r m < 0 : b .1) V m (0 | 0) ≤ V m (0 | 1) ≤ V m (1 | 1); b .2) V m (1 | 0) ≤ V m (1 | 1); (27b) c) F or m ≥ 1 : c.1) V m (0 | 0) ≤ V m (1 | 1) ≤ V m (0 | 1); c.2) V m (1 | 1) ≤ V m (1 | 0) . (27c) Pr oof: See Appendix B. Le veraging Lemma 4, we can now establis h the optimalit y of a threshold policy u ∗ m ( ω ) . Pr opositio n 5. The optim al policy u ∗ m ( ω ) in (22) for subsidy m ∈ R is given by u ∗ m ( ω ) = 1 , i f ω > ω ∗ ( m ) 0 , if ω ≤ ω ∗ ( m ) , (28) where ω ∗ ( m ) ∈ R is the optim al th reshold for a gi ven subsidy m . The optimal t hreshold ω ∗ ( m ) is 0 ≤ ω ∗ ( m ) ≤ 1 if 0 ≤ m < 1 , while it is arbitrary negative for m < 0 and arbitrary greater 16 than unity for m ≥ 1 . In o ther words we have u ∗ m ( ω ) = 1 i f m < 0 and u ∗ m ( ω ) = 0 i f m ≥ 1 . Pr oof: W e start by showing that (28), for 0 ≤ m < 1 , sati sfies (22) and is thus an op timal policy . T o see this, we refer to Fig. 3, where we sketch functi ons V m ( ω | 1) and V m ( ω | 0) for diffe rent values o f the subsidy m . From (22), we have that u ∗ m ( ω ) = 1 for all ω such that V m ( ω | 1) > V m ( ω | 0) and u ∗ m ( ω ) = 0 otherwise. For 0 ≤ m < 1 , from the in equalities of Lemma 4-a), the linearity of V m ( ω | 1) and the con vexity of V m ( ω | 0) , it follows that there i s only one i ntersection ω ∗ ( m ) between V m ( ω | 1) and V m ( ω | 0) with 0 ≤ ω ∗ ( m ) ≤ 1 , as sh own in Fig. 3-a). Instead, wh en m < 0 , by Lemma 4-b), arm acti vation is always optimal, that is, u ∗ m ( ω ) = 1 , since V m ( ω | 1) > V m ( ω | 0) for any 0 ≤ ω ≤ 1 as sh own in Fig. 3-b). Con versely , when m ≥ 1 , by Lemma 4-c), it follows th at passivity is always optim al, that is, u ∗ m ( ω ) = 0 , since V m ( ω | 0) ≥ V m ( ω | 1) for any 0 ≤ ω ≤ 1 as shown in Fig. 3-c). 0 1 ω ) 0 | ( ω m V ) 1 | ( ω m V ) ( * m ω ω ≤ ) ( * m ω ω > 0 1 ) 0 | ( ω m V ) 1 | ( ω m V 0 1 ) 0 | ( ω m V ) 1 | ( ω m V a) b) c) 1 0 < ≤ m 1 ≥ m 0 < m ω ω ) ( * m ω Figure 3. Illustration of the optimality of a t hreshold policy for different values of the subsidy for passi vity m : a) 0 ≤ m < 1 ; b) m < 0 ; c) m ≥ 1 . C. Closed-F orm Expr ession of t he V alue Function By leveraging the optimalit y of the threshold poli cy (28) we derive a closed-form expression of V ∗ m ( ω ) in (21), being a ke y step in establishin g the RMAB’ s indexability in Sec. IV -D . Notice that function τ ( k ) 0 ( ω ) in (9), when specialized to conditions (18), becomes τ ( k ) 0 ( ω ) = 1 − (1 − p 01 ) k (1 − ω ) , (29) which i s a monotoni cally i ncreasing function of k , so that τ ( k ) 0 ( ω ) ≥ τ ( i ) 0 ( ω ) for any k ≥ i . Based o n such m onotonicit y , we can define the avera ge number L ( ω , ω ′ ) of slots i t takes for the 17 belief to become lar ger th an ω ′ when starting from ω while the arm is kept passive, as L ( ω , ω ′ ) = min n k : τ ( k ) 0 ( ω ) > ω ′ o = 0 ω > ω ′ ln 1 − ω ′ 1 − ω ln(1 − p 01 ) + 1 ω ≤ ω ′ ∞ ω ≤ 1 ≤ ω ′ . (30) From (30) we hav e L ( ω , ω ′ ) = 1 for ω = ω ′ since, witho ut loss of optim ality , we assumed that the passive action is opti mal (i.e., u ∗ m ( ω ) = 0 ) when V m ( ω | 0) = V m ( ω | 1) . For ω ′ ≥ 1 ins tead (according to Propositi on 5), the arm i s always kept pass iv e and thus L ( ω , ω ′ ) = ∞ . Lemma 6. The opt imal throughput V ∗ m ( ω ) in (21) can be writt en as V ∗ m ( ω ) = 1 − β L ( ω, ω ∗ ( m )) 1 − β m + β L ( ω, ω ∗ ( m )) V m ( τ ( L ( ω ,ω ∗ ( m ))) 0 ( ω ) | 1) , (31) where ω ∗ ( m ) is the optimal threshold obtain ed from Propositio n 5. Pr oof: According to Proposition 5, the optimal p olicy u ∗ m ( ω ) keeps the arm passive as long as the current belief is ω ≤ ω ∗ ( m ) . Therefore, the arm i s kept passive for L ( ω , ω ∗ ( m )) slots, during which a rew ard R m ( ω , 0) = m is accrued i n each slot. This l eads to a total re ward within the passi vity time given by the following geometric series P L ( ω, ω ∗ ( m )) − 1 k =0 β k m = 1 − β L ( ω , ω ∗ ( m )) 1 − β m , which correspon ds t o th e first term in the RHS of (31). After L ( ω , ω ∗ ( m )) slots of pass ivity , the belief becomes l ar ger than the t hreshold ω ∗ ( m ) and the arm is activ ated. The contribution to t he value function V ( ω ) thu s becomes β L ( ω, ω ∗ ( m )) V m ( τ ( L ( ω ,ω ∗ ( m ))) 0 ( ω ) | 1) , which i s the second term in the RHS of (31). Note that, when ω > ω ∗ ( m ) , activ ation is opt imal, and V ∗ ( ω ) = V ( ω | 1) . T o ev aluate V ∗ m ( ω ) from (31), we only need t o calculate V m ( ω | 1) since the other terms, thank s to (30) are explicitly give n once ω ∗ ( m ) is obt ained from Proposition 5. Howe ver , from (24), e valuating V m ( ω | 1) onl y requires V ∗ m (0) and V ∗ m ( p 01 ) , which are calculat ed in the lemma below . 18 Lemma 7. W e have V ∗ m (0) = m − 2 mβ L ∗ m + β L ∗ m υ ∗ m − β L ∗ m +1 υ ∗ m + mβ L ∗ m +1 + mβ L ∗ m υ ∗ m − mβ L ∗ m +1 υ ∗ m ( β − 1) ( β L ∗ m − β L ∗ m υ ∗ m + β L ∗ m +1 υ ∗ m − 1) (32a) V ∗ m ( p 01 ) = mβ − mβ L ∗ m + β L ∗ m υ ∗ m − β L ∗ m +1 υ ∗ m + mβ L ∗ m +1 υ ∗ m − mβ L ∗ m +2 υ ∗ m β ( β − 1) ( β L ∗ m − β L ∗ m υ ∗ m + β L ∗ m +1 υ ∗ m − 1) (32b) where we have defined L ∗ m = L (0 , ω ∗ ( m )) and υ ∗ m = τ ( L (0 ,ω ∗ ( m ))) 0 (0) . Pr oof: By plug ging (24) into (31), and ev aluating (31) for ω = 0 and ω = p 01 , we get a linear system in the two unknowns V ∗ m (0) and V ∗ m ( p 01 ) , which can be solved leading to (32). D. Index abil ity and Whittle Index Here, we prove that the RMAB at hand is indexable, w e derive the Whi ttle index in closed form and show that it is equivalent to the M P and thus opti mal for the RMAB problem (17). Theor em 8. a ) The RMAB at hand is ind exable and b ) its Whit tle index i s W ( ω ) = 1 − β L (0 ,ω ) 1 − β τ L (0 ,ω ) 0 (0) (1 − β ) (1 − h ) ω + β L (0 ,ω ) τ L (0 ,ω ) 0 (0) (1 − β ) ( hβ + 1) ( β − 1) β L (0 ,ω ) (1 − β (1 − h )) ω − 1 + β L (0 ,ω ) τ L (0 ,ω ) 0 (0)(1 − β ) + hβ . (33) Pr oof: Part a ). See Appendix C. Part b ). By (26), the Whittle i ndex W ( ω ) of st ate ω is the value of the sub sidy m for wh ich activa tin g or n ot t he arm is equally rewa rding so that V m ( ω | 0) = V m ( ω | 1) . By using (23)-(24) th is becomes ω + β [ ω V ∗ m (0) + (1 − ω ) V ∗ m ( p 01 )] = m + β V ∗ m ( τ (1) 0 ( ω ) ) . Moreover , since the threshol d poli cy is opti mal and τ (1) 0 ( ω ) > ω , it follows that, when the belief becomes τ (1) 0 ( ω ) , it is optimal to activate t he arm and thus V ∗ m ( τ (1) 0 ( ω ) ) = V m ( τ (1) 0 ( ω ) | 1) = β τ (1) 0 ( ω ) V ∗ m (0) + β (1 − τ (1) 0 ( ω )) V ∗ m ( p 01 ) . Pluggi ng thi s result into V m ( ω | 0) = V m ( ω | 1) , along with (32a) and (32b), leads to (33), which concludes the proo f. It can be show that the Whittle index W ( ω ) in (33) is an increasing function of ω . Therefore, since the Whitt le policy selects the K arms with the largest index at each slot, we have: Corollary 9. The Whittle index policy is equi valent to t he MP and is thus opti mal. 19 V . E X T E N SI O N T O T A S K Q U E U E S O F A R B I T R A RY C A P AC I T Y C > 1 The problem o f characterizing the opt imal policies wh en C > 1 is significantly m ore compli - cated than for C = 1 and is left op en by this work. M oreover , since the dimens ion of the state space of the beli ef MDP gro ws with C , e ven the numerical computat ion of the optimal p olicies is quite cumbersome. Due to these di ffi culties, here we compare the performance of the MP , inspired by its optimal ity for C = 1 , wit h a performance up per bound obtained following the relaxation approach of [6]. a) b) 0 1 ) 0 ( 01 p ) 0 ( 21 p ) 0 ( 00 p C ) 0 ( 1 − CC p C-1 2 ) 0 ( 12 p ) 0 ( 1 C C p − ) 0 ( CC p ) 0 ( 10 p 0 1 ) 1 ( 01 p ) 1 ( 21 p ) 1 ( 00 p C ) 1 ( 1 − CC p C-1 2 ) 1 ( 12 p ) 1 ( 1 C C p − ) 1 ( CC p ) 1 ( 10 p Figure 4. Marko v model for t he ev olution of the queue Q i ( t ) , of arbitrary capacity C , when the node U i : a) is not scheduled in slot t (i.e., U i / ∈ U ( t ) ); b) is scheduled in slot t (i .e., U i ∈ U ( t ) ). A. System Model and Myopi c P olicy Each node U i has a task queue Q i ( t ) ∈ { 0 , 1 , ..., C } of capacity C . W e consider the Markov model of Fig. 4 for the task generation and expiration processes at each node (cf. Sec. I-A). The transition probabili ties between qu eue states when node U i is not scheduled are p (0) xy = Pr [ Q i ( t + 1) = y | Q i ( t ) = x, U i / ∈ U ( t )] , whereas when U i is scheduled we have p (1) xy = Pr [ Q i ( t + 1) = y | Q i ( t ) = x, U i ∈ U ( t )] , for x, y ∈ { 0 , 1 , ..., C } . When node U i is scheduled at slot t , and Q i ( t ) ≥ 1 , one of its task is ex ecuted and it also informs the CC about the number o f t asks left in the queue (observation). W e assume that at m ost one task can b e generated (or dropp ed) in a slot, so th at p ( u ) xy = 0 for y < x − 1 and y > x + 1 , wi th u ∈ { 0 , 1 } as shown in Fig. 4. The belief of each i th node is represented by a ( C × 1) vector ω i = [ ω i, 0 , ..., ω i,C − 1 ] whose k th entry ω i,k , for k ∈ { 0 , 1 , ..., C − 1 } , is given by (cf. (2)) ω i,k = Pr [ Q i ( t ) = k | H ( t )] . The 20 immediate re ward (3), given the initial belief vectors ω 1 ( t ) , ..., ω M ( t ) and actio n U , becomes R ( ω 1 ( t ) , ..., ω M ( t ) , U ) = M X i =1 Pr [ Q i ( t ) > 0 |H ( t )] 1( U i ∈ U ) = K − X i ∈U ω i, 0 ( t ) . (34) The performance of i nterest is the in finite-horizon throughput (17). 1) Myopic P olicy: Th e MP (14), specialized to the im mediate rew ard (34), becomes U M P ( t ) = argmax U R ( ω 1 ( t ) , ..., ω M ( t ) , U ) = ar gmin U X i ∈U ω i, 0 ( t ) . (35) Note that, unl ike Sec. III-A, when C > 1 the MP does not generally hav e a RR st ructure. B. Upper Bound Here we deriv e an upper bound to the throughput (17) by following the approach for general RMAB problems propo sed i n [6]. The upper bound relaxes the const raint that exactly K nodes must be scheduled in each s lot. Specifically , i t allows a variable number K π ( t ) of scheduled nodes in each t th slot under policy π , with th e only con straint that it s dis counted average satisfies E π " ∞ X t =1 β t − 1 K π ( t ) # = K 1 − β . (36) The advantage of t his relaxed version of the schedulin g p roblem is that it can b e tackled by focusing on each single arm independ ently from the others [6], [12]. This is because, by the symmetry of the nodes, t he constraint (36) can be equiv alently h andled by im posing that each node is active on average for a discounted time E π [ P ∞ t =1 β t − 1 1( U i ∈ U π ( t ))] = K M (1 − β ) . W e can thus calculate the opt imal soluti on of the relaxed problem by solving a single RSAB probl em. W e no w elabo rate on such a RSAB by dropping the node i ndex. Here, t he i mmediate reward when the arm is in state ω (a vector sin ce C > 1 , see Sec. V -A), and acti on u ∈ { 0 , 1 } is chosen, is R ( ω , u ) = 1 − ω 0 if u = 1 and R ( ω , u ) = 0 if u = 0 , while the Markov ev oluti on of t he belief follows from Fig. 4 and similarly to Sec. I-A. The problem consists in optimi zing th e t hroughput under the constraint E π [ P ∞ t =1 β t − 1 1( U i ∈ U π ( t ))] = P ∞ t =1 β t − 1 E π [ u π ( t )] = K/ ( M (1 − β )) , as 21 introduced above. Under the assumpt ion that the s tate ω belon gs to a finite state s pace W (to be discussed b elow), this opt imization can be don e by resorting to a linear programming (LP) formulation [12]. Specifically , let z ( u ) ω be the probability of being in state ω and selecting action u ∈ { 0 , 1 } under a given policy . The optim ization at hand leads to the following L P maximize P ω ,u R ( ω , u ) z ( u ) ω , (37a) subject to : X ω ,u z ( u ) ω = 1 , (37b) X ω z (1) ω = K M (1 − β ) , (37c) z (0) ω + z (1) ω = δ ( ω − ω (1)) + β X ω ′ ,u z ( u ) ω ′ p ( u ) ω ω ′ , for all ω ∈ W , (37d) where (37c) is the constraint on th e av erage tim e i n which the node is scheduled, while (37d) guarantees t hat z ( u ) ω is the statio nary distribution [12], in wh ich δ ( ω − ω (1)) = 1 if ω = ω (1 ) and δ ( ω − ω (1)) = 0 i f ω 6 = ω (1) . Note that, as discussed in Sec. II, the term p ( u ) ω ω ′ is the probability that the n ext state is ω ′ giv en that action u i s taken in state ω . W e are left to dis cuss the cardinality of the set W . While the belief ω can generally assume any value i n the C -dimensional pro bability simplex, the n umber of states actually assumed by ω during any limi ted ho rizon o f ti me is finite du e to the finiteness of the action space [10]. In our p roblem, since the t ime horizon is unl imited, this fact alone is not sufficient to con clude that the set W i s finite. Howe ver , after each t th sl ot in whi ch the arm is activ ated, t he belief at the ( t + 1) th slot can only takes C values given t hat the queue stat e is learned b y th e CC. Therefore, the e volution of the belief is reset after each acti vation, and in practice, the time between t wo activ ations is finite since the node must be kept activ e for a d iscounted fraction of time K / ( M (1 − β ) . Hence, by constraining the maxim um tim e i nterval between two activ ations to a suf ficiently lar ge va lue, t he st ate space W remains finit e and t he opt imal performance is not af fected. W e used thi s approach for the num erical e valuation o f the upp er bound in Sec. V -C. 22 C. Numerical Results W e now present som e num erical results to compare the performance o f the M P with the upper bound of Sec. V -B. The performance is the through put (17) norm alized by its ideal v alue K/ (1 − β ) that is obt ained if the nodes always hav e a task to be completed when scheduled. In Fig. 5 we show t he normalized throughput versus the queue capacity C for differe nt ratio M /K between the nu mber M of nodes and the number K of nodes scheduled in each slot. W e keep K = 3 fixed and vary M . W e assume a uniform d istribution for the initial number of tasks in the queues Q i (1) for all the nodes, so that ω i,k (1) = 1 / ( C + 1) for all i, k . The probabilities that a new task is generated when t he arm is kept pass iv e are p (0) 01 = 0 . 15 and p (0) k k +1 = 0 . 1 , for k ∈ { 1 , C − 1 } , whi le u nder activ ation they are p (1) 01 = 0 . 05 and p (1) k k +1 = 0 . The probabil ity that a task expires when the arm is kept passive and activ ated are p (0) k k − 1 = 0 . 05 and p (1) k k − 1 = 0 . 95 respectiv ely . The remaining transition s probabi lities are p (0) C C = 0 . 9 , p (1) C C = 0 . 05 , while β = 0 . 95 . From Fig. 5 it can be seen t hat when C and/ or M /K are small the MP’ s performance is clos e to the upper bound. In fact, for small M /K , mo st of the nodes are schedul ed i n each slot and the relax ed system in Sec. V -B approaches the origin al one, while for small C we get clo ser to the opt imality of the MP for C = 1 . For moderate t o large values of M /K and/or C instead, the more flexibility in th e relaxed system enables larger gain s over the M P . V I . C O N C L U S I O N S This paper considers a centralized scheduling problem for independent, sym metric and time- sensitive tasks under resources constraints. The prob lem i s to assign a finite nu mber of resources to a larger number of nodes that may h a ve tasks to be comp leted in their tas k queue . It is assumed that the central contr oller has no direct access t o the queue of each node. Based on a M arko vian modeling of the task generatio n and expiration processes, the scheduli ng problem is formu lated as a partially observable Markov decision process (POMDP) and then cast int o the framework of restl ess multi-armed bandit (RMAB) problems. Under th e assum ption that the task queues are of capacity one, a greedy , or myopic policy (MP), operating in the sp ace of t he a posteriori 23 1 2 3 4 5 6 7 8 9 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Battery Capacity C Normalized Throughput V 1 ( ω (1)) Upper Bound Myopic Policy M/K = 1 M/K = 3 M/K = 10 Figure 5. Normalized optimal throughput of the MP in (35) as compared to the upper bound versus the queue capacity C for differe nt ratios M /K ∈ { 1 , 3 , 10 } (system parameters are K = 3 , β = 0 . 95 , ω i,k (1) = 1 / ( C + 1) for all i, k , p (0) 01 = 0 . 15 , p (1) 01 = 0 . 05 , p (0) C C = 0 . 9 , p (1) C C = 0 . 05 , p (0) kk − 1 = 0 . 05 , p (1) kk − 1 = 0 . 95 , p (0) kk +1 = 0 . 1 , p (1) kk +1 = 0 , for k ∈ { 1 , C − 1 } ). probabilities (beliefs) of th e number of tasks in t he queues, is proved to be opt imal, under appropriate assum ptions, for both finite and infinite-horizon t hroughput criteria. The MP selects at each slot the nodes with the largest probabili ty of having a task to be complet ed. It is shown that the MP is round-robin since it schedules the nodes periodi cally . W e have also established that t he RMAB probl em at hand is indexable, d eriv ed t he Whit tle i ndex in clos ed form and shown t hat the Whittle index policy is equiv alent to the MP and th us it is o ptimal. Systems in which the task queues hav e arbit rary capacities ha ve been in vestigated as well by comparing t he performance of t he MP , whi ch is generally suboptimal, with an upper bound based on a relaxation of the schedul ing constraint. Overall, this paper propos es a g eneral framework for resource allocati on that finds applications in several areas of current interest in cluding comm unication networks and di stributed computin g. A P P E N D I X A P R O O F O F T H E O R E M 3 The proof i s divided into two steps. In the first step we deriv e the t hroughput of the RR policy in closed form, and then we show that inequality (16) holds. 24 As for the first step, the throughpu t for the RR pol icy (and th us of the MP) can be calculated as the sum of the cont ribution of each nod e separately (due to the round robin structu re). T o elaborate, let u s focus on node U i , with ini tial belief ω i (1) , and assum e that U i ∈ G 1 . Nodes in group G 1 are scheduled at slo ts t ∈ { 1 + ( j − 1) m } , for j ∈ { 1 , 2 , ... } . Let r j ( ω i (1)) = E RR [ ω i (1 + ( j − 1) m ) | ω i (1)] be the avera ge re ward accrued by t he CC from n ode U i only , w hen scheduli ng it for t he j th tim e at slot t = 1 + ( j − 1) m (see the RHS of (3)) (i.e., when operating the RR policy). At slot t = 1 we have r 1 ( ω i (1)) = ω i (1) . T o calculate r 2 ( ω i (1)) we first deriv e the ave rage v alue of the belief (see (7)) after the slot of activity in t = 1 as E RR [ ω i (2) | ω i (1)] = τ (1) 1 ( ω i (1)) , where τ (1) 1 = ω δ 1 + p (1) 01 with δ u = p ( u ) 11 − p ( u ) 01 (cf. (8)). W e then account for th e ( m − 1 ) slots of passivity by exploiting (8), so that r 2 ( ω i (1)) = E RR [ ω i (1 + m ) | ω i (1)] = φ (1) ( ω i ( t )) , where we hav e set φ (1) ( ω ) = τ ( m − 1) 0 ( τ (1) 1 ( ω )) = ω α m + ψ m with α m = δ 1 δ m − 1 0 and ψ m = p (1) 01 δ m − 1 0 + p (0) 01 1 − δ m − 1 0 1 − δ 0 , and where τ ( k ) 0 ( ω ) = τ (1) 0 ( τ ( k − 1) 0 ( ω )) indicates the belief of a node after k slots of passivity when the ini tial belief is ω (i .e., τ ( k ) 0 ( ω ) i s obtain ed recursive ly by applying τ (1) 0 ( ω ) to itself k times). In general, we can obtain r j ( ω i (1))= E RR [ ω i (1 + ( j − 1) m ) | ω i (1)] , for j ≥ 2 , b y iterating the procedure above by applying φ (1) ( ω ) to it self ( j − 1) ti mes. After a li ttle al gebra we get φ ( j − 1) ( ω ) = φ (1) ( φ ( j − 2) ( ω )) = ω α j − 1 m + ψ m 1 − α j − 1 m 1 − α m , s o t hat r j ( ω i (1)) = φ ( j − 1) ( ω i (1)) , where we set φ (0) ( ω ) = ω . The reasoning above can b e applied when starting from any arbitrary slot t . Finally , th e total re ward accrued by the CC from a node t hat is scheduled H ti mes, when its belief at the first slot in whi ch it is scheduled is ω , can be calculated by sum ming up the av erage re ward r j ( · ) durin g each slot in which the node is s cheduled (see definiti on above), as θ ( H ) ( ω ) = H X j =1 β ( j − 1) m r j ( ω ) = ψ m 1 − α m 1 − β mH 1 − β m − 1 − ( β m α m ) H 1 − β m α m ! + 1 − ( β m α m ) H 1 − β m α m ω . (38) Note th at, for a node U i ∈ G g , for g ≥ 1 and with b elief equal to ω at t = 1 , the first slo t in which the node is scheduled is t = g , and thus its belief at tim e t = g becomes τ ( g − 1) 0 ( ω ) (i.e., after ( g − 1) s lots of p assivity whi le other groups are schedu led). Therefore, for a node U i ∈ G g , with initial belief ω , the total contri bution to the throughput is given by β g − 1 θ ( H ) τ ( g − 1) 0 ( ω ) . 25 Let u s n ow focus on t he second step, i.e., proving the in equality (16). At t = T , it is easily seen to hold due to (3) and (12). W e then n eed to show that (16) also holds at t . T o do so, let us denote as L and R the RR poli cies whose throughpu ts are given by th e LHS and RHS of (16) respectively . The differences between L and R are the positi ons of the no des with b elief x and y in the initial belief vectors. Therefore, s ome of the m group s created by the two policies might have different n odes (see the RR operations in Proposition 1). T o simpli fy , we refer t o the nod e with belief x ( y ) as node x ( y ). Let us assume that nodes x and y belong to groups G g ′ and G g ′′ under policy R , respectiv ely , while they bel ong to groups G g ′′ and G g ′ under policy L , respectiv ely , with g ′′ ≥ g ′ , and g ′ , g ′′ ∈ { 1 , ..., m } . If g ′′ = g ′ , then the two pol icies coincid e and (16) hold s with equali ty . If g ′′ = g ′ + 1 (nodes are adjacent but do n ot belong to the same group), the only differe nce between policies L and R is the s cheduling order of nodes x and y . T o verify that inequalit y (16) holds , we need to prove that s cheduling node y in group G g ′ and node x in grou p G g ′′ is no better than d oing the oppo site for any x ≥ y . T o elaborate, let H R x ( t ) = H L y ( t ) and H R y ( t ) = H L x ( t ) be the number of times that no de x (or y ) is schedu led under po licy R (or L ) and node y (or x ) is scheduled under policy L (or R ) in th e horizon { t, t + 1 , ..., T } , respectiv ely . By recalling (38) and the discount factor β , the contribution generated by node x and y under policy R is β g ′ − 1 θ ( H R x ( t ) ) ( τ ( g ′ − 1) 0 ( x )) and β g ′′ − 1 θ ( H R y ( t ) ) ( τ ( g ′′ − 1) 0 ( y )) respectively , and simi larly u nder policy L we hav e β g ′′ − 1 θ ( H L x ( t ) ) ( τ ( g ′′ − 1) 0 ( x )) and β g ′ − 1 θ ( H L y ( t ) ) ( τ ( g ′ − 1) 0 ( y )) . Note that, i n the argument of function θ ( · ) ( · ) , we have consi dered that the n odes in grou p G g ′ are scheduled for the first time at slot g ′ − 1 , and thus the belief must be updated through function τ ( g ′ − 1) 0 ( · ) , and similarly for nodes in G g ′′ the first slot is g ′′ − 1 . Moreover , the d iscount factor is β g ′ − 1 is comm on to all the no des in group G g ′ , and so is β g ′′ − 1 for group G g ′′ . By recalling that all the nodes, except x and y , are scheduled at the same slot under the two policies R and L (thus giving the same contribution to the throughput ), the inequalit y (16) can thus be reduced to β g ′ − 1 θ ( H R x ( t ) ) ( τ ( g ′ − 1) 0 ( x ))+ β g ′′ − 1 θ ( H R y ( t ) ) ( τ ( g ′′ − 1) 0 ( y )) − β g ′′ − 1 θ ( H L x ( t ) ) ( τ ( g ′′ − 1) 0 ( x )) − β g ′ − 1 θ ( H L y ( t ) ) ( τ ( g ′ − 1) 0 ( y )) ≥ 0 , wh ich m ust hold for all admissible H R x ( t ) = H L y ( t ) and H R y ( t ) 26 = H L x ( t ) and all g ′′ ≥ g ′ , with g ′ , g ′′ ∈ { 1 , ..., m } . There are two cases: 1 ) H R x ( t ) = H L y ( t ) = H R y ( t ) = H L x ( t ) = H ≥ 1 , that is, nodes x and y are schedul ed the s ame number of t imes wi thin the horizon of interest under the two policies R and L ; 2 ) H R x ( t ) = H L y ( t ) = H , and H R y ( t ) = H L x ( t ) = H − 1 , for H ≥ 1 , namely , node x (or y ) is schedul ed one time more than node y (or x ) under pol icy R (or L ). By exploiting the RHS of (38), after a litt le algebra, one can verify t hat the inequali ty above holds in bot h cases, which concludes the proof of T heorem 3. A P P E N D I X B P R O O F O F L E M M A 4 Pr oof o f case a) . From (23)-(24), and recalling that τ (1) 0 (0) = p 01 from (29), th e leftm ost inequality in (27a.1) fol lows immediately as it becomes V m (0 | 1) = β V ∗ m ( p 01 ) ≤ m + β V ∗ m ( p 01 ) = V m (0 | 0) . For the rightm ost inequ ality in (27a.1), we have V m (1 | 1) = 1 + β V ∗ m (0) , while from (21) and t he fact th at V m (0 | 1) ≤ V m (0 | 0) we h a ve V ∗ m (0) = max { V m (0 | 0) , V m (0 | 1) } = V m (0 | 0) . Therefore, we hav e V m (1 | 1) = 1 + β V ∗ m (0)1 + β V m (0 | 0) ≥ V m (0 | 0) , which holds as 1 + β V m (0 | 0) ≥ V m (0 | 0) impli es V m (0 | 0) ≤ 1 1 − β . The l atter bound always holds, sin ce for m < 1 the infinite horizon throu ghput is upper bound ed as V ∗ m ( ω ) ≤ P ∞ t =0 β = 1 1 − β giv en that we can get at m ost a rewar d of R m ( ω , u ) ≤ 1 i n each s lot. Hence, inequ alities (27a.1) are proved. Inequali ty (27 a.2) can b e prove d by contradiction. Specifically , let us assume that: hp.1 ) V m (1 | 0) ≥ V m (1 | 1) . From (21) we would have V ∗ m (1) = max { V m (1 | 0) , V m (1 | 1) } = V m (1 | 0) , i.e., the passive acti on would be optim al when ω = 1 . Moreover , from (23) we would have V m (1 | 0) = m + β V ∗ m (1) = m + β V m (1 | 0) , which can be solved with respect t o V m (1 | 0) to get V m (1 | 0) = m 1 − β = V ∗ m (1) . Therefore, if hypothesis hp.1 ) holds, we also hav e that V m (1 | 1) = 1 + β V ∗ m (0) ≤ V m (1 | 0) = V ∗ m (1) = m 1 − β . Howe ver , the value function V ∗ m ( ω ) i s bounded m 1 − β ≤ V ∗ m ( ω ) ≤ 1 1 − β , where the lowe r bound i s ob tained considering a policy that alwa ys choos es the passive action for any belief ω . The boun dedness o f the value function, thus implies t hat if hp .1 ) holds then 1 + β m 1 − β ≤ 1 + β V m (0) = V m (1 | 1) ≤ V m (1 | 0) = m 1 − β , which yields 1 + β m 1 − β ≤ m 1 − β and thus (1 − β ) (1 − m ) ≤ 0 . But this is clearly i mpossibl e as m, β < 1 . 27 Consequently , we hav e proved t hat V m (1 | 1) ≥ V m (1 | 0) . Pr oof of case b ) Inequality V m (0 | 0) ≤ V m (0 | 1) fol lows i mmediately sin ce m + β V ∗ m ( p 01 ) ≤ β V ∗ m ( p 01 ) h olds for m < 0 . The second i nequality V m (0 | 1) ≤ V m (1 | 1) b ecomes V m (0 | 1) ≤ = 1 + β V ∗ m (0)1 + β V m (0 | 1) , which leads to V m (0 | 1) ≤ 1 1 − β , which always hol ds as discuss ed above. Inequality V m (1 | 0) ≤ V m (1 | 1) holds since an activ e action is always optimal when m < 0 . Pr oof of case c ) The inequali ty holds si nce a passive action is always optimal for any m ≥ 1 . A P P E N D I X C P R O O F O F T H E O R E M 8 Follo wing the discuss ion in Sec. IV -A2, to prove indexability it is suffic ient to show that the threshold ω ∗ ( m ) is m onotonically increasing wi th t he subs idy m , for 0 ≤ m < 1 . In fact, from Proposition 5 the passive set (25) for m < 0 is P ( m ) = ∅ , while for m ≥ 1 , we have P ( m ) = [0 , 1] . W e then only need to prove the monot onicity o f ω ∗ ( m ) for 0 ≤ m < 1 , which has been sh own to hol d in [9, Lemm a 9] if dV m ( ω | 1) dm ω = ω ∗ ( m ) < dV m ( ω | 0) dm ω = ω ∗ ( m ) . (39) T o check if (39) holds, we diffe rentiate (23)-(24) at the opti mal threshold ω = ω ∗ ( m ) as V m ( ω ∗ ( m ) | 1) = ω ∗ ( m ) + β ω ∗ ( m ) V ∗ m (0) + β (1 − ω ∗ ( m )) V ∗ m ( p 01 ) , and (40) V m ( ω ∗ ( m ) | 0) = m + β h τ (1) 0 ( ω ∗ ( m )) (1 + β V ∗ m (0)) + β (1 − τ (1) 0 ( ω ∗ ( m ))) V ∗ m ( p 01 ) i , (41) where (41) follows from (24) and from the fact t hat τ (1) 0 ( ω ) ≥ ω , for any ω (see (29)), and hence V ∗ m ( τ (1) 0 ( ω ∗ ( m ))) = V m ( τ (1) 0 ( ω ∗ ( m )) | 1) , since arm activ ation is optimal for any ω > ω ∗ ( m ) . By letti ng D m ( ω ) = dV ∗ m ( ω ) dm , then from (40) we ha ve dV m ( ω | 1) dm ω = ω ∗ ( m ) = β ω ∗ ( m ) D m (0) + β (1 − ω ∗ ( m )) D m ( p 01 ) , while from (41 ) we get dV m ( ω | 0) dm ω = ω ∗ ( m ) = 1 + β 2 τ (1) 0 ( ω ∗ ) D m (0) + β 2 (1 − τ (1) 0 ( ω ∗ )) D m ( p 01 ) . Finally , after some algebraic manipulati ons, a nd recalling that D m (0) = dV ∗ m (0) dm = d ( m + β V ∗ ( p 01 )) dm = 1 + r ecursiv ely β D m ( p 01 ) , we can rewrite (39) as 28 D m ( p 01 ) β (1 − β ) [1 − ω ( 1 − β (1 − p 01 ))] + β [ ω (1 − β (1 − p 01 )) − β p 01 ] < 1 . T o show that the last inequality ho lds when 0 ≤ m < 1 , we first upper bound the deriv ative of the value function as D m ( ω ) ≤ 1 1 − β , si nce d dm R m ( ω ) ≤ 1 . Finally , using this upper bound D m ( p 01 ) ≤ 1 1 − β after a l ittle algebra (39) reduces to β (1 − β p 01 ) < 1 , w hich clearly ho lds for any β ∈ [0 , 1) as 0 ≤ p 01 ≤ 1 . This concludes th e proof of Theorem 8. R E F E R E N C E S [1] D. Bertsekas, R. G. Gallager, Data Networks . Englewo od Cliffs, NJ: Prentice Hall, 1992. [2] A. Benoit, L. Marchal, J.-F . Pineau, Y . Robert, F . Vi vien, “Scheduling concurrent bag-of-tasks applications on heterogeneous platforms, ” IEEE T rans. Computer s , vol. 59, no. 2, pp. 202-217, Feb . 2010. [3] Y . Bai, C. Xu and Z. Li, “T ask-aw are based co-scheduling for virtual machine system, ” in Proc . ACM Symp. On Applied Comp ., Sierre, Switzerland, pp. 181-188, Mar . 2010. [4] G. E. Monahan, “ A surve y of partially observab le Markov decision processes: Theory , models, and algorithms, ” Manag . Sci ., vol. 28, no. 1, pp. 1-16, 1982. [5] J. Gittins, K, Glazerbrook, R. W eber, Multi-armed B andit Allocation Indices . W est Sussex, UK: W iley , 2011. [6] P . Whittle, “Restless bandits: Activity allocation in a changing world, ” J. Appl. Pr obab ., vol. 25, pp. 287-298, 1988. [7] S. H. A. Ahmad, M. Liu, T . Javidi, Q. Zhao and B. Krishnamachari, “Optimality of myopic sensing in multi-channel opportunistic access, ” IEE E Tr ans. Inf. Theory , vol. 55, No. 9, pp. 4040-4050 , Sept. 2009. [8] S. H. A. Ahmad, M. Liu, “Multi-channel opportunistic access: A case of restless bandits with multiple plays, ” in Proc . 47th Ann. Allerton Conf. Commun., Contr ., Comput. , Monticello, IL, pp. 1361-1368, Sept. 2009. [9] K. Liu and Q. Zhao, “Indexability of restless bandit problems and optimality of Whittl e index for dynamic multichannel access, ” IEE E T ran s. Inf. Theory , vol. 56, no. 11, pp. 5547-5567, Nov . 2010. [10] L. P . Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and acting in partially observ able stochastic domains, ” Artif. I ntell. , vol. 101, pp. 99-134, May 1998. [11] M. L. Puterman, Marko v Decision Pr ocesses: Discrete Stochastic Dynamic Pr ogramming . Hobok en, NJ: W iley , 2005. [12] D. Bertsimas and J. E. Niño-Mora, “Restless bandits, linear programming relaxations, and a primal-dual heuristic, ” Oper . Res. , vol. 48, no. 1, pp. 80-90, Jan. 2000.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment