Sequential Design of Experiments via Linear Programming
The celebrated multi-armed bandit problem in decision theory models the basic trade-off between exploration, or learning about the state of a system, and exploitation, or utilizing the system. In this paper we study the variant of the multi-armed ban…
Authors: Sudipto Guha, Kamesh Munagala
Sequen tial Design of Exp erimen ts v ia Linear Programm ing ∗ Sudipto Guha † Kamesh Munagala ‡ Abstract The celebrated multi-armed bandit problem in decision theory mo dels the central trade-off betw een exploratio n, or learning ab out the sta te of a system, and exploitatio n, or utilizing the system. In this pa per we study the v arian t of the m ulti-armed bandit problem where the exploratio n phase inv olves co stly exp eriments and occ ur s b efore the exploitatio n phase; and where each pla y of an ar m during the explor ation phase up dates a pr ior b elief a bo ut the arm. The problem o f finding a n inex pe ns ive explora tio n s trategy to optimize a certain exploita tion ob jective is NP-Hard ev en when a single play rev e a ls all information about a n arm, and all exploratio n steps cost the s ame. W e provide the first polyno mial time co ns tant-factor approximation a lg orithm for this class of problems. W e s how tha t this framework als o g eneralizes several problems of interest studied in the con text of data acquisition in sensor ne tw orks. Our analyses also ex tends to switc hing and setup cos ts, and to concave utility ob jectives. Our solutio n appro ach is via a no vel linear prog r am rounding technique ba sed on sto chastic packing. In addition to yie lding explor ation po licies who s e pe rformance is within a small con- stant factor of the adaptive optimal po licy , a nice feature of this approa ch is that the resulting po licies explo r e the arms se quent ial ly without revisiting any arm. Se q uent iality is a w ell-s tudied paradigm in decision theory , and is very desirable in domains whe r e multiple explorations can be conducted in parallel, for instance, in the sensor netw o rk context. 1 In tro duction The s equen tial design of exp eriments is a classic prob lem first formulate d b y W a ld in 194 7 [49]. The stud y of this p roblem ga ve r ise to the general field of decision theory; and more sp ecifically , led Robbins [41] to form ulate the celebrated m ulti-armed b an d it problem, and Snell [4 6 ] and Rob- bins [41] to in v en t the theory of optimal stopping. T he copious literature in this field is surv eyed b y Whittle [51, 52]. The canonical problem of sequential design of exper im ents is b est describ ed in the la n guage o f the m ulti-armed bandit problem: There are n comp eting options referred to as “arms” (for instance, consider clinical treatments) yielding unkno w n rewa r ds (or ha ving unkn o wn effectiv eness) { p i } . Pla ying an arm (or testing a treatmen t on a patient ) yields observ ations that rev eal information ab out the und erlying reward or effectiv eness. The goal is to sequen tially test the treatment s (or sequen tially p la y the arms) in order to ultimately choose the “b est” one. Suc h problems are us ually studied in a decision theoretic setting, where costs and utilitie s are asso ciated with actions (testing ∗ This combines work from tw o pap ers [30, 29] appearing in the A CM-SIAM Symp osium on Discrete Algorithms (SODA), 2007 and the 39 th Annual ACM Symp osium on Theory of Computing (STOC), 2007 resp ectively . † Department of C omputer and Information Sciences, Universit y o f Pennsylv ania. Email: sudipto@cis.upen n.edu . Researc h supp orted in p art by an Alfred P . Sloan R esearc h F ello wship and by an NSF Award CCF-0430376. ‡ Department of Computer S cience, Du ke Un ivers it y . Email : kamesh@c s.duke.ed u . R esearc h supp orted in p art by NSF CNS-0540347. a treatmen t) and outcomes (c ho osing one treatmen t fi n ally). The goal of an y decision p ro cedure is to come up with a plan for testing the treatment s (or playing the arms) and choosing an outcome in order to optimize some criterion based on the costs and u tilities. Th e testing pro cedure is termed explor ation , and c ho osing the outcome is te r med exploitation . The crux of the multi- armed band it problem, and th e reason has b een extensivel y stud ied, is that it cleanly mo d els the general trade-off b et w een the cost of exploration (or learning m ore ab out the state of the system) and the u tilit y gained fr om exploitation (or utilizing th e system). V arious frameworks in d ecision theory differ in (i) the a v ailable information a nd (ii) optimizatio n criteria for ev aluating a d ecision plan. W e now describ e the p roblem w e study from the p ersp ectiv e of these design c h oices. F rom the p ers p ectiv e of a v ailable information, we fo cus exclusiv ely on the Ba y esian s etting, fir st form u lated by Arro w, Blac kwell and Girsh ick in 1949 [2]. In this setting, eac h arm (or treatmen t) is asso ciated w ith prior information (sp ecified b y d istributions) that up dates via Ba yes’ r ule conditioned on the results of the pla ys (o r tests). More formally , w e are giv en a bandit with n in dep end en t arms. The s et of p ossible states of arm i is denoted by S i , and the initial state is ρ i ∈ S i . When the arm i is pla ye d in a state u ∈ S i , the arm transitions to state v ∈ S i w.p. p uv dep end ing on the observ ed outco me of the play . The initia l state mo dels the p rior kno wledge ab out the arm. The states in general capture the p osterior conditioned on the observ ations from a sequence of p la ys (or exp eriments) sta r ting at the root. The c ost of a pla y dep end s on wh ether the p revious p la y wa s for the same arm or not. If the previous pla y w as for the same arm, the play at u ∈ S i costs c u , else it costs c u + h i , where h i is the setup c ost for switching int o arm i 1 . Recall that th e arms corresp ond to different treatmen ts or exp eriments; therefore, this cost mo d els setting up the corresp onding exp eriment . Ev ery state u ∈ S i is associated with a rew ard r u , whic h is the exp ected reward of pla ying in this state (whic h is of course conditioned on the observ ations from the pla ys so far). By Ba ye s ’ r ule, the rew ard of th e d ifferen t states evol v e according to a Martingale prop erty: r u = P v ∈S i p uv r v . W e p resen t concrete examples of state spaces in Sect ion 2. F rom the optimization p ersp ectiv e, our ob j ectiv e is to maximize futur e utilization . An y p olicy explores (or tests) the arms for a certain amount of time and s u bsequently , exploits (or c ho oses) an arm that yields the b est exp ected p osterior (or futur e) rewa r d. F or this ob jectiv e to b e meaningful, w e need to constrain th e total cost w e can incur in e xploration b efore making the exploit d ecision. A natural example of this is pro du ct marketing researc h, wh ere the entir e explor ation phase app e ars b efor e the exploitation phase . F orm ally , a p olicy π p erforms a possibly adaptiv e sequence of pla ys during the explorat ion. Since the sta te ev olutions are sto c h astic, the exploratio n ph ase leads to a probabilit y distribution o ver outcomes, O ( π ). In outcome o ∈ O ( π ), eac h arm i is in some fi nal state u o i . In this outcome o the p olicy will c ho ose the “b est arm” max i r u o i (or a suitable conca v e function of th e vect or h· · · , r u o i , · · · i ). The exp ected rew ard of the p olicy π o v er the outcomes of exploration, R ( π ) is P o ∈O ( π ) q ( o, π ) m ax i r u o i . Let C ( o, π ) denote th e cost of the exploration pla ys made by the p olicy given an outcome o . In the simplest v ersion, w e seek to find th e p olicy π whic h maximizes R ( π ) sub ject to C ( o, π ) ≤ C f or all o ∈ O . As remarked in [2], this prob lem is solv able by dynamic pr ogramming [11, 13 ]. H o wev er this app roac h r equires computation time p olynomial in the join t state space (truncated by the budget constrain t) for m ultiple arms, whic h is the pro duct of th e individ ual (tru ncated) state sp aces. Unsurprisin gly , the p roblem b ecomes NP-Hard ev en when a single p lay r evea ls the full information abou t an arm, and all pla ys (across differen t arms) cost th e same [27]. Desig ning a p olicy whic h is computationally tractable, at the cost of b ound ed loss in p erformance, is the main go al of this pap er. W e will stu d y the problem from the p ersp ectiv e of appro ximation algorithms, wh ere w e seek to find a pro v ably near optimal 1 Our algorithms also extend to concav e costs where th e cost of r consecutive pla y as well as switching out costs, w e omit th at discussion here. 2 solution with the restriction th at the algorithm must run in time p olynomial in th e su m of the state spaces. More precisely , w e seek an algorithm which w ould giv e u s an utilization least O P T /α where O P T = max π R ( π ) sub ject to C ( o, π ) ≤ C for all o ∈ O ; w hic h is denoted as an α appro ximation. Note that we seek a multiplic ativ e appro ximation b ecause su ch a result is in v ariant und er scaling of the rew ard s (see also the d iscussion on discoun t rewards b elo w). Sin ce it is NP-Hard to determine O P T , we seek to use a linear program to determine an u pp er b ound γ ∗ ≥ O P T a nd provide an algorithm that ac hiev es γ ∗ /α in the w orst case. T he add ed b enefi t of suc h an ap p roac h is that w e ha v e a concrete upp er b ound γ ∗ for c omparison and an algorithm whic h guaran tees γ ∗ /α in the worst case, ma y ha ve a significantly b etter (and quantifiable, due to the existence of th e upp er b ound) p erformance in practice. The interested reader ma y consult [48] for a r eview of appro x im ation algo r ithms. The necessit y of studying this problem is fur th er hastened b y the emergence of sev eral app li- cations where the n umb er of arms is large, t ypically data intensiv e applications. Examples of this problem arise in “activ e learnin g” [38, 42] where the goal is to learn and choose th e most d iscerning h yp othesis by sequentia lly testing the hyp otheses on a set of assisted examples; sensor n etw orks [35], where the goal is sens or p lacemen t to maximize a utilit y function suc h as information gain, b ased on sequentia lly collecting a small n umb er of samples; and data bases [7], wh ere the goal is to set tle up on a p ossibly long ru nning qu er y execution plan, again based on a few carefully c hosen samples. 1.1 Related Models The future utilization ob jectiv e is w ell-known in literature (refer for instance, Berry and F rist- edt [12], Chapter 3 . 6). T h e unit cost v ersion of this problem is a sp ecial case of the infinite hor izon disc ounte d multi-armed bandit problem. In the discounted bandit problem, there is an infinite discoun t sequence { α t ∈ [0 , 1] | t = 1 , 2 , . . . } . Any p olicy π pla y s an arm at eac h time step; su pp ose the exp ected reward from pla ying at time t is R t ( π ). T he goal is to design an adaptiv e p olicy π to maximize P t ≥ 1 α t R t ( π ). The future utilizatio n ob jective with an exploration bu dget C corresp ond s to α 1 = α 2 = · · · = α C = α C +2 = α C +3 = · · · = 0, and α C +1 = 1. This setting implies the ob jectiv e is the rew ard of the arm c h osen at the ( C + 1) st pla y (exploita tion), and only plays of significance for making this c hoice are the first C p la ys (exploratio n ). As observ ed in [12], this p r oblem seems significan tly harder computationally than the case where the d iscount s equ ence is monotonically decreasing with time. In fact, when the discount sequence is geometric, i.e. , α t = β t for some β < 1, the cele brated result of Gittins and Jones sho ws that there exists an elegan t greedy optimal solution termed the Gittins index p olicy [26]; a n ind ex p olicy r anks the arms based solely on their o wn c haracteristics and p la ys th e b est arm at ev ery step. Th e Gittins ind ex is sub optimal b oth the finite horizon setting where α t = 1 for t ≤ C and 0 otherwise; as well as the futu re u tilizatio n setting w e consider here [38]. Finally , Banks and Sun daram [1 0] show that no index exists in the presence of switc hing in/out costs. Alternativ es to the Ba y esian formulatio n a re also as old as the original study of W ald [49] and Robbins [41]. These v ersions do not assume prior information, but instead p erform a min -max optimization o ver p ossible und er lyin g rew ard s via a su itably constr u cted loss or regret measure. As observ ed in [12, 50], although minmax ob j ective s are more robu s t, the Ba ye sian approac h is more widely used since it typicall y requires less s amples. F ur thermore, the regret criterion naturally forces the optimizati on to consid er the p ast : What is the m inim u m loss in the past N trials du e to not kno win g the true rew ard s . Note that minimizing regret is not the same as maximizing future utilization, the f ormer b eing more akin to the finite horizon ve rsion with d iscount sequence α t = 1 for t ≤ C and 0 otherwise. In tuitive ly , in the former, we attempt to minimize the error during the testing pro cess, while in the latter, we do not care abou t errors in testing, b ut attempt to ensure 3 that at the end, we are truly pic king the (near) b est option for exploitatio n. Nev ertheless, it is natural to ask whether the algorithms su ggested in the con text of minmax analysis, particularly the seminal w orks of Lai and Robbins [36], and Au er, C esa-Bianc hi and Fisc her [4] (and extended to uniform switc hing costs in [47, 3]), hav e goo d p erformance guaran tees in the future u tilizati on measure. Ho wev er these are “mo d el free” algorithms, and it is easy to show that f or appropr iately c hosen b udget C , these algorithms hav e significan tly inferior p erformance on the future utilizatio n ob jectiv e as c ompared to alg orithm s that use the prior information. This is not surprising b ecause the ob jectiv es are differen t. Similar commen ts apply to the “exp erts” problem [18] and sub sequen t researc h in adv ersarial m ultiarmed bandits [5, 25] w here the rew ard distribution is c hosen b y an adv ersary and need not b e sto c hastic. It is w orth p oin ting out that in the loss function or min max appr oac h, t he loss or regret arises due to lac k of information ab out the rewa rds. The difficult y in optimizing future utilizat ion in the Ba y esian setting arises from the computational aspect. This is quite similar to the differences b et w een the classes o f o nline and appro ximation algorithms. 1.2 Structure of the P olicies F or the futu r e utiliza tion measure, it is w orth mentioning that the general structur e of the p olicies are important. Two suc h classes of p olicies are notew orth y . The first cl ass is moti v ated b y the stopping time p roblem, an early example of which is the secretary p roblem [20]. A p olicy in this class fixes an or dering of the arm s in ad v anc e , and samples the arms sequen tially , i. e. , do es not return to previously rejected arm. The b enefit of suc h strategy is that these are often succinct to repr esent and easy to implemen t in real hardwa r e fr om the p ersp ectiv e of con trol. Another b enefit, as the reader w ould hav e observe d , is that it is easy to model switching/se tup costs in suc h p olicies; these costs in fact can b e generalized so that r consecutiv e pla ys h av e a cost wh ic h is conca v e fu nction in r . W e defi n e s u c h p olicies as sequential , b ecause the order in g of the arms is fixed b eforehand. Su c h strategies h a ve b een considered in testing b etw een t w o hypothesis [49], sto c hastic scheduling [39, 45], sto c hastic pac king [23, 24] and in op erator placemen t in databases [8, 9 ] – ho w ever all except the hypotheses testing results hold f or tw o-level state sp aces (o r arms with p oint pr iors), where a single pla y reveal s complete in f ormation ab out the underlyin g rew ard of the arm. (Refer Section 2 for a formal definition.) The second and more restrictiv e class of p olicies p erforms all the tests (or p lays) b efore observin g an y of their outcomes. Therefore, th e p olicy has three disjoin t successiv e ph ases: T est, observe, and select. Suc h non-adaptiv e p olicies are of int erest when the observ ations ca n be made in p arallel, and therefore the final choi ce can b e made quic ker. Naturally these strategies are m eaningful for t wo leve l state sp aces, and ha ve thus b een found to b e of int erest in con text of sensor netw orks [35], m u ltihoming n etw orks [1], stochastic optimization [27, 30] and database optimization [7 ]. F or b oth the ab ov e classes, the go al is to sho w that p erformance of an algorithm that is restricted to the resp ectiv e class is not significan tly w orse compared to an adv ersary whose strategy is ful ly adaptive . This is kn o wn as the A daptivity Gap of a strategy . All p revious analysis of adaptivity gap w as restricted to t wo lev el state spaces. This pap er p ro vides an uniform fr amew ork that extends to b oth the classes ab o ve and applies to multilev el state s paces. It is inte resting to note that one of the origi nal go als of W ald [49] in sequent ial analysis w as to explore sequenti al strategi es. Though suc h strategie s are op timal for choosing b et ween tw o hyp othesis, the d ifficult y in obtaining optimal strategi es for testing m u ltiple comp eting h yp otheses was known since that time. The major c ontribution of this work is to sho w that in a variety of b andit settings, when we ar e se eking to optimize any c onc ave function of the p osterior pr ob abilities, the adaptivity gap in c onsidering se quential str ate gies is b ounde d by a c onstant. I n other w ords, the p erformance of a fully adaptiv e 4 solution cannot b e significant ly b etter t han a sequen tial str ategy . 1.3 Problems and Results W e consider three main t yp es of problems in this pap er . Recall that there are n indep endent arms, eac h with its own state space S i ; a p olicy π adaptively explores the arm s p a ying exp ected cost C ( π ) b efore selec ting an arm for exploita tion based on the observ ed outcomes. The exp ected rew ard of the selected arm o ver the outco m es of the p olicy π is denoted R ( π ). • Budgeted (F uturistic) Bandits: There is a cost budget C . A p olicy π is f e asible if f or any sequence of pla ys made b y the p olicy , the cost is at most C . The goa l is to fi nd the feasible p olicy π with maxim um R ( π ). W e ha ve already discussed sw itc hing costs. An extension of switc hing cost is conca ve pla y cost where the cost of sequential in terr upted plays of an arm is conca ve in the n u m b er of pla ys. This w as first hinted at in [2] and the au th ors explicitly settled o n linear costs. A generalizatio n of the ab o ve problem is budgete d conca ve utility bandits problem where the ob jectiv e function is an arbitrary conca ve function of the final r ew ard s of the arms. Examples of s uc h function include c h o osing the b est K arm s, p o wer allocation across noisy c hann els [21] or optimizing “ TCP friendly” net w ork utilit y functions [37]. • Mo del Driv en Optimization: Th is is a n on -adaptive form ulation of the ab ov e, where the state space S i is 2-lev el and a sin gle pla y rev eals full information ab ou t an arm. In such a con text, non-adaptive strategies are desirable sin ce the pla ys can b e executed in parallel. A feasible n on-adaptiv e p olicy π c ho oses a subset of the arm s to explore, b efore seeing the result of an y of th e p la ys. There has b een a significan t num b er of pap ers in recent y ears, sp ecially in the con text of sensor n etw orks. O ur pap er un ifies this thr ead with th e b an d it framewo rk. • Lagrangean (F uturistic) Bandits: Find the policy π with maxim um R ( π ) − C ( π ). Note that the Lagrangean ca n b e defined on b oth the adaptiv e and non-adaptiv e setting. This is a natur al extension of the single-arm optimal stopping time problem. In this pap er, w e pr esen t a single framew ork that pr o vides efficien t algorithms yieldin g p olicies with n ear-optimal p erformance for all of the ab o ve pr ob lems. F or the b udgeted (fu tu ristic) b andits in the conca ve cost setting (including sw itc hing in/out cost), we sho w that there exists a sequen tial strategy th at resp ects the budget, and has ob jectiv e v alue at m ost a factor 4 a wa y f rom that of the optimal fully-adaptiv e strateg y sub j ected to the same bu d get. Section 2 discusses different state spaces. This is presen ted in Section 3 present s the app r o ximate sequentia l strategy that resp ects th e budget, for linear utilities (ob jectiv e function). W e also pr esen t a bicriteria 2(1 + α ) app ro ximation with th e cost constrain t r elaxed by a factor 1 α . In Section 4 , w e show ho w the same framew ork giv es a more restricte d non-adaptiv e strate gy for 2-lev el states spaces whic h is within constan t factor of the b est adaptiv e strategy . In con trast, for multi-lev el state spaces, any non-adaptiv e strategy has a significant p erform ance loss. W e also present a sequen tial strategy that is a 2 appro ximation for the Lagrangean Bandits in Section 5. In Section 6, we extend the results in Section 3 to conca ve utilities w ith a fact or 2 loss of the appro ximation facto r . Note that constan t f actor appro ximations are b est p ossible from the con text of adaptivity gap of sequ ential p olicies as w ell as inte gr ality g ap of the linear programming relaxations w e u se. T ec hniques: W e use a linear programming formulation o ver the state space of individual arms, and w e ac hiev e p olynomial sized f orm u lation in the size of eac h individu al state s pace. This particular 5 form u lation has b een used in the p ast [53, 40] and foun d to b e useful in practice. T o the b est of our knowledge, we p resen t the first analysis of these relaxations in the fin ite horizon context. W e also brin g to b ear tec hniques from sto c hastic pac king literature, particularly the wo rk on adaptivit y gaps by Dean, Go emans and V ondr´ ak [23 , 24, 22]. Their r esults can b e viewed as sequen- tial strategies f or 2-lev el state spaces and is similar to th e online nature of the p olicies considered in sto c hastic sc h eduling [39, 45], where there is a s tr ong notion of “irrev o cable commitmen t”. While the online notio n is related to sequen tial strateg ies, they a re not the same. In terms of analysis, our results can b e thought of as extend ing an alysis b oth to arb itrary state spaces as well as f or n on-adaptiv e strategies for the 2-lev el case. Our o v er all tec hnique can b e though t of as “LP round ing via sto c hastic pac king” – findin g this conn ection b et ween fin ite h orizon m u lti-armed band its and sto chastic pac king b y designing simple LP rounding p olicies for a v ery general class of budgeted b andit pr ob lems represen ts the k ey con tribution of this wo rk. Related W ork: S ev eral heur istics had b een prop osed for the budgeted (futur istic) bandit problem b y S c hn eider and Mo ore [42] and Madani et al. [38]. The final algorithm that arises f r om our framew ork b ears r esem blance (but is not the same) to the al gorithms p rop osed therein, but as far as w e are aw are there was n o prior analysis of any algorithm in th is conte x t. A series of pap ers [27, 35, 30] considered the 2-lev el state sp aces (where a single pla y r esolv es all information ab out an arm) f or sp ecific p roblems and presen ted approximat ions. Th e Lagrangean (futuristic) band it problem with 2-lev el state space has b een considered b efore in [31], where a 1 . 25 appro ximation is present ed. None of those tec hniques apply for the iterativ e refinement that is required for multiple lev el state sp aces. Note th at m ost other literature on sto chastic packing d o n ot consid er refinemen t of inf ormation [33, 2 8]. Our LP r elaxatio n is w ell-studied in the con text of multi- armed b andit problems [15, 53, 16] and other lo osely coupled systems such as multi-cla s s queu eing systems [14, 17]; w e present the fi r st pro v able analysis of th is form ulation. Though LP form ulations o ver the sta te space of outco m es exist for other sto c hastic optimizatio n problems su c h as multi-st age optimization with recourse [34, 43, 19], these formulations are b ased on sampling scenarios. Ho we v er th ese pr oblems also do not ha ve a notion of r efinement , and a re fundamen tally d ifferen t from our sett ing where the scenarios w ould b e refin emen t tra jectories [32] that are hard to sample. 2 T yp es of State Spaces Recall th at eac h arm is associated with a state that ev olv es wh en the arm is pla y ed . The s tate captures the distributional kno wledge about the reward distribution of the arm. F ormally , the set of p ossible states of arm i is den oted b y S i , and the initial state is ρ i ∈ S i . When the arm i is pla ye d in a state u ∈ S i , the arm trans itions to s tate v ∈ S i w.p. p uv dep end ing on the observ ed outco m e of the play . The initial state mo dels the pr ior knowledge ab out the arm. Th e states in general capture the p osterior conditioned on th e observ ations from a sequence of p la ys (or exp eriments) starting at the ro ot. Every state u ∈ S i is asso ciated with a reward r u , whic h is the exp ected rew ard of p la ying in this s tate (wh ic h is of cour s e conditioned on the observ ations fr om the p lays so far). By Ba yes’ rule, the rew ard of the different states ev olve acc ording to a M artingale prop ert y: r u = P v ∈S i p uv r v . W e now pr esen t t wo represen tativ e scenarios in order to b etter motiv ate the abstract problem form u lation. In the fir st scenario, th e und erlying r ew ard distribu tion is deterministic, and the distributional knowledge is sp ecified as a distribu tion o ver the p ossible d eterministic v alues; this implies th at the u ncertain ty ab out an arm is completely resolv ed in one pla y by observing the rew ard . In the second scenario, the uncertain t y resolv es gradually o ver time. 6 Tw o-lev el State Spa ce. A tw o-lev el state space mo dels the case wh er e the un derlying r eward of the arm is deterministic, so that the pr ior knowle dge is a distribu tion o ver these v alues. In this setting, a s in gle p la y resolv es this distribution in to a determin istic p osterior. F orm ally , the prior distribu tional kno wledge X i is a discrete distribution ov er v alues { a i 1 , a i 2 , . . . , a i m } , so that Pr[ X i = a i j ] = p i j for j = 1 , 2 , . . . , m . The state space S i of the arm is as follo ws: T he ro ot no de ρ i has r ρ i = E [ X i ] = µ i . F or j = 1 , 2 , . . . , m , state i j has r i j = a i j , and p ρ i i j = p i j . Since the underlying rew ard distribution is simp ly a d eterministic v alue, the state sp ace is 2-le v el, defining a star graph with ρ i b eing the root, and i 1 , i 2 , . . . , i m b eing the lea v es. T o motiv ate budgeted b andits in such state spaces, consider a sens or n et work where the ro ot serv er monitors the maxim um v alue [6, 44 ]. The probabilit y distributions of the v alues at v arious no des are kno wn to th e server via past observ ations. Ho we v er, at the curren t step, probing all no des to fin d out their actual v alues is u ndesirable since it r equires transmissions fr om all no des, consuming their battery life . Consider the simple setti ng where the net w ork connecting the no des to the server is a one-lev el tree, and p robing a no de consumes b attery p o wer of that n o de. Giv en a b ound on the total battery life consu med, th e goal of the r o ot serv er is to maximize (in exp ectation) its estimate of the maxim um v alue. F ormally , eac h no de corresp onds to a distr ib ution X i with mean µ i ; th e exact v alue sensed at the no de can b e found by paying a “transmission cost” c i . The goal of the server is to adap tively prob e a subs et S of no des with total transmission cost at m ost C in order to maximize the estimate of the largest v alue sensed, i.e maximize E [max (max i ∈ S X i , max i / ∈ S µ i )], where the exp ectation is o ve r the adaptiv e c hoice of S and the outcome of the p rob es. The term max i / ∈ S µ i incorp orates the mean of the unprob ed no des into the estimate of the m axim um v alue. In this co n text, it is desirable for the sensor n o de to prob e the no d es in parallel, i.e. , use a non-adaptiv e st rategy . The qu estion then b ecomes h ow go o d is such a s trategy compared to the optimal adaptive strategy . W e sho w p ositiv e results for the con text of 2-lev el spaces in Sectio n 4. Multi-lev el State Spaces. These are the most general state spaces we consider, and mak e sen se in con texts such as clinical trials where the und erlying effectiv eness of a treatmen t is a random v ariable follo wing a parametrized distribu tion with u nknown p arameters. The prior distr ib ution will then b e a distribu tion ov er p ossible p arameter v alues. In the clinical trial setting, eac h exp erimen tal drug is a bandit a rm, and the goal is to devise a clinical trial phase to maximize the b elief about the effectiv eness of the d r ug finally c hosen for mark eting. Eac h dru g h as an effec tiveness that is unknown a priori. The effectiv eness ca n b e mo deled as a coin whose bias, θ , is un kno wn a priori – the outcomes of tossing the coin (runnin g a tr ial) are 0 and 1 whic h corresp ond to a trial b eing ineffectiv e and effectiv e resp ectiv ely . The uncertain t y in the b ias is s p ecified by a prior d istribution (or b elief ) on the p ossible v alues it can tak e. Since the underlying distr ibution is Bern ou lli, its conjugate prior is the Beta distribution. A Beta distrib u tion with paramete rs α 1 , α 2 ∈ { 1 , 2 , . . . } , whic h we denote B ( α 1 , α 2 ) has p .d.f. of the form cθ α 1 − 1 (1 − θ ) α 2 − 1 , wh ere c is a normalizing constan t. B (1 , 1) is the uniform distribution, which corresp onds to ha ving no a priori information. The distrib ution B ( α 1 , α 2 ) corresp ond s to the curr ent (p osterior) distribution ov er the p ossib le v alues of the bias θ after h aving observe d ( α 1 − 1) 0’s and ( α 2 − 1) 1’s. Given this distribution as our b elief, the expected v alue of the bias o r effe ctive ness is α 1 α 1 + α 2 . The state space S i is a D AG , whose r o ot ρ i enco des the initial b elief ab out the bias, B ( α 1 , α 2 ), so that r ρ i = α 1 α ρ 1 + α 2 . When the arm is pla yed in this state, the state ev olv es dep endin g on the outcome observ ed – if the outcome is 1 , which hap p ens w.p. α 1 α 1 + α 2 , the c hild u has b elief B ( α + 1 , α 2 ), so that r u = α 1 +1 α 1 + α 2 +1 , and p ρu = α 1 α 1 + α 2 ; if the outcome is 0, the c hild v has b elief B ( α 1 , α 2 + 1), r v = α 1 α 1 + α 2 +1 , and p ρv = α 2 α 1 + α 2 . In general, if the D A G S i has depth C (corresp ond ing to p la ying the arm at most C times), it has O ( C 2 ) states. W e omit details, since Beta distributions and their 7 m u ltinomial generalizati ons, the Diric hlet d istributions, are standard in the Ba ye sian cont ext (refer for instance W etherill and Glazebrook [50]). 3 Budgeted Bandits W e are giv en a bandit with n in dep end en t arms. Th e set o f p ossible state s of arm i is den oted b y S i , and the initial state is ρ i ∈ S i . When the arm i is play ed in a state u ∈ S i , the arm transitions to state v ∈ S i w.p. p uv . The r eward at a state satisfies r u = P v ∈S i p uv r v . The cost of a pla y dep end s on whether the previous play w as for the same arm or not. If the pr evious pla y w as for the same arm, th e pla y at u ∈ S i costs c u , else it co s ts c u + h i , where h i is th e setup c ost for switc h ing in to arm i . A p olicy π p erforms a p ossibly adaptiv e sequence of pla ys dur ing t he exploratio n. leading to a p robabilit y distribution o ver outcomes, O ( π ). In outcome o ∈ O ( π ), eac h arm i is in some final state u o i . In this outcome o the p olicy c h o oses max i r u o i . Th e exp ected r ew ard o f th e p olicy π o ve r the outcomes of exploration, R ( π ) is P o ∈O ( π ) q ( o, π ) m ax i r u o i . Let C ( o, π ) denote the co st of the exploratio n plays made b y the p olicy given an outcome o . In this section, we seek to find the p olicy π whic h maximizes R ( π ) sub ject to C ( o, π ) ≤ C for all o ∈ O . W e describ e the linear pr ogramming formulatio n and roun ding tec hnique that yields a 4- appro x im ation. W e note that the form ulation and solution are p olynomial in n , the n um b er of arms, and m , the num b er of sta tes p er arm. 3.1 Linear Programming F ormul ation Recall the notation from Section 1.3. Consider any adaptiv e p olicy π . F or some arm i and s tate u ∈ S i , let: (1) w u denote the probabilit y that durin g the exec u tion of the p olicy π , arm i enters state u ∈ S i ; (2) z u denote the probability that the state of arm i is u and the p olicy p la ys arm i in this sta te; and (3) x u denote the probabilit y that the policy π c ho oses the arm i in stat e u during the exploitatio n ph ase. Note that since the latter t wo co rresp ond to m utually exclusiv e eve n ts, w e hav e x u + z u ≤ w u . T h e follo wing LP which has th ree v ariables w u , x u , and z u for eac h arm i and eac h u ∈ S i . A similar L P form ulation w as prop osed for the m u lti-armed bandit problem by Whittle [53] and Bertsimas and Nino-Mo ra [40]. Maximize n X i =1 X u ∈S i x u r u P n i =1 h i z ρ i + P u ∈S i c u z u ≤ C P n i =1 P u ∈S i x u ≤ 1 P v ∈S i z v p vu = w u ∀ i, u ∈ S i \ { ρ i } x u + z u ≤ w u ∀ u ∈ S i , ∀ i x u , z u , w u ∈ [0 , 1] ∀ u ∈ S i , ∀ i Let γ ∗ b e the optimal LP v alue, and O P T b e the exp ected r ew ard of the optimal adaptiv e p olicy . Claim 3.1. OP T ≤ γ ∗ . Pr o of. W e sh o w that th e w u , z u , x u as defined ab ov e, corresp onding to the optimal p olicy π ∗ , are feasible for the constraints of the LP . Since eac h p ossible outcome of exploration le ads to c ho osing one arm i in some sta te u ∈ S i for e xploitation, in exp ectation o ve r the outcomes, o ne arm in one state is c hosen for exploitation. This is captured by the first constrain t. F urther, since on eac h sequence of outcomes (the decision tra jectory), the cost of pla ying and switc hing in to the arm is at 8 most C , o ver th e en tire decision tree, the exp ected cost of switc h ing in to the ro ot states ρ i plus the exp ected cost of play is at most C . This is captured by the seco nd constraint . Note that the LP only tak es in to acco u n t the cost of switching in to an arm the very fi rst time this arm is explored, and ignores the rest of the switc hing costs. T his is clearly a relaxat ion, though the optimal p olicy migh t switc h multiple times in to any arm. Ho w ever, o ur round ing pro cedur e switc hes in to an arm at most once, preserving the structure of th e LP relaxat ion. The third constraint simply encod es that the probabilit y of reac hing a state u ∈ S i during exploration. I t is precisely the probabilit y with whic h it is pla yed in some state v ∈ S i , times the probabilit y p vu that it reac h es u cond itioned on that pla y . The constraint x u + z u ≤ w u simply captures that playing an arm is a disjoint ev ent fr om exploiting it in an y state. The ob jectiv e is precisely the exp ected reward of the p olicy . Hence, the LP is a relaxation of the optimal p olicy . 3.2 The Sin gle-arm P olicies The op timal LP s olution clearly do es not directly corresp ond to a f easible p olicy sin ce th e v ariables do not faithfully capture the joint ev olution of the states of differen t arms. Belo w, w e presen t an in terp r etation of the LP solution, and sh o w ho w it can b e conv erted to a feasible appro ximately optimal p olicy . Let h w ∗ u , x ∗ u , z ∗ u i denote the optimal solution to the L P . W e can assu me w.l.o.g. th at w ∗ ρ i = 1 for all i . Ignoring the first t wo constraints of the LP for th e time b eing, the remainin g constrain ts enco de a separate p olicy f or eac h arm as follo ws: Consid er an y arm i in isolatio n. T he play starts at state ρ i . The arm is play ed with probabilit y z ∗ ρ i , so that state u ∈ S i is reac hed with probabilit y z ∗ ρ i p ρ i u . Th is pla y incurs cost h i + c ρ i , which captures the co st of switc hing into this arm, and the cost of playing at the ro ot. At state ρ i , with p robabilit y x ∗ ρ i , the pla y stops and arm i is chosen for exploitation. T he ev ents in v olving playing the arm and choosing for exploitation are disjoint . Similarly , cond itioned on reac hing state u ∈ S i , with pr obabilities z ∗ u /w ∗ u and x ∗ u /w ∗ u , arm i is pla yed and c hosen for exploitation resp ectiv ely . This yields a p olicy φ i for arm i whic h is d escrib ed in Figure 1. F or p olicy φ i , it is easy to see by ind uction that if state u ∈ S i is reac hed b y th e p olicy with pr obabilit y w ∗ u , then state u ∈ S i is r eac hed and arm i is pla ye d with probabilit y z ∗ u . The p olicy φ i sets E i = 1 if on termination, arm i w as chosen for exploitatio n. I f E i = 1 at state u ∈ S i , then exploiting the arm in this s tate yields reward r u . Note that E i is a rand om v ariable that dep ends on the execution of p olicy φ i . Let R i , C i denote the random v ariables c orresp ond ing to the exploita tion rew ard, and cost of p la ying and switc h in g, resp ectiv ely . P olicy φ i : If arm i is curren tly in state u , then c ho ose q ∈ [0 , w ∗ u ] u niformly at r andom: 1. If q ∈ [0 , z ∗ u ], then pla y the arm ( explore ). 2. If q ∈ ( z ∗ u , z ∗ u + x ∗ u ], then stop executing φ i , set E i = 1 ( exploit ). 3. If q ∈ ( z ∗ u + x ∗ u , w ∗ u ], then stop executing φ i , set E i = 0. Figure 1: The Pol icy φ i . F or p olicy φ i , d efine the fol lo wing quan tities: 1. P ( φ i ) = E [ E i ] = P u ∈S i Pr[ E i = 1 ∧ u ] = P u ∈S i x ∗ u : Probabilit y the arm is exploited. 2. R ( φ i ) = E [ R i ] = P u ∈S i r u Pr[ E i = 1 ∧ u ] = P u ∈S i x ∗ u r u : Exp ected reward of exploitation. 3. C ( φ i ) = E [ C i ] = h i z ∗ i + P u ∈S i c u z ∗ u : Exp ected cost o f s witc hing in to and pla ying this arm. 9 Let φ denote the p olicy that is obtained by executing eac h φ i indep end en tly in succession. Since p olicy φ i is obtained b y considering arm i in isolatio n, φ is not a feasible policy for the follo win g reasons: (i) The cost P i C i sp ent exploring all the arms need not b e at most C in ev ery exploration tra jectory , and (ii ) It could happ en that for sev eral arms i , E i is set to 1, whic h implies several arms could be c hosen s imultaneously for e xploitation. Ho we v er, all is not lost. First note that the r.v. R i , C i , E i for differen t i are indep endent . F ur thermore, it is easy to see u sing the first t wo constrain ts and ob jectiv e of the LP f ormulation that φ is feasible in the follo wing exp ected sense: P i E [ C i ] = P i C ( φ i ) ≤ C . Secondly , P i E [ E i ] = P i P ( φ i ) ≤ 1. Finally , P i E [ R i ] = P i R ( φ i ) = γ ∗ . Based on the ab o ve, w e sho w that p olicy φ can b e con verted to a feasible p olicy u sing ideas from the adaptivit y gap pro ofs for sto c hastic pac king problems [23, 24, 22]. W e treat eac h p olicy φ i as an item w hic h tak es up co st C i , has size E i , and profit R i . These items need to b e p laced in a knapsac k – placing item i corresp onds to exploring arm i according to p olicy φ i . This placemen t is an irrevocable decision, and after the p lacemen t, the v alues of C i , E i , R i are rev ealed. W e need P i C i for items placed so far sh ould b e at most C . F ur thermore, the p lacemen t (or exploration) stops the fi rst time some E i is set to 1, and uses arm i is used for exploitation (obtaining r ew ard or pr ofit R i ). Since only one E i = 1 ev en t is allo w ed b efore th e pla y stops, this yields th e ”size constrain t” P i E i ≤ 1. T he knapsac k therefore has b oth cost and size co nstrain ts, and the goa l is to sequential ly and irrev o cably place the items in the knapsac k , stoppin g when the constraints w ould b e violated. Th e goal is to choose the ord er to place the items in order to maximize the exp ected profit, or the exploitati on gain. T his is a t wo-c on s train t stoc h astic pac king problem. Th e LP solution imp lies that the exp ected v alues of the random v ariables satisfy the pac king constraints. W e sh ow that the “start-deadline” framew ork in [22] can b e adapted to sho w that there is a fixed order of exploring the arms according to the φ i whic h yields gain at least γ ∗ / 4. There is one subtle p oint – the profit (or gain) is also a rand om v ariable correlated w ith the size and cost. F urth er m ore, the “start deadline” mo d el in [22] wo u ld also imply the final pac king could violate the constrain ts b y a small amount. W e ge t around this difficulty by presen ting an algorithm Gre edyOrder that explicitly ob eys the co nstrain ts, but whose analysis will b e coupled with the analysis of a simpler p olicy GreedyViola te w hic h exceeds the bud get. The cen tral idea would b e that although th e b enefit of the current arm has n ot b een “v erified ”, the alte rnativ es hav e b een r u led out. 3.3 The Roundin g Algorithm The GreedyOrder p olicy is sho wn in Figure 2. Note th at step (3) ensures that no arm is ev er revisited, so that the strategy is se quential . F or the purp ose of analysis, we first p resen t an infeasible p olicy Gree dyViola te which is simpler to analyze. The algorithm is th e same as GreedyOrder except for step (2), w hic h w e outline in Figure 3. In GreedyViola t e , the cost bud get is c hec ked only after fu lly executing a p olicy φ j . Therefore, the p olicy could viola te th e b udget constrain t by at most the exploration cost c max of one arm. Theorem 3.2. G reedyViola te sp ends c ost at most C + c max and yields r ewar d at le ast O P T 4 . Pr o of. W e h a ve γ ∗ = P i R ( φ i ), and P i P ( φ i ) ≤ 1. W e n ote that the r andom v ariables corresp ond- ing to differen t i are indep endent. F or notational conv enience, let ν i = R ( φ i ), and let µ i = P ( φ i ) + C ( φ i ) /C . W e th er efore ha ve P i µ i ≤ 2. The sorted order in g is decreasing ord er of ν i /µ i . Re-num b er the arms according to the sorted ordering so that the first arm play ed is n u m b ered 1. Let k den ote the sm allest in teger suc h that P k i =1 µ i ≥ 1. By the sorted ordering p rop erty , it is easy to see that P k i =1 ν i ≥ 1 2 γ ∗ . 10 Algorithm GreedyOrder 1. Order the arms in decrea sing order of R ( φ i ) P ( φ i )+ C ( φ i ) C and c ho ose the arms to pla y in this order. 2. F or eac h arm j in s orted ord er, pla y arm j according to φ j as follo w s un til φ j termi- nates: (a) If the next pla y acco rding to φ j w ould violat e the budget constrain t, then stop exploration and goto step (3). (b) If φ j has terminated and E j = 1, then stop explorat ion and goto step (3). (c) Else, pla y arm j according to p olicy φ j and goto s tep (2a). 3. Cho ose the last arm pla yed in step (2) for exploit ation. Figure 2: The GreedyOrder p olicy . Step 2 ( Gree dyViola te ) F or eac h arm j in sorte d order, do the follo w ing: (a) Pla y arm j according to p olicy φ j unt il φ j terminates. (b) When the p olicy φ j terminates execution, if eve nt E j = 1 is observ ed or th e cost budget C is exhausted or exceeded, then stop exploration and goto step (3). Figure 3: The GreedyViola te p olicy . Arm i is reac hed and play ed by the p olicy iff P j 1 times the explor ation budget, the adaptivity gap r emains Ω( p n/γ ) . Pr o of. Eac h arm has an underlying r ew ard distrib ution o ver the three v alues a 1 = 0, a 2 = 1 /n 9 and a 3 = 1. Let q = 1 / √ n . The underlyin g distribution could b e one of 3 p ossibilities: R 1 , R 2 , R 3 . R 1 is the deterministic v alue a 1 , R 2 is determin istically a 2 and R 3 is a 3 w.p. q and a 2 w.p. 1 − q . F or eac h arm, w e kno w in adv ance that P r[ R 1 ] = 1 − q , Pr[ R 2 ] = q (1 − q ) and Pr [ R 3 ] = q 2 . Therefore, the kno wledge for eac h arm is a prior o v er the three distributions R 1 , R 2 , R 3 . The priors for d ifferen t arms are i.i.d. All c i = 1 and the total budget is C = 5 n . W e first show that the adaptive p olicy c ho oses an arm with underlying rew ard distrib ution R 3 with constan t prob ab ility . This p olicy first pla ys eac h arm once and discards all arm s with obs er ved rew ard a 1 . With probabilit y at least 1 / 2, there are at most 2 /q arms w hic h surviv e, and at least one of these arms has un derlying reward d istribution R 3 . If more arms surviv e, c ho ose an y 2 / q arms. The p olicy n o w pla ys eac h of the 2 /q arms 2 √ n times. T h e probab ility that an arm with distribution R 3 yields reward a 3 on some pla y is at lea st once is 1 − (1 − q ) 2 /q ≈ Θ(1). In this case, it c ho oses the arm with rew ard distribu tion R 3 for exploitat ion. Since this happ ens w.p. at least a constan t, the exp ected exploitation r eward is Θ ( q ). Note that this is b est p ossible to within constan t factors, since E [ R 3 ] = Θ ( q ). No w consider an y non-adaptiv e p olicy . With pr obabilit y 1 − 1 /n Θ(1) , there are at most 2 log n arms with rewa r d distribu tion R 3 , and at least 1 / (2 q ) arms with rew ard distribution R 2 . L et r ≫ 2 log n . The strategy allo cates at most 5 r pla ys to at least n (1 − 1 /r ) arms – call this set of arms T . With probabilit y (1 − 1 /r ) 2 log n = Ω (1 − (2 log n ) /r ), all arms with r eward distribu tion R 3 lie in this set T . F or any of these arms pla yed O ( r ) times, with probabilit y 1 − O ( q r ), all observed rewa rds will ha v e v alue a 2 . Th is imp lies with probabilit y 1 − O ( q r ), all arm s with distribu tion R 3 yield rew ard s a 2 , and so do Ω(1 / (2 q )) arms with distribu tions R 2 . S ince these app ear indistinguishable to the p olicy , it can at b est choose one of these at r andom, obtaining exploita tion rew ard q log n 2(1 /q ) = O ( q 2 log n ). Since this situati on happ ens with probability 1 − O (log n/r ), and w ith the remaining probabilit y the exploitatio n rew ard is at most q , the strategy th erefore h as exp ected exp loitatio n rew ard O ( q log n ( 1 r + q )). T his implies the adaptivit y gap is Ω(1 / q ) = Ω( √ n ) if w e set r = 1 /q . No w sup p ose we allo w the b udget to b e increased by a factor of γ > 1. Then th e strategy w ould allocate at most 5 γ r pla ys to at lea st n (1 − 1 /r ) arms. By follo wing the same argumen t a s ab ov e, the exp ected rewa rd is O ( q log n ( 1 r + q γ )). This pr o ve s th e second part of the theo rem. 13 4.2 Upp er Bound for Tw o-Leve l State Spaces W e n ext sho w that for 2-lev el state spaces, wh ic h corresp ond to deterministic underlyin g rewards (refer Section 2), the a daptivit y gap is at most a f actor of 7. Theorem 4.2. If e ach state sp ac e S i is a dir e cte d star gr aph with ρ i as the r o ot, then ther e i s a non-adapt ive str ate gy that achieves r e war d at le ast 1 / 7 the LP b ound. Pr o of. In the case of 2-le v el state spaces, a non-adaptiv e strategy c ho oses a subset S of a rms and allocates zero/one pla ys to eac h of these so that the total cost of the pla ys is at most C . W e consider tw o c ases based on the LP optimal solution. In the first case, sup p ose P i r ρ i x ρ i ≥ γ ∗ / 7, then not pla ying an ything b ut simp ly c h o osing th e arm with highest r ρ i directly for exploita tion is a 7-appro ximation. In the remaining pro of, we assume the abov e is not the case, and compare against the optimal LP solution that sets x ρ i = 0 for al l i . This solution has v alue at le ast 6 γ ∗ / 7. F or simplicit y of notation, define z i = z ρ i as the probabilit y that th e arm i is pla yed. Define X i = 1 z i P u ∈S i x u as the probabilit y that the arm is exploited conditioned on b eing pla ye d, and R i = 1 z i P u ∈ S i x u r u as the exp ected exploitation rew ard conditioned on b eing pla y ed. Also define c i = c ρ i . Th e LP satisfies the constrain t: P i z i c i C + X i ≤ 2 , and the LP ob jectiv e is P i z i R i , w h ic h has v alue at least 6 γ ∗ / 7. A b etter ob jectiv e for the LP can b e obtained b y considering the arms in decreasing order of R i c i C + X i , and increasing z i in this order unti l the constrain t P i z i c i C + X i ≤ 1 b ecomes tigh t. Set the r emaining z i = 0. It is easy to s ee P i z i R i ≥ 3 7 γ ∗ . A t this p oin t, let k denote the index of the last a rm wh ic h could p ossibly h a ve z k < 1, and let S denote the set of arms with z i = 1 f or i ∈ S . There are aga in t w o cases. In the first case, if z k R k > γ ∗ / 7, then c h o osing jus t this arm for exploitation has rewa r d at least γ ∗ / 7, and is a 7-appro x im ation. In the second and fi nal case, w e ha ve a subset of arms P i ∈ S c i C + X i ≤ 1, an d P i ∈ S R i ≥ 3 7 γ ∗ − γ ∗ / 7 = 2 7 γ ∗ . If all these arms are pla yed, the exp ected n u mb er of arms that are exploited is P i ∈ S X i ≤ 1, an d the exp ected rew ard is P i ∈ S R i ≥ 2 7 γ ∗ . The p ro of of Theorem 3. 2 can b e adapted to sh o w that c ho osing the b est arm f or exp loitatio n yields at least half the r ew ard , i.e. , rew ard at least γ ∗ / 7. 5 Lagrangean V ersion Recall fr om Section 1.3 that in th e Lagrangean version of the problem, there are n o bud get con- strain ts on the p la ys, the goal is to fi nd a p olicy π suc h that R ( π ) − C ( π ) is maximized. Denote this quantit y as the pr ofit of the strategy . The linear program relaxation is b elo w. The v ariables are id entica l to the p r evious form ulation, but there is no bu dget constrain t. Maximize n X i =1 X u ∈S i ( x u r u − c u z u ) − h i z ρ i P n i =1 P u ∈S i x u ≤ 1 P v ∈S i z v p vu = w u ∀ i, u ∈ S i \ { ρ i } x u + z u ≤ w u ∀ u ∈ S i , ∀ i x u , z u , w u ∈ [0 , 1] ∀ u ∈ S i , ∀ i 14 Let O P T = optimal net profit and γ ∗ = optimal LP solution. T he next is similar to Cla im 3.1. Claim 5.1. OP T ≤ γ ∗ . F rom this LP optimum h w ∗ u , x ∗ u , z ∗ u i , the p olicy φ i is constructed as describ ed in Figure 1, and the r.v.’s E i , C i , R i and their resp ectiv e exp ectations P ( φ i ) , C ( φ i ), and R ( φ i ) are obtained as describ ed in the b eginning of Section 3.2. Let r. v. Y i = R i − C i denote the profit of pla ying arm i according to φ i . Note that E [ Y i ] = P u ∈S i ( x u r u − c u z u ) − h i z ρ i . The nice asp ect of the pro of of Theorem 3.2 is that it do es n ot necessarily require the r.v. corresp ondin g to the rew ard of p olicy φ i , R i to b e non-negativ e. A s long as E [ R i ] = R ( φ i ) ≥ 0, the p ro of holds. T his will b e crucial for the Lagrangean v ersion. Claim 5.2. F or any arm i , E [ Y i ] = R ( φ i ) − C ( φ i ) ≥ 0 . Pr o of. F or eac h i , since all r u ≥ 0, setting x ρ i ← P u ∈S i x u , w ρ i ← 1, and z u ← 0 for u ∈ S i yields a feasible non-negativ e solution. The LP optim um will therefore guaran tee that the term P u ∈S i ( x u r u − c u z u ) − h i z ρ i ≥ 0. Therefore, E [ Y i ] ≥ 0 for all i . The GreedyOrder p olicy orders the arms in decreasing order of R ( φ i ) − C ( φ i ) P ( φ i ) , and pla ys them according to their resp ective φ i unt il some E i = 1. Theorem 5.3. Th e exp e cte d pr ofit of GreedyOrder is at le ast O P T / 2 . Pr o of. Let µ i = P ( φ i ) and ν i = E [ Y i ] for notational con v enience. T he LP solution y ields P i µ i ≤ 1 and P i ν i = γ ∗ . Re-n um b er the arms ac cord ing to the sorted ordering of ν i µ i so that the first arm pla yed is n umbered 1. The ev ent that Greed y Order pla ys arm i corresp onds to P j 0, let L = n ǫ . Discretize th e domain [0 , 1] in m ultiples of 1 /L . F or l ∈ { 0 , 1 , . . . , L } , let ζ u ( l ) = g u ( l/L ). This corresp onds to the con trib u tion of arm i to the exploitatio n v alue on allocating w eight y i = l /L . Define the follo w ing linear program: Max n X i =1 X u ∈S i L X l =0 x ul ζ u ( l ) P n i =1 h i z ρ i + P u ∈S i c u z u ≤ C P n i =1 σ i P u ∈S i P L l =0 lx ul ≤ B L (1 + ǫ ) P v : u ∈ D ( v ) z v p vu = w u ∀ i, u ∈ S i \ { ρ i } z u + P L l =0 x ul ≤ w u ∀ u ∈ S i , ∀ i w u , x ul , z u ∈ [0 , 1] ∀ u ∈ S i , ∀ i, l Let γ ∗ b e th e optimal LP v alue and OP T = v alue of the optimal adaptiv e explorati on p olicy . Lemma 6.1. O P T ≤ γ ∗ . Pr o of. In the optimal solution, let w u denote the pr obabilit y that the policy reac hes state u ∈ S i , and let z u denote the probability of r eac hing s tate u ∈ S i and pla ying arm i in th is state . F or l ≥ 1, let x ul denote the probabilit y o f s topping exploration at u ∈ S i and allo cating w eigh t y i ∈ ( l − 1 L , l L ] to arm i . All the constraint s are straigh tforward, except the constraint inv olving B . Observ e that if the weig h t assignments y i in the optimal solution w ere rounded up to the nearest multi ple of 1 /L , then the total size of an y assignmen t increases by at m ost ǫB since all s i ≤ B . Therefore, this constrain t is satisfied. Using the same round ing up a rgument , if the w eigh t s atisfies y i ∈ ( l − 1 L , l L ], then the cont ribution of arm i to the exploitation v alue is up p er b ounded by ζ u ( l ) since the fu n ction g u ( y ) is non-decreasing in y . Th erefore, the proof follo w s. 16 P olicy φ i : If arm i is curren tly in state u , c h o ose q ∈ [0 , w ∗ u ] u.a.r. and do one of the follo wing: 1. If q ∈ [0 , z ∗ u ], then pla y the arm. 2. else Stop executing φ i . Find the smallest l ≥ 0 suc h that q ≤ z ∗ u + P l k =0 x ∗ uk . S et E i = l L and R i = ζ u ( l ). Figure 4: The p olicy φ i for conca ve v alue f unctions. 6.2 Exploration P olicy Let h w ∗ u , x ∗ ul , z ∗ u i denote the optimal solution to th e LP . Assume w ∗ ρ i = 1 for all i . Also w .l.o.g, z ∗ u + P L l =0 x ∗ ul = w ∗ u for a ll u ∈ S i . Th e LP solution yields a natural (infeasible) exploration p olicy φ consisting of one ind ep endent p olicy φ i p er arm i . Po licy φ i is d escrib ed in Fig ure 4. The p olicy φ i is indep endent of the states of the other arms. It is easy to see by induction that if state u ∈ S i is reac hed b y the p olicy with probabilit y w ∗ u , then state u ∈ S i is reac hed and arm i is p la ye d with pr obabilit y z ∗ u . Let r andom v ariable C i denote the cost of executing φ i , and let C ( φ i ) = E [ C i ]. Denote this o v erall p olicy φ – this corresp ond s to one indep endent d ecision p olicy φ i (determined by h w ∗ u , x ∗ ul , z ∗ u i ) p er arm. It is easy to see that the follo wing hold for φ : 1. C ( φ i ) = E [ C i ] = h i z ∗ ρ i + P u ∈S i c u z ∗ u so that P i C ( φ i ) ≤ C . 2. P ( φ i ) = E [ E i ] = 1 L P u ∈S i P L l =0 lx ∗ ul ⇒ P i σ i P ( φ i ) ≤ B (1 + ǫ ). 3. R ( φ i ) = E [ R i ] = P u ∈S i P L l =0 x ∗ ul ζ u ( l ) ⇒ P i R ( φ i ) = γ ∗ . Algorithm GreedyOrder 1. Order the arms in decrea s ing ord er of R ( φ i ) σ i B P ( φ i )+ 1 C C ( φ i ) . 2. F or ea c h arm j in sorted order, p lay it according to φ j as follo ws un til φ j terminates: (a) If the next pla y w ould violate the cost constraint, then set E j ← 1, stop e xplo- ration, and goto step (3). (b) If φ j terminates and P i σ i E i ≥ B , then stop exploration and g oto step (3). (c) Else, pla y arm j according to p olicy φ j and goto s tep (2a). 3. Exploitation: S cale do wn E i b y a factor of 2. Figure 5: The GreedyOrder p olicy for conca v e functions. The GreedyOrder p olicy is presen ted in Figure 5. W e again use an in f easible p olicy GreedyVi- ola te which is simpler to analyz e. The algorithm is the same as Gre edyOrder except for step (2), wh ere violation of the cost constraint is only c heck ed after the policy φ j terminates. Theorem 6.2. L et c max denote the maximum c ost of exploring a single arm . Then Gree dyVio- la t e sp ends c ost at most C + c max and has exp e cte d value O P T 8 (1 − ǫ ) . Pr o of. Let ν i = R ( φ i ) and let µ i = σ i B P ( φ i ) + 1 C C ( φ i ). The LP constraint s imply that γ ∗ = P i ν i , and P i µ i ≤ 2 + ǫ . No w usin g the same pro of as Theorem 3.2, we obtain the v alue G 17 of GreedyViola te according to the weig ht assignment E i at the end of Step (2 ) is at least O P T 4 (1 − ǫ ). Th is weig ht assignmen t could b e infeasible b ecause of the last arm, so that th e E i only satisfy P i σ i E i ≤ 2 B . This is made f easible in Step (3) b y scaling all E i do wn b y a factor of 2. Since the functions g i ( y ) are conca v e in y , the exploitatio n v alue reduces by a factor of 1 / 2 b ecause of scaling do wn. Theorem 6.3. G reedyOrder p olicy with budget C achieves exp e cte d v alue at le ast O P T 8 (1 − ǫ ) . Pr o of. Consider the GreedyViola te p olicy . No w supp ose the play for arm i reac hes state u ∈ S i , and the next decision of GreedyViola te in v olve s pla ying arm i and this would exceed the cost budget. Conditioned on this next d ecision, GreedyOrder sets E i = 1 and sto ps exploration. In this case, the exploita tion v alue of GreedyOrder from arm i is at least the exp ected exploitat ion gain of GreedyViola te f or this arm b y th e sup er-martingale prop ert y of the v alue f unction g . Therefore, for th e assignmen ts at the end of S tep (2), the gain of Gre edyOrder is at least O P T 4 (1 − ǫ ). Since Step (3) scales the E ’s d o wn b y a factor of 2, the theorem foll o ws. 7 Conclusions W e studied the classica l sto chastic m ulti-armed bandit problem under th e future utilization ob- jectiv e in the presence of priors. This mod el is relev an t to settings in volving data acquisitio n and design of exp eriments. In this pr oblem the exploration phase necessarily precedes the exploitatio n phase. T his make s the pr ob lem significan tly differen t from the problems in online op timization, whic h seeks to minimize regret o v er the past, b ecause online optimizat ion mo dels problems where exploration and exploita tion are sim ultaneous. The cen tral difficulty of online optimization is the lac k of information, w h ereas the difficult y in optimizing future utilization is computational. In fact the latter is prov ably NP-Hard . W e presen ted constant factor approxi mation algorithms that yield sequentia l p olicies for several extensions of this b asic pr oblem. These algorithms p ro ceed via LP r ounding and sh o w a surpr ising connection to sto c hastic p ac king algorithms. W e also sho w that the sequentia l p olicy we dev elop is within constan t fac tor of a f ully adaptive solution. Note that a constan t factor adaptivit y gap result is the best possib le. There are sev eral challengi ng op en q u estions arising from this w ork; w e mentio n tw o of them. First, w e conjecture that constructing a (p ossib ly adaptive) strategy for the bu dgeted learning problem is APX-Hard , i.e. , th er e exists an absolute constan t c > 1 suc h that it is NP-Hard to pro du ce a solution wh ic h is within factor c times the optim um. Seco ndly , we ha v e fo cused exclusiv ely on utilit y maximization; it w ould b e in teresting to explore other ob j ectiv es, su c h as minimizing r esidual inform ation [35]. Ac kno w ledgmen t: W e would lik e to th ank Jen Bur ge, Vincent Conitzer, Ash ish Go el, Ronald P arr, and F ern ando P ereira for helpful discussions. References [1] A. Akell a, B. M. Maggs, S. Seshan, A. Shaikh, and R. K. Sitaraman. A measurement-based analysis of m ultihoming. In ACM SIGCOMM Confer enc e , pages 353–3 64, 20 03. [2] K. J. Arro w, D. Blac kw ell, and M. A. Girshic k. Ba y es and minmax s olutions of sequen tial decision p r oblems. Ec onometric a , 17:213 –244, 1949. [3] P . Auer. Using confidence boun ds for exploitation-e xploration trade-offs. Journal of Machine L e arning R ese ar ch , 3:3 97–422, 2002. 18 [4] P . Auer, N. Cesa-Bianc hi, and P . Fisc her. Finite-time analysis of the m ultiarmed bandit problem. Machine L e arning , 47(2-3):235 –256, 2002. [5] P . Auer, N. Cesa-Bianc hi, Y. F reund , and R. E. Sc hapir e. Gam bling in a rigge d casino: The adv ersarial m u lti-arm band it problem. In Pr o c. of the 1995 Annual Symp. on F oundations of Computer Scienc e , pages 322–331, 1995 . [6] B. Bab co c k and C. Olston. Distributed top-k monitoring. In SIGMO D ’03: Pr o c e e dings of the 2003 A CM SIGMOD international c onfer enc e on Management of dat a , pag es 28–39, 200 3. [7] S. Babu and P . Biz arro. Proactiv e reoptimizatio n . In Pr o c . of the A CM SIGMOD Intl. Co nf. on Management of Data , 2005. [8] S. Babu, R. Mot wani, K . Munagala, I . Nishiza wa , and J. Widom. Adaptiv e ordering of pip elined stream filters. In Pr o c. of the 2004 ACM SIGMOD Intl. Conf. on Management of Data , pages 407–418, June 2004. [9] S. Ba b u, K. Munagala , J. Widom, and R. Mot wani. Adaptive cac hing for con tin uous queries. In Pr o c. of the 2005 Intl. Con f. on Data Engine e ring , 2005. [10] J. S. Banks and R. K. Su ndaram. Switc hing costs and the gittins in dex. E c onometric a , 62(3): 687–6 94, 1994. [11] R. Bellman. Dynamic Pr o gr amming . Princeton univ ersity Press, 1957. [12] D.A. Berry and B. F r istedt. Band i t P r oblems: Se q uential Al lo c ation of Exp eriments . Chapman and Hall, London, UK, 1985. [13] D. Bertsek as. Dynamic Pr o gr amming and Optimal Contr ol . A thena S cien tific, second edition, 2001. [14] D. Bertsimas, D. Gamarnik, and J. T sitsiklis. P erformance of m ulticlass mark o vian queueing net works via piecewise linea r Ly apuno v functions. A nnals of Applie d Pr ob ability , 11(4):138 4– 1428, 2002. [15] D. Bertsimas and J. Nino-Mo ra. Conserv ation laws, exte nded p olymatroids and m ulti-armed bandit p r oblems: A u nified p olyhedral appr oac h. Math. of Op er. R es. , 21(2):257–3 06, 1996. [16] D. Bertsimas and J. Ni ˜ no-Mora. Restless band its, linear pr ogramming relaxations, and a primal-dual ind ex heuristic. Op e r. R es. , 48(1): 80–90 , 2000. [17] D. Bertsimas, I. P asc h alidis, and J. N. Tsitsiklis. Op timization of multiclass queueing netw orks: P olyhedr al and nonlinear c haracterizations of ac h iev able p erformance. Annals of A pplie d Pr ob- ability , 4(1):43–7 5, 1994. [18] N. Cesa-Bianc hi, Y. F r eund, D. Haussler, D. P . Helm b old, R. E. Sc h apire, and M. K. W arm u th. Ho w to use exp ert advice. J. ACM , 44(3):427– 485, 1997. [19] M. Charik ar, C. Cheku r i, and M. P´ al. Sampling b ounds for sto chastic optimization. In APPRO X - RANDOM , pages 257–269, 200 5. [20] Y. S. Ch o w, S . Moriguti, H. E. Robbins, and S. M. Sam u els. O ptimal selectio n based on relativ e rank – th e secreatary problem. Isr ael Journal of Math. , 2:81–90, 1964. 19 [21] T. M. Co v er and J. A. Thomas. Elements of Information The ory . J ohn Wiley & sons, 1991. [22] B. Dea n. A ppr oximation A lgorithms for Sto chastic Sche duling Pr oblems . PhD th esis, MIT, 2005. [23] B. C. Dean, M. X. Go emans, and J. V ondrak. Appro ximating the sto c hastic knapsac k prob lem: The b enefi t of adaptivit y . In F O CS ’04: Pr o c e e dings of the 45th Annual IEEE Symp osium on F oundations of Computer Scienc e , pages 208–217, 2004 . [24] B. C. Dean, M. X. Go emans, and J. V on d r´ ak. Ad aptivit y and appro ximation for s to c hastic pac king problems. In SOD A ’05: Pr o c e e dings of the sixte enth annual A CM -SIAM symp osium on Discr e te algorithms , pages 395–404, 2005 . [25] A. Flaxman, A. K alai, and H. B. McMahan. Online conv ex optimization in th e bandit setting: Gradien t descen t without a gradien t. In Annua l ACM-SIAM Symp. on Discr ete Algorithm s , 2005. [26] J. C. Gittins and D. M. Jones. A d ynamic allo cation ind ex for the sequen tial design of exp eriments. Pr o gr ess in statistics (Eur op e an Me eting of Statisticians) , 1972. [27] A. Go el, S . Guh a, and K. Munagala. Asking the right qu estions: Mo del-dr iven optimizat ion using pr ob es. In Pr o c. of the 200 6 ACM Symp . on Principles of Datab ase Systems , 20 06. [28] A. Go el and P . Indyk. Sto c hastic load balancing and r elated problems. In Pr o c. of the 1999 Annual Symp. on F oundations of Computer Scienc e , 1999. [29] S. Guha and K. Munagala. App ro ximation algorithms for bu dgeted learning problems. In Pr o c. ACM Symp. on Th e ory of Computing (STOC) , 2007. [30] S. Guha and K. Munagala. Mo del dr iven optimization u sing adaptiv e prob es. I n Pr o c. ACM- SIAM Symp. on Discr ete Algorithms (SO DA) , 2007 . [31] S. Guha, K. Mun agala, and S. S ark ar. Join tly optimal probing and transmission strategies for m u lti-c hannel wireless systems. CoRR abs/ 0804.1 724 , 2008. [32] M. J. Kearns, Y. Mansour , and A. Y. Ng. App ro ximate planning in large POMDPs via reusable tra jectories. I n NIPS , pages 10 01–1007, 1999. [33] J. K leinb erg, Y. R ab an i, and ´ E. T ardos. Allocating bandwidth for burst y connections. SIAM J. Comput , 30(1), 2000. [34] A. J. K leyw egt, A. Shapiro, and T. Homem de Mello. T he sample a verage app r o ximation metho d f or sto c hastic discrete optimizatio n. SIAM J. on Optimization , 12(2):479 –502, 2002. [35] A. K rause and C . Guestrin . Near-optimal n onm yopic v alue of information in graphical mo dels. Twenty-first Confer enc e on Unc ertainty in Artificial Intel ligenc e (UAI 2005 ) , 2005 . [36] T. L. Lai and H. Robbins. Asymp toticall y efficien t adaptiv e allo cation rules. A dvanc es in Applie d Mathematics , 6:4–22, 198 5. [37] S. H. Lo w an d D. E . Lap s ley . Op timization flo w control -I: Basic algorithm and conv ergence. IEEE/ACM T r ans. Netw. , 7( 6):861 –874, 1999. 20 [38] O. Ma dani, D. J. Lizott e, and R. Gr einer . Activ e mo del sele ction. In U AI ’04: Pr o c. 20th Conf. on Unc e rtainty in A rtificial Intel ligenc e , page s 357–36 5, 2004. [39] R. H. Mohrin g, A. S. S c hulz, and M. Uetz . Appro ximation in s to c hastic sc hedulin g: the p ow er of LP-based priorit y p olicies. J. A CM , 46(6):924– 942, 1999. [40] J. Ni ˜ no-Mora. Restless bandits, partial conserv ation laws and indexabilit y . A dv. in A ppl. Pr ob ab. , 33(1):7 6–98, 2001. [41] H. Robbin s. Some asp ects of the sequential design of exp erim ents. Bul letin Americ an Mathe- matic al So ciety , 55: 527–5 35, 1952. [42] J. S c hn eider and A. Mo ore. Activ e learning in discrete input spaces. In 34 th Interfac e Symp. , 2002. [43] D. S hmo ys and C. Swam y . Sto c hastic optimization is (almost) as easy as discrete optimization. In Pr o c. 45 th IEEE Symp. on F oundations of Computer Scienc e , p ages 228 –237, 2004. [44] A. Silb erstein, R. Bra ynard, C. Ellis, K. Mu n agala, and J. Y ang. A sampling based approac h to optimizing top-k queries in sensor net works. In Pr o c. of the Intl. Conf. on Data Eng i ne ering , 2006. [45] M. Skutella and M. Uetz. Sc h eduling precedence-constrained jobs with sto c h astic pro cessing times on parallel mac hines. In Pr o c. 12 th ACM-SIAM Symp. on Discr e te algorithms , pages 589–5 90, 2001 . [46] J. L. Sn ell. Applications of martingale system theorems. T r ansactions of Americ an Math. So cie ty , 73:293–31 2, 1952. [47] M. P . v an Oy en, D. G. Pandelis, and D. T eneke tzis. Optimalit y of index p olicies for sto c hastic sc hedu ling with switc hing p enalties. Journal of Applie d Pr ob ability , p ages 957–966, 19 92. [48] V. V azirani. A ppr oximation Algorithm s . S pringer, 2001. [49] A. W ald. Se quential Analysis . Wiley , New Y ork, 1947. [50] G. B. W etherill and K. D. Glazebrook. Se quential M etho ds in Statistics (M ono gr aphs on Statistics and Applie d Pr ob ability) . Chapman & Hall, London, 1986. [51] P . Whittle. Optimization over time:Dynamic pr o gr amming and Sto chastic c ontr ol 1 . Wiley , New Y ork, 19 82. [52] P . Whittle. Optimization over time:Dynamic pr o gr amming and Sto chastic c ontr ol 2 . Wiley , New Y ork, 19 83. [53] P . Whittle . Restless b andits: Activit y allocation in a c h an ging w orld. A ppl. Pr ob. , 25(A) :287– 298, 1988. 21
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment