Multi-scale Online Learning and its Applications to Online Auctions
We consider revenue maximization in online auction/pricing problems. A seller sells an identical item in each period to a new buyer, or a new set of buyers. For the online posted pricing problem, we show regret bounds that scale with the best fixed p…
Authors: Sebastien Bubeck, Nikhil R. Devanur, Zhiyi Huang
Mul ti-scale Online Learning and its Applica tion to Online A uctions Multi-scale Online Learning and its Applications to Online Auctions Sébastien Bub ec k sebubeck@micr osoft.com Micr osoft R ese ar ch, 1 Micr osoft W ay, R e dmond, W A 98052, USA. Nikhil Dev anur nikdev@micr osoft.com Micr osoft R ese ar ch, 1 Micr osoft W ay, R e dmond, W A 98052, USA. Zhiyi Huang zhiyi@cs.hku.hk Dep artment of Computer Scienc e, The University of Hong Kong, Pokfulam, Hong Kong. Rad Niazadeh rad@cs.st anf ord.edu Dep artment of Computer Scienc e, Stanfor d University, Stanfor d, CA 94305, USA. Editor: Csaba Szep esv ari Abstract W e consider reven ue maximization in online auction/pricing problems. A seller sells an iden tical item in eac h perio d to a new buyer, or a new set of buyers. F or the online pricing problem, we show regret b ounds that scale with the b est fixe d pric e , rather than the range of the v alues. W e also show regret b ounds that are almost sc ale fr e e , and matc h the offline sample complexit y , when comparing to a b enchmark that requires a lower b ound on the market shar e . These results are obtained b y generalizing the classical learning from exp erts and multi-armed bandit problems to their multi-sc ale v ersions. In this version, the reward of eac h action is in a differ ent r ange , and the regret with resp ect to a giv en action scales with its own r ange , rather than the maxim um range. Keyw ords: online learning, multi-scale learning, auction theory , bandit information, sample comple xit y † 1. In tro duction Consider the following reven ue maximization problem in a rep eated setting, called the online p oste d pricing problem. In eac h perio d, the seller has a single item to sell, and a new prosp ectiv e buy er. The seller offers to sell the item to the buyer at a giv en price; the buyer buys the item if and only if the price is b elow his priv ate v aluation for the item. The priv ate v aluation of the buyer itself is never revealed to the seller. Ho w should a monop olistic seller † F ollowing the theoretical computer science conv ention, we used alphab eti c al author or dering. 1 Bubeck, Dev anur, Huang and Niazadeh iterativ ely set the prices if he wishes to maximize his reven ue? What if he also cares ab out the mark et share, i.e. the fraction of time p erio ds at whic h the item is sold? Estimating price sensitivities and demand mo dels in order to optimize rev en ue and mar- k et share is the b edro c k of econometrics. The emergence of online marketplaces has enabled sellers to costlessly change prices, as well as collect huge amounts of data. This has renew ed the in terest in understanding b est practices for data driven pricing. The extreme case of this when the price is up dated for eac h buyer is the online pricing problem describ ed ab ov e; one can alwa ys use this for less frequen t price up dates. Moreo v er this problem is intimately related to the classical exp erimentation and estimation procedures. This problem has b een studied from an online le arning p ersp ectiv e, as a v arian t of the multi-arme d b andit problem. In this v ariant, there is an arm for eac h p ossible price (presumably after an appropriate discretization). The reven ue of each arm p is either p or zero, dep ending on whether the arriving v alue is at least equal to the price p or smaller than the price p , resp ectiv ely . The total reven ue of a pricing algorithm is then compared to the total rev en ue of the b est fixed p osted price in hindsigh t. The difference b et ween the t w o, called the r e gr et , is then b ounded from ab o v e. No assumption is made on the distribution of v alues; the regret b ounds are required to hold for the worst c ase sequence of v alues. Blum et al. (2004) assume that the buy er v aluations are in [1 , h ] , and show the follo wing m ultiplicativ e plus additiv e b ound on the regret: for any ∈ (0 , 1) , the regret is at most times the reven ue of the optimal price, plus O ( − 2 h log h log log h ) . Blum and Hartline (2005) show that the additive factor can b e made to b e O ( − 3 h log log h ) , trading off a log h factor for an extra − 1 factor. An undesirable asp ect of these b ounds is that they scale line arly with h ; this is particu- larly problematic when h is an estimate and we might set it to b e a generous upp er b ound on the range of prices we wish to consider. A typical use case is when the same algorithm is used for man y different products, with widely v arying price ranges. W e may not b e able to man ually tune the range for each pro duct separately . One might wonder if this dep endence on h is una v oidable, as it seems to b e reflected by the existing low er b ounds for this problem in the literature (lo w er b ounds are discussed later in the introduction with more details). In terestingly , in all of these lo w er-b ound instances the best fixed price is equal to h itself; Therefore, it is not clear whether this dep endency on h is required for instances where h is only a p essimistic upp er-b ound on the b est fixed price. W e no w ask the follo wing question: Question : do online le arning algorithms exist for the online p oste d pricing pr ob- lem, such that their r e gr ets ar e pr op ortional to the b est fixe d pric e inste ad of the highest value? Standard off-the-shelf b ounds allow regret to dep end on the loss of the b est arm instead of the worst case loss. Ho w ev er, ev en such b ounds still dep end linearly on the maximum range of all the losses, and th us they w ould not allo w to replace h b y the b est fixed price. F ortunately , in the online pricing problem the reward function of the arms is w ell struc- tured. In particular, as a neat observ ation, the reward of the arm p is upp er-b ounded by p (and not only the maximum v alue). Can we use this structure in our fav or to improv e the standard regret b ounds? W e answ er this question in the affirmative by the means of reducing the problem to a pure learning problem termed as mutli-sc ale online le arning . 2 Mul ti-scale Online Learning and its Applica tion to Online A uctions 1.1 Multi-scale online learning The main tec hnical ingredients in our results are v arian ts of the classical problems of learning from expert advice and multi-armed bandit. W e in tro duce the multi-scale versions of these problems, where each action has its reward bounded in a different range. Here, w e seek to design online learning algorithms that guarantee multi-sc ale r e gr et b ounds, i.e. their regrets with resp ect to each certain action scales with the range of that particular action, instead of the maxim um p ossible range. These guaran tees are in contrast with the regret b ounds of the standard versions, which scale with the maxim um range. Main result (informal): we give algorithms for the ful l information and b andit information versions of the multi-sc ale online le arning pr oblem with multi-sc ale r e gr et guar ante es. While we use these b ounds mostly for designing online auctions and pricing mec hanisms, w e expect suc h bounds to be of independent interest. The main idea b ehind our algorithms is to use a tailored v ariant of online (sto chastic) mirr or desc ent (OSMD) (Bub ec k, 2011). In this tailored version, the algorithm uses a weighte d ne gative entr opy as the Legendre function (also kno wn as the mirr or map ), where the weigh t of each term i (corresp onding to arm i ) is actually equal to the range of that arm. More formally , assuming the range of arm i is equal to c i , our mirror descent algorithms (Algorithm 1 for full information, and Algorithm 3 for the bandit information) use the follo wing mirror map: F ( x ) = X arms i c i · x i ln( x i ) In tuitiv ely sp eaking, these algorithms tak e into accoun t different ranges for differen t arms by first normalizing the rew ard of each arm by its range (i.e. divide the reward of arm i by its corresponding range c i ), and then pro jecting the updated w eights b y performing a smo oth multi-sc ale pr oje ction in to the simplex. This pro jection is an instant of the more general Bregman pro jection (Bub ec k, 2011) for the sp ecial case of weigh ted negativ e entrop y as the mirror map. The mirror descent framework then giv es regret b ounds in terms of a “lo cal norm” as well as an “initial div ergence”, whic h w e then bound differen tly for eac h v ersion of the problem. In the technical sections we highlight how the subtle v ariations arise as a result of different techniques used to bound these t wo terms. While our algorithms hav e the style of the multiplicativ e w eigh ts up date (up to a normal- ization of the rew ards), the smo oth pro jection step at eac h iteration makes them drastically differen t. T o shed some insight on this pro jection step, whic h pla ys an imp ortant role in our analysis, consider a very sp ecial case of the problem where the reward of eac h arm i is deterministically equal to c i . The multiplicativ e weigh ts algorithm picks arm i with a prob- abilit y prop ortional to exp ( c i ) . Ho wev er, as it is clear from the description of Algorithm 1, our algorithm uniformly scales the weigh t of each arm first. Then, in the pro jection step the w eigh t of eac h arm i is multiplied b y exp ( − λ ∗ c i ) for some parameter λ ∗ . Hence, arm i will b e sampled with a probability prop ortional to exp ( − λ ∗ c i ) (whic h is a smo oth approximation to i ∗ = argmax c i , but in a different wa y compared to the v anilla m ultiplicative weigh ts). The multi-scale v ersions exhibit subtle v ariations that do not app ear in the standard v ersions. First of all, our applications to auctions and pricing hav e non-negative rewards, and 3 Bubeck, Dev anur, Huang and Niazadeh this actually mak es a difference. F or b oth the exp ert and the bandit v ersions, the minimax regret b ounds for non-negativ e rewards are pr ovably b etter than those when rew ards could b e negative. F urther, for the bandit version, w e can pro v e a b etter b ound if we only require the b ound to hold with resp ect to the b est action, rather than al l actions (for non-negativ e rew ards). The v arious regret bounds and comparison to standard b ounds are summarized in T ables 1. Standard regret b ound O ( · ) Multi-scale b ound (this pap er) Upp er b ound O ( · ) Lo wer b ound Ω( · ) Exp erts/non-negativ e c max p T log ( k ) ∗ c i p T log ( k T ) c i p T log ( k ) Bandits/non-negativ e c max √ T k † c i T 2 3 ( k log ( kT )) 1 3 c i √ T K c i ∗ p T k log( k ) , i ∗ is the b est action - Exp erts/symmetric c max p T log ( k ) ∗ c i q T log ( k · c max c min ) c i p T log ( k ) Bandits/symmetric c max √ T k † c i q T k · c max c min log( k T · c max c min ) c i q T k · c max c min ∗ F reund and Schapire (1995); † Audib ert and Bub eck (2009). T able 1: Pure-additiv e regret b ounds for non-negativ e rewards, i.e. when rew ard of any action i at any time is in [0 , c i ] , and symmetric range rew ards, i.e. when rew ard of any action i at any time is in [ − c i , c i ] (supp ose T is the time horizon, A is the action set, and k is the n umber of actions). 1.2 The implications for online auctions and pricing As a direct application of our multi-scale online learning framework, somewhat surprisingly , Second contribution: we show that we c an get r e gr et pr op ortional to the b est fixe d pric e inste ad of the highest value for the online p oste d pricing pr oblem. (i.e., we can replace h b y the best fixed price, whic h is used in the definition of the b enc h- mark). In particular, w e sho w that the additive b ound can b e made to b e O ( − 2 p ∗ log h ) , where p ∗ is the b est fixed price in hindsight. This allows us to use a v ery generous estimate for h and let the algorithm adapt to the actual range of prices; we only lose a log h factor. The algorithm balances exploration probabilities of different prices carefully and automati- cally zo oms in on the relev an t price range. This do es not violate known lo w er b ounds, since in those instances p ∗ is close to h . Bar-Y ossef et al. (2002), Blum et al. (2004), and Blum and Hartline (2005) also consider the “full information” version of the problem, or what w e call the online (single buyer) auction problem, where the v aluations of the buyers are rev ealed to the algorithm after the buy er has made a decision. Such information may b e av ailable in a context where the buy ers ha v e to bid for the items, and are aw arded the item if their bid is ab ov e a hidden price. In this case, the additive term can b e impro ved to O ( − 1 h log( − 1 )) , whic h is tight. Once 4 Mul ti-scale Online Learning and its Applica tion to Online A uctions again, by a reduction to multi-scale online learning, we show that h c an b e r eplac e d with p ∗ ; in particular, we show that the additive term can be made to b e O ( − 1 p ∗ log( h − 1 )) . 1.3 Purely m ultiplicative bounds and sample complexity The regret b ounds mentioned ab ov e can b e turned into a purely multiplicativ e factor in the follo wing wa y: for any > 0 , the algorithm is guaranteed to get a 1 − O ( ) fraction of the b est fixed price reven ue, provided the num b er of perio ds T ≥ E /, where E is the additiv e term in the regret b ounds ab ov e. This follows from the observ ation that a reven ue of T is a lo w er b ound on the b est fixed price rev enue. Define the n umber of p erio ds required to get a 1 − multiplicativ e appro ximation (as a function of ) to b e the c onver genc e r ate of the algorithm. A 1 − m ultiplicative factor is also the target in the recen t line of work, on the sample c omplexity of auctions, started b y Balcan et al. (2008); Elkind (2007); Dhangwatnotai et al. (2014); Cole and Roughgarden (2014). (W e giv e a more comprehensive discussion of this line of work in Section 1.4.) Here, i.i.d. samples of the v aluations are giv en from a fixe d but unknown distribution , and the goal is to find a price such that its rev enue with respect to the hidden distribution is a 1 − fraction of the optim um reven ue for this distribution. The sample complexit y is the minimum n um b er of samples needed to guarantee this (as a function of ). The sample complexity and the con vergence rate (for the full information setting) are closely related to each other. The sample complexity is alw a ys smaller than the conv ergence rate: the problem is easier b ecause of the follo wing. 1. The v aluations are i.i.d. in the case of sample complexit y , whereas they can be arbi- trary (w orst case) in the case of conv ergence rate. 2. Sample complexity corresp onds to an offline problem: y ou get all the samples at once. Con v ergence rate corresp onds to an online problem: you need to decide what to do on a giv en v aluation without knowing what v aluations arriv e in the future. This is formalized in terms of an online to offline r e duction [folklore] which sho ws that a conv ergence rate upper bound can b e automatically translated to a sample complexit y upp er b ound. This lets us conv ert sample complexity lo w er b ounds into lo wer b ounds on the con v ergence rate, and in turn in to lo wer bounds on the additive error E in an additive plus m ultiplicative regret b ound. F or example, the additive error for the online auction problem (and hence also for the p osted pricing problem ∗ ) cannot b e o ( h − 1 ) (Huang et al., 2015b). Moreov er, it is insightful to compare conv ergence rates w e sho w with the b est known sample c omplexity upp er b ound; pr oving b etter c onver genc e r ates would me an impr oving these b ounds as wel l . A natural target con vergence rate for a problem is therefore the corresponding sample complexit y , but ac hieving this is not alw a ys trivial. In particular, we consider an in teresting v ersion of the sample complexity b ound for auctions, for which no analogous conv ergence rate b ound is known in the literature. This v ersion tak es in to accoun t b oth r evenue and ∗ W e conjecture that the low er b ound for the p osted pricing problem should b e worse by a factor of − 1 , since one needs to explore ab out − 1 differen t prices. 5 Bubeck, Dev anur, Huang and Niazadeh market shar e , and gets sample complexity b ounds that are sc ale fr e e ; there is no dep endence on h , which means it works for unbounded v aluations! F or an y δ ∈ (0 , 1) , the b est fixed price b enchmark is relaxed to ignore those prices whose market share (whic h is equiv alent to the probability of sale) is b elo w a δ fraction; as δ increases the b enchmark is low er. This is a meaningful b enchmark since in many cases rev enue is not the only goal, even if you are a monop olist. A more reasonable goal is to maximize reven ue sub ject to the constraint that the market share is ab o v e a certain threshold. What is more, this gives a sample complexit y of O ( − 2 δ − 1 log( δ − 1 − 1 )) (Huang et al., 2015b). In fact δ can be set to h − 1 without loss of generalit y , when the v alues are in [1 , h ] , † and the ab o ve bound then matches the sample complexit y with resp ect to the b est fixed price reven ue. In addition, this b ound gives a precise in terp olation: as the target market share δ increase, the n umber of samples needed decreases almost linearly . Third contribution: we show a c onver genc e r ate that almost matches the ab ove sample c omplexity, for the ful l information setting. W e hav e a mild dep endence on h ; the rate is prop ortional to log log h . F urther, w e also show a near optimal con v ergence rate for the online posted pricing problem. ‡ Multiple buyers: All of our results in the full information (online auction) setting ex- tend to the multiple buyer mo del. In this model, in eac h time p erio d, a new set of n buyers comp etes for a single item. The seller runs a truthful auction that determines the winning buy er and his pa yment. The benchmark here is the set of all “My erson-type” mechanisms. These are mec hanisms that are optimal when each p erio d has n buyers of potentially dif- feren t types, and the v alue of each buy er is drawn indep endently from a type dep endent distribution. In fact, our conv ergence rates also imply new sample complexity bounds for these problems (except that they are not computationally efficien t). The v arious b ounds and comparisons to previous work are summarized in T ables 2 & 3. 1.4 Other related w ork The online pricing problem, also called dynamic pricing , is a muc h studied topic, across disciplines suc h as operations researc h and management science (T alluri and V an Ryzin, 2006), economics (Segal, 2003), marketing, and of course computer science. The m ulti- armed bandit approach to pricing is particularly p opular. See den Boer (2015) for a recent surv ey on v arious approac hes to the problem. Klein b erg and Leigh ton (2003) consider the online pricing problem, under the assumption that the v alues are in [0 , 1] , and considered purely additive factors. They sho wed that the minimax additive regret is ˜ Θ( T 2 / 3 ) , where T is the num b er of p erio ds. This is similar in spirit to regret bounds that scale with h , since one has to normalize the v alues so that they are in [0 , 1] . The finer distinction about the magnitude of the b est fixed price is absent in this w ork. Recen tly , Syrgk anis (2017) also consider the online auction problem, with an † When the v alues are in [1 , h ] , we can guaran tee a reven ue of T by posting a price of 1, and to b eat this, an y other price (and in particular a price of h ) would hav e to sell at least T /h times. ‡ Unfortunately , we cannot yet guaran tee that our online algorithm itself gets a mark et share of δ , although we strongly b elieve that it do es. Sho wing such b ounds on the market share of the algorithm is an imp ortan t a ven ue for future research. 6 Mul ti-scale Online Learning and its Applica tion to Online A uctions Low er b ound Upper b ound Best known (Sample complexity) Best known (Conv ergence rate) This pap er (Thm. 16) Online single buyer auction Ω h 2 ∗ ˜ O h 2 † ˜ O h 2 † ˜ O p ∗ 2 Online p osted pricing Ω max { h 2 , 1 3 } ∗ § - ˜ O h 3 † ˜ O p ∗ 3 Online multi buy er auction Ω( h 2 ) ∗ O ( nh 3 ) ‡ - ˜ O nh 3 ∗ Huang et al. (2015b); † Blum et al. (2004); ‡ Dev anur et al. (2016); Gonczarowski and Nisan (2017); Elkind (2007); § Klein b erg and Leighton (2003). T able 2: Num b er of rounds/samples needed to get a 1 − appro ximation to the b est offline price/mec hanism. Sample complexit y is for the offline case with i.i.d. samples from an unkno wn distribution. Con v ergence rate is for the online case with a w orst case sequence. Sample complexit y is alwa ys no larger than the conv ergence rate. Low er b ounds hold for sample complexity to o, except for the online p osted pricing problem for which there is no sample complexit y v ersion. The additiv e plus multiplicativ e regret b ounds are conv erted to conv ergence rates by dividing the additiv e error b y . In the last row, n is the num b er of buyers. In the last column, p ∗ denotes the optimal price. Lo wer b ound (Sample comple xit y) Upp er b ound Best known (Sample complexity) This pap er (Thm. 17) Online single buyer auction Ω 1 2 δ ∗ ˜ O 1 2 δ ∗ ˜ O 1 2 δ Online p osted pricing Ω max { 1 2 δ , 1 3 } ∗ † - ˜ O 1 4 δ Online multi buyer auction Ω 1 2 δ ∗ - ˜ O n 3 δ ∗ Huang et al. (2015b); † Klein b erg and Leighton (2003). T able 3: Sample complexity & conv ergence rate w.r.t. the opt mechanism/price with market share ≥ δ . emphasis on a notion of “oracle based” computational efficiency . They assume the v alues are all in [0 , 1] and do not consider the scaling issue that w e do; this mak es their contribution orthogonal to ours. Starting with Dhangwatnotai et al. (2014), there has been a spate of recent results analyzing the sample complexity of pricing and auction problems. Cole and Roughgarden (2014) and Dev an ur et al. (2016) consider multiple buyer auctions with regular distributions (with un b ounded v aluations) and giv e sample complexity bounds that are p olynomial in n and − 1 , where n is the num b er of buy ers. Morgenstern and Roughgarden (2015) consider arbitrary distributions with v alues b ounded by h , and ga v e b ounds that are polynomial in n, h, and − 1 . Roughgarden and Sc hrijv ers (2016); Huang et al. (2015b) give further impro v emen ts on the single- and multi-buy er v ersions resp ectiv ely; T ables 2 and 3 giv e a 7 Bubeck, Dev anur, Huang and Niazadeh comparison of these results with our b ounds, for the problems we consider. The dynamic pricing problem has also b een studied when there are a given n umber of copies of the item to sell (limited supply) (Agra wal and Dev anur, 2014; Babaioff et al., 2015; Badanidiyuru et al., 2013; Besb es and Zeevi, 2009). There are also v ariants where the seller interacts with the same buy er rep eatedly , and the buyer can strategize to influence his utility in the future p erio ds (Amin et al., 2013). F oster et al. (2017) also consider the multi-scale online learning problem motiv ated by a mo del selection problem. They consider additiv e b ounds, for the symmetric case, for full information, but not bandit feedback. Their regret b ounds are not comparable to ours in general; our b ounds are b etter for the pricing/auction applications w e consider, and their b ounds are b etter for their application. Organization W e start in Section 2 b y sho wing regret upper b ounds for the multi-scale exp erts problem with non-negativ e rewards (Theorem 1). The corresp onding upp er b ounds for the bandit version are in section 3 (Theorem 12). In Section 4 we show how the m ulti-scale regret b ounds (Theorems 1 and 12) imply the corresp onding b ounds for the auction/pricing problems (Theorems 16 and 17). Finally , the regret (upp er and lo wer) b ounds for the symmetric range are discussed in Section 5 (Theorems 18, 20, 21, and 23). 2. F ull Information Multi-scale Online Learning W e consider a v ariety of online algorithmic problems that are all parts of the multisc ale online le arning framew ork. W e start by defining this framework, in which different actions ha v e differen t ranges. W e exploit this structure and express our results in terms of action- sp e cific regret b ounds for this general problem. T o obtain these results, w e use a v ariant of online mirror descent and prop ose a multiplicativ e-weigh t up date style learning algorithm for our problem, termed as Multi-Sc ale Multiplic ative-W eight (MSMW) algorithm. Next, we inv estigate the single buy er auction problem (or equiv alently the full-information single buy er dynamic pricing problem) as a canonical application, and sho w ho w to get mul- tiplicativ e cum additiv e appro ximations here b y the help of the m ulti-scale online learning framew ork. T o sho w the tightness of our b ounds, we compare the conv ergence rate of our dynamic pricing with the sample complexity of a closely related offline problem, i.e. the near optimal Bay esian rev enue maximization from samples (Cole and Roughgarden, 2014). 2.1 The framew ork Our full-information multi-scale online learning framew ork is basically the classical learning from expert advice problem. The main difference is that the r ange of rew ards of differen t exp erts could b e different. More formally , suppose there is a set of actions A . § The online problem proceeds in T rounds, where in eac h round t ∈ [ T ] : ¶ • The adversary pic ks a reward function g ( t ) , where g i ( t ) is the rew ard of action i . • The algorithm pic ks an action i t ∈ A simultaneously . § W e use the terms exp erts, arms and actions interc hangeably in this pap er. ¶ W e use the notation [ n ] := { 1 , 2 , . . . , n } , for any n ∈ N . 8 Mul ti-scale Online Learning and its Applica tion to Online A uctions • Then the algorithm gets the reward g i t ( t ) and observes the en tire reward function g ( t ) . The total reward of the algorithm is denoted by G alg := P T t =1 g i t ( t ) . The standard “b est fixed action” b enchmark is G max := max i ∈ A P T t =1 g i ( t ) . W e further assume that the action set is finite. Without loss of generality , if the action set is of size k , we identify A = [ k ] . The rew ard g ( t ) is suc h that for all i ∈ A , g i ( t ) ∈ [0 , c i ] , where c i ∈ R + is the r ange of action i . 2.2 Multi-scale regret b ounds W e prov e action-specific regret b ounds, whic h we call also multi-sc ale r e gr et guar ante es . T o w ards this end, w e define the following quantities. G i := P t ∈ [ T ] g i ( t ) , (1) regret i := G i − G alg . (2) The regret b ound w.r.t. action i , i.e., an upp er bound on E [ regret i ] , dep ends on the range c i , as w ell as an y prior distribution π ov er the action set A ; this wa y , w e can handle coun tably many actions. Let c min = inf i ∈ A c i and c max = sup i ∈ A c i (if applicable) b e the minim um and the maximum range. W e first state a version of the regret b ound whic h is parameterized by > 0 ; suc h b ounds are stronger than √ T t yp e b ounds whic h are more standard. Theorem 1 (Main Result) Ther e exists an algorithm for the ful l-information multi-sc ale online le arning pr oblem that takes as input any distribution π over A , the r anges c i , ∀ i ∈ A and a p ar ameter 0 < ≤ 1 , and satisfies: ∀ i ∈ A : E [ regret i ] ≤ · G i + O 1 log 1 π i · c i (3) Compare this to what you get by using the standard analysis for the exp erts problem (Arora et al., 2012), where the second term in the regret bound is O 1 log( k ) · c max . Choosing π to be the uniform distribution in the abov e theorem giv es O 1 log k · c i . Also, one can compare the pure-additive version of this b ound with the classic pure-additive regret b ound O c max · p T log( k ) for the exp erts problem by setting = q log( k T ) T (Corollary 2). Corollary 2 Ther e exists an algorithm for the ful l-information multi-sc ale online le arning pr oblem that takes as input the r anges c i , ∀ i ∈ A , and satisfies: ∀ i ∈ A : E [ regret i ] ≤ O c i · p T log( k T ) (4) Remark 3 W e should assert that in a multi-sc ale r e gr et guar ante e, we pr ovide a sep ar ate r e gr et b ound for e ach action, wher e the b ound on the r e gr et of action i only sc ales line arly with c i . This typ e of guar ante e should “not” b e mistaken as a b ound on the worst action. 9 Bubeck, Dev anur, Huang and Niazadeh Here is the map of the rest of this section. In Section 2.3 w e prop ose an algorithm that exploits the rew ard structure, and later in Section 2.4 we show how this algorithm is an online mirror descen t with w eigh ted negative en trop y as its mirror map. F or reward-only instances, w e prov e the regret b ound in Section 2.5. W e finally turn our atten tion to the single buy er online auction problem in Section 2.6. 2.3 Multi-Scale Multiplicativ e-W eigh t (MSMW) algorithm W e ac hiev e our regret b ound in Theorem 1 b y using the MSMW algorithm (Algorithm 1). The main idea b ehind this algorithm is to take into accoun t differen t ranges for different exp erts, and therefore: 1. W e normalize the rew ard of eac h exp ert accordingly , i.e. divide the rew ard of exp ert i b y its corresponding range c i ; 2. W e pro ject the up dated w eigh ts b y p erforming a smo oth multi-sc ale pr oje ction in to the simplex: the algorithm finds a λ ∗ suc h that m ultiplying the current weigh t of each exp ert i b y exp ( − λ ∗ c i ) mak es a probability distribution ov er the exp erts. It then uses this resulting probability distribution for sampling the next exp ert. Algorithm 1 MSMW 1: input initial distribution µ ov er A , learning rate 0 < η ≤ 1 . 2: initialize p (1) such that p i (1) = µ i for all i ∈ A . 3: for t = 1 , . . . , T do 4: Randomly pic k an action drawn from p ( t ) , and observ e g ( t ) . 5: ∀ i ∈ A : w i ( t + 1) ← p i ( t ) · exp( η · g i ( t ) c i ) . 6: Find λ ∗ (e.g., binary search) s.t. P i ∈ A w i ( t + 1) · exp( − λ ∗ c i ) = 1 . 7: ∀ i ∈ A : p i ( t + 1) ← w i ( t + 1) · exp( − λ ∗ c i ) . 8: end for 2.4 Equiv alence to online mirror descent with weigh ted negative entrop y While it is p ossible to analyze the regret of the MSMW algorithm (Algorithm 1) b y using first principles, w e take a differen t approac h (the elementary analysis can still be found in the app endix, Section A.2). W e sho w how this algorithm is indeed an instance of the Online Mirror Descent (OMD) algorithm for a particular choice of the L e gendr e function (also kno wn as the mirr or map ). 2.4.1 Preliminaries on online mirror descent. Fix an open con v ex set D and its closure ¯ D , whic h in our case are (0 , + ∞ ) A and [0 , + ∞ ) A resp ectiv ely , and a closed-con vex action set A ⊂ ¯ D , whic h in our case is ∆ A , i.e. the set of all probabilit y distributions ov er exp erts in A . At the heart of an OMD algorithm there is a L e gendr e function F : ¯ D → R , i.e. a strictly conv ex function that admits con tinuous first order partial deriv atives on D and lim x → ¯ D\D k∇ F ( x ) k = + ∞ , where ∇ F ( . ) denotes the gradient map of F . One can think of OMD as a mem b er of pr oje cte d gr adient desc ent 10 Mul ti-scale Online Learning and its Applica tion to Online A uctions algorithms, where the gr adient up date happ ens in the dual sp ac e ∇ F ( D ) rather than in primal D , and the pr oje ction is defined b y using the Br e gman diver genc e asso ciated with F rather than ` 2 -distance (see Figure 2.4.1). Figure 1: Online Mirror Descent (OMD): moving to the dual space by gradient map (blue) , gradien t up date in the dual space (r e d) , applying the inv erse gradien t map (gr e en) , and finally pro jecting bac k to the simplex using Bregman pro jection (purple) . Definition 4 (Bregman Div ergence (Bub eck, 2011)) Given a L e gendr e function F over ∆ A , the Br e gman diver genc e asso ciate d with F , denote d as D F : ∆ A × ∆ A → R , is define d by D F ( x, y ) = F ( x ) − F ( y ) − ( x − y ) T ∇ F ( y ) Definition 5 (Online Mirror Descen t (Bub eck, 2011)) Supp ose F is a L e gendr e func- tion. At every time t ∈ [ T ] , the online mirr or desc ent algorithm with L e gendr e function F sele cts an exp ert dr awn fr om distribution p ( t ) , and then up dates w ( t ) and p ( t ) given r ewar ds g ( t ) by: Gradien t update: ∇ F ( w ( t + 1)) = ∇ F ( p ( t )) + η · g ( t ) ⇒ w ( t + 1) = ( ∇ F ) − 1 ( ∇ F ( p ( t )) + η · g ( t )) (5) Bregman pro jection: p ( t + 1) = ar gmin p ∈ ∆ A ( D F ( p , w ( t + 1))) (6) wher e η > 0 is c al le d the le arning r ate of OMD. W e use the following standard regret b ound of OMD (Refer to Bub ec k (2011) for a thorough discussion on OMD. F or completeness, a pro of is also pro vided in the app endix, Section A.3). Roughly speaking, this lemma upp er-b ounds the regret by the summation of t w o separate terms: “lo cal norm” (the first term), whic h captures the total deviation b etw een p ( t ) and w ( t + 1) , and “initial div ergence” (the second term), whic h captures ho w muc h the initial distribution is far from the target distribution. 11 Bubeck, Dev anur, Huang and Niazadeh Lemma 6 F or any le arning r ate p ar ameter 0 < η ≤ 1 and any b enchmark distribution q over A , the OMD algorithm with L e gendr e function F ( . ) admits the fol lowing: P t ∈ [ T ] g ( t ) · q − p ( t ) ≤ 1 η P t ∈ [ T ] D F ( p ( t ) , w ( t + 1)) + 1 η D F ( q , p (1)) (7) 2.4.2 MSMW algorithm as an OMD F or our application, we fo cus on a particular c hoice of Legendre function that captures differ- en t learning rates prop ortional to c − 1 i for different exp erts, as we saw earlier in Algorithm 1. W e start by defining the weighte d ne gative entr opy function. Definition 7 Given exp ert-r anges { c i } i ∈ A , the w eighted negative entrop y is define d by F ( x ) = P i ∈ A c i · x i ln( x i ) (8) Corollary 8 It is str aightforwar d to se e F ( x ) = P i ∈ A c i · x i ln( x i ) is a non-ne gative L e gendr e function over R A + . Mor e over, ∇ F ( x ) i = c i (1 + ln( x i )) and D F ( x, y ) = P i ∈ A c i · ( x i ln( x i y i ) − x i + y i ) . W e now hav e the follo wing lemma that shows Algorithm 1 is indeed an OMD algorithm. Lemma 9 The MSMW algorithm, i.e. A lgorithm 1, is e quivalent to an OMD algorithm asso ciate d with the weighte d ne gative entr opy F ( x ) = P i ∈ A c i · x i ln( x i ) as its L e gendr e function. Pro of Lo ok at the gradien t up date step of OMD, as in Equation (5), with Legendre function F ( x ) = P i ∈ A c i · x i ln( x i ) . By using Corollary 8 we hav e ∇ F ( w ( t + 1)) = ∇ F ( p ( t )) + η · g ( t ) ⇒ c i (1 + ln( w i ( t + 1))) = c i (1 + ln( p i ( t ))) + η · g i ( t ) , and therefore, w i ( t + 1) = p i ( t ) · exp( η · g i ( t ) c i ) . Moreov er, for the Bregman pro jection step w e ha ve p ( t + 1) = argmin p ∈ ∆ A ( D F ( p , w ( t + 1))) = argmin p ∈ ∆ A X i ∈ A c i · ( p i ln( p i w i ( t + 1) ) − p i + w i ( t + 1)) ! (9) This is a con vex minimization ov er a conv ex set. T o find a closed form solution, we lo ok at the Lagrangian dual function L ( p , λ ) , P i ∈ A c i · ( p i ln( p i w i ( t +1) ) − p i + w i ( t + 1)) + λ ( P i ∈ A p i − 1) and the Karush-Kuhn-T uck er (KKT) conditions ∇L ( p ∗ , λ ∗ ) = 0 . W e ha v e c i · ln( p ∗ i w i ( t + 1) ) + λ ∗ = 0 ⇒ p ∗ i = w i ( t + 1) · exp( − λ ∗ c i ) (10) As P i ∈ A p ∗ i = 1 , λ ∗ should be unique num b er s.t. P i ∈ A w i ( t + 1) · exp( − λ ∗ c i ) = 1 , and then p i ( t + 1) = w i ( t + 1) · exp( − λ ∗ c i ) . So, Algorithm 1 is equiv alent to OMD with w eighted negativ e en tropy as its Legendre function. By combining Lemma 6, Corollary 8 and finally Lemma 9 we prov e the following regret b ound for the MSMW algorithm. W e encourage the reader to also lo ok at the appendix, Section A.2, for an extra pro of using first principles. 12 Mul ti-scale Online Learning and its Applica tion to Online A uctions Prop osition 10 F or any initial distribution µ over A , and any le arning r ate p ar ameter 0 < η ≤ 1 , and any b enchmark distribution q over A , the MSMW algorithm satisfies that: X i ∈ A q i · G i − E [ G alg ] ≤ η X t ∈ [ T ] X i ∈ A p i ( t ) ( g i ( t )) 2 c i + 1 η · X i ∈ A c i q i ln q i µ i − q i + µ i . Pro of [ of Prop osition 10 ] W e ha v e: X i ∈ A q i · G i − E [ G alg ] = X t ∈ [ T ] q · g ( t ) − X t ∈ [ T ] p ( t ) · g ( t ) = X t ∈ [ T ] g ( t ) · q − p ( t ) (11) By applying the regret b ound of OMD (Lemma 6) to upp er-b ound the RHS, w e ha ve X i ∈ A q i · G i − E [ G alg ] ≤ 1 η X t ∈ [ T ] D F ( p ( t ) , w ( t + 1)) + 1 η D F ( q , p (1)) (12) T o b ound the first term in regret, a.k.a lo c al norm , w e ha ve: D F ( p ( t ) , w ( t + 1)) = X i ∈ A c i · ( p i ( t ) ln( p i ( t ) w i ( t + 1) ) − p i ( t ) + w i ( t + 1)) = X i ∈ A c i · p i ( t )( − η · g i ( t ) c i − 1 + exp ( η · g i ( t ) c i )) (13) Note that η · g i ( t ) c i ∈ [ − 1 , 1] b ecause g i ( t ) ∈ [ − c i , c i ] and 0 < η ≤ 1 . By exp( x ) − x − 1 ≤ x 2 for − 1 ≤ x ≤ 1 and that η g i ( t ) ∈ [ − c i , c i ] , the ab ov e is upper b ounded b y η 2 P i ∈ A p i ( t ) ( g i ( t )) 2 c i . W e can also rewrite the second term in regret. In fact, if w e set p (1) = µ , then 1 η · D F ( q , p (1)) = 1 η · X i ∈ A c i q i ln q i µ i − q i + µ i By summing the upper-b ounds η 2 P i ∈ A p i ( t ) ( g i ( t )) 2 c i on each term of lo cal norm in (13) for t ∈ [ T ] and putting all the pieces together, w e get the desired b ound. 2.5 Regret analysis for non-negativ e rewards Theorem 1 Ther e exists an algorithm for the ful l-information multi-sc ale online le arning pr oblem that takes as input any distribution π over A , the r anges c i , ∀ i ∈ A and a p ar ameter 0 < ≤ 1 , and satisfies: ∀ i ∈ A : E [ regret i ] ≤ · G i + O 1 log 1 π i · c i (14) 13 Bubeck, Dev anur, Huang and Niazadeh Pro of [ of Theorem 1 ] Supp ose i min is an action with the minimum c i . Let µ = (1 − η ) · 1 i min + η · π , and let q = (1 − η ) · 1 i + η · π in Prop osition 10. If i 6 = i min , we get that (note that µ j = q j for an y j 6 = i, i min ): (1 − η ) · G i + η · X j ∈ A π j · G j − E [ G alg ] ≤ η · E [ G alg ] + 1 η · c i · q i ln q i µ i − q i + µ i + 1 η · c i min · q i min ln q i min µ i min − q i min + µ i min By 1 ≥ q i > µ i ≥ η π i , the second term on the RHS is upp er bounded as: 1 η · c i · q i ln q i µ i − q i + µ i ≤ 1 η · c i · ln 1 η π i Similarly , by 1 ≥ µ i min > q i min ≥ 0 , the third term on the RHS is upper b ounded as 1 η · c i min · q i min ln q i min µ i min − q i min + µ i min ≤ 1 η · c i min ≤ 1 η · c i Finally , note that G j ≥ 0 for all j ∈ A in rew ard-only instances. So the LHS is lo wer b ounded by (1 − η ) · G i − E [ G alg ] = (1 − η ) · regret i − η · E [ G alg ] . Putting all this together, we get that E [ regret i ] ≤ 2 η 1 − η · E [ G alg ] + O 1 η ln 1 η π i · c i ≤ 3 η · E [ G alg ] + O 1 η ln 1 η π i · c i . The theorem then follo ws by choosing η = 3 and rearranging terms. 2.6 A canonical application: online single buy er auction The setup. The simple auction design problem that we consider is as follo ws. There is a seller with infinite iden tical copies of an item. Buyers arriv e o v er time. At eac h round, the seller pic ks a pric e and the arriving buyer rep orts her value . If the v alue is no less than the price, the trade happ ens; money goes to the seller and the copy of the item go es to the arriving buy er. The goal is to maximize the rev enue of the seller. F ormally , we lo ok at this problem as an instance of the full information multi-scale online learning framew ork; The action set is A = [1 , h ] . k The rew ard function is such that at round t the adversary (i.e. the arriving buyer) pic ks a v alue v ( t ) ∈ [1 , h ] and for an y price p ∈ A pic k ed by the seller (i.e. the algorithm), the reward is g p ( t ) := p · 1 ( v ( t ) ≥ p ) . This is a full information setting, b ecause the v alue v ( t ) is revealed to the algorithm after eac h round t . k Here, w e allow an infinite action set. Later, w e show how to discretize to get around this issue. 14 Mul ti-scale Online Learning and its Applica tion to Online A uctions The additive/m ultiplicative appro ximation. In order to obtain a (1 − ) -appro ximation of the optimal rev en ue, i.e. the reven ue of the b est fixed price p ∗ in hindsight, it suffices to consider prices of the form (1 + ) j for 0 ≤ j ≤ b log 1+ h c = O ( log h ) . As a result, w e reduce the online single buy er auction problem to the m ulti-scale online learning with full information and finite actions. The action set has k = O ( log h ) actions whose ranges form a geometric sequence (1 + ) j , 0 ≤ j < k . Recall the definition of G max in Section 2.1, and let p ∗ b e the b est fixed price in hindsight, whic h is the price that achiev es G max . W e now sho w how to get a multiplicativ e cum additive appro ximation for this problem with G max as the b enchmark, à la Blum et al. (2004); Blum and Hartline (2005). The main improv ement ov er these results is that the additive term scales with the b est price rather than h . Theorem 11 Ther e is an algorithm for the online single buyer auction pr oblem that takes as input a p ar ameter > 0 , and satsify G alg ≥ (1 − ) G max − O ( E ) , wher e: E = p ∗ log(log h/ ) . A lso, even if h is not known up fr ont, ther e is an (slightly mo difie d) algorithm that achieves a similar appr oximation guar ante e for online single buyer auction with: E = p ∗ log( p ∗ / ) . Pro of [ of Theorem 11 ] [Part 1: known h ] Recall the ab ov e form ulation of the problem as an online learning problem with full information. The pro of then follo ws by Theorem 1, letting π to b e the uniform distribution ov er the k = O (log h/ ) actions, i.e., discretized prices. [Part 2: unknown h ] When h is not known up front, we consider a v ariant of our algorithm (Algorithm 2) that picks the next price in eac h round t from the set of r elevant pric es (denoted b y P ), up dates this set if necessary , and then up dates the w eights of prices in this set as in Algorithm 1. The main new idea here is to update the set of prices P so that it only includes prices that are at most the highest v alue we hav e seen so far (let the highest seen v alue b e 1 at the b eginning). Now, for the sake of analysis, consider a h yp othetical algorithm (called ALG H ) that considers a coun tably infinite action space comprising all prices of the form (1 + ) j , for j ≥ 0 . W e first show this hypothetical algorithm ALG H satisfies the required appro ximation guaran tee in Theorem 11. W e then show the exp ected reven ue of Algorithm 2 is at least the exp ected reven ue of ALG H (min us a constant that is negligible in our bound), and hence the final pro of. The pro of of the regret b ound of Theorem 1 works when we hav e countably many actions (although we cannot implement such algorithms directly). No w, consider simulating ALG H and let the prior distribution π b e suc h that for an y price p = (1 + ) j , π p = ( + 2)(1 + ) − 2( j +1) = ( +2) (1+ ) 2 · 1 p 2 (this c hoice will b ecome more clear later in the pro of; in short we need π p to b e proportional to 1 p 2 ). The approximation guaran tee in Theorem 11 then follo ws by Theorem 1. W e no w argue the follo wings: • F or any round t , unless the v alue in that round is a new highest v alue, Algorithm 2 gets w eakly higher rev enue than ALG H . This is because the probability that Algorithm 2 15 Bubeck, Dev anur, Huang and Niazadeh pla ys any relev ant price in P (that has a non-zero gain in this round) is weakly higher than that in ALG H . • F or an y price p = (1 + ) j , consider the first time a v alue at least p shows up. Algo- rithm 2 suffers a loss of at most p · π p compared to ALG H , due to ALG H ’s probability of playing p in that round, where π p is the probability of playing p in the initial dis- tribution. This is b ecause the probability that ALG H pla ys p in this round is at most π p as p has not got any p ositiv e gains b efore this round. • Then, by choosing π p to b e in v ersely prop ortional to p 2 , w e can show that Algorithm 2 has an additiv e loss of P p β p = +2 +1 = O (1) compared to ALG H , where β = P p 1 p 2 − 1 = (2+ ) (1+ ) 2 is the normalization constan t of the initial distribution π π π . This finishes the pro of. Algorithm 2 Online single buy er auction (for unkno wn h ) 1: input learning rate 0 < η ≤ 1 , price discretization parameter 0 < ≤ 1 . 2: initialize the set of relev ant prices P = { 1 } . Let α 1 (1) = 1 . 3: for t = 1 , . . . , T do 4: Randomly pic k a price in P dra wn from α α α ( t ) , and observe g ( t ) . 5: Up date P to b e all the prices (1 + ) j that are at most the highest v alue until time t . 6: ∀ p ∈ P : w p ( t + 1) ← α p ( t ) · exp( η · g p ( t ) p ) . 7: Find λ ∗ (e.g., binary search) s.t. P p ∈P w p ( t + 1) · exp( − λ ∗ p ) = 1 . 8: ∀ p ∈ P : α p ( t + 1) ← w p ( t + 1) · exp( − λ ∗ p ) . 9: end for Bounds on the sample complexity of auctions for single buy er problem (Huang et al., 2015a) imply that the first b ound in this theorem is tigh t up to log factors: the low er b ound is h − 1 in an instance where p ∗ is actually equal to h . Also, the best upper b ound known is b y Blum et al. (2004); Blum and Hartline (2005), whic h is E = h log(1 / ) . W e conclude that Theorem 11 generalizes the known tight sample complexit y upp er-bound for the offline single buy er Bay esian reven ue maximization to the online adversarial setting. 3. Multi-Scale Online Learning with Bandit F eedbac k In this section, w e look at the bandit feedback v ersion of multi-scale online learning frame- w ork prop osed in Section 2.1. Essentially , the only difference here is that after the algorithm pic ks an arm i t at time t , it only observes the obtained rew ard, i.e. g i t ( t ) , and do es not ob- serv e the entire reward function g ( t ) . Inspired by the online stochastic mirror descen t algorithm (Bubeck, 2011) w e in tro duce Bandit-MSMW algorithm. Our algorithm follo ws the standard bandit route of using un bi- ased estimators for the rew ards in a full information strategy (in this case MSMW). W e also 16 Mul ti-scale Online Learning and its Applica tion to Online A uctions mix the MSMW distribution with an extra uniform exploration, and use a tailored initial distribution to obtain the desired mutli-scale regret bounds. 3.1 Bandit m ulti-scale regret b ounds F or the bandit v ersion, we can get similar regret guaran tees as in Section 2.2 for the full- information v ariant, but only for the b est action. If we require the regret b ound to hold for all actions, then we can only get a weak er b ound, where the second term has − 2 instead of − 1 . The difference b et ween the b ounds for the bandit and the full information setting is essen tially a factor of k , which is una voidable. Theorem 12 Ther e exists an algorithm for the online multi-sc ale pr oblem with b andit fe e d- b ack that takes as input the r anges c i , ∀ i ∈ A , and a p ar ameter 0 < ≤ 1 , and satisfies, • for i ∗ = arg max i ∈ A G i , E [ regret i ∗ ] ≤ · G i ∗ + O 1 k log k · c i ∗ . (15) • for al l i ∈ A , E [ regret i ] ≤ · G i + O 1 2 k log k · c i . (16) Also, one can compute the pure-additiv e v ersions of the b ounds in Theorems 12 b y setting = q k log( k T ) T and = ( k log( k T ) T ) 1 3 resep ctiv ely (Corollary 13), and compare with the pure-additiv e regret b ound O c max · √ T k for the adv ersarial multi-arm ed bandit prob- lem (Audibert and Bubeck, 2009; Auer et al., 1995). Corollary 13 Ther e exist algorithms for the online multi-sc ale b andits pr oblem that satis- fies, • F or i ∗ = arg max i ∈ A G i , E [ regret i ∗ ] ≤ O c i ∗ · p T k log( k T ) (17) • F or al l i ∈ A , E [ regret i ] ≤ O c i · T 2 3 ( k log( k T )) 1 3 (18) Here is a map of this section. In Section 3.2 w e prop ose our bandit algorithm and prov e its general regret guaran tee for non-negative rew ards. Then in Section 3.3 w e show how to get a multi-scale style regret guaran tee for the best arm c i ∗ , and a w eak er guaran tee for all arms { c i } i ∈ A . 3.2 Bandit Multi-Scale Multiplicativ e W eigh t (Bandit-MSMW) algorithm W e presen t our Bandit algorithm (Algorithm 3) when the set of actions A is finite (with | A | = k ). Let η b e the learning rate and γ b e the exploration probabilit y . W e show the follo wing regret b ound. 17 Bubeck, Dev anur, Huang and Niazadeh Algorithm 3 Bandit-MSMW 1: input exploration parameter γ > 0 , learning rate η > 0 . 2: initialize p (1) = (1 − γ ) 1 i min + γ k 1 , where i min is the arm with minimum range c i min . 3: for t = 1 , . . . , T do 4: Let ˜ p ( t ) = (1 − γ ) p ( t ) + γ k 1 . 5: Randomly pic k an expert i t dra wn from ˜ p ( t ) , and observ e g i t ( t ) . 6: Let ˜ g ( t ) b e suc h that ˜ g i ( t ) = g i ( t ) ˜ p i ( t ) if i = i t ; 0 otherwise . 7: ∀ i ∈ A : w i ( t + 1) ← p i ( t ) · exp( η c i · ˜ g i ( t )) . 8: Find λ ∗ (e.g., binary search) s.t. P i ∈ A w i ( t + 1) · exp( − λ ∗ c i ) = 1 . 9: ∀ i ∈ A : p i ( t + 1) ← w i ( t + 1) · exp( − λ ∗ c i ) . 10: end for Lemma 14 F or any explor ation pr ob ability 0 < γ ≤ 1 2 and any le arning r ate p ar ameter 0 < η ≤ γ k , the Bandit-MSMW algorithm achieves the fol lowing r e gr et b ound when the gains ar e non-ne gative : ∀ i ∈ A : E [ regret i ] ≤ O 1 η log k γ · c i + η P j ∈ A G j + γ · G i Pro of [ of Lemma 14 ] W e further define: e G alg , P t ∈ [ T ] g i t ( t ) = P t ∈ [ T ] ˜ p ( t ) · ˜ g ( t ) , e G j , P t ∈ [ T ] ˜ g j ( t ) . In expectation o v er the randomness of the algorithm, w e hav e: 1. E [ G alg ] = E h e G alg i ; and 2. G j = E h e G j i for an y j ∈ A . Hence, to upp er b ound E [ regret i ] = G i − E [ G alg ] , it suffices to upp er b ound E h e G i − e G alg i . By the definition of the probabilit y that the algorithm pic ks each arm, i.e., ˜ p ( t ) , we hav e: E h e G alg i ≥ (1 − γ ) P t ∈ [ T ] p ( t ) · ˜ g ( t ) . Hence, w e ha v e that for an y initial distribution q o v er A : X j ∈ A q j · E h e G j i − E h e G alg i ≤ E h P j ∈ A q j · e G j − P t ∈ [ T ] p ( t ) · ˜ g ( t ) i + γ 1 − γ E h e G alg i ≤ E h P j ∈ A q j · e G j − P t ∈ [ T ] p ( t ) · ˜ g ( t ) i + 2 γ E h e G alg i . (19) 18 Mul ti-scale Online Learning and its Applica tion to Online A uctions Next, we upp er b ound the 1st term on the RHS. Note that p ( t ) ’s are the probabilities of c ho osing exp erts b y MSMW when the exp erts hav e rew ards ˜ g ( t ) ’s. By Prop osition 10, w e ha v e that for an y benchmark distribution q ov er S , the Bandit-MSMW algorithm satisfies that: X j ∈ A q j · e G j − X t ∈ [ T ] p ( t ) · ˜ g ( t ) ≤ η X t ∈ [ T ] X j ∈ A p j ( t ) c j · ˜ g j ( t ) 2 + 1 η X j ∈ A c j q j ln q j p j (1) − q j + p j (1) . (20) F or any t ∈ [ T ] and any j ∈ A , by the definition of ˜ g j ( t ) , it equals g j ( t ) ˜ p j ( t ) with probability ˜ p j ( t ) , and equals 0 otherwise. Th us, if w e fix the random coin flips in the first t − 1 rounds and, th us, fix ˜ p ( t ) , and take exp ectation o ver the randomness in round t , we hav e that: E p j ( t ) c j · ˜ g j ( t ) 2 = p j ( t ) c j · ˜ p j ( t ) · g j ( t ) ˜ p j ( t ) 2 = p j ( t ) ˜ p j ( t ) ( g j ( t )) 2 c j . F urther note that since ˜ p j ( t ) ≥ (1 − γ ) p j ( t ) , and g j ( t ) ≤ c j , the ab ov e is upp er b ounded b y 1 1 − γ g j ( t ) ≤ 2 g j ( t ) . Putting together with (20), we hav e that for an y 0 < η ≤ γ n : E X j ∈ A q j · e G j − X t ∈ [ T ] p ( t ) · ˜ g ( t ) ≤ η X t ∈ [ T ] X j ∈ A 2 g j ( t ) + 1 η X j ∈ A c j q j ln q j p j (1) − q j + p j (1) = 2 η X j ∈ A G j + 1 η X j ∈ A c j q j ln q j p j (1) − q j + p j (1) Com bining with (19), w e hav e: X j ∈ A q j · E h e G j i − E h e G alg i ≤ 2 η X j ∈ A G j + 1 η X j ∈ A c j q j ln q j p j (1) − q j + p j (1) + 2 γ E h e G alg i Let q = (1 − γ ) 1 i + γ k 1 . Recall that p (1) = (1 − γ ) 1 i min + γ k 1 (recall i min is the arm with minim um range c i min ). Similar to the discussion for the exp ert problem in Section 2.5, the 2nd term on the RHS is upper b ounded b y O 1 η log k γ · c i . Hence, we hav e: X j ∈ A q j · E h e G j i − E h e G alg i ≤ 2 η X j ∈ A G j + O 1 η log k γ · c i + 2 γ E h e G alg i . (21) F urther, the LHS is low er b ounded as: (1 − γ ) E h e G i i + γ k X j ∈ A E h e G j i − E h e G alg i ≥ (1 − γ ) E h e G i i − E h e G alg i . The lemma then follo ws by putting it back to (21) and rearranging terms. 19 Bubeck, Dev anur, Huang and Niazadeh 3.3 Regret b ounds for non-negativ e rewards - proof of Theorem 12 Pro of [ of Theorem 12 ] Letting γ = and η = γ k = k in Lemma 14, w e get that the exp ected regret w.r.t. an action i ∈ A is bounded by: O · G i + k P j ∈ A G j + c i · k ln k . When i = i ∗ (b est arm), regret is b ounded b y O · G i ∗ + c ∗ i · k ln k , as desired. F or the regret w.r.t. an arbitrary action, note that E [ G alg ] ≥ γ k P j ∈ A G j . Th us, the regret bound w.r.t. an action i ∈ A in Lemma 14 is further upp er bounded b y: O 1 η log k γ · c i + η k γ + γ · E h e G alg i The theorem then follo ws by letting γ = and η = γ 2 k = 2 k . 4. More Applications of Multi-scale Learning for Auctions and Pricing In this section, w e consider applying the multi-scale online learning framework, dev elop ed in Section 2 and Section 3, to design several other online auctions and pricings b e the single buy er auction (discussed in Section 2.6). Besides the single buy er auction, the problems that w e consider are as follo ws. • Online posted pricing: The same as the online single buyer auction of Section 2.6, but in the bandit setting. The algorithm only learns the indicator function 1 ( v ( t ) ≥ p t ) where p t is the price it picks in round t . • Online multi buyer auction: The action set is the set of all “My erson-t yp e” mec h- anisms for n buyers, for some n ∈ N . (See Definition 15.) The adv ersary pic ks a v aluation v ector v ( t ) ∈ [1 , h ] n and the rew ard of a mechanism M is its reven ue when the v aluation of the buy ers is given b y v ( t ) ; this is denoted by rev M ( v ( t )) . The algorithm sees the full vector of v aluations v ( t ) . 4.1 Auctions and pricing as m ulti-scale online learning problems W e now show how to reduce the ab o v e problems to sp ecial cases of multi-scale online learning. Online m ulti buy er auction In m ulti buyer auctions, w e consider the set of all dis- cretized My erson-t yp e auctions as the action space. W e start by defining My erson-type auctions: Definition 15 (My erson-type auctions) A Myerson-t yp e auction is define d by n non- de cr e asing virtual value mappings φ 1 , . . . , φ n : [1 , h ] 7→ [ −∞ , h ] . Given a value pr ofile v 1 , . . . , v n , the item is given to the bidder j with the lar gest non-ne gative virtual value φ j ( v j ) . Then, bidder j p ays the minimum value that would ke ep him as the the winner. 20 Mul ti-scale Online Learning and its Applica tion to Online A uctions My erson (1981) shows that when the bidders’ v alues are dra wn from indep endent (but not necessarily iden tical) distributions, the reven ue-optimal auction is a My erson-type auction. Dev an ur et al. (2016, Lemma 5) observe that to obtain a 1 − appro ximation, it suffices to consider the set of discretized Myerson-t yp e auctions that treat each bidder’s v alue as if it is equal to the closest pow er of 1 + from below. As a result, it suffices to consider the set of discretized Myerson-t yp e auctions, each of whic h is defined b y the virtual v alues of (1 + ) j ’s, i.e., b y O ( n log h/ ) real n um b ers φ ` ((1 + ) j ) , for ` ∈ [ n ] , and 0 ≤ j ≤ b log 1+ h c . F urthermore, first Elkind (2007) and later on Dev anur et al. (2016); Gonczaro wski and Nisan (2017) note that a discretized Myerson-t yp e auction is in fact completely characterized by the total ordering of φ ` ((1 + ) j ) ’s; ∗∗ their actual v alues do not matter. Indeed, b oth the allo cation rule and the pa yment rule are determined b y the ordering of virtual v alues. As a result, our action space is a finite set with at most O (( n log h/ )!) actions. The range of an action, i.e., a discretized Myerson-t yp e auction, is the largest price ever c harged by the auction, i.e., the largest v alue v of the form (1 + ) j suc h that there exists ` ∈ [ n ] , φ ` ( v ) > φ ` ((1 + ) − 1 v ) . 4.2 Multiplicativ e/additive appro ximations Similar to Section 2.6, w e show how to get a m ultiplicativ e cum additiv e approximations for these problems with G max as the b enchmark. Recall the definition of G max in Section 2.1 and let p ∗ b e the b est fixed price on hindsigh t, whic h is the price that achiev es G max . Theorem 16 Ther e ar e algorithms for the online p oste d pricing and the online multi buyer auction pr oblems that take as input a p ar ameter > 0 , and satsify G alg ≥ (1 − ) G max − O ( E ) , wher e r esp e ctively (for the two pr oblems mentione d ab ove) E = p ∗ log h log (log h/ ) 2 , and hn log h log( n log h/ ) 2 . Even if h is not known up fr ont, we c an stil l get the similar appr oximation guar ante e for the online multi buyer auction with: E = hn log h log( n log h/ ) 2 . W e conjecture that our b ound for the online posted pricing problem is tight up to logarithmic factors, and lea ve resolving this as an op en problem. The second b ound is not comparable to the b est sample complexit y for the multi buy er auction problem by Roughgarden and Sc hrijvers (2016); it is b etter than theirs for large (when 1 / ≤ o ( nh ) ), and is worse for smaller (when 1 / ≥ ω ( nh ) ). Also, compare the first b ound to the corresp onding upp er b ound for the pricing problem by Blum and Hartline (2005), whic h is min h log h log log h 2 , h log log h 3 . Essen tially , the main improv ement ov er this result is that the additive term scales with the b est price rather than h . ∗∗ Cai et al. (2012) also generalizes this observ ation to multi-dimensional types. 21 Bubeck, Dev anur, Huang and Niazadeh 4.3 Pro of of Theorem 16 Pro of Online p oste d pricing. Recall the form ulation of the problem as an online learning problem with bandit feedback in Section 4.1. This part then follows b y Theorem 12 with k = O (log h/ ) actions. Online multi buyer auction. Recall the formulation of the problem as an online learning problem with full information in Section 4.1. The pro of then follows b y Theorem 1, where w e let π be the uniform distribution ov er the k = O (( n log h/ )!) actions, i.e., My erson-t yp e auctions. When h is not kno wn up fron t, similar to the proof of Theorem 11, we consider a h yp othetical algorithm with coun tably infinite action space A as follows. F or any p = (1 + ) j , j ≥ 0 , let the k p = O (( n log p/ )!) Myerson-t yp e auctions for v alues in [1 , p ] b e in A ; w e assume these auctions treat any v alues greater than p as if they w ere p . F urther, w e c ho ose the prior distribution π suc h that the probability mass of each auction for range [1 , p ] is equal to ( +2) (1+ ) 2 · 1 p 2 · 1 k p . The appro ximation guarantee then follows b y Theorem 1. T o implemen t this algorithm, we use the same trick as in the pro of Theorem 11 by running a mo dified algorithm that only considers auctions for all ranges [1 , p ] where p is no larger than the highest v alue seen so far among all the buyers (i.e. a multi-buy er auction v ersion of Algorithm 2). The rest of the proof that sho ws the reven ue loss of this algorithm compared to the h yp othetical algorithm is negligible is similar to the proof of Theorem 11 (and hence omitted for brevity). 4.4 Comp eting with δ -guarded b enc hmarks F or the single buy er auction/pricing problem, w e define a δ -guarded benchmark, for an y δ ∈ [0 , 1] . This b enc hmark is restricted to only those prices that sell the item in at least a δ fraction of the rounds. G max ( δ ) := max n P T t =1 g p ( t ) : p ∈ A, P T t =1 1 ( v t ≥ p ) ≥ δ T o . As observ ed in F o otnote † , one can replace δ with 1 /h and get the corresp onding guarantees for G max rather than G max ( δ ) . How ever, the main p oint of these results is to show a graceful impro v emen t of the b ounds as δ is chosen to b e larger. Multiple buyers: F or the m ulti buy er auction problem, we define the δ -guarded b ench- mark as follo ws. F or an y sequence of v alue vectors v (1) , v (2) , . . . , v ( T ) , let ¯ V denote the largest v alue such that there are at least δ T distinct t ∈ [1 : T ] with max i ∈ [ n ] v i ( t ) ≥ ¯ V . Define the δ -guarded b enc hmark to b e G max ( δ ) = max M P T t =1 Rev M min( ¯ V ~ 1 , v ( t ))) , where the “ min ” is tak en coordinate-wise, and the “max” is o ver all Myerson-t yp e mec h- anisms. In other words, here is how we can describ e the δ -guarded b enc hmark: for each My erson-t yp e auction M , after identifying the v alue cap ¯ V , w e cut all the v alues that are ab o ve ¯ V by this quan tity , and then run M . The b enc hmark is then the rev enue of the b est My erson-t yp e auction under these mo dified v alues. 22 Mul ti-scale Online Learning and its Applica tion to Online A uctions W e fo cus on purely multiplicativ e appro ximation factors when comp eting with G max ( δ ) . In particular, for an y giv en > 0 , w e are in terested in a 1 − appro ximation. W e state our results in terms of the c onver genc e r ate . W e sa y that T ( , δ ) is the conv ergence rate of an algorithm if for all time horizon T ≥ T ( , δ ) , we are guaranteed that G alg ≥ (1 − ) G max ( δ ) . Our main results are as follows. Theorem 17 Ther e ar e algorithms for the online single buyer auction, online p oste d pric- ing, and the online multi buyer auction pr oblems with c onver genc e r ates r esp e ctively of O log(log h/ ) 2 δ , O log h 4 δ , and O n log (1 /δ ) log ( n log(1 /δ ) / ) 3 δ + log (log h/ ) 2 δ . Even if h is not known upfr ont, we c an stil l get the fol lowing similar c onver genc e r ates for online single buyer auction and online multi buyer auction r esp e ctively: O log( p ∗ / ) 2 δ , and O n log (1 /δ ) log ( n log(1 /δ ) / ) 3 δ + log ( h/ ) 2 δ . Once again, we compare to the sample comp exit y b ounds: our first is within a log log h factor of the b est sample complexit y upp er b ound in Huang et al. (2015b). The lo w er b ound for the online single buy er auction is Ω( δ − 1 − 2 ) , whic h is also the best lo w er b ound kno wn for the pricing and the m ulti-buy er problem. †† F or the online p osted pricing problem, we conjecture that the right dep endence on should b e − 3 . No sample complexit y bounds for the m ulti-buyer problem were known b efore; in fact w e introduce the definition of a δ -guarded benchmark for this problem. 4.5 Pro of of Theorem 17 Pro of Online single buyer auction. By Theorem 1, letting π b e the uniform distribution o v er the k = O (log h/ ) actions, i.e., discretized prices, we hav e that for any price p (recall that c p = p ): G alg ≥ (1 − ) · G p − O log(log h/ ) · p . F or the δ -guarded optimal price p ∗ (i.e., sub ject to selling in at least δ T rounds), w e ha v e G p ∗ ≥ δ T · p ∗ . Therefore, when T ≥ O log(log h/ ) / 2 δ , the additiv e term of the ab ov e appro ximation guaran tee is at most · G p ∗ . So the theorem holds. The treatmen t for the case when h is not known up front is essentially the same as in Theorem 16 and Theorem 11. As a h yp othetical algorithm useful for analysis, we consider an algorithm (similar to Algorithm 1) with a coun tably infinite action space comprising all prices of the form (1 + ) j , for j ≥ 0 . Then, let the prior distribution π b e such that for an y price p = (1 + ) j , π p = ( + 2)(1 + ) − 2( j +1) = ( eps +2)) (1+ ) 2 · 1 p 2 . The rest of the pro of and how to implemen t is the same as in the proof of Theorem 11 (i.e. Algorithm 2). Online p oste d pricing. Recall the ab ov e formulation of the problem as an online learning problem with bandit feedback. By Theorem 12 with k = O (log h/ ) actions, we hav e that †† Cole and Roughgarden (2014) sho w that at least a linear dep endence on n is necessary when the v alues are drawn from a regular distribution, but as is, their low er b ound needs unbounded v aluations. The low er b ound probably holds for “large enough h ” but it is not clear if it holds for all h . 23 Bubeck, Dev anur, Huang and Niazadeh for an y price p : G alg ≥ (1 − ) · G p − O log h log(log h/ ) 3 · p . Again, for the δ -guarded optimal price p ∗ (i.e., sub ject to selling in at least δ T rounds), w e ha v e G p ∗ ≥ δ T · p ∗ . Therefore, when T ≥ O log h log log h/ / 4 δ , the additiv e term of the abov e appro ximation guarantee is at most · G p ∗ . So the theorem holds. Online multi buyer auction. Supp ose i ∗ is the δ -guarded b est Myerson-t yp e auction. Recall that ¯ V is the largest v alue suc h that there are at least δ T distinct v ( t ) ’s with max ` ∈ [ n ] v ` ( t ) ≥ ¯ V . So we ma y assume without loss of generality that i ∗ do es not dis- tinguish v alues greater than ¯ V . Hence: c i ∗ ≤ ¯ V . (22) F urther, note that running a second-price auction with anon ymous reserve ¯ V is a My erson- t yp e auction (e.g., mapping v alues less than ¯ V to virtual v alue −∞ and v alues greater than or equal to ¯ V to virtual v alue ¯ V ), and it gets reven ue at least δ T · ¯ V . So w e ha ve that: G p ∗ ≥ δ T · ¯ V . (23) Finally , the ab ov e implies that to obtain a 1 − appro ximation, it suffices to consider prices that are at least δ ¯ V . Hence, it suffices to consider Myerson-t yp e auctions that, for a given ¯ V , do not distinguish among v alues greater than ¯ V , and do not distinguish among v alues smaller than δ ¯ V . There are O (log h/ ) different v alues of ¯ V . F urther, giv en ¯ V , there are only O (log (1 /δ ) / ) distinct v alues to b e considered and, th us, there are at most O (( n log(1 /δ ) / )!) distinct My erson-type auctions of this kind. Hence, the total num b er of distinct My erson-type actions that we need to consider is at most: k = O log h · n log(1 /δ ) ! . Letting π be the uniform distribution o v er the k actions in Theorem 1, we hav e that (recall Eqn. (22)): G alg ≥ (1 − ) · G i ∗ − O n log (1 /δ ) log ( n log(1 /δ ) / ) 2 + log (log h/ ) · ¯ V . When T ≥ O n log (1 /δ ) log( n log(1 /δ ) / ) 3 δ + log (log h/ ) 2 δ , the additiv e term of the ab ov e appro ximation guaran tee is at most · G i ∗ due to Eqn. (23). So the theorem holds. Again, the treatmen t for the case when h is not kno wn up fron t is similar to that in Theorem 16. When h is not kno wn up fron t, we consider a hypothetical algorithm with a coun tably infinite action space A as follows. F or any ¯ V = (1 + ) j , j ≥ 0 , let the k 0 = O (( n log(1 /δ ) / )!) Myerson-t yp e auctions that do not distinguish among v alues greater than ¯ V , and do not distinguish among v alues smaller than δ ¯ V b e in A . F urther, we choose the prior distribution π suc h that the probabilit y mass of each My erson-type auction for a giv en ¯ V is equal to 1+ · 1 ¯ V · 1 k 0 . The approximation guaran tee then follows by Theorem 1 and essen tially the same argument as the kno wn h case. Implemen tation is similar to the proof of Theorem 16 and Theorem 11 (i.e. a multi-buy er auction v ersion of Algorithm 2). The 24 Mul ti-scale Online Learning and its Applica tion to Online A uctions rest of the pro of that shows the rev en ue loss of this algorithm compared to the hypothetical algorithm is negligible is similar to the pro of of Theorem 16 (and hence omitted for brevity). Remark Dev anur et al. (2016) show that when the v alues are dra wn from indep endent regular distributions, the -guarded optimal price is a 1 − approximation of the unguarded optimal price. So our conv ergence rate for the online multi buy er auction problem in Theo- rem 1 implies a ˜ O ( n − 4 ) sample complexit y mo dulo a mild log log h dep endency on the range, almost matc hing the b est known sample complexit y upp er b ound for regular distributions. 5. Multi-scale Online Learning with Symmetric Range In this section, we consider multi-scale online learning when the rew ards are in a symmetric range, i.e. for all i ∈ A and t ∈ [ T ] , g i ( t ) ∈ [ − c i , c i ] . The standard analysis for the exp erts and the bandit problems holds even if the range of g i ( t ) is [ − c i , c i ] , instead of [0 , c i ] . In con trast, there are subtle differences on the b est achiev able m ulti-scale regret b ounds b et ween the non-negativ e and the symmetric range, which w e explore in this section. W e lo ok at b oth the full information and bandit setting, and pro ve action-sp ecific regret upp er b ounds. W e then prov e a tight lo w er-b ound in Section 5.3 for the full information case, and an almost tight low er-b ound in Section 5.5 for the bandit setting. 5.1 Multi-scale regret b ounds for symmetric ranges W e first show the following upp er b ound for the full information setting when the range is symmetric. This bound follo ws the same st yle of action-specific regret bounds as in Theorem 1. More detailed discussion on ho w the choice of initial distribution π affects the b ound is deferred to the app endix, Section A.1 (recall that the initial distribution π is the distribution o ver actions that is used in the first round of Algorithm 1). Theorem 18 Ther e exists an algorithm for the multi-sc ale exp erts pr oblem with symmetric r ange that takes as input any distribution π over A , the r anges c i , ∀ i ∈ A , and a p ar ameter 0 < ≤ 1 , and satisfies: ∀ i ∈ A : E [ regret i ] ≤ · E h P t ∈ [ T ] g t ( i ) i + O 1 log 1 π i · c i c min · c i . (24) Similar to Section 2.1, we can compute the pure-additiv e v ersion of the b ound in Theo- rem 18 by setting = q log( k · c max c min ) T , as in Corollary 2. Corollary 19 Ther e exists an algorithm for the online multi-sc ale exp erts pr oblem with symmetric r ange that takes as input the r anges c i , ∀ i ∈ A , and satisfies: ∀ i ∈ A : E [ regret i ] ≤ O c i · q T log( k · c max c min ) (25) If w e compare the ab o v e regret bound with the standard O ( c max √ T log k ) regret b ound for the exp erts problem, w e see that we replace the dep endency on c max in the standard b ound with c i q log( c max c min ) . It is natural to ask whether w e could get rid of the dep endence on 25 Bubeck, Dev anur, Huang and Niazadeh log( c i /c min ) and show a regret b ound of O ( c i √ T log k ) , like w e did for non-negative rew ards. Ho w ev er, the next theorem shows that this dep endence on log ( c i /c min ) in the ab ov e b ound is necessary , in a weak sense: where the constant in the O ( · ) is universal and do es not dep end on the ranges c i . This is b ecause the lo w er b ound only holds for “small” v alues of the horizon T , whic h nonetheless grows with the { c i } s. ‡‡ Theorem 20 Ther e exists an action set of size k , and r anges c i , ∀ i ∈ [ k ] , and time horizon T , such that for al l algorithms for the online multi-sc ale exp erts pr oblem with symmetric r ange, ther e is a se quenc e of T gain ve ctors such that ∃ i ∈ A : E [ regret i ] > c i 4 · q T log( k · c max c min ) W e then show the follo wing upp er b ound for the bandit setting when the range is symmet- ric. This b ound also follows the same style of action-sp ecific regret b ounds as in Theorem 12. Theorem 21 Ther e exists an algorithm for the multi-sc ale b andits pr oblem with symmetric r ange that takes as input the r anges c i , ∀ i ∈ A , and a p ar ameter 0 < ≤ 1 / 2 , and satisfies: ∀ i ∈ A : E [ regret i ] ≤ O T + k c max c min log k c max c min · c i . (26) Also, similar to Section 2.1, w e can compute the pure-additiv e version of the b ound in Theorem 21 b y setting = q k c max c min log( k T · c max c min ) T , as in Corollary 2. This b ound is comparable to the standard regret b ound of O ( c max √ k T log k ) (Auer et al., 1995) for the adv ersarial m ulti-armed bandits problem. Corollary 22 Ther e exists an algorithm for the online multi-sc ale b andits pr oblem with symmetric r ange that satisfies: ∀ i ∈ A : E [ regret i ] ≤ O c i · q T k · c max c min log( k T · c max c min ) . (27) Once again, for the bandit problem, the following theorem shows that this b ound cannot b e impro v ed b eyond logarithmic factors (to get a guaran tee like that of Theorem 12, for instance). Theorem 23 Ther e exists an action set of size k , and r anges c i , ∀ i ∈ [ k ] , such that for al l algorithms for the online multi-sc ale b andit pr oblem with symmetric r ange, for al l sufficiently lar ge time horizon T , ther e is a se quenc e of T gain ve ctors such that ∃ i ∈ A : E [ regret i ] > c i 8 √ 2 · r T k · c max c min . 5.2 Upp er b ound for exp erts with symmetric range - Pro of of Theorem 18 Recall the pro of of Prop osition 10. The pro of only requires g i ( t ) ∈ [ − c i , c i ] for all i ∈ A, t ∈ [ T ] . Cho osing q to b e 1 i , a v ector with a 1 -entry in i th co ordinate and 0 -en tries elsewhere for an action i ∈ A , and noting that P t ∈ [ T ] P i ∈ A p i ( t ) ( g i ( t )) 2 c i ≤ P t ∈ [ T ] P i ∈ A p i ( t ) · g i ( t ) , w e get the follo wing regret bound as a corollary of Proposition 10. ‡‡ F or this reason we chose not to include this b ound in T able 1. 26 Mul ti-scale Online Learning and its Applica tion to Online A uctions Corollary 24 F or any initial distribution µ over A , and any le arning r ate p ar ameter 0 < η ≤ 1 , the MSMW algorithm achieves the fol lowing r e gr et b ound: ∀ i ∈ A : E [ regret i ] ≤ η · E h P t ∈ [ T ] g i ( t ) i + 1 η c i · log 1 µ i + 1 η P j ∈ A µ j c j (28) No w, w e can prov e the m ulti-scale regret upper-b ound in Theorem 18 using Corollary 24. Pro of [ of Theorem 18 ] The pro of follows b y choosing an appropriate initial distribution µ in Corollary 24. By Corollary 24, we hav e: E [ regret i ] ≤ η · E h P t ∈ [ T ] g i ( t ) i + 1 η c i · log( 1 µ i ) + 1 η P j ∈ A µ j c j Let i min b e an action with the minim um range c i min = c min . Consider an initial distri- bution µ j = π j c min c j for all j 6 = i min , and µ i min = 1 − P j 6 = i min µ j , i.e., putting all remaining probabilit y mass on action i min . Then, the third term on the RHS is upper b ounded b y: P j ∈ A µ j c j = P j 6 = i min µ j c j + µ i min c i min = P j 6 = i min π j c min + µ i min c min ≤ 2 c min ≤ 2 c i . F or i 6 = i min , b y the definition of µ i , w e ha v e: E [ regret i ] ≤ η · E h P t ∈ [ T ] g i ( t ) i + 1 η c i · log ( 1 π i · c i c min ) + 1 η · 2 c min = η · E h P t ∈ [ T ] g i ( t ) i + O 1 η log 1 π i · c i c min · c i . So the theorem follows b y c ho osing η = . F or i = i min , note that µ j ≤ π j for all j 6 = i min and, th us, µ i min = 1 − P j 6 = i min µ j ≥ 1 − P j 6 = i min π j = π i min = π i min c min c i min . The theorem then holds follo wing the same calculation as in the j 6 = i min case. 5.3 Lo wer bound for exp erts with symmetric range - pro of of Theorem 20 Pro of [ of Theorem 20 ] W e first show that for any online learning algorithm, and an y sufficien tly large h > 1 , there is an instance that has tw o exp erts with c 1 = 1 and c 2 = h with T = Θ(log h ) rounds, such that either E [ regret 1 ] > 1 2 T + √ h , or E [ regret 2 ] > 1 2 T h + 1 5 h log 2 h . W e will construct this instance with T = 1 2 log 2 h − 1 rounds adaptively that alw a ys has gain 0 for action 1 and gain either h or − h for action 2 . The pro of of the theorem then follo ws as c min = 1 , c max = h , T = 1 2 log 2 h − 1 , and k = 2 in this instance. Let q t denote the probabilit y that the algorithm picks action 2 in round t after ha ving the same rew ards 1 and h for the tw o actions resp ectiv ely in the first t − 1 rounds. W e will first show that (1) if the algorithm has small regret with respect to action 1 , then q t m ust b e upper b ounded since the adversary ma y let action 2 hav e cost − h in an y round t in which q t is to o large. Then, we will show that (2) since q t is upp er b ounded for any 1 ≤ t ≤ T , the algorithm m ust ha ve large regret with resp ect to action 2 . W e pro ceed with the upp er b ounding q t ’s. Concretely , we will sho w the follo wing lemma. 27 Bubeck, Dev anur, Huang and Niazadeh Lemma 25 Supp ose E [ regret 1 ] ≤ 1 2 T + √ h . Then, for any 1 ≤ t ≤ T , we have q t ≤ 2 t √ h . Pro of [Pro of of Lemma 25] W e will prov e b y induction on t . Consider the base case t = 1 . Supp ose for contradiction that q 1 > 2 √ h . Then, consider an instance in whic h action 2 alwa ys has gain. In this case, the exp ected gain of the algorithm (ev en if it alwa ys correctly picks action 1 in the remaining instance) is at most q 1 · ( − h ) < − 2 √ h . This is a con tradiction to the assumption that E [ regret 1 ] ≤ 1 2 T + √ h < 2 √ h . Next, supp ose the lemma holds for all rounds prior to round t . Then, the expected gain of algorithm in the first t − 1 rounds if arm 2 has gain H is t − 1 X ` =1 q ` · h ≤ t − 1 X ` =1 2 ` √ h = 2 t − 2 √ h . Supp ose for con tradiction that q t > 2 t √ h . Then, consider an instance in whic h action 2 has gain H in the first t − 1 rounds and − H afterwards. In this case, the exp ected gain of the algorithm (ev en if it alwa ys correctly picks action 1 after round t ) is at most 2 t − 2 √ h + q t ( − h ) < 2 t − 2 √ h + 2 t √ h < − 2 √ h . This is a con tradiction to the assumption that E [ regret 1 ] ≤ 1 2 T + √ h < 2 √ h . Consider an instance in which action 2 alwa ys has gain H . Supp ose that E [ regret 1 ] ≤ 1 2 T + √ h . As an immediate implication of the abov e lemma, the algorithm is that the exp ected gain of the algorithm is upper b ounded b y: T X t =1 q t h ≤ T X t =1 2 t √ h < 2 T +1 √ h = h . Note that in this instance E [ G 2 ] = T · h . Thus, the regret w.r.t. action 2 is at least ( T − 1) h , whic h is greater than 1 2 · E [ G 2 ] + 1 5 h log 2 h for sufficiently large h . 5.4 Upp er b ound for bandits with symmetric range - Pro of of Theorem 21 W e start by presen ting the follo wing regret bound, whose pro of is an alteration of that for Lemma 14 under symmetric range. Next, w e prov e Theorem 21. Lemma 26 F or any explor ation r ate 0 < γ ≤ min { 1 2 , c min c max } and any le arning r ate 0 < η ≤ γ k , the Bandit-MSMW algorithm (Algorithm 3) achieves the fol lowing r e gr et b ound: ∀ i ∈ A : E [ regret i ] ≤ O 1 η log k γ · c i + γ T · c max Pro of [ of Lemma 26 ] W e further define: e G alg , P t ∈ [ T ] g i t ( t ) = P t ∈ [ T ] ˜ p ( t ) · ˜ g ( t ) , e G j , P t ∈ [ T ] ˜ g j ( t ) . In expectation o v er the randomness of the algorithm, w e hav e: 28 Mul ti-scale Online Learning and its Applica tion to Online A uctions 1. E [ G alg ] = E h e G alg i ; and 2. G j = E h e G j i for an y j ∈ A . Hence, to upp er b ound E [ regret i ] = G i − E [ G alg ] , it suffices to upp er b ound E h e G i − e G alg i . By the definition of the probability that the algorithm picks eac h arm, i.e., ˜ p ( t ) , and that rew ard of eac h round is at least − c max , w e ha v e that: E h e G alg i ≥ (1 − γ ) X t ∈ [ T ] p ( t ) · ˜ g ( t ) − γ T c max . Hence, for any b enchm ark distribution q o v er A , we hav e that: P j ∈ A q j · E h e G j i − E h e G alg i ≤ E h P j ∈ A q j · e G j − P t ∈ [ T ] p ( t ) · ˜ g ( t ) i + γ 1 − γ E h e G alg i + γ 1 − γ T c max ≤ E h P j ∈ A q j · e G j − P t ∈ [ T ] p ( t ) · ˜ g ( t ) i + 2 γ E h e G alg i + 2 γ T c max ≤ E h P j ∈ A q j · e G j − P t ∈ [ T ] p ( t ) · ˜ g ( t ) i + 4 γ T c max . (29) where the 2nd inequalit y is due to γ ≤ 1 2 , and the 3rd inequality follows by that c max is the largest possible rew ard per round. Next, w e upp er b ound the 1st term on the RHS of (29). Note that p ( t ) ’s are the probabil- it y of choosing exp erts b y MSMW when the exp erts ha v e rewards ˜ g ( t ) ’s. By Prop osition 10, w e hav e that for an y b enchmark distribution q ov er S , the Bandit-MSMW algorithm satisfies that: X j ∈ A q j · e G j − X t ∈ [ T ] p ( t ) · ˜ g ( t ) ≤ η X t ∈ [ T ] X j ∈ A p j ( t ) c j · ˜ g j ( t ) 2 + 1 η X j ∈ A c j q j ln q j p j (1) − q j + p j (1) . (30) F or any t ∈ [ T ] and any j ∈ A , by the definition of ˜ g j ( t ) , it equals g j ( t ) ˜ p j ( t ) with probability ˜ p j ( t ) , and equals 0 otherwise. Th us, if w e fix the random coin flips in the first t − 1 rounds and, th us, fix ˜ p ( t ) , and take exp ectation o ver the randomness in round t , we hav e that: E p j ( t ) c j · ˜ g j ( t ) 2 = p j ( t ) c j · ˜ p j ( t ) · g j ( t ) ˜ p j ( t ) 2 = p j ( t ) ˜ p j ( t ) ( g j ( t )) 2 c j . F urther note that ˜ p j ( t ) ≥ (1 − γ ) p j ( t ) , and | g j ( t ) | ≤ c j , the ab o ve is upper b ounded b y 1 1 − γ | g j ( t ) | ≤ 2 | g j ( t ) | ≤ 2 c max . Putting together with (30), we hav e that for an y 0 < η ≤ γ n : E X j ∈ A q j · e G j − X t ∈ [ T ] p ( t ) · ˜ g ( t ) ≤ η X t ∈ [ T ] X j ∈ A 2 c max + 1 η X j ∈ A c j q j ln q j p j (1) − q j + p j (1) = 2 η T k c max + 1 η X j ∈ A c j q j ln q j p j (1) − q j + p j (1) 29 Bubeck, Dev anur, Huang and Niazadeh Com bining with (29), w e hav e (recall that η ≤ γ k ): X j ∈ A q j · E h e G j i − E h e G alg i ≤ 2 η T k c max + 1 η X j ∈ A c j q j ln q j p j (1) − q j + p j (1) + 4 γ T c max ≤ 1 η X j ∈ A c j q j ln q j p j (1) − q j + p j (1) + 6 γ T c max Let q = (1 − γ ) 1 i + γ k 1 . Recall that p (1) = (1 − γ ) 1 i min + γ k 1 (recall i min is the arm with minim um range c i min ). Similar to the discussion for the exp ert problem in Section 2.5, the 1st term on the RHS is upper b ounded b y O 1 η log k γ · c i . Hence, we hav e: X j ∈ A q j · E h e G j i − E h e G alg i ≤ O 1 η log k γ · c i + 6 γ T c max . (31) F urther, the LHS is low er b ounded as: (1 − γ ) E h e G i i + γ k X j ∈ A E h e G j i − E h e G alg i ≥ (1 − γ ) E h e G i i − γ T c max − E h e G alg i . The lemma then follo ws by putting it bac k to (31) and rearranging terms. Pro of [ of Theorem 21 ] Let γ = c min c max and η = γ k in Lemma 26. Theorem follows noting that γ c max = c min ≤ c i . 5.5 Lo wer-bound for bandits with symmetric range - Pro of of Theorem 23 Pro of [ of Theorem 23 ] W e first sho w that for an y online m ulti-scale bandits algorithm problem, and there is an instance that has t wo arms with c 1 = 1 and c 2 = h for some sufficien tly large h , a sufficiently large T , and = q h 256 T , suc h that either E [ regret 1 ] > T + 1 256 h , or E [ regret 2 ] > T h + 1 256 h 2 W e will pro v e the existence of this instance by lo oking at the sto chastic setting, i.e., the gain v ectors g ( t ) ’s are i.i.d. for 1 ≤ t ≤ T . W e consider t wo instances, b oth of which admit a fixed gain of 0 for action 1 . In the first instance, the gain of action 2 is h with probability 1 2 − 2 , and − h otherwise. Hence, the exp ected gain of pla ying action 2 is − 4 h p er round in instance 1 . In the second instance, the gain of action 2 is h with probabilit y 1 2 + 2 , and − h otherwise. Hence, the exp ected gain of pla ying action t wo is 4 h per round in instance 2 . Note this pro v es the theorem, as c min = 1 , c max = h , k = 2 and and T = h 256 2 . Supp ose for contradiction that the algorithm satisfies: E [ regret 1 ] ≤ T + 1 256 h = 1 128 h , E [ regret 2 ] ≤ hT + 1 256 h 2 = 1 128 h 2 . Let N 1 denote the exp ected n um b er of times that the algorithm plays action 2 in instance 1 . Then, the expected regret with respect to action 1 in instance 1 is N 1 · 4 h . By the assumption that E [ regret 1 ] ≤ 1 128 h , w e ha v e N 1 ≤ 1 512 2 . 30 Mul ti-scale Online Learning and its Applica tion to Online A uctions Next, b y standard calculation, w e get that the Kullbac k-Leibler (KL) divergence of the observ ed rewards in a single round in the tw o instances is 0 if action 1 is pla y ed and is at most 64 2 (for 0 < < 0 . 1 ) if action 2 is pla yed. So the KL divergence of the observed rew ard sequences in the tw o instances is at most 64 2 · N 1 ≤ 1 8 . Then, we use a standard inequality ab out KL div ergences. F or any measurable function ψ : X 7→ { 1 , 2 } , w e ha v e Pr X ∼ ρ 1 ψ ( X ) = 2 + Pr X ∼ ρ 2 ψ ( X ) = 1 ≥ 1 2 exp − K L ( ρ 1 , ρ 2 ) . F or an y 1 ≤ t ≤ T , let ρ 1 and ρ 2 b e the distribution of observed rewards up to a round t in the tw o instances, and let ψ ( X ) b e the action play ed by the algorithm. By this inequal- it y and the ab o ve b ound on the KL divergence b etw een the observ ed rew ards in the t w o instances, we get that in each round, the probability that the algorithm plays action 2 in instance 1 , plus the probabilit y that the algorithm plays action 1 in instance 2 , is at least 1 2 exp ( − 1 8 ) > 2 5 in an y round t . Thus, the exp ected num b er of times that the algorithm plays action 1 in instance 2 from round 1 to T , denoted as N 2 , is at least N 2 ≥ 2 5 · T − N 1 ≥ 1 3 · T , where the second inequality holds for sufficiently large h . Therefore, the expected regret w.r.t. action 2 in instance 2 is at least: 4 h · 1 3 · T = 4 3 hT > 1 128 h 2 . This is a con tradiction to our assumption that E [ regret 2 ] ≤ 1 128 h 2 . 6. Conclusion Rev en ue management has emerged as a competitive to olb ox of strategies for increasing the profit of web-based markets. In particular, dynamic pricing, and dynamic auction design as its less mature relative, hav e b ecome prev alent market mechanisms in nearly all industries. In this pap er, w e studied these problems from the p ersp ective of online learning. F or the online auction for single buyer, we show ed regret b ounds that scale with the b est fixed price, rather than the range of the v alues (with a generalization to learning auctions). Moreov er, we demonstrated a connection b et ween the optimal regret bounds for this problem and offline sample complexity lo wer-bounds of appro ximating optimal rev enue, studied in Cole and Roughgarden (2014); Huang et al. (2015a). Using this connection, we show ed our regret b ounds are almost optimal as they match these information theoretic lo wer-bounds. W e further generalized our result to online pricing (bandit feedbac k) and online auction with m ultiple-buy ers. The key to our dev elopmen t and improv ed regret b ounds for online auction design is generalizing the classical learning from exp erts and m ulti-armed bandit problems to their “m ulti-scale v ersions”, where the rew ard of each action is in a different range. Here the ob jectiv e is to design online learning algorithms whose regret with resp ect to a given action scales with its o wn range, rather than the maximum range. W e sho w ed how a v ariant of online mirror descent solves this learning problem. A ckno wledgmen ts 31 Bubeck, Dev anur, Huang and Niazadeh References Shipra Agra w al and Nikhil R Dev anur. Bandits with concav e rewards and conv ex knapsacks. In Pr o c e e dings of the fifte enth ACM c onfer enc e on Ec onomics and c omputation , pages 989– 1006. A CM, 2014. Kareem Amin, Afshin Rostamizadeh, and Umar Syed. Learning prices for rep eated auctions with strategic buyers. In A dvanc es in Neur al Information Pr o c essing Systems , pages 1169– 1177, 2013. Sanjeev Arora, Elad Hazan, and Sat y en Kale. The m ultiplicativ e w eights up date metho d: a meta-algorithm and applications. The ory of Computing , 8(1):121–164, 2012. Jean-Y v es Audib ert and Sébastien Bubeck. Minimax policies for adversarial and stochastic bandits. In COL T , pages 217–226, 2009. P eter Auer, Nicolo Cesa-Bianc hi, Y oav F reund, and Rob ert E Schapire. Gambling in a rigged casino: The adv ersarial multi-armed bandit problem. In F oundations of Computer Scienc e, 1995. Pr o c e e dings., 36th A nnual Symp osium on , pages 322–331. IEEE, 1995. Moshe Babaioff, Shaddin Dughmi, Rob ert Kleinberg, and Aleksandrs Slivkins. Dynamic pricing with limited supply . ACM T r ansactions on Ec onomics and Computation , 3(1):4, 2015. Ash winkumar Badanidiyuru, Rob ert Klein b erg, and Aleksandrs Slivkins. Bandits with knap- sac ks. In F oundations of Computer Scienc e (F OCS), 2013 IEEE 54th Annual Symp osium on , pages 207–216. IEEE, 2013. Maria-Florina Balcan, A vrim Blum, Jason D Hartline, and Yisha y Mansour. Reducing mec hanism design to algorithm design via machine learning. Journal of Computer and System Scienc es , 74(8):1245–1270, 2008. Ziv Bar-Y ossef, Kirsten Hildrum, and F elix W u. Incentiv e-compatible online auctions for digital go o ds. In Pr o c e e dings of the thirte enth annual ACM-SIAM symp osium on Discr ete algorithms , pages 964–970. So ciet y for Industrial and Applied Mathematics, 2002. Omar Besbes and Assaf Zeevi. Dynamic pricing without kno wing the demand function: Risk bounds and near-optimal algorithms. Op er ations R ese ar ch , 57(6):1407–1420, 2009. A vrim Blum and Jason D Hartline. Near-optimal online auctions. In Pr o c e e dings of the sixte enth annual ACM-SIAM symp osium on Discr ete algorithms , pages 1156–1163. So ciet y for Industrial and Applied Mathematics, 2005. A vrim Blum, Vija y Kumar, Atri Rudra, and F elix W u. Online learning in online auctions. The or etic al Computer Scienc e , 324(2-3):137–146, 2004. Sébastien Bubeck. Introduction to online optimization. L e ctur e Notes , pages 1–86, 2011. 32 Mul ti-scale Online Learning and its Applica tion to Online A uctions Y ang Cai, Constantinos Dask alakis, and S Matthew W einberg. Optimal multi-dimensional mec hanism design: Reducing reven ue to welfare maximization. In F oundations of Com- puter Scienc e (F OCS), 2012 IEEE 53r d Annual Symp osium on , pages 130–139. IEEE, 2012. Ric hard Cole and Tim Roughgarden. The sample complexit y of reven ue maximization. In Symp osium on The ory of Computing, STOC 2014, New Y ork, NY, USA, May 31 - June 03, 2014 , pages 243–252, 2014. Arnoud V den Bo er. Dynamic pricing and learning: historical origins, current researc h, and new directions. Surveys in op er ations r ese ar ch and management scienc e , 20(1):1–18, 2015. Nikhil R Dev anur, Zhiyi Huang, and Christos-Alexandros Psomas. The sample complexit y of auctions with side information. In Pr o c e e dings of the 48th A nnual A CM SIGA CT Symp osium on The ory of Computing , pages 426–439. ACM, 2016. P eerap ong Dhangwatnotai, Tim Roughgarden, and Qiqi Y an. Rev en ue maximization with a single sample. Games and Ec onomic Behavior , 2014. Edith Elkind. Designing and learning optimal finite supp ort auctions. In Pr o c e e dings of the eighte enth annual A CM-SIAM symp osium on Discr ete algorithms , pages 736–745. So ciety for Industrial and Applied Mathematics, 2007. Dylan J F oster, Saty en Kale, Mehryar Mohri, and Karthik Sridharan. P arameter-free online learning via mo del selection. In A dvanc es in Neur al Information Pr o c essing Systems , pages 6020–6030, 2017. Y oa v F reund and Robert E Schapire. A desicion-theoretic generalization of on-line learning and an application to b o osting. In Eur op e an c onfer enc e on c omputational le arning the ory , pages 23–37. Springer, 1995. Y annai A Gonczaro wski and Noam Nisan. Efficient empirical rev enue maximization in single-parameter auction environmen ts. In Pr o c e e dings of the A CM STOC , 2017. Zhiyi Huang, Yishay Mansour, and Tim Roughgarden. Making the most of y our samples. In Pr o c e e dings of the Sixte enth A CM Confer enc e on Ec onomics and Computation, EC ’15, Portland, OR, USA, June 15-19, 2015 , pages 45–60, 2015a. Zhiyi Huang, Yishay Mansour, and Tim Roughgarden. Making the most of y our samples. In Pr o c e e dings of the Sixte enth ACM Confer enc e on Ec onomics and Computation , pages 45–60. A CM, 2015b. Rob ert Klein b erg and T om Leigh ton. The v alue of kno wing a demand curv e: Bounds on regret for online p osted-price auctions. In F oundations of Computer Scienc e, 2003. Pr o c e e dings. 44th A nnual IEEE Symp osium on , pages 594–605. IEEE, 2003. Jamie H Morgenstern and Tim Roughgarden. On the pseudo-dimension of nearly optimal auctions. In A dvanc es in Neur al Information Pr o c essing Systems , pages 136–144, 2015. 33 Bubeck, Dev anur, Huang and Niazadeh Roger B. My erson. Optimal auction design. Mathematics of Op er ations R ese ar ch , 6(1): 58–73, 1981. Tim Roughgarden and Okk e Schrijv ers. Ironing in the dark. In Pr o c e e dings of the 2016 A CM Confer enc e on Ec onomics and Computation , pages 1–18. A CM, 2016. Ily a Segal. Optimal pricing mechanisms with unknown demand. The Americ an e c onomic r eview , 93(3):509–529, 2003. V asilis Syrgk anis. A sample complexity measure with applications to learning optimal auc- tions. arXiv pr eprint arXiv:1704.02598 , 2017. Kaly an T T alluri and Garrett J V an Ryzin. The the ory and pr actic e of r evenue management , v olume 68. Springer Science & Business Media, 2006. App endix A. Other Deferred Pro ofs and Discussions A.1 Discussion on c hoice of π for bandit symmetric range W e no w describ e ho w the c hoice of initial distribution π affects the b ound giv en in Theo- rem 18. • When the action set is finite, we can choose π to be the uniform distribution to get the term O 1 log( k c i /c min ) · c i This reco vers the standard b ound b y setting c i = c max for all i ∈ A . • W e can choose π i = c i P j ∈ A c j to get O 1 log( P j ∈ A c j /c min ) · c i . In particular, if the c i ’s form an arithmetic progression with a constan t difference then this is just O log k · c i . A.2 Pro of of Prop osition 10 from first principles W e also provide an elemen tary pro of of this lemma using first principles. 34 Mul ti-scale Online Learning and its Applica tion to Online A uctions Pro of [ of Prop osition 10 ] Based on the up date rule of Algorithm 1, w e ha ve g i ( t ) = c i η log( w i ( t +1) p i ( t ) ) for any i ∈ A . Therefore: g ( t ) · q − p ( t ) = X i ∈ A g i ( t ) q i − p i ( t ) = X i ∈ A c i η · log w i ( t + 1) p i ( t ) · q i − p i ( t ) = 1 η X i ∈ S c i · q i · log w k ( t + 1) p k ( t ) + X i ∈ A c i · p i ( t ) · log p i ( t ) w i ( t + 1) ! = 1 η X i ∈ S c i · q i · log w k ( t + 1) p k ( t + 1) + X i ∈ S c i · q i · log p k ( t + 1) p k ( t ) + X i ∈ A c i · p i ( t ) · log p i ( t ) w i ( t + 1) (32) No w, note that due to the normalization step of Algorithm 1, for an y i ∈ S w e hav e: c i · log( w i ( t + 1) p i ( t + 1) ) = λ = X j ∈ A c j · p j ( t + 1) · λ c j = X j ∈ A c j · p j ( t + 1) · log( w j ( t + 1) p j ( t + 1) ) So the first summation in (32) is equal to: X i ∈ S c i · q i · log w k ( t + 1) p k ( t + 1) = X i ∈ S q i · X j ∈ A c j · p j ( t + 1) · log( w j ( t + 1) p j ( t + 1) ) = X j ∈ A c j · p j ( t + 1) · log( w j ( t + 1) p j ( t + 1) ) = X i ∈ A c i · p i ( t + 1) · log( w i ( t + 1) p i ( t + 1) ) (33) Com bining Eqn. (32) and (33), we hav e: g ( t ) · q − p ( t ) = 1 η X i ∈ A c i · p i ( t ) · log ( p i ( t ) w i ( t + 1) ) + p i ( t + 1) · log( w i ( t + 1) p i ( t + 1) ) + 1 η X i ∈ S c i · q i · log p i ( t + 1) p i ( t ) The 2nd part is a telescopic sum when we sum o v er t . W e will upp er b ound the 1st part as follo ws. By log( x ) ≤ ( x − 1) , w e get that: X i ∈ A c i · p i ( t ) · log ( p i ( t ) w i ( t + 1) ) + p i ( t + 1) · log( w i ( t + 1) p i ( t + 1) ) ≤ X i ∈ A c i · p i ( t ) · log ( p i ( t ) w i ( t + 1) ) − p i ( t + 1) + w i ( t + 1) = X i ∈ A c i · p i ( t ) − p i ( t + 1) + X i ∈ A c i · p i ( t ) · log ( p i ( t ) w i ( t + 1) ) − p i ( t ) + w i ( t + 1) 35 Bubeck, Dev anur, Huang and Niazadeh Again, the 1st part is a telescopic sum when we sum ov er t . W e will further w ork on the 2nd part. By the relation b etw een w i ( t + 1) and p i ( t ) , w e get that: X i ∈ A c i · p i ( t ) · log ( p i ( t ) w i ( t + 1) ) − p i ( t ) + w i ( t + 1) = X i ∈ A c i · p i ( t ) − η · g i ( t ) c i − 1 + exp( η · g i ( t ) c i ) Note that η · g i ( t ) c i ∈ [ − 1 , 1] b ecause g i ( t ) ∈ [ − c i , c i ] and 0 < η ≤ 1 . By exp( x ) − x − 1 ≤ x 2 for − 1 ≤ x ≤ 1 and that ηg i ( t ) ∈ [ − c i , c i ] , the abov e is upp er bounded b y η 2 P i ∈ A p i ( t ) ( g i ( t )) 2 c i . Putting together, w e get that: g ( t ) · q − p ( t ) ≤ 1 η X i ∈ S c i · q i · log p i ( t + 1) p i ( t ) + p i ( t ) − p i ( t + 1) + η X i ∈ A p i ( t ) ( g i ( t )) 2 c i Summing o ver t , w e hav e: g ( t ) · q − p ( t ) ≤ 1 η X i ∈ S c i · q i · log p i ( T + 1) p i (1) + p i (1) − p i ( T + 1) + η X t ∈ [ T ] X i ∈ A p i ( t ) ( g i ( t )) 2 c i Finally , b y log ( x ) ≤ ( x − 1) , we get that q i log p i ( T +1) q i ≤ p i ( T + 1) − q i . Hence, we ha v e: g ( t ) · q − p ( t ) ≤ 1 η X i ∈ S c i · q i · log q i p i (1) + p i (1) − q i + η X t ∈ [ T ] X i ∈ A p i ( t ) ( g i ( t )) 2 c i The lemma then follo ws by our c hoice of the initial distribution. A.3 Pro of of OMD regret b ound In order to pro v e the OMD regret b ound, w e need some prop erties of Bregman divergence. Lemma 27 (Prop erties of Bregman div ergence (Bub eck, 2011)) Supp ose F ( · ) is a L e gendr e function and D F ( · , · ) is its asso ciate d Br e gman diver genc e as define d in Defini- tion 4. Then: • D F ( x, y ) > 0 if x 6 = y as F is strictly c onvex, and D F ( x, x ) = 0 . • D F ( ., y ) is a c onvex function for any choic e of y . • ( Pythagorean theorem ) If A is a c onvex set, a ∈ A , b / ∈ A and c = ar gmin x ∈A ( D F ( x, b )) , then D F ( a, c ) + D F ( c, b ) ≤ D F ( a, b ) Giv en Lemma 27, w e are no w ready to prov e Lemma 6. 36 Mul ti-scale Online Learning and its Applica tion to Online A uctions Pro of [Pro of of Lemma 6] T o obtain the OMD regret bound, w e ha ve: q · g ( t ) − p ( t ) · g ( t ) = 1 η ( q − p ( t )) · ( ∇ F ( w ( t + 1)) − ∇ F ( p ( t ))) = 1 η ( D F ( q b, p ( t )) + D F ( p ( t ) , w ( t + 1)) − D F ( q b, w ( t + 1))) (1) ≤ 1 η D F ( p ( t ) , w ( t + 1)) + 1 η ( D F ( q , p ( t )) − D F ( q , p ( t + 1))) (34) where in (1) w e use D F ( p ( t + 1) , w ( t + 1)) ≥ 0 and D F ( q , p ( t + 1))+ D F ( p ( t + 1) , w ( t + 1)) ≤ D F ( q , w ( t + 1)) due to Pythagorean theorem (Lemma 27). By summing up b oth hand sides of (34) for t = 1 , · · · , T we hav e: X t ∈ [ T ] g ( t ) · q − p ( t ) ≤ 1 η X t ∈ [ T ] D F ( p ( t ) , w ( t + 1)) + 1 η D F ( q , p (1)) (35) 37
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment