Sparsity regret bounds for individual sequences in online linear regression

We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T. We introduce the notion of sparsity regret bound, which is a deterministic onli…

Authors: Sebastien Gerchinovitz (DMA, INRIA Paris - Rocquencourt)

Journal of Machine Learning Resear ch 14 (2013) 729-769 Submitted 3/12; Re vised 11/12; Published 3/13 Sparsity Regr et Bounds f or Indi vidual Sequenc es in Online Linear Regr ession ∗ S ´ ebastien Ger chino vitz S E BA S T I E N . G E R C H I N OV I T Z @ E N S . F R ´ Ecole Normale Sup ´ erieur e † 45 rue d’Ulm P aris, FRANCE Editor: Nicol ` o Cesa-Bianchi Abstract W e conside r the pro blem of online linear regression on arbitrary determin is tic sequences when the ambient dimension d can be m uch larger than the numb er of time r ounds T . W e intro duce the notion of spa r sity r e gr et bo und , which is a determ inistic on line coun terpart of recent risk bo unds de ri ved in the stochastic setting under a sparsity scenario. W e prove suc h regret boun ds for an onlin e-learning algorithm called SeqSEW and based on exp onential weightin g and data-driven truncation . In a second part we app ly a parameter-free version of this algorithm to the stochastic setting (regression model with rando m design) . This yields risk bo unds o f the same flavor as in Dalalyan a nd Tsybakov (2012 a) but which solve tw o question s left op en t herein. In particular our risk bou nds are ad apti ve (up to a logarithmic factor) to the unknown v ariance of the no ise if the latter is Gaussian. W e also address the regression model with fix ed design. Keywords: sparsity , online linear re gression, individual sequen ces, adaptive regret bounds 1. Intr oduction Sparsity has been ext ensi vely studied in the stochastic setting over the past decade. This notion is ke y to add ress statistical probl ems that are h igh-dimension al, that is, where t he n umber of unkno wn paramete rs is of the same order or ev en much lar ger than the number of observ ations. This is the case in many contemporary applicatio ns such as computati onal biology (e.g., analysi s of DNA sequen ces), collabora ti ve filtering (e.g., Netflix, A maz on), satellite and hyperspec tral imaging, and high-d imensional econometrics (e.g., cross-count ry gro wth reg ression probl ems). A ke y messag e abo ut sparsity i s that, al thoug h high- dimensio nal statisti cal inferen ce is impossi - ble in general (i.e., w i thout furthe r assumption s), it becomes statis tically fea sible if among the many unkno wn parameters, only fe w of them are non-zer o. Such a situation is called a spar sity scenari o and h as been the focus of man y theo retical, computation al, and practi cal wor ks o ver the p ast d ecade in the stoch astic settin g. On the theoretic al side, most sparsity-re lated risk bounds tak e the form of the so-called spar sity oracl e inequal ities , that is, risk bounds expre ssed in terms of the number of non-ze ro c oordinates of the oracle vector . As of no w , such theoretical guarantees ha ve only been pro ved und er stochast ic assumptions. 1 ∗ . A shorter version appeared in the proceed i n gs of C OL T 2011 (see Gerchinovitz 2011). †. This research was carried out within the INRIA project CLASSIC hosted by ´ Ecole Normale Sup ´ erieure and CNRS. 1. One could object that most high-probability risk bounds deriv ed for ℓ 1 -regularization methods are in fact deterministic inequalities that hold t ru e whene ver the noise vector ε belong to some set S (see, e.g., B ick el et al. 2009). Howe ver , c  2013 S ´ ebastie n Gerchino vitz. G E R C H I N OV I T Z In this paper w e add ress the predictio n possibilitie s under a sparsity scenari o in both determin- istic a nd stocha stic setting s. W e first pro ve that theoret ical guarant ees similar to sparsit y oracle inequa lities can be obtained in a deterministic online setting, namely , online linear regre ssion on indi vidual seque nces. The newly obtained determin istic predict ion guara ntees are called sparsity r e gr et bound s . W e prove such bounds for an online-le arning algorithm w h ich, in its m o st sophisti- cated ver sion, is fully automatic in the sens e tha t no preli minary kn o w l edge is neede d for the choice of its tuning parameter s. In the second part of this paper , we apply our sparsit y regret bounds—of determin istic nature—to the stochastic setting (regres sion model w i th random design). One of our ke y results is that, thanks to our online tun ing technique s, these determin istic bounds imply sp arsity oracle inequ alities that are adap ti ve to the unkno wn varian ce of the noise (up to logarithmic fac- tors) when the latter is Gaussian. In particula r , this solv es an open question raised by D a lalyan and Tsybak ov (2012a). In the next parag raphs, we introduc e our main setting and moti vate the notion of sparsity regret bound from an onlin e-learning vie wpoint. W e then detai l our main co ntrib utions with re spect to the statist ical literature and the machine-learn ing literature . 1.1 Intr odu c tion of a Deterministic Coun t erpar t of Sparsity Oracle Inequalities W e consider the problem of online linear regr ession on arbitrary determinis tic sequences. A fore- caster has to predict in a seque ntial fashion the values y t ∈ R of an unkno wn sequence of observ a- tions gi ven so me inp ut data x t ∈ X and some b ase forecas ters ϕ j : X → R , 1 6 j 6 d , on the basis of which he output s a predic tion b y t ∈ R . T h e quality of the predicti ons is assessed by the square loss. The goal of t he forec aster is to predict almost as well as the best lin ear forecaster u · ϕ , ∑ d j = 1 u j ϕ j , where u ∈ R d , that is, to satisfy , u niformly o ver all indi vidua l sequen ces ( x t , y t ) 1 6 t 6 T , a regret bound of the form T ∑ t = 1  y t − b y t  2 6 inf u ∈ R d ( T ∑ t = 1  y t − u · ϕ ( x t )  2 + ∆ T , d ( u ) ) for some regre t term ∆ T , d ( u ) that should be as small as possible and, in particular , sublinear in T . (For the sak e of introduction , we omit the dependenc ies of ∆ T , d ( u ) on the amplitu des max 1 6 t 6 T | y t | and max 1 6 t 6 T max 1 6 j 6 d | ϕ j ( x t ) | .) In this setting the version of the sequent ial ridge regressi on forecas ter studie d by Azoury and W armuth (2001) and V ovk (2001) can be tuned to hav e a regret ∆ T , d ( u ) of order at most d ln  T k u k 2 2  . When the ambient dimension d is much large r than the number of time rounds T , the latter regret bound may unfortuna tely be larger than T and is thus someho w tri vial. Since the regret bound d ln T is opti m a l in a certain sens e (see, e.g., the lo wer bound of V o vk 2001, Theorem 2), add itional assumpti ons are needed to get interestin g theoretic al guara ntees. A natural assumption , which has already been extensi vely studie d in the stochas tic setting , is that there is a sparse vec tor u ∗ (i.e., with s ≪ T / ( ln T ) non-zero coef ficients) such that the linear combina tion u ∗ · ϕ has a small cumulati ve squar e loss. If th e foreca ster kne w in adv ance the su pport J ( u ∗ ) , { j : u ∗ j 6 = 0 } of u ∗ , he could a pply the same forecaster as abov e b ut only to the s -dimen sional linear subspace  u ∈ R d : ∀ j / ∈ J ( u ∗ ) , u j = 0  . The regret bound of this “oracl e” would be roughly of order s ln T and thus sublinear in T . Under this sparsity scenario, a sublinea r regre t thus seems the fac t that ε ∈ S with high-probability is only guaran t ee d via concentration argu ments, so it is a consequence of the underlying statistical assumptions. 730 S P A R S I T Y R E G R E T B O U N D S F O R I N D I V I D UA L S E Q U E N C E S possib le, though, of course , the aforemen tioned regre t bound s ln T can only be used as an ideal bench mark (since the support of u ∗ is unkn o w n ). In this paper , we pro ve that a regret bound proportion al to s is achiev able (up to logarith m i c fact ors). In Corollary 2 and its refinements (Corollary 7 and Theorem 10), we indeed deri ve regr et bound s of the form T ∑ t = 1 ( y t − b y t ) 2 6 inf u ∈ R d ( T ∑ t = 1  y t − u · ϕ ( x t )  2 +  k u k 0 + 1  g T , d  k u k 1 , k ϕ k ∞  ) , (1) where k u k 0 denote s the number of non-zero coordinates of u and where g gro ws at most logarith- mically in T , d , k u k 1 , ∑ d j = 1 | u j | , and k ϕ k ∞ , sup x ∈ X max 1 6 j 6 d | ϕ j ( x ) | . W e call regre t bounds of the abo ve form spa rsit y re gr et bounds . This work is in connection with se veral papers that belong either to the statistica l or to the machine- learni ng literature. N e xt we discuss these papers and some related referen ces. 1.2 Related W orks in the Stochastic Setting The abov e regr et bound ( 1) can be seen as a deter m in istic onlin e counte rpart of the so-ca lled spar sity ora cle inequa lities introduced in the stocha stic setting in th e past decade. The latter are risk bou nds exp ressed in terms of the number of non-zer o coordi nates of the oracle vector —see (2) belo w . More fo rmally , consid er the regr ession model wit h ran dom of fix ed des ign. The forecas ter observes indepe ndent random pairs ( X 1 , Y 1 ) , . . . , ( X T , Y T ) ∈ X × R gi ven by Y t = f ( X t ) + ε t , 1 6 t 6 T , where t he X t ∈ X are either i.i.d rand om v ariables (random desig n) or fixe d ele ments (fixed desig n), denote d in both ca ses b y capital l etters in this paragraph, and where the ε t are i.i.d. square-i nte grable real random v ariables with zero mean (cond itiona lly on the X t if the des ign is random). The goal of the forec aster is to const ruct an estimator b f T : X → R of the unk no wn regre ssion fu nction f : X → R based on the sample ( X t , Y t ) 1 6 t 6 T . Depending on the nature of the design, the performan ce of b f T is measured throug h its risk R  b f T  : R  b f T  ,          Z X  f ( x ) − b f T ( x )  2 P X ( d x ) (rando m design) 1 T T ∑ t = 1  f ( X t ) − b f T ( X t )  2 (fixed d esign) , where P X denote s the common distrib ution of the X t if the design is random. W ith the abo ve notati ons, and gi ven a dictionary ϕ = ( ϕ 1 , · · · , ϕ d ) of base forecaster s ϕ j : X → R as pre viously , typica l examples of spar sity oracle inequalit ies take app roximate ly the form R  b f T  6 C inf u ∈ R d  R  u · ϕ  + k u k 0 ln d + 1 T  (2) in expecta tion or with high probab ility , for some constan t C > 1. Thus, sparsit y oracle inequa lities are risk bounds in vol ving a trade-of f between the risk R ( u · ϕ ) and the number of non-zer o coordi- nates k u k 0 of any comparison vector u ∈ R d . In particular , they indica te that b f T has a small risk 731 G E R C H I N OV I T Z under a sparsit y scenario, that is, if f is well approximated by a sparse linear combinatio n u ∗ · ϕ of the base foreca sters ϕ j , 1 6 j 6 d . Sparsity orac le inequali ties were first deri ved b y Bir g ´ e an d Massart (2001) via ℓ 0 -reg ularization methods (throug h model-selecti on argu ments). Later works in this direction include, among many other papers , those of Bir g ´ e and Massart ( 2007), Abramov ich et al. (2006), and Bun ea et al. (2007 a) in th e reg ressio n m o del with fixed d esign and th at of Bunea et al. (2004) in the rand om design case . More recently , a lar ge body of research has been dedicated to the analysis of ℓ 1 -reg ularization methods , which are con vex and thus computation ally tractabl e va riants of ℓ 0 -reg ularization meth- ods. A celebrat ed exampl e is the Lasso estimator introduced by T ibshirani (1996) and D o noho and Johns tone (1994). Under some assumpti ons on the desig n matrix, 2 such methods hav e been prov ed to satis fy sparsity oracle ineq ualitie s of the form (2) (with C = 1 in the re cent paper by K oltchin skii et al. 2011). A list of few references—b ut far from being comprehens i ve—in cludes the works of Bunea et al. (2007b), Candes and T ao (2007), va n de Geer (2008), Bickel et al. (2009), K oltchi n- skii (2009a), Koltch inskii (2009b),Heb iri and v an de G e er (2011), Koltch inskii et al. (2011) and Lounici et al. (2011). W e refer the reader to the m o nograph by B ¨ uhlmann and van de Geer (2011) for a detaile d account on ℓ 1 -reg ularization. A third line of research recentl y focused on procedures based on expon ential weightin g. Such methods wer e pro ved to satisfy s harp sparsity orac le inequali ties (i.e., with lead ing consta nt C = 1), either in the regre ssion model with fixed design (Dalalyan and Tsybako v, 200 7, 2008; Rigollet and T sy bakov , 2011; Alquier and L o unici, 2011) or in the regr ession model with random design (Dalalyan and Tsybak ov, 2012a; Alquier and Lounici, 2011). These papers show that a trade-of f can be re ached between stron g theoretical guarantees (as with ℓ 0 -reg ularization) and computationa l ef fi c ienc y (as with ℓ 1 -reg ularization). They indeed propose aggrega tion algo rithms which satis fy sparsi ty orac le inequaliti es under almost no assumptio n on the base forecast ers ( ϕ j ) j , and w h ich can b e approximat ed numerically at a reaso nable computational cost for lar ge valu es of t he amb ient dimensio n d . Our online-learn ing algorithm SeqSEW is inspired from a statistical m e thod of Dalalyan and Tsybak ov (2008, 2012a). Follo wing the same lines as in Dalalyan and Tsybak ov (2012b), it is possib le to sligh tly adapt the statement of our algorithm to make it computa tionally tractable by means of Lange vin Monte-Car lo approx imation—with out af fecting its statis tical properties . The techni cal deta ils are howe ver omit ted in this pap er , which only focus es on the theoretical guar antees of the algor ithm SeqSEW . 1.3 Pre vious W orks on Sparsity in the Framework of Individual Sequences T o the be st of our kno wledge, Coro llary 2 and its refinements ( Corollary 7 and T h eorem 10) provide the first examples of sparsi ty regret bounds in the sense of (1). T o comment on the optimality of such regret bound s and compare them to related results in the frame work of indi vidual sequences, note that (1) can be re written in the equi v alent form: 2. Despite their computational ef fi cien cy , the aforeme ntioned ℓ 1 -regularized methods still su ffer from a drawback : their ℓ 0 -oracle properties hold under rather restrictiv e assumptions on the design; namely , that the ϕ j should be nearly orthogonal (see the detailed discussion in v an de Geer and B ¨ u hlmann 2009). 732 S P A R S I T Y R E G R E T B O U N D S F O R I N D I V I D UA L S E Q U E N C E S For all s ∈ N and all U > 0, T ∑ t = 1 ( y t − b y t ) 2 − inf k u k 0 6 s k u k 1 6 U T ∑ t = 1  y t − u · ϕ ( x t )  2 6  s + 1  g T , d  U , k ϕ k ∞  , where g grows at most logarithmic ally in T , d , U , and k ϕ k ∞ . When s ≪ T , this upper bound matches (up to log arithmic factors ) the lower bou nd of order s ln T that follo ws in a strai ghtfor ward manner from Theorem 2 of V ovk (2001). Indeed, if s ≪ T , X = R d , and ϕ j ( x ) = x j , then for any foreca ster , there is an indi vidual sequence ( x t , y t ) 1 6 t 6 T such that the regret of this forec aster on  u ∈ R d : k u k 0 6 s and k u k 1 6 d  is bounded from belo w by a quantity of order s ln T . T h erefore , up to logar ithmic fa ctors, any algorithm sa tisfying a sparsity reg ret bound of the form (1) is minima x optimal on intersection s of ℓ 0 -balls (of radii s ≪ T ) and ℓ 1 -balls. T h is is in particula r the case for our algorit hm S eq S EW , b ut this contrast s w it h relate d works discusse d belo w . Recent works in the field of online con vex optimizat ion addressed the sparsity issue in the online determinis tic setting, b ut from a quite dif ferent angle. They focus on algorithms which outpu t spa rse linea r combinatio ns, while we are i nterested in algorithms whose reg ret is small unde r a sparsity sce nario, that is, on ℓ 0 -balls o f s m a ll rad ii. See, for ex ample, th e papers by Langford et a l. (2009 ), Shale v-Shwartz and T ewari (2011), Xiao (2010), Duchi et al. (2010) and the reference s therein . All these articles focus on con vex regula rization. In the particu lar case of ℓ 1 -reg ularization under the square lo ss, the aforement ioned wor ks propo se al gorith ms which predi ct as a sparse linea r combina tion b y t = b u t · ϕ ( x t ) of the base forecast s (i.e., k b u t k 0 is small), while no such guarantee can be prov ed for our algorithm SeqSEW . Howe ver they prov e bounds on the ℓ 1 -reg ularized regret of the form T ∑ t = 1  ( y t − b u t · x t ) 2 + λ k b u t k 1  6 inf u ∈ R d ( T ∑ t = 1  ( y t − u · x t ) 2 + λ k u k 1  + e ∆ T , d ( u ) ) , (3) for some regret term e ∆ T , d ( u ) which is subop timal on intersec tions of ℓ 0 - and ℓ 1 -balls as explaine d belo w . The truncated gradient algorithm of Langford et al. (2009, Corollar y 4.1) satisfies such a reg ret bound 3 with e ∆ T , d ( u ) at least of order k ϕ k ∞ √ d T when the base forecasts ϕ j ( x t ) are dens e in the sense that max 1 6 t 6 T ∑ d j = 1 ϕ 2 j ( x t ) ≈ d k ϕ k 2 ∞ . This regret bound grows as a power of and not logari thmically in d as is expecte d for sparsity regret bounds (recall that we are interest ed in the case when d ≫ T ). The three other papers mentioned abov e do pro ve (some) regret boun ds with a logarithmic de- pende nce in d , but these bounds do not ha ve the depend ence in k u k 1 and T we are looking for . For p − 1 ≈ 1 / ( ln d ) , the p -norm RD A method of Xiao (2010) and the algorit hm SMID A S of Shale v- Shwartz and T e wari (2011)—the latter being a particular case of the algorithm CO MID of Duchi et al. (201 0) spec ialized to the p -norm di ver gence—sa tisfy regret b ounds of the abo ve form (3) w it h 3. The bound stated in L an gford et al. (2009, Corollary 4.1) differs from (3) in that the constant before t h e i n fimum is equal to C = 1 / ( 1 − 2 c 2 d η ) , where c 2 d ≈ max 1 6 t 6 T ∑ d j = 1 ϕ 2 j ( x t ) 6 d k ϕ k 2 ∞ , and where a reasonable choice for η can easily b e seen to be η ≈ 1 / q 2 c 2 d T . If the base forecasts ϕ j ( x t ) are dense in the sense that c 2 d ≈ d k ϕ k 2 ∞ , then we hav e C ≈ 1 + q 2 c 2 d / T , which yields a regret bound wi th leading constant 1 as in (3) and with e ∆ T , d ( u ) at least of order q c 2 d T ≈ k ϕ k ∞ √ d T . 733 G E R C H I N OV I T Z e ∆ T , d ( u ) ≈ µ k u k 1 √ T ln d , for some gradie nt-bas ed constan t µ . Therefore , in all three cases, the func- tion e ∆ grows at least l inearly in k u k 1 and as √ T . This is in contrast with the logarit hmic de pendence in k u k 1 and the fast rate O ( ln T ) w e a re looking for and prov e, for example, in Corollary 2. Note that the subopti mality of the aforementio ned algorithms is specific to the goal w e a re pur - suing, that is, predi ction o n ℓ 0 -balls (intersec ted with ℓ 1 -balls) . On the contrary the rate k u k 1 √ T ln d is more suited and actua lly nearly optimal for learning on ℓ 1 -balls (see Gerchino vitz and Y u 2011). Moreo ver , the predictio ns output by our algori thm S e qSEW are not necessarily sparse linear com- binati ons of the base forecas ts. A question left open is thus whether it is possib le to design an al- gorith m which both ouputs sparse linear combinations (which is statistic ally useful and sometimes essent ial for computationa l issues ) and satisfies a spar sity regret bound of the form (1). 1.4 P A C-Bayes ian Analysis in the Framework of Individual Sequences T o deriv e ou r spars ity regret bounds, we follo w a P A C-Bayesian appr oach combined with the ch oice of a sparsity -fav oring prior . W e do not hav e the space to revie w the P A C-Bayesian litera ture in the stocha stic setting and only refer the reader to Catoni (2004) for a thorough introduct ion to the subjec t. As for the online deterministic setting, P A C-Bayesian-typ e inequalit ies w e re prov ed in the framew ork of prediction with exper t advice, for exampl e, by Freund et al. (1997) and Kivinen and W armuth (1999), or in the same setting as ours with a Gaussian prior by V o vk (2001). More recent ly , Audibert (2009 ) proved a P A C - Bayesian result on indivi dual sequence s for general losses and predictio n sets. T h e latter result relies on a unifying assumption called the online varian ce inequa lity , which holds true, for exampl e, w h en the loss function is exp-conc ave . In the pres ent paper , w e only f ocus on t he parti cular case of the squa re loss. W e first u se Theorem 4.6 of Audib ert (2009 ) to deri ve a non-ada ptiv e sparsity regre t bound. W e then provid e an adapti ve online P AC- Bayesian ine quality to automatic ally adapt to the unk nown range of the o bserv ations max 1 6 t 6 T | y t | . 1.5 Applica tion to the Stochastic Setting When the Noise Lev el Is Unk no w n In Section 4.1 we apply an automatica lly-tuned versi on of our algori thm SeqSEW on i.i.d. data. Thanks to the standard online-t o-batc h con vers ion, our sparsi ty regret bounds—of deterministic nature —imply a sparsit y oracle inequa lity of the same flav or as a result of Dalalyan and Tsybak ov (2012 a). Howe ver , our risk bound holds on the whole R d space instead of ℓ 1 -balls of finite radii, which solves one question left open by Dalalyan and Tsybako v (2012a , Section 4.2). Besides, and more importantl y , our algorithm does not need the a priori kno wledge of the var iance of the noise when th e latter is Gaussia n. Since the noise l eve l is unkno wn in pr actice, ad apting to it is impo rtant. This solv es a second quest ion raised by Dalalyan and Tsybako v (2012 a, S e ction 5.1, R e mark 6). 1.6 Outline of the Paper This paper is or ganize d as follo ws. In S e ction 2 we describe our main (determinist ic) setting as well as ou r main notations. In Section 3 we pro ve the afore mentione d sparsity re gret bo unds fo r our algori thm SeqSEW , first when the for ecaster has access to some a p riori knowled ge on the o bserva - tions (Sectio ns 3.1 and 3 .2), and the n when no a priori info rmation is a vail able (Section 3.3), which yields a fully automatic algorithm. In Section 4 we apply the algorithm SeqSEW to two stochastic setting s: the reg ression model with random design (Section 4.1) and the regress ion model with fix ed design (Section 4.2). Finally the appendi x conta ins some proofs and sev eral useful inequalities . 734 S P A R S I T Y R E G R E T B O U N D S F O R I N D I V I D UA L S E Q U E N C E S 2. Setting and Notations The main setting conside red in this paper is an instanc e of the game of predict ion with expert advice calle d pr edictio n with side informati on (under the squar e loss) or , more s imply , online line ar r e gr ession (see C es a-Bianchi and Lugosi 2006, Chapter 11 for an introd uction to this setting ). The data sequ ence ( x t , y t ) t > 1 at hand is deterministic a nd arbitra ry and w e look fo r the oretica l guarantees that hold for ev ery individ ual seque nce. W e giv e in Figure 1 a detailed description of our online protoc ol. Paramet ers : input data set X , base forecaste rs ϕ = ( ϕ 1 , . . . , ϕ d ) with ϕ j : X → R , 1 6 j 6 d . Initial step : the en vironment chooses a sequen ce of observ ations ( y t ) t > 1 in R and a sequen ce of input data ( x t ) t > 1 in X but the forec aster has not access to them. At each time ro und t ∈ N ∗ , { 1 , 2 , . . . } , 1. T h e en vironment rev eals the input data x t ∈ X . 2. T h e foreca ster chooses a prediction b y t ∈ R (possi bly as a linea r combination of the ϕ j ( x t ) , b ut this is not necessary). 3. T h e en vironment rev eals the observ ation y t ∈ R . 4. E a ch linear foreca ster u · ϕ , ∑ d j = 1 u j ϕ j , u ∈ R d , incurs the loss  y t − u · ϕ ( x t )  2 and the foreca ster incurs the loss ( y t − b y t ) 2 . Figure 1: The online linear reg ression setting. Note that our online protoco l is described as if the en vironmen t were obli vious t o the forecaster’ s predic tions. Actually , since we only consider deterministic forecaster s, all re gret boun ds of this paper also hold when ( x t ) t > 1 and ( y t ) t > 1 are chos en by an adversa rial en vironment. T wo stocha stic batch settin gs are also cons idered later in this paper . See Section 4.1 for the reg ression model with random design, and Section 4.2 for the regres sion model with fixed design. 2.1 Some Notations W e no w define some notations. W e write N , { 0 , 1 , . . . } and e , exp ( 1 ) . V ectors in R d will be denote d by bold lett ers. For all u , v ∈ R d , the sta ndard inner produ ct in R d between u = ( u 1 , . . . , u d ) and v = ( v 1 , . . . , v d ) will be deno ted by u · v = ∑ d i = j u j v j ; the ℓ 0 -, ℓ 1 -, and ℓ 2 -norms of u = ( u 1 , . . . , u d ) are respect i vel y defined by k u k 0 , d ∑ j = 1 I { u j 6 = 0 } =   { j : u j 6 = 0 }   , k u k 1 , d ∑ j = 1 | u j | , and k u k 2 , d ∑ j = 1 u 2 j ! 1 / 2 . 735 G E R C H I N OV I T Z The set of all probabil ity distrib utions on a set Θ (endowed w i th some σ - algebr a, for exampl e, the Borel σ -alge bra when Θ = R d ) will be deno ted by M + 1 ( Θ ) . For all ρ , π ∈ M + 1 ( Θ ) , the Kullb ack- Leibler di ver gence between ρ and π is defined by K ( ρ , π ) ,    Z R d ln  d ρ d π  d ρ if ρ is absol utely continuou s with respect to π ; + ∞ otherwis e, where d ρ d π denote s the Radon-Niko dym deriv ati ve of ρ with respect to π . For all x ∈ R and B > 0, we denote by ⌈ x ⌉ the smallest inte ger larger than or equal to x , and by [ x ] B its thresho lded (or clippe d) valu e: [ x ] B ,    − B if x < − B ; x if − B 6 x 6 B ; B if x > B . Finally , we will use the (natural) con ventio ns 1 / 0 = + ∞ , (+ ∞ ) × 0 = 0, and 0 ln ( 1 + U / 0 ) = 0 for all U > 0. Any sum ∑ 0 s = 1 a s inde xed from 1 up to 0 is by con venti on equal to 0. 3. Sparsity Regr et Bounds for Indi vi d ual Sequenc es In this section we prov e sparsity regret boun ds for differ ent vari ants of our algorithm SeqSEW. W e first assume in Section 3.1 that the forecaste r has access in advan ce to a bound B y on the observ ations | y t | and a boun d B Φ on the trace of the empirica l G ra m m a trix. W e then remov e these requirements one by one in Sections 3.2 and 3.3. 3.1 Known Bounds B y on the Observ ations and B Φ on the T race of the Empirical Gram Matrix T o simplify the analysi s, we first assume that, at the begin ning of the game, the number of rounds T is kno wn to the forecaste r and that he has access to a bound B y on all the observ ations y 1 , . . . , y T and to a bound B Φ on the trace of the empirica l G ra m m a trix, that is, y 1 , . . . , y T ∈ [ − B y , B y ] and d ∑ j = 1 T ∑ t = 1 ϕ 2 j ( x t ) 6 B Φ . The first version of the algorith m studied in this paper is defined in Figure 2 (adapti ve v ariants will be introduce d later). W e name it S eqSEW for it is a varia nt of the S p arse Expone ntial W eight ing algori thm introdu ced in the stochas tic setting by Dalalyan and Tsybako v (2007, 2008) which is tailore d for the prediction of indivi dual sequence s. The choice of the hea vy-tailed prior π τ is due to Dalalyan and Tsybak ov (2007). The role of hea vy-tailed priors to tackle the sparsity issue was alr eady pointed out earlier; see, for exa m p le, the discus sion by S e ege r (2008, Section 2.1). In high dimens ion, su ch he avy-ta iled pri ors fav or sparsi ty: sampling from these prior distrib utions (or posteri or distr ibutio ns based on them) typically results in approxi mately sparse vecto rs, that is, vectors ha ving most coordina tes almost equal to zero and the fe w remaining ones with quite large v alues. 736 S P A R S I T Y R E G R E T B O U N D S F O R I N D I V I D UA L S E Q U E N C E S Paramet ers : thres hold B > 0, in verse temperature η > 0, and prior scale τ > 0 with which we associ ate the sparsity prior π τ ∈ M + 1 ( R d ) defined by π τ ( d u ) , d ∏ j = 1 ( 3 / τ ) d u j 2  1 + | u j | / τ  4 . Initializ ation : p 1 , π τ . At each time ro und t > 1, 1. G e t the input data x t and predic t a as b y t , Z R d  u · ϕ ( x t )  B p t ( d u ) ; 2. G e t the observ ation y t and compute the posteri or distri b ution p t + 1 ∈ M + 1 ( R d ) as p t + 1 ( d u ) , exp − η t ∑ s = 1  y s −  u · ϕ ( x s )  B  2 ! W t + 1 π τ ( d u ) , where W t + 1 , Z R d exp − η t ∑ s = 1  y s −  v · ϕ ( x s )  B  2 ! π τ ( d v ) . a . The clipping operator [ · ] B is defined in Section 2. Figure 2: The algorith m SeqSEW B , η τ . Pro p os ition 1 Assume that, for a known cons tant B y > 0 , the ( x 1 , y 1 ) , . . . , ( x T , y T ) ar e suc h that y 1 , . . . , y T ∈ [ − B y , B y ] . Then, for all B > B y , all η 6 1 / ( 8 B 2 ) , and all τ > 0 , the algorithm SeqSE W B , η τ satisfi es T ∑ t = 1 ( y t − b y t ) 2 6 inf u ∈ R d ( T ∑ t = 1  y t − u · ϕ ( x t )  2 + 4 η k u k 0 ln  1 + k u k 1 k u k 0 τ  ) + τ 2 d ∑ j = 1 T ∑ t = 1 ϕ 2 j ( x t ) . (4) Cor ollary 2 Assume that, for some known constant s B y > 0 and B Φ > 0 , the ( x 1 , y 1 ) , . . . , ( x T , y T ) ar e such that y 1 , . . . , y T ∈ [ − B y , B y ] and ∑ d j = 1 ∑ T t = 1 ϕ 2 j ( x t ) 6 B Φ . Then, when used with B = B y , η = 1 8 B 2 y , and τ = s 16 B 2 y B Φ , the algor ithm S eq S EW B , η τ satisfi es T ∑ t = 1 ( y t − b y t ) 2 6 inf u ∈ R d ( T ∑ t = 1  y t − u · ϕ ( x t )  2 + 32 B 2 y k u k 0 ln  1 + √ B Φ k u k 1 4 B y k u k 0  ) + 16 B 2 y . (5) Note that, i f k ϕ k ∞ , sup x ∈ X max 1 6 j 6 d | ϕ j ( x ) | is finite, th en the last coroll ary pro vides a sparsity r e gr et bound in the sense of (1). Indeed, in this case, w e can take B Φ = d T k ϕ k 2 ∞ , which yields a reg ret bound proport ional to k u k 0 and that gro ws logarit hmically in d , T , k u k 1 , and k ϕ k ∞ . 737 G E R C H I N OV I T Z T o prov e Propositi on 1, we fi r st need the foll o w in g deterministic P AC-Bay esian inequality which is at the c ore of our analy sis. It is a str aightfo rward conseq uence of Theorem 4.6 of Audibe rt (2009 ) w h en applied to the square loss. An adapti ve varia nt of this inequal ity will be provid ed in Section 3.2. Lemma 3 Assume that for some known constan t B y > 0 , we have y 1 , . . . , y T ∈ [ − B y , B y ] . F or all τ > 0 , if the algorit hm SeqSEW B , η τ is used with B > B y and η 6 1 / ( 8 B 2 ) , then T ∑ t = 1 ( y t − b y t ) 2 6 inf ρ ∈ M + 1 ( R d ) ( Z R d T ∑ t = 1  y t −  u · ϕ ( x t )  B  2 ρ ( d u ) + K ( ρ , π τ ) η ) (6) 6 inf ρ ∈ M + 1 ( R d ) ( Z R d T ∑ t = 1  y t − u · ϕ ( x t )  2 ρ ( d u ) + K ( ρ , π τ ) η ) . (7) Pro of (of Lemma 3) Inequ ality (6) is a straight forward consequen ce of Theorem 4.6 of Audibert (2009 ) when applied to the square loss, the set of predictio n functio ns G ,  x 7→  u · ϕ ( x )  B : u ∈ R d  , and the prior 4 e π τ on G induced by the prior π τ on R d via the m a pping u ∈ R d 7→  u · ϕ ( · )  B ∈ G . T o a pply the aforemen tioned theore m, recall from Cesa- B ia nchi and Lugo si (2006, Section 3.3) that the square loss is 1 / ( 8 B 2 ) -ex p-conca ve on [ − B , B ] and thu s η -exp- concav e, 5 since η 6 1 / ( 8 B 2 ) by assumptio n. Therefore, by Theorem 4.6 of Audibert (2009 ) w i th the va riance function δ η ≡ 0 (see the comments follo wing R e m a rk 4.1 therein ), w e g et T ∑ t = 1 ( y t − b y t ) 2 6 inf µ ∈ M + 1 ( G ) ( Z G T ∑ t = 1  y t − g ( x t )  2 µ ( d g ) + K ( µ , e π τ ) η ) 6 inf ρ ∈ M + 1 ( R d ) ( Z R d T ∑ t = 1  y t −  u · ϕ ( x t )  B  2 ρ ( d u ) + K ( e ρ , e π τ ) η ) , where the last ineq uality follo ws by restricting the infimum over M + 1  G  to the subset  e ρ : ρ ∈ M + 1 ( R d )  ⊂ M + 1  G  , where e ρ ∈ M + 1  G  denote s the probabil ity distrib ution ind uced by ρ ∈ M + 1 ( R d ) via the m a pping u ∈ R d 7→  u · ϕ ( · )  B ∈ G . Inequality (6) then follo ws from the fact that for all ρ ∈ M + 1 ( R d ) , we ha ve K ( e ρ , e π τ ) 6 K ( ρ , π τ ) by joi nt con ve xity of K ( · , · ) . As for Inequal ity (7), it follo ws from (6) by noting that ∀ y ∈ [ − B , B ] , ∀ x ∈ R ,   y − [ x ] B   6 | y − x | . Therefore , trun cation to [ − B , B ] can only improv e prediction under the square loss if the observ a- tions ar e [ − B , B ] -va lued, which i s the case her e since by assumptio n y t ∈ [ − B y , B y ] ⊂ [ − B , B ] for all t = 1 , . . . , T . Remark 4 As can be seen fr om the pre vious pr oof , if the prior π τ used to define the algor ithm Se- qSEW was r eplace d with an y prior π ∈ M + 1 ( R d ) , then Lemma 3 would still hold t rue wit h π instead 4. The set G is endo wed wi th the σ -algebra generated by all the coordinate mapping s g ∈ G 7→ g ( x ) ∈ R , x ∈ X (where R is endo wed with its Borel σ -algebra). 5. This means that for all y ∈ [ − B , B ] , the function x 7→ exp  − η ( y − x ) 2  is concav e on [ − B , B ] . 738 S P A R S I T Y R E G R E T B O U N D S F O R I N D I V I D UA L S E Q U E N C E S of π τ . This fact is natur al fr om a P AC-Baye sian persp ective (see, e.g., Catoni, 2004; Dalalyan and Tsybak ov , 2008). W e on ly—b ut crucially —use the pa rticular shape of the sparsity- favori ng prior π τ to deriv e P r opositio n 1 fr om the P AC-Bayes ian bound (7) . Pro of (of Pro position 1) Our proof mimics the proof of Theorem 5 by D a lalyan and Tsybak ov (2008 ). W e thus only write the outline of the proof and stress the m i nor changes that are needed to deri ve Ineq uality (4). The key technical tools pro vided by Dalalyan and T s ybak ov (2008 ) are reprod uced in Appendix B.2 for the con veni ence of the reader . Let u ∗ ∈ R d . Since B > B y and η 6 1 / ( 8 B 2 ) , we can apply Lemma 3 and get T ∑ t = 1 ( y t − b y t ) 2 6 inf ρ ∈ M + 1 ( R d ) ( Z R d T ∑ t = 1  y t − u · ϕ ( x t )  2 ρ ( d u ) + K ( ρ , π τ ) η ) 6 Z R d T ∑ t = 1  y t − u · ϕ ( x t )  2 ρ u ∗ , τ ( d u ) | {z } ( 1 ) + K ( ρ u ∗ , τ , π τ ) η | {z } ( 2 ) . (8) In the last inequ ality , ρ u ∗ , τ is tak en as the transla ted of π τ at u ∗ , namely , ρ u ∗ , τ ( d u ) , d π τ d u ( u − u ∗ ) d u = d ∏ j = 1 ( 3 / τ ) d u j 2  1 + | u j − u ∗ j | / τ  4 . The two terms ( 1 ) and ( 2 ) can be upper bounded as in the proof of Theorem 5 by Dalalyan and Tsybak ov (2008). By a symmetry argument recalled in Lemma 22 (Appendix B. 2 ), the first term ( 1 ) can be re written as Z R d T ∑ t = 1  y t − u · ϕ ( x t )  2 ρ u ∗ , τ ( d u ) = T ∑ t = 1  y t − u ∗ · ϕ ( x t )  2 + τ 2 d ∑ j = 1 T ∑ t = 1 ϕ 2 j ( x t ) . (9) As for the term ( 2 ) , we ha ve, as is recalled in Lemma 23, K ( ρ u ∗ , τ , π τ ) η 6 4 η k u ∗ k 0 ln  1 + k u ∗ k 1 k u ∗ k 0 τ  . (10) Combining (8), (9), and (10), which all hold for all u ∗ ∈ R d , we get Inequa lity (4). Pro of (of C o rollar y 2) A pp lying P r oposit ion 1, w e h a ve, since B > B y and η 6 1 / ( 8 B 2 ) , T ∑ t = 1 ( y t − b y t ) 2 6 inf u ∈ R d ( T ∑ t = 1  y t − u · ϕ ( x t )  2 + 4 η k u k 0 ln  1 + k u k 1 k u k 0 τ  ) + τ 2 d ∑ j = 1 T ∑ t = 1 ϕ 2 j ( x t ) 6 inf u ∈ R d ( T ∑ t = 1  y t − u · ϕ ( x t )  2 + 4 η k u k 0 ln  1 + k u k 1 k u k 0 τ  ) + τ 2 B Φ , since ∑ d j = 1 ∑ T t = 1 ϕ 2 j ( x t ) 6 B Φ by assu mption. T h e partic ular (and nearly optimal) choice s of η and τ gi ven in the statement of the corollary then yield the desired inequality (5). 739 G E R C H I N OV I T Z W e end this subsec tion with a natural question about approximate sparsity: Proposi tion 1 en- sures a lo w regret with res pect to sparse linear combinatio ns u · ϕ , b ut w h at can be said for appr oxi- mately sparse lin ear combinati ons, that is, predictors of the form u · ϕ where u ∈ R d is ve ry close to a sparse vector? As can be seen from the proof of Lemma 23 in A p pendix B.2, the sparsity- related term 4 η k u k 0 ln  1 + k u k 1 k u k 0 τ  in the reg ret bound of Proposit ion 1 can actually be replaced w i th the smaller (and contin ous) term 4 η d ∑ j = 1 ln ( 1 + | u j | / τ ) . The last term is always smaller than the former and guarantee s that the reg ret is small w i th respe ct to an y approxima tely sparse vec tor u ∈ R d . 3.2 Unknown Bound B y on the Observ ations but Known Bound B Φ on the T race of the Empirical Gram Matrix In the p reviou s section , to p rove the upper bounds state d in L emma 3 and Propos ition 1, we assumed that the forecaster had access to a boun d B y on the observ ations | y t | and to a bound B Φ on the trace of the empirical Gram matrix. In this section, we remov e the fi rs t requirement and prov e a sparsity reg ret bound for a varian t of the algorith m SeqSEW B , η τ which is adap tive to the unkno wn bound B y = max 1 6 t 6 T | y t | ; see Proposit ion 5 and Remark 6 below . For this purpose we consider the algorithm of Figure 3, which we call SeqSEW ∗ τ therea fter . It differ s from S e qSEW B , η τ defined in the pre vious section in that the threshold B and the in vers e temperat ure η are no w allo wed to va ry over ti me and are chose n at each time rou nd as a func tion of the data a vailab le to the forec aster . The idea of trun cating the base forecasts was use d many times in the past ; see, for exa m p le, the work of V o vk (2001) in the online linear regr ession setting, that of Gy ¨ orfi et al. (2002, C h apter 10) for the regre ssion problem with random design, and the papers of Gy ¨ orfi and Ottucs ´ ak (2007) and Biau et al. (2010) for sequential predictio n of unbounde d time series under the square loss. A key ingred ient in the present paper is to perform truncation with respect to a data-dri ven threshold. Pro p os ition 5 F or all τ > 0 , the algorit hm SeqSEW ∗ τ satisfi es T ∑ t = 1 ( y t − b y t ) 2 6 inf u ∈ R d ( T ∑ t = 1  y t − u · ϕ ( x t )  2 + 32 B 2 T + 1 k u k 0 ln  1 + k u k 1 k u k 0 τ  ) + τ 2 d ∑ j = 1 T ∑ t = 1 ϕ 2 j ( x t ) + 5 B 2 T + 1 , wher e B 2 T + 1 , max 1 6 t 6 T y 2 t . Remark 6 In vie w o f Propos ition 1, the a lgorith m SeqSEW ∗ τ satisfies a sp arsity reg ret boun d which is adapti ve to the unkno wn bound B y = max 1 6 t 6 T | y t | . The price for the automatic tuning with respec t to B y consis ts only of the additi ve term 5 B 2 T + 1 = 5 B 2 y . 740 S P A R S I T Y R E G R E T B O U N D S F O R I N D I V I D UA L S E Q U E N C E S Paramet er : prior scale τ > 0 with whic h we asso ciate the sparsi ty prior π τ ∈ M + 1 ( R d ) defined by π τ ( d u ) , d ∏ j = 1 ( 3 / τ ) d u j 2  1 + | u j | / τ  4 . Initializ ation : B 1 , 0, η 1 , + ∞ , and p 1 , π τ . At each time ro und t > 1, 1. G e t the input data x t and predic t a as b y t , Z R d  u · ϕ ( x t )  B t p t ( d u ) ; 2. G e t the observ ation y t and update : • the threshold B t + 1 , max 1 6 s 6 t | y s | , • the in verse temperature η t + 1 , 1 /  8 B 2 t + 1  , • and the posterior distrib ution p t + 1 ∈ M + 1 ( R d ) as p t + 1 ( d u ) , exp − η t + 1 t ∑ s = 1  y s −  u · ϕ ( x s )  B s  2 ! W t + 1 π τ ( d u ) , where W t + 1 , Z R d exp − η t + 1 t ∑ s = 1  y s −  v · ϕ ( x s )  B s  2 ! π τ ( d v ) . a . The clipping operator [ · ] B is defined in Section 2. Figure 3: The algorith m SeqSEW ∗ τ . As in the pre vious section , se veral corollarie s can be deri ved from Proposition 5. If the forec aster has access beforeh and to a quan tity B Φ > 0 such that ∑ d j = 1 ∑ T t = 1 ϕ 2 j ( x t ) 6 B Φ , then a sub optimal b ut reasonab le choice of τ is giv en by τ = 1 / √ B Φ ; see Corollary 7 belo w . The simpler tuning τ = 1 / √ d T of Coro llary 8 will b e use ful in the stoc hastic batch settin g (cf., Secti on 4). 6 The pro ofs of the ne xt corollar ies are immediate . Cor ollary 7 Assume that, for a known constant B Φ > 0 , the ( x 1 , y 1 ) , . . . , ( x T , y T ) ar e such that ∑ d j = 1 ∑ T t = 1 ϕ 2 j ( x t ) 6 B Φ . Then, when used with τ = 1 / √ B Φ , the algor ithm SeqSEW ∗ τ satisfi es T ∑ t = 1 ( y t − b y t ) 2 6 inf u ∈ R d ( T ∑ t = 1  y t − u · ϕ ( x t )  2 + 32 B 2 T + 1 k u k 0 ln  1 + √ B Φ k u k 1 k u k 0  ) + 5 B 2 T + 1 + 1 , 6. The tuning τ = 1 / √ d T only u ses the knowled ge of T , wh i ch is kno wn b y the forecaster in the stochastic batch setting. In that framewo r k , another simple and easy-to-analyse tuning is gi ven by τ = 1 / ( k ϕ k ∞ √ d T ) —which corresponds to B Φ = d T k ϕ k 2 ∞ —but it requires that k ϕ k ∞ , sup x ∈ X max 1 6 j 6 d | ϕ j ( x ) | be finit e . Note that the last tuning satisfies the scale-in variant prop erty pointed out by Dalalyan and Tsybako v (2012 a, Remark 4). 741 G E R C H I N OV I T Z wher e B 2 T + 1 , max 1 6 t 6 T y 2 t . Cor ollary 8 Assume that T is known to the for ecaster at the be ginning of the pr edictio n game. Then, when used with τ = 1 / √ d T , the algor ithm SeqSE W ∗ τ satisfi es T ∑ t = 1 ( y t − b y t ) 2 6 inf u ∈ R d ( T ∑ t = 1  y t − u · ϕ ( x t )  2 + 32 B 2 T + 1 k u k 0 ln 1 + √ d T k u k 1 k u k 0 !) + 1 d T d ∑ j = 1 T ∑ t = 1 ϕ 2 j ( x t ) + 5 B 2 T + 1 , wher e B 2 T + 1 , max 1 6 t 6 T y 2 t . As in the pre vious sectio n, to prov e Proposition 5, we first need a ke y P A C-Bayesian ine qualit y . The nex t lemma is an adapti ve va riant of Lemma 3. Lemma 9 F or all τ > 0 , the algori thm SeqSEW ∗ τ satisfi es T ∑ t = 1 ( y t − b y t ) 2 6 inf ρ ∈ M + 1 ( R d ) ( Z R d T ∑ t = 1  y t −  u · ϕ ( x t )  B t  2 ρ ( d u ) + 8 B 2 T + 1 K ( ρ , π τ ) ) + 4 B 2 T + 1 (11) 6 inf ρ ∈ M + 1 ( R d ) ( Z R d T ∑ t = 1  y t − u · ϕ ( x t )  2 ρ ( d u ) + 8 B 2 T + 1 K ( ρ , π τ ) ) + 5 B 2 T + 1 , (12) wher e B 2 T + 1 , max 1 6 t 6 T y 2 t . Pro of (o f Lemma 9) The proo f is based on argu ments t hat are similar to those underlyin g Lemma 3, exc ept that we no w need to deal with B and η changi ng over time. In the same spirit as in Auer et al. (2002), Cesa-Bianc hi et al. (2007) and Gy ¨ orfi and Ottucs ´ ak (2007), our analys is relies on the contro l of ( ln W t + 1 ) / η t + 1 − ( ln W t ) / η t where W 1 , 1 and, for all t > 2, W t , Z R d exp − η t t − 1 ∑ s = 1  y s −  u · ϕ ( x s )  B s  2 ! π τ ( d u ) . Before control ling ( ln W t + 1 ) / η t + 1 − ( ln W t ) / η t , we first need a l ittle comment. Note th at all η t ’ s such that η t = + ∞ (i.e., B t = 0) can be replaced with any fi n ite va lue without chang ing the predic- tions of the algorithm (since the sum ∑ t − 1 s = 1 abo ve equals zero). Therefor e, we assume in the sequel that ( η t ) t > 1 is a non- decrea sing sequence of finite positi ve real numbers. F irst step : On the one hand, we hav e ln W T + 1 η T + 1 − ln W 1 η 1 = 1 η T + 1 ln Z R d exp − η T + 1 T ∑ t = 1  y t −  u · ϕ ( x t )  B t  2 ! π τ ( d u ) − 1 η 1 ln 1 = − inf ρ ∈ M + 1 ( R d ) ( Z R d T ∑ t = 1  y t −  u · ϕ ( x t )  B t  2 ρ ( d u ) + K ( ρ , π τ ) η T + 1 ) , (13) 742 S P A R S I T Y R E G R E T B O U N D S F O R I N D I V I D UA L S E Q U E N C E S where t he la st equal ity follo ws from a con vex du ality argumen t for the Kull back-Leibler di ver gence (cf., e.g., Catoni 2004, p. 159) which we recall in Propositio n 21 in Appendi x B.1. Secon d step : On the other hand, we can rewrite ( ln W T + 1 ) / η T + 1 − ( ln W 1 ) / η 1 as a telescop ic sum and get ln W T + 1 η T + 1 − ln W 1 η 1 = T ∑ t = 1  ln W t + 1 η t + 1 − ln W t η t  = T ∑ t = 1  ln W t + 1 η t + 1 − ln W ′ t + 1 η t | {z } ( 1 ) + 1 η t ln W ′ t + 1 W t | {z } ( 2 )  , (14) where W ′ t + 1 is obtained from W t + 1 by repla cing η t + 1 with η t ; namely , W ′ t + 1 , Z R d exp − η t t ∑ s = 1  y s −  u · ϕ ( x s )  B s  2 ! π τ ( d u ) . Let t ∈ { 1 , . . . , T } . T h e first term ( 1 ) is non-po sitiv e by Jen sen’ s inequality (note th at x 7→ x η t + 1 / η t is conca ve on R ∗ + since η t + 1 6 η t by cons tructio n). As for the second term ( 2 ) , by definiti on of W ′ t + 1 , 1 η t ln W ′ t + 1 W t = 1 η t ln Z R d exp  − η t  y t −  u · ϕ ( x t )  B t  2  exp − η t t − 1 ∑ s = 1  y s −  u · ϕ ( x s )  B s  2 ! W t π τ ( d u ) = 1 η t ln Z R d exp  − η t  y t −  u · ϕ ( x t )  B t  2  p t ( d u ) . (15) where (15) follo ws by definitio n of p t . T h e next paragrap hs are dedica ted to upper bound ing the last integr al above . First note that this is straightforw ard in the particular case where y t ∈ [ − B t , B t ] . Indeed , by definition of η t , 1 / ( 8 B 2 t ) and by the fact that the square loss is 1 / ( 8 B 2 t ) -ex p-conca ve on [ − B t , B t ] (as in Lemma 3), 7 we get from Jensen ’ s inequalit y that Z R d e − η t  y t −  u · ϕ ( x t )  B t  2 p t ( d u ) 6 exp − η t  y t − Z R d  u · ϕ ( x t )  B t p t ( d u )  2 ! = e − η t ( y t − b y t ) 2 , where the last equalit y follo ws by definition of b y t . T aking the logarithms of both sides of the last inequa lity and di viding by η t , we can see th at the quanti ty on the righ t-hand side of (1 5) is bound ed from abo ve by −  y t − b y t  2 . In the general case , we cannot assume that y t ∈ [ − B t , B t ] , since it may happe n tha t | y t | > max 1 6 s 6 t − 1 | y s | , B t . As shown belo w , we can still use the exp-co ncavit y of the square loss if we re place y t with its cl ipped v ersion [ y t ] B t . More pre cisely , setting b y t , u , [ u · ϕ ( x t )] B t for a ll u ∈ R d , the square loss appear ing in the right-hand side of (15) equals  y t − b y t , u  2 =  [ y t ] B t − b y t , u  2 +  y t − [ y t ] B t  2 + 2  y t − [ y t ] B t  [ y t ] B t − b y t , u  =  [ y t ] B t − b y t , u  2 +  y t − [ y t ] B t  2 + 2  y t − [ y t ] B t  [ y t ] B t − b y t  + c t , u , (16) 7. T o be more exact, we assigned some arbitrary finite value to η t when B t = 0. Howe ver , in t h is case, the square loss is of course η t -exp-co ncav e on [ − B t , B t ] = { 0 } whate ver the v alue of η t . 743 G E R C H I N OV I T Z where we set c t , u , 2  y t − [ y t ] B t  b y t − b y t , u  > − 4 B t   y t − [ y t ] B t   > − 4 B t ( B t + 1 − B t ) , (17) where the last two inequali ties follow from the property b y t , b y t , u ∈ [ − B t , B t ] (by cons tructio n) and from the element ary 8 yet useful upper bound   y t − [ y t ] B t   6 B t + 1 − B t . Combining (16) with the lo wer bound (17) yields that, for all u ∈ R d ,  y t − b y t , u  2 >  [ y t ] B t − b y t , u  2 + C t , (18) where we set C t ,  y t − [ y t ] B t  2 + 2  y t − [ y t ] B t  [ y t ] B t − b y t  − 4 B t ( B t + 1 − B t ) . W e can no w continue the upper bou nding of ( 1 / η t ) ln ( W ′ t + 1 / W t ) . Indeed , substitutin g the lower bound (18) into (15), we get that 1 η t ln W ′ t + 1 W t 6 1 η t ln  Z R d exp  − η t  [ y t ] B t − b y t , u  2  p t ( d u )  − C t 6 1 η t ln " exp − η t  [ y t ] B t − Z R d b y t , u p t ( d u )  2 !# − C t (19) = −  [ y t ] B t − b y t  2 − C t (20) = −  y t − b y t  2 + 4 B t ( B t + 1 − B t ) , (21) where (19) follows by Jensen’ s inequality (recall that η t , 1 / ( 8 B 2 t ) and that the square loss is 1 / ( 8 B 2 t ) -ex p-conca ve on [ − B t , B t ] ), 9 where (20) is entailed by definition of b y t , u and b y t , and where (21) follo ws by definition of C t abo ve and by ele mentary calculatio ns. Summing (21) ov er t = 1 , . . . , T and using the upper bound B t ( B t + 1 − B t ) 6 B 2 t + 1 − B 2 t , Equa- tion (14) yield s ln W T + 1 η T + 1 − ln W 1 η 1 6 − T ∑ t = 1 ( y t − b y t ) 2 + 4 T ∑ t = 1  B 2 t + 1 − B 2 t  = − T ∑ t = 1 ( y t − b y t ) 2 + 4 B 2 T + 1 . (22) Thir d step : Putting (13) and (22) together , we get the P A C-Bayesian inequa lity T ∑ t = 1 ( y t − b y t ) 2 6 inf ρ ∈ M + 1 ( R d ) ( Z R d T ∑ t = 1  y t −  u · ϕ ( x t )  B t  2 ρ ( d u ) + K ( ρ , π τ ) η T + 1 ) + 4 B 2 T + 1 , which yields (11) since η T + 1 , 1 / ( 8 B 2 T + 1 ) by definition. 10 The other P A C-Bayesian inequal ity (1 2) , which is stated for non-trunc ated base foreca sts, is a direct con sequen ce of (11) and of the fol lo wing two ar guments: for all u ∈ R d and all t = 1 , . . . , T ,  y t − [ u · ϕ ( x t )] B t  2 6  y t − u · ϕ ( x t )  2 + ( B t + 1 − B t ) 2 (23) 8. T o see why this is true, it suffices to re write [ y t ] B t in the three cases y t < − B t , | y t | 6 B t , or y t > B t . 9. Same remark as in Footnote 7. 10. If B T + 1 = 0, t h en y t = b y t = 0 for all 1 6 t 6 T , which immediately yields (11). 744 S P A R S I T Y R E G R E T B O U N D S F O R I N D I V I D UA L S E Q U E N C E S and T ∑ t = 1 ( B t + 1 − B t ) 2 6 B 2 T + 1 . (24) Complement : proof of (23) and (24). T o see why (23) is true, we can distinguish between se ver al cases. F ir st note that this inequ ality is straigh tforward when | y t | 6 B t (indee d, in this case, clipp ing u · ϕ ( x t ) to [ − B t , B t ] can only improv e predic tion). W e ca n thu s assume th at | y t | > B t , or just 11 that y t > B t . In t his c ase, we can dist inguish between three sub-ca ses: • if u · ϕ ( x t ) < − B t , then clippin g impro ves predictio n since y t > B t ; • if − B t 6 u · ϕ ( x t ) 6 B t , then the clipp ing operator [ · ] B has no ef fect on u · ϕ ( x t ) ; • if u · ϕ ( x t ) > B t , then [ u · ϕ ( x t )] B t = B t so that ( y t − [ u · ϕ ( x t )] B t ) 2 = ( B t + 1 − B t ) 2 since B t + 1 = y t . Therefore , in all three sub-cases described above , we hav e ( y t − [ u · ϕ ( x t )] B t ) 2 6 max  ( y t − u · ϕ ( x t )) 2 , ( B t + 1 − B t ) 2  , which conclu des the proof of (23) . As for (24), it follo ws from the inequality T ∑ t = 1 ( B t + 1 − B t ) 2 6 sup ∆ 1 ,..., ∆ T > 0 ∑ T t = 1 ∆ t = B T + 1 ( T ∑ t = 1 ∆ 2 t ) = B 2 T + 1 , where the last equalit y is entailed by con vex ity of the function ( ∆ 1 , . . . , ∆ T ) 7→ ∑ T t = 1 ∆ 2 t on the poly - tope  ( ∆ 1 , . . . , ∆ T ) ∈ R T + : ∑ T t = 1 ∆ t = B T + 1  . This conclu des the proof. Pro of (of Pr oposition 5) The proof follows exactly the same lines as in P ro position 1 except that we apply Lemma 9 instead of L e mma 3. Indeed, using L e mma 9 and restricting the infimum to the ρ u ∗ , τ , u ∗ ∈ R d (cf., (40)), we get that T ∑ t = 1 ( y t − b y t ) 2 6 inf u ∗ ∈ R d ( Z R d T ∑ t = 1  y t − u · ϕ ( x t )  2 ρ u ∗ , τ ( d u ) + 8 B 2 T + 1 K ( ρ u ∗ , τ , π τ ) ) + 5 B 2 T + 1 6 inf u ∗ ∈ R d ( T ∑ t = 1  y t − u ∗ · ϕ ( x t )  2 + 32 B 2 T + 1 k u ∗ k 0 ln  1 + k u ∗ k 1 k u ∗ k 0 τ  ) + τ 2 d ∑ j = 1 T ∑ t = 1 ϕ 2 j ( x t ) + 5 B 2 T + 1 , where the last inequal ity follows from Lemmas 22 and 23. 11. If y t < − B t , it suf fices to apply (23) with − y t and − u . 745 G E R C H I N OV I T Z 3.3 A Fully A u to m a tic A lg orithm In the pre vious section, we pro ved that adaptation to B y was possible . If we also no longe r assume that a bou nd B Φ on th e trace of t he emp irical Gram matrix i s a v ailable to the foreca ster , then we c an use a doubl ing trick on the nondecreasi ng quantity γ t , ln   1 + v u u t t ∑ s = 1 d ∑ j = 1 ϕ 2 j ( x s )   and repeated ly run the algor ithm S eq S EW ∗ τ of the pre vious section for rapidly-d ecreasi ng va lues of τ . This yields a sparsity regret bound with extra logarit hmic multiplica tive factors as compared to Proposi tion 5, but which holds for a fully automati c algori thm; see T h eorem 10 below . More formally , our algorithm SeqSEW ∗ ∗ is defined as follo w s . The set of all time rounds t = 1 , 2 , . . . is partitione d into regimes r = 0 , 1 , . . . whose final time instanc es t r are data-dri ven. Let t − 1 , 0 by con vention. W e call re gime r , r = 0 , 1 , . . . , the sequence of time rounds ( t r − 1 + 1 , . . . , t r ) where t r is the fi rs t date t > t r − 1 + 1 such that γ t > 2 r . At the beg inning of regime r , w e restart the algori thm S e qSEW ∗ τ defined in Figure 3 with the paramet er τ set to τ r , 1 /  exp ( 2 r ) − 1  . In particul ar , on each regime r , the current instance of the algorithm S e qSE W ∗ τ r only uses the past ob serv ations y s , s ∈ { t r − 1 + 1 , . . . , t − 1 } , to perfor m the on line tru nction and t o tun e the in verse temperat ure parameter . Therefore, the algorithm SeqSEW ∗ ∗ is fully auto m a tic. Theor em 10 W ithout r equiring any pr eliminary kn owledg e at the be ginni ng of the pr ediction game , SeqSEW ∗ ∗ satisfi es, for all T > 1 and all ( x 1 , y 1 ) , . . . , ( x T , y T ) ∈ X × R , T ∑ t = 1 ( y t − b y t ) 2 6 inf u ∈ R d ( T ∑ t = 1  y t − u · ϕ ( x t )  2 + 128  max 1 6 t 6 T y 2 t  k u k 0 ln   e + v u u t T ∑ t = 1 d ∑ j = 1 ϕ 2 j ( x t )   + 32  max 1 6 t 6 T y 2 t  A T k u k 0 ln  1 + k u k 1 k u k 0  ) +  1 + 9 max 1 6 t 6 T y 2 t  A T , wher e A T , 2 + log 2 ln  e + q ∑ T t = 1 ∑ d j = 1 ϕ 2 j ( x t )  . Though the algori thm SeqSEW ∗ ∗ is fully auto matic, two possible improv ements coul d be ad- dresse d in the future. F ro m a theoretical vie wpoint, can we contruct a fully automatic algorith m with a bound similar to T h eorem 10 but without the extr a logarithmic fact or A T ? From a practica l vie wpoint, is it possi ble to perform the adap tation to B Φ without re startin g the alg orithm repeat edly (just like we di d for B y )? A smooth er time-v arying tuning ( τ t ) t > 2 might e nable to answer both ques - tions. This would be very probably at the price of a m o re in v olv ed analys is (e.g., if w e adapt the P A C-Bayesian bound of Lemma 9, then a third approx imation term would appear in (14) since π τ t chang es ov er time). Pro of sketch (of Theorem 10) T h e proof relies on the use of Corollary 7 on all regimes r visited up to time T . More prec isely , note th at γ t r − 1 6 2 r by d efinition of t r (ex cept mayb e in the tri vial case when t r = t r − 1 + 1), which entai ls that t r − 1 ∑ t = t r − 1 + 1 d ∑ j = 1 ϕ 2 j ( x t ) 6  e 2 r − 1  2 , B Φ , r . 746 S P A R S I T Y R E G R E T B O U N D S F O R I N D I V I D UA L S E Q U E N C E S Since we tuned the instance of the algorithm SeqSEW ∗ τ on regime r with τ = τ r , 1 / p B Φ , r , we can apply Corollary 7 on regime r for all r . Summing the correspondi ng regret bounds ove r r then yields the desired result. See Appendix A.1 for a detailed proof. Theorem 10 yields the follo wing corol lary . It upper bounds the re gret of the algorithm S e qSEW ∗ ∗ unifor mly ove r all u ∈ R d such that k u k 0 6 s and k u k 1 6 U , where the sparsity le vel s ∈ N and the ℓ 1 -diamete r U > 0 are both unkno wn to the foreca ster . The proof is postponed to Appendix A .1 . Cor ollary 11 Fix s ∈ N and U > 0 . Then, for all T > 1 and all ( x 1 , y 1 ) , . . . , ( x T , y T ) ∈ X × R , the r e gr et of the algori thm S e qSEW ∗ ∗ on  u ∈ R d : k u k 0 6 s and k u k 1 6 U  is boun ded by T ∑ t = 1 ( y t − b y t ) 2 − inf k u k 0 6 s k u k 1 6 U T ∑ t = 1  y t − u · ϕ ( x t )  2 6 128  max 1 6 t 6 T y 2 t  s ln   e + v u u t T ∑ t = 1 d ∑ j = 1 ϕ 2 j ( x t )   + 32  max 1 6 t 6 T y 2 t  A T s ln  1 + U s  +  1 + 9 max 1 6 t 6 T y 2 t  A T , wher e A T , 2 + log 2 ln  e + q ∑ T t = 1 ∑ d j = 1 ϕ 2 j ( x t )  . 4. Adaptivity to th e Unknown V ariance in the Stochastic Setting In this sectio n, we apply the online algor ithm SeqSEW ∗ τ of Section 3.2 to two relat ed stochastic setting s: the reg ression model with random design (Section 4.1) and the regress ion model with fix ed design (S e ction 4.2). T h e sparsity regre t bounds proved for this algorithm on indiv idual sequenc es imply in both setting s sparsi ty oracle inequalities w i th leading constan t 1. These risk bounds are of the same flav or as in Dalalyan and Tsybak ov ( 2008, 2012a) but they are adapt ive (up to a logarit hmic fact or) to the unkno w n v ariance σ 2 of the noise if the latter is Gaussian. In particul ar , we solve two questi ons left open by Dalaly an and Tsybak ov (2012a) in the random design case. In the sequel , just like in the online determini stic setting, we assume that the foreca ster has access to a dicti onary ϕ = ( ϕ 1 , . . . , ϕ d ) of measurab le base forecasters ϕ j : X → R , j = 1 , . . . , d . 4.1 Regr ession M o d el W ith Rand om Design In this section we apply the algorit hm SeqSEW ∗ τ to the regre ssion model w it h random design. In this batch setting the fore caster is giv en at the begi nning of the game T independen t random copies ( X 1 , Y 1 ) , . . . , ( X T , Y T ) of ( X , Y ) ∈ X × R whose common distr ib ution is unkno wn. W e ass ume there- after that E [ Y 2 ] < ∞ ; the goal of the forecas ter is to estima te the regress ion function f : X → R defined by f ( x ) , E [ Y | X = x ] for all x ∈ X . Setting ε t , Y t − f ( X t ) for all t = 1 , . . . , T , note that Y t = f ( X t ) + ε t , 1 6 t 6 T , and that the pairs ( X 1 , ε 1 ) , . . . , ( X T , ε T ) are i.i.d. and such that E [ ε 2 1 ] < ∞ and E [ ε 1 | X 1 ] = 0 almost surely . In the sequ el, we deno te the distrib ution of X by P X and we s et, for a ll measu rable functions 747 G E R C H I N OV I T Z h : X → R , k h k L 2 ,  Z X h ( x ) 2 P X ( d x )  1 / 2 =  E  h ( X ) 2   1 / 2 . Next we construct an estimator b f T : X → R based on the sample ( X 1 , Y 1 ) , . . . , ( X T , Y T ) that satisfies a sparsity oracle inequalit y , that is, its expe cted L 2 -risk E  w w f − b f T w w 2 L 2  is almost as small as the smallest L 2 -risk k f − u · ϕ k 2 L 2 , u ∈ R d , up to some additi ve term propor tional to k u k 0 . 4 . 1 . 1 A L G O R I T H M A N D M A I N R E S U L T Even if the who le sample ( X 1 , Y 1 ) , . . . , ( X T , Y T ) is a vail able at the beginni ng of the prediction game, we treat it in a sequen tial fashi on. W e run the algorithm S e qSEW ∗ τ of Section 3.2 from time 1 to time T with τ = 1 / √ d T (note that T is kno wn in this setting). U s ing the standard onlin e-to- batch con versio n (see, e.g., Littlestone 1989; Cesa-Bianchi et al. 2004), we define our estimator b f T : X → R as the uniform a ver age b f T , 1 T T ∑ t = 1 e f t (25) of the estimato rs e f t : X → R sequen tially built by the algor ithm SeqSE W ∗ τ as e f t ( x ) , Z R d  u · ϕ ( x )  B t p t ( d u ) . (26) Note that, contrary to much prior work from the statistics community such as those of C a toni (2004 ), Bunea an d Nobel (2008) a nd Dalalyan and Tsy bak ov (2012 a), the estimat ors e f t : X → R are tuned online. T h erefore , b f T does not depend on any prior kno wledge on the unkno w n distrib ution of the ( X t , Y t ) , 1 6 t 6 T , such as the unkno w n varian ce E  ( Y − f ( X )) 2  of the noise, the norms w w ϕ j w w ∞ , or the norms w w f − ϕ j w w ∞ (actua lly , the functions ϕ j and f − ϕ j do not ev en need to be bound ed in ℓ ∞ -norm). In this respect, this work improve s on that of Bunea and Nobel (2008) who tune their online foreca sters as a functio n of k f k ∞ and sup u ∈ U k u · ϕ k ∞ , where U ⊂ R d is a bounded comparison set. 12 Their techniq ue is not appropriate w h en k f k ∞ is unkno wn and it canno t be extende d to the case w h ere U = R d (since sup u ∈ R d k u · ϕ k ∞ = + ∞ if ϕ 6 = 0 ). The major techn ical dif ference is that we truncat e the base forecasts u · ϕ ( X t ) instead of truncating the observ ations Y t . In particular , this enable s us to aggrega te the base forecasters u · ϕ for all u ∈ R d , that is, o ver the whole R d space. The nex t sparsity oracle inequality is the main result of this section. It follo ws from the deter - ministic re gret bound o f Corollory 8 an d from Jens en’ s inequality . T wo corol laries are to be d eriv ed later . Theor em 12 Assume that ( X 1 , Y 1 ) , . . . , ( X T , Y T ) ∈ X × R ar e i ndependent r andom copi es of ( X , Y ) ∈ X × R , wh er e E [ Y 2 ] < + ∞ and w w ϕ j w w 2 L 2 , E [ ϕ j ( X ) 2 ] < + ∞ for all j = 1 , . . . , d . Then, the estimato r 12. Bunea and Nobel (2008) study the case where U is the (scaled) simplex in R d or the set of its v ertices. 748 S P A R S I T Y R E G R E T B O U N D S F O R I N D I V I D UA L S E Q U E N C E S b f T define d in (25) - (26) satis fies E  w w w f − b f T w w w 2 L 2  6 inf u ∈ R d ( k f − u · ϕ k 2 L 2 + 32 E  max 1 6 t 6 T Y 2 t  T k u k 0 ln 1 + √ d T k u k 1 k u k 0 !) + 1 d T d ∑ j = 1 w w ϕ j w w 2 L 2 + 5 E  max 1 6 t 6 T Y 2 t  T . Pro of sketch (of Th e or em 12) By Coroll ary 8 and by definition of e f t abo ve and b y t , e f t ( X t ) in Figure 3, we ha ve, almost sur ely , T ∑ t = 1 ( Y t − e f t ( X t )) 2 6 inf u ∈ R d ( T ∑ t = 1  Y t − u · ϕ ( X t )  2 + 32  max 1 6 t 6 T Y 2 t  k u k 0 ln 1 + √ d T k u k 1 k u k 0 ! ) + 1 d T d ∑ j = 1 T ∑ t = 1 ϕ 2 j ( X t ) + 5 max 1 6 t 6 T Y 2 t . T aking th e exp ectations of both side s and apply ing Jensen’ s inequalit y yields the desir ed result. For a detail ed proof, see A p pendix A .2. Theorem 12 abov e can be used under sev eral assumptio ns on the distrib ution of the output Y . In all cases, it suffices to upper bound the amplitude E  max 1 6 t 6 T Y 2 t  . W e present belo w a general coroll ary and explain later why our fully automatic proce dure b f T solv es two questions left open by Dalalyan and Tsybak ov (2012a) (see Corollary 14 belo w). 4 . 1 . 2 A G E N E R A L C O R O L LA RY The next sparsity oracle inequ ality follo ws from Theorem 12 and from the upper bounds on E  max 1 6 t 6 T Y 2 t  entaile d by L e mmas 24–26 in A p pendix B. The proof is postpo ned to Appendix A.2. Cor ollary 13 A ss ume that ( X 1 , Y 1 ) , . . . , ( X T , Y T ) ∈ X × R ar e independ ent rand om c opies of ( X , Y ) ∈ X × R , that sup 1 6 j 6 d w w ϕ j w w 2 L 2 < + ∞ , that E | Y | < + ∞ , and that one of the following assumptio ns holds on the distri b ution of ∆ Y , Y − E [ Y ] . •  BD ( B )  : | ∆ Y | 6 B almost sur ely for a given constant B > 0 ; •  SG ( σ 2 )  : ∆ Y is subgaus sian with varian ce factor σ 2 > 0 , that is, E  e λ∆ Y  6 e λ 2 σ 2 / 2 for all λ ∈ R ; •  BEM ( α , M )  : ∆ Y has a bound ed e xponentia l moment, that is, E  e α | ∆ Y |  6 M for some gi ven consta nts α > 0 and M > 0 ; •  BM ( α , M )  : ∆ Y has a bounded moment, that is, E  | ∆ Y | α  6 M for some given constan ts α > 2 and M > 0 . 749 G E R C H I N OV I T Z Then, the estimato r b f T define d above satisfies E  w w w f − b f T w w w 2 L 2  6 inf u ∈ R d ( k f − u · ϕ k 2 L 2 + 64  E [ Y ] 2 T + ψ T  k u k 0 ln 1 + √ d T k u k 1 k u k 0 !) + 1 d T d ∑ j = 1 w w ϕ j w w 2 L 2 + 10  E [ Y ] 2 T + ψ T  , wher e ψ T , 1 T E  max 1 6 t 6 T  Y t − E [ Y t ]  2  6                          B 2 T under Assumption  BD ( B )  , 2 σ 2 ln ( 2 eT ) T under Assumption  SG ( σ 2 )  , ln 2  ( M + e ) T  α 2 T under Assumption  BEM ( α , M )  , M 2 / α T ( α − 2 ) / α under Assumption  BM ( α , M )  . Sev eral comments can be made about Corollary 13. W e first stress that, if T > 2, then the two “bias” terms E [ Y ] 2 / T abov e can be av oided, at least at the price of a multiplic ati ve factor of 2 T / ( T − 1 ) 6 4. This can be achie ved via a slightl y more sophist icated online clipping—se e Remark 19 in Appendix A.2. Second, under the assumptio ns  BD ( B )  ,  SG ( σ 2 )  , or  BEM ( α , M )  , the ke y quantity ψ T is respec ti vely of the order of 1 / T , ln ( T ) / T and ln 2 ( T ) / T . U p to a logarithmic factor , this correspo nds to the classical fast rate of con verg ence 1 / T obtained in the random design setting for dif ferent ag- gre gation p roblems (see, e.g., Cato ni 1999 ; Juditsk y et al. 2008; Audiber t 200 9 for mo del-se lection - type aggregat ion and Dalalyan and Tsybako v 2012a for linear aggre gation ). W e were able to get similar rates—with, howe ver , a fully automatic procedure —since our online algorith m SeqSEW ∗ τ is well suited for bounde d indi vidual sequenc es with an unkno w n bound. More precise ly , the fi n ite i.i.d. sequen ce Y 1 , . . . , Y T is almost surely unifor mly bounded by the random bound max 1 6 t 6 T | Y t | . Our indi vidual seque nce t echniq ues adapt sequentially to this random bound, yiel ding a re gret bound that scales as max 1 6 t 6 T Y 2 t . As a result, th e risk bounds obtai ned after the online- to-batch con versio n scale as E  max 1 6 t 6 T Y 2 t  / T . If the distrib ution of the output Y is suf fi c iently lightly-ta iled—which includ es the quite general bounded-e xponential-moment assumptio n—then we can reco ver the fast rate of con ver gence 1 / T up to a logarith mic factor . W e note that ther e i s still a ques tion left open for hea vy-tailed output distri bution s. For ex ample, under the bounded moment assumptio n  BM ( α , M )  , the rate T − ( α − 2 ) / α that we prov ed does not match the faste r rate T − α / ( α + 2 ) obtain ed by Juditsk y et al. (2008) and Audibert (2009) under a similar assumpti on. Their methods use some preliminary knowle dge on the output distrib ution (such as the expon ent α ). Thus, obtaining the same rate w i th a procedu re tuned in an automatic fash ion—just like our method b f T —is a challeng ing task. For this purpose, a dif ferent tuning of η t or a more sophi sticate d online truncation might be necessary . Third, sev eral va riations on the assu mptions are possi ble. First note that se vera l classical assumpti ons on Y expr essed in terms of f ( X ) and ε , Y − f ( X ) are either par ticular cases of the abov e corollary or can be treated similarly . Indeed, each of the four assumpti ons abov e on 750 S P A R S I T Y R E G R E T B O U N D S F O R I N D I V I D UA L S E Q U E N C E S ∆ Y , Y − E [ Y ] = f ( X ) − E [ f ( X )] + ε is sati sfied as soo n as both the distri b ution of f ( X ) − E [ f ( X )] and the conditiona l distrib ution of ε (condition ally on X ) satisfy the same type of assumption. For exa mple, if f ( X ) − E [ f ( X )] is subgau ssian with varia nce factor σ 2 X and if ε is subgaussia n condi- tional ly on X with a var iance factor uniformly bounded by a consta nt σ 2 ε , then ∆ Y is subgaussi an with v ariance factor σ 2 X + σ 2 ε (see also Remark 20 in Appendix A.2 to a void conditio ning). The assumptio ns on f ( X ) − E [ f ( X )] and ε can also be mixed togeth er . For instance, as expla ined in Remark 20 in Appendi x A .2 , under the classic al assumptions k f k ∞ < + ∞ and E h e α | ε |    X i 6 M a.s. (27) or k f k ∞ < + ∞ and E h e λε    X i 6 e λ 2 σ 2 / 2 a.s. , ∀ λ ∈ R , (28) the ke y quantity ψ T in the corol lary can be bounded from above by ψ T 6        8 k f k 2 ∞ T + 2 ln 2  ( M + e ) T  α 2 T under the set of assumptio ns (27), 8 k f k 2 ∞ T + 4 σ 2 ln ( 2 eT ) T under the set of assumptio ns (28). In pa rticula r , under the set of as sumption s (28), our p rocedure b f T solv es two ques tions left open by Dalalyan and Tsybak ov (2012a ). W e discu ss below our contr ib utions in this particular case. 4 . 1 . 3 Q U E S T I O N S L E F T O P E N B Y D A L A L Y A N A N D T S Y B A K OV In this subsection w e focus on the case when the set of assumptio ns (28) holds true. Namely , the reg ression function f is bounded (by an unkno wn const ant) and the noise ε , Y − f ( X ) is subgaus- sian conditio nally on X with an unkno wn vari ance factor σ 2 > 0. An important particu lar case is when k f k ∞ < + ∞ and when the noise ε is independ ent of X and normally distrib uted N ( 0 , σ 2 ) . Under the set of assump tions (28) , the two terms E  max 1 6 t 6 T Y 2 t  of Theorem 12 can be upper bound ed in a simpler and slightly tight er way as compared to the proo f of Coro llary 13 (we only use the inequa lity ( x + y ) 2 6 2 x 2 + 2 y 2 once, instead of twice). It yields the follo wing sparsi ty oracle inequa lity . Cor ollary 14 A ss ume that ( X 1 , Y 1 ) , . . . , ( X T , Y T ) ∈ X × R ar e independ ent rand om c opies of ( X , Y ) ∈ X × R suc h that the se t of assumptions ( 28) abo ve ho lds t rue . Then, the estima tor b f T define d in (25) - (26) satisfi es E  w w w f − b f T w w w 2 L 2  6 inf u ∈ R d ( k f − u · ϕ k 2 L 2 + 64  k f k 2 ∞ + 2 σ 2 ln ( 2 eT )  k u k 0 T ln 1 + √ d T k u k 1 k u k 0 ! ) + 1 d T d ∑ j = 1 w w ϕ j w w 2 L 2 + 10 T  k f k 2 ∞ + 2 σ 2 ln ( 2 eT )  . 751 G E R C H I N OV I T Z Pro of W e apply Theorem 12 an d boun d E  max 1 6 t 6 T Y 2 t  from above . By the element ary i nequal ity ( x + y ) 2 6 2 x 2 + 2 y 2 for all x , y ∈ R , we get E  max 1 6 t 6 T Y 2 t  = E  max 1 6 t 6 T  f ( X t ) + ε t  2  6 2  k f k 2 ∞ + E  max 1 6 t 6 T ε 2 t  6 2  k f k 2 ∞ + 2 σ 2 ln ( 2 eT )  , where the last ineq uality follo ws from Lemma 24 in Appendi x B and from the fac t that, for all 1 6 t 6 T and all λ ∈ R , we hav e E  e λε t  = E  e λε  = E  E  e λε   X  6 e λ 2 σ 2 / 2 by (28). (Note that the assump tion of conditio nal subgau ssianity in (28 ) is stron ger than what we need, that is, subgaus- sianity without conditi oning .) This conc ludes the proof. The abov e bound is of the same order (up to a ln T facto r) as the sparsity oracle ine qual- ity pro ved in P r oposit ion 1 of D al alyan and Tsybako v (2012a ). For the sake of comparis on w e state belo w w it h our notation s (e.g., β therein correspond s to 1 / η in this paper) a straightfor ward conseq uence of this propositi on, which follo ws by Jensen ’ s inequality and the particula r 13 choice τ = min  1 / √ d T , R / ( 4 d )  . Pro p os ition 15 (A consequence of Pro p. 1 of Dalalyan and Tsybakov 2 012a) Assume that sup 1 6 j 6 d w w ϕ j w w ∞ < ∞ and that the set of assumpt ions (28) above holds true. Then, for all R > 0 and all η 6 ¯ η ( R ) ,  2 σ 2 + 2 sup k u k 1 6 R k u · ϕ − f k 2 ∞  − 1 , the mirr or avera ging ag- gr e gate b f T : X → R defin ed by Dalaly an and Tsybak ov (2012a , Equation s (1) and (3)) with τ = min  1 / √ d T , R / ( 4 d )  satisfi es E  w w w f − b f T w w w 2 L 2  6 inf k u k 1 6 R / 2 ( k f − u · ϕ k 2 L 2 + 4 η k u k 0 T + 1 ln 1 + √ d T k u k 1 + 2 d k u k 0 !) + 4 d T d ∑ j = 1 w w ϕ j w w 2 L 2 + 1 ( T + 1 ) η . W e can no w discuss the two question s left open by Dalaly an and Tsybak ov (2012a) . Risk bound on the w h ole R d space . Despite the similarity of the two bound s, the sparsity oracle inequa lity stated in Propositio n 15 abov e only holds for vector s u within an ℓ 1 -ball of finite radius R / 2, while our bound holds over the whole R d space. Moreov er , the parameter R abov e has to be chosen in adv ance, bu t it cannot be chosen too large since 1 / η > 1 / ¯ η ( R ) , which grows as R 2 when R → + ∞ (if ϕ 6 = 0 ). Dalalyan a nd Tsybak ov (2012 a, Section 4.2) thus asked whethe r it was poss ible to get a bound with 1 / η < + ∞ such that the infimum in Propositi on 15 exte nds to the whole R d space. Our results sho w that, thanks to data-dri ven truncation, the answer is positi ve. Note that it is still possible to transform t he bound of Propositio n 15 into a bound ov er t he whole R d space if the paramet er R is chosen (ille gally) as R = 2 k u ∗ k 1 (or as a tight upper boun d of th e las t 13. Proposition 1 of Dalalyan and Tsybak ov (201 2a) may seem more genera l than C o rollary 1 4 at first sight since it holds for all τ > 0, but this is actually also the case for Corollary 14. T h e proof of t h e latter would indeed have remained true had we replaced τ = 1 / √ d T with any value of τ > 0 (see Proposition 5). W e ho we ver chose the reasonable v alue τ = 1 / √ d T to mak e our algorithm parameter -fr e e. As noted earlier , if k ϕ k ∞ , sup x ∈ X max 1 6 j 6 d | ϕ j ( x ) | i s finite and kno wn by the forecaster , another si mp l e an d easy-to-analyse tuning is giv en by τ = 1 / ( k ϕ k ∞ √ d T ) . 752 S P A R S I T Y R E G R E T B O U N D S F O R I N D I V I D UA L S E Q U E N C E S quanti ty), w h ere u ∗ ∈ R d minimizes ov er R d the regu larized risk k f − u · ϕ k 2 L 2 + 4 ¯ η ( 2 k u k 1 ) k u k 0 T + 1 ln 1 + √ d T k u k 1 + 2 d k u k 0 ! + 4 d T d ∑ j = 1 w w ϕ j w w 2 L 2 + 1 ( T + 1 ) ¯ η ( 2 k u k 1 ) . For i nstanc e, choosi ng R = 2 k u ∗ k 1 and η = ¯ η ( R ) , we get from Propo sition 15 th at the e xpected L 2 - risk E  k f − b f T k 2 L 2  of the corresp onding procedu re is upper bounded by the infimum of the abov e reg ularize d risk ov er all u ∈ R d . Howe ver , this paramete r tuning is illeg al since k u ∗ k 1 is not kno w n in practice. On the contrary , thanks to data-dr i ven truncation, the prior kno wledge of k u ∗ k 1 is not requir ed by our procedure. Adaptivi ty to the unknown variance of the noise . The second open question , which was raised by Dalalyan and Tsybako v (2012a, Section 5.1, Remark 6), deals with the prior kno wledge of the v ariance factor σ 2 of the no ise. The latter is inde ed required by thei r algorithm for the choi ce of the in verse temperatu re parameter η . Since the noise le vel σ 2 is un kno wn in practice, the auth ors aske d the importan t question whether adap ti vity to σ 2 was possible. Up to a ln T fac tor , Corollary 14 abo ve pro vides a positi ve answer . 4.2 Regr ession M o d el W ith F i xed Design In this section, we conside r the regressio n model with fixed design. In this batch setting the fore- caster is giv en at the beginn ing of the game a T -sampl e ( x 1 , Y 1 ) , . . . , ( x T , Y T ) ∈ X × R , w h ere the x t are determini stic elements in X and where Y t = f ( x t ) + ε t , 1 6 t 6 T , (29) for some i.i.d. sequence ε 1 , . . . , ε T ∈ R (with unkno wn distrib ution) and some unkno wn function f : X → R . N e xt we construct an estimator b f T : X → R of f based on the sample ( x 1 , Y 1 ) , . . . , ( x T , Y T ) that satisfies a spars ity oracle inequality , that is, its expect ed mean squared error E  1 T ∑ T t = 1 ( f ( x t ) − b f T ( x t )) 2  is almost as small as the smallest mean squared error 1 T ∑ T t = 1 ( f ( x t ) − u · ϕ ( x t )) 2 , u ∈ R d , up to some additi ve term proporti onal to k u k 0 . In this setting, just like in S ec tion 4.1, our algorith m and the correspond ing analy sis are a straigh tforward consequen ce of the general results on indivi dual seque nces dev eloped in Section 3. As in the random design setting, the sample ( x 1 , Y 1 ) , . . . , ( x T , Y T ) is treated in a sequen tial fashio n. W e run the a lgorithm SeqSEW ∗ τ defined in Figure 3 from time 1 to time T with th e par ticula r choice of τ = 1 / √ d T . W e then define our estimator b f T : X → R by b f T ( x ) ,      1 n x ∑ 1 6 t 6 T t : x t = x e f t ( x ) if x ∈ { x 1 , . . . , x T } , 0 if x / ∈ { x 1 , . . . , x T } , (30) where n x ,    t : x t = x    = ∑ T t = 1 I { x t = x } , and w h ere the estimators e f t : X → R sequent ially bu ilt by the algorit hm S eq S EW ∗ τ are defined by e f t ( x ) , Z R d  u · ϕ ( x )  B t p t ( d u ) . (31) 753 G E R C H I N OV I T Z In the particu lar case when the x t are all distinct, b f T is simply defined by b f T ( x t ) , e f t ( x t ) for all t ∈ { 1 , . . . , T } and by b f T ( x ) = 0 otherwise. Therefore , in this case, b f T only uses the observ ations y 1 , . . . , y t − 1 to estimat e f ( x t ) (in partic ular , b f T ( x 1 ) is dete rministic). The next theorem is the main result of this subsec tion. It follo ws as in the random design setting from the deterministic regret bound of Corollory 8 and from Jensen’ s inequali ty . T h e proof is postp oned to Appendix A .3. Theor em 16 Consider the r e gre ssion model w it h fixed design desc ribed in (29) . Then, the e stimator b f T define d in (30) – (31) satis fies E " 1 T T ∑ t = 1  f ( x t ) − b f T ( x t )  2 # 6 inf u ∈ R d ( 1 T T ∑ t = 1  f ( x t ) − u · ϕ ( x t )  2 + 32 E  max 1 6 t 6 T Y 2 t  T k u k 0 ln 1 + √ d T k u k 1 k u k 0 !) + 1 d T 2 d ∑ j = 1 T ∑ t = 1 ϕ 2 j ( x t ) + 5 E  max 1 6 t 6 T Y 2 t  T . As in Section 4.1, th e amplitud e E  max 1 6 t 6 T Y 2 t  can be uppe r bounded under va rious assump- tions. The proof of the follo wing coroll ary is postpone d to A p pendix A.3. Cor ollary 17 C o nsider the r e gr ession model with fixed design described in (29) . A s sume that one of the followin g assumptions holds on the distrib ution of ε 1 . •  BD ( B )  : | ε 1 | 6 B almost sur ely for a given cons tant B > 0 ; •  SG ( σ 2 )  : ε 1 is subgaus sian with variance factor σ 2 > 0 , that is, E  e λε 1  6 e λ 2 σ 2 / 2 for all λ ∈ R ; •  BEM ( α , M )  : ε has a bounded expon ential moment, that is, E  e α | ε |  6 M for some given consta nts α > 0 and M > 0 ; •  BM ( α , M )  : ε has a bounded moment, that is, E  | ε | α  6 M for some given constants α > 2 and M > 0 . Then, the estimato r b f T define d in (30) – (31) satisfies E " 1 T T ∑ t = 1  f ( x t ) − b f T ( x t )  2 # 6 inf u ∈ R d ( 1 T T ∑ t = 1  f ( x t ) − u · ϕ ( x t )  2 + 64  max 1 6 t 6 T f 2 ( x t ) T + ψ T  k u k 0 ln 1 + √ d T k u k 1 k u k 0 !) + 1 d T 2 d ∑ j = 1 T ∑ t = 1 ϕ 2 j ( x t ) + 10  max 1 6 t 6 T f 2 ( x t ) T + ψ T  , 754 S P A R S I T Y R E G R E T B O U N D S F O R I N D I V I D UA L S E Q U E N C E S wher e ψ T , 1 T E  max 1 6 t 6 T ε 2 t  6                          B 2 T if Assumptio n  BD ( B )  holds, 2 σ 2 ln ( 2 eT ) T if Assumptio n  SG ( σ 2 )  holds, ln 2 (( M + e ) T ) α 2 T if Assumptio n  BEM ( α , M )  holds, M 2 / α T ( α − 2 ) / α if Assumptio n  BM ( α , M )  holds. The above bound is of the same flav or as that of Dalalyan and Tsybako v (2008, Theorem 5). It has one adv antage and one drawback. On the one hand , we note two addition al “bias” terms  max 1 6 t 6 T f 2 ( x t )  / T as compared to the bound of Dalalyan and Tsybako v (2008, T h eorem 5). As of now , we hav e not been able to remov e them using ideas similar to what w e did in the random design case (see Remark 19 in A p pendi x A. 2 ). On the other hand, under Assumption  SG ( σ 2 )  , contra ry to Dalalyan and Tsybako v (2008), our algorithm does not require the prior kno wledge of the v arianc e factor σ 2 of the noise . Acknowledgmen ts The author would like to thank Arnak Dalalyan, G i lles S to ltz, and Pascal Massart for their helpfu l feedba ck and sugges tions, as well as two anonymous re viewers for their insight ful comments, one of which helpe d us simplify the online tuning carried out in Section 3.2. The author ackno wledges the suppo rt of the French Agence Nationale de la Recherche (ANR), under grant P ARCIMON IE ( http: //www .prob a.jussieu.fr/ANR/Parcimonie ), and of th e IST P ro gramme of the European Community , under the P ASCAL 2 Network of Excelle nce, IST -2007-21 6886. Ap pendix A. Proofs In this append ix we pro vide the proofs of some resu lts stated abov e. A.1 Pr oofs of Theore m 10 and Coro llary 11 Before prov ing Theorem 1 0, we fi r st need the fo llowing co m men t. Since t he algori thm SeqSEW ∗ τ is restart ed at the be ginning of each re gime, the t hresho ld va lues B t used on r egi me r by the alg orithm SeqSEW ∗ τ are n ot co mputed on the basis o f all past observ ations y 1 , . . . , y t − 1 b ut only on the b asis of the past observ ations y t , t ∈ { t r − 1 + 1 , . . . , t − 1 } . T o a void any ambig uity , w e s et B r , t r − 1 + 1 , 0 and B r , t , max t r − 1 + 1 6 s 6 t − 1 | y s | , t ∈ { t r − 1 + 2 , . . . , t r } . Pro of (of Theorem 10) W e denote by R , min { r ∈ N : T 6 t r } the index of the last regime. For notati onal con venien ce, we re-define t R , T (e ven if γ T 6 2 R ). 755 G E R C H I N OV I T Z W e upper bo und the regr et of t he al gorith m SeqSEW ∗ ∗ on { 1 , . . . , T } by the s um of its r egr ets on each time interv al. T o do so, first note that 14 T ∑ t = 1 ( y t − b y t ) 2 = R ∑ r = 0 t r ∑ t = t r − 1 + 1 ( y t − b y t ) 2 = R ∑ r = 0 ( y t r − b y t r ) 2 + t r − 1 ∑ t = t r − 1 + 1 ( y t − b y t ) 2 ! 6 R ∑ r = 0 2 ( y 2 t r + B 2 r , t r ) + t r − 1 ∑ t = t r − 1 + 1 ( y t − b y t ) 2 ! (32) 6 R ∑ r = 0 t r − 1 ∑ t = t r − 1 + 1 ( y t − b y t ) 2 ! + 4 ( R + 1 ) y ∗ T 2 , (33) where we set y ∗ T , max 1 6 t 6 T | y t | , where (32) follo ws from the upper bound ( y t r − b y t r ) 2 6 2 ( y 2 t r + b y 2 t r ) 6 2 ( y 2 t r + B 2 r , t r ) (sinc e | b y t r | 6 B r , t r by constru ction) , and where (33) follo ws from the inequalities y 2 t r 6 y ∗ T 2 and B 2 r , t r , max t r − 1 + 1 6 t 6 t r − 1 y 2 t 6 y ∗ T 2 . But, for ev ery r = 0 , . . . , R , the trace of the empirical Gram matrix on { t r − 1 + 1 , . . . , t r − 1 } is upper bound ed by t r − 1 ∑ t = t r − 1 + 1 d ∑ j = 1 ϕ 2 j ( x t ) 6 t r − 1 ∑ t = 1 d ∑ j = 1 ϕ 2 j ( x t ) 6 ( e 2 r − 1 ) 2 , where the last ineq uality follows from the f act that γ t r − 1 6 2 r (by definition of t r ). Since in additio n τ r , 1 / p ( e 2 r − 1 ) 2 , we can apply C o rollor y 7 on each period { t r − 1 + 1 , . . . , t r − 1 } , r = 0 , . . . , R , with B Φ = ( e 2 r − 1 ) 2 and get from (33) the upper bound T ∑ t = 1 ( y t − b y t ) 2 6 R ∑ r = 0 inf u ∈ R d ( t r − 1 ∑ t = t r − 1 + 1  y t − u · ϕ ( x t )  2 + ∆ r ( u ) ) + 4 ( R + 1 ) y ∗ T 2 , (34) where ∆ r ( u ) , 32 B 2 r , t r k u k 0 ln 1 +  e 2 r − 1  k u k 1 k u k 0 ! + 5 B 2 r , t r + 1 . (35) Since the infimum is superadd itiv e and since  y t r − u · ϕ ( x t r )  2 > 0 for all r = 0 , . . . , R , we get from (34) that T ∑ t = 1 ( y t − b y t ) 2 6 inf u ∈ R d R ∑ r = 0 t r ∑ t = t r − 1 + 1  y t − u · ϕ ( x t )  2 + ∆ r ( u ) ! + 4 ( R + 1 ) y ∗ T 2 = inf u ∈ R d ( T ∑ t = 1  y t − u · ϕ ( x t )  2 + R ∑ r = 0 ∆ r ( u ) ) + 4 ( R + 1 ) y ∗ T 2 . (36) Let u ∈ R d . Next we bound ∑ R r = 0 ∆ r ( u ) and 4 ( R + 1 ) y ∗ T 2 from abov e. First note that, by the upper bound B 2 r , t r 6 y ∗ T 2 and by the elementary inequa lity ln ( 1 + xy ) 6 ln (( 1 + x )( 1 + y )) = ln ( 1 + 14. In the triv i al cases whe re t r = t r − 1 + 1 for some r , the sum ∑ t r − 1 t = t r − 1 + 1 ( y t − b y t ) 2 equals 0 by con vention. 756 S P A R S I T Y R E G R E T B O U N D S F O R I N D I V I D UA L S E Q U E N C E S x ) + ln ( 1 + y ) with x = e 2 r − 1 and y = k u k 1 / k u k 0 , (35) yields ∆ r ( u ) 6 32 y ∗ T 2 k u k 0 2 r + 32 y ∗ T 2 k u k 0 ln  1 + k u k 1 k u k 0  + 5 y ∗ T 2 + 1 . Summing ov er r = 0 , . . . , R , we get R ∑ r = 0 ∆ r ( u ) 6 32  2 R + 1 − 1  y ∗ T 2 k u k 0 + ( R + 1 )  32 y ∗ T 2 k u k 0 ln  1 + k u k 1 k u k 0  + 5 y ∗ T 2 + 1  . (37) First case: R = 0 Substitu ting (37) in (36), we conc lude the proof by noting that A T > 2 + log 2 1 > 1 and that ln  e + q ∑ T t = 1 ∑ d j = 1 ϕ 2 j ( x t )  > 1. Second case: R > 1 Since R > 1, we ha ve, by definition of t R − 1 , 2 R − 1 < γ t R − 1 , ln   1 + v u u t t R − 1 ∑ t = 1 d ∑ j = 1 ϕ 2 j ( x t )   6 ln   e + v u u t T ∑ t = 1 d ∑ j = 1 ϕ 2 j ( x t )   . The last ineq uality enta ils that 2 R + 1 − 1 6 4 · 2 R − 1 6 4 ln  e + q ∑ T t = 1 ∑ d j = 1 ϕ 2 j ( x t )  and tha t R + 1 6 2 + log 2 ln  e + q ∑ T t = 1 ∑ d j = 1 ϕ 2 j ( x t )  , A T . Therefore, one the one hand, via (37), R ∑ r = 0 ∆ r ( u ) 6 128 y ∗ T 2 k u k 0 ln   e + v u u t T ∑ t = 1 d ∑ j = 1 ϕ 2 j ( x t )   + 32 y ∗ T 2 A T k u k 0 ln  1 + k u k 1 k u k 0  + A T  5 y ∗ T 2 + 1  , and, on the other hand, 4 ( R + 1 ) y ∗ T 2 6 4 A T y ∗ T 2 . Substitu ting the l ast two inequalities in (36) and n oting th at y ∗ T 2 = max 1 6 t 6 T y 2 t conclu des the pro of. Pro of (of Cor ollary 11) The proof is straigh tforwa rd. In view of Theorem 10, w e just need to check that the quantit y (continuou sly extend ed in s = 0) 128  max 1 6 t 6 T y 2 t  s ln   e + v u u t T ∑ t = 1 d ∑ j = 1 ϕ 2 j ( x t )   + 32  max 1 6 t 6 T y 2 t  A T s ln  1 + U s  is non-d ecreasing in s ∈ R + and in U ∈ R + . This is clear for U . T h e fact that it also non-decrea sing in s comes from the follo wing remark. For all U > 0, the function s ∈ ( 0 , + ∞ ) 7→ s ln ( 1 + U / s ) has a deri va tive equal to ln  1 + U s  − U / s 1 + U / s for all s > 0 . 757 G E R C H I N OV I T Z From the elementary inequ ality ln ( 1 + u ) = − ln  1 1 + u  > −  1 1 + u − 1  = u 1 + u , which holds for all u ∈ ( − 1 , + ∞ ) , the abov e deriv ati ve is nonne gati ve for all s > 0 so that the con- tinuou s extens ion s ∈ R + 7→ s ln ( 1 + U / s ) is non-d ecreasing. A.2 Pr oofs of Theore m 12 and Coro llary 13 In this subs ection, we set ε , Y − f ( X ) , so that the pairs ( X 1 , ε 1 ) , . . . , ( X T , ε T ) are independen t copies of ( X , ε ) ∈ X × R . W e also define σ > 0 by σ 2 , E  ε 2  = E  ( Y − f ( X )) 2  . Pro of (of Theorem 12) By Corol lory 8 and the definiti ons of e f t in (2 6) and b y t , e f t ( X t ) in Figure 3, we ha ve, almost sur ely , T ∑ t = 1 ( Y t − e f t ( X t )) 2 6 inf u ∈ R d ( T ∑ t = 1  Y t − u · ϕ ( X t )  2 + 32  max 1 6 t 6 T Y 2 t  k u k 0 ln 1 + √ d T k u k 1 k u k 0 ! ) + 1 d T d ∑ j = 1 T ∑ t = 1 ϕ 2 j ( X t ) + 5 max 1 6 t 6 T Y 2 t . It remains to take the expectati ons of both sides with respect to  ( X 1 , Y 1 ) , . . . , ( X T , Y T )  . First note that for all t = 1 , . . . , T , since ε t , Y t − f ( X t ) , we ha ve E h  Y t − e f t ( X t )  2 i = E h  ε t + f ( X t ) − e f t ( X t )  2 i = σ 2 + E h  f ( X t ) − e f t ( X t )  2 i , since E  ε 2 t  = E  ε 2  , σ 2 on the one hand, and, on the other hand, e f t is a built on ( X s , Y s ) 1 6 s 6 t − 1 and E  ε t   ( X s , Y s ) 1 6 s 6 t − 1 , X t  = E  ε t   X t  = 0 (from the independ ence of ( X s , Y s ) 1 6 s 6 t − 1 and ( X t , Y t ) and by definition of f ). In the same way , E h  Y t − u · ϕ ( X t )  2 i = σ 2 + E h  f ( X t ) − u · ϕ ( X t )  2 i . Therefore , by Jensen ’ s inequality and the conca vity of the infimum, the last inequality becomes, after takin g the expectati ons of both sides, T σ 2 + T ∑ t = 1 E h  f ( X t ) − e f t ( X t )  2 i 6 inf u ∈ R d ( T σ 2 + T ∑ t = 1 E h  f ( X t ) − u · ϕ ( X t )  2 i + 32 E  max 1 6 t 6 T Y 2 t  k u k 0 ln 1 + √ d T k u k 1 k u k 0 !) + 1 d T d ∑ j = 1 T ∑ t = 1 E  ϕ 2 j ( X t )  + 5 E  max 1 6 t 6 T Y 2 t  . 758 S P A R S I T Y R E G R E T B O U N D S F O R I N D I V I D UA L S E Q U E N C E S Noting that the T σ 2 cancel out, divid ing the two sides by T , and using the fact that X t ∼ X in the right-h and side, we get 1 T T ∑ t = 1 E h  f ( X t ) − e f t ( X t )  2 i 6 inf u ∈ R d ( k f − u · ϕ k 2 L 2 + 32 E  max 1 6 t 6 T Y 2 t  T k u k 0 ln 1 + √ d T k u k 1 k u k 0 !) + 1 d T d ∑ j = 1 w w ϕ j w w 2 L 2 + 5 E  max 1 6 t 6 T Y 2 t  T . The right-ha nd side of the last inequal ity is exactl y the upper bound stated in Theorem 12. T o conclu de the proof, we thus only need to check that E  k f − b f T k 2 L 2  is bounded from abov e by the left-ha nd side. B u t by definition of b f T and by con vexity of the square loss, E  w w w f − b f T w w w 2 L 2  , E "  f ( X ) − 1 T T ∑ t = 1 e f t ( X )  2 # 6 1 T T ∑ t = 1 E h  f ( X ) − e f t ( X )  2 i = 1 T T ∑ t = 1 E h  f ( X t ) − e f t ( X t )  2 i . The last eq uality follo w s class ically from the f act that, for a ll t = 1 , . . . , T , ( X s , Y s ) 1 6 s 6 t − 1 (on whi ch e f t is construc ted) is independen t from both X t and X and the fact that X t ∼ X . Remark 18 The fact that the inequ ality state d in Cor ollary 8 has a leading constant equal to 1 on indivi dual sequen ces is crucia l to derive in the stoc hastic setting an oracle inequalit y in terms of the (e xcess) risks E h k f − b f T k 2 L 2 i and k f − u · ϕ k 2 L 2 . Indeed, if the constant appearing in fr ont of the infimum was equal to C > 1 , then the T σ 2 would not cancel out in the pr evious pr oof, so that the r esulting expe cted inequality would contain a non-vanish ing additi ve term ( C − 1 ) σ 2 . Pro of (of C o rollar y 13) W e can apply Theorem 12. Then, to prov e the upper bound on E h k f − b f T k 2 L 2 i , it suf fices to show that E  max 1 6 t 6 T Y 2 t  T 6 2  E [ Y ] 2 T + ψ T  . (38) Recall that ψ T , 1 T E  max 1 6 t 6 T  Y t − E [ Y t ]  2  = 1 T E  max 1 6 t 6 T ( ∆ Y ) 2 t  , where we defined ( ∆ Y ) t , Y t − E [ Y t ] = Y t − E [ Y ] for all t = 1 , . . . , T . From the elementary inequ ality ( x + y ) 2 6 2 x 2 + 2 y 2 for all x , y ∈ R , we ha ve E  max 1 6 t 6 T Y 2 t  , E  max 1 6 t 6 T  E [ Y ] + ( ∆ Y ) t  2  6 2 E [ Y ] 2 + 2 E  max 1 6 t 6 T ( ∆ Y ) 2 t  . Div iding both sides by T , we get (38). 759 G E R C H I N OV I T Z As for the upper bound on ψ T , since the ( ∆ Y ) t , 1 6 t 6 T , are distrib uted as ∆ Y , we can apply Lemmas 24, 2 5, a nd 26 in Appendix B.3 to bound ψ T from abo ve unde r the assumptions  SG ( σ 2 )  ,  BEM ( α , M )  , and  BM ( α , M )  respec ti vely (the upper boun d under  BD ( B )  is straig htforward): E  max 1 6 t 6 T ( ∆ Y ) 2 t  6            B 2 if Assumptio n  BD ( B )  holds, σ 2 + 2 σ 2 ln ( 2 eT ) if Assumption  SG ( σ 2 )  holds, ln 2  ( M + e ) T  α 2 if Assumptio n  BEM ( α , M )  holds, ( M T ) 2 / α if Assumptio n  BM ( α , M )  holds . Remark 19 If T > 2 , then the tw o “bias” terms E [ Y ] 2 / T appea ring in C o r ollary 13 can be avoided, at least at the price of a multiplica tive factor of 2 T / ( T − 1 ) 6 4 . It suf fices to use a slightly mor e sophis ticated onlin e clipp ing defined as follows. T h e first r ound t = 1 is only used to obser ve Y 1 . Then, the a lgorit hm Seq SEW ∗ τ is run with τ = 1 / √ d T fr om r ound 2 up to r ound T with the following importan t m o dificat ion: instead of trun cating the pr edicti ons to [ − B t , B t ] , whic h is best suite d to the case E [ Y ] = 0 , we trunca te them to the interv al  Y 1 − B ′ t , Y 1 + B ′ t  , wher e B ′ t , ma x 1 6 s 6 t − 1 | Y s − Y 1 | . If η t is cha nge d accor dingly , that is, if η t = 1 / ( 8 B ′ t 2 ) , then it easy to see that the r esulting pr ocedur e b f T , 1 T − 1 ∑ T s = 2 e f s (wher e e f 2 , . . . , e f T ar e the estimator s output by SeqSEW ∗ τ ) satisfi es E  w w w f − b f T w w w 2 L 2  6 inf u ∈ R d ( k f − u · ϕ k 2 L 2 + 64  V ar [ Y ] T − 1 + ψ T − 1  k u k 0 ln 1 + √ d T k u k 1 k u k 0 ! ) + 1 d T d ∑ j = 1 w w ϕ j w w 2 L 2 + 10  V ar [ Y ] T − 1 + ψ T − 1  , wher e V ar [ Y ] , E  ( Y − E [ Y ]) 2  . C o mparing the last bound to that of Cor ollary 13, we note that the two terms E [ Y ] 2 / T ar e absent, and that we loose a m ul tiplicative factor at most of 4 since V ar [ Y ] 6 E  max 2 6 t 6 T ( Y t − E [ Y t ]) 2  , ( T − 1 ) ψ T − 1 so that V ar [ Y ] T − 1 + ψ T − 1 6 2 ψ T − 1 6 2  T T − 1  ψ T 6 4 ψ T . Remark 20 W e mentioned after C o r ollary 13 that each of the four assumptions on ∆ Y is fulfilled as soon as both the distrib ution of f ( X ) − E [ f ( X )] and the cond itional distrib ution of ε (condi- tional ly on X ) satisfy the same type of assumption. It actually ex tends to the mor e gene ral case when the conditio nal distrib ution of ε given X is r eplaced with the distrib ution of ε itself (without condit ioning ). T h is r elies on the elementary upper bound E  max 1 6 t 6 T ( ∆ Y ) 2 t  = E  max 1 6 t 6 T  f ( X t ) − E [ f ( X )] + ε t  2  6 2 E  max 1 6 t 6 T  f ( X t ) − E [ f ( X )]  2  + 2 E  max 1 6 t 6 T ε 2 t  . 760 S P A R S I T Y R E G R E T B O U N D S F O R I N D I V I D UA L S E Q U E N C E S F r om the last inequality , we can also see that assumptions of dif fer ent natur e can be m a de on f ( X ) − E [ f ( X )] and ε , such as the assu m p tions given in (27) or in (28) . A.3 Pr oofs of Theore m 16 and Coro llary 17 Pro of (of Theorem 16) The proof follows the sames lines as in the proof of T he orem 12. W e thus only sk etch the main argu ments. In the seque l, we set σ 2 , E  ε 2 1 ] . Applying Corollor y 8 we ha ve, almost sur ely , T ∑ t = 1  Y t − e f t ( x t )  2 6 inf u ∈ R d ( T ∑ t = 1  Y t − u · ϕ ( x t )  2 + 32  max 1 6 t 6 T Y 2 t  k u k 0 ln 1 + √ d T k u k 1 k u k 0 ! ) + 1 d T d ∑ j = 1 T ∑ t = 1 ϕ 2 j ( x t ) + 5 max 1 6 t 6 T Y 2 t . T aking the exp ectations of both side s, expandin g the square s  Y t − e f t ( x t )  2 and  Y t − u · ϕ ( x t )  2 , noting that two ter ms T σ 2 cancel out, 15 and then di viding both sides by T , we get E " 1 T T ∑ t = 1  f ( x t ) − e f t ( x t )  2 # 6 inf u ∈ R d ( 1 T T ∑ t = 1  f ( x t ) − u · ϕ ( x t )  2 + 32 E  max 1 6 t 6 T Y 2 t  T k u k 0 ln 1 + √ d T k u k 1 k u k 0 !) + 1 d T 2 d ∑ j = 1 T ∑ t = 1 ϕ 2 j ( x t ) + 5 E  max 1 6 t 6 T Y 2 t  T . The right-h and side is exactl y the upper bound stated in Theorem 16. W e thus only need to check that E " 1 T T ∑ t = 1  f ( x t ) − b f T ( x t )  2 # 6 E " 1 T T ∑ t = 1  f ( x t ) − e f t ( x t )  2 # . (39) This is an equality if the x t are all distinct. In genera l we get an inequality which follo ws from the con vexi ty of the squa re loss. Indeed , by definition of n x , we ha ve, almost surely , T ∑ t = 1  f ( x t ) − b f T ( x t )  2 = ∑ x ∈{ x 1 ,..., x T } ∑ 1 6 t 6 T t : x t = x  f ( x t ) − b f T ( x t )  2 = ∑ x ∈{ x 1 ,..., x T } n x  f ( x ) − b f T ( x )  2 = ∑ x ∈{ x 1 ,..., x T } n x f ( x ) − 1 n x ∑ 1 6 t 6 T t : x t = x e f t ( x ) ! 2 6 ∑ x ∈{ x 1 ,..., x T } n x 1 n x ∑ 1 6 t 6 T t : x t = x  f ( x ) − e f t ( x )  2 = T ∑ t = 1  f ( x t ) − e f t ( x t )  2 , 15. Note that E  ( f ( x t ) − e f ( x t )) ε t  = 0 si n ce e f t ( x t ) and ε t are independ ent. This is due t o the fact that e f t is built from the past d ata only . In particular , truncating the prediction s to B = max 1 6 t 6 T | Y t | might not w ork. A similar com ment could be made in the random design case (Section 4.1). 761 G E R C H I N OV I T Z where t he se cond lin e is by definit ion of b f T and wh ere th e last lin e foll ows fro m Jense n’ s inequality . Div iding both sides by T and taking their exp ectatio ns, we get (39), which conclud es the proo f. Pro of (of C o rollar y 17) First note that E  max 1 6 t 6 T Y 2 t  , E  max 1 6 t 6 T  f ( x t ) + ε t  2  6 2  max 1 6 t 6 T f 2 ( x t ) + E h max 1 6 t 6 T ε 2 t i  . The proof then follo ws exact ly the same lines as for Corollar y 13 with the sequence ( ε t ) instead of the sequen ce  ( ∆ Y ) t  . Ap pendix B. T ools Next we pro vide sev eral (in)equalitie s that prove to be use ful throughou t the paper . B.1 A Duality Fo rmula f or th e K u l lback-Leibler Div ergence W e recall be low a ke y duality formula satisfied by the Ku llback-Leibler d i ver gence an d whose proof can be found, for exampl e, in the monogra ph by Catoni (2004, pp. 159–160 ). W e use the notat ions of Section 2. Pro p os ition 21 F or any measu rab le space ( Θ , B ) , any pr obabi lity distrib ution π on ( Θ , B ) , and any measur able function h : Θ → [ a , + ∞ ) bounded fr om below (by some a ∈ R ), we have − ln Z Θ e − h d π = inf ρ ∈ M + 1 ( Θ )  Z Θ h d ρ + K ( ρ , π )  , wher e M + 1 ( Θ ) denotes th e set of all pr obabili ty distrib utions on ( Θ , B ) , and wher e the e xpecta tions R Θ h d ρ ∈ [ a , + ∞ ] ar e always well defined since h is bounded fr om below . B.2 Some T ool s to Exploit Our P A C-Bayesia n Inequalities In this su bsection w e recal l two result s needed for the deri v ation of Proposit ion 1 and Proposit ion 5 from the P A C-Bayesian inequal ities (7) and (12). The proofs are due to Dalalyan and Tsybako v (2007 , 2008 ) and we only reproduc e them for the con ven ience of the reader . 16 For an y u ∗ ∈ R d and τ > 0, define ρ u ∗ , τ as the transl ated of π τ at u ∗ , namely , ρ u ∗ , τ , d π τ d u ( u − u ∗ ) d u = d ∏ j = 1 ( 3 / τ ) d u j 2  1 + | u j − u ∗ j | / τ  4 . (40) 16. The notations are howe ver slightly modified because of the change in the st atistical setting and goal. The target predictions ( f ( x 1 ) , . . . , f ( x T )) are indeed replaced with the observ ations ( y 1 , . . . , y T ) and the prediction lo ss k f − f u k 2 n is replaced with t h e cumulati ve loss ∑ T t = 1  y t − u · ϕ ( x t )  2 . Moreov er, the analysis of the present proof i s slightly simpler since we just need to consider the case L 0 = + ∞ according to the notations of Theorem 5 by Dalalyan and Tsybako v ( 2 008). 762 S P A R S I T Y R E G R E T B O U N D S F O R I N D I V I D UA L S E Q U E N C E S Lemma 22 F or all u ∗ ∈ R d and τ > 0 , the pr obability distrib ution ρ u ∗ , τ satisfi es Z R d T ∑ t = 1  y t − u · ϕ ( x t )  2 ρ u ∗ , τ ( d u ) = T ∑ t = 1  y t − u ∗ · ϕ ( x t )  2 + τ 2 d ∑ j = 1 T ∑ t = 1 ϕ 2 j ( x t ) . Lemma 23 F or all u ∗ ∈ R d and τ > 0 , the pr obability distrib ution ρ u ∗ , τ satisfi es K ( ρ u ∗ , τ , π τ ) 6 4 k u ∗ k 0 ln  1 + k u ∗ k 1 k u ∗ k 0 τ  . Pro of (of Lemma 22) For all t ∈ { 1 , . . . , T } we expand the square  y t − u · ϕ ( x t )  2 =  y t − u ∗ · ϕ ( x t ) + ( u ∗ − u ) · ϕ ( x t )  2 and use the linearit y of the inte gral to get Z R d T ∑ t = 1  y t − u · ϕ ( x t )  2 ρ u ∗ , τ ( d u ) (41) = T ∑ t = 1  y t − u ∗ · ϕ ( x t )  2 + T ∑ t = 1 Z R d  ( u ∗ − u ) · ϕ ( x t )  2 ρ u ∗ , τ ( d u ) + T ∑ t = 1 2  y t − u ∗ · ϕ ( x t )  Z R d ( u ∗ − u ) · ϕ ( x t ) ρ u ∗ , τ ( d u ) | {z } = 0 The last sum equals zero by symmetry of ρ u ∗ , τ around u ∗ , which yield s Z R u ρ u ∗ , τ ( d u ) = u ∗ . As for the second sum of the right-hand side, it can be bounde d from abov e similarly . Indeed, expa nding the inner produ ct and then the square  ( u ∗ − u ) · ϕ ( x t )  2 we ha ve, for all t = 1 , . . . , T ,  ( u ∗ − u ) · ϕ ( x t )  2 = d ∑ j = 1 ( u ∗ j − u j ) 2 ϕ 2 j ( x t ) + ∑ 1 6 j 6 = k 6 d ( u ∗ j − u j )( u ∗ k − u k ) ϕ j ( x t ) ϕ k ( x t ) . By symmetry of ρ u ∗ , τ around u ∗ and the fact that ρ u ∗ , τ is a prod uct-dis tribution , we get T ∑ t = 1 Z R d  ( u ∗ − u ) · ϕ ( x t )  2 ρ u ∗ , τ ( d u ) = T ∑ t = 1 d ∑ j = 1 ϕ 2 j ( x t ) Z R d ( u ∗ j − u j ) 2 ρ u ∗ , τ ( d u ) + 0 = T ∑ t = 1 d ∑ j = 1 ϕ 2 j ( x t ) Z R ( u ∗ j − u j ) 2 ( 3 / τ ) d u j 2  1 + | u j − u ∗ j | / τ  4 (42) = τ 2 T ∑ t = 1 d ∑ j = 1 ϕ 2 j ( x t ) Z R 3 t 2 d t 2 ( 1 + | t | ) 4 (43) = τ 2 T ∑ t = 1 d ∑ j = 1 ϕ 2 j ( x t ) , (44) where (42) follo w s by definition of ρ u ∗ , τ , where (43) is obtaine d by the change of v ariables t = ( u j − u ∗ j ) / τ , and where (44) follows from the equality Z R 3 t 2 d t 2  1 + | t |  4 = 1 that can be prov ed by in- teg rating by parts. Substituting (44) into (41) concludes the proof. 763 G E R C H I N OV I T Z Pro of (of L e mma 23) B y d efi n ition of ρ u ∗ , τ and π τ , we ha ve K ( ρ u ∗ , τ , π τ ) , Z R d  ln d ρ u ∗ , τ d π τ ( u )  ρ u ∗ , τ ( d u ) = Z R d ln d ∏ j = 1  1 + | u j | / τ  4  1 + | u j − u ∗ j | / τ  4 ! ρ u ∗ , τ ( d u ) = 4 Z R d  d ∑ j = 1 ln 1 + | u j | / τ 1 + | u j − u ∗ j | / τ  ρ u ∗ , τ ( d u ) . (45) But, for all u ∈ R d , by the triangl e inequality , 1 + | u j | / τ 6 1 + | u ∗ j | / τ + | u j − u ∗ j | / τ 6  1 + | u ∗ j | / τ  1 + | u j − u ∗ j | / τ  , so that Equation (45) yields the upper boun d K ( ρ u ∗ , τ , π τ ) 6 4 d ∑ j = 1 ln  1 + | u ∗ j | / τ  = 4 ∑ j : u ∗ j 6 = 0 ln  1 + | u ∗ j | / τ  . W e now recall that k u ∗ k 0 ,   { j : u ∗ j 6 = 0 }   and apply Jensen’ s inequa lity to the conca ve function x ∈ ( − 1 , + ∞ ) 7− → ln ( 1 + x ) to get ∑ j : u ∗ j 6 = 0 ln  1 + | u ∗ j | / τ  = k u ∗ k 0 1 k u ∗ k 0 ∑ j : u ∗ j 6 = 0 ln  1 + | u ∗ j | / τ  6 k u ∗ k 0 ln 1 + ∑ j : u ∗ j 6 = 0 | u ∗ j | k u ∗ k 0 τ ! 6 k u ∗ k 0 ln  1 + k u ∗ k 1 k u ∗ k 0 τ  . This concl udes the proof. B.3 Some Maximal Inequalities Next we prove three maximal inequa lities needed for the deri vation of Corollarie s 13 and 17 from Theorems 12 and 16 respecti vely . Their proofs are quite standard but we prov ide them for the con venienc e of the reader . Lemma 24 Let Z 1 , . . . , Z T be T > 1 (cent er ed) r eal random vari ables suc h tha t, for a given constant ν > 0 , we have ∀ t ∈ { 1 , . . . , T } , ∀ λ ∈ R , E h e λ Z t i 6 e λ 2 ν / 2 . (46) Then, E  max 1 6 t 6 T Z 2 t  6 2 ν ln ( 2e T ) . Lemma 25 Let Z 1 , . . . , Z T be T > 1 rea l r andom variables suc h that, for some given constan ts α > 0 and M > 0 , we have ∀ t ∈ { 1 , . . . , T } , E h e α | Z t | i 6 M . Then, E  max 1 6 t 6 T Z 2 t  6 ln 2  ( M + e ) T  α 2 . 764 S P A R S I T Y R E G R E T B O U N D S F O R I N D I V I D UA L S E Q U E N C E S Lemma 26 Let Z 1 , . . . , Z T be T > 1 rea l r andom variables suc h that, for some given constan ts α > 2 and M > 0 , we have ∀ t ∈ { 1 , . . . , T } , E  | Z t | α  6 M . Then, E  max 1 6 t 6 T Z 2 t  6 ( M T ) 2 / α . Pro of (of Lemma 24) Let t ∈ { 1 , . . . , T } . From the subgauss ian assumpt ion (46) it is well kno wn (see, e.g., Massart 2007, Chapter 2) that for all x > 0, we ha ve ∀ t ∈ { 1 , . . . , T } , P  | Z t | > x  6 2 e − x 2 / ( 2 ν ) . Let δ ∈ ( 0 , 1 ) . By the change of var iables x = p 2 ν ln ( 2 T / δ ) , the last inequa lity entails that, for all t = 1 , . . . , T , we hav e | Z t | 6 p 2 ν ln ( 2 T / δ ) with probability at least 1 − δ / T . Therefore , by a union bound , w e get, with proba bility at least 1 − δ , ∀ t ∈ { 1 , . . . , T } , | Z t | 6 p 2 ν ln ( 2 T / δ ) . As a conseq uence , with probabilit y at least 1 − δ , max 1 6 t 6 T Z 2 t 6 2 ν ln ( 2 T / δ ) 6 2 ν ln ( 1 / δ ) + 2 ν ln ( 2 T ) . It now just remains to integrate the last inequalit y ov er δ ∈ ( 0 , 1 ) as is made precise belo w . By the chang e of variab les δ = e − z , the latter inequa lity yields ∀ z > 0 , P "  max 1 6 t 6 T Z 2 t − 2 ν ln ( 2 T ) 2 ν  + > z # 6 e − z , (47) where for all x ∈ R , x + , max { x , 0 } denotes the positi ve part of x . Using the w el l-kno wn fact that E [ ξ ] = R + ∞ 0 P ( ξ > z ) d z for all nonn ega ti ve real random v ariable ξ , we get E " max 1 6 t 6 T Z 2 t − 2 ν ln ( 2 T ) 2 ν # 6 E "  max 1 6 t 6 T Z 2 t − 2 ν ln ( 2 T ) 2 ν  + # = Z + ∞ 0 P "  max 1 6 t 6 T Z 2 t − 2 ν ln ( 2 T ) 2 ν  + > z # d z 6 Z + ∞ 0 e − z d z = 1 , where the last line follo ws from (47) abov e. Rearrang ing terms, we get E  max 1 6 t 6 T Z 2 t  6 2 ν + 2 ν ln ( 2 T ) , w h ich concludes the proof. Pro of (of L e mma 25) W e first need the follo wing definitions. Let ψ α : R + → R be a con ve x majorant of x 7→ e α √ x on R + defined by ψ α ( x ) , ( e if x < 1 / α 2 , e α √ x if x > 1 / α 2 . 765 G E R C H I N OV I T Z W e assoc iate w it h ψ α its general ized in verse ψ − 1 α : R → R + defined by ψ − 1 α ( y ) = ( 1 / α 2 if y < e , ( ln y ) 2 / α 2 if y > e . Elementary manipula tions show that: • ψ α is nondecr easing and con vex on R + ; • ψ − 1 α is nondecr easing on R ; • x 6 ψ − 1 α  ψ α ( x )  for all x ∈ R + . The pro of is base d on a Pisier-t ype arg ument as is done, for e xample, by Massart (2007, Lemma 2.3) to prove the maximal inequality E [ max 1 6 t 6 T ξ t ] 6 √ 2 ν ln T for all subgaussian real random v ariables ξ t , 1 6 t 6 T , with common v ariance factor ν > 0. From the inequal ity x 6 ψ − 1 α  ψ α ( x )  for all x ∈ R + we ha ve E  max 1 6 t 6 T Z 2 t  6 ψ − 1 α  ψ α  E h max 1 6 t 6 T Z 2 t i  6 ψ − 1 α  E  ψ α  max 1 6 t 6 T Z 2 t   = ψ − 1 α  E  max 1 6 t 6 T ψ α  Z 2 t   , where the last two inequalit ies follo w by Jensen’ s inequalit y (since ψ α is con vex ) and the fact that both ψ − 1 α and ψ α are non decrea sing. Since ψ α > 0 and ψ − 1 α is nond ecreas ing we get E  max 1 6 t 6 T Z 2 t  6 ψ − 1 α E " T ∑ t = 1 ψ α  Z 2 t  #! = ψ − 1 α T ∑ t = 1 E h ψ α  Z 2 t  i ! 6 ψ − 1 α T ∑ t = 1 E h e α | Z t | + e i ! 6 ψ − 1 α  M T + eT  = ln 2  M T + eT  α 2 , where the second line follo ws from the inequalit y ψ α ( x ) 6 e + e α √ x for all x ∈ R + , and where the last line follo w s from the bound ed exponent ial moment assumption and the definition of ψ − 1 α . It conclu des the proof. Pro of (of L e mma 26) As in the pre vious proof, we ha ve, by Jensen ’ s inequalit y and the fact that x 7→ x α / 2 is con vex and nonde creasin g on R + (since α > 2), E  max 1 6 t 6 T Z 2 t  6 E "  max 1 6 t 6 T Z 2 t  α / 2 # 2 / α = E  max 1 6 t 6 T   Z t   α  2 / α 6 E " T ∑ t = 1   Z t   α # 2 / α 6 ( M T ) 2 / α by the boun ded-momen t assumption, which concludes the proof. 766 S P A R S I T Y R E G R E T B O U N D S F O R I N D I V I D UA L S E Q U E N C E S Refer ences F . Abramo vich, Y . Benjamini, D.L. Donoho, and I.M. Johns tone. Adapting to unkno wn sparsity by contro lling the false disco very rate. Ann. Statis t. , 34(2):584– 653, 2006. P . Alqui er and K. Lounic i. P A C-Bayesian b ounds for sparse regre ssion estimation with e xponential weights. Electr on. J. Stat . , 5:127–145 , 2011. J.-Y . Audibert. Fas t learning rates in statistical inferenc e through aggre gation. Ann. Statist. , 37(4): 1591– 1646, 2009. IS SN 0090 -5364 . P . Auer , N. C e sa-Bianch i, and C. Gentile. A d apti ve and self-co nfident on-line learning algor ithms. J . Comp. Sys. Sci. , 64:48–75, 2002. K. S. Azoury and M. K. W armuth. Relativ e loss bounds for on-l ine densi ty estimatio n with the exp onent ial family of distrib utions. Mach. Learn. , 43(3):2 11–246, 2001. ISSN 0885-6125 . G. Biau, K . Bleakle y , L. Gy ¨ orfi, and G. Ottucs ´ ak. Nonparame tric sequential prediction of time series. J . N o nparametr . Stat. , 22(3 –4):297–31 7 , 2010. P . J. Bickel, Y . Ritov , and A. B. Tsybako v . Simultaneo us analys is of Lasso and Dantzig select or. Ann. Statist . , 37(4):1705 –1732, 2009. ISSN 0090-53 64. L. Bir g ´ e and P . Mass art. Gaussian model selectio n. J. Eur . Math. Soc. , 3:203–268 , 2001. L. B ir g ´ e and P . Massart. Minimal penaltie s for G a ussian m o del selecti on. Pr obab . Theory Relat. F ields , 138:33– 73, 2007. P . B ¨ uhlmann and S. van de Geer . Statisti cs for H i gh-Dimensi onal Data: Methods, Theory and Applicat ions . S p ringer S e ries in Statisti cs. S p ringer , Heidelber g, 2011. F . Bunea and A. Nobel. Sequential procedure s for aggreg ating arbitrary estimators of a conditiona l mean. IEEE T rans. Inform. Theory , 54(4): 1725–1735, 2008. ISSN 0018-944 8. F . Bunea, A. B. Tsybako v , and M. H. W egkamp. Aggregatio n for regres sion learn ing. T echnica l report , 2004. A v ailable at http: //arx iv.or g/abs/math/0410214 . F . B u nea, A. B. Tsybak ov , and M. H. W egkamp. Aggre gation for Gaussian re gression. Ann. St atist. , 35(4): 1674– 1697, 2007a. ISSN 0090-536 4. F . Bunea, A. B. T s ybak ov , and M. H. W egka m p . S p arsity oracle inequaliti es for the Lasso. E l ectr on. J . Stat. , 1:169–194, 2007b. ISS N 1935-7 524. E. Candes and T . T ao. The Dantzig selector : statistical estimation when p is m u ch lar ger than n . Ann. Statist . , 35(6):2313 –2351, 2007. O. Catoni. Univ ersal aggre gation rules with exact bias bounds. T echnical R ep ort PMA-510, Labo- ratoire de Probab ilit ´ es et Mod ` eles Al ´ eatoire s, C NRS, P aris, 1999. O. Catoni. Statis tical Learning Theory and Stocha stic Optimization . Springer , New Y ork, 2004. 767 G E R C H I N OV I T Z N. Cesa-Bianchi and G. L u gosi. P r ediction, Learning , and Games . Cambridge Univ ersity P re ss, 2006. N. Cesa-Bianchi , A . Conconi, and C . Gentile. On the general ization ability of on-line learning algori thms. IEE E T ran s. Inform. Theory , 50(9):2 050–2057, 2004. ISSN 0018-9448 . N. Cesa-Bianchi , Y . Mansour , and G. Stoltz. Improv ed secon d-orde r bounds for prediction with exp ert advice. Mach. Learn . , 66(2/ 3):321 –352, 2007. A. D a lalyan and A. B. Tsybak ov . Aggreg ation by exp onential weighting and sharp oracle inequal- ities. In Pr oceedin gs of the 20th Annual Confer ence on Learning Theory (CO L T’07) , pages 97–11 1, 2007. ISBN 978-3-540 -7292 5-9. A. Dalalyan and A. B . Tsybak ov . Aggre gation by expone ntial weightin g, sharp P A C - Bayesian bound s and sparsity. Mach. Lear n. , 72(1-2) :39–61, 2008. ISSN 0885 -6125 . A. Dalalyan and A . B. Tsybako v . Mirror av eraging with sparsity priors. B er noulli , 18(3):9 14–944, 2012a . A. D a lalyan and A. B . Tsybak ov . S pa rse regress ion learning by aggregat ion and Langev in Monte- Carlo. J . C o m p ut. System Sci. , 78:142 3–144 3, 2012b. D. L. Donoho and I. M. Johnstone . Ideal spatial adaptatio n by wav elet shrinkag e. Biometrika , 81 (3):42 5–455 , 1994. ISSN 0006-344 4. J. C. Duchi, S. Shale v-Shwartz, Y . Singer , and A. T ewar i. Composite objecti ve m i rror descen t. In Pr oceedin gs of the 2 3rd Ann ual Confer ence on Learn ing Theory (COL T ’1 0) , pages 14– 26, 2010 . Y . Freund, R. E. Schapir e, Y . Sin ger , and M. K. W armuth. Using an d comb ining pred ictors that spe- cialize . In Pr oceeding s of the 29th Annu al ACM Symposiu m on Theory of Computi ng (S TOC’97) , pages 334–34 3, 1997. S. Gerchino vitz. S p arsity regret bound s for indi vidual sequences in online linear regressio n. JMLR W ork shop and Confer ence Pr oceedings , 19 (CO L T 201 1 Proceeding s):377 –396, 2011. S. Gerchino vitz and J.Y . Y u. Adapti ve and opti mal onlin e linear regre ssion on ℓ 1 -balls. In J. K i vinen, C. Szepesv ´ ari, E . Ukkone n, and T . Zeugmann, editors, Algorithmic Learning Theory , volume 6925 of Lectur e Notes in Computer Science , pages 99–113 . Springer Berlin/Hei delber g, 2011. L. Gy ¨ orfi and G. Ottuc s ´ ak. S e quent ial prediction of unbounded stationar y time serie s. IEEE T ran s. Inform. Theory , 53(5): 1866– 1872, 2007. L. Gy ¨ orfi, M. Ko hler , A. Krzy ˙ zak, and H. W alk. A Distrib ution-F r ee Theory of Nonpar ametric Re gr ession . S p ringer Series in Statistics. Springer -V erlag, New Y ork, 2002. ISBN 0-387-95 441- 4. M. Hebir i and S. v an de Gee r . The Smooth-Lasso and o ther ℓ 1 + ℓ 2 -penal ized methods. Electr on. J . Stat. , 5:1184 –1226 , 2011. A. Juditsk y , P . Rigollet, and A. B. Tsybak ov . L ea rning by mirror av eraging. Ann. Statist. , 36(5): 2183– 2206, 2008. IS SN 0090 -5364 . 768 S P A R S I T Y R E G R E T B O U N D S F O R I N D I V I D UA L S E Q U E N C E S J. Kivine n and M. K . W armuth. A veragin g exper t predic tions. In P r oceeding s of the 4th E u r opean Confer ence on Computationa l L ea rning T h eory (Eur oCOLT’99) , pa ges 153–167 , 1999 . V . Ko ltchinskii. Sparse recov ery in con ve x hulls via entrop y penali zation . Ann. Stat ist. , 37(3): 1332– 1359, 2009a. ISS N 0090-5 364. V . Kolt chinsk ii. Spar sity in penalized empirical risk minimization. Ann. Inst. H en ri P oincar ´ e Pr obab . Stat. , 45(1):7 –57, 2009b. ISSN 0246-0203 . V . Koltch inskii, K. L o unici, and A. B. Tsybako v . Nuclear norm penalizati on and optimal rates for noisy lo w -r ank m a trix completi on. Ann. Statist . , 39(5):2302 –2329, 2011. J. Langford, L. Li, and T . Z h ang. Sparse online learning via truncate d gradien t. J. Mach. Learn. Res. , 10:7 77–80 1, 2009. ISSN 1532–44 35. N. Littlestone. From on-line to batch learning. In Pr oceedings of the 2nd Annual Confer ence on Learning Theory (COLT’89 ) , pages 269–28 4, 1989. K. Lounici, M. Pontil, S. van de Geer , and A.B . Tsybako v . Oracle inequ alities and optimal infe rence under group spars ity . Ann. Statist. , 39(4):2 164–2204, 2011. P . Massart. Concentr ation Inequali ties and Model Selection , volu m e 1896 of Lectur e N o tes in Mathematic s . S p ringer , Berlin, 2007. P . Rigollet an d A. B. Tsy bakov . Exponen tial Screening and opti mal rates of sp arse estimation. Ann. Statis t. , 39(2):731– 771, 2011. M. W . Seeger . Bayesian inference and optimal design for the sparse linear model. J. Mach . Learn. Res. , 9:75 9–813 , 2008. S. Shale v-Shwartz and A. T ewar i. Stocha stic method s for ℓ 1 -reg ularized lo ss minimization. J. Mach. Learn. Res. , 12:18 65–18 92, 2011. R. Tibsh irani. Reg ression shrinkag e and selection via the Lasso. J. Roy . Statist. Soc. Ser . B , 58(1): 267–2 88, 1996. S. A. van de Geer . High-dimensi onal generali zed linear models and the Lasso. Ann. Statist. , 36(2): 614–6 45, 2008. ISSN 0090-5364 . S. A. van de Geer and P . B ¨ uhlmann. On the conditi ons used to prove oracle results for the L a sso. Electr on. J. Stat. , 3:1360 –1392, 2009. ISSN 1935-752 4. V . V ovk . Competiti ve on-line statistics. Internat . Statist. Rev . , 69:213–248 , 2001. L. X ia o. D u al a veragin g methods for regular ized stoch astic learnin g and online optimizatio n. J . Mach. Lear n. Res. , 11:2543 –2596, 2010. 769

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment