Supermartingales in Prediction with Expert Advice
We apply the method of defensive forecasting, based on the use of game-theoretic supermartingales, to prediction with expert advice. In the traditional setting of a countable number of experts and a finite number of outcomes, the Defensive Forecastin…
Authors: Alexey Chernov, Yuri Kalnishkan, Fedor Zhdanov
Sup ermartingales in P rediction with Exp ert Advice Alexey Cherno v, Y uri Kalnishk an, F edor Zhdano v, Vladimir V o vk Computer Learning Resear c h Cent re Department of Computer Science Roy al Hollow a y , Universit y of Lo ndon, Egham, Surrey TW20 0EX, UK { cher nov, yura, fedor, vovk } @c s.rhul.a c.uk July 24, 2018 Abstract W e apply the metho d of d efensiv e forecasting, based on the u se of game-theoretic sup ermartingal es, to prediction with exp ert advice. In the traditional setting of a countable number of exp erts and a fi nite num b er of outcomes, the D efensiv e F orecasting Algorithm is very close to th e w ell- known Aggregating Algorithm. Not only the performance guaran tees but also the predictions are the same for these tw o meth ods of fundamen tally different nature. W e discuss also a new setting where the experts can give advice conditional on the learner’s future decision. Both the algorithms can b e adapted to t he new setting an d give the same p erformance guar- antees as in the traditional setting. Finally , we outline an application of defensive forecasting to a setting with severa l loss functions. 1 In tro duction The framework of prediction with exp ert advice was in tro duced in the late 1980s . In contrast to statistical learning theory , the metho ds of predictio n with exp ert advice do not require sta tistical assumptions a bout the sour ce o f data. The r ole of the ass umptions is pla yed by a “p o ol of ex p erts”: the forecaster, called Learner , ba ses his predictio ns up on the pr edictions and pe r formance of the exp erts. F or details and refer ences, see the monog raph [6]. Many metho ds for pr ediction with exp ert advice are known. This pap er de a ls with tw o of them: the Aggreg ating Algor ithm [2 4] and defensive for ecasting [26]. The Aggregating Algorithm (the AA fo r s hort) is a member of the family of exp onen tia l-w eig h ts a lgorithms a nd implements a B ayesian-t yp e aggr egation; v arious optimality prop erties of the AA hav e b een established [25]. Defensive forecasting is a recently developed technique that combin es the ideas of game- theoretic probability [21] with Levin and G´ acs’s ideas of neutral measure [10, 1 6] and F oster and V ohra ’s ideas of universal calibr ation [8]. 1 The idea of defensive forecasting comes from an in terpr e tation o f pro babilit y with the help o f per fect information ga mes. The Learner develops his strategy mo deling a g ame where a probability foreca ster plays on the actual da ta aga inst an imaginar y opp onen t, Sceptic, that repr esen ts a law o f pro babilit y . The cap- ital of Sceptic tends to infinity (or b ecomes larg e) if the players’ mov es lead to violation of this law. The capita l of a str ategy for Sceptic as a function of other play ers ’ mov es is ca lled a (game-theoretic) sup ermarting a le. It is known (see Lemma 4 in this paper) that for any sup ermartingale there is a forecasting strat- egy that preven ts this sup ermartingale from g ro wing (“defending” aga inst this strategy o f Sceptic), thereby forcing the corresp onding law of pr o babilit y . The older versions of defensive forecasting (see, e.g., [26]) minimize Lear ner’s actual loss with the help of the following trick: a forecasting strategy is constructed so that the actua l lo s ses (Lear ner’s a nd exp erts’) are close to the (one-step-a head conditional) exp ected losses; at each step Lear ner minimizes the exp ected loss (that is, the law of pro babilit y used in this case is the co njunction of several laws of la rge num b ers). This pap er g iv es a self-contained description of a different version of the defensiv e for ecasting metho d. W e use certa in sup ermartingales and do not need to talk ab out the under lying laws o f pro babilit y . Defensive forecasting, a s well as the AA, can b e used for comp etitiv e online prediction against “ po ols o f exp erts” consisting of all functions from a lar ge function cla s s (see [2 7, 2 8]). How ever, the loss b ounds prov ed so far are gen- erally incompara ble: for large classes (such as many Sob olev spaces), defensive forecasting is b etter, wher eas for smaller cla sses (such as classes of ana lytical functions), the AA works b etter. Note that the optimality re s ults for the AA are obtained for expe r ts that ar e free a gen ts, no t functions from a given class; th us we need to e v a luate the a lgorithms anew. This g eneral task r e quires a deep e r understa nding of the prop erties o f defensive foreca s ting. In this pap er, the AA and defensive foreca sting ar e discussed in the simple case o f a finite num ber of outcomes. Learner co mpetes with a countable p ool of E x perts Θ. Exp erts and Learner g ive pr edictions and suffer some loss at each step. A game is a spec ification what predictions ar e admissible and what losses a prediction incur for each outcome. F or ev ery game, w e are in tere sted in per formance guarantees of the form ∀ θ ∈ Θ ∀ N L N ≤ c L θ N + a θ , where L N is the cumulativ e lo ss of Le arner and L θ N is the cumulativ e lo ss of Exp ert θ ov er the fir st N steps, c is some cons ta n t and a θ depe nds o n θ only . Section 2 reca lls the AA and its loss bo und (Theorem 1) a nd introduces notation used in the pap er. Section 3 presents the main results o f the pap er. Subsection 3.1 describes the Defensiv e F orecasting Algorithm (DF A), whic h is based on the use of g ame- theoretic sup ermartingales, and its los s b ound (Theorem 5 ). It turns o ut that if the AA and the DF A are bo th applicable to a game, they guarantee the s ame loss bo und. Subsections 3.3 – 3.6 discuss when the DF A and the AA are applica- ble. Lo o sely sp eaking, if the DF A is applicable then the AA is applicable as w ell (Theorem 9); and for g ames sa tisfying s ome additional assumptions , if the AA is applicable then the DF A is applicable (Theorems 13 a nd Theorem 20). Sub- section 3.7 gives a criterio n of the AA realiza bilit y in terms of super martingales (Theorem 22) using a rather awkw ard v ariant of the DF A. The cons tr uction of 2 the sup ermartinga les used in this pap er in volves a par ameterization of the game with the help of a prope r loss function. Pro p er lo ss functions play an imp or- tant r o le in Bayesian statistics, a nd their meaning in our context is discussed in Subsections 3.4 and 3.6. The r est of the pa p er is devoted to modifica tions of the s tandard setting. Subsection 3.8 applies the DF A in an extended setting wher e the outco mes for m a finite-dimensio nal simplex. Section 4 introduces a new setting for pr e dic- tion with exp ert advice, wher e the exp erts ar e a llo w ed to “seco nd-guess”, that is, to g iv e “conditiona l” predictions tha t ar e functions of the future Lea rner’s decision (cf. the notion o f internal regre t [9]). If the dependence is r egular enough (namely , contin uo us ), the DF A works in the new setting virtually with- out changes (Theor em 26). The AA with so me mo dification based on the fixed po in t theo rem can b e applied in the new se tting to o (Theorem 29). Section 5 briefly outlines o ne more application o f the DF A: a setting with several loss functions. Some results of the pap er a ppeare d in [29] and in AL T’08 pro ceedings [4]. 2 Games of Prediction and the Aggregating Al- gorithm W e b egin with formulating the setting o f prediction with expe r t advice. A ga me of pr e diction c onsists of three comp onents: a non-empty set Ω of po ssible out- comes, a no n-empt y set Γ of p o ssible de c isions, and a function λ : Γ × Ω → [0 , ∞ ] called the loss function . In this pa p er we as sume that the set Ω is finite. The set Λ = { g ∈ [0 , ∞ ] Ω | ∃ γ ∈ Γ ∀ ω ∈ Ω g ( ω ) = λ ( γ , ω ) } is called the set of pr e dictions of the game. In this pap er, we will identify ea c h decision γ ∈ Γ with the function ω 7→ λ ( γ , ω ) (and also with a po in t in a | Ω | -dimensional Euclidean spac e with p oint wise op e rations). A loss function can be considered as a pa r ameterization of Λ b y elemen ts of Γ. T o study the pro perties of a g ame, we do not need to know the decisio n set Γ a nd the los s function; w e can forget ab out them and consider the prediction s et Λ o nly . F r om now on, a game wil l by sp e cifie d by a p air (Ω , Λ), where Λ ⊆ [0 , ∞ ] Ω . W e will use the letter γ (as well as g ) with indices to denote elements of [0 , ∞ ] Ω (rather than decisions ). How ever, loss functions remain a conv enient metho d to sp ecify a game, and we w ill use them in examples. Also an imp ortant technical to o l will be a kind of c anonical para meterization of Λ g iv en by the so called pr oper los s functions. Also los s functions a re una voidable in Section 5, wher e we consider games with several simultaneous lo sses. The game of pr ediction with exper t advice is played b y Lear ne r , Exp erts, and Reality; the set (“p o ol”) of Exp erts is denoted by Θ. W e will a ssume that Θ is (finite or) countable. There is no los s o f generality in ass uming that Reality and all Exp erts are co op erative, since w e are only in ter ested in wha t can b e achiev ed b y Learner alone; therefore, w e essentially consider a tw o- pla yer game. The game is play ed acco rding to Proto col 1. The goa l of Le a rner is to keep L n smaller or at least not m uc h grea ter than L θ n , a t ea c h step n and for all θ ∈ Θ. T o ana lyze the game, we need some a dditional notation. A p oint g ∈ [0 , ∞ ] Ω is called a sup erpr e diction in the game (Ω , Λ) if there is γ ∈ Λ such that 3 Proto col 1 Pr e dic tio n with Exp ert Advice L 0 := 0 . L θ 0 := 0 , for all θ ∈ Θ. for n = 1 , 2 , . . . do All Exp erts θ ∈ Θ announce γ θ n ∈ Λ. Learner announces γ n ∈ Λ. Reality announces ω n ∈ Ω. L n := L n − 1 + γ n ( ω n ). L θ n := L θ n − 1 + γ θ n ( ω n ), for all θ ∈ Θ. end for γ ( ω ) ≤ g ( ω ) for all ω ∈ Ω. It is co n venien t to wr ite the last condition as γ ≤ g . In the seq uel, w e will use p oin twise relations a nd operatio ns for the elements of [0 , ∞ ] Ω without sp ecial mentioning. F or a ga me (Ω , Λ), denote by Σ Λ the s e t of all s uperpredictio ns . Using op erations on sets, this definition ca n b e wr itten as Σ Λ = Λ + [0 , ∞ ] Ω = { γ + g | γ ∈ Λ , g ∈ [0 , ∞ ] Ω } . The Ag gr e gating Algori thm is a strategy for Learner. It has four par ameters: reals c ≥ 1 and η > 0, a distribution P 0 on Θ (that is , P 0 ( θ ) ∈ [0 , 1] for every θ ∈ Θ and P θ ∈ Θ P 0 ( θ ) = 1), and a substitution function σ : Σ Λ → Λ such that σ ( g ) ≤ g for any g ∈ Σ Λ . A t step N , the AA computes g N ∈ [0 , ∞ ] Ω by the formula g N ( ω ) = − c η ln X θ ∈ Θ P N − 1 ( θ ) P θ ∈ Θ P N − 1 ( θ ) exp( − η γ θ N ( ω )) ! , where P N − 1 ( θ ) = P 0 ( θ ) N − 1 Y n =1 exp( − η γ θ n ( ω n )) is the (po sterior) distribution on Θ. Then, γ N = σ ( g N ) is announced a s Learner’s prediction. The s tep N of the AA can b e p erformed if and only if g N is a superpredictio n ( g N ∈ Σ Λ ), that is, if ∃ γ N ∈ Λ ∀ ω γ N ( ω ) ≤ − c η ln X θ ∈ Θ P N − 1 ( θ ) P θ ′ ∈ Θ P N − 1 ( θ ′ ) exp( − η γ θ N ( ω )) ! . (1) W e say that the AA is ( c, η )- r e alizable (for the ga me (Ω , Λ)) if co ndition (1) is true rega rdless of Θ, N , γ θ N ∈ Λ and P N − 1 (that is, rega rdless of P 0 , the history of the previo us moves, and the oppo nen ts’ moves at the last step). This requirement can b e res tated in several eq uiv a len t fo r ms: fo r an y finite set G ⊆ Λ and for any distribution ρ on G , it holds that ∃ γ ∈ Λ γ ≤ − c η ln X g ∈ G ρ ( g ) exp( − η g ) ; (2) 4 or equiv a len tly , for any finite G ⊆ Σ Λ and an y distr ibutio n ρ on G , it holds that ∃ γ ∈ Σ Λ γ ≤ − c η ln X g ∈ G ρ ( g ) exp( − η g ) ; (3) equiv alently , in the last formu la ≤ can be replaced b y =. Indeed, the condi- tion (1) implies (2) since γ θ N and P 0 are a rbitrary; G ⊆ Λ can b e r eplaced by G ⊆ Σ Λ since the rig h t-ha nd side o f (2) increases when elements o f G increase ; by definition, (2) means that its right-hand side b elongs to Σ Λ , and we get (3) with = instead of ≤ . Clearly , (1) follo ws from (3), if we allow countably infinite G a s well (then we can take { γ θ N | θ ∈ Θ } for G ), which is p ossible due to the following prop erty of conv ex sets. F or a given η , the exp-c onvex hul l o f Σ Λ is the set Σ η Λ ⊇ Σ Λ that co nsists of all po in ts in [0 , ∞ ] Ω of the form log (e − η ) X g ∈ G ρ ( g ) e − η g = − 1 η ln X g ∈ G ρ ( g ) exp( − η g ) , (4) where G is a finite subset of Σ Λ and ρ is a distribution on G . Actually , exp( − η Σ η Λ ) is the conv ex hull o f exp( − η Σ Λ ). As known from conv ex analy sis, we ge t the same definition if we allow infinite G (see e. g. [2, Theo rem 2.4 .1]). With this notation, the conditio n (3) says that Σ Λ ⊇ c Σ η Λ . Let us state s ome prop erties of the set Σ η Λ . Fir s t, Σ η Λ = Σ η Λ + [0 , ∞ ] Ω , that is, if Σ η Λ is a pr ediction set then its sup erprediction set is Σ η Λ itself. (Indeed, if a p oin t g 0 of the form (4) b elongs to Σ η Λ as a co m bination of g i ∈ G ⊆ Σ Λ then, for a n y g ∈ [0 , ∞ ] Ω , the p oin t g 0 + g belo ngs to Σ η Λ as the c o m binatio n of g i + g .) The set exp( − η Σ η Λ ) is conv ex (clearly , the points of the form (4) belo ng to Σ η Λ also if we allow G ⊆ Σ η Λ ). The conv exity of exp onen t implies that the set Σ η Λ is c o n vex as w ell (if g 1 , g 2 ∈ Σ η Λ then αg 1 + (1 − α ) g 2 ≥ − 1 η ln α exp( − η g 1 ) + (1 − α ) exp( − η g 2 ) and hence αg 1 + (1 − α ) g 2 ∈ Σ η Λ to o). The game (Ω , Λ ) is called η -mixable if the AA is (1 , η )-rea lizable, that is, if Σ Λ = Σ η Λ . The game is mixable if it is η -mixable for so me η > 0. The mixa ble games are of sp e c ial interest. In a sense, the AA works with mixa ble ga mes only , and to any non-mixable game (Ω , Λ ) the AA assigns the η -mixable g ame (Ω , Σ η Λ ) and then simply transfers the loss b ound (at the price of a constant factor). Standard examples of mixable g a mes are the square loss ga me [25, Example 4], whic h is η -mixa ble for η ∈ (0 , 2], and the logar ithmic los s game [25, Example 5 ], which is η -mixable for η ∈ (0 , 1]; see Subsection 3.2 . A standar d example of a non-mixable game is the a bsolute los s game [25, Ex ample 3] with the loss function λ ( p, ω ) = | p − ω | , p ∈ [0 , 1], ω ∈ { 0 , 1 } (its predictio n set Λ is { ( x, y ) ∈ [0 , 1 ] 2 | x + y = 1 } ); for the absolute loss ga me, the AA is ( c, η )- realizable for η > 0 a nd c ≥ η/ (2 ln(2 / (1 + e − η ))). A detailed surv ey of the AA, its pro perties, attainable b ounds and rea liz abil- it y conditions for a num b er o f games can b e found in [25]. Here we repro duce the pro of o f the main lo ss b ound in the form that motiv ates our further s tudy . Theorem 1 ([2 4 ]) . If the AA is ( c, η ) -r e alizable t hen the A A with p ar ameters c , η , P 0 , and σ guar ante es t ha t, at e ach step N and for al l exp erts θ , it holds L N ≤ c L θ N + c η ln 1 P 0 ( θ ) . 5 Pr o of. W e need to deduce the p erformance b ound fr o m the condition (1). T o this end, we will rewrite (1) and get a se mi- in v ar ian t of the AA—a v alue that do es not g ro w. Indeed, the inequality (1) is equiv a len t to X θ ∈ Θ P N − 1 ( θ ) ≥ X θ ∈ Θ P N − 1 ( θ ) exp( − η γ θ N ( ω )) exp η c γ N ( ω ) . Multiplying both sides b y Q N − 1 n =1 exp η c γ n ( ω n ) (whic h is indep enden t o f θ a nd hence can b e pla ced under the sum), a nd expanding P N − 1 , we get X θ ∈ Θ P 0 ( θ ) N − 1 Y n =1 exp( − η γ θ n ( ω n )) N − 1 Y n =1 exp η c γ n ( ω n ) ≥ X θ ∈ Θ P 0 ( θ ) N − 1 Y n =1 exp( − η γ θ n ( ω n )) N − 1 Y n =1 exp η c γ n ( ω n ) × ex p( − ηγ θ N ( ω )) exp η c γ N ( ω ) , that is, X θ ∈ Θ P 0 ( θ ) Q N − 1 ( θ ) ≥ X θ ∈ Θ P 0 ( θ ) Q N − 1 ( θ ) exp η γ N ( ω ) c − γ θ N ( ω ) where Q N − 1 is defined by the formula: Q N − 1 ( θ ) = exp η N − 1 X n =1 γ n ( ω n ) c − γ θ n ( ω n ) ! . That is, the condition (1) is equiv ale nt to ∃ γ N ∈ Λ ∀ ω X θ ∈ Θ P 0 ( θ ) ˜ Q N ( θ ) ≤ X θ ∈ Θ P 0 ( θ ) Q N − 1 ( θ ) , (5) where ˜ Q N is the result of substituting ω for ω N in Q N . In o ther words, the AA (when it is ( c, η )-realizable) guar an tees that a f- ter each step n the v alue P θ ∈ Θ P 0 ( θ ) Q n ( θ ) do es not increase wha tev er ω n is chosen by Reality . Since P θ ∈ Θ P 0 ( θ ) Q 0 ( θ ) = P θ ∈ Θ P 0 ( θ ) = 1, we get P θ ∈ Θ P 0 ( θ ) Q N ( θ ) ≤ 1 and Q N ( θ ) ≤ 1 / P 0 ( θ ) fo r ea c h step N . T o complete the pro of it r emains to note that Q N ( θ ) = exp η L N c − L θ N . F or c = 1, the v alue 1 η ln ( P θ P 0 ( θ ) Q N ( θ )) is known as the exp onen tial po ten tial (see [6, Sections 3.3, 3.5]) and plays an imp ortant role in the analysis of weighted average algo rithms. In the next section we show tha t the r eason why condition (5) can b e satisfied is essentially that the function P θ P 0 ( θ ) Q N ( θ ) is a super martingale. 6 3 Sup ermartingales and the AA Let P (Ω) b e the set of a ll distributions on Ω. No te that since Ω is finite w e can identify P (Ω) with a ( | Ω | − 1)-dimensio nal simplex in Euclidea n space R | Ω | equipp e d with the standar d distance and top ology . Let E be a ny no n-empt y set. A rea l-v a lued function S defined on ( E × P (Ω) × Ω) ∗ is called a (ga me-theoretic) sup ermartingale if for any N , for a ny e 1 , . . . , e N ∈ E , for any π 1 , . . . , π N ∈ P (Ω), for any ω 1 , . . . , ω N − 1 ∈ Ω, it ho lds that X ω ∈ Ω π N ( ω ) S ( e 1 , π 1 , ω 1 , . . . , e N − 1 , π N − 1 , ω N − 1 , e N , π N , ω ) ≤ S ( e 1 , π 1 , ω 1 , . . . , e N − 1 , π N − 1 , ω N − 1 ) . (6 ) F or N = 1, the argument o f S in the right-hand side is the empty sequence, and we treat S () as a r eal constant. The intuition b ehind the definition is the following: there is a sequence of even ts ω n , ea c h even t is gener ated according its o wn distribution π n selected (or r ev ea led) at each step anew; w he n the even t happ ens we compute the next v a lue of S dep ending on the outcomes of the previous even ts, the previo us distr ibutions and so me side information e n ; the sup e rmartingale prop erty o f S means that the exp ectation of the next v alue (when the distribution π n has b een selected but the outcome is not k no wn yet) never exceeds the prev ious v alue of S . Remark 2. The no tio n of a sup ermartingale is well-known in the probability theory . Let X 1 , X 2 , . . . b e a sequence of ra ndom elemen ts with v alues in Ω. Denote by x n some rea lization of X n , n = 1 , 2 , . . . , a nd let π n be a conditional distribution of X n given X 1 = x 1 , . . . , X n − 1 = x n − 1 . I f we fix some v alues for e n and substitute X n for ω n in S , we can rewrite condition (6) as E S ( x 1 , . . . , x N − 1 , X N ) ≤ S ( x 1 , . . . , x N − 1 ) (the parameter s e n and π n in S are omitted). W e get the us ual definition of a (probabilistic) sup ermartinga le S N = S ( X 1 , . . . , X N ), N = 1 , 2 , . . . , with resp ect to the s equence X 1 , X 2 , . . . : E[ S N | X 1 , . . . , X N − 1 ] ≤ S N − 1 . In a s ense, a game-theor etic sup ermartingale is a family o f pro babilistic super - martingales para meterized by some e n and also by pr obabilistic distributions π n , where the la tter serve a s conditiona l probabilities of the underlying random pro - cess. Remark 3. A reader familiar with the sup ermartinga les in alg o rithmic proba- bilit y theory may also find helpful the following connection. Let µ : Ω ∗ → [0 , 1] be a meas ur e on Ω ∞ (where Ω ∗ and Ω ∞ are the sets of finite and infinite sequences o f elements from Ω). As defined in e. g . [1 7, p. 296], a function s : Ω ∗ → R + is called a sup ermartingale with resp ect to µ if for any N and any ω 1 , . . . , ω N − 1 ∈ Ω it ho lds that X ω ∈ Ω µ ( ω | ω 1 , . . . , ω N − 1 ) s ( ω 1 , . . . , ω N − 1 , ω ) ≤ s ( ω 1 , . . . , ω N − 1 ) , 7 where µ ( ω | ω 1 , . . . , ω N − 1 ) = µ ( ω 1 ,...,ω N − 1 ,ω ) µ ( ω 1 ,...,ω N − 1 ) (and µ ( ω 1 , . . . , ω n ) means the mea- sure o f the set of a ll infinite s equences with the prefix ω 1 . . . ω n ). Let e n be any functions of ω 1 , . . . , ω n − 1 . Let π n ( ω ) b e µ ( ω | ω 1 , . . . , ω n − 1 ). Having substi- tuted these functions in any game-theoretic sup e r martingale S , we g et a sup er- martingale with resp ect to µ in the algor ithmic sense. A super martingale S is ca lled for e c ast-c ontinuous if for any N , for any e 1 , . . . , e N ∈ E , for any π 1 , . . . , π N − 1 ∈ P (Ω), for an y ω 1 , . . . , ω N − 1 , ω N ∈ Ω, the function S ( e 1 , π 1 , ω 1 , . . . , e N , π , ω N ) is contin uous as the function of π ∈ P (Ω). The main use of foreca st-con tin uous sup e rmartingales in this pap er is ex- plained by the following lemma. Lemma 4. Supp ose that S is a for e c ast-c ontinuous su p ermartingale. Then for any N , for any e 1 , . . . , e N ∈ E , for any π 1 , . . . , π N − 1 ∈ P (Ω) , for any ω 1 , . . . , ω N − 1 ∈ Ω , it holds that ∃ π ∈ P (Ω) ∀ ω ∈ Ω S ( e 1 , π 1 , ω 1 , . . . , e N , π , ω ) ≤ S ( e 1 , π 1 , ω 1 , . . . , e N − 1 , π N − 1 , ω N − 1 ) . Note that the pro perty pr o v ided by this lemma is similar to the condition (5), where the role o f S with the fir st N − 1 triples of the a r gumen ts is play ed by P θ ∈ Θ P 0 ( θ ) Q N − 1 ( θ ), the role of S ( . . . , e N , π , ω ) (the left-hand s ide ) is play ed by P θ ∈ Θ P 0 ( θ ) ˜ Q N ( θ ), the v ariable π corr esponds to γ N , a nd for n = 1 , . . . , N − 1, the par ameters π n and e n are repre s en ted by γ n and the vector of γ θ n , θ ∈ Θ, resp ectiv ely . A v aria n t of this lemma was o riginally pr o v ed by Levin [1 6] in the co n- text of algorithmic theory of randomness. W e will prov e this lemma later (see Lemma 8), and in the next subs e ction we co ns ider the Defensive F or ecasting Algorithm, the main application of this lemma in o ur pap er. 3.1 Defensiv e F orecasting The Defensive F or e c asting Algorithm (DF A) is another stra tegy for Learner in the game of prediction with exp ert advice. Let (Ω , Λ) be a game. The DF A ha s five par a meters: reals c ≥ 1, η > 0, a (canonic) loss function λ : P (Ω) → Σ Λ , a distribution P 0 on Θ, and a substitution function σ : Σ Λ → Λ such that σ ( γ ) ≤ γ for all γ ∈ Σ Λ . Given λ , c a nd η , le t us define the following function on (Σ Λ × P (Ω) × Ω) ∗ : Q ( g 1 , π 1 , ω 1 , . . . , g N , π N , ω N ) = exp η N X n =1 λ ( π n , ω n ) c − g n ( ω n ) ! . (7) T o simplify notatio n, here and in the sequel we consider λ as a function fro m P (Ω) × Ω to [0 , ∞ ], that is, we write λ ( π , ω ) instea d of λ ( π ) ( ω ) and λ ( π , · ) instead of λ ( π ). F or N = 0 , we let Q () = 1 in a ccordance with the usual agreement tha t the sum of zero num b er o f ter ms equals 0. Note that Q is similar to Q N ( θ ) fr om the pro of of Theorem 1, with g n standing for γ θ n and λ ( π n , · ) standing for γ n . 8 Given also P 0 , let us define the function Q P 0 on ((Σ Λ ) Θ × P (Ω) × Ω) ∗ as the following weigh ted sum of Q : Q P 0 ( { γ θ 1 } θ ∈ Θ , π 1 , ω 1 , . . . , { γ θ N } θ ∈ Θ , π N , ω N ) = X θ ∈ Θ P 0 ( θ ) Q ( γ θ 1 , π 1 , ω 1 , . . . , γ θ N , π N , ω N ) . (8 ) A t step N , the DF A cho oses any π N ∈ P (Ω) such that ∀ ω ∈ Ω Q P 0 ( { γ θ 1 } θ ∈ Θ , π 1 , ω 1 , . . . , { γ θ N } θ ∈ Θ , π N , ω ) ≤ Q P 0 ( { γ θ 1 } θ ∈ Θ , π 1 , ω 1 , . . . , { γ θ N − 1 } θ ∈ Θ , π N − 1 , ω N − 1 ) , (9 ) stores this π N for use a t later s teps , and announces γ N = σ ( λ ( π N , · )) as Learner’s prediction. Assume tha t the function Q defined by (7) is a for e cast-continuous sup er- martingale. Clearly , this implies that Q P 0 defined by (8) is als o a fore c a st- contin uous s upermar tingale for any P 0 . Then Lemma 4 guara n tees that the DF A can c ho ose π N with the required prop erty . Theorem 5. If Q define d by (7) is a for e c ast-c ontinu ous su p ermartingale for c ertain c , η , and λ then the DF A with p ar ameters c , η , λ , P 0 , and σ guar ante es that, at e ach st ep N and for al l exp erts θ , it holds L N ≤ c L θ N + c η ln 1 P 0 ( θ ) . Pr o of. The step of the DF A guara n tees that at each step N the v a lue of Q P 0 do es not increa s e independent of the outco me ω N . Th us , the v a lue of Q P 0 at each step N is not gre ater than its initial v alue, 1. Since Q is always non-neg a- tive and Q P 0 as the s um of non- negativ e v alues can be b ounded from b e low b y any of its terms, we get P 0 ( θ ) exp η N X n =1 λ ( π n , ω n ) c − γ θ n ( ω n ) ! ≤ 1 , and therefore N X n =1 λ ( π n , ω n ) ≤ cL θ N + c η ln 1 P 0 ( θ ) . It re mains to re c all that γ n = σ ( λ ( π n , · )) ≤ λ ( π n , · ), thus summing up we get L N ≤ P N n =1 λ ( π n , ω n ). In Subsections 3.3 – 3.6 we discuss general conditions when Q defined b y (7) is a s upermar tingale. In the next subsection we b egin with exa mples for t wo widely used games of prediction. 3.2 Tw o Examples of Sup ermartingales The lo garithmic loss game is defined by the loss function λ log ( p, ω ) := ( − ln p if ω = 1 , − ln(1 − p ) if ω = 0 , 9 where ω ∈ { 0 , 1 } is the outcome and p ∈ [0 , 1] is the decis io n (notice that the loss function is allow ed to take v alue ∞ ). It is known [25, Example 5] that this g ame is η -mixable for η ∈ (0 , 1]. The corr esponding prediction set is Λ log = { ( x, y ) ∈ R 2 | e − x + e − y = 1 } . The loss e s in the ga me are L N := P N n =1 λ log ( p n , ω n ) for Learner who pr edicts p n and L θ N := P N n =1 λ log ( p θ n , ω n ) for Exp ert θ who predicts p θ n . Cons ider the following function: exp η N X n =1 λ log ( p n , ω n ) − λ log ( p θ n , ω n ) ! . (10) This function is actually Q defined by (7 ), where c = 1 and λ log ( p θ n , · ) stands for g n . The only difference is that p n is not an element of P (Ω). T o fix this, let us assign Learner ’s decision p ∈ [0 , 1] (and thereby pre dic tio n ( − ln(1 − p ) , − ln p ) ∈ Λ log ) to ea c h distribution π = (1 − p, p ) o n { 0 , 1 } . With this identification π 7→ p , the expression (10) sp ecifies a function on ([0 , 1] × P ( { 0 , 1 } ) × { 0 , 1 } ) ∗ with the arguments p θ n , π n (represented by p n = π n (1)) and ω n . Lemma 6 . F or η ∈ (0 , 1] , the fun ction (10) is a for e c ast- c ont inuous sup er- martingale. Pr o of. The c o n tinuit y is o bvious. F or the sup ermartinga le pr operty , it suffices to c heck that p n e η ( − ln p n +ln p θ n ) + (1 − p n )e η ( − ln(1 − p n )+ln ( 1 − p θ n )) ≤ 1 (11) i.e., that p 1 − η n p θ n η + (1 − p n ) 1 − η 1 − p θ n η ≤ 1 for a ll p n , p θ n , η ∈ [0 , 1]. The last inequality immediately follows from the generalized inequa lit y between arith- metic and geometr ic means: u α v 1 − α ≤ αu + (1 − α ) v for any u, v ≥ 0 and α ∈ [0 , 1], which after taking the logarithm just ex presses tha t lo garithm is con- cav e. (Remark: The left-hand side of (11) is a spe cial case of what is known as the Hellinger integral in proba bilit y theory .) In the squar e loss game , the outco mes ar e ω ∈ { 0 , 1 } and the decisions are p ∈ [0 , 1] as be fore, and the loss function is λ sq ( p, ω ) = ( p − ω ) 2 . It is known [25, Example 4] that this ga me is η -mixable for η ∈ (0 , 2]. The corres ponding pr e - diction set is Λ sq = { ( x, y ) ∈ [0 , 1] 2 | √ x + √ y = 1 } . The losses of Learner and Exp ert θ a re L N := P N n =1 ( p n − ω n ) 2 and L θ N := P N n =1 ( p θ n − ω n ) 2 , r espectively . With the sa me identification π 7→ p , the following ex pression sp ecifies a function on ([0 , 1] × P ( { 0 , 1 } ) × { 0 , 1 } ) ∗ : exp η N X n =1 ( p n − ω n ) 2 − ( p θ n − ω n ) 2 ! (12) (again, note that it is a sp ecial case of Q defined by (7)). Lemma 7 . F or η ∈ (0 , 2] , the fun ction (12) is a for e c ast- c ont inuous sup er- martingale. Pr o of. It is sufficient to chec k that p n e η ( ( p n − 1) 2 − ( p θ n − 1) 2 ) + (1 − p n )e η ( ( p n − 0) 2 − ( p θ n − 0) 2 ) ≤ 1 10 for all p n , p θ n ∈ [0 , 1] a nd η ∈ [0 , 2]. T o simplify notatio n, let us substitute p for p n and p + x for p θ n . Then a fter trivial trans formations we get: p e 2 η (1 − p ) x + (1 − p )e − 2 ηpx ≤ e ηx 2 , ∀ x ∈ [ − p, 1 − p ] . The last inequality is a s imple cor o llary of the following well-kno wn v ariant of Ho effding’s ineq ualit y [15, 4.1 6]: ln Ee sX ≤ s E X + s 2 ( b − a ) 2 8 , which is tr ue for any random v ar iable X ta king v a lues in [ a, b ] a nd for any s ∈ R ; see [6, Lemma A.1] for a pro of. Indeed, a pply ing the inequa lit y to the random v aria ble X that is equa l to 1 with pro babilit y p and to 0 with proba bilit y (1 − p ), we o btain p exp( s (1 − p )) + (1 − p ) exp( − sp ) ≤ exp( s 2 / 8). Substituting s := 2 η x , we have p exp(2 η (1 − p ) x ) + (1 − p ) exp( − 2 η px ) ≤ ex p( η 2 x 2 / 2) ≤ exp( η x 2 ), the last inequality assuming η ≤ 2. 3.3 Sup ermartingales and the Realizabilit y of the AA Our next g oal is to find when Q defined b y (7) is a sup ermartingale, dep ending on the parameter s c , η and λ . Lo osely sp eaking, we will show that the AA is ( c, η )-realizable if a nd o nly if there exists λ s uc h that Q is a s upermar ting ale. More precisely , the “o nly if ” par t ho lds for some class of g ames only . F or arbitrar y games , the equiv alence ho lds if we relax slightly the super martingale definition (see Theor em 22). Let us b egin with some notation. F or any functions f : Ω → R and π : Ω → R denote E π f := X ω ∈ Ω π ( ω ) f ( ω ) . Actually , this is the sca lar pro duct of f a nd π in R Ω . W e will mostly use this for π ∈ P (Ω); in this case E π f can be interpreted as the exp ectation of f ov er distribution π . F o r functions g ∈ [0 , ∞ ] Ω and π ∈ P (Ω), let E π g := X ω ∈ Ω , π ( ω ) 6 =0 π ( ω ) g ( ω ) . Recall that the function Q defined by (7) is a super martingale if E π Q ( g 1 , π 1 , ω 1 , . . . , g N , π , · ) − Q ( g 1 , π 1 , ω 1 , . . . , g N − 1 , π N − 1 , ω N − 1 ) ≤ 0 for a n y g 1 , π 1 , ω 1 , . . . , g N − 1 , π N − 1 , ω N − 1 , g N and π . The formula (7) can b e rewritten a s Q = Q N n =1 q g n ( π n , ω n ), where the functions q g : P (Ω) × Ω → [0 , ∞ ] are defined by the formula q g ( π , ω ) = exp η λ ( π, ω ) c − g ( ω ) (13) for any g ∈ Σ Λ . Clearly , Q is a sup ermartinga le if and only if E π q g ( π , · ) ≤ 1 for all π ∈ P (Ω) and fo r a ll g ∈ Σ Λ . Let us say that a function q : P (Ω) × Ω → R has the sup ermartingale pr op erty if for any π ∈ P (Ω) E π q ( π, · ) ≤ 1 . 11 The function q is for e c ast-c ontinuous if for every ω ∈ Ω it is c o n tinuous as the function of π . So, Q defined by (7) is a forecas t-con tin uo us sup ermartingale if and only if the functions q g defined b y (13) ar e forecas t-con tin uo us and hav e the supe r - martingale prop erty for a ll g ∈ Σ Λ . In the sequel, we will disc us s the proper ties of q g instead of Q . Let us beg in with a v ariant of Lemma 4. Lemma 8 . L et a funct io n q : P (Ω) × Ω → R b e for e c ast-c ontinuous. If for al l π ∈ P (Ω) it holds that E π q ( π, · ) ≤ C , wher e C ∈ R is some c onstant, then ∃ π ∈ P (Ω) ∀ ω ∈ Ω q ( π , ω ) ≤ C . The pro of of the lemma is g iven in App endix. Here let us illustrate the idea behind the pro of. Consider the function φ ( π ′ , π ) = E π ′ q ( π, · ) and assume that it has the minimax prop ert y: min π max π ′ φ ( π ′ , π ) = max π ′ min π φ ( π ′ , π ). Look ing at the r igh t- ha nd side, note that min π φ ( π ′ , π ) ≤ φ ( π ′ , π ′ ) ≤ C . Let π minimize the left-hand side, then we get max π ′ E π ′ q ( π, · ) ≤ C , that is , E π ′ q ( π, · ) ≤ C for any π ′ , which implies the statement of the le mma if we consider distributions π ′ concentrated at each ω . Note tha t Lemma 4 is a simple co rollary of Lemma 8 applied to C = 0 a nd q ( π, ω ) = S ( e 1 , π 1 , ω 1 , . . . , e N , π , ω ) − S ( e 1 , π 1 , ω 1 , . . . , e N − 1 , π N − 1 , ω N − 1 ) . Now let us prove that if q g defined by (13) hav e the sup ermartinga le prop erty for all g ∈ Σ Λ (in other words, Q is a sup ermartingale) then the AA is realiza ble. Theorem 9. L et λ map P (Ω) to Σ Λ , and let c ≥ 1 and η > 0 b e r e als such that q g ( π , ω ) := exp η λ ( π, ω ) c − g ( ω ) ar e for e c ast-c ontinuous and have the sup ermartingale pr op erty for al l g ∈ Σ Λ . Then t he AA is ( c, η ) -r e alizable. Pr o of. Recall that the ( c, η )-rea lizabilit y is equiv alent to the inequality (3) for any finite G ⊆ Σ Λ and for an y distribution ρ on G . Let us consider the following function: q ( π, ω ) = X g ∈ G ρ ( g ) q g ( π , ω ) . The function q is forecast-co ntin uous and has the s upermar ting ale prop ert y as a non-neg ativ e weighted sum o f forecast-c on tinuous functions with the sup er- martingale prope r t y . By Lemma 8 applied to this q and C = 1, there exists π ∈ P (Ω) such that q ( π , ω ) ≤ 1 for all ω , that is, X g ∈ G ρ ( g ) exp η λ ( π, ω ) c − g ( ω ) ≤ 1 . After trivial trans formations, we g et the inequality (3) with γ ( ω ) replaced by λ ( π, ω ). It r emains to note that λ ( π , · ) ∈ Σ Λ . 12 3.4 Prop er Loss F unctions The functions q g defined by (13) hav e a loss function λ as a parameter. In this subsection, we cons ide r an imp ortant prop erty of this loss function. A function λ : P (Ω) × Ω → [0 , ∞ ] is called a pr op er los s function if for all π , π ′ ∈ P (Ω) E π λ ( π, · ) ≤ E π λ ( π ′ , · ) , and λ is strictly pr op er if for all π 6 = π ′ the inequality is stric t. The in tuition b ehind this definition is the following. Assume that the o ut- come ω is g enerated according to so me distribution π . Then the exp e c ted loss E π λ ( π ′ , · ) is minimal, if the prediction π ′ equals the true distribution. Infor- mally speak ing , prop er loss functions encourage a forecaster to anno unce the true sub jective probabilities. In a sense, if the loss function is pr oper then the predictions have a r eal, not just notational, pro babilistic meaning. The prop er loss functions are well-kno wn in the Bay esian context; see [7] and [1 2] (note that these a uthors co nsider gains, o r scores, instead o f losses, so their notation differs from ours by the sign). W e sa y that λ is pr op er with r esp e ct t o a set X ⊆ [0 , ∞ ] Ω if for a ll π ∈ P (Ω), it holds that λ ( π , · ) ∈ X and for all g ∈ X it holds that E π λ ( π, · ) ≤ E π g (in other words, λ ( π , · ) ∈ a rg min g ∈ X E π g ). If the inequality holds for a fixed π and a ll g ∈ X , w e will say that λ is prop er at π . Clearly , if λ is prop er with resp ect to X then λ is prop er in the usual sense. The definition has a simple geometrical interpretation. The inequa lit y means that the set X lies on o ne side o f the hyper plane { x ∈ R Ω | P ω ∈ Ω π ( ω ) x ( ω ) = E π λ ( π, · ) } , a nd X touches the h yper plane at λ ( π , · ) ∈ X . That is, λ ( π , · ) is a p oint where X touches the suppo rting hyper plane with nor mal π . Lemma 10. L et λ m ap P (Ω) to Σ Λ and η > 0 b e such that the fun ctions q g ( π , ω ) := e η ( λ ( π ,ω ) − g ( ω )) ar e for e c ast - c ont inuous and have the su p ermartingale pr op erty for al l g ∈ Σ Λ (the functions q g ar e just (13) with c = 1 ). Then λ is a c ontinu ous pr op er loss function with r esp e ct to Σ Λ . Pr o of. The contin uity is obvious. Since e x ≥ 1 + x for all x ∈ R , we get E π e η ( λ ( π , · ) − g ) ≥ E π 1 + η ( λ ( π , · ) − g ) = 1 + η (E π λ ( π, · ) − E π g ) , and from the sup ermartingale prop erty w e have E π λ ( π, · ) ≤ E π g for a ll g ∈ Σ Λ and all π ∈ P (Ω), since η > 0. (Remark: we g et the strict inequality E π λ ( π, · ) < E π g , if λ ( π , ω 0 ) 6 = g ( ω 0 ) and π ( ω 0 ) 6 = 0 for some ω 0 ∈ Ω.) F rom Theorem 9 we know that the conditions o f the last lemma imply a lso that the game (Ω , Λ) is η -mixable. L et us s how that the co n verse sta tement holds, i. e. the prop erness of λ and mixabilit y are sufficien t for the sup ermartin- gale prop erty . 13 Lemma 11. Supp ose that t he game (Ω , Λ) is η -mixable and λ : P (Ω) → Σ Λ is a pr op er loss fu n ction with r esp e ct to Σ Λ . Then the fun ctions q g ( π , ω ) = e η ( λ ( π ,ω ) − g ( ω )) have t he sup ermartingale pr op erty for every g ∈ Σ Λ . If λ is c ontinuous then q g ar e for e c ast-c ontinu ous. Pr o of. The foreca s t-con tin uity is obvious. Assume that the sup ermarting a le prop ert y do es not hold, in other w ords, that E π e η ( λ ( π , · ) − g ) = 1 + δ for some π ∈ P (Ω), g ∈ Σ Λ and δ > 0. F or any ǫ > 0 consider the p oin t g ǫ = − 1 η ln (1 − ǫ )e − ηλ ( π , · ) + ǫ e − ηg . The point g ǫ belo ngs to Σ η Λ by the definition of Σ η Λ , and Σ η Λ = Σ Λ since the game is η -mixable, that is, g ǫ ∈ Σ Λ for any ǫ > 0. When ǫ → 0 , we have g ǫ = λ ( π , · ) − 1 η ln 1 + ǫ e η ( λ ( π , · ) − g ) − 1 = λ ( π , · ) − ǫ η e η ( λ ( π , · ) − g ) − 1 + O ( ǫ 2 ) . T aking the exp ectation E π , we get E π g ǫ = E π λ ( π, · ) − ǫ η E π e η ( λ ( π , · ) − g ) − 1 + O ( ǫ 2 ) = E π λ ( π, · ) − ǫδ η + O ( ǫ 2 ) , where δ > 0 by our assumption. If ǫ is sufficiently small then ( δ /η ) ǫ > O ( ǫ 2 ) and E π g ǫ < E π λ ( π, · ), which is imp ossible since λ ( π , · ) is pr oper with r espect to Σ Λ . An a lternativ e, mor e geometrica l pr o of of the last lemma for binar y games the reader can find in [5, Lemma 3]. 3.5 The Realizabilit y of the AA and Sup ermartingales Theorem 9 shows that if the functions q g defined by (13) ar e foreca st-contin uous and have the s upermarting ale prop ert y then the AA is rea lizable. W e wan t to show the c o n verse, that if the AA is realiza ble then one can find λ s uc h tha t the functions q g are for ecast-contin uous and ha ve the sup ermartinga le pr operty . F or mixable games, we know already that a pr oper lo ss function works (though we do no t know yet whether a proper loss function exis ts ). In this subsection w e show that w e can o btain λ in a ny ga me if we can c onstruct co n tinuous prop er loss functions for mixable g a mes. How to do the latter and when it is p ossible is discussed in the next subsection. T o state and prove the ma in result of this subsection, we ne e d t wo standard assumptions (see [25]) ab out the game (Ω , Λ) and some additional notation. Assumption 1. Λ is a co mpact subset o f [0 , ∞ ] Ω (in the extended top ology). Assumption 2. Ther e exists g fin ∈ Λ such that g fin ( ω ) < ∞ for all ω ∈ Ω. 14 Note that if Λ is compact then Σ Λ is a lso compa ct, as well as Σ η Λ . A nice fea- ture of compact predictio n s ets is tha t the pr operties o f the g ame are determined by the b o undary o f the pr e diction set. F or any set X ⊆ [0 , ∞ ] Ω , by M X denote the set of minimal elements of X : g 0 ∈ M X if a nd only if for any g ∈ X the inequalit y g 0 ≥ g implies g 0 = g . F or a compact s e t X , for every g ∈ X there is an element g 0 ∈ M X such that g 0 ≤ g ; that is, X ⊆ ( M X + [0 , ∞ ] Ω ). Notice that M X is contained in the bo undary ∂ X of X . Since Σ Λ = Σ Λ + [0 , ∞ ] Ω = Λ + [0 , ∞ ] Ω , we hav e M Σ Λ = M Λ ⊆ Λ. F o r compact Λ, w e hav e Σ Λ = M Σ Λ + [0 , ∞ ] Ω = Σ M Λ = M Λ + [0 , ∞ ] Ω . Note also that a game is η - mixable if and only if M Σ η Λ ⊆ Λ, since this is equiv a len t to Σ η Λ = Σ Λ . A lo ss function is proper with respect to Σ η Λ if and only if it is prop er with resp ect to M Σ η Λ . Lemma 12. Supp ose that the game (Ω , Λ) sat isfies Ass umptions 1 and 2 and the AA is ( c, η ) -r e alizable for this game. Then ther e is a c ont inuous mapping V : Σ η Λ → ∂ Σ Λ such that V ( g ) ≤ cg for al l g ∈ Σ η Λ . The pro of is given in Appendix . The mapping V is actually the central pro jection from Σ η Λ int o the sup erprediction set Σ Λ (whic h contains c Σ η Λ when the AA is ( c, η )-realiza ble). Theorem 13. L et the game (Ω , Λ) satisfy A ssumptions 1 and 2, the A A b e ( c, η ) -r ealizable for this game, and λ η : P (Ω) → Σ η Λ b e a c ont inu ous pr op er loss function with r esp e ct to Σ η Λ . Then for any c ont inuous λ : P (Ω) → ∂ Σ Λ such that λ ( π, · ) ≤ cλ η ( π , · ) for al l π ∈ P (Ω) , the functions q g define d by (13) ar e for e- c ast-c ontinu ous and have the sup ermartingale pr op erty for every g ∈ Σ Λ ; and ther e exists a c ont inuous λ : P (Ω) → ∂ Σ Λ such that λ ( π , · ) ≤ cλ η ( π , · ) for al l π ∈ P (Ω) . Pr o of. The foreca s t-con tin uity is o b vio us . Let us chec k the sup ermartingale prop ert y , i. e., that E π e η ( λ ( π, · ) c − g ) ≤ 1 for all π ∈ P (Ω) and a ll g ∈ Σ Λ . Since λ ( π , · ) ≤ cλ η ( π , · ), it suffices that E π e η ( λ η ( π , · ) − g ) ≤ 1 , which follows from Lemma 11 applied to the η -mixable g ame (Ω , Σ η Λ ) and the prop er function λ η (note that Σ Λ ⊆ Σ η Λ , hence the lemma w or ks for all g ∈ Σ Λ ). It remains to observe that λ ( π, · ) = V ( λ η ( π , · )), wher e V is de fined in Lemma 12, has the prop erties w e nee d. 3.6 Construction of a Contin uous P roper Loss F unction In this subsection, w e fix a game (Ω , Λ), fix η > 0, and consider prop er loss functions with resp ect to Σ η Λ . They can be in terpreted als o as prop er loss functions for the η -mixable game (Ω , Σ η Λ ). Lemma 14. L et λ 1 and λ 2 b e fu n ctions fr om P (Ω) to Σ η Λ . Su pp ose that they ar e pr op er with r esp e ct t o Σ η Λ at some p oint π ∈ P (Ω) , that is, E π λ i ( π , · ) ≤ E π g , i = 1 , 2 , for al l g ∈ Σ η Λ . Then for al l ω ∈ Ω we have π ( ω ) 6 = 0 ⇒ λ 1 ( π , ω ) = λ 2 ( π , ω ) . 15 The pr oof of the lemma is g iv en in Appendix. Let P ◦ (Ω) be the set of all non-degene r ate distributions, i. e. P ◦ (Ω) = { π ∈ P (Ω) | ∀ ω ∈ Ω π ( ω ) > 0 } . Lemma 14 implies that a pr oper los s function is uniquely defined on P ◦ (Ω). The following lemma gives a more explicit specificatio n of the v a lue s of a prop er loss function on P ◦ (Ω). Lemma 1 5. L et the game (Ω , Λ) satisfy Assu mptions 1 and 2. L et u s define function H : R Ω → [ −∞ , ∞ ) by the formula H ( π ) = min g ∈ Σ η Λ E π g . (14) L et H b e the domain wher e H is differ ent iab le. Then H ⊇ P ◦ (Ω) , and the c om- p onents of t he gr adient of H at π ∈ H ∩ P (Ω) c onstitu te a c ontinuous function λ : H ∩ P (Ω) → Σ η Λ such that E π λ ( π, · ) = H ( π ) . Mor e over, if π ∈ P ◦ (Ω) then λ ( π, · ) is t he unique p oint wher e the minimum in (14) is att aine d. Remark 16 . The function H ( π ) for π ∈ P (Ω) is known as the generalized ent ropy of the game (Ω , Λ); s e e [13]. F or the log arithmic los s game, H ( π ) bec omes the Shannon entrop y of π (cf. (16)). It is worth men tioning that one can reconstruct the sup e rprediction s et Σ Λ from the generalized entrop y of the game, a nd also from the predictive complexity of the game (see [18] for the definitions and pro ofs in the case of binary games). The pr oof o f the lemma is given in App endix. The pro of is based on the fact that the function − H ( π ) is convex. Note that λ ( π , · ) ∈ M Σ η Λ for any π ∈ P ◦ (Ω). Indeed, if fo r some π ∈ P ◦ (Ω) we hav e λ ( π , · ) / ∈ M Σ η Λ then there exists g ≤ λ ( π, · ) , g ∈ M Σ η Λ and g ( ω ) < λ ( π, · ) for at least one ω . Since π ( ω ) > 0, w e g et E π g < E π λ ( π, · ) = H ( π ), which contradicts the definition o f H . Recall that if a loss function λ is prop er with respec t to Σ η Λ then E π λ ( π, · ) = H ( π ). Lemma 15 shows that on P ◦ (Ω) a prop er loss function λ exists and it is unique and cont inu ous. O ur next task is to extend λ contin uously fro m P ◦ (Ω) to P (Ω). Unfortunately , this is sometimes imp ossible. Consider a n ex a mple. Let Ω = { 1 , 2 , 3 } , a nd let the prediction s et b e Λ = { ( − ln p, − ln(1 − p ) , 1) | p ∈ [0 , 1] } . Actually , this is the binary logar ithmic loss g ame with an additional dumm y outcome. This game is 1-mixable and Σ 1 Λ = Σ Λ . It is ea sy to check that the prop er loss function with resp ect to Σ Λ is given o n P ◦ (Ω) by the formulas λ ( π, i ) = − ln π ( i ) π (1)+ π (2) , i = 1 , 2 , a nd λ ( π , 3) = 1. This function ca n b e e xtended contin uously to all π s uc h that π (1) + π (2) 6 = 0, so we have λ ( π, · ) = ( ∞ , 0 , 1) if π (1) = 0 and λ ( π, · ) = (0 , ∞ , 1) if π (2) = 0. Howev er, these con tin uations are inconsistent a t the point π = (0 , 0 , 1 ). Ther efore, there is no co n tinuous function on P (Ω) which is pro p er with resp ect to Σ Λ for this game. Now let us cons ider three examples o f ga mes where a contin uous prop er (and even strictly prop er) loss function exists. The first example is the Brier g ame (see [31]), which is a ge neralization o f the square loss game: λ B ( π , ω ) = X o ∈ Ω ( δ ω ( o ) − π ( o )) 2 16 where δ ω ( o ) = 1 if o = ω a nd δ ω ( o ) = 0 if o 6 = ω . F o r the binary game Ω = { 0 , 1 } , distribution π ∈ P (Ω) is pair (1 − p, p ) where p ∈ [0 , 1], a nd hence λ B ( π , ω ) = 2( p − ω ) 2 , which is t wice the loss λ sq ( p, ω ) = ( p − ω ) 2 in the bina ry square loss game as defined in Subsection 3.2. The Brier game is 1 -mixable, that is, Σ η Λ B = Σ Λ B for η ≤ 1. Let us calculate H ( π ) defined b y (14) for π ∈ P (Ω): H B ( π ) = min g ∈ Σ η Λ B E π g = min g ∈ Λ B E π g = min π ′ ∈P (Ω) E π λ B ( π ′ , · ) = min π ′ ∈P (Ω) X ω ∈ Ω π ( ω ) X o ∈ Ω ( δ ω ( o ) − π ′ ( o )) 2 = 1 − X ω ∈ Ω π 2 ( ω ) + min π ′ ∈P (Ω) X ω ∈ Ω ( π ′ ( ω ) − π ( ω )) 2 = 1 − X ω ∈ Ω π 2 ( ω ) . Clearly , H B ( π ) is differentiable on P (Ω), hence a contin uous pr oper los s function for the Brier ga me ca n b e computed as the g radient of H B by Lemma 15. How ever, it is e a sier to note that the minimum of E π λ B ( π ′ , · ) is attained a t π ′ = π only , and th us the standard form of the loss function λ B is prop er. Remark 17. Note that in the ex ample a bov e we computed the v a lue o f H ( π ) assuming that π ∈ P (Ω). If w e wan t to compute λ ( π , ω ) as the partial deriv a tiv es of H ( π ) with resp ect to π ( ω ), we m us t consider H ( π ) a s a function on R Ω (as stated in Lemma 15). T o this end, just note that H is homogeneous: H ( π ) = H π P ω ∈ Ω π ( ω ) X ω ∈ Ω π ( ω ) (15) for π ∈ R Ω . In the Br ier g a me example we hav e H B ( π ) = 1 − P ω ∈ Ω π 2 ( ω ) P ω ∈ Ω π ( ω ) 2 ! X ω ∈ Ω π ( ω ) , and the partial deriv atives a r e 1 − 2 π ( ω ) + X o ∈ Ω π 2 ( o ) = λ B ( π , ω ) for any π ∈ P (Ω). In general, if we hav e a function φ : R Ω → R such that φ ( π ) = H ( π ) for all π ∈ P (Ω), ta king the deriv atives o f (15) w e get that the pr oper loss function λ can be computed by the following for mula for any π ∈ P (Ω): λ ( π, ω ) = φ ( π ) − X o ∈ Ω π ( o ) φ ′ o ( π ) + φ ′ ω ( π ) , where φ ′ ω is the partial deriv ative of φ with r espect to π ( ω ). This formula is known from the Sav age theorem [20] (se e a ls o [12, Theorem 3 .2]; reca ll that they consider scor es, or g ains, − λ ins tea d of loss es λ ). The s econd exa mple is the Helling er ga me: λ H ( π , ω ) = 1 2 X o ∈ Ω p δ ω ( o ) − p π ( o ) 2 . 17 Similarly to the Brier game, we can find that H H ( π ) = min π ′ ∈P (Ω) X ω ∈ Ω π ( ω ) 1 − p π ′ ( ω ) = X ω ∈ Ω π ( ω ) − s X ω ∈ Ω π 2 ( ω ) . Here the minimum is not attained a t π ′ = π and λ H is not prop er. T aking the deriv atives, we find a pro per loss function for the Hellinger g a me: λ ( π, ω ) = 1 − π ( ω ) p P ω ∈ Ω π 2 ( ω ) . This loss function is known as the spher ical loss. The spherica l lo ss a nd the Hellinger loss sp ecify the same g ame but under different parameter ization. F or binary games, this kind o f “r e parameterization” was considered in [14 , Section 3.1 ], where a pro per function λ ( π , · ) was ca lled a Bayes-optimal prediction for bias π . More precisely , the pap er [1 4] discusses binary g ames sp ecified by a lo s s function λ ( γ , ω ), where ω is 0 or 1 and γ ∈ [0 , 1]. Their Lemma 3.5 states c onditions (on deriv a tiv es of λ as a function of γ ) when there exists a unique γ p that minimizes (1 − p ) λ ( γ , 0 )+ pλ ( γ , 1) for ea c h p ∈ [0 , 1 ]. This γ p can be obtained from E quation (3 .8 ) in [14]: (1 − p ) d dγ λ ( γ , 0) γ = γ p + p d dγ λ ( γ , 1) γ = γ p = 0 . Our Lemma 15 can be regar ded as a generaliza tion of this approach. Our thir d example is the genera l logarithmic loss game defined by λ log ( π , ω ) = − ln π ( ω ) . Similarly to the Brie r loss function, the logarithmic loss function is strictly prop er. Indeed, let us calculate the entropy H log for π ∈ P (Ω): H log ( π ) = min π ′ ∈P (Ω) X ω ∈ Ω π ( ω ) − ln π ′ ( ω ) = − X ω ∈ Ω π ( ω ) ln π ( ω ) − max π ′ ∈P (Ω) X ω ∈ Ω π ( ω ) ln π ′ ( ω ) π ( ω ) = − X ω ∈ Ω π ( ω ) ln π ( ω ) . (16 ) Here the partial deriv atives are infinite at the bound of P (Ω). Nevertheless, it is easy to chec k that the minimum in the definition of H log ( π ) is alwa ys attained at one p oin t π ′ = π only . The la st equa lit y in (16) holds since lo garithm is concav e P ω ∈ Ω π ( ω ) ln π ′ ( ω ) π ( ω ) ≤ ln P ω ∈ Ω π ( ω ) π ′ ( ω ) π ( ω ) = 0 and the inequality is strict unless π ′ ( ω ) π ( ω ) are equal for a ll ω ∈ Ω or π ( ω 0 ) = 1 for some ω 0 . In the for mer case, π = π ′ , since π, π ′ ∈ P (Ω). In the latter case, w e get max π ′ ∈P (Ω) ln π ′ ( ω 0 ), which is attained if π ′ ( ω 0 ) = 1 , a nd hence π = π ′ to o. Now we consider a genera l wa y to co nstruct prop er loss functions, ev en in the ca se when H is not differentiable on a ll P (Ω). Note that the only way to extend λ contin uously is to define it at P (Ω) \ P ◦ (Ω) as a limit from P ◦ (Ω), where λ ( π , · ) is defined as a p oint of minimum. The following lemma proved in Appendix states that a limit of such p oint s is again a po int of minimum . 18 Lemma 18. L et π i ∈ P (Ω) and γ i ∈ M Σ η Λ b e such t hat E π i γ i = min g ∈ Σ η Λ E π i g , i = 1 , 2 , . . . . Assume that π i → π and γ i → γ as i → ∞ . Then γ ∈ M Σ η Λ and E π γ = min g ∈ Σ η Λ E π g . In particular , the lemma implies that a contin uo us prop er loss function exists in ga mes wher e each minimum is attained in a unique p oint. Let us formulate this assumption explicitly and prove the ex istence theore m. Assumption 3. F or every π ∈ P (Ω) such that π ( ω 1 ) = 0 a nd π ( ω 2 ) = 0 fo r some ω 1 , ω 2 ∈ Ω, ω 1 6 = ω 2 , there ex ists only one p oin t where the minim um of E π g ov er all g ∈ M Σ η Λ is attained. Remark 19. Assumption 3 holds automatically for all binar y games. The games with differentiable H , such as the ge ne r al squar e loss g ame, satisfy As- sumption 3 as well. Theorem 20. S upp ose that the game (Ω , Λ) satisfies Assu mptions 1 and 2, and Assumption 3 for c ertain η > 0 . Then ther e exists a c ontinuous loss function λ : P (Ω) → M Σ η Λ that is pr op er, and even strictly pr op er, with r esp e ct to Σ η Λ . Pr o of. Let us show first that the minim um of E π g ov er all g ∈ M Σ η Λ is a ttained at one p oin t only for all π ∈ P (Ω). F or π ∈ P ◦ (Ω), it follows from Lemma 1 4. Let π ∈ P (Ω) be such that π ( ω 0 ) = 0 for s ome ω 0 ∈ Ω and π ( ω ) 6 = 0 for ω 6 = ω 0 . Let g 1 , g 2 ∈ M Σ η Λ be any tw o p oin ts o f minimum. Again by Lemma 14, g 1 ( ω ) = g 2 ( ω ) for all ω 6 = ω 0 . Ther efore g 1 ≤ g 2 or g 1 ≥ g 2 (since g 1 ( ω 0 ) and g 2 ( ω 0 ) are comparable, b eing t wo reals), and the greater of them cannot b elong to M Σ η Λ . Thus, g 1 = g 2 . Assumption 3 works for all other π ∈ P (Ω). Let us take λ ( π , · ) = ar g min g ∈M Σ η Λ E π g for all π ∈ P (Ω). C le arly , λ is prop er with resp ect to Σ η Λ (recall that every p oint in Σ η Λ is mino rized by some p oint in M Σ η Λ ). Let us prove contin uity . T ake any conv erging sequence π i ∈ P (Ω), let π b e its limit, and co nsider the corresp onding λ ( π i , · ). Lemma 18 implies that all accumulation p oints o f the set { λ ( π i , · ) } ar e p oints where min g ∈M Σ η Λ E π g is attained, therefore λ ( π , · ) is the only a ccum ula tio n p oin t and λ ( π i , · ) conv erges to λ ( π, · ). 3.7 Defensiv e F orecasting R evisited Let us review the results w e obtained so far. Theore ms 1 and 5 gives us the same loss bo und for a g ame (Ω , Λ), if the AA is realizable and if Q defined by (7) is a forecas t- con tinuous sup ermartingale, r espectively . The algo rithms are very close in their internal structure. W e can say even more: with the same par ameters and inputs, they give the same predictio ns , in some s ense. More precisely , tw o sets co incide: the set of γ N ∈ Λ satisfying (1) and the set of γ N ∈ Λ such that γ N minorizes λ ( π N , · ) for π N satisfying (9). Both a lgorithms are applicable under almost the same c onditions: Theor em 9 says that if Q is a forec a st-cont in uous s upermar tingale then the AA is realiza ble; Theorems 13 and 20 show the conv erse for g a mes satisfying Assumptions 1 – 3. Whereas Assumptions 1 and 2 are s tandard and natural, and the AA is usu- ally considered only for the games satisfying these assumptions, Assumption 3 is new and quite cum ber some. Ho wever, it turns out that with the help o f a mo re 19 complicated version o f the DF A we can get r id of Assumption 3 and get a p er- fect equiv alence betw een the re alizabilit y o f the AA and so me super martingale condition (under the standar d Assumptions 1 a nd 2 only). T o b egin with, let us s lig h tly rela x the definitions concerning sup ermartin- gales. W e say that a function q : P ◦ (Ω) × Ω → R has the sup ermartingale pr op erty on P ◦ (Ω) if for any π ∈ P ◦ (Ω) E π q ( π, · ) ≤ 1 . The function q is for e c ast- c ontinuous on P ◦ (Ω) if for every ω ∈ Ω it is contin uo us as the function of π for all π ∈ P ◦ (Ω). Lemma 2 1. L et a function q : P ◦ (Ω) × Ω → R b e non-ne gative and for ec ast- c ontinuous on P ◦ (Ω) . Supp ose that for al l π ∈ P ◦ (Ω) it holds that E π q ( π, · ) ≤ C , wher e C ∈ [0 , ∞ ) is some c onst ant. Then ther e exists a s e quenc e { π ( i ) } i ∈ N such that π ( i ) ∈ P ◦ (Ω) , the se quen c e π ( i ) c onver ges in P (Ω) , the se quenc es q ( π ( i ) , ω ) c onver ge for every ω ∈ Ω , and ∀ ω ∈ Ω lim i →∞ q ( π ( i ) , ω ) ≤ C . The pr oof of the lemma is g iv en in Appendix after the pro of of Lemma 8. Theorem 22. L et the game (Ω , Λ) s at isfy Assu mptions 1 and 2. The AA is ( c, η ) -r e alizable for this game if and only if t her e exists λ su ch that the functions q g define d by (13) ar e for e c ast-c ont inu ous on P ◦ (Ω) and have the su p ermartin- gale pr op erty on P ◦ (Ω) for al l g ∈ Σ Λ . Pr o of. The “only if ” pa rt ea sily follows fro m Lemma 15 combined with (the pro of of ) Theor em 13. The “if ” part is analogo us to Theo r em 9. W e need to prove inequality (3 ) for any finite G ⊆ Σ Λ and for any distribution ρ o n G . Let us consider the function q ( π, ω ) = X g ∈ G ρ ( g ) q g ( π , ω ) , which is non-negative, forecast-contin uous on P ◦ (Ω), and has the sup ermartin- gale prop erty on P ◦ (Ω). By Lemma 21 a pplied to this q and C = 1, ther e exist π ( i ) ∈ P ◦ (Ω) such that ∀ ω ∈ Ω lim i →∞ X g ∈ G ρ ( g ) exp η λ ( π ( i ) , ω ) c − g ( ω ) ≤ 1 . Let γ ( i ) = λ ( π ( i ) , · ) ∈ Σ Λ . Since Σ Λ is compact (by Assumption 1), the se- quence γ ( i ) contains a c o n vergent subsequenc e ; let γ ∈ Σ Λ be its limit. Then P g ∈ G ρ ( g ) exp η γ /c − g is a limit of the corresp onding conv e r gen t subse- quence o f the sequence P g ∈ G ρ ( g ) exp η γ ( i ) /c − g , and for every ω ∈ Ω we get inequality (3): X g ∈ G ρ ( g ) exp η γ ( ω ) c − g ( ω ) ≤ 1 . 20 Let us state also the algo rithm DF A ∗ , a v ariant of the DF A suitable for sup e rmartingales on P ◦ (Ω). At step N , the DF A ∗ defines the function q ( π, ω ) = X θ ∈ Θ P 0 ( θ ) exp η N − 1 X n =1 γ n ( ω n ) c − γ θ n ( ω n ) ! × ex p η λ ( π, ω ) c − γ θ N ( ω ) and c ho oses a n y sequence of π ( i ) ∈ P ◦ (Ω) such that ∀ ω ∈ Ω lim i →∞ q ( π ( i ) , ω ) ≤ 1 . Then the algo r ithm chooses as γ the limit of any co n vergent subsequence of the sequence λ ( π ( i ) , · ), and announces γ N = σ ( γ ) as Learner’s prediction. It is clear that the DF A ∗ guarantees the same loss b ound as Theorem 5. It is imp ortant for a pplica tions that the AA is rather efficient computation- ally (though it is more complicated than some other algor ithms). The DF A ∗ is designed to obta in a nice theor y , and it makes little sense to discuss its efficiency . The DF A is muc h more practica l then the DF A ∗ . Unfortunately , the DF A seems to b e less practical than the AA. Its main s tep hidden in the pro of of Lemma 8 requires finding a fixed point (or a minimax), which is gener a lly a hard task (PP AD-complete). F or bina ry games, howev er , the fix ed p oin ts can b e found by bisection metho d, which gives us a not so inefficient implementation of the DF A. Some tr ic ks can also help for ga mes with three outco mes. Remark 23. After this pa per had b een finished, the author s have discov er ed another wa y to deal with ga mes that do not satisfy Ass umption 3. The idea is to c o nsider a multiv alued lo ss function: to every π it assigns a ll p oints where the minimum o f E π g is atta ine d. The definition of s upermar tingale sho uld be mo dified according ly , and a v ar ian t of Lemma 4 c a n be pr o ved for such m ultiv alued sup ermartingales . The details will b e added later o r published elsewhere. 3.8 On Con tinuou s Outcomes W e a ssumed so fa r that the space of outcomes, Ω, is finite. Howev er, it is often natural to consider a contin uous space of o utcomes. F or example, for the square loss function λ sq ( p, ω ) = ( p − ω ) 2 , one can take ω ∈ [0 , 1] instead of ω ∈ { 0 , 1 } . In this s ubs ection we consider one impor tan t case of contin uous outcome spaces: a finite-dimensional simplex. W e will consider a simplex a s the space P (Ω) of distr ibutions o n some finite Ω. A g a me of pre dic tio n is a pair ( P (Ω) , Λ ), where Λ ⊆ [0 , ∞ ] P (Ω) ; pr edictions are functions γ : P (Ω) → [0 , ∞ ]; the proto col is the same. Ea ch game of prediction with the outcomes from a simplex P (Ω) can be r estricted to a g a me on Ω : we iden tify each ω ∈ Ω with the dis tr ibution δ ω concentrated o n this ω . Thus w e may a s sume P (Ω) ⊃ Ω. Denote by Λ Ω ⊆ [0 , ∞ ] Ω the set of functions from Λ restricted to Ω. W e will show how the super martingale tec hnique works for ga mes having some re g ularit y prop ert y . (A similar extension for the AA is discussed in [1 4, Section 4.1].) 21 T o motiv a te this kind o f prop erty , let us start from the other side and ass ume that we have a predictio n (recall that our prediction is a vector of our losses for every p ossible outcome) γ defined on Ω a nd wan t to extend it to P (Ω). The mos t natural way to do this is to say that an element of P (Ω) is just a pr o babilit y distribution on the outcomes, and consider the exp ected lo ss with resp ect to this distribution, that is, γ ( p ) := E p γ fo r ev er y p ∈ P (Ω). It is als o na tural to expect that having this prop erty one sho uld b e a ble to transfer a regr et b ound from the game on Ω to the resp ective game on P (Ω). How ever, the equality γ ( p ) = E p γ is too restrictive. F or example, it do es not hold for the square loss. At the same time, what do es hold for the squar e loss (and will b e chec ked later) is a n equa lit y concerning the differenc e o f tw o pr edictions: γ 1 ( p ) − γ 2 ( p ) = E p ( γ 1 − γ 2 ). This is quite natural in our context, since the difference is a regret, lo osely sp eaking, and a reg ret is the v alue we are optimizing. This le a ds to the fo llowing requir emen t (formally weak e r than the condition for the sq ua re loss). W e say that Λ ⊆ [0 , ∞ ] P (Ω) has the r elative exp-c onvexity pr op erty fo r cer tain c and η if for all γ 1 , γ 2 ∈ Λ a nd for all p ∈ P (Ω) it holds that exp η γ 1 ( p ) c − γ 2 ( p ) ≤ X ω ∈ Ω p ( ω ) exp η γ 1 ( ω ) c − γ 2 ( ω ) . Remark 24 . The relative exp-conv exity prop erty for any c > 0 and η fo llo ws from ∀ γ ∈ Λ ∀ p ∈ P (Ω) γ ( p ) = X ω ∈ Ω p ( ω ) γ ( ω ) due to co n vexity of the exp onent function. F o r c = 1 and any η , it follows also from ∀ γ 1 , γ 2 ∈ Λ ∀ p ∈ P (Ω) γ 1 ( p ) − γ 2 ( p ) = X ω ∈ Ω p ( ω ) γ 1 ( ω ) − γ 2 ( ω ) . Let σ Ω : Λ Ω → Λ b e any mapping inv er se to the restric tion from Λ to Λ Ω , that is, for any γ ∈ Λ Ω , the function σ Ω ( γ ) ∈ Λ restr icted to Ω is γ . Such a mapping exists since every element of Λ Ω is a restrictio n of some element of Λ. Theorem 25. F or a game ( P (Ω) , Λ) , su pp ose that Λ has the r elative exp-c on- vexity pr op erty for some c ≥ 1 and η > 0 . F or t he r estricte d game (Ω , Λ Ω ) , supp ose that for some λ : P (Ω) → Σ Λ Ω , the functions q g define d by (13) ar e for ec ast-c ontinuous and have the sup ermartingale pr op erty for al l g ∈ Σ Λ Ω . L et σ : Σ Λ Ω → Λ Ω b e a substitution function (that is, σ ( g ) ≤ g for al l g ∈ Σ Λ Ω ). Then for the game ( P (Ω) , Λ) t her e is Le arner’s str ate gy (in fact, a variant of the DF A) with p ar ameters c , η , λ , P 0 , σ , and σ Ω guar ante eing t hat, at e ach step N and for al l exp erts θ , it holds L N ≤ c L θ N + c η ln 1 P 0 ( θ ) . Pr o of. Assume that we are at step N and need to announce the next prediction. Let γ θ n ∈ Λ, n = 1 , . . . , N b e the exp erts’ prediction up to step N , γ n , n = 1 , . . . , N − 1 b e the Lea rner’s previous pr e dictions, and p n , n = 1 , . . . , N − 1 b e the previous outcomes. Define the function Q N − 1 from Θ to R Q N − 1 ( θ ) = N − 1 Y n =1 exp η γ n ( p n ) c − γ θ n ( p n ) 22 and consider the following function o n P (Ω) × Ω: q N ( π , ω ) = X θ ∈ Θ P 0 ( θ ) Q N − 1 ( θ ) × exp η λ ( π, ω ) c − γ θ N ( ω ) . Due to the assumptions ab out the last m ultiplier, q N is forecast-contin uo us a nd E π q N ( π , · ) ≤ P θ ∈ Θ P 0 ( θ ) Q N − 1 ( θ ). By Lemma 8, we can find π N ∈ P (Ω) such that for all ω ∈ Ω q N ( π N , ω ) ≤ X θ ∈ Θ P 0 ( θ ) Q N − 1 ( θ ) . The prediction of the strategy is γ N = σ Ω ( σ ( λ ( π N , · ))) ∈ Λ. Let p N ∈ P (Ω) b e the outcome at step N . The relative exp-conv exity prop ert y implies that exp η γ N ( p N ) c − γ θ N ( p N ) ≤ X ω ∈ Ω p N ( ω ) exp η γ N ( ω ) c − γ θ N ( ω ) . W e have γ N ( ω ) = σ ( λ ( π N , · ))( ω ) by the definitio n of σ Ω , hence we hav e γ N ( ω ) ≤ λ ( π N , ω ) by definition of σ . Thus, X θ ∈ Θ P 0 ( θ ) Q N ( θ ) = X θ ∈ Θ P 0 ( θ ) Q N − 1 ( θ ) × exp η γ N ( p N ) c − γ θ N ( p N ) ≤ X θ ∈ Θ P 0 ( θ ) Q N − 1 ( θ ) × X ω ∈ Ω p N ( ω ) exp η λ ( π N , ω ) c − γ θ N ( ω ) = X ω ∈ Ω p N ( ω ) q N ( π N , ω ) ≤ X θ ∈ Θ P 0 ( θ ) Q N − 1 ( θ ) , and the loss bo und follows as usual. As an example, let us aga in consider the Brier g ame (the g eneral square loss function), now with dis tr ibutions as o utcomes: Ω is a finite non-empty set, outcomes p are from P (Ω), and the loss of decis io n π ∈ P (Ω) for outcome p is λ B ( π , p ) = X ω ∈ Ω ( p ( ω ) − π ( ω )) 2 . It is easy to check that this game has the relative exp- c o n vexity prop ert y fo r c = 1 and any η due to Remark 24: X ω ∈ Ω p ( ω ) λ B ( π 1 , ω ) − λ B ( π 2 , ω ) = X ω ∈ Ω ( π 2 1 ( ω ) − π 2 2 ( ω )) + 2 X ω ∈ Ω p ( ω )( π 2 ( ω ) − π 1 ( ω )) = λ B ( π 1 , p ) − λ B ( π 2 , p ) . Another important ex a mple is the Kullback-Leibler game (its restricted v er- sion is the loga rithmic loss game): λ KL ( π , p ) = X ω ∈ Ω p ( ω ) ln p ( ω ) π ( ω ) . This game also ha s the rela tiv e exp-conv exit y prop erty for c = 1 and an y η : λ KL ( π 1 , p ) − λ KL ( π 2 , p ) = P ω ∈ Ω p ( ω ) λ KL ( π 1 , ω ) − λ KL ( π 2 , ω ) . 23 4 Second-Guessing Exp erts In this section, we a pply the s up ermarting a le technique and the DF A to a new v ariant of the prediction with ex pert adv ic e setting. Proto col 2 is an extension of Pro tocol 1, where the game is specified b y the same elements (Ω , Λ) as b efore, but the Exp erts have a new power. Proto col 2 Pr e dic tio n with Second-Guess ing Exp ert Advice L 0 := 0 . L θ 0 := 0 , for all θ ∈ Θ. for n = 1 , 2 , . . . do All Exp erts θ ∈ Θ announce Γ θ n : Λ → Λ. Learner announces γ n ∈ Λ. Reality announces ω n ∈ Ω. L n := L n − 1 + γ n ( ω n ). L θ n := L θ n − 1 + Γ θ n ( γ n , ω n ), for all θ ∈ Θ. end for The new proto col c on tains only one substa n tial change. Every Exp ert θ announces a function Γ θ from Λ to Λ ins tead o f an element o f Λ (to s implify notation, we consider Γ also as a function from Λ × Ω to [0 , ∞ ], a s we did with the pro per loss functions λ ). Infor mally sp eaking, now an ex p ert’s o pinion is not a prediction, but a conditional sta tement that sp ecifies the actual prediction depe nding o n Learner’s next step. Therefore, the loss of each exp ert is deter- mined by the Learner’s prediction as well as b y the o utcome chosen b y Reality . W e w ill call the exp erts in Proto col 2 se c ond-guessing exp erts . Se c o nd-guessing exp erts a re a g eneralization of exp erts in the sta ndard P roto col 1: a standard exp ert can b e interpreted in P roto c ol 2 as a constant function. The phenomenon of “ second-guessing exp erts” o ccurs, for example, in r eal- world finance. In particular , commer cial banks serve as “second-gues sing ex- per ts” for the cen tr al bank when they use v a riable interest r ates (that is, the int erest rate for the next per iod is announced no t as a fixed v alue but as an explicit function of the central bank base ra te). In game theor y , the no tion o f internal reg ret [9, 3, 22, 2 3] is somewhat related to the idea of seco nd-guessing e xperts. The internal regr et app ears in the framework where for each pre diction, whic h is called action in that context, there is an exp ert that co nsisten tly reco mmends this action, and Lear ner follows one of the exp e r ts at ea c h step. The in terna l regret for a pair of exper ts ( i , j ) shows by how muc h Learner could have decreased his lo ss if he had follow e d exp ert j each time he follow ed exp e r t i . This ca n b e mo deled by a seco nd-guessing ex pert that “ a djusts” Lear ner’s predic tio ns: agr ees with Learner if Learner do es not follow i , and reco mmends following j when the Learner follows i . The in ternal regret is usually s tudied in randomized prediction proto cols. In the case of deterministic Le arner’s predictions, one canno t ho pe to get a ny int eresting loss b ound without additional assumptions . Indeed, Exp erts can alwa ys sugg est exactly the “opp osite” to the Learner’s pr ediction (for exa mple, in the log loss game, they predict 1 if Lear ner predicts p n (“the probability of 1”) less than 0 . 5 and they predict 0 otherwise), a nd Rea lit y can “agr ee” with them (c ho osing the outcome equal to Exp erts’ prediction); then the Exp erts’ losses remain zero , but the Lear ner’s los s grows linearly in the num b er of steps. 24 A non-trivia l b ound is p ossible if Lear ner is a llowed to give predictions in the form o f a distribution on Exp erts. This ca n b e for malized as the F reund-Schapire game [2 5, E xample 7]. Then the seco nd-guessing exp ert mo deling an internal regret is a co ntin uous transforma tion of the distr ibution given by Lear ner. The results of [3] and o thers ar e b ounds o f the form L N ≤ L θ N + O ( √ N ) for the F reund-Schapire g a me, whic h is no n-mixable. A discussion of bounds of this form achiev able by the defensive forecasting metho d will b e publis he d else w her e: in this pap er we cons ider another kind of b ounds. Ho wever, here we will a lso make the ass umption that seco nd-guessing exp erts mo dify the prediction of Learner c ontinuously . 4.1 The DF A for Second-Guessing Exp erts First consider the cas e when Γ θ n are co n tinuous mappings from Λ to Λ. The DF A requires virtually no mo difications for this task and gives the same loss bo unds a s in Theor em 5. Theorem 26. S u pp ose that for s ome c , η , and some c ontinuous λ : P (Ω) → Λ the functions q g define d by (13) ar e for e c ast-c ontinu ous and have the sup er- martingale pr op erty for al l g ∈ Λ . Then for the game fol lowing the pr oto c ol of pr e diction with se c ond-guessing exp ert advic e wher e al l exp erts θ at al l steps n announc e c ontinuous funct ions Γ θ n : Λ → Λ , ther e is L e arner’s st r ate gy (in fact, the DF A applie d to Q P 0 define d by (17) ) with p ar ameters c , η , λ , P 0 , (wher e P 0 is a distribution on Θ ) guar ante eing t hat, at e ach step N and for al l exp erts θ , it holds L N ≤ c L θ N + c η ln 1 P 0 ( θ ) . Pr o of. F or any contin uous Γ : Λ → Λ co nsider the function ˜ q Γ ( π , ω ) = exp η λ ( π, ω ) c − Γ( λ ( π , · ) , ω ) . It is forecast-c on tinuous a s a comp osition of contin uous functions, and has the sup e rmartingale proper t y since for an y π ∈ P (Ω), taking g = Γ( λ ( π , · )) we ha ve E π ˜ q Γ = E π q g ≤ 1. Similarly to (8), define Q P 0 on (( C (Λ → Λ)) Θ × P (Ω) × Ω) ∗ , where C (Λ → Λ) is the set of contin uous functions on Λ, by the formula Q P 0 ( { Γ θ 1 } θ ∈ Θ , π 1 , ω 1 , . . . , { Γ θ N } θ ∈ Θ , π N , ω N ) = X θ ∈ Θ P 0 ( θ ) N Y n =1 exp η λ ( π n , ω n ) c − Γ θ n ( λ ( π n , · ) , ω n ) . (17 ) As in Theorem 5, Q P 0 is a fo r ecast-contin uous sup ermarting a le. A t step N , the str ategy choos e s any π N satisfying (9) a nd anno unces γ N = λ ( π N , · ) as Learner’s pr ediction (we do not need a substitution function here since the r ange o f λ is in Λ by the theor em assumption). The loss b ound 25 follows, since exp η N X n =1 λ ( π n , ω n ) c − Γ θ n ( γ n , ω n ) ! = exp η N X n =1 λ ( π n , ω n ) c − Γ θ n ( λ ( π n , · ) , ω n ) ! ≤ 1 P 0 ( θ ) . Recall that Theore m 20 pro vides us (under Ass umptions 1 – 3) with a c o n- tin uous pr oper los s function λ : P (Ω) → M Σ η Λ . F or any η - mixable game, we hav e M Σ η Λ ⊆ Λ , and due to T he o rem 1 3 we can take this λ and get forec ast- contin uous q g with the sup ermartingale prop ert y . F or non-mixable g a mes there is no guarantee that such λ exists. Theorem 13 gives a function λ ranging over ∂ Σ Λ (the b oundary of the supe r predictions set Σ Λ ), which is not necessarily contained in Λ. Moreover, it may happen that even for c on tinuous exp erts Γ θ n : Λ → Λ it is imp ossible to get any in teresting loss b ound, for any str ategy . Indeed, consider a game where Λ is not connected (e. g., the simple prediction ga me [25, Exa mple 1 ] with Λ = { (0 , 1) , (1 , 0) } ). Then the example with “o pposite” predictions works: the exp erts just need to map Learner ’s predictions in to another connected comp onent . By this reason, let us consider a mo dification of Proto col 2 that changes the sets of predictions allow ed for Learner and for E x perts. Namely , for the game (Ω , Λ), Exp erts θ ∈ Θ anno unce Γ θ n : ∂ Σ Λ → Σ Λ , and Le arner announces γ n ∈ ∂ Σ Λ (the res t of Pr o tocol 2 do es not change). W e will a ssume that the game satisfies Assumptions 1 and 2 (for non-co mpact Λ the boundary ∂ Σ Λ may be empty). Then the mo dified pr o tocol usually g iv es more freedom to Learner: since M Λ ⊆ ∂ Σ Λ , the predictions in Λ \ ∂ Σ Λ are minor ized b y some b etter predictions in M Λ. The E x perts are a llo w ed to give predictions (which ar e Γ θ n ( γ n )) in a larger set Σ Λ , how ever, they need to cope with Learner predictions from a larger set to o. F or the mo dified proto col, Theore m 26 holds with minimal changes: λ is allow ed to ra nge ov er Σ Λ instead of Λ , the functions q g hav e the sup ermartinga le prop ert y for all g ∈ Σ Λ (instead o f g ∈ Λ only), Γ θ n are co n tinuous functions from ∂ Σ Λ to Σ Λ ; the pr oof do es not change. Theor em 13 provides us with λ such that q g hav e the required prop erties. 4.2 The AA for Second-Guessing Exp erts In contrast to the DF A, the AA cannot b e applied to the s econd-guessing pro- to col in a stra igh tfor w a rd way . Ho w ever, the AA can b e mo dified for this case. Recall that the AA is base d on the ineq ualit y (1), which is alr eady solved for γ N . In the second-g ue s sing proto col, b oth sides of this inequality will contain γ N : γ N ( ω N ) ≤ − c η ln X θ ∈ Θ P N − 1 ( θ ) P θ ∈ Θ P N − 1 ( θ ) exp( − η Γ θ N ( γ N , ω N )) ! . The DF A implicitly solves this inequalit y in (the pro of o f ) Lemma 4, using a kind of fixed p oint theorem. W e will present a mo dification of the AA which uses a fixed p oin t theo rem explicitly . 26 A top ological spac e X has the fixe d p oint pr op erty if every contin uo us func- tion f : X → X has a fixed p oint, that is, ∃ x ∈ X f ( x ) = x . Let us s how that if the game (Λ , Ω) satisfies As s umptions 1 and 2 then the set M Σ η Λ (the set of minimal points of Σ η Λ ) has the fixed point prope r t y for any η > 0. First co ns ider the homeo morphism from [0 , ∞ ] Ω to [0 , 1] Ω that maps g 7→ exp( − η g ). As mentioned in Section 2, the s et exp( − η Σ η Λ ) is convex. It is no n-empt y due to Ass umption 2 a nd co mpa ct due to Assumption 1. Thus, exp( − η Σ η Λ ) has the fixed p oin t prop ert y by [1, Theo rem 4.10], and Σ η Λ has the pro perty as its homeomo rphic image [1, Theorem 4.1]. Now we need the following technical lemma proved in App endix. Lemma 27. Ther e is a c ontinu ous mapping F : Σ η Λ → M Σ η Λ such that F ( g ) ≤ g for any g ∈ Σ η Λ . Remark 28. Essentially , the main conten ts of Lemma 27 is a c o nstruction of a contin uous substitution function. In many natural games, the standard substitution functions are contin uous without additio na l efforts. The definition of M Σ η Λ implies that if F ( g ) ≤ g then F ( g ) = g for any g ∈ M Σ η Λ , and hence F defined in the lemma is a r etra ction (by definition, a contin uous mapping from a top ological space into its subset that do es not mov e elements of the subset). Due to [1, Theo rem 4.2], s ince Σ η Λ has the fixed point prop ert y , its r etract M Σ η Λ has the fixed p oin t pro perty to o. Theorem 29. Supp ose that the game (Ω , Λ) satisfies Assu mptions 1 and 2 and is η -mixable. Then for the pr e diction with se c ond-guessing exp ert advic e pr oto c ol, ther e exists L e arner’s str ate gy (a mo dific ation of t he AA) with p ar ameters η and P 0 guar ante eing that, at e ach step N and for al l exp erts θ , it holds L N ≤ L θ N + 1 η ln 1 P 0 ( θ ) . Pr o of. A t step N , the mo dified AA anno unces as Learner’s prediction γ N any solution of the following equation with res pect to γ ∈ M Σ η Λ : γ = F − 1 η ln X θ ∈ Θ P N − 1 ( θ ) P θ ∈ Θ P N − 1 ( θ ) exp( − η Γ θ N ( γ )) !! , (18) where Γ θ N are a nnounced b y the exp erts, the weigh ts P N − 1 are defined in the usual wa y with the help of the previo us los ses: P N − 1 ( θ ) = P 0 ( θ ) N − 1 Y n =1 exp( − η Γ θ n ( γ n , ω n )) , and F is the c o n tinuous mapping from Lemma 27. Since for an η -mixable game we have Σ η Λ = Σ Λ , a nd since M Σ Λ ⊆ Λ, the functions Γ θ N are defined on γ . By the definitio n of Σ η Λ , the ar gumen t o f F in eq uation (18) b elongs to Σ η Λ , and F maps it to M Σ η Λ . The mapping is contin uous a s the compo sition of contin uous mappings. Therefore, since M Σ η Λ has the fixed p oin t pro perty , equa tion (18) has a so lution. 27 The pr operty F ( g ) ≤ g implies that γ N ≤ − 1 η ln X θ ∈ Θ P N − 1 ( θ ) P θ ∈ Θ P N − 1 ( θ ) exp( − η Γ θ N ( γ N )) ! , and the usual analys is of the AA gives us the b ound. Let us outline briefly how the construction of Theorem 2 9 can be applied to non-mixable games under the mo dified second-guessing pr otoco l (where experts are defined on ∂ Σ Λ ). Let the AA b e ( c, η )-rea liz a ble. Now we are lo oking for γ ∈ ∂ Σ Λ satisfying the following equation: γ = V F − 1 η ln X θ ∈ Θ P N − 1 ( θ ) P θ ∈ Θ P N − 1 ( θ ) exp( − η Γ θ N ( γ )) !!! , (19) where after F we apply V , the mapping defined in the pr oof of Lemma 12. Since V is contin uous and maps Σ η Λ to ∂ Σ Λ , we get a contin uous ma pping of V (Σ η Λ ) ⊆ ∂ Σ Λ int o itself. It remains to show tha t V (Σ η Λ ) ha s the fixe d po in t prop erty . Similarly to the pro of of Lemma 12, consider the set Z = { g ∈ Σ η Λ | ∀ r ∈ [0 , 1) rg / ∈ Σ η Λ } . F or any g ∈ Σ η Λ , there exists a unique r such that rg ∈ Z , and the contin uity of this ma pping g → r g follows in the same w ay as in the pro of of Le mma 12; th us Z is a retract of Σ η Λ and has the fixe d point prop ert y . Since V ( g ) = V ( rg ) for any non-negative real r such that g and r g belo ng to Σ η Λ , we hav e V (Σ η Λ ) = V ( Z ). The definition of Z implies that V is bijectiv e on Z , a nd again as in the pro of of Lemma 12 one can show that the inv erse mapping V − 1 : V ( Z ) → Z is contin uous. T her efore V ( Z ) has the fixed po in t pro perty as the homeomor phic image of Z . Let γ N ∈ V (Σ η Λ ) b e any solution of the equation (19 ). By the prop erties o f F a nd V , we hav e γ N ≤ − c η ln X θ ∈ Θ P N − 1 ( θ ) P θ ∈ Θ P N − 1 ( θ ) exp( − η Γ θ N ( γ N )) ! , and the usual AA b ound fo llo ws . 5 Predictions with Resp ect to Sev eral Loss F unc- tions In this section, we illustrate the us e of the sup ermartinga le technique for ano ther extension o f P roto col 1: a game with several loss functions (for a more detailed discussion o f this setting s e e [5]). In contrast to the case of s econd-guessing exp erts, it is not clea r yet whether the AA can help in this case. Up to now a game was (Ω , Λ ) wher e Λ was the set of admissible predictions, common for Learner and Exp erts. Here we retur n to the g ame sp ecification by a loss function on the decision s pace P (Ω). How ever, now each Ex p ert θ has its own loss function λ θ . So, the ga me is sp ecified b y (Ω , P (Ω) , { λ θ } θ ∈ Θ ), where λ θ : P (Ω) × Ω → [0 , ∞ ] are prop er los s functions. The sets of predictio ns Λ( θ ) and sup erpredictions Σ Λ( θ ) may be different for different exp erts θ . The game follows Pro tocol 3 . 28 Proto col 3 Pr e dic tio n with Exp ert Ev aluators ’ Advice L ( θ ) 0 := 0 , for all θ ∈ Θ. L θ 0 := 0 , for all θ ∈ Θ. for n = 1 , 2 , . . . do All Exp erts θ ∈ Θ announce π θ n ∈ P (Ω). Learner announces π n ∈ P (Ω). Reality announces ω n ∈ Ω. L ( θ ) n := L ( θ ) n − 1 + λ θ ( π n , ω n ), for all θ ∈ Θ. L θ n := L θ n − 1 + λ θ ( π θ n , ω n ), for all θ ∈ Θ. end for There are tw o changes in Proto col 3 compared to Proto col 1. The accumu- lated loss L θ of e a c h Exp ert θ is ca lculated acco rding to his o wn loss function λ θ . Learner do es not hav e o ne accum ulated lo s s anymore, but the losses L ( θ ) of Learner a r e calculated sepa rately for compa risons with each Exp ert θ and ac- cording to the loss function of this Exp ert. Now it do es not make much sense to spe a k ab o ut the b est exp ert: their per formance is ev aluated by different loss functions and thus the losses may hav e different scale. What remains mea ningful are bounds for every exper t θ o f the form L ( θ ) N ≤ c θ L θ N + a θ , where c θ and a θ may b e differ e nt for different exp erts θ ∈ Θ. Informally sp eaking, Pro tocol 3 describ es the following situation. W e have some practical task and a num b er of prediction a lgorithms (they will b e our Exp erts). Each of them minimizes so me loss, ma ybe different for different al- gorithms. W e do not know which a lgorithms fits our task best. As usual in practice, we do not hav e a loss that measur es the quality of predictions for our task; we only know that predictions mu st b e close to the real outcomes. A safe option in this case would b e to predict in such a wa y that our prediction ar e not bad compar ed to pre dictions of any of the alg o rithms even if the quality is ev aluated by the los s function a scribed to this algo rithm. The DF A can b e adapted to P rotoc ol 3 straightforwardly . Theorem 30. Supp ose that for e ach θ ∈ Θ , ther e exist r e als c θ ≥ 1 and η θ > 0 such that t he functions exp η θ λ θ ( π , ω ) c θ − g ( ω ) (they ar e dir e ct analo gs of q g define d by (1 3) ) ar e for e c ast- c ont inuous and have the su p ermartingale pr op erty for al l g ∈ Σ Λ θ . Then for any initial distribution P 0 ∈ P (Θ) ther e is L e arner’s str ate gy (in fact, the DF A applie d t o Q P 0 define d by (2 0) ) guar ante eing that, at e ach s t ep N and for al l exp erts θ , it holds L ( θ ) N ≤ c θ L θ N + c θ η θ ln 1 P 0 ( θ ) . Pr o of. Similarly to the pr oofs of Theor ems 5 and 26, we ca n co nstruct the 29 sup e rmartingale Q P 0 : Q P 0 ( { π θ 1 } θ ∈ Θ , π 1 , ω 1 , . . . , { π θ N } θ ∈ Θ , π N , ω N ) = X θ ∈ Θ P 0 ( θ ) N Y n =1 exp η θ λ θ ( π n , ω n ) c θ − λ θ ( π θ n , ω n ) (20) and choose π N satisfying (9 ) with the help of Lemma 4. The lo ss b ound follows in the same wa y as in Theorem 5. Proto col 3 can handle also the following task. W e hav e several exp erts a nd several candidates for the loss function, and a priori some exp erts may perfor m well for t wo or mo re of the loss functions. In this c ase, it is natur a l to r equire that Learne r ’s loss is small with resp ect to every exp ert and with resp ect to every loss function. A simple trick reduces the ta sk to P roto col 3: for each original exp ert (supply ing us with a prediction), w e co nsider several new ex p erts who announce the same pre diction but use different lo ss functions. If our predictions are go od in the game with these new exp erts then o ur predictio ns are go od in the original game with resp ect to any of the loss functions. F or example, a ssume tha t we wan t to comp ete with K exp erts accor ding to the logarithmic loss function and squar e loss function in the game with out- comes { 0 , 1 } . Lemmas 6 and 7 imply that the following function is a fore c a st- contin uous sup ermartingale: 1 2 K K X k =1 exp N X n =1 − ln π n ( ω n ) + ln π k n ( ω n ) ! + 1 2 K K X k =1 exp 2 N X n =1 ( ω n − π n (1)) 2 − ( ω n − π k n (1)) 2 ! , where π k n is the prediction of Exp ert k a nd π n is the prediction of Learner . Cho osing π n according to Lemma 4, we can achiev e that the regret term with resp ect to the logarithmic loss function is bo unded by ln(2 K ) < ln K + 0 . 7 , a nd the regret with resp ect to the s quare loss function is b ounded by 0 . 5 ln(2 K ) < 0 . 5 ln K + 0 . 4 — pra ctically the same as the regr ets against K exper ts that ar e achiev able when we co mp ete with resp ect to o nly one of the lo ss functions. Ac kno wledgemen ts This work w as partly supp orted b y EP SR C grant EP/F00 2998/1. Discuss ions with Alex Gammerman, Glenn Shafer, and Alexa nder Shen, and detailed co m- men ts of the anonymous r eferees for the conference v ersion [4] and for a journal submission hav e help ed us impr o ve the pap er. References [1] R. Agar wal, M. Meehan, D. O ’Regan. Fix e d Point The ory and Applic ations , volume 141 of Cambridge T r acts in Mathematics . Cambridge Univ ersity Press, Cambridge, England, 200 1. 30 [2] D. Blackw ell, M. A.Girschik. The ory of Games and Statistic al De cisions , New Y o rk: Wiley , 1 954. [3] A. Blum, Y. Mansour. F rom Exter nal to Internal Regret. J. Mach. L e arn. R es. , 8:1307 –1324, 2007. [4] A. Chernov, Y. Kalnishk an, F. Zhdanov, V. V ovk. Sup ermartingales in Pre- diction with E xpert Advice. In: Y. F reund, L. Gy¨ orfi, G. T ur´ an, T. Zeug- mann (eds.) AL T 2008 Pr o c e e dings , LNCS(LNAI) vol. 525 4, pp. 199– 2 13. Springer, 2008. [5] A. Cher no v , V. V ovk. P rediction with Exp ert E v alua tors’ Advice . In: R. Gav ald` a, G. Lugosi, T. Zeugma nn, S. Zilles (eds.) AL T 2009 Pr o c e e d- ings , LNCS(LNAI) vol. 5809, pp. 8– 22. Springer, 200 9. (See also T echnical Repo rt ar Xiv:0902.4127 v1 [cs.LG].) [6] N. Cesa -Bianc hi, G. L ug osi. Pr e diction, L e arning, and Games . Cambridge Univ ersity Press, Cambridge, Eng land, 2006. [7] A. P . Da wid. The Geometry o f Pr oper Scoring Rules. Annals of the Institute of St atistic al Mathematics , 5 9:77–93, 2007 . [8] D. F o ster, R. V ohr a. Asymptotic Calibration. Biometrika , 85:379 –390, 1998. [9] D. F o ster, R. V ohr a. Reg ret in the O nline Decision Problem. Games Ec on. Behav. , 29:10 4–130, 1999. [10] P . G´ acs. Uniform T est o f Algorithmic Randomness over a Genera l Space. The or etic al Computer Scienc e , 341:9 1–137, 2005 . [11] P . G´ acs. L e ctur e Notes on Descriptional Complexity and R andomness , Un- published, av aila ble online at http:/ /www.cs. bu.edu/faculty/gacs/papers/ait-notes.pdf [12] T. Gneiting, A. E. Raftery . Strictly Pro per Scoring Rules, Pr ediction, and Estimation. J. Americ an Statistic al Asso ciation , 102 :359–378 , 2007 . [13] P . D. Grunw a ld, A. P . Dawid. Game Theo ry , Maximum Entropy , Minimum Discrepancy , and Robust Bay esia n Decision Theory . Annals of Statistics , 32(4), pp. 1 367–143 3, 2 004. [14] D. Haussler, J . Kivinen, M. W armuth. Sequent ial P rediction of Individ- ual Sequences under General Lo ss F unctions. IEEE T r ans. on Information The ory , 44(5):19 06–1925 , 19 98. [15] W. Ho effding. Probability Inequa lities for Sums of Bounded Random V a ri- ables. J. Ameri c an St atistic al Asso ciation , 5 8:13–30, 1 963. [16] L. Levin. Uniform T ests of Randomness . Soviet Mathematics Doklady , 17:337 –340, 1976. The Rus s ian or iginal: Doklad y AN SSSR , 227(1 ), 1976. [17] M. Li, P . Vit´ an y i, An Intr o duct ion to Kolmo gor ov Complexity and Its A p- plic ations , 2nd edition, New Y ork : Springer, 1997. 31 [18] Y. Kalnishk an, V. V ovk, M. V. Vyugin. Lo ss F unctions, Co mplexities, and the Legendre T r ansformation. The or etic al Computer Scienc e , 313(2 ):195– 207, 2004. [19] R. Ro ck afella r. Convex Ana lysis. Princeton Universit y Pr e s s, 1970. [20] L. J. Sav age, E licitation of Personal Probabilities and Exp ectations. J. Americ an S tatistic al Asso ciation , 66:783 – 801, 1971. [21] G. Shafer, V. V ovk. Pr ob ability and Financ e: It’s Only a Game! Wiley , New Y o rk, 20 01. [22] G. Stoltz, G. Lugosi. Internal Regre t in O n-Line Portfolio Selection. Ma- chine L e arning , 59:125 –159, 2005. [23] G. Sto ltz, G. Lug osi. Learning Corr elated Eq uilibr ia in Games with C o m- pact Sets of Strategies. Games and Ec onomic Behavi or , 5 9:187–20 9, 200 7. [24] V. V ovk. Agg regating Strategies. In M. F ulk, J. Ca se, editors, Pr o c e e dings of the Thir d Annual Workshop on Computational L e arning The ory , pp. 3 71– 383, San Ma teo, CA, Mor gan Kaufmann, 1 9 90. [25] V. V ovk. A Game o f Prediction with E xpert Advice. Journal of Computer and S yst em Scienc es , 56:1 53–173, 1998 [26] V. V ovk. Defensive Prediction with Ex pert Advice. In: S. Ja in, H. Simon, E. T omita (eds.) AL T 2005 Pr o c e e dings , LNCS(LNAI) vol. 3 734, pp. 444– 458. Springer, 2005 . (See also : Comp etitiv e On-line Lear ning with a Conv ex Loss F unctio n. T ec hnica l Rep ort arX iv:cs/05 06041v3 [cs.LG], arXiv .org e-Print archiv e, September 20 05.) [27] V. V o vk. On-line Regression Comp etitive with Reproducing Kernel Hilb ert Spaces. In: J. C a i, S. Ba rry Co oper , A. Li (eds.) T AMC 2006 Pr o c e e dings , LNCS(LNAI) v ol. 3 959, pp. 452-4 63. Spring er, 2006. (See a lso T echnical Repo rt arXiv :cs/05110 58v2 [cs.LG], arXiv .org e- Print arc hive, J an ua ry 2006.) [28] V. V ovk. Metric Ent ropy in Comp etitiv e On-line Prediction. T echnical Re- po rt arX iv:cs/06 09045v1 [cs.LG], a rXiv.org e-Pr in t archive, Septem b er 2006. [29] V. V ovk. Defensive F ore c asting for O ptimal Predictio n with Exp ert Advice. T echnical Repo rt arXiv :0708.150 3 [cs.LG], arX iv.org e-Print archive, August 2 007. [30] V. V ovk. Co n tinuous and Randomized Defensive F or ecasting: Unified View. T echnical Report arXiv :0708.235 3v2 [cs.LG], arXiv .org e-Pr in t ar c hive, August 2 007. [31] V. V ovk, F. Zhda no v. Pr ediction with E xpert Advice for the Brie r Ga me. ICML ’08: Pr o c e e dings of the 25th Intern atio nal Confer enc e on Machine le arning , pp. 1104 –1111, 2008. 32 App endix Pr o of of Le mma 8. Given the function q , let us define the following function φ on P (Ω) × P (Ω): φ ( π ′ , π ) = E π ′ q ( π, · ) . F or each fixed π ′ , the function φ ( π ′ , · ) is contin uo us, since q is contin uous. F or each fixed π , the function φ ( · , π ) is linear, and thus concav e. Note also that P (Ω) is a conv ex co mpa ct set. Therefore, φ satisfies the co nditions of Ky F an’s minimax theor em (se e e. g. [1, Theo r em 1 1.4]), and thus there exists ˜ π ∈ P (Ω) such that for any π ′ ∈ P (Ω) it holds that E π ′ q ( ˜ π , · ) = φ ( π ′ , ˜ π ) ≤ sup π ∈P (Ω) φ ( π, π ) = sup π ∈P (Ω) E π q ( π, · ) ≤ C . (21) It is easy to see that ˜ π has the prop ert y that the lemma m ust guarantee: q ( ˜ π , ω ) ≤ C fo r all ω ∈ Ω. Indeed, if we substitute the dis tribution δ ω (whic h is concentrated on ω ) for π ′ in (21), the left-hand side will b e just q ( ˜ π , ω ). Lemma 8 is a very impo rtan t statement in our sup ermartingale framework, so let us outline an alternative pro of for it (for details see [10, Theorem 6 ], [11, Theorem 16.1] o r [30, Theorem 1]). Co nsider the sets F ω = { π | q ( π, ω ) ≤ C } . These sets ar e clo s ed and for a ny Ω 0 ⊆ Ω the union ∪ ω ∈ Ω 0 F ω contains all the measures concentrated on Ω 0 . Then all F ω has a non-empty in tersection by Sper ner’s lemma . Lemma 3 1. L et a function q : P ◦ (Ω) × Ω → R b e non-ne gative and for e c ast- c ontinuous on P ◦ (Ω) . Supp ose that for any π ∈ P ◦ (Ω) it holds that E π q ( π, · ) ≤ C , wher e C ∈ [0 , ∞ ) is some c onstant . Then it holds t ha t ∃ π ∈ P ◦ (Ω) ∀ ω ∈ Ω q ( π, ω ) ≤ (1 + ǫ ) C . Pr o of. Let δ ∈ (0 , 1) be a c o nstan t to b e c hosen later. Let P δ (Ω) = { π ∈ P (Ω) | ∀ ω ∈ Ω π ( ω ) ≥ δ } . This set is a non-empty con- vex compa c t subset of P ◦ (Ω). Rep eating the constructio n from the pro of of Lemma 8 and a pplying Ky F a n’s theor em for the function on P δ (Ω), w e get tha t there exists ˜ π ∈ P δ (Ω) such that for any π ′ ∈ P δ (Ω) it holds that E π ′ q ( ˜ π , · ) ≤ C . F or each ω 0 , consider the distribution π δ,ω 0 such that π δ,ω 0 ( ω ) = δ for ω 6 = ω 0 and π δ,ω 0 ( ω 0 ) = 1 − δ ( | Ω | − 1 ) . Substituting π δ,ω 0 for π ′ , we get (1 − δ ( | Ω | − 1)) q ( ˜ π , ω 0 ) + δ X ω 6 = ω 0 q ( ˜ π , ω ) ≤ C . Since q ( ˜ π , ω ) ≥ 0 (the sup ermartinga le S is non-nega tiv e), the la st inequa lit y implies that (1 − δ ( | Ω | − 1)) q ( ˜ π , ω 0 ) ≤ C . It r emains to note that we can choo s e δ so small that 1 / (1 − δ ( | Ω | − 1)) ≤ 1 + ǫ . Pr o of of Le mma 21. Accor ding to Lemma 31, we can find π k ∈ P ◦ (Ω) such that ∀ ω ∈ Ω q ( π k , ω ) ≤ 1 + 1 k C . 33 Since P (Ω) is compact, there exists a strictly incr easing index s e quence k ( j ), j ∈ N , s uch tha t the sequence π k ( j ) conv er g es to so me π ∈ P (Ω). The p oints g j = q ( π k ( j ) , · ) b elong to a compact s et [0 , 2 C ] Ω . Hence there exists a str ic tly incr easing index se q uence j ( i ), i ∈ N , such that the s equence g j ( i ) conv er g es to some g 0 . F or every ω ∈ Ω, we have g j ( ω ) = q ( π k ( j ) , ω ) ≤ (1 + 1 /k ( j )) C , therefo re g 0 ( ω ) = lim i g j ( i ) ( ω ) ≤ lim i 1 + 1 k ( j ( i )) C = C . It r emains to s et π ( i ) = π k ( j ( i )) and note that q ( π ( i ) , ω ) = g j ( i ) . Pr o of of Le mma 12. Let 0 : Ω → [0 , ∞ ] b e the constant z e ro function (that is, 0 ( ω ) = 0 for all ω ∈ Ω). If 0 ∈ Σ Λ then 0 ∈ ∂ Σ Λ and we can let V ( g ) = 0 for any g . Assume that 0 / ∈ Σ Λ . Let V ( g ) = R ( g ) g , where R : Σ η Λ → (0 , c ] is defined b y the following rule: R ( g ) = min { r ∈ (0 , c ] | rg ∈ Σ Λ } for any g ∈ Σ η Λ . Since the AA is ( c, η )-rea lizable, it holds that c Σ η Λ ⊆ Σ Λ , that is, cg ∈ Σ Λ for any g ∈ Σ η Λ . The minimum is attained since Σ Λ is compac t (by Assumption 1 ). Thu s R ( g ) is well defined. It is obvious fro m the definition that V ( g ) = R ( g ) g belo ngs to the b oundary ∂ Σ Λ of Σ Λ for all g ∈ Σ η Λ . It r e mains to chec k that V ( g ) = R ( g ) g is contin uous in g . W e pr o v e that R is contin uous, namely , w e take any g i → g 0 and for any infinite subsequence { g i k } , we show that if R ( g i k ) conv er ges then lim k R ( g i k ) = R ( g 0 ). If R ( g i k ) conv er ges then R ( g i k ) g i k conv er g es, and lim k R ( g i k ) g i k = (lim k R ( g i k )) g 0 ∈ Σ Λ since Σ Λ is compact. Ther e fore R ( g 0 ) ≤ lim k R ( g i k ). F or the other inequality , consider R ( g ′ | g ) = min { r ∈ (0 , ∞ ) | rg ′ ≥ V ( g ) } for g , g ′ ∈ Σ η Λ such that if g ( ω ) 6 = 0 for some ω ∈ Ω then g ′ ( ω ) 6 = 0 to o. Clearly , the function R ( g ′ | g ) is contin uous in g ′ for any fixed g (note that R ( g ′ | g ) = max ω : g ( ω ) 6 =0 V ( g )( ω ) /g ′ ( ω )) and R ( g | g ) = R ( g ). Since rg ′ ≥ V ( g ) ∈ Σ Λ implies rg ′ ∈ Σ Λ , we have R ( g ′ | g ) ≥ R ( g ′ ). In par ticular, R ( g i k | g 0 ) ≥ R ( g i k ) (assuming k large enough so that g ( ω ) 6 = 0 implies g i k ( ω ) 6 = 0) and R ( g 0 ) = lim k R ( g i k | g 0 ) ≥ lim k R ( g i k ). Pr o of of Le mma 14. Assume that λ 1 ( π , ω 0 ) 6 = λ 2 ( π , ω 0 ) and π ( ω 0 ) > 0 fo r s ome ω 0 ∈ Ω. Since λ 1 ( π , · ) and λ 2 ( π , · ) belo ng to Σ η Λ , the p oin t g = − 1 η ln e − ηλ 1 ( π , · ) + e − ηλ 2 ( π , · ) 2 also b elongs to Σ η Λ by the definition of Σ η Λ . F or any reals x, y , we have (e x + e y ) / 2 ≥ e ( x + y ) / 2 , a nd the inequality is strict if x 6 = y . Therefor e , g ( ω ) ≤ ( λ 1 ( π , ω ) + λ 2 ( π , ω )) / 2 for a ll ω ∈ Ω and g ( ω 0 ) < ( λ 1 ( π , ω 0 ) + λ 2 ( π , ω 0 )) / 2. Multiplying these ine q ualities by π ( ω ) and summing ov e r all ω ∈ Ω, we get E π g < 1 2 E π λ 1 ( π , · ) + E π λ 2 ( π , · ) (recall that π ( ω 0 ) > 0). Since λ 1 and λ 2 are prop er with resp ect to Σ η Λ , we hav e E π λ 1 ( π , · ) ≤ E π g and E π λ 2 ( π , · ) ≤ E π g . Hence we get a contradiction E π g < E π g . 34 F or a conv ex function U : R Ω → [ −∞ , ∞ ], a sub gr adient at p oin t x ∈ R Ω is a po in t x ∗ ∈ R Ω such that ∀ z ∈ R Ω U ( z ) ≥ U ( x ) + h x ∗ , z − x i . Lemma 32. Supp ose that Y is a non-empty close d c onvex subset of [0 , ∞ ) Ω . L et U : R Ω → ( −∞ , ∞ ] b e the function U ( x ) = − inf y ∈ Y h x, y i , wher e h x, y i = P ω ∈ Ω x ( ω ) y ( ω ) is the sc alar pr o duct in R Ω . Then U ( x ) is a c onvex function, and for any π ∈ P (Ω) , it holds that U ( π ) < ∞ , and π ∗ is a sub gr adient of U at the p oint π if and only if − π ∗ ∈ Y and h π , − π ∗ i = − U ( π ) . Pr o of. Since Y is not empty , the infimum is finite, and therefore U ( x ) > −∞ for all x . F or a ny α ∈ [0 , 1] a nd any x 1 , x 2 ∈ R Ω , we hav e U ( αx 1 + (1 − α ) x 2 ) = − inf y ∈ Y ( α h x 1 , y i + (1 − α ) h x 2 , y i ) ≤ − inf y ∈ Y α h x 1 , y i − inf y ∈ Y (1 − α ) h x 2 , y i = αU ( x 1 ) + (1 − α ) U ( x 2 ), th us U is conv ex. Let us fix some π ∈ P (Ω). Then h π , y i ≥ 0 for all y ∈ Y , and U ( π ) ≤ 0 < ∞ . Let − π ∗ ∈ Y and h π , − π ∗ i = − U ( π ). Then U ( π ) + h π ∗ , z − π i = − h− π ∗ , z i ≤ − inf y ∈ Y h z , y i = U ( z ) for any z , thus π ∗ is a subgradie nt o f U a t π . Let π ∗ be a n y subgradient of U at π . Assume that − π ∗ / ∈ Y . Then − π ∗ and Y can b e str ongly separated by Corollar y 11.4 .2 in [1 9], and T heo rem 11.1(c) there implies that there ex is ts z ∈ R Ω such that inf y ∈ Y h y , z i > h− π ∗ , z i . Let us choose δ > 0 such that inf y ∈ Y h y , z i > δ + h− π ∗ , z i , and then choose y 0 ∈ Y such that h π + z , y 0 i < inf y ∈ Y h π + z , y i + δ . F rom the definitio n o f the subgradient, we get U ( π + z ) ≥ U ( π ) + h π ∗ , z i , and th us h π + z , y 0 i − δ < inf y ∈ Y h π + z , y i ≤ inf y ∈ Y h π , y i + h− π ∗ , z i ≤ h π , y 0 i + h− π ∗ , z i . So, h z , y 0 i < δ + h− π ∗ , z i , which contradicts the choice of δ . This mea ns that − π ∗ ∈ Y . It remains to note that the definition of the subgra dient implies U (0) ≥ U ( π ) + h π ∗ , 0 − π i , a nd since U (0) = 0 , we get inf y ∈ Y h y , π i = − U ( π ) ≥ h− π ∗ , π i . Pr o of of Le mma 15. By Assumption 2, there exists a finite p oint g fin in Σ η Λ ∩ [0 , ∞ ) Ω , wher e E π g fin is finite for any π . By Assumption 1 , Σ η Λ is compac t, and therefore the minim um is attained for all π ∈ R Ω . Thus H is w ell defined. Note also that H ( π ) ≥ 0 for π ∈ P (Ω) and H ( π ) = −∞ if π ( ω ) < 0 for some ω ∈ Ω. Now let us show that H ( π ) = inf g ∈ Σ η Λ ∩ [0 , ∞ ) Ω E π g . 35 Again b y Assumption 2, the infim um is ta k en ov er a no n-empt y set. If π ( ω ) < 0 for some ω ∈ Ω then H ( π ) = −∞ and the infim um is equal to −∞ as well. Thu s we need to co nsider o nly the case when π ( ω ) ≥ 0 for all ω ∈ Ω and the minim um in the definitio n of H is attained at a p oint g such that g ( ω ) = ∞ for some v alues o f ω . Note that for these ω we have π ( ω ) = 0, since H ( π ) < ∞ . Cho ose a sequence g n ∈ Σ η Λ ∩ [0 , ∞ ) Ω that con v erges to g (for example, consider the segment b et ween the p oint s e − ηg and e − ηg fin , and take a sequence e − ηg n along this seg men t). Since g n ( ω ) a nd g ( ω ) ar e finite for non-zer o π ( ω ), we get E π g n → E π g = H ( π ), and th us the infimum is not grea ter than H ( π ). Now w e can apply Lemma 32 with Y = Σ η Λ ∩ [0 , ∞ ) Ω and U ( π ) = − H ( π ). It implies that for any π ∈ P (Ω), the set of subgr adien ts of U a t π is the set of p oints where the infim um o f E π g ov er g ∈ Σ η Λ ∩ [0 , ∞ ) Ω is a ttained. If π ∈ P ◦ (Ω), the infimum is attained indeed, and it is unique by Lemma 14. By Theor em 25.1 in [19], the function H is differentiable at π , and the p oint λ ( π, · ) = arg min g ∈ Σ η Λ E π g is the gra dien t of H . Thus H ⊇ P ◦ (Ω). On the other hand, if H is differentiable, the set of subgra dien ts consists of one ele ment only , the gradient. Theorem 25.5 in [19] implies that the gr a dien t mapping π 7→ λ ( π, · ) is contin uous on H . Pr o of of Le mma 18. Due to Assumption 1, M Σ η Λ is compact and there fore con- tains all its limit p oin ts, that is, γ ∈ M Σ η Λ . Let E + π ′ g be a shorthand for P ω ∈ Ω , π ( ω ) 6 =0 π ′ ( ω ) g ( ω ) for a n y g ∈ [0 , ∞ ] Ω and π ′ ∈ P (Ω). By definition, E π γ = E + π γ . Note fir st tha t E π i g conv er ges to E π g for any finite g ∈ [0 , ∞ ) Ω . Note als o that E + π i γ i conv er g es to E + π γ . Indeed, E + π i γ i ≤ E π i γ i ≤ E π i g fin ≤ P ω ∈ Ω g fin ( ω ) < ∞ , where g fin ∈ Σ η Λ ∩ [0 , ∞ ) Ω exists by Assumption 2. If π ( ω ) 6 = 0 then π i ( ω ) is separated from 0 for s ufficiently lar ge i , therefore γ i ( ω ) are b ounded, and their limit γ ( ω ) is finite. And for finite limits γ and π , the conv er g ence is trivial. Fix any g 0 ∈ [0 , ∞ ) Ω and any ǫ > 0. F or sufficiently la rge i , we have E + π i γ i ≥ E + π γ − ǫ and E π i g 0 ≤ E π g 0 + ǫ . T aking in to account that E + π i γ i ≤ E π i γ i and E π i γ i = min g ∈ Σ η Λ E π i g ≤ E π i g 0 , we get E π γ ≤ E π g 0 + 2 ǫ . Since ǫ and g 0 are arbitra ry , w e hav e E π γ ≤ inf g ∈ Σ η Λ ∩ [0 , ∞ ) Ω E π g , and the last infimum can b e repla ced by min g ∈ Σ η Λ as shown in the pro of of Lemma 15. Pr o of of Le mma 27. W e constr uc t a contin uous mapping F : Σ η Λ → M Σ η Λ as a comp osition of mappings F ω for all ω ∈ Ω. Each F ω when a pplied to g ∈ Σ η Λ preserves the v a lues of g ( o ) for o 6 = ω and decre a ses as far as p ossible the v alue g ( ω ) so that the result is still in Σ η Λ . F ormally , F ω ( g ) = g ′ such that g ′ ( o ) = g ( o ) for o 6 = ω and g ′ ( ω ) = min { γ ( ω ) | γ ∈ Σ η Λ , ∀ o 6 = ω γ ( o ) = g ( o ) } . Let us show that each F ω is c o n tinuous. It suffices to show that F ω ( g )( ω ) depe nds contin uously on g , s ince the o ther co ordinates do no t change. W e will show that F ω ( g )( ω ) is co n vex in g , co n tinuit y follows (see, e. g . [19]). Indeed, take any t ∈ [0 , 1], and g 1 , g 2 ∈ Σ η Λ . Since Σ η Λ is conv ex, then tg 1 + (1 − t ) g 2 ∈ Σ η Λ and tF ω ( g 1 ) + (1 − t ) F ω ( g 2 ) ∈ Σ η Λ . The latter p oint has all the coor dina tes o 6 = ω the same as the former. Th us, by definition of F ω , we get F ω ( tg 1 + (1 − t ) g 2 )( ω ) ≤ 36 ( tF ω ( g 1 ) + (1 − t ) F ω ( g 2 ))( ω ) = tF ω ( g 1 )( ω ) + (1 − t ) F ω ( g 2 )( ω ), which was to b e shown. All F ω do not increa se the co ordinates. Since the set Σ η Λ contains any p oint g with all its ma jorants, F ω ( g 1 ) = g 1 implies that F ω ( g 2 ) = g 2 for any g 2 obtained from g 1 by applying any F ω ′ . Ther e fo re, the image of a comp osition of F ω ov er all ω ∈ Ω is included in M Σ η Λ . 37
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment