Online variants of the cross-entropy method

ONLINE V ARIANT S OF THE CROSS-ENTR OPY M ETHOD ISTV ´ AN SZIT A AND ANDR ´ AS L ˝ ORINCZ Abstract. The cross- e ntropy metho d [2] is a s imple but eﬃcient metho d for globa l optimization. In this paper w e provide t wo online v ariants of the ba sic CEM, to gether with a pro of of conv ergence. 1. Intr oduction It is w ell know n that the cross en trop y metho d (CEM) has [2] similarities to man y other selection ba sed me tho ds, suc h as genetic algorithms, estimation-f-distribution al- gorithms, ant colo n y optimization, and maxim um lik eliho o d par a meter estimation. In this pap er w e pro vide tw o online v arian ts o f the basic CEM. The online v aria nts rev eal similarities to sev eral other optimization me tho ds lik e sto c hastic gradien t o r simulated annealing. How ever, it is not our aim to analyze the similarities and diﬀerences b et w een these metho ds, no r to ar g ue that one metho d is sup erior to the other. Here w e prov ide asymptotic conv erg ence results for the new CE v arian ts, which are online. 2. The algorithms 2.1. The basic CE metho d. The cross-en trop y metho d is show n in Figure 1. F or an explanation of the algorithm and its deriv ation, s ee e .g. [2]. Extens ions of the me tho d allo w v a rious generalizations, e.g., dec reasing α , v arying p opulatio n size, added noise etc. In this pap er w e restrict our atten tion to t he basic a lgorithm. 2.2. CEM for the com binatorial optimization task. C onsider the follo wing prob- lem: The c ombinatorial optimization task . Let n ∈ N , D = { 0 , 1 } n and f : D → R . Find a v ector x ∗ ∈ D suc h that x ∗ = arg min x ∈ D f ( x ). T o apply the CE metho d to this problem, let the distribution g b e the pro duct of n indep enden t Bernoulli distributions with parameter v ector p t ∈ [0 , 1] n and set the initial parameter v ector to p 0 = (1 / 2 , . . . , 1 / 2). F or Bernoulli distributions, the parameter up date is done by the follo wing simple pro cedure: 2.3. Online CEM. The alg orithm p erforms batc h up dates, the sampling distribution is up dated once after dra wing a nd ev aluating N samples. W e shall transform this alg o rithm in to a n online o ne. Batc h pro cessing is used in t w o steps of the algorithm: • in the upda t e of the distribution g t , and Department of Information Systems F aculty of Infomatics E¨ otv¨ os Lor´ and Univ ersity P´ azm´ any P´ eter s´ et´ any 1/ C Budap est, Hungary , H-111 7 Emails: szityu@gmail.com, andras.lo rincz@elte.hu. 1 2 ISTV ´ AN SZI T A AND ANDR ´ AS L ˝ ORINCZ % in p uts: % p opulation size N % se l e ction r atio ρ % sm o o thing factor α % numb er of iter a tions T p 0 := initial distribution par ameters for t fro m 0 to T − 1, % dr aw N samples an d evaluate them for i from 1 to N , dra w x ( i ) from distribution g ( p t ) f i := f ( x ( i ) ) sort { ( x ( i ) , f i ) } in descending order w.r.t. f i % c ompute new eli te thr eshold level γ t +1 := f ⌈ ρ · N ⌉ % get elite sam ples E t +1 := { x ( i ) | f i ≥ γ t +1 } p ′ := CEBatc hUp date( E t +1 , p t , α ) end lo op Figure 1. The basic cross-en trop y metho d. pro cedure p t +1 := CEBatc hUp date ( E , p t , α ) % E : set of elite sam ples % p t : curr e nt p ar ameter ve ctor % α : smo othing factor N b = ⌈ ρ · N ⌉ p ′ :=  P x ∈ E x  / N b p t +1 := (1 − α ) · p t + α · p ′ Figure 2. The batc h cross-en trop y up date for Bernoulli pa rameters. • when the elite threshold is computed (whic h inc ludes the sorting of the N samples of t he last episo de). As a ﬁrst step, note that the con tribution of a single sample in the distribution update is α 1 := α/ ⌈ ρ · N ⌉ , if the sample is con tained in the elite set and zero otherwise. W e can p erform this up date immediately after generating the sample, pro vided that w e know whether it is an elite sample or not. T o decide this, w e hav e to w ait un til the end of the episo de. Ho w ev er, with a small mo diﬁcation w e can get an answ er immediately: w e can che c k whether the new sample is among the b est ρ -p ercen tile of the last N samples . This corresp onds to a sliding windo w of length N . Algorithmically , w e can implemen t this as a queue Q with at most N elemen ts. The algorithm is summarized in Figure 3. ONLINE V ARIANTS O F THE CROSS-ENTROPY METHOD 3 % in p uts: % wi n dow size N % se l e ction r atio ρ % sm o o thing factor α % numb er of samples K p 0 := initial distribution par ameters Q := {} for t fro m 0 to K − 1, % dr aw one s a mples and evaluate it dra w x ( t ) from distribution g ( p t ) f t := f ( x ( t ) ) % add sample to queue Q := Q ∪ { ( t, x ( t ) , f t ) } if LengthOf ( Q ) > N , % no up dates until we have c ol le cte d N samples delete oldest elemen t of Q % c ompute new eli te thr eshold level { f ′ t } := sort f − v a lues in Q in descendin g order γ t +1 := f ′ ⌈ ρ · N ⌉ if f ( x ( t ) ) ≥ γ t +1 then % x ( t ) is an elite s ample p t +1 := CEOnlineUpdate ( x ( t ) , p t , α/ ⌈ ρ · N ⌉ ) endif endif end lo op Figure 3. Online cross-en trop y metho d, ﬁrst v ariant. F or Bernoulli distributions, the parameter up date is done b y the simple pro cedure sho wn in Fig 4. pro cedure p t +1 := CEOnlineUp date ( x , p t , α 1 ) % x : elite sam ple % p t : curr e nt p ar ameter ve ctor % α 1 : stepsiz e p t +1 := (1 − α 1 ) · p t + α 1 · x Figure 4. The online cross-en trop y up date for Bernoulli para meters. Note that the b eha vior of this mo diﬁed algorithm is sligh tly diﬀeren t from the batc h v ersion, as the follow ing example highlights: supp ose that the p opulat io n size is N = 100 , and w e ha v e just draw n t he 114th sample. In the batc h v ersion, we will c hec k whether this sample b elongs to the elite of the set { x (101) , . . . , x (200) } (after all of these samples 4 ISTV ´ AN SZI T A AND ANDR ´ AS L ˝ ORINCZ are know n), while in the online v ersion, it is che c k ed a gainst the set { x (14) , . . . , x (114) } (whic h is known immediately). 2.4. Online CEM, memoryless version. The sliding windo w online CEM a lgorithm (Fig. 3) is fully incremen ta l in the sense that eac h sample is pro cessed immediately , a nd the p er- sample pro cessing time do es not increase with increasing t . Ho w ev er, pro cessing time (and required memory) does depend o n the size of the sliding w indow N : in order to determine the elite t hr eshold lev el γ t , w e ha v e to store the la st N samples and sort them. 1 In some applications (for example, when a connectionist implemen tation is sough t for), this requireme n t is not desirable. W e shall simplify the algorithm further, so that b oth memory requiremen t and pro cessing time is constan t. This simpliﬁcation will come at a cost: the p erformance of the new v arian t will dep end on the range and distribution of the sample v alues. Consider now the sample a t p osition N e = ⌈ ρ · N ⌉ , t he v alue of whic h determines the threshold. The key observ a tion is t ha t its p osition cannot change arbitrarily in a single step. First of all, there is a small c hance that it will b e remov ed fro m the queue as the oldest sample. Neglecting t his small-probabilit y even t, the p osition of the threshold sample can either jump up or dow n one place or remain unc hanged. More precisely , there are four p ossible cases, dep ending on (1) whether the new sample b elongs to the elite a nd (2) whether the sample tha t just drops out of the queue b elonged to the elite (A) b oth the new sample and the drop o ut sample are elite. The threshold p osition remains unc hanged. So do es the threshold lev el except with a small pro babilit y when the new or the drop out sample we re exactly at the b oundary . W e will ignore this small-pro babilit y ev en t. (B) the new sample is elite but the drop out sample is not. The threshold lev el increases to γ t +1 := γ t + f N e +1 − f N e (ignoring a lo w-probabilit y ev en t ) (C) neither the new sample nor the drop out sample are elite. The threshold remains unc hanged (with high pro babilit y). (D) the new sample is not elite but the drop out sample is. The thr eshold level decreases to γ t +1 := γ t + f N e − 1 − f N e . Let F t denote the σ -alg ebra generated by kno wing all random outcomes up t o time step t . Assumin g that the p ositions of the new sample and the dro p out sample are distributed uniformly , w e get that E ( γ t +1 | F t , new sample is elite) = γ t + Pr(case A) · E ( f N e +1 − f N e | F t ) + Pr(case B) · 0 ≈ γ t + (1 − ρ ) · E ( f N e +1 − f N e | F t ) = γ t + (1 − ρ ) · ∆ t , 1 Pro cess ing time can be reduced to O (log N ) if insertion sort is used: in ea ch step, ther e is o nly one new elemen t to b e inser ted into the sorted queue. ONLINE V A RIAN TS OF THE CROSS-ENTROPY METHOD 5 where we introduced the notation ∆ t = E ( f N e +1 − f N e | F t ). Similarly , E ( γ t +1 | F t , new sample is not elite) = γ t + Pr(case C) · 0 + Pr (case D) · E ( f N e − 1 − f N e | F t ) ≈ γ t + ρ · E ( f N e − 1 − f N e | F t ) ≈ γ t − ρ · ∆ t , using the approximation that E ( f N e − 1 − f N e | F t ) ≈ − E ( f N e +1 − f N e | F t ) = ∆ t . ∆ t can drift as t grows, and its exact v alue cannot b e computed without storing the f -v a lues. Therefore, w e hav e t o use some appro ximation. W e presen t three p ossibilities: (1) use a constant stepsize ∆. Clearly , this approximation w orks b est if the dis- tribution of f -v alue diﬀerences do es not c hange m uc h during the optimization pro cess. (2) assume that f unction v alues are distributed uniformly ov er an in terv a l [ a, b ]. In this case, ∆ t = ( b − a ) / ( N + 1). On the other hand, let D t = E ( | f ( x ( t ) ) − f ( x ( t +1) ) | ). f ( x ( t ) ) and f ( x ( t +1) ) are indep enden t, uniformly distributed samples, so w e obta in D t = ( b − a ) / 3, i.e., ∆ t = ∆ uniform 0 D t with ∆ uniform 0 = 3 N +1 . F rom this, w e can obtain an online appro ximation sc heme ∆ t +1 := (1 − β )∆ t + β · ∆ uniform 0 | f ( x ( t ) ) − f ( x ( t +1) ) | , where β is an exp onen tial forgetting parameter. (3) assume that function v alues ha v e a normal distribution ∼ N ( µ, σ 2 ). In this case, ∆ t = σ  Φ − 1 (1 − ρ + 1 N ) − Φ − 1 (1 − ρ )  , where Φ is the Gaussian error function. On the other hand, let D t = E ( | f ( x ( t ) ) − f ( x ( t +1) ) | ). f ( x ( t ) ) and f ( x ( t +1) ) ar e indep enden t, normally distributed samples, so w e obtain D t = σ 2 √ π , i.e., ∆ t = ∆ Gauss 0 D t with ∆ Gauss 0 = 2 √ π  Φ − 1 (1 − ρ + 1 N ) − Φ − 1 (1 − ρ )  . F rom this, w e can obtain an online appro ximation sc heme ∆ t +1 := (1 − β )∆ t + β · ∆ Gauss 0 | f ( x ( t ) ) − f ( x ( t +1) ) | , where β is an exp onen tial forgetting parameter. (4) w e can obtain a similar approx imation for many other distributions f , but the constan t ∆ f 0 do es not necessarily hav e an easy-to-compute form. The resulting alg orithm using option (1) is summarized in Fig. 5. 3. Convergence anal ysis In this section we show that despite the v arious approximations used, the three v a ri- an ts of the CE metho d p ossess the same asymptotical con v erg ence prop erties. Na turally , the actual p erfor mance o f these algo r it hms ma y diﬀer from eac h other. 3.1. The classical CE metho d. Firstly , w e review the results of Costa et al. [1] on the con v ergence of the classical CE metho d. Theorem 3.1. If the b asic CE metho d is use d for c omb inatorial optimization with smo othing factor α , ρ > 0 an d p 0 ,i ∈ (0 , 1) for e ach i ∈ { 1 , . . . , m } , then p t c on ver ges to a 0/1 ve ctor with p r ob ability 1. The pr ob ability that the op tima l pr ob a b ility is g e n er a te d during the pr o c ess c an b e made arbitr arily close to 1 i f α is suﬃciently s m al l. 6 ISTV ´ AN SZI T A AND ANDR ´ AS L ˝ ORINCZ % in p uts: % wi n dow size N % se l e ction r atio ρ % sm o o thing factor α % numb er of samples K p 0 := initial distribution par ameters γ 0 := ar bit r a ry for t fro m 0 to K − 1, % dr aw one s a mples and evaluate it dra w x ( t ) from distribution g ( p i ) if f ( x ( t ) ) ≥ γ t then % x ( t ) is an elite s ample % c ompute new eli te thr eshold level γ t +1 := γ t + (1 − ρ ) · ∆ p t +1 := CEOnlineUpdate ( x ( t ) , p t , α/ ( ρ · N )) else % c ompute new eli te thr eshold level γ t +1 := γ t − ρ · ∆ endif % optional s tep: up date ∆ % ∆ := (1 − β )∆ + β · ∆ 0   f ( x ( t ) ) − f ( x ( t − 1) )   end lo op Figure 5. Online cross-en trop y metho d, memoryles s v arian t. The statemen ts of the theorem are rather w eak, and are not sp eciﬁc to the part icular form of the algorithm: basically they state that (1) the algorithm is a “trapp ed random w alk”: the probabilities ma y change up an do wn, but ev en tually they con v erge to either one of the tw o absorbing v alues, 0 or 1 ; and (2) if the random w alk can last for a suﬃcien tly long time, then the opt imal solution is sampled with high proba bility . W e shall transfer the pro of to the other tw o algorithms b elo w. 3.2. The online CE metho ds. Theorem 3.2. If either variant of the online CE metho d is use d for c ombinatorial optimization with smo othing factor α , ρ > 0 and p 0 ,i ∈ (0 , 1) for e ach i ∈ { 1 , . . . , n } , then p t c on ver ges to a 0 /1 ve ctor with pr ob ability 1. The p r ob ability that the optimal pr ob ability is gener ate d during the pr o c ess c an b e made arbitr arily close to 1 if α is suﬃciently sma l l. Pr o of. The pro of follo ws close ly the pro of of Theorems 1-3 in [1]. W e b egin with in tro ducing sev eral nota t io ns. Let x ∗ denote the optim um solution, let F t denote the σ -algebra generated by kno wing all random outcomes up to time step t . Let φ t := Pr( x = x ∗ | F t − 1 ) the probabilit y that the optimal solution is generated at time t and φ t,i := Pr( x i = x ∗ i | F t − 1 ) the probabilit y that comp onen t i is iden tical t o ONLINE V A RIAN TS OF THE CROSS-ENTROPY METHOD 7 that o f the optimal solution. Clearly , φ t,i = p t − 1 ,i 1 { x ∗ i = 1 } + (1 − p t − 1 ,i ) 1 { x ∗ i = 0 } and φ t = Q n i =1 φ t,i . Let p min t,i and p max t,i denote the minimum and maxim um p ossible v a lue of p t,i , resp ec- tiv ely . In eac h step o f the algorithms, p t,i is either left unc hanged or mo diﬁed with stepsize α 1 := α/ N e . Consequen tly , p min t,i = p 0 ,i (1 − α 1 ) t and p max t,i = p 0 ,i (1 − α 1 ) t + t X j =1 α 1 (1 − α 1 ) t − j = p 0 ,i (1 − α 1 ) t + t X j =1  1 − (1 − α 1 )  (1 − α 1 ) t − j = p 0 ,i (1 − α 1 ) t + 1 − (1 − α 1 ) t . Using these quantities , φ min t,i = p min t − 1 ,i 1 { x ∗ i = 1 } + ( 1 − p max t − 1 ,i ) 1 { x ∗ i = 0 } = (1 − α 1 ) t ( p 0 ,i 1 { x ∗ i = 1 } + ( 1 − p 0 ,i ) 1 { x ∗ i = 0 } ) = φ 1 ,i (1 − α 1 ) t , φ min t = n Y i =1 φ min t,i = φ 1 (1 − α 1 ) nt . Let E t = ∩ t m =1 { x ( m ) 6 = x ∗ } denote the ev en t that the optimal s olution w as not generated up t o time t . Let R t denote the set of p ossible v alues of φ t . Clearly , for all r ∈ R t , r ≥ φ min t . Note also that Pr( x ( t ) = x ∗ | φ t , E t − 1 ) = r b y the construction o f the random sampling pr o cedure of CE. Then Pr( x ( t ) = x ∗ | E t − 1 ) = X r ∈R t Pr( x ( t ) = x ∗ | φ t , E t − 1 ) Pr( φ t = r | E t − 1 ) = X r ∈R t r Pr( φ t = r | E t − 1 ) ≥ φ min t = φ 1 (1 − α 1 ) nt . Using t his, w e can estimate the probability that the optim um solution has not b een generated up to time step T : Pr( E T ) = Pr( E 1 ) T Y t =2 Pr( E t | E t − 1 ) = Pr( E 1 ) T Y t =2 (1 − Pr( x ( t ) = x ∗ | E t − 1 )) ≤ Pr( E 1 ) T Y t =2 (1 − φ 1 (1 − α 1 ) nt ) . 8 ISTV ´ AN SZI T A AND ANDR ´ AS L ˝ ORINCZ Using the fact that (1 − u ) ≤ e − u , w e obtain Pr( E T ) ≤ Pr( E 1 ) T Y t =2 exp( − φ 1 (1 − α 1 ) nt ) = Pr( E 1 ) exp − φ 1 T X t =1 (1 − α 1 ) nt ! . Let h ( α 1 ) := ∞ X t =1 (1 − α 1 ) nt = 1 1 − (1 − α 1 ) n − 1 . With this notat io n, lim T →∞ Pr( E T ) ≤ Pr( E 1 ) exp ( − φ 1 h ( α 1 )) . Ho w ev er, h ( α 1 ) → 0 as α 1 → 0, so lim T →∞ Pr( E T ) can b e ma de arbitrarily close to zero, if α 1 is suﬃcie n tly small. T o pro v e the second part of the theorem, deﬁne Z t,i = p t,i − p t − 1 ,i . F or the sak e of notational con v enience, w e ﬁx a comp onen t i and omit it from t he indices. No t e tha t Z t 6 = 0 if and o nly if x ( t ) is considered an elite sample. Clearly , if x ( t ) is not elite, then no probability up date is made. O n the other hand, an up da te mo diﬁes p t to w ards either 0 or 1. Since 0 < p t < 1 with no equalit y allo w ed, this up date will c hange the probabilities indeed. Consider the su bset of time indices when probabilities are updated, I = { t : Z t 6 = 0 } . W e need to sho w that | I | = ∞ . This is the only part of the pro of where there is a sligh t extra w ork compared to the pro of of the batc h v ar ia n t. W e will sho w that each un brok en sequence of zeros in { Z t } is ﬁnite with probabilit y 1. C onsider suc h a 0-sequence that starts a t time t 1 , and supp ose tha t it is inﬁnite. Then, the sampling distribution p t is unchanged for t ≥ t 1 , and so is the distribution F of the f -v alues. Let us examine the ﬁrst online v arian t of the CEM. D ivide the in terv al [ t 1 , ∞ ) to N + 1-step long ep o c hs. The con ten ts of the queue at t ime step t 1 , t 1 + ( N + 1) , t 1 + 2( N + 1) , . . . are indep endent and iden tically distributed, b ecause (a) the samples are generated indep enden tly from eac h other a nd (b) the diﬀeren t queues ha v e no common elemen ts. F or a giv en queue Q t (with a ll elemen ts sampled from distribution F ) and a new sample x ( t ) (also from distribution F ), the probability that x ( t ) is not elite is exactly 1 − ρ . Therefore the probabilit y that no sample is considered elite f or t ≥ t 1 is at most lim k →∞ (1 − ρ ) k = 0. The situation is ev en simpler for the memoryless v ariant of the online CEM: supp ose again that no sample is considered elite f or t ≥ t 1 , and all samples are dra wn from the distribution F . F is a distribution ov er a ﬁnite domain, so it has a ﬁnite minimum f min . As all samples are considered non-elite, the elite threshold is decreased b y a constan t amoun t ρ ∆ in eac h step, ev en tually b ecoming smaller than f min , whic h results in a con tradiction. So, for both online methods w e can consider the (inﬁnitely long) subseq uences { Z t } t ∈ I , { x ( t ) } t ∈ I , { p t } t ∈ I etc. F or the sak e of notational simplicit y , we shall index these subseque nces with t = 1 , 2 , 3 , . . . . F rom now on, the pro of contin ues iden tically to the original. W e will show that Z t c hanges signs fo r a ﬁnite n um b er of times with probability 1 . T o this end, let τ k b e the random itera t io n n um b er when Z t c hanges sign for t he k th time. F or all k , ONLINE V A RIAN TS OF THE CROSS-ENTROPY METHOD 9 (1) τ k = ∞ ⇒ τ k +1 = ∞ , (2) Z τ k < 0 ⇒ p τ k = (1 − α 1 ) p τ k − 1 + α 1 · 0 ≤ (1 − α 1 ) < 1, (3) Z τ k > 0 ⇒ p τ k = (1 − α 1 ) p τ k − 1 + α 1 · 1 ≥ α 1 > 0. F rom this p oin t on, the pro of of Theorem 3 in [1] can b e applied without c hange, sho wing that the n um b er of sign c hanges is ﬁnite with probabilit y 1, then proving that this implies conv ergence to either 0 or 1.  Reference s [1] Andre Costa , Owen D. Jones, and Dirk P . K ro ese. Conv ergence prop erties of the cross-entropy metho d for discrete optimizatio n. Op er ations R ese ar ch L etters , 2007. T o appe a r. [2] Reuven Y. Rubinstein. The cr oss-entropy metho d for combinatorial and cont inuous optimization. Metho dolo gy and Computing in Applie d Pr ob ability , 1:127–1 9 0, 19 99.

Online variants of the cross-entropy method

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment