Bias-Variance Techniques for Monte Carlo Optimization: Cross-validation for the CE Method
In this paper, we examine the CE method in the broad context of Monte Carlo Optimization (MCO) and Parametric Learning (PL), a type of machine learning. A well-known overarching principle used to improve the performance of many PL algorithms is the b…
Authors: Dev Rajnarayan, David Wolpert
Bias-V ariance T ec hniques for Mon te Carlo Optimization: Cross-v alidation for the CE Metho d Dev Ra jnaray an and Da vid W olp ert No vem b er 2, 2018 1 In tro duction In this pap er, w e examine the CE metho d in the broad con text of Mon te Carlo Optimization (MCO) [Er- moliev and Norkin, 1998, Rob ert and Casella, 2004] and P arametric Learning (PL), a type of machine learning. A well-kno wn o verarc hing principle used to improv e the p erformance of man y PL algorithms is the bias-v ariance tradeoff [W olpert, 1997]. This tradeoff has b een used to impro v e PL algorithms rang- ing from Monte Carlo estimation of integrals [Lepage, 1978], to linear estimation, to general statistical estimation [Breiman, 1996a,b]. Moreo ver, as describ ed b y W olpert and Ra jnaray an [2007], MCO is very closely related to PL. Owing to this similarity , the bias-v ariance tradeoff affects MCO p erformance, just as it do es PL p erformance. In this article, we exploit the bias-v ariance tradeoff to enhance the p erformance of MCO algorithms. W e use the technique of cross-v alidation, a technique based on the bias-v ariance tradeoff, to significantly impro ve the p erformance of the Cross Entrop y (CE) metho d, which is an MCO algorithm. In previous w ork w e ha ve confirmed that other PL tec hniques impro ve the p erfomance of other MCO algorithms [see W olp ert and Ra jnara yan, 2007]. W e conclude that the man y techniques pioneered in PL could b e in vestigated as wa ys to improv e MCO algorithms in general, and the CE metho d in particular. The rest of the pap er is organized as follo ws. In Sec. 2, we present an ov erview of the bias-v ariance tradeoff. In Sec. 3, w e describ e a few wa ys to exploit this tradeoff, starting from the relativ ely simple case of Monte Carlo integration, and pro ceeding to the more complex case of MCO. W e also describ e the original exploitation of this tradeoff, as a wa y to improv e PL algorithms. In Sec. 4, we describ e how to use cross-v alidation, a particular technique based on the bias-v ariance tradeoff, to mo dify the CE metho d. Sec. 5 then presents p erformance comparisons b et ween this mo dified version of the CE metho d and the conv en tional CE metho d. These comparisons are on contin uous, multimodal, unconstrained optimziation problems. W e show that on these problems, using the mo dified version of the CE metho d can significantly improv e optimization p erformance of the CE metho d, and nev er worsens p erformance. 1 Rajna ray an and Wolpert Bias Plus Va riance for the CE Method 2 The Bias-V ariance T radeoff Consider a given random v ariable Φ and a random v ariable that w e can mo dify , b Φ. W e wish to use a sample ˆ φ of b Φ as an estimate of a sample φ of Φ. The mean squared error b et w een suc h a pair of samples is a sum of four terms. The first term reflects the statistical coupling b et w een Φ and b Φ and is conv en tionally ignored in bias-v ariance analysis. The second term reflects the inheren t randomness in Φ and is independent of the estimator b Φ. Accordingly , we cannot affect this term. In con trast, the third and fourth terms dep end on b Φ. The third term, called the bias, is indep enden t of the precise samples of b oth Φ and b Φ, and reflects the difference b et ween the means of Φ and b Φ. The fourth term, called the v ariance, is independent of the precise sample of Φ, and reflects the inheren t randomness in the estimator as one samples it. These last tw o terms can b e mo dified b y changing the c hoice of the estimator. In particular, on small sample sets, w e can often decrease mean squared error b y introducing a small bias that causes a large reduction the v ariance. While most commonly used in machine learning, bias-v ariance tradeoffs are applicable in a muc h broader con text and in a v ariet y of situations. 2.1 A Simple Deriv ation of the Bias-V ariance Decomp osition Supp ose we hav e a Euclidean random v ariable Φ taking on v alues φ distributed according to a density function p ( φ ). W e wan t to estimate a certain v alue φ , that we cannot access directly , and that was obtained by sampling p ( φ ). W e can, ho wev er, access a differen t Euclidean random v ariable b Φ taking on v alues ˆ φ distributed according to p ( ˆ φ ). W e can also mo dify the distribution of b Φ, and w e w ant to exploit the coupling b et ween b Φ and Φ to improv e our estimate. Assuming a quadratic loss function, the quality of our estimate is measured by its Mean Squared Error (MSE): MSE( b Φ) ≡ Z p ( ˆ φ, φ ) ( ˆ φ − φ ) 2 d ˆ φ dφ. (1) In standard bias-v ariance analysis, the statistical coupling of Φ and b Φ is simply ignored without an y justification 1 , and the distribution p ( φ, ˆ φ ) is replaced with the product of marginals, p ( φ ) p ( ˆ φ ). So our equation for MSE reduces to MSE( b Φ) = Z p ( ˆ φ ) p ( φ ) ( ˆ φ − φ ) 2 d ˆ φ dφ. (2) Using simple algebra, the righ t hand side of Eq. 2 can b e written as the sum of three terms. The first is the v ariance of Φ. This term is justifiably ignored in bias-v ariance analysis since it is b ey ond our con trol in designing the estimator b Φ. The second term inv olves a mean that describ es the deterministic comp onen t of the error. This term dep ends on b oth the distribution of Φ and that of b Φ, and quan tifies ho w close the means of those distributions are. The third term is a v ariance that describ es sto c hastic v ariations from one sample to the next. This term is indep enden t of the random v ariable b eing estimated. 1 One could account for this coupling b y using an additive correction term [see W olp ert, 1997]. 2 Rajna ray an and Wolpert Bias Plus Va riance for the CE Method F ormally , up to an ov erall additive constant, we can write MSE( b Φ) = Z p ( ˆ φ )( ˆ φ 2 − 2 φ ˆ φ + φ 2 ) d ˆ φ, = Z p ( ˆ φ ) ˆ φ 2 d ˆ φ | {z } − 2 φ Z p ( ˆ φ ) ˆ φ d ˆ φ + φ 2 , = V ( ˆ φ ) + [ E ( ˆ φ )] 2 − 2 φ E ( ˆ φ ) + φ 2 , = V ( ˆ φ ) + [ φ − E ( ˆ φ )] 2 | {z } , = v ariance + bias 2 . (3) In light of Eq. 3, one wa y to try to reduce MSE is to mo dify an estimator to trade bias for v ariance. Some of the most well-kno wn applications of such bias-v ariance tradeoffs occur in parametric machine learning, where man y techniques ha ve b een dev elop ed to exploit that tradeoff. There are still some extensions of that tradeoff that could b e applied in parametric machine learning that hav e b een ignored b y that communit y . 3 Applications of the Bias-V ariance T radeoff In this section, we describ e some applications of the bias-v ariance tradeoff. First, we provide a sim- ple, concrete example that elab orates ho w con ven tional bias-v ariance techniques ignore the statistical coupling describ ed in the previous section. Then, we describ e the application of bias-v ariance tradeoffs to Monte Carlo (MC) tec hniques for the estimation of in tegrals, where, as for all unbiased estimators, the bias-v ariance tradeoff reduces to simple v ariance reduction. Next, we introduce the field of Monte Carlo Optimization (MCO), and illustrate that there are subtleties in volv ed that are absent in simple MC. Then, we describ e the field of Parametric Machine Learning, which is mathematically identical to MCO, and describ e how they ignore some of the subtleties p ertaining to MCO, but apply bias-v ariance analysis nonetheless. 3.1 Sup ervised Learning Consider the simplest t yp e of sup ervised machine learning problem, where the aim is to accurately predict the b ehavior of an input-output system, based on several ‘training examples’. In this example, w e consider a finite input space X , and an output space Y whic h is just the space of real n umbers, and a deterministic input-output system, or ‘target function’ f that maps each element of X to a single elemen t of Y . T o b e precise, there is a ‘prior’ probabilit y density function p ( f ) o ver target functions, and it gets sampled to pro duce some particular target function, f . Next, f is I ID sampled at a set of m inputs to pro duce a ‘training set’ D of input-output pairs. F or simplicity of analysis, say we ha ve a single fixed ‘prediction p oin t’ x ∈ X . Our goal in sup ervised learning is to estimate f ( x ), but f is not known. Accordingly , to p erform the estimation the training set is presented to a ‘learning algorithm’, which in response to the training set pro duces a guess ˆ f ( x ) for the v alue f ( x ). 3 Rajna ray an and Wolpert Bias Plus Va riance for the CE Method This entire sto c hastic pro cedure defines a joint distribution π ( f , D , f ( x ) , ˆ f ( x )). In order to analyze the p erformance of the learning algorithm, we marginalize to obtain a distribution π ( f ( x ) , ˆ f ( x )). Since ˆ f ( x ) is supp osed to b e an estimate of f ( x ), we can identify ˆ f ( x ) as the v alue ˆ φ of the random v ariable b Φ and f ( x ) as the v alue φ of Φ. In other words, w e can define p ( φ, ˆ φ ) = π ( f ( x ) , ˆ f ( x )). If we then compute the mean squared error in the estimate made by our learning algorithm for the v alue f ( x ), we get Eq. 1. Of course, in general, Φ and b Φ, i.e., f ( x ) and ˆ f ( x ), are statistically dep enden t; if they weren’t, the learning algorithm gains nothing from knowing D . This dep endence can b e established by writing p ( f ( x ) , ˆ f ( x )) = Z d D p ( f ( x ) , ˆ f ( x ) | D ) p ( D ) , = Z d D p ( ˆ f ( x ) | f ( x ) , D ) p ( f ( x ) | D ) p ( D ) , = Z d D p ( ˆ f ( x ) | D ) p ( f ( x ) | D ) p ( D ) . (4) The first t wo steps comprise a straightforw ard application of Bay es’ rule. In the third step, the con- ditioning on ˆ f ( x ) is remo ved b ecause the guess of the learning algorithm is determined solely by the training set D . Nevertheless, note that Eq. 4 exact, and in general is not the same as the pro duct p ( f ( x )) p ( ˆ f ( x )) = Z d D p ( f ( x ) | D ) p ( D ) Z d D p ( ˆ f ( x ) | D ) p ( D ) . (5) As w e describ ed in Sec. 2, conv entional bias-v ariance analysis appro ximates Eq. 4 by Eq. 5. 3.2 Mon te Carlo In tegration Mon te Carlo metho ds are often the metho d of choice for estimating difficult high-dimensional in tegrals. Consider a function f : X → R , which we w ant to integrate ov er some region X ⊆ X , yielding the v alue F , as given by F = Z X dx f ( x ) . W e can view F as a degenerate random v ariable Φ, with densit y function given b y a Dirac delta function cen tered on F . Therefore, the v ariance of Φ is 0, and Eq. 3 is exact. A p opular MC metho d to estimate this integral is imp ortance sampling [Rob ert and Casella, 2004]. This exploits the law of large num bers as follows: i.i.d. samples x ( i ) , i = 1 , . . . , m are generated from a so-called imp ortance distribution h ( x ) that we con trol, and the asso ciated v alues of the integrand, f ( x ( i ) ) are computed. Denote these ‘data’ b y D = { ( x ( i ) , f ( x ( i ) ) , i = 1 , . . . , m } . (6) No w, φ = F = Z X dx h ( x ) f ( x ) h ( x ) , = lim m →∞ 1 m m X i =1 f ( x ( i ) ) h ( x ( i ) ) with probabilit y 1. 4 Rajna ray an and Wolpert Bias Plus Va riance for the CE Method Denote b y b Φ the random v ariable with v alue given by the finite sample av erage for D : ˆ φ = 1 m m X i =1 f ( x ( i ) ) h ( x ( i ) ) . W e use b Φ as our statistical estimator for Φ, as we broadly describ ed in Sec. 1. Assuming a quadratic loss function, L ( ˆ φ, φ ) = ( φ − ˆ φ ) 2 , the bias-v ariance decomp osition describ ed in Eq. 3 applies exactly . It can b e shown that the estimator b Φ is unbiased, that is, E b Φ = φ , where the mean is ov er samples of h . Consequen tly , the MSE of this estimator is just its v ariance. The choice of sampling distribution h that minimizes this v ariance is giv en by [see Rob ert and Casella, 2004] h ? ( x ) = | f ( x ) | R X | f ( x 0 ) | dx 0 . By itself, this result is not very helpful, since the equation for the optimal imp ortance distribution con tains a similar in tegral to the one we are trying to estimate. F or non-negativ e integrands f ( x ), the VEGAS algorithm [Lepage, 1978] describ es an adaptive metho d to find successively b etter imp ortance distributions, by iterativ ely estimating Φ, and then using that estimate to generate the next imp ortance distribution h . In the case of this and other unbiased estimators, there is no tradeoff b et ween bias and v ariance, and minimizing MSE is ac hieved by minimizing v ariance. 3.3 Mon te Carlo Optimization Instead of a fixe d integral to ev aluate, consider a parametrized integral F ( θ ) = Z X dx f θ ( x ) . F urther, supp ose we are interested in finding the v alue of the parameter θ ∈ Θ that minimizes F ( θ ): θ ? = arg min θ ∈ Θ F ( θ ) . In the case where the functional form of f θ is not explicitly known, one approac h to solve this problem is a tec hnique called Monte Carlo Optimization (MCO) [see Ermoliev and Norkin, 1998], inv olving rep eated MC estimation of the integral in question with adaptive mo dification 2 of the parameter θ . W e pro ceed analogously to the preceding section. Whereas in MC, there w as no parameter θ and w e had a single Dirac-delta distribution, we now hav e a set of such distributions, corresp onding to the v alues θ . Accordingly , we introduce the θ -indexed vector random v ariable Φ, each of whose comp onen ts Φ θ has a degenerate Dirac-delta distribution about the asso ciated v alue F ( θ ). Next, we introduce our estimator random-v ariable, a similar θ -indexed vector random v ariable b Φ each of whose comp onents b Φ θ can b e sampled to estimate Φ θ . Regardless of how b Φ is defined, given a sample of ˆ φ , one wa y to estimate θ ? is ˆ θ ? = arg min θ ∈ Θ ˆ φ θ W e call this approach ‘natural’ MCO. 2 The similarity to the CE metho d is quite clear. 5 Rajna ray an and Wolpert Bias Plus Va riance for the CE Method F or example, let D be a data set as describ ed in Eq. 6. Then for every θ , any sample of D pro vides a sample of b Φ θ , whic h is the asso ciated estimate ˆ φ θ = ˆ F ( θ ) = 1 m m X i =1 f θ ( x ( i ) ) h ( x ( i ) ) , Giv en this choice for b Φ, the ‘natural MCO’ approach is to estimate the optimal θ as follows. ˆ θ ? = arg min θ ∈ Θ 1 m m X i =1 f θ ( x ( i ) ) h ( x ( i ) ) . (7) As we shall see, this do es not work w ell in practice, and we therefore call this imp ortance-sampling application of natural MCO ‘naive’ MCO. In general, consider any algorithm that estimates θ ? as a single-v alued function of ˆ φ . The estimate of θ ? pro duced b y that algorithm is itself a random v ariable, since it is a function of the random v ariable b Φ. Call this random v ariable ˆ Θ ? , taking on v alues ˆ θ ? . An y MCO algorithm is defined by ˆ Θ ? ; that random v ariable encapsulates the estimate made b y the algorithm. T o analyze the error of such an algorithm, consider the asso ciated random v ariable F ( ˆ Θ ? ), taking on the exact v alues of the integral F ( ˆ θ ? ). Since our aim in MCO is to minimize F ( θ ), it is reasonable to prop ose that the cost of estimating θ ? b y ˆ θ ? is nothing but the difference betw een F ( ˆ θ ? ) and the true minimal v alue of the in tegral, F ( θ ? ) = min θ F ( θ ). In other words, we adopt the loss function L ( ˆ θ ? , θ ? ) , F ( ˆ θ ? ) − F ( θ ? ) . (8) This is in con trast to our discussion on MC in tegration, whic h in volv ed q uadratic loss. Up to an unkno wn additiv e constant, this loss function is nothing but F ( ˆ θ ? ). This additive constant F ( θ ? ) is fixed by the MCO problem at hand, is b ey ond our control, and is not affected b y our algorithm. Consequen tly , we ignore that additive constant, and write out the asso ciated exp ected loss: E ( L ) = Z d ˆ θ ? p ( ˆ θ ? ) F ( ˆ θ ? ) . (9) No w change co ordinates in this integral from the v alues of the scalar random v ariable ˆ θ ? to the v alues of the underlying vector random v ariable b Φ. The exp ected loss now b ecomes E ( L ) = Z d ˆ φ p ( ˆ φ ) F ( ˆ θ ? ( ˆ φ )) . The natural MCO algorithm provides some insight into these results. F or that algorithm, E ( L ) = Z d ˆ φ p ( ˆ φ ) F (arg min θ ˆ φ θ ) = Z d ˆ φ θ 1 d ˆ φ θ 2 . . . p ( ˆ φ θ 1 , ˆ φ θ 2 , . . . ) F (arg min θ ˆ φ θ ) . (10) F or any fixed θ , there is an error betw een samples ˆ φ θ and the true v alue F ( θ ). Bias-v ariance consid- erations apply to this error, exacty as in the discussion of MC ab o ve. In MCO, how ev er, we are not concerned with ˆ φ for a single comp onen t θ , but rather for a set of θ ’s. 6 Rajna ray an and Wolpert Bias Plus Va riance for the CE Method The simplest suc h case is where the components ˆ φ θ are indep enden t. Even so, arg min θ ˆ φ θ is dis- tributed according to the laws for extrema of multiple indep enden t random v ariables, and this distribu- tion dep ends on higher-order momen ts of eac h random v ariable ˆ φ θ . This means that E ( L ) also dep ends on such higher-order moments. Only the first t wo momen ts, how ever, arise in the bias and v ariance for an y single θ . Thus, ev en in the simplest p ossible case, the bias-v ariance considerations for the individual θ do not provide a complete analysis. In most cases, the comp onen ts of ˆ φ are not indep enden t. Therefore, in order to analyze E ( L ), in addition to higher moments of the distribution for each individual θ , w e must now also consider higher- order momen ts coupling the estimates ˆ φ θ for differen t θ . Con ven tional bias-v ariance analysis for MCO is therefore incomplete on three fronts: ignoring the coupling betw een Φ and b Φ, ignoring higher order momen ts for individual θ ’s, and ignoring moments coupling different θ ’s. Due to the coupling betw een θ ’s, it ma y b e quite acceptable for the individual comp onen ts ˆ φ θ to ha ve b oth a large bias and a large v ariance, as long as the co v ariances are large. Large co v ariances would ensure that if some ˆ φ θ w ere incorrectly large, then ˆ φ θ 0 , for all θ 0 6 = θ w ould also be incorrectly large. This would preserve the ordering of θ ’s under Φ. So, ev en with large bias and v ariance for each θ , the estimator as a whole would still w ork well. If we could exploit this insight, it may b e p ossible to come up with weak er requirements for accurate estimators. Nev ertheless, we can ignore these insights, and imp ose a stronger requirement: design estimators ˆ φ θ with sufficien tly small bias plus v ariance for each single θ . More precisely , supp ose that those terms are v ery small on the scale of differences F ( θ ) − F ( θ 0 ) for any θ and θ 0 . Then, by Chebyc hev’s inequality , we know that the densit y functions of the random v ariables b Φ θ and b Φ θ 0 ha ve almost no o verlap. Accordingly , the probability that a sample ˆ φ θ − ˆ φ θ 0 has the opp osite sign of F ( θ ) − F ( θ 0 ) is almost zero. Eviden tly , E ( L ) is generally determined b y a complicated relationship in volving bias, v ariance, co- v ariance, and higher moments. Natural MCO in general, and naive MCO in particular, ignore all of these effects, and consequently , often perform quite p oorly in practice. In the next section we discuss some w ays of addressing this problem. 3.4 P arametric Mac hine Learning There are many versions of the basic MCO problem describ ed in the previous section. Some of the b est-explored arise in parametric density estimation and parametric sup ervised learning, which together comprise the field of P arametric mac hine Learning (PL). In particular, parametric sup ervised learning attempts to solve arg min θ ∈ Θ Z dx p ( x ) Z dy p ( y | x ) f θ ( x ) . Here, the v alues x represen t inputs, and the v alues y represent corresponding outputs, generated accord- ing to some sto c hastic pro cess defined by a set of conditional distributions { p ( y | x ) , x ∈ X } . Typically , one tries to solv e this problem by casting it as a single-stage MCO problem, F or instance, sa y we adopt a quadratic loss b etw een a predictor z θ ( x ) and the true v alue of y . Using MCO notation, we can express 7 Rajna ray an and Wolpert Bias Plus Va riance for the CE Method the asso ciated sup ervised learning problem as finding arg min θ F ( θ ), where l θ ( x ) = Z dy p ( y | x ) ( z θ ( x ) − y ) 2 , f θ ( x ) = p ( x ) l θ ( x ) , F ( θ ) = Z dx f θ ( x ) . (11) Next, the argmin is estimated b y minimizing a sample-based estimate of the F ( θ )’s. More precisely , w e are given a ‘training set’ of samples of p ( y | x ) p ( x ), { ( x ( i ) , y i ) i = 1 , . . . , m } . This training set provides a set of asso ciated estimates of F ( θ ): ˆ F ( θ ) = 1 m m X i =1 l θ ( x ( i ) ) . These are used to estim ate arg min θ F ( θ ), exactly as in MCO. In particular, one could estimate the minimizer of F ( θ ) by finding the minimium of ˆ F ( θ ), just as in natural MCO. As mentioned ab o ve, this MCO algorithm can p erform very p oorly in practice. In PL, this p o or p erformance is called ‘o verfitting the data’. There are several formal approaches that hav e b een explored in PL to try to address this ‘o verfitting the data’. Interestingly , none are based on direct consideration of the random v ariable ˆ θ ? ( b Φ) and the ramifications of its distribution for exp ected loss (cf. Eq. 10). In particular, no w ork has applied the mathematics of extrema of multiple random v ariables to analyze the bias-v ariance-cov ariance tradeoffs encapsulated in Eq. 10. The PL approach that p erhaps comes closest to such direct consideration of the distribution of F ( ˆ θ ? ) is uniform conv ergence theory , whic h is a central part of Computational Learning Theory [see Angluin, 1992]. Uniform conv ergence theory starts by crudely encapsulating the quadratic loss formula for exp ected loss under natural MCO, Eq. 10. It do es this by considering the worst-case bound, o ver p ossible p ( x ) and p ( y | x ), of the probabilit y that F ( θ ? ) exceeds min θ F ( θ ) b y more than κ . It then examines how that b ound v aries with κ . In particular, it relates such v ariation to characteristics of the set of functions { f θ : θ ∈ Θ } , e.g., the ‘VC dimension’ of that set [see V apnik, 1982, 1995]. Another approac h is to apply bias-plus-v ariance considerations to the entire PL algorithm ˆ Θ ? , rather than to eac h b Φ θ separately . This approach is applicable for algorithms that do not use natural MCO, and even for non-parametric sup ervised learning. As formulated for parameteric sup ervised learning, this approac h combines the form ulas in Eq. 11 to write F ( θ ) = Z dx dy p ( x ) p ( y | x )( z θ ( x ) − y ) 2 . This is then substituted into Eq. 9, giving E ( L ) = Z d ˆ θ ? dx dy p ( x ) p ( y | x ) p ( ˆ θ ? )( z ˆ θ ? ( x ) − y ) 2 = Z dx p ( x ) Z d ˆ θ ? dy p ( y | x ) p ( ˆ θ ? )( z ˆ θ ? ( x ) − y ) 2 . (12) The term in square brack ets is an x -parameterized exp ected quadratic loss, which can b e decomp osed in to a bias, v ariance, etc., in the usual wa y . This formulation eliminates an y direct concern for issues like 8 Rajna ray an and Wolpert Bias Plus Va riance for the CE Method the distribution of extrema of multiple random v ariables, cov ariances b etw een b Φ θ and b Φ θ 0 for different v alues of θ , and so on. In some wa ys, this imposes stronger conditions on the estimators we design, and consideration of the issues outlined ab o ve ma y give weak er conditions on how to increase estimation accuracy . There are numerous other approaches for addressing the problems of natural MCO that ha ve b een explored in PL. P articulary imp ortan t among these are Ba y esian approaches, e.g., Buntine and W eigend [1991], Berger [1985], Mac k ay [2003]. Based on these approaches, as well as on intuition, many p o werful tec hniques for addressing data-ov erfitting hav e been explored in PL, including regularization, cross- v alidation, stacking, bagging, etc. Essen tially all of these tec hniques can b e applied to any MCO problem, not just PL problems. Since many of these techniques can b e justified using Eq. 12, they pro vide a wa y to exploit the bias-v ariance tradeoff in other domains b esides PL. 4 PLMCO-CE In this section, we first presen t a review of the CE metho d for con tinuous problems, and describ e v arious comp onen ts in the terms used thus far. Kro ese et al. [2006] describe the application of the CE metho d to con tinuous multi-extremal problems. They describ e the use parametrized contin uous distributions, in particular, m ultiv ariate Gaussian and m ultiv ariate Gaussian mixtures, to perform adaptiv e imp ortance sampling for optimization of contin uous functions. The problem at hand, in our notation, is as follows. Find minimize x ∈X G ( x ) . (13) This is then conv erted to an Asso ciated Sto c hastic Problem (ASP) o ver θ -parametrized probability distributions q θ o ver X . Let γ ? = min x ∈X [ G ( x )], and let I {·} b e the indicator function. W e wan t to maximize θ E q θ I { G ( x ) ≤ γ ? } (14) This s olution to this problem is a degenerate Dirac-delta function cen tered on the optimal x , provided that the set of allow able θ p ermits parametrization of such degenerate distributions. Note also that sampling this degenerate distribution gives the answ er to the original problem, so in a wa y , the t wo problems are equiv alen t. The ASP is actually solved using a homotop y metho d, where one solves a sequence of optimization problems with progressively decreasing v alue of γ . The algorithm pro ceeds as follows. θ is initialized to some θ 0 . At each iteration t ≥ 0, a set of samples is drawn from q θ t , and γ t +1 is chosen to corresp ond to the the b est κ p ercen tile of these samples, and and θ t +1 is c hosen so as to minimize a KL div ergence (or cross en tropy) from the suitably normalized indicator distribution (also called a Heaviside distribution) giv en by p γ ( x ) ∝ Θ γ ( x ) = ( 1 , G ( x ) ≤ γ , 0 , otherwise. (15) The problem of computing θ t +1 is actually an MCO problem: we use a set of samples to search for the θ that minimizes a parametrized integral θ ? t +1 = arg min θ KL( p γ k q θ ) = − arg min θ Z x ∈X dx p γ ( x ) ln ( q θ ( x )) . (16) 9 Rajna ray an and Wolpert Bias Plus Va riance for the CE Method If w e keep track of the q θ that eac h of the elite samples was generated from, we can then prop erly use imp ortance sampling to estimate the ab o ve integral. Also, the normalization constant turns out to be irrelev an t, enabling the use of Θ γ rather than p γ . The imp ortance-sampled estimate of of interest is therefore m X i =1 Θ γ ( x ( i ) ) q θ k i ( x ( i ) ) ln q θ ( x ( i ) ) (17) Here, q θ k i is the actual parametrized distribution that generated the i th sample. In the CE metho d, ho wev er, the denominator of the likelihoo d ratio is for some reason ignored, and the imp ortance ratio is just set to unity for the elite samples, and zero for all the others. 4.1 Cross-v alidation for the CE Metho d Using the notation from Sec. 3, w e see that the CE metho d actually p erforms naive MCO: it uses imp ortance sampling to generate a finite-sample estimate of a parametrized integral, then estimates the in tegral-optimizing parameters by simply minimizing the finite-sample sum. It is known from extensive exp erience with learning algorithms in the PL comm unity that such an approac h is b ound to perform p oorly with small sample sizes. In other words, it will suffer from large errors caused by large v ariance, stemming from ov erfitting the data. Indeed, to preven t such ov erfitting, Kro ese et al. [2006] recommend the use of ‘dynamic smo othing’ to preven t ‘premature shrinking’ of the distributions. Also, the percentile κ m ust b e set appropriately: if it is to o large, con vergence to an optimum will b e very slo w, and if it is to o small, the algorithm will conv erge prematurely , either to a lo cal optimum, or worse, to a p oint that is not even a lo cal minimum, and make extremely slo w progress tow ards a minimum. The p ercen tile κ , or equiv alently , the elite size N elite , can b e though t of as a hyperparameter: a data-indep enden t parameter that affects algorithm p erformance. In a PL context, hyperparameters are often set using the tec hnique of cross-v alidation, which works as follo ws: the given data is partitioned in to tw o parts, a training set and a ‘held-out’ test set. The learning algorithm (also called a training algorithm) is given the training data, and the estimated optimal parameters ˆ θ ? it generates are tested b y estimating the v alue of the parametrized integral using the test data. The hyperparameters are then c hosen so as to optimize this ‘held-out’ performance. This can b e done man y wa ys, helpfully named 70-30 cross-v alidation, k -fold cross-v alidation, or leav e-one-out cross-v alidation. As exp ected, in 70-30 cross-v alidation, the giv en data is randomly partitioned into tw o parts containing 70% and 30% of the data. The larger partition is used for training, and the smaller one for testing. In k -fold cross-v alidation, the data is divided into k equal-sized partitions. k − 1 of these are used for training, and the last one is used for testing. This pro cedure is repeated k times, so that there ha ve b een k estimates generated and tested; the hyperparameters are chosen to optimize av erage held-out p erformance. In leav e-one-out cross-v alidation, all but one data point are used for training, and the left-out p oin t is used for testing. Of course, leav e-one-out cross-v alidation only works if the p erformance measure makes sense on a single p oin t. In some cases, such as prediction error, it do es. In this pap er, we use 4-fold cross-v alidation to trade bias and v ariance of the CE ‘training algorithm’ to dynamically pick the p ercentile v alue κ t for elite samples. F or our exp erimen ts, we used a fixed set of parameters to implement dynamic smo othing, though these to o could b e c hosen using cross-v alidation. 10 Rajna ray an and Wolpert Bias Plus Va riance for the CE Method In the case of mixture distributions, Kro ese et al. [2006] use a technique called augmentation, where eac h elite sample is presumed to hav e come from a particular comp onen t of the mixture. They don’t sp ecify how exactly they p erform the partitioning of elite s amples. Instead, we use the well-kno wn Exp ectation Maximization (EM) algorithm that is widely used for precisely this purpose: maximizing lik eliho o d (or cross-entrop y) using mixture distributions. Broadly sp eaking, instead of deterministically partitioning the elite samples, the EM algorithm iteratively ‘soft-partitions’ the elite samples using probabilities ov er a latent v ariable to sp ecify which mixture comp onent they arose from. The n umber of mixture comp onen ts sp ecifies what’s called a mo del class. In addition to picking h yp erparameters, cross-v alidation is also used to select mo del classes, and this process is called mo del selection. W e use cross-v alidation to adaptively pick the n umber of mixing comp onents in the Gaussian mixture q θ t . In our case, we wan t to use cross-v alidation to pick γ , or equiv alen tly , κ . Therefore, rather than use E q θ I { G ( x ) ≤ γ } as the parametrized integral to optimize, we c ho ose E q θ [ G ( x )]. Even though we use a differen t ASP , note that the resulting optimization problem is formally equiv alen t to the CE ASP: they share the same optimum p oint θ , but differ in optimum v alue. W e do, ho wev er, still use the CE metho d as our training algorithm. In other words, w e use v arying κ in the CE algorithm to generate estimates ˆ θ ? of the optimal θ , and use cross-v alidation to choose that v alue κ that minimizes the imp ortance-sampled estimate of E q θ [ G ( x )] on the held-out data. In other w ords, pick κ to minimize m ho X i =1 q θ ( x ( i ) ) q θ k i ( x ( i ) ) G ( x ( i ) ) . (18) Note that the sum is ov er the held out samples, denoted by the subscript ho on the summation limit. Also note that we do keep trac k of the actual likelihoo d for each sample generated, and use the correct lik eliho o d ratio in our imp ortance-sampled estimate for held-out p erformance. W e call this CE algorithm, where hyperparameters and mo dels are adaptively pic ked using cross- v alidation, PLMCO-CE, since PL tec hniques are applied to impro ve MCO p erformance of the CE algo- rithm. 5 Results and Discussion In this section, we present p erformance comparisons b et ween the conv entional CE metho d and the new v ersion presented in Sec. 4, on a few simple test problems for unconstrained optimization. 5.1 T est Problems The test problems are all analytic, low-dimensional, multi-extremal functions. The problems v aried in dimensionalit y from 4 to 8, and each of them has properties that make it difficult for lo cal optimizers to find the global optima: some, such as the W o ods and Rosenbrock problems, are badly-scaled, others ha ve multiple lo cal minima, with the worst minima having the largest basin of attraction for gradient- based algorithms, and so on. A few of these problems ha ve b een used for testing con tinuous multi- extremal optimization using the CE metho d [Kro ese et al., 2006]. The 4-dimensional problems are the n -dimensional (extended) Rosenbrock, the W o ods function, and the Shekel family of functions. The 11 Rajna ray an and Wolpert Bias Plus Va riance for the CE Method Hougen function is 5-dimensional. W e also exp erimen ted with the 6-dimensional Hartman function and the 8-dimensional Rosenbrock. The n -dimensional extended Rosen bro ck is defined as G ( x ) = n − 1 X i =1 (1 − x i ) 2 + 100( x 2 i − x i +1 ) 2 (19) The W o ods function is defined as G ( x ) = 100( x 2 − x 1 ) 2 + (1 − x 1 ) 2 + 90( x 4 − x 2 3 ) 2 + (1 − x 3 ) 2 +10 . 1[(1 − x 2 ) 2 + (1 − x 4 ) 2 ] + 19 . 8(1 − x 2 )(1 − x 4 ) , (20) The Hougen function is describ ed by Kro ese et al. [2006]. The others are describ ed in detail by Dixon and Szeg¨ o [1978]. 5.2 Algorithms T ested W e versions of CE algorithm describ ed by Kro ese et al. [2006], b y v arying v alue of the elite size N elite and the num b er of comp onen ts in the Gaussian mixture. W e tested three v alues of N elite as a fraction of the n umber of samples taken: 5, 10, and 15%. F or each of these, w e exp erimen ted with a single Gaussian and a 3-comp onen t Gaussian mixture. In the attached plots, the single Gaussian algorithms are denoted b y CESxx, where xx represen ts N elite as a percentage of the p opulation size, and the mixture-based algorithms are denoted by CEMxx. W e also tested tw o PLMCO-CE algorithms, whic h dynamically c hose a go od v alue of N elite at each iteration. These algorithms used cross-v alidation to pic k the v alue of N elite that optimized e stimated held-out p erformance, i.e., E q θ [ G ( x )], as describ ed in the preceding section. The version using Gaussian mixtures also dynamically chose the num b er of mixing comp onen ts (b et w een 1 and 3) using cross-v alidation. F or a given num b er of mixing comp onen ts, the algorithm used cross-v alidation to find the optimal v alue of N elite . This follo ws the rather standard mac hine-learning practice of optimizing hyperparameters for each mo del t yp e using cross-v alidation, and then pic king the mo del t yp e that optimizes a verage held-out p erformance. The PLMCO-CE algorithms using single Gaussian and mixtures are denoted by CESX and CEMX resp ectiv ely . 5.3 Comparing Algorithm Performance W e compare the p erformance of these algorithms b y running each of them 100 times on all 8 problems. F or a giv en trial of a test problem, the same set of initial samples w as used for all algorithms. There- fore, p erformance differences b et ween algorithms reflect directly on the choices made by the algorithms themselv es. Of course, this common initial set was v aried from trial to trial. W e used tw o metrics to analyze the results of these exp erimen ts. The first is mean p erformance, but since we do not hav e the true mean, we plot the sample mean of these 100 runs, accompanied by the asso ciated 95% confidence interv als shown as shaded regions. This, how ev er, is not a comprehensive summary of algorithm performance: w e also presen t a semilog plot of algorithm p erformance, where w e plot ( G best − G ? ) on a log scale against the num ber of function calls. This is standard practice in con ven tional optimization, but is seldom done in evolutionary optimization communit y . One reason to use a semilog scale is to b e able to visually distinguish smaller v alues when displa yed on the same plot as 12 Rajna ray an and Wolpert Bias Plus Va riance for the CE Method larger ones: often, w e w ant to kno w if some algorithm consistently finds n um b ers 0.001 smaller than some other algorithm, but the n umbers themselves are m uch larger than 0.001. In con ven tional optimization, the other ma jor reason to use a semilog scale is to assess conv ergence rate. On these semilog plots, w e cannot prop erly displa y confidence in terv als for tw o reasons: the upp er and low er confidence interv als will not b e of the same size o wing the logarithmic nature of the plot scale, and b esides, the v alue of the mean minus the confidence interv al may b e negativ e, and this cannot b e shown on a log-scale. Owing to the p ositivit y of the v ariable ( G best − G ? ), this random v ariable has a rather skew ed distribution, and in such cases, it is often more informative to consider the median, since the mean is dominated by the large v alues. So, our second metric is median p erformance, and this is shown plotted on a semilog scale, along with b est and worst p erformance. Of course, these extremal e v en ts are those that o ccurred on our 100 runs, and we cannot make rigorous claims regarding the true b est-case or worst-case p erformance. 5.4 Results W e now present the p erformance comparisons discussed ab ov e. Fig. 1 summarizes p erformance on the 6-dimensional Hartman problem. As we can see, the plot of mean p erformance do es not show any discernible difference, but this is part of the reason for showing the semilog plot. W e see that the median p erformance of the PLMCO-CE algorithm is many orders of magnitude b etter, b oth using Gaussian mixtures and a single Gaussian. This is not surprising on a multimodal problem, where ‘ov erfitting’ to a set of unlucky data would lead to conv ergence to a lo cal optimum. Fig. 2 sho ws p erformance on the 4-dimensional W o o ds problem, whic h is unimo dal, but is po orly scaled and has large flat regions. Using a single Gaussian, the mean p erformance of PLMCO-CE sho ws significan t early gains, and yet final p erformance is no w orse than the standard CE algorithm. This just means that PLMCO-CE quickly conv erges to the optimum. With mixtures, median p erformance, as usual, is greatly improv ed. Fig. 3 compares results on the Shekel family of functions, eac h of whose mem b ers is 4-dimensional and m ultimo dal. As b efore, median p erformance with mixtures is v astly improv ed with PLMCO-CE, and is not adv ersely affected with a single Gaussian. The large v ariances in the mean p erformance indicate that b oth PLMCO-CE and standard CE with mixtures are getting trapp ed in lo cal minima. This is to b e exp ected while using mixtures on a m ultimo dal problem. Nevertheless, PLMCO seems to ameliorate this problem. While it is clear that p erformance is greatly improv ed in the initial stages, w e cannot b e as certain ab out final mean p erformance. Still, the ov erlap in confidence interv als is rather small, and b est-case p erformance is v astly improv ed, suggesting that using PLMCO adv antageously skews the distribution of algorithm p erformance. With a single Gaussian, we can see that we need more trials: the v ariance in the means is not small enough to draw v alid conclusions ab out final p erformance. In all cases, significan t early gains from using PLMCO are indisputable. Figs. 4 and 5 compare p erformance on the Hougen and Rosenbrock functions. O n the Hougen problem, as b efore, PLMCO provides significant gains in the early stages, but no significant final gains (all v ariants do ev entually conv erge to the optimum). On the n -dimensional Rosenbrock, all algorithms p erform very p oorly , mainly owing to bad scaling (of curv ature) of the problem. Even so, PLMCO do es not seem to h urt the performance in any wa y , and arguably improv es it sligh tly . While the 10-dimensional 13 Rajna ray an and Wolpert Bias Plus Va riance for the CE Method a. Median p erformance, Gaussian mixtures. b. Mean p erformance, Gaussian mixtures. c. Median p erformance, single Gaussian. d. Mean p erformance, single Gaussian. Figure 1: Performance comparison on Hartman6 function. 14 Rajna ray an and Wolpert Bias Plus Va riance for the CE Method a. Median p erformance, Gaussian mixtures. b. Mean p erformance, Gaussian mixtures. c. Median p erformance, single Gaussian. d. Mean p erformance, single Gaussian. Figure 2: Performance comparison on W oo ds function. 15 Rajna ray an and Wolpert Bias Plus Va riance for the CE Method Median p erformance, Gaussian mixtures. Mean p erformance, Gaussian mixtures. Median p erformance, single Gaussian. Mean p erformance, single Gaussian. Figure 3: Performance comparison on Shekel family of functions. 16 Rajna ray an and Wolpert Bias Plus Va riance for the CE Method a. Median p erformance, Gaussian mixtures. b. Mean p erformance, Gaussian mixtures. c. Median p erformance, single Gaussian. d. Mean p erformance, single Gaussian. Figure 4: Performance comparison on Hougen function. 17 Rajna ray an and Wolpert Bias Plus Va riance for the CE Method a. Median p erformance, Gaussian mixtures. b. Mean p erformance, Gaussian mixtures. c. Median p erformance, single Gaussian. d. Mean p erformance, single Gaussian. e. Median p erformance, Gaussian mixtures. f. Mean p erformance, Gaussian mixtures. g. Median p erformance, single Gaussian. h. Mean p erformance, single Gaussian. Figure 5: Performance comparison on n -dimensional Rosen bro ck. 18 Rajna ray an and Wolpert Bias Plus Va riance for the CE Method Rosen bro c k seems to hav e b een successfully used b y Kro ese et al. [2006], careful examination reveals that the 10 5 function ev aluations were used to arrive within 0 . 02 of the optimum, which is somewhat inefficien t for a ten-dimensional problem. 5.5 Discussion and Conclusions In truth, this is not the whole picture. The relation betw een MCO and PL can be summarized as follo ws: eac h step of an MCO algorithm is essentially an entire PL problem. In addition, b et ween iterations, MCO algorithms acquire more data b y sampling, and this choice of samples is largely a heuristic such as adaptive imp ortance sampling. In the PL communit y , the field of active learning [F reund et al., 1997, Dasgupta and Kalai, 2005] comes close to addressing this problem, but this still do es not address the iterativ e nature of the MCO algorithm. All of this is in addition to the three points mentioned in the text: while designing estimators for parametrized integrals, there are few or no published results in the literature vis-a-vis tailoring such estimators for a search pro cess. In particular, as describ ed in the text, the authors are unaw are of any analyses that consider the higher momen ts of random v ariables required to analyze extremal v alues. The same is the case with the consideration of momen ts coupling different estimators. Ev en when all these considerations are ignored, our results seem to str ongly imply that PL techniques improv e MCO p erformance. That is, a bias-v ariance tradeoff at each iteration of the algorithm, for the individual θ -by- θ estimators alone, even though it do es not fully capture the pro cess, do es improv e MCO p erformance. In the particular case of the CE metho d, the use of PLMCO-CE strongly impro v es mean performance in the early stages of the algorithm. Moreo v er, these early gains are achiev ed without degrading final p erformance. This seems to b ear out the main p oin t of our PLMCO hypothesis: a straightforw ard application of well-kno wn PL tec hniques to the individual iterations of an iterated MCO pro cess should result in impro ved p erformance. Ev en if one w ere to examine the p erformance more closely , suc h as our in vestigation of median p erformance using a semilog plot, there are often significant gains from using PLMCO. These gains often o ccur when the problems are multimodal, or the data are sparse, and in general, when there is a real p ossibilit y of ‘ov erfitting’ due to lack of data. One would exp ect that the phenomenon of having to choose b et ween mo dels and tune hyperparame- ters is the norm rather than the exception, that hard optimization problems will probably b e m ultimodal, requiring the use of mixture distributions (and perhaps ev en more hyperparameters or mo del classes), and using naiv e MCO in suc h cases would indeed result in muc h o verfitting, and consequen tly , erratic p erformance. Our version of PLMCO is extremely rudimentary: we ignore almost all the complexities of an iterativ e searc h algorithm, and yet these results seem to indicate that there are significant gains to b e exp ected. An inv estigation of the more subtle nuances of such iterative search pro cess is likely to lead to even more significant gains. The authors b eliev e that several interesting insigh ts a wait discov ery . 19 Rajna ray an and Wolpert Bias Plus Va riance for the CE Method References Y. M. Ermoliev and V. I. Norkin. Mon te carlo optimization and path dep enden t nonstationary la ws of large num b ers. T echnical Rep ort IR-98-009, International Institute for Applied Systems Analysis, Marc h 1998. C. P . Rob ert and G. Casella. Monte Carlo Statistic al Metho ds . Springer-V erlag, New Y ork, 2004. D. H. W olp ert. On bias plus v ariance. Neur al Computation , 9:1211–1244, 1997. G. P . Lepage. A new algorithm for adaptive multidimensional integration. Journal of Computational Physics , 27:192–203, 1978. L. Breiman. Stack ed regression. Machine L e arning , 24(1):49–64, 1996a. L. Breiman. Bagging predictors. Machine L e arning , 24(2):123–140, 1996b. D. H. W olp ert and D. Ra jnaray an. Parametric learning and monte carlo optimization. Av ailable at h ttp://arxiv.org/abs/0704.1274, 2007. D. Angluin. Computational learning theory: Survey and selected bibliography . In Pr o c e e dings of the Twenty-F ourth Annual ACM Symp osium on The ory of Computing, May 1992 , 1992. V. N. V apnik. Estimation of Dep endenc es Base d on Empiric al Data . Springer, 1982. V. N. V apnik. The Natur e of Statistic al L e arning The ory . Springer, 1995. W. Bun tine and A. W eigend. Bay esian back-propagation. Complex Systems , 5:603–643, 1991. J. M. Berger. Statistic al De cision the ory and Bayesian Analysis . Springer-V erlag, 1985. D. Mac k ay . Information the ory, infer enc e, and le arning algorithms . Cambridge Universit y Press, 2003. Dirk P . Kro ese, Sergey Porotsky , and Reuven Y. Rubinstein. The cross-en tropy metho d for contin uous the cross-entrop y metho d for contin uous multi-extremal optimization. Metho dolo gy and Computing in Applie d Pr ob ability , 8(3):383–407, September 2006. L. C. W. Dixon and G. P . Szeg¨ o. The global optimisation problem: An introduction. T owar ds Glob al Optimization , 2:1–15, 1978. Y. F reund, H. S. Seung, E. Shamir, and N. Tishb y . Selectiv e sampling using the query by committee algorithm. Machine L e arning , 2-3(28):133–168, 1997. Sanjo y Dasgupta and Adam T auman Kalai. Analysis of Per c eption-b ase d A ctive L e arning , pages 249–263. Springer, 2005. 20
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment