Approximate inference via variational sampling

App ro ximate inference via va riational sampling Alexis Ro c he ∗ Siemens Healthcare Sector, Biomedical Imaging Cen ter (CIBM) CH-1015 Lausanne, Switzerland Octob er 7, 2013 A new metho d called “v ariational sampling” is prop osed to estimate integrals un- der probabilit y distributions that can b e ev aluated up to a normalizing constant. The k ey idea is to ﬁt the target distribution with an exp onential family mo del by minimizing a strongly consistent empirical approximation to the Kullback-Leibler div ergence computed using either deterministic or random sampling. It is shown ho w v ariational sampling diﬀers conceptually from b oth quadrature and imp ortance sampling and established that, in the case of random independence sampling, it ma y ha v e m uch faster sto chastic conv ergence than imp ortance sampling under mild con- ditions. The v ariational sampling implementation presented in this paper requires a rough initial approximation to the target distribution, whic h ma y b e found, e.g. us- ing the Laplace method, and is sho w n to then ha ve the potential to substan tially im- pro v e o v er sev eral existing appro ximate inference tec hniques to estimate momen ts of order up to t w o of nearly-Gaussian distributions, whic h occur frequently in Ba yesian analysis. In particular, an application of v ariational sampling to Ba y esian logistic regression in moderate dimension is presented. 1 Intro duction A central task in probabilistic inference is to ev aluate expectations with resp ect to the prob- abilit y distribution of unobserv ed v ariables. Assume that p : R d → R + is some unnormalized target distribution representing, typically , the join t distribution of some data (omitted from the notation) and some unobserv ed d -dimensional random v ector x . Making an inference on x often in v olv es computing a v ector-v alued integral of the form: I ( p ) = Z p ( x ) φ ( x ) d x , (1) where φ : R d → R n is a feature function. F or instance, the p osterior mean and v ariance of x can b e reco v ered from (1) if the components of φ form a basis of second-order polynomials (including a constan t term to capture the normalizing constant of p ). A recurring problem, how ever, is that (1) may be analytically in tractable in practice, so that one has to resort to appro ximation ∗ alexis.ro c he@gmail.com, alexis.ro che@epﬂ.c h 1 metho ds, which ma y b e roughly classiﬁed as b eing either sampling-based, or ﬁtting-based, or b oth. Sampling-based metho ds substitute (1) with a ﬁnite sum of the form: ˆ I ( p ) = X k w k p ( x k ) π ( x k ) φ ( x k ) , (2) where π is a suitable non-v anishing “windo w” function, while x k and w k are ev aluation p oin ts and weigh ts, resp ectiv ely , whic h are determined either deterministically or randomly . In the deterministic case, (2) is called a quadrature rule. Classical Gaussian quadrature is exact, by construction, if the integrand ( p/π ) φ is p olynomial in each comp onen t. Other rules known as Ba y es-Hermite quadrature mo del the integrand as a spline or, equiv alen tly , as the outcome of a Gaussian pro cess [24]. F or computational eﬃciency in multiple dimensions, quadrature rules should b e implemen ted using sparse grids [12], a basic example of which is used, e.g., in the unscen ted Kalman ﬁlter [17]. An alternativ e to quadrature is to sample the ev aluation p oin ts randomly and indep enden tly from the window function π considered as a probability density , leading to the p opular impor- tance sampling (IS) metho d [20]. Setting uniform weigh ts w k = 1 / N , where N is the sample size, then guarantees that (2) yields a statistically un biased estimate of I ( p ). This pro cedure, ho w ev er, may ha v e slow sto c hastic conv ergence, or may not conv erge at all, if the integrand ( p/π ) φ has large or inﬁnite v ariance under π . The most common approac h to mitigate this problem is to constrain sampling so as to reduce ﬂuctuations of the integrand across sampled p oin ts. This has motiv ated an impressiv e n umber of IS enhancemen ts prop osed in the literature, including defensive and multiple IS [27], adaptive IS [6, 7], path sampling [11], annealed IS [23], sequen tial Mon te Carlo metho ds [10], Marko v chain Monte Carlo (MCMC) methods [1, 2, 13], and nested sampling [32]. In con trast with sampling-based metho ds, ﬁtting-based metho ds ev aluate (1) b y replacing the target distribution p with an approximation for which the in tegral is tractable, thereb y usu- ally a v oiding sampling. F or instance, the w ell-kno wn Laplace metho d [33] p erforms a Gaussian appro ximation using the second-order T a ylor expansion of log p at the mo de. In some prob- lems, ho wev er, more accurate Gaussian appro ximations may b e obtained b y iterative analytical calculation using v ariational metho ds such as lo wer b ound maximization [25, 30, 14, 26] or exp ectation propagation (EP) [21, 22]. Other t yp es of explicit approximations (not necessarily Gaussian) may b e computed using a latent v ariable mo del and applying the v ariational Bay es algorithm [4] or using other structured v ariational inference metho ds [16, 22]. Fitting-based metho ds, how ever, are not alwa ys applicable as they typically rely on parametric or structural assumptions regarding the target distribution. When applicable, they may b e computationally eﬃcient but may lac k accuracy since they intrinsically rely on appro ximations. On the other hand, sampling-based metho ds are widely applicable and virtually not limited in accuracy but tend to b e time-consuming in practice in that they may require man y function ev aluations to conv erge. This pap er explores a third group of metho ds that com bine sampling and ﬁtting ideas. As in v ariational metho ds, the ev aluation of (1) is recast into an optimization problem. Ho wev er, rather than choosing the ob jective function for analytical tractability , the idea is to appro ximate, b y sampling, an in tractable ob jective whose optimization is known to yield an exact result. The rational b ehind this variational sampling idea is that a w ell-chosen v ariational form ulation ma y b e statistically more eﬃcien t than a direct integral ev aluation as in (2). While previous metho ds in this vein used weigh ted log-likelihoo d as a sampling-based approximation to the 2 Kullbac k-Leibler (KL) divergence [34, 9, 8], this paper advocates another very natural wa y to appro ximate the KL div ergence that yields more accurate in tegral estimates under mild conditions and, p erhaps surprisingly , do es not seem to ha ve b een proposed b efore. 2 V a riational sampling Our starting p oint is a well-kno wn v ariational argument: computing the in tegral (1) is equiv a- len t, under some existence conditions, to substituting p with the distribution q ? that minimizes the inclusive KL div ergence: L ? ( q ) = D ( p k q ) = Z p ( x ) log p ( x ) q ( x ) d x − Z p ( x ) d x + Z q ( x ) d x , (3) o v er the exp onen tial family: q θ ( x ) = e θ > φ ( x ) , (4) whic h is induced b y the feature function φ . Here, w e consider the generalized KL divergence as, e.g. in [22], whic h is a p ositiv e-v alued conv ex function of ( p, q ) for an y pair of unnormalize d dis- tributions, and v anishes iﬀ p = q . Remark ably , since I ( q ? ) = I ( p ), the v ariational approximation q ? yields an error-free integral for this particular c hoice of divergence and appro ximating family . W e note that I ( q ? ) has a closed-form expression if the exp onen tial family consists of Gaus- sian or ﬁnite distributions, or pro ducts of con v enien t univ ariate distributions. The exp erimen ts rep orted in Section 4 fo cus on the Gaussian family , where φ spans the space of second-order p olynomials in x , in which case q ? is the Gaussian distribution with same normalizing constan t, mean and v ariance matrix as p . Note that, since w e hav e the decomp osition D ( p k q θ ) = D ( p k q ? ) + D ( q ? k q θ ) for any q θ in the exp onen tial family , minimizing (3) boils do wn to minimizing the exc ess KL div ergence D ( q ? k q θ ), whic h is clearly a p ositiv e quan tit y that v anishes iﬀ q θ = q ? . Provid ed that θ deﬁnes a unique parametrization, the excess KL divergence is nonzero whenever I ( q θ ) diﬀers from I ( q ? ) = I ( p ) and therefore it deﬁnes a natural error measure on the integral (1), whic h w e will use in Section 4 to assess estimation accuracy . 2.1 Surrogate KL divergence Unfortunately , ev aluating (3) for any q and th us ﬁnding q ? is just as diﬃcult as ev aluating (1). Ho w ev er, it seems natural to use a sampling appro ximation of L ? ( q ) as a surrogate cost function to b e minimized. T o this end, we remark that L ? ( q ) is the exp ected v alue under p of the loss function ` ( x , q ) = − log( r ( x )) − 1 + r ( x ) with r ( x ) ≡ q ( x ) /p ( x ). Applying an in tegration rule suc h as (2) to ` ( x , q ), w e may prop ose the follo wing KL approximation: L ( q ) = X k w k h p ( x k ) π ( x k ) log p ( x k ) q ( x k ) − p ( x k ) π ( x k ) + q ( x k ) π ( x k ) i . (5) As depicted in Figure 1, the loss ` ( x , q ) at any p oint x is minimized and v anishes iﬀ r ( x ) = 1, meaning q ( x ) = p ( x ). This implies the str ong c onsistency property that L ( p ) ≤ L ( q ) for an y unnormalized distribution q p ossibly outside the approximating exp onential family , with equalit y iﬀ q coincides with p at the sampled p oin ts. In other words, L deﬁnes a v alid cost function not just in an asymptotic sense, but for an y ﬁnite sample. Another wa y to see this is 3 to remark that (5) may b e in terpreted as a ﬁnite KL divergence D ( ¯ p k ¯ q ), where the weigh ted sample vectors: ¯ p = h w 1 p ( x 1 ) π ( x 1 ) , w 2 p ( x 2 ) π ( x 2 ) , . . . i > , ¯ q = h w 1 q θ ( x 1 ) π ( x 1 ) , w 2 q θ ( x 2 ) π ( x 2 ) , . . . i > , (6) are considered as unnormalized ﬁnite distributions. This remark suggests that our v ariational sampling strategy essentially recasts a contin uous distribution ﬁtting problem in to a discrete one. 0 1 2 3 4 5 r 0.0 0.5 1.0 1.5 2.0 2.5 3.0 − l o g r − 1 + r Figure 1: Loss function asso ciated with the generalized KL divergence. W e note that another p ossible sampling-based appro ximation of (3) is: L 0 ( q ) = X k w k p ( x k ) π ( x k ) h log p ( x k ) q ( x k ) − 1 i + Z q ( x ) d x , (7) whic h slightly diﬀers from (5) in that the sampling appro ximation to the in tegral of q is re- placed by its exact expression. Minimizing (7) o v er the exp onential family is akin to a familiar maxim um likelihoo d estimation problem, yielding the moment-matc hing condition: Z q θ ( x ) φ ( x ) d x = X k w k p ( x k ) π ( x k ) φ ( x k ) . In other w ords, minimizing L 0 pro vides the same integral approximation as a direct application of the integration rule to p , and has therefore no practical v alue in the con text of estimating in tegrals. L 0 has b een used previously by analogy with maxim um likelihoo d estimation for v arious distribution approximation tasks, e.g. in the Mon te Carlo exp ectation-maximization algorithm [34] or in the cross-en trop y method [9]. Ho w ev er, L 0 cannot be interpreted as a discrete div ergence, con trary to L , and turns out not to satisfy the ab o ve-men tioned strong consistency prop erty of L to b e globally minimized by p , meaning that it is generally p ossible to ﬁnd a distribution q , ev en in a restricted approximation space, suc h that L 0 ( q ) < L ( p ), as illustrated in Figure 2. In the sequel, we refer to v ariational sampling (VS) as the minimization of L as deﬁned in (5). Owing to the strong consistency of L as a KL divergence approximation, VS may pro duce more accurate in tegral estimates than a direct application of the in tegration rule if the target distribution is “close enough” to the ﬁtting family . 4 8 6 4 2 0 2 4 6 8 0.0 0.1 0.2 0.3 0.4 0.5 0.6 VS direct spline 8 6 4 2 0 2 4 6 8 0.0 0.1 0.2 0.3 0.4 0.5 0.6 VS direct spline Figure 2: Exactness of v ariational sampling for ﬁtting a normal distribution N (0 , 1) using a Gauss-Hermite rule (left) and random independent sampling (right) b oth with windo w function π = N ( − 2 , 4). Blue lines: target distribution and VS Gaussian ﬁts. Red lines: Gaussian ﬁts obtained b y direct in tegral approximations (equiv alen t to IS in the random sampling case). Green lines: interpolating Gaussian splines. 2.2 Minimization When choosing the appro ximation space as the exp onen tial family (4), L ( q θ ) is seen to b e a smo othly con v ex function of θ , abbreviated L ( θ ) in the following, with gradient and Hessian matrix: ∇ θ L = Φ > ( ¯ q θ − ¯ p ) , ∇ θ ∇ > θ L = Φ > diag( ¯ q θ ) Φ , where ¯ p and ¯ q θ are the w eigh ted samples of p and q as deﬁned in (6), and Φ is the N × n matrix with general elemen t Φ kj = φ j ( x k ). The Hessian ma y b e in terpreted as a Gram matrix and is th us p ositive deﬁnite everywhere pro vided that Φ has rank n . While there is generally no closed-form solution to the minimization of (5), the following result provides a simple suﬃcien t condition for the existence and uniqueness of a minim um. Prop osition 2.1. If the matrix Φ has r ank n , then L ( θ ) admits a unique minimizer in R n . Pr o of. If Φ has rank n , then Φ > diag( ¯ q θ ) Φ is p ositive deﬁnite for an y θ . This implies that L ( θ ) is strictly conv ex and guarantees uniqueness. Existence then follows from the fact that L ( θ ) is co ercive in the sense that lim k θ k→∞ L ( θ ) = + ∞ . T o pro v e that, we decomp ose θ via θ = t u where t is a scalar and u a ﬁxed unit-norm vector. Φ ha ving rank n means that u > Φ 6 = 0, hence there exists at least one k for which P i u i φ i ( x k ) 6 = 0, implying that lim t → + ∞ q t u ( x k ) /p ( x k ) ∈ { 0 , + ∞} therefore lim t → + ∞ ` ( x k , q t u ) = + ∞ , which completes the pro of. F or man y c hoices of φ functions, such as linearly indep enden t p olynomials, this condition is v eriﬁed as so on as N ≥ n distinct p oin ts are sampled. F or instance, the minimum sample size to guarantee uniqueness when ﬁtting a full Gaussian mo del is n = ( d + 2)( d + 1) / 2. In small dimension, sa y d < 10, we perform numerical optimization using the Newton method with adaptive step size, which inv olv es in v erting the Hessian matrix at eac h iteration using a Cholesky decomp osition. This step is time consuming in larger dimension, hence w e resort to 5 a quasi-Newton approac h where the Hessian is appro ximated b y the ﬁxed matrix Φ > diag( ¯ p ) Φ , whic h gives a rough estimate of the Hessian at the minim um. Using this tric k, the optimizer t ypically requires more iterations to con v erge but runs faster o v erall because iterations are muc h c heap er. 3 Monte Ca rlo implementation So far, w e ha v e considered a general deﬁnition of v ariational sampling that relies on an y de- terministic or sto chastic integration rule of type (2). In practice, it is simple to implemen t the IS rule, where p oints are sampled indep endently from a probabilit y density π and weigh ts are uniform, w k = 1 / N . In this con text, VS is a Monte Carlo in tegration metho d comparable to IS, and enjoys an analogous central limit theorem that states, in short, that the VS in tegral estimator is asymptotically un biased with v ariance in versely prop ortional to N . Prop osition 3.1. Under the c onditions of Pr op osition 2.1, let ˆ θ = arg min θ ∈ R n L ( θ ) and let the c orr esp onding inte gr al estimator ˆ I ( p ) = I ( q ˆ θ ) . If D ? ( p k q ) admits a unique minimizer q ? in the exp onential family, then ˆ I ( p ) c onver ges in distribution to I ( p ) : √ N [ ˆ I p ( φ ) − I p ( φ )] d → N (0 , Σ) , with Σ = Z [ p ( x ) − q ? ( x )] 2 π ( x ) φ ( x ) φ ( x ) > d x , (8) pr ovide d t hat Σ exists and is ﬁnite. Sketch of pr o of. The proof is based on standard T a ylor expansion-based argumen ts from asymp- totic theory and is straigh tforw ard using a general conv ergence result on Mon te Carlo approxi- mated exp ected loss minimization given b y Shao [31], theorem 3. This theorem implies under the stated conditions that ˆ θ con v erges in distribution to θ ? . The result then easily follows from a T a ylor expansion of I ( q θ ) around θ ? . As corollaries of Prop osition 3.1, we also ha v e that √ N ( ˆ θ − θ ? ) conv erges in distribution to N (0 , H − 1 Σ H − 1 ), where H is the Hessian of the contin uous KL div ergence at the minimum: H = R q ? ( x ) φ ( x ) φ ( x ) > d x , and that the minimum KL div ergence is o v erestimated by a v anishing p ortion: L ? ( ˆ θ ) − L ? ( θ ? ) = D ( q ? k q ˆ θ ) = 1 2 N trace(Σ H − 1 ) + o ( 1 N ) , meaning that the integral estimation error, as measured b y the excess KL divergence, also decreases in 1 / N . The VS asymptotic v ariance Σ / N is to b e compared with that of the IS estimator, which is giv en by Σ 0 / N with: Σ 0 = Z p ( x ) 2 π ( x ) φ ( x ) φ ( x ) > d x − I ( p ) I ( p ) > . (9) Both v ariances decrease in O (1 / N ) yet with factors that migh t diﬀer b y orders of magnitude if the function p − q ? tak es on small v alues. Substituting p with p − q ? in (9), we get the fundamen tal insight that VS is asymptotically equiv alent to IS applied to p − q ? even though q ? is unknown . An interpretation of this result is that VS tries to comp ensate for ﬂuctuations of the imp ortance weigh ting function, p/π b y subtracting it a b est ﬁt (in the KL divergence sense), q ? /π . W e retriev e the fact that VS is exact in the limiting case p = q ? . In this light, VS app ears complemen tary to IS extensions that reduce v ariance b y constraining the sampling mec hanism [6, 11, 27, 7, 23, 10]. 6 F rom Prop osition 3.1, a suﬃcient condition for sto chastic conv ergence is that Σ b e ﬁnite. Lo osely sp eaking, this means that π should b e “wide enough” to sample regions where p and q ? diﬀer signiﬁcan tly . In practice, c ho osing π as an initial guess of q ? found, e.g. using the Laplace metho d, is often go od enough to get fast conv ergence. This strategy , ho w ev er, do es not oﬀer a strict warran ty . In cases where p or q ? ha v e signiﬁcan tly stronger tails than the c hosen π , con v ergence migh t not o ccur or b e to o slow. While such cases ma y b e diagnosed using an empirical estimate of Σ, they will necessitate to restart the metho d, e.g. b y rescaling the v ariance of π . A strategy that alleviates the need for suc h a calibration step, is to pre- m ultiply p by a fast decreasing “con text” function c and search for a ﬁt of the form c ( x ) q θ ( x ). In this straightforw ard extension, VS minimizes a sampling approximation to the lo calized KL divergence D c ( p k q θ ) = D ( cp k cq θ ) as opp osed to the global divergence D ( p k q θ ), and faster con v ergence then comes at the price of a biased in tegral estimate. This approach ma y b e view ed as a tradeoﬀ b et ween the Laplace metho d and global KL div ergence minimization. 4 Exp eriments W e implemented the Mon te Carlo-based VS metho d describ ed in Sections 2 and 3 in Scien- tiﬁc Python ( www.scipy.org ). The co de is in op en-source access at https://github.com/ alexis- roche/variational_sampler . The follo wing describ es some exp eriments using VS to illustrate its practical v alue compared to other metho ds. 4.1 Simulations W e simulated non-Gaussian target distributions as mixtures of 100 normal k ernels with random cen ters and unit v ariance: p δ ( x ) = 1 100 100 X i =1 N ( x ; µ i , I d ) , µ i i.i.d. ∼ N (0 , δ 2 d I d ) . (10) The parameter δ 2 represen ts the a v erage squared Euclidean distance b etw een centers, and thus con trols deviation of p δ from normality . Figure 3 compares p erformances of VS, IS, Ba yesian Monte Carlo (BMC) [29] and the Laplace metho d in estimating the n = ( d + 2)( d + 1) / 2 momen ts of order 0, 1 and 2 of p δ , in diﬀerent dimensions, for diﬀerent v alues of δ and for diﬀeren t sample sizes. VS, IS and BMC w ere run on the same random samples dra wn from the Laplace appro ximation to p δ . VS w as implemen ted to ﬁt a full Gaussian mo del to p δ , as describ ed in Section 2. The BMC method w orks by ﬁtting an in terp olating spline and w as implemen ted here using an isotropic Gaussian correlation function with ﬁxed scale 1 that matc hes the Gaussian k ernels in p δ . Errors are measured by the excess KL divergence D ( q ? k ˆ q ) (see Section 2), where ˆ q is the Gaussian ﬁt output by a metho d and q ? is the known KL optimal ﬁt analytically derived from p δ . Note that D ( q ? k ˆ q ) can b e calculated analytically as it is the KL divergence b et ween t w o unnormalized Gaussian distributions. It is a global error measure that com bines errors on momen ts of order 0, 1 and 2, hence it departs from zero whenever either the normalizing constan t, or the p osterior mean, or the p osterior v ariance matrix is oﬀ. These simulations illustrate the accelerated sto chastic conv ergence of VS ov er IS for target distributions that are close to Gaussian. F or almost Gaussian distributions, VS is merely equiv a- len t to the Laplace metho d. As the target departs from normality , VS can signiﬁcantly impro ve o v er the Laplace metho d while still requiring m uc h smaller samples than IS for comparable 7 0 50 100 150 200 250 300 350 Sample size 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Excess KL divergence VS IS BMC Laplace 0 50 100 150 200 250 300 350 Sample size 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 Excess KL divergence VS IS BMC Laplace 0 1000 2000 3000 4000 5000 6000 7000 8000 Sample size 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Excess KL divergence VS IS Laplace 0 1000 2000 3000 4000 5000 6000 7000 8000 Sample size 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 Excess KL divergence VS IS Laplace Figure 3: Excess KL divergences of VS, IS, BMC and the Laplace metho ds for simulated Gaus- sian mixture target distributions as in (10). Left, δ = 1 . 5. Right, δ = 3. T op, in dimension 5. Bottom, in dimension 30. BMC errors in dimension 30 are out of the represen ted range. Dots and v ertical segmen ts resp ectiv ely represen t median errors and inter-quartile ranges across 100 sampling trials. accuracy . The BMC metho d w as here comp etitiv e with IS in dimension 5, but broke do wn in higher dimension. These ﬁndings must b e temp ered b y the fact that VS requires more computation time than IS for a given sample size , as shown in Figure 4. The goo d news is that estimation time scales linearly with sample size similarly to IS, but unlike BMC where it scales quadratically . As we will demonstrate now, one may actually sa v e considerable computation time b y running VS on a mo derate size sample rather than running IS on a larger sample for equiv alen t accuracy . 4.2 Logistic regression W e here inv estigate VS and other metho ds to make Bay esian inference in logistic regression [28]. In this case, the unnormalized target distribution reads p ( x ) = p 0 ( x ) Q M i =1 σ ( y i a > i x ), where y i ∈ {− 1 , 1 } are binary measurements in num b er M , σ ( x ) = 1 / (1 + e − x ) is the logistic function, a i are kno wn attribute vectors, p 0 is a prior, and x is an unkno wn v ector of co eﬃcients. 8 1000 2000 3000 4000 5000 6000 7000 8000 Sample size 0 1 2 3 4 5 Estimation time (sec) VS IS BMC Figure 4: Estimation time (excluding sampling time) for VS, IS and BMC in dimension 30. Quan tities of sp ecial interest are the p osterior mean and p ossibly the p osterior v ariance of x to mak e MAP or a v eraged predictions [28]. It is customary to approximate these using the Laplace metho d, whic h is fast but usually inaccurate. In the case of probit regression, where σ is replaced with the standard normal cumulativ e distribution, there exists an instance of the EP algorithm with fully explicit up dating rules (hereafter referred to as EP-probit) that generally yields better results [28]. Whilst explicit EP do es not exist for logistic regression, one ma y resort to the v ariational b ound Gaussian appro ximation prop osed b y Jaakk ola & Jordan [15], denoted VB-logistic in the follo wing, whic h minimizes an upp er b ound of the exclusive KL div ergence. A related v ariational b ound method using a mean-ﬁeld appro ximation is discussed in [18]. Other structured v ariational metho ds such as P o w er EP [22] ha v e the p otential to w ork for logistic regression but hav e not b een rep orted so far. In [13], several MCMC methods are compared against b enc hmark logistic regression problems, among which the b est p erformers tend to hav e high computational cost. The strategy we prop ose here is to use VS with sampling k ernel π c hosen as the Laplace appro ximation to p . W e report b elo w results on sev eral datasets from the UCI Mac hine Learning Rep ository [3] with attribute dimension ranging from d = 4 to 34, where VS is compared to IS and BMC in the same sampling scenario, as w ell as to EP-probit and VB-logistic. In eac h exp erimen t, a constan t attribute was added to accoun t for unev en o dds. All attributes w ere conv entionally normalized to unit Euclidean norm and a zero-mean Gaussian prior with large isotropic v ariance 10 5 w as used. W e ev aluated VS, IS and BMC on random samples of increasing sizes N corresp onding to multiples 2 , 4 , 8 , 16 , . . . of the num b er of Gaussian param- eters, n = ( d + 2)( d + 1) / 2, provided that the ov erall computation time, including Laplace appro ximation (which to ok only 10–50 milliseconds), sampling and moment estimation, was b elo w 10 seconds. All simulations satisfying this criterion were repeated 250 times. BMC was implemen ted using an isotropic Gaussian correlation function with scale parameter tuned em- pirically to yield roughly optimal p erformance rather than optimized by marginal likelihoo d maximization [29]. This was done to keep computation time comparable with b oth IS and VS. Ground truth parameters were estimated using IS with samples of size 10 7 . Excess KL div ergences as deﬁned in Section 2 are rep orted in T able 1 for eac h metho d. F or VS, IS and BMC, the rep orted v alue corresponds to the largest sample size that enabled less than 10 seconds 9 computation. While b oth deterministic v ariational metho ds EP-probit and VB-logistic ran fast (con v erging in 1–2 seconds), they b oth p erformed p o orly in logistic regression. This could b e exp ected for EP-probit as it relies on a probit mo del and indeed turned out to b e the b est p erformer for probit regression. VB-logistic, how ever, w as designed for logistic regression but only works well in dimension d = 1 in our exp erience. Ov erall, logistic regression results sho w that VS was the only method to clearly impro ve ov er the Laplace metho d in all datasets. VS prov ed b oth more accurate and more precise than IS and BMC, while IS was more accurate than BMC but generally less precise. Figure 5 shows VS, IS and BMC errors as decreasing functions of computation time, and conﬁrms that VS has the potential to massiv ely o v ercome its lo w er computational eﬃciency at given sample size compared to IS. Note that, since sampling time is prop ortional to the num b er of measurements M , the computational o v erhead of VS relativ e to IS is a decreasing function of M . An additional observ ation is that IS and BMC tended to pro vide more accurate posterior mean estimates than the Laplace metho d, but w orse v ariance estimates except for the “Haberman” lo w-dimensional dataset. 0 1 2 3 4 5 6 7 8 Computation time (sec) 5 10 15 20 25 30 35 40 45 Posterior mean error VS IS BMC Laplace 0 1 2 3 4 5 6 7 8 Computation time (sec) 0 5 10 15 20 25 30 Excess KL divergence VS IS BMC Laplace Figure 5: Estimation errors in logistic regression on the “Ionosphere” UCI dataset. Left, errors on the posterior mean (Euclidean distances). Righ t, excess KL divergences. Dots and v ertical segmen ts resp ectiv ely represen t median errors and in ter-quartile ranges across 250 sampling trials. Errors for EP-probit and VB-logistic are out of the represented range. These results suggest VS as a fast and accurate metho d for Bay esian logistic regression. It is p erhaps less useful in this implemen tation for probit regression where, as shown by T able 1, VS still p erformed well and compared fa v orably with IS and BMC but was less accurate than EP-probit within 10 seconds computation time. Since EP do es not exactly minimize the KL di- v ergence [21, 22], VS could b eat the EP solution using larger samples, as a consequence of Prop osition 3.1, ho w ev er such reﬁnement might b e considered to o costly in practice. 5 Conclusion In summary , VS is an asymptotically exact moment matching technique that relies on few assumptions regarding the target distribution and can provide an eﬃcient alternativ e to the 10 Dataset d M Mo del EP-probit VB-logistic VS IS BMC Hab erman 4 306 Logistic 1941 . 9 88 . 91 0 . 001 ± 0 . 000 0 . 007 ± 0 . 004 0 . 015 ± 0 . 007 Probit 0 . 0005 4811 . 11 0 . 001 ± 0 . 001 0 . 157 ± 0 . 074 0 . 025 ± 0 . 021 P arkinsons [19] 23 195 Logistic 13 . 22 295 . 21 0 . 12 ± 0 . 01 0 . 24 ± 0 . 09 1 . 21 ± 0 . 05 Probit 0 . 05 129 . 90 0 . 10 ± 0 . 01 0 . 32 ± 0 . 13 1 . 04 ± 0 . 04 Ionosphere 33 351 Logistic 17 . 07 139 . 82 0 . 22 ± 0 . 02 0 . 69 ± 0 . 40 2 . 30 ± 0 . 05 Probit 0 . 06 58 . 60 0 . 18 ± 0 . 02 0 . 56 ± 0 . 27 3 . 92 ± 0 . 05 Prognostic 34 194 Logistic 6 . 26 24 . 49 0 . 61 ± 0 . 04 0 . 83 ± 0 . 22 1 . 29 ± 0 . 04 Breast Cancer Probit 0 . 01 26 . 10 0 . 12 ± 0 . 01 0 . 48 ± 0 . 20 1 . 62 ± 0 . 06 T able 1: Excess KL div ergences (divided b y corresponding excess KL div ergences of the Laplace metho d) in logistic and probit regression on UCI datasets: mid-hinges and in ter- quartile ranges for several inference metho ds restricted to 10 seconds computation. V alues b elo w 1 indicate higher accuracy than the Laplace metho d. Laplace metho d as w ell as con v en tional sampling-based and v ariational inference metho ds. VS ma y b e viewed as a nonline ar integration rule that is exact for Gaussian functions (or, more generally , arbitrary exponential families). Although the VS implementation tested in this pap er uses random sampling, w e stress that VS is not intrinsically a Mon te Carlo metho d as it can b e constructed from any int egration rule. Mon te Carlo VS w as shown to work w ell in problems of mo derate dimension (up to 34 in the presented logistic regression examples) for which a reason- able initial approximation to the target distribution w as a v ailable and was used as a sampling distribution. There are many Ba yesian inference problems where suc h an initial approximation can b e obtained eﬃciently using the Laplace metho d or other analytical metho ds. Extension to high dimension is c hallenging if only b ecause the n um b er of free parameters in a full Gaussian mo del scales quadratically with dimension. W e see diﬀerent p oten tial wa ys to o v ercome this diﬃcult y . One is to use a sparse Gaussian appro ximation, suc h as one with diago- nal co v ariance mo del for whic h the n umber of parameters n = 2 d + 1 reduces to a linear function of dimension. This can help if one is interested in marginal moments only , but can also slo w do wn conv ergence if strong comp onen t-wise correlations exist. Another p ossible work around is to try and factorize the target distribution into a pro duct of e lemen tary distributions that eac h in v olv e few v ariables, and use VS as a building blo c k within an EP-like algorithm [21, 22] to rep eatedly p erform factor-wise appro ximations if those cannot b e computed analytically . Fi- nally , a more general approach is to use sto c hastic gradient descent metho ds [5] to substitute quasi-Newton minimization in large dimensional problems in order to cut down computational complexit y and memory load. References [1] C. Andrieu, N. de F reitas, A. Doucet, and M. Jordan. An in troduction to MCMC for mac hine learning. Machine L e arning , 50:5–43, 2003. [2] C. Andrieu and J. Thoms. A tutorial on adaptive MCMC. Statistics and Computing , 18(4):343–373, 2008. [3] K. Bache and M. Lic hman. UCI mac hine learning rep ository , 2013. [4] M. J. Beal and Z. Ghahramani. The v ariational Bay esian EM algorithm for incomplete 11 data: with application to scoring graphical mo del structures. In Bayesian Statistics 7 . Oxford Universit y Press, 2003. [5] L. Bottou and O. Bousquet. The tradeoﬀs of large scale learning. In A dvanc es in Neur al Information Pr o c essing Systems , pages 161–168, 2008. [6] C. G. Bucher. Adaptiv e sampling – an iterative fast Mon te Carlo pro cedure. Structur al Safety , 5:119–126, 1988. [7] O. Capp ´ e, A. Guilin, J.-M. Marin, and C. Rob ert. P opulation Monte Carlo. Journal of Computational and Gr aphic al Statistics , 13(4):907–929, 2004. [8] P . Carbonetto, M. King, and F. Hamze. A Sto c hastic appro ximation metho d for inference in probabilistic graphical mo dels. In A dvanc es in Neur al Information Pr o c essing Systems , pages 216–224, 2009. [9] P .-T. De Bo er, D. Kro ese, S. Mannor, and R. Rubinstein. A tutorial on the cross-entrop y metho d. Annals of Op er ations R ese ar ch , 134(1):19–67, 2005. [10] P . Del Moral, A. Doucet, and A. Jasra. Sequen tial Monte Carlo samplers. Journal of the R oyal Statistic al So ciety: Series B , 68(3):411–436, 2006. [11] A. Gelman and X.-L. Meng. Simulating normalizing constan ts: F rom importance sampling to bridge sampling to path sampling. Statistic al Scienc e , 13:163–185, 1998. [12] T. Gerstner and M. Grieb el. Numerical integration using sparse grids. Numeric al algo- rithms , 18(3-4):209–232, 1998. [13] M. Girolami and B. Calderhead. Riemann manifold Langevin and Hamiltonian Mon te Carlo metho ds. Journal of the R oyal Statistic al So ciety: Series B , 73(2):123–214, 2011. [14] P . Hall, J. Ormerod, and M. W and. Theory of Gaussian v ariational approximation for a P oisson mixed mo del. Statistic a Sinic a , 21:369–389, 2011. [15] T. Jaakkola and M. Jordan. A v ariational approach to Ba y esian logistic regression mo d- els and their extensions. In Sixth International Workshop on Artiﬁcial Intel ligenc e and Statistics , 1997. [16] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to v ariational metho ds for graphical mo dels. Machine L e arning , 37(2):183–233, 1999. [17] S. J. Julier and J. K. Uhlmann. Unscented ﬁltering and nonlinear estimation. Pr o c e e dings of the IEEE , 92(3):401–422, 2004. [18] D. A. Kno wles and T. Mink a. Non-conjugate v ariational message passing for m ultinomial and binary regression. In A dvanc es in Neur al Information Pr o c essing Systems , pages 1701– 1709, 2011. [19] M. A. Little, P . E. McSharry , S. J. Rob erts, D. A. Costello, and I. M. Moroz. Exploiting nonlinear recurrence and fractal scaling prop erties for v oice disorder detection. BioMe dic al Engine ering OnLine , 6(1):23, 2007. [20] N. Metrop olis. The b eginning of the Mon te Carlo metho d. L os Alamos Scienc e , 15:125–130, 1987. Sp ecial Issue. 12 [21] T. P . Mink a. Exp ectation Propagation for approximate Bay esian inference. In Unc ertainty in Artiﬁcial Intel ligenc e , pages 362–369. Morgan Kaufmann, 2001. [22] T. P . Mink a. Div ergence measures and message passing. T ec hnical Rep ort MSR-TR-2005- 173, Microsoft Research Ltd., Cam bridge, UK, Dec. 2005. [23] R. Neal. Annealed imp ortance sampling. Statistics and Computing , 11(2):125–139, 2001. [24] A. O’Hagan. Ba y es-Hermite quadrature. Journal of Statistic al Planning and Infer enc e , 29:245–260, 1991. [25] M. Opp er and C. Archam b eau. The v ariational Gaussian appro ximation revisited. Neur al c omputation , 21(3):786–792, 2009. [26] J. Ormero d and M. W and. Gaussian v ariational approximate inference for generalized linear mixed mo dels. Journal of Computational and Gr aphic al Statistics , 21(1):2–17, 2012. [27] A. Owen and Y. Zhou. Safe and eﬀective importance sampling. Journal of the Americ an Statistic al Asso ciation , 95(449):135–143, 2000. [28] C. Rasm ussen and C. Williams. Gaussian pr o c esses for machine le arning . MIT Press, 2006. [29] C. E. Rasm ussen and Z. Ghahramani. Ba yesian Monte Carlo. In A dvanc es in Neur al Information Pr o c essing Systems , pages 489–496. MIT Press, 2003. [30] M. W. Seeger and D. P . Wipf. V ariational Ba yesian inference tec hniques. Signal Pr o c essing Magazine, IEEE , 27(6):81–91, 2010. [31] J. Shao. Mon te Carlo approximations in Ba yesian decision theory. Journal of the Americ an Statistic al Asso ciation , 84(407):727–732, 1989. [32] J. Skilling. Nested sampling for general Bay esian computation. Bayesian A nalysis , 1(4):833–860, 2006. [33] L. Tierney and J. B. Kadane. Accurate appro ximations for p osterior momen ts and marginal densities. Journal of the A meric an Statistic al Asso ciation , 81(393):82–86, 1986. [34] G. W ei and M. A. T anner. A Monte Carlo implemen tation of the EM algorithm and the p oor man’s data augmentation algorithm. Journal of the A meric an Statistic al Asso ciation , 85:699–704, 1990. 13

Approximate inference via variational sampling

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment