Gamma Processes, Stick-Breaking, and Variational Inference
While most Bayesian nonparametric models in machine learning have focused on the Dirichlet process, the beta process, or their variants, the gamma process has recently emerged as a useful nonparametric prior in its own right. Current inference scheme…
Authors: Anirban Roychowdhury, Brian Kulis
Gamma Pro cesses, Stic k-Breaking, and V ariational Inference Anirban Ro ycho wdh ury , Brian Kulis Departmen t of Computer Science and Engineering The Ohio State Univ ersity roychowdhury.7@osu.edu ,kulis@cse.ohio-state.edu Octob er 7, 2014 Abstract While most Ba yesian nonparametric models in machine learning ha ve fo cused on the Diric hlet pro cess, the b eta pro cess, or their v ariants, the gamma pro cess has recently emerged as a useful nonparametric prior in its o wn righ t. Curren t inference sc hemes for models inv olving the gamma pro cess are restricted to MCMC-based metho ds, which limits their scal- abilit y . In this pap er, we present a v ariational inference framework for mo dels inv olving gamma process priors. Our approac h is based on a no vel stick-breaking constructive definition of the gamma pro cess. W e pro ve correctness of this stick-breaking pro cess by using the c haracteriza- tion of the gamma process as a completely random measure (CRM), and w e explicitly derive the rate measure of our construction using Poisson pro cess mac hinery . W e also deriv e error b ounds on the truncation of the infinite pro cess required for v ariational inference, similar to the trunca- tion analyses for other nonparametric models based on the Dirichlet and b eta pro cesses. Our representation is then used to derive a v ariational inference algorithm for a particular Bay esian nonparametric latent struc- ture formulation kno wn as the infinite Gamma-P oisson mo del, where the laten t v ariables are drawn from a gamma pro cess prior with Poisson like- liho ods. Finally , w e present results for our algorithms on nonnegativ e ma- trix factorization tasks on do cumen t corp ora, and show that w e compare fa vorably to b oth sampling-based techniques and v ariational approaches based on b eta-Bernoulli priors. 1 In tro duction The gamma pro cess is a versatile pure-jump L´ evy pro cess with widespread ap- plications in v arious fields of science. Of late it is emerging as an increasingly p opular prior in the Ba y esian nonparametric literature within the mac hine learn- ing communit y; it has recently b een applied to exchangeable mo dels of sparse graphs [1] as well as for nonparametric ranking mo dels [2]. It also has b een 1 used as a prior for infinite-dimensional laten t indicator matrices [3]. This latter application is one of the earliest Bay esian nonparametric approac hes to coun t mo deling, and as suc h can b e though t of as an extension of the v enerable In- dian Buffet Pro cess to mo deling latent structures where each feature can o ccur m ultiple times for a datap oin t, instead of b eing simply binary . The flexibilit y of gamma pro cess mo dels allows them to b e applied in a wide v ariety of Bay esian nonparametric settings, but their relative complexity mak es principled inference nontrivial. In particular, all direct applications of the gamma pro cess in the Bay esian nonparametric literature use Marko v c hain Mon te Carlo samplers (t ypically Gibbs sampling) for p osterior inference, which often suffers from p o or scalability . F or other Bay esian nonparametric models— in particular those inv olving the Dirichlet pro cess or b eta pro cess—a successful thread of researc h has considered v ariational alternatives to standard sampling metho ds [4, 5, 6]. One first derives an explicit construction of the underlying “w eights” of the atom ic measure comp onen t of the random measures under- lying the infinite priors; so-called “stick-breaking” pro cesses for the Dirichlet and b eta pro cesses yield such a construction. Then these weigh ts are truncated and integrated into a mean-field v ariational inference algorithm. F or instance, stic k-breaking w as deriv ed for the Diric hlet pro cess in the sem inal pap er b y Seth uraman [7], which was in turn used for v ariational inference in Dirichlet pro cess mo dels [4]. Similar stic k-breaking represen tations for a sp ecial case of the Indian Buffet Pro cess [8] and the b eta pro cess [9] hav e b een constructed, and hav e naturally led to mean-field v ariational inference algorithms for non- parametric mo dels inv olving these priors [10, 11]. Suc h v ariational inference algorithms hav e b een shown to b e more scalable than the sampling-based infer- ence techniques normally used; moreo ver they work with the full model p osterior without marginalizing out any v ariables. In this pap er we propose a v ariational inference framew ork for gamma pro- cess priors using a nov el stick-breaking construction of the pro cess. W e use the characterization of the gamma process as a c ompletely r andom me asur e (CRM), which allows us to leverage Poisson pro cess prop erties to arrive at a simple deriv ation of the rate measure of our stick-breaking construction, and sho w that it is indeed equal to the L´ evy measure of the gamma pro cess. W e also use the Poisson pro cess formulation to derive a b ound on the error of the truncated version compared to the full pro cess, analogous to the b ounds de- riv ed for the Diric hlet pro cess [12], the Indian Buffet Pro cess [10] and the b eta pro cess [11]. W e then, as a particular example, fo cus on the infinite Gamma- P oisson mo del of [3] (note that v ariational inference need not b e limited to this mo del). This mo del is a prior on infinitely wide latent indicator matrices with non-negativ e integer-v alued en tries; each column has an asso ciated parameter indep enden tly drawn from a gamma distribution, and the matrix v alues are in- dep enden tly drawn from Poisson distributions with these parameters as means. W e develop a mean-field v ariational tec hnique using a truncated version of our stic k-breaking construction, and a sampling algorithm that uses Monte Carlo in tegration for parameter marginalization, similar to [9], as a baseline inference algorithm for comparison. Finally we compare the tw o algorithms on a non- 2 negativ e matrix factorization task inv olving the Psychological Review, NIPS, K OS and New Y ork Times do cumen t corp ora. Related W ork. T o our knowledge there has b een no previous exp osition of an explicit recursive “stick-breaking”-lik e construction of the gamma CRM, and b y extension no instance of v ariational algorithms for such priors. The v ery general in verse L´ evy measure algorithm of [13] requires inv ersion of the exp o- nen tial in tegral, as does the generalized CRM construction technique of [14] when applied to the gamma pro cess; since the closed form solution of the in- v erse of an exp onen tial integral is not kno wn, these techniques do not give us an analytic construction of the weigh ts, and hence cannot b e adapted to v aria- tional techniques in a straigh tforward manner. Other constructive definitions of the gamma pro cess include [15], who discusses a sampling-based sc heme for the w eights of a gamma pro cess by sampling from a P oisson pro cess. F urther, the c haracterization of the Dirichlet process as a normalized gam ma process may p ossibly b e utilized for sampling gamma pro cess weigh ts, but to our knowledge no existing metho ds for v ariational inference emplo y these approac hes. As an alternativ e to gamma pro cess-based mo dels for count mo deling, recent research has examined the negative binomial-b eta pro cess and its v ariants [16, 17, 18]; the stic k-breaking construction of [9] readily extends to such mo dels since they ha ve b eta pro cess priors. The b eta stick-breaking construction has also b een used for v ariational inference in b eta-Bernoulli pro cess priors [11], though they ha ve scalabilit y issues when applied to the count mo deling problems addressed in this work, as we show in the exp erimen tal section. 2 Bac kground 2.1 Completely random measures A completely random measure [19, 20] G on a space (Ω , F ) is defined as a sto c hastic pro cess on F such that for an y tw o disjoint Borel subsets A 1 and A 2 in F , the random v ariables G ( A 1 ) and G ( A 2 ) are indep enden t. The canonical w ay of constructing a completely random measure G is to first take a σ -finite pro duct measure H on Ω ⊗ R + , then draw a countable set of p oints { ( ω k , p k ) } from a Poisson pro cess on a Borel σ -algebra on Ω ⊗ R + with H as the rate measure. Then the CRM is constructed as G = P ∞ k =0 p k δ ω k , where the measure giv en to a measurable Borel set B ⊂ Ω is G ( B ) = P k : ω k ∈ B p k . In this notation p k are referred to as weigh ts and the ω k as atoms. If the rate measure is defined on Ω ⊗ [0 , 1] as H ( dω, dp ) = cp − 1 (1 − p ) c − 1 B 0 ( dω ) dp , where B 0 is an arbitrary finite con tinuous measure on Ω and c is some con- stan t (or function of ω ), then the corresp onding CRM constructed as ab o ve is kno wn as a b eta pro cess. If the rate measure is defined as H ( dω, dp ) = cp − 1 e − cp G 0 ( dω ) dp , with the same restrictions on c and G 0 , then the corre- sp onding CRM constructed as ab o ve is known as the gamma pro cess. The total mass of the gamma pro cess G, G (Ω), is distributed as Gamma( cG 0 (Ω) , c ). The improp er distributions in these rate measures in tegrate to infinit y ov er their 3 resp ectiv e domains, ensuring a countably infinite set of p oin ts in a draw from the Poisson pro cess. F or the b eta pro cess, the weigh ts p k are in [0,1], whereas for the gamma process they are in [0 , ∞ ). In b oth cases how ever the sum of the w eights is finite, as can b e seen from Campb ell’s theorem [19], and is gov erned b y c and the total mass of the base measure on Ω. F or completeness we note that completely random measures as defined in [19] hav e three comp onen ts: a set of fixed atoms, a deterministic measure (usually assumed absen t), and a random discrete measure. It is this third comp onen t that is explicitly generated using a P oisson pro cess, though the fixed comp onen t can b e readily incorp orated in to this construction [21]. If we create an atomic measure by normalizing the weigh ts { p k } from the gamma pro cess, i.e. D = P ∞ k =0 π k δ ω k where π k = p k / P ∞ i =0 p i , then D is kno wn as a Dirichlet pr o c ess [22], denoted as D ∼ DP( α 0 , H 0 ) where α 0 = G 0 (Ω) and H 0 = G 0 /α 0 . It is not a CRM as the random v ariables induced on disjoin t sets lack indep endence b ecause of the normalization; it belongs to the class of normalized random measures with indep enden t increments (NRMIs). 2.2 Stic k-breaking for the Diric hlet and Beta Pro cesses A recursive wa y to generate the weigh ts of random measures is giv en by stick- breaking, where a unit in terv al is subdivided into fragments based on dra ws from suitably chosen distributions. F or example, the sick-breaking construction of the Dirichlet pro cess [7] is given by D = ∞ X i =1 V i i − 1 Y j =1 (1 − V j ) δ ω i , where V i iid ∼ Beta(1 , α ) , ω i iid ∼ H 0 . Here the length of the first break from a unit-length stick is given b y V 1 . In the next round, a fraction V 2 of the remaining stic k of length 1 − V 1 is brok en off, and we are left with a piece of length (1 − V 2 )(1 − V 1 ). The length of the piece in the next round is therefore giv en by V 3 (1 − V 2 )(1 − V 1 ), and so on. Note that the weigh ts b elong to (0,1), and since this is a normalized measure, the weigh ts sum to 1 almost surely . This is consisten t with the use of the Dirichlet pro cess as a prior on probability distributions. This construction w as generalized in [9] to yield stick-breaking for the b eta pro cess: B = ∞ X i =1 C i X j =1 V ( i ) ij i − 1 Y l =1 (1 − V ( l ) ij ) δ ω ij , (1) where V ( i ) ij iid ∼ Beta(1 , α ) , C i iid ∼ Poisson( γ ) , ω ij iid ∼ 1 γ B 0 . W e use this rep- resen tation as the basis for our stick breaking-lik e construction of the Gamma CRM, and use Poisson pro cess-based pro of tec hniques similar to [23] to derive the rate measure. 4 3 The Stic k-breaking Construction of the Gamma Pro cess 3.1 Constructions and pro of of correctness W e prop ose a simple recursive construction of the gamma pro cess CRM, based on the stic k-breaking construction for the beta pro cess prop osed in [9, 23]. In particular, we augment (or ‘mark’) a slightly mo dified stick-breaking b eta pro- cess with an indep enden t gamma-distributed random measure and show that the resultan t Poisson pro cess has the rate measure H ( dω, dp ) = cp − 1 e − cp G 0 ( dω ) dp as defined ab o ve. W e show this by directly deriving the rate measure of the mark ed Poisson pro cess using pro duct distribution form ulae. Our prop osed stic k-breaking construction is as follows: G = ∞ X i =1 C i X j =1 G ( i ) ij V ( i ) ij i Y l =1 (1 − V ( l ) ij ) δ ω ij , (2) where G ( i ) ij iid ∼ Gamma( α +1 , c ) , V ( i ) ij iid ∼ Beta(1 , α ) , C i iid ∼ Poisson( γ ) , ω ij iid ∼ 1 γ H 0 . As with the b eta pro cess stick-breaking construction, the pro duct of b eta random v ariables allows us to interpret each j as corresp onding to a stick that is b eing brok en into an infinite n umber of pieces. Note that the exp ected weigh t on an atom in round i is α i /c (1 + α ) i . The parameter c can therefore b e used to control the weigh t decay cadence along with α . The ab o ve represen tation provides the clearest view of the construction, but is somewhat cum bersome to deal with in practice, mostly due to the introduction of the additional gamma random v ariable. W e reduce the num b er of random v ariables b y noting that the pro duct of a Beta(1 , α ) and a Gamma( α + 1 , c ) ran- dom v ariable has an Exp( c ) distribution; we also p erform a change of v ariables on the product of the (1 − V ij )s to arriv e at the follo wing equiv alent construction, for which we now pro ve its correctness: Theorem 1. A gamma CRM with p ositive c onc entr ation p ar ameters α and c and finite b ase me asur e H 0 may b e c onstructe d as G = ∞ X i =1 C i X j =1 E ij e − T ij δ ω ij (3) wher e E ij iid ∼ Exp ( c ) , T ij ind ∼ Gamma ( i, α ) , C i iid ∼ Poisson ( γ ) , ω ij iid ∼ 1 γ H 0 . Pr o of. Note that, by construction, in each round i in (3), eac h set of w eighted atoms { ω ij , E ij e − T ij } C i j =1 forms a Poisson p oin t process since the C i are dra wn from a Poisson( γ ) distribution. In particular, eac h of these sets is a marke d P oisson process [21], where the atoms ω ij of the Poisson pro cess on Ω are mark ed with the random v ariables E ij e − T ij that ha ve a probability measure on (0 , ∞ ). The sup erposition theorem of [21] tells us that the coun table union of P oisson 5 pro cess is itself a P oisson process on the same measure space; therefore denoting G i = C i X j =1 E ij e − T ij δ ω ij , we can sa y G = ∞ [ i =1 G i is a Poisson process on Ω × [0 , ∞ ). W e show b elo w that the rate measure of this pro cess equals that of the Gamma CRM. No w, we note that the random v ariable E ij e − T ij has a probabilit y measure on [0 , ∞ ); denote this by q ij . W e are going to mark the underlying Poisson pro cess with this measure. The density corresp onding to this measure can b e readily derived using pro duct distribution formulae. T o that end, ignoring indices, if we denote W = exp ( − T ), then w e can derive its distribution by a c hange of v ariable. Then, denoting Q = E × W where E ∼ Exp( c ) , we can use the pro duct distribution formula to write the density of Q as f Q ( q ) = 1 Z 0 α i Γ( i ) ( − log w ) i − 1 w α − 2 ce − c q w d w , where T ∼ Gamma( i, α ). F ormally sp eaking, this is the Radon-Nikodym density corresp onding to the measure q , since it is absolutely contin uous with resp ect to the Leb esgue measure on [0 , ∞ ) and σ -finite by virtue of b eing a probability measure. F urthermore, these conditions hold for all the measures that we hav e in our union of marked Poisson pro cesses; this allows us to write the density of the combined measure as f ( p ) = ∞ X i =1 1 Z 0 α i Γ( i ) ( − log w ) i − 1 w α − 2 ce − c p w d w = 1 Z 0 ∞ X i =1 α i Γ( i ) ( − log w ) i − 1 w α − 2 ce − c p w d w b y monotone conv ergence = 1 Z 0 αw − 2 ce − c p w d w = αp − 1 e − cp = cp − 1 e − cp α c . Note that the measure defined on B ([0 , ∞ )) by the “improper” gamma distri- bution p − 1 e − cp is σ -finite, in the sense that we can decompose [0 , ∞ ) into the coun table union of disjoint interv als [1 /k , 1 / ( k − 1)) , k = 1 , 2 , . . . ∞ , each of whic h has finite measure. In particular, the measure of the interv al [1 , ∞ ) is giv en by the exp onen tial in tegral. Therefore the rate measure of the pro cess G as constructed here is G ( dω , dp ) = cp − 1 e − cp G 0 ( dω ) dp where G 0 is the same as H 0 up to the m ultiplicative constant α c , and therefore satisfies the finiteness assumption imp osed on H 0 . 6 W e use the form sp ecified in the theorem ab o ve in our v ariational inference algorithm since the v ariational distributions on almost all the parameters and v ariables in this construction lend themselves to simple closed-form exp onen tial family up dates. As an aside, we note that the random v ariables (1 − V ij ) hav e a Beta( α, 1) distribution; therefore if we denote U ij = 1 − V ij then the construction in (2) is equiv alen t to G = ∞ X i =1 C i X j =1 E ( i ) ij i Y l =1 U ( l ) ij δ ω ij , where E ( i ) ij iid ∼ Exp( c ) , U ( i ) ij iid ∼ Beta( α, 1) , C i iid ∼ Poisson( γ ) , ω ij iid ∼ 1 γ H 0 . This notation therefore relates our construction to the stick-breaking construc- tion of the Indian Buffet Pro cess [8], where the Bernoulli probabilities π k are generated as products of iid Beta( α, 1) random v ariables : π 1 = ν 1 , π k = k Q i =1 ν i where ν i iid ∼ Beta( α, 1). In particular, w e can view our construction as a generalization of the IBP stic k-breaking, where the stick-breaking weigh ts are m ultiplied with independent Exp( c ) random v ariables, with the summation o v er j pro viding an explicit P oissonization. 3.2 T runcation analysis The v ariational algorithm requires a truncation level for the num b er of atoms for tractabilit y . Therefore we need to analyze the closeness b et w een the marginal distributions of the data dra wn from the full prior and the truncated prior, with the stick-breaking prior weigh ts integrated out. Our construction leads to a simpler truncation analysis if we truncate the num b er of rounds (indexed b y i in the outer sum), which automatically truncates the atoms to a finite n umber. F or this analysis, we will use the stick-breaking gamma pro cess as the base measure of a Poisson lik eliho o d pro cess, which we denote by P P ; this is precisely the mo del for which we dev elop v ariational inference in the next section. If w e denote the gamma pro cess as G = P ∞ k =0 g k δ ω k , with g k as the recursively constructed w eights, then P P can b e written as P P = P ∞ k =0 p k δ ω k where p k = P oisson( g k ). Under this mo del, w e can obtain the follo wing result, whic h is analogous to error b ounds derived for other nonparametric mo dels [12, 10, 11] in the literature. Theorem 2. L et N samples X = ( X 1 , .., X N ) b e dr awn fr om P P ( G ) . If G ∼ Γ P ( c, G 0 ) , the ful l gamma pr o c ess, then denote the mar ginal density of X as m ∞ ( X ) . If G is a gamma pr o c ess trunc ate d after R r ounds, denote the mar ginal density of X as m R ( X ) . Then 1 4 Z | m ∞ ( X ) − m R ( X ) | d X ≤ 1 − exp ( − N γ α c α 1 + α R ) . Pr o of. The starting intuition is that if we truncate the pro cess after R rounds, then the error in the marginal distribution of the data will depend on the proba- bilit y of p ositive indicator v alues app earing for atoms after the R th round in the 7 infinite v ersion. Combining this with ideas analogous to those in [ ? ] and [12], w e get the following b ound for the difference b et ween the marginal distributions: 1 4 Z | m ∞ ( X ) − m R ( X ) | d X ≤ P ( ∃ ( k , j ) , k > R X r =1 C r , 1 ≤ n ≤ N s.t. X n ( ω kj ) > 0 ) . Since we ha ve a Poisson lik eliho od on the underlying gamma pro cess, this prob- abilit y can b e written as P ( · ) = 1 − E E ∞ Y r = R +1 C r Y j =1 e − π rj N C r , where π rj = G ( r ) rj V ( r ) rj Q r l =1 (1 − V ( l ) rj ). W e ma y then use Jensen’s inequalit y to b ound it as follows: P ( · ) ≤ 1 − exp N ∞ X r = R +1 E C r X j =1 log( e − π rj ) = 1 − exp " N γ 1 c ∞ X r = R +1 α 1 + α r # = 1 − exp ( − N γ α c α 1 + α R ) . 4 V ariational Inference As discussed in Section 3.2, we will focus on the infinite Gamma-Poisson model, where a gamma process prior is used in conjunction with a Poisson likelihoo d function. When integrating out the w eights of the gamma process, this pro cess is known to yield a nonparametric prior for sparse, infinite coun t matrices [3]. W e note that our approac h should easily b e applicable to other mo dels in volving gamma pro cess priors. 4.1 The Mo del T o effectively p erform v ariational inference, we re-write G as a single sum of w eighted atoms, using indicator v ariables { d k } for the rounds in whic h the atoms o ccur, similar to [9]: G = ∞ X k =1 E k e − T k δ ω k , (4) 8 where E k iid ∼ Exp( c ) , T k ind ∼ Gamma( d k , α ) , ∞ P k =1 1 ( d k = r ) iid ∼ Poisson( γ ) , ω k iid ∼ 1 γ H 0 . W e also place gamma priors on α, γ and c : α ∼ Gamma( a 1 , a 2 ) , γ ∼ Gamma( b 1 , b 2 ) , c ∼ Gamma( c 1 , c 2 ). Denoting the data, the laten t prior v ari- ables and the mo del hyperparameters by D , Π and Λ resp ectiv ely , the full likeli- ho od ma y b e written as P ( D , Π | Λ) = P ( D , Π − G | Π G , Λ) · P (Π G | Λ) where P (Π G | Λ) = P ( α ) · P ( γ ) · P ( c ) · P ( d | γ ) · K Q k =1 P ( E k | c ) · P ( T k | d k , α ) · N Q n =1 P ( z nk | E k , T k ). W e truncate the infinite gamma pro cess to K atoms, and tak e N to b e the total n umber of datap oin ts. Π − G denotes the set of the latent v ariables excluding those from the Poisson-Gamma prior; for instance, in factor analysis for topic mo dels, this contains the Diric hlet-distributed factor v ariables (or topics). F rom the P oisson likelihoo d, we hav e z nk | E k , T k ∼ P oisson( E k e − T k ), in- dep enden tly for each n . The distributions of T k and d inv olv e the indicator functions on the round indicator v ariables d k : P ( T k | d k , α ) = α ν k (0) Q r ≥ 1 Γ( r ) 1 ( d k = r ) T ν k (1) k e − αT k , where ν k ( s ) = P r ≥ 1 ( r − s ) 1 ( d k = r ) , and P ( d | γ ) = ∞ Y r =1 γ P k 1 ( d k = r ) P k 1 ( d k = r ) ! · exp − γ I ∞ X r 0 = r ∞ X k =1 1 ( d k = r 0 ) > 0 . See [11] for a discussions on how to approximate these factors in the v ariational algorithm. 4.2 The V ariational Prior Distribution Mean-field v ariational inference inv olv es minimizing the KL divergence b et ween the mo del p osterior, and a suitably constructed variational distribution which is used as a more tractable alternative to the actual p osterior distribution. T o that end, we prop ose a fully-factorized v ariational distribution on the Poisson- Gamma prior as follows: Q = q ( α ) · q ( γ ) · q ( c ) · K Y k =1 q ( E k ) · q ( T k ) · q ( d k ) · N Y n =1 q ( z nk ) , where q ( E k ) ∼ Gamma( ´ ξ k , ´ k ) , q ( T k ) ∼ Gamma( ´ u k , ´ υ k ) , q ( α ) ∼ Gamma( κ 1 , κ 2 ) , q ( γ ) ∼ Gamma( τ 1 , τ 2 ) , q ( c ) ∼ Gamma( ρ 1 , ρ 2 ) , q ( z nk ) ∼ P oisson( λ nk ) , q ( d k ) ∼ Mult( ϕ k ). Instead of working with the actual KL div ergence b etw een the full p osterior and the factorized proxy distribution, v ariational inference maximizes what is canonically known as the evidenc e lower b ound (ELBO), a function that is the 9 same as the KL div ergence up to a constant. In our case it may b e written as L = E Q log P ( D , Π | Λ) − E Q log Q . W e omit the full represen tation here for brevit y . 4.3 The V ariational Parameter Up dates Since w e are using exponential family v ariational distributions, w e leverage the closed form v ariational updates for exp onen tial families wherever we can, and p erform gradient ascent on the ELBO for the parameters of those distributions whic h do not hav e closed form up dates. W e list the up dates on the distribu- tions of the prior b elo w. The closed-form up dates for the hyperparameters in q ( E k ) , q ( α ) , q ( c ) and q ( γ ) are as follows: ´ ξ k = N X n =1 E Q ( z nk ) + 1 , ´ k = E ( c ) + N × E Q e − T k , κ 1 = K X k =1 X r ≥ 1 r ϕ k ( r ) + a 1 , κ 2 = K X k =1 E Q ( T k ) + a 2 , ρ 1 = c 1 + K , ρ 2 = K X k =1 E Q ( E k ) + c 2 , τ 1 = b 1 + K , τ 2 = X r ≥ 1 ( 1 − K Y k =1 r − 1 X ´ r =1 ϕ k ( ´ r ) ) + b 2 . The up dates for the m ultinomial probabilities in q ( d k ) are given by: ϕ k ( r ) ∝ exp { r E Q (log α ) − log Γ( r ) + ( r − 1) E Q (log T k ) − ζ · X i 6 = k ϕ i ( r ) − E Q ( γ ) r X j =2 Y k 0 6 = k j − 1 X r 0 =1 ϕ k 0 ( r 0 ) } . In addition to these up dates, our v ariational algorithm requires gradient ascent up dates on q ( T k ) and up dates on q (Π − G ) and q ( z nk ) as follows: The gradients for the tw o v ariational parameters in q ( T k ) are: ∂ L ∂ ´ u k = X r ≥ 1 ( r − 1) ϕ k ( r ) ψ 0 ( ´ u k ) − E Q ( α ) ´ υ k − N X n =1 E Q ( E k ) ´ υ k ´ υ k + 1 ´ u k · log ´ υ k ´ υ k + 1 − N X n =1 E Q ( z nk ) 1 ´ υ k − ( ´ u k − 1) ψ 0 ( ´ u k ) − 1 ∂ L ∂ ´ υ k = − X r ≥ 1 ( r − 1) ϕ k ( r ) 1 ´ υ k + E Q ( α ) ´ u k ( ´ υ k ) 2 − N X n =1 E Q ( E k ) ´ u k ´ υ k ´ u k − 1 ( ´ υ k + 1) ´ u k +1 + N X n =1 E Q ( z nk ) ´ u k ( ´ υ k ) 2 − 1 ´ υ k . 10 F or the topic mo deling problems, we mo del the observ ed vocabulary-vs- do cumen t corpus coun t matrix D as D ∼ P oi(Φ Z ), where the V × K matrix Φ mo dels the factor loadings, and the K × N matrix Z mo dels the actual factor coun ts in the do cumen ts. W e put the K − truncated Poisson-Gamma prior on Z , and put a Dirichlet( β 1 , . . . , β V ) prior on the columns of Φ. The v ariational distribution Q consequen tly gets a Dirichlet(Φ |{ b } k ) dis- tribution m ultiplied to it, where b = ( b 1 , . . . , b V ) are the v ariational Dirichlet h yp erparameters. This setup do es not immediately lend itself to closed form up dates for the b -s, so w e resort to gradient ascen t. The gradient of the ELBO with resp ect to each v ariational hyperparameter is ∂ L ∂ b v k = − E Q ( z nk ) · P v b v k − b v k ( P v b v k ) 2 + ψ 0 ( b v k ) · β v − b v k + X n d v n ! + ψ 0 ( X v b v k ) × X v b v k − V − β v − X n d v n + 1 ! . In practice how ever w e found a closed-form up date facilitated by a simple low er b ound on the ELBO to conv erge faster. W e describ e the up date here. First note that the part of the ELBO relev an t to a p oten tial closed form v ariational up date of φ v k can b e written as L = − φ v k · X n E Q ( z nk ) + X n d v n · log φ v k + · · · , whic h can then b e low er b ounded as L ≥ log φ v k · − X n E Q ( z nk ) + X n d v n ! + · · · . This allo ws us to analytically up date b v k as b v k = − P n E Q ( z nk ) + P n d v n + β v . This frees us from ha ving to choose appropriate corpus-sp ecific initializations and learning rates for the Φs. A similar lo wer b ound on the ELBO allows us to up date the v ariational parameters of q ( z nk ) as λ nk = − 1 − P v d v n + E Q (log E k ) + E Q ( T k ). 5 The MCMC Sampler As a baseline, we also derive and compare with a standard MCMC sampler for this mo del. W e use the construction in (4) for sampling from the mo del. T o av oid inferring the laten t v ariables in all the atom weigh ts of the Poisson- Gamma prior, we use Mon te Carlo techniques to integrate them out, as in [9]. This affects p osterior inference for the indicators z nk , the round indicators d and the h yp erparameters c and α . The p osterior distribution for γ is closed form, as are those for the likelihoo d latent v ariables in Φ − G . W e re-write the 11 construction of the Poisson-Gamma prior: G = ∞ X k =1 E k e − T k δ ω k , E k iid ∼ Exp( c ) , T k ind ∼ Gamma( d k , α ) , ∞ P k =1 1 ( d k = r ) iid ∼ Pois( γ ) , ω k iid ∼ 1 γ H 0 . W e put improp er priors on α and c , and a noninformative Gamma prior on γ . The indicator coun ts are given by Z nk ∼ Pois( g k ) , where g k = E k e − T k . T o a void sampling the atom weigh ts E k and T k , w e in tegrate them out using Mon te Carlo techniqu es in the sampling steps for the prior. 5.1 Sampling the round indicators The conditional p osterior for the round indicators d = { d k } K k =1 can b e written as p d k = i |{ d l } k − 1 l =1 , { Z nk } N n =1 , α, c, γ ∝ p { Z nk } N n =1 | d k = i, α, c p d k = i |{ d l } k − 1 l =1 . F or the first factor, we collapse out the stick-breaking weigh ts and approxi- mate the resulting integral using Monte-Carlo tec hniques as follows: p { Z nk } N n =1 | d k = i, α, c = Z [0 , ∞ ] i N Y n =1 P ois( Z nk | g k )d G ≈ 1 S S X s =1 N Y n =1 P ois( Z nk | g ( s ) k ) , where g ( s ) k = E ( s ) k e − T ( s ) k d = V ( s ) k,d k Q d k l =1 (1 − V ( s ) kl ). Here S is the n umber of sim ulated samples from the integral o ver the stick-breaking w eights. W e take S = 1000 in our exp erimen ts. The second factor is the same as [9]: p ( d k = d | γ , { d l } k − 1 l =1 ) = 0 if d < d k − 1 1 − P D k − 1 t =1 Pois( t | γ ) 1 − P D k − 1 − 1 t =1 Pois( t | γ ) if d = d k − 1 1 − 1 − P D k − 1 t =1 Pois( t | γ ) 1 − P D k − 1 − 1 t =1 Pois( t | γ ) (1 − Pois(0 | γ ))P ois(0 | γ ) h − 1 if d = d k − 1 + h. Here D k ∆ = k P j =1 I ( d j = d k ). Normalizing the pro duct of these tw o factors o ver all i is infeasible, so we ev aluate this pro duct for increasing i till it drops b elo w 10 − 2 , and normalize ov er the gathered v alues. 12 5.2 Sampling the factor v ariables Here we consider the Poisson factor mo deling scenario that we use to mo del v o cabulary-do cumen t count matrices. Recall that a V × N count matrix D is mo deled as D = Poi(Φ Z ), where the V × K matrix Φ mo dels the factor loadings, and the K × N matrix Z mo dels the actual factor counts in the documents.. W e put the Poisson-Gamma prior on Z and symmetric Dirichlet( β 1 , . . . , β V ) priors on the columns of Φ. The sampling steps for Φ and Z are describ ed next. 5.2.1 Sampling Φ First note that the elements of the count matrix are mo deled as d v n = P oi P K k =1 φ v k z kn , whic h can b e equiv alently written as d v n = P K k =1 d v kn , d v kn = Poi( φ v k z kn ). Standard manipulations then allow us to sample the d v kn ’s from Mult( d v n ; p v 1 n , . . . , p v K n ) where p v kn = φ v k z kn / P K k φ v k z kn . No w w e hav e φ k ∼ Dirichlet( β 1 , . . . , β V ). Using standard relationships be- t ween Poisson and multinomial distributions, we can derive the p osterior distri- bution of the φ k ’s as Diric hlet( β 1 + d 1 k , . . . , β V + d V k ) , where d v k = P N n =1 d v kn . 5.2.2 Sampling Z In our algorithm w e sample each z nk conditioned on all the other v ariables in the mo del; therefore the conditional p osterior distribution can b e written as p ( z nk | D , Φ , Z n, − k , d , α, c, γ ) = p ( D | Z n , Φ) p ( z nk | d , α, c, Z n, − k ) = V Y v =1 P oi d v n | K X k =1 φ v k z kn ! p ( Z n | d , α, c ) p ( Z n, − k | d , α, c ) . The distributions in b oth the numerator and denominator of the second factor can b e sampled from using the Mon te Carlo tec hniques described abov e, by in tegrating out the stic k-breaking weigh ts. 5.3 Sampling h yp erparameters As mentioned ab o ve, w e put a noninformative Gamma prior on γ and improp er (1) priors on α and c . The p osterior sampling steps are describ ed b elo w: 5.3.1 Sampling γ Giv en the round indicators d = { d k } , we can recov er the round-specific atom coun ts as describ ed ab o ve. Then the conjugacy b et w een the Gamma prior on γ and the Poisson distribution of C i giv es us a closed form p osterior distribution for γ : p ( γ | d , Z, α, c ) = Gamma( γ | a + P K i =1 C i , b + d K ). 13 5.3.2 Sampling α The conditional p osterior distribution of α may b e written as: p ( α | Z, d , c ) ∝ p ( α ) N Y n =1 K Y k =1 p ( Z | d , α, c ) . W e calculate the p osterior distribution of Z using Monte Carlo techniques as describ ed ab ov e. Then w e discretize the search space for α around its current v alues as ( α cur + t ∆ α ) U t = L , where the lo wer and upper bounds L and U are c hosen so that the unnormalized posterior falls b elo w 10 − 2 . The searc h space is also clipp ed below at 0. α is then drawn from a multinomial distribution on the search v alues after normalization. 5.3.3 Sampling c W e sample c in exactly the same w ay as α . W e first write the conditional p osterior as p ( c | Z, d , α ) ∝ p ( c ) N Y n =1 K Y k =1 p ( Z | d , α, c ) . The search space ( c > 0) is then discretized using appropriate upp er and low er b ounds as ab o ve, and Z is sampled using Monte Carlo techniques. c is then dra wn from a multinomial distribution on the search v alues after normalization. 6 Exp erimen ts W e consider the problem of learning laten t topics in do cumen t corp ora. Giv en an observ ed set of coun ts of vocabulary w ords in a set of documents, represented b y say a V × N count matrix, where V is the v o cabulary size and N the num b er of documents, w e aim to learn K latent factors and their v ocabulary realizations using P oisson factor analysis. In particular, w e model the observed corpus coun t matrix D as D ∼ Poi(Φ I ), where the V × K matrix Φ models the factor loadings, and the K × N matrix I mo dels the actual factor counts in the do cumen ts. As a baseline, we also derive and compare with a standard MCMC sampler for this mo del. W e use the construction in (4) for sampling from the mo del. T o a void inferring the laten t v ariables in all the atom weigh ts of the Poisson-Gamma prior, w e use Monte Carlo tec hniques to integrate them out, as in [9]. This affects p osterior inference for the indicators z nk , the round indicators d and the h yp erparameters c and α . The p osterior distribution for γ is closed form, as are those for the likelihoo d latent v ariables in Π − G . The complete up dates are describ ed in the supplementary . W e implemented and analyzed the performance of three v ariational algo- rithms corresp onding to three different priors on I : the Poisson-gamma pro cess prior from this pap er (abbreviated hereafter as VGP), the Bernoulli-b eta prior from [11] (VBP) and the IBP prior from [10] (VIBP), along with the MCMC 14 sampler mentioned ab o ve (SGP). F or the Bernoulli-b eta priors we mo deled I as I = W ◦ Z as in [11], where the nonparametric priors are put on Z and a v ague Gamma prior is put on W . F or the VGP and SGP mo dels we set I = Z . In addition, for all four algorithms, we put a symmetric Dirichlet( β 1 , . . . , β V ) prior on the columns of Φ. W e added corresp onding v ariational distributions for the v ariables in the collection denoted as Π − G ab o v e. W e use held-out per-word test log-lik eliho o ds and times required to up date all v ariables in Π in each iteration as our comparison metrics, with 80% of the data used for training. W e used the same likelihoo d metric as [16], with the samples replaced by the exp ectations of the v ariational distributions. Syn thetic Data. As a warm-up, we consider the p erformances of VGP and SGP on some synthetic data generated from this mo del. W e generate 200 w eighted atoms from the gamma prior using the stick-breaking construction, and use the P oisson likelihoo d to generate 3000 v alues for eac h atom to yield the indicator matrix Z . W e simulated a vocabulary of 200 terms, generated a 200 × 200 factor-loading matrix Φ using symmetric Dirichlet priors, and then generated D = Poi(Φ Z ). F or the VGP , we measure the test likelihoo d after ev ery iteration and av erage the results across 10 random restarts. These mea- suremen ts are plotted in fig.1a. As shown, VGP’s measured heldout likelihoo d con verges within 10 iterations. The SGP traceplot shows the first thirt y held- out likelihoo ds measured after burn-in. Per-iteration times were 15 seconds and 2.36 min utes for V GP (with K =125) and SGP respectively . The SGP learned K online, with v alues oscillating around 50. SNBP refers to the Poisson-Gamma mixture (“NB pro cess”) sampler from [16]. Its traceplot sho ws the first 30 lik eliho o ds measured after 1000 burn-in iterations. W e see that it p erformed similarly to our algorithms, though slightly worse. Real data. W e used a similar framework to model the count data from the Psyc hological Review (PsyRev) 1 , NIPS 2 , KOS 3 and New Y ork Times 3 corp ora. The vocabulary sizes are 2566, 13649, 6906 and 100872 resp ectiv ely , while the do cumen t coun ts are 1281, 1740, 3430 and 300000 resp ectiv ely . F or each dataset, w e ran all three v ariational algorithms with 10 random restarts each, measuring the held-out log-lik eliho ods and p er-iteration run times for differen t v alues of the truncation factor K . The learning rates for gradient ascent up dates were kept on the order of 10 − 4 for both V GP and VBP , with 5 gradien t steps p er iteration. A representativ e subset of results is shown in figs.1b through 1f. W e used v ague gamma priors on the hyperparameters α, γ and c in the v ari- ational algorithms, and improp er (1) priors for the sampler. W e found the test lik eliho o ds to b e indep enden t of these initializations. The results for the v ari- ational algorithms were dependent on the Dirichlet prior β on Φ, as noted in fig.1b. W e therefore used the learned test likelihoo d after 100 iterations as a heuristic to select β . W e found the three v ariational algorithms to attain v ery similar test likelihoo ds across all four datasets after a few hours of CPU time, with the VGP and VBP having a slight edge o ver the VIBP . The sampler some- 1 http://psiexp.ss.uci.edu/researc h/programs data/to olbox.h tm 2 http://www.stats.o x.ac.uk/ teh/data.html 3 https://arc hiv e.ics.uci.edu/ml/datasets/Bag+of+W ords 15 what unexp ectedly did not attain a comp etitiv e score for any dataset, unlike the syn thetic case. F or instance, as shown in fig.1c, it oscillated around -7.45 for the PsyRev dataset, whereas the v ariational algorithms attained -7.23. F or comparison, the NB pro cess sampler from [16] attains -7.25 each iteration after 1000 iterations of burn-in. Also as seen in fig.1c, V GP was faster to con v ergence (in less than 10 iterations in ∼ 5 seconds) than VIBP and VBP ( ∼ 50 iterations eac h). The test log-lik eliho o ds after a few hours of runtime were largely inde- p enden t of the truncation K for the three v ariational algorithms. Behavior for the other datasets was similar. Among the three v ariational algorithms, the VIBP scaled best for small to medium datasets as a function of the truncation factor due to all up dates b eing closed-form, in spite of ha ving to learn the additional weigh t matrix W . The V GP running times were comp etitive for small v alues of K for these datasets. Ho wev er, in the large NYT dataset, VGP was orders of magnitude faster than the Bernoulli-b eta algorithms (note the log-scale in fig.1f). F or example, with a truncation of 100 atoms, V GP to ok around 45 seconds p er iteration, whereas b oth VIBP and VBP to ok more than 3 minutes. The VBP scaled po orly for all datasets, as seen in figs.1d through 1f. The reason for this is three-fold: learn- ing the parameters for the additional matrix W which is directly affected b y dimensionalit y (also the reason for VIBP b eing slo w for NYT dataset), gradient up dates for t wo v ariables (as opp osed to one for VGP) and a T aylor appro xima- tion required for these gradient up dates (see [11]). The sampler SGP required around 7 minutes p er iteration for the small datasets and an hour and 40 min- utes on av erage for NYT. T o summarize, we found the V GP to post running times that are comp etitiv e with the fastest algorithm (VIBP) in small to medium datasets, and outperform the other metho ds completely in the large NYT dataset, all the while provid- ing similar accuracy compared to v ariational algorithms for similar mo dels, as measured by held-out lik eliho od. It w as also the fastest to conv erge, typically taking less than 15 iterations. Compared with SGP , our v ariational metho d is substan tially faster (particularly on large-scale data) and produces higher lik eliho o d scores on real data. 7 Conclusion W e hav e describ ed a nov el stick-breaking represen tation for gamma pro cesses and used it to derive a v ariational inference algorithm. This algorithm has b een sho wn to b e far more scalable for large datasets than v ariational algorithms for related mo dels while attaining similar accuracy , and also outperforms sampling- based metho ds. W e exp ect that recent improv emen ts to v ariational techniques can also b e applied to our algorithm, p otentially yielding even further scalabil- it y . 16 (a) (b) (c) (d) (e) (f ) Figure 1: Plots of held-out test lik eliho ods and per-iteration running times. Plots (d), (e) and (f ) are for PsyRev, KOS, and NYT resp ectiv ely . Plots (b) and (c) are for the PsyRev dataset. Algorithm trace colors are common to all plots. See text for full details. 17 References [1] F. Caron and E. B. F ox. Bay esian Nonparametric Mo dels of Sparse and Exc hangeable Random Graphs, 2013. [2] F. Caron, Y. W. T eh, and B. T. Murph y . Ba yesian Nonparametric Plac kett-Luce mo dels for the Analysis of Clustered Ranked Data, 2013. [3] M. Titsias. The Infinite Gamma-Poisson Mo del. In A dvanc es in Neur al Information Pr o c essing Systems , 2008. [4] D. Blei and M. Jordan. V ariational metho ds for Diric hlet pro cess mixtures. Bayesian Analysis , 1:121–144. [5] Y.W. T eh, K. Kurihara, and M. W elling. Collapsed v ariational inference for HDP. In NIPS , 2007. [6] C. W ang, J. P aisley , and D. Blei. Online v ariational inference for the hierarc hical Dirichlet process. In AIST A TS , 2011. [7] J. Seth uraman. A constructiv e definition of diric hlet priors. Statistic a Sinic a , 4:639–650, 1994. [8] Y. W. T eh, D. G¨ or¨ ur, and Z. Ghahramani. Stick-breaking construction for the Indian buffet pro cess. In Pr o c e e dings of the International Confer enc e on Artificial Intel ligenc e and Statistics , volume 11, 2007. [9] J. P aisley , A. Zaas, C. W. W o o ds, G. S. Ginsburg, and L. Carin. A Stic k- Breaking C onstruction of the Beta Pro cess. In International Confer enc e on Machine L e arning , 2010. [10] F. Doshi-V elez, K. Miller, J. V an Gael, and Y. W. T eh. V ariational Infer- ence for the Indian Buffet Pro cess. In AIST A TS , 2009. [11] J. Paisley , L. Carin, and D. M. Blei. V ariational Inference for Stick-Breaking Beta Pro cess Priors. In International Confer enc e on Machine L e arning , 2011. [12] H. Ishw aran and L. F. James. Gibbs Sampling Metho ds for Stick-Breaking Priors. Journal of the Americ an Statistic al Asso ciation , 96:161–173, 2001. [13] R. W olp ert and K. Ickstadt. Simulation of L´ evy Random Fields. In Pr actic al Nonp ar ametric and Semip ar ametric Bayesian Statistics . Springer- V erlag, 1998. [14] P . Orbanz and S.Williamson. Unit-rate Poisson representations of com- pletely random measures. [15] R.J. Thibaux. Nonp ar ametric Bayesian Mo dels for Machine L e arning . PhD thesis, Universit y of California at Berkeley , 2008. 18 [16] M. Zhou and L. Carin. Augment-and-conquer negativ e binomial pro cesses. In NIPS , 2012. [17] M. Zhou, L. Hannah, D. Dunson, and L. Carin. Beta-negativ e binomial pro cess and Poisson factor analysis. In AIST A TS , 2012. [18] T. Broderick, L. Mack ey , J. P aisley , and M. I. Jordan. Combinatorial clustering and the b eta negativ e binomial pro cess. arXiv , 2011. [19] J.F.C. Kingman. Completely Random Measures. Pacific Journal of Math- ematics , 21(1):59–78, 1967. [20] M. I. Jordan. Hierarc hical Models, Nested Mo dels and Completely Random Measures. In M.-H. Chen, D. Dey , P . Mueller, D. Sun, and K. Y e, editors, F r ontiers of Statistic al De cision Making and Bayesian Analysis: In Honor of James O. Ber ger . New Y ork: Springer, 2010. [21] J. F. C. Kingman. Poisson Pr o c esses , volume 3 of Oxfor d Studies in Pr ob- ability . Oxford Universit y Press, New Y ork, 1993. [22] T.S. F erguson. A Bay esian Analysis of Some Nonparametric Problems. The A nnals of Statistics , 1(2):209–230, 1973. [23] J. Paisley , D. M. Blei, and M. I. Jordan. Stick-Breaking Beta Pro cesses and the Poisson Pro cess. In Artificial Intel ligenc e and Statistics , 2012. 19
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment