Neural Gaussian Copula for Variational Autoencoder
Variational language models seek to estimate the posterior of latent variables with an approximated variational posterior. The model often assumes the variational posterior to be factorized even when the true posterior is not. The learned variational…
Authors: Prince Zizhuang Wang, William Yang Wang
Neural Gaussian Copula f or V ariational A utoencoder Prince Zizhuang W ang Department of Computer Science Uni versity of California Santa Barbara zizhuang wang@ucsb.edu William Y ang W ang Department of Computer Science Uni versity of California Santa Barbara william@cs.ucsb.edu Abstract V ariational language models seek to estimate the posterior of latent variables with an ap- proximated variational posterior . The model often assumes the v ariational posterior to be factorized even when the true posterior is not. The learned variational posterior under this assumption does not capture the dependency relationships over latent variables. W e ar- gue that this would cause a typical training problem called posterior collapse observed in all other v ariational language models. W e propose Gaussian Copula V ariational Autoen- coder (V AE) to avert this problem. Copula is widely used to model correlation and depen- dencies of high-dimensional random variables, and therefore it is helpful to maintain the de- pendency relationships that are lost in V AE. The empirical results sho w that by modeling the correlation of latent variables explicitly us- ing a neural parametric copula, we can av ert this training difficulty while getting competi- tiv e results among all other V AE approaches. 1 1 Introduction V ariational Inference (VI) ( W ainwright et al. , 2008 ; Hof fman et al. , 2013 ) methods are inspired by calculus of variation ( Gelfand et al. , 2000 ). It can be dated back to the 18th century when it was mainly used to study the problem of the change of functional , which is defined as the mapping from functions to real space and can be understood as the function of functions. VI takes a distri- bution as a functional and then studies the prob- lem of matching this distribution to a tar get dis- tribution using calculus of variation . After the rise of deep learning ( Krizhevsk y et al. , 2012 ), a deep generativ e model called V ariational A u- toencoder ( Kingma and W elling , 2014 ; Hof fman 1 Code will be released at https://github.com/ kingofspace0wzz/copula- vae- lm Figure 1: Intuiti ve illustration of VI : The elliptic P is a distribution family containing the true posterior p ∈ P , and the circle Q is a Mean-field variational family con- taining a standard normal prior N . The optimal solu- tion q ∗ is the one in Q that has the smallest K L ( q || p ) . In reality these two families may not o verlap. et al. , 2013 ) is proposed based on the theory of VI and achiev es great success over a huge number of tasks, such as transfer learning ( Shen et al. , 2017 ), unsupervised learning ( Jang et al. , 2017 ), image generation ( Gre gor et al. , 2015 ), semi-supervised classification ( Jang et al. , 2017 ), and dialogue gen- eration ( Zhao et al. , 2017 ). V AE is able to learn a continuous space of latent random variables which are useful for a lot of classification and generation tasks. Recent studies ( Bo wman et al. , 2015 ; Y ang et al. , 2017 ; Xiao et al. , 2018 ; Xu and Durrett , 2018 ) show that when it comes to text genera- tion and language modeling, V AE does not per- form well and often generates random texts with- out making good use of the learned latent codes. This phenomenon is called Posterior Collapse , where the Kullback-Leibler (KL) di ver gence be- tween the posterior and the prior ( often assumed to be a standar d Gaussian ) v anishes. It makes the latent codes completely useless because any text input will be mapped to a standard Gaussian vari- able. Many recent studies ( Y ang et al. , 2017 ; Xu and Durrett , 2018 ; Xiao et al. , 2018 ; Miao et al. , 2016 ; He et al. , 2018 ) try to address this issue by providing new model architectures or by changing the objectiv e functions. Our research lies in this second direction. W e re view the theory of V AE, and we argue that one of the most widely used assumptions in V AE, the Mean-field assump- tion, is pr oblematic . It assumes that all approx- imated solutions in a family of v ariational distri- butions should be factorized or dimensional-wise independent for tractability . W e argue that it leads to the posterior collapse problem since any varia- tional posterior learned in this way does not main- tain the correlation among latent codes and will ne ver match the true posterior which is unlikely factorized. W e avert this problem by proposing a Neural Gaussian Copula (Copula-V AE) model to train V AE on text data. Copula ( Nelsen , 2007 ) can model dependencies of high-dimensional random v ariables and is v ery successful in risk manage- ment ( K ole et al. , 2007 ; McNeil et al. , 2005 ), financial management ( W ang and Hua , 2014 ), and other tasks that require the modeling of de- pendencies. W e provide a reparameterization trick ( Kingma and W elling , 2014 ) to incorporate it with V AE for language modeling. W e argue that by maintaining the dependency relationships ov er latent codes, we can dramatically improve the performance of variational language modeling and av oid posterior collapse. Our major contributions can be summarized as the follo wing: • W e propose Neural parameterized Gaussian Copula to get a better estimation of the pos- terior for latent codes. • W e pro vide a reparameterization technique for Gaussian Copula V AE. The experiments sho w that our method achiev es competitiv e results among all other variational language modeling approaches. • W e perform a thorough analysis of the orig- inal V AE and copula V AE. The results and analysis rev eal the salient drawbacks of V AE and explain ho w introducing a copula model could help av ert the posterior collapse prob- lem. 2 Related W ork Copula: Befor e the rise of Deep Lear ning Copula ( Nelsen , 2007 ) is a multiv ariate distribu- tion whose marginals are all uniformly distributed. Over the years, it is widely used to extract corre- lation within high-dimensional random v ariables, and achiev es great success in many subjects such as risk management ( K ole et al. , 2007 ; McNeil et al. , 2005 ), finance ( W ang and Hua , 2014 ), ci vil engineering( Chen et al. , 2012 ; Zhang and Singh , 2006 ), and visual description generation ( W ang and W en , 2015 ). In the past, copula is often es- timated by Maximum Likelihood method ( Choro ´ s et al. , 2010 ; Jaworski et al. , 2010 ) via parametric or semi-parametric approaches ( Tsukahara , 2005 ; Choro ´ s et al. , 2010 ). One major difficulty when estimating the copula and extracting dependen- cies is the dimensionality of random variables. T o ov ercome the curse of dimensionality , a graphi- cal model called vine copula ( Joe and Kurowicka , 2011 ; Czado , 2010 ; Bedford et al. , 2002 ) is pro- posed to estimate a high-dimensional copula den- sity by breaking it into a set of biv ariate condi- tional copula densities. Ho wev er , this approach is often hand-designed which requires human e x- perts to define the form of each biv ariate cop- ula, and hence often results in overfitting. There- fore, Gaussian copula ( Xue-K un Song , 2000 ; Frey et al. , 2001 ) is often used since its multiv ariate e x- pression has a simple form and hence does not suf- fer from the curse of dimensionality . V AE for T ext Bo wman et al. ( 2015 ) proposed to use V AE for text generation by using LSTM as encoder-decoder . The encoder maps the hidden states to a set of latent v ariables, which are fur- ther used to generate sentences. While achie ving relati vely lo w sample perplexity and being able to generate easy-to-read texts, the LSTM V AE often results in posterior collapse, where the learned la- tent codes become useless for text generation. Recently , many studies are focusing on how to av ert this training problem. They either pro- pose a new model architecture or modify the V AE objecti ve. Y ang et al. ( 2017 ) seeks to re- place the LSTM ( Hochreiter and Schmidhuber , 1997 ) decoder with a CNN decoder to control model expressi v eness, as they suspect that the ov er-e xpressi ve LSTM is one reason that makes KL vanish. Xiao et al. ( 2018 ) introduces a topic v ariable and pre-trains a Latent Dirichlet Alloca- tion ( Blei et al. , 2003 ) model to get a prior distri- bution o ver the topic information. Xu and Dur- rett ( 2018 ) believ es the bubble soup effect of high- dimensional Gaussian distribution is the main rea- son that causes KL v anishing, and therefore learns a hyper -spherical posterior over the latent codes. 3 V ariational Inference The problem of inference in probabilistic mod- eling is to estimate the posterior density p ( z | x ) of latent variable z gi ven input samples { x i } D i =1 . The direct computation of the posterior is in- tractable in most cases since the normalizing con- stant R p ( z , x ) d z lacks an analytic form. T o get an approximation of the posterior , many approaches use sampling methods such as Markov chain Monte Carlo (MCMC) ( Gilks et al. , 1995 ) and Gibbs sampling ( George and McCulloch , 1993 ). The downside of sampling methods is that they are inefficient, and it is hard to tell how close the approximation is from the true posterior . The other popular inference approach, variational in- ference (VI) ( W ainwright et al. , 2008 ; Hof fman et al. , 2013 ), does not ha ve this shortcoming as it provides a distance metric to measure the fitness of an approximated solution. In VI , we assume a variational family of distri- butions Q to approximate the true posterior . The K ullbac k-Leibler ( KL ) div ergence is used to mea- sure ho w close q ∈ Q is to the true p ( z | x ) . The optimal v ariational posterior q ∗ ∈ Q is then the one that minimizes the KL di ver gence K L ( q || p ) = X q ( z | x ) log q ( z | x ) p ( z | x ) Based on this, variational autoencoder ( V AE ) ( Kingma and W elling , 2014 ) is proposed as a latent generati ve model that seeks to learn a posterior of the latent codes by minimizing the KL div ergence between the true joint density p θ ( x , z ) the variational joint density q φ ( z , x ) . This is equi valent to maximizing the follo wing evidence lower bound ELBO , L ( θ ; φ ; x ) = − K L ( q φ ( z , x ) || p θ ( x , z )) = E q ( x ) [ E q φ ( z | x ) [log p θ ( x | z )] − K L ( q φ ( z | x ) || p ( z ))] In this case, Mean-field ( Kingma and W elling , 2014 ) assumption is often used for simplicity . That is, we assume that the members of vari- ational family Q are dimensional-wise indepen- dent, meaning that the posterior q can be writ- ten as q ( z | x ) = Q D i =1 q ( z i | x ) . The simplicity of this form makes the estimation of ELBO very easy . Howe ver , it also leads to a particular training dif ficulty called posterior collapse, where the KL di ver gence term becomes zero and the factorized v ariational posterior collapses to the prior . The la- tent codes z would then become useless since the generati ve model p ( x | z ) no longer depends on it. W e believ e the problem comes from the nature of variational family itself and hence we propose our Copula-V AE which mak es use of the depen- dency modeling ability of copula model to guide the variational posterior to match the true poste- rior . W e will provide more details in the follo wing sections. W e hypothesize that the Mean-field assump- tion is problematic itself as the q under this as- sumption can ne ver recover the true structure of p . On the other hand, Copula-V AE makes use of the dependency relationships maintained by a copula model to guide the variational posterior to match the true posterior . Our approach dif fers from Copula-VI ( T ran et al. , 2015 ) and Gaussian Copula-V AE ( Suh and Choi , 2016 ) in that we use copula to estimate the joint density p ( z ) rather than the empirical data density p ( x ) . 4 Our A pproach: Neural Gaussian Copula 4.1 Gaussian Copula In this section, we revie w the basic concepts of Gaussian copula. Copula is defined as a probabil- ity distribution over a high-dimensional unit cube [0 , 1] d whose univ ariate marginal distributions are uniform on [0 , 1] . Formally , gi ven a set of unifor- maly distributed random variables U 1 , U 2 , ..., U n , a copula is a joint distribution defined as C ( u 1 , u 2 , ..., u n ) = P ( U 1 ≤ u 1 , ..., U n ≤ u n ) What makes a copula model abov e so useful is the famous Sklar’s Theorem . It states that for any joint cumulative distribution function (CDF) with a set of random v ariables { x i } d 1 and marginal CDF F i ( x i ) = P ( X i ≤ x ) , there exists one unique cop- ula function such that the joint CDF is F ( x 1 , ..., x d ) = C ( F 1 ( x 1 ) , ..., F d ( x d )) By pr obability integr al transform , each marginal CDF is a uniform random variable on [0 , 1] . Hence, the abov e copula is a v alid one. Since for each joint CDF , there is one unique cop- ula function associated with it giv en a set of marginals, we can easily construct any joint distri- bution whose marginal uni variate distrib utions are the ones F i ( x i ) that are given. And, for a gi ven joint distrib ution, we can also find the correspond- ing copula which is the CDF function of the gi ven marginals. A useful representation we can get by Sklar’s Theorem for a continuous copula is, C ( u 1 , ...u d ) = F ( F − 1 ( u 1 ) , ..., F − d ( u d )) If we further restrict the mar ginals to be Gaus- sian, then we can get an expression for Gaussian copula, that is, C Σ ( u 1 , ..., u d ) = Φ(Φ − 1 ( u 1 ) , ..., Φ − d ( u d ); 0 , Σ) where Φ( · ; Σ) is the CDF of a multiv ariate Gaus- sian Distribution N (0 , Σ) , and { Φ − i } is the in verse of a set of marginal Gaussian CDF . T o calculate the joint density of a copula func- tion, we take the deri vati ve with respect to random v ariables u and get c Σ ( u 1 , ..., u d ) = ∂ C Σ ( u 1 , ..., u d ) ∂ u 1 · · · u d = ∂ C Σ ( u 1 , ..., u d ) ∂ q 1 · · · q d d Y i =1 ∂ q i ∂ u i = ( d Y i =1 σ i ) | Σ − 1 / 2 | exp ( − 1 2 q T M q ) where M = Σ − − diag (Σ) − , and q i = Φ − i ( u i ) . Then, if the joint density p ( x 1 , ..., x d ) has a Gaussian form, it can be expressed by a copula density and its marginal densities, that is, p ( x 1 , ..., x d ) = ∂ F ( · ) ∂ x 1 · · · x d = ∂ C Σ ( · ) ∂ u 1 · · · u d Y ∂ u i ∂ x i = c Σ ( u 1 , ..., u d ) Y i p ( x i ) Therefore, we can decompose the problem of es- timating the joint density into two smaller sub- problems: one is the estimation for the marginals; the other is the estimation for the copula density function c Σ . In many cases, we assume indepen- dence over random variables due to the intractabil- ity of the joint density . For e xample, in the case of variational inference, we apply Mean-Field as- sumption which requires the variational distribu- tion family to have factorized form so that we can get a closed-form KL diver gence with respect to the prior . This assumption, ho wev er , sacrifices the useful dependency relationships over the la- tent random v ariables and often leads to training dif ficulties such as the posterior collapse problem. If we assume the joint posterior of latent vari- ables to be Gaussian, then the abov e Gaussian cop- ula model can be used to recover the correlation among latent v ariables which helps obtain a bet- ter estimation of the joint posterior density . In the V AE setting, we can already model the marginal independent posterior of latent variables, so the only problem left is how to ef ficiently estimate the copula density function c Σ . In the ne xt section, we introduce a neural parameterized Gaussian copula model, and we provide a way to incorporate it with the reparameterization technique used in V AE. 4.2 Neural Gaussian Copula for VI By Mean-field assumption, we construct a varia- tional family Q assuming that each member q ∈ Q can be factorized, q ( z ) = d Y i q ( z i ) This assumption, ho wev er , loses dependencies ov er latent codes and hence does not consider the non-factorized form of the true posterior . In this case, as pictured by Figure 2 , when we search the optimal q *, it will never reach to the true poste- rior p . If we reliev e the assumption, the varia- tional family may overlap with the posterior fam- ily . Ho wev er , this is intractable as the Monto Carlo estimator with respect to the objecti ve often has very high variance ( Kingma and W elling , 2014 ). Hence, we need to find a way to make it possi- ble to match the v ariational posterior with the true posterior while having a simple and tractable ob- jecti ve function so that the gradient estimator of the expectation is simple and precise. This is where Gaussian Copula comes into the story . Giv en a factorized posterior , we can con- struct a Gaussian copula for the joint posterior , q φ ( z | x ) = c Σ (Φ 1 ( z 1 ) , ... Φ( z d )) d Y i q φ ( z i | x ) where c Σ is the Gaussian copula density . If we take the log on both sides, then we ha ve, log q φ ( z | x ) = log c Σ ( u 1 , ..., u d ) + d X i log q φ ( z i | x ) Figure 2: T raining stage of V AE. Initially , the model tries to maximize ELBO by maximizing E q ( z | x ) [ p ( x | z )] . Once E q ( z | x ) [ p ( x | z )] is maximized, the model maximizes ELBO by minimizing KL. During this stage, the posterior starts to move closer to the prior . In the final stage, the posterior collapses to the prior . But, the ELBO and E q ( z | x ) [ p ( x | z )] are already maximized, which means the model keeps constraining KL and there are not enough gradients to mov e the posterior away from the prior an ymore. Figure 3: Ideal final stage of Copula-V AE. The family of distrib utions that contains the true posterior is no w a subset of the variational f amily . Note that the second term on the right hand side is just the factorized log posterior we hav e in the original V AE model. By reparameterization trick ( Kingma and W elling , 2014 ), latent codes sampled from the posterior are parameterized as a deterministic function of µ and σ 2 , that is, z = µ + σ · , ∼ N (0 , I ) , where µ, σ 2 are parameter - ized by two neural networks whose inputs are the final hidden states of the LSTM encoder . Since Q i q φ ( z i | x ) = N ( µ, σ 2 I ) , we can compute the sum of log density of posterior by , d X i log q φ ( z i | x ) = − d X i log | σ i | − d X i ( z i − µ i ) 2 2 σ 2 i − d 2 log 2 π No w , to estimate the log copula density log c Σ ( · ) , we pro vide a reparameterization method for the copula samples q ∼ C Σ (Φ 1 ( q 1 ) , ... Φ( q d )) . As suggested by Kingma and W elling ( 2014 ); Hof fman et al. ( 2013 ), reparameterization is needed as it giv es a differentiable, low-v ariance estimator of the objectiv e function. Here, we pa- rameterize the copula samples with a determinis- tic function with respect to the Cholesky factor L of its cov ariance matrix Σ . W e use the fact that for any multi variate Gaussian random variables, a linear transformation of them is also a multi- v ariate Gaussian random variable. Formally , if X ∼ N ( µ, Σ) , and Y = AX , then we must hav e Y ∼ N ( Aµ, A Σ A T ) . Hence, for a Gaussian cop- ula with the form c Σ = N (0 , Σ) , we can reparam- eterize its samples q by , ∼ N (0 , I ) q = L · It is easy to see that q = L · ∼ N (0 , LI − L T = Σ) is indeed a sample from the Gaussian cop- ula model. This is the standard w ay of sampling from Gaussian distribution with co v ariance matrix LL T . T o ensure numerical stability of the above reparameterization and to ensure that the cov ari- ance Σ = LL T is positi ve definite, we provide the follo wing algorithm to parameterize L . Algorithm 1: Neural reparameterization of Copula: Cholesky approach h = LSTM(x) w = ReLU( W 1 · h + b 1 ) a = T anh( W 2 · h + b 2 ) Σ = w · I + aa T L = CholeskyF actorization( Σ ) In Algorithm 1, we first parameterize the cov ari- ance matrix and then perform a Cholesky f actor- the company said it will be sold to the compan y ’ s promotional programs and UNK the company also said it will sell $ n million of soap e ggs turning millions of dollars the company said it will be UNK by the company ’ s UNK division n the company said it would n’ t comment on the suit and its reorganization plan mr . UNK said the company ’ s UNK group is considering a UNK standstill agreement with the company traders said that the stock market plunge is a UNK of the market ’ s rebound in the dow jones industrial av erage one trader of UNK said the market is skeptical that the market is n’ t UNK by the end of the session the company said it e xpects to be fully operational by the company ’ s latest recapitalization i was excited to try this place out for the first time and i w as disappointed . the food was good and the food a fe w weeks ago , i was in the mood for a UNK of the UNK i lov e this place . i ’ ve been here a fe w times and i ’ m not sure why i ’ ve been this place is really good . i ’ v e been to the other location many times and it ’ s very good . i had a great time here . i was n’ t sure what i was expecting . i had the UNK and the i hav e been here a few times and ha ve been here se veral times . the food is good , but the food is good this place is a great place to go for lunch . i had the chicken and w affles . i had the chicken and the UNK i really like this place . i love the atmosphere and the food is great . the food is alw ays good . T able 1: Qualitative comparison between V AE and our proposed approach. First row: PTB samples generated from prior p ( z ) by V AE ( upper half ) and copula-V AE ( lower half ). Second row: Y elp samples generated from prior p ( z ) by V AE ( upper half ) and copula-V AE ( lower half ). ization ( Chen et al. , 2008 ) to get the Cholesk y f ac- tor L . The cov ariance matrix Σ = w · I + aa T formed in this way is guaranteed to be positiv e definite. It is worth noting that we do not sam- ple the latent codes from Gaussian copula. In fact, z still comes from the independent Gaussian dis- tribution. Rather , we get sample q from Gaussian copula C Σ so that we can compute the log cop- ula density term in the follo wing, which will then be used as a regularization term during training, in order to force the learned z to respect the depen- dencies among indi vidual dimensions. No w , to calculate the log copula density , we only need to do, log c Σ = d X i log σ i − 1 / 2 log | Σ | + 1 / 2 q T M q where M = diag (Σ − ) − Σ − . T o make sure that our model maintains the de- pendency structure of the latent codes, we seek to maximize both the ELBO and the joint log poste- rior likelihood log q ( z | x ) during the training. In other words, we maximize the following modified ELBO, L 0 = L + λ (log c Σ ( · ) + d X i log q φ ( z i | x )) where L is the original ELBO. λ is the weight of log density of the joint posterior . It controls how good the model is at maintaining the dependency relationships of latent codes. The reparameteriza- tion tricks both for z and q makes the abov e objec- ti ve fully differentiable with respect to µ, σ 2 , Σ . Maximizing L 0 will then maximize the log in- put likelihood log p ( x ) and the joint posterior log- likelihood log q ( z | x ) . If the posterior collapses to the prior and has a factorized form, then the joint posterior likelihood will not be maximized since the joint posterior is unlikely factorized. There- fore, maximizing the joint posterior log-lik elihood along with ELBO forces the model to generate readable te xts while also considering the depen- dency structure of the true posterior distribution, which is ne ver factorized. 4.3 Evidence Lower Bound Another interpretation can be seen by taking a look at the prior . If we compose the copula density with the prior , then, like the non-factorized poste- rior , we can get the non-factorized prior , log p θ ( z ) = log c Σ ( u 1 , ..., u d ) + d X i log p θ ( z i ) And the corresponding ELBO is, L ( θ ; φ ; x ) = E q ( x ) [ E q φ ( z | x ) [log p θ ( x | z )] − K L ( q φ ( z | x ) || p θ ( z ))] = E q ( x ) [ E q φ ( z | x ) [log p θ ( x | z )] − E q φ ( z | x ) [log q φ ( z | x ) − log p ( z )] + E q φ ( z | x ) [log c Σ ( u 1 , ..., u d )]] Like Normalizing flow ( Rezende and Mohamed , 2015 ), maximizing the log copula density will then learns a more flexible prior other than a stan- dard Gaussian. The dependency among each z i is then restored since the KL term will push the pos- terior to this more complex prior . W e argue that relieving the Mean-field assump- tion by maintaining the dependenc y structure can av ert the posterior collapse problem. As shown in Figure 2 , during the training stage of origi- nal V AE, if kl-annealing ( Bowman et al. , 2015 ) is used, the model first seeks to maximize the ex- pectation E q ( z | x ) [ p ( x | z )] . Then, since q ( z | x ) can ne ver reach to the true p ( z | x ) , q will reach to a boundary and then the expectation can no longer increase. During this stage, the model starts to maximize the ELBO by minimizing the KL di ver - gence. Since the expectation is maximized and can no longer lev erage KL, the posterior will collapse to the prior and there is not sufficient gradient to mov e it away since ELBO is already maximized. On the other hand, if we introduce a copula model to help maintain the dependency structure of the true posterior by maximizing the joint posterior likelihood, then, in the ideal case, the variational family can approximate distributions of any forms since it is now not restricted to be factorized, and therefore it is more likely for q in Figure 3 to be closer to the true posterior p . In this case, the E q ( z | x ) [ p ( x | z )] can be higher since now we have latent codes sampled from a more accurate poste- rior , and then this e xpectation will be able to lev er- age the decrease of KL e ven in the final training stage. 5 Experimental Results 5.1 Datasets Data T rain V alid T est V ocab Y elp13 62522 7773 8671 15K PTB 42068 3370 3761 10K Y ahoo 100K 10K 10K 20K T able 2: Size and vocab ulary size for each dataset. In the paper, we use Penn T ree ( Marcus et al. , 1993 ), Y ahoo Answers ( Xu and Durrett , 2018 ; Y ang et al. , 2017 ), and Y elp 13 re views ( Xu et al. , 2016 ) to test our model performance o ver v ariational language modeling tasks. W e use these three lar ge datasets as they are widely used in all other variational language modeling ap- proaches ( Bowman et al. , 2015 ; Y ang et al. , 2017 ; Xu and Durrett , 2018 ; Xiao et al. , 2018 ; He et al. , 2018 ; Kim et al. , 2018 ). T able 2 shows the statis- tics, vocab ulary size, and number of samples in T rain/V alidation/T est for each dataset. Model NLL KL PPL LSTM-LM ( Y ang et al. , 2017 ) 116.2 - 104.2 V AE ( Bowman et al. , 2015 ) 105.2 1.74 121 vmf-V AE ( Xu and Durrett , 2018 ) 96.0 5.7 79.6 V AE-NF 96.8 0.87 82.9 non-diag-V AE 105.1 4.9 121.0 copula-V AE(cho) λ = 0 . 4 92.2 7.3 67.2 T able 3: V ariational language modeling on PTB Model NLL KL PPL V AE ( Bowman et al. , 2015 ) 197.1 0.03 58.2 vmf-V AE ( Xu and Durrett , 2018 ) 198.0 6.4 59.3 V AE-NF 200.4 0.1 62.5 non-diag-V AE 198.3 4.7 59.7 copula-V AE(cho) λ = 0 . 5 187.7 10.0 48.0 T able 4: V ariational language modeling on Y elp Re- views 13 Model NLL KL PPL V AE ( Bowman et al. , 2015 ) 351.6 0.3 81.6 vmf-V AE ( Xu and Durrett , 2018 ) 359.3 17.9 89.9 V AE-NF 353.8 0.1 83.0 lagging-V AE ( He et al. , 2018 ) 326.6 6.7 64.9 non-diag-V AE 352.2 5.7 82.3 copula-V AE(cho) λ = 0 . 5 344.2 15.4 74.4 T able 5: V ariational language modeling on Y ahoo 5.2 Experimental Setup W e set up a similar experimental condition as in ( Bo wman et al. , 2015 ; Xiao et al. , 2018 ; Y ang et al. , 2017 ; Xu and Durrett , 2018 ). W e use LSTM as our encoder-decoder model, where the num- ber of hidden units for each hidden state is set to 512. The word embedding size is 512. And, the number of dimension for latent codes is set to 32. For both encoder and decoder , we use a dr opout layer for the initial input, whose dropout rate is α = 0 . 5 . Then, for inference, we pass the final hidden state to a linear layer following a Batch Normalizing ( Ioffe and Szegedy , 2015 ) layer to get reparameterized samples from Q i q ( z i | x ) and from Gaussian copula C Σ . For the training stage, the maximum vocab ulary size for all inputs are set to 20000, and the maximum sequence length is set to 200. Batch size is set to 32, and we train for 30 epochs for each dataset, where we use the Adam stochastic optimization ( Kingma and Ba , 2014 ) whose learning rate is r = 10 − 3 . W e use kl annealing ( Bowman et al. , 2015 ) dur - ing training. W e also observe that the weight of log copula density is the most important factor which determines whether our model av oids the posterior collapse problem. W e hyper tuned this parameter in order to find the optimal one. (a) train loss (b) train kld (c) validation loss (d) validation kld Figure 4: T raining and V alidation KL div ergence and sequential loss for PTB using Cholesky Neural Copula with different copula density weights λ . It is obvious that the optimal λ plays a huge role in alleviating the posterior collapse problem and does not result in o verfitting as V AE does. Models are trained using 1-layer LSTM with 200 hidden units for Encoder/Decoder , in which the embedding size is 200 and the latent dimension is 32. Figure 5: Reconstruction loss, KL div ergence, and PPL (sum of loss and KL) on PTB. When we gradually in- crease λ , the KL div ergence increases and the test re- construction will decrease. 5.3 Language Modeling Comparison 5.3.1 Comparison with other V ariational models W e compare the v ariational language modeling re- sults over three datasets. W e show the results for Negati ve log-likelihood (NLL), KL div ergence, and sample perplexity (PPL) for each model on these datasets. NLL is approximated by the evi- dence lower bound . First, we observe that kl-annealing does not help alle viate the posterior collapse problem when it comes to larger datasets such as Y elp, but the problem is solved if we can maintain the latent code’ s dependencies by maximizing the copula likelihood when we maximize the ELBO. W e also observe that the weight λ of log copula density af- fects results dramatically . All λ produce compet- iti ve results compared with other methods. Here, we provide the numbers for those weights λ that produce the lowest PPL. For PTB, copula-V AE achie ves the lowest sample perplexity , best NLL approximation, and do not result in posterior col- lapse when λ = 0 . 4 . For Y elp, the lowest sample perplexity is achie ved when λ = 0 . 5 . W e also compare with V AE models trained with normalizing flo ws ( Rezende and Mohamed , 2015 ). W e observed that our model is superior to V AE based on flows. It is worth noting that W asserstein Autoencoder trained with Normaliz- ing flow ( W ang and W ang , 2019 ) achieves the low- est PPL 66 on PTB, and 41 on Y elp. Ho we ver , the problem of designing flexible normalizing flow is orthogonal to our research. 5.3.2 Generation T able 1 presents the results of text generation task. W e first randomly sample z from p ( z ) , and then feed it into the decoder p ( x | z ) to generate text us- ing greedy decoding. W e can tell whether a model suf fers from posterior collapse by examining the di versity of the generated sentences. The original V AE tends to generate the same type of sequences for dif ferent z . This is v ery obvious in PTB where the posterior of the original V AE collapse to the prior completely . Copula-V AE, howe ver , does not hav e this kind of issue and can always generate a di verse set of texts. 5.4 Hyperparameter -T uning: Copula weights play a huge role in the training of V AE In this section, we in vestigate the influence of log copula weight λ over training. From Figure 5 , we observe that our model performance is very sen- siti ve to the value of λ . W e can see that when λ is small, the log copula density contributes a small part to the objectiv e, and therefore does not help to maintain dependencies ov er latent codes. In this case, the model performs like the original V AE, where the KL di v ergence becomes zero at the end. When we increase λ , test KL becomes larger and test reconstruction loss becomes smaller . This phenomenon is also observed in v alidation datasets, as sho wn in Figure 4 . The training PPL is monotonically decreasing in general. Howe ver , when λ is small and the dependency relationships ov er latent codes are lost, the model quickly over - fits, as the KL di ver gence quickly becomes zero and the validation loss starts to increase. This fur- ther confirms what we sho wed in Figure 2 . F or original V AE models, the model first maximizes E q ( z | x ) [ p ( x | z )] which results in the decrease of both train and validation loss. Then, as q ( z | x ) can ne ver match to the true posterior, E q ( z | x ) [ p ( x | z )] reaches to its ceiling which then results in the decrease of KL as it is needed to maximize the ELBO. During this stage, the LSTM decoder starts to learn ho w to generate texts with standard Gaus- sian latent variables which then causes the in- crease of validation loss. On the other hand, if we gradually increase the contribution of copula den- sity by increasing the λ , the model is able to main- tain the dependencies of latent codes and hence the structure of the true posterior . In this case, E q ( z | x ) [ p ( x | z )] will be much higher and will le ver - age the decrease of KL. In this case, the decoder is forced to generate texts from non-standard Gaus- sian latent codes. Therefore, the validation loss also decreases monotonically in general. One major drawback of our model is the amount of training time, which is 5 times longer than the original V AE method. In terms of perfor - mance, copula-V AE achie ves the lowest recon- struction loss when λ = 0 . 6 . It is clear that from Figure 5 that increasing λ will result in larger KL di ver gence. 6 Conclusion In this paper, we introduce Copula-V AE with Cholesky reparameterization method for Gaussian Copula. This approach averts Posterior Collapse by using Gaussian copula to maintain the depen- dency structure of the true posterior . Our results sho w that Copula-V AE significantly improves the language modeling results of other V AEs. References T im Bedford, Roger M Cooke, et al. 2002. V ines– a new graphical model for dependent random vari- ables. The Annals of Statistics , 30(4):1031–1068. David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of ma- chine Learning resear ch , 3(Jan):993–1022. Samuel R Bowman, Luke V ilnis, Oriol V inyals, An- drew M Dai, Rafal Jozefowicz, and Samy Ben- gio. 2015. Generating sentences from a continuous space. CoNLL . Lu Chen, V ijay P Singh, Shenglian Guo, Ashok K Mishra, and Jing Guo. 2012. Drought analysis us- ing copulas. Journal of Hydrolo gic Engineering , 18(7):797–808. Y anqing Chen, Timothy A Davis, William W Hager , and Siv asankaran Rajamanickam. 2008. Algorithm 887: Cholmod, supernodal sparse cholesky factor - ization and update/do wndate. A CM T ransactions on Mathematical Softwar e (TOMS) , 35(3):22. Barbara Choro ´ s, Rustam Ibragimov , and Elena Permi- akov a. 2010. Copula estimation. In Copula theory and its applications , pages 77–91. Springer . Claudia Czado. 2010. Pair -copula constructions of multiv ariate copulas. In Copula theory and its ap- plications , pages 93–109. Springer . R ¨ udiger Fre y , Alexander J McNeil, and Mark Nyfeler . 2001. Copulas and credit models. Risk , 10(111114.10). Izrail Moiseevitch Gelfand, Richard A Silv erman, et al. 2000. Calculus of variations . Courier Corporation. Edward I George and Robert E McCulloch. 1993. V ari- able selection via gibbs sampling. J ournal of the American Statistical Association , 88(423):881–889. W alter R Gilks, Sylvia Richardson, and Da vid Spie gel- halter . 1995. Markov chain Monte Carlo in practice . Chapman and Hall/CRC. Karol Gregor , Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan W ierstra. 2015. Draw: A recurrent neural network for image generation. ICML . Junxian He, Daniel Spokoyn y , Graham Neubig, and T aylor Berg-Kirkpatrick. 2018. Lagging inference networks and posterior collapse in variational au- toencoders. Sepp Hochreiter and J ¨ urgen Schmidhuber . 1997. Long short-term memory . Neural computation , 9(8):1735–1780. Matthew D Hoffman, David M Blei, Chong W ang, and John Paisle y . 2013. Stochastic variational infer- ence. The Journal of Machine Learning Researc h , 14(1):1303–1347. Serge y Ioffe and Christian Szegedy . 2015. Batch nor- malization: Accelerating deep network training by reducing internal cov ariate shift. arXiv pr eprint arXiv:1502.03167 . Eric Jang, Shixiang Gu, and Ben Poole. 2017. Cat- egorical reparameterization with gumbel-softmax. ICLR . Piotr Jaworski, Fabrizio Durante, W olfgang Karl Har- dle, and T omasz Rychlik. 2010. Copula theory and its applications , volume 198. Springer . Harry Joe and Dorota Kurowicka. 2011. Dependence modeling: vine copula handbook . W orld Scientific. Y oon Kim, Sam W iseman, Andre w Miller , David Son- tag, and Alexander Rush. 2018. Semi-amortized variational autoencoders. In International Confer - ence on Machine Learning , pages 2683–2692. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 . Diederik P Kingma and Max W elling. 2014. Auto- encoding variational bayes. stat , 1050:10. Erik Kole, Kees K oedijk, and Marno V erbeek. 2007. Selecting copulas for risk management. Journal of Banking & F inance , 31(8):2405–2423. Alex Krizhe vsky , Ilya Sutskev er , and Geoffrey E Hin- ton. 2012. Imagenet classification with deep con- volutional neural networks. In Advances in neural information pr ocessing systems , pages 1097–1105. Mitchell P Marcus, Mary Ann Marcinkie wicz, and Beatrice Santorini. 1993. Building a large annotated corpus of english: The penn treebank. Computa- tional linguistics , 19(2):313–330. Alexander J McNeil, R ¨ udiger Frey , Paul Embrechts, et al. 2005. Quantitative risk management: Con- cepts, techniques and tools , volume 3. Princeton univ ersity press Princeton. Y ishu Miao, Lei Y u, and Phil Blunsom. 2016. Neural variational inference for te xt processing. In ICML . Roger B Nelsen. 2007. An intr oduction to copulas . Springer Science & Business Media. Danilo Jimenez Rezende and Shakir Mohamed. 2015. V ariational inference with normalizing flows. arXiv pr eprint arXiv:1505.05770 . T ianxiao Shen, T ao Lei, Regina Barzilay , and T ommi Jaakkola. 2017. Style transfer from non-parallel text by cross-alignment. In NIPS . Suwon Suh and Seungjin Choi. 2016. Gaussian cop- ula variational autoencoders for mixed data. arXiv pr eprint arXiv:1604.04960 . Dustin T ran, David Blei, and Edo M Airoldi. 2015. Copula v ariational inference. In Advances in Neural Information Pr ocessing Systems , pages 3564–3572. Hideatsu Tsukahara. 2005. Semiparametric estimation in copula models. Canadian Journal of Statistics , 33(3):357–375. Martin J W ainwright, Michael I Jordan, et al. 2008. Graphical models, exponential families, and varia- tional inference. F oundations and T r ends® in Ma- chine Learning , 1(1–2):1–305. Prince Zizhuang W ang and W illiam Y ang W ang. 2019. Riemannian normalizing flo w on variational wasser- stein autoencoder for text modeling. arXiv pr eprint arXiv:1904.02399 . W illiam Y ang W ang and Zhenhao Hua. 2014. A semi- parametric gaussian copula regression model for predicting financial risks from earnings calls. In Pr oceedings of the 52nd Annual Meeting of the As- sociation for Computational Linguistics (V olume 1: Long P apers) , v olume 1, pages 1155–1165. W illiam Y ang W ang and Miaomiao W en. 2015. I can has cheezb urger? a nonparanormal approach to combining textual and visual information for pre- dicting and generating popular meme descriptions. In Pr oceedings of the 2015 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language T echnologies , Den ver , CO, USA. A CL. Y ijun Xiao, T iancheng Zhao, and W illiam Y ang W ang. 2018. Dirichlet v ariational autoencoder for text modeling. arXiv pr eprint arXiv:1811.00135 . Jiacheng Xu, Danlu Chen, Xipeng Qiu, and Xuangjing Huang. 2016. Cached long short-term memory neu- ral networks for document-le vel sentiment classifi- cation. EMNLP . Jiacheng Xu and Greg Durrett. 2018. Spherical latent spaces for stable variational autoencoders. EMNLP . Peter Xue-K un Song. 2000. Multi variate dispersion models generated from gaussian copula. Scandina- vian Journal of Statistics , 27(2):305–320. Zichao Y ang, Zhiting Hu, Ruslan Salakhutdinov , and T aylor Ber g-Kirkpatrick. 2017. Improv ed varia- tional autoencoders for text modeling using dilated con volutions. ICML . LSVP Zhang and VP Singh. 2006. Bi variate flood fre- quency analysis using the copula method. Journal of hydr ologic engineering , 11(2):150–164. T iancheng Zhao, Ran Zhao, and Maxine Eskenazi. 2017. Learning discourse-level di versity for neural dialog models using conditional variational autoen- coders. A CL .
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment