Flow-GAN: Combining Maximum Likelihood and Adversarial Learning in Generative Models

Adversarial learning of probabilistic models has recently emerged as a promising alternative to maximum likelihood. Implicit models such as generative adversarial networks (GAN) often generate better samples compared to explicit models trained by max…

Authors: Aditya Grover, Manik Dhar, Stefano Ermon

Flow-GAN: Combining Maximum Likelihood and Adversarial Learning in   Generative Models
Flow-GAN: Combining Maximum Likelihood and Adversarial Learning in Generativ e Models Aditya Gro ver , Manik Dhar , Stefano Ermon Computer Science Department Stanford Univ ersity { adityag, dmanik, ermon } @cs.stanford.edu Abstract Adversarial learning of probabilistic models has recently emerged as a promising alternative to maximum likelihood. Implicit models such as generativ e adversarial networks (GAN) often generate better samples compared to explicit models trained by maximum likelihood. Y et, GANs sidestep the characterization of an explicit density which makes quan- titativ e ev aluations challenging. T o bridge this gap, we pro- pose Flow-GANs, a generativ e adversarial network for which we can perform exact likelihood ev aluation, thus support- ing both adv ersarial and maximum likelihood training. When trained adv ersarially , Flo w-GANs generate high-quality sam- ples b ut attain e xtremely poor log-lik elihood scores, inferior ev en to a mixture model memorizing the training data; the op- posite is true when trained by maximum likelihood. Results on MNIST and CIF AR-10 demonstrate that hybrid training can attain high held-out lik elihoods while retaining visual fi- delity in the generated samples. 1 Introduction Highly expressi ve parametric models hav e enjoyed great success in supervised learning, where learning objectives and ev aluation metrics are typically well-specified and easy to compute. On the other hand, the learning objectiv e for un- supervised settings is less clear . At a fundamental lev el, the idea is to learn a generativ e model that minimizes some no- tion of div er gence with respect to the data distribution. Min- imizing the Kullback-Liebler div ergence between the data distribution and the model, for instance, is equiv alent to per - forming maximum likelihood estimation (MLE) on the ob- served data. Maximum likelihood estimators are asymptoti- cally statistically efficient, and serve as natural objectives for learning pr escribed generative models (Mohamed and Lak- shminarayanan 2016). In contrast, an alternate principle that has recently at- tracted much attention is based on adversarial learning, where the objective is to generate data indistinguishable from the training data. Adversarially learned models such as generative adversarial networks (GAN; (Goodfello w et al. 2014)) can sidestep specifying an explicit density for any data point and belong to the class of implicit generative mod- els (Diggle and Gratton 1984). Copyright c  2018, Association for the Advancement of Artificial Intelligence (www .aaai.org). All rights reserved. The lack of characterization of an explicit density in GANs is howe ver problematic for two reasons. Sev eral ap- plication areas of deep generative models rely on density estimates; for instance, count based e xploration strategies based on density estimation using generative models hav e recently achieved state-of-the-art performance on challeng- ing reinforcement learning environments (Ostrovski et al. 2017). Secondly , it makes the quantitative ev aluation of the generalization performance of such models challenging. The typical ev aluation criteria based on ad-hoc sample quality metrics (Salimans et al. 2016; Che et al. 2017) do not address this issue since it is possible to generate good samples by memorizing the training data, or missing important modes of the distribution, or both (Theis, Oord, and Bethge 2016). Alternativ ely , density estimates based on approximate in- ference techniques such as annealed importance sampling (AIS; (Neal 2001; W u et al. 2017)) and non-parameteric methods such as kernel density estimation (KDE; (Parzen 1962; Goodfellow et al. 2014)) are computationally slow and crucially rely on assumptions of a Gaussian observa- tion model for the likelihood that could lead to misleading estimates as we shall demonstrate in this paper . T o sidestep the abo ve issues, we propose Flow-GANs, a generati ve adv ersarial network with a normalizing flo w generator . A Flo w-GAN generator transforms a prior noise density into a model density through a sequence of inv ert- ible transformations. By using an in v ertible generator , Flow- GANs allow us to tractably e v aluate exact likelihoods using the change-of-variables formula and perform exact posterior inference o ver the latent variables while still permitting effi- cient ancestral sampling, desirable properties of any proba- bilistic model that a typical GAN would not provide. Using a Flow-GAN, we perform a principled quantitati ve comparison of maximum likelihood and adversarial learning on benchmark datasets viz. MNIST and CIF AR-10. While adversarial learning outperforms MLE on sample quality metrics as expected based on strong evidence in prior work, the log-likelihood estimates of adversarial learning are or- ders of magnitude worse than those of MLE. The difference is so stark that a simple Gaussian mixture model baseline outperforms adversarially learned models on both sample quality and held-out likelihoods. Our quantitativ e analysis rev eals that the poor likelihoods of adversarial learning can be explained as a result of an ill-conditioned Jacobian ma- trix for the generator function suggesting a mode collapse, rather than ov erfitting to the training dataset. T o resolv e the dichotomy of perceptually good-looking samples at the expense of held-out likelihoods in the case of adversarial learning (and vice versa in the case of MLE), we propose a hybrid objectiv e that bridges implicit and pre- scribed learning by augmenting the adversarial training ob- jectiv e with an additional term corresponding to the log- likelihood of the observ ed data. While the hybrid objectiv e achiev es the intended effect of smoothly trading-of f the two goals in the case of CIF AR-10, it has a regularizing effect on MNIST where it outperforms MLE and adv ersarial learning on both held-out likelihoods and sample quality metrics. Overall, this paper makes the follo wing contributions: 1. W e propose Flow-GANs, a generati ve adversarial net- work with an in v ertible generator that can perform effi- cient ancestral sampling and exact likelihood e valuation. 2. W e propose a hybrid learning objecti ve for Flo w-GANs that attains good log-likelihoods and generates high- quality samples on MNIST and CIF AR-10 datasets. 3. W e demonstrate the limitations of AIS and KDE for log- likelihood e v aluation and ranking of implicit models. 4. W e analyze the singular value distribution for the Jaco- bian of the generator function to explain the low log- likelihoods observed due to adv ersarial learning. 2 Preliminaries W e begin with a revie w of maximum lik elihood estimation and adv ersarial learning in the context of generati ve models. For ease of presentation, all distributions are w .r .t. any arbi- trary x ∈ R d , unless otherwise specified. W e use upper -case to denote probability distributions and assume they all admit absolutely continuous densities (denoted by the correspond- ing lower -case notation) on a reference measure d x . Consider the following setting for learning generativ e models. Giv en some data X = { x i ∈ R d } m i =1 sampled i.i.d. from an unknown probability density p data , we are inter - ested in learning a probability density p θ where θ denotes the parameters of a model. Given a parameteric family of mod- els M , the typical approach to learn θ ∈ M is to minimize a notion of di vergence between P data and P θ . The choice of div er gence and the optimization procedure dictate learning, leading to the following tw o objecti ves. 2.1 Maximum likelihood estimation In maximum likelihood estimation (MLE), we minimize the Kullback-Liebler (KL) div er gence between the data distri- bution and the model distribution. Formally , the learning ob- jectiv e can be e xpressed as: min θ ∈M K L ( P data , P θ ) = E x ∼ P data  log p data ( x ) p θ ( x )  Since p data is independent of θ , the abov e optimization problem can be equiv alently expressed as: max θ ∈M E x ∼ P data [log p θ ( x )] (1) Hence, ev aluating the learning objective for MLE in Eq. (1) requires the ability to e v aluate the model density p θ ( x ) . Models that provide an explicit characterization of the like- lihood function are referred to as prescribed generative mod- els (Mohamed and Lakshminarayanan 2016). 2.2 Adversarial learning A generati ve model can be learned to optimize di vergence notions beyond the KL div ergence. A large family of di ver - gences can be con v eniently expressed as: max φ ∈F E x ∼ P θ [ h φ ( x )] − E x ∼ P data  h 0 φ ( x )  (2) where F denotes a set of parameters, h φ and h 0 φ are ap- propriate real-valued functions parameterized by φ . Differ- ent choices of F , h φ and h 0 φ can lead to a variety of f - div er gences such as Jenson-Shannon div ergence and inte- gral probability metrics such as the W asserstein distance. For instance, the GAN objectiv e proposed by Goodfellow et al. (2014) can also be cast in the form of Eq. (2) below: max φ ∈F E x ∼ P θ [log (1 − D φ ( x ))] + E x ∼ P data [ D φ ( x )] (3) where φ denotes the parameters of a neural netw ork function D φ . W e refer the reader to (Nowozin, Cseke, and T omioka 2016; Mescheder , No wozin, and Geiger 2017b) for further details on other possible choices of div ergences. Impor- tantly , a Monte Carlo estimate of the objectiv e in Eq. (2) requires only samples from the model. Hence, any model that allows tractable sampling can be used to evaluate the following minimax objecti ve: min θ ∈M max φ ∈F E x ∼ P θ [ h φ ( x )] − E x ∼ P data  h 0 φ ( x )  . (4) As a result, even differentiable implicit models which do not provide a characterization of the model likelihood 1 but allow tractable sampling can be learned adversarially by op- timizing minimax objectiv es of the form gi ven in Eq. (4). 2.3 Adversarial learning of latent variable models From a statistical perspective, maximum likelihood estima- tors are statistically efficient asymptotically (under some conditions) and hence minimizing the KL di ver gence is a natural objecti ve for many prescribed models (Huber 1967). Howe ver , not all models allow for a well-defined, tractable, and easy-to-optimize likelihood. For example, exact likelihood ev aluation and sampling are tractable in directed, fully observed models such as Bayesian networks and autoregressi ve models (Larochelle and Mur- ray 2011; Oord, Kalchbrenner , and Kavukcuoglu 2016). Hence, they are usually trained by maximum likelihood. Undirected models, on the other hand, provide only unnor- malized lik elihoods and are sampled from using expensiv e Markov chains. Hence, they are usually learned by approx- imating the likelihood using methods such as contrastive div er gence (Carreira-Perpinan and Hinton 2005) and pseu- dolikelihood (Besag 1977). The likelihood is generally in- tractable to compute in latent variable models (e ven directed 1 This could be either due to computational intractability in ev al- uating likelihoods or because the likelihood is ill-defined. ones) as it requires marginalization. These models are typi- cally learned by optimizing a stochastic lower bound to the log-likelihood using variational Bayes approaches (Kingma and W elling 2014). Directed latent variable models allow for efficient ances- tral sampling and hence these models can also be trained using other div er gences, e.g . , adversarially (Mescheder, Now ozin, and Geiger 2017a; Mao et al. 2017; Song, Zhao, and Ermon 2017). A popular class of latent variable models learned adversarially consist of generativ e adversarial net- works (GAN; (Goodfello w et al. 2014)). GANs comprise of a pair of generator and discriminator netw orks. The gener- ator G θ : R k → R d is a deterministic function differen- tiable with respect to the parameters θ . The function takes as input a source of randomness z ∈ R k sampled from a tractable prior density p ( z ) and transforms it to a sample G θ ( z ) through a forward pass. Evaluating likelihoods as- signed by a GAN is challenging because the model density p θ is specified only implicitly using the prior density p ( z ) and the generator function G θ . In f act, the lik elihood for an y data point is ill-defined (with respect to the Lesbegue mea- sure over R n ) if the prior distribution over z is defined over a support smaller than the support of the data distribution. GANs are typically learned adversarially with the help of a discriminator network. The discriminator D φ : R d → R is another real-valued function that is differentiable with re- spect to a set of parameters φ . Gi ven the discriminator func- tion, we can express the functions h and h 0 in Eq. (4) as compositions of D φ with div ergence-specific functions. For instance, the W asserstein GAN (WGAN; (Arjovsk y , Chin- tala, and Bottou 2017)) optimizes the following objecti ve: min θ max φ ∈F E x ∼ P data [ D φ ( x )] − E z ∼ P z [ D φ ( G θ ( z ))] (5) where F is defined such that D φ is 1-Lipschitz. Empirically , GANs generate excellent samples of natural images (Rad- ford, Metz, and Chintala 2015), audio signals (Pascual, Bonafonte, and Serr ` a 2017), and of behaviors in imitation learning (Ho and Ermon 2016; Li, Song, and Ermon 2017). 3 Flow Generative Adversarial Networks As discussed above, generativ e adversarial networks can tractably generate high-quality samples but hav e intractable or ill-defined likelihoods. Monte Carlo techniques such as AIS and non-parameteric density estimation methods such as KDE get around this by assuming a Gaussian observation model p θ ( x | z ) for the generator . 2 This assumption alone is not sufficient for quantitativ e ev aluation since the mar ginal likelihood of the observed data, p θ ( x ) = R p θ ( x , z )d z in this case would be intractable as it requires integrating ov er all the latent factors of variation. This would then require ap- proximate inference ( e.g. , Monte Carlo or variational meth- ods) which in itself is a computational challenge for high- dimensional distrib utions. T o circumv ent these issues, we propose flow generati ve adversarial networks (Flo w-GAN). 2 The true observation model for a GAN is a Dirac delta distribu- tion, i.e. , p θ ( x | z ) is infinite when x = G θ ( z ) and zero otherwise. A Flow-GAN consists of a pair of generator-discriminator networks with the generator specified as a normalizing flow model (Dinh, Krueger , and Bengio 2014). A normalizing flow model specifies a parametric transformation from a prior density p ( z ) : R d → R + 0 to another density over the same space, p θ ( x ) : R d → R + 0 where R + 0 is the set of non- negati ve reals. The generator transformation G θ : R d → R d is in vertible, such that there exists an in verse function f θ = G − 1 θ . Using the change-of-v ariables formula and let- ting z = f θ ( x ) , we hav e: p θ ( x ) = p ( z )     det ∂ f θ ( x ) ∂ x     (6) where ∂ f θ ( x ) ∂ x denotes the Jacobian of f θ at x . The abo ve formula can be applied recursiv ely ov er compositions of many inv ertible transformations to produce a complex final density . Hence, we can ev aluate and optimize for the log- likelihood assigned by the model to a data point as long as the prior density is tractable and the determinant of the Ja- cobian of f θ ev aluated at x can be efficiently computed. Evaluating the likelihood assigned by a Flow-GAN model in Eq. (6) requires overcoming two major challenges. First, requiring the generator function G θ to be rev ersible imposes a constraint on the dimensionality of the latent variable z to match that of the data x . Thereafter , we require the trans- formations between the v arious layers of the generator to be in v ertible such that their overall composition results in an in v ertible G θ . Secondly , the Jacobian of high-dimensional distributions can howe v er be computationally expensiv e to compute. If the transformations are designed such that the Jacobian is an upper or lo wer triangular matrix, then the de- terminant can be easily e valuated as the product of its diago- nal entries. W e consider two such family of transformations. 1. V olume pr eserving tr ansformations . Here, the Jacobian of the transformations have a unit determinant. For exam- ple, the NICE model consists of se veral layers perform- ing a location transformation (Dinh, Krue ger , and Bengio 2014). The top layer is a diagonal scaling matrix with non- zero log determinant. 2. Non-volume pr eserving transformations . The determinant of the Jacobian of the transformations is not necessarily unity . F or example, in Real-NVP , layers performs both location and scale transformations (Dinh, Sohl-Dickstein, and Bengio 2017). For brevity , we direct the reader to Dinh, Krue ger , and Bengio (2014) and Dinh, Sohl-Dickstein, and Bengio (2017) for the specifications of NICE and Real-NVP respecti vely . Crucially , both volume preserving and non-volume preserv- ing transformations are in vertible such that the determinant of the Jacobian can be computed tractably . 3.1 Learning objectives In a Flow-GAN, the lik elihood is well-defined and compu- tationally tractable for exact e valuation of e ven expressi v e volume preserving and non-v olume preserving transforma- tions. Hence, a Flow-GAN can be trained via maximum like- lihood estimation using Eq. (1) in which case the discrimi- nator is redundant. Additionally , we can perform ancestral (a) MLE (b) AD V (c) Hybrid Figure 1: Samples generated by Flow-GAN models with dif ferent objectives for MNIST ( top ) and CIF AR-10 ( bottom ). sampling just lik e a re gular GAN whereby we sample a ran- dom v ector z ∼ P z and transform it to a model generated sample via G θ = f − 1 θ . This makes it possible to learn a Flow-GAN using an adv ersarial learning objecti ve (for e x- ample, the WGAN objectiv e in Eq. (5)). A natural question to ask is why should one use adv ersar- ial learning giv en that MLE is statistically efficient asymp- totically (under some conditions). Besides difficulties that could arise due to optimization (in both MLE and adversar - ial learning), the optimality of MLE holds only when there is no model misspecification for the generator i.e . , the true data distribution P data is a member of the parametric f amily of distributions under consideration (White 1982). This is gen- erally not the case for high-dimensional distributions, and hence the choice of the learning objecti ve becomes largely an empirical question. Unlike other models, a Flow-GAN allows both maximum likelihood and adversarial learning, and hence we can in v estigate this question experimentally . 3.2 Evaluation metrics and experimental setup Our criteria for e valuation is based on held-out log- likelihoods and sample quality metrics. W e focus on nat- ural images since they allow visual inspection as well as quantification using recently proposed metrics. A “good” generativ e model should generalize to images outside the training data and assign high log-likelihoods to held-out data. The Inception and MODE scores are standard quan- titativ e measures of the quality of generated samples of natural images for labelled datasets (Salimans et al. 2016; Che et al. 2017). The Inception scores are computed as: exp ( E x ∈ P θ [ K L ( p ( y | x ) k p ( y )]) where x is a sample generated by the model, p ( y | x ) is the softmax probability for the labels y assigned by a pretrained classifier for x , and p ( y ) is the ov erall distribution of la- bels in the generated samples (as predicted by the pretrained classifier). The intuition is that the conditional distribution p ( y | x ) should ha ve low entropy for good looking images while the marginal distribution p ( y ) has high entropy to en- sure sample diversity . Hence, a generativ e model can per- form well on this metric if the KL di ver gence between the two distributions (and consequently , the Inception score for the generated samples) is large. The MODE score giv en be- low modifies the Inception score to take into account the distribution of labels in the training data, p ∗ ( y ) : exp ( E x ∈ P θ [ K L ( p ( y | x ) k p ∗ ( y )] − K L ( p ∗ ( y ) k p ( y ))) . W e compare learning of Flow-GANs using MLE and ad- versarial learning (AD V) for the MNIST dataset of hand- written digits (LeCun, Cortes, and Bur ges 2010) and the CIF AR-10 dataset of natural images (Krizhevsk y and Hin- ton 2009). The normalizing flow generator architectures are chosen to be NICE (Dinh, Krueger , and Bengio 2014) and Real-NVP (Dinh, Sohl-Dickstein, and Bengio 2017) for MNIST and CIF AR-10 respectiv ely . W e fix the W asser- stein distance as the choice of the div ergence being op- timized by AD V (see Eq. (5)) with the Lipschitz con- straint ov er the critic imposed by penalizing the norm of the gradient with respect to the input (Arjovsk y , Chintala, and Bottou 2017; Gulrajani et al. 2017). The discrimina- tor is based on the DCGAN architecture (Radford, Metz, and Chintala 2015). The abov e choices are among the current state-of-the-art in maximum likelihood estimation and adversarial learning and greatly stabilize GAN train- ing. Further experimental setup details are provided in Ap- pendix A. The code for reproducing the results is av ailable at https://github.com/ermongroup/flow-gan . 0.0 0.5 1.0 1e5 generator iterations 3 2 1 0 1e3 NLL (in nats) 0.0 0.5 1.0 1e5 generator iterations 3 2 1 0 1e3 NLL (in nats) train nll val nll 0.00 0.25 0.50 0.75 1.00 1e5 generator iterations 0.0 0.2 0.4 0.6 0.8 1.0 1e5 NLL (in nats) 0 5 10 15 20 wgan loss 0.00 0.25 0.50 0.75 1.00 1e5 generator iterations 0.0 0.2 0.4 0.6 0.8 1.0 1e5 NLL (in nats) train nll val nll wgan loss 0 1 2 1e5 generator iterations 4 6 8 10 NLL (in bits per dim) 0 1 2 1e5 generator iterations 4 6 8 10 NLL (in bits per dim) train nll val nll (a) MLE 0 1 2 1e5 generator iterations 1 0 3 1 0 4 1 0 5 1 0 6 NLL (in bits per dim) 0 100 200 300 400 wgan loss 0 1 2 1e5 generator iterations 1 0 3 1 0 4 1 0 5 1 0 6 NLL (in bits per dim) train nll val nll wgan loss (b) AD V Figure 2: Learning curv es for negati ve log-likelihood (NLL) ev aluation on MNIST ( top , in nats) and CIF AR ( bottom , in bits/dim). Lower NLLs are better . 3.3 Evaluation results Log-likelihood. The log-likelihood learning curves for Flow-GAN models learned using MLE and AD V are sho wn in Figure 2a and Figure 2b respectiv ely . Follo wing conv en- tion, we report the negati ve log-likelihoods (NLL) in nats for MNIST and bits/dimension for CIF AR-10. MLE. In Figure 2a, we see that normalizing flow models attain low validation NLLs (blue curves) after few gradient updates as expected because it is explicitly optimizing for the MLE objectiv e in Eq. (1). Continued training ho we ver could lead to ov erfitting as the train NLLs (red curv es) begin to div er ge from the v alidation NLLs. AD V . Surprisingly , ADV models show a consistent in- cr ease in validation NLLs as training progresses as sho wn in Figure 2b (for CIF AR-10, the estimates are reported on a log scale!). Based on the learning curves, we can disregard ov erfitting as an explanation since the increase in NLLs is observed ev en on the training data. The training and vali- dation NLLs closely track each other suggesting that AD V models are not simply memorizing the training data. Comparing the left vs. right panels in Figure 2, we see that the log-likelihoods attained by AD V are orders of magnitude worse than those attained by MLE after suf ficient training. Finally , we note that the WGAN loss (green curves) does not correlate well with NLL estimates. While the WGAN loss stabilizes after fe w iterations of training, the NLLs con- tinue to increase. This observation is in contrast to prior work sho wing the loss to be strongly correlated with sam- ple quality metrics (Arjovsk y , Chintala, and Bottou 2017). Sample quality . Samples generated from MLE and AD V - based models with the best MODE/Inception are shown in Figure 1a and Figure 1b respecti vely . AD V models significantly outperform MLE with respect to the final MODE/Inception scores achiev ed. V isual inspection of sam- ples confirms the observations made on the based of the sam- ple quality metrics. Curves monitoring the sample quality metrics at ev ery training iteration are gi ven in Appendix B. 3.4 Gaussian mixture models The abo ve experiments suggest that AD V can produce ex- cellent samples but assigns low likelihoods to the observed data. Howe ver , a direct comparison of AD V with the log- likelihoods of MLE is unf air since the latter is explicitly optimizing for the desired objective. T o highlight that gen- erating good samples at the expense of low likelihoods is not a challenging goal, we propose a simple baseline. W e compare the adversarially learned Flow-GAN models that achiev es the highest MODE/Inception score with a Gaussian Mixture Model consisting of m isotropic Gaussians with equal weights centered at each of the m training points as the baseline Gaussian Mixture Model (GMM). The band- width hyperparameter , σ , is the same for each of the mixture components and optimized for the lowest validation NLL by doing a line search in (0 , 1] . W e show results for CIF AR- 10 in Figure 3. Our observations below hold for MNIST as well; results deferred to Appendix C. W e overload the y -axis in Figure 3 to report both NLLs and sample quality metrics. The horizontal maroon and cyan dashed lines denote the best attainable MODE/Inception scores and corresponding validation NLLs respectiv ely at- tained by the adversarially learned Flow-GAN model. The GMM can clearly attain better sample quality metrics since it is explicitly overfitting to the training data for low values of the bandwidth parameter (any σ for which the red curve is abov e the maroon dashed line). Surprisingly , the simple GMM also outperforms the adversarially learned model with respect to NLLs attained for sev eral v alues of the bandwidth parameter (an y σ for which the blue curv e is below the cyan dashed line). Bandwidth parameters for which GMM mod- els outperform the adversarially learned model on both log- likelihoods and sample quality metrics are highlighted using the green shaded area. W e sho w samples from the GMM in the appendix. Hence, a trivial baseline that is memo- rizing the tr aining data can generate high quality samples and better held-out log-likelihoods , suggesting that the log- likelihoods attained by adversarial training are v ery poor . 0.0 0.1 0.2 0.3 b a n d w i d t h 1 0 1 1 0 2 1 0 3 1 0 4 validation NLL (in bits per dim) 2 3 4 5 6 7 8 9 inception scores 0.0 0.1 0.2 0.3 b a n d w i d t h 1 0 1 1 0 2 1 0 3 1 0 4 validation NLL (in bits per dim) NLL of ADV model with best inception score Best Inception Score of ADV model Figure 3: Gaussian Mixture Models outperform adversar- ially learned models on both held-out log-likelihoods and sampling metrics on CIF AR-10 ( green shaded region ). 4 Hybrid learning of Flow-GANs In the previous section, we observed that adversarially learn- ing Flow-GANs models attain poor held-out log-likelihoods. This makes it challenging to use such models for applica- tions requiring density estimation. On the other hand, Flo w- GANs learned using MLE are “mode covering” but do not generate high quality samples. With a Flow-GAN, it is pos- sible to trade-off the two goals by combining the learning objectiv es corresponding to both these inductive principles. W ithout loss of generality , let V ( G θ , D φ ) denote the min- imax objectiv e of any GAN model (such as WGAN). The hybrid objecti ve of a Flo w-GAN can be expressed as: min θ max φ V ( G θ , D φ ) − λ E x ∼ P data [log p θ ( x )] (7) where λ ≥ 0 is a hyperparameter for the algorithm. By vary- ing λ , we can interpolate between plain adversarial training ( λ = 0 ) and MLE (very high λ ). W e summarize the results from MLE, ADV , and Hy- brid for log-likelihood and sample quality ev aluation in T a- ble 1 and T able 2 for MNIST and CIF AR-10 respectiv ely . The tables report the test log-likelihoods corresponding to the best validated MLE and ADV models and the highest MODE/Inception scores observed during training. The sam- ples generated by models with the best MODE/Inception scores for each objectiv e are sho wn in Figure 1c. While the results on CIF AR-10 are along expected lines, the hybrid objecti ve interestingly outperforms MLE and AD V on both test log-likelihoods and sample quality met- rics in the case of MNIST . One potential e xplanation for this is that the AD V objective can regularize MLE to generalize to the test set and in turn, the MLE objecti ve can stabilize the optimization of the ADV objectiv e. Hence, the hybrid objectiv e in Eq. (7) can smoothly balance the two objecti ves using the tunable hyperparameter λ , and in some cases such as MNIST , the performance on both tasks could improve as a result of the hybrid objecti ve. T able 1: Best MODE scores and test negati v e log-likelihood estimates for Flow-GAN models on MNIST . Objectiv e MODE Score T est NLL (in nats) MLE 7 . 42 − 3334 . 56 AD V 9 . 24 − 1604 . 09 Hybrid ( λ = 0 . 1 ) 9 . 37 − 3342 . 95 T able 2: Best Inception scores and test ne gati ve log- likelihood estimates for Flo w-GAN models on CIF AR-10. Objectiv e Inception Score T est NLL (in bits/dim) MLE 2 . 92 3 . 54 AD V 5 . 76 8 . 53 Hybrid ( λ = 1 ) 3 . 90 4 . 21 5 Interpr eting the results Our findings are in contrast with prior work which report much better log-lik elihoods for adv ersarially learned mod- els with a standard generator architecture based on an- nealed importance sampling (AIS; (W u et al. 2017)) and kernel density estimation (KDE; (Goodfellow et al. 2014)). These methods rely on approximate inference techniques for log-likelihood ev aluation and make assumptions about a Gaussian observ ation model which does not hold for GANs. Since Flow-GANs allow us to compute exact log- likelihoods, we can ev aluate the quality of approximation made by AIS and KDE for density estimation of inv ertible generators. F or a detailed description of the methods, we re- fer the reader to prior work (Neal 2001; Parzen 1962). W e consider the MNIST dataset where these meth- ods hav e been previously applied to by W u et al. (2017) and Goodfellow et al. (2014) respecti vely . Since both AIS and KDE inherently rely on the samples generated, we ev al- uate these methods for the MLE, AD V , and Hybrid Flow- GAN model checkpoints corresponding to the best MODE scores observed during training. In T able 3, we observe that both AIS and KDE produce estimates of log-likelihood that are far from the ground truth, accessible through the ex- act Flow-GAN log-likelihoods. Even worse, the ranking of log-likelihood estimates for AIS (AD V > Hybrid > MLE) and KDE (Hybrid > MLE > AD V) do not obey the relativ e rank- ings of the Flow-GAN estimates (MLE > Hybrid > AD V). 5.1 Explaining log-likelihood trends In order to explain the v ariation in log-lik elihoods attained by v arious Flow-GAN learning objectives, we in vestigate the distribution of the magnitudes of singular v alues for the Jacobian matrix of sev eral generator functions, G θ for MNIST in Figure 4 e v aluated at 64 noise v ectors z randomly sampled from the prior density p ( z ) . The x -axis of the figure shows the singular value magnitudes on a log scale and for each singular v alue s , we show the corresponding cumula- tiv e distrib ution function v alue on the y -axis which signifies the fraction of singular values less than s . The results on CIF AR-10 in Appendix D show a similar trend. The Jacobian is a good first-order approximation of the generator function locally . In Figure 4, we observe that the T able 3: Comparison of inference techniques for negati ve log-likelihood estimation of Flo w-GAN models on MNIST . Objectiv e Flo w-GAN NLL AIS KDE MLE -3287.69 -2584.40 -167.10 AD V 26350.30 -2916.10 -3.03 Hybrid -3121.53 -2703.03 -205.69 singular value distribution for the Jacobian of an in vertible generator learned using MLE (orange curves) is concen- trated in a narrow range, and hence the Jacobian matrix is well-conditioned and easy to in vert. In the case of in vertible generators learned using ADV with W asserstein distance (green curv es) howe ver , the spread of singular v alues is very wide, and hence the Jacobian matrix is ill-conditioned. The average log determinant of the Jacobian matrices for MLE, ADV , and Hybrid models are − 4170 . 34 , − 15588 . 34 , and − 5184 . 40 respectiv ely which translates to the trend AD V < Hybrid < MLE. This indicates that the AD V models are trying to squish a sphere of unit v olume centered at a la- tent vector z to a v ery small v olume in the observed space x . Tin y perturbations of training as well as held-out data- points can hence manifest as poor log-likelihoods. In spite of not being limited in the representational capacity to co ver the entire space of the data distribution (the dimensions of z ( i.e. , k ) and x ( i.e. , d ) match for in vertible generators), AD V prefers to learn a distribution o ver a smaller support. The Hybrid learning objecti ve (blue curves), howe ver , is able to correct for this behavior , and the distribution of sin- gular v alue magnitudes matches closely to that of MLE. W e also considered variations in volving the standard DCGAN architectures with k = d minimizing the W asserstein dis- tance (red curves) and Jenson-Shannon di ver gence (purple curves). The relative shift in distribution of singular value magnitudes to lower v alues is apparent even in these cases. 6 Discussion Any model which allows for efficient likelihood ev aluation and sampling can be trained using maximum likelihood and adversarial learning. This line of reasoning has been ex- plored to some extent in prior work that combine the ob- jectiv es of prescribed latent variable models such as V AEs (maximizing an evidence lower bound on the data) with ad- versarial learning (Larsen et al. 2015; Mescheder , No wozin, and Geiger 2017a; Sriv astav a et al. 2017). Ho wever , the ben- efits of such procedures do not come for “free” since we still need some form of approximate inference to get a handle on the log-likelihoods. This could be expensiv e, for instance combining a V AE with a GAN introduces an additional in- ference network that increases the ov erall model complexity . Our approach sidesteps the additional complexity due to approximate inference by considering a normalizing flow model. The trade-of f made by a normalizing flow model is that the generator function needs to be inv ertible while other generati ve models such as V AEs hav e no such re- quirement. On the positiv e side, we can tractably ev aluate exact log-likelihoods assigned by the model for any data point. Normalizing flow models have been previously used 1 0 4 5 1 0 3 7 1 0 2 9 1 0 2 1 1 0 1 3 1 0 5 magnitude of singular value, s 0.0 0.2 0.4 0.6 0.8 1.0 P(S

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment