Multi-Agent Diverse Generative Adversarial Networks

We propose MAD-GAN, an intuitive generalization to the Generative Adversarial Networks (GANs) and its conditional variants to address the well known problem of mode collapse. First, MAD-GAN is a multi-agent GAN architecture incorporating multiple gen…

Authors: Arnab Ghosh, Viveka Kulharia, Vinay Namboodiri

Multi-Agent Diverse Generative Adversarial Networks
Multi-Agent Div erse Generativ e Adversarial Netw orks Arnab Ghosh ∗ Uni versity of Oxford, UK arnabg@robots.ox.ac.uk V i veka K ulharia ∗ Uni versity of Oxford, UK viveka@robots.ox.ac.uk V inay Namboodiri IIT Kanpur , India vinaypn@iitk.ac.in Philip H.S. T orr Uni versity of Oxford, UK philip.torr@eng.ox.ac.uk Puneet K. Dokania Uni versity of Oxford, UK puneet@robots.ox.ac.uk Abstract W e pr opose MAD-GAN, an intuitive generalization to the Generative Adversarial Networks (GANs) and its condi- tional variants to address the well known pr oblem of mode collapse . F irst, MAD-GAN is a multi-agent GAN ar chitec- tur e incorpor ating multiple gener ators and one discrimina- tor . Second, to enfor ce that differ ent gener ators captur e di- verse high pr obability modes, the discriminator of MAD- GAN is designed such that along with finding the real and fake samples, it is also r equir ed to identify the generator that g enerated the given fake sample . Intuitively , to succeed in this task, the discriminator must learn to push differ ent generator s towar ds differ ent identifiable modes. W e per- form extensive experiments on synthetic and r eal datasets and compare MAD-GAN with differ ent variants of GAN. W e show high quality diverse sample generations for c halleng- ing tasks suc h as imag e-to-image translation and face g en- eration. In addition, we also show that MAD-GAN is able to disentangle differ ent modalities when trained using highly challenging diverse-class dataset ( e.g . dataset with images of for ests, icebergs, and bedr ooms). In the end, we show its efficacy on the unsupervised featur e r epresentation task. In Appendix, we intr oduce a similarity based competing ob- jective (MAD-GAN-Sim) which encour ag es differ ent gener- ators to generate diverse samples based on a user defined similarity metric. W e show its performance on the image- to-image translation, and also show its effectiveness on the unsupervised featur e r epr esentation task. 1. Introduction Generativ e models have attracted considerable attention recently . The underlying idea behind such models is to at- ∗ Joint first author . This is an updated version of our CVPR’18 paper with the same title. In this v ersion, we also introduce MAD-GAN-Sim in Appendix B. Figure 1: Div erse-class data generation using MAD- GAN. Div erse-class dataset contains images from differ - ent classes/modalities (in this case, forests, iceber gs, and bedrooms). Each row represents generations by a particu- lar generator and each column represents generations for a giv en random noise input z . As shown, once trained us- ing this dataset, generators of MAD-GAN are able to disen- tangle different modalities, hence, each generator is able to generate images from a particular modality . tempt to capture the distribution of high-dimensional data such as images and texts. Though these models are highly useful in v arious applications, it is computationally expen- siv e to train them as they require intractable inte gration in a very high-dimensional space. This drastically limits their applicability . Howe ver , recently there has been con- siderable progress in deep generative models – conglom- erate of deep neural networks and generative models – as they do not e xplicitly require the intractable inte gration, and can be ef ficiently trained using back-propagation algorithm. T wo such famous examples are Generativ e Adversarial Net- works (GANs) [13] and V ariational Autoencoders [17]. In this paper we focus on GANs as they are known to produce sharp and plausible images. Briefly , GANs emplo y a generator and a discriminator where both are in volv ed in a minimax game. The task of the discriminator is to learn the difference between r eal samples (from true data distri- bution p d ) and fake samples (from generator distribution p g ). Whereas, the task of the generator is to maximize the mistakes of the discriminator . At con ver gence, the genera- tor learns to produce real looking images. A few success- 1 ful applications of GANs are video generation [30], image inpainting [25], image manipulation [33], 3D object gen- eration [31], interacti ve image generation using fe w brush strokes [33], image super -resolution [20], diagrammatic ab- stract reasoning [18] and conditional GANs [23, 27]. Despite the remarkable success of GAN, it suffers from the major problem of mode collapse [2, 7, 8, 22, 28]. Though, theoretically , con v ergence guarantees the genera- tor learning the true data distribution. Howe v er , practically , reaching the true equilibrium is dif ficult and not guaranteed, which potentially leads to the aforementioned problem of mode collapse. Broadly speaking, there are two schools of thought to address the issue: (1) improving the learning of GANs to reach better optima [2, 22, 28]; and (2) explicitly enforcing GANs to capture di verse modes [7, 8, 21]. Here we focus on the latter . Borrowing from the multi-agent algorithm [1] and cou- pled GAN [21], we propose to use multiple generators with one discriminator . W e call this frame work the Multi-Agent GAN architecture, as shown in Fig. 2. In detail, similar to the standard GAN, the objecti v e of each generator here is to maximize the mistakes of the common discriminator . De- pending on the task, it might be useful for different genera- tors to share information. This is done using the initial layer parameters of generators. Another reason behind sharing these parameters is the fact that initial layers capture low- frequency structures which are almost the same for a partic- ular type of dataset (for example, faces), therefore, sharing them reduces redundant computations. Howe ver , when the dataset contains images from completely different modali- ties, one can avoid sharing these parameters. Naiv ely using multiple generators may lead to the trivial solution where all the generators learn to ge nerate similar samples. T o resolve this issue and generate different visually plausible samples capturing div erse high probability modes, we propose to modify the objectiv e function of the discriminator . In the modified objectiv e, along with finding the real and the fake samples, the discriminator also has to correctly identify the generator that generated the gi ven fake sample. Intuitively , in order to succeed in this task, the discriminator must learn to push generations corresponding to different generators tow ards dif ferent identifiable modes. Combining the Multi- Agent GAN architecture with the di versity enforcing term allows us to generate diverse plausible samples, thus the name Multi-Agent Div erse GAN (MAD-GAN). As an example, an intuiti v e setting where mode collapse occurs is when a GAN is trained on a dataset containing images from different modalities/classes. For example, a div erse-class dataset containing images such as forests, ice- berg, and bedrooms. This is of particular interest as it not only requires the model to disentangle intra-class variations, it also requires inter-class disentanglement. Fig. 1 demon- strates the surprising effecti v eness of MAD-GAN in this challenging setting. Generators among themselves are able to disentangle inter-class v ariations, and each generator is also able to capture intra-class variations. In addition, we analyze MAD-GAN through exten- siv e experiments and compare it with sev eral variants of GAN. First, for the proof of concept, we perform experi- ments in controlled settings using synthetic dataset (mix- ture of Gaussians), and complicated Stacked/Compositional MNIST datasets with hand engineered modes. In these set- tings, we empirically show that our approach outperforms all other GAN variants we compare with, and is able to generate high quality samples while capturing large num- ber of modes. In a more realistic setting, we show high quality di verse sample generations for the challenging tasks of image-to-ima ge translation [14] (conditional GAN) and face generation [8, 26]. Using the SVHN dataset [24], we also show the efficac y of our frame work for learning the feature representation in an unsupervised setting. W e also pro vide theoretical analysis of this approach and show that the proposed modification in the objecti ve of dis- criminator allo ws generators to learn together as a mixture model where each generator represents a mixture compo- nent. W e show that at conv ergence, the global optimum value of − ( k + 1) log( k + 1) + k log k is achie ved, where k is the number of generators. Figure 2: Multi-Agent Di v erse GAN (MAD-GAN). The discriminator outputs k + 1 softmax scores signifying the probability of its input sample being from either one of the k generators or the real distribution. 2. Related W ork The recent work called InfoGAN [8] proposed an information-theoretic e xtension to GANs in order to ad- dress the problem of mode collapse. Briefly , InfoGAN dis- entangles the latent representation by assuming a factored representation of the latent variables. In order to enforce that the generator learns factor specific generations, Info- GAN maximizes the mutual information between the fac- tored latents and the generator distribution. Che et al . [7] proposed a mode regularized GAN (ModeGAN) which uses an encoder-decoder paradigm. The basic idea behind Mod- eGAN is that if a sample from the true data distribution p d 2 belongs to a particular mode, then the sample generated by the generator (fak e sample) when the true sample is passed through the encoder-decoder is lik ely to belong to the same mode. ModeGAN assumes that there exists enough true samples from a mode for the generator to be able to capture it. Another work by Metz et al . [22] proposed a surrogate objectiv e for the update of the generator with respect to the unrolled optimization of the discriminator (UnrolledGAN) to address the issue of conv er gence of the training process of GANs. This improves the training process of the gen- erator which in turn allow the generators to explore better cov erage to true data distribution. Liu et al . [21] presented Coupled GAN, a method for training two generators with shared parameters to learn the joint distribution of the data. The shared parameters guide both the generators tow ards similar subspaces b ut since they are trained independently on two domains, they promote di- verse generations. Durugkar et al . [10] proposed a model with multiple discriminators whereby an ensemble of multi- ple discriminators hav e been sho wn to stabilize the training of the generator by guiding it to produce better samples. W -GAN [3] is a recent technique which emplo ys integral probability metrics based on the earth mov er distance rather than the JS-diver gences that the original GAN uses. BE- GAN [5] builds upon W -GAN using an autoencoder based equilibrium enforcing technique alongside the W asserstein distance. DCGAN [26] was a seminal technique which used a fully con volutional generator and discriminator for the first time along with the introduction of batch normalization thus stabilizing the training procedure, and was able to gen- erate compelling generations. GoGAN [16] introduced a training procedure for the training of the discriminator using a maximum margin formulation alongside the earth mover distance based on the W asserstein-1 metric. [4] introduced a technique and theoretical formulation stating the impor- tance of multiple generators and discriminators in order to completely model the data distribution. In terms of employ- ing multiple generators, our work is closest to [4, 21, 11]. Howe ver , while using multiple generators, our method ex- plicitly enforces them to capture div erse modes. 3. Preliminaries Here we present a brief review of GANs [13]. Gi ven a set of samples D = ( x i ) n i =1 from the true data distribution p d , the GAN learning problem is to obtain the optimal pa- rameters θ g of a generator G ( z ; θ g ) that can sample from an approximate data distribution p g , where z ∼ p z is the prior input noise ( e.g . samples from a normal distribution). In order to learn the optimal θ g , the GAN objective (Eq. (1)) employs a discriminator D ( x ; θ d ) that learns to differenti- ate between a r eal (from p d ) and a fake (from p g ) sample x . The ov erall GAN objecti ve is: min θ g max θ d V ( θ d , θ g ) := E x ∼ p d log D ( x ; θ d ) + E z ∼ p z log  1 − D ( G ( z ; θ g ); θ d )  (1) The above objective is optimized in a block-wise manner where θ d and θ g are optimized one at a time while fixing the other . For a gi v en sample x (either from p d or p g ) and the parameter θ d , the function D ( x ; θ d ) ∈ [0 , 1] pro- duces a score that represents the probability of x belonging to the true data distribution p d (or probability of it being real). The objective of the discriminator is to learn parame- ters θ d that maximizes this score for the true samples (from p d ) while minimizing it for the fake ones ˜ x = D ( z ; θ g ) (from p g ). In the case of generator, the objective is to min- imize E z ∼ p z log  1 − D ( G ( z ; θ g ); θ d )  , equi valently maxi- mize E z ∼ p z log D ( G ( z ; θ g ); θ d ) . Thus, the generator learns to maximize the scores for the fake samples (from p g ), which is exactly the opposite to what discriminator is try- ing to achieve. In this manner, the generator and the dis- criminator are in volv ed in a minimax game where the task of the generator is to maximize the mistakes of the discrim- inator . Theoretically , at equilibrium, the generator learns to generate real samples, which means p g = p d . 4. Multi-Agent Diverse GAN In the GAN objective, one can argue that the task of a generator is much harder than that of the discriminator as it has to produce real looking images to maximize the mistakes of the discriminator . This, along with the min- imax nature of the objecti ve raise se v eral challenges for GANs [2, 7, 8, 22, 28]: (1) mode collapse; (2) difficult op- timization; and (3) tri vial solution. In this work we propose a ne w framew ork to address the first challenge of mode col- lapse by increasing the capacity of the generator while using well known tricks to partially a v oid other challenges [2]. Briefly , we propose a Multi-Agent GAN ar chitectur e that employs multiple generators and one discriminator in order to generate different samples from high probability regions of the true data distribution. In addition, theoretically , we show that our formulation allo ws generators to act as a mix- ture model with each generator capturing one component. 4.1. Multi-Agent GAN Architectur e Here we describe our proposed architecture (Fig. 2). It in v olves k generators and one discriminator . In the case of homogeneous data (all the images belong to same class, e.g . faces or birds), we allow all the generators to share in- formation by tying most of the initial layer parameters. This is essential to av oid redundant computations as initial lay- ers of a generator capture low-frequency structures which are almost the same for a particular type of dataset. This 3 Figure 3: V isualization of different generators getting pushed to wards dif ferent modes. Here, M 1 and M 2 could be a cluster of modes where each cluster itself contains many modes. The arrows abstractly represent generator specific gradients for the purpose of building intuition. also allows different generators to con ver ge faster . How- ev er , in the case of diverse-class data ( e.g . dataset with a mixture of different classes such as forests, icebergs etc .), it is necessary to av oid sharing these parameters to allow each generator to capture content specific structures. Thus, the extent to which one should share these parameters depends on the task at hand. More specifically , gi ven z ∼ p z for the i -th generator , similar to the standard GAN, the first step in v olves gen- erating a sample (for example, an image) ˜ x i . Since each generator receiv es the same latent input sampled from the same distribution, naively using this simple approach may lead to the trivial solution where all the generators learn to generate similar samples. In what follows, we propose an intuitiv e solution to av oid this issue and allow the generators to capture div erse modes. 4.2. Enf orcing Di verse Modes Inspired by the discriminator formulation for the semi- supervised learning [28], we use a generator identification based objective function that, along with minimizing the score D ( ˜ x ; θ d ) , requires the discriminator to identify the generator that generated the given fake sample ˜ x . In order to do so, as opposed to the standard GAN objecti v e function where the discriminator outputs a scalar value, we modify it to output k + 1 soft-max scores. In more detail, giv en the set of k generators, the discriminator produces a soft- max probability distribution over k + 1 classes. The score at ( k + 1) -th index ( D k +1 ( . )) represents the probability that the sample belongs to the true data distribution and the score at j ∈ { 1 , . . . , k } -th index represents the probability of it being generated by the j -th generator . Under this set- ting, while learning θ d , we optimize the cross-entropy be- tween the soft-max output of the discriminator and the Dirac delta distribution δ ∈ { 0 , 1 } k +1 , where for j ∈ { 1 , . . . , k } , δ ( j ) = 1 if the sample belongs to the j -th generator , other- wise δ ( k + 1) = 1 . Thus, the objectiv e of the discrimina- tor , which is optimizing θ d while keeping θ g constant (refer Eq. (1)), is modified to: max θ d E x ∼ p H ( δ, D ( x ; θ d )) where, S upp ( p ) = ∪ k i =1 S upp ( p g i ) ∪ S upp ( p d ) and H ( ., . ) is the negati v e of the cross entropy function. Intuiti vely , in order to correctly identify the generator that produced a given fake sample, the discriminator must learn to push different generators to wards different identifiable modes. Howe ver , the objecti ve of each generator remains the same as in the standard GAN. Thus, for the i -th generator , the objectiv e is to minimize the follo wing: E x ∼ p d log D k +1 ( x ; θ d ) + E z ∼ p z log(1 − D k +1 ( G i ( z ; θ i g ); θ d )) T o update the parameters, the gradient for each generator is simply computed as ∇ θ i g log(1 − D k +1 ( G i ( z ; θ i g ); θ d )) . Notice that all the generators in this case can be up- dated in parallel. For the discriminator , gi ven x ∼ p (can be real or fake) and corresponding δ , the gradient is ∇ θ d log D j ( x ; θ d ) , where D j ( x ; θ d ) is the j -th index of D ( x ; θ d ) for which δ ( j ) = 1 . Therefore, using this approach requires very minor modifications to the stan- dar d GAN optimization algorithm and can be easily used with dif ferent variants of GAN. An intuitiv e visualization is shown in Fig. 3. Theorem 1 shows that the above objective function actu- ally allows generators to form a mixture model where each generator represents a mixture component and the global optimum of − ( k + 1) log( k + 1) + k log k is achie ved when p d = 1 k P k i =1 p g i . Notice that, at k = 1 , which is the case with one generator , we obtain exactly the same Jensen- Shannon div ergence based objective function as sho wn in [13] with the optimal value of − log 4 . Theorem 1. Given the optimal discriminator , the objective for training the gener ators boils down to minimizing K L  p d ( x ) || p av g ( x )  + k K L  1 k k X i =1 p g i ( x ) || p av g ( x )  − ( k + 1) log ( k + 1) + k log k (2) wher e, p av g ( x ) = p d ( x )+ P k i =1 p g i ( x ) k +1 . The above objective function obtains its global minimum if p d = 1 k P k i =1 p g i with the objective value of − ( k + 1) log( k + 1) + k log k . Pr oof. The joint objectiv e of all the generators is to mini- mize the following: E x ∼ p d log D k +1 ( x ) + k X i =1 E x ∼ p g i log(1 − D k +1 ( x )) 4 Using Corollary 1, we substitute the optimal discriminator in the abov e equation and obtain: E x ∼ p d log " p d ( x ) p d ( x ) + P k i =1 p g i ( x ) # + k X i =1 E x ∼ p g i log " P k i =1 p g i ( x ) p d ( x ) + P k i =1 p g i ( x ) # = E x ∼ p d log " p d ( x ) p av g ( x ) # + k E x ∼ p g log " p g ( x ) p av g ( x ) # − ( k + 1) log ( k + 1) + k log k (3) where, p g = P k i =1 p g i k and p av g ( x ) = p d ( x )+ P k i =1 p g i ( x ) k +1 . Note that, Eq. (3) is exactly the same as Eq. (2). When p d = P k i =1 p g i k , both the KL terms become zero and the global minimum is achiev ed. Corollary 1. F or fixed generators, the optimal distribution learned by the discriminator D has the following form: D k +1 ( x ) = p d ( x ) p d ( x ) + P k i =1 p g i ( x ) , D i ( x ) = p g i ( x ) p d ( x ) + P k i =1 p g i ( x ) , ∀ i ∈ { 1 , · · · , k } . wher e, D i ( x ) repr esents the i -th index of D ( x ; θ d ) , p d the true data distribution, and p g i the distribution learned by the i -th generator . Pr oof. F or fixed generators, the objectiv e function of the discriminator is to maximize E x ∼ p d log D k +1 ( x ) + k X i =1 E x i ∼ p g i log D i ( x i ) where, P k +1 i =1 D i ( x ) = 1 and D i ( x ) ∈ [0 , 1] , ∀ i . The above equation can be written as: Z x p d ( x ) log D k +1 ( x ) dx + k X i =1 Z x p g i ( x ) log D i ( x ) dx = Z x ∈ p k +1 X i =1 p i ( x ) log D i ( x ) dx (4) where, p k +1 ( x ) := p d ( x ) , p i ( x ) := p g i ( x ) , ∀ i ∈ { 1 , · · · , k } , and S upp ( p ) = S k i =1 S upp ( p g i ) S S upp ( p d ) , Therefore, for a giv en x , the optimum of objective function defined in Eq. (4) with constraints defined abo ve can be ob- tained using Proposition 1. Proposition 1. Given y = ( y 1 , · · · , y n ) , y i ≥ 0 , and a i ∈ R , the optimal solution for the objective function defined below is achie ved at y ∗ i = a i P n i =1 a i , ∀ i max y n X i =1 a i log y i , s.t. n X i y i = 1 Pr oof. The Lagrangian of the abo ve problem is: L ( y , λ ) = n X i =1 a i log y i + λ ( n X i =1 y i − 1) Differentiating w .r .t y i and λ , and equating to zero, a i y i + λ = 0 , n X i =1 y i − 1 = 0 Solving the abov e two equations, we obtain y ∗ i = a i P n i =1 a i . 5. Experiments W e present an e xtensiv e quantitativ e and qualitativ e analysis of MAD-GAN on various synthetic and real- world datasets. First, we use a simple 1D mixture of Gaussians and also Stacked/Compositional MNIST dataset (1000 modes) to compare MAD-GAN with several known variants of GANs, such as DCGAN [26], WGAN [3], BE- GAN [5], GoGAN [16], Unrolled GAN [22], Mode-Re g GAN [7] and InfoGAN [8]. Furthermore, we created an- other baseline, called MA-GAN (Multi-Agent GAN) , which is a trivial extension of GAN with multiple generators and one discriminator . As opposed to MAD-GAN, MA-GAN has a simple Multi-Agent architecture without modifica- tions to the objective of the discriminator . This compari- son allows us to understand the effect of explicitly enforc- ing div ersity in the objectiv e of the MAD-GAN. W e use KL-div er gence [19] and number of modes recov ered [7] as the criterion for comparisons and show superior results compared to all the other methods. Additionally , we show div erse generations for the challenging tasks of image-to- image translation [14], diverse-class data generation , and face generation. It is non-trivial to de vise a metric to ev alu- ate div ersity on these high quality generation tasks, so we perform qualitative assessment. Note that, the image-to- image translation objecti ve is known to learn the delta dis- tribution, thus, it is agnostic to the input noise v ector . How- ev er , we show that MAD-GAN is able to produce highly plausible div erse generations for this task. In the end, we show the efficac y of MAD-GAN in unsupervised feature representation learning task. W e provide detailed ov ervie w of the architectures, datasets, and the parameters used in our experiments in the Appendix C. 5 (a) DCGAN (b) WGAN (c) BEGAN (d) GoGAN (e) Unrolled GAN (f) Mode-Reg DCGAN (g) InfoGAN (h) MA-GAN (i) MAD-GAN (Our) Figure 4: A toy example to understand the behaviour of different GAN variants in order to compare with MAD-GAN (each method was trained for 198000 iterations). The orange bars sho w the density estimate of the training data and the blue ones for the generated data points. After careful cross-validation, we chose the bin size of 0 . 1 . (a) 1 Generator (b) 2 Generators (c) 3 Generators (d) 4 Generators (e) 5 Generators (f) 6 Generators (g) 7 Generators (h) 8 Generators Figure 5: A toy example to understand the behavior of MAD-GAN with different number of generators (each method was trained for 1 , 98 , 000 iterations). The orange bars show the density estimate of the training data and the blue ones for the generated data points. After careful cross-validation, we chose the bin size of 0 . 1 . In the case of InfoGAN [8], we varied the dimension of the categorical variable, depicting the number of modes, to obtain the best cross-validated results. 5.1. Non-Parametric Density Estimation In order to understand the behavior of MAD-GAN and different state-of-the-art GAN models, we first perform a very simple synthetic experiment, much easier than gen- erating high-dimensional complex images. W e consider a distribution of 1D GMM [6] having fiv e mixture compo- nents with modes at 10, 20, 60, 80 and 110, and standard deviations of 3, 3, 2, 2 and 1, respectively . While the first two modes ov erlap significantly , the fifth mode stands iso- lated as sho wn in Fig. 4. W e train different GAN models using 200 , 000 samples from this distribution and generate 65 , 536 data points from each model. In order to compare the learned distribution with the ground truth distrib utions, we first estimate them using bins over the data points and create the histograms. These histograms are carefully cre- ated using different bin sizes and the best bin (found to be 0 . 1 ) is chosen. Then, we use Chi-square distance and the KL-div er gence to compute distance between the two his- tograms. From Fig. 4 and T ab . 1 it is evident that MAD- GAN is able to capture all the clustered modes which in- cludes significantly ov erlapped modes as well. MAD-GAN obtains the minimum value in terms of both Chi-square 6 GAN V ariants Chi-square ( × 10 5 ) KL-Div DCGAN [26] 0.90 0.322 WGAN [3] 1.32 0.614 BEGAN [5] 1.06 0.944 GoGAN [16] 2.52 0.652 Unrolled GAN [22] 3.98 1.321 Mode-Reg DCGAN [7] 1.02 0.927 InfoGAN [8] 0.83 0.21 MA-GAN 1.39 0.526 MAD-GAN (Our) 0.24 0.145 T able 1: Synthetic experiment on 1D GMM (Fig. 4). # Generators Chi-square ( × 10 7 ) KL-Div 1 1.27 0.57 2 1.38 0.42 3 3.15 0.71 4 0.39 0.28 5 3.05 0.88 6 0.54 0.29 7 0.97 0.78 8 4.83 0.68 T able 2: Synthetic experiment with different number of MAD-GAN generators (same setup as in Fig. 4). distance and the KL-diver gence. In this experiment, both MAD-GAN and MA-GAN used four generators. In the case of InfoGAN, we used 5 dimensional categorical v ariable, which provides the best result. T o understand the effect of varying the number of gen- erators in MAD-GAN, we use the same synthetic experi- ment setup, i.e. the real data distribution is same GMM with 5 Gaussians. For better non-parametric estimation, we use 1 million sample points from real distribution (in- stead of 65 , 536 ). W e generate equal number of points from each of the generators such that they sum up to 1 million. The results are shown in Fig. 5 and corresponding T ab . 2. It is quite clear that as the number of generators are in- creased up to 4 , the sampling keeps getting more realis- tic. In case when multiple modes are significantly over - lapped/clustered, a generator can capture cluster of modes. Therefore, for this real data distribution, 4 generators are enough to capture all the 5 modes. W ith 5 or more gener- ators, all the modes were still captured, but the two ov er- lapping modes have more than two generation peaks. This is mainly because multiple generators are capturing this re- gion and all the generators (mixture components) were as- signed equal weights during sampling. Other works using more than one generators [21, 4] also use the number of generators as a hyper -parameter as know- ing a-priori the number of modes in a real-world data ( e.g . GAN V ariants KL Div # Modes Covered DCGAN [26] 2.15 712 WGAN [3] 1.02 868 BEGAN [5] 1.89 819 GoGAN [16] 2.89 672 Unrolled GAN [22] 1.29 842 Mode-Reg DCGAN [7] 1.79 827 InfoGAN [8] 2.75 840 MA-GAN 3.4 700 MAD-GAN (Our) 0.91 890 T able 3: Stacked-MNIST experiments and comparisons. Note that three generators are used for MAD-GAN. GAN V ariants KL Div # Modes Covered DCGAN [26] 0.18 980 WGAN [3] 0.25 1000 BEGAN [5] 0.19 999 GoGAN [16] 0.87 972 Unrolled GAN [22] 0.091 1000 Mode-Reg DCGAN [7] 0.12 992 InfoGAN [8] 0.47 990 MA-GAN 1.62 997 MAD-GAN (Our) 0.074 1000 T able 4: Compositional-MNIST experiments and compar- isons. Note that three generators are used for MAD-GAN. images) in itself is an open problem. 5.2. Stacked and Compositional MNIST W e no w perform experiments on a more challenging setup, similar to [7, 22], in order to examine and com- pare MAD-GAN with other GAN variants. [22] created a Stacked-MNIST dataset with 25 , 600 samples where each sample has three channels stacked together with a random MNIST digit in each of them. Thus, it creates 1000 distinct modes in the data distribution. [22] used a stripped down version of the generator and discriminator pair to reduce the modeling capacity . W e do the same for fair comparisons and used the same architecture as mentioned in their paper . Similarly , [7] created Compositional-MNIST whereby they took 3 random MNIST digits and place them at the 3 quad- rants of a 64 × 64 dimensional image. This also resulted in a data distribution with 1000 hand-designed modes. The dis- tribution of the resulting generated samples was estimated using a pretrained MNIST classifier to classify each of the digits either in the channels or the quadrants to decide the mode it belongs to. T ables 3 and 4 provide comparison of our method with variants of GAN in terms of KL diver gence and the num- ber of modes recovered for the Stacked and Compositional 7 Figure 7: Diverse generations for edges-to-handbags generation task. In each sub-figure, the first column is the input, columns 2-4 are generations by MAD-GAN (using three generators), and columns 5-7 are generations by InfoGAN (using three categorical codes). Clearly different generators of MAD-GAN are producing diverse results capturing dif ferent colors, textures, design patterns, etc . Howe ver , InfoGAN generations are visually almost the same, indicating mode collapse. Figure 9: InfoGAN for edges-to-handbags task by sharing the discriminator and Q Network. In each sub-figure, the first column is the input, columns 2 − 4 are generations when input is categorical code besides conditioning image, and columns 5 − 7 are generations with noise as an additional input. The generations for both the architectures are visually the same irrespectiv e of the cate gorical code value, which clearly indicates that it is not able to capture di verse modes. MNIST datasets, respectively . In Stacked-MNIST , as ev- ident from the T ab . 3, MAD-GAN outperforms all other variants of GAN in both the criteria. Interestingly , in the case of Compositional-MNIST , as shown in T ab . 4, MAD- GAN, WGAN and Unrolled GAN were able to recover all the 1000 modes. Howe ver , in terms of KL div ergence, the distribution generated by MAD-GAN is the closest to the true data distribution. 5.3. Diverse Samples for Image-to-Image T ransla- tion and Comparison to Inf oGAN Here we present experimental results on the challenging task of image-to-image translation [14] which uses condi- tional v ariant of GANs [23]. Conditional GAN for this task is kno wn to learn the delta distribution, thus, generates the same image irrespecti ve of the v ariations in the input noise vector . Generating diverse samples in this setting in itself Figure 11: Diverse generations for night-to-day image gen- eration task. First column in each sub-figure represents the input. The remaining three columns show the div erse gen- erations of three different generators of MAD-GAN (Our). 8 Figure 13: Face generations using MAD-GAN. Each sub-figure shows generations by a single generator . The first generator is gener ating f aces with v ery dark background. The second one is generating female faces with long hair in light background, while the third one is generating faces with colored background and casual look (based on facial direction and e xpression). is an open problem. W e show that MAD-GAN is able to generate di verse samples in these experiments as well. W e use three generators for MAD-GAN experiments and show three div erse generations. Note that, we do not claim to cap- ture all the possible modes present in the data distribution because firstly we cannot estimate the number of modes a priori, and secondly , ev en if we could, we do not kno w how div erse the generations would be after using certain num- ber of generators. W e follo w the same approach as [14] and employ patch based conditional GAN. W e compare MAD-GAN with InfoGAN [8] in these ex- periments as it is closest to our approach and can be used in image-to-image translation task. Theoretically , latent codes in InfoGAN should enable diverse generations. Howe ver , InfoGAN can only be used when the bias introduced by the categorical v ariables hav e significant impact on the genera- tor network. For image-to-image translation and high res- olution generations, the categorical variable does not have sufficient impact on the generations. As will be seen shortly , we v alidate this hypothesis by comparing our method with InfoGAN for this task. For the InfoGAN generator , to cap- ture three kinds of distinct modes, the categorical code is chosen to take three v alues. Since we are dealing with im- ages, in this case, the categorical code is a 2D matrix in which we set one third of the entries to 1 and remaining to 0 for each category . The generator is fed input image along with categorical code appended channel wise to the image. Architecture of the Q network is same as that of the pix2pix discriminator [14], except that the output is a vector of size 3 for the prediction of the categorical codes. Note that, we tried dif ferent v ariations of the cate gorical codes b ut did not observe an y significant v ariation in the generations. Fig. 7 shows generations by MAD-GAN and InfoGAN for the edges-to-handbags task, where giv en the edges of handbags, the objective is to generate real looking hand- bags. Clearly , each MAD-GAN generator is able to produce meaningful images but different from remaining generators in terms of color , texture, and patterns. Ho wev er , InfoGAN generations are almost the same for all the three categorical codes. The results sho wn for InfoGAN are obtained by not sharing the discriminator and Q network parameters. T o make our baseline as strong as possible, we did some more experiments with InfoGAN for the edges-to-handbags task. For Fig. 9, we did two experiments by sharing all the initial layers of the discriminator and Q network. In the first experiment, the input is the categorical code besides the conditional image. In the second experiment, noise is also added as an input. The architecture details are given in Appendix C.3.2. In Fig. 9, we sho w the results of both these experiments side by side. There are not much perceiv able changes as we v ary the cate gorical code v alues. Generator simply learn to ignore the input noise as was also pointed by [14]. In addition, in Fig. 11, we show diverse generations for the night-to-day task, where giv en night images of places, the objecti ve is to generate their corresponding day images. As can be seen, the generated day images in Fig. 11 dif fer in terms of lighting conditions, sky patterns, weather condi- tions, and many other minute yet useful cues. 5.4. Diverse-Class Data Generation T o further e xplore the mode capturing capacity of MAD- GAN, we e xperimented with a much more challenging task of div erse-class data generation. In detail, we trained MAD- GAN (three generators) on a combined dataset consist- ing of various highly di v erse images such as islets , ice- ber gs , br oadleaf-for est , bamboo-forest , and bedr oom , ob- tained from the Places dataset [32]. Images were randomly selected from each of them, creating a training dataset of 24 , 000 images. The generators hav e the same architecture 9 Figure 14: Face generations using MAD-GAN. Each gen- erator employed is DCGAN. Each ro w represents a genera- tor . Each column represents generations for a gi v en random noise input z . Note that, the first generator is generating faces pointing to the left. The second generator is gener- ating female faces with long hair , while the third generator generates images with light background. as that of DCGAN. In this case, as the images in the dataset belong to different classes, we did not share the generator parameters. As sho wn in Fig. 1, to our surprise, we found that even in this highly challenging setting, the generations from different generators belong to different classes. This clearly indicates that the generators in MAD-GAN are able to disentangle inter-class v ariations. In addition, each gen- erator for different noise input is able to generate div erse samples, indicating intra-class div ersity . 5.5. Diverse F ace Generation Here we sho w di v erse face generations (CelebA dataset) using MAD-GAN where we use DCGAN [26] as each of our three generators. Again, we use the same setting as provided in DCGAN. The high quality f ace generations are shown in the Fig. 14. T o get better understanding about the possible diversi- ties, we show additional generations in Fig. 13. 5.6. Unsupervised Repr esentation Learning Similar to DCGAN [26], we train our framework using SVHN dataset [24]. The trained discriminator is used to ex- tract features. Using these features, we train an SVM for the classification task. For the MAD-GAN, with three gen- erators, we obtained misclassification error of 17 . 5% which is almost 5% better than the results reported by DCGAN ( 22 . 48% ). This clearly indicates that our framework is able to learn a better feature space in an unsupervised setting. 6. Conclusion W e presented a very simple and effecti v e framework, Multi-Agent Div erse GAN (MAD-GAN), for generating di- verse and meaningful samples. W e showed the efficac y of our approach and compared it with various variants of GAN that it captures div erse modes while producing high quality samples. W e presented a theoretical analysis of MAD-GAN with conditions for global optimality . Looking forward, an interesting future direction would be to estimate a priori the number of generators needed for a particular dataset. It is not clear how to do that gi ven that we do not hav e access to the true data distrib ution. In addition, we would also like to theoretically understand the limiting cases that depend on the relationship between the number of generators and the complexity of the data distribution. Another interesting direction would be to exploit dif ferent generators such that their combinations can be used to capture div erse modes. 7. Acknowledgements This work was supported by the EPSRC, ERC grant ERC-2012-AdG 321162-HELIOS, EPSRC grant Seebibyte EP/M013774/1 and EPSRC/MURI grant EP/N019474/1. References [1] M. Abadi and D. Andersen. Learning to protect communi- cations with adversarial neural cryptography . arXiv preprint arXiv:1610.06918 , 2016. [2] M. Arjovsky and L. Bottou. T owards principled methods for training generati ve adversarial networks. In International Confer ence on Learning Repr esentations , 2017. [3] M. Arjovsky , S. Chintala, and L. Bottou. W asserstein gan. In International Confer ence on Machine Learning , 2017. [4] S. Arora, R. Ge, Y . Liang, T . Ma, and Y . Zhang. Generaliza- tion and equilibrium in generati ve adversarial nets (gans). In International Confer ence on Machine Learning , 2017. [5] D. Berthelot, T . Schumm, and L. Metz. Began: Boundary equilibrium generati ve adversarial networks. arXiv preprint arXiv:1703.10717 , 2017. [6] C. Bishop. Pattern recognition and machine learning (infor- mation science and statistics), 1st edn. 2006. corr . 2nd print- ing edn. Springer , New Y ork , 2007. [7] T . Che, Y . Li, A. Jacob, Y . Bengio, and W . Li. Mode reg- ularized generativ e adversarial networks. In International Confer ence on Learning Repr esentations , 2017. [8] X. Chen, Y . Duan, R. Houthooft, J. Schulman, I. Sutske ver , and P . Abbeel. Infogan: Interpretable representation learning by information maximizing generativ e adversarial nets. In Advances in Neural Information Pr ocessing Systems , 2016. [9] J. Deng, W . Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In Computer V ision and P attern Recognition , 2009. [10] I. Durugkar, I. Gemp, and S. Mahadev an. Generative multi- adversarial networks. In International Confer ence on Learn- ing Repr esentations , 2017. [11] A. Ghosh, V . Kulharia, and V . Namboodiri. Message passing multi-agent gans. arXiv preprint , 2016. [12] X. Glorot and Y . Bengio. Understanding the dif ficulty of training deep feedforward neural networks. In Pr oceedings of the thirteenth international conference on artificial intel- ligence and statistics , 2010. [13] I. Goodfellow , J. Pouget-Abadie, M. Mirza, B. Xu, D. W arde-F arley , S. Ozair, A. Courville, and Y . Bengio. Gen- erativ e adversarial nets. In Advances in Neural Information Pr ocessing Systems , 2014. 10 [14] P . Isola, J.-Y . Zhu, T . Zhou, and A. Efros. Image-to-image translation with conditional adversarial networks. In Com- puter V ision and P attern Recognition , 2017. [15] T . Joachims, T . Finley , and C. Y u. Cutting-plane training of structural SVMs. Machine Learning , 2009. [16] F . Juefei-Xu, V . N. Boddeti, and M. Savvides. Gang of gans: Generative adversarial networks with maximum mar- gin ranking. arXiv pr eprint arXiv:1704.04865 , 2017. [17] D. Kingma and M. W elling. Auto-encoding variational bayes. In International Confer ence on Learning Repr esen- tations , 2014. [18] V . Kulharia, A. Ghosh, A. Mukerjee, V . Namboodiri, and M. Bansal. Contextual rnn-gans for abstract reasoning dia- gram generation. In AAAI Confer ence on Artificial Intelli- gence , 2017. [19] S. Kullback and R. A. Leibler . On information and suffi- ciency . The annals of mathematical statistics , 1951. [20] C. Ledig, L. Theis, F . Huszar , J. Caballero, A. Cunning- ham, A. Acosta, A. Aitken, A. T ejani, J. T otz, Z. W ang, and W . Shi. Photo-realistic single image super-resolution using a generativ e adversarial network. In Computer V ision and P attern Recognition , 2017. [21] M.-Y . Liu and O. T uzel. Coupled generative adversarial net- works. In Advances in neural information pr ocessing sys- tems , 2016. [22] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. Unrolled generativ e adversarial networks. In International Confer ence on Learning Repr esentations , 2017. [23] M. Mirza and S. Osindero. Conditional generativ e adversar - ial nets. arXiv pr eprint arXiv:1411.1784 , 2014. [24] Y . Netzer , T . W ang, A. Coates, A. Bissacco, B. W u, and A. Y . Ng. Reading digits in natural images with unsupervised fea- ture learning. In NIPS W orkshop on Deep Learning and Un- supervised F eatur e Learning , 2011. [25] D. Pathak, P . Krahenbuhl, J. Donahue, T . Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. In Computer V ision and P attern Recognition , 2016. [26] A. Radford, L. Metz, and S. Chintala. Unsupervised repre- sentation learning with deep con v olutional generati ve adv er- sarial networks. arXiv preprint , 2015. [27] S. Reed, Z. Akata, X. Y an, L. Logeswaran, B. Schiele, and H. Lee. Generati ve adversarial text to image synthesis. In International Confer ence on Machine Learning , 2016. [28] T . Salimans, I. Goodfello w , W . Zaremba, V . Cheung, A. Rad- ford, and X. Chen. Improv ed techniques for training g ans. In Advances in Neural Information Pr ocessing Systems , 2016. [29] I. Tsochantaridis, T . Hofmann, T . Joachims, and Y . Al- tun. Support vector machine learning for interdependent and structured output spaces. In International Conference on Machine Learning , 2004. [30] C. V ondrick, H. Pirsiav ash, and A. T orralba. Generating videos with scene dynamics. In Advances In Neural Infor- mation Pr ocessing Systems , 2016. [31] J. W u, C. Zhang, T . Xue, B. Freeman, and J. T enenbaum. Learning a probabilistic latent space of object shapes via 3d generativ e-adversarial modeling. In Advances in Neural In- formation Pr ocessing Systems , 2016. [32] B. Zhou, A. Lapedriza, A. Khosla, A. Oli v a, and A. T orralba. Places: A 10 million image database for scene recognition. IEEE T r ansactions on P attern Analysis and Machine Intelli- gence , 2017. [33] J.-Y . Zhu, P . Kr ¨ ahenb ¨ uhl, E. Shechtman, and A. A. Efros. Generativ e visual manipulation on the natural image mani- fold. In Eur opean Confer ence on Computer V ision , 2016. 11 A ppendix Here, we first giv e better insights about the Theorem 1 and discuss how and when MAD-GAN leads to diverse generations. In Appendix B, we introduce another way of getting dif ferent generators to generate div erse samples. W e introduce intuiti v e similarity based competing objecti ve (MAD-GAN-Sim) which encourages different generators to generate diverse samples. Finally in Appendix C, we pro- vide architecture details and data preparations for all the ex- periments reported for MAD-GAN and MAD-GAN-Sim. A. Insights f or Div ersity in MAD-GAN One obvious question that could arise is that is it possi- ble that all the generators learn to captur e the same mode? . The short answer is, theoretically yes and in practice no . Let us begin with the discussion to understand this. Theo- retically , according to Theorem 1 if p g i = p d , for all i , then also the minimum objective v alue can be achie ved. This im- plies, in worst case, MAD-GAN would perform same as the standard GAN. Howe ver , as discussed below , this is possi- ble in following highly unlik ely situations: • all the generators always generate exactly similar sam- ples so that the discriminator is not able to dif ferentiate them. In this case, the discriminator will learn a uni- form distribution over the generator indices, thus, the gradients passed through the discriminator will be ex- actly the same for all the generators. Ho we ver , this situation in general is not possible as all the generators are initialized differently . Even a slight variation in the samples from the generators will be enough for the discriminator to identify them and pass different gradi- ent information to each generator . In addition, the ob- jectiv e function of generators is only to generate r eal samples, thus, there is nothing that encourages them to generate exactly the same samples. • the discriminator does not have enough capacity to learn the optimal parameters. This is in contrast to the assumption made in Theorem 1, which is that the discriminator is optimal . Thus, it should hav e enough capacity to learn a feature representation such that it can correctly identify samples from different genera- tors. In practice, this is a very easy task and we did not hav e to modify anything up to the feature repre- sentation stage of the architecture of the discriminator . W e used the standard architectures (explained in Ap- pendix C) for all the tasks. Hence, with random initializations and suf ficient capacity generator/discriminator , we can easily a v oid the trivial solu- tion in which all the generators focus on exactly the same re- gion of the true data distribution. This has been v ery clearly supported by v arious experiments showing di verse genera- tions by MAD-GAN. B. Similarity based competing objective W e have discussed the MAD-GAN architecture using generator identification based objectiv e. In this section, we propose a dif ferent extension to the standard GAN : simi- larity based competing objective , which we call as MAD- GAN-Sim. Here, we augment the GAN objecti ve function with a di versity enforcing term. It ensures that the genera- tions from different generators are di verse where the di ver - sity depends on a user-defined task-specific function. The architecture is same as MAD-GAN discussed in Section 4.1 (refer Fig. 15). Figure 15: MAD-GAN-Sim compared with MAD-GAN. All the generators share parameters of all the layers ex- cept the last one. T wo proposed di versity enforcing objec- tiv es, ‘competing’ (MAD-GAN-Sim) and ‘generator identi- fication’ (MAD-GAN), are sho wn at the end of the discrim- inator B.1. A pproach The approach presented here is moti v ated by the fact that the samples from different modes must look dif ferent. For example, in the case of images, these samples should dif- fer in terms of texture, color , shading, and various other cues. Thus, different generators must generate dissimilar samples where the dissimilarity comes from a task-specific function. Before delving into the details, let us first define some notations in order to av oid clutter . W e denote θ i g as the parameters of the i -th generator . The set of generators is denoted as K = { 1 , · · · , k } . Gi ven random noise z to the i - th generator , the corresponding generated sample G i ( z ; θ i g ) is denoted as g i ( z ) . Using these notations and following the abov e discussed intuitions, we impose follo wing constraints ov er the i -th generator while updating its parameters: D ( G i ( z ; θ i g ); θ d ) ≥ D ( G j ( z ; θ j g ); θ d ) + ∆  φ ( g i ( z )) , φ ( g j ( z ))  , ∀ j ∈ K \ i (5) where, φ ( g i ( z )) denotes the mapping of the generated im- age g i ( z ) by the i -th generator into a feature space and 12 ∆( ., . ) ∈ [0 , 1] is the similarity function. Higher the value of ∆( ., . ) more similar the arguments are. Intuitively , the abov e set of constraints ensures that the discriminator score for each generator should be higher than all other genera- tors with a margin proportional to the similarity score. If the samples are similar, the margin increases and the con- straints become more active. W e use unsupervised learn- ing based representation as our mapping function φ ( . ) . Pre- cisely , giv en a generated sample g i ( z ) , φ ( g i ( z )) is the fea- ture vector obtained using the discriminator of our frame- work. This is moti vated by the feature matching based ap- proach to improve the stability of the training of GANs [28]. The ∆( ., . ) function used in this w ork is the standard cosine similarity based function. The above mentioned constraints can be satisfied by maximizing an equiv alent unconstrained objectiv e function as defined belo w: U ( θ i g , θ d ) := f  D ( G i ( z ; θ i g ); θ d ) − 1 k − 1 X j ∈ K \ i  D ( G j ( z ; θ j g ); θ d ) + ∆( ψ i , ψ j )  where, f ( a ) = min(0 , a ) , ψ i = φ ( g i ( z )) , and ψ j = φ ( g j ( z )) . Intuiti v ely , if the argument of f ( . ) is positiv e, then the desirable constraint is satisfied and there is no need to do anything. Otherwise, maximize the argument with re- spect to θ i g . Note that instead of using all the constraints independently , we use the average of all of them. Another approach would be to use the constraint corresponding to the j -th generator that maximally violates the set of con- straints shown in Eq. 5. Experimentally we found that the training process of the a v erage constraint based objecti ve is more stable than the maximum violated constraint based ob- jectiv e. The intuition behind using these constraints comes from the well know 1 -slack formulation of the structured SVM framework [15, 29]. Thus, the overall objectiv e for the i -th generator is: min θ i g V ( θ d , θ i g ) − λ U ( θ i g , θ d ) where λ ≥ 0 is the hyperparameter . Algorithm 1 shows how to compute gradients corresponding to different generators for the above mentioned objectiv e function. Notice that, once sampled, the same z is passed through all the genera- tors in order to enforce constraints ov er a particular gener - ator (as sho wn in Eq. 5). Howe ver , in order for constraints to not to contradict with each other while updating another generator , a dif ferent z is sampled again from the p z . The Algorithm 1 is shown for the batch of size one which can be trivially generalized for any given batch sizes. In the case of discriminator , the gradients will hav e exactly the same form as the standard GAN objectiv e. The only dif ference is that in this case the fake samples are being generated by k generators, instead of one. Algorithm 1 Updating generators for MAD-GAN-Sim input θ d ; p ( z ) ; θ i g , ∀ i ∈ { 1 , · · · , k } ; λ . 1: for each generator i ∈ { 1 , · · · , k } do 2: Sample noise from the giv en noise prior z ∼ p z . 3: Obtain the generated sample G i ( z ; θ i g ) and corre- sponding feature vector ψ i = φ ( G i ( z ; θ i g )) . 4: ν ← 0 . 5: for each generator j ∈ { 1 , · · · , k } \ i do 6: Compute feature vector ψ j = φ ( G j ( z ; θ j g )) . 7: ν ← ν + D ( G j ( z ; θ j g ); θ d ) + ∆( ψ i , ψ j ) . 8: end for 9: ν ← D ( G i ( z ; θ i g ); θ d ) − ν k − 1 . 10: if ν ≥ 0 then 11: ∇ θ i g log(1 − D ( G i ( z ; θ i g ); θ d ))) . 12: else 13: ∇ θ i g  log(1 − D ( G i ( z ; θ i g ); θ d ))) − λU ( θ i g , θ d )  . 14: end if 15: end for output B.2. Experiments W e present the efficac y of MAD-GAN-Sim on the real world datasets. B.3. Diverse Samples for Image-to-Image T ransla- tion W e sho w div erse and highly appealing results using the div ersity promoting objecti ve. W e use cosine based similar- ity to enforce di verse generations, an important criteria for real images. As before, we show results for the following two situations where div erse solution is useful: (1) giv en the edges of handbags, generate real looking handbags as in Fig. 17; and (2) giv en night images of places, generate their equiv alent day images as in Fig. 19. W e clearly notice that each generator is able to produce meaningful and diverse images. B.4. Unsupervised Repr esentation Learning W e do the same e xperiment using SVHN as done in Section 5.6. F or MAD-GAN-Sim we obtained the mis- classification error of 18.3% which is better than DCGAN (22.48%). It clearly indicates that MAD-GAN-Sim is able to learn better feature representation in an unsupervised set- ting. C. Network Ar chitectur es and Parameters Here we provide all the details about the architectures and the parameters used in various experiments. For the experiment concerning non-parametric density estimation, the MAD-GAN parameters are randomly initialized using 13 Figure 17: MAD-GAN-Sim: Di verse generations for ‘edges-to-handbags’ image generation task. First column in each sub-figure represents the input. The remaining three columns sho w the di verse outputs of dif ferent generators. It is evident that different generators are able to produce very div erse results capturing color (brown, pink, black), texture, design pattern, shininess, among others. Figure 19: MAD-GAN-Sim: Div erse generations for ‘night to day’ image generation task. First column in each sub- figure represents the input. The remaining three columns show the di verse outputs of different generators. It is ev- ident that different generators are able to produce very di- verse results capturing different lighting conditions, differ - ent sky patterns (cloudy vs clear sky), dif ferent weather con- ditions (winter , summer, rain), different landscapes, among many other minute yet useful cues. xavier initialization with normal distributed random sam- pling [12]. For all the other experiments, the initialization done is same as the base architecture used to adapt MAD- GAN. C.1. Non-Parametric Density Estimation Architectur e Details: The generator has two fully con- nected hidden layers with 128 neurons (each of which are followed by e xponential linear unit) and fully connected outer layer . In case of MAD-GAN and MA-GAN, we used 4 generators with parameters of first two layers shared. Generator generates 1D samples. Input to each generator is a uniform noise U ( − 1 , 1) of 64 dimension. In case of InfoGAN, 5 dimensional categorical code is further con- catenated with the uniform noise to form the input. The categorical code is randomly sampled from the multinomial distribution. The discriminator architecture for respective networks is sho wn in T ab . 5. Mode-Regularized GAN ar- chitecture has encoder , BEGAN has encoder and decoder , and InfoGAN has Q Netw ork whose details are also present in T ab . 5. MAD-GAN has multi-label cross entropy loss. MA- GAN has binary cross entropy loss. For training, we use Adam optimizer with batch size of 128 and learning rate of 1 e − 4 . In each mini batch, for MAD-GAN we hav e 128 samples from each of the generators as well as real distribu- tion, while for MA-GAN 128 samples are chosen from real distribution as well as all the generators combined. Dataset Generation W e generated synthetic 1D data us- ing GMM with 5 Gaussians and select their means at 10 , 20 , 60 , 80 and 110 . The standard deviation used is 3 , 3 , 2 , 2 and 1 . The first two modes overlap significantly while the fifth one is peaky and stands isolated. C.2. Stacked and compositional MNIST Experi- ments Architectur e details: The architecture for stacked- MNIST is similar to the one used in [22]. Please refer to the T ab . 6 for generator architecture and T ab . 7 for discriminator architecture and Q network architecture of InfoGAN. The architecture for compositional-MNIST experiment is same as DCGAN [26]. Please refer to the T ab . 8 for discrimi- nator architecture and Q network architecture of InfoGAN. In both the experiments, Q network of InfoGAN shares all except the last layer with the discriminator . Dataset preparation: MNIST database of hand written digits are used for both the tasks. C.3. Image-to-Image T ranslation C.3.1 MAD-GAN/MAD-GAN-Sim Architectur e details: The network architecture is adapted from [14] and the e xperiments were conducted with the U-Net architecture and patch based discriminator . In more detail, let Ck denote a Con v olution-BatchNorm- ReLU layer with k filters and CDk represent a Conv olution- BatchNorm-Dropout-ReLU layer with a dropout rate of 14 DCGAN, Unrolled GAN, Inf oGAN, MA- GAN Disc Mode-Reg DCGAN Disc Mode-Reg DCGAN Enc WGAN, GoGAN Disc BEGAN Enc BEGAN Dec MAD- GAN Disc InfoGAN QNet Input: 1 1 1 32 1 1 1 1 fc: 128, leaky relu fc: 128, leaky relu fc: 1 1 64 1 32 1 5 (nGen+1) 5 sigmoid identity softmax T able 5: Non-Parametric density estimation architecture for discriminators (Disc), encoders (Enc), decoders (Dec), and Q Network (QNet). nGen is number of generators, fc is fully connected layer . number outputs stride Input: z ∼ N (0 , I 256 ) Fully connected 4 * 4 * 64 Reshape to image 4,4,64 T ransposed Con v olution 32 2 T ransposed Con v olution 16 2 T ransposed Con v olution 8 2 Con v olution 3 1 T able 6: Generator architecture for 1000 class stacked- MNIST experiment. For MAD-GAN, all the layers except those mentioned in last two ro ws are shared. number outputs stride Input: 32x32 Color Image Con v olution 4 2 Con v olution 8 2 Con v olution 16 2 Flatten Fully Connected 1 T able 7: Discriminator architecture for 1000 class stack ed- MNIST experiment. For MAD-GAN, with k generators, it is adapted to have k + 1 dimensional last layer output. For InfoGAN, with 156 dimensional salient variables and 100 dimensional incompressible noise, it is adapted to ha ve 156 dimensional output for Q network. 50 %. All Con volutions are 4 × 4 spatial filters with a stride of 2 . Con v olutions in the encoder, and in the discriminator, downsample by a factor of 2 , whereas in the decoder they upsample by a factor of 2 . Generator Architectures W e used the U-Net generator based architecture from [14] as follows: • U-Net Encoder: C64-C128-C256-C512-C512-C512- number outputs stride Input: Color Image (64x64) Con v olution 64 2 Con v olution 128 2 Con v olution 256 2 Con v olution 512 2 Flatten Fully Connected 1 T able 8: Discriminator architecture for 1000 class compositional-MNIST experiment. For MAD-GAN, with k generators, it is adapted to hav e k + 1 dimensional last layer output. For InfoGAN, with 156 dimensional salient variables and 100 dimensional incompressible noise, it is adapted to hav e 156 dimensional output for Q netw ork. C512-C512 • U-Net Decoder: CD512-CD1024-CD1024-C1024- C1024-C512-C256-C128. Note that, in case of MAD- GAN, the last layer does not share parameters with other generators. After the last layer in the decoder , a conv olution is applied to map to the number of output channels to 3 , follo wed by a tanh function. BatchNorm is not applied to the first C64 layer in the encoder . All ReLUs in the encoder are leaky , with a slope of 0 . 2 , while ReLUs in the decoder are not leaky . The U-Net architecture has skip-connections be- tween each layer i in the encoder and layer n − i in the de- coder , where n is the total number of layers. The skip con- nections concatenate acti v ations from layer i to layer n − i . This changes the number of channels in the decoder . Discriminator Architectures The patch based 70 × 70 discriminator architecture was used in this case : C64- C128-C256-C512. 15 Diversity term • MAD-GAN: After the last layer , a con volution is ap- plied to map the output layer to the dimension of k + 1 (where k is the number of generators in MAD-GAN) followed by the softmax layer for the normalization. • MAD-GAN-Sim: After the last layer , a con v olution is applied to map to a 1 dimensional output followed by a Sigmoid function. For the unsupervised feature repre- sentation φ ( . ) , the feature acti vations from the penul- timate layer C256 of the discriminator w as used as the feature activations for the computation of the cosine similarity . For the training, we used Adam optimizer with learning rate of 2 e − 4 (for both generators and discriminator), λ L 1 = 10 (hyperparameter corresponding to the L 1 regularizer), λ = 1 e − 3 (corresponding to MAD-GAN-Sim), and batch size of 1 . C.3.2 Inf oGAN The network architecture is adapted from [14] and the ex- periments were conducted with the U-Net architecture and patch based discriminator . Generator Architectur es The U-Net generator is exactly same as in [14] except that the number of input channels are increased from 3 to 4 . For the experiment done for Fig. 9, to take noise as input, input channels are increased to 5 (one extra input channel for noise). Discriminator Architectur es The discriminator is ex- actly same as in [14]: C64-C128-C256-C512 Q network Ar chitectures The Q network architecture is C64-C128-C256-C512-Con v olution3-Con v olution3. Here first Con volution3 gives a output of 30 × 30 patches with 3 channels while second Con v olution3 just gives 3 dimen- sional output. All the layers except last two are shared with the discriminator to perform the experiments for Fig. 9. Diversity term T o capture three kinds of distinct modes, the categorical code can take three values. Hence, in this case, the categorical code is a 2D matrix in which one third of entries are set to 1 and remaining to 0 for each category . The generator is fed input image along with categorical code appended channel wise to the image. For the exper - iment done for Fig. 9, to take noise as input, the generator input is further channel wise appended with a 2D matrix of normal noise. For the training, we used Adam optimizer with learn- ing rate of 2 e − 4 (for both generator and discriminator), λ L 1 = 10 (hyperparameter corresponding to the L 1 regu- larizer) and batch size of 1 . Dataset Preparation: • Edges-to-Handbags: W e used 137 , 000 Amazon hand- bag images from [33]. The random split into train and test was kept the same as done by [33]. • Night-to-Day: W e used 17 , 823 training images e x- tracted from 91 webcams. W e thank Jun-Y an Zhu for providing the dataset. C.4. Diverse-Class Data Generation Architectur e details: The network architecture is adapted from DCGAN [26]. Concretely , the discriminator architecture is described in T ab . 11 and the generator architecture in T ab . 10. W e use three generators without sharing an y parameter . The residual layers helped in improving the image quality since the data manifold was much more complicated and the discriminator needed more capacity to accommodate it. Diversity terms For the training, we used Adam opti- mizer with the learning rate of 2 e − 4 (both generator and discriminator) and batch size of 64 . Dataset preparation: Training data is obtained by com- bining dataset consisting of various highly div erse images such as islets , iceber gs , br oadleaf-for est , bamboo-for est and bedr oom , obtained from the Places dataset [32]. T o create the training data, images were randomly selected from each of them, creating a dataset consisting of 24 , 000 images. C.5. Diverse F ace Generations with DCGAN Architectur e details: The network architecture is adapted from DCGAN [26]. Concretely , the discriminator architecture is described in T ab . 11 and the generator architecture in T ab . 10. In this case all the parameters of the generators except the last layer were shared. The residual layers helped in improving the image quality since the data manifold and the manifolds of each of the generators was much more complicated and the discriminator needed more capacity to accommodate it. 16 Discriminator D Input 64x64 Color Image 4x4 con v . 64 leakyRELU. stride 2. batchnorm 4x4 con v . 128 leakyRELU. stride 2. batchnorm 4x4 con v . 256 leakyRELU. stride 2. batchnorm 4x4 con v . 512 leakyRELU. stride 2. batchnorm 4x4 con v . output leakyRELU. stride 1 T able 9: DCGAN Discriminator: It is adapted to hav e k + 1 dimensional last layer output for MAD-GAN with k gener- ators. (normalizer is softmax). Generator G Input ∈ R 100 4x4 upcon v . 512 RELU.batchnorm.shared 4x4 upcon v . 256 RELU. stride 2.batchnorm.shared 4x4 upcon v . 128 RELU. stride 2.batchnorm.shared 4x4 upcon v . 64 RELU. stride 2.batchnorm.shared 4x4 upcon v . 3 tanh. stride 2 T able 10: DCGAN Generator: All the layers e xcept the last one are shared among all the three generators. Residual Discriminator D Input 64x64 Color Image 7x7 con v . 64 leakyRELU. stride 2. pad 1. batchnorm 3x3 con v . 64 leakyRELU. stride 2. pad 1. batchnorm 3x3 con v . 128 leakyRELU. stride 2.pad 1. batchnorm 3x3 con v . 256 leakyRELU. stride 2. pad 1. batchnorm 3x3 con v . 512 leakyRELU. stride 2. pad 1. batchnorm 3x3 con v . 512 leakyRELU. stride 2. pad 1. batchnorm 3x3 con v . 512 leakyRELU. stride 2. pad 1. batchnorm RESIDU AL-(N512, K3, S1, P1) RESIDU AL-(N512, K3, S1, P1) RESIDU AL-(N512, K3, S1, P1) T able 11: Discriminator architecture for diverse-class data generation and div erse face generation: The last layer out- put is k + 1 dimensional for MAD-GAN with k generators (normalizer is softmax). ‘RESIDU AL ’ layer is elaborated in T ab . 12. RESIDU AL -Residual Layer Input: previous-layer -output c1: CONV -(N512, K3, S1, P1), BN, ReLU c2: CONV -(N512, K3, S2, P1), BN SUM(c2,previous-layer -output) T able 12: Residual layer description for T ab . 11. Diversity terms For the training, we used Adam opti- mizer with the learning rate of 2 e − 4 (both generator and T echnique 2 Gen 3 Gen 4 Gen MAD- GAN 20.5% 18.2% 17.5% MAD- GAN-Sim 20.2% 19.6% 18.3% T able 13: The misclassification error of MAD-GAN and MAD-GAN-Sim on SVHN while varying the number of generators. discriminator) and batch size of 64 . Dataset preparation: W e used CelebA dataset as men- tioned for face generation based experiments. For Image generation all the images ( 14 , 197 , 122 ) from the Imagenet- 1k dataset [9] were used to train the DCGAN with 3 Gener- ators alongside the MAD-GAN objectiv e. The images from both CelebA and Imagenet-1k were resized into 64 × 64 . C.6. Unsupervised Repr esentation Learning Architectur e details: Our architecture uses the one pro- posed in DCGAN [26]. Similar to the DCGAN experiment on SVHN dataset ( 32 × 32 × 3 ) [24], we removed the penulti- mate layer of generator (second last row in T ab. 10) and first layer of discriminator (first con v olution layer in T ab . 9). Classification task: W e trained our model on the av ail- able SVHN dataset [24]. For feature extraction using dis- criminator , we followed the same method as mentioned in the DCGAN paper [26]. The features were then used for training a re gularized linear L2-SVM. The ablation study is presented in T ab . 13. Dataset preparation: W e used SVHN dataset [24] con- sisting of 73 , 257 digits for the training, 26 , 032 digits for the testing, and 53 , 1131 extra training samples. As done in DCGAN [26], we used 1000 uniformly class distributed random samples for training, 10 , 000 samples from the non- extra set for v alidation and 1000 samples for testing. For the training, we used Adam optimizer with learning rate of 2 e − 4 (both generator and discriminator), λ = 1 e − 4 (competing objectiv e), and batch size of 64 . 17

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment