Conditional Generative Adversarial Nets
Generative Adversarial Nets [8] were recently introduced as a novel way to train generative models. In this work we introduce the conditional version of generative adversarial nets, which can be constructed by simply feeding the data, y, we wish to c…
Authors: Mehdi Mirza, Simon Osindero
Conditional Generativ e Adversarial Nets Mehdi Mirza D ´ epartement d’informatique et de recherche op ´ erationnelle Univ ersit ´ e de Montr ´ eal Montr ´ eal, QC H3C 3J7 mirzamom@iro.umontreal.ca Simon Osindero Flickr / Y ahoo Inc. San Francisco, CA 94103 osindero@yahoo-inc.com Abstract Generative Adversarial Nets [8] were recently introduced as a novel way to train generativ e models. In this w ork we introduce the conditional v ersion of generati v e adversarial nets, which can be constructed by simply feeding the data, y , we wish to condition on to both the generator and discriminator . W e show that this model can generate MNIST digits conditioned on class labels. W e also illustrate ho w this model could be used to learn a multi-modal model, and provide preliminary examples of an application to image tagging in which we demonstrate ho w this approach can generate descriptiv e tags which are not part of training labels. 1 Introduction Generativ e adversarial nets were recently introduced as an alternative framew ork for training gen- erativ e models in order to sidestep the difficulty of approximating many intractable probabilistic computations. Adversarial nets ha ve the adv antages that Markov chains are ne ver needed, only backpropag ation is used to obtain gradients, no inference is required during learning, and a wide variety of factors and interactions can easily be incorporated into the model. Furthermore, as demonstrated in [8], it can produce state of the art log-likelihood estimates and realistic samples. In an unconditioned generativ e model, there is no control on modes of the data being generated. Howe v er , by conditioning the model on additional information it is possible to direct the data gener- ation process. Such conditioning could be based on class labels, on some part of data for inpainting like [5], or e ven on data from dif ferent modality . In this work we show how can we construct the conditional adversarial net. And for empirical results we demonstrate two set of e xperiment. One on MNIST digit data set conditioned on class labels and one on MIR Flickr 25,000 dataset [10] for multi-modal learning. 1 2 Related W ork 2.1 Multi-modal Learning For Image Labelling Despite the many recent successes of supervised neural netw orks (and conv olutional netw orks in particular) [13, 17], it remains challenging to scale such models to accommodate an extremely large number of predicted output categories. A second issue is that much of the work to date has focused on learning one-to-one mappings from input to output. Ho wever , many interesting problems are more naturally thought of as a probabilistic one-to-many mapping. For instance in the case of image labeling there may be many different tags that could appropriately applied to a giv en image, and different (human) annotators may use dif ferent (but typically synonymous or related) terms to describe the same image. One way to help address the first issue is to leverage additional information from other modalities: for instance, by using natural language corpora to learn a vector representation for labels in which geometric relations are semantically meaningful. When making predictions in such spaces, we ben- efit from the fact that when prediction errors we are still often ‘close’ to the truth (e.g. predicting ’table’ instead of ’chair’), and also from the fact that we can naturally make predictiv e generaliza- tions to labels that were not seen during training time. W orks such as [3] have shown that e ven a simple linear mapping from image feature-space to word-representation-space can yield improved classification performance. One way to address the second problem is to use a conditional probabilistic generati ve model, the input is taken to be the conditioning variable and the one-to-many mapping is instantiated as a conditional predictiv e distrib ution. [16] take a similar approach to this problem, and train a multi-modal Deep Boltzmann Machine on the MIR Flickr 25,000 dataset as we do in this work. Additionally , in [12] the authors show how to train a supervised multi-modal neural language model, and they are able to generate descripti ve sentence for images. 3 Conditional Adversarial Nets 3.1 Generative Adversarial Nets Generativ e adv ersarial nets were recently introduced as a novel way to train a generati ve model. They consists of two ‘adv ersarial’ models: a generative model G that captures the data distribution, and a discriminati ve model D that estimates the probability that a sample came from the training data rather than G . Both G and D could be a non-linear mapping function, such as a multi-layer perceptron. T o learn a generator distribution p g ov er data data x , the generator builds a mapping function from a prior noise distribution p z ( z ) to data space as G ( z ; θ g ) . And the discriminator , D ( x ; θ d ) , outputs a single scalar representing the probability that x came form training data rather than p g . G and D are both trained simultaneously: we adjust parameters for G to minimize l og (1 − D ( G ( z )) and adjust parameters for D to minimize l og D ( X ) , as if they are following the two-player min-max game with v alue function V ( G, D ) : min G max D V ( D , G ) = E x ∼ p data ( x ) [log D ( x )] + E z ∼ p z ( z ) [log(1 − D ( G ( z )))] . (1) 3.2 Conditional Adversarial Nets Generativ e adversarial nets can be extended to a conditional model if both the generator and discrim- inator are conditioned on some extra information y . y could be any kind of auxiliary information, such as class labels or data from other modalities. W e can perform the conditioning by feeding y into the both the discriminator and generator as additional input layer . 2 In the generator the prior input noise p z ( z ) , and y are combined in joint hidden representation, and the adversarial training frame work allo ws for considerable fle xibility in ho w this hidden representa- tion is composed. 1 In the discriminator x and y are presented as inputs and to a discriminativ e function (embodied again by a MLP in this case). The objectiv e function of a two-player minimax game w ould be as Eq 2 min G max D V ( D , G ) = E x ∼ p data ( x ) [log D ( x | y )] + E z ∼ p z ( z ) [log(1 − D ( G ( z | y )))] . (2) Fig 1 illustrates the structure of a simple conditional adversarial net. Figure 1: Conditional adversarial net 4 Experimental Results 4.1 Unimodal W e trained a conditional adversarial net on MNIST images conditioned on their class labels, encoded as one-hot vectors. In the generator net, a noise prior z with dimensionality 100 was dra wn from a uniform distrib ution within the unit hypercube. Both z and y are mapped to hidden layers with Rectified Linear Unit (ReLu) acti vation [4, 11], with layer sizes 200 and 1000 respectively , before both being mapped to second, combined hidden ReLu layer of dimensionality 1200. W e then have a final sigmoid unit layer as our output for generating the 784-dimensional MNIST samples. 1 For now we simply have the conditioning input and prior noise as inputs to a single hidden layer of a MLP , but one could imagine using higher order interactions allowing for comple x generation mechanisms that w ould be extremely dif ficult to work with in a traditional generati ve frame work. 3 Model MNIST DBN [1] 138 ± 2 Stacked CAE [1] 121 ± 1 . 6 Deep GSN [2] 214 ± 1 . 1 Adversarial nets 225 ± 2 Conditional adversarial nets 132 ± 1 . 8 T able 1: Parzen window-based log-likelihood estimates for MNIST . W e followed the same procedure as [8] for computing these values. The discr iminator maps x to a maxout [6] layer with 240 units and 5 pieces, and y to a maxout layer with 50 units and 5 pieces. Both of the hidden layers mapped to a joint maxout layer with 240 units and 4 pieces before being fed to the sigmoid layer . (The precise architecture of the discriminator is not critical as long as it has sufficient power; we ha ve found that maxout units are typically well suited to the task.) The model was trained using stochastic gradient decent with mini-batches of size 100 and ini- tial learning rate of 0 . 1 which was exponentially decreased down to . 000001 with decay factor of 1 . 00004 . Also momentum was used with initial v alue of . 5 which was increased up to 0 . 7 . Dropout [9] with probability of 0.5 w as applied to both the generator and discriminator . And best estimate of log-likelihood on the v alidation set was used as stopping point. T able 1 sho ws Gaussian Parzen windo w log-likelihood estimate for the MNIST dataset test data. 1000 samples were drawn from each 10 class and a Gaussian Parzen windo w w as fitted to these samples. W e then estimate the log-likelihood of the test set using the Parzen window distribution. (See [8] for more details of how this estimate is constructed.) The conditional adversarial net results that we present are comparable with some other network based, b ut are outperformed by sev eral other approaches – including non-conditional adversarial nets. W e present these results more as a proof-of-concept than as demonstration of efficacy , and believ e that with further exploration of hyper-parameter space and architecture that the conditional model should match or exceed the non-conditional results. Fig 2 shows some of the generated samples. Each row is conditioned on one label and each column is a different generated sample. Figure 2: Generated MNIST digits, each row conditioned on one label 4.2 Multimodal Photo sites such as Flickr are a rich source of labeled data in the form of images and their associated user-generated metadata (UGM) — in particular user -tags. 4 User-generated metadata differ from more ‘canonical’ image labelling schems in that they are typ- ically more descriptiv e, and are semantically much closer to how humans describe images with natural language rather than just identifying the objects present in an image. Another aspect of UGM is that synoymy is prev alent and different users may use different vocabulary to describe the same concepts — consequently , having an ef ficient way to normalize these labels becomes impor- tant. Conceptual word embeddings [14] can be very useful here since related concepts end up being represented by similar vectors. In this section we demonstrate automated tagging of images, with multi-label predictions, using con- ditional adversarial nets to generate a (possibly multi-modal) distribution of tag-vectors conditional on image features. For image features we pre-train a con v olutional model similar to the one from [13] on the full ImageNet dataset with 21,000 labels [15]. W e use the output of the last fully connected layer with 4096 units as image representations. For the world representation we first gather a corpus of text from concatenation of user -tags, titles and descriptions from YFCC100M 2 dataset metadata. After pre-processing and cleaning of the text we trained a skip-gram model [14] with word vector size of 200. And we omitted an y w ord appearing less than 200 times from the vocab ulary , thereby ending up with a dictionary of size 247465. W e k eep the con volutional model and the language model fixed during training of the adversarial net. And leav e the experiments when we e ven backpropagate through these models as future w ork. For our experiments we use MIR Flickr 25,000 dataset [10], and e xtract the image and tags features using the con volutional model and language model we described above. Images without any tag were omitted from our e xperiments and annotations were treated as e xtra tags. The first 150,000 examples were used as training set. Images with multiple tags were repeated inside the training set once for each associated tag. For e valuation, we generate 100 samples for each image and find top 20 closest words using cosine similarity of vector representation of the words in the vocab ulary to each sample. Then we select the top 10 most common words among all 100 samples. T able 4.2 shows some samples of the user assigned tags and annotations along with the generated tags. The best working model’ s generator receives Gaussian noise of size 100 as noise prior and maps it to 500 dimension ReLu layer . And maps 4096 dimension image feature vector to 2000 dimension ReLu hidden layer . Both of these layers are mapped to a joint representation of 200 dimension linear layer which would output the generated word v ectors. The discriminator is consisted of 500 and 1200 dimension ReLu hidden layers for w ord vectors and image features respecti vely and maxout layer with 1000 units and 3 pieces as the join layer which is finally fed to the one single sigmoid unit. The model was trained using stochastic gradient decent with mini-batches of size 100 and ini- tial learning rate of 0 . 1 which was exponentially decreased down to . 000001 with decay factor of 1 . 00004 . Also momentum was used with initial v alue of . 5 which was increased up to 0 . 7 . Dropout with probability of 0.5 was applied to both the generator and discriminator . The hyper-parameters and architectural choices were obtained by cross-v alidation and a mix of random grid search and manual selection (albeit ov er a somewhat limited search space.) 5 Future W ork The results shown in this paper are extremely preliminary , but they demonstrate the potential of conditional adversarial nets and sho w promise for interesting and useful applications. In future explorations between no w and the workshop we e xpect to present more sophisticated mod- els, as well as a more detailed and thorough analysis of their performance and characteristics. 2 Y ahoo Flickr Creativ e Common 100M http://webscope.sandbox.yahoo.com/catalog. php?datatype=i&did=67 . 5 User tags + annotations Generated tags montanha, trem, in verno, frio, people, male, plant life, tree, structures, trans- port, car taxi, passenger , line, transportation, railway station, passengers, railways, signals, rail, rails food, raspberry , delicious, homemade chicken, fattening, cooked, peanut, cream, cookie, house made, bread, biscuit, bakes water , river creek, lake, along, near , riv er , rocky , treeline, val- ley , woods, waters people, portrait, female, baby , indoor lov e, people, posing, girl, young, strangers, pretty , women, happy , life T able 2: Samples of generated tags Also, in the current experiments we only use each tag indi vidually . But by using multiple tags at the same time (ef fecti vely posing generati ve problem as one of ‘set generation’) we hope to achie ve better results. Another obvious direction left for future work is to construct a joint training scheme to learn the language model. W orks such as [12] has shown that we can learn a language model for suited for the specific task. Acknowledgments This project was dev eloped in Pylearn2 [7] framework, and we would like to thank Pylearn2 dev el- opers. W e also lik e to thank Ian Goodfellow for helpful discussion during his affiliation at University of Montreal. The authors gratefully acknowledge the support from the V ision & Machine Learning, and Production Engineering teams at Flickr (in alphabetical order: Andrew Stadlen, Arel Cordero, Clayton Mellina, Cyprien Noel, Frank Liu, Gerry Pesa vento, Huy Nguyen, Jack Culpepper , John K o, Pierre Garrigues, Rob Hess, Stacey Sv etlichnaya, T obi Baumgartner, and Y e Lu). References [1] Bengio, Y ., Mesnil, G., Dauphin, Y ., and Rifai, S. (2013). Better mixing via deep representations. In ICML ’2013 . [2] Bengio, Y ., Thibodeau-Laufer , E., Alain, G., and Y osinski, J. (2014). Deep generativ e stochastic net- works trainable by backprop. In Proceedings of the 30th International Conference on Machine Learning (ICML ’14) . 6 [3] Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., Mikolov , T ., et al. (2013). Devise: A deep visual-semantic embedding model. In Advances in Neural Information Pr ocessing Systems , pages 2121– 2129. [4] Glorot, X., Bordes, A., and Bengio, Y . (2011). Deep sparse rectifier neural networks. In International Confer ence on Artificial Intelligence and Statistics , pages 315–323. [5] Goodfellow , I., Mirza, M., Courville, A., and Bengio, Y . (2013a). Multi-prediction deep boltzmann ma- chines. In Advances in Neural Information Pr ocessing Systems , pages 548–556. [6] Goodfellow , I. J., W arde-Farley , D., Mirza, M., Courville, A., and Bengio, Y . (2013b). Maxout networks. In ICML ’2013 . [7] Goodfellow , I. J., W arde-Farley , D., Lamblin, P ., Dumoulin, V ., Mirza, M., Pascanu, R., Bergstra, J., Bastien, F ., and Bengio, Y . (2013c). Pylearn2: a machine learning research library . arXiv preprint arXiv:1308.4214 . [8] Goodfellow , I. J., Pouget-Abadie, J., Mirza, M., Xu, B., W arde-Farley , D., Ozair, S., Courville, A., and Bengio, Y . (2014). Generati ve adv ersarial nets. In NIPS’2014 . [9] Hinton, G. E., Sri v astav a, N., Krizhevsk y , A., Sutskev er , I., and Salakhutdino v , R. (2012). Improving neural networks by pre venting co-adaptation of feature detectors. T echnical report, [10] Huiskes, M. J. and Le w , M. S. (2008). The mir flickr retrie val ev aluation. In MIR ’08: Pr oceedings of the 2008 A CM International Confer ence on Multimedia Information Retrieval , Ne w Y ork, NY , USA. A CM. [11] Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y . (2009). What is the best multi-stage architecture for object recognition? In ICCV’09 . [12] Kiros, R., Zemel, R., and Salakhutdinov , R. (2013). Multimodal neural language models. In Pr oc. NIPS Deep Learning W orkshop . [13] Krizhevsk y , A., Sutskev er , I., and Hinton, G. (2012). ImageNet classification with deep con volutional neural networks. In Advances in Neural Information Pr ocessing Systems 25 (NIPS’2012) . [14] Mikolov , T ., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. In International Confer ence on Learning Repr esentations: W orkshops T rack . [15] Russakovsk y , O. and Fei-Fei, L. (2010). Attribute learning in large-scale datasets. In European Confer- ence of Computer V ision (ECCV), International W orkshop on P arts and Attributes , Crete, Greece. [16] Sriv astav a, N. and Salakhutdinov , R. (2012). Multimodal learning with deep boltzmann machines. In NIPS’2012 . [17] Szegedy , C., Liu, W ., Jia, Y ., Sermanet, P ., Reed, S., Anguelov , D., Erhan, D., V anhoucke, V ., and Rabi- novich, A. (2014). Going deeper with con volutions. arXiv pr eprint arXiv:1409.4842 . 7
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment