Text2FaceGAN: Face Generation from Fine Grained Textual Descriptions

T ext2F aceGAN: F ace Generation fr om Fine Grained T extual Descriptions Osaid Rehman Nasir + ∗ , Shailesh Kumar Jha + ∗ , Manraj Singh Grov er ∗ , Y i Y u † , Ajit Kumar ‡ and Rajiv Ratn Shah ∗ ∗ MID AS Lab, IIIT -Delhi Delhi, India Email: midas@iiitd.ac.in † NII, T ok yo, J apan Email: yiyu@nii.ac.jp ‡ Adobe Systems Email: ajikumar@adobe.com Abstract —Powerful generativ e adversarial networks (GAN) hav e been developed to automatically synthesize realistic im- ages from text. Ho wever , most existing tasks ar e limited to generating simple images such as ﬂowers from captions. In this work, we extend this pr oblem to the less addressed domain of face generation fr om ﬁne-grained textual descriptions of face, e.g., “A person has curly hair , oval face, and mustache” . W e are motivated by the potential of automated face generation to impact and assist critical tasks such as criminal face reconstruction. Since current datasets for the task are either very small or do not contain captions, we generate captions for images in the CelebA dataset by creating an algorithm to automatically con vert a list of attributes to a set of captions. W e then model the highly multi-modal pr oblem of text to face generation as learning the conditional distribution of faces (conditioned on text) in same latent space. W e utilize the cur - rent state-of-the-art GAN (DC-GAN with GAN-CLS loss) for learning conditional multi-modality . The presence of more ﬁne- grained details and variable length of the captions makes the problem easier f or a user b ut more difﬁcult to handle compar ed to the other text-to-image tasks. W e ﬂipped the labels for real and fake images and added noise in discriminator . Generated images f or diverse textual descriptions show promising r esults. In the end, we show how the widely used inceptions score is not a good metric to evaluate the performance of generative models used for synthesizing faces from text. Keyw ords -Datasets, Generative Adversarial Networks, T ext to Image, Facial Attrib utes, F ace Generation I . I N T RO D U C T I O N Photographic text-to-face synthesis is a mainstream prob- lem with potential applications in image editing, video games, or for accessibility . The task can be addressed as learning a mapping from a semantic text space describing the facial features e.g ., “P ointy Nose” and “W aivy hair” to the RGB pixel space. The community has traditionally ad- dressed faces in the context of image recognition [1] where the task is to recognize the human faces from the visual descriptions of the images. Such tasks inv olved extracting ﬁne-grain details, map them to a latent space and learn their distribution in the latent space. + Equal Contribution The woman has high cheekbones. She has straight hair which is black in colour . She has big lips with arched eyebro ws. The smiling, young woman has rosy cheeks and heavy makeup. She is wearing lip- stick. Figure 1: Example of image generated from caption in the “zero-shot” setting. Recent advances in generative modelling [2] spurred a lot of interest in the research community to generate faces by learning a mapping to the pixel space from a latent noise space. While works like BeautyGAN [3] demonstrating style transfer on faces and face captioning using GANs [4] hav e been done but the problem of face synthesis from textual descriptions remain largely unaddressed due to following obstacles. 1) W idely used datasets such as Flickr8K [5], Flickr30K [6], VL T2K [7], and MS COCO [8] contain textual descriptions at concrete conceptual le vel describing broadly the object and the context without saying anything about the inferences that could be drawn from the images. While helpful these captions do not contain physical description of faces such as skin color , eyes, hairstyle, etc. that are necessary for generating faces. 2) The existing face datasets such as LFW [9] and MegaF ace [10] lack any additional description while others such as LFW A [9] and CelebA [11] have a list of attributes associated with the images. Despite providing ﬁne-grain information about faces such as “Blond Hair” and “Ar ched eyebr ows” attributes re- quires knowledge of the domain. As a result attributes cannot be used for general purpose user end applica- tions. 3) The conditional distribution of the face (conditioned on text) is highly multimodal due to the multiple possible pixel orientations being semantically con- sistent with the facial features present in the text. Presence of more ﬁne-grained details in the facial description than scene descriptions makes learning the joint representation difﬁcult in the “zero-shot” setting. In this paper , we address the aforementioned problem as learning the joint distrib ution of images in the pixel space and te xt mapped to a latent encoding space. Natural language provides a generic interface to represent information on facial features. Hence captions with information on the faces provide a way to combine the discriminati ve abilities of the attributes as well as the generality of natural language. W e create the captions for the CelebA [11] dataset from the attributes provided as the solution to dataset unavailabil- ity . W e divided the captions into six sentences with each sentence capturing the features speciﬁc to certain parts of face e.g . the ﬁrst sentence captures the face outline such as high cheekbones and while the second sentence captures the hairstyle such as waivy hair (see T able 1). The automatic generation ensured the captions are free from the bias due to the subjecti ve nature of human generated captions. The generated captions are encoded using the Skip-Thought [12] model to better capture the facial features as well as their spatial orientation so as to maintain consistency with the general semantics of a face (“mouth should be above the nose”). The advent of GANs marked a major breakthrough in generativ e modelling and has become the mainstream solu- tion to the problem of learning conditional multi-modality . W e solve the problem of learning the joint distribution of text and images by using the generator to generate the face while conditioning both generator and discriminator on the encoded facial descriptions. Apart from lev eraging the property of discriminator network acting as an adaptiv e loss function, we explicitly provide the discriminator the sources of error as discriminator has to dif ferentiate whether the joint h image, text i pair is real or fak e as mentioned in the GAN- CLS [13] algorithm. In the midst of experimentation we faced the problem of faster con ver gence of the discriminator loss to wards 0 and to tackle the same we introduced noise in the discriminator by swapping the real and the fake images after every three iterations. Figure 1 shows the image generated by our model for the giv en caption. W e e valuate our GAN model using the widely used in- ception score which requires around 50K generated samples. The generated samples are classiﬁed by the InceptionV3 [14] model and the predicted classes are used to calculate the marginal distribution p ( y ) and conditional distribution p ( y | x ) for all images x and classes y (see Equation 1). p ( y ) = Z x p ( y | x ) dx (1) Popular datasets for image synthesis from text such as Oxford-102 Flowers [15] and Caltech-USD Birds [16] contain classes with high intraclass similarity and very low interclass similarity . This property ensures that if the captions selected to generate the images (while ev aluation) are uniform across classes then the inception score would reﬂect the clarity and div ersity of the images. W e ﬁnally shows why the widely used inception score is not a good metric to ev aluate the performance of GANs on the face datasets (see Section V). The main contributions of this paper are as follows: 1) Caption creation 1 for CelebA dataset to facilitate face generation from textual descriptions. 2) GAN model to synthesize faces from description of ﬁne-grained facial features. 3) GAN model ev aluation using inception score and justiﬁcation as to why it is not a good metric for face datasets. The rest of the paper has been organised as follows. Section II discusses previous works on text to image con- version and style transfer on faces using GANs [2]. Section III provides necessary background for GANs and inception score to understand the impact of the randomness of image generation on the inception score. Our methodology to automatically generate captions from attributes list and our network architecture of GAN [2] is discussed in Section IV. Section V presents the ev aluation model used infer- ences from the inception score. This section discusses how inception score is affected by the randomness in Generated images. Finally , Section VI concludes the paper and presents certain extensions of this work. I I . R E L A T E D W O R K Deep learning has led to substantial progress in the ﬁeld of generativ e image modelling with the introduction of deep generativ e models such as GANs [2], [17], V ariational Auto Encoders [18], and others. Multimodal deep learning has shown to learn relating features across modalities like text [19], audio [20], visual [21], [22] and more [23]. One natural extension of image generation is text to image synthesis, which requires predic- tion of data in one modality (image) conditioned on data in another modality (text). Reed et al. [13] tackled this problem by using a deep con v olutional generati ve adv ersarial network (DC-GAN) [17] conditioned on text features encoded by a hybrid character-lev el con volutional recurrent neural net- work. Using their model they were able to produce 64 × 64 images. Zhang et al. [24] proposed a two stage training strategy to produce 256 × 256 images. Recently , Zhang et al. [25] proposed a GAN architecture with hierarchically- nested discriminators. This allo ws the authors to create 512 × 512 images. These models are usually e valuated on Oxford-102 Flowers [15], Caltech-UCSD Birds [16] and MS-COCO datasets [26]. 1 W e will release the captions to public for research purpose. Due to absence of objecti ve function and high cost of human ev aluation, text to image synthesis models are ev aluated using an automated method such as Inception Score [27]. Inception score measures both the objecti veness and diversity of generated images. It requires ﬁne-tuning of Inception model [14] pre-trained on ImageNet. A very similar, yet relatively less researched problem is T ext to Face generation which requires generation of images consisting of faces from input text description. This problem is difﬁcult to solve mainly due to the absence of paired text and face image dataset. The current dataset Face2T ext [28] consists of only 400 facial images and textual captions for each of them. Ho wev er complex models cannot be used for such a small dataset as the generator can easily learn the entire dataset and hence will not be able to produce any results for unseen text descriptions (zero-shot setting). Though in the work [29], authors used a hybrid model of stackGAN and proGAN to generate faces from captions. Howe ver , the results that they have received are very poor (see Figure 2). Figure 2: Existing Experiments [29] on Face2T e xt dataset T o solve this problem we lev erage the CelebA dataset [30] by introducing captions. These captions are generated by segregating the attributes of the images into six sentences based on structure of the face, f acial hair , hairstyle, ﬁne grain face details, accessories worn and attributes that enhance the appearance. The ﬁnal caption thus created is the con- catenation of all the ﬁv e sentences created. Howe ver , since most images hav e a subset of the given attrib utes creating all six sentences are not always possible. Hence the length of each caption can vary widely . Due to the instability of GAN training and inconsistency of caption length, the problem of text to face synthesis becomes more difﬁcult. W e employ a variety of methods [31] such as maximizing l og ( D ) instead of minimizing l og (1 − D ) , adding noise to labels for discriminator , and others to deal with the instability of GAN training. T o deal with inconsistent caption length we use Skip-thought vectors [12]. Face synthesis has also been done based on audio input. In W A V2PIX [32] the authors generated face from raw audio input. They trained their model in a self-supervised approach by exploiting the audio and visual signals naturally aligned in videos. They used high quality Y ouT ube videos for this where the speaker was expressi ve in both speech and signals. Also recently Karras et al. [33] proposed GAN architecture that enables unsupervised separation of high-lev el attributes(e.g., pose and identity when trained on human faces) and stochastic variation in the generated images (e.g., freckles,hair). Building on the ideas of the previous models we provide state of the art results for the problem of text to face generation. W e then provide inception score for our model by ﬁne-tuning Inception model on CelebA dataset. I I I . B A C K G RO U N D In this section we provide previous works our model builds on. W e ﬁrst describe how Generativ e Adversarial Networks (GANs) work. Then we describe GAN-CLS ar- chitecture where GANs hav e been used for the problem of T ext to Image synthesis. W e further describe Skip-Thought vectors and how they are useful for our problem. Finally we describe Inception Score and how it is ev aluated. A. Generative Adversarial Networks Generativ e Adversarial Networks (GAN) is a framew ork that allows us to learn a function or program that can generate samples that are very similar to samples drawn from a given training distribution. It consists of a generator G and a discriminator D that compete in a minimax game [2]. D tries to distinguish between real training data and synthetic data, while G tries to fool D . This minimax game is giv en by Equations 2 and 3. J ( D ) = − 1 2 E x ∼ p data log D ( x ) − 1 2 E z log(1 − D ( G ( z ))) (2) J ( G ) = − J ( D ) (3) where J ( D ) is the discriminator cost and J ( G ) is the gener- ator cost, p data is the probability distribution of giv en data. Goodfellow et al. [2] proved that the Nash Equilibrium of this game is when samples produced by G is indistinguish- able from samples coming from training data (provided G and D hav e enough capacity). B. Matching-awar e Discriminator (GAN-CLS) T ext-to-image synthesis can easily be modelled using conditional GANs by treating the text, image pairs as joint observations. The discriminator no w has to judge the pairs as real or fake. In a vanilla conditional GAN, the discriminator must discriminate between real images with matching text, and synthetic images with arbitrary text. Therefore, it must implicitly learn to distinguish synthetic images and realistic images with incorrect captions. T o tackle this problem, Reed et al. [13] modiﬁed the discriminator by adding a third input consisting of real images with mismatched text (see Equations 4, 5, and 6). J ( D ) = J ( D ) adv E x ∼ p data log D ( x , ϕ ( ˆ t )) (4) J ( D ) adv = − E x ∼ p data log D ( x , ϕ ( t )) − E z log(1 − D ( G ( z , ϕ ( t )))) (5) J ( G ) = − J ( D ) (6) where: J ( D ) is the discriminator cost J ( G ) is the generator cost p data is the probability distribution of given data ϕ ( t ) is the text embedding corresponding to a giv en image ϕ ( ˆ t ) is the text embedding corresponding to a different image C. Skip-thought V ectors W e need to encode the giv en input text to learn the mapping between the text and face image. W e use Skip- Thought vectors [12] to encode the input text to a 4800 dimension vector by using the pretrained model provided by the authors. The skip-thought vectors are generated by training an encoder-decoder model. An encoder maps the sentence to a vector , whereas the decoder generates surrounding sentences from the vector . Kiros et al. used an RNN encoder with GR U acti vations and an RNN decoder with Conditional GR U. These vectors obtain very good results for image retriev al task (retrieve images that are good ﬁt to giv en query sentence) of MS COCO. D. Inception Scor e The lack of objectiv e function makes it difﬁcult to ev aluate and compare Generative Adversarial Networks. Primarily used method for ev aluation is human annotation of the generated images. Howe ver based on the motiv ation of annotator and task setup such human ev aluation can be sub- jectiv e. T o ov ercome this Salimans et al. [27] proposed the inception score metric to automatically e v aluate performance of GANs [2]. It uses the Inception model [14] to calculate the conditional distribution p ( y | x ) , and the marginal distribution p ( y ) as sho w in Equation 7. p ( y ) = Z x p ( y | x ) dx (7) The ﬁnal inception score in calculated as the KL diver gence of these distributions (see Equation 8). E x KL ( p ( y | x ) || p ( y )) (8) where x is random variable for image and y for classes. The conditional distribution p ( y | x ) captures the clarity of the generated images. The marginal distribution p ( y ) captures the diversity of the GAN model. A higher inception score corresponds to a ske wed p ( y | x ) as the inception model [14] predicts the class for the gi ven image with high conﬁdence. Moreov er the marginal distribution p ( y ) should be uniform reﬂecting that the GAN model is not biased to wards any particular class. For a good model, p ( y | x ) should have high entropy while p ( y ) should ha ve lo w entropy . I V . M E T H O D O L O G Y In this section, we describe our algorithm for automatic caption generation along with our modeling of the problem of text-to-face as learning conditional distribution of faces (conditioned on text). W e begin by pro viding the algorithm and justiﬁcation as to why our algorithm captures all the features of an image in meaningful and versatile captions. W e then explain why the problem of mapping text to faces is unsupervised learning of conditional representation and how conditional multimodality comes into picture. Finally we show how to model it using GANs [2] and our modi- ﬁcations to prev ent the faster con vergence of discriminator . Figure 3 shows the architecture of our T ext Conditional- Con volutional GAN which is conditioned on captions. A. Caption Generation T o con vert the attribute list provided for the images in the CelebA [11] dataset to meaningful captions, we create six group of features in response to six questions which progressiv ely describe the face starting from the face outline to the facial features which enhance the appearance (see T able I). Apart from these set of attributes, we use words describing the gender of the celebrity , e.g., “she”, “he”, and other . T able I: Questions and the corresponding set of attributes as response Questions for Facial Groups Facial Attributes used for Answers What is the structure of the face? Chubby face, Double Chin, Oval face, High cheekbones What is the facial hairstyle does the person sport? 5 O Clock Shadow , Goatee, Mus- tache, Sideburns What hairstyle does the person sport? Bald, Straight hair, Black hair, Blond hair , Brown hair , Gray hair , Bangs, W avy hair, Receding hairline. What is the description of the other facial features? Big lips, Big nose, Pointy nose, Nar- row eyes, Arched eyebro ws, Bushy eyebro ws, Mouth slightly open. What are the attributes that en- hance the appearance? Y oung, Attractiv e, Smiling, Pale skin, Rosy cheeks, Heavy makeup. What are the accessories worn? Earrings, Hat, Necklace, Necktie, Eyeglasses, Lipstick The questions are so aligned to assist the Generator in GANs [2] to build the face by ﬁrst learning to create the face outline, then add hair in the speciﬁed hairstyle followed by creating eyes, nose etc., then enhance appearance with Generated Image (64x64) Z ~ (0,1) φ (t) Generator Discriminator The woman has oval face. She has wavy hair which is brown in colour . She has big lips and pointy nose φ (t) Wrong Image (64x64) Real Image (64x64) D(x’, φ (t)) D(x, φ (t)) D(G(z), φ (t)) φ (t) Figure 3: Our text conditional-con volutional GAN architecture conditioned on captions. The real and fak e images are swapped after ev ery third iteration. the features like “young”, “attractive” and ﬁnally add the speciﬁed accessories in the captions. W e maintain a dictionary with attributes as the keys with corresponding values being the set of words to replace them in the sentence, e.g ., “ Mouth Slightly Open ” : “ slightly open mouth ”. In order to create a sentence from a giv en set of attributes we create a queue. W e ﬁrst add the start of the sentence to the queue ( e.g., “He sports a”). Then we add the corresponding values for the ﬁrst feature to the queue ( e.g ., 5 o’clock shadow). For every subsequent attributes we add a conjunction or punctuation to the queue before the attribute, provided there is already an attribute at the end of the queue. Otherwise we add the next attribute directly (see Algorithm 1). Suppose the list of attributes has “goatee” and “mustache” as the features describing facial hair . The queue initially contains “He sports a” (notice that the back of queue has “a” which is not an attribute). W e add the ﬁrst feature i.e goatee directly . Queue no w is “He sports a goatee”. Next feature is mustache. Since the back of queue has an attribute therefore we add a conjunction ( i.e ., “and”) to the queue before adding mustache. So the ﬁnal queue is “He sports a goatee and mustache”. Our algorithm has O( nl ) running time complexity , where n is the number of images and l is the length of the attributes list. For CelebA dataset [11], l = 40 hence the running time becomes O( n ) which is linear in n . Algorithm 1 Caption Creation For Facial Hair attrib utes 1: procedur e F AC I A L H A I R C A P T I O N ( isP r esent ) 2: Q ←{ He, sports, a }  Q is a queue 3: L ← { 5 o ´ clock shadow , Goatee, Mustache, Side- burns } 4: conj unction [ Goatee ] ← ‘, ’ 5: conj unction [ M ustache ] ← and 6: conj unction [ S idebur ns ] ← w ith 7: for all l ∈ L do 8: if isP resent ( l ) then 9: if Q.back () = a then 10: if l 6 = sidebur ns then 11: Q.push ( l ) 12: else 13: Q.clear () 14: Q ←{ He has sideburns } 15: else 16: Q.push ( conj unction [ l ]) 17: Q.push ( l ) 18: retur n Q B. Network Ar chitectur e The generator network is represented as G : R Z × R T → R I and the discriminator as D : R I × R T → (0 , 1) where the Z is the dimension of the noise vector input to the generator, T is the dimension of the skip-thought embedding of the caption and I is the dimension of the generated image. W e sample the input noise Z ∈ R Z ∼ U (0 , 1) of dimension 100 and then encode the text caption t using skip-thought encoder ϕ ( t ) (we used 4800 as the dimension of encoding). W e reduced the dimension of the text encoding ϕ ( t ) to 256 using fully connected layers followed by leaky RELU activ ation. W e then concatenate the reduced encoding ϕ ( t ) to the noise Z to form a vector θ of length 356 as an input to the generator . The generator is a decon volutional network with a projec- tion operations, 4 decon volutional layers and ﬁnally a tanh layer . Con volutional layers are followed by batch normaliza- tion and leaky RELU activ ation. The generator ﬁrst projects θ to a vector θ proj of dimension 8192 (see Equation 9). θ proj = W T θ + B (9) where W is the projection matrix of dimension 356 × 8192 and B is the bias. θ proj is then reshaped into a tensor of dimenison height 4, width 4 and 512 channels. Further the decon volutinoal layers decrease the number of channels by a factor of 2 and increase the height and width by the same. The last decon volutional layer con verts the tensor output of the fourth layer (with height 32, width 32 and 64 channels) followed by tanh to a 64 × 64 × 3 RGB image. The discriminator is a con volutional network with four con volutional layers having strides of 2, dimension expan- sion (after 4th con volutional layer) and ﬁnally a sigmoid layer . Conv olutional layers are followed by batch normal- ization and leaky RELU activ ation. The ﬁrst conv olutional layer con verts a RGB image of dimension 64 × 64 × 3 to a tensor of height 32, width 32 and 64 channels. The ne xt three con volutional layers progressively decrease the height and width by a factor of 2 and increase the number of channels by a factor of 2. The resulting tensor γ is of dimension 4 × 4 × 512 . Then the dimension of ϕ ( t ) is expanded to 4 × 4 × 256 and concatenated to γ along third dimension which is conv olved over over by the ﬁnal conv olutional layer . The output of ﬁnal con v olutional layer is passed to sigmoid layer to generate a conﬁdence score between (0,1). GANs [2] experience the problem of faster con v ergence of the discriminator over generator leading to no learning of generator . For conditional GANs, this becomes ev en more difﬁcult as the generator has to generate images in the pixel space while maintaining semantic similarity in the text space. When the discriminator learns faster than generator D ( x ) ≈ 1 and D ( G ( ϕ ( t ))) ≈ 0 . Equations 10 and 11 sho w how the log losses con ver ge to 0. log( D ( x )) ≈ 0 (10) log(1 − D ( G ( ϕ ( t )))) ≈ 0 (11) Hence in Equation 2, J ( D ) ≈ 0 and the generator cannot learn anything from thereon. T o tackle this, we swapped the real and the generated images for the discriminator after ev ery three iterations. This fools the discriminator into believing that generated images are real, slowing down the learning and pro viding essential time for generator to catch up to discriminator . V . E V A L UA T I O N A N D R E S U LT S W e ran our model on the 10000 random selected images from CelebA [11] dataset with our created captions for 200 epochs. The training set consists of 7500 images and the testing set consist of 2500 images. W e used batches of the dataset to train the model with a batch size of 64. Learning rate for generator was set to 0.0002 and for discriminator was 0.0001. W e used Adam [34] with β 1 = 0 . 5 and β 2 = 0 . 5 for both generator and discriminator . W e used the Inception score [27] to ev aluate the performance of our model and also present the generated images for visual inspection (see Figure 4). The identities of the celebrities were used as the classes. W e kept the number of captions from every class uniform to ensure that the generated images are not biased tow ards a speciﬁc class. Non-uniform distribution of captions over classes could lead to generation of more images belonging to the class with higher captions which makes class distribution (conditioned on generated images) ske wed giving a poor inception score. Such results could not lead to an y conclusion as the same model could lead to uniform class distribution (conditioned on generated images) giving a good inception score. A. Results and Infer ences Our model gav e an inception score of 1.4 ± 0.7 over 5 iterations of ev aluation. The images generated from our model sho w promising results. Our model is not f acing mode collapse which can be observed in the last two images of Figure 4 which are signiﬁcantly different ev en though they hav e very similar captions. The high variance 0.7 suggests randomness in the marginal distribution as computed by Equation 12. p ( y ) = Z x p ( y | x = G ( z )) dz (12) In some iterations, predicted classes are uniformly dis- tributed while for others they are highly skewed. The low inception score sho ws that the marginal distribution p ( y ) has high entropy and very similar to p ( y | x ) ∀ image x , class y and text encoding z . Popular datasets such as Oxford-102 Flowers [15] and Caltech-USD Birds [16] ha ve classes such that the captions for images have high intraclass similarity and very low interclass similarity . Descriptions of “Lily” e.g. “This ﬂower is white and pink in color , with petals that have veins” sho ws clear semantic dissimilarity with that of “Sunﬂower” e.g. The man has ov al face and high cheekbones. He has wa vy hair which is bro wn in colour . He has a slightly open mouth. The young attractiv e man is smiling. The woman has high cheekbones. She has wavy hair . The young at- tractiv e woman has heavy makeup. She’ s wearing a necklace and lipstick. The woman has ov al face. She has wavy hair which is brown in colour . She has big lips and pointy nose with arched eyebrows and a slightly open mouth. The young at- tractiv e woman has heavy makeup. She’ s wearing lipstick. The woman has high cheekbones. She has wavy hair . She has arched eyebro ws. The young attractiv e woman has heavy makeup. She’ s wearing lipstick. The woman has ov al face. She has straight hair which is brown in colour . The smiling, young at- tractiv e woman has heavy makeup. She’ s wearing lipstick. The man’ s hair is bro wn in colour . The man looks young. The woman has ov al face and high cheekbones.She has straight hair which is bro wn in colour . She has big lips and narrow eyes with arched eyebrows and a slightly open mouth. The smiling, young attractiv e woman has heavy makeup. She’ s wearing lipstick. The woman has wavy hair which is blond in colour . She has big lips with arched eyebro ws and a slightly open mouth. The young attractiv e woman has rosy cheeks and hea vy makeup. She’ s wearing lipstick. The man sports a 5 o’clock shado w . His hair is black in colour . He has big nose with bushy and arched eyebro ws. The man looks attrac- tiv e. The man sports a 5 o’clock shadow and mustache. He has a receding hairline. He has big lips and big nose, narrow eyes and a slightly open mouth. The young attrac- tiv e man is smiling. He’ s wearing necktie. The woman has ov al face and high cheekbones. Her straight hair has shades of blond.She has a slightly open mouth. The smiling, young attractiv e woman has heavy makeup. She’ s wearing lipstick. The man has straight hair .He has arched eyebrows.The man looks young and attractiv e.He’ s wearing necktie. The woman has high cheekbones. She has wa vy hair which is brown in colour . She has big lips with arched eyebro ws. The smiling, young woman has rosy cheeks and heavy makeup. She is wearing lip- stick. The woman has high cheekbones. She has straight hair which is brown in colour . She has arched eyebro ws and a slightly open mouth. The smiling, young attrac- tiv e woman has heavy mak eup. She is wearing lipstick. Figure 4: Qualitative results for visual inspection. Above images contain selected few features and are generated in the “zero-shot” setting i.e. unseen text. “The ﬂower has yellow petals and the center of it is br own” . For these datasets while calculating inception score, if the captions are uniformly distrib uted ov er classes and the model is good then the generated images w ould be classiﬁed with high conﬁdence with uniform class distribution. Han Zangh et al. [24] calculated an inception score of 2.88 ± 0.04 for Oxford-102 Flowers [15] and 2.66 ± 0.06 for Caltech-USD birds [16]. The woman has ov al face and high cheekbones. She has straight hair which is bro wn in colour . She has arched eyebro ws and a slightly open mouth. The smiling, young attractiv e woman has heavy makeup. She is wearing earrings and lipstick. The woman has ov al face and high cheekbones. She has straight hair which is bro wn in colour . She has big lips and narrow eyes with arched eyebrows and a slightly open mouth. The smiling, young attractiv e woman has heavy makeup. She is wearing earrings and lipstick. Figure 5: Similarity in the facial features for celebs with different identities. Person’ s identity or any other class based on attributes is a very poor choice for classifying the images as the captions hav e high interclass similarity (due to high possi- bility of similar facial features being present across classes) as shown in Figure 5. For instance, in this ﬁgure both captions are almost similar b ut the y belong to two different celebrities. As a result when conditioned on caption t the model could randomly generate semantically similar face G ( ϕ ( t )) belonging to any of the classes (having captions capturing similar facial features as the query caption). This randomness could result in generation of a lot of images for a few classes while very few for others. As discussed abov e, e ven a good inception score in some iteration of the experiment cannot be used to infer better performance of GANs [2] in terms of producing quality images semantically similar with query captions. This argument is strengthened by the fact that the generated images are very good and semantically similar to the textual descriptions. V I . C O N C L U S I O N A N D F U T U R E W O R K In this work we presented captions for the CelebA dataset to facilitate face synthesis from text. W e then used Generativ e Adversarial Network to learn the conditional multimodality in synthesis of face from captions. Finally we demonstrated why inception score used to measure the performance of GANs [2] fails to ev aluate their performance on our dataset. W e plan on extending the w ork in the follo wing directions: 1) Improve the selection of the wrong image for the GAN-CLS [13] algorithm. Currently , we randomly select images from the dataset as wrong image. One possibility is to select the wrong caption for real image rather than selecting the wrong image. This could be done by selecting the caption having the lowest cosine similarity with the caption of the real image. 2) Explore better language models such as BER T , analyze and compare performance of other GAN architectures with our model for face generation from captions. 3) Propose a better ev aluation metric to capture the semantic similarity of the generated faces with their captions, without using the classes. 4) Improving the resolution of the generated faces e.g. 128 × 128 and 256 × 256 faces. V I I . A C K N O W L E D G E M E N T Rajiv Ratn Shah is partly supported by the Infosys Center of AI, IIIT Delhi and ECRA Grant by SERB, Govt. of India. This work was partially supported by JSPS Grant-in-Aid for Scientiﬁc Research (C) under Grant No. 1 9 K 1 1 9 8 7. R E F E R E N C E S [1] John Wright, Allen Y Y ang, Arvind Ganesh, S Shankar Sastry , and Y i Ma, “Robust face recognition via sparse representation, ” IEEE transactions on pattern analysis and machine intelligence , vol. 31, no. 2, pp. 210–227, 2009. [2] Ian Goodfellow , Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David W arde-Farley , Sherjil Ozair , Aaron Courville, and Y oshua Bengio, “Generativ e adversarial nets, ” in Advances in neural information processing systems , 2014, pp. 2672–2680. [3] T ingting Li, Ruihe Qian, Chao Dong, Si Liu, Qiong Y an, W enwu Zhu, and Liang Lin, “Beautygan: Instance-level f acial makeup transfer with deep generativ e adversarial network, ” in 2018 ACM Multimedia Confer ence on Multimedia Confer- ence . A CM, 2018, pp. 645–653. [4] Omid Mohamad Nezami, Mark Dras, Peter Anderson, and Len Hamey , “Face-cap: Image captioning using facial ex- pression analysis, ” CoRR , v ol. abs/1807.02250, 2018. [5] Micah Hodosh, Peter Y oung, and Julia Hockenmaier , “Fram- ing image description as a ranking task: Data, models and ev aluation metrics, ” Journal of Artiﬁcial Intelligence Re- sear ch , vol. 47, pp. 853–899, 2013. [6] Peter Y oung, Alice Lai, Micah Hodosh, and Julia Hocken- maier , “From image descriptions to visual denotations: New similarity metrics for semantic inference o ver event descrip- tions, ” T ransactions of the Association for Computational Linguistics , vol. 2, pp. 67–78, 2014. [7] Desmond Elliott and Frank Keller , “Image description using visual dependency representations, ” in Pr oceedings of the 2013 Confer ence on Empirical Methods in Natural Language Pr ocessing , 2013, pp. 1292–1302. [8] T Lin, Michael Maire, Serge J Belongie, Lubomir D Bourdev , Ross B Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ ar , and C Lawrence Zitnick, “Microsoft coco: com- mon objects in context. corr abs/1405.0312 (2014), ” arXiv pr eprint arXiv:1405.0312 , 2014. [9] Gary B. Huang, Marwan Mattar , Honglak Lee, and Erik Learned-Miller , “Learning to align from scratch, ” in NIPS , 2012. [10] Ira Kemelmacher -Shlizerman, Steven M Seitz, Daniel Miller, and Evan Brossard, “The megaface benchmark: 1 million faces for recognition at scale, ” in Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Recognition , 2016, pp. 4873–4882. [11] Ziwei Liu, Ping Luo, Xiaogang W ang, and Xiaoou T ang, “Deep learning face attributes in the wild, ” 2015. [12] Ryan Kiros, Y ukun Zhu, Ruslan R Salakhutdinov , Richard Zemel, Raquel Urtasun, Antonio T orralba, and Sanja Fidler , “Skip-thought vectors, ” in Advances in neural information pr ocessing systems , 2015, pp. 3294–3302. [13] Scott Reed, Zeynep Akata, Xinchen Y an, Lajanugen Lo- geswaran, Bernt Schiele, and Honglak Lee, “Genera- tiv e adversarial text to image synthesis, ” arXiv preprint arXiv:1605.05396 , 2016. [14] Christian Szegedy , V incent V anhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew W ojna, “Rethinking the inception architecture for computer vision, ” CoRR , vol. abs/1512.00567, 2015. [15] Maria-Elena Nilsback and Andrew Zisserman, “ Automated ﬂower classiﬁcation over a large number of classes, ” in 2008 Sixth Indian Conference on Computer V ision, Graphics & Image Processing . IEEE, 2008, pp. 722–729. [16] Peter W elinder, Steve Branson, T akeshi Mita, Catherine W ah, Florian Schroff, Serge Belongie, and Pietro Perona, “Caltech- ucsd birds 200, ” 2010. [17] Alec Radford, Luke Metz, and Soumith Chintala, “Unsuper- vised representation learning with deep conv olutional genera- tiv e adversarial networks, ” arXiv preprint , 2015. [18] Diederik P Kingma and Max W elling, “ Auto-encoding variational bayes, ” arXiv pr eprint arXiv:1312.6114 , 2013. [19] Rajiv Ratn Shah, Y i Y u, Akshay V erma, Suhua T ang, An- war Dilawar Shaikh, and Roger Zimmermann, “Lev eraging multimodal information for ev ent summarization and concept- lev el sentiment analysis, ” Knowledge-Based Systems , vol. 108, pp. 102 – 109, 2016, New A venues in Knowledge Bases for Natural Language Processing. [20] Y i Y u, Suhua T ang, Francisco Raposo, and Lei Chen, “Deep cross-modal correlation learning for audio and lyrics in music retriev al, ” ACM T rans. Multimedia Comput. Commun. Appl. , vol. 15, no. 1, pp. 20:1–20:16, Feb. 2019. [21] Y . Y u, S. T ang, K. Aizawa, and A. Aizawa, “Category-based Deep CCA for ﬁne-grained venue discovery from multimodal data, ” IEEE T ransactions on Neural Networks and Learning Systems , vol. 30, no. 4, pp. 1250–1258, April 2019. [22] Rajiv Ratn Shah, Y i Y u, and Roger Zimmermann, “ Advisor: Personalized video soundtrack recommendation by late fusion with heuristic rankings, ” in Proceedings of the 22Nd ACM International Conference on Multimedia , New Y ork, NY , USA, 2014, MM ’14, pp. 607–616, A CM. [23] Rajiv Shah and Roger Zimmermann, Multimodal analysis of user-gener ated multimedia content , Springer International Publishing, 2017. [24] Han Zhang, T ao Xu, Hongsheng Li, Shaoting Zhang, Xiao- gang W ang, Xiaolei Huang, and Dimitris N Metaxas, “Stack- gan: T ext to photo-realistic image synthesis with stacked generativ e adversarial networks, ” in Pr oceedings of the IEEE International Conference on Computer V ision , 2017, pp. 5907–5915. [25] Zizhao Zhang, Y uanpu Xie, and Lin Y ang, “Photographic text-to-image synthesis with a hierarchically-nested adversar - ial network, ” in Pr oceedings of the IEEE Conference on Computer V ision and P attern Recognition , 2018, pp. 6199– 6208. [26] Xinlei Chen, Hao Fang, Tsung-Y i Lin, Ramakrishna V edan- tam, Saurabh Gupta, Piotr Doll ´ ar , and C Lawrence Zitnick, “Microsoft coco captions: Data collection and evaluation server , ” arXiv pr eprint arXiv:1504.00325 , 2015. [27] T im Salimans, Ian Goodfellow , W ojciech Zaremba, V icki Cheung, Alec Radford, and Xi Chen, “Improved techniques for training gans, ” pp. 2234–2242, 2016. [28] Albert Gatt, Marc T anti, Adrian Muscat, Patrizia Paggio, Reuben A Farrugia, Claudia Borg, K enneth P Camilleri, Mike Rosner , and Lonneke V an der Plas, “Face2text: collecting an annotated image description corpus for the generation of rich face descriptions, ” arXiv preprint , 2018. [29] akanimax, “T2f: text to face generation using deep learning, ” https://github .com/akanimax/T2F , 2019, Accessed: 2019-04- 09. [30] Shuo Y ang, Ping Luo, Chen-Change Loy , and Xiaoou T ang, “From facial parts responses to face detection: A deep learn- ing approach, ” in Pr oceedings of the IEEE International Confer ence on Computer V ision , 2015, pp. 3676–3684. [31] Soumith Chintala, Emily Denton, Martin Arjovsk y , and Michael Mathieu, “How to train a GAN? T ips and tricks to make GANs work, ” 2016. [32] Amanda Duarte, Francisco Roldan, Miquel T ubau, Janna Escur , Santiago Pascual, Amaia Salvador , Eva Mohedano, Ke vin McGuinness, Jordi T orres, and Xavier Giro-i Nieto, “W av2pix: Speech-conditioned face generation using genera- tiv e adversarial networks, ” 2019. [33] T ero Karras, Samuli Laine, and Timo Aila, “ A style-based generator architecture for generative adversarial networks, ” in Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Recognition , 2019, pp. 4401–4410. [34] Diederik P . Kingma and Jimmy Ba, “ Adam: A method for stochastic optimization, ” 2015.

Text2FaceGAN: Face Generation from Fine Grained Textual Descriptions

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment