Fashion Editing with Adversarial Parsing Learning

F ashion Editing with Adversarial P arsing Learning Haoye Dong 1 , Xiaodan Liang 1 , Y ixuan Zhang 2 , Xujie Zhang 1 , Zhenyu Xie 1 , Bowen W u 1 , Ziqi Zhang 1 , Xiaohui Shen 3 , Jian Yin 1 1 Sun Y at-sen Univ ersity , 2 Petuum Inc, 3 ByteDance AI Lab donghy7@mail2.sysu.edu.cn, xdliang328@gmail.com Abstract Interactiv e fashion image manipulation, which enables users to edit images with sketches and color strok es, is an interesting research problem with great application value. Existing works often treat it as a general inpainting task and do not fully lev erage the semantic structural information in fashion images. Moreover , they directly utilize conv entional con volution and normalization layers to restore the incomplete image, which tends to wash aw ay the sketch and color information. In this paper , we propose a novel F ashion Editing Generativ e Adversarial Network (FE-GAN), which is capable of manipulating fashion images by free-form sk etches and sparse color strokes. FE-GAN consists of two modules: 1) a free-form parsing network that learns to control the human parsing generation by manipulating sk etch and color; 2) a parsing-aware inpainting network that renders detailed textures with semantic guidance from the human parsing map. A new attention normal- ization layer is further applied at multiple scales in the decoder of the inpainting network to enhance the quality of the synthesized image. Extensiv e experiments on high-resolution fashion image datasets demonstrate that the proposed method signiﬁcantly outperforms the state-of-the-art methods on image manipulation. 1 Introduction Fashion image manipulation aims to generate high-resolution realistic fashion images with user- provided sk etches and color strok es. It has huge potential v alues in v arious applications. For example, a fashion designer can easily edit clothing designs with dif ferent styles; ﬁlmmakers can design characters by controlling the f acial expression, hairstyle, and body shape of the actor or actress. In this paper, we propose FE-GAN, a fashion image manipulation network that enables ﬂexible and efﬁcient user interactions such as simple sketches and a fe w sparse color strokes. Some interacti ve manipulation results of FE-GAN are shown in Figure 1, which indicates that it can generate realistic images with con vincing and desired details by controlling the sketch and color strokes. In general, image manipulation has made great progress due to the signiﬁcant improvement of neural network techniques [ 2 , 6 , 7 , 14 , 17 , 21 , 35 ]. Howe ver , previous methods often treat it as an end-to-end one-stage image completion problem without ﬂe xible user interactions [ 12 , 16 , 19 , 20 , 25 , 32 , 33 ]. Those methods usually do not e xplicitly estimate and then le verage the semantic structural information in the image. Furthermore, they excessiv ely use the conv entional con volutional layers and batch normalization, which signiﬁcantly dissolve the sketch and color information from the input during propagation. As a result, the generated images usually contain unrealistic artifacts and undesired textures. T o address the abov e challenges, we propose a no vel F ashion Editing Generativ e Adv ersarial Network (FE-GAN), which consists of a free-form parsing netw ork and a parsing-aware inpainting netw ork with multi-scale attention normalization layers. Different from the previous methods, we do not directly generate the complete image in one stage. Instead, we ﬁrst generate a complete parsing map from incomplete inputs, and then render detailed textures on the layout induced from the generated Preprint. Under revie w . O r gi na l I nput O ut put O r gi na l I nput O utput O r gi na l I nput O utput Figure 1: Some interactiv e results of our FE-GAN. The input contains free-form mask, sk etch, and sparse color strokes. Zoom in for details. parsing map. Speciﬁcally , in the training stage, giv en an incomplete parsing map obtained from the image, a sketch, sparse color strokes, a binary mask, and a noise sampled from the Gaussian distribution, the free-form parsing network learns to reconstruct a complete human parsing map guided by the sketch and color . A parsing-aware inpainting network then takes the generated parsing map, the incomplete image, and composed masks as the input of encoders, and synthesizes the ﬁnal edited image. T o better capture the sketch and color information, we design an attention normalization layer , which is able to learn an attention map to select more effecti ve features conditioned on the sketch and color . The attention normalization layer is inserted at multiple scales in the decoder of the inpainting network. Moreover , we de velop a foreground-based partial con v olutional encoder for the inpainting network that is only conditioned on the valid pix els of the foreground, to enable more accurate and efﬁcient feature encoding from the image. W e conduct experiments on our newly collected fashion dataset, named FashionE, and two challenging datasets: DeepFashion [ 36 ] and MPV [ 4 ]. The results demonstrate that incorporating the multi-scale attention normalization layers and the free-form parsing network can help our FE-GAN signiﬁcantly outperforms the state-of-the-art methods on image manipulation, both qualitativ ely and quantitativ ely . The main contributions are summarized as follo ws: 1) W e propose a free-form parsing network that enables users to control parsing generation ﬂexibly by manipulating the sketch and color . 2) W e de velop a ne wly attention normalization for extracting features effecti vely based on a learned attention map. 3) W e design a parsing-aw are inpainting network with foreground-aw are partial con volutional layers and multi-scale attention normalization layers, which can generate high-resolution realistic edited fashion images. 2 Related W ork Image Manipulation . Image manipulation with Generativ e Adversarial Networks (GANs) [ 6 ] is a popular topic in computer vision, which includes image translation, image completion, image editing, etc. Based on conditional GANs [ 18 ], Pix2Pix [ 11 ] is proposed for image-to-image translation. T argeting at synthesizing high-resolution photo-realistic image, Pix2PixHD [ 27 ] comes up with a nov el framew ork with coarse-to-ﬁne generators and multi-scale discriminators. [ 22 , 33 ] design framew orks to restore lo w-resolution images with an original (square) mask, which generate some artifacts when facing the free-form mask and do not allow image editing. T o make up for these deﬁciencies, Deepﬁllv2 [ 12 ] utilizes a user’ s sketch as input and introduces a free-form mask to replace the original mask. On top of Deepﬁllv2, Xiong et al. [ 30 ] further in vestigate a foreground- aware image inpainting approach that disentangles structure inference and content completion explicitly . Faceshop [ 25 ] is a face editing system that tak es sketch and color as input. Howe ver , the synthesized image would ha ve blurry edges on the restored re gion, and it would obtain undesirable result if too much area erased. Recently , another face editing system SC-FEGAN [ 32 ] is proposed, which generates high-quality images when users pro vide the free-form as input. Howe ver , SC-FEGAN is designed for face editing. In this paper , we propose a novel f ashion editing system conditioned on the sketch and sparse color , utilizing feature in volved in the parsing map, which is usually ignored by 2 previous methods. Besides, we introduce a novel multi-scale attention normalization to e xtract more signiﬁcant features conditioned on the sketch and color . Normalization Layers . Normalization layers hav e become an indispensable component in modern deep neural networks. Batch Normalization (BN) used in Inception-v2 network [ 9 ], making the training of deep neural networks easier . Other popular normalization layers, including Instance Normalization (IN) [ 3 ], Layer Normalization (LN) [ 13 ], W eight Normalization (WN) [ 24 ], Group Normalization (GN) [ 34 ], are classiﬁed as unconditional normalization layers because no external data is utilized during normalization. In contrast to the above normalization techniques, conditional normalization layers require external data. Speciﬁcally , layer activ ations are ﬁrst normalized to zero mean and unit deviation. Then a learned afﬁne transformation is inferred from external data, which is utilized to modulate the acti vation to denormalized the normalized acti vations. The afﬁne transformations are various among dif ferent tasks. For style transfer tasks [ 26 , 31 ], afﬁne parameters are spatially-in variant since the y only control the global style of the output images. As for semantic image synthesis tasks, SP ADE [ 23 ] applies a spatially-v arying afﬁne transformation to preserv e the semantic information. In this paper, we propose a nov el normalization technique named attention normalization. Instead of learning the afﬁne transformation directly , attention normalization learns an attention map to extract signiﬁcant information from the normalization acti vations. What’ s more, compared to the SP ADE ResBlk in SP ADE [ 23 ], attention normalization has a more compact structure and occupies less computation resource. 3 F ashion Editing W e propose a novel method for editing fashion image, allowing users to edit images with a few sketches and sparse color strokes on an interested region. The ov erview of our FE-GAN is sho wn in Figure 2. The main components of our FE-GAN include a free-form parsing network and a parsing-aw are inpainting network with the multi-scale attention normalization layers. W e ﬁrst discuss the free-form parsing network in Section 3.1. It can manipulate human parsing guided by free-form sketch and color, and is crucial to help the parsing-aware inpainting network produce convincing interactiv e results, which is described in Section 3.2. Then, in Section 3.3, we describe the attention normalization layers inserted at multiple scales in the inpainting decoder that can selectiv ely extract ef fective features and enhance visual quality . Finally , in Section 3.4, we give a detailed description of the learning objectiv e function used in our FE-GAN. 3.1 Free-f orm Parsing Network Compared to directly restoring an incomplete image, predicting a parsing map from an incomplete parsing map is more feasible since there are fewer details in the parsing map. Meanwhile, the semantic information in the parsing map can be a guidance for rendering detail textures in each part of an image precisely . T o this end, we propose a free-form parsing network to synthesize a complete parsing map when giving an incomplete parsing map and arbitrary sk etch and color strokes. The architecture of the free-form parsing netw ork is illustrated in the upper left part of Figure 2. It is based on the encoder-decoder architecture like U-net [ 21 ]. The encoder receiv es ﬁve inputs: an incomplete parsing map, a binary sketch that describes the structure of the remov ed region, a noise sampled from the Gaussian distrib ution, sparse color strok es and a mask. More details about the input data will be discussed in Section 4.2. It is worth noting that giv en the same incomplete parsing map and various sk etch and color strokes, the free-form parsing network can synthesize dif ferent parsing map, which indicates that our parsing generation model is controllable. It is signiﬁcant for our f ashion editing system since different parsing maps guide to render dif ferent contents in the edited image. 3.2 Parsing-awar e Inpainting Network The architecture of parsing-a ware inpainting network is illustrated on the bottom of Figure 2. Inspired by [ 16 ], we introduce a partial con volution encoder to extract feature from the valid region in incomplete images. Our proposed partial conv olution in partial con volution encoder is a bit dif ferent from the original version. Instead of using the mask directly , we utilize the composed mask to make the network focus only on the foreground re gion. The composed mask can be expressed as: M 0 = ( 1 − M )  M f , (1) 3 In co m p l et e Pars i n g + Sk et ch + N o i s e + Col o r + Mas k … D eco d er Sk et ch + Col o r + N o i s e … U p s am p l i n g U p s am p l i n g A t t en t i o n N o rm U p s am p l i n g Sy n t h es i zed Im ag e Sk et ch + Co l o r + N o i s e In co m p l et e Im ag e + Com p o s ed Mas k Sy n t h es i zed Parsi n g Fr ee - for m Par sin g Netw ork Par sin g - aw ar e In pain ting Netw ork Attention Normal iz ation Lay er s Sy n t h es i zed Pars i n g Res i d u al Bl o ck s D i l at ed Res i d u al Bl o ck s E n co d er D eco d er Parti al Co n v E n co d er St an d ard Co n v E n co d er A t t en t i o n No rm A t t en t i o n N o rm σ || Co n v Co n v Batch N o rm S ig m oid Ele ment - wise Mult ipl y Ele ment - wise P lus De pth C onc a tena ti on σ || Sket ch + Col o r + N o i s e Sk et ch + Co l o r + N o i s e Co n v Co n v RE L U Figure 2: The overvie w of our FE-GAN. W e ﬁrst feed the incomplete human parsing, sketch, noise, color , and mask into free-form parsing network to obtain complete synthesized parsing. Then, incomplete image, composed mask, and synthesized parsing are fed into parsing-a ware inpainting network for manipulating the image by using the sketch and color . where M 0 , M and M f are the composed mask, original mask and fore ground mask respectiv ely .  denotes element-wise multiply . Besides the partial conv olution encoder , we introduce a standard con volution encoder to extract semantics feature from the synthesized parsing map. The human parsing map has semantics and location information that will guide the inpainting, since the content in a region with the same semantics should be similar . Giv en the semantic features, the network can render textures on the particular re gion more precisely . T wo encoded feature maps are concatenated together in a channel-wise manner . Then the concatenated feature map undergoes sev eral dilated residual blocks. During the upsampling process, well-designed multi-scale attention normalization layers are introduced to obtain attention maps, which are conditioned on sketch and color strokes. Unlike SC-FEGAN, the learned attention maps are helpful to select more effecti ve feature in the forward acti vations. W e explain the details in the next section. 3.3 Attention Normalization Layers Attention Normalization Layers (ANLs) are similar to SP ADE [ 23 ] to some extent and can be regarded as a variant of conditional normalization. Howe ver , instead of inferring an afﬁne transformation from external data directly , ANLs learn an attention map which is used to extract the signiﬁcant information in the earlier normalized acti vation. The upper right part of Figure 2 illustrates the design of ANLs. The details of ANLs are shown belo w . Let x i denotes the activ ations of the layer i in the deep neural network. Let N denotes the number of samples in one batch. Let C i denotes the number of channels of x i . Let H i and W i represent the height and width of acti vation map in layer i respectiv ely . When the activ ations x i passing through ANLs, they are ﬁrst normalized in a channel-wise manner . Then the normalized activ ations are modulated by the learned attention map and bias. Finally , the modulated acti vations pass through a rectiﬁed linear unit (RELU) and a con volution layer and concatenate with the original normalized activ ations. The activ ations value before the ﬁnal concatenation at position ( n ∈ N , c ∈ C i , h ∈ 4 H i , w ∈ W i ) is signed as: f ( α i c,h,w ( d ) x i n,c,h,w − µ i c σ i c + β i c,h,w ( d )) , (2) where f ( x ) denotes RELU and con volution operations, x i n,c,h,w is the acti vation v alue at particular position before normalization, µ i c and σ i c are the mean and standard deviation of acti vation in channel c . As the same of BN [9], we formulate them as: µ i c = 1 N H i W i X n,h,w x i n,c,h,w (3) σ i c = s 1 N H i W i X n,h,w ( x i n,c,h,w ) 2 − ( µ i c ) 2 (4) The α i c,h,w ( d ) and β i c,h,w ( d ) are learned attention map and bias for modulating the normalization layer , which are conditioned on the external data d , namely , the sketch and color strokes and noise in this paper . Our implementations of α i n,h,w and β i n,h,w are straightforward. The external data is ﬁrst projected into an embedding space through a con volution layer . Then the bias is produced by another con volution layer , and the attention map is generated by a conv olution layer and a sigmoid operation, which limits the range of feature map v alues between zero and one, and ensures the output to be an attention map. The effecti veness of ANLs is due to their inherent characteristics. Similar to SP ADE [ 23 ], ANLs can av oid washing away semantic information in activ ations, since the attention map and bias are spatially-v arying. Moreove r, the multi-scale ANLs can not only adapt the various scales of acti vations during upsampling b ut also extract coarse-to-ﬁne semantic information from external data, which guide the fashion editing more precisely . 3.4 Learning Objective Function Due to the complex textures of the incomplete image and the v ariety of sketch and color strokes, the training of the free-form parsing network and parsing-aware inpainting network is a challenging task. T o address these problems, we apply several losses to make the training easier and more stable in dif ferent aspects. Speciﬁcally , we apply adversarial loss L adv [ 6 ], perceptual loss L perceptual [ 14 ], style loss L style [ 14 ], parsing loss L parsing [ 5 ], multi-scale feature loss L feat [ 27 ], and total v ariation loss L TV [ 14 ] to regularize the training. W e deﬁne a face TV loss to remove the artifacts of the face by using L TV on face re gion. W e deﬁne a mask loss by using the L1 norm on the mask area, let I gen be generated image, let I real be ground truth, and let M be the mask, which is computed as: L mask = || I gen  M − I real  M || 1 , (5) we also deﬁne a foreground loss to enhance the foreground quality . Let M foreground be the mask of foreground part, then L foreground can be formally computed as L foreground = || I gen  M foreground − I real  M foreground || 1 , (6) similar to L foreground , we formulate a face loss L face to improv e the quality of face region. The ov erall objectiv e function L free-form-parser for free-form parsing network is formulated as: L free-form-parser = γ 1 L parsing + γ 2 L feat + γ 3 L adv , (7) where hyper-parameters γ 1 , γ 2 and γ 3 are weights of each loss. The ov erall objectiv e function L inpainter for parsing-aware inpainting netw ork written as: L inpainter = λ 1 L mask + λ 2 L foreground + λ 3 L face + λ 4 L faceTV + λ 5 L perceptual + λ 6 L style + λ 7 L adv , (8) where hyper-parameters λ i , ( i = 1 , 2 , 3 , 4 , 5 , 6 , 7) are the weights of each loss. 4 Experiments 4.1 Datasets and Metrics W e conduct our experiments on DeepF ashion [36] from Fashion Image Synthesis track. It contains 38,237 images which are split into a train set and a test set, 29,958 and 8,279 images respectively . 5 O r i gina l I nc ompl e t e I ma g e D e e pf i l l v 1 P a r ti a l C onv Edg e - c onne c t FE - G A N ( O ur s ) O r i gina l I nc ompl e t e I ma g e D e e pf i l l v 1 P a r t i a l C onv Edge - c onne c t FE - G A N ( O ur s ) Figure 3: Qualitativ e comparisons with Deepﬁll v1 [ 33 ], Partial Con v [ 16 ], and Edge-connect [ 19 ] on DeepFashion [36], MPV [4], and F ashionE, respectiv ely . MPV [ 4 ] contains 35,687 images which are split into a train set and a test set, 29,469 and 6,218 samples. For better contrib uting to the fashion editing community , we collected a ne w fashion dataset, named FashionE . It contains 7,559 images with the size of 320 × 512 . In our experiment, we split it into a train set of 6,106 images and a test set of 1,453 images. The dataset will be released upon the publication of this work. The size of the image is 320 × 512 across all datasets. W e utilize the Irregular Mask Dataset provided by [ 16 ] in our experiments. The original dataset contains 55,116 masks for training and 24,866 masks for testing. W e randomly select 12,000 images, splitting it into one train set of 9,600 masks and one test set of 2,400 masks. T o mimic the free-form color stroke, we utilize one irregular mask dataset from [ 10 ] as Irregular Strokes Dataset . The mask region stands for stroke in our e xperiment. In our experiment, we split it into a train set of 50,000 masks and a test set of 10,000 masks. In our experiment, all the masks are resized to 320 × 512 . Metrics . W e ev aluate our proposed method, as well as compared approaches on three metrics, PSNR (Peak Signal Noise Ratio), SSIM (Structural Similarity index) [ 28 ], and FID (Fréchet Inception Distance) [8]. W e apply the Amazon Mechanical T urk (AMT) for ev aluating the qualitative results. 4.2 Implementation Details T raining Pr ocedure . The training procedure is two-stage. The ﬁrst stage is to train free-form parsing network. W e use γ 1 = 10, γ 2 = 10, γ 3 = 1 in the loss function. The second stage is to train parsing-aware inpainting network. W e use λ 1 = 5.0, λ 2 = 50, λ 3 = 1.0, λ 4 = 0.1, λ 5 = 0.05, λ 6 = 200, λ 7 = 0.001 in the loss function. For both training stages, we use Adam [ 15 ] optimizer with β 1 = 0.5 and β 2 = 0.999 and learning rate is 0.0002. The batch sizes of stage 1 is 20, and stage 2 is 8. In each training cycle, we train one step for the generator and one step for the discriminator . All the experiments are conducted on 4 Nvidia 1080 T i GPUs. Sketch & Color Domain . The way of extracting sketch and color domain from images is similar to SC-FEGAN. Instead of using HED [ 29 ], we generated sketches by Cann y Edge Detector [ 1 ]. Relying on the result of human parsing, we use the median color of each segmented area to represent the color of that area. More details are presented in the supplementary material. Discriminators . The discriminator , used in free-form parsing network, has a similar structure as the multi-scale discriminator in Pixel2Pix elHD [ 27 ], which has two PatchGAN discriminators. The discriminator , used in parsing-aware inpainting network, has a similar structure as inpainting discriminator in Edge-connect [19], with ﬁv e con volutions and spectral norm blocks. Compared Approaches . T o make a comprehensiv e ev aluation of our proposed method, we conduct three comparison experiments based on the recent state of the art approaches at image inpainting [ 16 , 19 , 33 ]. It comprises of an edge generator and an image completion module. The re-implementations followed the source codes provided by authors. T o make a fair comparison, all inputs consist of incomplete images, masks, sketch, color domain, and noise across all comparison experiments. 6 FE - G A N ( O urs ) E dge - c onne c t P a rt i a l Conv D e e pfi l l v 1 Input O ri gi na l S ynt he s i z e d P a rs i ng ( O urs ) Figure 4: Some interacti ve comparisons with Deepﬁll v1 [ 33 ], Partial Con v [ 16 ], and Edge- connect [19] on DeepFashion [36], MPV [4], and F ashionE, respectiv ely . T able 1: Quantitativ e comparisons on DeepFashion [36], MPV [4], and FashionE datasets. DeepFashion [36] MPV [4] FashionE Model PSNR SSIM FID PSNR SSIM FID PSNR SSIM FID Deepﬁll v1 [33] 16.885 0.781 60.994 18.450 0.808 58.742 19.170 0.814 56.738 Partial Con v [16] 19.103 0.827 17.728 20.408 0.850 22.751 20.635 0.848 20.148 Edge-connect [19] 26.236 0.901 12.633 27.557 0.924 7.888 29.154 0.926 5.182 FE-GAN (Ours) 29.552 0.928 3.700 30.602 0.944 3.796 30.974 0.938 3.246 4.3 Quantitative Results PSNR computes the peak signal-to-noise ratio between images. SSIM measures the similarity between two images. Higher value of PSNR and SSIM mean better results. FID is tended to replace Inception Score as one of the most signiﬁcant metrics measuring the quality of generated images. It computes the Fréchet distance between two multiv ariate Gaussians, the smaller the better . As mentioned in [ 28 ], there is no good numerical metric in image inpainting. Furthermore, our focus is ev en beyond the regular inpainting. W e can observe from T able 1, our FE-GAN achieves the best PSNR, SSIM, and FID scores and outperforms all other methods among three datasets. 4.4 Qualitative Results Beyond numerical e valuation, we present visual comparisons for image completion task among three datasets and four methods, shown in Figure 3. Three rows, from top to bottom, are results from DeepFashion, MPV , and FashionE. The interacti ve results for those methods are sho wn in Figure 4. The last column of the Figure 4, are the results of the free-form parsing netw ork. W e can observe that the free-form parsing network can obtain promising parsing results by manipulating the sketch and color . Thanks to the multi-scale attention normalization layers and the synthesized parsing result from the free-form parsing network, our FE-GAN outperforms all other baselines on visual comparisons. 7 4.5 Human Evaluation T o further demonstrate the robustness of our proposed FE-GAN, we conduct the human ev aluation deployed on the Amazon Mechanical T urk platform on the DeepFashion [ 36 ], MPV [ 4 ], and FashionE. In each test, we provide two images, one from compared methods, the other from our proposed method. W orkers are asked to choose the more realistic image out of two. During the ev aluation, K images from each dataset are chosen, and n workers will only e valuate these K images. In our case, K = 100 and n = 10 . W e can observe from T able 2, our proposed method has a superb performance ov er the other baselines. This conﬁrms the ef fectiv eness of our FE-GAN comprised of a free-form parsing network and a parsing-aw are network, which generates more realistic fashion images. T able 2: Human ev aluation results of pairwise comparison with other methods. Comparison Method Pair DeepFashion [36] MPV [4] FashionE Ours vs Deepﬁll v1 [33] 0.849 vs 0.151 0.845 vs 0.155 0.857 vs 0.143 Ours vs Partial Con v [16] 0.917 vs 0.083 0.864 vs 0.136 0.799 vs 0.201 Ours vs Edge-connect [19] 0.790 vs 0.210 0.691 vs 0.309 0.656 vs 0.344 5 Ablation Study T o ev aluate the impact of the proposed component of our FE-GAN, we conduct an ablation study on FashionE with using the model of 20 epochs. As shown in T able 3 and Figure 5, we report the results of the different versions of our FE-GAN. W e ﬁrst compare the results using attention normalization to the results without using it. W e can learn that incorporating the attention normalization layers into the decoder of the inpainting module signiﬁcantly impro ves the performance of image completion. W e then verify the ef fectiv eness of the proposed free-from parsing network. From T able 3 and Figure 5, we observe that the performance drops dramatically without using parsing, which can depict the human layouts for guiding image manipulation with higher-le vel structure constraints. The results report that the main improved performance achieved by the attention normalization and human parsing. W e also explore the impact of our designed objectiv e function that each of the losses can substantially improv e the results. Method PSRN SSIM FID Full 30.035 0.932 4.092 w/o attention norm 29.185 0.920 5.191 w/o parsing 29.109 0.923 5.355 w/o L mask 28.813 0.921 4.773 w/o L foreground 29.848 0.927 5.030 T able 3: Ablation studies on FashionE. ( a 2 ) ( b 2 ) ( a 1 ) ( b 1 ) Figure 5: Ablation studies on FashionE. (a1)(b1): Ours(Full); (a2): w/o attention norm; (b2): w/o parsing. 6 Conclusions W e propose a no vel F ashion Editing Generati ve Adversarial Network (FE-GAN), which enables users to manipulate the fashion image with an arbitrary sk etch and a fe w sparse color strokes. The FE-GAN incorporates a free-form parsing network to predict the complete human parsing map to guide fashion image manipulation. Moreov er , we dev elop a foreground-based partial con volutional encoder and design an attention normalization layer which used in the multiple scales layers of the decoder for the inpainting network. The experiments on fashion datasets demonstrate that our FE-GAN outperforms the state-of-the-art methods and achiev es high-quality performance with con vincing details. References [1] John F . Canny . A computational approach to edge detection. IEEE T ransactions on P attern Analysis and Machine Intelligence , P AMI-8:679–698, 1986. 8 [2] Y unjey Choi, Minje Choi, Munyoung Kim, Jung-W oo Ha, Sunghun Kim, and Jaegul Choo. Stargan: Uniﬁed generative adv ersarial networks for multi-domain image-to-image translation. In CVPR , 2018. [3] V ictor Lempitsky Dmitry Ulyanov , Andrea V edaldi. Instance normalization: The missing ingredient for fast stylization. arXiv pr eprint arXiv:1607.08022 , 2016. [4] Haoye Dong, Xiaodan Liang, Bochao W ang, Hanjiang Lai, Jia Zhu, and Jian Y in. T o wards multi-pose guided virtual try-on network. arXiv pr eprint arXiv:1902.11026 , 2019. [5] Ke Gong, Xiaodan Liang, Xiaohui Shen, and Liang Lin. Look into person: Self-supervised structure-sensitiv e learning and a ne w benchmark for human parsing. In CVPR , pages 6757– 6765, 2017. [6] Ian Goodfellow , Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, Da vid W arde-Farle y , Sherjil Ozair, Aaron Courville, and Y oshua Bengio. Generative adv ersarial nets. In NIPS , pages 2672–2680, 2014. [7] Xintong Han, Zuxuan W u, Zhe W u, Ruichi Y u, and Larry S Davis. V iton: An image-based virtual try-on network. arXiv pr eprint arXiv:1711.08447 , 2017. [8] Martin Heusel, Hubert Ramsauer , Thomas Unterthiner , Bernhard Nessler, and Sepp Hochreiter . Gans trained by a two time-scale update rule conv erge to a local nash equilibrium. In NIPS , 2017. [9] Serge y Ioffe and Christian Sze gedy . Batch normalization: Accelerating deep network training by reducing internal cov ariate shift. In ICML , pages 448–456, 2015. [10] Kazizat T . Iskakov . Semi-parametric image inpainting. arXiv pr eprint arXiv:1807.02855 , 2018. [11] Phillip Isola, Jun-Y an Zhu, Tinghui Zhou, and Alex ei A Efros. Image-to-image translation with conditional adversarial networks. In CVPR , 2017. [12] Jimei Y ang-Xiaohui Shen Xin Lu Thomas S. Huang Jiahui Y u, Zhe Lin. Free-form image inpainting with gated con volution. arXiv preprint , 2018. [13] Geoffre y E. Hinton Jimmy Lei Ba, Jamie Ryan Kiros. Layer normalization. arXiv pr eprint arXiv:1607.06450 , 2016. [14] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV , pages 694–711, 2016. [15] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014. [16] Guilin Liu, Fitsum A. Reda, K evin J. Shih, T ing-Chun W ang, Andre w T ao, and Bryan Catanzaro. Image inpainting for irregular holes using partial con volutions. In ECCV , 2018. [17] Liqian Ma, Xu Jia, Qianru Sun, Bernt Schiele, T inne T uytelaars, and Luc V an Gool. Pose guided person image generation. In NIPS , 2017. [18] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv pr eprint arXiv:1411.1784 , 2014. [19] Kamyar Nazeri, Eric Ng, T ony Joseph, Faisal Z. Qureshi, and Mohammad Ebrahimi. Edgeconnect: Generativ e image inpainting with adversarial edge learning. arXiv preprint arXiv:1901.00212 , 2019. [20] Deepak Pathak, Philipp Krähenbühl, Jef f Donahue, T rev or Darrell, and Alexei Efros. Context encoders: Feature learning by inpainting. In CVPR , 2016. [21] Olaf Ronneberger , Philipp Fischer , and Thomas Brox. U-net: Con volutional networks for biomedical image segmentation. In MICCAI , pages 234–241, 2015. 9 [22] Hiroshi Ishikawa Satoshi Iizuka, Edgar Simo-Serra. Globally and locally consistent image completion. A CM T ransactions on Graphics (TOG),36(4):107 , 2017. [23] T ing-Chun W ang Jun-Y an Zhu T aesung Park, Ming-Y u Liu. Semantic image synthesis with spatially-adaptiv e normalization. In CVPR , 2019. [24] Diederik P . Kingma Tim Salimans. W eight normalization: A simple reparameterization to accelerate training of deep neural networks. In NIPS , 2016. [25] Attila Szabó-Sia vash Arjomand Bigdeli-Paolo Fav aro Matthias Zwicker T iziano Portenier , Qiyang Hu. Faceshop: Deep sketch-based face image editing. A CM T ransactions on Graphics (TOG),37(4):99 , 2018. [26] Manjunath Kudlur V incent Dumoulin, Jonathon Shlens. A learned representation for artistic style. In ICLR , 2016. [27] T ing-Chun W ang, Ming-Y u Liu, Jun-Y an Zhu, Andrew T ao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. arXiv pr eprint arXiv:1711.11585 , 2017. [28] Zhou W ang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity . TIP , 13(4):600–612, 2004. [29] Saining Xie and Zhuowen T u. Holistically-nested edge detection. In ICCV , pages 5505–5514, 2015. [30] W ei Xiong, Jiahui Y u, Zhe L. Lin, Jimei Y ang, Xin Lu, Connelly Barnes, and Jiebo Luo. Fore ground-aware image inpainting. arXiv pr eprint arXiv:1901.05945 , 2019. [31] Serge Belongie Xun Huang. Arbitrary style transfer in realtime with adaptiv e instance normal- ization. In ICCV , 2017. [32] Jongyoul Park Y oungjoo Jo. Sc-fegan: Face editing generati ve adversarial network with user’ s sketch and color . arXiv preprint , 2019. [33] Jiahui Y u, Zhe L. Lin, Jimei Y ang, Xiaohui Shen, Xin Lu, and Thomas S. Huang. Generative image inpainting with contextual attention. In CVPR , pages 5505–5514, 2018. [34] Kaiming He Y uxin W u. Group normalization. In ECCV , 2018. [35] Jun-Y an Zhu, T aesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adv ersarial networks. In ICCV , 2017. [36] Shi Qiu-Xiaogang W ang Ziwei Liu, Ping Luo and Xiaoou T ang. Deepfashion: Powering robust clothes recognition and retriev al with rich annotations. In CVPR , pages 1096–1104, 2016. 10 A ppendix Inc om ple te par sing Sketc h Color Mask Noise Inc om ple te im age Com posed m ask Sy nthe size d par sing Origina l Full S ket ch Full Color Figure 6: Example of model inputs shown in the second row . The inputs of the free-form parsing network consist of incomplete parsing, sk etch, color , mask, and noise; the inputs of parsing-a ware inpainting network contain incomplete image, composed mask and synthesized parsing. The inputs of attention normalization layers are a sketch, color , and noise. W e ﬁrst generate the sketches by using Canny [ 1 ] shown in the second column of the ﬁrst row . Then, we use a human parser [ 5 ] to extract the median color of each part of the person, sho wn in the last column of the ﬁrst row . FE - GAN ( O ur s ) E dge - c onne c t P a r ti a l C onv D e e pf il l v 1 I nput O r igi na l Figure 7: Some interactiv e comparisons of Deepﬁll v1 [ 33 ], Partial Con v [ 16 ], Edge-connect [ 19 ], and FE-GAN (Ours). The results of our FE-GAN are shown in the last column. Zoom in for details. 11 I nput O utput I m a ge O r igi na l S ynthe s iz e d P a r s ing Figure 8: Some interacti ve results of our FE-GAN, sho wn in the third column. The input contains free-form mask, sketch, and sparse color strok es. The results of our free-form parsing network sho wn in the last column. Zoom in for details. 12 I nput O utput I m a ge O r igi na l S ythe s iz e d P a r s ing Figure 9: Some interacti ve results of our FE-GAN, sho wn in the third column. The input contains free-form mask, sketch, and sparse color strok es. The results of our free-form parsing network sho wn in the last column. Zoom in for details. 13 O ri gi nal Incom pl et e Im age D eepfi l l v1 Part i al Conv Edge - connec t FE - G A N (O ur s) Figure 10: Qualitative comparisons between Deepﬁll v1 [ 33 ], Partial Con v [ 16 ], Edge-connect [ 19 ], and FE-GAN on FashionE. 14 O ri gi nal Incom pl et e Im age Deepfi l l v1 Part i al Conv Edge - connec t FE - G A N (O ur s) Figure 11: Qualitative comparisons between Deepﬁll v1 [ 33 ], Partial Con v [ 16 ], Edge-connect [ 19 ], and FE-GAN on FashionE. 15 O ri gi nal Incom pl et e Im age D eepfi l l v1 Part i al Conv Edge - connec t FE - G A N (O ur s) Figure 12: Qualitative comparisons between Deepﬁll v1 [ 33 ], Partial Con v [ 16 ], Edge-connect [ 19 ], and FE-GAN on FashionE. 16 O ri gi nal Incom pl et e Im age D eepfi l l v1 Part i al Conv Edge - connec t FE - G A N (O ur s) Figure 13: Qualitative comparisons between Deepﬁll v1 [ 33 ], Partial Con v [ 16 ], Edge-connect [ 19 ], and FE-GAN on DeepFashion [36] 17 O ri gi nal Incom pl et e Im age D eepfi l l v1 Part i al Conv Edge - connec t FE - G A N (O ur s) Figure 14: Qualitative comparisons between Deepﬁll v1 [ 33 ], Partial Con v [ 16 ], Edge-connect [ 19 ], and FE-GAN on DeepFashion [36] 18 O ri gi nal Incom pl et e Im age D eepfi l l v1 Part i al Conv Edge - connec t FE - G A N (O ur s) Figure 15: Qualitative comparisons between Deepﬁll v1 [ 33 ], Partial Con v [ 16 ], Edge-connect [ 19 ], and FE-GAN on DeepFashion [36] 19 O ri gi nal Incom pl et e Im age D eepfi l l v1 Part i al Conv Edge - connec t FE - G A N (O ur s) Figure 16: Qualitative comparisons between Deepﬁll v1 [ 33 ], Partial Con v [ 16 ], Edge-connect [ 19 ], and FE-GAN on MPV [4]. 20 O ri gi nal Incom pl et e Im age D eepfi l l v1 Part i al Conv Edge - connec t FE - G A N (O ur s) Figure 17: Qualitative comparisons between Deepﬁll v1 [ 33 ], Partial Con v [ 16 ], Edge-connect [ 19 ], and FE-GAN on MPV [4]. 21 O ri gi nal Incom pl et e Im age D eepfi l l v1 Part i al Conv Edge - connec t FE - G A N (O ur s) Figure 18: Qualitative comparisons between Deepﬁll v1 [ 33 ], Partial Con v [ 16 ], Edge-connect [ 19 ], and FE-GAN on MPV [4]. 22

Fashion Editing with Adversarial Parsing Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment