Structured Prediction using cGANs with Fusion Discriminator

Published as a workshop paper at ICLR 2019 S T R U C T U R E D P R E D I C T I O N U S I N G C G A N S W I T H F U S I O N D I S C R I M I N A T O R Faisal Mahmood, W enhao Xu, Nicholas J. Durr Department of Biomedical Engineering Johns Hopkins Univ ersity Baltimore, MD 21218 USA { faisalm,wenhao1,ndurr } @jhu.edu Jer emiah W . Johnson Department of Applied Engineering & Sciences Univ ersity of New Hampshire Manchester , NH 03101, USA jeremiah.johnson@unh.edu Alan Y uille Department of Computer Science Johns Hopkins Univ ersity Baltimore, MD 21218, USA ayuille1@jhu.edu A B S T R A C T W e propose the fusion discriminator , a single uniﬁed frame work for incorporating conditional information into a generati ve adv ersarial network (GAN) for a v ariety of distinct structured prediction tasks, including image synthesis, semantic seg- mentation, and depth estimation. Much like commonly used con volutional neu- ral network - conditional Marko v random ﬁeld (CNN-CRF) models, the proposed method is able to enforce higher -order consistency in the model, but without being limited to a very speciﬁc class of potentials. The method is conceptually simple and ﬂexible, and our experimental results demonstrate improvement on several div erse structured prediction tasks. 1 I N T RO D U C T I O N Con volutional neural networks (CNNs) have demonstrated groundbreaking results on a v ariety of different learning tasks. Howe ver , on tasks where high dimensional structure in the data needs to be preserved, per-pixel regression losses typically result in unstructured outputs since they do not take into consideration non-local dependencies in the data. Structured prediction frameworks such as graphical models and joint CNN-graphical model-based architectures e.g . CNN-CRFs have been used for imposing spatial contiguity using non-local information (Lin et al., 2016; Chen et al., 2018a; Schwing & Urtasun, 2015; Mahmood & Durr, 2018). The moti vation to use CNN-CRF models stems from their ability to capture some structured information from second order statistics using the pairwise part. Ho wev er, statistical interactions be yond the second-order are tedious to incorporate and render the models complicated (Arnab et al., 2016; K ohli et al., 2009). Generativ e models provide another way to represent the structure and spacial contiguity in lar ge high-dimensional datasets with complex dependencies. Implicit generativ e models specify a stochastic procedure to produce outputs from a probability distribution. Such models are appeal- ing because they do not demand parametrization of the probability distribution they are trying to model. Recently , there has been great interest in CNN-based implicit generativ e models using au- toregressi ve (Chen et al., 2018b) and adversarial training frame works (Luc et al., 2016). Generativ e adversarial networks (GANs) (Goodfello w et al., 2014) can be seen as a two player minimax game where the ﬁrst player, the generator, is tasked with transforming a random input to a speciﬁc distribution such that the second player , the discriminator , can not distinguish between the true and synthesized distributions. The most distincti ve feature of adversarial networks is the discriminator that assesses the discrepancy between the current and target distributions. The dis- criminator acts as a progressiv ely precise critic of an increasingly accurate generator . Despite their structured prediction capabilities, such a training paradigm is often unstable. Ho wev er , recent work on spectral normalization (SN) and gradient penalty has signiﬁcantly increased training stability 1 Published as a workshop paper at ICLR 2019 (Miyato et al., 2018; Gulrajani et al., 2017). Conditional GANs (cGANs) (Mirza & Osindero, 2014) incorporate conditional image information in the discriminator and have been widely used for class conditioned image generation (Miyato et al., 2018; Miyato & K oyama, 2018). T o that ef fect, unlike in standard GANs, a discriminator for cGANs discriminates between the generated distribution and the target distrib ution on pairs of samples y and conditional information x . For class conditioning, several unique strategies have been presented to incorporate class informa- tion in the discriminator (Reed et al., 2016; Miyato & K oyama, 2018; Odena et al., 2016). Ф Ф Ψ Concat x y Adversarial loss (a) Concatenated Image Conditioning (Mirza at al., 2014; Isola et al. 2017) x y Adversarial loss (b) Fusion Discriminator (proposed) … Figure 1: Discriminator models for image condi- tioning. W e propose fusing the features of the in- put and the ground truth or generated image rather than concatenating. Howe ver , a cGAN can also be conditioned by structured data such as an image. Such condi- tioning is much more useful for structured pre- diction problems. Since the discriminator in an image conditioned-GAN has access to large portions of the image the adversarial loss can be interpreted as a learned loss that incorporates higher order statistics, essentially eliminating the need to manually design higher order loss functions. This v ariation of cGANs has exten- siv ely been used for image-to-image translation tasks (Isola et al., 2017; Zhu et al., 2017). Ho w- ev er, the best way of incorporating conditional image information into a GAN is not clear , and methods of feeding generated and conditional images to the discriminator tend to use a naive concatenation approach. In this work we ad- dress this gap by proposing a discriminator ar- chitecture speciﬁcally designed for image con- ditioning. Such a discriminator contributes to the promise of generalization that GANs bring to structured prediction problems by provid- ing a singular and simplistic setup for captur - ing higher order non-local structural informa- tion from higher dimensional data without complicated modeling of energy functions. Contributions. W e propose an approach to incorporating conditional information into a cGAN using a fusion discriminator architecture (Fig. 1b). In particular , we make the following ke y contri- butions: 1. W e propose a nov el discriminator architecture designed to incorporating conditional infor- mation for structured prediction tasks. The method is designed to incorporate conditional information in feature space in a way that allows the discriminator to enforce higher-order consistency in the model, and is conceptually simpler than alternativ e structured prediction methods such as CNN-CRFs where higher -order potentials hav e to be manually incorpo- rated in the loss function. 2. W e demonstrate the effecti veness of this method on a variety of distinct structured predic- tion tasks including semantic segmentation, depth estimation, and generating real images from semantic masks. Our empirical study demonstrates that the fusion discriminator is effecti ve in preserving high-order statistics and structural information in the data and is ﬂexible enough to be used successfully for man y structured prediction tasks. 2 R E L A T E D W O R K 2 . 1 C N N - C R F M O D E L S Models for structured prediction ha ve been e xtensiv ely studied in computer vision. In the past these models often entailed the construction of hand-engineered features. In 2015, Long et al. (2015) demonstrated that a fully con volutional approach to semantic segmentation could yield state-of- the-art results at that time with no need for hand-engineering features. Chen et al. (2014) showed 2 Published as a workshop paper at ICLR 2019 that post-processing the results of a CNN with a conditional Markov random ﬁeld led to signiﬁcant improv ements. Subsequent work by many authors hav e reﬁned this approach by incorporating the CRF as a layer within a deep network and thereby enabling the parameters of both models to be learnt simultaneously (Kn ¨ obelreiter et al., 2017). Many researchers ha ve used this approach for other structured prediction problems, including image-to-image translation and depth estimation (Liu et al., 2015; Mahmood & Durr, 2018; Mahmood et al., 2018). In most cases CNN-CRF models only incorporate unary and pairwise potentials. Arnab et al. (2016) in vestigated incorporating higher -order potentials into CNN-based models for semantic segmenta- tion, and found that while it is possible to learn the parameters of these potentials, the y can be tedious to incorporate and render the model quite complex. Thus there is a need to de velop methods that can incorporate higher-order statistical information without requiring manual modeling of higher order potentials. 2 . 2 G E N E R A T I V E A D V E R S A R I A L N E T W O R K S Adversarial T raining. Generativ e adversarial networks were introduced in Goodfellow et al. (2014). A GAN consists of a pair of models ( G, D ) , where G attempts to model the distribution of the source domain and D attempts to ev aluate the diver gence between the generative distribution q and the true distribution p . GANs are trained by training the discriminator and the generator in turn, iterativ ely reﬁning both the quality of the generated data and the discriminator’ s ability to distin- guish between p and q . The result is that D and G compete to reach a Nash equilibrium that can be expressed by the training procedure. While GAN training is often unstable and prone to issues such as mode collapse, recent developments such as spectral normalization and gradient penalty hav e increased GAN training stability (Miyato et al., 2018; Gulrajani et al., 2017). Furthermore, GANs hav e the advantage of being able to access the joint conﬁguration of many variables, thus enabling a GAN to enforce higher-order consistency that is difﬁcult to enforce via other methods (Luc et al., 2016; Isola et al., 2017). Conditional GANs. A conditional GAN (cGAN) is a GAN designed to incorporate conditional information (Mirza & Osindero, 2014). cGANs hav e shown promise for sev eral tasks such as class conditional image synthesis and image-to-image translation (Mirza & Osindero, 2014; Isola et al., 2017). There are se veral advantages to using the cGAN model for structured prediction, including the simplicity of the framew ork. Image conditioned cGANs can be seen as a structured prediction problem tasked with learning a ne w representation of an input image while making use of non-local dependencies. Howe ver , the method by which the conditional information should be incorporated into the model is often unmotiv ated. Usually , the conditional data is concatenated to some layers in the discriminator (often the input layers). A notable exception to this methodology is the projection cGAN, where the data is assumed to follo w certain simple distrib utions, allo wing a hard mathemati- cal rule for incorporating conditional data can be deri ved from the underlying probabilistic graphical model (Miyato & K oyama, 2018). As mentioned in Miyato & K oyama (2018), the method is less likely to produce good results if the data does not follow one of the prescribed distributions. For structured prediction tasks inv olving conditioning with image data, this is often not the case. In the following section we introduce the fusion discriminator and e xplain the motiv ation behind it. 3 P RO P O S E D M E T H O D : C G A N S W I T H F U S I O N D I S C R I M I N A T O R As mentioned, the most signiﬁcant part of cGANs for structured prediction is the disc riminator . The discriminator has continuous access to pairs of the generated data or real data y and the conditional information ( i.e. the image) x . The cGAN discriminator can then be deﬁned as D cGAN ( x, y , θ ) := A ( f ( x, y , θ )) , where A is the acti vation function, and f is a function of x and y and θ represents the parameters of f . Let p and q designate the true and the generated distributions. The adversarial loss for the discriminator can then be deﬁned as L ( D ) = − E q ( y ) [ E q ( x | y ) log( D ( x, y , θ )] − E p ( y ) [ E p ( x | y ) [log(1 − D ( x, G ( x ) , θ )] . (1) Here, A is the sigmoid function, D is the conditional discriminator, and G is the generator . By design, this framew orks allows the discriminator to signiﬁcantly effect the generator (Goodfellow 3 Published as a workshop paper at ICLR 2019 Real Fake Output (y) Encoder Input (x) Encoder Final Conv. Conv+SN+ReLU F u s i on D r op ou t Real Fake (b) VGG-16 Fusion Discriminator (a ) Fusion Discriminator Input (x) Output (y) Figure 2: Fusion discriminator architecture. et al., 2014). The most common approach currently in use to incorporate conditional image informa- tion into a GAN is to concatenate the conditional image information to the input of the discriminator at some layer, often the ﬁrst (Isola et al., 2017). Other approaches for conditional information fusion are limited to class conditional fusion where conditional information is often a one-hot vector rather than higher dimensional structured data. Since the discriminator classiﬁes pairs of input and output images, concatenating high-dimensional data may not e xploit inherent dependencies in the structure of the data. Fusing the input and output information in an intuitiv e way such as to preserve the dependencies is instrumental in designing an adversarial frame work with high structural capacity . W e propose the use of a fusion discriminator architecture with two branches. The branches of this discriminator are con volutional neural networks with identical architectures, say ψ ( x ) and φ ( y ) , that learn representations from both the conditional data ( ψ ( x ) ) and the generated or real data ( φ ( y ) ) re- spectiv ely . The learned representations are then fused at v arious stages (Fig. 2). This architecture is similar to the encoder portion of the FuseNet architecture, which has pre viously been used to incor- porate depth information from RGB-D images for semantic segmentation (Hazirbas et al., 2017). In Figure 2, we illustrate a four layer and a VGG16-style fusion discriminator , in which both branches are similar in depth and structure to the VGG16 model (Simonyan & Zisserman, 2014). The ke y ingredient of the fusion discriminator architecture is the fusion block, which combines the learned representations of x and y . The fusion layer (red, Fig. 2) is implemented as element-wise summa- tion and is always inserted after a conv olution → spectral normalization → ReLU instance. The fusion layer modiﬁes the signal passed through the ψ branch by adding in learned representations of x from the φ branch. This preserves representation from both x and y . For structured prediction tasks, x and y will often have learned representations that complement each other; for instance, in tasks like depth estimation, semantic segmentation, and image synthesis, x and y all have highly complimentary features. 3 . 1 M O T I V AT I O N Theoretical Motivation. When data is passed through two networks with identical architectures and the activ ations at corresponding layers are added, the effect is to pass through the combined network (the upper branch in Fig. 2) a stronger signal than would be passed forw ard by applying an activ ation to concatenated data. T o see this in the case of the ReLU activ ation function, denote the k th feature map in the l th layer by h ( l ) k and let the weights and biases for this feature and layer be denoted W ( l ) K = [ U ( l ) k V ( l ) K ] T and b ( l ) k = [ c ( l ) k d ( l ) k ] T respectiv ely . Let h =  x T y T  T , where x and y represent the learned features from the conditional and real or generated data respectiv ely . Then 4 Published as a workshop paper at ICLR 2019 h ( l +1) k = max( 0 , W ( l ) k h + b ( l ) k ) (2) = max( 0 , ( U ( l ) k x ( l ) + V ( l ) k y ( l ) + ( c ( l ) k + d ( l ) k ))) (3) ≤ max( 0 , ( U ( l ) k x ( l ) + c ( l ) k )) + max( 0 , ( V ( l ) k y ( l ) + d ( l ) k )) . (4) Eq. 4 demonstrates that the fusion of the activ ations in ψ ( x ) and φ ( y ) produces a stronger signal than the acti vation on concatenated inputs. 1 Strengthening some activ ations does not guarantee improv ed performance in general; ho wever , in the conte xt of structured prediction the fusing operation results in the strongest signals being passed through the discriminator speciﬁcally at those places where the model ﬁnds useful information simultaneously in both the conditional data and the real or generated data. A similar mechanism can be found at at work in many other successful models that require higher order structural information to be preserved; to take one example, consider the neural algorithm of artistic style proposed by Gatys et al. (2015). This algorithm successfully transfers highly structured data from an existing image x onto a randomly initialized image y by minimizing the content loss function L content ( x , y , l ) = 1 2 X i,j  F l ij − P l ij  2 , (5) where F l ij and P l ij denote the acti v ations at locations i, j in layer l of x and y respectiv ely . The loss function mechanism used here dif fers from the fusing mechanism used in the fusion discriminator, but the underlying principle of capturing high-level structural information from a pair of images by combining signals from common layers in parallel networks is the same. The neural algorithm of artistic style succeeds in content transfer by insuring that the activ ations containing information of structural importance is similar in both the generated image and the content image. In the case of image-conditioned cGAN training, it can be assumed that the acti vations of the real or generated data and the conditional data will be similar , and by fusing these acti vations and passing forward a strengthened signal the network is better able to attend to those locations containing important structural information in both the real or generated data and the conditional data; c.f. Fig. 3. Empirical Motivation. W e use gradient-weighted Class Activ ation Mapping (Grad-CAM) (Sel- varaju et al., 2017) which uses the class-speciﬁc gradient information going into the ﬁnal conv olu- tional layer of a trained CNN to produce a coarse localization map of the important regions in the image. W e visualized the outputs of a fusion and concatenated discriminator for several dif ferent tasks to observe the structure and strength of the signal being passed forward. W e observed that the fusion discriminator architecture always had a visually strong signal at important features for the giv en task. Representati ve images from classifying x and y pairs as ’ real’ for two different struc- tured prediction tasks are sho wn in Fig. 3. This provides visual e vidence that a fusion discriminator preserves more structural information from the input and output image pairs and classiﬁes overlap- ping patches based on that information. Indeed, this is not evidence that a stronger signal will lead to a more accurate classiﬁcation, but it is a heuristic justiﬁcation that more representative features from x and y will be used to make the determination. 4 E X P E R I M E N T S In order to ev aluate the ef fectiv eness of the proposed fusion discriminator we conducted three sets of experiments on structured prediction problems: 1) generating real images from semantic masks (Cityscapes); 2) semantic segmentation (Cityscapes); 3) depth estimation (NYU v2). For all three tasks we used a U-Net based generator . W e applied spectral normalization to all weights of the generator and discriminator to regularize the Lipschitz constant. The Adam optimizer was used for all experiments with hyper -parameters α = 0 . 0002 , β 1 = 0 , β 2 = 0 . 9 . 1 Equations 2–4 apply only for the ReLU, but similar statements can be easily prov en for many commonly used activ ation functions; see Section 6 for additional discussion. 5 Published as a workshop paper at ICLR 2019 Input (x) Ground Truth (y*) RGB → Depth Mask RGB → Semantic Labels Grad-CAM Concatenated Disc. Grad-CAM Fusion Discriminator , , PatchGAN - Overlapping Patch Pairs Fusion Discriminator Real Fake … Fusion di scriminato r can preserve more structural infor mation. … … , , Figure 3: V isualizing Discriminator features using gradient-weighted Class Activ ation Maps (Grad- CAM) to produce a coarse localization map of the important regions in the image. The fusion discriminator passes a stronger and more structured signal on important features in comparison to a concatenated discriminator . Input (x) GT Output (y) L1+GAN (CD) GAN (CD) L1+GAN (FD) L1+GAN (VGG- FD) S e matic Masks-> Cityscapes Sun RGB - > Depth Cityscapes - > Sematic Masks Figure 4: Comparative analysis of concatenation and fusion discriminators on three different struc- tured prediction tasks, a) Semantic masks to real image transformation b) Semantic segmentation c) Depth Estimation. The fusion discriminator preserves more structural details. 4 . 1 I M A G E - T O - I M AG E T R A N S L AT I O N In order to demonstrate the structure preserving abilities of our discriminator we use the proposed setup in the image-to-image translation setting. W e focus on the application of generating realistic images from semantic labels. This application has recently been studied for generating realistic syn- thetic data for self dri ving cars (W ang et al., 2018; Chen & K oltun, 2017). Unlike recent approaches where the objective is to generate increasingly realistic high deﬁnition (HD) images, the purpose of this experiment is to explore if a generic fusion discriminator can outperform a concatenated discriminator when using a simple generator . W e used 2,975 training images from the Cityscapes dataset (Cordts et al., 2016) and re-scaled them to 256 × 256 for computational efﬁciency . The provided Cityscapes test set with 500 images was used for testing. Our ablation study focused on changing the discriminator between a standard 4 -layer concatenation discriminator used in the sem- 6 Published as a workshop paper at ICLR 2019 Figure 5: A comparativ e analysis of concatenation, projection and fusion discriminators on three different structured prediction tasks, i.e . , image synthesis, semantic segmentation, and depth estima- tion. T able 1: PSPNet-based semantic segmentation IoU and accuracy scores using generated images from different discriminators. Our results outperform concatenation-based methods by a large mar- gin and is close to the accuracy and IoU on actual images (GT/Oracle). Discriminator Mean IoU Pixel Accuracy 4-Layer Concat. (Isola et al. (2017)) 0.3617 74.34% 4-Layer Concat. + SN 0.4022 76.49% 4-Layer Fusion + SN 0.4569 79.23% VGG16 Concat. + SN 0.4125 77.62% Projection + SN (Miyato & K oyama (2018)) 0.4696 79.11% V GG16 Fusion + SN 0.5483 83.07% GT / Oracle 0.5937 85.13% inal image-to-image translation work (Isola et al., 2017), a combination of this 4-layer discriminator with spectral normalization (SN) (Miyato et al., 2018), a VGG-16 concatenation discriminator and the proposed 4-layer and VGG-16 fusion discriminators. 4 . 1 . 1 E V A L U A T I O N Since standard GAN ev aluation metrics such as inception score and FID can not directly be applied to image-to-image translation tasks we use an ev aluation technique previously used for such image synthesis Isola et al. (2017); W ang et al. (2017). T o quantitativ ely ev aluate the ef fecti veness of our proposed discriminator architecture we perform semantic segmentation on synthesized images and compare the similarity between the predicted se gments and the input. The intuition behind this kind of experimentation is that if the generated images corresponds to the input label map an existing semantic segmentation model such as a PSPNet (Zhao et al., 2017) should be able to predict the input segmentation mask. Similar experimentation has been suggested in Isola et al. (2017) and W ang et al. (2017). T able 1 reports segmentation both pixel-wise accuracy and o verall intersection- ov er-union (IoU). The proposed fusion discriminator outperforms the concatenated discriminator by a large margin. Our result is closer to the theoretical upper bound achieved by real images. This conﬁrms that the fusion discriminator contrib utes to structure preserv ation in the output image. The fusion discriminator could be used with high deﬁnition images, howe ver , such analysis is beyond the scope of the current study . Representativ e images for this task are shown in Fig. 4. The projec- tion discriminator was modiﬁed image conditioning according to the e xplanation gi ven in Miyato & K oyama (2018) for the super-resolution task. Fig. 5 sho ws a comparati ve analysis of the concatena- tion, projection and fusion discriminators in an ablation study upto 550 k iterations. 7 Published as a workshop paper at ICLR 2019 T able 2: GAN-based semantic segmentation using different discriminators. T ested with cityscapes dataset rescaled to 256 × 256 images. Discriminator Mean IoU Pixel Accuracy 4-Layer Concat. (Isola et al. (2017)) 0.2925 81.41% 4-Layer Concat. + SN 0.3162 83.49% 4-Layer Fusion + SN 0.4471 85.23% VGG16 Concat. + SN 0.4066 84.62% Projection + SN (Miyato & K oyama (2018)) 0.4687 85.97% V GG16 Fusion + SN 0.6642 92.17% CNN-CRF Postprocess 0.5425 87.41% CNN-CRF Joint T raining 0.6042 90.25% 4 . 2 S E M A N T I C S E G M E N TA T I O N Semantic segmentation is vital for visual scene understanding and is often formulated as a dense labeling problem where the objecti ve is to predict the category label for each indi vidual pixel. Se- mantic segmentation is a classical structured prediction problem and CNNs with pix el-wise loss often fail to make accurate predictions (Luc et al., 2016). Much better results have been achiev ed by incorporating higher order statistics in the image using CRFs as a post-processing step or jointly training them with CNNs (Chen et al., 2018a). It has been shown that incorporating higher order potentials continues to improv e semantic se gmentation improvement, making this an ideal task for ev aluating the structured prediction capabilities of GANs and their enhancement using our proposed discriminator . Here, we empirically validate that the adversarial framew ork with the fusion discriminator can pre- serve more spacial context in comparison to CNN-CRF setups. W e demonstrate that our proposed fusion discriminator is equipped with the ability to preserve higher order details. F or compara- tiv e analysis we compare with relatively shallo w and deep architectures for both concatenation and fusion discriminators. W e also conduct an ablation study to analyze the ef fect of spectral normal- ization. The generator for all semantic segmentation experiments was a U-Net. For the experiment without spectral normalization, we trained each model for 950k iterations, which was sufﬁcient for the training of the concatenated discriminator to stabilize. For all other experiments, we trained for 800k iterations. The discriminator was trained twice as much as the generator . 4 . 3 D E P T H E S T I M AT I O N Depth estimation is another structured prediction task that has been extensiv ely studied because of its wide spread applications in computer vision. As with semantic segmentation, both per-pix el losses and non-local losses such as CNN-CRFs have been widely used for depth estimation. State- of-the art with depth estimation has been achie ved using a hierarchical chain of non-local losses. W e argue that it is possible to incorporate higher order information using a simple adversarial loss with a fusion discriminator . In order to validate our claims we conducted a series of experiments with dif ferent discriminators, similar to the series of experiments conducted for semantic segmentation. W e used the Eigen test- train split for the NYU v2 Nathan Silberman & Fer gus (2012) dataset containing 1449 images for training and 464 images for testing. W e observ ed that as with image synthesis and semantic se gmen- tation the fusion discriminator outperforms concatenation-based methods and pairwise CNN-CRF methods ev ery time. 5 C O N C L U S I O N S Structured prediction problems can be posed as image conditioned GAN problems. The discrim- inator plays a crucial role in incorporating non-local information in adversarial training setups for structured prediction problems. Image conditioned GANs usually feed concatenated input and out- put pairs to the discriminator . In this research, we proposed a model for the discriminator of cGANs that in volv es fusing features from both the input and the output image in feature space. This method 8 Published as a workshop paper at ICLR 2019 T able 3: Depth Estimation results on NYU v2 dataset using various discriminators. Discriminator relati ve error rms log 10 4-Layer Concat. (Isola et al. (2017)) 0.1963 0.784 0.087 4-Layer Concat. + SN 0.1442 0.592 0.059 4-Layer Fusion + SN 0.1315 0.583 0.057 VGG16 Concat. + SN 0.1374 0.547 0.054 Projection + SN (Miyato & K oyama (2018)) 0.1417 0.573 0.059 V GG16 Fusion + SN 0.1254 0.491 0.052 CNN-CRF Postprocess 0.311 1.025 0.129 CNN-CRF Joint T raining 0.232 0.824 0.094 provides the discriminator a hierarchy of features at dif ferent scales from the conditional data, and thereby allows the discriminator to capture higher-order statistics from the data. W e qualitativ ely demonstrate and empirically v alidate that this simple modiﬁcation can signiﬁcantly impro ve the general adversarial frame work for structured prediction tasks. The results presented in this paper strongly suggest that the mechanism of feeding paired information into the discriminator in image conditioned GAN problems is of paramount importance. R E F E R E N C E S Anurag Arnab, Sadeep Jayasumana, Shuai Zheng, and Philip HS T orr . Higher order conditional random ﬁelds in deep neural networks. In Eur opean Conference on Computer V ision , pp. 524– 540. Springer , 2016. Liang-Chieh Chen, Geor ge Papandreou, Iasonas K okkinos, K evin Murphy , and Alan L Y uille. Se- mantic image se gmentation with deep con volutional nets and fully connected crfs. arXiv pr eprint arXiv:1412.7062 , 2014. Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Ke vin Murphy , and Alan L Y uille. Deeplab: Semantic image segmentation with deep con volutional nets, atrous conv olution, and fully connected crfs. IEEE tr ansactions on pattern analysis and machine intelligence , 40(4): 834–848, 2018a. Qifeng Chen and Vladlen Koltun. Photographic image synthesis with cascaded reﬁnement networks. In Pr oceedings of the IEEE International Confer ence on Computer V ision (ICCV) , pp. 1511– 1520, 2017. Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter Abbeel. Pix elsnail: An improved autoregressi ve generati ve model, 2018b. URL https://openreview.net/forum?id= Sk7gI0CUG . Marius Cordts, Mohamed Omran, Sebastian Ramos, T imo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Pr oceedings of the IEEE conference on computer vision and pattern r ecognition , pp. 3213–3223, 2016. Leon A Gatys, Alexander S Ecker , and Matthias Bethge. A neural algorithm of artistic style. arXiv pr eprint arXiv:1508.06576 , 2015. Ian Goodfellow , Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David W arde-Farle y , Sherjil Ozair, Aaron Courville, and Y oshua Bengio. Generative adversarial nets. In Advances in neural infor- mation processing systems , pp. 2672–2680, 2014. Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky , V incent Dumoulin, and Aaron C Courville. Im- prov ed training of wasserstein gans. In Advances in Neural Information Pr ocessing Systems , pp. 5767–5777, 2017. 9 Published as a workshop paper at ICLR 2019 Caner Hazirbas, Lingni Ma, Csaba Domokos, and Daniel Cremers. Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture. In Shang-Hong Lai, V incent Lep- etit, Ko Nishino, and Y oichi Sato (eds.), Computer V ision – ACCV 2016 , pp. 213–228, Cham, 2017. Springer International Publishing. ISBN 978-3-319-54181-5. Phillip Isola, Jun-Y an Zhu, Tinghui Zhou, and Alex ei A Efros. Image-to-image translation with conditional adversarial networks. arXiv preprint , 2017. Patrick Kn ¨ obelreiter , Christian Reinbacher, Ale xander Shekhovtsov , and Thomas Pock. End-to-end training of hybrid cnn-crf models for stereo. In Confer ence on Computer V ision and P attern Recognition,(CVPR) , 2017. Pushmeet Kohli, Philip HS T orr , et al. Robust higher order potentials for enforcing label consistency . International Journal of Computer V ision , 82(3):302–324, 2009. Guosheng Lin, Chunhua Shen, Anton V an Den Hengel, and Ian Reid. Efﬁcient piecewise training of deep structured models for semantic se gmentation. In Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Recognition , pp. 3194–3203, 2016. Fayao Liu, Chunhua Shen, and Guosheng Lin. Deep con volutional neural ﬁelds for depth estimation from a single image. In Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Recognition , pp. 5162–5170, 2015. Jonathan Long, Ev an Shelhamer , and T re vor Darrell. Fully con volutional networks for semantic segmentation. In Pr oceedings of the IEEE confer ence on computer vision and pattern r ecognition , pp. 3431–3440, 2015. Pauline Luc, Camille Couprie, Soumith Chintala, and Jakob V erbeek. Semantic se gmentation using adversarial networks. arXiv preprint , 2016. Faisal Mahmood and Nicholas J Durr . Deep learning and conditional random ﬁelds-based depth esti- mation and topographical reconstruction from conv entional endoscopy . Medical Imag e Analysis , 2018. Faisal Mahmood, Richard Chen, and Nicholas J Durr . Unsupervised reverse domain adaptation for synthetic medical images via adversarial training. IEEE T ransactions on Medical Imaging , 2018. Mehdi Mirza and Simon Osindero. Conditional generativ e adversarial nets. arXiv pr eprint arXiv:1411.1784 , 2014. T akeru Miyato and Masanori K oyama. cGANs with projection discriminator . In International Confer ence on Learning Representations , 2018. URL https://openreview.net/forum? id=ByS1VpgRZ . T akeru Miyato, T oshiki Kataoka, Masanori K oyama, and Y uichi Y oshida. Spectral normalization for generativ e adversarial networks. In International Conference on Learning Repr esentations , 2018. URL https://openreview.net/forum?id=B1QRgziT- . Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor se gmentation and support inference from rgbd images. In ECCV , 2012. Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxil- iary classiﬁer gans. arXiv pr eprint arXiv:1610.09585 , 2016. Scott Reed, Zeynep Akata, Xinchen Y an, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generativ e adversarial text to image synthesis. In 33rd International Confer ence on Machine Learning , pp. 1060–1069, 2016. Alexander G Schwing and Raquel Urtasun. Fully connected deep structured networks. arXiv pr eprint arXiv:1503.02351 , 2015. Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna V edantam, Devi Parikh, Dhruv Batra, et al. Grad-cam: V isual explanations from deep networks via gradient-based local- ization. In ICCV , pp. 618–626, 2017. 10 Published as a workshop paper at ICLR 2019 Karen Simonyan and Andrew Zisserman. V ery deep con volutional networks for large-scale image recognition. arXiv pr eprint arXiv:1409.1556 , 2014. T ing-Chun W ang, Ming-Y u Liu, Jun-Y an Zhu, Andre w T ao, Jan Kautz, and Bryan Catanzaro. High- resolution image synthesis and semantic manipulation with conditional gans. arXiv pr eprint arXiv:1711.11585 , 2017. T ing-Chun W ang, Ming-Y u Liu, Jun-Y an Zhu, Andre w T ao, Jan Kautz, and Bryan Catanzaro. High- resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Confer ence on Computer V ision and P attern Recognition (CVPR) , pp. 8798–8807, 2018. Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang W ang, and Jiaya Jia. Pyramid scene parsing network. In IEEE Conf. on Computer V ision and P attern Recognition (CVPR) , pp. 2881–2890, 2017. Jun-Y an Zhu, T aesung Park, Phillip Isola, and Alex ei A Efros. Unpaired image-to-image translation using cycle-consistent adv ersarial networks. arXiv pr eprint , 2017. 11 Published as a workshop paper at ICLR 2019 6 S U P P L E M E N TA RY M A T E R I A L 6 . 1 C G A N O B J E C T I V E The objectiv e function for a conditional GANs can be deﬁned as, L cGAN ( G, D ) = E x,y [log( D ( x, y )] + E x,z [log(1 − D ( x, G ( x ))] . (6) The generator G tries to minimize the loss expressed by equation 6 while the discriminator D tries to maximize it. In addition, we impose an L 1 reconstruction loss: L L 1 ( G ) = E x,y [ || y − G ( x ) || 1 ] , (7) leading to the objectiv e, G ∗ = arg min G max D L cGAN ( G, D ) + λ L L 1 ( G ) . (8) 6 . 2 G E N E R A T O R A R C H I T E C T U R E W e adapt our network architectures from those explained in (Isola et al., 2017). Let CSRk denote a Con volution-Spectral Norm -ReLU layer with k ﬁlters. Let CSRDk donate a similar layer with dropout with a rate of 0.5. All conv olutions chosen are 4 × 4 spatial ﬁlters applied with a stride 2, and in decoders the y are up-sampled by 2. All networks were trained from scratch and weights were initialized from a Gaussian distribution of mean 0 and standard deviation of 0.02. All images were cropped and rescaled to 256 × 256 , were up sampled to 268 × 286 and then randomly cropped back to 256 × 256 to incorporate random jitter in the model. Encoder : CSR64 → CSR128 → CSR256 → CSR512 → CSR512 → CSR512 → CSR512 → CSR512 Decoder : CSRD512 → CSRD1024 → CSRD1024 → CSR1024 → CSR1024 → CSR512 → CSR256 → CSR128 The last layer in the decoder is followed by a conv olution to map the number of output channels (3 in the case of image synthesis and semantic labels and 1 in the case of depth estimation). This is followed by a T anh function. Leaky ReLUs were used throughout the encoder with a slope of 0.2, re gular ReLUs were used in the decoder . Skip connections are placed between each layer l in the encoder and layer ln in the decoder assuming l is the maximum number of layers. The skip connections concatenate activ ations from the l th layer to layer ( l − n ) th later . 6 . 3 A C T I V ATI O N S W I T H N E G A T I V E B R A N C H E S Equations 2–4 of section 3.1 illustrate that when the ReLU acti vation is used in a fusion block, the fusing operation results in a positive signal at least as lar ge as that obtained by concatenation. For activ ations with negati ve branches, the follo wing similar claim holds. Lemma 1 Denote the k th featur e map in the l th layer by h ( l ) k , and let the weights and biases for this featur e and layer be denoted W ( l ) K = [ U ( l ) k V ( l ) K ] T and b ( l ) k = [ c ( l ) k d ( l ) k ] T r espectively . Let h =  x T y T  T , where x and y r epresent the learned features fr om the conditional and r eal or generated data r espectively . Let σ r epr esent an activation function. If sign( σ ( U ( l ) k x ( l ) + c ( l ) k )) = sign( σ ( V ( l ) k y ( l ) + d ( l ) k )) , then | σ ( W ( l ) k h + b ( l ) k ) | ≤ | σ ( U ( l ) k x ( l ) + c ( l ) k ) | + | σ ( V ( l ) k y ( l ) + d ( l ) k ) | . (9) Proof 2 T rivial; c.f . equations 2–4. Howe ver , the general situation can be much more complex. Ideally , if sign( σ ( U ( l ) k x ( l ) + c ( l ) k )) 6 = sign( σ ( V ( l ) k y ( l ) + d ( l ) k )) , then | σ ( W ( l ) k h + b ( l ) k ) | ≥ | σ ( U ( l ) k x ( l ) + c ( l ) k ) | + | σ ( V ( l ) k y ( l ) + d ( l ) k ) | , but this claim cannot be made in general. A counterexample is giv en by the leaky ReLU function σ ( x ) = max(0 , x ) − α max(0 , − x ) , where α ∈ R + . In the case where σ ( U ( l ) k x ( l ) + c ( l ) k )) ≥ 0 and 12 Published as a workshop paper at ICLR 2019 σ ( V ( l ) k y ( l ) + d ( l ) k ) ≤ 0 , fusing leads to the activ ation U ( l ) k x ( l ) + c ( l ) k + α ( V ( l ) k y ( l ) + d ( l ) k ) , while concatenation results in the activ ation − α ( U ( l ) k x ( l ) + c ( l ) k + V ( l ) k y ( l ) + d ( l ) k ) . The value of α plays a signiﬁcant role in shaping the combined acti vation, and in some instances fusing can lead to a stronger signal that concatenation despite the disagreement in the incoming signals. 13

Structured Prediction using cGANs with Fusion Discriminator

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment