Neural Photo Editing with Introspective Adversarial Networks

Published as a conference paper at ICLR 2017 N E U R A L P H O T O E D I T I N G W I T H I N T R O S P E C T I V E A D - V E R S A R I A L N E T W O R K S Andrew Brock, Theodore Lim, & J.M. Ritchie School of Engineering and Physical Sciences Heriot-W att Univ ersity Edinbur gh, UK {ajb5, t.lim, j.m.ritchie}@hw.ac.uk Nick W eston Renishaw plc Research A ve, North Edinbur gh, UK Nick.Weston@renishaw.com A B S T R A C T The increasingly photorealistic sample quality of generativ e image models suggests their feasibility in applications be yond image generation. W e present the Neural Photo Editor , an interface that le verages the po wer of generati ve neural networks to make large, semantically coherent changes to existing images. T o tackle the challenge of achieving accurate reconstructions without loss of feature quality , we introduce the Introspecti ve Adversarial Network, a novel hybridization of the V AE and GAN. Our model ef ﬁciently captures long-range dependencies through use of a computational block based on weight-shared dilated con volutions, and improv es generalization performance with Orthogonal Regularization, a nov el weight regularization method. W e validate our contrib utions on CelebA, SVHN, and CIF AR-100, and produce samples and reconstructions with high visual ﬁdelity . 1 I N T RO D U C T I O N Editing photos typically in v olves some form of manipulating indi vidual pixels, and achie ving desirable results often requires signiﬁcant user expertise. Given a suf ﬁciently powerful image model, ho wev er , a user could quickly make large, photorealistic changes with ease by instead interacting with the model’ s controls. T wo recent advances, the V ariational Autoencoder (V AE)(Kingma & W elling, 2014) and Generative Adversarial Network (GAN)(Goodfellow et al., 2014), ha ve shown great promise for use in modeling the comple x, high-dimensional distributions of natural images, but signiﬁcant challenges remain before these models can be used as general-purpose image editors. V AEs are probabilistic graphical models that learn to maximize a variational lower bound on the likelihood of the data by projecting into a learned latent space, then reconstructing samples from that space. GANs learn a generati ve model by training one network, the "discriminator ," to distinguish between real and generated data, while simultaneously training a second network, the "generator ," to transform a noise v ector into samples which the discriminator cannot distinguish from real data. Both approaches can be used to generate and interpolate between images by operating in a lo w-dimensional learned latent space, but each comes with its o wn set of beneﬁts and drawbacks. V AEs hav e stable training dynamics, but tend to produce images that discard high-frequency details when trained using maximum likelihood. Using the intermediate activ ations of a pre-trained discrim- inativ e neural network as features for comparing reconstructions to originals (Lamb et al., 2016) molliﬁes this ef fect, but requires labels in order to train the discriminativ e network in a supervised fashion. By contrast, GANs ha ve unstable and often oscillatory training dynamics, b ut produce images with sharp, photorealistic features. Basic GANs lack an inference mechanism, though techniques to train an inference network (Dumoulin et al., 2016) (Donahue et al., 2016) hav e recently been dev eloped, as well as a hybridization that uses the V AE’ s inference network (Larsen et al., 2015). T wo ke y issues arise when attempting to use a latent-variable generativ e model to manipulate natural images. First, producing acceptable edits requires that the model be able to achie ve close-to-e xact reconstructions by inferring latents, or else the model’ s output will not match the original image. This simultaneously necessitates an inference mechanism (or inference-by-optimization) and careful 1 Published as a conference paper at ICLR 2017 Figure 1: The Neural Photo Editor . The original image is center . The red and blue tiles are visualizations of the latent space, and can be directly manipulated as well. design of the model architecture, as there is a tradeoff between reconstruction accurac y and learned feature quality that varies with the size of the information bottleneck. Second, achieving a speciﬁc desired edit requires that the user be able to manipulate the model’ s latent variables in an interpretable way . T ypically , this would require that the model’ s latent space be augmented during training and testing with a set of labeled attributes, such that interpolating along a latent such as "not smiling/smiling" produces a speciﬁc change. In the fully unsupervised setting, howe ver , such semantically meaningful output features are generally controlled by an entangled set of latents which cannot be directly manipulated. In this paper, we present the Neural Photo Editor , an interface that handles both of these issues, enabling a user to make lar ge, coherent changes to the output of unsupervised generative models by indirectly manipulating the latent vector with a "contextual paintbrush." By applying a simple interpolating mask, we enable this same exploration for existing photos despite reconstruction errors. Complementary to the Neural Photo Editor , we dev elop techniques to improv e on common design tradeoffs in generativ e models. Our model, the Introspective Adversarial Network (IAN), is a hybridization of the V AE and GAN that leverages the po wer of the adversarial objecti ve while maintaining the V AE’ s efﬁcient inference mechanism, improving upon pre vious V AE/GAN hybrids both in parametric ef ﬁciency and output quality . W e employ a no vel con v olutional block based on dilated con volutions (Y u & K oltun, 2016) to efﬁciently increase the network’ s receptiv e ﬁeld, and Orthogonal Regularization, a no vel weight regularizer . W e demonstrate the qualitative sampling, reconstructing, and interpolating ability of the IAN on CelebA (Liu et al., 2015), SVHN (Netzer et al., 2011), CIF AR-10 (Krizhevsk y & Hinton, 2009), and Imagenet (Russako vsky et al., 2015), and quantitativ ely demonstrate its inference capabilities with competitiv e performance on the semi-supervised SVHN classiﬁcation task. Further quantita- tiv e experiments on CIF AR-100 (Krizhevsk y & Hinton, 2009) verify the generality of our dilated con volution blocks and Orthogonal Re gularization. 2 N E U R A L P H O T O E D I T I N G W e present an interface, shown in Figure 1, that turns a coarse user input into a reﬁned, photorealistic image edit by indirectly manipulating the latent space with a "contextual paintbrush." The k ey idea is simple: a user selects a paint brush size and color (as with a typical image editor) and paints on the output image. Instead of changing individual pix els, the interface backpropagates the dif ference between the local image patch and the requested color , and takes a gradient descent step in the latent space to minimize that difference. This step results in globally coherent changes that are semantically meaningful in the context of the requested color change. Giv en an output image ˆ X and 2 Published as a conference paper at ICLR 2017 Figure 2: V isualizing the interpolation mask. T op, left to right: Reconstruction, reconstruction error , original image. Bottom: Modiﬁed reconstruction, ∆ , output. a user requested color X user , the change in latent values is − d || X user − ˆ X || 2 d Z , ev aluated at the current paintbrush location each time a user requests an edit. For example, if a user has an image of a person with light skin, dark hair , and a widow’ s peak, by painting a dark color on the forehead, the system will automatically add hair in the requested area. Similarly , if a user has a photo of a person with a closed-mouth smile, the user can produce a toothy grin by painting bright white ov er the target’ s mouth. This technique enables exploration of samples generated by the network, but fails when applied directly to existing photos, as it relies on the manipulated image being completely controlled by the latent variables, and reconstructions are usually imperfect. W e circumvent this issue by introducing a simple masking technique that transfers edits from a reconstruction back to the original image. W e take the output image to be a sum of the reconstruction, and a masked combination of the requested pixel-wise changes and the reconstruction error: Y = ˆ X + M ∆ + (1 − M )( X − ˆ X ) (1) Where X is the original image, ˆ X is the model’ s reconstruction of X , and ∆ is the dif ference between the modiﬁed reconstruction and ˆ X . The mask M is the channel-wise mean of the absolute v alue of ∆ , smoothed with a Gaussian ﬁlter g and truncated pointwise to be between 0 and 1: M = min ( g ( ¯ | ∆ | ) , 1) (2) The mask is designed to allow changes to the reconstruction to sho w through based on their magnitude. This relaxes the accuracy constraints by requiring that the reconstruction be feature-aligned rather than pixel-perfect, as only modiﬁcations to the reconstruction are applied to the original image. As long as the reconstruction is close enough and interpolations are smooth and plausible, the system will successfully transfer edits. A visualization of the masking technique is shown in Figure 2. This method adds minimal computa- tional cost to the underlying latent space exploration and produces con vincing changes of features including hair color and style, skin tone, and facial expression. A video of the interface in action is av ailable online. 1 3 Published as a conference paper at ICLR 2017 Figure 3: The Introspectiv e Adversarial Network (IAN). 3 I N T RO S P E C T I V E A D V E R S A R I A L N E T W O R K S Complementary to the Neural Photo Editor, we introduce the Introspectiv e Adversarial Network (IAN), a nov el hybridization of the V AE and GAN moti vated by the need for an image model with photorealistic outputs that achiev es high-quality reconstructions without loss of representational power . There is typically a design tradeof f between these two goals related to the size of the latent space: a higher-dimensional latent space (i.e. a wider representational bottleneck) tends to learn less descriptiv e features, but produces higher quality reconstructions. W e thus seek techniques to improve the capacity of the latent space without increasing its dimension- ality . Similar to V AE/GAN (Larsen et al., 2015), we use the decoder network of the autoencoder as the generator netw ork of the GAN, b ut instead of training a separate discriminator network, we combine the encoder and discriminator into a single network. Central to the IAN is the idea that features learned by a discriminatively trained network tend to be more expressiv e those learned by an encoder network trained via maximum likelihood (i.e. more useful on semi-supervised tasks), and thus better suited for inference. As the Neural Photo Editor relies on high-quality reconstructions, the inference capacity of the underlying model is critical. Accordingly , we use the discriminator of the GAN, D , as a feature extractor for an inference subnetwork, E , which is implemented as a fully-connected layer on top of the ﬁnal conv olutional layer of the discriminator . W e infer latent values Z ∼ E ( X ) = q ( Z | X ) for reconstruction and sample random values Z ∼ p ( Z ) from a standard normal for random image generation using the generator network, G . Similar to V AE/GAN and DeePSiM (Dosovitskiy & Brox, 2016), we use three distinct loss functions: • L img , the L 1 pixel-wise reconstruction loss, which we prefer to the L 2 reconstruction loss for its higher av erage gradient. • L f eatur e , the feature-wise reconstruction loss, e v aluated as the L 2 difference between the original and reconstruction in the space of the hidden layers of the discriminator . • L adv , the ternary adversarial loss, a modiﬁcation of the adversarial loss that forces the discriminator to label a sample as real, generated, or reconstructed (as opposed to a binary real vs. generated label). Including the V AE’ s KL diver gence between the inferred latents E ( X ) and the prior p ( Z ) , the loss function for the generator and encoder network is thus: L E ,G = λ adv L Gadv + λ img L img + λ f eatur e L f eatur e + D K L ( E ( X ) || p ( Z )) (3) 1 https://www .youtube.com/watch?v=FDELBFSeqQs 4 Published as a conference paper at ICLR 2017 Where the λ terms weight the relativ e importance of each loss. W e set λ img to 3 and leav e the other terms at 1. The discriminator is updated solely using the ternary adv ersarial loss. During each training step, the generator produces reconstructions G ( E ( X )) (using the standard V AE reparameterization trick) from data X and random samples G ( Z ) , while the discriminator observes X as well as the reconstructions and random samples, and both networks are simultaneously updated. 3 . 1 F E A T U R E - W I S E L O S S W e compare reconstructions using the intermediate acti v ations, f ( G ( E ( X ))) , of all con v olutional layers of the discriminator , mirroring the perceptual losses of Discriminativ e Regularization (Lamb et al., 2016), V AE/GAN (Larsen et al., 2015), and DeepSiM (Doso vitskiy & Brox, 2016). W e note that Feature Matching (Salimans et al., 2016) is designed to operate in a similar fashion, but without the guidance of an inference mechanism to match latent values Z to particular v alues of f ( G ( Z )) . W e ﬁnd that using this loss to complement the pixel-wise dif ference results in sharper reconstructions that better preserve higher frequenc y features and edges. 3 . 2 T E R NA RY A DV E R S A R I A L L O S S The standard GAN discriminator network is trained using an implicit label source (real vs fake); noting the success of augmenting the discriminator’ s objectiv e with supervised labels (Odena et al., 2016), we seek additional sources of implicit labels, in the hopes of achieving similar improv ements. The ternary loss provides an additional source of supervision to the discriminator by asking it to determine if a sample is real, generated, or a reconstruction, while the generator’ s goal is still to hav e the discriminator assign a high "real" probability to both samples and reconstructions. W e thus modify the discriminator to ha ve three output units with a softmax nonlinearity , and train it to minimize the categorical cross-entropy: L Dadv = − log ( D real ( X )) − log ( D g ener ated ( G ( Z ))) − l og ( D reconstr ucted ( G ( E ( X )))) (4) Where each D term in Equation 4 indicates the discriminator output unit assigned to each label class. The generator is trained to produce outputs that maximize the probability of the label "real" being assigned by the discriminator by minimizing L Gadv : L Gadv = − log ( D real ( G ( Z ))) − l og ( D real ( G ( E ( X ))) (5) W e posit that this loss helps maintain the balance of power early in training by pre venting the discriminator from learning a small subset of features (e.g. artifacts in the generator’ s output) that distinguish real and generated samples, reducing the range of useful features the generator can learn from the discriminator . W e also ﬁnd that this loss leads to higher sample quality , perhaps because the additional source of supervision leads to the discriminator ultimately learning a richer feature space. 3 . 3 A R C H I T E C T U R E Our model has the same basic structure as DCGAN (Radford et al., 2015), augmented with Multiscale Dilated Con volution (MDC) blocks in the generator, and Minibatch Discrimination (Salimans et al., 2016) in the discriminator . As in (Radford et al., 2015), we use Batch Normalization (Ioffe & Szegedy, 2015) and Adam (Kingma & Ba, 2014) in both networks. All of our code is publicly av ailable. 2 3 . 4 M U LT I S C A L E D I L A T E D C O N V O L U T I O N B L O C K S W e propose a novel Inception-style (Szegedy et al., 2016) con volutional block motiv ated by the ideas that image features naturally occur at multiple scales, that a network’ s expressi vity is proportional to the range of functions it can represent divided by its total number of parameters, and by the desire to ef ﬁciently expand a network’ s receptiv e ﬁeld. The Multiscale Dilated Con volution (MDC) block applies a single FxF ﬁlter at multiple dilation factors, then performs a weighted elementwise sum 2 https://github .com/ajbrock/Neural-Photo-Editor 5 Published as a conference paper at ICLR 2017 (a) (b) Figure 4: (a) Multiscale Dilated Con volution Block. (b) V isualizing a 3d3 MDC ﬁlter composition. of each dilated ﬁlter’ s output, allo wing the network to simultaneously learn a set of features and the relev ant scales at which those features occur with a minimal increase in parameters. This also rapidly expands the network’ s receptive ﬁeld without requiring an increase in depth or the number of parameters. Dilated con volutions have previously been successfully applied in semantic segmentation (Y u & Koltun, 2016), and a similar scheme, minus the parameter sharing, is proposed in (Chen et al., 2016). As shown in Figure 4(a), each block is parameterized by a bank of N FxF ﬁlters W , applied with S factors of dilation, and a set of N*S scalars k , which relati vely weight the output of each ﬁlter at each scale. This is naturally and ef ﬁciently implemented by reparameterizing a sparsely populated F+(S-1)*(F-1) ﬁlterbank as displayed in Figure 4(b). W e propose two variants: Standard MDC, where the ﬁlter weights are tied to a base W , and Full-Rank MDC, where ﬁlters are given the sparse layout of Figure 4(b) but the weights are not tied. Selecting Standard versus Full-Rank MDC blocks allows for a design tradeof f between parametric ef ﬁciency and model ﬂexibility . In our architecture, we replace the hidden layers of the generator with Standard MDC blocks, using F=5 and D=2; we specify MDC blocks by their base ﬁlter size and their maximum dilation factor (e.g. 5d2). 3 . 5 O RT H O G O NA L R E G U L A R I Z A T I O N Orthogonality is a desirable quality in Con vNet ﬁlters, partially because multiplication by an orthogo- nal matrix leav es the norm of the original matrix unchanged. This property is valuable in deep or recurrent networks, where repeated matrix multiplication can result in signals vanishing or exploding. W e note the success of initializing weights with orthogonal matrices (Saxe et al., 2014), and posit that maintaining orthogonality throughout training is also desirable. T o this end, we propose a simple weight re gularization technique, Orthogonal Regularization, that encourages weights to be orthogonal by pushing them tow ards the nearest orthogonal manifold. W e augment our objectiv e with the cost: L ortho = Σ( | W W T − I | ) (6) Where Σ indicates a sum across all ﬁlter banks, W is a ﬁlter bank, and I is the identity matrix. 4 R E L A T E D W O R K Our architecture builds directly of f of previous V AE/GAN hybrids (Larsen et al., 2015) (Dosovitskiy & Brox, 2016), with the ke y difference being our combination of the discriminator and the encoder to improv e computational and parametric efﬁcienc y (by reusing discriminator features) as well as reconstruction accuracy (as demonstrated in our CelebA ablation studies). The methods of ALI (Dumoulin et al., 2016) and BiGAN (Donahue et al., 2016) provide an orthogonal approach to GAN inference, in which an inference network is trained by an adversarial (as opposed to a variational) process. The method of iGAN (Zhu et al., 2016) bears the most relation to our interface. The iGAN interface allows a user to impose shape or color constraints on an image of an object through use of a brush 6 Published as a conference paper at ICLR 2017 Figure 5: CelebA and SVHN samples. tool, then optimizes to solve for the output of a DCGAN (Radford et al., 2015) which best satisﬁes those constraints. Photorealistic edits are transferred to existing images via motion and color ﬂo w estimation. Both iGAN and the Neural Photo Editor turn coarse user input into reﬁned outputs through use of a generativ e model, but the methods differ in several ke y ways. First, we focus on editing portraits, rather than objects such as shoes or handbags, and are thus more concerned with modifying features , as opposed to overall color or shape, for which our method is less well-suited. Our edit transfer technique follo ws this difference as well: we directly transfer the local image changes produced by the model back onto the original image, rather than estimating and mimicking motion and color ﬂow . Second, our interface applies user edits one step at a time, rather than iterativ ely optimizing the output. This highlights the difference in design approaches: iGAN seeks to produce outputs that best match a giv en set of user constraints, while we seek to allow a user to guide the latent space tra versal. Finally , we e xplicitly tailor our model design to the task at hand and jointly train an inference network which we use at test time to produce reconstructions in a single shot. In contrast, iGAN trains an inference network to minimize the L 2 loss after training the generator network, and use the inference network to get an initial estimate of the inferred latents, which are then iterati vely optimized. Another related interface (Champanard, 2016) reﬁnes simple user input into complex textures through use of artistic style transfer (Gatys et al., 2015). Other related work (White, 2016) also circumv ents the need for labeled attributes by constructing latent v ectors by analogy and bias-correcting them. 5 E X P E R I M E N T S W e qualitati vely e valuate the IAN on 64x64 CelebA (Liu et al., 2015), 32x32 SVHN (Netzer et al., 2011), 32x32 CIF AR-10 (Krizhevsky & Hinton, 2009), and 64x64 Imagenet (Russako vsky et al., 2015). Our models are implemented in Theano (T eam, 2016) with Lasagne (Dieleman et al., 2015). Samples from the IAN, randomly selected and sho wn in Figure 5, display the visual ﬁdelity typical of adversarially trained netw orks. The IAN demonstrates high quality reconstructions on previously unseen data, shown in Figure 6, and smooth, plausible interpolations, ev en between drastically different samples. CIF AR and Imagenet samples, along with additional comparisons to samples from other models, are av ailable in the appendix. 5 . 1 D I S C R I M I N A T I V E E X P E R I M E N T S W e quantitativ ely demonstrate the effecti veness of our MDC blocks and Orthogonal Re gularization on the CIF AR-100 (Krizhevsk y & Hinton, 2009) benchmark. Using standard data augmentation, we train a set of 40-layer , k=12 DenseNets (Huang et al., 2016) for 50 epochs, annealing the learning rate at 25 and 37 epochs. W e add varying amounts of Orthogonal Regularization and modify the 7 Published as a conference paper at ICLR 2017 Figure 6: CelebA and SVHN Reconstructions and Interpolations. The outermost images are originals, the adjacent images are reconstructions. standard DenseNet architecture by replacing ev ery 3x3 ﬁlterbank with 3d3 MDC blocks, and report the test error after training in T able 1. In addition, we compare to performance using full 7x7 ﬁlters. There is a noticeable increase in performance with the progressive addition of our modiﬁcations, despite a negligible increase in the number of parameters. Adding Orthogonal Regularization improv es the network’ s generalization ability; we suspect this is because it encourages the ﬁlter weights to remain close to a desirable, non-zero manifold, increasing the likelihood that all of the av ailable model capacity is used by pre venting the magnitude of the weights from ov erly diminishing. Replacing 3x3 ﬁlters with MDC blocks yields additional performance gains; we suspect this is due to an increase in the expressiv e power and receptive ﬁeld of the network, allowing it to learn longer-range dependencies with ease. W e also note that substituting Full-Rank MDC blocks into a 40-Layer DenseNet improv es performance by a relati ve 5%, with the only increased computational cost coming from using the larger ﬁlters. For use in ev aluating the IAN, we additionally train 40-layer , k=12 DenseNets on the CelebA attribute classiﬁcation task with varying amounts of Orthogonal Regularization. A plot of the train and v alidation error during training is av ailable in Figure 7. The addition of of Orthogonal Regularization improv es the validation error from 6.55% to 4.22%, further demonstrating its utility . 5 . 2 E V A L U A T I N G M O D I FI C A T I O N S For use in editing photos, a model must produce reconstructions which are photorealistic and feature- aligned, and hav e smooth, plausible interpolations between outputs. W e perform an ablation study to in vestigate the ef fects of our proposals, and emplo y sev eral metrics to ev aluate model quality gi ven these goals. In this study , we progressi vely add modiﬁcations to a V AE/GAN (Larsen et al., 2015) baseline, and train each network for 50 epochs. For reconstruction accuracy , pixel-wise distance does not tend to correlate well with perceptual similarity . In addition to pixel-wise L 2 distance, we therefore compare model reconstruction accuracy in terms of: • Feature-wise L 2 distance in the ﬁnal layer of a 40-Layer k=12 DenseNet trained for the CelebA attribute classiﬁcation task. • T rait reconstruction error . W e run our classiﬁcation DenseNet to predict a binary attribute vector y ( X ) giv en an image X, and y ( G ( E ( X ))) giv en a model’ s reconstruction, then measure the percent error . • Fiducial ke ypoint Error , measured as the mean L 2 distance between the facial landmarks predicted by the system of (Sankaranarayanan et al., 2016). 8 Published as a conference paper at ICLR 2017 Model # Params MDC Ortho. Reg. Error (%) Baseline DenseNet (D=40,K=12) 1.0M 7 7 26.71 DenseNet with Ortho. Reg. 1.0M 7 1e-3 26.51 DenseNet with Ortho. Reg 1.0M 7 1e-1 26.46 DenseNet with 7x7 Filters 5.0M 7 7 26.39 DenseNet with 3d3 MDC 1.0M 3 7 26.02 DenseNet with Ortho. Reg & MDC 1.0M 3 1e-3 25.72 DenseNet with Ortho. Reg & MDC 1.0M 3 1e-1 25.39 DenseNet (Huang et al., 2016), 300 epochs 1.0M 7 7 24.42 DenseNet with Full MDC, 300 epochs 2.8M full 7 23.30 T able 1: Error rates on CIF AR-100+ after 50 epochs. MDC Ortho. Reg. T ernary Pixel Feature T rait(%) Ke ypoint Inception V AE/GAN Baseline 0.295 4.86 0.197 2.21 1389( ± 64) 7 7 7 0.285 4.76 0.189 2.11 1772( ± 37) 7 3 7 0.258 4.67 0.182 1.79 2160( ± 70) 3 7 7 0.248 4.69 0.172 1.54 2365( ± 97) 3 3 7 0.230 4.39 0.165 1.47 3158( ± 98) 7 7 3 0.254 4.60 0.177 1.67 2648( ± 69) 7 3 3 0.239 4.51 0.164 1.57 3161( ± 70) 3 7 3 0.221 4.37 0.158 0.99 3300( ± 123) 3 3 3 0.192 4.33 0.155 0.97 3627( ± 146) T able 2: CelebA inv estigations. Gauging the visual quality of the model’ s outputs is notoriously difﬁcult, but the Inception score recently proposed by (Salimans et al., 2016) has been found to correlate positiv ely with human- ev aluated sample quality . Using our CelebA attribute classiﬁcation network in place of the Inception (Szegedy et al., 2016) model, we compare the Inception score of each model ev aluated on 50,000 random samples. W e posit that this metric is also indicativ e of interpolation quality , as a high visual quality score on a large sample population suggests that the model’ s output quality remains high regardless of the state of the latent space. Results of this ablation study are presented in T able 2; samples and reconstructions from each conﬁguration are av ailable in the appendix, along with comparisons between a fully-trained IAN and related models. As with our discriminativ e experiments, we ﬁnd that the progressiv e addition of modiﬁcations results in consistent performance improv ements across our reconstruction metrics and the Inception score. W e note that the single largest gains come from the inclusion of MDC blocks, suggesting that the network’ s receptive ﬁeld is a critical aspect of network design for both generati ve and discriminati ve tasks, with an increased receptiv e ﬁeld correlating positiv ely with reconstruction accurac y and sample quality . The improvements from Orthogonal Regularization suggest that encouraging weights to lie close to the orthogonal manifold is beneﬁcial for improving the sample and reconstruction quality of generati ve neural networks by preventing learned weights from collapsing to an undesirable manifold; this is consistent with our experience iterating through network designs, where we ha ve found mode collapse to occur less frequently while using Orthogonal Regularization. Finally , the increase in sample quality and reconstruction accuracy through use of the ternary adversarial loss suggests that including the "reconstructed" tar get in the discriminator’ s objective does lead to the discriminator learning a richer feature space. This comes along with our observations that training with the ternary loss, where we hav e observed that the generator and discriminator losses tend to be more balanced than when training with the standard binary loss. 9 Published as a conference paper at ICLR 2017 Method Error rate V AE (M1 + M2) (Kingma et al., 2014) 36 . 02% SWW AE with dropout (Zhao et al., 2015) 23 . 56% DCGAN + L2-SVM (Radford et al., 2015) 22 . 18%( ± 1 . 13%) SDGM (Maaløe et al., 2016) 16 . 61%( ± 0 . 24%) ALI (L2-SVM) (Dumoulin et al., 2016) 19 . 14%( ± 0 . 50%) IAN (ours, L2-SVM) 18 . 50%( ± 0 . 38%) IAN (ours, Improv ed-GAN) 8 . 34%( ± 0 . 91%) Improv ed-GAN (Salimans et al., 2016) 8 . 11%( ± 1 . 3%) ALI (Improved-GAN) 7 . 3 % T able 3 Epochs 5 10 15 20 25 30 35 40 45 50 Error (%) 4 5 6 7 8 9 10 11 12 13 14 15 Baseline Train Error 1e-3 Ortho. Reg. Train Error 1e-1 Ortho. Reg. Train Error Baseline Valid Error 1e-3 Ortho. Reg. Valid Error 1e-1 Ortho. Reg. Valid Error Figure 7 T able 3: Error rates on Semi-Supervised SVHN with 1000 training examples. Figure 7: Performance on CelebA Classiﬁcation task with varying Orthogonal Re gularization. 5 . 3 S E M I - S U P E RV I S E D L E A R N I N G W I T H S V H N W e quantitativ ely ev aluate the inference abilities of our architecture by applying it to the semi- supervised SVHN classiﬁcation task using two dif ferent procedures. W e ﬁrst ev aluate using the procedure of (Radford et al., 2015) by training an L2-SVM on the output of the FC layer of the encoder subnetwork, and report av erage test error and standard deviation across 100 dif ferent SVMs, each trained on 1000 random examples from the training set. Next, we use the procedure of (Salimans et al., 2016), where the discriminator outputs a distribution ov er the K object categories and an additional "fake" category , for a total of K +1 outputs. The discriminator is trained to predict the category when giv en labeled data, to assign the "fake" label when provided data from the generator , and to assign k ∈ { 1 , ..., K } when provided unlabeled real data. W e modify feature-matching based Improved-GAN to include the encoder subnetwork and reconstruction losses detailed in Section 3, but do not include the ternary adv ersarial loss. Our performance, as sho wn in T able 3, is competiti ve with other netw orks ev aluated in these fashions, achie ving 18.5% mean classiﬁcation accuracy when using SVMs and 8.34% accuracy when using the method of Improv ed-GAN. When using SVMs, our method tends to demonstrate improvement o ver previous methods, particularly o ver standard V AEs. W e believ e this is due to the encoder subnetwork being based on more descriptiv e features (i.e. those of the discriminator), and therefore better suited to discriminating between SVHN classes. W e ﬁnd the lack of improvement when using the method of Improv ed-GAN unsurprising, as the IAN architecture does not change the goal of the discriminator; any changes in behavior are thus indirectly due to changes in the generator , whose loss is only slightly modiﬁed from feature-matching Improv ed-GAN. 6 C O N C L U S I O N W e introduced the Neural Photo Editor , a novel interface for exploring the learned latent space of generativ e models and for making speciﬁc semantic changes to natural images. Our interface makes use of the Introspectiv e Adversarial Network, a hybridization of the V AE and GAN that outputs high ﬁdelity samples and reconstructions, and achie ves competitiv e performance in a semi-supervised classiﬁcation task. The IAN makes use of Multiscale Dilated Con volution Blocks and Orthogonal Regularization, tw o improvements designed to improve model expressivity and feature quality for con volutional netw orks. A C K N O W L E D G M E N T S This research was made possible by grants and support from Renishaw plc and the Edinbur gh Centre For Robotics. The work presented herein is also partially funded under the European H2020 Programme BEA CONING project, Grant Agreement nr . 687676. 10 Published as a conference paper at ICLR 2017 R E F E R E N C E S A.J. Champanard. Semantic style transfer and turning two-bit doodles into ﬁne artwork. arXi v Preprint arXiv: 1603.01768, 2016. L-C. Chen, G. P apandreou, I. K okkinos, K. Murphy , , and A. L. Y uille. Deeplab: Semantic image segmentation with deep con volutional nets, atrous con v olution, and fully connected crfs. arXiv Preprint arXiv:1606.00915, 2016. S. Dieleman, J. Schlüter , C. Raf fel, E. Olson, S.K. Sønderby , D. Nouri, and E. Battenber g. Lasagne: First release., 2015. URL http://dx.doi.org/10.5281/zenodo.27878. J. Donahue, P . Krähenbühl, and T . Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782 , 2016. A. Dosovitskiy and T . Brox. Generating images with perceptual similarity metrics based on deep networks. arXiv Preprint arXi v:1602.02644, 2016. V . Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. Arjovsky , O. Mastropietro, and A. Courville. Adversarially learned inference. arXiv Preprint arXi v: 1606.0070, 2016. L.A. Gatys, A.S. Ecker , and M. Bethge. A neural algorithm of artistic style. arXiv Preprint arXi v: 1508.06576, 2015. I. Goodfellow , J. Pouget-Abadie, Jean, M. Mehdi, X. Bing, D. W arde-Farley , S. Ozair, A. Courville, and Y . Bengio. Generativ e adv ersarial nets. In Advances in Neural Information Pr ocessing Systems , pp. 2672–2680, 2014. G. Huang, Z. Liu, K.Q. W einberger , and L. van der Maaten. Densely connected conv olutional networks. arXiv Preprint arXi v:1608.06993, 2016. S. Ioffe and C. Szegedy . Batch normalization: Accelerating deep network training by reducing internal cov ariate shift. In ICML 2015 , 2015. D.P . Kingma and J. Ba. Adam: A method for stochastic optimization. arXi v Preprint arXiv: 1412.6980, 2014. D.P . Kingma and M. W elling. Auto-encoding variational bayes. In ICLR 2014 , 2014. D.P . Kingma, S Mohamed, D.J. Rezende, and M. W elling. Semi-supervised learning with deep generativ e models. In Advances in Neural Information Processing Systems , pp. 3581–3589, 2014. A. Krizhevsk y and G. Hinton. Learning multiple layers of features from tiny images, 2009. A. Lamb, V . Dumoulin, and A. Courville. Discriminativ e regularization for generativ e models. arXiv pr eprint arXiv:1602.03220 , 2016. A.B.L. Larsen, S.K. Sønderby , and O. W inther . Autoencoding beyond pixels using a learned similarity metric. arXiv pr eprint arXiv:1512.09300 , 2015. Z. Liu, P . Luo, X. W ang, and X. T ang. Deep learning face attributes in the wild. In Pr oceedings of the IEEE International Conference on Computer V ision , pp. 3730–3738, 2015. L. Maaløe, C.K. Sønderby , S.K. Sønderby , and O. Winther . Auxiliary deep generativ e models. arXiv pr eprint arXiv:1602.05473 , 2016. Y . Netzer , T . W ang, A. Coates, A. Bissacco, B. W u, and A.Y . Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised featur e learning , volume 2011, pp. 4. Granada, Spain, 2011. A. Odena, C. Olah, and J. Shiens. Conditional image synthesis with auxiliary classiﬁer gans. arXiv Preprint arXiv: 1610.09585, 2016. A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep con volutional generativ e adversarial networks. arXiv preprint , 2015. 11 Published as a conference paper at ICLR 2017 O. Russako vsky , J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpath y , A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International J ournal of Computer V ision , 115(3):211–252, 2015. T . Salimans, I. Goodfellow , W . Zaremba, V . Cheung, A. Radford, and X. Chen. Improved techniques for training gans. arXiv Preprint arXi v: 1606.03498, 2016. S. Sankaranarayanan, R. Ranjan, C. D. Castillo, and R. Chellappa. An all-in-one con volutional neural network for face analysis. arXiv Preprint arXi v:1611.00851, 2016. A.M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In ICLR 2014 , 2014. C. Szegedy , S. Ioff e, and V . V anhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv Preprint arXi v: 1602.07261, 2016. The Theano Dev elopment T eam. Theano: A python frame work for fast computation of mathematical expressions. arXiv Preprint arXi v: 1605.02688, 2016. T . White. Sampling generati ve networks. arXiv Preprint arXi v:1609.04468, 2016. F . Y u and V . K oltun. Multi-scale context aggre gation by dilated con volutions. In ICLR 2016 , 2016. J. Zhao, M. Mathieu, R. Goroshin, and Y . Lecun. Stacked what-where auto-encoders. arXiv pr eprint arXiv:1506.02351 , 2015. J.-Y . Zhu, P . Krähenb uhl, E. Shechtman, and A. A. Efros. Generativ e visual manipulation on the natural image manifold. In ECCV 2016 , 2016. 12 Published as a conference paper at ICLR 2017 A P P E N D I X : A D D I T I O N A L V I S U A L C O M P A R I S O N S Figure 7: Comparing samples from different models. From top: V AE(Kingma & W elling, 2014), DCGAN (Goodfellow et al., 2014), V AE/GAN from (Larsen et al., 2015), ALI from(Dumoulin et al., 2016), IAN (ours). 13 Published as a conference paper at ICLR 2017 MDC Ortho. Reg. T ernary Recon1 Recon2 Sample1 Sample 2 Sample 3 Original V AE/GAN Baseline 7 7 7 7 3 7 3 7 7 3 3 7 7 7 3 7 3 3 3 7 3 3 3 3 T able 4: Reconstructions and samples from CelebA ablation Study . 14 Published as a conference paper at ICLR 2017 Figure 8: Samples, reconstructions, and interpolations on CIF AR-10. T op three rows: samples, bottom three ro ws: reconstructions and interpolations. Our model achie ves an Inception score of 6 . 88( ± 0 . 08) , on par with the 6 . 86( ± 0 . 06) achieved by Impro ved-GAN with historical a veraging. Figure 9: Samples, reconstructions, and interpolations on Imagenet. T op three rows: samples, bottom three ro ws: reconstructions and interpolations. Our model achie ves an Inception score of 8 . 56( ± 0 . 09) . 15

Neural Photo Editing with Introspective Adversarial Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment