ZM-Net: Real-time Zero-shot Image Manipulation Network

ZM-Net: Real-time Zer o-shot Image Manipulation Network Hao W ang 1 , Xiaodan Liang 2 , Hao Zhang 2 , 3 , Dit-Y an Y eung 1 , and Eric P . Xing 3 1 Hong K ong Uni v ersity of Science and T echnology 2 Carnegie Mellon Uni v ersity 3 Petuum Inc. { hwangaz, dyyeung } @cse.ust.hk, { xiaodan1, hao } @cs.cmu.edu, eric.xing@petuum.com Figure 1: Example results by our Zero-shot image Manipulation Network (ZM-Net) that can manipulate images guided by any personal- ized signals in real-time. Row 2: zero-shot style transfer guided by dif ferent landscape paintings from Ro w 1. Row 3: image manipulation conditioned on descriptiv e attrib utes; from left to right are descriptiv e attributes, the input image, and the 5 transformed images corre- sponding to the text ‘noon’, ‘afternoon’, ‘morning’, 0 . 5 ‘morning’ + 0 . 5 ‘night’, and ‘night’, respectiv ely . Abstract Many pr oblems in imag e pr ocessing and computer vision (e.g . colorization, style transfer) can be posed as “manip- ulating” an input image into a corr esponding output im- age given a user-speciﬁed guiding signal. A holy-grail so- lution towar ds generic image manipulation should be able to ef ﬁciently alter an input image with any personalized signals (even signals unseen during training), such as di- verse paintings and arbitrary descriptive attributes. How- ever , existing methods ar e either inefﬁcient to simultane- ously pr ocess multiple signals (let alone generalize to un- seen signals), or unable to handle signals fr om other modal- ities. In this paper , we make the ﬁrst attempt to addr ess the zer o-shot image manipulation task. W e cast this pr oblem as manipulating an input ima ge according to a par ametric model whose key parameters can be conditionally gener- ated from any guiding signal (even unseen ones). T o this end, we pr opose the Zer o-shot Manipulation Net (ZM-Net), a fully-differ entiable ar chitectur e that jointly optimizes an image-tr ansformation network (TNet) and a par ameter net- work (PNet). The PNet learns to generate ke y transforma- tion parameters for the TNet given any guiding signal while the TNet performs fast zer o-shot image manipulation ac- cor ding to both signal-dependent parameters fr om the PNet and signal-in variant par ameters fr om the TNet itself. Exten- sive experiments show that our ZM-Net can perform high- quality image manipulation conditioned on differ ent forms of guiding signals (e.g . style images and attrib utes) in r eal- time (tens of milliseconds per image) even for unseen sig- nals. Mor eover , a lar ge-scale style dataset with over 20,000 style images is also constructed to promote further r esear ch. 1. Introduction Image manipulation, which aims to manipulate an in- put image based on personalized guiding signals e xpressed in diverse modalities (e.g. art paintings or text attributes), has recently attracted ev er -growing research interest and deriv ed various real-world applications, such as attribute- driv en image editing and artistic style transfer (e.g. Prisma). An image manipulation model is usually deployed in various devices, ranging from a GPU desktop to a mobile 1 phone. For such a solution to be applicable, we argue that it must meet three requirements: ﬁrst, the model should be zero-shot – it can immediately capture the intrinsic manip- ulation principles con v eyed by the guiding signal and apply it on the tar get image, without retraining distinct models for ev ery user input. Further , to support the do wnstream mo- bile applications, the inference process for a target image should be really efﬁcient (regardless of where it happens, remote serv er or local mobile), so that the user can immedi- ately obtain the desired output without waiting for seconds to minutes. Third, a personalized guiding signal usually comes in different forms – it could either be an artistic style con v eyed by an art painting (Fig 1 second row), or some descriptiv e phrases typed in by the user (Fig 1 third ro w), or ev en a speech instruction – therefore, it is preferable that the model possesses the capability of receiving arbitrary guid- ing signal in multiple modalities. A variety of relev ant approaches hav e been de veloped to- wards the goal of real-time zero-shot image manipulation. Existing approaches, such as [ 4 , 19 , 16 , 3 , 17 , 25 ], mainly focus on training transformation neural networks that cor- responds to a small set of guiding signals, such as a fe w art paintings. Among them, some CNN-based methods can process images (nearly) in real-time [ 9 ]; howe v er , each of their networks is only tied to one speciﬁc guiding signal (e.g. a single style image) and cannot generalize to unseen types speciﬁed by users, unless retraining as many networks as the number of guiding signals, which is both compu- tationally and time prohibitive. Although some recent ap- proaches [ 12 ] try to encode multiple styles within a single network, they fail to perform zero-shot style transfer and cannot process guiding signals either in real-time or from distinct modalities (e.g. text attrib utes). In this paper, we make the ﬁrst attempt to explore the real-time zero-shot image manipulation task, to our best knowledge. This task is challenging since the model should be able to exploit and transform div erse and complex pat- terns from arbitrary guiding signals into transformation pa- rameters, and perform the image manipulation in real-time. T o this end, we propose a novel Zero-shot Manipulation Net (ZM-Net) that combines a parameter network (PNet) and an image-transformation netw ork (TNet) into an end-to-end framew ork. The PNet is a generic model to produce a hi- erarchy of key transformation parameters, while TNet takes these generated parameters, combines with its own signal- in v ariant parameters, to generate a ne w image. In the sense of image style transfer , the PNet can embed any style image into the hierarchical parameters, which are used by TNet to transform the content image to a stylized image. W e show that ZM-Net can digest o ver 20 , 000 style images in a single network rather than training one network per style as most previous methods did. It can also be trained to process guid- ing signals in other forms, such as descriptiv e attributes. Moreov er , with the ability of fast zero-shot manipulation, the proposed ZM-Net can generate animation of a single image in real-time (tens of milliseconds per image) ev en though the model is trained on images r ather than videos . In summary , our main contributions are as follows: (1) T o our best kno wledge, this is the ﬁrst scalable solution for the real-time zero-shot image manipulation task – the pro- posed ZM-Net is able to digest over 20 , 000 style images with a single model and perform zero-shot style transfer in real-time. Interestingly , e v en in the zero-shot setting (no re- training/ﬁnetuning), ZM-Net can still generate images with quality comparable to previous methods that need to retrain models for new style images. (2) Our ZM-Net can han- dle more general image manipulation tasks (beyond style transfer) with different forms of guiding signals (e.g. text attributes). (3) Using a small set of 984 seed style images, we construct a much larger dataset of 23 , 307 style images with much more content div ersity . Experiments show that training on this dataset can dramatically decrease the testing loss nearly by half. 2. Related W ork A lot of research efforts have been dev oted to the image manipulation task, among which the most common and ef- ﬁcient approach is to train a con volutional neural network (CNN) which directly outputs a transformed image for the input content image [ 4 , 19 , 16 , 3 , 17 , 25 , 15 ]. For exam- ple, in [ 1 , 24 ], a CNN is trained to perform colorization on input images, and in [ 9 , 22 , 2 , 12 ] to transform content im- ages according to speciﬁc styles. Although the most recent method by [ 9 ] can process images (nearly) in real-time, it has to train a single network for each speciﬁc type of ma- nipulation (e.g. a speciﬁc style image in style transfer) and cannot generalize to other types of manipulation (new style images or other forms of guiding signals) unless retraining the model for ev ery type, which usually takes sev eral hours and prev ents them from being scaled to real-world applica- tions. One of the most relev ant works with ours in [ 12 ] tries to encode multiple styles within a single network; howe v er , their model focuses on increasing the diversity of output im- ages and are still unable to handle di v erse and unseen guid- ing signals from distinct modalities (e.g. text attrib utes). On the other hand, some iterative approaches [ 8 , 7 , 6 , 13 ] hav e been proposed to manipulate image, either patch by patch [ 8 , 6 ], or by iteratively updating the input image with hundreds of reﬁnement [ 7 ] to obtain the transformed im- age. Although these methods require no additional training for each new guiding signal, the iterati v e ev aluation process usually takes tens of seconds ev en with GPU acceleration [ 9 ], which might be impractical especially for online users. 3. Real-time Zer o-shot Image Manipulation In this section we ﬁrst revie w the pipelines of current state-of-the-art CNN-based methods, and discuss their lim- Figure 2: An image transformation network with a ﬁx ed loss net- work as described in [ 9 ]. For the style transfer task, the guid- ing signal is a style image. Note that one transformation network works for only one style. itations in the zero-shot setting. Then, we present the Zero- shot Manipulation Network (ZM-Net), a uniﬁed network structure that jointly optimizes a parameter network (PNet) and a image-transformation network (TNet). 3.1. Image Manipulation with CNNs An image manipulation task [ 9 , 7 ] can be formally de- ﬁned as: given a content image X c ∈ R H × W × 3 and a guid- ing signal (e.g. a style image) X s ∈ R H × W × 3 , output a transformed image Y ∈ R H × W × 3 such that Y is similar to X c in content and simultaneously similar to X s in style. Learning effecti ve representations of content and styles are hence equally essential to perform plausible image manip- ulation. As in [ 7 ], using a ﬁx ed deep CNN φ ( · ) , the feature maps φ l ( X ) ∈ R C l × H l × W l in the layer l can represent the content of the image X , and the Gram matrix of φ l ( X ) , denoted as G ( φ l ( X )) ∈ R C l × C l which is computed as G ( φ l ( X )) c,c 0 = H l X h =1 W l X w =1 φ l ( X ) c,h,w φ l ( X ) c 0 ,h,w (1) can express the desired style patterns of the image X . T wo images are assessed to be similar in content or style only if the difference between each corresponding representation (i.e. φ l ( X ) or G ( φ l ( X )) ) has a small Frobenius norm. Therefore, we can train a feedforward image transforma- tion network Y = T ( X c ) , which is typically a deep CNN, with the loss function: L = λ s L s ( Y ) + λ c L c ( Y ) , L s ( Y ) = X l ∈S 1 Z 2 l k G ( φ l ( Y )) − G ( φ l ( X s )) k 2 F , L c ( Y ) = X l ∈C 1 Z l k φ l ( Y ) − φ l ( X c ) k 2 F , (2) where L s ( Y ) is the style loss for the generated image Y , L c ( Y ) is the content loss, and λ s , λ c are hyperparameters. S is the set of “style layers”, C is the set of “content lay- ers”, and Z l is the total number of neurons in layer l [ 9 ]. After the transformation network T ( · ) is trained, giv en a new content image X 0 c , we can generate the stylized im- age Y = T ( X 0 c ) without using the loss network. Figure 2 shows an overvie w of this model. Note that the compu- tation of φ l ( · ) is deﬁned by a ﬁxed loss network (e.g. a 16 -layer VGG network [ 21 ] pretrained on ImageNet [ 20 ]) while the transformation network T ( · ) is learned giv en a set of training content images and a style image. Although performing image manipulation with a single feedforward pass of CNN is usually three orders of magnitude faster than the optimization-based methods in [ 7 ], this approach [ 9 ] is largely restricted by that one single transformation netw ork is tied to one speciﬁc style image, meaning that N sepa- rate networks have to be trained to enable transfer from N style images. The disadvantages are obvious: (1) it is time- consuming to train N separate networks; (2) it needs much more memory to store N networks, which is impractical for mobile de vices; (3) it is not scalable and cannot generalize to ne w styles (a new model needs to be trained for very new incoming styles). 3.2. ZM-Net T o address the aforementioned problems and enable real-time zero-shot image manipulation, we propose a general architecture, ZM-Net, that combines an image- transformation network (TNet) and a parameter network (PNet). Different from prior works that only adopt a TNet to transform images, we train an extra parameter network (PNet) to produce ke y parameters of the TNet conditioned on the guiding signals (e.g. style images). As parameters are generated on the ﬂy given arbitrary guiding signals, our ZM-Net a v oids training and storing man y dif ferent network parameters for distinct signals like prior w orks. Moreover , as the PNet learns to embed the guiding signal into a shared space, our ZM-Net is able to perform zero-shot image ma- nipulation giv en unseen guiding signals. Here we generalize the notion of style images (in style transfer) to guiding signals (in general image manipulation tasks), i.e. the input X s can be an y guiding signals beyond style images, for example, word embeddings that express the descripti ve attributes in order to impose speciﬁc seman- tics on the input image X c (Section 4 ) or color histograms (a v ector representing the pixel color distribution) to guide the colorization of X c . In the following, we ﬁrst present the design of a TNet with our proposed dynamic instance normalization based on [ 23 ], then introduce a PNet and its variants including the serial PNet and the parallel PNet. 3.2.1 TNet with Dynamic Instance Normalization T o enable zero-shot image manipulation, we must design a principled w ay to dynamically specify the network parame- ters of TNet during testing, so that it can handle unseen sig- nals. A naiv e way would be to directly generate the ﬁlters of the TNet, based on feature maps from the PNet condition- ing on the guiding signal X s . Howe ver , in practice, each layer of TNet typically has over 100 , 000 parameters (e.g. 128 × 128 × 3 × 3 ) while feature maps in each layer of PNet usually have about 1 , 000 , 000 entries (e.g. 128 × 80 × 80 ). It is thus difﬁcult to ef ﬁciently transform a high dimensional Figure 3: An overvie w of the serial architecture (left) and the parallel architecture (right) of our ZM-Net. Details of the loss network are the same as Figure 2 and omitted here. vector to another one. Inspired by [ 2 ], we resort to dynami- cally augmenting the instance normalization (performed af- ter each con volutional layer in TNet) [ 23 ] with the produced scaling and shifting parameters γ ( X s ) and β ( X s ) by PNet. Here the scaling and shifting factors γ ( X s ) and β ( X s ) are treated as k ey parameters in each layer of TNet. Formally , let x ∈ R C l × H l × W l be a tensor before instance normal- ization. x ij k denotes the ij k -th element, where i indexes the feature maps and j, k span spatial dimensions. The out- put y ∈ R C l × H l × W l of our dynamic instance normalization (DIN) is thus computed as (the layer index l is omitted): y ij k = x ij k − µ i p σ 2 i +  γ i ( X s ) + β i ( X s ) , (3) µ i = 1 H W H X j =1 W X k =1 x ij k , σ 2 i = 1 H W H X j =1 W X k =1 ( x ij k − µ i ) 2 , where µ i is the a verage v alue in feature map i and σ 2 i is the corresponding variance. γ i ( X s ) is the i -th element of an C i -dimensional vector γ ( X s ) generated by the PNet and similarly for β i ( X s ) . Here if γ i ( X s ) = 1 and β i ( X s ) = 0 , DIN degenerates to the vanilla instance normalization [ 23 ]. If γ i ( X s ) = γ i and β i ( X s ) = β i , they become directly learnable parameters irrele v ant to the PNet. DIN then de- generates to the conditional instance normalization (CIN) in [ 2 ]. In both cases, the model loses its ability of zer o-shot learning and ther efore cannot g eneralize to unseen signals . The PNet that aims to generate γ ( X s ) and β ( X s ) can be a CNN, a multilayer perceptron (MLP), or even a recur- rent neural network (RNN). W e use a CNN and an MLP as the PNet in Section 4 to demonstrate the generality of our proposed ZM-Net. Since content images and guiding signals are inherently different, the input pair for image ma- nipulation is non-e xchang eable , making this problem much more difﬁcult than typical problems such as image match- ing with an exchangeable input image pair . Due to the non- exchangeability , the connection between the TNet and the PNet should be asymmetric. 3.2.2 Parameter Network (PNet) T o driv e the TNet with dynamic instance normalization, a PNet can hav e either a serial or a parallel architecture. Serial PNet. In a serial PNet, one can use a deep CNN, with a structure similar to the TNet, to generate γ ( l ) ( X s ) and β ( l ) ( X s ) in layer l . Figure 3 (left) sho ws an ov ervie w of this serial architecture. In the serial PNet, γ ( l ) ( X s ) and β ( l ) ( X s ) of Equation ( 3 ) (yello w and blue boxes in Figure 3 ) are conditioned on the feature maps, denoted as ψ l ( X s ) , in layer l of the PNet. Speciﬁcally , γ ( l ) ( X s ) = ψ l ( X s ) W ( l ) γ + b ( l ) γ , (4) β ( l ) ( X s ) = ψ l ( X s ) W ( l ) β + b ( l ) β . (5) Here if the input X s is an image, ψ l ( X s ) can be the output of con volutional layers in the TNet. If the input X s is a word embedding (a vector), ψ l ( X s ) can be the output of fully connected layers. W ( l ) γ , b ( l ) γ , W ( l ) β , and b ( l ) β are parameters to learn. Note that in Equation ( 3 ), y ij k with different j and k share the same β i ( X s ) , this design signiﬁcantly reduces the number of parameters and increases the generalization of the model. Interestingly , if we let γ i ( X s ) = 1 and replace β i ( X s ) with β ij k ( X s ) , which is computed as the output of a con volutional layer with input ψ l − 1 ( X s ) , followed by the v anilla instance normalization, Equation ( 3 ) is equiv- alent to concatenating φ l − 1 ( X c ) and ψ l − 1 ( X s ) followed by a conv olutional layer and the vanilla instance normaliza- tion, as used in [ 12 ]. Our preliminary experiments show that although structures similar to [ 12 ] has suf ﬁcient model ca- pacity to perform image manipulation giv en guiding signals (e.g. style images) in the training set, it gener alizes poorly to unseen guiding signals and cannot be used for zero-shot image manipulation. Parallel PNet. Alternati vely , one can use separate shal- low networks (either fully connected or con volutional ones) to generate ψ l ( X s ) in layer l , which is then used to com- pute γ ( l ) ( X s ) and β ( l ) ( X s ) according to Equation ( 4 ) and ( 5 ). Figure 3 (right) shows the architecture of this parallel PNet. Different from the serial PNet where higher levels of γ ( l ) ( X s ) and β ( l ) ( X s ) are generated from higher le vels of ψ l ( X s ) , here the transformation from X s to γ ( l ) ( X s ) and Figure 4: Results (from Column 3 to 6) of OST [ 7 ], FST [ 9 ], CIN [ 2 ], and a 10 -style ZM-Net. Column 1 is the content image and Column 2 contains 2 of the 10 style images used during training. Golden Gate Bridge photograph by Rich Nie wiroski Jr . Figure 5: Results of a 20 , 938 -style ZM-Net. Column 1 is the content image and Column 2 to 6 are randomly selected training style images and corresponding generated images. β ( l ) ( X s ) follo ws a shallow and parallel structure. Our ex- periments (in Section 4.2 ) show that this design would limit the ef fecti v eness of the PNet and slightly decrease the qual- ity of the generated TNet and consequently the generated images Y . Therefore, in Section 4 we use the serial PNet unless otherwise speciﬁed. T raining and T est. ZM-Net can be trained in an end-to- end manner with the supervision from the loss network, as shown in Figure 3 . During the testing phase, the content im- age X c and the guiding signal X s are fed into the TNet and the PNet, respectiv ely , generating the transformed image Y . Note that the loss network is irrele v ant during testing. 4. Experiments In this section, we ﬁrst demonstrate our ZM-Net’ s ca- pacity of digesting over 20 , 000 style images in one single network (with a TNet and a PNet), follo wed by experiments showing the model’ s ability of zero-shot learning on im- age manipulation tasks (being able to generalize to unseen guiding signals ). As another set of experiments, we also try using simpliﬁed word embeddings expressing the descrip- tiv e attributes rather than style images as guiding signals to embed speciﬁc semantics in content images. W e sho w that with the ability of zero-shot learning and fast image manip- ulation, our model can generate animation of a single image in real-time ev en though the model is imag e-based . T able 1: Comparison of optimization-based style transfer [ 7 ], fast style transfer [ 9 , 22 , 11 , 2 ], and our ZM-Net. Note that ZM-Net’ s time cost per image is up to 0.038s for the ﬁrst time it processes a new style, and drops to 0.015s after that. [ 7 ] [ 9 , 22 , 11 , 2 ] ZM-Net Speed 15.86s 0.015s 0.015s ∼ 0.038s Zero-shot X X X 4.1. Fast Zero-shot Style T ransfer As shown in T able 1 , current methods for fast style trans- fer [ 9 , 22 , 11 ] need to train different networks for differ - ent styles, costing too much time (sev eral hours for each networks) and memory . Besides, it is also impossible for these methods to generalize to unseen styles (zero-shot style transfer). On the other hand, although the original optimization-based style transfer (OST) method [ 7 ] is ca- pable of zero-shot transfer, it is se veral orders of magnitude slower than [ 9 , 22 , 11 ] when generating stylized images. Our ZM-Net is able to get the best of both worlds, perform- ing both fast and zero-shot style transfer . Datasets. W e use the MS-COCO dataset [ 14 ] as our content images. In order for ZM-Net to generalize well to unseen styles, the style images in the training set need sufﬁcient div ersity to pre vent the model from ov erﬁtting to just a few styles. Unfortunately , unlike photos that can be massiv ely produced, art work such as paintings (especially famous ones) is rare and difﬁcult to collect. T o address this problem, we use the 984 impressionism paintings in the dataset P andora [ 5 ] as seed style images to produce a larger dataset of 23 , 307 style images. Speciﬁcally , we ﬁrst split the 984 images into 784 training images, 100 valida- tion images (for choosing hyperparameters), and 100 test- ing images. W e then randomly select a content image and an impressionism painting from one of the three sets as in- put to OST [ 7 ], producing a new style image with a similar style but different content. Note that differ ent fr om tradi- tional dataset e xpansion, our e xpansion pr ocess can intr o- duce muc h more content diversity to the dataset and hence pr e vent the training pr ocess fr om overﬁtting the content of the style images . Our experiments show that using the ex- panded dataset rather than the original one can cut the test- ing loss L nearly by half (from 58342 . 2 to 31860 . 4 ). Experimental Settings. For the baselines, OST [ 7 ], fast style transfer (FST) [ 9 ], and CIN [ 2 ], we use the same net- work structures and hyperparameters mentioned in the pa- pers. F or our ZM-Net, we follow the network structure from [ 9 ] (with residual connections) for both the TNet and the PNet, except for the part connecting to DIN. W e use a se- rial PNet for the style transfer task. As in [ 9 , 2 ], we use the VGG-16 loss network with the same content and style layers. All models are trained with a minibatch size of 4 for 40 , 000 iterations using Adam [ 10 ] (for efﬁciency , con- tent images in the same minibatch share the same style im- age). As an exception, we train the 20 , 938 -style ZM-Net for 160 , 000 iterations with an initial learning rate of 1 × 10 − 3 and decay it by 0 . 1 e very 40 , 000 iterations. Model Capacity . T o show that ZM-Net has enough model capacity to digest multiple styles with one single net- work, we train ZM-Net with up to 20 , 938 style images and ev aluate its ability to stylize ne w content images with style images in the training set . Figure 4 shows the results of a 10 -style ZM-Net (the last column), OST [ 7 ], FST [ 9 ], and CIN [ 2 ] (see the supplementary material for more results). Note that both FST and CIN need to train different networks for dif ferent style images 1 while ZM-Net can be simultane- ously trained on multiple styles with a single network. As we can see, ZM-Net can achiev e comparable performance with one single network. Similarly , Figure 5 sho ws the re- sults of a 20 , 938 -style ZM-Net. Surprisingly , ZM-Net has no problem digesting as many as 20 , 938 with only one net- work either . Quantitativ ely , the ﬁnal training loss (a v erage ov er the last 100 iterations) of the 20 , 938 -style ZM-Net is very close to that of CIN [ 2 ] ( 157382 . 7 versus 148374 . 3 ), which again demonstrates ZM-Net’ s sufﬁcient model ca- pacity . Fast Zero-shot Style T ransfer . Note that in style trans- fer , there are two le vels of generalization in volv ed: (1) gen- eralization to new content images, which is achieved by 1 Although CIN can share parameters of conv olutional layers across dif- ferent styles, the other parts of the param eters still need to be trained sepa- rately for different styles. [ 2 , 22 , 9 ], and (2) generalization to not only new content images but also ne w style images. Since the second level in v olves style transfer with style images (guiding signals) unseen during training, we call this zero-shot style transfer . Figure 6 shows the results of fast zero-shot style transfer using our 10 -style ZM-Net, 20 , 938 -style ZM-Net, and FST [ 9 ] (see the supplementary material for more results). As we can see, the 10 -style ZM-Net se verely overﬁts the 10 style images in the training set and generalizes poorly to unseen styles. The 20 , 938 -style ZM-Net, with the help of enough div ersity in the training style images, can perform satisfac- tory style transfer e ven for unseen styles, while models lik e FST [ 2 , 22 , 9 ] are tied to speciﬁc styles and fail to generalize to unseen styles. Note that both the TNet and the PNet in ZM-Net ha v e 10 layers ( 5 of them are residual blocks with 2 con v olutional layers each), and the PNet connects to the TNet through the ﬁrst 9 layers with the DIN operations in Equation ( 3 ). T o in vestigate the function of DIN in different layers, we turn off the DIN operations in some layers (set γ i = 1 and β i = 0 ) and perform zero-shot style transfer using ZM- Net. As sho wn in Figure 7 , DIN in layer 1 ∼ 3 focuses on generating content details (e.g., edges), DIN in layer 4 ∼ 6 focuses on roughly adjusting colors, and DIN in layer 7 ∼ 9 focuses transfer texture-related features. [ 2 ] proposes CIN to share con v olutional layers across different styles and ﬁnetune only the scaling/shifting factors of the instance normalization, γ i and β i , for a new style. Figure 9 sho ws the style transfer for an unseen style image after ﬁnetuning CIN [ 2 ] and ZM-Net for 1 ∼ 40 iterations. As we can see, with the ability of zero-shot learning, ZM- Net can perform much better than CIN even without ﬁnetun- ing for a ne w style. Figure 8 sho ws the training and testing loss (sum of content and style loss) of training FST (train a transformation netw ork from scratch), ﬁnetuning CIN, and ﬁnetuning our ZM-Net. W e can conclude that, (1) ﬁnetun- ing CIN has much lower initial training/testing loss than FST , and ﬁnetuning ZM-Net can do even better; (2) ZM- Net con v erges f aster and to lo wer training/testing loss. 4.2. W ord Embeddings as Guiding Signals Besides style transfer , which uses style images as guid- ing signals, we also try ZM-Net with word embeddings as input to embed speciﬁc semantics into images. For exam- ple, taking the word embedding of the word ‘night’ will transform a photo taken during daytime to a photo with a night view . In this setting, if we train ZM-Net with only the w ords ‘noon’ and ‘night’, a successful zero-shot manip- ulation would take the word embeding of ‘morning’ or ‘af- ternoon’ and transform the content image taken at noon to an image taken in the morning the in the afternoon (though ‘morning’ and ‘afternoon’ nev er appear in the training set). T o perform such tasks, we design a ZM-Net with a deep Figure 6: Fast zero-shot style transfer results (from Row 2 to 4) using our 10 -style ZM-Net, 20 , 938 -style ZM-Net, and FST [ 9 ]. Row 1 shows the content image and the style images. Figure 7: Zero-shot style transfer using a 20 , 938 -style ZM-Net with DIN in some layers turned on. Column 1: Style image. Column 2: Content images. Column 3: DIN in all layers is off. Column 4 to 6: DIN in layer 1 ∼ 3 , 4 ∼ 6 , and 7 ∼ 9 is on, respectively . Column 7: DIN in all layers is on. 0 10000 20000 30000 4000 0 10 5 10 6 10 7 Iterations Training Loss FST CIN ZM-Net 0 20 40 60 80 100 10 5 10 6 10 7 Iterations Training Loss FST CIN ZM-Net 0 10000 20000 30000 40000 10 5 10 6 10 7 Iterations Testing Loss FST CIN ZM-Net Figure 8: Training loss for all iterations (left), training loss for the ﬁrst 100 iterations (middle), and testing loss for all iterations (right) of FST , CIN, and our ZM-Net. T esting loss is computed ev ery 1 , 000 iterations. con v olutional TNet identical to the one used for style trans- fer and a deep fully connected PNet with residual con- nections (see the supplementary material for details on the structure). T o facilitate analysis and av oid ov erﬁtting, we compressed the pretrained 50 -dimensional word embed- dings from [ 18 ] to 2 -dimensional vectors. W e crawl 30 images with the tag ‘noon’ and 30 with the tag ‘night’ as training images. Note that different from the ZM-Net for style transfer where the same style image is used both as input to the PNet and as input to the ﬁxed loss network (as shown in Figure 3 ), here we use word em- beddings as input to the PNet and use the corresponding ‘noon/night’ images as input to the loss network. In each it- eration, we randomly select the word embeddings of ‘noon’ or ‘night’ as the input guiding signal and use a correspond- ing image to feed into the loss network. Differ ent fr om style transfer , e ven for the same input guiding signal, dif- ferent ‘noon/night’ images are fed into the loss network. In this case, ZM-Net is actually extracting the common pat- terns/semantics from ‘noon’ or ‘night’ images instead of simply learning to perform style transfer . Row 2 of Figure 10 shows the zero-shot image manipula- Figure 9: Column 1: The content image and style image. Column 2 to 7: Style transfer for the unseen style image after ﬁnetuning ZM-Net (Ro w 1) and CIN [ 2 ] (Row 2) for 1 , 10 , 20 , 30 , 40 , and 50 iterations. The CIN model is ﬁrst trained on another style image before ﬁnetuning. Figure 10: Zero-shot image manipulation with word embeddings as guiding signals compared to simply changing image illumination (Row 1). Row 2 sho ws the 6 images corresponding to compressed word embeddings of ‘noon’, 0 . 5 ‘noon’ + 0 . 5 ‘afternoon’, ‘afternoon’, ‘morning’, 0 . 5 ‘morning’ + 0 . 5 ‘night’, and ‘night’ when a serial PNet is used. Ro w 3 shows the results when a parallel PNet is used. Column 1 shows the content image and the compressed w ord embeddings. tion with a serial PNet in ZM-Net. W e train the model with word embeddings of ‘noon’ and ‘night’ and use word em- beddings of ‘morning’ and ‘afternoon’ (which never appear during training) as guiding signals during testing. As we can see, the transformed images gradually change from day- time (noon) views (with bright sky and buildings) to night- time views (with dark sky and buildings with lights on), with ‘morning/afternoon views’ in between. Note that with ZM-Net’ s ability of fast zero-shot manipulation, it can gen- erate animation of a single image in real-time e ven though the model is image-based (see the demonstration in the sup- plementary material). As a baseline, Row 1 of Figure 10 shows the results of simple illumination change. W e can see that ZM-Net automatically transfer the lighting effect (lights in the buildings) to the content image while simple illumina- tion fails to do so . Besides the serial PNet, we also perform the same task with a parallel PNet and report the results in Row 3 of Figure 10 . W e can see that comparing to results using an serial PNet, the parallel PNet produces much more redundant yellow pixels surrounding the buildings, which is not reasonable for a daytime photo. The comparison shows that the serial PNet with its deep structure tends to perform higher-quality image manipulation than the parallel PNet. 5. Conclusion In this paper we present ZM-Net, a general network architecture with dynamic instance normalization, to per- form real-time zero-shot image manipulation. Experiments show that ZM-Net produces high-quality transformed im- ages with dif ferent modalities of guiding signals (e.g. style images and text attributes) and can generalize to unseen guiding signals . ZM-Net can even produce real-time ani- mation for a single image e v en though the model is trained on images . Besides, we construct the largest dataset of 23 , 307 style images to provide much more content div er- sity and reduce the testing loss nearly by half. References [1] Z. Cheng, Q. Y ang, and B. Sheng. Deep colorization. In ICCV , pages 415–423, 2015. 2 [2] V . Dumoulin, J. Shlens, and M. Kudlur . A learned represen- tation for artistic style. CoRR , abs/1610.07629, 2016. 2 , 4 , 5 , 6 , 8 [3] D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale conv olu- tional architecture. In ICCV , pages 2650–2658, 2015. 2 [4] C. Farabet, C. Couprie, L. Najman, and Y . LeCun. Learning hierarchical features for scene labeling. TP AMI , 35(8):1915– 1929, 2013. 2 [5] C. Florea, R. Condorovici, C. V ertan, R. Butnaru, L. Flo- rea, and R. Vr ˆ anceanu. Pandora: Description of a painting database for art movement recognition with baselines and perspectiv es. In EUSIPCO , pages 918–922. IEEE, 2016. 6 [6] O. Frigo, N. Sabater, J. Delon, and P . Hellier . Split and match: Example-based adapti ve patch sampling for unsuper - vised style transfer . In CVPR , pages 553–561, 2016. 2 [7] L. A. Gatys, A. S. Ecker , and M. Bethge. Image style transfer using con volutional neural netw orks. In CVPR , pages 2414– 2423, 2016. 2 , 3 , 5 , 6 [8] A. Hertzmann, C. E. Jacobs, N. Oliv er , B. Curless, and D. Salesin. Image analogies. In SIGGRAPH , pages 327– 340, 2001. 2 [9] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV , pages 694–711, 2016. 2 , 3 , 5 , 6 , 7 [10] D. Kingma and J. Ba. Adam: A method for stochastic opti- mization. arXiv preprint , 2014. 6 [11] C. Li and M. W and. Precomputed real-time texture synthesis with markovian generative adversarial networks. In ECCV , pages 702–716, 2016. 5 [12] Y . Li, C. Fang, J. Y ang, Z. W ang, X. Lu, and M.-H. Y ang. Div ersiﬁed texture synthesis with feed-forward networks. arXiv pr eprint arXiv:1703.01664 , 2017. 2 , 4 [13] Y . Li, N. W ang, J. Liu, and X. Hou. Demystifying neural style transfer . CoRR , abs/1701.01036, 2017. 2 [14] T . Lin, M. Maire, S. J. Belongie, J. Hays, P . Perona, D. Ra- manan, P . Doll ´ ar , and C. L. Zitnick. Microsoft COCO: com- mon objects in context. In ECCV , pages 740–755, 2014. 5 [15] F . Liu, C. Shen, and G. Lin. Deep con volutional neural ﬁelds for depth estimation from a single image. In CVPR , pages 5162–5170, 2015. 2 [16] J. Long, E. Shelhamer , and T . Darrell. Fully con v olutional networks for semantic se gmentation. In CVPR , pages 3431– 3440, 2015. 2 [17] H. Noh, S. Hong, and B. Han. Learning decon volution net- work for semantic segmentation. In ICCV , pages 1520–1528, 2015. 2 [18] J. Pennington, R. Socher , and C. D. Manning. Glo ve: Global vectors for word representation. In EMNLP , 2014. 7 [19] P . H. O. Pinheiro and R. Collobert. Recurrent con v olutional neural networks for scene labeling. In ICML , pages 82–90, 2014. 2 [20] O. Russakovsky , J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy , A. Khosla, M. S. Bernstein, A. C. Berg, and F . Li. ImageNet large scale visual recogni- tion challenge. IJCV , 115(3):211–252, 2015. 3 [21] K. Simonyan and A. Zisserman. V ery deep conv olu- tional networks for large-scale image recognition. CoRR , abs/1409.1556, 2014. 3 [22] D. Ulyanov , V . Lebedev , A. V edaldi, and V . S. Lempitsky . T exture networks: Feed-forw ard synthesis of textures and stylized images. In ICML , pages 1349–1357, 2016. 2 , 5 , 6 [23] D. Ulyanov , A. V edaldi, and V . S. Lempitsky . Instance normalization: The missing ingredient for fast stylization. CoRR , abs/1607.08022, 2016. 3 , 4 [24] R. Zhang, P . Isola, and A. A. Efros. Colorful image coloriza- tion. In ECCV , pages 649–666, 2016. 2 [25] S. Zheng, S. Jayasumana, B. Romera-Paredes, V . V ineet, Z. Su, D. Du, C. Huang, and P . H. S. T orr . Conditional random ﬁelds as recurrent neural networks. In ICCV , pages 1529–1537, 2015. 2

ZM-Net: Real-time Zero-shot Image Manipulation Network

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment