A Closed-form Solution to Universal Style Transfer

A Closed-f orm Solution to Univ ersal Style T ransfer Ming Lu ∗ 1 , Hao Zhao 1 , Anbang Y ao 2 , Y urong Chen 2 , Feng Xu 3 , and Li Zhang 1 1 Department of Electronic Engineering, Tsinghua Uni versity 2 Intel Labs China 3 BNRist and School of Software, Tsinghua Uni versity { lu-m13@mails,zhao-h13@mails,feng-xu@mail,chinazhangli@mail } .tsinghua.edu.cn { anbang.yao, yurong.chen } @intel.com Abstract Universal style transfer tries to explicitly minimize the losses in featur e space, thus it does not r equir e training on any pr e-deﬁned styles. It usually uses differ ent layers of VGG network as the encoders and trains sever al decoder s to in vert the featur es into images. Ther efor e, the effect of style transfer is achie ved by featur e transform. Although plenty of methods have been pr oposed, a theor etical anal- ysis of feature transform is still missing. In this paper , we ﬁrst pr opose a novel interpretation by treating it as the op- timal transport pr oblem. Then, we demonstrate the r ela- tions of our formulation with former works like Adaptive In- stance Normalization (AdaIN) and Whitening and Coloring T ransform (WCT). F inally , we derive a closed-form solution named Optimal Style T ransfer (OST) under our formula- tion by additionally considering the content loss of Gatys. Comparatively , our solution can preserve better structure and achieve visually pleasing r esults. It is simple yet ef- fective and we demonstrate its advantages both quantita- tively and qualitatively . Besides, we hope our theoretical analysis can inspir e future works in neural style transfer . Code is available at https://github.com/ lu- m13/ OptimalStyleTransfer . 1. Introduction A variety of methods on neural style transfer have been proposed since the seminal work of Gatys [8]. These meth- ods can be roughly categorized into image optimization and model optimization [13]. Methods based on image opti- mization directly obtain the stylized output by minimizing the content loss and style loss. The style loss can be de- ∗ This work w as done when Ming Lu was an intern at Intel Labs China, supervised by Anbang Y ao who is responsible for correspondence. ﬁned by Gram matrix [8], histogram [25], or Markov Ran- dom Fields (MRFs) [16]. Contrary to that, methods based on model optimization try to train neural networks on large datasets like COCO [22]. The training loss can be deﬁned as perceptual loss [14] or MRFs loss [17]. Subsequent works [3, 6, 32] further study the problem of training one net- work for multiple styles. Recently , [12] proposes to use AdaIN as feature transform to train one network for arbi- trary styles. Apart from image and model optimization, many other works study the problems of semantic style transfer [23, 21, 1], video style transfer [11, 2, 26, 27], por- trait style transfer [28], and stereoscopic style transfer [4]. [13] provides a thorough revie w of the w orks on style trans- fer . In this paper , we study the problem of univ ersal style transfer [19]. Our motiv ation is to explicitly minimize the losses deﬁned by Gatys [8]. Therefore, our approach does not require training on any pre-deﬁned styles. Simi- lar to WCT [19], our method is also based on a multi-scale encoder-feature transform-decoder framew ork. W e use dif- ferent layers of VGG network [31] as the encoders and train the decoders to in vert features into images. The effect of style transfer is achie ved by feature transform between en- coder and decoder . Therefore, the key to uni versal style transfer is feature transform. In this work, we focus on the theoretical analysis of feature transform and propose a new closed-form solution. Although AdaIN [12] trains its decoder on a lar ge dataset of style images, AdaIN itself is also a feature transform method. It considers the feature of each channel as a Gaus- sian distrib ution and assumes the channels are independent. For each channel, AdaIN ﬁrst normalizes the content feature and then matches it to the style feature. This means it only matches the diagonal elements of the co v ariance matrices. WCT [19] proposes to use whitening and coloring as fea- ture transform. Compared with AdaIN, WCT improv es the results by matching all the elements of co variance matrices. Since the channels of deep Con volutional Neural Networks (CNNs) are correlated, the non-diagonal elements are es- sential to represent the style. Howe ver , WCT only matches the cov ariance matrices, which shares similar spirits with minimizing the style loss of Gatys. It does not consider the content loss and cannot well preserve the image struc- ture. Moreov er , multiplying an orthogonal matrix between the whitening and coloring matrices can also match the co- variance matrices, which has been pointed out by [18]. [20] sho ws that matching Gram matrices is equi v alent to minimizing the Maximum Mean Discrepancy (MMD) with the second order polynomial k ernel. Howe ver , it does not giv e a closed-form solution. Instead, our work reformu- lates style transfer as an optimal transport problem. Opti- mal transport tries to ﬁnd a transformation that matches tw o high-dimensional distributions. For neural style transfer, considering the neural feature in each activ ation as a high dimension sample, we assume the samples of content and style images are from two Multiv ariate Gaussian (MVG) distributions. Style transfer is equiv alent to transforming the content samples to ﬁt the distribution of style samples. Assuming the transformation is linear , we ﬁnd that both AdaIN and WCT are special cases of our formulation. Al- though [18] also assumes the transformation is linear, it still follows the whitening and coloring pipeline and trains two meta networks for whitening and coloring matrices. Con- trary to that, we directly ﬁnd the transformation under the optimal transport formulation. As we hav e described abo ve, there are still inﬁnite trans- formations, for e xample, multiplying an orthogonal matrix between the whitening and coloring matrices can also be the solution. Therefore, we seek for a transformation, which additionally minimizes the difference between transformed feature and original content feature. This shares similar spirits with minimizing the content loss of Gatys [8]. W e prov e that a unique closed-form solution named Optimal Style Transfer (OST) can be found, once considering the content loss. W e show the detailed proof of OST in the method part. Since OST further considers the content loss, it can preserve better structures compared with WCT . Our contributions can be concluded as follo ws: 1. W e present a novel interpretation of neural style trans- fer by treating it as an optimal transport problem and eluci- date the theoretical relations of our interpretation with for - mer works on feature transform, for example, AdaIN and WCT . 2. W e ﬁnd the unique closed-form solution named OST under the optimal transport interpretation by additionally considering the content loss. 3. Our closed-form solution preserves better structures and achiev es visually pleasing results. 2. Related W ork Image Optimization. Methods based on image opti- mization directly obtain the stylized output by minimizing the content loss and style loss deﬁned in the feature space. The optimization is usually based on back-propagation. [7, 8] propose to use Gram matrix to deﬁne the style of an example image. [16] improv es the results by combin- ing MRFs with Conv olutional Neural Networks. [1] uses the semantic masks to deﬁne the style losses within corre- sponding regions. In order to improve the results for portrait style transfer , [28] proposes to modify the feature maps to transfer the local color distributions of the example paint- ing onto the content image. This is similar to the gain map proposed by [30]. [9] studies the problem of controlling the perceptual factors during style transfer . [25] impro ves the results of neural style transfer by incorporating the his- togram loss. [26] incorporates the temporal consistency loss into the optimization for video style transfer . Since all the abov e methods solve the optimization by back-propagation, they are intrinsically time-consuming. Model Optimization. In order to solve the speed bottle- neck of back-propagation, [14, 17] propose to train a feed- forward network to approximate the optimization process. Instead of optimizing the image, they optimize the param- eters of the network. Since it is tedious to train one net- work for each style, [3, 6, 32] further study the problem of training one network for multiple styles. Later , [5] presents a method based on patch swap for arbitrary style transfer . First, the content and style images are forwarded through the deep neural netw ork to e xtract features. Then the style transfer is formulated as neural patch swap to get the re- constructed feature map. This feature map is inv erted by the decoder network to image space. Since then, the frame- work of encoder -feature transform-decoder has been widely explored for arbitrary style transfer . [12] uses AdaIN as the feature transform and trains the decoder ov er large col- lections of content and style images. [18] trains tw o meta networks for the whitening and coloring matrices, follow- ing the formulation of WCT [19]. Many other w orks also extend neural style transfer to video [2, 11, 27] and stereo- scopic style transfer [4]. These w orks usually jointly train additional networks apart from the style transfer network. Universal Style T ransfer . Univ ersal style transfer [19] is also based on the framework of encoder-feature transform-decoder . Unlike AdaIN [12], it does not require network training on any style image. It directly uses dif- ferent layers of VGG network as the encoders and train the decoders to in vert the feature into image. The style transfer effect is achiev ed by feature transform. [5] re- places the patches of content feature by the most similar patches of style feature. Howe ver , the nearest neighbor search achie ves less transfer ef fect since it tends to pre- serve the original appearance. AdaIN considers the activ a- Figure 1. (a) The pipeline of OST for univ ersal style transfer . First, we extract features using the encoder for content image and style image. Then we use the feature transform method to obtain the stylized feature. Finally , the decoder in verts the stylized feature into image. The output of top layer is used as the input content image for the bottom layer . (b) The decoder in verts the feature of a certain layer to the image. Although [10, 29] propose to train the decoder to inv ert the feature to its bottom layer’ s feature, which might be more efﬁcient, we use the image decoder in this work since decoder is not our contribution. (c) W e use the feature loss (denoted by the blue arrow) and the reconstruction loss (denoted by the red arrow) to train the DecoderX (X=1,2,...,5). tion of each channel as a Gaussian distrib ution and matches the content and style images through mean and v ariance. Howe ver , since the channels of CNN are correlated, AdaIN cannot achiev e visually pleasing transfer ef fect. WCT [19] proposes to use feature whitening and coloring to match the cov ariance matrices of style and content images. Howe ver , as pointed out by [18], WCT is not the only approach to matching the covariance matrices. [29] proposes a method to combine patch match with WCT and AdaIN. Instead of ﬁnding the nearest neighbor by the original feature, [29] conducts it using the projected feature. These projected fea- ture can be generated by AdaIN or WCT . Ho we ver , above methods all fail to giv e a theoretical analysis of feature transform. The key observ ation of current works like WCT is matching the cov ariance matrices, which is not enough to ﬁnd a good solution. 3. Motivation The pipeline of OST is sho wn in Figure 1. It is similar to WCT [19]. W e use dif ferent layers of the pre-trained VGG network as the encoders. For e very encoder , we train the corresponding decoder to inv ert the feature into image as illustrated by Figure 1 (b, c). Although [10, 29] propose to train the decoder to in vert the feature to its bottom layer’ s feature, which might be more ef ﬁcient, we use the image decoder [19] in this work since the frame work is not our contribution. W e start to study the problem of feature transform by reformulating neural style transfer as the optimal trans- port problem. W e denote the content image as I c and the style image as I s . F or the features of content image and style image, we denote them as F c ∈ R C × H c W c and F s ∈ R C × H s W s separately , where H c W c and H s W s are the numbers of activ ations and C is the number of channels. W e view the columns of F c and F s as samples from two Multiv ariate Gaussian (MVG) distributions N ( µ c , Σ c ) and N ( µ s , Σ s ) , where µ c , µ s ∈ R C are the mean vectors and Σ c , Σ s ∈ R C × C are the variance matrices. W e further de- note the sample from content distribution as u and the sam- ple from style distribution as v . Therefore, u ∼ N ( µ c , Σ c ) and v ∼ N ( µ s , Σ s ) . Assuming the optimal transformation is linear , we can represent it as follo ws. t ( u ) = T ( u − µ c ) + µ s (1) Where T ∈ R C × C is the transformation matrix. Since we assume the features are from two MVG distributions, T must meet the following equation to match two MVG distributions. T Σ c T T = Σ s (2) Where T T is the transpose of T . When Eq. 2 is satis- ﬁed, we can obtain t ( u ) ∼ N ( µ s , Σ s ) . W e then demon- strate the relations of our formulation with AdaIN [12] and WCT [19]. W e denote the diagonal matrices of Σ c and Σ s as D c and D s separately . For AdaIn , the transformation matrix T = D s ./D c , where ./ denotes the element-wise division. Therefore, AdaIN does not satisfy Eq. 2 since it ignores the correlation of channels. Only the diagonal el- ements are matched by AdaIN. As f or WCT , we can ﬁnd that the transformation matrix T = Σ 1 / 2 s Σ − 1 / 2 c . Since both Σ 1 / 2 s and Σ − 1 / 2 c are symmetric matrices, WCT satisﬁes Eq. Figure 2. Style transfer results of T = Σ 1 / 2 s Q Σ − 1 / 2 c , where Q is a unite orthogonal matrix. Although T satisﬁes Eq. 2, the results vary signiﬁcantly . 2. Howe ver , WCT is not the only solution to Eq. 2 be- cause T = Σ 1 / 2 s Q Σ − 1 / 2 c , where Q is a unite orthogonal matrix, is a family of solutions to Eq. 2. This has also been pointed out by [18]. Theoretically , there are inﬁnite solu- tions, considering only Eq. 2. W e sho w the style transfer results of multiplying a random unite orthogonal matrix to the whitening matrix in Figure 2. As can been seen, al- though T = Σ 1 / 2 s Q Σ − 1 / 2 c satisﬁes Eq. 2, the style transfer results vary signiﬁcantly . Our motiv ation is to ﬁnd an optimal solution by addition- ally considering the content loss of Gatys. Therefore, our formulation can be represented as follo ws, where E repre- sents the expectation. T = arg min T E ( || t ( u ) − u || 2 2 ) s.t. t ( u ) = T ( u − µ c ) + µ s T Σ c T T = Σ s (3) 4. Method In this part, we deri ve the closed-form solution to Eq. 3. W e substitute Eq. 1 to the expectation term of Eq. 3 and obtain: E [( T ( u − µ c ) + µ s − u ) T ( T ( u − µ c ) + µ s − u )] (4) W e denote u ∗ = u − µ c , v ∗ = T u ∗ and δ = µ s − µ c . Therefore, we can get u ∗ ∼ N (0 , Σ c ) and v ∗ ∼ N (0 , Σ s ) . Besides, δ is a constant C-dimensional vector . Using u ∗ , v ∗ and δ , we can re-write Eq. 4 as: E [( v ∗ + δ − u ∗ ) T ( v ∗ + δ − u ∗ )] (5) W e further expand Eq. 5 to: E [ v ∗ T v ∗ + δ T v ∗ − u ∗ T v ∗ + v ∗ T δ + δ T δ − u ∗ T δ − v ∗ T u ∗ − δ T u ∗ + u ∗ T u ∗ ] (6) Since u ∗ ∼ N (0 , Σ c ) , v ∗ ∼ N (0 , Σ s ) and δ is a constant C-dimensional vector , we can get E [ δ T v ∗ ] = E [ v ∗ T δ ] = 0 and E [ u ∗ T δ ] = E [ δ T u ∗ ] = 0 . Besides, E [ δ T δ ] is also constant. Therefore, minimizing Eq. 6 is equi v alent to min- imizing Eq. 7: E [ v ∗ T v ∗ + u ∗ T u ∗ − u ∗ T v ∗ − v ∗ T u ∗ ] (7) Using the representation of matrix trace, Eq. 7 can be rewritten as follo ws. tr ( E [ v ∗ v ∗ T + u ∗ u ∗ T − v ∗ u ∗ T − u ∗ v ∗ T ]) (8) Where tr means the trace of a matrix. Since E [ v ∗ v ∗ T ] = Σ s , E [ u ∗ u ∗ T ] = Σ c and E [ v ∗ u ∗ T ] = E [ u ∗ v ∗ T ] = φ , where φ denotes the cov ariance matrix of v ∗ and u ∗ , the solution to Eq. 3 can be reformulated as follo ws. T = arg max T ( tr ( φ )) (9) Next, we introduce a lemma, which has been proved by [24]. W e do not repeat the proof due to limited space. The lemma can be concluded as follows. Lemma 4.1 Given two high-dimensional distributions X and Y , where X ∼ N (0 , Σ 11 ) and Y ∼ N (0 , Σ 22 ) , we deﬁne the distrib ution of ( X, Y ) as N (0 , Σ) , wher e Σ can be r epr esented as follows. Σ =  Σ 11 φ φ T Σ 22  (10) The pr oblem of max( tr (2 φ )) has a unique solution, which can be r epresented as: φ = Σ 11 Σ 1 / 2 22 (Σ 1 / 2 22 Σ 11 Σ 1 / 2 22 ) − 1 / 2 Σ 1 / 2 22 (11) W ith the above lemma, let X = v ∗ , Y = u ∗ , Σ 11 = Σ s and Σ 22 = Σ c , we can obtain the so- lution to Eq. 9, which can be represented as φ = Σ s Σ 1 / 2 c (Σ 1 / 2 c Σ s Σ 1 / 2 c ) − 1 / 2 Σ 1 / 2 c . W e rewrite the covari- ance matrix as φ = E [ v ∗ u ∗ T ] = E [ v ∗ ( T − 1 v ∗ ) T ] = E [ v ∗ v ∗ T ]( T − 1 ) T = Σ s ( T − 1 ) T . Therefore, we can get ( T − 1 ) T = Σ 1 / 2 c (Σ 1 / 2 c Σ s Σ 1 / 2 c ) − 1 / 2 Σ 1 / 2 c . Then the ﬁnal T can be represented as Eq. 12. T = Σ − 1 / 2 c (Σ 1 / 2 c Σ s Σ 1 / 2 c ) 1 / 2 Σ − 1 / 2 c (12) Remarks: The ﬁnal solution of our method is very sim- ple. Since our method additionally considers the content loss, we can preserve better structure compared with WCT . Contrary to former works, we pro vide a complete theoret- ical proof of the proposed method. The relations of our method with former works are also demonstrated. W e be- liev e both the closed-form solution and the theoretical proof will inspire future works in neural style transfer . Gatys Patch Swap AdaIn AdaIn+ WCT Ours 207.12s 13.15s 0.49s 0.16s 3.47s 4.06s T able 1. Processing speed comparison. 5. Results In this section, we ﬁrst qualitativ ely compare our method with Gatys [8], Patch Swap [5], AdaIN (with our decoder) [12], AdaIN+ (with their decoder) [12], and WCT [19] in Section 5.1. Then we pro vide a quantitative comparison of our method against Gatys, Patch Swap, AdaIN, AdaIN+ and WCT in Section 5.2. Following former works, we also show results of linear interpolation and semantic style trans- fer in Section 5.3. Finally , we discuss the limitations of our method in Section 5.4. Parameters: W e train the decoders on the COCO dataset [22]. The weight to balance the feature loss and reconstruction loss in Eq. 13 is set to 1 as [19]. For the results in this work, the resolution of the input is ﬁxed as 512 × 512 . Perf ormance: W e implement the proposed method on a server with an NVIDIA Titan Xp graphics card. The pro- cessing speed comparison is listed in T able 1 under the in- put resolution of 512 × 512 . W e do the comparison with the published implementations on our server , which might result in slight differences with the papers. 5.1. Qualitative Results Our Method v ersus Gatys: Gatys [8] is the pioneer- ing work of neural style transfer and it can handle arbitrary styles. Although it uses time-consuming back-propagation to minimize the content loss and style loss, we still compare with it since its formulation is the foundation of our method. As shown in Figure 3, Gatys can usually achie ve reasonable results, ho wev er , these results are not so stylized since the iterativ e solver cannot reach the optimal solution in limited iterations. Instead, our method tries to ﬁnd the closed-form solution, which explicitly minimizes the style loss and con- tent loss. Comparativ ely , our results are more stylized and they also well-preserv e the structures of content images. Our Method versus Patch Swap: As far as we know , Patch Swap [5] is the ﬁrst work to use the encoder-feature transform-decoder framew ork. It chooses a certain layer of VGG network as the encoder and trains the correspond- ing decoder . The feature transform is formulated as neural patch swap. Ho wev er , neural patch swap using the origi- nal feature tends to simply reconstruct the feature, thus the results are not stylized. Besides, Patch Swap only trans- fers the style in a certain layer , which also reduces the style transfer ef fect. [29] proposes to match the neural patch in the projected domains, for example, the whitened feature [19]. Apart from this, [29] uses multiple layers to transfer the style, achieving more stylized results. Our work does not use the idea of neural patch match, instead, we focus on the theoretical analysis to deli ver the closed-form solu- tion. As can be seen in Figure 3, our result is more stylized compared with Patch Swap. Our Method versus AdaIN and AdaIN+: As dis- cussed in the motivation, AdaIN [12] assumes the channels of CNN feature are independent. For each channel, AdaIN matches two one-dimensional Gaussian distributions. Ho w- ev er , the channels of CNN feature are actually correlated. Therefore, using AdaIN as the feature transform cannot achiev e visually stylized results. Instead of using AdaIN as the feature transform method, AdaIN+ [12] trains a decoder on large collections of content and style images. Although AdaIN+ only transfers the feature in a certain layer, it trains the decoder with style losses deﬁned in multiple layers. W e conduct the comparisons with both AdaIN and AdaIN+. As illustrated by Figure 3, the results of AdaIN and AdaIN+ are similar and both of them fail to achie ve visually pleas- ing transfer results. Therefore, we belie ve the reason why AdaIN and AdaIN+ fail is because they ignore the corre- lation between channels of CNN feature. Instead, our work considers the correlation thus achie ves more stylized results as shown in Figure 3. Our Method versus WCT : WCT [19] proposes to use feature whitening and coloring as the solution to style trans- fer . It chooses ZCA whitening in the paper and we test some other whitening methods with the feature of ReLU3 1 as shown in Figure 4. As can be seen, only ZCA whitening achiev es reasonable results. This is because ZCA whiten- ing is the optimal choice, which minimizes the difference between content feature and the whitened feature. Although the ZCA-whitened image can preserv e the structure of con- tent image, there is none constraint on the ﬁnal transformed feature. Contrary to that, we consider to minimize the dif- ference between content feature and the ﬁnal transformed feature. As we ha ve analyzed in the moti vation section, WCT satisﬁes Eq. 2. Therefore, it perfectly matches two high-dimension Gaussian distributions. Howe ver , it ignores the content loss of Gatys. Instead, we seek the closed-form solution, which additionally minimizes the content loss. As can be seen in Figure 3, our transformation can preserve better structures (see the red rectangles). W e also notice that the ﬁnal feature can be the linear combination of original content feature and the transformed feature as sho wn in Eq. 13. Where α is the weight of trans- formed feature. t ∗ ( u ) = αt ( u ) + (1 − α ) u (13) W e sho w the results of different α v alues in Figure 5. As illustrated by Figure 5, adjusting the weight can change the degree of style transfer . With smaller α , WCT can pre- serve more structure of content image. Ho wev er , there is still obvious artifact ev en with small α . Instead, our method Inp ut G a ty s P a tc h Sw a p A d a IN A d a IN + WC T O ur R esul t Figure 3. Qualitativ e results. W e compare our method against Gatys [8], Patch Swap [5], AdaIN (with our decoder) [12], AdaIN+ (with their decoder) [12], and WCT [19]. AdaIN ignores the non-diagonal elements of cov ariance matrices, which results in less stylized output. WCT does not consider the content loss and cannot well-preserve the structure of content image as shown in the red rectangles. Our method can achiev e both stylized and content-preserving results. Method Gatys Patch Swap AdaIN AdaIN+ WCT* Ours* Content Loss 0.096 0.086 0.167 0.151 0.296 0.255 Style Loss-1 23.77 100.8 15.85 15.9 3.89 3.60 Style Loss-2 8577.04 30647.6 5351.6 3355.5 594.4 457.8 Style Loss-3 6749.7 15607.2 4564.7 4905.5 1226.6 1203 Style Loss-4 325939 562192 245133 202767 187907 129695 Style Loss-5 15.96 17.73 14.1 12.48 24.33 12.37 T able 2. A verage content loss and style losses. * means fully matching the statistics of content and style features. Figure 4. Illustration of dif ferent whitening methods. W e test some whitening methods with the feature of ReLU3 1. As can be seen, ZCA whitening achiev es better results. cor means correlated and details of whitening methods can be found in [15] Figure 5. Illustration of linear interpolation. The top row is the results of our method and the bottom row is the results of WCT . Linearly combining content feature with the transformed feature can help preserve the structure. With smaller α in Eq. 13, WCT can preserve more structure. Howe ver , there is still obvious arti- fact. Instead, our method consistently achieves pleasing results. Gatys P atch Swap AdaIN AdaIN+ WCT Ours 2.17 1.05 2.00 1.94 2.67 3.07 T able 3. A verage scores of user study . consistently achiev es visually pleasing results. 5.2. Quantitative Results User Study: Style transfer is a very subjecti ve research topic. Although we have theoretically prov ed the advan- tages of our method, we further conduct a user study to quantitativ ely compare our work against Gatys, Patch Swap, AdaIN, AdaIN+ and WCT . This study uses 16 content im- ages and 35 style images collected from published imple- mentations, thus 560 stylized images are generated by each method. W e show the content, style and stylized images to testers. W e ask the testers to choose a score from 1 (worst) - 5 (best) for the purpose of e valuating the quality of style transfer . W e do this user study with 50 testers online. The av erage scores are listed in the T able 3. This study shows that our method improv es the results of former works. Content Loss and Style Loss: In addition to user study , we also ev aluate the content loss and style loss deﬁned by Gatys [8]. W e calculate the average content loss and style loss with the images of user study for each method. W e normalize the content loss with the number of neural acti- vations. The av erage losses are listed in T able 2. As can be seen, compared with WCT , our method achie ves lo wer con- tent loss and similar style loss. As for Gatys, Patch Swap, AdaIN and AdaIN+, the y fail to achiev e stylized results with high style losses as we hav e analyzed in the qualita- tiv e comparison part. 5.3. More Results W e show more results to demonstrate the generalization of our method in Figure 6, where α is set as 1. T o fur- ther ev aluate the linear interpolation, we show two samples with dif ferent α values in Figure 7. W e also combine our method with semantic style transfer as sho wn in Figure 7. Although we assume the neural features are sampled from MVG distributions in the proof, these results are all visu- ally pleasing, which demonstrate the generalization ability of the proposed method. 5.4. Limitations Our method still has some limitations. For example, we ev aluate the frame-by-frame results of video style transfer . Although our method can preserve better structure com- pared with former works, the frame-by-frame results still contain obvious jittering. W e ﬁnd that the temporal jitter- ing is not only caused by feature transform but also caused by the information loss of encoder netw orks. Deep encoder network will cause obvious temporal jittering ev en without feature transform. Figure 6. More results. W e sho w more results, where α is set as 1. Figure 7. Linear interpolation and semantic style transfer . Although we assume the neural features are sampled from MVG distrib utions in the proof, these results are all visually pleasing, which demonstrate the generalization ability of our work. Besides, style transfer is a very subjectiv e problem. Al- though the Gram matrix representation proposed by Gatys has been widely used, mathematically modeling of what people really feel about style is still an unsolved problem. Exploring the relation between deep neural network and im- age style is an interesting topic. 6. Conclusion In this paper, we ﬁrst present a novel interpretation of neural style transfer by treating it as an optimal transport problem. Then we demonstrate the theoretical relations be- tween our interpretation and former works, for example, AdaIN and WCT . Based on our formulation, we derive the unique closed-form solution by additionally considering the content loss. Our solution preserves better structure com- pared with former works due to the minimization of content loss. W e hope this paper can inspire future works in style transfer . Acknowledgements. This work was supported by the National Key R&D Program of China 2018YF A0704000, the NSFC (Grant No. 61822111, 61727808, 61671268, 61132007, 61172125, 61601021, and U1533132) and Bei- jing Natural Science Foundation (L182052). References [1] Ale x J Champandard. Semantic style transfer and turn- ing tw o-bit doodles into ﬁne artworks. arXiv pr eprint arXiv:1603.01768 , 2016. [2] Dongdong Chen, Jing Liao, Lu Y uan, Nenghai Y u, and Gang Hua. Coherent online video style transfer . In Pr oceedings of the IEEE International Confer ence on Computer V ision , pages 1105–1114, 2017. [3] Dongdong Chen, Lu Y uan, Jing Liao, Nenghai Y u, and Gang Hua. Stylebank: An e xplicit representation for neural im- age style transfer . In Pr oceedings of the IEEE Conference on Computer V ision and P attern Recognition , pages 1897– 1906, 2017. [4] Dongdong Chen, Lu Y uan, Jing Liao, Nenghai Y u, and Gang Hua. Stereoscopic neural style transfer . In Pr oceedings of the IEEE Conference on Computer V ision and P attern Recogni- tion , pages 6654–6663, 2018. [5] T ian Qi Chen and Mark Schmidt. Fast patch-based style transfer of arbitrary style. arXiv preprint , 2016. [6] V incent Dumoulin, Jonathon Shlens, and Manjunath Kudlur . A learned representation for artistic style. [7] Leon Gatys, Alexander S Ecker , and Matthias Bethge. T ex- ture synthesis using con volutional neural networks. In Ad- vances in neural information pr ocessing systems , pages 262– 270, 2015. [8] Leon A Gatys, Alexander S Ecker , and Matthias Bethge. Im- age style transfer using con volutional neural networks. In Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Recognition , pages 2414–2423, 2016. [9] Leon A Gatys, Alexander S Ecker, Matthias Bethge, Aaron Hertzmann, and Eli Shechtman. Controlling perceptual fac- tors in neural style transfer . In Pr oceedings of the IEEE Con- fer ence on Computer V ision and P attern Recognition , pages 3985–3993, 2017. [10] Shuyang Gu, Congliang Chen, Jing Liao, and Lu Y uan. Ar- bitrary style transfer with deep feature reshufﬂe. In Pr oceed- ings of the IEEE Confer ence on Computer V ision and P attern Recognition , pages 8222–8231, 2018. [11] Haozhi Huang, Hao W ang, W enhan Luo, Lin Ma, W enhao Jiang, Xiaolong Zhu, Zhifeng Li, and W ei Liu. Real-time neural style transfer for videos. In Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Recognition , pages 783–791, 2017. [12] Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptiv e instance normalization. In Proceed- ings of the IEEE International Confer ence on Computer V i- sion , pages 1501–1510, 2017. [13] Y ongcheng Jing, Y ezhou Y ang, Zunlei Feng, Jingwen Y e, Y izhou Y u, and Mingli Song. Neural style transfer: A review . arXiv pr eprint arXiv:1705.04058 , 2017. [14] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In Eur opean conference on computer vision , pages 694–711. Springer , 2016. [15] Agnan K essy , Alex Le win, and K orbinian Strimmer . Opti- mal whitening and decorrelation. The American Statistician , 72(4):309–314, 2018. [16] Chuan Li and Michael W and. Combining marko v random ﬁelds and conv olutional neural networks for image synthesis. In Pr oceedings of the IEEE Conference on Computer V ision and P attern Recognition , pages 2479–2486, 2016. [17] Chuan Li and Michael W and. Precomputed real-time texture synthesis with markovian generativ e adversarial networks. In Eur opean Confer ence on Computer V ision , pages 702–716. Springer , 2016. [18] Xueting Li, Sifei Liu, Jan Kautz, and Ming-Hsuan Y ang. Learning linear transformations for fast arbitrary style trans- fer . arXiv preprint , 2018. [19] Y ijun Li, Chen Fang, Jimei Y ang, Zhaowen W ang, Xin Lu, and Ming-Hsuan Y ang. Univ ersal style transfer via feature transforms. In Advances in neural information pr ocessing systems , pages 386–396, 2017. [20] Y anghao Li, Naiyan W ang, Jiaying Liu, and Xiaodi Hou. Demystifying neural style transfer . arXiv pr eprint arXiv:1701.01036 , 2017. [21] Jing Liao, Y uan Y ao, Lu Y uan, Gang Hua, and Sing Bing Kang. V isual attribute transfer through deep image analogy . arXiv pr eprint arXiv:1705.01088 , 2017. [22] Tsung-Y i Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Dev a Ramanan, Piotr Doll ´ ar , and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Eur opean conference on computer vision , pages 740–755. Springer , 2014. [23] Ming Lu, Hao Zhao, Anbang Y ao, Feng Xu, Y urong Chen, and Li Zhang. Decoder network over lightweight recon- structed feature for fast semantic style transfer . In Proceed- ings of the IEEE International Confer ence on Computer V i- sion , pages 2469–2477, 2017. [24] Ingram Olkin and Friedrich Pukelsheim. The distance be- tween two random vectors with giv en dispersion matrices. Linear Algebra and its Applications , 48:257–263, 1982. [25] Eric Risser, Pierre W ilmot, and Connelly Barnes. Stable and controllable neural texture synthesis and style transfer using histogram losses. arXiv pr eprint arXiv:1701.08893 , 2017. [26] Manuel Ruder , Alexey Doso vitskiy , and Thomas Brox. Artistic style transfer for videos. In German Confer ence on P attern Recognition , pages 26–36. Springer , 2016. [27] Manuel Ruder , Alexey Doso vitskiy , and Thomas Brox. Artistic style transfer for videos and spherical images. Inter- national Journal of Computer V ision , 126(11):1199–1219, 2018. [28] Ahmed Selim, Mohamed Elgharib, and Linda Doyle. Paint- ing style transfer for head portraits using con volutional neural networks. ACM T ransactions on Graphics (T oG) , 35(4):129, 2016. [29] Lu Sheng, Ziyi Lin, Jing Shao, and Xiaogang W ang. A vatar- net: Multi-scale zero-shot style transfer by feature decora- tion. In Pr oceedings of the IEEE Conference on Computer V ision and P attern Recognition , pages 8242–8250, 2018. [30] Y iChang Shih, Sylvain P aris, Connelly Barnes, W illiam T Freeman, and Fr ´ edo Durand. Style transfer for headshot por- traits. ACM T ransactions on Graphics (TOG) , 33(4):148, 2014. [31] Karen Simonyan and Andrew Zisserman. V ery deep con vo- lutional networks for large-scale image recognition. arXiv pr eprint arXiv:1409.1556 , 2014. [32] Hang Zhang and Kristin Dana. Multi-style generati ve net- work for real-time transfer . In Eur opean Confer ence on Computer V ision , pages 349–365. Springer , 2018.

A Closed-form Solution to Universal Style Transfer

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment