Music-oriented Dance Video Synthesis with Pose Perceptual Loss

Music-oriented Dance V ideo Synthesis with P ose Per ceptual Loss Xuanchi Ren Haoran Li Zijian Huang Qifeng Chen HKUST Figure 1: Our synthesized dance video conditioned on the music “I W ish” . W e show 5 frames from a 5-second synthesized video. The top row sho ws the skeletons, and the bottom row sho ws the corresponding synthesized video frames. More results are shown in the supplementary video at https://youtu.be/0rMuFMZa_K4 . Abstract W e pr esent a learning-based appr oach with pose per- ceptual loss for automatic music video generation. Our method can pr oduce a r ealistic dance video that conforms to the beats and rhymes of almost any given music. T o achie ve this, we ﬁr stly gener ate a human skeleton se- quence fr om music and then apply the learned pose-to- appearance mapping to generate the ﬁnal video. In the stage of generating skeleton sequences, we utilize two dis- criminators to captur e differ ent aspects of the sequence and propose a novel pose perceptual loss to pr oduce nat- ural dances. Besides, we also pr ovide a new cr oss-modal evaluation to evaluate the dance quality , which is able to estimate the similarity between two modalities of mu- sic and dance. F inally , a user study is conducted to demonstrate that dance video synthesized by the presented appr oach pr oduces surprisingly realistic r esults. Sour ce code and data ar e available at https://github.com/ xrenaa/Music- Dance- Video- Synthesis . 1. Introduction Music videos have become unprecedentedly popular all ov er the world. Nearly all the top 10 most-vie wed Y ouT ube videos 1 are music videos with dancing. While these mu- sic videos are made by professional artists, we wonder if an intelligent system can automatically generate personalized and creativ e music videos. In this work, we study auto- matic dance music video generation, gi ven almost an y mu- sic. W e aim to synthesize a coherent and photo-realistic dance video that conforms to the gi ven music. W ith such music video generation technology , a user can share a per- sonalized music video on social media. In Figure 1 , we show some images of our synthesized dance video given the music “I W ish” by Cosmic Girls . The dance video synthesis task is challenging for v arious technical reasons. Firstly , the mapping between dance mo- tion and background music is ambiguous: different artists may compose distinctive dance motion giv en the same mu- 1 https://www.digitaltrends.com/web/ most- viewed- youtube- videos/ 1 sic. This suggests that a simple machine learning model with L 1 or L 2 distance [ 19 , 35 ] can hardly capture the re- lationship between dance and music. Secondly , it is techni- cally difﬁcult to model the space of human body dance. The model should av oid generating non-natural dancing mov e- ments. Even slight deviations from normal human poses could appear unnatural. Thirdly , no high-quality dataset is av ailable for our task. Pre vious motion datasets [ 18 , 33 ] mostly focus on action recognition. T ang et al. [ 35 ] pro- vide a 3D joint dataset for our task. Howe ver , we encounter errors that the dance motion and the music are not aligned when we try to use it. Now adays, there are a large number of music videos with dancing online, which can be used for the music video gen- eration task. T o build a dataset for our task, we apply Open- Pose [ 4 , 5 , 42 ] to get dance skeleton sequences from online videos. Howe ver , the skeleton sequences acquired by Open- Pose are very noisy: some estimated human poses are inac- curate. Correcting such a dataset is time-consuming by re- moving inaccurate poses and thus not suitable for extensi ve applications. Furthermore, only L 1 or L 2 distance is used for training a network in prior work [ 19 , 35 , 43 ], which is demonstrated to disregard some speciﬁc motion characteris- tics by [ 24 ]. T o tackle these challenges, we propose a nov el pose perceptual loss so that our model can be trained on noisy data (imperfect human poses) gained by OpenPose. Dance synthesis has been well studied in the literature by searching dance motion in a database using music as a query [ 1 , 16 , 31 ]. These approaches can not generalize well to music beyond the training data and lack creativity , which is the most indispensable factor of dance. T o over - come such obstacles, we choose the generative adversar- ial network (GAN) [ 12 ] to deal with cross-modal mapping. Howe ver , Cai et al. [ 3 ] sho wed that human pose constraints are too complicated to be captured by an end-to-end model trained with a direct GAN method. Thus, we propose to use two discriminators that focus on local coherence and global harmony , respectiv ely . In summary , the contrib utions of our work are: • W ith the proposed pose perceptual loss, our model can be trained on a noisy dataset (without human labels) to synthesize realistic dance video that conforms to al- most any gi ven music. • W ith the Local T emporal Discriminator and the Global Content Discriminator , our framew ork can generate a coherent dance skeleton sequence that matches the length, rhythm, and the emotion of music. • For our task, we build a dataset containing paired mu- sic and sk eleton sequences, which will be made public for research. T o e valuate our model, we also propose a novel cross-modal ev aluation that measures the sim- ilarity between music and a dance skeleton sequence. 2. Related W ork GAN-based V ideo Synthesis. A generative adversarial network (GAN) [ 12 ] is a popular approach for image gen- eration. The images generated by GAN are usually sharper and with more details compared to those with L 1 and L 2 distance. Recently , GAN is also extended to video gen- eration tasks [ 21 , 25 , 36 , 37 ]. The most simple changes made in GANs for videos are proposed in [ 29 , 37 ]. The GAN model in [ 37 ] replaced the standard 2D con volutional layer with a 3D con volutional layer to capture the temporal feature, although this characteristic capture method is lim- ited in the ﬁxed time. TGAN [ 29 ] ov ercame the limitation but with the cost of constraints imposed in the latent space. MoCoGAN [ 36 ] could generate videos that combine the ad- vantages of RNN-based GAN models and sliding windo w techniques so that the motion and content are disentangled in the latent space. Another adv antage of GAN models is that it is widely applicable to many tasks, including the cross-modal audio- to-video problem. Chen et al. [ 34 ] proposed a GAN-based encoder-decoder architecture using CNNs to conv ert be- tween audio spectrograms and frames. Furthermore, V ou- gioukas et al. [ 38 ] adapted temporal GAN to synthesize a talking character conditioned on speech signals automati- cally . Dance Motion Synthesis. A line of work focuses on the mapping between acoustic and motion features. On the base of labeling music with joint positions and angles, Shiratori et al. [ 31 , 16 ] incorporated gravity and beats as additional features for predicting dance motion. Recently , Alemi et al. [ 1 ] proposed to combine the acoustic feature with the motion features of previous frames. Howe ver , these ap- proaches are entirely dependent on the prepared database and may only create rigid motion when it comes to music with similar acoustic features. Recently , Y aota et al. [ 43 ] accomplished dance synthe- sis using standard deep learning models. The most recent work is by T ang et al. [ 35 ], who proposed a model based on LSTM-autoencoder architecture to generate dance pose se- quences. Their approach is trained with a L 2 distance loss, and their ev aluation only includes comparisons with ran- domly sampled dances that are not on a par with those by real artists. Their approach may not work well on the noisy data obtained by OpenPose. 3. Overview T o generate a dance video from music, we split our sys- tem into two stages. In the ﬁrst stage, we propose an end-to- end model that directly generates a dance skeleton sequence according to the audio input. In the second stage, we apply an impro ved pix2pixHD GAN [ 39 , 7 ] to transfer the dance skeleton sequence to a dance video. In this ov erview , we Audio Encoder Audio Encoder Audio Encoder Audio Encoder GR U GR U GR U GR U Hidden/St ates 𝐻 " 𝐻 # 𝐻 $ 𝐻 % Pos e Gener ator Globa l/Con tent/ Disc rimin at or Lo c a l / T e m p o r a l / D i s c r i m i n a t o r Mu s i c Figure 2: Our frame work for human skeleton sequence synthesis. The input is music signals, which are di vided into pieces of 0.1-second music. The generator contains an audio encoder , a bidirectional GR U, and a pose generator . The output skeleton sequence of the generator is fed into the Global Content Discriminator with the music. The generated skeleton sequence is then divided into o verlapping sub-sequences, which are fed into the Local T emporal Discriminator . will mainly describe the ﬁrst stage, as shown in Figure 2 . Let V be the number of joints of the human skeleton, and the dimension of a 2D coordinate ( x, y ) is 2. W e formulate a dance skeleton sequence X as a sequence of human skele- tons across T consecutive frames in total: X ∈ R T × 2 V where each skeleton frame X t ∈ R 2 V is a vector contain- ing all ( x, y ) joint locations. Our goal is to learn a function G : R T S → R T × 2 V that maps audio signals with sample rate S per frame to a joint location vector sequence. Generator . The generator is composed of a music en- coding part and a pose generator . The input audio signals are di vided into pieces of 0.1-second music. These pieces are encoded using 1D con volution and then fed into a bi- directional 2-layer GR U in chronological order , resulting in output hidden states O = { H 1 , H 2 , · · · , H T } . These hid- den states are fed in the pose generator , which is a multi- layer perceptron to produce a skeleton sequence X . Local T emporal Discriminator . The output skeleton sequence X is divided into K overlapping sequences ∈ R t × 2 V . Then these sub-sequences are fed into the Lo- cal T emporal Discriminator , which is a two-branch con- volutional network. In the end, a small classiﬁer outputs K scores that determine the realism of these skeleton sub- sequences. Global Content Discriminator . The input to the Global Content Discriminator includes the music M ∈ R T S and the dance skeleton sequence X . For the pose part, the skeleton sequence X is encoded using pose discriminator as F P ∈ R 256 . For the music part, similar to the sub-network of the generator , music is encoded using 1D conv olution and then fed into a bi-directional 2-layer GR U, resulting an output O M = { H M 1 , H M 2 , ..., H M T } and O M is transmitted into the self-attention component of [ 22 ] to get a compre- hensiv e music feature expression F M ∈ R 256 . In the end, we concatenate F M and F P along channels and use a small classiﬁer , composed of a 1D con volutional layer and a fully- connected (FC) layer , to determine if the sk eleton sequence matches the music. Pose Perceptual Loss. Recently , Graph Con volutional Network (GCN) has been extended to model skeletons since the human skeleton structure is graph-structured data. Thus, the feature extracted by GCN remains a high-lev el spatial structural information between different body parts. Match- ing activ ations in a pre-trained GCN network gives a bet- ter constraint on both the detail and layout of a pose than the traditional methods such as L 1 distance and L 2 dis- tance. Figure 3 shows the pipeline of the pose perceptual loss. W ith the pose perceptual loss, our output skeleton se- G 𝒴  𝒴 ç ç ç ç ℓ 1 ℓ 2 ℓ 𝑙 Figure 3: The ov erview of the pose perceptual loss based on ST -GCN. G is our generator in the ﬁrst stage. y is the ground- truth skeleton sequence, and ˆ y is the generated skeleton sequence. quences do not need an additional smooth step or any other post-processing. 4. Pose P erceptual Loss Perceptual loss or feature matching loss [ 2 , 8 , 11 , 15 , 27 , 39 , 40 ] is a popular loss to measure the similarity between two images in image processing and synthesis. For the tasks that generate human skeleton sequences [ 3 , 19 , 35 ], only L 1 or L 2 distance is used for measuring pose similarity . W ith a L 1 or L 2 loss, we ﬁnd that our model tends to gener- ate poses conservati vely (repeatedly) and fail to capture the semantic relationship across motion appropriately . More- ov er, the datasets generated by OpenPose [ 4 , 5 , 42 ] are v ery noisy , as shown in Figure 5 . Correcting inaccurate human poses on a large number of videos is labor-intensi ve and un- desirable: a two-minute video with 10 FPS will have 1200 poses to verify . T o tackle these difﬁculties, we propose a nov el pose perceptual loss. The idea of perceptual loss is originally studied in the image domain, which is used to match activ ations in a vi- sual perception network such as VGG-19 [ 32 , 8 ]. T o use the traditional perceptual loss, we need to draw generated skeletons on images, which is complicated and seemingly suboptimal. Instead of projecting pose joint coordinates to an image, we propose to directly match activ ations in a pose recognition network that takes human skeleton sequences as input. Such a network is mainly aimed at pose recognition or prediction tasks, and ST -GCN [ 44 ] is a Graph Conv o- lutional Network (GCN) that is applicable to be a visual perception network in our case. ST -GCN utilizes a spatial- temporal graph to form the hierarchical representation of skeleton sequences and is capable of automatically learn- ing both spatial and temporal patterns from data. T o test the impact of the pose perceptual loss on our noisy dataset, we prepare a 20-video dataset with many noises due to the wrong pose detection of OpenPose. As shown in Figure 4 , our generator can stably generate poses with the pose per- ceptual loss. Giv en a pre-trained GCN network Φ , we deﬁne a collec- tion of layers Φ as { Φ l } . For a training pair ( P , M ) , where P is the ground truth skeleton sequence and M is the corre- sponding piece of music, our perceptual loss is L P = X l λ l k Φ l ( P ) − Φ l ( G ( M )) k 1 . (1) Here G is the ﬁrst-stage generator in our framew ork. The hyperparameters { λ l } balance the contrib ution of each layer l to the loss. 5. Implementation 5.1. Pose Discriminator T o evaluate if a skeleton sequence is an excellent dance, we believe the most indispensable factors are the intra- frame representation for joint co-occurrences and the inter- frame representation for skeleton temporal ev olution. T o extract features of a pose sequence, we explore multi- stream CNN-based methods and adopt the Hierarchical Co- occurrence Network frame work [ 20 ] to enable discrimina- tors to differentiable real and f ake pose sequences. T wo-Stream CNN. The input of the pose discriminator is a skeleton sequence X . The temporal difference is inter - polated to be of the same shape of X . Then the skeleton sequence and the temporal difference are fed into the net- work directly as two streams of inputs. Their feature maps are fused by concatenation along channels, and then we use con volutional and fully-connected layers to extract features. 5.2. Local T emporal Discriminator One of the objectiv es of the pose generator is the tem- poral coherence of the generated skeleton sequence. For example, when a man mov es his left foot, his right foot should keep still for multiples frames. Similar to Patch- GAN [ 14 , 47 , 40 ], we propose to use Local T emporal Dis- criminator , which is a 1D version of PatchGAN to achiev e coherence between consecutive frames. Besides, the Local T emporal Discriminator contains a trimmed pose discrimi- nator and a small classiﬁer . Figure 4: In each section, the ﬁrst image is a skeleton gener- ated by the model without pose perceptual loss, and the sec- ond image is a skeleton generated by the model with pose perceptual loss according to the same piece of music. 5.3. Global Content Discriminator Dance is closely related to music, and the harmony be- tween music and dance is a crucial criterion to ev aluate a dance sequence. Inspired by [ 38 ], we proposed the Global Content Discriminator to deal with the relationship between music and dance. As we mentioned previously , music is encoded as a se- quence O M = { H M 1 , H M 2 , ..., H M T } . Though GR U can capture long term dependencies, it is still challenging for GR U to encode the entire music information. In our ex- periment, only using H M T to represent music feature F M will lead to a crash of the beginning part of the skele- ton sequence. Therefore, we use the self-attention mech- anism [ 22 ] to assign a weight for each hidden state and gain a comprehensiv e embedding. In the ne xt part, we brieﬂy de- scribe the self-attention mechanism used in our frame work. Self-attention mechanism. Gi ven O M ∈ R T × k , we can compute its weight at each time step by r = W s 2 tanh( W s 1 O M > ) , (2) a i = − log exp ( r i ) P j exp ( r j ) ! , (3) where r i is i -th element of the r while W s 1 ∈ R k × l and W s 2 ∈ R l × 1 . a i is the assigned weight for i -th time step in the sequence of hidden states. Thus, the music feature F M can be computed by multiplying the scores A = [ a 1 , a 2 , ..., a n ] and O M , written as F M = AO M . 5.4. Other Loss Function GAN loss L adv . The Local T emporal Discriminator ( D local ) is trained on ov erlapping skeleton sequences that are sampled using S ( · ) from a whole skeleton sequence. The Global Content Discriminator ( D g l obal ) distinguishes the harmony between the skeleton sequence and the input music m . Besides, we have x = G ( m ) and the ground truth skeleton sequence p . W e also apply a gradient penalty [ 13 ] Figure 5: Noisy data caused by occlusion and ov erlapping. For the ﬁrst part of the K-pop dataset, there is a large num- ber of such skeletons. For the second part of the K-pop dataset, there are few inaccurate sk eletons. term in D g l obal . Therefore, the adversarial loss is deﬁned as L adv = E p [log D local ( S ( p ))]+ E x,m [log[1 − D local ( S ( x ))]]+ E p,m [log D g l obal ( p, m )]+ E x,m [log[1 − D g l obal ( x, m )]]+ w GP E x ˆ ,m [( k 5 x ˆ ,m D ( x ˆ ,m ) k 2 − 1) 2 ] . (4) where w GP is the weight for the gradient penalty term. L 1 distance L L 1 . Given a ground truth dance skeleton sequence Y with the same shape of X ∈ R T × 2 V , the re- construction loss at the joint lev el is: L L 1 = X j ∈ [0 , 2 V ] k Y j − X j k 1 . (5) Featur e matching loss L F M . W e adopt the feature matching loss from [ 39 ] to stabilize the training of Global Content Discriminator D : L F M = E p,m M X i =1 k D i ( p, m ) − D i ( G ( m ) , m ) k 1 . (6) where M is the number of layers in D and D i denotes the i th layer of D . In addition, we omit the normalization term of the original L F M to ﬁt our architecture. Full Objective. Our full objecti ve is arg min G max D L adv + w P L P + w F M L F M + w L 1 L L 1 . (7) where w P , w F M , and w L 1 represent the weights for each loss term. 5.5. Pose to V ideo Recently , researchers hav e been studying motion trans- fer , especially for transferring dance motion between two videos [ 7 , 23 , 40 , 46 ]. Among these methods, we adopt the approach proposed by Chan et al. [ 7 ] for its simplicity and effecti veness. Given a skeleton sequence and a video of a target person, the framework could transfer the movement of the skeleton sequence to the target person. W e used a third-party implementation 2 . 2 https://github.com/CUHKSZ- TQL/ EverybodyDanceNow_reproduce_pytorch Category Number Ballet 1165 Break 3171 Cha 4573 Flamenco 2271 Foxtrot 2981 Jiv e 3765 Latin 2205 Pasodoble 2945 Quickstep 2776 Rumba 4459 Samba 3143 Square 5649 Swing 3528 T ango 3321 T ap 2860 W altz 3046 (a) Let’ s Dance Dataset. Category Number Clean T rain 1636 Clean V al 146 Noisy T rain 656 Noisy V al 74 (b) K-pop. Category Number Pop 4334 Rock 4324 Instrumental 4299 Electronic 4333 Folk 4346 International 4341 Hip-Hop 4303 Experimental 4323 (c) FMA. T able 1: The detail of our datasets. All the datasets are cut into pieces of 5s. Number means the number of the pieces. For Let’ s Dance Dataset and FMA, 70% is for training, 5% is for validation, and 25% is for testing. 6. Experiments 6.1. Datasets K-pop dataset. T o build our dataset, we apply Open- Pose [ 4 , 5 , 42 ] to some online videos to obtain the skeleton sequences. In total, W e collected 60 videos about 3 min- utes with a single women dancer and split these videos into two datasets. The ﬁrst part with 20 videos is very noisy , as shown in Figure 5 . This dataset is used to test the perfor- mance of the pose perceptual loss on noisy data. 18 videos of this part are for training, and 2 videos of this part are for ev aluation. The second part with 40 videos is relatively clean and used to form our automatic dance video genera- tion task. 37 videos of this part are for training, and 3 videos of this part are for ev aluation. The detail of this dataset is shown in T able 1b . Let’ s Dance Dataset. Castro et al. [ 6 ] released a dataset containing 16 classes of dance, presented in T able 1a . The dataset provides information about human skeleton se- quences for pose recognition. Though there are existing enormous motion datasets [ 18 , 30 , 33 ] with skeleton se- quences, we choose Let’ s Dance Dataset to pre-train our ST -GCN for pose perceptual loss as dance is different with normal human motion. FMA. For our cross-modal e valuation, the extraction of music features is needed. T o achiev e this goal, we adopt CRNN [ 9 ] and choose the dataset Free Music Archiv e Metric Cross-modal BRISQUE Rand Frame 0.151 – Rand Seq 0.204 – L 1 0.312 40.66 Global D 0.094 40.93 Local D 0.068 41.46 Our model 0.046 41.11 T able 2: Results of our model and baselines. On the cross- modal e valuation, lower is better . For BRISQUE, higher is better . The details of the baselines are sho wn in Section 6.3 . (FMA) to train CRNN. In FMA, genre information and the music content are provided for genre classiﬁcation. The in- formation of FMA is shown in T able 1c . 6.2. Experimental Setup All the models are trained on an Nvidia GeForce GTX 1080 T i GPU. For the ﬁrst stage in our frame work, the model is implemented in PyT orch [ 28 ] and takes approxi- mately one day to train for 400 epochs. For the hyperparam- eters, we set V = 18 , T = 50 , t = 5 , K = 16 , S = 16000 . For the self attention mechanism, we set k = 256 , l = 40 . For the loss function, the hyperparameters { λ l } are set to be [20 , 5 , 1 , 1 , 1 , 1 , 1 , 1 , 1] and w GP = 1 , w P = 1 , w F M = 1 , w L 1 = 200 . Though the weight of L 1 distance loss is rel- ativ ely large, the absolute value of the L 1 loss is quite small. W e used Adam [ 17 ] for all the networks with a learning rate of 0.003 for the generator and 0.003 for the Local T emporal Discriminator and 0.005 for the Global Content Discrimi- nator . For the second stage that transfers pose to video, the model takes approximately three days to train, and the hy- perparameters of it adopt the same as [ 7 ]. For the pre-train process of ST -GCN and CRNN, we also used Adam [ 17 ] for them with a learning rate of 0.002. ST -GCN achiev es 46% precision on Let’ s Dance Dataset . CRNN is pretrained on the FMA, and the top-2 accuracy is 67.82%. 6.3. Evaluation W e will ev aluate the following baselines and our model. • L 1 . In this condition we just use L 1 distance to con- duct the generator . • Global D . Based on L 1 , we add a Global Content Dis- criminator . • Local D . Based on Global D , we add a Local T empo- ral Discriminator . • Our model . Based on Local D , we add pose percep- tual loss. These conditions are used in T able 2 . Figure 6: Synthesized music video conditioned on the music “LIKEY” by TWICE . For each 5-second dance video, we sho w 4 frames. The top ro w shows the sk eleton sequence, and the bottom ro w sho ws the synthesized video frames conditioned on different tar get videos. 6.3.1 User Study T o ev aluate the quality of the generated skeleton sequences (our main contrib utions), we conduct a user study compar - ing the synthesis skeleton sequence and the ground-truth skeleton sequence. W e randomly sample 10 pairs sequences with different lengths and draw the sequences into videos. T o make this study fair , we verify the ground truth skele- tons and re-annotate the noisy ones. In the user study , each participant watches the video of the synthesis skeleton se- quence and the video of the ground truth sk eleton sequence in random order . Then the participant needs to choose one of the two options: 1) The ﬁrst video is better . 2) The sec- ond video is better . As sho wn in Figure 7 , in 43.0% of the comparisons, participants vote for our synthesized skeleton sequence. This user study shows that our model can chore- ograph at a similar lev el with real artists. 6.3.2 Cross-modal Evaluation It is challenging to e valuate if a dance sequence is suitable for a piece of music. T o our best kno wledge, there is no ex- isting method to ev aluate the mapping between music and dance. Therefore, we propose a two-step cross-modal met- ric, as shown in Figure 8 , to estimate the similarity between music and dance. Giv en a training set X = { ( P , M ) } where P is a dance skeleton sequence and M is the corresponding music. Then with a pre-trained music feature extractor E m [ 9 ], we ag- gregate all the music embeddings F = { E m ( M ) , M ∈ X } in an embedding dictionary . The input to our ev aluation is music M u . W ith our gen- erator G , we can get the synthesized skeleton sequence Ours is better Groud truth is better 57% 43% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Dancer Non -Dancer Figure 7: Results of user study on comparisons between the synthesized skeleton sequence and the ground truth. There are 27 participants in total, including seven dancers. In nearly half of the comparisons, users can not tell which skeleton sequence is better given the music. T o make the results reliable, we make sure there is no unclean skeleton in the study . P u = G ( M u ) . The ﬁrst step is to ﬁnd a skeleton sequence that represents the music M u . W e ﬁrst obtain the music feature F u by F u = E m ( M u ) . Then let F v be the nearest neighbor of F u in the embedding dictionary . In the end, we use its corresponding sk eleton sequence P v to represent the music M u . The second step is to measure the similarity be- tween two skeleton sequences with the novel metric learn- ing objectiv e based on a triplet architecture and Maximum Mean Discrepancy , proposed by Coskun et al. [ 10 ]. More implementation details about this metric will be shown in supplement materials. 6.3.3 Quantitative Evaluation T o ev aluate the quality of results of the ﬁnal method in com- parison to other conditions, Chan et al. [ 7 ] propose to make CRNN Feat ure G Find.the.nearest.neighbor Embedding. Dictionary S S’ Pos e. me tr ic .n etw or k Skelet o n.Sequence 𝑀 " 𝑃 $ 𝑃 " 𝐹 " 𝐹 $ Figure 8: Cross-modal ev aluation. W e ﬁrst project all the music pieces in the training set of the K-pop dataset into an embedding dictionary . W e train the pose metric network based on the K-means clustering result of the embedding dictionary . For the K-means clustering, we choose K = 5, according to the Silhouette Coef ﬁcient. The similarity between M u and P u is measured by k S − S 0 k 2 . Figure 9: Our synthesized music video with a male student as a dancer . a transfer between the same video since there is no ref- erence for the synthesized frame and use SSIM [ 41 ] and LPIPS [ 45 ] to measure the videos. For our task, such met- rics are useless because there are no reference frames for the generated dance video. So we apply BRISQ UE [ 26 ], which is a no-reference Image Quality Assessment to measure the quality of our ﬁnal generated dance video. As sho wn in T able 2 , by utilizing the Global Content Dis- criminator and the Local T emporal Discriminator , ev en for a single frame result, the score is better . For the addition of the pose perceptual loss, the poses become plausible, and then transferring the di verse poses to the frames may lead to the decline of the score. Furthermore, more signiﬁcant differences can be observed in our video. T o validate our proposed ev aluation, we also try two random conditions: • Rand Frame . Randomly select 50 frames from the training dataset for the input music instead of feeding the music into the generator . • Rand Seq . Randomly select a skeleton sequence from the training dataset for the input music instead of feed- ing the music into the generator . T o make the random results stable, we make ten ran- dom processes and get the av erage score. 7. Conclusion W e hav e presented a two-stage framew ork to generate dance videos, given any music. W ith our proposed pose per- ceptual loss, our model can be trained on dance videos with noisy pose skeleton sequence (no human labels). Our ap- proach can create arbitrarily long, good-quality videos. W e hope that this pipeline of synthesizing skeleton sequence and dance video combining with pose perceptual loss can support more future work, including more creativ e video synthesis for artists. References [1] Omid Alemi, Jules Franc ¸ oise, and Philippe Pasquier . Groov enet: Real-time music-driv en dance movement gen- eration using artiﬁcial neural networks. networks , 2017. 2 [2] Joan Bruna, Pablo Sprechmann, and Y ann LeCun. Super- resolution with deep conv olutional sufﬁcient statistics. In ICLR , 2016. 4 [3] Haoye Cai, Chunyan Bai, Y u-W ing T ai, and Chi-K eung T ang. Deep video generation, prediction and completion of human action sequences. In ECCV , 2018. 2 , 4 [4] Zhe Cao, Gines Hidalgo, T omas Simon, Shih-En W ei, and Y aser Sheikh. OpenPose: realtime multi-person 2D pose es- timation using Part Afﬁnity Fields. In , 2018. 2 , 4 , 6 [5] Zhe Cao, T omas Simon, Shih-En W ei, and Y aser Sheikh. Realtime multi-person 2d pose estimation using part afﬁnity ﬁelds. In CVPR , 2017. 2 , 4 , 6 [6] Daniel Castro, Steven Hickson, Patsorn Sangkloy , Bhavishya Mittal, Sean Dai, James Hays, and Irfan Essa. Let’ s dance: Learning from online dance videos. In , 2018. 6 [7] Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alex ei A Efros. Everybody dance no w . In ICCV , 2019. 2 , 5 , 6 , 7 [8] Qifeng Chen and Vladlen Koltun. Photographic image syn- thesis with cascaded reﬁnement networks. In ICCV , 2017. 4 [9] Keunwoo Choi, Gy ¨ orgy Fazekas, Mark B. Sandler , and Kyungh yun Cho. Con volutional recurrent neural networks for music classiﬁcation. In ICASSP , 2017. 6 , 7 [10] Huseyin Coskun, David Joseph T an, Sailesh Conjeti, Nassir Nav ab, and Federico T ombari. Human motion analysis with deep metric learning. In ECCV , 2018. 7 [11] Alexe y Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics based on deep networks. In NeurIPS , 2016. 4 [12] Ian Goodfellow , Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David W arde-Farley , Sherjil Ozair , Aaron Courville, and Y oshua Bengio. Generativ e adversarial nets. In NeurIPS , 2014. 2 [13] Ishaan Gulrajani, Faruk Ahmed, Mart ´ ın Arjovsky , V incent Dumoulin, and Aaron C. Courville. Improved training of wasserstein gans. In NeurIPS , 2017. 5 [14] Phillip Isola, Jun-Y an Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adver- sarial networks. In CVPR , 2017. 4 [15] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV , 2016. 4 [16] Jae W oo Kim, Hesham Fouad, and James K. Hahn. Making them dance. In AAAI F all Symposium: A urally Informed P erformance , 2006. 2 [17] Diederik P . Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR , 2015. 6 [18] H. Kuehne, H. Jhuang, E. Garrote, T . Poggio, and T . Serre. HMDB: a large video database for human motion recogni- tion. In ICCV , 2011. 2 , 6 [19] Juheon Lee, Seohyun Kim, and K yogu Lee. Listen to dance: Music-driv en choreography generation using autoregressi ve encoder-decoder netw ork. CoRR , 2018. 2 , 4 [20] Chao Li, Qiaoyong Zhong, Di Xie, and Shiliang Pu. Co- occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. In IJCAI , 2018. 4 [21] Y itong Li, Martin Renqiang Min, Dinghan Shen, Da vid E. Carlson, and Lawrence Carin. V ideo generation from text. 2017. 2 [22] Zhouhan Lin, Minwei Feng, C ´ ıcero Nogueira dos Santos, Mo Y u, Bing Xiang, Bowen Zhou, and Y oshua Bengio. A structured self-attenti ve sentence embedding. In ICLR , 2017. 3 , 5 [23] W en Liu, Zhixin Piao, Jie Min, W enhan Luo, Lin Ma, and Shenghua Gao. Liquid warping GAN: A uniﬁed framework for human motion imitation, appearance transfer and novel view synthesis. In ICCV , 2019. 5 [24] Julieta Martinez, Michael J. Black, and Javier Romero. On human motion prediction using recurrent neural networks. In CVPR , 2017. 2 [25] Micha ¨ el Mathieu, Camille Couprie, and Y ann LeCun. Deep multi-scale video prediction beyond mean square error . In ICLR , 2016. 2 [26] Anish Mittal, Anush Krishna Moorthy , and Alan Conrad Bovik. No-reference image quality assessment in the spa- tial domain. IEEE T rans. Imag e Processing , 2012. 8 [27] Anh Mai Nguyen, Alexe y Dosovitskiy , Jason Y osinski, Thomas Brox, and Jef f Clune. Synthesizing the preferred inputs for neurons in neural networks via deep generator net- works. In NeurIPS , 2016. 4 [28] Adam Paszk e, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Y ang, Zachary DeV ito, Zeming Lin, Al- ban Desmaison, Luca Antiga, and Adam Lerer . Automatic differentiation in p ytorch. 2017. 6 [29] Masaki Saito, Eiichi Matsumoto, and Shunta Saito. T empo- ral generativ e adversarial nets with singular value clipping. In ICCV , 2017. 2 [30] Amir Shahroudy , Jun Liu, Tian-Tsong Ng, and Gang W ang. NTU RGB+D: A large scale dataset for 3d human activity analysis. In CVPR , 2016. 6 [31] T akaaki Shiratori, Atsushi Nakazawa, and Katsushi Ikeuchi. Dancing-to-music character animation. Comput. Graph. F o- rum , 2006. 2 [32] Karen Simonyan and Andrew Zisserman. V ery deep conv o- lutional netw orks for large-scale image recognition. In ICLR , 2015. 4 [33] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR , 2012. 2 , 6 [34] Chen T . and Rao R.R. Audio-visual integration in multi- modal communication. In IEEE , 1998. 2 [35] T aoran T ang, Jia Jia, and Han yang Mao. Dance with melody: An lstm-autoencoder approach to music-oriented dance syn- thesis. In A CM Multimedia , 2018. 2 , 4 [36] Sergey T ulyakov , Ming-Y u Liu, Xiaodong Y ang, and Jan Kautz. MoCoGAN: Decomposing motion and content for video generation. In CVPR , 2018. 2 [37] Carl V ondrick, Hamed Pirsiav ash, and Antonio T orralba. Generating videos with scene dynamics. In NeurIPS , 2016. 2 [38] Konstantinos V ougioukas, Stavros Petridis, and Maja Pan- tic. End-to-end speech-dri ven facial animation with temporal gans. In BMVC , 2018. 2 , 5 [39] Ting-Chun W ang, Ming-Y u Liu, Jun-Y an Zhu, Andrew T ao, Jan Kautz, and Bryan Catanzaro. High-resolution image syn- thesis and semantic manipulation with conditional gans. In CVPR , 2018. 2 , 4 , 5 [40] Ting-Chun W ang, Ming-Y u Liu, Jun-Y an Zhu, Nikolai Y akov enko, Andre w T ao, Jan Kautz, and Bryan Catanzaro. V ideo-to-video synthesis. In NeurIPS , 2018. 4 , 5 [41] Zhou W ang, Alan C. Bo vik, Hamid R. Sheikh, and Eero P . Simoncelli. Image quality assessment: from error visibility to structural similarity . IEEE T rans. Image Pr ocessing , 2004. 8 [42] Shih-En W ei, V arun Ramakrishna, T akeo Kanade, and Y aser Sheikh. Conv olutional pose machines. In CVPR , 2016. 2 , 4 , 6 [43] Nelson Y alta. Sequential deep learning for dancing motion generation. 2 [44] Sijie Y an, Y uanjun Xiong, and Dahua Lin. Spatial tempo- ral graph con volutional networks for skeleton-based action recognition. In AAAI , 2018. 4 [45] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shecht- man, and Oliv er W ang. The unreasonable effectiv eness of deep features as a perceptual metric. In CVPR , 2018. 8 [46] Y ipin Zhou, Zhaowen W ang, Chen Fang, Trung Bui, and T amara L. Berg. Dance dance generation: Motion transfer for internet videos. CoRR , 2019. 5 [47] Jun-Y an Zhu, T aesung Park, Phillip Isola, and Alex ei A Efros. Unpaired image-to-image translation using cycle- consistent adversarial networks. In ICCV , 2017. 4

Music-oriented Dance Video Synthesis with Pose Perceptual Loss

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment