Dilated Convolution with Dilated GRU for Music Source Separation

Dilated Con volution with Dilated GR U f or Music Source Separation Jen-Y u Liu , Y i-Hsuan Y ang Research Center for IT Innov ation, Academia Sinica jenyuliu.tw@gmail.com, yang@citi.sinica.edu.tw Abstract Stacked dilated con volutions used in W av enet have been shown effecti ve for generating high-quality audios. By replacing pooling/striding with dila- tion in con volution layers, they can preserve high- resolution information and still reach distant loca- tions. Producing high-resolution predictions is also crucial in music source separation, whose goal is to separate dif ferent sound sources while maintain- ing the quality of the separated sounds. Therefore, this paper in vestigates using stacked dilated con- volutions as the backbone for music source sep- aration. Howe ver , while stacked dilated con volu- tions can reach wider context than standard con vo- lutions, their effecti ve receptive ﬁelds are still ﬁx ed and may not be wide enough for complex music audio signals. T o reach information at remote lo- cations, we propose to combine dilated conv olu- tion with a modiﬁed v ersion of gated recurrent units (GR U) called the ‘Dilated GRU’ to form a block. A Dilated GR U unit receiv es information from k steps before instead of the previous step for a ﬁxed k . This modiﬁcation allo ws a GR U unit to reach a lo- cation with fewer recurrent steps and run faster be- cause it can execute partially in parallel. W e show that the proposed model with a stack of such blocks performs equally well or better than the state-of- the-art models for separating vocals and accompa- niments. 1 Introduction Music source separation has recei ved much attention and has obtained impressive progress in recent years [ St ¨ oter et al. , 2018 ] . It has dif ferent use cases such as generating accompa- niment from pop songs for karaoke [ Raﬁi et al. , 2018 ] , sepa- rating speciﬁc sources as a pre-processing tool for other tasks such as music transcription [ Paulus and V irtanen, 2005 ] , and DJ-related applications [ V ande V eire and De Bie, 2018 ] . In recent years, neural network-based methods have ob- tained promising result for music source separation [ Nugraha et al. , 2016; Uhlich et al. , 2017; T akahashi et al. , 2018; Liutkus et al. , 2017; St ¨ oter et al. , 2018 ] . Among them, [ Nugraha et al. , 2016 ] proposed a DNN architecture with fully-connected layers, and it is one of the ﬁrst models based on neural networks for music source separation. Mu- sic source separation requires the model to produce high- resolution predictions. This has mostly been achieved by either using encoder-decoder architecture with skip connec- tions [ Jansson et al. , 2017; T akahashi and Mitsufuji, 2017; T akahashi et al. , 2018; Liu and Y ang, 2018; Stoller et al. , 2018 ] , or recurrent neural networks [ Uhlich et al. , 2017 ] . W e notice that the stacked dilated con volutions used in W av enet [ van den Oord et al. , 2016 ] might also work well for this purpose. In W avenet, the kernels in con volution lay- ers are dilated more and more as the network goes deeper , so the entire network can access neighboring information as well as distant information, depending on how many dilated con volution layers are stacked. Dilated con volutions are used in audio generation [ van den Oord et al. , 2016 ] , machine translation [ Kalchbrenner et al. , 2017 ] , speech recognition [ Sercu and Goel, 2016 ] , semantic segmentation [ Y u and K oltun, 2016 ] , and video generation [ Kalchbrenner et al. , 2017 ] . T o the best of our knowledge, dilated con volutions hav e not been used for audio regression problems such as music source separation. Music audio signals are u sually very long, ev en using spec- trogram as the feature. For example, a 3-minute audio under 44,100 Hz sampling rate and 1,024-sample hop size for short- time Fourier transform has 7,752 frames in its spectrogram representation. It is not easy to stack enough dilated con volu- tions to reach that far with limited computational resources. W e propose to combine a recurrent layer , speciﬁcally a gated recurrent unit (GR U) [ Cho et al. , 2014; Chung et al. , 2014 ] , and a dilated con volution to form a block. A GR U can in theory reach very far aw ay in one layer if the information does not decay too fast. As aforementioned, music audio sig- nals can be very long. This can sometimes be a problem for a recurrent layer like GRU, because it has to process its input sequence sequentially . W e use the Dilated GR U to alleviate this problem. A Di- lated GR U unit receiv es information from k -step before in- stead of the previous step for a ﬁxed k . This modiﬁcation al- lows a GR U unit to reach a location with fewer recurrent steps and run faster because it can execute partially in parallel. The dilated version of recurrent layers is also used in other con- texts. For example, [ Chang et al. , 2017 ] stacked multiple re- current layers with increasing dilations for speaker identiﬁca- tion, while [ V ezhnevets et al. , 2017 ] used them as managers in reinforcement learning, both of which are pure RNN archi- tectures. In contrast, we combine them with dilated grouped con volutions to form processing blocks. W e call these blocks the ‘Dilated recurrent-Dilated con volution’ (D2) blocks. W e conduct extensi ve experiments to verify the capability of the proposed model, as well as the relative importance of its components. W e also in vestig ate ho w the D2 blocks work. Our ev aluation shows that our model (GRU dilation=1) out- performs the state-of-the-art model [ T akahashi et al. , 2018 ] by 0.25 dB and 0.57 dB in the signal-to-distortion ratio (SDR) for vocal separation and accompaniment, respecti vely . 2 Proposed model Our model works on the magnitude of spectrograms. For a D × T tensor M mix of the mixture magnitude, the goal of source separation is to predict a D × T tensor M s for each source s ∈ S , where D , T , and S denotes the feature di- mension, the sequence length, and the set of sources, respec- tiv ely . Our model takes M mix as the input, and predicts all the source tensors M s at once. 2.1 Dilated GR U GR U 1 [ Cho et al. , 2014; Chung et al. , 2014 ] is a popular de- sign choice of recurrent layers. In this paper , we use a mod- iﬁed version of GRU where the unit at a temporal point re- ceiv es information from k -step before instead of the previous step for a ﬁxed k . The same idea of dilating the temporal con- nections of an RNN has also been used by [ V ezhnevets et al. , 2017; Chang et al. , 2017 ] . Dilated GR U with dilation k in volves the follo wing opera- tions: r t = σ ( W ir x t + b ir + W hr h ( t − k ) + b hr ) , z t = σ ( W iz x t + b iz + W hz h ( t − k ) + b hz ) , n t = tanh( W in x t + b in + r t ( W hn h ( t − k ) + b hn )) , h t = (1 − z t ) n t + z t h ( t − k ) , where W stands for matrices, b the bias terms, and σ the sig- moid function. r , z , n and h are all vectors. The temporal indices are partitioned into k disjoint sets, meaning that they can be processed independently in parallel. Figure 1 provides an illustration. 2.2 Dilated Recurrent-Dilated Con volution Block (D2 block) In contrast to the a con ventional con volution, the kernel of a dilated con volution is dilated so that it takes as input the further neighbors instead of the immediate neighbors. W e propose to stack a dilated grouped con volution on top of a Dilated GR U layer with bidirectional connections. The input to the block, the output of the Dilated GR U and the out- put of the dilated grouped con volution are summed together 1 W e use the GR U version implemented by PyT orch (https: //pytorch.or g/docs/stable/nn.html#gru), which is slightly different from the original version. Figure 1: Parallel branches in Dilated GR U. The locations with odd indices and e ven indices can be processed in parallel with dilation 2. to form the output of the block. This is illustrated in Fig- ure 2. Leaky ReLU and weight normalization [ Salimans and Kingma, 2016 ] are applied to all the con volution layers. In stacked dilated con volutions, the model could poten- tially pay less attention to neighboring locations as the dila- tion in conv olution layers increases. W e hope to alleviate this problem by using vertical skip connections from the block input to the block output, as also shown in Figure 2. The grouping in a con volution layer is also used in archi- tectures such as MobileNet [ How ard et al. , 2017 ] and Shuf- ﬂeNet [ Zhang et al. , 2018 ] . It reduces the connections be- tween input and output and hence reduces the memory usage and computation while still maintaining the capacity of the network. In one layer, the different groups have their own inputs and do not communicate with each other . Different groups cannot directly communicate with each other, so the grouped conv olutions are usually followed some mixing lay- ers such as fully-connected layers [ How ard et al. , 2017 ] or shufﬂe layers [ Zhang et al. , 2018 ] . W e do not use these extra layers in the proposed model, but instead delegate this com- munication task to the GR U layer, since the computation in GR U is fully-connected across channels. Therefore, the Di- lated GRU at the beginning of each block will mix the infor- mation from different groups in addition to its other task of aggregating information through time. 2.3 Full model The full model is shown in Figure 4. The input is ﬁrst pro- cessed by a con volution layer . Then, several D2 blocks are stacked together . By the end, the output of the last D2 block is processed by a con volution. It then outputs the separation predictions of all the sources at once. Most top models in SiSEC2018 use certain forms of de- noising auto-encoders that down-sample and up-sample along the temporal axis in the encoders and decoders respec- tiv ely [ Jansson et al. , 2017; T akahashi and Mitsufuji, 2017; T akahashi et al. , 2018; Liu and Y ang, 2018 ] . They also use symmetric skip connections connecting a pair of encoder and decoder to compensate the loss of high-resolution informa- tion due to down-sampling. In contrast, our models main- tain the temporal resolution without down-sampling and up- (a) T emporal vie w (b) Channel view Figure 2: Dilated GR U-Dilated conv olution block (D2 block). A Dilated GR U layer is follo wed by a dilated con volution with groups. The input of the block, the output of Dilated GRU and the output of the dilated con volution are added together to the output of the block. T wo views of the proposed architecture are sho wn, since the archi- tecture contains temporal connections (better seen in the T emporal view) as well as grouped channels in con volutions (better seen in the Channel view). (a) V ocals (b) Accompaniment Figure 3: Track-by-track comparison of the 50 test songs. The SDR of a track is represented by a block colored with a color on the yellow-green-black color map (high to lo w). The ‘Proposed’ rep- resents the model of GR U dilation 1 trained with 20-sec clips. Figure 4: Network architecture of the proposed model. A con volu- tion layer processes the input feature map, followed by a stack of D2 blocks. A con volution layer processes the output of the last D2 block and outputs all the predicted separations. sampling during the whole process, so the high-resolution in- formation will not be lost. The blocks can be easily stacked without considering the symmetry of skip connections across different blocks in this approach. More details of our model can be found in T able 1. 3 Evaluation 3.1 Evaluation Setup W e ev aluate the proposed model for music source separation on the MUSDB18 dataset 2 used in SiSEC2018 [ St ¨ oter et al. , Unit Spec. Input Output Input con volution 1D Con v (3, 1) 2049 2048 Block 1 Dilated GR U 2048 2048 1D Con v (3, 2) 2048 2048 Block 2 Dilated GR U 2048 2048 1D Con v (3, 4) 2048 2048 Block 3 Dilated GR U 2048 2048 1D Con v (3, 8) 2048 2048 Output con volution 1D Con v (3, 1) 2048 4 × 2049 T able 1: Model speciﬁcation. The two numbers in the parentheses after ‘1D Con v’ are kernel size and dilation size, respecti vely . All the 1D conv olution layers in the Blocks 1, 2 and 3 hav e 32 groups, where each group has 64 input channels and 64 output channels. 2 https://sigsep.github .io/datasets/musdb.html 2018 ] . MUSDB18 contains 100 songs for training and 50 songs for ev aluation, all with 44,100 Hz sampling rate. Each song contains the source audios of ‘vocals, ’ ‘drums, ’ ‘bass’ and ‘other . ’ The ev aluation is conducted by using the ev alu- ation package provided by the SiSEC2018 organizers. 3 The ev aluation metrics are computed by taking the median ov er all the tracks in the ev aluation set as done in the of ﬁcial re- port [ St ¨ oter et al. , 2018 ] , using the SiSEC analysis package. 4 W e e valuate the performance of the models mainly with objectiv e metrics commonly used in music source separation. Unless otherwise speciﬁed, we report the SDR in the tables. Our models use the log-scale magnitudes of the complex spectrogram as the feature. First, complex spectrograms are deriv ed by applying a short-term Fourier transform (STFT) to wav eforms, with 4,096-sample windo w size and 3/4 over - lapping. Then, the magnitudes of the complex spectrograms are computed. log (1 + magnitude ) of the mixture audios and the source audios are used as the inputs and the training tar- gets, respecti vely . Mean square error is used as the loss func- tion for updating the network, and Adam [ Kingma and Ba, 2015 ] is used to update the weights. As data augmentation has been found useful in the literature [ T akahashi et al. , 2018; Uhlich et al. , 2017; Liu and Y ang, 2018 ] , we use data aug- mentation in the training process by randomly shufﬂing the audio clips in each source and then collecting the audio clips from the four sources in the shuf ﬂed orders. A mixture clip is formed by summing the collected source clips. The source audio clips are shufﬂed at the beginning of ev ery epoch. Batch sizes 20 and 5 are used for 5-sec and 20-sec training respectiv ely so that they hav e roughly the same number of weight updating in training. 1/10 of the training set is used for validation. The weights in the epoch with the best valida- tion loss during 500-epoch training are kept for a model. T o con vert the outputs of the model to w av eforms, the abov e process is reversed. The phases of the mixture complex spectrograms are used with the predicted spectrogram magni- tudes to construct the complex spectrogram. Before con vert- ing back to wa veforms, multi-channel W iener ﬁlter is applied to the complex spectrograms as widely done in recent source separation systems [ Nugraha et al. , 2016; Uhlich et al. , 2017; T akahashi et al. , 2018; Liu and Y ang, 2018 ] . 3.2 Comparison with Participants of SiSEC2018 T able 2 shows the performance of the proposed models and the top-performing models of SiSEC2018 [ Raﬁi et al. , 2018 ] . 5 W e show the top 5 models (one top model from each group) that are trained without extra data in SiSEC2018. In this subsection, each setting is run three times, and the mean score and standard deviation of the three runs is reported. First, we can see that the model using stacked dilated con- volutions alone without recurrent layers and skip connections already have performance comparable with the top-3 model JY3 in SiSEC2018. This v eriﬁes our intuition that stack ed di- 3 https://github .com/sigsep/sigsep- mus- ev al 4 https://github .com/sigsep/sigsep- mus- 2018- analysis/blob/ master/sisec- 2018- paper- ﬁgures/boxplot.py 5 The raw data of the e valuation metrics are a vailable at [Online] https://github .com/sigsep/sigsep- mus- 2018 lated con volutions are strong not only in generation problems but also in music source separation. The best performance in both ‘vocals’ and ‘accompani- ment’ is achie ved by the proposed model with dilation 1 trained with 20-sec clips. In general, models trained with longer and shorter clips hav e similar performance. Dilation 2 and dilation 1 are both strong models. The per- formance difference between the proposed models with dila- tion 1 and dilation 2 are not as lar ge as the difference between the difference between the proposed models and T AK1. In practice, the proposed model with dilation 2 has the beneﬁt of parallel computation. Our models are strong in ‘vocals, ’ ‘other , ’ and overall ‘ac- companiment. ’ They perform slightly worse in ‘drums’ and ‘bass’ compared to T AK1 and UHL2. Our conjecture is that the different sources are competing for resources in the model because we train one model for all four sources. The sounds in ‘vocals’ and ‘other’ are usually louder than ‘drums’ and ‘bass, ’ so they are weighted more in the loss function. In contrast, T AK1 and UHL2 do not have this problem because they train one model for each source [ T akahashi et al. , 2018; Uhlich et al. , 2017 ] . This can be a trade-off between re- sources and performance. In T able 3, we compare our models with T AK1, includ- ing other performance metrics in addition to SDR. Our mod- els consistently outperform T AK1 in ‘vocals’ and ‘other’ in all metrics. W e also show the track-by-track comparison in Figure 3. In general, our proposed model performs better in vocals in almost all tracks, while our model and T AK1 have their own adv antages in accompaniment in different tracks. 3.3 Ablation Study W e consider ablated versions of our model in this e valuation. Speciﬁcally , we use the model with dilation 2 and trained with 5-sec subclips as the baseline. The score of one training run for each setting is reported in this subsection. W e will refer to the model in Section 2.2 and T able 1 as the proposed model. First, we compare different block designs. The ﬁrst one is the proposed block, the second one, ‘Dense, ’ is similar to the proposed block with an additional skip connection adding the input of the block to the output of Dilated GR U, the third one, ‘Residual, ’ has a residual connection for each layer including Dilated GRU and conv olutions, and the fourth one has the same skip connections as the proposed block but the dilated con volution layer is before the Dilated GR U. T o in vestigate the effecti veness of GR U, we replace GR U with a con volution in the ﬁfth row . The performance of the v ariants are presented in T able 4. The proposed one has the best performance. Second, we compare different GRU dilations in T able 5. Dilation 1 and 2 perform better either in vocals or accompa- niment than dilation 4. [ Chang et al. , 2017 ] constructed a recurrent neural network with increasing dilations in the re- current layers. W e also ev aluate our model with increasing dilations, but found no improvement as sho wn in the last ro w of T able 5. T o e valuate the running speed, we run the model with one batch 10 times for each setting and compute the aver - age running time. The a verage are 423 ± 4.90, 340 ± 8.91, and Model Description vocals drums bass other accomp. T AK1 [ T akahashi et al. , 2018 ] 6.60 6.43 5.16 4.15 12.83 UHL2 [ Uhlich et al. , 2017 ] 5.93 5.92 5.03 4.19 12.23 JY3 [ Liu and Y ang, 2018 ] 5.74 4.66 3.67 3.40 12.08 MDL1 [ Mimilakis et al. , 2018 ] 4.02 N A N A N A 9.92 RGT1 [ Roma et al. , 2018 ] 3.85 3.44 2.70 2.63 N A Stacked dilated con volutions 20-sec training 5.34 ± 0.17 5.20 ± 0.05 3.72 ± 0.10 3.33 ± 0.12 11.88 ± 0.05 Stacked dilated con volutions 5-sec training 5.56 ± 0.07 5.35 ± 0.19 3.76 ± 0.04 3.52 ± 0.07 11.95 ± 0.09 Proposed, GR U dilation 2 20-sec training 6.76 ± 0.06 5.85 ± 0.07 4.84 ± 0.09 4.49 ± 0.08 13.19 ± 0.02 Proposed, GR U dilation 2 5-sec training 6.78 ± 0.05 5.66 ± 0.09 4.96 ± 0.03 4.48 ± 0.09 13.22 ± 0.05 Proposed, GR U dilation 1 20-sec training 6.85 ± 0.04 5.86 ± 0.12 4.86 ± 0.05 4.65 ± 0.04 13.40 ± 0.11 Proposed, GR U dilation 1 5-sec training 6.81 ± 0.15 5.72 ± 0.06 4.58 ± 0.10 4.48 ± 0.11 13.26 ± 0.08 T able 2: Comparison with models in SiSEC2018 (in SDR). The table shows the top models in SiSEC2018 in the upper part, the baseline stacked dilated con volutions in the middle part, and the proposed models in the lo wer part. All the proposed models contain three blocks. SDR vocals drums bass other accomp. T AK1 6.60 6.43 5.16 4.15 12.83 D=2 6.76 ± .06 5.85 ± .07 4.84 ± .09 4.49 ± .08 13.19 ± .02 D=1 6.85 ± .04 5.86 ± .12 4.86 ± .05 4.65 ± .04 13.40 ± .11 SIR vocals drums bass other accomp. T AK1 14.37 11.81 10.54 6.41 16.69 D=2 14.60 ± .30 11.65 ± .13 9.44 ± .13 7.37 ± .27 18.22 ± .24 D=1 14.33 ± .31 12.26 ± .21 9.20 ± .40 7.58 ± .13 18.43 ± .31 SAR vocals drums bass other accomp. T AK1 6.37 6.64 5.69 4.83 14.08 D=2 6.50 ± .18 6.05 ± .12 5.89 ± .19 4.98 ± .02 13.63 ± .12 D=1 6.56 ± .05 6.13 ± .23 5.82 ± .10 4.88 ± .06 13.72 ± .18 ISR vocals drums bass other accomp. T AK1 11.56 12.02 9.92 9.86 22.56 D=2 13.24 ± .05 10.62 ± .24 9.63 ± .28 9.56 ± .17 22.38 ± .13 D=1 13.50 ± .10 10.69 ± .20 10.20 ± .31 9.58 ± .12 22.39 ± .31 T able 3: Performance comparison with the state-of-the-art model (T AK1) [ T akahashi et al. , 2018 ] using SDR and other metrics. Our models (GR U dilations 1 and 2) are trained with 20-sec subclips. ‘D’ represents dilation. vocals drums bass other accomp. DGR U-DGConv 6.74 5.71 5.00 4.61 13.25 *(Dense) 6.60 5.71 4.72 4.36 13.19 *(Residual) 6.50 5.86 4.83 4.26 13.16 DGCon v-DGRU 6.59 5.64 4.42 4.17 13.15 Con v-DGConv 5.81 5.28 3.97 3.65 12.24 T able 4: Performance comparison (in SDR) of different block de- signs, using dilation 2 and 3 blocks. ‘DGR U’ represents Dilated GR U with dilation 2, ‘DGCon v’ represents the dilated grouped con- volution described in Section 2.2.‘DGR U-Conv’ is the one presented in T able 1. ‘Conv-DGCon v’ replaces all GRUs with non-grouped 1D con volution with kernel size 1. Dilation vocals drums bass other accomp. 1 6.99 5.79 4.55 4.49 13.22 2 6.74 5.71 5.00 4.61 13.25 4 6.72 5.48 4.41 4.22 13.07 2, 4, 8 6.53 5.70 4.68 4.29 12.86 T able 5: SDR of the models with dif ferent GR U dilations, all with 3 blocks. # blocks vocals drums bass other accomp. 2 6.49 5.49 4.89 4.32 13.01 3 6.74 5.71 5.00 4.61 13.25 4 6.81 5.64 4.84 4.50 13.26 T able 6: SDR of variants of our model with different number of blocks, all with GR U dilation 2. 327 ± 10.2 ms for models with dilation 1, dilation 2, and dila- tion 4, respectively . The running time is reduced about 20% from dilation 1 to dilation 2, but there is only marginal reduc- tion from dilation 2 to dilation 4. Third, models with different number of D2 blocks are com- pared in T able 6. 3 blocks and 4 blocks have close results. 3.4 How the Model W orks W e in vestigate here how our model works. In a D2 block, the block input is added to the output of the last layer as sho wn in Figure 2. This operation intuiti vely could ﬁx the semantic of each channel, starting from the output of the ﬁrst con volution to the output of the ﬁnal block. T o verify this intuition, we remov e the blocks from the top of the stack one at a time in the trained dilation-2 model. This results in an altered model without any blocks, an altered model with block 1, an altered model with block 1 and 2, and the original model with block 1, 2 and 3. W e apply these altered models to songs and con- vert them back to audios just like how we ev aluate the full separation model described in Section 3.1. Note that these al- tered models use the same weights as the full model and are not re-trained. (a) V ocals (b) Drums (c) Bass (d) Other Figure 5: W av eform evolving as the blocks are returned (added) to a trained model. The model with no block, the one with block 1, the one with blocks 1 and 2, and the one with all blocks 1, 2 and 3 are represented in gray-scale shades from the darkest to the lightest shades. The ‘other’ source and the other three soruces show opposite beha viors as the number of blocks increases. W e ﬁnd that the output from these altered models ha ve very good sound quality subjecti vely . In terms of separation per- formance, it gets better as more blocks are returned (added) to the model. The fact that the separations from these altered models are audible veriﬁes the intuition that the semantics of the channels are ﬁxed to some de gree. By plotting the wav eforms conv erted from the outputs of the altered models, we can hav e a grasp of how the original model works. Figure 5 shows such plots for ‘other’ and ‘vo- cals. ’ W e observe that ‘vocals, ’ ‘drums, ’ and ‘other’ get more and more activ ations as more blocks are used. In contrast, ‘other’ gets less and less activ ations as more blocks are used. These observations giv e us some hints on the strategy the model has developed. In the lower blocks, the model stores most of the information in the ‘other’ source. As the process mov es forward, the ‘vocals’ recovers more and more infor- mation from the channels related to the other three sources. The changes in ‘drums’ and ‘bass’ are relativ ely smaller com- pared to the change in ‘vocals. ’ The objectiv e metrics also giv e us some hints on the pro- cess, as shown in T able 7. ‘Drums’ and ‘bass’ are relativ ely simple signals compared to ‘v ocals’ and ‘other , ’ so the model can already roughly separate these two sources e ven without any blocks. In contrast, ‘vocals’ and ‘other’ are very poor without any blocks. The capacity of the model increases as the blocks are added back. 4 Conclusion W e have presented a model for music source separation. It uses a stack of dilated con volutions as the backbone and con- sists of a stack of D2 blocks. In a D2 block, a Dilated GR U is combined with a dilated con volution to form a unit. Dilated GR U runs faster than standard GR U while still maintaining vocals drums bass other No blocks 0.82 2.28 1.65 -0.05 Block 1 1.54 2.58 1.56 1.01 Block 1, 2 2.82 3.69 2.86 1.85 Block 1, 2, 3 6.74 5.71 5.00 4.61 T able 7: Performance comparison (in SDR) as the blocks are added to a trained model. In ‘No blocks, ’ all D2 blocks are removed from the model. Then, the blocks are added one by one. the performance. The proposed model achieves state-of-the- art performance in separating ‘vocals’ and ‘accompaniment. ’ Currently , music source separation focuses on separating sources in Pop music with vocals. In the future, we aim to separate dif ferent sources such as dif ferent instruments in symphony and rich electronic sounds in EDM. References [ Chang et al. , 2017 ] Shiyu Chang, Y ang Zhang, W ei Han, Mo Y u, Xiaoxiao Guo, W ei T an, Xiaodong Cui, Michael W itbrock, Mark Hasega wa-Johnson, and Thomas S. Huang. Dilated recurrent neural networks. In Proc. Ad- vances in Neural Information Pr ocessing Systems , 2017. [ Cho et al. , 2014 ] Kyungh yun Cho, Bart v an Merrienboer, Dzmitry Bahdanau, and Y oshua Bengio. On the proper- ties of neural machine translation: Encoder-decoder ap- proaches. In Pr oc. W orkshop on Syntax, Semantics and Structur e in Statistical T ranslation , 2014. [ Chung et al. , 2014 ] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Y oshua Bengio. Empirical ev alua- tion of gated recurrent neural networks on sequence mod- eling. Pr oc. NIPS W orkshop on Deep Learning , 2014. [ How ard et al. , 2017 ] Andrew G. How ard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, W eijun W ang, T obias W eyand, Marco Andreetto, and Hartwig Adam. Mo- bileNets: Efﬁcient conv olutional neural networks for mo- bile vision applications. arXiv preprint , apr 2017. [ Jansson et al. , 2017 ] Andreas Jansson, Eric Humphrey , Nicola Montecchio, Rachel Bittner, Aparna Kumar , and T illman W eyde. Singing voice separation with deep U-net con volutional networks. In Pr oc. Int. Society for Music Information Retrieval Conf. , 2017. [ Kalchbrenner et al. , 2017 ] Nal Kalchbrenner , A ¨ aron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol V inyals, Alex Grav es, and Koray Kavukcuoglu. V ideo pixel networks. In Pr oc. Int. Conf. Machine Learning , pages 1771–1779, 2017. [ Kingma and Ba, 2015 ] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. 2015. [ Liu and Y ang, 2018 ] Jen-Y u Liu and Y i-Hsuan Y ang. De- noising auto-encoder with recurrent skip connections and residual regression for music source separation. In Pr oc. IEEE Int. Conf. Machine Learning and Applications , pages 773–778, 2018. [ Liutkus et al. , 2017 ] Antoine Liutkus, F abian Robert St ¨ oter , Zafar Raﬁi, Daichi Kitamura, Bertrand Riv et, Nobutaka Ito, Nobutaka Ono, and Julie Fontecav e. The 2016 signal separation ev aluation campaign. In Proc. L V A/ICA , pages 323–332. 2017. [ Mimilakis et al. , 2018 ] Stylianos Ioannis Mimilakis, K on- stantinos Drossos, Joao F . Santos, Gerald Schuller , T uo- mas V irtanen, and Y oshua Bengio. Monaural singing voice separation with skip-ﬁltering connections and recurrent in- ference of time-frequency mask. In IEEE Int. Conf. Acous- tics, Speech and Signal Pr ocessing , pages 721–725, 2018. [ Nugraha et al. , 2016 ] Aditya Arie Nugraha, Antoine Li- utkus, and Emmanuel V incent. Multichannel music sep- aration with deep neural networks. In Proc. Eur opean Sig- nal Pr ocessing Conf. , pages 1748–1752, 2016. [ Paulus and V irtanen, 2005 ] Jouni P aulus and T uomas V irta- nen. Drum transcription with non-negati ve spectrogram factorisation. In Pr oc. Eur opean Signal Pr ocessing Conf. , pages 1–4, 2005. [ Raﬁi et al. , 2018 ] Zafar Raﬁi, Antoine Liutkus, Fabian- Robert Stoter , Stylianos Ioannis Mimilakis, Derry FitzGer - ald, and Bryan Pardo. An overvie w of lead and accom- paniment separation in music. IEEE/A CM T rans. Audio, Speech and Lang. Proc. , 26(8):1307–1335, August 2018. [ Roma et al. , 2018 ] Gerard Roma, Owen Green, and Pierre Alexandre T remblay . Improving single-network single-channel separation of musical audio with con- volutional layers. In Pr oc. L V A/ICA , pages 306–315, 2018. [ Salimans and Kingma, 2016 ] T im Salimans and Diederik P . Kingma. W eight normalization: A simple reparameter- ization to accelerate training of deep neural networks. Pr oc. Advances in Neur al Information Pr ocessing Sys- tems , pages 901–909, 2016. [ Sercu and Goel, 2016 ] T om Sercu and V aibhav a Goel. Dense prediction on sequences with time-dilated conv o- lutions for speech recognition. arXiv preprint , no v 2016. [ Stoller et al. , 2018 ] Daniel Stoller , Sebastian Ewert, and Si- mon Dixon. W a ve-U-Net: A multi-scale neural network for end-to-end audio source separation. In Proc. Int. Soci- ety for Music Information Retrieval Conf. , pages 334–340, 2018. [ St ¨ oter et al. , 2018 ] Fabian-Robert St ¨ oter , Antoine Liutkus, and Nobutaka Ito. The 2018 signal separation ev aluation campaign. In Pr oc. L V A/ICA , pages 293–305, 2018. [ T akahashi and Mitsufuji, 2017 ] Naoya T akahashi and Y uki Mitsufuji. Multi-scale multi-band densenets for audio source separation. In Proc. IEEE W orkshop on Applica- tions of Signal Pr ocessing to Audio and Acoustics , pages 21–25, 2017. [ T akahashi et al. , 2018 ] Naoya T akahashi, Nabarun Goswami, and Y uki Mitsufuji. MMDenseLSTM: An efﬁcient combination of conv olutional and recurrent neural networks for audio source separation. Pr oc. International W orkshop on Acoustic Signal Enhancement (IW AENC) , pages 106–110, 2018. [ Uhlich et al. , 2017 ] Stefan Uhlich, Marcello Porcu, Franck Giron, Michael Enenkl, Thomas Kemp, Naoya T akahashi, and Y uki Mitsufuji. Improving music source separation based on deep neural networks through data augmentation and network blending. In Pr oc. IEEE Int. Conf. Acoustics, Speech and Signal Pr ocessing , pages 261–265, 2017. [ van den Oord et al. , 2016 ] Aaron v an den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol V inyals, Alex Graves, Nal Kalchbrenner , Andrew Senior , and K o- ray Kavukcuoglu. W avenet: A generati ve model for raw audio. arXiv preprint , sep 2016. [ V ande V eire and De Bie, 2018 ] Len V ande V eire and Tijl De Bie. From raw audio to a seamless mix: creating an automated DJ system for drum and bass. EURASIP J. Au- dio, Speech, and Music Pr ocessing , 2018(1):13, 2018. [ V ezhnevets et al. , 2017 ] Alexander Sasha V ezhnevets, Si- mon Osindero, T om Schaul, Nicolas Heess, Max Jader- berg, David Silver , and K oray Kavukcuoglu. FeUdal net- works for hierarchical reinforcement learning. In Pr oc. Int. Conf. Machine Learning , pages 3540–3549, 2017. [ Y u and Koltun, 2016 ] Fisher Y u and Vladlen K oltun. Multi- scale context aggregation by dilated conv olutions. Pr oc. International Confer ence on Learning Representations (ICLR) , 2016. [ Zhang et al. , 2018 ] Xiangyu Zhang, Xinyu Zhou, Mengx- iao Lin, and Jian Sun. ShufﬂeNet: An extremely efﬁ- cient conv olutional neural network for mobile devices. In Pr oc. IEEE Conf. Computer V ision and P attern Recogni- tion , 2018.

Dilated Convolution with Dilated GRU for Music Source Separation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment