Semi-Supervised Monaural Singing Voice Separation With a Masking Network Trained on Synthetic Mixtures

We study the problem of semi-supervised singing voice separation, in which the training data contains a set of samples of mixed music (singing and instrumental) and an unmatched set of instrumental music. Our solution employs a single mapping functio…

Authors: Michael Michelashvili, Sagie Benaim, Lior Wolf

Semi-Supervised Monaural Singing Voice Separation With a Masking Network   Trained on Synthetic Mixtures
SEMI-SUPER VISED MONA URAL SINGING V OICE SEP ARA TION WITH A MASKING NETWORK TRAINED ON SYNTHETIC MIXTURES Michael Mic helashvili 1 , Sagie Benaim 1 , Lior W olf 1 , 2 1 T el A viv Uni versity 2 Facebook AI Research ABSTRA CT W e study the problem of semi-supervised singing voice sep- aration, in which the training data contains a set of samples of mixed music (singing and instrumental) and an unmatched set of instrumental music. Our solution employs a single map- ping function g , which, applied to a mixed sample, recov ers the underlying instrumental music, and, applied to an instru- mental sample, returns the same sample. The network g is trained using purely instrumental samples, as well as on syn- thetic mixed samples that are created by mixing reconstructed singing v oices with random instrumental samples. Our results indicate that we are on a par with or better than fully super - vised methods, which are also provided with training samples of unmixed singing voices, and are better than other recent semi-supervised methods. Index T erms — Singing voice separation, Adversarial training, Semi-supervised learning 1. INTR ODUCTION The problem of separating a gi ven mixed signal into its com- ponents without direct supervision is ubiquitous. For exam- ple, in single cell gene expression conducted in cancer re- search, one obtains a gene e xpression that contains both the cancer cell of interest and the e xpression of immune cells that attach to it. In what is kno wn in biology as gene expression decon volution [1], one would like to obtain the expression of the cancer cell itself, while only having access to a dataset of such mixed readings and another dataset containing gene expression profiles of immune cells. In the task of singing voice separation, which is the fo- cus of this work, examples of mix ed music, which contains both singing and instrumental music, are ab undant. It is also relativ ely easy to label parts of the song where no singing is present. Howe ver , it is much harder to separate out pure voice samples. W ithout such samples, one cannot use the su- pervised methods that were suggested for this separation task. In this work, we propose a nov el method for performing the separation. The method is based on applying a learned function twice: once on the mixtures, in order to recov er esti- mated singing voice samples, and once on synthetic mix es, in which the reconstructed singing samples are crossed with real instrumental samples from the training set. The adv antage of these crosses ov er the original mixed samples is that the un- derlying components of these mix ed samples are kno wn, and, therefore, added losses can be applied, when training the sep- arating function on them. 2. RELA TED WORK Single-channel source separation is a long-standing task which has been researched extensi vely . Classical works on blind source separation include Single-Channel ICA [2] and, specifically in singing v oice separation, RPCA [3]. These methods utilize hand-crafted priors on the sources, such as a low rank assumption on the instrumental music. The problem of singing voice separation is often studied in the supervised case, where the mix ed samples are pro vided with the target source. Often, a simple masking model in the spectral domain is assumed and the desired source b is giv en by a point-wise multiplication of the mix ed signal a and some mask m , i.e., b = a  m , where  is the Hadamard product. In our work, we use a network g such that b ≈ a − g ( a ) , where the architecture of g includes the masking, i.e., g ( a ) = a  m ( a ) , for some subnetwork m with outputs in [0 , 1] . The GRA3 method [4], similarly to ours, estimates the mask m directly from the mix ed sample a . This is done us- ing an ensemble of four deep neural networks, trained with different losses. The architecture we use is of the type com- monly used for the autoencoding of images. Similar architec- tures are used in other work to directly estimate all the sources from the mixtures, e.g., [5, 6]. The GRU-RIS-L method of Mimilakis et al. [7], em- ploys RNNs of stochastic depth in order to recover the time- frequency mask. The usage of RNNs allows for ef ficient modeling of longer time dependencies of the input data. This is extended in [8] (MaDT winNet) by introducing a technique called T win Net, which regularizes the RNNs. Our analysis of long sequences is segment by se gment and does not exploit long range dependencies. Adversarial training using GANs [9] is a powerful method for unconditional image generation. GANs are composed of two parts: (i) a generator g that synthesizes realistic images, and (ii) a discriminator d that distinguishes real from fake im- ages. The objecti ve of the generator is to create images that are realistic enough to fool the discriminator . The objectiv e of the discriminator is to detect the fake images. The method was later extended to perform unsupervised image-to-image mapping [10]. In this setting, the generator is conditioned on an input image from the source domains and generates a “fake” sample in the target domain. As in the uncondi- tional setting, the discriminator attempts to differentiate be- tween real and generated images. Adversarial training was used for supervised source separation, where the distribution of each of the mixture components is kno wn and modeled by a GAN, by Stoller et al. [11] and Subkhan et al. [12]. The adversarial training was motiv ated as being better able to deal with correlated sources. Semi-supervised approach using ad- verserial training was used by Higuchi et al [13] for the task of speech enhancement. In the setting of Semi-supervised audio source separation, in which we work, the task is to separate mixtures of two sources given mixed samples as well as samples from only one of the sources. Previous solutions were typically based on NMF [14] or the related PLCA [15]. The most similar method to ours is NES [16], which sep- arates mix ed samples into a sum of tw o samples: one from an observed domain and one from an unobserved domain. The method consists of an iterati ve process: (i) estimation of sam- ples from the unobserved distribution; (ii) synthesis of mix ed signals by combining training samples from the observ ed do- main and the estimated samples from the unobserved one; (iii) training of a mapping from the mixed domain to the ob- served domain. It was demonstrated in [16] that due to its iterativ e nature, NES is sensiti ve to the initialization method. Our method, in contrast, performs a non-iterati ve end to end training that includes the synthetic mixtures as part of the net- work. This also allo ws us to apply additional losses, such as GAN based losses and the constraint that learned function g is idempotent ( g ◦ g = g ) [17]. As can be seen in Sec. 5, our results are significantly stronger than those obtained by [16]. 3. METHOD In the problem of semi-supervised separation, the learning algorithm is provided with unlabeled datasets from two do- mains, a domain of mixtures A and a domain of observed components C . There also exists a target domain B , from which no samples are presented. The goal is to learn a func- tion g : A → C , which maps a sample in domain a ∈ A to a component c in domain C such that there exits a component b ∈ B for which the following equality holds a = b + c . During training, we obtain two sets of unmatched sam- ples: the set S A of mixed samples in domain A , and the set S C of samples in the observed domain C . Due to the lack of training samples in B , we rely on the generation of a synthetic training set of samples in domain B : The network mixes the samples in ¯ S B with random sam- ples in C , in order to create the following set of synthetic Fig. 1 . The transformations and constraints of our method. Blue arro ws stand for functions. Dashed lines represent losses, which are of two types: reconstruction losses (black) and GAN loss terms (red). crosses: For each sample ¯ a ∈ ¯ S B × C , we memorize the under- lying samples ¯ b , c that were used to create it, and mark these samples as b ( ¯ a ) and c ( ¯ a ) , respectiv ely . In addition to g , we train two discriminator netw orks d C and d A , which provide adversarial signals that enforce the dis- tribution of the recov ered samples from domain C to match the distrib ution of the training set S C and the mix ed synthetic samples to match the distribution of domain S A . Specifically , d C is applied to samples of the form g ( a ) , where a ∈ S A ; d A is applied to samples of the form ¯ a ∈ ¯ S B × C . The following losses are used to train the netw ork g : L R 1 = X c ∈ S C k g ( c ) − c k 1 (1) L R 2 = X a ∈ S A k g ( g ( a )) − g ( a ) k 1 (2) L R 3 = X ¯ a ∈ ¯ S B × C k g ( ¯ a ) − c ( ¯ a ) k 1 (3) L R 4 = X ¯ a ∈ ¯ S B × C k ( ¯ a − g ( ¯ a )) − b ( ¯ a ) k 1 (4) L GAN C = X a ∈ S A − ` ( d C ( g ( a )) , 0) (5) L GAN A = X ¯ a ∈ ¯ S B × C − ` ( d A ( ¯ a ) , 0) , (6) where ` is the Least Squares loss, following [18]. That is, ` ( x, y ) = ( x − y ) 2 . Note that g appears in L GAN A and ap- pears more than once in L R 3 , L R 4 , since it takes part in the formation of the set ¯ S B × C . The first loss requires that g , applied to samples in C , is the identity operator . The second loss enforces idempotence on g (since g maps to domain C , applying it again should be the same as applying identity), and the next two losses enforce the separation of the synthetic cross samples to result in the known components. The last tw o losses are GAN based losses in the domains C and A. The full objecti ve for g is defined as: L g = L R 1 + L R 1 + L R 3 + L R 4 + 0 . 5( L GAN C + L GAN A ) The discriminators of the GAN losses, d C and d A , are trained with the following losses, respecti vely: L d C = X a ∈ S A ` ( d C ( g ( a )) , 0) + X c ∈ S C ` ( d C ( c ) , 1) (7) L d A = X ¯ a ∈ ¯ S B × C ` ( d A ( ¯ a ) , 0) + X a ∈ S A ` ( d A ( a ) , 1) (8) 4. IMPLEMENT A TION DET AILS An Adam optimizer is used with β 1 = 0 . 5 , β 2 = 0 . 999 and a batch size of one. The learning rate is initially set to 0 . 0001 and is halved after 100 , 000 iterations. 4.1. Network architecture The underlying network architecture adapts that used in [19]. Let C 7 S 1 k denote a 7 × 7 1-stride conv olution with k filters. Similarly , let C 4 S 2 k denote a 4 × 4 2-stride con volution with k filters. Let R k denote a residual block with tw o 3 × 3 con- volutional blocks and k filters and let u k denote a 2 × nearest- neighbor upsampling layer , follo wed by a 5 × 5 con volutional block with k filters and 1 stride. Recall that g ( a ) = a  m ( a ) . m is built as an auto- encoder . The encoder consists of tw o do wnsampling con volu- tional layers, C 7 S 1 64 and C 4 S 2 128 . This is follo wed by four residual blocks of type R 256 . Each con volutional layer of the encoder is follo wed by an Instance Normalization layer and a ReLU acti v ation. The decoder consists of four residual blocks of type R 256 . This is followed by two upsampling blocks, u 128 and u 256 , and a conv olutional layer C 7 S 1 3 . Each con- volutional layer of the decoder is follo wed by a Adapti ve In- stance Normalization [20] layer and a ReLU activ ation. T o obtain mask values between 0 and 1 , the ReLU of the last layer is replaced by a sigmoid activ ation function. A multi-scale discriminator is used for d C and d A , as in [21], to produce both accurate low-le vel details, as well as capture global structure. Each discriminator consists of the following sequence of layers: C 4 S 2 64 , C 4 S 2 128 , C 4 S 2 256 and C 4 S 2 512 . Each conv olutional layer is follo wed by a leak y ReLU with slope parameter of 0 . 2 . 4.2. A udio processing T o conv ert an audio file to an input to network g , we perform the following pre-processing: The audio file is re-sampled to 20480 Hz. It is then split into clips of duration of 0 . 825 sec- onds. W e then compute the Short Time Fourier Transform (STFT) with window size of 40 ms, hop size of 64 and FFT size of 512 , resulting in an STFT of size 257 × 256 . Lastly , we take the absolute values and apply a power -law compression with p = 0 . 3 , i.e. we obtain | A | 0 . 3 , where | A | is the mag- nitude of the STFT . The highest frequency bin is trimmed, resulting in an input audio representation of size 256 × 256 . T o conv ert the method’ s output ¯ b = a − g ( a ) back to audio, we apply ISTFT on the multiplication of the magnitude spectrogram of ¯ b with the phase of the original mixture, and add back the top-frequenc y by padding with zeros. T o process an entire audio file, we simply process each non o verlapping segment indi vidually , and concatenate the results. 5. EV ALU A TION W e perform a comparison to other semi-supervised methods, using the e valuation protocol used by [16]. In addition, we compare our semi-supervised method to the state of the art su- pervised methods, follo wing the protocol used in [8]. Finally , ablation experiments are run to study the relativ e importance of the various losses. 5.1. Comparison to semi-supervised methods For semi-supervised methods, our ev aluation protocol fol- lows closely the one of [16]. W e e valuated our method against the fiv e methods reported there: (1) Semi-supervised Non-ne gative Matrix F actorization (NMF) [15]: The method learns a set of l = 3 bases from the samples in S C by Sparse NMF [22, 23] as S C = H c ∗ W c , with mixture components H c and basis vectors W c , where the two ma- trices are non-negati ve, using the fast Non-negati ve Least Squares solver of [24]. Then, the mixture S A is decom- posed with 2 l bases, where the first l bases are simply W c : S A = H ac ∗ W c + H ab ∗ W b . The estimated components from domain B are then gi ven by: ¯ S B = h ab ∗ W b . (2) GAN: A masking function m is learned so that after masking, the training mixtures are indistinguishable from the source sam- ples by a discriminator d , similar to our L GAN C loss. (3) GLO Masking (GLOM): This method learns an explicit generativ e GLO [25] model to both domain A and domain C and fits the parameters to each giv en sample a , followed by approximat- ing the solution by a mask between 0 and 1 that is multiplied by the mixed signal a . (4) Neural Egg Separation (NES): The iterative method of [16], which is initialized by taking the mixture components to be each half of the mixed signal a . (5) Fine-tuned NES (NES-FT): Initializing NES with the GLOM solution abov e. The semi-supervised experiments are performed on the MUSDB18 [26] dataset, which consists of 150 music tracks, 100 of which in the train set and 50 in the test set. Each mu- sic track is comprised of separate signal streams of the mix- ture, drums, bass, the accompaniment, and the v ocals. In our method, samples are preprocessed as described in Sec. 4.2 and then trained using the method of Sec. 3. W e compare the performance of our method, using the signal-to-distortion ratio (SDR) in T ab . 1. W e can observe that NMF , GAN, GLOM and NES perform much w orse then NES-FT and our method. There is also a significant g ap between NES-FT and our method ( 2 . 1 dB vs 3 . 2 dB ) as well. 5.2. Comparison to fully supervised methods W e next compare with fully supervised methods that solely deal with singing v oice separation. For this comparison, our ev aluation protocol follows closely the one of [8], except that our method does not employ the training samples of the singing voices and is unaware of the matching pairs ( a , c ) . Baseline results are shown for GRA3 [4] , GR U-RIS-L [7] and MaDT winNet [8] , which are discussed in Sec. 2, and for the following methods: CHA [6] , which uses CNN to estimate time-frequency soft masks; STO2 [27] , which is based on signal representation that di vides the complex spectrogram into a grid of patches of arbitrary sizes; and JEO2 [3]: , which is based on robust principal component analysis (RPCA). The results for all of the abov e approaches are obtained from [8]. The dev elopment subset of of Demixing Secret Dataset (DSD100) [28] and the non-bleeding/non-instrumental stems of MedleydB [29] are used for training. Baseline approaches here are trained in a supervised fashion, while our method is trained in a semi-supervised manner . For e valuation, the e val- uation subset of DSD100, which consists of 50 samples, is used. For these methods, the literature reports both the signal- to-distortion ratio (SDR) and the signal to-interference ratio (SIR), and we report both, using the mir ev al Python library . The comparison is shown in T ab . 2. As can be seen, SDR values for our method are better then those of GRA3 and CHA, but worse than STO2, JEO2, GR U-RIS-L and MaDT winNet. Our SIR value is significantly higher than all baselines, achieving a gap of 7 . 0 to the second best method. This is consistent with our observation: The net- work seems to filter out all the instrumental music v ery well for most samples. Howe ver , for some samples, there is a slight distortion of the voice generated. Samples, in com- parison to those published by [7], are av ailable at https: //sagiebenaim.github.io/Singing/ . 5.3. Ablation study W e perform an ablation analysis to understand the relati ve contribution of the different losses in our method. This is done by removing various losses from the training objectiv e and retraining. T able 1 . Median SDR (dB) for our method and pre vious semi-supervised approaches e valuated on the MUSDB18 [26] dataset. Baselines are form [16], which did not report SIR. Approach SDR SIR NMF 0.0 - GAN 0.3 - GLOM 0.6 - Approach SDR SIR NES 0.3 - NES-FT 2.1 - Ours 3.2 14.2 T able 2 . Median SDR and SIR (dB) v alues for the proposed method and pre vious supervised approaches, which solely deal with singing voice separation, ev aluated on the ev alua- tion subset of DSD100 [28] dataset. Approach Supervision SDR SIR GRA3 [4] supervised -1.7 1.3 CHA [6] supervised 1.6 5.2 STO2 [27] supervised 3.9 6.7 JEO2 [3] supervised 4.1 6.1 GR U-RIS-L [7] supervised 4.2 7.9 MaDT winNet [8] supervised 4.6 8.2 Ours semi-supervised 3.5 15.2 T able 3 . Ablation study: Median SDR and SIR v alues for the proposed method without (w/o) selected losses e v aluated on the ev aluation subset of DSD100 [28]. Losses SDR SIR All losses 3.5 15.2 w/o L R 1 -0.9 3.4 w/o L R 2 2.3 9.7 w/o L R 3 -4.3 13.3 Losses SDR SIR w/o L R 4 -6.3 -4.7 w/o L GAN A -6.3 -4.2 w/o L GAN C -4.1 -2.4 w/o L GAN A & L GAN C -17.0 -3.6 As can be seen in T ab . 3, L R 2 has a smaller significance than other losses. The most significant losses are L R 4 , and the GAN losses L GAN C and L GAN A , without ev en one of these, the two metrics drop considerably . L R 3 is also very significant and without it the SDR is greatly diminished. 6. CONCLUSIONS W e present a ne w method for semi-supervised singing voice separation that is competiti ve with some of the state of the art supervised methods and all of the literature semi-supervised ones. The crux of the method is the use of compound losses, applied to synthetic mixes, and the application of two GANs. This setup could be e xtended to multiple sources due the su- perposition principle of audio signals that is satisfied by the compound losses and will be inspected as future work. In addition, using time-domain architectures can be explored. The method is applied sequentially to fix ed-length audio clips and, as future work, we would like to employ overlapping segments and e ven incorporate longer -term dependencies. Acknowlegments This project has received funding from the European Re- search Council (ERC) under the European Unions Horizon 2020 research and innov ation programme (grant ERC CoG 725974). 7. REFERENCES [1] Y ingdong Zhao and Richard Simon, “Gene expression decon volution in clinical samples, ” Genome medicine , vol. 2, no. 12, pp. 93, 2010. [2] Mike Davies and Christopher James, “Source separation using single channel ICA, ” Signal Pr ocessing , 2007. [3] Il-Y oung Jeong and K yogu Lee, “Singing v oice separa- tion using rpca with weighted l 1 -norm, ” in Int. Conf . on Latent V ariable Analysis and Signal Separation , 2017. [4] Emad M Grais, Gerard Roma, Andre w JR Simpson, and Mark D Plumbley , “Single-channel audio source sepa- ration using deep neural network ensembles, ” in Audio Engineering Society Con vention 140 , 2016. [5] S. Uhlich, F . Giron, and Y . Mitsufuji, “Deep neural network based instrument extraction from music, ” in ICASSP , 2015. [6] Pritish Chandna, Marius Miron, Jordi Janer, and Emilia G ´ omez, “Monoaural audio source separation using deep con volutional neural networks, ” in Int. Conf. on Latent V ariable Analysis and Signal Separation , 2017. [7] Stylianos I. Mimilakis, Konstantinos Drossos, Jo ˜ ao F Santos, Gerald Schuller, T uomas V irtanen, and Y oshua Bengio, “Monaural singing v oice separation with skip- filtering connections and recurrent inference of time- frequency mask, ” in ICASSP , 2018. [8] K onstantinos Drossos, Stylianos Ioannis Mimilakis, et al., “MaD T winNet: Masker-denoiser architecture with twin networks for monaural sound source separa- tion, ” in Int. Joint Conf . on Neural Networks , 2018. [9] Ian Goodfellow , Jean Pouget-Abadie, Mehdi Mirza, et al., “Generative adv ersarial nets, ” in NIPS , 2014. [10] Jun-Y an Zhu, T aesung P ark, Phillip Isola, and Alexei A. Efros, “Unpaired image-to-image translation using cycle-consistent adv ersarial networks, ” in ICCV , 2017. [11] Daniel Stoller , Sebastian Ewert, and Simon Dixon, “ Adversarial semi-supervised audio source separation applied to singing voice extraction, ” arXiv pr eprint arXiv:1711.00048 , 2017. [12] Cem Subakan and P aris Smaragdis, “Genera- tiv e adversarial source separation, ” arXiv pr eprint arXiv:1710.10779 , 2017. [13] T . Higuchi et al., “ Adversarial training for data-driven speech enhancement without parallel corpus, ” in 2017 IEEE Automatic Speech Recognition and Understand- ing W orkshop (ASR U) , Dec 2017, pp. 40–47. [14] T om Barker and T uomas V irtanen, “Semi-supervised non-negati ve tensor f actorisation of modulation spectro- grams for monaural speech separation, ” in Int. Joint Conf. Neur al Networks , 2014. [15] P aris Smaragdis, Bhiksha Raj, et al., “Supervised and semi-supervised separation of sounds from single- channel mixtures, ” in Int. Conf. Independent Compo- nent Analysis and Signal Separation , 2007. [16] Y edid Hoshen, T avi Halperin, and Ariel Ephrat, “Neural separation of observ ed and unobserved distributions, ” in Submitted to Int. Conf . Learning Repr esentations , 2019. [17] T omer Galanti and Lior W olf, “ A theory of output- side unsupervised domain adaptation, ” arXiv preprint arXiv:1703.01606 , 2017. [18] Xudong Mao, Qing Li, Haoran Xie, et al., “Least squares generative adversarial networks, ” in Int. Conf. Computer V ision (ICCV) , 2017. [19] Xun Huang, Ming-Y u Liu, Serge Belongie, and Jan Kautz, “Multimodal unsupervised image-to-image translation, ” arXiv preprint , 2018. [20] Xun Huang and Serge J. Belongie, “ Arbitrary style transfer in real-time with adaptiv e instance normaliza- tion, ” Int. Conf. on Computer V ision (ICCV) , 2017. [21] T ing-Chun W ang, Ming-Y u Liu, Jun-Y an Zhu, et al., “High-resolution image synthesis and semantic manip- ulation with conditional GANs, ” in CVPR , 2018. [22] P atrik O Hoyer , “Non-negati ve matrix factorization with sparseness constraints, ” JMLR , 2004. [23] Hyunsoo Kim and Haesun Park, “Sparse non-negati ve matrix factorizations via alternating non-negati vity- constrained least squares, ” Bioinformatics , 2007. [24] Jingu Kim and Haesun Park, “Fast nonnegativ e ma- trix factorization: An acti ve-set-like method and com- parisons, ” SIAM J. Scientific Computing , 2011. [25] Piotr Bojano wski, Armand Joulin, et al., “Optimizing the latent space of generati ve networks, ” in ICML , 2018. [26] Zafar Rafii et al., “The MUSDB18 corpus for music separation, ” Dec. 2017. [27] F abian-Robert St ¨ oter et al., “Common Fate Model for Unison source Separation, ” in ICASSP , 2016. [28] Antoine Liutkus, F abian-Robert St ¨ oter , et al., “The 2016 signal separation ev aluation campaign, ” in Latent V ari- able Analysis and Signal Separation , 2015. [29] Rachel Bittner, Justin Salamon, Mike Tierne y , et al., “Medleydb: A multitrack dataset for annotation- intensiv e mir research, ” in ISMIR , 2014.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment