Referenceless Performance Evaluation of Audio Source Separation using Deep Neural Networks

REFERENCELE SS PERFORMANCE EV ALU A TION OF A UDIO SOURCE SEP ARA TION USING DEEP NEURAL NETWORKS Emad M. Grais 1 , 3 , Hagen W ierstorf 1 ∗ , Dominic W ar d 1 , Russell Mason 2 , Mark D. Plumble y 1 1 Centre for V ision, Speech and Signal Processing, Uni versity of Surrey , UK 2 Institute of Sound Recording, University of Surrey , U K 3 Electronics, Communications , and Com puter Engineering Department, Helwan Unive rsity , Egypt ABSTRA CT Current perfor mance ev aluation for aud io sour ce separatio n depend s on comparin g the processed or separ a ted sign als with referenc e sign als. Therefo re, commo n perfo rmance ev alua- tion toolkits are no t app licable to real-world situations where the grou nd truth audio is un av ailable. In this paper, we pr o- pose a perf ormanc e e valuation techn iq ue that does not req u ire referenc e signals in order to assess separation q uality . The propo sed technique uses a dee p n eural network (DNN) to m ap the processed aud io into its qu ality score. Our exper iment r e- sults show that the DNN is capable of pred icting the sources- to-artifacts ratio from th e blin d sou rce separatio n ev aluation toolkit [1] withou t the need for refer ence signals. Index T erms — Performan ce ev aluation, deep learning, audio source separation , BSS-Ev al sources-to- artifacts ratio. 1. INTRODUCTION Audio source separ ation aims to separate one o r mo re tar - get audio so u rces from m ixture signals [ 2, 3]. The separ ated sources often contain distortion s, artifacts, and unwanted sig- nals fro m the other sou rces in the mixtures. An ev aluation of the quality of the separated sourc e s is essential to guid e dev el- opment o f separ ation algorithms or to select the most suitable algorithm for a given mixture signal or applicatio n typ e. T his requires either percep tual ev aluation wh ere exper ie n ced lis- teners jud ge the quality of the estimated sources accordin g to different perce ptual attributes [4 , 5, 6, 7, 8, 9, 1 0], or ob - jectiv e m etrics that can estimate the proportion of distortions, artifacts, or interf erence p resent in the separated sources, by comparin g these with the referen ce clean sources [1, 11]. In experimental situations, the ref erence sources are usu - ally av ailable f o r use in evaluating the per forman ce of a certain source sep aration appro ach. Howe ver , for pra ctical applications o f source separation , the m ixtures are av ailable but th e separate original sour ces (the reference signals) are ∗ The ﬁrst two authors started this work when they were researchers at Centre for V isio n, Speech and Signal Processing, Uni versi ty of Surre y , UK. They con tribu ted equally to this work. Hagen W ierst orf is now with audEER- ING GmbH, Gilching, Germany . not. W ithout these r e ference sou rces being available, the most commo n o b jectiv e metrics cannot be employed , and the only way to evaluate the qu ality of th e separated sour ces is to ask listeners to giv e scores f or the qu ality of the separa ted sources. Using listener s to evaluate the qu ality o f the sepa- rated sou rces is time consuming , a n d of ten un feasible, and hence an auto mated system o f ev aluating the q u ality of the separated signal using neith er listen e rs no r referen ce sig n als would be prefer able. Such an automated referenceless evalu- ation meth o d could b e usefu l, fo r example, f or selecting the most appr opriate source sep a ration algorith ms for soloing or karaoke app lications fo r each son g, o r automatically evaluat- ing wheth er the separ ated signals are of sufﬁcient qua lity or whether extra work is ne eded to furth er improve the quality of the separated signals u sing po st-processing o r add itional separation techniq ues, e.g . [12, 13]. The concep t o f referen c eless quality evaluation fo r pro- cessed signals has been introduc ed in many signal processing domains, including the per c e ptual ev aluation o f image en- hancemen t ap proach es [14], and e valuating the quality a n d intelligibility of sp eech sig n als [ 15, 16]. In th is paper we pr o- pose a reference less evaluation method to ev aluate the qual- ity of the separated audio sources witho u t using the reference sources. The main idea o f th e prop osed meth od is to train a deep neural network (DNN) to map the estimated sepa r ated sources to th e o utput o f a reference- based evaluation m etric. The metr ic used in this paper is th e Sou rces-to-Ar tifacts Ra- tio (SAR) from the Blind Source Separation Evaluation (BSS Eval) toolkit [1]. SAR is selected as a ca se study , but it is intended that the prop o sed method will be used for other ob- jectiv e metrics, or the r esults of subjective judg ments. The DNN is ﬁrst trained to map the separated sign als from one or more source separation algorithms to their SAR scores. SAR in the train in g stage of th e DNN is ca lc u lated by using the ref erence sign als of each sou r ce. The trained DNN is then used to estimate the SAR for separated sources without using any reference signals. W e consider three different scenarios of u sing DNNs to estimate the SAR values. T h e ﬁrst scenario is to evaluate how well a DNN ca n p redict the SAR r e su lts f or the same sing le source separation algorithm for which it is trained: we refer to this scenario as a within-a lgorithm test . T he second scenario is to evaluate how well a DNN can pred ict the SAR results for a rang e of sep aration a lg orithms when train e d using data from th at same set of separa tion algo rithms: we refer to this scenario as an a cr o ss-known-a lgorithms test . The third sce- nario is to ev aluate h ow well a DNN can pred ict the SAR re- sults for a range of separation algor ithms when tr ained u sing data from a different set of sepa r ation algorithm s: we refe r to this scenario as a n acr oss-unknown-alg orithm test . 2. THE BLIND SOURCE SEP ARA TION EV ALU A TION TOOLKIT The Blind So u rce Sep a ration Evaluation (BSS-Eval) toolkit [1] is the most frequently used tool f or e valuating sour ce sep- aration algor ithms. BSS-Eval decom p oses the error b etween the ref erence/target source and the extracted/sep arated source into a target distor tion componen t reﬂectin g spatial or ﬁltering errors, an artifacts compone nt pertaining to artiﬁcial noise, and a n inter ference co mponen t associated with the unwanted sources. The salience of these comp o nents is quan tiﬁed us- ing three energy ratios: source I mage-to- Spatial distor tio n Ratio ( ISR), Sources-to- Artifacts Ratio (SAR), a n d Source- to-Interf erence Ratio (SIR). A four th metric, th e So urce-to- Distortion Ratio (SDR), measure s th e glob al perfo rmance (all impairmen ts com bined) . Computing these metrics dep ends mainly o n com paring th e refer ence signals and the ir corr e- sponding estimated signa ls from the sou rce separatio n system for each sourc e . Without the r eference sources, the BSS-Eval toolkit cannot provid e infor m ation regarding the quality of the estimated sources. 3. DEEP NEURAL NETWORK FOR REFERENCELESS SAR PREDICTION In this paper we use a deep neura l network to predict the BSS- Eval SAR scores fro m the output signals o f a source sep a- ration system. The DNNs we use are fu lly connec te d feed forward neural network s as shown in Fig. 1. SAR was se- lected as a case study: it has been shown to be a n indicator of the magnitud e of perceptu al artifacts in the separated sign als [5, 6]. The DNN is trained to map the extracted featu res of the separated sources to their correspo nding SAR values. In this training stage of the DNN, we assume the ref erence sign als are available. Given the refere nce or c le a n signals and their correspo n ding estimated signals from the source sep a ration technique , the SAR is calculated using BSS-Eval [1]. W e ex- tract features from the separated sources and use these fea- tures as in put to th e DNN. The fe a tures we use in this work are th e mel- f requen cy spec tr ogram (MFS), which a re calcu - lated by conv erting the spectrog rams of the estimated sign als to a mel-freq uency scale w ith 128 freque ncy ch a nnels. The Fig. 1 . The deep neural netw orks structure that we use in this work. T he input is the estimated separated signal and the output is its correspondi ng qualit y score. training of th e DNN pa rameters is d one by min imizing the mean-squ are-erro rs between the e stima ted SAR values fro m the DNN and their co rrespon ding calcu lated SAR values us- ing BSS-Eval. The trained DNN is then used to estimate th e SAR values for a new set of separated sources with out u sing the reference signals. The MFS f eatures are extracted from the separ ated sources an d fed to the trained DNN to estimate the SAR val- ues of the inp ut features. 4. EXPERIMENTS W e under took a p ilot study to predict th e sources-to-a rtifact ratio (SAR) as provided by BSS-Eval. Th e audio d ata and the source separation algo rithms were taken from the SiSEC- 2016- MUS-task challenge [ 1 7]. The data c o nsists o f 10 0 stereo so n gs, tho ugh fou r of them are cor rupted so were re- moved. Each song is a mixtur e of vocals, b ass, drum s, and other musical instruments. Th e SiSEC-2016- MUS-task in- volved separating these four sou rces fro m each song in th e dataset. In total, 24 different sourc e separation algorith ms with differing perform ance were subm itted to this challen g e. The following submitted sou r ce sepa r ation algo rithms are blind so u rce separ ation algor ith ms: DUR [1 8], KAM [19], OZE [20], RAF [21], JEO [ 2 2], an d HU A [2 3], and the fol- lowing sub mitted algorithms are sup e rvised source separation algorithm s using deep neural n etworks: STO [24], UHL [3], NUG [25], CHA [26], GRA [27], and KON [ 28]. The sep- arated signals u sing the Ideal Binary Mask ( IBM) [17] are also included in this d ata. More de ta ils a b out each algo rithm can be f ound in the SiSEC-20 16 web site [28]. Th e se sour ce separation alg orithms produced separ ated sign als with a wid e range of SAR values (from − 1 0 dB to 20 dB). In o ur experim ents we aim ed to pred ict the SAR fo r the vocal separated fro m each song fo r all the source separation algorithm s th at were submitted to this challeng e. W e tested three different scenarios of varying dif ﬁculty: • T est 1: T he DNN model was used to predict the SAR for the source separation algorithm for which it had been trained. W e call this test a within -algorithm test . This was conduc ted separately for each separa tion al- gorithm to exam in e any alg orithm-d epende n ce in the results. • T est 2: The DNN m odel was traine d using data fro m all 24 source separatio n algorithms simultaneously , then used to pr edict SAR values of each o f th e 24 source sep - aration algorith ms. W e call this test an acr oss-known- algorithms test . • T est 3: The DNN model was train e d using data from 17 source separatio n algorithms simultaneously , then used to p redict SAR values fo r 7 source separatio n al- gorithms n ot u sed in the training. W e c a ll this test an acr oss-unknown-alg orithm test . The 96 available songs (non- corrup ted) from SiSEC-201 6 dataset were split into 67 train ing songs and 29 test songs, all processed by the algorith ms used in the te sts. As the per- ceptual quality varies over time f o r musical sign als, the SAR was ca lcu lated every 117 milliseconds (ms) over a time win- dow o f 464 ms on an 116 seco nds (s) excerpt of every son g . The g oal of the trained DNNs was to predict the time-varying SAR for every son g and sour c e separatio n algo rithm in the test data set. The DNNs were d eep fully co nnected feed for- ward networks as shown in Fig . 1 , co n sisting of three hid den layers using a rectiﬁed linear u nit (ReLU) activ ation function for all but the la st lay er , which used a linear activ ation fun c- tion. The n umber of node s in each hid den lay er was 500. The input feature s we r e calculated as follows: the stereo in - puts were co n verted to mono by taking the average b etween the two chan nels; the spectrog ram was calcu lated and c on- verted to mel-freq uency spectrog rams (M FS) with 128 fre- quency channels. W e stacked 40 neigh bour MFS fr ames to form the inputs of the DNN with dimension 4 0 × 128 = 5120 MFS values, wh e re 4 0 is the nu mber of stacked fram es, and each frame con tains 128 frequ ency bands. T o ev aluate how well the DN N s could p r edict the SAR values without u sing the r eference signals, w e com pared the estimated SAR as ou tput from the DNNs with the SAR v alues calculated fro m the BSS-Eval toolkit u sing th e refe r ence sig- nals; the average absolute err o r and th e corr elation betwee n these were u sed to evaluate the perf o rmance o f the DNN ac- curacy . T est1 T est2 T est3 Algorithm Error Corr . Error Co r r . Error Corr . CHA 1.2 0.82 1.5 0.83 0.7 0.89 GRA2 1.4 0.87 1.5 0.86 1.3 0.92 GRA3 1.3 0.80 1.6 0.81 1.7 0.89 IBM 1.3 0.90 2.9 0.86 3.1 0.93 JEO1 0.8 0.89 1.3 0.76 0.9 0.89 KAM1 1.2 0.83 1.2 0.79 0.9 0.87 KAM2 0.9 0.81 1.0 0.75 0.6 0.85 KON 1.3 0.90 1.3 0.88 1.3 0.92 NUG1 1.4 0.89 1.1 0.88 0 .5 0.95 NUG2 1.3 0.89 1.1 0.88 0 .5 0.96 NUG3 1.4 0.89 1.2 0.89 0 .8 0.95 OZE 1.0 0.7 2 1.1 0.73 0.9 0.80 RAF1 0.9 0.75 1 .3 0.72 1 .2 0.78 STO1 1.1 0.9 0 1.0 0.8 7 0.5 0.94 UHL3 1.5 0.86 1 .8 0.85 1 .5 0.93 NUG4 1.5 0.89 1.2 0.89 1 .6 0.92 UHL2 1.5 0.84 1 .7 0.85 1 .5 0.90 DUR 1.2 0.75 1 .7 0.72 3 .7 0.74 HU A 0.8 0 .66 1 . 1 0 .61 4 . 4 0 .30 JEO2 0.8 0.95 1 .1 0.93 1 .6 0.93 RAF2 1.0 0.77 1 .1 0.73 1 .4 0.70 RAF3 1.0 0.82 1 .4 0.78 2 .0 0.79 STO2 1.1 0.9 0 1.0 0.8 8 1.1 0.88 UHL1 1.4 0.85 1 .3 0.86 1 .5 0.86 T able 1 . The mean absolute error in dB and the mean correla tion betwee n the refe rencel ess estimat ed SAR va lues using DNNs and the calculate d SAR using BSS-Eva l with referen ce signal s (refe rence SAR) for each source sepa- ration algorit hm. T he horizonta l line separat es the algor ithms used for trai n- ing (abo ve the line) and those used for testing (belo w the line) in T est 3. 5. RESUL TS T able 1 shows the m ean absolute er ror and th e mean co rre- lation between the refer enceless estimated SAR values using DNNs and the calculated SAR u sin g BSS-Eval with referen ce signals (reference SAR) for the three scenarios (T est 1 to T est 3). 5.1. T est 1 : the within-alg orithm test T est 1 was intended to be a case where a DNN could be trained ind ividually for a given sep aration algor ith m, and hence should give the mo st fav ourable re su lts as the DNN is custo m ised fo r a single case. For this, w e in depend ently trained 24 DNNs: one for each source separation algo rithm. Each DNN in this case was used to estimate the SAR f or the sep aration algor ithm for which is was trained. T h e sam e set of training song s an d the same set o f test songs was used for each algorith m , with no overlap betwee n the two sets of so ngs. The e rror in the prediction s was c a lc u lated as the difference b etween the p r edicted SAR from each DNN, an d the refere nce SAR for th e same separated signa l. The mean absolute error b etween the pr edicted and refe r ence SAR was 1 . 2 dB, an d rang ed from 0 . 8 dB to 1 . 5 dB f or ea c h separ ation algorithm . Th e correlation between the pr e dicted and mea- sured SAR range d from 0 . 66 to 0 . 95 for each a lg orithm, with Fig. 2 . The correlation between the estimated and reference SAR va lues for a song s eparated by source separati on algori thm GRA2. an average over th e 24 algo r ithms of 0 . 84 . Compared to th e range o f SAR values o f − 10 dB to 2 0 dB, the mean absolute erro r of 1 . 2 dB rep resents 4% of the range. This sugge sts that the SAR values estimated without using a referenc e could be used to discriminate be twe e n the perfo r- mance of som e comb inations of algo r ithm a nd song. How- ev er , it may not be able to d iscr im inate between the average results of som e of the alg orithms in the SiSEC-20 16-MUS- task [17], and h ence further reﬁne m ent is required . 5.2. T est 2 : the across-known-algor ithms test T est 2 was intend ed to be a case where a single DNN was trained u sing a set of separ ation algo rithms, and this used to attempt to p redict the results of any separation a lgorithm in- cluded in its train ing set. This requires a more g eneralised set of pred ictions compared to T est 1, and h ence was intend ed to be a more challenging test. The single DNN was tr ained us- ing the same training set of songs emp loyed in T est 1, though this time u sing the re su lts fr o m all 24 source separation al- gorithms. The trained DNN was then used to ev aluate the separated vocal sign als fr om the test set song s individually for each o f the same 24 sour ce separation algorithms. Th e re- sults are sh own in T able 1: th e mean ab solute err or betwee n the predic ted and refere nce SAR was 1 . 4 dB, and rang ed from 1 . 0 dB to 2 . 9 dB f o r e ach separation algorithm. The correla - tion between th e p redicted and measu red SAR ranged f rom 0 . 61 to 0 . 93 for eac h alg orithm, with a n average over the 24 algorithm s of 0 . 82 . As an example of the correlation between the estimated and actual SAR results, Fig . 2 shows the cor relation between the estimated and refere n ce SAR values for a song separated by sour ce sep aration alg orithm GRA2. As can be seen fro m the ﬁgu re, the estimated SAR values are highly correlated with the refer e n ce SAR. Compared to th e range o f SAR values of − 1 0 dB to 2 0 dB, the mean absolute error of 1 . 4 dB represen ts nearly 5% of the range. Th ough the perfor mance is less accurate f or th is mo re challengin g test, even the worst-case m ean absolu te error of 2 . 9 dB indicates that the r eferencele ss SAR pr e diction co uld be used to discrimin ate between the perfo rmance o f some combinatio ns of algorithm and son g, but again furth er reﬁne- ment is req uired. 5.3. T est 3 : the across-unknown-algorit hm test T est 3 was intend ed to be a case where a single DNN was trained u sing a set of separ a tion algorithms, and this used to attempt to predict the results of any separa tion alg orithm, in- cluding th o se n ot inclu ded in its trainin g set. This require s further gen eralisation of the r esults, to both son gs and algo- rithms outside o f the tra ining set, and is the most ch allenging of the tests used. For this, the ﬁrst 17 source sep aration al- gorithms in T able 1 were used fo r training a nd validation, and the last 7 algor ithms ( separated b y a h orizon tal line in T able 1) were u sed f o r testing ; the training an d testing wer e ag ain un - dertaken using a separate sets of songs. In additio n , the D N N was tested separately for ea ch source separatio n algorithm us- ing solely the songs fro m the test set, with th e resu lts shown in T able 1). The m ean a bsolute error b etween the pr edicted and reference SAR was 2 . 3 dB, and ranged from 1 . 1 dB to 4 . 4 dB for each separation algorithm in the test set, and from 0 . 5 dB to 3 . 1 dB for e a ch separ ation algor ithm in the trainin g set. The average correlation between the pred icted and mea- sured SAR time series was 0 . 74 , with a ra nge of 0 . 3 to 0 . 93 for the test set and 0 . 78 to 0 . 96 for the train ing set. As expected , the pe r forman ce was less accu rate fo r this test, though the worst-case error would still allow discrimina- tion between some co mbination s of alg o rithm and song. 6. CONCLUSIONS In this pape r we intr oduced a novel referenc eless evaluation method to assess a range of au dio sour ce separatio n sy stem s without the need for the orig inal so urces. W e u sed a d eep neu- ral network to p redict the sources-to- artifacts ratio (SAR) [1] of singing-voice recor dings extracted from music mixtures of varying genres. Our experimental results show that the DNNs were capab le of p redicting the SAR without the referenc e sig- nals, in most cases resulting in an er ror that was low enou gh (mostly < 1 .5dB) to allow discrimination between th e perf or- mance of some comb inations of algo r ithm and song, and with a high co rrelation (mo stly > 0 .80) between the compu ted SAR from BSS-Eval that u ses the r eference signals. This work indicates that the idea of using DNNs to pre d ict the output of o bjective source separ a tion evaluation toolkits without the use of referen ce sign als prod uces u seful re su lts, and can be extended to train the DNNs to pred ict the other m etrics of the BSS-Eval or predict perceptu al related quality scores. Ackno wledgment This work i s supported by grant EP/L027119/2 from t he UK Engi- neering and Physical Sciences Research Council (EPSRC). 7. REFE RENCES [1] E. Vin cent, R . Gribon v al, and C. Fevotte, “Performance mea- surement in blind audio source separation, ” IEEE T rans . on Audio, Speec h, and Languag e Pr ocessing , vol. 14, no. 4, pp. 1462–6 9, July 2006. [2] T . V irtanen, “Monaural sound source separation by non- negati v e matrix factorization with t emporal continuity and sparseness criteri a, ” IEEE T rans. on Audio, Speech, and Lan- guag e Pr ocessing , vol. 15, pp. 1066–1074, Mar . 2007. [3] S. Uhlich, M. P orcu, F . Gi ron, M. Enenkl, T . Kemp , N. T aka- hashi, and Y . Mitsufuji, “Improving music source separation based on deep neural netwo rks through data augmentation and network blending, ” in Pr oc. ICASSP , 2017, pp. 261–265 . [4] V . Emiya, E . V incent, N. H arl ander , and V . Hohmann, “Sub- jectiv e and objective quality assessment of audio source sep- aration, ” IEEE T ra ns. on Audio, Speech, and L angua ge Pro- cessing , vol. 19, no. 7, pp. 2046–2057, Sept. 2011 . [5] D. W ard, H . Wierstorf, R. Mason, E. M. Grais, and M. D. Plumbley , “BSS Eval or Peass? predicting the perception of singing-v oice separation, ” in Pro c. ICA SSP , 2018, pp. 596 – 600. [6] H. Wierstorf, D. W ard, R. Mason, E. M. Grai s, C. Hummer- sone, and M. D. P lumbley , “Perceptua l ev aluation of source separation for remixing music, ” in P r oc. AES 143 , 2017. [7] M. Cartwright, B. Pardo, G. J. Myso re, and M. Hoffman , “Fast and easy cro wdsourced perceptual audio ev alu ation, ” in Pro c. ICASSP , 2016, pp. 619–623. [8] P . Coleman, Q. Liu, J. F rancombe, and P . Jackson, “Percep- tual ev aluation of blind source separation in object-based audio production, ” in Pr oc. L V A/ICA , 2018. [9] E. Cano, D. FitzGerald, and K. B randenb urg, “Evaluation of quality of sound source separation algorithms: Human percep- tion vs quantitativ e metrics, ” in Proc. EUSIPCO , 2016. [10] ITU-R BS.1534-3, “Method for the subjective assessment of intermediate quality lev el of audio systems, ” T ech . Rep., Inter- national T elecommu nication Union, T ech . Rep., 2015. [11] R. Huber and B. K ollmeier , “PEMO-Q: a new method for ob- jectiv e audio quality assessment using a model of auditory per- ception, ” IEEE T rans. on Audio, Speech and Langua ge Pr o- cessing , vol. 14, no. 6, pp. 1902–1911, 2006. [12] E. M. Grais and H. Erdogan, “Spectro-temporal post- enhanceme nt using MMSE estimati on in NMF based single- channel source separation, ” in Pr oc. In terSpeech , 2013. [13] D.S. W illiamson, Y . W ang, and D.L. W ang, “A two-stage approach for improving the perceptual quality of separated speech, ” in P r oc. ICASSP , 2014, pp. 7034–70 38. [14] H. T alebi and P . Milanf ar , “Learned perceptual image enhance- ment, ” in , 2017. [15] S. F u., Y . Tsao, H. Hwang., and H. W ang , “Quality-Net: An end-to-end non-intrusi ve speech quality assessment model based on BLSTM, ” in Pr oc. InterS peech , 2018. [16] C. Spille, S. D. Ewert, B. K ollmeier , and B. T .Me yer , “Predict- ing speech intell i gibility with deep neural networks, ” Com- puter Speec h and Languag e , vol. 48, pp. 51–66, 2018. [17] A. Liutkus, F . S toter , Z . R aﬁ i, D. Kitamura, B. Riv et, N. Ito, N. Ono, and J. Fontecave, “The 2016 Signal Separation Eval- uation Campaign, ” in P r oc. L V A/ICA , 2017, pp. 323–332. [18] J.-L. Durrieu, B. David, and G. Richard, “A musically moti- v ated mid-le vel representation for pitch estimation and musical audio source separation, ” IEEE T rans. on Selected T opics on Signal Pr ocessing , v ol. 5, no. 6, pp. 1118 –1133, Oct. 2011. [19] A. Liutkus, D. Fi tzGerald, Z . Raﬁi, and L. Daudet, “Scalable audio separation with light kernel additi v e modelling, ” in Pr oc. ICASSP , 2015, pp. 76–80. [20] A. Ozerov , E. V incent, and F . Bimbot, “A general ﬂexi- ble frame work for t he handling of prior i nformation in audio source separation, ” IEEE T r ans. on Audio, Speech, and Lan- guag e Proc essing , vol. 20, no. 4, pp. 1118–1133, Oct. 2012. [21] Z. Raﬁi and B. Pardo, “RE peating Pattern Extraction T ec h- nique (REPET): A simple method for music/vo ice separation, ” IEEE T r ans. on A udio, Sp eech, and Langua g e P r ocessing , vol. 21, no. 1, pp. 71–82, Jan. 2013. [22] I.-Y . Jeong and K. Lee, “Singing voice separation using RP CA with weighted l1-norm, ” in Pr oc. L V A/ICA , 2017, pp. 553–562. [23] P . H uang, S. Chen, P . Smaragdis, and M. Hasegawa -Johnson, “Singing-vo ice separation from monaural recordings using ro- bust principal component analysis, ” in Proc. ICASSP , 2012, pp. 57–60. [24] F . - R. Stoter, A. Liutkus, R. Badeau, B. Edler , and P . Magron, “Common fate model for unison source separation, ” in Proc. ICASSP , 2016, pp. 126–130. [25] A. A. Nugraha, A. Liutkus, and E. V incent, “Multichannel au- dio source separation with deep neural networks, ” IEEE/ACM T r ans. on Audio, Speech, and Langua ge Pr ocessing , vol. 24, no. 9, pp. 1652 –1664, 2016 . [26] P . Chandna, M. Miron, J. Janer , and E. Gomez, “Monoaural audio source separation using deep con v olutional neural net- works, ” i n Proc. L V A/ICA , 2017, pp. 258–266. [27] E. M. Gr ais, G. Roma, A. J. R. Simpson, and M. D. P lumb- ley , “Si ngle channel audio source separation using deep neural network ensembles, ” in Pr oc. AES-140 , 2016. [28] URL, “ https:// www.sisec17.a udiolabs- erlangen.de , ” 2017.

Referenceless Performance Evaluation of Audio Source Separation using Deep Neural Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment