노래 음성의 동시 시작·종료 검출을 위한 쌍별 접근법과 코렌트로피 활용

DRAFT A P AIR WISE APPR O A CH T O SIMUL T ANEOUS O NSET/OFFSET DETECTION FOR SINGING V O ICE USING CORRENTR OPY Sungkyun Chang and K yogu Lee Music and Audio Research Group Seoul National Univ ersity , 151-742 Seoul, K orea e-mail: { rayno1,kglee } @s nu.ac.kr ABSTRA CT In this paper , we propose a nov el method to search for precise locations of paired note onset and offset in a singin g voice signal. In compariso n with the existing onset detection algo- rithms, our appr oach differs in two key r espects. First, we employ Corr entr opy , a gener alized correlatio n fun ction in- spired from Reyni’ s entr op y , as a detection fun ction to cap- ture the in stantaneous ﬂux while preserving insensitiveness to o utliers. Next, a novel peak picking algorithm is specially designed for this d etection fu nction. By calculatin g the ﬁt- ness of a pre-deﬁned inv erse hyp erbolic kernel to a d etection function , it is possible to ﬁnd an onset and its co rrespond ing offset simultaneou sly . Exper imental results sho w that the pro - posed method achieves perf ormance signiﬁcan tly better than or co mparab le to oth er state-of-th e-art techniqu es for onset detection in singing v oice. Index T erms — on set detection, o ffset detection , singing voice, entropy , pairwise peak picking 1. INTR ODUCTION Onset detection is a p roblem of ﬁnding the prec ise lo cation of discrete musical events, and thus is an imp ortant pre- processing step in many music applicatio ns, includin g pitch estimation, beat tracking, a nd automatic mu sic transcrip tion, to nam e a few . In mu sic signal processing, n ote onset de- tection still remains a challen ging pro blem, p articularly f or singing v oice, because of se veral reasons such as a large vari- ance in ar ticulation, singer-depend ent timbral characteristics, and slowly-varying onset envelopes. The repo rt f rom Mu sic Information Retrieval Evaluation eXchange 20 12 (MIREX 2012) for solo singin g voice reﬂects these difﬁculties, where the best-performin g algorith m yield s the F-measure o f merely 55.9% [1 2]. It is noteworthy that singing class g i ves much lower F-measu re value than other classes like polypho nic pitched, solo brass, and wind instruments classes. This research was supported by Samsung Electronic s and the MSIP (Ministry of Scienc e, ICT & Future Planning), Kore a, under the ITRC (In- formation T echn ology Research Cente r) support program supervised by the Nationa l IT Industry Promotion Agency . NIP A-2013-H0301-13-40 05. ERB Filtebank T ime-domain Signal De te c ti on F u nc ti o n Pairwise-Peak Picking {Onset, Offset} Ad-hoc Kernel Optimization Corr entr opy Function( σ ) ∆ W (t) σ Fig. 1 . Overv ie w of the system Numerou s m ethods ha ve been proposed so far to solv e the onset detection prob lem [1]. Majority of e xisting method s are generalized b y the two-stage proced ures: detection function and pea k picking [1 – 8]. Gene rating detection f unction refer s to transform ing au dio signals into featur e vectors more rele- vant to indic ating onsets. Due to the fact that singing voice contains many soft onsets [1], a r ecent study foc uses on ﬁnd- ing the harmonic regularity instead of using the con ventional, energy-based tech niques [4]. If a pr operly designed detec- tion functio n is p rovided, o nset candidates will g i ve a r ise to certain local peaks. The p eak-pick ing pro cedure th en ﬁnd s precise onset locations from the detection function . This p aper explore s a new detection fun ction and a peak- picking meth od for onset/offset detec tion in singing voice. The use of corren tropy for detection function brings us two major a dvantages: ﬁrst, it p rovides a comp act fro nt-end like a conv entional correlation functio n; second , the p roperty of correntr opy p reserves robustness to outliers – such as noise or subdomin ant chan ges in fre quency , amplitud e or phase – by projec ting an inpu t sig nal into a hig h dim ensional Repro- ducing Kernel Hilbert Space (RKHS). Figure 1 illustrates an overview of the pro posed system. W e describe a novel detection fun ction based on co rrentropy in Section 2. An ad-ho c kerne l optimizer co ntinuously up - dates a localized p arameter required f or corr entropy estima- tion. The detection fun ction yields a regular shape with dis- tinct local peaks, or ( − )/( + ) peak s, which correspo nd to on- DRAFT −1 0 1 (b) 1 2 3 4 5 6 7s 0 (c) (a) 0 64 t = 0 Onset Offset Fig. 2 . Singing voice example. From top to bottom: (a),(b) and ( c) are input signa l, W t ( τ ) and d etection functio n ∆ W ( t ) , resp ecti vely . Red dashed -lines are onset positions that corr espond to (-) peaks in (c). Black solid-lin es are (+) peaks near the offset. set/offset, resp ecti vely . This regular shap e m otiv ates us to design a pairwise peak-picking metho d in Section 3. The experimental results ar e presented in Sectio n 4 , f ollowed by conclud ing remar ks in Section 5 . A relation to prior work is addressed in Section 6. 2. DETECTI ON F UNCTION B ASED ON CORRENTROPY Our detection function ﬁrst passes the time-do main input sig- nal sampled at 11,0 25 Hz into an auditory ﬁlterbank [18]. W e map center frequen cies of 6 4-chann el g ammatone ﬁlterbank accordin g to the Equiva lent Recta ngular Ban dwidth ( ERB) scale between 80 Hz and 4,000 Hz. Hereafter , x c ( t ) re fers to the am plitude of th e ﬁlter ed pa rallel output, where c and t denote the channel and discrete time indices, respecti vely . 2.1. Correntropy-based F ormulation Correntro py function incorp orates both distribution an d tem- poral structures o f a time series in random pr ocesses [10]. There are various ap plications o f corre ntropy to non-linear and non-g aussian sign al processing wh ere correlatio n func- tion is n ot sufﬁcient. Let { x t : t ∈ T } be a rand om p rocess with T denoting a n index set. Corren tropy fun ction V is de- ﬁned as V ( x t 1 , x t 2 ) = E [ κ ( x t 1 , x t 2 | σ )] , (1) where κ σ ( · ) is a Parzen kernel a nd E is an expectatio n op er- ator . V is advanced to measur e the d istance b etween the two discrete vectors. The proper ties reveal that correntro py h as very similar characteristics to a conventional correlation f unction. Liu et al. showed that the sufﬁcient con dition to satisfy V ( t, t − τ ) = (a) Good case 40 80 120 0 1 2 3 4 5 40 80 120 5 10 15 20 25 30 35 40 45 40 80 120 40 80 120 40 80 120 (b) Bad case (c) Improved (d) Zoom in part of (b) (e) Zoom in part of (c) t(s) 0 5 10 15 20 25 30 35 40 45 0 0 1 2 3 4 5 0 1 2 3 4 5 t(s) t(s) t(s) t(s) (τ) (τ) (τ) (τ) (τ) Fig. 3 . Examples o f W t ( τ ) in Section 2.2: (a) σ = 0 . 017 pre- serves good c ontrast b etween stationa ry and non -stationary regions. (b) bad case using g lobal o ptimum f or 4 5 s. ( c) im- proved result with “loosely” localized σ . (d), (e) zoomed -in parts from (b) and (c). W e observe more contrast in (e). V ( τ ) is that th e input ran dom pro cess mu st be time sh ift- in variant on the even mo ments. More speciﬁcally , this is a stronger conditio n than a wide- sense stationarity in v olv- ing only seco nd-ord er moments. W e estimate co rrentro py V giv en t , c and lag τ by com puting th e samp le mean o f a size N wind ow as follows: V t,c ( τ ) = 1 N N X n =1 N σ ( x c ( t + n ) , x c ( t + n + τ )) , (2) where N σ ( p, q ) = 1 √ 2 π σ exp {− ( p − q ) 2 2 σ 2 } , and σ is a Ga ussian bandwidth pa rameter . In practice, b oth the window size N and maximum τ are set to sampling rate / 80 , when the lowest band-lim it is 80 Hz. T hen the correntropy coef ﬁcient for each channel is “pooled ” into a n on-negative summary matrix W as follows [9]: W t ( τ ) = X c V t,c ( τ ) . (3) The detection f unction ∆ W ( t ) is then calcu lated by the rec- tiﬁed difference with a hop-size h as follows: ∆ W ( t ) = X τ W t + h ( τ ) − X τ W t ( τ ) . (4) A monophon ic singing e xample in Figure 2 illustrates on- set/offset regions and their relation to the d etection functio n. W e estimate onset/offset location s by ﬁnding the ﬁrst time t at which ∆ W ( t ) < 0 , and the ﬁrst time after t where ∆ W ( t ) > 0 , respectively . 2.2. Ad hoc K ernel Optimization A f ree para meter σ in Equ ation (2) acts as a sensitivity con- troller for detection function: the larger σ becomes, the faster the h igher-order moments decay . T o keep both n onlinearity and d iscrimination ab ility , one way to select the o ptimal σ is DRAFT (a) (b) (c) (d) (e) (f) 1 -1 0 1 -1 0 F 1 -1 0 F Onset Offset ω 0 i i t t t t Start Start Fig. 4 . Illustration of peak-picking algorithm. (a) Pre-deﬁned kernel matrix Λ . (b) Cross section of Λ with ω . In (c), start search at t = 0 . Calculate ﬁtness by stretching Λ . In (d), onset is found at maximum ﬁtness location. In (e), start s earch from pre vious onset location using − Λ . I n (f), correspo nding offset is found similarly as in (d) . W e repeat (c) to (f) un til the end. giv en by Silverman’ s R ule of Thumb [16]. Assuming that κ is a Gaussian kernel, the optimal σ can be simpliﬁed to σ = b · ˆ ψ N − 1 / 5 , (5) where b, ˆ ψ an d N den ote the co nstant scale factor, sample standard deviation, and the numb er of samples, respectively . The sensitivity of the d etection functio n can be observed from the contrast in colors in W t ( τ ) . Figure 3 (a) is a go od ex- ample wher e the boun daries would y ield a relevant detection function u sing the global optim um σ . Howe ver , Figure 3(b) reveals that the globa l op timum par ameter do est no t always guaran tee good co ntrast on the entire song. O n the other hand, a stron gly localized para meter would n ot gu arantee temp oral contrast. In this situa tion, a “loosely ” localized optima l pa- rameter is requ ired for ro bust detection of real-world sing ing onsets. Figu re 3(c) displays W t ( τ ) with improved con trast by an ad-ho c optimizer . It update s σ with an inter val h . In pra c- tice, we ach iev e good perform ances b y computing the optimal σ using E quation ( 5) with an observation window size of 7 s and h = 5 ms. T his im proves the p recision by over 10% in compariso n with a global optimization method. 3. P AIR WISE SIMUL T ANEOUS P EAK PICKING A pairwise peak- picking ap proach is mo ti vated by the regu - lar shape of the detection function ob served in Figure 2(c): a basic idea is to captu re a pair of falling and rising peaks by calculating a ﬁtn ess to a p re-deﬁned kernel. Figure 4 illus- trates this concept. W e ﬁr st g enerate a set o f pr e-deﬁned inv erse hyp erbolic kernels, who se shape is similar to expansion or shrinkag e of detection function . T his kernel Λ is deﬁn ed as Λ( z ) = z 1 + α − | z | , (6) such that − 1 + 1 0 − 5 ≤ z ≤ 1 − 10 − 5 , where α is a shar p- ness factor . The r ange of z is set to avoid division by 0 , and α ≈ 0 . 15 is empirically found. Given an observation windo w length ω an d sample index i = { 1 , 2 , ..., ω } , we cho p z into ω samples by a line ar scale. In o ur default settings, ω min is set to 4 samples (=20 m s) a nd ω max is set to 500 sam ples o r larger ( ≥ 2 . 5 s) for 5 ms-c orrentro py hopsize h . Hence, any ev ent less th an 20 ms will be igno red. Figu re 4 (a) displays the generated kernel matrix for a set of observation length ω . T o ﬁn d a pair of offet/onset, we calculate the ﬁtness be- tween the detection function and the pre-de ﬁned kernel as we expand the k ernel size. The ﬁtness is calculated by F it (Λ ′ ω , W ′ ω ) = (Λ ′ ω − W ′ ω ) 2 · ω − k , (7) where Λ ′ ω is the pr e-deﬁned kern el sam pled at ω , a nd W ′ ω is the ω -long rectang ular windowed detection functio n. k is a weighting factor for close p eaks and w e set k = 1 . For r efer- ence, E quation (7) is der i ved f rom lack-of-ﬁt sum o f squares which has been widely used in classical F-test statistics [17]. T o ﬁnd a pair o f on set/offset, we start from th e last onset po- sition and perfor m the same calculation but use − Λ . The ab ove proce dures ar e sum marized in Alg orithm 1. This p airwise appro ach makes sense in mono phonic sources: if a n o nset is f ound, then its cor respondin g offset mu st exist before the next onset. 4. EXPERIMENT 4.1. Dataset The dataset is o btained from the auth ors of referen ced pa- per [4]. I t allows us to directly co mpare the perform ance o f propo sed algo rithm to theirs. The total leng th of audio clips is about 13 minu tes, which is much longe r th an singing d ata included in the MIREX audio onset detection task [ 11]. The dataset con sists of 1 3 male and 2 fema le singers’ record ings of p opular songs. Onset labels are cr oss-validated by three persons wh o have pro fessional career s in music. In total, the dataset co ntains 1, 567 on sets with anno tations. Audio ﬁles are produ ced in mono with the sampling rate of 44,1 00 Hz. 4.2. Results and Discussion W e followed the ev aluation procedu re for o nset detection de- scribed in MIREX [1 1]. The toleran ce value is set to + / − DRAFT Input : ω = window si ze, Λ = inv erse hyperbolic k ernel. W = detection function. Output : ⋆, ◦ are onset and offset marking on time index, t . while t → T do % ﬁnd onset, ⋆ for t ′ : ( t + ω min ) → ( t + ω max ) do F ( t ′ ) = F i t (Λ ′ , W ′ ) end ⋆ = t + arg max t ′ ( F ( t ′ )) t = ⋆ % ﬁnd offset, ◦ for t ′ : ( t + ω min ) → ( t + ω max ) do F ( t ′ ) = F i t ( − Λ ′ , W ′ ) end ◦ = t + arg max t ′ ( F ( t ′ )) t = ◦ end Algorithm 1: Pseudo code fo r pair wise p eak pic king de- scribed in Section 3. Class # of Onset Precision Recall F-measure male 1,533 80.9 80.1 80.3 female 34 93.8 88.4 91.4 T otal 1,567 81.1 80.2 80.6 T a ble 1 . Perform ance of the proposed algorithm 50 ms, which allows us to consider any detected onset within this ran ge from th e gr ound- truth o nset as a tru e po siti ve. If not, th en it is coun ted as a false negative. A false p ositi ve is deﬁned as any detected onset outside all the tolera nce range. The overall resu lts ar e summ arized in T able 1. Using the same dataset, we com pared th e p erform ance of the prop osed algorithm with others, in cluding a recent algo rithm pr oposed by Heo et al. [4] as well as more conventional ones [14 , 1, 3 , 13]. It can be seen fro m Figure 5 that the proposed algorithm achieves the results signiﬁcan tly better than all the other algo- rithms in all metrics. Alth ough not directly com parable, it is also remark able that the performan ce is over 30% higher than the best-perfo rming algorith m for singing voice cla ss from MIREX 2012 [12]. The pr oposed algorithm is able to detect offsets as well. Howe ver , we could not ﬁn d any reliable da taset with offset annotation s. Here we brieﬂy rep ort an extra-experimen t re- sult. An extra annotation fo r offsets was prep ared for each one m inute-lon g excerpt of the p revious singing dataset, and for another clarinet dataset. The tolerance v alue for onset was set to 50 ms, as with the main experiment. On the other hand, HCR Energy-based HFC SF Eqaul Loudness 0 10 20 30 40 50 60 70 80 90 100 Precision Recall F−measure % Proposed Fig. 5 . Compar ison of p erforma nce with existing o nset de- tection algorithm s. From lef t to right, the p roposed and o ther algorithm s [4, 14, 1, 3, 13]. offset tolerance was set more generou sly to 10 0 ms. In prac- tice, h uman can not ﬁnd su ch precise d efusing locatio ns b y ears. The result was F = 83.5% for sing ing onset and 95.0% for clarinet on set. F or offset, w e obtain ed F = 6 7.5% and F = 95 .0%, respecti vely . 5. CONCLUSION W e prop osed a pairwise approach to onset/offset detection for singing voice re cordings. The p roposed method differs f rom previous approache s in tw o main aspects. First, we employed higher-order statistics to capture onset/offset e vents in a time- domain signal. W e also demo nstrated that a com pact a dap- ti ve kernel metho d impr oved the results. Secondly , a ne w peak pick ing algorithm was derived fo r this detec tion func- tion. By searching a p recise location whe re the ﬁtness be- tween a pre-deﬁned kernel and the detection function is max- imized, a set of onset and its co rrespond ing offset w as simul- taneously fo und. W e ev aluated th e prop osed method with a recogn ized dataset. The average F-m easure for onset detec- tion by th e pro posed algor ithm was 8 0.6%, which is hig hest among all methods in compariso n. 6. RELA TION TO PRIOR WORK So far , a solid id ea in th is paper was th at th e h igher-order statistics u sing corr entropy [10] would pr ovide more robust- ness to general o nset detection pr oblem. A pr e vious ap pli- cation to mon ophon ic p itch d etection exists [9]. The use of Rule of Thumb for kernel parameter optimization w as recom- mended in Liu [1 0]. W e extended th ese concep ts to a novel feature representation for a detection function. For peak pick ing, adaptive thr eshold method has been widely u sed [1]. Others have formulate d such decision mak- ing into a machine learn ing problems [5 – 7]. T he prop osed method is novel in that it jointly estimates onset/offset, and it can be generalized as dynamic pr o gramming . DRAFT 7. REFERENCES [1] J. P . Bello, L . Daudet, S. Abdallah , C. Duxbury , M. Da vies and M.B. San dler: “ A tuto rial on onset detection in music signals, ” IEEE T ransactions on Speech and Audio Pr ocessing , 13(5), pp. 103 5– 1047, 2005. [2] S. Dixo n: “Onset detection revisited, ” In Pr oc. o f D AFx, Montr eal, Canada , pp. 133–137 , 2006. [3] C. Duxbury , M. Sandler, and M. Davies: “ A hy - brid approach to musical note onset detection”, In Pr oc. of DAFx, Hambur g, Germany , pp. 33– 38, 2002. [4] H. Heo , D. Su ng and K. Lee: “Note Onset De- tection based on Harmonic Cepstrum R egularity”, IEEE Int. Conf. on Multimedia and Exp o, San Jose , USA , pp. 1–6, 2013. [5] C. C. T oh, B. Zhang , a nd Y . W ang: “Mu ltiple- feature fusio n based on set detec tion for so lo singing voice, ” In Pr oc. of Int. Society for Music Information Retrieval , 2009. [6] F . Ey ben, S. B ¨ ock, B. Sch uller and A. Graves: “Universal onset detection with bidirectional long short-term memory neural networks, ” In Pr oc. of Int. Society for Music Informatio n Retrieval , 2010. [7] S. B ¨ ock, A. Arzt, F . Kre bs and M. Schedl: “On - line rea l-time on set detection with recurren t neural networks, ” In Pr oc. of D AFx , 2012. [8] W . W ang, et. al. : “Non-n egati ve matrix factoriza- tion fo r note o nset detection of aud io signals, ” In Pr oc. o f IEEE S ignal Pr ocessing S ociety W ork- shop on Machine Learning for Signal Pr ocessing , 2006. [9] J. Xu and J. C. Principe: “ A pitch de tector ba sed on a generalized corr elation function, ” IEEE T ransaction s on Audio, S peech, a nd Lang uage Pr ocessing V ol. 16.8, pp. 1420–1432 , 2008. [10] W , Liu, P . Pokharel and J. C. Principe: “Cor- rentropy: proper ties and applications in non- Gaussian signal pr ocessing, ” IEEE T r ansactions on Sig nal Pr ocessing , V ol. 55.1 1, pp. 52 86–52 98. 2007. [11] J. S. Downie, A.F . Ehmann, M. Bay and M.C. Jones: “The Music Inform ation Retriev al Ev al- uation eXch ange: Some o bservations and in- sights, ” Advan ces in Music Information Retrie val , Springer Berlin Heidelberg, pp. 93–115. 2010. [12] MI REX 2012 Onset Detection F- Measure per Class. A vailable from http://nema. lis.illinois .edu/nema_out/mire x 2 0 1 [13] A. Klap uri: “Sound o nset detectio n by app lying psychoaco ustic knowledge, ” In Pr oc. of IEEE Int. Conf. Aco ustics, Spe ech and Signal Pr ocessing, Phoenix, USA , pp. 115– 118. 199 9. [14] A. W . Schloss: “On the automatic tran scrip- tion of percussive music from ac oustic signal to high-level analysis, ” Ph.D. thesis, De pt. Hear ing and Sp eech, Stanf ord Univ ., Stanfo rd, CA, 1985, T ech. Rep. ST AN-M-27. [15] O. Lartillo t and P . T o i viainen : “ A Matlab tool- box for musical feature extraction from audio, ” In Pr oc. of D AFx, Bor deaux, 2007. [16] B. W . Silverman, “Den sity estima tion for statistics and data analysis, ” Lon don: Chapman and Hall, 1986. [17] M. H. Kutner , C. J. Nachstheim, J. Neter and W . Li, “Diag nostics and Rem edial Measures, ” In Ap- plied Linear Statistical Models , 5th ed. Ne w Y or k: McGraw-Hill/Irwin, pp. 119–12 4. 2005. [18] Brian . C. J. Mo ore and Brian R. Glasberg, “Sug- gested form ulae for calculatin g au ditory ﬁlter bandwidth s and excitatio n pa tterns, ” In Th e Jour- nal of the Acoustical Society of America , 74.3, pp. 750–7 53. 1983 .

노래 음성의 동시 시작·종료 검출을 위한 쌍별 접근법과 코렌트로피 활용

원본 논문

댓글 및 학술 토론

의견 남기기

원본 논문

관련 논문

댓글 및 학술 토론

의견 남기기