노래 음성의 동시 시작·종료 검출을 위한 쌍별 접근법과 코렌트로피 활용

본 논문은 코렌트로피 기반 검출 함수를 이용해 노래 음성의 음표 시작(onset)과 종료(offset)를 동시에 찾는 새로운 방법을 제안한다. 검출 함수는 고차 통계량을 활용해 잡음과 이상치에 강인하며, 사전 정의된 역쌍곡선 커널과의 적합도를 최대화하는 쌍별 피크 피킹 알고리즘으로 정확한 시작·종료 쌍을 추출한다. 실험 결과, 기존 최첨단 기법들을 능가하거나 동등한 성능을 보이며, 특히 온셋 검출에서 80 % 이상의 F‑measure를 달성하였…

저자: Sungkyun Chang, Kyogu Lee

노래 음성의 동시 시작·종료 검출을 위한 쌍별 접근법과 코렌트로피 활용
DRAFT A P AIR WISE APPR O A CH T O SIMUL T ANEOUS O NSET/OFFSET DETECTION FOR SINGING V O ICE USING CORRENTR OPY Sungkyun Chang and K yogu Lee Music and Audio Research Group Seoul National Univ ersity , 151-742 Seoul, K orea e-mail: { rayno1,kglee } @s nu.ac.kr ABSTRA CT In this paper , we propose a nov el method to search for precise locations of paired note onset and offset in a singin g voice signal. In compariso n with the existing onset detection algo- rithms, our appr oach differs in two key r espects. First, we employ Corr entr opy , a gener alized correlatio n fun ction in- spired from Reyni’ s entr op y , as a detection fun ction to cap- ture the in stantaneous flux while preserving insensitiveness to o utliers. Next, a novel peak picking algorithm is specially designed for this d etection fu nction. By calculatin g the fit- ness of a pre-defined inv erse hyp erbolic kernel to a d etection function , it is possible to find an onset and its co rrespond ing offset simultaneou sly . Exper imental results sho w that the pro - posed method achieves perf ormance significan tly better than or co mparab le to oth er state-of-th e-art techniqu es for onset detection in singing v oice. Index T erms — on set detection, o ffset detection , singing voice, entropy , pairwise peak picking 1. INTR ODUCTION Onset detection is a p roblem of finding the prec ise lo cation of discrete musical events, and thus is an imp ortant pre- processing step in many music applicatio ns, includin g pitch estimation, beat tracking, a nd automatic mu sic transcrip tion, to nam e a few . In mu sic signal processing, n ote onset de- tection still remains a challen ging pro blem, p articularly f or singing v oice, because of se veral reasons such as a large vari- ance in ar ticulation, singer-depend ent timbral characteristics, and slowly-varying onset envelopes. The repo rt f rom Mu sic Information Retrieval Evaluation eXchange 20 12 (MIREX 2012) for solo singin g voice reflects these difficulties, where the best-performin g algorith m yield s the F-measure o f merely 55.9% [1 2]. It is noteworthy that singing class g i ves much lower F-measu re value than other classes like polypho nic pitched, solo brass, and wind instruments classes. This research was supported by Samsung Electronic s and the MSIP (Ministry of Scienc e, ICT & Future Planning), Kore a, under the ITRC (In- formation T echn ology Research Cente r) support program supervised by the Nationa l IT Industry Promotion Agency . NIP A-2013-H0301-13-40 05. ERB Filtebank T ime-domain Signal De te c ti on F u nc ti o n Pairwise-Peak Picking {Onset, Offset} Ad-hoc Kernel Optimization Corr entr opy Function( σ ) ∆ W (t) σ Fig. 1 . Overv ie w of the system Numerou s m ethods ha ve been proposed so far to solv e the onset detection prob lem [1]. Majority of e xisting method s are generalized b y the two-stage proced ures: detection function and pea k picking [1 – 8]. Gene rating detection f unction refer s to transform ing au dio signals into featur e vectors more rele- vant to indic ating onsets. Due to the fact that singing voice contains many soft onsets [1], a r ecent study foc uses on find- ing the harmonic regularity instead of using the con ventional, energy-based tech niques [4]. If a pr operly designed detec- tion functio n is p rovided, o nset candidates will g i ve a r ise to certain local peaks. The p eak-pick ing pro cedure th en find s precise onset locations from the detection function . This p aper explore s a new detection fun ction and a peak- picking meth od for onset/offset detec tion in singing voice. The use of corren tropy for detection function brings us two major a dvantages: first, it p rovides a comp act fro nt-end like a conv entional correlation functio n; second , the p roperty of correntr opy p reserves robustness to outliers – such as noise or subdomin ant chan ges in fre quency , amplitud e or phase – by projec ting an inpu t sig nal into a hig h dim ensional Repro- ducing Kernel Hilbert Space (RKHS). Figure 1 illustrates an overview of the pro posed system. W e describe a novel detection fun ction based on co rrentropy in Section 2. An ad-ho c kerne l optimizer co ntinuously up - dates a localized p arameter required f or corr entropy estima- tion. The detection fun ction yields a regular shape with dis- tinct local peaks, or ( − )/( + ) peak s, which correspo nd to on- DRAFT −1 0 1 (b) 1 2 3 4 5 6 7s 0 (c) (a) 0 64 t = 0 Onset Offset Fig. 2 . Singing voice example. From top to bottom: (a),(b) and ( c) are input signa l, W t ( τ ) and d etection functio n ∆ W ( t ) , resp ecti vely . Red dashed -lines are onset positions that corr espond to (-) peaks in (c). Black solid-lin es are (+) peaks near the offset. set/offset, resp ecti vely . This regular shap e m otiv ates us to design a pairwise peak-picking metho d in Section 3. The experimental results ar e presented in Sectio n 4 , f ollowed by conclud ing remar ks in Section 5 . A relation to prior work is addressed in Section 6. 2. DETECTI ON F UNCTION B ASED ON CORRENTROPY Our detection function first passes the time-do main input sig- nal sampled at 11,0 25 Hz into an auditory filterbank [18]. W e map center frequen cies of 6 4-chann el g ammatone filterbank accordin g to the Equiva lent Recta ngular Ban dwidth ( ERB) scale between 80 Hz and 4,000 Hz. Hereafter , x c ( t ) re fers to the am plitude of th e filter ed pa rallel output, where c and t denote the channel and discrete time indices, respecti vely . 2.1. Correntropy-based F ormulation Correntro py function incorp orates both distribution an d tem- poral structures o f a time series in random pr ocesses [10]. There are various ap plications o f corre ntropy to non-linear and non-g aussian sign al processing wh ere correlatio n func- tion is n ot sufficient. Let { x t : t ∈ T } be a rand om p rocess with T denoting a n index set. Corren tropy fun ction V is de- fined as V ( x t 1 , x t 2 ) = E [ κ ( x t 1 , x t 2 | σ )] , (1) where κ σ ( · ) is a Parzen kernel a nd E is an expectatio n op er- ator . V is advanced to measur e the d istance b etween the two discrete vectors. The proper ties reveal that correntro py h as very similar characteristics to a conventional correlation f unction. Liu et al. showed that the sufficient con dition to satisfy V ( t, t − τ ) = (a) Good case 40 80 120 0 1 2 3 4 5 40 80 120 5 10 15 20 25 30 35 40 45 40 80 120 40 80 120 40 80 120 (b) Bad case (c) Improved (d) Zoom in part of (b) (e) Zoom in part of (c) t(s) 0 5 10 15 20 25 30 35 40 45 0 0 1 2 3 4 5 0 1 2 3 4 5 t(s) t(s) t(s) t(s) (τ) (τ) (τ) (τ) (τ) Fig. 3 . Examples o f W t ( τ ) in Section 2.2: (a) σ = 0 . 017 pre- serves good c ontrast b etween stationa ry and non -stationary regions. (b) bad case using g lobal o ptimum f or 4 5 s. ( c) im- proved result with “loosely” localized σ . (d), (e) zoomed -in parts from (b) and (c). W e observe more contrast in (e). V ( τ ) is that th e input ran dom pro cess mu st be time sh ift- in variant on the even mo ments. More specifically , this is a stronger conditio n than a wide- sense stationarity in v olv- ing only seco nd-ord er moments. W e estimate co rrentro py V giv en t , c and lag τ by com puting th e samp le mean o f a size N wind ow as follows: V t,c ( τ ) = 1 N N X n =1 N σ ( x c ( t + n ) , x c ( t + n + τ )) , (2) where N σ ( p, q ) = 1 √ 2 π σ exp {− ( p − q ) 2 2 σ 2 } , and σ is a Ga ussian bandwidth pa rameter . In practice, b oth the window size N and maximum τ are set to sampling rate / 80 , when the lowest band-lim it is 80 Hz. T hen the correntropy coef ficient for each channel is “pooled ” into a n on-negative summary matrix W as follows [9]: W t ( τ ) = X c V t,c ( τ ) . (3) The detection f unction ∆ W ( t ) is then calcu lated by the rec- tified difference with a hop-size h as follows: ∆ W ( t ) = X τ W t + h ( τ ) − X τ W t ( τ ) . (4) A monophon ic singing e xample in Figure 2 illustrates on- set/offset regions and their relation to the d etection functio n. W e estimate onset/offset location s by finding the first time t at which ∆ W ( t ) < 0 , and the first time after t where ∆ W ( t ) > 0 , respectively . 2.2. Ad hoc K ernel Optimization A f ree para meter σ in Equ ation (2) acts as a sensitivity con- troller for detection function: the larger σ becomes, the faster the h igher-order moments decay . T o keep both n onlinearity and d iscrimination ab ility , one way to select the o ptimal σ is DRAFT (a) (b) (c) (d) (e) (f) 1 -1 0 1 -1 0 F 1 -1 0 F Onset Offset ω 0 i i t t t t Start Start Fig. 4 . Illustration of peak-picking algorithm. (a) Pre-defined kernel matrix Λ . (b) Cross section of Λ with ω . In (c), start search at t = 0 . Calculate fitness by stretching Λ . In (d), onset is found at maximum fitness location. In (e), start s earch from pre vious onset location using − Λ . I n (f), correspo nding offset is found similarly as in (d) . W e repeat (c) to (f) un til the end. giv en by Silverman’ s R ule of Thumb [16]. Assuming that κ is a Gaussian kernel, the optimal σ can be simplified to σ = b · ˆ ψ N − 1 / 5 , (5) where b, ˆ ψ an d N den ote the co nstant scale factor, sample standard deviation, and the numb er of samples, respectively . The sensitivity of the d etection functio n can be observed from the contrast in colors in W t ( τ ) . Figure 3 (a) is a go od ex- ample wher e the boun daries would y ield a relevant detection function u sing the global optim um σ . Howe ver , Figure 3(b) reveals that the globa l op timum par ameter do est no t always guaran tee good co ntrast on the entire song. O n the other hand, a stron gly localized para meter would n ot gu arantee temp oral contrast. In this situa tion, a “loosely ” localized optima l pa- rameter is requ ired for ro bust detection of real-world sing ing onsets. Figu re 3(c) displays W t ( τ ) with improved con trast by an ad-ho c optimizer . It update s σ with an inter val h . In pra c- tice, we ach iev e good perform ances b y computing the optimal σ using E quation ( 5) with an observation window size of 7 s and h = 5 ms. T his im proves the p recision by over 10% in compariso n with a global optimization method. 3. P AIR WISE SIMUL T ANEOUS P EAK PICKING A pairwise peak- picking ap proach is mo ti vated by the regu - lar shape of the detection function ob served in Figure 2(c): a basic idea is to captu re a pair of falling and rising peaks by calculating a fitn ess to a p re-defined kernel. Figure 4 illus- trates this concept. W e fir st g enerate a set o f pr e-defined inv erse hyp erbolic kernels, who se shape is similar to expansion or shrinkag e of detection function . T his kernel Λ is defin ed as Λ( z ) = z 1 + α − | z | , (6) such that − 1 + 1 0 − 5 ≤ z ≤ 1 − 10 − 5 , where α is a shar p- ness factor . The r ange of z is set to avoid division by 0 , and α ≈ 0 . 15 is empirically found. Given an observation windo w length ω an d sample index i = { 1 , 2 , ..., ω } , we cho p z into ω samples by a line ar scale. In o ur default settings, ω min is set to 4 samples (=20 m s) a nd ω max is set to 500 sam ples o r larger ( ≥ 2 . 5 s) for 5 ms-c orrentro py hopsize h . Hence, any ev ent less th an 20 ms will be igno red. Figu re 4 (a) displays the generated kernel matrix for a set of observation length ω . T o fin d a pair of offet/onset, we calculate the fitness be- tween the detection function and the pre-de fined kernel as we expand the k ernel size. The fitness is calculated by F it (Λ ′ ω , W ′ ω ) = (Λ ′ ω − W ′ ω ) 2 · ω − k , (7) where Λ ′ ω is the pr e-defined kern el sam pled at ω , a nd W ′ ω is the ω -long rectang ular windowed detection functio n. k is a weighting factor for close p eaks and w e set k = 1 . For r efer- ence, E quation (7) is der i ved f rom lack-of-fit sum o f squares which has been widely used in classical F-test statistics [17]. T o find a pair o f on set/offset, we start from th e last onset po- sition and perfor m the same calculation but use − Λ . The ab ove proce dures ar e sum marized in Alg orithm 1. This p airwise appro ach makes sense in mono phonic sources: if a n o nset is f ound, then its cor respondin g offset mu st exist before the next onset. 4. EXPERIMENT 4.1. Dataset The dataset is o btained from the auth ors of referen ced pa- per [4]. I t allows us to directly co mpare the perform ance o f propo sed algo rithm to theirs. The total leng th of audio clips is about 13 minu tes, which is much longe r th an singing d ata included in the MIREX audio onset detection task [ 11]. The dataset con sists of 1 3 male and 2 fema le singers’ record ings of p opular songs. Onset labels are cr oss-validated by three persons wh o have pro fessional career s in music. In total, the dataset co ntains 1, 567 on sets with anno tations. Audio files are produ ced in mono with the sampling rate of 44,1 00 Hz. 4.2. Results and Discussion W e followed the ev aluation procedu re for o nset detection de- scribed in MIREX [1 1]. The toleran ce value is set to + / − DRAFT Input : ω = window si ze, Λ = inv erse hyperbolic k ernel. W = detection function. Output : ⋆, ◦ are onset and offset marking on time index, t . while t → T do % find onset, ⋆ for t ′ : ( t + ω min ) → ( t + ω max ) do F ( t ′ ) = F i t (Λ ′ , W ′ ) end ⋆ = t + arg max t ′ ( F ( t ′ )) t = ⋆ % find offset, ◦ for t ′ : ( t + ω min ) → ( t + ω max ) do F ( t ′ ) = F i t ( − Λ ′ , W ′ ) end ◦ = t + arg max t ′ ( F ( t ′ )) t = ◦ end Algorithm 1: Pseudo code fo r pair wise p eak pic king de- scribed in Section 3. Class # of Onset Precision Recall F-measure male 1,533 80.9 80.1 80.3 female 34 93.8 88.4 91.4 T otal 1,567 81.1 80.2 80.6 T a ble 1 . Perform ance of the proposed algorithm 50 ms, which allows us to consider any detected onset within this ran ge from th e gr ound- truth o nset as a tru e po siti ve. If not, th en it is coun ted as a false negative. A false p ositi ve is defined as any detected onset outside all the tolera nce range. The overall resu lts ar e summ arized in T able 1. Using the same dataset, we com pared th e p erform ance of the prop osed algorithm with others, in cluding a recent algo rithm pr oposed by Heo et al. [4] as well as more conventional ones [14 , 1, 3 , 13]. It can be seen fro m Figure 5 that the proposed algorithm achieves the results significan tly better than all the other algo- rithms in all metrics. Alth ough not directly com parable, it is also remark able that the performan ce is over 30% higher than the best-perfo rming algorith m for singing voice cla ss from MIREX 2012 [12]. The pr oposed algorithm is able to detect offsets as well. Howe ver , we could not fin d any reliable da taset with offset annotation s. Here we briefly rep ort an extra-experimen t re- sult. An extra annotation fo r offsets was prep ared for each one m inute-lon g excerpt of the p revious singing dataset, and for another clarinet dataset. The tolerance v alue for onset was set to 50 ms, as with the main experiment. On the other hand, HCR Energy-based HFC SF Eqaul Loudness 0 10 20 30 40 50 60 70 80 90 100 Precision Recall F−measure % Proposed Fig. 5 . Compar ison of p erforma nce with existing o nset de- tection algorithm s. From lef t to right, the p roposed and o ther algorithm s [4, 14, 1, 3, 13]. offset tolerance was set more generou sly to 10 0 ms. In prac- tice, h uman can not find su ch precise d efusing locatio ns b y ears. The result was F = 83.5% for sing ing onset and 95.0% for clarinet on set. F or offset, w e obtain ed F = 6 7.5% and F = 95 .0%, respecti vely . 5. CONCLUSION W e prop osed a pairwise approach to onset/offset detection for singing voice re cordings. The p roposed method differs f rom previous approache s in tw o main aspects. First, we employed higher-order statistics to capture onset/offset e vents in a time- domain signal. W e also demo nstrated that a com pact a dap- ti ve kernel metho d impr oved the results. Secondly , a ne w peak pick ing algorithm was derived fo r this detec tion func- tion. By searching a p recise location whe re the fitness be- tween a pre-defined kernel and the detection function is max- imized, a set of onset and its co rrespond ing offset w as simul- taneously fo und. W e ev aluated th e prop osed method with a recogn ized dataset. The average F-m easure for onset detec- tion by th e pro posed algor ithm was 8 0.6%, which is hig hest among all methods in compariso n. 6. RELA TION TO PRIOR WORK So far , a solid id ea in th is paper was th at th e h igher-order statistics u sing corr entropy [10] would pr ovide more robust- ness to general o nset detection pr oblem. A pr e vious ap pli- cation to mon ophon ic p itch d etection exists [9]. The use of Rule of Thumb for kernel parameter optimization w as recom- mended in Liu [1 0]. W e extended th ese concep ts to a novel feature representation for a detection function. For peak pick ing, adaptive thr eshold method has been widely u sed [1]. Others have formulate d such decision mak- ing into a machine learn ing problems [5 – 7]. T he prop osed method is novel in that it jointly estimates onset/offset, and it can be generalized as dynamic pr o gramming . DRAFT 7. REFERENCES [1] J. P . Bello, L . Daudet, S. Abdallah , C. Duxbury , M. Da vies and M.B. San dler: “ A tuto rial on onset detection in music signals, ” IEEE T ransactions on Speech and Audio Pr ocessing , 13(5), pp. 103 5– 1047, 2005. [2] S. Dixo n: “Onset detection revisited, ” In Pr oc. o f D AFx, Montr eal, Canada , pp. 133–137 , 2006. [3] C. Duxbury , M. Sandler, and M. Davies: “ A hy - brid approach to musical note onset detection”, In Pr oc. of DAFx, Hambur g, Germany , pp. 33– 38, 2002. [4] H. Heo , D. Su ng and K. Lee: “Note Onset De- tection based on Harmonic Cepstrum R egularity”, IEEE Int. Conf. on Multimedia and Exp o, San Jose , USA , pp. 1–6, 2013. [5] C. C. T oh, B. Zhang , a nd Y . W ang: “Mu ltiple- feature fusio n based on set detec tion for so lo singing voice, ” In Pr oc. of Int. Society for Music Information Retrieval , 2009. [6] F . Ey ben, S. B ¨ ock, B. Sch uller and A. Graves: “Universal onset detection with bidirectional long short-term memory neural networks, ” In Pr oc. of Int. Society for Music Informatio n Retrieval , 2010. [7] S. B ¨ ock, A. Arzt, F . Kre bs and M. Schedl: “On - line rea l-time on set detection with recurren t neural networks, ” In Pr oc. of D AFx , 2012. [8] W . W ang, et. al. : “Non-n egati ve matrix factoriza- tion fo r note o nset detection of aud io signals, ” In Pr oc. o f IEEE S ignal Pr ocessing S ociety W ork- shop on Machine Learning for Signal Pr ocessing , 2006. [9] J. Xu and J. C. Principe: “ A pitch de tector ba sed on a generalized corr elation function, ” IEEE T ransaction s on Audio, S peech, a nd Lang uage Pr ocessing V ol. 16.8, pp. 1420–1432 , 2008. [10] W , Liu, P . Pokharel and J. C. Principe: “Cor- rentropy: proper ties and applications in non- Gaussian signal pr ocessing, ” IEEE T r ansactions on Sig nal Pr ocessing , V ol. 55.1 1, pp. 52 86–52 98. 2007. [11] J. S. Downie, A.F . Ehmann, M. Bay and M.C. Jones: “The Music Inform ation Retriev al Ev al- uation eXch ange: Some o bservations and in- sights, ” Advan ces in Music Information Retrie val , Springer Berlin Heidelberg, pp. 93–115. 2010. [12] MI REX 2012 Onset Detection F- Measure per Class. A vailable from http://nema. lis.illinois .edu/nema_out/mire x 2 0 1 [13] A. Klap uri: “Sound o nset detectio n by app lying psychoaco ustic knowledge, ” In Pr oc. of IEEE Int. Conf. Aco ustics, Spe ech and Signal Pr ocessing, Phoenix, USA , pp. 115– 118. 199 9. [14] A. W . Schloss: “On the automatic tran scrip- tion of percussive music from ac oustic signal to high-level analysis, ” Ph.D. thesis, De pt. Hear ing and Sp eech, Stanf ord Univ ., Stanfo rd, CA, 1985, T ech. Rep. ST AN-M-27. [15] O. Lartillo t and P . T o i viainen : “ A Matlab tool- box for musical feature extraction from audio, ” In Pr oc. of D AFx, Bor deaux, 2007. [16] B. W . Silverman, “Den sity estima tion for statistics and data analysis, ” Lon don: Chapman and Hall, 1986. [17] M. H. Kutner , C. J. Nachstheim, J. Neter and W . Li, “Diag nostics and Rem edial Measures, ” In Ap- plied Linear Statistical Models , 5th ed. Ne w Y or k: McGraw-Hill/Irwin, pp. 119–12 4. 2005. [18] Brian . C. J. Mo ore and Brian R. Glasberg, “Sug- gested form ulae for calculatin g au ditory filter bandwidth s and excitatio n pa tterns, ” In Th e Jour- nal of the Acoustical Society of America , 74.3, pp. 750–7 53. 1983 .

원본 논문

고화질 논문을 불러오는 중입니다...

댓글 및 학술 토론

Loading comments...

의견 남기기