Feature Trajectory Dynamic Time Warping for Clustering of Speech Segments

F E A T U R E T R A J E C T O RY D Y N A M I C T I M E W A R P I N G F O R C L U S T E R I N G O F S P E E C H S E G M E N T S A P R E P R I N T Lerato Lerato, Thomas Niesler Departmen t of Electrical and Electronic Engineering University of Stellenbosch Stellenbosch, South Africa llera to@sun.ac. a c.za , trn@s un.ac.za October 31, 2018 A B S T R AC T Dynamic time warping (DTW) can be used to compute the similarity between two s equences of generally differing length. W e propose a modiﬁca tio n to DTW that perfor ms ind ividual and ind e- penden t pairwise a lig nment of feature trajecto ries. The modiﬁed tec hnique, termed feature trajectory dynamic time w arping (FTDTW), is applied as a similarity measure in the agglomerative hierarch i- cal clusterin g of speech segments. Ex periments using MFCC and PLP par ametrisations e xtracted from T I MIT and from th e Spoken Ar abic Digit Dataset (SADD) show co nsistent and statistically sig- niﬁcant impr ovemen ts in the quality of the resulting clusters in terms of F-measure and nor malised mutual information (NMI). K eywords Dynamic time warping · Feature trajec to ry · Speech segments · Agglomerative hier archical clustering 1 Intr oduction Dynamic Time W arping (DTW) is a method of optimally aligning two distinct time series of ge n erally different length. In ad dition to the align ment, DTW com putes a score indicatin g the similarity of the two seq uences. This ability to quantify the similarity between time series has led to th e application of DTW in autom atic speech r ecognition (ASR) systems se vera l decades ago [1, 2]. It has remained po pular in this ﬁeld, with mor e recent dev elopmen ts re ported in [3] and [4]. DTW has also fo und application in ﬁelds related to ASR. For examp le, it has been used successfully in keyword spotting and inform ation retriev al (I R) systems [ 5 , 6, 7]. T o acc o mplish IR, sub-sequen ces in a sp e ech sign al that match a template with certain degree of time warping are detected. In th e related task of acoustic pattern discovery , DTW can be allo wed to consider mu ltiple local alignments between speech signals du ring the overall searc h [8] . In this w ay DTW can ﬁn d similar segmen t pairs in speech a u dio, f ollo wed by a clustering step [9]. The resulting cluster labels are u sed to train hidden Markov models (HMM s ). In an effort to improve perfor mance, se veral variations of DTW have b een prop osed sinc e its inc e ption. For example, a one-ag a in st-all index (OAI) fo r each time serie s und er c o nsideration is proposed in [4]. The O AI is subsequently used to weight the correspondin g DTW align m ent score in a speech recognition sy s tem. DTW has also been modiﬁed to allow the direct matching of poin ts along the b est alignm e nt f or use in a sign ature veriﬁcation system [ 1 0 ]. A stability function is subsequen tly applied, and th e r esulting score is used as a similarity measure. W e describe a m o diﬁcation of DTW and demo nstrate its improved perfor mance whe n used as a similarity m easure to cluster speech segments. Our DTW mod iﬁcation exploits the asynchr onous tempor al stru cture of f eatures ex- tracted from spee ch. Related work h as con sidered such feature trajectories by trainin g separate hid den Markov models (HMMs) f or each MFCC feature dimension [11]. This work rep orts improvements in both phonem e and word recog- A P R E P R I N T - O C T O B E R 3 1 , 2 0 1 8 nition. T he clustering of speech segmen ts a ls o ha s sev eral useful ap p lications in ASR [12, 13, 14]. Recently it has been particular ly useful in the automatic d is covery of sub-word units [15, 16]. Section 2 reviews the standard f ormulation of DTW and Section 3 describes our pr o posed mo diﬁcation. Section 4 presen ts the ev aluation too ls we emp lo y an d Section 5 describes the d ata we use f or experime n tation. Section 6 presents an experimental evaluation of the proposed method . Section 7 discusses the results and concludes the paper . 2 Classical Dynamic Time W arping (DT W) W e consider spe ech segments as temporal sequ ences of m ultidimensional feature vectors in th e Euclidean space. Se- quences are of ar bitrary a n d generally dif ferent length, but all vectors are of equ al dimension. T he DT W algorithm recursively d e termines the best alignment between tw o such vector time series by minimizing a cumulativ e path cost that is commonly based on Euclidean distances between time a lig ned vectors [2, 17]. Consider N such sequences X i , i = 1 , 2 , ..., N , each comp o sed o f T i feature vectors, as deﬁn ed in Equ a tion 1. X i = { x i 1 , x i 2 , ..., x iT i } , i = 1 , 2 , ..., N (1) Each feature vector x it has m dimensions, as indicated in E quation 2. x it = D x (1) it , x (2) it , ..., x ( m ) it E , t = 1 , 2 , .., T (2) T wo sequences X i and X j are aligned b y constructin g a T i -by- T j distance matrix D ij ( p, q ) whose entries co ntain the distances d ( x ip , x j q ) . T yp ical choice s f o r d are th e E u clidean distance and th e Ma n hattan distance. A m atrix of mini- mum accu mulated distances γ ij ( p, q ) is then con structed b y considering all p a th s from D ij (1 , 1) to D ij ( p, q ) . Using local an d g lobal path co nstraints, γ ij ( p, q ) is c omputed recursively accor d ing to the p rinciple of dy namic program ming, as shown in Equ ation 3 [2]. γ ij ( p, q ) = D ij ( x ip , x iq ) + min  γ ij ( p − 1 , q − 1) , γ ij ( p − 1 , q ) , γ ij ( p, q − 1)  (3) The similarity D T W ( X i , X j ) betwee n vector sequences X i and X j is then given by E quation 4. Here K is the len gth of the optimal path from D ij (1 , 1) to D ij ( T i , T j ) and is used to normalise the similarity v alue. D T W ( X i , X j ) = 1 K γ ij  T i , T j  (4) This stan d ard formu latio n of dynamic time warping will in the remainder of the paper be ref erred to as cla ss ical DT W . Figure 1 shows the classical DTW alignment between two different seq uences of 21- d imensional spectral feature vec- tors r epresenting th e same sound uttere d by different speakers. These spectra l featu res ar e o b tained b y stra ig htforward binning of the short-time power spectra. T o avoid clu tter , the alignment of ju st fo ur of the feature vectors is shown. 3 Fe atur e T rajectory DTW (FTDTW) W e d eﬁne a feature trajecto ry X ( l ) i as the time series ob tained when considering the l -th element of each feature vector in a sequence X i , as sho wn in Equation 5. X ( l ) i = n x ( l ) i 1 , x ( l ) i 2 , ..., x ( l ) iT i o , l = 1 , 2 , ..., m (5) Hence X ( l ) i is a 1- dimensional time series for feature l . W e now calculate th e similarity of two featur e vector sequences by applying classical DTW to each correspon ding p air of featur e trajector ies, and subsequen tly norm a lis e the s um, as shown in Equ ation 6. F T D T W ( X i , X j ) = 1 β m X l =1 D T W n X ( l ) i , X ( l ) j o (6) where β = p P m l =1 K 2 l , K is the path len gth and D T W ( . ) is non-no rmalised classical DTW . As illustration, we repeat the alignment of th e two speech segments shown in Figur e 1 with FTDTW . Figur e 2 (a) identiﬁes se ven features from 2 A P R E P R I N T - O C T O B E R 3 1 , 2 0 1 8 Figure 1 : Alignment by classical DTW of spectral features extracted from the trip hone b -aa+dx as uttered by (a) male speaker m rfk0 an d (b ) by female speaker fdml0 in the TIM IT corpu s. each o f the four fe ature vectors shown in Figure 1 (a). Figure 2 (b) demonstrates h o w ea c h of these seven features align with the second speech segment. The featu res themselves are the sam e as th ose illustrated in Figure 1. For the illustrated example, application of Equa tio n 6 in volves 21 s eparate alignments, each between corresponding features trajectories as a lso indicated in Figur e 2. The resulting 2 1 sco res ar e summed an d nor malised by β . Fig ure 2 illustrates how , in contrast to classical DTW , FT D T W does not require features co incident in time in one segment to align with features in the other se gment also coincident in time. 4 Evaluation W e evaluate the effecti veness of our p roposed mod iﬁcatio n to DTW by using it to compute similarities between speech segments, and then using these similarities to per form agglomerative hierarch ical clustering [18, 19]. W e will cluster speech segments correspo nding to triph ones extracted fr om th e TIMIT corpus a s well as isolated d ig its extracted from the Spoken Arabic Digit Dataset (SADD). Since the phon etic a lignment is provided in the former an d the word alignments in the latter, th e gro u nd truth is av ailable. Hence we can use the external metrics F-measur e and norm a lis ed mutual information (NMI) to quantify th e q uality of the resulting clusters [20, 21, 22]. 4.1 Agglomerat i ve Hierarchical Clustering (AHC) In AHC, th e agglomeratio n of data objects (speech segments in the case of our e xperimen tal e valuation) is initialised by the a s sumption that each object is th e sole occupan t o f its own cluster . A binary tree r e ferred to as a dendrogr am is created by successively m er ging the closest cluster p airs u ntil a single cluster remains [23]. W e use the popular W ard method to quan tify inter - cluster similarity [24]. The input to the AHC algor ith m is a sy mmetric N × N p roximity matrix popu lated b y the values of D T W ( · , · ) or F T D T W ( · , · ) and the o utput con s ists of th e R clu sters. 3 A P R E P R I N T - O C T O B E R 3 1 , 2 0 1 8 Figure 2: Alignment by FTDTW of spectr al featur e s extracted from the triphone b- a a+dx as utter e d by (a) male speaker m rfk0 an d (b ) by female speaker fdml0 in the TIM IT corpu s. 4.2 F-measure The F-measure is based on the quantities precision (PR) and recall (RE). Precision indicates the degree to which a cluster is dominated by a particular class, while recall indicates the degree to which a particular class is concen trated in a speciﬁc cluster . Precision and recall are deﬁned in Equations 7 and 8 respecti vely . P R ( r, v ) = n r v n r (7) RE ( r, v ) = n r v n v (8) Here n r v indicates the num ber of objects of class v in cluster r ; n r and n v the n umber of objects in cluster r and class v respecti vely . The F-m easure ( F ) is giv en in Eq uation 9. F ( r , v ) = 2 × R E ( r , v ) × P R ( r , v ) RE ( r, v ) + P R ( r , v ) (9) When the clusters are perfect, n r v = n r = n v and hence F ( r , v ) = 1 . 4.3 Normalised mutual inf ormation Normalised mutual informa tion (NMI ) employs the following form ulations: • th e set of R clusters G = { G 1 , G 2 , ..., G R } , and • th e set of V classes C = { C 1 , C 2 , ..., C V } representing groun d truth . 4 A P R E P R I N T - O C T O B E R 3 1 , 2 0 1 8 NMI is based on the mutual informatio n, I ( G , C ) between classes and clu s ters [22, 25]. The mutual information is not sensitiv e to varying nu mber of clusters and th e refore it is n ormalised by a factor based on th e cluster e n tropy H ( G ) and class en tropy H ( C ) . These entropies me a s ure cluster an d class cohesiveness respectively . The NMI criter ion is giv en in E q uation 10. N M I ( G , C ) = 2 I ( G , C ) [ H ( G ) + H ( C )] (10) The mu tual informa tio n I ( G , C ) and the entrop ies H ( G ) an d H ( C ) are given in Equa tio ns 11, 12 and 1 3 r especti vely . I ( G , C ) = X r ∈ G X v ∈ C P ( G r ) P ( C v ) lo g P ( G r ∩ C v ) P ( G r ) P ( C v ) (11) In Equ a tion 11, P ( G r ) , P ( C v ) and P ( G r ∩ C v ) are the probab ilities of a segment b elonging to cluster G r , class C v and the intersection of G r and C v respectively . H ( G ) = − X r ∈ G P ( G r ) lo g P ( G r ) (12) H ( C ) = − X v ∈ C P ( C v ) lo g P ( C v ) (13) It can be shown that I ( G , C ) is zero when the clustering is random with respect to class m embership and that it achieves a maximum of 1 . 0 for perfect clustering [ 25]. 5 Data Our ﬁrst set of experim ents uses speech segments taken from the TIMIT speech corpus [2 6 ]. TIMIT has been cho sen because it includes accurate time- a ligned p honetic transcriptions, mean ing that both phonetic labe ls and their start/en d times are known. As our desired clusters w e use triphon es, which are phones in speciﬁc left and right contexts [27]. W e consider trip h ones tha t occur at least 20 times and at most 25 times in the corpus. This lead s to an ev enly b alanced set of 8772 speech segments, which also c o rresponds appr oximately to the number of segments in our seco nd set of experiments. For comparison an d con ﬁrmation pu rposes, we per formed a second set of experiments using the Spoken Ara bic Digit Dataset (SADD) [28]. SADD con s ists of 8800 utterances already param e trised as 13-d imensional MFCCs. The utterances were spoken b y 44 male and 44 female Arab ic speakers. Each utteranc e in the SADD cor responds to a single Arabic digit, and will therefore be considered to be a single se gment in our experime n ts. Each d igit (0 to 9) w as uttered ten times by each speaker . A third set of experim ents is based on 10 ind ependent subsets of speech segments d ra wn fro m the TIMIT SI and SX u tterances, irrespective of occurren ce frequ ency . This better repre sen ts th e unbalanced distribution of triphones that may be expected in uncon stra ined speech . T able 1 summarises the datasets u sed in each o f the thre e sets o f experiments. T able 1: Datasets used fo r experimental evaluation. Dataset Description 1 8772 TIMIT triphones (ev enly balan ced). 2 8800 SADD isolated digits (e venly balan ced). 3 12318 2 TIMIT SI and SX trip hones divided randomly into 10 subsets (n ot ev enly balanced). W e considered two feature vecto r par ametrisations popular in the ﬁeld o f speech processing, n amely mel frequency cepstral co ef ﬁcients (MFCCs) and perceptual linear prediction (PLP) coef ﬁcients [29, 30]. For the former , log fram e energy was appended to the ﬁrst 12 MFCCs to prod uce a 13-dimension al feature vector . First and second differentials (velocity and acceleration) were sub sequently added to produce the ﬁnal 39-d imensional MFCC f eature vector . For the la tter, 13 PLP coefﬁcients were considered, to which velocity and acceleration we r e added, again resulting in a 39 - dimensiona l feature vector . One such feature vector was extracted fo r each 1 0ms fr ame of speech, where consecutive frames overlapped b y 5ms. All TIMIT feature vector s were com puted using HTK [31]. SADD provides pre-com puted MFCC features, and hence PLP features were n ot used in the associated experiments. 5 A P R E P R I N T - O C T O B E R 3 1 , 2 0 1 8 6 Experiments T o e valuate the performance of Featu r e Trajectory DTW (FTDTW) as an alternati ve to classical DTW as a similarity measure, we will employ it to perfo rm AHC of the speech segments described in Section 5. The qu a lity of the automatically- determined c lusters will be determin ed using the F-measure and in se veral cases also NMI. In a ﬁrst set o f experiments, we cluster Dataset 1 (T able 1). Figure 3 reﬂects the clustering performan ce in terms of (a) the F- measure and (b ) NMI, wh en using M FCCs as features. Both the F-measure an d NMI are plotted as a function of the number of clusters. Note that the F-measure continues to decline as the numb er of clusters exceeds 12 00. Figure 3: Clustering perfo rmance for Dataset 1 when using MFCC fea tures in terms of (a) F-measure and (b) NMI. Figures 3(a) and 3(b ) show that FTDTW improves on the pe rformance of classical DTW in this clustering task in terms of both F-measure and N M I. Especia lly in terms of F-measure, this improvement is substantial. A c o rresponding set of experimen ts using PLP features was c a rried out for Dataset 1, and the results are shown in Figure 4. The same trends seen for MFCCs in Figu re 3 a re observed, with substantial improvements particularly in terms of F-measure. In a secon d set of experiments, we clustered Dataset 2 ( T ab le 1) which c o nsists of isolated Arabic digits. Fig ure 5 indicates the clustering perfo rmance, both in terms of F-m easure and NMI for th is dataset. Ag ain we observe that FTDTW outperf orms classical DTW in terms of both F-measure and NMI in practically all cases. In a third and ﬁnal set of experime nts, we considered Dataset 3 (T able 1). The 10 inde pendent subsets of the TIMIT training set each contained between 12034 a nd 1 2 495 tripho n e segmen ts . In contrast to the experiments for Dataset 1, all tr iphone tokens were consider ed irrespecti ve of oc c urrence frequency . Further m ore, the n umber of c lusters was chosen to be 23 94, a ﬁg u re which cor r esponds to the number of triphone types w ith more th a n 10 o c currences in the data. A single numb e r o f clusters, rather than a range as presented in Figures 3, 4 an d 5, has been used here in order to make the req u ired computation s pr actical. Figure 6 presen ts the clusterin g performance for each of the 1 0 subsets 6 A P R E P R I N T - O C T O B E R 3 1 , 2 0 1 8 Figure 4: Clustering perfo rmance for Dataset 1 wh en usin g PLP feature s in terms of (a) F-measure and (b) NMI. in terms o f F-measure. W e ob serve that FTDTW achie ves an improvement over classical DTW in all cases . A paired t-test in dicated p < 0 . 0 001 , and h ence the imp rov ements are statistically high ly signiﬁcant. Similar improvements were observed in ter ms of NMI. 7 Discussion and conclusions The expe r iments in Section 6 hav e applied o ur modiﬁed DTW algorith m ( FT D T W ) to th e clustering of speec h se g- ments. Our exp e riments sh o w co n sis tent an d statistically signiﬁcan t improvement o ver the classical DT W baseline for both MFCC and PLP parametrisations and a c r oss th ree data sets. W e co nclude tha t FTDTW is more effecti ve as a similarity measure for speech signals than classical DTW . Because classical DTW ope r ates o n a featur e-vector by feature-vector basis, it enforces absolute temporal synchrony between the feature tr a jectories. In con tr ast, FTDTW do es not impose th is synchrony constraint, but aligns feature trajectories independen tly on a p air - b y-pair basis. Since FTDTW is observed to lead to better clusters in our e xperi- ments, we conclud e that the strict temporal synchrony imposed by classical DTW is counter-productive in the case of speech signals. W e further specu late that segmen ts of spe e ch that human listeners would regard as similar also exhibit such differing tim e -scale warpin g among the featur e trajector ies. It remains to be seen wheth er this dec oupling of th e feature trajectories is advantageous for signals other than speech. Finally , and noting that it is not a fo cus of th is p aper , we m ay co nsider the maxima ob served in the F-measure in Figure 3 and 4, and in both the F-measur e and NMI in Figure 5. A peak in the q u ality o f the clusters as a function 7 A P R E P R I N T - O C T O B E R 3 1 , 2 0 1 8 Figure 5: Clustering perfo rmance for Dataset 2 in terms o f (a) F-measu r e and (b) NMI. Figure 6: Clustering perfo rmance for the 10 independe nt subsets of Dataset 3 in terms of F-measure. of the number o f clusters may be tak en to indicate the best estimate of th e ’true’ number of clusters in the data. For the experiments using the MFCC p arametrisation of Dataset 1 (Fig ure 4), we see that an optimu m in the F-measu r e in reached at 501 an d 421 clu st ers fo r FTDT W an d classical DTW r especti vely . The ’true ’ n u mber o f cluster s correspo nds 8 A P R E P R I N T - O C T O B E R 3 1 , 2 0 1 8 to the n umber of triphon e types in Dataset 1, wh ich is 404. Hen ce both DTW for mulations over-estimate th e number of clusters. A similar tendency is seen for the PLP parametrisatio n s of the same dataset, where the F-measure p eaks at 439 and 559 clusters for classical DT W and pr oposed DTW respectively , and also f or Dataset 2 in Figure 5. Although the groun d truth is known, the class deﬁnitio ns ( triphones fo r Datasets 1 and 3, and isolated digits for Dataset 2) may be called into qu esti on. In particular, althoug h all triphon es corr e s pond to acoustic segments f rom the same ph one within the same left and rig ht contexts, there ar e many other po s sible sour ces of systematic variability , such as the accent o f the speaker . Hence it may b e reasonab le to expect th at a larger numbe r of clusters is needed to optimally model the data. T o determine whether this is the case, the clusters should be used to determine acoustic models for an ASR system. Then the pe r formance of v arying clusterin gs of the data can be compare d by compa r ing the perform a nce of the re su lting ASR systems. W e inten d to addr ess th is qu estion in ongoin g work. Refer ences [1] Hiro aki Sakoe an d Seibi Chiba. Dynamic p r ogramming algo rithm optimizatio n for spoken word recogn ition. IEEE T ransactions on Acoustics, Speech an d Signal Pr ocessing , 26( 1):43–49, 1978 . [2] C. Myer s, L.R. Rabin e r , and A.E . Rosenberg. Per formance tradeoffs in dyn amic tim e warping algo r ithms for isolated word r ecognition. IEEE T ransactions on Acoustics, Sp eec h, and Sig nal Pr o cess ing , 2 8(6):623–6 35, 1980. [3] Lin dasal wa Mud a, Mu mtaj Begam, and I Elamv azuthi. V oice recognition algorithms using m el fr equency cep- stral coefﬁcient (MFCC) and dynam ic time warping (D T W ) techniques. arXiv pr e print arXiv:1 003.4083 , 2010. [4] Xian glilan Zhang, Jiping Sun , and Zhigan g Luo . On e-against-all weig h ted dynamic time warping fo r language- indepen d ent and speaker-depend ent speech recogn ition in adverse c onditions. PloS ONE , 9(2):e854 58, 2 014. [5] Y aod ong Zhan g and James R Glass. Un s uperv is ed spoken keyword spotting via se gmental DTW on Gaussian posteriorgram s . In P r oc. IEEE Automatic Spee ch Rec o gnition & Understanding W orkshop (ASR U) , pages 398– 403, 2009. [6] Xavier Anguera. I n formation retr ie val-based d ynamic time warping. In Pr oc. Interspeech , pag es 1–5, 2013 . [7] Lin -Shan Lee , James Glass, Hu ng-Y i Lee, and Chun-An Chan . Spoken co n tent retrieval—be yond c ascading speech reco g nition with text retriev al. A udio, Speech, and Language Pr ocessing, IEEE /ACM T ransactio ns on , 23(9) :1389–1420, 2 015. [8] A. Park and J. Glass. T owards unsupervised pattern discov ery in s peech. In Pr oc. IEEE W orkshop on Automatic Speech Recogn ition and Understanding (ASR U) , 200 5. [9] Oliver W alter , T imo K orthals, Reinhold Haeb-Umba c h, and Bhiksha Raj. A hierarch ical system for word dis- covery exploiting DT W -based in itialization. I n Pr oc. IEEE W orksho p o n Automatic Sp e ec h Recognition and Understanding (ASR U) , pages 386 –391, 2013. [10] A Piyush Shanker and AN Rajagop alan. Of f-line sign ature veriﬁcation using DT W . P attern Recognition Letters , 28(12 ):1407–1414, 20 07. [11] Shigeki Sagayama, Sh ig eki Matsuda, Mitsuru Nakai, and Hiroshi Shimodaira. Asynchro nous-transition HMM for acou sti c mo deling. I n Internation al W orkshop on Acoustic Sp eec h Recognition an d Understanding , p a ges 99–10 2, 19 99. [12] T . Svendsen, K.K. Paliwal, E. Harborg, a n d P .O. Husoy . An improved sub-word b ased speech recognizer . In Pr oc. IEEE Internatio n al Confer e n ce on Acoustics, Speech a nd S ig nal Pr o cessing (I C ASSP) , pages 729–73 2, 19 8 9. [13] K.K. Paliwal. Lexicon-building m ethods f or an acoustic sub- w ord based speec h reco gnizer . In Pr oc. IEEE Internation al Co n fer ence on Acoustics, Speech a nd Sign a l Pr ocessing (ICAS SP) , pages 108– 1 11, 1 990. [14] H. W ang, T . L e e , C. Leun g, B. Ma, an d H. Li. Unsu p ervised min ing of acoustic subword units with segment-level Gaussian posteriogra m s. In Pr oc. In te rspeech , pages 2297–23 01, 201 3. [15] K. Livescu, E. Fosler-Lussier , an d F .Metze. Subword m odeling for automatic sp e ech recognitio n: Past, present, and emerging app roaches. I E EE Sig nal Pr o cess ing Magazine , pages 44–57, 2012. [16] Herman Kamper , Aren Jansen, Simon King , and Sharon Goldwater . Unsup ervised lexical clustering o f speech segments using ﬁxed d im ensional acou st ic embedding s. In P r oc. IEE E Spo k en Language T echnology W orkshop (SLT) , 2 0 14. [17] Eamonn J Keogh an d Mich ael J Pazzani. Deriv ati ve dynamic time warping . In SDM , volume 1, pages 5– 7. SIAM, 2001. 9 A P R E P R I N T - O C T O B E R 3 1 , 2 0 1 8 [18] Anil K. Jain and Richard C. Dubes. Alg orithms fo r Clustering Data . Pren tice-Hall, In c., U p per Saddle River , NJ, USA, 1988. [19] Fionn Murtagh and Pedro Con treras. Method s of hierarchical clustering. arXiv p reprint a rXiv:1105.012 1 , 2 011. [20] J. Artiles E. A m igo, J. Gon zalo and F . V e r dejo. A com parison o f extrinsic clustering e valuation metrics ba s ed on formal constraints. I n formation Retrieval , 12(4):46 1–486, 20 09. [21] B. Larsen and C. Aon e . Fast an d effecti ve text mining using linear-time doc ument clustering. In P r oc. ACM Confer ence on Knowledge Discovery and Data Mining (SIGKDD) , pages 16–22, Ne w Y ork, USA, 1999. [22] N. X.V inh, J. Ep ps, an d J. Bailey . Information theoretic measures f o r clu s terings compa rison: V ariants, properties, normalisation and correction for chance. Journal of Machine Learning Resear ch , 11 :2837–285 4, 2010 . [23] Rui Xu an d Do n ald W u nsch. Survey of clu ster ing alg orithms. IEEE T ransactions on Neu r al Networks , 16(3) :645– 678, 2005. [24] Fionn Murtag h and Pierr e Legendre. W a r d’ s hierarchica l agglo merati ve clusterin g m e thod: Which algorith ms implement Ward’ s criterion? Journal of Classiﬁca tion , 31( 3 ):274–295 , 20 14. [25] C. D. Mann ing and P . Raghav an. Intr o duction to Information Retrie val . Camb ridge University Press, Ne w Y ork, USA, 2008. [26] John S Garof olo, Lori F Lamel, W illiam M Fisher, Jonathon G Fiscus, David S Pallett, Nancy Dah lgren, and V ictor Zue. DARP A TI MIT acoustic-p honetic continous speech corpus LDC93S1. 19 93. [27] B. Im p erl, Z. Kacic, B. Horvat, and A. Zga n k. Clustering of trip hones using p h oneme similarity estima tio n for the deﬁnition of a multilingual set o f trip h ones. Speech Communication , 39 (4):353–36 6, 2 003. [28] M. Lichman. UCI m a c hine learnin g repository , 2013. [29] Ste ven B Davis and Paul Mermelstein. Compariso n of parametric representations for mon osyllabic word recog- nition in co n tinuously spoken sentences. IEEE T ransactio n s on Acoustics, S p eec h and Signal Pr ocessing , 28(4) :357–366, 1 980. [30] Hynek Hermansky . Perceptual linear p redicti ve (PLP) analysis of sp eech. Journal of the Acoustical Society of America , 87(4) :1738–1752, 1 990. [31] S. J. Y oun g, G. Everman n, M. J. F . Gales, T . Ha in , D. Kersha w , G. M oore, J. Odell, D. Ollaso n , D . Pove y , V . V altchev , and P . C . W o odland. The HTK Book, ver sion 3.4 . Cambr id ge University Engineerin g Department, Cambridge, UK, 2006. 10

Feature Trajectory Dynamic Time Warping for Clustering of Speech Segments

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment