Exact Indexing for Massive Time Series Databases under Time Warping Distance
Among many existing distance measures for time series data, Dynamic Time Warping (DTW) distance has been recognized as one of the most accurate and suitable distance measures due to its flexibility in sequence alignment. However, DTW distance calcula…
Authors: Vit Niennattrakul, Pongsakorn Ruengronghirunya, Chotirat Ann Ratanamahatana
Noname man usript No. (will b e inserted b y the editor) Exat Indexing for Massiv e Time Series Databases under Time W arping Distane Vit Niennattrakul · P ongsak orn Ruengronghirun y a · Chotirat Ann Ratanamahatana the date of reeipt and aeptane should b e inserted later Abstrat Among man y existing distane measures for time series data, Dy- nami Time W arping (DTW) distane has b een reognized as one of the most aurate and suitable distane measures due to its exibilit y in sequene align- men t. Ho w ev er, DTW distane alulation is omputationally in tensiv e. Esp e- ially in v ery large time series databases, sequen tial san through the en tire database is denitely impratial, ev en with random aess that exploits some index strutures sine high dimensionalit y of time series data inurs extremely high I/O ost. More sp eially , a sequen tial struture onsumes high CPU but lo w I/O osts, while an index struture requires lo w CPU but high I/O osts. In this w ork, w e therefore prop ose a no v el indexed sequen tial stru- ture alled TWIST (Time W arping in Indexed Sequen tial sT ruture) whi h b enets from b oth sequen tial aess and index struture. When a query se- quene is issued, TWIST alulates lo w er b ounding distanes b et w een a group of andidate sequenes and the query sequene, and then iden ties the data aess order in adv ane, hene reduing a great n um b er of b oth sequen tial and random aesses. Impressiv ely , our indexed sequen tial struture a hiev es signian t sp eedup in a querying pro ess b y a few orders of magnitude. In addition, our metho d sho ws sup eriorit y o v er existing riv al metho ds in terms of query pro essing time, n um b er of page aesses, and storage requiremen t with no false dismissal guaran teed. Keyw ords Time Series, Indexing, Dynami Time W arping V. Niennattrakul Departmen t of Computer Engineering, Ch ulalongk orn Univ ersit y E-mail: g49vnnp.eng. h ula.a.th P . Ruengronghirun y a Departmen t of Computer Engineering, Ch ulalongk orn Univ ersit y E-mail: g51prnp.eng. h ula.a.th C.A. Ratanamahatana Departmen t of Computer Engineering, Ch ulalongk orn Univ ersit y E-mail: annp.eng. h ula.a.th 2 1 In tro dution Dynami Time W arping (DTW) distane (Berndt and Cliord, 1994 ; Ratanamahatana and Keogh , 2004 , 2005 ; Sakurai et al , 2007 ) has b een kno wn as one of the b est distane measures (Ding et al , 2008 ; Keogh and Kasett y , 2003 ) suited for time series domain o v er the traditional Eulidean distane b eause DTW distane has m u h more exibilit y in sequene alignmen t. In addition, DTW distane tries to nd the b est w arping, while Eulidean distane is alulated in one-to-one manner, as sho wn in Figure 1. Ho w ev er, DTW distane has a ma jor dra wba k, i.e., it requires extremely high omputational ost, esp eially when DTW dis- tane is used in similarit y sear h problems, inluding top- k query . More sp eif- ially , in top- k querying problem, after a query sequene has b een issued, a set of k andidate sequenes most similar to the query sequene rank ed b y DTW distane is returned. T raditionally , the naïv e approa h needs to alulate DTW distanes for all andidate sequenes. As a result, its query pro essing time mainly dep ends on distane alulation and the n um b er of data aesses. (a) (b) Fig. 1 The omparison of sequene alignmen ts b et w een a) Eulidean distane and b) DTW distane So far, man y sp eedup te hniques ha v e b een prop osed inluding lo w er b ound- ing funtions and index strutures. Lo w er b ounding funtions (Yi et al, 1998 ; Kim et al, 2001 ; Keogh and Ratanamahatana , 2005 ; Zh u and Shasha , 2003 ; Sakurai et al, 2005 ), whose omplexit y is t ypially m u h lo w er than that of a DTW distane measure, are used for a lo w er b ounding distane alulation whi h guaran tees that DTW distane m ust b e equal to or larger than the lo w er 3 b ounding distane. A dditionally , in sequen tial san, b efore alulating DTW distane b et w een the query sequene and a andidate sequene, a lo w er b ound- ing funtion is utilized to appro ximate and prune o the andidate sequene whi h has larger lo w er b ounding distane than the urren t b est-so-far distane. And in indexing, the lo w er b ounding distane is also used to guide the simi- larit y sear h. Curren tly , man y lo w er b ounding funtions ha v e b een prop osed to redue omputational osts inluding LB_Yi (Yi et al, 1998 ), LB_Kim (Kim et al , 2001 ), LB_Keogh (Keogh and Ratanamahatana , 2005 ), LB_P AA (Keogh and Ratanamahatana , 2005 ), LB_NewP AA (Zh u and Shasha , 2003 ), and LBS (Sakurai et al, 2005 ). It has b een widely kno wn that LB_Keogh and LBS are among the most eien t lo w er b ounding funtions, where LB_Keogh has lo w er time omplexit y , while LBS has tigh ter b ound. Beside lo w er b ounding funtions, v arious index strutures for DTW dis- tane ha v e b een prop osed to guide the sear h to aess only some parts of the database. In other w ords, the sear h result is returned, while a small p ortion of the database is aessed for distane alulation, i.e., when querying, the index struture determines whi h parts of the database are lik ely to on tain answ ers, and then the ra w data on disk are randomly aessed. Generally , this index struture should b e small enough to t in main memory . Curren tly , t w o exat indexing approa hes are t ypially used, i.e., GEMINI framew ork with LB_P AA (Keogh and Ratanamahatana , 2005 ), and a more reen t ap- proa h, FTW indexing (Sakurai et al , 2005 ). Note that the exat indexing re- turns a set of querying results with no false dismissal guaran teed; in the other w ords, the b est answ ers m ust b e inluded in the results. GEMINI framew ork (F aloutsos et al, 1994 ) t ypially utilizes the m ulti-dimensional tree, e.g., R*- tree (Be kmann et al , 1990 ), as an index struture, while FTW indexing stores indies in a at le. Ho w ev er, urren t indexing te hniques are burdened with h uge amoun t of I/O ost sine random aess to the database is t ypially 5 to 10 times slo w er than the sequen tial aess ( W eb er et al , 1998 ). Therefore, indexing is eien t when less than 20% of ra w data sequenes are aessed on a v erage. Ho w ev er, urren t indexing te hniques still onsumes large I/O o v erheads whi h are not suitable for massiv e databases. In this w ork, w e prop ose a no v el index struture and aess metho d under DTW distane alled TWIST (Time W arping in Index Sequen tial sT ruture). TWIST utilizes adv an tages from b oth sequen tial struture and index stru- ture, i.e., lo w I/O and lo w CPU osts. Instead of randomly aessing the ra w time series data lik e other indexing te hniques, TWIST separates and stores a olletion of time series data in sequen tial strutures or at les. F or ea h le, TWIST generates a represen tativ e sequene (alled an en v elop e) and stores this sequene in an index struture. Therefore, when a query sequene is issued, ea h en v elop e is alulated for a lo w er b ounding distane using our newly prop osed lo w er b ounding funtion for a group of sequenes (LBG). The lo w er b ounding distane b et w een an en v elop e and a query sequene guaran tees that all DTW distane b et w een ea h and ev ery andidate sequene under this en v elop e and the query sequene m ust alw a ys b e larger than this lo w er b ound- ing distane. A dditionally , if the lo w er b ounding distane is larger than the 4 b est-so-far distane, no aess to the sequenes within the en v elop e is needed; otherwise, ev ery sequene in the en v elop e is sequen tially aessed for DTW distane alulation. W e ev aluate our prop osed metho d, TWIST, omparing with the urren t b est approa hes, i.e., FTW indexing and sequen tial san with LB_Keogh lo w er b ounding funtion. As will b e demonstrated, TWIST prunes o a large n um b er of andidate sequenes and is m u h faster than the riv al metho ds b y a few orders of magnitude. F urthermore, when the size of databases exp onen tially inreases, our query pro essing time only gro ws linearly . The rest of the pap er is organized as follo ws. Setion 2 pro vides literature reviews of related w ork in sp eeding up similarit y sear h under DTW distane. In Setion 3, our prop osed index struture TWIST, its aess metho d, and no v el prop osed lo w er b ounding distane funtions, are desrib ed. W e sho w the sup eriorit y of TWIST o v er the b est existing metho d in Setion 4. Finally , in Setion 5, w e onlude our w ork and pro vide the diretion of future resear h. 2 Related W ork After Dynami Time W arping (DTW) distane measure (Berndt and Cliord, 1994 ) has b een in tro dued in data mining omm unit y (Keogh and Kasett y , 2003 ; Loh et al , 2004 ; W ang et al, 2006 ; Vla hos et al , 2006 ; Bagnall et al , 2006 ; Lin et al , 2007 ), it sho ws the sup eriorit y of similarit y mat hing o v er tra- ditional Eulidean distane due to its great exibilit y in sequene alignmen t sine time series data mining has b een long studied. Sp eially , DTW distane utilizes a dynami programming to nd the optimal w arping path and alu- late the distane b et w een t w o time series sequenes. Unfortunately , to alu- late DTW distane, exhaustiv e omputation is generally required. In addition, sine DTW distane is not qualied as a distane metri, neither distane- based (Ciaia et al, 1997 ; Yianilos , 1993 ) nor spatial-based (Ber h told et al , 1996 ; Guttman , 1984 ; Be kmann et al, 1990 ) index struture an b e used ef- ien tly in similarit y sear h under DTW distane. Therefore, v arious lo w er b ounding funtions and indexing te hniques for DTW distane ha v e b een prop osed to resolv e these problems. Yi et al. (Yi et al, 1998 ) rst prop ose a lo w er b ounding funtion, LB_Yi, using t w o features of a time series sequene, i.e., the minim um and maxim um v alues. LB_Yi reates an en v elop e o v er a query sequene from these minim um and maxim um v alues, and then the distane is omputed from the summation of areas b et w een an en- v elop e and a andidate sequene, as sho wn in Figure 2a). Instead of using only t w o features, Kim et al. (Kim et al, 2001 ) suggest t w o additional features, i.e., the rst and the last v alues of the sequene. LB_Kim then alulates distane from the tuples of a query sequene and a andidate sequene, as sho wn in Fig- ure 2 b). Although these t w o lo w er b ounding funtions only require small time omplexit y , the uses of LB_Yi and LB_Kim is not pratial sine their lo w er b ounding distanes annot prune o m u h of the DTW distane alulations. 5 Sequence Q Sequence C Max(Q) Min(Q) (a) A B C D Sequence Q Sequence C (b) Sequence Q Sequence C Envelope () Fig. 2 Illustration of lo w er b ounding distane alulation b et w een a query sequene and a andidate sequene when using a) LB_Yi, b) LB_Kim, and ) LB_Keogh (a) (b) () Fig. 3 Shap es of a) Sak o e-Chiba band, b) Itakura P arallelogram, and ) Ratanamahatana- Keogh band 6 Keogh et al. prop ose a tigh ter lo w er b ounding funtion, LB_Keogh, utiliz- ing global onstrain ts (Sak o e and Chiba, 1978 ; Itakura , 1975 ; Ratanamahatana and Keogh , 2004 ), whi h are generally used to limit the sop e of w arping in distane matrix to prev en t undesirable paths. In addition, v arious w ell-kno wn global onstrain ts ha v e b een prop osed, e.g., Sak o e-Chiba band (Sak o e and Chiba, 1978 ), Itakura P arallelogram (Itakura , 1975 ), and Ratanamahatana-Keogh (R-K) band (Ratanamahatana and Keogh , 2004 ). T o b e more illustrativ e, Fig- ure 3 sho ws dieren t shap es of global onstrain ts. Note that R-K band is an arbitrary-shap ed onstrain t whi h an represen t an y bands b y using only a sin- gle one-dimensional arra y . LB_Keogh rst reates an en v elop e o v er a query sequene aording to the shap e and size of the global onstrain t. Its lo w er b ounding distane then is an area b et w een the en v elop e and a andidate se- quene, as sho wn in Figure 2 ). In addition, Keogh et al. also prop ose an indexing te hnique whi h utilizes their disretized v ersion of their lo w er b ounding funtion, LB_P AA. In order to reate an index struture, they redue dimensions of ea h time series se- quene using Pieewise A v erage Aggregation (P AA) te hnique (Keogh et al , 2001 ), and store the redued sequene in a m ulti-dimensional index struture su h as R*-tree (Be kmann et al , 1990 ). Ea h leaf no de of the tree, storing on disk, on tains a olletion of segmen ted sequenes, where ea h sequene p oin ts to its ra w time series data. In querying pro ess, an en v elop e of the query se- quene is reated and disretized. Therefore, ea h MBR (Minim um Bounding Retangle) of R*-tree is retriev ed and is ompared with the segmen ted query sequene un til the leaf no de is retriev ed in random-aess manner. Then, all disretized andidate sequenes in the leaf no de are undergone lo w er b ound- ing distanes alulation using LB_P AA. If the lo w er b ounding distane from the LB_P AA is smaller than the b est-so-far distane, the ra w time series se- quene is also retriev ed b y random aess, and the distanes are determined using LB_Keogh and DTW distane, resp etiv ely . It is lear that Keogh et al.'s index struture requires to o man y random aesses as the database size sligh tly inreases. Note that although Zh u et al. later prop ose a tigh ter lo w er b ounding funtion, LB_NewP AA (Zh u and Shasha , 2003 ), the index struture still onsumes high I/O ost. Sakurai et al. (Sakurai et al, 2005 ) prop ose a new lo w er b ounding funtion, LBS (Lo w er Bounding distane measure with Segmen tation), whi h requires a quadrati time omplexit y O ( n 2 /t 2 ) , where n is the length of time series and t is the size of a segmen t. T o alulate lo w er b ounding distane, LBS rst quan tizes a query sequene and a andidate sequene in to sequenes of segmen ts. Ea h segmen t on tains t w o v alues that indiate the maxim um and minim um among the data p oin ts in the segmen t. Then, dynami programming is used to nd the optimal distane b et w een these t w o segmen ted sequenes, and the resulted distane is determined as a lo w er b ound distane of DTW distane. Despite the fat that LBS requires larger omputational time and spae than those of LB_P AA at the same resolution, LBS a hiev es m u h tigh ter lo w er b ounding distane. The example of segmen ted sequene is sho wn in Figure 4 . 7 Fig. 4 Illustration of segmen ted sequenes with v arious resolutions T o use LBS in indexing, Sakurai et al. prop osed an index struture whi h stores pre-alulated segmen ted sequenes. F or ea h time series data, a set of segmen ted sequenes is generated b y v arying segmen t sizes from the oarsest to the nest, and the segmen ted sequene is stored in a at le with a p oin ter to the ra w time series data. In querying pro ess, a query sequene is seg- men ted, and then the index struture is sequen tially aessed and alulated for lo w er b ounding distane with pre-segmen ted andidate sequenes. If the lo w er b ounding distane is larger than the b est-so-far distane, the ra w time series data is retriev ed in random aess manner. Ho w ev er, the main dra wba k of FTW is that the size of the index struture is appro ximately t wie the size of the ra w time series database. Therefore, this index struture is denitely impratial for massiv e time series database sine the en tire index le with size larger than the ra w data are required to b e read one for ev ery single query ausing large I/O o v erheads. It is w orth to note that the existing index strutures are not designed for massiv e databases. F or example, sine LB_P AA utilizes P AA to redue the n um b er of dimensions, as the database size inreases, its pruning p o w er sig- nian tly dereases; therefore, a h uge n um b er of sequenes m ust b e aessed for distane alulation. Similarly for FTW indexing, when the database size inreases, the index size will double. In Setion 5, our exp erimen ts will demon- strate that when the database exeeds the size of the main memory , our pro- p osed metho d signian tly outp erforms these riv al metho ds. 3 Ba kground Before desribing our prop osed metho d, TWIST, w e pro vide some ba kground kno wledge, i.e., Dynami Time W arping distane (DTW), global onstrain ts, and lo w er b ounding distane funtions inluding LB_Keogh and LBS. 8 3.1 Dynami Time W arping Distane Dynami Time W arping (DTW) distane (Berndt and Cliord, 1994 ; Ratanamahatana and Keogh , 2005 , 2004 ) is a w ell-kno wn shap e-based similarit y measure. It uses a dynami programming te hnique to nd an optimal w arping path b et w een t w o time series sequenes. T o alulate the distane, it rst reates a distane matrix, where ea h elemen t in the matrix is a um ulativ e distane of the minim um of three surrounding neigh b ors. Supp ose w e ha v e t w o time series, a sequene Q = h q 1 , . . . , q i , . . . , q n i and a sequene C = h c 1 , . . . , c j , . . . , c m i . First, w e reate an n -b y- m matrix, and then ea h ( i, j ) elemen t, γ i,j , of the matrix is dened as: γ i,j = | q i − c j | p + min { γ i − 1 ,j − 1 , γ i − 1 ,j , γ i,j − 1 } (1) where γ i,j is the summation of | q i − c j | p and the minim um um ulativ e distane of three elemen ts surrounding the ( i, j ) elemen t, and p is the dimension of L p - norms. F or time series domain, p = 2 , equipping to Eulidean distane, is t ypially used. After w e ha v e all distane elemen ts in the matrix, to nd an optimal path, w e ho ose the path W = h w 1 , . . . , w k , . . . , w K i that yields a minim um um ulativ e distane at ( n, m ) , where w k is the p osition ( i, j ) at k th elemen t of a w arping path, w 1 = (1 , 1) , and w K = ( n, m ) , whi h is dened as: D T W ( Q, C ) = min ∀ W ∈ W p v u u t K X k =1 d w k (2) where d w k is the L p distane at the p osition w k , p is the dimension of L p -norms in Equation 1 , and W is a set of all p ossible w arping paths. The reursiv e funtion are sho wn in Equation 3. Note that, in the original DTW, p th ro ot of the distane m ust b e omputed; ho w ev er, for fast omputation, w e usually omit this alulation sine ranking of distane v alues do es not hange. D T W ( Q, C ) = p p D ( n, m ) (3) D ( i, j ) = | q i − c j | p + min D ( i − 1 , j − 1) D ( i − 1 , j ) D ( i, j − 1) (4) where D (0 , 0 ) = 0 , D ( i, 0 ) = D (0 , j ) = ∞ , 1 ≤ i ≤ n , and 1 ≤ j ≤ m . 3.2 Global Constrain ts Although unonstrained DTW distane measure giv es an optimal distane b e- t w een t w o time series data, an un w an ted w arping path ma y b e generated. The global onstrain t eieny limits the optimal path to giv e a more suit- able alignmen t. Reen tly , an R-K band ( Ratanamahatana and Keogh , 2004 ), a general mo del of global onstrain ts, has b een prop osed. It an b e sp eied b y 9 a one-dimensional arra y R , i.e., R = h r 1 , . . . , r i , . . . , r n i , where n is the length of time series, and r i is the heigh t ab o v e the diagonal in y diretion and the width to the righ t of the diagonal in x diretion, as sho wn in Figure 5. Ea h r i v alue is arbitrary; therefore, R-K band is also an arbitrary-shap ed global onstrain t. Note that when r i = 0, where 1 ≤ i ≤ n , this R-K band represen ts the w ell-kno wn Eulidean distane, and when r i = n , 1 ≤ i ≤ n , this R-K band represen ts the original DTW distane with no global onstrain t. The R-K band an also represen t the S-C band b y giving all r i = c , where c is the width of a global onstrain t. r i r i Fig. 5 Global on train t on DTW distane matrix when applying sp ei R-K band 3.3 Lo w er Bounding Distane F untion Lo w er b ounding distane funtion for DTW distane is a funtion that is used to alulate a lo w er b ounding distane whi h m ust alw a ys b e smaller than or equal to the exat DTW distane (Yi et al , 1998 ; Kim et al , 2001 ; Keogh and Ratanamahatana , 2005 ; Zh u and Shasha , 2003 ; Sakurai et al, 2005 ). Therefore, in similarit y sear h, the lo w er b ounding funtion is used to prune o the andidate sequenes that are denitely not the answ ers. T ypially , lo w er b ounding funtion onsumes m u h lo w er omputational time than the DTW distane do es. In this w ork, w e onsider t w o lo w er b ounding funtions, i.e., LB_Keogh (Keogh and Ratanamahatana , 2005 ) prop osed b y Keogh et al. and LBS (Sakurai et al, 2005 ) prop osed b y Sakurai et al. sine LB_Keogh is the b est existing lo w er b ounding funtion used in sequen tial sear h, and LBS is the tigh tnest lo w er b ounding funtion used in indexing. LB_Keogh reates an en v elop e from a query sequene, and then the lo w er b ounding distane is alulated from areas b et w een the en v elop e and a andidate sequene. Unlik e LB_Keogh, LBS reates a segmen ted query sequene and a segmen ted andi- date sequene, and then these t w o segmen ted sequenes are used to determine a lo w er b ounding distane using dynami programming. 10 3.3.1 LB_Ke o gh T o alulate LB_Keogh (Keogh and Ratanamahatana , 2005 ), an en v elop e E = h e 1 , . . . , e i , . . . , e n i is generated from a query sequene Q = h q 1 , . . . , q i , . . . , q n i , where e i = { u i , l i } , and u i and l i are an upp er and a lo w er v alues of e i . With a sp eied global onstrain t R = h r 1 , . . . , r i , . . . , r n i , elemen ts u i and l i are omputed from u i = max { q i − r i , . . . , q i + r i } and l i = min { q i − r i , . . . , q i + r i } , re- sp etiv ely . The lo w er b ounding distane LB K eog h ( Q, C ) b et w een sequenes Q and C an b e omputed b y the follo wing equation. LB K eog h ( Q, C ) = p v u u u t n X i =1 | c i − u i | p if c i > u i | l i − c i | p if c i < l i 0 otherwise (5) where p is the dimension of L p -norms. The pro of of LB K eog h ( Q, C ) ≤ D T W ( Q, C ) an b e found in the original pap er (Keogh and Ratanamahatana , 2005 ). 3.3.2 LBS T o alulate LBS (Lo w er b ounding distane measure with Segmen tation), a query and a andidate sequenes m ust rst b e segmen ted. The segmen ted sequene S T = s T 1 , . . . , s T b , . . . , s T t is alulated from the sequene S = h s 1 , . . . , s a , . . . , s A i with a giv en segmen t size T , where s T b = us T b , l s T b , us T b = max { s x , . . . , s y } , l s T b = min { s x , . . . s y } , x = ( a − 1) · T + 1 , y = b · T , and 1 ≤ T ≤ A . Although LBS has apabilit y to supp ort segmen ts with dif- feren t lengths, in this w ork, w e onsider ea h segmen ts with an equal length to demonstrate maxim um p erformane of LBS. The lo w er b ounding distane LB S ( Q T , C T ) b et w een a segmen ted query sequene Q T = q T 1 , . . . , q T i , . . . , q T n and a segmen ted andidate sequene C T = c T 1 , . . . , c T i , . . . , c T n an b e om- puted b y the follo wing equations. LB S ( Q T , C T ) = p p D ( n, m ) (6) D ( i, j ) = T · d ( q T i , c T j ) + min D ( i − 1 , j − 1) D ( i − 1 , j ) D ( i, j − 1) (7) d ( q T i , c T j ) = l q T i − uc T j p if ( l q T i > uc T j ) l c T j − uq T i p if ( l c T j > uq T j ) 0 otherwise (8) where D (0 , 0 ) = 0 , D ( i, 0 ) = D (0 , j ) = ∞ , 1 ≤ i ≤ n , 1 ≤ j ≤ m , q T i = uq T i , l q T i , c T i = uc T i , l c T i , and p is the dimension of L p -norms. The pro of of LB S ( Q T , C T ) ≤ D T W ( Q, C ) an b e found in the Sakurai et al.'s original pap er (Sakurai et al, 2005 ). 11 4 Time W arping in Indexed Sequen tial sT ruture (TWIST) In this w ork, w e prop ose a no v el index struture alled TWIST ( T ime W arping in I ndexed S equen tial s T ruture) whi h onsists of b oth sequen tial strutures and an index struture. Ea h sequen tial struture stores a olletion of ra w time series sequenes, and the index struture stores a represen tativ e and a p oin ter to its orresp onding sequen tial struture. The in tuitiv e idea of TWIST is to minimize the n um b er of random aesses and minimize the n um b er of distane alulations, giving TWIST a m u h more suitable hoie for massiv e database than the existing metho ds whi h are not quite salable. 4.1 Problem Denition W e are in terested in a generi top- k querying in this w ork sine man y other mining tasks, e.g., lassiation and lustering, all require this b est-mat hed querying as their t ypial subroutine. Giv en a query sequene Q , a set C of equal-length time series sequenes, a global onstrain t R , and an in teger k , it returns a set of k nearest-neigh b or sequenes of Q from C under DTW distane measure with the onstrain t R . 4.2 Data Struture In this setion, w e desrib e the data struture of TWIST whi h is sp eially designed to minimize b oth the I/O and CPU osts in the querying pro ess. TWIST onsists of t w o main omp onen ts, i.e., a set of sequen tial strutures (alled Data Sequene File DSF) and an index struture (alled En v elop e Sequene File ESF). In addition, TWIST groups the similar sequenes in to same sequen tial struture so that in the querying pro ess, if this sequen tial struture greatly diers from a query sequene, TWIST will simply b ypass that struture. T o measure the dierene b et w een a query sequene and all the sequenes in a sequen tial struture, a represen tativ e sequene (alled an en v elop e) is pre-determined and stored in an index struture. The main b enet of the sequen tial struture is that, w e an aess all the data in the sequen tial struture m u h faster than the random aess (W eb er et al , 1998 ). A sample data struture of TWIST is sho wn in Figure 6 . Supp ose there is a set S of time series sequenes S = h s 1 , . . . , s i , . . . , s n i , DSF simply stores these sequenes sequen tially . And for ea h DSF, an en v elop e E G = h e g 1 , . . . , eg i , . . . , eg n i for a group of time series sequenes is generated, where eg i = { ueg i , l eg i } , ueg i = max S ∈ S { s i } , and l eg i = min S ∈ S { s i } . In addition, the data struture of ESF is basially an arra y A of an ob jet O = { P , E Q } on taining a p oin ter P to DSF and an en v elop e E G . Figure 7 illustrates an en v elop e onstrution for ea h DSF. The en v elop e is determined from an upp er b ound and a lo w er b ound of a group of sequenes. 12 ESF DSFs Fig. 6 A sample data struture of TWIST Fig. 7 An en v elop e reated from a group of sequenes 4.3 Lo w er Bounding Distane for a Group of Sequenes In this w ork, w e prop ose a no v el lo w er b ounding distane funtion for a group of sequenes alled LBG. Instead of alulating lo w er b ounding distanes b e- t w een a query sequene and a andidate sequene, LBG returns a lo w er b ound- ing distane b et w een a query sequene and a set of andidate sequenes; in other w ords, ea h DTW distane b et w een a query sequene and an y andi- date sequene in the set is alw a ys larger than the lo w er b ounding distane from LBG. Therefore, if the lo w er b ounding distane is larger than the dis- tane from the b est-so-far distane, LBG an prune o all those andidate sequenes sine all the real DTW distanes from the andidate sequenes are guaran teed not to b e an y smaller. More sp eially , TWIST utilizes LBG b y determining an LBG for ea h DFS from an en v elop e sequene stored in the EFS so that only some DSF s are aessed whi h signian tly redues b oth CPU and I/O osts. Giv en a query sequene Q = h q 1 , . . . , q a , . . . , q n i and an en v elop e E G = h eg 1 , . . . , eg b , . . . , eg n i , where eg b = { u eg b , l eg b } . LBG rst reates segmen ted query sequenes Q T = q T 1 , . . . , q T i , . . . , q T t and segmen ted en v elop e E G T = eg T 1 , . . . , eg T j , . . . , eg T t with segmen t size T , where q T i = uq T i , l q T i and 13 eg T j = ueg T j , l eg T j . An elemen t q T i of segmen ted query sequene Q T is omputed b y uq T i = max { s x , . . . , s y } and l q T i = min { s x , . . . s y } , where x = ( a − 1) · T + 1 , and y = a · T . On the other hand, to segmen t an en v elop e E G , elemen ts ueg T j and l eg T j are reated as follo ws, ueg T j = max { ueg x , . . . , ueg y } and l eg T j = min { l eg x , . . . l eg y } , where x = ( b − 1) · T + 1 , and y = b · T . T o b e more illustrativ e, Figure 8 sho ws the segmen ted en v elop e E G T reated from an en v elop e E G . (a) (b) Fig. 8 Illustration sho ws a) an en v elop e used to generate b) a segmen ted en v elop e when alulating LBG The lo w er b ounding distane LB G ( Q T , E G T ) b et w een a segmen ted query sequene Q T and a segmen ted en v elop e E G T an b e omputed b y the follo wing equations. LB G ( Q T , E G T ) = p p D ( n, m ) (9) D ( i, j ) = T · d ( q T i , eg T j ) + min D ( i − 1 , j − 1 ) D ( i − 1 , j ) D ( i, j − 1) (10) d ( q T i , eg T j ) = l q T i − ue g T j p if ( lq T i > ueg T j ) l eg T j − uq T i p if ( le g T j > uq T i ) 0 otherwise (11) where D (0 , 0 ) = 0 , D ( i, 0 ) = D (0 , j ) = ∞ , 1 ≤ i ≤ n , 1 ≤ j ≤ m , and p is the dimension of L p -norms. Theorem 1 L et Q T = q T 1 , . . . , q T i , . . . , q T t and E G T = eg T 1 , . . . , eg T j , . . . , eg T t b e the appr oximate se gments of se quen e Q and envelop e E G of a gr oup of time series se quen es C = { C 1 , . . . , C k , . . . , C m } , r esp e tively, wher e q T i = uq T i , l q T i and eg T j = ueg T j , l eg T j , then LB G ( Q T , E G T ) ≤ D T W ( Q, C opt ) (12) where C opt is a sequene in C whi h giv es minim um distane to sequene Q , and C T opt is a segmen ted sequene of C opt . 14 Pr o of F ollo wing from the pro of of LBS (Sakurai et al, 2005 ), w e ha v e LB S ( Q T , C T opt ) ≤ D T W ( Q, C opt ) (13) Sine ueg T j ≥ uc T opt j and l eg T j ≤ lc T opt j for all j d q T i , eg T j = l q T i − ue g T j p if ( l q T i > ueg T j ) l eg T j − uq T i p if ( l eg T j > uq T j ) 0 otherwise ≤ l q T i − uc T j p if ( l q T i > uc T j ) l c T j − uq T i p if ( l c T j > uq T j ) 0 otherwise ≤ d q T i , c T j (14) Sine d q T i , eg T j ≤ d q T i , c T j , then LB G ( Q T , E G T ) ≤ LB S ( Q T , C T opt ) (15) Therefore, from Equation 13 , w e ha v e LB G ( Q T , E G T ) ≤ D T W ( Q, C opt ) (16) Q.E.D. Sine LBG utilizes the onept of a lo w er b ounding distane alulation b e- t w een a query and a group of sequenes. W e also prop ose a lo w er b ounding distane funtion extended from LB_Keogh alled LBG K . LBG K obtains lo w er b ounding distane from a query sequene Q = h q 1 , . . . , q i , . . . , q n i and an en- v elop e E G = h eg 1 , . . . , eg i , . . . , eg n i , where eg i = { ueg i , l eg i } . Giv en a query sequene Q , an en v elop e E , and a global onstrain t R = h r 1 , . . . , r i , . . . , r n i . LBG K rst reates an en v elop e of global onstrain t E GC = h eg c 1 , . . . , eg c i , . . . , eg c n i from E G , where eg c i = { ueg c i , l eg c i } . Elemen ts ueg c i and l eg c i are alulated b y ueg c i = max { ueg i − r i , . . . , ueg i + r i } and l eg c i = min { l eg i − r i , . . . , l eg i + r i } , resp etiv ely . The lo w er b ounding distane LB G K ( Q, E G ) b et w een the query sequene Q and the en v elop e E G are determined b y Equation 17 along with its pro of of orretness. LB G K ( Q, E G ) = p v u u u t n X i =1 | q i − ue g c i | p if c i > ueg c i | l eg c i − q i | p if c i < l eg c i 0 otherwise (17) where p is the dimension of L p -norms. Theorem 2 L et Q = h q 1 , . . . , q i , . . . , q n i b e a query se quen e and E GC = h eg 1 , . . . , eg i , . . . , eg n i b e an envelop e of glob al onstr aint r e ate d fr om an en- velop e E G of a gr oup of se quen es C = { C 1 , . . . , C k , . . . , C m } , wher e C k = h c k 1 , . . . , c k i , . . . , c k n i , then 15 LB G K ( Q, E G ) ≤ D T W ( Q, C opt ) (18) where C opt is the sequene whi h giv es the minim um DTW distane to Q in C . Pr o of Sine D T W ( Q, C opt ) = p v u u t K X k =1 d w k (19) where d w k is the k th distane alulation of sequene Q and the nearest C opt in the optimal w arping path whi h alulates distane b et w een q i and c opt i . F or ueg c i and l eg c i , ueg c i = max 1 − r ≤ j ≤ i + r { ueg j } = max 1 − r ≤ j ≤ i + r max 1 ≤ k ≤ n c k j ≥ c opt j (20) l eg c i = min 1 − r ≤ j ≤ i + r { l eg j } = min 1 − r ≤ j ≤ i + r min 1 ≤ k ≤ n c k j ≤ c opt j (21) Sine LB G K eog h ( Q, E G ) ≤ DT W ( Q, C opt ) , p v u u u t n P i =1 | q i − ue g c i | p if q i > ueg c i | l eg c i − q i | p if q i < l eg c i 0 otherwise ≤ p s K P k =1 d w k (22) Sine K ≥ n from the DTW's onditions, there are three p ossible ases, i.e., | q i − ue g c i | p ≤ d w k , | l eg c i − q i | p ≤ d w k , and 0 ≤ d w k . Supp ose | q i − ue g c i | p ≤ d w k , (23) DTW requires that, for d w k and for all i − r i ≤ j ≤ i + r i , ea h data p oin t q i m ust b e ompared one with c opt j | q i − ue g c i | p ≤ q i − c opt j p (24) ueg c i ≥ c opt j (25) The ase k l eg c i − q i k p ≤ d w k yields to a similar argumen t and 0 ≤ d w k alw a ys holds sine d w k is nonnegativ e. Hene, 16 LB G K eog h ( Q, E G ) ≤ D T W ( Q, C opt ) (26) Q.E.D. 4.4 Querying Pro ess When a query sequene is issued, ESF is rst aessed and lo w er b ounding distane from LBG for ea h en v elop e is alulated. Therefore, if LBG for an y DSF is larger than the b est-so-far distane, all time series sequenes in that DSF are guaran teed not to b e the answ ers. TWIST ould utilize this distane to prune o a signian tly large n um b er of andidate sequenes b y using only a v ery small amoun t of b oth CPU and I/O osts. Instead of alulating only one lev el of lo w er b ounding distane, LBG al- ulates lo w er b ounding distane iterativ ely . First, the b est-so-far distane is initialized with an LBG distane b et w een the oarsest segmen ted sequenes of a query sequene and an en v elop e. Subsequen tly , ea h ner en v elop e sequene is used b y LBG alulation again and again. If LBG distane is still smaller than the b est-so-far distane, the DSF is aessed, and all data sequenes in DSF are then sequen tially sear hed. But if ner LBG is returned with an y- thing larger than the b est-so-far distane, the next DSF is then onsidered. The pro ess is terminated when all en v elop e sequenes in ESF are exhausted. The pseudo o de of TWIST with LBG is desrib ed in T able 1 . Although implemen tations of LBG and LBG K o v er TWIST are dieren t, w e pro vide solutions for b oth. The adv an tages of LBG K o v er LBG are that LBG K requires to aess ESF only one, while LBG requires t wie the aess, and when the small global onstrain t is applied in the querying, LBG K is faster. Ho w ev er, LBG a hiev es a b etter query p erformane in terms of query pro ess- ing time than LBG K sine LBG returns a tigh ter lo w er b ounding distane, indep enden t of the global onstrain t. T o query with LBG K under top- k querying, ea h en v elop e sequene is sequen tially retriev ed, and its lo w er b ounding distane is alulated. Then LBG K distanes are sorted in to a priorit y queue. DSF with smallest LBG K distane will rst b e aessed. Then for ea h andidate sequene in the DSF, sequen tial sear h is utilized to nd the b est-so-far sequene. One the DSF aess is ompleted, the lo w er b ounding distane from LBG K distane for the next DSF will then b e onsidered. If the lo w er b ounding distane b et w een the en v elop e of the next DSF is larger than the b est-so-far distane, the sear h is terminated, and a set of nearest-neigh b or sequenes is returned. The pseudo o de is pro vided in T able 2. 17 T able 1 T op- k querying under TWIST with LBG Algorithm [ C ℄ = LBG Top-k Quer ying [ Q, k ℄ 1 Let: 2 C b e a priorit y queue of answ er sequenes 3 P b e a p oin ter to DSF 4 E G b e an en v elop e 5 d best = PositiveInfinite b e the b est-so-far distane 6 T b e the oarsest resolution 7 for all { P , E G } in ESF // Finding d best from the oarsest v ersion of ESF 8 d E G = LB W LBS ( Q T , E G T ) 9 if ( d E G < d best ) d best = d E G endif 10 endfor 11 for all { P , W } in ESF 12 while ( T is not the nest resolution) // Use LB W LBS to prune E S F 13 d W = LB W LBS ( Q T , E G T ) 14 if ( d W > d best ) Break and go to the next { P, E G } endif 15 Set T to b e a ner resolution 16 endwhile 17 for all C in D S F P 18 d lower = LB ( Q , C ) 19 if ( d lower ≤ d best ) 20 d tr ue = DT W ( Q, C ) 21 if ( C .siz e () < k ) 22 C .enq ueue ( { C, d tr ue } ) 23 else 24 if ( d tr ue ≤ d best ) 25 C .enq ueue ( { C, d tr ue } ) 26 C .deq ueue () 27 d best = C .peek () .d tr ue 28 endif 29 endif 30 endif 31 endfor 32 endfor 33 Return C 18 T able 2 T op- k querying under TWIST with LBG K Algorithm [ C ℄ = LBG K Top-k Quer ying [ Q, k ℄ 1 Let: 2 W b e a priorit y queue of en v elop e distanes 3 C b e a priorit y queue of answ er sequenes 4 P b e a p oin ter to DSF 5 E G b e an en v elop e 6 d best = PositiveInfinite b e the b est-so-far distane 7 Initialize d best = PositiveInfinite 8 for all { P, E G } in ESF // Calulate LBG distane from ESF for all DSF 9 d W = LBW K ( Q, W ) 10 W .enq ueue ( { P, d W } ) 11 endfor 12 // Dequeue { P, d W } with smallest d W 13 // k eep sear hing for an answ er while d W ≤ d best 14 while ( { P , d W } = W .deq ueue () and d W ≤ d best ) 15 for all C in D S F P 16 d lower = LB ( Q, C ) 17 if ( d lower ≤ d best ) 18 d tr ue = DT W ( Q, C ) 19 if ( C .siz e () < k ) 20 C .enq ueue ( { C, d tr ue } ) 21 else 22 if ( d tr ue ≤ d best ) 23 C .enq ueue ( { C, d tr ue } ) 24 C .deq ueue () 25 d best = C .peek () .d tr ue 26 endif 27 endif 28 endif 29 endfor 30 endwhile 31 Return C Although this pap er emphasizes on top- k querying, range query an simply b e adapted. Instead of using the b est-so-far distane to prune o the database, the range distane is used to sp eify the maxim um distane b et w een a query sequene and a andidate sequene. In addition, an in teger k is set to b e p ositiv e innite. 4.5 Indexing Pro ess T o main tain a data struture, w e also prop ose a ma hanism to eien tly insert and delete data sequenes o v er our prop osed index struture TWIST. 4.5.1 Data Se quen e Insertion In ase of insertion, supp ose there exist DSF s and ESF, ost of insertion b e- t w een a new sequene and an en v elop e is omputed for all en v elop es in ESF, the new sequene will b e in the minim um ost en v elop e. After the minim um-ost 19 en v elop e has b een found, the en v elop e's DSF is aessed, and the new sequene is added. The en v elop e is up dated aordingly to the ESF. Generally , the ost is omputed from the size of an en v elop e after insertion. If DSF exeeds the maxim um n um b er of sequenes p er le (maxim um page size), TWIST splits this DSF in to t w o DSF s, and t w o new en v elop es are also generated and stored in the ESF. F or lariation, w e pro vide the insertion algorithm in T able 3 . Note that the maxim um page size is a user-dened parameter whi h deter- mines a maxim um n um b er of sequenes within ea h DSF. T able 3 Inserting a new sequene to TWIST Algorithm Inser tion [ C ℄ 1 // Find the minim um-ost DSF 2 Initialize cost min = PositiveInfinite , P min = null 3 for all { P , E G P } in ESF 4 cost E G = C ost ( E G P , C ) 5 if ( cost E G < cost min ) 6 cost min = cost E G 7 P min = P 8 endif 9 endfor 10 A dd C in D S F P min 11 // Che k if the size of D S F P min exeeds α 12 if ( D S F P min .siz e () > α ) 13 // Split D S F P min in to t w o DSF s, D S F X and D S F Y 14 [ { S, E G S } , { T , E G T } ] = S plitD S F ( DS F P min ) 15 Delete ˘ P min , E G P min ¯ from ESF 16 A dd { X, E G T } , { X, E G Y } to ESF 17 else 18 // Up date E G P min from C 19 E G P min = U pdateE nv elope ( E G P min , C ) 20 Up date ˘ P min , E G P min ¯ to ESF 21 endif Envelope EG New candidate sequence C Fig. 9 Shado w ed area represen ts total ost of insertion b et w een a sequene C and an en v elop e E G 20 T able 4 Cost funtion for an insertion of a sequene C in to E G Algorithm Cost [ E G, C ℄ 1 Let: 2 cost sum = 0 3 for ea h c i , ueg i , leg i 4 if ( c i > ueg i ) 5 cost sum = cost sum + | c i − l e g i | p 6 else if ( c i < leg i ) 7 cost sum = cost sum + | ueg i − c i | p 8 endif 9 Return cost sum Generally , the ost funtion is alulated from total area of an en v elop e after a new sequene is inserted. T o b e more illustrativ e, the shado w ed areas in Figure 9 indiate the ost of insertion. Giv en a new time series sequene C = h c 1 , . . . , c i , . . . , c n i and an en v elop e E G = h eg 1 , . . . , eg i , . . . , eg n i , where eg i = { ueg i , l eg i } , the ost funtion C ost ( E G, C ) is dened as (also sho wn in T able 4). C ost ( E G, C ) = p v u u u t n X i =1 | c i − l eg i | p if c i > ueg i | ueg i − c i | p if c i < le g i 0 otherwise (27) where p is the dimension of L p -norms. If the n um b er of sequene in DSF exeeds the maxim um page size, the DSF needs to split in to t w o DSF s to redue the en v elop e size. Generally , TWIST tries to split sequenes in to t w o groups so that ea h new en v elop e sequene is tigh t and has only small o v erlaps. In this pap er, k -means lus- tering (MaQueen , 1967 ) ( k = 2 ) with Eulidean distane is adopted as a heuristi funtion for separating the data in to t w o appropriate groups. Ho w- ev er, other algorithms su h as splitting algorithms in R-tree (Guttman , 1984 ) and R*-tree (Be kmann et al, 1990 ) an b e used in plae of k -means lustering algorithm sine splitting algorithms are also designed to separate and minimize Minim um Bounding Retangle (MBR); ho w ev er, these splitting algorithms re- quire relativ ely large time omplexit y . Pseudo o de of the splitting algorithm is pro vide in T able 5 . After new DSF s are reated in the insertion step, new en v elop es are gen- erated b y an algorithm desrib ed in T able 6 b y nding the maxim um and minim um v alues for ea h DSF. If the n um b er of sequenes in DSF exeeds the maxim um allo w ed, the en v elop e in ESF is simply up dated using the ex- isting en v elop e and a new sequene. T o up date the existing en v elop e E G = h eg 1 , . . . , eg i , . . . , eg n i from a new time series sequene C = h c 1 , . . . , c i , . . . , c n i , elemen ts are up dated b y ueg i = max { ueg i , c i } and l eg i = min { le g i , c i } , where eg i = { ueg i , l eg i } . The up dating algorithm is desrib ed in T able 7 . 21 T able 5 Splitting algorithm, separating a DSF in to t w o DSF s Algorithm SplitDSF [ D S F ℄ 1 // Run k -means lustering algorithm 2 // Fix k = 2 3 [ DS F X , D S F Y ] = K M eans ( D S F ) 4 // Create E G X and E G Y 5 E G X = C r eateE nv elope ( D S F X ) 6 E G Y = C r eateE nv elope ( D S F Y ) 7 Return [ { X, E G X } , { Y , E G Y } ] T able 6 An en v elop e onstrution algorithm Algorithm Crea teEnvelope [ D S F ℄ 1 Let: 2 E G b e an en v elop e 3 for ea h sequene C in D S F 4 for ea h c i , ueg i , leg i 5 ueg i = max { ueg i , c i } 6 leg i = min { leg i , c i } 7 endfor 8 endfor 9 Return E G T able 7 An en v elop e sequene up date algorithm after a new sequene insertion Algorithm Upd a teEnvelope [ E G, C ℄ 1 for ea h c i , ueg i , leg i 2 ueg i = max { ueg i , c i } 3 leg i = min { leg i , c i } 4 endfor 5 Return E G 4.5.2 Data Se quen e Deletion T o delete a data sequene, oresp onding DSF is aessed and the sequene is simply deleted. Ho w ev er, when DSF hanges, ESF needs to b e up dated as w ell. In partiular, w e pro vide t w o deletion p oliies, i.e., eager deletion and lazy deletion. F or eager deletion, after ea h sequene deletion, TWIST immediately realulates a new en v elop e from the en tire set of sequenes in that DSF, and up dates the hanges in to the ESF. On the other hand, lazy deletion simply deletes a sequene from DSF without the need of ESF up date sine TWIST guaran tees that false dismissals will nev er o ur in the lo w er b ounding alulation of LBG. The treadeos are, of ourse, a deletion time and the tigh tness of an en v elop e b et w een these t w o deletion p oliies. If eager deletion is applied, the deletion time inreases but its en v elop e sequene is tigh ter, while the deletion time is v ery fast in lazy deletion, but the en v elop e sequene is not as tigh t. W e pro vide a pseudo o de for the deletion algorithm in T able 8. 22 T able 8 Delete an existing sequene from TWIST Algorithm Deletion [ C ℄ 1 Selet D S F P whi h on tains C 2 Delete C from D S F P 3 if ( IsEa ger ) 4 E G P = C r eateE nv elope ( D S F P ) 5 Up date { P, E G P } to ESF 6 endif 5 Exp erimen tal Ev aluation In exp erimen tal ev aluation, w e ompare our prop osed metho d, TWIST, with the b est existing indexing metho d, FTW (Sakurai et al , 2005 ), and the b est naïv e metho d, sequen tial sear h with LB_Keogh ( Keogh and Ratanamahatana , 2005 ), in man y ev aluation metris, i.e., querying time, indexing time, the n um b er of page aesses, and storage requiremen t. In addition, t w o solutions of our prop osed metho d are ev aluated, i.e., TWIST with LBG and TWIST with LBG K . Although FTW indexing outp erforms R*-tree with LB_P AA (Keogh and Ratanamahatana , 2005 ), our metho d sho ws sup eriorit y o v er FTW b y few orders of magnitude. In addition, sequen tial sear h with LB_Keogh is also ev aluated to sho w the b est p erformane of naïv e metho d when no in- dexing struture is utilized. It is imp ortan t to note that w e mak e our b est eort in tuning the riv al metho ds to run at their b est p erformanes b y ap- plying early abandon (Keogh and Ratanamahatana , 2005 ) and early stopping (Sakurai et al, 2005 ) te hniques; ho w ev er, as will b e demonstrated, our pro- p osed metho d still outp erforms them in all terms. T o v erify that our prop osed metho d is salable for massiv e time series database, w e use a database with the size exeeding the main memory . Other- wise, the op erating system is lik ely to a he the data in to the main memory . Therefore, our database size ranges from 256MB to 4 GB. W e p erform our exp erimen ts on a Windo ws-XP omputer with In tel Core 2 Duo 2.77 GHz, 2GB of RAM, and 80 GB of 5400 rpm in ternal hard driv e. All o des in our exp erimen ts are implemen ted with Ja v a 1.5. 5.1 Datasets T o visualize the p erformane in v arious dimensions, man y dieren t datasets listed b elo w are generated b y v arying the n um b ers of sequenes in the databases ( 2 16 = 65 536 , 2 17 = 1310 72 , 2 18 = 26 2144 , and 2 19 = 5242 88 sequenes) and the sequene lengths (512, 1024, and 2048 data p oin ts). All data sequenes are Z-normalized; some examples for ea h dataset are sho wn in Figure 10 . 1. Random W alk I ( Sakurai et al, 2005 ; Assen t et al , 2008 ): T o demonstrate the salabilit y of our prop osed metho d, a large amoun t of sequenes are generated b y a follo wing equation: t i +1 = t i + N (0 , 1) , where N (0 , 1) is a random v alue dra wn from a normal distribution. 23 0 200 400 600 800 1000 −2 −1 0 1 2 3 0 200 400 600 800 1000 −2 −1 0 1 2 3 (a) (b) 0 200 400 600 800 1000 −5 −4 −3 −2 −1 0 1 2 3 () Fig. 10 Example sequenes for datasets a) Random W alk I, b) Random W alk I I, and ) Eletro ardiogram 2. Random W alk I I (Assen t et al, 2008 ): W e generate a set of random w alk sequenes from a follo wing equation: t i +1 = 2 t i − t i − 1 + N (0 , 1 ) , where N (0 , 1) is a random v alue dra wn from a normal distribution. 3. Eletro ardiogram (Mo o dy and Mark , 1983 ): This dataset is reorded from h uman sub jets with atrial brillation whi h has 250 samples p er seond. In addition, this dataset w as made at Boston's Beth Israel Hospital and re- vised for MIT-BIH Arrh ythmia Database. T o build the dataset, w e segmen t all the original sequenes in to small subsequenes. 5.2 Querying Time In this exp erimen t, query pro essing times are a v eraged o v er 100 runs, and are ompared in the b est-mat hed problem b y v arying four parameters, i.e., the n um b er of time series sequenes, the dataset size, the width of global onstrain t, an in teger k , and the maxim um page size (only for TWIST). In order to observ e the trend for ea h parameter, the default v alues are xed as follo ws, the dataset size as 524288 ( 2 19 ) sequenes, the length of time se- ries sequene as 2048 data p oin ts, the default width of global onstrain t as 10% of sequene length, an in teger k in top- k querying as 1, and the maxi- m um n um b er of sequenes in DSF as 128 sequenes. In addition, a dataset of 524288 sequenes with length 2048, giving appro ximately 4 GB in size, and 10% onstrain t width of global onstrain t is t ypially used in time series data mining omm unit y (Ratanamahatana and Keogh , 2005 ). Note that for LBS, the default segmen t size prop osed in the original pap er is used, i.e., 1024, 256, 24 1 1.5 2 2.5 3 3.5 4 4.5 5 x 1 0 5 0 500 1000 1500 2000 2500 The dataset size (data sequence) Query processing time per one query (sec.) LB_Keogh FTW TWIST/LBG_K TWIST/LBG 1 1.5 2 2.5 3 3.5 4 4.5 5 x 1 0 5 0 50 100 150 200 250 300 350 400 Dataset size (data sequence) Query processing time per one query (sec.) LB_Keogh FTW TWIST/LBG_K TWIST/LBG (a) Random W alk I (b) Random W alk I I 1 1.5 2 2.5 3 3.5 4 4.5 5 x 1 0 5 0 500 1000 1500 2000 2500 3000 3500 4000 4500 The dataset size (data sequence) Query processing time per query (sec.) LB_Keogh FTW TWIST/LBG_K TWIST/LBG () Eletro ardiogram Fig. 11 TWIST outp erforms the riv al metho ds, and is sligh tly aeted b y an inrease in the dataset size, where sequene length, global onstrain t, an in teger k , and page size are set to 2048, 10%, 1, and 128, resp etiv ely 400 600 800 1000 1200 1400 1600 1800 2000 0 500 1000 1500 2000 2500 Sequence length (data point) Query processing time per one query (sec.) LB_Keogh FTW TWIST/LBG_K TWIST/LBG 400 600 800 1000 1200 1400 1600 1800 2000 0 50 100 150 200 250 300 350 400 Length of time series sequence (data point) Query processing time per one query (sec. ) LB_Keogh FTW TWIST/LBG_K TWIST/LBG (a) Random W alk I (b) Random W alk I I 400 600 800 1000 1200 1400 1600 1800 2000 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Length of time series sequence (data point) Query processing time per one query (sec. ) LB_Keogh FTW TWIST/LBG_K TWIST/LBG () Eletro ardiogram Fig. 12 Although sequene length inreases, TWIST requires only small query pro ess- ing time omparing with FTW and LB_Keogh, where database size, global onstrain t, an in teger k , and page size are set to 524288, 10%, 1, and 128, resp etiv ely 25 10 20 30 40 50 60 70 80 90 100 0 500 1000 1500 2000 2500 3000 3500 4000 k Query processing time per one query (sec.) LB_Keogh FTW TWIST/LBG_K TWIST/LBG 10 20 30 40 50 60 70 80 90 100 0 100 200 300 400 500 600 k Query processing time per one query (sec.) LB_Keogh FTW TWIST/LBG_K TWIST/LBG (a) Random W alk I (b) Random W alk I I 10 20 30 40 50 60 70 80 90 100 0 2000 4000 6000 8000 10000 k Query processing time per one query (sec.) LB_Keogh FTW TWIST/LBG_K TWIST/LBG () Eletro ardiogram Fig. 13 TWIST is faster than FTW and LB_Keogh for all v alues of k , where database size, sequene length, global onstrain t size, and maxim um page size are set to 524288, 2048, 10%, and 128, resp etiv ely 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 2000 4000 6000 8000 10000 12000 14000 The width of global constraint (% of time series length) Query processing time per one query (sec.) LB_Keogh FTW TWIST/LBG_K TWIST/LBG 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 500 1000 1500 2000 2500 The width of global constraint (% of time series length) Query processing time per one query (sec.) LB_Keogh FTW TWIST/LBG_K TWIST/LBG (a) Random W alk I (b) Random W alk I I 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 The width of global constraint (% of time series length) Query processing time per one query (sec.) LB_Keogh FTW TWIST/LBG_K TWIST/LBG () Eletro ardiogram Fig. 14 TWIST and FTW are not aeted b y the inremen t of the global onstrain t's width; ho w ev er, TWIST outp erforms b oth FTW and LB_Keogh, where database size, sequene length, an in teger k , and page size are set to 524288, 2048, 1, and 128, resp etiv ely 26 100 150 200 250 300 350 400 450 500 500 1000 1500 2000 2500 3000 Maxmum page size (data sequence per page) Query processing time per one query (sec.) LB_Keogh FTW TWIST/LBG_K TWIST/LBG 100 150 200 250 300 350 400 450 500 0 100 200 300 400 500 Maximum page size (data sequence per page) Query processing time per one query (sec.) LB_Keogh FTW TWIST/LBG_K TWIST/LBG (a) Random W alk I (b) Random W alk I I 100 150 200 250 300 350 400 450 500 0 1000 2000 3000 4000 5000 6000 Maximum page size (data sequence per page) Query processing time per one query (sec.) LB_Keogh FTW TWIST/LBG_K TWIST/LBG () Eletro ardiogram Fig. 15 When maxim um page size hanges, TWIST still outp erforms the riv al metho ds, where database size, sequene length, global onstrain t size, and an in teger k are set to 524288, 2048, 10%, and 1, resp etiv ely 64, and 16, and LBG uses the same segmen t size to that of LBS. In sequen tial sear h in DSF, w e implemen t LBS to redue the DTW distane alulation. Ho w ev er, the segmen ted sequene is generated online; in other w ords, no index struture is stored on DSF. Figures 11 , 12 , 13 , 14 , and 15 illustrate the p erformane of TWIST, om- paring in terms of querying time against t w o riv al metho ds b y v arying the dataset size, sequene length, the width of global onstrain t, an in teger k , and maxim um n um b er of sequenes in DSF. As exp eted, TWIST greatly outp er- forms sequen tial sear h with LB_Keogh and FTW indexing. 5.3 Indexing Time Indexing time is a w all lo k time that an algorithm onsumes to build the index struture. In this exp erimen t, w e only ompare the indexing time with FTW indexing sine the sequen tial sear h with LB_Keogh do es not need an index struture. F rom an exp erimen t sho wn in Figure 16 , our indexing time is omparable to FTW's; ho w ev er, if the maxim um page size is larger, TWIST an greatly redue indexing time, but it ma y trade o with querying time (see Figure 15 ). The parameters used in this exp erimen t are set to b e the same as the default parameters from the example in the previous exp erimen ts. Although the indexing time is omparable to the FTW indexing, TWIST requires v ery small storage spae omparing with FTW indexing (as will b e sho wn in Setion 5.5 ). 27 1 1.5 2 2.5 3 3.5 4 4.5 5 x 1 0 5 0 1000 2000 3000 4000 5000 6000 7000 8000 Dataset size (data sequence) Indexing time (sec.) FTW TWIST with page size 64 TWIST with page size 128 TWIST with page size 256 TWIST with page size 512 1 1.5 2 2.5 3 3.5 4 4.5 5 x 1 0 5 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Dataset size (data sequence) Indexing time (sec.) FTW TWIST with page size 64 TWIST with page size 128 TWIST with page size 256 TWIST with page size 512 (a) Random W alk I (b) Random W alk I I 1 1.5 2 2.5 3 3.5 4 4.5 5 x 1 0 5 0 2000 4000 6000 8000 10000 12000 14000 Dataset size (data sequence) Indexing time (sec.) FTW TWIST with page size 64 TWIST with page size 128 TWIST with page size 256 TWIST with page size 512 () Eletro ardiogram Fig. 16 As page size inreases, the indexing time of TWIST signian tly redues and is omparable to FTW's. Note that TWIST still queries faster than FTW for all page sizes (see Figure 15 ). 1 1.5 2 2.5 3 3.5 4 4.5 5 x 1 0 5 0 2 4 6 8 10 12 x 10 4 Dataset size (data sequence) Number of data accesses LB_Keogh FTW TWIST/LBG_K TWIST/LBG 1 1.5 2 2.5 3 3.5 4 4.5 5 x 1 0 5 0 2 4 6 8 10 12 x 10 4 Dataset size (data sequence) Number of page accesses LB_Keogh FTW TWIST/LBG_K TWIST/LBG (a) Random W alk I (b) Random W alk I I 1 1.5 2 2.5 3 3.5 4 4.5 5 x 1 0 5 0 2 4 6 8 10 12 x 10 4 Dataset size (data sequence) Number of page accesses LB_Keogh FTW TWIST/LBG_K TWIST/LBG () Eletro ardiogram Fig. 17 Num b er of page aesses of TWIST is smaller than other riv al metho ds, esp eially in Random W alk I and Random W alk I I, when sp eedup fator is 5. 28 1 1.5 2 2.5 3 3.5 4 4.5 5 x 1 0 5 0 2 4 6 8 10 12 x 10 4 Dataset size (data sequence) Number of page accesses LB_Keogh FTW TWIST/LBG_K TWIST/LBG 1 1.5 2 2.5 3 3.5 4 4.5 5 x 1 0 5 0 2 4 6 8 10 12 x 10 4 Dataset size (data sequence) Number of page accesses LB_Keogh FTW TWIST/LBG_K TWIST/LBG (a) Random W alk I (b) Random W alk I I 1 1.5 2 2.5 3 3.5 4 4.5 5 x 1 0 5 0 2 4 6 8 10 12 x 10 4 Dataset size (data sequence) Number of page accesses LB_Keogh FTW TWIST/LBG_K TWIST/LBG () Eletro ardiogram Fig. 18 Num b er of page aesses of TWIST is smaller than other riv al metho ds, esp eially in Random W alk I and Random W alk I I, when sp eedup fator is 10. 5.4 Num b er of P age A esses The n um b er of page aesses ( η ) is generally ev aluated in order to estimate the I/O ost. W e alulate the n um b er of page aesses for TWIST with LBG and TWIST with LBG K aording to the follo wing equations. η LB G = 2 α + β S F + δ (28) η LB G K = α + β S F + δ (29) where α is a n um b er of en v elop es in ESF, β is a n um b er of aessed andidate sequenes, δ is a n um b er of random aesses to DSF s, S F is Sp eedup F ator prop osed b y W eb er et al. ( W eb er et al , 1998 ) stating that the sequen tial aess is m u h faster than random aess up to 5 to 10 times. Generally , t w o v alues of SF s are onsidered, i.e., 5 and 10, whi h represen t traditional and pratial sp eedup fator of sequen tial aess o v er random aess. Sine sequen tial san aesses the en tire database, it an therefore b e on- sidered as an upp er b ound. Surprisingly , as sho wn in Figures 17 and 18 , the n um b er of page aesses of FTW indexing is appro ximately equal to that of the sequen tial san, and is v ery large when omparing with our prop osed metho d TWIST b eause FTW retriev es the en tire index struture whi h has database size nearly doubled. On the other hand, in a v erage ases, TWIST an redue a great n um b er of data aesses sine it tries to minimize the n um b er of DSF aesses and the n um b er of aessed andidate. F or exp erimen tal parameters, 29 dataset size, sequene length, maxim um page size, global onstrain t, and k , are set to 524288, 2048, 128, 10%, and 1, resp etiv ely . 5.5 Storage Requiremen t In this setion, w e demonstrate the storage requiremen t for storing an in- dex le omparing with the riv al metho d, FTW. Sine FTW reates a set of segmen ted sequenes for ea h andidate sequene, the index le's size is larger than the data le's. Therefore, FTW index struture is not pratial in real w orld appliation. Unlik e FTW, TWIST's index le requires only small amoun t of storage, i.e., only the en v elop es from all groups of data sequenes are stored. Figure 19 sho ws the omparison of storage requiremen t b et w een TWIST and FTW. When the dataset size is 2 19 sequenes or 4 GB, FTW requires nearly 5 GB, but as exp eted TWIST requires only 110 MB; in other w ords, TWIST requires appro ximately 51 times less storage spae than FTW, while still outp erforming in terms of querying pro essing time. 1 1.5 2 2.5 3 3.5 4 4.5 5 x 10 5 0 1 2 3 4 5 6 x 10 9 The dataset size (data sequence) Storage requirement (byte) FTW TWIST 1 1.5 2 2.5 3 3.5 4 4.5 5 x 1 0 5 0 1 2 3 4 5 6 x 10 9 The dataset size (data sequence) Storage requirement (byte) FTW TWIST (a) Random W alk I (b) Random W alk I I 1 1.5 2 2.5 3 3.5 4 4.5 5 x 1 0 5 0 1 2 3 4 5 6 x 10 9 The dataset size (data sequence) Storage requirement (byte) FTW TWIST () Eletro ardiogram Fig. 19 Illustration of storage requiremen t omparison sho wing that TWIST's index le requires only small amoun t of storage when omparing with FTW's, where dataset size, sequene length, and maxim um page size are set to 524288, 2048, and 64, resp etiv ely . 5.6 Disussion As exp eted, query pro essing time inreases when the dataset size and the sequene length are larger for all approa hes. Ho w ev er, from Figures 11 and 12 , 30 w e an see that FTW indexing and naïv e metho d requires m u h longer time for a single query than TWIST with LBG and LBG K , and when database size inreases, the query pro essing time is also m u h larger. In Figure 13 , if the global onstrain t hanges, only naïv e metho d with LB_Keogh and TWIST with LBG K are aeted sine the LB_Keogh and LBG K lose their tigh tness when the width of the global onstrain t inreases. Although the b est-mat hed querying ( k = 1 ) is t ypially used in sev eral domains, w e also ev aluate TWIST when v arying k as sho wn in Figure 14 . Ob viously , when k inreases, the query pro essing time also inreases sine for a large v alue of k the b est-so-far dis- tane is also large. If the b est-so-far is large, the sear h annot use the lo w er b ounding distane to prune o the database. Ho w ev er, from Figure 15 , TWIST still eien tly retriev es an answ er omparing with other metho ds. The maxi- m um page size is also another imp ortan t parameter that m ust b e onsidered b eause TWIST uses it to balane the n um b er of pages in the database and the n um b er of sequenes in ea h page. In other w ords, if the maxim um page size is small, the n um b er of random aess inreases; otherwise, the n um b er of sequen tial aess will inrease. Ho w ev er, from the exp erimen t, when the max- im um page n um b er hanges, TWIST still outp erforms FTW and sequen tial sear h with LB_Keogh. Note that when w e set the maxim um page size to one, TWIST is iden tial to FTW, but when the maxim um page size is set to in- nite, TWIST is similar to the naïv e metho d, i.e., sequen tial san. Therefore, b oth FTW indexing and the naïv e metho d are sp eial ases of TWIST. T o ev aluate the indexing time, w e ompare TWIST with FTW indexing b y v arying the database size and the maxim um page size in Figure 16 . F rom our insertion algorithm, if the n um b er of sequenes exeeds the maxim um page size, TWIST splits DSF in to t w o DSF s. Therefore, if the maxim um page size is large, TWIST redues a n um b er of splitting funtion alls; this therefore redues a n um b er of indexing time sine splitting algorithm requires k -means lustering algorithm whi h has linear time omplexit y to a n um b er of page size. Although the large maxim um page size redues the indexing time, the p erformane when querying is a tradeo. Although w e pro vide the ev aluation in terms of query pro essing time in Setion 5.2 , the n um b er of page aesses needs to b e ev aluated sine the n um- b er of page aesses reets the I/O ost for ea h approa h. The n um b er of page aesses is form ulized and alulated aording to (Sakurai et al , 2005 ; W eb er et al , 1998 ) whi h state that the sequen tial aess is faster than the random aess v e to ten times. F rom Figures 17 and 18 , the n um b er of page aesses of FTW indexing m ust alw a ys larger than the naïv e approa h sine FTW indexing reads all segmen ted sequenes in the index le whi h are equal to the n um b er of sequenes in the database. Ob viously , TWIST onsumes only small amoun t of page aesses b eause TWIST is designed to redue b oth se- quen tial and random aesses. F or the size of an index struture, TWIST utilizes only small amoun t of spaes omparing with FTW indexing whi h alw a ys requires the spae t wie the database size. In Figure 19 , w e demonstrate TWIST's storage requiremen t 31 b y v arying the database sizes and the maxim um page n um b er sine the size of ESF solely dep ends of the n um b er of DSF in the database. 6 Conlusion In this w ork, w e prop ose a no v el index sequen tial struture alled TWIST (Time W arping in Index Sequen tial sT ruture) whi h signian tly redues querying time up to 50 times omparing with the b est existing metho ds, i.e., FTW indexing and sequen tial san with LB_Keogh. More sp eially , TWIST groups similar time series sequenes together in the same le, and then the represen tativ e of a group of sequenes is alulated and stored in the index struture. When a query sequene is issued, a lo w er b ounding distane for a group of sequenes is determined from the query sequene and a represen ta- tiv e is retriev ed from the index le. Therefore, if the lo w er b ounding distane for a group of sequenes is larger than the b est-so-far distane, all andidate sequenes in the group do es not need to b e aessed. This an prune o an impressiv ely large amoun t of andidate sequenes and mak es TWIST feasible for massiv e time series database. A kno wledgemen t This resear h is partially supp orted b y the Thailand Resear h F und (Gran t No. MR G5080246), the Thailand Resear h F und giv en through the Ro y al Golden Jubilee Ph.D. Program (PHD/0141/2549 to V. Niennattrakul), and the Ch u- lalongk orn Univ ersit y Graduate S holarship to Commemorate the 72 nd An- niv ersary of His Ma jest y King Bh umib ol A duly adej. Referenes Assen t I, Krieger R, Afs hari F, Seidl T (2008) The TS-tree: eien t time series sear h and retriev al. In: Pro eedings of 11th In ternational Conferene on Extending Database T e hnology (EDBT 2008), Nan tes, F rane, pp 252 263 Bagnall AJ, Ratanamahatana CA, Keogh EJ, Lonardi S, Janaek GJ (2006) A bit lev el represen tation for time series data mining with shap e based similarit y . Data Mining and Kno wledge Diso v ery 13(1):1140 Be kmann N, Kriegel HP , S hneider R, Seeger B (1990) The R*-tree: An eien t and robust aess metho d for p oin ts and retangles. In: Pro eedings of the 1990 A CM SIGMOD In ternational Conferene on Managemen t of Data (SIGMOD 90), A tlan ti Cit y , pp 322331 Ber h told S, Keim D A, Kriegel HP (1996) The X-tree : An index struture for high-dimensional data. In: Pro eedings of 22nd In ternational Conferene on V ery Large Data Bases (VLDB 96), Mum bai (Bom ba y), India, pp 2839 32 Berndt DJ, Cliord J (1994) Using dynami time w arping to nd patterns in time series. In: the 1994 AAAI W orkshop on Kno wledge Diso v ery in Databases, Seattle, W ashington, pp 359370 Ciaia P , P atella M, Zezula P (1997) M-tree: An eien t aess metho d for similarit y sear h in metri spaes. In: Pro eedings of 23rd In ternational Conferene on V ery Large Data Bases (VLDB 97), A thens, Greee, pp 426 435 Ding H, T ra jevski G, S heuermann P , W ang X, Keogh E (2008) Querying and mining of time series data: Exp erimen tal omparison of represen tations and distane measures. In: Pro eedings of 34th In ternational Conferene on V ery Large Data Bases (VLDB 2008), Au kland, New Zealand F aloutsos C, Ranganathan M, Manolop oulos Y (1994) F ast subsequene mat hing in time-series databases. In: Pro eedings of the 1994 A CM SIG- MOD In ternational Conferene on Managemen t of Data (SIGMOD 94), Minneap olis, Minnesota, pp 419429 Guttman A (1984) R-trees: A dynami index struture for spatial sear hing. In: Y ormark B (ed) Pro eedings of Ann ual Meeting SIGMOD'84, A CM Press, Boston, Massa h usetts, pp 4757 Itakura F (1975) Minim um predition residual priniple applied to sp ee h reognition. IEEE T ransations on A oustis, Sp ee h, and Signal Pro essing 23(1):6772 Keogh E, Ratanamahatana CA (2005) Exat indexing of dynami time w arp- ing. Kno wledge and Information Systems 7(3):358386 Keogh EJ, Kasett y S (2003) On the need for time series data mining b en h- marks: A surv ey and empirial demonstration. Data Mining and Kno wledge Diso v ery 7(4):349371 Keogh EJ, Chakrabarti K, P azzani MJ, Mehrotra S (2001) Dimensionalit y redution for fast similarit y sear h in large time series databases. Kno wledge and Information Systems 3(3):263286 Kim SW, P ark S, Ch u WW (2001) An index-based approa h for similarit y sear h supp orting time w arping in large sequene databases. In: Pro eedings of the 17th In ternational Conferene on Data Engineering (ICDE 2001), Heidelb erg, German y , pp 607614 Lin J, Keogh EJ, W ei L, Lonardi S (2007) Exp eriening SAX: a no v el sym- b oli represen tation of time series. Data Mining and Kno wledge Diso v ery 15(2):107144 Loh WK, Kim SW, Whang KY (2004) A subsequene mat hing algorithm that supp orts normalization transform in time-series databases. Data Mining and Kno wledge Diso v ery 9(1):528 MaQueen JB (1967) Some metho ds for lassiation and analysis of m ul- tiv ariate observ ations. In: Cam LML, Neyman J (eds) Pro eedings of 5th Berk eley Symp osium on Mathematial Statistis and Probabilit y , Univ ersit y of California Press, v ol 1, pp 281297 Mo o dy GB, Mark R G (1983) A new metho d for deteting atrial brillation using RR in terv als. Computers in Cardiology 10:227230 33 Ratanamahatana CA, Keogh EJ (2004) Making time-series lassiation more aurate using learned onstrain ts. In: Pro eedings of 4th SIAM In terna- tional Conferene on Data Mining (SDM 2004), Lak e Buena Vista, Florida, USA, pp 1122 Ratanamahatana CA, Keogh EJ (2005) Three m yths ab out dynami time w arping data mining. In: Pro eedings of 2005 SIAM In ternational Data Mining Conferene (SDM 2005), Newp ort Bea h, CL, USA, pp 506510 Sak o e H, Chiba S (1978) Dynami programming algorithm optimization for sp ok en w ord reognition. IEEE T ransations on A oustis, Sp ee h, and Sig- nal Pro essing 26(1):4349 Sakurai Y, Y oshik a w a M, F aloutsos C (2005) FTW: F ast similarit y sear h under the time w arping distane. In: Pro eedings of 24th A CM SIGA CT- SIGMOD-SIGAR T Symp osium on Priniples of Database Systems, Balti- more, ML, USA, pp 326337 Sakurai Y, F aloutsos C, Y amam uro M (2007) Stream monitoring under the time w arping distane. In: Pro eedings of IEEE 23rd In ternational Confer- ene on Data Engineering (ICDE 2007), Istan bul, T urk ey , pp 10461055 Vla hos M, Y u PS, Castelli V, Meek C (2006) Strutural p erio di measures for time-series data. Data Mining and Kno wledge Diso v ery 12(1):128 W ang X, Smith KA, Hyndman RJ (2006) Charateristi-based lustering for time series data. Data Mining and Kno wledge Diso v ery 13(3):335364 W eb er R, S hek HJ, Blott S (1998) A quan titativ e analysis and p erformane study for similarit y-sear h metho ds in high-dimensional spaes. In: Gupta A, Shm ueli O, Widom J (eds) Pro eedings of 24th In ternational Conferene on V ery Large Data Bases (VLDB 98), Morgan Kaufmann, New Y ork Cit y , NY, pp 194205 Yi BK, Jagadish HV, F aloutsos C (1998) Eien t retriev al of similar time sequenes under time w arping. In: Pro eedings of 14th In ternational Con- ferene on Data Engineering (ICDE 98), Orlando, FL, USA, pp 201208 Yianilos PN (1993) Data strutures and algorithms for nearest neigh b or sear h in general metri spaes. In: Pro eedings of 4th Ann ual A CM-SIAM Symp o- sium on Disrete Algorithms (SOD A 93), So iet y for Industrial and Applied Mathematis, Philadelphia, P A, USA, pp 311321 Zh u Y, Shasha D (2003) W arping indexes with en v elop e transforms for query b y h umming. In: Pro eedings of the 2003 A CM SIGMOD In ternational Con- ferene on Managemen t of Data (SIGMOD 2003), San Diego, CA, USA, pp 181192
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment