Exact Indexing for Massive Time Series Databases under Time Warping Distance

Noname man usript No. (will b e inserted b y the editor) Exat Indexing for Massiv e Time Series Databases under Time W arping Distane Vit Niennattrakul · P ongsak orn Ruengronghirun y a · Chotirat Ann Ratanamahatana the date of reeipt and aeptane should b e inserted later Abstrat Among man y existing distane measures for time series data, Dy- nami Time W arping (DTW) distane has b een reognized as one of the most aurate and suitable distane measures due to its exibilit y in sequene align- men t. Ho w ev er, DTW distane alulation is omputationally in tensiv e. Esp e- ially in v ery large time series databases, sequen tial san through the en tire database is denitely impratial, ev en with random aess that exploits some index strutures sine high dimensionalit y of time series data inurs extremely high I/O ost. More sp eially , a sequen tial struture onsumes high CPU but lo w I/O osts, while an index struture requires lo w CPU but high I/O osts. In this w ork, w e therefore prop ose a no v el indexed sequen tial stru- ture alled TWIST (Time W arping in Indexed Sequen tial sT ruture) whi h b enets from b oth sequen tial aess and index struture. When a query se- quene is issued, TWIST alulates lo w er b ounding distanes b et w een a group of andidate sequenes and the query sequene, and then iden ties the data aess order in adv ane, hene reduing a great n um b er of b oth sequen tial and random aesses. Impressiv ely , our indexed sequen tial struture a hiev es signian t sp eedup in a querying pro ess b y a few orders of magnitude. In addition, our metho d sho ws sup eriorit y o v er existing riv al metho ds in terms of query pro essing time, n um b er of page aesses, and storage requiremen t with no false dismissal guaran teed. Keyw ords Time Series, Indexing, Dynami Time W arping V. Niennattrakul Departmen t of Computer Engineering, Ch ulalongk orn Univ ersit y E-mail: g49vnnp.eng. h ula.a.th P . Ruengronghirun y a Departmen t of Computer Engineering, Ch ulalongk orn Univ ersit y E-mail: g51prnp.eng. h ula.a.th C.A. Ratanamahatana Departmen t of Computer Engineering, Ch ulalongk orn Univ ersit y E-mail: annp.eng. h ula.a.th 2 1 In tro dution Dynami Time W arping (DTW) distane (Berndt and Cliord, 1994 ; Ratanamahatana and Keogh , 2004 , 2005 ; Sakurai et al , 2007 ) has b een kno wn as one of the b est distane measures (Ding et al , 2008 ; Keogh and Kasett y , 2003 ) suited for time series domain o v er the traditional Eulidean distane b eause DTW distane has m u h more exibilit y in sequene alignmen t. In addition, DTW distane tries to nd the b est w arping, while Eulidean distane is alulated in one-to-one manner, as sho wn in Figure 1. Ho w ev er, DTW distane has a ma jor dra wba k, i.e., it requires extremely high omputational ost, esp eially when DTW dis- tane is used in similarit y sear h problems, inluding top- k query . More sp eif- ially , in top- k querying problem, after a query sequene has b een issued, a set of k andidate sequenes most similar to the query sequene rank ed b y DTW distane is returned. T raditionally , the naïv e approa h needs to alulate DTW distanes for all andidate sequenes. As a result, its query pro essing time mainly dep ends on distane alulation and the n um b er of data aesses. (a) (b) Fig. 1 The omparison of sequene alignmen ts b et w een a) Eulidean distane and b) DTW distane So far, man y sp eedup te hniques ha v e b een prop osed inluding lo w er b ound- ing funtions and index strutures. Lo w er b ounding funtions (Yi et al, 1998 ; Kim et al, 2001 ; Keogh and Ratanamahatana , 2005 ; Zh u and Shasha , 2003 ; Sakurai et al, 2005 ), whose omplexit y is t ypially m u h lo w er than that of a DTW distane measure, are used for a lo w er b ounding distane alulation whi h guaran tees that DTW distane m ust b e equal to or larger than the lo w er 3 b ounding distane. A dditionally , in sequen tial san, b efore alulating DTW distane b et w een the query sequene and a andidate sequene, a lo w er b ound- ing funtion is utilized to appro ximate and prune o the andidate sequene whi h has larger lo w er b ounding distane than the urren t b est-so-far distane. And in indexing, the lo w er b ounding distane is also used to guide the simi- larit y sear h. Curren tly , man y lo w er b ounding funtions ha v e b een prop osed to redue omputational osts inluding LB_Yi (Yi et al, 1998 ), LB_Kim (Kim et al , 2001 ), LB_Keogh (Keogh and Ratanamahatana , 2005 ), LB_P AA (Keogh and Ratanamahatana , 2005 ), LB_NewP AA (Zh u and Shasha , 2003 ), and LBS (Sakurai et al, 2005 ). It has b een widely kno wn that LB_Keogh and LBS are among the most eien t lo w er b ounding funtions, where LB_Keogh has lo w er time omplexit y , while LBS has tigh ter b ound. Beside lo w er b ounding funtions, v arious index strutures for DTW dis- tane ha v e b een prop osed to guide the sear h to aess only some parts of the database. In other w ords, the sear h result is returned, while a small p ortion of the database is aessed for distane alulation, i.e., when querying, the index struture determines whi h parts of the database are lik ely to on tain answ ers, and then the ra w data on disk are randomly aessed. Generally , this index struture should b e small enough to t in main memory . Curren tly , t w o exat indexing approa hes are t ypially used, i.e., GEMINI framew ork with LB_P AA (Keogh and Ratanamahatana , 2005 ), and a more reen t ap- proa h, FTW indexing (Sakurai et al , 2005 ). Note that the exat indexing re- turns a set of querying results with no false dismissal guaran teed; in the other w ords, the b est answ ers m ust b e inluded in the results. GEMINI framew ork (F aloutsos et al, 1994 ) t ypially utilizes the m ulti-dimensional tree, e.g., R*- tree (Be kmann et al , 1990 ), as an index struture, while FTW indexing stores indies in a at le. Ho w ev er, urren t indexing te hniques are burdened with h uge amoun t of I/O ost sine random aess to the database is t ypially 5 to 10 times slo w er than the sequen tial aess ( W eb er et al , 1998 ). Therefore, indexing is eien t when less than 20% of ra w data sequenes are aessed on a v erage. Ho w ev er, urren t indexing te hniques still onsumes large I/O o v erheads whi h are not suitable for massiv e databases. In this w ork, w e prop ose a no v el index struture and aess metho d under DTW distane alled TWIST (Time W arping in Index Sequen tial sT ruture). TWIST utilizes adv an tages from b oth sequen tial struture and index stru- ture, i.e., lo w I/O and lo w CPU osts. Instead of randomly aessing the ra w time series data lik e other indexing te hniques, TWIST separates and stores a olletion of time series data in sequen tial strutures or at les. F or ea h le, TWIST generates a represen tativ e sequene (alled an en v elop e) and stores this sequene in an index struture. Therefore, when a query sequene is issued, ea h en v elop e is alulated for a lo w er b ounding distane using our newly prop osed lo w er b ounding funtion for a group of sequenes (LBG). The lo w er b ounding distane b et w een an en v elop e and a query sequene guaran tees that all DTW distane b et w een ea h and ev ery andidate sequene under this en v elop e and the query sequene m ust alw a ys b e larger than this lo w er b ound- ing distane. A dditionally , if the lo w er b ounding distane is larger than the 4 b est-so-far distane, no aess to the sequenes within the en v elop e is needed; otherwise, ev ery sequene in the en v elop e is sequen tially aessed for DTW distane alulation. W e ev aluate our prop osed metho d, TWIST, omparing with the urren t b est approa hes, i.e., FTW indexing and sequen tial san with LB_Keogh lo w er b ounding funtion. As will b e demonstrated, TWIST prunes o a large n um b er of andidate sequenes and is m u h faster than the riv al metho ds b y a few orders of magnitude. F urthermore, when the size of databases exp onen tially inreases, our query pro essing time only gro ws linearly . The rest of the pap er is organized as follo ws. Setion 2 pro vides literature reviews of related w ork in sp eeding up similarit y sear h under DTW distane. In Setion 3, our prop osed index struture  TWIST, its aess metho d, and no v el prop osed lo w er b ounding distane funtions, are desrib ed. W e sho w the sup eriorit y of TWIST o v er the b est existing metho d in Setion 4. Finally , in Setion 5, w e onlude our w ork and pro vide the diretion of future resear h. 2 Related W ork After Dynami Time W arping (DTW) distane measure (Berndt and Cliord, 1994 ) has b een in tro dued in data mining omm unit y (Keogh and Kasett y , 2003 ; Loh et al , 2004 ; W ang et al, 2006 ; Vla hos et al , 2006 ; Bagnall et al , 2006 ; Lin et al , 2007 ), it sho ws the sup eriorit y of similarit y mat hing o v er tra- ditional Eulidean distane due to its great exibilit y in sequene alignmen t sine time series data mining has b een long studied. Sp eially , DTW distane utilizes a dynami programming to nd the optimal w arping path and alu- late the distane b et w een t w o time series sequenes. Unfortunately , to alu- late DTW distane, exhaustiv e omputation is generally required. In addition, sine DTW distane is not qualied as a distane metri, neither distane- based (Ciaia et al, 1997 ; Yianilos , 1993 ) nor spatial-based (Ber h told et al , 1996 ; Guttman , 1984 ; Be kmann et al, 1990 ) index struture an b e used ef- ien tly in similarit y sear h under DTW distane. Therefore, v arious lo w er b ounding funtions and indexing te hniques for DTW distane ha v e b een prop osed to resolv e these problems. Yi et al. (Yi et al, 1998 ) rst prop ose a lo w er b ounding funtion, LB_Yi, using t w o features of a time series sequene, i.e., the minim um and maxim um v alues. LB_Yi reates an en v elop e o v er a query sequene from these minim um and maxim um v alues, and then the distane is omputed from the summation of areas b et w een an en- v elop e and a andidate sequene, as sho wn in Figure 2a). Instead of using only t w o features, Kim et al. (Kim et al, 2001 ) suggest t w o additional features, i.e., the rst and the last v alues of the sequene. LB_Kim then alulates distane from the tuples of a query sequene and a andidate sequene, as sho wn in Fig- ure 2 b). Although these t w o lo w er b ounding funtions only require small time omplexit y , the uses of LB_Yi and LB_Kim is not pratial sine their lo w er b ounding distanes annot prune o m u h of the DTW distane alulations. 5 Sequence Q Sequence C Max(Q) Min(Q) (a) A B C D Sequence Q Sequence C (b) Sequence Q Sequence C Envelope () Fig. 2 Illustration of lo w er b ounding distane alulation b et w een a query sequene and a andidate sequene when using a) LB_Yi, b) LB_Kim, and ) LB_Keogh (a) (b) () Fig. 3 Shap es of a) Sak o e-Chiba band, b) Itakura P arallelogram, and ) Ratanamahatana- Keogh band 6 Keogh et al. prop ose a tigh ter lo w er b ounding funtion, LB_Keogh, utiliz- ing global onstrain ts (Sak o e and Chiba, 1978 ; Itakura , 1975 ; Ratanamahatana and Keogh , 2004 ), whi h are generally used to limit the sop e of w arping in distane matrix to prev en t undesirable paths. In addition, v arious w ell-kno wn global onstrain ts ha v e b een prop osed, e.g., Sak o e-Chiba band (Sak o e and Chiba, 1978 ), Itakura P arallelogram (Itakura , 1975 ), and Ratanamahatana-Keogh (R-K) band (Ratanamahatana and Keogh , 2004 ). T o b e more illustrativ e, Fig- ure 3 sho ws dieren t shap es of global onstrain ts. Note that R-K band is an arbitrary-shap ed onstrain t whi h an represen t an y bands b y using only a sin- gle one-dimensional arra y . LB_Keogh rst reates an en v elop e o v er a query sequene aording to the shap e and size of the global onstrain t. Its lo w er b ounding distane then is an area b et w een the en v elop e and a andidate se- quene, as sho wn in Figure 2 ). In addition, Keogh et al. also prop ose an indexing te hnique whi h utilizes their disretized v ersion of their lo w er b ounding funtion, LB_P AA. In order to reate an index struture, they redue dimensions of ea h time series se- quene using Pieewise A v erage Aggregation (P AA) te hnique (Keogh et al , 2001 ), and store the redued sequene in a m ulti-dimensional index struture su h as R*-tree (Be kmann et al , 1990 ). Ea h leaf no de of the tree, storing on disk, on tains a olletion of segmen ted sequenes, where ea h sequene p oin ts to its ra w time series data. In querying pro ess, an en v elop e of the query se- quene is reated and disretized. Therefore, ea h MBR (Minim um Bounding Retangle) of R*-tree is retriev ed and is ompared with the segmen ted query sequene un til the leaf no de is retriev ed in random-aess manner. Then, all disretized andidate sequenes in the leaf no de are undergone lo w er b ound- ing distanes alulation using LB_P AA. If the lo w er b ounding distane from the LB_P AA is smaller than the b est-so-far distane, the ra w time series se- quene is also retriev ed b y random aess, and the distanes are determined using LB_Keogh and DTW distane, resp etiv ely . It is lear that Keogh et al.'s index struture requires to o man y random aesses as the database size sligh tly inreases. Note that although Zh u et al. later prop ose a tigh ter lo w er b ounding funtion, LB_NewP AA (Zh u and Shasha , 2003 ), the index struture still onsumes high I/O ost. Sakurai et al. (Sakurai et al, 2005 ) prop ose a new lo w er b ounding funtion, LBS (Lo w er Bounding distane measure with Segmen tation), whi h requires a quadrati time omplexit y O ( n 2 /t 2 ) , where n is the length of time series and t is the size of a segmen t. T o alulate lo w er b ounding distane, LBS rst quan tizes a query sequene and a andidate sequene in to sequenes of segmen ts. Ea h segmen t on tains t w o v alues that indiate the maxim um and minim um among the data p oin ts in the segmen t. Then, dynami programming is used to nd the optimal distane b et w een these t w o segmen ted sequenes, and the resulted distane is determined as a lo w er b ound distane of DTW distane. Despite the fat that LBS requires larger omputational time and spae than those of LB_P AA at the same resolution, LBS a hiev es m u h tigh ter lo w er b ounding distane. The example of segmen ted sequene is sho wn in Figure 4 . 7 Fig. 4 Illustration of segmen ted sequenes with v arious resolutions T o use LBS in indexing, Sakurai et al. prop osed an index struture whi h stores pre-alulated segmen ted sequenes. F or ea h time series data, a set of segmen ted sequenes is generated b y v arying segmen t sizes from the oarsest to the nest, and the segmen ted sequene is stored in a at le with a p oin ter to the ra w time series data. In querying pro ess, a query sequene is seg- men ted, and then the index struture is sequen tially aessed and alulated for lo w er b ounding distane with pre-segmen ted andidate sequenes. If the lo w er b ounding distane is larger than the b est-so-far distane, the ra w time series data is retriev ed in random aess manner. Ho w ev er, the main dra wba k of FTW is that the size of the index struture is appro ximately t wie the size of the ra w time series database. Therefore, this index struture is denitely impratial for massiv e time series database sine the en tire index le with size larger than the ra w data are required to b e read one for ev ery single query ausing large I/O o v erheads. It is w orth to note that the existing index strutures are not designed for massiv e databases. F or example, sine LB_P AA utilizes P AA to redue the n um b er of dimensions, as the database size inreases, its pruning p o w er sig- nian tly dereases; therefore, a h uge n um b er of sequenes m ust b e aessed for distane alulation. Similarly for FTW indexing, when the database size inreases, the index size will double. In Setion 5, our exp erimen ts will demon- strate that when the database exeeds the size of the main memory , our pro- p osed metho d signian tly outp erforms these riv al metho ds. 3 Ba kground Before desribing our prop osed metho d, TWIST, w e pro vide some ba kground kno wledge, i.e., Dynami Time W arping distane (DTW), global onstrain ts, and lo w er b ounding distane funtions inluding LB_Keogh and LBS. 8 3.1 Dynami Time W arping Distane Dynami Time W arping (DTW) distane (Berndt and Cliord, 1994 ; Ratanamahatana and Keogh , 2005 , 2004 ) is a w ell-kno wn shap e-based similarit y measure. It uses a dynami programming te hnique to nd an optimal w arping path b et w een t w o time series sequenes. T o alulate the distane, it rst reates a distane matrix, where ea h elemen t in the matrix is a um ulativ e distane of the minim um of three surrounding neigh b ors. Supp ose w e ha v e t w o time series, a sequene Q = h q 1 , . . . , q i , . . . , q n i and a sequene C = h c 1 , . . . , c j , . . . , c m i . First, w e reate an n -b y- m matrix, and then ea h ( i, j ) elemen t, γ i,j , of the matrix is dened as: γ i,j = | q i − c j | p + min { γ i − 1 ,j − 1 , γ i − 1 ,j , γ i,j − 1 } (1) where γ i,j is the summation of | q i − c j | p and the minim um um ulativ e distane of three elemen ts surrounding the ( i, j ) elemen t, and p is the dimension of L p - norms. F or time series domain, p = 2 , equipping to Eulidean distane, is t ypially used. After w e ha v e all distane elemen ts in the matrix, to nd an optimal path, w e  ho ose the path W = h w 1 , . . . , w k , . . . , w K i that yields a minim um um ulativ e distane at ( n, m ) , where w k is the p osition ( i, j ) at k th elemen t of a w arping path, w 1 = (1 , 1) , and w K = ( n, m ) , whi h is dened as: D T W ( Q, C ) = min ∀ W ∈ W    p v u u t K X k =1 d w k (2) where d w k is the L p distane at the p osition w k , p is the dimension of L p -norms in Equation 1 , and W is a set of all p ossible w arping paths. The reursiv e funtion are sho wn in Equation 3. Note that, in the original DTW, p th ro ot of the distane m ust b e omputed; ho w ev er, for fast omputation, w e usually omit this alulation sine ranking of distane v alues do es not  hange. D T W ( Q, C ) = p p D ( n, m ) (3) D ( i, j ) = | q i − c j | p + min    D ( i − 1 , j − 1) D ( i − 1 , j ) D ( i, j − 1) (4) where D (0 , 0 ) = 0 , D ( i, 0 ) = D (0 , j ) = ∞ , 1 ≤ i ≤ n , and 1 ≤ j ≤ m . 3.2 Global Constrain ts Although unonstrained DTW distane measure giv es an optimal distane b e- t w een t w o time series data, an un w an ted w arping path ma y b e generated. The global onstrain t eieny limits the optimal path to giv e a more suit- able alignmen t. Reen tly , an R-K band ( Ratanamahatana and Keogh , 2004 ), a general mo del of global onstrain ts, has b een prop osed. It an b e sp eied b y 9 a one-dimensional arra y R , i.e., R = h r 1 , . . . , r i , . . . , r n i , where n is the length of time series, and r i is the heigh t ab o v e the diagonal in y diretion and the width to the righ t of the diagonal in x diretion, as sho wn in Figure 5. Ea h r i v alue is arbitrary; therefore, R-K band is also an arbitrary-shap ed global onstrain t. Note that when r i = 0, where 1 ≤ i ≤ n , this R-K band represen ts the w ell-kno wn Eulidean distane, and when r i = n , 1 ≤ i ≤ n , this R-K band represen ts the original DTW distane with no global onstrain t. The R-K band an also represen t the S-C band b y giving all r i = c , where c is the width of a global onstrain t. r i r i Fig. 5 Global on train t on DTW distane matrix when applying sp ei R-K band 3.3 Lo w er Bounding Distane F untion Lo w er b ounding distane funtion for DTW distane is a funtion that is used to alulate a lo w er b ounding distane whi h m ust alw a ys b e smaller than or equal to the exat DTW distane (Yi et al , 1998 ; Kim et al , 2001 ; Keogh and Ratanamahatana , 2005 ; Zh u and Shasha , 2003 ; Sakurai et al, 2005 ). Therefore, in similarit y sear h, the lo w er b ounding funtion is used to prune o the andidate sequenes that are denitely not the answ ers. T ypially , lo w er b ounding funtion onsumes m u h lo w er omputational time than the DTW distane do es. In this w ork, w e onsider t w o lo w er b ounding funtions, i.e., LB_Keogh (Keogh and Ratanamahatana , 2005 ) prop osed b y Keogh et al. and LBS (Sakurai et al, 2005 ) prop osed b y Sakurai et al. sine LB_Keogh is the b est existing lo w er b ounding funtion used in sequen tial sear h, and LBS is the tigh tnest lo w er b ounding funtion used in indexing. LB_Keogh reates an en v elop e from a query sequene, and then the lo w er b ounding distane is alulated from areas b et w een the en v elop e and a andidate sequene. Unlik e LB_Keogh, LBS reates a segmen ted query sequene and a segmen ted andi- date sequene, and then these t w o segmen ted sequenes are used to determine a lo w er b ounding distane using dynami programming. 10 3.3.1 LB_Ke o gh T o alulate LB_Keogh (Keogh and Ratanamahatana , 2005 ), an en v elop e E = h e 1 , . . . , e i , . . . , e n i is generated from a query sequene Q = h q 1 , . . . , q i , . . . , q n i , where e i = { u i , l i } , and u i and l i are an upp er and a lo w er v alues of e i . With a sp eied global onstrain t R = h r 1 , . . . , r i , . . . , r n i , elemen ts u i and l i are omputed from u i = max { q i − r i , . . . , q i + r i } and l i = min { q i − r i , . . . , q i + r i } , re- sp etiv ely . The lo w er b ounding distane LB K eog h ( Q, C ) b et w een sequenes Q and C an b e omputed b y the follo wing equation. LB K eog h ( Q, C ) = p v u u u t n X i =1    | c i − u i | p if c i > u i | l i − c i | p if c i < l i 0 otherwise (5) where p is the dimension of L p -norms. The pro of of LB K eog h ( Q, C ) ≤ D T W ( Q, C ) an b e found in the original pap er (Keogh and Ratanamahatana , 2005 ). 3.3.2 LBS T o alulate LBS (Lo w er b ounding distane measure with Segmen tation), a query and a andidate sequenes m ust rst b e segmen ted. The segmen ted sequene S T =  s T 1 , . . . , s T b , . . . , s T t  is alulated from the sequene S = h s 1 , . . . , s a , . . . , s A i with a giv en segmen t size T , where s T b =  us T b , l s T b  , us T b = max { s x , . . . , s y } , l s T b = min { s x , . . . s y } , x = ( a − 1) · T + 1 , y = b · T , and 1 ≤ T ≤ A . Although LBS has apabilit y to supp ort segmen ts with dif- feren t lengths, in this w ork, w e onsider ea h segmen ts with an equal length to demonstrate maxim um p erformane of LBS. The lo w er b ounding distane LB S ( Q T , C T ) b et w een a segmen ted query sequene Q T =  q T 1 , . . . , q T i , . . . , q T n  and a segmen ted andidate sequene C T =  c T 1 , . . . , c T i , . . . , c T n  an b e om- puted b y the follo wing equations. LB S ( Q T , C T ) = p p D ( n, m ) (6) D ( i, j ) = T · d ( q T i , c T j ) + min    D ( i − 1 , j − 1) D ( i − 1 , j ) D ( i, j − 1) (7) d ( q T i , c T j ) =      l q T i − uc T j   p if ( l q T i > uc T j )   l c T j − uq T i   p if ( l c T j > uq T j ) 0 otherwise (8) where D (0 , 0 ) = 0 , D ( i, 0 ) = D (0 , j ) = ∞ , 1 ≤ i ≤ n , 1 ≤ j ≤ m , q T i =  uq T i , l q T i  , c T i =  uc T i , l c T i  , and p is the dimension of L p -norms. The pro of of LB S ( Q T , C T ) ≤ D T W ( Q, C ) an b e found in the Sakurai et al.'s original pap er (Sakurai et al, 2005 ). 11 4 Time W arping in Indexed Sequen tial sT ruture (TWIST) In this w ork, w e prop ose a no v el index struture alled TWIST ( T ime W arping in I ndexed S equen tial s T ruture) whi h onsists of b oth sequen tial strutures and an index struture. Ea h sequen tial struture stores a olletion of ra w time series sequenes, and the index struture stores a represen tativ e and a p oin ter to its orresp onding sequen tial struture. The in tuitiv e idea of TWIST is to minimize the n um b er of random aesses and minimize the n um b er of distane alulations, giving TWIST a m u h more suitable  hoie for massiv e database than the existing metho ds whi h are not quite salable. 4.1 Problem Denition W e are in terested in a generi top- k querying in this w ork sine man y other mining tasks, e.g., lassiation and lustering, all require this b est-mat hed querying as their t ypial subroutine. Giv en a query sequene Q , a set C of equal-length time series sequenes, a global onstrain t R , and an in teger k , it returns a set of k nearest-neigh b or sequenes of Q from C under DTW distane measure with the onstrain t R . 4.2 Data Struture In this setion, w e desrib e the data struture of TWIST whi h is sp eially designed to minimize b oth the I/O and CPU osts in the querying pro ess. TWIST onsists of t w o main omp onen ts, i.e., a set of sequen tial strutures (alled Data Sequene File  DSF) and an index struture (alled En v elop e Sequene File  ESF). In addition, TWIST groups the similar sequenes in to same sequen tial struture so that in the querying pro ess, if this sequen tial struture greatly diers from a query sequene, TWIST will simply b ypass that struture. T o measure the dierene b et w een a query sequene and all the sequenes in a sequen tial struture, a represen tativ e sequene (alled an en v elop e) is pre-determined and stored in an index struture. The main b enet of the sequen tial struture is that, w e an aess all the data in the sequen tial struture m u h faster than the random aess (W eb er et al , 1998 ). A sample data struture of TWIST is sho wn in Figure 6 . Supp ose there is a set S of time series sequenes S = h s 1 , . . . , s i , . . . , s n i , DSF simply stores these sequenes sequen tially . And for ea h DSF, an en v elop e E G = h e g 1 , . . . , eg i , . . . , eg n i for a group of time series sequenes is generated, where eg i = { ueg i , l eg i } , ueg i = max S ∈ S { s i } , and l eg i = min S ∈ S { s i } . In addition, the data struture of ESF is basially an arra y A of an ob jet O = { P , E Q } on taining a p oin ter P to DSF and an en v elop e E G . Figure 7 illustrates an en v elop e onstrution for ea h DSF. The en v elop e is determined from an upp er b ound and a lo w er b ound of a group of sequenes. 12 ESF DSFs Fig. 6 A sample data struture of TWIST Fig. 7 An en v elop e reated from a group of sequenes 4.3 Lo w er Bounding Distane for a Group of Sequenes In this w ork, w e prop ose a no v el lo w er b ounding distane funtion for a group of sequenes alled LBG. Instead of alulating lo w er b ounding distanes b e- t w een a query sequene and a andidate sequene, LBG returns a lo w er b ound- ing distane b et w een a query sequene and a set of andidate sequenes; in other w ords, ea h DTW distane b et w een a query sequene and an y andi- date sequene in the set is alw a ys larger than the lo w er b ounding distane from LBG. Therefore, if the lo w er b ounding distane is larger than the dis- tane from the b est-so-far distane, LBG an prune o all those andidate sequenes sine all the real DTW distanes from the andidate sequenes are guaran teed not to b e an y smaller. More sp eially , TWIST utilizes LBG b y determining an LBG for ea h DFS from an en v elop e sequene stored in the EFS so that only some DSF s are aessed whi h signian tly redues b oth CPU and I/O osts. Giv en a query sequene Q = h q 1 , . . . , q a , . . . , q n i and an en v elop e E G = h eg 1 , . . . , eg b , . . . , eg n i , where eg b = { u eg b , l eg b } . LBG rst reates segmen ted query sequenes Q T =  q T 1 , . . . , q T i , . . . , q T t  and segmen ted en v elop e E G T =  eg T 1 , . . . , eg T j , . . . , eg T t  with segmen t size T , where q T i =  uq T i , l q T i  and 13 eg T j =  ueg T j , l eg T j  . An elemen t q T i of segmen ted query sequene Q T is omputed b y uq T i = max { s x , . . . , s y } and l q T i = min { s x , . . . s y } , where x = ( a − 1) · T + 1 , and y = a · T . On the other hand, to segmen t an en v elop e E G , elemen ts ueg T j and l eg T j are reated as follo ws, ueg T j = max { ueg x , . . . , ueg y } and l eg T j = min { l eg x , . . . l eg y } , where x = ( b − 1) · T + 1 , and y = b · T . T o b e more illustrativ e, Figure 8 sho ws the segmen ted en v elop e E G T reated from an en v elop e E G . (a) (b) Fig. 8 Illustration sho ws a) an en v elop e used to generate b) a segmen ted en v elop e when alulating LBG The lo w er b ounding distane LB G ( Q T , E G T ) b et w een a segmen ted query sequene Q T and a segmen ted en v elop e E G T an b e omputed b y the follo wing equations. LB G ( Q T , E G T ) = p p D ( n, m ) (9) D ( i, j ) = T · d ( q T i , eg T j ) + min    D ( i − 1 , j − 1 ) D ( i − 1 , j ) D ( i, j − 1) (10) d ( q T i , eg T j ) =      l q T i − ue g T j   p if ( lq T i > ueg T j )   l eg T j − uq T i   p if ( le g T j > uq T i ) 0 otherwise (11) where D (0 , 0 ) = 0 , D ( i, 0 ) = D (0 , j ) = ∞ , 1 ≤ i ≤ n , 1 ≤ j ≤ m , and p is the dimension of L p -norms. Theorem 1 L et Q T =  q T 1 , . . . , q T i , . . . , q T t  and E G T =  eg T 1 , . . . , eg T j , . . . , eg T t  b e the appr oximate se gments of se quen e Q and envelop e E G of a gr oup of time series se quen es C = { C 1 , . . . , C k , . . . , C m } , r esp e tively, wher e q T i =  uq T i , l q T i  and eg T j =  ueg T j , l eg T j  , then LB G ( Q T , E G T ) ≤ D T W ( Q, C opt ) (12) where C opt is a sequene in C whi h giv es minim um distane to sequene Q , and C T opt is a segmen ted sequene of C opt . 14 Pr o of F ollo wing from the pro of of LBS (Sakurai et al, 2005 ), w e ha v e LB S ( Q T , C T opt ) ≤ D T W ( Q, C opt ) (13) Sine ueg T j ≥ uc T opt j and l eg T j ≤ lc T opt j for all j d  q T i , eg T j  =      l q T i − ue g T j   p if ( l q T i > ueg T j )   l eg T j − uq T i   p if ( l eg T j > uq T j ) 0 otherwise ≤      l q T i − uc T j   p if ( l q T i > uc T j )   l c T j − uq T i   p if ( l c T j > uq T j ) 0 otherwise ≤ d  q T i , c T j  (14) Sine d  q T i , eg T j  ≤ d  q T i , c T j  , then LB G ( Q T , E G T ) ≤ LB S ( Q T , C T opt ) (15) Therefore, from Equation 13 , w e ha v e LB G ( Q T , E G T ) ≤ D T W ( Q, C opt ) (16) Q.E.D. Sine LBG utilizes the onept of a lo w er b ounding distane alulation b e- t w een a query and a group of sequenes. W e also prop ose a lo w er b ounding distane funtion extended from LB_Keogh alled LBG K . LBG K obtains lo w er b ounding distane from a query sequene Q = h q 1 , . . . , q i , . . . , q n i and an en- v elop e E G = h eg 1 , . . . , eg i , . . . , eg n i , where eg i = { ueg i , l eg i } . Giv en a query sequene Q , an en v elop e E , and a global onstrain t R = h r 1 , . . . , r i , . . . , r n i . LBG K rst reates an en v elop e of global onstrain t E GC = h eg c 1 , . . . , eg c i , . . . , eg c n i from E G , where eg c i = { ueg c i , l eg c i } . Elemen ts ueg c i and l eg c i are alulated b y ueg c i = max { ueg i − r i , . . . , ueg i + r i } and l eg c i = min { l eg i − r i , . . . , l eg i + r i } , resp etiv ely . The lo w er b ounding distane LB G K ( Q, E G ) b et w een the query sequene Q and the en v elop e E G are determined b y Equation 17 along with its pro of of orretness. LB G K ( Q, E G ) = p v u u u t n X i =1    | q i − ue g c i | p if c i > ueg c i | l eg c i − q i | p if c i < l eg c i 0 otherwise (17) where p is the dimension of L p -norms. Theorem 2 L et Q = h q 1 , . . . , q i , . . . , q n i b e a query se quen e and E GC = h eg 1 , . . . , eg i , . . . , eg n i b e an envelop e of glob al  onstr aint r e ate d fr om an en- velop e E G of a gr oup of se quen es C = { C 1 , . . . , C k , . . . , C m } , wher e C k = h c k 1 , . . . , c k i , . . . , c k n i , then 15 LB G K ( Q, E G ) ≤ D T W ( Q, C opt ) (18) where C opt is the sequene whi h giv es the minim um DTW distane to Q in C . Pr o of Sine D T W ( Q, C opt ) = p v u u t K X k =1 d w k (19) where d w k is the k th distane alulation of sequene Q and the nearest C opt in the optimal w arping path whi h alulates distane b et w een q i and c opt i . F or ueg c i and l eg c i , ueg c i = max 1 − r ≤ j ≤ i + r { ueg j } = max 1 − r ≤ j ≤ i + r  max 1 ≤ k ≤ n  c k j   ≥ c opt j (20) l eg c i = min 1 − r ≤ j ≤ i + r { l eg j } = min 1 − r ≤ j ≤ i + r  min 1 ≤ k ≤ n  c k j   ≤ c opt j (21) Sine LB G K eog h ( Q, E G ) ≤ DT W ( Q, C opt ) , p v u u u t n P i =1    | q i − ue g c i | p if q i > ueg c i | l eg c i − q i | p if q i < l eg c i 0 otherwise ≤ p s K P k =1 d w k (22) Sine K ≥ n from the DTW's onditions, there are three p ossible ases, i.e., | q i − ue g c i | p ≤ d w k , | l eg c i − q i | p ≤ d w k , and 0 ≤ d w k . Supp ose | q i − ue g c i | p ≤ d w k , (23) DTW requires that, for d w k and for all i − r i ≤ j ≤ i + r i , ea h data p oin t q i m ust b e ompared one with c opt j | q i − ue g c i | p ≤   q i − c opt j   p (24) ueg c i ≥ c opt j (25) The ase k l eg c i − q i k p ≤ d w k yields to a similar argumen t and 0 ≤ d w k alw a ys holds sine d w k is nonnegativ e. Hene, 16 LB G K eog h ( Q, E G ) ≤ D T W ( Q, C opt ) (26) Q.E.D. 4.4 Querying Pro ess When a query sequene is issued, ESF is rst aessed and lo w er b ounding distane from LBG for ea h en v elop e is alulated. Therefore, if LBG for an y DSF is larger than the b est-so-far distane, all time series sequenes in that DSF are guaran teed not to b e the answ ers. TWIST ould utilize this distane to prune o a signian tly large n um b er of andidate sequenes b y using only a v ery small amoun t of b oth CPU and I/O osts. Instead of alulating only one lev el of lo w er b ounding distane, LBG al- ulates lo w er b ounding distane iterativ ely . First, the b est-so-far distane is initialized with an LBG distane b et w een the oarsest segmen ted sequenes of a query sequene and an en v elop e. Subsequen tly , ea h ner en v elop e sequene is used b y LBG alulation again and again. If LBG distane is still smaller than the b est-so-far distane, the DSF is aessed, and all data sequenes in DSF are then sequen tially sear hed. But if ner LBG is returned with an y- thing larger than the b est-so-far distane, the next DSF is then onsidered. The pro ess is terminated when all en v elop e sequenes in ESF are exhausted. The pseudo o de of TWIST with LBG is desrib ed in T able 1 . Although implemen tations of LBG and LBG K o v er TWIST are dieren t, w e pro vide solutions for b oth. The adv an tages of LBG K o v er LBG are that LBG K requires to aess ESF only one, while LBG requires t wie the aess, and when the small global onstrain t is applied in the querying, LBG K is faster. Ho w ev er, LBG a hiev es a b etter query p erformane in terms of query pro ess- ing time than LBG K sine LBG returns a tigh ter lo w er b ounding distane, indep enden t of the global onstrain t. T o query with LBG K under top- k querying, ea h en v elop e sequene is sequen tially retriev ed, and its lo w er b ounding distane is alulated. Then LBG K distanes are sorted in to a priorit y queue. DSF with smallest LBG K distane will rst b e aessed. Then for ea h andidate sequene in the DSF, sequen tial sear h is utilized to nd the b est-so-far sequene. One the DSF aess is ompleted, the lo w er b ounding distane from LBG K distane for the next DSF will then b e onsidered. If the lo w er b ounding distane b et w een the en v elop e of the next DSF is larger than the b est-so-far distane, the sear h is terminated, and a set of nearest-neigh b or sequenes is returned. The pseudo o de is pro vided in T able 2. 17 T able 1 T op- k querying under TWIST with LBG Algorithm [ C ℄ = LBG Top-k Quer ying [ Q, k ℄ 1 Let: 2 C b e a priorit y queue of answ er sequenes 3 P b e a p oin ter to DSF 4 E G b e an en v elop e 5 d best = PositiveInfinite b e the b est-so-far distane 6 T b e the oarsest resolution 7 for all { P , E G } in ESF // Finding d best from the oarsest v ersion of ESF 8 d E G = LB W LBS ( Q T , E G T ) 9 if ( d E G < d best ) d best = d E G endif 10 endfor 11 for all { P , W } in ESF 12 while ( T is not the nest resolution) // Use LB W LBS to prune E S F 13 d W = LB W LBS ( Q T , E G T ) 14 if ( d W > d best ) Break and go to the next { P, E G } endif 15 Set T to b e a ner resolution 16 endwhile 17 for all C in D S F P 18 d lower = LB ( Q , C ) 19 if ( d lower ≤ d best ) 20 d tr ue = DT W ( Q, C ) 21 if ( C .siz e () < k ) 22 C .enq ueue ( { C, d tr ue } ) 23 else 24 if ( d tr ue ≤ d best ) 25 C .enq ueue ( { C, d tr ue } ) 26 C .deq ueue () 27 d best = C .peek () .d tr ue 28 endif 29 endif 30 endif 31 endfor 32 endfor 33 Return C 18 T able 2 T op- k querying under TWIST with LBG K Algorithm [ C ℄ = LBG K Top-k Quer ying [ Q, k ℄ 1 Let: 2 W b e a priorit y queue of en v elop e distanes 3 C b e a priorit y queue of answ er sequenes 4 P b e a p oin ter to DSF 5 E G b e an en v elop e 6 d best = PositiveInfinite b e the b est-so-far distane 7 Initialize d best = PositiveInfinite 8 for all { P, E G } in ESF // Calulate LBG distane from ESF for all DSF 9 d W = LBW K ( Q, W ) 10 W .enq ueue ( { P, d W } ) 11 endfor 12 // Dequeue { P, d W } with smallest d W 13 // k eep sear hing for an answ er while d W ≤ d best 14 while ( { P , d W } = W .deq ueue () and d W ≤ d best ) 15 for all C in D S F P 16 d lower = LB ( Q, C ) 17 if ( d lower ≤ d best ) 18 d tr ue = DT W ( Q, C ) 19 if ( C .siz e () < k ) 20 C .enq ueue ( { C, d tr ue } ) 21 else 22 if ( d tr ue ≤ d best ) 23 C .enq ueue ( { C, d tr ue } ) 24 C .deq ueue () 25 d best = C .peek () .d tr ue 26 endif 27 endif 28 endif 29 endfor 30 endwhile 31 Return C Although this pap er emphasizes on top- k querying, range query an simply b e adapted. Instead of using the b est-so-far distane to prune o the database, the range distane is used to sp eify the maxim um distane b et w een a query sequene and a andidate sequene. In addition, an in teger k is set to b e p ositiv e innite. 4.5 Indexing Pro ess T o main tain a data struture, w e also prop ose a ma hanism to eien tly insert and delete data sequenes o v er our prop osed index struture TWIST. 4.5.1 Data Se quen e Insertion In ase of insertion, supp ose there exist DSF s and ESF, ost of insertion b e- t w een a new sequene and an en v elop e is omputed for all en v elop es in ESF, the new sequene will b e in the minim um ost en v elop e. After the minim um-ost 19 en v elop e has b een found, the en v elop e's DSF is aessed, and the new sequene is added. The en v elop e is up dated aordingly to the ESF. Generally , the ost is omputed from the size of an en v elop e after insertion. If DSF exeeds the maxim um n um b er of sequenes p er le (maxim um page size), TWIST splits this DSF in to t w o DSF s, and t w o new en v elop es are also generated and stored in the ESF. F or lariation, w e pro vide the insertion algorithm in T able 3 . Note that the maxim um page size is a user-dened parameter whi h deter- mines a maxim um n um b er of sequenes within ea h DSF. T able 3 Inserting a new sequene to TWIST Algorithm Inser tion [ C ℄ 1 // Find the minim um-ost DSF 2 Initialize cost min = PositiveInfinite , P min = null 3 for all { P , E G P } in ESF 4 cost E G = C ost ( E G P , C ) 5 if ( cost E G < cost min ) 6 cost min = cost E G 7 P min = P 8 endif 9 endfor 10 A dd C in D S F P min 11 // Che k if the size of D S F P min exeeds α 12 if ( D S F P min .siz e () > α ) 13 // Split D S F P min in to t w o DSF s, D S F X and D S F Y 14 [ { S, E G S } , { T , E G T } ] = S plitD S F ( DS F P min ) 15 Delete ˘ P min , E G P min ¯ from ESF 16 A dd { X, E G T } , { X, E G Y } to ESF 17 else 18 // Up date E G P min from C 19 E G P min = U pdateE nv elope ( E G P min , C ) 20 Up date ˘ P min , E G P min ¯ to ESF 21 endif Envelope EG New candidate sequence C Fig. 9 Shado w ed area represen ts total ost of insertion b et w een a sequene C and an en v elop e E G 20 T able 4 Cost funtion for an insertion of a sequene C in to E G Algorithm Cost [ E G, C ℄ 1 Let: 2 cost sum = 0 3 for ea h c i , ueg i , leg i 4 if ( c i > ueg i ) 5 cost sum = cost sum + | c i − l e g i | p 6 else if ( c i < leg i ) 7 cost sum = cost sum + | ueg i − c i | p 8 endif 9 Return cost sum Generally , the ost funtion is alulated from total area of an en v elop e after a new sequene is inserted. T o b e more illustrativ e, the shado w ed areas in Figure 9 indiate the ost of insertion. Giv en a new time series sequene C = h c 1 , . . . , c i , . . . , c n i and an en v elop e E G = h eg 1 , . . . , eg i , . . . , eg n i , where eg i = { ueg i , l eg i } , the ost funtion C ost ( E G, C ) is dened as (also sho wn in T able 4). C ost ( E G, C ) = p v u u u t n X i =1    | c i − l eg i | p if c i > ueg i | ueg i − c i | p if c i < le g i 0 otherwise (27) where p is the dimension of L p -norms. If the n um b er of sequene in DSF exeeds the maxim um page size, the DSF needs to split in to t w o DSF s to redue the en v elop e size. Generally , TWIST tries to split sequenes in to t w o groups so that ea h new en v elop e sequene is tigh t and has only small o v erlaps. In this pap er, k -means lus- tering (MaQueen , 1967 ) ( k = 2 ) with Eulidean distane is adopted as a heuristi funtion for separating the data in to t w o appropriate groups. Ho w- ev er, other algorithms su h as splitting algorithms in R-tree (Guttman , 1984 ) and R*-tree (Be kmann et al, 1990 ) an b e used in plae of k -means lustering algorithm sine splitting algorithms are also designed to separate and minimize Minim um Bounding Retangle (MBR); ho w ev er, these splitting algorithms re- quire relativ ely large time omplexit y . Pseudo o de of the splitting algorithm is pro vide in T able 5 . After new DSF s are reated in the insertion step, new en v elop es are gen- erated b y an algorithm desrib ed in T able 6 b y nding the maxim um and minim um v alues for ea h DSF. If the n um b er of sequenes in DSF exeeds the maxim um allo w ed, the en v elop e in ESF is simply up dated using the ex- isting en v elop e and a new sequene. T o up date the existing en v elop e E G = h eg 1 , . . . , eg i , . . . , eg n i from a new time series sequene C = h c 1 , . . . , c i , . . . , c n i , elemen ts are up dated b y ueg i = max { ueg i , c i } and l eg i = min { le g i , c i } , where eg i = { ueg i , l eg i } . The up dating algorithm is desrib ed in T able 7 . 21 T able 5 Splitting algorithm, separating a DSF in to t w o DSF s Algorithm SplitDSF [ D S F ℄ 1 // Run k -means lustering algorithm 2 // Fix k = 2 3 [ DS F X , D S F Y ] = K M eans ( D S F ) 4 // Create E G X and E G Y 5 E G X = C r eateE nv elope ( D S F X ) 6 E G Y = C r eateE nv elope ( D S F Y ) 7 Return [ { X, E G X } , { Y , E G Y } ] T able 6 An en v elop e onstrution algorithm Algorithm Crea teEnvelope [ D S F ℄ 1 Let: 2 E G b e an en v elop e 3 for ea h sequene C in D S F 4 for ea h c i , ueg i , leg i 5 ueg i = max { ueg i , c i } 6 leg i = min { leg i , c i } 7 endfor 8 endfor 9 Return E G T able 7 An en v elop e sequene up date algorithm after a new sequene insertion Algorithm Upd a teEnvelope [ E G, C ℄ 1 for ea h c i , ueg i , leg i 2 ueg i = max { ueg i , c i } 3 leg i = min { leg i , c i } 4 endfor 5 Return E G 4.5.2 Data Se quen e Deletion T o delete a data sequene, oresp onding DSF is aessed and the sequene is simply deleted. Ho w ev er, when DSF  hanges, ESF needs to b e up dated as w ell. In partiular, w e pro vide t w o deletion p oliies, i.e., eager deletion and lazy deletion. F or eager deletion, after ea h sequene deletion, TWIST immediately realulates a new en v elop e from the en tire set of sequenes in that DSF, and up dates the  hanges in to the ESF. On the other hand, lazy deletion simply deletes a sequene from DSF without the need of ESF up date sine TWIST guaran tees that false dismissals will nev er o ur in the lo w er b ounding alulation of LBG. The treadeos are, of ourse, a deletion time and the tigh tness of an en v elop e b et w een these t w o deletion p oliies. If eager deletion is applied, the deletion time inreases but its en v elop e sequene is tigh ter, while the deletion time is v ery fast in lazy deletion, but the en v elop e sequene is not as tigh t. W e pro vide a pseudo o de for the deletion algorithm in T able 8. 22 T able 8 Delete an existing sequene from TWIST Algorithm Deletion [ C ℄ 1 Selet D S F P whi h on tains C 2 Delete C from D S F P 3 if ( IsEa ger ) 4 E G P = C r eateE nv elope ( D S F P ) 5 Up date { P, E G P } to ESF 6 endif 5 Exp erimen tal Ev aluation In exp erimen tal ev aluation, w e ompare our prop osed metho d, TWIST, with the b est existing indexing metho d, FTW (Sakurai et al , 2005 ), and the b est naïv e metho d, sequen tial sear h with LB_Keogh ( Keogh and Ratanamahatana , 2005 ), in man y ev aluation metris, i.e., querying time, indexing time, the n um b er of page aesses, and storage requiremen t. In addition, t w o solutions of our prop osed metho d are ev aluated, i.e., TWIST with LBG and TWIST with LBG K . Although FTW indexing outp erforms R*-tree with LB_P AA (Keogh and Ratanamahatana , 2005 ), our metho d sho ws sup eriorit y o v er FTW b y few orders of magnitude. In addition, sequen tial sear h with LB_Keogh is also ev aluated to sho w the b est p erformane of naïv e metho d when no in- dexing struture is utilized. It is imp ortan t to note that w e mak e our b est eort in tuning the riv al metho ds to run at their b est p erformanes b y ap- plying early abandon (Keogh and Ratanamahatana , 2005 ) and early stopping (Sakurai et al, 2005 ) te hniques; ho w ev er, as will b e demonstrated, our pro- p osed metho d still outp erforms them in all terms. T o v erify that our prop osed metho d is salable for massiv e time series database, w e use a database with the size exeeding the main memory . Other- wise, the op erating system is lik ely to a he the data in to the main memory . Therefore, our database size ranges from 256MB to 4 GB. W e p erform our exp erimen ts on a Windo ws-XP omputer with In tel Core 2 Duo 2.77 GHz, 2GB of RAM, and 80 GB of 5400 rpm in ternal hard driv e. All o des in our exp erimen ts are implemen ted with Ja v a 1.5. 5.1 Datasets T o visualize the p erformane in v arious dimensions, man y dieren t datasets listed b elo w are generated b y v arying the n um b ers of sequenes in the databases ( 2 16 = 65 536 , 2 17 = 1310 72 , 2 18 = 26 2144 , and 2 19 = 5242 88 sequenes) and the sequene lengths (512, 1024, and 2048 data p oin ts). All data sequenes are Z-normalized; some examples for ea h dataset are sho wn in Figure 10 . 1. Random W alk I ( Sakurai et al, 2005 ; Assen t et al , 2008 ): T o demonstrate the salabilit y of our prop osed metho d, a large amoun t of sequenes are generated b y a follo wing equation: t i +1 = t i + N (0 , 1) , where N (0 , 1) is a random v alue dra wn from a normal distribution. 23 0 200 400 600 800 1000 −2 −1 0 1 2 3 0 200 400 600 800 1000 −2 −1 0 1 2 3 (a) (b) 0 200 400 600 800 1000 −5 −4 −3 −2 −1 0 1 2 3 () Fig. 10 Example sequenes for datasets a) Random W alk I, b) Random W alk I I, and ) Eletro ardiogram 2. Random W alk I I (Assen t et al, 2008 ): W e generate a set of random w alk sequenes from a follo wing equation: t i +1 = 2 t i − t i − 1 + N (0 , 1 ) , where N (0 , 1) is a random v alue dra wn from a normal distribution. 3. Eletro ardiogram (Mo o dy and Mark , 1983 ): This dataset is reorded from h uman sub jets with atrial brillation whi h has 250 samples p er seond. In addition, this dataset w as made at Boston's Beth Israel Hospital and re- vised for MIT-BIH Arrh ythmia Database. T o build the dataset, w e segmen t all the original sequenes in to small subsequenes. 5.2 Querying Time In this exp erimen t, query pro essing times are a v eraged o v er 100 runs, and are ompared in the b est-mat hed problem b y v arying four parameters, i.e., the n um b er of time series sequenes, the dataset size, the width of global onstrain t, an in teger k , and the maxim um page size (only for TWIST). In order to observ e the trend for ea h parameter, the default v alues are xed as follo ws, the dataset size as 524288 ( 2 19 ) sequenes, the length of time se- ries sequene as 2048 data p oin ts, the default width of global onstrain t as 10% of sequene length, an in teger k in top- k querying as 1, and the maxi- m um n um b er of sequenes in DSF as 128 sequenes. In addition, a dataset of 524288 sequenes with length 2048, giving appro ximately 4 GB in size, and 10% onstrain t width of global onstrain t is t ypially used in time series data mining omm unit y (Ratanamahatana and Keogh , 2005 ). Note that for LBS, the default segmen t size prop osed in the original pap er is used, i.e., 1024, 256, 24 1 1.5 2 2.5 3 3.5 4 4.5 5 x 1 0 5 0 500 1000 1500 2000 2500 The dataset size (data sequence) Query processing time per one query (sec.) LB_Keogh FTW TWIST/LBG_K TWIST/LBG 1 1.5 2 2.5 3 3.5 4 4.5 5 x 1 0 5 0 50 100 150 200 250 300 350 400 Dataset size (data sequence) Query processing time per one query (sec.) LB_Keogh FTW TWIST/LBG_K TWIST/LBG (a) Random W alk I (b) Random W alk I I 1 1.5 2 2.5 3 3.5 4 4.5 5 x 1 0 5 0 500 1000 1500 2000 2500 3000 3500 4000 4500 The dataset size (data sequence) Query processing time per query (sec.) LB_Keogh FTW TWIST/LBG_K TWIST/LBG () Eletro ardiogram Fig. 11 TWIST outp erforms the riv al metho ds, and is sligh tly aeted b y an inrease in the dataset size, where sequene length, global onstrain t, an in teger k , and page size are set to 2048, 10%, 1, and 128, resp etiv ely 400 600 800 1000 1200 1400 1600 1800 2000 0 500 1000 1500 2000 2500 Sequence length (data point) Query processing time per one query (sec.) LB_Keogh FTW TWIST/LBG_K TWIST/LBG 400 600 800 1000 1200 1400 1600 1800 2000 0 50 100 150 200 250 300 350 400 Length of time series sequence (data point) Query processing time per one query (sec. ) LB_Keogh FTW TWIST/LBG_K TWIST/LBG (a) Random W alk I (b) Random W alk I I 400 600 800 1000 1200 1400 1600 1800 2000 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Length of time series sequence (data point) Query processing time per one query (sec. ) LB_Keogh FTW TWIST/LBG_K TWIST/LBG () Eletro ardiogram Fig. 12 Although sequene length inreases, TWIST requires only small query pro ess- ing time omparing with FTW and LB_Keogh, where database size, global onstrain t, an in teger k , and page size are set to 524288, 10%, 1, and 128, resp etiv ely 25 10 20 30 40 50 60 70 80 90 100 0 500 1000 1500 2000 2500 3000 3500 4000 k Query processing time per one query (sec.) LB_Keogh FTW TWIST/LBG_K TWIST/LBG 10 20 30 40 50 60 70 80 90 100 0 100 200 300 400 500 600 k Query processing time per one query (sec.) LB_Keogh FTW TWIST/LBG_K TWIST/LBG (a) Random W alk I (b) Random W alk I I 10 20 30 40 50 60 70 80 90 100 0 2000 4000 6000 8000 10000 k Query processing time per one query (sec.) LB_Keogh FTW TWIST/LBG_K TWIST/LBG () Eletro ardiogram Fig. 13 TWIST is faster than FTW and LB_Keogh for all v alues of k , where database size, sequene length, global onstrain t size, and maxim um page size are set to 524288, 2048, 10%, and 128, resp etiv ely 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 2000 4000 6000 8000 10000 12000 14000 The width of global constraint (% of time series length) Query processing time per one query (sec.) LB_Keogh FTW TWIST/LBG_K TWIST/LBG 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 500 1000 1500 2000 2500 The width of global constraint (% of time series length) Query processing time per one query (sec.) LB_Keogh FTW TWIST/LBG_K TWIST/LBG (a) Random W alk I (b) Random W alk I I 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 The width of global constraint (% of time series length) Query processing time per one query (sec.) LB_Keogh FTW TWIST/LBG_K TWIST/LBG () Eletro ardiogram Fig. 14 TWIST and FTW are not aeted b y the inremen t of the global onstrain t's width; ho w ev er, TWIST outp erforms b oth FTW and LB_Keogh, where database size, sequene length, an in teger k , and page size are set to 524288, 2048, 1, and 128, resp etiv ely 26 100 150 200 250 300 350 400 450 500 500 1000 1500 2000 2500 3000 Maxmum page size (data sequence per page) Query processing time per one query (sec.) LB_Keogh FTW TWIST/LBG_K TWIST/LBG 100 150 200 250 300 350 400 450 500 0 100 200 300 400 500 Maximum page size (data sequence per page) Query processing time per one query (sec.) LB_Keogh FTW TWIST/LBG_K TWIST/LBG (a) Random W alk I (b) Random W alk I I 100 150 200 250 300 350 400 450 500 0 1000 2000 3000 4000 5000 6000 Maximum page size (data sequence per page) Query processing time per one query (sec.) LB_Keogh FTW TWIST/LBG_K TWIST/LBG () Eletro ardiogram Fig. 15 When maxim um page size  hanges, TWIST still outp erforms the riv al metho ds, where database size, sequene length, global onstrain t size, and an in teger k are set to 524288, 2048, 10%, and 1, resp etiv ely 64, and 16, and LBG uses the same segmen t size to that of LBS. In sequen tial sear h in DSF, w e implemen t LBS to redue the DTW distane alulation. Ho w ev er, the segmen ted sequene is generated online; in other w ords, no index struture is stored on DSF. Figures 11 , 12 , 13 , 14 , and 15 illustrate the p erformane of TWIST, om- paring in terms of querying time against t w o riv al metho ds b y v arying the dataset size, sequene length, the width of global onstrain t, an in teger k , and maxim um n um b er of sequenes in DSF. As exp eted, TWIST greatly outp er- forms sequen tial sear h with LB_Keogh and FTW indexing. 5.3 Indexing Time Indexing time is a w all lo  k time that an algorithm onsumes to build the index struture. In this exp erimen t, w e only ompare the indexing time with FTW indexing sine the sequen tial sear h with LB_Keogh do es not need an index struture. F rom an exp erimen t sho wn in Figure 16 , our indexing time is omparable to FTW's; ho w ev er, if the maxim um page size is larger, TWIST an greatly redue indexing time, but it ma y trade o with querying time (see Figure 15 ). The parameters used in this exp erimen t are set to b e the same as the default parameters from the example in the previous exp erimen ts. Although the indexing time is omparable to the FTW indexing, TWIST requires v ery small storage spae omparing with FTW indexing (as will b e sho wn in Setion 5.5 ). 27 1 1.5 2 2.5 3 3.5 4 4.5 5 x 1 0 5 0 1000 2000 3000 4000 5000 6000 7000 8000 Dataset size (data sequence) Indexing time (sec.) FTW TWIST with page size 64 TWIST with page size 128 TWIST with page size 256 TWIST with page size 512 1 1.5 2 2.5 3 3.5 4 4.5 5 x 1 0 5 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Dataset size (data sequence) Indexing time (sec.) FTW TWIST with page size 64 TWIST with page size 128 TWIST with page size 256 TWIST with page size 512 (a) Random W alk I (b) Random W alk I I 1 1.5 2 2.5 3 3.5 4 4.5 5 x 1 0 5 0 2000 4000 6000 8000 10000 12000 14000 Dataset size (data sequence) Indexing time (sec.) FTW TWIST with page size 64 TWIST with page size 128 TWIST with page size 256 TWIST with page size 512 () Eletro ardiogram Fig. 16 As page size inreases, the indexing time of TWIST signian tly redues and is omparable to FTW's. Note that TWIST still queries faster than FTW for all page sizes (see Figure 15 ). 1 1.5 2 2.5 3 3.5 4 4.5 5 x 1 0 5 0 2 4 6 8 10 12 x 10 4 Dataset size (data sequence) Number of data accesses LB_Keogh FTW TWIST/LBG_K TWIST/LBG 1 1.5 2 2.5 3 3.5 4 4.5 5 x 1 0 5 0 2 4 6 8 10 12 x 10 4 Dataset size (data sequence) Number of page accesses LB_Keogh FTW TWIST/LBG_K TWIST/LBG (a) Random W alk I (b) Random W alk I I 1 1.5 2 2.5 3 3.5 4 4.5 5 x 1 0 5 0 2 4 6 8 10 12 x 10 4 Dataset size (data sequence) Number of page accesses LB_Keogh FTW TWIST/LBG_K TWIST/LBG () Eletro ardiogram Fig. 17 Num b er of page aesses of TWIST is smaller than other riv al metho ds, esp eially in Random W alk I and Random W alk I I, when sp eedup fator is 5. 28 1 1.5 2 2.5 3 3.5 4 4.5 5 x 1 0 5 0 2 4 6 8 10 12 x 10 4 Dataset size (data sequence) Number of page accesses LB_Keogh FTW TWIST/LBG_K TWIST/LBG 1 1.5 2 2.5 3 3.5 4 4.5 5 x 1 0 5 0 2 4 6 8 10 12 x 10 4 Dataset size (data sequence) Number of page accesses LB_Keogh FTW TWIST/LBG_K TWIST/LBG (a) Random W alk I (b) Random W alk I I 1 1.5 2 2.5 3 3.5 4 4.5 5 x 1 0 5 0 2 4 6 8 10 12 x 10 4 Dataset size (data sequence) Number of page accesses LB_Keogh FTW TWIST/LBG_K TWIST/LBG () Eletro ardiogram Fig. 18 Num b er of page aesses of TWIST is smaller than other riv al metho ds, esp eially in Random W alk I and Random W alk I I, when sp eedup fator is 10. 5.4 Num b er of P age A esses The n um b er of page aesses ( η ) is generally ev aluated in order to estimate the I/O ost. W e alulate the n um b er of page aesses for TWIST with LBG and TWIST with LBG K aording to the follo wing equations. η LB G = 2 α + β S F + δ (28) η LB G K = α + β S F + δ (29) where α is a n um b er of en v elop es in ESF, β is a n um b er of aessed andidate sequenes, δ is a n um b er of random aesses to DSF s, S F is Sp eedup F ator prop osed b y W eb er et al. ( W eb er et al , 1998 ) stating that the sequen tial aess is m u h faster than random aess up to 5 to 10 times. Generally , t w o v alues of SF s are onsidered, i.e., 5 and 10, whi h represen t traditional and pratial sp eedup fator of sequen tial aess o v er random aess. Sine sequen tial san aesses the en tire database, it an therefore b e on- sidered as an upp er b ound. Surprisingly , as sho wn in Figures 17 and 18 , the n um b er of page aesses of FTW indexing is appro ximately equal to that of the sequen tial san, and is v ery large when omparing with our prop osed metho d TWIST b eause FTW retriev es the en tire index struture whi h has database size nearly doubled. On the other hand, in a v erage ases, TWIST an redue a great n um b er of data aesses sine it tries to minimize the n um b er of DSF aesses and the n um b er of aessed andidate. F or exp erimen tal parameters, 29 dataset size, sequene length, maxim um page size, global onstrain t, and k , are set to 524288, 2048, 128, 10%, and 1, resp etiv ely . 5.5 Storage Requiremen t In this setion, w e demonstrate the storage requiremen t for storing an in- dex le omparing with the riv al metho d, FTW. Sine FTW reates a set of segmen ted sequenes for ea h andidate sequene, the index le's size is larger than the data le's. Therefore, FTW index struture is not pratial in real w orld appliation. Unlik e FTW, TWIST's index le requires only small amoun t of storage, i.e., only the en v elop es from all groups of data sequenes are stored. Figure 19 sho ws the omparison of storage requiremen t b et w een TWIST and FTW. When the dataset size is 2 19 sequenes or 4 GB, FTW requires nearly 5 GB, but as exp eted TWIST requires only 110 MB; in other w ords, TWIST requires appro ximately 51 times less storage spae than FTW, while still outp erforming in terms of querying pro essing time. 1 1.5 2 2.5 3 3.5 4 4.5 5 x 10 5 0 1 2 3 4 5 6 x 10 9 The dataset size (data sequence) Storage requirement (byte) FTW TWIST 1 1.5 2 2.5 3 3.5 4 4.5 5 x 1 0 5 0 1 2 3 4 5 6 x 10 9 The dataset size (data sequence) Storage requirement (byte) FTW TWIST (a) Random W alk I (b) Random W alk I I 1 1.5 2 2.5 3 3.5 4 4.5 5 x 1 0 5 0 1 2 3 4 5 6 x 10 9 The dataset size (data sequence) Storage requirement (byte) FTW TWIST () Eletro ardiogram Fig. 19 Illustration of storage requiremen t omparison sho wing that TWIST's index le requires only small amoun t of storage when omparing with FTW's, where dataset size, sequene length, and maxim um page size are set to 524288, 2048, and 64, resp etiv ely . 5.6 Disussion As exp eted, query pro essing time inreases when the dataset size and the sequene length are larger for all approa hes. Ho w ev er, from Figures 11 and 12 , 30 w e an see that FTW indexing and naïv e metho d requires m u h longer time for a single query than TWIST with LBG and LBG K , and when database size inreases, the query pro essing time is also m u h larger. In Figure 13 , if the global onstrain t  hanges, only naïv e metho d with LB_Keogh and TWIST with LBG K are aeted sine the LB_Keogh and LBG K lose their tigh tness when the width of the global onstrain t inreases. Although the b est-mat hed querying ( k = 1 ) is t ypially used in sev eral domains, w e also ev aluate TWIST when v arying k as sho wn in Figure 14 . Ob viously , when k inreases, the query pro essing time also inreases sine for a large v alue of k the b est-so-far dis- tane is also large. If the b est-so-far is large, the sear h annot use the lo w er b ounding distane to prune o the database. Ho w ev er, from Figure 15 , TWIST still eien tly retriev es an answ er omparing with other metho ds. The maxi- m um page size is also another imp ortan t parameter that m ust b e onsidered b eause TWIST uses it to balane the n um b er of pages in the database and the n um b er of sequenes in ea h page. In other w ords, if the maxim um page size is small, the n um b er of random aess inreases; otherwise, the n um b er of sequen tial aess will inrease. Ho w ev er, from the exp erimen t, when the max- im um page n um b er  hanges, TWIST still outp erforms FTW and sequen tial sear h with LB_Keogh. Note that when w e set the maxim um page size to one, TWIST is iden tial to FTW, but when the maxim um page size is set to in- nite, TWIST is similar to the naïv e metho d, i.e., sequen tial san. Therefore, b oth FTW indexing and the naïv e metho d are sp eial ases of TWIST. T o ev aluate the indexing time, w e ompare TWIST with FTW indexing b y v arying the database size and the maxim um page size in Figure 16 . F rom our insertion algorithm, if the n um b er of sequenes exeeds the maxim um page size, TWIST splits DSF in to t w o DSF s. Therefore, if the maxim um page size is large, TWIST redues a n um b er of splitting funtion alls; this therefore redues a n um b er of indexing time sine splitting algorithm requires k -means lustering algorithm whi h has linear time omplexit y to a n um b er of page size. Although the large maxim um page size redues the indexing time, the p erformane when querying is a tradeo. Although w e pro vide the ev aluation in terms of query pro essing time in Setion 5.2 , the n um b er of page aesses needs to b e ev aluated sine the n um- b er of page aesses reets the I/O ost for ea h approa h. The n um b er of page aesses is form ulized and alulated aording to (Sakurai et al , 2005 ; W eb er et al , 1998 ) whi h state that the sequen tial aess is faster than the random aess v e to ten times. F rom Figures 17 and 18 , the n um b er of page aesses of FTW indexing m ust alw a ys larger than the naïv e approa h sine FTW indexing reads all segmen ted sequenes in the index le whi h are equal to the n um b er of sequenes in the database. Ob viously , TWIST onsumes only small amoun t of page aesses b eause TWIST is designed to redue b oth se- quen tial and random aesses. F or the size of an index struture, TWIST utilizes only small amoun t of spaes omparing with FTW indexing whi h alw a ys requires the spae t wie the database size. In Figure 19 , w e demonstrate TWIST's storage requiremen t 31 b y v arying the database sizes and the maxim um page n um b er sine the size of ESF solely dep ends of the n um b er of DSF in the database. 6 Conlusion In this w ork, w e prop ose a no v el index sequen tial struture alled TWIST (Time W arping in Index Sequen tial sT ruture) whi h signian tly redues querying time up to 50 times omparing with the b est existing metho ds, i.e., FTW indexing and sequen tial san with LB_Keogh. More sp eially , TWIST groups similar time series sequenes together in the same le, and then the represen tativ e of a group of sequenes is alulated and stored in the index struture. When a query sequene is issued, a lo w er b ounding distane for a group of sequenes is determined from the query sequene and a represen ta- tiv e is retriev ed from the index le. Therefore, if the lo w er b ounding distane for a group of sequenes is larger than the b est-so-far distane, all andidate sequenes in the group do es not need to b e aessed. This an prune o an impressiv ely large amoun t of andidate sequenes and mak es TWIST feasible for massiv e time series database. A  kno wledgemen t This resear h is partially supp orted b y the Thailand Resear h F und (Gran t No. MR G5080246), the Thailand Resear h F und giv en through the Ro y al Golden Jubilee Ph.D. Program (PHD/0141/2549 to V. Niennattrakul), and the Ch u- lalongk orn Univ ersit y Graduate S holarship to Commemorate the 72 nd An- niv ersary of His Ma jest y King Bh umib ol A duly adej. Referenes Assen t I, Krieger R, Afs hari F, Seidl T (2008) The TS-tree: eien t time series sear h and retriev al. In: Pro eedings of 11th In ternational Conferene on Extending Database T e hnology (EDBT 2008), Nan tes, F rane, pp 252 263 Bagnall AJ, Ratanamahatana CA, Keogh EJ, Lonardi S, Janaek GJ (2006) A bit lev el represen tation for time series data mining with shap e based similarit y . Data Mining and Kno wledge Diso v ery 13(1):1140 Be kmann N, Kriegel HP , S hneider R, Seeger B (1990) The R*-tree: An eien t and robust aess metho d for p oin ts and retangles. In: Pro eedings of the 1990 A CM SIGMOD In ternational Conferene on Managemen t of Data (SIGMOD 90), A tlan ti Cit y , pp 322331 Ber h told S, Keim D A, Kriegel HP (1996) The X-tree : An index struture for high-dimensional data. In: Pro eedings of 22nd In ternational Conferene on V ery Large Data Bases (VLDB 96), Mum bai (Bom ba y), India, pp 2839 32 Berndt DJ, Cliord J (1994) Using dynami time w arping to nd patterns in time series. In: the 1994 AAAI W orkshop on Kno wledge Diso v ery in Databases, Seattle, W ashington, pp 359370 Ciaia P , P atella M, Zezula P (1997) M-tree: An eien t aess metho d for similarit y sear h in metri spaes. In: Pro eedings of 23rd In ternational Conferene on V ery Large Data Bases (VLDB 97), A thens, Greee, pp 426 435 Ding H, T ra jevski G, S heuermann P , W ang X, Keogh E (2008) Querying and mining of time series data: Exp erimen tal omparison of represen tations and distane measures. In: Pro eedings of 34th In ternational Conferene on V ery Large Data Bases (VLDB 2008), Au kland, New Zealand F aloutsos C, Ranganathan M, Manolop oulos Y (1994) F ast subsequene mat hing in time-series databases. In: Pro eedings of the 1994 A CM SIG- MOD In ternational Conferene on Managemen t of Data (SIGMOD 94), Minneap olis, Minnesota, pp 419429 Guttman A (1984) R-trees: A dynami index struture for spatial sear hing. In: Y ormark B (ed) Pro eedings of Ann ual Meeting SIGMOD'84, A CM Press, Boston, Massa h usetts, pp 4757 Itakura F (1975) Minim um predition residual priniple applied to sp ee h reognition. IEEE T ransations on A oustis, Sp ee h, and Signal Pro essing 23(1):6772 Keogh E, Ratanamahatana CA (2005) Exat indexing of dynami time w arp- ing. Kno wledge and Information Systems 7(3):358386 Keogh EJ, Kasett y S (2003) On the need for time series data mining b en h- marks: A surv ey and empirial demonstration. Data Mining and Kno wledge Diso v ery 7(4):349371 Keogh EJ, Chakrabarti K, P azzani MJ, Mehrotra S (2001) Dimensionalit y redution for fast similarit y sear h in large time series databases. Kno wledge and Information Systems 3(3):263286 Kim SW, P ark S, Ch u WW (2001) An index-based approa h for similarit y sear h supp orting time w arping in large sequene databases. In: Pro eedings of the 17th In ternational Conferene on Data Engineering (ICDE 2001), Heidelb erg, German y , pp 607614 Lin J, Keogh EJ, W ei L, Lonardi S (2007) Exp eriening SAX: a no v el sym- b oli represen tation of time series. Data Mining and Kno wledge Diso v ery 15(2):107144 Loh WK, Kim SW, Whang KY (2004) A subsequene mat hing algorithm that supp orts normalization transform in time-series databases. Data Mining and Kno wledge Diso v ery 9(1):528 MaQueen JB (1967) Some metho ds for lassiation and analysis of m ul- tiv ariate observ ations. In: Cam LML, Neyman J (eds) Pro eedings of 5th Berk eley Symp osium on Mathematial Statistis and Probabilit y , Univ ersit y of California Press, v ol 1, pp 281297 Mo o dy GB, Mark R G (1983) A new metho d for deteting atrial brillation using RR in terv als. Computers in Cardiology 10:227230 33 Ratanamahatana CA, Keogh EJ (2004) Making time-series lassiation more aurate using learned onstrain ts. In: Pro eedings of 4th SIAM In terna- tional Conferene on Data Mining (SDM 2004), Lak e Buena Vista, Florida, USA, pp 1122 Ratanamahatana CA, Keogh EJ (2005) Three m yths ab out dynami time w arping data mining. In: Pro eedings of 2005 SIAM In ternational Data Mining Conferene (SDM 2005), Newp ort Bea h, CL, USA, pp 506510 Sak o e H, Chiba S (1978) Dynami programming algorithm optimization for sp ok en w ord reognition. IEEE T ransations on A oustis, Sp ee h, and Sig- nal Pro essing 26(1):4349 Sakurai Y, Y oshik a w a M, F aloutsos C (2005) FTW: F ast similarit y sear h under the time w arping distane. In: Pro eedings of 24th A CM SIGA CT- SIGMOD-SIGAR T Symp osium on Priniples of Database Systems, Balti- more, ML, USA, pp 326337 Sakurai Y, F aloutsos C, Y amam uro M (2007) Stream monitoring under the time w arping distane. In: Pro eedings of IEEE 23rd In ternational Confer- ene on Data Engineering (ICDE 2007), Istan bul, T urk ey , pp 10461055 Vla hos M, Y u PS, Castelli V, Meek C (2006) Strutural p erio di measures for time-series data. Data Mining and Kno wledge Diso v ery 12(1):128 W ang X, Smith KA, Hyndman RJ (2006) Charateristi-based lustering for time series data. Data Mining and Kno wledge Diso v ery 13(3):335364 W eb er R, S hek HJ, Blott S (1998) A quan titativ e analysis and p erformane study for similarit y-sear h metho ds in high-dimensional spaes. In: Gupta A, Shm ueli O, Widom J (eds) Pro eedings of 24th In ternational Conferene on V ery Large Data Bases (VLDB 98), Morgan Kaufmann, New Y ork Cit y , NY, pp 194205 Yi BK, Jagadish HV, F aloutsos C (1998) Eien t retriev al of similar time sequenes under time w arping. In: Pro eedings of 14th In ternational Con- ferene on Data Engineering (ICDE 98), Orlando, FL, USA, pp 201208 Yianilos PN (1993) Data strutures and algorithms for nearest neigh b or sear h in general metri spaes. In: Pro eedings of 4th Ann ual A CM-SIAM Symp o- sium on Disrete Algorithms (SOD A 93), So iet y for Industrial and Applied Mathematis, Philadelphia, P A, USA, pp 311321 Zh u Y, Shasha D (2003) W arping indexes with en v elop e transforms for query b y h umming. In: Pro eedings of the 2003 A CM SIGMOD In ternational Con- ferene on Managemen t of Data (SIGMOD 2003), San Diego, CA, USA, pp 181192

Exact Indexing for Massive Time Series Databases under Time Warping Distance

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment