Motif Detection Inspired by Immune Memory

The search for patterns or motifs in data represents an area of key interest to many researchers. In this paper we present the Motif Tracking Algorithm, a novel immune inspired pattern identification tool that is able to identify variable length unkn…

Authors: ** - 논문에 명시된 저자 정보는 제공되지 않았으나, 이전 연구(

Motif Detection Inspired by Immune Memory
Mo t i f D e t e ct io n Inspired b y Imm une Me m o r y William Wilson 1 and Phil Birkin 1 and Uw e A i c k eli n 1 School o f C om pu t e r Science, U n i v e r si t y of N ot t i n gh am , UK wo w ,p ab ,u x a@ cs .n ot t . ac .u k Abstr act. The searc h for patter ns or motifs in dat a r e p r ese n t s an a rea of k ey i n t e r es t to man y resea rchers. In t h is pap er w e p r ese n t the Mo- t i f T racki ng A l go r i t h m , a nov el i mm un e inspir ed pattern i d e n t i fi c at i on t oo l that is able to i d e n t i f y v ariable length unknown motifs which r e- p eat wi t h i n t i m e se r ies data. The al go r i t h m searches from a c om p le t el y n e u t r al p e r s p ec t i v e that is i nd e p e nd e n t of the data b eing analy se d an d the und erl ying motifs. In t h is pap e r we tes t the fl e x i b ili t y of t h e motif t r ack i n g al gor i t h m by app lyin g it to the search f or pat tern s in t wo indus- trial data s e t s . The al go r i t h m is a ble t o i d e n t i f y a population of motifs successfull y i n b oth ca ses, an d t h e v alue of t h ese motifs is d is c u ss e d. 1 I n t r o du c t io n The i n v es t ig a t io n and anal ysis of t i m e serie s dat a is a p opular and w ell s t udi ed area of research. Common goals of t i m e series anal ysi s i nclude the desir e t o i den t i f y kno w n patt erns in a t i m e series, to predict futur e t r end s given hi s t o r i ca l i nf o r m a t i o n and th e a b i li t y to classif y d ata i n t o similar cl us t er s . These pr o ces ses g en er a t e summar ised r epr es en t a t io ns of la rge dat a sets tha t can b e m ore ea s il y i n t er pret ed by th e u s er . H i s t o r i ca ll y , s t a t i s t i ca l t ec hn i qu es ha ve b een appl ied to t hi s proble m d om a i n. How e ver, th e use of Immune Sys t em inspire d (IS) t ec hn i ques in t hi s field has r e- mained fa irl y li m i t ed. In our previous wo r k [15] w e prop osed a n IS app r o a c h to i den t i f y patt erns embe dd ed in price d ata using a p o pu l a t io n of t r a c k er s t ha t evolv e using pr oli f er a t io n and m ut a t i o n. This e arl y re search prov e d s ucc ess f ul on s mall d ata sets but suffer ed when scaled to lar ger dat a sets wi t h m ore co m - plex m o t i f s . In t hi s pap e r w e d escrib e th e M o t i f T racking A lg o r i t hm (M T A), a det er m i n i s t i c but no n-ex h a us t i v e appr oa c h to i den t i f y i ng r ep ea t i ng patte rns i n t i m e ser ies da ta, th at direc tly addr esses t hi s s ca l a b i li t y i ss u e. The MT A r epr es en t s a nov el A r t i fic i a l Immune S ys t em (AIS) using pr i ncip l es a b s t r a ct ed fr om th e human immune s y s t em , in pa r t i cul a r th e im mune me m o r y t he o r y o f E ric Bell [16]. I mp l em en t i ng princ iples fr om im mune memo r y to b e used as part of a solution mechanism is of g r ea t i n t er es t to th e im mune s y s - t em co m m un i t y and here we ar e able to t a k e a dv an t a ge of such a s y s t em . T h e MT A i mpl em en t s the Bell i mmune mem ory t he o r y by pr ol i f er a t i ng an d m ut a t - ing a p o pul a t io n of s olut ion ca ndi da t es u sin g a der i v a t i v e of th e clona l s el ect io n a l go r i t hm [ 3 ] . A subseque nce of a t i m e ser ies tha t is see n to r ep ea t wi t h i n tha t t i m e ser ies i s defined as a m o t i f . T he ob jectiv e o f the MT A is to find t h o s e m o t i f s . The p ow er of th e MT A c omes f rom th e fact tha t i t h a s no pri or kn owledge of th e t i m e s er i es to b e e xamine d or w ha t m o t i f s ex i s t . It searches in a f a s t a nd effi ci en t m a nn e r and th e f lexi b i li t y i nc o r p o r a t ed in i t s generic appr oach all o ws th e MT A to b e applied ac ross a d iv ers e ra n ge of pr o ble m s . Consi derable resear ch ha s a lrea dy b een p e rforme d o n i den t i f y i n g known pa t - t er ns in t i m e serie s [9]. In co n t r a s t li tt le resea rch has b een p erf orme d on lo o k i ng for unkn own m o t i f s in t i m e series. This provides an i deal o pp o r t un i t y f or an A I S driven a ppr oach to t a c k le th e prob le m of m o t i f det ec tio n, as a d i s t i ng u i s hin g f ea - ture of th e MT A i s i t s a bi li t y to i den t i f y variable len gth un know n patte rns t ha t r ep ea t in a t i m e series usi ng an evo lu ti on ary sy s t em . In many dat a sets t her e is no prior knowledge of w ha t patt ern s ex i s t s o t r a d i t io na l det ec tio n t ec hn i qu es a r e uns u i t a b l e. In t h i s pa p er w e test t he generic pr o p er t i es of th e MT A by a pp l y i ng i t to m o t i f i den t if ica t io n in t w o i ndu s t r i a l da t a sets to asses i t s a bi li t y to f ind v ar iable l en g t h un known m o t i f s . The pap e r is s t r uctur ed a s f ollows, Sec tion 2 pr ov i des a disc ussi on of th e w o r k tha t has b een p erf ormed in m o t i f de te ct io n, t he n v ar io us t er m s a nd defi ni t io ns used by th e MT A are i n t r o du ce d in S ectio n 3. The p seudo c o de for th e MT A is describ e d in Se ction 4. Section 5 pr es en t s th e r es u l t s of the MT A when a pp l i ed to th e t wo i nd us t r i a l dat a sets b efore moving on to con clude in Secti on 6 . 2 R e l a t e d W o r k The search for patte rns in d ata is r el ev an t to a div ers e ran ge of fi elds, i n cl ud- ing b iolo gy , business, finance , and s t a t i s t i cs . W ork by Gua n [6] add resse s D N A pat t er n m a t c h ing usin g lo o kup t a b l e t ec hn i q ues th at ex ha us t i v el y search t he dat a set to find recur ring patt ern s. I n v es t ig a t io n s using a piece wise linear se g- men t a t io n scheme [7] and discre te F ourier t r a n s f o r m s [4] provide e xample s of mechanisms to search a t i m e series for a pa r t i cul a r m o t i f of i n t er es t . W or k b y Singh [12 ] searches for patte rns in fi nancia l t i m e serie s by t a k i n g a se q uen ce of th e m o s t r ec e n t dat a i t em s and lo oks f or re-oc curr ences of t hi s pat t er n i n th e h i s t o r i ca l dat a. An und erl ying assu mp tion in all t hes e appr oaches is t ha t the patt e rn to b e f oun d is kno w n in adv ance. The m a t c h ing t a s k is t her ef o r e m uch simpler as the a lgo r i t hm ju st has to find re -o ccur rence s o f tha t pa r t i cul a r pa tt er n. The se arch for u nknown m o t i f s is at th e hea rt of th e work conducte d by Ke o g h et al. Ke oghs pr o ba b i li s t ic [2] and viztree a lgo r i t hm s [8 ] are very s uccessf ul i n i den t i f y i ng unkn own m o t i f s but t hey re quire a dd i t io na l pa ra m et er s comp are d t o th e MT A. The y al so assume pr ior knowled ge of th e l e ng t h of the m o t i f to b e found, so th e m o t i f is “on ly pa r t i a ll y unkn o wn”. M o t i f s longer an d p o t en t i a ll y s ho r t er th an t hi s predefi ned l en g t h may remain un det e cted in f ull. W ork b y T an a k a [13 ] atte mpt s to add ress t h i s iss ue by using minimum d es cr ipt io n l en g t h to disc o ver th e optima l l en g t h f or th e m o t i f . F u et al. [5] use sel f-or ganisin g m a ps to i den t i f y unk nown patte rns in s t o c k mar ket d ata , by r epr es en t i ng patte rns a s p ercep tu all y i m p o r t an t p oi n t s . This provides an e ffec tive s olution but again t h e patte rns foun d are li m i t ed to a pre dete rmi ne d l en g t h. A more fl exibl e ap proa ch is see n in th e TE I RE SIA S a lg o r i t hm [11] able t o i den t i f y patt erns in bi ologic al se que nces. TEI RE SIA S finds patte rns of an a r - bit r ar y l en g t h by i s ol a t i ng i ndi vidual buildi ng blocks tha t com prise th e s ub s et s of the patt ern, t hes e ar e th en combined i n t o l arger patt erns. The m et h o do log y of buildin g u p m o t i f s by fi ndin g and c ombining t hei r co m p o nen t parts is at t he he ar t of the MT A. The M T A t a k es an IS a pproach e volving a p o pu l a t io n of t r a c k er s th at is able to dete ct m o t i f s by ma ki ng fewer assu mp tion s ab ou t t he dat a set a nd th e p o t en t i a l m o t i f s . It fo cuse s on the sear c h for un known m o t i f s of a n ar bi tr ar y l en g t h lea din g to a nov el and u nique s ol ut io n. 3 M o t i f D e t e c t io n : T erms and D e fi n i t io n s Here w e define s ome of th e t er m s use d by th e M T A . Def in iti on 1. Time se ries. A t i m e ser ies T = t 1 ,...,t m is a t i m e order ed s et of m r eal or i n t eg er v al ued v ariables. In or der to i den t i f y patt erns in T we br ea k T up i n t o s ubseque nces of l en g t h n using a slidin g window m ec ha ni s m . Def in iti on 2. M otif. A sub sequen ce fr om T tha t is se e n to r ep ea t at l ea s t once t hr o u gho ut T is d efi ned as a m o t i f . W e use E ucli dea n di s t a nc e to ex a m i ne th e r el a t io nsh i p betw ee n t wo s ubse quences C 1 and C 2 , ED( C 1 , C 2 ) a g a i n s t a m a t c h t hr es ho l d r. If ED( C 1 , C 2 ) ≤ r th e su bseque nces ar e deeme d to m a t c h and thus are sa ved a s a m o t i f . The m o t i f s pr e v a l en t in a t i m e series a re det ect ed by th e MT A t hr o ug h th e evoluti on of a p o pu l a t io n of t r a c k er s . Def in iti on 3. T r a c k e r . A t r a c k er r epr es en t s a s ig na t ur e for a m o t i f s eq uen ce tha t is seen to r ep ea t . It has wi t h i n i t a sequence of 1 to w sym b ols tha t a r e used to r epr es en t a dimens ionall y red uced e q uiv a l en t of a subsequenc e. T h e subseq uences g en er a t ed from th e t i m e series a re co n v er t ed i n t o a discret e s y m b ol s t r i ng . The t r a c k er s a re th en used as a t o ol to i den t i f y which of t h es e s y m b ol s t r i ng s r epr es en t a r ecurri ng m o t i f . The t r a c k er s also inclu de a m a t c h co un t v ar iable to i nd i ca t e th e le v el of s t i mul a t io n rec eived d uri ng th e m a t c hin g pr o cess . 4 The M o t i f T rac king A lgo r i t h m This Secti on prov ides a li s t i ng of th e MT A ps e udo co de al ong wi t h a des cr i pt io n of i t s main o p er a t io ns . W e direct th e rea ders att enti on to [16 ] f or a m ore i n depth de s crip t io n of t hi s a l go r i t hm , alon g wi t h a review of th e i mmu no log i ca l i n spi r a t io n b eh ind th e MT A, w hich w e do no t hav e t i m e to cov er here. T h e pa ra m et er s req uired in th e MT A include the l e ng t h of a symbol s, th e m a t c h t hr es ho l d r, and th e alp h ab et size a . MT A Pseud o C o de I n iti a t e M T A ( s , r, a ) C o nv e r t T i m e s e r i e s T t o s y m bo li c r e p r e s e n t a ti on G e n e r a t e S y m b o l M a t r i x S I n iti a li s e T r ac k e r po pu l a ti o n t o s i ze a W h il e ( T r ac k e r po pu l a ti on > 0 ) { G e n e r a t e m o ti f ca n d i d a t e m a t r i x M fr o m S M a t c h t r ac k e r s t o m o ti f ca nd i d a t e s E li m i n a t e un m a t c h e d t r a c k e r s E x a m i n e T t o c on f i r m g e nu i n e m o ti f s t a t u s E li m i n a t e un s u cc e ss f u l t r ac k e r s S t o r e m o ti f s f ound P r o li f e r a t e m a t c h e d t r a c k e r s M u t a t e m a t c h e d t r a c k e r s } M e m o r y m o ti f s t r ea m li n i ng C o n v e r t Ti me S e rie s T to Sy mb olic Re prese nta tio n . The MT A t a k es as in put a uni v a r i a t e t i m e se ries co ns i s t i n g of real or i n t eg er v alues. T aking t he fir s t order d iff erence of T we lo o k at m o vem en t s be tween d ata p oi n t s a ll owi n g a compari son of subseq uence s a cross di ff er en t a mp l i t u des . T o f ur t her m i ni m i s e a mp l i t ude sca lin g iss ues w e norma lise the t i m e ser ies. In o ur p r evious w ork [ 15 ] th e a l go r i t hm i n v es t ig a t ed m o t i f s t hr o u g h co n s i der a t io n of each dat a p oi n t i n- dividuall y , cr ea t i n g a sol ution th at was no t scala ble to lar ger dat a sets. In t he MT A t h i s proble m is res olved as w e i n v es t ig a t e m o t i f s by combini ng i nd i v i du a l dat a p oi n t s i n t o seque nces and c ompari ng a nd combining t h o s e sequen ces t o form m o t i f s . Pie cewise A gg r eg a t e A pp r o x i m a t io n (P AA ) [2] is used to discre tise th e t i m e series. P AA is a p ow erf ul compr essi on t o ol tha t uses a discre te, fin i t e s y m b ol set to g en er a t e a dime nsion all y reduced version of a t i m e series tha t co ns i s t s of s ym b o l s t r i ng s . This i n t ui t i v e r epr es en t a t io n has be en sho wn to riv al m o r e s o ph i s t i ca t ed reduc tion meth o ds such as F our ier t r a n s f o r m s and wav elets [ 2 ] . Using P AA we slide a wi ndow of size s across th e t i m e s eries T one p oi n t at a t i m e. Each sl idi ng win dow r epr es en t s a subsequ ence fro m T. The MT A ca l cul a t es th e av erage of th e v al ues from th e slid ing w ind ow and uses tha t av erage t o r epr es en t th e subseq uence. The MT A co n v er t s t hi s a verage i n t o a s ym b o l s t r i n g . The user predef ine s the size a of the alph ab et used to r ep r es en t th e t i m e ser ies T . Giv e n T has bee n normalis ed we can i den t i f y th e br eak p oi n t s for th e a l ph ab et c h ar acte rs t ha t g en er a t e a e qual size d a reas under the Gaussian c urve [2]. T h e av er age v a lue ca l cu l a t ed f or th e slidi ng windo w i s the n exami ned a g a i n s t t he br eak p oi n t s and co n v er t ed i n t o th e app r o pr i a t e symbol. Th is proce ss i s r ep e a t ed for al l slid ing w indows ac r oss T to g en er a t e m-s +1 su bseque nces, e ach co ns i s t i n g of s ymb ol s t r i ng s comp risi ng one c ha r a ct er . Gene rate Sy mb ol Ma trix S. The s t r i n g of s ymbo ls r epr es en t i n g a s ub s e- quence is define d as a word. Each word g en er a t ed from th e sliding w ind ow is en t er ed i n t o the symb ol m a t r i x S. The MT A e xamine s th e t i m e series T us i ng t hes e words an d no t t h e ori ginal dat a p oi n t s to sp ee d up th e sear c h pr o ces s . Symbol s t r i n g com pari sons ca n b e p er forme d effi ci en t l y to fil t er out bad m o t i f ca nd i da t es , ensur ing th e co m pu t a t io na ll y exp en sive Eucli dean di s t a nce ca l cul a - t io n is only p erf orme d on t ho s e m o t i f ca nd i da t es th at are p o t en t i a ll y g enuin e . Having g en er a t ed th e symb ol m a t r i x S, th e nov el t y of th e MT A c omes f r o m th e wa y in which ea ch g en er a t io n a selecti on of w ords from S, corres p on ding t o th e l en g t h o f th e m o t i f u nde r co ns i de r a t io n, are e x t r a ct ed i n a n i n t ui t i v e m a nn e r as a re duced set and pr es en t ed to th e t r a c k er p o pu l a t io n for m a t c h ing . Initi al ise T racker Pop ula tion t o Size a. The t r a c k er s ar e th e pri mar y t o ol used to i den t i f y m o t i f ca nd i da t es in th e t i m e series. A t r a c k er comp rises a se- quence of 1 to w s ymb ols. T he s ym b ol s t r i ng co n t a i ned wi t hi n t h e t r a c k er r ep - r es en t s a seque nce o f sym b o ls tha t are seen to r ep ea t t hr o ug h o ut T . T rack er i ni t i a li s a t io n an d evolution is t ig h t l y r eg u l a t ed to av oid pr oli f er a t io n of ineffec tive m o t i f ca nd i da t es . The i n i t i a l t r a c k er p o pu l a t io n i s co ns t r uct ed of size a to co n t a i n one of ea ch of th e viable alph ab et symb ols pr edefi ned by t h e user. Eac h t r a c k er is unique, to av oid unnecessar y dup l i ca t io n. T rack ers are cr ea t ed of a l en g t h of one sym b ol an d m a t c he d to m o t i f can - di da t es via the words pr es en t ed fr om th e s t a g e m a t r i x S. T ra ck ers th at m a t c h a word are s t i mul a t ed and bec ome ca nd i da t es for pr oli f er a t io n as t h ey i ndi ca t e wo rds th at are r ep ea t ed in T. Given a m o t i f and a t r a c k er tha t m a t c hes part of tha t m o t i f , pr oli f er a t io n ena bles th e t r a c k er to ex t end i t s l en g t h by on e s y m b ol each g en er a t io n un t il i t s l en g t h m a t c hes tha t of the m o t i f . Gene rate Moti f Candi da te Matrix M from S. The s ym b ol m a t r i x S co n - t a i ns a t i m e ordered li s t of all words, each co n t a i nin g jus t one symb ol, tha t a r e pr es en t in th e t i m e series T. Nei gh b ouring words in S co n t a i n s ig ni fi ca n t o v er l ap as t hey we re e x t r a ct ed via sl idi ng w in dows. P r es en t i n g all words in S to t he t r a c k er p o pul a t io n would r es u l t in i na pp r o pr i a t e m o t i f s b eing i den t ifie d b et w ee n neighb ouring wo r ds. T o pr ev en t t hi s i ssue such ‘ t r i v i a l ’ m a t c h ca nd i da t es a r e remov ed fr om th e symb ol m a t r i x S in a simila r fashi on to th at used i n [ 2 ] . T rivial m a t c h eli m i na t io n is achiev ed as a w or d is onl y t r a n s f er r ed from S for pr es en t a t io n to the t r a c k er p o pul a t i o n if i t differ s from th e pre vious w ord e x - tract ed. This allows th e MT A to fo cus on s ig ni fi ca n t v a r i a t io ns in th e t i m e s er i es and pr e v en t s t i m e b ei ng w a s t ed on th e search across un i n t er es t i ng v a r i a t io ns . Excess ively aggressi ve t r i v i a l m a t c h eli m i na t io n is pr ev en t ed by li m i t i ng t he maxi m um number of conse cuti ve t r i v i a l m a t c h eli m i na t io n s to s, th e num b er of dat a p oi n t s encom passe d by a s ymbol. In t h i s way a su bseq uence can eli m i na t e as t r i v i a l all s ubse quence s g en er a t ed from sli ding wi nd o ws that start in lo ca t io ns co n t a i ned wi t h i n tha t sub seque nce ( if t h ey g en er a t e th e same s ym b o l s t r i ng ) bu t no o t her s . The redu ced set of wo r ds selected fr om S i s t r a n s f er r ed to t h e m o t i f ca nd i da t e m a t r i x M a nd pr es en t ed to the t r a c k er p o pu l a t io n f or m a t c h ing . M a tc h T ra ck ers to Mot if Candi dates. Duri ng a n i t er a t io n each t r a c k er is t a ken in turn and c ompare d to the set of words i n M. M a t c h ing is p er f o r me d using a simple s t r i n g comp ari son b e tw een the t r a c k er and th e word. A m a t c h occ urs if th e com paris on f uncti on re tur ns a v alue of 0, i nd i ca t i ng a p er fect m a t c h b etw een th e symb ol s t r i ng s . E ach m a t c h in g t r a c k er i s s t i mul a t ed by i ncr em en t i n g i t s m a t c h co un t er by 1 . Eli min at e U n m a tc h e d T r ackers. T r ack ers th at have a m a t c h co un t > 1 i n - di ca t e symb ols tha t are seen to r ep ea t t hr o u gh o ut T and are viable m o t i f can - di da t es . E li m i n a t i ng a ll t r a c k er s wi t h a m a t c h co un t < 2 ens ures the MT A o nly searches for m o t i f s fr om a m o n g s t t h es e viable ca nd i d a t es . Kn owledge of p o ss i- ble m o t i f ca nd i da t es f r om T is carrie d forward by th e t r a c k er p o pu l a t io n. A f t er eli m i na t i o n th e m a t c h co un t of th e sur vivin g t r a c k er s is reset to 0 . Exa mi ne T to C on fir m G enui ne Moti f S tat us . The survivin g t r a c k er p o p- ul a t io n i nd i ca t es whi c h words in M r epr es en t viable m o t i f ca ndi da t es . H ow ev er m o t i f ca nd i d a t es wi t h i den t i ca l words m ay no t r epr es en t a tru e m a t c h w hen lo okin g at the t i m e serie s d ata underl y ing th e su bseque nces c ompri sing t ho s e wo rds. In order to confi rm whet her t wo m a t c hin g w ords X a nd Y, co n t a i n ing th e same sym b o l s t r i n g s , cor resp ond to a g e nuine m o t i f w e need to appl y a di s - t a nce mea sure to the origi nal t i m e series dat a a ss o ci a t ed wi t h t h o s e ca nd i da t es . The M T A use s th e Eucl idea n di s t a nce to measur e th e r el a t io nsh i p b etw een t w o m o t i f ca ndi da t es ED (X ,Y) [ 16 ] . If ED( X,Y ) ≤ r a m o t i f has b e en fo und a nd th e m a t c h co un t o f th at t r a c k er i s s t i mul a t ed. A memory m o t i f is cr ea t ed to s t o r e th e symb ol s t r i ng a ss o ci a t ed wi t h X a nd Y. The sta rt lo ca t io n s of X an d Y ar e also sav ed. F or f ur t her i nf o r m a t io n on th e de r i v a t io n of t h i s m a t c h ing t hr es h o l d plea se refe r to [ 16 ] . The MT A th en co n t i n ues i t s search f or m o t i f s , fo c using onl y on t ho s e wo r d s in M th at m a t c h th e survi ving t r a c k er p o pu l a t io n in an atte mp t to find a ll occ urre nces of th e p o t en t i a l m o t i f s . The t r a c k er s t he r ef o r e act as a pr un i ng mechanism, reduc in g th e p o t en t i a l search space to ensu re th e MT A only f o cus es on viable ca ndi da t es . Eli m ina te Un s uc c es sful T ra ckers. The MT A n o w r emoves any uns t i mul a t ed t r a c k er s from th e t r a c k er p o pu l a t io n. These t r a c k er s r epr es en t sym b ol s t r i ng s tha t we re seen to r ep ea t b ut upo n f ur t her i n v es t ig a t io n wi t h th e un d er l y i n g dat a were no t pr o ven to b e v al id m o t i f s in T . Stor e Mo ti fs F o und . The m o t i f s i den t if ied durin g th e co nf ir m a t io n s t a ge a r e s t o r ed in th e mem ory p ool for r evie w. C omparis ons a re made to rem o ve an y dup l i ca t io n. The final mem or y p o ol r epr es en t s th e comp resse d r epr es en t a t io n of th e t i m e serie s, co n t a i nin g all th e re- occ urring patte rns f o un d. Proli ferate M a tc h e d T ra ck ers . P r oli f er a t io n and m ut a t io n are needed t o ex t end th e l en g t h o f th e t r a c k er so i t c an captur e more of the com ple te m o t i f . A t the e nd of th e f ir s t g en er a t io n th e survi ving t r a c k er s , e ach co n s i s t i ng of a w o r d wi t h a sin gle symbol, r epr es en t all th e symb ols th at are ap plica ble to the m o t i f s in T. Co mple te m o t i f s in T on l y co ns i s t o f co mbi na t io n s o f t h es e sym b ols. T hes e t r a c k er s are s t o r ed as th e m ut a t i o n t emp l a t e f or use by th e M T A . P r oli f er a t io n and m ut a t io n to l en g t hen t r a c k er s will only inv olv e s ym b ol s from th e m ut a t io n t emp l a t e and no t th e full s ymbo l alph ab et , as any o t her m ut a t io ns would lead to unsuc cessfu l m o t i f ca nd i d a t es . Duri ng pr ol i f er a t io n t he MT A t a k es ea ch sur viving t r a c k er in turn and g en er a t es a n umb er of clone s eq ua l to th e size of th e m ut a t io n t emp l a t e. The clo nes a dop t th e same symb ol s t r i n g as t hei r pa r en t . Mu ta te M a tc h e d T rack e rs. The clones g en er a t ed from ea ch pa r en t are t a ken in turn and e x t en d ed by addin g a symbol t a ken c onsecuti vely fr om th e m ut a - t io n t emp l a t e. T hi s cr ea t es a t r a c k er p o pu l a t io n wi t h maxima l cov era ge of a ll p o t en t i a l m o t i f s oluti ons and n o du pl i ca t i o n. This p r oc ess forms th e eq uiv a l en t of th e s ho r t t er m mem or y p o ol i den t ifi ed by Bell [1] and is ill u s t r a t ed in m o r e detail in [ 16 ] . The t r a c k er p o ol i s fed back i n t o the MT A read y f or th e nex t g en er a t io n. A new m o t i f ca nd i da t e m a t r i x M co n s i s t i ng of words wi t h t w o symbo ls mus t no w b e f o r mul a t ed to p r es en t to the evolved t r a c k er p o pul a t io n. I n t hi s w ay th e M T A builds up th e r epr es en t a t io n of a m o t i f one symb ol at a t i m e each g en er a t io n t o ev en t ua ll y map to the f ull m o t i f usi ng fe edback f rom th e t r a c k er s . Given th e symb ol l en g t h s w e c an g en er a t e a w ord co n s i s t i ng of t wo co ns ec- utive sym b ols by t a k i n g th e symb ol from m a t r i x S at p o s i t io n i and tha t f r o m p o s i t io n i+s . R ep e a t i n g t h i s acr oss S, a nd appl y ing t r i v i a l m a t c h eli m i na t io n, th e MT A obtai ns a new m o t i f ca nd i da t e m a t r i x M in g en er a t i o n t w o , each en t r y of w hich co n t a i ns a word of t w o s ymbols, each of l en g t h s . The MT A co n t i n ues to prepa re and pr es en t new m o t i f ca nd i d a t e m a t r i x da t a to th e evolving t r a c k er p o pul a t io n each g en er a t io n. The m o t i f ca nd i da t es a r e bui l t up o ne sym b ol at a t i m e and m a t c he d to t h e l en g t h en in g t r a c k er s . T h i s flexible a pproa c h ena bles th e MT A to i den t i f y unkn own m o t i f s of a v a r i a b l e l en g t h. This pr o cess co n t i n ues un t il all t r a c k er s are eli m i na t ed as non m a t c h - ing a nd the t r a c k er p o pu l a t io n is empty . Any f ur t her ex t ens io n t o th e t r a c k er p o pul a t i o n will no t impr ov e t hei r fi t to any o f th e unde rl ying m o t i f s in T . Me m o ry M ot i f St re a mli n in g. The MT A s t r ea m li nes th e mem ory p ool, r e- moving du pl i ca t es and t ho s e enc ap su l a t ed wi t h i n o t her m o t i f s to pro duce a f in a l li s t of m o t i f s tha t form s th e eq u iv a l en t of th e long t er m mem ory p o ol . 5 R e s u l t s Here we exa mine th e M T A ’ s p er forma nce on t wo publi cl y av a ilab le i nd us t r i a l dat a sets. T he MT A was w r i tt en in C+ + and run on a Windows XP m a c hin e wi t h a P en t i um M 1.7 G hz p ro cess or wi t h 1G b o f R A M . 5.1 Stea mge n D a ta The s t ea m g en dat a set was g en er a t ed usi ng fuzz y mo del s a ppl ied to th e m o del of a s t ea m g en er a t o r at t he A bb o tt Pow er Plant in Cha mp aign [10 ] and is a v a il- able fr om h tt p: // h om es . es a t . ku le uve n. b e/ ∼ t o k k a / da i s y da t a.h t m l . The s t ea m - gen d at a set co n s i s t s of e v er y te nth o bs er v a t io n t a ken from th e s t ea m flow o ut - put, sta r tin g wi t h t h e fir s t o b s er v a t io n. This sp ec ific dat a selecti on was use d b y Keogh and has b ee n follow e d for th e purp oses of co m pa r i s o n. The s t ea m g en dat a set co n t a i ns 960 i t em s wi t h s ig n i fi ca n t a mp l i t ude v a r i- a t io n. Param et ers s = 10, a = 6, r = 0.5 were e s t a bl i s hed as s u i t a bl e a f t er n ume rous r uns of th e MT A. Sen s i t i v i t y anal ysis on t hes e pa ra m et er s ca n b e found i n [16]. The MT A i den t ifi ed 1 0 4 m o t i f s of l en g t h s v ar ying be tw een ten an d 60 dat a p oi n t s . Some of th e m o t i f s o f l en g t h ten are see n to r ep ea t up to 15 t i m es t hr o ugh o ut th e data, o t her s of l en g t h 20 are n o t ed to r ep ea t u p to four t i m es . One s ig n i f ica n t m o t i f of l en g t h 6 0, see n to occ ur t wic e in the dat a , at lo ca t io n s 75 and 83 3, do m i na t es th e m o t i f p ool. This m o t i f is pl o tt ed in Figure 1 . Fig. 1. The plot of a motif found in the s t e am ge n data by the MT A. I t c on sis t s of the subs eq uen ce s starting at loc ations 75 and 8 83 , bot h of le n gt h 60 . The X a xis refers to the motif length, w h ils t the Y axis re f ers to s t e am fl o w. In order to provide s ome gr oundin g for the MT A, we com pared th e M T A r es ul t to tha t of th e Keo gh’s pr o ba b i li s t ic m o t i f sear ch a l go r i t hm [2] . Keo gh w a s kind en ough to provide a t ea c h in g version of th e a lg o r i t hm w hich we appl ied t o th e s t ea m g en dat a, using pa r am e ter s es t a bl i s hed by Keogh. Giv e n a pr ed ef in ed l en g t h of 80, th e a lgo r i t hm w as able to i den t i f y a do m i nan t m o t i f co ns i s t i n g of sequence s sta rtin g a t p oi n t s 66 a nd 8 74, th at is co n s i s t en t wi t h th e m o t i f f o un d by th e MT A , as ill us t r a t ed in Figur e 2 . Compa ring Fi gures 1 and 2 i t app ea rs that the MT A has onl y de tec te d a subset of the m o t i f found by th e pr o ba b i li s t ic a lgo r i t hm , missing o ff th e f ir s t and l a s t ten dat a p oi n t s of th e lon ger m o t i f . H ow ever, th e Eucl idea n di s t a nc es Fig. 2. The pl ot of a m otif f ound in the s t e am ge n dat a by Keog hs p r ob ab ilis t ic algo- r i t h m . It c on sis t s of the sub se quen ce s starting at lo c at i on s 6 6 and 874 , b ot h of le n gt h 80. The X axis refers to the motif length, w h ils t the Y a xis refers t o s t e am fl o w. across th e fi r s t and l a s t te n p oi n t sequen ces are 5.48 and 11.17 r es p ect i v el y . Giv e n s = 10 an d r= 0. 5 p e r uni t , a m a t c h t hr es ho l d of 5.0 is app lied to ea c h ten p oi n t se que nce, r es ul t i n g i n the reject ion of b o t h o m i tt ed subsequenc es a s non m a t c h ing . Sen s i t i v i t y a nal y si s w as p erforme d on th e MT A to changes in s , r, and a [16]. The MT A is s ens i t i v e to c ha nges in s and r but no t a. Red u cin g s from 20 to 10 an d th en to 5 increase s exec ution t i m es by 27 8% and 7 76% resp ec tively , th e more detaile d search t a k es s ig n i f ica n t l y l onger. Red ucing r b y 50% re duce s exec ution t i m e by ap pr o x i m a t el y 92% as th e s t r i ct er bind co nd i t io n reduces th e number of m o t i f ca nd i da t es i n v es t ig a t ed . If th e user is aw a re of the m o t i f l en g t h th e n Keo gh’s pr o ba b i li s t ic a l go r i t hm pro duces s a t i s f a ct o r y r es ul t s . It could b e run for a l t er na t i v e m o t i f s l en g t h s t o find im prov ed m o t i f s , how ever t h i s reduce s th e a lg o r i t hm s effe ctiv e ness. W i t ho ut kno w led ge o f m o t i f l en g t h th e MT A pr ovides a success ful a l t er na t i v e. It is a b l e to dynamica lly bui ld up i de n t ific a t io n of th e m o t i f , symb ol by symbol, un t il t he m a t c h t hr es ho l d is excee ded. The se arch pr oce ss is dri ven by th e m a t c h cr i t er i a and no t a pr edet erm ine d m o t i f l en g t h, lea din g to b etter fi t t i ng m o t i f s . 5.2 Po w er De man d D a ta Having c ompar ed th e MT A to an a l t er na t i v e a ppr oach w e now fo cu s the M T A on th e p ow er de mand dat a set ( www . cs . ucr . ed u/ ∼ ea m o nn/ T SD M A / i n de x .h t m l ) which co n t a i n s 35,040 fif t ee n minute av era ged v a lues of p o wer dema nd ( KW ) for th e EC N re search c en t r e dur ing 19 97 [14]. A s ubse t of 5, 000 dat a p oi n t s w as e x t r a ct ed fr om dat a p oi n t 5,0 00 onw ards for e v a l ua t io n. Figure 3 pl o t s t hi s dat a subset , the f iv e we e k d ay p ea ks in p o w er de mand a re clea rly e v i den t , wi t h minimal dema nd seen to oc cur ov er th e w ee k end s . Runni ng th e MT A wi t h th e previou sly det er min ed alph ab et size a = 6, an d increasi ng s = 500 and r = 4 g iven th e lar ger dat a set a nd th e m a g n i t ud e of Fig. 3. Po wer demand data s ub se t wi t h Motif A of le ngth 50 0 h i gh li gh t e d i n li gh t grey , wi t h five o cc ur re nc es ( lis t e d 1 to 5) star ting a t loc ations 508 , 1 18 2, 1 854, 2525 and 3869 actual dat a v a lues, th e MT A is a ble to i den t i f y 18 m o t i f s wi t h i n 798,29 8ms. T he m o s t f r eq uen t l y r ep ea t ed m o t i f A is p l o tt ed in Fi gure 3. M o t i f A r epr es en t s t he p ow e r de man d fr om Thu rs d ay t hr o u g h to T uesday , inc ludi ng a nor mal w ee k end wi t h m inimal de man d o n Sa t ur d a y and S unday , tha t is seen to o ccur fi ve t i m es . The i n t er v a l s b etw een th e fi r s t four occ urre nces of the m o t i f ar e ap pr o x i m a t el y 675 dat a p oi n t s , o r se v en days given tha t ea c h dat a p oi n t i s a 15 minute i n t er v a l , w hi l s t th at b etween th e f o ur t h and fif t h occ urrence s is 14 day s. This i mplies a p o t en t i a l m o t i f is mi ssing from dat a p oi n t 3200 to 3700. Ho wev er t h i s s ubs e- quence r el a t es to the p eri od from th e 26 t h to th e 31 s t March 1997 a nd in t he Ne th er l an ds the 28 t h a nd 31 s t of March are ba nk holid ays d urin g whi ch t her e w as no p o w er d eman d. The se quence from 3200 to 3700 is t her ef o r e no t co ns i s - tent wi t h m o t i f A and was o m i tt ed . Th is simp le case sh o w s th e MT A has b ee n able to find a m o t i f tha t r ep r es en t s al l occurre nces of a t wo day w eekend t ha t has no a ss o ci a t ed ba nk ho li da y s . Reduci ng th e symb ol size s to 400 for a m ore detailed search , the M T A i den t i fie s 2 1 m o t i f s in 661,9 02m s. One M o t i f B, of l en g t h 4 00, is see n to o cc ur t wic e at lo ca t io n s 2880 and 3 648, see Figure 4. The M T A f ound a m o t i f t ha t corresp onds to th e t wo weeks th at i nc o r p o r a t e a ban k ho li da y . One c ould a rgue that the f our day patt erns foun d in m o t i f B shou ld be f o un d to r ep ea t a s a subse t of all th e o t her f ive day w orki ng weeks. Ho wev er th e MT A is able to d i s t i ng u i s h b etw een th e ban k holiday weeks and norma l f ive d ay wo r k i ng w ee ks a s i t i den t if ied a n o t her m o t i f C of l en g t h 1308, seen to occ ur t wic e in t he dat a at p o s i t io n s 539 a nd 1884. This m o t i f en cap su l a t ed a f o r t ni g h t of no r m a l wo rki ng days tha t w as se en to r ep ea t t wi ce, c ov ering th e p eri od up to th e s t a r t of m o t i f B. M o t i f C did no t occ ur f or a thir d t i m e a f t er t hi s due to the ex i s t enc e of th e ban k h olida y s which br oke th e m a t c h ing cr i t er i a for th at s eq u en ce. Fig. 4. Po wer demand data s ub se t wi t h Motif B of l ength 4 00 h i gh li gh t e d i n li gh t grey , occ urring t wice as lab ell ed 1 and 2 , at lo cati ons 3924 a nd 3648 6 C o n cl u s io n M o t i f s and patt ern s are k ey t o ol s for u se in dat a anal ysis. B y e x t r a ct i n g m o t i f s tha t ex i s t in dat a we gain some und e r s t a nd i ng as to th e n at ur e and c ha r a ct er i s - t i cs of th a t da ta . The m o t i f s provide an obvious mechanism to cl u s t er , cl a ss i f y and s um marise th e da ta, plac ing g r ea t v al ue on t hes e patt erns. W h i l s t m o s t resear c h has foc use d on th e search f or known m o t i f s , li tt le resea rch ha s b ee n p erforme d lo okin g for va riable len gth unknown m o t i f s in t i m e series. The M T A t a k es up t hi s challenge, bu ilding o n our earlie r w ork to g en er a t e a nov e l i m m un e inspired ap proa ch to evolve a p o pul a t io n of t r a c k er s th at see k out and m a t c h m o t i f s pr es en t in a t i m e series. The MT A uses a mini mal number of pa r a m et er s wi t h minima l assu mp tio ns and re quires no knowledge of th e dat a exa mined o r th e underl y i ng m o t i f s , unl ike o t he r a l t er na t i v e app roa c hes. Previous iss ues of s ca l a b i li t y w ere addre ssed b y usin g a discr ete, fin i t e symb ol set to g en er a t e a dimensi onall y r educe d v er sio n of th e t i m e ser ies f or i n v es t ig a t io n. The M T A w as e v a l ua t ed usi ng t w o i nd us t r i a l dat a sets a nd th e a lg o r i t hm w a s able to i den t i f y a m o t i f p o pu l a t io n f or e ach. In the s t ea m g en dat a set a d om i nan t m o t i f was i den t ifi ed and c ompar ed to r es ul t s fr om an a l t er na t i v e au th or . T he a b i li t y of th e MT A to find impr o ved v ariab le l en g t h m o t i f s due to i t ’ s i mm une mem ory in spired t r a c k er evolution was also hi g h l ig h t ed , a di s t i n gu i s hin g f ea t ur e o ver o t her a lgo r i t hm s . In th e p ow er dema nd dat a set th e MT A was a ble t o i den t i f y m o t i f s th at had mea ningf ul significa nce to th e user. The M T A found a s m o t i f s i) t ho s e p eri ods th at cor resp o nd to w ee k e nds no t a ss o ci a t ed wi t h ba nk holidays, ii ) the four day workin g w ee ks tha t co n t a i n a bank h olida y , a nd iii ) th e n ormal fi ve day w orki ng weeks. F ro m t hes e r es u l t s w e be lieve the M T A offers a v al uable co n t r i but io n to an area of resea rch th at a t pr es en t has r ece i ved surprisi ngl y li tt le a tt en t io n. R e f e r e n c e s 1. E. B. B ell, S . M. S p ar s h ot t , and C. Bunce. CD4+ t -cell memory , c d 45 r s ub s e t s and the p e r sis t e n ce of an t ige n - a unifying c on ce p t . I mm uno l o g y T o da y, 1 9:60 –64, F ebr uary , 1998 . 2. B . Chiu, E. Keogh, and S. Lo nard i. P r ob ab ilis t ic dis cov ery of t i m e se r ies m ot i f s . SIGKDD , A u gu s t , 20 0 3. 3. L. N. de C as t r o and F. J. V on Zub en. Learn ing and op t i m iz at i on u si n g t h e c lonal selec t i on princ iple . IEEE T ra nsa ctions on E v o l u t i o na r y Computati on, 6(3) :2 39– 25 1, 200 2. 4. C. F al ou t sos , M. Rang ana th a n, and Y. Manolop oulos. F ast su bse quenc e matching in t i m e series databa ses . In pr oc ee din gs of t he SIG M O D c onfer ence, pages 419–4 29, 19 94 . 5. T. C. F u, F. L. Ch ung, V Ng, and R Luk. P att ern disco v er y f rom s t oc k m ar k e t time s eries using self organizing maps. Work shop notes of KDD2 001 w o r k s hop on tem p or a l data mining. San fr ancisc o, CA, p age s 2 7–37, 2 001 . 6. X. Guan a nd E. C. Ub er bacher . A f as t look up al gor i t h m for d e t ec t i n g r e p e t i t i v e dna sequences. Pacific symp os iu m on bioc omp uti ng , Hawaii, IEEE T r an. C o nt r o l Systems T e ch. , D ece m b e r 1 99 6. 7. E. K eogh and P . S my t h. A p r ob ab ilis t ic approach to f as t pattern m at c h i n g i n t i m e series databa ses. In pro c e e ding s of t he third i nt e r na t i o na l c onfer enc e of know le dge dis c overy and data mining, pages 20 –24, 19 97 . 8. J. Lin, E. Keogh, and S. Lona rd i. Visualizing and discov ering non t r i v i al p at t e r n s in large t i m e seri es databases. Inform ation visualization, 4, is s u e 2:61 –82, 200 5. 9. J. Lin, E. Keogh, S. Lo nar di, and P . P atel . Fi n din g motifs in t i m e series. I n t he 2nd w o r k s hop o n tem p or al data mining, at the 8 th A CM SIGK DD i nt e r na t i o na l c onfer enc e on know le dge discovery and data mining, J u l y , 20 02 . 10. G. P elle gr i n e tt i and J. B e n s t m an . Nonline ar c on t r ol or ie n t e d b oiler m o d elli n g, a b enchamrk problem for c on t r oll e r desi gn. IEEE T r an . Co nt r ol Sy s t e ms T e ch., 4 , No 1, January , 1996 . 11. I. R i go u t sos and A. Florato s. C ombina toria l pa ttern discov er y in b iol ogic al s e- quences: TEIRESIAS al gor i t h m . Bioi nfor ma tics , 14 n o. 1: 55– 67 , 1 99 8. 12. S. Singh. P atte rn modelling in t i m e s eri es f or ec as t i n g. Cyb ernet ics a nd systems - an i nt e r na t i o na l journa l, 31, issue 1, 20 0 0. 13. Y. T a nak a and K Uehara . Discov er mo tif s in m u l t i- d i m e n si on al t i m e s e r ies u si n g the pr incipa l c om p on e n t analysis and the m d l princip le. In 3 r d i nt e r na t i o na l c onfer ence on mac hin e learning and d ata mining in p a tt e r n r e c o gnition, Le ipzi g, Germany, pages 2 52–265, 2003. 14. J. J v an Wijk and E . R. v a n Selo w. C l u s t e r a nd calendar based v is ual iz at i on of t i m e series data. In IEEE Symp osium on informat ion v i z u a l i s a t i o n IN FOVIS ’99 , San F rancisc o, Octob er 25-26, 1999 . 15. W. O. Wilson, P . Birkin, and U. Aic kelin. Price t r ack e r s inspir ed b y i mm un e memory . In Pr o c e e dings of Artificial immune s ystems, 5th Int e r na t i o na l c onfe r e nc e , ICAR IS 2006 , Oeiras, Port uga l, pages 36 2–37 5, 200 6. 16. W . O. Wilson, P . Birkin, and U Aickelin. The m otif t r ack i n g al gor i t h m . In IEEE T ransactions on Evo lu tionar y Com putat ion , 2007 . Under r e v iew .

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment