Practical Algorithmic Techniques for Several String Processing Problems

The domains of data mining and knowledge discovery make use of large amounts of textual data, which need to be handled efficiently. Specific problems, like finding the maximum weight ordered common subset of a set of ordered sets or searching for spe…

Authors: Mugurel Ionut Andreica, Nicolae Tapus

Practical Algorithm ic Techniques for Se veral String Processing Probl ems Mugurel Ionu ţ Andreica, Nicolae Ţă pu ş Compu ter Science and Engineerin g Department Politeh nica Universit y of Bucharest Bucharest , Romania e-mail: {mugurel.and reica, nicolae.tapus }@cs.pub.ro Abstract —The domains of data mining and knowledge discovery make use of large amounts of textual dat a, which need to be han dled eff iciently. Specific probl ems, like finding the maximum weight ordered common subset of a set of ordered sets or searching for specific patterns withi n texts, occur frequently i n this context. In this pap er we present several novel and practical a lgorithmic techniques for processing textual data (strings) in order to efficiently solve multiple problems. O ur techniq ues make use of eff icient string algorithms and d ata structures, like KMP, suf fix arrays, tr ies and deterministic finite automata. Keywords-string processing; prefix query; trie; suffix array; KMP; deterministic finite automaton I. I NTRODU CTION Textual data exists and is constantly being pro duced i n large a mounts – web pag es, dig ital libraries or official state documents all need to be processed efficie ntly i n o rder to be able to extract automatica lly th e information contained within them. Web search eng ines classify and store the text data int o eff icient da ta stru ctures, data mi ning a lgorithms make use of similarities betwe en multiple piece s of data, and biolog ists a nalyze large DNA sequences in order to ex tract information regarding genes and identify patterns. In this paper we present novel and pr actica l algorithmic solutio ns for several string processing problems. Our technique s make use of efficient string a lgorithms and d ata st ructure s, like KMP, suffix arrays, tries, determ inistic f inite aut omata, a nd others. In Section II we discuss two types of string prefix queries. In Secti on III we a ddress th e problem of optimally concatena ting a set of strings, i n order to optimize an objective metric. In Section IV we study constra ined o ptima l length commo n subsequences an d we present nove l solutions for the shortest common contiguous non-subsequ ence of a set of strings. In Section V we con struct and count strings having specific properties . Finally , in Section VI we dis cuss related w ork and in Section VII we con clude. II. S TRING P REFIX Q UERIES We cons ider a string S=c 1 c 2 …c n (of length n ). We want to preprocess the string in order to be able to answer the following types of queries: 1) PQ(i,j) =is the p refix c 1 c 2 …c i of S equ al to c j-i +1 c j-i +2 …c j ? ( j ≥ i ) ; 2) LPQ(j,k) =which is the largest value of i such that i ≤ k ( 0 ≤ i ≤ k ≤ j ) and PQ(i,j)=tru e ? The solution for both types of queries consists of fi rst running the preprocessing st age o f the Knuth-Morris -Pratt ( KMP ) algorith m [4]. As a result of t his stage, the ta ble p(i) is computed, where p(i) i s the lar ges t value ( p(i)0 do: (1) if (w(Anc(v, j))>k) th en v=Anc(v,j ) ; (2) j=j-1 . In the end, if w(v)>k (and v is not the tree root) we set v=p(v) . This approach has the disad vantag e that it requires O(n·log(n )) memory. We ca n trade memory for running time, as follows. We will compute a value Anc(i) =a n ancestor of vertex i w hich i s located c l evels higher than i ( or the closest ancestor which is located on a level which is a multiple of c ). While traversing the tr ee, we maintai n the DFS stack Stk of vertices. When first enter ing a vertex i , this vertex is pushed at the top of the stack and when exiting a vertex i , t his vertex is popped from the stack. Let’s assume that the topm ost l evel o f th e s tack is ltop ( the one containing vertex i ). I f l top ≤ c we have Anc(i ) =the tree’s root; otherwi se, we m ay s et Anc(i)=Stk (ltop-c) or A nc(i )= Stk(ltop -1-((ltop- 1) mod c)) . Then, in order to find the ancesto r v o f ver tex i with the larg est weigh t w (v) ≤ k we proceed as follows. We initialize v=i . While ( v is not the tree root) and w(An c(v))>k w e set v=Anc(v ) . Then, wh ile ( v is no t the tree root) and ( w(p(v))>k ) we set v=p(v) . In the e nd, if w(v)>k (a nd v is not the tre e root ) we set v= p(v) . III. O PTIMAL S TRING C ONCATENATIONS A. Optimal Concatenation of Strings from Two Set s We con sider t wo sets A and B , each cont aining N strin gs. We want to compute the shortest string S which can be obtained as a co ncatenation of bot h some strings fr om A an d of s ome strings from B . When consi dering t he concate nation of the strings from each se t, a s tring may occur zer o, o ne, or multipl e times. W e will d enote b y len( x,y) the le ngth of th e string x from the set A (if y=1 ) or B (if y=2 ). We will conceptu ally cons truct two st rings S 1 and S 2 , where S 1 is obtained by con catenating some st rings fro m A and S 2 is obtained b y con catenati ng s om e st rings from B . Let’s assume th at, initially, S 1 contains a string f rom A and S 2 is empt y. The al gorith m f or c onstru cting the st rings S 1 a nd S 2 will add a string fro m A ( B ) to S 1 ( S 2 ) if S 1 ( S 2 ) i s s horter than S 2 ( S 1 ). This way , the shorter string will alwa ys b e a prefix of the l onger string; t he longer string may have s ome extra ch aracters which are the su ffix of a string from th e correspondi ng s et. We will comput e t he following table: L(i, j, p) =the mi nimum l ength o f t he shorter strin g (amon g S 1 and S 2 ), su ch t hat: (1) if p=1 , t hen S 1 is longer than S 2 - S 1 has th e structure S 2 +q ( + is con catenation ), where q is the suffix of th e string i fr om th e set A , starti ng at the p osition j ( 0 ≤ j ≤ len( i,1) ) ; we consider the positions of t he strings to be indexed st arting from 0 ; (2) if p=2 , then S 2 is lo nger th an S 1 - S 2 has t he struct ure S 1 +q , where q is the suffix o f the st ring i from the set B , which starts at the posi tion j ( 0 ≤ j ≤ len (i,2) ) We now have t o s olve a short est p ath probl em i n a dir ected graph w ith O(N·(LMAX+1)) vertices ( LMAX =the maximu m length of a string from A or B ), or O(sum of the lengths of all th e strings f rom A an d B) . In o rder to comp ute the neighbors of a vertex (i ,j,p) , we will c onsider all the strings i 1 fr om the set A (if p=2 ) or B (if p=1 ). For every string i 1 , w e will veri fy if this strin g matches t he stri ng i from the set A (if p=1 ) or B (if p=2 ), starting from the positio n j . Let ’s assume th at the t wo stri ngs match. Then, if len(i 1 ,3-p) ≤ len(i,p)-j , then the vertex (i,j,p) will have a dir ected edge toward s the vertex (i, j+l en(i 1 ,3-p), p) - the cost of t his e dge will be len(i 1 ,3-p) ; otherwise, the vertex (i, j, p) will have a direct ed edge t owards t he verte x (i 1 , l en(i 1 , 3 - p) - l en(i,p) + j, 3-p) – t he cost of t he e dge w ill be (len(i,p )-j) . For every vert ex ( i,j,p) and every consid ered st ring i 1 we can test in O (LMAX) ti me if th e string i 1 matches t he correspondi ng suffix fr om th e st ring i (sta rting at th e positio n j ). However, we can p reprocess all t his information using K MP. Fo r each st ring i fr om A [ p=1] ( B [p=2] ) we consid er every string i 1 from B ( A ); we can compute in O(LMAX) time all t he positions j f rom th e string i at which the string i 1 may st art such that it matches the suffix o f the string i starting at j ( j =j’-len(i 1 ,3-p)+1 if i 1 matches i on the positio ns j,…,j’ , or j=len(i,p)-j ’’ , whe re j’’ is the l ength of any pref ix of i 1 which is an ancestor in i 1 ’s failur e tree of the longest pr efix of i 1 matching the en d of i ). Thus , we can test a match in O(1) amorti zed time for ever y tu ple (i,j,p,i 1 ) . The gra ph has O(N· (LMAX+1 )) ver tices and O(N 2 · (LMAX+1)) (directed) edges. Using Dijkst ra’s al gor ithm, the time complexity is O(N 2 ·(LMAX +1) 2 ) , or O (N 2 ·(LMAX+ 1)· log(N·(LMA X+1)))) (if w e use binar y h eaps) , or O(N 2 · (LMAX+1)+N·(LMA X+1)· log(N·(LMAX+1)) ) (i f we use Fibonacci heaps). At first, we will have L(*,0,*)=0 (the vertices (i,0,p) are the source vertices; we can always have multiple so urce vertices if we insert the m all at the beginning in the que ue/binary heap/Fibonacci heap) and L (*,j>0,*) = + ∞ . At the end of the alg orithm, the length of th e shortest string S that can be written as both a concate nation of strings from A and from B is min{L(i , l en(i,p ), p)} . The s tring S can be computed by tracin g back the way we computed the values L(i,j,p) (starting from the vertex which minimizes the length of S ). If we hadn’t cared for findi ng the shortest string S with the given properties (and any s tring S would have been ok ), then we coul d h ave simply traversed the constructed graph ( using DF S or BFS), s tarting from the vertices (i,0,p) (we introduce all these vertices in t he beginni ng in the BF S queue; in the DFS traver sal case, we will start a new traversal from each such v ertex , taking care not t o traver se t he ver tices which were mar ked a s tra ver sed at previous DFS traver sals or at the current DFS traversal). If we visit a vertex (i,len(i,p),p ) , then a string S with the g iven propertie s e xist s (and can be found by tracing back the path towards the source vertex/v ertices). An applic ation of this problem is the following. We have a set of strings A and we want to com pute the shor test p alindrome th at can be o btained as a co ncatenatio n o f some st rings from the set (any st ring may b e used zero, one, o r multiple times). We will construct the set B as being formed of the strings from A , but reversed. Then, we will compute the same value s as before. A ve rtex (i,j,p) d enotes a solut ion if the substring s tarting at the position j fr om the string i of th e set corresponding to p ( A for p=1 , and B fo r p=2 ) and ending at the end of the string is a pa lindrome (this subs tring may also ha ve a length of 0 or 1 ). The answer will be min{2 ·L(i,j,p) +len(i,p)- j | the suffix of the string i from the set A ( B ) if p=1 ( p=2 ), starting at the position j , is a palindrome } . The concatenation of the strings from the set B is rever sed and attached at the end of the concatena tion of t he strings from A . B. Minimum Lexicograp hic Concatenatio n We co nsider N strings: S(1), …, S(N) . We want to sort these strings in some order p(1), …, p(N) , such that the string Q=S(p(1) )+…+S(p(N)) ( + denotes the concatenation of two stri ngs) i s lexico graphically mi nimum. Let l en(X) be the length of th e str ing X . We will use an y sorting algorithm for sorting “increasingl y” the N strings. When we need to compare t wo strings S(i ) and S(j) , we can decide th at: • if S(i)+S(j ) < lex S(j)+S(i) th en S(i) ” < ” S(j) • if S(i)+S(j ) > lex S(j)+S(i) th en S(i) “> ” S(j) • if S(i)+S(j)=S(j)+S( i) then: if len(S(i )) ≤ len(S(j)) then S(i) “ ≤ ” S(j ) else S(i) “> ” S(j) IV. O PTIMAL L ENGTH C OMMON S UBSEQUEN CES A. Longest Common Contiguous Subseq uence We consider N strings: S(1), … , S (N) . We want to compute the lo ngest contiguous substring which o ccurs at least a(i) tim es ( 0 ≤ a(i) ; 1 ≤ i ≤ N ) in at least F ( 0 ≤ F ≤ N ) str ings among the N given (for the oth er strings S(i) , the sub string may occur fewer t han a (i) t imes). We will const ruct a st ring Z=S(1) $ 1 S(2) $ 2 ... $ N-1 S(N) , s tructured as follows: the string S (1) , f ollowed by th e charact er $ 1 , then foll owed b y S(2) , then followed by $ 2 , ..., th en followed by S(N) . Th e charact ers $ 1 , .. ., $ N-1 are di stinct and they do not occur in any of t he strings S(i) ( 1 ≤ i ≤ N ). We w ill construct a suf fix arra y associated t o Z , obtained by lexicographical ly so rtin g the suffixes of Z : su(1), ..., su (|Z|) ( s u(i) d enotes the position of t he first character from Z o f t he c orresponding suff ix; |Z|=len( Z) ). For each s uffix s u(i) we know exactly to whi ch string S(i ) its firs t charact er belongs, bec ause of i ts position in Z (the firs t |S(1 )| positions are ma rked as belonging to S(1) , th e n ext position contains $ 1 , t he next |S (2)| positions are marked as belon ging t o S(2) , and s o on); the suffixes for which the first charact er is one of the charact ers $ 1 , ... , $ N-1 will pr esent no interest. We wi ll also comp ute t he values LCP(i) = the l ength o f the longest common p refix of the suffixes su(i) and su(i+1) ( 1 ≤ i ≤ |Z|-1 ). The suffix array and the array LCP can be c onstructed in O(|Z|·log( |Z|)) time. We will tra verse the suffix arra y su (*) with two poi nters, left and right . Initiall y, we will have left=1 a nd right= 0 . We will mai ntain an ar ray x , where x(i) =the number of occurrences of a suffi x whose startin g position bel ongs to S(i) , amon g th e su ffixes su(j) with l eft ≤ j ≤ right ; i nitially, x(i)=0 ( 1 ≤ i ≤ N ). We will also mainta in a counter nok =the numbe r of st rings S(i) for which a(i) ≤ x(i) . Initially, nok i s equal to the nu mber of indices j for which a(j)=0 ( 1 ≤ j ≤ N ). If nok ≥ F from the start, then there are at least F valu es a(i)=0 and the a nswer is given by t he string S (q) ha ving t he maximu m l ength. Another particular cas e oc curs when t here are exactly F-1 values a(i) equal t o 0 , an d the other values are at l east 1 – in this c ase, the ma ximum len gth subseq uence is the string S(k) with the maximum length such that a(k) ≥ 1 . Af ter removing t he special cases, w e will perform | Z| st eps. At the beginni ng o f ever y s tep we will set right=ri ght+1 . If the suffix su(r ight) be longs to a s tring S(j) , then w e will set x(j)=x(j)+ 1 ; if x(j) becomes equal t o a( j) , then we set no k=nok+1 . Then , we will incr ement the variable l eft by 1 as long as one of the following conditio ns is true: (1) t he first position of the suffix su(l eft) belongs to no string S(j ) (i.e. it correspond s to a character $ q ); (2) the first position of su(left) b elongs t o a st ring S(j ) for w hich ((x(j)>a (j)) o r ( (x(j)=a(j)) and ( nok>F))) : in t his case we decrement x(j) by 1 first ( and, if x(j) becomes smaller than a(j) , we al so decre ment no k by 1 ) and o nl y aft er t his will we incr ement l eft by 1 . Aft er the (possi ble) changes of the variable left , we ch eck if nok ≥ F . If it is, then we will compute the value W=min{LCP(j) | left ≤ j ≤ right-1} (if left=rig ht , then W =the length of the suffix su( left) , i.e. |Z|- su(left)+1 ; if the fi rst character of su (left) is a $ q character, then W= - ∞ ). We can co mpute thes e values in O(1 ) time, using th e Range Minimu m Query (RMQ) te chnique, which requir es a simp le O(|Z|·log(|Z|)) time preprocessing (o r a more complicated O(|Z|) time one). W is t he length of the longest subst ring of the strings S(i) which satisfies the constr aints, considerin g onl y th e suffixes on the p ositions left, left+1, … , ri ght . W e compare W wi th th e l argest len gth L found so far and we set L= max{W,L} (initi ally, L=0 ). Finall y, after all these operations, we can skip to the next st ep. The t ime comple xity of the entire algorithm is O(|Z|·log (|Z|)) (the p reproc essing st age takes O(|Z|·log(|Z|) ) time; all t he other |Z| st eps take O(|Z|) ti me overall). B. Longest Common Non-Contiguous Subs equence We consid er K strings: S(1), …, S(K) , composed o f charact ers fro m an alp habet with N s ymbols (n umbered from 1 to N ); each po sition j of a string S(i) has a weight wp(i,j) ≥ 0 . A string A is a (n ot necessaril y contiguous) subseq uence of anoth er st ring B if it can be obt ained f rom B by erasin g 0 or more characters from B . We wa nt to compute a string S which is a common subse quence of all the K given stri ngs and who se aggregate we ight is maximu m. Th e weight o f a stri ng is computed as an aggregate fu nction agg 1 of th e wei ghts of i ts chara cters. Th e weight of a character o n a position of the st ring i s e qual to an aggregate a gg 2 of the weight s of the positi ons in which the character matches ea ch of th e K g iven strin gs; agg 1 is a non-decre asing fun ction, defin ed for non -negative values (e.g., addi tion, max); agg 2 may be an y functio n returning non-negati ve value s (e.g . sum, multiplication , max, min). Each character i occ urs num(j,i) ≥ 1 ti mes i n S(j) (otherwis e, we could remove the character from t he alphabet and from every str ing which c ontains it). We know th at t he total numbe r of tuples of positions (p 1 , ..., p k ) , such that S(1)(p 1 )=S(2)(p 2 )=...=S (K)(p K ) i s at most PMAX . The positio ns of a string are numbered st arting from 1 . We wi ll generat e all t he tupl es (p(1 ), ..., p (K)) such t hat S(1)(p( 1))=...=S( K)(p(K)) . In order t o do this, we will travers e every str ing S (i) and , for every char acter j , we will construct a l ist L(i,j) (i niti ally, all th ese lists ar e empt y). As we travers e the string S(i) and we r each the charact er on a positio n q , we insert q at the end of th e li st L( i,S (i)(q)) . Thus , the elements o f each list ar e added in increasin g order. Then, we will generate all the t uples we menti oned, in lexico graphic order. W e will traverse, one a t a t ime, the positio ns q(1) fro m S(1) . For ever y po sition q(1) we will consid er, i n order, ever y posi tion q(2) f rom L(2, S(1) (q(1))) ; for every pai r (q(1), q(2)) we co nsider, i n o rder, every positio n q(3) fr om L(3, S(1)(q(1))) , and so on, f or every tupl e (q(1), …, q(r )) (with r len(S t(F)) and St(G)(y)=|A|+1 , if y>len(St( G)) . If c 1 +1 ≤ c 2 -1 then we add to PS( F,G) an y charact er labeled with a number lb form the i nterval [c 1 +1,c 2 -1] . If, however, c 1 +1=c 2 , then we have two possibiliti es (and we will choose t he one which l eads to a short er s tring). The first possibility occu rs if c 1 >0 . We add to PS(F,G) the character c 1 . Then, we need t o compute the numbe r of consecutive characters equ al to |A| in St( F) , st arting from positi on LCP(F,G)+2 . Let x=cnt(F, |A|, LCP(F,G)+2 ) be this number. We add to PS(F ,G) x charact ers | A| and then we add a character which is l arger than St( F)(LCP(F,G)+2 +x) . The second possibility occurs if c 2 ≤ |A| . W e add t o PS(F ,G) the cha racter c 2 . Th en, we need to compute t he number of cons ecutive cha racters equal to 1 in St(G) , startin g from position LCP(F,G)+2 . Let x=cnt(G , 1, LCP(F,G)+ 2) be t his n umber. W e add to PS(F,G) x charact ers 1 and t hen w e add a character which is smaller than St(G)( LCP(F,G)+2+x) . W e can tabulate t he cnt(*,*,*) values as follo ws. For y>len(St (U)) we h ave cnt (U, c, y)=0 ; for 1 ≤ y ≤ len(St(U )) w e have: i f ( St(U)(y)=c ) then cnt(U, c, y)=1+cnt( U, c, y+1) ; oth erwise, cnt( U, c, y)=0 . For completen ess, we ad d an empt y vi rtu al string at the beginnin g of the sorted order and an empty s tring at t he end of the sorted orde r. We can implement all these operations in O(N· log(N)) time using the suf fix arra y s technique ( N =the su m of t he lengths of t he R given st rings, plus R-1 ): we co nstruct a l arge stri ng S= ST(1) $ 1 St(2) $ 2 … $ R-1 St(R) , where $ 1 , …, $ R-1 are di fferent ch aracters which are not part of A . Then, we can sort all t he s uffixes o f S in O(N·log(N)) time a nd com pute th e longest common prefix (LCP) of any two consecutive su ffixes i n th e so rted orde r in O(log (N)) time (b y s torin g au xiliary i nforma tion). Afterwar ds, we wil l maintain onl y the suffixes st arting a t the initi al positio n of a string S T(i) an d truncat ed, such t hat they do not contain $ j charact ers. We can compute the LCP between any pair of suffixes in t he ori ginal order by us in g the Range Mi nimum Quer y ( RMQ) technique on the array of LCP values betwe en consecutive suffixes in th e sorted o rder; the length of t he LCP between two truncated s uffixes is r educed i f it exceeds the l ength of any o f them. Another m ethod for computi ng th e LCP o f t wo suf fixes starting at any pos itions a and b of S w as mentioned to u s b y C. Negru ş eri. We will compute a h ash val ue h(i) for e very pos itio n i of S : h(0)=0 and h(1 ≤ i ≤ N)=hash(h (i-1), S( i), i) . Wi th t hese values, w e will be a ble to efficientl y compute a hash va lue for any (contiguou s) substring S (i:j) (from the position i to j ) of S . The hash values o f two substrin gs with t he sam e value will be i dentical. Then, we will binar y search the LCP betwe en 0 and N-max {a,b}+1 . We hav e LCP ≥ L if hashVa lue(a, a+L- 1)=has hValue(b, b+ L-1) ( with a h igh p robabilit y ) and LCPq’ do no t make us move to the next char acter of the str ing (i.e. they are non-absorb ing ). Every state q s tores 4 lists o f adjacent ed ges, for the following cases: incom ing/outgoin g and absorbin g/non-absor bing . Moreover, within each list, it stores M sub-lists, one for eac h cha racter of the alphabet (sub-lis t c contains t he edges marked with c from the correspon ding li st). At fi rst, we will proces s the DFA . For every character c from the alphabet, we consider DFA’(c) =the DFA conta ining all t he states but only t he non- absorbing ed ge s marked with c . From each state q there is at most one outgoing edge mark ed with c . We will identify the cycles in DFA ’(c) and re move from DFA and DFA’(c) a ll the edge s which are par t of a cy cle in DFA’(c) . In order to identify the c y cles, we consider that all the states are unmarked. Then, we consider e very state q of DFA’(c) . If q is unmarked, we mo ve along the edge s, starting from the one going out of q (if any) and mark all the states we visit. If we get back to q , then we fo und a cy cle. Afte r remov ing these edges, DFA’(c) is a directed acy clic graph . W e will compute a topological sort of DFA’(c) : ts(c,1), … , ts(c,V) , such t hat all the state s q’ for whic h t here is an e dge q’->q (di rected from q’ to q ) are located bef ore q in the topolog ical sort. Then, we will c ompute Cnt (q,l) = the number o f strings of length l which can make the DFA reach the state q . We h ave Cnt(q 0 ,0)=1 and Cnt(q ≠ q 0 ,0)=0 ( q 0 =the initial state). Then, for l=1,…,Lmax we perform the following steps: (1) for every character c we consider the states ts(c ,1), …, ts(c,V) (in this order): we comput e Cnt’(ts(c,i ),l,c) as Cnt(t s(c,i), l-1) plus th e sum o f t he valu es Cnt’(ts(c ,j),l,c) (such that there is a non-absorb ing edge marked with c from ts(c,j ) to ts(c,i) ); (2) we c onsider every st ate q and set Cn t(q,l) t o the sum of the val ues C nt’(q’, l,c) (where q’ is a state s uch that ther e is an absorbing directe d ed ge fr om q’ to q ma rked with the character c ). The final answer is t he sum of the values Cnt(q,Len ) , where q is a fi nal sta te and Len ∈ SLen . We will now cons ider a problem which is very similar to the first problem from t his section. We are given the same input da ta, except that every string i fro m L 2 also has a weight w(i) . We w ant to const ruct a string SL with maximu m weight, which obey s the s ame constrain ts from the previous problem. The weight of a st ring SL is the agg regate agg of the weig hts o f e ach occurrence o f a string S( i) from L 2 i nto SL . Additionally , for some of the strings S(j) from L 2 we do not care how many times they occur in SL . We will proceed as follows. At firs t, we compute the DFA of all the strings from L 1 ∪ L 2 and we mark t he forbidden states (as before). Then, for each non-forbidden state q , we compute ws(q) = the aggregate ag g of the we ights w(i) of the strings i from L 2 which are part of SE (q) . A fter this, w e r emove from L 2 a nd from all the s ets SE(*) t hose strings i for which we don’t care how many times they occur i n the optimal string SL . After removing these str ings, we will renumbe r the remaining strings from 1 to |L 2 | (we modify accordingly the numberi ng in L 2 and in every set SE(*) ). Then, we will run a dynamic programming alg orithm which is very similar to the one from the first problem. We will compute Wmax(q, l, o(1), …, o(|L 2 |)) = the maximum weight of a string S L of l ength l whose suffix matches the str ing associa ted to the state q of the D FA and in which every (remain ing) string i fr om L 2 occurs o(i) times and in which none of the strings from L 1 occur. W e have Wmax (q 0 ,0,o(1) =…=o(|L 2 |)=0)=0 an d - ∞ for all the other t uples with l=0 . For each length l from 1 to Lmax (in i ncreas ing order) we consider all th e non-forbidden states q . W e i nitialize Wmax(q, l, o(1)=*, … , o(|L 2 |)=*)=- ∞ and Prev(q, l, o(1)=*, …, o(|L 2 |)=*)=und efined . Then, we consider all the directed edges entering q from oth er non- forbidden sta tes q’ . For each such state q’ we consider all the tuples (q’, l-1, o (1), …, o(|L 2 |)) such that Wmax(q’, l-1, o(1), …, o(|L 2 |))>- ∞ and if agg(Wmax(q’, l-1, o(1), …, o(|L 2 |) ), ws(q)) >Wmax(q, l, o( 1)+x(q,1), …, o(i)+x(q,i), …, o(|L 2 |)+x( q,|L 2 |)) then we se t Wmax(q, l, o(1)+x(q, 1), …, o(i)+x(q, i), …, o(|L 2 |) +x(q,|L 2 |))=ag g(Wmax(q’, l-1, o(1), …, o(|L 2 |) ), w s(q)) and Prev(q, l, o(1)+x(q,1 ), …, o(i)+x(q,i ), …, o(| L 2 |) +x(q,|L 2 |))=q’ . I n the e nd, we wi ll compute wm=max{Wmax (q, Len, o(1), … , o(|L 2 |))| Len ∈ SLen and o(i) ∈ Occ(i) (1 ≤ i ≤ |L 2 |)} . If wm>- ∞ then let (q, Len, o(1), …, o(|L 2 |)) be a tuple such tha t Wmax(q, Len, o(1), …, o(|L 2 |))=wm . We will construct the string from the end towards th e fr ont. Whi le Len>0 do: (1) let q’=Prev (q, Len, o(1), …, o (|L 2 |)) ; ( 2) the character on the position Len of SL is the character on the directed edge from q’ to q ; (3) o(i)=o(i) -x(q,i) ( 1 ≤ i ≤ |L 2 | ) ; (4 ) q=q’ ; (5) Le n=Len-1 . VI. R ELATED W ORK The failure t ree produced by t he K MP a lgorithm was mention ed in [1], but was not used directl y as it is. I n [2], the authors stu died s everal n on-substring and non- subseq uence problems, focusing on w heth er they belong to the P or N P c lass . The y also gave a tri vial, but inefficie nt polynomia l algorit hm fo r t he shortest non-co ntiguous subseq uence problem, w hile w e presented an effici ent solutio n. String concatenations are related to the shortest common supers equence probl em [6]. A surve y of l ongest common subse quence algorithm s (of only two strings) was present ed in [ 5]. The algorithms and d ata st ructures we used in this p aper ( e.g. KMP, trie, det erministic finit e autom aton) are well presented in several text bo oks, like [3] an d [4]. VII. C ONCLUSIONS A ND F UTURE W ORK In this paper we presented new and practi cal algorithmic solutio ns for several s tring processing problem s which are important in th e autom atic kn owledge extracti on and d ata mining fields. Our solutions make u se of several standard, but effi cient, algorithmi c techniqu es and data st ructures. The present ed a lgorith ms ca n easi l y be implem ented in many data mini ng and kno wledge discover y appl ications . R EFERENCES [1] M. Gu, M. Farach and R . Beig el, “An Efficient A lgorithm for Dynamic Text I ndexing”, Proc. of the 5 th ACM-SIAM S ymp. on Discrete Alg orithms, 1994, p p. 697 -7 04. [2] A. R. Rubinov an d V. G. Timkovsky, “String Noninclusio n Optimizati on Prob l ems”, SIAM J . Discrete M ath., vol. 11 ( 3), 1998, pp. 456-467. [3] D. Gusfi eld, “Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology ”, Pres s Sy ndicate of the Univ. o f Cambridge, 1997. [4] M. Crochemore and W. R ytter, “ Jewels of Stringology: Text Algorithms”, World Scientific P ublishing, 2002. [5] L. Ber groth, H. Hakonen and T. Raita, “A Survey of Longest Common Su bsequence Al gorithms ”, P roc. of the Intl. Symp. o n String Processing and Inform ation Retrieval, 2000, pp. 39-48. [6] A. Bl um, T. Jiang, M. Li, J . Tro mp and M. Yannakakis, „ Linear Approximati on of Shortest Superstrings”, Journal o f the ACM, vol. 41 (4), 1994, pp . 630-647.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment