Fixed Parameter Polynomial Time Algorithms for Maximum Agreement and Compatible Supertrees

Symposium on Theoretical Aspects of Computer Science 2008 (Bordeaux), pp. 361-372 www .stacs-conf .org FIXED P ARAMETER POL YNOMIAL TIME ALGORITHMS F OR MAXIMUM A GREEMENT AND COMP A TIBLE SUPER TREES VIET TUNG HOANG 1 , 2 AND WING-KIN SU N G 1 , 2 1 Department of Computer Science, National U nivers ity of Singap ore E-mail addr ess : {hoangvi2,ksung}@comp.nus.e du.sg 2 Genome Institute of Singap ore Abstra ct. Consider a set of labels L and a set of trees T = { T (1) , T (2) , . . . , T ( k ) } where eac h tree T ( i ) is distinctly leaf-labeled by some subset of L . One fund amental problem is to ﬁnd the biggest tree (denoted as sup ertree) to represent T whic h minimizes th e disagreemen ts with t h e trees in T un der certain criteria. This problem ﬁnds app lications in phylogenetics, database, and d ata mining. In this pap er, we focu s on tw o particular sup ertree problems, namely , the maximum agreemen t sup ertree problem (MASP) and the maximum compatible sup ertree problem (MCSP). These tw o problems are known to b e NP-hard for k ≥ 3. This pap er gives the ﬁ rst p olynomial time algorithms for b oth MAS P and MCSP when b oth k an d the maximum d egree D of the trees are constan t. 1. In tro duction Giv en a set of lab els L and a set of un ordered trees T = {T (1) , . . . , T ( k ) } where eac h tree T ( i ) is distinctly leaf-labeled by some subs et of L . The sup ertree metho d tries to ﬁnd a tree to r ep resen t all tr ees in T which minimizes the p ossible conﬂicts in the inp ut trees. The sup ertree metho d ﬁnd s applications in ph ylogenetics, database, and data mining. F or instance, in the T ree of Life pro ject [10], the sup ertree metho d is the basic to ol to infer th e phylo genetic tree of all sp ecies. Man y s up ertree metho ds ha ve b een prop osed in th e literature [2, 5, 6 , 8]. This pap er fo cuses on t wo particular sup ertree metho ds, namely the Maximum Agreemen t Su p ertree (MASP) [8] and the Maxim u m Compatible S up ertree (MCSP) [2]. Both method s try to ﬁnd a consensus tree with the largest num b er of lea ves wh ic h can represent all the trees in T u n der certain criteria. (Please read Section 2 for the f orm al deﬁ nition.) MASP and MCSP are known to b e NP-hard as they are the generalizat ion of the Maxim um Agreemen t Su btree problem (MAST) [1, 3, 9 ] and the Maxim um Compatible Subtree problem (MCT) [7, 4] resp ective ly . Jansson et al. [8] p ro ve d that MASP remains NP-hard ev en if ev ery tree is a ro oted triplet, i.e., a binary tree of 3 lea v es. F or k = 2, Jansson et al. [8] and Berry and Nicolas [2 ] prop osed a linear time algorithm to transform MASP and MCSP for 2 inpu t tr ees to MAS T and MCT resp ectiv ely . F or k ≥ 3, p ositiv e 1998 A CM Subje ct Cl assiﬁc ation: A lgorithms, Biolog ical compu ting. Key wor ds and phr ases: maximum agreement su p ertree, maxim um compatible sup ertree. c  Hoang and Sung CC  Creative Comm on s Attribution-NoDer ivs License 362 HOANG AND SUNG Ro oted Unro oted MASP for k trees of max d egree D O (( k D ) k D +3 (2 n ) k ) † O (( k D ) k D +3 (4 n ) k ) † MCSP for k trees of max degree D O (2 2 kD n k ) † O (2 2 kD n k ) † MASP/MCSP for k bin ary trees O  k (2 n 2 ) 3 k 2  [8] O (8 k n k ) [6] O (6 k n k ) † T able 1: S u mmary of p r evious and new results ( † stands for n ew r esult). results for compu ting MASP/MCSP are rep orted on ly for ro oted binary trees. Jansson et al. [8] ga ve an O  k (2 n ) 3 k 2  time solution to this problem. Recen tly , Guillemot and Berry [6] further improv e the runn in g time to O (8 k n k ). In general, the trees in T ma y not b e binary nor ro oted. Hence, Jansson et al. [8] p osted an op en problem and ask ed if MASP can b e solve d in p olynomial time when k and the maxim um degree of the trees in T are constant . T his p ap er give s an aﬃrmativ e answer to this question. W e sh o w that b oth MASP and MCS P can b e solv ed in p olynomial time when T cont ains constan t n umb er of b ou n ded degree trees. F or the sp ecial case where the trees in T are ro oted binary trees, we sho w that b oth MASP and MCS P can b e solv ed in O (6 k n k ) time, whic h imp ro ve s the previous b est result. T able 1 su mmarizes th e previous and new results. The r est of the pap er is organized as follo ws. Section 2 giv es th e f orm al d eﬁnition of th e problems. Then, Sections 3 and 4 describ e the algorithms for solving MCSP for b oth ro oted and unr o oted cases. Finally , Sections 5 and 6 detail the algorithms for s olving MASP f or b oth ro oted and unr o oted cases. Pro ofs omitted d ue to space limitation will app ear in the full v ersion of this pap er. 2. Preliminary A phylo genetic tr e e is deﬁned as an unordered and distinctly leaf-lab eled tree. Giv en a phylo genetic tree T , the n otation L ( T ) denotes the leaf s et of T , and the size of T refers to | L ( T ) | . F or any lab el set S , the r estriction of T to S , denoted T | S , is a phylogenetic tree obtained from T by remo ving all lea ve s in L ( T ) − S and then suppr essing all in ternal no d es of degree t wo. (S ee Figure 1 for an example of r estriction .) F or tw o ph ylogenetic trees T and T ′ , w e sa y that T r eﬁnes T ′ , denoted T D T ′ , if T ′ can b e obtained by con tracting some edges of T . (See Figure 1 for an example of r eﬁnement .) Maxim um C ompatible Sup ertree Problem : Consider a s et of k phylo genetic trees T = {T (1) , . . . , T ( k ) } . A c omp atible sup ertr e e of T is a tree Y suc h that Y | L ( T ( i ) ) D T ( i ) | L ( Y ) for all i ≤ k . The Maxim u m Compatible Su p ertree Problem (MCSP) is to ﬁnd a compatible s u p ertree with as man y lea ve s as p ossib le. Figure 2 sh o ws an exa mp le of a compatible sup ertree Y of t wo ro oted p h ylogenetic trees T (1) and T (2) . I f all in put trees ha v e the same leaf sets, MCS P is r eferred as Maxim um C ompatible Subtree Problem (MCT). Maxim um Agreement Sup ertree Problem: Con s ider a set of k phyloge netic trees T = {T (1) , . . . , T ( k ) } . An agr e e ment sup ertr e e of T is a tree X such that X | L ( T ( i ) ) = T ( i ) | L ( X ) for all i ≤ k . The Maxim um Agreemen t S up ertree Prob lem (MASP) is to ﬁnd an agreemen t sup ertree with as many leav es as p ossible. Figure 2 sho ws an example of an FIXED P ARAME TER POL YNOMIAL TIME ALGORITHMS F OR MASP AND MCSP 363 a a b c d c d T T’ T’’ b c d a Figure 1: Th ree ro oted trees. A tree T , a tree T ′ suc h th at T ′ = T | { a, c, d } , and a tree T ′′ suc h that T ′′ D T . agreemen t sup ertree X of tw o ro oted ph ylogenetic trees T (1) and T (2) . If all inp ut trees ha v e the same leaf sets, MASP is referr ed as Maximum Agreemen t S ubtree Problem (MAST). b c a b c d T (1) T (2) X a e a b d e a d b e c Y Figure 2: An agreemen t sup ertree X and a compatible sup ertree Y of 2 ro oted phylogeneti c trees T (1) and T (2) . In th e follo win g discussion, for the set of phylog enetic trees T = {T (1) , . . . , T ( k ) } , w e denote n = | S i =1 ..k L ( T ( i ) ) | , and D stands for the maximum degree of the trees in T . W e assume that none of the trees in T has an int ern al no de of degree tw o, so that ea ch tr ee con tains at most n − 1 inte rn al no des. (If a tr ee T ( i ) has some inte r n al no des of d egree t wo, w e can replace it b y T ( i ) | L ( T ( i ) ) in linear time.) 3. Algorithm for MCSP of ro oted trees Let T b e a set of k ro oted phyloge n etic tr ees. This section presents a dyn amic program- ming algorithm to compu te the size of a maxim um compatible sup ertree of T in O  2 2 kD n k  time. The maximum compatible sup ertree can b e obtained in the same asymptotic time b ound b y bac ktrac king. 364 HOANG AND SUNG F or ev ery compatible su p ertree Y of T , th er e exists a b inary tree that r eﬁ nes Y . This binary tree is also a compatible sup ertree of T , and is of the same size as Y . Hence in this section, ev ery compatible sup ertree is implicitly assumed to b e binary . Deﬁnition 3.1 (Cut-subtree) . A cut-subtr e e of a tree T is either an empty tree or a tree obtained b y ﬁrst selecting some subtrees attac hed to the same internal no de in T and th en connecting those subtrees by a common r o ot. Deﬁnition 3.2 (Cut-subforest) . Giv en a set of k ro oted (or un ro oted) trees T , a c u t- subfor est of T is a set A = {A (1) , . . . , A ( k ) } , wh ere A ( i ) is a cut-subtr ee of T ( i ) and at least one elemen t of A is not an empt y tr ee. T (1) b c a c b a b c A (1) T (2) d e f b A (2) e d f Figure 3: A cu t-subforest A of T . F or example, in Figure 3, {A (1) , A (2) } is a cu t-su bforest of {T (1) , T (2) } . Let O d en ote the set of all p ossible cut-sub f orests of T . Lemma 3.3. Ther e ar e O  2 k D n k  diﬀer ent cut- subfor ests of T . Pr o of. W e claim that eac h tree T ( i ) con tributes 2 D n or few er cut-subtr ees; therefore there are O  2 k D n k  cut-subforests of T . A t eac h in ternal no d e v of T ( i ) , s in ce the degree of v do es not exceed D , we h a v e at most 2 D w a ys of selecting th e subtrees attac hed to v to form a cut-subtree. Includ ing th e emp t y tr ee, the num b er of cut-subtrees in T ( i ) cannot go b ey ond ( n − 1)2 D + 1 < 2 D n . Figure 4 demonstrates that a compatible sup ertr ee of some cut-subforest A of T may not b e a compatible sup ertree of T . T o circumv ent this irregularit y , w e deﬁne emb e dde d sup ertr e e as f ollo w s . Deﬁnition 3.4 (Emb edded sup ertr ee) . F or any cut-subf orest A of T , a tree Y is called an emb e dde d sup ertr e e of A if Y is a compatible sup ertr ee of A , and L ( Y ) ∩ L ( T ( i ) ) ⊆ L ( A ( i ) ) for all i ≤ k . Note th at a compatible sup er tr ee of T is also an embedd ed su p ertree of T . F or eac h cut-subforest A of T , let mcsp ( A ) denote the maxim um size of embed ded sup ertrees of A . Our aim is to compute mcsp ( T ). Belo w , w e ﬁr st deﬁne the recursiv e equation for comput- ing m csp ( A ) for all cut-su bforests A ∈ O . Then, w e describ e our dynamic programming algorithm. W e partition the cut-sub f orests in O in to t w o classes. A cut-subforest A of T is terminal if eac h elemen t A ( i ) is either an empt y tree or a leaf of T ( i ) ; it is calle d non-terminal , otherwise. FIXED P ARAME TER POL YNOMIAL TIME ALGORITHMS F OR MASP AND MCSP 365 T (1) a b c a c b a b a b c c a b c A (1) Y Z T (2) A (2) Figure 4: Consid er T = {T (1) , T (2) } and its cut-subforest A = {A (1) , A (2) } . Although Z is a compatible s u p ertree of A , it is not a compatible su p ertree of T . The maxim um compatible sup ertree of T is Y that con tains only 2 lea v es. F or eac h terminal cut-subforest A , let Λ( A ) = n l ∈ [ j = 1 ..k L ( A ( j ) ) | l 6∈ L ( T ( i ) ) − L ( A ( i ) ) f or i = 1 , 2 , . . . , k o . (3.1) F or example, with T in Figure 2, if A (1) and A (2) are lea v es lab eled b y a and d resp ectiv ely then Λ( A ) = { d } . In Lemma 3.5, we show that mcsp ( A ) = | Λ( A ) | . Lemma 3.5. If A is a terminal cut-sub f or est then mcsp ( A ) = | Λ( A ) | . Pr o of. Consider an y em b edded sup ertree Y of A . By Deﬁn ition 3.4, ev ery leaf of Y b elongs to Λ( A ). Hence the v alue mcsp ( A ) do es not exceed | Λ( A ) | . It r emains to giv e an example of some em b edded sup ertree of A whose leaf set is Λ( A ). Let C b e a ro oted caterpillar 1 whose leaf set is Λ( A ). Th e d eﬁnition of Λ( A ) implies that L ( C ) ∩ L ( T ( i ) ) ⊆ L  A ( i )  for ev ery i ≤ k . Sin ce eac h A ( i ) has at most one leaf, it is straigh tforwa rd that C is a compatible sup ertree of A . Hence C is the desired example. Deﬁnition 3.6 (Bipartite) . Let A b e a cut-su bforest of T . W e sa y that the cut-subforests A L and A R bip artition A if for ev ery i ≤ k , th e trees A ( i ) L and A ( i ) R can b e obtained by (1) partitioning the su btrees attac hed to the ro ot of A ( i ) in to tw o sets S ( i ) L and S ( i ) R ; and (2) connecting the subtr ees in S ( i ) L (resp. S ( i ) R ) by a common ro ot to form A ( i ) L (resp. A ( i ) R ). Figure 5 sho ws an example of the pr eceding deﬁn ition. F or eac h n on-terminal cu t- subforest A , we compute mc sp ( A ) based on the m csp v alues of A L and A R for eac h bipartite ( A L , A R ) of A . More precisely , w e p ro v e th at mcsp ( A ) = max { mcsp ( A L ) + mcsp ( A R ) | A L and A R bipartition A} . (3.2) The iden tity (3.2) is th en established by Lemmas 3.8 and 3.10. Lemma 3.7. Consider a bip artite ( A L , A R ) of some cut-subfor est A of T . If Y L and Y R ar e emb e dde d sup ertr e e s of A L and A R r esp e ctively then Y is an emb e dde d sup ertr e e of A , wher e Y is forme d by c onne c ting Y L and Y R to a c ommon r o ot. 1 A ro oted caterpillar is a ro oted, unordered, and distinctly leaf-lab eled binary tree where every internal nod e has at least on e child that is a leaf. 366 HOANG AND SUNG c A (2) d a b b a d b e A L (1) A L (2) A R (1) A R (2) c b c A (1) e a c a Figure 5: A bipartite ( A L , A R ) of a cut-subforest A . Th e emp t y tree is r epresen ted by a white circle. Lemma 3.8. L et A b e a cut-subfor est of T . If ( A L , A R ) is a bip artite of A then mc sp ( A ) ≥ mcsp ( A L ) + mcsp ( A R ) . Pr o of. Consider an em b edded sup ertree Y L of A L suc h th at | L ( Y L ) | = mcs p ( A L ). Deﬁne Y R for A R similarly . Let Y b e a tree form ed by connecting Y L and Y R with a common r o ot. Note that Y is of size mcsp ( A L ) + mcsp ( A R ). By Lemma 3.7, Y is an em b edd ed sup ertree of A and hence the lemma follo ws. Lemma 3.9. Given a cut-subfor est A of T , let Y b e a bi nary emb e dde d sup ertr e e of A with left subtr e e Y L and right subtr e e Y R . Ther e exists a bip artite ( A L , A R ) of A such that e ither ( i ) Y is an emb e dde d sup ertr e e of A L ; or ( ii ) Y L and Y R ar e emb e dde d sup ertr e es of A L and A R r esp e ctively. Lemma 3.10. F or e ach non-terminal cut-su b for est A of T , ther e exists a bip artite ( A L , A R ) of A such that mcsp ( A ) ≤ mcsp ( A L ) + mcsp ( A R ) . Pr o of. Let Y be a binary em b edded sup ertree of A suc h that | L ( Y ) | = mcsp ( A ). By Lemma 3.9, there exists a bipartite ( A L , A R ) of A such that either (1) Y is an em b edd ed sup ertree of A L ; or (2) Y L and Y R are em b edded sup ertrees of A L and A R resp ectiv ely , where Y L is the left s u btree and Y R is the right subtree of Y . In b oth cases, | L ( Y ) | ≤ mcsp ( A L ) + mcsp ( A R ). Then the lemma follo ws. The ab o ve discu s sion then leads to Theorem 3.11. Theorem 3.11. F or every cut-subfor est A of T , the v alue mcsp ( A ) e qu als to  | Λ( A ) | , if A is terminal, max { mcsp ( A L ) + mcsp ( A R ) | A L and A R bip artition A} , otherwise . W e deﬁn e an ordering of th e cu t-subforests in O as follo ws. F or an y cut-subforests A 1 , A 2 in O , we sa y th at A 1 is smaller than A 2 if A ( i ) 1 is a cut-sub tree of A ( i ) 2 for i = 1 , 2 , . . . , k . O u r algorithm enumerates A ∈ O in top ologica lly increasing order and computes mcsp ( A ) based on Theorem 3.11. Theorem 3.12 states the complexity of our algorithm. Theorem 3.12. A maximum c omp atible sup e rtr e e of k r o ote d phylo genetic tr e es c an b e obtaine d in O  2 2 kD n k  time . FIXED P ARAME TER POL YNOMIAL TIME ALGORITHMS F OR MASP AND MCSP 367 Pr o of. T esting if a cut-subforest is terminal tak es O ( k ) times, and eac h termin al cut- subforest A then requires O ( k 2 ) time for the compu tation of Λ( A ). In view of Lemma 3.3, it suﬃces to show that eac h non-terminal cut-subforest A has O (2 k D ) bipartites. This result follo ws from the fact that for eac h i ≤ k , there are at m ost 2 D w a ys to partition the set of the subtr ees attac h ed to the ro ot of A ( i ) . In the sp ecial case wh ere eve r y tree T ( i ) is b inary , Theorem 3.13 sh o ws th at our algo- rithm actually has a b etter time complexit y . Note that th e concepts of agreemen t sup ertree and compatible sup ertree will coincide for binary tr ees. Hence, our algorithm improv es the O  8 k n k  -time alg orithm in [6] for computing maxim um agreemen t sup ertr ee of k ro oted binary trees. Theorem 3.13. If e v ery tr e e in T is binary, a maximum c omp atible sup ertr e e (or a maxi- mum agr e ement sup ertr e e) c an b e c ompute d in O  6 k n k  time. Pr o of. W e claim that th e p r o cessing of non-terminal cut-subf orests of T requires O  6 k n k  time. Th e argumen t in the pro of of Th eorem 3.1 2 tells that the remaining computation runs w ithin the same asymptotic time b oun d. Consid er an in teger r ∈ { 0 , 1 , . . . , k } . W e shall b e d ealing with a cut-subforest A su c h that there are exactly r cut-subtrees A ( i ) whose ro ots are inte r nal no d es of T ( i ) . Th e key of this p ro of is to show th at the num b er of those cut-subforests do es n ot exceed  k r  ( n − 1) r ( n + 1) k − r , and the runn ing time for eac h cut- subforest is O  4 r 2 k − r  . Hence, the total run n ing time f or all non-termin al cut-subforests is k X r =0  k r  ( n − 1) r ( n + 1) k − r O  4 r 2 k − r  = O  6 k n k  . W e can count the num b er of the sp eciﬁed cut-subforests A as follo ws. Firs t th ere are  k r  options for r indices i suc h that th e ro ots of cut-sub trees A ( i ) are in ternal no des of T ( i ) . F or those cut-subtrees, w e then app oint one of the ( n − 1) or few er in tern al no des of T ( i ) to b e the ro ot n o de of A ( i ) . Eve r y other cut-subtree of A is a leaf or the empt y tr ee, and then can b e determined fr om at most n + 1 alternativ es. Multiplying those p ossibilities giv es u s the b ound stipulated in the pr eceding p aragraph . It remains to estimate the runn in g time for eac h sp eciﬁed cut-subforest A . This task requires us to b ound the num b er of bipartites of eac h cut-sub forest. If the ro ot v of A ( i ) is an internal n o de of T ( i ) then A ( i ) con tributes 4 or fewer wa ys of partitioning the set of th e subtrees attac hed to v . Otherwise, we hav e at most 2 w a ys of partitioning this set. Hence A o wns at most 4 r 2 k − r bipartites, and this completes the p ro of. 4. Algorithm for MCSP of unro oted trees Let T b e a set of k unro oted phylogenet ic trees. This section extends the alg orithm in Section 3 to ﬁn d the size of a maximum compatible sup ertree of T . The maximum compatible sup ertree can b e obtained by bac ktrac king. Surp risingly , th e extended algorithm for u n ro oted trees runs w ithin the same asymptotic time b ound as the original alg orithm for ro oted trees. 368 HOANG AND SUNG W e w ill follo w the same app r oac h as Section 3, i.e., for eac h cut-sub forest A of T , w e ﬁnd an em b edded sup ertree of A of maxim um size. Deﬁnitions 3.1, 3.2, and 3.4 f or cut- subforest and emb edded s up ertree in the p revious section are still v alid for unro oted tr ees. Notice that although T is the set of unro oted trees, eac h cut-subforest A of T consists of ro oted trees. (See Figure 6 for an example of cut-subforest for u nro oted trees.) Hence we can use the algorithm in Section 3 to ﬁnd the m aximum em b edded sup ertree of A . W e then select the biggest tree T among th ose maxim um em b edd ed sup ertr ees for all cut-subforests of T , and unr o ot T to obtain the m axim um compatible s u p ertree of T . a b c d e f a b c e d f e a d A (1) d b a A (2) T (1) T (2) Figure 6: The set of ro oted trees A = {A (1) , A (2) } is a cut-subforest of T = {T (1) , T (2) } . Theorem 4.1 sho ws that the extend ed algorithm has the same asymptotic time b ound as the algorithm in Section 3. Theorem 4.1. We c an ﬁnd a maximum c omp atible sup ertr e e of k unr o ote d phylo genetic tr e es in O  2 2 kD n k  time. Pr o of. Using a similar pro of as Lemma 3.3, we can p ro ve th at th er e are O  2 k D n k  cut- subforests of T . As give n in the pro of of Theorem 3.12, ﬁ nding the maxim um em b edded sup ertrees of eac h cut-subf orest tak es O (2 k D ) time. Hence the extended algorithm r uns within the sp eciﬁed time b ound. 5. Algorithm for MASP of ro oted trees Let T b e a set of k ro oted ph ylogenetic trees. This sect ion presents a dynamic p ro- gramming alg orithm to compute the size of a maxim u m agreemen t su p ertree of T in O  ( k D ) k D +3 (2 n ) k  time. T he maxim um agreemen t sup ertree can b e obtained in the same asymptotic time b ound by backtrac king. The id ea here is similar to that of Section 3. Ho w eve r , while w e can assume that compatible sup er tr ees are bin ary , the maximum degree of agreement sup ertr ees can gro w up to k D . It is the r eason why we ha v e the factor O (( k D ) k D +3 ) in the complexit y . Deﬁnition 5.1 (Sub-forest) . Giv en a set of k ro oted trees T , a sub- for est of T is a set A = {A (1) , . . . , A ( k ) } , where eac h A ( i ) is either an empt y tree or a complete subtree ro oted at some no de of T ( i ) , and at least one elemen t of A is not an empty tree. Notice that the deﬁn ition of sub -forest d o es not coincide with the concept of cut- subforest in Deﬁnition 3.2 of Section 3. F or example, the cu t-su bforest A in Figure 3 is not a s u b-forest of T , b ecause A (2) is not a complete subtree ro oted at some no d e of T (2) . Let O denote the set of all p ossible sub-forests of T . Th en |O | = O  (2 n ) k  . FIXED P ARAME TER POL YNOMIAL TIME ALGORITHMS F OR MASP AND MCSP 369 Deﬁnition 5.2 (En closed su p ertree) . F or any su b-forest A of T , a tree X is called an enclose d sup ertr e e of A if X is an agreemen t su p ertree of A , and L ( X ) ∩ L ( T ( i ) ) ⊆ L ( A ( i ) ) for all i ≤ k . F or eac h sub-forest A of T , let masp ( A ) denote the maximum s ize of enclosed sup ertrees of A . W e use a similar approac h as S ection 3, i.e., w e compute masp ( A ) for all A ∈ O , and masp ( T ) is the size of a maximum agreemen t sup ertree of T . W e partition the sub-forests in O to tw o classes. A sub-forest A is terminal if eac h A ( i ) is either an empty tree or a leaf. Otherwise, A is called non-terminal . Notice th at for terminal sub-forest, the deﬁnition of enclosed sup ertree coincides with the concept of em b edded sup ertree in Deﬁnition 3.4 of S ection 3. Th en b y Lemm a 3.5, w e ha ve masp ( A ) = | Λ( A ) | . (Please refer to the formula (3.1) in the paragraph preceding Lemma 3.5 for the deﬁn ition of fu nction Λ.) Deﬁnition 5.3 (Deco mp osition) . Let A b e a sub-forest of T . W e sa y that su b-forests B 1 , . . . , B d (with d ≥ 2) de c omp ose A if for all i ≤ k , either ( i ) E xactly one of B ( i ) 1 , . . . , B ( i ) d is isomorphic to A ( i ) while the others are empt y trees; or ( ii ) There are at least 2 nonempty trees in B ( i ) 1 , . . . , B ( i ) d , and all those nonempty trees are isomorph ic to pairwise distin ct subtrees attac hed to the ro ot of A ( i ) . τ 2 τ 1 τ 3 τ 4 τ 1 τ 2 τ 4 A (1) B 1 (1) B 2 (1) B 3 (1) A (2) B 1 (2) B 2 (2) B 3 (2) Figure 7: A d ecomp osition ( B 1 , B 2 , B 3 ) of a su b-forest A . Th e emp t y trees are represen ted b y wh ite circles. Figure 7 illustrates the concept of decomp osition. F or eac h sub -forest A of T , w e will pro ve that masp ( A ) = max { masp ( B 1 ) + . . . + mas p ( B d ) | B 1 , . . . , B d decomp ose A} . (5.1) The iden tity (5.1) is th en established by Lemmas 5.5 and 5.7. Lemma 5.4. Supp ose ( B 1 , . . . , B d ) is a de c omp osition of some sub-for est A of T . L et τ 1 , . . . , τ d b e some enclose d sup ertr e es of B 1 , . . . , B d r esp e ctively, and let X b e the tr e e ob- taine d by c onne cting τ 1 , . . . , τ d to a c ommon r o ot. Then, X is an enclose d sup ertr e e of A . Lemma 5.5. If ( B 1 , . . . , B d ) is a de c omp osition of a sub - for est A of T then ma sp ( A ) ≥ masp ( B 1 ) + . . . + masp ( B d ) . Pr o of. F or eac h B j , let τ j b e an enclosed sup er tr ee of B j suc h that | L ( τ j ) | = masp ( B j ). Let X b e the tree obtained by connecting τ 1 , . . . , τ d to a common ro ot. By Lemm a 5.4, X is an enclosed sup ertree of A . Hence | L ( τ 1 ) | + . . . + | L ( τ d ) | = | L ( X ) | ≤ masp ( A ). 370 HOANG AND SUNG Lemma 5.6. L et X b e an enclose d sup ertr e e of some sub-for est A of T , and let τ 1 , . . . , τ d b e al l su b tr e e s attache d to the r o ot of X . Then either ( i ) Ther e is a de c omp osition ( B 1 , B 2 ) of A such that X is an enclose d sup ertr e e of B 1 ; or ( ii ) Ther e is a de c omp osition ( B 1 , . . . , B d ) of A such that e ach τ j is an enclose d sup ertr e e of B j . Lemma 5.7. F or e ach non-terminal sub-for est A of T , ther e is a de c omp osition ( B 1 , . . . , B d ) of A such that masp ( A ) ≤ masp ( B 1 ) + . . . + masp ( B d ) Pr o of. Let X b e an enclosed su p ertree of A such that | L ( X ) | = m asp ( A ) and let τ 1 , . . . , τ d b e all sub trees attac hed to the ro ot of X . By Lemma 5.6, either (i) There exists a d ecomp osition ( B 1 , B 2 ) of A such that X is an enclosed s up ertree of B 1 ; or (ii) There is a decomp osition ( B 1 , . . . , B d ) of A su c h that eac h τ j is an enclosed sup ertr ee of B j . In case (i), we ha v e | L ( X ) | ≤ masp ( B 1 ) ≤ mas p ( B 1 ) + masp ( B 2 ). On the other hand, in case (ii), we ha v e | L ( X ) | = | L ( τ 1 ) | + . . . + | L ( τ d ) | ≤ masp ( B 1 ) + . . . + masp ( B d ) . The ab o ve discu s sion then leads to Theorem 5.8. Theorem 5.8. F or every sub-for est A of T , the v alue masp ( A ) e qu als to  | Λ( A ) | , if A is terminal, max { masp ( B 1 ) + . . . + masp ( B d ) | B 1 , . . . , B d de c omp ose A} , otherwise . W e deﬁne an ordering of th e su b-forests in O as follo w s. F or an y sub-forests A 1 , A 2 in O , we sa y A 1 is smaller than A 2 if A ( i ) 1 is either an empty tree or a subtree of A ( i ) 2 for i = 1 , 2 , . . . , k . Our algorithm enumerates A ∈ O in top ologically increasing ord er and computes masp ( A ) based on Theorem 5.8. In Lemma 5.9, we b ound the num b er of d ecomp ositions of eac h sub-forest of T . Th eo- rem 5.10 states the complexit y of the algorithm. Lemma 5.9. Each sub-for est of T has O  ( k D ) k D +1  de c omp ositions, and gener ating those de c omp ositions takes O  k 2 D 2  time p er de c omp osition. Pr o of. Let A b e a sub-forest of T . Sin ce the m axim um degree of an y agreemen t sup ertree of A is b oun ded by k D , we consider only decomp ositions that consist of at most k D elements. W e claim that f or eac h d ∈ { 2 , . . . , k D } , the sub-forest A owns O  ( d + 2) k D  decomp ositions ( B 1 , . . . , B d ). Summing up those asymptotic terms give s us the sp eciﬁed b ound . The k ey of this pro of is to pr o v e th at f or eac h s ∈ { 1 , . . . , k } , the tr ee A ( s ) con tributes at most ( d + 1) D + d < ( d + 2) D sequences B ( s ) 1 , . . . , B ( s ) d , and generating those sequences requires O ( d ) time p er sequence. W e h a v e t wo cases, eac h corresp ond s to a type of the ab o ve s equence. Case 1: One term in th e sequence is A ( s ) ; therefore the other terms are emp t y trees. Then, we can generate this sequence by assigning A ( s ) to exactly one term and setting the rest to b e empt y trees. This case pro vides exactly d sequences and enumerate s them in O ( d ) time p er sequence. Case 2: No term in the ab o ve sequen ce is A ( s ) . C onsider an in teger r ∈ { 0 , 1 , . . . , d } and assume th at the sequence consists of exactly r terms that are nonempty no d es. Then those r nonempty trees are isomorph ic to pairwise distinct su btrees attac h ed to the ro ot of A ( s ) . Let δ b e the degree of the ro ot of A ( s ) . W e generate the sequen ce as follo ws. Fi r st w e dr aw r pairwise distinct sub tr ees attac hed to the ro ot of A ( s ) . Next, w e select r terms FIXED P ARAME TER POL YNOMIAL TIME ALGORITHMS F OR MASP AND MCSP 371 in the sequen ce and distribu te the ab o v e subtrees to them. Fin ally w e set the remaining terms to b e empt y trees. Hence this case giv es at most X r ≤ min { δ ,d }  δ r  d ! ( d − r )! < D X r =0  D r  d r = ( d + 1) D sequences, and generates them in O ( d ) time p er sequence. Theorem 5.10. A maximum agr e ement sup ertr e e of k r o ote d phylo gene tic tr e es c an b e obtaine d in O  ( k D ) k D +3 (2 n ) k  time. Pr o of. T esting if a sub -forest is terminal tak es O ( k ) times, and eac h terminal sub-forest A then requires O ( k 2 ) time for compu ting Λ( A ). By Lemma 5.9, eac h non-terminal su b-forest requires O  ( k D ) k D +3  runn in g time. Summing up th ose asymptotic terms for O  (2 n ) k  sub-forests of T gives us the sp eciﬁed time b ound. 6. Algorithm for MASP of unro oted trees Let T be a set of k un ro oted phylo genetic trees. Th is section extends the algorithm in Section 5 to ﬁ nd the size of a maximum agreemen t sup ertree of T in O  ( k D ) k D +3 (4 n ) k  time. T he maxim um agreemen t sup ertree can b e obtained b y bac ktrac king. W e sa y that a set of k ro oted trees F = {F (1) , . . . , F ( k ) } is a r o ote d varian t of T if w e can obtain eac h F ( i ) b y r o oting T ( i ) at some internal no de. One naiv e app roac h is to u se the algorithm in the previous s ection to solv e MASP f or eac h ro oted v arian t of T . Eac h ro oted v ariant th en giv es u s a solution, and the maxim um of those solutions is the size of a m aximum agreemen t su p ertree of T . Because there are O  n k  ro oted v arian ts of T , this approac h ad d s an O  n k  factor to the complexit y of the algorithm for ro oted trees. W e no w show h o w to impro ve the ab ov e naive algorithm. As mentio n ed in the pr evious section, the computation of eac h ro oted v arian t of T consists of O  (2 n ) k  sub-prob lems whic h corr esp ond to its sub -forests. (Please r efer to Deﬁnition 5.1 f or the concept of su b- forest.) Since diﬀerent ro oted v ariant s may ha ve some common sub -forests, the total num b er of sub -problems w e ha ve to run is muc h smaller than O (2 k n 2 k ). More precisely , w e w ill show that the total num b er of sub-prob lems is only O  (4 n ) k  . A (ro oted or unro oted) tree is trivial if it is a leaf or an empty tree. A maximal subtr e e of an unr o oted tree T is a r o oted tree obtained b y ﬁrs t ro oting T at some in ternal no de v and then remo ving at most one n on trivial su b tree attac hed to v . Let O denote the s et of sub-forests of all ro oted v arian ts of T . Lemma 6.1. L et A = {A (1) , . . . , A ( k ) } b e a set of r o ote d tr e es. Then A ∈ O if and only i f e ach A ( i ) is ei ther a trivial subtr e e or a maximal subtr e e of T ( i ) . Pr o of. Let F b e a ro oted v arian t of T suc h that A is a su b-forest of F . Fix an ind ex s ∈ { 1 , . . . , k } and let v b e th e ro ot no de of A ( s ) . Our claim is straight forward if either A ( s ) is trivial or v is the r o ot n o de of F ( s ) . O therwise, let u b e the parent of v in F ( s ) . Hence A ( s ) is the m aximal su btree of T ( s ) obtained b y ﬁrs t ro oting T ( s ) at v and th en r emo ving the complete subtree ro oted at u . Con versely , we construct a ro oted v ariant F of T s u c h that A is a sub-forest of F as follo ws. F or eac h i ≤ k , if A ( i ) is trivial or A ( i ) is a tree obtained by r o oting T ( i ) at some in ternal no de then constructing F ( i ) is straigh tforward. Otherwise A ( i ) is a maximal su b tree 372 HOANG AND SUNG of T ( i ) obtained by ﬁrst r o oting T ( i ) at some in ternal no de v and then remo ving exactly one nontrivial su btree τ attac hed to v . Hence F ( i ) is the tree obtained by r o oting T ( i ) at u , where u is the ro ot of τ . Theorem 6.2. We c an ﬁnd a maximum agr e ement sup ertr e e of k unr o ote d phylo gene tic tr e es in O  ( k D ) k D +3 (4 n ) k  time. Pr o of. The key of this pro of is to sho w that eac h tree T ( i ) con tributes at most (3 n − 1) maximal subtrees. I t follo ws that |O | ≤ (4 n ) k . Th e sp eciﬁed r unning time of our algorithm is then straigh tforw ard b ecause eac h sub problem requires O  ( k D ) k D +3  time as giv en in the pro of of Theorem 5.10. Assu me that the tree T ( i ) has exactly L leav es, with L ≤ n . W e no w count the n umb er of maximal sub trees T of T ( i ) in t wo cases. Case 1: T is obtained by r o oting T ( i ) at some internal no de. Hence this case p ro vides at most L − 1 < n maximal subtrees. Case 2: T is obtained b y ﬁrst ro oting T ( i ) at some in ternal no de v and then remo vin g a non trivial subtree τ attac hed to v . Notice that there is a one-to-one corresp ondence b et ween the tree T and the directed edge ( v, u ) of T ( i ) , w here u is the root nod e of τ . There are 2 L − 2 or few er u ndirected edges in T ( i ) but exactly L of them are adjacen t to the lea ves. Hence this case giv es us at most 2(2 L − 2 − L ) < 2 n − 1 maximal subtrees. References [1] A. A m ir and D. Keselman. Maximum Agreement Subtree in a set of Evolutionary Trees: Metrics and Eﬃcien t Algorithms. SIAM Journal on Computing , 26(6):1656–166 9, 1997. [2] V. Berry and F. Nicolas. Maximum Agreement and Compatible Sup ertrees. In Pr o c. 15 th Symp osium on Combinatorial Pattern Matching (CPM 2004), L e ct. Notes in Comp. Scienc e 3109 , pp. 205–219. Springer, 2004. [3] M. F arac h, T. Przytyck a, and M. Thorup. On the agreemen t of many trees. Information Pr o c essing L etters , 55:297–301, 1995. [4] G. Ganapathysara v anabav an and T. W arnow. Finding a maxim um compatible tree for a b oun ded num b er of trees with b ounded degree is solv able in p olynomial time. In Pr o c. 1 st Workshop on Algorithms in Bioinformatics (W ABI 2001), L e ct. Notes in Comp. Scienc e 2149 , p p. 156–16 3. Springer, 2001. [5] A. G. Gordon. Consensus sup ertrees: the sy nthesis of ro oted trees containing ove rlapping sets of lab elled lea ves. Journal of Classiﬁc ation , 3:335–348, 1986. [6] Sylv ain Guillemot and Vincent Berry . Fixed-Parameter T ractability of the Maximum Agreemen t S u- p ertree Problem. I n Pr o c. 18 th Symp osium on Combinatorial Pattern Matching (CPM 2007), L e ct. Notes in Comp. Scienc e 4580 , pp . 274–285. Springer, 2007. [7] J. Hein, T. Jiang, L. W ang, and K . Zhang. O n the complexity of comparing evol u tionary trees. Di scr ete Applie d M athematics , 71:153–169 , 1996. [8] Jesp er Jansson, Joseph H.-K . Ng, K unihiko Sadak ane, and Wing-King Sung. Rooted Maximum Agree- ment Sup ertrees. Algorithmic a , 43:293–3 07, 2005. [9] M.-Y. K ao, T.-W. Lam, W.-K. Sung, and H.-F. Ting. An Even Faster and More Un ifying A lgorithm for Comparing Trees v ia Unbalanced Bipartite Matchings. Journal of Algor i thms , 40(2):212–233, 2001. [10] Maddison, D.R ., and K.-S. S c hulz (eds.). The Tree of Life Web Pro ject. http://tolwe b.org , 1996-2006. This work is lice nsed u nder the Cr eative Co mmons Attr ibution-NoDer ivs License. T o view a copy of this license, visit htt p://crea tivecommons.org/licenses/by- nd/3.0/ .

Fixed Parameter Polynomial Time Algorithms for Maximum Agreement and Compatible Supertrees

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment