The zero exemplar distance problem

The zero exemplar d istance problem ∗ Mingh ui Jiang Departmen t of C omputer Science, Utah State U niv ersit y , Logan, UT 84322, US A mjiang@cc.u su.edu No v em b er 5, 2010 Abstract Given tw o genomes with duplicate g enes, Zero Ex emplar Dist ance is the problem of deciding whether the t wo genomes can be reduced to the same geno me without duplicate g enes by deleting all but one copy of each gene in each geno me. Blin, F ertin, Sikora, and Viale tte recently pr ov ed that Zero Exemplar Dist an ce for mono chromosomal g enomes is NP- ha rd even if each gene app ea rs at most tw o times in each genome, thereby settling an imp ortant op en question on g enome rearrang ement in the exemplar mo del. In this pap er, we give a very simple alternative pr o of o f this result. W e a ls o study the problem Zer o Exemplar Dist ance for m ultichromosomal genomes w itho ut gene or der, and prove the analogo us result that it is a lso NP-hard even if each gene app ear s a t most tw o times in each genome. F or the p o sitive direction, we sho w that b oth v ariants of Zero Exemplar D ist ance admit p olyno mial-time alg orithms if each gene a pp e ars exactly once in o ne g enome and at least once in the other genome. In addition, we present a polyno mial-time algor ithm for the related problem Exemplar L ongest Common Subsequence in the sp ecial cas e that each mandatory symbol app ear s exactly once in one input sequence and at least once in the other input s equence. This ans wers a n op en question of Bonizzoni et al. W e also show that Zero Exemplar Dist ance for multic hromosomal geno mes without g ene o rder is ﬁxed-parameter tractable if the parameter is the maximum num b er of chromosomes in each genome. 1 In tro duction Giv en t wo genomes with d uplicate genes, Genome Rearr angement with Gene F amilies [12] is the pr ob lem of deleting all but one cop y of eac h gene in eac h genome, so as to minimize some rearrangemen t distance b et we en the t w o reduced genomes. The min imum rearrangement distance th us attained is called the exe mplar distanc e b et we en the tw o genomes. F or example, eac h of the follo wing t w o mono chromosomal genomes G 1 : − 4 +1 +2 +3 − 5 +1 +2 +3 − 6 G 2 : − 1 − 4 +1 +2 − 5 +3 − 2 − 6 +3 ∗ Supp orted in part by NSF gran t DBI-0743670. A preliminary version o f this pap er ( including Theorem 1, Theorem 2, and a w eaker version of Theorem 4 ) app eared in Proceedin gs of the 8th Ann ual RECOMB S atellite W orkshop on Comparativ e Genomics (RECOMB-CG 2010) [11]. 1 has at most tw o copies of eac h gene, and eac h of the follo wing tw o reduced genomes G ′ 1 : − 4 +1 +2 − 5 + 3 − 6 G ′ 2 : − 4 +1 +2 − 5 + 3 − 6 has exactly one copy of eac h gene. Re call that in the stud y of genome rearrangemen t, a g e ne is usually represente d by a signed inte ger: t he absolute v alue of the inte ger ( the unsigned int eger) denotes the gene family to wh ic h the gene b elongs; the sign of the in teger d enotes the orien tation of the gene in its c hromosome. Then a chr omosome is a sequence of signed in tegers, and a genome is a collect ion of chromosomes. Genome R earrangem ent with Gen e F amilies is not a single problem but a wh ole class of related problems, b ecause the c hoice of rearrangement distance is not unique. This c hoice b ecomes irrelev an t, ho we ve r, wh en w e ask the fun dament al question: Is the distanc e zer o? In the example ab o v e, the t w o redu ced genomes G ′ 1 and G ′ 2 are identica l, th us th e exemplar distance b etw een the t w o original genomes G 1 and G 2 is zero for any reasonable choic e of rearrangemen t distance. In this pap er, we study the most b asic version o f the pr oblem Genome Rearrangement with Gene F amilies : Giv en t wo sequences of signed in tegers, Zero Exemplar Dist ance (for mono c hromosomal genomes) is the problem of d eciding whether the tw o sequences hav e a common subsequence including eac h unsigned intege r exactly once in either p ositive or negativ e form . Due to its generic nature, the problem Zero Exempl ar Dist a nce has b een extensiv ely stud- ied by sev eral g roups of researc hers [5, 4, 2] fo cu s ing on diﬀerent rearrangemen t distances, and, not s u rprisin gly , has acquired seve ral diﬀerent names. Except for trivial distinctions, Zer o Ex- emplar Dist ance is essen tially the same problem as Ze ro Exemp lar Conser ved I nter v al Dist an ce [5], Exempl ar Longest Comm on Sub sequenc e (deciding wh ether a feasible solution exists) [4], and Zero Exemplar Brea kpoint Dist an ce [2]. It is easy to chec k that if only one of the t wo genomes has duplicate genes, then Ze ro Exemplar Dist an ce can b e solv ed in linear time: w e simp ly need to decide whether the genome without duplicates is a subsequence of the genome with duplicates. In sharp cont rast, if b oth genomes con tain d uplicate genes, then even if eac h gene app ears at most three times in eac h genome, the problem Zero Exemplar Dist ance is already NP-hard , as sho wn ind ep end ently in thr ee pap ers [5, 4 , 2]. The quest for the exact b ound ary b et ween p olynomial solv abilit y and NP-hardn ess led to the follo wing op en question ﬁ rst r aised b y Chen et al. in 2006: Question 1 (Chen, F o wler, F u, and Zhu, 2006 [5 ]) . Is the pr oblem Ze ro Exempl ar Dist an ce f or mono chr omosomal genomes stil l N P-har d if e ach gene app e ars at most two times in e ach genome? This qu estion wa s ﬁnally settled in the aﬃrmative by Blin et al. in 2009: Theorem 1 (Blin, F ertin, Sikora, and Vial ette, 2009 [3]) . Zero Exemplar Dist a nce for mono chr o- mosomal genomes is NP -har d even i f e ach gene app e ars at most two times in e ach genome. In S ection 2, w e give a very simple alternativ e p ro of of this theorem. Both the previous p ro of of Theorem 1 [3] and our alternativ e p ro of d ep end cru cially on the order of the genes in the c hromosomes. One ma y n aturally w onder w hether the complexit y of Z ero Exemplar Dist anc e would c hange if gene order is not kn own. Note that genome rearrangement distances su c h as the syntenic distance [8] can b e deﬁned in the abs ence of gene order. No w mo del eac h chromosome as a set of unsigned intege rs instead of a sequence of signed in tegers. Then Zero Exemplar Di st ance f or multic hromosomal genomes without gene order is the follo win g pr oblem: Giv en t w o collections G 1 and G 2 of subsets of the same groun d set S 2 of uns igned in tegers, decide whether b oth G 1 and G 2 can b e reduced, b y d eleting elemen ts from subsets and deleting sub sets from collec tions, to th e same coll ection G ′ of sub sets of S s u c h that eac h un signed in teger in S is con tained in exactly one subset in G ′ , i.e., G ′ is a partition of S . F or example, S : { 1 , 2 , 3 , 4 , 5 } G 1 : { 1 , 2 , 3 } { 2 , 3 , 4 } { 4 , 5 } G 2 : { 1 , 2 } { 2 , 3 , 4 } { 3 , 4 , 5 } { 1 , 5 } G ′ : { 1 , 2 } { 3 } { 4 , 5 } In S ection 3, w e p ro v e the f ollo wing theorem analogous to Theorem 1: Theorem 2. Zero Exempla r Dist ance for multichr omosomal genomes without gene or der is NP-har d eve n if e ach gene app e ars at most two times in e ach genome. As d ecision p roblems, b oth v ariants of Zero Exemplar Dist ance , for mono chromosomal genomes and for m ultic hromosomal genomes without gene order, are in NP. Thus, follo wing the NP-hardness r esults in Theorem 1 and Theorem 2, these t wo decision problems are b oth NP- complete. Moreo ver, the NP-h ardness results in Theorem 1 and Theorem 2 imply th at unless NP = P , the corresp ond in g minim ization pr ob lems of compu ting the exemplar distance b etw een t w o genomes d o not admit any ap p ro ximation. W e refer to [5, 6, 4, 2, 1] for related results. The problem Zero Exempla r Dist ance for mono c h romosomal genomes, as ment ioned earlier, has b een studied u nder several diﬀeren t names. Giv en t wo sequences A and B o v er an alph ab et Σ = Σ 1 ∪ Σ 2 , where Σ 1 is a set of mandatory sym b ols and Σ 2 is a set of optional symbols, Exemplar Longest Common Subseq uence [4] is the p r oblem of ﬁnd ing a longest common subsequence of A and B that con tains all mandatory symb ols in Σ 1 . F or example, if Σ 1 = { 1 , 2 , 3 } and Σ 2 = { 4 , 5 } , then C = 124355 is an exemplar longest common su bsequence of the tw o sequences A = 12423545 and B = 1142443 555. Due to the s trict requirement on mandatory symb ols, Exemplar Longest C ommon Sub- sequenc e do es not alw a ys ha ve a feasible solution. It is not d iﬃcult to see that simply deciding whether a feasible solution to Exem plar Longest Common Subsequ ence exists for t w o se- quences A and B is the same as the pr oblem Zero Exemplar Dist a nce for t w o mono c h romo- somal genomes A ′ and B ′ obtained from A and B by deleting all optional sym b ols. Recall that the problem Ze ro Exempl ar Di st ance for mono c hr omosomal genomes b ecomes trivial wh en only one of the tw o genomes has dup licate genes. F or the equiv alen t pr oblem of deciding w h ether a feasible solution to Exemp lar Longes t Common Subsequen ce exists, Bonizzoni et al. [4 ] sho w ed another tractable sp ecial case: If eac h man d atory symb ol app ears a total of at most three times in A an d B , then there is a p olynomial-time algorithm, b ased on 2SA T, that decides whether A and B ha ve a common subsequence con taining all mandatory symb ols. This algorithm do es n ot solv e the maximization problem, ho w ev er, and the follo wing question w as left op en : Question 2 (Bonizzoni et al. [4 ]) . Is ther e a p olynomial-time algorithm for Exem plar Lon gest Common Subseq uence in the sp e cial c ase that e ach mandat ory symb ol app e ars a total of at most thr e e times in the two input se qu enc es? Without loss of generalit y , we assume that eac h input sequence conta ins eac h sym b ol in the alphab et at least once. If eac h mandatory sym b ol app ears a tota l of at most thr ee times in the t w o input sequences, then it must app ear exact ly once in one sequ ence, and at least once in the 3 other sequence, as in the example shown earlier. In Section 4, w e pro v e the f ollo wing theorem that complemen ts T heorem 1 and ans wers the op en question of Bonizzoni et al. in the aﬃrmative : Theorem 3. Zero Exemplar Dist ance for mono chr omosomal genomes admits a p olynomial- time algorithm in the sp e cial c ase tha t e ach gene app e ars exactly onc e in one genome and at le ast onc e in the other genome. Exemplar Lo ngest Commo n S ubsequ ence adm its a p olynomial- time algorithm in the sp e cial c ase that e ach mandatory symb ol app e ars exactly onc e i n one input se quenc e and at le ast onc e in the other input se quenc e. Finally , in Section 5 , we pro ve the follo wing theorem that complemen ts Th eorem 2: Theorem 4. Zero Exempl ar Dist ance for multichr omosomal genomes without gene or der ad- mits a p olynomial-time algorithm in the sp e ci al c ase that e ach gene app e ars exactly onc e in one genome and at le ast onc e in the other genome, and is ﬁxe d-p ar ameter tr actable if the p ar ameter is the maximum numb er of chr omosomes in e ach genome. 2 Alternativ e Pro of of Theorem 1 W e pro v e that Ze ro Exempl ar Dist anc e for mono c hromosomal genomes is NP-hard by a redu c- tion from the w ell-kno wn NP-complete p roblem 3SA T [9]. Let ( V , E ) b e a 3SA T instance, wh ere V = { v 1 , . . . , v n } is a set of n b o olean v ariables, E = { e 1 , . . . , e m } is a conjunctiv e b o olean form ula of m clauses, and eac h clause in E is a disju n ction of exac tly three literals of the v ariables in V . W e will construct t wo sequen ces (genomes) G 1 and G 2 o ver 2 n + 6 m + 1 distinct un signed intege rs (genes): • Two variable genes x i , y i for eac h v ariable v i , 1 ≤ i ≤ n ; • Thr ee clause gene s a j , b j , c j for eac h clause e j , 1 ≤ j ≤ m ; • Thr ee liter al genes r j , s j , t j for the three literals of eac h clause e j , 1 ≤ j ≤ m ; • One sep ar ator gene z . In our construction, all genes app ear in the p ositiv e orientat ion in the t wo genomes, s o we w ill omit the signs in our description. The t wo genomes G 1 and G 2 are r epresen ted s chematic ally as follo ws: G 1 : h v 1 i . . . h v n i z h e 1 i . . . h e m i G 2 : h v 1 i . . . h v n i z h e 1 i . . . h e m i F or eac h v ariable v i , the v ariable gadget h v i i consists of one cop y of x i and t w o copies of y i in G 1 , t wo copies of x i and one copy of y i in G 2 , and, for eac h literal of the v ariable in the clauses, one copy of the corresp onding literal gene ( r j , s j , or t j for some clause e j ) in eac h genome. Let p i, 1 , . . . , p i,k i b e the literal genes for the p ositive literals of v i , and let q i, 1 , . . . , q i,l i b e the literal genes for the negativ e literals of v i . The genes x i , y i , p i, 1 , . . . , p i,k i , q i, 1 , . . . , q i,l i in the v ariable gadget h v i i are arran ged in the follo wing pattern in the t w o genomes: G 1 h v i i : y i p i, 1 . . . p i,k i x i q i, 1 . . . q i,l i y i G 2 h v i i : p i, 1 . . . p i,k i x i y i x i q i, 1 . . . q i,l i 4 F or eac h clause e j , the clause gadget h e j i consists of t wo copies of eac h clause gene a j , b j , c j and one copy of eac h literal gene r j , s j , t j . Th ese genes in h e j i are arranged in the follo win g pattern in the tw o genomes: G 1 h e j i : r j a j b j c j s j a j b j c j t j G 2 h e j i : a j r j b j a j s j c j b j t j c j This completes the constru ction. It is easy to c hec k that eac h gene app ears at most t wo times in eac h genome, and that eac h genome includes exactly 3 n + 12 m + 1 genes includin g duplicates. W e give an example: Example 1. F or a 3SA T instanc e of 4 variables and 2 clauses e 1 = { r 1 = v 1 , s 1 = ¬ v 2 , t 1 = ¬ v 3 } and e 2 = { r 2 = ¬ v 1 , s 2 = v 3 , t 2 = v 4 } , the r e duction c onstructs the fol lowing two genomes: G 1 : y 1 r 1 x 1 r 2 y 1 y 2 x 2 s 1 y 2 y 3 s 2 x 3 t 1 y 3 y 4 t 2 x 4 y 4 z r 1 a 1 b 1 c 1 s 1 a 1 b 1 c 1 t 1 r 2 a 2 b 2 c 2 s 2 a 2 b 2 c 2 t 2 G 2 : r 1 x 1 y 1 x 1 r 2 x 2 y 2 x 2 s 1 s 2 x 3 y 3 x 3 t 1 t 2 x 4 y 4 x 4 z a 1 r 1 b 1 a 1 s 1 c 1 b 1 t 1 c 1 a 2 r 2 b 2 a 2 s 2 c 2 b 2 t 2 c 2 The assignment v 1 = tru e , v 2 = f alse , v 3 = f alse , v 4 = tru e satisﬁes the 3SA T instanc e and c orr esp onds to the fol lowing c ommon r e duc e d genome: G ′ : r 1 x 1 y 1 y 2 x 2 s 1 y 3 x 3 t 1 t 2 x 4 y 4 z a 1 b 1 c 1 r 2 a 2 s 2 b 2 c 2 The red u ction clearly runs in p olynomial time. It remains to pr o ve the follo wing lemma: Lemma 1. The 3SA T instanc e ( V , E ) is satisﬁable if and only if the two genomes G 1 and G 2 have a c ommon subse quenc e G ′ including exactly one c opy of e ach gene. W e ﬁrst prov e the direct implication. Supp ose that the 3SA T instance ( V , E ) is satisﬁable. W e will comp ose a common su bsequence G ′ of the tw o genomes G 1 and G 2 from a common s ubsequence of eac h v ariable gadget h v i i , the separator gene z in the midd le, and a common sub sequence of eac h clause gadget h e j i . Consider a truth assignment that satisﬁes the 3SA T instance. F or eac h v ariable v i , tak e the su bsequence p i, 1 . . . p i,k i x i y i if v i is set to true, and tak e the subsequence y i x i q i, 1 . . . q i,l i if v i is set to false. F or eac h clause e j , at least one of its thr ee literals is tr ue; corresp ond in gly , at least on e of the three literal genes r j , s j , t j has b een taken from some v ariable gadget h v i i . No w tak e a sub s equence from the clause gadget h e j i follo wing one of thr ee cases: 1. If r j has b een tak en, then tak e the subsequence a j b j s j c j t j . 2. If s j has b een tak en, then tak e either th e subsequ en ce r j b j a j c j t j or the sub sequence r j a j c j b j t j . 3. If t j has b een tak en, then tak e the subsequen ce r j a j s j b j c j . Here an un derlined literal gene is omitted from the subsequ ence taken from th e clause gadget h e j i if its other copy has already b een tak en f r om some v ariable gadget h v i i . The common subsequen ce G ′ th us comp osed clearly includes exactly one cop y of eac h gene. W e next pro ve the reve rse imp lication. Supp ose that the t w o genomes G 1 and G 2 ha v e a common subsequence G ′ including exactly one cop y of eac h gene. W e will ﬁ nd a satisfying assignmen t for the 3SA T instance ( V , E ) as follo ws. Due to the strategic lo cation of the separator gene z in th e t w o genomes, eac h literal gene m ust app ear in the common su bsequence either b efore z in b oth 5 genomes, in some v ariable gadget h v i i , or after z in b oth genomes, in some clause gadget h e j i . The crucial prop ert y of the clause gadget h e j i is that it cannot ha v e a common su bsequence including exactly one copy of eac h clause gene a j , b j , c j unless at least one of the three literal genes r j , s j , t j is omitted. A literal gene omitted from the common subsequen ce of the clause gadget h e j i has to app ear in the common subsequence of some v ariable gadget h v i i , w here th e tw o v ariable genes x i and y i m ust app ear in the ord er x i y i if the literal is p ositiv e and app ear in the order y i x i if the literal is n egativ e. Now set eac h v ariable v i to true if th e t wo v ariable genes x i and y i app ear in the common subsequence G ′ in the order x i y i , and set it to false otherwise. T h en eac h clause gets at least one true literal. This completes the pro of of Theorem 1. 3 Pro of of Theorem 2 W e prov e that Ze ro Exe mplar Dist ance for multic hromosomal genomes without gene ord er is NP-hard by a reduction again fr om 3SA T. Let ( V , E ) b e a 3S A T in stance, where V = { v 1 , . . . , v n } is a set of n b o olean v ariables, E = { e 1 , . . . , e m } is a conjunctiv e b o olean form u la of m cla uses, and eac h clause in E is a disjun ction of exactly thr ee literals of the v ariables in V . Without loss of generalit y , assu me that n o clause in E contai ns tw o literals of the same v ariable in V . W e w ill construct tw o genomes G 1 and G 2 o ver n + 9 m distinct genes: • One variable gene x i for eac h v ariable v i , 1 ≤ i ≤ n ; • Six clause genes a j , b j , c j , a ′ j , b ′ j , c ′ j for eac h clause e j , 1 ≤ j ≤ m ; • Thr ee liter al genes r j , s j , t j for the three literals of eac h clause e j , 1 ≤ j ≤ m . F or eac h v ariable v i , let p i, 1 , . . . , p i,k i b e the literal genes for the p ositiv e literals of v i , and let q i, 1 , . . . , q i,l i b e the literal genes for th e negativ e literals of v i . G 1 includes one subs et and G 2 includes tw o sub sets of genes including x i : G 1 h v i i : { p i, 1 , . . . , p i,k i , x i , q i, 1 , . . . , q i,l i } G 2 h v i i : { p i, 1 , . . . , p i,k i , x i } { x i , q i, 1 , . . . , q i,l i } F or eac h clause e j , G 1 includes s ix subsets and G 2 includes s ev en subsets of clause/literal genes: G 1 h e j i : { a j , b j } { b j , c j } { c j , a j } { a ′ j , r j } { b ′ j , s j } { c ′ j , t j } G 2 h e j i : { a j , b j , c j } { a j , a ′ j , r j } { b j , b ′ j , s j } { c j , c ′ j , t j } { a ′ j } { b ′ j } { c ′ j } This completes the constru ction. It is easy to c hec k that eac h gene app ears at most t wo times in eac h genome, G 1 includes exactly n + 15 m genes includ ing d uplicates, and G 2 includes exactly 2 n + 18 m genes includin g dup licates. W e giv e an example: Example 2. F or a 3SA T instanc e of 4 variables and 2 clauses e 1 = { r 1 = v 1 , s 1 = ¬ v 2 , t 1 = ¬ v 3 } and e 2 = { r 2 = ¬ v 1 , s 2 = v 3 , t 2 = v 4 } , the r e duction c onstructs the fol lowing two genomes: G 1 : { r 1 , x 1 , r 2 } { x 2 , s 1 } { s 2 , x 3 , t 1 } { t 2 , x 4 } { a 1 , b 1 } { b 1 , c 1 } { c 1 , a 1 } { a ′ 1 , r 1 } { b ′ 1 , s 1 } { c ′ 1 , t 1 } { a 2 , b 2 } { b 2 , c 2 } { c 2 , a 2 } { a ′ 2 , r 2 } { b ′ 2 , s 2 } { c ′ 2 , t 2 } G 2 : { r 1 , x 1 } { x 1 , r 2 } { x 2 } { x 2 , s 1 } { s 2 , x 3 } { x 3 , t 1 } { t 2 , x 4 } { x 4 } { a 1 , b 1 , c 1 } { a 1 , a ′ 1 , r 1 } { b 1 , b ′ 1 , s 1 } { c 1 , c ′ 1 , t 1 } { a ′ 1 } { b ′ 1 } { c ′ 1 } { a 2 , b 2 , c 2 } { a 2 , a ′ 2 , r 2 } { b 2 , b ′ 2 , s 2 } { c 2 , c ′ 2 , t 2 } { a ′ 2 } { b ′ 2 } { c ′ 2 } 6 The assignment v 1 = tru e , v 2 = f alse , v 3 = f alse , v 4 = tru e satisﬁes the 3SA T instanc e and c orr esp onds to the fol lowing c ommon r e duc e d genome: G ′ : { r 1 , x 1 } { x 2 , s 1 } { x 3 , t 1 } { t 2 , x 4 } { a 1 } { b 1 , c 1 } { a ′ 1 } { b ′ 1 } { c ′ 1 } { c 2 } { a 2 , b 2 } { a ′ 2 , r 2 } { b ′ 2 , s 2 } { c ′ 2 } The red u ction clearly runs in p olynomial time. It remains to pr o ve the follo wing lemma: Lemma 2. The 3SA T instanc e ( V , E ) is satisﬁable if and only if the two genomes G 1 and G 2 have a c ommon r e duc e d genome G ′ including exactly one c opy of e ach gene. W e ﬁr st p ro v e the dir ect implication. Sup p ose that the 3SA T ins tance ( V , E ) is satisﬁable. W e will comp ose a common reduced genome G ′ of the tw o genomes G 1 and G 2 as follo ws. Con- sider a tru th assignmen t th at satisﬁes the 3SA T instance. F or eac h v ariable v i , tak e the subs et { p i, 1 , . . . , p i,k i , x i } if v i is s et to true, and take the subset { x i , q i, 1 , . . . , q i,l i } if v i is s et to false. F or eac h clause e j , at least one of its three literals is tr u e; corresp ondingly , at least one of the three literal genes r j , s j , t j has b een tak en from some v ariable gadget h v i i . No w take some subsets of clause/lite ral genes follo wing one of thr ee cases: 1. If r j has b een tak en, then tak e the subsets { a j } , { b j , c j } , { a ′ j } , { b ′ j , s j } , { c ′ j , t j } . 2. If s j has b een tak en, th en tak e the subsets { b j } , { c j , a j } , { a ′ j , r j } , { b ′ j } , { c ′ j , t j } . 3. If t j has b een tak en, then tak e the subsets { c j } , { a j , b j } , { a ′ j , r j } , { b ′ j , s j } , { c ′ j } . Here an u nderlined literal gene is omitted f r om th e su bset tak en from the clause gadget h e j i if its other cop y has already b een tak en from some v ariable gadget h v i i . The reduced genome G ′ th us comp osed clearly includ es exactly one cop y of eac h gene. W e next prov e the r ev erse implication. Supp ose that the tw o genomes G 1 and G 2 ha v e a common red u ced genome G ′ including exactly one copy of eac h gene. W e will ﬁnd a satisfying assignmen t for the 3SA T ins tance ( V , E ) as follo ws. The crucial prop ert y of the clause gadget h e j i is that it cannot h a v e a common reduced genome including exact ly one cop y of eac h clause gene a j , b j , c j , a ′ j , b ′ j , c ′ j unless at least one of the three literal genes r j , s j , t j is omitt ed. A literal gene omitted from the clause gadget h e j i h as to app ear in a subset in G ′ that conta ins some v ariable gene x i . By th e construction of the v ariable gadgets, this su bset cont ains, b esides x i , either literal genes f or p ositiv e literals, or literal genes for negativ e literals. No w set eac h v ariable v i to tru e if the sub set in G ′ that conta ins x i also con tains at least on e literal gene for a p ositiv e literal, and set it to false otherwise. Then eac h clause gets at least one true literal. Th is completes the pro of of T h eorem 2. 4 Pro of of Theorem 3 Let A and B b e t wo sequences of lengths n and m , r esp ectiv ely , ov er an alphab et Σ = Σ 1 ∪ Σ 2 , where Σ 1 is a set of mandatory symb ols and Σ 2 is a set of optional sym b ols. In the sp ecial case that eac h mandatory symbol in Σ 1 app ears exactly once in one sequence and at least once in the other sequence, w e ha v e the ob vious but imp ortan t prop ert y that any c ommon sub se quenc e of the two se q uenc es c an c ontain e ach mandato ry symb ol at most onc e . This prop ert y leads to a 7 v ery simple algorithm that decides w hether a feasible solution to Exempl ar Longest Common Subseq uence exists in this sp ecial case: Algorithm 1. 1. Obtain t wo sequ en ces A ′ and B ′ from A and B b y deleting all optional sym b ols in Σ 2 . 2. Compute a longest common su bsequence C ∗ of A ′ and B ′ . 3. If C ∗ con tains all mand atory symb ols in Σ 1 , return y es. Otherwise, retur n n o. The time complexit y of Algorithm 1 is O ( nm ) b y using a standard dynamic pr ogramming algorithm for longest common su bsequence [10]. T h e correctness of Algorithm 1 is justiﬁed by the follo win g lemma: Lemma 3. A and B have a c ommon subse quenc e c ontaining al l mandatory symb ols in Σ 1 if and only if the longest c ommon subse quenc e C ∗ of A ′ and B ′ c ontains al l mandato ry symb ols in Σ 1 . Pr o of. The r eduction from A and B to A ′ and B ′ preserve s the mandatory symb ols. Thus A and B ha v e a common subsequ ence con taining all mandatory symbols in Σ 1 if and only if A ′ and B ′ ha v e a common sub sequence con taining all mand atory sy mb ols in Σ 1 . It remains to prov e the equ iv alen t claim that A ′ and B ′ ha v e a common subs equence conta ining all mandatory sy mb ols in Σ 1 if and only if C ∗ con tains all mand atory symbols in Σ 1 . The “if ” direction of the claim is trivial b ecause C ∗ is a common sub sequence of A ′ and B ′ . T o pro v e the “only if ” direction, r ecall th at in an y common subsequence of A ′ and B ′ , eac h mandatory sym b ol can app ear at most once. Th us the length of any common subsequence of A ′ and B ′ is at most the size of Σ 1 . Moreo ver, if the length of some common subsequence of A ′ and B ′ is equal to the size of Σ 1 , then this common su bsequence must con tain all mandatory sym b ols in Σ 1 , and vice v ersa. No w su pp ose that A ′ and B ′ ha v e a common subsequence C ′ con taining all mandatory sym b ols in Σ 1 . Then the length of C ′ m ust b e equal to the size of Σ 1 . Since the length of C ∗ is at least the length of C ′ , the length of C ∗ m ust also b e equal to the size of Σ 1 . Then C ∗ m ust con tain all mandatory symbols in Σ 1 to o. T his completes the pro of. Since deciding whether a feasible solution to Exempla r Longest Commo n S ubseque nce exists for tw o sequences A an d B is the s ame as the p roblem Zero Exempl ar Dist ance for t w o mono c hromosomal genomes A ′ and B ′ obtained from A and B by deleting all optional symb ols, w e also hav e an O ( nm ) algorithm for Zero Exe mplar Dist ance for mono c hromosomal genomes in the sp ecial case that eac h gene app ears exactly once in one genome and at least once in the other genome. W e next presen t an algorithm for the maximizati on problem Exempla r Longest Comm on Subseq uence in the sp ecial case that eac h m andatory symbol app ears exactly once in one inp ut sequence and at least once in the other in put sequence: Algorithm 2. 1. Assign eac h mandatory symb ol in Σ 1 a we ight of w = min { n, m } + 1 , and assign eac h optional sym b ol in Σ 2 a weigh t of 1. Compu te a common sub sequence C ∗ of A and B of the m axim um total weig ht. 8 2. If C ∗ con tains all mandatory sym b ols in Σ 1 , return C ∗ . Otherwise, rep ort that no feasible solution exists. If A and B ha v e no common su bsequence con taining all mandatory symb ols in Σ 1 , then clearly the m aximum-w eight common su bsequence C ∗ of A and B cannot cont ain all mandatory sym b ols in Σ 1 , and hence the algorithm correctly rep orts th at no feasible solution exists. Oth erwise, the correctness of Algorithm 2 is justiﬁed by the follo wing lemma: Lemma 4. If A and B have a c ommon subse quenc e c ontaining al l mandatory symb ols in Σ 1 , then the maximum-weight c ommon subse quenc e C ∗ of A and B is a longest c ommon subse quenc e of A and B that c ontains al l mandator y symb ols in Σ 1 . Pr o of. Supp ose that A and B hav e a common subsequ en ce C con taining all mandatory symb ols in Σ 1 . W e ﬁrst sho w that the m axim um-we igh t common subsequence C ∗ of A and B con tains all mandatory sym b ols in Σ 1 . Note that the n umb er of optional sym b ols in C ∗ is at m ost th e length of C ∗ , whic h is at most min { n, m } . Also r ecall th at any common su bsequence of A and B can con tain eac h m an d atory symb ol at most once. If C ∗ do es not con tain all mandatory s y mb ols in Σ 1 , then b y our c hoice of w = min { n, m } + 1, the total weigh t of C ∗ w ould b e at most ( | Σ 1 | − 1) · w + min { n, m } · 1 < ( | Σ 1 | − 1) · w + w · 1 = | Σ 1 | · w . On the other hand , sin ce C con tains all m andatory symbols in Σ 1 , the weigh t of C is at least | Σ 1 | · w . This con tradicts the assumption that C ∗ is a maximum-w eigh t common su bsequence of A and B . No w , since C ∗ con tains all mandatory sym b ols and can conta in eac h m andatory symb ol at most once, C ∗ m ust con tain eac h mand atory symb ol exactly once. Then, to ha ve the maximum total w eigh t, C ∗ m ust b e a longest common su bsequence of A and B that con tains all man d atory symb ols in Σ 1 . Again, the o v erall time complexit y of Algorithm 2 is clearly O ( nm ). This completes the pro of of T h eorem 3. 5 Pro of of Theorem 4 W e pr esen t tw o algorithms for Zero Exempl ar Dist an ce for multic hr omosomal genomes w ithout gene ord er. Let k 1 and k 2 , resp ectiv ely , b e the n u m b ers of c h r omosomes in G 1 and G 2 . Let A 1 , . . . , A k 1 b e the k 1 c h romosomes in G 1 . Let B 1 , . . . , B k 2 b e the k 2 c h romosomes in G 2 . Let k = max { k 1 , k 2 } . L et n b e the total n umber of genes in G 1 and G 2 , i.e., n = P k 1 i =1 | A i | + P k 2 j =1 | B j | . W e ﬁrst p r esen t a p olynomial-time algo rithm for Zero Exempl ar Dist a nce for multic hr o- mosomal genomes without gene ord er in the sp ecial case that ea c h gene app ears exactly once in one genome and at least once in the other genome. Our algorithm is based on maxim um-weigh t matc h in g in b ip artite graphs: Algorithm 3. 1. Construct a complete bipartite graph G = ( V 1 ∪ V 2 , V 1 × V 2 ) with ve rtices V 1 = { A 1 , . . . , A k 1 } and V 2 = { B 1 , . . . , B k 2 } . Asso ciate with eac h edge b et ween A i ∈ V 1 and B j ∈ V 2 a r educed c h romosome C ij = A i ∩ B j and a weigh t equal to its size. 9 2. Compute a maximum-w eight matc hing M in the graph G . 3. If the set of reduced c hr omosomes corresp onding to the edges in M includes all the genes, return y es. Other w ise, return no. T o see the correctness of Algorithm 3, n ote that eac h reduced c hromosome of a common re- duced genome is a common subset of t wo distinct c hromosomes, one from eac h inp ut genome, and corresp onds to an edge of a matc hing in the complete bipartite graph. In the sp ecial case that eac h gene app ears exactly once in one genome and at least once in th e other genome, no gene can app ear more than once in the reduced c hromosomes corresp onding to the ed ges of a matching. Thus the maxim um p ossible wei ght of a matc hing is equal to the num b er of distinct genes, and a common reduced genome that in cludes all the genes corresp onds to a matc hin g of the maxim um w eigh t. W e now analyze the time complexit y of Algorithm 3. Steps 1 an d 3 can b e easily implement ed in O ( n 2 ) time. Step 2 can b e implemente d in O ( k 3 ) time using a standard algorithm for weig ht bipartite matching; see e.g. [13]. T h us the o ve rall time complexit y is O ( n 2 + k 3 ). W e next present a ﬁxed-parameter tractable algorithm for th is problem w ithout an y assump tion on the distribution of dup licate genes. Refer to [7] for basic concepts in parameterized complexit y theory . The parameter of our algorithm is k = max { k 1 , k 2 } : Algorithm 4. 1. Add k − k 1 empt y c hromosomes A k 1 +1 , . . . , A k to G 1 , or add k − k 2 empt y c h romosomes B k 2 +1 , . . . , B k to G 2 , such that G 1 and G 2 ha v e the same num b er k of c hromosomes. 2. F or eac h p ermutati on π of h 1 , . . . , k i , compute C π = ∪ k i =1 ( A i ∩ B π ( i ) ). 3. If for some p ermutatio n π the set C π includes all the genes, retur n ye s. Otherwise return no. T o see the correctness of Algorithm 4, note again that eac h c hromosome of a common redu ced genome is a common subset of tw o distinct c hr omosomes, one from eac h inpu t genome. All other c h romosomes of the t wo input genomes that do not cont ribu te to the common redu ced genome are deleted. T o handle the matching and the deletion of the c h romosomes in a uniform w ay , w e can think of eac h c h romosome deleted from one genome as matc hed to a c hromosome deleted from the other genome or to an emp t y chromosome. T hus b y padding the t w o genomes to the same n umb er of c hr omosomes, w e only n eed to consider p erfect matc h in gs as p erm utations. T he time complexit y of Algorithm 4 is O ( k ! n 2 ), with O ( n 2 ) time for eac h of the k ! p erm utations. This completes the pro of of Theorem 4 . W e remark th at the p roblem Zero Exemp lar Dist ance for m ultic hromosomal genomes with- out gene order is unlike ly to hav e a ﬁxed -p arameter tractable algorithm if th e parameter is the maxim um num b er of genes in an y single c hromosome. This is b ecause 3SA T remains NP-hard even if for eac h v ariable there are at most ﬁv e clauses that con tain its literals [9]. As a resu lt, the num b er of genes in eac h c hr omosome need not b e more than some constant in our redu ction from 3S A T. Ac knowledgmen t The author wo uld lik e to thank Binhai Zh u f or a br ief discuss ion on Ques- tion 1 dur ing a visit to Univ ersit y of T exas – P an American in Ma y 2008, and thank Gu illaume F ertin for comm unicating the r ecen t r esult [3]. T he alternativ e pro of of Theorem 1 was obtained indep en d en tly by the author in F ebruary 2010 without kno wing the recen t progress [3]. Th e author also thanks Pedro J. T ejada for b ringing th e op en qu estion of Bonizzoni et al. [4], Question 2 , to his atten tion. 10 References [1] S. S. Adi, M. D. V. Braga, C. G. F ernan d es, C. E. F erreira, F. V. Martinez, M.-F. S agot, M. A. Stefanes, C. Tjand raatmadja, and Y. W ak aba y ashi. Rep etition-free longest common subsequence. D iscr ete Applie d Mathematics , 158:13 15–1324 , 2010 . [2] S. Angibaud, G. F ertin, I. Ru su, A. Th´ ev enin, and S . Vialette. On th e appr o ximabilit y of comparing genomes with dup licates. Journal of Gr aph Algorith ms and Applic ations , 13:19–5 3, 2009. [3] G. Blin, G. F ertin, F. Sikora, and S . Vialett e. The Exemplar Breakp oin t Distance for non- trivial genomes cann ot b e appr o xim ated. In Pr o c e e dings of the 3r d Workshop on Algo rithms and Computation (W ALCOM’09) , pages 357–368, 2009. [4] P . Bonizzoni, G. Della V edo v a, R. Dondi, G. F ertin, R. Rizzi, and S. Vialette . Exemplar longest common su bsequence. IEE E/ACM T r ansactions on Computational Biolo gy and Bioinformat- ics , 4:535–54 3, 2007. [5] Z. Chen, R. H. F o wler, B. F u, and B. Zhu. On the inapp r o ximabilit y of the exemplar con- serv ed interv al distance problem of genomes. Journal of Combinatorial Optimization , 15:201– 221, 2008. (A preliminary v ersion app eared in Pr o c e e dings of the 12th Annual International Confer enc e on Computing and Combinatorics (COCOON’06) , pages 245–254, 2006.) [6] Z. Chen , B. F u, an d B. Zhu. The app ro ximabilit y of th e exemplar b r eakp oint distance problem. In Pr o c e e dings of the 2nd International Confer enc e on Algorithmic Asp e cts in Informatio n and Management (AAIM ’06) , pages 291–302, 2006. [7] R. G. Do wney and M. R. F ello ws. Par ameterize d Complexity . Sprin ger-V erlag, 1999. [8] V. F erretti, J. H. Nadeau, and D. S ank oﬀ. O riginal synten y . In Pr o c e e dings of the 7th Annual Symp osium on Combinatorial Pattern Matching (CPM’96) , pages 159–16 7, 1996. [9] M. R. Garey and D. S. Johnson. Computers and Intr actability: A Guide to the The ory of NP-Completeness . W. H. F reeman and Company , 1979. [10] D. Gusﬁeld. Algorithms on Strings, T r e es, and Se que nc es . Cambridge Univ ersit y Press, 1997. [11] M. Jiang. The zero exemplar d istance pr oblem. In Pr o c e e dings of the 8th Annual RECOMB Satel lite Workshop on Comp ar ative Genomics (RECOMB-CG’10) , p ages 74–82, 2010. [12] D. Sank oﬀ. Genome rearrangement with gene families. Bioinformatics , 15:909–91 7, 1999. [13] R. E. T arjan. Data Structur es and Ne twork Algorithms . SIAM, 1983 . 11

The zero exemplar distance problem

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment