The zero exemplar distance problem

Given two genomes with duplicate genes, \textsc{Zero Exemplar Distance} is the problem of deciding whether the two genomes can be reduced to the same genome without duplicate genes by deleting all but one copy of each gene in each genome. Blin, Ferti…

Authors: Minghui Jiang

The zero exemplar d istance problem ∗ Mingh ui Jiang Departmen t of C omputer Science, Utah State U niv ersit y , Logan, UT 84322, US A mjiang@cc.u su.edu No v em b er 5, 2010 Abstract Given tw o genomes with duplicate g enes, Zero Ex emplar Dist ance is the problem of deciding whether the t wo genomes can be reduced to the same geno me without duplicate g enes by deleting all but one copy of each gene in each geno me. Blin, F ertin, Sikora, and Viale tte recently pr ov ed that Zero Exemplar Dist an ce for mono chromosomal g enomes is NP- ha rd even if each gene app ea rs at most tw o times in each genome, thereby settling an imp ortant op en question on g enome rearrang ement in the exemplar mo del. In this pap er, we give a very simple alternative pr o of o f this result. W e a ls o study the problem Zer o Exemplar Dist ance for m ultichromosomal genomes w itho ut gene or der, and prove the analogo us result that it is a lso NP-hard even if each gene app ear s a t most tw o times in each genome. F or the p o sitive direction, we sho w that b oth v ariants of Zero Exemplar D ist ance admit p olyno mial-time alg orithms if each gene a pp e ars exactly once in o ne g enome and at least once in the other genome. In addition, we present a polyno mial-time algor ithm for the related problem Exemplar L ongest Common Subsequence in the sp ecial cas e that each mandatory symbol app ear s exactly once in one input sequence and at least once in the other input s equence. This ans wers a n op en question of Bonizzoni et al. W e also show that Zero Exemplar Dist ance for multic hromosomal geno mes without g ene o rder is fixed-parameter tractable if the parameter is the maximum num b er of chromosomes in each genome. 1 In tro duction Giv en t wo genomes with d uplicate genes, Genome Rearr angement with Gene F amilies [12] is the pr ob lem of deleting all but one cop y of eac h gene in eac h genome, so as to minimize some rearrangemen t distance b et we en the t w o reduced genomes. The min imum rearrangement distance th us attained is called the exe mplar distanc e b et we en the tw o genomes. F or example, eac h of the follo wing t w o mono chromosomal genomes G 1 : − 4 +1 +2 +3 − 5 +1 +2 +3 − 6 G 2 : − 1 − 4 +1 +2 − 5 +3 − 2 − 6 +3 ∗ Supp orted in part by NSF gran t DBI-0743670. A preliminary version o f this pap er ( including Theorem 1, Theorem 2, and a w eaker version of Theorem 4 ) app eared in Proceedin gs of the 8th Ann ual RECOMB S atellite W orkshop on Comparativ e Genomics (RECOMB-CG 2010) [11]. 1 has at most tw o copies of eac h gene, and eac h of the follo wing tw o reduced genomes G ′ 1 : − 4 +1 +2 − 5 + 3 − 6 G ′ 2 : − 4 +1 +2 − 5 + 3 − 6 has exactly one copy of eac h gene. Re call that in the stud y of genome rearrangemen t, a g e ne is usually represente d by a signed inte ger: t he absolute v alue of the inte ger ( the unsigned int eger) denotes the gene family to wh ic h the gene b elongs; the sign of the in teger d enotes the orien tation of the gene in its c hromosome. Then a chr omosome is a sequence of signed in tegers, and a genome is a collect ion of chromosomes. Genome R earrangem ent with Gen e F amilies is not a single problem but a wh ole class of related problems, b ecause the c hoice of rearrangement distance is not unique. This c hoice b ecomes irrelev an t, ho we ve r, wh en w e ask the fun dament al question: Is the distanc e zer o? In the example ab o v e, the t w o redu ced genomes G ′ 1 and G ′ 2 are identica l, th us th e exemplar distance b etw een the t w o original genomes G 1 and G 2 is zero for any reasonable choic e of rearrangemen t distance. In this pap er, we study the most b asic version o f the pr oblem Genome Rearrangement with Gene F amilies : Giv en t wo sequences of signed in tegers, Zero Exemplar Dist ance (for mono c hromosomal genomes) is the problem of d eciding whether the tw o sequences hav e a common subsequence including eac h unsigned intege r exactly once in either p ositive or negativ e form . Due to its generic nature, the problem Zero Exempl ar Dist a nce has b een extensiv ely stud- ied by sev eral g roups of researc hers [5, 4, 2] fo cu s ing on different rearrangemen t distances, and, not s u rprisin gly , has acquired seve ral different names. Except for trivial distinctions, Zer o Ex- emplar Dist ance is essen tially the same problem as Ze ro Exemp lar Conser ved I nter v al Dist an ce [5], Exempl ar Longest Comm on Sub sequenc e (deciding wh ether a feasible solution exists) [4], and Zero Exemplar Brea kpoint Dist an ce [2]. It is easy to chec k that if only one of the t wo genomes has duplicate genes, then Ze ro Exemplar Dist an ce can b e solv ed in linear time: w e simp ly need to decide whether the genome without duplicates is a subsequence of the genome with duplicates. In sharp cont rast, if b oth genomes con tain d uplicate genes, then even if eac h gene app ears at most three times in eac h genome, the problem Zero Exemplar Dist ance is already NP-hard , as sho wn ind ep end ently in thr ee pap ers [5, 4 , 2]. The quest for the exact b ound ary b et ween p olynomial solv abilit y and NP-hardn ess led to the follo wing op en question fi rst r aised b y Chen et al. in 2006: Question 1 (Chen, F o wler, F u, and Zhu, 2006 [5 ]) . Is the pr oblem Ze ro Exempl ar Dist an ce f or mono chr omosomal genomes stil l N P-har d if e ach gene app e ars at most two times in e ach genome? This qu estion wa s finally settled in the affirmative by Blin et al. in 2009: Theorem 1 (Blin, F ertin, Sikora, and Vial ette, 2009 [3]) . Zero Exemplar Dist a nce for mono chr o- mosomal genomes is NP -har d even i f e ach gene app e ars at most two times in e ach genome. In S ection 2, w e give a very simple alternativ e p ro of of this theorem. Both the previous p ro of of Theorem 1 [3] and our alternativ e p ro of d ep end cru cially on the order of the genes in the c hromosomes. One ma y n aturally w onder w hether the complexit y of Z ero Exemplar Dist anc e would c hange if gene order is not kn own. Note that genome rearrangement distances su c h as the syntenic distance [8] can b e defined in the abs ence of gene order. No w mo del eac h chromosome as a set of unsigned intege rs instead of a sequence of signed in tegers. Then Zero Exemplar Di st ance f or multic hromosomal genomes without gene order is the follo win g pr oblem: Giv en t w o collections G 1 and G 2 of subsets of the same groun d set S 2 of uns igned in tegers, decide whether b oth G 1 and G 2 can b e reduced, b y d eleting elemen ts from subsets and deleting sub sets from collec tions, to th e same coll ection G ′ of sub sets of S s u c h that eac h un signed in teger in S is con tained in exactly one subset in G ′ , i.e., G ′ is a partition of S . F or example, S : { 1 , 2 , 3 , 4 , 5 } G 1 : { 1 , 2 , 3 } { 2 , 3 , 4 } { 4 , 5 } G 2 : { 1 , 2 } { 2 , 3 , 4 } { 3 , 4 , 5 } { 1 , 5 } G ′ : { 1 , 2 } { 3 } { 4 , 5 } In S ection 3, w e p ro v e the f ollo wing theorem analogous to Theorem 1: Theorem 2. Zero Exempla r Dist ance for multichr omosomal genomes without gene or der is NP-har d eve n if e ach gene app e ars at most two times in e ach genome. As d ecision p roblems, b oth v ariants of Zero Exemplar Dist ance , for mono chromosomal genomes and for m ultic hromosomal genomes without gene order, are in NP. Thus, follo wing the NP-hardness r esults in Theorem 1 and Theorem 2, these t wo decision problems are b oth NP- complete. Moreo ver, the NP-h ardness results in Theorem 1 and Theorem 2 imply th at unless NP = P , the corresp ond in g minim ization pr ob lems of compu ting the exemplar distance b etw een t w o genomes d o not admit any ap p ro ximation. W e refer to [5, 6, 4, 2, 1] for related results. The problem Zero Exempla r Dist ance for mono c h romosomal genomes, as ment ioned earlier, has b een studied u nder several differen t names. Giv en t wo sequences A and B o v er an alph ab et Σ = Σ 1 ∪ Σ 2 , where Σ 1 is a set of mandatory sym b ols and Σ 2 is a set of optional symbols, Exemplar Longest Common Subseq uence [4] is the p r oblem of find ing a longest common subsequence of A and B that con tains all mandatory symb ols in Σ 1 . F or example, if Σ 1 = { 1 , 2 , 3 } and Σ 2 = { 4 , 5 } , then C = 124355 is an exemplar longest common su bsequence of the tw o sequences A = 12423545 and B = 1142443 555. Due to the s trict requirement on mandatory symb ols, Exemplar Longest C ommon Sub- sequenc e do es not alw a ys ha ve a feasible solution. It is not d ifficult to see that simply deciding whether a feasible solution to Exem plar Longest Common Subsequ ence exists for t w o se- quences A and B is the same as the pr oblem Zero Exemplar Dist a nce for t w o mono c h romo- somal genomes A ′ and B ′ obtained from A and B by deleting all optional sym b ols. Recall that the problem Ze ro Exempl ar Di st ance for mono c hr omosomal genomes b ecomes trivial wh en only one of the tw o genomes has dup licate genes. F or the equiv alen t pr oblem of deciding w h ether a feasible solution to Exemp lar Longes t Common Subsequen ce exists, Bonizzoni et al. [4 ] sho w ed another tractable sp ecial case: If eac h man d atory symb ol app ears a total of at most three times in A an d B , then there is a p olynomial-time algorithm, b ased on 2SA T, that decides whether A and B ha ve a common subsequence con taining all mandatory symb ols. This algorithm do es n ot solv e the maximization problem, ho w ev er, and the follo wing question w as left op en : Question 2 (Bonizzoni et al. [4 ]) . Is ther e a p olynomial-time algorithm for Exem plar Lon gest Common Subseq uence in the sp e cial c ase that e ach mandat ory symb ol app e ars a total of at most thr e e times in the two input se qu enc es? Without loss of generalit y , we assume that eac h input sequence conta ins eac h sym b ol in the alphab et at least once. If eac h mandatory sym b ol app ears a tota l of at most thr ee times in the t w o input sequences, then it must app ear exact ly once in one sequ ence, and at least once in the 3 other sequence, as in the example shown earlier. In Section 4, w e pro v e the f ollo wing theorem that complemen ts T heorem 1 and ans wers the op en question of Bonizzoni et al. in the affirmative : Theorem 3. Zero Exemplar Dist ance for mono chr omosomal genomes admits a p olynomial- time algorithm in the sp e cial c ase tha t e ach gene app e ars exactly onc e in one genome and at le ast onc e in the other genome. Exemplar Lo ngest Commo n S ubsequ ence adm its a p olynomial- time algorithm in the sp e cial c ase that e ach mandatory symb ol app e ars exactly onc e i n one input se quenc e and at le ast onc e in the other input se quenc e. Finally , in Section 5 , we pro ve the follo wing theorem that complemen ts Th eorem 2: Theorem 4. Zero Exempl ar Dist ance for multichr omosomal genomes without gene or der ad- mits a p olynomial-time algorithm in the sp e ci al c ase that e ach gene app e ars exactly onc e in one genome and at le ast onc e in the other genome, and is fixe d-p ar ameter tr actable if the p ar ameter is the maximum numb er of chr omosomes in e ach genome. 2 Alternativ e Pro of of Theorem 1 W e pro v e that Ze ro Exempl ar Dist anc e for mono c hromosomal genomes is NP-hard by a redu c- tion from the w ell-kno wn NP-complete p roblem 3SA T [9]. Let ( V , E ) b e a 3SA T instance, wh ere V = { v 1 , . . . , v n } is a set of n b o olean v ariables, E = { e 1 , . . . , e m } is a conjunctiv e b o olean form ula of m clauses, and eac h clause in E is a disju n ction of exac tly three literals of the v ariables in V . W e will construct t wo sequen ces (genomes) G 1 and G 2 o ver 2 n + 6 m + 1 distinct un signed intege rs (genes): • Two variable genes x i , y i for eac h v ariable v i , 1 ≤ i ≤ n ; • Thr ee clause gene s a j , b j , c j for eac h clause e j , 1 ≤ j ≤ m ; • Thr ee liter al genes r j , s j , t j for the three literals of eac h clause e j , 1 ≤ j ≤ m ; • One sep ar ator gene z . In our construction, all genes app ear in the p ositiv e orientat ion in the t wo genomes, s o we w ill omit the signs in our description. The t wo genomes G 1 and G 2 are r epresen ted s chematic ally as follo ws: G 1 : h v 1 i . . . h v n i z h e 1 i . . . h e m i G 2 : h v 1 i . . . h v n i z h e 1 i . . . h e m i F or eac h v ariable v i , the v ariable gadget h v i i consists of one cop y of x i and t w o copies of y i in G 1 , t wo copies of x i and one copy of y i in G 2 , and, for eac h literal of the v ariable in the clauses, one copy of the corresp onding literal gene ( r j , s j , or t j for some clause e j ) in eac h genome. Let p i, 1 , . . . , p i,k i b e the literal genes for the p ositive literals of v i , and let q i, 1 , . . . , q i,l i b e the literal genes for the negativ e literals of v i . The genes x i , y i , p i, 1 , . . . , p i,k i , q i, 1 , . . . , q i,l i in the v ariable gadget h v i i are arran ged in the follo wing pattern in the t w o genomes: G 1 h v i i : y i p i, 1 . . . p i,k i x i q i, 1 . . . q i,l i y i G 2 h v i i : p i, 1 . . . p i,k i x i y i x i q i, 1 . . . q i,l i 4 F or eac h clause e j , the clause gadget h e j i consists of t wo copies of eac h clause gene a j , b j , c j and one copy of eac h literal gene r j , s j , t j . Th ese genes in h e j i are arranged in the follo win g pattern in the tw o genomes: G 1 h e j i : r j a j b j c j s j a j b j c j t j G 2 h e j i : a j r j b j a j s j c j b j t j c j This completes the constru ction. It is easy to c hec k that eac h gene app ears at most t wo times in eac h genome, and that eac h genome includes exactly 3 n + 12 m + 1 genes includin g duplicates. W e give an example: Example 1. F or a 3SA T instanc e of 4 variables and 2 clauses e 1 = { r 1 = v 1 , s 1 = ¬ v 2 , t 1 = ¬ v 3 } and e 2 = { r 2 = ¬ v 1 , s 2 = v 3 , t 2 = v 4 } , the r e duction c onstructs the fol lowing two genomes: G 1 : y 1 r 1 x 1 r 2 y 1 y 2 x 2 s 1 y 2 y 3 s 2 x 3 t 1 y 3 y 4 t 2 x 4 y 4 z r 1 a 1 b 1 c 1 s 1 a 1 b 1 c 1 t 1 r 2 a 2 b 2 c 2 s 2 a 2 b 2 c 2 t 2 G 2 : r 1 x 1 y 1 x 1 r 2 x 2 y 2 x 2 s 1 s 2 x 3 y 3 x 3 t 1 t 2 x 4 y 4 x 4 z a 1 r 1 b 1 a 1 s 1 c 1 b 1 t 1 c 1 a 2 r 2 b 2 a 2 s 2 c 2 b 2 t 2 c 2 The assignment v 1 = tru e , v 2 = f alse , v 3 = f alse , v 4 = tru e satisfies the 3SA T instanc e and c orr esp onds to the fol lowing c ommon r e duc e d genome: G ′ : r 1 x 1 y 1 y 2 x 2 s 1 y 3 x 3 t 1 t 2 x 4 y 4 z a 1 b 1 c 1 r 2 a 2 s 2 b 2 c 2 The red u ction clearly runs in p olynomial time. It remains to pr o ve the follo wing lemma: Lemma 1. The 3SA T instanc e ( V , E ) is satisfiable if and only if the two genomes G 1 and G 2 have a c ommon subse quenc e G ′ including exactly one c opy of e ach gene. W e first prov e the direct implication. Supp ose that the 3SA T instance ( V , E ) is satisfiable. W e will comp ose a common su bsequence G ′ of the tw o genomes G 1 and G 2 from a common s ubsequence of eac h v ariable gadget h v i i , the separator gene z in the midd le, and a common sub sequence of eac h clause gadget h e j i . Consider a truth assignment that satisfies the 3SA T instance. F or eac h v ariable v i , tak e the su bsequence p i, 1 . . . p i,k i x i y i if v i is set to true, and tak e the subsequence y i x i q i, 1 . . . q i,l i if v i is set to false. F or eac h clause e j , at least one of its thr ee literals is tr ue; corresp ond in gly , at least on e of the three literal genes r j , s j , t j has b een taken from some v ariable gadget h v i i . No w tak e a sub s equence from the clause gadget h e j i follo wing one of thr ee cases: 1. If r j has b een tak en, then tak e the subsequence a j b j s j c j t j . 2. If s j has b een tak en, then tak e either th e subsequ en ce r j b j a j c j t j or the sub sequence r j a j c j b j t j . 3. If t j has b een tak en, then tak e the subsequen ce r j a j s j b j c j . Here an un derlined literal gene is omitted from the subsequ ence taken from th e clause gadget h e j i if its other copy has already b een tak en f r om some v ariable gadget h v i i . The common subsequen ce G ′ th us comp osed clearly includes exactly one cop y of eac h gene. W e next pro ve the reve rse imp lication. Supp ose that the t w o genomes G 1 and G 2 ha v e a common subsequence G ′ including exactly one cop y of eac h gene. W e will fi nd a satisfying assignmen t for the 3SA T instance ( V , E ) as follo ws. Due to the strategic lo cation of the separator gene z in th e t w o genomes, eac h literal gene m ust app ear in the common su bsequence either b efore z in b oth 5 genomes, in some v ariable gadget h v i i , or after z in b oth genomes, in some clause gadget h e j i . The crucial prop ert y of the clause gadget h e j i is that it cannot ha v e a common su bsequence including exactly one copy of eac h clause gene a j , b j , c j unless at least one of the three literal genes r j , s j , t j is omitted. A literal gene omitted from the common subsequen ce of the clause gadget h e j i has to app ear in the common subsequence of some v ariable gadget h v i i , w here th e tw o v ariable genes x i and y i m ust app ear in the ord er x i y i if the literal is p ositiv e and app ear in the order y i x i if the literal is n egativ e. Now set eac h v ariable v i to true if th e t wo v ariable genes x i and y i app ear in the common subsequence G ′ in the order x i y i , and set it to false otherwise. T h en eac h clause gets at least one true literal. This completes the pro of of Theorem 1. 3 Pro of of Theorem 2 W e prov e that Ze ro Exe mplar Dist ance for multic hromosomal genomes without gene ord er is NP-hard by a reduction again fr om 3SA T. Let ( V , E ) b e a 3S A T in stance, where V = { v 1 , . . . , v n } is a set of n b o olean v ariables, E = { e 1 , . . . , e m } is a conjunctiv e b o olean form u la of m cla uses, and eac h clause in E is a disjun ction of exactly thr ee literals of the v ariables in V . Without loss of generalit y , assu me that n o clause in E contai ns tw o literals of the same v ariable in V . W e w ill construct tw o genomes G 1 and G 2 o ver n + 9 m distinct genes: • One variable gene x i for eac h v ariable v i , 1 ≤ i ≤ n ; • Six clause genes a j , b j , c j , a ′ j , b ′ j , c ′ j for eac h clause e j , 1 ≤ j ≤ m ; • Thr ee liter al genes r j , s j , t j for the three literals of eac h clause e j , 1 ≤ j ≤ m . F or eac h v ariable v i , let p i, 1 , . . . , p i,k i b e the literal genes for the p ositiv e literals of v i , and let q i, 1 , . . . , q i,l i b e the literal genes for th e negativ e literals of v i . G 1 includes one subs et and G 2 includes tw o sub sets of genes including x i : G 1 h v i i : { p i, 1 , . . . , p i,k i , x i , q i, 1 , . . . , q i,l i } G 2 h v i i : { p i, 1 , . . . , p i,k i , x i } { x i , q i, 1 , . . . , q i,l i } F or eac h clause e j , G 1 includes s ix subsets and G 2 includes s ev en subsets of clause/literal genes: G 1 h e j i : { a j , b j } { b j , c j } { c j , a j } { a ′ j , r j } { b ′ j , s j } { c ′ j , t j } G 2 h e j i : { a j , b j , c j } { a j , a ′ j , r j } { b j , b ′ j , s j } { c j , c ′ j , t j } { a ′ j } { b ′ j } { c ′ j } This completes the constru ction. It is easy to c hec k that eac h gene app ears at most t wo times in eac h genome, G 1 includes exactly n + 15 m genes includ ing d uplicates, and G 2 includes exactly 2 n + 18 m genes includin g dup licates. W e giv e an example: Example 2. F or a 3SA T instanc e of 4 variables and 2 clauses e 1 = { r 1 = v 1 , s 1 = ¬ v 2 , t 1 = ¬ v 3 } and e 2 = { r 2 = ¬ v 1 , s 2 = v 3 , t 2 = v 4 } , the r e duction c onstructs the fol lowing two genomes: G 1 : { r 1 , x 1 , r 2 } { x 2 , s 1 } { s 2 , x 3 , t 1 } { t 2 , x 4 } { a 1 , b 1 } { b 1 , c 1 } { c 1 , a 1 } { a ′ 1 , r 1 } { b ′ 1 , s 1 } { c ′ 1 , t 1 } { a 2 , b 2 } { b 2 , c 2 } { c 2 , a 2 } { a ′ 2 , r 2 } { b ′ 2 , s 2 } { c ′ 2 , t 2 } G 2 : { r 1 , x 1 } { x 1 , r 2 } { x 2 } { x 2 , s 1 } { s 2 , x 3 } { x 3 , t 1 } { t 2 , x 4 } { x 4 } { a 1 , b 1 , c 1 } { a 1 , a ′ 1 , r 1 } { b 1 , b ′ 1 , s 1 } { c 1 , c ′ 1 , t 1 } { a ′ 1 } { b ′ 1 } { c ′ 1 } { a 2 , b 2 , c 2 } { a 2 , a ′ 2 , r 2 } { b 2 , b ′ 2 , s 2 } { c 2 , c ′ 2 , t 2 } { a ′ 2 } { b ′ 2 } { c ′ 2 } 6 The assignment v 1 = tru e , v 2 = f alse , v 3 = f alse , v 4 = tru e satisfies the 3SA T instanc e and c orr esp onds to the fol lowing c ommon r e duc e d genome: G ′ : { r 1 , x 1 } { x 2 , s 1 } { x 3 , t 1 } { t 2 , x 4 } { a 1 } { b 1 , c 1 } { a ′ 1 } { b ′ 1 } { c ′ 1 } { c 2 } { a 2 , b 2 } { a ′ 2 , r 2 } { b ′ 2 , s 2 } { c ′ 2 } The red u ction clearly runs in p olynomial time. It remains to pr o ve the follo wing lemma: Lemma 2. The 3SA T instanc e ( V , E ) is satisfiable if and only if the two genomes G 1 and G 2 have a c ommon r e duc e d genome G ′ including exactly one c opy of e ach gene. W e fir st p ro v e the dir ect implication. Sup p ose that the 3SA T ins tance ( V , E ) is satisfiable. W e will comp ose a common reduced genome G ′ of the tw o genomes G 1 and G 2 as follo ws. Con- sider a tru th assignmen t th at satisfies the 3SA T instance. F or eac h v ariable v i , tak e the subs et { p i, 1 , . . . , p i,k i , x i } if v i is s et to true, and take the subset { x i , q i, 1 , . . . , q i,l i } if v i is s et to false. F or eac h clause e j , at least one of its three literals is tr u e; corresp ondingly , at least one of the three literal genes r j , s j , t j has b een tak en from some v ariable gadget h v i i . No w take some subsets of clause/lite ral genes follo wing one of thr ee cases: 1. If r j has b een tak en, then tak e the subsets { a j } , { b j , c j } , { a ′ j } , { b ′ j , s j } , { c ′ j , t j } . 2. If s j has b een tak en, th en tak e the subsets { b j } , { c j , a j } , { a ′ j , r j } , { b ′ j } , { c ′ j , t j } . 3. If t j has b een tak en, then tak e the subsets { c j } , { a j , b j } , { a ′ j , r j } , { b ′ j , s j } , { c ′ j } . Here an u nderlined literal gene is omitted f r om th e su bset tak en from the clause gadget h e j i if its other cop y has already b een tak en from some v ariable gadget h v i i . The reduced genome G ′ th us comp osed clearly includ es exactly one cop y of eac h gene. W e next prov e the r ev erse implication. Supp ose that the tw o genomes G 1 and G 2 ha v e a common red u ced genome G ′ including exactly one copy of eac h gene. W e will find a satisfying assignmen t for the 3SA T ins tance ( V , E ) as follo ws. The crucial prop ert y of the clause gadget h e j i is that it cannot h a v e a common reduced genome including exact ly one cop y of eac h clause gene a j , b j , c j , a ′ j , b ′ j , c ′ j unless at least one of the three literal genes r j , s j , t j is omitt ed. A literal gene omitted from the clause gadget h e j i h as to app ear in a subset in G ′ that conta ins some v ariable gene x i . By th e construction of the v ariable gadgets, this su bset cont ains, b esides x i , either literal genes f or p ositiv e literals, or literal genes for negativ e literals. No w set eac h v ariable v i to tru e if the sub set in G ′ that conta ins x i also con tains at least on e literal gene for a p ositiv e literal, and set it to false otherwise. Then eac h clause gets at least one true literal. Th is completes the pro of of T h eorem 2. 4 Pro of of Theorem 3 Let A and B b e t wo sequences of lengths n and m , r esp ectiv ely , ov er an alphab et Σ = Σ 1 ∪ Σ 2 , where Σ 1 is a set of mandatory symb ols and Σ 2 is a set of optional sym b ols. In the sp ecial case that eac h mandatory symbol in Σ 1 app ears exactly once in one sequence and at least once in the other sequence, w e ha v e the ob vious but imp ortan t prop ert y that any c ommon sub se quenc e of the two se q uenc es c an c ontain e ach mandato ry symb ol at most onc e . This prop ert y leads to a 7 v ery simple algorithm that decides w hether a feasible solution to Exempl ar Longest Common Subseq uence exists in this sp ecial case: Algorithm 1. 1. Obtain t wo sequ en ces A ′ and B ′ from A and B b y deleting all optional sym b ols in Σ 2 . 2. Compute a longest common su bsequence C ∗ of A ′ and B ′ . 3. If C ∗ con tains all mand atory symb ols in Σ 1 , return y es. Otherwise, retur n n o. The time complexit y of Algorithm 1 is O ( nm ) b y using a standard dynamic pr ogramming algorithm for longest common su bsequence [10]. T h e correctness of Algorithm 1 is justified by the follo win g lemma: Lemma 3. A and B have a c ommon subse quenc e c ontaining al l mandatory symb ols in Σ 1 if and only if the longest c ommon subse quenc e C ∗ of A ′ and B ′ c ontains al l mandato ry symb ols in Σ 1 . Pr o of. The r eduction from A and B to A ′ and B ′ preserve s the mandatory symb ols. Thus A and B ha v e a common subsequ ence con taining all mandatory symbols in Σ 1 if and only if A ′ and B ′ ha v e a common sub sequence con taining all mand atory sy mb ols in Σ 1 . It remains to prov e the equ iv alen t claim that A ′ and B ′ ha v e a common subs equence conta ining all mandatory sy mb ols in Σ 1 if and only if C ∗ con tains all mand atory symbols in Σ 1 . The “if ” direction of the claim is trivial b ecause C ∗ is a common sub sequence of A ′ and B ′ . T o pro v e the “only if ” direction, r ecall th at in an y common subsequence of A ′ and B ′ , eac h mandatory sym b ol can app ear at most once. Th us the length of any common subsequence of A ′ and B ′ is at most the size of Σ 1 . Moreo ver, if the length of some common subsequence of A ′ and B ′ is equal to the size of Σ 1 , then this common su bsequence must con tain all mandatory sym b ols in Σ 1 , and vice v ersa. No w su pp ose that A ′ and B ′ ha v e a common subsequence C ′ con taining all mandatory sym b ols in Σ 1 . Then the length of C ′ m ust b e equal to the size of Σ 1 . Since the length of C ∗ is at least the length of C ′ , the length of C ∗ m ust also b e equal to the size of Σ 1 . Then C ∗ m ust con tain all mandatory symbols in Σ 1 to o. T his completes the pro of. Since deciding whether a feasible solution to Exempla r Longest Commo n S ubseque nce exists for tw o sequences A an d B is the s ame as the p roblem Zero Exempl ar Dist ance for t w o mono c hromosomal genomes A ′ and B ′ obtained from A and B by deleting all optional symb ols, w e also hav e an O ( nm ) algorithm for Zero Exe mplar Dist ance for mono c hromosomal genomes in the sp ecial case that eac h gene app ears exactly once in one genome and at least once in the other genome. W e next presen t an algorithm for the maximizati on problem Exempla r Longest Comm on Subseq uence in the sp ecial case that eac h m andatory symbol app ears exactly once in one inp ut sequence and at least once in the other in put sequence: Algorithm 2. 1. Assign eac h mandatory symb ol in Σ 1 a we ight of w = min { n, m } + 1 , and assign eac h optional sym b ol in Σ 2 a weigh t of 1. Compu te a common sub sequence C ∗ of A and B of the m axim um total weig ht. 8 2. If C ∗ con tains all mandatory sym b ols in Σ 1 , return C ∗ . Otherwise, rep ort that no feasible solution exists. If A and B ha v e no common su bsequence con taining all mandatory symb ols in Σ 1 , then clearly the m aximum-w eight common su bsequence C ∗ of A and B cannot cont ain all mandatory sym b ols in Σ 1 , and hence the algorithm correctly rep orts th at no feasible solution exists. Oth erwise, the correctness of Algorithm 2 is justified by the follo wing lemma: Lemma 4. If A and B have a c ommon subse quenc e c ontaining al l mandatory symb ols in Σ 1 , then the maximum-weight c ommon subse quenc e C ∗ of A and B is a longest c ommon subse quenc e of A and B that c ontains al l mandator y symb ols in Σ 1 . Pr o of. Supp ose that A and B hav e a common subsequ en ce C con taining all mandatory symb ols in Σ 1 . W e first sho w that the m axim um-we igh t common subsequence C ∗ of A and B con tains all mandatory sym b ols in Σ 1 . Note that the n umb er of optional sym b ols in C ∗ is at m ost th e length of C ∗ , whic h is at most min { n, m } . Also r ecall th at any common su bsequence of A and B can con tain eac h m an d atory symb ol at most once. If C ∗ do es not con tain all mandatory s y mb ols in Σ 1 , then b y our c hoice of w = min { n, m } + 1, the total weigh t of C ∗ w ould b e at most ( | Σ 1 | − 1) · w + min { n, m } · 1 < ( | Σ 1 | − 1) · w + w · 1 = | Σ 1 | · w . On the other hand , sin ce C con tains all m andatory symbols in Σ 1 , the weigh t of C is at least | Σ 1 | · w . This con tradicts the assumption that C ∗ is a maximum-w eigh t common su bsequence of A and B . No w , since C ∗ con tains all mandatory sym b ols and can conta in eac h m andatory symb ol at most once, C ∗ m ust con tain eac h mand atory symb ol exactly once. Then, to ha ve the maximum total w eigh t, C ∗ m ust b e a longest common su bsequence of A and B that con tains all man d atory symb ols in Σ 1 . Again, the o v erall time complexit y of Algorithm 2 is clearly O ( nm ). This completes the pro of of T h eorem 3. 5 Pro of of Theorem 4 W e pr esen t tw o algorithms for Zero Exempl ar Dist an ce for multic hr omosomal genomes w ithout gene ord er. Let k 1 and k 2 , resp ectiv ely , b e the n u m b ers of c h r omosomes in G 1 and G 2 . Let A 1 , . . . , A k 1 b e the k 1 c h romosomes in G 1 . Let B 1 , . . . , B k 2 b e the k 2 c h romosomes in G 2 . Let k = max { k 1 , k 2 } . L et n b e the total n umber of genes in G 1 and G 2 , i.e., n = P k 1 i =1 | A i | + P k 2 j =1 | B j | . W e first p r esen t a p olynomial-time algo rithm for Zero Exempl ar Dist a nce for multic hr o- mosomal genomes without gene ord er in the sp ecial case that ea c h gene app ears exactly once in one genome and at least once in the other genome. Our algorithm is based on maxim um-weigh t matc h in g in b ip artite graphs: Algorithm 3. 1. Construct a complete bipartite graph G = ( V 1 ∪ V 2 , V 1 × V 2 ) with ve rtices V 1 = { A 1 , . . . , A k 1 } and V 2 = { B 1 , . . . , B k 2 } . Asso ciate with eac h edge b et ween A i ∈ V 1 and B j ∈ V 2 a r educed c h romosome C ij = A i ∩ B j and a weigh t equal to its size. 9 2. Compute a maximum-w eight matc hing M in the graph G . 3. If the set of reduced c hr omosomes corresp onding to the edges in M includes all the genes, return y es. Other w ise, return no. T o see the correctness of Algorithm 3, n ote that eac h reduced c hromosome of a common re- duced genome is a common subset of t wo distinct c hromosomes, one from eac h inp ut genome, and corresp onds to an edge of a matc hing in the complete bipartite graph. In the sp ecial case that eac h gene app ears exactly once in one genome and at least once in th e other genome, no gene can app ear more than once in the reduced c hromosomes corresp onding to the ed ges of a matching. Thus the maxim um p ossible wei ght of a matc hing is equal to the num b er of distinct genes, and a common reduced genome that in cludes all the genes corresp onds to a matc hin g of the maxim um w eigh t. W e now analyze the time complexit y of Algorithm 3. Steps 1 an d 3 can b e easily implement ed in O ( n 2 ) time. Step 2 can b e implemente d in O ( k 3 ) time using a standard algorithm for weig ht bipartite matching; see e.g. [13]. T h us the o ve rall time complexit y is O ( n 2 + k 3 ). W e next present a fixed-parameter tractable algorithm for th is problem w ithout an y assump tion on the distribution of dup licate genes. Refer to [7] for basic concepts in parameterized complexit y theory . The parameter of our algorithm is k = max { k 1 , k 2 } : Algorithm 4. 1. Add k − k 1 empt y c hromosomes A k 1 +1 , . . . , A k to G 1 , or add k − k 2 empt y c h romosomes B k 2 +1 , . . . , B k to G 2 , such that G 1 and G 2 ha v e the same num b er k of c hromosomes. 2. F or eac h p ermutati on π of h 1 , . . . , k i , compute C π = ∪ k i =1 ( A i ∩ B π ( i ) ). 3. If for some p ermutatio n π the set C π includes all the genes, retur n ye s. Otherwise return no. T o see the correctness of Algorithm 4, note again that eac h c hromosome of a common redu ced genome is a common subset of tw o distinct c hr omosomes, one from eac h inpu t genome. All other c h romosomes of the t wo input genomes that do not cont ribu te to the common redu ced genome are deleted. T o handle the matching and the deletion of the c h romosomes in a uniform w ay , w e can think of eac h c h romosome deleted from one genome as matc hed to a c hromosome deleted from the other genome or to an emp t y chromosome. T hus b y padding the t w o genomes to the same n umb er of c hr omosomes, w e only n eed to consider p erfect matc h in gs as p erm utations. T he time complexit y of Algorithm 4 is O ( k ! n 2 ), with O ( n 2 ) time for eac h of the k ! p erm utations. This completes the pro of of Theorem 4 . W e remark th at the p roblem Zero Exemp lar Dist ance for m ultic hromosomal genomes with- out gene order is unlike ly to hav e a fixed -p arameter tractable algorithm if th e parameter is the maxim um num b er of genes in an y single c hromosome. This is b ecause 3SA T remains NP-hard even if for eac h v ariable there are at most fiv e clauses that con tain its literals [9]. As a resu lt, the num b er of genes in eac h c hr omosome need not b e more than some constant in our redu ction from 3S A T. Ac knowledgmen t The author wo uld lik e to thank Binhai Zh u f or a br ief discuss ion on Ques- tion 1 dur ing a visit to Univ ersit y of T exas – P an American in Ma y 2008, and thank Gu illaume F ertin for comm unicating the r ecen t r esult [3]. T he alternativ e pro of of Theorem 1 was obtained indep en d en tly by the author in F ebruary 2010 without kno wing the recen t progress [3]. Th e author also thanks Pedro J. T ejada for b ringing th e op en qu estion of Bonizzoni et al. [4], Question 2 , to his atten tion. 10 References [1] S. S. Adi, M. D. V. Braga, C. G. F ernan d es, C. E. F erreira, F. V. Martinez, M.-F. S agot, M. A. Stefanes, C. Tjand raatmadja, and Y. W ak aba y ashi. Rep etition-free longest common subsequence. D iscr ete Applie d Mathematics , 158:13 15–1324 , 2010 . [2] S. Angibaud, G. F ertin, I. Ru su, A. Th´ ev enin, and S . Vialette. On th e appr o ximabilit y of comparing genomes with dup licates. Journal of Gr aph Algorith ms and Applic ations , 13:19–5 3, 2009. [3] G. Blin, G. F ertin, F. Sikora, and S . Vialett e. The Exemplar Breakp oin t Distance for non- trivial genomes cann ot b e appr o xim ated. In Pr o c e e dings of the 3r d Workshop on Algo rithms and Computation (W ALCOM’09) , pages 357–368, 2009. [4] P . Bonizzoni, G. Della V edo v a, R. Dondi, G. F ertin, R. Rizzi, and S. Vialette . Exemplar longest common su bsequence. IEE E/ACM T r ansactions on Computational Biolo gy and Bioinformat- ics , 4:535–54 3, 2007. [5] Z. Chen, R. H. F o wler, B. F u, and B. Zhu. On the inapp r o ximabilit y of the exemplar con- serv ed interv al distance problem of genomes. Journal of Combinatorial Optimization , 15:201– 221, 2008. (A preliminary v ersion app eared in Pr o c e e dings of the 12th Annual International Confer enc e on Computing and Combinatorics (COCOON’06) , pages 245–254, 2006.) [6] Z. Chen , B. F u, an d B. Zhu. The app ro ximabilit y of th e exemplar b r eakp oint distance problem. In Pr o c e e dings of the 2nd International Confer enc e on Algorithmic Asp e cts in Informatio n and Management (AAIM ’06) , pages 291–302, 2006. [7] R. G. Do wney and M. R. F ello ws. Par ameterize d Complexity . Sprin ger-V erlag, 1999. [8] V. F erretti, J. H. Nadeau, and D. S ank off. O riginal synten y . In Pr o c e e dings of the 7th Annual Symp osium on Combinatorial Pattern Matching (CPM’96) , pages 159–16 7, 1996. [9] M. R. Garey and D. S. Johnson. Computers and Intr actability: A Guide to the The ory of NP-Completeness . W. H. F reeman and Company , 1979. [10] D. Gusfield. Algorithms on Strings, T r e es, and Se que nc es . Cambridge Univ ersit y Press, 1997. [11] M. Jiang. The zero exemplar d istance pr oblem. In Pr o c e e dings of the 8th Annual RECOMB Satel lite Workshop on Comp ar ative Genomics (RECOMB-CG’10) , p ages 74–82, 2010. [12] D. Sank off. Genome rearrangement with gene families. Bioinformatics , 15:909–91 7, 1999. [13] R. E. T arjan. Data Structur es and Ne twork Algorithms . SIAM, 1983 . 11

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment