A variant of the tandem duplication - random loss model of genome rearrangement

A v arian t of t he tandem dupl i cation - random loss mo d e l of genome rearrang e men t Mathil d e Bouvel a Domini q ue R o ssin a a LIAF A, Universit´ e Paris Dider ot, CNRS, Case 7014, 75205 Paris Ce dex 13 Abstract In (4), Chaudhuri, Chen, Mihaescu and Rao stud y algorithmic p rop erties of the tandem dupl ic ation - r andom loss mo del of genome rearrangemen t, well-kno w n in ev olutionary biology . In their mo del, th e cost of one step of dup licatio n -loss of wid th k is α k for α = 1 or α ≥ 2. In this pap er, we s tu dy a v ariant of this m o del, where the cost of one step of wid th k is 1 if k ≤ K and ∞ if k > K , for any v alue of the parameter K ∈ N ∪ {∞} . W e ﬁ rst sho w that p erm utations obtained after p steps of w id th K d eﬁ ne classes of pattern-a voiding p ermutat ions. W e also compute the n umb ers of d uplication-loss steps of width K n ecessary and suﬃcien t to obtain an y p ermutati on of S n , in the w orst case and on a verag e. In this second part, we ma y also consider the case K = K ( n ), a function of the size n of the p er mutation on whic h the d u plication-loss op erations are p erformed. Key wor ds: Sorting, P ermutations, Pat tern P ACS: 1 In tro duction 1.1 The mo del In the usual mo dels of g enome rearrangemen t, duplications and losses of genes are no t t a k en in to accoun t. There w ere attempts to incorp orate them to the classical mo dels , but the consecutiv e com binatorial complexit y of the mo d- els so obtained made their study quite diﬃcult. F ollowing (4), w e fo cus o n the duplication-loss pro blem b y considering the tand em duplic ation - r a n dom loss mo del of genome rearrangemen t in whic h genomes a re mo diﬁed only b y duplications and losses of g enes. One s tep of ta ndem duplication - random loss, o r duplication-loss for short, consists in (1) the tandem duplication o f a con tiguous fragmen t of the genome, Preprint s ubmitted to Elsevier 4 No vem b er 2018 i.e., the duplicated fragmen t is inserted immediately af ter t he orig inal frag- men t, and (2 ) the loss of one of the t wo copies of ev ery duplicated gene. W e assume that the loss o ccurs immediately after the duplication of genes, whic h is, on an ev olutionar y time-scale, a go od a ppro ximation to realit y . The w idth of a step is the num b er of duplicated genes. See Figure 1 fo r an example. 1 2 z }| { 3 4 5 6 7 1 2 z }| { 3 4 5 6 z }| { 3 4 5 6 7 (tandem duplication) 1 2 3  4 5 6  3 4  5  6 7 (random loss) 1 2 4 5 3 6 7 Fig. 1. Example of one step of tandem d uplication - r andom loss of width 4 F rom a forma l p oin t o f view, a g enome consisting of n genes is mo delled by a p erm utation π ∈ S n of the set of in tegers { 1 , 2 , . . . , n } . In (4), the autho r s deﬁne t he cost of a duplication-loss step of width k to b e α k , α ≥ 1 b eing a constan t para meter. They suggest that o t her cost functions can b e considered, and in particular aﬃne functions. In this pap er, w e consider a pie c ewise c o n- stant cost function: the cost of a step of width k is 1 if k ≤ K and is inﬁnite for k > K , for some ﬁxed para meter K ∈ N ∪ {∞} . Ob viously , for this mo del to b e meaningful, w e assume that K ≥ 2 . W e also consider the po ssibility that K = K ( n ) is dep enden t on the size n of the p ermu ta tion on whic h the duplication-loss op erations are p erformed. Both mo dels are generalizations of the whol e genome duplic ation - r a n dom loss mo del : it corresp onds to the case α = 1 in t he mo del o f (4), K = ∞ or K = K ( n ) = n in our mo del. Man y mo dels of ev olution of p erm utations are inspired by computational bi- ology issues: see (2), (5), ( 6), (7) f or examples in t he literature. Our mo del of ev olution of p ermutations can b e view ed in the fra mework of p erm uting machines deﬁned in (1). Such a mac hine take s a p ermutation in input, and transforms it in to an output p erm uta tion, the transforma t io n b eing sub j ect to satisfy the tw o prop erties of independence with resp ect to the v alues and of stability with resp ect to pattern-inv olveme nt (see (1) for more details). The imp ortan t p oin t is that the duplication- loss t r a nsformation satisﬁes these t w o prop erties. Thus, o ne duplication-loss step ( in one of the mo dels deﬁned ab ov e) corresp onds to running an adequate p ermuting mac hine once. When w e will consider p erm utations obtained after a seque nce of duplication-loss steps, it will corresp ond to p erm utations obtained in the output of a com bination in series of iden tical p erm uting ma chines . F or ease o f exp osition in some pro ofs, w e will sometimes use a graphical rep- resen tation of p ermutations, as show n in Fig ure 2. 2 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Fig. 2. Th e graphical representa tion of σ = 6813 5427 1.2 Pattern-avoiding classes of p ermutations Though not app earing clearly for the momen t, there exist strong links betw een the duplication-loss mo del and some pattern-av oiding classes of p erm uta tions. Hence, w e need to recall a f ew deﬁnitions concerning those classes. A p erm utation σ ∈ S n is a bijectiv e map from [1 ..n ] to itself. The in teger n is called the size of σ , denoted | σ | . W e denote by σ i the image o f i under σ . A p erm utat io n can b e seen as a word σ 1 σ 2 . . . σ n con taining exactly once eac h letter i ∈ [1 ..n ]. F or each en try σ i of a p erm ut a tion σ , w e call i its p osition and σ i its value . Deﬁnition 1 A p erm utation π ∈ S k is a pattern of a p ermutation σ ∈ S n if ther e is a subse quenc e of σ whi c h is or der-isomorphi c to π ; in o ther w o r ds, if ther e is a subse quenc e σ i 1 σ i 2 . . . σ i k of σ (with 1 ≤ i 1 < i 2 < . . . < i k ≤ n ) such that σ i ℓ < σ i m whenever π ℓ < π m . We a l s o say that π is in volv ed in σ and c al l σ i 1 σ i 2 . . . σ i k an o ccurrenc e of π in σ . W e write π ≺ σ to denote that π is a patt ern of σ . A permutation σ that do es not contain π as a pattern is said to avoid π . The class of all p erm utations av oiding the patterns π 1 , π 2 . . . π k is denoted S ( π 1 , π 2 , . . . , π k ), a nd S n ( π 1 , π 2 , . . . , π k ) denotes the set of p erm utatio ns o f size n av oiding π 1 , π 2 , . . . , π k . W e say that S ( π 1 , π 2 , . . . , π k ) is a class of pattern- a v oiding p erm utations of b a s i s { π 1 , π 2 , . . . , π k } . Example 2 F o r example σ = 142563 c o ntains the p attern 1342 , and 1563 , 1463 , 2563 and 145 3 ar e the o c curr enc es of this p attern in σ . B ut σ ∈ S (32 1 ) : σ avo ids the p attern 321 as no subse quenc e of s i z e 3 of σ is isomorphic to 321 , i.e., is de cr e a s i n g. 3 1.3 Outline of the p ap er In the tandem duplication - random loss mo del describ ed a b o ve , w e will fo cus on t w o kinds of problems. First, as hin ted b efore, w e will consider p erm u- tations obtained after a certain n um b er of duplication-loss steps, that is to sa y p ermutations in output of a com bination in series of a certain nu mber of p erm uting machin es. F or this, w e deﬁne the class C ( K , p ) as follows: Deﬁnition 3 The class C ( K , p ) den o tes the class of al l p ermutations obtaine d fr om 12 . . . n (for any n ) a fter p duplic ation-loss steps of width a t most K , for some c onstant p ar am eters p and K . W e do no t consider the case K = K ( n ) here. Be careful that the duplication-loss steps are not reve rsible, as noticed in (4), and that consequen tly C ( K , p ) is not t he class of p ermutations that can b e sorte d to 12 . . . n in p steps of duplication-loss of width a t most K . Lik e for the v arious classes of p erm utations obtained after a com bination in series of p erm uting mac hines considered in (1) , w e obtained combinatorial prop erties of C ( K, p ) in terms of pa t tern-a voidance . Namely , w e sho w that C ( K, p ) is a class of pattern-av oiding p erm utations. In the case p = 1 (Section 2.2), we giv e a precise description of the basis B of excluded patterns: B = { 321 , 3142 , 2143 } ∪ D , D b eing the set of all p erm utatio ns of S K +1 that do not start with 1 nor end with K + 1, and con taining exactly one descen t. In particular, B is o f cardinality 3 + 2 K − 1 and con tains patt erns of size at most K + 1. F o r the general case (Section 2.3), we cannot get suc h a precise r esult but o nly a b ound on the size o f the excluded patterns: w e show that C ( K , p ) is a class of pattern-av oiding p ermutations whose basis con tains pat t erns of size at most ( K p + 2) 2 − 2. A second p oin t of view is to examine how ma ny steps of a giv en width a re necessary to obtain an y p ermutation of S n starting from 12 . . . n . Namely in Section 3 we ﬁx a width K (constan t, or K = K ( n )) a nd a size n a nd searc h for the n um b er p suc h that an y p ermutation of S n can b e obtained from 12 . . . n in at most p duplication-loss steps of width at most K . W e describe an algorithm computing a p ossible scenario of duplications and losses for an y π ∈ S n , this scenario in v olving Θ( n K log K + n 2 K 2 ) duplication-loss steps in t he w orst case and on av erage. W e also show tha t Ω(log n + n 2 K 2 ) steps are necessary ( in the w orst case and on a v erage) to o btain any p erm uta t ion of S n from 1 2 . . . n . These upp er and low er b ounds coincide in most cases. 4 2 Characterization with excluded patterns Before fo cusing on the classes C ( K , 1) and C ( K , p ) deﬁned for o ur mo del, we will get bac k to the simpler whole genome duplication - ra ndo m loss mo del (corresp onding t o K = ∞ in our mo del, but deﬁned previously b y other authors). W e will not pro ve new theorems, but will in terprete the existing results from the pattern-av oidance p oint of view. 2.1 The whole geno me duplic ation - r andom lo ss mo del thr ough the p attern- avoidanc e prism Let us recall that in the whole genome duplication - random lo ss mo del, an y duplication-loss step has cost 1, so that we can consider w.l.o.g tha t the du- plicated fragmen t is the whole p ermutation at any step. The cost of obtaining a p erm utation σ ∈ S n from the iden tity is just the minimal n um b er of steps of a duplication-loss scenario transforming 12 . . . n into σ . A statistics o f p erm utations that matters for our purp ose is their n umber of desc ents . Deﬁnition 4 Given a p ermutation σ of size n , we say that ther e is a descen t (r esp . ascen t ) at p o s ition i , 1 ≤ i ≤ n − 1 , if σ i > σ i +1 (r esp . σ i < σ i +1 ). We write desc ( σ ) the numb er of desc ents o f the p ermutation σ . Example 5 F o r example , σ = 52 4 316 has 3 de s c ents, nam e l y at p ositions 1 , 3 and 4 . A p erm utation σ of size n has at most n − 1 descen ts, the case of n − 1 descen ts exactly corresp onding to the reve rsed identit y p ermutation n ( n − 1) . . . 21. It is also of common kno wledge tha t the a v erag e n um b er of descen t s a mo ng p erm utations of size n is n − 1 2 . In (4), the authors prov e the follow ing theorem. Theorem 6 L et σ ∈ S n . In the whole gen ome duplic ation - r and o m loss mo d el, ⌈ log 2 ( desc ( σ ) + 1) ⌉ steps ar e ne c essary and suﬃcie n t to obtain σ fr om 12 . . . n . It is equiv alent to sa y that the p erm uta t io ns that can b e obtained in at most p steps in the whole genome duplication - random loss mo del are exactly those whose nu mber of descen ts is at most 2 p − 1. No w, w e can notice that the prop erty of b eing obtainable in a t most p steps is 5 stable for the pattern- inv o lv emen t relatio n ≺ : if σ can b e obta ined in at most p steps, and if π ≺ σ , then π can a lso b e obtained in at most p steps. Indeed, it is enough to p erform the same duplication-loss scenario on σ , ke eping trac k only of the elemen ts o f σ that form an o ccurrence of π . This stabilit y for ≺ implies that the class of p ermutations obtainable in at most p steps is a class of pattern- a v oiding permutations, whose excluded patterns ar e the minimal (again in t he sense of ≺ ) p erm utatio ns that canno t b e obtained in p steps. Then, b y Theorem 6, the excluded patterns are the minimal p ermu ta tions with 2 p descen ts. W e initiated a study of the minimal p erm uta tions with d descen ts in (3). How ev er, it is simple to notice that a p erm utat ion with d descen ts and minimal for this criterion has size at most 2 d , since it do es not con tain to consecutiv e ascen ts by minimality . An immediate consequence is that the num b er of excluded patterns is ﬁnite. This allows us t o state the following version of Theorem 6: Theorem 7 The p ermutations that c an b e obtaine d in at most p s teps in the whole genome duplic ation - r andom loss mo d el form a class of p attern-avoiding p erm utations. The ex c lude d p atterns ar e the p e rm utations with exactly 2 p de- sc en ts that ar e mini mal (in the sense of ≺ ) for this criterion . These exclude d p atterns ar e in ﬁnite numb er. In (3), we will g iv e a simpler description and some prop erties of these minimal p erm utations with d descen ts. 2.2 Permutations obtaine d in one step of width K As an in tro duction to the study of C ( K, p ), we deal in t his section with the sim- pler case of the class C ( K ) = C ( K , 1) of p erm utations obtained from 12 . . . n in one duplication- loss step of width at mo st K . Assume in this section that the pa r a meter K ≥ 2 is ﬁxed. Throughout this section, when referring to a duplication-loss step, w e alw ays mean duplication- loss step o f width K , except when otherwise explicitly stated. It is easily noticed that an y p erm ut a tion of C ( K ) cannot hav e more than one descen t. Conv ersly , any p ermutation of size at mo st K ha ving exactly one descen t b elongs to C ( K ). Although it is a tec hnical p oin t of imp ortance in the pro of of Theorem 9, the follo wing prop osition comes straigh tfo rw ard: Prop osition 8 The p ermutations of s i z e K + 1 that do not b elong to C ( K ) and having exactly one desc ent ar e e x actly those of S K +1 with one desc ent that 6 do not start with 1 nor end with K + 1 . PR OOF. Let σ = σ 1 σ 2 . . . σ K +1 b e a p erm utation of size K + 1 that do es not b elong t o C ( K ) but has exactly one descen t. Now , if σ 1 = 1, then σ = σ 2 . . . σ K +1 is a p ermutation (of { 2 , 3 , . . . , K + 1 } ) o f size K ha ving one descen t, and therefore σ can b e obta ined fro m 23 . . . K + 1 in one duplication- loss step. Applying the same transformation to 123 . . . K + 1 will then pr o duce σ , con tradicting that σ / ∈ C ( K ). The same reasoning ho lds when σ K +1 = K + 1 . So σ do es not start with 1 nor end with K + 1. No w if σ is a p erm utation of size K + 1 ha ving exactly o ne descen t, that do es not star t with 1 nor end with K + 1, w e claim tha t σ cannot b e obtained from 12 . . . K + 1 in one duplication-loss step. This is b ecause no duplication-loss step of width K can mo v e b o th 1 and K + 1 in 12 . . . K + 1. Theorem 9 The class C ( K ) of p ermutations obtaine d fr om 12 . . . n (for some n ≥ 1 ) in one duplic ation-loss step of width K is a class S ( B ) of p attern- avoiding p ermutations whose b asis B is c o m p os e d o f 3 + 2 K − 1 p atterns of size at most K + 1 . Namely B = { 321 , 3142 , 214 3 } ∪ D , D b eing the set of al l p ermutations of S K +1 that do not start with 1 nor end with K + 1 , and c on tain i n g exactly one des c ent. Example 10 C (4 ) = S (321 , 3142 , 2143 , 23451 , 23514 , 24513 , 34512 , 25134 , 3 5 124 , 45123 , 5123 4) PR OOF. W e prov e the rev ersed statement: σ / ∈ S ( B ) if and o nly if σ cannot b e obtained fr o m an iden tity p erm utation in one duplication- loss step of width K . Assume σ / ∈ S ( B ). Then there exists b ∈ B suc h that b ≺ σ . If b = 321, 3142 or 2143, then σ has at least 2 descen ts a nd cannot b e obta ined in one duplication-loss step. Otherwise, using Prop osition 8, there exists ρ ∈ S K +1 suc h that ρ ≺ σ and ρ / ∈ C ( K ). No w if σ could b e o bta ined in one duplication- loss step, then so would b e ρ , yielding a con tradiction. So σ / ∈ C ( K ). Con v ersly , assume that σ / ∈ C ( K ). If σ con ta ins at least 2 descen ts, then σ contains an o ccurrence of 321 or 3142 or 2143, since these three are the minimal p erm utations (in the sense of the relatio n ≺ ) with 2 descen ts. And consequen tly , σ / ∈ S ( B ). Thus we may a ssume that σ ha s exactly one descen t. W e decomp ose σ ∈ S n in to σ = 12 . . . p 1 b σ p 2 ( p 2 + 1) . . . n , where b σ is a p er- m utation of the set { p 1 + 1 , p 1 + 2 . . . , p 2 − 1 } that do es not start with p 1 + 1 nor end with p 2 − 1, a nd con ta ins exactly one descen t. This decomp osition is sho wn in Figure 3 . W e denote b y c K the size of b σ . Since σ / ∈ C ( K ), necess arily c K ≥ K + 1 or w e would get a contradiction. If c K = K + 1, w e get that b σ is 7 an o ccurrence of some pattern o f D ⊂ B in σ . As a consequence, σ / ∈ S ( B ). What is left to prov e is that this extends to the case c K > K + 1. W e just need to show tha t w e can remo v e elemen ts in b σ without violating any of the prop erties b elo w: • the p erm utatio n do es not start with its smallest elemen t • the p erm utatio n do es not end with its greatest elemen t • the p erm utatio n has exactly one descen t un til we get a p ermu ta tion of size K + 1. At that p oint b σ con tains an o ccurrenc e of a pattern in D , and so do es σ , and w e get that σ / ∈ S ( B ). Now, b ecause of the conditions on b σ , the only descen t in b σ necessarily go es from the greatest to the smallest elemen t in b σ , ensuring that it is p ossible to remo v e elemen ts without violating an y of the prop erties ab ov e (see Figure 3). Decomp osition of σ . . . 1 2 . . . p 1 b σ . . p 2 . . . n Shap e of b σ Fig. 3. Decomp osition σ = 12 . . . p 1 b σ p 2 ( p 2 + 1) . . . n on the graph ical represen tation of σ , and shap e of b σ 2.3 Permutations obtaine d in p steps of width K As for the case of C ( K, 1) in Section 2.2, we prov e (Theorem 19) in this section that the class C ( K , p ) of all p erm uta t io ns obtained fro m an iden tit y p erm u- tation a f ter p duplication-loss steps of width at most K is a class of pattern- a v oiding p erm utations. How ev er, w e do not get a precise description of the basis of this class, but only an upp er b ound on the size of the exclude d pat - terns. As in t he previous section, when referring to a duplication-loss step, w e alwa ys mean duplication- lo ss step of width K , except when otherwise ex- plicitely stated. T o prov e the announced result, we will need a few more notations and tec hnical lemmas. The ve ctor from i to j in a p erm utation σ consists of all elemen ts whose p ositions lie b etw een the p ositions of i and j , i and j b eing included. The size of a v ector is the n um b er of elemen ts in it. F or example, the v ector from 7 to 2 in the p erm utation 41235 76 is ← − − 2357, and has size 4. 8 Deﬁnition 11 L et σ b e a p ermutation of S n . The v alue-p osition v ector asso- ciated with i ∈ [1 ..n ] ( v p -v ector for short) is the ve ctor of σ going fr om i to σ i , if i is not a ﬁxp o int of σ . I n the c ase i = σ i , the v p -ve c tor ass o ciate d with i is e m pty. It should app ear in this deﬁnition that t he v p - v ector asso ciated with i , going from the elemen t of σ whic h has v alue i t o the elemen t of σ at p osition i , represen ts the necessary mo v e for i to reac h its p o sition in the sorted p erm u- tation 12 . . . n . As it can b e seen on Figure 4, on the graphical represen tation of p erm utations used througho ut the pap er, the v p -v ector asso ciated with i is an arrow going horizon tally from the elemen t at or dinate i to the diagonal. W e can also notice t ha t a non-empt y v p - v ector con tains at least tw o elemen ts. T o take into accoun t all the mo ves necess ary to sort σ to 12 . . . n , it is conv e- nien t to introduce the value-p osition domain : Deﬁnition 12 L et σ b e a p e rmutation o f S n . The v alue-p osition do ma in of σ ( v p - domain fo r short) is c omp ose d of al l elements of σ app e aring in at le ast one v p - v e ctor. These t wo deﬁnitions a r e illustrated on F ig ure 4. σ = 4 1 2 3 5 7 6 v p - do main of σ = { 1 , 2 , 3 , 4 , 6 , 7 } Fig. 4. v p -vec tors and v p -domain for σ = 4123576, in th e usual and in the graph ical represent ations No w, observ e that fo r any p ermutation, the v p -vectors are rev ersible in the sense that rev ersing all the arrow s will give a set of ve ctors that represen t the mo v es of elemen ts that are necessary to ”unsort” 12 . . . n into σ . It is easily seen fro m Deﬁnitions 11 and 12 and this remark that for an y p erm utation σ ∈ C ( K, p ), an y elemen t b elonging to the v p -domain of σ also b elongs to at least one of the duplication-loss steps used to obtain σ f r o m 12 . . . n . Conseque ntly , the v p -domain of σ con tains at most K p elemen ts. Lemma 13 Consider a p ermutation σ , a nd the p erm utation τ obtaine d fr om σ by the r emoval of some element j . Then for any e l e m ent i 6 = j such that i 6 = σ i , either this element b e c omes a ﬁxp oint in τ o r the size of the v p -ve ctor asso ciate d with this element in τ r emains c onstant, is incr e ase d of 1 or is diminishe d of 1 w ith r esp e ct to the size of the v p -ve ctor asso ciate d with i in σ . 9 PR OOF. It is easily seen on the graphical represen tation of σ . Any elemen t that do es not lie just ab ov e or just b elo w the diagonal cannot b ecome a ﬁx- p oin t when remo ving an elemen t j . F or elemen ts that do not b ecom ﬁxp oints , the ho r izontal distance to the diagonal can only c hang e of 0, 1 or − 1 when remo ving some elemen t j (see Fig ur e 5) . j j Diagonal Candidate ﬁxp oints Changes in the v p -v ectors V ariation of the distance to the diagonal 0 +1 − 1 Fig. 5. V ariation of the size of v p -v ectors due to the r emo v al of an element j ab o ve or b elo w the diagonal. Lemma 14 F or any p ermutation σ , ther e is a t le ast one eleme nt j such that the p ermutation τ obtaine d fr om σ by the r em oval of j c ontains at most one mor e ﬁxp oint than σ . PR OOF. It is conv enien t to in tro duce the quasi-diagona l elemen ts o f σ , de- ﬁned as follows. i is a quasi-diagonal elemen t of σ if σ i − 1 = i or σ i +1 = i . These tw o cases corresp ond resp ectiv ely to elemen ts of σ lying just ab ov e or just b elo w the dia gonal in the graphical represen tation o f σ . An y elemen t of σ that ma y b ecome a ﬁxp oin t in τ is necessarily a quasi-diagonal elemen t. If there is no quasi-diag onal elemen t, then w e can remo ve a n y elemen t j to obtain a p erm utation τ that do es not hav e more ﬁxpo in ts than σ . If there ar e some, then w e pic k j among the quasi-diagonal elemen ts. W e claim that at most one ﬁxp oin t is create while remo ving j . The a r gumen t is simple. Supp ose j is suc h that σ j − 1 = j , the o t her case b eing similar. Then the o nly ﬁxpo in t that may a pp ear is j − 1, if σ j = j − 1 . This should app ear clearly on Figure 6. Diagonal Remo v ed elemen t Candidate ﬁxp oint Fig. 6. Th e only ﬁxp oint that can app ear wh en remo ving a qu asi-diagonal elemen t. 10 Lemma 15 Consider a p ermutation σ / ∈ C ( K , p ) such that fo r any strict p attern τ of σ , τ ∈ C ( K , p ) . Then the v p -domain of σ is of s i z e at most 2 K p + 2 . PR OOF. By Lemma 1 4 , we can c ho ose some τ ≺ σ with | τ | + 1 = | σ | and suc h t hat τ has at most o ne more ﬁxpo in t tha n σ . Call j the elemen t deleted in σ to obtain τ . By a previous r emark, since τ ∈ C ( K , p ), the v p -domain of τ is of size at most K p , and is t herefore comp o sed of at most K p v p -v ectors. Eac h of these v p -v ectors in τ yields a v p - vec to r in σ , whose size is smaller or equal or p ossibly increased by 1. L et us denote b y − → V the set o f v p -v ectors of σ obtained from a v p -v ector of τ . Then the n um b er of elemen ts of σ that b elong to a v p -v ector of − → V is at most 2 K p . Ho w ev er − → V is not y et the v p -domain of σ . W e m ust complete it with up to tw o v p - v ectors: the one asso ciated with the elemen t j deleted, and the one asso ciated with the ﬁxp oint of τ that w as not a ﬁxp oin t in σ , if suc h a p oint exists. If suc h an elemen t exists, then it is a quasi- diagonal elemen t in σ and its v p - vector (denoted − → v ) in σ is necessarily of size 2, so that − → V ∪ { − → v } has to t al size at most 2 K p + 2. Now it is easily observ ed that any elemen t of σ b elonging to one v p -v ector necessarily b elongs to at least t w o v p - vectors (this can b e seen as a “balance condition”). Consequen tly , all the elemen ts of the v p -v ector asso ciated with j a re already co v ered b y a ve ctor of − → V ∪ { − → v } , so that the v p -domain of σ is exactly the set of elemen ts co v ered b y − → V ∪ { − → v } . Therefore, its size is at most 2 K p + 2. Lemma 16 Consider a p ermutation σ / ∈ C ( K , p ) of size n > ( K p + 2) 2 − 2 such that for any strict p attern τ of σ , τ ∈ C ( K, p ) . Then σ is of the form σ = I i ( i + 1) . . . ( i + K p ) J with I a p ermutation of [1 ..i − 1] a nd J a p ermutation of [ i + K p + 1 ..n ] . It is p ossible that I or J is empty. PR OOF. By Lemma 1 5 , the v p -domain of σ is of size at most 2 K p + 2. W e can decomp ose σ in to fr e e windows of consecutiv e elemen ts outside t he v p - domain of σ , separated by windo ws of consecutiv e elemen ts of the v p -domain. No w, there are at most K p + 1 windo ws of consecutiv e elemen ts of the v p - domain, and conseque ntly , there are at most K p + 2 free windo ws in σ . Since σ is o f size n > ( K p + 2) 2 − 2 = ( K p + 2) K p + 2 K p + 2, at least one of the free windo ws of σ has size strictly g reater than K p , i.e., contains at least K p + 1 elemen ts. By deﬁnition, these elemen ts do not b elong to the v p -domain of σ , and hence they allow the decomp osition of σ in to σ = I i ( i + 1) . . . ( i + K p ) J with I a permutation of [1 ..i − 1] and J a p ermutation of [ i + K p + 1 ..n ]. Figure 7 represen t the decomp osition of σ used in this pro of. Lemma 17 Consider a p ermutation σ = σ ′ ( j + 1)( j + 2) . . . n wher e σ ′ is a p erm utation of [1 . . . j ] . If σ is obtainable after p duplic ation steps of size at 11 σ = at most K p + 1 v p -windo ws at most K p + 2 fr ee-windows Fig. 7. Pro of of Lemma 16 most K then σ is obtainable a f ter p duplic ation steps of s i z e a t m ost K such that the duplic ate d window for e ach step do es n o t interse ct PR OOF. The k ey idea is to consider the ﬁrst sequence s 1 , s 2 , . . . , s p of duplication- loss steps and create a new sequence s ′ 1 , s ′ 2 , . . . , s ′ p suc h that : • Each step s ′ i concerns only elemen ts o f [1 ..j ]. • After ev ery step s ′ i , the elemen ts 1 , 2 , . . . , j are in the same order than after p erforming steps s 1 , s 2 , . . . , s i . Then the pro of is b y induction on the n um b er of steps. If there is only one step then the pro of is straighforward. Supp ose no w that the ab o v e statemen t is true until p − 1 steps. Then for the last step, w e use our h yp othesis for p − 1 so that we hav e op erations s ′ 1 , s ′ 2 , . . . , s ′ p − 1 resp ecting the ab o ve conditions. F or s ′ n , only notice that the elemen ts of [1 . . . j ] in v olv ed in s n are also in a windo w of size K in the p erm utation obtained after s ′ j − 1 and in the same relat ive order b y our induction h yp othesis whic h pro ves the existence of s ′ n . Using these lemmas, w e state a nd prov e a key prop osition: Prop osition 18 Consider a p ermutation σ / ∈ C ( K , p ) . Then either σ is of size at most ( K p + 2) 2 − 2 , or ther e exists a strict p attern τ of σ that do es not b elo ng to C ( K , p ) . PR OOF. Consider a p erm utation σ / ∈ C ( K, p ) suc h that any strict pattern τ of σ b elongs to C ( K , p ). W e w an t to sho w that σ is of size n ≤ ( K p + 2) 2 − 2. Let us assume the con trary . By Lemma 1 6, there exist i ∈ [1 ..n ], I a p erm utatio n of [1 ..i − 1] and J a p erm utation of [ i + K p + 1 ..n ] suc h that σ = I i ( i + 1) . . . ( i + K p ) J . Let us denote b σ the p erm utatio n b σ = I i ( i + 1) . . . ( i + K p − 1)( J − 1), where ( J − 1) is the p ermutation of [ i + K p..n − 1] obtained from J by subtracting 1 to ev ery elemen t of J . b σ is a strict pattern of σ , hence b σ ∈ C ( K , p ). Consider a shortest sequence of duplication-loss steps of width at most K that pro duces b σ from 12 . . . ( n − 1). This sequence has a t most p steps, eac h of width at mo st K . It implies that the total distance crossed b y the elemen ts that are duplicated is at most K p . Consequen tly , it is not p ossible to bring an elemen t of I and an elemen t of J − 1 in t w o consecutiv e p ositions. So it is necessary that the duplication-loss steps of the scenario w e 12 consider are internal to I and J − 1. W e can repro duce these steps in I and J to obtain σ fro m 12 . . . n in at most p duplication-loss steps of width at most K , con tradicting that σ / ∈ C ( K , p ). It is then quite easy to pro ve Theorem 19: Theorem 19 The class C ( K , p ) of al l p ermutations obtaine d fr om an identity p erm utation af ter p duplic ation-loss steps of width at most K is a class of p attern-avo iding p ermutations wh ose b asis is ﬁnite and c o n tains only p atterns of size at most ( K p + 2) 2 − 2 . PR OOF. W e set B = { π : π / ∈ C ( K , p ) and | π | ≤ ( K p + 2) 2 − 2 } a nd sho w that S ( B ) = C ( K, p ). Consider σ / ∈ C ( K, p ). If | σ | ≤ ( K p + 2) 2 − 2, then σ ∈ B and σ / ∈ S ( B ). Otherwise, if | σ | > ( K p + 2) 2 − 2, then b y Pro p osition 18, there exists a strict pattern τ of σ that do es not b elong to C ( K, p ). Reasoning by induction on the size of the p erm utations, w e deduce fr om τ / ∈ C ( K, p ) that τ / ∈ S ( B ) . A direct consequenc e is that σ / ∈ S ( B ). This prov es that S ( B ) ⊆ C ( K , p ). Con v ersely , consider σ ∈ C ( K , p ). Then any pattern τ of σ is also obta inable from an iden tity p erm utatio n in at most p steps of width at mo st K (using the sequence of duplication-loss steps asso ciated with σ ), i.e., τ ∈ C ( K , p ). Then σ do es not contain a n o ccurrence of an y pattern of B , i.e., σ ∈ S ( B ). This sho ws that C ( K , p ) ⊆ S ( B ), ending the pro of o f the theorem. 3 Num b er of steps of w idt h K to obtain an y p erm utation of size n The whole genom e duplic ation - r an dom loss mo del is studied in (4 ), and the authors describ e a metho d to compute an optimal duplication- loss scenario, i.e., a scenario of duplications (of the whole genome in this case) and losses whose nu mber of steps is minimal. Our mo del with b ounded size duplication op eratio ns reduces to the whole genome duplic ation - r an dom loss case when K = n and thu s to a radix-sort algorithm as sho wn in (4) and to a bubble-sort when K = 2. Thus we give some algorithm whose complexit y matc hes the t wo extremal cases and sho ws some con tinuit y b et w een the t wo sorting algor ithms. It is w orth noticing that an y scenario in our mo del can b e view ed as a whole genome duplication - random loss scenario. Consequen tly , t he n um b er of steps of an o ptimal whole genome duplicatio n - random loss scenario is a lo w er 13 b ound to the nu mber of steps of an optimal scenario in o ur duplication-loss mo del. It is also easy to see that, when considering p ermutations of size at most K , our mo del and the whole g enome duplication - random loss mo del coincide. Indeed, w e will use f or our purp ose the pro cedure of (4), whic h is giv en in Algo rithm 1. W e omit the pro of of correctness and optimality of this algorithm. See (4 ) for details. Algorithm 1 An o ptima l whole genome duplication - random loss scenario from 12 . . . K to σ ∈ S K 1: π = 12 . . . K 2: Partition σ in to maximal increasing substrings, f rom left to rig h t 3: Each elemen t of [1 ..K ] app ear ing in the i th maximal increasing substring gets as a la b el the binary represen tation of i 4: for j = 1 to ⌈ log 2 ( desc ( σ ) + 1) ⌉ do 5: P erform a duplication-loss step on π that k eeps in the ﬁrst copy of π exactly the elemen ts whose lab el ha s a 0 in its j th least signiﬁcan t bit 6: end for In order to examine ev ery bit of the lab els g iven to the elemen ts of [1 ..K ], the n um b er of steps in the lo op o n line 4 is ⌈ log 2 (n um b er of ma ximal increasing substrings of σ ) ⌉ = ⌈ log 2 ( desc ( σ ) + 1) ⌉ . A consequence is that the n um b er of steps in an optimal whole genome duplication - random lo ss scenario from 12 . . . n to σ is Θ (log n ) in the w orst case and on av erage (see equation (1) for the av erage case). Note that the same algorithm can b e used to compute an optimal whole genome duplication - random lo ss scenario from i 1 i 2 . . . i k , with k ≤ K and i 1 < i 2 < . . . < i k , to an y p ermutation of { i 1 , i 2 , . . . , i k } . 3.1 Upp er b ound In this section, w e prov ide an algorithm that computes, for any p ermutation σ ∈ S n in input, a p ossible scenario of duplications and losses to obtain σ fro m 12 . . . n . W e will restrict ourselv es to duplication-lo ss steps of width at most K , so that the num b er of duplication-loss steps corresp onds to the cost of the scenario in our cost mo del. W e are inte rested in the num b er of duplication- loss steps of the scenario pro duced by the alg orithm, in the worst case, and on a v erag e. It provides an upp er b ound o n the n um b er of duplication-loss steps that are necessary to obtain a p erm utation. The algorithm w e use is describ ed in Algorithm 2. A few ke ys to understand Algor ithm 2 are the follo wing remarks. 14 Algorithm 2 A duplication-loss scenario from 12 . . . n t o σ ∈ S n 1: π ← 12 . . . n 2: for i = 1 to ⌈ n − K ⌊ K/ 2 ⌋ ⌉ do 3: Let L i = { σ j : n − i ⌊ K / 2 ⌋ + 1 ≤ j ≤ n − ( i − 1) ⌊ K / 2 ⌋} 4: P erform duplication-loss steps on π to mo v e from left to righ t the ele- men ts of L i to the p ositions n − i ⌊ K/ 2 ⌋ + 1 to n − ( i − 1) ⌊ K / 2 ⌋ of π , without changing their resp ectiv e o r der 5: end for 6: for i = 1 to ⌈ n − K ⌊ K/ 2 ⌋ ⌉ do 7: P erform Algorithm 1 on the window of π b etw een the indices n − i ⌊ K/ 2 ⌋ + 1 and n − ( i − 1) ⌊ K / 2 ⌋ 8: end for 9: Perform Algorithm 1 o n the windo w of π b et w een the indices 1 and n − ⌈ n − K ⌊ K/ 2 ⌋ ⌉⌊ K/ 2 ⌋ The set L i of v alues deﬁned at line 3 represen ts the righ tmost ⌊ K / 2 ⌋ elemen ts of σ no t y et examined. The algo r it hm consists in tw o diﬀeren t lo ops, the ﬁrst one corresp onding to lines 2 to 5 of the algorithm a nd the second one from line 6 to 8. A t t he end o f the ﬁrst lo op (line 5), π is decompo sed in to windo ws of width ⌊ K/ 2 ⌋ (except the leftmost one whic h is o f width at most K ) ; and eac h of these windo ws is an increasing sequence con taining exactly the same elemen ts as the windo w of σ corresp onding to the same indices. In the second lo op, w e consider these windo ws from righ t to left and since there are of width less than K , w e can call Algo rithm 1 (that implemen ts whole genome duplication-ra ndom loss) on eac h windo w successiv ely to tr a nsform π in to σ . An example is give n with σ = 2 10 1 7 6 5 8 9 3 4 and K = 6. W e ﬁrst cut σ in ch unks of size ⌊ K/ 2 ⌋ = 3 and obtain 2 10 1 | 7 6 5 | 8 9 3 | 4. Then the ﬁrst lo op of the algo r it hm (step 2 to 5) starts from 1 2 3 4 5 6 7 8 9 10 and take s the elemen ts in increasing order to the same c hu nk the b elong t o in σ . This giv es 1 2 10 | 5 6 7 | 3 8 9 | 4. Then the second lo op sorts eac h ch unk separately to obtain σ using the r adix sort Algorithm 1 in tro duced in (4). Notice here that we use in the second lo op (except for the leftmost windo w) only duplication-loss steps of width ⌊ K/ 2 ⌋ . An impro vem ent w e considered is to use whole g enome duplication - random loss scenarios on windo ws of width K , that are nonetheless increasing sequences . Unfortunately , we we re not able to analyse how many duplication-loss steps there a re in a scenario pro duced b y suc h an algorithm. W e now analyse the n umber of steps of the scenario pro duced by Algorithm 2. Prop osition 20 The numb er of duplic ation-loss steps of a sc enario pr o duc e d 15 by Algorithm 2 on a p ermutation of size n is at most Θ ( n K log K + n 2 K 2 ) as ymp- totic al ly. PR OOF. Supp ose we are at iteration i of the ﬁrst lo op. W e ha v e to mo ve the ⌊ K/ 2 ⌋ elemen ts of L i to their p ositions (from n − i ⌊ K / 2 ⌋ + 1 to n − ( i − 1) ⌊ K / 2 ⌋ ) b y duplication-loss steps of width at most K . The w orst situation is when the elemen ts of L i are at the b egining o f π . But in t his case, w e can mov e to the righ t the elemen ts of L i b y ⌈ K / 2 ⌉ p ositions a t ev ery duplication-loss step, un til they reach their p osition. The to tal n um b er of duplication-loss steps in this ﬁrst pro ces s is then at most ⌈ n − K ⌊ K/ 2 ⌋ ⌉ X i =1  n − i ⌊ K/ 2 ⌋ ⌈ K/ 2 ⌉  = Θ( n 2 K 2 ). Consider no w the second lo op of Algorithm 2. In each window o f size ⌊ K / 2 ⌋ , it p erforms a t most ⌈ log ⌊ K / 2 ⌋⌉ duplication-loss steps (line 7) and in the leftmost windo w (line 9), at most ⌈ log K ⌉ by the result of (4 ) . Conseque ntly the num b er of duplication-loss steps pro duce d b y the second lo op is  n − K ⌊ K/ 2 ⌋  ⌈ log ⌊ K / 2 ⌋⌉ + ⌈ log K ⌉ = Θ( n K log K ) . W e ﬁnally get that the to t a l num b er of duplication-loss steps in a scenario pro duced b y Algor ithm 2 is at most Θ( n K log K + n 2 K 2 ) asymptotically in the w orst case. It is easily noticed that this worst case corresp onds to the r everse d identity p erm utation n ( n − 1) . . . 21. This corresp onds to our in tuitio n of a w orst case situation in this context. W e can also notice tha t Θ( n K log K + n 2 K 2 ) = Θ( n 2 K 2 ) for “small” v alues of K , namely as long as K = o ( n log n ). If on the con trary n log n = o ( K ) then Θ( n K log K + n 2 K 2 ) = Θ( n K log K ). When K = Θ( n log n ), the t w o terms are of the same order. W e can a lso compute the a ve rag e num b er of duplication-loss steps o f a scenario pro duced by Algorithm 2. Prop osition 21 The numb er of duplic ation-loss steps of a sc enario pr o duc e d by A lgori thm 2 on a p ermutation of size n is on aver age Θ( n K log K + n 2 K 2 ) asymptotic al ly. PR OOF. First, w e introduce a few notatio ns. Consider σ a p erm utatio n of size n , and decomp ose it f r o m right to left in to p =  n − K ⌊ K/ 2 ⌋  + 1 windo ws of 16 width ⌊ K/ 2 ⌋ , except t he leftmost one, whose width is n −  n − K ⌊ K/ 2 ⌋  ⌊ K/ 2 ⌋ ≤ K . W e denote σ = σ 1 σ 2 . . . σ p this decomp osition. No w, let us denote c ( σ ) the n umber of duplication- loss steps pro duced in the ﬁrst lo op of Algorit hm 2 on σ . And in particular, we denote c p ( σ ) the num b er of suc h steps pro duced b y the ﬁrst iteration of this lo op, i.e. , the n um b er of steps to mo ve the elemen ts of L 1 at the end of the p erm utatio n. F or computing the a v erage n umber of suc h steps, we consider u n = P σ ∈ S n c ( σ ). It is simple to conceiv e that u n = X σ ∈ S n c p ( σ ) + c ( σ 1 . . . σ p − 1 ) = X σ ∈ S n c p ( σ ) + n ( n − 1 ) . . . ( n − ⌊ K / 2 ⌋ + 1) X σ ∈ S n −⌊ K/ 2 ⌋ c ( σ ) = X σ ∈ S n c p ( σ ) + n ! ( n − ⌊ K / 2 ⌋ )! u n −⌊ K/ 2 ⌋ . Let us fo cus o n P σ ∈ S n c p ( σ ). Fig ure 8 should convinc e the reader that n + 1 − ⌊ K / 2 ⌋ − min( σ p ) K ≤ c p ( σ ) ≤ n + 1 − ⌊ K / 2 ⌋ − min( σ p ) ⌊ K/ 2 ⌋ . σ = v p - vectors p ositions: 1 2 . . . min( σ p ) . . . n ⌊ K/ 2 ⌋ righ tmost p ositions n + 1 − min( σ p ) elemen ts Fig. 8. Bound ing c p ( σ ) No w, w e can notice that the num b er of p erm utatio ns σ o f size n suc h that min( σ p ) = i is  n − i ⌊ K/ 2 ⌋− 1  ( n − ⌊ K / 2 ⌋ )! ⌊ K/ 2 ⌋ !. This yields 17 X σ ∈ S n n + 1 − ⌊ K / 2 ⌋ − min( σ p ) = n −⌊ K/ 2 ⌋ +1 X i =1 ( n + 1 − ⌊ K / 2 ⌋ − i ) n − i ⌊ K/ 2 ⌋ − 1 ! ( n − ⌊ K / 2 ⌋ )! ⌊ K/ 2 ⌋ ! = ( n − ⌊ K/ 2 ⌋ )! ⌊ K/ 2 ⌋ ! n − 1 X i = ⌊ K/ 2 ⌋− 1 ( i + 1 − ⌊ K / 2 ⌋ ) i ⌊ K/ 2 ⌋ − 1 ! = ( n − ⌊ K/ 2 ⌋ )! ⌊ K/ 2 ⌋ ! ⌊ K / 2 ⌋ n − 1 X i = ⌊ K/ 2 ⌋ i ⌊ K/ 2 ⌋ ! = ( n − ⌊ K/ 2 ⌋ )! ⌊ K/ 2 ⌋ ! ⌊ K / 2 ⌋ n ⌊ K/ 2 ⌋ + 1 ! . Consequen tly , X σ ∈ S n c p ( σ ) ≤ ( n − ⌊ K / 2 ⌋ )! ⌊ K / 2 ⌋ ! n ⌊ K/ 2 ⌋ + 1 ! X σ ∈ S n c p ( σ ) ≥ ⌊ K/ 2 ⌋ K ( n − ⌊ K / 2 ⌋ )! ⌊ K/ 2 ⌋ ! n ⌊ K/ 2 ⌋ + 1 ! ≥ 1 3 ( n − ⌊ K / 2 ⌋ )! ⌊ K/ 2 ⌋ ! n ⌊ K/ 2 ⌋ + 1 ! , giving after a few computations 1 3 n − ⌊ K/ 2 ⌋ ⌊ K/ 2 ⌋ + 1 + u n −⌊ K/ 2 ⌋ ( n − ⌊ K / 2 ⌋ )! ≤ u n n ! ≤ n − ⌊ K/ 2 ⌋ ⌊ K/ 2 ⌋ + 1 + u n −⌊ K/ 2 ⌋ ( n − ⌊ K / 2 ⌋ )! . Therefore, w e consider tw o sequences ( v n ) and ( w n ) satisfying the relations v n = 1 3 n −⌊ K/ 2 ⌋ ⌊ K/ 2 ⌋ +1 + v n −⌊ K/ 2 ⌋ and w n = n −⌊ K/ 2 ⌋ ⌊ K/ 2 ⌋ +1 + w n −⌊ K/ 2 ⌋ resp ectiv ely if n > K , and v n = w n = u n n ! for any n ≤ K . Then w e hav e v n ≤ u n n ! ≤ w n ∀ n ∈ N . W e can solv e the recurrence equations for v n and w n ; and if w e write n = ⌈ n − K ⌊ K/ 2 ⌋ ⌉⌊ K/ 2 ⌋ + r (then ⌊ K / 2 ⌋ ≤ r ≤ K ), we get: v n = 1 3 ⌈ n − K ⌊ K/ 2 ⌋ ⌉ X i =1 n − i ⌊ K/ 2 ⌋ ⌊ K/ 2 ⌋ + 1 + v r = 1 3( ⌊ K / 2 ⌋ + 1)  n − K ⌊ K/ 2 ⌋  n − ⌊ K/ 2 ⌋ ⌈ n − K ⌊ K/ 2 ⌋ ⌉ + 1 2  + v r = Θ( n 2 K 2 ) 18 and w n = ⌈ n − K ⌊ K/ 2 ⌋ ⌉ X i =1 n − i ⌊ K/ 2 ⌋ ⌊ K/ 2 ⌋ + 1 + w r = Θ( n 2 K 2 ) Consequen tly , the a verage n um b er of duplication-loss steps pro duced by the ﬁrst lo op o f Algorithm 2 on p ermutations of size n is u n n ! = Θ( n 2 K 2 ). What is left to compute is the a verage num b er o f duplication-loss steps pr o - duced by the second lo op of Algo rithm 2 on p ermutations of size n . This n um b er is giv en b y 1 n ! X σ ∈ S n p X i =1 ⌈ log( desc ( σ i ) + 1) ⌉ = 1 n !  p X i =2 X σ ∈ S n ⌈ log( desc ( σ i ) + 1) ⌉ + X σ ∈ S n ⌈ log ( desc ( σ 1 ) + 1) ⌉  = 1 n !  p X i =2 ( n − ⌊ K / 2 ⌋ )! n ⌊ K/ 2 ⌋ ! X σ ∈ S ⌊ K/ 2 ⌋ ⌈ log ( desc ( σ ) + 1) ⌉ +( n − | σ 1 | )! n | σ 1 | ! X σ ∈ S | σ 1 | ⌈ log ( desc ( σ ) + 1 ) ⌉  = 1 ⌊ K/ 2 ⌋ ! ( p − 1) X σ ∈ S ⌊ K/ 2 ⌋ ⌈ log( desc ( σ ) + 1) ⌉ + 1 | σ 1 | ! X σ ∈ S | σ 1 | ⌈ log ( desc ( σ ) + 1) ⌉ . Since p =  n − K ⌊ K/ 2 ⌋  + 1, w e deduce that the a verage num b er of duplication-loss steps pro duced b y the second lo op of Algorit hm 2 on p erm utat io ns of size n is Θ( 1 ⌊ K/ 2 ⌋ !  n − K ⌊ K/ 2 ⌋  + 1  P σ ∈ S ⌊ K/ 2 ⌋ ⌈ log( desc ( σ ) + 1) ⌉ ). Hence we fo cus on the computation of 1 k ! P σ ∈ S k ⌈ log( desc ( σ ) + 1) ⌉ ) f o r k = ⌊ K/ 2 ⌋ . By conca vity o f the log function, since 1 k ! P σ ∈ S k desc ( σ ) + 1 = k +1 2 , w e get that 1 k ! X σ ∈ S k ⌈ log( desc ( σ ) + 1) ⌉ ≥ 1 k ! X σ ∈ S k log( d esc ( σ ) + 1) ≥ log( k + 1 2 ). Moreo v er, it is clear tha t 1 k ! X σ ∈ S k ⌈ log( desc ( σ ) + 1) ⌉ ≤ ⌈ log( k ) ⌉ , 19 so that we deduce that 1 k ! X σ ∈ S k ⌈ log( desc ( σ ) + 1) ⌉ = Θ(log( k )). (1) Consequen tly , the a verage n um b er of duplication-loss steps pro duced by the second lo o p of Algorit hm 2 on p erm utations of size n is Θ( ⌈ n − K ⌊ K/ 2 ⌋ ⌉ log( ⌊ K / 2 ⌋ )) = Θ( n K log K ). Finally , w e end the pro of concluding that the total n um b er of duplicatio n- loss steps in a scen ar io pro duced by Algorithm 2 on a p erm utat io n of size n is Θ( n K log K + n 2 K 2 ) on av erag e. 3.2 L owe r b o und It is p ossib le to pro vide v ery simple low er b o unds on the n umber of duplication- loss steps necessary to obtain a p ermu ta tion. These lo w er b ounds are giv en and pro v ed in Prop ositions 22 and 23 b elo w. They are tigh t in most cases, ho w ev er not in an y case. Indeed the upp er and lo w er b ounds coincide up to a constan t factor whenev er K is a constant, o r when K = K ( n ), except when n log n ≪ K ( n ) ≪ n . Prop osition 22 In the worst c ase, Ω(log n + n 2 K 2 ) duplic ation-loss steps of width K ar e ne c e ssary to obtain a p ermutation of S n fr om 123 . . . n . PR OOF. Let us consider ﬁrst the n umber of in v ersions in a p erm uta tion that can create a duplication-loss step s of width K . It is easily seen that these new in v ersions can only in v olv e t wo elemen ts of s . Call i the n umber of elemen t s of s that are kep t in the ﬁrst cop y . Then the maxim um n um b er of inv ersions that can b e created b y s is i ( K − i ) ≤ K 2 4 . No w, a p erm utation σ ∈ S n has up to n ( n − 1) 2 in v ersions, so that at least 2 n ( n − 1) K 2 duplication-loss steps a r e necessary to transform 123 . . . n in to σ . T o get t he other term of the low er b ound, we just refer to the result of (4 ) recalled at the b eginning of this section, namely that log n steps are necess ary in the worst case in the whole genome duplication - ra ndo m loss mo del, in whic h duplicatio n- loss op eration a r e less restricted. Finally , w e get a lo wer b ound o f Ω(log n + n 2 K 2 ) necessary duplication-loss steps to obtain a p erm utatio n of S n from 123 . . . n in the w orst case. 20 Prop osition 23 On aver age, Ω(log n + n 2 K 2 ) dupli c ation-loss steps of wid th K ar e ne c essary to obtain a p ermutation of S n fr om 123 . . . n . PR OOF. As b efore, a duplication-loss step can create a t most K 2 4 in v ersions in a p erm utat ion. But the a v erage num b er of in v ersions in a p erm utatio n of S n is n ( n − 1) 4 , so that on a v erage at least n ( n − 1) K 2 duplication-loss steps a re necessary to transform 123 . . . n in to σ ∈ S n . Again, (4) pro vides use with the Ω(log n ) low er b ound, referring to the whole genome duplication - random loss mo del whic h is more general than ours, so that this b ound applies in our contex t. W e conclude t ha t a lo we r bo und on the a ve ra ge num b er of duplication-loss steps necessary to obtain a p erm uta t ion of S n from 123 . . . n is Ω(log n + n 2 K 2 ). 4 Conclusion W e discuss the results of Section 3 on the a v erage (or w orst case) nu mber of steps of width K to obta in a p erm utation of size n . It app ear s that w e could not pro vide low er b ounds that coincide with the upp er b ounds g iv en b y our algorithm, but w e claim that they are tigh t in man y cases. Indeed, whenev er K = o ( n log n ), w e get that n K log K = o ( n 2 K 2 ), and conseque ntly the upp er b ound can b e rewritten as Θ ( n K log K + n 2 K 2 ) = Θ( n 2 K 2 ), whic h coincide up to a constan t factor with the low er b ound Ω(log n + n 2 K 2 ) = Ω( n 2 K 2 ). F or the case K = Θ ( n log n ), the same a rgumen t holds, but the constant factor b et w een the low er and the upp er b ound might b e m uc h g reater. Finally , if K = Θ( n ), then Θ( n K log K + n 2 K 2 ) = Θ(log n ) and Ω(log n + n 2 K 2 ) = Ω(log n ), so that upp er and low er b ounds coincide ag ain. On the con tra ry , when n log n ≪ K ≪ n , the upp er and low er b ounds pro vided do not coincide. W e lea v e as an op en question the pro blem of ﬁnding an algorithm that computes a duplication-loss scenario whose n umber of steps is optimal (on av erage and in the w orst case) up to a constan t factor, when the width K of the duplicated windo ws satisﬁes n log n ≪ K ≪ n . Sev eral other questions a re still op en. F irst of all neither of our a lgorithms is optimal for a sp eciﬁc p ermutation and our results are only optimal asymptot- ically in a v erage and/or in the w orst case. It could b e in teresting t o pr ovide algorithms that pr o duce optimal duplication-loss scenarios on an y p erm uta - tion σ , for K = K ( n ) in order to pro vide some con tinuit y b etw een the bubble sort (corresp onding to K = 2) and the radix sort (corresp onding to K ( n ) = n ). 21 References [1] M.H. Alb ert, R.E.L. Aldred, M.D. Atkinson, H.P . V an Ditmarsc h, C.C. Handley , D.A. Hotlon, and D.J. McCaughan. Comp ositions of patt ern restricted sets of p ermu ta tions. T ec hnical rep ort, Univ ersit y of Ota go, 2004. T ec hnical rep ort num b er OUCS-2004- 12. [2] S. B ´ erard, A. Bergeron, C. Chauve , and C. Paul. Perfect sorting b y rev er- sals is not alw ay s diﬃcult. IEEE/A CM T r ansactions on Computational Biolo gy a n d Bioinformatics , 4(1), 2 007. [3] M. Bo uv el and E. P ergola. P erm utations arising in the duplication-loss mo del. In prepara tion. [4] K . Chaudh uri, K. Chen, R. Mihaescu, and S. Rao. On the tandem duplication-random loss model of g enome rearra ngemen t. In SODA , SOD A, pages 564 – 570, 200 6. [5] M.C. Chen and R .C.T. Lee. So rting b y transp ositions based on the ﬁrst increasing substring concept. In BIBE ’0 4 : Pr o c e e dings o f the 4th IEEE Symp o s i um on Bioin f o rmatics and Bio engine ering , pag e 553 , W ashington, DC, USA, 2004 . IEEE Computer So ciet y . [6] Anthon y Labarre. A new tight upp er b ound on the transp osition distance. In Rita Casadio and G ene My ers, editors, Algorithms in B ioinformatics, 5th In ternational Workshop, W ABI 20 05, Mal lor c a, Sp ain, Octob er 3-6, 2005, Pr o c e e dings , volume 369 2 of L e ctur e Notes in Co m puter Scienc e , pages 216–2 27. Springer, 2005. [7] Anthon y Labarre. New b ounds and tractable instances for the tr a nsp osi- tion distance. IEEE/ACM T r ans. Comput. Biolo gy Bi o inform , 3(4):380 – 394, 2006 . 22

A variant of the tandem duplication - random loss model of genome rearrangement

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment