Global Alignment of Molecular Sequences via Ancestral State Reconstruction

Global Alignment of Molecular Sequenc es via Ancestral State Reconstruct ion Alexandr Andoni ∗ , Constantinos Daskalakis † , A vinatan Hassidim ‡ , S ebastien Roch § October 27, 2021 Abstract Molecular phylog enetic technique s do not generally accoun t for such commo n ev olutionary e vents as site insertions and d eletions (known as indels). Instead tre e building algorithms and ance stral state inferen ce proce- dures typ ically r ely on substitution-o nly models o f sequ ence evolution. In practice these methods ar e extende d beyond this simpliﬁed setting with the use of heur istics that produce global alignmen ts of the input sequences— an imp ortant p roblem which has no rigor ous mod el-based solution. In th is p aper we open a new direction on this topic by considering a v ersion of the multip le sequence alignment in the context of stochastic indel models. More precisely , we intro duce the following trace r econstruction pr oblem o n a tr ee (TRPT): a binary sequence is b roadcast thr ough a tree ch annel wh ere we allow substitution s, d eletions, and inser tions; we seek to reco n- struct the original sequence from the s equences received at the lea ves of the tree. W e giv e a recursi ve proced ure for this prob lem with stron g r econstructio n g uarantees at low mutation rates, providing also an alignment o f the sequences at the leaves of the tree. The TRPT prob lem without indels has been studied in p revious work (Mossel 200 4, Daskalakis et al. 2 006) as a b ootstrapp ing s tep towards obtaining informatio n-theor etically opti- mal phylogen etic reconstructio n metho ds. The present work sets up a framework for extending these work s to ev olutionary mod els with indels. In th e TRPT problem we begin with a rand om sequence x 1 , . . . , x k at the root of a d -ary tree. If vertex v has the sequence y 1 , . . . y k v , then each one of its d children will have a sequ ence whic h is ge nerated fro m y 1 , . . . y k v by ﬂipping three biased coins for each bit. The ﬁrst coin has pro bability p s for Heads, and determines whether th is b it will be substituted or no t. The seco nd coin has probab ility p d , and d etermines whe ther this bit will be deleted , and the th ird coin has p robab ility p i and deter mines whether a new rando m bit will be inserted . The inpu t to the p rocedu re is the sequences o f the n lea ves of the tree, as well as the tr ee struc ture (but n ot the sequences of the inner vertices) and the go al is to reconstru ct an appro ximation to the sequence of the root (the DN A of the ancestral father). For ev ery ǫ > 0 , we present a d eterministic algorithm which ou tputs an approx imation of x 1 , . . . , x k if p i + p d < O (1 /k 2 / 3 log n ) and (1 − 2 p s ) 2 > O ( d − 1 log d ) . T o our knowledge, this i s the ﬁrst rigo rous trace reconstruction result on a tree in the presence of indels. ∗ CSAIL, MIT † CSAIL, MIT . costi s@mit.edu . P art of this work w as done while the author was a postdoctoral researcher at Microsoft Research. ‡ MIT § Department of Mathematics, UCLA. Part of this wo rk was done while the author was a postdo ctoral researcher at Microsoft Research. 1 Introd uction T race r eco nstruction on a star . In the “ trace rec onstru ction problem” (TRP) [L e v01a, Le v01 b, BKKM04, K M05, HMPW08, VS08], a random binary string X of length k generates an i.i.d. collection of traces Y 1 , . . . , Y n that are identi cal to X except for random m utatio ns which consist in indels , i.e., the deletion of an old site or the insertion of a new site between exist ing sites, and substituti ons , i.e., the ﬂ ipping of the state at an existi ng site 1 . (In keeping with biolo gical terminolog y , we refer to the c omponen ts or positio ns of a string as sites .) The goal is t o recons truct ef ﬁciently the origin al string with high probabil ity from as fe w rando m traces as possible. An important moti v atio n for this problem is the reconstr uction of ancestra l DNA sequen ces in computa tion biolog y [BKKM04, KM05]. O ne can think of X as a gene in an (extinct) ancestor spec ies 0 . Through speciation , the ancestor 0 giv es rise to a larg e number of descenda nts 1 , . . . , n and gene X ev olv es indepe ndent ly through the action of mutations into sequences Y 1 , . . . , Y n respec ti ve ly . Inferri ng the sequence X of an ancient gene from ext ant descen dant copies Y 1 , . . . , Y n is a standard probl em in e v olut ionary biology [Tho04]. The inference of X typica lly req uires the solutio n of an auxiliary pro blem, the multiple sequence align ment pr oblem (whic h is an importan t proble m in its own right in computation al biolo gy): site t i of sequenc e Y i and site t j of sequenc e Y j are said to be homolo gous (in this simpliﬁed TRP setting) if they desc end from a common site t of X only thr ough substi tution s ; in the multipl e sequence alignmen t problem, w e seek roughly to uncov er the homology relation between Y 1 , . . . , Y n . Once homologous site s hav e been identiﬁed, the original sequence X can be estimated, for instan ce, by site-wise majority . The TRP as deﬁned abo ve is an idea lized version of the ancest ral sequ ence recon struct ion problem in one importan t aspect. It ignores the actual phylo genet ic relations hip between species 1 , . . . , n . A phylog eny is a (typic ally , bina ry) tree relating a group of species . The lea v es of the tree corre spond to extan t species. Inter nal nodes can be thou ght of as ext inct ances tors. In particular the root of the tree represents the most recent common ancest or of all specie s in the tree. Follo wing paths from the root to the lea ves , ea ch bifurcation indicates a spe ciation e ven t whereby two ne w specie s are created from a parent. An ex cellen t introductio n to phylog enetics is [SS03]. A standard assumption in computation al phyloge netics is that genetic informa tion ev olv es from the root to the lea v es according to a Marko v model on the tree. Hence, the stocha stic m odel used in trace reconstruc tion can be seen as a special case where the phylog eny is sta r -shap ed . (The substitution model used in trace reconst ructio n is kno wn in biology as the Ca ven der -Fa rris-Ne yman (CFN) [Cav7 8, Far73, Ney71] mode l.) It may seem that a star is a good ﬁrst approxi mation for the e v olut ion of DN A sequences. Howe v er ex tensi v e work on the so-cal led “recon structi on pro blem” in theoretica l compute r science and statistic al phys ics has highlighted the importance of taking into account the full tree model in analyz ing the reconstru ction of ancest ral sequences . The “re constructio n pr oblem. ” In the “recon structi on problem” (RP), we ha ve a single site which ev ol ves through substi tution s only from the root to the leav es of a tree. In the most basic set up which we will consi der here, the tree is d -ary and each edge is an independ ent symmetric indel-free channel where the probability of a substituti on is a cons tant p s > 0 . The goal is to recons truct the state at the root giv en the vector of state s at the lea ves . More genera lly , one can consid er a sequ ence of length k at the root where each site e v olv es independent ly acco rding to the Marko v process ab ov e. Denote by n the nu mber of leav es in the tree. T he R P has a ttracte d much attention in the theore tical compute r scienc e literat ure due to its deep con nectio ns to comp utation al phylog enetics [Mos03, Mos04 , DMR06, Roc08] and statistical physi cs [Mos98, EKPS00, M os01, MP03, MS W04, JM04, BKMP 05, BCMR 06, GM07, BVVD07, Sly09a, Sly09c]. S ee e.g. [Roc07, Sly09b] for back groun d. Unlik e t he st ar case, the R P on a tree exhi bits an interesting thr eshold ing effe ct: on th e one hand, info rmation is lost at an expo nenti al rate along each path from the root; on the other hand, th e number of paths grows exp onenti ally with the number of le vel s. W hen the substi tution probab ility is lo w , the latter “wins” and vice versa. This “phase transit ion” has bee n th orough ly analyze d in th e th eoretic al computer sc ience an d math ematical phy sics l iteratu re— althou gh much remains to be underst ood. Mor e formally , we say that the R P is solvab le when the correla tion between the root and th e lea v es pers ists no matter h o w lar ge th e tree is. Note t hat unli ke the TRP we d o not r equire 1 One can also consider the case where X is arbitrary rather than random. W e will not discuss this problem here. 1 high-p robab ility reconstr uction in this case as it is n ot informatio n-the oretically achie v a ble for d con stant—si mply consid er the informa tion lost on the ﬁrst le ve l bel o w the roo t. Moreov er the “ number of traces” is irrele v ant here as it is gov erned by the depth of the tree and the solv ability notion implies nontri vial correlatio n for an y depth. When the RP is uns olvabl e, the correlat ion decays to 0 f or lar ge trees. The results of [BRZ95, EKPS00 , Iof96, BKM P05, MSW04, BCMR06] sho w that for the CFN mo del, if p s < p ∗ , then th e RP is so lv able, where d (1 − 2 p ∗ ) 2 = 1 . This is the so-called Keste n-Stigum bound [KS66]. If, on the other hand, p s > p ∗ , then the RP is unsolvable . Moreo ver in this case, the correlat ion bet w een the root state and any function of the charact er state s at the lea ves decays as n − Ω(1) . The positi ve result abov e is obtained by taking a majority vote ov er the leaf states. Like the T RP , the RP is only an idealized version of the ancestral sequence recon struction problem: it ignores the presence of indels. In other word s, the RP assumes that the multiple seque nce alignmen t problem has been solv ed perfectly . This is in fact a long-stand ing assumption in e voluti onary biology where one typically prepro- cesses seq uence data b y run ning it through a multiple seq uence alignment heuri stic and the n one only has to model the substit ution proc ess. This simpliﬁcat ion has come under attack in the biology literature, where it has been ar gued that alignmen t procedur es often create sy stematic biases that a f fect analysis [LG08, WSH08]. Much empir - ical work has been de vote d to the proper joint estimation of alignments and phyloge nies [TKF91, TKF 92, Met03, MLH04, SR06, RE08, LG08, LRN + 09]. Our re sults. W e make progress in this recent new directio n by analyzing the R P in the presenc e of indels—which we also refer to as the TRP on a tree (T RPT). W e cons ider a d -ary tree where each edge is an independen t chann el with subst itution prob ability p s , delet ion probabili ty p d , and insertion probab ility p i (see S ection 1.1 for a precise statemen t of the m odel). The root sequenc e has length k and is assumed to be uniform in { 0 , 1 } k . As in the standard RP , we drop the requirement of high-proba bility reconst ruction and seek instead a reconstruct ed seq uence that has correla tion with the true root sequen ce uniformly bounded in the depth. W e giv e an ef ﬁcient recursi ve proce dure which solv es the T RPT for p s > 0 a small enough constant (strictly belo w , albeit close, to the K esten-Stigum boun d) and p d , p i = O ( k − 2 / 3 log − 1 n ) . A s a by-pro duct of our analy sis we also obtain a partial global alignment of the seque nces at the leav es. Our m ethod pro vides a framew ork for separa ting the indel process from the substitution process by identifying well-pres erved subsequen ces w hich then serv e as markers for alignment and reconstruc tion (see Section 1.2 for a high -le vel descr iption of our techn iques). As far as we are aw are, our results are the ﬁ rst rigoro us resu lts for this problem. Results on the RP ha ve been used in pre vious work to adv ance the state of the art in rigorous phyloge netic tree recons truction metho ds [Mos04, D MR06, MHR 08, Roc08]. A central component in these methods is to solve the RP on a partially recons tructed phyl ogen y to obtain sequenc e information that is “close” to the ev olutiona ry past ; then this sequence informati on is used to obtain further structural informat ion abo ut the phyloge ny . The w hole phylo gen y is built by alternati ng these steps. Ou r method sets up a framew ork for exte nding thes e techniqu es bey ond substitutio n-only models. Partial results of this type will be gi ven in the full version of the paper . Related work. M uch wo rk has been dev oted to the trace reco nstruction problem on a star [Lev01 a , Lev 01b , BKKM04, KM05, HMPW08, VS08]. In particu lar , in [HMP W08], it was sho wn that, when there are only deletion s, it is possible to tolerate a small constan t deletio n rate using p oly( k ) traces. For a dif ferent rang e of parameters, V iswan athan and S waminath an [VS08] sho wed that, under constant substitutio n probabil ity and O (1 / log k ) indel probab ility , O (log k ) traces suf ﬁce. Both re sults assume that the root sequence X is uniformly random. The multiple sequen ce alignment proble m as a co mbinator ial optimization proble m (ﬁnding the bes t alignmen t under some pairwise scoring function) is kno wn to be NP-hard [WJ94, E li06]. Most heuristics used in practice, such as CLUS T AL [HS88], T -Coffee [NHH 00], MAFFT [KMKM 02], and MUSC LE [Edg04], use the idea of a guide tr ee, that is, the y ﬁrst construct a very rough ph ylogenetic tree fro m the data (u sing edit distance as a measure of ev olutionar y distan ce), and then recursi vely construct local alignments produced by “aligning alignments. ” Our work can be thoug ht of as the ﬁrst attempt to analyze rigorou sly this type of procedure . Finally , our work is tangentia lly related to the stud y of edit distance . Edit distanc e and pattern matching in random en vironmen ts has been studied, e.g., by [Nav01, NBYST, AK08]. 2 1.1 Deﬁnitions W e no w deﬁne our basic model of sequence ev olution. Deﬁnition 1.1 (Model of sequence evo lu tion) Let T ( d ) H be the d -ary tr ee with H level s and n = d H leaves . F or simplicit y , we assume thr oughout tha t d is odd. W e consid er th e following mode l of evolutio n on T ( d ) H . The sequence at the r oot of T ( d ) H has length k and is drawn uniformly at rando m over { 0 , 1 } k . Along each edge of the tr ee, each site (or positio n) under goes the following mutations independen tly of the other sites : • Substitution. The site state is ﬂipped with pr obabi lity p s > 0 . • Deletio n . The site is del eted w ith pr obability p d > 0 . • Inserti on. A new site is cr eated to the right of the curr ent site with pr obabili ty p i > 0 . The state of this new site is unifo rm { 0 , 1 } . These opera tions occu r independen tly of eac h other . The last two ar e called indel s . W e let p id = p i + p d and θ s = 1 − 2 p s . The parameter s p s , p d , p i may depend on k and n , wher e n is the number of leaves. Remark 1.2 F or con venience , our model of insertion is intent ionally simplistic . In the biology liter atur e, r elated contin uous-time Mark ov m odels ar e instead used for this kind of pr ocess [TK F91, TKF 92, Met03, MLH 04, RE08, DR09]. It sho uld be po ssible to e xtend our res ults t o suc h gener alizations by p r oper modi ﬁcatio ns to the algorithm. 1.2 Results Statement of results. Our main result is the follo w ing. Denote by X = x 1 , . . . , x k a binary unifor m sequence of length k . Run the ev olution ary proces s on T ( d ) H with ro ot seque nce X and le t Y 1 , . . . , Y n be the sequenc es obtained at the lea ves , where Y i = y i 1 , . . . , y i k i . Theor em 1.3 (Main result) F or all χ > 0 and β = O ( d − 1 log d ) , ther e is Φ , Φ ′ , Φ ′′ > 0 such that the following holds for d lar ge enough. Ther e is a polynomia l-time algorithm A with access to Y 1 , . . . , Y n suc h that for all (1 − 2 p s ) 2 > Φ log d d , p i + p d < Φ ′ k 2 / 3 log n , Φ ′′ log 3 n < k < p oly( n ) , the algori thm A output s a binary sequence b X which satisﬁes the following with pr obabil ity at least 1 − χ : 1. b X = ˆ x 1 , . . . , ˆ x k has length k . 2. F or all j = 1 , . . . , k , P [ ˆ x j = x j ] > 1 − β . Remark 1.4 Notice that we a ssume that the ( leaf-labelled ) tr ee and and the sequen ce length of the r oot ar e known . The r equir ement that the sequence length is known is not crucial. W e adopt it for simplicity in the pr esentation. Remark 1.5 In fact, we pr ove a str onge r r esult which allows χ = o (1) and shows that the “agr eement” between b X and X “dominates” an i.i.d. sequen ce. See Lemm a B.1 an d Sectio n 5.2. Pro of ske tch. W e giv e a brief proof sk etch. As discuss ed prev iously , in the presence of indels the recon structi on of ances tral sequences requires the solution of the m ultiple sequence alignment problem. H o w e ver , in additi on to being computation ally intractabl e, global alig nment through the optimization of a pairwise scoring func tion may create biases and correlat ions that are hard to quantify . Therefore, we require a more probabilisti c appro ach. From a purel y information -theoretic point of view the pairwise alignment of sequen ces that are far apart in the tree is 3 dif ﬁcult. A natural solu tion to this problem is instead to perf orm local alignments and ances tral recons tructions, and recurs e our way up the tree. This rec ursive app roach ra ises its o wn set of i ssues. Consid er a pa rent node and its d childre n. It may be easy to perfor m a local alignment of the childre n’ s sequen ces and deri ve a good appro ximation to the parent sequence (for exa mple, throu gh site-wise majority). Note howe ver that, to allow a recur sion of this procedure all the way to the root, we hav e to provide stro ng gua rantees abou t the probabilist ic beha vior of our loca l ancestral reconstr uction. As is the case for global alignment, a careless alignment procedure creates biase s and correlation s that are hard to contro l. For instance , it is tempting to treat misaligned sites as independe nt unbia sed noise but this idea pres ents dif ﬁculties: Consider a site j of the parent sequence and suppose that for this site w e hav e succe eded in align ing all b ut two of the children , say 1 and 2 . Let x i j i denote the site in the i ’ th child which was used to estimate th e j ’ th site. By the indep endenc e assumption on th e root seque nce and the ins erted sites, x 1 j 1 and x 2 j 2 are uniform and independen t of ( x i j i ) d i =3 . Howe ver , x 1 j 1 and x 2 j 2 may originate from the same neighb oring site of the parent sequen ce and therefor e are themselv es correl ated. Quantify ing the effe ct of this type of correlation appears to be nontri vial. Instea d, we use an adv ers arial appro ach to local ancestral reconstruc tion. That is, w e treat the misaligned sites as being controlled by an adversa ry who seek s to ﬂip the recon structed va lue. This comes at a cost: it produ ces an asymmetry in our ancestral reconst ructio n. Although the RP is well-studied in the symmetric noise case, much remains to be understo od in the asy mmetric case . In particu lar , obtaini ng tight results in terms of substi tution probabil ity here m ay not be possib le as the critical thresho ld of the RP may be hard to ident ify . W e do ho weve r provide a t ailored analys is of the p articul ar instan ce of the RP by recur sive major ity obtaine d throu gh this adv ersarial approach and we obt ain result s that are close to the kno wn thr eshold for the symmetric case. Unlike the standard RP , the reconstruc tion error is not i.i.d. but we sho w inste ad that it “dominates” an i.i.d. noise. (S ee Section 4.2 for a deﬁnition.) This turns out to be enough for a well-cont rolled recursio n. W e ﬁrst deﬁne a local alignmen t procedure which has a fair succe ss probabil ity (independ ent of n ). Howe ver , applying this align ment proced ure multiple times in the tree is bou nd to fail sometimes. W e there fore prov e that the local reconstru ction proced ure is some what robus t in the sense that ev en if one of the d inputs to the reconstruc tion proced ure is faulty , it still has a good probab ility of success. As for our local alignment pro cedure , we adop t an anch or approac h. Anchors were also used by [KM05, HMPW08]—althou gh in a quit e diffe rent way . W e imagine a partitio n of ev ery node’ s sequence into islands of length O ( k 1 / 3 ) . (The pre cise cho ice of the island length c omes from a trade-of f between the length a nd the nu m ber of islands in bounding the “bad” eve nts belo w—see the proof of Lemma 3.3.) At the beginni ng of each island we hav e an anchor of length O (log n ) . T hrough this partition of the sequences in island s and anch ors we aim to guaran tee the followin g. Gi ven a speciﬁc father node v , w ith fair probab ility 1) all the anchors in the children nodes are indel-free ; and 2) for all parent islands, almost all of the correspond ing childr en islands hav e no indel at all and, moreov er , at most one child island may hav e a single indel. T he “bad” children islands—tho se that do not satisfy these p roperti es—are treated as controlled by a n adv ersary . W e sho w th at Conditions 1) and 2) are suf ﬁcient to guarant ee that : the anc hors of all islands can be aligned with high probabili ty and single indel e vents between ancho rs can be identiﬁed . This allo w s a local align m ent of all isla nds with at mos t one “bad” child pe r isla nd and is enoug h to perform a succes sful adv ersarial rec ursiv e majority vo te as described a bov e. The bound on the maximum indel probabi lity sustained by our reconst ruction algorithm comes from satisfying Conditions 1) and 2) above . Notation. For a sequence X = x 1 , . . . , x k , we let X [ i : j ] = x i , . . . , x j . W e use the express ion “with high probab ility (w .h.p.)” to m ean “with probability at least 1 − 1 / p oly ( n ) ” where the polynomial in n can be made of arbitrari ly high degree (by choosing the appropria te const ants lar ge enough). W e denote by Bin( n, p ) a random v ariable w ith binomial distrib ution of parame ters n, p . For two random v ariables X , Y we denote by X ∼ Y the equali ty in distrib ution. Organiz ation. The rest of the paper is orga nized as follows. W e desc ribe the algorithm in Section 2. The proof of our main result is di vided into two sections. In Section 3, w e prov e a series of high-pro bability claims abou t the 4 e voluti onary process. T hen, cond itioning on these claims, we pr ovide a determini stic analysis of the corr ectness of the algorit hm in Section 5.2. All proof s are in the Appendi x. 2 Description of the Algorithm In this section w e describe our algo rithm for TRPT . Our algorithm is recursi ve, proceedin g from the leav es of the tree to the root. W e describe the recursi ve step applied to a non-leaf node of the tree. Recursiv e Setup —Our Go al. For our discus sion in t his section , let us cons ider a non-leaf node v with d children, denote d u i for i ∈ [ d ] . For notat ional con venie nce, w e drop the index u and denote its children by 1 , . . . , d . Our goal for the recursi ve step of the algorithm is to recon struct the sequence at the node v giv en the sequence s of the childr en. D enote the sites of the fathe r by X 0 = x 0 1 , . . . , x 0 k 0 , and the sites of the i ’th child by X i = x i 1 , . . . , x i k i . During the reconstruct ion process, we do not hav e acces s to the children’ s sequences , but rather to reconstruc ted sequen ces denoted by b X i = ˆ x i 1 , . . . ˆ x i ˆ k i . Let us conside r the follo wing partition of the sequen ce of v into subseq uences, called islands . Of course our algori thm d oesn’ t ha ve access to t he sequenc e at v during the recursi ve step of the a lgorith m . W e deﬁne the partition as a means to describe our algorithm: The sites of v are partitioned into islands of length ℓ = k 1 / 3 (ex cept for the last one which is pos sibly shorter). Denote by N 0 = ⌈ k 0 /ℓ ⌉ the number of isla nds in v . Each island start s with an anc hor of a bits. That is, the island s are the bitstri ngs X 0 [1 : ℓ ] , X 0 [ ℓ + 1 : 2 ℓ ] , . . . and the anchors are the bitstri ngs X 0 [1 : a ] , X 0 [ ℓ + 1 : ℓ + a ] , . . . . Our algorithm tries to identify for each island X 0 [( i − 1) ℓ + 1 : iℓ ] the substr ings of each of the d children that correspo nd to this islan d (i.e., contain the sites of the island), calle d “chil d islands. ” W e do so iterati vely for i = 1 . . . N 0 . W e use the islands that did not ha ve indels for sequ ence reconstruct ion, using the subs titution-onl y model. S ome islan ds w ill ha ve indels howe ver . This leads to two “modes of failure” : one in v alidates the entire (paren t) node, and the other in valida tes only an island of a child . More spe ciﬁcally , a node becomes in valida ted (i.e., useless) when indels are not ev enly distri buted, that is: w hen an indel occured in an anchor , or two (or more) indels occured in a speciﬁc island ov er all d children. This is a rare e vent. Barring this e vent, we expe ct that each islan d suf fers only at most one indel ov er all chil dren. The isla nd (of a child) that has exactly one indel is in v alidated (second mode of fa ilure), and is thus d eemed useless for recons truction purposes. As lo ng as the pare nt node is not in vali dated, each island will hav e at least d − 2 non-in v alidated children islands (one additional island is potentia lly lost to a child node that may ha ve been in v alidate d at an earlier stage). Even when the algorithm identiﬁes that a child island has an indel some where, the isla nd is not ignored. The algori thm still needs to compute the leng th of the island in order to know the start of the next island in this child. For this pur pose, we use the anchor of the ne xt island and match it to the correspon ding anchors of the other (non-i n v alidated) child islands. In fac t the same procedure lets us detect which of the child islands are in v alidate d. More formally , we deﬁne d functions f i : { 1 , . . . , k 0 } → { 1 , . . . , k i } ∪ {†} , where f i tak es the sites of v to the corresp onding sites of the i ’th child or to the special symbol † if the site was deleted. N ote that for each i , f i is monoto ne, when ignoring sites which are mapp ed to † . For t = ℓ r , let s i ( r ) = f i ( t + 1) − ( t + 1) denote the displa cement of the site correspon ding to the ( t + 1) st site of the parent, in the i th child. By con vention , we take s i (0) = 0 . If there is no indel between t = ℓr and t ′ = ℓr ′ then s i ( r ) = s i ( r ′ ) . Note that, in the speciﬁc case of one indel operatio n in the island, we hav e that | s i ( r ) − s i ( r ′ ) | = 1 . Algorithm. Our algorithm estimates the val ues of s i ( r ) and uses these estimates to match the starting positio ns of the islands in the children. The full algo rithm is giv en in F igure 1 in the Appendix. W e use the followin g additi onal notation. For x ∈ { 0 , 1 } we let h x i = 2 x − 1 . Then, for two { 0 , 1 } m -seque nces Y = y 1 , . . . , y m and Z = z 1 , . . . , z m , we deﬁne their (empiri cal) correlatio n as Corr( Y , Z ) = 1 m m X j =1 h y j ih z j i . Note that y 7→ h y i maps 1 to 1 and 0 to − 1 . One can think of Corr( Y , Z ) as a form of normalized centered 5 Hamming di stance b etween Y and Z . In parti cular , a lar ge v alue of Corr( Y , Z ) implies th at Y and Z tend to agree. W e will use the follo wing threshold (which will be justiﬁed in Section 5.1) γ = ((1 − δ )(1 − 2 p s ) 2 − 4 β ) , where δ is chose n so that (1 − δ )(1 − 2 p s ) 2 − 8 β > δ + 8 β , where again β = O ( d − 1 log d ) . 3 Analyzing the Indel Process W e deﬁne a ≥ C log n and α ≤ ε/d < 1 , for constant s C , ε to be determined later . W e requir e a < k 1 / 3 < p oly( n ) . 2 W e assume that the indel prob ability per site satisﬁes p id = α 4 dk 2 / 3 a = O  1 k 2 / 3 log n  . Through out, we denote the tree by T = ( V , E ) . 3.1 Bound on the Sequen ce Length As the indel probabi lity is deﬁned per site, longer sequence s suffe r more indel operations than shorter one s. W e beg in by boun ding the ef fect of this process . W e sho w that with high probabil ity the lengths of all sequences are rough ly equal. Lemma 3.1 (Bound on sequence length) F or all ζ > 0 (small), ther e exi sts C ′ > 0 (lar ge) so that for all u in V , we have k v ∈ [ k , ¯ k ] ≡ [(1 − ζ ) k, (1 + ζ ) k ] , with high pr obabi lity given k ≥ C ′ log 3 n . W e denote this eve nt by L . 3.2 Existence of a Dense Stable Subtr ee In this section, we show that with probability close to 1 there exists a dense subtre e of T with a “good inde l structu re, ” as deﬁned belo w . Our algorit hm will try to identify this subtree and perfo rm reconstruc tion on it, as descri bed in Section 4. Indel structur e of a node. Recall that ℓ = k 1 / 3 . Deﬁnition 3.2 (Indel structur e) F or a node (par ent) v , we say that v is radio active if one of the follo wing ev ents happe n: 1. Event B 1 : Node v has a child u such that when evolving fr om v to u an indel operati on occ urr ed in at least one of the sites whic h ar e located in an anchor . 2. Event B 2 : Ther e is an island I and two chil dre n u, u ′ , suc h that an indel occurr ed in I in the tran sition fr om v to u and in the tran sition fr om v to u ′ . 3. Event B 3 : Ther e is an island I and a child u , suc h that two indel operation s (or mor e) happened in I in the tra nsition fr om v to u . 2 A va riant of the algorithm where the anchors hav e length O (log k ) also works when k ≫ n . 6 Otherwise the node v is stable . By deﬁnitio n, the leaves of T ar e stable . A subtr ee of T is stable if all of its nodes ar e stable. Lemma 3.3 (Bound on radioacti vity) F or all α > 0 , ther e ex ists a choic e of ζ > 0 small enou gh in Lemm a 3.1 suc h that conditionin g on the event L occuring : any vertex v is radioac tive with pr obabil ity at most α . As a corolla ry we obtain the follo wing. Lemma 3.4 (Existence of a dense stable subtree) F or all χ > 0 , ther e is of ζ > 0 small eno ugh in Lemma 3.1 suc h that, condition ing on the e vent L occuring , with pr obability at least 1 − χ , the r oot of T is the father of a ( d − 1) -ary stable subtr ee of T . W e denote this even t by S . 4 A Stylized Reconstruction Pr ocess In this subs ection, we lay out the basic lemmas that we need to analyze our ancestral recons truction method. W e do this by way of describing a hypothet ical sequence reconstr uction proce ss performed on the stable tree deﬁned by the indel proce ss (see Lemma 3.4). W e analyze this reconstruc tion process (assuming that the radioa cti ve nodes and the islands with indels are controlled by an adv ersary) and sho w in Lemm a 4.5 that the pro cess gi ves strong reconstructi on guarantees. Then we ar gue in Sectio n 5 that our algori thm perfor ms at least as well as the recons truction process against the adversa ry desc ribed in this section. Throughout this section w e suppo se that a stable t ree e xists and is gi ven t o us, toget her with th e “or bit” of ev ery site of the sequence at the roo t of the tree (se e functi on F belo w ). Ho wev er , we are giv en no informati on about the substitutio n proce ss. Let v ∈ V and assume v is the root of a ( d − 1) -ary stable subtree T ∗ = ( V ∗ , E ∗ ) of T . (W e make the stable subtre e bel o w v into a ( d − 1) -ary tree by potentia lly removing arbitra ry nodes from it, at random.) Let u ∈ V ∗ . For each island I in u , at most o ne child u ′ of u in T ∗ contai ns an indel in which c ase it contains exac tly one indel. W e say t hat such an I is a corrupt ed island of u ′ . The ba sic intuition behi nd our analy sis is th at, pro vided the ali gnment on T ∗ is performed correctly (which w e defer to S ection 5.2), the ancestral reconstruc tion step of our algorithm is a recurs i ve majority procedu re against an adve rsary which controls the corrupt ed islan ds and the radio acti ve nodes (as well as all their descend ants). Belo w we analyze this adver sarial process. Recursiv e m ajority . W e begin with a formal de ﬁnition of recu rsiv e majority . Let Ma j : { 0 , 1 , ♯ } d → { 0 , 1 } be the functi on that retur ns the majority val ue ov er non- ♯ va lues, and ﬂips an unbiased coin in case of a tie (includin g the all- ♯ vecto r). Let n 0 = d H 0 be the number of lea ves in T belo w v . C onside r the followin g recursi ve function of z = ( z 1 , z 2 , . . . , z n 0 ) ∈ { 0 , 1 , ♯ } : Ma j 0 ( z 1 ) = z 1 , and Ma j j ( z 1 , . . . , z d j ) = Ma j(Ma j j − 1 ( z 1 , . . . , z d ( j − 1) ) , . . . , Ma j j − 1 ( z d j − d ( j − 1) +1 , . . . , z d j )) , for all j = 1 , . . . , H 0 . Then, Ma j H 0 ( z ) is the d -wise recurs i ve majority of z . Let X 0 = x 0 1 , . . . , x 0 k 0 be the sequenc e at v . For u ∈ V ∗ and t = 1 , . . . , k 0 , we denote by F u ( t ) the position of site x 0 t in u or † if the site has been delet ed on the path to u . W e say that C u,t holds if F u ( t ) is in a corrupted island of u . L et P ath( u, v ) be the set of nodes on the path between u and v . Deﬁnition 4.1 (Gateway node) A node u is a gate way for site t if: 1. F u ( t ) 6 = † ; and 2. F or all u ′ ∈ P ath ( u, v ) − { v } , C u ′ ,t does not hold. W e let T ∗∗ t = ( V ∗∗ t , E ∗∗ t ) be the subtree of T ∗ contai ning all gatew ay nodes for t . B y construc tion, T ∗∗ t is at least ( d − 2) - ary and for con venien ce we remo ve arbitrar y nodes, at random, to mak e it ex actly ( d − 2) -ary . Notice tha t, for t, t ′ ∈ [1 : k 0 ] , the subtrees T ∗∗ t and T ∗∗ t ′ are random and correl ated. Ho w e ver , they are ind ependent of the substi tution process. 7 W e w ill sho w in Section 5.2 that the recons tructed sequence produce d by our method at v “domina tes” (see belo w) the followin g reconstructi on process. Let L v = u 1 , . . . , u n 0 be the lea ves belo w v ordered accord ing to a planar realiza tion of the subtree belo w v . Denote by X i = x i 1 , . . . , x i k i the sequence at u i . For t = 1 , . . . , k 0 , let L ∗∗ t be the lea ves of T ∗∗ t . W e deﬁne the follo wing auxiliary sequences: for u i ∈ L v , we let Ξ i = ξ i 1 , . . . , ξ i k i where for t = 1 , . . . , k 0 ξ i t = ( x i F u i ( t ) if u i ∈ L ∗∗ t 1 − x 0 t o.w . In words, ξ i t is the desce ndant of x 0 t if u i is a gate way to t and is the oppos ite of the v alue x 0 t otherwis e. Because of the monotonicity of recursi ve majority , the latte r choice is in some sense the “worst adv ersary” (ignorin g cor - relatio ns between sites—we will come back to this point late r). W e then deﬁne a reconstr ucted sequence at v as b Ξ 0 = ˆ ξ 0 1 , . . . , ˆ ξ 0 k 0 where for t = 1 , . . . , k 0 ˆ ξ 0 t = Ma j H 0 ( ξ 1 t , . . . , ξ n 0 t ) . W e no w analyz e the accurac y of this (hypot hetical ) estimator —which w e refer t o as the adver sarial rec onstru ction of X 0 . W e sho w in Section 5.2 that our actual estimator is at least as good as b Ξ 0 w .h.p. 4.1 Recursiv e Majori ty Against an Adver sary T o analyze the performa nce of the adver sarial reconst uction b Ξ 0 , we consi der the follo w ing stylized process. Deﬁnition 4.2 (Recursi ve Majority Against an Adversary) W e consider the following pr ocess: 1. Run the e volutionar y pr ocess on T ( d − 2) H 0 at one po sition only starting with r oot state 0 without indels, that is, taking p id = 0 . 2. Then compl ete T ( d − 2) H 0 into T ( d ) H 0 and assoc iate to each additi onal node the state 1 . 3. Let R ( d ) H 0 be th e random vari able in { 0 , 1 } obtain ed by run ning re cursive majority on the le af state s obtained abo ve. W e call this proces s the r ecursi ve majority against an adver sary on T ( d ) H 0 . W e show the foll o w ing. Lemma 4.3 (Accuracy of r ecursive majority) F or all β > 0 , ther e exi sts a constant C ′′ > 0 such that taking θ 2 s > C ′′ log d d , and d lar ge enough, then the pr obability that the re curs ive majority agains t an adver sary on T ( d ) H 0 corr ectly r econ- struct s r oot state 0 is at least 1 − β uniformly in H 0 . In comparison , note that the K esten-Sti gum bound for binary symmetric chan nels on d -ary tr ees is θ 2 > d − 1 [KS66, Hig77]. As a corolla ry of Lemma 4.3, we hav e the followin g. Deﬁnition 4.4 (Bernoulli sequence) F or q > 0 and m ∈ N , the ( q , m ) -Bernoulli sequence is the pr oduct dis- trib ution on { 0 , 1 } m suc h that each position is 1 independe ntly with pr obabilit y 1 − q . W e denote by B q ,m the corr espon ding random variab le. Lemma 4.5 (Subsequence rec onstruction) Assume v is the r oot of a ( d − 1) -ary stable subtr ee. F or all β > 0 , cho osing C ′′ > 0 as in Lemma 4.3 is such that the following holds for d lar ge enough. F or t, m ∈ { 1 , . . . , k 0 } , let Λ = ( λ 1 , . . . , λ m ) be the agreement vector between the b Ξ 0 [ t + 1 : t + m ] and X 0 [ t + 1 : t + m ] , that is, λ i = 1 if r ecursi ve majority corr ectly rec onstru cts position i . Then the re is 0 ≤ β ′ ≤ β such that Λ ∼ B β ′ ,m . (Her e, β ′ may depen d on H 0 b ut β does not.) 8 4.2 Stochastic Domination and Corr elation In our discu ssion so far w e hav e assumed that a stable tree exists and is giv en to us, together with the the funct ion F . This allowed us to deﬁne the sty lized recursi ve major ity pr ocess agai nst an advers ary (Deﬁnition 4.2 ), fo r which we established strong reconstru ction guaran tees (Lemmas 4.3 and 4.5). In reality , we ha ve no acces s to the stable tree. W e are going to constr uct it recur sive ly from the lea ves to wards the root. At the same time we will align sequen ces, discov er corrupted islands, and reconstr uct sequences of in ternal nodes. The stylize d recursi ve majority proces s will be used to prov ide a “lo wer bound” on t he actua l reconstructi on proc ess. The n otion of “lo wer bound ” that is of interes t to us is captured by stoc hastic domination , which we proceed to deﬁne formally . Deﬁnition 4.6 (Stochastic domination) Let X , Y be two random variab les in { 0 , 1 } m . W e say that Y stochast i- cally dominate s X , denote d X  Y , if ther e is a joint rando m variable ( ˜ X , ˜ Y ) such that the m ar ginals satisfy X ∼ e X and Y ∼ e Y and mor eover P [ e X ≤ e Y ] = 1 . Note that in the deﬁnition abov e X and Y m ay (typi cally) liv e in dif ferent probabili ty space s. Then, the joint v ariable ( e X , e Y ) is a coupled v ersion of X and Y . In our case, X is the adversar ial recu rsiv e process w hereas Y is the actual reconstructi on perf ormed by the algorit hm. W e no w show ho w to use this prope rty for correl ation estimatio n. Corr elation. The analysis of the previo us section guarantees that the sequences outp ut by the adversa rial rec on- structi on process are well correlated with the true seque nces. But if we are only going to use the adv ersarial proces s as a lower bound for the true reconstru ction proc ess, it is important to establ ish that stochasti c domina- tion preserve s correlati on. In preparing the ground for such a claim let us establish an important property of the adv ersarial process. Let T u and T v be the two disjoint copies of T ( d ) h rooted at the nodes u and v respecti vely , and let X = x 1 , x 2 , . . . , x m ∈ { 0 , 1 } m and Y = y 1 , y 2 , . . . , y m ∈ { 0 , 1 } m be seque nces at the nodes u and v . Assume that u and v are the roots of ( d − 1) -ary stable subtrees . Let b X ′ = ˆ x ′ 1 , ˆ x ′ 2 , . . . , ˆ x ′ m ∈ { 0 , 1 } m and b Y ′ = ˆ y ′ 1 , ˆ y ′ 2 , . . . , ˆ y ′ m ∈ { 0 , 1 } m be the reconstru ctions of X and Y obtain ed by the adversar ial reco nstruct ion proces s. Let Λ = λ 1 , . . . , λ m and Θ = θ 1 , . . . , θ m be the resul ting agreement vectors . W e sho w the following : Lemma 4.7 (Concentratio n of bias) Let β ′ , β be as in Lemma 4.5. Then, with pr obability at least 1 − e − Ω( mβ 2 ) the following ar e satisﬁed      1 m m X i =1 h λ i ih θ i i − (1 − 2 β ′ ) 2      ≤ 1 2 β ;      1 m m X i =1 1 h λ i i = − 1 − β ′      ≤ 1 2 β ;      1 m m X i =1 1 h θ i i = − 1 − β ′      ≤ 1 2 β . W e use the pre vious lemma to arg ue that stochastic domination does not affect our corre lation computations. Lemma 4.8 (Corr elation boun d) Let b X , b Y ∈ { 0 , 1 } m be ra ndom string s deﬁned on the same pr obabilit y space as b X ′ and b Y ′ . Denote by Z (re sp. W ) the a gre ement vector s of b X (r esp. b Y ) with X (res p. Y ). Assume that Λ ≤ Z and Θ ≤ W with pr obab ility 1 , wher e Λ and Θ ar e the agr eement vecto rs of b X ′ and b Y ′ with X and Y as e xplain ed abo ve. Then, | Corr( X , Y ) − C orr( b X , b Y ) | ≤ 1 − 1 m m X i =1 ( h λ i ih θ i i − 1 h λ i i = − 1 − 1 h θ i i = − 1 ) , with pr obabil ity 1 . Furthermor e, conditione d on the conclus ions of Lemma 4.7, we have, with pr obability 1 : | Corr( X , Y ) − Corr( b X , b Y ) | ≤ 8 β . 9 5 Analyzing the T rue Reconstruction Process W e provid e the pro of of Theore m 1.3. In Section 5.1, w e sho w that, if a stable subtree e xists, the adv ersaria l recons tructions of aligned anchors exhib it strong correlati on sign al, w hile misali gned anchors exhibi t w eak signal. This holds true for sequences that stochas tically dominat e the adversar ial reconst ruction s. W e use this property to complete the analys is of our reconstru ction method in Section 5.2. 5.1 Anchor Alignment Consider a parent v that is st able. Let i, j be two c hildren with se quences X i = x i 1 , . . . , x i k i and X j = x j 1 , . . . , x j k j . Let t = ℓr and consider the follo wing subsequen ces (of l ength a ) at i and j A i r = x i [ t + s i ( r ) + 1 : t + s i ( r ) + a ] , and A j r = x j [ t + s j ( r ) + 1 : t + s j ( r ) + a ] . These are related (b ut not identical) to the deﬁnition of anchors in the algorit hm of Section 2. In particular , note that by deﬁnition A i r and A j r are always aligned, in the sense that the y correspond to the same subsequenc e of v . Consider also the follo wing subsequence s D j r = x j [ t + s j ( r ) : t + s j ( r ) + a − 1] and I j r = x j [ t + s j ( r ) + 2 : t + s j ( r ) + a + 1] . These are the one-site shifted subsequen ces for j . The follo w ing lemma bounds the correlatio n between these string s. More pre cisely , we sho w that A i r is always signiﬁcantly more cor related to its aligned bro ther A j r than to the misaligned one s D j r and I j r . This follo ws from the fac t that the misalign ed subsequ ences are site wise indepe ndent. Lemma 5.1 (Anchor corr elations) F or al l δ > 0 such tha t (1 − δ )(1 − 2 p s ) 2 − 8 β > δ + 8 β , ther e is C > 0 lar ge enoug h so that w ith a = C log n , the following hold: 1. Aligned anc h ors. P h Corr( A i r , A j r ) > (1 − δ )(1 − 2 p s ) 2 i > 1 − exp ( − Ω ( a )) = 1 − 1 / p oly ( n ) . 2. Misalig n ed anchors. P h Corr( A i r , D j r ) < δ i > 1 − exp ( − Ω( a )) = 1 − 1 / p oly ( n ) , and similarly for I j r . W e den ote by A i,j,r the abo ve events and their symmetric counterp arts under i ↔ j . Lemma 5.2 (Anchor corr elations: Reconstructed version) Let b X i = ( ˆ x i ι ) k i ι =1 and b X j = ( ˆ x j ι ) k j ι =1 dominate the adver sarial r econstr uctions b X ′ i and b X ′ j of X i and X j , as deﬁned in Lemma 4.8. Let ˆ A i r = ˆ x i [ t + s i ( r ) + 1 : t + s i ( r ) + a ] and similarly for a ll other possib ilities ˆ A ↔ ˆ D , ˆ I and/or i ↔ j . Denote by B i,j,r the e vent that the conclu sions of Lemm a 4.7 hold for b X ′ i and b X ′ j ove r all pairs of intervals in volving [ t + s i ( r ) : t + s i ( r ) + a − 1] , [ t + s i ( r ) + 1 : t + s i ( r ) + a ] , and [ t + s i ( r ) + 2 : t + s i ( r ) + a + 1] , with i ↔ j as ne cessar y . Then, condit ioned on B i,j,r and A i,j,r we have Corr( ˆ A i r , ˆ A j r ) > (1 − δ )(1 − 2 p s ) 2 − 8 β , Corr( ˆ A i r , ˆ D j r ) < δ + 8 β , Corr( ˆ A i r , ˆ I j r ) < δ + 8 β , as well as their symmetric counterp arts under i ↔ j . 5.2 Pr oof of Correc tness W e sho w tha t our recurs i ve pr ocedure recon structs the d esired sequ ence at the root of the tree whene ver a collec tion of good eve nts occurs. Recall the deﬁnitions of the ev ents L , S , B i,j,r , A i,j,r from Lemmas 3.1, 3.4, 5.1 and 5.2. 3 3 Event L guarantees t hat t here is no big variance in the nodes’ sequence lengths; eve nt S guarantees that a stable ( d − 1) -ary subtree exists; t he eve nts B i,j,r guarantee that the adversarial reconstruction proc ess is successfu l, also in preserv ing c orrelations between sequences of n odes; and the e vents A i,j,r guarantee that aligned an chors (across sequences of a n ode’ s children) exhibit strong co r relation sig nal, while misaligned anchors giv e weak correlation signal. 10 Conditio ning on L and S , denote by T ∗ = ( V ∗ , E ∗ ) the stable ( d − 1) -ary subtree of T . Then, for all v ∈ V ∗ , all pairs of children i, j of v in T ∗ , and all r = 1 , . . . , ¯ k /ℓ , we conditio n on the ev ents B i,j,r and A i,j,r . Note that ha ving conditio ned on L there is only a poly nomial number of such e vents, sinc e all sequence lengths are bou nded by ¯ k . (If r · ℓ is lar ger than a node’ s sequen ce lengt h w e assume that the correspondi ng ev ents are v acuously satisﬁed.) Finally reca ll that, conditionin g on L , the ev ent S occurs with probabi lity 1 − χ and all other e vents occur with high probabi lity . W e denote the collection of ev ents by E . Conditio ning on E , the proof of correctness of the algorithm follo w s from a bottom-up indu ction. The gist of the ar gument is the followin g. Suppose that at a recursi ve step of the algor ithm we hav e reconstr ucted seq uences for all childre n of a node v , which are strongly correla ted with the true sequenc es (in the sense of domina ting the corres pondin g adversa rial reconstru ctions ). Having conditio ned on the ev ents A i,j,r and B i,j,r , it follo ws then that the correct alignments of anchors exhibit strong correlatio n signal w hile the incorre ct alignments w eak correla tion signal . H ence, our correlation tests between anchors disco ver the corrupted island s and do the ancho r alignments correc tly (at leas t for all nodes lying inside the stable tree). Hence the shif t functi ons ˆ s i ’ s are correctly infe rred, and t he recon struction of v ’ s se quence can be sh own to dominate the c orrespondin g adv ersarial recon struction. The complete proof details are gi ven in App endi x D. Refer ences [AK08] A. Andoni and R. Krauthgamer . The smoothed complexit y of edit distance. Lectur e Notes in Com- puter Science , 5125:35 7–369, 2008. [BCMR06] Christian Bor gs, Jennifer T . Chayes, Elchanan Mossel, and S ´ ebasti en Roch. The Kesten-Stig um recons truction bound is tight for roughl y symmetric binar y channels. In F OCS , pages 518–5 30, 2006. [BKKM04] T u ˇ gkan Batu, Sampath Kannan, Sanjeev Khanna, and Andre w McGreg or . R econst ructing string s from rand om traces. In SOD A ’04: P r oceedi ngs of the ﬁfteent h annual A CM-SIAM sympos ium on Discr ete algorith m s , page s 910–9 18, Philadelph ia, P A, US A, 2004. S ociety for Industrial and Applied Mathematic s. [BKMP05] N. Berge r , C. Ke nyo n, E. Mossel , and Y . Peres. Glauber dynamics on trees and hyperbolic graphs. Pr obab . Theor y Rel. , 131(3):3 11–340, 2005. Extended abstrac t by Ke nyon, Mossel and Peres ap- peared i n proceedi ngs of 42 nd IEEE Symposium on Fo undati ons of Computer Scienc e (FOCS) 2001, 568–5 78. [BRZ95] P . M. Bleher , J. Ruiz, and V . A. Zagrebn ov . On the purity of the limiting G ibbs state for the Ising model on the Bethe lattice. J . Statist. Phys. , 79(1-2 ):473–482, 1995. [BVVD07] N. Bhatnaga r , J. V era, E. V igoda, and W eitz D. Reconstr uction for colorin gs on trees. Preprint a vailab le at arxi v .or g/abs/0711.3664, 2007. [Ca v78] J. A. Ca vender . T axonomy with conﬁden ce. M ath. Biosci. , 40(3-4 ), 1978. [DMR06] Constant inos Daskalak is, Elchanan M ossel, and S ´ ebastie n Roch. Optimal phylogen etic recons truc- tion. In STOC’06: Pr oceedings of the 38th Annual ACM Sympos ium on Theory of Computing , pages 159–1 68, New Y ork , 2006. ACM. [DR09] Constant inos Daskalakis and Sebastien Roch. Alignment-f ree phylogeneti c reconstructi on. Preprint, 2009. [Edg04] Robert C. Edgar . MUSC LE: multiple sequ ence alignment with high accurac y and high throughp ut. Nucl. Acids Res. , 32(5):17 92–1797, 2004. 11 [EKPS00] W . S. Evan s, C. Keny on, Y . Peres, and L. J. Schulman. Broadcastin g on trees and the Ising model. Ann. Appl. Pr obab . , 10(2):4 10–433, 2000 . [Eli06] Isaac E lias. Settling the intracta bility of multip le align ment. J ournal of Computationa l B iolo gy , 13(7): 1323– 1339, 2006 . PMID: 17037961 . [Far73 ] J. S. Farris. A probability model for inferring ev olutionar y trees. Syst. Zool. , 22(4): 250–2 56, 1973. [GM07] Antoine Gerschenfeld and Andrea Montanari. R econst ruction for models on random graphs. In F OCS ’07: P r oceedi ngs of the 48th Annual IEEE Symposiu m on F oundat ions of C omputer Science , pages 194–2 04, W ashingto n, DC, USA, 2007. IEEE Computer Society . [Hig77] Y . Higuchi. Remarks on the limiting Gibbs states on a ( d + 1) -tre e. P ubl. Res. Inst. Math. Sci. , 13(2): 335–3 48, 1977 . [HMPW08] Thomas H olenst ein, Micha el Mitzenmache r , Rina Panigrah y , and U di W ieder . T race reconstruct ion with constant deletion probability and related results . In SOD A ’08: Pr oceedin gs of the nineteenth annua l A C M-SIAM symposium on Discr ete algorithms , pa ges 3 89–398, Philadel phia, P A, USA, 2008. Society for Indus trial and Applied Mathematics. [HS88] D. G. Higgin s and P . M. S harp. Clustal: a package for performing multiple sequence alignment on a microcomp uter . Gene , 73(1):23 7–244 , 1988. [Iof96 ] D. Iof fe. On the e xtremality of the disordered state for the Ising model on the Bethe lat tice. L ett. Math. Phys. , 37(2):13 7–143, 1996. [JM04] S. Janson and E. Mossel. R ob ust recon structi on on trees is determined by the second eigen valu e. Ann. Pr obab . , 32:2 630–2649, 2004. [KM05] S. Kannan and A. McGrego r . More on reconstruct ing strings from rando m traces: insertions and deletio ns. In Pr oceedings of ISIT , pages 297–301 , 2005. [KMKM02] Kazuta ka Katoh, Kazuharu M isaw a, K ei-ichi Kuma, and T akashi Miyata. MAFF T : a no vel m ethod for rapid multiple sequenc e alignment based on fast Fouri er trans form. Nucl. Acids R es. , 30(14):30 59– 3066, 2002. [KS66] H. Ke sten and B. P . S tigum. Additional limit th eorems for ind ecompos able m ultidimen sional G alton- Watson process es. Ann. Math. Stati st. , 37:1463– 1481, 1966 . [Le v01a] Vladimir I. Lev enshtein. Efﬁcient recons tructio n of sequences . IEEE T ran sactions on Information Theory , 47(1): 2–22, 2001. [Le v01b] Vladimir I. Le venshte in. Efﬁcient reconstr uction of sequences from their subseque nces or super se- quenc es. J . Comb . T heory Ser . A , 93(2):3 10–332, 2001. [LG08] Ari Loytyno ja and Nick Goldman. Phylogen y-A ware G ap Placement Prev ents Errors in Sequence Alignment and Evo lutiona ry Analysis. Science , 320(588 3):163 2–1635, 2008. [LRN + 09] Ke vin Liu, Sindhu Ragha v an, Serita Nelese n, C. Randal L inder , and T andy W arnow . Rapid and Accurate L ar ge-Scale Coestimation of Sequence Alignments and Phylogenet ic Trees. Sc ience , 324(5 934):1 561–1564, 2009 . [Met03] Dirk Metzler . S tatistic al alignment based on frag m ent insertio n and de letion models. Bioinformatic s , 19(4): 490–4 99, 2003 . 12 [MHR08] Radu Mihae scu, Cameron Hill, and Satish Rao. Fast phylogen y recon structi on thro ugh learn ing of ancest ral sequences . CoRR , abs/08 12.1587, 2008. [MLH04] I. Miklos, G. A. Lunter , and I. H olmes. A ”Long Indel” Model For Evoluti onary Sequence Alignment. Mol Biol Evol , 21(3):5 29–54 0, 2004. [Mos98] E. Mossel. Recursi ve reconstru ction on per iodic trees . Random Struct. Algor . , 13(1):81 –97, 1998. [Mos01] E. Mossel. Reconstruct ion on trees: beating the seco nd eige n v alue. Ann. Appl. Pr obab . , 11(1):2 85– 300, 2001. [Mos03] E. Mossel. On the impossi bility of recon structing ances tral data and phylogeni es. J . Comput. Biol. , 10(5): 669–6 78, 2003 . [Mos04] E. Mossel. Phase transiti ons in phyloge ny . T rans. Amer . Math. Soc. , 356 (6):23 79–24 04 , 2004. [MP03] E. Mossel and Y . Peres . Informatio n ﬂow on trees. Ann. Appl. Pr obab . , 13(3):817– 844, 2003. [MSW04] F . Martinelli, A. Sinclair , and D. W eitz. Glauber dynamics on trees: boundary condition s and mixing time. Comm. Math. Phys. , 250(2 ):301– 334, 2004. [Na v01] G. Nav arro. A guided tour to approximate string matching. A CM Computing Surve ys (CSUR ) , 33(1): 31–88 , 200 1. [NBYST] G. Nav arro, R . Baeza-Y ates, E. Sutinen, and J. T arhio . Inde xing methods for app roximate string matching . Bulletin of the T echnic al Committee on , page 19. [Ney7 1] J. Ney m an. Molecular studies of ev olutio n: a source of no vel stati stical prob lems. In S. S. Gupta and J. Y acke l, editors , Statistical desici on theory and r elated topics , pages 1–27. Academic Press, Ne w Y ork, 1971. [NHH00] C. Notredame, D.G. Higgins, an d J. Heringa. T -cof fee: A no vel metho d for fast and a ccurate multiple sequen ce alignment . 2000. [RE08] Elena Riv as and Sean R. Eddy . Probabi listic phylo genetic inference with insertions and deletio ns. PLoS Comput Biol , 4(9):e10 00172, 09 2008. [Roc07] S. Roch. Marko v Models on T r ees: Reconstr uction and Applic ations . PhD thesis, UC Berkel ey , 2007. [Roc08] S ´ ebastie n Roch. Sequ ence-length r equire ment for distance- based phylog eny r econst ruction: Breaking the polyno m ial barrier . In FOCS , pages 729–73 8, 2008. [Sly09a] A. Sly . Reconstru ction of random colou rings. Comm unicati ons in Math ematical Physics , 288(3) :943– 961, 2009. [Sly09b] A. Sly . Spatial and T emporal Mixi ng of Gibbs Measur es . PhD thesis, UC Berkele y , 2009. [Sly09c] Allan Sly . R econst ructio n for the Potts model. In STOC ’09: Pr oceedin gs of the 41st annual A C M symposiu m on Theory of computin g , pages 581–590, New Y ork, NY , USA, 2009. A C M. [SR06] Marc A. S uchard and B enjamin D. R edelin gs. BAli-Phy: simultaneou s Bayesia n inference of align- ment and phylog eny . Bioinformatics , 22(16):204 7–2048, 2006. [SS03] C. Semple and M. Steel. Phyl og enetics , v olume 22 of Mathematics and its Applicati ons series . Oxford Uni versity Press, 2003. 13 [Tho04] J. W . Thornto n. Resurrecting ancien t genes: ex perimental analysis of extinct molec ules. Nat. Rev . Genet. , 5(5):366 –375, 2004. [TKF91] Jef frey L. Thorn e, H irohisa Kishino , and Joseph F elsenst ein. An e v olutionary model for maximum like lihood alignmen t of dna sequen ces. J ournal of Molecular Evolution , 33(2):114 –124, 1991. [TKF92] Jef frey L. Thorne, Hirohisa Kishin o, and Jose ph Felsenste in. Inching to ward reality: An improved like lihood model of sequ ence e volut ion. J ournal of Molecular Evolution , 34(1):3–1 6, 1992. [VS08] Krishnamurth y V iswanathan and Ram Swaminathan. Improv ed string reconstruc tion ov er insertio n- deletio n channels. In SODA ’08: Pr oceedings of the nin eteenth annual A CM -SIAM symposium on Discr ete algorith m s , page s 399–4 08, Philadelph ia, P A, US A, 2008. S ociety for Industrial and Applied Mathematic s. [WJ94] Lusheng W ang and T ao Jiang. On the comple xity of multiple sequ ence alignment. Journ al of Com- putati onal Biology , 1(4):33 7–348, 1994. [WSH08] Karen M. W ong, Marc A. Suchard, and John P . Huelse nbeck. A lignment U ncertai nty and Genomic Analysis . Scien ce , 319(5 862):473–4 7 6, 200 8. 14 A Algorithm 1. Inpu t. Children sequenc es ˆ x 1 , . . . , ˆ x d . 2. Initializatio n. S et ˆ s i (0) := 0 , ∀ i , ℓ = k 1 / 3 , r = 1 , and t = ℓ . 3. Main loop. While ˆ x i [ t + ˆ s i ( r − 1) + 1 : t + ˆ s i ( r − 1) + a ] is non-empty for all i , (a) Cu rr ent p osition. Set t = ℓr . (b) An chor d eﬁnition. For each i , set b A i r = ˆ x i [ t + ˆ s i ( r − 1) + 1 : t + ˆ s i ( r − 1) + a ] . W e say that b A i r is the r ’t h anchor of the i ’ th child. (If the remaining seque nces are not long enough to produce an anch or of length a , we repeat the pre vious step w ith the full remainin g sequences.) (c) Alignment. For each anchor , we deﬁne the set of an chors which agree with it. Formally , G i r = { j ∈ [ d ] , Corr ( b A i r , b A j r ) ≥ γ } . (d) Up date. W e deﬁne the set of aligned children G r = { i : | G i r | ≥ d − 2 } . i. Aligned anchors. For eac h i ∈ G r , set ˆ s i ( r ) = ˆ s i ( r − 1) . ii. Misaligned anchors. For each i 6∈ G r deﬁne two strings b D i r = ˆ x i [ t + ˆ s i ( r − 1) : t + ˆ s i ( r − 1) + a − 1] and b I i r = ˆ x i [ t + ˆ s i ( r − 1) + 2 : t + ˆ s i ( r − 1) + a + 1] . If |{ j ∈ [ d ] − { i } : Corr( b D i r , b A j r ) ≥ γ }| ≥ d − 2 , set ˆ s i ( r ) = ˆ s i ( r − 1) − 1 . If |{ j ∈ [ d ] − { i } : Corr( b I i r , b A j r ) ≥ γ }| ≥ d − 2 , set ˆ s i ( r ) = ˆ s i ( r − 1) + 1 . (e) An cestral sequence. C ompute ˆ x 0 t − ℓ +1 , . . . ˆ x 0 t by performing a site wise majority on the childr en in G r . (If the remaining child ren sequences are too short to produc e a full island , we use whatev er is left which should all hav e equal length by our proof.) (f) Increment. Set r := r + 1 . 4. Ou tput. Output ˆ x 0 and set ˆ k 0 to its length . Figure 1: This is the basic recursi ve step of our reconst ruction algorith m. It take s as input the d inferred sequences of the child ren ˆ x 1 , . . . , ˆ x d and computes a sequence for the parent ˆ x 0 . If an y of the steps above cannot be accom- plishe d, w e abo rt the recons truction of the parent and declare it radioa ctiv e. B Further Lemmas For α going to 0 , we ha ve more precisely: Lemma B.1 (Limit α → 0 ) Conditio n on L . L et α = 1 h ( n ) , 15 for h ( n ) = ω (1) . Then, for n lar ge enoug h, the r oot is the father of a ( d − 1) -ary stable sub tree with pr obabil ity at least 1 − χ = 1 − 1 p h ( n ) . Pro of of Lemma B.1: Plugging α = 1 /h ( n ) and ν = 1 − 1 / p h ( n ) into the recursi on deri ved in the proof of Lemma 3.4, we get (1 − α ) g ( ν ) =  1 − 1 h ( n )  1 − d p h ( n ) + O  1 h ( n )  + d 1 − d − 1 p h ( n ) + O  1 h ( n )  ! 1 p h ( n ) ! =  1 − 1 h ( n )   1 − O  1 h ( n )  ≥ 1 − 1 / p h ( n ) , for n → + ∞ .  C Pr oo fs Pro of of Lemma 3.1 : W e pro ve the upper bou nd by assumin g there is no deletio n. T he lo w er bou nd can be prov ed similarly . T he proof goes by inductio n. Let v be a node at graph distance i from the root. W e show that there is C ′′ > 0 independent of i such that k v ≤ k + i p C ′′ k log n . Since the depth of T is O (log n ) , this implies the main claim as long as p C ′′ k log n log n ≤ ζ k , which follo ws from our assumption for C ′ > 0 large enou gh. The base case of the inductio n is satisﬁed trivia lly . Assume the induction claim holds for v , the parent of u . It suf ﬁces to sh ow t hat th e number o f n ew in sertions is at most √ C ′′ k log n . By our in ductio n hypothes is, the number of inser tions is bound ed abov e by a b inomial Z with par ameters k + ( i − 1) √ C ′′ k log n ≤ (1 + ζ ) k and p id w .h.p. By Hoef fding’ s inequal ity , takin g η = s C ′′′ log n (1 + ζ ) k , we ha ve P [ Z > (1 + ζ ) k p id + (1 + ζ ) kη ] < exp ( − 2((1 + ζ ) kη ) 2 / [(1 + ζ ) k ]) = 1 / p oly ( n ) . By our assumptio n on p id , we hav e (1 + ζ ) k p id = O αk 1 / 3 log n ! , so that choosin g C ′′ lar ge enough giv es (1 + ζ ) k p id + (1 + ζ ) kη ≤ p C ′′ k log n. 16 This prov es the claim.  Pro of of Lemma 3.3: According to Lemma 3.1, the length of the sequence at v is in [ k , ¯ k ] w .h.p. W e denote that e vent by L v . W e bound the probabil ity of ev ents B 1 , B 2 , B 3 separa tely . Let N = ¯ k /ℓ = (1 + ζ ) k 2 / 3 . Conditi oned on L v , there are at most N anchors, each of length a . By a union bound , the proba bility that at least one of the site s in the anchors has an inde l ope ration in any child is upp er bound ed by P [ B 1 ] = P [ B 1 | L v ] P [ L v ] + P [ B 1 | L c v ] P [ L c v ] ≤ N adp id + 1 / p oly( n ) = αad N 4 k 2 / 3 ad + 1 / p oly( n ) = (1 + ζ ) k 2 / 3 k 2 / 3 · α 4 + 1 / p oly( n ) < α (1 / 3 − 1 / p oly ( n )) , where w e choose ζ small enou gh. The quantity we want to estimate is in fact P [ B 1 | L ] (which is not the same as condit ioning on L v only). But notice that P [ B 1 ] = P [ B 1 | L ] P [ L ] + P [ B 1 | L c ] P [ L c ] ≥ P [ B 1 | L ] P [ L ] , which implies P [ B 1 | L ] ≤ α (1 / 3 − 1 / p oly( n )) 1 − 1 / p oly ( n ) < α/ 3 . (This ar gument sho ws that it sufﬁces to co nditio n on L v . W e apply the same trick belo w .) T o bou nd the probabilit y of the second ev ent, consider an island I and a son u . The probabilit y that there is an indel when ev olving from v to u is at most p id ℓ = α 4 k 2 / 3 ad k 1 / 3 = α 4 k 1 / 3 ad . Thus, the probabil ity that more than one child of v expe riences an indel in I is at most d X i =2  d i   α 4 k 1 / 3 ad  i ≤ d X i =2 d i i !  α 4 k 1 / 3 ad  i ≤ d X i =2 1 i !  α 4 k 1 / 3 a  i ≤ e  α 4 k 1 / 3 a  2 = eα 2 16 k 2 / 3 a 2 , where we used that the expressi on in parenthe sis on the second line is < 1 . T aking a union bound ove r all islands , the probab ility that at least two childr en expe rience an indel in the same island is at most P [ B 2 | L ] ≤ N · eα 2 16 k 2 / 3 a 2 = (1 + ζ ) eα 2 16 a 2 < α 3 , 17 where we used that α < 1 . For the thir d e vent, consid er again an island I and a ch ild u . The p robability for at lea st two inde l operations in I when ev olving from v to u is at most 2 ℓ X i =2  2 ℓ i   α 4 adk 2 / 3  i ≤ 2 ℓ X i =2 1 i !  2 ℓα 4 adk 2 / 3  i ≤ 2 ℓ X i =2 1 i !  α 2 adk 1 / 3  i ≤ e  α 2 adk 1 / 3  2 ≤ eα 2 4 a 2 d 2 k 2 / 3 . (W e use 2 ℓ to accoun t for insert ions and deletion s.) T aking a union bound over all islands and childr en, the probab ility that there are two indel opera tions in the same child in the same island is bound ed by P [ B 3 | L ] ≤ d N eα 2 4 a 2 d 2 k 2 / 3 ≤ (1 + ζ ) eα 2 4 a 2 d < α/ 3 . T aking a union bound ov er the three ways in which a site can become radioacti ve prove s the lemma.  Pro of of Lemma 3.4: W e follo w a proof of [Mos01]. Let v be a node at distan ce r from the lea ves. W e let ν r be the probab ility that v is the root of a ( d − 1) -ary stable subtree. Let g ( ν ) = ν d + dν d − 1 (1 − ν ) . Then, from Lemma 3.3, ν r ≥ (1 − α ) g ( ν r − 1 ) . Note that g ′ ( ν ) = d ( d − 2) ν d − 2 (1 − ν ) . In particul ar , g is monotone, g (1) = 1 , and g ′ (1) = 0 . Hence, there is 1 − χ < ν ∗ < 1 such that g ( ν ∗ ) > ν ∗ . Then, taking 1 − α > ν ∗ /g ( ν ∗ ) , we ha ve ν r ≥ (1 − α ) g ( ν r − 1 ) ≥ ν ∗ g ( ν ∗ ) g ( ν r − 1 ) ≥ ν ∗ > 1 − χ, by the inductio n hypothe sis that ν r − 1 ≥ ν ∗ . Note in particular that ν 0 = 1 ≥ ν ∗ .  Pro of of Lemma 4.3: Recall that we assume the root state is 0 and all adversa rial nodes are 1 . Because of the bias tow ards 1 , w e c annot app ly sta ndard results about re cursi ve maj ority for symmetric chan nels [Mos 98, Mos0 4 ]. Instea d, we perform a tailored analysis of this particula r chann el. 18 W e tak e asymptotic s as d → + ∞ and we sho w that the probabili ty of reconstru ction can be tak en to be 1 − β = 1 − 1 d , for C ′′ lar ge eno ugh. Let v be the root of T ( d ) H 0 . W e denote by Z v the number of non-ad vers arial children of v in st ate 0 and by Z ′ v the number of nodes among them that return 0 upon app lying recurs i ve majori ty to their respec ti ve subtre e. Let q 0 H 0 be the proba bility of incorr ect reconst ruction at v (gi ven that the state at v is 0 ). Then 1 − q 0 H 0 ≥ P  Z ′ v ≥ d + 1 2  ≥ d − 2 X i =0 P  Z ′ v ≥ d + 1 2 | Z v = i  P [ Z v = i ] , (1) where we simply ignore d the contrib ution of the child ren who ﬂipped to 1 . W e prov e q 0 H 0 ≤ 1 /d by induction on the heigh t. Let u be a non- adversari al node in T ( d ) H 0 at height h from the lea ves to which w e associate as above the v ariables Z u , Z ′ u and the quanti ty q 0 h . Note that q 0 0 = 0 . W e assume the induct ion hypothesis holds for h − 1 . Note that condi tioned on the state at u being 0 Z u is Bin( d − 2 , (1 − p s )) where 1 − p s = 1 + θ s 2 = 1 2 + Θ r log d d ! , as d → + ∞ . Similarly , giv en Z u = i , the var iable Z ′ u is Bin( i, 1 − q 0 h − 1 ) . In particular , the quantity P  Z ′ u ≥ d + 1 2 | Z u = i  , is monotone in i . W e us e Chernof f ’ s bound on Z ′ u to truncate the lo w er boun d (1). Indeed , let µ = (1 − p s )( d − 2) = d 2 + Υ( d ) , with Υ( d ) = Θ( p d log d ) , and µ (1 − η ) = d 2 + Υ( d ) 2 , where in particula r η = Θ r log d d ! . Then, we ha ve P [ Z u < µ (1 − η )] < exp  − µη 2 / 2  = d − Ω(1) , for C ′′ lar ge enough. Apply ing to (1) leads to the lower boun d 1 − q 0 h ≥ (1 − d − Ω(1) ) P  Bin  d 2 + Υ( d ) 2 , 1 − q 0 h − 1  ≥ d + 1 2  . By the inducti on hypothes is, q 0 h − 1 ≤ 1 /d . By apply ing Chernof f ’ s bound again we get P  Bin  d 2 + Υ( d ) 2 , 1 − q 0 h − 1  ≥ d + 1 2  > 1 − d − Ω(1) , 19 and therefo re q 0 h ≤ 1 /d . This pro ves the claim.  Pro of of Lemma 4.5: As we pointe d out earlier , although the subtrees ( T ∗∗ t ′ ) t + m t ′ = t +1 are correla ted by the construc- tion of th e islands, the y a re ind ependent of the substitu tion pro cess. By forcin g (ra ndomly) the subtrees ( T ∗∗ t ′ ) t + m t ′ = t +1 to be ( d − 2) -ary and ﬁxing t he ad versarial nodes to 1 , we resto re the i.i.d. nat ure of the sites, fro m which the result follo ws.  Pro of of Lemm a 4.7: This follo ws from L emma 4.5, the independe nce of Λ and Θ , and three applicat ions of Hoef fding’ s lemma.  Pro of of Lemma 4.8: Note that Corr( b X , b Y ) = 1 m m X i =1 h ˆ x i ih ˆ y i i = 1 m m X i =1 h x i ih y i ih z i ih w i i . Hence, | Corr( X , Y ) − Corr( b X , b Y ) | ≤ 1 m m X i =1 (1 − h z i ih w i i ) = 1 − 1 m m X i =1 h z i ih w i i . No w notice by case analys is that h z i ih w i i ≥ h λ i ih θ i i − 1 h λ i i = − 1 − 1 h θ i i = − 1 . This prov es the ﬁ rst claim. The second claim follo w s from the bounds in Lemma 4.7.  Pro of of Lemma 5.1: For the ﬁrst cla im, note that E [Corr( A i r , A j r )] = θ 2 s = (1 − 2 p s ) 2 , where we used that 1) there is no indel in the sites [ t + 1 : t + a ] between v and i, j ; 2) that the sites are perfectly aligne d; and 3) that the substitution process is independen t of the indel process. W e also used the w ell-kno wn fact that the θ s ’ s are multiplicat ive along a path under our model of subs titutio n [SS 03]. The result then follo ws from Hoef fding’ s inequa lity . For the seco nd claim, because the anchors are now misaligned the t ′ -th term in Corr ( A i r , D j r ) for t ′ ∈ [ t + 1 : t + a ] is the varia ble h x i t ′ + s i ( r ) ih x j t ′ + s j ( r ) − 1 i which is uniform in {− 1 , + 1 } . In particula r , we no w hav e E [Corr( A i r , D j r )] = 0 . The result follo ws from the method of bounded dif ference s w ith resp ect to the independ ent v ectors { ( x i t ′ + s i ( r ) , x j t ′ + s j ( r ) ) } t + a t ′ = t .  Pro of of Lemma 5.2: This follo ws from Lemm as 4.8 and 5.1 and the tria ngle inequality .  D Completing the Pr oo f of the Main Theor em Hav ing conditioned on the e ven t E , we justify the corre ctness of our reconstructi on method via the follo wing induct ion. The top le vel of the ind uction establish es Theorem 1.3. Induction hypothesis. Consider a parent v in T ∗ ; in particula r , v is stable. W e assu m e that the follo wing conditio ns, denote d by ( ⋆ ) , are satisﬁed: For all children i ∈ [ d ] of v belonging to T ∗ 20 1. Alignment. For all chil dren i ′ of i with i ′ ∈ T ∗ and all r = 1 , . . . , ¯ k /ℓ − 1 , ˆ s i ′ ( r ) = s i ′ ( r ) . (2) (This cond ition is tri vially satisﬁed for valu es of r ℓ that are larg er than the sequ ence lengt h of i ′ .) 2. Reconstructi on. Moreov er , we ha ve ˆ k i = k i and for all t = 1 , . . . , k i , the follo wing holds: Let L i be the leav es belo w i with n i = | L i | . L et H be the lev el of v . Let L ∗∗ t be the gate way lea ves for site t . For u ∈ L ∗∗ t let F u ( t ) be the position of site t in u . Note that ˆ x i t can be w ritten as ˆ x i t = Ma j H − 1 ( z 1 , . . . , z n i ) , where z j is either ♯ or x j ♭ j for an appropr iate functi on ♭ j . Our hypot hesis is that ∀ u ∈ L ∗∗ t , ♭ u = F u ( t ) . (3) In parti cular , the ancestral reconstru ction b X i dominate s the adv ersarial reconstruc tion b X ′ i . The base case where v is a leaf is tri vially satisﬁed. Alignment. W e begin with the correctn ess of the alignment. Lemma D.1 (Induction: Alignment) A ssuming E and ( ⋆ ) , the algorith m infer s s i corr ectly for all c hildr en i ∈ [ d ] which ar e also in T ∗ , that is, (2) holds for v . Pro of of Lemma D.1: Let Π denote the set of children of v in T ∗ . The proof follo w s by induction on r . T he base case r = 0 is triv ial. Assume correctn ess for r − 1 . If there is no indel in any of the children i ∈ Π between the sites ( r − 1) ℓ and r ℓ of v , then under E , ( ⋆ ) and Lemma 5.2 we ha ve Π ⊆ G r . In that cas e, for all i ∈ Π we ha ve ˆ s i ( r ) = ˆ s i ( r − 1) = s i ( r − 1) = s i ( r ) , where the second equality is from ( ⋆ ) . If there is an indel operatio n in island r , then since v is stable only one indel oper ation occurre d in one child. Denote the child with a n indel by j . Assume the indel i s a del etion. (The case of the in sertion is handled simila rly . ) If j is not in T ∗ we are bac k to the pre vious case. So assume j is in T ∗ . A gain, from E , ( ⋆ ) and Lemma 5.2 the other children in T ∗ are added to the set G r , and the shift v alue will be computed correctly for them. Moreo ver by ( ⋆ ) , for ev ery i ∈ Π − { j } , f i ( r ℓ + 1) = r ℓ + 1 + s i ( r ) = r ℓ + 1 + ˆ s i ( r ) = r ℓ + 1 + ˆ s i ( r − 1) , which is the starting point of b A i r . Also, f j ( r ℓ + 1) = r ℓ + 1 + s j ( r ) = r ℓ + 1 + s j ( r − 1) − 1 = r ℓ + 1 + ˆ s j ( r − 1) − 1 = r ℓ + ˆ s j ( r − 1) , which is the starting point of b D j r . Thus according to L emma 5.2 b D j r matches b A i r for all i ∈ Π ∩ G r . As there are d − 2 childre n in Π ∩ G r , we get that the algorit hm sets ˆ s j ( r ) = ˆ s j ( r − 1) − 1 = s j ( r − 1) − 1 = s j ( r ) , as required. Note also that in this case, accord ing to Lemma 5.2 again, b A j r does not hav e high correlation with b A i r for any i ∈ Π ∩ G r , and thus we will consider b I j r and b D j r . Similarly , b I j r does not ha ve high correla tion with b A i r for any i ∈ Π ∩ G r , and thus we will not try to set ˆ s j ( r ) twice.  21 Ancestral r econstruction. W e use Lemma D.1 to prove that the a ncestr al recon structi on dominates t he adve rsarial recons truction. In the algorithm, we perform a site w ise majority v ote ov er the child ren of v in G r (these are the aligne d children—see the description of the algorithm in Figure 1). For notation al con ve nience , we assume that in fact we perf orm a m ajority v ote over all childr en but we replac e the states of the children outside G r with ♯ . Lemma D.2 (Induction: Reconstructi on) Assuming E , ( ⋆ ) and the conclusi on of Lemma D.1, (3) holds for v . In partic ular , the ancestr al r econstruc tion b X v dominate s the adver sarial r econstr uction b X ′ v . Pro of of Lemma D.2: The second claim follo ws from the ﬁrst one to gether with the con struction of th e adv ersarial proces s and the monotonicit y of majority . As for the ﬁrst claim, by Lemma D.1 for each site of v there are d − 2 uncorrupt ed chil dren islands containing this site suc h that the children are also in T ∗ . In p articular , the d − 2 correspon ding sites in the childre n are c orrectly aligne d. Moreo ver , by the inductio n hypothesis, each correspo nding site in the children sati sfy (3). By takin g a majority v ote over the se sites we get (3) for v as well. A small tech nical detail is h andlin g the case w here the last island has less than a sites, an d t hus doe s not c ontain an an chor . Ho wev er , in this case, if the fathe r is stable the n there are no ind el operations at all in the las t isla nd, and therefo re aligning it accordin g to the prev ious one giv es the right result.  22

Global Alignment of Molecular Sequences via Ancestral State Reconstruction

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment