Coverage statistics for sequence census methods

Background: We study the statistical properties of fragment coverage in genome sequencing experiments. In an extension of the classic Lander-Waterman model, we consider the effect of the length distribution of fragments. We also introduce the notion …

Authors: Steven N. Evans, Valerie Hower, Lior Pachter

Co v erage statistics for sequence census metho ds Stev en N. Ev ans, V alerie Ho w er and Lior P ach ter No vem b er 4, 2021 Abstract Backgr ound: W e study the statistical prop erties of fragment co verage in genome sequencing exp erimen ts. In an extension of the classic Lander-W aterman model, we consider the effect of the length distribution of fragments. W e also introduce the notion of the shap e of a cov erage function, which can b e used to detect abb erations in co verage. The probability theory under- lying these problems is essen tial for constructing mo dels of curren t high-throughput sequencing exp erimen ts, where both sample preparation protocols and sequencing technology particulars can affect fragment length distributions. R esults: W e sho w that regardless of fragment length distribution and under the mild assump- tion that fragment start sites are Poisson distributed, the fragmen ts produced in a sequencing exp erimen t can b e viewed as resulting from a tw o-dimensional spatial P oisson pro cess. W e then study the jump sk eleton of the the cov erage function, and show that the induced trees are Galton-W atson trees whose parameters can b e computed. Conclusions: Our results extend standard analyses of shotgun sequencing that fo cus on cov er- age statistics at individual sites, and pro vide a null mo del for detecting deviations from random co verage in high-throughput sequence census based exp erimen ts. By fo cusing on fragmen ts, w e are also led to a new approach for visualizing sequencing data that should b e of indep endent in terest. 1 In tro duction The classic “Lander-W aterman mo del” [15] pro vides statistical estimates for the read cov erage in a whole genome shotgun (WGS) sequencing exp erimen t via the Poisson appro ximation to the Binomial distribution. Although originally intended for estimating the extent of co verage when mapping b y fingerprinting random clones, the Lander-W aterman mo del has served as an essen tial to ol for estimating sequencing requirements for mo dern WGS exp eriments [17]. Although it mak es a n umber of simplifying assumptions (e.g. fixed fragmen t length and uniform fragmen t selection ) that are violated in actual exp erimen ts, extensions and generalizations [19, 18] ha ve contin ued to b e developed and applied in a v ariety of settings. The adven t of “high-throughput sequencing”, whic h refers to massively parallel sequencing tec hnologies has greatly increased the scop e and applicability of sequencing exp eriments. With the increasing scop e of exp erimen ts, new statistical questions ab out cov erage statistics hav e emerged. In particular, in the context of se quenc e c ensus metho ds , it has b ecome imp ortan t to understand the shap e of co verage functions, rather than just co verage statistics at individual sites. 1 Sequence census metho ds [20] are exp erimen ts designed to assess the conten t of a mixture of molecules via the creation of DNA fragments whose abundances can b e used to infer those of the original molecules. The DNA fragmen ts are identified by sequencing, and the desired abundances inferred b y solution of an in verse problem. An example of a sequence census metho d is ChIP-Seq. In this exp eriment, the goal is to determine the lo cations in the genome where a sp ecific protein binds. An antibo dy to the protein is used to “pull down” fragmen ts of DNA that are b ound via a pro cess called chromatin immunoprecipitation (abbreviated by ChIP). These fragments form the “mixture of molecules” and after purifying the DNA, the fragments are determined by sequencing. The resulting sequences are compared to the genome, leading to a c over age function that records, at eac h site, the num ber of sequenced fragmen ts that contained it. As with man y sequence census metho ds, “noise” in the exp erimen t leads to random sequenced fragments that ma y not corresp ond to bound DNA, and therefore it is necessary to iden tify regions of the co v erage function that deviate from what is exp ected according to a suitable null mo del. The purp ose of this pap er is not to develop metho ds for the analysis of ChIP-Seq (or an y other sequence census metho d), but rather to present a null mo del for the shap e of a co verage function that is of general utilit y . That is, we prop ose a definition for the shap e of a fragment co verage function, and describ e a random instance assuming that fragments are selected at random from a genome, with lengths of fragmen ts given b y a kno wn distribution. The distinction betw een our w ork and previous statistical studies of sequencing exp erimen ts, is that we go b ey ond the description of co verage at a single lo cation, to a description of the change in cov erage along a genome. 2 The shap e of a fragment co v erage function W e b egin by explaining what we mean by a c over age function . Given a genome mo deled as a string of fixed length N , a co verage function is a function f : { 1 , . . . , N } − → Z ≥ 0 . The in terpretation of this function, is that f ( i ) is the num ber of sequenced fragmen ts obtained from a sequencing exp erimen t that co ver p osition i in the genome. It is imp ortan t to note that N is typically large; for example, the h uman genome consists of appro ximately 2 . 8 billion bases. Because N is very large, we replace the finite set { 1 , . . . , N } with R , and re-define a cov erage function to b e a function f : R − → Z ≥ 0 . This helps to simplify our analysis. W e next introduce an ob ject that describ es a sequence cov erage function’s shape. Our approach is motiv ated by recen t applications of top ology including p ersistent homology [2, 21] and the use of critical points in shape analysis [1, 5, 6]. F or a giv en co verage function f : R − → Z ≥ 0 , we will define a ro oted tree, whic h is a particular type of directed graph with all the directed edges p ointing a wa y from the root. This tree T f is based on the upp er-excursion sets of f : U h := { ( x, f ( x )) | f ( x ) ≥ h } , h ∈ Z ≥ 0 and keeps trac k of how the sets U h ev olve as h decreases. Long paths in T f represen t features of the cov erage function that persist through man y v alues of h . Sp ecifically , for each h ∈ Z ≥ 0 , let C h denote the set of connected comp onents of the upp er- excursion set U h . W e define the ro oted tree T f = ( V , E ) as follows • V ertices in V corresp ond to the connected comp onen ts in the collection { C h } h ∈ Z ≥ 0 • ( i, j ) ∈ E pro vided their corresp onding connected comp onen ts c i ∈ C h i and c j ∈ C h j with h i < h j satisfy h i = h j − 1 and c j ⊂ c i . Note that the ro ot of T f corresp onds to the single connected comp onen t in C 0 . The tree T f is very similar to a con tour tree [1, § 4.1], which is built using level sets of a function, and a join tree [3]. 2 Indeed, supp ose we ignore ev ery vertex that is adjacent to only one v ertex with greater height. Then, the remaining vertices of T f corresp ond to (equiv alence classes of ) lo cal extrema of f . Each lo cal maxim um of f yields the birth of a new connected comp onen t as we sweep do wn through h ∈ Z ≥ 0 while a lo cal minim um of f merges connected comp onents. Since we do not require f to ha ve distinct critical v alues (as is frequently assumed), the vertices in T f can ha ve arbitrary degrees, as is depicted in Figure 1C. In the sequel, w e will use the follo wing equiv alen t c haracterization that can b e found in [7, § 2.3]. Giv en a cov erage function f : R − → Z ≥ 0 with f ( a ) = f ( b ) = 0 and f ( x ) > 0 for x ∈ ( a, b ), we form an in teger-v alued sequence x 0 , . . . , x 2 n that records the c hanges in height of f on the in terv al [ a, b ]. The sequence x 0 , . . . , x 2 n consists of the y v alues that f trav els through from x 0 := f ( a ) = 0 to x 2 n := f ( b ) = 0 and satisfies x 0 = x 2 n = 0 , x i > 0 for 0 < i < 2 n, | x i − x i − 1 | = 1 for 1 ≤ i ≤ 2 n. Suc h a sequence is called a lattic e p ath excursion away fr om 0. Next, w e define an equiv alence relation on the set { 0 , 1 , . . . , 2 n } by setting i ≡ j ⇐ ⇒ x i = x j = min i ≤ k ≤ j x k . The equiv alence classes under this relation are in 1 : 1 corresp ondence with the connected compo- nen ts in the upp er-excursion sets of f | [ a,b ] . One equiv alence class is { 0 , 2 n } , and if { i 1 , . . . , i p } is an equiv alence class with 0 < i 1 < i 2 < . . . < i p then x i 1 − 1 = x i 1 − 1 , whereas x i q − 1 = x i q + 1 for 2 ≤ q ≤ p . Conv ersely , any index i with x i − 1 = x i − 1 is the minimal element of an equiv alence class. W e use the minimal elemen t of eac h equiv alence class as its representativ e. Thus, w e can view the vertices of T f | [ a,b ] as the set { 0 } ∪ { i | x i − 1 = x i − 1 } . Two indices i 1 < i 2 are adjacent in T f | [ a,b ] pro vided x i 2 = x i 1 + 1 and x k ≥ x i 1 for i 1 ≤ k ≤ i 2 . Figure 1 giv es an example of a cov erage function together with its lattice path excursion (0 , 1 , 2 , 3 , 4 , 3 , 2 , 3 , 4 , 5 , 4 , 3 , 2 , 3 , 2 , 1 , 0) and ro oted tree. The minimal elemen ts of eac h equiv alence class in Figure 1B are depicted with red squares. Figure 1: A co verage function (A) with its lattice path excursion (B) and ro oted tree (C). 3 3 Planar P oisson pro cesses from sequencing exp erimen ts In order to mo del random co verage along the genome, we use a P oisson pro cess to give random starting lo cations to the fragments. Sp ecifically , supp ose that we ha ve a stationary Poisson p oin t pro cess on R with intensit y ρ . At eac h p oin t of the Poisson point pro cess w e lay down an interv al that has that point as its left end-p oin t. The lengths of the successive interv als are indep enden t and identically distributed with common distribution µ . W e will use the notation X for a cov erage function built from this pro cess and X t for the height at a point t . Let t 1 , t 2 , · · · b e the left-end p oints and l 1 , l 2 , · · · b e the corresp onding lengths of interv als. The in terv al given b y ( t i , l i ) will co ver a nucleotide t 0 pro vided t i ≤ t 0 and t i + l i ≥ t 0 . W e can view this pictorially by plotting p oints { ( t j , l j ) } in the plane. Then X t 0 —the num b er of interv als co vering t 0 —is the num b er of p oin ts in the triangular region b elo w. W e no w recall the definition of a t wo- l = t 0 − t t 0 ( t, l )-plane Figure 2: A t wo dimensional view of a sequencing exp erimen t. dimensional Poisson pro cess and refer the reader to [10, § 6.13] or [4, § 2.4] for the details. Supp ose Γ is a lo cally finite measure on the Borel σ -algebra B ( R 2 ). A random countable subset Π of R 2 is called a non-homo gene ous Poisson pr o c ess with me an me asur e Γ if, for all Borel subsets A , the random v ariables N ( A ) := #( A ∩ Π) satisfy: 1. N ( A ) has the P oisson distribution with parameter Γ( A ), and 2. If A 1 , · · · , A k are disjoin t Borel subsets of R 2 , then N ( A 1 ) , · · · , N ( A k ) are indep enden t ran- dom v ariables. The follo wing theorem is a consequence of [14, Prop osition 12.3]. Theorem 3.0.1. The c ol le ction { ( t i , l i ) } of p oints obtaine d as describ e d ab ove is a non-homo gene ous Poisson pr o c ess with me an me asur e ρ m ⊗ µ . Her e m is L eb esgue me asur e on R . W e compute the expected v alue E [ X t ] = ρ m ⊗ µ (wedge) : ρ m ⊗ µ (wedge) = ρ Z t −∞ Z ∞ t − u µ ( dv ) du = ρ Z t −∞ µ (( t − u, ∞ )) du = ρ Z ∞ 0 µ (( s, ∞ )) ds. 4 3.1 F ragmen t lengths ha v e the exp onen tial distribution W e treat the simplest case first, namely the case where the distribution µ of fragment lengths is exp onen tial with rate λ . Then, w e ha ve µ (( s, ∞ )) = P { l > s } = e − λs , and E ( X t ) = ρ Z ∞ 0 e − λs ds = ρ λ . Claim 1. The pr o c ess X is a stationary, time-homo gene ous Markov pr o c ess. Pr o of. It is clear that X is stationary b ecause of the manner in whic h it is constructed from a Poisson pro cess on R 2 that has a distribution which is inv arian t under translations in the t direction; that is, the random set { ( t i , l i ) } has the same distribution as { ( t i + t, l i ) } for any fixed t ∈ R . Since µ is exp onen tial, it is memoryless, meaning for any interv al length l with an exp onen tial distribution P { l > a + b | l > a } = P { l > b } . This means that probabilit y that an in terv al cov ers t 2 kno wing that it cov ers t 1 is the same as the probabilit y that an interv al starting at t 1 co vers t 2 . Thus, the probability that X t 2 = k given X t for t ≤ t 1 only dep ends on the v alue of X t 1 . Indeed, in terms of time, P { X t 2 = k | X t 1 = k 0 } dep ends only on t 2 − t 1 . More sp ecifically , X is a birth-and-death pro cess with birth rate β ( k ) = ρ in all states k and death rate δ ( k ) = k λ in state k ≥ 1. Note that as the exp onential distribution is the only distribution with the memoryless prop ert y , we lose the Mark ov property when µ is not exp onen tial. T o build the tree of § 2, we are interested in the jumps of the co verage function f ( t ) = X t . W e hence consider the jump c hain of X — a discrete-time Marko v c hain with transition matrix P ( i, j ) =            1 , if i = 0 and j = 1 , ρ ρ + iλ , if i ≥ 1 and j = i + 1 , iλ ρ + iλ , if i ≥ 1 and j = i − 1 , 0 , otherwise . Supp ose now w e ha ve a lattice path excursion starting at 0. Given a vertex v of the asso ciated tree at heigh t k , we are in terested in the num b er of offspring (at height k + 1) of this v ertex. Supp ose i 0 is the minimal equiv alence class representativ e for vertex v , and supp ose [ i 0 ] = { i 0 , i 1 , · · · , i n } with i 0 < i 1 < · · · < i n . Then, we hav e x i r = k for 0 ≤ r ≤ n , x i r +1 = k + 1 for 0 ≤ r ≤ n − 1, x i n +1 = k − 1, and x t > k for i 0 < t < i n with t 6 = some i r . F rom the Mark ov prop ert y , for 0 ≤ j ≤ n , P { x i j +1 = k + 1 | x i j = k } = ρ ρ + λk and P { x i j +1 = k − 1 | x i j = k } = λk ρ + λk . The resulting tree is a Galton-W atson tree with generation-dep enden t offspring distributions (see [8, 9, 12, 13] for more on Galton-W atson trees). Indeed, w e ha ve P { a v ertex at height k has n offspring } =  ρ ρ + λk  n λk ρ + λk , whic h is the probability of n failures b efore the first success in a sequence of indep endent Bernoulli trials where the probability of success equals λk ρ + λk . 5 3.2 F ragmen t lengths ha v e a general distribution Supp ose that w e ha ve a general distribution µ for the fragment lengths. W e observe X at some fixed “time” – which migh t as w ell b e 0 b ecause of stationarity , and ask for the conditional probabilit y giv en X 0 that the next jump of X will b e upw ards. W e kno w from the ab o ve that if µ is exp onen tial with rate λ , then conditional on X 0 = k this is ρ/ ( ρ + k λ ). Let T denote the time until the next segmen t comes along. This random v ariable has an exp onen tial distribution with rate ρ and is independent of X 0 [4, § 2.1]. If we condition on X 0 = k , the t w o-dimensional Poisson p oin t pro cess must hav e k p oin ts in the region A := { ( t, l ) : −∞ < t ≤ 0 , − t < l < ∞} . 0 T ( t, l )-plane T Figure 3: A w edge from the planar Poisson pro cess. Conditionally , these k p oin ts in A ha ve the same distribution as k p oin ts chosen at random in A according to the probabilit y measure ρ m ⊗ µ ( B ) ρ m ⊗ µ ( A ) for B ⊂ A Ho wev er, in order that the next jump after 0 is up wards, the t w o-dimensional Poisson point process m ust ha ve no p oin ts in the orange region { ( t, l ) : −∞ < t ≤ 0 , − t < l < T − t } as these segments end b efore time T . This leav es the k p oin ts lying in the blue region B T := { ( t, l ) : −∞ < t ≤ 0 , T − t ≤ l < ∞} , whic h o ccurs with probability  ρ R ∞ T µ (( u, ∞ )) du ρ R ∞ 0 µ (( u, ∞ )) du  k . Th us, conditional on X 0 = k , the probabilit y that the next jump will b e up wards is Z ∞ 0  R ∞ t µ (( u, ∞ )) du R ∞ 0 µ (( u, ∞ )) du  k ρe − ρt dt. 6 W rite p ( k ) for this quantit y . A reasonable approximation to the jump sk eleton Z of X is to take it b e a discrete-time Mark ov c hain on the nonnegativ e in tegers with transition probabilities P ( i, j ) =            1 , if i = 0 and j = 1 , p ( i ) , if i ≥ 1 and j = i + 1 , 1 − p ( i ) , if i ≥ 1 and j = i − 1 , 0 , otherwise . The resulting tree is then a Galton-W atson tree with generation dependent offspring distributions, where P { a v ertex at height k has n offspring } = p ( k ) n (1 − p ( k )) . Example 3.2.1. Supp ose µ is the p oint mass at L (that is, al l se gment lengths ar e L ). Then µ (( u, ∞ )) = ( 1 , u < L 0 , u ≥ L , and Z ∞ t µ (( u, ∞ )) du = ( R L t du = L − t, t < L 0 , t ≥ L. This gives p ( k ) = Z L 0 ( L − t ) k L k ρe − ρt dt = Z 1 0 w k ρe − ρ ( L − Lw ) Ldw = θ e − θ Z 1 0 w k e θw dw for k ≥ 1 , wher e θ := ρL = E [ X 0 ] . We inte gr ate by p arts and find that p ( k ) = θ e − θ q ( k ) wher e q ( k ) = w k e θw θ     w =1 w =0 − k θ Z 1 0 w k − 1 e θw dw = e θ θ − k θ q ( k − 1) for k ≥ 2 , which yields the r e cursion p ( k ) = 1 − k θ p ( k − 1) , k ≥ 2 , with p (1) = 1 − 1 θ + e − θ θ . Solving explicitly, we obtain p ( k ) = k !   k X j =0 ( − 1) k − j j ! θ k − j + ( − 1) k − 1 e − θ θ k   for k ≥ 1 . 7 4 Discussion Our observ ation that randomly sequenced fragments from a genome form a planar Poisson pro cess in ( position, l eng th ) co orindates has implications b ey ond the cov erage function analysis p erformed in this pap er. F or example we hav e found that the visualization of sequencing data in this nov el form is useful for quickly iden tifying instances of sequencing bias by eye, as it is easy to “see” deviations from the Poisson pro cess. An example is shown in Figure 4 where fragments from an Illumina sequencing exp erimen t are compared with an idealized simulation (where the fragmen ts are placed uniformly at random). Specifically , paired-end reads from an RNA-Seq exp eriment conducted on a GAI I sequencer were mapp ed back to the genome and fragmen ts inferred from the read end lo cations. Bias in the sequencing is immediately visible, lik ely due to non-uniform PCR amplification [11] and other effects. W e hop e that others will find this approac h to visualizing fragmen t data of use. Figure 4: (A) F ragments from a sequencing exp erimen t sho wn in the ( t, l ) plane. (B) The spatial P oisson pro cess resulting from fragmen ts with the same length distribution as (A) but with position sampled uniformly at random. The “shap e” w e ha ve prop osed for cov erage functions was motiv ated b y p ersistence ideas from top ological data analysis (TD A). In the context of TD A, our setting is very simple (1-dimensional), ho wev er unlik e what is typically done in TDA, we hav e provided a detailed probabilistic analysis that can b e used to construct a null h yp othesis for cov erage-based test statistics. F or example, w e en vision computing test statistics [16] based on the trees constructed from cov erage functions and comparing those to the statistics exp ected from the Galton-W atson trees. It should be interesting to p erform similar analyses with high-dimensional generalizations for which w e b eliev e many of our ideas can be translated. There are also biological applications, for example in the analysis of p ooled exp erimen ts where fragments may b e sequenced from different genomes sim ultaneously . Indeed, we b eliev e that the study of sequence cov erage functions that we ha ve initiated may b e of use in the analysis of man y sequence census metho ds. The num ber of prop osed proto cols has explo ded in the past tw o y ears, as a result of dramatic drops in the price of sequencing. F or example, in January 2010, the company Illumina announced a new sequencer, the HiSeq 2000, that they claim “c hanges the tra jectory of sequencing” and can b e used to sequence 25Gb p er 8 da y . Although tec hnologies suc h as the HiSeq 2000 w ere motiv ated by h uman genome sequencing a surprising developmen t has b een the fact that the ma jority of sequencing is in fact b eing used for sequence census exp erimen ts [20]. The v ast amounts of sequence b eing pro duced in the context of c omplex sequencing proto cols, means that a detailed probabilistic understanding of random sequencing is likely to b ecome increasingly imp ortan t in the coming years. 5 Ac kno wledgemen ts SNE is supp orted in part b y NSF grant DMS-0907630 and VH is funded b y NSF fellowship DMS- 0902723. W e thank Adam Rob erts for his help in making Figure 4. 6 Author Con tributions LP prop osed the problem of understanding the random b eha viour of cov erage functions in the con text of sequence census metho ds. VH inv estigated the jump sk eleton based on ideas from top ological data analysis. SE developed the probabilit y theory and iden tified the relev ance of Theorem 3.0.1. SNE, VH and LP work ed together on all asp ects of the pap er and wrote the man uscript. References [1] S. Biasotti, D. Giorgi, M. Spagn uolo, and B. F alcidieno. Reeb graphs for shap e analysis and applications. The or etic al Computer Scienc e , 392(1-3):5 – 22, 2008. Computational Algebraic Geometry and Applications. [2] Gunnar Carlsson. T op ology and data. Bul l. Amer. Math. So c. (N.S.) , 46(2):255–308, 2009. [3] Hamish Carr, Jack Sno eyink, and Ulrik e Axen. Computing contour trees in all dimensions. Comput. Ge om. , 24(2):75–94, 2003. Sp ecial issue on the F ourth CGC W orkshop on Computa- tional Geometry (Baltimore, MD, 1999). [4] D. J. Daley and D. V ere-Jones. An intr o duction to the the ory of p oint pr o c esses . Springer Series in Statistics. Springer-V erlag, New Y ork, 1988. [5] Mark de Berg and Marc v an Krev eld. T rekking in the Alps without freezing or getting tired. A lgorithmic a , 18(3):306–323, 1997. First Europ ean Symp osium on Algorithms (Bad Honnef, 1993). [6] Herb ert Edelsbrunner, John Harer, and Afra Zomoro dian. Hierarchical Morse-Smale com- plexes for piecewise linear 2-manifolds. Discr ete Comput. Ge om. , 30(1):87–107, 2003. ACM Symp osium on Computational Geometry (Medford, MA, 2001). [7] Stev en N. Ev ans. Pr ob ability and r e al tr e es , volume 1920 of L e ctur e Notes in Mathematics . Springer, Berlin, 2008. Lectures from the 35th Summer School on Probability Theory held in Sain t-Flour, July 6–23, 2005. 9 [8] Dean H. F earn. Galton-Watson pro cesses with generation dep endence. In Pr o c e e dings of the Sixth Berkeley Symp osium on Mathematic al Statistics and Pr ob ability (Univ. California, Berkeley, Calif., 1970/1971), Vol. IV: Biolo gy and he alth , pages 159–172, Berkeley , Calif., 1972. Univ. California Press. [9] I. J. Go o d. The joint distribution for the sizes of the generations in a cascade pro cess. Pr o c. Cambridge Philos. So c. , 51:240–242, 1955. [10] Geoffrey R. Grimmett and David R. Stirzaker. Pr ob ability and r andom pr o c esses . Oxford Univ ersity Press, New Y ork, third edition, 2001. [11] K Hansen, SE Brenner, and S Dudoit. Biases in illumina transcriptome sequencing caused by random hexamer priming. Nucleic A cids R ese ar ch , 2010. [12] Theo dore E. Harris. The the ory of br anching pr o c esses . Do ver Pho enix Editions. Dov er Pub- lications Inc., Mineola, NY, 2002. Corrected reprint of the 1963 original [Springer, Berlin; MR0163361 (29 #664)]. [13] P eter Jagers. Galton-Watson pro cesses in v arying en vironments. J. Appl. Pr ob ability , 11:174– 178, 1974. [14] Ola v Kallen b erg. F oundations of mo dern pr ob ability . Probability and its Applications (New Y ork). Springer-V erlag, New Y ork, second edition, 2002. [15] ES Lander and MS W aterman. Genomic mapping b y fingerprin ting random clones: a mathe- matical analysis. Genomics , 2:231–239, 1988. [16] F A Matsen. A geometric approach to tree shap e statistics. Systematic Biolo gy , 4:652–661, 2006. [17] JL W eb er and EW Myers. Human whole-genome shotgun sequencing. Genome R ese ar ch , 7:401–409, 1997. [18] MC W endl. A general co v erage theory for shotgun DNA sequencing. Journal of Computational Biolo gy , 13:1177–1196, 2006. [19] MC W endl and W Brad Barbazuk. Extension of Lander-W aterman theory for sequencing filtered DNA libraries. BMC Bioinformatics , 6:245, 2005. [20] B W old and RM Myers. Sequence census metho ds for functional genomics. Natur e Metho ds , 5:19–21, 2008. [21] Afra Zomoro dian and Gunnar Carlsson. Computing p ersisten t homology . Discr ete Comput. Ge om. , 33(2):249–274, 2005. 10

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment