Efficient Parallel and Out of Core Algorithms for Constructing Large Bi-directed de Bruijn Graphs

The genomic sequence of an organism is a string from the alphabet Σ = {A, T, G, C}. This string is also referred as the Deoxyribonucleic acid (DNA) sequence. DNA sequences exist as complementary pairs (A -T , G -C) due to the double strandedness of the underlying DNA structure. Several characteristics of an organism are encoded in its DNA sequence, thereby reducing the biological analysis of the organism to the analysis of its DNA sequence. Identifying the unknown DNA sequence of an organism is known as de novo sequencing and is of fundamental biological importance. On the other hand the existing sequencing technology is not mature enough to identify/read the entire sequence of the genomeespecially for complex organisms like the mammals. However small fragments of the genome can be read with acceptable accuracy. The shotgun method employed in many sequencing projects breaks the genome randomly at several places and generates several small fragments (reads) of the genome. The problem of reassembling all the fragmented reads into a small sequence close to the original sequence is known as the Sequence Assembly (SA) problem. Although the SA problem seems similar to the Shortest Common Super string (SCS) problem, there are in fact some fundamental differences. Firstly, the genome sequence might contain several repeating regions. However, in any optimal solution to the SCS problem we will not be able to find repeating regions -because we want to minimize the length of the solution string. In addition to the repeats, there are other issues such as errors in reads and double strandedness of the reads which make the reduction to SCS problem very complex. The literature on algorithms to address the SA problem can be classified into two broad categories. The first class of algorithms model a read as a vertex in a directed graph -known as the overlap graph [2]. The second class of algorithms model every substring of length k (i.e., a kmer) in a read as a vertex in a (subgraph of) a de Bruijn graph [3]. In an overlap graph, for every pair of overlapping reads, directed edges are introduced consistent with the orientation of the overlap. Since the transitive edges in the overlap graph are redundant for the assembly process they are removed and the resultant graph is called the string graph [2]. The edges of the string graph are classified into optional, required and exact. The SA problem is reduced to the identification of a shortest walk in the string graph which includes all the required and exact constraints on the edges. Identifying such a walk -minimum S-walk -on the string graph is known to be NP-hard [4]. When a de Bruijn graph is employed, we model every substring of length k (i.e., a k-mer) in a read as a vertex [3]. A directed edge is introduced between two k-mers if there exists some read in which these two k-mers overlap by exactly k -1 symbols. Thus every read in the input is mapped to some path in the de Bruijn graph. The SA problem is reduced to a Chinese Postman Problem (CPP) on the de Bruijn graph, subject to the constraint that the resultant CPP tour include all the paths corresponding to the reads. This problem is also known to be NP-hard. Thus solving the SA problem exactly on both these graph models is intractable. Overlap graph based algorithms were found to perform better (see [5] [6] [7] [8]) with Sanger based read methods. Sanger methods produce reads typically around 1000 base pairs long. However these can produce significant read errors. To overcome the issues with Sanger reads new read technologies such as the pyrosequencing (454sequencing) have emerged. These read technologies can produce reliable and accurate genome fragments which are very short (up to 100 base-pairs long). On the other hand short read technologies can increase the number of reads in the SA problem by a large magnitude. Overlap based graph algorithms do not scale well in practice since they represent every read as a vertex. De Bruijn graph based algorithms seem to handle short reads very efficiently (see [9]) in practice compared to the overlap graph based algorithms. However the existing sequential algorithms [9] to construct these graphs are sub-optimal and require significant amounts of memory. This limits the applicability of these methods to large scale SA problems. In this paper we address this issue and present algorithms to construct large de Bruijn graphs very efficiently. Our algorithm is optimal in the sequential, parallel and out-of-core models. A recent work by Jackson and Aluru [1] yielded parallel algorithms to build these de Bruijn graphs efficiently. They present a parallel algorithm that runs in O(n/p) time using p processors (assuming that n is a constantdegree ploynomial in p). The message complexity of their algorithm is Θ(nΣ). By message complexity we mean the total number of messages (i.e., k-mers) communicated by all the processors in the entire algorithm. One of the major contributions of our work is to show that we can accomplish this in Θ(n/p) time with a message complexity of Θ(n). An experimental comparison of these two algorithms on an SGI Altix machine shows that our algorithm is considerably faster. In addition, our algorithm works optimally in an out-of-core setting. In particular, our algorithm requires only Θ( n log(n/B) B log(M/B) ) I/O operations. The organization of the paper is as follows. In Section II we introduce some preliminaries and define a bidirected de Bruijn graph formally. Section III discusses our main algorithm in a sequential setting. Section V and Section VI show how our main idea can easily be extended to parallel and out-of-core models optimally. In Section V-A we provide some remarks on the parallel algorithm of Jackson and Aluru [1]. Section VII gives algorithms to perform the simplification operation on the bi-directed de Bruijn graph. Section VIII discusses how our simplified bi-directed de Bruijn graph algorithm can replace the graph construction algorithm in a popular sequence assembly program VELVET [9]. Finally we present experimental results in Section IX. Let s ∈ Σ n be a string of length n. Any substring s j (i.e., s[j]s The set of all k-mers of a given string s is called the k-spectrum of s and is denoted by S(s, k). Given a k-mer s j , sj denotes the reverse complement of s j (e.g., if s j = AAGT A then sj = T ACT T ). Let ≤ be the partial ordering among the strings of equal length, then s i ≤ s j indicates that the string s i is lexicographically smaller than s j . Given any k-mer s i , let ŝi be the lexicographically smaller string between s i and si . We call ŝi the canonical k-mer of s i . In other words, if s i ≤ si then ŝi = s i otherwise ŝi = si . A k-molecule of a given k-mer s i is a tuple ( ŝi , si ) consisting of the canonical k-mer of s i and the reverse complement of the canonical k-mer. In the rest of this paper we use the terms positive strand and canonical k-mer interchangeably. Likewise the noncanonical k-mer is referred to as the negative strand. A bi-directed graph is a generalized version of a standard directed graph. In a directed graph every edge has only one arrow head (-⊲ or ⊳-). On the other hand in a bi-directed graph every edge has two arrow heads attached to it (⊳-⊲, ⊳-⊳,⊲-⊳ or ⊲-⊲). Let V be the set of vertices and {⊳, ⊲}} be the set of bi-directed edges in a bi-directed graph G(V, E). For any edge e = ] and o 2 = e[o -] refer to the orientations of the arrow heads on the vertices v i and v j , respectively. A walk W (v i , v j ) between two nodes v i , v j ∈ V of a bi-directed graph G(V, E) is a sequence v i , e i1 , v i1 , e i2 , v i2 , . . . , v im , e im+1 , v j , such that for every intermediate vertex v il , 1 ≤ l ≤ m the orientation of the arrow head on the incoming edge adjacent on v il is opposite to the orientation of the arrow head on the out going edge. To make this clearer, let e il , v il , e il+1 be a sub-sequence in the walk then for the walk to be valid it should be the case that o 2 = o ′ 1 . Figure 1(a) illustrates an example of a bi-directed graph. Figure 1(b) shows a simple bi-directed walk between the nodes A and E. Bi-directed walk between two nodes may not be simple. Figure 1(c) shows a bi-directed walk between A and E which is not simple -because B repeats twice. A de Bruijn graph D k (s) of order k on a given string s is defined as follows. The vertex set V of D k (s) is defined as the k-spectrum of s (i.e. V = S(s, k)). We use the notation suf (v i , l) (pre(v i , l), respectively) to denote the suffix (prefix, respectively) of length l in the string v i . Let the symbol • denote the concatenation operation between two strings. The set of directed edges E of D k (s) is defined as follows: We can also define de Bruijn graphs for sets of strings as follows. If S = {s 1 , s 2 . . . s n } is any set of strings, a de Bruijn graph B k (S) of order k on S has To model the double strandedness of the DNA molecules we should also consider the reverse complements ( S = { s1 , s2 . . . sn }) while we build the de Bruijn graph. To address this a bi-directed de Bruijn graph BD k (S ∪ S) has been suggested in [4]. The set of vertices V of BD k (S ∪ S) consists of all possible k-molecules from S ∪ S. The set of bi-directed edges for BD k (S ∪ S) is defined as follows. Let x, y be two k-mers which are next to each other in some input string z ∈ S ∪ S. Then an edge is introduced between the k-molecules v i and v j corresponding to x and y, respectively. Please note that two consecutive k-mers in some input string always overlap by k -1 symbols. The converse need not be true. The orientations of the arrow heads on the edges are chosen as follows. If both x and y are the positive strands in v i and v j , respectively, then an edge (v i , v j , ⊲, ⊲) is introduced. If x is the positive strand in v i and y is the negative strand in v j an edge (v i , v j , ⊲, ⊳) is introduced. Finally, if x is the negative strand in v i and y is the positive strand in v j an edge (v i , v j , ⊳, ⊲) is introduced. Figure 2 illustrates a simple example of the bi-directed de Bruijn graph of order k = 3 from a set of reads AT GG, CCAT, GGAC, GT T C, T GGA and T GGT observed from a DNA sequence AT GGACCAT and its reverse complement AT GGT CCAT . Consider two 3-molecules v 1 = (GGA, T CC) and v 2 = (GAC, GT C). Because the positive strand x = GGA in v 1 overlaps the positive strand y = GAC in v 2 by string GA, an edge (v 1 , v 2 , ⊲, ⊲) is introduced. Note that the negative strand GT C in v 2 also overlaps the negative strand T CC in v 2 by string T C, so the two overlapping strings GA and T C are drawn above the edge (v 1 , v 2 , ⊲, ⊲) in Figure 2. A bi-directed walk on the example bi-directed de Bruijn graph as illustrated by the dash line is corresponding to the original DNA sequence with the first letter omitted T GGACCAT . We would like to remark that the parameter k is always chosen to be odd to ensure that the forward and reverse complements of a k-mer are not the same. DE BRUIJN GRAPHS In this section we describe our algorithm BiConstruct to construct a bi-directed de Bruijn graph on a given set of reads. The following are the main steps in our algorithm to build the bi-directed de Bruijn graph. Let R f = {r 1 , r 2 . . . r n }, r i ∈ Σ r be the input set of reads. ). R k+1 is the set of all (k + 1)-mers from the input reads and their reverse complements. Recall that x and ŷ are the canonical k-mers of x and y, respectively. Create a canonical bi-directed edge ( vi , vj , o 1 , o 2 ) for each (k + 1)-mer as follows. Reduce multiplicity: Sort all the bidirected edges in [STEP-1], using radix sort. Since the parameter k is always odd this guarantees that a pair of canonical k-mers have exactly one orientation. Remove the duplicates and record the multiplicities of each canonical edge. Gather all the unique canonical edges into an edge list E. Theorem 1: Algorithm BiConstruct builds a bidirected de Bruijn graph of the order k in Θ(n) time. Here n is number of characters/symbols in the input. Proof: Without loss of generality assume that all the reads are of the same size r. Let N be the number of reads in the input. This generates a total of (rk)N (k + 1)-mers in [STEP -1]. The radix sort needs to be applied at most 2k log(|Σ|) passes, resulting in 2k log(|Σ|)(rk)N operations. Since n = N r is the total number of characters/symbols in the input, the radix sort takes Θ(kn log(|Σ|)) operations assuming that in each pass of sorting only a constant number of symbols are used. If k log(|Σ|) = O(log N ), the sorting takes only O(n) time. In practice since N is very large in relation to k and |Σ|, the above condition readily holds. Since the time for this step dominates that of all the other steps, the runtime of the algorithm BiConstruct is Θ(n). BI-DIRECTED DE BRUIJN GRAPH In this section we illustrate a parallel implementation of our algorithm. Let p be the number of processors available. We first distribute N/p reads for each processor. All the processors can execute [STEP-1] in parallel. In [STEP-2] we need to perform parallel sorting on the list E. Parallel radix/bucket sort -which does not use any all-to-all communications-can be employed to accomplish this. For example, the integer sorting algorithm of Kruskal, Rudolph and Snir takes O n p log n log(n/p) time. This will be O(n/p) if n is a constant degree polynomial in p. In other words, for coarse-grain parallelism the run time is asymptotically optimal. In practice coarse-grain parallelism is what we have. Here n = N (rk + 1). We call this algorithm Par-BiConstruct. Theorem 2: Algorithm Par-BiConstruct builds a bidirected de Bruijn graph in time O(n/p). The message complexity is O(n). The algorithm of Jackson and Aluru [1] first identifies the vertices of the bi-directed graph -which they call representative nodes. Then for every representative node |Σ| many-to-many messages were generated. These messages correspond to potential bi-directed edges which can be adjacent on that representative node. A bi-directed edge is successfully created if both the representatives of the generated message exist in some processor, otherwise the edge is dropped. This results in generating a total of Θ(n|Σ|) many-to-many messages. The authors in the same paper demonstrate that communicating many-tomany messages is a major bottleneck and does not scale well. On the other had we remark that the algorithm BiConstruct does not involve any many-to-many communications and does not have any scaling bottlenecks. On the other hand the algorithm presented in their paper [1] can potentially generate spurious bi-directed edges. According to the definition [4] of the bi-directed de Bruijn graph in the context of SA problem, a bi-directed edge between two k-mers/vertices exists iff there exists some read in which these two kmers are adjacent. We illustrate this by a simple example. Consider a read r i = AAT GCAT C. If we wish to build a bi-directed graph of order 3, then {AAT, AT G, T GC, GCA, CAT, AT C} form a subset of the vertices of the bi-directed graph. In this example we see that k-mers AAT and AT C overlap by exactly 2 symbols. However there cannot be any bi-directed edge between them according to the definition -because they are not adjacent. On the other hand the algorithm presented in [1] generates the following edges with respect to k-mer AAT : {(AAT, AT A), (AAT, AT G), (AAT, AT T ), (AAT, AT C)}. The edges (AAT, AT A) and (AAT, AT C) are purged since the k-mers AT A and AT C are missing. However bi-directed edges with corresponding orientations are established between AT G and AT C. Unfortunately (AAT, AT C) is a spurious edge and can potentially generate wrong Fig. 3. Problems with pointer jumping on bi-directed chains assembly solutions. In contrast to their algorithm [1] our algorithm does not use all-to-all communicationsalthough we use point-point communications. Theorem 3: There exists an out-of core algorithm to construct a bi-directed de Bruijn graph using an optimal number of I/O's. Proof: Sketch: Replace the radix sorting with an external R-way merge which takes only Θ( n log(n/B) B log(M/B) ). Where M is the main memory size, n is the sum of the lengths of all reads, and B is the block size of the disk. The bi-directed de Bruijn graph constructed in the previous section may contain several linear chains. These chains have to be compacted to save space as well as time. The graph that results after this compaction step is refered to as the simplified bi-directed graph. A linear chain of bi-directed edges between nodes u and v can be compacted only if we can find a valid bi-directed walk connecting u and v. All the k-mers/vertices in a compactable chain can be merged into a single node, and relabelled with the corresponding forward and reverse complementary strings. In Figure 4 we can see that the nodes X 1 and X 3 can be connected with a valid bidirected walk and hence these nodes are merged into a single node. In practice the compaction of chains seems to play a very crucial role. It has been reported that merging the linear chains can reduce the number of nodes in the graph by up-to 30% [9]. Although bi-directed chain compaction problem seems like a list ranking problem there are some fundamental differences. Firstly, a bi-directed edge can be traversed in both the directions. As a result, applying pointer jumping directly on a bi-directed graph can lead to cycles and cannot compact the bi-directed chains correctly. Figure 3 illustrates the first phase of pointer jumping. As we In consistent bi-directed edges Fig. 4. Issues with partially compacted bi-directed chains x ȳ ŷ Fig. 5. Transforming bi-directed chain compaction to list ranking can see, the green arcs indicate valid pointer jumps from the starting nodes. However since the orientation of the node Y 4 is reverse relative to the direction of pointer jumping a cycle results. In contrast, a valid bidirected chain compaction would merge all the nodes between Y 1 and Y 5 since there is a valid bi-directed walk between Y 1 and Y 5 . On the other hand, bi-directed chain compaction may result in inconsistent bi-directed edges and these edges should be recognised and removed. Consider a bi-directed chain in Figure 4; this chain contains two possible bi-directed walks -Y 1 to Y 4 and X 1 to X 3 . The walk from Y 1 to Y 4 (Y 4 to Y 1 ) spells out a label AT AGGT (ACCT AT ) after compaction. Once we perform this compaction the edge between Y 4 and Z 1 in the original graph is no longer valid -because the k-mer on Z 1 cannot overlap with the label ACCT AT . The same is true for the compaction of the bi-directed walk between X 1 and X 3 . The redundant edges after compaction are marked in red. Since bi-directed chain compaction has a lot of practical importance efficient and correct algorithms are essential. We now provide algorithms for the bi-directed chain compaction problem. Our key idea here is to transform a bi-directed graph into a directed graph and then apply list ranking. Given a list of candidate canonical bi-directed edges, we apply a ListRankingTransform (see Figure 5) which introduces two new nodes v + , v -for every node v in the original graph. Directed edges corresponding to the orientation are introduced. See Figure 5. Lemma 1: Let BG(V, E) be a bi-directed graph; let BG t (V t , E t ) be the directed graph after applying Lis-tRankingTransform. Two nodes u, v ∈ V are connected by a bi-directed path iff u Proof: We first prove the forward direction by induction on the number of nodes in the bi-directed graph. Consider the basis of induction when |V | = 2, let v 0 , v 1 ∈ V . Clearly we are only interested when v 0 and v 1 are connected by a bi-directed edge. By the definition of ListRankingTransform the Lemma in this case is trivially true. Now consider a bi-directed graph with |V | = n + 1 nodes, if the path between v i , i < n and v j , j < n does not involve node v n the lemma still holds by induction on the sub bidirected graph BG(V -{v n }, E). Now assume that v i . . . v p , v n , v q . . . v j is the bi-directed path between v i and v j involving the node v n ; see Figure 6(a). Also Figure 6(a) shows how the transformed directed graph look like; observe the colors of bi-directed edges and corresponding directed edges. By induction hypothesis on the sub bi-directed paths v i . . . v p , v n and v n , v q . . . v j we have the following. v + i is connected to v + n or v - n by some directed path P 1 (See Figure 6(b);v + n is connected to v + j or v - j by some directed path P 2 . We examine three possible cases depending on how the directed edge from P 1 and P 2 is incident on v + n . In CASE-1 we have both P 1 and P 2 pointing into node v + n . This implies that the orientation of the bi-directed edges in the original graph is according to Figure 6(b). In this case we cannot have a bi-directed walk involving the node v n , which contradicts our original assumption. Similarly CASE-2(Figure 6(c)) would also lead to a similar contradiction. Only CASE-3 would let node v n involve in a bi-directed walk. In this case v + i will be connected to either v + j or v - j by concatenation of the paths P 1 , P 2 . We can make a similar argument to prove the reverse direction. We first identify a set of candidate bi-directed edges which can potentially form a chain. One possible criterion will be to include all the edges which are adjacent on bi-directed nodes with exactly one in and out degree. Each candidate bi-directed edge is transformed using ListRankingTranform and list ranking is applied on resultant set. As a consequence of the symmetry in ListRankingTransform we would see both forward and reverse complements of the compacted chains in the output. We can further canonicalize each chain and remove the duplicates by sorting. This results in unique bi-directed chains from the candidate bi-directed edges. Finally we report only the chains which are resultant of compaction of at least three bi-directed nodes. This removes all the inconsistent edges (see Figure 4) from further consideration. As a consequence of Lemma 1 all the bi-directed chains are correctly compacted. Let E l be the list of candidate edges for compaction. To do compaction in parallel, we can use a Segmented Parallel Prefix on p processors to accomplish this in time O(|2E l |/p + log(p)). On the other hand list ranking can also be done out-of-core as follows. Without loss of generality we can treat the input for the list ranking problem as a set S of ordered tuples of the form (x, y). Given S we create a copy and call it S ′ . We now perform an external sort of S, S ′ with respect to y (i.e., using the y value of tuple (x, y) as the key) and x respectively. The two sorted lists are scanned linearly to identify tuples (x, y) ∈ S, (x ′ , y ′ ) ∈ S ′ such that y = x ′ . These two tuples are merged into a single tuple (x, y ′ ) and are added to a list E ′ l . This process is now repeated on E ′ l . Note that if the underlying graph induced by E l does not have any cycles then |E ′ l | ≤ |E l |/2; which means that the size of E ′ l geometrically decreases after every iteration. The I/O complexity of each iteration is dominated by the external sorting and hence bi-directed compaction can be accomplished out-of-core with Care should be taken to deal with cycles. There are two ways of dealing with cycles. One way is to first identify all the cycles in the bi-directed graph (generated in the previous section) and then break these cycles by removing appropriate edges. A second easy is to identify the cycles on the fly and break them. We employ the second approach. In this section we briefly describe how our algorithms can be used to some of the existing SA programs. As an example, we consider VELVET [9]. VELVET is a suite of programs -velveth and velvetg, which has recently gained acclamation in assembling short reads. VELVET program builds a simplified bidirected graph from a set of reads. We now briefly describe the algorithm used in VELVET to build this graph. VELVET program puts all the k-mers from the input into a hash table and then identifies the k-mers which are present in at least 2 reads -this information is called the roadmap in VELVET's terminology. The program then builds a de Bruijn graph using these kmers. Finally it takes every read and threads it on these k-mers. The worst case time complexity is O(n log(n))assuming that k and |Σ| are constants. On the other hand since VELVET builds this graph entirely in-memory, this has some serious scalability problems especially on large scale assembly projects. However VELVET seems have some very good assembly heuristics to remove errors and identify redundant assembly paths, etc. Thus our algorithm can act as a replacement to code in VELVET which performs in-memory graph construction. We have compared the performance of our algorithm and that of Jackson and Aluru [1]. We refer to the later algorithm as JA. To make this comparison fair, we have implemented the JA algorithm also on the same platform that our algorithm runs on. We have I shows the user and system times for both our algorithm and the JA algorithm. We can clearly see that the system time (communication time) is consistently higher for the JA algorithm. Also, as we move from one million to eight million reads the increase in the system time is quite significant for the JA algorithm (e.g., the system time for JA increases from 0.621 sec to 4.120 sec, which is almost a 7X increase. On the other hand there is only a 3X increase in our algorithm). The JA algorithm has a higher communication cost because it enumerates all the bi-directed edges and uses many-to-many messages to send to the right processor. The user time of our algorithm is also consistently superior compared to the user time of JA. This clearly is because we do much less local computations. In contrast, JA needs to do a lot of local processing which arises from processing all the received edges, removing redundant ones, and collecting the necessary edges to perform many-to-many communications. Since JA was taking a significant amount of time on for inputs larger than 8 million we have compared these algorithms only for input sizes up to 8 million. The experimental results reported in [1] start with at least 64 processors. We however show the scalablity of our algorithm upto 128 million reads in Table II. Table II clearly demonstrates the scalability of our algorithm. We make our implementations available at http://trinity.engr.uconn.edu/ ∼ vamsik/ParBidirected.tgz. In this paper we have presented an efficient algorithm to build a bi-directed de Bruijn graph which is • •

Efficient Parallel and Out of Core Algorithms for Constructing Large Bi-directed de Bruijn Graphs

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment