Reference Sequence Construction for Relative Compression of Genomes

Reference Sequence Construction for Relativ e Compression of Genomes ∗ Shanik a Kuruppu † Simon J. Puglisi ‡ Justin Zob el † Abstract Relativ e compression, where a set of similar strings are compressed with resp ect to a reference string, is a very eﬀectiv e metho d of compress- ing DNA datasets con taining multiple similar sequences. Relativ e com- pression is fast to p erform and also supp orts rapid random access to the underlying data. The main diﬃcult y of relativ e compression is in selecting an appropriate reference sequence. In this paper, we explore using the dic- tionary of repeats generated by Comrad , Re-p air and Dna-x algorithms as reference sequences for relative compression. W e show this technique allo ws b etter compression and supp orts random access just as w ell. The tec hnique also allows more general rep etitiv e datasets to be compressed using relative compression. 1 In tro duction Rapid adv ancements in the ﬁeld of high-throughput sequencing hav e led to a large n umber of whole genome DNA sequencing pro jects. Some of these pro jects take adv antage of the improv ed sequencing sp eeds and costs, to obtain genomes of sp ecies that are unsequenced to date; for example the Genome 10K pro ject ( www.genome10k.org ). Others fo cus on resequencing, where individual genomes from a given species are sequenced to understand v ariation b et ween individuals. Examples are the 1000 Genomes pro ject ( www.1000genomes.org ) for humans and the 1001 Genomes pro ject ( www.1001genomes.org ) for the plant A r abidopsis thaliana . The assembled sequences from these pro jects can range from terab ytes to p etab ytes in size. Therefore, algorithms and data structures to eﬃciently store, access and search these large datasets are necessary . Some progress has already b een made [3, 7, 11, 17, 12], but signiﬁcant challenges remain. DNA sequences may contain rep eated substrings within a sequence, ho w- ev er, in a database of sequences, the most signiﬁcan t rep eats o ccur b etwe en ∗ This work was supp orted b y the Australian Research Council and the NICT A Victorian Research Lab oratory . NICT A is funded by the Australian Go vernmen t as represen ted by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council through the ICT Center of Excellence program. A version of this p ap er is to app ear in the Pr oc e e dings of SPIRE 2011. † National ICT Australia, Departmen t of Computer Science & Soft ware Engineering, Uni- versit y of Melb ourne, Australia, { kuruppu,jz } @csse.unimelb.edu.au ‡ School of Computer Science and Information T echnology , Roy al Melb ourne Institute of T echnology , Australia, simon.puglisi@rmit.edu.au sequences, usually those of the same or similar sp ecies. T o help manage large genomic databases, compression algorithms that capture and eﬃciently enco de this repeated information are employ ed. Compression algorithms sp eciﬁc to DNA sequences ha ve b een around for some time [1, 4, 5, 6, 9, 10, 19]. Ho w- ev er, most existing algorithms are unsuitable for compressing large datasets of m ultiple sequences. More recently , algorithms that compress large rep etitiv e datasets, that also supp ort random access and search on the compressed se- quences, known as self-indexes , ha ve emerged. Some of these algorithms are sp eciﬁc to DNA compression and supp ort random access queries [13, 14]. Oth- ers can compress general datasets and also implemen t searc h queries on the compressed sequences [11, 17]. One of the most eﬀectiv e wa ys to compress a rep etitive dataset containing m ultiple sequences from the same or very similar sp ecies, or sequences serving the same biological functions, is to compress each sequence with resp ect to a c hosen reference sequence [4, 14, 17]. The need for such a compression metho d for DNA sequences was ﬁrst realised b y Grumbac h and T ahi [9]. XM , a s tatisti- cal algorithm that implements this feature, can also generate probabilities for the lev el of similarity b et w een the reference sequence and the sequence b eing compressed [4]. Christley , et al. prop osed a solution to store just the v ariations of eac h h uman genome with resp ect to the reference genome [7] and a similar approac h is tak en b y Brandon, et al. [3]. M¨ akinen, et al. introduce more general metho ds to compress highly rep etitive collections whic h also supp ort searching in the compressed data [17]. The Rlz metho d, whic h is used in this pap er, represents eac h sequence as an LZ77 parsing [20] with resp ect to a reference sequence chosen from the dataset [14]. Recen tly Grab o wski and Deorowicz engineered Rlz to improv e run time and compression p erformance [8]. Relativ e compression algorithms like RLZ pro duce go o d compression results b ecause the reference sequence acts as a static “dictionary” that includes most of the rep eats present in the dataset b eing compressed. Compression sp eed is fast b ecause the sequences can b e compressed in a single pass ov er the collection, once an index on the reference sequence has b een built. The static reference also makes random access fast, and easy to supp ort. The main drawbac k is the diﬃcult y of selecting an appropriate reference sequence. Selecting a reference sequence from a dataset con taining only individual genomes from the same strain of the same sp ecies is simple, as any sequence will act as a go o d reference sequence. Ho wev er this will not b e eﬀective for datasets containing sequences from diﬀerent sp ecies, or from diﬀeren t strains of the same sp ecies. Grab o wski and Deorowicz [8] attempt to address this issue b y adjusting the comp osition of the reference sequence during compression. When substrings of a certain minimum length, which do not o ccur in the reference sequence, are encoun tered, they are app ended to the reference sequence, so that later o ccurrences of those substrings can b e enco ded as references. Results in [8] sho w that such a mec hanism can provide a sligh t impro vemen t to compression with no eﬀects on the compression or decompression times. How ever, this method ov er- comp ensates and adds more substrings to the reference sequence than necessary . W e compare our results with those of Grab o wski and Deorowicz in Section 4. 2 273614N 322134S 378604X BC187 DBVPG1106 DBVPG1373 DBVPG1788 DBVPG1853 DBVPG6040 273614N 322134S 378604X BC187 DBVPG1106 DBVPG1373 DBVPG1788 DBVPG1853 DBVPG6040 DBVPG6044 DBVPG6765 K11 L_1374 L_1528 NCYC110 NCYC361 REF RM11_1A S288c SK1 UWOPS03_461_4 UWOPS05_217_3 UWOPS05_227_2 UWOPS83_787_3 UWOPS87_2421 W303 Y12 Y55 Y9 YIIc17_E5 YJM789 YJM975 Reference sequence 16 17 18 19 20 21 22 23 24 25 Size (MB) Compressed size of dataset vs reference sequence Figure 1: The c hange in the compressed size of the S. c er evisiae dataset when the reference sequence is c hanged. The y-axis con tains the compressed size, measured in Megabytes and the x-axis contains the reference sequence used. Our contribution: In this paper w e explore the artiﬁcial construction of ref- erence sequences from the phrases built b y p opular dictionary compressors. W e ﬁnd that artiﬁcally constructed reference sequences allow sup erior compression, while retaining the principle adv atange of relativ e compression: fast random access to the collection. 2 Reference Sequence Selection Before we explore wa ys to generate an appropriate reference sequence, we ﬁrst analyse the eﬀect on compression when “go od” and “bad” reference sequences are used. As an example, we use the Rlz algorithm to compress the S. c er e- visiae dataset containing 39 yeast genomes from diﬀerent strains. The dataset w as compressed 39 times, with a diﬀeren t sequence b eing used as a reference eac h time. Figure 1 shows that the reference sequence chosen can impact com- pression signiﬁcantly . F or instance, c ho osing the sequence DBVPG6765 results in a compressed size of 16.65 MB for the S. c er evisiae dataset, while choosing the sequence UWOPS05 227 2 results in 24.42 MB. The exp erimen tal results of Rlz in [14] uses the reference genome REF for the S. c er evisiae sp ecies. Using REF , a compressed size of 17.89 MB was achiev ed, not far from the b est result of 16.65 MB. This example illustrates that a more principled approach to selecting a reference sequence is necessary . The n¨ aive wa y to select the b est reference sequence from a dataset is to follo w the approac h tak en to generate Figure 1; compress the dataset man y times, each time using a diﬀeren t sequence as the reference sequence, then select the sequence that gives the b est compression as the reference sequence. In this manner, DBVPG6765 is chosen as the reference sequence for the S. c er evisiae dataset. This technique is feasible for small datasets but is ultimately not 3 24000 26000 28000 30000 32000 34000 Position W303 S288c 378604X 322134S DBVPG6040 YS4 K11 SK1 DBVPG6044 YS2 NCYC110 Y12 Y9 YPS128 YPS606 UWOPS05_227_2 UWOPS03_461_4 UWOPS87_2421 NCYC361 UWOPS83_787_3 UWOPS05_217_3 Y55 YS9 BC187 273614N DBVPG1106 DBVPG1373 DBVPG1788 DBVPG1853 DBVPG6765 L_1374 L_1528 RM11_1A YIIc17_E5 YJM789 YJM975 YJM978 YJM981 Sequence Spread of factors across S.cerevisiae reference sequence Figure 2: The p osition comp onen ts of the ﬁrst 100 aligning factors for each sequence in the S. c er evisiae dataset. Only the factors that start at p ositions in the range of 24000-34000 are visible. The x-axis is the p osition on the reference sequence and the y-axis is the sequence names for sequences in the dataset. scalable. Moreo v er, a single reference sequence still ma y not b e representativ e of the rep etitions present in the whole dataset. A sequence may b e highly similar to a few other sequences in the dataset but quite diﬀerent from others. In other w ords, the sequences may form clusters. This is plausible for datasets con taining genomes from v arious str ains of a sp ecies. T o test this hypothesis, w e used the factors that are generated b y Rlz that form alignme n ts to the reference sequence (LISS factors that enco de the segments of DNA that are not m utations [15]). W e graphed the position component of these aligning factors for the S. c er evisiae dataset, when sequence REF is used as the reference. If the set of aligning factors are the same across tw o sequences, then those t w o sequences align to the reference sequence in the same wa y , hence the tw o sequences are similar. The aligning factors for each sequence in the S. c er evisiae dataset for the p osition range 24,000-34,000 of the reference sequence, are illustrated in Fig- ure 2. The graph highlights clusters of similar sequences. Most sequences hav e factors that start at the same p osition, esp ecially those in the top half of the graph. The latter half of the graph has clusters of sequences that ha ve similar factor p ositions. As an example, YPS606 and YPS128 seem to align to the ref- erence sequence in the same wa y , and so do the sequences UWOPS03 461 4 and UWOPS05 227 2 . An alternative to using m ultiple reference sequences is to use a single refer- ence sequence that includes the signiﬁcan t rep eats in the whole dataset. The substrings that are shared among the sequences within clusters can b e used to create a reference sequence. Dictionary compression algorithms ﬁnd the re- p eated substrings of the dataset b eing compressed and the dictionary stores 4 these rep eats. Hence, a dictionary compression algorithm that detects global rep etitions can b e used to generate a dictionary whose entries can then b e con- catenated to construct a reference sequence. W e exp erimen t with this idea next. 3 Reference Sequence Construction W e c ho ose three dictionary compression algorithms to generate reference se- quences for the tw o yeast datasets; Re-p air [16], a well-kno wn dictionary com- pression algorithm, Comrad [13], similar to Re-p air but tailored for DNA compression, and Dna-x [18], a DNA-sp eciﬁc implementation of the algorithm b y Bentley and McIlro y [2]. W e ﬁrst compress our test datasets with Re-p air , Comrad and Dna-x , and then use the dictionary of rep eats as a reference se- quence for relative compression. Below we explain each algorithm brieﬂy and the pro cess used to generate the reference sequence from the dictionary . Re-p air The Re-p air algorithm [16] op erates in multiple iterations. In the ﬁrst iteration, a coun t of all the distinct pairs of symbols in the input sequence are recorded. Then the most frequent sym b ol pair is replaced b y a new symbol, and the coun ts are up dated to reﬂect the replacement. In this manner, the algorithm substitutes the sym b ol pair with the highest count at each iteration, until there are no symbol pairs left with a coun t of more than one. The new symbols generated b y the algorithm are identiﬁed as ‘non-terminals’, while the sym b ols in the original input are identiﬁed as ‘terminals’. The algorithm outputs the input sequence with all its rep eated substrings replaced b y non-terminal symbols, and a dictionary of rules that map all non-terminals to the symbol pairs that they replaced. The dictionary is hierarchical, since during later iterations, rules of the form B ← C D or of the form B ← cD or B ← C d are generated, where upp er-case sym b ols are non-terminals and lo wer-case sym b ols are terminals. The non-terminals C and D in turn may also represen t other non-terminals and so on. The dictionary of rules generated by Re-p air con tains the rep eated sub- strings of the input sequence. The right hand sides of the rules can be expanded recursiv ely to obtain the rep eated substrings, whic h can then b e concatenated to create a reference sequence. It’s not necessary to add all of the expanded rules to the reference sequence. Some of the rules lo w er in the hierarch y hav e already b een incorp orated into the rep eated substrings of rules higher in the hierarc h y that refer to these rules, so it is redundant to add these to the ref- erence sequence. F or example, expanding rule Z in the set of rules Z ← X Y , X ← aA , Y ← C D , would result in rules X , Y , A , C and D b eing expanded. Once Z is expanded, it is redundant to individualy expand X , Y , A , C and D . T o implement this, we use a bit vector that is the length of the total n umber of rules. T o b egin with, all the bits are set to zero. When a rule Y appears on the righ t hand side of another rule Z , then the bit for rule Y is set to 1 to indicate that when it is Y s turn to get expanded later, it can b e skipp ed. The non-terminals generated b y Re-p air are identiﬁed using unique in tegers. The higher the non-terminal num b er, the later the rule was generated and the higher up in the hierarch y the rule is likely to b e. So starting from the highest 5 n um b ered rule to the low est num b ered rule, rule Z is expanded if and only if Z has not b een expanded by a previous rule, as indicated by the bit v ector. If rule Z is expanded, then the resulting substring is app ended to the reference sequence. This contin ues until all of the rules are considered for expansion. The concatenation of the expanded substrings forms the reference sequence. Comrad Similar to Re-p air , Comrad [13] is a dictionary compression algorithm that de- tects rep eated substrings in the input, and enco des them eﬃciently to achiev e compression. Comrad also op erates in m ultiple iterations, how ever, it is a DNA-sp eciﬁc disk-based algorithm designed to compress large DNA datasets. Instead of replacing pairs of frequent symbols, Comrad replaces rep eated sub- strings of longer lengths to reduce the num b er of iterations. The ﬁrst iteration of Comrad counts distinct L length substrings and the rep eated substrings from most frequent to least frequent are replaced with non- terminals and a dictionary is formed. The input sequence now consists of a combination of terminals and non-terminals. In subsequent iterations, the coun ts of distinct substrings that satisfy a certain set of patterns is recorded (see [13]), and again substrings from most frequent to least are replaced with non-terminals. The iterations contin ue un til there are no substrings of the ab o ve form remaining with at least a count of F (only substrings with frequency F are eligible for replacemen t). The algorithm outputs the input sequence with rep eated substrings replaced by non-terminals, and like Re-p air , a dictionary con taining the non-terminals mapping to the substrings they replace. As with the Re-p air dictionary , we expand non-terminals and app end them to create a reference sequence. Dna-x Unlik e Re-p air and Comrad , Dna-x is a single pass dictionary compression algorithm. As the input is read, the ﬁngerprin t of every B -th substring of length B is stored in a hash table. T o encode the next substring, all o v erlapping B -mers in the so far unenco ded part of the input are searched for in the hash table until there is a match. The hash table gives the p ositions of the earlier o ccurrences of the B -mer. Eac h of these o ccurrences is chec k ed to ﬁnd the longest p ossible matc h. Then the preﬁx until the matching substring, follow ed by the reference for the matching substring is enco ded. Searching and enco ding contin ues un- til no more symbols remain to b e enco ded. The longest matc hing substrings enco ded by the algorithm are the rep eated substrings we use to construct the reference sequence. W e mo diﬁed the implementation of Dna-x b y Manzini and Rastero to only output the concatenation of the longest matching substrings detected by the algorithm. W e use this output as the reference sequence. 4 Exp erimen tal Results T o test the performance of the reference construction method, w e use Rlz as the relativ e compressor. W e use three test datasets containing rep etitiv e genomes: 39 strains of S. c er evisiae and 36 strains of the S. p ar adoxus sp ecies of yeast, and 6 33 strains of E. c oli bacteria. W e ran Re-p air , Comrad and Dna-x on all three datasets. F or Re-p air , we used the default parameters, whic h do es not place an y restrictions on the num b er or length of rep eats that can b e detected. F or Comrad , we used a starting substring length L of 16 and a threshold frequency F of 2. F or Dna-x we set the substring length B to 16 to b e consistent with Comrad . The rep eated substrings resulting from the dictionaries were used to generate the reference sequence as describ ed ab o ve. Compression results are in T able 1. The ﬁrst s ection contains the results for compressing with Rlz using the original reference sequence. The num b er of megabases (including the reference sequence) and the 0-order entrop y of the dataset are in the ﬁrst row. The second and third row contains the compression results from using the reference sequences av ailable in the dataset with the RLZ-std and RLZ-opt (with the full set of optimisations), resp ectiv ely . The results show that RLZ-opt achiev es b etter compression compared to RLZ-std . The second section of T able 1 contains results for using the Comrad gen- erated reference sequence. The t w o rows contain results for using the standard implemen tation of Rlz ( RLZ-std-C ) and the optimised Rlz with lo ok-ahead and short factor encoding enabled ( RLZ-opt-C ) 1 , resp ectiv ely . The S. c er e- visiae and S. p ar adoxus datasets compress b etter using the Comrad generated reference sequence. The biggest improv ement (a factor of tw o) is for E. c oli . The original reference sequence was the K12 strain from the dataset, since the sp ecies do es not hav e a reference genome. Evidently K12 is not a sequence that represen ts the dataset w ell and the Comrad generated reference sequence is a m uc h b etter represen tation. The third section of T able 1 con tains the results for the Re-p air gener- ated reference sequences, which are very similar to the Comrad results. The compression results improv ed for all three datasets with the most signiﬁcant impro v ements b eing for E. c oli . Overall, using the Re-p air generated refer- ence sequences led to slightly b etter compressed sizes than using the Comrad generated reference sequences. The Dna-x generated reference sequences are not as promising. W e found Dna-x generated large reference sequences, as some of the rep eats it output w ere redundant. F or example, the reference sequences for S. c er evisiae are 124.46 Mbases, 127.95 Mbases and 439.27 Mbases for Re-p air , Comrad and Dna-x , resp ectiv ely . Filtering such duplicate rep eats is diﬃcult as there are no non-terminal num b ers to identify multiple o ccurrences of the same rep eat. Next we show that using a reference sequence con taining rep eats from the whole dataset is b etter than using a single sequence from the dataset as a reference. As in Section 2, for all three datasets, we ran Rlz -opt multiple times, with each sequence from the dataset b eing used as a reference at each iteration, to select a single sequence from each dataset that achiev es the b est compression result when used as a reference. The b est compression results achiev ed were 9.33 MB, 13.23 MB and 18.69 MB for S. c er evisiae using the reference genome, S. p ar adoxus using the Z1 strain and E. c oli using the Sakai strain, resp ectively . Comparing these results to those in the second and third sections of T able 1 sho ws that even if the sequence that gives the b est compressed size is chosen as 1 LISS factor enco ding was not used as the reference is not a sequence from the dataset and so there is no reason to exp ect factor p ositions to b e predictable. F or completeness, we compressed with the LISS option on and the compression results were w orse than standard Rlz . 7 the reference sequence for a dataset, the compression results are still worse than the results that could b e ac hieved by using a Comrad or Re-p air generated reference sequence. This conﬁrms that a single sequence is unlik ely to capture all the rep eats in a dataset of similar sequences and it is w orth constructing a reference sequence that captures all the signiﬁcant rep eats of the dataset to ac hiev e b etter compression results. T able 2 shows compression and decompression times. Obviously the com- pression time increases signiﬁcan tly when using a generated reference sequence as the reference must now b e generated. Also generated references tend to b e longer and so more time is needed to construct suﬃx and LCP arra ys used to p erform the Rlz parsing, and to compress the reference sequence with 7zip . This is particularly the case for Dna-x . Still, p erformance for all metho ds remains at an acceptable lev el: the tw o largest datasets, can b e compressed in appro ximately 20 minutes. More imp ortan tly , decompression times are not aﬀected at all. T able 3 shows compression results for Rlcsa , Lz-end , Comrad , XM and Re-p air algorithms b eing used to compress the three test datasets. The results clearly show that using Rlz with the Comrad or Re-p air generated dictionar- ies achiev e muc h b etter compression than even the b est results in T able 3. While Re-p air generated reference sequences seem to compress the datasets a little b etter than those of Comrad , resource requiremen ts of the algorithms should b e taken in to account. Both Comrad and Re-p air ha ve comparable run times ( Re-p air required a little ov er half the time of Comrad , see T able 2). Ho w ever, the main memory usage of Re-p air is muc h higher, with S. c er evisiae and S. p ar adoxus using approximatly 12 Gb and 11 Gb, resp ectiv ely . On the other hand, Comrad only requires 277 Mb and 554 Mb for S. c er evisiae and S. p ar adoxus , resp ectively . Dna-x has the low est resource usage, but a b etter pro cess needs to b e follo wed to extract the necessary rep eats from the dictionary to get a b etter quality reference sequence. W e next exp erimen t with data sets whic h do not con tain a sp eciﬁc reference. These were a Hemo globin dataset containing 15,199 DNA sequences of proteins that are asso ciated with Hemoglobin, an Inﬂuenza dataset containing 78,041 sequences of v arious strains of the Inﬂuenza virus and a Mito chondria dataset con taining 1,521 mito c hondrial DNA sequences from v arious sp ecies. Reference sequences w ere generated for the datasets using Comrad , Re-p air and Dna-x . The results are presen ted in T able 4. The ﬁrst section of T able 4 contains the p erformance of Rlz when the ﬁrst sequence in the dataset is chosen to b e the reference. W e only used standard Rlz , since the reference sequences chosen w ere arbitrary so none of the Rlz optimisations will b e an adv antage to the compression. The compression re- sults for Rlz are worse than on previous datasets where a sp eciﬁc reference is a v ailable. The results in the second section of the table are for using Comrad gener- ated reference sequences. Compression clearly improv es for all three datasets. The most signiﬁcant improv ement is for the Inﬂuenza dataset, follow ed b y the Hemo globin dataset. The Mito chondria dataset did not compress very well but compression still improv es. Compression also impro ved signiﬁcantly for all datasets b y using a Re-p air generated reference. The Inﬂuenza dataset had the most signiﬁcant impro ve- men t, follow ed b y Hemo globin . The Mito chondria dataset still do es not com- 8 Dataset S. cerevisiae S. paradoxus E. coli Size En t. Size Ent. Size Ent. (Mb ytes) (bpb) (Mb ytes) (bpb) (Mb ytes) (bpb) Original 485.87 2.18 429.27 2.12 164.90 2.00 Rlz -std 17.89 0.29 23.38 0.44 24.27 1.18 Rlz -opt 9.33 0.15 13.44 0.25 19.30 0.94 Rlz -std-C 8.20 0.14 9.64 0.18 8.70 0.42 Rlz -opt-C 7.99 0.13 9.08 0.17 8.07 0.39 Rlz -std-R 7.78 0.13 9.10 0.17 8.21 0.40 Rlz -opt-R 7.64 0.13 8.67 0.16 7.72 0.37 Rlz -std-D 9.80 0.16 13.38 0.25 11.06 0.54 Rlz -opt-D 9.64 0.16 13.01 0.24 10.57 0.51 T able 1: Compression results for using Comrad , Re-p air and Dna-x gener- ated reference sequences. The columns are, the identiﬁers for Rlz version used and algorithm used to generate the reference sequence, compressed size of the dataset in Megabytes (original dataset size in Megabases) and av erage num b er of bits used p er base when compressed, resp ectiv ely . The sections are for com- pression results of Rlz when using, Comrad , Re-p air and Dna-x generated reference sequences, resp ectiv ely . In the ﬁrst section, RLZ-opt includes all the optimisations. In the last t w o sections, RLZ-opt only includes lo oking ahead and short factor enco ding. press well. The fourth section of the table contains the results of using the Dna-x generated reference. The results hav e impro ved compared to using the original reference sequence, but gains are less than with the other t wo algo- rithms. According to T able 4, if there is enough rep etitions in the dataset, it is fea- sible to generate a reference sequence using either Re-p air or Comrad , or any other dictionary compression algorithm, that can b e used by Rlz to compress an y arbitrary rep etitiv e dataset. There is no signiﬁcant diﬀerence betw een using a Comrad generated reference sequence o ver a Re-p air generated one, ho wev er curren t implementations of Re-p air are less scalable than Comrad . T able 5 sho ws the compression and decompression times. Finally , we compare the new results for S. c er evisiae and S. p ar adoxus to those obtained by Grab o wski and Deorowicz [8]. The results they achiev e with- out the improv ed reference sequence are 7.18 Mb ytes and 9.62 Mbytes, and with the improv ed reference sequence are 6.94 Mb ytes and 9.01 Mbytes for S. c er evisiae and S. p ar adoxus , resp ectiv ely . Our b est results are 7.64 Mb yte for S. c er evisiae and 8.67 Mbyte for S. p ar adoxus , using Re-p air , which are comparable. It may b e p ossible to combine the techniques to acheiv e even b et- ter results. 5 Concluding Remarks Relativ e compression is a p o werful technique for compressing collections of re- lated genomes, whic h are now b ecoming commonplace. In this pap er we hav e sho wn that these genomic collections can contain clusters of sequences which are more highly related than others. W e hav e also shown that impressiv e gains in compression can b e acheiv ed by exploiting these clusters. Our sp eciﬁc approach 9 Dataset S. cerevisiae S. paradoxus E. coli Comp. Dec. Comp. Dec. Comp. Dec. (sec) (sec) (sec) (sec) (sec) (sec) RLZ-std 143 9 182 6 125 3 RLZ-opt 233 8 241 6 140 3 RLZ-std-C 1561 4 1619 4 588 2 RLZ-opt-C 1783 4 1832 3 658 2 RLZ-std-R 1170 4 1134 4 455 2 RLZ-opt-R 1482 4 1353 4 499 2 RLZ-std-D 2272 8 1787 7 618 4 RLZ-opt-D 2901 7 2492 7 843 4 T able 2: Compression and decompression times for Comrad , Re-p air and Dna-x generated reference sequences. The columns are: the algorithm used and the time tak en to compress and decompress measured in seconds, resp ectiv ely . Compression times include the time tak en to generate the reference sequences, where necessary . Dataset S. cerevisiae S. paradoxus E. coli Size En t. Size En t. Size En t. (Mb ytes) (bpb) (Mb ytes) (bpb) (Mb ytes) (bpb) Original 485.87 2.18 429.27 2.12 164.90 2.00 Rlcsa 41.39 0.57 47.35 0.88 34.94 1.67 Lz-end 42.52 0.70 57.18 1.07 55.25 2.68 Comrad 15.29 0.25 18.33 0.34 13.44 0.65 XM 74.53 1.26 13.17 0.25 8.82 0.43 Re-p air 8.85 0.15 11.75 0.22 11.89 0.58 T able 3: Compression results for the y east and E. c oli datasets using other compression algorithms. The ﬁrst row is the original size for all datasets (size in megabases), the remaining rows are the compression p erformance of Rlcsa , Lz-end , Comrad , XM and Re-p air algorithms. The tw o columns p er dataset sho w the size in Mbytes and the 0-order entrop y (in bits p er base). has b een to detect rep etitions across the dataset and build an artiﬁcial “ref- erence sequence”, relativ e to which the sequence is subsequently compressed. This metho d retains the principle adv antage of relativ e compression: fast ran- dom access. The drawbac k is slo w er compression time, as time must now b e sp en t ﬁnding rep eats with which to generate the reference. F uture w ork will attempt to address this problem. W e also b eliev e it may b e fruitful to apply clustering algorithms to related genomes to isolate strains. References [1] B. Behzadi and F. L. F essant. DNA compression c hallenge revisited: A dynamic programming approac h. In Pr o c. 16th Annual Symp osium on Combinatorial Pattern Matching (CPM’05) , pages 190–200, 2005. [2] J. Bentley and D. McIlro y . Data compression using long common strings. In Pr o c. Data Compr ession Confer enc e (DCC’99) , pages 287–295, 1999. 10 Dataset Hemoglobin Inﬂuenza Mito c hondria Size En t. Size Ent. Size Ent. (Mb ytes) (bpb) (Mb ytes) (bpb) (Mb ytes) (bpb) Original 7.38 2.07 112.64 1.97 25.26 1.95 RLZ-std 3.81 4.13 43.65 3.10 9.31 2.95 RLZ-std-C 1.31 1.42 3.31 0.23 6.55 2.07 RLZ-opt-C 1.17 1.27 3.00 0.21 6.05 1.92 RLZ-std-R 1.32 1.43 3.00 0.21 6.69 2.12 RLZ-opt-R 1.19 1.28 2.82 0.20 6.20 1.96 RLZ-std-D 1.42 1.54 3.68 0.26 7.13 2.26 RLZ-opt-D 1.27 1.38 3.49 0.25 6.59 2.09 T able 4: Compression results for Comrad , Re-p air and Dna-x generated ref- erence sequences for compressing rep etitiv e datasets that do not ha ve an explicit reference sequence. The columns are, resp ectiv ely , algorithm used, compressed size in Megabytes (original dataset sizes in Megabases) and av erage n um b er of bits used per base. The sections are results for using Rlz with, the ﬁrst sequence from each dataset as a reference sequence, and Comrad , Rlz and Dna-x gen- erated reference sequences, resp ectiv ely . Rlz is run in standard mo de, RLZ-std and optimised mo de, RLZ-opt . [3] M. Brandon, D. W allace, and P . Baldi. Data structures and compression algorithms for genomic sequence data. Bioinformatics , 25(14):1731–1738, 2009. [4] M. D. Cao, T. Dix, L. Allison, and C. Mears. A simple statistical algorithm for biological sequence compression. In Pr o c. Data Compr ession Confer enc e (DCC’07) , pages 43–52, 2007. [5] X. Chen, S. Kw ong, and M. Li. A compression algorithm for DNA sequences and its applications in genome com parison. In Pr o c. 4th Confer enc e on R ese ar ch in Computational Mole cular Biolo gy (RECOMB’00) , pages 107– 117, 2000. [6] X. Chen, M. Li, B. Ma, and J. T romp. DNA Compress: fast and eﬀective DNA sequence compression. Bioinformatics , 18(12):1696–1698, 2002. [7] S. Christley , Y. Lu, C. Li, and X. Xie. Human genomes as email attach- men ts. Bioinformatics , 25(2):274–275, 2009. [8] S. Grabowski and S. Deorowicz. Engineering relative compression of genomes. , 2011. [9] S. Grumbac h and F. T ahi. Compression of DNA sequences. In Pr o c. Data Compr ession Confer enc e (DCC’93) , pages 340–350, 1993. [10] S. Grumbac h and F. T ahi. A new c hallenge for compression algorithms: Genetic sequences. Information Pr o c essing & Management , 30(6):875–886, 1994. [11] S. Kreft and G. Nav arro. LZ77-lik e compression with fast random access. In Pr o c. Data Compr ession Confer enc e (DCC’10) , pages 239–248, 2010. 11 Dataset Hemoglobin Inﬂuenza Mitochondria Comp. Dec. Comp. Dec. Comp. Dec. (sec) (sec) (sec) (sec) (sec) (sec) RLZ-std 5 1 67 11 11 1 RLZ-std-C 20 1 196 1 106 1 RLZ-opt-C 21 1 235 1 107 1 RLZ-std-R 11 1 189 1 64 1 RLZ-opt-R 12 1 220 1 64 1 RLZ-std-D 14 1 247 3 69 1 RLZ-opt-D 17 1 364 3 72 1 T able 5: Compression and decompression times for Comrad , Re-p air and Dna-x generated reference sequences are used for compressing rep etitiv e datasets that do not hav e an explicit reference sequence. The columns are re- sp ectiv ely , algorithm used, total time taken to compress and decompress, mea- sured in seconds. All compression times include time tak en to generate the reference sequence. [12] S. Kreft and G. Nav arro. Self-indexing based on LZ77. In Pr o c. 22nd A nnual Symp osium on Combinatorial Pattern Matching (CPM’11) , 2011. T o app ear. [13] S. Kuruppu, B. Beresford-Smith, T. Conw ay , and J. Zob el. Iterativ e dic- tionary construction for compression of large dna datasets. IEEE/A CM T r ansactions on Computational Biolo gy and Bioinformatics , 2011. T o ap- p ear. [14] S. Kuruppu, S. J. Puglisi, and J. Zobel. Relative Lemp el-Ziv compression of genomes for large-scale storage and retriev al. In Pr o c. 17th Symp osium on String Pr o c essing and Information R etrieval (SPIRE’10) , pages 201–206, 2010. [15] S. Kuruppu, S. J. Puglisi, and J. Zob el. Optimized relative Lemp el-Ziv compression of genomes. In Pr o c. 34th Austr alasian Computer Scienc e Confer enc e (ACSC’11) , pages 91–98, 2011. [16] N. J. Larsson and A. Moﬀat. Oﬄine dictionary-based compression. In Pr o c. Data Compr ession Confer enc e (DCC’99) , pages 296–305, 1999. [17] V. M¨ akinen, G. Nav arro, J. Sir ´ en, and N. V¨ alim¨ aki. Storage and retriev al of highly rep etitiv e sequence collections. Journal of Computational Biolo gy , 17(3):281–308, 2010. [18] G. Manzini and M. Rastero. A simple and fast DNA compressor. Softwar e: Pr actic e and Exp erienc e , 34(14):1397–1411, 2004. [19] E. Riv als, J. Delahay e, M. Dauc het, and O. Delgrange. A guaranteed com- pression scheme for rep etitiv e DNA sequences. In Pr o c. Data Compr ession Confer enc e (DCC’96) , page 453, 1996. [20] J. Ziv and A. Lemp el. A universal algorithm for sequential data compres- sion. IEEE T r ansactions on Information The ory , 23(3):337–343, 1977. 12

Reference Sequence Construction for Relative Compression of Genomes

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment