Comparative genomics methods are widely used to aid the functional annotation of non coding DNA regions. However, aligning non coding sequences requires new algorithms and strategies, in order to take into account extensive rearrangements and turnover during evolution. Here we present a novel large scale alignment strategy which aims at drawing a precise map of conserved non coding regions between genomes, even when these regions have undergone small scale rearrangments events and a certain degree of sequence variability. We applied our alignment approach to obtain a genome-wide catalogue of conserved non coding blocks (CNBs) between Drosophila melanogaster and 11 other Drosophila species. Interestingly, we observe numerous small scale rearrangement events, such as local inversions, duplications and translocations, which are not observable in the whole genome alignments currently available. The high rate of observed low scale reshuffling show that this database of CNBs can constitute the starting point for several investigations, related to the evolution of regulatory DNA in Drosophila and the in silico identification of unannotated functional elements.
Deep Dive into DrosOCB: a high resolution map of conserved non coding sequences in Drosophila.
Comparative genomics methods are widely used to aid the functional annotation of non coding DNA regions. However, aligning non coding sequences requires new algorithms and strategies, in order to take into account extensive rearrangements and turnover during evolution. Here we present a novel large scale alignment strategy which aims at drawing a precise map of conserved non coding regions between genomes, even when these regions have undergone small scale rearrangments events and a certain degree of sequence variability. We applied our alignment approach to obtain a genome-wide catalogue of conserved non coding blocks (CNBs) between Drosophila melanogaster and 11 other Drosophila species. Interestingly, we observe numerous small scale rearrangement events, such as local inversions, duplications and translocations, which are not observable in the whole genome alignments currently available. The high rate of observed low scale reshuffling show that this database of CNBs can constitute t
The functional annotation of eukaryotic DNA sequences represents a great challenge in post-genomic biological research. The identification of functional non-coding elements, such as untranslated regions (UTRs), genes for non-protein-coding RNAs, and cis-regulatory elements, is extremely difficult, as the rules governing their structure and function are far from being well undertood.
A great aid to functional annotation of genome sequences is provided by comparative genomics methods which, since a few years, have been extended also to non coding DNA regions. The basic assumption of comparative genomic approach is that common features of two organisms are encoded within the DNA that is conserved between the species, due to purifying selection during evolution. According to the same assumption, the DNA sequences controlling the expression of genes that are regulated similarly in two related species should also be selected during evolution.
However, comparison of non coding sequences requires new algorithms and strategies to take into account the different evolutionary mechanisms affecting regulatory sequences. Recent studies examining the evolution of cis-regulatory modules in Drosophila, reveals that regulatory sequences may frequently evolve through compensatory gain and loss events in transcription factors binding sites, that produces little functional change [1], [2]. Great plasticity in the arrangement of binding sites within cis-regulatory modules is another remarkable evolutionary feature revealed to occur in vertebrates [3].
Once complete genomes from different species are available, a global alignment procedure is suitable to find a map of colinear conserved segments between the input sequences, descarding alignments that overlap or cross over. Global alignment methods are widely used to identify highly similar regions in the sequences which appear in the same order and orientation. On the contrary, local alignment algorithms are generally very useful in finding similarity between regions that may be related but are inverted or rearranged with respect to each other.
Recently, the novel notion of glocal alignment, a sophisticated combination of global and local methods, has been introduced [5]. This class of alignment algorithms create a map that transforms one sequence into the other while allowing for rearrangement events. This procedure, at the base of Shuffled-LAGAN algorithm [6], is able to take into account large scale genomic rearrangments, but fails at lower scale.
Here, we present an novel large scale alignment strategy which aims at drawing a precise map of conserved non-coding regions between genomes, even when these regions have undergone small scale rearrangement events. Our procedure is optimized to take into account the great plasticity of non coding DNA, such as shuffling and sequence variability of binding sites within functional modules, low scale translocations, inversions and duplications. We used a “gene-centric” approch, in that it starts with a list of orthologous genes between two species, and applies a local alignment algorithm to the corresponding flanking intergenic regions and intronic regions of these orthologous pairs. Hence, it is a local alignment strategy but applied systematically on a genome-wide scale and, for this reason, we decided to call it “lobal”.
The recent availability of 12 Drosophila species sequences and annotations [7] offers a complete and reliable genomic dataset for developing and testing methods for comparative genomics of non coding DNA. We applied our lobal alignment approach to align Drosophila melanogaster to several other drosophila species (D. yakuba, D. pseudoobscura, D. virilis, …), for which a reliable genome build and annotation is available.
For each Drosophila species examined (listed in Tab.1 and referenced to as D.xxx), we compile a list of genes orthologous to a D.melanogaster (D.mel) gene, according to the “12 drosophila genomes project” data (Tab.1 and Material and Methods). For each pair of D.mel/D.xxx orthologous genes, we extract in both species the upstream, downstream and intronic regions. Upstream and downstream regions are extracted up to the next neighboring gene (see Material and Methods for more details), taking the longest transcript as a reference in case of multiple transcripts. All sequences have been previously masked for repeats using the RepeatMasker program [8]. At this stage, the comparison procedure crucially depends on the availability of genomic annotations (i.e. gene coordinates and orthology relationships). The orthologous regions are then aligned using a local alignment procedure described later. For the alignment, the orthologous regions are oriented such that the corresponding genes are in the same orientation. Using this genecentric approach, most intergenic regions are considered twice. For example, the region chr4:64404-68333 in D.melanogaster is first considered as the upstream region of the Ple
…(Full text truncated)…
This content is AI-processed based on ArXiv data.