We describe an algorithm for comparing two RNA secondary structures coded in the form of trees that introduces two new operations, called node fusion and edge fusion, besides the tree edit operations of deletion, insertion, and relabeling classically used in the literature. This allows us to address some serious limitations of the more traditional tree edit operations when the trees represent RNAs and what is searched for is a common structural core of two RNAs. Although the algorithm complexity has an exponential term, this term depends only on the number of successive fusions that may be applied to a same node, not on the total number of fusions. The algorithm remains therefore efficient in practice and is used for illustrative purposes on ribosomal as well as on other types of RNAs.
Deep Dive into A new distance for high level RNA secondary structure comparison.
We describe an algorithm for comparing two RNA secondary structures coded in the form of trees that introduces two new operations, called node fusion and edge fusion, besides the tree edit operations of deletion, insertion, and relabeling classically used in the literature. This allows us to address some serious limitations of the more traditional tree edit operations when the trees represent RNAs and what is searched for is a common structural core of two RNAs. Although the algorithm complexity has an exponential term, this term depends only on the number of successive fusions that may be applied to a same node, not on the total number of fusions. The algorithm remains therefore efficient in practice and is used for illustrative purposes on ribosomal as well as on other types of RNAs.
A New Distance for High Level RNA
Secondary Structure Comparison
Julien Allali and Marie-France Sagot
Abstract—We describe an algorithm for comparing two RNA secondary structures coded in the form of trees that introduces two new
operations, called node fusion and edge fusion, besides the tree edit operations of deletion, insertion, and relabeling classically used in
the literature. This allows us to address some serious limitations of the more traditional tree edit operations when the trees represent
RNAs and what is searched for is a common structural core of two RNAs. Although the algorithm complexity has an exponential term, this
term depends only on the number of successive fusions that may be applied to a same node, not on the total number of fusions. The
algorithm remains therefore efficient in practice and is used for illustrative purposes on ribosomal as well as on other types of RNAs.
Index Terms—Tree comparison, edit operation, distance, RNA, secondary structure.
1
INTRODUCTION
R
NAS are one of the fundamental elements of a cell. Their
role in regulation has been recently shown to be far
more prominent than initially believed (20 December 2002
issue of Science, which designated small RNAs with
regulatory function as the scientific breakthrough of the
year). It is now known, for instance, that there is massive
transcription of noncoding RNAs. Yet current mathematical
and computer tools remain mostly inadequate to identify,
analyze, and compare RNAs.
An RNA may be seen as a string over the alphabet of
nucleotides (also called bases), {A, C, G, T}. Inside a cell,
RNAs do not retain a linear form, but instead fold in space.
The fold is given by the set of nucleotide bases that pair. The
main type of pairing, called canonical, corresponds to bonds
of the type A U and G C. Other rarer types of bonds
may be observed, the most frequent among them is G U,
also called the wobble pair. Fig. 1 shows the sequence of a
folded RNA. Each box represents a consecutive sequence of
bonded pairs, corresponding to a helix in 3D space. The
secondary structure of an RNA is the set of helices (or the
list of paired bases) making up the RNA. Pseudoknots,
which may be described as a pair of interleaved helices, are
in general excluded from the secondary structure of an
RNA. RNA secondary structures can thus be represented as
planar graphs. An RNA primary structure is its sequence of
nucleotides while its tertiary structure corresponds to the
geometric form the RNA adopts in space.
Apart from helices, the other main structural elements in
an RNA are:
1.
hairpin loops which are sequences of unpaired bases
closing a helix;
2.
internal loops which are sequences of unpaired
bases linking two different helices;
3.
bulges which are internal loops with unpaired bases
on one side only of a helix;
4.
multiloops which are unpaired bases linking at least
three helices.
Stems are successions of one or more among helices,
internal loops, and/or bulges.
The comparison of RNA secondary structures is one of
the main basic computational problems raised by the study
of RNAs. It is the problem we address in this paper. The
motivations are many. RNA structure comparison has been
used in at least one approach to RNA structure prediction
that takes as initial data a set of unaligned sequences
supposed to have a common structural core [1]. For each
sequence, a set of structural predictions are made (for
instance, all suboptimal structures predicted by an algo-
rithm like Zucker’s MFOLD [15], or all suboptimal sets of
compatible helices or stems). The common structure is then
found by comparing all the structures obtained from the
initial set of sequences, and identifying a substructure
common to all, or to some of the sequences. RNA structure
comparison is also an essential element in the discovery of
RNA structural motifs, or profiles, or of more general
models that may then be used to search for other RNAs of
the same type in newly sequenced genomes. For instance,
general models for tRNAs and introns of group I have been
derived by hand [3], [10]. It is an open question whether
models at least as accurate as these, or perhaps even more
accurate, could have been derived in an automatic way. The
identification of smaller structural motifs is an equally
important topic that requires comparing structures.
As we saw, the comparison of RNA structures may
concern known RNA structures (that is, structures that were
experimentally determined) or predicted structures. The
objective in both cases is the same: to find the common
parts of such structures.
In [11], Shapiro suggested to mathematically model RNA
secondary structures without pseudoknots by means of
trees. The trees are rooted and ordered, which means that
the order among the children of a node matters. This order
corresponds to the 5’-3’ orientation of an RNA sequence.
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,
VOL. 2,
NO. 1,
JANUARY-MARCH 2005
1
. J. Allali is w
…(Full text truncated)…
This content is AI-processed based on ArXiv data.