Background: It is of biological interest to make genome-wide predictions of the locations of DNA melting bubbles using statistical mechanics models. Computationally, this poses the challenge that a generic search through all combinations of bubble starts and ends is quadratic. Results: An efficient algorithm is described, which shows that the time complexity of the task is O(NlogN) rather than quadratic. The algorithm exploits that bubble lengths may be limited, but without a prior assumption of a maximal bubble length. No approximations, such as windowing, have been introduced to reduce the time complexity. More than just finding the bubbles, the algorithm produces a stitch profile, which is a probabilistic graphical model of bubbles and helical regions. The algorithm applies a probability peak finding method based on a hierarchical analysis of the energy barriers in the Poland-Scheraga model. Conclusions: Exact and fast computation of genomic stitch profiles is thus feasible. Sequences of several megabases have been computed, only limited by computer memory. Possible applications are the genome-wide comparisons of bubbles with promotors, TSS, viral integration sites, and other melting-related regions.
Deep Dive into A stitch in time: Efficient computation of genomic DNA melting bubbles.
Background: It is of biological interest to make genome-wide predictions of the locations of DNA melting bubbles using statistical mechanics models. Computationally, this poses the challenge that a generic search through all combinations of bubble starts and ends is quadratic. Results: An efficient algorithm is described, which shows that the time complexity of the task is O(NlogN) rather than quadratic. The algorithm exploits that bubble lengths may be limited, but without a prior assumption of a maximal bubble length. No approximations, such as windowing, have been introduced to reduce the time complexity. More than just finding the bubbles, the algorithm produces a stitch profile, which is a probabilistic graphical model of bubbles and helical regions. The algorithm applies a probability peak finding method based on a hierarchical analysis of the energy barriers in the Poland-Scheraga model. Conclusions: Exact and fast computation of genomic stitch profiles is thus feasible. Seq
A stitch in time: Efficient computation of genomic DNA melting bubbles
Eivind Tøstesen∗
Department of Tumor Biology, Norwegian Radium Hospital, N-0310 Oslo, Norway,
and Department of Mathematics, University of Oslo, N-0316 Oslo, Norway
(Dated: November 10, 2021)
Background: It is of biological interest to make genome-wide predictions of the locations of DNA
melting bubbles using statistical mechanics models. Computationally, this poses the challenge that
a generic search through all combinations of bubble starts and ends is quadratic.
Results: An efficient algorithm is described, which shows that the time complexity of the task is
O(NlogN) rather than quadratic. The algorithm exploits that bubble lengths may be limited, but
without a prior assumption of a maximal bubble length. No approximations, such as windowing,
have been introduced to reduce the time complexity.
More than just finding the bubbles, the
algorithm produces a stitch profile, which is a probabilistic graphical model of bubbles and helical
regions. The algorithm applies a probability peak finding method based on a hierarchical analysis
of the energy barriers in the Poland-Scheraga model.
Conclusions: Exact and fast computation of genomic stitch profiles is thus feasible. Sequences of
several megabases have been computed, only limited by computer memory. Possible applications
are the genome-wide comparisons of bubbles with promotors, TSS, viral integration sites, and other
melting-related regions.
PACS numbers: 87.14.Gg, 87.15.Ya, 05.70.Fh, 02.70.Rr
I.
BACKGROUND
Models of DNA melting make it possible to compute
what regions that are single-stranded (ss) and what re-
gions that are double-stranded (ds). Based on statistical
mechanics, such model predictions are probabilistic by
nature. Bubbles or single-stranded regions play an essen-
tial role in fundamental biological processes, such as tran-
scription, replication, viral integration, repair, recombi-
nation, and in determining chromatin structure [1, 2].
It is therefore interesting to apply DNA melting models
to genomic DNA sequences, although the available mod-
els so far are limited to in vitro knowledge.
Genomic
applications began around 1980 [3, 4], and have been
gaining momentum over the years with the increasing
availability of sequences, faster computers, and model
development.
It has been found that predicted ds/ss
boundaries often are located at or very close to exon-
intron junctions, the correspondence being stronger in
some genomes than others [5, 6, 7, 8], which suggested
a gene finding method [9]. In the same vein, compar-
isons of actin cDNA melting maps in animals, plants, and
fungi suggested that intron insertion could have target
the sites of such melting fork junctions in ancient genes
[10, 11]. In other studies, bubbles in promotor regions
were computed to test the hypothesis that the stability
of the double helix contributes to transcriptional regula-
tion [12, 13, 14, 15, 16, 17]. Bubbles induced by superhe-
licity have also been found to correlate with replication
origins as well as promotors [18, 19, 20, 21]. In addi-
tion to the testing of specific hypotheses, a strategy has
∗Email: eivindto@math.uio.no
been to provide whole genomes with annotations of their
melting properties [22, 23]. Combined with all other ex-
isting annotations, such melting data allow exploratory
data mining and possibly to form new hypotheses [24].
For example, the human genomic melting map was made
available, compared to a wide range of other annotations,
and was shown to provide more information than the lo-
cal GC content [23].
In the genomic studies, various melting features have
proved to be of particular interest.
These include the
bubbles and helical regions, bubble nucleation sites,
cooperative melting domains, melting fork junctions,
breathers, sites of high or low stability, and SIDD sites.
Most often we want to know their locations, but addi-
tional information is sometimes useful, such as probabil-
ities, dynamics, stabilities, and context. DNA melting
models based on statistical mechanics are powerful tools
for calculating such properties, especially those models
that can be solved by dynamical programming in poly-
nomial time. For many features of interest, however, al-
gorithms remain to be developed to do such predictions.
The existing melting algorithms typically produce melt-
ing profiles of some numerical quantity for each sequence
position. The prototypical example is Poland’s probabil-
ity profile [25], but also profiles of melting temperatures
(melting maps), free energies or other quantities are com-
puted per basepair. The result can be plotted as a curve,
while the wanted features often have the format of re-
gions, junctions and other sites.
Some genomics data
mining tools also require data in these formats rather
than curves. As a remedy, melting profiles have been sub-
jected to ad hoc post-processing methods to extract the
wanted features, such as segmentation algorithms [23],
…(Full text truncated)…
This content is AI-processed based on ArXiv data.