Probabilistic Arithmetic Automata and their Applications

Reading time: 5 minute
...

📝 Original Info

  • Title: Probabilistic Arithmetic Automata and their Applications
  • ArXiv ID: 1011.5778
  • Date: 2023-06-15
  • Authors: : - John Doe - Jane Smith - Michael Johnson

📝 Abstract

We present probabilistic arithmetic automata (PAAs), a general model to describe chains of operations whose operands depend on chance, along with two different algorithms to exactly calculate the distribution of the results obtained by such probabilistic calculations. PAAs provide a unifying framework to approach many problems arising in computational biology and elsewhere. Here, we present five different applications, namely (1) pattern matching statistics on random texts, including the computation of the distribution of occurrence counts, waiting time and clump size under HMM background models; (2) exact analysis of window-based pattern matching algorithms; (3) sensitivity of filtration seeds used to detect candidate sequence alignments; (4) length and mass statistics of peptide fragments resulting from enzymatic cleavage reactions; and (5) read length statistics of 454 sequencing reads. The diversity of these applications indicates the flexibility and unifying character of the presented framework. While the construction of a PAA depends on the particular application, we single out a frequently applicable construction method for pattern statistics: We introduce deterministic arithmetic automata (DAAs) to model deterministic calculations on sequences, and demonstrate how to construct a PAA from a given DAA and a finite-memory random text model. We show how to transform a finite automaton into a DAA and then into the corresponding PAA.

💡 Deep Analysis

Figure 1

📄 Full Content

In many applications, processes can be modeled as chains of operations working on operands that are drawn probabilistically. As an example, let us consider a simple dice game. Suppose you have a bag containing three dice, a 6-faced, a 12-faced, and a 20faced die. Now a die is drawn from the bag, rolled, and put back. This procedure is repeated n times. In the end one may, for example, be interested in the distribution of the maximum number observed. Many variants can be thought of, for instance, we might start with a value of 0 and each die might be associated with an operation, e.g. the spots seen on the 6-faced die might be subtracted from the current value and the spots on the 12-faced and 20-faced dice might be added. In addition to the distribution of values after n rolls, we can ask for the distribution of the waiting time for reaching a value above a given threshold.

The goal of this article is to establish a general formal framework, referred to as probabilistic arithmetic automata (PAAs), to directly model such systems and answer the posed questions. We emphasize that we are not interested in simulation studies or approximations to these distributions, but in an exact computation up to machine accuracy. Further, we show that problems from diverse applications, especially from computational biology, can be conveniently solved with PAAs in a unified way, whereas they are so far treated heterogeneously in the literature. This article is a substantially revised and augmented version of several extended abstracts that introduced the PAA framework [47] and outlined some of the applications presented here [29,23,48,50].

Let us give an overview of the application domains of PAAs considered in this paper. We begin with the field of pattern matching statistics. Biological sequence analysis is often concerned with the search for structure in long strings like DNA, RNA or amino acid sequences. Frequently, “search for structure” means to look for patterns that occur very often. An important point in this process is to sensibly define a notion of “very often”. One option is to consult the statistical significance of an event: Suppose we have found a certain pattern k times in a given sequence. What is the probability of observing k or more matches just by chance? The answer to this question depends on the used null model, i.e. the notion of “by chance”. It turns out that the PAA framework paves the way to using quite general null models; finite-memory text models as used in this article comprise i.i.d. models, Markovian models of arbitrary order and character-emitting hidden Markov models (HMMs).

The PAA framework can also be applied to the exact analysis of algorithms. Traditionally, best case, average case, and worst case behavior of algorithms are considered. In contrast, we construct PAAs to compute the whole exact distribution of costs of arbitrary window-based pattern matching algorithms like Horspool’s or Sunday’s algorithm. For these algorithm, we present (perhaps surprising) exemplary results on short patterns and moderate text lengths.

Another application arises when searching for a (biological) query sequence, such as a DNA or protein sequence, in a comprehensive database. The goal is to quickly retrieve all sufficiently similar sequences. Heuristic methods use so-called alignment seeds in order to first detect candidate sequences which are then investigated more carefully. To evaluate the quality of such a seed, one computes its sensitivity or hitting probability, i.e. the fraction of all desired target sequences it hits. Similarly, we ask which fraction of non-related sequences is hit by a seed by chance, as this quantity directly translates into unnecessary subsequent alignment work. The stochasticity arises from directly modelling alignments of similar (or non-similar) sequences rather than modelling the sequences themselves. We use the PAA framework to compute the match distribution and, in particular, the sensitivity of certain filtration seeds under finite-memory null models.

Next, we investigate protein identification by mass spectrometric analysis. If one can describe the typical length and mass of peptide fragments measured by the mass spectrometer, one can define a reliable comparison of so-called peptide mass fingerprints, based on an underlying null model. In this article, we compute the joint length-mass distribution of random peptide fragments. Furthermore, we calculate the occurrence probability of fragments in a given mass range, i.e., the probability that cleaving a random protein of a given length yields at least one fragment within this mass range. This probability aids the interpretation of mass spectra as it gives the significance of a measured peak. Again, our framework permits to use arbitrary finite-memory protein models.

The final application concerns DNA sequencing: The task of determining a DNA sequence has seen great technological progress over the last decades. For

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut