p-Adic numbers in bioinformatics: from genetic code to PAM-matrix

Reading time: 6 minute
...

📝 Original Info

  • Title: p-Adic numbers in bioinformatics: from genetic code to PAM-matrix
  • ArXiv ID: 0903.0137
  • Date: 2011-05-10
  • Authors: Researchers from original ArXiv paper

📝 Abstract

In this paper we denonstrate that the use of the system of 2-adic numbers provides a new insight to some problems of genetics, in particular, generacy of the genetic code and the structure of the PAM matrix in bioinformatics. The 2-adic distance is an ultrametric and applications of ultrametrics in bioinformatics are not surprising. However, by using the 2-adic numbers we match ultrametric with a number theoretic structure. In this way we find new applications of an ultrametric which differ from known up to now in bioinformatics. We obtain the following results. We show that the PAM matrix A allows the expansion into the sum of the two matrices A=A^{(2)}+A^{(\infty)}, where the matrix A^{(2)} is 2-adically regular (i.e. matrix elements of this matrix are close to locally constant with respect to the discussed earlier by the authors 2-adic parametrization of the genetic code), and the matrix A^{(\infty)} is sparse. We discuss the structure of the matrix A^{(\infty)} in relation to the side chain properties of the corresponding amino acids.

💡 Deep Analysis

Deep Dive into p-Adic numbers in bioinformatics: from genetic code to PAM-matrix.

In this paper we denonstrate that the use of the system of 2-adic numbers provides a new insight to some problems of genetics, in particular, generacy of the genetic code and the structure of the PAM matrix in bioinformatics. The 2-adic distance is an ultrametric and applications of ultrametrics in bioinformatics are not surprising. However, by using the 2-adic numbers we match ultrametric with a number theoretic structure. In this way we find new applications of an ultrametric which differ from known up to now in bioinformatics. We obtain the following results. We show that the PAM matrix A allows the expansion into the sum of the two matrices A=A^{(2)}+A^{(\infty)}, where the matrix A^{(2)} is 2-adically regular (i.e. matrix elements of this matrix are close to locally constant with respect to the discussed earlier by the authors 2-adic parametrization of the genetic code), and the matrix A^{(\infty)} is sparse. We discuss the structure of the matrix A^{(\infty)} in relation to the

📄 Full Content

Various clustering procedures play a crucial role in bioinformatics, in particular, genetics, see, e.g., [1,2] or [3]. An important class of such procedures is based on introduction of various metrics on the space information strings, see e.g. [5]. A metric with new interesting features was recently used in theoretical physics (from string theory to theory of disordered systems, spin glasses), see e.g. [6], [7], [8], [9] in cognitive science, psychology and image analysis [10]. This is so called p-adic metric (in fact, a class of metrics depending on the parameter p -a prime number). The main distinguishing feature of this metric is its sensitivity to hierarchic patterns in information having a special structure matching with p-adic encoding of information. 1 A few years ago 2-adic metric was applied to study the problem of degeneration of the genetic code, see [11,12,13]. These p-adic models can be considered as new development in the approach to investigation of the structure of the genetic code from the point of view of coding theory, see [14,15,16].

In the present paper we discuss the structure of the PAM matrix used in bioinformatics (see for example [1]) from the point of view of p-adic analysis. We use the 2-adic parametrization of the genetic (amino acid) code obtained in [11] (see also [12] for the different p-adic parametrization).

In [11,12] it was shown that, after some special parametrization of the space of codons (triples of nucleotides) the genetic code becomes a locally constant map of p-adic argument. Moreover, the degeneracy of the genetic code in this language takes the form of local constancy of the corresponding mapping.

Let us also mention the application of the p-adic parametrization to the description of the Parisi matrix from the replica symmetry breaking approach to spin glasses [17,18]. After the p-adic parametrization of the numbers of the lines and the columns the Parisi matrix becomes a locally constant block matrix.

It is natural to check, using the p-adic parametrization approach, the structure of the PAM matrix. The PAM matrix is used in bioinformatics for sequence alignment and is constructed using a Markov chain model of point mutations for a protein chain.

We assume that the structure of the PAM matrix has some relation to the structure of the genetic code. Using this idea we enumerate the lines and the columns of the PAM matrix using the 2-adic parametrization of the genetic code. After this parametrization the PAM matrix becomes more regular, namely, the dependence of the matrix elements A ij of the PAM matrix on the indices i and j is close to locally constant with respect to the 2-adic norm for the majority of matrix elements.

We have some exceptions from this rule. It is easy to see that these exceptions are related to several amino acids, namely to Y, W, C, F, L. In order to describe this deviations from 2-adicity we introduce the following construction: we expand (by hands) the PAM matrix into the sum of the two matrices

The matrix in this expansion A (2) is 2-adically regular (close to locally constant). The matrix A (∞) is sparse (the majority of matrix elements are zero, non zero matrix elements are mainly concentrated of the lines and columns related to the amino acids Y, W, C, F, L).

One can see that the deviations from 2-adicity (i.e. non-zero matrix elements of A (∞) ) are related to amino acids which are in some sense special -to the aromatic amino acids Y, W, F, and to Cysteine C which contains the SH group.

We also mention that the 2-adic structure of the genetic code is related to some chemical properties of the amino acids. In particular, hydrophobic amino acids are clustered in two ball with respect to the 2-adic norm. Therefore the 2-adic parametrization allows to separate the impact of the chemical and geometrical properties of aromatic amino acids for the structure of the PAM matrix.

The structure of the present paper is the following.

In Section 2 we discuss some family of ultrametric spaces.

In Section 3 we describe the 2-adic 2-dimensional parametrization of the genetic code of [11].

In Section 4 we put the PAM250 matrix.

In Section 5 we describe the reshuffling of the lines and the columns of the PAM matrix, corresponding to the 2-adic parametrization of the genetic code of Section 2.

In Section 6 we introduce the expansion of the PAM matrix into the sum of the two matrices, one of which is 2-adically regular (close to locally constant) and the other is sparse (majority of matrix elements are equal to zero).

Sections 7 and 8 are appendices where the definitions of PAM matrices and the eucaryotic genetic code are exposed.

An ultrametric space is a metric space where the metric d(x, y) satisfies the strong triangle inequality:

∀x, y, z.

The strong triangle inequality can be stated geometrically: each side of a triangle is at most as long as the longest one of the two other sides. Such a triangle is quite restricted when consid

…(Full text truncated)…

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut