Nature of protein family signatures: Insights from singular value analysis of position-specific scoring matrices

Reading time: 6 minute
...

📝 Original Info

  • Title: Nature of protein family signatures: Insights from singular value analysis of position-specific scoring matrices
  • ArXiv ID: 0711.1010
  • Date: 2008-04-14
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Position-specific scoring matrices (PSSMs) are useful for detecting weak homology in protein sequence analysis, and they are thought to contain some essential signatures of the protein families. In order to elucidate what kind of ingredients constitute such family-specific signatures, we apply singular value decomposition to a set of PSSMs and examine the properties of dominant right and left singular vectors. The first right singular vectors were correlated with various amino acid indices including relative mutability, amino acid composition in protein interior, hydropathy, or turn propensity, depending on proteins. A significant correlation between the first left singular vector and a measure of site conservation was observed. It is shown that the contribution of the first singular component to the PSSMs act to disfavor potentially but falsely functionally important residues at conserved sites. The second right singular vectors were highly correlated with hydrophobicity scales, and the corresponding left singular vectors with contact numbers of protein structures. It is suggested that sequence alignment with a PSSM is essentially equivalent to threading supplemented with functional information. The presented method may be used to separate functionally important sites from structurally important ones, and thus it may be a useful tool for predicting protein functions.

💡 Deep Analysis

Deep Dive into Nature of protein family signatures: Insights from singular value analysis of position-specific scoring matrices.

Position-specific scoring matrices (PSSMs) are useful for detecting weak homology in protein sequence analysis, and they are thought to contain some essential signatures of the protein families. In order to elucidate what kind of ingredients constitute such family-specific signatures, we apply singular value decomposition to a set of PSSMs and examine the properties of dominant right and left singular vectors. The first right singular vectors were correlated with various amino acid indices including relative mutability, amino acid composition in protein interior, hydropathy, or turn propensity, depending on proteins. A significant correlation between the first left singular vector and a measure of site conservation was observed. It is shown that the contribution of the first singular component to the PSSMs act to disfavor potentially but falsely functionally important residues at conserved sites. The second right singular vectors were highly correlated with hydrophobicity scales, and t

📄 Full Content

Protein sequence alignment using a position-specific scoring matrix (PSSM) or sequence profile [1,2] is now a standard tool for sequence analysis [3,4]. Using a PSSM, it is often possible to detect very distantly related proteins which cannot be detected by the standard pairwise alignment based on a position-independent amino acid substitution matrix (AASM).

An AASM is a 20×20 real (usually symmetric) matrix each element of which reflects the tendency of substitution between amino acid residues. There have been many kinds of AASMs developed to date among which the most popular ones include the PAM [5] and the BLOSUM series [6]. General properties of AASMs are now well clarified [7][8][9][10]. Tomii and Kanehisa found that the PAM matrices can be well approximated by the volume and hydrophobicity of amino acid residues [8]. A similar result was obtained by Pokarowski et al. [10], but they also pointed out the importance of the coil preferences of amino acids residues. Using eigenvalue decomposition, Kinjo and Nishikawa [9] showed that the most dominant component of AASMs is the relative mutability [5] for closely related homologs, but it changes to hydrophobicity below the sequence identity of 30%, and this transition of dominant modes was related to the so-called twilight zone of sequence comparison [11,12]. There are also AASMs specifically optimized to overcome the twilight zone [13,14].

Detection of very distant homologs is often possible by using PSSM-based sequence alignment methods such as PSI-BLAST [4] or hidden Markov models [3,15] because a PSSM is specific to a particular protein family so that some family-specific features can be exploited. In a PSSM, family-specific features are expressed as position-dependent substitution scores, and hence a PSSM is an N ×20 matrix where N is the length of the protein or protein family it represents. Since PSSMs can be regarded as an extension of sequence motifs [15], family-specific features are, to the first approximation, a pattern of amino acid residues around functionally or structurally important sites expressed in a probabilistic manner. In order to further understand the mechanism by which the effectiveness of PSSMs is realized, however, it is necessary to elucidate more general characteristics of PSSMs that are shared across different protein families.

To delineate the general properties of PSSMs, we analyze them by using singular value decomposition (SVD, Eq. 4 in the Methods section). By applying SVD, a PSSM can be decomposed into 20 orthogonal components of varying importance. Each singular component consists of a singular value (a scalar), right singular vector (r-SV) and left singular vector (l-SV). See the Methods section for the details. A singular value represents the relative importance of the component whereas the corresponding r-SV (a 20-vector) represents a property of 20 amino acid types and the l-SV may be regarded as a one-dimensional (1D) numerical representation of the amino acid sequence that is “dual” to the property represented by the r-SV. Since r-SVs can be regarded as amino acid indices [8,16,17], we can infer their meaning by comparing them with the entries of the AAindex database [18] which compiles many amino acid indices published to date. This is a natural generalization of a previous work where AASMs were analyzed by using eigenvalue decomposition [9]. The present analysis revealed a tendency of PSSMs that is analogous to the AASMs for close homologs. That is, the first principal component disfavors any substitutions and potentially functionally important residues are more severely penalized, and the second component is highly correlated with sequence and structural properties related to hydrophobicity. These features are expected to contribute to the effectiveness of sequence alignment based on PSSMs.

In order to check to what extent a subset of singular components can explain the original PSSM, we calculated the accumulative contribution ratio of each PSSM. The accumulative contribution ratio up to k-th singular value is defined as

where σ α is the α-th singular value which is non-negative. The averages of S k for k = 1, • • • , 20 are shown in Fig. 2. We observe that the first singular value contributes 17% of the total singular values in the PDB set, and 24% in the Pfam set. Thus, the contribution of the first singular component is relatively larger in the Pfam PSSMs than in the PSI-BLAST-generated PSSMs of PDB entries. This tendency may be related to the higher specificity of the Pfam hidden Markov models. 50% contributions are made by first 4 or 5 components in the PDB or Pfam sets, respectively, whereas 90% contributions are made by the first 15 components in the both sets. Compared to the case with AASMs where 50% and 90% contributions are made by first 3 and 10 singular values (or eigenvalues) [9], the “compressibility” of PSSMs is lower in the sense that more components are needed to explain the same fraction

…(Full text truncated)…

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut