A method to search for local structural similarities in proteins at atomic resolution is presented. It is demonstrated that a huge amount of structural data can be handled within a reasonable CPU time by using a conventional relational database management system with appropriate indexing of geometric data. This method, which we call geometric indexing, can enumerate ligand binding sites that are structurally similar to sub-structures of a query protein among more than 160,000 possible candidates within a few hours of CPU time on an ordinary desktop computer. After detecting a set of high scoring ligand binding sites by the geometric indexing search, structural alignments at atomic resolution are constructed by iteratively applying the Hungarian algorithm, and the statistical significance of the final score is estimated from an empirical model based on a gamma distribution. Applications of this method to several protein structures clearly shows that significant similarities can be detected between local structures of non-homologous as well as homologous proteins.
According to the 'sequence determines structure determines function' paradigm, it should be possible to predict protein structure from its amino acid sequence, and in turn, to predict its function from the structure. It has been empirically proved, however, that ab initio approaches to the both of these problems are extremely difficult.
Currently, the most practical and reliable methods for protein structure prediction are the ones based on sequence comparison. In such homology-based methods, sequence similarities imply structural similarities. It is tempting to assume that the same argument applies to the prediction of protein functions. That is, we expect that we can infer some functional information if there are some similarities between two protein structures. However, it has been demonstrated that the protein folds (approximate over-all structures) of proteins are not significantly correlated with their functions. Since many protein functions such as enzymatic catalysis and ligand binding are performed by a small subset of protein atoms or residues, it seems necessary to perform local structure comparison in addition to (or, instead of) fold comparison for inferring protein function by similarity.
A number of methods have been proposed for searching for local similarities in protein structures 1 . However, some of them limit the data size due to a prohibitive amount of CPU time and/or RAM space required 2,3,4 , while others sacrifice structural details or diversity for the efficiency of search 5,6,7 . The ever increasing structural data in the Protein Data Bank (PDB) 8 include many proteins of unknown functions and hence making available efficient and thorough methods for local structure comparison for inferring protein functions is a pressing matter. At the same time, however, such rapidly increasing data only make conventional methods more and more inefficient.
It is required that methods for local structure comparison be able to follow the rapid increase of data with a reasonable scalability.
In this Note, we introduce techniques to construct a scalable method for similarity search for local protein structures. In this method, ligand binding sites consisting of protein atoms are first compiled as a table in a relational database management system (RDBMS) 9 . For a given protein structure as a query, the method searches for struc-turally equivalent atoms in the database that match the atoms in the query structure. This search process can be executed efficiently owing to the indexing mechanism of the RDBMS. We call this technique geometric indexing (GI). After identifying matching ligand binding sites, alignments at atomic resolution are obtained by using the Hungarian algorithm 10,11 . The present method is similar to the geometric hashing (GH) algorithm in spirit. However, since the total size of the structural data may well exceed several gigabytes, it is usually not possible to naively implement the GH method which must keep a huge hash table in RAM. On the other hand, an RDBMS stores all the data on a hard disk which is much cheaper and larger than RAM, and hence let us overcome the data size problem. In addition, almost any modern RDBMS provides an efficient indexing mechanism which allows us to retrieve data satisfying a given set of constraints rather quickly. By using the technique introduced here, it becomes possible to keep up with the rapidly increasing structural data without sacrificing the efficiency of searching or the details and diversity of structural information.
We first extract ligand binding sites (templates) from PDBML files 12 and save them in XML files called LBSML (Ligand Binding Site Markup Language) files. An LBSML file contains information of atoms that are in contact with a ligand, along with reference sets (refsets) for local coordinate systems (see below). Then we compile refsets and atomic coordinates in local coordinate systems into a set of relational database (RDB) tables. This is a pre-processing stage and is carried out only once as long as we do not need to update the database (Figure 1, left part).
Then a database search is carried out for a given protein structure as a query (Figure 1, right part). A search is divided into two stages. In the first stage, called geometric indexing search (“GI Search” in Figure 1), the database is scanned by exploiting the indexing mechanism of the RDBMS, and possible atomic correspondences are counted.
In the second part (“IR Procedure” in Figure 1), a predefined number of high-scoring templates are subject to iterative refinement of the alignment to the sub-structures of the query.
We downloaded all the PDBML 12 files (43,755 entries) on June 6, 2007. From these PDB entries, those were discarded that do not contain a protein chain or that do not contain any hetero atoms other than water.
As in the geometric hashing algorithm, all atomic coordinates are expressed in various local coordinate systems defined by reference
This content is AI-processed based on open access ArXiv data.