A more appropriate Protein Classification using Data Mining

Reading time: 5 minute
...

📝 Original Info

  • Title: A more appropriate Protein Classification using Data Mining
  • ArXiv ID: 1111.2514
  • Date: 2023-09-15
  • Authors: : Wan K. Kim, Dan M. Bolser, Jong H. Park

📝 Abstract

Research in bioinformatics is a complex phenomenon as it overlaps two knowledge domains, namely, biological and computer sciences. This paper has tried to introduce an efficient data mining approach for classifying proteins into some useful groups by representing them in hierarchy tree structure. There are several techniques used to classify proteins but most of them had few drawbacks on their grouping. Among them the most efficient grouping technique is used by PSIMAP. Even though PSIMAP (Protein Structural Interactome Map) technique was successful to incorporate most of the protein but it fails to classify the scale free property proteins. Our technique overcomes this drawback and successfully maps all the protein in different groups, including the scale free property proteins failed to group by PSIMAP. Our approach selects the six major attributes of protein: a) Structure comparison b) Sequence Comparison c) Connectivity d) Cluster Index e) Interactivity f) Taxonomic to group the protein from the databank by generating a hierarchal tree structure. The proposed approach calculates the degree (probability) of similarity of each protein newly entered in the system against of existing proteins in the system by using probability theorem on each six properties of proteins.

💡 Deep Analysis

Figure 1

📄 Full Content

Classification of protein based on their various properties is a crucial issue in different fields of biological science. Researches in pharmacy, biochemistry, genetic engineering even in agriculture vastly rely on appropriate protein grouping techniques. Emphasizing the importance of protein classification some research groups in bioinformatics have initiated their projects with a view to deriving appropriate algorithms for protein classification. Protein can be classified based on their some properties, namely, a) Structure comparison b)Sequence Comparison c) Connectivity d) Cluster Index e) Interactivity f) Taxonomic and age diversity [1]. Individual 1 research group, so far has attempted to classify protein focusing on only one or two above stated properties. As for example, BMC bioinformatics research group has developed an in silico classification system entitled HODOCO (Homology modeling, Docking and Classification Oracle), in which protein Residue Potential Interaction Profiles (RPIPS) are used to summarize protein -protein interaction characteristics. This system applied to a dataset of 64 proteins of the death domain super family this was used to classify each member into its proper subfamily. Two classification methods were attempted, heuristic and support vector machine learning. Both methods were tested with a 5-fold cross-validation. The heuristic approach yielded a 61% average accuracy, while the machine learning approach yielded an 89% average accuracy. Though this is a good technique but it concentrates on only proteinprotein interaction property [2].

Wan K. Kim, Dan M. Bolser and Jong H. Park [1] had used PSIMAP for large-scale coevolution analysis of protein structural interlogues. They investigated the degree of co-evolution for more than 900 family pairs in a global protein structure interactome map. They have constructed PSIMAP by systematic extraction of all protein domain contacts in the web based Protein Data Bank. Their PSIMAP contained 37387 interacting domain pairs with five or more contacts within 5 A. They have first confirmed that correlated evolution is observed extensively throughout the interacting pairs of structural families in PDB, indicating that the observation is a general property of protein evolution. The overall average correlation was 0.73 for a relatively reliable set of 454 family pairs, of which 78% showed significant correlation at 99% confidence. In total, 918 family pairs have been investigated and the correlation was 0.61 on average. But the statistical validity was weak for the family pairs with small N (the number of member domain pairs) of their research. This is the first step in protein classification technique two combine two properties of proteins, namely, structure comparison and interactivity.

Mr. Jong Park and Dan Bolser established a bioinformatics research group in UK named MRC-DUNN. They stated their research on protein network. They worked on structure of proteins. They also used PSIMAP concept. But the limitation is that they only focused on protein intractability and taxonomic diversity. As a result their concept did not help that much on protein structure analysis using PSIMAP concept.

Again in February 2003, Mr. Jong Park and Dan Bolser tried to integrate Biological network evolution hypothesis to protein structural interactome. PSI-MAP was used to identify all the structurally observed interactions at the structure family level. To assess the functional and evolutionary differences between the most interactive and the least interactive folds, they used the latest HIINFOLD and LOINFOLD comparison sets (Park and Bolser, 2001): high interaction structure families and low interaction structure families. The major problem of their system is that they said that scale free topology is robust. But in practical it’s not true.

BMC bioinformatics research group has developed a concept of Visualization and graph-theoretic analysis of a large-scale protein structural interactome. They presented a global analysis of PSIMAP using several distinct network measures relating to centrality, interactivity, fault-tolerance, and taxonomic diversity. But to get proper structure and layout they put several proteins according to maximum similarity. As a result some proteins are placed in wrong places. And lots of scale free proteins do not get proper places. Sungsam Gong, Giseok Yoon, Insoo Jang, Dan Bolser, Panos Dafas and some other famous scientist developed PSIBase for Protein Structural Interactome map (PSIMAP). They introduced PSIbase: the PSIMAP web server and database. It contains (1) domain-domain and protein-protein interaction information from proteins whose 3D-structures are identified, (2) a protein interaction map and its viewer at protein super family and family levels, (3) protein interaction interface viewers and (4) structural domain prediction tools for possible interactions by detecting homologous matches in the Protein Data Bank (PDB) fro

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut