DeepPNI: Language- and graph-based model for mutation-driven protein-nucleic acid energetics

Reading time: 6 minute
...

๐Ÿ“ Abstract

The interaction between proteins and nucleic acids is crucial for processes that sustain cellular function, including DNA maintenance and the regulation of gene expression and translation. Amino acid mutations in protein-nucleic acid complexes often lead to vital diseases. Experimental techniques have their own specific limitations in predicting mutational effects in protein-nucleic acid complexes. In this study, we compiled a large dataset of 1951 mutations including both protein-DNA and protein-RNA complexes and integrated structural and sequential features to build a deep learning-based regression model named DeepPNI. This model estimates mutation-induced binding free energy changes in protein-nucleic acid complexes. The structural features are encoded via edge-aware RGCN and the sequential features are extracted using protein language model ESM-2. We have achieved a high average Pearson correlation coefficient (PCC) of 0.76 in the large dataset via five-fold cross-validation. Consistent performance across individual dataset of protein-DNA, protein-RNA complexes, and different experimental temperature split dataset make the model generalizable. Our model showed good performance in complex-based five-fold cross-validation, which proved its robustness. In addition, DeepPNI outperformed in external dataset validation, and comparison with existing tools

๐Ÿ’ก Analysis

The interaction between proteins and nucleic acids is crucial for processes that sustain cellular function, including DNA maintenance and the regulation of gene expression and translation. Amino acid mutations in protein-nucleic acid complexes often lead to vital diseases. Experimental techniques have their own specific limitations in predicting mutational effects in protein-nucleic acid complexes. In this study, we compiled a large dataset of 1951 mutations including both protein-DNA and protein-RNA complexes and integrated structural and sequential features to build a deep learning-based regression model named DeepPNI. This model estimates mutation-induced binding free energy changes in protein-nucleic acid complexes. The structural features are encoded via edge-aware RGCN and the sequential features are extracted using protein language model ESM-2. We have achieved a high average Pearson correlation coefficient (PCC) of 0.76 in the large dataset via five-fold cross-validation. Consistent performance across individual dataset of protein-DNA, protein-RNA complexes, and different experimental temperature split dataset make the model generalizable. Our model showed good performance in complex-based five-fold cross-validation, which proved its robustness. In addition, DeepPNI outperformed in external dataset validation, and comparison with existing tools

๐Ÿ“„ Content

Cellular processes are maintained through communication among numerous biomolecules.

Protein-nucleic acid interactions (PNIs) play key roles in the regulation of essential biological functions such as DNA replication, repair, recombination, gene expression, and translation. 1,2 The affinity between proteins and nucleic acids (DNA and RNA) within a protein-nucleic acid complex is governed by intermolecular forces, physicochemical properties, and structural features. 3,4 These factors are often disrupted by missense mutations occurring in either or both counterparts of the complex. 3 Numerous pathologies related to these mutations have been described in the literature. 5- 16 For instance, mutations in TDP-43 (Transactive response DNA-binding protein 43), an RNA/DNA-binding protein, have been shown to be linked with neurodegenerative diseases like ALS (amyotrophic lateral sclerosis) and FTD (frontotemporal dementia). 6 Melanoma, a type of fatal skin cancer, is mainly caused by solar-radiation induced mutations in DNA. 7 It has been observed that these DNA mutations can amplify TERT (Telomerase) promoter activity. In normal conditions, TERT in complex with RNA contributes to the synthesis of telomeres at chromosomal ends. However, its abnormally heightened activity can promote carcinogenesis. 5 Further, Leigh syndrome is a progressive neurodegenerative disorder that has been reported to be caused by pathogenic mutations in mitochondrial tRNAs. Upon mutation, these tRNAs become substrates for different aminoacyl-tRNA synthetases thus hindering the usual protein synthesis. [17][18][19] These examples show that mutations in protein-nucleic acid complexes trigger several diseases, and therefore the effects of these alterations are crucial to be investigated thoroughly.

The impact of a mutation can be assessed experimentally using techniques such as surface plasmon resonance, FRET, or isothermal titration calorimetry to determine the change in binding affinity. [20][21][22][23][24] However, with the rapid growth of genomic data, the demand for highthroughput analysis is also increasing proportionally. 25 Conventional experimental methods are not well suited for this purpose due to their slow throughput and high expense. 26 The advent of high-throughput experimental methods such as high-throughput SELEX, 27 protein-binding microarrays, 28 mechanically induced trapping of molecular interactions 29 has partially addressed this challenge. Nevertheless, these techniques also have limitations restricting their wide applicability. [28][29][30] Computational methods such as free energy perturbation and thermodynamic integration can accurately calculate binding free energy. 31 However, these methods are also not appropriate for large scale studies because they are computationally expensive. [32][33][34][35][36][37] In this work, we developed a deep learning model for PNI predictions trained on a larger dataset of 1951 mutations across protein-DNA and protein-RNA complexes from NABE 38 database. The structural and sequential features were fused together, and multiple types of nodes and their edges were used to generate an edge-aware relational graph convolutional network to encode the structural features. These structural features were integrated with the sequential features from protein language model to predict the binding free energy change upon amino acid mutation. We tested our model’s performance on a large variety of data splits that included individual protein-DNA, protein-RNA data, and temperature-based split. The robustness of our model was confirmed by its good performance in complex-based five-fold cross-validation experiment. Our model outperformed the recent state-of-the-art methods and showed a good performance in external database validation. The combination of sequential and edge-aware atomic level relational graph convolutional network in a large dataset made this work innovative.

This study aims to quantitatively predict how single amino acid mutations across proteinnucleic acid interface affect their binding affinity. The impact of a mutation is measured by the binding free energy change (ฮ”ฮ”G), defined as the difference in binding free energies between the mutant and the wild-type complex:

A positive ฮ”ฮ”G indicates a destabilizing mutation (reduced binding affinity), whereas a negative value represents a stabilizing mutation (enhanced binding affinity). This task is formulated as a supervised regression problem, where each mutation is represented by a combination of structural and sequential features. Specifically, a local atomic graph centered at the mutation site was encoded using a graph convolutional network (GCN) to capture spatial and physicochemical information, while an ESM-based protein language model embedding described the sequence context. The concatenated feature representation ๐‘ง ๐‘๐‘œ๐‘š๐‘๐‘–๐‘›๐‘’๐‘‘ = [๐‘ง ๐บ โˆฅ ๐‘’ ๐‘’๐‘ ๐‘š ] was used to learn a mapping function ๐‘“: ๐‘ง ๐‘๐‘œ๐‘š๐‘๐‘–๐‘›๐‘’๐‘‘ โ†’ ๐›ฅ๐›ฅ๐บ, predicting the experimentall

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

โ†‘โ†“
โ†ต
ESC
โŒ˜K Shortcut