A DNN Biophysics Model with Topological and Electrostatic Features
In this project, we present a deep neural network (DNN)-based biophysics model that uses multi-scale and uniform topological and electrostatic features to predict protein properties, such as Coulomb energies or solvation energies. The topological features are generated using element-specific persistent homology (ESPH) on a selection of heavy atoms or carbon atoms. The electrostatic features are generated using a novel Cartesian treecode, which adds underlying electrostatic interactions to further improve the model prediction. These features are uniform in number for proteins of varying sizes; therefore, the widely available protein structure databases can be used to train the network. These features are also multi-scale, allowing users to balance resolution and computational cost. The optimal model trained on more than 17,000 proteins for predicting Coulomb energy achieves MSE of approximately 0.024, MAPE of 0.073 and $R^2$ of 0.976. Meanwhile, the optimal model trained on more than 4,000 proteins for predicting solvation energy achieves MSE of approximately 0.064, MAPE of 0.081, and $R^2$ of 0.926, showing the efficiency and fidelity of these features in representing the protein structure and force field. The feature generation algorithms also have the potential to serve as general tools for assisting machine learning based prediction of protein properties and functions.
💡 Research Summary
This paper introduces a deep neural network (DNN) framework for predicting protein physical properties—specifically Coulomb energy and electrostatic solvation energy—by constructing uniform, multi‑scale features that capture both the geometric topology of protein structures and their underlying electrostatic interactions.
The authors first generate topological descriptors using element‑specific persistent homology (ESPH). By selecting subsets of atoms (e.g., heavy atoms or carbon atoms) and varying a distance threshold ε, they build Vietoris‑Rips complexes and compute persistence barcodes for dimensions 0, 1, and 2. From these barcodes they extract statistics such as the number of connected components, loops, cavities, persistence lengths, birth/death times, and summary moments. Because the ESPH pipeline aggregates information across scales, the resulting topological feature vector has a fixed length (e.g., 200–300 entries) regardless of protein size, enabling straightforward batching of heterogeneous proteins.
To incorporate electrostatic information, the paper adapts the Cartesian treecode algorithm. Traditional pairwise Coulomb calculations scale as O(N²), but the treecode replaces particle‑particle interactions with particle‑cluster interactions, achieving O(N log N) complexity. The authors further modify the method by representing atomic charges as multipole moments at cluster centers, thereby preserving both charge magnitude and spatial distribution across multiple hierarchical levels. For each level they compute average potentials, potential variances, and distance‑weighted charge moments, producing a compact electrostatic feature vector that is also size‑independent.
The two feature sets are concatenated and fed into a five‑layer fully connected DNN (input dimension ≈ 500–600, hidden layers 1024‑512‑256‑128, ReLU activations, batch normalization). Training uses the Adam optimizer (learning rate 1e‑4) with an L2 regularization term, early stopping, and a train/validation/test split of 8:1:1. Labels are obtained from two sources: (1) direct pairwise Coulomb energy computed from atomic charges for > 17,000 proteins, and (2) solvation free energy derived from the matched‑interface‑and‑boundary Poisson‑Boltzmann (MIBPB) solver for > 4,000 proteins.
Performance metrics demonstrate that the combined feature model outperforms baselines that use only geometric or only electrostatic descriptors. For Coulomb energy prediction the model achieves MSE ≈ 0.024, MAPE ≈ 0.073, and R² ≈ 0.976; for solvation energy it reaches MSE ≈ 0.064, MAPE ≈ 0.081, and R² ≈ 0.926. Ablation studies reveal that removing electrostatic features degrades solvation energy accuracy substantially, confirming their critical contribution. Feature generation is fast: ESPH takes ~0.8 s per protein, treecode ~0.4 s, making the pipeline suitable for large‑scale databases.
The authors emphasize that their feature extraction is independent of any specific implicit solvent model (Poisson‑Boltzmann or Generalized Born), allowing the same descriptors to be reused for other tasks such as pKa prediction, binding affinity estimation, or mutation impact analysis. They suggest future extensions including higher‑dimensional persistence (e.g., 3‑dimensional cavities), integration with graph neural networks to capture local connectivity, and application to broader biophysical property prediction. Overall, the study showcases how mathematically rigorous, multi‑scale topological and electrostatic representations can dramatically improve the accuracy and scalability of machine‑learning models for protein physics.
Comments & Academic Discussion
Loading comments...
Leave a Comment