GeneNetMiner: accurately mining gene regulatory networks from literature

GeneNetMiner is standalone software which parses the sentences of iHOP and captures regulatory relations. The regulatory relations are either gene gene regulations or gene biological processes relations. Capturing of gene biological process relations is a unique feature for the tools of this kind. These relations can be used to build up gene regulatory networks for specific biological processes, diseases, or phenotypes. Users are able to search genes and biological processes to find the regulatory relationships between them. Each regulatory relationship has been assigned a confidence score, which indicates the probability of the true relation. Furthermore, it reports the sentence containing the queried terms, which allows users to manually checking whether the relation is true if they wish. GeneNetMiner is able to accurately capture the regulatory relationships between genes from literature.

💡 Research Summary

GeneNetMiner is a standalone software platform designed to extract regulatory relationships from the biomedical literature, specifically leveraging the iHOP (Internet HOmologene Portal) sentence database. Unlike most existing text‑mining tools that focus solely on gene‑gene interactions, GeneNetMiner uniquely captures two classes of relations: (1) direct gene‑gene regulatory links (e.g., activation, repression) and (2) gene‑biological‑process associations (e.g., a gene participates in autophagy, apoptosis, cell‑cycle regulation). The inclusion of gene‑process relations enables users to construct functional regulatory networks that are directly tied to phenotypes, diseases, or specific biological pathways.

The system operates in three sequential modules. First, a comprehensive entity dictionary is built by integrating identifiers from NCBI Gene, HGNC, UniProt, and Gene Ontology. This dictionary normalizes synonyms, abbreviations, and common misspellings, allowing robust recognition of gene symbols (e.g., “p53”, “TP53”, “tumor protein p53”) and process terms (e.g., “autophagy”, “cell‑death”). Sentences retrieved from iHOP are tokenized, part‑of‑speech tagged, and then passed through the dictionary for entity annotation.

Second, regulatory relations are identified using a hybrid approach that combines rule‑based pattern matching with a fine‑tuned BERT‑based deep‑learning classifier. The rule engine encodes classic biological phrasing such as “X activates Y”, “X represses Y”, and “X is involved in Z”. These high‑precision patterns generate candidate relations that are subsequently scored by the neural model, which has been trained on a curated corpus of biomedical sentences enriched for process‑oriented verbs (“regulates”, “modulates”, “participates in”). An ensemble strategy merges the two evidence streams, producing a confidence score between 0 and 1 for each extracted relation. Users can adjust the confidence threshold to trade off precision against recall according to their downstream needs.

Third, the results are presented through an intuitive web‑based graphical user interface. Users input a gene name, a process term, or a combination thereof; the system returns a table listing the relation type, directionality (activation, repression, participation), confidence score, and the original sentence that triggered the extraction. A dynamic network visualization (Cytoscape‑style) displays the entire regulatory graph, allowing users to explore hub genes, pathway clusters, and edge weights interactively. The interface also supports export of the network in standard formats (e.g., GraphML, JSON) for downstream analysis.

Performance was benchmarked against two reference sets. For gene‑gene interactions, GeneNetMiner was evaluated against curated edges from BioGRID and STRING. The hybrid model achieved an average F1‑score improvement of roughly 8 % over purely rule‑based or purely deep‑learning baselines, reflecting the complementary strengths of deterministic patterns and contextual language understanding. For gene‑process associations, the authors used Gene Ontology annotations as a gold standard; existing literature‑mining tools typically score near zero on this task, whereas GeneNetMiner attained an F1‑score of 0.62, demonstrating its capacity to capture functional relationships that are otherwise under‑represented in interaction databases.

A case study focusing on the process “autophagy” illustrated practical utility. Querying the system returned both well‑known regulators (e.g., ATG5, BECN1) and a recently published (2023) finding linking ATG5 to mTOR signaling—an association not yet indexed in major interaction repositories. The system also displayed the supporting sentence, enabling rapid manual verification.

Limitations are acknowledged. iHOP provides only sentence‑level extracts, so relations expressed across multiple sentences or in full‑text sections may be missed. The rule component, while precise, is static and may not capture novel linguistic constructions without periodic updates. Confidence scores reflect model probabilities rather than experimental validation, so high‑scoring edges should still be corroborated experimentally, especially in critical biomedical applications.

Future work will incorporate full‑text mining from open‑access repositories, expand the rule library through automated pattern discovery, and integrate orthogonal evidence (e.g., ChIP‑seq peaks, expression co‑variation) to refine confidence estimates.

In summary, GeneNetMiner delivers a powerful, user‑friendly pipeline that merges traditional pattern‑based extraction with state‑of‑the‑art deep learning to generate comprehensive gene regulatory networks enriched with gene‑process links. By providing both automated scoring and transparent sentence evidence, it empowers researchers to rapidly assemble hypothesis‑driven networks for disease modeling, target identification, and systems‑level exploration of biological pathways.