INDIGENA: inductive prediction of disease-gene associations using phenotype ontologies
Motivation: Predicting gene-disease associations (GDAs) is the problem to determine which gene is associated with a disease. GDA prediction can be framed as a ranking problem where genes are ranked for a query disease, based on features such as phenotypic similarity. By describing phenotypes using phenotype ontologies, ontology-based semantic similarity measures can be used. However, traditional semantic similarity measures use only the ontology taxonomy. Recent methods based on ontology embeddings compare phenotypes in latent space; these methods can use all ontology axioms as well as a supervised signal, but are inherently transductive, i.e., query entities must already be known at the time of learning embeddings, and therefore these methods do not generalize to novel diseases (sets of phenotypes) at inference time. Results: We developed INDIGENA, an inductive disease-gene association method for ranking genes based on a set of phenotypes. Our method first uses a graph projection to map axioms from phenotype ontologies to a graph structure, and then uses graph embeddings to create latent representations of phenotypes. We use an explicit aggregation strategy to combine phenotype embeddings into representations of genes or diseases, allowing us to generalize to novel sets of phenotypes. We also develop a method to make the phenotype embeddings and the similarity measure task-specific by including a supervised signal from known gene-disease associations. We apply our method to mouse models of human disease and demonstrate that we can significantly improve over the inductive semantic similarity baseline measures, and reach a performance similar to transductive methods for predicting gene-disease associations while being more general. Availability and Implementation: https://github.com/bio-ontology-research-group/indigena
💡 Research Summary
The paper introduces INDIGENA, an inductive framework for predicting disease‑gene associations (GDAs) based on phenotype ontologies. Traditional GDA prediction relies on semantic similarity measures such as Resnik or Lin, which only exploit the hierarchical structure of ontologies and are hand‑crafted, limiting their adaptability. Recent ontology‑embedding approaches (e.g., Onto2Vec, DL2Vec) embed phenotypes in a latent space but are inherently transductive: they require all query entities (diseases) to be present during training, preventing generalisation to novel disease phenotypes.
INDIGENA overcomes this limitation through three key steps. First, it projects the UPheno cross‑species phenotype ontology, together with gene‑phenotype, disease‑phenotype, and gene‑disease annotations, into a unified knowledge graph. Each annotation is encoded as an OWL axiom (e.g., gene ⊑ ∃has_phenotype Phenotype) and then transformed into graph edges using the projection rules of OWL2Vec*. This yields four progressively richer graphs: (1) ontology only, (2) + gene‑phenotype links, (3) + disease‑phenotype links, and (4) + known gene‑disease links (the supervised signal).
Second, the authors train several knowledge‑graph embedding models—TransE, TransH, TransD, and ConvKB (with ConvKB‑D initialized from TransD)—on these graphs. The embeddings capture both structural and relational patterns. Importantly, the supervised gene‑disease edges in graph 4 are used as a task‑specific signal, guiding the embeddings toward features useful for GDA prediction.
Third, INDIGENA aggregates phenotype embeddings into a single vector for each gene or disease using an explicit, linear aggregation (e.g., averaging). Because aggregation is independent of the training set, a disease represented by a novel combination of phenotypes can be embedded at inference time without retraining—hence the method is fully inductive. Similarity between a query disease and a candidate gene is then computed as the cosine similarity of their aggregated vectors.
The evaluation uses mouse‑gene–phenotype data from MGI, human‑disease–phenotype data from HPO, and the UPheno ontology. A 10‑fold disease‑split cross‑validation ensures that test diseases are never seen during training. INDIGENA is compared against traditional semantic similarity baselines (Resnik‑BMA, Lin‑BMA, SimGIC) and against transductive embedding baselines (the same embedding models trained on graphs that already contain test diseases). Results show that INDIGENA consistently outperforms semantic similarity methods (≈15–20 % higher AUROC and precision@10) and matches or slightly exceeds the performance of transductive embeddings, while retaining the ability to handle unseen diseases. ConvKB‑D achieves the best results among the embedding variants, indicating that relation‑specific projections combined with convolutional processing capture complex phenotype semantics effectively.
Key contributions are: (1) a generalizable pipeline that converts ontology axioms and external annotations into a knowledge graph; (2) incorporation of supervised GDA information into embedding learning for task‑specific representations; (3) an explicit aggregation strategy that enables inductive prediction for arbitrary phenotype sets. The authors discuss future extensions such as applying INDIGENA directly to patient‑level data, integrating variant‑prioritisation tools (e.g., Exomiser, EmbedPVP), and exploring graph neural network‑based aggregators.
In summary, INDIGENA bridges the gap between handcrafted semantic similarity and powerful embedding methods, delivering a scalable, inductive solution for disease‑gene association prediction that is especially valuable for rare or newly described diseases where prior knowledge is scarce.
Comments & Academic Discussion
Loading comments...
Leave a Comment