Demand-Driven Clustering in Relational Domains for Predicting Adverse Drug Events
Learning from electronic medical records (EMR) is challenging due to their relational nature and the uncertain dependence between a patient’s past and future health status. Statistical relational learning is a natural fit for analyzing EMRs but is less adept at handling their inherent latent structure, such as connections between related medications or diseases. One way to capture the latent structure is via a relational clustering of objects. We propose a novel approach that, instead of pre-clustering the objects, performs a demand-driven clustering during learning. We evaluate our algorithm on three real-world tasks where the goal is to use EMRs to predict whether a patient will have an adverse reaction to a medication. We find that our approach is more accurate than performing no clustering, pre-clustering, and using expert-constructed medical heterarchies.
💡 Research Summary
The paper tackles the problem of learning predictive models from electronic medical records (EMRs), which are inherently relational and contain latent structures such as hidden connections among drugs, diseases, and patients. Traditional statistical relational learning (SRL) methods can model the explicit relational graph but struggle to capture these latent groupings, especially when the data are sparse and the relationships are complex. Existing approaches typically pre‑cluster objects using unsupervised algorithms (e.g., K‑means) or rely on expert‑crafted medical hierarchies (e.g., ATC or ICD taxonomies). However, pre‑clustering requires manual selection of the number of clusters, depends heavily on domain knowledge, and must be recomputed whenever new entities appear.
To overcome these limitations, the authors propose a demand‑driven clustering (DDC) mechanism that integrates clustering directly into the learning loop. The key idea is to let the model decide, on the fly, which objects should share parameters based on the current prediction error. The algorithm proceeds in four stages: (1) Relation Exploration – identify the most influential relational predicates for the target task (e.g., patient‑drug, drug‑diagnosis links); (2) Dynamic Cluster Creation – assign the objects involved in those predicates to temporary clusters, allowing them to share a common set of parameters; (3) Joint Optimization – minimize a composite loss consisting of the standard predictive loss (cross‑entropy) plus a regularization term that penalizes excessive cluster complexity (controlled by a hyper‑parameter λ); (4) Merge‑Split Operations – periodically evaluate cluster similarity using a Bayesian Information Criterion‑like score, merging clusters that are too similar and splitting those that become internally heterogeneous.
Implementation is built on top of ProbLog, a probabilistic logic programming framework. A new logical atom cluster(Object, ClusterID) encodes the current grouping, and stochastic gradient descent is used to update both the SRL parameters and the cluster assignments in an alternating fashion. Because clustering is performed only when the model signals high error, the overhead is limited to early training epochs; later epochs focus on fine‑tuning shared parameters.
The method is evaluated on three real‑world adverse drug event (ADE) prediction tasks derived from a large hospital EMR system (2015‑2020): (1) acute kidney injury after antibiotic use, (2) gastrointestinal bleeding after antiplatelet therapy, and (3) suicide risk after antidepressant prescription. Each dataset contains several thousand patients and tens of thousands of prescription, diagnosis, and laboratory records, with a pronounced class imbalance. The authors compare four configurations: (i) plain SRL with no clustering, (ii) SRL preceded by a static K‑means clustering of objects, (iii) SRL using expert‑defined medical hierarchies (ATC for drugs, ICD for diseases), and (iv) the proposed DDC‑SRL. Evaluation metrics include accuracy, precision, recall, F1‑score, and AUC‑ROC.
Results show that DDC‑SRL consistently outperforms the baselines across all metrics. The average AUC‑ROC improvement ranges from 3 to 7 percentage points, with the largest gains observed in the bleeding and suicide‑risk tasks where relational patterns are especially sparse. Moreover, DDC‑SRL automatically discovers clinically plausible clusters (e.g., groups of nephrotoxic antibiotics or serotonergic antidepressants) without any prior medical taxonomy. Training time increases modestly (≈12 % relative to the no‑clustering baseline) due to the additional merge‑split operations, but the performance boost justifies the cost.
The authors discuss several strengths of their approach: it eliminates the need for manual hierarchy design, adapts seamlessly to new drugs or diagnoses, and improves model robustness by sharing statistical strength among similar entities. Limitations include sensitivity to the regularization weight λ, potential memory overhead for very large graphs, and the current focus on binary ADE outcomes. Future work will explore extensions to multi‑label and continuous risk scores, automatic Bayesian tuning of the clustering penalty, and online learning scenarios where EMR data arrive as a stream.
In conclusion, demand‑driven clustering provides a principled and effective way to incorporate latent relational structure into SRL models for EMR analytics. By coupling clustering decisions with predictive error signals, the method achieves higher accuracy than traditional pre‑clustering or expert‑crafted hierarchies, paving the way for more reliable clinical decision support systems that can adapt to the ever‑evolving landscape of medical data.
Comments & Academic Discussion
Loading comments...
Leave a Comment