A variational Bayes latent class approach for EHR-based patient phenotyping in R
The VBphenoR package for R provides a closed-form variational Bayes approach to patient phenotyping using Electronic Health Records (EHR) data. We implement a variational Bayes Gaussian Mixture Model (GMM) algorithm using closed-form coordinate ascent variational inference (CAVI) to determine the patient phenotype latent class. We then implement a variational Bayes logistic regression, where we determine the probability of the phenotype in the supplied EHR cohort, the shift in biomarkers for patients with the phenotype of interest versus a healthy population and evaluate predictive performance of binary indicator clinical codes and medication codes. The logistic model likelihood applies the latent class from the GMM step to inform the conditional.
💡 Research Summary
The paper introduces VBphenoR, an R package that implements a fully closed‑form variational Bayes (VB) framework for patient phenotyping using electronic health records (EHR). The authors address two major limitations of existing Bayesian latent class approaches: the computational burden of Markov chain Monte Carlo (MCMC) and the instability of automatic‑differentiation‑based variational inference (ADVI) when applied to large, heterogeneous clinical data. Their solution consists of a two‑stage pipeline that combines a variational Gaussian mixture model (GMM) for latent class discovery with a Bayesian logistic regression that quantifies biomarker shifts and evaluates the predictive performance of binary clinical codes and medication codes.
Stage 1 – Variational GMM:
Patient covariates (e.g., age, sex, ethnicity, BMI) are modeled with a multivariate Gaussian mixture. The mixing proportions π receive a Dirichlet prior with hyper‑parameter α, while each component’s mean μ and precision Λ are governed by a Normal‑Wishart prior (parameters m, λ, W, ν). Using coordinate ascent variational inference (CAVI), the posterior factors q(π) and q(μ,Λ) are updated in closed form, avoiding any need for stochastic gradients. The authors emphasize the role of α: small α lets the data dominate, large α forces the prior to dictate the number and size of clusters. For highly imbalanced data (e.g., rare diseases), they recommend initializing the mixture with DBSCAN rather than k‑means, which yields robust seeds for the minor component. Convergence is monitored via the evidence lower bound (ELBO); iterations stop when ELBO begins to reverse, preventing over‑fitting.
Stage 2 – Bayesian Logistic Regression:
The latent binary indicator Dᵢ obtained from the GMM is treated as a fixed latent variable in a second variational model. The regression coefficients for biomarker availability (β_R), biomarker values (β_Y), clinical codes (β_W), and medication codes (β_P) each receive multivariate normal priors. Importantly, β_Y captures the mean shift of each biomarker between Dᵢ = 1 (diseased) and Dᵢ = 0 (healthy) groups, allowing direct clinical interpretation. Priors for β are centered on healthy‑population means, which guides posterior estimates toward clinically plausible values. The logistic link g(·)=exp(·)/(1+exp(·)) is used throughout, and the model yields closed‑form variational updates for all coefficient posteriors. Sensitivity and specificity of each binary code can be expressed analytically as functions of the corresponding β coefficients (e.g., sensitivity = expit(β_W0 + β_W1)).
Implementation Details:
The package provides a single high‑level function run_Model() that orchestrates the GMM and logistic steps. Users supply the list of biomarkers, the matrix of covariates for the GMM, initialization parameters (e.g., DBSCAN eps and minPts), and prior specifications for both stages. The output includes posterior means of biomarker shifts, logistic coefficients, ELBO trajectories, and diagnostic plots. All computations are performed in pure R without reliance on external C++ or Python libraries, ensuring reproducibility and ease of installation from CRAN.
Empirical Evaluation:
Two datasets illustrate the methodology.
- Sickle Cell Disease (SCD) Cohort: The authors construct a synthetic SCD cohort with a realistic prevalence of 0.3 %. Using DBSCAN for initialization and a very low α (0.001), the variational GMM isolates the rare disease component despite extreme class imbalance. The subsequent logistic regression estimates a CBC shift of –7.93 g/dL and an RC (reticulocyte count) shift of +3.67 %, matching known hematologic signatures of SCD.
- Faithful Dataset: This classic dataset (eruption waiting times) serves to demonstrate the effect of α on cluster granularity. With α set low, medium, and high, the number of inferred components varies from a data‑driven sparse solution to a more evenly split configuration, confirming the theoretical expectations about Dirichlet concentration.
Performance metrics (ROC curves, AUC, sensitivity, specificity) show that VBphenoR’s logistic stage outperforms existing rule‑based phenotyping tools such as PheNorm, particularly when informative priors are incorporated.
Discussion and Future Work:
The authors argue that closed‑form VB provides a pragmatic balance between computational scalability and statistical rigor, making it suitable for nationwide EHR repositories containing hundreds of thousands of patients. They acknowledge that prior specification is critical; clinical expertise should guide the choice of α, prior means, and covariance structures to avoid bias. Planned extensions include multi‑class latent structures, temporal dynamics (e.g., hidden Markov models), and integration of unstructured text (clinical notes) via embeddings.
In summary, VBphenoR delivers a fast, interpretable, and fully Bayesian solution for patient phenotyping in R, filling a gap in the current ecosystem of EHR analysis tools and offering a solid foundation for future methodological enhancements.
Comments & Academic Discussion
Loading comments...
Leave a Comment