Bayesian Profile Regression using Variational Inference to Identify Clusters of Multiple Long-Term Conditions Conditioning on Mortality in Population-Scale Data

Bayesian Profile Regression using Variational Inference to Identify Clusters of Multiple Long-Term Conditions Conditioning on Mortality in Population-Scale Data
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multiple long-term conditions (MLTC) are increasingly observed in clinical practice globally. Clustering methods to group diseases into commonly co-occurring clusters have been of interest for further understanding of how MLTC group together and their associated impact on patient outcomes. However, such approaches require large, often population-scale datasets. Bayesian Profile Regression (BPR) is a statistical model that combines a Dirichlet Process Mixture model with a hierarchical regression model, in order to form clusters of items conditional on covariates and an outcome of interest. We developed a BPR model using full-rank Stochastic Variational Inference (SVI) for application in large-scale data. We assessed it’s performance using simulation studies comparing fits using the No-U-turn (NUTS) sampler and full-rank SVI. We then fit a BPR model to find clusters of MLTC in a population-scale data held in the Secure Anonymised Information Linkage (SAIL) databank. We found results from full-rank SVI compared well with results from NUTS in a simulation study, and the improved fitting performance allowed for fitting models in population-scale datasets. There were 1,296,463 individuals in our electronic health record (EHR) cohort. The clustering model was conditioned on age at cohort entry, socioeconomic deprivation and sex with mortality as the outcome. We used the Elixhauser comorbidity index disease definitions, and found there were 33 disease clusters. We found that clusters featuring metastatic cancer and cardiovascular diseases, such as congestive heart failure, were most strongly associated with the probability of mortality. Our findings show that SVI can be a useful and accurate method for fitting Bayesian models, especially when the dataset size would make Monte Carlo methods prohibitively time consuming or impossible.


💡 Research Summary

This paper presents a scalable Bayesian Profile Regression (BPR) framework that integrates a Dirichlet Process Mixture Model (DPMM) with a hierarchical generalized linear mixed‑effects model to identify clusters of multiple long‑term conditions (MLTC) while conditioning on an outcome of interest—in this case, mortality. Traditional clustering approaches often separate the discovery of disease groups from outcome analysis, which can miss rare but high‑risk clusters. BPR overcomes this limitation by jointly modelling disease co‑occurrence and the binary mortality outcome, allowing cluster‑specific intercepts and global covariate effects (age, sex, socioeconomic deprivation).

The major methodological contribution is the implementation of full‑rank Stochastic Variational Inference (SVI) for fitting the BPR model on population‑scale data. While the No‑U‑Turn Sampler (NUTS), a Hamiltonian Monte Carlo method, provides exact posterior draws, it becomes computationally infeasible for datasets with millions of rows and dozens of disease variables. SVI reframes inference as an optimization problem, maximizing the Evidence Lower Bound (ELBO). By adopting a multivariate normal variational family with a full covariance matrix (full‑rank), the authors preserve correlations among all model parameters, mitigating the under‑estimation of uncertainty typical of mean‑field approximations. Discrete cluster allocation variables are analytically marginalised, and posterior predictive checks are performed after fitting. The SVI workflow uses mini‑batches (≥10 % of the data) and a learning rate of 0.01 with 30 ELBO samples per iteration, enabling efficient use of GPU‑accelerated JAX operations via NumPyro.

A comprehensive simulation study validates the SVI approach. With 8,000 synthetic observations and five true clusters, the authors compare SVI to NUTS across 800 replicates. Both methods recover mixture probabilities (ϕ) and regression coefficients (β) with comparable bias; however, smaller ϕ values are harder to estimate for all methods. Coverage of 95 % credible intervals for β is slightly below nominal (≈80–90 %) for SVI, reflecting its approximate nature, while NUTS achieves closer to nominal coverage. Scaling the simulation to 100,000 observations demonstrates that SVI maintains stable bias and coverage, whereas NUTS becomes intractable. Additional experiments show that batch sizes of at least 10 % of the dataset ensure convergence without sacrificing ELBO stability.

The authors then apply the model to the Secure Anonymised Information Linkage (SAIL) databank, comprising 1,296,463 individuals from Wales. Covariates include age at cohort entry, sex, and deprivation quintile; the binary outcome is all‑cause mortality. Using the Elixhauser comorbidity definitions, the BPR‑SVI model discovers 33 disease clusters. Clusters characterised by metastatic cancer and cardiovascular conditions such as congestive heart failure exhibit the strongest positive association with mortality, confirming clinical expectations and demonstrating the model’s ability to surface high‑risk phenotypes without pre‑specifying them.

The discussion acknowledges the inherent approximation error of variational methods, particularly the tendency to underestimate posterior variance and the need for post‑hoc label‑switching correction. Nevertheless, the full‑rank variational family substantially improves uncertainty quantification relative to diagonal (mean‑field) approximations. The authors suggest future extensions such as hybrid MCMC‑SVI schemes, Stein variational gradient descent, or sparse variational techniques to further balance accuracy and scalability.

In conclusion, the study shows that full‑rank SVI provides a practical, accurate alternative to MCMC for Bayesian profile regression on massive health‑record datasets. It enables simultaneous discovery of disease clusters and assessment of their impact on mortality, opening avenues for risk stratification, resource allocation, and targeted interventions in public health. The open‑source implementation (GitHub) and detailed methodological exposition make the approach readily adoptable by researchers handling large‑scale electronic health records.


Comments & Academic Discussion

Loading comments...

Leave a Comment