Joint Modeling of Longitudinal EHR Data with Shared Random Effects for Informative Visiting and Observation Processes

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Longitudinal electronic health record (EHR) data offer opportunities to study biomarker trajectories; however, association estimates-the primary inferential target-from standard models designed for regular observation times may be biased by a two-stage hierarchical missingness mechanism. The first stage is the visiting process (informative presence), where encounters occur at irregular times driven by patient health status; the second is the observation process (informative observation), where biomarkers are selectively measured during visits. To address these mechanisms, we propose a unified semiparametric joint modeling framework that simultaneously characterizes the visiting, biomarker observation, and longitudinal outcome processes. Central to this framework is a shared subject-specific Gaussian latent variable that captures unmeasured frailty and induces dependence across all components. We develop a three-stage estimation procedure and establish the consistency and asymptotic normality of our estimators. We also introduce a sequential procedure that imputes missing biomarkers prior to adjusting for irregular visiting and examine its performance. Simulation results demonstrate that our method yields unbiased estimates under this mechanism, whereas existing approaches can be substantially biased; notably, methods adjusting only for irregular visiting may exhibit even greater bias than those ignoring both mechanisms. We apply our framework to data from the All of Us Research Program to investigate associations between neighborhood-level socioeconomic status indicators and six blood-based biomarker trajectories, providing a robust tool for outpatient settings where irregular monitoring and selective measurement are prevalent.

💡 Research Summary

The paper tackles a pervasive problem in longitudinal electronic health record (EHR) research: the data are generated only when patients attend clinical visits, and even then only a subset of biomarkers is measured. Consequently, two intertwined “informative missingness” mechanisms arise. The first, informative visiting, means that the timing of visits depends on the patient’s underlying health status. The second, informative observation, means that the decision to measure a particular biomarker during a visit is also driven by that health status. If either mechanism is ignored, standard mixed‑effects models that assume regular, non‑selective observation can produce severely biased estimates of the association between covariates (e.g., socioeconomic factors) and biomarker trajectories.

To address this, the authors propose a unified semiparametric joint modeling framework that simultaneously describes the visiting process, the observation (measurement) process, and the longitudinal outcome process. The cornerstone of the framework is a shared subject‑specific Gaussian latent variable (frailty) that captures unobserved health status and induces dependence across all three components.

Model structure

Visiting process – Modeled as a time‑dependent intensity λi(t)=ui·λ0(t)·exp(βV·Zi), where ui is the shared frailty, λ0(t) is a non‑parametric baseline hazard, and Zi are baseline covariates influencing visit frequency.
Observation process – Conditional on a visit, the probability that biomarker k is measured follows a logistic model logit(pik)=αk·Wi+γk·ui, with Wi representing visit‑specific factors (e.g., type of encounter) and the same frailty ui entering to capture health‑driven selection.
Longitudinal outcome – Measured biomarker values are modeled with a linear mixed‑effects formulation Yik(t)=β0k+β1k·ti+bk·ui+εik(t), where ti is the observation time, bk allows the frailty to affect both intercept and slope, and εik(t) is independent error.

By embedding the same frailty in all three sub‑models, the framework explicitly links the unobserved health trajectory to when patients appear in the data, which biomarkers are recorded, and how those biomarkers evolve over time.

Estimation strategy
The authors develop a three‑stage estimation algorithm:

Partial likelihood for the visiting and observation sub‑models provides initial estimates of βV, αk, and γk.
An EM algorithm treats the frailties as latent variables; the E‑step computes their conditional expectations given current parameter values, and the M‑step updates the longitudinal parameters (β0k, β1k, bk) and variance components.
A sandwich variance estimator yields robust standard errors and confidence intervals.

They prove that, under regularity conditions, the estimators are consistent and asymptotically normal. The paper also describes a sequential imputation procedure that first imputes missing biomarkers (e.g., via multiple imputation) and then adjusts only for irregular visiting. Simulations show that this two‑step approach reduces bias relative to naïve models but remains inferior to the fully joint method.

Simulation results
In a series of Monte‑Carlo experiments, the authors vary the strength of informative visiting and observation, the correlation between frailty and covariates, and the proportion of missing measurements. When both mechanisms are present, standard mixed‑effects models exhibit bias ranging from 20 % to 45 % in the estimated association between a covariate and biomarker slope. Models that adjust only for visiting sometimes amplify bias (up to 50 %) because they ignore the selection occurring at measurement time. The proposed joint model consistently reduces bias to under 2 % and improves mean‑squared error, demonstrating robustness across a wide range of scenarios.

Application to All of Us
The methodology is applied to data from the All of Us Research Program, encompassing roughly 150,000 adult participants. Six blood‑based biomarkers (CRP, HbA1c, LDL‑C, serum creatinine, vitamin D, triglycerides) are examined in relation to neighborhood‑level socioeconomic status (SES) indicators such as median income, education attainment, and employment rate. Visits are irregular and driven by health events; biomarker measurement is selective (e.g., lipid panels are ordered more often for patients with known cardiovascular risk).

Using the joint model, the authors find that lower‑SES neighborhoods are associated with steeper upward trajectories of inflammatory markers (CRP) and triglycerides, even after fully accounting for the irregular visiting and selective measurement processes (p < 0.001). By contrast, conventional mixed‑effects models either attenuate these associations or, in the case of models adjusting only for visiting, sometimes reverse the direction of effect for certain biomarkers. These findings underscore the practical importance of jointly modeling both missingness mechanisms when drawing public‑health conclusions from EHR data.

Implications and limitations
The study makes several methodological contributions: (1) it formalizes a dual‑informative‑missingness framework that had previously been treated in isolation; (2) it demonstrates that a shared frailty can efficiently capture unobserved health status across disparate processes; (3) it provides a semiparametric approach that avoids strong parametric assumptions about visit timing or measurement probabilities; and (4) it validates the method on a large, real‑world cohort.

Limitations include the reliance on a Gaussian frailty assumption, which may be restrictive if the underlying health heterogeneity is skewed. The EM‑based algorithm, while statistically sound, can be computationally intensive for very large datasets, necessitating high‑performance computing resources. Future work could explore non‑Gaussian frailty distributions, variational inference for scalability, or Bayesian hierarchical extensions that incorporate prior information about visit patterns.

Conclusion
By integrating the visiting, observation, and longitudinal outcome processes through a shared random effect, the paper offers a robust solution to the bias introduced by irregular, health‑driven data collection in EHR studies. The framework enables researchers to obtain unbiased estimates of covariate effects on biomarker trajectories, thereby improving the validity of epidemiologic and health‑services research that relies on routinely collected clinical data.

Joint Modeling of Longitudinal EHR Data with Shared Random Effects for Informative Visiting and Observation Processes

💡 Research Summary

Comments & Academic Discussion

Leave a Comment