Longitudinal electronic health record (EHR) data offer opportunities to study biomarker trajectories; however, association estimates-the primary inferential target-from standard models designed for regular observation times may be biased by a two-stage hierarchical missingness mechanism. The first stage is the visiting process (informative presence), where encounters occur at irregular times driven by patient health status; the second is the observation process (informative observation), where biomarkers are selectively measured during visits. To address these mechanisms, we propose a unified semiparametric joint modeling framework that simultaneously characterizes the visiting, biomarker observation, and longitudinal outcome processes. Central to this framework is a shared subject-specific Gaussian latent variable that captures unmeasured frailty and induces dependence across all components. We develop a three-stage estimation procedure and establish the consistency and asymptotic normality of our estimators. We also introduce a sequential procedure that imputes missing biomarkers prior to adjusting for irregular visiting and examine its performance. Simulation results demonstrate that our method yields unbiased estimates under this mechanism, whereas existing approaches can be substantially biased; notably, methods adjusting only for irregular visiting may exhibit even greater bias than those ignoring both mechanisms. We apply our framework to data from the All of Us Research Program to investigate associations between neighborhood-level socioeconomic status indicators and six blood-based biomarker trajectories, providing a robust tool for outpatient settings where irregular monitoring and selective measurement are prevalent.
Longitudinal data collected in routine clinical care pose inferential challenges that differ fundamentally from those in prospective cohort studies or clinical trials. In designed studies, a protocol governs both the schedule of patient encounters and the panel of measurements obtained at each encounter. In routine care, neither is predetermined; instead, both the timing of a visit and whether a particular biomarker is measured arise from a complex interplay of disease severity, clinical judgment, and patient characteristics and behavior (Hripcsak and Albers, 2013;Goldstein et al., 2016). When these clinically driven mechanisms depend on the latent health trajectory under study, analyses that treat the data collection process as fixed or exogenous yield systematically biased estimates of exposure-biomarker associations (Lin et al., 2004;Pullenayegum and Lim, 2016). These issues are of growing importance as electronic health record (EHR) data are now widely used in large-scale biomedical initiatives such as the All of Us Research Program in the United States (All of Us Research Program Investigators, 2019) and the UK Biobank (Bycroft et al., 2018), where routinely collected laboratory measurements serve as longitudinal outcomes for studying associations with clinical, demographic, genetic, environmental, behavioral, and social determinants of health. A prominent example is the laboratory-wide association study (LabWAS; Goldstein et al., 2020;Dennis et al., 2021), in which repeated biomarker measurements are linked to genetic risk factors on a phenome-wide scale. Although LabWAS has yielded empirical insights, collapsing repeated measurements into simple summaries neglects within-subject variability and fails to account for the informative nature of the data collection process. Specifically, two clinically driven mechanisms-the timing of visits and the decision to measure a particular biomarker-distort the available data in ways that standard methods do not address.
These two mechanisms reflect a two-stage hierarchy inherent in EHR data collection (Figure 1). The first stage is the visiting process, which generates patient encounters at irregular, clinically driven times. When patients with greater disease burden visit more frequently, the resulting overrepresentation of adverse health states introduces a bias known as informative presence (IP; Lin et al., 2004;Sun et al., 2007). However, the direction of bias under IP is not definitive: limited access to care may reduce visit frequency among the sickest patients (Obermeyer et al., 2019), while heterogeneities in diagnostic testing rates may inflate encounter frequency (Ellenbogen et al., 2024), making the magnitude and sign of bias under IP difficult to determine a priori. The second stage is the observation process: conditional on a visit occurring, a clinician decides whether to measure the biomarker of interest-which serves as the longi-tudinal outcome in our modeling framework-based on the patient’s current symptoms, prior test results, and clinical judgment. When this decision depends on factors related to the health trajectory that are not fully captured by observed covariates, it gives rise to a second source of bias that we term informative observation (IO; Wells et al., 2013;Haneuse and Daniels, 2016). Consider patient 2 in the left panel of Figure 1 as an illustration of this hierarchy. At the first visit, denoted by an open circle, no laboratory measurement is taken, reflecting the absence of clinical indication. At the second visit, a measurement is observed (solid dot) in response to an adverse condition. Finally, at the third visit, a follow-up measurement is obtained to confirm recovery; notably, this decision is conditional on both the patient’s current condition and clinical history. In the All of Us data, biomarker measurements are observed only at a fraction of outpatient visits; for example, even among patients with at least one glucose record, the median per-patient measurement rate is only 9.1%. The extent of missingness varies substantially across biomarkers (Table 4). Moreover, this missingness is unlikely to be ignorable: the unmeasured factors driving both the frequency of visits and the decision to measure-including latent health status, access to care, and heterogeneity in clinical decision-making-are often correlated with the biomarker trajectory under study, so that both stages constitute a missing not at random (MNAR) mechanism that cannot be resolved by addressing either stage in isolation.
Patient 1 Patient 3 are generated by the visiting process driven by covariates X V i . At each visit, the observation process (driven by X O i (t)) determines whether the biomarker outcome Y i (t) is measured (solid dots, R Y i (t) = 1) or unmeasured (hollow circles, R Y i (t) = 0), while the underlying longitudinal biomarker trajectory is driven by X Y i (t). Right panel: The resulting long-format dataset used for analysis, where “NA” i
This content is AI-processed based on open access ArXiv data.