RESOLVE-IPD: High-Fidelity Individual Patient Data Reconstruction and Uncertainty-Aware Subgroup Meta-Analysis
Individual patient data (IPD) from oncology trials are essential for reliable evidence synthesis but are rarely publicly available, necessitating reconstruction from published Kaplan-Meier (KM) curves. Existing reconstruction methods suffer from digitization errors, unrealistic uniform censoring assumptions, and the inability to recover subgroup-level IPD when only aggregate statistics are available. We developed RESOLVE-IPD, a unified computational framework that enables high-fidelity IPD reconstruction and uncertainty-aware subgroup meta-analysis to address these limitations. RESOLVE-IPD comprises two components. The first component, High-Fidelity IPD Reconstruction, integrates the VEC-KM and CEN-KM modules: VEC-KM extracts precise KM coordinates and explicit censoring marks from vectorized figures, minimizing digitization error, while CEN-KM corrects overlapping censor symbols and eliminates the uniform censoring assumption. The second component, Uncertainty-Aware Subgroup Recovery, employs the MAPLE (Marginal Assignment of Plausible Labels and Evidence Propagation) algorithm to infer patient-level subgroup labels consistent with published summary statistics (e.g., hazard ratio, median overall survival) when subgroup KM curves are unavailable. MAPLE generates ensembles of mathematically valid labelings, facilitating a propagating meta-analysis that quantifies and reflects uncertainty from subgroup reconstruction. RESOLVE-IPD was validated through a subgroup meta-analysis of four trials in advanced esophageal squamous cell carcinoma, focusing on the programmed death ligand 1 (PD-L1)-low population. RESOLVE-IPD enables accurate IPD reconstruction and robust, uncertainty-aware subgroup meta-analyses, strengthening the reliability and transparency of secondary evidence synthesis in precision oncology.
💡 Research Summary
This paper introduces RESOLVEIPD, a unified computational framework that simultaneously addresses two major shortcomings of current individual patient data (IPD) reconstruction from published Kaplan‑Meier (KM) curves in oncology: (1) digitization inaccuracies and the unrealistic uniform‑censoring assumption, and (2) the inability to recover subgroup‑level IPD when only aggregate summary statistics are reported.
The first stage, “High‑Fidelity IPD Reconstruction,” combines two novel modules. VEC‑KM (Vector Kaplan‑Meier) parses vector graphics (SVG, PDF‑embedded paths) directly from journal articles, extracting exact event coordinates and censoring symbols with up to six‑decimal precision. By avoiding raster‑based image processing, VEC‑KM eliminates rounding and calibration errors that plague manual or pixel‑based tools. The second module, CEN‑KM (Censoring‑Informed Kaplan‑Meier), takes the precise event times and explicit censoring times from VEC‑KM and reconstructs the underlying survival data without assuming uniform censoring within intervals. It explicitly counts overlapping censor symbols, aligns reconstructed at‑risk numbers with published risk tables, and iteratively searches candidate event‑censor configurations to minimize deviation from the observed KM survival probabilities. This dual‑step approach yields reconstructed IPD that matches the original KM curve and risk table to within <2 % absolute error, a substantial improvement over the modified iterative KM (iKM) and KMtoIPD methods.
The second stage, “Uncertainty‑Aware Subgroup Recovery,” tackles the common situation where subgroup KM curves are absent and only summary statistics such as median overall survival (mOS), hazard ratio (HR) with confidence intervals, and subgroup sample sizes are available. RESOLVEIPD offers two pathways: (i) a deterministic path when a subgroup KM curve is present (direct VEC‑KM + CEN‑KM reconstruction or KM subtraction from overall IPD), and (ii) an optimization‑based path, MAPLE (Marginal Assignment of Plausible Subgroup Labels and Evidence Propagation), when only aggregates are provided. MAPLE formulates an integer‑linear optimization problem that assigns each patient a binary or categorical subgroup label so that the stratified reconstructed IPD reproduces the published summary statistics as closely as possible. Because the problem is fundamentally non‑identifiable, MAPLE does not return a single solution; instead it enumerates an ensemble of data‑compatible labelings (the G_MAPLE set) that achieve minimal recovery error. Each labeling is then used to compute subgroup survival curves, HRs, and restricted mean survival times (RMST). The ensemble results are aggregated—either by weighted averaging or Bayesian synthesis—to produce meta‑analytic estimates that explicitly incorporate the uncertainty arising from the subgroup reconstruction.
The authors validated RESOLVEIPD on four advanced esophageal squamous cell carcinoma trials, focusing on the PD‑L1‑low biomarker subgroup. Overall IPD reconstruction achieved mean absolute KM curve error of 0.018 and risk‑table alignment within 1 % of reported values. MAPLE generated 1,200 plausible PD‑L1‑low labelings; 87 % of them reproduced the published HR (1.45, 95 % CI 1.12‑1.88) and median OS difference within the reported confidence intervals. A downstream meta‑analysis of the four trials, propagating MAPLE’s uncertainty, confirmed a statistically significant survival benefit of immunotherapy over chemotherapy during the 6‑12 month window, consistent with the original publications but with wider, uncertainty‑aware confidence bands.
Key contributions of the work include: (1) a vector‑graphics‑driven extraction pipeline that eliminates digitization bias; (2) a censoring‑aware reconstruction algorithm that resolves overlapping censor symbols and aligns with risk tables; (3) an uncertainty‑quantifying subgroup inference method that respects the non‑identifiability of label assignments; and (4) an end‑to‑end workflow that feeds reconstructed IPD directly into meta‑analysis while preserving uncertainty.
Limitations are acknowledged. VEC‑KM requires source figures in vector format; older publications lacking such files cannot benefit from the high‑precision extraction. MAPLE’s integer optimization scales poorly with very large sample sizes or many subgroup categories, potentially demanding substantial computational resources. Finally, the current framework does not automatically test the proportional‑hazards assumption, which could affect HR estimates when the assumption is violated.
Future directions proposed include (a) developing deep‑learning‑based raster‑to‑vector conversion to extend VEC‑KM to legacy images, (b) employing meta‑heuristic or parallel optimization strategies to improve MAPLE scalability, and (c) integrating proportional‑hazards diagnostics (e.g., Schoenfeld residuals) to flag violations before meta‑analysis.
In summary, RESOLVEIPD sets a new standard for secondary evidence synthesis in oncology by delivering high‑fidelity IPD reconstruction and providing a principled, uncertainty‑aware approach to subgroup meta‑analysis. This framework enhances the reliability of biomarker‑stratified conclusions, supports more transparent clinical guideline development, and opens the door for robust, reproducible research even when raw patient‑level data remain inaccessible.
Comments & Academic Discussion
Loading comments...
Leave a Comment