A Causal Machine Learning Framework for Treatment Personalization in Clinical Trials: Application to Ulcerative Colitis
Randomized controlled trials estimate average treatment effects, but treatment response heterogeneity motivates personalized approaches. A critical question is whether statistically detectable heterogeneity translates into improved treatment decisions – these are distinct questions that can yield contradictory answers. We present a modular causal machine learning framework that evaluates each question separately: permutation importance identifies which features predict heterogeneity, best linear predictor (BLP) testing assesses statistical significance, and doubly robust policy evaluation measures whether acting on the heterogeneity improves patient outcomes. We apply this framework to patient-level data from the UNIFI maintenance trial of ustekinumab in ulcerative colitis, comparing placebo, standard-dose ustekinumab every 12 weeks, and dose-intensified ustekinumab every 8 weeks, using cross-fitted X-learner models with baseline demographics, medication history, week-8 clinical scores, laboratory biomarkers, and video-derived endoscopic features. BLP testing identified strong associations between endoscopic features and treatment effect heterogeneity for ustekinumab versus placebo, yet doubly robust policy evaluation showed no improvement in expected remission from incorporating endoscopic features, and out-of-fold multi-arm evaluation showed worse performance. Diagnostic comparison of prognostic contribution against policy value revealed that endoscopic scores behaved as disease severity markers – improving outcome prediction in untreated patients but adding noise to treatment selection – while clinical variables (fecal calprotectin, age, CRP) captured the decision-relevant variation. These results demonstrate that causal machine learning applications to clinical trials should include policy-level evaluation alongside heterogeneity testing.
💡 Research Summary
This paper introduces a modular causal‑machine‑learning pipeline designed to move from detecting treatment‑effect heterogeneity in randomized controlled trial (RCT) data to evaluating whether that heterogeneity can improve actual treatment decisions. The authors separate three distinct questions: (1) which baseline variables predict heterogeneity (permutation‑based feature importance), (2) whether the heterogeneity associated with a set of variables is statistically significant (Best Linear Predictor, BLP, testing), and (3) whether incorporating those variables into a decision rule yields higher expected patient outcomes (doubly robust policy evaluation).
The methodology is demonstrated on individual‑patient data from the UNIFI maintenance trial in ulcerative colitis, which randomized patients to placebo, ustekinumab every 12 weeks (Q12) or every 8 weeks (Q8). Baseline demographics, medication history, week‑8 clinical scores, laboratory biomarkers, and video‑derived endoscopic features (segmental Mayo scores and a cumulative disease score) were used as covariates.
For heterogeneity estimation the authors employ the X‑learner meta‑learner. Separate outcome models μ₀(X), μ₁₂(X), μ₈(X) are fit with gradient‑boosted trees (XGBoost) using five‑fold cross‑validation. Pseudo‑outcomes are constructed (treated − control predictions for treated, and vice‑versa for controls) and regressed on the covariates to obtain conditional average treatment effects (CATEs). Because treatment assignment is randomized, constant propensity scores equal to the empirical arm probabilities are used.
Permutation importance is calculated on the second‑stage CATE model (or on a “gap” model for the multi‑arm case) by randomly shuffling each feature ten times and measuring the increase in mean‑squared error. The BLP test aggregates each feature group (endoscopic vs. clinical) into two summary statistics per patient (mean and standard deviation across the group) and regresses the estimated CATEs on these summaries. Inference is performed with a multiplier bootstrap (1,000 replications) and a Wald‑type statistic that sums the squared z‑scores of the group coefficients.
The results show that endoscopic features dominate permutation importance and achieve a highly significant BLP statistic (z‑score sum = 8.28, p < 0.001), indicating that they explain a substantial portion of the estimated heterogeneity between ustekinumab and placebo. However, when the authors construct a treatment‑assignment policy that selects, for each patient, the arm with the highest predicted remission probability (π(X)=argmaxₜ μₜ(X)), doubly robust evaluation yields a 95 % confidence interval for the policy value of –1.6 to +6.6 percentage points relative to a baseline policy. In out‑of‑fold multi‑arm validation the policy that uses all features attains only 30.5 % remission, compared with 36.8 % for a policy that relies solely on clinical and laboratory variables.
A diagnostic comparison of “prognostic contribution” (incremental Brier‑score reduction) versus “policy contribution” (incremental policy value) reveals that endoscopic scores act as strong disease‑severity markers: they improve outcome prediction for untreated patients but add noise when deciding which treatment to give. By contrast, fecal calprotectin, age, and CRP retain predictive power across arms and improve the policy value.
The authors argue that causal‑machine‑learning applications to clinical trials must include policy‑level evaluation; statistical heterogeneity alone can be misleading. Their framework also demonstrates how to handle multi‑arm trials (nested cross‑validation, pseudo‑outcome gaps) and how to combine X‑learner CATE estimation with doubly robust policy evaluation. Limitations include reliance on a single trial, potential measurement error in manually scored endoscopic features, and lack of external validation.
In summary, the paper provides a rigorous, reproducible workflow that moves from heterogeneity detection to actionable treatment personalization, showing that variables that are strong predictors of disease severity need not be effective effect modifiers. This work highlights the importance of distinguishing prognostic from predictive signals when designing personalized therapeutic strategies in inflammatory bowel disease and beyond.
Comments & Academic Discussion
Loading comments...
Leave a Comment