Comparing methods to assess treatment effect heterogeneity in general parametric regression models
This paper reviews and compares methods to assess treatment effect heterogeneity in the context of parametric regression models. These methods include the standard likelihood ratio tests, bootstrap likelihood ratio tests, and Goeman’s global test motivated by testing whether the random effect variance is zero. We place particular emphasis on tests based on the score-residual of the treatment effect and explore different variants of tests in this class. All approaches are compared in a simulation study, and the approach based on residual scores is illustrated in a clinical trial with time-to-event outcome comparing treatment versus placebo. Our findings demonstrate that score-residual based methods provide practical, flexible and reliable tools for exploring treatment effect heterogeneity and treatment effect modifiers, and can provide useful guidance for decision making around treatment effect heterogeneity.
💡 Research Summary
**
This paper provides a comprehensive review and empirical comparison of statistical methods for assessing treatment‑effect heterogeneity (TEH) within the framework of general parametric regression models. The authors focus on four classes of approaches: (1) the conventional likelihood‑ratio test (LRT) comparing a model with only main effects to one that also includes treatment‑by‑covariate interaction terms; (2) a parametric bootstrap version of the LRT that approximates the finite‑sample distribution of the likelihood‑ratio statistic; (3) Goeman’s global test, originally devised for high‑dimensional prognostic covariates, which tests whether the variance component of a random‑effects formulation of the interaction terms is zero; and (4) a novel set of procedures based on the score residuals with respect to the treatment‑effect parameter.
The paper begins by discussing why TEH matters in randomized clinical trials, emphasizing that the magnitude and even the direction of a treatment effect can depend on the scale (absolute risk difference, relative risk, odds ratio) and on the covariates included in the model. The authors adopt the WATCH (Workflow for Assessing Treatment‑Effect Heterogeneity) framework, which structures TEH exploration into four steps: planning, data preparation, TEH exploration, and multidisciplinary assessment. The third step is further divided into (a) testing the global null hypothesis of homogeneity, (b) ranking covariates that act as effect modifiers, and (c) visual displays.
Methodologically, the paper defines the treatment‑effect estimand as the coefficient δ in the regression model η_i = β_0 + x_i′β_x + δ·z_i (or, for Cox models, the log‑hazard ratio). The global TEH test based on interaction terms leads to a χ² statistic with k degrees of freedom under large‑sample theory, but this approximation can be poor for moderate sample sizes, especially with binary outcomes. The bootstrap LRT mitigates this by repeatedly simulating data from the null model, refitting both null and interaction models, and constructing an empirical null distribution. Goeman’s test recasts the interaction terms as random effects and tests whether their common variance component equals zero.
The core contribution is the score‑residual approach. For each subject i, the score residual with respect to δ is s_i = ∂Ψ/∂δ |_{β̂,δ̂}, where Ψ is the negative log‑likelihood. In generalized linear models this reduces to s_i = (y_i – μ̂_i)·z_i, which yields zero for control subjects (z_i = 0). To avoid loss of information, the authors propose centering the treatment indicator: \tilde{z}_i = z_i – E(z_i|x_i). In a randomized trial E(z_i|x_i) equals the known randomization probability, so \tilde{z}_i has mean zero and provides a non‑zero contribution for all participants. The collection of s_i’s can be examined globally (e.g., via a sum‑of‑squares statistic) to test the null hypothesis of homogeneous treatment effect, and individually regressed on each covariate to obtain a measure of variable importance. This approach is computationally cheap (linear in N and k) and naturally extends to continuous, binary, count, and time‑to‑event outcomes.
A simulation study evaluates all four methods across four model families (linear, logistic, Poisson, Cox) with varying numbers of covariates (k = 5, 10, 20) and interaction strengths. The score‑residual based global test maintains the nominal Type I error rate and exhibits higher power than the bootstrap LRT and Goeman’s test, especially when the interaction effects are modest. The bootstrap LRT becomes computationally intensive as k grows, while Goeman’s test tends to be conservative. Variable‑importance rankings derived from score residuals correctly identify the true effect modifiers with higher precision than the other methods.
The authors illustrate the methodology on a real clinical trial with a time‑to‑event endpoint (progression‑free survival). A Cox model is fitted, score residuals are computed using the centered treatment indicator, and the global test yields a p‑value of approximately 0.03, indicating statistically detectable heterogeneity. Subsequent regression of the residuals on baseline covariates highlights age, ECOG performance status, and a specific biomarker as the most influential modifiers. Visual displays (e.g., stratified survival curves) are generated to aid interpretation.
In conclusion, the paper demonstrates that score‑residual based methods provide a practical, flexible, and statistically reliable toolkit for exploring TEH and identifying potential effect modifiers across a wide range of outcome types. When embedded within the WATCH workflow, these methods enable systematic, transparent, and reproducible heterogeneity assessments, supporting more nuanced decision‑making in drug development. Future work may extend the approach to non‑linear effects, higher‑order interactions, and Bayesian formulations.
Comments & Academic Discussion
Loading comments...
Leave a Comment