Accounting for Heavy Censoring in Evaluating the Risk Stratification Abilities of Existing Models for Time to Diagnosis of Huntington Disease

Accounting for Heavy Censoring in Evaluating the Risk Stratification Abilities of Existing Models for Time to Diagnosis of Huntington Disease
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Huntington disease (HD) is a neurodegenerative disease with progressively worsening symptoms. Accurately modeling time to HD diagnosis is essential for clinical trial design. Langbehn’s model, the CAG-Age Product (CAP) model, the Prognostic Index Normed (PIN) model, and the Multivariate Risk Score (MRS) model have all been proposed for this task. However, these models may yield conflicting predictions and few studies have systematically compared their performance. Further, those that have could be misleading due to testing the models on the same data used to train them and failing to account for high rates of right censoring (80%+) in performance metrics. We discuss the theoretical foundations of these models, offering intuitive comparisons about their practical feasibility. We externally validate their risk stratification abilities using data from the ENROLL-HD study and two censoring-appropriate performance metrics, guiding model selection for HD clinical trial design. As these models were developed in HD studies that ended more than a decade ago, we compared their predictive performance using published parameters versus updated ones (re-estimated using ENROLL-HD). We show how these models can be used to estimate sample sizes for an HD clinical trial. Based on either metric and using published or updated parameters, the MRS model, which incorporates the most covariates, performed best. However, the simpler PIN model offered similarly good performance while requiring fewer variables, many of which would require patients to undergo additional tests. In illustrating an HD clinical trial design, we defined an optimal threshold based on model performance metrics to determine which patients are more likely to be diagnosed. Sample size calculations using an optimal threshold based on metrics that did not account for censoring, as in previous studies, are shown to lead to underpowered trials.


💡 Research Summary

Huntington disease (HD) is a fully penetrant neurodegenerative disorder caused by CAG repeat expansion, and the time from genetic confirmation to clinical diagnosis can span decades. Early‑stage (prodromal) preventive trials require tools that predict when an at‑risk individual will be diagnosed, thereby enabling sample enrichment strategies that focus enrollment on high‑risk participants. Four major time‑to‑diagnosis models have been proposed: the Langbehn model (logistic‑based probability using CAG repeats), the CAG‑Age Product (CAP) model (accelerated‑failure model using the product of CAG and age), the Prognostic Index Normed (PIN) model (semi‑parametric risk index with several clinical covariates), and the Multivariate Risk Score (MRS) model (a more complex parametric model incorporating ten or more clinical and cognitive variables).

Previous comparative work suffered from three major flaws: (1) only three of the four models were evaluated; (2) performance metrics (standard ROC) ignored the very high right‑censoring rates (≈80 %) typical of HD observational cohorts; and (3) the models were tested on the same datasets used for their development, creating double‑dipping bias. To address these issues, the authors performed an external validation using the ENROLL‑HD cohort (5,173 participants, 88 % censored). They applied two censoring‑appropriate metrics: Uno’s C‑statistic, which weights concordance by inverse probability of censoring, and a Kaplan‑Meier‑based time‑specific ROC analysis.

Both the original published parameters and re‑estimated parameters (using ENROLL‑HD) were evaluated for each model. The MRS consistently achieved the highest Uno’s C‑statistic (0.78 with published parameters, 0.81 after updating) and the highest time‑specific AUC (≈0.80–0.83). The PIN model followed closely (C‑statistic ≈0.75–0.77, AUC ≈0.73–0.76) while requiring far fewer covariates (≈5 vs. >10). The CAP model performed poorly (C‑statistic ≈0.62), suggesting that its accelerated‑failure assumption does not capture the age‑CAG interaction in modern cohorts. The Langbehn model, which does not produce a log‑risk score, was transformed into a short‑term probability; its concordance was modest (≈0.68).

The authors then demonstrated how these models can be used for sample enrichment. By selecting an optimal risk‑score threshold that maximizes Uno’s C‑statistic, the proportion of participants who receive a diagnosis during a typical 3‑year trial increased roughly threefold, reducing the required sample size for 80 % power by 30–40 % compared with enrolling all eligible prodromal individuals. In contrast, using thresholds derived from standard ROC (which ignores censoring) underestimates the needed sample size by 15–20 %, leading to underpowered studies.

From a practical standpoint, the MRS offers the best predictive accuracy but demands extensive data collection (genetic, motor, cognitive, and imaging measures). The PIN model provides a compelling trade‑off: comparable discrimination with a modest set of variables that are routinely collected in HD observational studies. Importantly, the study shows that re‑estimating model parameters on contemporary cohorts markedly improves performance, underscoring the necessity of periodic model updating.

In conclusion, when evaluating time‑to‑diagnosis models for HD in the presence of heavy right‑censoring, censoring‑aware metrics must be employed. The MRS model is the top performer, but the simpler PIN model may be preferable in many trial settings due to lower data‑collection burden. Properly calibrated risk‑stratification can substantially reduce sample size requirements for preventive HD trials, enhancing feasibility and cost‑effectiveness.


Comments & Academic Discussion

Loading comments...

Leave a Comment