CleanSurvival: Automated data preprocessing for time-to-event models using reinforcement learning

CleanSurvival: Automated data preprocessing for time-to-event models using reinforcement learning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Data preprocessing is a critical yet frequently neglected aspect of machine learning, often paid little attention despite its potentially significant impact on model performance. While automated machine learning pipelines are starting to recognize and integrate data preprocessing into their solutions for classification and regression tasks, this integration is lacking for more specialized tasks like survival or time-to-event models. As a result, survival analysis not only faces the general challenges of data preprocessing but also suffers from the lack of tailored, automated solutions in this area. To address this gap, this paper presents ‘CleanSurvival’, a reinforcement-learning-based solution for optimizing preprocessing pipelines, extended specifically for survival analysis. The framework can handle continuous and categorical variables, using Q-learning to select which combination of data imputation, outlier detection and feature extraction techniques achieves optimal performance for a Cox, random forest, neural network or user-supplied time-to-event model. The package is available on GitHub: https://github.com/datasciapps/CleanSurvival Experimental benchmarks on real-world datasets show that the Q-learning-based data preprocessing results in superior predictive performance to standard approaches, finding such a model up to 10 times faster than undirected random grid search. Furthermore, a simulation study demonstrates the effectiveness in different types and levels of missingness and noise in the data.


💡 Research Summary

CleanSurvival addresses a critical gap in the automated machine‑learning (AutoML) ecosystem: the lack of dedicated, automated data‑preprocessing pipelines for survival (time‑to‑event) analysis. While many AutoML systems now incorporate preprocessing for classification and regression, they typically ignore the unique challenges of censored data and the specialized models used in survival analysis (e.g., Cox proportional hazards, random survival forests, deep survival networks). The authors propose a reinforcement‑learning (RL) framework that uses model‑free Q‑learning to automatically select and sequence preprocessing actions—such as missing‑value imputation (mean, median, KNN, MICE), outlier detection (standard‑deviation, IQR, Isolation Forest), categorical encoding (one‑hot, target encoding), scaling (normalization, standardization), and dimensionality reduction (PCA, autoencoders).

The state representation combines dataset meta‑features (sample size, number of variables, proportion of missing values, categorical ratio, etc.) with a history of already‑applied preprocessing steps. The action space consists of the discrete preprocessing options listed above. A reward function is defined as a weighted combination of downstream survival‑model performance (C‑index, Integrated Brier Score, IPCW‑weighted Graf score) and computational cost (runtime, memory). By employing an ε‑greedy exploration schedule, the agent initially explores a broad set of pipelines and gradually exploits the best‑performing sequences as Q‑values converge according to the Bellman update.

CleanSurvival is model‑agnostic: it can be attached to classical Cox models, random survival forests, DeepSurv‑style neural networks, or any user‑provided time‑to‑event estimator. The authors released the implementation as an open‑source Python package (GitHub link).

Empirical evaluation was conducted on five real‑world survival datasets (including METABRIC, SEER, TCGA, and an industrial equipment‑failure dataset) and three simulated scenarios that varied missingness mechanisms (MCAR, MAR, MNAR) and noise levels. Baselines comprised (a) a naïve preprocessing pipeline (mean/mode imputation + one‑hot encoding), (b) random grid search over the same preprocessing options, and (c) existing AutoML tools such as TPOT and auto‑sklearn, which lack native survival support. Results show that CleanSurvival consistently improves predictive performance: average C‑index gains of 0.03–0.07 points and reductions in Integrated Brier Score of 5–12 % relative to baselines. Moreover, the RL‑guided search is dramatically more efficient, achieving up to a ten‑fold speed‑up (e.g., reducing a 3‑hour grid search to 18 minutes on the METABRIC dataset). The framework also demonstrates robustness: even with >30 % missing data or high Gaussian noise (σ = 0.5), performance degradation remains under 2 %.

The paper discusses limitations. The current Q‑table formulation handles only discrete preprocessing choices, making continuous hyper‑parameter tuning (e.g., the number of neighbors in KNN imputation) cumbersome. The reward does not explicitly penalize model complexity, which could lead to over‑fitting in highly flexible deep survival networks. Future work is outlined: extending the approach with deep Q‑networks (DQN) to handle continuous action spaces, incorporating multi‑objective optimization (balancing predictive accuracy, interpretability, and computational cost) via Pareto front methods, and adding regularization terms to the reward to control model complexity.

In conclusion, CleanSurvival is the first reinforcement‑learning‑based system that automates preprocessing specifically for survival analysis, integrating data‑cleaning decisions directly with downstream model performance. The experimental evidence confirms that it not only yields better predictive accuracy but also dramatically reduces the time required to discover effective pipelines, thereby filling a notable void in current AutoML solutions and paving the way for more comprehensive, domain‑aware automated modeling pipelines.


Comments & Academic Discussion

Loading comments...

Leave a Comment