Setting the Standard: Recommended Practices for Data Preprocessing in Data-Driven Climate Prediction
Artificial intelligence (AI) - and specifically machine learning (ML) - applications for climate prediction across timescales are proliferating quickly. The emergence of these methods prompts a revisit to the impact of data preprocessing, a topic familiar to the climate community, as more traditional statistical models work with relatively small sample sizes. Indeed, the skill and confidence in the forecasts produced by data-driven models are directly influenced by the quality of the datasets and how they are treated during model development, thus yielding the colloquialism, “garbage in, garbage out.” As such, this article establishes protocols for the proper preprocessing of input data for AI/ML models designed for climate prediction (i.e., subseasonal to decadal and longer). The three aims are to: (1) educate researchers, developers, and end users on the effects that preprocessing has on climate predictions; (2) provide recommended practices for data preprocessing for such applications; and (3) empower end users to decipher whether the models they are using are properly designed for their objectives. Specific topics covered in this article include the creation of (standardized) anomalies, dealing with non-stationarity and the spatiotemporally correlated nature of climate data, and handling of extreme values and variables with potentially complex distributions. Case studies will illustrate how using different preprocessing techniques can produce different predictions from the same model, which can create confusion and decrease confidence in the overall process. Ultimately, implementing the recommended practices set forth in this article will enhance the robustness and transparency of AI/ML in climate prediction studies.
💡 Research Summary
**
The paper “Setting the Standard: Recommended Practices for Data Preprocessing in Data‑Driven Climate Prediction” addresses a critical gap in the rapidly expanding field of artificial‑intelligence (AI) and machine‑learning (ML) applications for climate forecasting. While much recent attention has focused on model architecture, hyper‑parameter tuning, and interpretability, the authors argue that the quality of the input data—and the way it is pre‑processed—often determines whether a model succeeds or fails. They therefore propose a comprehensive, step‑by‑step protocol that can serve as a community standard for climate‑prediction projects ranging from subseasonal to decadal timescales.
The paper is organized around three overarching goals: (1) educate researchers and end‑users about the profound impact of preprocessing choices; (2) provide concrete, reproducible recommendations for data preparation; and (3) empower users to assess whether a given model’s pipeline matches their scientific objectives.
Key Recommendations
-
Problem Definition First – Clearly state the target variable, forecast horizon, and appropriate AI/ML paradigm (supervised vs. unsupervised). This guides the selection of predictors and the design of the training‑validation‑test split.
-
Exploratory Data Analysis (EDA) – Compute effective sample size (N_{\text{eff}}) by accounting for spatio‑temporal autocorrelation, identify missing or erroneous entries, and characterize distributions (skewness, kurtosis, multimodality).
-
Feature Engineering & Dimensionality Reduction – Use physically meaningful transformations (e.g., anomalies, detrended fields) and statistical tools such as Principal Component Analysis (PCA), Empirical Orthogonal Functions (EOF), or autoencoders to compress high‑resolution fields while preserving dominant modes of variability.
-
Handling Non‑Stationarity – Remove long‑term trends and seasonal cycles using methods appropriate to the data: simple linear or polynomial detrending, Seasonal‑Trend decomposition using Loess (STL), or Empirical Mode Decomposition (EMD). The authors stress that detrending must be performed only on the training period to avoid data leakage.
-
Dealing with Extreme Values and Non‑Normal Distributions – Apply robust scaling (e.g., median‑based IQR clipping), Winsorization, or distribution‑specific transforms such as log, Box‑Cox, or Yeo‑Johnson. For precipitation‑type variables, quantile mapping or kernel density estimation can improve model calibration.
-
Scaling and Normalization – After addressing trends and extremes, standardize variables (z‑score) or use min‑max scaling, but compute scaling parameters exclusively on the training/validation set and then apply them unchanged to the test set.
-
Preventing Data Leakage – The most emphasized point is to split the dataset before any preprocessing. Temporal block splitting (e.g., training 1940‑2010, testing 2011‑2024) with a gap that exceeds the autocorrelation length ensures that the model cannot inadvertently learn from future information. The paper illustrates how using the full time series for detrending creates a biased “cooler” ground truth in the test period, inflating skill scores.
-
Cross‑Validation (CV) Strategies – Standard k‑fold CV assumes independent and identically distributed (i.i.d.) samples, which is rarely true for climate data. The authors recommend:
- Time‑Series CV – respects chronological order, suitable when long‑term memory dominates.
- Spatial CV – creates spatially disjoint folds to test generalization across regions.
- Stratified CV – maintains class proportions for imbalanced categorical events (e.g., El Niño vs. La Niña).
- Hybrid approaches – combine temporal and spatial blocking for complex datasets.
-
Feature Selection – Use statistical relevance (correlation, mutual information) and model‑based importance metrics (SHAP values, permutation importance) to prune irrelevant predictors, thereby reducing model complexity and over‑fitting risk.
-
Transparency & Reproducibility – Publish the full preprocessing pipeline (code, configuration files, metadata) alongside model weights. Report training/validation/test split ratios, transformation parameters, and CV scheme in the manuscript to enable fair comparison across studies.
Case Studies
Case Study 1 examines temperature anomaly prediction using two detrending approaches: a simple linear trend (1‑degree polynomial) versus a quadratic trend (2‑degree polynomial). When the detrending is performed on the entire 1940‑2024 record, the test period exhibits an artificial cooling bias, inflating skill metrics. Restricting detrending to the training window eliminates this bias and yields more realistic performance.
Case Study 2 focuses on Indian monsoon precipitation, a variable with a strongly skewed distribution. The authors compare raw, log‑transformed, Box‑Cox, and Yeo‑Johnson transformed inputs. The Yeo‑Johnson transformation, combined with robust outlier clipping, improves the LSTM model’s Continuous Ranked Probability Score (CRPS) by ~12 % and reduces RMSE by ~9 % relative to the raw data baseline.
These examples demonstrate that preprocessing decisions can have a larger impact on forecast skill than the choice of model architecture itself.
Conclusions
The authors argue that a standardized, transparent preprocessing workflow is essential for the credibility of AI/ML climate prediction. By systematically addressing non‑stationarity, extreme values, spatio‑temporal autocorrelation, and data leakage, researchers can produce models that are both more accurate and more trustworthy. The recommended practices are intended to become a community‑wide benchmark, facilitating reproducibility, fair model inter‑comparison, and ultimately, more reliable climate forecasts for decision‑makers.
Comments & Academic Discussion
Loading comments...
Leave a Comment