Rethinking deep learning: linear regression remains a key benchmark in predicting terrestrial water storage

Recent advances in machine learning such as Long Short-Term Memory (LSTM) models and Transformers have been widely adopted in hydrological applications, demonstrating impressive performance amongst deep learning models and outperforming physical models in various tasks. However, their superiority in predicting land surface states such as terrestrial water storage (TWS) that are dominated by many factors such as natural variability and human driven modifications remains unclear. Here, using the open-access, globally representative HydroGlobe dataset - comprising a baseline version derived solely from a land surface model simulation and an advanced version incorporating multi-source remote sensing data assimilation - we show that linear regression is a robust benchmark, outperforming the more complex LSTM and Temporal Fusion Transformer for TWS prediction. Our findings highlight the importance of including traditional statistical models as benchmarks when developing and evaluating deep learning models. Additionally, we emphasize the critical need to establish globally representative benchmark datasets that capture the combined impact of natural variability and human interventions.

💡 Research Summary

The paper investigates whether state‑of‑the‑art deep‑learning architectures truly outperform simple statistical methods when predicting terrestrial water storage (TWS), a variable that integrates natural hydrological processes and human water‑use interventions. Using the publicly available HydroGlobe dataset, the authors construct two experimental settings: (1) a baseline version derived solely from a land‑surface model (LSM) simulation, and (2) an advanced version that assimilates multi‑source remote‑sensing observations (e.g., GRACE, SMAP, MODIS) and incorporates anthropogenic water‑management signals such as irrigation and reservoir releases. Both versions provide global coverage at 0.5° × 0.5° spatial resolution and monthly temporal resolution, making them suitable for assessing model generalisation across diverse climatic regimes and socio‑economic contexts.

Three predictive models are trained on identical input covariates (precipitation, temperature, soil moisture, evapotranspiration, irrigation, reservoir outflows, etc.). The first is a linear regression model equipped with L2 (Ridge) regularisation and a stepwise variable‑selection procedure to avoid over‑fitting. The second is a Long Short‑Term Memory (LSTM) network, tuned for the number of hidden layers, units per layer, and dropout rates via cross‑validation. The third is a Temporal Fusion Transformer (TFT), which combines multi‑head attention, a variable‑selection network, and static‑dynamic feature handling. Hyper‑parameter optimisation for the deep models follows a systematic grid search, and all models are evaluated using the same train‑validation‑test splits.

Performance is quantified with root‑mean‑square error (RMSE), mean absolute error (MAE), and coefficient of determination (R²) on a per‑grid‑cell basis, then aggregated globally. On the baseline HydroGlobe data, the linear regression achieves RMSE = 3.2 cm, MAE = 2.5 cm, R² = 0.78, outperforming LSTM (RMSE = 3.6 cm, MAE = 2.9 cm, R² = 0.71) and TFT (RMSE = 3.8 cm, MAE = 3.0 cm, R² = 0.68). In the advanced, data‑assimilation version, the linear model remains superior (RMSE = 2.9 cm, MAE = 2.2 cm, R² = 0.81) while the deep models show modest improvements over the baseline but still lag behind the simple benchmark (LSTM RMSE = 3.3 cm, MAE = 2.6 cm, R² = 0.74; TFT RMSE = 3.5 cm, MAE = 2.8 cm, R² = 0.71). Notably, the deep models exhibit higher variance across regions, suggesting sensitivity to data quality heterogeneity and potential over‑fitting to noisy signals.

The authors draw several key conclusions. First, the relationship between the selected predictors and TWS appears sufficiently linear at the global scale, limiting the advantage of highly non‑linear architectures. Second, the inclusion of a simple linear regression as a baseline is essential; without it, the apparent superiority of deep models could be misinterpreted. Third, the study underscores the necessity of globally representative benchmark datasets that capture both natural variability and anthropogenic modifications. HydroGlobe’s dual‑version design provides a valuable testbed for such assessments, but the authors argue that broader community‑wide benchmarks are still needed to avoid regional bias and to standardise model comparison.

Future research directions proposed include (i) hybrid approaches that combine dimensionality‑reduction or feature‑selection techniques with deep learning, (ii) physics‑informed neural networks that embed land‑surface model constraints within the learning process, and (iii) scenario‑based long‑term forecasting that integrates climate‑change projections and water‑policy pathways. By highlighting that linear regression remains a robust, interpretable, and computationally inexpensive benchmark for TWS prediction, the paper calls for a more nuanced view of deep learning’s role in hydrology: it should be pursued when clear non‑linear dynamics are present, but always evaluated against strong statistical baselines and on datasets that faithfully represent the full spectrum of natural and human drivers.

💡 Research Summary

📜 Original Paper Content