An Efficient Regional Storm Surge Surrogate Model Training Strategy Under Evolving Landscape and Climate Scenarios

An Efficient Regional Storm Surge Surrogate Model Training Strategy Under Evolving Landscape and Climate Scenarios
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Coastal communities face significant risk from storm-induced coastal flooding, which causes substantial societal and economic losses worldwide. Machine learning techniques have increasingly been integrated into coastal hazard modeling, particularly for storm surge prediction, due to advances in computational capacity. However, incorporating multiple projected future climate and landscape scenarios requires extensive numerical simulations of synthetic storm suites over large geospatial domains, resulting in rapidly escalating computational costs. This study proposes a cost-effective training data reduction strategy for machine learning based storm surge surrogate models that enables efficient incorporation of new future scenarios while minimizing computational burden. The proposed strategy reduces training data across three dimensions: grid points, input features, and storm suite size. Reducing the storm suite size for future scenario simulations is highly effective in guiding numerical simulations, yielding substantial reductions in simulation cost. The performance of surrogate models trained on reduced datasets was evaluated using different machine learning algorithms. Results demonstrate that the proposed reduction strategy is robust across different model types. When trained using 5,000 out of 80,000 grid points, 10 out of 12 input features, and 60 out of 90 storms, the total training dataset is reduced to approximately 5% of its original size. Despite this reduction, the trained model achieves a correlation coefficient of 0.94, comparable to models trained on the full dataset. In addition, storm selection methodologies are introduced to support efficient storm set expansion for future scenario analyses.


💡 Research Summary

The paper addresses the growing computational burden associated with training machine‑learning (ML) surrogate models for regional storm‑surge prediction under multiple future climate and landscape scenarios. Traditional approaches rely on exhaustive numerical simulations of large synthetic storm suites (up to 645 storms) across high‑resolution meshes (≈80 000 grid points). When many “what‑if” scenarios are required—different sea‑level rise rates, subsidence patterns, restoration measures—the total number of simulations quickly becomes prohibitive.

To overcome this, the authors propose a three‑dimensional data‑reduction strategy that simultaneously trims (1) the number of grid points (GPs) used for training, (2) the number of input features supplied to the model, and (3) the size of the storm suite. Each reduction dimension is investigated separately, then combined into a unified workflow that can be applied iteratively as new scenario data become available.

Grid‑point reduction: Because the surrogate predicts a single scalar (peak surge) per GP, conventional output‑dimensionality reduction (e.g., PCA) cannot be applied directly. Instead, the authors use k‑means clustering on a feature set that includes geographic coordinates, elevation, landscape parameters (canopy, Manning’s n, surface roughness Z₀), and the first PCA eigenvector derived from surge responses (only for the baseline S00Y00 scenario, which contains the full 645‑storm set). Missing surge values at dry nodes are imputed with a weighted k‑nearest‑neighbors approach. Representative GPs are selected as the points nearest each cluster centroid. Two schemes are compared: a fixed GP subset derived from the baseline clustering, and a flexible subset obtained by reclustering each scenario individually. Sensitivity tests show that with few GPs the flexible approach yields lower RMSE, but as the GP count rises (≈5 000), performance converges. Importantly, using inconsistent GP counts across scenarios (e.g., many GPs for the baseline but few for others) degrades spatial accuracy, especially along the Mississippi River west of New Orleans, highlighting the need for balanced training data.

Input‑feature reduction: The original model uses 12 variables: five storm parameters (central pressure P₀, forward speed V, radius of maximum wind Rₘₐₓ, landfall angle θ, landfall longitude), six GP‑specific landscape variables (latitude, longitude, Manning’s n, canopy coefficient, surface roughness Z₀, topographic/bathymetric elevation), and one climate variable (mean sea level, MSL). Correlation analysis and feature‑importance metrics identify two storm parameters and one landscape variable as redundant. Removing these reduces the dimensionality to 10 inputs without sacrificing predictive skill, while also cutting training time and mitigating over‑fitting.

Storm‑suite reduction: This dimension yields the largest cost savings. The authors explore clustering‑based representative storm selection and adaptive sampling that prioritize storms contributing most to the Joint Probability Method (JPM) integral used for hazard‑curve construction. An optimization algorithm minimizes the error in hazard‑curve integration, resulting in a reduced set of 60 storms (out of the original 90 used in the CMP2023 scenario set, and far fewer than the full 645). The reduced suite retains the statistical diversity of wind speed, pressure, track, and size, ensuring that the surrogate can still capture extreme surge responses.

Model architecture and evaluation: The surrogate is a feed‑forward neural network with four hidden layers (256 neurons each, ReLU activation) and a linear output layer predicting peak surge. The learning rate is 0.001. The authors also test three additional ML algorithms—Random Forest, Gradient Boosting (XGBoost), and Support Vector Regression—to verify that the reduction strategy is algorithm‑agnostic. All models trained on the reduced dataset (5 000 GPs, 10 inputs, 60 storms) achieve a correlation coefficient (R) of 0.94 and comparable RMSE to models trained on the full dataset (≈5 % of the original size).

Validation and generalization: Training utilizes the baseline S00Y00 scenario (2020, static sea level) and two future climate pathways (S07 and S08) spanning 2030–2070, each with 90 storms. The unseen S09Y50 scenario (2070, extreme sea‑level rise) serves as an out‑of‑sample test, confirming that the reduced‑data surrogate maintains high fidelity under conditions not seen during training.

Practical implications: By cutting simulation and training costs by roughly 95 % while preserving predictive accuracy, the proposed framework enables coastal planners to rapidly update storm‑surge hazard curves as new climate projections, land‑use changes, or engineering interventions are introduced. The storm‑selection methodology also guides future numerical experiments, ensuring that each additional simulation adds maximal information for hazard‑curve refinement.

In summary, the paper delivers a robust, scalable approach for building ML‑based storm‑surge surrogates that can accommodate an expanding suite of future scenarios without overwhelming computational resources. Its three‑pronged reduction strategy—grid‑point clustering, feature pruning, and storm‑suite optimization—demonstrates that careful data curation can yield near‑full‑accuracy models at a fraction of the original cost, paving the way for more agile, data‑driven coastal risk management.


Comments & Academic Discussion

Loading comments...

Leave a Comment