TCBench: A Benchmark for Tropical Cyclone Track and Intensity Forecasting at the Global Scale
TCBench is a benchmark for evaluating global, short to medium-range (1-5 days) forecasts of tropical cyclone (TC) track and intensity. To allow a fair and model-agnostic comparison, TCBench builds on the IBTrACS observational dataset and formulates TC forecasting as predicting the time evolution of an existing tropical system conditioned on its initial position and intensity. TCBench includes state-of-the-art dynamical (TIGGE) and neural weather models (AIFS, Pangu-Weather, FourCastNet v2, GenCast). If not readily available, baseline tracks are consistently derived from model outputs using the TempestExtremes library. For evaluation, TCBench provides deterministic and probabilistic storm-following metrics. On 2023 test cases, neural weather models skillfully forecast TC tracks, while skillful intensity forecasts require additional steps such as post-processing. Designed for accessibility, TCBench helps AI practitioners tackle domain-relevant TC challenges and equips tropical meteorologists with data-driven tools and workflows to improve prediction and TC process understanding. By lowering barriers to reproducible, process-aware evaluation of extreme events, TCBench aims to democratize data-driven TC forecasting.
💡 Research Summary
TCBench is introduced as a comprehensive benchmark for global tropical cyclone (TC) forecasting, covering both track (latitude‑longitude) and intensity (maximum sustained wind speed Vmax and minimum sea‑level pressure pmin) over a short‑to‑medium range of 1–5 days. The benchmark builds on the International Best Track Archive for Climate Stewardship (IBTrACS) as the ground‑truth observational dataset, retaining only the standard 6‑hourly timestamps (00, 06, 12, 18 UTC) to ensure temporal consistency.
TCBench integrates forecasts from state‑of‑the‑art physics‑based global ensemble systems (the TIGGE archive, including GEFS and IFS) and several leading neural weather models—AIFS, Pangu‑Weather, FourCastNet v2, and GenCast. To guarantee a fair, model‑agnostic comparison, all model outputs are processed through a unified pipeline: tracks and intensities are extracted using the TempestExtremes and HuracanPy libraries, then standardized to the IBTrACS identifier and unit conventions. The benchmark requires at least two daily initializations (00 z and 12 z), 6‑hourly output, and a minimum forecast horizon of five days.
Evaluation is multi‑faceted. Deterministic metrics include mean absolute track error (km) and intensity error (m s⁻¹ or Pa). Probabilistic “storm‑following” scores such as Continuous Ranked Probability Score (CRPS) and Brier score assess the quality of ensemble forecasts. In addition, rapid intensification (RI) is treated as a binary classification problem: a storm is labeled RI if Vmax increases by ≥30 kt (≈15.4 m s⁻¹) within 24 h, corresponding to the 95th percentile of intensification rates. Classification performance is reported via ROC‑AUC, F1‑score, and precision‑recall curves, making the benchmark directly relevant to operational warning systems.
The authors train and validate on TC seasons 2017‑2022 and evaluate on the 2023 season. Results show that neural weather models consistently outperform the physics‑based ensembles in track prediction, achieving average errors of 80–120 km across the 1‑5‑day lead times. FourCastNet v2 and GenCast, especially when used in ensemble mode, lead the pack. However, intensity forecasts from neural models exhibit systematic bias and larger errors than the physics baselines. Post‑processing that incorporates oceanic predictors (e.g., sea‑surface temperature, ocean heat content) or applies statistical‑dynamical corrections akin to the SHIPS framework markedly improves intensity skill. For RI prediction, neural models demonstrate modest discriminative ability but still lag behind dedicated statistical‑dynamical schemes such as SHIPS‑RII.
TCBench is fully open‑source. The repository provides data ingestion scripts, preprocessing utilities, evaluation code, and visualization tools (built on the troPYcal package). Researchers can submit new model outputs to a public leaderboard, compare against provided baselines, and extend the benchmark with additional predictors or basin‑specific modules via a well‑documented API.
Limitations acknowledged by the authors include the focus on a single test year (2023), the absence of long‑range (>5 days) verification, and the current lack of direct ocean‑atmosphere coupling in the neural models, which constrains intensity forecasting. Nevertheless, TCBench establishes a reproducible, extensible platform that bridges AI research and operational meteorology, offering a clear pathway for future advances in data‑driven TC prediction.
Comments & Academic Discussion
Loading comments...
Leave a Comment