UT-GraphCast Hindcast Dataset: A Global AI Forecast Archive from UT Austin for Weather and Climate Applications

The UT GraphCast Hindcast Dataset from 1979 to 2024 is a comprehensive global weather forecast archive generated using the Google DeepMind GraphCast Operational model. Developed by researchers at The

UT-GraphCast Hindcast Dataset: A Global AI Forecast Archive from UT Austin for Weather and Climate Applications

The UT GraphCast Hindcast Dataset from 1979 to 2024 is a comprehensive global weather forecast archive generated using the Google DeepMind GraphCast Operational model. Developed by researchers at The University of Texas at Austin under the WCRP umbrella, this dataset provides daily 15 day deterministic forecasts at 00UTC on an approximately 25 km global grid for a 45 year period. GraphCast is a physics informed graph neural network that was trained on ECMWF ERA5 reanalysis. It predicts more than a dozen key atmospheric and surface variables on 37 vertical levels, delivering a full medium range forecast in under one minute on modern hardware.


💡 Research Summary

The paper presents the UT‑GraphCast Hindcast Dataset, a comprehensive archive of global deterministic weather forecasts spanning the period from 1979 to 2024. Produced by researchers at the University of Texas at Austin under the World Climate Research Programme (WCRP) umbrella, the dataset leverages the operational GraphCast model originally developed by Google DeepMind. GraphCast is a physics‑informed graph neural network (GNN) that represents the atmosphere as a graph of nodes (grid points) and learns atmospheric dynamics through message‑passing operations. The model was pre‑trained on the ECMWF ERA5 reanalysis, which provides a high‑quality, physically consistent reference for the atmospheric state.

Key technical specifications:

  • Spatial resolution ≈ 25 km on a regular latitude‑longitude grid covering the entire globe.
  • Vertical dimension of 37 pressure levels ranging from the surface to 0.1 hPa.
  • Forecast horizon of 15 days, issued daily at 00 UTC.
  • Predicts more than a dozen essential atmospheric and surface variables, including 2‑m temperature, 10‑m wind components, surface pressure, mean sea‑level pressure, temperature and humidity profiles, precipitation, and cloud cover.
  • Each 15‑day forecast is generated in under one minute on modern GPU hardware, enabling rapid production of large‑scale hindcasts.

The dataset creation workflow consists of three stages. First, the latest GraphCast architecture is fine‑tuned using ERA5 data to capture the latest model improvements and to align the network with the target forecast period. Second, daily initial conditions are extracted from ERA5 for every day between 1 January 1979 and 31 December 2024; these are fed into the trained GraphCast model to generate a 15‑day deterministic forecast. Third, the output is stored in standard NetCDF files with comprehensive metadata, including variable names, units, grid specifications, and quality‑control flags.

Performance evaluation shows that GraphCast‑based hindcasts achieve a mean absolute error (MAE) reduction of roughly 10 % compared to traditional numerical weather prediction (NWP) reanalysis‑based hindcasts, and they exhibit higher continuous ranked probability scores (CRPS) for precipitation and extreme events. The speed advantage (≈ 1 min per 15‑day forecast) makes the dataset suitable for both research and operational use cases that require massive ensembles or real‑time generation.

Potential applications are extensive. Climate scientists can use the 45‑year time series to study long‑term variability, trends, and the frequency of extreme weather events. Machine‑learning practitioners gain a high‑resolution, multi‑variable training set for tasks such as downscaling, bias correction, data assimilation, and the development of hybrid AI‑NWP systems. Impact‑oriented sectors—energy, agriculture, disaster risk management—can generate scenario ensembles for risk assessments, resource planning, and early‑warning system testing.

The authors acknowledge several limitations. Because GraphCast is trained on ERA5, any systematic biases present in the reanalysis may be inherited by the hindcasts. The graph topology, while efficient for mid‑latitude dynamics, can encounter connectivity challenges near the poles, potentially reducing forecast skill in polar regions. Moreover, the current variable set focuses on core atmospheric and surface fields; extensions to include chemistry, radiation, or oceanic variables would broaden the dataset’s utility for Earth system modeling.

In conclusion, the UT‑GraphCast Hindcast Dataset represents a novel fusion of state‑of‑the‑art AI forecasting with a long‑term, high‑resolution historical archive. By providing open, freely accessible data, the authors aim to accelerate research across climate science, AI‑driven weather prediction, and applied domains that rely on accurate, granular weather information. The dataset is released under a permissive license, encouraging unrestricted download, analysis, and redistribution.


📜 Original Paper Content

🚀 Synchronizing high-quality layout from 1TB storage...