ClimateSet: A Large-Scale Climate Model Dataset for Machine Learning
š” Research Summary
ClimateSet addresses a critical gap in climateāmachineālearning research by providing a largeāscale, consistent, and MLāready dataset that combines inputs from Input4MIPs with outputs from CMIP6 for 36 Earth system models. The authors first motivate the need for such a resource: traditional climate simulations are computationally expensive, and most existing ML studies rely on a single climate model, limiting training data volume and the ability to capture interāmodel uncertainty. While datasets such as ClimateBench, WeatherBench, and EarthNet2021 have advanced ML for weather or singleāmodel climate emulation, none offer a multiāmodel, scenarioārich climate archive suitable for policyārelevant projections.
The core contribution is a modular data pipeline that automatically downloads, harmonizes, and preprocesses CMIP6 ScenarioMIP outputs (temperature and precipitation at the surface) and Input4MIPs forcing fields (COā, CHā, SOā, black carbon). The pipeline aligns spatial resolution (ā„250āÆkm), temporal frequency (monthly), calendar conventions, and units across all models, producing a readyātoāuse tensor format. The resulting core dataset includes five scenarios (historical plus SSP1ā2.6, SSP2ā4.5, SSP3ā7.0, SSP5ā8.5) for the period 2015ā2100, with up to the maximum number of ensemble members available per model.
To demonstrate utility, the authors benchmark several stateāofātheāart ML architectures (e.g., ConvLSTM, UāNet, Transformerābased models) on a climateāemulation task. They compare two training regimes: (i) modelāspecific emulators trained on a single climate model, and (ii) a āsuperāemulatorā trained jointly on all 36 models. Results show that the superāemulator achieves lower rootāmeanāsquare error (10ā15āÆ% improvement) and better generalization to unseen SSP scenarios, especially for precipitation where variability is high. Moreover, the multiāmodel training captures interāmodel variability, enabling uncertainty quantification that is essential for policy makers.
The paper also outlines a broad set of downstream applications: downscaling to higher spatial resolution, extremeāevent prediction under different warming pathways, training large climateāfocused AI models, and rapid scenario testing for decision support. The authors emphasize the extensibility of ClimateSet: users can add more models, variables, vertical levels, finer spatial/temporal grids, or additional forcing agents as long as the data exist on the ESGF repository. The pipeline and code are publicly released on GitHub, and the core dataset is hosted via the Digital Research Alliance of Canada.
Limitations are candidly discussed. Currently only surface temperature and precipitation are provided, and forcing fields are limited to four agents. The 250āÆkm resolution may be insufficient for regional impact studies, and the exclusion of the lowāforcing SSP1ā1.9 scenario reduces coverage of optimistic pathways. Not all models contain every scenario or ensemble member, leading to some data imbalance. The authors suggest future work to incorporate more variables (e.g., wind, humidity), higher resolutions, additional aerosol species, and integration with observational or satellite datasets.
In conclusion, ClimateSet delivers a muchāneeded infrastructure that bridges climate science and machine learning, offering a scalable training corpus, a reproducible preprocessing workflow, and a benchmark for multiāmodel emulation. By enabling āsuperāemulatorsā that learn from many climate models simultaneously, it opens the door to faster, uncertaintyāaware climate projections that can directly inform policy and societal decisionāmaking.
Comments & Academic Discussion
Loading comments...
Leave a Comment