Machine Learning-Ready Data Sets for the Analysis and Nowcasting of Atmospheric Radiation at Aviation Altitudes

Machine Learning-Ready Data Sets for the Analysis and Nowcasting of Atmospheric Radiation at Aviation Altitudes
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Nowcasting and forecasting of the radiation environment in the Earth’s lower atmosphere are critical for the safety of aircraft and spacecraft crews and passengers. Currently, this problem is addressed by employing statistical and physics-based models that take into account particle transport and precipitation. However, given the increased number of radiation measurements available to the community, it is possible to start developing data-driven approaches. We prepared Machine Learning-ready (ML-ready) datasets to nowcast the effective dose rates at aviation altitudes. The presented datasets contain 92,476 individual measurements from 589 flights obtained by the Automated Radiation Measurements for Aerospace Safety (ARMAS) experiment from 2013 to 2023. The ARMAS measurements are augmented with the properties of the Geospace environment, such as solar soft X-ray and proton fluxes, solar wind properties, secondary cosmic ray neutrons, space weather indexes, and global solar activity indicators (such as daily sunspot number). ARMAS data are separated into three partitions, ensuring that (1) the data points from a single flight remain within the same partition, and (2) each partition samples the flight locations and Geospace environment conditions equally. Several versions of the datasets allow predictions based on point-in-time measurements and use up to 24 hours of Geospace parameter history. The test of the use case demonstrates a possibility of nowcasting ARMAS measurements with accuracies slightly better than the considered physics-based models. The publicly available ML-ready datasets could serve as the first step in data preparation for ML-driven nowcasting and forecasting of the radiation environment.


💡 Research Summary

The paper presents the first publicly available machine‑learning‑ready (ML‑ready) dataset for nowcasting and forecasting atmospheric radiation at aviation altitudes (approximately 8–17 km). The core of the dataset consists of 92,476 effective dose‑rate measurements collected by the Automated Radiation Measurements for Aerospace Safety (ARMAS) experiment during 589 commercial flights between June 2013 and December 2023. After rigorous quality control—including removal of non‑science data, electromagnetic interference periods, unrealistically high dose values (>50 µSv h⁻¹), and selection of the most reliable ARMAS device per flight based on Pearson correlation with the NAIRAS v3 model—the authors retained a clean set of measurements that span a wide geographic area (continental United States, Pacific Ocean, North Atlantic, Antarctica) and cover the declining phase of Solar Cycle 24, the solar minimum, and the rising phase of Solar Cycle 25.

To make the data useful for data‑driven prediction, the ARMAS measurements are augmented with a comprehensive suite of geospace parameters:

  1. Neutron monitor counts from five stations (Oulu, New ark, South Pole, Thule, Izmir) corrected for pressure and efficiency, sampled at 5‑minute cadence.
  2. Solar‑wind properties (density, temperature, three components of velocity and magnetic field) from the OMNIWeb database, also at 5‑minute resolution.
  3. Energetic particle fluxes measured by GOES satellites: proton fluxes in seven energy channels (≥1 MeV to ≥100 MeV) and electron flux in the ≥2 MeV channel, with linear interpolation for gaps.
  4. Solar soft X‑ray fluxes (1‑8 Å and 0.5‑4 Å) from GOES, averaged over 1 minute.
  5. Geomagnetic activity indices (hourly Kp, Ap, Dst) from OMNIWeb.
  6. Long‑term solar activity indicators: hourly F10.7 radio flux, daily sunspot number, and solar polar magnetic fields.

All auxiliary data are interpolated to fill missing values, ensuring a continuous time series for each flight point. The authors also compute derived features such as lagged values and moving averages up to 24 hours prior to each ARMAS observation, enabling models to exploit temporal context.

A key contribution is the partitioning strategy. The full dataset is split into three mutually exclusive subsets (training, validation, test) such that all measurements from a single flight belong to the same subset, preventing data leakage. Moreover, each subset is constructed to sample the multidimensional space of flight locations and geospace conditions uniformly, using hierarchical clustering and stratified sampling. This design guarantees that model performance evaluated on the test set reflects genuine generalization to unseen flights and space‑weather conditions.

The authors demonstrate a simple nowcasting use case using a Gradient Boosting Regressor. Input features include the instantaneous ARMAS dose rate, the current geospace parameters, and up to 24 hours of historical values. Model performance is assessed with root‑mean‑square error (RMSE) and mean absolute error (MAE) against the ARMAS ground truth. Compared with the physics‑based NAIRAS v3 model, the ML approach achieves roughly a 5 % reduction in error, with the most pronounced gains in low‑geomagnetic‑cutoff regions where NAIRAS tends to under‑predict. This result indicates that the enriched, high‑quality dataset can capture subtle dependencies that are not fully represented in current physics‑based formulations.

The complete dataset, together with metadata, preprocessing scripts, and example notebooks, is hosted on the Radiation Data Portal (https://dmlab.cs.gsu.edu/rdp/ml-dataset.html) in both CSV and Parquet formats. By providing an “out‑of‑the‑box” ML‑ready resource, the authors lower the barrier for the community to develop, benchmark, and deploy data‑driven radiation prediction models. They outline future directions, including deep‑learning time‑series architectures (LSTM, Transformer), multi‑task learning (simultaneous dose prediction and SEP event classification), and integration into real‑time aviation decision‑support systems.

In summary, this work delivers a meticulously curated, richly annotated, and openly accessible dataset that bridges the gap between abundant radiation measurements and modern machine‑learning techniques. It offers a solid foundation for advancing nowcasting and forecasting of atmospheric radiation, ultimately contributing to improved radiation risk management for crew, passengers, and space‑flight operations.


Comments & Academic Discussion

Loading comments...

Leave a Comment