"What is a realistic forecast?" Assessing data-driven weather forecasts, a journey from verification to falsification

"What is a realistic forecast?" Assessing data-driven weather forecasts, a journey from verification to falsification
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The artificial intelligence revolution is fueling a paradigm shift in weather forecasting: forecasts are generated with machine learning models trained on large datasets rather than with physics-based numerical models that solve partial differential equations. This new approach proved successful in improving forecast performance as measured with standard verification metrics such as the root mean squared error. At the same time, the realism of data-driven weather forecasts is often questioned and considered as an Achilles’ heel of machine learning models. How ‘forecast realism’ can be defined and how this forecast attribute can be assessed are the two questions simultaneously addressed here. Inspired by the seminal work of Murphy (1993) on the definition of ‘forecast goodness’, we identify 3 types of realism and discuss methodological paths for their assessment. In this framework, falsification arises as a complementary process to verification and diagnostics when assessing data-driven weather models.


💡 Research Summary

The paper addresses a timely and increasingly important question in modern meteorology: how should we define and assess the “realism” of data‑driven, machine‑learning (ML) weather forecasts? Building on Murphy’s (1993) classic framework for forecast goodness, the author proposes three distinct types of realism—functional, structural, and physical—and outlines concrete verification, diagnostic, and falsification methods for each.

Functional realism (type 1) concerns the closeness of an individual forecast to the corresponding observation. It is measured with a scoring function v(xᵢ, yᵢ) such as RMSE, MAE, CRPS, or other proper scoring rules. This is the traditional verification step, where each forecast instance receives a numerical error that can be averaged to produce an overall accuracy metric.

Structural realism (type 2) refers to the statistical consistency between the forecast distribution and the observed distribution. Summary statistics—bias, variance, activity bias, power spectra, resolution, granularity, and sharpness—are compared via a diagnostic measure d(X, Y). This step evaluates reliability: whether the forecast can be taken at face value without additional statistical correction. Diagnostic tools also identify systematic model weaknesses (regional bias, insufficient ensemble spread, over‑smoothing) that guide model development.

Physical realism (type 3) is the novel contribution of the paper. It asks whether a forecast is physically plausible given our scientific knowledge base K (the laws of physics, conservation principles, dynamical constraints). A falsification test f(Xᵢ, K) checks if the forecast violates any known physical rule. The author draws on Popperian falsification, reversing the usual direction: instead of observations falsifying a theory, a forecast is falsified if it lies outside the physically admissible space. Examples include violations of energy or mass conservation, unrealistic ageostrophic motions, or the “hallucinations” that generative AI models sometimes produce. Physical realism can be assessed through direct dynamical tests, conservation‑law checks, or expert subjective verification.

The paper explores the relationships among the three types. In probabilistic forecasting, improving structural realism (reliability) generally improves functional realism because proper scoring rules decompose into reliability and resolution components. In deterministic settings, however, bias correction (first‑moment calibration) may improve the score while variance correction (second‑moment calibration) can degrade it—a phenomenon the author calls the accuracy‑activity trade‑off.

A thought experiment with a “perfect” forecast illustrates the hierarchy: a perfect type 1 forecast automatically yields perfect type 2 and type 3 realism; a perfect type 2 forecast ensures that forecasts are drawn from the same distribution as observations but does not guarantee instance‑by‑instance accuracy; a perfect type 3 forecast satisfies physical laws but may still be statistically biased or inaccurate.

Finally, the author emphasizes a fit‑for‑purpose perspective. Different users may prioritize different realism types: short‑range operational forecasters may value functional realism, climate modelers may care more about structural realism, and scientific research or safety‑critical applications may demand high physical realism. The proposed verification‑diagnostics‑falsification workflow provides a systematic way to assess and balance these requirements.

In summary, the paper offers a comprehensive, theoretically grounded framework for evaluating ML‑based weather forecasts, extending traditional verification to include model diagnostics and a physics‑based falsification step. By doing so, it addresses a key barrier to the operational adoption of data‑driven forecasts—concerns about realism—and provides practical guidance for developers, forecasters, and decision‑makers alike.


Comments & Academic Discussion

Loading comments...

Leave a Comment