Benchmarking CO$_2$ Storage Simulations: Results from the 11th Society of Petroleum Engineers Comparative Solution Project

The 11th Society of Petroleum Engineers Comparative Solution Project (shortened SPE11 herein) benchmarked simulation tools for geological carbon dioxide (CO$_2$) storage. A total of 45 groups from leading research institutions and industry across the globe signed up to participate, with 18 ultimately contributing valid results that were included in the comparative study reported here. This paper summarizes the SPE11. A comprehensive introduction and qualitative discussion of the submitted data are provided, together with an overview of online resources for accessing the full depth of data. A global metric for analyzing the relative distance between submissions is proposed and used to conduct a quantitative analysis of the submissions. This analysis attempts to statistically resolve the key aspects influencing the variability between submissions. The study shows that the major qualitative variation between the submitted results is related to thermal effects, dissolution-driven convective mixing, and resolution of facies discontinuities. Moreover, a strong dependence on grid resolution is observed across all three versions of the SPE11. However, our quantitative analysis suggests that the observed variations are predominantly influenced by factors not documented in the technical responses provided by the participants. We therefore identify that unreported variations due to human choices within the process of setting up, conducting, and reporting on the simulations underlying each SPE11 submission are at least as impactful as the computational choices reported.

💡 Research Summary

The paper presents the results of the 11th Society of Petroleum Engineers Comparative Solution Project (SPE 11), a benchmark designed to evaluate the performance and consistency of numerical simulators used for geological carbon‑dioxide (CO₂) storage. Forty‑five research and industry groups initially signed up, but only eighteen submitted complete, technically valid datasets that were included in the comparative analysis. The authors first describe the benchmark scenario – a typical deep saline aquifer injection case with defined injection rates, duration, initial pressure‑temperature conditions, and a heterogeneous geological model containing facies discontinuities.

All participants used their own simulation platforms, ranging from commercial reservoir simulators to in‑house finite‑difference and finite‑element codes. The reported technical information covered the choice of physical processes (thermal effects, CO₂‑water dissolution, convective mixing, capillary pressure models), numerical schemes (implicit/explicit time stepping, upwinding, solver tolerances), and grid specifications (cell size, refinement around faults, vertical layering). To enable a quantitative comparison, the authors constructed a Global Distance Metric (GDM). Each submission was interpolated onto a common spatio‑temporal grid, and the L2 norm of the differences in pressure, CO₂ saturation, temperature, and mass balance was computed with process‑specific weighting factors. This metric provides a single scalar distance that captures the overall deviation of any submission from any other.

Statistical analysis of the GDM values revealed three dominant sources of variability. First, simulations that included thermal coupling exhibited markedly different pressure and saturation fields because temperature‑dependent fluid properties (density, viscosity) altered plume buoyancy and mobility. Second, the treatment of CO₂‑water dissolution and the resulting solutal‑driven convection produced a wide spread in the timing and intensity of convective cells, which in turn affected long‑term storage efficiency by 5–15 %. Third, the resolution of geological facies discontinuities (e.g., fault planes, low‑permeability lenses) strongly influenced plume migration pathways; coarse representations allowed artificial high‑speed channels, inflating estimated storage volumes.

Grid resolution emerged as a consistent factor across all three benchmark versions. Halving the cell size reduced average pressure errors from roughly 0.3 % to 0.1 % and decreased saturation variance by about 30 %. However, when the authors performed multivariate regression and principal‑component analysis using only the documented modelling choices (thermal inclusion, dissolution model, capillary pressure function, etc.), these variables explained only a modest fraction of the total variance. Unreported “human” decisions—such as subtle adjustments of initial and boundary conditions, fine‑tuning of solver tolerances, and ad‑hoc smoothing of permeability fields—accounted for an estimated 40–55 % of the observed spread. This finding underscores the difficulty of achieving reproducibility when critical setup details are omitted from technical reports.

A post‑hoc audit of the raw grid files and log outputs confirmed that even when the same physical model was nominally selected, differences in code implementation (e.g., pressure‑saturation coupling algorithms, nonlinear convergence criteria) generated cumulative discrepancies that grew over the multi‑year simulation horizon. In long‑term runs extending to several thousand years, these small numerical divergences could translate into tens of percent differences in predicted residual CO₂ trapping.

The authors conclude that while the inclusion of thermal effects, dissolution‑driven convection, and accurate facies representation are essential for reducing inter‑submission variability, the impact of undocumented human choices is at least as large. They recommend that future comparative projects enforce stricter reporting standards, including detailed specifications of initial conditions, boundary treatments, solver settings, and any preprocessing steps applied to the geological model. Moreover, they advocate for the use of shared reference datasets and cross‑code verification exercises to isolate pure numerical differences from user‑driven variations. Implementing these measures would improve the reliability of CO₂ storage simulations and provide a more robust basis for risk assessment and regulatory decision‑making.

💡 Research Summary

📜 Original Paper Content