Comparison of generative algorithms for conceptual groundwater modeling of coastal volcanic aquifer features with disparate, sparse and extremely imbalanced data
In coastal volcanic aquifers, the reliability of freshwater seawater-exchange simulations are governed by accuracy of the conceptual groundwater model (CGM). The traditional CGMs are constructed by qualitatively combining independent hydrogeophysical features, limiting their ability to capture the complexity of volcanic terrains. To integrate these disparate, sparse, and imbalanced features, we propose an AI-assisted workflow. First, the self-organizing map (SOM) is applied to estimate a deterministic set of transdisciplinary features called the reference model. Second, generative algorithms are applied to the reference model and empirical distributions constructed to obtain sets of stochastic point clouds called the site model. Data quality metrics identify the preferred generative algorithm whose set of stochastic features are mapped using SOM to the groundwater model grid and assigned as the stochastic CGM. The proposed algorithm is applied to extremely imbalanced multiclass features and multiple discrete numerical features observed at the Halawa-Moanalua aquifer, Oahu, Hawaii. At this stie, the Copula Generative Adversarial Network is deemed as the preferred generative algorithm whose set of stochastic transdisciplinary features represent the Halawa-Moanalua CGM. The simulated spatial geologic units correspond to published surface maps; and the simulated conductance, temperature, and barometric pressure profiles correlate with those measured at deep monitoring wells. Inspecting the 3-dimensional conductance models reveal groundwater flow and discharge driven by the aquifer hydraulic gradient, freshwater pumping, seawater intrusion induced by onshore withdrawals, and preferred pathways for freshwater-seawater exchange, such as landward intrusion of seawater and seaward discharge of freshwater.
💡 Research Summary
The paper addresses a fundamental challenge in coastal volcanic aquifer modeling: the integration of highly disparate, sparse, and extremely imbalanced hydro‑geophysical data into a reliable conceptual groundwater model (CGM). Traditional CGMs are built by qualitatively stitching together independent datasets, which often fails to capture the complex three‑dimensional heterogeneity of volcanic terrains. To overcome this, the authors propose an AI‑assisted workflow that combines a self‑organizing map (SOM) with several state‑of‑the‑art generative algorithms, and they apply the workflow to the Halawa‑Moanalua (H‑M) aquifer on Oʻahu, Hawaiʻi.
Workflow Overview
- Data Cube Construction – All available categorical (geologic units, confinement status, ocean/open‑water flag) and continuous (resistivity, gravity, temperature, pressure, etc.) observations are compiled into a unified “data cube”. Categorical variables are one‑hot encoded, continuous variables are log‑transformed and augmented to mitigate severe class imbalance and spatial skewness.
- Reference Model via SOM – A two‑stage SOM is trained. The first, unsupervised stage learns the intrinsic topology of the high‑dimensional data, projecting it onto a 2‑D lattice while preserving neighborhood relationships. The second, supervised stage links each SOM node to target aquifer properties, using node‑based regression for continuous variables and probabilistic class assignment for categorical variables. The result is a deterministic, data‑driven reference model that captures the dominant multivariate structure of the aquifer.
- Generative Modeling (Site Model) – Four generative approaches are evaluated: Conditional Tabular GAN (CTGAN), Tabular Variational AutoEncoder (TVAE), Tabular Gaussian Copula (TGC), and Copula Generative Adversarial Network (CopulaGAN). Each model learns the joint distribution of the reference‑model variables and generates synthetic “site” samples. Model performance is quantified with a suite of data‑quality metrics, including Kolmogorov‑Smirnov distance, Wasserstein distance, marginal distribution fidelity, and preservation of inter‑variable correlations. CopulaGAN emerges as the best performer, especially in retaining rare geological classes and extreme values.
- Stochastic CGM Construction – Synthetic site samples from the preferred CopulaGAN are re‑projected onto the SOM grid, assigning probabilistic geological, geophysical, engineering, and water‑quality attributes to each cell of a numerical groundwater model grid. This yields a stochastic CGM that can be sampled repeatedly to assess uncertainty.
Results
- The stochastic geological units generated by the workflow align closely with published surface geological maps of the H‑M aquifer.
- Simulated 3‑D conductance fields reproduce temperature and barometric pressure profiles measured in deep monitoring wells, confirming physical realism.
- Flow visualizations reveal that groundwater movement is driven by the hydraulic gradient, freshwater pumping, and seawater intrusion induced by on‑shore withdrawals. Preferred pathways such as landward seawater intrusion and seaward freshwater discharge are clearly identifiable.
Key Contributions
- Demonstrates a quantitative, reproducible pipeline for integrating highly imbalanced, multimodal hydro‑geophysical data.
- Shows that SOM can serve both as a dimensionality‑reduction tool and as a bridge between deterministic reference models and stochastic generative models.
- Provides a systematic comparative assessment of modern tabular generative models, establishing CopulaGAN as the most suitable for this class of environmental datasets.
- Delivers a 3‑D stochastic CGM that not only matches independent observations but also offers a probabilistic framework for uncertainty propagation in groundwater simulations.
Limitations and Future Work
- SOM’s ability to preserve complex non‑linear relationships may be limited; alternative manifold learning techniques (e.g., UMAP, t‑SNE) could be explored.
- The training dataset remains relatively small, especially for deep borehole measurements, raising concerns about over‑fitting in the generative stage.
- Physical constraints (e.g., Darcy’s law linking conductance and permeability) are not explicitly enforced in the generative models; integrating physics‑informed GANs or Bayesian networks could improve realism.
- Extending the workflow to incorporate time‑varying data (e.g., seasonal water‑level fluctuations) would broaden its applicability to transient groundwater management scenarios.
Overall, the study provides a compelling proof‑of‑concept that modern generative AI, when coupled with unsupervised clustering, can transform the way conceptual groundwater models are built for data‑poor, highly heterogeneous volcanic aquifers.
Comments & Academic Discussion
Loading comments...
Leave a Comment