Using Machine Learning to Discern Eruption in Noisy Environments: A Case Study using CO2-driven Cold-Water Geyser in Chimayo, New Mexico

Using Machine Learning to Discern Eruption in Noisy Environments: A Case   Study using CO2-driven Cold-Water Geyser in Chimayo, New Mexico
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present an approach based on machine learning (ML) to distinguish eruption and precursory signals of Chimay'{o} geyser (New Mexico, USA) under noisy environments. This geyser can be considered as a natural analog of $\mathrm{CO}_2$ intrusion into shallow water aquifers. By studying this geyser, we can understand upwelling of $\mathrm{CO}_2$-rich fluids from depth, which has relevance to leak monitoring in a $\mathrm{CO}_2$ sequestration project. ML methods such as Random Forests (RF) are known to be robust multi-class classifiers and perform well under unfavorable noisy conditions. However, the extent of the RF method’s accuracy is poorly understood for this $\mathrm{CO}_2$-driven geysering application. The current study aims to quantify the performance of RF-classifiers to discern the geyser state. Towards this goal, we first present the data collected from the seismometer that is installed near the Chimay'{o} geyser. The seismic signals collected at this site contain different types of noises such as daily temperature variations, seasonal trends, animal movement near the geyser, and human activity. First, we filter the signals from these noises by combining the Butterworth-Highpass filter and an Autoregressive method in a multi-level fashion. We show that by combining these filtering techniques, in a hierarchical fashion, leads to reduction in the noise in the seismic data without removing the precursors and eruption event signals. We then use RF on the filtered data to classify the state of geyser into three classes – remnant noise, precursor, and eruption states. We show that the classification accuracy using RF on the filtered data is greater than 90%.These aspects make the proposed ML framework attractive for event discrimination and signal enhancement under noisy conditions, with strong potential for application to monitoring leaks in $\mathrm{CO}_2$ sequestration.


💡 Research Summary

This paper presents a comprehensive machine‑learning framework for distinguishing eruption, precursor, and background noise states of the CO₂‑driven cold‑water Chimayo geyser in New Mexico, using noisy seismic recordings. The authors collected 18 days of continuous 200 Hz seismic data from a station located ~3 m from the well, with 14 days allocated for training and 4 days for testing. Ground‑truth eruption times were obtained from a motion‑sensor camera that captured images whenever CO₂‑rich fluid was expelled. The raw seismic signal is heavily contaminated by daily temperature cycles, seasonal trends, wind, rain, and anthropogenic/animal activity, making simple threshold‑based detection infeasible.

To address this, the authors devised a two‑stage denoising pipeline. First, a Butterworth high‑pass filter removes low‑frequency seasonal and diurnal components. Second, an autoregressive (AR) model of order p (selected via the Akaike Information Criterion) is trained on a 5 % subset of the data that contains only background noise. The AR model predicts the stationary noise component, and the prediction error (original signal minus AR prediction) is taken as the filtered signal. This “prediction‑error filter” effectively suppresses stationary noise while preserving the non‑stationary precursor and eruption signatures.

After denoising, the signal is segmented into 1‑minute sliding windows (12 000 samples each). For each window, the authors extract more than 700 time‑series features using the Python library tsfresh, covering statistical moments, entropy, trend, spectral density, FFT coefficients, wavelet transforms, ARIMA parameters, etc. Feature relevance testing reduces the set to roughly 100 informative descriptors; the top ten include partial autocorrelation lag, several FFT coefficients, spectral centroid, variance, skewness, kurtosis of the absolute Fourier spectrum, aggregated linear trend, and change quantiles. This dimensionality reduction mitigates the curse of dimensionality and yields a compact feature vector suitable for tree‑based classifiers.

A Random Forest (RF) ensemble is then trained on the labeled windows. The three classes are defined as: (1) “remnant noise” – data points more than 3 minutes away from any major eruption; (2) “precursor” – data within 1–3 minutes before an eruption; (3) “eruption” – data during the 2‑minute active eruption phase. RF’s bootstrap aggregation and random feature selection make it robust to residual noise and capable of handling multi‑class problems with low bias and variance.

Performance evaluation shows that the RF classifier applied to fully filtered data achieves >90 % overall accuracy (≈0.92 F1 score). When only the seasonal trend is removed (partial filtering), accuracy drops modestly to 87 %, indicating the importance of the AR‑based noise suppression. In contrast, a Dynamic Time Warping (DTW) classifier on the same filtered data attains only 44 % accuracy, underscoring the superiority of the feature‑based RF approach under noisy conditions. Inference time per window is on the order of 10⁻⁴ seconds on a standard laptop, demonstrating feasibility for near‑real‑time monitoring.

The study’s contributions are fourfold: (i) a hierarchical signal‑processing chain (Butterworth high‑pass + AR prediction‑error filter) that effectively isolates geyser‑related seismic energy from complex environmental noise; (ii) a systematic large‑scale feature extraction and selection pipeline that compresses high‑frequency seismic windows into a manageable set of discriminative descriptors; (iii) empirical evidence that a model‑free ensemble learner (Random Forest) can classify geyser states with >90 % accuracy even when the signal‑to‑noise ratio is low; and (iv) a demonstration of the method’s relevance to CO₂ sequestration monitoring, where early detection of leak‑related precursory signals is critical.

Limitations include reliance on a single field site and a relatively short observation period, which may constrain the generalizability of the trained model to other geysers or CO₂ leak scenarios with different geological or climatic contexts. The labeling depends on visual confirmation from the motion sensor, introducing potential human error. Future work could explore multi‑sensor fusion (e.g., pressure, gas concentration), deep‑learning architectures that learn features directly from raw waveforms, and validation against controlled CO₂ leak experiments. Extending the framework to a network of stations and integrating it into an automated early‑warning system would further enhance its applicability to real‑world carbon‑capture and storage projects.


Comments & Academic Discussion

Loading comments...

Leave a Comment