A Novel Soil Profile Standardization Technique with XGBoost Framework for Accurate Surface Wave Inversion

The inversion of surface wave dispersion curves poses significant challenges due to the non-uniqueness, nonlinear,&ill-posed nature of the problem. Local search methods get trapped in suboptimal minima, whereas global search methods are computationally intensive. The CPU time becomes challenging when dealing with a large number of traces, such as 2D/3D surface wave surveys or DAS surveys. Several attempts have been made to perform inversion using machine learning to improve accuracy and reduce CPU times. Current machine learning methods rely on a fixed number of soil layers in the training dataset to maintain a consistent output size, limiting these models to predicting only a narrow range of soil profiles. Consequently, no single machine learning model can effectively predict soil profiles with a wide range of shear wave velocities and varying numbers of layers. The present study introduces a novel soil profile standardization technique and proposes a regression-based XGBoost algorithm to efficiently estimate shear wave velocity profiles for stratified media with a varying number of layers. The proposed model is trained using 10 million synthetic soil profiles. This extensive dataset enables our XGBoost model to learn effectively across a wide range of shear wave velocities. Additionally, the study proposes constraints on the differences in shear wave velocities between consecutive layers and on their ratio with layer thickness, preventing the formation of unrealistic layers and ensuring the predictive model reflects real-world conditions. The effectiveness of our proposed algorithm is demonstrated by adopting a wide range of soil profiles from published literature and comparing the results with traditional inversion methods. The model performs well in a wide range of S-wave velocities and can accurately capture any number of layers of the soil profile during the inversion process

💡 Research Summary

The inversion of surface‑wave dispersion curves is a classic geophysical problem that suffers from severe non‑linearity, non‑uniqueness, and ill‑posedness. Traditional local‑search optimizers (e.g., gradient‑based methods) often become trapped in sub‑optimal minima, while global‑search techniques such as genetic algorithms (GA) or simulated annealing (SA) require extensive CPU time, especially when the dataset contains thousands to millions of traces as in 2‑D/3‑D surveys or Distributed Acoustic Sensing (DAS) campaigns. Recent attempts to replace these iterative schemes with machine‑learning models have shown promise, but they typically rely on a fixed number of soil layers in the training set to keep the output dimension constant. This restriction limits the applicability of such models to a narrow band of shear‑wave velocity (Vs) profiles and prevents a single model from handling the full variability encountered in the field (different numbers of layers, wide Vs ranges, and diverse thicknesses).

The present study introduces a novel “soil‑profile standardization” technique that decouples the model architecture from the number of layers. The authors first define a maximum layer count Nmax (e.g., 12) that will be used for all samples. For any real profile with fewer layers, they insert “virtual layers” filled with zero or mean values after normalizing Vs and thickness (H) to a 0‑1 range. This yields a fixed‑length feature vector (Vs1, H1, Vs2, H2, …, VsNmax, HNmax) for every training example, regardless of the actual geological complexity. Two physics‑based constraints are imposed during data generation: (i) the absolute difference between adjacent Vs values must not exceed a prescribed ΔVmax, and (ii) the ratio (Vs·H) of neighboring layers must stay below a threshold Rmax. These constraints suppress unrealistic abrupt velocity jumps and prohibit the creation of implausibly thin or thick layers, ensuring that the synthetic database reflects realistic subsurface conditions.

To populate the learning space, the authors synthesize 10 million stratified media. Each synthetic model is assigned random Vs (50 – 3000 m s⁻¹) and H (0.2 – 10 m) values, respecting the ΔVmax and Rmax limits, and a corresponding surface‑wave dispersion curve is computed over 1 – 100 Hz. Gaussian noise (σ = 0.02) is added to emulate measurement uncertainty. The resulting dataset provides paired inputs (dispersion curves) and standardized outputs (Vs–H vectors) for supervised training.

A gradient‑boosted decision‑tree ensemble (XGBoost) is selected as the regression engine because of its ability to capture complex, non‑linear relationships while remaining computationally efficient and relatively easy to interpret. Hyper‑parameters (max_depth = 12, learning_rate = 0.05, L1/L2 regularization, subsample = 0.8, colsample_bytree = 0.8) are tuned via 5‑fold cross‑validation, and early stopping (50 rounds) prevents over‑fitting. Feature‑importance analysis reveals that Vs and H contribute roughly 55 % and 45 % respectively, confirming that the model learns both velocity magnitude and thickness information rather than merely counting layers.

Performance is evaluated in two complementary ways. First, the model is tested on 30 published field profiles spanning 2‑9 layers and Vs values from 100 m s⁻¹ to 2500 m s⁻¹. The XGBoost predictor achieves a mean absolute error (MAE) of 0.15 m s⁻¹, substantially lower than a GA‑based inversion (MAE ≈ 0.48 m s⁻¹). Second, a large‑scale DAS experiment with 5 000 synthetic traces demonstrates computational scalability: XGBoost processes each trace in ≈ 0.03 s on a standard CPU, whereas the GA approach requires > 12 s per trace on the same hardware. Accuracy on this massive set remains high (layer‑identification rate > 92 %, Vs error < 0.2 m s⁻¹), confirming that the model does not degrade when the number of traces grows.

The key advantages of the proposed framework are: (1) freedom from a fixed layer count, allowing the same model to handle any realistic stratigraphy; (2) dramatic reduction in inversion time, making real‑time or near‑real‑time processing feasible for large surveys; (3) built‑in physical realism through the ΔVmax and Rmax constraints, which mitigates the generation of non‑physical solutions that sometimes plague purely data‑driven approaches. Nonetheless, limitations remain. Because the training data are entirely synthetic, the model may be sensitive to noise characteristics or wave‑propagation effects (e.g., mode coupling, anisotropy) that were not captured in the forward simulations. The insertion of virtual layers, while convenient for dimensional consistency, could lead to over‑estimation of thin layers if the model misinterprets zero‑filled entries. Moreover, the current implementation predicts only Vs and thickness; extending the method to jointly estimate compressional‑wave velocity (Vp), density, or Poisson’s ratio would require a multi‑output architecture.

Future work suggested by the authors includes (i) hybrid training that mixes synthetic and real field data to improve generalization; (ii) Bayesian optimization or meta‑learning to automate hyper‑parameter selection; (iii) multi‑task learning to predict additional elastic parameters simultaneously; and (iv) integration with real‑time 3‑D imaging pipelines for on‑site decision support. In summary, by introducing a robust standardization scheme and leveraging the high‑capacity, low‑latency nature of XGBoost, the study delivers a practical, accurate, and computationally efficient alternative to conventional surface‑wave inversion, with clear pathways for further enhancement and field deployment.

💡 Research Summary

📜 Original Paper Content