A data-driven choice of misfit function for FWI using reinforcement learning

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In the workflow of Full-Waveform Inversion (FWI), we often tune the parameters of the inversion to help us avoid cycle skipping and obtain high resolution models. For example, typically start by using objective functions that avoid cycle skipping, like tomographic and image based or using only low frequency, and then later, we utilize the least squares misfit to admit high resolution information. We also may perform an isotropic (acoustic) inversion to first update the velocity model and then switch to multi-parameter anisotropic (elastic) inversions to fully recover the complex physics. Such hierarchical approaches are common in FWI, and they often depend on our manual intervention based on many factors, and of course, results depend on experience. However, with the large data size often involved in the inversion and the complexity of the process, making optimal choices is difficult even for an experienced practitioner. Thus, as an example, and within the framework of reinforcement learning, we utilize a deep-Q network (DQN) to learn an optimal policy to determine the proper timing to switch between different misfit functions. Specifically, we train the state-action value function (Q) to predict when to use the conventional L2-norm misfit function or the more advanced optimal-transport matching-filter (OTMF) misfit to mitigate the cycle-skipping and obtain high resolution, as well as improve convergence. We use a simple while demonstrative shifted-signal inversion examples to demonstrate the basic principles of the proposed method.

💡 Research Summary

The paper presents a novel application of reinforcement learning (RL) to automate the selection of misfit functions during Full‑Waveform Inversion (FWI). In conventional practice, geophysicists employ a hierarchical strategy: they begin with cycle‑skipping‑robust objective functions such as low‑frequency, tomographic, or optimal‑transport‑based matching‑filter (OTMF) misfits, and later switch to the conventional L2‑norm to extract high‑resolution details once the model is sufficiently close to the true solution. Determining the optimal switching point, however, relies heavily on expert intuition and is complicated by the massive data volumes, limited bandwidth, aperture constraints, and the non‑linear nature of FWI.

The authors formulate the misfit‑selection problem as a Markov Decision Process (MDP). The state sₜ consists of the predicted and observed waveforms at iteration t (a single trace in a 1‑D setting). The action aₜ is a discrete choice between two misfit functions: the standard L2‑norm and the OTMF, which is known to mitigate cycle‑skipping by comparing global distributions via the Wasserstein‑2 distance. The reward rₜ is defined as the negative normalized L2‑norm of either the model error or the data residual, encouraging the agent to minimize cumulative misfit over the entire inversion history.

A Deep Q‑Network (DQN) is employed to approximate the action‑value function Q(s,a;θ). The network receives the waveform pair as input, passes it through a single hidden layer of size equal to the number of time samples (nt = 200), and outputs two scalar Q‑values corresponding to the two actions. Training follows the standard DQN pipeline: experience replay stores transitions (sₜ,aₜ,rₜ,sₜ₊₁) in a buffer of capacity N, minibatches of size 128 are sampled to compute the temporal‑difference (TD) error (Equation 7), and gradient descent updates the network weights. An ε‑greedy exploration policy starts with ε = 0.90 and decays exponentially to 0.05, ensuring sufficient exploration early on. A target network ˆQ is synchronized with the online network every C steps to stabilize learning.

The experimental test case is a synthetic time‑shift inversion problem. A Ricker wavelet with peak frequency 3 Hz is generated, and the forward model simply shifts the waveform by a parameter τ. The true τ and the initial τ are drawn randomly from 0.4 s to 1.2 s. Each episode consists of 12 inversion iterations, and 10 000 episodes are run for training. The loss (Equation 7) decreases smoothly, and the accumulated reward rises, indicating successful convergence of the RL agent.

Analysis of the learned Q‑function reveals a clear decision boundary: when the relative time shift between predicted and observed data is less than roughly 0.15 s (approximately half the period of the 3 Hz wavelet), the Q‑value for the L2‑norm exceeds that for OTMF, prompting the agent to select L2. For larger shifts, the OTMF Q‑value dominates, leading to a switch to the transport‑based misfit. This boundary aligns with the classic “half‑cycle rule” used by practitioners, but here it emerges automatically from data without any hand‑crafted heuristics.

The study demonstrates that a DQN can learn a policy that dynamically chooses the most appropriate misfit function based solely on waveform information, thereby reducing reliance on expert judgment. The approach is inherently extensible: in principle, the state could include multi‑trace or multi‑parameter information, and the action space could be expanded to more sophisticated misfit options. However, the current work is limited to a 1‑D, single‑parameter synthetic example. Future research must address scalability to realistic 2‑D/3‑D FWI, robustness to noise and incomplete data, and more nuanced reward designs that incorporate geological constraints or computational cost.

In conclusion, the paper provides a proof‑of‑concept that reinforcement learning, specifically Deep Q‑Learning, can be integrated into the FWI workflow to automate the selection of objective functions, achieving fast convergence and cycle‑skipping avoidance in a data‑driven manner.

A data-driven choice of misfit function for FWI using reinforcement learning

💡 Research Summary

Comments & Academic Discussion

Leave a Comment