Causal Flow Q-Learning for Robust Offline Reinforcement Learning

Causal Flow Q-Learning for Robust Offline Reinforcement Learning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Expressive policies based on flow-matching have been successfully applied in reinforcement learning (RL) more recently due to their ability to model complex action distributions from offline data. These algorithms build on standard policy gradients, which assume that there is no unmeasured confounding in the data. However, this condition does not necessarily hold for pixel-based demonstrations when a mismatch exists between the demonstrator’s and the learner’s sensory capabilities, leading to implicit confounding biases in offline data. We address the challenge by investigating the problem of confounded observations in offline RL from a causal perspective. We develop a novel causal offline RL objective that optimizes policies’ worst-case performance that may arise due to confounding biases. Based on this new objective, we introduce a practical implementation that learns expressive flow-matching policies from confounded demonstrations, employing a deep discriminator to assess the discrepancy between the target policy and the nominal behavioral policy. Experiments across 25 pixel-based tasks demonstrate that our proposed confounding-robust augmentation procedure achieves a success rate 120% that of confounding-unaware, state-of-the-art offline RL methods.


💡 Research Summary

The paper tackles a critical yet under‑explored problem in offline reinforcement learning (RL): the presence of unobserved confounding when the demonstrator’s observations differ from the learner’s sensory inputs, a situation common in pixel‑based demonstrations. Traditional offline RL methods assume that the behavioral policy that generated the dataset and the target policy share the same support over the state‑action space (the “no unobserved confounder” or NUC condition). When this assumption is violated, hidden exogenous variables (denoted U) simultaneously influence actions, next states, and rewards, creating spurious correlations that bias value estimation and policy improvement.

To formalize this, the authors introduce the Confounded Markov Decision Process (CMDP), an extension of the standard MDP that explicitly incorporates an unobserved noise variable U and three deterministic functions: transition f_S, behavioral policy f_X, and reward f_Y. The data‑generating process is depicted as a causal graph with bidirectional edges between X (action) and Y (reward) as well as between X and the subsequent state S′, representing the hidden confounder. Under this model, the observational distribution P( X_{1:T}, S_{1:T}, Y_{1:T}) is insufficient to uniquely identify the true transition and reward functions, which is why standard offline RL objectives fail.

The core theoretical contribution is a worst‑case lower bound on the state‑value function for any target policy π in a CMDP. Building on prior work that derived a “Causal Bellman” equation for discrete actions, the authors prove Theorem 3.1, which yields a closed‑form bound (Equation 8) that is valid for continuous, multimodal action spaces. The bound distinguishes two cases for each sampled tuple (s, x, x′): (1) if the target action x equals the observed action x′, the update reduces to the ordinary Bellman backup using the observed next state; (2) if x ≠ x′, the update uses a pessimistic estimate based on the minimal Q‑value over all possible next states s*. This construction guarantees that the learned policy is safe with respect to the worst‑case environment compatible with the confounded observations.

Algorithmically, the paper proposes Causal Flow Q‑Learning (CFQL), which integrates the worst‑case bound into a flow‑matching offline RL framework. CFQL learns two continuous normalizing‑flow policies: (i) a behavioral cloning policy μ_ω that imitates the nominal behavioral distribution P(X|S) extracted from the dataset, and (ii) a target policy π_θ that is optimized for performance. A deep discriminator D(s, x, x′) approximates the indicator 1_{x = x′} by taking a state‑action pair from the dataset and an action sampled from π_θ. The discriminator’s output weights the two terms in the Q‑loss (Equation 9): when D≈1 the loss reduces to the standard Q‑learning objective; when D≈0 the loss substitutes the pessimistic term a + γ min_{s*} Q(s*, x*). This dynamic weighting effectively “turns on” the robust correction only when the target policy deviates from the observed behavior, thereby mitigating bias without sacrificing sample efficiency.

The authors evaluate CFQL on 25 pixel‑based robotic manipulation tasks derived from the OGBench suite. Baselines include state‑of‑the‑art flow‑matching methods (Flow Q‑Learning, Value‑Flow), diffusion‑based offline RL, and classic conservative offline algorithms (CQL, BCQ). Results show that CFQL achieves an average success‑rate improvement of 120 % over the best confounding‑unaware baseline, with some tasks even surpassing policies trained on true state observations. Ablation studies confirm that both the discriminator and the behavioral cloning flow are essential: removing the discriminator collapses performance to that of standard FQL, while omitting μ_ω leads to unstable policy updates. Moreover, the authors demonstrate that the robust Q‑loss can be plugged into other continuous‑action policy classes (e.g., diffusion policies) with comparable gains, highlighting the generality of the approach.

In summary, the paper makes three intertwined contributions: (1) a causal graphical model (CMDP) that captures hidden confounding in offline RL, (2) a mathematically rigorous worst‑case value lower bound applicable to continuous action spaces, and (3) a practical algorithm (CFQL) that combines flow‑matching, a deep discriminator, and the robust bound to learn policies that are provably safe under confounded data. The work opens several avenues for future research, such as extending the framework to multiple latent confounders, integrating partial causal knowledge, and exploring safe online‑offline hybrid learning scenarios.


Comments & Academic Discussion

Loading comments...

Leave a Comment