Confounding Robust Continuous Control via Automatic Reward Shaping

Confounding Robust Continuous Control via Automatic Reward Shaping
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Reward shaping has been applied widely to accelerate Reinforcement Learning (RL) agents’ training. However, a principled way of designing effective reward shaping functions, especially for complex continuous control problems, remains largely under-explained. In this work, we propose to automatically learn a reward shaping function for continuous control problems from offline datasets, potentially contaminated by unobserved confounding variables. Specifically, our method builds upon the recently proposed causal Bellman equation to learn a tight upper bound on the optimal state values, which is then used as the potentials in the Potential-Based Reward Shaping (PBRS) framework. Our proposed reward shaping algorithm is tested with Soft-Actor-Critic (SAC) on multiple commonly used continuous control benchmarks and exhibits strong performance guarantees under unobserved confounders. More broadly, our work marks a solid first step towards confounding robust continuous control from a causal perspective. Code for training our reward shaping functions can be found at https://github.com/mateojuliani/confounding_robust_cont_control.


💡 Research Summary

This paper tackles the long‑standing problem of designing effective reward‑shaping functions for continuous‑control reinforcement learning when only offline data are available and that data may be contaminated by unobserved confounders. The authors formalize the setting as a Confounded Markov Decision Process (CMDP), where the observed state S, action X and reward Y are all jointly influenced by a latent exogenous variable U. Because U is hidden, standard off‑policy methods that assume “no unobserved confounding” (NUC) cannot reliably identify transition dynamics or reward functions, leading to biased value estimates and poor online performance.

To overcome this, the paper introduces a Causal Bellman Optimality Equation for stationary infinite‑horizon CMDPs. The equation defines an optimistic upper bound V(s) on the true optimal value V*(s). Unlike the classic Bellman equation, the causal version incorporates two terms: (i) the expected reward and next‑state value conditioned on the observed action distribution P(x|s), and (ii) a worst‑case term that accounts for the probability of “not taking” the observed action (i.e., the effect of the hidden confounder), multiplied by a known reward upper bound b and the maximal possible future value. Formally:

V(s) = maxₓ


Comments & Academic Discussion

Loading comments...

Leave a Comment