Learning Optimal and Sample-Efficient Decision Policies with Guarantees

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The paradigm of decision-making has been revolutionised by reinforcement learning and deep learning. Although this has led to significant progress in domains such as robotics, healthcare, and finance, the use of RL in practice is challenging, particularly when learning decision policies in high-stakes applications that may require guarantees. Traditional RL algorithms rely on a large number of online interactions with the environment, which is problematic in scenarios where online interactions are costly, dangerous, or infeasible. However, learning from offline datasets is hindered by the presence of hidden confounders. Such confounders can cause spurious correlations in the dataset and can mislead the agent into taking suboptimal or adversarial actions. Firstly, we address the problem of learning from offline datasets in the presence of hidden confounders. We work with instrumental variables (IVs) to identify the causal effect, which is an instance of a conditional moment restrictions (CMR) problem. Inspired by double/debiased machine learning, we derive a sample-efficient algorithm for solving CMR problems with convergence and optimality guarantees, which outperforms state-of-the-art algorithms. Secondly, we relax the conditions on the hidden confounders in the setting of (offline) imitation learning, and adapt our CMR estimator to derive an algorithm that can learn effective imitator policies with convergence rate guarantees. Finally, we consider the problem of learning high-level objectives expressed in linear temporal logic (LTL) and develop a provably optimal learning algorithm that improves sample efficiency over existing methods. Through evaluation on reinforcement learning benchmarks and synthetic and semi-synthetic datasets, we demonstrate the usefulness of the methods developed in this thesis in real-world decision making.

💡 Research Summary

The dissertation tackles three intertwined challenges that hinder the deployment of reinforcement learning (RL) in high‑stakes domains: (1) the prohibitive cost or danger of online interaction, (2) the presence of hidden confounders in offline datasets, and (3) the need to learn high‑level, temporally extended objectives. The author proposes a unified causal‑inference‑driven framework that leverages instrumental variables (IVs) to identify causal effects in offline settings and solves the resulting conditional moment restriction (CMR) problems using a double‑machine‑learning (DML) approach.

In Chapter 4, the core algorithm DML‑CMR is introduced. By constructing a Neyman orthogonal score and employing cross‑fitting, the estimator attains bias‑reduction and an 𝑁⁻¹ᐟ² convergence rate under a novel DML‑identifiability condition. The method is extended to the offline IV bandit problem, where theoretical guarantees on sub‑optimality are derived. A computationally efficient variant replaces deep networks with tree‑based learners while preserving statistical guarantees. Empirical results on synthetic IV regression, offline bandits, and proximal causal learning benchmarks demonstrate lower mean‑squared error and higher cumulative reward than state‑of‑the‑art baselines, even when instruments are weak.

Chapter 5 adapts the CMR estimator to offline imitation learning. The author formalizes MDPs with hidden confounders and shows that the imitation problem can be cast as a CMR problem. The resulting DML‑IL algorithm inherits the same orthogonal‑score structure, yielding an imitation gap that shrinks at 𝑁⁻¹ᐟ². Experiments on a ticket‑pricing environment and MuJoCo robotics tasks confirm that DML‑IL outperforms behavior cloning, inverse‑RL, and recent causal imitation methods, and remains robust when combined with alternative CMR estimators.

Chapter 6 addresses the learning of high‑level objectives expressed in linear temporal logic (LTL). By translating LTL specifications into limit‑deterministic Büchi automata (LDBA) and forming a product MDP, the author designs a Q‑learning‑based LTL learner that incorporates “counterfactual imagining” to correct for hidden confounding during policy updates. Theoretical analysis shows a logarithmic‑linear sample complexity improvement over prior LTL‑RL approaches. Benchmarks on probabilistic gate, Frozen Lake, and Office World MDPs validate the method’s ability to achieve near‑optimal policies with fewer samples.

The thesis concludes with a discussion of practical considerations, strengths and limitations of the proposed methods, and a roadmap for future work, including extensions to continuous‑time settings, robust IV selection, and large‑scale deployment in healthcare and finance. Overall, the work delivers a coherent set of algorithms that jointly guarantee sample efficiency and optimality for offline decision‑making under hidden confounding, advancing both the theory and practice of causal reinforcement learning.

Learning Optimal and Sample-Efficient Decision Policies with Guarantees

💡 Research Summary

Comments & Academic Discussion

Leave a Comment