Reinforcement Learning for Individual Optimal Policy from Heterogeneous Data
Offline reinforcement learning (RL) aims to find optimal policies in dynamic environments in order to maximize the expected total rewards by leveraging pre-collected data. Learning from heterogeneous data is one of the fundamental challenges in offline RL. Traditional methods focus on learning an optimal policy for all individuals with pre-collected data from a single episode or homogeneous batch episodes, and thus, may result in a suboptimal policy for a heterogeneous population. In this paper, we propose an individualized offline policy optimization framework for heterogeneous time-stationary Markov decision processes (MDPs). The proposed heterogeneous model with individual latent variables enables us to efficiently estimate the individual Q-functions, and our Penalized Pessimistic Personalized Policy Learning (P4L) algorithm guarantees a fast rate on the average regret under a weak partial coverage assumption on behavior policies. In addition, our simulation studies and a real data application demonstrate the superior numerical performance of the proposed method compared with existing methods.
💡 Research Summary
The paper addresses a fundamental limitation of most offline reinforcement learning (RL) methods: they assume a homogeneous environment and therefore learn a single policy that is applied to all individuals. In many real‑world settings—especially in healthcare, mobile health, and personalized robotics—different subjects experience distinct state‑transition dynamics and reward functions, and some sub‑populations are under‑represented in the data. Ignoring this heterogeneity can lead to biased value estimates and sub‑optimal policies, potentially exacerbating disparities.
To tackle this, the authors formalize the problem as a collection of time‑stationary heterogeneous Markov decision processes (MDPs), one for each individual i, denoted M(i) = {S, A, P_i, r_i, γ}. The transition kernel P_i and reward function r_i may differ across individuals, while the state and action spaces are shared. The goal is to learn an individualized stationary policy π_i* for each i that maximizes the expected discounted return J_i(π_i).
The methodological contribution consists of two tightly coupled components:
-
Heterogeneous Latent Variable Model – Each individual is associated with a latent vector z_i that parameterizes both its Q‑function and its policy. By imposing a multi‑centroid penalty on the latent vectors, the model encourages individuals with similar latent representations to cluster together, thereby sharing statistical strength. This structure captures commonalities across the population while preserving individual specificity.
-
Penalized Pessimistic Personalized Policy Learning (P4L) – Policy evaluation is performed in a pessimistic manner: given a set of candidate Q‑functions, the algorithm evaluates a policy using the most pessimistic (i.e., lowest) Q‑value. This guards against over‑estimation caused by distributional shift between the behavior policy that generated the offline data and the target policy being evaluated. Crucially, the algorithm only requires a partial coverage assumption: the average visitation distribution across all individuals’ behavior policies must dominate the visitation distribution induced by each individual’s target policy. This is far weaker than the full coverage condition traditionally required for off‑policy evaluation.
The authors derive a finite‑sample regret bound for the policies learned by P4L. Under mild regularity conditions, the average regret across N individuals scales as O(√(N_T)), where N_T is the total number of observed transitions. This rate matches that of an oracle that knows the true subgroup structure, demonstrating that the latent‑variable clustering incurs no asymptotic penalty. To handle the computational difficulty of the pessimistic optimization over an uncertainty set of Q‑functions, the authors formulate a Lagrangian dual problem. Assuming convexity of the Q‑function space, solving the dual yields the same regret bound while dramatically reducing computational cost.
Empirical validation is carried out in two settings. In synthetic simulations, the authors vary the degree of heterogeneity and the amount of data per individual, comparing P4L against (i) group‑wise policy learning methods, (ii) meta‑RL approaches that require massive online interaction, and (iii) naïve individual Q‑learning. P4L consistently achieves higher estimated returns and lower regret, especially for individuals with scarce data. In a real‑world case study involving cardiac patients, the method learns personalized treatment regimes that outperform existing dynamic treatment regime algorithms in terms of clinical outcome metrics.
Overall, the paper makes several notable advances: it introduces a principled framework for heterogeneous offline RL, proposes a latent‑variable sharing mechanism with multi‑centroid regularization, leverages pessimism to relax coverage requirements, provides rigorous regret guarantees, and demonstrates practical superiority on both simulated and real data. The work opens a clear path toward deploying offline RL in domains where population heterogeneity is the norm, such as precision medicine, adaptive education, and individualized recommendation systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment