Provable Offline Reinforcement Learning for Structured Cyclic MDPs
We introduce a novel cyclic Markov decision process (MDP) framework for multi-step decision problems with heterogeneous stage-specific dynamics, transitions, and discount factors across the cycle. In this setting, offline learning is challenging: optimizing a policy at any stage shifts the state distributions of subsequent stages, propagating mismatch across the cycle. To address this, we propose a modular structural framework that decomposes the cyclic process into stage-wise sub-problems. While generally applicable, we instantiate this principle as CycleFQI, an extension of fitted Q-iteration enabling theoretical analysis and interpretation. It uses a vector of stage-specific Q-functions, tailored to each stage, to capture within-stage sequences and transitions between stages. This modular design enables partial control, allowing some stages to be optimized while others follow predefined policies. We establish finite-sample suboptimality error bounds and derive global convergence rates under Besov regularity, demonstrating that CycleFQI mitigates the curse of dimensionality compared to monolithic baselines. Additionally, we propose a sieve-based method for asymptotic inference of optimal policy values under a margin condition. Experiments on simulated and real-world Type 1 Diabetes data sets demonstrate CycleFQI’s effectiveness.
💡 Research Summary
This paper introduces a new formalism for Markov decision processes that exhibit cyclic, stage‑specific dynamics, called “Cyclic MDPs.” In many real‑world sequential decision problems—such as type‑1 diabetes management, traffic control, or any system that repeats a set of distinct phases—the dynamics, state‑action spaces, transition kernels, rewards, and discount factors differ from one stage to another. The authors model the environment as a collection of K stages that repeat indefinitely. Each stage k has its own state space S_k, action space A_k, transition kernel P_k, reward function R_k, a set of terminal state‑action pairs T_k, a deterministic inter‑stage mapping φ_k, and a discount factor γ_k applied when moving to the next stage. The overall discount for a full cycle is γ_cycle = ∏_{k=1}^K γ_k, guaranteeing bounded total return as long as at least one γ_k < 1.
A central difficulty in offline reinforcement learning (RL) is distribution mismatch: when a policy is updated, the state‑distribution of subsequent stages changes, and errors can compound across the cycle. Standard offline algorithms such as Fitted Q‑Iteration (FQI), Conservative Q‑Learning (CQL), or Batch Constrained Q‑Learning (BCQ) assume a stationary environment and therefore suffer from error explosion in cyclic settings.
To overcome this, the authors propose a modular structural approach. They maintain a separate Q‑function Q_k for each stage and define a constrained Bellman operator T_U, where U ⊆ {1,…,K} denotes the set of stages whose policies are to be learned; stages not in U follow a fixed behavior policy π∘_k. The operator first computes a stage‑wise value V_k(s) that is either a max over actions (if k ∈ U) or an expectation under the fixed policy (if k ∉ U). It then updates each Q_k using the immediate reward, the intra‑stage transition, and either the intra‑stage value or the discounted value of the next stage (when a terminal transition occurs). The authors prove that T_U is non‑expansive and that after H = K applications it contracts by γ_cycle, guaranteeing a unique fixed point Q^*_U.
The algorithm CycleFitted Q‑Iteration (CycleFQI) implements this theory. Given offline data D = {D_k}_{k=1}^K, where D_k consists of n_k tuples (s_i^k, a_i^k, r_i^k, s’_i^k) collected under some behavior policies, CycleFQI proceeds iteratively. At iteration m, for each transition in stage k it constructs a target
y_i^k = r_i^k + V_{k}^{(m‑1)}(s’i^k) if (s_i^k, a_i^k) ∉ T_k
y_i^k = r_i^k + γ_k V{k+1}^{(m‑1)}(φ_k(s’_i^k)) if (s_i^k, a_i^k) ∈ T_k,
where V_{k}^{(m‑1)} is derived from Q_k^{(m‑1)} using the max or expectation rule according to U. Once the targets are fixed, each stage solves an independent regression problem (typically least‑squares over a function class F_k) to obtain Q_k^{(m)}. This decoupling makes the per‑iteration computation fully parallelizable while preserving the cyclic dependencies through the target construction.
For finite‑sample analysis the authors assume each Q_k belongs to a Besov space B^{α_k}_{p,q} with smoothness α_k and dimension d_k. Using covering‑number arguments they bound the approximation error of the regression step and propagate it through the Bellman operator. The main result shows that after M iterations
‖Q^{(M)} – Q^U‖∞ ≤ C·max_k n_k^{‑α_k/(2α_k+d_k)} + γ_cycle^M‖Q^{(0)} – Q^U‖∞,
where C depends on the Besov constants and the Bellman contraction. Crucially, the error scales with the hardest stage (largest d_k/α_k ratio) rather than the product of all stages, demonstrating that CycleFQI mitigates the curse of dimensionality that plagues monolithic baselines which would suffer a factor K in the exponent.
Beyond point estimation, the paper tackles statistical inference for the optimal policy value V^*_U. Under a margin condition that guarantees a non‑trivial gap between the optimal and sub‑optimal actions, the authors adapt sieve estimation techniques to obtain asymptotic normality of the estimated value. This enables construction of confidence intervals and hypothesis tests for policy performance, a rare result in offline RL literature.
Empirical evaluation comprises two settings. In synthetic experiments, the authors construct a three‑stage cyclic MDP where each stage has distinct transition dynamics and reward structures. CycleFQI consistently outperforms standard FQI, BCQ, and CQL, achieving 10–15 % higher cumulative reward and showing robustness when only a subset of stages (e.g., U = {1,3}) is optimized. In a real‑world type‑1 diabetes dataset, the day is partitioned into morning, afternoon, and night phases. CycleFQI learns stage‑specific insulin dosing policies that reduce glucose variability and increase the proportion of time spent within target glucose ranges compared to baseline policies. Importantly, the method accommodates partial control scenarios where certain phases must follow clinically prescribed protocols, preserving safety while still improving overall outcomes.
In summary, the paper delivers a comprehensive framework for offline RL in environments with structured cyclicity. Its contributions include (1) a formal cyclic MDP model with stage‑wise heterogeneity, (2) a modular Bellman operator and the CycleFQI algorithm that enable parallel learning across stages, (3) finite‑sample error bounds that break the dimensionality barrier by leveraging Besov regularity, and (4) a sieve‑based inference procedure for policy value estimation. The work opens avenues for applying offline RL to many domains where processes naturally repeat with phase‑specific dynamics, and suggests future extensions to stochastic inter‑stage mappings, continuous‑time formulations, and deep neural network function approximators.
Comments & Academic Discussion
Loading comments...
Leave a Comment