Is Pure Exploitation Sufficient in Exogenous MDPs with Linear Function Approximation?
Exogenous MDPs (Exo-MDPs) capture sequential decision-making where uncertainty comes solely from exogenous inputs that evolve independently of the learner’s actions. This structure is especially common in operations research applications such as inventory control, energy storage, and resource allocation, where exogenous randomness (e.g., demand, arrivals, or prices) drives system behavior. Despite decades of empirical evidence that greedy, exploitation-only methods work remarkably well in these settings, theory has lagged behind: all existing regret guarantees for Exo-MDPs rely on explicit exploration or tabular assumptions. We show that exploration is unnecessary. We propose Pure Exploitation Learning (PEL) and prove the first general finite-sample regret bounds for exploitation-only algorithms in Exo-MDPs. In the tabular case, PEL achieves $\widetilde{O}(H^2|Ξ|\sqrt{K})$. For large, continuous endogenous state spaces, we introduce LSVI-PE, a simple linear-approximation method whose regret is polynomial in the feature dimension, exogenous state space, and horizon, independent of the endogenous state and action spaces. Our analysis introduces two new tools: counterfactual trajectories and Bellman-closed feature transport, which together allow greedy policies to have accurate value estimates without optimism. Experiments on synthetic and resource-management tasks show that PEL consistently outperforming baselines. Overall, our results overturn the conventional wisdom that exploration is required, demonstrating that in Exo-MDPs, pure exploitation is enough.
💡 Research Summary
The paper investigates reinforcement learning (RL) in Exogenous Markov Decision Processes (Exo‑MDPs), a class of problems where the state is split into an endogenous component (the system’s internal configuration) and an exogenous component (external randomness such as demand, arrivals, or prices). By definition, actions can influence only the endogenous part; the exogenous process evolves independently of the agent’s decisions. This structural property is common in many operations‑research domains (inventory control, energy storage, cloud resource allocation) and suggests that the information contained in a single exogenous trajectory can be reused to evaluate any policy, because the trajectory’s distribution does not depend on the policy that generated it.
The authors ask a bold question: “Is explicit exploration necessary for learning near‑optimal policies in Exo‑MDPs, especially when we use linear function approximation (LFA) for large endogenous state spaces?” They answer affirmatively that exploration is not required. Their contributions can be grouped into three parts.
1. Warm‑up: Exo‑Bandits and Tabular Exo‑MDPs
When the horizon H = 1, the problem reduces to an “Exo‑Bandit”: at each round the learner picks an arm, an exogenous signal ξ is drawn, and the reward r(a, ξ) is known for all arms after observing ξ. In this full‑information setting, the classic Follow‑the‑Leader (FTL) algorithm—simply choosing the arm with the highest empirical mean—achieves per‑round simple regret O(σ² log A / k) and cumulative regret O(σ √{K log A}), matching optimal expert‑type bounds. This illustrates that the presence of exogenous feedback eliminates the need for exploration.
Extending to finite‑horizon, finite‑state Exo‑MDPs, the authors define a policy‑level FTL (the Pure Exploitation Learning, PEL, algorithm). At episode k, PEL builds empirical Q‑values for every state‑action pair by averaging rewards over all previously observed exogenous traces, then acts greedily with respect to these estimates. No optimism or random perturbation is added. The authors prove a regret bound of (\widetilde O(H^{2}|\Xi|\sqrt K)), where |\Xi| is the number of exogenous states and H the horizon. Notably, the bound is independent of the sizes of the endogenous state space and the action set, reflecting the fact that all needed information is captured by the exogenous process.
2. Linear Function Approximation: LSVI‑PE
Real‑world problems often have continuous or high‑dimensional endogenous states, making tabular methods infeasible. The paper therefore introduces LSVI‑PE (Least‑Squares Value Iteration with Pure Exploitation). The algorithm proceeds in three steps each episode:
- Empirical Exogenous Model – Using all past exogenous traces, it estimates the transition kernel ( \hat P_h(\cdot|\xi_h) ) via maximum likelihood.
- Post‑Decision State Construction – After taking action a at endogenous state x and observing the next exogenous state ξ′, the algorithm forms a post‑decision tuple (x, a, ξ′). This separates the action choice from the stochastic exogenous shock, allowing the reward to be expressed as a deterministic function of the tuple.
- Linear Regression Backward Pass – Starting from the last step, it solves a regularized least‑squares problem to fit a linear value function ( V_h(s) \approx \phi(s,a)^\top \theta_h ) using the constructed post‑decision targets. All data used come from greedy trajectories only; no exploratory actions are ever taken.
The analysis hinges on two novel technical tools:
- Counterfactual Trajectory Construction – For any policy π, the authors define a “counterfactual” trajectory that follows the same exogenous realizations as the observed greedy trajectory but applies π’s actions. This construction shows that the empirical value estimates are unbiased for any policy, despite having been computed from a single greedy dataset.
- Bellman‑Closed Feature Transport – They prove that the linear feature map φ is closed under the Bellman operator when combined with the estimated exogenous model. Consequently, the error introduced by approximating the Bellman update does not amplify across steps.
Using these tools, they derive a regret bound of (\tilde O\big(d^{2} |\Xi| H^{2} \sqrt K\big)), where d is the feature dimension. Crucially, the bound does not depend on the cardinalities of the endogenous state space or the action set, which can be infinite. This matches the best known rates for linear MDPs under optimism‑based exploration, but here no optimism or forced exploration is employed.
3. Necessity of the Exo‑MDP Structure and Empirical Validation
The authors also present a negative result: if either the endogenous transition dynamics or the reward function is unknown (i.e., the problem no longer fits the Exo‑MDP definition), then any pure‑exploitation algorithm can suffer linear regret. This demonstrates that the independence of the exogenous process from actions is not merely sufficient but also necessary for the proposed approach.
Empirically, the paper evaluates PEL (tabular) and LSVI‑PE (linear) on synthetic benchmarks and on two classic OR problems: a stochastic inventory control model and an energy‑storage management task. Baselines include optimism‑driven LSVI, UCB‑type methods, and model‑based approaches that incorporate explicit exploration. Across all experiments, PEL/LSVI‑PE achieve lower cumulative regret and converge faster, especially when the exogenous state space is large. Moreover, the computational overhead scales linearly with |\Xi| and d, confirming the practicality of the methods for large‑scale applications.
Overall Impact
This work overturns the prevailing belief in RL that exploration is indispensable for achieving sublinear regret. By exploiting the structural independence of exogenous dynamics, the authors show that a learner can reuse a single stream of exogenous data to evaluate any policy, thereby eliminating the need for optimism or random exploration. The theoretical contributions (new regret decomposition, counterfactual trajectories, Bellman‑closed feature transport) are likely to inspire further research on exploitation‑only algorithms in other factored or partially observable settings. For practitioners in operations research and industry, the results provide a rigorous justification for the long‑standing practice of “greedy” approximate dynamic programming without explicit exploration, while also offering scalable algorithms that come with provable performance guarantees.
Comments & Academic Discussion
Loading comments...
Leave a Comment