Fast Non-Episodic Finite-Horizon RL with K-Step Lookahead Thresholding

Fast Non-Episodic Finite-Horizon RL with K-Step Lookahead Thresholding
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Online reinforcement learning in non-episodic, finite-horizon MDPs remains underexplored and is challenged by the need to estimate returns to a fixed terminal time. Existing infinite-horizon methods, which often rely on discounted contraction, do not naturally account for this fixed-horizon structure. We introduce a modified Q-function: rather than targeting the full-horizon, we learn a K-step lookahead Q-function that truncates planning to the next K steps. To further improve sample efficiency, we introduce a thresholding mechanism: actions are selected only when their estimated K-step lookahead value exceeds a time-varying threshold. We provide an efficient tabular learning algorithm for this novel objective, proving it achieves fast finite-sample convergence: it achieves minimax optimal constant regret for $K=1$ and $\mathcal{O}(\max((K-1),C_{K-1})\sqrt{SAT\log(T)})$ regret for any $K \geq 2$. We numerically evaluate the performance of our algorithm under the objective of maximizing reward. Our implementation adaptively increases K over time, balancing lookahead depth against estimation variance. Empirical results demonstrate superior cumulative rewards over state-of-the-art tabular RL methods across synthetic MDPs and RL environments: JumpRiverswim, FrozenLake and AnyTrading.


💡 Research Summary

The paper tackles a largely overlooked setting in reinforcement learning: online learning in non‑episodic, finite‑horizon Markov Decision Processes (MDPs). In this regime an agent experiences a single trajectory of length T without resets, and the objective is to maximize the cumulative reward up to a known terminal time. Existing theory focuses either on infinite‑horizon discounted or average‑reward problems, or on episodic finite‑horizon tasks where the environment restarts after each episode. Both families rely on structural properties—discount contraction, stationary distributions, or repeated resets—that do not hold when the horizon is fixed and the process never restarts. Consequently, standard model‑based and model‑free algorithms suffer linear regret Ω(T) in this setting because they must estimate the full‑horizon Q‑function from a single trajectory, which incurs high variance.

Key Idea.
The authors propose to deliberately truncate the planning horizon. Instead of learning the full‑horizon Q‑function Q⁎h(s,a), they learn a K‑step lookahead Q‑function Q⁎{T−K}(s,a), which only predicts the expected return over the next K steps. When K=1 the problem reduces to a contextual bandit, known to admit constant regret. To further reduce sample complexity they introduce a thresholding mechanism: at each time step t a time‑varying threshold γ_t is defined, and the agent selects an action only if its estimated K‑step lookahead value exceeds γ_t; otherwise it defaults to the greedy K‑step action. This “K‑step lookahead thresholding” policy, denoted π_{K,γ}, focuses learning on actions that are already promising, thereby limiting exploration to a smaller, more informative set.

Theoretical Contributions.

  1. Policy Optimality.

    • When K ≥ T, the K‑step greedy policy coincides with the optimal finite‑horizon policy π⁎.
    • For binary‑state MDPs satisfying a stochastic dominance assumption (the action with highest immediate reward also maximizes the probability of transitioning to the higher‑reward state), the K‑step greedy policy is optimal for any K ≥ 1 (Theorem 3.3).
    • In general multi‑state MDPs, the authors construct instances where any K < T yields an optimality gap linear in T (Theorem 3.4), showing that truncation inevitably trades off long‑term optimality for sample efficiency.
  2. Algorithm – LGKT (LCB‑Guided K‑step Thresholding).

    • For each state‑action pair, LGKT maintains upper and lower confidence bounds on the K‑step lookahead value using Hoeffding‑type concentration.
    • The threshold γ_t follows a pre‑specified decreasing schedule (e.g., γ_t = γ₀ · t^{−α}).
    • At decision time, the algorithm forms the set of actions whose upper confidence bound exceeds γ_t; if the set is non‑empty, it samples uniformly from it, otherwise it executes the greedy K‑step action.
    • After observing the immediate reward and next state, the K‑step return estimate is updated, and confidence intervals are tightened.
  3. Regret Analysis.

    • For K = 1, LGKT reduces to a standard stochastic bandit algorithm and achieves minimax‑optimal constant regret O(1) (Theorem 4.2).
    • For any K ≥ 2, the regret against the benchmark policy π_{K,γ} is bounded by
      \

Comments & Academic Discussion

Loading comments...

Leave a Comment