Decoupling Time and Risk: Risk-Sensitive Reinforcement Learning with General Discounting
Distributional reinforcement learning (RL) is a powerful framework increasingly adopted in safety-critical domains for its ability to optimize risk-sensitive objectives. However, the role of the discount factor is often overlooked, as it is typically treated as a fixed parameter of the Markov decision process or tunable hyperparameter, with little consideration of its effect on the learned policy. In the literature, it is well-known that the discounting function plays a major role in characterizing time preferences of an agent, which an exponential discount factor cannot fully capture. Building on this insight, we propose a novel framework that supports flexible discounting of future rewards and optimization of risk measures in distributional RL. We provide a technical analysis of the optimality of our algorithms, show that our multi-horizon extension fixes issues raised with existing methodologies, and validate the robustness of our methods through extensive experiments. Our results highlight that discounting is a cornerstone in decision-making problems for capturing more expressive temporal and risk preferences profiles, with potential implications for real-world safety-critical applications.
💡 Research Summary
The paper tackles a fundamental limitation of conventional reinforcement learning (RL) – the conflation of time preferences and risk preferences into a single scalar discount factor. Traditional Markov decision processes (MDPs) employ a fixed exponential discount γ, which simplifies analysis but cannot capture the rich, often non‑exponential, temporal preferences observed in humans and animals (e.g., hyperbolic or quasi‑hyperbolic discounting). Moreover, most risk‑sensitive RL approaches treat risk separately from discounting, leading to a fragmented treatment of preference modeling.
To address this gap, the authors propose a unified framework that (1) allows arbitrary, non‑increasing discount functions d(t) (including exponential, hyperbolic, and quasi‑hyperbolic forms) and (2) integrates Optimized Certainty Equivalent (OCE) risk measures, a broad class encompassing expectation, CVaR, entropy‑regularized utilities, and more. The key technical device is “stock augmentation”: the environment state is extended with a statistic Cᵈₜ that tracks the accumulated, appropriately scaled reward history. This stock evolves according to Cᵈₜ₊₁ = (Cᵈₜ + Rₜ₊₁) / ˆdₜ, where ˆdₜ = dₜ₊₁ / dₜ. The combination Cᵈ₀ + Gᵈ₀ (total unscaled return) can be expressed recursively as dₜ·(Cᵈₜ + Gᵈₜ), providing an “any‑time proxy” that enables dynamic programming (DP) even when the discounting is non‑stationary.
Two structural properties of the risk functional K are required: (i) indifference to scaling (preferences are preserved when outcomes are multiplied by the discount factor) and (ii) indifference to mixtures (preference orderings are preserved under convex combinations). These ensure that the distributional Bellman operator remains monotone and, for exponential discounting, contractive.
The paper develops algorithms for three horizon settings:
- Finite horizon – With a known terminal time T, backward induction yields exact optimal policies despite time‑varying discounts.
- Multi‑horizon – A single network simultaneously learns policies for a set of discount functions, enabling a single training run to produce agents with diverse temporal preferences.
- Infinite horizon – When ˆdₜ → 1 (as in hyperbolic discounting), standard contraction arguments fail. The authors propose a hybrid approach: a risk‑neutral policy with exponential discount serves as a stabilizing baseline, while the primary policy follows the general discount/OCE objective. They prove an error bound of order O(1/√N) using properties of the stock‑augmented representation.
Theoretical contributions are complemented by extensive experiments. In a synthetic American put option trading task, the time‑consistent, risk‑sensitive policies outperform the time‑inconsistent baseline of Fedus et al. (2019), achieving higher risk‑adjusted returns. In the Windy Lunar Lander benchmark, the proposed method yields substantially better mean returns and lower variance. Finally, multi‑horizon learning is evaluated on a suite of Atari 2600 games, where it demonstrates faster convergence and higher final scores compared to standard distributional algorithms (C51, QR‑DQN, etc.). The experiments confirm that enforcing time‑consistency (i.e., allowing policies to be non‑stationary when the discount is non‑exponential) is crucial for performance.
In summary, the paper delivers a comprehensive solution for decoupling time and risk in RL: a mathematically sound stock‑augmented DP framework, rigorous optimality and error analyses for various horizons, practical algorithms that respect non‑stationary optimal policies, and empirical evidence of superior performance across finance, control, and high‑dimensional gaming domains. This work opens the door to more expressive preference modeling in safety‑critical applications such as autonomous driving, robotic manipulation, and financial portfolio management.
Comments & Academic Discussion
Loading comments...
Leave a Comment