Risk-Sensitive Exponential Actor Critic
Model-free deep reinforcement learning (RL) algorithms have achieved tremendous success on a range of challenging tasks. However, safety concerns remain when these methods are deployed on real-world applications, necessitating risk-aware agents. A common utility for learning such risk-aware agents is the entropic risk measure, but current policy gradient methods optimizing this measure must perform high-variance and numerically unstable updates. As a result, existing risk-sensitive model-free approaches are limited to simple tasks and tabular settings. In this paper, we provide a comprehensive theoretical justification for policy gradient methods on the entropic risk measure, including on- and off-policy gradient theorems for the stochastic and deterministic policy settings. Motivated by theory, we propose risk-sensitive exponential actor-critic (rsEAC), an off-policy model-free approach that incorporates novel procedures to avoid the explicit representation of exponential value functions and their gradients, and optimizes its policy w.r.t the entropic risk measure. We show that rsEAC produces more numerically stable updates compared to existing approaches and reliably learns risk-sensitive policies in challenging risky variants of continuous tasks in MuJoCo.
💡 Research Summary
The paper tackles a fundamental challenge in risk‑sensitive reinforcement learning (RL): how to optimize policies with respect to the entropic risk measure (also known as exponential utility) in a model‑free, deep‑learning setting without suffering from the high variance and numerical instability that have plagued prior approaches. The authors first provide a rigorous theoretical foundation by deriving on‑policy policy‑gradient theorems for both stochastic and deterministic policies under the entropic risk objective. For stochastic policies, Theorem 1 shows that the gradient involves a “twisted” state distribution ρ*_π(s) and an exponential weighting e^{β(Q_β(s,a)−V_β(s))}, which captures the risk preference β but can cause overflow when β is large. For deterministic policies, Theorem 2 eliminates the need for the exponential weighting altogether, yielding a gradient that depends only on the gradient of the soft‑action‑value function Q_β with respect to actions, evaluated at the deterministic policy output. This deterministic formulation is both computationally cheaper (no action integral) and numerically safer.
Next, the authors extend the theory to off‑policy learning. By defining an off‑policy objective J_β^b(µ) that integrates the soft‑value function of the target deterministic policy over the state distribution induced by a behavior policy b, they propose an approximate gradient g(µ) that drops the term involving ∇_θ Q_β. Theorem 3 proves that, when the policy is represented tabularly, updating the parameters with this approximate gradient guarantees monotonic improvement of the soft‑value function, thus justifying the approximation.
Building on these results, the paper introduces the risk‑sensitive exponential actor‑critic (rsEAC), a practical off‑policy algorithm. The key innovation lies in how the critic is represented. Instead of directly learning the exponential value function Z(s,a)=e^{βV(s)} (which quickly leads to overflow/underflow and exploding gradients), the authors parametrize Z as Z_ψ(s,a)=e^{Q_ψ(s,a)} where Q_ψ is a conventional neural network approximating the soft‑action‑value. The loss is the squared exponential TD error: (e^{Q_ψ(s,a)}−e^{βr+Q’_ψ(s’)})². Because the gradient with respect to ψ contains the factor ∇Q_ψ(s,a) taken in log‑space, the update is far more stable. To further tame the exponential term, they factor out the larger exponent and define a bounded helper function f(x,y)∈
Comments & Academic Discussion
Loading comments...
Leave a Comment