Policy Gradients for Cumulative Prospect Theory in Reinforcement Learning
We derive a policy gradient theorem for Cumulative Prospect Theory (CPT) objectives in finite-horizon Reinforcement Learning (RL), generalizing the standard policy gradient theorem and encompassing distortion-based risk objectives as special cases. Motivated by behavioral economics, CPT combines an asymmetric utility transformation around a reference point with probability distortion. Building on our theorem, we design a first-order policy gradient algorithm for CPT-RL using a Monte Carlo gradient estimator based on order statistics. We establish statistical guarantees for the estimator and prove asymptotic convergence of the resulting algorithm to first-order stationary points of the (generally non-convex) CPT objective. Simulations illustrate qualitative behaviors induced by CPT and compare our first-order approach to existing zeroth-order methods.
💡 Research Summary
This paper introduces a novel framework for integrating Cumulative Prospect Theory (CPT) into reinforcement learning (RL) and develops a first‑order policy gradient method for optimizing CPT‑based objectives. Traditional RL maximizes the expected cumulative reward, assuming risk‑neutral agents. Existing risk‑sensitive RL approaches (e.g., variance, CVaR) capture only limited aspects of risk and do not model the asymmetric gain‑loss perception and probability distortion that characterize human decision making. CPT, a cornerstone of behavioral economics, addresses both phenomena by (i) applying separate utility transformations to gains and losses relative to a reference point, and (ii) distorting probabilities through weighting functions that over‑weight rare events and under‑weight common ones.
The authors formalize the problem in a finite‑horizon Markov Decision Process (MDP) with a parametrized stochastic policy πθ, where θ∈ℝ^d. The objective is the CPT value C(X) of the random total return X=∑_{t=0}^{H‑1} r_t, defined as
C(X)=∫_0^∞ w⁺(P(u⁺(X)>z))dz − ∫_0^∞ w⁻(P(u⁻(X)>z))dz.
When the weighting functions w⁺, w⁻ are identity and the utility functions u⁺, u⁻ are linear, C(X) reduces to the ordinary expectation, showing that CPT generalizes the standard RL objective.
The core technical contribution is a policy gradient theorem for CPT. Under mild regularity assumptions (continuous utilities, Lipschitz and differentiable weighting functions, differentiable policy parameterization), the gradient of the CPT objective J(θ)=C(X) admits the closed‑form expression:
∇_θ J(θ)=E
Comments & Academic Discussion
Loading comments...
Leave a Comment