Boosting CVaR Policy Optimization with Quantile Gradients
Optimizing Conditional Value-at-risk (CVaR) using policy gradient (a.k.a CVaR-PG) faces significant challenges of sample inefficiency. This inefficiency stems from the fact that it focuses on tail-end performance and overlooks many sampled trajectories. We address this problem by augmenting CVaR with an expected quantile term. Quantile optimization admits a dynamic programming formulation that leverages all sampled data, thus improves sample efficiency. This does not alter the CVaR objective since CVaR corresponds to the expectation of quantile over the tail. Empirical results in domains with verifiable risk-averse behavior show that our algorithm within the Markovian policy class substantially improves upon CVaR-PG and consistently outperforms other existing methods.
💡 Research Summary
The paper tackles a fundamental inefficiency of Conditional Value‑at‑Risk policy gradient (CVaR‑PG), a popular method for risk‑averse reinforcement learning. CVaR‑PG updates the policy using only the worst‑case α‑fraction of sampled trajectories, discarding the remaining 1 − α samples and ignoring high‑reward trajectories—a phenomenon termed “blindness to success.” Moreover, when the return distribution is flat or discrete, the gradient can vanish, halting learning. Prior attempts to improve sample efficiency (cross‑entropy sampling, policy mixing, per‑step reward re‑weighting) either require control over environment stochasticity or rely on strong alignment between risk‑neutral and risk‑averse actions, limiting their practicality.
The authors observe that CVaR is the expectation of the Value‑at‑Risk (VaR, i.e., quantile) over the tail:
\
Comments & Academic Discussion
Loading comments...
Leave a Comment