Trust Regions Sell, But Who's Buying? Overlap Geometry as an Alternative Trust Region for Policy Optimization
Standard trust-region methods constrain policy updates via Kullback-Leibler (KL) divergence. However, KL controls only an average divergence and does not directly prevent rare, large likelihood-ratio excursions that destabilize training–precisely the failure mode that motivates heuristics such as PPO’s clipping. We propose overlap geometry as an alternative trust region, constraining distributional overlap via the Bhattacharyya coefficient (closely related to the Hellinger/Renyi-1/2 geometry). This objective penalizes separation in the ratio tails, yielding tighter control over likelihood-ratio excursions without relying on total variation bounds that can be loose in tail regimes. We derive Bhattacharyya-TRPO (BTRPO) and Bhattacharyya-PPO (BPPO), enforcing overlap constraints via square-root ratio updates: BPPO clips the square-root ratio q = sqrt(r), and BTRPO applies a quadratic Hellinger/Bhattacharyya penalty. Empirically, overlap-based updates improve robustness and aggregate performance as measured by RLiable under matched training budgets, suggesting overlap constraints as a practical, principled alternative to KL for stable policy optimization.
💡 Research Summary
The paper addresses a fundamental limitation of current trust‑region policy‑gradient methods such as TRPO and PPO: they constrain updates using the Kullback‑Leibler (KL) divergence, which only controls the average log‑density shift between the old and new policies. While KL limits the overall “distance” of a policy change, it does not directly prevent rare but extreme importance‑weight (likelihood‑ratio) spikes. These spikes dominate the variance of the gradient estimator, cause update shrinkage, and often lead to premature stagnation, especially in high‑dimensional continuous‑control tasks.
To remedy this, the authors propose a new trust‑region geometry based on distributional overlap measured by the Bhattacharyya coefficient (BC), equivalently the squared Hellinger distance. They first re‑parameterize a stochastic policy by its square‑root density ψθ(a|s)=√πθ(a|s). In the L² Hilbert space this representation lives on the unit sphere, and the inner product ⟨ψθ,ψθ′⟩ yields the BC ρs(θ,θ′)=∫√πθπθ′ da. The BC is bounded between 0 and 1, directly quantifies overlap, and its complement 1‑ρ equals the Hellinger distance. Crucially, a second‑order Taylor expansion of the BC around the current parameters shows the same Fisher information matrix that appears in the KL expansion, i.e., locally the BC‑based trust region is equivalent to the KL‑based one up to a constant factor. Thus the new geometry preserves the desirable local Fisher structure while offering a globally bounded, symmetric distance.
The key technical device is the square‑root importance ratio q(s,a)=√r(s,a)=exp(Δ/2) where Δ=log πθ−log πold. Because r=q², any deviation of q from 1 is amplified quadratically in r. The authors derive a first‑order surrogate objective by expanding r≈1+2(q−1) and discarding higher‑order terms, yielding the Hellinger‑weighted surrogate L_Hell(θ)=E_old
Comments & Academic Discussion
Loading comments...
Leave a Comment