Improved Stochastic Optimization of LogSumExp

Improved Stochastic Optimization of LogSumExp
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The LogSumExp function, dual to the Kullback-Leibler (KL) divergence, plays a central role in many important optimization problems, including entropy-regularized optimal transport (OT) and distributionally robust optimization (DRO). In practice, when the number of exponential terms inside the logarithm is large or infinite, optimization becomes challenging since computing the gradient requires differentiating every term. We propose a novel convexity- and smoothness-preserving approximation to LogSumExp that can be efficiently optimized using stochastic gradient methods. This approximation is rooted in a sound modification of the KL divergence in the dual, resulting in a new $f$-divergence called the safe KL divergence. Our experiments and theoretical analysis of the LogSumExp-based stochastic optimization, arising in DRO and continuous OT, demonstrate the advantages of our approach over existing baselines.


💡 Research Summary

The paper tackles a fundamental computational bottleneck that arises when optimizing objectives containing the LogSumExp (LSE) function with a large or infinite number of exponential terms. In many modern machine learning and operations‑research problems—soft‑max classification, entropy‑regularized optimal transport (OT), distributionally robust optimization (DRO), and others—the LSE appears as a log‑partition functional. Computing its exact gradient requires differentiating every term, which is infeasible at scale.

The authors propose a principled approximation that preserves convexity and smoothness while enabling stochastic gradient methods. Starting from the Gibbs variational principle, they replace the Kullback–Leibler (KL) divergence in the dual formulation with a new f‑divergence they call the “safe KL divergence” (D_{\rho}). For a parameter (0<\rho<1) the generator is
(f_{\rho}(t)=t\log t+1+\frac{1-\rho t}{\rho}\log(1-\rho t)) for (0\le t\le 1/\rho) and (+\infty) otherwise. This construction caps the density ratio (d\nu/d\mu) at (1/\rho), preventing the uncontrolled growth that can cause numerical overflow in OT or DRO settings.

The convex conjugate of (f_{\rho}) is a scaled SoftPlus:
(f_{\rho}^{*}(s)=\frac{1}{\rho}\log\bigl(1+\rho e^{s}\bigr)-1).
Using this, the original log‑partition functional (F(\phi;\mu)=\log\int e^{\phi},d\mu) is approximated by
\


Comments & Academic Discussion

Loading comments...

Leave a Comment