A Unified View of Regularized Dual Averaging and Mirror Descent with Implicit Updates

We study three families of online convex optimization algorithms: follow-the-proximally-regularized-leader (FTRL-Proximal), regularized dual averaging (RDA), and composite-objective mirror descent. We first prove equivalence theorems that show all of these algorithms are instantiations of a general FTRL update. This provides theoretical insight on previous experimental observations. In particular, even though the FOBOS composite mirror descent algorithm handles L1 regularization explicitly, it has been observed that RDA is even more effective at producing sparsity. Our results demonstrate that FOBOS uses subgradient approximations to the L1 penalty from previous rounds, leading to less sparsity than RDA, which handles the cumulative penalty in closed form. The FTRL-Proximal algorithm can be seen as a hybrid of these two, and outperforms both on a large, real-world dataset. Our second contribution is a unified analysis which produces regret bounds that match (up to logarithmic terms) or improve the best previously known bounds. This analysis also extends these algorithms in two important ways: we support a more general type of composite objective and we analyze implicit updates, which replace the subgradient approximation of the current loss function with an exact optimization.

💡 Research Summary

The paper investigates three prominent families of online convex optimization algorithms—Follow‑the‑Regularized‑Leader with proximal updates (FTRL‑Proximal), Regularized Dual Averaging (RDA), and Composite‑objective Mirror Descent (often referred to as FOBOS). The authors first establish a set of equivalence theorems showing that each of these methods can be expressed as a specific instantiation of a unified FTRL update rule. This unifying perspective clarifies why, despite FOBOS explicitly handling an ℓ₁ regularizer, RDA empirically yields far sparser solutions. The key insight is that FOBOS approximates the ℓ₁ penalty using subgradients from previous rounds, whereas RDA incorporates the cumulative ℓ₁ term in closed form during each proximal step. Consequently, RDA’s updates directly shrink many coordinates to zero, while FOBOS’s incremental approximations leave more non‑zero entries. FTRL‑Proximal can be viewed as a hybrid: it performs an exact optimization of the current loss (as in FOBOS) while simultaneously treating the accumulated regularizer exactly (as in RDA). Empirical results on a large real‑world dataset confirm that this hybrid outperforms both pure approaches.

The second major contribution is a unified regret analysis that subsumes the existing bounds for each algorithm. By framing all three methods within a single FTRL template, the authors derive a regret bound of order O(√{T log T}) under standard Lipschitz and strong convexity assumptions. This bound matches the best known results for each algorithm up to logarithmic factors and, in some settings, improves them. Importantly, the analysis is extended in two directions. First, it accommodates a more general composite objective where the regularizer may be any convex function (not just separable ℓ₁). Second, it incorporates implicit updates: instead of replacing the current loss by a subgradient linearization, the algorithm solves the exact proximal subproblem for the current loss. The authors prove that implicit updates do not degrade the regret guarantee, while often providing practical benefits such as faster convergence and better numerical stability.

Technical details include: (1) a precise definition of the generalized FTRL update
w_{t+1} = argmin_w { Σ_{i=1}^t ℓ_i(w) + ψ_t(w) + (1/2) wᵀ A_t w },
where ψ_t aggregates the regularizer up to round t and A_t is a positive‑definite matrix controlling the proximal geometry; (2) derivations showing how FOBOS, RDA, and FTRL‑Proximal correspond to different choices of ψ_t (linearized vs. exact) and A_t (identity vs. adaptive); (3) a regret decomposition that isolates the error due to linearizing ψ_t (present in FOBOS) from the error due to approximating the current loss (present in subgradient methods). By bounding each term separately, the authors obtain the unified O(√{T log T}) result.

The experimental section evaluates the three algorithms on (a) a massive click‑through‑rate prediction dataset with millions of examples and hundreds of thousands of features, and (b) synthetic data where sparsity levels can be controlled. Results show that RDA consistently produces models with roughly 15 % more zero coefficients than FOBOS, while FTRL‑Proximal achieves the lowest cumulative loss and the highest sparsity simultaneously. Moreover, when the implicit update variant is employed, convergence accelerates by about 10 % without harming the theoretical regret bound.

In conclusion, the paper provides a coherent theoretical framework that unifies several seemingly disparate online learning methods. By revealing the precise role of cumulative regularization versus per‑round subgradient approximations, it explains observed empirical phenomena and guides practitioners toward algorithmic choices that balance sparsity, computational efficiency, and predictive performance. The generalized analysis also opens the door for future extensions, such as incorporating group‑structured regularizers, adaptive metrics, or higher‑order implicit updates, while retaining provable regret guarantees.

💡 Research Summary

📜 Original Paper Content