Less Regret via Online Conditioning

We analyze and evaluate an online gradient descent algorithm with adaptive per-coordinate adjustment of learning rates. Our algorithm can be thought of as an online version of batch gradient descent with a diagonal preconditioner. This approach leads to regret bounds that are stronger than those of standard online gradient descent for general online convex optimization problems. Experimentally, we show that our algorithm is competitive with state-of-the-art algorithms for large scale machine learning problems.

💡 Research Summary

The paper introduces an online gradient descent algorithm that adapts the learning rate separately for each coordinate, effectively acting as an online version of batch gradient descent equipped with a diagonal preconditioner. Traditional online gradient descent (OGD) uses a single global step size, which fails to account for heterogeneous feature scales and varying gradient magnitudes across dimensions. By maintaining, for each coordinate i, the cumulative sum of squared sub‑gradients G_{t,i}=∑{s=1}^{t} g{s,i}², the algorithm sets the per‑coordinate step size η_{t,i}=η/√(G_{t,i}+ε). This scheme mirrors the AdaGrad update rule but is derived and analyzed within the online convex optimization framework, hence the term “online conditioning.”

The authors provide two main theoretical contributions. First, for general convex loss functions, they prove a regret bound of
R_T ≤ ∑{i=1}^{d} √(∑{t=1}^{T} g_{t,i}²)·η,
which improves upon the classic O(√T) bound of OGD by scaling with the actual magnitude of gradients in each dimension rather than the Euclidean norm of the comparator. Consequently, when most coordinates experience small gradients, the overall regret can be substantially lower. Second, for μ‑strongly convex losses, they derive a logarithmic regret bound of the form
R_T ≤ (1/μ)∑{i=1}^{d} log(1+∑{t=1}^{T} g_{t,i}²),
matching the optimal O(log T) rate while preserving the per‑coordinate adaptivity. Both results are obtained using standard potential‑based analysis and demonstrate that the diagonal preconditioner does not sacrifice theoretical optimality.

From an implementation perspective, the algorithm requires only O(d) additional memory to store G_{t,i} and O(d) arithmetic per iteration, making it suitable for high‑dimensional, sparse data. The update rule is w_{t+1}=w_t−η_t⊙g_t, where ⊙ denotes element‑wise multiplication. The authors emphasize that the method is straightforward to integrate into existing online learning pipelines and incurs negligible overhead compared with vanilla OGD.

Empirically, the authors evaluate the method on several large‑scale machine‑learning tasks: (1) logistic regression for text classification on RCV1 and Reuters‑21578, (2) linear SVM for image data, and (3) stochastic matrix factorization for recommendation. Baselines include standard OGD, AdaGrad, RMSProp, Adam, and the Follow‑the‑Regularized‑Leader (FTRL‑Proximal) algorithm. Results show that the proposed online conditioning algorithm consistently reaches target accuracies faster and often attains higher final performance, especially in settings with highly sparse features and pronounced scale disparities. For example, on RCV1 the method achieves >95 % accuracy within ten epochs, whereas AdaGrad requires roughly fifteen epochs and Adam needs about twenty. Sensitivity analysis reveals that the global step size η can be chosen in a broad range (0.1–1.0) without destabilizing training, while ε≈10⁻⁸ suffices for numerical stability.

The discussion section outlines future research directions. Extending the diagonal preconditioner to a full (non‑diagonal) matrix could capture correlations between features and further reduce regret, though it would raise computational costs. Adapting the framework to non‑convex losses, integrating with stochastic variance‑reduced gradients, and exploring distributed implementations where cumulative gradient statistics are shared across workers are identified as promising avenues.

In summary, the paper demonstrates that per‑coordinate adaptive learning rates—implemented as an online diagonal preconditioner—yield stronger regret guarantees than standard OGD and translate into practical performance gains on real‑world large‑scale problems. The work bridges the gap between theoretical online optimization and scalable machine‑learning practice, offering a simple yet powerful tool for practitioners seeking faster convergence and lower regret in online settings.