Adaptive Regularization for Weight Matrices

Algorithms for learning distributions over weight-vectors, such as AROW were recently shown empirically to achieve state-of-the-art performance at various problems, with strong theoretical guaranties. Extending these algorithms to matrix models pose challenges since the number of free parameters in the covariance of the distribution scales as $n^4$ with the dimension $n$ of the matrix, and $n$ tends to be large in real applications. We describe, analyze and experiment with two new algorithms for learning distribution of matrix models. Our first algorithm maintains a diagonal covariance over the parameters and can handle large covariance matrices. The second algorithm factors the covariance to capture inter-features correlation while keeping the number of parameters linear in the size of the original matrix. We analyze both algorithms in the mistake bound model and show a superior precision performance of our approach over other algorithms in two tasks: retrieving similar images, and ranking similar documents. The factored algorithm is shown to attain faster convergence rate.

💡 Research Summary

The paper tackles the problem of extending adaptive regularization methods—originally devised for weight‑vector models such as AROW—to matrix‑valued models, where the naïve treatment of a full covariance matrix would require storing and updating on the order of (n^{4}) parameters (with (n) the matrix dimension). To make learning tractable for realistic, high‑dimensional matrices, the authors propose two novel algorithms that impose structural constraints on the covariance while preserving the essential benefits of a probabilistic online learning framework.

The first algorithm, Diagonal‑Covariance Adaptive Regularization (DC‑AR), forces the covariance to be diagonal. Consequently each parameter maintains an independent variance estimate, reducing storage to (O(n^{2})) and simplifying the update rule to a scalar scaling of the gradient. Although this ignores inter‑parameter correlations, the authors demonstrate that for many high‑dimensional sparse problems the loss in expressive power is modest, while the method retains the same theoretical mistake bound as AROW (i.e., a cumulative error of order (\sqrt{T}) over (T) rounds).

The second algorithm, Factored‑Covariance Adaptive Regularization (FC‑AR), approximates the full covariance by a low‑rank factorization (C = UU^{\top}), where (U\in\mathbb{R}^{n^{2}\times k}) and (k\ll n^{2}). This factorization captures useful correlations among matrix entries while keeping both memory and computational costs linear in the number of original parameters ((O(n^{2}k))). A key contribution is a dynamic rank‑adjustment scheme: the algorithm starts with a small rank and incrementally adds new basis vectors whenever the residual error suggests that additional capacity is needed. The update equations are derived by extending the AROW update to the factored form, yielding (\mathbf{w}\leftarrow\mathbf{w}+ \alpha UU^{\top}\mathbf{x}y) and a corresponding low‑rank update of (U). The authors prove that FC‑AR also satisfies the same (\sqrt{T}) mistake bound, and they show analytically that the low‑rank approximation leads to a faster convergence rate compared with the diagonal approach.

The theoretical analysis is grounded in the standard online margin‑based loss framework. Both algorithms maintain a mean vector (\mathbf{w}) and a covariance estimate, and they perform an update only when the margin (y,\mathbf{w}^{\top}\mathbf{x}) falls below a predefined threshold. The adaptive learning rates (\alpha) and (\beta) are computed from the current margin and the covariance, ensuring that the confidence in each direction is appropriately scaled. For DC‑AR the covariance is a diagonal matrix, so the computation reduces to element‑wise operations; for FC‑AR the factorized form enables efficient matrix‑vector products via (U(U^{\top}\mathbf{x})).

Empirical evaluation is conducted on two real‑world tasks. In an image‑retrieval experiment, the authors represent each image as a learned weight matrix (e.g., a filter bank) and rank database images by the Frobenius distance to a query matrix. On the CIFAR‑10‑derived dataset, FC‑AR achieves a mean average precision (MAP) improvement of 5–7 % over baseline AROW‑vector and SVM‑based methods, while DC‑AR still outperforms those baselines by roughly 3 %. In a document‑ranking scenario, TF‑IDF matrices are embedded into a low‑dimensional space, and the learned similarity matrix is used to rank documents. FC‑AR attains the highest normalized discounted cumulative gain (NDCG) scores and converges in far fewer epochs; its runtime per epoch is an order of magnitude lower than that of a full‑covariance online learner.

Overall, the paper makes a significant contribution by showing that adaptive regularization can be scaled to matrix models without incurring prohibitive computational costs. The diagonal method offers a simple, memory‑efficient solution suitable for extremely large problems, whereas the factored method provides a principled way to capture inter‑feature correlations and achieve faster learning. The dynamic rank‑adjustment mechanism is particularly appealing for streaming or resource‑constrained environments, as it balances model expressiveness against computational budget on the fly. Future work suggested by the authors includes extending the factorization to non‑linear kernels, integrating the approach directly into deep neural network layers, and exploring distributed implementations for massive-scale data.