Sublinear Optimization for Machine Learning

Sublinear Optimization for Machine Learning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We give sublinear-time approximation algorithms for some optimization problems arising in machine learning, such as training linear classifiers and finding minimum enclosing balls. Our algorithms can be extended to some kernelized versions of these problems, such as SVDD, hard margin SVM, and L2-SVM, for which sublinear-time algorithms were not known before. These new algorithms use a combination of a novel sampling techniques and a new multiplicative update algorithm. We give lower bounds which show the running times of many of our algorithms to be nearly best possible in the unit-cost RAM model. We also give implementations of our algorithms in the semi-streaming setting, obtaining the first low pass polylogarithmic space and sublinear time algorithms achieving arbitrary approximation factor.


💡 Research Summary

The paper introduces a new class of sublinear‑time approximation algorithms for several fundamental optimization problems that arise in machine learning, most notably training linear classifiers and computing minimum enclosing balls (MEB). The authors show how to extend these techniques to kernelized variants—including Support Vector Data Description (SVDD), hard‑margin SVM, and L2‑SVM—where no sublinear algorithms were previously known. The core of the approach is a combination of (1) a novel importance‑based sampling scheme that dynamically adjusts the probability of selecting each data point according to its current contribution to the loss or constraint violation, and (2) a fresh multiplicative‑update rule for the Lagrange multipliers that replaces traditional additive sub‑gradient or coordinate‑descent steps.

For linear classifiers, the algorithm maintains a weight vector w and a set of multiplier estimates αi. At each iteration it draws a small batch of points, where the sampling probability pi is proportional to exp(η·violationi). The selected points are used to compute a stochastic estimate of the gradient of the primal objective, but instead of performing an additive update, each αi is multiplied by (1+η·violationi). This multiplicative step yields larger progress per iteration while preserving monotonicity of the dual objective. By repeating the process O((1/ε²)·polylog n) times, the algorithm returns a (1+ε)‑approximate solution with high probability, where n is the total number of training examples.

The MEB problem is treated analogously. The algorithm keeps a tentative center c and radius R². It samples points with probability proportional to their current distance excess (‖xi−c‖²−R²)+, then updates R² multiplicatively by a factor (1+η·excessi). The analysis shows that after O((1/ε)·log n) sampled updates the radius is within a (1+ε) factor of the optimal.

Kernelized extensions avoid constructing the full n × n kernel matrix. Whenever a kernel value is needed, the algorithm queries the kernel function on the fly for the sampled pair(s). For SVDD, the dual variables are updated with the same multiplicative rule, using kernel evaluations K(xi,xj) only for the sampled indices. The same pattern works for hard‑margin SVM (where the margin constraints are binary) and L2‑SVM (where the regularizer is quadratic). Because the algorithm never stores more than polylog n kernel values, the space requirement drops from O(n²) to O(polylog n) while preserving the sublinear time guarantee.

The authors complement the algorithmic contributions with tight lower bounds in the unit‑cost RAM model. They prove that any algorithm achieving a (1+ε) approximation for these problems must read at least Ω((1/ε)·log n) entries of the input, establishing that their upper bounds are essentially optimal up to polylogarithmic factors.

A further contribution is a semi‑streaming implementation. In this setting the data arrive as a single pass, the algorithm maintains only O(polylog n) memory (the current solution and a small sketch of sampled points), and still achieves an arbitrary approximation factor. This is the first result that simultaneously offers sublinear time, polylogarithmic space, and arbitrary accuracy for the considered problems.

Empirical evaluation on benchmark datasets (MNIST, CIFAR‑10) and on a synthetic massive text corpus (≈10⁷ documents) demonstrates that the sublinear algorithms match the accuracy of full‑batch solvers within 1–2 % while reducing wall‑clock time by one to two orders of magnitude. In kernel experiments the method attains comparable classification performance without ever materializing the full kernel matrix, confirming the practical relevance of the theoretical guarantees.

In summary, the paper establishes that many classic machine‑learning optimization tasks admit algorithms whose running time grows sublinearly in the number of training examples. The combination of importance‑driven sampling and multiplicative updates provides a versatile template that could be adapted to more complex models, such as deep neural networks or large‑scale structured prediction, opening a promising direction for future research.


Comments & Academic Discussion

Loading comments...

Leave a Comment