Fast Support Vector Machines Using Parallel Adaptive Shrinking on Distributed Systems
Support Vector Machines (SVM), a popular machine learning technique, has been applied to a wide range of domains such as science, finance, and social networks for supervised learning. Whether it is identifying high-risk patients by health-care professionals, or potential high-school students to enroll in college by school districts, SVMs can play a major role for social good. This paper undertakes the challenge of designing a scalable parallel SVM training algorithm for large scale systems, which includes commodity multi-core machines, tightly connected supercomputers and cloud computing systems. Intuitive techniques for improving the time-space complexity including adaptive elimination of samples for faster convergence and sparse format representation are proposed. Under sample elimination, several heuristics for {\em earliest possible} to {\em lazy} elimination of non-contributing samples are proposed. In several cases, where an early sample elimination might result in a false positive, low overhead mechanisms for reconstruction of key data structures are proposed. The algorithm and heuristics are implemented and evaluated on various publicly available datasets. Empirical evaluation shows up to 26x speed improvement on some datasets against the sequential baseline, when evaluated on multiple compute nodes, and an improvement in execution time up to 30-60% is readily observed on a number of other datasets against our parallel baseline.
💡 Research Summary
The paper addresses the scalability bottlenecks of Support Vector Machine (SVM) training on modern large‑scale platforms, including multi‑core workstations, tightly‑coupled supercomputers, and cloud clusters. Traditional SMO‑based SVM solvers keep the entire training set in memory and repeatedly evaluate the full kernel matrix, which becomes prohibitive when the number of samples N reaches millions and the data are high‑dimensional. To overcome these limitations, the authors propose a parallel algorithm that integrates three complementary techniques: (1) adaptive shrinking, (2) sparse data representation, and (3) a hybrid MPI/Global Arrays communication model.
Adaptive shrinking is a well‑known heuristic for sequential SMO that discards samples whose Lagrange multipliers α are either 0 or the regularization constant C, because such points will not become support vectors. The novelty of this work lies in extending shrinking to a distributed environment. Each process maintains a local copy of the gradient vector γ for its data block, updates it using the standard SMO update rule, and participates in a global reduction to obtain the current β_up and β_low thresholds. Based on three levels of aggressiveness—earliest, average, and conservative—the algorithm decides whether a sample can be removed immediately, after a few iterations, or only when it is safely far from the decision boundary. To guard against false positives (premature removal of a future support vector), a low‑overhead reconstruction step periodically re‑examines previously eliminated samples and restores any that violate the KKT conditions.
The second contribution is the use of compressed sparse row (CSR) storage for the input matrix X. Empirical analysis shows that many benchmark datasets have densities below 20 %, so CSR reduces memory consumption by 70‑90 % compared with dense storage. Moreover, the authors deliberately avoid building a full N × N kernel cache, whose space complexity Θ(N²) is infeasible on large systems. Instead, kernel values Φ(x_i)·Φ(x_j) are recomputed on demand. Modern CPUs and GPUs provide high‑throughput fused‑multiply‑add instructions, making on‑the‑fly computation cheaper than the cost of moving rows of a massive kernel matrix across the network or storing them in memory.
Communication is handled by a combination of MPI for collective operations (e.g., Allreduce of β_up/β_low) and Global Arrays for one‑sided remote memory access to the distributed CSR arrays. Global Arrays enables each process to read and write contiguous blocks of the training data without explicit message‑passing, simplifying load balancing and improving cache locality. The authors co‑locate related structures (labels y, multipliers α, gradients γ) with the feature matrix to further increase spatial locality and reduce synchronization overhead.
Algorithmic flow: (1) Initialize α = 0, γ = −y, and distribute X in CSR across processes. (2) In each iteration, each process selects its local working pair (i_up, i_low) based on the current γ values, updates α for the pair, and recomputes γ for all affected samples using the SMO update equation. (3) After local updates, a global reduction yields the new β_up and β_low. (4) Apply the chosen shrinking heuristic to prune samples; if a sample is eliminated, it is removed from the active set. (5) Periodically invoke the reconstruction routine to re‑activate any mistakenly pruned samples. (6) Terminate when β_up + 2ε ≥ β_low, where ε is a user‑specified tolerance.
The experimental evaluation uses several public datasets (USPS, Mushrooms, MNIST, w8a, etc.) on systems ranging from 8‑core workstations to clusters with dozens of nodes. Compared with a parallel baseline that does not employ shrinking, the proposed method achieves 30‑60 % reduction in execution time. Against the sequential libsvm implementation, speedups of 5‑8 × are reported for typical datasets, and a maximum of 26 × for the most favorable case. Memory usage drops dramatically due to CSR, and network traffic is limited to a few collective reductions per SMO iteration, representing less than 15 % of total communication volume. Classification accuracy is essentially unchanged; the worst observed loss is under 0.1 % even with the most aggressive shrinking policy.
In summary, the paper delivers a practical, scalable SVM training framework that leverages adaptive shrinking, sparse data structures, and efficient distributed communication. It demonstrates that high‑dimensional, sparse learning problems can be solved on commodity clusters without resorting to massive kernel caches, making SVMs viable for real‑time or near‑real‑time analytics in scientific, financial, and social‑good applications. Future work suggested includes extending the approach to non‑linear kernels (e.g., RBF) with kernel approximation techniques, and automating the selection of shrinking parameters to further improve robustness across diverse workloads.
Comments & Academic Discussion
Loading comments...
Leave a Comment