Performance Optimization on Model Synchronization in Parallel Stochastic Gradient Descent Based SVM
Understanding the bottlenecks in implementing stochastic gradient descent (SGD)-based distributed support vector machines (SVM) algorithm is important in training larger data sets. The communication time to do the model synchronization across the parallel processes is the main bottleneck that causes inefficiency in the training process. The model synchronization is directly affected by the mini-batch size of data processed before the global synchronization. In producing an efficient distributed model, the communication time in training model synchronization has to be as minimum as possible while retaining a high testing accuracy. The effect from model synchronization frequency over the convergence of the algorithm and accuracy of the generated model must be well understood to design an efficient distributed model. In this research, we identify the bottlenecks in model synchronization in parallel stochastic gradient descent (PSGD)-based SVM algorithm with respect to the training model synchronization frequency (MSF). Our research shows that by optimizing the MSF in the data sets that we used, a reduction of 98% in communication time can be gained (16x - 24x speed up) with respect to high-frequency model synchronization. The training model optimization discussed in this paper guarantees a higher accuracy than the sequential algorithm along with faster convergence.
💡 Research Summary
The paper investigates the performance bottleneck caused by model synchronization in parallel stochastic gradient descent (PSGD) based support vector machine (SVM) training. While distributed SVM can alleviate the computational and memory demands of large‑scale data, the communication overhead incurred when synchronizing the weight vector across processes often dominates execution time, especially when synchronization is performed after every data point (high‑frequency synchronization). The authors introduce the notion of Model Synchronization Frequency (MSF), defined as the number of data points processed locally before a global averaging step, and study how varying MSF influences convergence speed, testing accuracy, and overall runtime.
The theoretical foundation follows the standard linear‑kernel binary SVM formulation:
(J(w)=\frac{1}{2}|w|^{2}+C\sum_{i=1}^{n}\max(0,1-y_i w^{\top}x_i)).
Stochastic gradient descent updates are performed with a diminishing learning rate (\alpha_t = 1/(1+t)). The paper presents three algorithmic variants: (1) a baseline sequential SGD‑SVM, (2) a “Sequential Replica of Distributed Model Synchronizing” (SRDMS) that mimics block‑wise synchronization without network traffic, and (3) the actual distributed implementation (DMS) that uses MPI_AllReduce to average local weight vectors after each block.
Experiments are conducted on three public datasets of varying size and dimensionality: Epsilon (400 k samples, 2 k features), Ijcnn1 (35 k samples, 22 features), and Webspam (350 k samples, 254 features). The authors systematically vary block size (i.e., MSF) and the number of parallel processes (K) to observe three aspects: (a) the effect of block size on convergence and cross‑validation accuracy in the sequential setting, (b) the impact of MSF and parallelism on convergence in the distributed setting, and (c) the total execution time as a function of MSF for different K values.
Key findings include:
- High‑frequency synchronization (block size = 1) yields fast per‑epoch convergence but incurs prohibitive communication cost, leading to longer overall runtimes.
- Increasing block size to 512–1024 dramatically reduces the number of AllReduce calls (by roughly two orders of magnitude), cutting communication time by up to 98 % and delivering 16×–24× speed‑ups compared with the high‑frequency baseline.
- Despite less frequent synchronization, the objective function value and cross‑validation accuracy remain essentially unchanged; in some cases the distributed approach even surpasses the sequential baseline by 1–2 % in test accuracy.
- The optimal MSF balances the communication‑computation ratio, allowing the distributed algorithm to achieve both faster convergence and higher predictive performance.
The authors conclude that MSF is a critical tunable parameter for scaling PSGD‑based SVMs. By selecting an appropriate block size, practitioners can minimize network overhead without sacrificing model quality. The paper suggests future work on adaptive MSF strategies (changing block size during training), asynchronous parameter‑server architectures, and communication‑compression techniques (quantization, sparsification) to further improve scalability on bandwidth‑constrained clusters.
Comments & Academic Discussion
Loading comments...
Leave a Comment