A Data and Model-Parallel, Distributed and Scalable Framework for Training of Deep Networks in Apache Spark
Training deep networks is expensive and time-consuming with the training period increasing with data size and growth in model parameters. In this paper, we provide a framework for distributed training of deep networks over a cluster of CPUs in Apache Spark. The framework implements both Data Parallelism and Model Parallelism making it suitable to use for deep networks which require huge training data and model parameters which are too big to fit into the memory of a single machine. It can be scaled easily over a cluster of cheap commodity hardware to attain significant speedup and obtain better results making it quite economical as compared to farm of GPUs and supercomputers. We have proposed a new algorithm for training of deep networks for the case when the network is partitioned across the machines (Model Parallelism) along with detailed cost analysis and proof of convergence of the same. We have developed implementations for Fully-Connected Feedforward Networks, Convolutional Neural Networks, Recurrent Neural Networks and Long Short-Term Memory architectures. We present the results of extensive simulations demonstrating the speedup and accuracy obtained by our framework for different sizes of the data and model parameters with variation in the number of worker cores/partitions; thereby showing that our proposed framework can achieve significant speedup (upto 11X for CNN) and is also quite scalable.
💡 Research Summary
The paper presents a novel framework for distributed training of deep neural networks on a cluster of commodity CPUs using Apache Spark. Recognizing that modern deep learning models often exceed the memory capacity of a single machine and that training time scales poorly with data size, the authors combine data parallelism (splitting the training dataset across workers) with model parallelism (splitting the network’s weight matrices across workers). The key contribution is a concrete algorithm for model‑parallel backpropagation: the weight matrix of each layer is partitioned column‑wise, a master process coordinates the forward pass by broadcasting the current layer’s activation vector to all workers, each worker computes its local slice of the next‑layer activations, and the master aggregates the results. The backward pass mirrors this pattern: the master broadcasts the error vector, workers compute local gradients and partial error contributions, and the master sums these to propagate errors further backward. Communication is organized as an all‑to‑all broadcast on a hypercube topology, yielding a logarithmic number of messages per worker (log₂F, where F is the number of processes).
A detailed cost model is derived, separating computation time (T_comp) and communication time (T_comm). The authors express T_comm as K·
Comments & Academic Discussion
Loading comments...
Leave a Comment