Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

P oseidon: An Efﬁcient Communication Ar chitecture f or Distrib uted Deep Learning on GPU Clusters Hao Zhang 1 , 2 , Zeyu Zheng 2 , Shizhen Xu 1 , W ei Dai 1 , 2 , Qirong Ho 2 , Xiaodan Liang 1 , Zhiting Hu 1 , 2 , Jinliang W ei 1 , Pengtao Xie 1 , 2 , Eric P . Xing 2 Carnegie Mellon Uni versity 1 , Petuum Inc. 2 Abstract Deep learning models can take weeks to train on a single GPU-equipped machine, necessitating scaling out DL training to a GPU-cluster . Howe ver , current distributed DL implementations can scale poorly due to substantial parameter synchronization o ver the netw ork, because the high throughput of GPUs allo ws more data batches to be processed per unit time than CPUs, leading to more fre- quent network synchronization. W e present Poseidon, an efﬁcient communication architecture for distributed DL on GPUs. Poseidon e xploits the layered model structures in DL programs to o verlap communication and compu- tation, reducing bursty network communication. More- ov er , Poseidon uses a hybrid communication scheme that optimizes the number of bytes required to synchronize each layer , according to layer properties and the num- ber of machines. W e show that Poseidon is applicable to different DL frameworks by plugging Poseidon into Caffe and T ensorFlow . W e show that Poseidon enables Caffe and T ensorFlow to achiev e 15.5x speed-up on 16 single-GPU machines, ev en with limited bandwidth (10GbE) and the challenging VGG19-22K network for image classiﬁcation. Moreov er , Poseidon-enabled T en- sorFlow achieves 31.5x speed-up with 32 single-GPU machines on Inception-V3, a 50% impro vement over the open-source T ensorFlo w (20x speed-up). 1 Introduction Deep learning (DL) is a class of machine learning (ML) approaches that has achie ved notable success across a wide spectrum of tasks, including speech recogni- tion [10], visual recognition [34, 35] and language un- derstanding [21, 20]. These DL models exhibit a high degree of model complexity , with many parameters in deeply layered structures that usually take days to weeks to train on a GPU-equipped machine. The high compu- tational cost of DL programs on lar ge-scale data neces- sitates the training on distributed GPU cluster in order to keep the training time acceptable. DL software such as T ensorFlow [1] and Caffe [14] allow practitioners to easily experiment with DL models on a single machine. Ho wever , their distributed imple- mentations can scale poorly for lar ger models. For exam- ple, we ﬁnd that on the VGG19-22K netw ork (229M pa- rameters), open-source T ensorFlow on 32 machines can be slower than single machine (Section 5.1). This obser- vation underlines the challenge of scaling DL on GPU clusters: the high computational throughput of GPUs al- lows more data batches to be processed per minute (than CPUs), leading to more frequent network synchroniza- tion that grows with the number of machines. Exist- ing communication strategies, such as parameter servers (PS) for ML [31, 19], can be overwhelmed by the high volume of communication [7]. Moreo ver , despite the increasing av ailability of faster network interfaces such as Inﬁniband or 40GbE Ethernet, GPUs hav e continued to grow rapidly in computational power , and continued to produce parameter updates f aster than can be nai vely synchronized ov er the network. For instance, on a 16- machine cluster with 40GbE Ethernet and one T itan X GPU per machine, updates from the VGG19-22K model will bottleneck the network, so that only an 8x speedup ov er a single machine is achie ved (Section 5.1). These scalability limitations in distributed DL stem from at least two causes: (1) the gradient updates to be communicated are v ery large matrices, which quickly saturate network bandwidth; (2) the iterati ve nature of DL algorithms causes the updates to be transmitted in bursts (at the end of an iteration or batch of data), with signiﬁcant periods of lo w netw ork usage in between. W e propose that a solution to these two problems should ex- ploit the structure of DL algorithms on two lev els: on one hand, it should identify w ays in which the matrix updates can be separated from each other , and then schedule them in a way that av oids bursty network trafﬁc. On the other hand, the solution should also exploit the structure of the matrix updates themselves, and wherev er possible, re- duce their size and thus the overall load on the network. For such a solution to be relev ant to practitioners (who may have strong preferences for particular frameworks), we would prefer not to e xploit speciﬁc traits of T ensor- Flow’ s or Caffe’ s design, but should striv e to be relev ant to as many e xisting frame works as possible. W ith this moti v ation, we design Poseidon, an ef ﬁ- cient communication architecture for data-parallel DL on distrib uted GPUs. Poseidon exploits the sequential layer-by-layer structure in DL programs, ﬁnding inde- pendent GPU computation operations and network com- munication operations in the training algorithm, so that they can be scheduled together to reduce bursty network communication. Moreov er , Poseidon implements a hy- brid communication scheme that accounts for each DL program layer’ s mathematical properties as well as the cluster conﬁguration, in order to compute the network cost of different communication methods, and select the cheapest one – currently , Poseidon implements and sup- ports a parameter server scheme [31] that is well-suited to small matrices, and a sufﬁcient factor broadcasting scheme [32] that performs well on large matrices. W e focus on synchronous parallel training which is shown to yield faster con ver gence compared with asynchronous training in distributed DL (as measured by wall clock time) on GPUs [7, 2]. Unless otherwise speciﬁed, our discussion in this paper assumes synchronous replica- tion of model parameters in each training iteration, al- though we note that Poseidon’ s design can easily be ap- plied to asynchronous or bounded-asynchronous consis- tency models [12, 8]. T o demonstrate Poseidon’ s applicability to multiple DL frameworks, we implement it into two different DL framew orks: Caffe and T ensorFlow , and show that Poseidon allo ws them to scale almost-linearly in algo- rithm throughput with additional machines, while incur- ring little additional overhead ev en in the single ma- chine setting. For distributed execution, with 40GbE network bandwidth av ailable, Poseidon consistently de- liv ers near-linear increases in throughput across v ari- ous models and engines: 31.5x speedup on training the Inception-V3 network using T ensorFlow engine on 32 nodes, which improves 50% upon the original T en- sorFlow (20x); when training a 229M parameter net- work (VGG19-22K), Poseidon still achiev es near-linear speedup (30x on 32 nodes) using both Caffe and T en- sorFlow engines, while distributed T ensorFlow some- times experiences negati ve [37] scaling with additional machines. Our experiments also conﬁrm that Poseidon successfully alleviates network communication bottle- necks, by reducing the required bandwidth for paralleliz- ing large models. For e xample, when training VGG19- 22K under limited bandwidth (10GbE), in contrast to a PS-based parallelization which only achiev es 4x speedup with 16 machines, Poseidon ef fectiv ely reduces the com- munication ov erheads by automatically specializing the best communication method for each layer, and is able to keep linearly scaling with throughput. Compared to other communication reduction methods [4, 36], Posei- don demonstrates either systems adv antages (increased algorithm throughput) or statistical advantages (fewer al- gorithm steps or iterations to reach a ﬁxed termination criteria). Poseidon does not suffer much from imbal- anced communication loads, which we found to be the case when using the sufﬁcient factor strategy used in Project Adam [4]. Poseidon also guarantees that the number of algorithm steps to reach termination remains unchanged, unlike the 1-bit quantization strate gy used in CNTK [36] which is approximate and can hurt statistical performance in some applications. The rest of the paper is organized as follows. Sec- tion 2 motiv ates Poseidon with introduction on large- scale DL, parameter serv ers and sufﬁcient factor broad- casting. Section 3 and section 4 elaborates Poseidon’ s design and implementation, respectiv ely . Section 5 ev al- uates Poseidon by training different models ov er multiple datasets, including comparisons to state-of-the-art GPU- based distrib uted DL systems. Section 6 discusses re- lated works and section 7 concludes. 2 Large-scale Deep Lear ning In this section, we formulate the DL training as an iterativ e-conv ergent algorithm, and describe parameter server (PS) and sufﬁcient factor broadcasting (SFB) for parallelizing such computation on clusters. 2.1 Distributed Deep Lear ning DL programs are distinguished from other ML programs mainly by their use of neural networks (NNs), a family of hierarchical models containing many layers, from as few as 5-10 [16] to as many as 100s [11]. Figure 1 illustrates a neural network with 6 layers. The ﬁrst layer (green) is an input layer that reads data in application-speciﬁc for- mats, e.g., raw pixels if it is trained to classify images. The input layer is connected to a sequence of interme- diate layers (cyan, orange), each of which consists of a few neurons, where each neuron applies a function trans- formation f on its input and produces an output. A vec- tor output is obtained by concatenating the output of all neurons from a layer . By stacking multiple intermediate layers, the NN can transform raw input data one layer at a time, ﬁrst into a series of intermediate representations, and ﬁnally into the desired output or prediction (red). DL programmers usually need to specify the computation of a layer by deﬁning two properties of its neurons. The ﬁrst is the transformation function f ( W , x ) , where x is the in- put to the neuron, and W is an optional trainable param- eter . The other is the connecti vity that determines how the neuron should be connected to its adjacent layer . For Conv% lay ers FC%l ayers Figure 1: A con volutional neural network with 6 layers. Wo rke r ' 1 Wo rke r ' 2 Wo rke r ' 3 Wo rke r ' 4 PS 𝛻𝜃 " 𝛻𝜃 # 𝛻𝜃 $ 𝛻𝜃 + 𝜃 𝜃 𝜃 𝜃 Wo rke r ' 1 Wo rke r ' 2 Wo rke r ' 3 Wo rke r ' 4 𝑢 " , 𝑣 " 𝑢 # , 𝑣 # 𝑢 " , 𝑣 " 𝑢 $ , 𝑣 $ 𝑢 $ , 𝑣 $ 𝑢 + , 𝑣 + 𝑢 # , 𝑣 # 𝑢 + , 𝑣 + 𝑢 " , 𝑣 " 𝑢 + , 𝑣 + 𝑢 $ , 𝑣 $ 𝑢 # , 𝑣 # (a) (b) Figure 2: An illustration of (a) the parameter server and (b) sufﬁcient f actor broadcasting for distributed ML. instance, a conv olutional neural network has two types of neuron: con volutional (CONV) neuron (cyan) that are only locally connected to a subset of neurons in its pre- vious layer, and fully-connected (FC) neurons (orange). Most NNs need to be trained with data to give accu- rate predictions. Stochastic gradient descent (SGD) and backpropagation are commonly employed to train NNs iterativ ely – each iteration performs a feed forward (FF) pass followed with a backpropagation (BP) pass. In the FF pass, the network takes a training sample as input, forwards from its input layer to output layer to produce a prediction. A loss function is deﬁned to ev aluate the prediction error, which is then backpropagated through the netw ork in reverse, during which the network param- eters are updated by their gradients towards where the error would decrease. After repeating a sufﬁcient num- ber of passes, the network will usually con verge to some state where the loss is close to a minima, and the training is then terminated. In a mathematical form, gi ven data D and a loss function L , ﬁtting the parameters θ of a NN can be formulated as an iterative-con ver gent algorithm that repeatedly ex ecuting the update equation θ ( t ) = θ ( t − 1 ) + ε · ∇ L ( θ ( t − 1 ) , D ( t ) ) (1) until θ reaches some stopping criteria, where t denotes the iteration. The update function ∇ L calculates the gra- dients of L ov er current data D i ( D i ∈ D ) . The gradi- ents are then scaled by a learning rate ε and applied on θ as updates. As the gradients are additive ov er data samples i , i.e., θ ( t ) = θ ( t − 1 ) + ε · ∑ i ∇ L ( θ ( t − 1 ) , D i ) , for efﬁcienc y , we usually feed a batch of training samples D ( t ) ( D ( t ) ⊂ D ) at each training iteration t , as in Eq.1. In large-scale deep learning, data D are usually too large to process on a single machine in acceptable time. T o speedup the training, we usually resort to data par- allelism , a parallelization strategy that partitions the data D and distributes to a cluster of computational worker machines (indexed by p = 1 , · · · , P ), as illustrated in Fig- ure 2. At each iteration t , ev ery worker fetches a batch D ( t ) p from its data partition and computes the gradients ∇ L ( θ ( t ) , D ( t ) p ) . Gradients from all workers are then ag- gregated and applied to update θ ( t ) to θ ( t + 1 ) following θ ( t + 1 ) = θ ( t ) + ε P ∑ p = 1 ∇ L ( θ ( t ) , D ( t ) p ) (2) Data-parallelism allows data to be locally partitioned to each worker , which is advantageous for large datasets. It howe ver requires e very work er to ha ve read and write access to the shared model parameters θ , which causes communication among workers; this shared ac- cess can be provided by a parameter server architec- ture [31, 4] (Figure 2a) or a peer-to-peer broadcast- ing architecture [32] (Figure 2b), both are designed for general-purpose data-parallel ML programs on CPUs. Parameter Server . A parameter serv er (PS) is a distributed shared memory system that provides sys- tematic abstraction of iterativ e-conv ergent algorithms in data-parallel distributed ML. T ypically , PS enables each worker to access the global model parameters θ via network communications following the client-server scheme. DL can be trivially parallelized over distrib uted workers using PS with the following 3 steps: (1) Each worker computes the gradients ( ∇ L ) on their o wn data partition and send them to remote serv ers; (2) serv ers re- ceiv e the updates and apply ( + ) them on globally shared parameters; (3) a consistency scheme coordinates the synchronization among servers and work ers (Figure 2a). Sufﬁcient F actor Br oadcasting. Many ML models rep- resent their parameters θ as matrices. For example, fully- connected NNs, when trained using SGD, their gradi- ent ∇ θ over a training sample is a rank-1 matrix, which can be cast as the outer product of two vectors u , v : ∇ θ = uv > , where u and v are called sufﬁcient factors (SFs). Sufﬁcient factor broadcasting (SFB) [32] is de- signed to parallelize these models by broadcasting SFs among workers and then reconstructing the gradient ma- trices ∇ θ using u , v locally . SFB presents three ke y dif- ferences from PS: (1) SFB uses a P2P communication strategy that transmits SFs instead of full matrices. (2) Unlike gradients, SFs are not additiv e ov er training sam- ples, i.e., the number of SFs needed to be transmitted grows linearly with the number of data samples (not data batches); (3) the overall communication overheads of SFB increase quadratically with the number of workers. 2.2 Parallel DL on Distrib uted GPUs Modern DL models are mostly trained using NVIDIA GPUs, because the primary computational steps (e.g., matrix-matrix multiplications) in DL match the SIMD operation that could be efﬁciently performed by GPUs. In practice, DL practitioners often use single-node soft- ware frame works, such as Caffe [14] and T orch [6], which mathematically deriv e the correct training algo- rithm and e xecute it on GPU by calling GPU-based ac- celeration libraries, such as CUBLAS and cuDNN. It is thus straightforward to parallelize these programs across distributed GPUs using either PS or SFB, by mo ving the computation from CPU to GPU, and performing memory copy operations (between DRAM and GPUs) or com- munication (among multiple nodes) whene ver needed. Howe ver , we argue below and sho w empirically in Sec- tion 5 that these usually lead to suboptimal performance. The inefﬁcienc y is mainly caused by parameter syn- chronization via the network. Compared to CPUs, GPUs are an order of magnitude more efﬁcient in matrix com- putations; the production of gradients on GPUs is much faster than they can be nai vely synchronized over the net- work. As a result, the training computations are usually bottlenecked by communications. F or example, when training AlexNet [16] (61.5M parameters) on Titan X with a standard batch size 256, 240 million gradients will be generated per second on each GPU (0.25s/batch). If we parallelize the training on 8 nodes using a PS, with ev ery node also holding 1 / 8 of parameters as a PS shard; then, ev ery node needs to transfer 240M × 7 / 8 × 4 = 840M ﬂoat parameters in one second to make sure the next iteration of computation not being blocked. Ap- parently , the demanded throughput ( > 26Gbps) exceeds the bandwidth that commodity Ethernet (i.e., 1GbE and 10GbE Ethernet) provides; the GPUs distributed across clusters cannot be fully utilized. Practically , it is usually difﬁcult to partition the parameters completely equally , which will result in more sev ere bandwidth demands, or bursty communication traf ﬁc on sev eral server nodes (as we will show in Section 5.3), which prevents the trivial realization of efﬁcient DL on distributed GPUs 1 . W e next describe our strategies and system design to over - come the aforementioned obstacles. 3 Poseidon Design In this section, we ﬁrst analyze the DL program in both a single-node and distributed environment by decompos- ing the program into a sequence of operations. Based on it, we introduce two strategies to address the issues. The Structure of DL Programs. At the core of the DL program is the BP algorithm that performs forward- backward pass through the network repeatedly . If we deﬁne a forward and a backward pass through the l th layer of a network as f l t and b l t , respecti vely , then a C omputation step at iteration t is notated as C t = [ f 1 t , · · · , f L t , b L t , · · · , b 1 t ] , as illustrated in Fig. 3(a). When 1 Frequent memory copy operations between DRAM and GPU memory can also cause extra overheads, which is minor compared to the network communication according to our empirical results. How- ev er , our strategies in this paper can also alleviate this ov erhead. ;𝐶 4 ;𝑂 4 𝐼 4 𝐶 47" ;𝐼 47" ⋯ 𝐶 4 𝑂 4 𝐼 4 (a) 𝑂 47" 𝑏 > 𝑏 # 𝑏 " 𝑜 C CD" > 𝑖 C CD" > ⋯ 𝑏 > 𝑏 # 𝑏 " 𝑜 > 𝑜 $ 𝑜 # 𝑜 " 𝑖 > 𝑖 $ 𝑖 # 𝑖 " 𝐶 47" 𝑂 47" 𝐼 47" 𝐶 47# 𝑂 47# 𝐼 47# 𝐶 47$ 𝑂 47$ 𝐼 47$ (b) Figure 3: (a) T raditional backpropagation and (b) wait-free backpropagation on distributed en vironment. ex ecuting on distributed GPUs, inter-machine communi- cations are required after each C step to guarantee the synchronized replication of model parameters. W e sim- ilarly deﬁne the S ynchronization step S t as the process that a worker sends out locally generated updates and then receives updated parameters from remote workers at iteration t . Therefore, a naiv e parallelization of DL train- ing over distributed GPUs using either PS or SFB can be expressed as alternating C t and S t deﬁned abov e. W e note that DL training is highly sequential; the commu- nication and computation perform sequentially , waiting each other to ﬁnish (Fig. 3a). Fortunately , we also note that as every layer of a NN contains an independent set of parameters, S t can be de- coupled as S t = ( s 1 t , · · · , s L t ) , by deﬁning s l t as the syn- chronization of parameters of layer l . If we further de- compose s l t = [ o l t , i l t ] as ﬁrst sending out local updates of layer l ( o l t ) and reads in the updated parameters remotely ( i l t ), we can rewrite a training iteration as: [ C t , S t ] = [ f 1 t , · · · , f L t , b L t , · · · , b 1 t , o L t , · · · , o 1 t , i L t , · · · , i 1 t ] . The sequen- tial nature of the BP algorithm presents us an opportunity to ov erlap the computations and communications. Our ﬁrst strate gy , wait-fr ee bac kpr opagation , ov erlaps C t and S t by partially rescheduling those b t and s t that are in- dependent. Our second strategy , hybrid communication , utilizes the independency among s t , and tries to reduce the communication ov erheads by specializing dif ferent communication methods for different s t . 3.1 W ait-free Backpr opagation The wait-free backpropagation (WFBP) is designed to ov erlap communication overheads with the computation based on two k ey independencies in the program: (1) the send-out operation o l t is independent of backw ard oper- ations b i t ( i < l ) , so the y could be ex ecuted concurrently without blocking each other; (2) the read-in operation i l t could update the layer parameters as long as b l t was ﬁn- ished, without blocking the subsequent backward opera- tions b i t ( i < l ) . Therefore, we can enforce each layer l to start its communication once its gradients are generated after b l t , so that the time spent on operation s l t could be ov erlapped with those of b i t ( i < l ) , as shown in Fig. 3b . WFBP is most beneﬁcial for training DL models that hav e their parameters concentrating at upper layers (FC layers) but computation concentrating at lower layers (CONV layers) 2 , e.g., VGG [26] and AdamNet [4, 7]), because it overlaps the communication of top layers (90% of communication time) with the computation of bottom layers (90% of computation time) [37, 7]. Be- sides chain-like NNs, WFBP is generally applicable to other non-chain lik e structures (e.g., tree-like structures), as the parameter optimization for deep neural networks depends on adjacent layers (and not the whole network), there is always an opportunity for parameter optimiza- tion (i.e., computation) and communication from differ - ent layers to be performed concurrently . Some DL frameworks, such as T ensorFlow , represent the data dependencies of DL programs using graphs, therefore implicitly enable auto-parallelization. How- ev er, they fail on exploring the potential opportunities of parallelization between iterations. For example, T en- sorFlow needs to fetch the updated parameters from the remote storage at the beginning of each iteration, while it is possible to overlap this communication procedure with the computation procedure of the previous iteration. In comparison, WFBP enforces this overlapping by e x- plicitly pipelining compute, send and receive procedures. W e describe our implementation of WFBP in Section 4 and empirically show its ef fectiv eness in Section 5.1. 3.2 Hybrid Communication While WFBP overlaps communication and computation, it does not reduce the communication overhead. In sit- uations where the network bandwidth is limited (e.g., commodity Ethernet or the Ethernet is shared with other communication-heavy applications), the communication would still be unacceptably slow . T o address the issue, we introduce a hybrid communication (HybComm) strat- egy that combines the best of PS and SFB by being aware of both the mathematical property of DL models and the structure of computing clusters. Our idea comes from two observ ations: ﬁrst, as presented in Section 3, the syn- chronization operations { S l t } L l = 1 are independent of each other , meaning that we can use different communication methods for dif ferent S l t by specializing o l t and i l t accord- ing to the two methods described in Figure 2; second, a NN structure is usually predeﬁned and ﬁxed throughout the training – by measuring the number of parameters needed to transferred, we are able to estimate the com- munication o verhead, so that we can alw ays choose the optimal method ev en before the communication happens. Consider training VGG19 network [26], the ov er- heads of S l t could be estimated as follo ws (T able 1): assume the batch size K = 32, the number of work- 2 Most classiﬁcation models will fall into this family if the number of classes to be classiﬁed is large. Method Server W orker Server & W orker PS 2 P 1 M N / P 2 2 M N 2 M N ( P 1 + P 2 − 2 ) / P 2 SFB N/A 2 K ( P 1 − 1 )( M + N ) N/A Adam (max) P 1 M N + P 1 K ( M + N ) K ( M + N ) + M N ( P 1 − 1 )( M N + K M + K N ) T able 1: Estimated communication cost of PS, SFB and Adam for synchrnizing the parameters of a M × N FC layer on a clus- ter with P 1 workers and P 2 servers, when batchsize is K . ers and server nodes P 1 = P 2 = 8 (assume parameters are equally partitioned ov er all server shards), respec- tiv ely . On one hand, if l is an FC layer (with shape 4096 × 4096 , M = N = 4096), synchronizing its param- eters via PS will transfer 2 M N ≈ 34 million parameters for a work er node, 2 P 1 M N / P 2 ≈ 34 million for a server node, and 2 M N ( P 1 + P 2 − 2 ) / P 2 ≈ 58 . 7 million for a node that is both a server and a worker , compared to 2 K ( M + N )( P 1 − 1 ) ≈ 3 . 7 million for a single node using SFB. On the other hand, if l is a CONV layer, the updates are indecomposable and sparse, so we can directly resort to PS. Therefore, the synchronization overheads depend not only on the model (type, shape, size of the layer), but also the size of the clusters. The optimal solution usu- ally changes with M , N , K , P 1 , P 2 . HybComm takes into account these factors and allows to dynamically adjust the communication method for dif ferent parts of a model – it always chooses the best method from av ailable ones whenev er it results in fewer communication ov erheads. Microsoft Adam [4] employs a different communica- tion strategy from those in Figure 2. Instead of broad- casting SFs across work ers, they ﬁrst send SFs to a pa- rameter server shard, then pull back the whole updated parameter matrices. This seems to reduce the total num- ber of parameters needed to be communicated, b ut usu- ally leads to load imbalance; the server node that holds the corresponding parameter shard overloads because it has to broadcast the parameter matrices to all work- ers ( P 1 M N + P 1 K ( M + N ) messages need to be broad- casted), which easily causes communication bottleneck (Section 5.3). It is noticeable that reconstructing gradi- ents from SFs may cause extra computation cost, which howe ver is often negligible compared to communication. W e describe our implementation of HybComm in the next section, and assess its ef fectiveness in Section 5. 4 Implementation This section ﬁrst elaborates Poseidon’ s system architec- ture and APIs, and then describes how to modify a frame- work using Poseidon to enable distrib uted ex ecution. 4.1 System Implementation and APIs Figure 4 illustrates the architecture of Poseidon: a C++ communication library that manages parameter commu- nication for DL programs running on distrib uted GPUs. It has three main components: coordinator , that main- Method Owner Arguments Description BestScheme Coordinator A layer name or index Get the best communication scheme of a layer Query Coordinator A list of property names Query information from coordinators’ information book Send Syncer None Send out the parameter updates of the corresponding layer Receive Syncer None Receiv e parameter updates from either parameter server or peer work ers Move Syncer A GPU stream and an indicator of move direction Move contents between GPU and CPU, do transformations and application of updates if needed Send KV store updated parameters Send out the updated parameters Receive KV store parameter buf fer of KV stores Receive gradient updates from work ers T able 2: Poseidon APIs for parameter synchronization. Algorithm 1 Get the best comm method of layer l 1: function B E S T S C H E M E ( l ) 2: l ayer pro pert y = Query( l .name) 3: P 1 , P 2 , K = Query(‘n worker’, ‘n serv er’, ‘batchsize’) 4: if l ayer pro pert y .type == ‘FC’ then 5: M = l ayer pro pert y .width 6: N = l ayer pro pert y .height 7: if 2 K ( P 1 − 1 )( M + N ) ≤ 2 M N ( P 1 + P 2 − 2 ) P 2 then 8: return ‘SFB’ 9: end if 10: end if 11: return ‘PS’ 12: end function GPU CPU Stream Pool KV Store Syncer i Coordinator SFB d ata flow allocate instruction KV Store Thread Pool Client Library Figure 4: An ov erview of the architecture of Poseidon. tains the model and the cluster conﬁguration; KV store, a shared memory ke y-value store that pro vides support for parameter server based communication; client library , which is plugged into DL programs to handle parame- ter communication. Their APIs are listed in T able 2. Coordinator . T o setup distributed training, the client program (e.g., Caffe) ﬁrst instantiates Poseidon by cre- ating a coordinator within its process. Coordinators will ﬁrst collect necessary information, including the clus- ter information (e.g., the number of workers and server nodes, their IP addresses) and the model architecture (e.g., the number of layers, layer types, number of neu- rons and ho w they are connected, etc.). W ith the in- formation, the coordinator will initialize the KV stores and the client library with two steps: (1) allocate proper communication ports for each PS shard and peer worker; (2) determine what parameters should be transmitted via the KV store and what by SFB, and hash the parame- ters equally to each KV store if necessary , and sa ve the mapping in the information book, which, throughout the whole training, is maintained and synchronized across nodes, and could be accessed else where through coor - dinator’ s Query API. Besides, the coordinator provides another API BestScheme that takes in a layer and re- turns the optimal communication scheme for it according to the strategy described in Section 3.2 (Algorithm 1). KV Store. The KV store is implemented based on a b ulk synchronous parameter server [31, 7], and instantiated by coordinators on a list of user-speciﬁed “server” ma- chines. Each instance of the KV store holds one shard of the globally shared model parameters in the form of a set of KV pairs, of which each KV pair is stored on a chunk of DRAM. Poseidon sets the size of a KV pair to a ﬁxed small size (e.g., 2MB), so as to partition and distribute model parameters to server nodes as equally as possible, reducing the risk of Ethernet bottleneck. Each KV store instance manages a parameter buf fer on RAM, and provides PS-like APIs, such as Receive and Send , for receiving and applying updates from client libraries, or sending out parameters. It will regularly checkpoint current parameter states for fault tolerance. Client Library . Poseidon coordinates with DL pro- grams via its client library . P articularly , users plug the client library into their training program, and the client library will create a syncer for each NN layer during net- work assembling (so that each layer one-to-one maps to one syncer), accounting for its parameter synchroniza- tion. Each sycner is then initialized, for example, set- ting up connections to its corresponding PS shards or (remote) peer syncers according to the coordinator’ s in- formation book, and allocating a small memory buf fer for receiving remote parameter matrices or SFs, etc. The client library manages a CPU thread pool and a GPU stream pool on the worker machine, which can be allocated by the syncer APIs when there is a syncer job created. The syncer has three main APIs, Send , Receive and Move , to be used in client programs. The Move API takes care of the memory mov ement between RAM and GPU memory , and performs necessary com- putation, e.g., the transformation between SFs and gradi- ents, and the application of updates. It is multi-threaded using the CUD A asynchronous APIs, and will trigger an allocation from the client library’ s thread/stream pools when a syncer job starts (see L14 of Algorithm 2). The Send and Receive are communication APIs that syn- chronize layer parameters across different model repli- cas. The Send API is nonblocking; it sends out param- eter updates during backpropagation once they are gen- erated, following the protocol returned by coordinator’ s BestScheme API. The Receive API will be called once Send is ﬁnished. It requests either fresh parameter matrices from the KV stores or SFs from its peer syncers, and will block its current thread until it recei ves all of what it requested. The receiv ed messages are put into the syncer’ s memory buf fer for the Move API to fetch. Managing Consistency . Poseidon implements the b ulk synchronous consistency (BSP) model as follows. The client library maintains a binary vector C with length the number of syncers and values reset to zeros at the start of each iteration. A syncer will set its corresponding en- try in C as 1 when its job ﬁnishes, and the client starts next iteration when all entries are 1. While, the KV store maintains a zero-initialized count v alue for each KV pair at the start of each iteration. Every time when there is an update being applied on a KV pair , its count value is increased by 1. The KV pair will be broadcasted via its Send API when its count equals to the number of workers. Poseidon handles stragglers by simply drop- ping them. Although asynchronous models can alleviate the straggler problem in distributed ML [12], Poseidon focuses on synchronous parallel training, because syn- chronous execution yields the fastest per-iteration im- prov ement in accuracy for distrib uted DL (as measured by wall clock time) on GPUs [7, 2] (see Section 5.1). Algorithm 2 Parallelize a DL library using Poseidon 1: function T R A I N ( net ) 2: for i t er = 1 → T do 3: sync count = 0 4: net .Forward() 5: for l = L → 1 do 6: net .BackwardThrough( l ) 7: t hread pool .Schedule(sync( l )) 8: end for 9: wait until( sync count == net . num l ayer s ) 10: end for 11: end function 12: function S Y N C ( l ) 13: st ream = s t ream pool .Allocate() 14: syncers [ l ] .Mov e( st ream , GPU2CPU) 15: syncers [ l ] . me t hod = coord inat or .BestScheme( l ) 16: syncers [ l ] .Send() 17: syncers [ l ] .Recei ve() 18: syncers [ l ] .Mov e( st ream , CPU2GPU) 19: sync count ++ 20: end function 4.2 Integrate Poseidon with DL Libraries Poseidon could be plugged into most existing DL frame- works to enable efﬁcient distributed execution. Algo- rithm 2 pro vides an example. Speciﬁcally , one needs to ﬁrst include Poseidon’ s client library into the frame work, then ﬁgure out where the backpropagation proceeds (L6), and insert Poseidon’ s syner APIs in between gradient generation and application (L7). W e demonstrate in Sec- tion 5.1 that with slight modiﬁcations (150 and 250 LoC for Caf fe and T ensorFlow), both Poseidon-enable Caffe and T ensorFlow deliver linear scalings up to 32 GPU ma- chines. Poseidon respects the programming interf aces by the nati ve DL library and stores necessary arguments for distributed ex ecution as en vironment v ariables to allow zero changes on the DL application programs. 5 Evaluation In this section, we ev aluate Poseidon’ s performance on scaling up DL with distributed GPUs. W e focus on the image classiﬁcation task where DL is most successfully applied. Our ev aluation rev eals the follo wing results: (1) Poseidon has little overhead when plugged into exist- ing frame works; it achie ves near-linear speedups across different NNs and framew orks, on up to 32 Titan X- equipped machines. (2) Poseidon’ s system design effec- tiv ely impro ves GPU and bandwidth utilization. (3) Po- seidon’ s communication strategy HybComm effecti vely alleviates the communication bottleneck, thus achieves better speedups under limited bandwidth; Moreov er , Poseidon compares fa vorably to other communication- reduction methods, such as the SF strategy in Adam [4], and the 1-bit quantization in CNTK [36]. Cluster Conﬁguration. W e conduct our e xperiments on a GPU cluster with each node equipped with a NVIDIA GeForce TIT AN X GPU card, an Intel 16-core CPU and 64GB RAM, interconnected via a 40-Gigabit Ethernet switch. All cluster nodes have shared access to a NFS and read data through the Ethernet interface. W e run our system on UB UNTU 16.04, with NVIDIA dri ver version 361.62, CUD A 8.0 and cuDNN v5. Computation Engines. W e deploy Poseidon on two DL framew orks, Caffe [14] and T ensorFlow [1]. For Caffe, we use the ofﬁcial version at 2016/06/30 as the single node baseline, and modify it using Poseidon’ s client li- brary API for distributed execution. For T ensorFlow , we use its open source version r0.10, and parallelize its single-node version with Poseidon’ s client library , and compare to its original distributed v ersion. 3 Dataset and Models. Our experiments use three well- known image classiﬁcation datasets. (1) CIF AR-10 [15], which contains 32 × 32 colored images of 10 classes, with 50K images for training and 10K for testing; (2) ILSVRC12 [23], a subset of ImageNet22K that has 1.28 million of training images and 50K validation images in 1,000 categories; (3) ImageNet22K [23], the largest pub- lic dataset for image classiﬁcation, including 14,197,087 3 Note that as the distrib uted engine of T ensorFlow is highly opti- mized (e.g., auto-parallelization of graphs [1]). Poseidon avoids lev er- aging any b uild-in optimization of distributed T ensorFlow by paral- lelizing its single-node version instead. Model # Params Dataset Batchsize CIF AR-10 quick 145.6K CIF AR10 100 GoogLeNet 5M ILSVRC12 128 Inception-V3 27M ILSVRC12 32 VGG19 143M ILSVRC12 32 VGG19-22K 229M ImageNet22K 32 ResNet-152 60.2M ILSVRC12 32 T able 3: Neural netw orks for ev aluation. Single-node batchsize is reported. The batchsize is chosen based on the standards reported in literature (usually the maximum batch size that can ﬁll in the GPU memory). labeled images from 21,841 categories. W e test Poseidon’ s scalability across different neural networks: (1) CIF AR-10 quick: a toy CNN from Caf fe that con verges at 73% accuracy for classifying images in CIF AR-10 dataset; (2) GoogLeNet [27]: a 22-layer CNN with 5M parameters. (3) Inception-V3 [28]: the ImageNet winner , an improv ed version of GoogLeNet from T ensorFlow; (4) VGG19: A popular feature e xtrac- tion netw ork in the computer vision community [26] that has 16 CONV layers and 3 FC layers, in total 143M pa- rameters; (5) VGG19-22K: we modify the VGG19 net- work by replacing its 1000-w ay classiﬁer with a 21841- way classiﬁer , to classify images from the ImageNet22K dataset. The modiﬁed network has 229M parameters. (6) ResNet-152: the ImageNet winner network with 152 lay- ers. W e list their statistics and conﬁgurations in T able 3. Metrics. In this paper , we mainly focus on metrics that measure the system performance, such as speedups on throughput (number of images scanned per second). Our experiments focus on medium-scale distributed cluster with up to 32 machines, which distributed DL empiri- cally beneﬁts most from. Larger clusters require larger batch sizes, which hurt the con ver gence rate of each iter - ation [3, 7]. For completeness, we also report the statis- tical performance (time/epoch to conv erge) on ResNet- 152. Poseidon uses synchronized replication which en- ables many models to con ver ge in fewer steps [1, 7, 3, 2]. 5.1 Scalability T o demonstrate Poseidon’ s scalability , we train CNNs using Poseidon with dif ferent computational engines, and compare dif ferent systems in terms of their speedups on throughput. For Caffe engine, we train GoogLeNet VGG19 and VGG19-22K networks; for T ensorFlow en- gine, we train Inception-V3, VGG-19, V GG19-22K. Caffe Engine. Figure 5 shows the throughput vs. num- ber of workers when training the three networks using Caffe engine, giv en 40GbE Ethernet bandwidth av ail- able. W e compare the following systems: (1) Caffe : unmodiﬁed Caffe that ex ecutes on a single GPU; (2) Caffe+PS : we parallelize Caffe using a v anilla PS, i.e., the parameter synchronization happens sequentially after the backpropagation in each iteration; (3) Caffe+WFBP : Parallelized Caf fe using Poseidon so the communication and computation are overlapped. Ho wever , we disable HybComm so that parameters are synchronized only via PS; (4) P oseidon : the full version of Poseidon-Caf fe. Poseidon shows little overheads when combined with Caffe; running on a single node with no communication in volved, Poseidon-Caffe can process 257, 35.5 and 34.2 images per second when training GoogLeNet, VGG19 and VGG19-22K, respectiv ely , as compared to the origi- nal Caffe, which can process 257, 35.5 and 34.6 images, and Caffe+PS, which can only process 213.3, 21.3 and 18.5 images per second, due to the overheads caused by memory copy operations between RAM and GPU, which hav e been overlapped by Poseidon with the com- putation. In distrib uted environment, the rescheduling of computation and communication signiﬁcantly improv es the throughput: when training GoogLeNet and V GG19, incorporating WFBP achie ves almost linear scalings up to 32 machines, and for the larger V GG19-22K network, Caffe+WFBP achiev es 21.5x speedup on 32 machines. W e conclude that rescheduling and multi-threading the communication and computation are key to the perfor- mance of distributed DL on GPUs, even when the band- width resource is abundant. Poseidon provides an effec- tiv e implementation to overlap these operations for DL framew orks, to guarantee better GPU utilization. When the av ailable bandwidth is suf ﬁcient, Poseidon’ s HybComm strategy shows small improvement on train- ing GoogLeNet and VGG19. Howe ver , when training VGG19-22K which has three FC layers that occupy 91% of model parameters, it improves ov er Caffe-WFBP from 21.5x to 29.5x on 32 nodes. T ensorFlow Engine. W e also modify T ensorFlow using Poseidon, and compare the following systems in terms of speedup on throughput: (1) TF : T ensorFlo w with its original distrib uted ex ecutions; (2) TF+WFBP : we mod- ify T ensorFlow using Poseidon’ s client library . Specif- ically , we change the assign operator in T ensorFlow , so that instead of being applied, the parameter updates will be synchronized via P oseidon’ s PS interface with WFBP ; (3) P oseidon : the full version of Poseidon-parallelized T ensorFlo w with HybComm enabled. W e train Inception-V3, V GG19 and VGG19-22K models and report the results in Figure 6. Running on a single node, Poseidon processes 43.2, 38.2 and 34.5 im- ages per second on training Inception-V3, VGG19 and VGG19-22K, while original T ensorFlo w processes 43.2, 38.5 and 34.8 images per second on these three models, respectiv ely – little overhead is introduced by our modi- ﬁcation. In distributed e xecution, Poseidon achiev es al- most linear speedup on up to 32 machines. Distributed T ensorFlo w , ho wev er , demonstrates only 10x speedup on training Inception-V3 and ev en fails to scale on training the other two networks in our experiments. T o in vesti- gate the problem of T ensorFlow and explain how Posei- 1 2 4 8 16 32 # of Nodes 1 2 4 8 16 32 Speedups GoogLeNet (40 GbE) Linear Poseidon Caffe+WFBP Caffe+PS 1 2 4 8 16 32 # of Nodes 1 2 4 8 16 32 Speedups VGG19 (40 GbE) Linear Poseidon Caffe+WFBP Caffe+PS 1 2 4 8 16 32 # of Nodes 1 2 4 8 16 32 Speedups VGG19-22K (40 GbE) Linear Poseidon Caffe+WFBP Caffe+PS Figure 5: Throughput scaling when training GoogLeNet, VGG19 and VGG19-22K using Poseidon-parallelized Caffe and 40GbE bandwidth. Single-node Caffe is set as baseline (i.e., speedup = 1). 1 2 4 8 16 32 # of Nodes 1 2 4 8 16 32 Speedups Inception-V3 (40 GbE) Linear Poseidon TF+WFBP TF 1 2 4 8 16 32 # of Nodes 1 2 4 8 16 32 Speedups VGG19 (40 GbE) Linear Poseidon TF+WFBP TF 1 2 4 8 16 32 # of Nodes 1 2 4 8 16 32 Speedups VGG19-22K (40 GbE) Linear Poseidon TF+WFBP TF Figure 6: Throughput scaling when training Inception-V3, VGG19 and VGG19-22K using Poseidon-parallelized T ensorFlow and 40GbE bandwidth. Single-node T ensorFlo w is set as baseline (i.e., speedup = 1). Inception-v3 VGG19 VGG19-22k 0 30 60 90 Percentage TF TF+WFBP PSD TF TF+WFBP PSD TF TF+WFBP PSD Stall time Computation time Figure 7: Breakdo wn of GPU computation and stall time when training the three networks on 8 nodes using dif ferent systems. don improves upon it, we illustrates in Figure 7 the (a ver- aged) ratio of busy and stall time of a GPU when training the three networks using dif ferent systems on 8 nodes. Observe that Poseidon keeps GPUs busy in most of the time, while T ensorFlo w wastes much time on waiting for parameter synchronization. The inefﬁcienc y of dis- tributed T ensorFlow stems from two sources. First, T en- sorFlow partitions model parameters in a coarse-grained granularity – each tensor (instead of a KV pair) in the model is assigned to a PS shard. A big tensor (such as the parameter matrix in V GG19) is highly likely to cre- ate communication bottleneck on its located server node. Poseidon ﬁxes this problem by partitioning parameters among server nodes in a ﬁner-grained granularity us- ing KV pairs, so that ev ery node has ev enly distributed communication load; as an evidence, TF-WFBP demon- strates higher computation-to-stall ratio in Figure 7. Sec- ond, T ensorFlo w cannot reduce the communication o ver- heads while Poseidon’ s HybComm ef fecti vely reduces the size of messages. As a result, Poseidon further im- prov es upon TF-WFBP from 22x to 30x on 32 nodes. Multi-GPU Settings. Poseidon’ s key strategies can be directly extended to support distributed multi-GPU en- vironment with minor modiﬁcations. Speciﬁcally , when there are more than 1 GPU on a worker node, Poseidon will ﬁrst collect the gradient updates following WFBP lo- cally (either by full matrices or SFs) from multiple GPUs to a leader GPU using CudaMemcpy(DeviceT oDevice) API. If those updates are determined to be communi- cated via full matrices, Poseidon will aggregate them lo- cally before sending out. Using Caffe engine on a single node, Poseidon achieves linear scalings on up to 4 T itan X GPUs when training all three networks, outperforming Caffe’ s multi-GPU version, which sho ws only 3x and 2x speedups when training GooLeNet and VGG19. When running on A WS p2.8xlarge instances (8 GPUs each node), Poseidon reports 32x and 28x speedups when training GoogLeNet and VGG19 with 4 nodes (32 GPUs in total), conﬁrming our statement that the overheads caused by memory mov ement between GPUs are usually negligible compared to network communication 4 . Statistical Perf ormance. For completeness, we re- port in Figure 9 the statistical performance for training ResNet-152 using Poseidon. Poseidon achiev es near- linear speedups on both system throughput and statisti- cal con vergence: Poseidon deliv ers 31x speedup in terms of throughput, and reaches 0.24 reported error with less than 90 epochs with both 16 and 32 nodes – thus lin- ear scales in terms of time to accuracy , compared to 8 nodes with batchsize = 32 × 8, which is a standard set- 4 The K80 GPUs on p2.8xlarge has less GFLOPS than T itan X used in our main experiments – the communication b urden is less sev ere. 1 2 4 8 16 # of Nodes 1 2 4 8 16 Speedups GoogLeNet Linear Poseidon (2GbE) Poseidon (5GbE) Poseidon (10GbE) Caffe+WFBP (2GbE) Caffe+WFBP (5GbE) Caffe+WFBP (10GbE) 1 2 4 8 16 # of Nodes 1 2 4 8 16 Speedups VGG19 Linear Poseidon (10GbE) Poseidon (20GbE) Poseidon (30GbE) Caffe+WFBP (10GbE) Caffe+WFBP (20GbE) Caffe+WFBP (30GbE) 1 2 4 8 16 # of Nodes 1 2 4 8 16 Speedups VGG19-22K Linear Poseidon (10GbE) Poseidon (20GbE) Poseidon (30GbE) Caffe+WFBP (10GbE) Caffe+WFBP (20GbE) Caffe+WFBP (30GbE) Figure 8: Throughput scaling when training GoogLeNet, V GG19 and VGG19-22K using Poseidon-parallelized Caf fe with varying network bandwidth . Single-node Caffe is set as baseline (speedup = 1). 1 2 4 8 16 32 # of Nodes 1 2 4 8 16 32 Speedups (a) Throughput Linear Poseidon TF 0 30 60 90 120 Epoch 0.20 0.24 0.30 0.40 0.50 0.60 Top-1 Error (%) (b) Convergence 32 nodes 16 nodes 8 nodes Figure 9: (a) Speedup vs. number of nodes and (b) T op-1 test error vs. epochs for training ResNet-152 using Poseidon- T ensorFlo w and the original T ensorFlow . TF-WFBP Adam Poseidon 0 10 20 Traffic (Gb/iter) Figure 10: A veraged communication load when training VGG19 using TF-WFBP , Adam and P oseidon with T ensorFlo w engine. Each bar represents the network traf ﬁc on a node. ting as in [11], echoing recent results that synchronous training on distributed GPUs yields better performance than asynchronous training in terms of time to quality for most NNs [7, 2]. For other NNs in T able. 3, Posei- don deliv ers the same quality of accuracies as reported in their papers [16, 28, 27, 26] on up to 32 GPUs. 5.2 Bandwidth Experiments T o further assess Poseidon’ s HybComm strategy , we simulate the environment where network bandwidth is limited. W e use Linux trafﬁc control tool tc to lower the a vailable bandwidth on each node, and compare the training throughput between with and without Hyb- Comm. W e focus on Caffe engine in this section because it is lighter and less optimized than T ensorFlo w . Figure 8 plots the speedup on throughput vs. num- ber of work ers when training GoogLeNet, VGG19 and VGG19-22K with different maximum bandwidth. Clearly , limited bandwidth pre vents a standard PS- based system from linearly scaling with number of nodes; for example, given 10GbE bandwidth (which is a commonly-deployed Ethernet conﬁguration in most cloud computing platforms), training VGG19 using PS on 16 nodes can only be accelerated by 8x. This observa- tion conﬁrms our argument that limited bandwidth would result in communication bottleneck when training big models on distributed GPUs. Fortunately , Poseidon sig- niﬁcantly alle viates this issue. Under limited bandwidth, it constantly improves the throughput by directly reduc- ing the size of messages needed to be communicated, especially when the batch size is small; when training VGG19 and V GG19-22K, Poseidon achieves near-linear speedup on 16 machines using only 10GbE bandwidth, while an optimized PS would otherwise need 30GbE or ev en higher to achieve. Note that Poseidon will nev er underperform a traditional PS scheme because it will re- duce to a parameter server whenev er it results in less communication overheads; for instance, we observe that Poseidon reduces to PS when training GoogLeNet on 16 nodes, because GoogleNet only has one thin FC layer (1000 × 1024) and is trained with a large batch size (128). 5.3 Comparisons to Other Methods In this section, we compare Poseidon against other com- munication methods, including Adam [4] and CNTK 1- bit quantization [36], and show Poseidon’ s advantages. Adam. T o sav e bandwidth, Adam [4] synchronizes the parameters of a FC layer by ﬁrst pushing SFs generated on all workers to a PS node, and then pulling back the full parameter matrices thereafter . As direct comparisons to Adam [4] are inaccessible, we implement its strategy in Poseidon, and compare it (denoted as Adam ) to TF- WFBP and P oseidon by monitoring the network trafﬁc of each machine when training VGG19 on 8 nodes using T ensorFlo w engine. As shown in Figure 10, the com- munication workload is highly imbalanced using Adam’ s strategy . Unlike a traditional PS (TF-WFBP) where the parameters are equally distributed ov er multiple shards, Adam cannot partition the parameters of FC layers be- cause of their usage of SFs. Although the “push” op- eration uses SFs to reduce message size, the “pull” re- quires some server nodes to broadcast big matrices to each worker node, which creates bursty traf ﬁc that re- sults in communication bottleneck on them. By contrast, Poseidon either partitions parameters equally ov er multi- ple PS shards, or transmits SFs among peer workers, both are communication load-balanced that av oid bursty com- munication situations. Quantitativ ely , Adam delivers 5x speedup with 8 nodes when training VGG19. CNTK. W e compare Poseidon to the 1-bit quantization technique proposed in CNTK [36]. W e create a baseline P oseidon-1bit which uses the 1-bit strategy to quantize the gradients in FC layers, and add the residual to up- dates of the ne xt iteration. W e then train the CIF AR-10 quick network, and plot the training loss and test error vs. iterations for two systems (both hav e linear scal- ing on throughput). As in Figure 11, 1-bit quantization yields worse conv ergence in terms of accuracy – on 4 GPUs, it achie ves 0.5 error after 3K iterations, while Po- seidon quickly con ver ges to 0.3 error at iteration 1000. W e conjecture this is caused by the quantization residual, which is equiv alent to delayed updates that may hurt the con vergence performance when training NNs on images, conﬁrmed by [7]. W e also directly train VGG19 using CNTK-1bit system, and report 5.8x, 11x, 20x speedups on 8, 16 and 32 nodes, respectiv ely , thus less scale-ups than Poseidon, and also compromised statistical perfor- mance due to approximated updates. 0 5 10 15 20 25 30 Iterations (x100) 0.4 0.8 1.2 1.6 2.0 2.4 Train Loss Poseidon Poseidon-1bit 0 5 10 15 20 25 30 Iterations (x100) 0.2 0.4 0.6 0.8 0.9 Test Error Poseidon Poseidon-1bit Figure 11: Training loss and test error vs. iteration when train- ing CIF AR-10 quick network using Poseidon and Poseidon- 1bit on 4GPUs with Caffe engine. 6 Related W ork PS-based Distributed DL Systems. Based on the pa- rameter server [31, 19] architecture, a number of CPU- based distrib uted DL systems ha ve been de veloped, such as [38, 29, 9, 17] and Adam [4]. They are purely PS- based systems on CPU-only clusters, whereas we ad- dress the more challenging case of GPU clusters. Scaling up DL on distrib uted GPUs is an activ e ﬁeld of research. Coates et al . [5] build a GPU-based multi- machine system for DL using model parallelism rather than data parallelism, and their implementation is rather specialized for a ﬁxed model structure while demand- ing specialized hardware, such as InﬁBand networking. T ensorFlo w [1] is Google’ s distributed ML platform that uses a dataﬂo w graph to represent DL models, and syn- chronizes model parameters via PS. It therefore can- not dynamically adjust its communication method de- pending on the layer and cluster information as Posei- don does. MXNet [3] is another DL system that uses PS for distributed ex ecution, and supports T ensorFlow- like graph representations for DL models. By auto- parallelizing independent subgraphs, both frameworks implicitly ov erlap the communication and computation. By contrast, Poseidon has a more explicit way to over - lap them via its client library . Hence, Poseidon can be also used to parallelize non-graph-based frameworks. Moreov er , both MXNet and T ensorFlow do not address the bottleneck caused by limited network bandwidth, which undermines their scalability when training large models with dense layers (e.g., big softmax). Besides, Cui et al . propose GeePS [7] that manages the limited GPU memory and report speedups on distributed GPUs. While, GeePS does not address the issue of limited network bandwidth. Therefore, Poseidon’ s technique could be combined with them to enable better training speedups. Also of note are sev eral ef forts to port Caf fe onto other distrib uted platforms, such as SparkNet [22], Y ahooCaf fe [33] and FireCaf fe [13], the former reports a 4-5 times speedup with 10 machines (and hence less scalability than our results herein). Other distributed ML systems. CNTK [36] is a DL framew ork that supports distrib uted ex ecutions and ad- dresses the problem of communication bottleneck via the 1-bit quantization technique. CNTK demonstrates little negati ve impact on conv ergence in speech do- mains [25, 24]. Ho wev er, in some other domains (Sec- tion 5.3), the performance is usually compromised by noisy gradients [1, 7]. By contrast, Poseidon’ s Hyb- Comm reduces the communication while always guar- anteeing synchronous training. There are also growing interest in parallelizing ML applications using peer-to- peer communication, such as MAL T [18], SFB [32] and Ako [30]. Poseidon draws inspiration from these works but goes one step further as it is an adaptiv e best-of-both- worlds protocol, which will select client-serv er commu- nication whenev er it would result in fewer o verheads. 7 Conclusion W e present Poseidon, a scalable and efﬁcient commu- nication architecture for large-scale DL on distributed GPUs. Poseidon’ s design is orthogonal to T ensorFlow , Caffe or other DL framew orks – the techniques present in Poseidon could be used to produce a better distrib uted version of them. W e empirically show that Poseidon constantly delivers linear speedups using up to 32 nodes and limited bandwidth on a variety of neural network, datasets and computation engines, and compares fa vor - ably to Adam and Microsoft CNTK. Acknowledgments W e thank our shepherd Y u Hua and A TC revie wers for their helpful feedback. W e thank the CMU Parallel Data Laboratory for their machine resources and Henggang Cui for insightful discussion. This research is supported by NSF Big Data IIS1447676 and NSF XPS Parallel CCF1629559. References [1] A B A D I , M . , B A R H A M , P . , C H E N , J . , C H E N , Z . , D A V I S , A . , D E A N , J . , D E V I N , M . , G H E M AW A T , S . , I RV I N G , G . , I S A R D , M . , E T A L . T ensorﬂow: A system for large-scale machine learning. arXiv pr eprint arXiv:1605.08695 (2016). [2] C H E N , J . , M O N G A , R . , B E N G I O , S . , A N D J O Z E - F O W I C Z , R . Re visiting distributed synchronous sgd. arXiv pr eprint arXiv:1604.00981 (2016). [3] C H E N , T. , L I , M . , L I , Y . , L I N , M . , W A N G , N . , W A N G , M . , X I AO , T. , X U , B . , Z H A N G , C . , A N D Z H A N G , Z . Mxnet: A ﬂexible and efﬁcient ma- chine learning library for heterogeneous distributed systems. arXiv pr eprint arXiv:1512.01274 (2015). [4] C H I L I M B I , T. , A PAC I B L E , Y . S . J . , A N D K A LY A - NA R A M A N , K . Project Adam: Building an Efﬁ- cient and Scalable Deep Learning T raining System. In OSDI (2014). [5] C O ATE S , A . , H U V A L , B . , W A N G , T. , W U , D . J . , N G , A . Y . , A N D C A TA N Z A R O , B . Deep Learning with CO TS HPC Systems. In ICML (2013). [6] C O L L O B E RT , R . , K A V U K C U O G L U , K . , A N D F A R A B E T , C . T orch7: A Matlab-like En vironment for Machine Learning. In NIPSW (2011). [7] C U I , H . , Z H A N G , H . , G A N G E R , G . R . , G I B - B O N S , P . B . , A N D X I N G , E . P . Geeps: Scal- able deep learning on distrib uted gpus with a gpu- specialized parameter serv er . In Pr oceedings of the Eleventh European Conference on Computer Sys- tems (2016), A CM, p. 4. [8] D A I , W. , K U M A R , A . , W E I , J . , H O , Q . , G I B - S O N , G . , A N D X I N G , E . P . Analysis of high- performance distrib uted ml at scale through param- eter server consistency models. In Pr oceedings of the 29th AAAI Confer ence on Artiﬁcial Intelligence (2015). [9] D E A N , J . , C O R R A D O , G . S . , M O N G A , R . , C H E N , K . , D E V I N , M . , L E , Q . V . , M AO , M . Z . , R A N - Z ATO , M . , S E N I O R , A . , T U C K E R , P . , Y A N G , K . , A N D N G , A . Y . Large Scale Distrib uted Deep Net- works. In NIPS (2012). [10] D E N G , L . , L I , J . , H UA N G , J . - T. , Y A O , K . , Y U , D . , S E I D E , F. , S E LT Z E R , M . L . , Z W E I G , G . , H E , X . , W I L L I A M S , J . , G O N G , Y . , A N D A C E R O , A . Recent Adv ances in Deep Learning for Speech Re- search at Microsoft. In ICASSP (2013). [11] H E , K . , Z H A N G , X . , R E N , S . , A N D S U N , J . Deep residual learning for image recognition. arXiv pr eprint arXiv:1512.03385 (2015). [12] H O , Q . , C I PA R , J . , C U I , H . , K I M , J . K . , L E E , S . , G I B B O N S , P . B . , G I B S O N , G . A . , G A N G E R , G . R . , A N D X I N G , E . P . More Effecti ve Dis- tributed ML via a Stale Synchronous Parallel Pa- rameter Server. In NIPS (2013). [13] I A N D O L A , F. N . , M O S K E W I C Z , M . W . , A S H R A F , K . , A N D K E U T Z E R , K . Firecaffe: near -linear ac- celeration of deep neural network training on com- pute clusters. In Pr oceedings of the IEEE Confer- ence on Computer V ision and P attern Recognition (2016), pp. 2592–2600. [14] J I A , Y . , S H E L H A M E R , E . , D O N A H U E , J . , K A R A Y E V , S . , L O N G , J . , G I R S H I C K , R . , G U A D A R R A M A , S . , A N D D A R R E L L , T. Caffe: Con volutional Architecture for Fast Feature Em- bedding. In MM (2014). [15] K R I Z H E V S K Y , A . Learning Multiple Layers of Features from Tin y Images. Master’ s thesis, Uni- versity of T oronto, 2009. [16] K R I Z H E V S K Y , A . , S U T S K E V E R , I . , A N D H I N - T O N , G . E . ImageNet Classiﬁcation with Deep Con volutional Neural Networks. In NIPS (2012). [17] L E , Q . V . , M O N G A , R . , D E V I N , M . , C H E N , K . , C O R R A D O , G . S . , D E A N , J . , A N D N G , A . Y . Building High-lev el Features Using Large Scale Unsupervised Learning. In ICML (2012). [18] L I , H . , K A D A V , A . , K R U U S , E . , A N D U N G U R E - A N U , C . Malt: distributed data-parallelism for existing ml applications. In Pr oceedings of the T enth Eur opean Conference on Computer Systems (2015), A CM, p. 3. [19] L I , M . , A N D E R S E N , D . G . , P A R K , J . W . , S M O L A , A . J . , A H M E D , A . , J O S I F O V S K I , V . , L O N G , J . , S H E K I TA , E . J . , A N D S U , B . - Y . Scaling distributed machine learning with the parameter server . In 11th USENIX Symposium on Operat- ing Systems Design and Implementation (OSDI 14) (2014), pp. 583–598. [20] L I A N G , X . , H U , Z . , Z H A N G , H . , G A N , C . , A N D X I N G , E . P . Recurrent topic-transition gan for visual paragraph generation. arXiv preprint arXiv:1703.07022 (2017). [21] M I K O L O V , T. , C H E N , K . , C O R R A D O , G . , A N D D E A N , J . Efﬁcient Estimation of W ord Represen- tations in V ector Space. In ICLRW (2013). [22] M O R I T Z , P . , N I S H I H A R A , R . , S T O I C A , I . , A N D J O R DA N , M . I . Sparknet: Training deep networks in spark. arXiv pr eprint arXiv:1511.06051 (2015). [23] R U S S A K OV S K Y , O . , D E N G , J . , S U , H . , K R A U S E , J . , S AT H E E S H , S . , M A , S . , H U A N G , Z . , K A R P A - T H Y , A . , K H O S L A , A . , B E R N S T E I N , M . , B E R G , A . C . , A N D F E I - F E I , L . ImageNet Large Scale V isual Recognition Challenge. IJCV (2015), 1–42. [24] S E I D E , F . , F U , H . , D R O P P O , J . , L I , G . , A N D Y U , D . 1-bit stochastic gradient descent and its appli- cation to data-parallel distrib uted training of speech dnns. In INTERSPEECH (2014), pp. 1058–1062. [25] S E I D E , F . , F U , H . , D R O P P O , J . , L I , G . , A N D Y U , D . On parallelizability of stochastic gradient de- scent for speech dnns. In 2014 IEEE International Confer ence on Acoustics, Speech and Signal Pr o- cessing (ICASSP) (2014), IEEE, pp. 235–239. [26] S I M O N Y A N , K . , A N D Z I S S E R M A N , A . V ery Deep Conv olutional Networks for Large-Scale Im- age Recognition. In ICLR (2015). [27] S Z E G E DY , C . , L I U , W . , J I A , Y . , S E R M A N E T , P . , R E E D , S . , A N G U E L O V , D . , E R H A N , D . , V A N - H O U C K E , V . , A N D R A B I N O V I C H , A . Going deeper with con volutions. In CVPR (2015). [28] S Z E G E DY , C . , V A N H O U C K E , V . , I O FF E , S . , S H L E N S , J . , A N D W O J N A , Z . Rethinking the in- ception architecture for computer vision. arXiv pr eprint arXiv:1512.00567 (2015). [29] W A N G , W. , C H E N , G . , D I N H , T. T . A . , G A O , J . , O O I , B . C . , T A N , K . - L . , A N D W A N G , S . SINGA: Putting Deep Learning in the Hands of Multimedia Users. In MM (2015). [30] W A T C H A R A P I C H A T , P . , M O R A L E S , V . L . , F E R - NA N D E Z , R . C . , A N D P I E T Z U C H , P . Ako: De- centralised deep learning with partial gradient ex- change. In Pr oceedings of the Seventh A CM Sym- posium on Cloud Computing (2016), A CM, pp. 84– 97. [31] W E I , J . , D A I , W . , Q I AO , A . , H O , Q . , C U I , H . , G A N G E R , G . R . , G I B B O N S , P . B . , G I B S O N , G . A . , A N D X I N G , E . P . Managed Communica- tion and Consistency for Fast Data-parallel Iterativ e Analytics. In SoCC (2015). [32] X I E , P . , K I M , J . K . , Z H O U , Y . , H O , Q . , K U - M A R , A . , Y U , Y . , A N D X I N G , E . Distributed Ma- chine Learning via Sufﬁcient Factor Broadcasting. In arXiv (2015). [33] Y A H O O . Caf fe on spark. http: //yahoohadoop . tumblr . com/post/ 129872361846/large- scale- distributed- deep- learning- on- hadoop . [34] Y A N , Z . , Z H A N G , H . , J A G A D E E S H , V . , D E - C O S T E , D . , D I , W . , A N D P I R A M U T H U , R . Hd- cnn: Hierarchical deep con volutional neural net- work for image classiﬁcation. ICCV (2015). [35] Y A N , Z . , Z H A N G , H . , W A N G , B . , P A R I S , S . , A N D Y U , Y . Automatic photo adjustment using deep neural networks. A CM T ransactions on Graphics (TOG) 35 , 2 (2016), 11. [36] Y U , D . , E V E R S O L E , A . , S E LT Z E R , M . , Y AO , K . , H U A N G , Z . , G U E N T E R , B . , K U C H A I E V , O . , Z H A N G , Y . , S E I D E , F. , W A N G , H . , E T A L . An in- troduction to computational networks and the com- putational network toolkit. T ech. rep. [37] Z H A N G , H . , H U , Z . , W E I , J . , X I E , P . , K I M , G . , H O , Q . , A N D X I N G , E . Poseidon: A system architecture for ef ﬁcient gpu-based deep learning on multiple machines. arXiv pr eprint arXiv:1512.06216 (2015). [38] Z O U , Y . , J I N , X . , L I , Y . , G U O , Z . , W A N G , E . , A N D X I AO , B . Mariana: T encent Deep Learning Platform and its Applications. In VLDB Endow- ment (2014).

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment