Placeto: Learning Generalizable Device Placement Algorithms for Distributed Machine Learning

Placeto: Lear ning Generalizable De vice Placement Algorithms f or Distrib uted Machine Lear ning Ravichandra Addanki , Shaileshh Bojja V enkatakrishnan , Shreyan Gupta , Hongzi Mao , Mohammad Alizadeh MIT Computer Science and Artiﬁcial Intelligence Laboratory {addanki, bjjvnkt, shreyang, hongzi, alizadeh}@mit.edu Abstract W e present Placeto, a reinforcement learning (RL) approach to ef ﬁciently ﬁnd de vice placements for distributed neural network training. Unlik e prior approaches that only ﬁnd a device placement for a speciﬁc computation graph, Placeto can learn generalizable device placement policies that can be applied to any graph. W e propose two ke y ideas in our approach: (1) we represent the policy as performing iterativ e placement improvements, rather than outputting a placement in one shot; (2) we use graph embeddings to capture rele vant information about the structure of the computation graph, without relying on node labels for inde xing. These ideas allow Placeto to train efﬁciently and generalize to unseen graphs. Our experiments show that Placeto requires up to 6 . 1 × fewer training steps to ﬁnd placements that are on par with or better than the best placements found by prior approaches. Moreover , Placeto is able to learn a generalizable placement polic y for any gi ven family of graphs, which can then be used without an y retraining to predict optimized placements for unseen graphs from the same family . This eliminates the large o verhead incurred by prior RL approaches whose lack of generalizability necessitates re-training from scratch ev ery time a new graph is to be placed. 1 Introduction & Related W ork The computational requirements for training neural networks hav e steadily increased in recent years. As a result, a growing number of applications [ 11 , 17 ] use distributed training en vironments in which a neural network is split across multiple GPU and CPU devices. A key challenge for distributed training is how to split a large model across multiple heterogeneous devices to achiev e the fastest possible training speed. T oday device placement is typically left to human experts, b ut determining an optimal de vice placement can be very challenging, particularly as neural networks grow in comple xity (e.g., networks with many interconnected branches) or approach device memory limits. In shared clusters, the task is made e ven more challenging due to interference and v ariability caused by other applications. Moti vated by these challenges, a recent line of work [ 10 , 9 , 5 ] has proposed an automated approach to device placement based on reinforcement learning (RL). In this approach, a neural network policy is trained to optimize the device placement through repeated trials. For example, Mirhoseini et al. [ 10 ] use a recurrent neural network (RNN) to process a computation graph and predict a placement for each operation. They show that the RNN, trained to minimize computation time, produces device placements that outperform both human experts and graph partitioning heuristics such as Scotch [ 15 ]. Subsequent w ork [ 9 ] impro ved the scalability of this approach with a hierarchical model and explored more sophisticated policy optimization techniques [ 5 ]. Although RL-based de vice placement is promising, existing approaches ha ve a ke y drawback: they require signiﬁcant amount of re-training to ﬁnd a good placement for each computation graph. For example, Mirhoseini et al. [ 10 ] report 12 to 27 hours of training time to ﬁnd the best device Preprint. Under revie w . placement for several vision and natural language models; more recently , the same authors report 12.5 GPU-hours of training to ﬁnd a placement for a neural machine translation (NMT) model [ 9 ]. While this overhead may be acceptable in some scenarios (e.g., training a stable model on large amounts of data), it is undesirable in many cases. For e xample, high device placement o verhead is problematic during model dev elopment, which can require many ad-hoc model explorations. Also, in a shared, non-stationary en vironment, it is important to make a placement decision quickly , before the underlying en vironment changes. Existing methods have high o verhead because the y do not learn generalizable de vice placement policies. Instead they optimize the device placement for a single computation graph. Indeed, the training process in these methods can be thought of as a search for a good placement for one computation graph, rather than a search for a good placement policy for a class of computation graphs. Therefore, for a ne w computation graph, these methods must train the policy network from scratch. Nothing learned from previous graphs carries over to ne w graphs, neither to improve placement decisions nor to speed up the search for a good placement. In this paper , we present Placeto, a reinforcement learning (RL) approach to learn an efﬁcient algorithm for device placement for a gi ven family of computation graphs. Unlike prior work, Placeto is able to transfer a learned placement policy to unseen computation graphs from the same family without requiring any retraining. Placeto incorporates two key ideas to impro ve training ef ﬁciency and generalizability . First, it models the de vice placement task as ﬁnding a sequence of iter ative placement impr ovements . Speciﬁcally , Placeto’ s policy netw ork takes as input a current placement for a computation graph, and one of its node, and it outputs a device for that node. By applying this policy sequentially to all nodes, Placeto is able to iterati vely optimize the placement. This placement improvement policy , operating on an explicitly-pro vided input placement, is simpler to learn than a policy representation that must output a ﬁnal placement for the entire graph in one step. Placeto’ s second idea is a neural network architecture that uses graph embeddings [ 3 , 4 , 7 ] to encode the computation graph structure in the placement policy . Unlike prior RNN-based approaches, Placeto’ s neural network policy does not depend on the sequential order of nodes or an arbitrary labeling of the graph (e.g., to encode adjacency information). Instead it naturally captures graph structure (e.g., parent-child relationships) via iterativ e message passing computations performed on the graph. Our experiments sho w that Placeto learns placement policies that outperform the RNN-based approach ov er three neural network models: Inception-V3 [ 19 ], NASNet [ 24 ] and NMT [ 23 ]. For example, on the NMT model Placeto ﬁnds a placement that runs 16 . 5% faster than the RNN-based approach. Moreov er, it also learns these placement policies substantially faster , with up to 6.1 × fewer placement ev aluations, than the RNN approach. Giv en any family of graphs Placeto learns a generalizable placement policy , that can then be used to predict optimized placements for unseen graphs from the same family without any re-training. This av oids the large overheads incurred by RNN-based approaches which must repeat the training from scratch ev ery time a new graph is to be placed. Concurrently with this work, Paliwal et al. [ 14 ] propose using graph embeddings to learn a generaliz- able policy for device placement and schedule optimization. Howe ver , their approach does not in v olve optimizing placements directly; instead a complex genetic search algorithm needs to be run for several thousands of iterations e verytime placement for a ne w graph is to be optimized [ 14 ]. This incurs a large penalty of e valuating thousands of placements and schedules, rendering the generalizability of the learned policy inef fectiv e. 2 Learning Method The computation graph of a neural network can be modeled as a graph G ( V , E ) , where V denotes the atomic computational operations (also referred to as “ops”) in the neural network, and E is the set of data communication edges. Each op v ∈ V performs a speciﬁc computational function (e.g., conv olution) on input tensors that it receiv es from its parent ops. For a set of devices D = { d 1 , . . . , d m } , a placement for G is a mapping π : V → D that assigns a de vice to each op. The goal of de vice placement is to ﬁnd a placement π that minimizes ρ ( G, π ) , the duration of G ’ s ex ecution when its ops are placed according to π . T o reduce the number of placement actions, we partition ops into predetermined gr oups and place ops from the same group on the same device, similar to 2 ... Ste p t= 0 Ste p t= 1 Ste p t= 2 Ste p t= 3 End of episode Action a 3 : Device 2 Action a 1 : Device 2 Action a 2 : Device 1 Pl acem ent im pro vem ent MD P step s Fi nal pl aceme nt Action a 4 : Device 2 Figure 1: MDP structure of Placeto’ s de vice placement task. At each step, Placeto updates a placement for a node (shaded) in the computation graph. These incremental improvements amount to the ﬁnal placement at the end of an MDP episode. Mirhoseini et.al. [ 9 ]. For ease of notation, henceforth we will use G ( V , E ) to denote the graph of op groups. Here V is the set of op groups and E is set of data communication edges between op groups. An edge is drawn between two op groups if there e xists a pair of ops, from the respectiv e op groups, that hav e an edge between them in the neural network. Placeto ﬁnds an ef ﬁcient placement for a gi ven input computation graph, by e xecuting an iterati ve placement impr o vement policy on the graph. The policy is learned using RL over computation graphs that are structurally similar (i.e., coming from the same underlying probability distribution) as the input graph. In the following we present the ke y ideas of this learning procedure: the Marko v decision process (MDP) formalism in § 2.1 , graph embedding and the neural network architecture for encoding the placement policy in § 2.2 , and the training/testing methodology in § 2.3 . W e refer the reader to [ 18 ] for a primer on RL. 2.1 MDP Formulation Let G be a family of computation graphs, for which we seek to learn an effecti ve placement polic y . W e consider an MDP where a state observation s comprises of a graph G ( V , E ) ∈ G with the following features on each node v ∈ V : (1) estimated run time of v , (2) total size of tensors output by v , (3) the current de vice placement of v , (4) a ﬂag indicating whether v has been “visited” before, and (5) a ﬂag indicating whether v is the “current” node for which the placement has to be updated. At the initial state s 0 for a graph G ( V , E ) , the nodes are assigned to de vices arbitrarily , the visit ﬂags are all 0 , and an arbitrary node is selected as the current node. At a step t in the MDP , the agent selects an action to update the placement for the current node v in state s t . The MDP then transitions to a new state s t +1 in which v is marked as visited, and an un visited node is selected as the new current node. The episode ends in | V | steps when the placements for all the nodes hav e been updated. This procedure has been illustrated for an e xample graph to be placed ov er two devices, in Figure 1 . W e consider two approaches for assigning rew ards in the MDP: (1) assigning a zero reward at each intermediate step in the MDP , and a rew ard equal to the ne gativ e run time of the ﬁnal placement at the terminal step; (2) assigning an intermediate reward of r t = ρ ( s t +1 ) − ρ ( s t ) at the t -th round for each t = 0 , 1 , . . . , | V | − 1 , where ρ ( s ) is the execution time of placement s . Intermediate rewards can help improv e credit assignment in long training episodes and reduce variance of the polic y gradient estimates [ 2 , 12 , 18 ]. Howe ver , training with intermediate rew ards is more expensi ve, as it must determine the computation time for a placement at each step as opposed to once per episode. W e contrast the beneﬁts of either reward design through e valuations in Appendix A.4 . T o ﬁnd a valid placement that ﬁts without exceeding the memory limit on de vices, we include a penalty in the rew ard proportional to the peak memory utilization if it crosses a certain threshold M (details in Appendix A.7 ). 2.2 Policy Network Architecture Placeto learns effecti ve placement policies by directly parametrizing the MDP polic y using a neural network, which is then trained using a standard policy-gradient algorithm [ 22 ]. At each step t of the MDP , the polic y network takes the graph conﬁguration in state s t as input, and outputs an updated placement for the t -th node. Howe ver , to compute this placement action using a neural network, we 3 Run time( s t ) Graph & neur al net wor k Policy& net wor k RL ag ent Sta te s t Next sta te s t+ 1 Sa m p le Dev ic e 1 Dev ic e 2 Policy - Rewa r d r t = Ru n tim e ( s t+ 1 ) - Runtim e ( s t ) Run tim e( s t+ 1 ) Cur r en t node New place me nt Figure 2: Placeto’ s RL framew ork for device placement. The state input to the agent is represented as a D A G with features (such as computation types, current placement) attached to each node. The agent uses a graph neural network to parse the input and uses a policy netw ork to output a probability distrib ution over de vices for the current node. The incremental rew ard is the difference between runtimes of consecutiv e placement plans. Op group feature: ( total_runtime , output_tensor_size, current_placement , is_node_current , i s_node_done ) To p - dow n m essag e pa s s i ng Bo t t o m - up m essag e pa s s i ng Paren t groups Chil d groups Paral lel groups Pare nt groups + Chil d groups + Para lle l groups + + Current node Dev 1 Dev 2 (a) (b) (c) (d) Figure 3: Placeto’ s graph embedding approach. It maps raw features associated with each op group to the device placement action. (a) Example computation graph of op groups. The shaded node is taking the current placement action. (b) T w o-way message passing scheme applied to all nodes in the graph. (c) Partitioning the message passed (denoted as bold) op groups. (d) T aking a placement action on two candidate de vices for the current op group. need to ﬁrst encode the graph-structured information of the state as a real-valued vector . Placeto achiev es this vectorization via a graph embedding procedure, that is implemented using a specialized graph neural network and learned jointly with the policy .Figure 2 summarizes ho w node placements are updated during each round of an RL episode. Next, we describe Placeto’ s graph neural network. Graph embedding. Recent works [ 3 , 4 , 7 , 8 ] hav e proposed graph embedding techniques that hav e been shown to achie ve state-of-the-art performance on a variety of graph tasks, such as node classiﬁcation, link prediction, job scheduling etc. Moreo ver , the embedding produced by these methods are such that they can generalize (and scale) to unseen graphs. Inspired by this line of work, in Placeto we present a graph embedding architecture for processing the raw features associated with each node in the computation graph. Our embedding approach is customized for the placement problem and has the following three steps (Figure 3 ): 1. Computing per-gr oup attributes (Figure 3a ). As raw features for each op group, we use the total execution time of ops in the group, total size of their output tensors, a one-hot encoding of the de vice (e.g., de vice 1 or device 2) that the group is currently placed on, a binary ﬂag indicating whether the current placement action is for this group, and a binary encoding of whether a placement action has already been made for the group. W e collect the runtime of each op on each device from on-de vice measurements (we refer to Appendix 5 for details). 2. Local neighborhood summarization (Figure 3b ). Using the raw features on each node, we perform a sequence of message passing steps [ 4 , 7 ] to aggregate neighborhood information for each node. Letting x v denote the features of op group v , the message passing updates take the form x v ← g ( P u ∈ ξ ( v ) f ( x u )) , where ξ ( v ) is the set of neighbors of v , and f , g are multilayer perceptrons with trainable parameters. W e construct tw o directions (top-down from root groups and bottom-up from leaf groups) of message passings with separate parameters. The top-down messages summarize information about the subgraph of nodes that can reach v , while the bottom- up does so for the subgraph reachable from v . The parameters in the transformation functions f , g are shared for message passing steps in each direction, among all nodes. W e repeat the message passing updates k times to propagate local structural information across the graph, where k is a 4 hyperparameter . As we show in our experiments (§ 3 ), reusing the same message passing function ev erywhere provides a natural way to transfer the learned polic y to unseen computation graphs. 3. P ooling summaries (Figures 3c and 3d ). After message passing, we aggregate the embeddings computed at each node to create a global summary of the entire graph. Speciﬁcally , for the node v for which a placement decision has to be made, we perform three separate aggregations: on the set S parents ( v ) of nodes that can reach v , set S children ( v ) of nodes that are reachable by v , and set S parallel ( v ) of nodes that can neither reach nor be reached by v . On each set S i ( v ) , we perform the aggregations using h i ( P u ∈ S i ( v ) l i ( x u )) where x u are the node embeddings and h i , l i are multilayer perceptrons with trainable parameters as abo ve. Finally , node v ’ s embedding and the result from the three aggregations are concatenated as input to the subsequent polic y network. The abov e three steps deﬁne an end-to-end policy mapping from raw features associated with each op group to the device placement action. 2.3 T raining Placeto is trained using a standard policy-gradient algorithm [ 22 ], with a timestep-based baseline [ 6 ] (see Appendix A.1 for details). During each training episode, a graph from a set G T of training graphs is sampled and used for performing the rollout. The neural network design of Placeto’ s graph embedding procedure and policy netw ork allows the training parameters to be shared across episodes, regardless of the input graph type or size. This allows Placeto to learn placement policies that generalize well to unseen graphs during testing. W e present further details on training in § 3 . 3 Experimental Setup 3.1 Dataset W e use T ensorﬂo w to generate a computation graph giv en any neural network model, which can then be run to perform one step of stochastic gradient descent on a mini-batch of data. W e e valuate our approach on computation graphs corresponding to the following three popular deep learning models: (1) Inception-V3 [ 19 ], a widely used con volutional neural network which has been successfully applied to a large v ariety of computer vision tasks; (2) NMT [ 23 ], a language translation model that uses an LSTM based encoder-decoder and attention architecture for natural language translation; (3) NASNet [ 24 ], a computer vision model designed for image classiﬁcation. For a more detailed descriptions of these models, we refer to Appendix A.2 W e also ev aluate on three synthetic datasets, each comprising of 32 graphs, spanning a wide range of graph sizes and structures. W e refer to these datasets as cifar10 , ptb and nmt . Graphs from cifar10 and ptb datasets are synthesized using an automatic model design approach called ENAS [ 16 ]. The nmt dataset is constructed by v arying the RNN length and batch size hyperparameters of the NMT model [ 23 ]. W e randomly split these datasets for training and test purposes. Graphs in cifar10 and ptb datasets are grouped to have about 128 nodes each, whereas graphs from nmt have 160 nodes. Further details on how these datasets are constructed can be found in the Appendix A.3 . 3.2 Baselines W e compare Placeto against the following heuristics and baselines from prior work [ 10 , 9 , 5 ]: (1) Single GPU , where all the ops in a model are placed on the same GPU. For graphs that can ﬁt on a single de vice and don’t ha ve a signiﬁcant inherent parallelism in their structure, this baseline can often lead to the fastest placement as it eliminates any cost of communication between de vices. (2) Scotch [ 15 ], a graph-partitioning-based static mapper that takes as input the computation graph, cost associated with each node, amount of data associated with connecting edges, and then outputs a placement which minimizes communication costs while k eeping the load balanced across devices within a speciﬁed tolerance. (3) Human expert. For NMT models, we place each LSTM layer on a separate device as recom- mended by W u et al. [ 23 ]. W e also colocate the attention and softmax layers with the ﬁnal LSTM layer . Similarly for vision models, we place each parallel branch on a different de vice. (4) RNN-based approach [ 9 ], in which the placement problem is posed as ﬁnding a mapping from an input sequence of op-groups to its corresponding sequence of optimized device placements. An RNN model is used to learn this mapping. The RNN model has an encoder-decoder architecture with content-based attention mechanism. W e use an open source implementation from Mirhoseini et.al. [ 9 ] 5 av ailable as part of the ofﬁcial T ensorﬂo w repository [ 20 ]. W e use the included hyperparameter settings and tune them extensi vely as required. 3.3 T raining Details Co-location groups. T o decide which set of ops have to be co-located in an op-group, we follow the same strategy as described by Mirhoseini et al. [ 10 ] and use the ﬁnal grouped graph as input to both Placeto and the RNN-based approach. W e found that e ven after this grouping, there could still be a fe w operation groups with very small memory and compute costs left o ver . W e eliminate such groups by iterativ ely merging them with their neighbors as detailed in Appendix A.6 . Simulator . Since it can take a long time to execute placements on real hardware and measure the elapsed time [ 9 , 10 ], we built a reliable simulator that can quickly predict the runtime of an y given placement for a gi ven device conﬁguration. W e ha ve discussed details about ho w the simulator works and its accuracy in Appendix A.5 . This simulator is used only for training purposes. All the reported runtime improv ements have been obtained by ev aluating the learned placements on real hardw are, unless explicitly speciﬁed otherwise. Further details on training of Placeto and the RNN-based approach are giv en in the Appendix A.7 . 4 Results In this section, we ﬁrst ev aluate the performance of Placeto and compare it with aforementioned baselines (§ 4.1 ). Then we ev aluate Placeto’ s generalizability compared to the RNN-based approach (§ 4.2 ). Finally , we provide empirical v alidation for Placeto’ s design choices (§ 4.3 ). 4.1 Perf ormance T able 1 summarizes the performance of Placeto and baseline schemes for the Inception-V3, NMT and N ASNet models. W e quantify performance along two axes: (i) runtime of the best placement found, and (ii) time taken to ﬁnd the best placement, measured in terms of the number of placement ev aluations required for the RL-based schemes while training. For all considered graphs, Placeto is able to ri val or outperform the best comparing scheme. Placeto also ﬁnds optimized placements much faster than the RNN-based approach. For Inception on 2 GPUs, Placeto is able to ﬁnd a placement that is 7 . 8% faster than the expert placement. Additionally , it requires about 4 . 8 × fewer samples than the RNN-based approach. Similarly , for the N ASNet model Placeto outperforms the RNN-based approach using up to 4 . 7 × fewer episodes. For the NMT model with 2 GPUs, Placeto is able to optimize placements to the same e xtent as the RNN-based scheme, while using 3 . 5 × fewer samples. F or NMT distributed across 4 GPUs, Placeto ﬁnds a non-tri vial placement that is 16 . 5% faster than the e xisting baselines. W e visualize this placement in Figure 4 . The expert placement heuristic for NMT f ails to meet memory constraints of the GPU devices. This is because in an attempt to maximize parallelism, it places each layer on a dif ferent GPU, requiring copying over the outputs of the i th layer to the GPU hosting the ( i + 1) th layer . These copies hav e to be retained until they can be fed in as inputs to the co-located gradient operations during the back-propagation phase. This results in a large memory footprint which ultimately leads to an OOM error . On the other hand, Placeto learns to exploit parallelism and minimize the inter-de vice communication overheads while remaining within memory constraints of all the de vices. The abo ve results show the adv antage of Placeto’ s simpler policy representation: it is easier to learn a policy to incrementally impro ve placements, than to learn a policy that decides placements for all nodes in one shot. 4.2 Generalizability W e ev aluate generalizability of the learning-based schemes, by training them over a family of graphs, and using the learned policies to predict effecti ve placements for unseen graphs from the same family . If the placements predicted by a policy are as good as placements found by separate optimizations ov er the indi vidual test graphs, we conclude that the placement scheme generalizes well. Such a policy can then be applied to a wide variety of structurally-similar graphs without requiring re-training. W e consider three family of graphs— nmt, ptb and cifar10 datasets—for this experiment. 6 Placement runtime Training time Improvement (sec) (# placements sampled) Model CPU Single #GPUs Expert Scotch Placeto RNN- Placeto RNN- Runtime Speedup only GPU based based Reduction factor Inception-V3 12.54 1.56 2 1.28 1.54 1.18 1.17 1.6 K 7.8 K - 0.85% 4.8 × 4 1.15 1.74 1.13 1.19 5.8 K 35.8 K 5% 6.1 × NMT 33.5 OOM 2 OOM OOM 2.32 2.35 20.4 K 73 K 1.3 % 3.5 × 4 OOM OOM 2.63 3.15 94 K 51.7 K 16.5 % 0.55 × NASNet 37.5 1.28 2 0.86 1.28 0.86 0.89 3.5 K 16.3 K 3.4% 4.7 × 4 0.84 1.22 0.74 0.76 29 K 37 K 2.6% 1.3 × T able 1: Running times of placements found by Placeto compared with RNN-based approach [ 10 ], Scotch and human-expert baseline. The number of measurements needed to ﬁnd the best placements for Placeto and the RNN-based are also sho wn (K stands for kilo). Reported runtimes are measured on real hardware. Runtime reductions and speedup factors are calculated with respect to the RNN-based approach. Lower runtimes and lower training times are better . OOM: Out of Memory . For NMT model, the number of LSTM layers is chosen based on the number of GPUs. Figure 4: Optimized placement across 4 GPUs for a 4-layered NMT model with attention found by Placeto. The top LSTM layers correspond to encoder and the bottom layers to decoder . All the layers are unrolled to a maximum sequence length of 32. Each color represents a dif ferent GPU. This non-trivial placement meets the memory constraints of the GPUs unlike the expert-based placement and the Scotch heuristic, which result in an Out of Memory (OOM) error . It also runs 16.5% faster than the one found by the RNN-based approach. For each test graph in a dataset, we compare placements generated by the following schemes: (1) Placeto Zero-Shot. A Placeto policy trained over graphs from the dataset, and used to predict placements for the test graph without any further re-training. (2) Placeto Optimized. A Placeto policy trained speciﬁcally ov er the test graph to ﬁnd an ef fectiv e placement. (3) Random. A simple strawman policy that generates placement for each node by sampling from a uniform random distribution. W e deﬁne RNN Zer o-Shot and RNN Optimized in a similar manner for the RNN-based approach. Figure 5 shows CDFs of runtimes of the placements generated by the above-deﬁned schemes for test graphs from nmt, ptb and cifar10 datasets. W e see that the runtimes of the placements generated by Placeto Zer o-Shot are very close to those generated by Placeto Optimized . Due to Placeto’ s generalizability-ﬁrst design, Placeto Zer o-Shot av oids the signiﬁcant overhead incurred by Placeto Optimized and RNN Optimized approaches, which search through se veral thousands of placements before ﬁnding a good one. Figure 5 also shows that RNN Zer o-Shot performs signiﬁcantly worse than RNN Optimized . In fact, its performance is very similar to that of Random . When trained on a graph, the RNN-based approach learns a policy to search for an effecti ve placement for that graph. Howe ver , this learned search strategy is closely tied to the assignment of node indices and the trav ersal order of the nodes in the graph, which are arbitrary and ha ve a meaning only within the context of that speciﬁc graph. As a result, the learned policy cannot be applied to graphs with a different structure or ev en to the same graph using a different assignment of node indices or tra versal order . 4.3 Placeto Deep Dive In this section we ev aluate how the node trav ersal order of a graph during training, affects the polic y learned by the different learning schemes. W e also present an ablation study of Placeto’ s policy 7 (a) (b) (c) Pla ceto Optimize d Pla ceto Ze ro - Sho t Random (d) (e) (f) RN N O p timize d RNN Zer o - Sh ot Random Figure 5: CDFs of runtime of placements found by the dif ferent schemes for test graphs from (a), (d) nmt (b), (e) ptb and (c), (f) cifar10 datasets. T op ro w of ﬁgures ((a), (b), (c)) correspond to Placeto and bottom row ((d), (e), (f)) to RNN-based approach. Placeto Zero-Shot performs almost on par with fully optimized schemes like Placeto Optimized and RNN Optimized ev en without any re-training. In contrast, RNN Zer o-Shot performs much worse and only slightly better than a randomly initialized policy used in Random scheme. network architecture. In Appendix A.4 we conduct a similar study on the beneﬁts of providing intermediate rew ards in Placeto during training. Node traversal order . Unlike the RNN-based approach, Placeto’ s use of a graph neural network eliminates the need to assign arbitrary indices to nodes while embedding the graph features. This aids in Placeto’ s generalizability , and allo ws it to learn effecti ve policies that are not tied to the speciﬁc node trav ersal orders seen during training. T o verify this claim, we train Placeto and the RNN-based approach on the Inception-V3 model following one of 64 ﬁxed node tra versal orders at each episode. W e then use the learned policies to predict placements under 64 unseen random node traversal orders for the same model. With Placeto, we observ e that the predicted placements have runtimes within 5% of that of the optimized placement on average, with a difference of about 10% between the fastest and slowest placements. Howe ver , the RNN-based approach predicts placements that are about 30% worse on a verage. Alternativ e policy architectures. T o highlight the role of Placeto’ s graph neural network archi- tecture (§ 2.2 ), we consider the following two alternative policy architectures and compare their generalizability performance against Placeto’ s on the nmt dataset. (1) Simple aggr e gator , in which a feed-forward network is used to aggregate all the node features of the input graph, which is then fed to another feed-forward netw ork with softmax output units for predicting a placement. This simple aggregator performs very poorly , with its predicted placements on the test dataset about 20% worse on a verage compared to Placeto. (2) Simple partitioner , in which the node features corresponding to the parent, child and parallel nodes—of the node for which a decision is to be made—are aggregated independently by three different feed-forward networks. Their outputs are then fed to a separate feed-forward network with softmax output units as in the simple aggregator . Note that this is similar to Placeto’ s policy architecture (§ 2.2 ), except for the local neighborhood summarization step (i.e., step 2 in § 2.2 ). This results in the simple partitioner predicting placements that run 13% slower on average compared to Placeto. Thus, local neighborhood aggregation and pooling summaries from parent, children and parallel nodes are both essential steps for transforming raw node features into generalizable embeddings in Placeto. 5 Conclusion W e presented Placeto, an RL-based approach for ﬁnding device placements to minimize training time of deep-learning models. By structuring the policy decisions as incremental placement improvement steps, and using graph embeddings to encode graph structure, Placeto is able to train ef ﬁciently and learns policies that generalize to unseen graphs. 8 References [1] Ec2 instance types. https://aws.amazon.com/ec2/instance- types/ , 2018. Accessed: 2018-10-19. [2] M. Andrychowicz, F . W olski, A. Ray , J. Schneider, R. Fong, P . W elinder , B. McGrew , J. T obin, O. P . Abbeel, and W . Zaremba. Hindsight experience replay . In Advances in Neural Information Pr ocessing Systems , pages 5048–5058, 2017. [3] P . W . Battaglia et al. Relational inductive biases, deep learning, and graph networks. arXiv pr eprint arXiv:1806.01261 , 2018. [4] M. M. Bronstein, J. Bruna, Y . LeCun, A. Szlam, and P . V andergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Pr ocessing Magazine , 34(4):18–42, 2017. [5] Y . Gao, L. Chen, and B. Li. Spotlight: Optimizing device placement for training deep neural networks. In J. Dy and A. Krause, editors, Pr oceedings of the 35th International Conference on Machine Learning , vol- ume 80 of Pr oceedings of Machine Learning Researc h , pages 1676–1684, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. [6] E. Greensmith, P . L. Bartlett, and J. Baxter . V ariance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Researc h , 5(Nov):1471–1530, 2004. [7] W . Hamilton, Z. Y ing, and J. Lesk ovec. Inductive representation learning on lar ge graphs. In Advances in Neural Information Pr ocessing Systems , pages 1024–1034, 2017. [8] H. Mao, M. Schwarzkopf, S. B. V enkatakrishnan, Z. Meng, and M. Alizadeh. Learning scheduling algorithms for data processing clusters. arXiv preprint , 2018. [9] A. Mirhoseini, A. Goldie, H. Pham, B. Steiner , Q. V . Le, and J. Dean. A hierarchical model for device placement. In International Conference on Learning Representations , 2018. [10] A. Mirhoseini, H. Pham, Q. Le, M. Norouzi, S. Bengio, B. Steiner , Y . Zhou, N. K umar, R. Larsen, and J. Dean. Device placement optimization with reinforcement learning. 2017. [11] A. Nair , P . Sriniv asan, S. Blackwell, C. Alcicek, R. Fearon, A. De Maria, V . Panneershelv am, M. Sule yman, C. Beattie, S. Petersen, et al. Massi vely parallel methods for deep reinforcement learning. arXiv preprint arXiv:1507.04296 , 2015. [12] A. Y . Ng, D. Harada, and S. J. Russell. Polic y in variance under reward transformations: Theory and application to reward shaping. In Proceedings of the Sixteenth International Conference on Machine Learning , ICML ’99, pages 278–287, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. [13] ONNX Developers. Onnx model zoo, 2018. [14] A. Paliwal, F . Gimeno, V . Nair , Y . Li, M. Lubin, P . Kohli, and O. V inyals. Regal: T ransfer learning for fast optimization of computation graphs. arXiv preprint , 2019. [15] F . Pelle grini. A parallelisable multi-lev el banded dif fusion scheme for computing balanced partitions with smooth boundaries. In T . P . A.-M. Kermarrec, L. Bougé, editor , Eur oP ar , volume 4641 of Lectur e Notes in Computer Science , pages 195–204, Rennes, France, Aug. 2007. Springer. [16] H. Pham, M. Y . Guan, B. Zoph, Q. V . Le, and J. Dean. Efﬁcient neural architecture search via parameter sharing. arXiv preprint , 2018. [17] D. Silver , J. Schrittwieser , K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T . Hubert, L. Baker , M. Lai, A. Bolton, et al. Mastering the game of go without human knowledge. Nature , 550(7676):354, 2017. [18] R. S. Sutton and A. G. Barto. Intr oduction to Reinfor cement Learning . MIT Press, Cambridge, MA, USA, 1st edition, 1998. [19] C. Szegedy , V . V anhoucke, S. Ioff e, J. Shlens, and Z. W ojna. Rethinking the inception architecture for computer vision. In Pr oceedings of the IEEE confer ence on computer vision and pattern reco gnition , pages 2818–2826, 2016. [20] T ensorﬂow contrib utors. T ensorﬂo w ofﬁcial repository , 2017. [21] W ikipedia contributors. Perplexity — W ikipedia, the free encyclopedia, 2019. [Online; accessed 26-April- 2019]. [22] R. J. W illiams. Simple statistical gradient-follo wing algorithms for connectionist reinforcement learning. Machine learning , 8(3-4):229–256, 1992. [23] Y . W u, M. Schuster, Z. Chen, Q. V . Le, M. Norouzi, W . Machere y, M. Krikun, Y . Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, Ł. Kaiser, S. Gouws, Y . Kato, T . K udo, H. Kazawa, K. Stev ens, G. Kurian, N. Patil, W . W ang, C. Y oung, J. Smith, J. Riesa, A. Rudnick, O. V inyals, G. Corrado, M. Hughes, and J. Dean. Google’ s Neural Machine Translation System: Bridging the Gap between Human and Machine T ranslation. ArXiv e-prints , 2016. [24] B. Zoph, V . V asudevan, J. Shlens, and Q. V . Le. Learning transferable architectures for scalable image recognition. arXiv preprint , 2(6), 2017. 9 A ppendices A Implementation Details A.1 REINFORCE Algorithm Placeto is trained using the REINFORCE polic y-gradient algorithm [ 22 ], in which a Monte-Carlo estimate of the gradient is used for updating polic y parameters. During each training episode, a graph G is sampled from the set of training graphs G T (see § 2.1 ) and a rollout ( s t , a t , r t ) N − 1 t =0 is performed on G using the current policy π θ . Here s t , a t , r t refer to the state, action and rew ard at time-step t respectiv ely , and θ is the parameter vector encoding the policy . At the end of each episode, the polic y parameter θ is updated as θ ← θ + η N − 1 X i =0 ∇ θ log π θ ( a i | s i ) N − 1 X i 0 = i r i 0 − b i ! , (1) where b i is a baseline for reducing varia nce of the estimate, and η is a learning rate hyperparameter . Placeto uses a time-based baseline in which b i is computed as the average of cumulati ve rew ards P N − 1 i 0 = i r t at time-step i ov er multiple independent rollouts of graph G using the current policy π θ . Intuitively , the update rule in Equation ( 1 ) shifts θ such that the probability of making “good" placement actions (i.e., actions for which the cumulativ e re wards are higher than the a verage re ward) is increased and vice-v ersa. Thus ov er the course of training, Placeto gradually learns placement policies for which the o verall running time of graphs, coming from the same distribution as G T , are minimized. A.2 Models W e e valuate our approach on the following popular deep-learning models from Computer V ision and NLP tasks: 1. Inception-V3 [ 19 ] is a widely used con volutional neural netw ork which has been successfully applied to a large variety of computer vision tasks. Its netw ork consists of a chain of blocks, each of which has multiple branches made up of conv olutional and pooling operations. While these branches from a block can be executed in parallel, each block has a sequential data dependenc y on its predecessor . The network’ s input is a batch of 64 images each with dimension 299 × 299 × 3 . Its computational graph in tensorﬂow has 3002 operations. 2. NMT [ 23 ] Neural Machine Translation with attention is a language translation model that uses an LSTM based encoder -decoder architecture to translate a source sequence into a target sequence. When its computational graph is unrolled to handle input sequences of length up to 32 , the memory footprint to hold the LSTM hidden states can be large, potentiating the use of model parallelism. W e consider 2-layer as well as 4-layer versions depending on the number of GPUs a vailable for placement. Their computational graphs in tensorﬂow ha ve 6361 and 10812 operations respectiv ely . W e use a batch size of 128 . 3. Nasnet [ 24 ] is a computer vision model designed for image classiﬁcation. Its netw ork consists of a series of cells each of which has multiple branches of computations that are ﬁnally reduced at the end to form input for the next cell. It’ s computational graph consists of 12942 operations. W e use a batch size of 64 . Prior works [ 10 , 9 , 5 ] report signiﬁcant possibilities of improv ements in runtimes for sev eral of the abov e models when placed ov er multiple GPUs. A.3 Datasets W e evaluate the generalizability of each placement scheme by measuring ho w well it transfers a placement policy learned using the graphs from a training dataset to unseen graphs from a test dataset. T o our kno wledge, there is no a vailable compilation of tensorﬂow models that is suitable to be used as a training dataset for the de vice placement problem. For e xample, one of the most popular tensorﬂo w model collection called ONNX [ 13 ] has only a handful of models and most of them do not hav e any inherent model parallelism in their computational graph structure. T o o vercome this difﬁculty , we use an automatic model design approach called EN AS [ 16 ] to generate a variety of neural network architectures of different shapes and sizes. ENAS uses a Reinforcement learning-based controller to discov er neural network architectures by searching for an optimal subgraph within a larger graph. It is trained to maximize expected re ward on a validation set. W e use the classiﬁcation accuracy on CIF AR-10 dataset as a re ward signal to the controller so that ov er the course of its training, it generates several neural network architectures which are designed to achiev e high accuracy on the CIF AR-10 image classiﬁcation task. 10 W e randomly sample from these architectures to form a family of N tensorﬂow graphs which we refer to as the cifar-10 dataset. Furthermore, for each of these graphs, batch size is chosen by uniformly sampling from the interval, bs low to bs high creating a range of memory requirements for the resulting graphs. W e use a fraction f of these graphs for training and the remaining for testing. Similar to the cifar -10 dataset, we use the in verse of v alidation perplexity [ 21 ] on Penn T reebank dataset as a rew ard signal to generate a class of tensorﬂow graphs suitable for language modeling task which we refer to as the ptb dataset. Furthermore, we also vary the number of unrolled steps L for the recurrent cell by sampling uniformly from L low to L high . In addition to the above tw o datasets created using the EN AS method, we create a third dataset made of graphs based on the NMT model which we refer to as the nmt dataset. W e generate N different v ariations of the 2-layer NMT model by sampling the number of unrolled steps, L from L low to L high and batch size from bs low to bs high . This creates a range of complex graphs based on the common encoder -decoder with attention structure with a wide range of memory requirements. For our e xperiments, we use the following settings: N = 32 , f = 0 . 5 , bs low = 240 , bs high = 360 for cifar10 graph dataset, N = 32 , f = 0 . 5 , bs low = 1536 , bs high = 3072 , L low = 25 , L high = 40 for ptb dataset and N = 32 , f = 0 . 5 , bs low = 64 , bs high = 128 , L low = 16 , L high = 32 for nmt dataset. W e visualize some samples graphs from cifar -10 and ptb datasets in Figures 6 and 7 A.4 Intermediate Rewards Placeto’ s MDP reformulation allows us to provide intermediate re ward signals that are known to help with the temporal credit assignment problem. Figure 8 empirically shows the beneﬁts of having intermediate rewards as opposed to a single re ward at the end. They lead to a faster con vergence and a lower variance in cumulative episodic reward terms used by REINFORCE to estimate polic y gradients during training. Placeto’ s polic y network learns to incrementally generate the whole placement through iterative improv ement steps o ver the course of the episode starting from a trivial placement. A.5 Simulator Over the course of training, runtimes for thousands of sampled placements need to be determined before a policy can be trained to con ver ge to a good placement. Since it is costly to execute the placements on real hardware and measure the elapsed time for one batch of gradient descent [ 9 , 10 ], we built a simulator that can quickly predict the runtime of any gi ven placement for a giv en device conﬁguration. For an y giv en model to place, our simulator ﬁrst proﬁles each operation in its computational graph by measuring the time it tak es to run it on all the a vailable devices. W e model the communication cost between de vices as linearly proportional to the size of intermediate data ﬂow across operations. The simulator maintains the following tw o FIFO queues for each device d : • Q op d : Collection of operations that are ready to run on d . • Q transfer d : Collection of output tensors that are ready to be transferred from d to some other device. W e deem an operation to be runnable on a device d only after all of its parent operations ha ve ﬁnished executing and their corresponding output tensors hav e been transferred to d . Our simulator uses an event-based design to generate an e xecution timeline. Each ev ent has a timestamp at which it gets triggered. Further , it also includes some metadata for easy referencing. W e deﬁne the following types of ev ents: • Op-done : Used to indicate when an operation has ﬁnished executing. Its timestamp is determined based on the information collected from the initial proﬁling step on how long it takes to run the operation on its corresponding device. • T ransfer -done : Used to indicate the ﬁnish of an inter -device transfer of an output tensor . Its timestamp is determined using an estimated communication bandwidth b between devices and size of the tensor . • W akeup : Used to signal the wakeup of a device (or a bus) that has been marked as free after its operation queue (or transfer queue) became empty and there was no pending work for it to do. W e no w deﬁne ev ent handlers for each of the above ev ent-types Op-done event-handler: 11 Figure 6: Sample graphs from cifar-10 dataset. Each color represents a different GPU in the optimized placement. Size of the node indicates its compute cost and the edge thickness visualizes the communication cost. The abov e graphs exhibit a wide range of structure and connectivity . Figure 7: Few of the recurrent cells used to generate the sequence based models in ptb dataset. Each color indicates a different operation from the follo wing: Identity (Id), Sigmoid (Sig.), T anh (tanh), ReLU (ReLU). x [ t ] is the input to the cell and the ﬁnal add operation is its output. 12 (a) (b) Figure 8: (a) Cumulativ e Episodic rewards when Placeto is trained with and without intermediate re wards on NMT model. Having intermediate re wards within an episode as opposed to a single re ward at the end leads to a lower v ariance in the runtime. (b) Runtime improvement observ ed in the ﬁnal episode starting from the initial trivial placement. Whenev er an operation o has completed running on the device d , the simulator performs the following actions in order: • For ev ery child operation o 0 placed on device d 0 : – Enqueue output tensor t o of o to Q transfer d if d 6 = d 0 . – Check if o 0 is runnable. If so, enqueue it to Q op d 0 . – Add the appropriate W akeup events necessary after the abo ve two steps in case d 0 happens to be free. • If Q op d is empty , then mark the de vice d as free. Otherwise, pick the next operation from this queue and create its corresponding Op-done ev ent. T ransfer-done event-handler: Whenev er a tensor t has been transferred from device d to d 0 , the simulator performs following actions in order: • Check if the operation o on device d 0 that takes t as its input is runnable. If so, enqueue it to Q op d 0 . Add a W akeup ev ent for device d 0 if necessary . • If Q transfer d is empty , mark the b us corresponding to device d as free. Otherwise, pick the next transfer operation and create its corresponding Op-done ev ent. W akeup event-handler: If a de vice or its corresponding bus recei ves a wakeup signal, then its corresponding queue should be non-empty . Pick the ﬁrst element from this queue and create a new Op-done or T ransfer-done ev ent based on it. W e initialize the queues with operations that hav e no data-dependencies and create their corresponding Op-done ev ents. The simulation ends when there are no more ev ents left to process and all the operations hav e ﬁnished ex ecuting. The timestamp on the last Op-done ev ent is considered to be the simulated runtime. During simulation, we keep track of the start and end timestamps for each operation. Along with the tensor sizes, these are used to predict the peak memory-utilizations of the devices. Note that we’ ve tried to model our simulator based on the real execution engine used in T ensorﬂo w . W e’ ve validated that the following k ey aspects of our design match with tensorﬂow’ s implementation: (a) Per -device FIFO queues holding runnable operations. (b) Communication o verlapping with compute. (c) No more than one operation runs on a device at a time. As a result, an RL-based scheme trained with the simulator exhibits nearly identical run times compared to training directly on the actual system. W e demonstrate this by comparing the run times in the learning curv es of a RNN-based approach [ 10 ] on the real hardware and our simulator (Figure 9 ). A.6 Merge-and-Colocate heuristic Merge-and-Colocate is a simple heuristic designed to reduce the size of a graph by colocating small operations with their neighbors. Giv en any input graph G i , the Merge-and-Colocate heuristic ﬁrst mer ges the node with the lowest cost into its neighbor . If the node has no neighbors, then its predecessor is used instead. This step is repeated until the graph 13 0 2000 4000 6000 Number of placements sampled 1.2 1.4 1.6 Measured Runtime Real Sim Figure 9: RNN-based approach exhibits near identical learning curve when rew ard signal is from a simulator or directly from measurements on real machines. size reaches a desired value N or alternativ ely until there are no more nodes with cost belo w a certain threshold C . The merged nodes are then colocated together on to the same de vice. For our experiments, we use the size of the output tensor of an operation as the cost metric for the abov e proceAdure. A.7 T raining details Here, we describe training details for Placeto and RNN-based model. Unless otherwise speciﬁed, we use the same described methodology for setting the hyperparameters for both of these approaches. Entropy . W e add an entropy term in the loss function as a w ay to encourage exploration. W e tune the entrop y factor seperately for Placeto and RNN-based model so that the exploration starts of f high and decays gradually to a low v alue towards the ﬁnal training episodes. Optimization. W e tune the initial learning rate for each of the models that we report the results on. For each model, we decay the learning rate linearly to smooth con ver gence. W e use Adam’ s optimizer to update our policy weights. W orkers. W e use 8 worker threads and a master coordinator which also serves as a parameter server . At the beginning of e very episode, each worker synchronizes its policy weights with the parameter server . Each worker then independently performs an episode rollout and collects the re wards for its sampled placement. It then computes the gradients of reinforce loss function with respect to all the policy parameters. All the workers send their respectiv e gradients to the parameter server which sums them up and updates the parameters to be used for the next episode. Baselines. For Placeto, we use a seperate mo ving average baseline for each stage of the episode. The baseline for time step t is the av erage of cumulative rewards at step t , of the past k episodes where k is a tunable hyperparameter . For RNN-based approach, we use baseline as described in Mirhoseini et al. [ 10 ]. Neural Network Architecture For Placeto, we use single layer feed-forward networks during message passing and aggregation steps with the same number of hidden units as the input dimension. W e feed the outputs of the aggregator into a two layer feed-forward neural network with softmax output layer . W e use ReLU as our default activ ation function. For the RNN-based approach, we use a bi-directional RNN with a hidden size of 512 units. T raining Details: W e use distributed learning with synchronous SGD algorithm to train Placeto’ s policy network. A parameter server is used to co-ordinate updates with 8 w orker nodes. Each worker independently performs an episode rollout and collects the re wards for its sampled placement. It then computes the gradients of reinforce loss function with respect to all the policy parameters. All the w orkers then send their respecti ve gradients to the parameter server which sums them up before updating the parameters to be used for the next episode. T o train a policy using multiple graphs, a dif ferent graph is used by each worker . More details about the training process including optimization, RL exploration, re ward baseline used and neural network architecture descriptions are provided in the Appendix A.7 Reward: Giv en any placement p with runtime r (in seconds) and maximum peak memory utilization m (in GB) across all devices, we deﬁne memory penalized runtime, R ( p ) as follows: R ( p ) = ( r if m ≤ M r + c ∗ ( m − M ) otherwise where M is the total av ailable memory on the device with maximum peak memory utilization and c is a scale factor . For our e xperiments, we use c = 2 . 14 T o ﬁnd a v alid placement that ﬁts without exceeding the memory limit on devices, we include a penalty proportional to the peak memory utilization if it crosses a certain threshold M . This threshold M could be used to control the memory footprint of the placements under ex ecution environments with high memory pressure ( e.g ., GPUs). For instance, we use M = 10 . 7 GB in our e xperiments to ﬁnd placements that ﬁt on T esla K80 GPUs which hav e about 11 GB of memory available for use. For an MDP episode of length T , we propose the following two dif ferent ways to assign reward: • T erminal Reward: A Non-zero re ward is gi ven only at the end of the episode. That is, r 1 = 0 , r 2 = 0 , . . . r T = − R ( p T ) . This requires ev aluating only one placement per episode but leads to a high v ariance in the policy gradient estimates due to a lack of temporal credit assignment. • Intermediate Rewards: Under this setting, the improv ement in runtimes of the successi ve time steps of an episode is used as an intermediate re ward signal. That is, r 1 = R ( p 1 ) − R ( p 0 ) , r 2 = R ( p 2 ) − R ( p 1 ) , . . . r T = R ( p T ) − R ( p T − 1 ) . Although this requires e valuating T + 1 placements for ev ery episode, intermediate rew ards result in better conv ergence properties in RL [ 22 ]. Devices: W e tar get the following de vice conﬁguration for optimizing placements: T ensorﬂow r1.9 version running on a p2.8xlarge instance from A WS EC2 [ 1 ] equipped with 8 Nvidia T esla K80 GPUs and a Xeon E5-2686 broadwell processor . 15

Placeto: Learning Generalizable Device Placement Algorithms for Distributed Machine Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment