Accelerating Data Loading in Deep Neural Network Training

Data loading can dominate deep neural network training time on large-scale systems. We present a comprehensive study on accelerating data loading performance in large-scale distributed training. We first identify performance and scalability issues in…

Authors: Chih-Chieh Yang, Guojing Cong

Accelerating Data Loading in Deep Neural Network Training
Accelerating Data Loading in Deep Neural Network T raining Chih-Chieh Y ang Data Centric Systems IBM T . J. W atson Resear ch Center Y orktown Heights, NY , USA chih.chieh.yang@ibm.com Guojing Cong Data Centric Systems IBM T . J. W atson Resear ch Center Y orktown Heights, NY , USA gcong@us.ibm.com Abstract —Data loading can dominate deep neural network training time on large-scale systems. W e present a comprehensive study on accelerating data loading performance in lar ge-scale distributed training. W e first identify performance and scalability issues in current data loading implementations. W e then propose optimizations that utilize CPU resour ces to the data loader design. W e use an analytical model to characterize the impact of data loading on the overall training time and establish the performance trend as we scale up distributed training. Our model suggests that I/O rate limits the scalability of distributed training, which inspires us to design a locality-aware data loading method. By utilizing softwar e caches, our method can drastically r educe the data loading communication volume in comparison with the original data loading implementation. Finally , we evaluate the proposed optimizations with various experiments. W e achieved more than 30x speedup in data loading using 256 nodes with 1,024 learners. Index T erms —machine learning, distributed training, scalabil- ity , data loading, data locality I . I N T RO D U C T I O N Deep Neural Network (DNN) models work incredibly well in real world scenarios, such as image classification, speech recognition, and autonomous driving. Howe ver , DNN training can take a long time — days, weeks, or e ven months — which mak es it difficult to optimize and re-train models. Researchers hav e dev oted much ef fort into speeding up DNN training. On the hardware side, vendors incorporated stronger machine learning capabilities to processor architecture design, and introduced special-purpose accelerators for machine learning [1]. On the software side, researchers dev eloped optimized libraries such as CUDNN [2] and MKL-DNN [3]; created easy-to-use frameworks such as Caffe [4], PyT orch [5] and T ensorflow [6]; and in vented new learning algorithms that con ver ge faster . They also ran DNN training on large-scale high- performance computing (HPC) systems, which leads to many interesting research problems. For example, scaling up while maintaining con ver gence rate, finding a good computation- to-communication ratio, and synchronizing results efficiently . Finding solutions to these problems has reduced training time immensely — take Imagenet-1K [7] training with ResNet50 [8] model for example, the state of the art training time is reduced from an hour [9] to minutes [10]–[13] within a year . In large-scale distributed DNN training, we can break down the training time into three major components: computation time, communication time and data-loading time. While the former two draw great attentions from researchers, data-loading time is often omitted in the literature, since different techniques exist that circumvent the data-loading problem. For example, it is possible to cache a dataset entirely in fast local storage such as SSD or DRAM, instead of loading data from a network- based file system that has higher I/O ov erhead. One can also preprocess a dataset to reduce its size so it fits in a local storage, if the original dataset is too large. Howe ver , such techniques do not always apply . Fast local storage may not fit the whole dataset e ven after preprocessing, and preprocessing may take a long time and cause loss of pertinent information. Considering common usage scenarios in HPC, it is important to design efficient methods to load data from a network- based file system or a data server , so that the data loading time does not become a bottleneck in DNN training. In this work, we propose data loader optimizations and bandwidth requirement optimizations to significantly improv e data loading time in large-scale distributed DNN training. W e ev aluated the proposed methods on PyT orch, and found our methods can provide more than 30x speedup in data loading using 256 nodes with 1,024 learners. For the Imagenet-1K classification, our optimizations giv e 92% improv ement in per epoch training cost ov er the regular distributed training implementation. Our contributions are as follows: • Data loader optimizations that improve data loading cost; • A locality-aware data loading method which greatly re- duces bandwidth requirement and improves the scalability; • An analytical model that models the cost and establishes the performance trend in different system scales; • A performance ev aluation that showcases the effects of our proposed optimizations. The rest of the paper is organized as follows. Section II describes necessary background information. In Section III, we present data loading optimizations that better utilize CPU to reduce overhead. Next, we explain the performance trend when scaling up with an analytical model in Section IV. In Section V, we propose a locality-aware data loading method which greatly reduces the bandwidth requirement of mini-batch SGD. In Section VI, we show improvements brought by our optimizations. In Section VII, we summarize related works. This paper has been accepted for publication in HiPC 2019 Finally , we draw the conclusions and describe the future work in Section VIII. I I . B AC K G RO U N D A. Mini-batch SGD Mini-batch stochastic gradient descent (SGD) [14] approxi- mates the optimal solution of an objectiv e function. It iterativ ely feeds a mini-batch (i.e. a set of random samples of a dataset) to a neural network model for forward propagation, and computes the loss from the output and the target values. Next, it performs backpropagation to compute gradients, which are then used to update the model weights. The frequency of model updates, depending on the choice of mini-batch sizes, can affect the con ver gence rate. In this context, we use a step to refer to training a single mini-batch, and an epoch to refer to training the whole dataset in multiple steps. Depending on the dataset and the model, it takes various number of epochs for the training to con v erge. Mini-batch SGD can be parallelized in a data-parallel fashion. A typical implementation uses multiple distributed processes (hereafter referred to as learners ), each with a copy of the model. The learners perform a step of mini-batch SGD collectiv ely with the following procedure: 1) Each learner acquires the same global mini-batch sequence (a sequence of sample indices instead of the actual samples) that all learners will collectiv ely load. 2) Each learner takes an ev en-sized disjoint slice of the global mini-batch sequence. 3) Each learner loads samples of its slice from the data source (e.g. a network file system) to form a local batch. 4) Each learner trains with its local batch independently to compute local gradients. 5) All learners synchronize (i.e. all-reduce) to produce the global gradients of the current step. 6) Each learner updates the model weights with the same global gradients. The abov e mentioned procedure is synchr onous mini-batch SGD, as the model is globally synchronized in each step. This data-parallel approach scales well when the mini-batch size is large enough to have a good compute-to-communication ratio. Howe v er , two issues exist: (1) In general, using a larger mini-batch makes it harder to generalize; (2) Synchronization ov erhead increases with the number of learners. There has been rigorous research work on the two issues, and the recent results in [15], [16] show that the distributed SGD can scale to extreme scale systems. B. Cost and scalability of distributed training Considering the cost of a single step in synchronized mini-batch SGD training, the are three major components: computation time (forward propagation and backpropagation), communication time (synchronization of gradients), and data loading time. Forward propagation and backpropagation to produce the local gradients take up most of the computation time in training. The actual cost depends on the complexity of the model used and the local batch size. In a small scale, the computation time usually dominates the ov erall cost in deep learning. V arious techniques can be applied to reduce this cost, such as using optimized DNN library like CUDNN [2] for Nvidia GPUs, and using lo wer precision floating arithmetic. In the foreseeable future, ne w hardware architectural advances will improv e the computation time faster than the other components. In synchronous mini-batch SGD, the synchronization of gradients happens per step, since the model weights must be updated before the next step starts. There have been many recent research works on improving synchronization algorithms including [11], [13]. The gradients of different model layers can also be synchronized separately [17]. The communication and computation cost are tightly coupled, as there are optimization techniques such as layer-by-layer synchronizations. In the rest of this paper, we use training time to refer to the overall cost of computations and communication, and discuss its relationship with data loading time . Data loading in the machine learning context refers to the actions required to mo ve data samples from a storage location to form a batch in the memory co-located with the compute units for training. The I/O cost (typically read-only) of moving data samples depends on the bandwidth of the storage system. Other than the I/O cost, to make data usable in training, there is often some preprocessing or data augmentation needed, depending on the training requirements. T ake an image classification task for example, one needs to decompress the image files, to randomly clip and resize the image, and perform other image transformations. These operations can be time-consuming. T o understand the ov erall cost in deep learning applications, let us first consider a single learner case. A learner waits if data is not prepared in time, as the training progress depends on input data. If a learner performs data loading and computation sequentially , there will be gaps between computation tasks caused by data loading. In comparison, a common practice is to, in a background process (or a thread), use pr efetching to ov erlap data loading with training. The data loading overhead can be partially or completely hidden. Now , let us consider scaling up distributed training of a fixed-sized input dataset. Suppose our training is data parallel, when we get more computing resources, the training time decreases, as the computation can easily be parallelized. In contrast, the data loading cost also decreases initially , because more processes participate in preprocessing the same amount of data, and more nodes can load data in simultaneously , which increases the effecti ve bandwidth. Howe ver , the bandwidth of the storage system is ev entually upper-bounded. Figure 1 sho ws the data loading scalability problem. On LLNL ’ s Lassen system, we used distributed processes to train ResNet50 with Imagenet-1K dataset. There were four learners per node. The batches were globally and randomly shuffled, and the local batch size of a learner was fixed at 128. The global batch size increases with the number of nodes. W e measured the a verage training time per epoch as shown in the orange bars; and the a verage waiting time for data to be ready for training as shown in the blue bars. The sum of 2 This paper has been accepted for publication in HiPC 2019 Fig. 1. A verage epoch time to train ResNet50 with Imagenet-1K dataset in different scales on LLNL Lassen. The cost stopped decreasing when the data loading overhead stopped scaling. the two is the average cost of an epoch. Since data loading is overlapped with training, the time to wait for data would appear only when data loading ov erhead was not completely hidden. W e can see that for 2, 4, and 8 nodes, the waiting for data was minimal and the performance scaled well. Howe ver , while the training time kept decreasing with more participating nodes, the data loading cost could no longer be fully hidden. The waiting time was significant starting from the 16-node case, and ev entually dominated the cost as we added more nodes. This is because even though the load volume per node decreased while scaling up, the overall data supplying rate could not catch up with the consuming rate, as a result, the data loading time stopped decreasing, and the cost stopped scaling down. These observ ations motiv ated us to attack the data loading problem from tw o dif ferent angles: (1) reducing the data loading cost to result in an overall improvement for all cases, and (2) reducing the data loading volume, so that the storage system’ s limited capability becomes less of a problem. W e address the former aspect in Section III and the latter in Section V. I I I . D AT A L OA D E R O P T I M I Z A T I O N S T o optimize data loading cost, we need to identify the ov erhead of data loading in a finer granularity . An illustrative typical learner execution timeline, similar to visualization of profiling tools such as n vprof, is shown in Figure 2. Each timeline represents a computing resource. The colored bars in a timeline denote tasks being performed and the white space represents idling. The main process (the middle timeline) driv es the training progress and interacts with the data loader worker to prescribe batch-loading requests and retrieve data; it also interacts with GPU to perform computations. As illustrated, if we consider only the training time meaningful work, there can be ov erhead due to data loading. In the following subsections, we discuss dif ferent optimization strategies that address different ov erheads. A. Multipr ocessing The time to load a batch can be significant, as illustrated in Figure 2 as the green bars. While we can use a background worker to prefetch batches to hide some ov erhead, adding more workers can overlap the loading of different batches and improv e performance further . PyT orch data loader implementation can spa wn con- current background worker processes to load multiple batches in parallel and maximize data loading through- put. The main process communicate with workers through multiprocessing.Queue instances. The main process prefetches data by submitting more batch-loading requests than its immediate demand. When using more workers, the through- put increases because workers simultaneously load samples from the data source, increasing the ef fectiv e bandwidth. It also parallelizes the preprocessing of batches. B. Multithr eading While multiprocessing loads multiple batches in parallel, there exists untapped parallelism within the loading of a batch. It is often the case that the preprocessing of individual samples can all be done independently in parallel. Thus, we can use multithreading to parallelize sample preprocessing within a batch, so that the loading time can reduce further . Figure 3 illustrates the effects of multiprocessing and multithreading. Multiprocessing overlaps batch loading across processes, while multithreading within a worker shortens loading time per batch by preprocessing samples in parallel. While multithreaded data loading exists in other deep learning frame works, in Pytorch, we hav e to modify the data loader implementation to create a ThreadPoolExecutor instance along with a data loader worker . Instead of loading samples and preprocessing them one-after-one sequentially in a single thread, we use ThreadPoolExecuter.map() to load samples in a batch concurrently . Note that, due to the python global interpreter lock (GIL) issue [18], multithreading works well only if the preprocessing pipeline stages call nati ve library routines and release GIL correctly . In experiments of Imagenet-1K training, system calls such as file I/O and the image transformations do release GIL, and we can see significant performance improvement with multithreading (see Section VI-A). C. Caching The data access pattern of mini-batch SGD is repetitiv e and random. The learners collectively load the same training dataset with a randomized sequence ev ery epoch. Since the samples are reused, there is temporal locality that we can utilize to improv e performance. W e can allocate a software cache either in memory or in a high speed local storage such as an SSD of a compute node to store samples that hav e been loaded in earlier epochs. The cached samples will be used again in subsequent epochs. W ith a cache hit, a learner loads a sample with a much shorter latency . It also reduces the number of accesses to the storage system, making the I/O bandwidth less likely to be saturated. 3 This paper has been accepted for publication in HiPC 2019 Fig. 2. Illustrative ex ecution timeline of a learner . Fig. 3. Parallelizing data loading. Multiprocessing allo ws overlapping of batch loading while multithreading further reduces batch loading time by utilizing parallelism within a batch. While caching in memory grants optimal data access time, training datasets that are too large to fit in the local DRAM can be cached in SSDs. For very large datasets that do not fit in the local cache, caching a partial subset locally can still improv e performance although the improvement can be limited. For example, considering a compute node that caches 10% of the training dataset, the cache hit rate is 0.1. In other words, 90% of the samples are still loaded from the storage system. T o av oid being limited by the local cache size, all the participating compute nodes can share their local caches with each other to form an ag gr e gated cache that is many times larger than individual caches, similar to the high-speed parallel data staging method mentioned in [16]. W ith the aggregated cache, compute nodes may cache disjoint partitions of a large dataset. W e refer to this technique as distributed caching . Distributed caching changes data-loading in mini-batch SGD. During training, a sample load can be a local cache hit, a remote cache hit, or a cache miss served by the storage system. The local cache hit rate is likely to be small, assuming the local cache only holds a small subset of the whole training dataset. But the remote cache hit rate can be very high, if the aggregate cache holds most of the dataset. The cache miss rate can be zero, if the whole training dataset is collectively cached. The technique bears a lot of similarities to the hierarchical hardware caches in modern processor architectures. W ith distributed caching, learners can exchange cached samples to create their local batches (i.e. a slice of a mini- batch sequence). The exchanges utilize the high-speed network among compute nodes instead of loading from the storage system. In this way , the storage system bandwidth is no longer a bottleneck. Ho wev er , the bandwidth among compute nodes, albeit typically larger , can still be a limiting factor , because the data loaded collectively per epoch is still close to the whole dataset size. W e model the cost of distributed training in the next section, and in Section V, we propose a ne w data loading method that reduces the bandwidth requirement of distributed mini-batch SGD. I V . P E R F O R M A N C E M O D E L W e present a simple analytical model to help analyze the cost in dif ferent system scales. The model contains the following parameters (we denote uppercase letters to represent constant values and lowercase letters to represent variables): D : the dataset size. T o simplify the analysis, we assume that the data samples are the same size, and D equals the total number of samples in the dataset. p : the number of participating compute nodes. V : the maximum training rate of a compute node. R : the I/O rate from the storage system. W e let this be the maximum loading rate. R c : the I/O rate from the remote caches. W e can reasonably assume R c is much larger than R due to the high-speed interconnection that HPC systems typically hav e. U : the maximum preprocessing rate of a compute node. α : the ratio of the cached subset (in the aggregated cache) to the whole dataset. From the discussion in Section II-B , we know that when data loading is ov erlapped with training, the true overall cost is the larger of the training cost and the data loading cost. The training cost and the data loading cost of an epoch can then be deriv ed from: T raining time = D p · V (1) Sample I/O time = D R (2) Sample preprocessing time = D p · U (3) Data loading time = (2) + (3) = D R + D p · U (4) Now , let us revisit Figure 1. There was only data loading without training in the experiment. In the plot, the data loading cost was high initially when there were few nodes, but it decreased when more nodes were added until it hit a plateau. 4 This paper has been accepted for publication in HiPC 2019 In (4), the preprocessing time decreases when p increases. It ev entually becomes insignificant, relati ve to the constant sample I/O time. And the data loading costs at least D R which is a constant. This explains the plateau. W e can also determine the true cost by comparing the training cost and the data loading cost. T o simplify the analysis, we assume that the preprocessing rate is much higher than training rate (i.e. U  V ), and p is also large enough so that the preprocessing cost is relativ ely insignificant. Thus, we only concern the relationship between (1) and (2). If the training time dominates the true cost: (1) ≥ (2) D p · V ≥ D R p ≤ R V (5) From (5), we know that for small p , the training time dominates. As p increases, more computing resources can be used to reduce the training cost while the sample I/O time remains constant. The true cost per epoch can thus be expressed as the following: T rue cost =      D p · V f or p ≤ R V D R f or p > R V (6) Now , let us consider distributed caching. W e assume that the cost of local cache hits is insignificant. The optimization does not change the training cost or the preprocessing cost but only affects the sample I/O cost: Sample I/O time = (1 − α ) · D R | {z } Storage system + α · D R c | {z } Remote caches · p − 1 p | {z } Local cache miss (7) W e can know two things from (7): (a) The local cache miss rate p − 1 p can be very high when p is large so local cache hits do not help very much although they are fast; (b) Both α and R c hav e to be large for distributed caching to perform very well. While scaling up, it is easy to hav e a large α since scaling up increases the amount of aggregated memory to store a fixed-sized dataset. R c  R is also a reasonable assumption in modern HPC systems. Howe ver , R c does not grow linearly with p , and eventually the performance scaling is limited by the bandwidth among compute nodes. V . L O C A L I T Y - A W A R E D A TA L O A D I N G The optimizations described in Section III improv e the data loading rate of learners. Howev er , as illustrated in both in the experiment (Figure 1) and through performance modeling, a distributed learning application scales only as far as the storage system’ s capability allows. W e need a way to reduce the bandwidth requirement of distributed DNN training to ov ercome the limitation. In the rest of this section, we describe a data loading method which adds a locality-aware flavor to distributed caching. It not only reduces data loading from the storage system, b ut also minimizes overall data loading volume to a fraction of the dataset size. Instead of exchanging among learners to form designated mini-batch slices, learners can assemble a mini-batch from their locally cached samples to greatly reduce data loading. A ke y property of SGD makes it possible: for a given global mini- batch sequence, as long as all the samples in such sequence are used in the training step, the ordering of the samples within the global batch does not affect the training results after synchronization (i.e. all-reduce operation). A. Methodology As in distributed caching, learners must populate their local caches before the locality-aware data loading method can be applied. This can either be a cache populating phase before training, or caching the samples loaded from the storage system on-the-fly during the first epoch. As long as the cached subsets are disjoint, how samples are cached is not important, since the mini-batch sequences are randomly shuffled. Howe ver , it may be advantageous to populate the caches in a way that sample locations (i.e. the nodes samples are cached) can be easily determined to avoid extra book-keeping. W e assume a cache dir ectory exists for tracking sample locations, and the directory is duplicated across all learners and stays the same (i.e. no cache replacement) after populating caches in the first epoch. W ith locality-aw are data loading method, in a training step, a learner goes through a giv en predefined global mini-batch sequence and look for samples that are cached locally and trains with them. Given the number of compute nodes p , if the whole dataset is evenly split among all caches, and a global mini-batch sequence is uniform-randomly sampled, a compute node should find close to 1 p of the global mini-batch in its local cache. It can then use these samples in the training step as its local batch. The results of training with this subset of the global mini-batch sequence are the learner’ s contribution in the training step. In Section V -B, we prove that training with this method produces equiv alent results to the regular method. When learners look for samples locally cached, they may find themselves caching varying sized subsets of the global mini-batch sequence. In other words, the sample distribution can be imbalanced. Letting learners train with imbalanced local batches, while giving the same training results and potentially performing zero remote data loading, can cause some learners to become stragglers and increase the training time of a step in synchronous SGD. W e need load balancing for optimal performance, and we discuss this further in Section V -C. Assuming the caches have been populated with samples, the procedure of locality-aware data loading is as follows: 1) Get a global mini-batch sequence that is the same across all learners. 2) Determine the sample distribution of the global mini-batch among the distributed learners. 5 This paper has been accepted for publication in HiPC 2019 Fig. 4. Conv entional method: learners load even-sized slices. Fig. 5. Locality-aware method: sample distribution in learner caches. 3) Determine data loading for either samples missing from the aggregated cache or load balancing. First, all learners get the same global mini-batch sequence. Next, each learner independently goes through the global sequence and determines the sample distributions by looking up the cache directory . Then, the learners need to agree on ho w to load samples locally so that they collectively assemble the global mini-batch. Samples not in the caches are loaded from the storage system. As for load-balancing, the learners can exchange data to achieve load balance, or they can load from the storage system. If learners exchange samples for load balancing, it creates point-to-point communication traffic. W e provide both theory and simulation results in Section V -C to show that this traffic is a small fraction of the data movement that the regular loading method requires. Figure 4 and 5 illustrate the differences between using the regular mini-batch SGD and the locality-aware method. W e assume 3 learners — Red, Green, and Blue — collectively load a global mini-batch of 12 samples. In the regular method (Figure 4), the global mini-batch sequence is split into multiple slices, and each learner loads a slice of 4 samples to train with before synchronization. In the locality-aware method, the learners look for the locally cached samples that belongs to the mini-batch. In Figure 5, Red has 2 samples, Green has 6 samples, and and Blue has 4 samples in their local caches. A way to balance the load is to let Red load 2 samples from Green before training. The total volume loaded for this global mini-batch is 2 ÷ 12 ≈ 17% of the regular method. B. Pr oof of equivalence Here we give a formal proof that the locality-aw are data loading method produces the same results as the regular loading method. Consider the follo wing optimization problem solved by mini-batch SGD: min w ∈X F ( w ) where F : R m → R is continuously dif ferentiable but not necessarily con ve x over X , and X ⊂ R m is a nonempty open subset. The objectiv e F can be seen as the empirical risk F ( w ) = n − 1 P n i =1 g i ( w , x i ) . Here x i , 1 ≤ i ≤ n , are data samples. The regular data loader and locality-aw are dataloader im- plement two sampling schemes for P learners. W e call the regular one Reg , and the locality-aware one Loc . Theorem 1. Assuming the same sequence of random numbers ar e generated for Reg and Loc , then distributed minibatch SGD pr oduces the same w with both sampling schemes after the same number of training steps. Pr oof. By induction. Denote x k t as the k -th ( 1 ≤ k ≤ B ) sam- ple in the t -th mini-batch, where B is the size of the global mini- batch. Since the mini-batch is evenly distributed to each learner in a block fashion, at the j -th ( 1 ≤ j ≤ P ) learner L j , the local mini-batch includes { x ( B /P ) ∗ ( j − 1)+1 t , · · · , x ( B /P ) ∗ ( j ) t } . Assume after t = s , s ≥ 1 , w is the same under Re g and Loc . Then at step s + 1 , Re g produces a global mini-batch sequence { x 1 s +1 , x 2 s +1 , · · · , x B s +1 } . W ith Reg the global mini-batch sequences are block- distributed, and at learner L j , the local batch is: { x ( B /P ) ∗ ( j − 1)+1 s +1 , x ( B /P ) ∗ ( j − 1)+2 s +1 , · · · , x ( B /P ) ∗ ( j ) s +1 } So the local gradient is: ∇ F ( w : { x ( B /P ) ∗ ( j − 1)+1 s +1 , x ( B /P ) ∗ ( j − 1)+2 s +1 , · · · , x ( B /P ) ∗ ( j ) s +1 } ) = X j j ∇ F ( w : x ( B /P ) ∗ ( j − 1)+ j j s +1 ) And the global gradient after reduction is: ∇ Re g = X j X j j ∇ F ( w : x ( B /P ) ∗ ( j − 1)+ j j s +1 ) Since Loc uses the same random number sequence, it produces the same global mini-batch sequence { x 1 s +1 , x 2 s +1 , · · · , x B s +1 } as Re g . Howe v er , due to locality optimization, the sequence is not distributed to the learners in a block fashion. In fact, the local batch may actually have different sizes. From the conv ergence perspecti ve, locality- aware optimization in ef fect permutes the sampling sequence { x 1 s +1 , x 2 s +1 , · · · , x B s +1 } into { x g 1 s +1 , x g 2 s +1 , · · · , x g B s +1 } , and distributes it unev enly in a block fashion to the learners. Suppose learner L j gets samples g j b to g j e . Then the local gradient is: ∇ F ( w : { x g j b s +1 , x g j b +1 s +1 , · · · , x g j e s +1 } ) = X g j b ≤ j j ≤ g j e ∇ F ( w : x j j s +1 ) And the global gradient after reduction: ∇ Loc = X j X g j b ≤ j j ≤ g j e ∇ F ( w : x j j s +1 ) By the commutati ve law of addition, ∇ Loc = ∇ Re g . Therefore, w s +1 of the two methods are the same. Obviously , the base w 1 is the same for both sampling schemes. This completes our proof. Theorem 1 sho ws that our locality-aware data loading scheme produces the same gradients as the original approach for each step in distributed SGD. In current practice, batch normalization is frequently used to improv e training accuracy and time. In theory , batch normalization should be applied to the whole mini- batch. In this case, Theorem 1 still holds. If batch normalization is applied to each local part of the mini-batch, the mean and the v ariance are obviously different from the original data loading scheme. Howe v er , from the training perspective, the impact of 6 This paper has been accepted for publication in HiPC 2019 Fig. 6. Simulated imbalance of the global mini-batch sample distribution in distributed caching. p is the number of compute nodes. our locality-aware scheme on batch normalization is similar to that of using a different random permutation sequence. It should hav e minimal impact on training results. This is confirmed by our experimental results. C. Load Imbalance Here, we discuss the load imbalance of the locality-aware data loading method. W e first analyze the data imbalance among the caches for the learners. The distribution of data samples in a global mini-batch to caches is a random process. T o characterize the amount of data samples of a global mini-batch in the cache of a certain learner, we consider the process of uniformly-at- random placing b balls in p bins. Let M be the random variable that counts the maximum number of balls in any bin. Then P r [ M > K α ] = o (1) for α > 1 and K α = b p + α q 2 b p log p , with p log p  b ≤ p · polylog ( p ) (see Theorem 1 of [19]). The imbalance in the amount of data samples for a mini-batch in theory is unlikely to be large. W e ran simulations to show the traf fic volume needed to balance the batch samples. Different local batch sizes and different number of compute nodes are used. The simulation started with a fixed sized dataset evenly partitioned and distributed to p compute nodes. Then, mini-batch sequences were generated, and the sample distributions were determined. The imbalance traffic volume percentage is calculated by summing the deficits of every learner and then divided by the mini-batch size. W e collect the imbalance numbers of many steps and render the box plot as shown in Figure 6. W e can make two observations from the figure. First, the imbalance depends on the local batch size. For example, the green boxes are the results of the same local batch size 64. And they have very close median values across different configurations. The same applies to the other local batch sizes. Second, the imbalance is in general a small percentage for moderate to large local batch sizes. The median values of the imbalance percentage for the local batch size 32, 64, and 128 are approximately 6.9%, 4.8%, and 3.4%, respectiv ely . Both the theory and simulation results sho w that the load imbalance of the locality aware data-loading is small. Still, imbalance in the amount of data present in the cache of each Algorithm 1 Balance ( p , L ) 1: Make a surplus heap H s of all surpluses in L in decreasing order 2: Make a deficit heap H d of all deficits in L in decreasing order 3: S ← {} 4: while H s is not empty do 5: h s ← find-max( H s ) 6: h d ← find-max( H d ) 7: m ← min( h s .imbalance, h d .imblance) 8: h s .imbalance ← h s .imbalance − m 9: h d .imbalance ← h d .imbalance − m 10: S .append( h s .ID, h d .ID, m ) 11: if h s .imbalance = 0 then 12: heap-remove( h s ) 13: else 14: heap-decrease-key( h s ) 15: end if 16: if h d .imbalance = 0 then 17: heap-remove( h d ) 18: else 19: heap-decrease-key( h d ) 20: end if 21: h s ← heap-find-max( H s ) 22: h d ← heap-find-max( H d ) 23: end while 24: return S learner creates imbalance in computation time for forward and backward propagation in training. T o achiev e perfect loading balancing, learners with data surplus need to send some data samples to learners with deficit. These data transfers incur communication among the learners, and we want to minimize the number of transfers (since the total amount of data measured in bytes being transferred in any scheme is the same). This optimization problem is equiv alent to an existing problem and turns out to be N P -complete (see [20]). W e propose an approximation algorithm. Its formal descrip- tion is given in Algorithm 1. In the algorithm, we build two heaps, one for learners with surplus H s , and the other for learners with deficit H d . Each heap element contains two items, imbalance for the current imbalance in workload, and ID for the learner . The algorithm greedily finds the current largest imbalanced heap elements h s in H s and h d in H d , and records sending min( h s .imbalance, h d .imbalance) amount of data samples from the learner with the surplus to the learner with the deficit in the schedule list S . The algorithm then updates the heaps, and continue. Since with each heap-find-max operation on H s the imbal- ance of at least one learner is remov ed, and the heap operation takes at most log p time, it is easy to see that Algorithm 1 runs in O ( p log p ) time. Theorem 2. Algorithm 1 is a 2-appr oximation algorithm. Pr oof. In the worst case, the number of messages sent by Algorithm 1 is at most p − 1 as each heap-find-max operation on H s fixes one imbalanced learner , and the minimum of messages sent is p/ 2 , the approximation ratio is p − 1 p/ 2 ≈ 2 . W e extend the performance model described in Section IV to incorporate locality-aware data loading method. The training time and preprocessing time are the same as previously described. W e focus on the sample I/O time here, since it 7 This paper has been accepted for publication in HiPC 2019 dominates the cost when p is large. T wo new parameters are needed: R b : the I/O rate of data mov ements for load balancing. If we choose to load the samples from remote caches, we can let R b = R c . β : the load balancing traffic volume ratio to the a giv en dataset size. The sample I/O time using the locality-aware data loading method is: Sample I/O time = (1 − α ) · D R | {z } Storage system + α · D R b · β | {z } Load balancing cost (8) From the pre vious analyses, we know that β is a small number (i.e. 0 ≤ β  1 ) because load imbalance is unlikely to be large. W e can see that (8) dif fers from (7) only in the second term. When p is large, p − 1 p ≈ 1  β , thus, compared with distributed caching, the locality-aw are data-loading method greatly reduces the I/O cost. V I . E X P E R I M E N T S W e conducted experiments on Lawrence Liv ermore National Lab’ s Lassen system using up to 256 nodes (1,024 GPUs). A compute node has two IBM PO WER9 processors (44 cores in total), 256 GB system memory , 4 Nvidia V100 (V olta) GPUs, 16 GB memory per GPU, and Inifiniband EDR interconnect among compute nodes. Compute nodes have access to IBM Spectrum Scale (GPFS), a high-performance parallel file system. W e studied a PyT orch implementation of Imagenet-1K clas- sification using Resnet50 adopted from PyT orch examples [21]. The dataset, Imagenet-1K, contains around 1.28 million JPEG images, each is sev eral hundred KBs. The total dataset size is about 150 GB. The distributed implementation spawns multiple learner processes, each associated with a GPU. The learners ex ecute in a data-parallel fashion and synchronize with each other using NCCL library , which provides an optimized all- reduce operation for the synchronizations in training. W e also include the data loading performance results of the UCF101 dataset [22]. The dataset was originally videos and was conv erted into two image datasets: RGB and optical flow (referred to as FLOW) with approximately 2.5 million and 5 million images of average sizes 24.2 KB and 4.6 KB respectiv ely . W e conducted data loading only (including I/O and video transformations) experiments with the optimized data loader to see how it performs for datasets other than Imagenet-1K. T o understand how our approach performs in loading very- large datasets, we used another 892 GB dataset generated from molecular dynamics (MD) simulations conducted using Multi- scale Machine-Learned Modeling Infrastructure (MuMMI) [23]. The dataset contains ~7M files that are deri ved MD trajectory frames. Each file contains a single frame of a constant size, 131 KB. The frames are stored in numpy array format and can be used in ML training directly after data loading. In other words, no sample pre-processing is required. A. Effects of Optimizations W e examined the Imagenet-1K sample loading rate running a single data loading only learner (i.e. no training) with different numbers of workers and threads per worker to find a good combination. The case with zero thread (i.e. multithreading off) is the default PyT orch data loader . As shown in Figure 7, in general, the loading rate increases both with more threads and more workers. Our multithreading optimization granted better performance with relativ ely fewer workers, which is preferable because the overhead of spawning more workers increases quickly . The maximum loading rate measured is around 800 samples per second. Next, we compared the results of the regular PyT orch data loader with those of our locality-aware data loader . In each compute node, we created 4 learners to associate with 4 GPUs individually , and we let each learner spawn 10 background workers for it grants maximum sample loading rate in the previous experiment. The cache size of each learner is upper bounded at 25 GB but in most cases they use less than that for that we populate at most 1 p of the dataset per learner in the first epoch without cache replacement afterwards. The regular data loader read samples of a designated slice from a randomly permuted global mini-batch sequence in ev ery step, while the locality-aware data loader went through a global mini-batch sequence, determined its contribution, and trained with mostly locally cached samples. W e let the locality-aware data loader train with balanced local batches (using Algorithm 1) to av oid negati v e effects of stragglers. W e ran a set of experiments in different scales. W e remov ed the per-step synchronizations and kept only one in the end of each epoch. W e experimented both with multithreading (4 threads each worker) and without multithreading to see if parallelized preprocessing helps the overall performance. W e report the average time spent per -epoch excluding the first epoch, in which the caches were populated in the locality-aware method. The results of loading Imagenet-1K in Figure 8 show that for the regular data loader, the cost did not decrease while scaling up. The slightly higher cost at 64-node was very likely caused by interference from other jobs. Reg ardless, since locality- aware data loaders fetched drastically fewer samples, data loading scales well with the number of learners. In most configurations, for the same data loader and the same scale, a multithreaded trial finished sooner than a single-threaded one. W e see a general improvement due to multithreading in all different numbers of compute nodes. For the regular data loader , multithreaded runs are 24%–71% faster; For the locality-aware data loader , multithreaded runs are 105%– 113% faster . Howe v er , multithreading alone did not help in scaling up to more nodes when using the regular data loader . In contrast, the locality-aware data loading method clearly improved the scalability . Once the effecti ve I/O bandwidth stopped scaling for the regular data loader — as can be seen from the fact that the time did not reduce further ev en though more nodes 8 This paper has been accepted for publication in HiPC 2019 Fig. 7. The Imagenet-1K sample loading rate of a single learner using different workers/threads combinations. Fig. 8. Cost to collectively load the Imagenet-1K dataset in different scales. Fig. 9. Cost to collectively load the UCF101-RGB dataset in different scales. Fig. 10. Cost to collectively load the UCF101-FLO W dataset in different scales. Fig. 11. Cost to collectively load the MuMMI dataset in different scales. participated — the locality-aware data loader outperformed significantly , due to its lo wered bandwidth requirement from reusing cached samples. At 256 nodes (1,024 learners), the locality-aware data loader achieved close to 34x speedup over the regular data loader at the same node count. Figure 9 and 10 show the results of loading the two sets from UCF101. The performance trend looks slightly different from Imagenet-1K results for the regular data loader . Without multithreading, the regular data loader spent more time per epoch to load both UCF101-RGB and UCF101-FLOW , albeit the volume to load per learner decreased as more compute nodes participated in loading. This trend also appeared in in loading UCF101-FLO W with multithreading. Since our jobs did not hav e exclusi ve access to the cluster during the experiment, we attribute the performance degradation to the interference of other jobs executing simultaneously that loaded from the GPFS. This phenomenon showcases again that at a very large scale, the con ventional way of data loading does not scale. In contrast to the regular data loader , our locality-aw are data loader granted much better performance results for the data loading tasks of UCF101 in different scales. For UCF101- RGB, our optimized data loader is 2.8x–55.5x faster and for UCF101-FLO W , it is 2.2x–60.6x faster . For the lar gest dataset MuMMI, we can observe even more encouraging results using the locality-aware loading method as shown in Figure 11. Our optimized data loader provides 18x, 35x, 70x, and 120x speedup over the regular data loader at 16, 32, 64 and 128 nodes correspondingly . W e can see that the multithreading optimization does not affect the performance in a significant way , since the samples are numpy arrays and no pre-processing is needed after loading to the DRAM. B. Imagenet-1K ResNet50 T raining W e ran the Imagenet-1K classification using ResNet50 model to measure the performance results and the validation accuracy 9 This paper has been accepted for publication in HiPC 2019 T ABLE I I M AG EN E T - 1 K R E S N E T 5 0 V A LI DAT IO N AC C UR AC Y C OM PA RI S O N B E T WE E N T H E R E G U LA R D A T A L OA D E R A N D T H E L O C A LI T Y - A W A RE DAT A L OA D E R . Number of nodes Mini-batch size Regular loader (%) Locality-aware loader(%) 16 8,192 76.67 76.81 32 16,384 75.33 75.12 64 32,768 68.69 69.54 Fig. 12. A v erage epoch time of Imagenet-1K ResNet50 training in different number of nodes. after 90 epochs in three different scales to compare the two data loader implementations. W e enabled multithreading (4 threads per worker) in all runs. As we scaled up the distributed training, we also increased the global mini-batch size. W e tried to reproduce the v alidation accurac y in [9] for 8K mini-batch using the same fine-tuning techniques including batch normalization. For larger batch sizes, some of the known highest accuracy numbers achieved in volv e LARS [10] and elaborate learning rate tuning, since our goal is not to achieve the highest accuracy , but to sho w comparable accuracy results, we did not implement those. In T able I, we present the results. Using the locality- aware data loader resulted in comparable validation accuracy with that of the regular PyT orch data loader , as the differences are below 1%. Figure 12 sho ws the av erage time per-epoch of the runs. W ith training on GPUs, the data loading overhead should be hidden except for when p is large. For 16 nodes, the GPU training time dominated the cost, and the time per-epoch was comparable between the two different loaders. For 32 and 64 nodes, the time per -epoch using the regular data loader was lower -bounded by the constant data loading cost, which was limited by the I/O rate. In contrast, using the locality-aware data loader helped the per-epoch training time to decrease further when more nodes participated. W e observe 1.9x speedup over regular data loader at 64 nodes (256 learners). The results prov e that our locality-aware data loading method works well in practice. V I I . R E L A T E D W O R K Sev eral papers [10]–[13] addressed the topic of optimizing Imagenet classification with ResNet50 on distributed systems. They mostly aimed to get a similar validation accuracy (75% after 90 epochs) to Goyal’ s work [9] while improving the total training time. V arious nov el methods that improv e GPU computation time and synchronization time were proposed, but they often omit the data loading problem. There hav e been mentions of the data loading problem and attempts to solve it on very large-scale deployments. In DeepIO [24], data servers store subsets of the training dataset in an in-memory cache and prioritize reuse of data from the in-memory cache. While this reduces the accesses to the storage system, the mechanism can change the mini-batch sequences and impact the model accuracy . In comparison, our method does not change the predefined mini-batch sequences. In [16], distributed caching successfully scaled the application to 4,560 nodes. It relies on the high-speed interconnect among compute nodes to reduce the data loading from the storage system and lower the I/O cost, but the total volume of data mov ement among compute nodes remains high. Our locality aw are method complements distributed caching by reducing the data mov ement to a small fraction, which can make applications scale to ev en larger systems. V I I I . C O N C L U S I O N Efficient data loading is fundamental for distributed DNN training to scale to large-scale HPC systems. W e inv estigated the issues of the existing data loader design and proposed performance optimizations. W e also identified that, by both performance modeling and empirical results, the inability to load data faster limits the scalability of distributed mini-batch SGD. Our locality-aware data loading method utilizes caches to potentially eliminate the data loading from the storage system after the first epoch, and also reduces the total data loading volume to a tiny fraction of the input dataset size. Thus, the method lowers the bandwidth requirement effecti v ely and makes distributed DNN training much more scalable. Our experiments show that with the proposed optimizations, we can speed up the data loading of 1,024 learners to 34x, 55x and 60x for Imagenet-1K, UCF101-RGB and UCF101-FLO W , respectiv ely . W e can also get 120x speedup for loading a 892 GB MuMMI dataset using 512 learners. Applying the optimizations to the practical Imagenet-1K classification task also shows that simply using our data loader granted ≈ 2x speedup with 1,024 learners while gaining comparable validation accuracy results. Our prototype implementation is based on PyT orch. W e plan to de velop a general software package of the optimized data loader that can be used with any machine learning framew orks. W e also plan to study the feasibility of applying our methods to other machine learning optimization methods. W e also want to explore using SSD which provides ample space and fast access, and is ideal for a hierarchical caching design. A C K N O W L E D G M E N T This work was supported under CORAL NRE Contract B604142. 10 This paper has been accepted for publication in HiPC 2019 R E F E R E N C E S [1] N. P . Jouppi et al. , “In-datacenter performance analysis of a tensor processing unit, ” in Computer Arc hitecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on . IEEE, 2017, pp. 1–12. [2] S. Chetlur, C. W oolley , P . V andermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer, “cudnn: Efficient primitiv es for deep learning, ” arXiv pr eprint arXiv:1410.0759 , 2014. [3] Intel, “MKL-DNN, ” https://github .com/intel/mkl- dnn, 2019. [4] Y . Jia, E. Shelhamer, J. Donahue, S. Karayev , J. Long, R. Girshick, S. Guadarrama, and T . Darrell, “Caffe: Conv olutional architecture for fast feature embedding, ” arXiv preprint , 2014. [5] A. Paszk e, S. Gross, S. Chintala, G. Chanan, E. Y ang, Z. DeV ito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “ Automatic differentiation in pytorch, ” in NIPS-W , 2017. [6] M. Abadi et al. , “T ensorFlow: Large-scale machine learning on heterogeneous systems, ” 2015, software available from tensorflow .or g. [Online]. A v ailable: https://www .tensorflow .org/ [7] O. Russako vsky et al. , “ImageNet Large Scale V isual Recognition Challenge, ” International Journal of Computer V ision (IJCV) , vol. 115, no. 3, pp. 211–252, 2015. [8] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition, ” in Proceedings of the IEEE conference on computer vision and pattern reco gnition , 2016, pp. 770–778. [9] P . Goyal, P . Dollár , R. Girshick, P . Noordhuis, L. W esolo wski, A. Kyrola, A. Tulloch, Y . Jia, and K. He, “ Accurate, large minibatch sgd: training imagenet in 1 hour, ” arXiv pr eprint arXiv:1706.02677 , 2017. [10] Y . Y ou, Z. Zhang, C.-J. Hsieh, J. Demmel, and K. Keutzer , “Imagenet training in minutes, ” Pr oceedings of the 47th International Conference on P arallel Pr ocessing - ICPP 2018 , 2018. [Online]. A v ailable: http://dx.doi.org/10.1145/3225058.3225069 [11] X. Jia et al. , “Highly scalable deep learning training system with mixed- precision: Training imagenet in four minutes, ” 2018. [12] C. Y ing, S. K umar , D. Chen, T . W ang, and Y . Cheng, “Image classification at supercomputer scale, ” 2018. [13] H. Mikami, H. Suganuma, P . U-chupala, Y . T anaka, and Y . Kageyama, “Imagenet/resnet-50 training in 224 seconds, ” 2018. [14] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao, “Optimal distributed online prediction using mini-batches, ” Journal of Machine Learning Research , vol. 13, no. Jan, pp. 165–202, 2012. [15] T . Kurth et al. , “Deep learning at 15pf, ” Pr oceedings of the International Conference for High P erformance Computing, Networking, Storag e and Analysis on - SC ’17 , 2017. [Online]. A v ailable: http://dx.doi.org/10.1145/3126908.3126916 [16] T . Kurth, S. Treichler , J. Romero, M. Mudigonda, N. Luehr , E. Phillips, A. Mahesh, M. Matheson, J. Deslippe et al. , “Exascale deep learning for climate analytics, ” in Pr oceedings of the International Confer ence for High P erformance Computing, Networking, Storage, and Analysis . IEEE Press, 2018, p. 51. [17] A. Sergeev and M. D. Balso, “Horovod: fast and easy distributed deep learning in tensorflow , ” 2018. [18] Python W iki contributors, “Global interpreter lock, ” 2019, [Online; accessed 01-March-2019]. [Online]. A v ailable: https://wiki.python.org/ moin/GlobalInterpreterLock [19] M. Raab and A. Steger , “Balls into bins—a simple and tight analysis, ” in International W orkshop on Randomization and Approximation T echniques in Computer Science . Springer , 1998, pp. 159–170. [20] X. Chen, L. Liu, Z. Liu, and T . Jiang, “On the minimum common integer partition problem, ” ACM T ransactions on Algorithms (T ALG) , vol. 5, no. 1, p. 12, 2008. [21] Facebook, “pytorch/examples, ” https://github .com/pytorch/examples, 2019. [22] K. Soomro, A. Roshan Zamir , and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild, ” in CRCV -TR-12-01 , 2012. [23] F . Di Natale et al. , “ A massively parallel infrastructure for adaptiv e multiscale simulations: Modeling RAS initiation pathway for cancer , ” in Super computing: The International Conference for High P erformance Computing, Networking, Storage , and Analysis (T o Appear) . A CM, 2019. [24] Y . Zhu, F . Chowdhury , H. Fu, A. Moody , K. Mohror, K. Sato, and W . Y u, “Entropy-aware i/o pipelining for large-scale deep learning on hpc systems, ” in 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and T elecommunication Systems (MASCO TS) . IEEE, 2018, pp. 145–156. 11

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment