Benchmarking TPU, GPU, and CPU Platforms for Deep Learning
Training deep learning models is compute-intensive and there is an industry-wide trend towards hardware specialization to improve performance. To systematically benchmark deep learning platforms, we introduce ParaDnn, a parameterized benchmark suite …
Authors: Yu Emma Wang, Gu-Yeon Wei, David Brooks
Benchmarking TPU , GPU , and CPU Platf orms f or Deep Learning Y u (Emma) W ang, Gu-Y eon W ei and David Brooks {ywang03,gywei,dbr ooks}@g.harvar d.edu John A. P aulson School of Engineering and Applied Sciences Harvar d University ABSTRA CT T raining deep learning models is compute-intensi ve and there is an industry-wide trend tow ards hardware specialization to improv e performance. T o systematically benchmark deep learning platforms, we introduce P araDnn, a parameterized benchmark suite for deep learning that generates end-to- end models for fully connected (FC), con volutional (CNN), and recurrent (RNN) neural networks. Along with six real- world models, we benchmark Google’ s Cloud TPU v2/v3, NVIDIA ’ s V100 GPU, and an Intel Skylake CPU platform. W e take a deep di ve into TPU architecture, re veal its bottle- necks, and highlight v aluable lessons learned for future spe- cialized system design. W e also provide a thorough compari- son of the platforms and find that each has unique strengths for some types of models. Finally , we quantify the rapid performance improv ements that specialized software stacks provide for the TPU and GPU platforms. 1. INTR ODUCTION Deep learning has rev olutionized man y application do- mains, defeating world champions in the game of Go [ 49 ], surpassing humans in image classification [ 28 ], and achieving competitiv e accuracy to humans in speech recognition [ 4 ] and language translation [ 57 ], to name a few . As such, there has been gro wing demand for new and better hardw are and software platforms to support the training and deployment of ev en more sophisticated models. As researchers from both academia and industry scramble to propose and deploy new systems to meet this demand, there is a great need to concurrently de velop a systematic and scientific approach to platform benchmarking. This benchmarking should not only compare performance of dif ferent platforms running a broad range of deep learning models, but also support deeper analysis of the interactions across the spectrum of different model attributes (e.g., hyperparameters), hardware design choices, and software support. Announced in May 2017, the T ensor Processing Unit (TPU) v2 is a custom ASIC. Each TPU v2 device deliv- ers a peak of 180 TFLOPS on a single board. TPU v3 was announced a year later and improves the peak performance to 420 TFLOPS . Cloud TPU became av ailable for early aca- demic access in February 2018. It is used in this paper . The NVIDIA T esla V100 T ensor Core is a Graphics Processing Unit (GPU) with the V olta architecture that was released in 2017. CPUs hav e been found to be suitable for training in certain cases [ 20 ] and, therefore, are an important platform to include for comparison. This study shows that no one platform is best for all scenarios. Dif ferent platforms offer advantages for dif ferent models based on their respectiv e characteristics. Moreover , giv en how rapidly deep learning models e volv e and change, benchmarking must be updated continuously and run frequently . Recent benchmarking efforts ha ve been limited to rela- tiv ely small collections of seemingly arbitrary DNN mod- els [ 41 , 3 , 12 , 51 ]. Focusing on well-known models such as ResNet50 [ 21 ] and T ransformer [ 54 ] can lead to misleading conclusions. For example, Transformer is a lar ge FC model that trains 3.5 × faster on the TPU compared to the GPU; yet focusing on this single model would not reveal the se- vere TPU memory bandwidth bottleneck that arises with FCs with more than 4k nodes. This highlights the risk of overly optimizing hardware and/or compilers for certain models. This paper proposes a collection of deep learning mod- els (for training) created and curated to benchmark a set of state-of-the-art deep learning platforms. In order to support broad and comprehensi ve benchmark studies, we introduce ParaDnn , a parameterized deep learning benchmark suite. ParaDnn seamlessly generates thousands of parameterized multi-layer models, comprising fully-connected models (FC), con volutional neural networks (CNN), and recurrent neural networks (RNN). ParaDnn allo ws systematic benchmarking across almost six orders-of-magnitude of model parameter size, exceeding the range of e xisting benchmarks. W e combine these parameterized models with a collection of six real-world models, which serve as unique points within a broad spectrum of model attributes, to pro vide comprehen- siv e benchmarking of hardware platforms. T able 1 summa- rizes fourteen observ ations and insights described throughout the paper that can inform future domain-specific architec- ture, system, and software design. W e specifically mark the insights enabled by ParaDnn. W e start with a deep div e into the TPU v2 and v3 in Section 4, re vealing architectural bottlenecks in computation capability , memory bandwidth, multi-chip ov erhead, and device-host balance (observ ations 1 through 5). Section 5 provides a comprehensiv e comparison 1 Observation ParaDnn * Proof Insight/Explanation 1. TPU does not exploit the parallelism from the model depth (layer count). X Fig 2 T o design/upgrade new specialized systems, architects 2. Many FC and CNN operations are bottlenecked by TPU memory bandwidth. X Fig 3 need to consider interactions between the operation 3. TPU suffers large ov erheads due to inter-chip communication bottlenecks. X Fig 4 mix from key w orkloads (arithmetic intensity) and 4. TPU performance can be improved by ≥ 34% by improving data infeed. - Fig 5 system configurations (FLOPS, memory bandwidth/ 5. TPU v3 optimizes compute-bound MatMuls by 2.3 × , memory-bound capacity , and intra-chip and host-device interconnect). ones by 3 × , and large embeddings by > 3 × , compared to v2. X Fig 6 TPU serves as a great example. 6. The largest FC models prefer CPU due to memory constraints. X Fig 7 Need for model parallelism on GPU and TPU. 7. Models with large batch size prefer TPU. Fig 8 Large batches pack well on systolic arrays; Those with small batch size prefer GPU. - Fig 10 warp scheduling is flexible for small batches. 8. Smaller FC models prefer TPU and larger FC models prefer GPU. X Fig 8 FC needs more memory bandwidth per core (GPU). 9. TPU speedup over GPU increases with larger CNNs. X Fig 10 TPU architecture is highly optimized for large CNNs. 10. TPU achieves 2 × (CNN) and 3 × (RNN) FLOPS utilization compared to GPU. X Fig 11 TPU is optimized for both CNN and RNN models. 11. GPU performance scales better with RNN embedding size than TPU. X Fig 10 GPU is more flexible to parallelize non-MatMuls. 12. Within se ven months, the software stack specialized for TPU It is easier to optimize for certain models improved by up to 2.5 × (CNN), 7 × (FC), and 9.7 × (RNN). X Fig 12 than to benefit all models at once. 13. Quantization from 32 bits to 16 bits Fig 5 Smaller data types sav e memory traffic and enable significantly improves TPU and GPU performance. - Fig 12 larger batch sizes, resulting in super -linear speedups. 14. T ensorFlow and CUD A teams provide substantial performance There is huge potential to optimize compilers improvements in each update. X Fig 12 ev en after the hardware has shipped. * W ithout ParaDnn the insights are not revealed, and/or lack deep e xplanations. T able 1: A summary of major observations and insights grouped by section of the paper . of TPU and GPU performance, highlighting important dif- ferences between the two platforms (observ ations 6 through 11). The final three observations are detailed in Section 6, which explores the performance impro vements of specialized software stacks and quantized datatypes. It is important to identify limitations of the study . This pa- per highlights optimization opportunities in current architec- ture and system designs, as they provide v aluable lessons for future design. Optimization details are beyond its scope. For example, the analysis focuses on training and not inference. W e do not study the performance of multi-GPU platforms or 256-node TPU systems, which may lead to dif ferent conclu- sions. Section 7 discusses these and other limitations of the study , which also motiv ate future work. 2. DEEP LEARNING BENCHMARKING Recent success of deep learning (DL) has moti vated de- velopment of benchmark suites, but existing suites have limitations. There are two types, real-world benchmark suites such as MLPerf [ 41 ], Fathom [ 3 ], BenchNN [ 12 ], and BenchIP [ 51 ], and micro-benchmark suites, such as Deep- Bench [ 43 ] and BenchIP . Each real-world suite contains a handful of popular DL models spanning a v ariety of model architectures. Their limitation is that they only contain to- day’ s deep learning models, which may become obsolete as DL models e volv e rapidly . Further , they fail to rev eal deep in- sights into interactions between DL model attributes and hard- ware performance, since the benchmarks are sparse points in the v ast space of deep learning models. Micro-benchmark suites exercise basic operations (e.g., matrix multiplication or con volution) that are common in neural networks, but the y cannot simulate complex dependencies between different op- erations in end-to-end models. T o complement existing benchmark suites for this study , we introduce ParaDnn, a parameterized benchmark suite for deep learning. 1 ParaDnn has the adv antages of the above 1 W e plan to open-source ParaDnn. approaches, with the goal of providing large “end-to-end” models cov ering current and future applications, and param- eterizing the models to explore a much larger design space of DNN model attrib utes. For e xample, a single end-to-end CNN model from P araDnn contains a mixture of man y differ - ent layers with dif ferent sizes of con volution, batch normal- ization, pooling, and FC layers. The complexity of ParaDnn workloads is comparable to that of real-world models (e.g., ResNet50 and T ransformer), as will be shown in Figure 1. Insights about hardware performance sensiti vity to model attributes allow interpolating and extrapolating to future mod- els of interest. These insights could not be discovered with either the small point space exploration of the real-world benchmark suites or DeepBench’ s microbenchmarks,which do not capture inter-operation dependencies as ParaDnn does. 2.1 ParaDnn Models ParaDnn includes end-to-end fully connected models (FC), con volutional neural networks (CNN), and recurrent neural networks (RNN). The model types cover 95% of Google’ s TPU workloads [ 32 ], all of Facebook’ s deep learning mod- els [ 20 ], and eight out of nine MLPerf models [ 41 ] (with reinforcement (minigo) as an exception). The image classi- fication/detection and sentiment analysis models are CNNs; the recommendation and translation models are FCs; the RNN translator and another version of sentiment analysis are RNNs. Speech recognition (DeepSpeech2) is a combination of CNN and GR U models. Fully-Connected Models FC models comprise multiple fully- connected layers. The architecture is Input → [ Layer [ Node ]] → Output , where [Layer] means the number of layers is variable. W e can sweep the number of layers, the number of nodes per layer , and the numbers of input and output units of the datasets. Con volutional Neural Networks CNN models are residual networks, the state-of-the-art model for image classification. 2 V ariable Layer Nodes Input Output Batch Size Min 4 32 2000 200 64 Max 128 8192 8000 1000 16384 Inc × 2 × 2 +2000 +200 × 2 (a) Fully Connected Models V ariable Block Filter Ima ge Output Batch Size Min 1 16 200 500 64 Max 8 32 300 1500 1024 Inc +1 64 +50 +500 × 2 (b) Con v . Neural Nets: Residual and Bottleneck Blocks V ariable Layer Embed Length V ocab Batch Size Min 1 100 10 2 16 Max 13 900 90 1024 1024 Inc +4 +400 +40 × 4 × 4 (c) Recurrent Neural Networks: RNN, LSTM, GRU T able 2: The ranges of the hyperparameters and dataset vari- ables ( italic ) chosen in this paper . The architecture of ParaDnn CNNs is Input → [ Residual/Bottleneck Block ] × 4 → FC → Output . A residual network contains four groups of blocks [ 21 ]. Each can be a residual block or a bottleneck block, followed by a fully-connected layer . Residual blocks have two conv o- lutional layers and two batch normalization layers, while bottleneck blocks hav e three of each. Usually the minimum number of filters of a residual network is 64 and it doubles in ev ery group, so the maximum is 512 filters. W e sweep the number of blocks per group, the minimum filters, and the datasets, including input images and number of categories as outputs. An input image is square with three channels, represented by its length. T o keep the study tractable, we constrain each group to hav e the same number of blocks. Recurrent Neural Netw orks RNNs comprise multiple lay- ers of basic RNN, LSTM, or GR U cells as shown belo w . Input → [ RNN/LSTM/GRU Cell ] → Output . Each token of the input sequence is embedded within a fixed length vector , and the length of the vector is the embedding size. In P araDnn, the number of layers and the embedding size are v ariable. The variables in the dataset include the maximum length per input sequence and the vocab ulary size. Range of Hyperparameters and Datasets W e choose the range of hyperparameters and datasets to cov er the real mod- els (Section 2.2), and we mak e sure the design space is rea- sonable. T able 2 summarizes variables for each network type and how the y are swept. W e also sweep training batch sizes. 2.2 Real-W orld Models In addition to ParaDnn, we include tw o of the three work- loads written in T ensorFlow from MLPerf [ 41 ], i.e., T rans- former (translation) [ 54 ] and ResNet-50 (image classifica- tion) [ 21 ], because currently TPU only supports T ensorFlo w . W e also select other real-world deep learning workloads [ 42 ], including RetinaNet [ 37 ], DenseNet [ 28 ], MobileNet [ 27 ], and SqueezeNet [ 29 ]. W e refer to them as real workloads or real models. The batch sizes are the lar gest supported on the hardware platform. For e xample, on TPU with bfloat16, we use batch size 64 for RetinaNet, 4k for T ransformer , and 1024 for the rest of the workloads. 1 0 5 1 0 6 1 0 7 1 0 8 # Trainable Parameters SqueezeNet MobileNet DenseNet ResNet-50 RetinaNet Transformer GRU LSTM RNN Bottleneck Residual FC Figure 1: The numbers of trainable parameters for all work- loads in this paper . Those from ParaDnn range from 10k to nearly a billion parameters, which is larger the range of real workload sizes, sho wn as dots. Figure 1 sho ws the numbers of trainable parameters across all workloads to quantify the size of the models. The ParaDnn workloads are sho wn as ranges and the real workloads as dots. ParaDnn covers a large range of models, from 10k to nearly a billion parameters. T ransformer is the largest real FC, and RetinaNet is the largest real CNN. The small models, SqueezeNet and MobileNet, reflect models typically targeted to wards mobile applications. RetinaNet and ResNet- 50 provide state-of-the-art image classification accurac y . 3. HARD W ARE PLA TFORMS Our selection of hardware reflects the latest configurations widely av ailable in cloud platforms at paper submission time. Platform specifications are summarized in T able 3. CPU Platform The CPU is an n1-standard-32 instance from Google Cloud Platform with Skylak e architecture. It has 16 cores and 32 threads. It has the largest memory (120 GB) and lo west peak flops (2 TFLOPS) among the three. GeekBench 4 produced the bandwidth measurement. GPU Platform The GPU is an NVIDIA V100 in a DGX-1 GPU platform that contains 8 V100 packages (SXM2) con- nected via 300 GB/s NVlink 2.0 interconnect. W e currently measure the performance of a single SXM2 node. One node has 16 GB of memory and 900 GB/s memory bandwidth. A V100 has 640 tensor cores and is able to run mixed precision training using float16 to compute and float32 to accumulate, making its peak performance 125 TFLOPS. TPU Platform The TPU is a Cloud TPU instance to which we were giv en academic access in February 2018. Its sys- tem architecture includes a Cloud Engine VM, a Cloud TPU server , Google Cloud storage, and a Cloud TPU board [ 2 ]. Each TPU board contains four TPU packages (the default Cloud TPU configuration) [ 14 ]. One TPU v2 package sup- ports 45 TFLOPS and contains 2 cores. One core has one matrix unit (MXU). T otal ML acceleration for a Cloud TPU v2 platform is 180 TFLOPS . Memory size is 8 GB per core, or 64 GB per board, with 2400 GB / s ov erall memory band- width. TPU v2 supports mixed precision training, using bfloat16 to compute and float32 to accumulate. Compared to v2, TPU v3 doubles the number of MXUs and HMB capacity per core [ 2 ]. The memory bandwidth has not been disclosed, but empirical results sho w that it is increased by 1.5 × . TPU v3 has a peak of 420 TFLOPS , 2.3 × greater than v2, likely because of higher frequency . Because v3 is an upgrade from v2, we focus on studying v2. In this paper , TPU refers to Cloud TPU v2, unless specified otherwise. Understanding TPU memory size. Data parallelism is im- 3 Mem Mem Mem Bdw Peak Platform Unit V ersion T ype (GB) (GB/s) FLOPS CPU 1 VM Skylake DDR4 120 16.6 2T SP † GPU 1 V100 (DGX-1) Pkg (SXM2) HBM2 16 900 125T 1 Board TPU (8 cores) v2 HBM 8 2400 180T TPUv3 8 cores v3 HBM 16 3600 * 420T † Single precision: 2 FMA × 32 SP × 16 cores × 2G frequency = 2 SP TFLOPS * Estimated based on empirical results (Section 4.5). T able 3: Hardware platforms under study . plemented on the TPU, where one batch of training data is split ev enly and sent to the 8 cores on the TPU board. The model is not distributed; e very TPU core keeps a whole copy of it. Therefore memory size per core determines the maximum model supported, while total on-board memory determines the maximum data batch size. That is why in Sec- tion 5.1, the GPU platform supports larger models than the TPU, and the TPU supports larger batch sizes (Section 5.2). Comparison rationale . W e ev aluate one V100 package and one TPU board (4 packages) because they are the min- imal units av ailable. The configurations are encapsulated. On Cloud TPU, distrib ution of computation across the four TPU packages on a TPU board happens automatically . On the other hand, multi-GPU performance depends lar gely on the user’ s implementation. Multi-GPU/TPU performance is beyond the scope of this work as discussed in Section 7. Therefore, note that conclusions in this paper do not apply to multi-GPU or larger TPU systems. 4. TPU ARCHITECTURAL IMPLICA TIONS As the end of Dennard scaling and Moore’ s law has slowed the performance improv ement of general-purpose micropro- cessors [ 23 ], the design of domain-specific hardware is be- coming more and more relev ant. The TPU is a prominent ex- ample of domain-specific hardware [ 32 , 14 ]. Its dev elopment was motiv ated by the observation that, with con ventional CPUs, Google would have had to double their datacenter footprint to meet the internal demand for machine learning workloads. Google has been using TPUs for their large-scale production systems, including Search, T ranslate, and Gmail. Analyzing the architecture of such systems can provide valu- able insights into future deep learning accelerator design. In this section, we study the performance characteristics of TPU v2 and v3 [ 14 , 2 ] with a focus on v2, from the computation capability in the core (FLOPS) to the system balance. Based on our observations, we discuss possible steps to improve TPU performance, which can be generalized to other deep learning accelerator systems. The follo wing is a summary of our key observ ations and insights: • FLOPS (Section 4.1) : TPU makes good use of the parallelism exposed by batch size and model width, but parallelism due to model depth is under-exploited, suggesting opportunities for model pipelining [ 8 ]. • Memory bandwidth (Section 4.2) : Memory bandwidth is the performance bottleneck of many models. Even highly-optimized compute-bound models show a sig- nificant fraction of memory-bound operations (13% in ResNet-50). Improving memory access for such opera- tions is key to further performance impro vement. • Multi-chip overhead (Section 4.3) : Communication ov erhead in a multi-chip system is non-negligible (up to 13% for CNNs with sizes similar to ResNet-50) b ut can be amortized with lar ge batch sizes. Reducing the communication ov erhead can lead to performance gain. • Host-device balance (Section 4.4) : Data quantization can make compute-bound workloads data-infeed-bound. Resolving the data-infeed bottleneck can impro ve per- formance by at least 34%. • TPU v3 (Section 4.5) : The maximum speedup of TPU v3 ov er v2 is up to 3 × , exceeding the 2.3 × FLOPS increase. TPU v3 benefited from its doubled memory capacity (which allows twice the batch size of v2) as well as increased memory bandwidth. 4.1 FLOPS Utilization Floating point operations per second (FLOPS) utilization is the ratio of average FLOPS to peak FLOPS, measuring ho w efficiently the computation capacity of a platform is used. W e discuss the TPU FLOPS utilization of the parameterized models in this section. W e first visualize ho w the model hyperparameters listed in T able 2 affect FLOPS utilization. Then we introduce an analysis methodology to quantify the hyperparameter ef fect using linear regression. FLOPS Utilization Heat Maps Figure 2(a)–(c) presents heat maps of FLOPS utilization for FC, CNN, and RNN mod- els, obtained by sweeping the hyperparameters with ranges listed in T able 2. W e choose two hyperparameters for each model type that affect FLOPS utilization the most (see below for ho w we choose them) and sho w them on the x - and y -axes while keeping the other hyperparameters fix ed. Specifically , we fix layer (32), input (2000), and output units (1000) for FCs, block (6), input image size ( 300 × 300 × 3 ), and output unit (1000) for CNNs, and layer (9), vocab ulary size (32), and max length (50) for RNNs. Figures 2(a)–(c) sho w that the FLOPS utilization of all three models increases with batch size. Other than that, the FLOPS utilization of FCs increases with number of nodes per layer (Figure 2(a)), that of CNNs increases with filters, and that of RNNs with embedding size. This indicates that TPU is capable of le veraging the parallelism within a batch (the former) and within the width of the models (the latter). Studying Parameterized Models with Linear Regression Having discussed the qualitati ve ef fects of hyperparameters on FLOPS utilization, we now b uild a linear regression (LR) model and use the weights to quantify these ef fects. Note that the LR model is only for measuring the effects of hyperpa- rameters. W e do not use it for prediction. In the case of FC, the linear regression model is FLOPS = w 0 × layer + w 1 × node + w 2 × input + w 3 × output + w 4 × batch size , where w 0 – w 4 are the weights of the hyperparameters. T o train the LR model, all the v alues are normalized to the same scale, so that we can use the weights as a measure of importance. For example, positiv e w 1 indicates that node count affects performance positi vely . If the absolute v alue of w 1 is larger than that of w 0 , it indicates node count has a larger ef fect on FLOPS than layer count. Other similar metrics for feature selection, including T -test and F-test, may be used for this 4 5 6 7 8 9 10 11 12 Log2(# Nodes) 6 7 8 9 10 11 12 13 14 Log2(Batch Size) F L O P S % 0 10 20 30 40 50 (a) FC 16 32 64 Filters 6 7 8 9 10 Log2(Batch Size) F L O P S % 10 20 30 40 (b) CNN 100 500 900 Embeddingsizes 4 6 8 10 Log2(Batch Size) F L O P S % 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 (c) RNN 1.0 0.5 0.0 0.5 1.0 LR Weights Input Output Layer Node Batch (d) FC 1.0 0.5 0.0 0.5 1.0 LR Weights Output Input Block Batch Filter (e) CNN 1.0 0.5 0.0 0.5 1.0 LR Weights Vocab Layer Maxlength Embedding Batch (f) RNN Figure 2: FLOPS utilization and its correlation with hyperpa- rameters. (a)–(c) sho w FLOPS utilization of parameterized models. (d)–(f) quantify effects of model hyperparameters on FLOPS utilization, using linear regression weights. purpose [ 26 ]. W e choose LR mainly to get the signs of the weights, which indicate the positive or neg ativ e effects of the hyperparameters on performance, while T -test and F-test only report positiv e values as importance. Figures 2(d)–(f) sho w the LR weights of the model hy- perparameters. The x - and y -axes in Figures 2(a)–(c) are the hyperparameters with the highest absolute v alues in Fig- ures 2(d)–(f). Figure 2(d) sho ws that the FLOPS utilization of FC is largely af fected by batch size and node, while layer, output, and input do not matter as much. Similarly , Fig- ure 2(e) sho ws filter is the most important, and batch size is more important than block, while input and output hav e minimal impact. The TPU FLOPS of RNNs is not affected by maximum length, number of layers, or vocab ulary size. Architectural Implications The TPU takes advantage of par - allelism due to lar ge batch size and model width, including that from nodes per layer in FC, filters in CNN, and em- bedding sizes in RNN. P arallelism opportunities from large numbers of layers remain to be explored, by approaches such as model parallelism [ 15 , 30 ] and pipelining [ 8 ]. 4.2 Roofline Model Analysis The FLOPS utilization in the pre vious section shows the computation capability of TPU, but the core is only part of the problem when designing an accelerator . In particular , memory bandwidth is another important aspect that can have significant impact on performance. In this section, we use the roofline model [ 56 ] to analyze the computation and memory bandwidth of FCs and CNNs. Roofline models are useful to demonstrate memory and computation bottlenecks [ 56 , 32 ]. W e omit RNN models because the TPU profiler reports incorrect numbers for memory bandwidth of RNN models. The Roofline Model Figure 3 sho ws the roofline plots. The y -axis is FLOPS and the x -axis is arithmetic intensity , i.e., floating-point operations per byte transferred from memory . The roofline (the red line in Figure 3) has of a slanted part and a horizontal part. It represents the highest achiev able FLOPS at a gi ven arithmetic intensity . Any data point ( x , y ) on the slanted part has x / y = memory bandwidth . The horizontal 1 0 0 1 0 1 1 0 2 Floating Ops/Byte 1 0 3 1 0 4 1 0 5 GFLOPS b s : 5 1 2 1 6 k node-128 node-512 node-2048 node-8192 Transformer Small FC (a) FC bfloat16 1 0 1 1 0 0 1 0 1 1 0 2 1 0 3 Floating Ops/Byte 1 0 1 1 0 2 1 0 3 1 0 4 1 0 5 GFLOPS Op Name Op % of Transformer | | | | | | | | | | | | Fused MatMul 66.0% Loop Fusion 7.0% CrossReplicaSum 3.9% Input Fusion 9.0% RMSProp N/A (b) FC Op Breakdown 1 0 0 1 0 1 1 0 2 Floating Ops/Byte 1 0 3 1 0 4 1 0 5 GFLOPS filter-16 filter-32 filter-64 ResNet-50 Small CNN (c) CNN bfloat16 1 0 1 1 0 0 1 0 1 1 0 2 Floating Ops/Byte 1 0 1 1 0 2 1 0 3 1 0 4 1 0 5 GFLOPS Op Name Op % of ResNet-50 | | | | | | | | | | Fused MatMul 85.2% Loop Fusion 9.0% MaxPoolGrad 2.9% CrossReplicaSum 1.1% (d) CNN Op Breakdown Figure 3: Rooflines for FC and CNN on TPU. W orkloads with matrix multiply (MatMul) operations are compute- bound. Even compute-bound workloads like T ransformer and ResNet-50 hav e more than 10% memory-bound operations. (a) and (c) show rooflines of parameterized and real-world models. (b) and (d) show the operation breakdo wn. part is the peak FLOPS on the hardware. A workload or operation (a point in Figure 3) close to the slanted roofline is memory-bound; one close to the horizontal part is compute- bound. A workload or operation not close to the roofline stresses neither memory interconnect nor compute units. Figures 3(a) and 3(c) sho w all the parameterized FC and CNN models (dots) plus T ransformer and ResNet-50 (stars). Figures 3(b) and 3(d) sho w all the operation breakdo wns. T ransformer and ResNet-50 are just instances (sparse design points) in P araDnn, so the stars overlap some of the dots. This is because ParaDnn enables more comprehensi ve model archi- tecture design space exploration and supports benchmarking hardware systems more systematically . An exception is that some operations of T ransformer do not align closely with those of FCs. This results from a choice in this paper, not a fundamental flaw of P araDnn. ParaDnn uses the RMSProp optimizer , keeping nodes per layer uniform in a parameter - ized FC, while Transformer uses the adafactor optimizer and has layers with 4k, 2k, and 512 nodes. FC Figure 3(a) sho ws that large batch sizes mak e FCs more compute-bound, and more nodes make FCs more memory- bound. That is because FCs with more nodes need to transfer more weights/acti vations from the memory , and large batch sizes increase the computation per weight/activ ation trans- ferred, i.e, the arithmetic intensity . For e xample, for FCs with ≥ 2k nodes, using large batch sizes turns memory-bound FCs into compute-bound. Specifically , the FCs with ≥ 2k nodes per layer and ≥ 8k batch size are compute-bound. Trans- former is close to compute-bound and it uses 4k batch size, which causes it to ov erlap with FCs having 4k batch sizes. CNN Figure 3(c) sho ws that models close to ResNet-50 are compute-bound, while a majority of the CNNs are bottle- necked by memory bandwidth. As it is in log scale, it shows that practically achiev able memory bandwidth for the CNNs is less than the theoretical bandwidth. The CNNs’ higher 5 FLOPS comes from higher arithmetic intensity caused by more filters. When memory bandwidth is the bottleneck, the way to increase FLOPS is to increase arithmetic intensity . Operation Breakdo wn The triangles in Figures 3(a) and 3(c) are selected memory-bound models. The FC has 8 layers, 8192 nodes per layer, and batch size 512; the CNN has 1 block per group, 16 filters, and batch size 64. Figures 3(b) and 3(d) show the T ensorFlow operations taking more than 1% of the workload e xecution time and more than 0 TPU FLOPS. The arithmetic intensity of such operations can be as lo w as 0 . 125 . 2 The T ensorFlow breakdo wn in Figure 3 is generated after operation fusion, which is a technique combining and ex ecuting several operations together for higher ef ficiency . Large MatMuls Figures 3(b) and 3(d) sho w that the only compute-bound operation is large fused MatMul (matrix mul- tiply fused with other operations), so a compute-bound model needs to hav e compute-bound MatMuls. Other operations are closer to the slanted line, indicating they are constrained by memory bandwidth. For e xample, in Figure 3(a) and (c), T ransformer and ResNet-50 are compute-bound because they hav e compute-bound MatMuls in Figures 3(b) and 3(d). Memory-bound Operations Interestingly , even compute- bound FC/CNN models contain a noticeable fraction of memory- bound operations. Transformer has three memory-bound op- erations: (1) input fusion (9.0%), which includes multiply , subtract, and reduce; (2) loop fusion (7.0%), which con- sists of control flo w operations (e.g., select and equal-to); and (3) CrossReplicaSum (3.9%), which sums up the v alues across multiple weight replicas. These three operations con- tribute to 19.9% of the total execution time. (12.3% of the ex ecution time is for data formatting, which has no arithmetic intensity or TPU FLOPS.) Even compute-bound ResNet-50 has many memory-bound operations, including loop fusion (9%), MaxPoolGrad (2.9%), and CrossReplicaSum (1.1%), which sums to 13%, sho wing the need for both end-to-end and per -operation optimization for deep learning accelerators. Architectural Implications Compute-bound FCs and CNNs hav e large MatMul operations. Surprisingly , ev en compute- bound models contain non-negligible fractions (19.9% for T ransformer and 13% for ResNet-50) of memory-bound op- erations. Gi ven the current TPU system, memory-bound operations need more attention. Potential w ays to speed up memory-bound operations include increasing memory band- width and reducing memory traf fic. Traditional architectural efforts to reduce memory traf fic can be adopted, such as exploiting the memory locality by caching [ 24 ]. Software/- compiler approaches include better operation fusion [ 1 , 11 , 44 ], more aggressi ve data quantization [ 6 ], and weights and gradients compression [ 17 , 38 ]. 4.3 Multi-Chip Overhead This section analyzes communication overhead in a multi- chip system. Previous sections focus on the compute and memory bandwidth of a TPU core. But these are not the only factors that affect training performance, because typical large-scale training systems use multiple chips [ 15 ]. This 2 For example, an acti vation accumulation operation (CrossReplica- Sum in T ensorFlo w) uses float32 ev en with bfloat16 model weights. In this case, the arithmetic intensity is 1 / ( 2 × 4 bytes ) = 0 . 125 , i.e., one floating point addition for ev ery two data points loaded. 0 20 40 60 80 100 1-Core FLOPS% 0 20 40 60 80 100 8-Core FLOPS% bs-2048 bs-4096 bs-8192 bs-16384 (a) FC 20 40 60 80 100 1-Core FLOPS% 20 40 60 80 100 8-Core FLOPS% bs-128 bs-256 bs-512 (b) CNN Figure 4: Communication overhead in a multi-chip system is non-negligible, b ut is reduced with large batch sizes. section ev aluates the scalability of a multi-chip TPU system. T o quantify the multi-chip overhead, we compare the FLOPS utilization of 1-core ( x -axis) and 8-core TPU ( y -axis) in Fig- ure 4. If there were no multi-chip o verhead, FLOPS utiliza- tion of 1-core and 8-core should be the same, i.e., all points should lie on the dashed line in Figure 4 sho wing x = y . On the 8-core TPU, FCs need at least 16k batch size to achie ve more than 50% FLOPS utilization. Specifically , FCs with ≥ 256 nodes and ≤ 512 batch size are faster to run on 1-core TPU than on 8-core TPU. Therefore we consider FCs with larger than 1024 batch size in Figure 4. As shown in the figure, 8-core TPU shows noticeably lo wer FLOPS utilization than 1-core TPU, indicating significant inter-core communication o verhead. For FC, the maximum FLOPS utilization in 8-core TPU is 62%, compared to 100% in 1-core TPU. Multi-chip ov erhead is less noticeable in CNNs, with FLOPS utilization decreasing from 55% in 1- core TPU to 40% in 8-core. It is worse for FCs because there are more weights to synchronize across the TPU cores than for CNNs. Based on Amdahl’ s law , we calculate that the maximum non-parallel fraction of the workloads is up to 60% for FC and 40% for CNN. The FLOPS utilization difference is smaller with larger batch sizes for both FC and CNN, because it increases the computation without increasing the weight synchronization. Using the largest batch size shown in Figure 4, the 90th-percentile of non-parallel fractions are 16% for FC and 8.8% for CNN. Architectural Implications W e show that communication ov erhead in multi-chip systems is non-negligible ev en for large FCs and CNNs. Using lar ge batch size can reduce the ov erhead by increasing the computation parallelism without increasing weight transfers. Possible optimizations include relaxed synchronization, model parallelism [ 15 ], gradient compression [ 38 ], and algorithm and architecture support for weight pruning and compression [ 17 ] before synchronization. 4.4 Host-Device Balance Previous subsections ha ve focused on the performance of the accelerator itself. This section focuses on “data infeed, ” the process of preparing and moving input data to the TPU board. ParaDnn analysis a voids part of the data infeed o ver - head by synthesizing data on t he CPU host. W e now describe a case study with real-world workloads to sho w the impor- tance of balancing accelerators and the host in a system. TPU Device and Host The TPU system is composed of a CPU host and a TPU device [ 14 ]. F or real-world CNNs, the host fetches images from the network, decodes, preprocesses, and feeds them to the device. Figure 5 calls this data prepara- 6 0 20 40 60 FLOPS % RetinaNet ResNet DenseNet MobileNet SqueezeNet Transformer 0 20 40 60 80 100 Infeed Time % float32, with data preparation (real data) bfloat16, with data preparation (real data) float32, no data preparation (synthetic data) bfloat16, no data preparation (synthetic data) Figure 5: FLOPS utilization (top) and infeed time (bottom) of the real models using float32 and bfloat16, with and without data preparation. Models with large infeed time percentage, i.e., RetinaNet and SqueezeNet, are limited by data infeed. tion. The device then performs training computation on the images. Data infeed means network overhead, host compute, and bandwidth between host and device. Infeed Overhead Analysis T o quantify the infeed overhead, we run real-world workloads both with and without data preparation, by directly feeding synthetic data as postpro- cessed inputs. W e also compare models using float32 to those with bfloat16, because replacing float32 with bfloat16 can af fect the ex ecution time of both data infeed and device computation. First, the arithmetic intensity of all operations doubles, because the same computation can be performed with half of the bytes transferred. Second, the FLOPS of memory-bound operations improv es in the device, because increased arithmetic intensity moves those operations tow ards the upper right in the roofline model of Figure 3. Third, im- prov ed device performance increases the need for faster data infeeding, which puts more pressure on the host. Figure 5 sho ws FLOPS utilization and infeed time of the real-world workloads. FLOPS utilization measures computa- tion efficienc y and infeed time measures how long the device waits for data, both of which are collected from the TPU profiler . The error bars are one standard deviation of the one-minute samples from the profiler . The figure shows that the bottleneck of a workload can be on the device or in data infeed by dif ferent degrees under different circumstances. Data infeed bottlenecks RetinaNet and SqueezeNet, as the performance increases noticeably when data preparation is skipped. Eliminating that bottleneck brings 37% and 180% speedup, respecti vely , for RetinaNet and SqueezeNet using bfloat16. RetinaNet’ s bottleneck is likely because it uses the COCO dataset ( 640 × 640 images), while others use the ImageNet dataset (224 × 224 images). ResNet-50 is bottlenecked by the device when using float32, and by data infeed when using bfloat16. That bitwidth re- duction speeds de vice ex ecution and increases FLOPS uti- lization so that training throughput on the device surpasses data preparation throughput on the host. If the resulting data infeed bottleneck can be resolved, the performance of bfloat16 ResNet-50 can be impro ved by 34%. Switching Reti- naNet and SqueezeNet from float32 to bfloat16 with real data slightly increases the data infeed percentage as well for simi- lar reasons. It also shows that performance can be impro ved when infeed time increases. DenseNet and MobileNet hav e zero data infeed time. Com- FC CNN RNN 1.0 1.5 2.0 2.5 3.0 Speedup 2.3 (a) TPU v3 ov er v2 1 0 1 1 0 0 1 0 1 1 0 2 Floating Ops/Byte 0.5 1.0 1.5 2.0 2.5 3.0 FLOPS v3/v2 M e m - B o u n d | | 2.3 MatMul (b) FC Ops 1 0 1 1 0 0 1 0 1 1 0 2 Floating Ops/Byte 0.0 0.5 1.0 1.5 2.0 2.5 3.0 FLOPS v3/v2 M e m - B o u n d | | 2.3 MatMul (c) CNN Ops Figure 6: (a) Speedup of TPU v3 over v2 running end-to-end models. (b) and (c) Speedup comparison for FC and CNN operations. TPU v3’ s larger memory supports doubled batch sizes, so memory-bound operations hav e triple speedup if they benefit from lar ger batch size, and 1.5 × speedup if not. Compute-bound on v3 operations have 2.3 × the speedup. The red line ( 75 Ops / Byte ) is the inflection point in the TPU v2 roofline. (See roofline and legends in Fig 3.) pared with ResNet, they train fe wer images/second, putting less stress on the host to infeed data. Switching from float32 to bfloat16 increases the performance of both workloads us- ing real data. Thus they are likely bottlenecked by memory bandwidth in the device. Unlike CNNs, Transformer processes sequences, which are smaller than images and demand minimal computation for data decoding and/or preprocessing. So T ransformer does not hav e significant infeed time, as expected. Unfortunately , its tensor2tensor implementation does not support synthetic data, so we omit the shaded bars for Transformer in Figure 5. Architectural Implications Scaling performance of the CPU host to match the TPU device is crucial for utilization of the accelerator’ s computation resource. For workloads limited by data infeed from the host to the device, resolving the bottleneck can improv e performance by at least 34%. Such workloads include RetinaNet, ResNet-50, and SqueezeNet using bfloat16. Sequence models such as Transformer do not stress data infeed as much as CNNs. By increasing FLOPS utilization, data quantization can turn a compute-bound work- load into one that is infeed-starved. W ith a po werful CPU host, further data quantization can yield greater performance gain, if it is v alid. 8-bit training is an example [ 6 ]. 4.5 TPU v3 This section focuses on the differences between TPU v2 and v3. Figure 6 compares TPU v3 and v2 using FC, CNN with bottleneck block, and basic RNN models. Batch size for v3 is twice that for v2, thanks to its doubled memory capac- ity . Figure 6(a) sho ws the speedups of end-to-end ParaDnn models. Because end-to-end model speedup depends on oper- ations, we first discuss the operation breakdo wn in detail. Fig- ure 6(b)–(c) sho w arithmetic intensity on the x -axis and the speedup of FC and CNN operations on the y -axis. Data points are colored by operation types, consistently with Figure 3(b) and (d). As a reference, the red dashed line is the inflection point in the TPU v2 roofline from Figure 3, where arithmetic intensity is 75 Ops / Byte ( 180 TFLOPS / 2 . 4 TB / s ). The op- erations on the left of the red line are memory-bound, and the ones on the right are compute-bound. W e can group the operations in four classes, as follows. Compute-Bound Ops The peak FLOPS of TPU v3 is 2.3 × 7 that of v2, so compute-bound operations are improv ed by about 2.3 × on v3. Such operations are on the right of the red dashed line in Figure 6(b). Memory-Bound Ops (2 × batch size) The maximum speedup of the memory-bound operations (mainly the MatMuls in Figure 6(b)–(c)) is 3 × . The tripled speedup comes from doubled batch size (enabled by doubled memory capacity) and memory bandwidth improv ement. Thus we can infer v3 has 1.5 × bandwidth improv ement ( 3 . 6 TB / s per board) over v2, although its memory bandwidth has not been offi cially disclosed. This is because on the slanted line of a roofline model, doubled batch size means doubled arithmetic inten- sity , and thus doubled FLOPS, because the ratio of FLOPS to arithmetic intensity is fixed. And switching from v2’ s roofline to v3’ s gi ves a FLOPS improv ement equal to the bandwidth improv ement. The fact that the overall speedup is 3 × indicates that the bandwidth improvement is 3 / 2 = 1 . 5 × . Other Memory-Bound Ops The 1.5 × bandwidth improv e- ment assumption is corroborated by the 1.5 × speedup of other memory-bound operations, represented by the non-MatMul FC operations in the lower left corner of Figure 6(b). The performance of those operations does not increase with larger batch size, as shown by the vertical alignment of each op- eration type in Figure 3(b). Thus the 1.5 × performance improv ement in Figure 6(b) is from bandwidth improvement. Boundary Cases The compute-bound MatMuls in Figure 6(c) become memory-bound on TPU v3, so the speedup is < 2 . 3 × . Such operations have arithmetic intensity between 75 and 117, because the roofline inflection point of v3 is at x = 420 / ( 2 . 4 ∗ 1 . 5 ) = 117 . CrossReplicaSum (yello w dots) is slo wed do wn on TPU v3, which may be because of more replicas across more MXUs. End-to-End Models In Figure 6(a) the maximum speedups are 2.83 × (FC), 2.31 × (CNN), and 3.11 × (RNN). Speedup increases with model width (second column of T able 2), and the maximum speedup is achiev ed by the largest width. FCs with close to 3 × speedup are dominated by memory-bound MatMuls. Exceptions are RNNs with more than 3 × ,; these hav e the largest embedding size (900), indicating that TPU v3 optimizes large embedding computations. Architectural Implications ParaDnn enables users to e xam a wide range of workloads, from memory-bound to compute- bound. Compared to v2, TPU v3 shows three main lev els of speedup: 2.3 × for compute-bound operations, 3 × for memory-bound MatMuls, and 1.5 × for other memory-bound operations. This is the result of its 2.3 × FLOPS, 2 × memory capacity , and 1.5 × memory bandwidth. For architects, the relativ e improvement of FLOPS and memory is a trade-off based on key w orkloads and budgets. 5. CR OSS-PLA TFORM COMP ARISON In this section, we conduct cross-platform comparison using TPU, GPU, and CPU, so that users can choose the most suitable platform based on models of interest. W e find that there are scenarios where each of the platforms is valuable, trading off fle xibility and specialization. W e also discuss the implications for future architecture designs. The follo wing is a summary of the key tak eaways: • TPU is highly-optimized for lar ge batches and CNNs, and has the highest training throughput. 7 8 9 10 11 12 13 Log2(# Nodes) 6 7 8 9 10 11 12 13 14 Log2(Batch Size) L o g 2 ( e x a m p l e s / s ) 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 (a) CPU 7 8 9 10 11 12 13 Log2(# Nodes) 6 7 8 9 10 11 12 13 14 Log2(Batch Size) L o g 2 ( e x a m p l e s / s ) 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 (b) GPU 7 8 9 10 11 12 13 Log2(# Nodes) 6 7 8 9 10 11 12 13 14 Log2(Batch Size) L o g 2 ( e x a m p l e s / s ) 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 (c) TPU Figure 7: Examples/second of fully-connected models with fixed layer (64). Examples/second decreases with nodes and increases with batch size. White squares indicate models that encounter out-of-memory issues. The CPU platform runs the largest model because of its lar ge memory . 1.0 0.5 0.0 0.5 1.0 LR Weights Node Input Output Layer Batch FC Speedup TPU/GPU (a) LR W eights 1 0 7 1 0 8 Params 1 0 1 1 0 0 1 0 1 Speedups FC: TPU/GPU bs-1k bs-2k bs-4k bs-8k bs-16k (b) Batch Size 1 0 7 1 0 8 Params 1 0 1 1 0 0 1 0 1 Speedups FC: TPU/GPU node-512 node-1k node-2k node-4k node-8k (c) Node Figure 8: Small FC models with large batch sizes prefer TPU, and large models with small batch sizes prefer GPU, indi- cating systolic arrays are better with lar ge matrices, and the warp scheduling on GPU is more flexible for small matrices. • GPU shows better flexibility and programmability for irregular computations, such as small batches and non- MatMul computations. The training of lar ge FC models also benefits from its sophisticated memory system and higher bandwidth. • CPU has the best programmability , so it achiev es the highest FLOPS utilization for RNNs, and it supports the largest model because of lar ge memory capacity . W e consider two performance metrics, e xamples/second and speedup. Examples/second measures the number of ex- amples trained per second, which is throughput. W e use it as a proxy for end-to-end performance. The speedup of one platform o ver another is the ratio of the former’ s performance (examples/second) o ver the latter’ s. 5.1 Fully-Connected DNNs This subsection provides systematic analysis of the perfor- mance and speedups for fully-connected (FC) models. Examples/second Figure 7 shows throughput for varying node counts and batch sizes but fixed layer count (64). W e use LR weights introduced in Section 4.1 to quantify the hy- perparameter effects (not sho wn owing to space limitations). Layer and node counts hav e negati ve weights, because it is time consuming to train lar ge models with many layers and nodes. Batch size greatly improves e xamples/second on GPU and TPU, but not CPU, because the parallelism a vailable with small batch sizes is enough to highly utilize CPU. It is interesting to note that only the CPU supports the largest models, and the GPU supports larger models than the TPU. This is because ev ery hardware core keeps one copy of the model, so the lar gest model supported is determined by memory per core, as explained in Section 3. In Figure 7, the 8 white squares indicate models that encounter out-of-memory (OOM) issues. CPU has the highest memory per core (120 GB), and GPU (16 GB) is higher than TPU (8 GB). While TPUs and GPUs may draw more attention, as of today the only choice for extremely large models is the CPU, which supports all model sizes. For example, Facebook reports using dual-sock et, high-memory CPU servers to train ranking models for News Feed and to perform anomaly detection (Sigma), both of which are fully-connected networks [ 20 ]. That fact emphasizes the need for model parallelism and pipelining [ 15 , 30 , 8 ] on GPU and TPU, such that those powerful accelerators can support lar ger models. TPU over GPU Speedup T o further in vestigate the best hard- ware platform for an FC model, we analyze TPU ov er GPU speedups. Figure 8(a) plots the linear regression weights across FC hyperparameters for TPU o ver GPU speedup. T o show the design space of FC models, Figures 8(b)–8(c) a re scatter plots sho wing numbers of model parameters on the x axis and speedups on the y axis. T o display the effects of the hyperparameters, we color code data points to reflect batch size (Figure 8(b)) and node count (Figure 8(c)). Overall, 62% of the FC models perform better on TPU (speedup > 1). TPU is well suited for lar ge batch training, because systolic arrays are very good at increasing throughput [ 35 ]. The posi- tiv e weight in Figure 8(a) and the horizontal color bands in Figure 8(b) show that large batch size is the key to higher TPU ov er GPU speedup. This suggests that the matrix multiply units (MXU) of TPU, implemented with systolic arrays [ 32 , 14 ], need large batches to reach full utilization. But GPU is a better choice for small batch sizes, because it e xecutes com- putation in warps, so it packs small batches and schedules them on stream multiprocessors more easily [ 39 ]. GPU is a better choice for lar ge models and datasets, suggesting that it is more optimized for large FC memory reuse/streaming requirements. Large models and datasets lowe r speedups, sho wn by the ne gativ e weights of node count, layer count, and input in Figure 8(a) and the scatter plot Fig- ure 8(c), corroborated by the ov erall negati vely-correlated trend of speedup with number of parameters in Figure 8. FC models hav e minimal weight reuse and large models hav e more weights, so they put a lot of pressure on the memory system. GPU has a more mature memory system and higher memory bandwidth than TPU, which makes GPU better- suited for the memory requirements of large FC models. GPU over CPU Speedup The speedup of GPU o ver CPU is an interesting comparison to TPU ov er GPU. Figure 9(a) shows the LR weights from learning GPU-ov er-CPU speedup. Figure 9(b) shows the design space colored by node count . GPU is a better platform for large FC models, because its architecture is better at exploiting the extra parallelism from large batches and models. As shown by Figure 9, lar ge models hav e higher speedups on GPU. W e also observe that large FC models prefer GPU over TPU, witnessed by the positiv e trend in Figure 9(b) and the neg ativ e trend in Fig- ures 8(b)–8(c). So GPU is the best platform for large FC models, b ut models with lar ge batch sizes perform best on TPU, and better on GPU than on CPU. 5.2 CNN and RNN W e now describe the speedup of CNNs and RNNs. Since 1.0 0.5 0.0 0.5 1.0 LR Weights Output Input Batch Layer Node FC Speedup GPU/CPU (a) LR W eights 1 0 7 1 0 9 Params 1 0 0 1 0 1 1 0 2 1 0 3 Speedups FC: GPU/CPU node-64 node-256 node-1k node-4k (b) Node Figure 9: Large FC models with large batch sizes are better suited for GPU than CPU because the GPU’ s architecture can better utilize the extra parallelism. our conclusions for CPUs and the hyperparameter LR weights on examples/second are similar to those in the pre vious sec- tion, we omit those results in the interest of brevity . CNN Figures 10(a)–10(c) show the speedups of TPU o ver GPU. All CNNs perform better on TPU. Batch size is still the ke y to better TPU o ver GPU speedup for CNNs, sho wn by its positi ve LR weight in Figure 10(a) and the increasing speedup with batch size in Figure 10(b). TPU is the best platform for lar ge CNNs, suggesting that the TPU architecture is highly optimized for the spatial reuse characteristics of CNNs. This is shown by the positiv e weights in Figures 10(a) and 10(c), where models with more filters and blocks ha ve higher speedups. It is different from Section 5.1, showing that TPU is not preferred for large FCs. This suggests it is easier for TPU to optimize for large CNNs than large FCs, which may be because CNNs reuse weights. FC models barely reuse weights, which introduces more memory traffic. GPU is a feasible choice for small CNNs. These conclusions only apply to single-GPU perfor - mance; the multi-GPU case may be different. RNN Figures 10(d)–10(e) show the speedup of TPU over GPU. W e display the embedding size in Figure 10(e), be- cause the magnitude of its weight is greatest in Figure 10(d). Embedding size has negati ve weights in Figure 10(d) and embedding computation is more sparse than matrix multi- plication. This suggests that TPU is less flexible for doing non-MatMul computations than GPU. TPU is better at dense computations like MatMuls. Even so, RNNs are still up to 20 × faster on TPU. Optimizing non-MatMul computations is another opportunity for TPU enhancement. 5.3 Overall Comparison This section summarizes the speedup of TPU over GPU and the FLOPS utilization of all parameterized and real mod- els. W e do not show the results of using CPUs to train CNNs, because it is extremely time consuming and unlik ely to con- tribute additional insights. TPU over GPU Speedup Figure 11(top) summarizes the TPU ov er GPU speedups of all models. Note that the real workloads use larger batch sizes on TPU than on GPU. Speedup of TPU ov er GPU depends heavily on the nature of the workload measured. The speedup of parameterized mod- els has large ranges, from less than 1 to 10 × , while the speedup of real workloads range from 3 × (DenseNet) to 6.8 × (SqueezeNet). ParaDnn represents a more complete view of potential workloads, and each real workload represents the concerns of certain users. Benchmarking platforms with two kinds of workloads of fer a more systematic understanding of 9 1.0 0.5 0.0 0.5 1.0 LR Weights Input Output Block Filter Batch CNN Speedup TPU/GPU (a) LR W eights 1 0 6 1 0 7 1 0 8 Params 1 0 0 2 × 1 0 0 3 × 1 0 0 4 × 1 0 0 6 × 1 0 0 Speedups CNN: TPU/GPU bs-128 bs-256 bs-512 bs-1k (b) Batch Size 1 0 6 1 0 7 1 0 8 Params 1 0 0 2 × 1 0 0 3 × 1 0 0 4 × 1 0 0 6 × 1 0 0 Speedups CNN: TPU/GPU filter-16 filter-32 filter-64 (c) Filter Size 1.0 0.5 0.0 0.5 1.0 LR Weights Embedding Layer Vocab Batch Maxlength RNN Speedup TPU/GPU (d) LR W eights 1 0 5 1 0 6 1 0 7 Params 1 0 0 1 0 1 Speedups RNN: TPU/GPU embed-100 embed-500 embed-900 (e) Embedding Size Figure 10: (a)–(c) TPU is a better choice than GPU for large CNNs, suggesting that TPU is highly-optimized for CNNs. (d)–(e) While TPU is a better choice for RNNs, it is not as flexible as GPU for embedding computations. their behavior than those with only one kind. T o further compare TPU and GPU while relaxing the con- straint on the software stack of the GPU, we also include the speedup relati ve to GPU performance of ResNet-50, reported in NVIDIA ’ s Developer Blog [ 9 ] (annotated as NVIDIA in Figure 11(top)). W e note that NVIDIA ’ s version of ResNet- 50 uses unreleased libraries, and we were unable to reproduce the results. The speedup using ResNet-50 from Google is 6.2 × compared to 4.2 × , which suggests software optimiza- tion can significantly impact performance. FLOPS Utilization Figure 11(bottom) shows the FLOPS utilization of all workloads and platforms. On average, the maximum FLOPS utilization of TPU is 2.2 × that of GPU for all CNN models, and the ratio is 3 × for RNNs. The TPU FLOPS utilization of Transformers is consistent with FCs with 4k batch size, as shown in Figure 2. For RNNs, TPU has less than 26% FLOPS utilization and GPU has less than 9%. In contrast, CPU has up to 46% utilization. RNNs hav e irregular computations compared to FCs and CNNs, due to the temporal dependenc y in the cells and the v ariable-length input sequences. The parameterized RNNs are very basic, ho wev er . Advanced RNN optimizations may be able to increase utilization on GPU and TPU. ResNet-50 and RetinaNet ha ve higher FLOPS utilization than DenseNet and SqueezeNet. The real workloads are ranked by number of trainable parameters, sho wn in Fig- ure 1. DenseNet has lower utilization because it has fewer filters than ResNet-50. DenseNet’ s maximum number of fil- ters is 24 [ 28 ], and the minimum of ResNet-50 is 64 [ 21 ]. SqueezeNet is designed specifically to ha ve fe wer parameters with the use of 1x1 filters [ 29 ]. Therefore, parallel operations represent a smaller portion of the whole workload. As a con- sequence of Amdahl’ s law , the small models are unable to utilize the parallelism av ailable on GPU or TPU. ResNet-50 has higher FLOPS utilization than CNNs with bottleneck blocks. This is because the parameterized CNNs keep the number of blocks the same in each group, while ResNet-50 has more blocks in groups with more filters, and that increases FLOPS. FC Res Bottle RNN LSTM GRU Trans Retina ResNet Dense Squeeze 1 0 0 1 0 1 Nvidia Google TPU/GPU Speedups C G FC T G Trans T G Res T G Bottle T G Retina T G ResNet T G Dense T G Squeeze T C G RNN T C G LSTM T C G GRU T 0 20 40 60 FLOPS Utilization % Figure 11: (T op) TPU over GPU speedups of all workloads. Note that the real workloads use larger batch sizes on TPU than on GPU. The NVIDIA version of ResNet-50 is from [ 9 ]. (Bottom) FLOPS utilization comparison for all platforms. 6. SOFTW ARE ST A CK AD V ANCES Custom hardware for deep learning opens opportunities for dramatic library , toolkit, and compiler optimizations. W e now describe ho w different versions of T ensorFlow (TF) and CUD A affect performance. W e study data type quantization with software v ersions, because it depends on software sup- port. As a reminder, for all results in the previous sections, we use the latest versions of each softw are stack with 16-bit quantization support. Software v ersions are summarized in the legends of Figure 12. ParaDnn can re veal softw are opti- mization focus (e.g., TF 1.9 optimizes small-batch CNNs); we omit these details for brevity . 6.1 T ensorFlow V ersions and TPU P erformance The compiler for the TPU is XLA [ 36 ], shipped with TF . Figure 12(a) shows TPU speedups obtained by running TF 1.7 to 1.12, treating 1.7 with float32 as the baseline. The speedup is per model, maximizing batch size in each setting. For e xample, using bfloat16 instead of float32 allo ws larger batch size and thus higher speedup. 3 Moving from TF 1.7 to 1.12 improv es performance for all ParaDnn models. Although FC and CNN encounter performance regression with TF 1.8, TF 1.9 fixes this anomaly and improv es overall performance. RNN performance is not improv ed much until TF 1.11. TF 1.11 sho ws 10 × speedup for RNN and 7.5 × for LSTM and GR U. Transformer , ResNet-50, and RetinaNet are improved continuously ov er TF updates. Interestingly , SqueezeNet is improv ed starting from TF 1.11, while the performance of DenseNet and MobileNet see little benefit. In the 7 months (222 days) between the release of TF 1.7.0 (03/29/2018) and that of TF 1.12.0 (11/05/2018), soft- ware stack performance improv ed significantly . The 90th- percentile speedup of TPU is 7 × for FC, 1.5 × for Residual CNN, 2.5 × for Bottleneck CNN, 9.7 × for RNN, and 6.3 × for LSTM and GR U. The use of bfloat16 enables significant performance im- prov ement for parameterized FC and CNN models. 90th- percentile speedups are up to 1.8 × for FC and Bottleneck CNN, and 1.3 × for Residual CNN. Depending on the rela- tiv e memory sizes of the data and model, TPU can usually 3 These experiments do not consider the impact of quantization on model accuracy . 10 7 8 FC 9 1 2 7 8 Res 9 1 2 7 8 9 Bottle 1 2 7 8 9 Trans 1 2 7 8 9 Retina 1 2 7 8 9 Resnet 1 2 7 8 9 Dense 1 2 7 8 9 Mobile 1 2 7 8 9 Squeeze 1 2 7 8 9 RNN 1 2 7 8 9 LSTM 1 2 7 8 9 GRU 1 2 0.0 2.5 5.0 7.5 10.0 Baseline: TF1.7 32b 7: TF1.7 16b 8: TF1.8 16b 9: TF1.9 16b 1: TF1.11 16b 2: TF1.12 16b TPU performance vs Tensorflow Versions 1 2 3 FC 4 1 2 Res 3 4 1 2 Bottle 3 4 1 2 Retina 3 4 1 2 Resnet 3 4 1 2 Dense 3 4 1 2 3 Squeeze 4 2 4 6 Baseline: TF1.7 CUDA9.0 32b 1: TF1.7 CUDA9.0 16b 2: TF1.8 CUDA9.0 16b 3: TF1.8 CUDA9.2 16b 4: TF1.12 CUDA10 16b GPU Performance vs CUDA and TF Versions 1 2 3 RNN 4 1 2 3 LSTM 4 1 2 3 GRU 4 1.0 1.5 Figure 12: (a) TPU performance with T ensorFlow updates. All ParaDnn models improv e; Transformer , RetinaNet, and ResNet-50 improve steadily . (b) GPU speedups across ver- sions of CUD A and TF . CUD A 9.2 improves CNNs more than other ParaDnn models, and ResNet-50 more than other real models. CUD A 10 does not improve RNNs or SqueezeNet. support doubled batch sizes by using 16 bits. Transmitting 16 bits also relie ves bandwidth pressure, which can speedup memory-bound operations as discussed in Section 4.2 and Section 4.4. Larger performance increases may be possible with further reductions in bitwidth. 6.2 CUD A V ersions and GPU Perf ormance Figure 12(b) sho ws GPU performance across versions of CUD A and TF . The baseline is TF 1.7 and CUDA 9.0 with float32. TF 1.8 does not improve GPU performance. By low- ering memory traffic and enabling lar ger batch sizes, bitwidth reduction can speed up CNNs by more than 2 × . W e note that CUDA 9.2 speeds up ResNet-50 significantly more (8%) than other real workloads (< 1%). CUD A 9.2 also speeds up ParaDnn CNNs more than FCs or RNNs. CUD A 10 speeds up other models, but not SqueezeNet. CUDA 10 also improv es speedups for ParaDnn FCs and CNNs, but not as much for RNNs. The ov erall 90th-percentile improvement for FCs is 5.2 × . For ParaDnn residual block and bottleneck block models it is 2.9 × and 2.6 × , respectiv ely . In contrast, the 90-percentile improv ement of parameterized models is 8.6% for RNN, 3.5% for LSTM, and 5.9% for GR U. The improv ement from CUD A updates is less than that for TF updates on TPU, likely because CUDA and GPU platforms hav e matured greatly since becoming popular before 2010, while TPU v2 for training was only announced in May 2017. 7. LIMIT A TIONS OF THIS WORK Scope of this W ork This work does not study DL inference, cloud ov erhead, multi-node systems, accuracy , or conv er- gence. W e intentionally leave these topics to future w ork, as each deserves in-depth study . For example, e valuating infer - ence entails different metrics, such as latency , and a different experimental setup, as network overhead may hav e a large effect. Section 4.4 provides insight to wards quantifying the network ov erhead, and we use synthetic data to minimize the cloud overhead, b ut virtualization, resource allocation, and job scheduling bring up more research questions. NVIDIA ’ s eight-node DGX-1 or Google’ s 256-TPU sys- tems are not studied here. Studying multi-node systems in- volv es more system parameters, including numbers of nodes, inter-node bandwidth, inter-connect topology , and synchro- nization mechanisms. Cloud system ov erhead also becomes more acute in multi-node systems. The v alidity of extrapolating training throughput to time- to-accuracy remains an open question. Recent work studied the number of training steps to accuracy as a function of batch sizes [ 47 ]. It sho ws that very large batch size results in sub-linear scaling, b ut the best batch size depends lar gely on the model and optimizer . In a multi-node system, syn- chronization becomes more complicated, which results in different con vergence beha vior . T ractability T o keep the experiments tractable, we constrain the parameters in this work, including the ParaDnn hyperpa- rameters (T able 2) and the TPU iterations. For e xample, we focus on large batches, as the platforms were designed for large batch training, and extremely small batches may lead to different conclusions. W e use the RMSProp optimizer , and SGD with momentum performs faster than RMSProp. 8. RELA TED WORK Benchmarks: “For better or worse, benchmarks shape a field, ” said David Patterson [ 40 ]. Indeed, benchmarks ha ve been the driving force for compiler and architecture design for decades, and notable examples include the SPEC CPU [ 25 ] and P ARSEC multiprocessor benchmarks [ 7 ]. Recently , work has focused on domain-specific benchmark suites including CortexSuite [ 52 ], T onicSuite [ 18 ], Sirius [ 19 ], Fathom [ 3 ], D A WNBench [ 13 ], and MLPerf [ 41 ]. It is impossible to make any performance conclusions without benchmarks. Benchmark designers must take care to a void bias. Exist- ing benchmark suites come with limitations as discussed in Section 2. ParaDnn is the first parameterized benchmark suite for deep learning in the literature. In the same spirit as param- eterized benchmarks, synthetic benchmarks have commonly been used, such as BenchMaker [ 31 ], and SYMPO [ 16 ], constructing benchmarks with hardware-independent char- acteristics. Some try to match the statistical characteristics of real applications [ 55 , 33 ]. Synthetic approaches are com- mon in domain-specific benchmarking, e.g., CAD [ 53 , 50 ], statistical network inference [ 46 ], and database [ 45 ]. Benchmarking Our use of deep learning models to compare up-to-date platforms, Google’ s TPU v2/v3 and NVIDIA ’ s V100 GPU, distinguishes this work from previous cross- platform comparisons. Shi et al. compare CPU (Intel i7- 3820 and E5-2630v3) and GPU (GTX 980, GTX 1080, and K80) platforms and deep learning frameworks [ 48 ]. Bahram- pour et al. compare deep learning framew orks [ 5 ]. Others compare cloud computing providers [ 34 ], heterogeneous plat- forms [ 10 ], and cloud support for HPC [ 22 ]. 9. CONCLUSION This paper provides a comprehensiv e benchmarking analy- sis of deep neural network training hardware and software, and v aluable lessons learned for future system designs. W e present architectural bottlenecks of the TPU platform and pro- vide suggestions for future impro vement. Using ParaDnn, our parameterized benchmark suite for end-to-end deep learning, along with six real-world models, we compare the hardware and software of the TPU, GPU, and CPU platforms. W e 11 present sev eral new observ ations and insights into the design of specialized hardware and software for deep learning and motiv ate the need for further work in this field. 10. A CKNO WLEDGEMENT This work was supported in part by Google’ s T ensor- Flow Research Cloud (TFRC) program, NSF Grant # CCF- 1533737, and the Center for Applications Dri ving Architec- tures (AD A), one of six centers of JUMP , a Semiconductor Research Corporation program co-sponsored by DARP A. The authors would like to thank Frank Chen, Blake Hechtman, Jim Held, Glenn Holloway , Dan Janni, Peter Mattson, Lifeng Nai, Da vid P atterson, Francesco Pontiggia, Parthasarathy Ranganathan, V ijay Reddi, Bjarke Roune, Brennan Saeta, Zak Stone, Sophia Shao, Anitha V ijayakumar , Shibo W ang, Qiumin Xu, Doe Hyun Y oon, Clif f Y oung for their support and feedback. 11. REFERENCES [1] “TensorFlow: Using JIT compilation https://www .tensorflow .org/xla/jit, ” 2018. [2] “https://cloud.google.com/tpu/docs/system- architecture, ” Google Cloud Documentation , 2018. [3] R. Adolf, S. Rama, B. Reagen, G.-Y . W ei, and D. Brooks, “Fathom: Reference workloads for modern deep learning methods, ” in W orkload Characterization (IISWC), 2016 IEEE International Symposium on . IEEE, 2016, pp. 1–10. [4] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper , B. Catanzaro, Q. Cheng, G. Chen et al. , “Deep speech 2: End-to-end speech recognition in english and mandarin, ” in International Confer ence on Machine Learning , 2016, pp. 173–182. [5] S. Bahrampour , N. Ramakrishnan, L. Schott, and M. Shah, “Comparativ e study of Caffe, Neon, Theano, and Torch for deep learning, ” in ICLR , 2016. [6] R. Banner , I. Hubara, E. Hoffer , and D. Soudry , “Scalable methods for 8-bit training of neural networks, ” arXiv preprint , 2018. [7] C. Bienia, S. Kumar , J. P . Singh, and K. Li, “The P ARSEC benchmark suite: Characterization and architectural implications, ” in Pr oceedings of the 17th international conference on P arallel architectur es and compilation techniques . A CM, 2008, pp. 72–81. [8] G. A. Blog, “Introducing GPipe, an open source library for efficiently training large-scale neural network models, ” https:// ai.googleblog .com/ 2019/ 03/ intr oducing- gpipe- open- source- library .html , 2019. [9] L. Case, “Volta Tensor Core GPU achiev es new AI performance milestones, ” Nvidia Developer Blog , 2018. [10] S. Che, M. Boyer , J. Meng, D. T arjan, J. W . Sheaffer , S.-H. Lee, and K. Skadron, “Rodinia: A benchmark suite for heterogeneous computing, ” in W orkload Characterization, 2009. IISWC 2009. IEEE International Symposium on . Ieee, 2009, pp. 44–54. [11] T . Chen, T . Moreau, Z. Jiang, H. Shen, E. Y an, L. W ang, Y . Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy , “TVM: End-to-end optimization stack for deep learning, ” arXiv pr eprint arXiv:1802.04799 , 2018. [12] T . Chen, Y . Chen, M. Duranton, Q. Guo, A. Hashmi, M. Lipasti, A. Nere, S. Qiu, M. Sebag, and O. T emam, “BenchNN: On the broad potential application scope of hardware neural network accelerators, ” in W orkload Characterization (IISWC), 2012 IEEE International Symposium on . IEEE, 2012, pp. 36–45. [13] C. Coleman, D. Narayanan, D. Kang, T . Zhao, J. Zhang, L. Nardi, P . Bailis, K. Olukotun, C. Ré, and M. Zaharia, “D A WNBench: An end-to-end deep learning benchmark and competition, ” T raining , vol. 100, no. 101, p. 102, 2017. [14] J. Dean, “Recent advances in artificial intelligence and the implications for computer system design, ” Hot Chips , 2017. [15] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior , P . Tucker , K. Y ang, Q. V . Le et al. , “Large scale distributed deep networks, ” in Advances in neural information processing systems , 2012, pp. 1223–1231. [16] K. Ganesan, J. Jo, W . L. Bircher, D. Kaseridis, Z. Y u, and L. K. John, “System-lev el max power (SYMPO)-a systematic approach for escalating system-lev el power consumption using synthetic benchmarks, ” in P arallel Arc hitectures and Compilation T echniques (P ACT), 2010 19th International Confer ence on . IEEE, 2010, pp. 19–28. [17] S. Han, H. Mao, and W . J. Dally , “Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding, ” arXiv pr eprint arXiv:1510.00149 , 2015. [18] J. Hauswald, Y . Kang, M. A. Laurenzano, Q. Chen, C. Li, T . Mudge, R. G. Dreslinski, J. Mars, and L. T ang, “DjiNN and Tonic: DNN as a service and its implications for future warehouse scale computers, ” in Computer Ar chitectur e (ISCA), 2015 ACM/IEEE 42nd Annual International Symposium on . IEEE, 2015, pp. 27–40. [19] J. Hauswald, M. A. Laurenzano, Y . Zhang, C. Li, A. Rovinski, A. Khurana, R. G. Dreslinski, T . Mudge, V . Petrucci, L. T ang et al. , “Sirius: An open end-to-end voice and vision personal assistant and its implications for future warehouse scale computers, ” in the T wentieth International Confer ence on Ar chitectural Support for Pro gramming Languages and Operating Systems (ASPLOS) , vol. 50, no. 4. ACM, 2015, pp. 223–238. [20] K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhulgakov , M. Fawzy , B. Jia, Y . Jia, A. Kalro, J. Law , K. Lee, J. Lu, P . Noordhuis, M. Smelyanskiy , L. Xiong, and X. W ang, “ Applied machine learning at Facebook: A datacenter infrastructure perspectiv e, ” in High P erformance Computer Arc hitectur e (HPCA), 2018 IEEE International Symposium on . IEEE, 2018, pp. 620–629. [21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition, ” in Pr oceedings of the IEEE conference on computer vision and pattern reco gnition (CVPR) , 2016, pp. 770–778. [22] Q. He, S. Zhou, B. Kobler , D. Duffy , and T . McGlynn, “Case study for running HPC applications in public clouds, ” in Pr oceedings of the 19th ACM International Symposium on High P erformance Distributed Computing . ACM, 2010, pp. 395–401. [23] J. Hennessy and D. Patterson, “ A new golden age for computer architecture: Domain-specific hardware/software co-design, enhanced. ” [24] J. L. Hennessy and D. A. Patterson, Computer architectur e: a quantitative appr oach . Elsevier , 2011. [25] J. L. Henning, “SPEC CPU2006 benchmark descriptions, ” ACM SIGARCH Computer Arc hitecture News , vol. 34, no. 4, pp. 1–17, 2006. [26] R. V . Hogg, J. McK ean, and A. T . Craig, Intr oduction to mathematical statistics . Pearson Education, 2005. [27] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenk o, W . W ang, T . W eyand, M. Andreetto, and H. Adam, “Mobilenets: Ef ficient con volutional neural networks for mobile vision applications, ” arXiv pr eprint arXiv:1704.04861 , 2017. [28] G. Huang, Z. Liu, L. V an Der Maaten, and K. Q. W einberger , “Densely connected con volutional networks. ” in CVPR , vol. 1, no. 2, 2017, p. 3. [29] F . N. Iandola, S. Han, M. W . Moskewicz, K. Ashraf, W . J. Dally , and K. Keutzer , “Squeezenet: Alexnet-level accurac y with 50x fewer parameters and <0.5 mb model size, ” arXiv pr eprint arXiv:1602.07360 , 2016. [30] Z. Jia, M. Zaharia, and A. Aiken, “Beyond data and model parallelism for deep neural networks, ” arXiv preprint , 2018. [31] A. Joshi, L. Eeckhout, and L. John, “The return of synthetic benchmarks, ” in 2008 SPEC Benchmark W orkshop , 2008, pp. 1–11. [32] N. P . Jouppi, C. Y oung, N. Patil, D. P atterson, G. Agrawal, R. Bajw a, S. Bates, S. Bhatia, N. Boden, A. Borchers et al. , “In-datacenter performance analysis of a tensor processing unit, ” in Computer Ar chitecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on . IEEE, 2017, pp. 1–12. [33] K. Kim, C. Lee, J. H. Jung, and W . W . Ro, “W orkload synthesis: Generating benchmark workloads from statistical ex ecution profile, ” in W orkload Characterization (IISWC), 2014 IEEE International Symposium on . IEEE, 2014, pp. 120–129. [34] K. K othari, “Comparison of several cloud computing pro viders, ” Elixir 12 Comp. Sci. & Engg , 2011. [35] H.-T . Kung, “Why systolic architectures?” IEEE computer , vol. 15, no. 1, pp. 37–46, 1982. [36] C. Leary and T . W ang, “XLA: T ensorFlow, compiled, ” T ensorFlow Dev Summit , 2017. [37] T .-Y . Lin, P . Goyal, R. Girshick, K. He, and P . Dollár, “F ocal loss for dense object detection, ” arXiv pr eprint arXiv:1708.02002 , 2017. [38] Y . Lin, S. Han, H. Mao, Y . W ang, and W . J. Dally , “Deep gradient compression: Reducing the communication bandwidth for distributed training, ” arXiv pr eprint arXiv:1712.01887 , 2017. [39] J. Nickolls and W . J. Dally , “The GPU computing era, ” IEEE Micr o , vol. 30, no. 2, 2010. [40] D. Patterson, “For better or w orse, benchmarks shape a field, ” Communications of the ACM , v ol. 55, 2012. [41] ——, “MLPerf: SPEC for ML, ” https:// rise.cs.berkele y .edu/ blog/ mlperf- spec- for- ml/ , 2018. [42] T . repository for TPU models, “https://github .com/tensorflow/tpu, ” Github , 2018. [43] B. Research, “Deepbench https://github .com/baidu- research/DeepBench, ” 2017. [44] N. Rotem, J. Fix, S. Abdulrasool, S. Deng, R. Dzhabarov , J. Hegeman, R. Lev enstein, B. Maher, S. Nadathur , J. Olesen et al. , “Glow: Graph lowering compiler techniques for neural netw orks, ” arXiv pr eprint arXiv:1805.00907 , 2018. [45] M. Saleem, Q. Mehmood, and A.-C. N. Ngomo, “Feasible: A feature-based sparql benchmark generation framew ork, ” in International Semantic W eb Conference . Springer , 2015, pp. 52–69. [46] T . Schaffter , D. Marbach, and D. Floreano, “GeneNetWeav er: in silico benchmark generation and performance profiling of network inference methods, ” Bioinformatics , vol. 27, no. 16, pp. 2263–2270, 2011. [47] C. J. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, R. Frostig, and G. E. Dahl, “Measuring the effects of data parallelism on neural network training, ” arXiv preprint , 2018. [48] S. Shi, Q. W ang, P . Xu, and X. Chu, “Benchmarking state-of-the-art deep learning software tools, ” in Cloud Computing and Big Data (CCBD), 2016 7th International Conference on . IEEE, 2016, pp. 99–104. [49] D. Silver , J. Schrittwieser, K. Simon yan, I. Antonoglou, A. Huang, A. Guez, T . Hubert, L. Baker , M. Lai, A. Bolton et al. , “Mastering the game of Go without human knowledge, ” Nature , v ol. 550, no. 7676, p. 354, 2017. [50] D. Stroobandt, P . V erplaetse, and J. V an Campenhout, “Generating synthetic benchmark circuits for ev aluating CAD tools, ” IEEE T ransactions on Computer-Aided Design of Integrated Circuits and Systems , vol. 19, no. 9, pp. 1011–1022, 2000. [51] J.-H. T ao, Z.-D. Du, Q. Guo, H.-Y . Lan, L. Zhang, S.-Y . Zhou, L.-J. Xu, C. Liu, H.-F . Liu, S. T ang, W . Chen, S.-L. Liu, and Y .-J. Chen, “BenchIP: Benchmarking intelligence processors, ” Journal of Computer Science and T echnology , vol. 33, no. 1, pp. 1–23, 2018. [52] S. Thomas, C. Gohkale, E. T anuwidjaja, T . Chong, D. Lau, S. Garcia, and M. B. T aylor, “CortexSuite: A synthetic brain benchmark suite. ” in IISWC , 2014, pp. 76–79. [53] M. T urki, H. Mehrez, Z. Marrakchi, and M. Abid, “T owards synthetic benchmarks generator for CAD tool evaluation, ” in Ph. D. Research in Micr oelectronics and Electronics (PRIME), 2012 8th Conference on . VDE, 2012, pp. 1–4. [54] A. V aswani, N. Shazeer , N. Parmar , J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser , and I. Polosukhin, “ Attention is all you need, ” in Advances in Neural Information Pr ocessing Systems , 2017, pp. 5998–6008. [55] W . W ei, L. Xu, L. Jin, W . Zhang, and T . Zhang, “AI matrix-synthetic benchmarks for DNN, ” arXiv pr eprint arXiv:1812.00886 , 2018. [56] S. W illiams, A. W aterman, and D. Patterson, “Roofline: an insightful visual performance model for multicore architectures, ” Communications of the ACM , v ol. 52, no. 4, pp. 65–76, 2009. [57] Y . W u, M. Schuster , Z. Chen, Q. V . Le, M. Norouzi, W . Macherey , M. Krikun, Y . Cao, Q. Gao, K. Machere y et al. , “Google’ s neural machine translation system: Bridging the gap between human and machine translation, ” arXiv pr eprint arXiv:1609.08144 , 2016. 13
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment