Sparse Deep Neural Network Graph Challenge

Sparse Deep Neural Network Graph Challenge Jeremy K epner 1 , 2 , 3 , Simon Alford 2 , V ijay Gadepally 1 , 2 , Michael Jones 1 , Lauren Milechin 4 , Ryan Robinett 3 , Sid Samsi 1 1 MIT Lincoln Laboratory Supercomputing Center , 2 MIT Computer Science & AI Laboratory , 3 MIT Mathematics Deparment, 4 MIT Dept. of Earth, Atmospheric, & Planetary Sciences Abstract —The MIT/IEEE/Amazon GraphChallenge.org en- courages community approaches to de veloping new solutions for analyzing graphs and sparse data. Sparse AI analytics present unique scalability difﬁculties. The proposed Sparse Deep Neural Network (DNN) Challenge draws upon prior challenges from machine learning, high performance computing, and visual analytics to create a challenge that is reﬂecti ve of emerging sparse AI systems. The Sparse DNN Challenge is based on a mathematically well-deﬁned DNN inference computation and can be implemented in any programming envir onment. Sparse DNN inference is amenable to both vertex-centric implementations and array-based implementations (e.g., using the GraphBLAS.org standard). The computations are simple enough that perf ormance predictions can be made based on simple computing hardware models. The input data sets ar e derived from the MNIST handwritten letters. The surrounding I/O and veriﬁcation pr ovide the context for each sparse DNN inference that allows rigorous deﬁnition of both the input and the output. Furthermore, since the proposed sparse DNN challenge is scalable in both problem size and hardware, it can be used to measure and quantitatively compare a wide range of present day and future systems. Ref- erence implementations hav e been implemented and their serial and parallel performance have been measured. Speciﬁcations, data, and software are publicly a vailable at GraphChallenge.org . I . I N T R O DU C T I O N MIT/IEEE/Amazon GraphChallenge.org encourages com- munity approaches to developing new solutions for analyzing graphs and sparse data. GraphChallenge.org provides a well- deﬁned community venue for stimulating research and high- lighting inno v ations in graph and sparse data analysis software, hardware, algorithms, and systems. The target audience for these challenges any individual or team that seeks to highlight their contributions to graph and sparse data analysis software, hardware, algorithms, and/or systems. As research in artiﬁcial neural networks progresses, the sizes of state-of-the-art deep neural network (DNN) architectures put increasing strain on the hardware needed to implement them [1], [2]. In the interest of reduced storage and runtime costs, much research over the past decade has focused on the sparsiﬁcation of artiﬁcial neural networks [3]–[13]. In the listed resources alone, the methodology of sparsiﬁcation includes Hessian-based pruning [3], [4], Hebbian pruning [5], matrix decomposition [9], and graph techniques [10]–[13]. This material is based in part upon work supported by the NSF under grants DMS-1312831 and CCF-1533644, and USD(R&E) under contract F A8702- 15-D-0001. Any opinions, ﬁndings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reﬂect the views of the NSF or USD(R&E). Increasingly , large amounts of data are collected from social media, sensor feeds (e.g. cameras), and scientiﬁc instruments and are being analyzed with graph and sparse data analytics to rev eal the complex relationships between different data feeds [14]. Many graph and sparse data analytics are ex ecuted in large data centers on large cached or static data sets. The processing required is a function of both the size of the and type of data being processed. There is also an increasing need to make decisions in real-time to understand how relationships represented in graphs or sparse data e v olve. Previous research on streaming analytics has been limited by the amount of processing required. Graph and sparse analytic updates must be performed at the speed of the incoming data. Sparseness can make the application of analytics on current processors extremely inefﬁcient. This inefﬁcienc y has either limited the size of the data that can be addressed to only what can be held in main memory or requires an extremely large cluster of computers to make up for this inef ﬁciency . The dev elopment of a novel sparse AI analytics system has the potential to enable the discovery of relationships as they unfold in the ﬁeld rather than relying on forensic analysis in data centers. Furthermore, data scientists can e xplore associations previously thought impractical due to the amount of processing required. The Subgraph Isomorphism Graph Challenge [15] and the Stochastic Block Partition Challenge [16] hav e enabled a new generation of graph analysis systems by highlighting the beneﬁts of nov el innovations in these systems. Similarly , the proposed Sparse DNN Challenge seeks to highlight innova- tions that are applicable to emerging sparse AI and machine learning. Challenges such as YOHO [17], MNIST [18], HPC Chal- lenge [19], ImageNet [20] and V AST [21], [22] hav e played important roles in driving progress in ﬁelds as diverse as machine learning, high performance computing and visual analytics. YOHO is the Linguistic Data Consortium database for voice veriﬁcation systems and has been a critical enabler of speech research. The MNIST database of handwritten letters has been a bedrock of the computer vision research community for two decades. HPC Challenge has been used by the supercomputing community to benchmark and acceptance test the largest systems in the world as well as stimulate research on the new parallel programing en vironments. Im- ageNet populated an image dataset according to the W ordNet hierarchy consisting of over 100,000 meaningful concepts (called synonym sets or synsets) [20] with an av erage of 1000 images per synset and has become a critical enabler of vision research. The V AST Challenge is an annual visual analytics challenge that has been held ev ery year since 2006; each year , V AST offers a ne w topic and submissions are processed like conference papers. The Sparse DNN Graph Challenge seeks to draw on the best of these challenges, but particularly the V AST Challenge in order to highlight innov ations across the algorithms, software, hardware, and systems spectrum. The focus on graph analytics allo ws the Sparse DNN Graph Challenge to also dra w upon signiﬁcant work from the graph benchmarking community . The Graph500 (Graph500.org) benchmark (based on [23]) pro vides a scalable power-la w graph generator [24] (used to build the w orld’ s largest synthetic graphs) with the goal of optimizing the rate of building a tree of the graph. The Firehose benchmark (see http://ﬁrehose.sandia.gov) simulates computer network trafﬁc for performing real-time analytics on network trafﬁc. The PageRank Pipeline benchmark [25], [26] uses the Graph500 generator (or any other graph) and provides reference imple- mentations in multiple programming languages to allo w users to optimize the rate of computing PageRank (1st eigen vector) on a graph. Finally , miniT ri (see mantev o.or g) [27], [28] takes an arbitrary graph as input and optimizes the time to count triangles. The organization of the rest of this paper is as follows. Sec- tion II provides information on the rele vant DNN mathematics and computations. Section III describes the synthetic DNNs used in the Challenge. Section IV provides the speciﬁcs of the input feature dataset based on MNIST images. Section V lays out the Sparse DNN Challenge steps and example code. Section VI discusses relev ant metrics for the Challenge. Sec- tion VII summarizes the work and describes future directions. I I . D E E P N E U R A L N E T W O R K S Machine learning has been the foundation of artiﬁcial intelligence since its inception [29]–[36]. Standard machine learning applications include speech recognition [31], com- puter vision [32], and ev en board games [33], [37]. 8 6 1955 WESTERN JOINT COMPUTER CONFERENCE Generalizatio n o f P a t t e r n R e c o g n i t i o n i n a Self-Organizin g S y s t e m * W . A . C L A R K f A N D B . G . F A R L E Y f Summary —A s e l f - o r g a n i z i n g s y s t e m r e p o r t e d u p o n e a r l i e r i s briefl y d e s c r i b e d . T w o f u r t h e r e x p e r i m e n t s t o d e t e r m i n e i t s p r o p e r - tie s h a v e b e e n c a r r i e d o u t . T h e f i r s t d e m o n s t r a t e s t h a t s e l f - o r g a n i z a - tio n s t i l l t a k e s p l a c e e v e n i f t h e i n p u t p a t t e r n s a r e s u b j e c t e d t o c o n - siderabl e r a n d o m v a r i a t i o n . T h e s e c o n d e x p e r i m e n t i n d i c a t e s t h a t , afte r o r g a n i z a t i o n w i t h t h e u s u a l f i x e d p a t t e r n s , t h e s y s t e m c l a s s i f i e s othe r i n p u t p a t t e r n s s t a t i s t i c a l l y a c c o r d i n g t o a s i m p l e p r e p o n d e r a n c e criterion . S i g n i f i c a n c e o f t h i s r e s u l t a s a g e n e r a l i z a t i o n i n p a t t e r n recognitio n i s d i s c u s s e d . S o m e r e m a r k s a r e m a d e o n m e t h o d s o f simulatio n o f s u c h s y s t e m s a n d t h e i r r e l a t i o n t o c o m p u t e r d e s i g n . DESCRIPTIO N O F S E L F - O R G A N I Z I N G S Y S T E M I N A P R E V I O U S p a p e r 1 t h e a u t h o r s d e s c r i b e d a s y s - te m w h i c h o r g a n i z e d i t s e l f f r o m a n i n i t i a l l y r a n d o m conditio n t o a s t a t e i n w h i c h d i s c r i m i n a t i o n o f t w o differen t i n p u t p a t t e r n s 2 w a s a c c o m p l i s h e d . T h e b e - havio r o f t h e s y s t e m w a s s i m u l a t e d b y m e a n s o f a digita l c o m p u t e r — t h e M e m o r y T e s t C o m p u t e r o f Lincol n L a b o r a t o r y . Briefly , t h e s e l f - o r g a n i z i n g s y s t e m w a s c o m p o s e d o f tw o p a r t s . T h e first p a r t r e c e i v e d i n p u t p a t t e r n s a n d transforme d t h e m i n t o o u t p u t s , a n d t h e s e c o n d p a r t acte d u p o n p a r a m e t e r s o f t h e f i r s t s o a s t o m o d i f y t h e input-outpu t t r a n s f o r m a t i o n a c c o r d i n g t o c e r t a i n fixed criteria . T h e s e p a r t s w e r e t e r m e d t h e t r a n s f o r m a t i o n an d t h e m o d i f i e r , r e s p e c t i v e l y . Th e t r a n s f o r m a t i o n i s a r a n d o m l y i n t e r c o n n e c t e d networ k o f n o n l i n e a r e l e m e n t s , e a c h e l e m e n t h a v i n g a definit e t h r e s h o l d f o r i n c o m i n g e x c i t a t i o n , b e l o w w h i c h n o a c t i o n o c c u r s , a n d a b o v e w h i c h t h e e l e m e n t " f i r e s . " Whe n a n e l e m e n t fires, i t s t h r e s h o l d i m m e d i a t e l y r i s e s effectivel y t o i n f i n i t y ( i t c a n n o t b e fired), a n d t h e n , a f t e r a s h o r t fixed d e l a y , f a l l s e x p o n e n t i a l l y b a c k t o w a r d i t s quiescen t v a l u e . F u r t h e r m o r e , a t s o m e s h o r t t i m e a f t e r firing, a n e l e m e n t t r a n s m i t s e x c i t a t i o n t o a l l o t h e r e l e r ment s t o w h i c h i t i s c o n n e c t e d . T h e e f f e c t i v e n e s s o f t h e excitatio n t h u s t r a n s m i t t e d t o a s u c c e e d i n g e l e m e n t i s determine d b y a p r o p e r t y o f t h e p a r t i c u l a r c o n n e c t i o n know n a s i t s " w e i g h t . " I n g e n e r a l , t h e r e w i l l b e s e v e r a l incomin g c o n n e c t i o n s a t a n y e l e m e n t , e a c h h a v i n g i t s individua l w e i g h t a s s h o w n i n F i g . 1 . A t t h e i n s t a n t o f transmissio n ( w h i c h i s t h e t i m e o f i m p u l s e a r r i v a l a t t h e succeedin g e l e m e n t ) , t h e a p p r o p r i a t e w e i g h t i s a d d e d t o an y e x c i t a t i o n a l r e a d y p r e s e n t a t t h e s u c c e e d i n g c e l l . * T h e r e s e a r c h r e p o r t e d i n t h i s d o c u m e n t w a s s u p p o r t e d j o i n t l y b y t h e A r m y , t h e N a v y , a n d t h e A i r F o r c e u n d e r c o n t r a c t w i t h t h e Massachusett s I n s t i t u t e o f T e c h n o l o g y . f L i n c o l n L a b o r a t o r y , M a s s a c h u s e t t s I n s t i t u t e o f T e c h n o l o g y , Lexington , M a s s . 1 B . G . F a r l e y a n d W . A . C l a r k , " S i m u l a t i o n o f s e l f - o r g a n i z i n g system s b y d i g i t a l c o m p u t e r , " Trans. IRE, v o l . P G I T - 4 , p p . 7 6 - 8 4 ; September , 1 9 5 4 . 2 I n t h i s p a p e r , t h e w o r d " p a t t e r n " i s s y n o n y m o u s w i t h " c o n - figuration." Thereafte r t h e e x c i t a t i o n d e c a y s e x p o n e n t i a l l y t o z e r o . I f a t a n y t i m e t h i s e x c i t a t i o n e x c e e d s t h e t h r e s h o l d o f th e s u c c e e d i n g e l e m e n t , t h e e l e m e n t p e r f o r m s i t s firing cycl e a n d t r a n s m i t s i t s o w n e x c i t a t i o n s . Fig . 1 — T y p i c a l n e t w o r k e l e m e n t s i a n d j s h o w i n g connectio n w e i g h t s w. A n e t w o r k s u c h a s t h e o n e d e s c r i b e d i s s u g g e s t i v e o f network s o f t h e n e r v e c e l l s , o r n e u r o n s , o f p h y s i o l o g y , bu t s i n c e t h e d e t a i l s o f n e u r o n i n t e r a c t i o n a r e a s y e t u n - certain , i t c a n n o t e v e n b e s a i d t h a t t h e n e t w o r k s a r e identica l w i t h o u t s o m e s i m p l i f i c a t i o n s w h i c h a r e p r e s e n t . I n t h e w o r k m e n t i o n e d , t h e n e t w o r k w a s a c t i v a t e d an d a n o u t p u t o b t a i n e d i n t h e f o l l o w i n g w a y . T h e n e t wa s d i v i d e d a r b i t r a r i l y i n t o t w o g r o u p s , d e s i g n a t e d a s inpu t a n d o u t p u t g r o u p s . T h e o u t p u t g r o u p w a s f u r t h e r subdivide d i n t w o , a n d a n o u t p u t w a s d e f i n e d a t a n y instan t b y t h e d i f f e r e n c e i n t h e n u m b e r o f e l e m e n t s fired i n t h e t w o s u b g r o u p s d u r i n g t h e i n s t a n t . T h i s a r r a n g e - men t m i g h t b e t e r m e d a p u s h - p u l l o u t p u t . Th e i n p u t g r o u p w a s a l s o s u b d i v i d e d i n t o t w o s u b - groups , a n d t w o fixed i n p u t p a t t e r n s w e r e p r o v i d e d , usuall y d e s i g n a t e d a s p x a n d p 2 . I n p u t pi c o n s i s t e d i n addin g a l a r g e e x c i t a t i o n i n t o a l l t h e i n p u t e l e m e n t s o f on e s u b g r o u p s i m u l t a n e o u s l y a n d r e p e t i t i v e l y a t a c o n - stan t p e r i o d , b u t d o i n g n o t h i n g t o t h e o t h e r s u b g r o u p . Inpu t p2 w a s j u s t t h e r e v e r s e . I n t h i s w a y o u t p u t a c - tivit y c h a r a c t e r i s t i c o f t h e i n p u t p a t t e r n w a s o b t a i n e d . I t w a s n o w d e s i r e d t o p r o v i d e a m o d i f i e r a c t i n g u p o n parameter s o f t h e n e t s o a s t o g r a d u a l l y r e o r g a n i z e i t t o obtai n o u t p u t a c t i v i t y o f a p r e v i o u s l y s p e c i f i e d c h a r a c - teristic , n a m e l y , t h a t p a t t e r n s pi a n d pi w o u l d a l w a y s driv e t h e o u t p u t i n p r e v i o u s l y s p e c i f i e d d i r e c t i o n s . I n ou r e x p e r i m e n t s , pi w a s m a d e t o d r i v e t h e o u t p u t i n a negativ e d i r e c t i o n , t h a t i s t o s a y , pi c a u s e s m o r e firing t o t a k e p l a c e o n t h e a v e r a g e i n t h e first o u t p u t s u b g r o u p tha n i n t h e s e c o n d . I n t h e c a s e o f p%, t h e s i t u a t i o n w a s exactl y r e v e r s e d . Thi s d e s i r e d o r g a n i z a t i o n o f t h e n e t w a s a c c o m p l i s h e d b y m e a n s o f v a r y i n g t h e w e i g h t s m e n t i o n e d a b o v e i n t h e followin g w a y . E x a m i n a t i o n i s m a d e o f t h e c h a n g e i n outpu t a t e v e r y i n s t a n t . I f a c h a n g e i n a f a v o r a b l e d i r e c - tio n o c c u r s ( e . g . n e g a t i v e c h a n g e i n c a s e pi i s t h e i n p u t Fig. 1: T ypical network elements i and j showing connection weights w (reproduced from [30]) Drawing inspiration from biological neurons to implement machine learning was the topic of the ﬁrst paper presented at the ﬁrst machine learning conference in 1955 [29], [30] (see Figure 1). It was recognized very early on in the ﬁeld that direct computational training of neural networks was computationally unfeasible with the computers that were av ail- able at that time [35]. The many-fold improvement in neural network computation and theory has made it possible to create neural networks capable of better -than-human performance in a variety of domains [38]–[41]. The production of validated data sets [42]–[44] and the power of graphic processing units (GPUs) [45]–[48] hav e allowed the ef fecti ve training of deep neural networks (DNNs) with 100,000s of input features, N , and 100s of layers, L , that are capable of choosing from among 100,000s categories, M (see Figure 2). Input Features Output Categories Edges Object Parts Objects y 0 W 0 b 0 W 1 b 1 W 2 b 2 W 3 b 3 y 2 y 3 y 4 y 1 Hidden Layers Fig. 2: Four layer ( L = 4 ) deep neural network architecture for categorizing images. The input features y 0 of an image are passed through a series of network layers W ` =0 , 1 , 2 , 3 , with bias terms b ` =0 , 1 , 2 , 3 , that produce scores for categories y L =4 . (Figure adapted from [49]) The impressiv e performance of large DNNs provides mo- tiv ation to explore even larger networks. Ho wev er , increas- ing N , L , and M each by a factor 10 results in a 1000- fold increase in the memory required for a DNN. Because of these memory constraints, trade-offs are currently being made in terms of precision and accuracy to sav e storage and computation [11], [50]–[52]. Thus, there is signiﬁcant interest in exploring the effecti v eness of sparse DNN representations where many of the weight values are zero. As a comparison, the human brain has approximately 86 billion neurons and 150 trillion synapses [53]. Its graph representation would hav e approximately 2,000 edges per node, or a density of 2 × 10 3 / 86 × 10 9 = 0 . 000002% . If a large fraction of the DNN weights can be set to zero, storage and computation costs can be reduced proportionately [6], [54]. The interest in sparse DNNs is not limited to their computational advantages. There has also been extensi ve theoretical work exploring the potential neuromorphic and algorithmic beneﬁts of sparsity [8], [55]–[58]. The primary mathematical operation performed by a DNN network is the inference, or forward propagation, step. Infer- ence is e xecuted repeatedly during training to determine both the weight matrix W ` and the bias vectors b ` of the DNN. The inference computation shown in Figure 2 is given by y ` +1 = h ( y ` W ` + b ` ) where h () is a nonlinear function applied to each element of the vector . The Sparse DNN Challenge uses the standard graph community con vention whereby W ( i, j ) 6 = 0 implies a connection between neuron i and neuron j . In this conv ention y ` are row vectors and left matrix multiply is used to progress through the network. Standard AI deﬁnitions can be used by transposing all matrices and multiplying on the right. A commonly used function is the rectiﬁed linear unit (ReLU) giv en by h ( y ) = max( y , 0) which sets values less that 0 to 0 and leaves other values unchanged. For the Sparse DNN challenge, h () also has an upper limit set to 32. When training a DNN, or performing inference on many dif ferent inputs, it is usually necessary to compute multiple y ` vectors at once in a batch that can be denoted as the matrix Y ` . In matrix form, the inference step becomes Y ` +1 = h ( Y ` W ` + B ` ) where B ` is a replication of b ` along columns giv en by B ` = b ` | Y ` 1 | 0 and 1 is a column array of 1’ s, and | | 0 is the zero norm. I I I . N E U R A L N E T W O R K D A TA Scale is an important driver of the Graph Challenge and graphs with billions to trillions of edges are of keen interest. Real sparse neural networks of this size are difﬁcult to obtain from real data. Until such data is available, a reasonable ﬁrst step is to simulate data with the desired network properties with an emphasis on the difﬁcult part of the problem, in this case: large sparse DNNs. The RadiX-Net synthetic sparse DNN generator is used [59] to efﬁciently generate a wide range of pre-determined DNNs. RadiX-Net produces DNNs with a number of de- sirable properties, such as equal number of paths between all inputs, outputs, and intermediate layers. The RadiX-Net DNN generation algorithm uses mixed radices to generate DNNs of speciﬁed connectedness (see Figure 3) which are then expanded via Kroneck er products into lar ger DNNs. For the Sparse DNN Challenge different DNNs were created with dif ferent numbers of neurons per layer . The RadiX- Net parameters used to create the base DNNs are given in T able I. The base DNNs are then gro wn to create much deeper DNNs by repeatedly randomly permuting and appending the base DNNs. The permutation process preserves the base DNN properties. The scale of the resulting large sparse DNNs are shown T able II. Fig. 3: 6 layer , 64 neurons per layer , 2 connections per neuron RadiX-Net DNN produced from Radix set [[2,2,2,2,2,2]]. The Kronecker product of this DNN with [16,16,16,16,16,16,16] produces a 6 layer, 1024 neurons per layer DNN with 32 connections per neuron. I V . I N P U T D AT A S E T Executing the Sparse DNN Challenge requires input data or feature vectors Y 0 . MNIST (Modiﬁed National Institute of Standards and T echnology) is a lar ge database of handwritten digits that is widely used for training and testing DNN image processing systems [18]. MNIST consists of 60,000 28 × 28 pix el images. The Sparse DNN Graph Challenge uses interpolated sparse versions of this entire corpus as input (Figure 5). Each 28 × 28 pixel image is resized to 32 × 32 (1024 neurons), 64 × 64 (4096 neurons), 128 × 128 (16384 neurons), and 256 × 256 (65536 neurons). The resized images are thresholded so that all values are either 0 or 1. The images are ﬂattened into a single row to form a feature vector . The non-zero values are written as triples to a .tsv ﬁle where each row corresponds to a dif ferent image, each column is the non- zero pixel location and the v alue is 1. Fig. 4: MNIST data set consists of 60,000 handwritten digits [18]. (top) Original 28 × 28 pixel images of four MNIST images. (bottom) 256 × 256 resampled thresholded v ersions of the same images. Radix Set Kronecker Set Layers Neurons/Layer Density Bias [[2,2,2,2,2,2]] [16,16,16,16,16,16,16] 6 1024 0.03 -0.30 [[2,2,2,2,2,2,2,2]] [16,16,16,16,16,16,16,16,16] 8 4096 0.008 -0.35 [[2,2,2,2,2,2,2,2,2,2]] [16,16,16,16,16,16,16,16,16,16,16] 10 16384 0.002 -0.40 [[2,2,2,2,2,2,2,2,2,2,2,2]] [16,16,16,16,16,16,16,16,16,16,16,16,16] 12 65536 0.0005 -0.45 T able I: RadiX-Net radix and Kronecker parameters and resulting base DNNs layers, neurons per layer, fraction of non-zeros in weight matrices (density), and bias value. All DNNs have 32 connections per neurons. Neurons Neurons Neurons Neurons Layers 1024 4096 16384 65536 120 3,932,160 15,728,640 62,914,560 251,658,240 480 15,728,640 62,914,560 251,658,240 1,006,632,960 1920 62,914,560 251,658,240 1,006,632,960 4,026,531,840 T able II: T otal number of connections = 32x(Layers)x(Neurons) for dif ferent large sparse DNNs used in the Sparse DNN Challenge. V . S P A R S E D N N C H A L L E N G E The core the Sparse DNN Challenge is timing DNN infer- ence using the provided DNNs on the provide MNIST input data and verifying the output with provided truth categories. The complete process for performing the challenge consists of the following steps • Download from GraphChallenge.org: DNN weight ma- trices W ` , sparse MNIST input data Y 0 , and truth categories • Load a DNN and its corresponding input • Create and set the appropriate sized bias vectors b ` from the table • Timed: Evaluate the DNN equation for all layers Y ` +1 = h ( Y ` W ` + B ` ) • Timed: Identify the categories (rows) in ﬁnal matrix with entries > 0 • Compare computed categories with truth categories to check correctness • Compute rate for the DNN: (# inputs) × (# connections) / time • Report time and rate for each DNN measured Reference serial implementations in various programming languages are available at GraphChallenge.org. The Matlab serial reference of the inference calculation is a follows function Y = inferenceReLUvec(W,bias,Y0); YMAX = 32; Y = Y0; for i=1:length(W) Z = Y * W { i } ; b = bias { i } ; Y = Z + (double(logical(Z)) . * b); Y(Y < 0) = 0; Y(Y > YMAX) = YMAX; end end For a gi ven implementation of the Sparse DNN Challenge an implementor should keep the following guidance in mind. Do • Use an implementation that could work on real-world data • Create compressed binary versions of inputs to accelerate reading the data • Split inputs and run in data parallel mode to achie ve higher performance (this requires replicating weight ma- trices on every processor and can require a lot of memory) • Split up layers and run in a pipeline parallel mode to achiev e higher performance (this sav es memory , but re- quires communicating results after each group of layers) • Use other reasonable optimizations that would work on real-world data A void • Exploiting the repetiti v e structure of weight matrices, weight values, and bias values • Exploiting layer independence of results • Using optimizations that would not work on real-world data V I . C O M P U TA T I O N A L M E T R I C S Submissions to the Sparse DNN Challenge will be e v aluated on the ov erall innov ations highlighted by the implementation and two metrics: correctness and performance. A. Correctness Correctness is ev aluated by comparing the reported cate- gories with the ground truth categories pro vided. B. P erformance The performance of the algorithm implementation should be reported in terms of the following metrics: • T otal number of non-zero connections in the gi ven DNN: This measures the amount of data processed • Execution time: T otal time required to perform DNN inference. • Rate: Measures the throughput of the implementation as the ratio of the number of inputs (e.g., number of MNIST images) times the number of connections in the DNN divided by the execution time. • Processor: Number and type of processors used in the computation. Neurons Layers Connections Time Rate per Layer (edges) (seconds) (inputs × edges/sec) 1024 120 3,932,160 626 376 × 10 6 1024 480 15,728,640 2440 386 × 10 6 1024 1920 62,914,560 9760 386 × 10 6 4096 120 15,728,640 2446 385 × 10 6 4096 480 62,914,560 10229 369 × 10 6 4096 1920 251,658,240 40245 375 × 10 6 16384 120 62,914,560 10956 344 × 10 6 16384 480 251,658,240 45268 333 × 10 6 16384 1920 1,006,632,960 179401 336 × 10 6 65536 120 251,658,240 45813 329 × 10 6 65536 480 1,006,632,960 202393 299 × 10 6 65536 1920 4,026,531,840 T able III: Serial timing measurements of inference rate on different sparse DNNs. C. T iming Measur ements Serial timing measurements of the Matlab code are shown in T able III and provide one example for reporting results. Parallel implementations of the Sparse DNN benchmark were dev eloped and tested on the MIT SuperCloud TX-Green supercomputer using pMatlab [60]. 100 150 200 250 300 350 400 450 500 550 600 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10 11 N=1024, L=120 N=1024, L=480 N=1024, L=1920 N=4096, L=120 N=4096, L=480 N=4096, L=1920 N=16384, L=120 N=16384, L=480 N=16384, L=1920 N=65536, L=120 N=65536, L=480 N=65536, L=1920 Number of Processors Rate (# inputs) x (# connections) / ti me Fig. 5: Inference rate versus number of processors for v arious DNN sizes. V I I . S U M M A RY The MIT/IEEE/Amazon GraphChallenge.or g encourages community approaches to dev eloping new solutions for ana- lyzing graphs and sparse data. Sparse AI analytics presents unique scalability dif ﬁculties. The machine learning, high performance computing, and visual analytics communities hav e wrestled with these difﬁculties for decades and de veloped methodologies for creating challenges to move these commu- nities forward. The proposed Sparse Deep Neural Network (DNN) Challenge draws upon prior challenges from machine learning, high performance computing, and visual analytics to create a challenge that is reﬂectiv e of emerging sparse AI systems. The Sparse DNN Challenge is a based on a mathematically well-deﬁned DNN inference kernel and can be implemented in any programming en vironment. Sparse DNN inference is amenable to both verte x-centric implementa- tions and array-based implementations (e.g., using the Graph- BLAS.org standard). The computations are simple enough that performance predictions can be made based on simple computing hardware models. The input data sets are deri ved from the MNIST handwritten letters. The surrounding I/O and veriﬁcation provide the context for each sparse DNN inference that allows rigorous deﬁnition of both the input and the output. Furthermore, since the proposed sparse DNN challenge is scalable in both problem size and hardware, it can be used to measure and quantitativ ely compare a wide range of present day and future systems. Reference implementations been implemented and their serial and parallel performance hav e been measured. Speciﬁcations, data, and software are publicly av ailable at GraphChallenge.org. A C K N O W L E D G M E N T S The authors wish to ackno wledge the follo wing individuals for their contributions and support: William Arcand, David Bestor , W illiam Bergeron, Bob Bond, Chansup Byun, Alan Edelman, Matthew Hubbell, Anne Klein, Charles Leiserson, Dav e Martinez, Mimi McClure, Julie Mullen, Andrew Prout, Albert Reuther, Antonio Rosa, V ictor Roytburd, Siddharth Samsi, Charles Y ee and the entire GraphBLAS.org community for their support and helpful suggestions. R E F E R E N C E S [1] C. Szegedy , W ei Liu, Y angqing Jia, P . Sermanet, S. Reed, D. Anguelov , D. Erhan, V . V anhoucke, and A. Rabinovich. Going deeper with con volutions. In 2015 IEEE Conference on Computer V ision and P attern Recognition (CVPR) , pages 1–9, June 2015. [2] Jeremy Kepner , V ijay Gadepally , Hayden Jananthan, Lauren Milechin, and Sid Samsi. Sparse deep neural network exact solutions. In High P erformance Extreme Computing Confer ence (HPEC) . IEEE, 2018. [3] Y ann LeCun, John S Denker , and Sara A Solla. Optimal brain damage. In Advances in neural information processing systems , pages 598–605, 1990. [4] Babak Hassibi and David G Stork. Second order deriv ativ es for network pruning: Optimal brain surgeon. In Advances in neural information pr ocessing systems , pages 164–171, 1993. [5] Nitish Sriv astav a, Geoffre y Hinton, Alex Krizhevsk y , Ilya Sutskever , and Ruslan Salakhutdinov . Dropout: a simple way to prev ent neural networks from ov erﬁtting. The J ournal of Machine Learning Research , 15(1):1929–1958, 2014. [6] Forrest N Iandola, Song Han, Matthe w W Moske wicz, Khalid Ashraf, W illiam J Dally , and Kurt Keutzer . Squeezenet: Alexnet-lev el accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360 , 2016. [7] Suraj Sriniv as and R. V enkatesh Babu. Data-free parameter pruning for deep neural networks. CoRR , abs/1507.06149, 2015. [8] Song Han, Huizi Mao, and William J. Dally . Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR , abs/1510.00149, 2015. [9] Baoyuan Liu, Min W ang, H. Foroosh, M. T appen, and M. Penksy . Sparse con volutional neural networks. In 2015 IEEE Conference on Computer V ision and P attern Recognition (CVPR) , pages 806–814, June 2015. [10] Jeremy Kepner and John Gilbert. Graph Algorithms in the Language of Linear Algebra . SIAM, 2011. [11] Jeremy Kepner , Manoj Kumar , Jos ´ e Moreira, Pratap Pattnaik, Mauricio Serrano, and Henry T ufo. Enabling massiv e deep neural networks with the graphblas. In High P erformance Extr eme Computing Confer ence (HPEC) . IEEE, 2017. [12] M Kumar, WP Horn, J Kepner , JE Moreira, and P Pattnaik. Ibm power9 and cognitiv e computing. IBM Journal of Research and Development , 2018. [13] J K epner and H Jananthan. Mathematics of Big Data: Spreadsheets, Databases, Matrices, and Graphs . MIT Press, 2018. [14] DARP A. Hierarchical Identify V erify Exploit. https: //www .fbo.gov/index?s=opportunity&mode=form&id= e3d5ebb6da9795cd0697cb24293f9302, 2017. [Online; accessed 01-January-2017]. [15] S. Samsi, V . Gadepally, M. Hurley, M. Jones, E. Kao, S. Mohindra, P . Monticciolo, A. Reuther, S. Smith, W . Song, D. Staheli, and J. Kepner. Static graph challenge: Subgraph isomorphism. In 2017 IEEE High P erformance Extreme Computing Confer ence (HPEC) , pages 1–6, Sep. 2017. [16] Edward Kao, V ijay Gadepally , Michael Hurley , Michael Jones, Jeremy Kepner , Sanjeev Mohindra, Paul Monticciolo, Albert Reuther , Siddharth Samsi, W illiam Song, Diane Staheli, and Stev en Smith. Streaming Graph Challenge - Stochastic Block Partition. Sept 2017. [17] J. P . Campbell. T esting with the yoho cd-rom voice veriﬁcation corpus. In 1995 International Conference on Acoustics, Speech, and Signal Pr ocessing , volume 1, pages 341–344 vol.1, May 1995. [18] C. Cortes Y . LeCun and C. J.C. Burges. The MNIST Database. http: //www .hpcchallenge.org, 2017. [Online; accessed 01-January-2017]. [19] HPC Challenge. http://yann.lecun.com/exdb/mnist/, 2017. [Online; accessed 01-January-2017]. [20] Olga Russako vsky , Jia Deng, Hao Su, Jonathan Krause, Sanjee v Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy , Aditya Khosla, Michael Bernstein, Alexander C. Ber g, and Li Fei-Fei. ImageNet Large Scale V isual Recognition Challenge. International Journal of Computer V ision (IJCV) , 115(3):211–252, 2015. [21] Kristin A. Cook, Georges Grinstein, and Mark A. Whiting. The V AST Challenge: History , Scope, and Outcomes: An introduction to the Special Issue. Information V isualization, 13(4):301-312 , Oct 2014. [22] Jean Scholtz, Mark A. Whiting, Catherine Plaisant, and Georges Grin- stein. A Reﬂection on Sev en Y ears of the V AST Challenge. In Pr oceedings of the 2012 BELIV W orkshop: Beyond Time and Err ors - Novel Evaluation Methods for V isualization , BELIV ’12, pages 13:1– 13:8. ACM, 2012. [23] D Bader , Kamesh Madduri, John Gilbert, V iral Shah, Jeremy K epner , Theresa Meuse, and Ashok Krishnamurthy . Designing scalable synthetic compact applications for benchmarking high producti vity computing systems. Cyberinfrastructure T echnology W atch , 2:1–10, 2006. [24] Jurij Leskov ec, Deepayan Chakrabarti, Jon Kleinberg, and Christos Faloutsos. Realistic, mathematically tractable graph generation and ev olution, using kronecker multiplication. In Eur opean Conference on Principles of Data Mining and Knowledge Discovery , pages 133–145. Springer , 2005. [25] Patrick Dreher, Chansup Byun, Chris Hill, V ijay Gadepally , Bradley Kuszmaul, and Jeremy K epner . Pagerank pipeline benchmark: Proposal for a holistic system benchmark for big-data platforms. In P arallel and Distributed Pr ocessing Symposium W orkshops, 2016 IEEE International , pages 929–937. IEEE, 2016. [26] Mauro Bisson, Everett Phillips, and Massimiliano Fatica. A cuda im- plementation of the pagerank pipeline benchmark. In High P erformance Extr eme Computing Confer ence (HPEC), 2016 IEEE , pages 1–7. IEEE, 2016. [27] M. M. W olf, J. W . Berry , and D. T . Stark. A task-based linear algebra building blocks approach for scalable graph analytics. In 2015 IEEE High P erformance Extreme Computing Confer ence (HPEC) , pages 1–6, Sept 2015. [28] M. M. W olf, H. C. Edwards, and S. L. Olivier . Kokk os/qthreads task- parallel approach to linear algebra based graph analytics. In 2016 IEEE High P erformance Extreme Computing Confer ence (HPEC) , pages 1–7, Sept 2016. [29] Willis H W are. Introduction to session on learning machines. In Pr oceedings of the Mar ch 1-3, 1955, western joint computer confer ence , pages 85–85. ACM, 1955. [30] WESLEY A Clark and Bernard G Farley . Generalization of pattern recognition in a self-org anizing system. In Pr oceedings of the Mar ch 1-3, 1955, western joint computer conference , pages 86–91. A CM, 1955. [31] Oliver G Selfridge. Pattern recognition and modern computers. In Pr oceedings of the Mar ch 1-3, 1955, western joint computer confer ence , pages 91–93. ACM, 1955. [32] GP Dinneen. Programming pattern recognition. In Pr oceedings of the Mar ch 1-3, 1955, western joint computer conference , pages 94–100. A CM, 1955. [33] Allen Newell. The chess machine: an example of dealing with a complex task by adaptation. In Pr oceedings of the Marc h 1-3, 1955, western joint computer conference , pages 101–108. ACM, 1955. [34] John McCarthy , Marvin L Minsky , Nathaniel Rochester, and Claude E Shannon. A proposal for the dartmouth summer research project on artiﬁcial intelligence, august 31, 1955. AI magazine , 27(4):12, 2006. [35] Marvin Minsky and Oliver G Selfridge. Learning in random nets. In Information theory : papers read at a symposium on information theory held at the Royal Institution, London, August 29th to September 2nd , pages 335–347. Butterworths, London, 1960. [36] Marvin Minsky . Steps to ward artiﬁcial intelligence. Pr oceedings of the IRE , 49(1):8–30, 1961. [37] Arthur L Samuel. Some studies in machine learning using the game of checkers. IBM Journal of resear ch and development , 3(3):210–229, 1959. [38] Richard Lippmann. An introduction to computing with neural nets. IEEE Assp magazine , 4(2):4–22, 1987. [39] Douglas A Reynolds, Thomas F Quatieri, and Robert B Dunn. Speaker veriﬁcation using adapted gaussian mixture models. Digital signal pr ocessing , 10(1-3):19–41, 2000. [40] Alex Krizhevsk y , Ilya Sutskever , and Geoffre y E Hinton. Imagenet classiﬁcation with deep conv olutional neural networks. In Advances in neural information processing systems , pages 1097–1105, 2012. [41] Y ann LeCun, Y oshua Bengio, and Geoffrey Hinton. Deep learning. Natur e , 521(7553):436–444, 2015. [42] Joseph P Campbell. T esting with the yoho cd-rom voice v eriﬁcation corpus. In Acoustics, Speech, and Signal Pr ocessing, 1995. ICASSP- 95., 1995 International Conference on , volume 1, pages 341–344. IEEE, 1995. [43] Y ann LeCun, Corinna Cortes, and Christopher JC Burges. The mnist database of handwritten digits, 1998. [44] Jia Deng, W ei Dong, Richard Socher , Li-Jia Li, Kai Li, and Li Fei- Fei. Imagenet: A large-scale hierarchical image database. In Computer V ision and P attern Recognition, 2009. CVPR 2009. IEEE Confer ence on , pages 248–255. IEEE, 2009. [45] Murray Campbell, A Joseph Hoane, and Feng-hsiung Hsu. Deep blue. Artiﬁcial intelligence , 134(1-2):57–83, 2002. [46] Michael P McGraw-Herdeg, Douglas P Enright, and B Scott Michel. Benchmarking the n vidia 8800gtx with the cuda dev elopment platform. HPEC 2007 Proceedings , 2007. [47] Andrew Kerr , Dan Campbell, and Mark Richards. Gpu performance assessment with the hpec challenge. In HPEC W orkshop 2008 , 2008. [48] Edward A Epstein, Marshall I Schor, BS Iyer , Adam Lally , Eric W Brown, and Jaroslaw Cwiklik. Making watson fast. IBM Journal of Resear ch and Development , 56(3.4):15–1, 2012. [49] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng. Con volutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Pr oceedings of the 26th annual international conference on mac hine learning , pages 609–616. ACM, 2009. [50] Baoyuan Liu, Min W ang, Hassan Foroosh, Marshall T appen, and Mar- ianna Pensky . Sparse conv olutional neural networks. In Pr oceedings of the IEEE Conference on Computer V ision and P attern Recognition , pages 806–814, 2015. [51] Andrew La vin and Scott Gray . Fast algorithms for con volutional neural networks. In Pr oceedings of the IEEE Conference on Computer V ision and P attern Recognition , pages 4013–4021, 2016. [52] Norman P Jouppi, Cliff Y oung, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. arXiv pr eprint arXiv:1704.04760 , 2017. [53] Frederico A.C. Azevedo, Ludmila R.B. Carv alho, Lea T . Grinberg, Jos Marcelo Farfel, Renata E.L. Ferretti, Renata E.P . Leite, Wilson Jacob Filho, Roberto Lent, and Suzana Herculano-Houzel. Equal numbers of neuronal and nonneuronal cells make the human brain an isometrically scaled-up primate brain. The Journal of Comparative Neur ology , 513(5):532–541, 2009. [54] Shaohuai Shi and Xiaowen Chu. Speeding up conv olutional neural networks by exploiting the sparsity of rectiﬁer units. arXiv pr eprint arXiv:1704.07724 , 2017. [55] Honglak Lee, Chaitan ya Ekanadham, and Andrew Y Ng. Sparse deep belief net model for visual area v2. In Advances in neural information pr ocessing systems , pages 873–880, 2008. [56] Marc’aurelio Ranzato, Y -lan Boureau, and Y ann L Cun. Sparse feature learning for deep belief networks. In Advances in neural information pr ocessing systems , pages 1185–1192, 2008. [57] Xavier Glorot, Antoine Bordes, and Y oshua Bengio. Deep sparse rectiﬁer neural networks. In Aistats , volume 15, page 275, 2011. [58] Dong Y u, Frank Seide, Gang Li, and Li Deng. Exploiting sparseness in deep neural networks for large vocabulary speech recognition. In Acous- tics, Speech and Signal Pr ocessing (ICASSP), 2012 IEEE International Confer ence on , pages 4409–4412. IEEE, 2012. [59] Ryan Robinett and Jeremy K epner. Radix-net: Structured sparse matrices for deep neural networks. In Proceedings of the 2015 IEEE International P arallel and Distributed Pr ocessing Symposium W orkshop , IPDPSW ’19. IEEE Computer Society , 2019. [60] Jeremy K epner . P arallel MATLAB for Multicore and Multinode Com- puters . SIAM, 2009.

Sparse Deep Neural Network Graph Challenge

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment