Run-Time Efficient RNN Compression for Inference on Edge Devices

Run- Time Ecient RNN Compression for Infer ence on Edge Devices Urmish Thakker Senior Research Engineer Arm ML Research urmish.thakker@arm.com Jesse Beu Sta Research Engineer Arm ML Research jesse.beu@arm.com Dibakar Gope Senior Research Engineer Arm ML Research dibakar .gope@arm.com Ganesh Dasika Principal Research Engineer Arm ML Research ganesh.dasika@arm.com Matthew Mattina Senior Director , ML and AI Research Arm ML Research matthew .mattina@arm.com Abstract Recurrent neural networks can be large and compute-intensive, yet many applications that benet from RNNs run on small devices with very limited compute and storage capabilities while still having run-time constraints. As a result, ther e is a need for compression techniques that can achiev e signif- icant compression without negatively impacting inference run-time and task accuracy . This pap er explores a new com- pressed RNN cell implementation called Hybrid Matrix De- composition (HMD) that achieves this dual objective . HMD creates dense matrices that results in output features where the upper sub-vector has "richer" features while the lower- sub vector has " constraine d" features" . On the benchmarks evaluated in this paper , this r esults in faster inference run- time than pruning and b etter accuracy than matrix factor- ization for compression factors of 2-4 × . Ke ywords RNN, Compression A CM Reference Format: Urmish Thakker, Jesse Beu, Dibakar Gope, Ganesh Dasika, and Matthew Mattina. 2020. Run- Time Ecient RNN Compression for Inference on Edge Devices. In Proceedings of 2019 (EMC2 W orkshop) . ACM, New Y ork, NY, USA, 5 pages. https://doi.org/10.1145/nnnnnnn. nnnnnnn 1 Introduction Recurrent neural networks hav e shown state-of-the-art re- sults for a wide variety of applications. Though many of these applications run on mobile devices, they are typically enabled by quer ying a cloud-based system to do most of the computation. The energy , latency , and privacy implications associated with running a query on the cloud is changing where users run a neural network application. W e should, therefore, expect an increase in the number of RNNs run- ning on embedde d devices. Due to the energy and power constraints of edge devices, embedded SoCs frequently use EMC2 W orkshop, A rizona, 2020. A CM ISBN 978-x-xxx x-xxx x-x/YY/MM. . . $15.00 https://doi.org/10.1145/nnnnnnn.nnnnnnn lower-bandwidth memory technologies and smaller caches compared to desktop and server processors. Thus, there is a need for goo d compression techniques to enable large RNN models to t into an e dge device or ensure that they run eciently on devices with smaller caches [ 15 ]. Additionally , compressing models should not negatively impact the infer- ence run-time as these tasks may have realtime deadlines to provide a good user experience. In order to cho ose a compression scheme for a particu- lar network, one needs to consider 3 dierent axes – the compression factor , the speedup over the baseline , and the accuracy . Ideally , a good compression algorithm should not sacrice impro vement along one axis for improvement along another . For example, network pruning [ 7 ] has shown to be an eective compression technique, but pruning creates a sparse matrix representation that is inecient to execute on most modern CP Us. Our analysis shows that pruned net- works can achieve a faster run-time than the baseline only for signicantly high compression factors. Low-rank matrix fac- torization (LMF) is another popular compression technique that can achiev e speedup proportional to the compression factor . However , LMF has had mixed results in maintain- ing model accuracy [ 1 , 5 , 12 ]. This is because LMF reduces the rank of a matrix signicantly , reducing its expressibil- ity . Lastly , structured matrices [ 3 , 17 ] can also b e used to compress neural networks. While these techniques show a signicant reduction in computation, this reduction only translates to a realized run-time improvement for larger ma- trices [18] or while using specialized hardware [9]. T o overcome the problem of nding an alternative to pruning, when LMF leads to loss in accuracy , we in- troduce a new compression technique called Hybrid Matrix Decomposition (HMD) which can act as an ef- fective compression technique for e dge use cases. The results are very promising – HMD achieves iso-accuracy for a large compr ession factor (2 × to 3 × ), improves the run-time over pruning by a factor of 2 × , improves run-time over a structured matrix-based technique by a factor of 30 × and achieves better model accuracy than LMF. EMC2 W orkshop, Arizona, Urmish Thakker, Jesse Beu, Dibakar Gope, Ganesh Dasika, and Mahew Maina A B C D E A’ D1 Siz e( A ’) = r *n Siz e (B ) = (m - r )* 1 Siz e (C) = 1 *n/2 Siz e (D) = (m - r )* 1 Siz e (E) = 1 *n/2 Siz e( A) = m* n Figure 1. Representation of a matrix using hybrid decompo- sition Algorithm 1 Reconstructing A in D1 Input : Matrices A ′ of dimension r × n , B of dimension ( m − r ) × 1 , C of dimension 1 × ( n / 2 ) , D of dimension ( m − r ) × 1 , E of dimension 1 × ( n / 2 ) Output : Matrix A of dimension m × n 1: G ← B × C 2: H ← D × E 3: K = c oncat e nat e ( G , H , col umn ) 4: A = c onc at enat e ( A ′ , K , r o w ) The key contributions of this paper are: • Introduction of a new compression technique called Hybrid Matrix Decomposition that can regain most of the baseline accuracy at 2 × to 3 × compression factors. • Comparison of the model accuracy , inference run-time and compression trade-os of HMD with network pruning and matrix factorization 2 Related W ork The research in NN compression can be categorize d under 4 topics - Pruning[ 7 , 21 ], structured matrix based te chniques [ 3 , 13 , 14 , 16 ], quantization [ 4 ] and tensor decomposition [ 8 , 19 ]. HMD belongs in the structured matrix categor y . W e compare our method against pruning, structured matrix and tensor decomposition techniques. Quantization is an orthogonal technique and can further compress the models presented in this paper . 3 HMD-Based RNN Compression Algorithm 2 Matrix vector product when a matrix uses the HMD technique as shown in D1 Input 1: Matrices A ′ of dimension r × n , B of dimension ( m − r ) × 1 , C of dimension 1 × ( n / 2 ) , D of dimension ( m − r ) × 1 , E of dimension 1 × ( n / 2 ) Input 2: V e ctor I of dimension n × 1 Output: Matrix O of dimension m × 1 1: O 1:r ← A ′ × I 2: T e mp 1 Sc al ar ← C × I 1:n/2 3: T e mp 1 ← B ◦ T em p 1 S c al a r 4: T e mp 2 Sc al ar ← E × I 1+n/2:n 5: T e mp 2 ← D ◦ T emp 2 S c al ar 6: O r+1:m ← T emp 1 + T emp 2 7: O = conc at enat e { O 1:r , O r+1:m } The output of a RNN layer is a vector . Each element of the vector is derived from multiple fully connected layers followed by a non linearity op eration. Thus, every element of an output vector is connected to every element of the input and hidden vectors of a RNN layer . This leads to a large number of parameters. Generally , not all elements of the output vector ne ed to be connected this way to deriv e useful information from the input and the hidden v ector . Pruning exploits these sparse connections in an unstructured manner . Additionally , most RNN networks are followed by a fully- connected softmax layer or another RNN layer . Even if the order of the elements in the output of a particular RNN layer changes, the weights in the subsequent fully connected or RNN layers can adjust to accommodate that. Thus, the order of the output vectors of RNN hidden layers is not strictly important. These two properties of a RNN layer can be used to cr eate a more hardwar e-friendly compression scheme. This paper introduces one such scheme – Hybrid Matrix De composition. HMD splits the input and recurrent matrices in an RNN layer into two parts – a fully parameterized upper part and a lower part comp osed of rank-1 blocks. The upper part is used to generate elements of an output vector that nee d dense connectivity , while the lower part generates elements of the output vector that can generate useful information using sparse connectivity . There are multiple ways to constrain the lower part using rank-1 blocks. Figure 1 shows one such technique - D1. The D1 technique consists of an unconstrained upper half A ′ and a constrained lower half. The lower half is comp osed of two rank-1 blocks. Algorithm 1 shows how to expand A ′ , B , C , D , and E to get a matrix of size m × n . In this paper , whenever we discuss HMD , we will refer to the D1 method to decomp ose the matrix. If we decompose the weight matrix using D1 technique, the storage reduction is given by: m × n ( r × n ) + 2 × ( m − r + n / 2 ) (1) Run- Time Eicient RNN Compression for Inference on Edge Devices EMC2 W orkshop, Arizona, Apart from the storage reduction, HMD also leads to a reduction in the number of computations. Assuming a batch size of 1 during inference, Algorithm 2 shows how to calcu- late the matrix vector product when the matrix is represented using HMD . This algorithm avoids expanding the matrix A ′ , B , C , D , and E into A as shown in Algorithm 1 and uses the associative property of matrix products to gain the compu- tation speedup. The compression in number of operations when we use Algorithm 2 is: m × n r × n + 2 × ( n / 2 + m − r ) + m − r (2) As discusse d previously , HMD divides the output into tw o stacked sub-vectors: One is a result of a fully-parameterized multiplication ( A ′ × I ) and the other is the result of the low rank multiplication ( C × R × I 1:n/2 + E × F × I 1+n/2:n ). Thus, the upper sub vector has “richer” features while the lower sub vector has “constrained” featur es. 4 Results W e do an extensive comparison of HMD with 2 other com- pression techniques – model pruning and matrix factoriza- tion. A dditionally , w e also compared HMD with a structured matrix-based compression technique calle d blo ck circular de- composition (BCD) [ 2 , 9 ]. BCD-compressed networks wer e able to recover the baseline accuracy for 2 × - 4 × compres- sion. Howe ver , the run-time of the compressed netw ork was 30 × slower than baseline. A s a result, we do not discuss the results using BCD compression in the rest of the paper . Model pruning [ 21 ] induces sparsity in the matrices of a neural networkcreating sparse matrices which are stored in a specialized CSR data structure. The ov erhead of traversing these data structures while p erforming the matrix-vector multiplication can lead to poorer inference run-time than when executing the baseline, non-sparse network. Low Rank Matrix Factorization (LMF) [ 8 ] expresses a larger matrix A of dimension m × n as a product of two smaller matrices U and V of dimension m × d and d × n , respectively . Parameter d controls the compression factor . Unlike pruning, Matrix Factorization is able to improve the run-time over the baseline for all compression factors. 4.1 Comparison of compression te chniques across dierent ML tasks The impact of compression on accuracy is compared for 3 benchmarks covering 2 dierent tasks – Human Activity Recognition and Language Modeling. These tasks are some of the imp ortant applications that run on e dge and embedde d devices. In order to compare the inference run-time of RNN cells compressed using the 3 techniques discussed abov e, we implemented these cells in C++ using the Eigen librar y . W e ran our experiments on a single cortex- A73 core of the Hike y 960 board. The size of L3 cache is 2MB. SB LMF P H MD SB LMF P HMD SB LMF P HMD SB LMF P HMD Ba seline 90.0% 90.5% 91.0% 91.5% 92.0% 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 Ac cur acy Spe ed - up o v er baseline 2x 2.5x 3.33x 5x Baseline Figure 2. Accuracy vs speedup for the HAR1 network com- paring the baseline with a smaller baseline and the baseline compressed using dierent compression schemes at var y- ing compression factors. Speed-up values > 1 indicate a decrease in inference run-time and values < 1 indicate an increase in inference run-time. For each compression factor , the compression scheme that is most to the top-right is the ideal choice and is highlighted in bold italics. P = Pruning, LMF = Low rank matrix factorization, HMD = Hybrid matrix decomposition, SB = Smaller baseline. W e compress the network using pruning, LMF, and HMD . Additionally , we train a smaller baseline with the numb er of parameters equal to that of the compressed baseline. 4.1.1 Human Activity Recognition (HAR) W e train two dierent networks for human activity recogni- tion. Both of these networks are trained on the Opportunity dataset [ 11 ]. However , they dier in the way they process the dataset and the body sensors they chose to train their networks on. HAR1 : The rst HAR netw ork is based on the work in [ 6 ]. The network uses a bidirectional LSTM with hidden length of size 179 followed by a softmax layer to get an accuracy of 91.9%. Input is of dimension 77 and is fed over 81 time steps. The total number of parameters in this network are 374,468. Figure 2 sho ws the result of compressing the LSTM layers in the baseline by 2 × , 2 . 5 × , 3 . 33 × and 5 × . As we incr ease the compression, the accuracy degradation becomes larger for all compression schemes. Thus, the best compression scheme for each compression factor is a function of task accuracy and spee dup required to run the application. For 2 × com- pression, HMD and pruning achieve better accuracy than the smaller baseline and LMF . Additionally , the HMD com- pressed network is 2 × faster than the pruned network. Simi- lar obser vations can b e made for 2 . 5 × and 3 . 33 × compression. Thus, HMD can be used as the preferred compression scheme for these compression factors. At 5 × compression, HMD is slightly more accurate than LMF while being 15% slower . The preferred choice for compression scheme depends on EMC2 W orkshop, Arizona, Urmish Thakker, Jesse Beu, Dibakar Gope, Ganesh Dasika, and Mahew Maina SB LMF P HMD SB LMF P HMD SB LMF P HMD SB LMF P HMD Ba seline 89.0% 89.5% 90.0% 90.5% 91.0% 91.5% 0.0 1.0 2.0 3.0 4.0 5.0 Ac cur acy Spe ed - up o v er baseline 2x 2.5x 3.33x 5x Baseline Figure 3. Accuracy vs spee dup for HAR2 network compar- ing the baseline with a smaller baseline and the baseline compressed using dierent compression schemes at var y- ing compression factors. Speed-up values > 1 indicate a decrease in inference run-time and values < 1 indicate an increase in inference run-time. For each compression factor , the compression scheme that is most to the top-right is the ideal choice and is highlighted in bold italics. P = Pruning, LMF = Low rank matrix factorization, HMD = Hybrid matrix decomposition, SB = Smaller baseline. what criteria (accuracy or speed) one is willing to sacrice. Finally , all three compression schemes have better accuracy than the smaller baseline. HAR2 : The second HAR network is based on the work in [ 10 ]. They use 113 sensors from the Opportunity dataset. The network has 4 convolutional lay ers followed by 2 LSTM layers and a softmax layer . The total number of parameters in the network are 3,964,754. The LSTM layers are of size 128 contributing to more than 95% of the total parameters. Figure 3 shows the result of compressing the LSTM lay- ers in baseline by 2 × , 2 . 5 × , 3 . 33 × and 5 × . As we increase the compression, the accuracy degradation becomes larger for all compression schemes. For 2 × and 2 . 5 × compression factors, HMD is the superior technique, achieving b etter run-time than pruning ( 2 × faster) and better accuracy than LMF (improv ement of 0 . 4% ). For higher compression factors, LMF becomes an attractive option to compress the HAR2 application. For 3 . 33 × compression, LMF , HMD and pruning achieve equivalent accuracy . How ever , LMF is slightly faster than HMD and more than 2 × faster than pruning. Finally , all three compression schemes have b etter accuracy than the smaller baseline. 4.1.2 Language Modeling W e use the small model from [ 20 ] as our baseline. The base- line has 2 LSTM layers each with a hidden v ector of size 200. Additionally , it uses 10,000 wor ds from the English vocabu- lary . T ogether with the input and output word embeddings, the total size of the network is 4,171,000 parameters. SB LMF P HMD SB LMF P HMD SB LMF P HMD SB LMF P HMD Ba seline 100 102 104 106 108 110 112 114 116 118 120 0.0 0.5 1.0 1.5 2.0 2.5 3.0 P er ple xity Spe ed - up o v er baseline 2x 2.5x 3.33x 5x Baseline Figure 4. Perplexity vs speedup for PTB-LM network com- paring the baseline with a smaller baseline and the baseline compressed using dierent compression schemes at var ying compression factors. Speed-up values > 1 indicate a de crease in inference run-time and values < 1 indicate an increase in inference run-time. In case of perplexity , lower values are better . Thus, for each compression factor , the compression scheme that is most to the bottom-right is the ideal choice and is highlighted in bold italics. P = Pruning, LMF = Low rank matrix factorization, HMD = Hybrid matrix decompo- sition, SB = Smaller baseline. Figure 4 shows the r esults of compressing the LSTM lay- ers in the baseline by 2 × , 2 . 5 × , 3 . 33 × and 5 × . In case of LM, lower the perplexity , better the model. Pruning consistently achieves better accuracy than baseline and other compr es- sion techniques. Howev er , pruning never achieves a better speedup than other compression techniques LMF achieves better perplexity than baseline for 2 × and 2 . 5 × compression and achieves speedup for all compression factors. But LMF , does not beat the perplexity values achieved by HMD. HMD simultaneously achieves better perplexity than baseline for most compression factor , better perplexity than LMF for all compression factors and faster inference run-time than base- line and pruned networks for all compression factors. Thus, HMD makes a strong case for being the preferred compres- sion scheme 5 Conclusion Choosing the right compression technique requires looking at three criteria – compression factor , accuracy , and run- time. Pruning is an eective compression technique, but can sacrice speedup over baseline for certain compression factors. LMF achieves better speedup than baseline for all compression factors, but can lead to accuracy degradation. This paper introduces a new compression scheme called HMD , which is extremely eective when compression using pruning does not lead to sp eedup over baseline and LMF leads to accuracy degradation. Run- Time Eicient RNN Compression for Inference on Edge Devices EMC2 W orkshop, Arizona, References [1] Ting Chen, Ji Lin, Tian Lin, Song Han, Chong W ang, and Denny Zhou. 2018. Adaptive Mixture of Low-Rank Factorizations for Com- pact Neural Modeling. Advances in neural information processing systems (CDNNRIA workshop) (2018). https://op enreview .net/forum? id=B1eHgu- Fim [2] Caiwen Ding, Siyu Liao, Y anzhi W ang, Zhe Li, Ning Liu, Y ouwei Zhuo , Chao W ang, Xuehai Qian, Y u Bai, Geng Y uan, Xiaolong Ma, Yipeng Zhang, Jian T ang, Qinru Qiu, Xue Lin, and Bo Yuan. 2017. CirCNN: Accelerating and Compressing Deep Neural Netw orks Using Block- circulant W eight Matrices. In Procee dings of the 50th A nnual IEEE/A CM International Symposium on Microarchitecture (MICRO-50 ’17) . A CM, New Y ork, NY, USA, 395–408. https://doi.org/10.1145/3123939.3124552 [3] Caiwen Ding, Ao Ren, Geng Yuan, Xiaolong Ma, Jiayu Li, Ning Liu, Bo Y uan, and Y anzhi W ang. 2018. Structured W eight Matrices-Based Hardware Accelerators in Deep Neural Networks: FPGA s and ASICs. In Proceedings of the 2018 on Great Lakes Symposium on VLSI (GLSVLSI ’18) . ACM, New Y ork, N Y, USA, 353–358. https://doi.org/10.1145/ 3194554.3194625 [4] Dibakar Gope, Jesse Beu, Urmish Thakker , and Matthew Mattina. 2019. T ernary MobileNets via Per-Layer Hybrid Filter Banks. arXiv:cs.LG/1911.01028 [5] Artem M. Grachev , Dmitr y I. Ignatov , and Andrey V . Savchenko . 2017. Neural Networks Compression for Language Modeling. In Pattern Recognition and Machine Intelligence , B. Uma Shankar, Kuntal Ghosh, Deba Prasad Mandal, Shubhra Sankar Ray , David Zhang, and Sankar K. Pal (Eds.). Springer International Publishing, Cham, 351–357. [6] Nils Y Hammerla, Shane Halloran, and Thomas P loetz. 2016. Deep, convolutional, and recurrent models for human activity recognition using wearables. IJCAI 2016 (2016). [7] Song Han, Huizi Mao, and William J Dally . 2016. Deep Compression: Compressing De ep Neural Networks with Pruning, Trained Quan- tization and Human Co ding. International Conference on Learning Representations (ICLR) (2016). [8] Oleksii Kuchaiev and Boris Ginsburg. 2017. Factorization tricks for LSTM networks. CoRR abs/1703.10722 (2017). arXiv:1703.10722 http: //arxiv .org/abs/1703.10722 [9] Zhe Li, Shuo W ang, Caiwen Ding, Qinru Qiu, Y anzhi W ang, and Yun Liang. 2018. Ecient Re current Neural Networks using Structured Matrices in FPGAs. CoRR abs/1803.07661 (2018). http://arxiv .org/abs/1803.07661 [10] Francisco Javier OrdÃşÃśez and Daniel Roggen. 2016. Deep Convolu- tional and LSTM Recurrent Neural Networks for Multimodal W earable Activity Recognition. Sensors 16, 1 (2016). https://doi.org/10.3390/ s16010115 [11] D. Roggen, A. Calatroni, M. Rossi, T . Holle czek, K. F Ãűrster , G. Tr Ãűster , P. Lukowicz, D. Bannach, G. Pirkl, A. Ferscha, J. Doppler, C. Holzmann, M. Kurz, G. Holl, R. Chavarriaga, H. Sagha, H. Bayati, M. Creatura, and J. d. R. MillÃăn. 2010. Colle cting complex activity datasets in highly rich networked sensor envir onments. In 2010 Sev- enth International Conference on Networked Sensing Systems (INSS) . 233–240. https://doi.org/10.1109/INSS.2010.5573462 [12] T . N. Sainath, B. Kingsbury , V . Sindhwani, E. Ariso y , and B. Ramabhad- ran. 2013. Low-rank matrix factorization for Deep Neural Network training with high-dimensional output targets. In 2013 IEEE Interna- tional Conference on Acoustics, Spee ch and Signal Processing . 6655–6659. https://doi.org/10.1109/ICASSP.2013.6638949 [13] Vikas Sindhwani, T ara Sainath, and Sanjiv Kumar . 2015. Structured Transforms for Small-Footprint Deep Learning. In Advances in Neural Information Processing Systems 28 , C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, Inc., 3088– 3096. [14] Urmish Thakker , Jesse G. Beu, Dibakar Gope, Chu Zhou, Igor Fedorov , Ganesh Dasika, and Matthew Mattina. 2019. Compressing RNNs for Io T devices by 15-38x using Kr onecker Products. CoRR abs/1906.02876 (2019). arXiv:1906.02876 http://ar xiv .org/abs/1906.02876 [15] Urmish Thakker , Ganesh Dasika, Jesse G. Beu, and Matthew Mattina. 2019. Measuring scheduling eciency of RNNs for NLP applications. CoRR abs/1904.03302 (2019). arXiv:1904.03302 http://arxiv .org/abs/ 1904.03302 [16] Urmish Thakker , Igor Fedorov , Jesse G. Beu, Dibakar Gope, Chu Zhou, Ganesh Dasika, and Matthew Mattina. 2019. Pushing the limits of RNN Compression. A rXiv abs/1910.02558 (2019). [17] Urmish Thakker , Paul Whatmough, Matthew Mattina, and Jesse Beu. 2020. Compressing Language Models using Doped Kronecker Products. arXiv:cs.LG/2001.08896 [18] Anna Thomas, Albert Gu, Tri Dao, Atri Rudra, and Christopher Ré. 2018. Learning Compressed T ransforms with Low Displacement Rank. In Advances in Neural Information Processing Systems 31 , S. Bengio, H. W allach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran A ssociates, Inc., 9066–9078. http://pap ers.nips.cc/paper/ 8119- learning- compressed- transforms- with- low- displacement- rank. pdf [19] Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura. 2017. Com- pressing recurrent neural network with tensor train. In Neural Net- works (IJCNN), 2017 International Joint Conference on . IEEE, 4451–4458. [20] W ojciech Zaremba, Ilya Sutskever , and Oriol Vinyals. 2014. Recur- rent Neural Network Regularization. CoRR abs/1409.2329 (2014). arXiv:1409.2329 http://arxiv .org/abs/1409.2329 [21] Michael Zhu and Suyog Gupta. 2017. T o prune, or not to prune: exploring the ecacy of pruning for mo del compression. arXiv e- prints , Article arXiv:1710.01878 (Oct. 2017), arXiv:1710.01878 pages. arXiv:stat.ML/1710.01878

Run-Time Efficient RNN Compression for Inference on Edge Devices

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment