GLU Variants Improve Transformer

Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in pl…

Authors: Noam Shazeer

GLU V arian ts Impro v e T ransformer Noam Shazeer Go ogle noam@go ogle.com F ebruary 14, 2020 Abstract Gated Linear Units [ Dauphin et al. , 2016 ] consist of t he comp onent-wi se product of tw o linear pro- jections, one of whic h is first passed through a sigmoid function. V ariations on GLU are possible, u sing different nonli near (or even linear) functions in place of sigmoid. W e test these v arian t s in the feed- forw ard subla yers of the T ransformer [ V asw ani et al. , 2017 ] sequence-to-sequen ce mo del, and find that some of them yield qualit y improv ements o ver the typical ly- used ReLU or GELU activ ations. 1 In tro duction The T ransformer [ V aswani et al. , 2017 ] sequence-to- s equence mo del a lter nates b et ween m ulti-hea d a tten tion, and what it calls "p osition-wise feed-fo r w ar d net works" (FFN). The FFN takes a vector x (the hidden r epre- sentation at a particular p o sition in the sequence) a nd passes it through tw o lear ned linear transfor ma tions, (represented by the matrices W 1 and W 2 and bias v ectors b 1 and b 2 ). A r e c tified-linea r (ReLU) [ Glorot e t al. , 2011 ] a ctiv ation function a pplied be t ween the tw o linear tra nsformations. FFN( x, W 1 , W 2 , b 1 , b 2 ) = max( 0 , xW 1 + b 1 ) W 2 + b 2 (1) F ollowing the T5 co debase [ Raffel e t al. , 2019 ] 1 , we use a version with no bia s: FFN ReLU ( x, W 1 , W 2 ) = max( xW 1 , 0) W 2 (2) Subsequent work has prop osed repla c ing the ReL U with other nonlinear activ ation functions such as Gaussian Error Linear Unit s , GELU( x ) = x Φ( x ) [ Hendrycks a nd Gimp el , 2016 ], and Swish β ( x ) = xσ ( β x ) [ Ramachandran e t al. , 2017 ]. FFN GELU ( x, W 1 , W 2 ) = GELU( xW 1 ) W 2 FFN Swish ( x, W 1 , W 2 ) = Swish 1 ( xW 1 ) W 2 (3) 2 Gated Linear Units (GLU) and V ari an ts [ Dauphin et al. , 2016 ] in tro duced Gated Linear Units (GLU), a neural netw or k la yer defined as the comp onen t- wise pro duct of t wo linear transformations of the input, one of which is sigmoid-activ ated. They a lso sugg est omitting the activ ation, whic h they c a ll a "bilinear " layer and attribute to [ Mnih and Hinton , 2007 ]. GLU( x, W, V , b , c ) = σ ( xW + b ) ⊗ ( xV + c ) Bilinear( x, W, V , b , c ) = ( xW + b ) ⊗ ( xV + c ) (4) W e can also define GLU v ariants using other activ ation functions: 1 Also in the in terest of M L fair ness. 1 ReGLU( x, W, V , b , c ) = max(0 , xW + b ) ⊗ ( xV + c ) GEGLU( x, W, V , b , c ) = GELU( xW + b ) ⊗ ( xV + c ) SwiGLU( x, W, V , b, c, β ) = Swish β ( xW + b ) ⊗ ( xV + c ) (5) In this pa per, we prop ose a dditiona l v ariations on the T ransformer FFN lay er which use GLU or one of its v aria nts in place of the fir st linear transfor mation and the activ ation function. Again, we omit the bias terms. FFN GLU ( x, W, V , W 2 ) = ( σ ( xW ) ⊗ xV ) W 2 FFN Bilinear ( x, W, V , W 2 ) = ( xW ⊗ xV ) W 2 FFN ReGLU ( x, W, V , W 2 ) = (max(0 , xW ) ⊗ xV ) W 2 FFN GEGLU ( x, W, V , W 2 ) = (GELU( x W ) ⊗ xV ) W 2 FFN SwiGLU ( x, W, V , W 2 ) = (Swish 1 ( xW ) ⊗ xV ) W 2 (6) All of these lay ers hav e three weigh t matrices, as opp osed to tw o for the original FFN. T o keep the n um b er o f parameters and the amount of computation constant , we reduce the num b er of hidden units d f f (the second dimension o f W and V and the first dimensio n of W 2 ) by a fa c tor of 2 3 when comparing these lay ers to the original tw o-ma trix version. 3 Exp erimen ts on T ext-t o-T ext T ransfer T ransformer (T5) W e test the FFN v ariant s we hav e describ ed on the transfer-learning setup fr om [ Raffel et al. , 20 19 ]. An enco der-deco der trans former mo del [ V a sw ani et a l. , 20 17 ] is trained on a denoising ob jectiv e of predicting missing text segments, and subseq uently fine-tuned on v arious language understanding tasks. 3.1 Mo del Architecture W e use the sa me code base, mo del archi tecture, a nd training task as the ba s e model fro m [ Raffel et al. , 2019 ]. The encoder and deco der each consist o f 12 layers, with d model = 768 . F or the attention lay ers, h = 1 2 and d k = d v = 64 . The FFN layers have hidden size d f f = 30 72. As w e des c r ibe a bov e, for the GLU-v ariant-based FFN lay e r s, which have thee weigh t matr ices instead of t wo, we r educe the hidden lay er to d f f = 2048, so as to maint ain the sa me pa r ameter a nd o p eration coun ts as the ba se mo del. T able 1 : H eldout-set log-p erplexity for T r ansformer mo dels on the segmen t-filling task from [ Raffel e t al. , 2019 ]. All mo dels are matched for parameters and computation. T raining Steps 65,536 524,28 8 FFN ReLU ( basel ine ) 1.997 (0 .0 05) 1.677 FFN GELU 1.983 (0 .0 05) 1.679 FFN Swish 1.994 (0 .0 03) 1.683 FFN GLU 1.982 (0 .0 06) 1.663 FFN Bilinear 1.960 (0 .0 05) 1.648 FFN GEGLU 1.942 (0.004) 1.633 FFN SwiGLU 1.944 (0.010) 1.636 FFN ReGLU 1.953 (0 .0 03) 1.645 2 3.2 Pre-T raining and Perplexit y Results Iden tically to [ Raffel et al. , 2 019 ], we pre-train for 524 ,2 88 steps on the s pan-filling objective on the C4 dataset. Eac h tra ining batch co nsists o f 12 8 exa mples, each o f which has a n input of 51 2 tokens a nd a n output o f 1 1 4 tokens, the o utput containing multiple spans of tokens which were deleted fro m the input 2 . Similarly to [ Raffel e t al. , 20 19 ], we use the Adafactor optimizer [ Shazeer and Stern , 2018 ] and an inv erse- square-r oot lear ning-rate schedule. W e also decay the lear ning ra te linearly for the final 10 p ercent of the training steps. Our main departure fro m [ Raffel et al. , 201 9 ] is that we use no drop out during pre - training. W e find this to pro duce superio r results. W e compute the lo g-perplexity on the training o b jectiv e on a heldout shard o f C4, whic h we believe to b e a go od indicator of mo del quality . F or each mo del ar c hitecture, we also tra ined four mo dels for a shorter p erio d (65 ,536 steps) to measur e inter-run v ar ia bilit y . The results are listed in table 1 . The GE GLU and SwiGLU v ariants pro duce the best p erplexities. 3.3 Fine-T uning W e then fine-tune each fully-tra ined mo del o nce o n an examples-prop ortional mixture o f the Stanford Question-Answering Datas et (SQuAD) [ Ra jpurkar e t al. , 2 0 16 ] and all the language understanding tasks in the GLUE [ W a ng et al. , 2 018 ] and Sup erGlue [ W ang et al. , 2019 ] benchmarks. 3 Fine-tuning consists of 13107 2 steps with a learning rate of 10 − 3 . As in training, the input sequences for each step have a com bined length of approximately 65 ,5 36 tokens. F ollowing [ Raffel et al. , 2019 ], we us e a drop out rate o f 0 . 1 o n the lay er o utput s , feed-for w ar d hidden-lay ers and attention w eig h ts. The embedding matrices are fixed during fine-tuning. T ables 2 , 3 and 4 show r esults on the developmen t sets. F or ea c h task, we rep ort the b est score of any of the chec k points recor ded during fine-tuning. While the results are noisy , the new GLU-v ariants p erform bes t on most of the tasks. F or compariso n, at the bottom of each o f the tables w e list the reuslts from [ Raffel e t al. , 201 9 ]. The mo del is identical to o ur FFN ReLU mo del. Their re s ults ar e notably worse, whic h we b e lieve was caused by their use of drop out during pre-tra ining. Also listed a re the inter-run standard deviations mea sured by [ Raffel e t al. , 2019 ]. T able 2: GLUE La nguage-Understanding B enc hmark [ W a ng et a l. , 2018 ] (dev). Score CoLA SST-2 MRPC MRPC STSB STSB QQP QQP MNLIm MNLImm QNLI R TE A verage MCC Acc F1 Acc PCC SCC F1 Acc Acc Acc Acc A cc FFN ReLU 83 . 80 51 . 32 9 4 . 04 93.08 90.20 89 . 64 89 . 42 89 . 01 91 . 75 85 . 83 86 . 42 92 . 81 80 . 14 FFN GELU 83 . 86 53 . 48 9 4 . 04 92 . 81 90.20 8 9 . 69 89 . 49 88 . 63 91 . 62 85 . 89 86 . 13 92 . 39 80 . 51 FFN Swish 83 . 60 49 . 79 9 3 . 69 92 . 31 89 . 46 89 . 20 88 . 98 88 . 8 4 91 . 67 85 . 22 85 . 02 92 . 33 81 . 2 3 FFN GLU 84 . 20 49 . 16 9 4 . 27 92 . 39 89 . 46 89 . 46 89 . 35 88 . 7 9 91 . 62 86 . 36 86 . 18 92 . 92 84.12 FFN GEGLU 84 . 12 53 . 65 9 3 . 92 92 . 68 89 . 71 90 . 26 90 . 13 89 . 1 1 91 . 85 86 . 15 86 . 17 92 . 81 79 . 4 2 FFN Bilinear 83 . 79 51 . 02 94.3 8 92 . 28 89 . 4 6 90 . 06 89 . 84 88 . 95 91 . 69 86.90 87.08 92 . 92 81 . 95 FFN SwiGLU 84 . 36 51 . 59 9 3 . 92 92 . 23 88 . 97 90.32 90.13 89.14 91.87 86 . 4 5 86 . 47 92.93 83 . 39 FFN ReGLU 84.67 56.16 94.38 92 . 06 89 . 22 89 . 97 89 . 85 88 . 86 91 . 72 86 . 20 86 . 40 92 . 68 81 . 59 [ Raffel et al. , 2019 ] 83 . 28 53 . 84 92 . 6 8 92 . 07 88 . 92 88 . 02 87 . 94 88 . 6 7 91 . 56 84 . 24 84 . 57 90 . 48 76 . 2 8 ibid. stddev. 0 . 235 1 . 111 0 . 569 0 . 729 1 . 019 0 . 374 0 . 418 0 . 10 8 0 . 070 0 . 291 0 . 231 0 . 361 1 . 39 3 4 Conclusions W e hav e ex tended the GLU family of layers and prop osed their use in T r ansformer. In a transfer-lea rning setup, the new v ariants seem to pro duce b etter per plexities for the de-noising ob jectiv e used in pre- training, as well as b etter r esults on many downstream lang uage-understanding tasks. These architectures are simple to implement, and have no apparent co mputationa l drawbacks. W e offer no explanation as to wh y these architectures seem to w o rk; we attribute their success, as all else, to divine b enev olence. 2 Eac h traini ng step to ok appro ximately 0.15 seconds on a 32-core TPUv2 cluster. 3 This departs f rom [ Raffel et al. , 2019 ], who fine-tuned separately on the different tasks. W e ch ose one fine-tuning r un for simplicity . 3 T able 3: Supe r GLUE Language- Understanding Benchmark [ W a ng et a l. , 2019 ] (dev). Score Boo lQ CB C B CoP A MultiR C MultiRC ReCo RD ReCoRD R TE WiC WSC A verage Acc F1 Acc Acc F1 EM F1 EM Acc Acc Acc FFN ReLU 72 . 76 80 . 15 83 . 3 7 89 . 29 70 . 00 76 . 93 39 . 14 73 . 7 3 72 . 91 83 . 39 67 . 71 7 7 . 88 FFN GELU 72 . 98 80 . 64 86 . 2 4 91.07 74 . 00 75 . 93 38 . 61 72 . 96 72 . 03 81 . 5 9 68 . 34 75 . 96 FFN Swish 72 . 40 80 . 43 77 . 7 5 83 . 93 67 . 00 76 . 34 39 . 14 73 . 3 4 72 . 36 81 . 95 68 . 18 8 1 . 73 FFN GLU 73 . 95 80 . 95 77 . 2 6 83 . 93 73 . 00 76 . 07 39 . 03 74 . 2 2 73 . 50 84 . 12 67 . 71 87.50 FFN GEGLU 73 . 96 81 . 19 82 . 0 9 87 . 50 72 . 00 77.43 41.03 75 . 28 74.60 83 . 39 67 . 08 83 . 65 FFN Bilinear 73 . 81 81.53 82 . 49 89 . 2 9 76.00 76 . 04 4 0 . 92 74 . 9 7 74 . 10 82 . 67 6 9.28 78 . 85 FFN SwiGLU 74.56 81 . 1 9 8 2 . 39 89 . 2 9 73 . 0 0 75 . 56 38 . 72 75.35 74 . 55 85.20 67 . 2 4 86 . 54 FFN ReGLU 73 . 66 80 . 89 86.37 91. 07 67 . 00 75 . 32 40 . 5 0 75 . 07 74 . 18 84 . 48 67 . 40 7 9 . 81 [ Raffel et al. , 2019 ] 71 . 36 76 . 62 91 . 2 2 91 . 96 66 . 20 66 . 13 25 . 78 69 . 0 5 68 . 16 75 . 34 68 . 04 78 . 56 ibid. stddev. 0 . 416 0 . 365 3 . 23 7 2 . 560 2 . 741 0 . 716 1 . 0 11 0 . 37 0 0 . 379 1 . 228 0 . 850 2 . 029 T able 4: SQuAD [ Ra jpurkar e t al. , 2016 ] v1 .1 (dev). EM F1 FFN ReLU 83 . 18 90 . 87 FFN GELU 83 . 09 90 . 79 FFN Swish 83 . 25 90 . 76 FFN GLU 82 . 88 90 . 69 FFN GEGLU 83 . 55 91 . 12 FFN Bilinear 83.82 91 . 0 6 FFN SwiGLU 83 . 42 91 . 03 FFN ReGLU 83 . 53 91.18 [ Raffel et a l. , 201 9 ] 80 . 88 88 . 81 ibid. Standard Deviation 0 . 343 0 . 226 References Y ann N. Dauphin, Angela F an, Mic hael A uli, a nd David Grangier. L a nguage mo deling with gated conv o lu- tional netw or ks. CoRR , abs/16 12.08083 , 2016 . URL http:/ /arxiv.o rg/abs/1612.08083 . Xavier Glorot, Antoine Bor des, and Y oshua Bengio . Deep sparse rectifier neural net works. In Pr o c e e dings of the fourte enth int ernatio n al c onfer enc e on artificial intel ligenc e and statistics , pa ges 315– 323, 2011 . Dan Hendrycks and Kevin Gimpel. Bridging nonlinearities and sto chastic regula rizers with g a ussian erro r linear units. CoRR , abs/ 1606.0841 5, 20 16. URL http:/ /arxiv.o rg/abs/1606.08415 . Andriy Mnih and Geoffrey Hinton. Three new gra phical mo dels for statistical la ng uage mo delling. In Pr o c e e dings of the 24th international c onfer enc e on Machine le arning , pa ges 641– 648, 2007 . Colin Raffel, Noa m Shazeer, Adam Rob erts, K atherine Lee, Sha ran Narang, Michael Matena, Y anqi Zhou, W ei Li, and Peter Liu. Exploring the limits of tra nsfer learning with a unified text-to- text tra nsformer. arXiv e-prints , 2019. Pranav Ra jpurkar, Jian Z hang, Konstan tin Lo pyrev, and Percy Liang. Squad: 1 0 0,000+ questions for machine compr ehension of text. arXiv pr eprint arXiv:1606.0 5250 , 2016. Pra jit Ramachandran, Bar ret Z oph, and Quoc V Le. Searchin g for activ ation functions. arXiv pr eprint arXiv:17 10.05 941 , 201 7. Noam Sha z e e r a nd Mitc hell Stern. Adafactor: Ad aptive learning rates with s ublinea r memor y co st. arXiv pr eprint arXiv:1804 .0423 5 , 2018 . Ashish V asw ani, Noam Shazeer , Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Ka iser, and Illia Polosukhin. A ttention is all you need. In NIPS , 20 17. 4 Alex W ang , Amapr eet Sing h, Julia n Michael, F elix Hill, Omer L e v y , and Sa m uel R. Bowman. GLUE: A multi-task benchmark and a nalysis platform for natura l language understanding. arXiv pr eprint arXiv:18 04.07 461 , 201 8. Alex W ang, Y ada P ruksach atkun, Nikita Nangia , Amanpreet Singh, Julian Michael, F elix Hill, Omer Levy , and Samuel R. Bowman. Sup erglue: A stic kier be nchmark fo r gener al-purpos e language under s tanding systems. arXiv pr eprint arXiv:190 5.00537 , 201 9. 5

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment