Multi-Task Learning of Keyphrase Boundary Classification

Multi-T ask Learn ing of K eyph rase Boundary Classiﬁcation Isabelle A ugenstein ∗ Department of Computer Science Univ ersity C ollege London i.augenste in@ucl.ac. uk Anders Søgaard ∗ Department of Comp uter Science Univ ersity of Copenhagen soegaard@d i.ku.dk Abstract Ke yphrase bou ndary class iﬁcation (KBC) is the task of detectin g key phrase s in sci- entiﬁc articles and labe lling them with re- spect to predeﬁned types. Although im- portan t in practice, this task is so far un- dere xplor ed, partly due to the lack of la- belled data. T o o ver come this, w e expl ore se ve ral auxili ary tasks , inclu ding semantic super -sense taggin g and identi ﬁcation of multi-wo rd expre ssions , and cast the task as a multi-task learnin g probl em with deep recurre nt neural networks. Our multi-task models perform signiﬁcantly better than pre vious state of the art approach es on two scient iﬁc KBC datasets, particu larly for long ke yphra ses. 1 Intr oduction The scientiﬁc ke yphras e boundary classiﬁcatio n (KBC) task consists of a) determini ng keyp hrase bound aries, an d b) labelling key phras es w ith their types according to a predeﬁned schema. KBC is moti v ated by the need to ef ﬁciently search scien- tiﬁc litera ture, which can be summarised by the ir ke yphr ases. Sev eral companies are working on ke yphr ase-ba sed recommende r systems for scien- tiﬁc literatur e or search interf aces where scien- tiﬁc articles decorate graphs, in w hich nodes are ke yphr ases. Such ke yphrases must be d ynamica lly retrie ved fro m the articl es, becau se importan t sci- entiﬁc concept s emerg e on a daily basis, and the most rece nt concep ts are typ ically the ones of in- terest to scientists. KBC is not a common task in NLP , and there are only fe w small annota ted data sets for induc ing sup ervised KBC models, ma de ⋆ Both authors co ntributed equ ally a v ailab le recent ly ( QasemiZadeh and Schumann , 2016 ; Augenstein et al. , 2017 ). T ypical KBC ap- proach es there fore rely on hand- crafted gazette ers ( Hasan and Ng , 2014 ) or reduce the task to ex- tractin g a list of keyp hrase s for each document ( Kim et al. , 20 10 ) ins tead of iden tifyin g mentions of key phrase s in sentences. For related more com- mon NLP tasks such as named entity recogniti on and identiﬁcatio n of multi-word expressi ons, neu- ral sequence labelli ng methods ha ve been sho wn to b e u seful ( Lample et al. , 2 016 ). In order to ov er - come the small data probl em, w e study using more widely a v ailable data for tasks related to KBC and exp loit their syner gies in a deep multi-task learn- ing setup. Multi-tas k learning has become popu lar within natura l language proces sing and machine learn- ing ove r the last fe w years; in partic ular , ha r d paramete r sharing of hidden layers in dee p learn- ing models. This approach to multi-task learning has three advan tages: a) It signiﬁcantly reduc es Rademache r complexity ( Baxter , 2000 ; Maurer , 2007 ), i.e., the risk of ov er -ﬁtting, b) it is space- ef ﬁcient, reduc ing the number of parameters, and c) it is easy to implement. This paper sho ws ho w hard parameter sharing can be used to improv e gazette er -free ke yphras e bound ary cla ssiﬁcatio n models, by e xplo iting dif- ferent syn tactic ally and semanti cally annotate d corpor a, as well as more readily av ailable data such as hyperlink s. Contrib utions W e study the so far widely un- dere xplor ed, though in practice important task of scient iﬁc keyp hrase boundary classiﬁcatio n, for which only a small amount of training data is a v ailab le. W e overco me this by identifying good auxili ary tasks and cast it as a multi-ta sk learn- ing probl em. W e ev aluate our models across two ne w , manually annotat ed corpo ra of scientiﬁc arti- cles and out perfor m single- task appr oaches by up to 9.64% F1, mostly due to better performanc e for long ke yphra ses. 2 Keyph rase Boundary Classiﬁcation Consider the follo w ing sentence from a scienti ﬁc paper: (1) W e ﬁnd that simple interp olatio n methods, like log- linear and linear interpolati on, im- pro ve the perfo rmance but fall short of the perfor mance of an oracl e. This senten ce occurs in the ACL RD-TEC 2.0 corpus . Here, inte rpola tion methods and log- linear and linear interpola tion are annot ated as techni cal keyp hrases , perf ormance as a ke yphra se related to measurements , and oracle is a keyp hrase labelle d as miscellaneo us. Belo w , we are inte r - ested in predict ing the boundari es and the types of all keyp hrase s. 3 Multi-T ask Lear ning Multi-tas k learnin g is an approach to learning , in which generali sation is impro ved by taking adv an- tage of the inducti ve bias in training signals of re- lated tasks. When abun dant label led dat a is av ail- able for an auxili ary task, b ut little data for the tar- get task, multi-task learning can act as a form of semi-sup ervise d learning combin ed with a distant superv ision signal. Induci ng a model from only the sparse tar get task data may lead to ov erﬁtting to random noise in the data, but relyin g on aux- iliary data helps the model generali se, makin g it easier to abstract away from noise, as well as lev er- aging the margina l distrib ution of auxiliary input data. From a represen tation learning perspec ti ve , auxili ary tasks can be used to induce representa- tions that may be beneﬁcial for the tar get task. Caruana ( 1993 ) also suggests that the auxiliary task can help focus attention in the induction of the tar get tas k model. Finally , multi-tas k learning can be cast as a regul ariser as studies sho w reductions in Rademacher comple xity in multi-task archi tec- tures ov er sing le-tas k archi tecture s ( Baxter , 2000 ; Maurer , 2007 ). Here, we follo w the probably most common ap- proach to multi-task learning, kno wn as har d pa- ramet er sharing . T his was introduced in Caruana ( 1993 ) in the conte xt of deep neural network s, in which hidden layers can be shared among tasks. W e assume T diffe rent training set, D 1 , · · · , D T , where each D t contai ns pairs of input-ou tput se- quenc es ( w 1: n , y t 1: n ) , w i ∈ V , y t i ∈ L t . T he input v ocab ulary V is shared across tasks, but the out- put v ocabu laries (tagset ) L t are task depen dent. At each step in the training process we choose a random task t , follo wed by a random training in- stance ( w 1: n , y t 1: n ) ∈ D t . W e use the tagger to predic t the labels ˆ y t i , suff er a loss with respect to the true labels y t i and update the model parame- ters. The para meters are tra ined join tly for a sen- tence, i.e. cross -entro py loss ov er each sente nce is emplo yed. Each task is associated with an inde- pende nt class iﬁcation fun ction , b ut all tas ks share the hidde n layers. Note that for our exper iments, we only consider one auxilia ry task at a time. 4 Experiments Experimental Setup W e perform experime nts for both keyph rase boundary identiﬁcation ( un- labell ed ), and key phras e boundary identiﬁcat ion and classiﬁcatio n ( labelled ). Met rics measured are tok en-le vel precis ion, recall and F1, which are micro-a verage results across keyp hrase types. T yp es are deﬁned by the two datasets studie d. A uxiliary tasks W e e xperimen t with ﬁ ve aux- iliary tasks: (1) syntactic chunkin g using anno- tation s extracte d from the English Penn T ree- bank, foll o wing Søgaard and Goldbe r g ( 2016 ); (2) frame tar get annot ations from FrameNet 1.5 (corresp ondin g to the tar get identiﬁcation and classiﬁcation tasks in Das et al. ( 20 14 )); (3) hyper link predictio n using the dataset from Spitko vsky et al. ( 2010 ), (4) identiﬁcat ion of multi-wo rd exp ressio ns using the S treusle cor - pus ( Schneid er and Sm ith , 201 5 ); and (5) seman- tic super -sense tagging using the Semcor dataset, follo wing Johan nsen et al. ( 2014 ). W e train our models on the main task with one auxil iary task at a time. Note that the dataset s for the auxili ary tasks are not annotated with ke yphra se bounda ry identi ﬁcation or classiﬁcation labels. Datasets W e e v aluat e on th e SemEv al 2017 T ask 10 datase t ( Augenste in et al. , 2017 ) and the the A CL RD-TEC 2.0 datase t ( QasemiZadeh and Schumann , 2016 ). The Se- mEva l 2017 datas et is annotated with thre e ke yphr ase types, the A C L RD-TEC datase t with se ve n. For the former , w e test on the de velopment portio n of the dataset, as the test set is not released yet. W e ran domly split A CL RD-TEC into a SemEval 201 7 T ask 10 A CL RD-TEC Labels Material, Pro cess, T ask T echnology and M e th od, T ool and Lib rary , Languag e Resour ce, Languag e Resour ce Prod uct, Measures an d Measur ements, Models, Oth er T opics Computer Scien ce, Phy sics, Natu r al Lang uage Processing Material Scienc e Number all keyphrases 5730 2939 Proportio n sing leton keyphrases 31% 83% Proportio n sing le-word m entions 18% 23% Proportio n me ntions with word length > = 2 82% 77% Proportio n me ntions with word length > = 3 51% 33% Proportio n me ntions with word length > = 5 22% 8% T ab le 1: Characteristic s of SemEva l 2017 T ask 10 and ACL-RD-TEC corpora, statisti cs of training sets trainin g and test set, reserving 1/3 for test ing. Ke y dataset character istics are summarised in T ab le 1 . One important observ ation is that the SemEval 2017 dataset cont ains a signiﬁcantly higher proporti on of long keyp hrase s than the A CL datas et. Models Our sing le- and mult i-task net- works are three-laye r , bi-dire ctiona l LSTMs ( Gra ves and Schmidhuber , 200 5 ) with pre-tra ined S E N N A embeddings. 1 For the m ulti-task net- works , w e follow the trainin g procedure outline d in S ection 3 . The dimensionalit y of the embed- dings is 50, and we follo w Søgaard and Goldber g ( 2016 ) in using the same dimensionalit y for the hidden layers. W e add a dropout of 0.1 to the input and train these architectur es w ith momen- tum S GD with initial learnin g rate of 0.001 and momentum of 0.9 for 10 epochs. Baselines Our baselines are Finke l et al. ( 2005 ) 2 and Lample et al. ( 2016 ) 3 , in order to compare to a lexical ised and a state-of -the-ar t neural method. W e use the implementatio ns released by the au- thors and re-tra in models on our data. 5 Results and Analys is Results for SemEval 2017 T ask 10 corpus are pre- sented in T able 2 , and for the A CL RD-TEC cor - pus in T able 3 . Fo r the SemEval corpus, all ﬁv e la- belled multi-task learning models outperf orm both exa mples of pre vious work , as well as our singl e- task BiLSTM baseline, by some m ar gin. For A CL 1 http://ronan. collobert.com /senna/ 2 http://nlp.st anford.edu/so ftware/ CRF- NER.shtml 3 https://githu b.com/clab/ stack- lstm- ner RD-TEC, three of out ﬁve multi-task learning la- belled labelled perf orm better than the si ngle-ta sk BiLSTM baseline . On the SemEval corpus, the F 1 err or redu ction of of the best labelle d model over the Stanford tag- ger is 9.64%. T he le xicali sed Finke l et al. ( 2005 ) model shows a su rprisin gly co mpetiti ve perfor- mance on the A CL RD-T E C corpus , where it is only 2 points in F1 behind our best perfor ming la- belled model and on par w ith our best- perfor ming unlabe lled model. Results with Lample et al. ( 2016 ), on the other hand , are lower tha n the Finke l et al. ( 2005 ) baselin e. This might be due to the model hav ing a lar ge set of parameters to model state transit ions which poses a dif ﬁ culty for small trainin g datasets. Overa ll, multi-tas k models sho w bigger im- pro vemen ts over baselin es for the SemEval cor- pus, and all models achie ve better result s on A CL R D-TEC. Statistics shown in T able 1 help to explain this. Most notice ably , the SemEval datase t contains a signiﬁcantl y higher proportio n of long keyp hrases than the A CL dataset. Interest- ingly , A CL RD-T EC contain s a lar ge prop ortion of key phrase s which only appear once in the train- ing set (singlet ons), signiﬁca ntly fe w er keyp hrases and more keyph rase type, but that does not seem to impact res ults as much as a hi gh propor tion of long ke yphra ses. All m odels strug gle with semantica lly v ague or broad ke yphr ases (e.g. ‘items’, ‘scope ’, ‘ke y’) and long keyp hrase s, especially those containin g clause s (e.g. ‘complete characte risatio n of the ox- ide partic les’, ‘earle y deduction proof proced ure for deﬁnite clauses ’). The multi-task models gen- erally outperf orm the BiLST M baseline for long Unlabelled Labelled Method Precision Recall F1 Pr ecision Recall F 1 Finkel et al. ( 20 0 5 ) 77.89 50.27 61.10 49 . 90 27.97 35.85 Lample et al. ( 2016 ) 71.92 49.37 58.55 41 . 36 28.47 33.72 BiLSTM 81.58 57.86 67.71 45 . 80 32.48 38.01 BiLSTM + Chunkin g 82.88 52.08 63.96 55 . 54 34.90 42.86 BiLSTM + Framenet 77.86 56.05 65.18 54.04 38.91 45.24 BiLSTM + Hyperlink s 76.59 60.53 67.62 46.99 44.09 41.13 BiLSTM + Multi-word 74.80 70.18 72.42 46.99 44.09 45.49 BiLSTM + Super-sense 83.70 51.76 63.93 56.94 35.25 43.54 T ab le 2: Results for keyph rase bounda ry classiﬁcati on on the SemEval 2017 T ask 10 corpus Unlabelled Labelled Method Precision Recall F1 Pr ecision Recall F 1 Finkel et al. ( 20 0 5 ) 84.16 80.08 82.07 59 . 97 53.86 56.75 Lample et al. ( 2016 ) 65.60 86.06 74.45 31 . 30 41.07 35.53 BiLSTM 83.40 80.36 81.85 59 . 62 57.45 58.51 BiLSTM + Chunkin g 83.36 79.46 81.37 59 . 26 57.24 57.84 BiLSTM + Framenet 84.11 79.39 81.68 60.64 57.24 58.89 BiLSTM + Hyperlink s 83.94 79.12 81.46 60.18 56.73 58.40 BiLSTM + Multi-word 84.86 76.92 80.69 59.81 54.21 56.87 BiLSTM + Super-sense 84.67 78.29 81.36 61.35 56.73 58.95 T ab le 3: Results for ke yphra se boundar y classiﬁcatio n on the A CL RD-TEC corpus phrase s (e.g. ‘l anguag e-ind ependent system for automati c disco very of text in parallel translatio n’, ‘hone ycomb network of graphite bricks’) . Being able to recog nise long keyp hrases correctly is part of the reason our multi-ta sk models outperf orm the baseline s, espec ially on the SemEval dataset, which contains many such long ke yphras es. 6 Related W ork Multi-T ask Learning Hard sharin g of all hid- den layers was introduce d in Caruana ( 1993 ), and pop ularis ed in NLP by Collober t et al. ( 2011 a ). Sev eral varia nts ha ve been in- troduc ed, inclu ding hard sharing of selected layers ( Søgaard and Goldber g , 201 6 ) and shar - ing of parts (subsp aces) of layer s ( Liu et al. , 2015 ). Søgaard and Goldber g ( 2016 ) show that hard parameter sha ring is an ef fecti ve re gu- lariser , also on heterogene ous tasks such as the ones considered here. Hard parameter shar- ing has been studie d for se veral task s, includ- ing CC G super tagging ( Søgaard and Goldber g , 2016 ), tex t normalisa tion ( Bollman and S øgaard , 2016 ), neural machine translatio n ( Dong et al. , 2015 ; Luong et al. , 201 6 ), and super -sense tag- ging ( Mart ´ ınez Alonso and Plank , 2017 ). Shar- ing of informatio n can further be achie ved by ex - tendin g LST Ms w ith an exte rnal memory shared across tasks ( Liu et al. , 2016 ). A further instance of multi-task learning is to optimise a superv ised trainin g objecti ve jointly with an unsupervise d trainin g objecti ve, as sho w n in Y u et al. ( 2016 ) for natura l langu age genera tion and auto -encod ing, and in Rei ( 2017 ) for dif ferent seq uence lab elling tasks and language modelli ng. Boundary Classiﬁcati on KBC is very similar to named entity rec ognit ion (NER), though arg uably harder . Deep neural network s hav e been applied to N ER in Collober t et al. ( 2011b ); Lample et al. ( 2016 ). Other successf ul methods rely on condi- tional ran dom ﬁelds, thereby modellin g the prob- ability of each output labe l co nditio ned on the la- bel at the prev ious time step. Lample et al. ( 2016 ), curren tly state-o f-the- art for NER, stack CRFs on top of recurrent neural networks . W e lea ve exp lor - ing such m odels in combinatio n with multi- task learnin g for future work. Ke yphrase detect ion methods speciﬁc to the sci- entiﬁc domain often use keyp hrase gazett eers as feature s or exploit citatio n graphs ( Hasan and N g , 2014 ). H owe ver , pre vious methods relied on cor- pora annotated for type-le vel identiﬁcation , not for mention- le v el identiﬁca tion ( Kim et al. , 2010 ; Sterckx et al. , 2016 ). While most applicatio ns rely on ext racting keyp hrases (as types), this has the unfort unate consequen ce that pre viou s work ig- nores acrony ms and other short-hand forms re- ferring to methods, metrics, etc. Further , relying on gazetteers makes ov erﬁtting like ly , obtaining lo wer scores on out-of -gazet teer k ey phras es. 7 Conclusions and Future W ork W e pres ent a new state of the art for keyp hrase bound ary class iﬁcation , using data from rela ted, auxili ary tasks; in particular , super -sense tag- ging and identiﬁcati on of multi-word expre ssions . Deep multi-task learning improve s signiﬁcantly on pre viou s approach es to KBC , with error reduc- tions of up to 9.64%, mostly due to better identiﬁ- cation and labelling of long ke yphrases. In future work, we wan t to explo re alterna- ti ve multi-tas k learnin g regimes to hard paramet er sharin g and ex periment with addition al auxiliary tasks. T he auxiliar y tasks conside red here are stan- dard NL P tasks, hyper link predictio n as ide. Other tasks may be more directly rele vant such as pre- dictin g the layout of calls for papers for scientiﬁc confer ences, or predictin g hash tags in tweets by scient ists, since both data sources conta in scien- tiﬁc ke yphras es. Acknowledgmen ts W e would like to thank Else vier for suppo rting this work. Refer ences Isabelle Augen stein , Mrinal Das, Sebastian Riedel, Lakshmi V ikraman, an d Andrew McCallum. 2017 . SemEval-2017 T ask 10 : E xtracting Ke yphrases and Relations from Scientiﬁc Pub lications. In Pr oceed- ings of SemEval, to app ear . Jonathan Baxter . 2 000. A mode l of inductive bias learning. Journal of Artiﬁcial Intelligence Resear ch 12:149 –198. Marcel Bollman and Anders Sø gaard. 2016. Imp roving historical spelling no rmalization with bi-directional LSTMs an d mu lti-task learnin g. In Pr oceedings of COLING . Rich Caruana. 1 9 93. Multitask Learning: A Knowledge-Based Sourc e of Ind uctive Bias . In P r o- ceedings of ICML . Ronan Collo bert, Jason W eston, L ´ eon Botto u, Mich ael Karlen, Koray Kavukcuo glu, an d Pa vel Kuksa. 2011a . Natural languag e proc e ssing ( almost) from scratch. The J ournal of Machine Lea rning Resear ch 12:249 3–253 7. Ronan Collober t, Jason W eston, L ´ eon Bottou, M ic h ael Karlen, Koray Kavukcuo glu, an d Pa vel Kuksa. 2011b . Natural language pr ocessing (a lmost) from scratch. The J ournal of Machine Lea rning Resear ch 12:249 3–253 7. Dipanjan Das, D e sai Chen, Andre Martins, Nathan Schneider, a n d No ah Sm ith. 2 014. Frame-semantic parsing. Comp utationa l lingu istics 40(1 ) :9–56. Daxiang Do n g, Hu a W u, W ei He, Dianhai Y u , an d Haifeng W an g. 2 0 15. Mu lti- T ask Learnin g fo r Mul- tiple Lang uage T ranslation. In Pr oce e dings of ACL . Jenny Finkel, Trond Gr enager, and Christopher Man- ning. 2005. Incorp orating non-local informa tion into informatio n e xtraction systems by Gibbs sam- pling. In Pr oce edings of ACL . Alex Grav es and J ¨ ur gen Schmidhu ber . 2005. Frame- wise Phon eme Classiﬁcation with Bidire c tional LSTM and other Ne ural Network Architectures. Neural Networks 18 ( 5):602 –610 . Kazi Said ul Hasan and V incen t Ng. 2014. Automatic Ke yphrase Extr action: A Survey of the State of the Art. In Pr oce edings of ACL . Anders Johannsen, Dirk Hovy , H ´ ector Mart´ ınez, Bar- bara Plank, and Ander s Søgaar d. 2014. More or less supervised supersen se tagging of T witter . In Pr o- ceedings of *SEM . Su Nam Kim, Olena Medelyan, Min -Y en Kan, and T imothy Baldwin. 2 010. SemE val-201 0 T ask 5 : Automatic Ke yphrase Extrac tio n f r om Scientiﬁc Ar- ticles. In Pr oceed ings of SemEval . Guillaume Lample, Migu el Ballesteros, Sand eep Sub- ramanian, Kazu ya Kawakami, and Chr is Dyer . 2016 . Neural Architectur e s for Named Entity Recogn ition. In Pr oc e e dings of NAA CL-HLT . pages 260 – 270. Pengfei Liu, Shaﬁq Joty , an d Helen Meng. 2015. Fine- grained Opinion Min ing with Recurr ent Neural Net- works and W ord Em b edding s. In Pr oceed ings of EMNLP . Pengfei Liu, Xipeng Qiu, and Xuanjing Hu ang. 20 16. Deep Mu lti-T ask Learn in g with Shar ed Mem ory for T ext Classiﬁcation. In Pr oceedin gs of EMNLP . Minh-Th ang Luon g, Qu oc V . Le, Ilya Sutskev er , Oriol V inyals, an d Lukasz Kaiser . 20 16. Multi-task Se- quence to Sequen c e L e a rning. In P r oceeding s of ICLR . H ´ ector Mar t ´ ınez Alonso and Barbara Plank. 2 017. When is mu ltitask learning effecti ve? Semantic se- quence p r ediction under varying data condition s. I n Pr oceed ings of EACL . Andreas Maurer . 2007 . Bou nds for L inear Multi T ask Learning . J ournal of Machine Learn in g Resea r ch 7:117– 139. Behrang QasemiZadeh and An ne-Kathr in Schu mann. 2016. The ACL RD-TE C 2 .0: A Lang uage Re- source for Evaluating T erm Ex traction and Entity Recognition Me th ods. In Pr oceed ings of LREC . Marek Rei. 201 7 . Semi-super vised M ultitask Learning for Sequence Labeling. I n Pr oceedings of A CL, to appea r . Nathan Schneider a n d No ah Smith . 2015. A Corpu s and Model Integrating Mu ltiword Exp ressions an d Supersenses. Pr oceedings of NAA CL-HLT . Anders Søgaar d and Y oav Goldberg. 2 016. Deep multi-task learning with low le vel tasks sup e r vised at lower layers. In Pr oceeding s o f ACL . V alentin Spitkovsky , Daniel Juraf sky , and Hiy an Al- shawi. 20 1 0. Proﬁting f rom M ark-Up: Hy p er-T e xt Annotation s for Guided Parsing. In Pr oceed ings o f AC L . Lucas Sterckx, Cornelia Caragea, Th o mas Dem eester , and Chris Develder . 2 016. Supe r vised Keyphrase Extraction as Positive Un labeled Learning . In Pr o- ceedings of EMNLP . Lei Y u, Jan Buys, and Ph il Blun som. 2016 . On line Segment to Segment Neur al Transduction. I n Pr o- ceedings of EMNLP .

Multi-Task Learning of Keyphrase Boundary Classification

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment