Multi-Task Learning of Keyphrase Boundary Classification

Keyphrase boundary classification (KBC) is the task of detecting keyphrases in scientific articles and labelling them with respect to predefined types. Although important in practice, this task is so far underexplored, partly due to the lack of label…

Authors: Isabelle Augenstein, Anders S{o}gaard

Multi-T ask Learn ing of K eyph rase Boundary Classification Isabelle A ugenstein ∗ Department of Computer Science Univ ersity C ollege London i.augenste in@ucl.ac. uk Anders Søgaard ∗ Department of Comp uter Science Univ ersity of Copenhagen soegaard@d i.ku.dk Abstract Ke yphrase bou ndary class ification (KBC) is the task of detectin g key phrase s in sci- entific articles and labe lling them with re- spect to predefined types. Although im- portan t in practice, this task is so far un- dere xplor ed, partly due to the lack of la- belled data. T o o ver come this, w e expl ore se ve ral auxili ary tasks , inclu ding semantic super -sense taggin g and identi fication of multi-wo rd expre ssions , and cast the task as a multi-task learnin g probl em with deep recurre nt neural networks. Our multi-task models perform significantly better than pre vious state of the art approach es on two scient ific KBC datasets, particu larly for long ke yphra ses. 1 Intr oduction The scientific ke yphras e boundary classificatio n (KBC) task consists of a) determini ng keyp hrase bound aries, an d b) labelling key phras es w ith their types according to a predefined schema. KBC is moti v ated by the need to ef ficiently search scien- tific litera ture, which can be summarised by the ir ke yphr ases. Sev eral companies are working on ke yphr ase-ba sed recommende r systems for scien- tific literatur e or search interf aces where scien- tific articles decorate graphs, in w hich nodes are ke yphr ases. Such ke yphrases must be d ynamica lly retrie ved fro m the articl es, becau se importan t sci- entific concept s emerg e on a daily basis, and the most rece nt concep ts are typ ically the ones of in- terest to scientists. KBC is not a common task in NLP , and there are only fe w small annota ted data sets for induc ing sup ervised KBC models, ma de ⋆ Both authors co ntributed equ ally a v ailab le recent ly ( QasemiZadeh and Schumann , 2016 ; Augenstein et al. , 2017 ). T ypical KBC ap- proach es there fore rely on hand- crafted gazette ers ( Hasan and Ng , 2014 ) or reduce the task to ex- tractin g a list of keyp hrase s for each document ( Kim et al. , 20 10 ) ins tead of iden tifyin g mentions of key phrase s in sentences. For related more com- mon NLP tasks such as named entity recogniti on and identificatio n of multi-word expressi ons, neu- ral sequence labelli ng methods ha ve been sho wn to b e u seful ( Lample et al. , 2 016 ). In order to ov er - come the small data probl em, w e study using more widely a v ailable data for tasks related to KBC and exp loit their syner gies in a deep multi-task learn- ing setup. Multi-tas k learning has become popu lar within natura l language proces sing and machine learn- ing ove r the last fe w years; in partic ular , ha r d paramete r sharing of hidden layers in dee p learn- ing models. This approach to multi-task learning has three advan tages: a) It significantly reduc es Rademache r complexity ( Baxter , 2000 ; Maurer , 2007 ), i.e., the risk of ov er -fitting, b) it is space- ef ficient, reduc ing the number of parameters, and c) it is easy to implement. This paper sho ws ho w hard parameter sharing can be used to improv e gazette er -free ke yphras e bound ary cla ssificatio n models, by e xplo iting dif- ferent syn tactic ally and semanti cally annotate d corpor a, as well as more readily av ailable data such as hyperlink s. Contrib utions W e study the so far widely un- dere xplor ed, though in practice important task of scient ific keyp hrase boundary classificatio n, for which only a small amount of training data is a v ailab le. W e overco me this by identifying good auxili ary tasks and cast it as a multi-ta sk learn- ing probl em. W e ev aluate our models across two ne w , manually annotat ed corpo ra of scientific arti- cles and out perfor m single- task appr oaches by up to 9.64% F1, mostly due to better performanc e for long ke yphra ses. 2 Keyph rase Boundary Classification Consider the follo w ing sentence from a scienti fic paper: (1) W e find that simple interp olatio n methods, like log- linear and linear interpolati on, im- pro ve the perfo rmance but fall short of the perfor mance of an oracl e. This senten ce occurs in the ACL RD-TEC 2.0 corpus . Here, inte rpola tion methods and log- linear and linear interpola tion are annot ated as techni cal keyp hrases , perf ormance as a ke yphra se related to measurements , and oracle is a keyp hrase labelle d as miscellaneo us. Belo w , we are inte r - ested in predict ing the boundari es and the types of all keyp hrase s. 3 Multi-T ask Lear ning Multi-tas k learnin g is an approach to learning , in which generali sation is impro ved by taking adv an- tage of the inducti ve bias in training signals of re- lated tasks. When abun dant label led dat a is av ail- able for an auxili ary task, b ut little data for the tar- get task, multi-task learning can act as a form of semi-sup ervise d learning combin ed with a distant superv ision signal. Induci ng a model from only the sparse tar get task data may lead to ov erfitting to random noise in the data, but relyin g on aux- iliary data helps the model generali se, makin g it easier to abstract away from noise, as well as lev er- aging the margina l distrib ution of auxiliary input data. From a represen tation learning perspec ti ve , auxili ary tasks can be used to induce representa- tions that may be beneficial for the tar get task. Caruana ( 1993 ) also suggests that the auxiliary task can help focus attention in the induction of the tar get tas k model. Finally , multi-tas k learning can be cast as a regul ariser as studies sho w reductions in Rademacher comple xity in multi-task archi tec- tures ov er sing le-tas k archi tecture s ( Baxter , 2000 ; Maurer , 2007 ). Here, we follo w the probably most common ap- proach to multi-task learning, kno wn as har d pa- ramet er sharing . T his was introduced in Caruana ( 1993 ) in the conte xt of deep neural network s, in which hidden layers can be shared among tasks. W e assume T diffe rent training set, D 1 , · · · , D T , where each D t contai ns pairs of input-ou tput se- quenc es ( w 1: n , y t 1: n ) , w i ∈ V , y t i ∈ L t . T he input v ocab ulary V is shared across tasks, but the out- put v ocabu laries (tagset ) L t are task depen dent. At each step in the training process we choose a random task t , follo wed by a random training in- stance ( w 1: n , y t 1: n ) ∈ D t . W e use the tagger to predic t the labels ˆ y t i , suff er a loss with respect to the true labels y t i and update the model parame- ters. The para meters are tra ined join tly for a sen- tence, i.e. cross -entro py loss ov er each sente nce is emplo yed. Each task is associated with an inde- pende nt class ification fun ction , b ut all tas ks share the hidde n layers. Note that for our exper iments, we only consider one auxilia ry task at a time. 4 Experiments Experimental Setup W e perform experime nts for both keyph rase boundary identification ( un- labell ed ), and key phras e boundary identificat ion and classificatio n ( labelled ). Met rics measured are tok en-le vel precis ion, recall and F1, which are micro-a verage results across keyp hrase types. T yp es are defined by the two datasets studie d. A uxiliary tasks W e e xperimen t with fi ve aux- iliary tasks: (1) syntactic chunkin g using anno- tation s extracte d from the English Penn T ree- bank, foll o wing Søgaard and Goldbe r g ( 2016 ); (2) frame tar get annot ations from FrameNet 1.5 (corresp ondin g to the tar get identification and classification tasks in Das et al. ( 20 14 )); (3) hyper link predictio n using the dataset from Spitko vsky et al. ( 2010 ), (4) identificat ion of multi-wo rd exp ressio ns using the S treusle cor - pus ( Schneid er and Sm ith , 201 5 ); and (5) seman- tic super -sense tagging using the Semcor dataset, follo wing Johan nsen et al. ( 2014 ). W e train our models on the main task with one auxil iary task at a time. Note that the dataset s for the auxili ary tasks are not annotated with ke yphra se bounda ry identi fication or classification labels. Datasets W e e v aluat e on th e SemEv al 2017 T ask 10 datase t ( Augenste in et al. , 2017 ) and the the A CL RD-TEC 2.0 datase t ( QasemiZadeh and Schumann , 2016 ). The Se- mEva l 2017 datas et is annotated with thre e ke yphr ase types, the A C L RD-TEC datase t with se ve n. For the former , w e test on the de velopment portio n of the dataset, as the test set is not released yet. W e ran domly split A CL RD-TEC into a SemEval 201 7 T ask 10 A CL RD-TEC Labels Material, Pro cess, T ask T echnology and M e th od, T ool and Lib rary , Languag e Resour ce, Languag e Resour ce Prod uct, Measures an d Measur ements, Models, Oth er T opics Computer Scien ce, Phy sics, Natu r al Lang uage Processing Material Scienc e Number all keyphrases 5730 2939 Proportio n sing leton keyphrases 31% 83% Proportio n sing le-word m entions 18% 23% Proportio n me ntions with word length > = 2 82% 77% Proportio n me ntions with word length > = 3 51% 33% Proportio n me ntions with word length > = 5 22% 8% T ab le 1: Characteristic s of SemEva l 2017 T ask 10 and ACL-RD-TEC corpora, statisti cs of training sets trainin g and test set, reserving 1/3 for test ing. Ke y dataset character istics are summarised in T ab le 1 . One important observ ation is that the SemEval 2017 dataset cont ains a significantly higher proporti on of long keyp hrase s than the A CL datas et. Models Our sing le- and mult i-task net- works are three-laye r , bi-dire ctiona l LSTMs ( Gra ves and Schmidhuber , 200 5 ) with pre-tra ined S E N N A embeddings. 1 For the m ulti-task net- works , w e follow the trainin g procedure outline d in S ection 3 . The dimensionalit y of the embed- dings is 50, and we follo w Søgaard and Goldber g ( 2016 ) in using the same dimensionalit y for the hidden layers. W e add a dropout of 0.1 to the input and train these architectur es w ith momen- tum S GD with initial learnin g rate of 0.001 and momentum of 0.9 for 10 epochs. Baselines Our baselines are Finke l et al. ( 2005 ) 2 and Lample et al. ( 2016 ) 3 , in order to compare to a lexical ised and a state-of -the-ar t neural method. W e use the implementatio ns released by the au- thors and re-tra in models on our data. 5 Results and Analys is Results for SemEval 2017 T ask 10 corpus are pre- sented in T able 2 , and for the A CL RD-TEC cor - pus in T able 3 . Fo r the SemEval corpus, all fiv e la- belled multi-task learning models outperf orm both exa mples of pre vious work , as well as our singl e- task BiLSTM baseline, by some m ar gin. For A CL 1 http://ronan. collobert.com /senna/ 2 http://nlp.st anford.edu/so ftware/ CRF- NER.shtml 3 https://githu b.com/clab/ stack- lstm- ner RD-TEC, three of out five multi-task learning la- belled labelled perf orm better than the si ngle-ta sk BiLSTM baseline . On the SemEval corpus, the F 1 err or redu ction of of the best labelle d model over the Stanford tag- ger is 9.64%. T he le xicali sed Finke l et al. ( 2005 ) model shows a su rprisin gly co mpetiti ve perfor- mance on the A CL RD-T E C corpus , where it is only 2 points in F1 behind our best perfor ming la- belled model and on par w ith our best- perfor ming unlabe lled model. Results with Lample et al. ( 2016 ), on the other hand , are lower tha n the Finke l et al. ( 2005 ) baselin e. This might be due to the model hav ing a lar ge set of parameters to model state transit ions which poses a dif fi culty for small trainin g datasets. Overa ll, multi-tas k models sho w bigger im- pro vemen ts over baselin es for the SemEval cor- pus, and all models achie ve better result s on A CL R D-TEC. Statistics shown in T able 1 help to explain this. Most notice ably , the SemEval datase t contains a significantl y higher proportio n of long keyp hrases than the A CL dataset. Interest- ingly , A CL RD-T EC contain s a lar ge prop ortion of key phrase s which only appear once in the train- ing set (singlet ons), significa ntly fe w er keyp hrases and more keyph rase type, but that does not seem to impact res ults as much as a hi gh propor tion of long ke yphra ses. All m odels strug gle with semantica lly v ague or broad ke yphr ases (e.g. ‘items’, ‘scope ’, ‘ke y’) and long keyp hrase s, especially those containin g clause s (e.g. ‘complete characte risatio n of the ox- ide partic les’, ‘earle y deduction proof proced ure for definite clauses ’). The multi-task models gen- erally outperf orm the BiLST M baseline for long Unlabelled Labelled Method Precision Recall F1 Pr ecision Recall F 1 Finkel et al. ( 20 0 5 ) 77.89 50.27 61.10 49 . 90 27.97 35.85 Lample et al. ( 2016 ) 71.92 49.37 58.55 41 . 36 28.47 33.72 BiLSTM 81.58 57.86 67.71 45 . 80 32.48 38.01 BiLSTM + Chunkin g 82.88 52.08 63.96 55 . 54 34.90 42.86 BiLSTM + Framenet 77.86 56.05 65.18 54.04 38.91 45.24 BiLSTM + Hyperlink s 76.59 60.53 67.62 46.99 44.09 41.13 BiLSTM + Multi-word 74.80 70.18 72.42 46.99 44.09 45.49 BiLSTM + Super-sense 83.70 51.76 63.93 56.94 35.25 43.54 T ab le 2: Results for keyph rase bounda ry classificati on on the SemEval 2017 T ask 10 corpus Unlabelled Labelled Method Precision Recall F1 Pr ecision Recall F 1 Finkel et al. ( 20 0 5 ) 84.16 80.08 82.07 59 . 97 53.86 56.75 Lample et al. ( 2016 ) 65.60 86.06 74.45 31 . 30 41.07 35.53 BiLSTM 83.40 80.36 81.85 59 . 62 57.45 58.51 BiLSTM + Chunkin g 83.36 79.46 81.37 59 . 26 57.24 57.84 BiLSTM + Framenet 84.11 79.39 81.68 60.64 57.24 58.89 BiLSTM + Hyperlink s 83.94 79.12 81.46 60.18 56.73 58.40 BiLSTM + Multi-word 84.86 76.92 80.69 59.81 54.21 56.87 BiLSTM + Super-sense 84.67 78.29 81.36 61.35 56.73 58.95 T ab le 3: Results for ke yphra se boundar y classificatio n on the A CL RD-TEC corpus phrase s (e.g. ‘l anguag e-ind ependent system for automati c disco very of text in parallel translatio n’, ‘hone ycomb network of graphite bricks’) . Being able to recog nise long keyp hrases correctly is part of the reason our multi-ta sk models outperf orm the baseline s, espec ially on the SemEval dataset, which contains many such long ke yphras es. 6 Related W ork Multi-T ask Learning Hard sharin g of all hid- den layers was introduce d in Caruana ( 1993 ), and pop ularis ed in NLP by Collober t et al. ( 2011 a ). Sev eral varia nts ha ve been in- troduc ed, inclu ding hard sharing of selected layers ( Søgaard and Goldber g , 201 6 ) and shar - ing of parts (subsp aces) of layer s ( Liu et al. , 2015 ). Søgaard and Goldber g ( 2016 ) show that hard parameter sha ring is an ef fecti ve re gu- lariser , also on heterogene ous tasks such as the ones considered here. Hard parameter shar- ing has been studie d for se veral task s, includ- ing CC G super tagging ( Søgaard and Goldber g , 2016 ), tex t normalisa tion ( Bollman and S øgaard , 2016 ), neural machine translatio n ( Dong et al. , 2015 ; Luong et al. , 201 6 ), and super -sense tag- ging ( Mart ´ ınez Alonso and Plank , 2017 ). Shar- ing of informatio n can further be achie ved by ex - tendin g LST Ms w ith an exte rnal memory shared across tasks ( Liu et al. , 2016 ). A further instance of multi-task learning is to optimise a superv ised trainin g objecti ve jointly with an unsupervise d trainin g objecti ve, as sho w n in Y u et al. ( 2016 ) for natura l langu age genera tion and auto -encod ing, and in Rei ( 2017 ) for dif ferent seq uence lab elling tasks and language modelli ng. Boundary Classificati on KBC is very similar to named entity rec ognit ion (NER), though arg uably harder . Deep neural network s hav e been applied to N ER in Collober t et al. ( 2011b ); Lample et al. ( 2016 ). Other successf ul methods rely on condi- tional ran dom fields, thereby modellin g the prob- ability of each output labe l co nditio ned on the la- bel at the prev ious time step. Lample et al. ( 2016 ), curren tly state-o f-the- art for NER, stack CRFs on top of recurrent neural networks . W e lea ve exp lor - ing such m odels in combinatio n with multi- task learnin g for future work. Ke yphrase detect ion methods specific to the sci- entific domain often use keyp hrase gazett eers as feature s or exploit citatio n graphs ( Hasan and N g , 2014 ). H owe ver , pre vious methods relied on cor- pora annotated for type-le vel identification , not for mention- le v el identifica tion ( Kim et al. , 2010 ; Sterckx et al. , 2016 ). While most applicatio ns rely on ext racting keyp hrases (as types), this has the unfort unate consequen ce that pre viou s work ig- nores acrony ms and other short-hand forms re- ferring to methods, metrics, etc. Further , relying on gazetteers makes ov erfitting like ly , obtaining lo wer scores on out-of -gazet teer k ey phras es. 7 Conclusions and Future W ork W e pres ent a new state of the art for keyp hrase bound ary class ification , using data from rela ted, auxili ary tasks; in particular , super -sense tag- ging and identificati on of multi-word expre ssions . Deep multi-task learning improve s significantly on pre viou s approach es to KBC , with error reduc- tions of up to 9.64%, mostly due to better identifi- cation and labelling of long ke yphrases. In future work, we wan t to explo re alterna- ti ve multi-tas k learnin g regimes to hard paramet er sharin g and ex periment with addition al auxiliary tasks. T he auxiliar y tasks conside red here are stan- dard NL P tasks, hyper link predictio n as ide. Other tasks may be more directly rele vant such as pre- dictin g the layout of calls for papers for scientific confer ences, or predictin g hash tags in tweets by scient ists, since both data sources conta in scien- tific ke yphras es. Acknowledgmen ts W e would like to thank Else vier for suppo rting this work. Refer ences Isabelle Augen stein , Mrinal Das, Sebastian Riedel, Lakshmi V ikraman, an d Andrew McCallum. 2017 . SemEval-2017 T ask 10 : E xtracting Ke yphrases and Relations from Scientific Pub lications. In Pr oceed- ings of SemEval, to app ear . Jonathan Baxter . 2 000. A mode l of inductive bias learning. Journal of Artificial Intelligence Resear ch 12:149 –198. Marcel Bollman and Anders Sø gaard. 2016. Imp roving historical spelling no rmalization with bi-directional LSTMs an d mu lti-task learnin g. In Pr oceedings of COLING . Rich Caruana. 1 9 93. Multitask Learning: A Knowledge-Based Sourc e of Ind uctive Bias . In P r o- ceedings of ICML . Ronan Collo bert, Jason W eston, L ´ eon Botto u, Mich ael Karlen, Koray Kavukcuo glu, an d Pa vel Kuksa. 2011a . Natural languag e proc e ssing ( almost) from scratch. The J ournal of Machine Lea rning Resear ch 12:249 3–253 7. Ronan Collober t, Jason W eston, L ´ eon Bottou, M ic h ael Karlen, Koray Kavukcuo glu, an d Pa vel Kuksa. 2011b . Natural language pr ocessing (a lmost) from scratch. The J ournal of Machine Lea rning Resear ch 12:249 3–253 7. Dipanjan Das, D e sai Chen, Andre Martins, Nathan Schneider, a n d No ah Sm ith. 2 014. Frame-semantic parsing. Comp utationa l lingu istics 40(1 ) :9–56. Daxiang Do n g, Hu a W u, W ei He, Dianhai Y u , an d Haifeng W an g. 2 0 15. Mu lti- T ask Learnin g fo r Mul- tiple Lang uage T ranslation. In Pr oce e dings of ACL . Jenny Finkel, Trond Gr enager, and Christopher Man- ning. 2005. Incorp orating non-local informa tion into informatio n e xtraction systems by Gibbs sam- pling. In Pr oce edings of ACL . Alex Grav es and J ¨ ur gen Schmidhu ber . 2005. Frame- wise Phon eme Classification with Bidire c tional LSTM and other Ne ural Network Architectures. Neural Networks 18 ( 5):602 –610 . Kazi Said ul Hasan and V incen t Ng. 2014. Automatic Ke yphrase Extr action: A Survey of the State of the Art. In Pr oce edings of ACL . Anders Johannsen, Dirk Hovy , H ´ ector Mart´ ınez, Bar- bara Plank, and Ander s Søgaar d. 2014. More or less supervised supersen se tagging of T witter . In Pr o- ceedings of *SEM . Su Nam Kim, Olena Medelyan, Min -Y en Kan, and T imothy Baldwin. 2 010. SemE val-201 0 T ask 5 : Automatic Ke yphrase Extrac tio n f r om Scientific Ar- ticles. In Pr oceed ings of SemEval . Guillaume Lample, Migu el Ballesteros, Sand eep Sub- ramanian, Kazu ya Kawakami, and Chr is Dyer . 2016 . Neural Architectur e s for Named Entity Recogn ition. In Pr oc e e dings of NAA CL-HLT . pages 260 – 270. Pengfei Liu, Shafiq Joty , an d Helen Meng. 2015. Fine- grained Opinion Min ing with Recurr ent Neural Net- works and W ord Em b edding s. In Pr oceed ings of EMNLP . Pengfei Liu, Xipeng Qiu, and Xuanjing Hu ang. 20 16. Deep Mu lti-T ask Learn in g with Shar ed Mem ory for T ext Classification. In Pr oceedin gs of EMNLP . Minh-Th ang Luon g, Qu oc V . Le, Ilya Sutskev er , Oriol V inyals, an d Lukasz Kaiser . 20 16. Multi-task Se- quence to Sequen c e L e a rning. In P r oceeding s of ICLR . H ´ ector Mar t ´ ınez Alonso and Barbara Plank. 2 017. When is mu ltitask learning effecti ve? Semantic se- quence p r ediction under varying data condition s. I n Pr oceed ings of EACL . Andreas Maurer . 2007 . Bou nds for L inear Multi T ask Learning . J ournal of Machine Learn in g Resea r ch 7:117– 139. Behrang QasemiZadeh and An ne-Kathr in Schu mann. 2016. The ACL RD-TE C 2 .0: A Lang uage Re- source for Evaluating T erm Ex traction and Entity Recognition Me th ods. In Pr oceed ings of LREC . Marek Rei. 201 7 . Semi-super vised M ultitask Learning for Sequence Labeling. I n Pr oceedings of A CL, to appea r . Nathan Schneider a n d No ah Smith . 2015. A Corpu s and Model Integrating Mu ltiword Exp ressions an d Supersenses. Pr oceedings of NAA CL-HLT . Anders Søgaar d and Y oav Goldberg. 2 016. Deep multi-task learning with low le vel tasks sup e r vised at lower layers. In Pr oceeding s o f ACL . V alentin Spitkovsky , Daniel Juraf sky , and Hiy an Al- shawi. 20 1 0. Profiting f rom M ark-Up: Hy p er-T e xt Annotation s for Guided Parsing. In Pr oceed ings o f AC L . Lucas Sterckx, Cornelia Caragea, Th o mas Dem eester , and Chris Develder . 2 016. Supe r vised Keyphrase Extraction as Positive Un labeled Learning . In Pr o- ceedings of EMNLP . Lei Y u, Jan Buys, and Ph il Blun som. 2016 . On line Segment to Segment Neur al Transduction. I n Pr o- ceedings of EMNLP .

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment