Bidirectional LSTM-CRF for Clinical Concept Extraction

Automated extraction of concepts from patient clinical records is an essential facilitator of clinical research. For this reason, the 2010 i2b2/VA Natural Language Processing Challenges for Clinical Records introduced a concept extraction task aimed …

Authors: Raghavendra Chalapathy, Ehsan Zare Borzeshi, Massimo Piccardi

Bidir ectional LSTM-CRF f or Clinical Concept Extraction Raghav endra Chalapath y Univ ersity of Sydney Capital Markets CRC rcha9612@uni.sydn ey .edu.au Ehsan Zar e Borzeshi Capital Markets CRC ezborzeshi@cmcrc.com Massimo Piccardi Univ ersity of T echnology Sydney Massimo.Piccardi@uts .edu.au Abstract Automated ext raction of concept s from patient clin ical records is an essent ial fac ilitator of clin- ical researc h. For this reason, the 2010 i2b2/V A Natural Language Processing Challenge s for Clinical Records intr oduced a concept ex traction task aimed at identif ying and class ifying con- cepts into predefined categor ies (i.e., treatments , tests and problems). State-of- the-art con cept ext raction approaches heavil y rely on handc rafted features and domain-s pecific resources w hich are hard to collect and define. For this reaso n, this paper proposes an alternati ve , streamlined approa ch: a recurrent neural netwo rk (the bidire ctional LS TM with CRF decod ing) initialized with general-pu rpose, off -the-she lf word embeddings. The experi mental results achie ved on the 2010 i2b2/V A referen ce corpora using the proposed frame work outperfo rm all recent m ethods and ranks clos ely to the best submissi on from the origina l 2010 i2b2/V A challenge . 1 Intr oduction Patien t clinical records typica lly cont ain longitudin al data abou t patients’ health status, dis eases, con- ducted tests and respon se to treatments. Analys ing such informati on can prov e of immense valu e not only for clinical practice, bu t also for the or ganisa tion and management of healthcar e service s. Concept e xtrac tion (CE) aims to identify mentions to medical concepts such as problems, test and treatments in clinica l records (e.g., dischar ge summaries and progress reports ) and classify them into predefined cate- gories . The concepts in clinical records are often exp ressed with unstruct ured, “free” text, making their automati c extract ion a challen ging task for clinical Natural Language Processing (NLP ) sys tems. Tr adi- tional approach es hav e ex tensi ve ly relied on rule-bas ed systems and lexi cons to recognise the conc epts of interes t. T ypicall y , the conce pts represen t drug names, anatomic al nomenclatu re and othe r speciali zed names and phras es which are not part of e ver yday v ocab ularie s. For instan ce, “re sp status” should be interp reted as “res ponse status”. Such use of abbre viated phr ases and acrony ms is very common within the med ical co mmunity , with man y a bbre viatio ns ha ving a speci fic meaning t hat dif fer f rom th at of other lex icons. Dictiona ry-based syst ems perform concep t e xtractio n by looking up terms on medical ontolo - gies suc h as the Unified Medical Langua ge System (UMLS) ( Kipper -Schuler et al., 2008). Intrinsically , dictio nary- and rule-b ased systems are laborious to implement and inflexibl e to new cases and mis- spellin gs (Liu et al., 2015). Although these systems can achie ve high precision, they tend to suffer from lo w recall (i.e., they may miss a s ignificant number of concept s). T o ov ercome thes e limitation s, v arious machine learning approache s ha ve been propose d (e.g ., con ditional random fields (CRFs), maximum- entrop y classifiers and supp ort vec tor machines ) to simultaneou sly expl oit the textual and context ual informat ion while reducing the rel iance on le xicon lookup (Laf fert y et al., 2001; Ber ger et al., 1996; Joachi ms, 1998). State-of- the-art machine learning approa ches usually follo w a two-s tep process of featur e engineer ing and clas sificatio n . T he feature engineer ing task is, in its own righ t, ve ry labor ious and demanding on expert kno wledge, and it can become the bottlen eck of the overal l approa ch. Fo r this reason , this pap er propo ses a highly streamline d alternati v e: to employ a cont emporary neural network - the bidire ctional LS TM-CRF - initialized with general-pur pose, off-the -shelf word embeddings such Sentence His HCT had d r oppe d fr om 36.7 de spite 2U PRBC and Concept clas s O B-te st O O O O O B-tr eatment I-tr eatment O T able 1: Example sen tence in a con cept extractio n task. The conc ept class es are repres ented in the standa rd in/out/b egin (IOB) format . as GloV e (Penning ton et al., 2014a) and W ord2V ec (Mik olo v et al., 2013b). The exper imental res ults ov er the authorit ati ve 2010 i2b2/V A benchmark show that the proposed approach outper forms all recent approa ches and ranks closely to the best from the literatu re. 2 Related W ork Most of the research to da te has fra med CE as a spec ialized case of named-entity reco gnition (NE R) and employed a number of supervis ed and semi-supervis ed machine learni ng algorit hms with domain- depen dent attrib ute s and text features (Uzuner et al., 2011). Hybrid models obtained by cascad ing CRF and S VM classifiers alo ng with sev eral pattern-matc hing rules ha v e sho wn to pro duce ef fecti v e re- sults (Boag et al., 2015). Moreo ver , (Fu and Ananiadou, 2014) hav e gi ven eviden ce to the importance of inclu ding pr eprocess ing steps such as truecasing and annot ation co mbination . The system that has report ed the high est accu racy on the 2010 i2b2/V A concept e xtractio n benchmark is based on unsupe r- vised feature representat ions obtained by Brown clustering and a hidden semi-Mark ov model as classi- fier (deBruij n et al., 2011). Howe ve r , the use of a “hard” clusterin g techn ique such as B ro wn cluste ring is not su itable for capturing multiple relation s between the word s and the concept s. For thi s reason, Jon- nalaga dda et al. (Jonnalagad da et al., 2012) demonstra ted that a ra ndom ind exin g model with d istrib uted word representa tions can improv e clinic al concept ex traction . Moreov er , W u et al. (W u et al., 2015) ha ve jointly used word embeddi ngs deri v ed from the entire E nglish W ikipedia (Collobert et al., 2011) and binariz ed w ord embedding s deri v ed from domain-s pecific corpora (e.g. the MIMIC-II corpus (Saeed et al., 2011)). In the broader field of machine learnin g, the recent years ha v e witnessed a pro- liferat ion of deep neu ral networks, w ith outsta nding resul ts in tasks as div erse as vis ual, speech and named-en tity recogn ition (Hinton et al., 2012; Krizhe vsk y et al., 2012; L ample et al., 201 6). One of the main adv antag es of neural network s ov er traditio nal approa ches is that they can learn the feature repr e- sentat ions automatic ally from the da ta, thus av o iding the expens iv e feature-en gineerin g stage. Giv en the promisin g perfor mance of deep neural networks and the recent success of unsupervis ed wor d embedding s in general NLP tasks (Penningt on et al., 2014a; Mikolo v et al., 2013b; Lebret and Collob ert, 2014), this paper sets to explo re the use of a state-of-t he-art deep sequenti al m odel initialize d with general -purpos e word embedd ings for a task of clini cal concept extr action. 3 Th e Proposed Ap pr oach CE can be formulated as a joint segmentatio n and classification task ov er a predefined set of classes. As an e xample, consider the inpu t sente nce provid ed in T able 1. The notation follo ws the widely adopted in/out /begi n (IOB) entity rep resenta tion with, in this instance, HC T as the test and 2U PRBC as th e treatmen t. In this paper , we approach the CE task by the bidirectio nal LSTM-CR F frame work where each word in the input sentence is first m apped to either a random ve ctor or a vector from a word embeddi ng. W e therefor e prov ide a brief descr iption of both word embed dings and the m odel hereaf ter . W ord embedding s are vector representat ions of natural languag e wor ds that aim to preser ve the se- mantic and syntac tic similarities between them. The vector repres entation s can be generat ed by either count- based approac hes such as Hellinger -PCA (Lebret and Collob ert, 2014) or trained models such as W ord2V ec (includi ng skip -grams and contin uous-b ag-of-word s) and G loV e trained over larg e, unsu- pervis ed corpora of general-na ture documen ts. In its embedde d rep resentat ion, each word in a text is repres ented by a real-v alu ed vector , x , of arbitrary dimensional ity , d . Recurren t neu ral networks (RNN s) are a family of neural network s that operate on sequent ial data. They tak e as input a sequence of vectors ( x 1 , x 2 , ..., x n ) and output a sequence of class posteri or proba- bilitie s, ( y 1 , y 2 , ..., y n ). An inte rmediate layer of hidden node s, ( h 1 , h 2 , ..., h n ), is also part of th e model. T rainin g set T est set notes 170 256 senten ces 16315 27626 probl em 7073 12592 test 4608 9225 treatmen t 4 844 9344 T able 2: Statisti cs of the training and test data sets used for the 2010- i2b2/V A concept extract ion. In an RNN, the v alue of the hidden node at time t , h t , depends on both the current input, x t , and the pre vious hidden node, h t − 1 . This recurrent connection from the past timeframe enable s a form of short- term memory and makes the RNN s suitable for the prediction of sequences. Formally , the v alue of a hidden node is descri bed as: h t = f ( U • x t + V • h t − 1 ) (1) where U and V are tra ined weight mat rices betwee n the in put and the h idden layer , and between t he pas t and current hidd en layers, respecti ve ly . Function f ( · ) is the sigmoid fun ction, f ( x ) = 1 / 1 + e − x , that adds non-l inearity to the layer . Eventually , h ( t ) is input into the outpu t lay er and con volv ed with the outpu t weight matrix, W : y t = g ( W • h t ) , with g ( z m ) = e z m Σ K k =1 e z k (2) Event ually , the outpu t is normalized by a m ulti-cl ass logisti c function , g ( · ) , to be come a proper probab ility ov er the class set. Ther efore, the out put dimens ionality is equal to the numbe r of concep t classe s. Although an RNN can, in theo ry , learn long-term depe ndencie s, in prac tice it tends to be biased to war ds its most re cent input s. For this reason , the Long Shor t-T erm Memory (LSTM) network incorpo rates an additio nal “gat ed” memory cell that can store long-ran ge depe nden- cies (Hochreite r and Schmidhuber , 1997). In its bidirectiona l v ersion, the LSTM computes both a for- ward, − → h t , and a bac kward, ← − h t , hidden repres entation at each timeframe t . The final rep resentat ion is created by concaten ating them as h t = [ − → h t ; ← − h t ] . In all these netw orks, the hidden layer can be regarded as an implicit, learned feature that enables concept predict ion. A further impro vemen t to this model is pro vided by pe rforming joint dec oding of the entire input sequence in a V iterbi-s tyle mann er usi ng a CRF (Laf ferty et al., 2001) as the final outp ut layer . The resulting network is commonly referred to as the bidir e ctional LSTM-CRF (Lample et al., 2016). 4 Experiments 4.1 Dataset The 2010 i2b 2/V A Natural Language Processing Challenges for Clinical Records include a concept ex- tractio n tas k focused on the extra ction of medical concepts from patient reports. For the challenge , a total of 39 4 conc ept-ann otated reports fo r train ing, 477 for te sting, and 877 unann otated reports were de- identi fied and released to the participa nts along side a data use agreement (Uzu ner et al., 2011). Ho wev er , part of this data set is no longer being distrib ute d due to restric tions later introdu ced by the Institution al Rev ie w Board (IRB). Thus, T able 2 summarizes the basic statistics of the trainin g and test data sets which are current ly publicly a va ilable and that we hav e used in our experimen ts. 4.2 Evaluat ion Methodology Our mode ls ha ve be en blindly e v aluat ed on the 20 10 i2b2/V A CE t est dat a using a str ict e v aluation crite- rion requir ing the predicte d concepts to exactly match the annotate d concepts in terms of both boundary Methods Precisio n Recall F 1 Score Hidden semi-Mark ov Model (deBruijn et al., 2011) 86 . 88 83 . 64 85 . 23 Distrib uto nal Semantics CRF (Jon nalagad da et al., 201 2 ) 85 . 6 0 82 . 00 83 . 70 Binarize d Neural Embeddi ng CRF (W u et al., 2015) 85 . 10 80 . 60 82 . 80 CliNER (Boag et al., 2015) 79 . 50 81 . 20 80 . 00 T ruecas ing CRFSuite (Fu and Anani adou, 2014) 80 . 83 71 . 47 75 . 86 Random - Bidirecti onal LSTM -CRF 81 . 06 75 . 40 78 . 13 W ord2V ec - Bidire ctional LS TM-CRF 82 . 61 80 . 03 81 . 30 GloV e - Bidirection al LS TM-CRF 84 . 36 83 . 41 83 . 88 T able 3: Performan ce comparison between the bidirect ional LSTM-CRF (bottom three lines) and state- of-the -art systems (top fiv e lines ) ov er the 2010 i2b2/V A concept extr action task. and class. T o facilita te the replicat ion of our experimen tal results, we hav e used a pub licly-a v ailab le li- brary for the implementatio n of the LSTM (i.e. the Theano neural networ k tool kit (Ber gstr a et al., 2010)) and w e publicly release our code 1 . W e hav e split the training set into two parts (sized at approximate ly 70% and 30 %, respecti ve ly), using the first for training and the sec ond for selection of the hyper - paramete rs (“v alidatio n”) (Ber gstra and Bengio, 2012).The hyper -paramet ers inclu de the embedding di- mension , d , chosen over { 50 , 100 , 3 00 , 500 } , and two additional parameters , the learning and drop -out rates, that were sampled from a unifor m distrib ution in the range [0 . 05 , 0 . 1] . All weight matrices were randomly initialize d from the uniform distrib ution within range [ − 1 , 1] . The word embeddings were ei- ther initialize d randomly in the same way or fetched from W ord2V ec and G loV e (Mikolo v et al., 2013a; Penningt on et al., 2014b). Approxima tely 25% of th e to kens were alph anumeric, abbre via ted or domain- specific strings that were not av aila ble as pre-train ed embeddings and w ere alwa ys randomly initialized. Early stoppi ng of training was set to 50 epochs to m ollify ov er -fitting, and the model that gav e the best perfor mance on the vali dation set was retain ed. The accuracy is report ed in terms of micro-a v erage F 1 score computed using the CoNLL score funct ion (Nadeau and Sekine, 2007). 4.3 Results and Analysis T able 3 sho ws the performance comparison between state-of-the- art CE systems and the propos ed bidi - rection al LSTM-CRF with diffe rent in itializat ion strate gies. As a first note, the bidi rectiona l LS TM-CRF initial ized with GloV e outperfo rms all recent approache s (2012-20 15). On the other hand, the best sub- mission from the 2010 i2b2/V A challenge (deBrui jn et al., 2011) still outper forms our approach. How- e ver , based on the d escripti on provid ed in (Uzune r et al., 2011), these re sults are not directly comparab le since the experi ments in (deBruijn et al., 2011; Jonnalag adda et al., 2012) had used the original dataset which has a significa ntly larg er number of training samples. Using genera l-purpos e, pre-traine d embed- dings improve s the F 1 score by over 5 percenta ge poin ts ove r a random initial ization. In general, the results achie ve d with the propos ed approac h are close and in many cases abo ve the resu lts achie v ed by systems based on hand-e ngineer ed features . Conclusion This paper ha s e xplor ed the e ffe cti vene ss of the contempo rary bidirec tional LST M-CRF for clinica l con- cept ex traction . The most appealing feature of this appr oach is its ability to provide end-to -end recog- nition using general-pur pose, of f-th e-shelf word embeddings, thus spari ng effor t from time-cons uming feature construc tion. The e xperiment al results over the authoritati v e 2010 i2b2/V A refere nce corpor a look promising, with the bidirectiona l LSTM-CRF outperfor ming all recent approaches and ranking closel y to the best submissi on from the origin al 2010 i2b2/V A chall enge. A potential wa y to further impro ve its performance would be to e xplore the use of unsupervise d word embeddings tra ined fro m domain-s pecific resour ces such as the MIMIC-III corpora (Johnson et al., 2016). 1 https://github .com/ragha vchalapathy /Bidirectional-LSTM-CRF -for-Clinical-Concept-Extraction Refer ences [Berger et al.1996] Adam L Berger, V in cent J Della Pietra, and Stephen A Della Pietra. 199 6. A maximum entr opy approa ch to natu ral language processing. Compu tational linguistics , 22(1):39 –71. [Bergstra and Bengio2012] James Bergstra and Y o shua Bengio. 2 012. Random search for hyp er-parameter op ti- mization. Journal of Machine Learn ing Resear ch (JMLR) , 13:281–30 5. [Bergstra et al.2010] James Bergstra, Olivier Breuleux, Fr ´ ed ´ eric Bast ien, Pascal Lamblin, Razvan P ascanu, Gu il- laume Desjardins, Joseph T urian, Da vid W arde- Farley , and Y oshua Bengio. 2010. Thean o: A CPU and GPU math compiler in Python. I n The 9th Python in Science Confer ence , pages 1–7. [Boag et al.201 5] W illiam Boag, Ke vin W acome, Tristan Naumann, and Ann a Rumshisky . 20 15. CliNER: A lightweight t ool for clinical n amed entity recog nition. In AMIA Joint Summits on Clinical Resear ch Info rmatics (poster) . [Collobert et al.2011 ] Ronan Collober t, Jason W esto n, L ´ eon Bottou, Michael Kar len, K oray Kavukcuog lu, and Pa vel Kuksa. 2011. Na tural language pr ocessing (almost) fro m scr atch. Journal of Machine Learning Resear c h (JMLR) , 12:24 93–25 37. [deBruijn et al.201 1] Berry deBru ijn, Colin Cherry , Svetlana Kiritch enko, Joel Martin, and Xiaodan Zhu. 2011 . Machine-lear ned solu tions for three stages of clinical infor mation extrac tion: the state of the art at i2b 2 201 0. Journal of the American Medica l Informatics Association , 18(5):557 –562 . [Fu and Ananiado u201 4] Xiao Fu and Sophia Ananiado u. 2014. Impr oving the extraction o f clinic al conc epts from clin ical record s. P r oceedings of the 4th W orksho p o n B uilding and Evalua ting Resou r ces for Health an d Biomedical T ext Pr ocessing (BioTxtM04) . [Hinton et al.2012 ] Geoffrey Hinton, Li Deng, Dong Y u , Geo rge E Dahl, Abdel-rahman Mohame d, Navdeep J aitly , Andrew Senior, V incent V an houcke, Patrick Nguye n, T ara N Sainath, et al. 2012. Deep n eural networks f or acoustic m odeling in spee ch r ecognition : Th e shared views of f our research gro ups. IEEE S ignal Pr ocessing Magazine , 29(6):82 –97. [Hochreiter and Schmid huber 1997] Sepp Hoch reiter and J ¨ ur gen Schmidhub er . 1997 . Long shor t-term memory . Neural Computation , 9(8):1735– 1780 . [Joachims19 98] Thorsten Joachims. 19 98. T e xt categorization with support vector machines: Learn ing with many relev ant features. In Eur opean confer ence on machine learnin g (ECML) , pages 137–142 . Springer . [Johnson et al.201 6] Alistair E.W . Johnson, T om J. Pollard, Lu Shen, Li-wei H. Lehman, Mengling Feng, Mo ham- mad Ghassemi, Benjamin Moo dy , Peter Szolovits, Leo A. Celi, and Roger G. Mark. 2016 . MIMIC-III, a freely accessible critical care database. S cientific Data , 3. [Jonnalag adda et al.2 012] Siddhar tha Jonnalag adda, T rev or Cohen, Stephen W u, an d Graciela Gon zalez. 2012. Enhancin g clinical concep t extraction with distributional semantics. J ournal of biome dical informatics , 45(1) :129–1 40. [Kipper-Schuler et al.2008] Karin Kip per-Schuler, V inod Kagg al, James Masanz, Philip Ogren , an d Gue rgana Sav ova. 2008. System e valuation on a nam ed entity corpus fr om clinical notes. I n La nguage Resour ces and Evaluation Confer en ce (LREC) , pages 3001–3 007. [Krizhevsky et al.201 2] Alex Krizhevsky , Ilya Sutske ver, and Geoffrey E Hinton. 2012. Im agenet classification with deep conv o lutional neu ral networks. In Neural Information Pr o cessing Systems (NIPS) . [Lafferty et al.2001] John Lafferty , An drew McCallum, and Fern ando Per eira. 2001. Condition al rando m fields: Probabilistic models for segmenting an d lab eling seque nce d ata. In International Machine Learn ing Co nfer ence (ICML) , pages 282–2 89. [Lample et al.2016 ] Guillaume Lample, Mig uel Ballesteros, San deep Subram anian, Kazuya Kawakami, and Chris Dyer . 2 016. Ne ural architectures for nam ed entity recog nition. In The North American Chapte r of the Associa- tion for Computationa l Lingu istics: Hu man Language T echnologies (NAA CL-HLT) . [Lebret and Collobert2 014] R ´ em i Lebret and Ron an Collo bert. 2 014. W or d em deddin gs throu gh hellin ger PCA. In Eur op ean Chapter of the Association for Computationa l Lingu istics (EA CL) . [Liu et al.201 5] Shengyu Liu, Buz hou T ang, Qingcai Chen, and Xiaolong W ang. 2 015. Drug nam e reco gnition: Approac hes and reso urces. Info rmation , 6(4):790– 810. [Mikolov et al.2013a] T o mas Mikolov , Ilya Sutske ver , Kai Chen, Greg Corrado , and Jeffrey Dean . 2013a. GloV e: Global V ectors for W ord Representation . Scientific Data . Accessed: 2016 -08-3 0. [Mikolov et al.2013b] T omas Mikolov , Ilya Sutskever , Kai Chen , Greg S Corrado , and Jeff Dean. 20 13b. Dis- tributed representation s o f words and phra ses and their compo sitionality . In Neural Informatio n Pr ocessing Systems (NIPS) , pages 3111–3 119. [Nadeau and Sekine20 07] Da vid Nad eau and Sato shi Sekine . 2007. A survey of named en tity recogn ition and classification. Ling uisticae In ve stigationes , 30(1):3–2 6. [Penning ton et al.2014a] Jeffre y Pennington, Richard Socher, and Christopher D. Manning. 2014a. Glo V e: Global vectors for word rep resentation. In Eur opean conference on machine learning (ECML) , pages 1532–15 43. [Penning ton et al.2014b] Jef frey Penn ington, Richard Socher , and Christopher D. Mannin g. 2014b. GloV e: G lobal vectors for word rep resentation. Scientifi c Data . Accessed: 201 6-08 -30. [Saeed et al.201 1] Mohammed Saeed, M auricio V illarroel, Andr ew T . Reisner , Gari Clif ford, Li-W ei Leh man, George Mood y , Tho mas Heldt, Tin H. K yaw , Benjam in Moo dy , and Rog er G. Mark. 201 1. Multiparameter Intelligent Mo nitoring in Inten si ve Care II (MIMIC-I I): A public-access in tensiv e care unit database. Critical Car e Medicine , 39:952– 960. [Uzuner et al.201 1] ¨ Ozlem Uz uner, Brett R So uth, Shu ying Shen, a nd Scott L DuV all. 20 11. 2010 i2 b2/V A chal- lenge on con cepts, assertion s, and relations in clinica l te xt. Journal of the American Medica l Informatics Association , 18(5) :552–5 56. [W u et al.20 15] Y ongh ui W u, Jun Xu , Min Jiang, Y aoyun Zhang , and Hua Xu. 2015. A study of n eural word embedd ings for named entity r ecognitio n in c linical text. In AMIA Annual S ymposium Pr oceeding s , v olume 2015.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment