Modelling Radiological Language with Bidirectional Long Short-Term Memory Networks

Motivated by the need to automate medical information extraction from free-text radiological reports, we present a bi-directional long short-term memory (BiLSTM) neural network architecture for modelling radiological language. The model has been used…

Authors: Savelie Cornegruta, Robert Bakewell, Samuel Withey

Modelling Radiological Language with Bidirectional Long Short-Term   Memory Networks
Modelling Radiological Language with Bidir ectional Long Short-T erm Memory Networks Sa velie Cornegruta, Robert Bakewell, Samuel Withey and Giovanni Montana Department of Biomedical Engineering, King’ s College London, UK giovanni.montana@kcl.ac.uk Abstract Motiv ated by the need to automate medical in- formation extraction from free-text radiolog- ical reports, we present a bi-directional long short-term memory (BiLSTM) neural network architecture for modelling radiological lan- guage. The model has been used to address two NLP tasks: medical named-entity recog- nition (NER) and ne gation detection. W e in- vestigate whether learning se veral types of word embeddings improv es BiLSTM’ s perfor- mance on those tasks. Using a large dataset of chest x-ray reports, we compare the pro- posed model to a baseline dictionary-based NER system and a negation detection system that lev erages the hand-crafted rules of the NegEx algorithm and the grammatical rela- tions obtained from the Stanford Dependency Parser . Compared to these more traditional rule-based systems, we argue that BiLSTM of- fers a strong alternativ e for both our tasks. 1 Introduction Radiological reports represent a lar ge part of all Electronic Medical Records (EMRs) held by med- ical institutions. For instance, in England alone, upwards of 22 million plain radiographs were re- ported ov er the 12-month period from March 2015 (NHS, 2016). A radiological report is a written doc- ument produced by a Radiologist, a physician that specialises in interpreting medical images. A report typically states any technical factors relev ant to the acquired image as well as the presence or absence of radiological abnormalities. When an abnormality is noted, the Radiologist often gi ves further descrip- tion, including anatomical location and the e xtent of the disease. Whilst Radiologists are taught to revie w radio- graphs in a systematic and comprehensi ve man- ner , their reporting style can vary quite dramatically (Reiner and Sie gel, 2006) and the same findings can often be described in a multitude of different ways (Sobel et al., 1996). The radiological reports may contain broken grammar and misspellings, which are often the result of voice recognition software or the dictation-transcript method (McGurk et al., 2014). Applying text mining techniques to these re- ports poses a number of challenges due to extensi ve v ariability in language, ambiguity and uncertainty , which are typical problems for natural language. In this work we are motiv ated by the need to au- tomatically e xtract standardised clinical information from digitised radiological reports. A system for the fully-automated e xtraction of this information could be used, for instance, to characterise the patient pop- ulation and help health professionals improve day- to-day services. The extracted structured data could also be used to build management dashboards (Sim- pao et al., 2014) summarising and presenting the most prev alent conditions. Another potential use is the automatic labelling of medical images, e.g. to support the development of computer-aided diagno- sis software (Shin et al., 2015). In this paper we propose a recurrent neural net- work (RNN) architecture for modelling radiologi- cal language and in vestigate its potential adv antages on two different tasks: medical named-entity recog- nition (NER) and ne gation detection. The model, a bi-directional long short-term memory (BiLSTM) network, does not use any hand-engineered features, but learns them using a relativ ely small amount of la- belled data and a larger but unlabelled corpus of ra- diological reports. In addition, we explore the com- bined use of BiLSTM with other language models such as GloV e (Pennington et al., 2014) and a novel v ariant of GloV e, proposed here, that makes use of a medical ontology . The performance of the BiLSTM model is assessed comparativ ely to a rule-based sys- tem that has been optimised for the tasks at hand and builds upon well established techniques for medi- cal NER and neg ation detection. In particular , for NER, the system uses a baseline dictionary-based text mining component relying on a curated dictio- nary of medical terms. As a baseline for the negation detection task, the system implements a hybrid com- ponent based on the NegEx algorithm (Chapman et al., 2013) in conjunction with grammatical rela- tions obtained from the Stanford Dependency Parser (Chen and Manning, 2014). The article is or ganised as follows. In Section 2 we provide a brief re vie w of the existing body of work in NLP for medical information extraction and briefly discuss the use of artificial neural networks for NLP tasks. In Section 3 we describe the datasets used for our e xperiments, and in Section 4 we in- troduce the BiLSTM model. The results are pre- sented in Section 6 where we also compare BiLSTM against the rule-based baseline systems described in Section 5. 2 Related W ork 2.1 Medical NER A large proportion of NLP systems for medical text mining use dictionary-based methods for extracting medical concepts from clinical document (Friedman et al., 1995; Johnson et al., 1997; Aronson, 2001; Sav ova et al., 2010). The dictionaries that contain the correspondence between a single- or multi-word phrase and a medical concept are usually built from medical ontologies such as the Unified Medical Lan- guage System (UMLS) (NLM, 2016b) and Medical Subject Headings (MeSH) (NLM, 2016a). These ontologies contain hundreds of thousands of medical concepts. There are also domain-specific ontologies such as RadLex (Langlotz, 2006), which has been de veloped for the Radiology domain, and currently contains ov er 68 , 000 concepts. Medical Language Extraction and Encoding Sys- tem (MEDLEE) (Friedman et al., 1995) is one of the earliest automated systems originally de v eloped for handling radiological reports, and later expanded to other medical domains. MEDLEE parses the gi ven clinical documents by string matching: the words are matched to a pre-defined dictionary of medi- cal terms or semantic groups (e.g. Centr al F ind- ing , Bodyloc Modifier , Certainty Modifier and Re- gion Modifier ). Once the w ords have been associ- ated with a semantic group, a Compositional Reg- ularizer stage combines them according to a list of pre-defined mappings to form re gularized multi- word phrases. The final stage looks up the regu- larized terms in a dictionary of medical concepts (e.g. enlar ged heart is mapped to the correspond- ing concept car diome galy ). A separate study e v al- uated MEDLEE on 150 manually annotated radiol- ogy reports (Hripcsak et al., 2002); MEDLEE was assessed on its ability to detect 24 clinical condi- tions achieving an a v erage sensitivity and specificity of 0 . 81 and 0 . 99 , respecti v ely . A more recent system for general medical infor - mation extraction is the Mayo Clinic’ s T ext Anal- ysis and Kno wledge Extraction System (cT AKES) (Sav ova et al., 2010), which also implements an NLP pipeline. During an initial shallow parsing stage, cT AKES attempts to group words into multi- word expressions by identifying constituent parts of the sentence (e.g. noun, prepositional, and verb phrases). It then string matches the identified phrases to a concept in UMLS. A ne w set of seman- tic groups were also derived from the UMLS ontol- ogy (Ogren et al., 2007). The NER performance of the cT AKES was ev aluated on the semantic groups, achie ving an F1-score of 0 . 715 for exact matches and 0 . 824 for o verlapping matches. In general, dictionary-based systems perform with high precision on the NER tasks b ut ha ve a low recall, sho wing a lack of generalisation. Low recall is usually caused by the inability to identify multi- word phrases as concepts, unless exact matches can be found in the dictionary . In addition, such sys- tems are not able to easily deal with disjoint enti- ties. For instance, in the phrase lungs ar e mildly hyper e xpanded , hyper expanded lungs constitutes a clinical finding. In an attempt to deal with dis- joint entities, rule-based systems such as MEDLEE, MetaMap (Aronson, 2001) and cT AKES, implement additional parsing stages to find grammatical rela- tions between different words in a sentence, thus aiming to create disjoint multi-word phrases. How- e ver , state-of-the-art syntactic parsers are still likely to fail when parsing sentences with broken grammar , as often occurs in clinical documents. In an attempt to improve upon dictionary-based information e xtraction systems, Hassanpour (2015) recently used a first-order linear-chain Conditional Random Field (CRF) model (Lafferty et al., 2001) in a medical NER task in volving five semantic groups (anatomy , anatomy modifier , observation, observ a- tion modifier , and uncertainty). The features used for the CRF model included part-of-speech (POS) tags, word stems, word n-grams, word shape, and negations extracted using the NegEx algorithm. The model was trained and tested using 10 -fold cross v alidation on a corpus of 150 multi-institutional Ra- diology reports and achieved a precision score of 0 . 87 , recall of 0 . 84 , and F1-score of 0 . 85 . 2.2 Medical negation detection NegEx, a popular negation detection algorithm, is usually applied to medical concepts after the entity recognition stage. This tool uses a curated list of phrases (e.g. no , no sign of , free of ), which are string matched to the medical te xt to detect a ne gation trig- ger , i.e. a word or phrase indicating the presence of a ne gated medical entity in the sentence. The tar - get entities falling inside a window , starting at the negation trigger , are then classified as ne gated . In light of its simplicity , speed and reasonable results, NegEx had been used as a component by many med- ical NLP systems (W u et al., 2014). It has been sho wn that that Ne gEx achie ves an accurac y of 0 . 94 as part of the cT AKES e v aluation (Sav ov a et al., 2010). Howe v er , the windo w approach that is used for classifying the negations may result in a large number of false positiv es, especially if there are mul- tiple entities within the 6 -word windo w . Aiming to reduce the number of false positi ves, recent efforts have inte grated NegEx with machine learning models that can be trained on annotated datasets. For instance, Shiv ade (2015) introduced a kernel-based approach that uses features built us- ing the type of negation trigger , features that are de- ri ved from the existence of conjunctions in the sen- tence, and features that weight the NegEx output against the bag-of-words in the dataset. The kernel based model outperformed the original Ne gEx algo- rithm by 2 . 7 F1-score points when trained and tested on the NegEx dataset. At around the same time, Mehrabi (2015) introduced DEEPEN, an algorithm that filters the NegEx output using the grammati- cal relations extracted using Stanford Dependency Parser . DEEPEN succeeded at reducing the number of false positi ves, although it sho wed a mar ginally lo wer F1-score when compared with NegEx on con- cepts from the Disorder s semantic group from the Mayo Clinic dataset (Ogren et al., 2007). 2.3 Neural networks for NLP tasks In recent years, deep artificial neural networks have been found to yield consistently good results on var - ious NLP tasks. The SENNA system (Collobert et al., 2011), which used a con v olutional neural net- work (CNN) architecture, came close to achie ving state-of-the-art performance across the tasks of POS tagging, shallow parsing, NER, and semantic role labeling. More recently , recurrent neural netw orks (RNNs) hav e been shown to achie v e very high per- formance, and often reach state-of-the-art results in v arious language modelling tasks (Mik olov and Zweig, 2012). RNNs hav e also been sho wn to out- perform more traditional machine learning models, such as Logistic Regression and CRF , at the slot fill- ing task in spoken language understanding (Mesnil et al., 2013). In a NER task on the publicly av ailable datasets in four languages, the bidirectional long short-term memory (LSTM) networks (Hochreiter and Schmidhuber, 1997), a v ariant of RNN, outper- formed CNNs, CRFs and other models (Lample et al., 2016). Neural networks hav e also been used to learn language models in an unsupervised learning set- ting. Some popular models include Skip-gram and continuous bag-of-words (CBO W) (Mikolov et al., 2013). These yield word representations, or embed- dings, that are able to carry the syntactic and se- mantic information of a language. Collobert (2011) sho wed that integrating pre-trained word embed- dings into a neural network can help the supervised learning process. Figure 1: Example of manual annotation of a radiology report performed using BRA T 3 A Radiology corpus 3.1 Dataset For this study , we produced an in-house radiology corpus consisting of 745 , 480 historical chest X- ray (radiographs) reports provided by Guy’ s and St Thomas’ Trust (GSTT). This T rust runs two hos- pitals within the National Health Service (NHS) in England, serving a large area in South Lon- don. The reports cover the period between Jan- uary 2005 and March 2016 , and were generated by 276 dif ferent reporters including consultant Radiol- ogists, trainee Radiologists and reporting Radiogra- phers. Our repository consists of text written or dic- tated by the clinicians after radiograph analysis, and do not contain an y referral information or patient- identifying data, such as names, addresses or dates of birth. Ho wever , many reports refer to the clinical history of the patient. The reports had a minimum of 1 word and maximum of 311 words, with an av- erage of 25 . 3 words and a standard deviation of 19 . 9 words. On av erage there were 2 . 9 sentences per re- port. After lemmatization, conv erting to lower case, and discounting words that occur less than 3 times in the corpus, the resulting vocabulary contained 8 , 031 words. A sample of 2 , 000 reports was randomly selected from the corpus for the purpose of creating a train- ing and validation dataset for the NER and negation detection tasks, whilst the remaining of the reports were utilised for pre-training word embeddings. The reports selected for manual annotation were written for all types of patients (Inpatient: 1072 , A&E At- tender: 515 , Outpatient: 229 , GP Direct Access Pa- tient: 165 , W ard Attender: 9 , Day Case P atient: 8 ) by 144 dif ferent clinicians. W e introduce a simple word-lev el annotation Semantic Group # of entities # of tokens Body Location 5686 10113 Clinical Finding 5396 8906 Descriptor 3458 3845 Medical Device 1711 3361 T otal 16251 26225 Negated entities 1851 2557 T able 1: Frequency distribution of entities by class in 2 , 000 manually annotated reports schema that includes four classes or semantic groups: Clinical F inding , Body Location , Descrip- tor and Medical Device : Clinical Finding encom- passes any clinically-relev ant radiological abnor- mality , Body Location refers to the anatomical area where the finding is present, and Descriptor includes all adjecti ves used to describe the other classes. The Medical De vice class is used to label any medical apparatus seen on chest radiographs, such as pace- makers, intrav ascular lines, and nasogastric tubes. Our annotation schema allo ws for the same token to belong to sev eral semantic groups. For example, as shown in Figure 1, the word heart was associ- ated with both Clinical Finding and Body Location classes. W e hav e also introduced a negation attribute to indicate the absence of any of these entities. 3.2 Gold standard T wo clinicians (RB and SW) annotated the reports using BRA T (Stenetorp et al., 2012), a collabora- ti ve tool for text annotation that was configured to use our o wn schema. The BRA T output was then transformed to the IOBES tagging schema. Here, we interpret I as a token in the middle of an entity; O as a token not part of the entity; B and E as the beginning and end of the entity , respecti vely; finally , S indicates a single-word entity . W e work with the assumption that entities may be disjoint and tokens that are surrounded by disjoint entity may belong to a dif ferent semantic group. For example, according to the annotation performed by the clinicians, in the sentence Heart is slightly enlar g ed the phrase heart enlar g ed represents an entity that belongs to the se- mantic group Clinical F inding and slightly is a De- scriptor . The resulting breakdown of all entities by semantic group can be found in T able 1. 4 Methodology In this Section we describe a model for NER that e x- tracts fi ve types of entities: the four semantic groups described in Section 3.1, as well as the negation, which is treated here as an additional class, analo- gously to the semantic groups. 4.1 Bi-directional LSTM The RNN is a neural network architecture designed to model time series, but it can be applied to other types of sequential data (Rumelhart et al., 1988). As the information passes through the network, it can persist indefinitely in its memory . This facili- tates the process of capturing sequential dependen- cies. The RNN makes a prediction after processing each element of the input sequence. Hence, the out- put sequence can be of the same length as the input sequence. The RNN architecture lends itself as a natural model for the proposed NER task, where the objecti ve is to predict the IOBES tags for each of the input words. The RNN is trained using the error backpropaga- tion through time algorithm (W erbos, 1990) and a v ariant of the gradient descent algorithm. Ho we ver , training these models is notoriously challenging due to the problem of exploding and vanishing gradients, especially when trained with long input sequences (Bengio et al., 1994). For the exploding gradi- ent problem, numerical stability can be achiev ed by clipping the gradients (Grav es, 2013). The prob- lem of vanishing gradients can be addressed by re- placing the standard RNN cell with a long short- term memory (LSTM) cell, which allows for a con- stant error flo w along the input sequence (Hochre- iter and Schmidhuber , 1997). A more constant er- ror also means that the network is able to learn bet- ter long-term dependencies o v er the input sequence. By combining the outputs of tw o RNNs that pass the information in opposing directions, it is possible to capture the context from both ends of the sequence. The resulting architecture is known as Bidirectional LSTM (BiLSTM) (Grav es and Schmidhuber , 2005). W e start by defining a vocab ulary V = { v 1 , v 2 , ..., v 8031 } that contains the words extracted from the corpus as described in Section 3.1. W e assume that, in order to perform NER on the words in any gi ven sentence, it is suf ficient to consider only the information contained in that sen- tence. Therefore we pass the BiLSTM one sentence at a time. For each input sentence of n words we define an n -dimensional vector x whose elements are the indices in V corresponding to words appear- ing in the sentence, preserving the order . The input x is passed to an Embedding Layer that returns the sequence S = { w j | j = x 1 , x 2 , ..., x n } where w j is the j th ro w of a dense matrix W ∈ R | V |× d , where d ∈ N is a hyperparameter . The vector w j represents a low-dimensional vector representation, or word embedding, whereas W is the corresponding em- bedding matrix. The sequence of word embeddings S is then passed as input to two LSTM layers that process it in opposing directions (forwards and backwards), similar to the architecture introduced by Graves (2005). Figure 2 shows the LSTM layers in their ”unrolled” form as they read the input. Each LSTM layer contains k LSTM memory cells which are based on the implementation by Graves (2013). The output from each of the LSTM layers is H = { h t ∈ R k | t = 1 , 2 , ..., n } . Next, we concatenate and flatten H f orw ard and H backw ard , obtaining a vector p ∈ R 2 kn . W e pass p through a linear transformation layer and reshape its output to a tensor of size n × C × T , where C is the number of annotation classes (5 in total, 4 semantic groups and 1 class for negation) and T is the number of possible tags (5 for the IOBES tags). Finally we apply the softmax function along the last dimension of the tensor to approximate the probability for each of the possible tags for each of the annotation class. 4.2 W ord embeddings W e explored 4 different techniques for learning word embeddings from the text . The embeddings will subsequently be used to initialise the embedding matrix W that is required by BiLSTM for the NER task. In previous work, the initialisation of W with pre-trained embeddings has been found to improv e the training process (Collobert et al., 2011; Mesnil et al., 2013). Random Embeddings Random embeddings were obtained by drawing from a uniform distribution in the ( − 0 . 01 , 0 . 01 ) range. As such, the positions of the words in the vector space do not pro vide any information re gard- Figure 2: An illustration of the BiLSTM architecture for joint medical entity recognition and negation detection ing patterns of relationships between words. BiLSTM Embeddings These embeddings were obtained after adapting the BiLSTM for a language modelling task. F ollo w- ing a pre viously described strategy (Collobert and W eston, 2008), the input words were randomly re- placed, with probability 0 . 2 , with a word extracted from V . W e then created a corresponding vector of binary labels to be used as prediction tar gets: each element of the vector is either 0 or 1 , where 0 in- dicates a w ord that has been replaced, and 1 indi- cates an unchanged word. The model outputs the probability of the labels for each word in the giv en sentence. After training this language model on the unlabelled part of our corpus, we e xtracted the w ord embeddings from W . GloV e Embeddings W ord embedding were also obtained using GloV e, an unsupervised method (Pennington et al., 2014). On word similarity and analogy tasks, it has the potential to outperform competing models such as Skip-gram and CBOW . The GloV e objective func- tion is | V | X i,j =1 f ( X ij )( w T i ˜ w + b i + ˜ b j − l og X ij ) 2 where X is the word-word co-occurrence matrix, f is a weighting function, w are word embeddings, and ˜ w ∈ R d are context word embeddings, with b and ˜ b the respectiv e bias terms. The GloV e embed- dings w are trained using AdaGrad optimisation al- gorithm (Duchi et al., 2011), stochastically sampling nonzero elements from X . GloV e-Ontology Embeddings Furthermore, we introduced a modified version of GloV e, denoted GloV e-Ontology , with the ob- jecti ve to lev erage the RadLe x ontology during the word embedding estimation process. The rationale is to impose some constrains on the estimated dis- tance between words using semantic relationships extracted from RadLex; this is an idea some what in- spired by pre vious work (Y u and Dredze, 2014). The RadLex data was initially represented as a tree, τ , by considering only the relation is par ent of between concepts. W e then attempted to string match ev ery word v in V to a concept in τ . Every v matched with a RadLex concept was then assigned the vector that enumerates all ancestors of that con- cept; otherwise it was associated with a zero vector . W e denote the resulting vector by φ . W e imposed the constraint that words close to each other in τ should also be close in the learned embedding space. Ac- cordingly , GloV e’ s original objecti ve function was modified to incorporate this additional penalty: | V | X i,j =1 f ( X ij )( w T i ˜ w + b i + ˜ b j − log X ij − αsim ( φ i , φ j )) 2 In this expression, α is a parameter controlling the influence of this additional constraint, and sim is taken to be the cosine similarity function. No ma- jor changes in the training algorithm were required compared to the original GloV e methodology . 4.3 BiLSTM implementation and training The BiLSTM was implemented using two open- source libraries, Theano (Theano Dev elopment T eam, 2016) and Lasagne (Dieleman et al., 2015). The number of memory cells in each LSTM layer, k , was set to 100 . W e limited the maximum length of the input sequence to 40 words and for shorter in- puts we used a binary mask at the input and cropped the output predictions accordingly . The loss func- tion was the categorical cross-entropy between the predicted probabilities of the IOBES tags and the true tags. BiLSTM was trained on a GPU for 20 epochs in batches of 10 sentences using Stochastic Gradient Descent (SGD) with Nesterov momentum and with the learning rate set to 0 . 5 . The embedding size d was set to 50 . The GloV e, GloV e-Ontology and BiLSTM word embeddings were trained on 743 , 480 unlabelled radiology re- ports. The α paramenter in the Glove-Ontology ob- jecti ve w as set to 0 . 5 . One aspect of the training was to allow or block the optimisation algorithm from updating the ma- trix W in the Embedding Layer of the BiLSTM. In Section 6 we refer to this aspect of training as fine- tuning . Previous work (Collobert et al., 2011) has sho wn that fine-tuning can boost the results of the se veral supervised tasks in NLP . 5 A competing rule-based system T wo clinicians (RB and SW) built a comprehensiv e dictionary of medical terms. In the dictionary , the ke y is the name of the term and the corresponding v alue specifies the semantic group, which was iden- tified using a number of resources. W e iterated over all RadLex concepts using the field Preferr ed Label as the dictionary key for the new entry . T o obtain the semantic group we tra versed up the ontology tree until an ancestor concept w as found that had been manually mapped to a semantic group. For example, one of the ancestor concepts of heart is Anatomical entity , which we had manually mapped to semantic group Body Location . The same procedure was also performed on the MeSH ontology using the MeSH Heading field as a dictionary ke y . Finally , we added 202 more terms that were common in day-to-day re- porting but were not present in RadLe x and MeSH. The sentences were tokenized and split using the Stanford CoreNLP suite (Manning et al., 2014), and also con verted to lo wer case and lemmatized using NL TK (Bird et al., 2009). Next, for each sentence, the algorithm attempted to match the longest pos- sible sequence of words, a target phrase, to an en- try in the dictionary of medical terms. When the match w as successful, the target phrase was anno- tated with the corresponding semantic group. When no match was found, the algorithm attempted to look up the target phrase in the English Wikipedia redi- rects database. In case of a match, the name of the target Wikipedia article was check ed ag ainst our cu- rated dictionary and the target phrase was annotated with the corresponding semantic group (e.g. oedema redirects to edema , which is how this concept is named in RadLex). For all the string matching operations we used SimString (Okazaki and Tsujii, 2010), a fast and ef ficient approximate string matching tool. W e ar - bitrarily chose the cosine similarity measure and a similarity threshold value of 0.85. Using SimString allo wed the system to match misspelled words (e.g. car diome gally to the correct concept car diome galy ). For negation detection, the system first obtained NegEx predictions for the entities extracted in the NER task. Next, it generated a graph of grammatical relations as defined by the Univ ersal Dependencies (De Marneffe et al., 2014) from the Stanford Depen- dency Parser . It then remov ed all relations in the graph e xcept ne g , the ne gation relation, and conj:or , the or disjunction. Giv en the NegEx output and the reduced dependency graph, the system finally clas- sified an entity as negated if any of the following two conditions were found to be true: (1) any of the words that are part of the entity were in a ne g re- lation or in a conj:or relation with a another word that was in a ne g relation; (2) if an entity was classi- fied by NegEx as ne gated, it was the closest entity to negation trigger and there was no ne g relations in the sentence. Our hybrid approach is somewhat similar to DEEPEN with the difference that the latter con- siders all first-order dependency relations between the negation trigger and the tar get entity . 6 Experimental Results W e ev aluated the BiLSTM model on the medical NER task by measuring the ov erlap between the pre- dicted semantic groups and the ground truth labels. The e v aluation was performed at the granularity of a single word and using 5 -fold cross-validation. The BiLSTM model was always trained on 80% of the annotated corpus and tested on the remaining 20% . Embeddings Fine-tuning P R F1 Random TR UE 0.878 0.869 0.873 Glov e TR UE 0.869 0.829 0.849 Glov e-ontology TR UE 0.875 0.860 0.867 BiLSTM TR UE 0.878 0.870 0.874 Random F ALSE 0.829 0.727 0.775 Glov e F ALSE 0.866 0.828 0.847 Glov e-ontology F ALSE 0.850 0.839 0.844 BiLSTM F ALSE 0.870 0.849 0.859 Rule-based 0.706 0.698 0.702 T able 2: Comparison of the BiLSTM model and rule-based sys- tem. BiLSTM is trained using different word embedding and ev aluated using 5 -fold cross-validation. The ev aluation consid- ers the overlap span of the semantic group predictions against gold standard annotations. Semantic Group P R F1 Body Location 0.896 0.887 0.891 Medical Device 0.898 0.923 0.910 Clinical Finding 0.871 0.895 0.883 Descriptor 0.824 0.725 0.771 T otal 0.878 0.870 0.874 T able 3: BiLSTM: performance metrics broken do wn by se- mantic group for the NER task. All results were obtained using BiLSTM word embeddings. Semantic Group P R F1 Body Location 0.724 0.839 0.778 Medical Device 0.976 0.538 0.694 Clinical Finding 0.862 0.551 0.672 Descriptor 0.467 0.780 0.584 T otal 0.706 0.698 0.702 T able 4: Rule-based system: performance metrics broken down by by semantic group for the NER task. Model P R F1 BiLSTM 0.903 0.912 0.908 NegEx 0.664 0.944 0.780 NegEx - Stanford 0.944 0.912 0.928 T able 5: Comparison of BiLSTM, NegEx and Ne gEx-Stanford for negation detection. All algorithms predicted whether a giv en medical entity was neg ated or affirmed. T able 2 compares the performance of various BiL- STM variants that were obtained with and without fine-tuning of the word embeddings to the perfor- nodule pacemaker small remains fracture bulla ppm tiny remain fractures nodules icd minor appears deformity opacity wires mild is body opacities drains dense are scoliosis opacification leads extensi v e were abnormality T able 6: For each one of the five words in boldface, fiv e nearest neighbours found in the embedding space learnt by BiLSTM. mance of our baseline rule-based system. Without fine-tuning, the BiLSTM NER model, that was ini- tialised with the embeddings trained in an unsuper- vised manner using the BiLSTM language model, achie ves the best F1-score ( 0 . 859 ), and outperforms the next best variant by 0 . 012 . W ith fine-tuning, the same BiLSTM v ariant impro ves the F1-score by a further 0 . 015 and outperforms the baseline rule- based system by an F1-score of 0 . 172 . T able 3 sho ws its performance measure for each of the se- mantic groups. The ev aluation of negation detection was mea- sured on complete entities. If any of the words within an entity were tagged with a I, B, E or S, that entity w as considered to be negated. As sho wn in T able 5, the BiLSTM (BiLSTM language model em- beddings, fine-tuning allowed) achiev ed an F1-score of 0 . 902 , which outperformed NegEx by 0 . 128 . Ho we ver , the best F1-score of 0 . 928 is achieved us- ing the NegEx-Stanford system. 7 Discussion In T able 3, we show the predicti ve performance of the best BiLSTM NER model for each of the se- mantic groups. Body Location , Medical Device and Clinical F inding show a balanced precision and re- call, and similar F1-scores. Descriptor has a lower F1-score which is caused by a lo w recall that may be the results of the larger v ariability in the words used for this semantic group. T able 4 sho ws the cor - responding results for the rule-based NER system. Medical De vice and Clinical F inding sho w a typi- cal performance for a dictionary-based NER system with a high precision and a low recall. Body Loca- tion has relatively high precision and recall v alues which suggests that this semantic group is well cov- ered by our dictionary of medical terms. In contrast, Descriptor shows a very lo w precision which is the result of a high number of false positiv es. The false positi ves are caused by many Descriptor entries in our dictionary of medical terms that had been au- tomatically extracted from RadLe x and MeSH b ut which do not correspond to the definition of a De- scriptor used by the clinicians who produced the la- belled data. As a qualitati ve assessment, T able 6 shows the 5 nearest neighbours obtained from BiLSTM lan- guage model embeddings of some frequent words used by Radiologists. W e note that there is an clear semantic similarity between the nearest neighbour words. Additionally , the embeddings encode syntac- tic information as the nearest neighbour words are parts of speech of the same type as the target word. W e also summed the v ectors for heart and enlar ged , which yielded vec( car diome galy ) as the nearest vec- tor . Similarly , the closest vector to vec( heart ) + vec( not ) + vec( enlar ged ) is vec( normal ). These ex- amples suggest that w ord embeddings may encode information about the compositionality of words as discussed by Mikolo v (2013). T able 2 sho ws that, without fine-tuning, the Em- bedding Layer weights can affect the performance of the NER task. When fine-tuning is allowed there is only a marginal advantage in using pre-trained embeddings, as the BiLSTM performs equally well when initialised with random embeddings. There- fore, despite a positiv e qualitativ e assessment, the pre-trained word embeddings seem to of fer only a small adv antage when used for the proposed NER task as BiLSTM is able to learn well using the anno- tated data during the supervised learning phase. 8 Conclusions In this paper we ha ve sho wn that a recurrent neu- ral network architecture, BiLSTM, can learn to de- tect clinical findings and ne gations using only a rel- ati vely small amount of manually labelled radiolog- ical reports. Using a manually curated medical cor- pus, we have provided initial e vidence that BiLSTM outperforms a dictionary-based system on the NER task. For the detection of negations, on our dataset BiLSTM approaches the performance of a negation detection system that was build using the popular NegEx algorithm and uses grammatical relations ob- tained from the Stanford Dependency Parser and hand-crafted rules. W e believ e that increasing the size of the annotated training dataset can result in much impro ved performance on this task, and plan to purse this in vestigation in future work. W e ha ve also in vestigated potential performance gains that can be achieved by using pre-trained word embeddings, i.e. BiLSTM, GloV e and GloV e- Ontology embeddings, in the context of BiLSTM- based modelling for the NER task. Our initial exper- imental results suggest that there is marginal bene- fit in using BiLSTM-learned embeddings while pre- training using GloV e and GloV e-Ontology embed- dings did not offer any significant improvements ov er a random initialisation. References [Aronson2001] Alan R Aronson. 2001. Effecti ve map- ping of biomedical text to the UMLS Metathesaurus: the MetaMap program. In Pr oceedings of the AMIA Symposium , page 17. American Medical Informatics Association. [Bengio et al.1994] Y oshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learning long-term dependen- cies with gradient descent is difficult. IEEE transac- tions on neural networks , 5(2):157–166. [Bird et al.2009] Stev en Bird, Ewan Klein, and Edward Loper . 2009. Natur al language pr ocessing with Python . ” O’Reilly Media, Inc. ”. [Chapman et al.2013] W endy W Chapman, Dieter Hilert, Sumithra V elupillai, Maria Kvist, Maria Skeppstedt, Brian E Chapman, Michael Conway , Melissa Tharp, Danielle L Mo wery , and Louise Deleger . 2013. Ex- tending the NegEx lexicon for multiple languages. Studies in health technology and informatics , 192:677. [Chen and Manning2014] Danqi Chen and Christopher D Manning. 2014. A fast and accurate dependency parser using neural networks. In EMNLP , pages 740– 750. [Collobert and W eston2008] Ronan Collobert and Jason W eston. 2008. A unified architecture for natural lan- guage processing: Deep neural networks with mul- titask learning. In Pr oceedings of the 25th interna- tional conference on Machine learning , pages 160– 167. A CM. [Collobert et al.2011] Ronan Collobert, Jason W eston, L ´ eon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pa vel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Resear c h , 12(Aug):2493–2537. [De Marneffe et al.2014] Marie-Catherine De Marneffe, T imothy Dozat, Natalia Silveira, Katri Hav erinen, Filip Ginter , Joakim Nivre, and Christopher D Man- ning. 2014. Univ ersal Stanford dependencies: A cross-linguistic typology . In LREC , volume 14, pages 4585–4592. [Dieleman et al.2015] Sander Dieleman, Jan Schlter , Colin Raf fel, Eben Olson, Sren Kaae Snderby , Daniel Nouri, Daniel Maturana, Martin Thoma, Eric Bat- tenberg, Jack Kelly , Jef fre y De Fauw , Michael Heil- man, diogo149, Brian McFee, Hendrik W eideman, takacsg84, peterderiv az, Jon, instagibbs, Dr . Kashif Rasul, CongLiu, Britefury , and Jonas Degra ve. 2015. Lasagne: First release., August. A v ailable at http://dx.doi.org/10.5281/zenodo.27878 . [Duchi et al.2011] John Duchi, Elad Hazan, and Y oram Singer . 2011. Adaptiv e subgradient methods for on- line learning and stochastic optimization. Journal of Machine Learning Resear ch , 12(Jul):2121–2159. [Friedman et al.1995] Carol Friedman, Geor ge Hripcsak, W illiam DuMouchel, Stephen B Johnson, and Paul D Clayton. 1995. Natural language processing in an operational clinical information system. Natural Lan- guage Engineering , 1(01):83–108. [Grav es and Schmidhuber2005] Alex Grav es and J ¨ urgen Schmidhuber . 2005. Framewise phoneme classifica- tion with bidirectional LSTM and other neural network architectures. Neural Networks , 18(5):602–610. [Grav es2013] Alex Graves. 2013. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850 . [Hassanpour and Langlotz2015] Saeed Hassanpour and Curtis P Langlotz. 2015. Information extraction from multi-institutional radiology reports. Artificial intelli- gence in medicine . [Hochreiter and Schmidhuber1997] Sepp Hochreiter and J ¨ urgen Schmidhuber . 1997. Long short-term memory . Neural computation , 9(8):1735–1780. [Hripcsak et al.2002] George Hripcsak, John HM Austin, Philip O Alderson, and Carol Friedman. 2002. Use of natural language processing to translate clinical infor- mation from a database of 889,921 chest radiographic reports 1. Radiology , 224(1):157–163. [Johnson et al.1997] David B. Johnson, Ricky K. T aira, Alfonso F . Cardenas, and Denise R. Aberle. 1997. Ex- tracting information from free text radiology reports. Int. J. Digit. Libr . , 1(3):297–308, December . [Lafferty et al.2001] John Lafferty , Andre w McCallum, and Fernando Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and label- ing sequence data. In Pr oceedings of the eighteenth international confer ence on machine learning, ICML , volume 1, pages 282–289. [Lample et al.2016] Guillaume Lample, Miguel Balles- teros, Sandeep Subramanian, Kazuya Kaw akami, and Chris Dyer . 2016. Neural architectures for named en- tity recognition. arXiv pr eprint arXiv:1603.01360 . [Langlotz2006] Curtis P Langlotz. 2006. Radlex: a new method for indexing online educational materials 1. Radiographics , 26(6):1595–1597. [Manning et al.2014] Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Stev en J. Bethard, and David McClosky . 2014. The Stanford CoreNLP natural language processing toolkit. In As- sociation for Computational Linguistics (A CL) System Demonstrations , pages 55–60. [McGurk et al.2014] S McGurk, K Brauer, TV Macfar- lane, and KA Duncan. 2014. The ef fect of voice recognition software on comparative error rates in ra- diology reports. The British journal of radiology . [Mehrabi et al.2015] Saeed Mehrabi, Anand Krishnan, Sunghwan Sohn, Alexandra M Roch, Heidi Schmidt, Joe Kesterson, Chris Beesley , Paul Dexter , C Max Schmidt, Hongfang Liu, et al. 2015. DEEPEN: A negation detection system for clinical text incorpo- rating dependency relation into NegEx. J ournal of biomedical informatics , 54:213–219. [Mesnil et al.2013] Gr ´ egoire Mesnil, Xiaodong He, Li Deng, and Y oshua Bengio. 2013. Inv estigation of recurrent-neural-network architectures and learning methods for spoken language understanding. In INTERSPEECH , pages 3771–3775. [Mikolov and Zweig2012] T omas Mikolo v and Geoffrey Zweig. 2012. Context dependent recurrent neural net- work language model. In SL T , pages 234–239. [Mikolov et al.2013] T omas Mikolov , Ilya Sutske v er , Kai Chen, Greg S Corrado, and Jef f Dean. 2013. Dis- tributed representations of words and phrases and their compositionality . In Advances in neural information pr ocessing systems , pages 3111–3119. [NHS2016] England NHS. 2016. Diagnostic Imag- ing Dataset Statistical Release. A v ailable at https://www .england.nhs.uk/statistics/statistical- work-areas/diagnostic-imaging-dataset/diagnostic- imaging-dataset-2015-16-data/. [NLM2016a] United States National Library of Medicine NLM. 2016a. Medical Subject Headings. A v ailable at https://www .nlm.nih.gov/mesh/. [NLM2016b] United States National Library of Medicine NLM. 2016b . Unified Medical Language System. A v ailable at https://uts.nlm.nih.gov/home.html. [Ogren et al.2007] Philip V Ogren, Guer gana K Sav o v a, Christopher G Chute, et al. 2007. Constructing ev al- uation corpora for automated clinical named entity recognition. In Medinfo 2007: Pr oceedings of the 12th W orld Congr ess on Health (Medical) Informat- ics; Building Sustainable Health Systems , page 2325. IOS Press. [Okazaki and Tsujii2010] Naoaki Okazaki and Jun’ichi Tsujii. 2010. Simple and efficient algorithm for ap- proximate dictionary matching. In Pr oceedings of the 23r d International Confer ence on Computational Linguistics , pages 851–859. Association for Computa- tional Linguistics. [Pennington et al.2014] Jeffre y Pennington, Richard Socher , and Christopher D Manning. 2014. Glove: Global vectors for word representation. In EMNLP , volume 14, pages 1532–1543. [Reiner and Siegel2006] Bruce Reiner and Eliot Siegel. 2006. Radiology reporting: returning to our image- centric roots. American Journal of Roentgenology , 187(5):1151–1155. [Rumelhart et al.1988] David E Rumelhart, Geoffrey E Hinton, and Ronald J W illiams. 1988. Learning repre- sentations by back-propagating errors. Cognitive mod- eling , 5(3):1. [Sav o v a et al.2010] Guergana K Sa vo v a, James J Masanz, Philip V Ogren, Jiaping Zheng, Sunghwan Sohn, Karin C Kipper-Schuler , and Christopher G Chute. 2010. Mayo clinical text analysis and kno wledge ex- traction system (cT AKES): architecture, component ev aluation and applications. J ournal of the American Medical Informatics Association , 17(5):507–513. [Shin et al.2015] Hoo-Chang Shin, Le Lu, Lauren Kim, Ari Sef f, Jianhua Y ao, and Ronald M Summers. 2015. Interleav ed text/image deep mining on a v ery large- scale radiology database. In Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Recogni- tion , pages 1090–1099. [Shiv ade et al.2015] Chaitanya Shiv ade, Marie-Catherine de Marneffe, Eric Fosler -Lussier , and Albert M Lai. 2015. Extending NegEx with kernel methods for nega- tion detection in clinical text. ExPr oM 2015 , page 41. [Simpao et al.2014] Allan F Simpao, Luis M Ahumada, Jorge A G ´ alvez, and Mohamed A Rehman. 2014. A revie w of analytics and clinical informatics in health care. Journal of medical systems , 38(4):1–7. [Sobel et al.1996] Jeffre y L Sobel, Marjorie L Pearson, Keith Gross, Katherine A Desmond, Ellen R Harrison, Lisa V Rubenstein, W illiam H Rogers, and Kather- ine L Kahn. 1996. Information content and clarity of radiologists’ reports for chest radiography . Academic radiology , 3(9):709–717. [Stenetorp et al.2012] Pontus Stenetorp, Sampo Pyysalo, Goran T opi ´ c, T omoko Ohta, Sophia Ananiadou, and Jun’ichi Tsujii. 2012. BRA T: a web-based tool for NLP-assisted text annotation. In Pr oceedings of the Demonstrations at the 13th Conference of the Eur o- pean Chapter of the Association for Computational Linguistics , pages 102–107. Association for Compu- tational Linguistics. [Theano Dev elopment T eam2016] Theano De velopment T eam. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv e- prints , abs/1605.02688, May . [W erbos1990] Paul J W erbos. 1990. Backpropagation through time: what it does and ho w to do it. Pr oceed- ings of the IEEE , 78(10):1550–1560. [W u et al.2014] Stephen W u, T imothy Miller , James Masanz, Matt Coarr , Scott Halgrim, David Carrell, and Cheryl Clark. 2014. Negations not solved: general- izability versus optimizability in clinical natural lan- guage processing. PloS one , 9(11):e112774. [Y u and Dredze2014] Mo Y u and Mark Dredze. 2014. Improving lexical embeddings with semantic knowl- edge. In A CL (2) , pages 545–550.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment