Transfer Learning for Named-Entity Recognition with Neural Networks

Recent approaches based on artificial neural networks (ANNs) have shown promising results for named-entity recognition (NER). In order to achieve high performances, ANNs need to be trained on a large labeled dataset. However, labels might be difficul…

Authors: Ji Young Lee, Franck Dernoncourt, Peter Szolovits

Transfer Learning for Named-Entity Recognition with Neural Networks
T ransfer Learning f or Named-Entity Recognition with Neural Networks Ji Y oung Lee ∗ MIT jjylee@mit.edu Franck Dernoncourt ∗ MIT francky@mit.edu Peter Szolo vits MIT psz@mit.edu Abstract Recent approaches based on artificial neu- ral networks (ANNs) have shown promis- ing results for named-entity recognition (NER). In order to achie ve high perfor - mances, ANNs need to be trained on a large labeled dataset. Ho we ver , labels might be dif ficult to obtain for the dataset on which the user wants to perform NER: label scarcity is particularly pronounced for patient note de-identification, which is an instance of NER. In this work, we analyze to what extent transfer learning may address this issue. In particular , we demonstrate that transferring an ANN model trained on a large labeled datase t to another dataset with a limited number of labels improv es upon the state-of-the-art results on two dif ferent datasets for patient note de-identification. 1 Introduction Electronic health records (EHRs) hav e been widely adopted in some countries such as the United States and represent gold mines of infor - mation for medical research. The majority of EHR data exist in unstructured form such as patient notes ( Murdoch and Detsky , 2013 ). Applying nat- ural language processing on patient notes can im- prov e the phenotyping of patients ( Ananthakrish- nan et al. , 2013 ; Piv ov arov and Elhadad , 2015 ; Halpern et al. , 2016 ), which has many down- stream applications such as the understanding of diseases ( Liao et al. , 2015 ). Ho we ver , before patient notes can be shared with medical inv estigators, some types of infor- mation, referred to as protected health informa- tion (PHI), must be removed in order to preserve ∗ These authors contributed equally to this w ork. patient confidentiality . In the United States, the Health Insurance Portability and Accountability Act (HIP AA) ( Of fice for Ci vil Rights , 2002 ) de- fines 18 different types of PHI, ranging from pa- tient names and ID numbers to addresses and phone numbers. The task of removing PHI from a patient note is referred to as de-identification . The essence of de-identification is recognizing PHI in patient notes, which is a form of named-entity recognition (NER). Existing de-identification systems are often rule-based approaches or feature-based machine learning approaches. Ho we ver , these techniques require additional lead time for dev eloping and fine-tuning the rules or features specific to each ne w dataset. Meanwhile, recent work using ANNs hav e yielded state-of-the-art performances with- out using an y manual features ( Dernoncourt et al. , 2016 ). Compared to the previous systems, ANNs hav e a competitiv e adv antage that the model can be fine-tuned on a new dataset without the over - head of manual feature dev elopment, as long as some labels for the dataset are av ailable. Ho we ver , it may still be inefficient to mass de- ploy ANN-based de-identification system in prac- tical settings, since creating annotations for pa- tient notes is especially dif ficult. This is due to the fact that only a restricted set of indi viduals is authorized to access original patient notes; the annotation task cannot be crowd-sourced, mak- ing it slow and expensi ve to obtain a large anno- tated corpus. Medical professionals are therefore wary to explore patient notes because of this de- identification barrier , which considerably hampers medical research. In this paper , we analyze to what extent trans- fer learning may improve de-identification perfor- mances on datasets with a limited number of la- bels. By training an ANN model on a large dataset (MIMIC) and transferring it to smaller datasets (i2b2 2014 and i2b2 2016), we demonstrate that transfer learning allows to outperform the state-of- the-art results. 2 Related W ork T ransfer learning has been studied for a long time. There is no standard definition of transfer learning in the literature ( Li , 2012 ). W e follow the defini- tion from ( Pan and Y ang , 2010 ): transfer learning aims at performing a task on a tar get dataset using some knowledge learned from a source dataset. The idea has been applied to many fields such as speech recognition ( W ang and Zheng , 2015 ) and finance ( Stamate et al. , 2015 ). The successes of ANNs for many applications ov er the last fe w years have escalated the interest in studying transfer learning for ANNs. In par- ticular , much work has been done for computer vision ( Y osinski et al. , 2014 ; Oquab et al. , 2014 ; Zeiler and Fergus , 2014 ). In these studies, some of the parameters learned on the source dataset are used to initialize the corresponding parameters of the ANNs for the target dataset. Fe wer studies have been performed on transfer learning for ANN-based models in the field of nat- ural language processing. For example, Mou et al. ( 2016 ) focused on transfer learning with con- volutional neural networks for sentence classifica- tion. T o the best of our knowledge, no study has analyzed transfer learning for ANN-based models in the context of NER. 3 Model The model we use for transfer learning exper - iments is based on a type of recurrent neu- ral netw orks called long short-term memory (LSTM) ( Hochreiter and Schmidhuber , 1997 ), and utilizes both token embeddings and character em- beddings. It comprises six major components: 1. T oken embedding layer maps each token to a token embedding. 2. Character embedding layer maps each char- acter to a character embedding. 3. Character LSTM layer takes as input charac- ter embeddings and outputs a single vector that summarizes the information from the sequence of characters in the corresponding token. 4. T oken LSTM layer takes as input a sequence of token vectors, which are formed by concate- nating the outputs of the token embedding layer and the character LSTM layer , and outputs a se- quence of vectors. 5. Fully connected layer takes the output of the token LSTM layer as input, and outputs vec- tors containing the scores of each label for the corresponding tokens. 6. Sequence optimization layer takes the se- quence of vectors from the output of the fully connected layer and outputs the most likely se- quence of predicted labels, by optimizing the sum of unigram label scores as well as bigram label transition scores. Figure 1 shows ho w these six components are in- terconnected to form the model. All layers are learned jointly using stochastic gradient descent. For regularization, dropout is applied before the token LSTM layer , and early stopping is used on the de velopment set with a patience of 10 epochs. x j Sequence optimization concatenate · · · y 2 y 1 y j y n-1 y n T oken embeddings T oken LSTM Character LSTM concatanate Character embeddings c 1j c 2j c lj c (l-1)j Fully connected FC FC FC FC · · · ··· ··· ··· ··· ··· ··· labels of each token in the sentence j th token in the sentence characters in the j th token Figure 1: ANN model for NER. For transfer learn- ing experiments, we train the parameters of the model on a source dataset, and transfer all or some of the parameters to initialize the model for train- ing on a target dataset. 4 Experiments 4.1 Datasets W e use three de-identification datasets for the transfer learning experiments: MIMIC, i2b2 2014, and i2b2 2016. The MIMIC de-identification dataset was introduced in ( Dernoncourt et al. , 2016 ), and is a subset of the MIMIC-III dataset ( Johnson et al. , 2016 ; Goldberger et al. , 2000 ; Saeed et al. , 2011 ). The i2b2 2014 and 2016 datasets were released as part of the 2014 i2b2/UTHealth shared task T rack 1 ( Stubbs et al. , 2015 ) and the 2016 i2b2 CEGS N-GRID shared task, respecti vely . T able 1 presents the datasets’ sizes. MIMIC i2b2 2014 i2b2 2016 V ocabulary size 69,525 46,803 61,503 Number of notes 1,635 1,304 1,000 Number of tokens 2,945,228 984,723 2,689,196 Number of PHI instances 60,725 28,867 41,142 Number of PHI tokens 78,633 41,355 54,420 T able 1: Overvie w of the MIMIC and i2b2 datasets. PHI stands for protected health informa- tion. 4.2 T ransfer learning The goal of transfer learning is to lev erage the information present in a source dataset to im- prov e the performance of an algorithm on a target dataset. In our setting, we apply transfer learning by training the parameters of the ANN model on the source dataset (MIMIC), and using the same ANN to retrain on the tar get dataset (i2b2 2014 or 2016) for fine-tuning. W e use MIMIC as the source dataset since it is the dataset with the most labels. W e perform two sets of experiments to gain insights on how effecti ve transfer learning is and which parameters of the ANN are the most impor- tant to transfer . 1 Experiment 1 Quantifying the impact of trans- fer learning for various train set sizes of the target dataset. The primary purpose of this experiment is to assess to what extent transfer learning improves the performances on the target dataset. W e exper- iment with different train set sizes to understand ho w many labels are needed for the target dataset 1 Our code is an extension of the NER library NeuroNER ( Dernoncourt et al. , 2017 ), which we commit- ted to NeuroNER’ s repository https://github.com/ Franck- Dernoncourt/NeuroNER to achie ve reasonable performances with and with- out transfer learning. Experiment 2 Analyzing the importance of each parameter of the ANN in the transfer learn- ing. Instead of transferring all the parameters, we experiment with transferring different com- binations of parameters. The goal is to under - stand which components of the ANN are the most important to transfer . The lowest layers of the ANN tend to represent task-independent features, whereas the topmost layers are more task-specific. As a result, we try transferring the parameters starting from the bottommost layer up to the top- most layer , adding one layer at a time. 5 Results Experiment 1 Figure 2 compares the F1-scores of the ANN trained only on the target dataset against the ANN trained on the source dataset fol- lo wed by the target dataset. T ransfer learning im- prov es the F1-scores ov er training only with the target dataset, though the improvement diminishes as the number of training samples used for the tar- get dataset increases. This implies that the rep- resentations learned from the source dataset are ef ficiently transferred and exploited for the target dataset. Therefore, when transfer learning is adopted, fe wer annotations are needed to achiev e the same le vel of performance as when the source dataset is unused. For example, on the i2b2 2014 dataset, performing transfer learning and using 16% of the i2b2 train set leads to similar performance as not using transfer learning and using 34% of the i2b2 train set. T ransfer learning thus allows to cut by half the number of labels needed on the target dataset in this case. For both the i2b2 2014 and 2016 datasets, the performance gains from transfer learning are greater when the train set size of the target dataset is small. The largest improv ement can be observed for i2b2 2014 when using 5% of the dataset as the train set (consisting of around 2k PHI tokens out of 50k tokens), where transfer learning increases the F1-score by around 3.1 percent point, from 90.12 to 93.21. Ev en when all of the train set is used, the F1-score improves when using transfer learn- ing, albeit by just 0.17 percent point, from 97.80 to 97.97. (a) i2b2 2014 5% 10% 20% 40% 60% Target train set size (%) 90 92 94 96 98 F1-score (%) (b) i2b2 2016 5% 10% 20% 40% 60% Target train set size (%) 88 90 92 94 96 F1-score (%) Baseline Transfer Figure 2: Impact of transfer learning on the F1-scores. Baseline corresponds to training the ANN model only with the target dataset, and transfer learning corresponds to training on the source dataset followed by training on the target dataset. The target train set size is the percentage of train set in the whole dataset, and 60% corresponds to the full of ficial train set. (a) i2b2 2014 No transfer Token emb Char emb Char LSTM Token LSTM Fully connected CRF 90 92 94 96 98 F1-score (%) (b) i2b2 2016 No transfer Token emb Char emb Char LSTM Token LSTM Fully connected CRF 88 90 92 94 96 F1-score (%) 5% 10% 20% 40% 60% Figure 3: Impact of transferring the parameters up to each layer of the ANN model using various train set sizes on the target dataset: 5%, 10%, 20%, 40%, and 60% (official train set). Experiment 2 Figure 3 sho ws the importance of each layer of the ANN in transfer learning. W e observe that transferring a few lower layers is al- most as efficient as transferring all layers. For i2b2 2014, transferring up to the token LSTM shows great improvements for each layer , but there is less improv ement for each added layer beyond that. For i2b2 2016, larger improv ements can be ob- served up to the character LSTM and less so be- yond that layer . The parameters in the lo wer layers therefore seems to contain most information that are rele- v ant to the de-identification task in general, which supports the common hypothesis that higher layers of ANN architectures contain the parameters that are more specific to the task as well as the dataset used for training. Despite the observation that transferring a fe w lo wer layers may be sufficient for efficient trans- fer learning, it is interesting to see that adding the topmost layers to the transfer learning does not hurt the performance. When retraining the model on the target dataset, the ANN is able to adapt to the target dataset quite well despite some the higher layers being initialized to parameters that are lik ely to be more specific to the source dataset. 6 Conclusion In this work, we hav e studied transfer learning with ANNs for NER, specifically patient note de-identification, by transferring ANN parameters trained on a large labeled dataset to another dataset with limited human annotations. W e demonstrated that transfer learning improv es the performance ov er the state-of-the-art results on two datasets. T ransfer learning may be especially beneficial for a target dataset with small number of labels. References Ashwin N Ananthakrishnan, T ianxi Cai, Guergana Sav ova, Su-Chun Cheng, Pei Chen, Raul Guzman Perez, V ivian S Gainer , Shawn N Murphy , Peter Szolovits, Zongqi Xia, et al. 2013. Improving case definition of Crohn’ s disease and ulcerativ e colitis in electronic medical records using natural language processing: a novel informatics approach. Inflam- matory bowel diseases 19(7):1411. Franck Dernoncourt, Ji Y oung Lee, and Peter Szolovits. 2017. NeuroNER: an easy-to-use pro- gram for named-entity recognition based on neural networks. arXiv:1705.05487 . Franck Dernoncourt, Ji Y oung Lee, Ozlem Uzuner , and Peter Szolovits. 2016. De-identification of patient notes with recurrent neural networks. Journal of the American Medical Informatics Association page ocw156. Ary L Goldberger , Luis AN Amaral, Leon Glass, Jef- frey M Hausdorff, Plamen Ch Ivano v , Roger G Mark, Joseph E Mietus, George B Moody , Chung- Kang Peng, and H Eugene Stanley . 2000. Ph ys- iobank, physiotoolkit, and physionet components of a new research resource for complex physiologic signals. Cir culation 101(23):e215–e220. Y oni Halpern, Ste ven Horng, Y oungduck Choi, and David Sontag. 2016. Electronic medical record phenotyping using the anchor and learn framework. Journal of the American Medical Informatics Asso- ciation page ocw011. Sepp Hochreiter and J ¨ urgen Schmidhuber . 1997. Long short-term memory . Neural computation 9(8):1735–1780. Alistair E. W . Johnson, T om J. Pollard, Lu Shen, Li wei Lehman, Mengling Feng, Mohammad Ghas- semi, Benjamin Moody , Peter Szolovits, Leo An- thony Celi, and Roger G. Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific Data . Qi Li. 2012. Literature survey: domain adaptation al- gorithms for natural language processing. Depart- ment of Computer Science The Graduate Center , The City University of New Y ork pages 8–10. Katherine P Liao, Tianxi Cai, Guergana K Sav ov a, Shawn N Murphy , Elizabeth W Karlson, Ashwin N Ananthakrishnan, V i vian S Gainer , Stanley Y Shaw , Zongqi Xia, Peter Szolovits, et al. 2015. Dev elop- ment of phenotype algorithms using electronic med- ical records and incorporating natural language pro- cessing. bmj 350:h1885. Lili Mou, Zhao Meng, Rui Y an, Ge Li, Y an Xu, Lu Zhang, and Zhi Jin. 2016. Ho w transferable are neural networks in NLP applications? arXiv pr eprint arXiv:1603.06111 . T ravis B Murdoch and Allan S Detsky . 2013. The in- evitable application of big data to health care. J ama 309(13):1351–1352. HHS Office for Civil Rights. 2002. Standards for pri- vac y of individually identifiable health information. final rule. F eder al Re gister 67(157):53181. Maxime Oquab, Leon Bottou, Ivan Lapte v , and Josef Sivic. 2014. Learning and transferring mid-le vel im- age representations using conv olutional neural net- works. In Pr oceedings of the IEEE conference on computer vision and pattern r ecognition . pages 1717–1724. Sinno Jialin Pan and Qiang Y ang. 2010. A surve y on transfer learning. IEEE T ransactions on knowledge and data engineering 22(10):1345–1359. Rimma Pi vov arov and No ´ emie Elhadad. 2015. Auto- mated methods for the summarization of electronic health records. Journal of the American Medical In- formatics Association 22(5):938–947. Mohammed Saeed, Mauricio V illarroel, Andrew T Reisner , Gari Clifford, Li-W ei Lehman, George Moody , Thomas Heldt, Tin H Kyaw , Benjamin Moody , and Roger G Mark. 2011. Multiparameter intelligent monitoring in intensi ve care II (MIMIC- II): a public-access intensive care unit database. Critical car e medicine 39(5):952. Cosmin Stamate, George D Magoulas, and Michael SC Thomas. 2015. Transfer learning approach for financial applications. arXiv preprint arXiv:1509.02807 . Amber Stubbs, Christopher K otfila, and ¨ Ozlem Uzuner . 2015. Automated systems for the de-identification of longitudinal clinical narrativ es: Overvie w of 2014 i2b2/UTHealth shared task track 1. J ournal of biomedical informatics 58:S11–S19. Dong W ang and Thomas Fang Zheng. 2015. T rans- fer learning for speech and language processing. In Signal and Information Pr ocessing Association An- nual Summit and Confer ence (APSIP A), 2015 Asia- P acific . IEEE, pages 1225–1237. Jason Y osinski, Jef f Clune, Y oshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks? In Advances in neural information pr ocessing systems . pages 3320–3328. Matthew D Zeiler and Rob Fergus. 2014. V isualizing and understanding con volutional networks. In Eur o- pean confer ence on computer vision . Springer , pages 818–833.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment