Comparing Rule-Based and Deep Learning Models for Patient Phenotyping

Objective: We investigate whether deep learning techniques for natural language processing (NLP) can be used efficiently for patient phenotyping. Patient phenotyping is a classification task for determining whether a patient has a medical condition, …

Authors: Sebastian Gehrmann, Franck Dernoncourt, Yeran Li

A Comparison of Rule-Based and Deep Learning Models for Patient Phenotyping Sebastian Gehrmann 0, 1 * , Franck Dernoncourt 0, 2 , Y eran Li 0, 3 , Eric T Carlson 0, 4 , Joy T Wu 0, 5 , Jonathan W elt 0, 6 , John Foote Jr . 0, 7 , Edward Moseley 0, 8 , David W Grant 0, 9 , Patrick D T yler 0, 5 , Leo Anthony Celi 0, 2 0 MIT Critical Data, Laboratory for Computational Physiology 1 Harvard John A. Paulson School of Engineering and Applied Sciences 2 Massachusetts Institute of T echnology 3 Harvard T .H. Chan School of Public Health 4 Philips Research North America 5 Beth Israel Deaconess Medical Center 6 Massachusetts General Hospital 7 T ufts University School of Medicine 8 University of Massachusetts 9 W ashington University School of Medicine Abstract Objective : W e investigate whether deep learning techniques for natural language processing (NLP) can be used efficiently for patient phenotyping. Patient phenotyping is a classification task for determining whether a patient has a medical condition, and is a crucial part of secondary analysis of healthcare data. W e assess the performance of deep learning algorithms and compare them with classical NLP approaches. Materials and Methods : W e compare convolutional neural networks (CNNs), n-gram models, and approaches based on cT AKES that extract pre-defined medical concepts from clinical notes and use them to predict patient phenotypes. The performance is tested on 10 dif ferent phenotyping tasks using 1,610 dischar ge summaries extracted from the MIMIC-III database. Results : CNNs outperform other phenotyping algorithms in all 10 tasks. The average F1-score of our model is 76 (PPV of 83, and sensitivity of 71) with our model having an F1-score up to 37 points higher than alternative approaches. W e additionally assess the interpretability of our model by presenting a method that extracts the most salient phrases for a particular prediction. Conclusion : W e show that NLP methods based on deep learning improve the performance of patient phenotyping. Our CNN-based algorithm automatically learns the phrases associated with each patient phenotype. As such, it reduces the annotation complexity for clinical domain experts, who are normally required to develop task-specific annotation rules and identify relevant phrases. Our method performs well in terms of both performance and interpretability , which indicates that deep learning is an effective approach to patient phenotyping based on clinicians’ notes. * gehrmann@seas.harvard.edu INTRODUCTIO N The s econdary analysis of data from electronic health records (EHRs) i s cr ucial to better understand the heterogene ity of trea tment effects an d to indiv idualize patient care. With th e growing adoption rate of EHRs, 1 r esea rchers gain access to rich data sets, such as the Medical Information Mart for I nt ensiv e Care or MIMIC database 2,3 and the I nf orm atics for Integ rating Biology and the Bedside (i2b2) datam arts, 4 – 9 which can be explored in numerou s ways. 10 EHR data compri se both structured data su ch as International C lassification of Diseases (ICD) codes, laboratory results and medications, and unstructure d data such as clinician progress notes. While structured data do not require complex processing prior t o perform ing statistical tests and conducting machine le arning task s, the majority of record ed d ata exist i n unstructured form. 11 Applying natural languag e proc essing (NLP) on the unstructu red d ata in conjunction with analyzing the structured data can lead to a better und erstanding of health and diseases, 12 and a more accurate phenotyping of patients to com pare tests and treatments. 13 – 15 Patient phenotyping is a classification task to determine whether a patient h as a medical condition, or pinpointing patients who are at risk for developing one. Further, intelligent applications for patient phenotyping can support clinicians by reducing the time they spend on c hart reviews, which tak es up a significant f raction of their da ily workflow. 16,17 A popular approach to patient phenotyping using NLP is based on extracting relevant m edical phrases from texts and u sing them as in put to build a p redictive model. 18 The dict ionary of relevant phrases is task-specific and its development r equ ires significant effort and a deep understanding of the task from domain experts. 19 A different and more involved approach is to develop a rule-based algorithm for each condition. 20 Due to t h e tedious and laborious task required of clinicians to build a generalizabl e model for patient phenotyping, models for automated cl assification u sing NLP are rarely dev eloped outs ide of the research area. How ever, recent developments in deep learning may provide an oppor tu nity to build a generalizable phenotyping m odel with a less intense domain expert inv olvement. Applications o f de ep l ea rning in healthcar e have shown promising results; examples include mortality prediction, 21 patient not e de-identification, 22 skin can cer detection, 23 and diabe tic retinopathy de tection. 24 A drawback to deep learning models is their lack of interpretability. Interpretab ility means that one can understand how the features of the m odel arrive at the predictions. Sinc e results directly impact health, clinicians have come to e xpect healthcare applications to use interpreta ble models. 25 Moreover, the European Union is considering regulations that require algori thms to be interpretable. 26 While much work has been done to understand deep learning NLP models and make a trained deep learning NLP model i n terpretable, 27 – 29 they rel y on complex interactions between all inputs and are thus inherently less interpretable than an NLP model that uses predefined phra se dictionaries. In this work, we investigate the application of convol utional neural networks (CNNs) 30 t o text- based patient phen otyping. CNNs l e arn to iden tify phrases in text t ha t lead to a positive or negative class ification, sim ilar to th e phrase di ctionary approach, and they outperform traditional approaches t o classification problems in other domains. 31 – 33 We compare CNN s to t he tradit ional rule-based entity extractio n systems using the Mayo clinical Text Analysis and Knowledge Extraction Sy stem (cTAKE S), 34 and other NLP methods suc h as logistic regre ssion m odels us ing n-gram features. We compare the performance for a total of 10 different phenotypes and show that CNNs outp erform both ext r action-based and n- gram-based methods. Fin ally, we evaluate the interpretability of the m odel by assessing the learn ed phrases that are ass o ciated with each phenotype and com pare them to the phrase dict ionaries dev eloped by clinician s. BACKGROUND Accurate patient phenoty ping is required for second ary analysis of EHRs to cor rectly identify th e patient cohort under investig ation and to better identify the c linical context. 35 Studies em ploying a manual chart review proce ss for patient phenotyping are naturally limited to a small number of preselected pa tients. Therefore, NLP is necessa ry to identify inform ation that is c ontained in tex t but may be inconsistently captured with accuracy in the structured data, such as recurrence in cancer, 36,37 whe ther a patient smokes, 4 classificati on within the autism spectrum, 38 or drug treatment patterns. 39 However, unstructured data in EHRs, such as progress notes or discharge summaries, is typi cally not am enable to simple text searche s because of spe lling mistak es, and the use of ambig uous term s. 40 To help addr ess these issues, re searchers ut ilize diction aries and ontologies for m edical term inologies such a s UMLS 41 and SNOMed. 42 Examples o f the sy stems that employ such databas es are t he Knowledg eMap C oncept I dentifier (KMCI), 44 MetaMap, 45 and t he cTAKES. These three identify words or phrases with in a tex t and provide the m edical concepts they are linked to. 34,46 They s ignificantly reduce the work requ ired from data scientists, who previous ly had to develop task- specific extractors. 47 Extracted entities are filtered to only include concepts related to the patient phenotype under investigation and either used as f e atures for a m odel t ha t predicts w hether the pa tient fits the pheno type, or as input for rule-based algorithms. 18,38,48 Lia o et al. 12 describe t he process of extraction, rule-generation and prediction as the general approach to p atient phenotyping using the cTAKES , 13,49 – 51 and test this approach on various data se ts. 52 The role of clinicians in this task is t o develo p a task - sp ecific dictionary of phra ses that ar e relevant to a patient phen otype. Carrell et al. 36 developed two sepa rate rule - based phenotyping algorithms, one for pathology documents and one for clinical docum ents, whic h th ey combined in order to identify recurre nc e of breast cancer. While they find that this modular approach identifies over 90% of recurrence, they note that the co st and tim e required to develop a n NLP algorithm lim its its applicability t o large or r e peated tasks. Moreover, while a usable system w ould offset the developm ent costs, it does not address the problem that a different specialized NLP system would have to be developed for every task in a hospital. Halpern et al. 15 address the heavy workload for clinicians and describe a semi -supervise d approach to this problem that uses the Anchor and Learn Fram ework . 53 In this schem e, the clinicians only need to de fine a few anchors, whi ch are phrases that iden tify concepts with a v ery high positive predict ive value (PPV). They t ra in a supervised m odel that uses a com bination of structured da ta and a bag- of-words of the notes to pr edict whether s uch ancho r exists i n a note. They showed that their method drastical ly reduces the required effort for clinici ans whil e yielding equivalent results. Our supervised approach aims to reduce complexity for clinicians while achieving both a high PPV and sensitivity to correctly capture the whole patient cohort. Furthermore, we develop our algori thm to create a phrase dictionary to use for patient phenotyping , and compare it to c TAKES-based m odels. METHODS Concept-Extractio n-Based Models For our baseline models, we use cTAKES to ex tract concepts from each note. In cTAKES, sentences and phases are split into tokens (individual words). Then, tokens with v ariations (e.g. plural) are normalized t o their base f o rm . The normalized tokens are tagged for their part -of- speech (e.g. noun, verb), and a shallow parse tree is constructed to represent the gramm atical structure of a sen tence. Fin ally, a named-entity rec ognition (NER ) algorithm uses this information to detect named en tities that ex ist as concept u nique id entifiers (CUI s) in UMLS. 41 While traditionally the ru les were mostly fully hand - crafted, modern methods use re levant concepts in a note as input to a machine l ea rning algorithm to directly learn to predict a phenotype. 54,55 Therefor e, we spec ify two differen t approaches t o using the cTAKES ou tput. The first appro ach uses the comple te list o f ex tracted C UIs as input to furt her proce ssing steps. I n t he second approach, clinicians specify a dictiona ry com prising all c linical conc epts t hat are r elevant to the desired pheno type (e.g . Alcohol Abuse). 19 Our predictive m odels replicate the p rocess as de scribed by Liao et al. 12 We repr esent each no te by the number of occurrences of each of the CUI s. Due to the fact that cTAKES detects negations, we count the occurrences o f negated an d non-negated CUI s se para tely . These features are then transform ed to continuous features using the ter m frequency – inverse docum ent frequency (TF-IDF). Compared to the orig inal repre sentation, or the bag-of-words of a note as described by Halpern et al., 15 the TF-IDF of the features reflects the importance of a feature to a note. For an accurate c omparison to approaches in literature, we train both a random forest ( R F) and a logistic reg ression (LR ) model with the se feature s. Convolutional Neural Netw orks Our prop osed m odel i s a conv olutional neural network (CN N) for text classifica tion, r ep licating the architectu re propo sed by Collobert et al. a nd Kim . 32,56 The idea b ehind conv olutions in computer vision is to learn a transformation of adjacent pixels into a single value, similar t o a filter. 57 In natural languag e processing, the m odel learns which com binations of su bsequent words are associated w ith a given concep t. An overview o f our architec ture is shown in F igure 1. A major advantage of CNNs is that words in a text are first projec ted into distributed representations, often referred as word embedding s. Wor d embeddings have shown to i m prove performance on other tasks bas ed on EHRs, f or exam ple NER. 58 Wor ds t ha t occur in similar contexts are trained to have similar word embeddings. Therefore, misspelling s, synonyms and abbreviations of an orig inal word l e arn similar embeddings, which le ad to si m ilar results. Consequently , a database of synonyms and common misspellings is not required. 19 Word embeddings can be pre-trained on a larg er corpus of texts, which improv es results of the NLP system and reduces the amount of data required t o train a model. 59,60 We pre-train our embeddings with wo rd2vec 61 on all discharge not es available in th e MIMI C-III database. The word embedding s of all w ords in the text to classify are concatenated and used as input to the convolutional layer. Convolutions detect a signal from a combination of adjacent inputs. We combine multiple convolutions of different lengths to evaluate phrases that are anywhere from two to fi v e words long, as illustrated in Figure 1. The combination of many filters of varying length r esu lts i n multiple outputs, which are then com bined with max-pooling. More specifical ly, we use max-over-tim e-pooling to ex tract the most predictive value per filter. 56 The resu lting prediction of the model utilizes a l i near combination of these pooled features with a sigmoid function similar to a logisti c regression. Figure 1 : The architec ture of our CNN model to p erform the patien t phenotyping. (A) Each word within a discharge note is looked up within a table of word embeddings and maps to its embedding. In our example , bot h instances of the wor d “and” will have the same embedding. (B) Convolutions of different widths are used to learn a filter that i s appl ied to all embeddings of word sequences of th e corresponding leng th. The conv olution K2 with w idth 2 in our examp le of sequence length 11 will look at all 10 combinat ions of neighboring two words and out pu t one value each. (C) The resulting multiple vectors are reduced to a single one using max-over-time pooling which will de tect the highest val u e (the one with the mos t signaling power) for each of the different convo lutions. (D) The final prediction ( “Does the phenoty pe apply to the pa tient who the note belongs to?'”) i s made by computing a we ighted comb ination of the po oled value s and applying a sigmoid function, simil ar to a logistic regression. This figure is adap ted with permission from K im. 32 DATA SET All notes for this study are extra cted fr om the MIMIC - III dat abas e. MIMIC-III contains de - identified clinical data of over 53,000 hospital admissions for adult patients to the intensive care units (ICU ) at t he Beth Israel Deaconess Medical Center from 2001 to 2012. 3 MIMIC-III includes several t y pes of clinical notes, including discharge summ aries (n = 52,746) and nursing notes (n=812,128). 62 I n this study, w e focus on the disc harge sum m aries since they are the m ost informative fo r patient phe notyping. We inves tigate pheno types tha t m ay be associated with being a ‘frequent f l ier’ in the I CU (defined as >3 ICU visits wit h in 365 days). As many as one third of readmissions have been suggested to be preventab le; identifying modifiable risk factors is a crucial s tep to reduci ng them. 63 We extracted t he first di scharge summ ar y from 415 IC U frequent fliers in MI MIC -I II, as well as 313 randomly selected summaries from subsequent visits. We additionally selected 882 random summ aries, yielding a total of 1,610 notes. The cTAKES ou tput for these notes cont ains a total of 11,094 un ique CUI s. All 1,610 discharge summaries were annotated by clinicians for the 10 phenotypes shown in Table 1. A nnotators for this study included two clinical re searchers w ho hav e tak en The Med ical College Admission Test (MCAT®) (ETM, J W ), two j unio r medical residents ( JF, JTW), two senior medical resident s (DWG, PDT), and a practicing intensiv e care medicine physician (LAC) . The table shows the definition for each phenotype the annotating clinicians were instructed to look for, to i m pr ov e inter-rater rel iability. To ens ure h igh-quality labels and minimize the risk of error, each note was labeled at least twice for each phenotype. In case the annotators were unsure, one of the senior c linicians (DWG or PD T) decided on t he final label. The resul ting num ber of occurrences of the phenoty pes varies from 126 to 460 case s. Table 1 : De finitions of phe notypes as well as the number of occ urrences. Phenotype #positive Definition Adv. / Metastatic Cancer 161 Cancers with very hig h or i m minent m ortality (pancreas, esophagus, stomach, cholang iocarcinom a, br a in); mention of distant or m ulti-organ metastasis, where pal liative care wou ld be considered (prognos is < 6 m ont hs). Adv. Heart Disease 275 Any consideration for needing a heart transplant; Description of severe aortic stenosis (aortic valve area < 1.0cm ^2), severe cardiomy opathy (LV EF <= 30%). Not sufficient to have a past medical history of congestive heart f a ilure (CHF) or myocardial infarction (MI) wit h stent o r coronary artery bypass g raft (CABG ) as these are too com mon. Adv. Lung Disease 167 Severe chronic obstructive pulmonary disease (COPD) defined as Gold Stage III-IV or FEV1 < 50% of normal, or FEV1/FVC < 70% , or severe inte rstitial l ung disease (I LD or I PF). Chronic Neurological Dystrophies 368 Any chronic central nervous system (CNS) or spinal cord diseases, i n cluded/not limited to: Multiple sclerosis ( MS), amyotrophic lateral sclerosis (ALS), my asthenia gravis, Parkinson’s Disease, epilepsy , “previous history” of stroke/cerebrov ascular accident (CVA) with residu al def i cits, and various neurom uscular diseases/dy strophies. Chronic Pain 321 Any etiology of chronic pain, including f ib romy algia, requiring long-term opioid /narcotic analg esic medica tion to control. Alcohol Abuse 196 Current/recen t alcoho l abu se histo ry; still an active problem at time of admission (may or may not be the ca use of it). Substance Abuse 155 Include any intravenous dr ug abuse (IVDU ), accidental overdose of psychoact ive or narcotic medications (prescribed or not). Admitting to m arijuana use in h istory not suffic ient. Obesity 126 Clinical obesity. BMI > 30. Previous history of or being considered for gastric bypass. Insufficient to have abdominal obesity m entioned in physical exam . Psychiatric Disorders 295 All psychiatric disorders in DSM-5 classification, including schizophrenia, bipolar and anxiety disorders, other than depression. Depression 460 Diagnosis of depression; prescription of anti-depressant medication; or any description of intent ional drug overdose, suicide or self- harm attem pts. Performance Metrics We ev aluate the PPV, s ensitivity, and F-score of all models. The F - score can be derived from a confusion matrix for the results on the test set. A confusion matrix contains four counts: true positive (TP), false positive (FP), true negative (TN), and false ne gative (FN). The PPV is the fraction of correct predictions out of all the samples that were predi cted to be in a given category. The sensitivity, also known as recall, i s the percentag e of positive predictions in relation to all the predictions that should have been predicted as positive. The F - sc ore is the harmonic mean of both PPV and sensitivity (more weight can be put on either of the two but we give equa l weight to both, i.e. we use the F1-score). 𝑃𝑃𝑉 = 𝑇𝑃 𝑇𝑃 + 𝐹𝑁 Sensitivity 𝑆 = 𝑇𝑃 𝑇𝑃 + 𝐹𝑁 F1-Score 𝐹 = 2 ∗ 𝑃𝑃𝑉 ∗ 𝑆 𝑃𝑃𝑉 + 𝑆 Evaluation For all of our models, we split the data into a training, v alidation and t est set. 70% of the labeled data w as us ed as the training set, 10% as validation set and 20% as test set. All reported num bers are obtained from testing on the same t es t set. The validation set is used to choos e the hyperparam eters of the models. To achieve a fair com parison between the diff erent types o f models, we com pare different approaches for each. We co m pare the performance of all models to two simple baselines based o n n-gram models to che ck th at a more complicat ed m odel s uch as a CNN is actually necessary or whether sim ple co-occurren ces of words can pick up the sig nals. Therefore , we report n umbers on the eight m odels and baselines show n in Table 2. Table 2 : Descr iptions of our different models and bas elines. Model Name Description o f the Model CNN Our proposed conv olutional neu ral network architecture 2-gram LR Baseline that u ses bigram s of a text as input to a logistic regressio n 3-gram LR Baseline that u ses trigram s of a text as inpu t to a logisti c regression cTAKES RF Random forest th at uses the full output f rom cTAKES cTAKES LR Logistic regress ion that uses the fu ll output from cTAKES Filter RF Random forest th at uses clinic ian-filtered output f rom cTAKES Filter LR Logistic regress ion that uses clinici an-filtered output fr om cTAKES For the CN N model, we used 100 filters for each of the widths 2, 3, 4, and 5. To prevent over - fitting, w e set the dropou t probabili ty to 0.5 and used L2-normalization to n ormalize word embeddings to have a max norm of 3. 64 T he model was trained using adadelta with an initial learning rate of 1 fo r 20 epochs. 65 The CNN model was implem ented using Lua and t he Torch7 framework. 66 All baseline m odels were implem ented using Py thon with the s cikit-learn library. 67 Interpretability We compare the interpre tability of the approache s by assessing which phrases are the m ost salient for a positive prediction on a global m odel -w ide scale. We evalu ate t he Filter LR model, because its learn ed param eters that correspond to each CU I are a direct indication o f how salient it is and irrelevant CUIs are al r eady removed by clinicians, mak ing sur e that all CU Is are relevant . 21 Fo r the CNN, we compute a modified saliency of all phrases. T he saliency in neural networks is defined as the norm of the gradient of the loss function with respect t o an input. 68 Alternative methods search the local space around an input 27 , or compute a layer-wise backpropagation. 69 – 72 An inpu t in o ur c ase i s a sing le word embedding; t o evaluate th e whol e phrase we calculate t he norm of the convolutional layer for positive predict ions instead. T his approximates how m uch a phrase contributed to a prediction and works well in our case. To obtain the most important phrases g lobally, w e clas sify and ev aluate all doc um ents in the test set and store the m ost indicative phrases. To assess the saliency on a local document level, we can extract the most indicative phrase s that exist in a g iven docum ent using the s ame methodo logies. RESULTS We show an overview of t he F1-scores f o r different models and phenoty pes in Figure 2. For every phenotype, t he CNN outperform s al l other approaches. For som e of the phenotypes suc h as Obesity and S chizophren ia , the CN N out perform s the o ther m odel s by a l arge m argin. The filtered models, which require much more effort from clinicians, only have a minim al improvem ent over the nois y input of all identified cTA KES concepts. In the detailed results, shown in Table 3, we observe that the CNN outperforms the bas elines on all of the sensitivity values and half of the PPV’s. I n some cases, the simple n -gram baselines achieve a very high PPV wit h a very low sensi t ivity . That means that these models could be efficiently used to detect patients if it does not matter t hat the model overlooks most of t he positives, for exam ple to detec t a small at-risk popu lation for inte rventions. 73 Figure 2: Comparison of a chieved F1-scores across a ll tested phenotyp es. Our models are the 3 models on the l e ft of each phenot ype, shown in blue. The 4 cTAKES-based models are on the right, in red. The CN N achieves the highest sco res across all p henotypes. We show the most sa l ient phrases ac cording to the CNN and the filtered cTAKES LR models for Advanced H eart Diseas e in T ab le 4, and for Alcohol Abuse i n Table 5. B oth table s contain many phrases m entioned in the defini tion shown in Table 1, such as “ Cardiomyopathy ”. We also observe mentions of “ CHF ” and “ CABG ” in Table 4 for both m odels, whi ch are comm on medical conditions assoc iated w ith advanced heart diseas e, but are not sufficient r equirements to be diagnosed or annotated as such in the annotation schem e. T he model still learned to associate those phrases with advanced heart disease, since those phrases also occ u r i n many notes from patients that w ere labeled positive for advanced he art failu re. We argue t ha t overall, ther e is no loss in interpretability when looking at the learned phrases. M oreover, while the CUIs extracte d by cTAKES can be very generic, such as “ At rium, Heart ” or “ Heart ”, the sa l ient CNN phr ases are more spec ific. The phrases in Table 5 illustrate how the CNN can detect mentions of the condition in many forms. Without any human inpu t, the CNN learne d that EtOH and alcoho l are used sy nonymously and thus detects phra ses cont a ining either of them, which leads to a higher sensitivity. T h e filtered cTAKES LR model surpr isingly ranks victim of abuse higher than the direct m ention of alcohol abuse in a note, and finds it very indica tive if an ethano l measurement was taken. Table 3: The results (PPV, Sensitivity, and F1 -scor e) acr oss all phenotypes for all models. cT stands for cTAKES. Phenotype CNN 2-gram 3-gram cT RF cT LR Filter RF Filter LR PPV 90 91 100 94 94 68 78 Adv. Cancer S 61 31 25 48 48 42 45 F 73 46 40 64 64 52 57 PPV 73 69 71 56 65 58 74 Adv. Heart Diseas e S 68 43 34 46 44 47 47 F 70 53 46 50 53 52 58 PPV 67 57 67 36 67 38 46 Adv. Lung Disease S 57 14 14 46 43 43 46 F 62 23 24 41 52 40 46 Chronic Neurological PPV 81 56 55 58 66 70 87 S 61 27 23 49 49 49 51 F 69 36 32 53 56 57 64 PPV 78 49 44 61 53 62 68 Chronic Pain S 45 33 26 48 48 46 46 F 57 40 33 54 50 53 55 PPV 85 100 100 94 76 100 100 Alcohol Abuse S 79 39 39 54 57 61 46 F 81 56 56 68 65 76 63 PPV 83 80 88 79 64 87 95 Substance Abuse S 80 27 23 50 47 43 67 F 81 40 37 61 54 58 78 PPV 100 50 50 60 80 67 90 Obesity S 95 10 5 45 40 40 45 F 97 17 9 51 53 50 60 Psychiatric Disorders PPV 87 61 67 62 62 88 79 S 80 29 24 49 47 51 46 F 83 39 35 55 54 65 58 PPV 91 67 67 82 77 74 82 Depression S 76 40 34 49 50 49 49 F 83 50 45 61 61 59 61 Table 4 : The most salient phrases for Advanced Heart Fail ure. The salient cTAKES CUIs are extracted from the filtered LR m odel. Duplicate phr ases are remo ved. Most relevant c TAKES CUI s Most salient phra ses detect ed by the CNN Magnesium Cardiomyopathy Hypokinesia Heart Failure Acetylsalicylic Aci d Atrium, Heart Coronary Disease Atrial Fibrillation Coronary Artery Disease Aortocoronary By passes Fibrillation Heart Catheterization Chest Artery CAT Scans, X-Ray Hypertension Creatinine Measurem ent Wall Hypokines is Port pacer Ventricular hypo kinesis p AVR post ICD status post ICD EF 20 30 bifurcation aneu rysm clippi ng CHF with EF cardiomyopathy , EF 15 ( EF 20 30 coronary artery b ypass graft respiratory viral infection b y DFA severe global free wa ll hypokine sis Class II , EF 20 lateral CHF with EF 30 anterior and atyp ical hypokinesis a kinesis severe global lef t ventricul ar hypokinesis ‘s cardiomyopa thy , EF 15 Table 5 : The most salient phrases for Alcohol Abuse. The salient cTAKES CUIs are extracted from the filtered LR m odel. Duplicate phra ses are rem oved. Most relevant c TAKES CUI s Most salient phra ses detect ed by the CNN Victim of abuse Ethanol Measuremen t Alcohol Abuse Thiamine Social and persona l history Family history Hypertension Injuries risk Pain Sodium Potassium Measureme nt Plasma Glucose M easurement Consciousness A lert Alcohol Abuse EtOH abuse Alcoholic Dilated ETOH cirrhosis heavy alcohol abus e evening Alcohol abu se Drug Reactions A ttending alcohol withdrawal compartment synd rome EtOH abuse with m ultiple liver secondary to alcohol abuse abuse crack cocain e, EtOH DISCUSSION CNNs ar e a novel and flexible approach to p atient phenotyping using clinical no tes. Ou r resul ts show that de ep learn ing ou tperform s all other m et hod s in term s of F1-score and sensitivity while achieving a comparable o r better PPV. Howev er, we notice tha t even consistent annota tion schemes lead to vary ing results between phenoty pes. This m akes it especially difficul t to compare our results to other reported metrics in the literature, a problem that is amplified by the sparsity of available studie s using only unst ructured data. The major advantag e of rule-based m odels that are specifical ly tailored to a given probl em is their interpretability . Clinicians dictate the ph rases t ha t are considered as in put to a classifier and have therefore full control ov er the model. Since b ias i n data co llection and a nalysis is at times unavoidable, models ar e requir ed t o be i n terpretable in or d er for clinicians to be able to detect such biases. One suc h example of bi a s wa s in mortality pr ed iction among patients with pneumonia where asthm a was found to increase survival probability . 25 It turned out t ha t there was an institutional practice to admit all patients with pn eum onia and a history of asthma to t he ICU regardless of disease severity, so tha t a history of as t hma was strongly correlated with a lower illness severity. We demonstrated that CNNs can be interpreted in the same way as rule -based models by computing the saliency of inputs. This leads to a similar level of i n terpretability . One disadvantage of our approach is the requirement for more phrases f or consideration. Lists of salient phrases will naturally comprise more items, making it more difficult to in vestigate which phrases exactly lead to a prediction. However, each phrase comes wit h a saliency coefficient, which allows com pensating for the length of the list of salient phrases. Another point of c om parison is th e complexity of the a nnotation task involved for clinicians, who may not be familiar with NLP data set creation methodologies. Both the CNN and rule-based approaches require the construct ion of an annotated d ata set, a process that can span from sev eral months t o years, e specially if the label s cannot be inferred from the s tructured data itself. Since a CNN learns a bout phrases in not es that are assoc iated with t he presenc e of a concep t, the clinicians can simply indicate the presence or absence of a concept of interes t while guided by clinically driven criteria. As such, our proposed approach allows annotations wit h broader diagnostic criteria instead of limiting the annotation rules to specific pre - def in ed phrases. This annotation approach is more suitable for m odeling concepts that require interpretation of complex contextual or clinical reasoning patterns. While a CNN learns the rules t h at lead t o a positive label, rule-based approaches require clinicians to define every phrase that is associated with a concept. Due to the heterog eneity of t ex t, clinicians might not be able to think of all possible phrases in adv ance. They also have to consider how to hand le negated phrases correctly. Finally, for some clinically importa nt phenotypes such as “Non - Adherence”, it is impossible to construc t an exhaustive l ist of phrases associ ated with it. There a re s om e lim itations for the CNN. B ecause CNNs learn s the phrases associated with positively annotated notes , the alg orithm’s generalizab ility is ev en more de pende nt on the i nitial note se lection criteria fo r i t s tr a ining data. Addit ionally , our approach still requires an annotated data set. There fore, cTAKES-based models m ay still b e preferable for app lications where a lower sensitivity is a cceptable. However, the advantag e of the CNN, and the annotation strategy that it enabled, lies in allowing rapid development of phe notyping capabilities fo r multiple com plex concepts simultaneously from only unst ruc tured data. Our annotation strategy takes approximately the same a m ount of time t o annotate any number of concepts once a clinician i s reading a note. While rule -based systems require a separate alg orithm for ea ch annotat ion, the sam e CNN can be trained for al l of the a nnotated phenotypes at the sam e time. T h is offers an opportun ity to dram atically accelerate the development of high-quality corpora of annotated data as well as scalable phenotyping algorithms. Such capab ilities are important for identify ing complex clinical concepts in unstructured clinical text that are poorly captured in the structu red data. For example, being able to identify patient s who are readm itted to hospital due t o poor managem ent of problems, such a s drug abus e, psychiatric dis orders and other chronic diseases, which ar e often poorly coded, will have high clinical impact. As we m entioned b efore, the g oal wit h our data is to understand phenotypes that are indica tive of patients having repea ted ICU adm issions. Due to the multiple phenotypes th at are hypothesized t o be as soc iated with it, we requi re a deep learning al g orithm t hat supports th is r apid phenotyp ing. Additionally, we anticipate validation of this approach in other types of clinical notes such as social work assessment to i den tify patients at risk. Lastly, the CNN cr e ates the opportunity to develop a m odel that can use phrase saliency to highlight notes and tag pat ients to suppo rt chart review. We are plann ing future work to explo re whether the ident ification of salient phrases can be used t o support chart abstraction and whether models usi ng these phrases represent what clinicians find salient in a m edical note. CONCLUSION We have presented an alternative approach to patient phenotyping using NLP based on deep learning. Our model significantly improves the accuracy of phenotyping while decreasing the annotation complexity required of clinical domain experts. Our approach ca n be employed to augment structured data in the EHR for a variety of phenotyping tasks. We address concerns about the interpreta bility of deep learn ing by proposing a method to identi fy phrases associ ated with differen t phenotypes. ACKNOWLEDGE MENTS We thank Barbara J. Grosz for h elpful discus sions. FUNDING INF ORMATION Franck Dernonco urt is supported by a gran t from Philips Re search. L eo Anthony Celi i s supported by the R01 gra nt EB017205- 01A1 from the Nation al I nstitute o f Bioi m aging and Biomedical Engineering. The content is solely the responsibility of t he authors and does not necessarily represent the official views of Philips Research or the National I nstitute of Bioimaging and Biomedica l Engineerin g. COMPETING INT ERESTS We have no com peting interes ts to declare. BIBLIOGRAPHY 1. Henry J, P y lypchuk Y, S e arcy T, Patel V. Adoption of Electronic Health Record Systems among U.S. Non-Federal Acute Care Hospitals: 2008-2015. 2. Saeed M, Lieu C, R a ber G, Mark RG. MIMIC II: a massive tempora l ICU patient database to support research in intelligent patient monitoring. Comput C ardiol . 2002;29:641-644. doi:10.1109/CIC.2002.1166854. 3. Johnson AEW, Pollard TJ, S he n L, et al. MIMIC -III, a freel y accessible critical care database. Sci data . 2016;3:160035. doi:10.1038/sdata.2016.35. 4. Uzuner O, Goldstein I, Luo Y, Kohane I. I dentifying patient smoking status from medical discharge records. J Am Med Inform Assoc . 2008;15(1):14-24. doi:10.1197/jamia.M2408. 5. Uzuner O. Recognizing obesit y and comorbidities in sparse data. J Am Med Inform Assoc . 2009;16(4):561-570. doi:10.1197/jamia.M3115. 6. Uzuner O, Solti I , Cadag E. Extracting medication information from clinical text. J Am Med Inform Assoc . 2010;17(5):514-518. doi:10.1136/jamia.2010.003947. 7. Uzuner O, South BR, Shen S, et al. 2010 i2b2/VA challenge on concepts, assertions, a nd relations in c linical text. J Am Med Informatics Assoc . 2011;18(5):552-556. doi:10.1136/amiajnl-2011-000203. 8. Sun W, Rum shisky A, Uzuner O. Annotating temporal information in clinical narratives. J Biomed Inform . 2013;46(SUPPL.):S5 – S12. doi:10.1016/j.jbi.2013.07.004. 9. Stubbs A, Uzuner O. Annotating risk factors for heart disease in clinical narratives for diabetic patients. J Biomed Inf . 2015;58:S78 – S91. doi:10.1016/j.jbi.2015.05.009. 10. Jensen PB, J ense n L J, Brunak S. Mining electronic hea lth records: toward s better research applications and clinical care. Nat Rev Genet . 2012;13(6):395-405. doi:10.1038/nrg3208. 11. Murdoch T, Detsky A. The Inevitable Application of Big Data to H ea lth Care. J Am Med Assoc . 2013;309(13):1351-1352. doi:10.1001/jama.2013.393. 12. Liao KP, Cai T, Savova GK, et al. Development of phenotype al g orithms using electronic medical records and incor pora ting natural lan g uage processing. BMJ ( Clin Res ed) . 2015;350(apr24_11):h1885. doi:10.1136/bmj.h1885. 13. Ananthakrishnan AN, C a i T, Savova G, et al. Improving Cas e Definition of Crohn’s Disease and Ulcerative Colitis in Electronic Medi ca l Records Us ing Natural Language Processing: A Nove l Informatics Approa ch. Inflamm Bowel Dis . 2013;19(7):1-10. doi:10.1097/MIB.0b013e31828133fd. 14. Pivovarov R, Elhadad N. Automated methods for the summ a rization of electronic health records. J Am Med Inform Assoc . 2015;22(5):938-947. doi:10.1093/jamia/ocv032 15. Halpern Y, Ho rng S, C hoi Y, Sontag D. Electr onic medical record phenot y pin g using the an c hor and learn framewo rk. J Am Med Informatics Assoc . 2016;23(4):731-740. doi:10.1093/jamia/ocw011. 16. Chen L , Guo U, I llipparambil LC, et al. Racing Aga inst the Clock: Internal Medicine Residents’ Time Spent On Electronic Health Records. J Grad Med Educ . 2016;8(1):39-44. doi:10.4300/JGME-D-15-00240.1. 17. Topaz M, Lai K, Dowding D, et al. Auto mated identification of wound information in clinical notes of patients with heart diseases: Developing and validating a natural language processing application. Int J Nurs Stud . 2016;64:25- 31. doi:10.1016/j.ijnurstu.2016.09.013. 18. Hripcsak G, Albers DJ . Next-generation phenot yping of electronic h e alth re cords. J Am Med Inform Assoc . 2013;20(1):117-121. doi:10.1136/amiajnl-2012-001145. 19. Carrell DS, Halgrim S, Tran D -T, et al. Using Natural Language P roc essing to Improve Efficienc y of Manual Chart Abstraction in Research: Th e Case o f Breast Cancer Recurrence. doi:10.1093/aje/kwt441. 20. Kirby JC, Speltz P, Rasmussen L V, et al. P heKB : a catalog and workflow for creating electronic phenotype al g orithms fo r transportability. J Am Med Informatics Assoc . 2016;23(6):1046-1052. doi:10.1093/jamia/ocv202. 21. Ranganath R, Perotte A, Elhadad N, Blei D. Deep Survival Ana lysis. In: Machine Learning for Healthcare 22. Dernoncourt F, Lee JY, Uzuner O, Szolovits P. De -identification of p atient notes with recurrent neu ra l networks. J Am Med Informatics Assoc . D e cember 2016:ocw156. doi:10.1093/jamia/ocw156. 23. Esteva A, Kuprel B, Novoa RA, et al. Dermatologist -leve l classification of skin cancer with deep neural networks. Na ture . 2017;542(7639):115-118. doi:10.1038/nature21056. 24. Gulshan V, Peng L , Coram M, et al. Development and Validation of a De ep Learning Al g orithm for De tection of Diabetic Retinopathy in Retinal Fundus Photographs. JAMA . 2016;316(22):2402. doi:10.1001/jama.2016.17216. 25. Caruana R, Lou Y, Gehrke J, Koch P, Sturm M, Elhadad N. Intelligible Models for HealthCare : Pr e dicting Pneumonia Risk and Hospit a l 30 -day Readmission. Proc 21th ACM SIGKDD Int Conf Know l Discov Data Min - KDD ’15 . 2015:1721 - 1730. doi:10.1145/2783258.2788613. 26. Goodman B, Flaxman S . EU re g ulations on algorithmic decision-makin g and a “right to explanation.” 2016 ICML W ork Hum In terpre t Mach L e arn ( W HI 2016) . 27. Ribeiro MT, Singh S, Guestrin C. “Wh y Should I Trust You? ” Explaining the Predictions of Any Classifier. doi:10.1145/2939672.2939778. 28. Strobelt H, Gehrmann S, Huber B, Pfister H, Rush AM. Visual Analysis of Hidden State D y n amics in Recurrent Neural Networks. arXiv Prepr arXiv160607461 . 29. Yosinski J, Clune J , Nguyen A, Fuchs T, Lipson H. Understanding Neural Networks Through Deep Visualization. Int Conf Mach Learn - Deep Learn W ork 2015 30. Zeiler M, Krishnan D, T ay lor G, F ergus R. Deconvolutional Networks for Fe ature Learning. Cvpr . 2010:2528-2535. 31. Karpathy A, Toderici G, Shetty S, Leung T , Sukthankar R, Fei- F ei L. La rge-scale Video Classification wi th Convolutional Neural Networks. Proceedings of the IEEE conference on Computer Vision and Pattern Recog nitio. 2014. 32. Kim Y. Convolutional Neural Networks for Sentence Classification. arXiv:1746 - 1751. 33. Krizhevsky A, Sutskever I, Hinton GE. ImageNet Classification wi th Deep Convolutional Neural Networks. 34. Savova GK, Masanz JJ, Ogren P V, et al. Mayo clinical Text Anal y sis and Knowledge Ex trac tion System (cTA KES): architecture, c omponent evaluation and applications. J Am Med Inform Assoc . 2010;17(5):507-513. doi:10.1136/jamia.2009.001560. 35. Ackerman JP, Bartos DC, Kapplinger JD, Tester DJ, Delisle BP, Ackerman MJ. The promise and peril of precision medicine: phenot y pin g still matter s most. In: Mayo Clinic Proceedings . Vol 91. ; 2016:1606-1616. 36. Carrell DS, Halgrim S, Tran D-T, et al. Practice of Epidemiology Using Natural Language P roc essing to Improve Efficiency of Manual Chart Abstraction in Research : The Case of Breast Cancer Recurrence. Am J E pidemiol . 2014;(12):kwt441. doi:10.1093/aje/kwt441. 37. Strauss JA, Chao CR, Kwan ML, Ahmed SA, Schottinger JE, Quinn VP. Identifying primary and recurrent cancers using a SAS -based natural language processing algorithm. J Am Med Inform Assoc . 2013;20(2):349 -355. doi:10.1136/amiajnl-2012-000928. 38. Lingren T, Chen P, B ochenek J , et al. Electronic Health Record Ba sed Algorithm to I dentify Patients with Autism Spectrum Disorde r. Smalheiser NR, ed . PLoS One . 2016;11(7):e0159621. doi:10.1371/journal.pone.0159621. 39. Savova GK, Olson JE, Murphy SP, et al. Automated discovery of drug treatment patterns for endocrine therapy of breast ca ncer within an elec tronic medical record. J Am Med Informatics Assoc . 2012;19(e1):e83-e89. doi:10.1136/amiajnl-2011- 000295. 40. Nadkarni PM, Ohno-Ma c hado L, Chapman WW . Natural language processing: an introduction. J Am Med Inform Assoc . 1984;18(5):544 - 551. doi:10.1136/amiajnl - 2011-000464. 41. Bodenreider O. Th e Unified Medical Language System (UMLS): in teg r ating biomedical terminolog y . Nucleic Acids Res . 2004;32(Database issue):D267-70. doi:10.1093/nar/gkh061. 42. Spackman KA, Campbell KE, Côté RA. SNOMED RT: a reference terminology for health care. Conf Proc Am Med Informatics Assoc Annu Fall Symp . 1997:640- 644. 43. Liu S, Ma W, Moore R, Ganesan V, Nelson S. A standardized dru g nomenclature links s y stems that use different vocabularies, so the patient gets what the doctor ordered. RxNorm: Prescription for Electronic Dr ug Information Exchange. 44. Denny JC , Irani PR, Wehbe FH, Smi ther s J D, Spickard A. The KnowledgeMap project: dev e lopment of a concept-based medical school cu rr iculum database. AMIA Annu Symp Proc . 2003:195-199. 45. Aronson AR, Lang F-M. An overview o f MetaMap: historical perspective and re cent advances. J Am Med Inform Assoc . 2010;17(3):229-236. doi:10.1136/jamia.2009.002733. 46. Denny JC, Choma NN, Peter son J F, et al. Natura l L anguage Processing I mproves Identification of Colorectal Cancer Testing in the Electronic Medi c al Record. Med Decis Mak . 2012;32(1):188-197. doi:10.1177/0272989X11400418. 47. Hripcsak G, Austin J HM, Alderson PO, Friedman C. Radiology Use of Natural Language Pro c essing to Translate C linica l Information from a Database of 889 , 921 Chest Radiographic Reports 1. Radiology . 2002;224(1):157-163. 48. Pradhan S , Elhadad N, South BR, et al. Evaluating the state of the art in disorder recognition and normaliz ation of the clinical nar ra tive. J Am Med Infor m Assoc . 2014;22(1):143-154. doi:10.1136/amiajnl-2013-002544. 49. Perlis RH, Iosifescu D V, Castro VM, et al. Using electronic medical records to enable large-scale studies in psyc hiatry: treatment resistant depression as a model. Psychol Med . 2012;42(1):10.1017/S0033291711000997. doi:10.1017/S0033291711000997. 50. Zhan W, Xia Z, Secor E, et al. Modeling Disease Severity in Multiple Sclerosis Using Electronic Health Records. PLoS One . 2013;8(11):e78927. doi:10.1371/journal.pone.0078927. 51. Liao KP, Cai T, Gainer V, et al. Electronic medical re cords for discovery research in rheumatoid arthritis. Arthritis Care Res . 2010;62(8):1120-1127. doi:10.1002/acr.20184. 52. Carroll RJ, Thompson WK, Ey ler AE, et al. Portability of an alg orithm to identify rheumatoid arthritis in elec tronic health records. J Am Med Inform Assoc . 2012;19(e1):e162-9. doi:10.1136/amiajnl-2011-000583. 53. Halpern Y, Choi Y, Hor ng S, Sontag D. Using A nc hors to Estimate Clinic al State without Labeled Data. AMIA Annu Symp Proc . 2014;2014:606-615. 54. Bates J , Fodeh SJ, Bra ndt CA, W omac k J A. Classi fica tion of radiology reports for falls in an HIV study cohort. doi:10.1093/jamia/ocv155. 55. Kang N, Singh B, Afzal Z, van Mull ig en EM, Kors JA. Using rule -ba sed natural language p rocessing to i mprove disease normalization in biomedical text. J Am Med Inform Assoc . 2013;20(5):876-881. doi:10.1136/amiajnl-2012-001173. 56. Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural Language Processing (Almost) fr om Scratch. J Mach Lea rn Res . 2011;12(Aug):2493-2537. doi:10.1.1.231.4614. 57. LeCun Y, Bottou L, Beng io Y, Haffner P. Gr adient-based learning applied to document recognition. Proc IEEE . 1998;86(11):2278-2323. doi:10.1109/5.726791. 58. Wu Y, Xu J , Jiang M, Zhang Y, Xu H. A Study of Neural Word Embeddings for Named Entit y Recognition in Clinical Text. AMIA Annu Symp Proc . 2015;2015:1326-1333. 59. Erhan D, Courville A, Vi ncent P. Why Does Unsupervised Pre-trainin g Help Deep Learning ? J Mach Learn Res . 2010;11(Feb):625-660. doi:10.1145/1756006.1756025. 60. Luan Y, Watanabe S, Harsham B. Efficient learning for spoken language understanding tasks with word embedding based pre-trainin g . In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH . Vol 2015-Janua. ; 2015:1398-1402. 61. Mikolov T, Che n K, Corrado G, Dean J. Distributed Representations of Words and Phrases and their Compositionalit y . Nips . 2013:1-9. doi:10.1162/jmlr.2003.3.4- 5.951. 62. Sarmiento RF, Dernoncourt F. I mproving Patient Cohort Identification Using Natural Language Processing. Second Anal Electron Heal Rec . 2016:405-415. 63. Kocher RP, Adashi EY. Hospit a l Readmissions and the Affordable C are Act Paying for Coordinated Quality Care. 64. Hinton GE, S riva stava N, Krizhevsky A, Sutskever I, Salakhutdinov RR. Improving neural netw orks by preventin g co -adaptation of feature detectors. 65. Zeiler MD. ADADELTA: An Ad a ptive Learning Rate Method. arXiv . 2012:6. 66. Collobert R, Kavukcuoglu K, Farabet C. Torch7: A Matlab- lik e Environment for Machine Learning .; 2011. 67. Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit -learn: Machine Learning in Python. J Mach Learn Res . 2012;12(Oct):2825-2830. doi:10.1007/s13398-014- 0173-7.2. 68. Li J, Chen X, Hovy E, Jurafsky D. Visualizing and Understanding Neural Models in NLP. :681-691. 69. Denil M, Demiraj A, De Freitas N. Extraction of Salient S e ntences from Labelled Documents. 70. Arras L, Horn F, Montavon G, Müll e r K-R, Samek W. "What is R e levant in a Text Document?": An Interpretable Machine Learning Approach. 71. Bach S, Binder A, Montavon G, et al. On Pixel-Wise Explanations for N on-L inear Classifier Decisions by L ayer-Wise R e levance P ropagation. Suarez OD, ed. PLoS One . 2015;10(7):e0130140. doi:10.1371/journal.pone.0130140. 72. Arras L, Horn F, Montavon G, Uller K-R, Samek W . Explaining Predictions of Non-Linear Classifiers in NL P. 2016:1-7. 73. Razavian N, Blecker S, S c hmidt AM, Smith -McLalle n A, Ni g am S, Sontag D. Population-Level Prediction of Type 2 Diabetes From Claims Data and Analy sis of Risk Factors. Big Data . 2015;3(4):277-287. doi:10.1089/big.2015.0020.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment