Automatic Diagnosis of Short-Duration 12-Lead ECG using a Deep Convolutional Network

We present a model for predicting electrocardiogram (ECG) abnormalities in short-duration 12-lead ECG signals which outperformed medical doctors on the 4th year of their cardiology residency. Such exams can provide a full evaluation of heart activity…

Authors: Ant^onio H. Ribeiro, Manoel Horta Ribeiro, Gabriela Paix~ao

Automatic Diagnosis of Short-Duration 12-Lead ECG using a Deep   Convolutional Network
A utomatic Diagnosis of Short-Duration 12-Lead ECG using a Deep Con volutional Network Antônio H. Ribeiro 1, 2, * , Manoel Horta Ribeiro 1 , Gabriela Paixão 1, 3 , Derick Oliveira 1 , Paulo R. Gomes 1, 3 , Jéssica A. Canazart 1 , Milton Pifano 1, 3 , W agner Meira Jr . 1 , Thomas B. Schön 2 , Antonio Luiz Ribeiro 1, 3, † 1 Univ ersidade Federal de Minas Gerais, Brazil, 2 Uppsala Univ ersity , Sweden, 3 T elehealth Center from Hospital das Clínicas da Universidade Federal de Minas Gerais, Brazil. * antonio-ribeiro@ufmg.br , † tom@hc.ufmg.br Abstract W e present a model for predicting electrocardiogram (ECG) abnormalities in short- duration 12-lead ECG signals which outperformed medical doctors on the 4th year of their cardiology residency . Such exams can provide a full ev aluation of heart activity and ha ve not been studied in pre vious end-to-end machine learning papers. Using the database of a large telehealth network, we built a nov el dataset with more than 2 million ECG tracings, orders of magnitude larger than those used in previous studies. Moreover , our dataset is more realistic, as it consist of 12-lead ECGs recorded during standard in-clinics exams. Using this data, we trained a residual neural network with 9 conv olutional layers to map 7 to 10 second ECG signals to 6 classes of ECG abnormalities. Future w ork should extend these results to cov er a large range of ECG abnormaliti es, which could improve the accessibility of this diagnostic tool and av oid wrong diagnosis from medical doctors. 1 Introduction Cardiov ascular diseases are the leading cause of death worldwide [ 1 ] and the electrocardiogram (ECG) is a major diagnostic tool for this group of diseases. As ECGs transitioned from analogue to digital, automated computer analysis of standard 12-lead electrocardiograms gained importance in the process of medical diagnosis [ 2 ]. Ho wev er , limited performance of classical algorithms [ 3 , 4 ] precludes its usage as a standalone diagnostic tool and relegates it to an ancillary role [5]. End-to-end deep learning has recently achiev ed striking success in task such as image classification [ 6 ] and speech recognition [ 7 ], and there are great expectations about how this technology may impro ve health care and clinical practice [ 8 – 10 ]. So far , the most successful applications used a supervised learning setup to automate diagnosis from exams. Algorithms have achie ved better performance than a human specialist on their routine workflow in diagnosing breast cancer [ 11 ] and detecting certain eye conditions from eye scans [ 12 ]. While efficient, training deep neural networks using supervised learning algorithms introduces the need for large quantities of labeled data which, for medical applications, introduce sev eral challenges, including those related to confidentiality and security of personal health information [13]. Standard, short-duration 12-lead ECG is the most commonly used complementary exam for the ev aluation of the heart, being employed across all clinical settings: from the primary care centers to the intensi ve care units. While tracing cardiac monitors and long-term monitoring, as the Holter exam, provides information mostly about cardiac rhythm and repolarization, 12-lead ECG can provide a full e v aluation of heart, including arrhythmias, conduction disturbances, acute coronary syn- dromes, cardiac chamber hypertrophy and enlargement and e ven the ef fects of drugs and electrolyte disturbances. Machine Learning for Health (ML4H) W orkshop at NeurIPS 2018. Although preliminary studies using deep learning methods [ 14 , 15 ] achie v e high accuracy in detecting specific abnormalities using single-lead heart monitors, the use of such approaches for detecting the full range of diagnoses that can be obtained from a 12-lead, standard, ECG is still largely une xplored. A contributing factor for this is the shortage of full digital 12-lead ECG databases, since most ECG are still registered only on paper, archi ved as images, or in PDF format [ 16 ]. Most av ailable databases comprise a fe w hundreds of tracings and no systematic annotation of the full list of ECG diagnosis [17], limiting their usefulness as training datasets in a deep learning setting. This lack of systematically annotated data is unfortunate, as training an accurate automatic method of ECG diagnosis from a standard 12-lead ECG would be greatly beneficial.The e xams are performed in settings where, often, there are no specialists to analyze and interpret the ECG tracings, such as in primary care centers and emergenc y units. Indeed, primary care and emergency department health professionals have limited diagnostic abilities in interpreting 12-lead ECGs [ 18 , 19 ]. This need is most acute in low and middle-income countries, which are responsible for more than 75% of deaths related to cardiov ascular disease [ 20 ], and where, often, the population does not ha ve access to cardiologists with full expertise in ECG diagnosis. The main contribution of this paper is to introduce a large-scale novel dataset of labelled 12-lead ECGs exams and to train and validate a residual neural network in this relev ant setup. W e consider 6 types of ECG abnormalities: 1st degree A V block (1dA Vb), right bundle branch block (RBBB), left bundle branch block (LBBB), sinus bradycardia (SB), atrial fibrillation (AF) and sinus tachycardia (ST), considered representativ e of both rhythmic and morphologic ECG abnormalities. 2 Related work Classical ECG software, such as Univ ersity of Glasgow’ s ECG analysis program [ 21 ], extracts the main features of the ECG signal using signal processing techniques and use them as input for classifiers. A literature revie w of these methods is giv en by [ 22 ]. In [ 23 ] a different approach is taken, where the ECG features are learned using an unsupervised method and then used as input to a supervised learning method. End-to-end deep learning presents an alternativ e to these two-step approaches, where the ra w signal itself is used as input to the classifier . In [ 24 , 25 , 14 ] the authors make use of a conv olutional neural network to classify ECG abnormalities. The network architecture used in [ 14 ] is inspired by architectures used for image classification and we make use of a similar architecture in this paper . There are differences though, in particular when it comes to the number of layers, input type (we use 12-leads, while [ 14 ] used a single lead) and the output layer used. Recurrent networks are used in [ 26 , 15 ]. A revie w of recent machine learning techniques applied for ECG automatic diagnosis is gi ven in [ 27 ]. The aforementioned methods and others (such as random forest and bayesian methods) are compared and a more extensi ve list of references using those methods is pro vided. The major dif ference between this paper and other pre vious applications of end-to-end learning for ECG classification is on the dataset used for training and v alidating the model. The most common dataset used to design and ev aluate ECG algorithms is the MIT -BIH arrhythmia database [ 28 ], which was used for training in [ 25 , 23 ] and for almost all algorithms in [ 22 ]. This data set contain 30-minutes 2-leads ECG records from 47 unique patients. In [ 15 ] they used a dataset of 24-hour Holter ECG recordings collected from 2,850 patients at the Uni versity of V irginia (UV A) Heart Station. In [ 14 ] they construct a new dataset containing labeled data of 64,121 ECG records from 29,163 unique patients who ha ve used Zio P atch monitor . The PhysioNet 2017 Challenge, made av ailable 12,186 entries dataset captured from the Aliv eCor ECG monitor containing between 9 and 61 seconds recordings [ 29 ]. All these datasets were obtained from cardiac monitors and holter exams, where patients are usually monitored for se veral hours, and are restricted to one or tw o leads. Our dataset, on the other hand, consists of short duration (7 to 10 seconds) 12-lead tracings obtained from in-clinics exams and is orders of magnitude larger than those used in previous studies, with well ov er 2 million entries. 3 Data The dataset used for training and validating the model consists of 2,470,424 records from 1,676,384 different patients from 811 counties in the state of Minas Gerais/Brazil. The duration of the ECG 2 Figure 1: Unidimensional residual neural network used for ECG classification. recordings is between 7 and 10 seconds. The data was obtained between 2010 and 2016 by a telediagnostic ECG system de veloped and maintained by the T elehealth Network of Minas Gerais (TNMG), led by the T elehealth Center from the Hospital das Clínicas of the Federal Uni versity of Minas Gerais. W e developed an unsupervised methodology that classifies each ECG according to the free text in the expert report. W e combine this result with two existing automatic ECG classifiers (Glasgo w and Minnesota), using rules deri ved from expert kno wledge and from the manual inspection of samples of the exams to obtain the ground truth. In several cases, we assigned the exams to be manually re viewed by medical students. This was done with around 34 , 000 exams. This process is thoroughly explained in Appendix A. W e split this dataset into training and v alidation set. The training set contains 98% of the data. And the validation set consist of 2% (approximately 50,000 e xams) used for tuning the hyperparameters. The dataset used for testing the model consists of 953 tracings from distinct patients. These were also obtained from TNMG’ s ECG system but using a more rigorous methodology for labelling the abnormalities. T wo medical doctors with experience in electrocardiography have independently annotated the ECGs. When they agree, the common diagnosis is considered as ground truth. And, in case of any disagreement, a third medical specialist, aware of the annotations from the other two, decided the diagnosis. Appendix B contain information about the abnormalities that can be found in both the training/validation set and the test set. 4 Model W e used a con volutional neural netw ork similar to the residual network [ 30 ], b ut adapted to unidimen- sional signals. This architecture allo ws deep neural networks to be ef ficiently trained by including skip connections. W e ha ve adopted the modification in the residual block proposed in [ 31 ], which place the skip connection in the position displayed in Figure 1. A similar architecture has been successfully employed for arrhythmia detection from ECG signals in [ 14 ] and the design choices we make in this section are, indeed, strongly influenced by [ 14 ]. W e should highlight that, despite using a significantly larger training dataset, we got the best validation results with an architecture with, roughly , one quarter the number of layers and parameters of the network employed in [14]. The network consists of a con volutional layer ( Conv ) followed by n blk = 4 residual blocks with two con volutional layers per block. The output of the last block is fed into a Dense layer with sigmoid activ ation function ( σ ), which was used because the classes are not mutually e xclusiv e (i.e. two or more classes may occur in the same exam). The output of each conv olutional layer is rescaled using batch normalization, BN , [ 32 ] and feed into a rectified linear acti v ation unit, ReLU . Dropout [ 33 ] is applied after the non-linearity . The con volutional layers hav e filter length 16, starting with 4096 samples and 64 filters for the first layer and residual block and increasing the number of filters by 64 e very second residual block and subsampling by a factor of 4 every residual block. Max Pooling and con volutional layers with filter length 1 ( 1x1 Conv ) may be included in the skip connection to make the dimensions match the ones from signals in the main branch. The loss function is the a verage cross-entropy 1 n class P n class i =1 y i log ˆ y i + (1 − y i ) log(1 − ˆ y i ) where ˆ y i is the output of the sigmoid layer for the i -th class and y i is the corresponding observed v alue (0 or 1). The cost function (i.e. the sum of loss functions ov er the entire training set) is minimized using the Adam optimizer [ 34 ] with default parameters and learning rate lr = 0 . 001 . The learning rate 3 Precision (PPV) Recall (Sensitivity) Specificity F1 Score model doctor model doctor model doctor model doctor 1dA Vb 0.923 0.905 0.727 0.679 0.998 0.998 0.813 0.776 RBBB 0.878 0.868 1.000 0.971 0.995 0.994 0.935 0.917 LBBB 0.971 1.000 1.000 0.900 0.999 1.000 0.985 0.947 SB 0.792 0.833 0.864 0.938 0.995 0.996 0.826 0.882 AF 0.846 0.769 0.846 0.769 0.998 0.996 0.846 0.769 ST 0.870 0.938 0.952 0.833 0.993 0.998 0.909 0.882 T able 1: Performance of our deep neural network model and 4th year cardiology resident medical doctors when ev aluated on the test set. (PPV = positive predicti ve v alue) is reduced by a factor of 10 whenever the validation loss does not present any improv ement for 7 consecutiv e epochs. The neural network weights are initialized as in [ 35 ] and the bias are initialized with zeros. The training runs for 50 epochs with the final model being the one with best v alidation results during the optimization process. 5 Results T able 1 sho ws the per formance on the test set. W e consider our model to ha ve predicted the abnormal- ity when its output is abo ve a threshold that is set manually for each of the classes. Each threshold was chosen to be approximately in the inflection point of the precision-recall curve (presented in Appendix C). High performance measures were obtained for all ECG abnormalities, with F1 scores abov e 80% and specificity indexes o ver 99% . The same dataset was ev aluated by two 4th year cardiology medical doctors, each one annotating half of the exams in the test set. Their av erage performance is given in the table for comparison and, considering the F1 score, the model outperforms them for 5 out of 6 abnormalities. 6 Future W ork The training data was collected from a general Brazilian population and, since the database is lar ge, it contains ev en rare conditions with sufficient frequency so we can try to build models to predict them. In future work we intend to extend the results to progressively larger classes of diagnosis. This process will happen gradually because: i) the dataset preprocessing can be time consuming and demands a lot of work (Appendix A); ii) generating validation data demand work hours of experienced medical doctors. The T elehealth Center at the Hospital das Clínicas of the Federal Univ ersity of Minas Gerais receiv es and assesses more than 2,000 digital ECGs per day . W ith the progressi ve improvements in the interface with the medical experts, the quality of this data should progressiv ely increase, and it could be used in training, validating and testing future models. The T elehealth Center is currently serving more than 1000 remote locations in 5 Brazilians states and hav e the means to deploy and ev aluate such automatic classification systems as a part of broader telehealth solutions, which could help to improve its capacity , making it possible to provide access of a broader population, with better quality reports. 7 Conclusion These promising initial results point to end-to-end learning as a competiti ve alternati ve to classical automatic ECG classification methods. The dev elopment of such technologies may yield high- accuracy automatic ECG classification systems that could save clinicians considerable time and prev ent wrong diagnosis. Millions of 12-lead ECGs are performed ev ery year , many times in places where there is a shortage of qualified medical doctors to interpret them. An accurate classification system could help detecting wrong diagnosis and impro ve the access of patients from depri ved and remote locations to this essential diagnostic tool of cardiov ascular diseases. 4 Acknowledgments This research was partly supported by the Brazilian Research Agencies CNPq, CAPES, and F APEMIG, by projects InW eb, MASW eb, EUBra-BIGSEA, INCT -Cyber and Atmosphere, and by the W allenber g AI, Autonomous Systems and Software Pr ogr am (W ASP) funded by Knut and Alice W allenberg F oundation. W e also thank NVIDIA for awarding our project with a T itan V GPU. References [1] GBD 2016 Causes of Death Collaborators, “Global, regional, and national age-sex specific mortality for 264 causes of death, 1980-2016: A systematic analysis for the Global Burden of Disease Study 2016, ” Lancet (London, England) , vol. 390, pp. 1151–1210, Sept. 2017. [2] J. L. W illems, C. Abreu-Lima, P . Arnaud, J. H. van Bemmel, C. Brohet, R. Deg ani, B. Denis, I. Graham, G. v an Herpen, and P . W . Macfarlane, “T esting the performance of ECG computer programs: The CSE diagnostic pilot study , ” Journal of Electr ocar diology , v ol. 20 Suppl, pp. 73– 77, Oct. 1987. [3] J. L. W illems, C. Abreu-Lima, P . Arnaud, J. H. van Bemmel, C. Brohet, R. Deg ani, B. Denis, J. Gehring, I. Graham, and G. van Herpen, “The diagnostic performance of computer programs for the interpretation of electrocardiograms, ” The New England Journal of Medicine , vol. 325, pp. 1767–1773, Dec. 1991. [4] A. P . Shah and S. A. Rubin, “Errors in the computerized electrocardiogram interpretation of cardiac rhythm., ” Journal of Electr ocar diology , vol. 40, no. 5, pp. 385–390, 2007 Sep-Oct. [5] N. A. M. Estes, “Computerized interpretation of ECGs: Supplement not a substitute, ” Cir cula- tion. Arrhythmia and Electr ophysiology , v ol. 6, pp. 2–4, Feb . 2013. [6] A. Krizhevsk y , I. Sutske ver , and G. E. Hinton, “Imagenet classification with deep conv olutional neural networks, ” in Advances in Neural Information Processing Systems , pp. 1097–1105, 2012. [7] G. Hinton, L. Deng, D. Y u, G. E. Dahl, A. Mohamed, N. Jaitly , A. Senior , V . V anhoucke, P . Nguyen, T . N. Sainath, and B. Kingsbury , “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared V iews of Four Research Groups, ” IEEE Signal Processing Magazine , v ol. 29, pp. 82–97, Nov . 2012. [8] W . W . Stead, “Clinical implications and challenges of artificial intelligence and deep learning, ” J AMA , vol. 320, pp. 1107–1108, Sept. 2018. [9] Naylor C, “On the prospects for a (deep) learning health care system, ” J AMA , v ol. 320, pp. 1099– 1100, Sept. 2018. [10] G. Hinton, “Deep learning—a technology with the potential to transform health care, ” J AMA , vol. 320, pp. 1101–1102, Sept. 2018. [11] B. E. Bejnordi, M. V eta, P . Johannes v an Diest, B. van Ginnek en, N. Karssemeijer , G. Litjens, J. A. W . M. van der Laak, and the CAMEL YON16 Consortium, M. Hermsen, Q. F . Manson, M. Balkenhol, O. Geessink, N. Stathonikos, M. C. van Dijk, P . Bult, F . Beca, A. H. Beck, D. W ang, A. Khosla, R. Garge ya, H. Irshad, A. Zhong, Q. Dou, Q. Li, H. Chen, H.-J. Lin, P .-A. Heng, C. Haß, E. Bruni, Q. W ong, U. Halici, M. U. Öner , R. Cetin-Atalay, M. Berseth, V . Khvatko v , A. Vylegzhanin, O. Kraus, M. Shaban, N. Rajpoot, R. A wan, K. Sirinukunwattana, T . Qaiser , Y .-W . Tsang, D. T ellez, J. Annuscheit, P . Hufnagl, M. V alkonen, K. Kartasalo, L. Latonen, P . Ruusuvuori, K. Liimatainen, S. Albarqouni, B. Mungal, A. George, S. Demirci, N. Nav ab, S. W atanabe, S. Seno, Y . T akenaka, H. Matsuda, H. Ahmady Phoulady , V . K ov alev , A. Kalinovsky , V . Liauchuk, G. Bueno, M. M. Fernandez-Carrobles, I. Serrano, O. Deniz, D. Racoceanu, and R. V enâncio, “Diagnostic Assessment of Deep Learning Algorithms for Detection of L ymph Node Metastases in W omen With Breast Cancer, ” J AMA , vol. 318, p. 2199, Dec. 2017. 5 [12] J. De Fauw , J. R. Ledsam, B. Romera-Paredes, S. Nikolov , N. T omasev , S. Blackwell, H. Askham, X. Glorot, B. O’Donoghue, D. V isentin, G. v an den Driessche, B. Lakshmi- narayanan, C. Meyer , F . Mackinder , S. Bouton, K. A youb, R. Chopra, D. King, A. Karthike- salingam, C. O. Hughes, R. Raine, J. Hughes, D. A. Sim, C. Eg an, A. T ufail, H. Montgomery , D. Hassabis, G. Rees, T . Back, P . T . Khaw , M. Sule yman, J. Cornebise, P . A. K eane, and O. Ron- neberger , “Clinically applicable deep learning for diagnosis and referral in retinal disease, ” Natur e Medicine , vol. 24, pp. 1342–1350, Sept. 2018. [13] E. J. Beck, W . Gill, and P . R. De Lay , “Protecting the confidentiality and security of personal health information in low- and middle-income countries in the era of SDGs and Big Data, ” Global Health Action , vol. 9, p. 32089, 2016. [14] P . Rajpurkar , A. Y . Hannun, M. Haghpanahi, C. Bourn, and A. Y . Ng, “Cardiologist-Level Arrhythmia Detection with Con volutional Neural Netw orks, ” , July 2017. [15] S. P . Shashikumar , A. J. Shah, G. D. Clifford, and S. Nemati, “Detection of P aroxysmal Atrial Fibrillation Using Attention-based Bidirectional Recurrent Neural Networks, ” in Pr oceedings of the 24th A CM SIGKDD International Confer ence on Knowledge Discovery & Data Mining , KDD ’18, (New Y ork, NY , USA), pp. 715–723, A CM, 2018. [16] R. Sassi, R. R. Bond, A. Cairns, D. D. Finlay , D. Guldenring, G. Libretti, L. Isola, M. V aglio, R. Poeta, M. Campana, C. Cuccia, and F . Badilini, “PDF-ECG in clinical practice: A model for long-term preservation of digital 12-lead ECG data, ” Journal of Electr ocar diology , vol. 50, no. 6, pp. 776–780, 2017 Nov - Dec. [17] A. L yon, A. Mincholé, J. P . Martínez, P . Laguna, and B. Rodriguez, “Computational techniques for ECG analysis and interpretation in light of their contribution to medical adv ances, ” Journal of the Royal Society Interface , vol. 15, Jan. 2018. [18] J. Mant, D. A. Fitzmaurice, F . D. R. Hobbs, S. Jowett, E. T . Murray , R. Holder , M. Davies, and G. Y . H. Lip, “ Accuracy of diagnosing atrial fibrillation on electrocardiogram by primary care practitioners and interpretati ve diagnostic softw are: Analysis of data from screening for atrial fibrillation in the elderly (SAFE) trial, ” BMJ (Clinical r esear ch ed.) , vol. 335, p. 380, Aug. 2007. [19] G. V eronese, F . Germini, S. Ingrassia, O. Cutuli, V . Donati, L. Bonacchini, M. Marcucci, A. F ab- bri, and Italian Society of Emer gency Medicine (SIMEU), “Emergenc y physician accuracy in interpreting electrocardiograms with potential ST-segment ele vation myocardial infarction: Is it enough?, ” Acute Cardiac Car e , vol. 18, pp. 7–10, Mar . 2016. [20] W orld Health Organization, Global Status Report on Noncommunicable Diseases 2014: At- taining the Nine Global Noncommunicable Diseases T ar gets; a Shar ed Responsibility . Genev a: W orld Health Organization, 2014. OCLC: 907517003. [21] P . W . Macfarlane, B. De vine, and E. Clark, “The uni versity of glasgo w (Uni-G) ECG analysis program, ” in Computers in Car diology, 2005 , pp. 451–454, 2005. [22] S. H. Jambukia, V . K. Dabhi, and H. B. Prajapati, “Classification of ECG signals using machine learning techniques: A survey , ” in 2015 International Confer ence on Advances in Computer Engineering and Applications , (Ghaziabad, India), pp. 714–721, IEEE, Mar . 2015. [23] M. A. Rahhal, Y . Bazi, H. AlHichri, N. Alajlan, F . Melgani, and R. Y ager , “Deep learning approach for activ e classification of electrocardiogram signals, ” Information Sciences , vol. 345, pp. 340–354, June 2016. [24] J. Rubin, S. Parvaneh, A. Rahman, B. Conroy , and S. Babaeizadeh, “Densely Connected Con volutional Netw orks and Signal Quality Analysis to Detect Atrial Fibrillation Using Short Single-Lead ECG Recordings, ” , Oct. 2017. [25] U. R. Acharya, H. Fujita, S. L. Oh, Y . Hagiwara, J. H. T an, and M. Adam, “ Application of deep con volutional neural network for automated detection of myocardial infarction using ECG signals, ” Information Sciences , vol. 415-416, pp. 190–198, Nov . 2017. 6 [26] T . T eijeiro, C. A. Garcia, D. Castro, and P . Félix, “ Arrhythmia Classification from the Abducti ve Interpretation of Short Single-Lead ECG Records, ” in Computing in Cardiolo gy, 2017 , Sept. 2017. [27] C. D. Cantwell, Y . Mohamied, K. N. Tzortzis, S. Garasto, C. Houston, R. A. Cho wdhury , F . S. Ng, A. A. Bharath, and N. S. Peters, “Rethinking multiscale cardiac electrophysiology with machine learning and predictiv e modelling, ” , Oct. 2018. [28] A. L. Goldberger , L. A. N. Amaral, L. Glass, J. M. Hausdorff , P . C. Ivanov , R. G. Mark, J. E. Mietus, G. B. Moody , C.-K. Peng, and H. E. Stanley , “PhysioBank, PhysioT oolkit, and PhysioNet, ” Cir culation , June 2000. [29] G. D. Clifford, C. Liu, B. Moody , L.-w . H. Lehman, I. Silva, Q. Li, A. E. Johnson, and R. G. Mark, “AF Classification from a Short Single Lead ECG Recording: The PhysioNet/Computing in Cardiology Challenge 2017, ” Computing in Cardiolo gy , vol. 44, Sept. 2017. [30] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition, ” arXiv:1512.03385 , Dec. 2015. [31] K. He, X. Zhang, S. Ren, and J. Sun, “Identity Mappings in Deep Residual Networks, ” arXiv:1603.05027 , Mar . 2016. [32] S. Ioffe and C. Sze gedy , “Batch Normalization: Accelerating Deep Netw ork Training by Reducing Internal Cov ariate Shift, ” , Feb. 2015. [33] N. Sriv astav a, G. E. Hinton, A. Krizhevsk y , I. Sutskev er , and R. Salakhutdinov , “Dropout: A simple way to prevent neural networks from overfitting., ” Journal of Machine Learning Resear ch , v ol. 15, no. 1, pp. 1929–1958, 2014. [34] D. P . Kingma and J. Ba, “ Adam: A Method for Stochastic Optimization, ” , Dec. 2014. [35] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-lev el performance on imagenet classification, ” in Pr oceedings of the IEEE International Confer ence on Computer V ision , pp. 1026–1034, 2015. A T raining data prepr ocessing In this appendix, we detail the preprocessing of the data used for training and v alidating the model. The exams were analyzed by doctors during routine workflow and are subject to medical errors, moreov er there might be errors associated with the semi-supervised methodology used to extract the diagnoses. Hence, we combine the e xpert annotation with well established automatic classifiers to improv e the quality of the dataset. Giv en i) the exams in the database; ii) the diagnoses giv en by the Glasgow and Minnesota automatic classifiers ( automatic diagnosis ); and, iii) the diagnoses extracted from the expert free text associated with the e xams using the unsupervised methodology ( medical diagnosis ), the follo wing procedure is used for obtaining the ground truth annotation: 1. W e: (a) Accept a diagnosis (consider an abnormality to be present) if both the expert and either the Glasgow or the Minnesota automatic classifiers indicated the same abnormality . (b) Reject a diagnosis (consider an abnormality to be absent) if only one classifier indicates the abnormality in disagreement with both the doctor and the other automatic classifier . After this initial step diagnoses there are two scenarios where we still need to accept or reject diagnoses. They are: i) both classifiers indicate the abnormality b ut the expert doesn’ t; or ii) only the expert indicates the abnormality b ut no classifier does. 2. W e used some rules to r eject some of the remaining diagnoses : (a) Diagnoses of ST where the heart rate was belo w 100 ( 8376 medical diagnoses and 2 automatic diagnoses) were r ejected . 7 (b) Diagnoses of SB where the heart rate was abov e 50 ( 7361 medical diagnoses and 16427 automatic diagnosis) were r ejected . (c) Diagnoses of LBBB or RBBB where the duration of the QRS interval w as below 115 ms ( 9313 medical diagnoses for RBBB and 8260 for LBBB) were r ejected . (d) Diagnoses of 1dA Vb where the duration of the PR interval w as below 190 ms ( 3987 automatic diagnoses) were r ejected . 3. Then, using the sensiti vity analysis of 100 manually re viewed e xams per abnormality , we came up with the following rules to accept some diagnoses remaining : (a) For RBBB, d1A Vb, SB and ST we accepted all medical diagnoses. 26033 , 13645 , 12200 and 14604 diagnoses were accepted in such fashion, respecti vely (b) For F A, we required not only that the exam was classified by the doctors as true but also that the standard deviation of NN interv als was higher than 646 . 14604 diagnoses were accepted using this rule. According to the sensitivity analysis the number of false positi ves that w ould be introduced by this procedure was smaller than 3% of the total number of exams. 4. After this process, we were still left with a approximately 34000 exams whose diagnoses had not been accepted or rejected. These were manually re viewed by medical students using the T elehealth ECG diagnostic system. The process of manually re viewing these 34000 ECGs took sev eral months. B ECG abnormalities T able 2: Prev alence of each abnormality in the train/validation set and in the test set. It contains both the percentage % and the absolute number of patients (in parentheses). Abbrev . Description Prev alence (T rain+V al) Prev alence (T est) 1dA Vb 1st degree A V block 1.5 % (36,324) 3.5 % (33) RBBB Right bundle branch block 2.6% (64,319) 3.8 % (36) LBBB Left b undle branch block 1.5% (37,326) 3.5 % (33) SB Sinus bradycardia 1.6% (38,837) 2.3 % (22) AF Atrial fibrilation 1.7% (42,133) 1.4 % (13) ST Sinus tachycardia 2.3% (56,186) 4.4 % (42) 8 Figure 2: A list of all the abnormalities the model classifies. W e sho w only 3 representativ e leads (DII, V1 and V6). 9 C Additional experiments In Figure 3 we show the precision-recall curve for our model. This is a useful graphical representation to assess the success of a prediction model when, as in our case, the classes are imbalanced. The thresholds we used to generate T able 1 were chosen trying to get the inflection point of these curves. And, for these same thresholds, T able 3 show the neural network confusion matrix for each of the classes. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Precision (a) 1dA Vb (a verage precision = 0.91) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 (b) RBBB (av erage precision = 0.94) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Precision (c) LBBB (av erage precision = 1.00) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 (d) SB (av erage precision = 0.87) 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision (e) AF (av erage precision = 0.90) 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 (f) ST (av erage precision = 0.96) Figure 3: Precision-recall curve for our prediction model in the test set with regard to each ECG abnormalities. The av erage precision (which is approximated by the area under the precision-recall curve) is displayed in the captions. 10 Predicted Class Predicted Class Actual Class 1dA Vb Not 1dA Vb Actual Class RBBB Not RBBB 1dA Vb 24 9 RBBB 36 0 Not 1dA Vb 2 918 Not RBBB 5 912 Actual Class LBBB Not LBBB Actual Class SB Not SB LBBB 33 0 SB 19 3 Not LBBB 1 919 Not SB 5 926 Actual Class AF Not AF Actual Class ST Not ST AF 11 2 ST 40 2 Not AF 2 938 Not ST 6 905 T able 3: Confusion matrices for the neural network. 11

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment