Short-segment heart sound classification using an ensemble of deep convolutional neural networks
This paper proposes a framework based on deep convolutional neural networks (CNNs) for automatic heart sound classification using short-segments of individual heart beats. We design a 1D-CNN that directly learns features from raw heart-sound signals,…
Authors: Fuad Noman, Chee-Ming Ting, Sh-Hussain Salleh
Short-segmen t heart sound classification using an ensem ble of deep con v olutional neural net w orks F uad Noman ∗ , Chee-Ming Ting † , Sh-Hussain Salleh * , and Hernando Om bao ‡ Abstract This paper prop oses a framew ork based on deep con v olutional neural net works (CNNs) for auto- matic heart sound classification using short-segmen ts of individual heart b eats. W e design a 1D-CNN that directly learns features from raw heart-sound signals, and a 2D-CNN that takes inputs of t wo- dimensional time-frequency feature maps based on Mel-frequency cepstral co efficien ts (MFCC). W e further dev elop a time-frequency CNN ensemble (TF-ECNN) combining the 1D-CNN and 2D-CNN based on score-level fusion of the c lass probabilities. On the large Ph ysioNet CinC challenge 2016 database, the prop osed CNN models outp erformed traditional classifiers based on supp ort vector ma- c hine and hidden Marko v mo dels with v arious hand-crafted time- and frequency-domain features. Best classification scores with 89.22% accuracy and 89.94% sensitivit y were achiev ed by the ECNN, and 91.55% specificity and 88.82% mo dified accuracy by the 2D-CNN alone on the test set. Keyw ords: Heart sound classification, con volutional neural netw ork, ensemble classifiers. 1 In tro duction Cardiac auscultation based on heart sound recordings or phono cardiogram (PCG) remains a primary screening to ol for div erse heart pathologies. V arious algorithms ha ve b een dev elop ed aiming at accurate automated classification of normal and abnormal PCGs [1]. How ever, the classification accuracy is still far from b eing reliable for diagnostics in clinical or non-clinical settings. One ma jor challenge is to extract robust and discriminative features from the ra w PCG recordings t ypically corrupted b y v arious noise sources. Differen t time-frequency and statistical features hav e b een employ ed in automatic heart sound classification. Heart-rate v ariabilit y is the most widely-used feature, which ho wev er can only be extracted from long recordings containing many cardiac cycles. Here we consider the challenges in obtaining high PCG classification accuracy for single individual cardiac cycles. Recen t dev elopments in deep learning (DL) tec hniques ha ve seen remark able success in man y practical classification tasks, sometimes surpassing h uman-level p erformance [2]. This is owing to its inherent mec hanism integrating b oth feature extractor and classifier, which p ermits learning of complex data represen tations with hierarc hical lev els of seman tic abstraction via its multiple stac ked hidden la yers and hence the robust and accurate pattern classification ev en based on ra w data or primitiv e features. It offers substantial gain in accuracy o ver traditional linear and k ernel metho ds with shallow architecture. One p opular DL architecture is the conv olutional neural net work (CNN) whic h alternately stac ks a conv olutional lay er to extract feature maps through sparse lo calized kernels with w eight sharing, and a sub-sampling or p ooling lay er to acquire inv ariance to lo cal ∗ Sc ho ol of Biomedical Engineering & Health Sciences, Univ ersiti T eknologi Malaysia, Mala ysia (e-mail: mn- fuad3@liv e.utm.my; h ussain@fke.utm.m y) † Sc ho ol of Biomedical Engineering & Health Sciences, Universiti T eknologi Malaysia, Malaysia, and also the Statistics Program, King Ab dullah Universit y of Science and T echnology , Th uw al, 23955-6900, Saudi Arabia(e-mail: cmting@utm.m y) ‡ Statistics Program, King Ab dullah Univ ersity of Scien ce and T ec hnology , Saudi Arabia(e-mail: her- nando.om bao@k aust.edu.sa) 1 translation. CNNs ha ve ac hieved state-of-the-art performance in div erse c hallenging image recognition tasks [3 – 5]. Applications of DL to cardiac signals are introduced very recently [6 – 8]. CNNs hav e b een used for normal/abnormal PCG classification using input features such as sp ectrogram and Mel-frequency cepstrum co efficien ts (MFCCs) in [9] on 5-second window ed segments, and MFCC heatmaps of 3-second segmen ts in [10]. Tsc hannen et al. [11] combined a wa v elet-based deep CNN feature extractor with supp ort v ector mac hine (SVM) for heart-sound classification. Zhang et al. [12] prop osed a segmen tal CNN mo del to detect cardiac abnormalit y with t wo different designs to adjust the configuration of conv olutional la yers filters. A DL architecture w as implemented on field programmable gate arra y (FPGA) for real-time heart-sound classification using inputs based on gray sonogram images transformed from PCG segmen ts [13]. In this pap er, w e propose a deep CNN for classification of pathology in PCG of a single heart beat. W e design a new architecture called time-frequency ensemble CNN (TF-ECNN) that combines a 1D-CNN and a 2D-CNN using resp ectiv ely the time-domain raw PCG signals and MF CC time-frequency representations as inputs. Our metho d w as ev aluated on the Ph ysioNet computing in cardiology (CinC) 2016 challenge database [14], the largest heart sound database av ailable so far. The aim is to classify the heart sound signal from a short segmen t (single cardiac cycle - heartb eat) into normal and abnormal classes. W e also inv estigated the p erformance of the prop osed CNN model com bined with differen t com bination of input features, and compared with traditional classifiers, i.e., supp ort vector mac hine (SVM), ensemble of decision trees and hidden Marko v mo del (HMM). The h yp erparameter tuning w as carried out b y using Ba yesian optimization to find optimal v alues for mo del parameters [15] for all comp eting classifiers except HMM where the exp ectation-maximization algorithm was used to estimate the mo del paramete rs. 2 Metho ds In this section, we describ e the main building blo c ks of heart sound classification algorithm consisting of prepro cessing, segmentation, feature extraction and classification. F or classification, w e prop ose an ensem ble of tw o deep CNNs that combines time-domain and frequency-domain input features, and consider three traditional approac hes as baseline. 2.1 Database W e used heart sound recordings obtained from the PhysioNet CinC challenge 2016 database publicly a v ailable on Ph ysioNet website [16]. The dataset consists of 3153 recordings collected from healthy and pathological sub jects. Recordings lab eled as ‘unsure’ by the cardiologists regarding the normal or abnormal categories were not used, leaving a total of 2872 recordings for training and ev aluation in this w ork. 2.2 Prepro cessing All the heart s ound recordings w ere down-sampled to 1000 Hz and band-pass-filtered with Butterworth filter b et ween 25 Hz and 400 Hz to eliminate the un wan ted low-frequency artifacts (e.g., baseline drift) and high-frequency noise (e.g., bac kground noise). The signals w ere then standardized by subtracting the mean and dividing b y its standard deviation b efore feature extraction. 2.3 Segmen tation The whole heart sound recordings were segmented in to short int erv als of single b eat and then classified in to normal and abnormal categories. In this w ork w e used the heart sound annotations provided with the database for segmentation of each recording in to heartb eats (from the b eginning of atrial activit y to end of ven tricular activity). Note that other data-driven unsup ervised algorithms such as the Viterbi alignmen t can also b e used to p erform such segmentation. A total of 81503 segments w ere extracted 2 T able 1: Distribution of train and test set of the Ph ysionet CinC challenge 2016 database. T rain T est normal abnormal normal abnormal Recordings 1150 284 1150 288 Heartb eats 32574 8170 32582 8177 from the whole database whic h were then partitioned into sub ject-orien ted train and test datasets with balanced n umber of samples as shown in T able. 1. 2.4 Baseline Classifiers W e consider three baseline class ifiers for comparison, namely , (1.) SVM with radial basis function kernel, (2.) ensemble of decision trees classifier and (2.) HMM. SVM and desicion tr e e ensemble . F ollowing [8], a total of 58 features were extracted from each heartb eat for the SVM and ensem ble of trees methods. These include 22 time-domain features (durations, sk ewness, kurtosis and sum of instantaneous amplitudes for eac h of the four heart sound states (S1, systole, S2 and diastole)) plus 36 frequency-domain features (median p o wer sp ectrum for 9 frequency bands for eac h heart sound state). W e further p erformed feature selection using the w ell-known neighborho od comp onen t analysis (NCA) [17], selecting a total of 28 features (16 time-domain and 12 frequency-domain). W e carried out 5-fold cross-v alidation to optimize and tune the h yp erparameters of the SVM and the tree ensem ble classifiers. A Bay es optimization approach is used to minimize the loss function and select the b est set of hyperparameters that pro duce b est classification results. W e also applied class weigh ts when computing the classification accuracy to accommo date p ossible misclassification of normal class, since the database is sligh tly imbalanced. HMM . Contin uous HMMs with Gaussian mixture densities w ere used for mo deling the temp oral structure in PCG. W e extracted a set of features as in [18]. A sequence of 12 × 1 short-time Mel-frequency cepstral co efficien ts (MF CCs) were computed ov er consecutive window ed frames for each heartbeat to obtain a t wo-dimensional 12 × T time-frequency represen tation with T the total n umber of feature v ectors. A 4-state HMM with left-to-right top ology w as emplo yed to mo del the time evolution of the four distinct heart sound comp onents in a single heartb eat. A mixture of 16 Gaussians w as used as the observ ation mo del in eac h state. W e found no practical improv ement in classification accuracy for this data with larger num b er of Gaussian components. The HMMs were trained b y using the Baum-W elc h algorithm based on exp ectation-maximization to find the maximum lik eliho o d estimates of the mo del parameters [19]. The Viterbi algorithm w as used for aligning the MFCC frames to each of the four cardiac states, and to compute the lik eliho o d score of a test example which w as then classified to the HMM with the highest lik eliho o d. 2.5 Prop osed Ensemble CNN Fig. 1 shows the arc hitecture of the prop osed time-frequency based ensemble deep CNN (TF-ECNN) mo del combining tw o distinct CNNs to capture the temp oral structure in b oth the time-domain and frequency-domain. The first CNN (1D-CNN) accepts one-dimensional PCG time series data as input (i.e., the ra w heartb eat signal). The second CNN (2D-CNN) uses the tw o-dimensional time-frequency feature maps of MF CCs and time-v arying autoregressive (TV-AR) coefficients as input. F or b oth the 1D-CNN and 2D-CNN, w e used the same netw ork architecture consisting of con volutional, activ ation, p ooling and fully-connected (or dense) lay ers but with differen t sets of hyperparameters. 2.5.1 F eature Extraction The 1D-CNN was designed to classify the ra w heart sound from fixed-length segmen ts. How ev er, the heartb eat segments are usually with v ariable lengths. There fore, tw o approac hes were used to normalize the segments durations. First, an anti-aliasing linear interpolation metho d was p erformed to normalize 3 C o n v o l u t i o n BN R e L U M ax P o o l i n g C o n v o l u t i o n D r o p o u t Re L U M ax P o o l i n g F l a t t e n C o n v o l u t i o n D r o p o u t R e L U M ax P o o l i n g S o f t M a x I n p u t s 1 @ 9 6 × 12 C o n v 2 D 1 6 @ 4 × 4 C o n v 2 D 1 6 @ 4 × 4 C o n v 2 D 1 6 @ 4 × 4 F l a t t e n 1 @ 1 9 6 D e n se 1 @ 2 5 6 De n se D r o p o u t S o f t M a x 1 @ 2 C o n v 1 D 8 @ 1 × 6 C o n v 1 D 8 @ 1 × 6 C o n v 1 D 8 @ 1 × 6 F l a t t e n 1 @ 1 0 0 0 D e n se 1 @ 5 1 2 S o f t M a x 1 @ 2 I n p u t s 1 @ 1 × 1 0 0 0 1D - C N N 2D - C N N S c o r e f u si o n Figure 1: Architecture of the proposed TF-ECNN mo del combining a 1D-CNN and a 2D-CNN taking inputs of ra w signals and time-frequency feature maps, resp ectiv ely . BN: Batch-normalization lay er. ReLU: rectified linear unit activ ation function. T able 2: Summary of 1D-CNN mo del configurations. Lay er Type Output shape Kernel size Strides 1 Conv olution 1000 × 8 6 1 2 Batch-Norm 1000 × 8 - - 3 MaxPooling 500 × 8 2 2 4 Conv olution 500 × 8 6 1 5 MaxPooling 250 × 8 2 2 6 Conv olution 250 × 8 6 1 7 MaxPooling 125 × 8 2 2 8 Flatten 1000 - - 9 Dense 512 - - 10 SoftMax 2 - - the heartb eats into reference duration (i.e. 1000 samples). Second, the segmen ts with durations higher that 1200 samples were ignored (1.2% of total segments), then w e zero-padded the rest of the segments to 1200 samples. F or 2D-CNN, we consider tw o approaches of feature extraction to obtain the tw o-dimensional time- frequency feature maps. First, similarly for the HMM classifier, w e computed frames of short-time MFCC features on the duration-normalized PCG segmen ts to produce feature maps of same size to represen t eac h heart b eat. Second, we computed autoregressive co efficien ts of 12-th order TV AR mo del commonly kno wn as short-time linear predictive co efficien ts (LPCs) to construct alternative feature map for each segmen t. 2.5.2 Netw ork Architecture & T raining T able 2 and T able 3 resp ectively summarize the architecture of the prop osed 1D-CNN and 2D-CNN mo dels individually . The exp erimen ts were carried out using T ensorFlo w platform [20] with Scikit- Optimize library whic h provides Ba yesian optimization of the h yp erparameters. W e used the expectation- impro vemen t metho d (with 100 iterations) to tune the CNN parameters, including learning rate, n umber of conv olution la yers, num b er of filters, k ernel size, activ ation metho d, n umber of dense lay ers, n umber of no des in dense lay ers, and drop out ratio of dense lay ers. W e selected a fixed drop out ratio of 0.4 and 0.5 for the con volution lay ers of the 1D-CNN and 2D-CNN, resp ectiv ely . All conv olution lay ers used the zero-padding to preserv e the input dimension. A batch-normalization la yer was attac hed to the first con volutional lay er to allow the mo del to learn different v ariations of the data whic h can give b etter 4 T able 3: Summary of 2D-CNN mo del configurations. Lay er Type Output shape Kernel size Strides 1 Conv olution 96 × 12 × 16 4 1 2 Batch-Norm 96 × 12 × 16 - - 3 MaxPooling 48 × 6 × 16 2 2 4 Conv olution 48 × 6 × 16 4 1 5 MaxPooling 24 × 3 × 16 2 2 6 Conv olution 24 × 3 × 16 4 1 7 MaxPooling 12 × 1 × 16 2 2 8 Flatten 192 - - 9 Dense 256 - - 10 SoftMax 2 - - T able 4: P erformance comparison of different classifiers on the test set. The n umbers in paren theses corresp ond to the classifier p erformance b efore applying the weigh t c ost for imbalanced classes. Classifier F eatures Accuracy (%) Sensitivit y (%) Sp ecificit y (%) MAcc (%) SVM Time & F req 84.87 (85.09) 85.82 (94.09) 81.09 (48.95) 83.46 (71.52) Ensem ble Time & F req 86.20 (86.23) 90.55 (94.25) 68.84 (54.26) 79.70 (74.26) HMM MF CC 87.07 (n/a) 85.97 (n/a) 91.45 (n/a) 88.71 (n/a) 1D-CNN Ra w (zero-pad) 86.34 (85.63) 87.80 (95.11) 80.32 (46.41) 84.06 (70.76) Ra w (norm-dur) 87.23 (87.52) 87.57 (91.51) 85.84 (71.64) 86.71 (81.58) 2D-CNN TV AR 86.41 (86.91) 88.85 (91.79) 76.69 (67.45) 82.77 (79.62) MF CC 87.18 (89.30) 86.08 (92.49) 91.55 (76.61) 88.82 (84.55) ECNN Ra w (norm-dur) + MF CC 89.22 (89.58) 89.94 (93.07) 86.35 (75.68) 88.15 (84.37) robustness to noise typically present in real heart sound recordings. F or other con v olutional lay ers, w e added drop out lay er as regularization metho d to preven t mo del ov erfitting. Of notes, additional exp eriments show ed that the use of zero-padding in input segments for the 1D- CNN did not perform as w ell as using the duration-normalized segmen ts with the same CNN arc hitecture. Th us, the drop out ratio of the conv olution la yers was set to 0.8. The Bay esian optimization pro cedure suggested a 2D-CNN architecture with similar n umber of con v olution la yers with the 1D-CNN but sligh tly differen t num b er of dense lay ers. Therefore, we manually tuned the 2D-CNN architecture to match that of 1D-CNN which p erformed comparably with the Bay esian-optimized mo del. The learning rates set by the optimizer for the 1D-CNN and 2D-CNN w ere resp ectiv ely 0.001031 and 0.000496 with batch size of 128. The Adam optimizer was used for weighs up dating in the bac kpropagation training stage. In the TF-ECNN, we combine b oth the 1D-CNN and 2D-CNN optimized ab o ve based on score-lev el fusion by summing ov er the outputs of softmax lay ers from tw o individual CNNs to pro duce fused class prediction probabilities. 3 Exp erimen tal Results W e ev aluate the classification p erformance of the 1D-CNN and 2D-CNN individually as w ell as the TF- ECNN, as measured by sensitivity , sp ecificity and mo dified accuracy (MAcc). The MAcc is an av erage of the sensitivity and sp ecificit y scores. T able 4 shows the results of different classifiers and feature sets on the lo cal hidden-test set. Num b ers in paren theses indicate p erformance of trained mo dels without using the w eighed-cost to con trol imbalances among the classes. They show that all classifiers do not p erform w ell with a significant tradeoff b et w een the sensitivity and the sp ecificit y . This is due to the imbalanced classes and limited abnormal data whic h lead to a high misclassification of abnormal segmen ts as clearly indicated b y the sp ecificit y scores. After corrections by applying class w eights to limit the misclassification of abnormal class, p erformance of all classifiers increases except the ensemble of tress (still with low sensitivit y and MAcc of b elo w 80% but high sensitivit y). 5 The prop osed CNN mo dels generally outp erform the baseline classifiers considerably in most of the p erformance measures. In particular, the 2D-CNN with MFCCs achiev ed the b est p erformance in sp eci- ficit y and MAcc, and the TF-ECNN giv es the highest accuracy and the second highest in sensitivit y . HMM follows, p erforming the b est among the traditional classifiers, p ossibly due to capabilit y of the Mark ov c hain in mo deling the temp oral structure of the four heart-sound states whic h is neglected b y the SVM and ev en the CNNs. The p erformance of the ensem ble of trees is not well-balanced, scoring highest sensitivit y but with the low est sensitivit y and MAcc. It is interesting to note that the 1D-CNN using only ra w-data as input shows a satisfactory p erformance compared to using computationally-expensive feature extraction metho ds (i.e., MFCC and TV AR) in the 2D-CNN. The 1D-CNN with duration-normalized ra w PCG is only 2% less than the b est MAcc score obtained by the 2D-CNN with MFCC. This may suggest the adv antages of the multiple hidden la yers in CNNs that can learn hierarchical time-frequency features directly from the raw PCG signal. F or the 2D-CNN with tw o-dimensional feature maps, the b etter time-frequency represen tation of the acoustic- based PCG signals using the MFCCs impro v es the classification p erformance ov er the TV AR. The ECNN com bining b oth raw and MFCC features offer gains in sensitivity o ver the 2D-CNN using MFCC alone whic h how ever p erforms b etter in sp ecificit y , suggesting the adv antage of ECNN in detecting the normal heart sounds whereas the 2D-CNN for the abnormal heart sounds. 4 Conclusion W e developed an ensemble of deep CNNs to classify normal and abnormal heart sounds based on short- segmen t recordings of individual heart beats with promising performance. The nov el netw ork arc hitecture com bines a 1D-CNN and a 2D-CNN designed resp ectiv ely to learn multiple levels of representations from b oth the time-domain raw signals and time-frequency features. Ev aluation on large PhysioNet CinC c hallenge 2016 database demonstrates adv antages of our prop osed CNN mo dels with considerable impro vemen t in classification p erformance ov e r strong start-of-the-art baseline classifiers and feature sets. This suggests p oten tials of deep learning approac hes for accurate heart-sound classification. F uture works will consider use of sequential DL mo dels suc h as the recurrent neural net works (RNNs) or long short- term memory (LSTM) RNNs [21] that could b etter capture the temp oral dep endency in the time-v arying sp ectrum of PCG signals. References [1] G. D. Clifford, et al., “Recen t adv ances in heart sound analysis,” Physiol. Me as. , vol. 38, pp. E10, 2017. [2] S. Do dge and L. Karam, “A study and comparison of h uman and deep learning recognition p erfor- mance under visual distortions,” in Computer Communic ation and Networks (ICCCN), 2017 26th International Confer enc e on . IEEE, 2017, pp. 1–7. [3] A. Krizhevsky , I. Sutskev er, and G. E. Hin ton, “Imagenet classification with deep con volutional neural net works,” in A dv. Neur al Inf. Pr o c ess. Syst. , 2012, pp. 1097–1105. [4] O. Ronneb erger, P . Fischer, and T. Brox, “U-net: Con volutional net works for biomedical image segmen tation,” in Int. Conf. Me dic al Image Computing and Compt.-Assiste d Intervention . Springer, 2015, pp. 234–241. [5] E. Shelhamer, J. Long, and T. Darrell, “F ully con volutional netw orks for semantic segmentation.,” IEEE T r ans. Pattern A nal. Mach. Intel l. , vol. 39, no. 4, pp. 640–651, 2017. [6] Q. Zhang, D. Zhou, and X. Zeng, “HeartID a multiresolution con volutional neural netw ork for ECG-based biometric human identification in smart health applications,” IEEE A c c ess , v ol. 5, pp. 11805–11816, 2017. 6 [7] U. R. Ac harya, et al., “A deep con v olutional neural netw ork mo del to classify heartb eats,” Computers in Biolo gy and Me dicine , v ol. 89, pp. 389–396, 2017. [8] C. Potes, et al., “Ensem ble of feature-based and deep learning-based classifiers for detection of abnormal heart sounds,” in 2016 Comput. Car diol. Conf. , 2016, pp. 621–624. [9] T. Nilanon, et al., “Normal/abnormal heart sound recordings classification using conv olutional neural net work,” in Computing in Car diolo gy Confer enc e (CinC), 2016 . IEEE, 2016, pp. 585–588. [10] J. Rubin, et al., “Classifying heart sound recordings using deep con volutional neural net works and mel-frequency cepstral coefficie n ts,” in Computing in Car diolo gy Confer enc e (CinC), 2016 . IEEE, 2016, pp. 813–816. [11] M. Tschannen, et al., “Heart sound classification using deep structured features,” in Computing in Car diolo gy Confer enc e (CinC), 2016 . IEEE, 2016, pp. 565–568. [12] Y. Zhang, et al., “Segmen tal conv olutional neural netw orks for detection of cardiac abnormalit y with noisy heart sound recordings,” arXiv pr eprint arXiv:1612.01943 , 2016. [13] J. P . Dominguez-Morales, et al., “Deep neural netw orks for the recognition and classification of heart murm urs using neuromorphic auditory sensors,” IEEE T r ansactions on Biome dic al Cir cuits and Systems , vol. 12, no. 1, pp. 24–34, 2018. [14] C. Liu, et al., “An op en access database for the ev aluation of heart sound algorithms,” Physiol. Me as. , vol. 37, no. 12, pp. 2181–2213, dec 2016. [15] J. Snoek, H. Laro c helle, and R. P . Adams, “Practical Bay esian optimization of mac hine learning algorithms,” 2012. [16] A. L. Goldb erger, et al., “PhysioBank, Ph ysioT o olkit, and Ph ysioNet: Comp onen ts of a new research resource for complex ph ysiologic signals,” Cir culation , vol. 101, no. 23, pp. e215–e220, 2000. [17] J. Goldb erger, et al., “Neighbourho o d comp onen ts analysis,” in A dv. Neur al Inf. Pr o c ess. Syst. , 2005, pp. 513–520. [18] F. Noman, et al., “A Marko v-Switching Mo del Approach to Heart Sound Segmentation and Classi- fication,” arXiv Pr epr. arXiv1809.03395 , 2018. [19] L. Rabiner, “A tutorial on hidden Marko v mo dels and selected applications in sp eec h recognition,” Pr o c. IEEE , v ol. 77, no. 2, pp. 257–286, 1989. [20] M. Abadi, et al., “T ensorflow: A system for large-scale mac hine learning.,” in OSDI , 2016, vol. 16, pp. 265–283. [21] S. Ho c hreiter and J. Schmidh ub er, “Long short-term memory ,” Neur al Computation , vol. 9, no. 8, pp. 1735–1780, 1997. 7
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment