Time series kernel similarities for predicting Paroxysmal Atrial Fibrillation from ECGs

We tackle the problem of classifying Electrocardiography (ECG) signals with the aim of predicting the onset of Paroxysmal Atrial Fibrillation (PAF). Atrial fibrillation is the most common type of arrhythmia, but in many cases PAF episodes are asympto…

Authors: Filippo Maria Bianchi, Lorenzo Livi, Alberto Ferrante

Time series kernel similarities for predicting Paroxysmal Atrial   Fibrillation from ECGs
T ime Series K ernel Similarities for Predicting P aroxysmal Atrial Fibrillation from ECGs Filippo Maria Bianchi † ∗ , Lorenzo Li vi § , Alberto Ferrante † , Jelena Milose vic ‡ , Miroslaw Malek † † ALaRI, F aculty of Informatics, Universit ` a della Svizzera italiana , Lugano, Switzerland ∗ Machine Learning Gr oup, UiT the Arctic University of Norway , T r omsø, Norway ‡ Institute of T elecommunications, TU W ien, V ienna, A ustria § Dept. of Computer Science, University of Exeter , Exeter , UK Email: ∗ filippo.m.bianchi@uit.no, § l.livi@exeter .ac.uk , † { alberto.ferrante,miroslaw .malek } @usi.ch, ‡ jelena.milosevic@tuwien.ac.at Abstract —W e tackle the problem of classifying Electrocar - diography (ECG) signals with the aim of predicting the onset of Paroxysmal Atrial Fibrillation (P AF). Atrial fibrillation is the most common type of arrhythmia, but in many cases P AF episodes ar e asymptomatic. Theref ore, in order to help diagnosing P AF , it is important to design procedur es for detecting and, more importantly , predicting P AF episodes. W e pr opose a method for predicting P AF events whose first step consists of a feature extraction procedure that repr esents each ECG as a multi-variate time series. Successively , we design a classification framework based on kernel similarities for multi-variate time series, capable of handling missing data. W e consider different appr oaches to perform classification in the original space of the multi- variate time series and in an embedding space, defined by the kernel similarity measure. W e achieve a classification accuracy comparable with state of the art methods, with the additional advantage of detecting the P AF onset up to 15 minutes in adv ance. Index T erms —Atrial fibrillation; time series; kernel methods; classification; prediction; feature selection. I . I N T RO D U C T I O N W earable devices are nowadays able to capture bio-signals, such as electrocardiograms (ECGs), for an extended period of time. Data recorded by these devices is of paramount clin- ical importance in the assessment of numerous heart-related conditions. Among them, the prediction of Paroxysmal Atrial Fibrillation (P AF) episodes and the risk stratification of P AF- prone patients is of high importance [1]–[3]. Atrial fibrillation is the most common type of arrhythmia. Its symptoms include fatigue or decreased exercise tolerance, palpitations, dyspnea on exertion, and generalized weakness, e ven though in many cases P AF episodes are asymptomatic [4]. The risk of P AF progressing to permanent atrial fibrillation increases with time. The problem of detecting and predicting P AF events from ECG recordings receiv ed attention from researchers from different fields [5]. Many approaches hav e been proposed to tackle such problems, including the use of entropy descriptors [6], [7], P-wav e characterization [8]–[10], the number of premature atrial complexes from R-R interv als [11] (i.e., the interval between R peaks, shown in Fig. 2), heart variability rate [12], the average number of f-wa ves in a TQ interv al [13], and hybrid complex networks / time series analysis techniques [14], [15]. In this work, we propose a ne w framew ork based on time series analysis methods to predict the onset of P AF . The o verall procedure consists of different steps. First, from each ECG we extract numerical descriptors (features) for characterizing the original signals. Then, we perform feature selection to extract a reduced subset of features according to their mutual dependencies. The v alues assumed ov er time by the selected features are represented as a multiv ariate time series (MTS); that might contain missing values. Finally , in order to predict the P AF onset, we train a classifier based on special kernel similarity measures designed for MTS with missing data [16], [17]. The classifier is trained either in the MTS input space, using directly the relationships among data defined by the kernel similarity , or in an embedding space where each MTS is represented by its similarity values with respect to the training data. T o ev aluate the ef fectiv eness of the proposed framework, we take into account the data from the PhysioNet P AF prediction challenge [18]–[20]. Results show performance higher or on par with respect to the state-of-the-art results to correctly identify ECGs associated with P AF ev ents. W e also have the option to predict the P AF onset up to 15 minutes before the ev ent, thus providing the ability to alert the patient of the possible occurrence of a P AF episode. Our work provides the following key contributions: • As methodological contribution, we propose a novel method for predicting P AF events that is based on a classifier trained either on the input or in the embedding space induced by a kernel similarity measure. In such a space, samples are described only by their similarities with respect to the elements in the training set. • Kernel similarities for MTS with missing data [16], [17] are a recent development in time series analysis and they have never been considered so far in the conte xt of P AF prediction, which represents thus a new field of application. • W e analyze a new type of P AF prediction problem, by studying how much time ahead the P AF events can be accurately detected by considering only the earliest measurements in the ECGs. • The proposed framework is characterized by a high degree of parallelization. This provides the opportunity of significantly speeding up the computation if the procedure is implemented on highly scalable systems, such as embedded devices with parallel computing support. The remaining part of the paper is or ganized as follows: Section II describes the dataset of the P AF prediction challenge that we have used to develop our method; Section III describes the details of our framew ork, and Section IV presents the results that we hav e obtained. I I . D A TA S E T D E S C R I P T I O N W e consider the ECG data from the Physionet P AF pre- diction challenge [18]–[20]. Half of the records present in the database are acquired from patients prone to P AF (either immediately before fibrillation or after some time a fibrillation occurred), the other half from healthy subjects. No ECG of patients with an ongoing P AFs is considered. The database provides a training set, consisting of 50 records of P AF patients and 50 records of healthy subjects and a test set containing 50 records. Excerpts are 30 minutes long, concurrently acquired from two sensors. The P AF prediction challenge [18] aims at predicting the onset of P AF events before their occurrence. Therefore, the task we tackle here is not a simple detection problem. The dataset under analysis consists of 106 recordings organized as follows. For the test set of the first P AF ev ent (P AF detection), there are 22 recordings classified as “control” and 28 as “P AF- prone”. For generating the test set of the P AF prediction chal- lenge (the one considered in this paper), only the 28 recordings of P AF-prone patients are considered. These 28 P AF-prone recordings are further divided into two classes: the “close- to-onset” and “far -from-onset” class. The former , represents records preceding a P AF ev ent; the latter , instead, represents the records that do not precede a P AF ev ent. Therefore, for generating the test set of the P AF prediction challenge, we end up with 56 recordings equally distributed between the two different classes. The training set is constructed by considering again pairs of recordings, this time taken from the 25 P AF- prone subjects belonging to the records used for training – for a total of 50 records to be used for training. I I I . M E T H O D O L O G Y W e propose a classification frame work, which implements the procedure summarized in Fig. 1. The procedure consists of three main steps: i) feature extraction, ii) computation of the kernel similarity and iii) training of the classifier . The details of each phase are giv en in the remainder of this section. A. F eatur e extr action The objective of this step is to extract from each ECG a set of descriptors, which are useful for the classification. Each selected descriptor is represented by a variable in an MTS that, in turn, becomes the new representation of the original ECG. Therefore, the ECG dataset will be represented by N MTS X 1 , . . . , X N with X ∈ R V × T , where V is the number of variables, T m is the number of time steps of the shortest E CG d atas e t F e atu re e x tr ac ti o n MT S Cl as s i fie r Ke rn e l s i mil ari ty Fig. 1: Schematic depiction of the proposed classification procedure. The original dataset of ECG signals is processed by a feature extraction block, to represent them as MTS. Specific kernel measures are then used to ev aluate the similarities between the MTS, which are stored in a kernel matrix. Such kernel matrix is finally exploited to train a classifier for predicting the P AF onset for each original ECG signal. MTS, and T M the number of time steps of the longest one. For the P AF dataset under analysis, we ha ve N = 106 , T m = 992 , and T M = 3364 . The number of variables V is determined by the feature extraction procedure, which consists of two main steps: featur e gener ation and featur e selection . 1) F eatur e generation: The ECG signals are initially pro- cessed for identifying the interval between two successive heartbeats (the R-R interval shown in Fig. 2). The variation in the time interval between heartbeats provides important clinical descriptors, which are used to characterize heart failure and other circulatory diseases [21]. !"#$%&'()*+$ ,$-. /*$ 0$-. /*$ 1$ 2$ 3$ 1$ 2$ 3$ 1$ 2$ 3$ 1$4$&56*7$ 2$4$(*.8$ 3$4$*59$ (a) (b) Fig. 2: In panel (a) the fiducial points of an ECG heartbeat [3]. In panel (b), an example of an R-R interv al, corresponding to the time span between the R peaks of two consecutive QRS complex. T o compute the R-R intervals, we first pre-process the ECG signal using the morphological filtering technique described in [22]. Morphological filtering allows to retain useful informa- tion from acquired signals, while effecti v ely eliminating noise originating from multiple sources such as low-frequency base- line wandering (caused by respiration and perspiration) and higher-frequenc y components (caused by muscular activity). A further step to enhance the quality of the acquired data is to combine signals from different inputs (leads) before the feature extraction step. In this study , we employ a Root Mean Square (RMS) combination of the two signals provided in the P AF-prediction database. Delineation is performed both on the two indi vidual signals provided by the P AF prediction database and on their root-mean-square combination. After pre-processing is completed, an algorithm based on the digital wav elet transform [3], [22] is used to retriev e R-R interval as well as the the fiducial points of each heartbeat: 2 the start, peak, and end of each characteristic wav e (see Fig. 2). There may be cases in which fiducial points cannot be extracted due to the low quality of the ECG signal. Low quality may be due to different reasons, e. g. imprecise/faulty measurements registered by the leads. In those cases, there will be a missing value in the descriptors, in correspondence of the time steps relative to portion of ECG measured with low quality . While other techniques with better performance are av ail- able for extracting the fiducial points, the ones used in this work are suitable for being used in resources-constrained devices, such as the wearable ones. Thus, they satisfy the conditions in which our methodology is designed to operate. The output of this procedure produces, for each original ECG signal, 95 time series of features related to morphological wa ve characteristics (for details on such features, see [3]). 2) F eatur e selection: After the feature generation proce- dure, each original ECG is represented by an MTS with V = 95 variates. Initially , we considered all the av ailable features. Howe ver , preliminary results showed that training a classifier using all variates yielded worse results. Indeed, most of the original features are highly correlated. Furthermore, by including all features, the computational time to ev aluate the kernel similarities and the training of the classifier increase considerably . Therefore, we considered different subsets of features in our experiments by performing feature selection on the original 95 variates. For guiding the selection of the features, we e v aluated their linear Pearson correlation. In Fig. 3, we show the sample correlation matrix C between the 95 variables, which is computed as follows: 1) c i,j n = corr( x i n , x j n ) ; 2) ¯ c i,j = 1 N P n c i,j n ; 3) C = ( C i,j = 0 if | ¯ c i,j | < θ c , C i,j = 1 otherwise . where x i n is the time series relativ e to the i -th variable in the n -th MTS X n and θ c is a hyperparameter that thresholds the maximum degree of correlation between the variables. In particular , we avoided to select at the same time two at- tributes if the y were highly correlated (depicted as white pixels in the correlation matrix in Fig. 3). As a suitable threshold for the av erage correlation matrix C , we used θ c = 0 . 4 . This choice allows to get rid of highly correlated variables and, at the same time, significantly reduce the memory requirements of the procedure. After this feature selection procedure, we reduced the number of variables to V = 22 . Those, describe features that are mostly related to R-R intervals and are summarized in T ab . I. B. Kernel similarities for multivariate time series Dynamic T ime W arping [23] is the most widely used ap- proach to assess the similarity between time series, but cannot treat MTS in its original formulation. More complicated vari- ants have been proposed to deal with multiple variables [24], but they are characterized by difficult hyperparameter tuning Avg . F eat. Cor r. 20 40 60 80 20 40 60 80 -0.5 0 0.5 1 (a) Avg . F eat. Co rr.(th reshol ded) 20 40 60 80 20 40 60 80 (b) Fig. 3: In (a), the average correlation matrix of the ECG features for the MTS in the whole dataset. In (b), the same matrix where the values hav e been binarized using θ c = 0 . 4 . T ABLE I: V ariables identified by the feature selection proce- dure; see Fig. 2 for reference. Name Description R Position Position of the R peak IdentifiedStructures Number of structures in the beat (P , R, and Q wav es) R Max Peak value of the R wav e R Area Area under the R wa ve R W idth W idth of the R wa ve R ini T ime distance from the beginning of the R peak until the maximum value P Position Position of the P wa ve P Max Peak value of the P wav e P Area Area under the P wa ve P W idth W idth of the P wa ve P ini T ime distance from the beginning of the P peak until the maximum value PR Interval Interval between the P and the R wa ves T Position Position of the T peak T Max Peak value of the T wav e T Area Area under the T wa ve T W idth W idth of the T wa ve T ini T ime distance from the beginning of the T peak until the maximum value R T Interval Interval between the R and T wav e RR Interval Interval between two R peaks PR Segment T ime distance between P and R wa ves RR Interval 50HB Mean Interval between two R peaks, mean ov er the last 50 samples RR Interval 5HB Mean Interval between two R peaks, mean ov er the last 5 samples and high computational complexity . Furthermore, there is not a straightforward way to account for missing v alues in the MTS when computing the similarities. In this work, we adopt two dif ferent kernel similarities for MTS, the Learned P attern Similarity (LPS) [16] and the T ime series Cluster K ernel (TCK) [17], whose details are provided in the following two subsections. Those methods benefit from high parallelization capabilities and learn a model that, once the training is over , can quickly process new unseen data. Both procedures for computing LPS and TCK require all the MTS to hav e equal length T . Therefore, we followed the approach proposed in [25] and, by means of linear interpolation, we 3 transformed all MTS so that they hav e the same number of time steps T . The v alue of T is determined by T =  T M l T M 25 m  , where T M is the length of the longest MTS in the dataset and d e is the ceiling operator . Before computing TCK we also standardized each MTS, i.e., from each variable v in the dataset we subtract its mean value ¯ v and divide by its standard deviation σ v ; both ¯ v and σ v are computed on the training set. On the other hand, since decision trees are scale in variant, we did not apply this transformation for LPS (in accordance with [16]). The training procedure returns a kernel matrix K tr ∈ R N tr × N tr , whose components are the similarities among the elements of the training set. Once LPS and TCK are fitted, we process the test set and generate an additional matrix K te ∈ R N te × N tr , whose components are the similarities of the elements in the training set with respect to those in the test set. N te and N ts represent the size of test and training set respectiv ely , with N = N te + N ts . TCK and LPS account for the missing patterns in the data to compute their similarities, rather than relying on imputation methods that may introduce strong biases. Imputation tech- niques replace the missing v alues with predefined or computed values, such as 0s, the empirical mean, or the last seen values. The choice is often arbitrary and, in presence of large amounts of missing v alues, the resulting data may end up being significantly different as well as the result of the analysis. 1) Learned P attern Similarity: The learned pattern simi- larity (LPS) algorithm [16] is based on the identification of segments-occurrence within time series. It generalizes nat- urally to MTS by means of regression trees where a bag- of-words type compressed representation is created, which in turn is used to compute the similarity score. LPS is considered the state-of-the-art for MTS, inherits the decision tree approach to handle missing data, and is based on an ensemble strategy . Specifically , one e xtracts from MTS all possible segments of length L ( L < T ) starting from each time index t = 1 , 2 , . . . , T − L + 1 and fit them to nT randomly initialized regression trees. Since LPS naturally deals with missing data, in general it is not necessary to replace the missing entries. LPS depends on two hyperparameters, which are the max- imum number of regression trees considered, nT , and the length of the segments, denoted as L . 2) T ime Series Cluster K ernel: The T ime series Cluster K ernel (TCK) [17] computes an unsupervised kernel similarity for MTS with missing data. The method is based on an ensemble learning approach wherein the robustness to hyper- parameters is ensured by joining the clustering results of many Gaussian mixture models (GMM) to form the final kernel. Hence, no critical hyperparameters have to be tuned by the user . In order to deal with missing data, the GMMs are extended using informativ e prior distributions [26]. T o generate parti- tions with different resolutions that capture both local and global structures in the input data, the TCK similarity matrix is built by fitting GMMs to the set of time series for a number of mixtures, each one with a dif ferent number of components. T o provide div ersity in the ensemble, each partition is ev aluated on a random subset of attributes and segments, using random initializations and randomly chosen hyperparameters. This also helps to provide rob ustness for what concerns hyperparameters selection. TCK is then built by summing (for each partition) the inner products between pairs of posterior distrib utions corresponding to different time series. The hyperparameters that needs to be specified in TCK are the number of different random initializations Q and the maximum number of Gaussian mixtures C . It is sufficient to set those hyperparameters to reasonable high values to obtain good performance in many tasks [27], [28]. C. Classification Once the kernel matrices are generated, we train a classifier based on the similarities yielded by K tr and K te . While in principle we can use any classifier operating on real-valued vectors, we consider two different ones, which are the k - Nearest Neighbors classifier ( k NN) and the Support V ector Machine classifier (SVM). T o perform classification we follow two different approaches: classification in the input space and classification in the similarity-induced embedding space . 1) Classification in input space: In the first approach, classification is performed directly in the MTS input space, considering as similarity between the samples the values contained in the kernel matrix K te . In k NN, for each element X i in the test set we select the relative row k i ∈ K te and then we select the indices relative to the k highest v alues in k i , which identify the k MTS of the training set that are most similar to X i . The estimated X i label ˆ y i is the most frequent one among those k MTS. On the other hand, in SVM is possible to train the classifier using K tr as precomputed kernel, which defines the kernel space where the optimal separating hyperplane is computed. In this case, the label ˆ y i is assigned according to the region (delimited by such hyperplane) that contains X i . 2) Classification in embedding similarity space: In this second approach, rather than assuming as input for the clas- sification the original MTS, whose similarity is described by the kernel matrix, we train the classifier on the rows of K tr and K te . In particular, each row k i ∈ R N tr is considered as a vectorial embedding representation of the original MTS X i in the similarity space. Therefore, in this approach to train the classifier it is necessary to compute an additional vector similarity among the rows of both K tr and K te . In k NN, we consider Euclidean, cosine, Cityblock and Pearson correlation similarities to iden- tify , for each vector in K te , the k most similar vectors in K tr . Analogously , in SVM we compute a kernel similarity among the ro ws of K tr and K te using radial basis functions (rbf), whose analytic expression reads d ij = exp  γ k k i − k j k 2  , (1) 4 where γ defines the bandwidth of the kernel. By applying Eq. 1 to all the rows i ∈ K tr and j ∈ K te , the embedding vectors are mapped into a ne w kernel space, where the optimal separating hyperplane is computed. Compared to directly performing the classification in the input space, this approach requires an additional computation to ev aluate the vector similarities among the representations (rows/columns of the kernel matrix) of the original MTS. The increment in the training time scales with the dimensionality of the vectors, which are equal to N tr , i.e. the size of the training set. The procedure is more computational demanding in the case a SVM classifier is adopted, since the rbf kernels must be ev aluated for each pair of vectors. On the other hand, the proposed approach based on similarity-induced embedding provides a stronger form of reg- ularization, which usually improves the generalization capa- bility of the classifier . Additionally , the similarity embedding can model better the relationships among classes, thanks to its robustness to instance level noise [29]. Finally , the similarity embedding can highly synthesize the information when the MTS are characterized by a lar ge number of variables and time steps, i.e. when T × V  N tr . Ho wev er , in those cases where the classifier is required to capture a higher degree of variance among the data, the embedding approach might not be beneficial. I V . R E S U LT S In this section, we describe the experiments that we have performed and we report the results that we have obtained. The whole classification framework, including TCK and LPS kernels, are implemented in MA TLAB 1 . In each experiment, for LPS we set the number of regression trees to nT = 200 and the length of the segments to L = 10 . Instead, for TCK we set the maximum number of Gaussian mixtures to C = 40 and the number of random initializations to Q = 30 . The two kernel matrices yielded by TCK and LPS are depicted in Fig. 4. The higher the v alues at the position ( i, j ) in such matrices, the higher the similarity between the corre- sponding MTS X i and X j . In the depiction, brighter colors correspond to higher similarities. It is possible to observe a block structure in the matrices, which denotes that the intra- class similarities are higher than the inter-class similarities. The blocks representing the MTS relativ e to far-fr om-onset and close-to-onset patients are visible more clearly in the kernel matrix yielded by TCK, which suggests its superior capability to discriminate between the two classes. In the following, we present two different e xperiments. First, in Sec. IV -A we report the classification results on the original P AF dataset relative to the second challenge (see Sec. II for details), obtained by means of the proposed framework. Then, in Sec. IV -B we analyze how such results change as we vary the number of time steps taken into account for the signals; accordingly , this results in ev aluating how the performance 1 TCK: https://github .com/kmi010/T ime- series- cluster- kernel- TCK- , LPS: http://www .mustafabaydog an.com/learned- pattern- similarity- lps.html far-from-onset close-to-onset far-from-onset close-to-onset (a) LPS far-from-onset close-to-onset far-from-onset close-to-onset (b) TCK Fig. 4: Kernel matrix obtained with the LPS and TCK algo- rithms. deteriorates when we omit data closer to the actual P AF onset. This last experiment is especially interesting as it shows how our methodology can be used to predict a P AF episode before its occurrence. A. Classification results Here, we ev aluate the results obtained with the proposed approach to classify the MTS into the far-fr om-onset and close-to-onset classes, which is the objecti ve of the second P AF challenge (P AF2) [18]. For comparison, we also report the three best official results, in terms of accuracy , obtained on the P AF2 challenge [30]. As discussed in Sec. III-C, it is possible to use the TCK and LPS kernel matrices directly as the pre-computed kernel in a kernel-based classifier, such as SVM, or to exploit the pairwise similarity scores in a k NN classifier operating in input space. As a second approach, we consider each row of the kernel matrix as an N tr -dimensional embedding vector . Compared to their original representations, the samples are now highly compressed since we move from R V × T (size of the original MTS) to R N tr (length of the embedding vector), with V × T = 319580 and N tr = 50 in our dataset. In the embedding-based approach, several similarity measures have been used for ev aluating the relationships of the embedding vectors k i . T o ev aluate classification performance, we consider accu- racy (ACC), recall on the “close-to-onset” class (REC), and finally the F1 score. T o select the optimal hyperparameters of the classifiers, we perform k -fold cross-validation. For the k NN classifier , we consider a range of v alues for the number of neighbors k in [1 , 49] and, in case of classification in the embedding space, we consider as possible vector similarities Euclidean distance, cosine similarity , CityBlock distance and Pearson correlation. In presence of ties (when k assumes ev en v alues), we resolve them by deterministically assigning label of class “f ar-from-to-onset” to the samples. For the SVM classifier , we search the margin of the hyperplane in C ∈ [2 − 20 , 2 10 ] and the kernel bandwidth in γ ∈ [2 − 5 , 2 5 ] The classification results obtained from the kernel similarities yielded by LPS and TCK are sho wn in T ab. II, along with the optimal hyperparameters found. W e refer to k NN-i, SVM-i 5 and k NN-e, SVM-e as the k NN and the SVM applied in the input (-i) and embedding (-e) space, respectiv ely . T ABLE II: Classification results in the input space ( k NN-i and SVM-i) and in the embedding space ( k NN-e and SVM- e), using the two time series similarities, LPS and TCK. W e report classification accuracy (A CC), recall (REC) for the close-to-onset class and the F1 measure (F1) relativ e to the configuration achie ving the best A CC value in the validation procedure. W e also report the optimal hyperparameters for such a configuration. Best results for each method are shown in bold. The last entries in the table are the best official results reported so far for the challenge [30]. Note that only A CC is reported as performance measure. Method Similarity A CC REC F1 k Diss C γ k NN-i LPS 0.714 0.928 0.764 41 – – – TCK 0.696 0.75 0.712 33 – – – k NN-e LPS 0.66 0.678 0.666 49 cosine – – TCK 0. 678 0.683 0.675 29 cityblock – – SVM-i LPS 0.625 0.678 0.644 – – 7.029 – TCK 0.696 0.571 0.653 – – 3.031 – SVM-e LPS 0.625 0.39 0.517 – – 20.072 1.045 TCK 0.632 0.643 0.631 – – 1.749 0.075 [31] – 0.71 – – – – – – [32] – 0.68 – – – – – – [33] – 0.68 – – – – – – W e observe that k NN and SVM when configured with TCK yield most of the time better results. Howe ver , when k NN is configured with LPS and it operates in the input space, it provides the best results, which are comparable with the state-of-the-art. Indeed, the highest official accuracy obtained for this problem is 0.71 and k NN-i+LPS obtains 0.714. Interestingly , we notice that the validation procedure for the k NN always identifies as optimal high values of the number of neighbors ( k ). A high value of k improves the robustness to noise of the classifier . This is particularly important in real-world datasets, such as the one considered in this work, especially when the boundaries between different classes are not very distinct [34]. With respect to the remaining hyperpa- rameters, we notice that the optimal configuration found with cross-validation varies from case to case. B. Results on windows of increasing length In this second experiment, we study how the classification outcomes change as we take into account an increasing number of time steps for the input MTS. In particular , we expand an initially small time window located far from the end of recording by iterati vely including measurements that are closer to end and, hence, closer to the P AF onset. This allows us to ev aluate how much time ahead the P AF ev ents can be predicted accurately in our framew ork. By applying a windo w of increasing size to the original data we obtain 10 different classification problems, c1, c2, . . . , c10. Fig. 5 graphically illustrates the procedure. The first problem c1 is created by considering the smallest windo w of heartbeats, PAF event heartbeats MTS 1 Classification c1 PAF event Classification c2 heartbeats First heartbeat PAF event - Classification c10 heartbeats … MTS 2 MTS 3 MTS N MTS 1 MTS 2 MTS 3 MTS N MTS 1 MTS 2 MTS 3 MTS N Fig. 5: Classification on windows of increasing size. The first classification problem, c1, takes into account a small time- window of heartbeats, which is far from the actual P AF ev ent. The last classification problem, c10, takes into account the maximum number of time steps, by considering a window with a size equal to the shortest MTS in terms of heartbeats (MTS 3 in this case). positioned as far as possible from the P AF event (i.e., at the beginning of the recording). Note that the first heartbeat considered is the first of the shortest time series (MTS 3 in the example sho wn in Fig. 5). Therefore, the window of maximum width considered in the last classification problem c10 has length equal to the number of heartbeats in the shortest MTS of the P AF dataset (960 in this case). The details of each one of the 10 classification problems are reported in T ab . III. Note that, assuming that an adult has approximately 60-100 heartbeats per minute, the last column constitutes an approximation of the minutes between the end of the window and the occurrence of the P AF event. The classification results for each problem by the dif ferent classification methods are reported in Fig. 6, in terms of ac- curacy (A CC) and recall (REC). Each classifier uses the same hyperparameter configurations that are described in T ab . II. W e notice that the best performance for this task in terms of recall (REC) are obtained by the k NN classifier when is configured with LPS and operating in the input space. On the other hand, when the k NN is operating in input space and 6 T ABLE III: Details of the classification problems generated by considering windows of increasing length. HB column denotes the number of heartbeats. Problem W in. size HBs befor e P AF Mins befor e P AF c1 96 864 ≈ 9 - 15 c2 192 768 ≈ 8 - 13 c3 288 672 ≈ 7 - 11 c4 384 576 ≈ 6 - 10 c5 480 480 ≈ 5 - 8 c6 576 384 ≈ 4 - 7 c7 672 288 ≈ 3 - 5 c8 768 192 ≈ 2 - 3 c9 864 96 ≈ 1 - 2 c10 960 0 0 c2 c4 c6 c8 c10 Classification problem 0 0.5 1 kNN-i + LPS ACC REC c2 c4 c6 c8 c10 Classification problem 0 0.5 1 kNN-e + LPS ACC REC c2 c4 c6 c8 c10 Classification problem 0 0.5 1 SVM-i + LPS ACC REC c2 c4 c6 c8 c10 Classification problem 0 0.5 1 SVM-e + LPS ACC REC c2 c4 c6 c8 c10 Classification problem 0 0.5 1 kNN-i + TCK ACC REC c2 c4 c6 c8 c10 Classification problem 0 0.5 1 kNN-e + TCK ACC REC c2 c4 c6 c8 c10 Classification problem 0 0.5 1 SVM-i + TCK ACC REC c2 c4 c6 c8 c10 Classification problem 0 0.5 1 SVM-e + TCK ACC REC Fig. 6: Classification accuracy (A CC) and recall (REC) for the close-to-onset class obtained by the different classifiers when considering 10 different windows of increasing size as reported in T able III. it is configured with TCK, we obtain the best performance in terms of accuracy (A CC). W e also note that kNN-e configured with LPS achieves stable results for the dif ferent lengths of the time windows, since the performance of the classification does not deteriorate significantly when only a few data items before the event are considered. This suggests that such a method can be the most suitable for an early detection of the P AF onset. W e note that, in this latter case, high accuracy and recall are obtained in the problem c5 where the considered time windo w terminates roughly 8-10 minutes before the actual P AF onset. Interestingly , in each experiment we observe that the recall is in general much lower when short time windows are con- sidered. On the other hand, the accuracy of the classification is very stable in each classification problem. This demonstrates that the proposed method is capable of detecting the onset of P AF with a classification accuracy comparable with state of the art methods, with the additional advantage of detecting the P AF event sev eral minutes before its occurrence. V . C O N C L U S I O N S In this paper , we proposed a methodology to classify ECG with the aim of predicting the onset Paroxysmal Atrial Fibrillation. W e represented the ECGs as multi-variate time series of measurements of different descriptors ov er time. W e hav e considered kernel functions for multi-variate time series, which allowed us to compute the pair-wise similarity between the ECGs. Such similarities are then exploited for training a classifier , defined either in the original input space or in a similarity-induced embedding space. Our classification results are higher or on pair with other state-of-the-art approaches. In addition, we sho w that with our proposed framework, it is possible to predict with a good precision the onset of P AF sev eral minutes before the end of the av ailable recordings. This result is of significant practical importance, as it makes possible to conceive an early-warning system for the onset of P AF , which can alert a patient before the occurrence of the ev ent. The proposed methodology is particularly efficient in terms of computational resources. In particular , the bottleneck, which is the computation of the kernel similarity matrix, can be implemented very efficiently by the TCK, which is an em- barassingly parallelizable algorithm and its computational complexity scales do wn linearly with the number of av ailable computing units. Thanks to this property , the proposed frame- work is also suitable for implementations on highly parallel hardware devices and on multicore microprocessors, such as the ones that are being adopted in embedded systems. Future includes porting onto embedded systems of the proposed software pipeline. A C K N O W L E D G E M E N T S This work was supported and funded by the Hasler Founda- tion under the Project “HSD: A Personal De vice for Automatic Evaluation of Health Status during Physical Training” (Grant No. 15048). The paper reflects only the view of the authors. The authors would like to thank Elisabetta De Giovanni, Dr . Amir Aminifar , and Prof. Alonso David Atienza from ´ Ecole Polytechnique F ´ ed ´ erale de Lausanne for helping in the feature extraction process as well as for their precious feedback. 7 R E F E R E N C E S [1] R. Alcaraz and J. J. Rieta, “ A revie w on sample entropy applications for the non-inv asiv e analysis of atrial fibrillation electrocardiograms, ” Biomedical Signal Processing and Contr ol , vol. 5, no. 1, pp. 1–14, 2010. [2] B. Hickey , C. Heneghan, and P . De Chazal, “Non-episode-dependent assessment of paroxysmal atrial fibrillation through measurement of RR interval dynamics and atrial premature contractions, ” Annals of Biomedical Engineering , vol. 32, no. 5, pp. 677–687, 2004. [3] J. Milosevic, A. Dittrich, A. Ferrante, M. Malek, C. R. Quiros, R. Brao- jos, G. Ansaloni, and D. Atienza, “Risk assessment of atrial fibrillation: a failure prediction approach, ” in Computing in Car diology Conference . Cambridge, MA, USA: IEEE, Sep. 2014, pp. 801–804. [4] L. S. W ann, A. B. Curtis, C. T . January , K. A. Ellenbogen, J. E. Lowe, N. M. Estes, R. L. Page, M. D. Ezeko witz, D. J. Slotwiner, W . M. Jackman, W . G. Stevenson, and C. M. Trac y , “2011 accf/aha/hrs focused update on the management of patients with atrial fibrillation (updating the 2006 guideline), ” Cir culation , vol. 123, no. 1, pp. 104–123, 2011. [Online]. A vailable: http://circ.ahajournals.org/content/123/1/104 [5] L. Y . Di Marco, D. Raine, J. P . Bourke, and P . Langley , “Characteristics of atrial fibrillation cycle length predict restoration of sinus rhythm by catheter ablation, ” Heart Rhythm , vol. 10, no. 9, pp. 1303–1310, 2013. [6] M.-T . Lo, Y .-C. Chang, C. Lin, H-W V. Y oung, Y .-H. Lin, Y .-L. Ho, C.- K. Peng, and K. Hu, “Outlier-resilient complexity analysis of heartbeat dynamics, ” Scientific Reports , vol. 5, 2015. [7] R. Cervig ´ on, J. Moreno, R. B. Reilly , J. Millet, J. P ´ erez-V illacast ´ ın, and F . Castells, “Entropy measurements in paroxysmal and persistent atrial fibrillation, ” Physiological Measurement , vol. 31, no. 7, p. 1011, 2010. [8] A. Mart ´ ınez, D. Ab ´ asolo, R. Alcaraz, and J. J. Rieta, “ Alteration of the P-wave non-linear dynamics near the onset of paroxysmal atrial fibrillation, ” Medical Engineering & Physics , vol. 37, no. 7, pp. 692– 697, 2015. [9] I. Dotsinsky , “ Atrial wav e detection algorithm for discov ery of some rhythm abnormalities, ” Physiological Measur ement , vol. 28, no. 5, p. 595, 2007. [10] A. Mart ´ ınez, R. Alcaraz, and J. J. Rieta, “Morphological variability of the P-wave for premature envision of paroxysmal atrial fibrillation ev ents, ” Physiological Measur ement , vol. 35, no. 1, p. 1, 2013. [11] T . Thong, J. McNames, M. Aboy , and B. Goldstein, “Prediction of paroxysmal atrial fibrillation by analysis of atrial premature complexes, ” IEEE T ransactions on Biomedical Engineering , vol. 51, no. 4, pp. 561– 569, Apr . 2004. [12] J. Schlenker , V . Socha, L. Riedlbauchov ´ a, T . Ned ˇ elka, A. Schlenker , V . Poto ˇ ckov ´ a, ˇ S. Mal ´ a, and P . Kut ´ ılek, “Recurrence plot of heart rate variability signal in patients with vasov agal syncopes, ” Biomedical Signal Pr ocessing and Contr ol , vol. 25, pp. 1–11, 2016. [13] X. Du, N. Rao, M. Qian, D. Liu, J. Li, W . Feng, L. Yin, and X. Chen, “ A novel method for real-time atrial fibrillation detection in electrocardiograms using multiple parameters, ” Annals of Noninvasive Electr ocar diology , vol. 19, no. 3, pp. 217–225, 2014. [14] X. Li, D. Y ang, X. Liu, and X.-M. W u, “Bridging time series dynamics and complex network theory with application to electrocardiogram analysis, ” IEEE Cir cuits and Systems Magazine , vol. 12, no. 4, pp. 33– 46, Nov . 2012. [15] X. Li and Z. Dong, “Detection and prediction of the onset of human ventricular fibrillation: An approach based on complex network theory , ” Physical Revie w E , vol. 84, no. 6, p. 062901, 2011. [16] M. G. Baydogan and G. Runger , “Time series representation and similarity based on local autopatterns, ” Data Mining and Knowledge Discovery , vol. 30, no. 2, pp. 476–509, 2016. [17] K. Ø. Mikalsen, F . M. Bianchi, C. Soguero-Ruiz, and R. Jenssen, “Time series cluster kernel for learning similarities between multivariate time series with missing data, ” P attern Recognition , vol. 76, pp. 569 – 581, 2018. [18] “The P AF Prediction Challenge Database, ” last accessed: 5-Jul-2016. [Online]. A vailable: https://physionet.org/physiobank/database/afpdb/ [19] G. B. Moody , A. L. Goldberger , S. McClennen, and S. P . Swiryn, “Predicting the onset of paroxysmal atrial fibrillation: The computers in cardiology challenge 2001, ” in Computers in Cardiology . Rotterdam, Netherlands: IEEE, Sep. 2001, pp. 113–116. [20] A. L. Goldberger , L. A. N. Amaral, L. Glass, J. M. Hausdorff, P . Ch. Ivano v , R. G. Mark, J. E. Mietus, G. B. Moody , C.-K. Peng, and H. E. Stanley , “Physiobank, physiotoolkit, and physionet components of a new research resource for complex physiologic signals, ” Cir culation , vol. 101, no. 23, pp. 215–220, 2000. [21] M. Malik, P . F ¨ arbom, V . Batchvarov , K. Hnatkov a, and A. Camm, “Relation between qt and rr intervals is highly indi vidual among healthy subjects: implications for heart rate correction of the qt interval, ” Heart , vol. 87, no. 3, pp. 220–228, 2002. [22] R. Braojos Lopez, G. Ansaloni, D. Atienza Alonso, R. V allejos, and F . Javier, “Embedded Real-Time ECG Delineation Methods: a Comparativ e Evaluation, ” in IEEE 12th International Conference on BioInformatics and BioEngineering (BIBE 2012) , vol. 1, 2012, pp. 99– 104. [23] D. J. Berndt and J. Clifford, “Using dynamic time warping to find patterns in time series. ” in KDD workshop , vol. 10, no. 16. Seattle, W A, 1994, pp. 359–370. [24] M. Shokoohi-Y ekta, B. Hu, H. Jin, J. W ang, and E. K eogh, “Generalizing dtw to the multi-dimensional case requires an adaptive approach, ” Data Mining and Knowledge Discovery , vol. 31, no. 1, pp. 1–31, 2017. [25] L. W ang, Z. W ang, and S. Liu, “ An effecti ve multiv ariate time series classification approach using echo state netw ork and adaptiv e differential ev olution algorithm, ” Expert Systems with Applications , vol. 43, pp. 237 – 249, 2016. [26] B. M. Marlin, D. C. Kale, R. G. Khemani, and R. C. W etzel, “Unsupervised pattern discovery in electronic health care data using probabilistic clustering models, ” in Pr oceedings of the 2nd A CM SIGHIT International Health Informatics Symposium , Miami, FL, USA, Jan. 2012, pp. 389–398. [27] K. Ø. Mikalsen, F . M. Bianchi, C. Soguero-Ruiz, S. O. Skrøvseth, R.-O. Lindsetmo, A. Revhaug, and R. Jenssen, “Learning similarities between irregularly sampled short multiv ariate time series from EHRs, ” in Pr oc. 3r d International W orkshop on P attern Recognition for Healthcar e Analytics at ICPR 2016 , 2016. [28] K. Øyvind Mikalsen, C. Soguero-Ruiz, F . M. Bianchi, A. Revhaug, and R. Jenssen, “An Unsupervised Multivariate Time Series Kernel Approach for Identifying Patients with Surgical Site Infection from Blood Samples, ” ArXiv e-prints , 2018. [29] K. Q. W einberger and O. Chapelle, “Large margin taxonomy embed- ding for document categorization, ” in Advances in Neural Information Pr ocessing Systems , 2009, pp. 1737–1744. [30] “Paf challenge scores. ” [Online]. A v ailable: https://physionet.org/ challenge/2001/top- scores.shtml [31] G. Schreier, P . Kastner, and W . Marko, “ An automatic ecg processing algorithm to identify patients prone to paroxysmal atrial fibrillation, ” in Computers in Cardiolo gy 2001 . IEEE, 2001, pp. 133–135. [32] P . de Chazai and C. Heneghan, “ Automated prediction of the onset of paroxysmal atrial fibrillation from surface electrocardiogram record- ings, ” in Computer in Cardiolo gy , 2001. [33] C. Maier, M. Bauch, and H. Dickhaus, “Screening and prediction of paroxysmal atrial fibrillation by analysis of heart rate variability parameters, ” in Computers in Cardiolo gy 2001 . IEEE, 2001, pp. 129– 132. [34] R. O. Duda, P . E. Hart, and D. G. Stork, P attern Classification . New Y ork, NY : Wile y & Sons, 2001. 8

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment