Objective evaluation metrics for automatic classification of EEG events
The evaluation of machine learning algorithms in biomedical fields for applications involving sequential data lacks standardization. Common quantitative scalar evaluation metrics such as sensitivity and specificity can often be misleading depending o…
Authors: Saeedeh Ziyabari, Vinit Shah, Meysam Golmohammadi
Journal of Neural Engineering Resubmission : April 15, 2018 Objective evaluation metrics for automat ic cla ssification of EEG events Saeedeh Ziyabar i 1 , Vinit Shah 1 , Meysam Golmohamm adi 2 , Iyad Obe id 1 an d Jose ph Picone 1 1 T he N e ur a l En g i n ee r i n g D a ta Co n so r t i um , T e mp l e Un iv e r si t y 1 9 4 7 N o r t h 12 t h St r e e t , Ph i la d e lp h i a , Pe n n s yl v a n i a , 1 9 1 2 2 , U S A . 2 B i oS ig n a l An a l y ti c s , I nc . , 36 2 4 M a r ke t St r e e t , S u it e 5 E , P h il a d e l p hi a , Pe n n s y lv a n i a , 19 1 0 4 , U S A . Abstract: The eva luation of machine learning algorithms in biomedical fields for applications involving sequential data lacks standardization. C ommon quantitati ve s calar e valuation metrics s u ch as sensitivity and specificity can often be mislea ding depending on the requirements of the application. E valuation metrics must ultimately reflect the needs of us ers yet be sufficiently sensitive to guide algorithm development. Feedback from critical care clinicians who use automated event detection software in cli nical applications has been overwhelmingly empha tic that a low f alse alarm rate, typica lly measured in units of the number o f erro rs per 24 hours, is the single most important criterion for user acce ptance. Though using a single metric is not often a s insightful as examining performance over a range of operating conditions, there is a ne ed for a single sc alar figure of merit . In this paper, we discus s the deficiencies of existing metrics for a seizure detection task a nd propose several new metrics that offer a more balanc ed view of pe rform ance. W e de m ons tr a t e the s e me t ri c s on a s e i z ur e d e t e c ti o n ta s k ba s e d on t h e TUH E EG C o r pu s . W e s h ow t ha t t w o p ro m is in g me t r ic s a re a m e as u re b as e d o n a c o nc e p t b o r ro w e d fr o m t h e s p ok e n t e rm d e te c ti o n li t e ra t u re , A c tu a l T e rm -W e i gh t e d V a lu e , an d a n e w me t ri c , Tim e -Aligned Event Scoring ( TAES), that accounts for the temporal alignment of th e hypothesis to t he reference annotation. We also demonstrate that state of the art technology ba sed on deep learning, though impressive in its performance, still needs s ignificant improvement be fore it wi ll me et very s trict us er a cceptance guidelines. Keyword s: electroencephalograms, EEG, machin e learning, evalua tion metrics 1. Intr o duct ion Electroe nceph alogra ms (EEGs) are th e primary means by which physi cians di a gnose , e valuate and manage brain- related illnesses su ch a s epilepsy, seizure s and sl ee p disorders [ 1]. Automa tic interpre tation of EEG s has be en e xtensively studied in the past de cade [2]-[6]. However, even though many re sear ch er s report impressive le vels of a ccuracy in publica tions, widespread adoption of c ommercial technology has yet to happen in c linical se ttings pr imaril y due to the high false a larm (FA) ra t e s of these systems [7][8][9]. In this pa per, we investigate the g ap in performance between researc h and commercial tec hnology and discuss how these perceptions are influ enced by a lack of a standardized scoring methodology. There are in general two types of ways to e valuate mac hine learning t e chnology: user acceptance testing [10][ 11 ] and objective performance met rics bas ed on a nnotated reference d a ta [12][13]. Use r accepta n ce testing is ti me- consuming and expensive. I t has never b een a practical way to guid e tec hnology development because algorithm de velopers need rapid turna round ti mes on e valuations . Henc e ev aluations using obje ctive perf ormance metrics, suc h as sensitivity and spec ificity, a re c ommon in the mac hine Correspondin g A uthor: Joseph Picone, T h e Ne ur a l E n g in e e r in g Da t a Co n so r t iu m , E N G R 70 3A , T e m p le U n iv e r si t y , 19 4 7 N or th 12 t h S tr e e t, Ph il a d e l ph ia , Pe n n s y lv a n i a , 1 9 1 2 2 , U SA , Tel : 2 15 - 204 - 48 4 1 , Fa x : 21 5 - 204 - 5 9 6 0, E m a il : jo s e p h .p i c o n e @gmail .com. Ziy abari et al.: Objective evaluation metrics Page 2 of 21 Journal of Neural Engineering Resubmission : April 20, 2018 learning f ield [14][15][ 16 ]. With this a pproach, it is very important to have a rich eva luation dataset and a perf o rmance metric that corre lat es well with user and a ppl ication needs. The metric must have a certain level of granularity so t hat sm all di ffe rences in a lgorithms ca n be inve stigated and para meter optimizations can be evaluated. F or example, in speech recognition applications, word error rate ha s b een use d for many year s because it correl ates well with use r acceptance testing but provides the n ece ssary l e vel of gr anular ity to guide tec hnology development. D espite m a ny years of r esear ch focused on finding better performance metrics [17][ 18 ], word e rror rate remains a valid metric for technology development and assessment. Seque ntial pattern recognition applications, suc h as speech recog nition, keyw o rd search or EEG a nalysi s, require additi onal considerations. Data , typically organiz ed in files on a c omputer, are not simply assessed with an ove rall judgment (e.g., “ did a seizure occur somewhere in this f ile? ”). Instead, the locality of th e hypothesis must be c onsidered – to w h at e xt ent did the start and end times of the hypothesis match the refere nce transcription. This is a complex issue since a hy pothesis c an partially overlap with the re ference annotation, a nd a consistent mechanism f or scoring such events must be adopted. Unfor tun ately, there is no such stand a rdization in the EEG l iterature . For exa mpl e , Wilson e t al. [19] a dvocate s using a term-based metric involving sensitivity and spec ificity. A te rm w as defined a s a c onnection of consec utive decisions from the same type of event . A hypothesis is counte d as a true positive when it overlaps with one or more refere nce annotations. A fa lse positive c orre sponds to a n event in which a hypothesis a nnotation doe s not overlap with any of the reference annota tions. Kelly et al. [20] rec ommends using a metric that mea sures sensitivity a nd FA s. A hypothesis is considere d a true p ositive whe n the time of de tection is within two minutes of the seizure onset. Otherwise it is c onsidered a false positive. Balda ssano et a l. [21] uses a n epoch -based metric that mea sures fa lse positive and negative ra tes as well a s la tency. The development, evalua tion a nd ranking of various machine learning approaches is highly depen dent on the choice of a metric. A l arge cl a ss of bio e ngineering proble ms, including seizure det ec tion, involve p r ediction as w e ll as classifica tion. In prediction problems, we are often conc er ned with how far in a dvanc e of a n eve nt (or after the event has oc cur red) we ca n predict an outcome. A c cur acy of pr e diction varie s with latenc y, so t his type of performance e v aluation adds some complexity to the p roce ss. Winterhalder et al. [22] have studied this problem extensively and argue for a scoring ba sed on long-term considerations. In this paper , w e are not concerne d with these types of pre di ction proble ms. We a re foc used mainly on a ssessing the ac curacy of classifica tion and assessing the proximit y of these classifications to the actual event. Therefore, we ana lyze several popular scoring met rics a nd discuss their str engths and weakne sses on sequential de coding problems. We introdu ce s everal alte rnatives, such as the Actual Term-Weighted V a lue [23][24] that have prove n succe ssful in other fields, and discuss their relevance to EEG a pplications. W e prese nt a compar ison o f perfor m ance f o r se ver al syste ms usi ng these metrics and discuss how this c orrelates with overall user acce ptance. 2. Method Researchers in biomedic al f ields typically r eport performance in terms of se nsitivit y a nd spe cificity [25]. In a two-class c lassification problem such as seizure dete ction, we can de fine four typ es of error s: True Positives (TP): the number of ‘posit ives’ de tected correctly True Nega tives (TN): the number of ‘negatives’ detected correc tly False Posit ives (FP): the number of ‘negatives’ detected as ‘positives’ False Nega tives (FN): the number of ‘posit ives’ detected a s ‘neg atives’ Sensitivity (TP/(TP+FN)) a nd spe cificity (TN/(TN+FP)) are derived from these quantities. There are a large number of auxiliary me a sures that ca n be calculated from these four b a sic qu antities that are used extensively in the literature . These are summ arized concisely in [ 26 ]. F or example, in information retrieval Ziy abari et al.: Objective evaluation metrics Page 3 of 21 Journal of Neural Engineering Resubmission : April 20, 2018 problems, systems are often evaluated using accuracy ((TP+TN) / ( TP+FN+TN+FP)), precision (TP/(TP+FP )), recall ( another term for se nsitivi ty) a nd F 1 score (( 2· Precision · Recall ) /( Precision + Recall )) . However, none of these measures addr ess the time scale on which the s coring must occur, which is c ritical in the interpre tation of th ese measure s for many real-time bioengineering applications. In som e a pplications, it is preferable to score every unit of time. With multichannel signals, such as EEGs, scoring for each cha nn el for e ach unit of ti me mi ght be appropriate sin c e signific a nt events such a s seizures occur on a subset of the channe ls present in the signal . However, it is more common in the literature to simply score a summary dec ision per unit of t ime tha t is ba sed on the per -channel inputs ( e.g., a majority vote). We ref er to this type of scoring as epoch-based [ 2 7][28]. An a lternative, that is more common in speec h and image recognition applications, is term-based [ 24][29], in which we consider the start and stop time of the e v ent, and eac h e vent identified in the r eference a nnotation is c ounted onc e. Th ere a r e fundamenta l differences between the two conventions. For example, on e e vent containing many e pochs will count more heavily in an epoch-ba sed scoring sc enario. Epoc h-based scoring ge nerally weights the duration of a n event more heavily since each unit of time is asse ssed independently. Time-a ligned scoring is essential to seque ntial decoding pr oblems. But to impleme nt such scoring in a meaningful way, there ne eds to be universal agreeme nt on how to asse ss overlap between the reference a nd the hypot hesis. F or e xample , Figure 1 demonstr a tes a typi c al iss ue in sco ring. The machine l e arning system correctly detected 5 seconds of a 10 -sec event. Ess entially 50% of the e vent is correc tly detected, but how that is reflected in the scoring depends on the specific metric. Epoch-base d s coring with an epoch duration of 1 sec would count 5 FN err ors and 5 TP dete ctions. Term- based scor ing would potentially count this as a correct recognition depending on the way overlaps are s c ored. Term- b ase d metrics score on an event basis and do not count individual f rames. A typical approach for calc ulating errors in te rm-based sc oring is the A ny-Ove rlap Me thod (OVLP) [30][31]. TPs a re counted when the hypothesis over l aps with r eference annotati on. FPs corr espond to situations in which the hypothesis do es not overlap with the refe rence. The m etric ignores the dura tion of t he term in the reference annotation. In Figure 2, w e demonstrate two extreme cases for which th e OVLP metric f ails. In each c ase, 90% of the e vent is inc orrec tly scored. In exa mple no. 1, the system does not dete ct a pproximately 9 se conds of a seizure eve nt, while in example no. 2, the system inc orrectly labels a n additional 9 seconds of time as seizure. OVLP is conside red a very pe r missive way of sc o ring, re sulting i n artifici ally high sensitivit ies. In Figure 2, the OVLP metric will score both ex amples as 100% TP. It is ve ry difficult to compa re the perfor m ance of va rious systems whe n only two values are reported (e.g. sensitivity and spe cificity) and when the prior prob abilities var y significantly (in seizure dete ction, the a priori proba bility of a seizure is ver y low , which mean s a ssessment of backgr ound e vents dominate the err o r c alculations). Of ten a more holistic view is preferred , such as a Re ceiver O p erating Character istic (ROC) [15 ] or a Detection Error T rade-off (D ET) c urve [16]. A n ROC c urve displays the TP rate a s a function of the FP r ate while a DET cur ve displays the FN rate as a f unction of th e FP rate . When a single metric is pre ferred, the area unde r an R OC curve (AUC ) [ 32][33] i s also a n eff ective way of compa ring t he Figure 1. A typical situation where a hypothes i s (HYP) has a 50% overlap with t he reference (REF). Ziy abari et al.: Objective evaluation metrics Page 4 of 21 Journal of Neural Engineering Resubmission : April 20, 2018 perf o rmance. A random guessing a pproach to classification will give a n AUC of 0.5 w hile a perfect classifier will give an AUC of 1.0 . The proper ba lance b etwe en s e nsitivity and FA ra te is often application spe c ific and h a s been st udied extensively in a number of research c ommuniti es. For example, eva luation of voice keyword search technology was carefully studied in t he Spoken Term Detec tion ( STD) evalua tions c onducted by NIST [23][24][ 34 ]. These evaluations result ed in the introdu ction of a singl e metric, Actual Term-Weighted Value (ATWV) [24], t o address concer ns about tr a deoffs f or the di ff erent types of errors tha t occur in voice keyword search systems. D espite being popul ar in the voi ce processing community, ATW V has not been used in the bioengineering community. Therefore, in this paper , we compare and contrast five popul ar scoring metrics and one derived measure: (1) NIST Actual T erm-Weighted Valu e (ATWV) : based on NIST’s popular s cor ing packa ge (F4DE v3.3.1), this metric, originally deve loped for the NIST 2006 S poken Term Detection e valuation, uses an objective fun ction that accounts fo r tempor al ove rlap between th e referen ce and hypothe sis using the detection scores assigned by the system. (2) Dy namic Programm ing Alignment (DPALI GN): simi lar to the NIST packa g e kno wn a s SCLite [35] , this metric uses a dyna mic programming algorithm to a lign terms. It is most of t e n used in a mode in which the time alignments produced by the system are ignored. (3) Epoc h-Based Sampling (EPOCH): treats the reference a nd hypothesis as temporal signals, samples eac h at a fix ed e poch duration, and counts errors accordin gly. (4) Any -Overlap (OVLP): a ssesses the overlap in time be twe e n a reference a nd hypothesis e v ent, and counts err o rs using binary scores for ea ch e v ent. (5) Ti me -Aligned Event Scoring (TA ES): simi lar to (4) , but considers the percenta ge overlap between the two eve nts and weights errors acco rdingly. (6) Inte r-Rater Agreement (IRA): uses EPOCH scor ing to e stimate errors, and c alculates Cohen ’s Kappa coefficient [36] us ing the measure d TP, TN, FP an d FN. Figure 2. TP scores for the Any-Overlap method are 100% even t hough large portions of the event are missed . Ziy abari et al.: Objective evaluation metrics Page 5 of 21 Journal of Neural Engineering Resubmission : April 20, 2018 It is important to unde rstand that ea ch of these measures e stimates TP, TN, FP and FN through some sor t of error a n alysis. From these e stimated quantit ies, traditional der ived mea su res such as sensiti vity a nd specific ity a re computed. As a result, we wil l see that sensitivity is a function of the underlying metric, and this is why it is important ther e be c ommunity-wide agreement on a specific metric. We now br iefly describe eac h of these a pproaches and pr ovide sever al examples tha t illustrate th eir strengths and weaknesse s. These examples are drawn on a c ompressed time- scale for illustrative purposes and were carefully selecte d because the y are indicative of scoring metric pr obl ems we have observed in actua l evaluation data collect ed fr om our algorithm re s earch. 2.1. NIST Actual Term- Weight ed Val ue (ATWV) ATWV is a measure that balances sensitivity and FA rate. ATWV essentially assigns an application- depende nt re ward to eac h correct d etection a nd a penalty to ea ch incorrect d etec tion. A perfect system results in an ATWV of 1.0 , w hile a system with no output results in an A TWV of 0 . It is possible for A TWV to be less tha n zero if a system is doing very poorly (for e xample a high FA rate) . Experiments in voice keyword search have shown that a n ATWV grea ter than 0.5 typically indicates a promising or usa ble system for infor mation r etrie v al by voice applica tions. We believe a similar range is applicable to EEG analysis. The metric accepts a s input a list of N -tuples repre s enting the hypothese s f or th e system be ing e valuated. Each of these N -tuples c onsists of a start time, end t im e and system de tection score . The s e e ntries are matched to the reference annota tions using a n objective func tion that a cc ounts for both temporal over l a p betwee n the reference a nd hypothe s es and the de tection scores assigned by the system being e valuate d . These detection scores are often likelihood or confide n ce scores [23]. The probabilities of miss and FA err o rs at a detection threshold θ are computed using: (1) (2) where is the numbe r of correct de t e ctions of term s with a detection score greater than or equal to θ , is the numbe r of incorre ct de tections of terms with a de tection sc ore greater than or equal to θ , a nd is number of non-targe t tr ials f or the t erm kw in the data . The number of non - target trials for a term is related to the total dura tion of source signal in seconds, , and is computed as A term- weighted valu e is then computed that specifies a trade-off of misses and FA s. ATWV is d e fined as the value of TWV at the system’s chosen d etec tion thr eshold. Using a predefined constant , β , th at was optimiz e d expe rimentally ( β = 999.9 ) [ 24 ], ATWV is computed using: (3) A st andar d i mplementation of this approach is available at [ 37]. This metric has bee n wid e ly used throughout the human la ngu age technology c ommunity for 15 ye ars. This is a v er y important conside ration in standa rdizing such a m etric – re searchers are using a common shared softw a re implementation that ensure s there are no subtle implement ation differences between sites or researchers. To demonstrate the features of this appr o ac h, consider the case shown in Figure 3. The hypothesis for this segment consists of several short seiz ure eve nts w hile the r eference consists of one long e vent. Th e ATWV metric will assign a TP score o f 100% beca us e the midpoint of the first eve nt in the hypothesis annota tion is mapped to th e long seizure eve nt in th e refer ence annotation. This is somewh at generous given that 50% of the eve nt wa s not de tected. The r emaining 5 events in the hypothesis a nnotation are counted as f alse positives. The ATW V metric is rela tively insensitive to the dura tion of the re ference event, though the 5 Ziy abari et al.: Objective evaluation metrics Page 6 of 21 Journal of Neural Engineering Resubmission : April 20, 2018 false positives will lower the overall perf o rmance of the system. The important issue here is that the hypothesis correctly detec ted about 70% of the seizure eve nt, and yet be cause of the lar ge number of fa ls e positives, it wil l be pena lized heavily. In Figure 4 we demonstrate a simi lar case in which the metric pe nalizes the hypothesis for missing thre e seizure e vents in the re ference . A pp roximately 50% of t he se gment is cor re ctly identified. This type of scoring pena lizi ng r epeated eve nts that are part of a larger e v ent in the re ference might make sense in an application like voic e keyword searc h because in human language each wor d hypothesis serve s a unique purpose in th e overall understanding of the signal. However, for a two-class event detection problem suc h as seizure detec tion, such sc oring too he avily pena lizes the hypothesis f or splitting a long event into a ser ies of short eve nts. 2.2. Dynamic Programming Alignm en t (DPALIG N) The DPALIG N metric essentially pe rfor ms a mi nimization of a n edit distance (the Lev enshtein distance) [ 12] to ma p the hypothesis on to the re fer ence . D PALIG N deter mine s the minimum number of edits requir ed to tra nsform the hypothesis string into the refere nce string. Gi ven two strings, the sourc e string X = [x 1 , x 2 , ..., x n ] o f length n , and target string Y = [y 1 , y 2 , ..., y m ] of l ength m , we defin e , which is the edit distance be tween the substring x 1 :x i and y 1 :y j , as: (4) The quantities being measure d he re a re of t en referred to as substituti on (sub), insertion (ins) a nd de letion (del) pe nalti es. For this study, these three penalties a re assigned e qual weights of 1 . A dynamic progra mming algorithm is use d to find the optimal alignment between the reference and hypothesis based on these weights. Though there are ver sions of this metric that perform time-aligned sco ring in which both the reference and hypothesis must include sta rt and end times, this metric is most commonly use d without time alignment information. The metric is best d emonstrated using the two exampl es shown in Figure 5. In the f irst example, the refere nce si gnal had three seiz ure events but the hypothe sis only dete cted two se izure events, so there were two insertion er rors. In the se cond example the hypothe sis missed the t hird seiz ure event, so there were two deletion e rrors. For convenience, lowercase symbols indica te cor rect detections while uppercase symbols indicate errors. The asterisk symbol is use d t o denote de letion and insertion errors. N ote that th e re is ambiguity in the se alignments. For example, it is not re ally clea r w hich of the thr ee seizure events in the second example corresponded to each of the seizure events in the hypothesis. Ne vertheless, this imprecision doesn’t r e ally i nfluen ce the overall sco r ing. Though t his type of sco r ing might a t first se e m highly inacc u rate sinc e it ignores time a lignments of the hypo th eses, it has been surpr isingly eff ective in scor ing machine learning systems in sequential da t a applications (e.g., speec h recognition) [ 12 ][35]. 2.3. Epoch-Based Samp li n g (EPOCH) Epoch- based scoring use s a metric that treats the ref erence and hypothesis as signals . These signals a re sampled at a fixed epoch duration. The corresponding label in the r efere nce is comp ared to the hypothesis. Similar to DPALIGN, substititions, deletions and insertion e rrors a re tabulated w ith an e qual weight of 1 for each type of error. This process is depicted in Figure 6. Epoch- based scoring requires tha t the entir e signal be annotate d, which is normally the c ase f o r sequential decoding e valuations. It attempts t o acc ount for the a mount of time the two a nnotations overlap, so it directly a ddresses the inconsistencie s demonstrated in Figure 3 and Figure 4. Ziy abari et al.: Objective evaluation metrics Page 7 of 21 Journal of Neural Engineering Resubmission : April 20, 2018 One important para met er to be tweaked in this algorithm is the f requency with w hi c h we sample th e two annotations, which we refer to as the scoring e poch dura ti on. It is ideally set to a n amount of time smaller than the unit of time use d by the classific ation system to make de cisions . For exa mple, the hypothesis in Figure 6 outputs decisions ever y 1 sec. The scoring epoch duration should be set smaller tha n this. W e use a scoring epoch duration of 0.25 sec for most of our work because our system epoc h duration is t ypically 1 sec. We find in situations l ike this the results ar e not overl y sensitive to the choice of th e e poch du r ation as long as it is below 1 sec . This pa ramete r simply controls how much pre cision one expec ts for se gment boundaries. Because EPOCH scoring samples the annota tions at fixed time inter vals, it is inh e rently biased to weigh long seizure events more heavily. For exa mple, if a signal contains one extre mely long seizure event (e.g., 1000 secs) a nd two shor t e v ents (e.g., e ach 10 sec s in duration), the accuracy with whic h the f irst event is detec ted will dominate the overall scoring. Since seizure events ca n v ar y dramatically in duration, this is a cause for conce rn. 2.4. Any- Overlap Me thod (OVLP) We previously introduced th e OVLP metric as a popular choic e in the neur oengine ering c ommunity [30][31]. OVLP is a more permissive metric that tends to produce much higher sensitiviti es . If an e v ent is detec ted in close proximity to a reference e vent, the refer enc e eve nt is c onsidered correctly detected. If a long event in the ref erence annota tion is detected as m ultiple shorter events in the hypothe sis, the ref erence event is also considere d correctly dete cted. Multiple e vents in the hypo thesis annotation corr esponding to the sa me e vent in the reference a nnotation are not typically counted as FA s. Since the FA r ate is a very Figure 3. ATWV s cores this segment as 1 TP and 5 FPs. Figure 4. ATWV s cores this segment as 0 TP and 4 FN events. Ref: bckg seiz bckg seiz bckg **** **** Hyp: bckg seiz bckg seiz bckg SEIZ BCKG (Hits: 5 Sub: 0 Ins: 2 Del: 0 Total Errors: 2) Ref: bckg seiz bckg seiz bckg SEIZ BCKG Hyp: bckg sei z bckg seiz bckg **** **** (Hits: 5 Sub: 0 Ins: 0 Del: 2 Total Errors: 2) Figure 5. DPALIG N aligns s ymbol sequences based on edit distance and ignores time alignments. Ziy abari et al.: Objective evaluation metrics Page 8 of 21 Journal of Neural Engineering Resubmission : April 20, 2018 important measure of performance in critical care applicat ions, this is another c ause f or con cern. The OVLP scori ng method is de monstr a ted in Figure 7. It has one significa nt tunable paramete r – a guard band that controls the degree to which a misalignment is still c onsidered a s a correct matc h. In this study, we use a fairly strict interpretation of this band and require some overlap between the two event s in time – essentially a guar d ba nd of zero. The guard ba nd ne eds to be tuned base d on the ne eds of the applica tion. Sensitivity generally increases as the gu ard band is increased. 2.5. Time-Aligned Event Scor i n g ( TAES) Though EPOCH scoring direc tly measures the a mount of overlap be tween the annota tions, there is a possibili ty tha t this too he avily weights single long events . Se izure events ca n vary in duration from a f ew seconds to ma ny hours. In some a pplications, c orrec tly dete cting the numbe r of events is a s important as their dura tion. H ence, the T AES me t ric w as designe d as a compromise betw een th ese competing constraints. The TAES scoring m etri calc ulates TP and FP a s follows: (5) (6 ) (7 ) Where H and R repre s ent the reference and hypothesis events respectively, a nd represents the duration of the reference events. The is duration of inserte d term. TAES gives equal w eight to each e vent, but it calculates a par tial sc ore for e ac h e ve nt based on the amount of ove rlap. The TP sc ore is the total dura tion of a detected t er m divided by the total dura tion of the reference term. The FN score is the fraction of the time the refere n ce te rm was missed divided by the total dura tion of the refere nce term. The FP score is the total duration of the inserted t e rm divided by total a mount of time this inserted ter m was inco rrect a ccording to the ref ere nce annotation. But F P can’t be more than 1 per event. Therefore , like TP and FN, a single FP eve nt cont ributes a f ractional amount to the overall FP score if it correctly detects a portion of the same event in the reference annotation (partial overl ap). Moreover, if multipl e re ference e vents a re det ec ted by a single long hypothesis event, a ll but th e first detections are considere d as FNs. S ince, FP per e vent cannot e xceed 1, this proper ty helps compensating the sensitivity Figure 6. EPOCH s coring directly me asures the similar ity of th e time-aligned annotations. The TP, FN and FP are 5 , 2 and 1 respec tively Ziy abari et al.: Objective evaluation metrics Page 9 of 21 Journal of Neural Engineering Resubmission : April 20, 2018 versus FA trade-of f. An example of TAES scoring is depicted in Figure 8. 2.6. Inter- Rate r Agreement ( IR A) Inter-ra ter agreeme nt (IRA) is a popular measure when com par ing the r elative similarity of two a nnotations. We refer to this me tric as a derived m etric since it is computed from err or c ounts c ollected using one of the other five m e trics. IRA is most often mea sured using Cohen’s Kappa c o efficient [36] , whic h compar es the observe d accur acy with the expected accuracy. It is computed using: (8) where is the relative obse rved a greem ent among raters a nd is the hypothetica l pr obability of c h ance agreement. The Kappa c oefficient ranges be tween (complete a greement) and (no a greement). It has bee n used extensive ly to assess inter- rater agreement for experts manua lly annotating seizures in EEG signals. Values in the ra nge of are common for these types of a ssessments [38]. The variability amongst e xperts mainly involves fine deta ils in the annota tions, such as the exa ct onset of a seizure. These kinds of details are e xtremely important for mac hine learning and hence we need a metric that is sensitive to small var iations in the annotations. For c ompleteness, we use this measure as a wa y of evalua ting the amount of agr eeme nt b etween two annota ti ons. 2.7. A Brief Comparison of Metrics A simple example of how these metrics compare on a specific se gment of a signal is shown in Figure 9. A 10 -sec s ec tion of an EEG signal is shown subdivi ded into 1 - sec segments. The reference has three isolated events. The system being evalua t ed outputs one hypothesis that sta rts in the middle of the first event and continues through the remaining two eve nts. ATWV scores the system a s 1 TP a nd 2 FNs since it assigns the exte nded hypoth e sis event to the center refere n ce eve nt and l eaves the other two undetected. The ATWV Figure 7. OVLP s coring is very permissive about th e degree of ove rlap betwee n the reference and hypothesis. The TP score for exa mple 1 is 1 with no f alse alarms. I n example 2, the system detects 2 out of 3 se izure events, so the T P and FN scores a re 2 and 1 respectively. Ziy abari et al.: Objective evaluation metrics Page 10 of 21 Journal of Neural Engineering Resubmission : April 20, 2018 score is 0.33 for seizure eve nts, 0.25 f or bac kg round eve nts, r esulting in a n a vera ge ATWV of 0.29 . The sensitivity and FA rates for s eizure e vents for t his metric are 33% a nd 0 per 24 h rs. r e spectively. DPALIG N score s the syst em the same w ay si nce t ime alignments a re ignored and t he fi rst event in each annot a tion a re matched together, le aving the other two eve nts undetected. The EPOCH method scores th e a lignment 5 TP , 3 FP and 1 FN using a 1 - sec epoc h duration because there are 4 e pochs for which the a nnotations do not a gree a nd 5 epochs w h ere th ey a g ree. The sensitivit y i s 83.33% and the FA rate per 24 hrs. is very high because o f the 3 FPs. The OVLP method scor es the segment as 3 TP and 0 FP beca use det e cted events hav e partial to full overlap with all the r e ference events, giving a sensitivity of 100% with an FA rate of 0 . TAES scores this s egment a s 0.5 FN and 2.5 TP because the first event is only 50% corr ect and there are a TP for the 5 th to 8 th and 10 th epochs (multiple overla pping reference events) , giving a sensitivit y of 83.33% and a high FA rate. IR A f or se izure events eva lu ated using Cohen’s Kappa statisti c is 0.09 bec ause th ere a re esse ntially 4 errors for 6 se izure eve nts. IR As below 0.5 indica te a poor match between the reference and the hypothesis. It is diff icult to conclude from this example which of th ese mea sures are most appropriate f o r EE G a nalysis. However, we se e th at ATWV a nd DPALIG N ge nerally pr oduce similar re sults. The EPOCH me tri c produce s l a rger counts be ca use i t samples time rather than events. OVLP produc es a high sensitivi ty while Figure 9 . An example that summarizes the differences between scoring metrics. Figure 8. T AES scoring accounts for the amount of over lap betwee n the reference and hypothesis. TA ES s cores example 1 as 0.71 TP, 0.29 FN and 0.14 FP. Example 2 is scored as 1 TP, 1 FN and 1 FP. Ziy abari et al.: Objective evaluation metrics Page 11 of 21 Journal of Neural Engineering Resubmission : April 20, 2018 TAES pr oduces a low sensitivity but a rel atively high er FA rate. 3. Resu lts To demonstrate the differences between the se metrics on a realistic task, we have evaluated a range of machine learning systems on a s eizure detection task based on the TUH EEG Seizure C orpus [39] . An ove rview of the corpus is give n in Table 1. This is the lar gest ope n source c orpus of its type. It consists of c linical data collecte d at Temple University Hospital, and represents a ver y challenging machine l ea rning task beca use it contains a rich variety of common rea l-world problems found in clin ical data (e.g., patient move m ent). There ar e 50 pa tients in the evalua tion corpus, making it large enough to accurately assess fine differences in algorithm perfor mance. A gene ral architecture for the five m a chine learning sys tems e valuate d is shown in Figure 10. An EEG signal is input using a Europea n D ata Format (EDF) file. The signa l is conver ted to a sequence of f eature vectors. A group of fr ames are cla ssified into an event on a per -channel ba sis using c ombination of de ep learning networks. Th e deep le a rning system e ssentially looks a cr oss multiple epochs, whi ch we refe r to as the temporal conte xt, and multipl e channels, which we re fe r to a s th e spatial context since each channel is associate d with a location of a n e lectrode on a p atient’s hea d. There are a wid e variety of algorithms that can b e us e d t o p roduc e a de cision from th e se i nputs. Ev en though seizures occur on a subset o f the c h anne ls input to such a system, we foc us on a single decision m ade across all channels at each point in time. The five systems se lected were carefully chosen because they re p rese nt a ra nge of performance that is representa tive of stat e of the art on this task a nd bec ause these systems exhibit different error m odalities on this task. The p e rformance of th e se syst ems is sufficiently close so that the impa ct o f these dif ferent scoring metrics bec om es appare nt. The systems selected were: (1) HMM/ SdA: a hybrid system consisting of a hidden M arkov model (HMM ) decoder and a postprocessor that uses a Stacked Denoising Autoencoder (S dA). An N -channel EEG was transformed into N independent feature streams us ing a s tandard sliding window based approach. The hypotheses generated by the HM Ms wer e postprocessed using a sec ond s tage of proces sing that examines the temporal and s patial context. We apply a Table 1. The TUH EEG Seizure Corpus (v1.1.1) D e s c r i pt i on T r a i n E v al Pa t i e n ts 196 50 Se s s io n s 456 230 Fi l e s 1,505 984 N o. S e iz u r e Ev e n t s 870 614 Se i z u re ( s e c s ) 51,140 53,930 N on - S e iz u r e (s e c s ) 877,821 547,728 T o t a l (s e c s ) 928,962 601,659 Figure 10. A hybrid deep learning architecture that integrates temporal and spatial context Ziy abari et al.: Objective evaluation metrics Page 12 of 21 Journal of Neural Engineering Resubmission : April 20, 2018 third pass of pos tprocessing that uses a s tochastic langua g e model to s mooth hypothes es involving sequences of events so that we can suppress spurious outputs. This third stage of postprocessing provides a moderate reduction in the false alarm rate. Standard three state left- to -right HMM s with 8 G aus sian mixtur e components per state we re used for sequential decoding. We divide each c hannel of an EEG into 1 -se cond epochs, and further s ubdivide these epoc h s into a sequenc e of frames. Each epoch is classified using an HMM trained on the subdivided epoch, and then th ese epoch-based dec i sions are postprocessed by additional s tatistical models in a proces s similar to the language modeling component of a speech rec ogni zer. The output of the e poch-based decisions was postprocessed by a deep learning system. T he SdA ne twork has three hidde n laye rs w ith corruption leve ls of 0.3 for e ach la ye r. T he num ber of nodes per layer are: fir s t layer = 800 , second lay er = 500 , third layer = 300 . The parameters for pre-training are: learning rate = 0.5 , number of epochs = 150 , batch size = 300 . The parameters for fine-tuning are: lea rning rate = 0.1 , number o f epochs = 300 , batch size = 100 . The overall result of the second stage is a probability vector of dimension two containing a likelihood that each label could ha ve occurred in the epoch. A soft decis ion paradigm is used rather than a hard decision paradigm because this output is smoothed in the thir d stage of process ing. (2) HMM/ LSTM: an HMM decoder pos tprocessed by a Long Short-Term Memory (LS TM ) ne twork. Like the HMM/SdA hybrid a pproach previously described, the output of the HM M s ystem is a vector of dimens ion 2 × number of channels ( 22 ) × the window length ( 7 ). Therefore, we also use PCA before LS TM in this approach to reduce the dimensiona lity of the data to 20 . For t his study, we use d a window le ngth of 41 for LSTM, and this laye r is composed of one hidden laye r with 32 nodes . T he output layer nodes in this LSTM level use a s igmoid function. The parameters of the models a re optimized to minimize the error using a cross - entropy loss function. Adaptive Moment Estimation (Adam) is used in the optimization proces s. (3) IPCA/L STM: a pr eprocessor based on Incremental Pr incipal Component Analysis (IPCA) f ollowed by an LSTM decode r. T he EEG fea tures are delivered to a n IPCA laye r for spatial context analysis and dimensionality reduction. A batch size of 50 is used in IPCA and the output di mension is 25 . T he output of IPCA is delivered to a LSTM for classification. We used a one-lay er LSTM with a hidden layer size of 128 and batch size of 128 is used a long with Adam optimization and a cross – entropy loss function. (4) CNN/MLP: a pure deep l earning-bas ed a pproa ch that uses a Convolutional Ne ural Network (CNN) decoder and a Multi-Layer Perceptron (MLP ) postprocess or. The net work contains six convolutional l ayers, three max pooling layers and two fully-connected layers. A rectified linear unit (ReLU) non -linearity is applied to the output of every convolutional and fully-connected layer. (5) CNN/LSTM: a pure deep learning-base d architecture that use s a c ombination of CNN and L STM networks. In this architecture, we integrate 2D CNNs, 1D CNNs and LSTM networks to better e xploit long-term dependencies. Exponential Linear Units (ELU) are us ed as the activation functions for the hidden layers . Adam is used in the optimization process along with a mean squared err or loss function. Comprehensive deta ils about the a rchitectures a re a vaila ble in [40][ 41 ]. The details of these systems ar e not critical to this study. What is more important is ho w the ra nge of performance is reflected in the s e metrics. A comparison of the pe rfor mance of the different a rchitectures is presented in Table 2. Though the relative rankings of these systems not surprisingly vary with the metric, the r anking of th ese systems is acc urately represente d by the overall trends in Table 2. HMM/SdA gene rally performs the poorest of these systems, delivering a respectable sensitivity but at a high FA rate. CNN/LSTM typi ca lly d e livers highest perf o rmance a nd h as a low FA rate, which is very important in this type of application. 4. Discussion Evaluating syst ems from a single op e rating point i s a lways a bit te nuous. Therefore, i n Fi gure 11, w e provide D ET c urves for the systems and in Ta ble 3 we provide AUCs for these DET c ur ves calculated using OVLP and TAES f or c omp ar ison. This is due to our empha sis on using OVLP and TAES metrics for seizure detec tion-like a pplications. The DET curves w ere derived from output from OVLP scoring metric only . Ziy abari et al.: Objective evaluation metrics Page 13 of 21 Journal of Neural Engineering Resubmission : April 20, 2018 The shapes of the D ET curves do not change significantly with the scor ing metric though the absolute numbers var y simil ar ly to what we se e in Ta ble 2. AUC values from Table 3 a lso follows the similar trend but the AU C-TA ES difference between the best and the w orst system is le ss pr onounced c ompared to the AUC -OVLP w hich seems to pr ovide more r ealist ic insight of the system’s peroformance . It is clear fr om this data that CNN/LS TM perfor mance is significa ntly differe nt fr om the othe r systems. This is pr imarily beca use of its low FA r ate. For this particular a pplication, sensitivi ty drops r apidly as the FA ra t e is lowered. Therefore, comparing a single data point for each system is dange rous because the systems are most likely opera ting at diffe rent points on a DET cur ve if the sensitivities are significa ntly diff ere nt. We find tuning these systems to have a compara ble F A rate is importa nt when compar ing two systems only based on sensitivity. In Table 2 we can e xamine the sensitivit y of the diff erent metrics by looking at the var iation in sensitivi ty. For e xample, for HMM/SdA , we see the lowe st sensitivities a re produced by TAES and EPOCH sc oring, while the highest sensitivities a re produce d by OVLP a n d D PALIGN. This makes sense beca use OVLP and DPALIG N are very forgiving of time alignment errors, while TAES a nd EPOCH penalize time alignment e rrors heavily. W e see similar trends for CNN/LSTM though the range of differences between the thre e highest sc oring metrics is smaller . W e also see that the five algorithms a re ra nked similarly by eac h scor ing m etric . HMM/SdA c onsistently scores the lowe st and CNN/LSTM c onsistently sc o res the highest. The other three systems are v ery similar in their performance. The ATW V scores for all algorithms are e xtremely low. The ATWV sc o re s are below 0.5 which indicates that overall per formance is poor. However, the ATWV sc ore for CNN/LSTM is significantly highe r than the other four systems. ATWV a ttempts to reduce th e informa tion contain ed in a DET curve t o a single number, a nd does a good job reflecting th e results shown in Figure 11. The DET curves f or HMM/LSTM a nd HMM/SdA overlap considera bly f or an FP rate be tween 0.25 a nd 1.0 , a nd this is a primary reason why their ATWV scores are simil ar . Howe ver, for the seiz ure de tection application we are primarily intere sted in the low FP r ate region, a nd in that ra nge, HMM/LSTM and IPCA/LSTM perform simil ar ly. Table 3. AUC comparison according to OVLP and TAES A lg o r i th m AUC (O V L P) AUC (T A E S ) H M M / Sd A 0.44 0.72 H M M / LS TM 0.44 0.71 IP C A / LS TM 0.39 0.72 C N N /M LP 0.38 0.65 C N N /L ST M 0.21 0.56 Table 2. Performance vs. scoring metric M e tr i c M e a s ur e HM M / S dA HM M / L S T M I PC A / L S T M C N N/ M L P C N N/ L S T M A T W V Se n s i t i vi t y 30.35% 26.73% 24.73% 29.52% 30.34 % Sp e c i fi c i t y 61.38% 68.93% 64.51% 65.87% 9 3.15 % F A s / 24 h rs 98 75 94 94 11 A TW V - 0.8392 - 0.8469 - 0.4628 - 0.7971 0.1737 OV L P Se n s i t i vi t y 35.35% 30.05% 3 2.97 % 39.09% 30.83% Sp e c i fi c i t y 73.35% 80.53% 77.57 % 76.84% 9 6.86 % F A s / 24 h rs 77 60 73 77 7 D P AL I GN Se n s i t i vi t y 44 . 11 % 33.77% 35.77% 43.35% 32.46 % Sp e c i fi c i t y 66.87% 72.99% 69.59% 71.49% 9 5.17 % F A s / 24 h rs 86 66 81 77 8 T A E S Se n s i t i vi t y 17.29 % 22. 8 4% 2 2.12 % 3 1.58 % 1 2.48 % Sp e c i fi c i t y 6 6.04 % 7 0.41 % 6 6.64 % 6 4.75 % 9 5.24 % F A s / 24 h rs 82 6 8 8 3 91 8 E PO C H Se n s i t i vi t y 20.71% 50.46% 51.02% 65.03% 9.784 % Sp e c i fi c i t y 98.22% 94.82% 94.09 91.55% 99.84% F A s / 24 h rs 1418 4133 47 11 6738 12 6 Ziy abari et al.: Objective evaluation metrics Page 14 of 21 Journal of Neural Engineering Resubmission : April 20, 2018 While sensitivity and specificity a re c ommonly used me trics in the bioengineering community, from Table 2 and Figure 11 we see that the FA ra te a lso plays a major r ole in de termining the usa bility of a system. A commonly used metric in the ma chine learning comm unity that is some what intuitive is ac cur acy . The accur acy of the f ive systems is shown in Table 4. A ccurac y weights all types of e rrors a s equally important. This is acc eptable if the da taset is balanc ed. However, f or many bioengine ering applica tions, such as se izure det e ction, the targ et class, or cl a ss of inter est, occurs inf r equently. We se e that CNN/LS TM is significa ntly more accur ate tha n t he other four systems, but that the diffe rence s between these remaining four systems is minimal when using accurac y as a me t ric. Another popular metric tha t attempts to a ggregate p er for ma nce into a single data point , a nd is popular in the i nforma tion retrieval c ommuniti es, is the F 1 sc ore. The se sco re s for the five syst ems ar e shown in Table 5. We see there is significant va riation in F 1 sc ores with the scoring metric. For e x ample, for TA ES and EPOC H, whi c h stress time alignments, the best p erforming system is n ot C NN/LSTM. F 1 scores do not adequa t ely empha size FAs for applications such as s eizure de tection. We ge nerally prefer operating points wher e per formance in te rms of sensitivit y, specificity and F As is balanced. The A TWV metric explicitly attempts to bala nce these by assigning a reward to each correct detec tion and a penalty to e ach inc orrect detection. None of the conventional metric s described here consider the f raction of a detected eve nt tha t is correct. This is the inspiration behind the development of TAES scoring. TA ES sc oring r equires the ti me alignment s to matc h, which is a more stringe nt requirement than, for e xample, OVLP. Consequently, the sensitivity produce d by the TAES and EPOCH metric s tends to be lower. Finally, comparing results ac ross these five metrics can provide use ful diagnostic information and pr ovid e Figure 11. A comparison of DET curves Ziy abari et al.: Objective evaluation metrics Page 15 of 21 Journal of Neural Engineering Resubmission : April 20, 2018 insight into the system’s beha vior. For e xample, the IP CA/ LSTM and HMM/LSTM systems have relatively higher sensitivities according to the EPOCH metric, indicating that these systems tend to detect longer seizure events. Conversely, since the CNN/LSTM system ha s relatively low sensitiviti es according to the TAES a nd EPOCH metrics , it can be inferred that this s ystem misses longer seizure events . S imilarly, if the sensitivity was relatively high for TAES a nd r elatively low f or EPOCH, it would indic ate that the system tends to detec t a majority of smaller to moderate e vents pr ec isely re gardless of the duration of an e vent. Similarly, a comparison of ATWV scores with other metrics gives diagnostic informa tion such as whether a system a cc u rate ly detects the onse t a nd e nd of an e vent or whe ther the system splits long e vents into multipl e short e vents. Examining the ense mble of scores can be revealing for these six metrics. To understand the pairwise statistical diff erence be tw een t he discussed eva luation me trics and deep arc hitectures, we have performed Kolmogorov- Smirnov (KS), Pearson’s R (correlation coeffi cient) and Z- test. The se te sts were performed to evaluate results of hybr id dee p le arning architectures on the basis of sensitivity and spe cificity. Each individual pa tient from the TUSZ dataset was eva luated separa tely. Outliers were removed by re jecting all input value s collected from patients which have no seizures and from those for whic h d ee p learning systems dete cted no se izures. Prior to performing tests for evaluating statistically difference s, such as a z-te st, t -test or AN OVA, it must first be determined whether or not the g roup sample, in our case individual metric’s score on per pa tient Table 4. Accuracy vs. scoring metric M e tr i c HM M / SdA HM M / L S T M I PC A / L S T M CNN/ M L P C N N/ L S T M A TW V 54.0 % 54.0 % 52.1 % 54.9 % 70.7 % O V LP 65.1 % 66.5 % 65.6 % 66.9 % 78.9 % D P A LI GN 61.5 % 60.2 % 59.2 % 62.9 % 73.6 % T AE S 56.6 % 57.3 % 55.4 % 57.2 % 69.7 % EP O C H 92.3 % 9 1.5 % 90.8 % 89.5 % 91.5 % Table 5. F 1 score vs. scoring metric M e tr i c HM M / S dA HM M / L S T M I PC A / L S T M CNN/ M L P C N N/ L S T M A TW V 0 .24 0.28 0.24 0.28 0.42 O V LP 0.31 0.33 0.34 0.38 0.45 D P A LI GN 0.35 0.36 0.35 0.42 0.45 T AE S 0.16 0.26 0.24 0.31 0.19 EP O C H 0.29 0.47 0.46 0.49 0.14 Table 6. Correlation of the scoring metrics (for sensitivity) Metric ATWV DPALIGN OVLP TAES EPOCH ATWV --- 0.87 (p < 0.001) 0.92 (p < 0.001) 0.71 (p < 0.001) 0.50 (p < 0.001) DPALIGN 0.87 (p < 0.001) --- 0.90 (p < 0.001) 0.69 (p < 0.001) 0.48 (p < 0.001) OVLP 0.92 (p < 0.001) 0.90 (p < 0.001) --- 0.78 (p < 0.001) 0.62 (p < 0.001) TAES 0.71 (p < 0.001) 0.69 (p < 0.001) 0.78 (p < 0.001) --- 0.87 (p < 0.001) EPOCH 0.50 (p < 0.001) 0.48 (p < 0.001) 0.62 (p < 0.001) 0.87 (p < 0.001) --- Table 7. Correlation of the scoring metrics (for specificity) Metric ATWV DPALIGN OVLP TAES EPOCH ATWV --- 0.49 (p < 0.001) 0.45 (p < 0.001) 0.54 (p < 0.001) 0.32 (p < 0.001) DPALIGN 0.49 (p < 0.001) --- 0.94 (p < 0.001) 0.89 (p < 0.001) 0.38 (p < 0.001) OVLP 0.45 (p < 0.001) 0.94 (p < 0.001) --- 0.95 (p < 0.001) 0.44 (p < 0.001) TAES 0.54 (p < 0.001) 0.89 (p < 0.001) 0.95 (p < 0.001) --- 0.56 (p < 0.001) EPOCH 0.32 (p < 0.001) 0.38 (p < 0.001) 0.44 (p < 0.001) 0.56 (p < 0.001) --- Ziy abari et al.: Objective evaluation metrics Page 16 of 21 Journal of Neural Engineering Resubmission : April 20, 2018 evalua tion, is normally dist r ibuted. We performed KS te sts on each sep arate e v aluation metric and confirme d that the group distributi on is indeed Gaussian. The K S values for normal distributions coll ec ted range f rom 0.61 – 0.71 for se nsitivit y and 0.99 – 1.00 for specificity with the p -values e qual to zer o. We then eva luat e the c orr elation coeffic ient (Pear son ’s R) between each metric - pairs. Correlations for each pair of s coring metrics are shown in Table 6 (f or sensiti vity) and Ta bl e 7 (for specific ity). From Table 6, it ca n be seen that t he pa irs ATWV -EPOCH a nd DPALIG N-EPOCH, ha ve minimum corr elation ( ~ 0.5 ). The pairwise c orrel ations betwee n OVLP, ATWV and DPALIG N ar e mu ch higher. The EPOCH me thod ha s a low correlation with all other metr ics except TAES met ric. This makes sense because the EPOCH method scores events on a constant time scale instea d of on individual events. TAES ta k es into account the dura tion of the overlap, so it i s the closest method to EPOCH in this regard. Since OVLP and TA ES both scor e overl apping events independent ly , we also expect these t wo methods to be correlate d ( sensitivity: 0.78 ; specificity: 0.95 ). ATW V on the othe r ha nd has fairlow correla tions with the other metrics for specificity because of its stringent rules for FP s when ther e are multiple ove rlapping events. The overall highest correlation is between ATWV and OVLP for sensiti vity, and OVLP and TA ES Table 8. Significance calculated for scoring metrics using Z - tests for α -value 0.05 (For sensitivity) ATWV (Abs. sens itivit y difference (%), Significa nt/Non-significant) ML Systems (Sens.) CNN -LSTM CNN -MLP HMM-LSTM HMM-SDA IPCA-LSTM CNN -LSTM(30.34%) --- (00.82%) Y (03.61%) Y (00.01%) Y (05.61%) Y CNN -MLP(29.52%) --- (02.79%) N (00.83%) N (04.79%) N HMM-LSTM(26.73%) --- (03.62%) N (02.00%) N HMM-SDA(30.35%) --- (05.62%) N IPCA-LSTM(24.73%) --- DPALIGN (Abs. sens itivit y difference) ML Systems (Sens.) CNN -LSTM CNN -MLP HMM-LSTM HMM-SDA IPCA-LSTM CNN -LSTM(32.46%) --- (10.89%) Y (01.31%) Y (11.65%) Y (03.31%) Y CNN -MLP(43.35%) --- (09.58%) N (00.76%) N (07.58%) N HMM-LSTM(33.77%) --- (10.34%) N (02.00%) N HMM-SDA(44.11%) --- (08.34%) N IPCA-LSTM(35.77%) --- EPOCH (Abs. sensitivity difference) ML Systems (Sens.) CNN -LSTM CNN -MLP HMM-LSTM HMM-SDA IPCA-LSTM CNN -LSTM(09.78%) --- (55.25%) N (40.68%) N (10.93%) Y (41.24%) N CNN -MLP(65.03%) --- (14.57%) Y (44.32%) Y (14.01%) N HMM-LSTM(50.46%) --- (29.75%) Y (00.56%) N HMM-SDA(20.71%) --- (30.31%) Y IPCA-LSTM(51.02%) --- OVLP (Abs. sensitivity differ ence) ML Systems (Sens.) CNN -LSTM CNN -MLP HMM-LSTM HMM-SDA IPCA-LSTM CNN -LSTM(30.83%) --- (08.26%) Y (02.14%) Y (04.52%) Y (02.14%) Y CNN -MLP(39.09%) --- (09.04%) N (03.74%) N (06.12%) N HMM-LSTM(30.05%) --- (05.30%) N (02.92%) N HMM-SDA(35.35%) --- (02.38%) N IPCA-LSTM(32.97%) --- TAES (Abs. sens itivit y difference) ML Systems (Sens.) CNN -LSTM CNN -MLP HMM-LSTM HMM-SDA IPCA-LSTM CNN -LSTM(12.48%) --- (19.10%) N (10.36%) N (04.81%) Y (09.64%) N CNN -MLP(31.58%) --- (08.74%) N (14.29%) Y (09.46%) N HMM-LSTM(22.84%) --- (05.55%) Y (00.72%) N HMM-SDA(17.29%) --- (04.83%) Y I P C A - L S T M ( 22.12%) --- Ziy abari et al.: Objective evaluation metrics Page 17 of 21 Journal of Neural Engineering Resubmission : April 20, 2018 for specificity. All the c orre lation value s ( Pearson’s R) collected in Ta ble 6 a nd Ta ble 7 a re statistically significant with the p -va lues < 0.001 . To understand the statistical significa nce of eac h syste m, w e perf orm ( two -tailed) Z-tests on a ll the rec ognition system pairs using individual metr ic separately a s shown in Table 8 ( for sensitivity) and Table 9 (for spe cificity) . Entries in both these ta bles have the se nsitivit y/specificity diffe renacne between the systems and a binary classification value ( Yes/No) ba s ed on extracted p - values from the Z-test with 95% confidence. Her e again, the data was prepared by scori ng systems on individual patients and prior to perf o rming Z-tests, Gaussianity of e ach s a mple was evaluated using KS - t est. All the samples w e re confirme d as norma l with the p-values < 0.001. From Table 8, it can be observed that the difference be t wee n CNN-LS TM system and other sytems are statistically significant for all metric es e x ce pt EPOCH and TAES me trices. On the other ha nd, f ails to re ject EPOCH a nd TAE S metri c es f a il to reject null-hypothesis for CNN - LSTM. A c cording to these m etric’s, t he Table 9. Significance calculated for scoring metrics using Z - tests for α -value 0.05 (For specificity) ATWV (Abs. specificity difference (%), Significant/Non-significant) ML Systems (Spec.) CNN -LSTM CNN -MLP HMM-LSTM HMM-SDA IPCA-LSTM CNN -LSTM(93.15%) --- (27.28%) Y (24.22%) Y (31.77%) Y (28.64%) Y CNN -MLP(65.87%) --- (03.06%) N (04.49%) N (01.36%) N HMM-LSTM(68.93%) --- (07.55%) Y (04.42%) N HMM-SDA(61.38%) --- (03.13%) N IPCA-LSTM(64.51%) --- DPALIGN (Abs. s peci ficity differe nce (%), Sign ificant/Non-signific ant) ML Systems (Spec.) CNN -LSTM CNN -MLP HMM-LSTM HMM-SDA IPCA-LSTM CNN -LSTM(95.17%) --- (23.68%) Y (22.18%) Y (28.30%) Y (25.58%) Y CNN -MLP(71.49%) --- (01.50%) N (04.62%) Y (01.90%) N HMM-LSTM(72.99%) --- (06.12%) Y (03.40%) N HMM-SDA(66.87%) --- (02.72%) Y IPCA-LSTM(69.59%) --- EPOCH (Abs. specificity difference (%), Sign ificant/Non-s ignif icant) ML Systems (Spec.) CNN -LSTM CNN -MLP HMM-LSTM HMM-SDA IPCA-LSTM CNN -LSTM(99.84%) --- (08.29%) N (05.02%) N (01.62%) N (05.75%) N CNN -MLP(91.55%) --- (03.27%) N (06.67%) N (02.54%) N HMM-LSTM(94.82%) --- (03.40%) N (00.73%) N HMM-SDA(98.22%) --- (04.13%) N IPCA-LSTM(94.09%) --- OVLP (Abs. specificity diffe rence (%) , Significant/Non-signif icant) ML Systems (Spec.) CNN -LSTM CNN -MLP HMM-LSTM HMM-SDA IPCA-LSTM CNN -LSTM(96.86%) --- (20.02%) Y (16.33%) Y (23.51%) Y (19.29%) Y CNN -MLP(76.84%) --- (03.69%) N (03.49%) Y (00.73%) N HMM-LSTM(80.53%) --- (07.18%) Y (02.96%) N HMM-SDA(73.35%) --- (04.22%) Y IPCA-LSTM(77.57%) --- TAES (Abs. specificity difference (%), Significant/Non-significant) ML Systems (Spec.) CNN -LSTM CNN -MLP HMM-LSTM HMM-SDA IPCA-LSTM CNN -LSTM(95.24%) --- (31.21%) Y (24.83%) Y (29.20%) Y (28.60%) Y CNN -MLP(64.03%) --- (06.38%) N (02.01%) Y (02.61%) N HMM-LSTM(70.41%) --- (04.37%) Y (03.77%) N HMM-SDA(66.04%) --- (00.60%) Y IPCA-LSTM(66.64%) --- Ziy abari et al.: Objective evaluation metrics Page 18 of 21 Journal of Neural Engineering Resubmission : April 20, 2018 perf o rmance of HMM -SDA shows signific a nt difference from oth er systems sho wing very poor perf o rmance. This ca n also be obse rve d from EPOCH/TAES results of Table 2. Table 9 for specificity, shows a differ ent tre nd than f or t he se nsitivi ty whe re EPOCH f ails to reject null- hypothesis for all the sys tems. Since , the specificity is ca lculated from TN a nd FP value s, for the evalset of duration ~167 hours with epoch size 0.25, few thousan d seconds of FPs do not make any signifi ca nt difference in terms of spe cificity. This can a lso be direc tly observed in Table 2 where spec ifi city of a ll the systems acc ording to EPOCH is a lways greater than 90%. The huge differe nce between the duration of backg round and seizure e v ents is the primary r eason for suc h high spec ificities. On the other hand, the OVLP and TAES complete ly a grees with each other’s Z -test results for specificity. 5. Conc l usions Standardization of scoring metrics is a n extr e mely i mportant st ep for a research c ommunity to take in orde r to make progr ess on machine le arning problems such as automatic interpreta tion of EEGs. There has be en a lack of standardization in most bioe ngineer ing fields. Popular metrics such as sensiti vity a nd spe cificity do not completely character ize the problem a nd neglect the importance that FA rate plays in achiev ing clinically acceptable solutions. In this paper , we ha ve c ompared se ver al popular scoring m etrics a nd demonstrated the value of c onsidering the accuracy of time alignments in the overall assessment o f a system. We hav e propos e d the use of a n ew m etric, TAES scoring, which is consistent with popular scoring approache s suc h as OVLP, but pr ovide s more a cc urate assessments by producing f ractional score s for rec ognition of events based on th e degree o f match in the time alignments. W e hav e a lso demonstrated the eff icacy of an existing metric, ATWV, that is popular in the speech recognition community. We have also not discussed the extent to whic h we can tune these metric s by weighting va rious types of err o rs bas ed on feedback from clinicians and other ‘ custom ers’ of t he tec hnology. Opti mization of the metric is a research problem in itself, since many considerations, including usabil ity of the technology a nd a broa d range of applica tions, must be involved in th is proc ess. Our infor mal attempts to optimize ATWV and OVLP for seizure detection ha ve not ye t pr oduced significantly dif fere nt results tha n w hat wa s prese nted here. Fee db ac k from clinicians has been consist ent that FA rate is per haps the single most important me asure once sensitivity is above approximately 75% . As we move more technology into opera tional environments we expect to have more to contribute to thi s rese arch topic. Finally, t he Python implementation of these metrics is available a t the project web site: https://w ww.isip.pic onepress.com /proje cts/tuh_eeg/downloads/nedc_eval_eeg . Readers are enc ou raged to refer to the software for detailed que stions about the speci fic implementations of these a lgorithms and the tunable parameters ava ilabl e. Acknowledgment s Resear ch re ported in this publ ication was most rec ently supported by the National Human Genome Resear ch I nstitute of the N ational Insti tutes of Hea lth under award number U01HG008468. The conte nt is solely the responsibil ity of the authors and does not ne ces sarily r epre sent the off icial view s of t he National Inst itutes of Health. This mater ial is a lso base d i n part upon work supported by the National S cience Foundation under Grant No. II P -1622765. Any opinions, f indings, and c onclusions or re commenda tion s expre ssed in this material are those of the a uthor(s) and do not necessarily r eflec t the views of the National Scie nce Foundation. Ziy abari et al.: Objective evaluation metrics Page 19 of 21 Journal of Neural Engineering Resubmission : April 20, 2018 Reference s [1] Yamada , T., & Meng, E. (2017). Practical guide for c linical neurophysiologic testing: EEG . (E. Meng & R. L. (O nline Service), Eds.). Philadelphia, P ennsylvania, USA: Lippincott Williams & Wilkins. https:// doi.org/10.1111%2Fj.1468-1331.2009.02 936.x . [2] Sche uer, M. L., Bagic, A ., & W ilson, S. B. ( 2017). Spik e detection: Inter -reader agreement and a statistical Turing test on a lar ge data set. Clinical Neurophysiology , 128 (1) , 243 – 250. https://doi.org/ 10.1016/j.clinph.2016.11.005 . [3] Gadhoumi, K., Lina , J.- M., Mormann, F., & Gotman, J. (2016). Seizure prediction for therape utic device s: A re view. J ournal of Neurosci ence Methods , 260 (Supplement C) , 270 – 282. https://doi.org/ 10.1016/j.jneume th.2015.06.010 . [4] Wilson, S. B., Sche uer, M. L., Plummer, C., Young, B., & P ac ia, S. (2003) . Seizure detection: correlation of hum an experts. Cl inical Neurophysiology , 114 (11), 2156 – 2164. https://doi.org/ 10.1016/S1388-2457 (03)00212-8 . [5] Gotm an, J., Flanagan, D ., Zhang, J., & Rosenbl att, B. (1997). Automati c seizure detection in the newborn: Methods a nd initi al eva luation. Electroencephalography and Clinical Neurophysiology , 103 (3), 356 – 362. https://doi.org/10.1016/S0013-4694(97)00003-9 . [6] Gotm an, J. (1982) . Automatic recognition of epileptic sei zures in the E EG. Electroencephalography and Clini cal Neurophysiology , 54 (5), 530 – 540. https://doi.org/10.1016/0013-4694(82)90038-4 . [7] Cvach Maria, M. (2014). Managing hospital alarms. Nursing Critical Care , 9( 3), 13 – 27. https://doi.org/ 10.1097/01.CC N.0000446255.81392.b0 . [8] Bridi, A. C., Louro, T. Q., & Da Silva, R. C. L. (2014). Clinical Alarms in intensive c are: implications of a larm fa tigue f or the safety of pa tients. R ev ista Latino-Am ericana de Enfermagem , 22(6), 1034. https://doi.org/ 10.1590/0104-1169.3488.2513 . [9] Hu, P. (2015). Reducing False A larms in Critical Ca re . Pre sented at the W orking Group on Neurocritical Care I nformatics, Neurocriti ca l Car e Soc i ety Annual Mee ting. S cottsdale, Arizona, USA. N ot available online. [10] Ha mbling, B. (2013). Use r Ac ceptance Testing A step -by- step guide . ( P. van Goethem, Ed.), User Acceptance Testing . Swindon, United Kingdom: BCS Lea rning & Deve lopment Limited. https://w ww.amazon.com /User-Acce ptance-Testing-Step- Step/dp/1780171676 . [11] Banchs, R., Bonafonte, A., & Perez, J. (2006) . Acceptance Te sting of a Spoke n Language Translation System. Proc eedings of LREC (p. 106). G e noa, Italy. http://www.lrec- conf.org/procee dings/lrec2006/pdf/60_pdf .pdf . [12] Picone, J., Doddington, G., & Pallett, D. (1990). P hone -media t ed word a lignment for speech rec ognition ev aluation. IEEE Transactions on A c oustics, Speec h and Signal Processing , 38 (3), 559 – 562. https://doi.org/ 10.1109/29.106877 . [13] Michel, M., Joy, D., Fiscus, J. G., Ma nohar, V., Ajot , J., & Barr , B. (2017). Fr amework for Dete ction Evaluation (F4DE). https://gi thub.com/usnistgov/F4DE . [14] Altman, D. G., & Bland, J. M. ( 1994). Diagnostic Tests 1: Se nsitivit y And Spe cificity. British Medical J ournal , 308 (6943), 1552. https:// doi.org/10.1136/bmj.308.6943.1552 . [15] Wozenc raft, J. M., & Jacobs, I. M. ( 1965). Principles of Communication Engineering . New York City, New York, USA: Wiley. https://books.google.com /boo k s/about/Principles_of _communic ation_engineering.ht ml?id=4ORSAAAAMAAJ. Ziy abari et al.: Objective evaluation metrics Page 20 of 21 Journal of Neural Engineering Resubmission : April 20, 2018 [16] Mar tin, A., Doddington, G., Ka mm, T., Ordowski, M., & Przybocki, M. (1997). The DET c urve in assessment of detec tion task p e rformance. Pro ceedings of Eurospe ech (pp. 1895 – 1898). Rhodes, Greece . https://doi.org/10.1.1.117.4489 . [17] Wa ng, Y.-Y., Acero, A., & Che lba, C. (2003). Is word e rro r ra te a good indica tor for spoken l angua ge understanding accur acy. Pro ceedings of the IEEE Works hop on Automatic Speech Recognition and Understanding (pp. 577 – 582). Sa int T homas, Virgin islands: I EE E. https://doi.org/ 10.1109/ASRU.2003.1318504 . [18] Mostefa , D., Hamin, O., & Choukri, K. (2006). Evaluation of Automatic Speech Recognition and Speech Language Tra nsl ation withi n TC-STA R: Results f rom the first evaluation campaign. Procee dings of the International Conferenc e on Language Re sources and Evaluati on (pp. 149 – 154) . Genoa, Italy. http://l rec- conf.org/procee dings/lrec2006/p df/813_pdf.pdf . [19] Wilson, S.B., Scheuer, M.L., Plummer, C., Young, B., & P acia , S . (2003) Seiz ure detection: correlation of human expe rts. Clinical Neurophysiology, 114 ( 11 ), 2156 – 64. https://doi.org/ 10.1016/S1388-2457(03)00212-8 . [20] Ke lly, K.M., Shiau, D.S., Kern, R.T., Chien, J.H., Ya ng , M.C.K., Ya ndora, K.A. Valeriano, J. P., Halfor d, J. J., & Sac kellares, J. C. ( 2010). Assessment of a sc alp EEG-based automated seiz ure detec tion system. Clinical Neuroph ysiology , 121 ( 11), 1832 – 43. https://doi.org/ 10.1016/j.c linph.2010.04.016 . [21] Baldassano, S ., Wulsin, D., Ung, H., Blevins, T., Brown, M -G., Fox, E., & Litt, B. (2016). A novel seizure detection a lgo r ithm informe d by hidden Markov model event states. J ournal of Neural Eng in ee ring,13 (3), 36011. https://doi.org/10.1088/1741-2560/13/3/036011. [22] Winter halder, M., Maiwa ld, T., Voss, H. U., Asc henbrenner -Scheibe, R., Timmer, J., & S chulze - Bonhage, A. ( 2003). The se izure prediction characteristic: a g eneral f ramework to asse ss and compare seizure prediction methods. Epileps y and Behavior , 4 (3), 318 – 325. https://doi .org/10.1016/S1525- 5050(03)00105-7 . [23] We gmann, S., Faria , A., Janin, A., Riedha mmer, K ., & Morga n, N. ( 2013 ). The TAO of ATWV: Probing the mysteries of keyw o rd search performance. Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (A SRU). Olomouc, Czech Republic: IEEE. https://doi.org/ 10.1109/ASRU.2013.6707728 . [24] Fiscus, J., Ajot, J., Garofolo, J., & Dodding tion, G. ( 2 007). Results of the 2006 S poken Te rm Detec tion Eva luation. In Procee dings of the SIGIR 2007 Workshop : Searc hing Spontane ous Conversational Speec h (p p. 45 – 50). Amsterdam, Netherlands . https://www .nist.gov/ publications/result s-2006- spoken- term-detection-evaluati on . [25] Japkowic z, N., & Sha h, M. ( 2014). Evaluating Le arning Algorithms: a class ification perspective . New Yo rk C ity, New York, U SA: Cambridge University Press. https://doi.org/ 10.1017/CBO9780511921803 . [26] Confusion matrix. (n.d.). Retrie v ed Oc tob er 31, 2017. https://en.wikipedia.org/wiki/ Confusion_matri x . [27] Liu, A., Hahn, J. S., Heldt, G. P., & Coen, R. W . (1992). Detection o f neonatal seiz ures through computerized EEG ana lysis. Elec troencephalography an d Clini cal Neurophysiology , 82 ( 2), 32 – 37. https://doi.org/ 10.1016/0013-4694 (92)90179-L . [28] Na vakatikyan, M. A ., Coldit z, P. B., Burke, C. J., I nde r, T. E., Richmond, J., & W illiams, C. E. (2006). Seizure de tection algorithm for neonates based on wave -sequence analysis. Clinical Neurophysiology , 117 (6) , 1190 – 1203. http://dx.doi.org/10.1016/j .clinph.2006.02.016 . Ziy abari et al.: Objective evaluation metrics Page 21 of 21 Journal of Neural Engineering Resubmission : April 20, 2018 [29] Xiong, W ., Wu, L., Alleva, F., Droppo, J., Hua ng, X., & Stolcke , A. (2017). The Microsoft 2017 Conversational Speech Recognition System. https://arxiv. org/abs/1708.06073 . [30] Gotman, J., Flana gan, D., Zhang, J., & Rosenblatt, B. (1997). A utomatic se izure dete ction in the newborn: Methods a nd initi al eva luation. Electroencephalography and Clinical Neurophysiology , 103 (3), 356 – 362. https://doi.org/10.1016/S0013-4694(97)00003-9 . [31] Wilson, S. B., Sc heuer, M. L., Plummer, C., Young, B., & P a cia, S . (20 03). Seizure detection: correlation of hum an experts. Cl inical Neurophysiology , 114 (11), 2156 – 2164. https://doi.org/ 10.1016/S1388-2457 (03)00212-8 . [32] Mason, S. J., & Graham, N. E. (2002) . Ar eas beneath the r elative operating character istics (ROC) and re lative operating levels (ROL) curves: Statistical significanc e and inter p reta tion. Quarterly Journal of the Roy al Me teorological Society , 128 ( 584), 2145 – 2166. https://doi.org/ 10.1256/003590002320603584 . [33] Ha jian-Tilaki, K. (2013). Receive r Operating Character istic ( ROC) Curve Analysis f or Medica l Diagnostic Test Ev aluation. Caspian Journal of Internal Medicine , 4 (2), 627 – 635. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3755824/ . [34] Fiscus, J. G., & Chen, N. ( 2013). O verv iew of the NIST Ope n Keyword Search 2013 Evaluation Workshop . Bethesda, Maryland, USA. http://w s680.nist.gov/publication/ get_pdf.cfm?pub_id= 914517 . [35] Na tional Inst itute of Standards and Technology (NIST) (2017). Speech Recognition Scoring Tool kit . https://git hub.com/usnistgov . [36] McHugh, M. L. (2012). Int errater reliability: t he kappa statist ic. Bioche mia Medica , 22(3), 276 – 282. Retrieved from http://www .ncbi.nlm.nih.gov/pmc/articles/ PMC3900052/ . [37] Na tional Inst itute of Standa rds and Technology (N IST ) (2017). Framework for Detection Evalua tion. https://git hub.com/usnistgov/F4DE . [38] Ha lford, J. J., S hiau, D., Desrochers, J. A., Kolls, B. J., D ea n, B. C., W ater s, C. G., … LaRoc he, S. M. (2015). Inter-r ater agreement on identification of electrographic seizures and periodic discharges in IC U EEG recor dings. Clini cal Neurophysiology : Offici al Journal of the International Federation of Cli nical Neurophysiology , 126 ( 9), 1661 – 9. https://doi.org/10.1016/j.clinph.2014.11.008 . [39] Golmohammadi, M., Shah, V., Lopez, S., Ziya bari, S ., Yang, S., Camaratta, J., … Picone, J . (2017) . The TUH EEG Seizure Corpus. In Procee dings of the American Clini cal Neurophysiology Society Annual M eeting (p. 1). Phoenix, Arizona, USA: American Clinical Neurophysiology Society. https://w ww.isip.pic onepress.com /publications/ confere nc e _presenta tions/2017/acns/tuh_eeg_seizur es . [40] Golmohammadi, M., Har ati N ejad Torbati, A., Lopez, S ., Obeid, I., & Picone, J. ( 2018). Automatic Analysis of EEGs Using Big Data and Hybrid Deep Lear ning A rchitectures. S ubmitted to Frontiers in Human Neuroscience , 1 – 30. https://www .isip.picone press.com/publications/unpubli shed/ journals/2018/ frontiers_neurosc ience /hybrid/ . [41] Golmohammadi, M., Ziyabari, S., Sha h, V., Lopez, S., Obeid, I., & Picone, J. (2018). Deep Architectures for Spatio- Temporal Modeling: Automated Seizure Detection in Scalp EEGs. Submitted to the International Conference on Machine Lea rning ( IC ML) (pp. 1 – 9). Stockholm, Swe d en. https://www.i sip.picone press.com/publications/unpublished/conferences/2018/icml/ .
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment