Deep Architectures for Automated Seizure Detection in Scalp EEGs
Automated seizure detection using clinical electroencephalograms is a challenging machine learning problem because the multichannel signal often has an extremely low signal to noise ratio. Events of interest such as seizures are easily confused with …
Authors: Meysam Golmohammadi, Saeedeh Ziyabari, Vinit Shah
Deep Archit ecture s for Automated Seizure De tection in Sc alp EE Gs Abstract Automated seizure detection using clinical electroencephalograms is a chal lenging machin e learnin g problem becau se the multichannel signa l often has an extremely low sig nal to noise ratio. Ev ents of interest such as seizures are easil y confused with signa l artifacts (e.g, eye movements) or benign variant s (e.g., slowing). Commercially available s ystems suffer from u nacceptabl y high false alarm rates. Deep learn ing algorithms that em plo y high dimensional models have n ot p reviously been effective due to the lack of big data resourc es. In this paper, w e use the TUH EEG Seizure Corpu s to evaluate a variet y of h ybrid deep structures includ ing Convolutional Neural Netw orks and Long Short-Term Memory Networks. We intr oduce a novel recurrent convol utional architecture that delivers 30% sensitivity at 7 f alse alarms per 24 hours. We have also evaluated our system on a held-out evaluation set based on the Du ke University S eizure Corp us and de monstrate that performance trends are similar to the TUH EEG Seizure Corpus. This is a significant fi nding because the Duke corpus was collected w it h d ifferent i nstrumentation and at different hospitals. Our work sho ws th at d eep l earning architectures that integrate spatial an d temporal c ontexts are critical to achieving state o f the art p erformance and will enab le a new generation of clinic ally-ac ceptab le technology. Introduction Electro encephalograms (EE Gs) are used in a wide ran ge o f clinical settings to re cord electrica l activit y along the sca lp. Scalp EE Gs are the pr imary mea ns b y which p hysicians diagnose brain-rela ted illnesses such as epilep sy and seizures (Obeid and P icone 20 17). However, manual analysis o f EEG signal s requires a hi ghly trained board- certified neurophysiologist, and is a process th at i s known to have rela tively low inter-rater agre ement (I RA) (S wisher et al. 2015). It is a tim e-consumi ng a nd expensive proce ss since the vol ume and velo city of the d ata far exceeds t he available resource s for detaile d inter pretatio n in real time. Automated analysis can impr ove the qualit y of patient care by reduci ng man ual err or and latency. In this pap er, w e focus on the speci fic proble m of seizure detectio n, though the work prese nted here is also applicable to other EEG prob lems such a s si gnal eve nt d etection (Harati et al. 20 16) and abnor mal detection ( Lopez e t al. 2015) . Like most machi ne le arning pr oblems of this nature, many a lgorith ms have been a pplied including time – frequency a nalysis method s (Gotman et al. 19 82) and nonlinear tec hniques (Schad et al. 2008) . Despite much research p rogress, co mmerciall y ava ilable a utomated E EG analysis syste ms ar e impractical due to hig h fal se de tection rates (Ram gopal 2014). Servicing false alar ms i n critical care settings places too much of a cognitive burden on caregivers, and hence, the outputs from the se systems ar e ignored (Christensen et al. 2014). T his c reates quality of healthcare issues as well as cost and e fficienc y challenges. Although contempor ary appro aches for automatic interpretatio n of EEG s have e mployed modern mac hine learning a pproa ches (Alotaiby et al. 2014), de ep learning algorithms that emplo y hig h di mensional model s have not previously been u tilized b ecause there ha s be en a lack o f b ig data resources. A si gnificant re source (G olmohammadi et al. 2017) , known as the TUH EEG Seizure Corpus ( TUSZ), has recently be come availa ble for EE G interpretation creating a uniq ue oppo rtunity to advance t echnology. The goal of this work is to demonstrate that advanced deep learning appro aches that have been s uccessful i n tasks like image processi ng and spe ech recognition, where a mple amounts o f a nnotated tra ining d ata are available, can be applied to EEG in te rpretation. To achieve this goal, we evaluated se veral on a sta ndard seizure de tection tas k. We propose a novel deep l earning architecture that reduces the false alarm rate while maintai ning sen sitivity a nd specificity. We demonstrate that the per formance o f this system is no w approac hing clinical acc eptance. Exploiting Spati o-Temporal Context Spatial and te mporal context are required for accura te dis- ambiguatio n of seizure s fro m a rtifacts (Ob eid and Pico ne 2017). In Figure 1, w e show o ur generic architecture for pro- cessing EEG signal s. T he multichannel signal is sampled at 250 Hz using 16 bits of resolutio n, converted to a fea ture- based r epresentatio n, proce ssed through a sequential mod- eler, and then po stprocessed usi ng a var iety of statis tical models t hat impose c onstrai nts based on subj ect matter ex- pertise. Several architect ures that implement Gaussian Mix- ture Mo dels (GMMs), hidden Markov model (HMMs) and deep learning (DL) ha ve been evaluate d. Feature e xtraction, whic h is not t he pri mary focus of t his paper , typically r elies on time frequency repre sentations of the signal. Though we can repla ce traditio nal model-based feature e xtraction wit h deep le arning-based approa ches that operat e d irectly o n the sampled d ata, in this work w e focus on the use of traditional cepstral-ba sed features (Pico ne 1993). T he use of more advanced discriminative feature s (Zhang et al. 2 016) has not yet pro duced subst antial improvements in perfor mance for this applic ation. O ur system uses a standard linear frequency cepstral c oefficient - based feature e xtraction approa ch (LFC Cs) (Har ati 2 015; Lopez 2016). We also use first and second derivatives of the features since these improve p erfor mance. Neurologists t ypically revie w EEGs in 1 0 sec windo ws and i dentify e vents w ith a resolution of appro ximately 1 sec. We analyze the signal in 1 sec epoc hs, a nd further d ivide this i nterval into 10 frames o f 0.1 secs each so that features are c omputed ever y 0.1 second s (re ferred to as the frame du- ration) usi ng 0.2 second an a lysis w indo ws (referred to as the window dur ation). The o utput of our feat ure extraction p ro- cess is a f e ature vector of dimensio n 2 6 for each of 22 chan- nels, with a fra me duratio n of 0. 1 secs. Sequential Decoding Using HM Ms HMMs a re a mong the most po werful statistica l modelin g tools availab le toda y for signals that have both a time and frequency d omain c omponent (Pic one 1990). H MMs have been used exte nsively in seq uential dec oding tasks like speech re cognition to mode l the temporal e volution of the signal. A u tomated in terpreta tion of EEGs is a problem li ke speech r ecognition since b oth ti me domain (e.g., spikes) and frequency domain i nfor mation (e.g., alpha w aves) are used to identify cr itical e vents (Obeid a nd Picone 2 017). In this st udy, a le ft-to-rig ht c hannel-inde pendent GMM- HMM, as illustrated in Figure 1, was used as a b aseline system for seq uential de coding. HMM s ar e attracti ve because trainin g is much faster than co mparable de ep learning systems, and HMM s tend to work we ll w h en a mple amounts of annotate d d ata ar e available. W e d ivide eac h channel of an EEG into 1-second epochs, a nd furt her subdivide these epochs i nto a sequence of frames. Each epoch is clas s ified using an HMM t rained on t he subdivided epoch, and the n t hese e poch-based decisions are postpro cessed by ad ditional statistical mod els i n a pro cess that parallels the langua ge modeling component of a speec h recognizer. Standar d three state left-to-rig ht H MMs (Picone 1990) with 8 Ga ussian mixture co mponents per state were used to model eac h channel of the 22 -channel sig nal. A diagonal covaria nce matri x a ssumptio n was used for each mixture co mponent. Channe l-independent models were trained since cha nnel-depe ndent models did not provide an y improvement i n perfor mance. Supervised trainin g based on the Baum-We lch reestimation al gorithm was used to tr ain t wo mod els – seizure and background. Models were trained on s egme nts of data containing seizures based o n manual annotatio ns that are availab le as par t of TUSZ. Since seiz ures co mprise a Figure 1: A hybrid a rchitecture for automatic interpretation of EEGs that inte grates temporal and spatial context for sequential d ecoding of EEG events is shown. Two levels of postprocessing are used. small perce ntage of the overal l data (3% in the tra ining set; 8% in the evaluatio n set), the a mount of non-seiz ure d ata was limited to be c omparable to the a mount of seiz ure data, and non-seizure data was selected to include a r ich variety of artifact s suc h as muscle a nd e ye move ments. T wenty iterations of Bau m-Welch were u sed thou gh performance is not very sens itive to this val ue. Standard Viter bi decod ing (no beam sear ch) was used in reco gnition to estimate the model likeliho ods for every ep och of data (the e ntire file was not d ecoded as one stream bec ause of the i mbala nce between the seizure and background classes – decod ing w a s restarted for each epoc h). The output of the epoch-bas ed dec isions was po stpro- cessed by a deep learning syste m. Our baseline system used a Stacked denoisi ng Autoe ncoder (SdA) (Vince nt et a l. 2008) as shown i n Figure 1. Sd As are an extensio n of th e stacked autoe ncoders and are a class of dee p le arning algo- rithms well-suited to learning knowledge representatio ns that ar e orga nized hierarchical ly (B engio e t al. 20 07). T hey also lend themselve s to proble ms i nvolving trainin g da ta that is sparse, a mbiguous or incomplete. Since inter-rat er agreement is relativel y low for seizure d etection (Swisher et al. 2015), it m ade sense to evaluate this type of algorithm as part o f a baseline appr oach. An N-cha nnel E EG w a s transfor med into N inde pendent feature streams using a standar d sliding window based ap- proac h. The hypotheses generated by th e HMMs we re post- proce ssed using a second stage o f pro cessing that examine s the te mporal and spa tial conte xt. We apply a third p ass of postpro cessing that uses a stochastic language mode l to smooth h ypotheses i nvolving sequences o f events s o t hat we can suppress spurio us o utputs. This third stage of postpro- cessing provide s a moderate reduction in fal se alar ms. Tr aining of SdA networ ks are done in two steps: (1) pre- training in a greed y layer- wise appro ach (Bengio et al. 2 007) and (2) fine-tuning by ad ding a logist ic re gression la yer o n top o f the net w or k (Hinton et al. 2006). T he output o f the first stage of p roce ssing is a vector of two like lihoods for each cha nnel at each epoch. T herefor e, if w e ha ve 22 channels, wh ic h i s typical f o r an EEG collected using a standard 10/20 configuration (Obeid and Picone 2016), and 2 classes (se izure and background), we w ill have a vecto r of dimension 2 x 22 = 44 for each epoch. Each of the se scores is indepe ndent of t he spatia l contex t (other EEG channels) o r tem poral context ( past or future epochs). To incorporate context, we form a supervector consisting of N epoc hs in time usin g a sliding window approa ch. We find benefit to making N large – typicall y 41. This results in a vector of dimensio n 1 ,804 that needs to be proce ssed e ach epoch. T he input d imensional ity is too high considerin g the a mount of manual ly labeled da ta available for training a nd the computa tional requirement s. To deal with th is proble m we used Principal Components Analysis (PCA) (Ros s et a l. 2008 ) to reduce the dimensio nality to 20 before a pplying the SdA po stprocessing. The parameters of the SdA model ar e optimized to mini- mize t he a verage rec onstructi on er ror usin g a cross-entro py loss function. In the op timizatio n p rocess, a variant o f sto- chastic gradient de scent is u sed called “Minibatch sto chastic gradient d escent” ( MSGD) (Zinkevich et al. 2010) . MSGD works identicall y to stoc hastic gradient descent, except that we use more t han one training example to make each esti- mate of the grad ient. This tech nique reduces varia nce in the estimate of the grad ient, and often makes b etter use of the hierarchical memory orga nization i n modern co mputers. The SdA network has th ree hidd en lay ers with corruptio n levels of 0 .3 for each layer. T he number of nodes pe r layer are: first layer = 800, second la yer = 500, third layer = 300. The parameters for pre-traini ng a re: learning rate = 0.5, number of epochs = 15 0, b atch s ize = 3 00. T he parameters for fine -tuning are: le arning rate = 0 .1, nu mber o f ep ochs = 300, batch size = 1 00. The overall result of t he second stage is a pro bability vector of dime nsion t wo containin g a li keli- hood that each lab el could have occurred in t he epoch. A soft de cision paradigm is used rather than a hard d ecision parad igm because this outp ut is smoothed in the third stage of proce ssing. A m ore detaile d explanatio n about the third pass of pro cessing is presente d in (Har ati et al. 2 016). Context Modeling Using LSTMs To improve our abilit y to model context, a hybrid syste m composed of an HMM a nd a Lon g Short T erm Me mory (LSTM) network (Hochreiter et al. 1 997) w as implemented. These networks ar e a special kind of recurrent neural network ( RNN) a rchitecture tha t is capa ble of learnin g long - term dep endencies, and ca n bridge time inter vals exceedin g 1,000 s teps even for noisy incompres sible input seq uences. This is ac hieved b y multipl icative gate units that learn t o open and close access to t he consta nt error flo w . Like t he H MM/Sd A hybr id approach previously describe d, the output of th e first pa ss is a vector of dimension 2 × 22 × t he windo w le ngth. Therefore, w e also use P CA befor e LSTM to reduce the dimensiona lity of the data to 20 . For this study, we used a window length o f 41 for LSTM, and this layer is composed of one hidden layer with 32 nodes. T he output layer nodes in this LSTM level use a si gmoid function. The para meters of the models a re optimized to minimize the e rror using a c ross-entro py loss function. Adaptive Moment Estimatio n (Ada m) is used (Kingma et al. 2 015) in the optimizat ion proce ss. To explor e the potential of LST Ms to encode lo ng-term depende ncies, we designed a nother ar chitecture, where Incre mental Principal Components Analysis (IPCA) was used for dimensionalit y r educt ion (Ross et al. 2008; Levy et al. 2000). LSTM net works which op erate direc tly o n features spanning long periods of time n eed more memory efficient appr oaches. IPCA has consta nt memor y complexity, o n the order o f the batch size, enabling use o f a large d ataset without loadin g the e ntire da taset into m emor y. IPCA builds a low-rank appro ximation for the i nput data using an a mount of memory which is inde pendent o f th e number o f input data sa mples. It is stil l depende nt on the input da ta feature s, but changing the batch size allo ws for control o f memory usage. The architec ture o f o ur I PCA/LST M system is presented in Fi gure 2. In the IPCA/LST M system, sample s ar e converted to features by our standard feat ure extractio n method pr eviously describe d. Next, the features are delivered to an IPCA layer for spatial c ontext analysis a nd dimensional ity reduction. T he output of I PCA is delivered to a one-layer LSTM for c lassifica tion. The input to IPCA has a di mension t hat is a mu ltiplication o f the number of channels, the feature vector length, the nu mber of feature s per se conds and window d uration (in seco nds). We typic ally use a 7-seco nd windo w d uration, so the IPCA i nput i s a vector of di mension 22 × 26 × 7 × 1 0 = 4004 . A bat ch size of 50 is used in IPCA and the outp ut d imension is 25 . In order to l earn lo ng-term dep endencies, one LST M with a hidden layer size of 128 and batch size of 1 28 is used alo ng with Ada m optimizatio n and a cr oss–entropy loss fu nction. Two-Dimensional Decodi ng Using CNNs Convolutio nal Neura l Ne tworks (CNNs) have delivered state of the art perfor mance on highly chal lenging ta sks such as speech (Sa on et al. 20 16) and i mage recognitio n (Simonyan et a l. 20 14). CNNs are usually comprised of convolutional layers along with s ubsampling layers which are followed by one or more full y co nnected layers. In the case o f two dimensiona l CNNs that are co mmon in image and speech r ecognition, the inp ut to a co nvolutional layer is W × H × N d ata (e.g. an i mage) where W and H are t he width and hei ght of t he inp ut data, and N is the n umber of c hannels (e.g. in an RGB image, N = 3) . The convolutional layer will have K filters (or kernels) of siz e M × N × Q where M and N are sm aller than the dimensio n of t he data and Q is typicall y smalle r than the number of channel s. In this way CNNs have a large lea rning capac ity that can be controlled by varying their depth and br eadth to produce K feature maps of size (W-M+1 ) × (H-N+1). Each map is then subsampled with ma x po oling o ver P × P c ontiguous regions. An additive nonlinea rity is applied to each feature map either b efore or a fter the subsamp ling layer. Our overall architecture of a system t hat co mbines C NN and a multi-la yer pe rceptro n (MLP) (Simonyan e t al. 2014 ) is shown in Fig ure 3. The networ k conta ins six convolutional layer s, t hree max poolin g laye rs a nd t wo fully-connected la yers. A rectified li near u nit ( ReLU) non- linearity is ap plied to the output of ever y convolutional and fully-connected layer (Nair et al. 2010). Draw i ng on an image classificatio n a nalo gy, e ach image is a signal where the width o f the ima ge (W) is the window len gth multipl ied by the n u mber o f sa mples pe r seco nd, the he ight o f the image (H) is the number of EEG channels and the nu mber of image cha nnels (N) is t he length of t he feature ve ctor. In our optimized syste m, a window duration of 7 seconds is used. The first convol utional layer filte rs the input of size of 70 × 22 × 26 using 16 ker nels o f size 3 × 3 with a str ide of 1. The seco nd convolution al layer filter s its input using 16 ker nels of size 3 × 3 wi th a stride of 1. T he first max pooling la yer takes as input the output of the second convo- lutional la yer and app lies a pooling size of 2 × 2. T his pro - cess is repeated two time s mo re with 3 2 a nd 64 kernels. Next, a fully-connected layer with 512 neuro ns i s applied and t he outp ut is fed to a 2-w a y sigmoid function w hich p ro- duces a two-cla ss decisio n (the final e poch label) . Recurrent Convolutional N e ural Networks In our final a rchitecture, which is s hown in Figure 4, w e in- tegrate 2D CNNs, 1 -D CNNs and LSTM networks, which we refer to as a CNN/LSTM, to better explo it long-term de- pendencies. Note that the way that we ha ndle data in CNN/LST M is differe nt fro m the CNN/M LP system pre- sented in F igure 3. Drawing on a vide o classificatio n anal - ogy, input data is co mposed of fra mes distribute d in time where eac h fra me is an image of width (W) equal to the length of a feature ve ctor, the height (H) e quals the number of EE G chan nels, a nd the nu mber o f i mage c hannels (N) equals one. T hen input data consists o f T frames where T is Figure 2: An a rchitecture that integrates IPCA for spatial context a nalysis and LSTM for learning long-term temporal d ependencies equal to the w i ndow length multiplied by the number o f samples per s eco nd. I n o ur o ptimized system with a window duration of 21 seco nds, the first 2D co nvolutional layer f il- ters 210 frames (T = 21 × 1 0) of EEGs distributed in time with a size of 26 × 22 × 1 (W = 26, H = 2 2, N = 1) using 16 kernels of size 3 × 3 and with a stride of 1 . The f ir st 2D m ax pooling layer takes as input a vector which is 2 60 frames distributed in time with a size of 26 × 22 × 16 and app lies a pooling size o f 2 × 2. T his process is re peated t wo times with two 2 D convolutio nal la yers with 32 and 6 4 kernels o f size 3 × 3 respectivel y and two 2D max po oling layers with a pooling size 2 × 2. The output of th ir d m ax pooling is flatte ned t o 210 fra mes with size of 384 × 1. Then a 1D c onvolutional layer filters the o utput of the flatte ning la yer usin g 16 kernels of size 3 which de creases the di mensio nality in spa ce to 2 10 × 16. Then we apply a 1D maxpoolin g layer with a size of 8 to decrea se the dimensio nality to 26 × 16. T his is the i nput to Figure 3: Two-dim ensional decoding of EEG signals using a CNN/MLP h ybrid architecture is shown that consists of six convolutional layers, three max pool ing layers and two fully-connected layers. Figure 4: A deep recurrent conv olutional architecture for two-dimensio nal decoding of EEG signals that integrates 2 D CNNs, 1-D CNNs and LSTM netwo rks is shown. a deep bidirectio nal LSTM net work where the dimensional- ity of the output space is 128 and 256. The output of the last bidire ctional LST M layer is fe d to a 2-way sigm oid function which pr oduces a final classi fication o f an epo ch. T o over - come the proble m of ove rfitting and force the s ystem to learn more robust features, d ropout and G aussian noise la y- ers are used bet ween layers (Sri vastava e t al. 2 010). To in- crease non-linearit y, Expo nential Linear Units (ELU) are used (Clevert et a l. 201 7). Adam is used in the opti mization proce ss along with a mean squar ed error lo ss function. Experiments The lack of bi g data r esources that c an be used to tr ain sophisticated statistical models compounds a major pro blem in automatic se izure detection. Inter-rater agreement for this task is low, espe cially when co nsidering shor t seizure events (Obeid and Picone 2017). Manual annotatio n of a large amount of dat a b y a te am o f certi fied neurolo gists is extremel y expen sive a nd ti me c onsuming. In th is s tudy, we are repor ting r esults for the first time o n t he TUSZ and a comparable c orpus, DU SZ, from Duke Univer sity (S wisher et al, 2 015). TUSZ was u sed as the training and test set corpus, whi l e DU SZ was u sed a s a held -out e valuation set. It is i mportant to note t hat T USZ was co llected using several generations of Natus EEG equipment, while DUSZ was collected using N ihon Kohde n equipment. Hence, thi s is a true ope n-set evaluation since the data were collected under completel y differe nt reco rding c onditions. A su mmary o f these corp ora is shown in T able 1 . A co mparison of the p erformance of t he differe nt architect ures presented in this p aper, for sensitivity in range of 30%, ar e shown in T able 2. T he r elated DET curve i s illustrated in Figure 5. These systems were evaluated using a method of scoring popular in the EEG r esearch co mmunit y known as the over lap method (Wilson et al. 2 003). True positives (TP ) are defined as the number of ep ochs identified as a seiz ure in the re ference annotations and corre ctly labeled as a seizure by the syste m. T rue negatives (T N) are defined as the n umber of epochs correctly identified as non- seizures. False positives (FP) are defined as the number of epochs incorrectly labeled as seizure while false negati ves (FN) are defined as t he number of ep ochs incorre ctly labeled as non-seizure . Sensitivity sho wn in Tab le 2 is computed as TP/(TP +FN). Specificity is c omputed as TN/(T N+FP) . The false alar m rate is the n umber of FP s per 24 hours. It i s i mportant to note that the r esults a re much lo wer than what is o ften publis hed in the literature on other seizure detection tasks. This is due to a combination of fa ctors including (1) the neuroscience communit y has fa vored a more pe rmissive met hod o f sc oring that te nds to pro duce much higher sensitivi ties and lower false alar m ra tes; and (2) TUSZ is a much more difficult task tha n a ny co rpus previously r eleased as open source. T he evaluation set was designed to be representative of common clinical issues and includes many challengi ng exa mples of seizure s. Also, note t hat the HMM baseline sys tem, which i s shown in t he first ro w of Table 2, o perates on e ach c hannel independentl y. The other methods consider all channels simultaneou sly b y using a supervec tor that is a concatenatio n o f the feature vectors for all channels. T he baseline HMM syste m o nly classi fies epochs (1 sec in duration) using data from w it hin that e poch. It does not look across channel s or acr oss multiple epoc hs when performi ng epoch-level classification. The re sults of the h ybrid HMM and deep learning structures show t hat add ing a deep learning struct ure for temporal and spatial analysis o f EEGs can de crease the false ala rm rate dramatica lly. F urther, b y comparing the r esults of HMM/SdA wit h HMM/LST M, we find that a simple o ne layer LSTM performs b etter than 3 layers o f SdA due to LSTM’s a bility to explicitly model long-term dependenc ies. Note t hat in this case the complexity a nd training tim e of these two syste ms is comparable. Table 1: An overview of the TUH S and Duke corpora D es c r i pt i o n TU H S D uk e T r a in Ev a l Ev a l P a t ie n t s 6 4 5 0 4 5 S es s i o n s 2 8 1 2 2 9 4 5 F i le s 1 ,0 28 9 8 5 4 5 S ei z u r e ( s e c s) 1 7 , 68 6 4 5 , 64 9 4 8 , 56 7 N on - S ei z u r e ( s e cs ) 5 9 6 ,6 9 6 5 5 6 ,0 3 3 5 9 9 ,3 8 1 T o t a l ( s e c s ) 6 1 4 ,3 8 2 6 0 1 ,6 8 2 6 4 7 ,9 4 8 Table 2: Performance on the TUSZ S y st e m S en si t i v i t y S pe c if i ci ty F A / 2 4 Hr s . H MM 3 0 . 32 % 8 0 . 07 % 2 4 4 H MM / S d A 3 5 . 35 % 7 3 . 35 % 7 7 H MM / L ST M 3 0 . 05 % 8 0 . 53 % 6 0 IP CA/ L S T M 3 2 . 97 % 7 7 . 57 % 7 3 CN N / M LP 3 9 . 09 % 7 6 . 84 % 7 7 CN N / L S TM 3 0 . 83 % 9 6 . 86 % 7 Figure 5: A DET curve comparing performance on TUSZ The be st over all system is the combination of CNN and LSTM. This doubl y deep recurrent convolutional structur e models both spatial relationships (e.g., cro ss-channel depende ncies) and tempora l dyn amics (e.g., spikes). For example, CNN/LST M d oes a mu c h better job reje cting artifacts that are e asily confused w ith spikes because t hese appear on onl y a fe w channels, and hence can b e filtered based on correl ations between channel s. The depth o f the convolutional network is i mportant since t he to p convolutional la yers tend to learn generic feature s while the deeper la yers learn d ataset sp ecific feature s. Per formance degrade s if a s ingle c onvolut ional la yer is removed. For example, remov ing any of t he middle convol utional la y ers results in a los s of about 4% in the sensiti vity. We have also cond ucted an ev aluation of our CNN/LSTM system o n a DUSZ. T he res ults are sho wn in T able 3 . A DET curv e is s hown in Fi gure 5. A t high false positive rates, performance between the t wo systems is c omparable. At low false positi ve r ates, false positives on TUSZ ar e lower than on DUS Z. This suggests there i s ro om for a dditional optimization s on DUSZ. In these e xperi ments, we obser ved that the c hoice of o p- timization method had a considerable impact on perfor- mance. The resul ts of our best performing syste m, CNN/LST M, was eval uated using a variety of o ptimizatio n methods, includ ing SGD (W ilson et al. 200 3), RMSprop (Botto u et al. 2004), Adagrad (Tieleman et al . 20 12), Adadelta (Duc hi et al. 2011) , A dam (Kin gma et al. 2015 ), Adamax (Kingma et al. 2 015) and Nad am ( Zeiler et al. 2013) as sho wn in Ta ble 4. T he best per formance is achieved w ith Adam, a lear ning rate of 𝛼 = 0.0005 , a learn- ing r ate deca y of 0.00 01, exponential deca y rates of 𝛽 = 0.9 and 𝛽 = 0.999 for the moment estimates a nd a fuzz factor o f 𝜖 = 10 . The par ameters follo w the nota tion de - scribed in (Ki ngma et a l. 2 015). Table 4 also illustrates t hat Nadam deliver s comparable p erformance to Ada m. Adam combines the advantages of Ada Grad wh ich w orks well with sparse gradients, and RMSP rop wh ich works well in non-stationar y settings. Similarl y, w e evaluated our CNN/L STM using different activatio n functio ns, as sho wn i n T able 5. ELU deliver s a small but measurab le increase in sensiti vity, and more im- portantl y, a re duction i n false alar ms. ReLU s and E LUs ac - celerate learning b y decreasi ng t he gap be tween the nor mal gradient and the unit natural gradient (Clevert et al. 2 017). ELUs push the mean towards zero but with a significantly smaller computations footprint. But unlike Re LUs, ELUs have a c lear saturation plateau in i ts negative regime, allow- ing the m to lear n a more robust and stable representation , and making it easier to mod el dependencies bet ween ELUs. Conclusions In this paper, w e intro duced a v a riety of d eep learning archi - tectures for auto matic cla ssifica tion of EE Gs inc luding a hy- brid architecture that i ntegrates CNN a nd L STM. While this architect ure delivers be tter perfor mance than other deep structures, its perfor mance still does not meet t he needs of clinicians. Hu man pe rformance on similar tasks is in t he range of 65% sensitivity with a false alarm rate of 12 per 24 hours (Swisher et al. 201 5). The false alar m rate is p articu- larly i mportant to c ritical care ap plications since it i mpacts the workload experienced by healt hcare pr oviders. The primar y error modalities ob served were false alar ms generated during br ief de lta range s lowing patter ns suc h as intermitte nt rhythmic delta activity. A variety of these types of a rtifacts have be en o bserved m ostl y duri ng inter-icta l a nd post-ictal stages. T raining models on such eve nts with di- verse morphol ogies has the p otential to significantl y reduce the remaining false alarms. This is one r eason we a re con- tinuing our e fforts to annotate a lar ger portio n of TUSZ. Table 3: Performance of CNN/ LSTM on DUSZ C or p u s S en si t i v i t y S pe c if i ci ty F A / 2 4 Hr s . TU S Z 3 0 . 83 % 9 6 . 86 % 7 D US Z 3 3 . 71 % 7 0 . 72 % 4 0 Figure 6: A performance compa rison of TUSZ and DUSZ Table 4: Comparison o f optimization algorithms Op t . S en si t i v i t y S pe c if i ci ty F A / 2 4 Hr s . S GD 2 3 . 12 % 7 2 . 24 % 4 4 RM Sp r o p 2 5 . 17 % 8 3 . 3 9 % 2 3 Ad a g ra d 2 6 . 42 % 8 0 . 42 % 3 1 Ad a d el t a 2 6 . 11 % 7 9 . 14 % 3 3 Ad a m 3 0 . 83 % 9 6 . 86 % 7 Ad a m a x 2 9 . 25 % 8 9 . 64 % 1 8 N ad a m 3 0 . 2 7 % 9 2 . 1 7 % 1 4 Table 5: Comparison o f activation functions A ct i v at i o n S en si t i v i t y S pe c if i ci ty F A / 2 4 Hr s . Li n e ar 2 6 . 46 % 8 8 . 48 % 2 5 T a n h 2 6 . 53 % 8 9 . 17 % 2 1 S i gm o i d 2 8 . 63 % 9 0 . 08 % 1 9 S o ft s i gn 3 0 . 05 % 9 0 . 51 % 1 8 Re L U 3 0 . 51 % 9 4 . 74 % 11 E LU 3 0 . 83 % 9 6 . 86 % 7 References Alotaiby, T., Alshebeili, S., Alshawi, T., Ah mad, I., & Abd El- Samie, F. 2 014. EE G seizure detection and prediction algorith ms: a sur vey. EURASIP Journal on Advan ces in S ignal Processing , 2014 (1), 1–21. Bengio, Y., Lamblin, P., Po povici, D., & Larochelle, H. 2007. Greedy layer-wise trainin g of deep networks. In Advances in Neu- ral In formation Processing System (pp. 15 3–160). Vancouver, B.C., Canada. Bottou, L., & Lecun, Y. 2004 . Large Scale O nline Learning. Advances in Neu ral Information Processing Systems , 217–225. Christensen, M ., Dodds, A., Sauer, J., & Watts, N. 2014. Alarm setting for the critically ill patient: a descriptive p ilot sur vey of nurses’ perceptions of current practice in an Australian Region al Critical Care Unit. Intensive and Critical Ca re Nursing . Clevert, D., Unte rthiner, T., & Hochreiter, S. 2017. F ast and Accurate Deep Net work Lea rning by E xponen tial Linear Units (ELUs). arXiv preprin Duchi, J., Hazan, E., & Si nger, Y. 2011. Adaptive Subgradient Methods for Online Learning an d Stochastic Optimization. Journal of Mach ine Learning Research , 12 , 2121–2159. Golmohammadi, M., et al. 2017. The TUH E EG Se izure Co rpus. In Pro ceedings of the American Cli nical Neuroph ysiology Society Annual Mee ting ( p . 1 ). Phoenix, Arizona, USA : American Cli nical Neurophysiology Societ y. Gotman, J. 1982. Automatic recogn ition of epileptic seizures in the EEG. Electroenceph alography and Clinical Neu rophysiology , 54 (5), 530–540. Harati, A. H., et al. 2016. Automatic Interpr etation of EEGs for Clinical Decision Support. In American Clin ical Neurophysiology Society (ACNS) Annual Meeting (p. 1). Orlando, Florid a, USA. Harati, A., et a l. 2015. Improved EEG Event Classification Usin g Differential Energy. In P roc. IEEE Signal Pro cessing in Medicine and Biology Symposi um (pp. 1–4). Philadelphia, P A, USA . Hinton, G. E., Osindero, S., & Teh, Y.-W. 2006. A Fast Learning Algorithm for Deep Belief Nets. Neural Co mputation , 18 (7), 1527–1554. Hochreiter, S., & Schmidhub er, J . 1997. Long s hort-term m e mory. Neural Computation , 9 (8), 1735–80. Kingma, D. P ., & Ba, J. L. 2015. Adam: a M ethod f or Stoch astic Optimization. International Conference on Lea rning R epresenta- tions 2015 , 1–15. Levy, A., & Lind enbaum, M. 2000. Sequential Karhunen-Loeve basis ex traction and its ap plication to images. IEEE Transa ctions on Image P rocessing , 9 (8), 1371–1374. Lopez, S., et al. 2016. An Anal ysis o f Two Co mmon Reference Points for EEGs. In Proc. o f the IEEE Signal P rocessing in Medi- cine and Biology Sympo sium (pp. 1–4). Philadelphia, P A, USA. Lopez, S., et al. 2015. Automated Identification of Abnormal EEGs. In IEE E Signal Processing in Medicine and Biology Sym- posium (pp. 1–4). Phila delphia, PA, USA. Nair, V., & Hinton, G. E . 2010. Re ctified Linear Units Improve Restricted Boltz mann Machines. Proceeding s of the 27th International Co nference on Machine Learning , (3), 807–814. Obeid, I., & Picone, J. 2016. The T emple University Hosp ital E EG Data Corpus. Frontie rs in Neuroscience, Section Neural Technol- ogy , 10 , 196. Obeid, I., and Picone, J. 2017. Machine Learning Approaches to Automatic In terpretation of EEGs. In E . Sejd ik & T . Falk (Ed s.), Biomedical Signal Processing in Big Data (1 st ed., p. N/A). Boca Raton, Florida, US A: CRC Press. Picone, J. 1990. Continuous Speech Recognition Using Hidd en Markov Models. IEEE ASSP Magazine , 7 (3 ), 26–41. Picone, J. 1993. Signal modeling tec hniques in speech recognition. Proceedings of th e IEEE , 81 (9), 1215–1247. Ramgopal, S. 2014. Se izure detection , seizure prediction, and closed-loop w arnin g systems in epilepsy . Epilepsy & B ehavior , 37 , 291–307. Ross, D. A., Lim, J., Lin, R. S., & Yang, M. H. 2008. Increm ental learning for robust visual tracking. In ternational Journal of Com- puter Vision , 77 (1–3), 125–141. Saon, G., Sercu, T., Rennie, S., & Kuo, H. K. J. 2016. T he IBM 2016 English conversational telephone speech recognition system . In Proceedings of INTERSPEECH , pp. 7–11. Schad, A., et al. 2 008. Application of a multivariate seizure detec- tion and prediction method to non-invasive and intracranial long- term EE G recordings. Clinical Neurophysiology, 1 19(1), 1 97–211. Simonyan, K., & Zisserman, A. 2014. Very Deep Co nvolutional Networks for Large-Scale I mage Recogniti on. arXiv preprint arXiv:1409.1556. Srivastava, N. , et al. 2014 . Dropo ut: A Simple Way to P revent Neural Netw orks from Over fitting. Journal of Machine Learning Research , 15 , 1929–1958. Swisher, C. B., et al . 2015. Diagnostic Accuracy of Electrographic Seizure Detection by Neurophysiolo gists and Non-Neurophysiol- ogists i n the Adult ICU Using a Panel of Quantitative EE G Trend s. Journal of Clinica l Neurophysiology , 324-330. Tieleman, T., & Hinton, G. 2012. Lecture 6.5-rmsprop: Divide the gradient b y a running average of its recent magnitude. COURSERA: Neural Net works for Machine Learning . Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P.-A. 2008. Extracting and composing robust features with denoising a u toen- coders. In Proceeding s of th e 25th Inte rnational Conference o n Machine Learning (pp. 1096–1103). New York, NY, USA. Wilson, S. B., Scheuer, M. L., P lummer, C., You ng, B., & P acia, S. 2003. Seizure detectio n: correlation of hu man experts. Clinical Neurophysiology , 1 14 (11), 2156–2164. Zeiler, M. D. 2013. ADADELTA: An Adaptive Learning Rate Method. IEEE Signa l Processing Letters , 2 0 (12), 1266–1269. Zhang, X., Liang, Y., Zh eng, Y., An, J., & Jiao, L. C. 20 16. Hier- archical Discriminative Feature Learning for H yperspec tral I mage Classification. IEEE Geoscience and Remote Sensing Letters . Zinkevich, M., Weimer, M., Smola, A., & Li, L . 2 010. P arallelized stochastic gradient descent. Advances in Neural I nformation Pro- cessing Systems .
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment