Automatic Analysis of EEGs Using Big Data and Hybrid Deep Learning Architectures

AUTOMATIC ANAL YSIS OF EEGS USING BIG DATA AND HYBRID DEEP LEARNING ARCHITECTURES Meysam Golm ohammadi a , Amir Hossein Harati Nejad Torbati b , Silvia Lopez de Diego a , Iy ad Obeid a , and Joseph Pi cone a a The Neural Engineering Data Consor tium, Temple University , Philadelphia, Pennsylv ania, USA b Ji bo, I nc., Redwood City, California, USA Abstract Objective: A clinical dec ision suppo rt tool tha t automatically interprets EEGs can reduce time to diagnosis and enhance real-time a pplications such a s ICU monitoring. Cl inicians have indicated th at a sensitivity of 95% with a specificity below 5% was the minimum requirement for clinical acceptance. We propose a high- performance classification sy stem based on principles of big data and m achine learning. Methods: A hybrid machine learning system tha t uses hidden Markov models (HMM) for sequential decoding and deep learning ne tworks for postprocessing is proposed. These algorithms were trained and evaluated using the TUH EEG Corpus, which is the world's largest publicly available d atabase o f clinical EEG data. Results: Our approach delivers a sensitivity above 90% while m aintaining a s pecificity be low 5%. This system detects three events of c linical i nterest: (1) spike and/or s harp waves, (2) periodic l ateralized epileptiform disc harges, ( 3) generalized periodic epileptiform discharges. It a lso detects three events used to model background noise: (1) artifacts, (2) eye m ovement (3) background. Conclusions: A hybrid HMM/deep learning system can deliver a low false alarm rate on EEG even t detection, making autom a ted analysis a viable option for clinicians. Significance: The T UH EEG Corpus enables application of highly data co nsumptive machine learning algorithms to EEG analysis. Perform ance is approaching clinical acceptance for real-time applications. Keywords — Electroencephalography, EE G, event detection, h i dden Markov models, HMM, Deep Learning, Stochastic Deno ising Autoencoders, SdA Highlights:  A hybrid machine learning s ystem based on hidden Markov models and deep learning is proposed for automatic interpretation of EEGs.  The results are reported on TUH EEG Corpus, whic h is the world’s largest publicl y available database of EE G recordings.  Using big data and deep learnin g, performance is approaching that required for clinical acceptance. Golmohammadi, et al. : Automatic Analysis of EEGs... 1 Journal of Clinical Neurophysiolog y November 1, 2017 1. INTRODUCTION Electroencephalograms (E EGs) are used in a w ide rang e of clinical set tings to record electrical activity along the sc alp. EEGs are the primary means by whic h physicians diagnose brain-related ill nesses such as epilepsy and seizures (Yam ada et al., 2017). However, analysis of EEG signals requires a highly-trained neurophysiologist. Manual analy sis is time-consum ing and expensive since identify ing rare clinical events requires analy s is of long data streams. Automatic analysis o f EEG scans reduc e s t ime t o diagnosis, reduces error and enhances real-time appli cations by flagging sec tions of the signal that need f urther re view by a clinician. Many methods have been developed over the years (Ney et al., 2016) including t ime–frequency analysis methods (Gotman, 1999; Sartoretto & Erm ani, 1999; Osorio et al., 1998), nonlinea r techniques (Schad et al., 2008; Stam, 2005; Sc hindler et al., 2001) and expert systems that at tempt to mimic a human observer (Deburchgraeve et al., 2008; Khamis e t al., 2009) . Despite much progress and r esearch, current EEG analy sis m ethodologies a re far f rom perfect wi th many be ing considered impractical due to high false detection rates (Hopfengärtner et al., 2007; Varsavsky & Mareels, 2006 ). Machine l earning has made tremendous progress over the past thre e dec ades due to rapid a dvances in low-cost highly-parallel computational infrastructure, powerful machine learning algorithms, and, most importantly, big da ta. Althoug h contemporary approaches for automatic interpretation of EEGs have employed m ore modern m a chine learning approaches such as neural networks (Ram gopal et al., 2014) a nd support vector machines (Alotaiby e t al., 2014), s tate of th e a rt machin e learning a lgorithms that e mploy high dimensional models have not previously been utilized in EE G analysis because there has been a lac k of l arge databases that i ncorporate sufficient real world variability to adequa t ely train t hese systems. In fact, what has been la cking in many bioengineering fields i ncluding automatic interpretation of EEGs are the big data resources r equired to support the application of a dvanced machine learning approaches. A significant big data resource, k nown as the TUH EEG Corpus (Obeid & Picone, 2016), has recently become available creating a unique opportunity to evaluate high performance deep learning models that require large am ounts of training da ta. This database inclu des detailed physician reports a nd patient medical histories, which are critical to the application of deep learning. But, tr ansforming physicians’ r eports i nto a deep lear ning paradigm is proving t o be challenging beca use the mapping of reports to u nderlying EEG events is nontrivial. Our e xperiment s suggest that a hybrid approach b ased on hidden Markov models and deep learning can approach clinically acceptable lev els of performance. 2. METHOD An over view of our proposed system is shown in Fig. 1. An N -channel EEG is transformed int o N independent feature s treams using a standard sliding window based approach. A sequential m odeler analyzes each channel and p roduces event hypotheses . Three passes of postpro cessing ar e perform ed to produce th e final output. In this section, we discuss t he var ious components of thi s syst em, i ncluding development of the st atistical models using a supervised training approach. We beg in with a discussion of the data used to train and ev aluate the system . 2.1. Data: The TUH EEG Co rpus Our s ystem was developed using the TUH EEG Corpus (TUH-EEG) (Obeid & Picone, 2016), which is the world’s largest publicly available database of clinical EEG data. It contains over 30,000 sessions from over 16,000 patients (ov er 30 years of signal data in total). I t is an ong oing data co ll ection effort. The most recent r elease, v1.0.0, includes data f rom 2002 – 2015. Thi s EE G data was c ollected a t the Department of Neurology at Temple University Hospital. It is entirely c omposed of clinical data w ith all the real-world artifacts one would e xpect t o see i n clinical r ecordings ( e.g., e ye blinking a nd head m ovements). Ea ch of the s essions contains at l east one EDF file and one physician r eport. These reports are generated by a board- certified neurolog ist a nd are the official hospital record. These reports are com prised of unstructured text Golmohammadi, et al. : Automatic Analysis of EEGs... 2 Journal of Clinical Neurophysiolog y November 1, 2017 that describ es th e patient, relevant history, medications, and clinical impression . The c orpus is publicly available from the Neural Eng ineering Data Consortium ( www.nedcdata.org ). EEG signals in TUH-EEG wer e recorded using several generations of Natus Medical Incorporated’s Nicolet TM EEG re cording technology. T he raw si gna ls consist of multichannel recordings i n which the number of channels varies between 20 and 128 channels (Harati e t al., 2014). A 16-bit A/D c onverter was used to digitize the data. The sample frequency varies from 2 50 Hz to 1024 Hz. I n our study, we have resampled all the data to a common sample frequency of 250 Hz. The Natus syst em store s the data in a proprietary format that has been exported to EDF with the use of Ni cVue v5.71.4.2530. The original EEG records are split int o multiple EDF files depending on how the session was a nnotated by the attending technician. Some statistics about the corp us are shown in Fig. 2. A portion of TUH-EEG was annotated manually during a study conducte d with Temple University Hospital neurologists ( Harati et al ., 2014). These annot ations comprise six types of events. The first three events are of clinical interest: ( 1) spike and/or sharp w aves (SPSW), (2) periodic l ateralized epileptiform discharges (PLED), and (3) generalized period ic ep ileptiform discharges (GPED). SPSW ev ents are epileptiform transients that are typi cally observed in pati ents with epilepsy. PLED events a re indicative of EEG abnormalities and often manifest themselves with r epetitive spike or sharp wave di scharges that can be focal or lateralized over one hemisphere. T hese sig nals display quasi-periodic behavior. GPED events are similar to PLEDs, an d manifest themselves as pe riodic short-interval di ffuse discharges, pe r iodic long- interval diffuse discharges and s uppression-burst pa tterns a ccording to the i nterval bet ween the discharges. Triphasic waves, which manifest themselves as diffuse and bilate rally synchronous spikes w ith bifrontal predominance, typically at a rate of 1- 2 Hz, ar e also included in this class. The rem ain ing three events were used by our machine learning t echnology to model background noi se: (1) eye movement ( EYEM), (2) artifacts (ARTF ), and (3) background (BCKG). EYEM events are spike- like signals th at occur durin g patient e ye movement. These are quite comm on in clinical d ata, so for re asons explained l ater, we devoted a sp ecific class to t hese e vents. ARTF events other than EYEM are recorded electrical a ctivity that is not of cerebral origin, such as t hose due to t he equipment, patient behavior or t he Fig. 1. A three-pass architecture for automatic interpretation of E EGs that integrates hidden M arkov models for sequential decoding of EEG e vents with deep learning for decision- making based on temporal and spatial context Golmohammadi, et al. : Automatic Analysis of EEGs... 3 Journal of Clinical Neurophysiolog y November 1, 2017 environment. These are comm on events that can often be confused with SPSW, and hence, it is important such things be included in t he database. BCKG is used to annotate all other portion s of the signal. There are over 10 different electrode configura tions and ov er 40 channel config urations represented in the corpus. T his poses a s erious c hallenge for machine learning systems since for a syst em to be pr actical it must be able to adapt to the specific type of EEG being administered. However, for this init ial study, we focused on a subs et of the data in wh ich signals were recorded using t he Averaged Refe rence (AR) electrode configuration (Lopez et al., 2016). The next step in the data pipeline is to convert the data to a feature representation. 2.2. Preprocessing: Feature Ex traction The f irst step in EEG proces sing in Fig. 1 consists of converting t he signal to a sequence of feature vectors ( Picone, 1990). Feature extraction for auto matic classification of EEG signals typically relies on time frequency representations of the sig nal (Th odoroff et al., 2016; Mirowsk i et al., 200 9). Whi le techniques such as c epstral -based filte r banks or wavelets are popular analysis techniques i n many s ignal processing applica tions including EEG classification (Subasi et al., 2007; Jahankhani e t al., 2006), our system uses a standard cep stral coefficient-based fe ature extraction approach based on Linear Frequency Cepstral Coefficients (LFCCs) that has been popular in applications such a s speech recognition (Picone, 1990). We use a linear fre quency scale for EEGs because there i s little eviden ce t hat a nonlinear scale is relevant to this problem, and in practice perform a nce was slightly better for a linear scale (Harati et al., 2015). Recent experim ents with different types of features (Da R ocha Garrit et al., 2015) or with using sampled data directl y (Xiong et al., 2017) have not shown a significant improvement in pe rformance by eliminating the feature extraction proces s and using sam pl ed data directly. It is common i n th e LFCC approach to com put e cepstral coef ficients by computing a high resolu tion fast Fourier Transform, downsampling t his r epresentation using an oversampling approach based on a set of overlapping ba ndpass f ilters, and transforming the ou tput into the c epstral domain using a discrete c osine transform (Picone, 1990). I n this study, the zeroth-ord er cepstral term is discarde d and replaced with a Fig. 2. Som e relevant statistics demonstrating the variety of data in TUH-EEG Golmohammadi, et al. : Automatic Analysis of EEGs... 4 Journal of Clinical Neurophysiolog y November 1, 2017 frequency domain energy te rm by computing t he s um of squares of the over sampled filter bank out puts after they are downsam pled: 𝐸  = log( ∑ | 𝑋 (𝑘 ) |      ) . (1) In orde r to improve d ifferentiation be t ween t ransient pulse-like events and stationary background noise, w e have introduce d a diff erential energy term that attempts to model the long-term change in energy. This term examines e nergy over a range of M frames centered about the current f rame, and computes t he differe nce between the maxim um and minimum over this interval: 𝐸  = 𝑚𝑎𝑥  󰇡𝐸  ( 𝑚 ) 󰇢 − 𝑚𝑖𝑛  󰇡𝐸  ( 𝑚 ) 󰇢 . (2) We typically u se a 0.9 sec window for this calculati on. This simple fea ture has proven to be surprisingly effective (Harati et al., 2015). The final step to note in our feature e xtraction pr ocess is a familiar method for computing derivati ves of features using a regression approach: 𝑑  = ∑  (     )    ∑     , (3) where 𝑑  is a delta coefficient, from frame 𝑡 computed in terms of the static coeffic ients 𝑐   to 𝑐   . A typical value for N is 9 (corresponding to 0.9 secs) for the first derivative in EEG processing, and 3 for the second derivative. T he introduction of derivatives helps the system discriminate between s teady-state behavior, such as a PLED, a nd impulsive or nonstationary sig nals such as spikes and eye m o vements. These features, which are often call ed deltas because they measure the change in the features over times, are one of the most well- known features in speech recognit ion. We typically use Eq. (3) to com pute the derivatives of the featur e s and then apply this appro ach again to those derivativ es to obtain an estimate of the second derivatives of the f eatures, generating what are often called delta-deltas. This process triples the size of t he f eature vector (adding delt as a nd delta-deltas), but is well-known to deli ver s mall but measurable improvem ents in performance. Fig. 3. An overview of the feature extraction algorithm Golmohammadi, et al. : Automatic Analysis of EEGs... 5 Journal of Clinical Neurophysiolog y November 1, 2017 In this work, through experim ents designed to optimize f eature extraction, we found best perform ance can be achieved using a feature vector length o f 26 c om prising 7 cepstral coefficients, 1 frequency energy , 1 differ ential energy , 9 deltas for all and 8 d e lta-deltas for just cepstral ter ms plu s frequency ene rgy ( Harati et al., 2015). A block di agram of the feature extraction process used i n t his work for auto matic c lassification of EEG signals is presented in Error! Reference source n ot found. . 2.3. First Pass: Sequential Decoding Us ing Hidden Markov Models Hidden Markov Mode ls (HMMs) are among the most powerfu l statistical modeling tools available today for signals that have both a time and frequency domain c omponent (Juang and Rabine r, 1991; Picone, 1990). HMMs are a class of doubly stochastic processes i n which discrete state seque nces are modeled as Markov chains. HM Ms have been u sed extensively in s peech rec ognition whe re a spee ch sig nal can be decomposed into an energy and frequency profile i n which particular events in the f requency domain can be used to identify the soun d spoken. The ch allenge of in terpreting and f inding patterns in EEG si gnal data is very similar to that of speech rela ted projects with a m e asure of specialization. Like speech r ecognition s ystem s, we assume tha t th e EEG signal is a realization of some message encoded as a sequence of one or more symbols. We model an EEG as a sequence of o ne of six s ymbols: SPSW, PLED, GPED, EYEM, ARTF and BCKG. Let each event be represented by a sequence of feature vectors or observations O, d efined as: 𝑂 = 𝑜  , 𝑜  , … , 𝑜  . (4) Here 𝑜  is the feature vec tor obse rved at t ime t. T hen considering 𝑆  is the i th e vent in our dictionary, the isolated event recognition problem can be regarded as finding the most probable event which f or a given set o f prior probabilities, 𝑃 (𝑆  ) , depe nd only on t he l ikelihood 𝑃 ( 𝑂 | 𝑆  ) . We tr ain one H M M m odel for each event using m anually annotated data. A simple left-to-right GMM-HMM, il lustrated in Fig. 4, was used for sequential de coding of EEG signals. A G MM-HM M is characterized by N number o f states, L - c omponent Gaus sian mixture m odel , the transition probability 𝑎  from s tate i to j a nd the output probability 𝑏  (𝑜) for sym bol o in the t ransition process. Considering 𝛼 ( 𝑖 , 𝑡 ) as the forward probability where ( i = 1 , 2 ,…,N; t = 1 , 2 , …, T ) , 𝛽 ( 𝑗, 𝑡 ) as the backward probability where ( j = 1 , 2 , …, N ; t = T-1 , …, 0 ), and P(O| M) as th e probability that model M generates symbol series O , the probability that a transiti on from state i t o state j happens at time t can be defined as: 𝛾  ( 𝑖 , 𝑗 ) =  ( ,  )     (  ,  ,  ) (  , )  (| ) . (5) The reestimation formulae for the trans ition probabilities are: 𝑎  = ∑   ( , )  ∑ ∑   (  , )   . (6) If the output vector, 𝑂  , follows an n -dim e nsional normal distribution, the outpu t density function is as follows: Fig. 4. A left- to-right HMM is used for sequential de coding in the first pass of p rocessing. Golmohammadi, et al. : Automatic Analysis of EEGs... 6 Journal of Clinical Neurophysiolog y November 1, 2017 𝑏  (𝑜  , 𝜇  , 𝛴  ) =  {(    )  ∑ (    )/   ( ) /    / , (7) where 𝜇  is th e mean and Σ  is the covariance matri x. The mean and covariance for each Gaussian mixture component can be estimated by: 𝜇  = ∑   (  , )    ∑   (  , )  , (8) 𝛴  = ∑   ( , ) (    )(    )   ∑   ( , )  . (9) In the first pa ss of signal modeling shown in Fig. 1, we divide each channel of t he EEG signal into epochs. Then each epoch is represented by a sequence of frames where each frame is represented by a feature vector. During training, we estimate the parameters of the K models ( 𝑎  , 𝑏  , 𝜇  𝑎𝑛𝑑 Σ  ) from the training dataset by iterating over all epochs using Eqs. (5-9). To determine these parameters in an iterative fashion, i t is f irst necessary to i nitialize them w ith a carefully chosen value (Picone, 1990). Once this is done, more accurate parameters, in the maximum likelihood sense, can be found by applying the so-called Baum-Welch r eestimation algorithm ( Picone, 1990). Decoding is typically pe rform ed using the V iterbi algorithm (Alphonso e t al., 2004). Then using one HMM model per label, we generate one posterior probability for each model and we select the label that corresponds to the highest probability . 2.4. Second Pass: Temporal a nd Spatial Context Analysis Based on Deep Learning The goal of the second pass of processing in Fig. 1 is t o integrate spatial and te mporal context to improve decision-mak ing. Therefore, the output of the first pass of pr ocessing, which is a vector of six posterior probabilities for every epoch of each chann el, postprocessed by a de ep learning system. Deep learning technology automatically self-organizes knowledge in a da ta-driven m anner and le arns to e mulate a physician’s decision-m a king process. Deep L earning a llows computational models tha t a re composed of multiple processing layers t o learn representations of data with multip le levels of abstraction. These methods have dram aticall y improved the state-of-the-art in speech recognition a nd many oth er domains in recent years (LeCun, et al., 2015). Deep learning discovers intricate str ucture i n la rge data sets by using the backpropagation algorithm t o i ndicate how a m achine should c hange i ts internal parameters that are used to compute t he r epresentation i n each layer from the representation in the previous layer. In t he second pass of processing, we are using a spe cific type of deep leaning ne twork known a s a Stacked denoising Autoe ncoders (SdA) (Vincent et al ., 2010). SdAs have proven t o perform well for applications where we need to em ulate human k nowledge (Bengio et al., 2007). Since in t errater agreement for annotation of seizures tends t o be relatively low and som ewhat ambig uous, we ne ed a deep learning structure that can deal with noisy i nputs. From a structural point of view, SdAs a re a f orm of stacking denoising autoencoders that form a deep network by usi ng the latent representation o f t he denoising autoencoder found in the lay e r below as the input to the current layer. Denoising Autoencoders are themselves an extension of a class ical autoencoder ( Vincent et al., 2008). An autoencoder takes an input vector 𝑥 ∈ [0, 1]  , and first maps it to a hidden representation 𝑦 ∈ [0, 1] 󰆒 through a deterministic m apping: 𝑦 = 𝑓  ( 𝑥 ) = 𝑠 (𝑊𝑥 + 𝑏 ) , (10) where 𝑊 is a 𝑑’ × 𝑑 weight matrix, 𝑏 is a bias vector, 𝑠 is a nonlinearity su ch as Sigm oid function and 𝜃 = { 𝑊 , 𝑏 }. T he latent representation 𝑦 , or c ode, is then mapped ba ck, with a de coder, into a re construction 𝑧 of the same shape as 𝑥 : Golmohammadi, et al. : Automatic Analysis of EEGs... 7 Journal of Clinical Neurophysiolog y November 1, 2017 𝑧 = 𝑔 󰆒 ( 𝑦 ) = 𝑠 ( 𝑊 󰆒 𝑦 + 𝑏 󰆒 ) . (11) The weight matrix 𝑊’ of the reverse mapping may opti onally be constrained by 𝑊’ = 𝑊  , in which case the a utoencoder is said t o hav e t ied weights. T he parameters of this model a re optim ized to minimize the average reconstruction error using a loss function, L , such as re construction cross-entropy: 𝜃 ∗ , 𝜃′ ∗ = 𝑎𝑟𝑔 𝑚𝑖𝑛  ,󰆒   ∑ 𝐿( 𝑥 (  ) ,    𝑔 󰆒 󰇡𝑓  𝑥 (  ) 󰇢) . (12) To implement a denoising autoencoder, we train an aut oencoder to reconstruct a c lean “repaired” input from a corru pted, partially de stroyed one. T his is don e by fir st corrupting t he initial i nput 𝑥 to get a partially destroyed version 𝑥  by means of a stoc hastic mapping 𝑥  = 𝑞  (𝑥  |𝑥) . Then this corrupted input is mapped, similar to a bas ic autoencoder, to a hidden representation 𝑦 = 𝑓  ( 𝑥  ) = 𝑠 (𝑊 𝑥  + 𝑏) from which we reconstruct 𝑧 = 𝑔 󰆒 ( 𝑦 ) = 𝑠 (𝑊′𝑥 + 𝑏′) . The schematic representation of the process is pre sented in Fig. 5. As before, the parameters are trained to minimize th e average re construction error over a training set, making 𝑧 as close as possible to the uncorrupted input 𝑥 . But the key di fference is that 𝑧 is now a deterministic function of 𝑥  r ather than 𝑥 and thus the result of a stochastic m apping of 𝑥 . The a pplication of dee p learning networ ks like SdAs generally involves three steps: de sign, training and implementation. In the design step, the number of inputs a nd outputs, the number of layers, and the function of nodes are def ined. During training, the w e ights of nodes are determined thr ough a de ep learning process. In the last step, the s tatistical model is i mplemented using the fixed parameters of the network determined du r ing training. Preprocessing of the input data is an additional step that i s ext remely important to various aspects of the deep learning training process. The block diagram of th e second st age of processing is depicted in Fig. 6. This stage consists of three parallel SdAs designed to integra te spatial and temporal context to improve decision-m aking. T hese SdAs are implemented wi th v a rying window sizes t o e ffectively perform a multi-time- scale analy sis o f the sig nal and m ap event labels onto a single com posite epoch l abel vector. A first SdA, referred t o as an SPSW- SdA, is responsible for mapping labe ls into one of t wo c lasses: e pileptiform and non-epileptiform. A se cond SdA, EYEM-SdA, m aps labels ont o the background (BC K G) and eye mov ement (EYEM) classes. A third SdA, 6W-SdA, maps labels t o any one of the six possible classes. The first two SdAs use a relatively short window context because SPSW and EYEM are locali zed events and can only b e detected whe n we have adequate temporal resolutio n. Training of the se three Sd A networks is done in two steps: pre-training and fine-tuning. Denoising autoencoders are stacked t o form a deep network. The unsupervised pre-training of such a n architecture is done one layer at a ti me. Each layer is trained as a de noising autoencoder by minimizing the err or in reconstructing its input (which is the output code of the previous layer). Once the first k layers are trained, Fig. 5. In a sta cked denoising autoencoder the input, 𝑥 , is corrupted to 𝑥  . T he autoencoder t hen maps it to 𝑦 and attempts to reconstruct 𝑥 . Golmohammadi, et al. : Automatic Analysis of EEGs... 8 Journal of Clinical Neurophysiolog y November 1, 2017 we can t rain the k+1 layer because we ca n now compute the code or latent representation from the layer below. Once all layers are pre-trained, the network goes through a second stage of training call ed fine-tuning. Here we cons ider supervised fine-tuning where we want to m in imize prediction er ror on a su pervised task. For this, we first add a logist ic regression la yer on to p of t he network. We then tr ain the entire n etwork as we would train a multilayer perceptron. At this point, we only consider the encoding parts of each auto- encoder. T his stage i s supervised, s ince now we use the ta rget class during t raining ( Bengio et. al., 2007; Hinton et al., 2006). Additionally, Fig. 6 shows t hat input data to deep learning networks is preprocessed using a global principal components a nalysis (PCA) to reduce the dim e nsionalit y be fore applying it to these SdAs (van der Maaten, 2009). PCA is applied to each individual epoch by conc atenating each channel output i nto a supervector and then reducing its dimensionality before it is input into SdA. For rare and localized events, which a re in t his case SPSW and EYEM, we use an out-of-sample technique to increase the number of training samples (van der Maa ten, 2009). Finally, using a block called an e nhancer (Vincent et al ., 2010), t he outpu ts of these t hree SdAs are then combined to obtain the f inal decision. To a dd the three outputs to gether, we initialize our final probability output with the output of t he 6-way class ifier. For each epoch, if the other two classifiers detect epileptiform or eye m o vement and the 6-w ay classifier was not in agreement wi th this, we update the outpu t probability based on the output of 2-way classifiers. The overall result of the second stage is a probability vector of dimension si x containing a likelihood that e ach la bel coul d have occurred in the epoch. It should also be noted t hat the outputs of these SdAs a r e in t he form of p robability v ectors. A s oft decision paradigm is u se d rather than hard decisions b ecause this output is smoothed in the third st age of processing. 2.5. Third Pass: Statistical Language M odeling Neurologists generally impose certain restrictions on events when interpreting an EEG. For example, PLEDs and GPEDs don’t happen in the same session. None of the previous stages of processing addr ess this problem . Ev en the ou t put of the s econd stage accounts m ostly for channel con text and is not e xtremely effective at modelling long-term t emporal c ontext. The t hird pass of processing addr esses this issue and improves t he overall detection per formance by using a finite state machine based on a st atistical language model. As is shown in Fi g. 1, th e third stage of postprocessing is designed to impose some conte xtual restrictions on t he output of the second st age. T hese contextual r elationships i nvolve lo ng- t erm behavior of the signal and are learned in a data-driven fashion. T his approach is also borrowed fr om speech r ecognition where a probabilistic gramm ar is us ed that combines the left and right contexts with the la bels (Levinson, 2005). This is done using a finite s tate machine that imposes specific syntactic const raints. In this study, a bigram probabilistic l anguage model th at pr ovides the probability of t ransiting f rom one type of epoch to another (e.g. PLED to PLED ) is prepared using t he training data set and als o i n consultation with neurologists in Te mple Hospital University. The bigram proba bilities for each of the six classes a re shown in T able 1, wh i ch models all possible transitions f rom one label to the n ext. The r emaining colum ns Fig. 6. An overview of the second pass of processing Golmohammadi, et al. : Automatic Analysis of EEGs... 9 Journal of Clinical Neurophysiolog y November 1, 2017 alternate betwe e n the cla ss label be ing transitioned to and its associated probability. The probabilities in this table a re op timized on a training d atabase that is a subset o f TUH-EEG. For exam ple, since PLEDs are long-term events, the probability of transitioning from one PLED to the next is high – a pproximately 0.9. However, si nce spikes th at occur in groups are PLEDs or GPEDs, and not SPSWs, the probability o f transitioning from a PLED to SPSW is 0.0. Therefore, these transi tion probabilities em ulate the contextual knowledge used by neuro logists. After compiling the probability table, a l ong window is centered on each epoch a nd t he posterior probability vector for that epoch is updated by considering le ft and right context as a prior (e ssentially predicting th e current epoch from its left and r ight context). A Bayesian framework is used to update t he probabilities of this grammar for a single iteration of t he algorithm: 𝑃  = ∑         , (13) 𝑅𝑃𝑃(𝑘 ) =   ∑  (  )        , (14) 𝐿𝑃𝑃 (𝑘 ) =   ∑  (  )        , (15) 𝑃   | = 𝛽  𝑃   ( ∑ ∑ 𝐿𝑃𝑃 ( 𝑖 ) 𝑅𝑃𝑃 ( 𝑗 ) 𝑃𝑟𝑜𝑏 ( 𝑖 , 𝑘 ) 𝑃𝑟𝑜𝑏(𝑘 , 𝑗))         . (16) In the se equations, k = 1 , 2 … K where K i s the total num be r of c lasses (in this study K = 6 ), L i s number of epochs in a f ile, 𝜖  is th e prior probability for a n epoch (a vector of length K ) and M is t he weight. LPP a nd RPP a re left and right context probabilities r espectively. λ i s the decaying weight for wi ndow, α is the weight associated with 𝑃  a nd 𝛽  a nd 𝛽  are normalization factors. 𝑃   is the prior probability, 𝑃   | is the posterior probability of epoch C for class k given the left and right c ontexts, y is the grammar weight, n is t he iteration number (starting from 1 ) a nd 𝛽  is t he normalization f actor. 𝑃𝑟𝑜 𝑏 ( 𝑖 , 𝑗 ) is a representation of the probability tab le shown i n Table 1. The algorithm iterates u nt il the lab el assig nments, which are decoded based on a pr obability vector, converge. The output of this stage is the final output and what was used in the ev aluations described in Section 3. 3. RESULTS In this section, we present results on a series of experiments designed to optimize and evaluate each stage of processing. We use d s subset of TUH-EE G for these experiments. 3.1. Data: The TUH-EEG Event Sh ort Set We collaborated with several neurolog ists and a team of undergraduat e annotators (Shah et al., 2018) Table 1. A bigram probabilistic language m odel f or the t hird pass of processing which models all possible transitions from one of the six classes to the next. i j P(i,j) j P(i,j) j P(i,j) j P(i,j) j P(i,j) j P(i,j) SPSW SPSW 0.40 PLED 0 .00 GPED 0 .00 EYEM 0.10 ARTF 0.20 BCKG 0.30 PLED SPSW 0 .00 PLED 0.90 GPED 0 .00 EYEM 0 .00 ARTF 0.05 BCKG 0.05 GPED SPSW 0 .00 PLE D 0 .00 GPED 0.60 EYEM 0 .00 ARTF 0.20 BCKG 0.20 EYEM SPSW 0.10 PLED 0 .00 GPED 0 .00 EYEM 0.40 ARTF 0.10 BCKG 0.40 ARTF SPSW 0.23 PLED 0.05 GPED 0.05 EYEM 0.23 ARTF 0.23 BCKG 0.23 BCKG SPSW 0.33 PLED 0.05 GPED 0.05 EYEM 0.23 ARTF 0.13 BCKG 0.23 Golmohammadi, et al. : Automatic Analysis of EEGs... 10 Journal of Clinical Neurophysiolog y November 1, 2017 to m anually label a subset of TUH-EEG for t he six t ypes of events described in Section 2.1. T he training set contains segm ents from 359 sessions while the ev aluation set was drawn from 159 sessions. No patient appears m o re than once in the en tire s ubset, which we refer to a s the T UH-EEG Event Short Set (TU-EE G- ESS). Note t hat the annotations were created on a cha nnel basis – th e specific channels on which an e vent was observed were annotated. This is in contrast to many open s ource databases that we ha ve observed which only m ark events in tim e and do not annotate the s pecific channels on wh ich the events o ccurred. In general, with EEG si gnals, events such as SPSW do not appear on all cha nnels. The subset of c hannels on which the event appears is relevant diagnos tic information. Our annotatio ns are demonstrated in Fig. 7. A distribution of the frequency of occu rrence of the 6 t ypes of events i n the training and evaluation se t is shown in Table 2. The training set was desig ned to provide a sufficient number of examples to train statistical m odels su ch as HMMs. Note that som e classes, such as SPSW, occur much less f r equently i n th e actual corpus than common events such as BCKG. I n fact, 99% of the data is assigned to the class BCKG, so special care must be taken to build robust classifiers for the non-background classes. High perform a nce Fig. 7. An example demons trating that the reference da ta is annotated on a per-cha nnel basis. Table 2. An overview of the distribution of events in the subset of the TUH EEG Corpus used in our experiments Event Train Train % (CDF) Eval Eval % (CDF) SPSW 645 0.8% ( 1%) 567 1.9% ( 2%) GPED 6,184 7.4% ( 8%) 1,998 6.8% ( 9%) PLED 11, 254 13.4% (22%) 4,677 15.9% (25%) EYEM 1,170 1.4% (23%) 329 1.1% (26%) ARTF 11,053 13.2% (36%) 2,204 7.5% (33%) BCKG 53,726 63.9% (100%) 19,646 66.8% (100%) Total: 84,032 100.0% (100%) 29,421 100.0% (100%) Golmohammadi, et al. : Automatic Analysis of EEGs... 11 Journal of Clinical Neurophysiolog y November 1, 2017 detection of EEG events requires dealing with infrequently occurring events s ince much of the data is uninformative. This is often referred to as an unbalanced data problem, and it is quite c omm o n in many biomedical applications. Hence, the evaluati on set was de signed t o contain a re asonable representation of all classes. Al l of EEGs in this subset were recorded using standard 10–20 system a nd pr ocessed using a TCP montage (Lopez et al., 2016), resulting in 22 channels of signal data per EEG. 3.2. Preprocessing: Feature Ex traction Features from each epoch are identified using a feature extraction technique explained in S ection 2.2. Neurologists re view EEGs i n 10 sec windows. Pattern recognition syst ems often subdivide th e signal into small segments durin g which the signal can be consi dered quasi- stationary. HMM system s need further subdivision so that there a re enough observations to allow the system to develop a strong se nse of the correct choice. A s imple set of preliminary experiments determined that a reasonable tradeoff between computational c omplexity and pe rform ance was to spli t the 10 sec window in to 1 sec epochs, a nd to fu rther subdivide these into 0.1 s ec frames. Hence, f e atures were computed every 0.1 sec using a 0.2 sec overlapping analysis window. These parameters we re optimized experimentally in a prev ious stu dy ( Harati et al., 2015). We ha ve also pr eviously shown that the use of a novel dif ferential energy feature improved performance for absolute features, but that b e nefit diminishes as first and s econd- o rder derivatives are i ncluded. We have shown there is benefit to using derivatives a nd there is a small advantage to using frequency domain energy. The output of the feature extraction system is 22 channels of data, whe re in e ach channel, a feature vector of dimension 26 corresponds to ev ery 0.1 secs. 3.3. First Pass: Sequential Decoding Us ing Hidden Markov Models A 6 -way classification experiment was conducted using th e m ode ls descr ibed in Fig. 4. Each state uses 8 Gaussian m ixture c omponents and a dia gona l c ovariance assumption (drawing on our experie nce with speech recognition systems and balancing dimensionality of the models with t he size of th e training data). Table 5. The 6 -way classification resu lts for the first pass of pro cessing Event ARTF BCKG EYEM GPED PLED SPSW ARTF 41.24 45.19 2.18 3.81 2.77 4.81 BCKG 7.02 71.93 2.59 7.37 2.28 8.81 EYEM 2.13 0.61 82.37 2.13 8.51 4.26 GPED 7.46 4.85 2.39 53.32 20.42 11.55 PLED 0.70 1. 85 4.70 17.62 54.80 20.32 SPSW 4.41 8.29 9.17 33.33 4.59 40.21 Table 4. The 4 -way classification resu lts for the first pass of pro cessing Event BCKG SPSW GPED PLED BCKG 82.30 8.35 6.94 2.42 SPSW 21.87 40.21 33.33 4.59 GPED 14 .71 11.55 53.32 20.42 PLED 7.26 20.32 17.62 54.80 Table 3. The 2 -way classification resu lts for the first pass of pro cessing Event TARG BCKG TARG 86.92 13.08 BCKG 18.20 81.80 Golmohammadi, et al. : Automatic Analysis of EEGs... 12 Journal of Clinical Neurophysiolog y November 1, 2017 Models were trained using all events on all channels resulting in what we refer to as channel independent models. Channe l dependent models have not proven to provide a boost in perform ance and add c onsiderable complexity to the system. The re sults for the first pass of processing are shown in Ta ble 5. A more inf ormative performance analysis can be c onstructed by coll apsing the three backg round classes into one category. We refer to t his second evaluation paradigm as a 4 -way cla ssification t ask: SPSW, GPED, PLED and BACKG. T he latter class contains an enum e ration of the three backg round cla sses. The 4 -w ay classification resu lts for th e first pass of pr ocessing are presented in Table 4. Finally, in o rder t hat we can produce a DET curve (Martin et al., 1997) we also report a 2 -way classification r esult i n which we collapse the data into a target c lass (TARG) and a ba ckground cla ss (BCK G). The 2 -w ay cla ssification results for the first pass of processing are presented in Table 3. Note that the classification results for all these tables a re measured by countin g each epoc h for each channel as an independent event. We refer t o this as forced- choic e e vent-based scoring because every epoch for every channe l is assigned a score based on its class label. 3.4. Second Pass: Temporal a nd Spatial Context Analysis Based on Deep Learning The output of the first stage of processing is a vector of six scores, or likelihoods, for each channel at each epoch. Therefore, if we have 22 channels and six classes we will ha ve a vector of di mension 6 x 22 = 132 scores for each epoch. This 132 -dim e nsion epoch vector is computed without considering si milar vectors from epochs adjacent in tim e . Information available from other channels within the same epoch is referred to as “spatial” context since each channel corresponds to a specifi c electrode location on the skull. Information available from other epochs is referred to as “t emporal” context. The goal of this level of processing is to integrate sp atial and temporal context to improv e decision-making. To integrate context, the input to the se cond pass deep learning system is a vector of dimension 6 x 22 x window length , where we agg regate 132 -dimension vectors in time. If we consider a 41 -second window, then we will ha ve a 5,412 -dim e nsion in put to t he se cond p ass of processing. T his input dimensionality is high even though we have a considerable amount of manually la beled training. To deal with this problem we follow a standard approach of using Principal Components Analysis (PCA) ( Fukunaga, 1983) before every SdA. The outp ut of the PCA is a vector of di mension 13 for SdA detectors that look for SPSW and EYEM and 20 for 6 -w ay SdA classifier. Further, s ince we do not hav e enough SPSW and E YEM events in the training dataset, we must use an out-of-sam ple te chnique (v an der Maaten, 2009) to train the SdA. Three cons ecutive outputs are averag ed, so the output is further reduced from 3 x 13 to j ust 13 , using a sliding window approach t o averaging. Therefore, the input for SPSW SdA and EYEM SdA de creases to 13 x window length and 20 x window length for 6 -way SdA. We used an open source toolkit, Theano ( Bastien et al., 2012; Bergstra et al., 2010), to implement the SdAs. The p arameters of the models are optimized to minimize the average reconstruction e rror using a cross-entropy loss function . In t he opt imization p rocess, a variant of stochastic gradient descent is use d, referred to as minibatches. Mini batch stochastic gradi ent descent works identical ly to stochastic gradient descent, except that we use more than one tra ining exam ple to make each e stimate of the gradient. This technique reduces variance in the estim ate of the gradient, a nd often makes better use o f t he hierarchical memory organization in modern com pute rs. SPSW SdA uses a window length of 3 which m eans it has 39 inputs and 2 outputs. I t has three hidden layers with c orruption levels of 0.3 for each layer. The n umber of nodes per layer are: first layer = 100 , second layer = 100 , t hird layer = 100 . T he parameters for pre- training are: learning rate = 0.5 , number of Golmohammadi, et al. : Automatic Analysis of EEGs... 13 Journal of Clinical Neurophysiolog y November 1, 2017 epochs = 200 , batch size = 300 . The para meters for fine- tuning are: learning rate = 0.2 , num be r o f epochs = 800 and batch size = 100 . EYEM SdA uses a window le ngth of 3 which m eans it has 39 inputs and 2 outputs. I t has three hidden layers with c orruption levels of 0.3 for each layer. The n umber of nodes per layer are: first layer = 100 , second layer = 100 , t hird layer = 100 . T he parameters for pre- training are: learning rate = 0.5 , number of epochs = 200 , batch size = 300 . The para meters for fine- tuning are: learning rate = 0.2 , num be r o f epochs = 100 and batch size = 100 . Six-way SdA use s a windo w length of 41 which means it has 820 inputs and 6 outputs. It has three hidden layers with corruption levels of 0.3 for eac h la yer. The number of nodes per l ayer are: first layer = 800 , second layer = 500 , third layer = 300 . The parameters for pre- training are: learning rate = 0.5 , number of e pochs = 150 and batch size = 300 . The parameters for fine-tuning are : lea rning rate = 0.1 , number of epochs = 300 and batch size = 100 . The 6 -way, 4 -way and 2 -way classificati on results for t he second stage of processing a re presented in Table 8, T able 7, Table 6 respectively. Note th at unlike the ta bles for the first pass of processing, t he classification results in each of these t ables a re m easured once per e poch – t hey are not pe r-channel r esults. We refer to these results as epoch- ba sed. 3.5. Third Pass: Statistical Language M odeling The outpu t of th e second s tage of processing is a vec tor of six scores, o r likelihoods, per epoch. This serves as the input for the third stage of p rocessing . T he o ptimized parameters for the t hird pass of processing are: prior probability for an epoch, 𝜖  , is 0.1 ; the weight, M , is 1 ; the decaying weight, λ , is 0.2 ; the weight associated wi th 𝑃  , 𝛼 , is 0.1 ; the gramm ar weight, y , i s 1 ; the number of iterations, n , is 20 , and the window lengt h to calculate the left and rig ht prior probabilities is 10 . The 6 -w ay, 4 -way and 2 -way c lassification results ar e pr esented in Table 11, T able 9 and Ta ble 10 respectively. Note that these results are als o epoch-based. Table 8. The 6 -way classification resu lts for the second pass o f processing Event ARTF BCKG EYEM GPED PLED SPSW ARTF 27.49 61.73 7.28 0.00 1.08 2.43 BCKG 7.00 82.03 5.79 0.97 0.36 3.86 EYEM 4.21 16.84 77.89 0 .00 0 .00 1.05 GPED 0.60 14.69 0 .00 59.96 10.26 14.49 PLED 1.40 22.6 5 0.80 13.83 52.30 9.02 SPSW 7.69 35.90 2.56 28.21 0 .00 25.64 Table 7. The 4 -way classification resu lts for the second pass o f processing Event BCKG SPSW GPED PLED BCKG 95.60 3.24 0.62 0.54 SPSW 46.15 25.64 28.21 0 .00 GPED 15 .29 14.49 59.96 10.26 PLED 24.85 9.02 13.83 52.30 Table 6. The 2 -way classification resu lts for the second pass o f processing Event TARG BCKG TARG 78.94 21.06 BCKG 4.40 95.60 Golmohammadi, et al. : Automatic Analysis of EEGs... 14 Journal of Clinical Neurophysiolog y November 1, 2017 4. DISCUSSION The 6 -w ay cla ssification task ca n be structured i nto several subtasks. Of course, due t o the high probability of the signal being background, the system is heavily biased towards choosing the background model. T herefore, in Table 4 we see that performance on BACKG is fairly hi gh. Not sur prisingly, BCKG is most often confused with SP SW. SPSW events are s hort in duration and there a re many transient events in BC KG that resemble an SPSW event. This i s one reason we a dded A RT F and EYEM models, so t hat we can redu ce the c onfusions of all classes with the short im pulsi ve SPSW events. As we anno tate ba ckground data in more deta il, a nd i dentify more commonly occurring artifacts, we ca n expa nd on our a bility to model BCKG events explicitly. GPEDs are, not surprisingly, most of ten confused wit h PLED e vents. Both e vents ha ve a longer duration than SPS Ws a nd artifacts. F r om Table 4, we se e that performance o n these t wo classes is generally high. T he main difference between GPED and PLED is duration, so we designed the postprocessing to learn this as a discriminator. For e xample, in the second pas s of processing, we implemented a window du r ation of 41 seconds so that the SdA system would be exposed to long-term temporal conte xt. We also designed three s eparate SdA networks to differentiate between short-term and long-term co ntext. In T able 7 we s e e that th e p erformance of GPEDs and PLEDs improves with the second pass of postprocessing. More significantly, t he con fusions between GPEDs and PLEDs al so decreased. No te also that in Table 7 performance of BCKG increased significantly. Confusions with GPEDs and PLEDs dropped dramatically to below 1%. Table 11. The 6 -way classification resu lts for the third pass of processing Event ARTF BCKG EYEM GPED PLED SPSW ARTF 14.04 72.98 10.18 0 .00 0 .00 2.81 BCKG 3.42 81.40 8.93 0.30 0 .00 5.95 EYEM 2.30 17.24 79.31 0 .00 0 .00 1.15 GPED 0.30 3.65 0 .00 65.05 13.37 17.63 PLED 0 .00 10.76 0.49 9.78 65.28 13.69 SPSW 10.00 33.33 13.33 10.00 0 .00 33.33 Table 9. The 4 -way classification resu lts for the third pass of process ing Event BCKG SPSW GPED PLED BCKG 95.11 4.69 0.19 0 .00 SPSW 56.67 33.33 10.00 0 .00 GPED 3 .95 17.63 65.05 13.37 PLED 11.25 13.69 9.78 65.28 Table 10. The 2 -way classification resu lts for the third pass of processing Event TARG BCKG TARG 90.10 9.90 BCKG 4.89 95.11 Golmohammadi, et al. : Automatic Analysis of EEGs... 15 Journal of Clinical Neurophysiolog y November 1, 2017 While performance across the board i ncreased, performance for SPSW dropped by addi ng the se cond pass of postprocessing. T his is a r eflection on the i mbalance o f the data. Less than one percent of da ta i s annotated as SPSWs, wh ile we hav e ten times m ore training sam ples for GPEDs and PLEDs. N ote that we used a n out-of- sample te chnique to increase the number of tra ining s amples for SPSWs, but even this technique could n ot solve the problem of a lack of annotated SP SW dat a. By comparing Table 5 to Table 8 we saw a similar behavior with the EYEM class because there is also fewer EYE M e vents. A summary of th e resul ts for different stages of proce ssing i s shown in Table 12. The overall performance o f the multi- pass hybrid HMM/deep learning cl assification system is promising: more t han 90% sensitivity and less than 5% s pecificity. Because the false alarm rate in these types of applications varies significantly with se nsitivity, i t is important to exam per formance using a DET curve. A DET curve for t he first, second a nd thir d st age of processing is given in Fig. 8. Note that the tables previously presented use the unprocessed likelihoods output from the system. They essentially correspond to the point on the DET curve wh ere a penalty of 0 is applied. T his operating point is identified on each of the c urves i n Fig. 8. We see that the raw li kelihoods of t he syst em corr espond to d ifferent operat ing poi nts in the DET curve s pace. From Fig. 8 it is readily apparent that postprocessin g significantly improves our abi lity to maintain a low false alarm rate as we increase t he detection rate. In vi rtually all c ases, the tre nds sh own in Table 5 to Table 12 hold up for t h e full range of the DET curve. This study demonstrates that a si gnificant amount of contextual processing is required to achieve a specificity of 5%. Fig. 8. DE T curves are shown for e ach pass of processing. The “zero penalty” operati ng point is al so shown since this was used in Table 5 – Table 10. Table 12. Specificity and s ensitivity for each pass o f processing Pass Sens itivity Specificity 1 (HMM) 86.78 17.70 2 (SdA) 78.93 4.40 3 (SLM) 90.10 4.88 Golmohammadi, et al. : Automatic Analysis of EEGs... 16 Journal of Clinical Neurophysiolog y November 1, 2017 5. CONCLUSION Virtually all prev ious R&D efforts involving EEG , including seizure de tection, hav e been conducted on small databases ( Akareddy et al., 2014). Often these databases are not good representations of the type of data observed in clinical environm ents. Transient artifacts, not common i n databases collected under research conditions, can significantly degrade performance. Not s urprisingly, despite high ac curacies presented in th e resea rch li terature, th e pe rformance of commercially a vailable sy stems ha s been lacking in clinical settings. T here is still great de mand fo r an autom ated system that achieves a low f alse al arm rate in clinical applications. We have presented a three-pass system tha t ca n achieve high performance classifying EEG events of clinical relevance. The system uses a c ombination of HMMs for accurate temporal segm entati on and deep learning f or high p erformance classification. In t he first pass, the signal is converted to EEG events using a hidden Mark ov model based system that models the temporal evolution of the s ignal. I n the second pass, three stacked denoisi ng autoencoders (SDA s) with diff erent window durations a re us ed t o map event labels onto a single composite epoch l abel vector. We demonstrated t hat both temporal and s patial context analysis based on dee p l earning can improve the pe rformance of se quential decoding using HMMs. In t he third pass, a probabilistic grammar is applied t hat combines left and r ight context with the curre nt label vector to produce a final decision for an epo ch. Our hybrid HMM/deep learning system delivered a sensitivity above 90% while maintaining a specificity below 5%, making a utomated a nalysis a viable option for c l inicians. This framework for automatic analysis of EEGs can be a pplied in other classification tasks such as seizure detection or abnormal detection. There are many straightforward e xtensions of th is system that c an include m ore powerf ul deep learning net works such as L ong Short-Te rm Me mory Ne tworks or Convolu tional N eural Network s. T his is the subject of our ongoing research. This project is part of a long-term collaboration with the Dep artment of Neurology at Te mple University Hospital that has produced several valuable outputs including a large corpus (TUH- EEG), a subset of the c orpus an notated for clinically relevant event s (TUH- EEG -ESS), and technology t o automatically interpret EE Gs. I n related work, we are also making the corpus sea rchable using multimodal queries that integrate metadata, information extracted from EEG r eports and the signal even t data describ ed here (Picone et al., 20 16). The r esulting system can be used to retrieve many differ ent types of cohorts and will be a valuable tool for clinical work, research and teaching. ACKNOWLEDGEMENTS The primary funder of this research was the QED Proof of Concept program of the University City Science Center (Grant No. S1313). Research reported in this publi cation was also supported by t he National Human Genome Research Institute of the National Institutes of Health under Award Number U01HG008468 and the Na tional Science Foundat ion through Major Research Instrumentation Grant No. CNS-09-58854. The TUH-EEG database work was funded by ( 1) th e Defense Advanced Re search Projects Agency (DARPA) MTO under t he auspices of Dr. Doug We ber through the Contract No. D13AP00065, (2) Temple University ’s C ollege of Engineering and (3) Te mple University’s Office of the Senior Vice- Provost for Research. Finally, we are also grateful to Dr. M ercedes Jacobson, Dr. Steven Tobochnik and David J ungries of t he T emple Universi ty School of Medicine f or their assistance in developing the classification paradigm used in this study and for preparing the m anually annotated data. REFERENCES Akareddy, S. M., Akareddy, S. M., & Kulkarni, P. K. (2014). EEG signal classification for Epilepsy Seizure Detection using Improved Approximate Entropy. International Journal of Public Health Science , 2 (1), 23–32. https://doi.org/10.11591/ ijphs.v2i1.1836 . Golmohammadi, et al. : Automatic Analysis of EEGs... 17 Journal of Clinical Neurophysiolog y November 1, 2017 Alotaiby, T ., Alshebeili, S., Alshawi, T ., Ahmad, I., & Abd El- Samie, F. (2014). EEG seiz ure detection and prediction alg orithms: a surv e y. EURASIP Journal on Advances in Si gnal Processing , 2014 (1), 1–21. https://doi.org/10.1186/1687-6180 -2014-183 . Alphonso, I., & Picone , J. (2004). Ne twork T raining for Co ntinuous Speech Recognition. Proceedings of the European Signal Proce ssing Conference (pp. 553–556). Vienna, A ustria. Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I., Bergeron, A., … Bengio, Y. (2012). Theano: new features and speed im p rovements. arXiv , 1–10. Retrieved from http://arxiv.org/abs/1211.5 590 . Bengio, Y., Lam blin, P., Popovici, D., & Larochelle, H. (2007). Greedy l aye r-wise t raining of deep networks. In Advances in Neural Information Processing System (pp. 153–160). Vancouver, B.C., Canada. https://doi.org/10.1.1.70.202 2 . Bengio, Y., Paiement, J.-F., V incent, P., Delalleau, O., Roux, N. Le, & Ouimet, M. (2003). Out- of -Sample Extensions f or LLE, Isomap, M DS, Eigen maps, a nd Spe ctral Cl ustering. In NIPS (pp. 1–8). Vancouver, British Col umbia, Canada . Retrieved from htt ps://papers.nips.cc/paper/2461-ou t-of- sample-extensions-for-lle-is omap-mds-eigenmaps -and-spectral-clustering . Bergstra, J., Breuleux, O., Bastien, F., La mblin, P., Pascanu, R., Desjardins, G., … Bengio, Y. (2010). Theano: a CPU and GPU Math Expression Compiler. Pro c eedings of the Python for Sc i entific Computing Conference (SciPy) , 1–7. Retrieved from http://www- etud.iro.umontreal.ca/~wardefar/publ ications/theano_scipy2010.pdf . Da R ocha Ga rrit, P. H., Gu imaraes Moura, A., Obei d, I . , & Picone, J. (2015). Wave l et An alysis for Fe a ture Extraction on EEG Signals. Presented at the NEDC Summer Research Experience for Undergraduates, Department of Electrical and Computer Engineering, Tem ple University (p. 1 ). Philadelphia, Pennsy lvania, USA. Retrieved from http://www.isip.piconepress.com/ publications/unpublished/co nferences/2015/summer_of_code/wavel ets/ ). Deburchgraeve, W., Cherian, P. J., De Vos, M., Sw arte, R. M., Blok, J. H., Visser, G. H., … Van H uffel, S. (2008). Autom ated neonatal seizure detection m imicking a human observer re ading EEG. Clinical Neurophysiology , 119(11), 2447–2454. Fukunaga, K. (1990). I ntroduction to Statistical Pattern Re cognition . Computer Science and Scientific Computing (2nd ed.). San Diego, Ca lifornia, U SA: Academic Press, Inc. Retrieved from https://www.elsevier.com/books/introd uction-to-statistical-pattern-recogn ition/fukunaga/978-0-08 - 047865-4 . Gotman, J . (1999). Automatic detection of se izures and spikes. Journal of Clinical Neurophysiology: Official Publication of the Ameri can Electroencephalographic Society , 16 (2), 130–140. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/1 0359498 . Harati, A., Golmohamm adi, M., Lopez, S., Obeid, I., & Picone, J . (2015). Improved EEG Event Classification Using Dif ferential Energy. Procee dings of the IEEE Signal Processing i n Medicine and Biology Symposium (pp. 1–4). Phila delphia, Pennsylvania, USA. https://doi.org/10.1109/SPMB.2015. 7405421 . Harati, A., Lopez, S., Obei d, I . , Jacobson, M., To bochnik , S., & Pi cone, J. (2014). The TUH EEG Corpus: A Big Data Resou rce for Autom ated EEG Interpretation. Proceedings of the IEEE Signal Processing in Medicine and Biology Symposium (pp. 1–5). Philadelphia, Pe nnsylvania, USA. https://doi.org/10.1109/SPMB.2014. 7002953 . Hopfengärtner, R., Kerling, F., Baue r, V., & Stefan, H. ( 2007). An ef ficient, robust and fa st m e thod for th e offline detection of epileptic seizures in long-term sca lp EEG recordings. Clinical Neurophysiology , 118 (11), 2332–2343. Retrieved from htt p://www.sciencedirect.co m/science/article/pii/ Golmohammadi, et al. : Automatic Analysis of EEGs... 18 Journal of Clinical Neurophysiolog y November 1, 2017 S1388245707003987 . Jahankhani, P., Kodogiannis, V ., & Revett, K. (2006). EEG Signal Classification Using Wa velet Feature Extraction and Neural Networks. IEEE John Vincent Atanasoff 2006 International Symposium on Modern Computing (JVA’0 6) , 120–124. https://doi.org/10.1109/JVA.2006.17 . Juang, B.-H., & Rabiner, L. (1991). Hidde n M arkov M odels for Speech Recognition. Technometrics , 33 (3), 251–272. Re trieved from http://www.leonidzhukov.net/hse/2012/stochmod/pap e rs/HMMs_speech_ recognition.pdf . Khamis, H., Moha med, A., & Simpson, S. (2009). Seizure state detecti on of temporal lo be seizures by autoregressive spe ctral analysis of s calp EEG. Clinical Neurophysiology , 120 (8), 1479–1488. Retrieved from http://www.sciencedirect.com/science/art icle/pii/S138824570900368X . LeCun, Y., Bengio, Y., & Hinton, G. ( 2015). Deep learning. Nature , 521 (7553), 436–444. Retrieved from http://dx.doi.org/10.1038/n ature14539%5Cn10.1038/n ature14539 . Levinson, S. E. (2005). Syntactic Analysis. I n Mathematical Models for Speech Technol ogy (pp. 119–135). John Wiley & Sons, Ltd. Retrieved f rom htt ps://doi.org/10.10 02/0470020911.ch4 . Lopez, S., Golmoham madi, M., Obeid, I., & Pic one, J. (2016). An Analysis of Two Common Reference Points for EEGs . I n Proceedings of the IEEE Signal Processing in Med icine an d B iology Symposium (pp. 1–4). Philadelphia, Pe nnsylvania, USA. Retrieved from https://doi.org/10.1109/SPMB.2016. 7846854 . Martin, A., Doddington, G., Kamm, T ., Ordowski, M ., & Przybocki, M. (1997). The DE T curve i n assessment of detection task performance. Proceedings of Eurospeech ( pp. 1895–1898). Rhodes, Greece. Retrieved from https://doi.org/10.1.1.11 7.4489 . Mirowski, P., Madhavan, D., Lecun, Y., & Kuzniecky, R. (2009). Classification o f patterns of EEG synchronization for seizure pr ediction. Clinical Neurophysiology , 120 (11), 1927–1940. Retrieved from https://doi.org/10.101 6/j.clinph.2009.09.002 . Ney, J. P., van der Goes, D. N., Nuwer, M. R ., & Nels on, L. ( 2016). Continuous and routine EEG in intensive c are: utilization and o utcomes, United States 2005-2009. Neurol ogy , 81 (23), 2002–2008. Retrieved from https://doi.org/10.1212 /01.wnl.0000436948.9339 9.2a . Obeid, I., & Picone , J . (2016). T he Temple Uni versity Hospital EEG Data Corpus. Frontiers i n Neuroscience, Section Neural Technology , 10 , 196. Retrieved from https://doi.org/http://dx.doi.org/10.3389 / fnins.2016.00 196 . Obeid, I., Picone, J., & Harabagiu, S. (2016). Automatic Discovery and Pr ocessing of EEG Cohor ts from Clinical Records. In Big Data to Knowledge All Hands Grantee M eeting (p. 1). Bethesda, Maryland, USA: National Institutes of Health. Retrieved from https://www.isip.piconepress.com / publications/conference_presentat ions/2016/nih_bd2k /cohort/ . Osorio, I ., Frei, M. G., & Wilkinson, S. B . (1998 ). Real- t ime automated det ection and quantitative analysis of seizures and short-term prediction of clinical onset. Epilepsia , 39 (6), 615–627. Retrieved fr om http://onlinelibrary.wiley.com/doi/10.1111/j.15 28-1157.1998. tb01430.x/abstract;jsession id=CAD 0143EC9B4C6A380DFB654EB657FF F.f03t04 . Picone, J . (1990). Conti nuous Speech Re cognition Using Hidden Mar kov Models. IEEE ASSP Magazine , 7 (3), 26–41. Retrieved from https: //doi.org/10.1109/5 3.54527 . Picone, J . (1993). Signal modelin g techniques in speech recognition. Proceedings of the IEEE , 81 (9), 1215– 1247. Retrieved from https://doi.org/1 0.1109/5.237532 . Ramgopal, S. (2014). Seizure detection, seizure prediction, and closed- l oop wa rning systems in epilepsy. Epilepsy & Behavior , 37 , 2 91–307. Retriev ed from https://www.ncbi.nlm.nih.gov/ pubmed/25174001 . Golmohammadi, et al. : Automatic Analysis of EEGs... 19 Journal of Clinical Neurophysiolog y November 1, 2017 Sartoretto, F., & Er mani, M . (1999). Automatic detection of e pileptiform activity b y single-level wavelet analysis. Clinical Neurophysiology , 110 ( 2), 239–249. Retrieved f rom http://www.sciencedirect.com/scienc e/article/pii/S0013469498001163 . Schad, A., S chindler, K., Schelte r, B., Maiwald, T ., Brandt, A., Timmer, J. , & Schulze-B onhage, A. (2008). Application of a multivariate se izure detection and pre diction method to non-invasive and intracranial long-term EEG recordings. Clinical Neurophysiology , 119 (1), 197–211. Retrieved from http://www.sciencedirect.com/scienc e/article/pii/S1388245707005986 . Schindler, K., Wiest, R., Kollar, M., & Donati, F. ( 2001). Using simulated neuronal cell models for detection of e pileptic seizures in foram en ovale and scalp EEG. Clinical Neurophysiology , 112 (6) , 1006–17. Retrieved from https://doi.org/10.1016/S1388-2457(0 1)00522-3 . Shah, V ., von Welti n, E., Ahsan, T., Ziyabari, S., Golmohammadi, M ., Obeid, I., & Picone, J. (2018). A Cost-effective Method for Generating High-qualit y Annotations of Seizure-Events. Submitted to the Journal of Clinical Neurophysiology . Retrieved from https://www.isip.piconepress.com / publications/unpublished/journals /2017/jcn/ira . Stam, C. J. (2005). Nonlinear dynamical analysis o f EEG a nd MEG: Review o f an emerging field . Clinical Neurophysiology , 116(10), 2266–2301. Retrieved from http://www.sciencedirect.com/science/ article/pii/S138824570500 2403 . Subasi, A. (2007). EEG signal classification using wavelet feature extraction and a mixture of expert model. Expe rt Sy stems wit h Applications , 32 (4), 1084–1093. Retrie ved from https://doi.org/10.1016/j.esw a.2006.02.005 . Thodoroff, P., Pi neau, J., & L im, A . (2016). Lear ning Robust Features us ing Deep Learning for Au tomatic Seizure Detection. Proceedings of Mac hine Learning Research , 56 , 178–190. Retrieved from https://arxiv.org/pdf/1608.00220.pd f . van der Maaten, L., Postma, E., & van den He rik, J. (2009). Dimensionality Reduc tion: A Comparative Review. Journal of Machine Learning Research , 10 ( February), 1–41. Retri eved from https://doi.org/10.1080/135062804440 00102 . Varsavsky, A., & Mareels, I. (2006). Patient un-specific detection of epileptic seizure s through changes in variance. In Proceedin gs of the Annual International Conference of th e I EEE Engineering in Medicine and Biology Society (pp. 3747–3750). New York, New York, USA: IEEE. R etrieved from http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber= 4 462614 . Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., & M anzagol, P.-A. (2010). Stacked Denoising Autoencoders: Lea rning Useful Re presentations in a D ee p Ne twork w ith a L ocal Denoising Criterion Pierre-Antoine Manzagol. Journal of Machine Learning Research , 11, 3371–3408. Re trieved from https://doi.org/10.1111/1467-8535. 00290 . Vincent, P., Larochelle, H., Bengio, Y., & Ma nzagol, P.-A. (2008). Extracting and composing r obust features with denoising autoencoders. Proceedings of the 25th International Conference on Machine Learning (pp. 1096–1103). N ew York, NY, USA. Retrieved from https://doi.org/10.1145/1390156.1390 294 . Xiong, W., Wu, L., Al leva, F ., Droppo, J., Huang, X., & Stolck e, A. (201 7) . The Microsoft 2017 Conversational Speech Recognition Sy stem. Ret rieved from https://doi.org/https://arxiv.org/abs/1708.0 6073 . Yamada, T ., & Meng, E. (2017). Practical guide for clinical neurophysiologic testing: EEG . (E. Meng & R. L. (Online Service), E ds.). Philadelphia, Pennsylvania, USA: Lippincott Williams & Wilkins. Retrieved from https://shop.lww. com/Practical-Guide-for-Clinical-Neurop hysiologic-Testing-- EEG/p/9781496383020 .

Automatic Analysis of EEGs Using Big Data and Hybrid Deep Learning Architectures

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment