LSTM-based Encoder-Decoder for Multi-sensor Anomaly Detection

LSTM-based Encoder -Decoder f or Multi-sensor Anomaly Detection Pankaj Malhotra, Anusha Ramakrishnan, Gaurangi Anand, Lovekesh V ig, Puneet Agarwal, Gautam Shroff { M A LH OT R A . PA N K A J , A N U S H A . R A M A K R I S H N A N , G AU R A N G I . A NA N D , L OV E K E S H . V I G , P U N E E T . A , G AU - TAM . S H RO FF } @ T C S . C O M TCS Research, New Delhi, India Abstract Mechanical devices such as engines, vehicles, aircrafts, etc., are typically instrumented with numerous sensors to capture the behavior and health of the machine. Howe ver , there are of- ten external factors or variables which are not captured by sensors leading to time-series which are inherently unpredictable. For instance, man- ual controls and/or unmonitored environmental conditions or load may lead to inherently un- predictable time-series. Detecting anomalies in such scenarios becomes challenging using stan- dard approaches based on mathematical models that rely on stationarity , or prediction models that utilize prediction errors to detect anomalies. W e propose a Long Short T erm Memory Networks based Enc oder - Dec oder scheme for A nomaly D etection (EncDec-AD) that learns to recon- struct ‘normal’ time-series behavior , and there- after uses reconstruction error to detect anoma- lies. W e experiment with three publicly av ailable quasi predictable time-series datasets: power de- mand, space shuttle, and ECG, and two real- world engine datasets with both predicti ve and unpredictable behavior . W e show that EncDec- AD is rob ust and can detect anomalies fr om pr e- dictable, unpredictable , periodic, aperiodic, and quasi-periodic time-series. Further , we show that EncDec-AD is able to detect anomalies fr om short time-series (length as small as 30) as well as long time-series (length as lar ge as 500) . 1. Introduction In real-world sensor data from machines, there are scenar- ios when the behavior of a machine changes based on us- age and external factors which are difﬁcult to capture. For Pr esented at ICML 2016 Anomaly Detection W orkshop , Ne w Y ork, NY , USA, 2016. Copyright c  2016 T ata Consultancy Ser- vices Ltd. (a) Predictable (b) Unpredictable Figure 1. Readings for a manual control sensor . example, a laden machine behav es differently from an un- laden machine. Further , the rele vant information pertaining to whether a machine is laden or unladen may not be a vail- able. The amount of load on a machine at a time may be unknown or change very frequently/abruptly , for example, in an earth digger . A machine may hav e multiple manual controls some of which may not be captured in the sensor data. Under such settings, it becomes dif ﬁcult to predict the time-series, e ven for very near future (see Figure 1), render - ing ineffecti ve prediction-based time-series anomaly detec- tion models, such as ones based on exponentially weighted moving av erage (EWMA) (Basseville & Nikiforov, 1993), SVR(Ma & Perkins, 2003), or Long Short-T erm Memory (LSTM) Networks (Malhotra et al., 2015). LSTM networks (Hochreiter & Schmidhuber, 1997) are recurrent models that hav e been used for many sequence learning tasks like handwriting recognition, speech recog- nition, and sentiment analysis. LSTM Encoder-Decoder models hav e been recently proposed for sequence-to- sequence learning tasks like machine translation (Cho et al., 2014; Sutskever et al., 2014). An LSTM-based en- coder is used to map an input sequence to a vector repre- sentation of ﬁxed dimensionality . The decoder is another LSTM network which uses this vector representation to produce the target sequence. Other variants ha ve been pro- posed for natural language generation and reconstruction (Li et al., 2015), parsing (V inyals et al., 2015), image cap- tioning (Bengio et al., 2015). W e propose an LSTM-based Enc oder- Dec oder scheme for A nomaly D etection in multi-sensor time-series (EncDec- AD). An encoder learns a vector representation of the in- put time-series and the decoder uses this representation to reconstruct the time-series. The LSTM-based encoder- decoder is trained to reconstruct instances of ‘normal’ time- LSTM-based Encoder-Decoder for Multi-sensor Anomaly Detection series with the target time-series being the input time-series itself. Then, the reconstruction error at any future time- instance is used to compute the likelihood of anomaly at that point. W e show that such an encoder-decoder model learnt using only the normal sequences can be used for de- tecting anomalies in multi-sensor time-series: The intuition here is that the encoder-decoder pair would only hav e seen normal instances during training and learnt to reconstruct them. When giv en an anomalous sequence, it may not be able to reconstruct it well, and hence would lead to higher reconstruction errors compared to the reconstruction errors for the normal sequences. EncDec-AD uses only the normal sequences for training. This is particularly useful in scenarios when anomalous data is not av ailable or is sparse, making it difﬁcult to learn a classiﬁcation model ov er the normal and anoma- lous sequences. This is especially true of machines that undergo periodic maintainance and therefore get serviced before anomalies show up in the sensor readings. 2. EncDec-AD Consider a time-series X = { x (1) , x (2) , ..., x ( L ) } of length L , where each point x ( i ) ∈ R m is an m -dimensional vec- tor of readings for m v ariables at time-instance t i . W e con- sider the scenario where multiple such time-series are av ail- able or can be obtained by taking a window of length L ov er a lar ger time-series. W e ﬁrst train the LSTM Encoder- Decoder model to reconstruct the normal time-series. The reconstruction errors are then used to obtain the likelihood of a point in a test time-series being anomalous s.t. for each point x ( i ) , an anomaly score a ( i ) of the point being anoma- lous is obtained. A higher anomaly score indicates a higher likelihood of the point being anomalous. 2.1. LSTM Encoder-Decoder as reconstruction model W e train an LSTM encoder-decoder to reconstruct in- stances of normal time-series. The LSTM encoder learns a ﬁx ed length v ector representation of the input time-series and the LSTM decoder uses this representation to recon- struct the time-series using the current hidden state and the value predicted at the previous time-step. Given X , h ( i ) E is the hidden state of encoder at time t i for each i ∈ { 1 , 2 , ..., L } , where h ( i ) E ∈ R c , c is the number of LSTM units in the hidden layer of the encoder . The encoder and decoder are jointly trained to reconstruct the time- series in reverse order (similar to (Sutske ver et al., 2014)), i.e. the target time-series is { x ( L ) , x ( L − 1) , ..., x (1) } . The ﬁnal state h ( L ) E of the encoder is used as the initial state for the decoder . A linear layer on top of the LSTM decoder layer is used to predict the target. During training, the de- coder uses x ( i ) as input to obtain the state h ( i − 1) D , and then Figure 2. LSTM Encoder-Decoder inference steps for input { x (1) , x (2) , x (3) } to predict { x 0 (1) , x 0 (2) , x 0 (3) } predict x 0 ( i − 1) corresponding to target x ( i − 1) . During in- ference, the predicted v alue x 0 ( i ) is input to the decoder to obtain h ( i − 1) D and predict x 0 ( i − 1) . The model is trained to minimize the objective P X ∈ s N P L i =1 k x ( i ) − x 0 ( i ) k 2 , where s N is set of normal training sequences. Figure 2 depicts the inference steps in an LSTM Encoder- Decoder reconstruction model for a sequence with L = 3 . The value x ( i ) at time instance t i and the hidden state h ( i − 1) E of the encoder at time t i − 1 are used to obtain the hidden state h ( i ) E of the encoder at time t i . The hidden state h (3) E of the encoder at the end of the input sequence is used as the initial state h (3) D of the decoder s.t. h (3) D = h (3) E . A linear layer with weight matrix w of size c × m and bias vector b ∈ R m on top of the decoder is used to compute x 0 (3) = w T h (3) D + b . The decoder uses h ( i ) D and prediction x 0 ( i ) to obtain the next hidden state h ( i − 1) D . 2.2. Computing likelihood of anomaly Similar to (Malhotra et al., 2015), we di vide the normal time-series into four sets of time-series: s N , v N 1 , v N 2 , and t N , and the anomalous time-series into two sets v A and t A . The set of sequences s N is used to learn the LSTM encoder-decoder reconstruction model. The set v N 1 is used for early stopping while training the encoder-decoder model. The reconstruction error vector for t i is giv en by e ( i ) = | x ( i ) − x 0 ( i ) | . The error vectors for the points in the sequences in set v N 1 are used to estimate the parameters µ and Σ of a Normal distribution N ( µ , Σ ) using Maxi- mum Likelihood Estimation. Then, for any point x ( i ) , the anomaly score a ( i ) = ( e ( i ) − µ ) T Σ − 1 ( e ( i ) − µ ) . In a supervised setting, if a ( i ) > τ , a point in a sequence can be predicted to be “anomalous”, otherwise “normal”. When enough anomalous sequences are av ailable, a thresh- old τ over the likelihood values is learnt to maximize F β = (1 + β 2 ) × P × R/ ( β 2 P + R ) , where P is precision, R is recall, “anomalous” is the positive class and “normal” is the negativ e class. If a window contains an anomalous pattern, the entire window is labeled as “anomalous”. This LSTM-based Encoder-Decoder for Multi-sensor Anomaly Detection Datasets Predictable Dimensions Periodicity N N n N a Power Demand Y es 1 Periodic 1 45 6 Space Shuttle Y es 1 Periodic 3 20 8 Engine-P Y es 12 Aperiodic 30 240 152 Engine-NP No 12 Aperiodic 6 200 456 ECG Y es 1 Quasi-periodic 1 215 1 T able 1. Nature of datasets. N , N n and N a is no. of original se- quences, normal subsequences and anomalous subsequences, re- spectiv ely . Datasets L c β P R F β -score TPR/FPR Power Demand 84 40 0.1 0.92 0.04 0.77 33.0 Space Shuttle 500 50 0.05 0.83 0.08 0.81 4.9 Engine-P 30 40 0.05 0.94 0.02 0.82 13.8 Engine-NP 30 90 0.05 1.0 0.01 0.83 ∞ ECG 208 45 0.05 1.0 0.005 0.65 ∞ T able 2. F β -scores and positiv e likelihood ratios (TPR/FPR). is helpful in many real-world applications where the exact position of anomaly is not known. For example, for the engine dataset (refer Section 3), the only information av ail- able is that the machine was repaired on a particular date. The last few operational runs prior to repair are assumed to be anomalous and the ﬁrst few operational runs after the re- pair are assumed to be normal. W e assume β < 1 since the fraction of actual anomalous points in a sequence labeled as anomalous may not be high, and hence lower recall is ex- pected. The parameters τ and c are chosen with maximum F β score on the validation sequences in v N 2 and v A . 3. Experiments W e consider four real-world datasets: power demand, space shuttle valv e, ECG, and engine (see T able 1). The ﬁrst three are taken from (Keogh et al., 2005) whereas the engine dataset is a proprietary one encountered in a real-life project. The engine dataset contains data for two different applications: Engine-P where the time-series is quasi-predictable, Engine-NP where the time-series is un- predictable, for reasons such as mentioned earlier . In our experiments, we consider architectures where both the encoder and decoder hav e single hidden layer with c LSTM units each. Mini-batch stochastic optimization based on Adam Optimizer (Kingma & Ba, 2014) is used for training the LSTM Encoder-Decoder . T able 2 shows the performance of EncDec-AD on all the datasets. 3.1. Datasets Po wer demand dataset contains one univ ariate time-series with 35 , 040 readings for power demand recorded over a period of one year . The demand is normally high during the weekdays and low over the weekend. Within a day , the demand is high during working hours and low other- wise (see Figure 3(a), top-most subplot). A week when any of the ﬁrst 5 days has low power demands (similar to the demand ov er the weekend) is considered anomalous (see (a) Power -N (b) Power -A (c) Space Shuttle-N (d) Space Shuttle-A (e) Engine-P-N (f) Engine-P-A (g) Engine-NP-N (h) Engine-NP-A (i) ECG-N (j) ECG-A Figure 3. Sample original normal (ﬁrst column) and anomalous (second column) sequences (ﬁrst ro w , blue color) with corre- sponding reconstructed sequences (second row , green color) and anomaly scores (third ro w , red color). The red regions in the orig- inal time-series for anomalous sequences correspond to the ex- act location of the anomaly in the sequence (whenever a vailable). Plots in same row hav e same y-axis scale. The anomaly scores are on log-scale. LSTM-based Encoder-Decoder for Multi-sensor Anomaly Detection Figure 3(b) where ﬁrst day has low power demand). W e downsample the original time-series by 8 to obtain non- ov erlapping sequences with L = 84 such that each windo w corresponds to one week. Space shuttle dataset contains periodic sequences with 1000 points per cycle, and 15 such cycles. W e delibrately choose L = 1500 such that a subsequence covers more than one cycle (1.5 cycles per subsequence) and consider sliding windows with step size of 500 . W e do wnsample the original time-series by 3 . The normal and anomalous se- quences in Figure 3(c)-3(d) belong to TEK17 and TEK14 time-series, respectiv ely . Engine dataset contains readings for 12 sensors such as coolant temperature, torque, accelerator (control variable), etc. W e consider two differents applications of the engine: Engine-P and Engine-NP . Engine-P has a discrete exter - nal control with two states: ‘high’ and ‘low’. The result- ing time-series are predictable except at the time-instances when the control v ariable changes. On the other hand, the external control for Engine-NP can assume any value within a certain range and changes very frequently , and hence the resulting time-series are unpredictable. Sam- ple sequences for the control variables from Engine-P and Engine-NP are sho wn in Figure 1(a) and 1(b), respec- tiv ely . W e randomly choose L = 30 for both Engine-P and Engine-NP . W e reduce the multiv ariate time-series to univ ariate by considering only the ﬁrst principal compo- nent after applying principal component analysis (Jolliffe, 2002). The ﬁrst component captures 72% of the variance for Engine-P and 61% for Engine-NP . ECG dataset contains quasi-periodic time-series (duration of a cycle varies from one instance to another). For our e x- periment, we use the ﬁrst channel from qtdb/sel102 dataset where the time-series contains one anomaly corresponding to a pre-ventricular contraction (see Figure 3(j)). W e con- sider non-overlapping subsequences with L = 208 (each subsequence corresponds to approximately 800 ms). Since only one anomaly is present in the dataset, sets v N 2 and v A are not created. The best model, i.e. c , is chosen based on the minimum reconstruction error on set v N 1 . W e choose τ = µ a + σ a , where µ a and σ a are the mean and standard deviation of the anomaly scores of the points from v N 1 . 3.2. Observations The key observ ations from our experiments are as follo ws: 1) The positiv e likelihood ratio is signiﬁcantly higher than 1.0 for all the datasets (see T able 2). High positiv e like- lihood ratio values suggest that EncDec-AD giv es signiﬁ- cantly higher anomaly scores for anomalous points as com- pared to normal points. 2) For periodic time-series, we experiment with v arying window lengths: window length same as the length of one cycle (po wer demand dataset) and window length greater than the length of one cycle (space shuttle dataset). W e also consider a quasi-periodic time-series (ECG). EncDec- AD is able to detect anomalies in all these scenarios. 3) A time-series prediction based anomaly detection model LSTM-AD (Malhotra et al., 2015) gives better results for the predictable datasets: Space Shuttle, Power and Engine- P (corresponding to Engine dataset in (Malhotra et al., 2015)) with F 0 . 1 scores of 0 . 84 , 0 . 90 and 0 . 89 , respec- tiv ely . On the other hand, EncDec-AD gives better results for Engine-NP where the sequences are not predictable. The best LSTM-AD model gi ves P , R, F 0 . 05 and TPR/FPR of 0 . 03 , 0 . 07 , 0 . 03 , 1 . 9 , respecti vely (for a two hidden layer architecture with 30 LSTM units in each layer and predic- tion length of 1 ) owing to the f act that the time-series is not predictable and hence a good prediction model could not be learnt, whereas EncDec-AD giv es P , R, F 0 . 1 score and TPR/FPR of 0 . 96 , 0 . 18 , 0 . 93 and 7 . 6 , respectiv ely . 4. Related W ork T ime-series prediction models hav e been shown to be ef- fectiv e for anomaly detection by using the prediction error or a function of prediction error as a measure of the sever - ity of anomaly (Hayton et al., 2007; Ma & Perkins, 2003; Y e et al., 2000). Recently , deep LSTMs ha ve been used as prediction models in LSTM-AD (Malhotra et al., 2015; Chauhan & V ig, 2015; Y ada v et al.) where a prediction model learnt over the normal time-series using LSTM net- works is used to predict future points, and likelihood of prediction error is used as a measure of anomaly . EncDec- AD learns a representation from the entire sequence which is then used to reconstruct the sequence, and is therefore different from prediction based anomaly detection models. Non-temporal reconstruction models such as denoising au- toencoders for anomaly detection (Sakurada & Y airi, 2014) and Deep Belief Nets (W ulsin et al., 2010) have been pro- posed. For time-series data, LSTM based encoder-decoder is a natural extension to such models. 5. Discussion W e show that LSTM Encoder-Decoder based reconstruc- tion model learnt over normal time-series can be a viable approach to detect anomalies in time-series. Our approach works well for detecting anomalies from predictable as well as unpredictable time-series. Whereas many existing models for anomaly detection rely on the fact that the time- series should be predictable, EncDec-AD is shown to detect anomalies even from unpredictable time-series, and hence may be more robust compared to such models. The fact that EncDec-AD is able to detect anomalies from time-series with length as large as 500 suggests the LSTM encoder- decoders are learning a robust model of normal beha vior . LSTM-based Encoder-Decoder for Multi-sensor Anomaly Detection References Basseville, Mich ` ele and Nikiforov , Igor V . Detection of Abrupt Changes: Theory and Application . Prentice- Hall, Inc., Upper Saddle Riv er, NJ, USA, 1993. ISBN 0-13-126780-9. Bengio, Samy , V inyals, Oriol, Jaitly , Navdeep, and Shazeer , Noam. Scheduled sampling for sequence pre- diction with recurrent neural networks. In Advances in Neural Information Pr ocessing Systems , pp. 1171–1179, 2015. Chauhan, Sucheta and V ig, Lov ekesh. Anomaly detection in ecg time signals via deep long short-term memory net- works. In Data Science and Advanced Analytics (DSAA), 2015. 36678 2015. IEEE International Conference on , pp. 1–7. IEEE, 2015. Cho, Kyunghyun, V an Merri ¨ enboer , Bart, Gulcehre, Caglar , Bahdanau, Dzmitry , Bougares, Fethi, Schwenk, Holger , and Bengio, Y oshua. Learning phrase represen- tations using rnn encoder-decoder for statistical machine translation. arXiv pr eprint arXiv:1406.1078 , 2014. Hayton, Paul, Utete, Simukai, King, Dennis, King, Steve, Anuzis, Paul, and T arassenko, Lionel. Static and dy- namic nov elty detection methods for jet engine health monitoring. Philosophical T ransactions of the Royal So- ciety of London A: Mathematical, Physical and Engi- neering Sciences , 365(1851):493–514, 2007. Hochreiter , Sepp and Schmidhuber , J ¨ urgen. Long short- term memory . Neural computation , 9(8):1735–1780, 1997. Jolliffe, Ian. Principal component analysis . W iley Online Library , 2002. Keogh, Eamonn, Lin, Jessica, and Fu, Ada. Hot sax: Efﬁciently ﬁnding the most unusual time series subse- quence. In Data mining, F ifth IEEE International Con- fer ence on , pp. 8–pp. IEEE, 2005. Kingma, Diederik P . and Ba, Jimmy . Adam: A method for stochastic optimization. CoRR , abs/1412.6980, 2014. URL . Li, Jiwei, Luong, Minh-Thang, and Jurafsky , Dan. A hi- erarchical neural autoencoder for paragraphs and docu- ments. arXiv pr eprint arXiv:1506.01057 , 2015. Ma, Junshui and Perkins, Simon. Online nov elty detec- tion on temporal sequences. In Pr oceedings of the ninth A CM SIGKDD international confer ence on Knowledge discovery and data mining , pp. 613–618. A CM, 2003. Malhotra, Pankaj, V ig, Lov ekesh, Shroff, Gautam, and Agarwal, Puneet. Long short term memory networks for anomaly detection in time series. In ESANN, 23r d Eur o- pean Symposium on Artiﬁcial Neur al Networks, Compu- tational Intelligence and Machine Learning , 2015. Sakurada, Mayu and Y airi, T akehisa. Anomaly detection using autoencoders with nonlinear dimensionality reduc- tion. In Pr oceedings of the MLSD A 2014 2Nd W ork- shop on Machine Learning for Sensory Data Analysis , MLSD A ’14, pp. 4:4–4:11, New Y ork, NY , USA, 2014. A CM. Sutske ver , Ilya, V inyals, Oriol, and Le, Quoc V . Sequence to sequence learning with neural networks. In Ghahra- mani, Z., W elling, M., Cortes, C., Lawrence, N. D., and W einberger , K. Q. (eds.), Advances in Neural Informa- tion Processing Systems 27 , pp. 3104–3112. Curran As- sociates, Inc., 2014. V inyals, Oriol, Kaiser , Łukasz, K oo, T erry , Petrov , Slav , Sutske ver , Ilya, and Hinton, Geoffre y . Grammar as a foreign language. In Advances in Neural Information Pr ocessing Systems , pp. 2755–2763, 2015. W ulsin, Drausin, Blanco, Justin, Mani, Ram, and Litt, Brian. Semi-supervised anomaly detection for eeg wav e- forms using deep belief nets. In Machine Learning and Applications (ICMLA), 2010 Ninth International Con- fer ence on , pp. 436–441. IEEE, 2010. Y adav , Mohit, Malhotra, Pankaj, V ig, Lovekesh, Sriram, K, and Shrof f, Gautam. Ode-augmented training im- prov es anomaly detection in sensor data from machines. In NIPS T ime Series W orkshop 2015 . URL http: //arxiv.org/abs/1605.01534 . Y e, Nong et al. A markov chain model of temporal behav- ior for anomaly detection. In Pr oceedings of the 2000 IEEE Systems, Man, and Cybernetics Information As- surance and Security W orkshop , volume 166, pp. 169. W est Point, NY , 2000.

LSTM-based Encoder-Decoder for Multi-sensor Anomaly Detection

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment