Unsupervised and Semi-supervised Anomaly Detection with LSTM Neural Networks

1 Unsupervised and Semi-supervised Anomaly Detection with LSTM Neural Networks T olga Ergen, Ali H. Mirza, and Suleyman S. K ozat Senior Member , IEEE Abstract —W e in vestigate anomaly detection in an unsuper vised framework and introduce Long Short T erm Memory (LSTM) neural network based algorithms. In particular , given variable length data sequences, we ﬁrst pass these sequences through our LSTM based structure and obtain ﬁxed length sequences. W e then ﬁnd a decision function for our anomaly detectors based on the One Class Support V ector Machines (OC-SVM) and Support V ector Data Description (SVDD) algorithms. As the ﬁrst time in the literature, we jointly train and optimize the parameters of the LSTM architectur e and the OC-SVM (or SVDD) algorithm using highly effective gradient and quadratic pr ogramming based training methods. T o apply the gradient based training method, we modify the original objective criteria of the OC- SVM and SVDD algorithms, where we prove the con vergence of the modiﬁed objective criteria to the original criteria. W e also pro vide extensions of our unsupervised formulation to the semi- supervised and fully supervised frameworks. Thus, we obtain anomaly detection algorithms that can process variable length data sequences while providing high performance, especially for time series data. Our approach is generic so that we also apply this approach to the Gated Recurrent Unit (GR U) architectur e by directly replacing our LSTM based structure with the GRU based structure. In our experiments, we illustrate signiﬁcant performance gains achieved by our algorithms with respect to the conv entional methods. Index T erms —Anomaly detection, Support V ector Machines, Support V ector Data Description, LSTM, GR U . I . I N T RO D U C T I O N A. Pr eliminaries Anomaly detection [1] has attracted signiﬁcant interest in the contemporary learning literature due its applications in a wide range of engineering problems, e.g., sensor failure [2], network monitoring [3], cybersecurity [4] and surveillance [5]. In this paper , we study the variable length anomaly detection problem in an unsupervised framework, where we seek to ﬁnd a function to decide whether each unlabeled variable length sequence in a gi ven dataset is anomalous or not. Note that although this problem is extensiv ely studied in the literature and there exist different methods, e.g., supervised (or semi- supervised) methods, that require the knowledge of data labels, we employ an unsupervised method due to the high cost of obtaining accurate labels in most real life applications [1] such as in cybersecurity [4] and surveillance [5]. Howe ver , we also extend our deriv ations to the semi-supervised and fully supervised framew orks for completeness. In the current literature, a common and widely used ap- proach for anomaly detection is to ﬁnd a decision function This work is supported in part by Outstanding Researcher Programme T urkish Academy of Sciences and TUBIT AK Contract No 117E153. The authors are with the Department of Electrical and Electronics En- gineering, Bilkent Uni versity , Bilkent, Ankara 06800, T urke y , T el: +90 (312) 290-2336, Fax: +90 (312) 290-1223, (contact e-mail: { ergen, mirza, kozat } @ee.bilkent.edu.tr). that deﬁnes the model of normality [1]. In this approach, one ﬁrst deﬁnes a certain decision function and then optimizes the parameters of this function with respect to a prede- ﬁned objective criterion, e.g., the One Class Support V ector Machines (OC-SVM) and Support V ector Data Description (SVDD) algorithms [6], [7]. Howe ver , algorithms based on this approach examine time series data over a sufﬁciently long time windo w to achiev e an acceptable performance [1], [8], [9]. Thus, their performances signiﬁcantly depend on the length of this time window so that this approach requires careful selection for the length of time window to provide a satisfactory performance [8], [10]. T o enhance performance for time series data, neural networks, especially Recurrent Neural Networks (RNNs), based approaches are introduced thanks to their inherent memory structure that can store “time” or “state” information [1], [11]. Ho wev er , since the basic RNN archi- tecture does not have control structures (gates) to regulate the amount of information to be stored [12], [13], a more advanced RNN architecture with sev eral control structures, i.e., the Long Short T erm Memory (LSTM) network, is introduced [13], [14]. Howe v er , neural networks based approaches do not directly optimize an objective criterion for anomaly detection [1], [15]. Instead, they ﬁrst predict a sequence from its past samples and then determine whether the sequence is an anomaly or not based on the prediction error , i.e., an anomaly is an ev ent, which cannot be predicted from the past nominal data [1]. Thus, they require a probabilistic model for the prediction error and a threshold on the probabilistic model to detect anomalies, which results in challenging optimization problems and restricts their performance accordingly [1], [15], [16]. Furthermore, both the common and neural networks based approaches can process only ﬁxed length vector sequences, which signiﬁcantly limits their usage in real life applications [1]. In order to circumvent these issues, we introduce novel LSTM based anomaly detection algorithms for v ariable length data sequences. In particular , we ﬁrst pass variable length data sequences through an LSTM based structure to obtain ﬁxed length representations. W e then apply our OC-SVM [6] and SVDD [7] based algorithms for detecting anomalies in the extracted ﬁxed length vectors as illustrated in Fig. 1. Unlike the previous approaches in the literature [1], we jointly train the parameters of the LSTM architecture and the OC-SVM (or SVDD) formulation to maximize the detection performance. For this joint optimization, we propose two different training methods, i.e., a quadratic programming based and a gradient based algorithms, where the merits of each different approach are detailed in the paper . For our gradient based training method, we modify the original OC-SVM and SVDD formulations and then provide the conv ergence results 2                                  . . . Arbitr ar y L e n g th Se q u e n c e ... LS TM OC - S VM o r S VD D R e cu rr e n t In f orma tion          Fix e d L e n g th                             St ack e d L S TMs D e cis io n                                                                                                 Jo int ly Op timiz e Fig. 1: Overall structure of our anomaly detection approach. of the modiﬁed formulations to the original ones. Thus, instead of following the prediction based approaches [1], [15], [16] in the current literature, we deﬁne proper objective functions for anomaly detection using the LSTM architecture and optimize the parameters of the LSTM architecture via these well deﬁned objectiv e functions. Hence, our anomaly detection algorithms are able to process variable length sequences and provide high performance for time series data. Furthermore, since we introduce a generic approach in the sense that it can be applied to any RNN architecture, we also apply our approach to the Gated Recurrent Unit (GR U) architecture [17], i.e., an advanced RNN architecture as the LSTM architecture, in our simulations. Through extensi ve set of experiments, we demonstrate signiﬁcant performance gains with respect to the con ventional methods [6], [7]. B. Prior Art and Comparisons Sev eral different methods have been introduced for the anomaly detection problem [1]. Among these methods, the OC-SVM [6] and SVDD [7] algorithms are generally em- ployed due their high performance in real life applications [18]. Howe ver , these algorithms provide inadequate perfor- mance for time series data due to their inability to capture time dependencies [8], [9]. In order to improve the performances of these algorithms for time series data, in [9], the authors con vert time series data into a set of v ectors by replicating each sample so that they obtain two dimensional vector sequences. Howe v er , ev en though they obtain two dimensional vector sequences, the second dimension does not provide additional information such that this approach still provides inadequate performance for time series data [8]. As another approach, the OC-SVM based method in [8] acquires a set of vectors from time series data by unfolding the data into a phase space using a time delay embedding process [19]. More speciﬁcally , for a certain sample, they create an E dimensional vector by using the previous E − 1 samples along with the sample itself [8]. Howe v er , in order to obtain a satisfactory performance from this approach, the dimensionality , i.e., E , should be carefully tuned, which restricts its usage in real life applications [20]. On the other hand, even though LSTM based algorithms provide high performance for time series data, we have to solve highly complex optimization problems to get an adequate performance [1]. As an example, the LSTM based anomaly detection algorithms in [10], [21] ﬁrst predict time series data and then ﬁt a multiv ariate Gaussian distribution to the error, where they also select a threshold for this distribution. Here, they allocate different set of sequences to learn the parameters of the distribution and threshold via the maximum likelihood estimation technique [10], [21]. Thus, the con ventional LSTM based approaches require careful selection of sev eral additional parameters, which signiﬁcantly degrades their performance in real life [1], [10]. Furthermore, both the OC-SVM (or SVDD) and LSTM based methods are able to process only ﬁxed length sequences [6], [7], [10]. T o circumvent these issues, we introduce generic LSTM based anomaly detectors for variable length data sequences, where we jointly train the parameters of the LSTM architecture and the OC-SVM (or SVDD) formulation via a predeﬁned objecti ve function. Therefore, we not only obtain high performance for time series data but also enjoy joint and effecti ve optimization of the parameters with respect to a well deﬁned objectiv e function. C. Contributions Our main contributions are as follows: • W e introduce LSTM based anomaly detection algorithms in an unsupervised framework, where we also extend our deriv ations to the semi-supervised and fully supervised framew orks. • As the ﬁrst time in the literature, we jointly train the pa- rameters of the LSTM architecture and the OC-SVM (or SVDD) formulation via a well deﬁned objecti ve function, where we introduce two dif ferent joint optimization meth- ods. For our gradient based joint optimization method, we modify the OC-SVM and SVDD formulations and then prov e the con ver gence of the modiﬁed formulations to the original ones. • Thanks to our LSTM based structure, the introduced methods are able to process variable length data se- quences. Additionally , unlike the con ventional methods [6], [7], our methods effecti vely detect anomalies in time series data without requiring any preprocessing. • Through extensi ve set of experiments inv olving real and simulated data, we illustrate signiﬁcant performance im- prov ements achiev ed by our algorithms with respect to the con ventional methods [6], [7]. Moreov er , since our approach is generic, we also apply it to the recently proposed GR U architecture [17] in our experiments. D. Or ganization of the P aper The or ganization of this paper is as follo ws. In Section II, we ﬁrst describe the variable length anomaly detection problem and then introduce our LSTM based structure. In Section III-A, we introduce anomaly detection algorithms based on the OC-SVM formulation, where we also propose two different joint training methods in order to learn the LSTM and SVM parameters. The merits of each different approach are also detailed in the same section. In a similar manner , we introduce anomaly detection algorithms based on the SVDD formulation and provide two dif ferent joint training methods to learn the parameters in Section III-B. In Section IV, we demonstrate performance improv ements ov er se veral real life datasets. In the same section, thanks to our generic approach, we also introduce GRU based anomaly detection algorithms. Finally , we provide concluding remarks in Section V. 3 I I . M O D E L A N D P R OB L E M D E S C R I P T I O N In this paper, all vectors are column vectors and denoted by boldface lo wer case letters. Matrices are represented by boldface uppercase letters. For a vector a , a T is its ordinary transpose and || a || = √ a T a is the ` 2 -norm. The time index is giv en as subscript, e.g., a i is the i th vector . Here, 1 (and 0 ) is a vector of all ones (and zeros) and I represents the identity matrix, where the sizes are understood from the context. W e observe data sequences { X i } n i =1 , i.e., deﬁned as X i = [ x i, 1 x i, 2 . . . x i,d i ] , where x i,j ∈ R p , ∀ j ∈ { 1 , 2 , . . . d i } and d i ∈ Z + is the number of columns in X i , which can v ary with respect to i . Here, we assume that the bulk of the observed sequences are normal and the remaining sequences are anomalous. Our aim is to ﬁnd a scoring (or decision) function to determine whether X i is anomalous or not based on the observed data, where +1 and − 1 represent the outputs of the desired scoring function for nominal and anomalous data respecti vely . As an example application for this frame work, in host based intrusion detection [1], the system handles operating system call traces, where the data consists of system calls that are generated by users or programs. All traces contain system calls that belong to the same alphabet, howe ver , the co-occurrence of the system calls is the key issue in detecting anomalies [1]. For different programs, these system calls are executed in different sequences, where the length of the sequence may vary for each program. Binary encoding of a sample set of call sequences can be X 1 = 101011 , X 2 = 1010 and X 3 = 1011001 for n = 3 case [1]. After observing such a set of call sequences, our aim is to ﬁnd a scoring function that successfully distinguishes the anomalous call sequences from the normal sequences. In order to ﬁnd a scoring function l ( · ) such that l ( X i ) = ( − 1 if X i is anomalous +1 otherwise , one can use the OC-SVM algorithm [6] to ﬁnd a hyperplane that separates the anomalies from the normal data or the SVDD algorithm [7] to ﬁnd a hypersphere enclosing the normal data while leaving the anomalies outside the hypersphere. Howe v er , these algorithms can only process ﬁxed length sequences. Hence, we use the LSTM architecture [14] to obtain a ﬁxed length vector representation for each X i . Although there exist sev eral different versions of LSTM architecture, we use the most widely employed architecture, i.e., the LSTM architecture without peephole connections [13]. W e ﬁrst feed X i to the LSTM architecture as demonstrated in Fig. 2, where the internal LSTM equations are as follows [14]: z i,j = g ( W ( z ) x i,j + R ( z ) h i,j − 1 + b ( z ) ) (1) s i,j = σ ( W ( s ) x i,j + R ( s ) h i,j − 1 + b ( s ) ) (2) f i,j = σ ( W ( f ) x i,j + R ( f ) h i,j − 1 + b ( f ) ) (3) c i,j = s i,j  z i,j + f i,j  c i,j − 1 (4) o i,j = σ ( W ( o ) x i,j + R ( o ) h i,j − 1 + b ( o ) ) (5) h i,j = o i,j  g ( c i,j ) , (6) where c i,j ∈ R m is the state vector , x i,j ∈ R p is the LS TM LS TM LS TM ...              Mean P ool i ng                           Mean P ool i ng                                 Fig. 2: Our LSTM based structure for obtaining ﬁxed length sequences. input vector and h i,j ∈ R m is the output vector for the j th LSTM unit in Fig. 2. Additionally , s i,j , f i,j and o i,j is the input, forget and output gates, respectively . Here, g ( · ) is set to the hyperbolic tangent function, i.e., tanh , and applies to input vectors pointwise. Similarly , σ ( · ) is set to the sigmoid function.  is the operation for elementwise multiplication of two same sized vectors. Furthermore, W ( · ) , R ( · ) and b ( · ) are the parameters of the LSTM architecture, where the size of each is selected according to the dimensionality of the input and output vectors. After applying the LSTM architecture to each column of our data sequences as illustrated in Fig. 2, we take the av erage of the LSTM outputs for each data sequence, i.e., the mean pooling method. By this, we obtain a new set of ﬁxed length sequences, i.e., denoted as { ¯ h i } n i =1 , ¯ h i ∈ R m . Note that we also use the same procedure to obtain the state information ¯ c i ∈ R m for each X i as demonstrated in Fig. 2. Remark 1. W e use the mean pooling method in or der to obtain the ﬁxed length sequences as ¯ h i = 1 d i P d i j =1 h i,j . However , we can also use the other pooling methods. As an example , for the last and max pooling methods, we use ¯ h i = h i,d i and ¯ h i = max j h i,j , ∀ i ∈ { 1 , 2 , . . . n } , r espectively . Our derivations can be straightforwar dly extended to these differ - ent pooling methods. I I I . N OV E L A N O M A LY D E T E C T I O N A L G O R I T H M S In this section, we ﬁrst formulate the anomaly detection approaches based on the OC-SVM and SVDD algorithms. W e then provide joint optimization updates to train the parameters of the ov erall structure. A. Anomaly Detection with the OC-SVM Algorithm In this subsection, we provide an anomaly detection algo- rithm based on the OC-SVM formulation and deriv e the joint updates for both the LSTM and SVM parameters. For the training, we ﬁrst provide a quadratic programming based algo- rithm and then introduce a gradient based training algorithm. T o apply the gradient based training method, we smoothly approximate the original OC-SVM formulation and then prove the con vergence of the approximated formulation to the actual one in the following subsections. In the OC-SVM algorithm, our aim is to ﬁnd a hyperplane that separates the anomalies from the normal data [6]. W e 4 formulate the OC-SVM optimization problem for the sequence { ¯ h i } n i =1 as follows [6] min θ ∈ R n θ , w ∈ R m ,ξ ∈ R ,ρ ∈ R k w k 2 2 + 1 nλ n X i =1 ξ i − ρ (7) subject to: w T ¯ h i ≥ ρ − ξ i , ξ i ≥ 0 , ∀ i (8) W ( · ) T W ( · ) = I , R ( · ) T R ( · ) = I and b ( · ) T b ( · ) = 1 , (9) where ρ and w are the parameters of the separating hyper- plane, λ > 0 is a regularization parameter , ξ is a slack variable to penalize misclassiﬁed instances and we group the LSTM parameters { W ( z ) , R ( z ) , b ( z ) , W ( s ) , R ( s ) , b ( s ) , W ( f ) , R ( f ) , b ( f ) , W ( o ) , R ( o ) , b ( o ) } into θ ∈ R n θ , where n θ = 4 m ( m + p + 1) . Since the LSTM parameters are unknown and ¯ h i is a function of these parameters, we also minimize the cost function in (7) with respect to θ . After solving the optimization problem in (7), (8) and (9), we use the scoring function l ( X i ) = sgn ( w T ¯ h i − ρ ) (10) to detect the anomalous data, where the sgn ( · ) function returns the sign of its input. W e emphasize that while minimizing (7) with respect to θ , we might suffer from overﬁtting and impotent learning of time dependencies on the data [22], i.e., forcing the parameters to null values, e.g., θ = 0 . T o circumvent these issues, we introduce (9), which constraints the norm of θ to av oid ov erﬁtting and trivial solutions, e.g., θ = 0 , while boosting the ability of the LSTM architecture to capture time dependencies [22], [23]. Remark 2. In (9) , we use an orthogonality constraint for each LSTM parameter . However , we can also use other constraints instead of (9) and solve the optimization pr oblem in (7) , (8) and (9) in the same manner . As an example , a common choice of constraint for neural networks is the F r obenius norm [24], i.e., deﬁned as k A k F = X i X j A 2 ij (11) for a real matrix A , where A ij r epresents the element at the i th column and j th r ow of A . In this case, we can dir ectly r eplace (9) with a F r obenius norm constraint for each LSTM parameter as in (11) and then solve the opti- mization problem in the same manner . Such appr oaches only aim to r e gularize the parameter s [23]. However , for RNNs, we may also encounter exponential gr owth or decay in the norm of the gradients while training the parameters, which signiﬁcantly degr ades capabilities of these ar chitectur es to captur e time dependencies [22], [23]. Thus, in this paper , we put the constraint (9) in order to re gularize the parameters while impr oving the capabilities of the LSTM ar chitectur e in capturing time dependencies [22], [23]. 1) Quadratic Pr ogramming Based T raining Algorithm: Here, we introduce a training approach based on quadratic programming for the optimization problem in (7), (8) and (9), where we perform consecutive updates for the LSTM and SVM parameters. For this purpose, we ﬁrst con vert the optimization problem to a dual form in the following. W e then provide the consecutive updates for each parameter . W e have the following Lagrangian for the SVM parameters L ( w , ξ, ρ, ν , α ) = k w k 2 2 + 1 nλ n X i =1 ξ i − ρ − n X i =1 ν i ξ i − n X i =1 α i ( w T ¯ h i − ρ + ξ i ) , (12) where ν i , α i ≥ 0 are the Lagrange multipliers. T aking deriv ativ e of (12) with respect to w , ξ and ρ and then setting the deriv ativ es to zero giv es w = n X i =1 α i ¯ h i (13) n X i =1 α i = 1 and α i = 1 / ( nλ ) − ν i , ∀ i. (14) Note that at the optimum, the inequalities in (8) become equalities if α i and ν i are nonzero, i.e., 0 < α i < 1 / ( nλ ) [6]. W ith this relation, we compute ρ as ρ = n X j =1 α j ¯ h T j ¯ h i for 0 < α i < 1 / ( nλ ) . (15) By substituting (13) and (14) into (12), we obtain the following dual problem for the constrained minimization in (7), (8) and (9) min θ ∈ R n θ , α ∈ R n 1 2 n X i =1 n X j =1 α i α j ¯ h T i ¯ h j (16) subject to: n X i =1 α i = 1 and 0 ≤ α i ≤ 1 / ( nλ ) , ∀ i (17) W ( · ) T W ( · ) = I , R ( · ) T R ( · ) = I and b ( · ) T b ( · ) = 1 , (18) where α ∈ R n is a vector representation for α i ’ s. Since the LSTM parameters are unknown, we also put the minimization term for θ into (16) as in (7). By substituting (13) into (10), we hav e the following scoring function for the dual problem l ( X i ) = sgn  n X j =1 α j ¯ h T j ¯ h i − ρ  , (19) where we calculate ρ using (15). In order to ﬁnd the optimal θ and α for the optimization problem in (16), (17) and (18), we employ the following procedure. W e ﬁrst select a certain set of the LSTM pa- rameters, i.e., θ 0 . Based on θ 0 , we ﬁnd the minimizing α values, i.e., α 1 , using the Sequential Minimal Optimization (SMO) algorithm [25]. Now , we ﬁx α as α 1 and then update θ from θ 0 to θ 1 using the algorithm for optimization with orthogonality constraints in [26]. W e repeat these consecutiv e update procedures until α and θ con ver ge [27]. Then, we use the con ver ged v alues in order to e valuate (19). In the follo wing, we explain the update procedures for θ and α in detail. Based on θ k , i.e., the LSTM parameter vector at the k th iteration, we update α k , i.e., the α vector at the k th iteration, using the SMO algorithm due to its efﬁcienc y in solving quadratic constrained optimization problems [25]. In the SMO algorithm, we choose a subset of parameters to minimize and ﬁx the rest of parameters. In the extreme case, we choose only one parameter to minimize, howe ver , due to (17), we must choose at least two parameters. T o illustrate ho w the SMO 5 algorithm works in our case, we choose α 1 and α 2 to update and ﬁx the rest of the parameters in (16). From (17), we hav e α 1 = 1 − S − α 2 , where S = n X i =3 α i . (20) W e ﬁrst replace α 1 in (16) with (20). W e then take the deriv ativ e of (16) with respect to α 2 and equate the deriv ative to zero. Thus, we obtain the following update for α 2 at the k th iteration α k +1 , 2 = ( α k, 1 + α k, 2 )( K 11 − K 12 ) + M 1 − M 2 K 11 + K 22 − 2 K 12 , (21) where K ij , ¯ h T i ¯ h j , M i , P n j =3 α k,j K ij and α k,i represents the i th element of α k . Due to (17), if the updated value of α 2 is outside of the region [0 , 1 / ( nλ )] , we project it to this region. Once α 2 is updated as α k +1 , 2 , we obtain α k +1 , 1 using (20). For the rest of the parameters, we repeat the same procedure, which eventually con ver ges to a certain set of parameters [25]. By this way , we obtain α k +1 , i.e., the minimizing α for θ k . Follo wing the update of α , we update θ based on the updated α k +1 vector . For this purpose, we employ the op- timization method in [26]. Since we hav e α k +1 that satisﬁes (17), we reduce the dual problem to min θ κ ( θ , α k +1 ) = 1 2 n X i =1 n X j =1 α k +1 ,i α k +1 ,j ¯ h T i ¯ h j (22) s.t.: W ( · ) T W ( · ) = I , R ( · ) T R ( · ) = I and b ( · ) T b ( · ) = 1 . (23) For (22) and (23), we update W ( · ) as follows W ( · ) k +1 =  I + µ 2 A k  − 1  I − µ 2 A k  W ( · ) k , (24) where the subscripts represent the current iteration index, µ is the learning rate, A k = G k ( W ( · ) k ) T − W ( · ) k G T k and the element at the i th row and the j th column of G , i.e., G ij , is deﬁned as G ij , ∂ κ ( θ , α k +1 ) ∂ W ( · ) ij . (25) Remark 3. F or R ( · ) and b ( · ) , we ﬁrst compute the gradient of the objective function with r espect to the chosen parameter as in (25) . W e then obtain A k accor ding to the chosen parameter . Using A k , we update the chosen parameter as in (24) . W ith these updates, we obtain a quadratic programming based training algorithm (see Algorithm 1 for the pseudocode) for our LSTM based anomaly detector . 2) Gradient Based T raining Algorithm: Although the quadratic programming based training algorithm directly opti- mizes the original OC-SVM formulation without requiring any approximation, since it depends on the separated consecutive updates of the LSTM and OC-SVM parameters, it might not con verge to e ven a local minimum [27]. In order to resolve this issue, in this subsection, we introduce a training method based on only the ﬁrst order gradients, which updates the parameters at the same time. Howe ver , since we require an approximation to the original OC-SVM formulation to apply this method, we also prov e the con ver gence of the approximated formulation to the original OC-SVM formulation in this subsection. Considering (8), we write the slack variable in a different Algorithm 1 Quadratic Programming Based Training for the Anomaly Detection Algorithm Based on OC-SVM 1: Initialize the LSTM parameters as θ 0 and the dual OC- SVM parameters as α 0 2: Determine a threshold  as conv ergence criterion 3: k = − 1 4: do 5: k = k + 1 6: Using θ k , obtain { ¯ h } n i =1 according to Fig. 2 7: Find optimal α k +1 for { ¯ h } n i =1 using (20) and (21) 8: Based on α k +1 , obtain θ k +1 using (24) and Remark 3 9: while  κ ( θ k +1 , α k +1 ) − κ ( θ k , α k )  2 >  10: Detect anomalies using (19) ev aluated at θ k and α k form as follows G ( β w ,ρ ( ¯ h i )) , max { 0 , β w ,ρ ( ¯ h i ) } , ∀ i, (26) where β w ,ρ ( ¯ h i ) , ρ − w T ¯ h i . By substituting (26) into (7), we remov e the constraint (8) and obtain the following optimization problem min w ∈ R m ,ρ ∈ R , θ ∈ R n θ k w k 2 2 + 1 nλ n X i =1 G ( β w ,ρ ( ¯ h i )) − ρ (27) s.t.: W ( · ) T W ( · ) = I , R ( · ) T R ( · ) = I and b ( · ) T b ( · ) = 1 . (28) Since (26) is not a differentiable function, we are unable to solve the optimization problem in (27) using gradient based optimization algorithms. Hence, we employ a differentiable function S τ ( β w ,ρ ( ¯ h i )) = 1 τ log  1 + e τ β w ,ρ ( ¯ h i )  (29) to smoothly approximate (26), where τ > 0 is the smoothing parameter and log represents the natural logarithm. In (29), as τ increases, S τ ( · ) conv erges to G ( · ) (see Proposition 1 at the end of this section), hence, we choose a large value for τ . W ith (29), we modify our optimization problem as follows min w ∈ R m ,ρ ∈ R , θ ∈ R n θ F τ ( w , ρ, θ ) (30) s.t.: W ( · ) T W ( · ) = I , R ( · ) T R ( · ) = I and b ( · ) T b ( · ) = 1 (31) where F τ ( · , · , · ) is the objectiv e function of our optimization problem and deﬁned as F τ ( w , ρ, θ ) , k w k 2 2 + 1 nλ n X i =1 S τ ( β w ,ρ ( ¯ h i )) − ρ. T o obtain the optimal parameters for (30) and (31), we update w , ρ and θ until they conv erge to a local or global optimum [26], [28]. For the update of w and ρ , we use the SGD algorithm [28], where we compute the ﬁrst order gradient of the objectiv e function with respect to each parameter . W e ﬁrst compute the gradient for w as follows ∇ w F τ ( w , ρ, θ ) = w + 1 nλ n X i =1 − ¯ h i e τ β w ,ρ ( ¯ h i ) 1 + e τ β w ,ρ ( ¯ h i ) . (32) 6 Using (32), we update w as w k +1 = w k − µ ∇ w F τ ( w , ρ, θ )    w = w k ρ = ρ k θ = θ k , (33) where the subscript k indicates the value of any parameter at the k th iteration. Similarly , we calculate the deriv ative of the objectiv e function with respect to ρ as follows ∂ F τ ( w , ρ, θ ) ∂ ρ = 1 nλ n X i =1 e τ β w ,ρ ( ¯ h i ) 1 + e τ β w ,ρ ( ¯ h i ) − 1 . (34) Using (34), we update ρ as ρ k +1 = ρ k − µ ∂ F τ ( w , ρ, θ ) ∂ ρ    w = w k ρ = ρ k θ = θ k . (35) For the LSTM parameters, we use the method for optimization with orthogonality constraints in [26] due to (31). T o update each element of W ( · ) , we calculate the gradient of the objectiv e function as ∂ F τ ( w , ρ, θ ) ∂ W ( · ) ij = 1 nλ n X i =1 − w T  ∂ ¯ h i /∂ W ( · ) ij  e τ β w ,ρ ( ¯ h i ) 1 + e τ β w ,ρ ( ¯ h i ) . (36) W e then update W ( · ) using (36) as W ( · ) k +1 =  I + µ 2 B k  − 1  I − µ 2 B k  W ( · ) k , (37) where B k = M k ( W ( · ) k ) T − W ( · ) k M T k and M ij , ∂ F τ ( w , ρ, θ ) ∂ W ( · ) ij . (38) Remark 4. F or R ( · ) and b ( · ) , we ﬁrst compute the gradient of the objective function with r espect to the chosen parameter as in (38) . W e then obtain B k accor ding to the chosen parameter . Using B k , we update the chosen parameter as in (37) . Remark 5. In the semi-supervised frame work, we have the following optimization pr oblem for our SVM based algorithms [29] min θ , w ,ξ,η ,γ ,ρ  P l i =1 η i + P l + k j = l +1 min( γ j , ξ j ) (1 /C )  + k w k (39) s.t.: y i ( w T ¯ h i + ρ ) ≥ 1 − η i , η i ≥ 0 , i = 1 , . . . , l (40) w T ¯ h j − ρ ≥ 1 − ξ j , ξ j ≥ 0 , j = l + 1 , . . . , l + k (41) − w T ¯ h j + ρ ≥ 1 − γ j , γ j ≥ 0 , j = l + 1 , . . . , l + k (42) W ( · ) T W ( · ) = I , R ( · ) T R ( · ) = I and b ( · ) T b ( · ) = 1 , (43) wher e γ ∈ R and η ∈ R ar e slack variables as ξ , C is a trade-of f parameter , l and k are the number of the labeled and unlabeled data instances, r espectively and y i ∈ {− 1 , +1 } r epresents the label of the i th data instance. F or the quadratic pro gramming based training method, we modify all the steps fr om (12) to (25) with respect to (39) – (43) . In a similar manner , we modify the equations from (26) to (38) according to (39) – (43) in or der to get the gradient based training method in the semi-supervised frame work. F or the supervised implementation, we follow the same pr ocedur e with the semi-supervised implementation for k = 0 case. Hence, we complete the required updates for each parame- ter . The complete algorithm is also provided in Algorithm 2 as a pseudocode. Moreover , we illustrate the con vergence of our approximation (29) to (26) in Proposition 1. Using Proposition 1, we then demonstrate the con vergence of the optimal values for our objective function (30) to the optimal values of the actual SVM objectiv e function (27) in Theorem 1. Proposition 1. As τ increases, S τ ( β w ,ρ ( ¯ h i )) uniformly con- ver ges to G ( β w ,ρ ( ¯ h i )) . As a consequence, our approxima- tion F τ ( w , ρ, θ ) con ver ges to the SVM objective function F ( w , ρ, θ ) , i.e., deﬁned as F ( w , ρ, θ ) , k w k 2 2 + 1 nλ n X i =1 G ( β w ,ρ ( ¯ h i )) − ρ. Pr oof of Pr oposition 1. In order to simplify our notation, for any given w , θ , X i and ρ , we denote β w ,ρ ( ¯ h i ) as Ω . W e ﬁrst show that S τ (Ω) ≥ G (Ω) , ∀ τ > 0 . Since S τ (Ω) = 1 τ log  1 + e τ Ω  ≥ 1 τ log  e τ Ω  = Ω and S τ (Ω) ≥ 0 , we have S τ (Ω) ≥ G (Ω) = max { 0 , Ω } . Then, for any Ω ≥ 0 , we hav e ∂ S τ (Ω) ∂ τ = − 1 τ 2 log  1 + e τ Ω  + 1 τ Ω e τ Ω 1 + e τ Ω < − 1 τ Ω + 1 τ Ω e τ Ω 1 + e τ Ω ≤ 0 and for any Ω < 0 , we hav e ∂ S τ (Ω) ∂ τ = − 1 τ 2 log  1 + e τ Ω  + 1 τ Ω e τ Ω 1 + e τ Ω < 0 , thus, we conclude that S τ (Ω) is a monotonically decreasing function of τ . As the last step, we derive an upper bound for the difference S τ (Ω) − G (Ω) . For Ω ≥ 0 , the deri vati ve of the difference is as follows ∂ ( S τ (Ω) − G (Ω)) ∂ Ω = e τ Ω 1 + e τ Ω − 1 < 0 , hence, the difference is a decreasing function of Ω for Ω ≥ 0 . Therefore, the maximum value is log(2) /τ and it occurs at Ω = 0 . Similarly , for Ω < 0 , the deri vati ve of the difference is positive, which shows that the maximum for the difference occurs at Ω = 0 . W ith this result, we obtain the following bound log(2) τ = max Ω  S τ (Ω) − G (Ω)  . (44) Using (44), for any  > 0 , we can choose τ sufﬁciently large so that S τ (Ω) − G (Ω) <  . Hence, as τ increases, S τ (Ω) uniformly con ver ges to G (Ω) . By av eraging (44) over all the data points and multiplying with 1 /λ , we obtain log(2) λτ = max w ,ρ, θ  F τ ( w , ρ, θ ) − F ( w , ρ, θ )  , which pro ves the uniform con vergence of F τ ( · , · , · ) to F ( · , · , · ) . 7 Theorem 1. Let w τ and ρ τ be the solutions of (30) for any ﬁxed θ . Then, w τ and ρ τ ar e unique and F τ ( w τ , ρ τ , θ ) con verg es to the minimum of F ( w , ρ, θ ) . Pr oof of Theorem 1. W e hav e the following Hessian matrix of F τ ( w , ρ, θ ) with respect to w ∇ 2 w F τ ( w , ρ, θ ) = I + τ nλ n X i =1 e τ β w ,ρ ( ¯ h i )  1 + e τ β w ,ρ ( ¯ h i )  2 ¯ h i ¯ h T i , which satisﬁes v T ∇ 2 w F τ ( w , ρ, θ ) v > 0 for any nonzero col- umn vector v . Hence, the Hessian matrix is positiv e deﬁnite, which shows that F τ ( w , ρ, θ ) is strictly con vex function of w . Consequently , the solution w τ is both global and unique giv en any ρ and θ . Additionally , we have the following second order deriv ativ e for ρ ∂ 2 F τ ( w , ρ, θ ) ∂ ρ 2 = τ nλ n X i =1 e τ β w ,ρ ( ¯ h i )  1 + e τ β w ,ρ ( ¯ h i )  2 > 0 , which implies that F τ ( w , ρ, θ ) is strictly con ve x function of ρ . As a result, the solution ρ τ is both global and unique for any giv en w and θ . Let w ∗ and ρ ∗ be the solutions of (27) for any ﬁxed θ . From the proof of Proposition 1, we have F τ ( w ∗ , ρ ∗ , θ ) ≥ F τ ( w τ , ρ τ , θ ) ≥ F ( w τ , ρ τ , θ ) ≥ F ( w ∗ , ρ ∗ , θ ) . (45) Using the con vergence result in Proposition 1 and (45), we hav e lim τ →∞ F τ ( w τ , ρ τ , θ ) ≤ lim τ →∞ F τ ( w ∗ , ρ ∗ , θ ) = F ( w ∗ , ρ ∗ , θ ) lim τ →∞ F τ ( w τ , ρ τ , θ ) ≥ F ( w ∗ , ρ ∗ , θ ) , which prov es the following equality lim τ →∞ F τ ( w τ , ρ τ , θ ) = F ( w ∗ , ρ ∗ , θ ) . B. Anomaly Detection with the SVDD algorithm In this subsection, we introduce an anomaly detection algorithm based on the SVDD formulation and provide the joint updates in order to learn both the LSTM and SVDD parameters. Howe ver , since the generic formulation is the same with the OC-SVM case, we only provide the required and distinct updates for the parameters and proof for the con vergence of the approximated SVDD formulation to the actual one. In the SVDD algorithm, we aim to ﬁnd a hypersphere that encloses the normal data while leaving the anomalous data outside the hypersphere [7]. For the sequence { ¯ h i } n i =1 , we hav e the following SVDD optimization problem [7] min θ ∈ R n θ , ˜ c ∈ R m ,ξ ∈ R ,R ∈ R R 2 + 1 nλ n X i =1 ξ i (46) subject to: k ¯ h i − ˜ c k 2 − R 2 ≤ ξ i , ξ i ≥ 0 , ∀ i (47) W ( · ) T W ( · ) = I , R ( · ) T R ( · ) = I and b ( · ) T b ( · ) = 1 , (48) where λ > 0 is a trade-off parameter between R 2 and the total misclassiﬁcation error, R is the radius of the hypersphere and ˜ c is the center of the hypersphere. Additionally , θ and ξ represent the LSTM parameters and the slack variable respectiv ely as in the OC-SVM case. After solving the con- strained optimization problem in (46), (47) and (48), we detect anomalies using the following scoring function l ( X i ) = sgn ( R 2 − k ¯ h i − ˜ c k 2 ) . (49) Algorithm 2 Gradient Based Training for the Anomaly De- tection Algorithm Based on OC-SVM 1: Initialize the LSTM parameters as θ 0 and the OC-SVM parameters as w 0 and ρ 0 2: Determine a threshold  as conv ergence criterion 3: k = − 1 4: do 5: k = k + 1 6: Using θ k , obtain { ¯ h } n i =1 according to Fig. 2 7: Obtain w k +1 , ρ k +1 and θ k +1 using (33), (35), (37) and Remark 4 8: while  F τ ( w k +1 , ρ k +1 , θ k +1 ) − F τ ( w k , ρ k , θ k )  2 >  9: Detect anomalies using (10) ev aluated at w k , ρ k and θ k 1) Quadratic Pr ogramming Based T raining Algorithm: In this subsection, we introduce a training algorithm based on quadratic programming for (46), (47) and (48). As in the OC- SVM case, we ﬁrst assume that the LSTM parameters are ﬁxed and then perform optimization over the SVDD parameters based on the ﬁxed LSTM parameters. For (46) and (47), we hav e the following Lagrangian L ( ˜ c , ξ , R, ν, α ) = R 2 + 1 nλ n X i =1 ξ i − n X i =1 ν i ξ i − n X i =1 α i ( ξ i − k ¯ h i − ˜ c k 2 + R 2 ) , (50) where ν i , α i ≥ 0 are the Lagrange multipliers. T aking deriv ativ e of (50) with respect to ˜ c , ξ and R and then setting the deriv ativ es to zero yields ˜ c = n X i =1 α i ¯ h i (51) n X i =1 α i = 1 and α i = 1 / ( nλ ) − ν i , ∀ i. (52) Putting (51) and (52) into (50), we obtain a dual form for (46) and (47) as follows min θ ∈ R n θ , α ∈ R n n X i =1 n X j =1 α i α j ¯ h T i ¯ h j − n X i =1 α i ¯ h T i ¯ h i (53) subject to: n X i =1 α i = 1 and 0 ≤ α i ≤ 1 / ( nλ ) , ∀ i (54) W ( · ) T W ( · ) = I , R ( · ) T R ( · ) = I and b ( · ) T b ( · ) = 1 . (55) Using (51), we modify (49) as l ( X i ) = sgn  R 2 − n X k =1 n X j =1 α k α j ¯ h T k ¯ h j + 2 n X j =1 α j ¯ h T j ¯ h i − ¯ h T i ¯ h i  . (56) 8 In order to solve the constrained optimization problem in (53), (54) and (55), we employ the same approach as in the OC- SVM case. W e ﬁrst ﬁx a certain set of the LSTM parameters θ . Based on these parameters, we ﬁnd the optimal α using the SMO algorithm. After that, we ﬁx α to update θ using the algorithm for optimization with orthogonality constraints. W e repeat these procedures until we reach con ver gence. Finally , we ev aluate (56) based on the conv erged parameters. Remark 6. In the SVDD case, we apply the SMO algorithm using the same procedur es with the OC-SVM case. In par- ticular , we ﬁrst choose two parameters, e.g., α 1 and α 2 , to minimize and ﬁx the other parameter s. Due to (54) , the chosen parameters must obey (20) . Hence, we have the following update rule for α 2 at the k th iteration α k +1 , 2 = 2(1 − S )( K 11 − K 12 ) + K 22 − K 11 + M 1 − M 2 2( K 11 + K 22 − 2 K 12 ) , wher e S = P n j =3 α k,j and the other deﬁnitions are the same with the OC-SVM case. W e then obtain α k +1 , 1 using (20) . By this, we obtain the updated values α k +1 , 2 and α k +1 , 1 . F or the r emaining parameters, we repeat this procedur e until r eaching con verg ence. Remark 7. F or the SVDD case, we update W ( · ) at the k th iteration as in (24) . However , instead of (25) , we have the following deﬁnition for G G ij = ∂ π ( θ , α k +1 ) ∂ W ( · ) ij , wher e π ( θ , α k +1 ) , n X i =1 n X j =1 α k +1 ,i α k +1 ,j ¯ h T i ¯ h j − n X i =1 α k +1 ,i ¯ h T i ¯ h i at the k th iteration. F or the remaining parameter s, we follow the pr ocedure in Remark 3. Hence, we obtain a quadratic programming based training algorithm for our LSTM based anomaly detector , which is also described in Algorithm 3 as a pseudocode. Algorithm 3 Quadratic Programming Based Training for the Anomaly Detection Algorithm Based on SVDD 1: Initialize the LSTM parameters as θ 0 and the dual SVDD parameters as α 0 2: Determine a threshold  as conv ergence criterion 3: k = − 1 4: do 5: k = k + 1 6: Using θ k , obtain { ¯ h } n i =1 according to Fig. 2 7: Find optimal α k +1 for { ¯ h } n i =1 using the procedure in Remark 6 8: Based on α k +1 , obtain θ k +1 using Remark 7 9: while  π ( θ k +1 , α k +1 ) − π ( θ k , α k )  2 >  10: Detect anomalies using (56) ev aluated at θ k and α k 2) Gradient Based T raining Algorithm: In this subsection, we introduce a training algorithm based on only the ﬁrst order gradients for (46), (47) and (48). W e again use the G ( · ) function in (26) in order to eliminate the constraint in (47) as follows min θ ∈ R n θ , ˜ c ∈ R m ,R ∈ R R 2 + 1 nλ n X i =1 G (Ψ R, ˜ c ( ¯ h i )) (57) s.t.: W ( · ) T W ( · ) = I , R ( · ) T R ( · ) = I and b ( · ) T b ( · ) = 1 , (58) where Ψ R, ˜ c ( ¯ h i ) , k ¯ h i − ˜ c k 2 − R 2 . Since the gradient based methods cannot optimize (57) due to the nondifferentiable function G ( · ) , we employ S τ ( · ) instead of G ( · ) and modify (57) as min θ ∈ R n θ , ˜ c ∈ R m ,R ∈ R F τ ( ˜ c , R, θ ) = R 2 + 1 nλ n X i =1 S τ (Ψ R, ˜ c ( ¯ h i )) (59) s.t.: W ( · ) T W ( · ) = I , R ( · ) T R ( · ) = I and b ( · ) T b ( · ) = 1 , (60) where F τ ( · , · , · ) is the objectiv e function of (59). T o obtain the optimal v alues for (59) and (60), we update ˜ c , R and θ till we reach either a local or a global optimum. For the updates of ˜ c and R , we employ the SGD algorithm, where we use the following gradient calculations. W e ﬁrst compute the gradient of ˜ c as ∇ ˜ c F τ ( ˜ c , R, θ ) = 1 nλ n X i =1 2( ˜ c − ¯ h i ) e τ Ψ ˜ c ,R ( ¯ h i ) 1 + e τ Ψ ˜ c ,R ( ¯ h i ) . (61) Using (61), we hav e the following update ˜ c k +1 = ˜ c k − µ ∇ ˜ c F τ ( ˜ c , R, θ )    ˜ c = ˜ c k R 2 = R 2 k θ = θ k , (62) where the subscript k represents the iteration number . Like- wise, we compute the deriv ative of the objectiv e function with respect to R 2 as ∂ F τ ( ˜ c , R, θ ) ∂ R 2 = 1 + 1 nλ n X i =1 − e τ Ψ ˜ c ,R ( ¯ h i ) 1 + e τ Ψ ˜ c ,R ( ¯ h i ) . (63) W ith (63), we update R 2 as R 2 k +1 = R 2 k − µ ∂ F τ ( ˜ c , R, θ ) ∂ R 2    ˜ c = ˜ c k R 2 = R 2 k θ = θ k . (64) For θ , the gradient calculation is as follows ∂ F τ ( ˜ c , R, θ ) ∂ W ( · ) ij = n X i =1 2( ∂ ¯ h i /∂ W ( · ) ij ) T ( ¯ h i − ˜ c ) e τ Ψ ˜ c ,R ( ¯ h i ) nλ (1 + e τ Ψ ˜ c ,R ( ¯ h i ) ) . (65) Using (65), we hav e the following update W ( · ) k +1 =  I + µ 2 B k  − 1  I − µ 2 B k  W ( · ) k , (66) where B k = M k ( W ( · ) k ) T − W ( · ) k M T k and M ij , ∂ F τ ( ˜ c , R, θ ) ∂ W ( · ) ij . (67) Remark 8. F or R ( · ) and b ( · ) , we ﬁrst compute the gradient of the objective function with r espect to the chosen parameter as in (67) . W e then obtain B k accor ding to the chosen parameter . Using B k , we update the chosen parameter as in (66) . 9 Remark 9. In the semi-supervised framework, we have the fol- lowing optimization problem for our SVDD based algorithms [30] min θ , ˜ c ,R ,ξ,γ ,η R 2 − C 1 γ + C 2 l X i =1 ξ i + C 3 l + k X j = l +1 η j (68) s.t.: k ¯ h i − ˜ c k 2 − R 2 ≤ ξ i , ξ i ≥ 0 , ∀ l i =1 (69) y j ( k ¯ h j − ˜ c k 2 − R 2 ) ≤ − γ + η j , η j ≥ 0 , ∀ l + k j = l +1 (70) W ( · ) T W ( · ) = I , R ( · ) T R ( · ) = I and b ( · ) T b ( · ) = 1 , (71) wher e η ∈ R is a slack variable as ξ , γ ∈ R is the mar gin of the labeled data instances, C 1 , C 2 and C 3 ar e trade- off parameters, k and l ar e the number of the labeled and unlabeled data instances, respectively and y j ∈ {− 1 , +1 } r epresents the label of the j th data instance. F or the quadratic pro gramming based training method, we modify all the steps fr om (50) to (56) , Remark 6 and Remark 7 with respect to (68) – (71) . In a similar manner , we modify the equations fr om (57) to (67) accor ding to (68) – (71) in order to obtain the gradient based training method in the semi- supervised framework. F or the supervised implementation, we follow the same pr ocedur e with the semi-supervised implemen- tation for l = 0 case. The complete algorithm is provided in Algorithm 4. In the following, we provide the con vergence proof as in the OC- SVM case. Theorem 2. Let ˜ c τ and R 2 τ be the solutions of (59) for any ﬁxed θ . Then, ˜ c τ and R 2 τ ar e unique and F τ ( ˜ c τ , R τ , θ ) con verg es to the minimum of F ( ˜ c , R, θ ) , i.e., deﬁned as F ( ˜ c , R , θ ) , R 2 + 1 nλ n X i =1 G (Ψ R, ˜ c ( ¯ h i )) . Pr oof of Theorem 2. W e hav e the following Hessian matrix of F τ ( ˜ c , R, θ ) with respect to ˜ c ∇ 2 ˜ c F τ ( ˜ c , R, θ ) = n X i =1 2 I (Ω i + Ω 2 i ) + 4 τ Ω i ( ˜ c − ¯ h i )( ˜ c − ¯ h i ) T nλ  1 + Ω i  2 , where Ω i = e τ Ψ ˜ c ,R ( ¯ h i ) , which implies v T ∇ 2 ˜ c F τ ( ˜ c , R, θ ) v > 0 for any nonzero column vector v . Thus, the Hessian matrix is positiv e deﬁnite, which shows that F τ ( w , ρ, θ ) is strictly con vex function of ˜ c . As a result, the solution ˜ c τ is both global and unique giv en any R and θ . In addition to this, we hav e the following second order deriv ative for R 2 ∂ 2 F τ ( ˜ c , R, θ ) ∂ ( R 2 ) 2 = τ nλ n X i =1 e τ Ψ ˜ c ,R ( ¯ h i )  1 + e τ Ψ ˜ c ,R ( ¯ h i )  2 > 0 , which implies that F τ ( ˜ c , R, θ ) is strictly conv ex function of R 2 . Therefore, the solution R 2 τ is both global and unique for any giv en ˜ c and θ . The con ver gence proof directly follows the proof of Theo- rem 1. I V . S I M U L A T I O N S In this section, we demonstrate the performances of the algorithms on sev eral different datasets. W e ﬁrst ev aluate the performances on a dataset that contains variable length data Algorithm 4 Gradient Based Training for the Anomaly De- tection Algorithm Based on SVDD 1: Initialize the LSTM parameters as θ 0 and the SVDD parameters as ˜ c 0 and R 2 0 2: Determine a threshold  as conv ergence criterion 3: k = − 1 4: do 5: k = k + 1 6: Using θ k , obtain { ¯ h } n i =1 according to Fig. 2 7: Obtain ˜ c k +1 , R 2 k +1 and θ k +1 using (62), (64), (66) and Remark 8 8: while  F τ ( ˜ c k +1 , R k +1 , θ k +1 ) − F τ ( ˜ c k , R k , θ k )  2 >  9: Detect anomalies using (49) ev aluated at ˜ c k , R 2 k and θ k sequences, i.e., the digit dataset [31]. W e then compare the anomaly detection performances on sev eral different bench- mark real datasets such as the occupancy [32], Hong K ong Exchange (HKE) rate [33], http [34] and Alcoa stock price [35] datasets. While performing experiments on real bench- mark datasets, we also include the GR U based algorithms in order to compare their performances with the LSTM based ones. Note that since the introduced algorithms have bounded functions, e.g., the sigmoid function in the LSTM architecture, for all the experiments in this section, we normalize each dimension of the datasets into [ − 1 , 1] . Throughout this section, we denote the LSTM based OC- SVM anomaly detectors that are trained with the gradient and quadratic programming based algorithms as “LSTM-GSVM” and “LSTM-QPSVM”, respectiv ely . In a similar manner, we use “LSTM-GSVDD” and “LSTM-QPSVDD” for the SVDD based anomaly detectors. Moreov er , for the labels of the GR U based algorithms, we replace the LSTM preﬁx with GR U. A. Anomaly Detection for V ariable Length Data Sequences In this section, we ev aluate the performances of the intro- duced anomaly detectors on the digit dataset [31]. In this dataset, we have the pixel samples of digits, which were written on a tablet by several different authors [31]. Since the speed of writing varies from person to person, the number of samples for a certain digit might signiﬁcantly differ . The intro- duced algorithms are able to process such kind of sequences thanks to their generic structure in Fig. 2. Howe ver , the con ventional OC-SVM and SVDD algorithms cannot directly process these sequences [6], [7]. For these algorithms, we take the mean of each sequence to obtain a ﬁxed length vector sequence, i.e., two dimensional in this case (two coordinates of a pixel). In order to ev aluate the performances, we ﬁrst choose a digit as normal and another digit as anomaly . W e emphasize that randomly choose digits for illustration and we obtain similar performances for the other digits. W e then divide the samples of these digits into training and test parts, where we allocate 60% of the samples for the training part and 40% for the test part. In both the training and test parts, we select the samples so that 10% of the samples are anomalies. Then, using the training part, we optimize the parameters of each algorithm using two fold cross validation, where we also select certain crucial parameter , e.g., µ . This procedure results 10 in µ = 0 . 05 , 0 . 001 , 0 . 05 and 0 . 01 for LSTM-GSVM, LSTM- QPSVM, LSTM-GSVDD and LSTM-QPSVDD, respectively . Furthermore, we select the output dimension of the LSTM architecture as m = 2 and the regularization parameter as λ = 0 . 5 for all the algorithms. For the implementation of the con ventional OC-SVM and SVDD algorithms, we use the libsvm library and their parameters are selected in a similar manner via built in optimization tools of libsvm [36]. Here, we use area under R OC curve as a performance metric [37]. In a ROC curve, we plot true positiv e rate (TPR) as a function of false positiv e rate (FPR). Area un- der this curve, i.e., also known as A UC, is a well known performance measure for anomaly detection tasks [37]. In Fig. 3a and 3b, we illustrate the ROC curves and provide the corresponding A UC scores, where we label digit “0” and “9” as normal and anomaly , respectiv ely . For the OC- SVM and SVDD algorithms, since we directly take the mean of variable length data sequences to obtain ﬁxed length se- quences, they achie ve signiﬁcantly lower A UC scores com- pared to the introduced LSTM based methods. Among the LSTM based methods, LSTM-GSVM slightly outperforms LSTM-QPSVM. On the other hand, LSTM-GSVDD achieves signiﬁcantly higher A UC than LSTM-QPSVDD. Since the quadratic programming based training method depends on the separated consecutiv e updates of the LSTM and SVM (or SVDD) parameters, it might not con ver ge to even a local minimum. Howe ver , the gradient based method can guarantee con vergence to at least a local minimum given a proper choice of the learning rate [28]. Thus, although these methods might provide similar performances as in Fig. 3a, it is also expected to obtain much higher performance from the gradient based method for certain cases as in Fig. 3b. Howe ver , ov erall, the introduced algorithms provide signiﬁcantly higher A UC than the con ventional methods. Besides the previous scenario, we also consider a scenario, where we label digit “1” and “7” as normal and anomaly , respectiv ely . In Fig. 4a and 4b, we illustrate the R OC curves and provide the corresponding A UC scores. As in the pre vious scenario, for both the SVM and SVDD cases, the introduced algorithms achiev e higher A UC scores than the con ventional algorithms. Among the introduced algorithms, LSTM-GSVM and LSTM-GSVDD achieve the highest A UC scores for the SVM and SVDD cases, respecti vely . Furthermore, the A UC score of each algorithm is much lower compared to the previous case due to the similarity between digit “1” and “7”. B. Benchmark Real Datasets In this section, we compare the A UC scores of each al- gorithm on several different real benchmark datasets. Since our approach in this paper is generic, in addition to the LSTM based algorithms, we also implement our approach on the recently introduced RNN architecture, i.e., the GRU architecture, which is deﬁned by the following equations [17]: ˜ z i,j = σ  W ( ˜ z ) x i,j + R ( ˜ z ) h i,j − 1  (72) r i,j = σ  W ( r ) x i,j + R ( r ) h i,j − 1  (73) ˜ h i,j = g  W ( ˜ h ) x i,j + r i,j  ( R ( ˜ h ) h i,j − 1 )  (74) h i,j = ˜ h i,j  ˜ z i,j + h i,j − 1  ( 1 − ˜ z i,j ) , (75) where h i,j ∈ R m is the output vector and x i,j ∈ R p is the input vector . Furthermore, W ( · ) and R ( · ) are the parameters of the GR U, where the sizes are selected according to the dimensionality of the input and output vectors. W e then replace (1)–(6) with (72)–(75) in Fig. 2 to obtain GR U based anomaly detectors. W e ﬁrst ev aluate the performances of the algorithms on the occupancy dataset [32]. In this dataset, we hav e ﬁve features, which are relative humidity percentage, light (in lux), carbon dioxide lev el (in ppm), temperature (in Celsius) and humidity ratio, and our aim is to determine whether an ofﬁce room is occupied or not based on the features. Here, we use the same procedure with the previous subsection to separate the test and training data. Moreover , using the training data, we select µ = 0 . 05 , 0 . 05 , 0 . 001 and 0 . 01 for LSTM-GSVM, LSTM- QPSVM, LSTM-GSVDD and LSTM-QPSVDD, respectively . Note that, for the GRU based algorithms in this subsection, we use the same parameter setting with the LSTM based algorithms. Furthermore, we choose m = 5 and λ = 0 . 5 for all of the experiments in this subsection in order to maximize the performances of the algorithms. As can be seen in T able I, due to their inherent memory , both the LSTM and GR U based algorithms achiev e consid- erably high A UC scores compared to the con ventional SVM and SVDD algorithms. Moreover , GR U-GSVDD achiev es the highest A UC score among all the algorithms, where the LSTM based algorithms (LSTM-GSVM and LSTM-QPSVM) also provide comparable A UC scores. Here, we also observe that the gradient based training method provides higher A UC scores compared to the quadratic programming based training method, which might stem from its separated update proce- dure that does not guarantee con vergence to a certain local minimum. Other than the occupancy dataset, we also perform an experiment on the HKE rate dataset in order to examine the performances for a real life ﬁnancial scenario. In this dataset, we hav e the amount of Hong Kong dollars that one can b uy for one US dollar on each day . In order to introduce anomalies to this dataset, we artiﬁcially add samples from a Gaussian distribution with the mean and ten times the variance of the training data. Furthermore, using the training data, we select µ = 0 . 01 , 0 . 005 , 0 . 05 and 0 . 05 for LSTM-GSVM, LSTM- QPSVM, LSTM-GSVDD and LSTM-QPSVDD, respectively . In T able I, we illustrate the A UC scores of the algorithms on the HKE rate dataset. Since we hav e time series data, both the LSTM and GR U based algorithms naturally outperform the con ventional methods thanks to their inherent memory , which preserves sequential information. Moreover , since the LSTM architecture also controls its memory content via output gate unlike the GRU architecture [17], we obtain the highest A UC scores from LSTM-GSVM. As in the previous cases, the gradient based training method provides better performance than the quadratic programming based training. W e also ev aluate the A UC scores of the algorithms on the http dataset [34]. In this dataset, we have 4 features, which are duration (number of seconds of the connection), network service, number of bytes from source to destination 11 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 True Positive Rate ROC Curves for SVM LSTM-QPSVM SVM LSTM-GSVM AUC=0.6041 AUC=0.8873 AUC=0.8942 (a) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 True Positive Rate ROC Curves for SVDD LSTM-QPSVDD SVDD LSTM-GSVDD AUC=0.5429 AUC=0.7565 AUC=0.8948 (b) Fig. 3: The R OC curves of the algorithms for the digit dataset, where we consider digit “0” as normal and digit “9” as anomaly (a) for the SVM based algorithms and (b) for the SVDD based algorithms. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 True Positive Rate ROC Curves for SVM LSTM-QPSVM SVM LSTM-GSVM AUC=0.5580 AUC=0.6849 AUC=0.7744 (a) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 True Positive Rate ROC Curves for SVDD LSTM-QPSVDD SVDD LSTM-GSVDD AUC=0.5235 AUC=0.6513 AUC=0.7464 (b) Fig. 4: The R OC curves of the algorithms for the digit dataset, where we consider digit “1” as normal and digit “7” as anomaly (a) for the SVM based algorithms and (b) for the SVDD based algorithms. and from destination to source. Using these features, we aim to distinguish normal connections from network attacks. In this experiment, we select µ = 0 . 01 , 0 . 05 , 0 . 001 and 0 . 01 for LSTM-GSVM, LSTM-QPSVM, LSTM-GSVDD and LSTM- QPSVDD, respectiv ely . W e demonstrate the performances of the algorithms on the http dataset in T able I. Even though all the algorithms achiev e high A UC scores on this dataset, we still observe that the LSTM and GR U based algorithms hav e higher A UC scores than the con ventional SVM and SVDD methods. Overall, GR U-QPSVDD achiev es the highest A UC score and the quadratic programming based training methods performs better than the gradient based training method on this dataset. Howe ver , since the A UC scores are very high and close to each other , we observe only slight performance improv ement for our algorithms in this case. As the last experiment, we ev aluate the anomaly detection performances of the algorithms on another ﬁnancial dataset, i.e., the Alcoa stock price dataset [35]. In this dataset, we have daily stock price values. As in the HKE rate dataset, we again artiﬁcially introduce anomalies via a Gaussian distribution with the mean and ten times the variance of the training data. Moreov er , we choose µ = 0 . 01 , 0 . 001 , 0 . 001 and 0 . 005 for LSTM-GSVM, LSTM-QPSVM, LSTM-GSVDD and LSTM- QPSVDD, respectiv ely . In T able I, we illustrate the A UC scores of the algorithms on the Alcoa stock price dataset. Here, we observe that the GR U and LSTM based algorithms achieve considerably higher A UC scores than the conv entional methods thanks to their memory structure. Although the LSTM based algorithms have higher A UC in general, we obtain the highest A UC score from GR U-QPSVDD. Moreover , as in the previous experiments, the gradient based training method provides higher performance compared to the quadratic programming based method thanks to its learning capabilities. V . C O N C L U D I N G R E M A R K S In this paper , we study anomaly detection in an unsu- pervised frame work and introduce LSTM based algorithms. Particularly , we have introduced a generic LSTM based 12 T ABLE I: A UC scores of the algorithms for the occupancy , HKE rate, http and Alcoa stock price datasets. P P P P P P P P Datasets Algorithms SVM SVDD LSTM-QPSVM LSTM-GSVM LSTM-QPSVDD LSTM-GSVDD GRU-QPSVM GRU-GSVM GRU-QPSVDD GRU-GSVDD Occupancy 0 . 8676 0 . 6715 0 . 8917 0 . 8957 0 . 7869 0 . 8609 0 . 8718 0 . 9049 0 . 7217 0 . 9099 HKE 0 . 8000 0 . 8500 0 . 9467 0 . 9783 0 . 8560 0 . 9753 0 . 8479 0 . 9516 0 . 8791 0 . 9517 Http 0 . 9963 0 . 9993 0 . 9992 0 . 9983 0 . 9994 0 . 9994 0 . 9986 0 . 9989 0 . 9999 0 . 9994 Alcoa 0 . 7197 0 . 9390 0 . 9496 0 . 9515 0 . 9415 0 . 9507 0 . 7581 0 . 9392 0 . 9651 0 . 9392 structure in order to process variable length data sequences. After obtaining ﬁxed length sequences via our LSTM based structure, we introduce a scoring function for our anomaly detectors based on the OC-SVM [6] and SVDD [7] algorithms. As the ﬁrst time in the literature, we jointly optimize the parameters of both the LSTM architecture and the ﬁnal scoring function of the OC-SVM (or SVDD) formulation. T o jointly optimize the parameters of our algorithms, we hav e also introduced gradient and quadratic programming based training methods with different algorithmic merits, where we extend our deriv ations for these algorithms to the semi-supervised and fully supervised frame works. In order to apply the gradient based training method, we modify the OC-SVM and SVDD formulations and then provide the con ver gence results of the modiﬁed formulations to the actual ones. Therefore, we obtain highly effecti ve anomaly detection algorithms, especially for time series data, that are able to process variable length data sequences. In our simulations, due to the generic structure of our approach, we have also introduced GR U based anomaly detection algorithms. Through extensiv e set of experiments, we illustrate signiﬁcant performance improvements achiev ed by our algorithms with respect to the conv entional methods [6], [7] ov er sev eral different real and simulated datasets. R E F E R E N C E S [1] V . Chandola, A. Banerjee, and V . Kumar , “ Anomaly detection: A survey , ” A CM Comput. Surv . , vol. 41, no. 3, pp. 15:1–15:58, Jul. 2009. [Online]. A vailable: http://doi.acm.org/10.1145/1541880.1541882 [2] T . Xie, N. M. Nasrabadi, and A. O. Hero, “Learning to classify with possible sensor failures, ” IEEE T ransactions on Signal Processing , vol. 65, no. 4, pp. 836–849, Feb 2017. [3] B. Baingana and G. B. Giannakis, “Joint community and anomaly tracking in dynamic networks, ” IEEE T ransactions on Signal Pr ocessing , vol. 64, no. 8, pp. 2013–2025, April 2016. [4] K. Cohen, Q. Zhao, and A. Swami, “Optimal index policies for anomaly localization in resource-constrained cyber systems, ” IEEE T ransactions on Signal Pr ocessing , vol. 62, no. 16, pp. 4224–4236, Aug 2014. [5] J. Sharpnack, A. Rinaldo, and A. Singh, “Detecting anomalous activity on networks with the graph fourier scan statistic, ” IEEE T ransactions on Signal Pr ocessing , vol. 64, no. 2, pp. 364–379, Jan 2016. [6] B. Sch ¨ olkopf, J. C. Platt, J. Shawe-T aylor, A. J. Smola, and R. C. W illiamson, “Estimating the support of a high-dimensional distribution, ” Neural computation , vol. 13, no. 7, pp. 1443–1471, 2001. [7] D. M. T ax and R. P . Duin, “Support vector data description, ” Machine Learning , vol. 54, no. 1, pp. 45–66, 2004. [Online]. A vailable: http://dx.doi.org/10.1023/B:MA CH.0000008084.60811.49 [8] J. Ma and S. Perkins, “Time-series novelty detection using one-class support vector machines, ” in Proceedings of the International Joint Confer ence on Neural Networks, 2003. , vol. 3, July 2003, pp. 1741– 1745 vol.3. [9] R. Zhang et al. , “One class support vector machine for anomaly detection in the communication network performance data, ” in Proceedings of the 5th Conference on Applied Electr omagnetics, Wir eless and Optical Communications , ser . ELECTROSCIENCE’07. Stevens Point, W is- consin, USA: W orld Scientiﬁc and Engineering Academy and Society (WSEAS), 2007, pp. 31–37. [10] P . Malhotra, L. V ig, G. Shroff, and P . Agarwal, “Long short term memory networks for anomaly detection in time series, ” in Pr oceedings . Presses universitaires de Louv ain, 2015, p. 89. [11] H. Debar, M. Becker , and D. Siboni, “ A neural network component for an intrusion detection system, ” in Pr oceedings 1992 IEEE Computer Society Symposium on Research in Security and Privacy , May 1992, pp. 240–250. [12] Y . Bengio, P . Simard, and P . Frasconi, “Learning long-term dependen- cies with gradient descent is difﬁcult, ” IEEE Tr ansactions on Neural Networks , vol. 5, no. 2, pp. 157–166, Mar 1994. [13] K. Greff, R. K. Srivasta va, J. Koutnk, B. R. Steunebrink, and J. Schmid- huber , “LSTM: A search space odyssey , ” IEEE T ransactions on Neural Networks and Learning Systems , vol. 28, no. 10, pp. 2222–2232, Oct 2017. [14] S. Hochreiter and J. Schmidhuber , “Long short-term memory , ” Neural Comput. , vol. 9, no. 8, pp. 1735–1780, Nov . 1997. [15] R. K ozma et al. , “ Anomaly detection by neural network models and statistical time series analysis, ” in Neural Networks, 1994. IEEE W orld Congr ess on Computational Intelligence., 1994 IEEE International Confer ence on , vol. 5, Jun 1994, pp. 3207–3210 vol.5. [16] C. Bishop, “Novelty detection and neural network validation, ” IEE Pr oceedings - V ision, Image and Signal Pr ocessing , vol. 141, pp. 217– 222(5), August 1994. [17] J. Chung, C. Gulcehre, K. Cho, and Y . Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling, ” arXiv preprint arXiv:1412.3555 , 2014. [18] Y . W ang, J. W ong, and A. Miner, “ Anomaly intrusion detection using one class svm, ” in Information Assurance W orkshop, 2004. Pr oceedings fr om the F ifth Annual IEEE SMC . IEEE, 2004, pp. 358–364. [19] N. H. P ackard, J. P . Crutchﬁeld, J. D. F armer, and R. S. Shaw , “Geometry from a time series, ” Physical r eview letters , vol. 45, no. 9, p. 712, 1980. [20] R. Zhang, S. Zhang, Y . Lan, and J. Jiang, “Network anomaly detection using one class support vector machine, ” in Pr oceedings of the Inter- national MultiConfer ence of Engineers and Computer Scientists , vol. 1, 2008. [21] S. Chauhan and L. V ig, “ Anomaly detection in ecg time signals via deep long short-term memory networks, ” in 2015 IEEE International Confer ence on Data Science and Advanced Analytics (DSAA) , Oct 2015, pp. 1–7. [22] L. Jing, C. Gulcehre, J. Peurifoy , Y . Shen, M. T egmark, M. Solja ˇ ci ´ c, and Y . Bengio, “Gated orthogonal recurrent units: On learning to forget, ” arXiv preprint arXiv:1706.02761 , 2017. [23] S. Wisdom et al. , “Full-capacity unitary recurrent neural networks, ” in Advances in Neural Information Pr ocessing Systems , 2016, pp. 4880– 4888. [24] M. Bai, B. Zhang, and J. Gao, “T ensorial recurrent neural networks for longitudinal data analysis, ” arXiv preprint , 2017. [25] J. Platt, “Sequential minimal optimization: A fast algorithm for training support vector machines, ” 1998. [26] Z. W en and W . Y in, “ A feasible method for optimization with orthog- onality constraints, ” Mathematical Pro gramming , vol. 142, no. 1, pp. 397–434, Dec 2013. [27] L. V eidinger , “On the numerical determination of the best approxima- tions in the chebyshev sense, ” Numerische Mathematik , vol. 2, no. 1, pp. 99–105, 1960. [28] A. H. Sayed, Fundamentals of adaptive ﬁltering . John Wiley & Sons, 2003. [29] K. P . Bennett and A. Demiriz, “Semi-supervised support vector ma- chines, ” in Advances in Neural Information pr ocessing systems , 1999, pp. 368–374. [30] N. G ¨ ornitz, M. Kloft, K. Rieck, and U. Brefeld, “T oward supervised anomaly detection, ” J. Artif. Int. Res. , vol. 46, no. 1, pp. 235–262, Jan. 2013. [Online]. A vailable: http://dl.acm.org/citation.cfm?id=2512538. 2512545 [31] M. Lichman, “UCI machine learning repository , ” 2013. [32] L. M. Candanedo and V . Feldheim, “ Accurate occupancy detection of an ofﬁce room from light, temperature, humidity and co2 measurements using statistical learning models, ” Energy and Buildings , vol. 112, no. Supplement C, pp. 28 – 39, 2016. [33] E. W . Frees, “Regression modelling with actuarial and ﬁnancial applications. ” [Online]. A vailable: http://instruction.bus.wisc.edu/jfrees/ jfreesbooks/Regression%20Modeling/BookW ebDec2010/data.html [34] S. Rayana, “ODDS library , ” 2016. [Online]. A vailable: http://odds.cs. stonybrook.edu [35] “Summary for alcoa inc. common stock. ” [Online]. A vailable: http://ﬁnance.yahoo.com/quote/AA?ltr=1 [36] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines, ” ACM T ransactions on Intelligent Systems and T echnology , vol. 2, pp. 27:1–27:27, 2011, software av ailable at http://www .csie.ntu. edu.tw/ ∼ cjlin/libsvm. [37] A. P . Bradley , “The use of the area under the roc curve in the e valuation of machine learning algorithms, ” P attern Recognition , vol. 30, no. 7, pp. 1145 – 1159, 1997.

Unsupervised and Semi-supervised Anomaly Detection with LSTM Neural Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment