One-Class SVM with Privileged Information and its Application to Malware Detection

One-Class SVM with Pri vile ged Information and its Application to Malw are Detection Evgeny Burnae v Skolko vo Institute of Science and T echnology , Building 3, Nobel st., Moscow 143026, Russia Institute for Information T ransmission Problems, 19 Bolshoy Karetny Lane, Mosco w 127994, Russia, Email: e.burnae v@skoltech.ru Dmitry Smolyakov Institute for Information T ransmission Problems, 19 Bolshoy Karetny Lane, Mosco w 127994, Russia, Email: dmitry .smolyakov@iitp.ru Abstract —A number of important applied problems in engi- neering, ﬁnance and medicine can be formulated as a problem of anomaly detection based on a one-class classiﬁcation. A classical approach to this problem is to describe a normal state using a one-class support vector machine. Then to detect anomalies we quantify a distance from a new observation to the constructed description of the normal class. In this paper we present a new approach to one-class classiﬁcation. W e formulate a new pr oblem statement and a corresponding algorithm that allow taking into account privileged inf ormation during the training phase. W e evaluate perf ormance of the proposed approach using synthetic datasets, as well as the publicly av ailable Microsoft Malware Classiﬁcation Challenge dataset. I . I N T RO D U C T I O N Anomaly detection refers to the problem of ﬁnding patterns in data that do not conform to an expected beha viour . Anomaly detection ﬁnds extensiv e use in a wide variety of applications such as fraud detection for credit cards, insurance or health care, intrusion detection for cyber-security , fault detection in safety critical systems, and military surveillance for enemy ac- tivities [1], [2], [3]. A classical approach to anomaly detection is to describe expected (“normal”) behaviour using one-class classiﬁcation techniques, i.e. to construct a description of a “normal” state using a number of examples, e.g. by describing a geometrical place of training patterns in a feature space. If a new test pattern does not belong to the “normal” class then we consider it to be anomalous. T o construct a “normal” domain we can use well-known approaches such as the Support V ector Domain Description (SVDD) [4], [5] and the One-Class Support V ector Machine (One-Class SVM) [6], possibly combined with model selection for anomaly detection [7], resampling [8], ensembling of “weak” anomaly detectors [9] and e xtraction of important features using manifold learning methods [10], [11]. Both SVDD and One-Class SVM can be kernelized to describe a complex nonlinear “normal” class. For the original two-class Support V ector Machine [12] V apnik recently proposed a modiﬁcation that allows taking into account pri vileged information during the training phase to impro ve a classiﬁcation accuracy [13]. Let us provide some examples of privileged information. If we solve an image clas- siﬁcation problem, then as privile ged information we can use a textual image description. In case of a malware detection we can use a source code of a malware to get additional features for the classiﬁcation. Such information is not av ailable during the test phase (e.g. it could be computationally prohibiti ve or too costly to obtain), when we use the trained model for anomaly detection and classiﬁcation, but can be used during the training phase. In this work we combine these two concepts (SVMs and learning using privileged information) and propose new ap- proaches to train SVDD and One-Class SVM with privileged information, under the intuition that this additional information av ailable at the training time can be utilized to better deﬁne the “normal” state. Through experiments on synthetic data and some real data from the Malware Classiﬁcation Challenge (see [14]), we show that a model, trained with pri vileged information, performs better than a model, trained without privile ged information. Howe ver , we do not want to claim that the used setup of malware classiﬁcation experiments indeed fully reﬂects a speciﬁcity of cyber security applications. Rather we demonstrate that one-class classiﬁcation with pri vileged information is useful for cyber security problems. I I . O N E - C L A S S S V M A N D S V D D Below we brieﬂy describe two classical approaches to one- class classiﬁcation. W e are given an i.i.d. sample patterns ( x 1 , . . . , x l ) ∈ X ⊂ R n . The main idea of these algorithms is to separate a major part of sample patterns, considered to be “normal”, from those ones, considered to be “abnormal” in some sense. A. One-Class SVM In case of the original One-Class SVM [6] we consider those patterns to be abnormal, which are close to origin of coordinates in a feature space. Let us separate patterns using a hyperplane in a feature space deﬁned by some feature map φ ( · ) and a normal vector to the hyperplane w . W e consider that a pattern x belongs to a “normal” class if ( w · φ ( x )) > ρ . In order to deﬁne the hyperplane, i.e. the normal vector w and the v alue of ρ , we solve an optimization problem 1 2 k w k 2 ` 2 + 1 ν l l X i =1 ξ i − ρ → min w,ξ ,ρ (1) s.t. ( w · φ ( x i )) ≥ ρ − ξ i , ξ i ≥ 0 . Here ν is a regularization coefﬁcient, ξ i is a slack v ariable for the i -th pattern. Optimization problem (1) is conv ex, therefore its solution coincides with that of the dual one: − l X i =1 l X j =1 α i α j K ( x i , x j ) → max α (2) s.t. l X i =1 α i = 1 , 0 ≤ α i ≤ 1 ν l . Here the scalar product ( φ ( x i ) · φ ( x j )) is replaced by the corresponding kernel function K ( x i , x j ) . Therefore as usual we do not need to know an explicit representation of φ ( x ) in order to solve the dual problem. Moreover , the solution of the primal problem can be represented through the solution of the dual problem, namely w = l P i =1 α i φ ( x i ) . Thanks to the structure of (2) if some α i > 0 , then the pattern x i belongs to the boundary of the “normal” domain, i.e. ( w · φ ( x i )) = ρ [6]. Thus the offset ρ can be recovered by exploiting the fact that for any α i > 0 the corresponding pattern x i satisﬁes the equality ρ = ( w · φ ( x i )) = l X j =1 α j K ( x j , x i ) . As a result the corresponding decision rule has the form f ( x ) = l X i =1 α i K ( x i , x ) − ρ. In case f ( x ) > 0 a pattern x is considered to belong to the “normal” class and vice versa. B. SVDD Another approach to anomaly detection is to separate out- lying patterns using a sphere [4], [5]. As before we denote by φ ( · ) some feature map. Let a be some point in the image of the feature map and R be some positiv e value. W e consider that a pattern x belongs to a “normal” class, if it is located inside the sphere k a − φ ( x ) k 2 ` 2 ≤ R . In order to ﬁnd the center a and the radius R we solve the optimization problem R + 1 ν l l X i =1 ξ i → min R,a,ξ i (3) s.t. k φ ( x i ) − a k 2 ` 2 ≤ R + ξ i , ξ i ≥ 0 . Here ξ i is a distance from the pattern x i , located out of the sphere, to the surface of the sphere. On the face of it, the variable R can be considered as a radius only if we require its positivity . Howe ver , it can be easily proved that this condition is automatically fulﬁlled if ν ∈ (0 , 1) [5], and for ν 6∈ (0 , 1) the solution of (3) is degenerate. The dual problem has the form l X i =1 α i K ( x i , x i ) − l X i =1 l X j =1 α i α j K ( x i , x j ) → max α s.t. 0 ≤ α i ≤ 1 ν l , l X i =1 α i = 1 . As in the previous case here we replace the scalar product ( φ ( x i ) · φ ( x j )) with the corresponding kernel K ( x i , x j ) . W e can write out the solution of the primal problem using the solution of the dual problem a = l X i =1 α i φ ( x i ) , R = k φ ( x j ) k 2 ` 2 − 2( a · φ ( x j )) + k a k 2 ` 2 , where in order to calculate R we can use any x j , such that α j > 0 . Here k φ ( x ) k 2 ` 2 = K ( x, x ) , ( φ ( x ) · a ) = P l i =1 α i K ( x i , x ) and k a k 2 ` 2 = P l i =1 P l j =1 α i α j K ( x i , x j ) . The decision function has the form f ( x ) = K ( x, x ) − 2 l X i =1 α i K ( x, x i ) + k a k 2 ` 2 − R. If f ( x ) > 0 , then a pattern x is located outside the sphere and is considered to be anomalous. I I I . P R I V I L E G E D I N F O R M A T I O N Let us assume that during the training phase we ha ve some privile ged information: besides patterns ( x 1 , . . . , x l ) ∈ X ⊂ R n (original information) we also hav e additional patterns ( x ∗ 1 , . . . , x ∗ l ) ∈ X ∗ ⊂ R m . This additional (priv- ileged) information is not a v ailable on the test phase, i.e. we are going to train our decision rule on pairs of patterns ( x, x ∗ ) ∈ X ∪ X ∗ ⊂ R n + m , but when making decisions we can use only test patterns x ∈ X . Let us discuss ho w this privileged information can be incorporated in the considered problem statements. In the original approaches to one-class classiﬁcation we assume that the slack v ariables ξ i , characterizing the distance from the patterns x i to the separating boundary , are determined through the solution of the corresponding optimization problem (see (1) or (3)). No w let us assume that the slack variables can be modelled as ξ i = ξ i ( x ∗ i ) = ( φ ∗ ( x ∗ i ) · w ∗ ) + b ∗ , (4) where φ ∗ ( · ) is a feature map in the space of privile ged patterns. Thus, we assume that using the privile ged patterns ( x ∗ 1 , . . . , x ∗ l ) we can reﬁne the location of the separating boundary w .r .t. the sample of training objects. A. One-Class SVM+ Let us modify problem statement (1) in order to incorporate privile ged information: ν l 2 k w k 2 ` 2 + γ 2 k w ∗ k 2 ` 2 − ν l ρ + l X i =1 [( w ∗ · φ ∗ ( x ∗ i )) + b ∗ + ζ i ] → min w,w ∗ ,b ∗ ,ρ,ζ (5) s.t. ( w · φ ( x i )) ≥ ρ − ( w ∗ · φ ∗ ( x ∗ i )) − b ∗ , ( w ∗ · φ ∗ ( x ∗ i )) + b ∗ + ζ i ≥ 0 , ζ i ≥ 0 . Here γ is a regularization parameter for the linear approxima- tion of the slack variables, ζ i are instrumental v ariables used to pre vent those patterns, belonging to a “positi ve” half-plane, from being penalized. Note that if γ → ∞ , then the solution of (5) is close to the original solution of (1). Let us write out a Lagrangian for (5): L = ν l 2 k w k 2 ` 2 − ν l ρ + γ 2 k w ∗ k 2 ` 2 + l X i =1 [( w ∗ · φ ∗ ( x ∗ i )) + b ∗ + ζ i ] − l X i =1 µ i ζ i − l X i =1 α i [( w · φ ( x i )) − ρ + ( w ∗ · φ ∗ ( x ∗ i )) + b ∗ ] − l X i =1 β i [( w ∗ · φ ∗ ( x ∗ i )) + b ∗ + ζ i ] . Setting δ i = 1 − β i , from the Karush – Kuhn – T ucker conditions we get that w = 1 ν l l X i =1 α i φ ( x i ) , w ∗ = 1 γ l X i =1 ( α i − δ i ) φ ∗ ( x ∗ i ) , δ i = µ i , l X i =1 δ i = l X i =1 α i = ν l, 0 ≤ δ i ≤ 1 . Using obtained equations, we now can formulate the dual problem − 1 2 ν l X i,j α i α j K ( x i , x j ) − X i,j 1 2 γ ( α i − δ i ) K ∗ ( x ∗ i , x ∗ j )( α j − δ j ) → max α,δ s.t. l X i =1 α i = ν l, l X i =1 δ i = ν l, 0 ≤ δ i ≤ 1 , α i ≥ 0 . Here we replace the scalar product ( φ ∗ ( x ∗ i ) · φ ∗ ( x ∗ j )) with the corresponding kernel function K ∗ ( x ∗ i , x ∗ j ) . At the end, the decision function has the same form as in the case of the original One-Class SVM: f ( x ) = l P i =1 α i K ( x i , x ) − ρ. B. SVDD+ Let us modify problem statement (3) in order to incorporate privile ged information: ν l R + γ 2 k w ∗ k 2 ` 2 + l X i =1 [( w ∗ · φ ∗ ( x ∗ i )) + b ∗ + ζ i ] → min R,a,w ∗ ,b,ζ (6) s.t. k φ ( x i ) − a k 2 ` 2 ≤ R + [( w · φ ∗ ( x ∗ i )) + b ∗ ] , ( w ∗ · φ ∗ ( x ∗ i )) + b ∗ + ζ i ≥ 0 , ζ i ≥ 0 . If γ → ∞ , then the solution of (6) is close to the original solution of (3). Let us write out a Lagrangian for (6): L = ν lR + γ 2 k w ∗ k 2 ` 2 + l X i =1 [( w ∗ · φ ∗ ( x ∗ i )) + b ∗ + ζ i ] − l X i =1 µ i ζ i + l X i =1 α i [ k φ ( x i ) − a k 2 ` 2 − R − ( w ∗ · φ ∗ ( x ∗ i )) − b ∗ ] − l X i =1 β i [( w ∗ · φ ∗ ( x ∗ i )) + b ∗ + ζ i ] . Setting δ i = 1 − β i , from the Karush – Kuhn – T ucker conditions we get that w ∗ = 1 γ l X i =1 ( α i − δ i ) φ ∗ ( x ∗ i ) , a = 1 ν l l X i =1 α i φ ( x i ) , δ i = µ i , l X i =1 α i = l X i =1 δ i = ν l, 0 ≤ δ i ≤ 1 . Let us formulate the dual problem: l X i =1 α i K ( x i , x i ) − 1 2 ν l X i,j α i α j K ( x i , x j ) − X i,j 1 2 γ ( α i − δ i ) K ∗ ( x ∗ i , x ∗ j )( α j − δ j ) → max α,δ s.t. l X i =1 α i = ν l, l X i =1 δ i = ν l, 0 ≤ δ i ≤ 1 , α i ≥ 0 . At the end, the decision function has the same form as in the case of the original SVDD: f ( x ) = K ( x, x ) − 2 l P i =1 α i K ( x, x i ) + k a k 2 ` 2 − R. I V . R E L A T E D W O R K In principle some attempts to use privile ged information for one-class classiﬁcation can be found in the literature. E.g. in [15] a feature space is divided into a number of meaningful subdomains X = ∪ N r =1 X r and for each element from the training sample ( x 1 , . . . , x l ) ordinal number of the subdomain, to which this element belongs to, is used as pri vileged in- formation in order to introduce different constraints on the slack variables for patterns from different subdomains, i.e. ξ i,r = ξ r ( x i ) = ( φ ∗ r ( x i ) · w ∗ r ) + b ∗ r , x i ∈ X r , r = 1 , . . . , N . Thus the authors do not e xplicitly use privile ged information from some other feature space rather then initial one: pri v- ileged information can be straightforwardly calculated from values of the original patterns ( x 1 , . . . , x l ) . In order to construct an anomaly detection rule the authors propose to solve the following optimization problem: 1 2 k w k 2 ` 2 + γ 2 N X r =1 k w ∗ r k 2 ` 2 + 1 N N X r =1 1 ν r X x i ∈X r [( w ∗ r · φ ∗ r ( x i )) + b ∗ r ] − ρ → min w,w ∗ r ,b ∗ r ,ρ,ζ r (7) s.t. ( w · φ ( x i )) ≥ ρ − ( w ∗ r · φ ∗ r ( x i )) − b ∗ r , x i ∈ X r , ( w ∗ r · φ ∗ r ( x i )) + b ∗ r + ζ i,r ≥ 0 , x i ∈ X r , ζ i,r ≥ 0 . A shortcoming of the problem statement (7) (cf. with the prob- lem statement (5)) is that the parameters ν r and γ inﬂuence the regularization in a dependent manner, i.e. their contribution to the regularization can not be disentangled. A similar framework is used in [16], where, as in the previ- ous paper , the authors use ordinal numbers of corresponding subdomains as privile ged information. The main difference with paper [15] is that the SVDD algorithm underlies their approach. In fact the authors of [16] propose to solve the following optimization problem in order to ﬁnd the decision rule: R 2 + γ 2 ( R ∗ ) 2 + 1 ν l l X i =1 [ k φ ∗ ( x ∗ i ) − a ∗ k 2 ` 2 − ( R ∗ ) 2 ] → min R,R ∗ ,a,a ∗ ,ζ (8) s.t. k φ ( x i ) − a k 2 ` 2 ≤ R 2 + k φ ∗ ( x ∗ i ) − a ∗ k 2 ` 2 − ( R ∗ ) 2 , k φ ∗ ( x ∗ i ) − a ∗ k 2 ` 2 − ( R ∗ ) 2 + ζ i ≥ 0 , ζ i ≥ 0 . As in the pre vious example, the parameterization, used in (8), does not allow to control the regularization in the original feature space and in the space of privile ged information inde- pendently . One more difference from the problem statement, proposed in this paper , is another approach to modelling slack variables ξ i . In our approach we use linear model (4), but in [16] the distance k φ ∗ ( x ∗ i ) − a ∗ k 2 ` 2 − ( R ∗ ) 2 to the surface of the sphere in the privileged space is used. V . N U M E R I C A L E X P E R I M E N T S In this section we describe performed numerical experi- ments. W e provide results only for the original One-Class Support V ector Machine and its elaborated extension with privile ged information, since results, based on SVDD, are comparable. This is not surprising, since when the Gaussian kernel is used the decision rules for the both methods are essentially the same, as it follows from the theoretical results of [5]. A. Data description W e perform experiments using both two-dimensional syn- thetic data and some real data from the Microsoft Malware Classiﬁcation Challenge (BIG 2015), see [14]. W e generate the ﬁrst synthetic dataset (“Mixture of Gaus- sians”) from a mixture of two-dimensional normal distribu- tions, with mean values c 1 = (2 , 2) and c 2 = ( − 2 , − 2) and unit cov ariance matrices I 2 ∈ R 2 × 2 , i.e. x ∼ η · N ( c 1 , I 2 ) + (1 − η ) · N ( c 2 , I 2 ) , η ∼ Ber(0 . 5) . As privileged information we use coordinates of a pattern after subtracting the nearest mean vector , i.e. x ∗ = x − arg min c 1 ,c 2 ( k x − c 1 k , k x − c 2 k ) . W e represent the second synthetic dataset (“Circles”) by tw o circles of different radii r 1 and r 2 with common center . W e generate each circle in polar coordinate system, then Cartesian coordinates of a pattern are calculated: x = ( r · cos φ, r · sin φ ) , r = η · I( η > 0) , η ∼ N ( r 0 , 0 . 5) , φ ∼ U [0 , 2 π ) , where I( · ) is an indicator function, U [0 , 2 π ) is a uniform distribution, r 0 = 5 for the external circle and r 0 = 0 . 5 for the internal circle. W e use Cartesian coordinates as features and polar coordinates ( r, φ ) as privile ged information. W e generate the third synthetic dataset (“ Arc”) in polar coordinates: φ ∼ N (0 , 0 . 04) , τ = η · (0 . 1 − | φ | ) , η ∼ N ( − 1 / 2 , 1) , x = ((10 − τ ) · cos φ, (10 − τ ) · sin φ ) , where thanks to the multiplier 0 . 1 − | φ | we get bigger variance for φ ≈ 0 and as a consequence a banana-like shape of the dataset. W e use Cartesian coordinates as features and polar coordinates as pri vileged information. For some experiments we need to generate a sample with noise: • generate a sample without noise, • estimate bounds [ x 1 , min , x 1 , max ] × [ x 2 , min , x 2 , max ] of the sample, • generate noise using uniform distribution U [ a 1 , a 2 ] × U [ b 1 , b 2 ] , where a 1 = x 1 , min − 0 . 5 · ( x 1 , max − x 1 , min ) , a 2 = x 1 , max + 0 . 5 · ( x 1 , max − x 1 , min ) , b 1 = x 2 , min − 0 . 5 · ( x 2 , max − x 2 , min ) , b 2 = x 2 , max + 0 . 5 · ( x 2 , max − x 2 , min ) . Thus the region of the feature space, in which anomalies are located, includes all patterns of a “normal” data. Examples of synthetic data are giv en in ﬁgure 1. B. Pr oportion of discar ded test sample patterns In case of the original One-Class SVM a number of test patterns, marked by the decision rule as anomalous, depends on the regularization parameter ν [12], [4] in a very certain way: it tends to ν if the training sample size increases and (a) Mixture of Gaussians (b) Mixture of Gaussians, privi- leged feature space (c) Circles (d) Circles, pri vileged feature space (e) Arc (f) Arc, privileged feature space Fig. 1: Examples of synthetic datasets both the test sample and the train sample are generated from the same distribution. Using synthetic datasets let us check how the proportion of test sample patterns, marked as anomalous, depends on the regularization parameters ν and γ . W e use the Gaussian kernel K ( x, x 0 ) = exp( −k x − x 0 k 2 /σ 2 ) with σ 2 = 2 both for the original feature space, and for the privile ged feature space. Results are provided in ﬁgure 2. W e can notice that for small values of ν (signiﬁcant regularization of the pri vileged space) this dependence is similar with that for the original One-Class SVM. C. Accuracy of Anomaly Detection For this experiment we generate samples with outliers using the approach, described in subsection V -A. The proportion of outliers is equal to 10% . For training of the One-Class SVM + (a) Arc (b) Circles (c) Mixture of Gaussians Fig. 2: Synthetic datasets. Proportion of test sample patterns, marked as anomalous by One-Class SVM+ we use an unlabeled sample. T o assess the accuracy we calcu- late area under the precision/recall curve using the test sample. Also we compare this accuracy with that of the original One- Class SVM. The main issue when performing comparison is how to set v alues for the regularization parameters and the kernel widths. Let us comment on this issue: • In order to tune the regularization parameter ν and the kernel width σ for the original One-Class SVM, we perform a grid search in order to maximize area under the precision/recall curve, estimated by the cross-v alidation procedure. W e denote obtained “optimal” values by ν opt and σ opt and provide the corresponding anomaly detec- tion accuracy in ﬁgure 3. • For the One-Class SVM + we tune only the regularization parameter γ and the kernel width σ ∗ , responsible for the “pri vileged” part of optimization problem (5). As abov e, we optimize area under the precision/recall curve, estimated by the cross-validation procedure. In this case we set values of the parameters ν and σ to ν opt and σ opt correspondingly , which are optimal for the original One- Class SVM. The idea is that if privile ged information does not provide any improvement over the original information, the parameter γ can be set to some big value and the “privile ged” part of optimization problem (5) will not ha ve an y inﬂuence on the o verall solution. W e provide typical values of the anomaly detection accuracy for the One-Class SVM+ in ﬁgure 4. W e can see that pri vileged information allows obtaining signiﬁcant increase of area under the precision/recall curve. Dataset/Method One-Class SVM One-Class SVM + Arc 0.25 0.67 Circles 0.56 0.96 Mixture of Gaussians 0.55 0.98 T ABLE I: Synthetic datasets. Accuracy of anomaly detection (a) Arc (b) Circles (c) Mixture of Gaussians Fig. 3: Synthetic datasets. Area under the precision/recall curve for the original One-Class SVM Accuracy of anomaly detection for synthetic datasets is re- ported in table I. D. Micr osoft Malwar e Classiﬁcation Challenge Zero-day cyber attacks such as worms and spy-ware are becoming increasingly widespread and dangerous. The exist- ing signature-based intrusion detection mechanisms are often not sufﬁcient in detecting these types of attacks. As a result, anomaly intrusion detection methods have been de veloped to cope with zero-day cyber attacks. Among the variety of common anomaly detection approaches [17], [18], the support vector machine is known to be one of the best machine learning algorithms to classify abnormal behaviour [19]. In this section we demonstrate the applicability of the proposed One- Class SVM modiﬁcation to cyber attacks detection using a real data from the Microsoft Malware Classiﬁcation Challenge [14]. In the framework of the Microsoft Malware Classiﬁcation Challenge a set of known malware ﬁles, representing a mix of nine different malw are families, is provided. Each malw are ﬁle has an Id, a 20 character hash value uniquely identifying the ﬁle, and a Class, an integer representing one of nine (a) Arc (b) Circles (c) Mixture of Gaussians Fig. 4: Synthetic datasets. Area under the precision/recall curve for the One-Class SVM+ family names to which the malware may belong: Ramnit, Lollipop, Kelihos ver3, V undo, Simda, Tracur , Kelihos v er1, Obfuscator .A CY , Gatak. For each ﬁle the raw data contains the hexadecimal representation of the ﬁle’ s binary content, without the PE header (to ensure sterility). A metadata manifest is also pro vided, which is a log containing v arious metadata information extracted from the binary , such as function calls, strings, etc. This was generated using the ID A disassembler tool. The task, proposed to the participants of the challenge, was to dev elop the best mechanism for classifying ﬁles from the test set into their respecti ve family afﬁliations using binary ﬁles and assembly code. In order to test our approach to anomaly detection with privile ged information we use the same methodology of fea- ture generation, as that initially proposed by the winning team [20]. Using binary ﬁles we calculate frequencies of bytes and number of dif ferent four-grams. From the assembly code we calculate frequencies of each command, a number of calls to external dll ﬁles. Also we use transformation of the assembly code to an image, since malw ares can be visualized as grayscale images from byte ﬁles or from asm ﬁles [21], [22]: each byte is from 0 to 255 so it can be easily translated into pixel intensity . Details of the transformation can be found in [20]. Thus we use features based on image textures which are commonly used in scene category classiﬁcation such as coast, mountain, forest, street, etc. Here, instead of scene categories, we have malware families. Finally , as the original features we use information, obtained from the binary ﬁles, and as privile ged information we use Fig. 5: Examples of malware assembly code transformation to images features, obtained from the assembly code. In such a way we want to model a situation, when we have resources to perform rev erse-engineering of a program in order to construct a training sample, but we can not make it during a test phase e.g. due to restrictions on computational resources. For each of the nine malware classes we consider the following problem statement: • W e select one of the nine classes, • As a train set we use half of patterns from the selected class, • As a test set we use patterns from another half of the selected class, as well as patterns from other eight classes. W e consider patterns from the selected class to be “normal” and patterns from other classes as “abnormal”, • W e use predictions on the test set to calculate area under the precision/recall curve. W e do not want to claim that such setup of experiments is indeed fully reﬂects a speciﬁcity of cybersecurity applications. In fact through experiments on this real malware data we would like to sho w that privile ged information can increase model accuracy , and so our approach is useful for cyber security applications. As in the previous subsection, here we also perform com- parison with the original One-Class SVM. In ﬁgure 6 we provide an example of ho w area under the precision/recall curve depends on the parameters. W e can notice that it almost does not depend on the regularization parameter ν . (a) One-Class SVM (b) One-Class SVM+ Fig. 6: Malware Detection problem. Dependence of anomaly detection accuracy on the regularization parameters ν and γ , and on the Gaussian kernel widths σ and σ ∗ Algorithm/Malware Class 1 2 3 4 5 OneClassSVM 0.67 0.93 0.97 0.69 0.57 OneClassSVM+ 0.81 0.95 0.99 0.72 0.57 Algorithm/Malware Class 6 7 8 9 OneClassSVM 0.80 0.82 0.84 0.60 OneClassSVM+ 0.81 0.85 0.87 0.62 T ABLE II: Malware Detection problem. Comparison of accu- racies of the One-Class SVM and the One-Class SVM+ V alues of the parameters ν and σ , selected for the One- Class SVM by the cross-validation procedure, are also used for the One-Class SVM+. Thus, for the One-Class SVM+ we only tune values of the regularization γ and the kernel width σ ∗ in the privile ged feature space. W e expect that in case of a small kernel width σ ∗ and a big value of γ we obtain results similar to that of the One-Class SVM. W e provide obtained results in table II. For some malware classes privile ged information allows getting signiﬁcant in- crease in accuracy of anomaly detection. For other malware classes accuracies of the One-Class SVM and the One-Class SVM+ turned out to be the same thanks to the fact, that for speciﬁc values of the kernel width in the privile ged feature space and speciﬁc values of the regularization parameter the decision function of the One-Class SVM+ is close to the decision function of the original One-Class SVM. E. Comparison with Related Appr oaches In [15] and [16] (we provide the revie w of the related methods in section IV) authors describe results of experiments on real datasets. In particular , in [15] results of experiments on four datasets are described, and in [16] only two datasets are used among those, which are considered in [15]. In order to compare approaches from [15] and [16] with methods, proposed in this paper, we use one of these two samples — the sample Abalone. Results of experiments for other datasets are similar . The Abalone dataset contains size and weight of molluscs, as well as their age, ev aluated from a number of rings on a shell. Patterns are divided into two groups depending on the Feature/Method One-Class SVM [15] SVDD [16] One-Class SVM+ Length 0.670 0.673 0.702 Height 0.730 0.728 0.751 Whole weight 0.715 0.710 0.744 T ABLE III: Comparison with related approaches. Accuracy V alues value of the parameter “Rings”. Patterns with Rings < 7 are considered as a normal class, the rest patterns are considered to be anomalies. Let us construct sev eral blocks of privile ged information. For this we divide the dataset into two groups w .r .t. to param- eters “Length” (Length < 0 . 5 ), “Height” (Height < 0 . 15 ) and “Whole weight” (Whole weight < 0 . 8 ). Each such division can be encoded by a binary vector , which we use as an addi- tional information during the training phase. W e perform three experiments. In each of the experiments we use one of these blocks of pri vileged information. W e e v aluate classiﬁcation accuracy by the ten-fold cross-validation procedure. In the previous sections we use area under the preci- sion/recall curve in order to e v aluate performance of anomaly detection algorithms. Unfortunately , in [16] the authors pro- vide only accuracy values, therefore in this section for com- parability we also provide only accuracy values. Results are giv en in table III. W e can see that the parametrization of the One-Class SVM+, proposed in this paper , allo wed us to ﬁnd more efﬁcient solution. Also let us note that results of the One-Class SVM from [15] and of the SVDD from [16] are comparable. V I . C O N C L U S I O N S W e provide modiﬁcations of the approaches for one-class classiﬁcation problem that allows to incorporate pri vileged information. W e can see from the results of experiments that in some cases privileged information can signiﬁcantly improv e anomaly detection accuracy . In cases when privileged information is not useful for a problem at hand thanks to the structure of the corresponding optimization problem (e.g. cf. (1) with (5)) pri vileged information will not have a signiﬁcant inﬂuence on the corresponding decision function: e.g. since for γ  1 the solution of (5) is close to the solution of (1), then the decision function of the One-Class SVM+ is close to the decision function of the original One-Class SVM. V I I . A C K N O W L E D G E M E N T S The research of the ﬁrst author was conducted in IITP RAS and supported solely by the Russian Science Foundation grant (project 14-50-00150). The research of the second author was supported by the RFBR grants 16-01-00576 A and 16-29- 09649 oﬁ m. R E F E R E N C E S [1] Manevitz, L.M., Y ousef, M.: One-class svms for document classiﬁcation. Journal of Machine Learning Research 2 (2001) 139–154 [2] Khan, S.S., Madden, M.G.: One-class classiﬁcation: T axonomy of study and review of techniques. The Knowledge Engineering Revie w 29 (03) (2014) 345–374 [3] Khan, S.S., Madden, M.G.: A survey of recent trends in one class classiﬁcation. In: Proceedings of the 20th Irish Conference on Artiﬁcial Intelligence and Cognitive Science, Dublin, in the LNAI. V olume 6206. Springer-V erlag (2009) 181–190 [4] T ax, D.M.J., Duin, R.P .W .: Support vector data description. Mach. Learn. 54 (1) (January 2004) 45–66 [5] Chang, W .C., Lee, C.P ., Lin, C.J.: A revisit to support vector data description. T echnical report (2013) [6] Sch ¨ olkopf, B., Williamson, R.C., Smola, A.J., Shawe-T aylor, J., Platt, J.C.: Support vector method for novelty detection. In Solla, S.A., Leen, T .K., M ¨ uller , K., eds.: Advances in Neural Information Processing Systems 12. MIT Press (2000) 582–588 [7] Burnaev , E., Erofeev , P ., Smolyakov , D.: Model selection for anomaly detection. In V erikas, A., Radeva, P ., Nikolae v , D., eds.: Proc. SPIE 9875. V olume 9875. SPIE (2015) [8] Burnaev , E., Erofeev , P ., Papanov , A.: Inﬂuence of resampling on accuracy of imbalanced classiﬁcation. In V erikas, A., Radeva, P ., Nikolae v , D., eds.: Proc. SPIE 9875. V olume 9875. SPIE (2015) [9] Artemov , A., Burnaev , E.: Ensembles of detectors for online detection of transient changes. In V erikas, A., Radev a, P ., Nikolaev , D., eds.: Proc. SPIE 9875. V olume 98751Z. SPIE (2015) [10] Kulesho v , A., Bernstein, A.: Incremental construction of low- dimensional data representations. In: Lecture Notes in Artiﬁcial Intel- ligence, “ Artiﬁcial Neural Networks for Pattern Recognition”. V olume 9896. Springer Heidelberg (2016) 13 pp. [11] Bernstein, A., Kuleshov , A., Y anovich, Y .: Information preserving and locally isometric and conformal embedding via tangent manifold learning. In: Proceedings of the International IEEE Conference on Data Science and Advanced Analytics (DSAA 2015). IEEE Computer Society , Piscataway , USA (2015) 1–9 [12] Sch ¨ olkopf, B., Smola, A.J., W illiamson, R.C., Bartlett, P .L.: New support vector algorithms. Neural Comput. 12 (5) (May 2000) 1207–1245 [13] Pechyony , D., V apnik, V .: On the theory of learnining with privileged information. In Lafferty , J.D., W illiams, C.K.I., Shawe-T aylor, J., Zemel, R.S., Culotta, A., eds.: Advances in Neural Information Processing Systems 23. Curran Associates, Inc. (2010) 1894–1902 [14] : Microsoft malware classiﬁcation challenge (big 2015). https://www . kaggle.com/c/malware- classiﬁcation [15] Zhu, W ., Zhong, P .: A ne w one-class svm based on hidden information. Knowledge-Based Systems 60 (2014) 35–43 [16] Zhang, W .: Support vector data description using privileged information. Electronics Letters 51 (14) (2015) 1075–1076 [17] Nissim, N., Moskovitch, R., Rokach, L., Elovici, Y .: Detecting unknown computer worm activity via support vector machines and active learning. Formal Pattern Analysis and Applications 15 (4) (2012) 459–475 [18] Nissim, N., Moskovitch, R., Rokach, L., Elovici, Y .: Novel active learning methods for enhanced pc malware detection in windows os. Expert Systems with Applications 41 (13) (2014) 5843–5857 [19] Shon, T ., Moon, J.: A hybrid machine learning approach to network anomaly detection. Information Sciences 177 (2007) 3799–3821 [20] W ang, X., Liu, J., Chen, X.: Microsoft malware classiﬁca- tion challenge (big 2015). ﬁrst place team: Say no to ov erﬁt- ting. https://github.com/xiaozhouwang/kaggle Microsoft Malware/blob/ master/Saynotooverﬁtting.pdf [21] Nataraj, L. http://sarvamblog.blogspot.com/ (2014) [22] Nataraj, L., Karthikeyan, S., Jacob, G., Manjunath, B.: Malware images: visualization and automatic classiﬁcation. In: Proceedings of the 8th International Symposium and V isualization for Cyber Security . ACM (2011) 7 pp.

One-Class SVM with Privileged Information and its Application to Malware Detection

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment