Semi-supervised classification for dynamic Android malware detection

Semi-sup ervised Classiﬁcation for Dynamic Android Malw are Detection Li Chen ? , Mingw ei Zhang ?? , Chih-Y uan Y ang ? ? ? , Ra vi Sahita † Securit y and Priv acy Researc h, Intel Labs Abstract. A gro wing n umber of threats to Android phones creates c hal- lenges for malware detection. Man ually lab eling the samples in to benign or diﬀerent malicious families requires tremendous human eﬀorts, while it is comparably easy and cheap to obtain a large amount of unlab eled APKs from v arious sources. Moreov er, the fast-paced ev olution of An- droid malw are con tinuously generates deriv ativ e malware families. These families often contain new signatures, which can escap e detection when using static analysis. These practical c hallenges can also cause traditional sup ervised mac hine learning algorithms to degrade in p erformance. In this pap er, we prop ose a framew ork that uses mo del-based semi- sup ervised (MBSS) classiﬁcation sc heme on the dynamic Android API call logs. The semi-sup ervised approach eﬃciently uses the lab eled and unlab eled APKs to estimate a ﬁnite mixture model of Gaussian distri- butions via conditional expectation-maximization and eﬃcien tly detects malw ares during out-of-sample testing. W e compare MBSS with the pop- ular malware detection classiﬁers suc h as support v ector mac hine (SVM), k -nearest neigh b or (kNN) and linear discriminan t analysis (LD A). Under the ideal classiﬁcation setting, MBSS has comp etitiv e p erformance with 98% accuracy and very lo w false p ositiv e rate for in-sample classiﬁca- tion. F or out-of-sample testing, the out-of-sample test data exhibit simi- lar b eha vior of retrieving phone information and sending to the net work, compared with in-sample training set. When this similarity is strong, MBSS and SVM with linear kernel main tain 90% detection rate while k NN and LDA suﬀer great p erformance degradation. When this similar- it y is slightly weak er, all classiﬁers degrade in performance, but MBSS still p erforms signiﬁcan tly b etter than other classiﬁers. Keyw ords: Android dynamic malware detection, machine learning, semi-supervised learning, out-of-sample classiﬁcation, Gaussian mixture mo deling, conditional exp ectation-maximization. ? First and corresponding author’s email address is li.c hen@intel.com. Li Chen is Data Scien tist in Priv acy and Security Lab in In tel Labs, Hillsb oro, OR 97124. ?? Mingw ei Zhang is Researc h Scientist in Priv acy and Security Lab in In tel Labs, Hillsb oro, OR 97124. ? ? ? Chih-Y uan Y ang is Research Scientist in Priv acy and Securit y Lab in Intel Labs, Hillsb oro, OR 97124. † Ra vi Sahita is Principal Engineer in Priv acy and Security Lab in Intel Labs, Hills- b oro, OR 97124. 1 In tro duction Android had dominated the mobile op erating system and accoun ted for ov er 86% of global shipment and op erating system market share in 2016 [4] [11]. The Androids oﬃcial application store, Go ogle Play , provides ov er 2.7 million appli- cations [9] and total download coun t had reached 65 billion times in one year p eriod [5]. The p opularit y of Android op erating system has made it a lucrative target for cyb ercriminals. According to McAfees 2016 mobile threat rep ort, there ha ve b een 9 millions of malw are found from app stores cross 190 countries glob- ally in three month p eriod. And 3 million devices were aﬀected ov er 6 months p eriod [7]. The av erage monthly infection rate among smartphones increased to 0.49 p ercen t in the ﬁrst half of 2016. This is a 98 p ercen t surge from the 0.25 p ercen t in the second half of 2015 based on Nokia Threat In telligence labs rep ort [8]. The rampant evolution of Android malware makes it more sophisticated to a void detection and more diﬃcult to classify using the commonly-used machine learning algorithms. Human eﬀort of labeling the malware as malicious or be- nign cannot k eep up with the pace of the voluminous generation of Android malw are, resulting in an imbalance of muc h more unlab eled data than lab eled data. F or sup ervised learning algorithms, this causes potential diﬃculties, since these algorithms are constructed merely on the lab eled dataset or training data, but are desired to hav e reasonably go o d p erformance on the greater amount of unlab eled data. F urthermore, due to this imbalance of unlab eled and lab eled data, the dis- tribution observed from the labeled data can b e diﬀerent from the actual data distribution. This is seen in malware detection, as study suggests that mal- w are family exhibit p olymorphic b ehaviors. T ranslated into machine learning, it implies that at testing phase, the test data can b e similarly distributed as the training data, but is not identically distributed as the training data. T radi- tional sup ervised machine learning algorithms can suﬀer p erformance degrada- tion when tested on samples that do not distribute iden tically as the training data. Therefore, in order to achiev e a robust malw are detection rate for out-of- sample malware detection, it is desired to hav e a machine learning based malw are detector that takes adv antage of b oth unlab eled and labeled data, and maintains a steady p erformance for out-of-sample test analysis. T o address the ab o ve challenges, in this pap er, w e prop ose a framework of utilizing mo del-based semi-sup ervised (MBSS) classiﬁcation on the dynamic b e- ha vior data for Android malw are detection. W e fo cus on detecting malicious b eha vior at runtime by using dynamic b eha vior data for our analysis. The main adv an tage of semi-sup ervised classiﬁcation is the strong robustness in p erfor- mance for out-of-sample testing. The mo del-based semi-sup ervised classiﬁcation uses b oth lab eled and unlabeled data to estimate the parameters, since unlabeled data usually carries v aluable information on the mo del ﬁtting pro cedure. Sp eciﬁcally , we use mixture mo deling to achiev e the classiﬁcation task. Mix- ture mo deling is a t yp e of mac hine learning technique, whic h assumes that every comp onen t of the mixture represen t a given set of observ ations in the entire collection of data. Gaussian mixture mo deling is the most popularly used ap- plied and studied technique. Mo del-based mixture mo deling uses a mixture of Gaussian distributions to develop clustering, classiﬁcation and semi-sup ervised learning metho ds [14], [27], [26], [23]. W e run the Android applications in our em ulator infrastructure and harvest the API calls at run time. Our framework eﬃciently uses the labeled and un- lab eled b eha vior data to estimate a set of ﬁnite mixture models of Gaussian distributions via conditional exp ectation-maximization, and uses the Ba yesian information criterion for mo del selection. W e compare MBSS with the p opular malw are detection classiﬁers such as supp ort vector machine (SVM), k -nearest neigh b or (kNN) and linear discriminant analysis (LDA). W e demonstrate that MBSS has comp etitiv e p erformance for in-sample classiﬁcation, and main tains strong robustness when applied for out-of-sample testing. W e consider semi- sup ervised learning on dynamic Android b eha vior data a practical and v aluable addition to Android malw are detection. The rest of the pap er is structured as follo ws. In Section 2, we describ e related w ork of machine learning used in Android malware detection. In Section 3, we describ e our infrastructure and implementation for our Android malware em ulator, whic h collects the dynamic b eha vioral data for our analysis. In Section 4, we describ e in detail the mo del-based semi-sup ervised classiﬁer. In Section 5, w e present the results of our proposed metho dology for three sets of experiments to demonstrate its eﬀectiveness for out-of-sample testing. W e conclude our pap er with summary and discussion in Section 6. 2 Related W ork The in tersection of artiﬁcial intelligence and statistics pro vides machine learning the foundation of probabilistic models and data-driven parameter estimation. Mac hine learning tasks can b e categorized into sup ervised learning, unsup er- vised learning and semi-sup ervised learning. In sup ervised learning, the learning task is to classify the labels of incoming data samples given the observed training data and their labels. T raditional metho ds include nearest neighbor [21], sup- p ort vector mac hine [20], decision tree [42], sparse represen tation classiﬁer [46] [19], etc. In unsup ervised learning, the learning problem is to group the obser- v ations into categories based on a chosen similarit y measure. Typical methods include K-means, exp ectation-maximization [24], [37], mo del-based clustering [27], [18], hierarchical clustering [32], sp ectral clustering [38] [34], and so on. Semi-sup ervised is b et ween sup ervised and unsup ervised learning, where the learning problem mak es use of the unlab eled for training and up dates the mo del using b oth the lab eled and unlab eled data. T ypical metho ds include Gaussian mixture mo dels [36], hidden Marko v mo del, low-densit y separation [17], and so on. There are v arious techniques used to detect malw are on Android system. They generally could b e classiﬁed as static and dynamic analysis. Static analysis tec hniques fo cus on the information from Android application pack age (APK) suc h as manifest, resources, co de binary , etc., while dynamic analysis techniques fo cus on dynamic b eha viors collected during APK execution. Previous work suc h as CHEX [33] statically chec ks the co de logic to preven t comp onen t hijac king attacks that might cause priv acy leak age. MAMADROID [35] and DroidAPIMiner [12] b oth use Android API in vocations in static co de to detect malw are. While static tec hniques are more prev alent among an ti-malware companies, they are known to fail to detect malw ares with sophisticated ev asion tec hniques like dynamic code loading or co de obfuscation. MYSTIQUE-S [47] is a sample malware, whic h can select attack features at runtime, and download malicious pa yloads dynamically . It b ypasses static detection and can only be detected b y dynamic monitoring to ols suc h as Droidb o x [2]. DroidHun ter [44], FLEXDROID [43] and Going native [13] are a few exam- ples on dynamic detection techniques. DroidHun ter is able to detect malicious co de loading by ho oking critical android runtime API. FLEXDROID prop oses an extra component in Android app manifest, whic h allo ws dev elop ers to set p ermissions on native co de. Going Native generates p olicies for native co de and limits malicious b eha viors. On the other hand, several dynamic monitoring to ols are prop osed on Android malw are detection. Droidb ox [2] and Droidmon [3] are b oth based on instrumentations in android runtime. Droidb o x relies on source co de mo diﬁcation of Android Op en Source Pro ject (AOSP), thus suﬀers from h uge engineering w orks of porting v arious Android v ersions. Droidmon instead is based on Xp osed F ramework, whic h is a framew ork for mo dules that can change the b eha vior of APK without mo difying it One do wnside of dynamic analysis is the limited co de cov erage as some b e- ha viors only exposed under speciﬁc conditions suc h as user in teractions or sensor data. IntelliDroid [45] tries to solv e this problem using symbolic execution. It lev erages a solv er to generate targeted input to trigger malicious co de logic and increase the cov erage. Ho wev er, its implementation is bound to Android version, and ma y suﬀer from the p orting issue. Similar to Droidmon, our instrumen tation framew ork Android Emulator (AE) is based on Xp osed F ramework for dynamic API monitoring. Also several com- p onen ts, such as the UI automation tool and sensor emulators, were incorp orated in AE to enhanced co verage of dynamic behavior. T racing logs of APK execution w ere harvested for future machine learning analysis. 3 Data Generation and Preparation Malicious and b enign APKs w ere downloaded from VirusT otal and Go ogle Play resp ectiv ely . Both samples w ere uploaded to our emulator infrastructure for ex- ecution, as seen in Figure 1. The Android emulator (AE) runs in a emulator ma- c hine. Each machine may contains multiple emulators. The downloaded APKs w ere dispatched b y the sc heduler and installed to new emulator instances for execution. Since most of Android applications are UI-based, merely launc hing the application may b e insuﬃcien t to exp ose its b eha viors. The automation to ol is dev elop ed to provide simulated human interactions, such as clicks, and sen- sor even ts, such as GPS lo cation. This to ol could navigate the UI automatically without human in terven tion. W e harvest applications b eha viors through Android Debug Bridge (ADB) [1] and aggregate them in to the disk. Fig. 1: Dynamic Instrumentation System Arc hitecture. T o address the trade-oﬀ b etw een eﬃciency and completeness, we set our ex- p erimen t time for each APK as 10 minutes. T o harvest the API calls at runtime, our customized emulator has pre-installed Xp osed framework [10] which could p oten tially ho ok any API inv ocations to android runtime. Our Xp osed comp o- nen t is running along with each application instance and intercept and print eac h API inv o cation through ADB [1], which is further harv ested by a dedicated out-of-b o x pro cess. Since Android application is mostly UI-based, launching the application is insuﬃcient to harvest enough b ehavior. T o cop e with it, we used a UI automation to ol pla ying as a rob ot to dynamically click the running APK inside em ulator. As a result, we could na vigate most of the application logic with largely reduced the h uman eﬀort. W e capture dynamic b eha vior of each android application by executing it in our emulator. The dynamic data consists of a time sequence of API calls m ade b y an application to the Android runtime. Since current Android run time export more than 50K APIs, for eﬃciency , w e carefully select 160 API calls that are critical to c hange Android system state suc h as sending short messages, accessing w ebsite, reading con tact information, etc. Our selection of Android API function comes from the union of function set selected by three well-kno wn op en source pro jects: AndroidEagleEy e [6], Droidmon [3], and Droidbox [2]. W e b eliev e our API selection set is suﬃcient to cov er all critical Android malicious b eha viors at runtime. F or each API call, we capture the API class name and its function name. As a result, our data consists of the dynamic traces of the APKs as samples, and the feature space of our data set is the collection of APIs, where in each feature, w e denote the existence of the API in the sample APK. That is, suppose the unique num b er of API calls is d , then a sample APK is represen ted b y x = (0 , 0 , .., 1 , .., 1 , .., 0) ∈ { 0 , 1 } d , where 1 denotes the existence of an API call and 0 otherwise. 4 Mo del-based semi-sup ervised classiﬁcation Ev en if a classiﬁer ac hieves high classiﬁcation accuracy and low false p ositiv e rate for in-sample testing, its go o d p erformance may not b e extended to out- of-sample testing. Semi-sup ervised learning classiﬁcation uses b oth the lab eled and unlab eled to up date the classiﬁer, taking adv an tage of the complete dataset and th us ac hieving more accurate classiﬁcation. Here we prop ose to use the mo del-based semi-supervised (MBSS) classiﬁcation algorithm [23], [41], [27]. W e start with formulating the semi-sup ervised classiﬁcation task, then explain the conditional exp ectation-maximization algorithm used for solving the maximum lik eliho o d problem, and describ e the mo del selection criterion. 4.1 The Semi-sup ervised Learning T ask In the sup ervised setting, we deﬁne [ K ] = { 1 , 2 , ..., K } for a giv en p ositiv e in teger K . Supp ose ( X , Y ) ∼ F X Y , where X is a feature v ector in R d and Y ∈ [ K ] is the class lab el, and F X Y is the joint distribution of X and Y . F ur- ther supp ose w e observe the indep endently iden tically distributed training data T n = ( {X n } , {Y n ) } := { ( X 1 , Y 1 ) , ( X 2 , Y 2 ) , ..., ( X n , Y n ) } i.i.d ∼ F X Y . The task is to classify the class membership of sample X via deﬁning a classiﬁer g : X → Y [25]. In the unsupervised setting, we observ e indep enden tly iden tically distributed feature v ectors X 1 , X 2 , ..., X n where eac h X i is a random v ariable for some prob- abilit y space. The task is to cluster the samples in to groups based a chosen metric of similarit y . The semi-sup ervised setting is b et ween supervised and unsup ervised learning. Denote the incoming data by X m = { X n +1 , ..., X n + m } with unknown lab els and use the same notation T n as the training data. The task is to learn a classiﬁer g : X → Y such that this classiﬁer predicts the lab els of X m via learning from the complete data rather than the lab eled training data alone. 4.2 Mo del-based Semi-sup ervised Learning F ramew ork In mo del-based approac h, the data x is assumed to b e distributed from a mixture densit y f ( x ) = P K k =1 π k f k ( x ), where f k ( · ) is the density from the k -th group and π k is the probability that an observ ation b elongs to group k . Each comp onen t is Gaussian, which is characterized by its mean µ k and cov ariance matrix Σ k . The probabilit y density function for the k -th comp onen t is thus f k = f k ( x, µ k , σ k ) = exp( − 1 2 ( x − µ k ) T Σ − 1 k ( x − µ k )) p det(2 π Σ k ) . (1) The mean µ = ( µ 1 , ..., µ K ), cov ariance matrix Σ = ( Σ 1 , ..., Σ K ) and the p opulation distribution π = ( π 1 , ..., π K ) are the parameters to b e estimated from the mixture mo dels. Here w e use maximum lik eliho o d estimation, a metho d of ﬁnding the parameters to maximize the probabilit y of obtaining the observ ations giv en the parameters. Denote θ := ( µ, Σ ) as all the parameters in the Gaussian comp onents. Hence w e wan t to estimate π and θ . Denote X n = { X 1 , ..., X n } as the training data and Y n = { Y 1 , ..., Y n } as the training lab els, where for the i-th observ ation, denote Y ik = 1 if the observ ation comes from group k and 0 otherwise. Denote the unkno wn lab els of the unlab eled data as Y M = { Y n +1 , ..., Y n + m } . The likelihoo d of the training data is L T ( π , θ |X n , Y n , X m ) = Π n i =1 Π K k =1 [ π k f ( X i | θ k )] Y i,k π n + m j = n +1 K X k =1 π k f ( X j | θ k ) . (2) T o estimates the unkno wn parameters π and θ , one calculates log-likelihoo d as l ( π , θ ) = log L T ( π , θ ) and uses the EM algorithm [24], [37] to maximize the log-lik eliho o d. In our framework, w e apply mo del-based semi-sup ervised classiﬁcation using b oth lab eled and unlabeled Android behavioral data to develop the classiﬁcation decision for the unlab eled data. The likelihoo d of the complete-data consisting of lab eled and unlab eled data is L C ( π , θ |X n , Y n , X m , Y m ) = Π n i =1 Π K k =1 [ π k f ( x i | θ k ) Y ik ] Π n + m j = n +1 Π K k =1 [ π k f ( X j | θ k )] Y j k . (3) Essen tially we treat the data with unknown lab els as missing data to include them in the complete likelihoo d. T o estimate the unkno wn parameters and maxi- mize the log-lik eliho o d of the complete data, w e use the conditional exp ectation- maximization (CEM) algorithm [31], whic h is similar to the expectation-maximization (EM) algorithm. 4.3 The Conditional Exp ectation-Maximization Algorithm The conditional exp ectation-maximization (CEM) [31] algorithm is used to solv e the lik eliho o d maximization on the complete data. W e note that CEM is sim- ilar to EM, except we up date the classiﬁcation using b oth labeled and unla- b eled data. W e denote the parameter estimates at the g -th iteration by ˆ π g , ˆ θ g = ( ˆ µ g , ˆ Σ g ) and the estimated lab el on the test data at the g -th iteration by ˆ Y g m . – Step 1: Initialization . Set g = 0 and initiate starting estimates ˆ π 0 , ˆ θ 0 us- ing mo del-based discriminant analysis estimates from the parameters of the mo del, as describ ed in [27]. – Step 2: Exp ectation . While chosen stopping criterion is not satisﬁed, in- crease iteration g . Calculate the exp ected v alue of the unkno wn lab els via w j k = π k f ( X j | ˆ θ g k ) Σ K k =1 ˆ π g k f ( X m | ˆ θ g k ) , (4) where k ∈ [ K ] denotes the index of the num b er of class memberships and j = n + 1 , ..., n + m denotes the index on the unlab eled data. – Step 3: Maximization . The estimated labels after the g -th iteration is ˆ Y g +1 j k := sign( w j k − w j k 0 ) for all k 6 = k 0 . Estimate the mixture parameters b y ˆ π g +1 k = P n i =1 l ik + P n + m j = n +1 ˆ Y g +1 j k n + m , (5) ˆ µ g +1 k = P n i =1 l ik X i + P n + m j = n +1 ˆ Y g +1 j k P n i =1 l ik + P n + m j = n +1 ˆ Y ( g +1) j k . (6) T o estimate the cov ariance, w e apply eigenv alue decomp osition on the co- v ariance matrix Σ k , i.e., Σ k = λ k D k A k D T k . Depending on which cov ariance structure is, seen in Section 4.4, diﬀerent constraints are imp osed on the co v ariance matrix. [15], [16] describ e in detail how each cov ariance matrix is estimated according to the structure. – Step 4: Con vergence . Stop until stopping criteria is met. In this study , the stopping criterion is set when the current v alue of the log-lik eliho o d is within 10 − 5 close to the estimated ﬁnal conv erged v alue. Up on con vergence, the ﬁtted mixture mo del will pro duce the p osterior probability of the group mem b erships Y j k , where j = n + 1 , ..., n + m and k ∈ [ K ] for the unlab eled data. The categorical classiﬁcation is done suc h that the observ ations b elong to the class with the maximum p osterior probabilit y . 4.4 Mo del Selection The CEM algorithm estimates the parameters µ, Σ , π . The regions or clusters cen tered at µ c haracterize the data generated from mixture of Gaussian dis- tributions. The shap es of these regions or clusters are determined by the co- v ariance matrices Σ . In [14], eigen-decomp osition on the co v ariance matrix, Σ k = λ k D k A k D T k , has geometric interpretations: D k is an orthogonal matrix consisting of the eigenv ectors of Σ k , determining the orientation of k -th comp o- nen t; A k is the diagonal matrix with the eigenv alues in the diagonal, determining the shap e; and λ k are the eigenv alues, determining the size of the group. F or ex- ample, when all comp onen ts are of the same size and spherical, the imp osed constrain t is Σ k = λI . If the comp onen ts are spherical but of diﬀerent sizes, the imp osed constraint is Σ k = λ k I . As a result, diﬀeren t shap es of the mixture comp onen ts lead to a diﬀerent parametrization of the cov ariance matrix, and th us diﬀerent mixture mo dels. Here we consider the set of models listed in [28], [29], [23], [41] and use the Ba yesian Information criterion (BIC) for mo del selection. The BIC is given by BIC( M i ) = 2 log L C − ln( n + m ) l, (7) where log L C is the maximum likelihoo d of the complete data, n + m is the n umber of ob eserv ations, and l is the num b er of parameters. The optimal mo del is selected based on the maxim um BIC [23], [27], [40]. 4.5 Summarizing our framework Our framework of mo del-based semi-sup ervised classiﬁcation for dynamic An- droid malw are analysis is summarized in Algorithm 1. Algorithm 1 Model-based semi-sup ervised classiﬁcation on dynamic Android b eha vior Goal: Predict the lab el of Android APKs. Input: Android APKs. MBSS con vergence stopping criterion. Step 1: Emulation. Execute APKs, and output API names to logs. Step 2: Data generation. Pro cess logs and extract selected API names. Step 3: F eature extraction. Engineer features from API names. Step 4: MBSS Mo del-based semi-sup ervised classiﬁcation on extracted features of API logs. Step 5: Classiﬁcation Classify APKs in to groups based on the maximum p osterior probabilit y from MBSS. 5 Results W e compare MBSS with some of the most p opular malware classiﬁers: supp ort v ector mac hine, k-nearest neigh b or, and linear discriminant analysis. W e conduct t wo categories of experiments: in-sample v alidation and out-of-sample v alidation. In-sample v alidation is the typical classiﬁcation setting where a dataset is split in to training and test set. A classiﬁer is trained on the training set and exp ected to p erform well when tested on the test set. Cross v alidation is usually em- plo yed to assess the performance of a classiﬁer. In-sample classiﬁcation provides the ideal classiﬁcation scenario, where test data distribution is the same as the training data distribution. Our next set of exp erimen ts focuses on out-of-sample testing. An out-of- sample exp erimen t uses the classiﬁer trained and v alidated on the in-sample data, and predicts the lab els of incoming unlab eled data. The v ast ma jority of the unlab eled data are not guaranteed to follow the same distribution as the in- sample training data. This is a challenge for machine learning algorithms used for practical applications. W e consider a classiﬁer robust and practical when it can still achiev e reasonably well performance for out-of-sample classiﬁcation. All the APKs for our exp eriments are retrieved from VirusT otal, and the exp erimen ts are conducted in R [39], [41], [29]. 5.1 Dataset W e use the dynamic logs obtained from our Android emulator as describ ed in Section 3. The behavior data contains the ﬁltered API calls during the execution of runs. The set of unique API calls constitute the feature space. Each data sample APK is represen ted by the binary feature v ector denoting existence of an API call. F or example, supp ose the unique num b er of API calls is d , then a sample APK is represen ted b y x = (0 , 0 , .., 1 , .., 1 , .., 0) ∈ { 0 , 1 } d , where 1 denotes the existence of an API call and 0 otherwise. 5.2 In-sample v alidation W e ﬁrst demonstrate that for in-sample classiﬁcation, MBSS has comp etitiv e p erformance when compared with SVM with radial k ernel, SVM with linear k ernel, 3NN and LD A, all of which are most widely used for malw are classiﬁca- tion. As in-sample dataset, w e obtain 55994 Android APK dynamic logs from our emulator with 24217 benign Android APKs and 31777 malicious APKs. The in-sample malicious b ehaviors include stealing location, device ID, MAC information, dynamic code loading behavior, and sending the information to outside net work. The lab el distribution is (43% , 57%) and the c hance accuracy (classiﬁcation accuracy when randomly guessing) is 0 . 57. W e ﬁrst v alidate the eﬀectiv eness of all ﬁve classiﬁers. W e conduct 10-fold cross v alidation, rep ort the accuracy mean by av eraging the accuracies and calculate the standard deviation of the accuracies across all the 10 folds v alidation, and report the similar metrics for false p ositiv e rates. As indicated in T able 1, all the classiﬁers hav e comp etitive p erformance. Under this ideal classiﬁcation scenario, SVM with radial and linear kernels demonstrate the b est p erformance with accuracies at 98 . 8% and 98 . 6% resp ectiv ely and false p ositiv e rates b oth at 2%, 3NN and MBSS hav e similar p erformance of accuracy at 97 . 6% and 97.9% resp ectiv ely and false p ositiv e rate at 3%, while LDA shows lesser p erformance with accuracy at 90% and false positive at 6%. Figure 2 shows the receiver op erating curv e (ROC) of MBSS in the 10-th fold classiﬁcation and the area under the curv e is 0.99. Classiﬁer Mean ACC Sd ACC Mean FP Sd FP DR for OOS1 DR for OOS2 MBSS 97.6% 0.002 3% 0.004 90.0% 55.3% SVM (radial) 98.8% 0.002 1.8% 0.003 0 0.06% SVM (linear) 98.6% 0.001 2% 0.003 90.8% 35.4% 3NN 97.9% 0.001 3% 0.004 68.4% NA LD A 90.0% 0.003 6% 0.003 9.4% 33.8% T able 1: Classiﬁcation p erformance comparison for all ﬁve classiﬁers. All classiﬁers ha ve comp etitiv e classiﬁcation p erformance for in-sample testing. F or OOS1, MBSS and SVM with linear kernel ac hieve the highest detection rate. F or OOS2, MBSS p erforms signiﬁcantly b etter than all other classiﬁers. Due to to o many ties for 3NN, w e do not rep ort its result here. Fig. 2: The receiver operating curve (ROC) of MBSS for in-sample classiﬁcation. The area under the curve is 0.99. 5.3 Out-of-sample classiﬁcation Our next exp erimen t is out-of-sample v alidation. F or our out-of-sample classi- ﬁcation exp erimen t, we apply the ﬁve classiﬁers on a test dataset consisting of all malicious samples, and thus our task is to detect the malicious samples. W e rep ort the detection rate (DR), which is deﬁned by the num ber of correctly clas- siﬁed malware APKs divided by the total n umber of out-of-sample test data. After v alidating that these classiﬁers hav e high accuracy and low false p ositiv e in Section 5.2, we use them to test on incoming samples. Under this practical and realistic scenario, the test samples do not follow very similar distribution as the training samples. In this case, the classiﬁers with high accuracy degrade, sometimes ev en signiﬁcantly , for out-of-sa mple testing. Recall that the in-sample malicious b eha viors include retrieving phone infor- mation and sending to the netw ork. Here we divide the out-of-sample exp eri- men ts into tw o types. First, a strong similarit y of this malicious b eha vior from the test set, whic h indicates that the distribution of the test set is similar to the distribution of a subset in the training data. Second, a weak er similarity of this malicious b eha vior from the test set, which indicates that the distribution of the test set is ev en less similar to the distribution of the training data. OOS1: Out-of-sample with mali cious similarity to in-sample data W e ﬁrst apply the ﬁve classiﬁers on a dataset of 12185 malicious APKs and rep ort the detection rate. These out-of-sample APKs exhibit similar malicious b eha viors of in tercepting and sending messages without the user’s consent as in the training set. T o visualize this similarity , we conduct principal comp onen t analysis (PCA) and insp ect the scatter plots of the ﬁrst four principal comp onen ts (PC). Figure 3 presents the scatter plot of (PC1, PC2), (PC1, PC3), (PC2, PC3), and (PC1, PC4). The in-sample b enign ﬁles are plotted in green, the in-sample malicious ﬁles are plotted in red and the out-of-sample malicious ﬁles are plotted in ma- gen ta. (PC1, PC2)-scatter plot indicates the distributional dissimilarit y b et ween the ﬁrst tw o principal comp onents, as the out-of-sample data almost forms its o wn cluster. Ho wev er (PC1, PC3), (PC2, PC3), and (PC1, PC4)-scatter plots indicate that similarity still persists as the out-of-sample data falls within the clusters of the in-sample data. Fig. 3: OOS1: Scatter plot of principal comp onents in OOS1. This out-of-sample set is similarly distributed compared to the training set. The in-sample b enign ﬁles are plotted in green, the in-sample malicious ﬁles are plotted in red and the out-of-sample malicious ﬁles are plotted in magenta. (PC1, PC2)-scatter plot indicates the distribu- tional dissimilarity b et ween the ﬁrst tw o principal comp onen ts, as the out-of-sample data almost forms its own cluster. The rest of the PC-scatter plots sho w that the out-of-sample data lies in the same embedding as the in-sample data. The sixth column in T able 1 demonstrates the detection rates of the ﬁve classiﬁers. Both SVM with linear kernel and MBSS hav e detection rate of 0.9, while the performance of 3NN is 0.68 and LDA has the lo west detection rate of 0.1. A dramatic p erformance degradation is seen for SVM with radial kernel, whic h wen t from the b est in-sample p erforming classiﬁer to the worst classiﬁer with detection rate near 0. The signiﬁcan t degradation of LD A is due to its highly parametric nature. In this case, SVM with linear kernel and MBSS maintain relativ ely stable detection rate. Next, w e examine the detection rate as we v ary the test size. W e apply the classiﬁers on the randomly selected { 0 . 1% , 1% , 20% , 50% , 90% } of the test data with indep enden t Mon te Carlo replications at { 50 , 30 , 20 , 10 , 5 , 1 } resp ectiv ely . Mon te Carlo replications are used to control the v ariation and th us provide better estimate of the accuracy . The detection rate is rep orted b y av eraging o ver the Mon te Carlo replicates. As seen in Figure 4, b oth MBSS and SVM with linear k ernel are sup erior to other classiﬁers. W e note that the p erformance v ariation, as we increase the p ercen tage of the test set, is negligible. This is b ecause our data pro cessing of binarizing the features results in a high num ber of replicates in the test data. Hence, the p erformance of the classiﬁers are seen relativ ely stable here as w e v ary the test size. Fig. 4: OOS1: Detection rate as we v ary the test size. MBSS and SVM linear achiv e the highest detection rate at 90% while the rest of the classiﬁers degrade in p erformance. OOS2: Out-of-sample with dissimilar distribution to in-sample data Our next out-of-sample exp erimen t applies the ﬁve classiﬁers onto a dataset of 11986 malicious APKs, whose malicious b eha viors primary include stealing priv ate information, sending it to the In ternet through commo dit y command and control (C&C) server, but do not include dynamic co de loading b eha vior. Compared to the training set, these APKs share similarities in the malicious b eha vior but hav e their own characteristics. Indeed, these APKs are considered a malw are family that is not in the training set. This slight similarit y in malicious b eha vior can been seen in the data distribu- tion as reﬂected by the principal components ac hiev ed from principal comp onent analysis. Figure 5 presents the scatter plot of (PC1, PC2), (PC1, PC3), (PC2, PC3), and (PC1, PC4). The in-sample b enign ﬁles are plotted in green, the in- sample malicious ﬁles are plotted in red and the out-of-sample malicious ﬁles are plotted in magenta. (PC1, PC2)-scatter plot and (PC2, PC3)-scatter plot indi- cate the distributional dissimilarity , as the out-of-sample data almost forms its o wn cluster. Ho wev er (PC1, PC3) and (PC1, PC4)-scatter plots indicate there is still similarity , even though not as strong, since the out-of-sample data falls within the clusters of the in-sample data. Fig. 5: OOS2: Scatter plot of principal components in OOS2.This out-of-sample dataset is not so similarly distributed compared to training set. The in-sample b enign ﬁles are plotted in green, the in-sample malicious ﬁles are plotted in red and the out-of-sample malicious ﬁles are plotted in magenta. (PC1, PC2)-scatter plot and (PC2, PC3)-scatter plot indicate the distributional dissimilarit y , as the out-of-sample data almost forms its own cluster. How ever (PC1, PC3) and (PC1, PC4)-scatter plots indicate there is still similarity , ev en though not as strong, since the out-of-sample data falls within the same embedding of the in-sample data. The last righ t c olumn in T able 1 demonstrates the detection p erformance of the ﬁve classiﬁers. All classiﬁers degrade in p erformance. How ever SVM with linear k ernel degrades in performance signiﬁcan tly . The result of 3NN is omitted, b ecause it fails due to to o many ties. MBSS p erforms signiﬁcantly b etter than the other classiﬁers. Next, w e v ary the test size to examine the out-of-sample classiﬁcation perfor- mance. As we v ary the test size, w e apply the classiﬁers on the randomly selected { 0 . 1% , 1% , 20% , 50% , 90% } of the test data with indep enden t Mon te Carlo repli- cations at { 50 , 30 , 20 , 10 , 5 , 1 } respectively . Monte Carlo replications are used to con trol the v ariation and thus provide b etter estimate of the accuracy . MBSS p erforms signiﬁcantly b etter than the rest at all test sizes, as seen in Figure 6. The v ariation of the p erformance by these classiﬁers are negligible. Again this is due to our data pro cessing, which results in many duplicates in the test data. 6 Summary and Discussion Automated malw are detection on Android platforms has fo cused mainly on de- v eloping signature-based metho ds or mac hine learning metho ds on static data, whic h may not capture the malicious b eha viors exhibiting only during runtime. F urthermore, pro ducing ground truth via manual lab eling is costly , while a v ast amoun t of unlab eled malware data already exists. The malware ev olution pro- Fig. 6: Detection rate as we v ary the test size. All classiﬁers degrade in p erformance. MBSS still p erforms signiﬁcan tly b etter the rest of the classiﬁers. cess creates huge engineering problem for an ti-virus researchers as they lack an eﬃcien t w ay to capture p oten tial new malicious ﬁles while pruning out those clean ﬁles and known ﬁles. As a result, traditional sup ervised machine learning algorithms can degrade in p erformance. In this pap er, we demonstrate the eﬀectiveness of using mo del-based semi- sup ervised learning (MBSS) approac h on dynamic Android b eha vior data for Android malw are detection. W e show that for in-sample testing, MBSS has com- p etitiv e accuracy and false p ositiv e rate compared with the most p opular mal- w are classiﬁers. F or out-of-sample testing, MBSS pro duces signiﬁcantly higher detection rate compared with the other classiﬁers in consideration. W e are op- timistic that the framework of semi-sup ervised learning for dynamic analysis is v aluable and practical for anti-malw are research. in vestigation. 6.1 Application on Malware T riage Malw are triage is a kno wn problem for security researc hers. As malware ev olves, deriv ativ e malware families will con tinuously generate new signatures that escape detection. This creates a huge engineering problem for an ti-virus researc hers, be- cause they lac k an eﬃcient wa y to capture p oten tial new malicious ﬁles while pruning out those clean ﬁles and known ﬁles. A more sophisticated to ol to pro- cess massive amoun t of samples is desired by cybersecurity companies. This to ol should precisely triage samples into clusters and identify a subset of highly prob- able malwares. Accordingly , the security exp erts can fo cus on a smaller subset and sp end time more eﬃciently . One application of MBSS can be used for triaging samples and identifying a subset of highly probable malw are. With limited known malwares and m uch more unknown samples in the ﬁeld, Android malware detection mo dels using semi-sup ervised mac hine learning approac h can accurately iden tify subset of samples needed further in vestigation. 6.2 N-gram There has b een an increasing interest in employing natural language processing tec hniques for malware detection. As a pre-pro cessing step, N -gram is used to extract the features based on N n umber of concatenated terms. With N > 1, the feature size increases and p oten tially provides more discriminatory p ow er in classiﬁcation. On the other hand, w e caution the usage of N -gram as it is not resilien t against analytical attacks, which is encoun tered in adv ersarial machine learning [30], a ﬁeld concerning the security of machine learning. When using ( N > 1)-gram, one can break the N -gram feature patterns b y injecting API calls, and cause the machine learning algorithm to misclassify . Here we use uni-gram, i.e., N = 1 as a robust mitigation to address the mac hine learning security . 6.3 Kernel T rick In this study , we compare MBSS with SVM linear kernel, SVM radial kernel, 3NN and LDA. Essentially SVM employs k ernel tricks to represent data not separable in the curren t dimension, and embeds it into higher dimension, such that the data b ecomes separable. This is the so-called kernel tric k. W e b eliev e that exploring other high-dimensional representation may also help enhance the out-of-sample detection rate for SVM. 6.4 Limitation of our framework Curren tly most detection metho ds for Android malw are focus on dev eloping signature-based techniques, and mac hine learning metho ds hav e mainly fo cused on static data. Static analysis provides the complete understanding of the co de all at once and is relatively faster to analyze compared to dynamic analysis. Ho wev er, static analysis are not resilien t to obfuscation, since a signiﬁcant p or- tion of the static samples as well as the main co de within the static ﬁles are usually encrypted. Simply dissembling the co de statically do es not exp ose the malicious b eha vior. On the other hand, using behavior data records and cap- tures the malware b eha vior during dynamic execution, and thus enables b etter malw are detection. How ever dynamic analysis has limited cov erage, since there is no guarantee to trav erse all functions of an APK at run time. Compared to static analysis, the dynamic analysis is relativ ely slow er. The underlying assumption on the data distribution is mixture of Gaussians. Though mixture of Gaussians can accommo date data with diﬀerent forms, and some non-Gaussian data can b e appro ximated by a few Gaussian distributions [22], we exp ect that if the data deviates from mixture of Gaussian distribution greatly , the performance ma y degrade. Another limitation is on high-dimensional data. The num ber of parameters grow as the square of the feature dimension. In this case, dimension reduction metho ds suc h as principal comp onen t analysis is desired. References 1. Android debug bridge. https://developer.android.com/studio/command- line/ adb.html . 2. Droidb o x. https://gith ub.com/p jlantz/droidbox. 3. Droidmon. https://gith ub.com/idanr1986/droidmon. 4. Gartner says ﬁv e of top 10 worldwide mobile phone v endors increased sales in second quarter of 2016. http://www.gartner.com/newsroom/id/3415117 . Egham, UK, August 19, 2016. 5. Go ogle pla y serv ed 65 billion downloads in 2015 alone. https://www.androidheadlines.com/2016/05/ google- play- served- 65- billion- downloads- 2015- alone.html . Ma y 18, 2016. 6. Mindmac/androideagleey e. https://github.com/MindMac/AndroidEagleEye . 7. Mobile threat rep ort. https://www.mcafee.com/us/resources/reports/ rp- mobile- threat- report- 2016.pdf . 8. Nokia malw are rep ort sho ws surge in mobile device infections in 2016. http://www.nokia.com/en_int/news/releases/2016/09/01/ nokia- malware- report- shows- surge- in- mobile- device- infections- in- 2016 . 9. Num b er of android applications. http://web.archive.org/web/20170210051327/ https:/www.appbrain.com/stats/number- of- android- apps . F ebruary 9, 2017. 10. ro vo89/xposed. https://github.com/rovo89/xposed . 11. Smartphone os mark et share, 2016 q3. http://www.idc.com/promo/ smartphone- market- share/os . 12. Y ousra Aafer, W enliang Du, and Heng Yin. Droidapiminer: Mining api-level fea- tures for robust malware detection in android. In International Confer enc e on Se curity and Privacy in Communic ation Systems , pages 86–103. Springer, 2013. 13. Vitor Afonso, An tonio Bianc hi, Y anick F ratantonio, Adam Doup´ e, Mario P olino, P aulo de Geus, Christopher Kruegel, and Giov anni Vigna. Going native: Using a large-scale analysis of android apps to create a practical native-code sandb ox- ing p olicy . In Pr o c e e dings of the Annual Symp osium on Network and Distribute d System Se curity (NDSS) , 2016. 14. Jeﬀrey D Banﬁeld and Adrian E Raftery . Model-based gaussian and non-gaussian clustering. Biometrics , pages 803–821, 1993. 15. Halima Bensmail and Gilles Celeux. Regularized gaussian discriminant analysis through eigen v alue decomp osition. Journal of the Americ an statistic al Asso ciation , 91(436):1743–1748, 1996. 16. Gilles Celeux and G ´ erard Go v aert. Gaussian parsimonious clustering models. Pat- tern r e c o gnition , 28(5):781–793, 1995. 17. Olivier Chapelle and Alexander Zien. Semi-supervised classiﬁcation by lo w densit y separation. In AIST A TS , pages 57–64, 2005. 18. Li Chen and Matthew Patton. Sto c hastic blo ckmodeling for online advertising. Pr o c e e dings of the Twenty-Ninth AAAI Confer enc e on Artiﬁcial Intel ligenc e , 2015. 19. Li Chen, Cencheng Shen, Josh ua T V ogelstein, and Carey E Prieb e. Robust vertex classiﬁcation. IEEE tr ansactions on p attern analysis and machine intel ligenc e , 38(3):578–590, 2016. 20. Corinna Cortes and Vladimir V apnik. Supp ort-vector netw orks. Machine learning , 20(3):273–297, 1995. 21. Thomas Cov er and Peter Hart. Nearest neighbor pattern classiﬁcation. IEEE tr ansactions on information the ory , 13(1):21–27, 1967. 22. Abhijit Dasgupta and Adrian E Raftery . Detecting features in spatial p oin t pro- cesses with clutter via mo del-based clustering. Journal of the Americ an Statistical Asso ciation , 93(441):294–302, 1998. 23. Nema Dean, Thomas Brendan Murph y , and Gerard Downey . Using unlab elled data to up date classiﬁcation rules with applications in fo o d authenticit y studies. Journal of the R oyal Statistic al So ciety: Series C (Applie d Statistics) , 55(1):1–14, 2006. 24. Arth ur P Dempster, Nan M Laird, and Donald B Rubin. Maxim um likelihoo d from incomplete data via the em algorithm. Journal of the r oyal statistic al so ciety. Series B (metho dolo gic al) , pages 1–38, 1977. 25. Luc Devroy e, L´ aszl´ o Gy¨ orﬁ, and G´ ab or Lugosi. A pr ob abilistic the ory of p attern r e c o gnition , volume 31. Springer Science & Business Media, 2013. 26. Chris F raley and Adrian E Raftery . Ho w many clusters? which clustering method? answ ers via mo del-based cluster analysis. The c omputer journal , 41(8):578–588, 1998. 27. Chris F raley and Adrian E Raftery . Mo del-based clustering, discriminant anal- ysis, and density estimation. Journal of the Americ an statistic al Association , 97(458):611–631, 2002. 28. Chris F raley , Adrian E Raftery , et al. Model-based metho ds of classiﬁcation: using the mclust softw are in c hemometrics. Journal of Statistic al Softwar e , 18(6):1–13, 2007. 29. Chris F raley , Adrian E. Raftery , Thomas Brendan Murphy , and Luca Scrucca. mclust V ersion 4 for R: Normal Mixtur e Mo deling for Mo del-Base d Clustering, Classiﬁc ation, and Density Estimation , 2012. 30. Ling Huang, Anthon y D Joseph, Blaine Nelson, Benjamin IP Rubinstein, and JD Tygar. Adv ersarial machine learning. In Pr o c e e dings of the 4th ACM workshop on Se curity and artiﬁcial intel ligenc e , pages 43–58. ACM, 2011. 31. T ony Jebara and Alex Pen tland. Maximum conditional likelihoo d via b ound max- imization and the cem algorithm. In Pr o c e e dings of the 11th International Confer- enc e on Neur al Information Pr o c essing Systems , pages 494–500. MIT Press, 1998. 32. Stephen C Johnson. Hierarchical clustering schemes. Psychometrika , 32(3):241– 254, 1967. 33. Long Lu, Zhich un Li, Zhenyu W u, W enk e Lee, and Guofei Jiang. Chex: statically v etting android apps for comp onent hijac king vulnerabilities. In Pr o c e e dings of the 2012 A CM c onfer enc e on Computer and communic ations se curity (CCS) , pages 229–240. ACM, 2012. 34. Vince Lyzinski, Daniel L Sussman, Donniell E Fishkind, Henry Pao, Li Chen, Josh ua T V ogelstein, Y oungser Park, and Carey E Prieb e. Sp ectral clustering for divide-and-conquer graph matching. Par al lel Computing , 47:70–87, 2015. 35. Enrico Mariconti, Lucky On wuzurike, P anagiotis Andriotis, Emiliano De Cristo- faro, Gordon Ross, and Gianluca Stringhini. Mamadroid: Detecting android mal- w are by building marko v chains of b eha vioral mo dels. 2016. 36. Geoﬀrey McLachlan and David Peel. Finite mixtur e mo dels . John Wiley & Sons, 2004. 37. T o dd K Mo on. The exp ectation-maximization algorithm. IEEE Signal pr o c essing magazine , 13(6):47–60, 1996. 38. Andrew Y Ng, Mic hael I Jordan, Y air W eiss, et al. On sp ectral clustering: Analysis and an algorithm. In NIPS , volume 14, pages 849–856, 2001. 39. R Core T eam. R: A L anguage and Envir onment for Statistical Computing . R F oundation for Statistical Computing, Vienna, Austria, 2016. 40. Adrian E Raftery . Bay es factors and bic: Comment on a critique of the bay esian information criterion for mo del selection. So ciolo gic al Metho ds & R ese ar ch , 27(3):411–427, 1999. 41. Niamh Russell, Laura Cribbin, and Thomas Brendan Murphy . up class: Up date d Classiﬁc ation Metho ds using Unlab ele d Data , 2014. R pack age version 2.0. 42. S Rasoul Safavian and Da vid Landgreb e. A surv ey of decision tree classiﬁer metho dology . IEEE tr ansactions on systems, man, and cyb ernetics , 21(3):660– 674, 1991. 43. Jaebaek Seo, Daehy eok Kim, Donghyun Cho, T aeso o Kim, and Insik Shin. Flex- droid: Enforcing in-app privilege separation in android. In Pr o c e e dings of the 2016 Annual Network and Distribute d System Se curity Symp osium (NDSS) , pages 1–53, 2016. 44. Asaf Shabtai, Y uv al Fledel, Uri Kanono v, Y uv al Elo vici, Shlomi Dolev, and Chanan Glezer. Google android: A comprehensive security assessment. IEEE Se curity & Privacy , 8(2):35–44, 2010. 45. Mic helle Y W ong and David Lie. In tellidroid: A targeted input generator for the dynamic analysis of android malware. In Pr o c e e dings of the Annual Symp osium on Network and Distribute d System Se curity (NDSS) , 2016. 46. John W right, Allen Y Y ang, Arvind Ganesh, S Shank ar Sastry , and Yi Ma. Robust face recognition via sparse representation. IEEE tr ansactions on p attern analysis and machine intel ligenc e , 31(2):210–227, 2009. 47. Yinxing Xue, Guozhu Meng, Y ang Liu, Tian Huat T an, Hongxu Chen, Jun Sun, and Jie Zhang. Auditing an ti-malware tools b y evolving android malware and dynamic loading technique. IEEE T r ansactions on Information F or ensics and Se curity , 2017.

Semi-supervised classification for dynamic Android malware detection

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment