Improving the detection accuracy of unknown malware by partitioning the executables in groups

Reading time: 5 minute
...

📝 Original Info

  • Title: Improving the detection accuracy of unknown malware by partitioning the executables in groups
  • ArXiv ID: 1606.06909
  • Date: 2018-09-18
  • Authors: Ashu Sharma, Sanjay K. Sahay and Abhishek Kumar

📝 Abstract

Detection of unknown malware with high accuracy is always a challenging task. Therefore, in this paper, we study the classification of unknown malware by two methods. In the first/regular method, similar to other authors [17][16][20] approaches we select the features by taking all dataset in one group and in the second method, we select the features by partitioning the dataset in the range of file 5 KB size. We find that the second method to detect the malware with ~8.7% more accurate than the first/regular method.

💡 Deep Analysis

Figure 1

📄 Full Content

Malwares are continuously evolving and are big threats to the leading Windows and Android computing platforms [3]. The attacks/threats are not only limited to individual level, but there are state sponsored highly skilled hackers developing customized malwares [25], [7], [10]. These malwares are generally classified as a first and second-generation malwares. In first generation, structure of the malwares remains constant, while in second generation, structure changes in every new variant, keeping the action same [12]. On the basis of how variants are generated in malware, second generation malwares are further classified into Encrypted, Oligormorphic, Polymorphic and Metamorphic Malwares [24].

Its an indisputable fact that the prolong traditional approach (signature matching) of combating the threats/attack with a technology-centric are ineffective to detect second generation customized malwares. If in time adequate measures has not taken, the consequence of the scale (more 317 million new malwares are reported [1] in the year 2014) at which malware are developed will be very devastating. Nevertheless, the second-generation malware are very effective and not easy to detect. Recently a new malware is reported by McFee which is capable to infect the hard drives and solid state storage device (SSD) firmware and the infection cannot be removed even by formatting the devices or reinstallation of operating systems [2]. Therefore, there is a need that both academia and anti-malwares developers should continually work to combat the threats/attacks from the evolving malwares. The most popular techniques used for the detection of malwares are signature matching, heuristics based detection, malware normalization, machine learning, etc [24].

In recent years, machine learning techniques are studied by many authors and proposed different approaches [4] [11] [19], which can supplement traditional antimalware system (signature matching).

Hence, in this paper for detection of unknown malware with high accuracy, we present a static malware analysis method which detect the unknown malware with ~ 8.7% more accurate then the regular method. The paper is organized as follow, in next section related work is discussed. In section 3 we discuss the data preprocessing and feature selection. Section 4 contains the brief description of Naive Bayes classifier. In section 5 we discuss the method to improve the detection accuracy and results. Finally section 6 contains the conclusion and future direction of the paper.

To combat the threats/attacks from the second generation malwares, Schultz et al. (2001) was the first to introduce the concept of data mining to classify the malwares [23]. In 2005 Karim et al. [14] addressed the tracking of malware evolution based on opcode sequences and permutations. In 2006, O. Henchiri et al. [13] reported 37.17% detection accuracy from a hierarchical feature extraction algorithm by using NB (Naive Bayes) classifier. Bilar (2007) uses small dataset to examine the opcode frequency distribution difference between malicious and benign code [8] and observed that some opcodes seen to be a stronger predictor of frequency variation. In 2008, Yanfang Ye et. al. [29] applied association rules on API execution sequences and reported an accuracy of 83.86% with NB classifier. In 2008, Tian et al. [27] classified the Trojan using function length frequency and shown that the function length along with its frequency is significant in identifying malware family and can be combined with other features for fast and scalable malware classification. Moskovitch et al. (2008) studied many classifier viz. NB, BNB, SVM, BDT, DT and ANN by byte-sequence n-grams (3, 4, 5 or 6) and find that NB classifier detect the malwares with 84.53% accuracy [17]. In 2009 S. Momina Tabish [26] 2013) proposed a classification of malware families based on N-grams sequential pattern features [15]. They used DT, ANN and SVM classifier and obtained good accuracy. In 2013 Santos et al. [22] used Term Frequency for modelling different classifiers and among the studied classifier, SVM outperform for one opcode and two opcode sequence length respectively. Recently (2014) Zahra Salehi et al. construct feature set by extracting API calls used in the executables for the classification of malwares [21].

For the experimental analysis, we downloaded 11088 malwares from malacia-project [18] and collected 4006 benign programs (also verified from virustotal.com [9]) from different systems. In the collected dataset we found that 97.18% malwares are below 500 KB, (fig. 1) hence for the classification we took the data set which are below 500 KB. For classification the features are opcodes of executables obtained by objdump utility available in the linux system and its selection procedure is given in the algo. 1. To test our methods we select popular Naive Bayes classifier which can handle an arbitrary number of independent variables and is briefly describe

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut