Control Flow Change in Assembly as a Classifier in Malware Analysis
📝 Abstract
As currently classical malware detection methods based on signatures fail to detect new malware, they are not always efficient with new obfuscation techniques. Besides, new malware is easily created and old malware can be recoded to produce new one. Therefore, classical Antivirus becomes consistently less effective in dealing with those new threats. Also malware gets hand tailored to bypass network security and Antivirus. But as analysts do not have enough time to dissect suspected malware by hand, automated approaches have been developed. To cope with the mass of new malware, statistical and machine learning methods proved to be a good approach classifying programs, especially when using multiple approaches together to provide a likelihood of software being malicious. In normal approach, some steps have been taken, mostly by analyzing the opcodes or mnemonics of disassembly and their distribution. In this paper, we focus on the control flow change (CFC) itself and finding out if it is significant to detect malware. In the scope of this work, only relative control flow changes are contemplated, as these are easier to extract from the first chosen disassembler library and are within a range of 256 addresses. These features are analyzed as a raw feature, as n-grams of length 2, 4 and 6 and the even more abstract feature of the occurrences of the n-grams is used. Statistical methods were used as well as the Naive-Bayes algorithm to find out if there is significant data in CFC. We also test our approach with real-world datasets.
💡 Analysis
As currently classical malware detection methods based on signatures fail to detect new malware, they are not always efficient with new obfuscation techniques. Besides, new malware is easily created and old malware can be recoded to produce new one. Therefore, classical Antivirus becomes consistently less effective in dealing with those new threats. Also malware gets hand tailored to bypass network security and Antivirus. But as analysts do not have enough time to dissect suspected malware by hand, automated approaches have been developed. To cope with the mass of new malware, statistical and machine learning methods proved to be a good approach classifying programs, especially when using multiple approaches together to provide a likelihood of software being malicious. In normal approach, some steps have been taken, mostly by analyzing the opcodes or mnemonics of disassembly and their distribution. In this paper, we focus on the control flow change (CFC) itself and finding out if it is significant to detect malware. In the scope of this work, only relative control flow changes are contemplated, as these are easier to extract from the first chosen disassembler library and are within a range of 256 addresses. These features are analyzed as a raw feature, as n-grams of length 2, 4 and 6 and the even more abstract feature of the occurrences of the n-grams is used. Statistical methods were used as well as the Naive-Bayes algorithm to find out if there is significant data in CFC. We also test our approach with real-world datasets.
📄 Content
Control Flow Change in Assembly as a Classifier in Malware Analysis
Andree Linke
School of Computer Science
University College Dublin
Ireland
andree.linkee@ucdconnect.ie
Nhien-An Le-Khac
School of Computer Science
University College Dublin
Ireland
an.lekhac@ucd.ie
Abstract—As currently classical malware detection methods
based on signatures fail to detect new malware, they are not
always efficient with new obfuscation techniques. Besides, new
malware is easily created and old malware can be recoded to
produce new one. Therefore, classical Antivirus becomes
consistently less effective in dealing with those new threats. Also
malware gets hand tailored to bypass network security and
Antivirus. But as analysts do not have enough time to dissect
suspected malware by hand, automated approaches have been
developed. To cope with the mass of new malware, statistical and
machine learning methods proved to be a good approach
classifying programs, especially when using multiple approaches
together to provide a likelihood of software being malicious. In
normal approach, some steps have been taken, mostly by
analyzing the opcodes or mnemonics of disassembly and their
distribution. In this paper, we focus on the control flow change
(CFC) itself and finding out if it is significant to detect malware.
In the scope of this work, only relative control flow changes are
contemplated, as these are easier to extract from the first chosen
disassembler library and are within a range of 256 addresses.
These features are analyzed as a raw feature, as n-grams of
length 2, 4 and 6 and the even more abstract feature of the
occurrences of the n-grams is used. Statistical methods were used
as well as the Naïve-Bayes algorithm to find out if there is
significant data in CFC. We also test our approach with real-
world datasets.
Keywords— Malware analysis, Control flow change, Naïve-
Bayes analysis, n-gram signatures
I. INTRODUCTION
The world of computer crime is constantly expanding. Due
to constantly new tech-nology is invading our lives, the
opportunities of making money by exploiting tech-nologies’
vulnerabilities rise in the same way. At the same time, classical
antivirus (AV) products seem to fail against new coded
malware [1], which incorporates rootkit technologies and gets
encoded to subvert AV products. Classical AV relies greatly on
file signatures, providing which is a reactive process of finding
a malware, creating a signature (for example by hashing or
extracting byte sequences) and pushing these signatures into
file/system scanners. For institutions like the police or military,
this approach is no more feasible, as the attackers have become
more proficient and equipped and institutions face a constant
stream of sophisticated attacks.
Therefore, new automated methods of discern between
wanted software (so-called “goodware”) and unwanted
software (“malware”) ought to be explored to battle the stream
of malware. Interesting approaches have been taken in the past
and lead to systems for automatic detection and categorization
of malware, such as sandboxes or intrusion prevention systems.
Current approaches have been taken to use statistical analysis
[2] or machine learning [3] to find discriminators for
categorization. As the analysis of microprocessor operation
code (opcode) has been subject of some research and some
approaches have been suggested for analysing the control flow,
in this paper we focus on relative change of control flow in
static disassembly. This approach has not been proposed in the
literature yet, so our work aims on testing if the use of control
flow change can be used to differentiate between goodware and
malware. The precondition for our approach is that the
software in question is not packed, encrypted or encoded.
Software unpacking, decryption or decoding is beyond the
scope of this work, however, simple steps in sorting out such
samples have been taken.
The rest of this paper is organised as follows: Section 2
shows background of our research and related work in this
area. We present our approach in Section 3. We describe and
analyse results in Section 4. Finally, we conclude and discuss
on future work in Section 5.
II. BACKGROUND
A. Windows PE files
The PE file format is the main format of Microsoft
Windows executable files, dynamic link libraries and object
code. It contains all information needed for the program loader
of the Windows operating system to build the process object,
the memory layout and needed library call structures. It is
derived from the Unix COFF file format. The supported
architectures of the PE file format are IA-32, IA-64, x86-64
and ARM. This work focuses on the IA-32 architecture. The
full documentation of the PE file format can be found in
Microsofts “Microsoft PE and COFF Specification” [4]. The
code of the executable can be extracted from the sections part
of the PE file in r
This content is AI-processed based on ArXiv data.