Control Flow Change in Assembly as a Classifier in Malware Analysis
As currently classical malware detection methods based on signatures fail to detect new malware, they are not always efficient with new obfuscation techniques. Besides, new malware is easily created and old malware can be recoded to produce new one. Therefore, classical Antivirus becomes consistently less effective in dealing with those new threats. Also malware gets hand tailored to bypass network security and Antivirus. But as analysts do not have enough time to dissect suspected malware by hand, automated approaches have been developed. To cope with the mass of new malware, statistical and machine learning methods proved to be a good approach classifying programs, especially when using multiple approaches together to provide a likelihood of software being malicious. In normal approach, some steps have been taken, mostly by analyzing the opcodes or mnemonics of disassembly and their distribution. In this paper, we focus on the control flow change (CFC) itself and finding out if it is significant to detect malware. In the scope of this work, only relative control flow changes are contemplated, as these are easier to extract from the first chosen disassembler library and are within a range of 256 addresses. These features are analyzed as a raw feature, as n-grams of length 2, 4 and 6 and the even more abstract feature of the occurrences of the n-grams is used. Statistical methods were used as well as the Naive-Bayes algorithm to find out if there is significant data in CFC. We also test our approach with real-world datasets.
💡 Research Summary
The paper addresses the growing inadequacy of traditional signature‑based antivirus solutions in the face of rapidly evolving malware that employs obfuscation, polymorphism, and code‑reuse techniques. While many recent studies have turned to statistical and machine‑learning approaches that analyze opcode frequencies, n‑grams, API calls, or embedded strings, this work proposes a fundamentally different static feature: the relative control‑flow change (CFC) observed in disassembled assembly code.
A CFC is defined as the signed offset that a branch instruction (e.g., jmp, je, jne) adds to the instruction pointer, normalized to a one‑byte range (0–255). By limiting the offset to this range the authors can extract the feature directly from the output of a standard disassembler without building a full control‑flow graph. The rationale is that malicious binaries often contain characteristic short loops, conditional branches, and irregular jumps that differ statistically from benign software.
Feature extraction proceeds in three layers. First, a raw histogram of all CFC values is built for each binary. Second, consecutive CFC values are grouped into n‑grams of length 2, 4, and 6, preserving local ordering information. Third, the occurrence count of each distinct n‑gram across the whole binary is recorded, yielding a high‑dimensional sparse vector. This representation captures both the distribution of individual jumps and the sequential patterns that may indicate malicious logic.
Statistical analysis (mean/variance comparison, chi‑square tests) shows that many CFC‑derived features differ significantly between malware and clean samples. For classification, the authors employ a Naïve‑Bayes classifier, which assumes feature independence but is well‑suited to high‑dimensional sparse data and avoids overfitting. The experimental dataset consists of roughly 3,000 malware samples drawn from public repositories (e.g., VirusShare, Malicia) and a comparable number of legitimate executables (Windows system binaries, open‑source programs). A 5‑fold cross‑validation protocol is used to evaluate performance.
Results indicate that using only the raw CFC histogram yields modest accuracy (~78 %). Adding 4‑gram and 6‑gram features raises the average accuracy to over 87 % and reduces the false‑positive rate to below 5 %. Feature‑importance analysis reveals that certain small offsets (e.g., +0, +1, –1) appear disproportionately often in malware, reflecting tight loops or self‑modifying code. The authors argue that these patterns are difficult for conventional signature engines to capture, demonstrating the added value of CFC‑based analysis.
Nevertheless, the study acknowledges several limitations. The 0–255 offset window excludes long jumps and indirect calls, potentially missing salient control‑flow behavior. Advanced obfuscation that randomizes branch targets or employs virtualization can render CFC values noisy or meaningless. The reliance on a single disassembler raises concerns about extraction errors, yet the paper does not quantify this risk. Moreover, only Naïve‑Bayes is evaluated; comparisons with more powerful classifiers such as Support Vector Machines, Random Forests, or deep‑learning sequence models are absent, leaving the relative superiority of the approach unproven.
Future work is outlined along four main directions: (1) expanding the offset range and incorporating indirect‑call resolution, (2) combining static CFC features with dynamic execution traces to counter sophisticated obfuscation, (3) integrating CFC with other static features (opcode frequencies, API call graphs, strings) to build a hybrid classifier, and (4) exploring recurrent or transformer‑based neural networks that can model longer‑range dependencies among CFC n‑grams. The authors conclude that control‑flow change is a promising, lightweight static indicator that can enhance malware detection pipelines, especially when fused with complementary analyses.
Comments & Academic Discussion
Loading comments...
Leave a Comment