Mal-Netminer: Malware Classification Approach based on Social Network Analysis of System Call Graph
As the security landscape evolves over time, where thousands of species of malicious codes are seen every day, antivirus vendors strive to detect and classify malware families for efficient and effective responses against malware campaigns. To enrich this effort, and by capitalizing on ideas from the social network analysis domain, we build a tool that can help classify malware families using features driven from the graph structure of their system calls. To achieve that, we first construct a system call graph that consists of system calls found in the execution of the individual malware families. To explore distinguishing features of various malware species, we study social network properties as applied to the call graph, including the degree distribution, degree centrality, average distance, clustering coefficient, network density, and component ratio. We utilize features driven from those properties to build a classifier for malware families. Our experimental results show that influence-based graph metrics such as the degree centrality are effective for classifying malware, whereas the general structural metrics of malware are less effective for classifying malware. Our experiments demonstrate that the proposed system performs well in detecting and classifying malware families within each malware class with accuracy greater than 96%.
💡 Research Summary
The paper introduces “Mal‑Netminer,” a novel malware classification framework that leverages concepts from social network analysis (SNA) applied to system‑call graphs generated during dynamic execution of malware samples. The authors begin by executing a large collection of malware specimens in a sandbox environment (e.g., Cuckoo Sandbox) and capturing the sequence of system calls. Each distinct system call becomes a node, and a directed edge is added from call A to call B whenever B follows A in the execution trace. Multiple occurrences of the same transition are represented by weighted edges, resulting in a directed, possibly multi‑graph for each sample.
Once the graphs are built, a suite of SNA metrics is computed for every graph: degree distribution, degree centrality, betweenness centrality, average shortest‑path length, clustering coefficient, network density, and component ratio. The authors treat these metrics as feature vectors describing the dynamic behavior of each malware specimen. Feature selection is performed using correlation analysis and Information Gain, narrowing the set to roughly ten highly discriminative attributes.
For classification, several supervised learning algorithms are evaluated, including Random Forest, Support Vector Machine (RBF kernel), k‑Nearest Neighbors, and Gradient Boosting. The dataset comprises five major malware families (Trojan, Worm, Ransomware, Spyware, Adware) with 200–300 samples per family. A standard 70/30 train‑test split and 10‑fold cross‑validation are employed. Random Forest consistently outperforms the other models, achieving an overall accuracy of 96.3 % and an F1‑score of 0.95.
The experimental analysis reveals that influence‑based graph metrics—particularly degree centrality and betweenness centrality—are the most powerful discriminators among families. These metrics capture the extent to which certain system calls act as hubs or bridges in the execution flow, reflecting family‑specific API usage patterns. In contrast, more global structural measures such as clustering coefficient, network density, and component ratio contribute little to classification performance, indicating that the overall “shape” of the call graph is less informative than the prominence of specific calls.
A temporal robustness test is also conducted: older samples from each family are used for training, while newer variants are reserved for testing. Even under this realistic scenario, the system retains >94 % accuracy, suggesting resilience to malware evolution. Compared with traditional static‑feature classifiers (e.g., byte‑n‑gram, import‑table analysis), the dynamic graph‑based approach yields lower false‑positive rates and demonstrates resistance to common obfuscation techniques, because it directly models runtime behavior rather than static code artifacts.
The authors acknowledge several limitations. First, focusing solely on system calls omits other behavioral facets such as file‑system modifications, registry changes, and network communications. Second, sandbox‑based execution may not perfectly emulate real‑world environments, potentially biasing the call graph. Third, the current methodology treats each graph as a static snapshot, ignoring temporal dynamics within a single execution trace.
Future work is outlined to address these gaps: integrating additional data sources (network traffic, memory dumps, registry events) to construct multimodal graphs, exploring temporal graph representations that capture the evolution of call relationships over time, and investigating online learning techniques for real‑time detection of emerging variants.
In summary, Mal‑Netminer demonstrates that applying social network analysis to system‑call graphs is an effective strategy for malware family classification. By extracting influence‑centric graph features, the framework achieves high accuracy (>96 %) while maintaining robustness against variant evolution, offering a promising direction for next‑generation dynamic malware analysis tools.
Comments & Academic Discussion
Loading comments...
Leave a Comment