조건부 GAN으로 만드는 안드로이드 악성코드 합성 데이터
📝 Abstract
The ever-increasing number of Android devices and the accelerated evolution of malware, reaching over 35 million samples by 2024, highlight the critical importance of effective detection methods. Attackers are now using Artificial Intelligence to create sophisticated malware variations that can easily evade traditional detection techniques. Although machine learning has shown promise in malware classification, its success relies heavily on the availability of up-to-date, high-quality datasets. The scarcity and high cost of obtaining and labeling real malware samples presents significant challenges in developing robust detection models. In this paper, we propose MalSynGen, a Malware Synthetic Data Generation methodology that uses a conditional Generative Adversarial Network (cGAN) to generate synthetic tabular data. This data preserves the statistical properties of real-world data and improves the performance of Android malware classifiers. We evaluated the effectiveness of this approach using various datasets and metrics that assess the fidelity of the generated data, its utility in classification, and the computational efficiency of the process. Our experiments demonstrate that MalSynGen can generalize across different datasets, providing a viable solution to address the issues of obsolescence and low quality data in malware detection.
💡 Analysis
The ever-increasing number of Android devices and the accelerated evolution of malware, reaching over 35 million samples by 2024, highlight the critical importance of effective detection methods. Attackers are now using Artificial Intelligence to create sophisticated malware variations that can easily evade traditional detection techniques. Although machine learning has shown promise in malware classification, its success relies heavily on the availability of up-to-date, high-quality datasets. The scarcity and high cost of obtaining and labeling real malware samples presents significant challenges in developing robust detection models. In this paper, we propose MalSynGen, a Malware Synthetic Data Generation methodology that uses a conditional Generative Adversarial Network (cGAN) to generate synthetic tabular data. This data preserves the statistical properties of real-world data and improves the performance of Android malware classifiers. We evaluated the effectiveness of this approach using various datasets and metrics that assess the fidelity of the generated data, its utility in classification, and the computational efficiency of the process. Our experiments demonstrate that MalSynGen can generalize across different datasets, providing a viable solution to address the issues of obsolescence and low quality data in malware detection.
📄 Content
This paper introduces MalSynGen (Malware Synthetic Data Generation), a comprehensive methodology and publicly available framework for generating and evaluating synthetic tabular data tailored for Android malware detection. We utilize a conditional Generative Adversarial Network (cGAN) model, inspired by [26], to generate synthetic data. Our evaluation framework assesses: (i) the fidelity of the generated synthetic data compared to real data, and (ii) the utility of the synthetic data in Android malware classification using a variety of classifiers. This work significantly expands upon our prior research by providing a more detailed methodology, a broader evaluation across diverse Android malware datasets, and a thorough analysis of computational resource consumption.
The key contributions of this expanded research are: 1) A cGAN model designed to generate synthetic tabular data that effectively supports Android malware classification.
A systematic methodology for training and evaluating the proposed cGAN model.
A comprehensive set of metrics for assessing both the fidelity and utility of the generated synthetic data in Android malware classification. 4) An extensive evaluation across multiple established Android malware datasets, demonstrating the generalization capabilities of our methodology. The remainder of this paper is structured as follows: Section II reviews related work. Section III details the conceptual methodology and cGAN model. Section IV outlines the evaluation process, including implementation and deployment details. Section V presents and discusses the evaluation results. Finally, Section VI summarizes the main conclusions and outlines future research directions.
In Table I, we present the main related works in the context of tabular data augmentation. We list the techniques, metrics, domain, and datasets used in each work. As it can be seen, GANs are frequently used to generate synthetic tabular data. However, other techniques are also used in the data augmentation process, such as the use of large language models (LLMs) [27] and diffusion models [28], due to their ability to effectively capture and generate complex patterns.
Most solutions for tabular data seek to capture the particularities of a specific context, such as healthcare [29], demographics [21] and malware VBA [32]. We can also observe that three solutions are specific to the context of malware Android, where we can see a predominance of cGANs [33], [35].
While prior solutions largely evaluate synthetic data utility through supervised learning model performance, they predominantly rely on standard binary classification metrics like precision, accuracy, recall, and F1-score. Furthermore, they typically utilize synthetic data in either the training or evaluation phase, but not both. This approach presents challenges, as high performance could stem from mere data replication rather than genuine novelty, while entirely novel data might yield poor classification if lacking inherent structure.
To overcome these limitations, we expand upon existing metrics by proposing two distinct categories: utility and fidelity. Utility metrics align with those commonly used in related works, whereas fidelity metrics, as emphasized in recent research [37], [38], are specifically designed for assessing generative models and synthetic data quality. Additionally, we integrate synthetic data into complementary evaluation methodologies to enhance robustness. Unlike previous works that primarily use the Training on Synthetic, Testing on Real (TSTR) method, we adopt a dual approach: Training on Synthetic, Testing on Real (TSTR) and Training on Real, Testing on Synthetic (TRTS). This comprehensive strategy ensures a more thorough evaluation.
This section details the conceptual components of the MalSynGen framework, encompassing both the overall methodology and the underlying cGAN model. We begin by outlining the methodological process and subsequently describe the generative model.
We illustrate the execution flow of the MalSynGen framework in Figure 1. The proposed flow consists of three main steps: selection and manipulation of the original dataset, training of classifiers, training of the conditional Generative Adversarial Network (cGAN), and evaluation of results.
In the first stage, selection, we choose a real dataset and perform balancing by the class (benign or malignant) with the fewest samples between the two categories. The balancing of the benign and malignant samples of the dataset is accomplished through the use of subsampling techniques. We then prepare the dataset for k-folds cross-validation. The balanced real dataset is divided into k equally sized subsets, and at each iteration, one part is chosen as the evaluation subset (Dataset r) and the remaining (k-1) subsets are used for training (Dataset R).
In the training step, the framework receives as input the cGAN training hyperparameters2 and the hyperparameters of the cla
This content is AI-processed based on ArXiv data.