Reducing Instability in Synthetic Data Evaluation with a Super-Metric in MalDataGen

Reading time: 4 minute
...

📝 Abstract

Evaluating the quality of synthetic data remains a persistent challenge in the Android malware domain due to instability and the lack of standardization among existing metrics. This work integrates into MalDataGen a Super-Metric that aggregates eight metrics across four fidelity dimensions, producing a single weighted score. Experiments involving ten generative models and five balanced datasets demonstrate that the Super-Metric is more stable and consistent than traditional metrics, exhibiting stronger correlations with the actual performance of classifiers.

💡 Analysis

Evaluating the quality of synthetic data remains a persistent challenge in the Android malware domain due to instability and the lack of standardization among existing metrics. This work integrates into MalDataGen a Super-Metric that aggregates eight metrics across four fidelity dimensions, producing a single weighted score. Experiments involving ten generative models and five balanced datasets demonstrate that the Super-Metric is more stable and consistent than traditional metrics, exhibiting stronger correlations with the actual performance of classifiers.

📄 Content

Synthetic data generation has become an increasingly relevant strategy in cybersecurity [1], [2], [3], particularly as a way to mitigate the scarcity of real, complete, and high-quality datasets that limit the performance and generalization of machine learning models. Despite these advances, assessing the quality of synthetic data remains a complex and largely non-standardized methodological challenge [4], with no clear consensus on which metrics should be used or how to combine them consistently.

The literature reports a significant fragmentation in the application of fidelity metrics, with studies identifying more than 65 distinct indicators used independently to assess fidelity [5]. This diversity hinders model-to-model comparison, reduces experimental reproducibility, and complicates the integrated interpretation of data quality. Tools such as the Synthetic Data Vault (SDV) 1 , which implements Copula, TVAE, and CTGAN [6]; YData Synthetic2 , which offers multiple variations of GANs; and Gretel Synthetics3 , which uses models such as DGAN, DPGAN, and ACTGAN, attempt to consolidate generation and evaluation processes. Additionally, initiatives such as [7] demonstrate the application of HMMs for time-series generation in the healthcare domain. However, these platforms exhibit limitations related to flexibility, restricted customization capabilities, and relatively small sets of pre-implemented algorithms.

To overcome these limitations, this work enhances the MalDataGen framework [8], a modular open-source platform designed for the generation of synthetic tabular data, through the integration of a generalizable super-metric developed to unify fidelity evaluation [5]. The super-metric aggregates eight metrics distributed across four fundamental dimensions-Distance, Correlation/Association, Feature Similarity, and Multivariate Distribution-producing a single weighted score that reduces the variability and inconsistency observed in evaluations based on isolated metrics.

The central contribution of this work is transforming MalDataGen from a generation tool into a complete ecosystem for multidimensional generation and evaluation of synthetic data aimed at Android malware detection. By incorporating the supermetric, the framework provides a more robust, consistent, and contextualized evaluation, making it more suitable for critical cybersecurity applications.

MalDataGen is a modular and open-source framework designed to systematically and reproducibly orchestrate the generation and evaluation of synthetic tabular data in the context of Android malware detection. Its goal is to provide a unified platform that enables the comparison of different generative models under the same experimental methodology, reducing implementation bias and ensuring consistency across executions.

The framework’s architecture is organized into three main components:: (i) input and preprocessing, esponsible for standardizing datasets, normalizing attributes, and preparing data for the generative models;; (ii) generative layer, which integrates multiple families of models capable of synthesizing tabular datasets with varying levels of complexity; and (iii) evaluation layer, which computes traditional fidelity and utility metrics, as well as the Super-Metric integrated in this work.

A central pillar of MalDataGen is its function as a flexible benchmark, enabling different generation paradigms to be evaluated under identical conditions. To support this, the generative layer includes four groups of models: (i) Adversarial Figure 1 presents a diagram illustrating the workflow across these components. The metrics module in MalDataGen organizes the evaluation process into three categories: binary metrics (e.g., precision, recall, F1-score), distance metrics (such as Euclidean and Hellinger), and probabilistic metrics (such as AUC-ROC). The internal infrastructure standardizes result storage by evaluation strategy, classifier, and fold, ensuring traceability and comparability across experiments.

The Super-Metric, integrated as a composite metric within the distance module, extends the evaluation system by providing a consolidated multidimensional analysis. It combines eight metrics distributed across four fundamental dimensions-distance, association, feature similarity, and multivariate distribution-and produces a single weighted final score. Its integration occurs transparently within the framework’s internal workflow, using the same routines and data structures as conventional metrics.

With this integration, MalDataGen evolves from a data generation tool into a complete ecosystem for generating, evaluating, and rigorously comparing generative models applied to the Android malware domain. This enables more consistent, stable, and comparable analyses, contributing to reproducible and methodologically sound experiments in cybersecurity.

Evaluating the quality of synthetic data remains a central challenge in the generation of tabular data

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut