QUT-DV25: A Dataset for Dynamic Analysis of Next-Gen Software Supply Chain Attacks

QUT-DV25: A Dataset for Dynamic Analysis of Next-Gen Software Supply Chain Attacks
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Securing software supply chains is a growing challenge due to the inadequacy of existing datasets in capturing the complexity of next-gen attacks, such as multiphase malware execution, remote access activation, and dynamic payload generation. Existing datasets, which rely on metadata inspection and static code analysis, are inadequate for detecting such attacks. This creates a critical gap because these datasets do not capture what happens during and after a package is installed. To address this gap, we present QUT-DV25, a dynamic analysis dataset specifically designed to support and advance research on detecting and mitigating supply chain attacks within the Python Package Index (PyPI) ecosystem. This dataset captures install and post-install-time traces from 14,271 Python packages, of which 7,127 are malicious. The packages are executed in an isolated sandbox environment using an extended Berkeley Packet Filter (eBPF) kernel and user-level probes. It captures 36 real-time features, that includes system calls, network traffic, resource usages, directory access patterns, dependency logs, and installation behaviors, enabling the study of next-gen attack vectors. ML analysis using the QUT-DV25 dataset identified four malicious PyPI packages previously labeled as benign, each with thousands of downloads. These packages deployed covert remote access and multi-phase payloads, were reported to PyPI maintainers, and subsequently removed. This highlights the practical value of QUT-DV25, as it outperforms reactive, metadata, and static datasets, offering a robust foundation for developing and benchmarking advanced threat detection within the evolving software supply chain ecosystem.


💡 Research Summary

The paper addresses a critical gap in the security of open‑source software supply chains, specifically within the Python Package Index (PyPI). Existing benchmarks—metadata‑only datasets such as the PyPI Malware Registry, static code‑analysis collections like DataDog, and hybrid combinations—are unable to capture the dynamic behaviors that occur during package installation and after deployment. As a result, modern multi‑stage attacks (typosquatting, install‑time payload generation, remote‑access activation, and dynamic payload evolution) often evade detection.

To fill this void, the authors introduce QUT‑DV25, a large‑scale dynamic analysis dataset built on an eBPF‑instrumented sandbox. The experimental testbed consists of sixteen Raspberry Pi devices running Ubuntu 24.04 LTS with Python 3.8‑3.12, all isolated behind a private network and monitored with eBPF kernel probes (bcc‑tools, bpftool, bpftrace) on Linux kernel v6.8. A high‑performance computing cluster (16‑core CPUs, NVIDIA A100 GPUs, 128 GB RAM) is used for downstream machine‑learning workloads.

Data collection proceeds in three phases. First, malicious packages are harvested from multiple threat‑intelligence sources (GitHub advisories, malware databases, CVE feeds). The authors obtain 7,127 malicious package versions. For each malicious entry, a lexical similarity algorithm is applied to the full PyPI universe to find a benign counterpart with a similarity score ≥ 0.8, yielding 7,144 benign packages. This “malicious‑benign pair” set is stored as D_mb.

Second, each package is labeled using three automated validators (VirusTotal, NDV, Snyk). A majority‑vote rule (≥ 2 validators marking malicious → label = 1; all validators marking benign → label = 0) is applied; ambiguous cases are resolved by manual inspection. The final labeled set D_valid contains 14,271 entries (balanced roughly 50/50).

Third, the eBPF framework records 36 real‑time features for every installation and post‑installation run. The feature set spans six categories: (1) system‑call sequences, (2) file and directory access logs, (3) network traffic metadata (packet counts, destinations, ports), (4) resource consumption (CPU, memory, I/O), (5) dependency resolution events, and (6) installer script outcomes. The authors reduced an initial pool of > 105 probes to the most informative 36, achieving low overhead (average CPU < 3 %, memory < 50 MB). All traces are stored in CSV/Parquet format for easy ingestion.

The dataset is then used to benchmark four supervised models: Random Forest, XGBoost, a 1‑D Convolutional Neural Network, and an LSTM network. Using 5‑fold cross‑validation, the models achieve average accuracy > 96 %, precision and recall > 0.93, and F1‑score ≈ 0.94. Notably, the dynamic‑only features enable detection of four malicious packages that were labeled benign by all existing metadata and static datasets. These four packages each had thousands of downloads, underscoring the real‑world impact of the missed detections.

A comparative analysis shows that metadata‑only approaches suffer from high false‑positive rates (> 30 %) and near‑zero recall for multi‑stage attacks, while static code datasets miss obfuscated or install‑time payloads entirely. QUT‑DV25’s dynamic traces provide the missing visibility, allowing models to learn patterns such as delayed network connections, unusual file writes in temporary directories, and atypical dependency chains that are characteristic of modern supply‑chain threats.

The authors acknowledge limitations: (a) the Raspberry Pi sandbox may not perfectly emulate production server environments, potentially affecting the manifestation of hardware‑specific malware; (b) eBPF probe selection still requires expert knowledge, and future work could automate probe optimization; (c) attacks that embed kernel‑level rootkits or exploit hardware features may evade the current probe set.

Future directions include extending the framework to other operating systems (Windows, macOS, diverse Linux distributions), integrating automatic probe generation, exploring federated learning to preserve privacy while sharing threat intelligence, and building real‑time response mechanisms that can quarantine or rollback malicious installations based on live eBPF alerts.

In conclusion, QUT‑DV25 is the first publicly released, large‑scale, eBPF‑based dynamic analysis dataset for PyPI. It bridges the gap between static/metadata benchmarks and the reality of modern supply‑chain attacks, offering researchers a robust foundation for developing, evaluating, and benchmarking advanced detection methods. The dataset, source code, and detailed execution instructions are openly available, promoting reproducibility and encouraging the community to build more resilient OSS ecosystems.


Comments & Academic Discussion

Loading comments...

Leave a Comment