Trace Gadgets: Minimizing Code Context for Machine Learning-Based Vulnerability Prediction
As the number of web applications and API endpoints exposed to the Internet continues to grow, so does the number of exploitable vulnerabilities. Manually identifying such vulnerabilities is tedious. Meanwhile, static security scanners tend to produce many false positives. While machine learning-based approaches are promising, they typically perform well only in scenarios where training and test data are closely related. A key challenge for ML-based vulnerability detection is providing suitable and concise code context, as excessively long contexts negatively affect the code comprehension capabilities of machine learning models, particularly smaller ones. This work introduces Trace Gadgets, a novel code representation that minimizes code context by removing non-related code. Trace Gadgets precisely capture the statements that cover the path to the vulnerability. As input for ML models, Trace Gadgets provide a minimal but complete context, thereby improving the detection performance. Moreover, we collect a large-scale dataset generated from real-world applications with manually curated labels to further improve the performance of ML-based vulnerability detectors. Our results show that state-of-the-art machine learning models perform best when using Trace Gadgets compared to previous code representations, surpassing the detection capabilities of industry-standard static scanners such as GitHub’s CodeQL by at least 4% on a fully unseen dataset. By applying our framework to real-world applications, we identify and report previously unknown vulnerabilities in widely deployed software.
💡 Research Summary
As the proliferation of web applications and API endpoints increases the attack surface, traditional manual vulnerability identification and static security scanners face significant challenges, such as high labor costs and excessive false positives. While machine learning (ML)-based approaches offer a promising alternative, their effectiveness is often hindered by the “cognitive load” issue, where excessively long code contexts—exceeding 1,000 tokens—degrade the comprehension capabilities of even advanced models like GPT-4o or CodeT5+.
This paper introduces “Trace Gadgets,” a novel code representation technique designed to minimize code context by stripping away irrelevant information. The core innovation lies in a sophisticated five-step pipeline: (1) performing bytecode analysis to construct Control and Data Flow Graphs (CFG/DFG), (2) identifying potential vulnerability sinks, (3) performing backward slicing from these sinks, (4) utilizing static tracing to remove non-related statements such as logging or unrelated variable initializations, and (5) inlining the remaining essential statements into a single, concise function. This preprocessing is remarkably efficient, averaging only 0.12 seconds, making it suitable for large-scale, real-time applications.
To ensure robust evaluation, the researchers constructed a large-scale, high-fidelity dataset by extracting bytecode from over 32,000 real-world JVM web applications sourced from Docker Hub. This dataset, containing 32,886 labeled samples focusing on critical vulnerabilities like CWE-89 (SQL Injection), offers significantly higher realism and diversity compared to traditional datasets like Juliet or OWASP.
Experimental results demonstrate the superiority of Trace Gadgets across various state-of-the-art models, including UniXcoder, CodeT5+, and GPT-4o. The use of Trace Gadgets led to an average F1-score increase of 6.8% and a dramatic reduction in the False Positive (FP) rate by 29% to 38%. Notably, the approach outperformed the industry-standard static analyzer, GitHub’s CodeQL, by at least 4% on entirely unseen datasets.
The practical utility of this research was validated through real-world application, where the framework successfully identified two previously unknown vulnerabilities in widely deployed software, including Atlassian Bamboo and GeoServer, leading to responsible disclosure and patching. While the current implementation is limited to the JVM ecosystem and carries potential labeling biases from automated static analysis, Trace Gadgets represents a significant leap forward in making ML-based vulnerability detection a precise and actionable tool for modern cybersecurity.
Comments & Academic Discussion
Loading comments...
Leave a Comment