From Data Leak to Secret Misses: The Impact of Data Leakage on Secret Detection Models

From Data Leak to Secret Misses: The Impact of Data Leakage on Secret Detection Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Machine learning models are increasingly used for software security tasks. These models are commonly trained and evaluated on large Internet-derived datasets, which often contain duplicated or highly similar samples. When such samples are split across training and test sets, data leakage may occur, allowing models to memorize patterns instead of learning to generalize. We investigate duplication in a widely used benchmark dataset of hard coded secrets and show how data leakage can substantially inflate the reported performance of AI-based secret detectors, resulting in a misleading picture of their real-world effectiveness.


💡 Research Summary

The paper “From Data Leak to Secret Misses: The Impact of Data Leakage on Secret Detection Models” investigates how data duplication within the widely used SecretBench benchmark inflates the reported performance of machine learning‑based secret detectors. SecretBench, a large collection of hard‑coded secrets mined from public GitHub repositories, contains a substantial amount of exact and near‑duplicate code contexts. The authors first quantify duplication: out of 97,479 total contexts, 69.3 % are exact duplicates, 8.7 % are near‑duplicates, and only 22 % are truly unique. They define a “secret context” as the 200 characters surrounding a secret value, and they use Jaccard similarity on token sets and multisets (thresholds 0.8 and 0.7) to separate exact from near‑duplicate samples.

To assess the impact of this leakage, three experimental scenarios are constructed, each using a consistent 5‑fold cross‑validation split:

  1. Mixed (Baseline on Duplicated Data) – No deduplication; training and test sets both contain exact and near duplicates.
  2. Near‑Duplicate (Baseline on Deduplicated Data) – Exact duplicates are removed, but near‑duplicates remain in both train and test splits, allowing isolation of the effect of semantically similar samples.
  3. Unique (Training on Duplicates, Testing on Unique Data) – Training uses all exact and near duplicates, while testing is performed exclusively on the unique subset, directly measuring generalization to unseen code.

Four model families are evaluated: a Random Forest (RF) with hyper‑parameters tuned by Differential Evolution, a two‑layer LSTM network with dropout and optional embedding, and two variants of GraphCodeBERT (full fine‑tuning and frozen‑feature extraction followed by a lightweight MLP). All models are trained on a filtered subset of SecretBench (23,352 samples from seven object‑oriented languages) and evaluated using Matthews Correlation Coefficient (MCC) as the primary metric, complemented by precision, recall, and F1‑score.

Results show that in the Mixed scenario all models achieve high MCC scores (RF ≈ 0.89, LSTM ≈ 0.92, GraphCodeBERT ≈ 0.90), reflecting the artificial advantage of having identical or highly similar examples in both training and testing. When exact duplicates are removed (Near‑Duplicate scenario), MCC drops across the board: RF falls to 0.77, LSTM to 0.77, and GraphCodeBERT to about 0.84 (a 7 % decrease). The Unique scenario reveals the most dramatic degradation: RF’s MCC declines to 0.65, LSTM to 0.77, and GraphCodeBERT to roughly 0.83, indicating that models trained on duplicated data struggle to detect truly novel secrets. Near‑duplicates alone still provide a measurable boost, confirming that semantic similarity can mask over‑optimistic performance.

The authors conclude that the prevalent use of SecretBench without proper deduplication leads to a systematic over‑estimation of secret‑detection capabilities. They recommend rigorous dataset hygiene (removing both exact and near duplicates), adopting evaluation protocols that separate duplicate and unique samples, and favoring architectures that capture deeper code semantics (e.g., GraphCodeBERT) to improve robustness. Additionally, they release a replication package containing scripts, deduplication tools, and the record IDs of the original dataset to promote reproducibility and transparency.

Overall, the study highlights data leakage as a critical threat to the validity of security‑focused machine‑learning research and provides concrete guidelines for building more reliable secret‑detection systems that will perform faithfully in real‑world software environments.


Comments & Academic Discussion

Loading comments...

Leave a Comment