Empirical Analysis of Adversarial Robustness and Explainability Drift in Cybersecurity Classifiers
Machine learning (ML) models are increasingly deployed in cybersecurity applications such as phishing detection and network intrusion prevention. However, these models remain vulnerable to adversarial perturbations small, deliberate input modifications that can degrade detection accuracy and compromise interpretability. This paper presents an empirical study of adversarial robustness and explainability drift across two cybersecurity domains phishing URL classification and network intrusion detection. We evaluate the impact of L (infinity) bounded Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) perturbations on model accuracy and introduce a quantitative metric, the Robustness Index (RI), defined as the area under the accuracy perturbation curve. Gradient based feature sensitivity and SHAP based attribution drift analyses reveal which input features are most susceptible to adversarial manipulation. Experiments on the Phishing Websites and UNSW NB15 datasets show consistent robustness trends, with adversarial training improving RI by up to 9 percent while maintaining clean-data accuracy. These findings highlight the coupling between robustness and interpretability degradation and underscore the importance of quantitative evaluation in the design of trustworthy, AI-driven cybersecurity systems.
💡 Research Summary
This paper conducts a systematic empirical study of adversarial robustness and explainability drift in two representative cybersecurity domains: phishing URL detection and network intrusion detection. Using publicly available structured datasets—the Phishing Websites dataset (≈11 K samples, 30 engineered numeric features) and the UNSW‑NB15 dataset (≈175 K samples, 42 features)—the authors train simple multilayer perceptron (MLP) classifiers (two hidden layers of 64 and 32 neurons, ReLU activations) with standard Adam optimization. Baseline clean‑data accuracies are 0.91 for phishing and 0.74 for intrusion detection.
Adversarial attacks are generated under an L∞ norm budget ε ranging from 0 to 0.3 in ten evenly spaced steps. Two attack algorithms are evaluated: the single‑step Fast Gradient Sign Method (FGSM) and the iterative Projected Gradient Descent (PGD) with 10 steps and step size α = 0.01. To quantify overall resilience, the authors introduce the Robustness Index (RI), defined as the area under the accuracy‑versus‑ε curve (Eq. 4). Higher RI indicates smoother degradation and stronger resistance to perturbations.
Results show that FGSM causes a steep accuracy drop (e.g., phishing accuracy falls to 0.30 at ε = 0.3), whereas PGD degrades more gradually (phishing accuracy remains around 0.72). Corresponding RI values are 0.62 (FGSM) vs. 0.76 (PGD) for phishing and 0.69 vs. 0.73 for UNSW‑NB15, demonstrating that, in normalized numeric feature spaces, PGD can be less damaging than the theoretically stronger FGSM.
Beyond raw performance, the study examines feature‑level vulnerability. Gradient‑based feature sensitivity (mean absolute loss gradient per feature, Eq. 5) identifies the most influential dimensions, while SHAP attribution drift (mean absolute change in SHAP values between clean and perturbed samples, Eq. 6) measures explainability instability. In the phishing dataset, features such as InsecureForms, NumDash, and FrequentDomainNameMismatch exhibit both high sensitivity and high SHAP drift, indicating that these attributes dominate both decision confidence and susceptibility to adversarial manipulation. Similar patterns appear in the intrusion dataset, though the larger feature set spreads the impact more evenly.
Adversarial training is evaluated by augmenting each training batch with 20 % FGSM‑generated examples (ε = 0.05). This hardening slightly reduces clean accuracy (phishing: 0.91 → 0.89; intrusion: 0.74 → 0.73) but yields substantial RI improvements: phishing RI_FGSM rises from 0.61 to 0.71 and RI_PGD from 0.72 to 0.87; intrusion RI_FGSM from 0.67 to 0.78 and RI_PGD from 0.73 to 0.92. The adversarially trained models display flatter degradation curves, confirming that exposure to adversarial examples reduces both gradient sensitivity and SHAP drift.
The discussion highlights a clear inverse relationship between feature dimensionality and empirical robustness, echoing theoretical bounds that higher‑dimensional inputs can dilute the impact of any single perturbation. More importantly, the paper demonstrates a strong correlation between accuracy loss and explainability drift: features with large gradient magnitudes also show large SHAP attribution changes under attack. This coupling implies that adversarial noise simultaneously destabilizes model predictions and the human‑interpretable explanations that security analysts rely on, raising operational risks in automated threat‑response pipelines.
An unexpected observation is that PGD appears less damaging than FGSM in these structured domains. The authors attribute this to the interaction of L∞ constraints with feature normalization: the projection step of PGD keeps perturbations within tighter bounds in the scaled feature space, whereas FGSM can push individual features to their maximum allowed change more directly. This insight suggests that careful feature scaling and normalization are crucial defensive considerations in real‑world cybersecurity AI systems.
In summary, the paper makes four key contributions: (1) a unified empirical framework for assessing both robustness and explainability in cybersecurity classifiers; (2) the Robustness Index as a concise, comparable metric; (3) a combined gradient‑and‑SHAP analysis that pinpoints the most vulnerable features; and (4) cross‑domain validation showing consistent trends and the effectiveness of adversarial training. These findings provide actionable guidance for practitioners seeking to deploy trustworthy, resilient AI‑driven security solutions that maintain both high detection performance and reliable interpretability.
Comments & Academic Discussion
Loading comments...
Leave a Comment