Safeguarding Efficacy in Large Language Models: Evaluating Resistance to Human-Written and Algorithmic Adversarial Prompts

This paper presents a systematic security assessment of four prominent Large Language Models (LLMs) against diverse adversarial attack vectors. We evaluate Phi-2, Llama-2-7B-Chat, GPT-3.5-Turbo, and G

Safeguarding Efficacy in Large Language Models: Evaluating Resistance to Human-Written and Algorithmic Adversarial Prompts

This paper presents a systematic security assessment of four prominent Large Language Models (LLMs) against diverse adversarial attack vectors. We evaluate Phi-2, Llama-2-7B-Chat, GPT-3.5-Turbo, and GPT-4 across four distinct attack categories: human-written prompts, AutoDAN, Greedy Coordinate Gradient (GCG), and Tree-of-Attacks-with-pruning (TAP). Our comprehensive evaluation employs 1,200 carefully stratified prompts from the SALAD-Bench dataset, spanning six harm categories. Results demonstrate significant variations in model robustness, with Llama-2 achieving the highest overall security (3.4% average attack success rate) while Phi-2 exhibits the greatest vulnerability (7.0% average attack success rate). We identify critical transferability patterns where GCG and TAP attacks, though ineffective against their target model (Llama-2), achieve substantially higher success rates when transferred to other models (up to 17% for GPT-4). Statistical analysis using Friedman tests reveals significant differences in vulnerability across harm categories ($p<0.001$), with malicious use prompts showing the highest attack success rates (10.71% average). Our findings contribute to understanding cross-model security vulnerabilities and provide actionable insights for developing targeted defense mechanisms


💡 Research Summary

This paper conducts a systematic security assessment of four widely used large language models (LLMs)—Phi‑2, Llama‑2‑7B‑Chat, GPT‑3.5‑Turbo, and GPT‑4—using the publicly available SALAD‑Bench benchmark. SALAD‑Bench comprises 1,200 prompts evenly distributed across six harm categories (malicious use, misinformation, violent incitement, sexual content, hate/discrimination, and privacy violation). The authors evaluate each model against four distinct attack vectors: (1) human‑written malicious prompts, (2) AutoDAN (an automated adversarial network that learns to bypass safety filters), (3) Greedy Coordinate Gradient (GCG), which performs minimal token‑level perturbations guided by gradient information, and (4) Tree‑of‑Attacks‑with‑pruning (TAP), a multi‑step search algorithm that prunes low‑utility branches to efficiently explore the adversarial space.

The experimental protocol applies every attack to every model, recording whether the model produces a harmful output. Success rates are averaged across the 1,200 prompts. Results show clear differences in robustness. Llama‑2‑7B‑Chat achieves the lowest overall attack success rate at 3.4 %, indicating strong resistance, especially to human‑crafted prompts. Phi‑2 is the most vulnerable, with an average success rate of 7.0 % and notably higher susceptibility to AutoDAN and GCG. GPT‑3.5‑Turbo and GPT‑4 fall in the middle, but a striking phenomenon emerges when attacks are transferred across models. GCG and TAP, which are relatively ineffective against their target model (Llama‑2, <2 % success), achieve dramatically higher success rates when applied to other models, reaching up to 17 % on GPT‑4. This demonstrates that adversarial prompts optimized for one architecture can exploit weaknesses in another, highlighting cross‑model transferability as a critical security concern.

Statistical analysis using the Friedman test confirms that success rates differ significantly across harm categories (p < 0.001). The “malicious use” category exhibits the highest average success rate at 10.71 %, followed by misinformation (8.3 %) and violent incitement (7.9 %). Sexual content and hate/discrimination show lower rates (4.2 % and 3.9 % respectively). The authors also analyze transferability patterns: GCG and TAP, despite low efficacy on their native target, can increase success by more than double—and in some cases fivefold—when transferred, suggesting that safety mechanisms are tightly coupled to model‑specific tokenization, parameter distributions, and post‑processing filters.

Based on these findings, the paper proposes several practical defense strategies. First, a multi‑model ensemble approach can mitigate transfer attacks by aggregating outputs from models with diverse architectures and applying a meta‑decision layer to flag inconsistent or suspicious responses. Second, meta‑learning‑based monitoring systems can be trained to detect statistical anomalies indicative of algorithmic adversarial prompts, providing real‑time interception. Third, regular stress‑testing using simulated adversarial attacks—both human‑crafted and automated—should be incorporated into the development lifecycle to surface latent vulnerabilities before deployment. Fourth, continuous benchmarking and re‑evaluation are essential whenever a model is updated, ensuring that previously effective defenses do not degrade against newly emerging attack techniques.

In summary, this work is the first to comprehensively evaluate LLM robustness against a combined set of human‑written and algorithmic adversarial prompts, quantify cross‑model transferability, and identify statistically significant variations across harm categories. The insights offered serve as a concrete foundation for building more resilient LLM deployments, guiding both researchers and industry practitioners in designing targeted, layered defense mechanisms that can adapt to the evolving threat landscape of generative AI.


📜 Original Paper Content

🚀 Synchronizing high-quality layout from 1TB storage...