Benchmarking LLAMA Model Security Against OWASP Top 10 For LLM Applications
As large language models (LLMs) move from research prototypes to enterprise systems, their security vulnerabilities pose serious risks to data privacy and system integrity. This study benchmarks various Llama model variants against the OWASP Top 10 for LLM Applications framework, evaluating threat detection accuracy, response safety, and computational overhead. Using the FABRIC testbed with NVIDIA A30 GPUs, we tested five standard Llama models and five Llama Guard variants on 100 adversarial prompts covering ten vulnerability categories. Our results reveal significant differences in security performance: the compact Llama-Guard-3-1B model achieved the highest detection rate of 76% with minimal latency (0.165s per test), whereas base models such as Llama-3.1-8B failed to detect threats (0% accuracy) despite longer inference times (0.754s). We observe an inverse relationship between model size and security effectiveness, suggesting that smaller, specialized models often outperform larger general-purpose ones in security tasks. Additionally, we provide an open-source benchmark dataset including adversarial prompts, threat labels, and attack metadata to support reproducible research in AI security, [1].
💡 Research Summary
This paper presents a systematic benchmark of ten Llama family models—five standard generative variants and five Llama‑Guard security‑focused variants—against the OWASP Top 10 for LLM Applications framework. The authors constructed a custom adversarial dataset of 100 prompts, evenly distributed across the ten OWASP categories (Prompt Injection, Sensitive Information Disclosure, Supply Chain, Data Poisoning, Improper Output Handling, Excessive Agency, System Prompt Leakage, Vector/Embedding Weaknesses, Misinformation, and Unbounded Consumption). Each prompt includes a trigger (e.g., Base64/Hex obfuscation, role‑playing cue), a malicious intent, and rich metadata (severity, sub‑category). The dataset implements 23 distinct injection techniques and is released in JSON format to enable reproducibility.
Experiments were conducted on the FABRIC testbed using NVIDIA A30 GPUs (24 GB VRAM), Ubuntu 22.04 LTS, PyTorch 2.1.0, CUDA 11.8, and HuggingFace Transformers v4.51.3. All models were loaded in FP16 precision (except the INT8 quantized variant) with a low temperature (0.1) and a maximum token limit of 10 to force binary “safe/unsafe” outputs. Guard models are fine‑tuned to emit explicit safety labels, while standard Llama models generate free‑form text; the latter’s outputs are parsed for the keyword “unsafe” to derive a binary decision.
The evaluation pipeline consists of four stages: (1) recording VRAM allocation and model loading time, (2) measuring end‑to‑end latency per prompt, (3) parsing responses for safety labels, and (4) aggregating overall and per‑category metrics. Results are summarized in Table 1. The Llama‑Guard‑3‑1B model achieved the highest detection rate of 76 % with an average latency of 0.165 seconds and a modest VRAM footprint of 0.94 GB. In stark contrast, the large base models Meta‑Llama‑3‑8B and Llama‑3.1‑8B recorded 0 % detection despite longer latencies (≈0.77 seconds) and higher memory usage (≈5.3 GB). This inverse relationship between model size and security effectiveness challenges the common assumption that larger models are inherently safer.
Instruction tuning proved critical: Llama‑3.1‑8B‑Instruct detected 54 % of threats, whereas its non‑tuned counterpart detected none. Compact models (1 B parameters) consistently outperformed larger variants, indicating that specialized safety training, rather than sheer parameter count, drives security performance. Quantization (INT8) degraded both speed (0.422 seconds per test) and accuracy (28 % detection), and the multimodal Vision variant (Llama‑Guard‑3‑11B‑Vision) performed poorly on pure‑text safety tasks (28 % detection), suggesting that modality‑specific optimizations can dilute text‑centric safety capabilities.
Category‑wise analysis (Table 2) reveals heterogeneous strengths. For example, Llama‑3.1‑8B‑Instruct excels at Prompt Injection (100 % detection) but fails completely on System Prompt Leakage (0 %). Llama‑3.2‑1B shows strong resilience to Information Disclosure (90 %) and Supply Chain attacks (100 %) but is only moderate against injection (50 %). No single model provides comprehensive protection across all ten OWASP categories, underscoring the need for multi‑model ensembles.
The discussion emphasizes practical deployment guidance: (i) prioritize lightweight Guard models for real‑time security monitoring, (ii) avoid using un‑aligned base models for safety‑critical roles, (iii) adopt layered defenses where a Guard model handles content filtering and a compact instruction‑tuned model addresses injection‑type threats, and (iv) consider output format—Guard models deliver structured binary labels, simplifying downstream automation. The authors also identify two persistent gaps: System Prompt Leakage and Supply Chain vulnerabilities remain largely undetected, indicating insufficient training data for these scenarios and highlighting avenues for future research.
In conclusion, the study establishes a reproducible benchmark linking Llama model variants to the OWASP Top 10 risk taxonomy. It demonstrates that (a) larger parameter counts do not guarantee better security, (b) specialized Guard models—especially the 1 B Llama‑Guard‑3‑1B—offer the best trade‑off between detection accuracy, latency, and memory consumption, and (c) instruction tuning and safety‑focused fine‑tuning are the primary drivers of security performance. The released dataset and evaluation scripts aim to foster standardized, high‑fidelity security testing across the AI community.
Comments & Academic Discussion
Loading comments...
Leave a Comment