A Systematic Literature Review on LLM Defenses Against Prompt Injection and Jailbreaking: Expanding NIST Taxonomy

A Systematic Literature Review on LLM Defenses Against Prompt Injection and Jailbreaking: Expanding NIST Taxonomy
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The rapid advancement and widespread adoption of generative artificial intelligence (GenAI) and large language models (LLMs) has been accompanied by the emergence of new security vulnerabilities and challenges, such as jailbreaking and other prompt injection attacks. These maliciously crafted inputs can exploit LLMs, causing data leaks, unauthorized actions, or compromised outputs, for instance. As both offensive and defensive prompt injection techniques evolve quickly, a structured understanding of mitigation strategies becomes increasingly important. To address that, this work presents the first systematic literature review on prompt injection mitigation strategies, comprehending 88 studies. Building upon NIST’s report on adversarial machine learning, this work contributes to the field through several avenues. First, it identifies studies beyond those documented in NIST’s report and other academic reviews and surveys. Second, we propose an extension to NIST taxonomy by introducing additional categories of defenses. Third, by adopting NIST’s established terminology and taxonomy as a foundation, we promote consistency and enable future researchers to build upon the standardized taxonomy proposed in this work. Finally, we provide a comprehensive catalog of the reviewed prompt injection defenses, documenting their reported quantitative effectiveness across specific LLMs and attack datasets, while also indicating which solutions are open-source and model-agnostic. This catalog, together with the guidelines presented herein, aims to serve as a practical resource for researchers advancing the field of adversarial machine learning and for developers seeking to implement effective defenses in production systems.


💡 Research Summary

This paper presents the first systematic literature review (SLR) dedicated to defenses against prompt injection and jailbreak attacks on large language models (LLMs) and generative AI (GenAI) systems. Building on the taxonomy introduced by the U.S. National Institute of Standards and Technology (NIST) in its March 2025 “AI 100‑2 E2025” report on adversarial machine learning (AML), the authors collected, screened, and analyzed 88 peer‑reviewed studies that propose concrete mitigation techniques.

The authors first assess whether the existing NIST taxonomy fully captures the breadth of current defenses. They find that while NIST’s three high‑level categories—training‑time, evaluation‑time, and deployment‑time interventions—cover many approaches, they omit several widely used practical methods. To address this gap, the paper extends the taxonomy with new sub‑categories, illustrated in Figure 1, such as:

  • Input/Output filtering (keyword blacklists, regex, content sanitization)
  • Prompt instruction and formatting (fixed system prompts, role restrictions)
  • Prompt stealing detection/prevention (hash‑based similarity checks)
  • Output aggregation or ensemble (multiple prompts, voting mechanisms)
  • Monitoring and response (runtime anomaly detection, logging, automated throttling)
  • Usage restrictions (rate limiting, user authentication)

These additions are marked with a “+” in the figure and are termed “indirect mitigations” because they act without altering the model’s internal weights.

The methodology follows established SLR protocols: the authors searched major databases (IEEE Xplore, ACM DL, arXiv, Scopus) using keywords such as “prompt injection”, “jailbreak”, and “LLM defense”. After de‑duplication and abstract screening, 312 records were narrowed to 88 papers that met three inclusion criteria: (1) they evaluate a defense on an actual LLM, (2) they report quantitative effectiveness metrics, and (3) they disclose whether the implementation is open‑source.

For each defense, the review extracts: the target LLM(s) (e.g., GPT‑3.5, LLaMA‑2, Claude), the attack benchmark(s) used (e.g., AdvBench, GCG, ROT13, token smuggling), the reported success‑rate or reduction in attack efficacy, computational overhead, and whether the solution is model‑agnostic or tied to a specific architecture. The authors compile this information into a comprehensive catalog (see Appendix A).

Key quantitative findings include:

  • Pre‑training or fine‑tuning based safety alignment and data sanitization consistently achieve the highest defense success rates (often > 80 % across multiple attack families).
  • Simple keyword filtering is cheap but vulnerable to obfuscation techniques (e.g., leet‑speak, base64 encoding) and typically yields < 60 % success.
  • Runtime monitoring combined with output aggregation can reduce attack success to under 30 % while incurring modest latency (≈ 10 % increase).
  • Model‑agnostic methods such as “Self‑Reflection” or “Defensive Pruning” are openly available on GitHub and work across both decoder‑only and encoder‑decoder transformers.

The paper also highlights a reproducibility crisis: many studies use disparate datasets, evaluation pipelines, and success metrics, making direct comparison difficult. To mitigate this, the authors propose a standardized benchmarking suite (named PromptBench) that includes a shared set of prompts, attack generators, and evaluation scripts, encouraging future work to report results under a “fair comparison” protocol.

Answering the four research questions (RQs):

  • RQ1 – NIST’s taxonomy does not fully cover existing defenses; the extended taxonomy fills the identified gaps.
  • RQ2 – Emerging trends include hybrid defenses that combine safety‑aligned fine‑tuning with real‑time monitoring, and the use of ensemble prompting to dilute malicious influence.
  • RQ3 – Reported results are not directly comparable due to inconsistent benchmarks; the proposed PromptBench aims to standardize evaluation.
  • RQ4 – Practical guidelines are distilled: (1) adopt a defense‑in‑depth strategy spanning training, deployment, and runtime layers; (2) prioritize open‑source, model‑agnostic solutions when possible; (3) continuously evaluate defenses against up‑to‑date attack generators; (4) log and monitor for anomalous prompt patterns and enforce usage restrictions.

The authors acknowledge limitations: most defenses are evaluated on a narrow set of LLMs, and the trade‑off between security and generation quality is under‑explored. They call for future research on (a) unified benchmark platforms, (b) multi‑objective optimization that balances safety with utility, and (c) automated red‑team frameworks that can continuously generate novel prompt injection variants.

In conclusion, this work systematically maps the current landscape of prompt‑injection defenses, expands the NIST AML taxonomy with concrete, practice‑oriented categories, and provides a detailed, quantitative catalog that serves both academic researchers and industry practitioners seeking to harden LLM‑driven applications.


Comments & Academic Discussion

Loading comments...

Leave a Comment