The Vulnerability of LLM Rankers to Prompt Injection Attacks

The Vulnerability of LLM Rankers to Prompt Injection Attacks
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) have emerged as powerful re-rankers. Recent research has however showed that simple prompt injections embedded within a candidate document (i.e., jailbreak prompt attacks) can significantly alter an LLM’s ranking decisions. While this poses serious security risks to LLM-based ranking pipelines, the extent to which this vulnerability persists across diverse LLM families, architectures, and settings remains largely under-explored. In this paper, we present a comprehensive empirical study of jailbreak prompt attacks against LLM rankers. We focus our evaluation on two complementary tasks: (1) Preference Vulnerability Assessment, measuring intrinsic susceptibility via attack success rate (ASR); and (2) Ranking Vulnerability Assessment, quantifying the operational impact on the ranking’s quality (nDCG@10). We systematically examine three prevalent ranking paradigms (pairwise, listwise, setwise) under two injection variants: decision objective hijacking and decision criteria hijacking. Beyond reproducing prior findings, we expand the analysis to cover vulnerability scaling across model families, position sensitivity, backbone architectures, and cross-domain robustness. Our results characterize the boundary conditions of these vulnerabilities, revealing critical insights such as that encoder-decoder architectures exhibit strong inherent resilience to jailbreak attacks. We publicly release our code and additional experimental results at https://github.com/ielab/LLM-Ranker-Attack.


💡 Research Summary

This paper conducts a comprehensive empirical investigation into the susceptibility of large language model (LLM)‑based re‑rankers to simple prompt‑injection (jailbreak) attacks. The authors focus on two complementary evaluation tasks: (1) Preference Vulnerability Assessment, which measures intrinsic susceptibility via Attack Success Rate (ASR), and (2) Ranking Vulnerability Assessment, which quantifies the operational impact on ranking quality using nDCG@10. They examine three widely used ranking paradigms—pairwise, listwise, and setwise—and two injection strategies: Decision Objective Hijacking (DOH), which explicitly tells the model to output the injected document as most relevant, and Decision Criteria Hijacking (DCH), which subtly reshapes the relevance criteria so that the injected passage is favored.

Methodology
The experimental pipeline mirrors the original study (Qian et al., 2025) but extends it substantially. For each query, candidate documents are constructed from standard IR benchmarks (TREC‑DL, BEIR). A clean “projection” stage records the model’s baseline choice or ranking without any injection. Then a target document that was not selected in the clean stage is chosen, a jailbreak prompt is inserted (either at the front or the back of the text), and the model is re‑evaluated on the same candidate set. An attack is deemed successful if the injected document becomes the top‑ranked choice (pairwise/setwise) or moves to the first position (listwise). ASR is the proportion of successful attacks among valid outputs. For the ranking impact, the authors run a full re‑ranking pipeline and compute the drop in nDCG@10 relative to the clean baseline.

Expanded Research Directions
Beyond reproducing prior findings, the authors systematically explore:

  • Scaling – testing multiple model families (Qwen‑3, Gemma‑3, LLaMA‑3, GPT‑4.1‑mini) across a wide parameter range to see whether larger models are consistently more vulnerable.
  • Position Sensitivity – comparing front‑ versus back‑injection to assess whether vulnerability is truly position‑agnostic.
  • Architectural Divergence – contrasting decoder‑only models with encoder‑decoder architectures (e.g., T5‑XXL, Flan‑UL2) and with mixture‑of‑experts (MoE) variants.
  • Cross‑Domain Robustness – evaluating ASR on the BEIR benchmark’s 14 diverse domains (news, legal, biomedical, etc.) to test transferability of the attack.

Key Findings

  1. Model Size Effect – Across all families, larger models (≥30 B parameters) exhibit higher ASR (up to ~0.84 for LLaMA‑3‑70B) than their smaller counterparts, confirming the earlier claim that “bigger is more vulnerable.”
  2. Paradigm‑Specific Strength – DOH is most effective in pairwise settings, while DCH dominates in listwise scenarios; setwise results are similar for both attacks.
  3. Injection Position – For decoder‑only models, front‑injection yields a modest but statistically significant ASR boost (≈ 0.09) over back‑injection. Encoder‑decoder models show negligible position effects, suggesting better contextual discrimination.
  4. Architectural Resilience – Encoder‑decoder models dramatically reduce ASR (≈ 0.30 for DOH, 0.28 for DCH) and cause almost no nDCG@10 degradation (< 0.02), indicating intrinsic robustness likely due to bidirectional context encoding.
  5. Domain Transferability – High ASR persists across most BEIR domains (0.65–0.82), though biomedical and legal domains show slightly lower rates, possibly because specialized terminology interferes with the simple “

Comments & Academic Discussion

Loading comments...

Leave a Comment