Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems
This paper investigates defenses for LLM-based evaluation systems against prompt injection. We formalize a class of threats called blind attacks, where a candidate answer is crafted independently of the true answer to deceive the evaluator. To counter such attacks, we propose a framework that augments Standard Evaluation (SE) with Counterfactual Evaluation (CFE), which re-evaluates the submission against a deliberately false ground-truth answer. An attack is detected if the system validates an answer under both standard and counterfactual conditions. Experiments show that while standard evaluation is highly vulnerable, our SE+CFE framework significantly improves security by boosting attack detection with minimal performance trade-offs.
💡 Research Summary
As Large Language Models (LLMs) increasingly serve as automated judges for evaluating other models (the “LLM-as-a-judge” paradigm), the security of these evaluation pipelines has become a critical concern. This paper addresses a sophisticated and emerging threat known as “Blind Attacks,” a specific type of prompt injection where an adversary crafts a candidate response to deceive the evaluator without any prior knowledge of the actual ground-truth answer.
The core problem identified in the paper is that traditional prompt injection defenses often rely on the attacker’s interaction with known data. However, in a “Blind Attack,” the attacker does not need to know the true reference answer. Instead, they embed malicious instructions or linguistic patterns within the candidate response that are designed to manipulate the evaluator’s decision-making logic. This allows the attacker to force a “correct” verdict regardless of the actual content of the ground truth, effectively neutralizing the integrity of the evaluation process.
To combat this, the authors propose the “SE+CFE” framework, which integrates Standard Evaluation (SE) with a novel Counterfactual Evaluation (CFE) mechanism. While the SE component performs the conventional task of comparing the candidate response to the true ground truth, the CFE component introduces a “counterfactual” element: a deliberately falsified ground-truth answer.
The detection logic is based on a consistency check. The framework flags an attack if the candidate response receives a “pass” verdict under both the standard (true) ground truth and the counterfactual (false) ground truth. The underlying principle is that a legitimate, high-quality response should be contextually tied to the true ground truth and should naturally fail when compared against a false one. Only an adversarial response, which contains payloads designed to bypass the evaluator’s logic by ignoring the specific content of the reference, would consistently trigger a “pass” across both scenarios.
Experimental evaluations demonstrate that while standard evaluation systems are highly susceptible to being deceived by blind attacks, the SE+CFE framework significantly enhances detection capabilities. Crucially, the researchers show that this security enhancement is achieved with minimal trade-offs in evaluation performance, ensuring that the accuracy of the judging process remains intact. This research provides a robust and scalable defense mechanism, offering a vital foundation for building secure and trustworthy automated evaluation infrastructures for the next generation of LLMs.
Comments & Academic Discussion
Loading comments...
Leave a Comment