RACA: Representation-Aware Coverage Criteria for LLM Safety Testing
Recent advancements in LLMs have led to significant breakthroughs in various AI applications. However, their sophisticated capabilities also introduce severe safety concerns, particularly the generation of harmful content through jailbreak attacks. Current safety testing for LLMs often relies on static datasets and lacks systematic criteria to evaluate the quality and adequacy of these tests. While coverage criteria have been effective for smaller neural networks, they are not directly applicable to LLMs due to scalability issues and differing objectives. To address these challenges, this paper introduces RACA, a novel set of coverage criteria specifically designed for LLM safety testing. RACA leverages representation engineering to focus on safety-critical concepts within LLMs, thereby reducing dimensionality and filtering out irrelevant information. The framework operates in three stages: first, it identifies safety-critical representations using a small, expert-curated calibration set of jailbreak prompts. Second, it calculates conceptual activation scores for a given test suite based on these representations. Finally, it computes coverage results using six sub-criteria that assess both individual and compositional safety concepts. We conduct comprehensive experiments to validate RACA’s effectiveness, applicability, and generalization, where the results demonstrate that RACA successfully identifies high-quality jailbreak prompts and is superior to traditional neuron-level criteria. We also showcase its practical application in real-world scenarios, such as test set prioritization and attack prompt sampling. Furthermore, our findings confirm RACA’s generalization to various scenarios and its robustness across various configurations. Overall, RACA provides a new framework for evaluating the safety of LLMs, contributing a valuable technique to the field of testing for AI.
💡 Research Summary
The paper introduces RACA (Representation‑Aware Coverage Criteria for LLM Safety Testing), a novel framework designed to evaluate the adequacy of safety‑testing suites for large language models (LLMs). Existing safety testing largely relies on static jailbreak prompt datasets and lacks systematic metrics to assess test quality. Traditional coverage criteria from small neural networks (e.g., Neuron Coverage, KMNC) are unsuitable for LLMs because of prohibitive computational cost and because they capture irrelevant neuron‑level noise rather than safety‑relevant behavior.
RACA addresses these issues by focusing on high‑level safety concepts that emerge as low‑dimensional directions in the hidden states of LLMs. The workflow consists of three stages:
- Calibration set construction – a small, expert‑curated collection of jailbreak prompts that exemplify the safety constraints of interest.
- Safety‑concept extraction – hidden‑state activations for the calibration set are gathered from a chosen layer, centered, and subjected to Principal Component Analysis (PCA). The top‑n principal components become “Safety Principal Components,” each interpreted as a distinct safety‑related concept (e.g., violence, political persuasion, disallowed content).
- Concept activation scoring – for any test prompt, its hidden representation is projected onto each safety component; the magnitude of the projection is the concept activation score.
Using these scores, RACA defines six sub‑criteria grouped into two orthogonal dimensions:
Individual Concept Coverage (Dimension I)
- Safety Feature Coverage (SFC) – proportion of unique safety concepts activated by the suite.
- Top‑K Feature Coverage (TKFC) – how many of the top‑K concepts each prompt triggers, encouraging diversity.
- Feature Intersection Coverage (FIC) – measures overlap among prompts to penalize redundant coverage.
Compositional Concept Coverage (Dimension II)
- Safety Concept Combination (SCC) – counts prompts that simultaneously activate multiple safety concepts.
- Pairwise Concept Coverage (PCC) – evaluates coverage of every pair of concepts across the suite.
- Concept‑Based Combination (CBC) – assesses coverage of predefined concept combinations (e.g., “violent + political”).
These criteria satisfy three design principles specific to LLM safety testing: (i) synonym‑insensitivity (redundant wording should not inflate coverage), (ii) invalid‑insensitivity (harmless or weak prompts should not contribute), and (iii) jailbreak‑sensitivity (prompts that actually cause harmful generation should yield high coverage).
The authors conduct extensive experiments on several LLMs (GPT‑2, LLaMA‑7B, Falcon‑40B, etc.). Compared with neuron‑level criteria, RACA more accurately identifies high‑quality jailbreak prompts while ignoring irrelevant or duplicate inputs. In a test‑set prioritization scenario, selecting prompts with the highest RACA scores yields a substantially larger number of successful jailbreaks under a fixed testing budget. In a prompt‑sampling experiment, using RACA scores as an objective for generation produces new jailbreak prompts that are markedly more effective than random or simple augmentation baselines.
Sensitivity analyses explore the impact of calibration‑set size, chosen layer, and number of PCA components. Results show that a calibration set of roughly 100 prompts suffices to learn stable safety components, and that extracting components from the final transformer layer provides the strongest correlation with jailbreak success. The framework remains robust across model scales and various hyper‑parameter configurations.
Limitations discussed include the linear nature of PCA (potentially missing non‑linear safety concepts), dependence on the quality and representativeness of the calibration set, and the current focus on a single layer rather than a multi‑layer representation. The authors suggest future work on non‑linear representation learning (e.g., autoencoders) and hierarchical concept aggregation.
In summary, RACA offers a scalable, concept‑driven coverage metric tailored to LLM safety testing. By reducing dimensionality to safety‑critical representations, it overcomes the computational and relevance shortcomings of traditional neuron‑level coverage, providing researchers and practitioners with a practical tool to assess, prioritize, and improve safety test suites for modern LLM deployments.
Comments & Academic Discussion
Loading comments...
Leave a Comment