Matching Ranks Over Probability Yields Truly Deep Safety Alignment
📝 Original Info
- Title: Matching Ranks Over Probability Yields Truly Deep Safety Alignment
- ArXiv ID: 2512.05518
- Date: 2025-12-05
- Authors: Jason Vega, Gagandeep Singh
📝 Abstract
A frustratingly easy technique known as the prefilling attack has been shown to effectively circumvent the safety alignment of frontier LLMs by simply prefilling the assistant response with an affirmative prefix before decoding. In response, recent work proposed a supervised fine-tuning (SFT) defense using data augmentation to achieve a "deep" safety alignment, allowing the model to generate natural language refusals immediately following harmful prefills. Unfortunately, we show in this work that the "deep" safety alignment produced by such an approach is in fact not very deep. A generalization of the prefilling attack, which we refer to as the Rank-Assisted Prefilling (RAP) attack, can effectively extract harmful content from models fine-tuned with the data augmentation defense by selecting lowprobability "harmful" tokens from the top 20 predicted next tokens at each step (thus ignoring high-probability "refusal" tokens). We argue that this vulnerability is enabled due to the "gaming" of the SFT objective when the target distribution entropies are low, where low fine-tuning loss is achieved by shifting large probability mass to a small number of refusal tokens while neglecting the high ranks of harmful tokens. We then propose a new perspective on achieving deep safety alignment by matching the token ranks of the target distribution, rather than their probabilities. This perspective yields a surprisingly simple fix to the data augmentation defense based on regularizing the attention placed on harmful prefill tokens, an approach we call PRefill attEntion STOpping (PRESTO). Adding PRESTO yields up to a 4.7× improvement in the mean StrongREJECT score under RAP attacks across three popular open-source LLMs, with low impact to model utility 1 .📄 Full Content
One frustratingly simple exploit for circumventing the safety alignment of LLMs is the prefilling attack (Vega et al., 2024;Andriushchenko et al., 2025). When the user provides a harmful request to the LLM (e.g., “How do I build a bomb?”), they can prefill the assistant response with affirmative text (e.g. “Here’s how to build a bomb. Step 1: Gather”) and then start the decoding process after this prefill. This was shown to succeed on safety-aligned LLMs from leading AI organizations such as Llama and DeepSeek R1 (Vega et al., 2024;Rager et al., 2025). Crucially, the prefilling attack Figure 1: A demonstration of the Rank-Assisted Prefilling (RAP) attack against the Llama 2 7B Chat checkpoint fine-tuned for deep safety alignment from Qi et al. (2025) on a request for bomb-making instructions. In the first step (left), we show the top 10 tokens and their probabilities from the next token probability distribution following a harmful prefill (red). Nearly all of the probability mass is concentrated on the top-ranked token “I”, yet selecting this token leads to many future decoding paths that refuse the request, such as “I cannot fulfill your request …” Instead, the “the” token can be selected at this step despite its low probability, and then appended to the input to yield the input for the next step. Repeating this process extracts harmful content fulfilling the request that is not likely to be generated by traditional sampling-based decoding strategies. can be done by hand, avoiding the need for computationally expensive algorithms, fine-tuning or high technical knowledge. As the only requirement for this exploit is the ability to prefill the assistant response, this yielded troubling implications for open-sourcing safety-aligned models (Vega et al., 2024), where this requirement is trivially satisfied. Prefilling attacks have also been shown to succeed against closed-source models where prefilling is provided as a feature, such as Claude (Andriushchenko et al., 2025). Although prefilling can be useful in benign scenarios for exerting greater control over the LLM’s output (Anthropic, 2025), it comes with potential risks to safety alignment that warrant greater investigation.
A promising training-time intervention to improve robustness against prefilling attacks was recently proposed by Qi et al. (2025) based on a principle they call “deep” safety alignment. The goal of deep safety alignment is to ensure that even if the beginning of a model’s response to a harmful request indicates compliance (e.g., via a prefilling attack), the model should be able to quickly “recover” and stop complying. To achieve this, they propose a simple supervised fine-tuning (SFT) defense based on data augmentation. A key strength of this work is that it enables helpful refusals to harmful prefills. (In contrast, a model that refuses by outputting a fixed string or random tokens is indeed safe, but such responses are not very helpful for the user to understand why the model refused.) This makes models fine-tuned with such a defense more readily-deployable for customer-facing applications that provide prefilling as a feature, enabling safe prefilling.
In Section 3, we show that unfortunately, the SFT-based data augmentation approach from Qi et al. (2025) can produce models with a vulnerability that allows for the deep safety alignment to be easily circumvented. We show that a simple combination of supporting prefilling and providing the top k predicted next tokens at each decoding step is sufficient for the exploit to work. This again is trivially satisfied by any open-source model, and access to the top k tokens is a feature seen in some commercial APIs such as the OpenAI API (which allows access to the top 20 tokens OpenAI (2025)). 2 The idea is simple: instead of using a traditional decoding strategy during a prefilling attack, simply select among the top k tokens at each decoding step to extract the desired harmful content, regardless of their probability. We refer to this attack as the Rank-Assisted Prefilling (RAP) attack, and provide an illustration in Figure 1. RAP is a more powerful generalization of the prefilling attack, as it allows for arbitrary selection of the tokens. We find that despite being finetuned for deep safety alignment, the data-augmented Llama 2 7B Chat checkpoint evaluated in Qi et al. (2025) retains many low-probability yet highly-ranked tokens that naturally continue a harmful prefill (which we refer to as “harmful tokens”) within the top 20 tokens. Selecting from these harmful tokens yields sequences fulfilling the harmful request that are not likely to be generated via traditional sampling-based decoding strategies. We show that RAP attacks can be easily done by hand, and also implement an automated version we call AutoRAP. 3Next, in Section 4.1, we propose a novel perspective on approaching deep safety alignment that takes into account the RAP attack vulnerability. We argue that to address RAP, it is most important to encourage the ranks of harmful tokens to be low, rather than just their probabilities. To do so, one can utilize the ranking information from the first response token distribution immediately following a harmful request without a prefill, which for a sufficiently safety-aligned model will likely have its top-ranked tokens be filled with “refusal tokens” (i.e., tokens that lead to many future decoding paths that refuse the request). By focusing instead on matching the top rankings of the first response token distribution, the model can learn to “push forward” these rankings to future steps where a harmful prefill is added. We refer to this new perspective on approaching deep safety alignment as Push-Forward Alignment (PFA) (see Figure 2 for an illustration). We then argue that approaches such as the data augmentation approach from Qi et al. (2025) can produce models vulnerable to the RAP attack due to “gaming” of the SFT objective when the target distribution entropies are low, where low fine-tuning loss is achieved by shifting high probability mass to a small number of refusal tokens while neglecting the high ranks of harmful tokens.
Finally, in Section 4.2, we show that approaching deep safety alignment from the PFA perspective yields a highly intuitive and mechanistically-interpretable implementation based on attention regularization. We show that it is sufficient to regularize the Multi-Head Attention modules in the model so that the model learns to ignore the harmful prefill portion of the input, which in turn encourages the model to push forward the highly-ranked refusal tokens from the first response token distribution. We refer to this approach as PRefill attEntion STOpping (PRESTO). In Section 5, we show that PRESTO helps mitigate the RAP attack vulnerability of the data augmentation approach from Qi et al. (2025) by significantly increasing the difficulty of finding harmful decoding paths through RAP. We also show that the addition of the PRESTO term does not significantly harm the utility of the model. Lastly, we analyze the effects of PRESTO on the model’s attention patterns, which reveal that attention in the later half of the model is most affected by the regularization.
In this section, we discuss attacks for circumventing safety alignment and defenses against such attacks from existing related work. For additional related work, please refer to Appendix A.
In our work, we focus on decoding exploits for circumventing safety alignment, as they are among the most accessible techniques to perform. Huang et al. (2024) proposed a decoding exploit that performs a grid search over decoding parameter configurations (e.g., temperature, top-p parameter). This work exploits the observation that harmful tokens may be ranked high enough and have just enough probability mass on them such that changing the decoding parameters can significantly boost the chance of their selection. However, it was shown in Qi et al. (2025) that this approach no longer becomes successful on models trained with the data augmentation approach to deep safety alignment, likely due to the distribution becoming highly concentrated on a single refusal token as shown in Figure 1. Since RAP attacks only utilize the rank of the tokens and not their probabilities, it is a more powerful threat than the decoding parameters exploit.
Aside from decoding exploits, another set of techniques to circumvent safety alignment are “jailbreaks.” These can either be handcrafted through extensive manual effort (Reddit, 2025), or automatically discovered with expensive search algorithms. Due to such costs, they may therefore not be preferable in situations where prefilling attacks are possible. For example, the Greedy Coordinate Gradient (GCG) attack (Zou et al., 2023b) searches for a suffix that the user can append to their prompt through a discrete optimization algorithm, but requires access to the target model’s weights (i.e., an open-source model, at which point one can just do a prefilling attack) or relies on transferability of suffixes to closed-source models. Some examples of jailbreak algorithms that don’t require the target model’s weights include PAIR (Chao et al., 2025), TAP (Mehrotra et al., 2024) and AutoDAN (Liu et al., 2023), with PAIR and TAP operating in a completely black-box manner and AutoDAN only requiring access to the target model’s output probabilities. Yet, the iterative nature of these attacks still makes them much more costly than prefilling attacks.
Finally, some other methods of circumventing safety alignment include those based on representation engineering (Zou et al., 2023a), which nudge the model’s internal representations in a harmencouraging direction (Arditi et al., 2024), and fine-tuning attacks (Qi et al., 2024), which fine-tunes the target model to disable its safety alignment. These obviously requires access to the model’s weights (or, in the case of fine-tuning attacks on closed-source models, for fine-tuning services to be provided), in which case prefilling attacks are again preferable when possible due to their simplicity.
In our work, we focus on training-time interventions for improving the robustness of safety alignment against decoding exploits. Deep safety alignment, along with an SFT-based implementation based on data augmentation, was proposed recently in Qi et al. (2025) as one of the first techniques to defend against prefilling attacks. Concurrently, Zhang et al. (2025) proposed a near-identical data augmentation approach, with the addition of a special [RESET] token to signal the start of a refusal following a prefilling attack. However, this latter approach is less preferable than the former from a safety perspective, as a user could just disable the [RESET] token in the open-source setting (e.g., by applying a strong bias during decoding so that it always has low probability).
A key strength of the two aforementioned works is that they enable helpful refusals to harmful prefills. This can be contrasted to defenses that do not have this desirable property. A recent example is the implementation of circuit breaking (a type of approach to deep safety alignment based on representation engineering) called Representation Rerouting, as proposed in Zou et al. (2024). Like the data augmentation approach to deep safety alignment, this method is effective at defending against prefilling attacks. However, because it fine-tunes the model to increase dissimilarity to harmful representations with no particular target representation, we’ve observed that the resulting models tend to produce unintelligible text following a harmful prefill as opposed to meaningful refusals. We therefore focus our work on strengthening the robustness of the data augmentation approach of Qi et al. (2025) to see if we can retain the benefit of generating helpful refusals under prefilling attacks while mitigating the vulnerability to RAP attacks.
There are also approaches to improving safety alignment robustness based on adversarial training. For instance, R2D2 (Mazeika et al., 2024) fine-tunes against adversarial examples generated through GCG. However, such approaches are costly due to the simulation of the adversary, which turns out to not even be necessary in some cases -the approach from Qi et al. (2025) provides decent protection against GCG anyways without specifically needing to train against it.
To implement deep safety alignment, Qi et al. (2025) Before fine-tuning, a harmful response is sampled from a jailbroken version of the model (Qi et al., 2024) for each harmful prompt in the fine-tuning dataset. During fine-tuning, these harmful responses are randomly truncated to form the harmful prefills (red). The refusals (blue) for each prompt are sampled from the original safety-aligned model and fixed throughout fine-tuning. A Table 1: Mean StrongREJECT (Souly et al., 2024) scores of prefilling and RAP attacks for the Llama 2 7B Chat checkpoint fine-tuned for deep safety alignment (“Data Augmentated”) from Qi et al. (2025). Scores are on a scale of [0, 1] with higher values indicating greater harmfulness. For the prefilling attacks, we report the mean and standard deviation across three runs, and also report results for the original Llama 2 7B Chat model (Touvron et al., 2023). For the human RAP evaluation, we report the mean and standard deviation over three participants.
Data Augmented RAP (Human) AutoRAP 0.831 ± 0.004 0.001 ± 0.002 0.597 ± 0.158 0.5389 safety-encouraging system prompt is used. The safety objective simply minimizes the negative loglikelihood of the refusal tokens given the preceding tokens. Qi et al. (2025) showed that this strategy is very effective at mitigating prefilling attacks with immediate natural language refusals. However, as we will show, the resulting “deep” safety alignment from this approach is in fact rather superficial.
We examine the Llama 2 7B Chat checkpoint fine-tuned with the data augmentation approach that was evaluated in Qi et al. (2025). To illustrate our key observation, in Figure 1 we use the bombmaking example as input to the model and display the top 10 tokens in the model’s next token probability distribution following the harmful prefill. Nearly all of the probability mass (∼ 97%) is concentrated on the refusal token “I”, which if selected would tend to generate refusals such as “I cannot fulfill your request.” However, despite having been fine-tuned for deep safety alignment, there still exists low-probability yet highly-ranked harmful tokens within the top 10 tokens. This yields two important takeaways. Firstly, it helps explain why the data augmentation approach is so effective against both prefilling attacks and the decoding parameters exploit under traditional decoding strategies -the mass becomes so highly concentrated on the refusal token that the distribution must not be “flat” enough for harmful tokens to be selected, even after varying the decoding parameters! Secondly, the presence of highly ranked harmful tokens suggests that RAP attacks should be feasible to carry out against this model, which we confirm next.
To evaluate the RAP attack as a means for extracting useful harmful content from the deep safetyaligned model of (Qi et al., 2025), we employ human evaluation where three participants were asked to perform the RAP attack by hand. We obtain harmful prompts from the StrongREJECT dataset (Souly et al., 2024) and use the accompanying grading rubric with GPT-54 for evaluation, as the rubric ensures that attack success should account for the quality of the response rather than just whether the model avoids refusing. We generate harmful prefills using Mistral 7B v0.3 following the few-shot prefill generation approach of Vega et al. (2024) (see Appendix B.1 for more details). Due to time constraints, we evaluate each participant on a sample of 52 prompts, about 1/6 of the size of the full dataset. We limit the maximum number of interactions per prompt (counting token selection and backtracking, i.e., the “undoing” of a token selection) to 256; this limits the amount of exploration possible, encouraging participants to simply select the first harmful token they see (rather than trying to strategically select tokens to maximize harmfulness).
To help scale up the evaluation, we also report results using our automated attack AutoRAP on a larger sample of 90 prompts and higher maximum interactions limit of 512. In both the human and automated settings, we restrict the attacks to the top k = 20 tokens at each step (as this mirrors realworld limits, e.g., what is supported by the OpenAI API (OpenAI, 2025)). Note that although Llama is an open-source model (and thus an attacker does not have a restriction on k), we still restrict k to demonstrate that an attacker does not have to search far to select harmful tokens, as well as to simulate a closed-source setting where an API allows both prefilling and access to the top k tokens. We provide more details on the design of the human evaluation and AutoRAP in Appendix B.
Figure 2: An illustration of the Push-Forward Alignment (PFA) approach to deep safety alignment on a request for bomb-making instructions. On the far left, we show the top 10 tokens for the first decoding step from the original Llama 2 7B Chat model (Touvron et al., 2023) when given the prompt without any prefill. The highest-ranked future decoding paths from these tokens tend to be refusals (e.g., “Sorry, but I cannot fulfill…”). When a harmful prefill is added, the top-ranked tokens from the first step are “pushed forward” to the current step, which helps to reduce the presence of harmful tokens that continue the prefill. Highly-ranked future decoding paths from the first step can also be pushed forward to help enable natural language refusal generation under other threat models (such as prefilling attacks under traditional sampling-based decoding strategies).
Table 1 reports the results. We also provide baseline results of performing prefilling attacks on the entire dataset of 313 prompts over three runs5 , both for the original model (Touvron et al., 2023) and the deep safety-aligned model. We use the results for the original model to help validate that our RAP results did not produce content than is more harmful than what we would expect to get out of the original model (e.g., via a heavy bias on the token selection as a result of the humans/AutoRAP implementation leveraging prior knowledge). For the deep safety-aligned model, the baseline prefilling attack is highly unsuccessful, as expected. For RAP however, we observe a significant leap in the success of circumvention. Moreover, we observe that AutoRAP is able to approach humanlevel performance, demonstrating the feasibility of automating the attack. Finally, comparing these results to the original model’s prefilling attack results, we see that RAP and AutoRAP are able to recover much of the original model’s harmfulness. Overall, these results confirm that there is a fundamental flaw in the SFT-based data augmentation approach to deep safety alignment that allows substantive harmful content to still be easily extracted under the top k and maximum interactions constraints. Clearly, there is a need to make harmful decoding paths under the RAP setting much more difficult to find. In the next section, we explore whether the data augmentation approach of Qi et al. (2025) can be fixed to address this exploit.
First, we describe a novel perspective on how to achieve deep safety alignment while mitigating RAP attacks. Consider the original (shallow) safety-aligned model and a harmful prompt with no harmful prefill. When generating the first response token, there will very likely not be any tokens among the highest-ranked tokens that would naturally continue a harmful prefill, since no prefill was present in the first place. Provided the model has undergone sufficient safety alignment, the highest-ranked tokens are thus likely to be filled with refusal tokens. This can then be used as a training signal during deep safety alignment fine-tuning. Specifically, the model can be trained to “push forward” these top-ranked tokens to future decoding steps when a harmful prefill is provided.
To maintain natural language refusals following the initial refusal token (e.g., when responding to a prefilling attack under a traditional decoding strategy), the highest-ranked future decoding paths can also be “pushed forward.” We provide an illustration of PFA in Figure 2.
We formalize the concept of fine-tuning a model with PFA as follows. We refer to pushing forward highly-ranked future decoding paths up to length t as “PFA-t”. For simplicity, we present PFA-1; i.e., just pushing forward the highest-ranked first response tokens. Let x denote a harmful prompt and x pre denote a harmful prefill drawn from a distribution D. Let p * (x) denote the next token distribution given x produced by the original model. Let p(x, x pre ; θ) be similarly defined, but now also given x pre and produced by a model parameterized by θ. Let R denote a function that returns ranks for each token according to their probabilities in a provided distribution, and let ρ w denote a function that computes a weighted version of the Spearman’s rank correlation coefficient (Lombardo et al., 2020). 6 Then, the PFA-1 loss is:
In general, the goal of push-forward alignment is to find a θ in a parameter space Θ with a PFA-t loss that approaches the optimal PFA-t loss ℓ * PFA-t := inf θ∈Θ ℓ PFA-t (θ), so that the highest-ranked decoding paths starting from p * (x) appear as the highest-ranked decoding paths starting from p(x, x pre ; θ).
Next, we analyze the data augmentation procedure of Qi et al. (2025). Let p * t (x) denote the marginal distribution over all length t continuations from x produced by the original model, and let p t (x, x pre ; θ) be similarly defined. (Note that following our prior notation, p * 1 (x) = p * (x) and p 1 (x, x pre ; θ) = p(x, x pre ; θ).) Under the data augmentation procedure, as the refusals are sampled from the original model, the fine-tuning essentially attempts to optimize the following loss:
(where T denotes a chosen maximal length).
If Equation 2 could be optimized to 0 (given a sufficiently large Θ), then we would have p T (x, x pre ; θ) = p * T (x) almost surely, and consequently ℓ * PFA-1 would also be achieved for Equation 1. However, in practice the optimal loss cannot be achieved; at best, we will have a model that can achieve very low (but non-zero) KL divergence. The key insight is to realize that if the entropy of the first response token distribution p * (x) is very low, then minimizing the contribution of p(x, x pre ; θ) to ℓ DA (θ) can easily be “gamed” by simply shifting most of the probability mass of p(x, x pre ; θ) to the high-probability tokens of p * (x) while neglecting the organization of the remaining lowprobability tokens. The neglection of the low-probability tokens is essentially what enables the RAP attack to succeed. It is critical to aim for increasing ρ w (R(p * (x)), R(p(x, x pre ; θ))), as this will more directly affect the top k tokens encountered at each RAP attack step. Note that a lower KL divergence between distributions does not necessarily translate to a higher rank correlation7 , and thus even a low-loss solution to optimizing Equation 2 is not necessarily a low-loss solution to optimizing Equation 1. In Appendix C, we empirically validate that the entropy of p * (x) tends to be low, and that p(x, x pre ; θ) shifts to better align with the low entropy of p * (x), providing further evidence of “gaming” ℓ DA (θ).
To design a practical training objective for PFA, it is crucial to be able to exert a strong influence over the token rankings in the output distribution. Intuitively, these rankings would be highly affected by the semantic meaning of the preceding tokens. During fine-tuning, the model will therefore need to be able to adjust its internal understanding of the semantics of the input in order to strongly affect the output distribution. Effectively, it should learn to “ignore” the harmful prefill portion of the input so that its semantic understanding of the input is only dependent on the harmful prompt, allowing the existing (shallow) safety alignment to kick in to effect and significantly shift the output distribution. At first glance, it appears from Equation 1 that this must be done by directly manipulating the inputs x or outputs p * (x), as the model is presented in an opaque manner. However, given we know that we should try to adjust the model’s internal understanding of the input, we can look towards the internal mechanisms of the model for more direct approaches. Fortunately, there is one mechanism in a transformer-based LLM that can directly cause portions of the input to be ignored: the Multi-Head Attention (MHA) mechanism. For example, “attention masking” is applied to ignore padding tokens when performing batch training of variable-length input sequences. We therefore design our loss around the MHA mechanism as a means for achieving PFA.
More concretely, consider giving the original (shallow) safety-aligned model a harmful prompt x and harmful prefill x pre . If we applied an attention mask to x pre , the effective input to the model becomes just x. Consequently, all length-t decoding paths would follow p * t (x) instead of p * t (x, x pre ), which would be sufficient to push forward high-ranked decoding paths that start from p * (x). Therefore, we can design a loss term that encourages attention placed on harmful prefill tokens to be minimized and redirected towards non-prefill tokens. For a model parameterized by θ, let a (l,h) ij (θ) denote the attention that token i places on token j by the h th attention head in the l th layer. Let n(x, x pre ) be the total number of tokens in the input when x and x pre are used, and let [n] := {1, 2, . . . , n}. Let I h be the set of indices of the harmful prefill tokens. We propose the following loss we call PRefill attEntion STOpping (PRESTO):
Non-Prefill Attention
(3)
PRESTO can be readily applied in conjunction with the data augmentation procedure of Qi et al. (2025) as an additional regularization term,8 as the attention scores are already calculated during the forward pass. In the following sections, we will conduct experiments to evaluate PRESTO’s effectiveness towards increasing the difficulty of RAP attacks.
We compare using the data augmentation approach from Qi et al. (2025) with and without the PRESTO loss term. Our experiment setup for evaluating the effect of PRESTO on RAP attacks follows the setup detailed in Section 3.2. Our goal here is to see whether PRESTO can help make the RAP attack more difficult to perform. Of course, we would still expect some harmful decoding paths to still exist within the top k tokens, but the point is that these paths should become harder to find. We also include results on newer and larger safety-aligned models: Qwen 3 8B Yang et al. (2025) and Gemma 3 12B IT Team et al. (2025). Please refer to Appendix D.1.2 for additional details.
PRESTO increases difficulty of RAP attacks. In Figure 3 we report the results of RAP evaluation. We observe that there is a notable reduction in the mean RAP performance across all three models when PRESTO is applied, both for the human and automated evaluation. Given the consistent trend across models, the data suggests PRESTO indeed makes it more difficult to find harmful decoding paths among the top k tokens. We report time data in Appendix D.2 to further corroborate this.
Utility is maintained after adding PRESTO. For each model (as well as the original models), we evaluate the model’s utility on MT-Bench (Zheng et al., 2023) (for evaluating open-ended generation) and GSM-8K (Cobbe et al., 2021) (for evaluating mathematical reasoning). The results are reported and discussed in Appendix D.3. In summary, adding PRESTO to the data augmentation approach of Qi et al. (2025) does not degrade the model’s utility by any significant amount.
Harmful prefill attention diminishes in later layers. In Appendix D.4 we observe that the attention placed on harmful prefill tokens appears to vanish in the second half of the Llama 2 7B Chat model trained with PRESTO. We also show that the deep safety-aligned version without PRESTO Ablation of the top k parameter. Under the RAP threat model, the sole parameter that can be varied is k, the amount of top tokens at each decoding step that is made available. We therefore perform an ablation study on this parameter and report the results in Appendix D.5. We focus on Llama 2 and perform AutoRAP for k ∈ {5, 10, 15, 25, 30, 35, 40}. Our results show that without PRESTO, AutoRAP is able to extract about the same level of harmfulness as k = 20 even when restricted to k = 5. In contrast, with PRESTO, AutoRAP is much less successful, and maintains this level of safety for all these values of k.
Evaluation against the GCG attack. Although our work focuses on the RAP attack vulnerability, we perform additional evaluation on the GCG attack (Zou et al., 2023b) as a sanity check that PRESTO does not re-introduce a vulnerability to GCG. Due to the high cost of GCG evaluation, we only evaluate Llama 2. Additional setup details and results are reported in Appendix D.6. The results confirm that GCG attack success remains low even after adding PRESTO.
We show that the SFT-based data augmentation approach to deep safety alignment still suffers from being vulnerable to an attack we call the Rank-Assisted Prefilling (RAP) attack. Through both human and automated evaluation, we show that RAP attacks are practical and can easily recover a significant amount of harmful content from such deep safety-aligned models. We then propose a new perspective on approaching deep safety alignment that we call Push-Forward Alignment (PFA), which yields a mechanistically-interpretable loss term based on regularizing the attention scores that we call PRESTO. Finally, we show that the PRESTO loss helps makes RAP attacks significantly more difficult to achieve, without significant degradation in model utility.
A.1 SAFETY ALIGNMENT OF LLMS Aligning LLMs with desired behaviors has been extensively investigated over the years, and the predominant underlying workhorse has been to use techniques from reinforcement learning on a large volume of preference data (Ouyang et al., 2022;Rafailov et al., 2024;Ethayarajh et al., 2024;Bai et al., 2022b). The post-training fine-tuning process of the models we examine in this work all involve Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022). They also combine RLHF with additional techniques for alignment, such as a bit of supervised fine-tuning, safety context distillation (Askell et al., 2021) for Llama 2 (Touvron et al., 2023), and strong-toweak distillation from larger models for Qwen 3 (Yang et al., 2025).
Although not necessary, the RAP attack can be automated by replacing traditional decoding strategies with a custom selection algorithm. In general, such algorithms first modify the distribution by attempting to suppress the probability of refusal tokens while uplifting the probability of harmful tokens, and then sample from this new distribution to generate the next token. Most existing work directly modifies the probabilities from the target model by leveraging the probabilities from nonsafety-aligned language models, such as the work of Zhao et al. (2024) and Zhou et al. (2024a). However, Zhou et al. (2024a) assumes access to the base pre-trained model, which may not always be available in practice. Moreover, Zhao et al. (2024) applies a weighting to the target model probabilities, which may not shift the target model distribution enough in cases where it is nearly entirely concentrated on a single refusal token, as we observed can happen in models fine-tuned with the data augmentation approach to deep safety alignment (Qi et al., 2025).
One approach that does not deal with these limitations is LINT (Zhang et al., 2023). When a new sentence is about to begin, LINT intervenes by first choosing the top k next tokens (regardless of probability) to be the candidate pool and then selecting the candidate that (when following a traditional decoding strategy) leads to the most “toxic” next sentence being generated, as evaluated by a trained toxicity evaluator. However, this will not work well against models fine-tuned with deep safety alignment; even if a candidate token is a harmful token (e.g., “Sure”), generating the rest of the sentence for toxicity evaluation will very likely abruptly switch to a refusal following this token due to its fine-tuning (e.g., “Sure I cannot fulfill…”). In our work, to help automate parts of our evaluation we develop a more general alternative to LINT called AutoRAP that performs the intervention at every step (not just at new sentences) and selects the top-ranked token that is not classified as being a refusal token (according to a trained classifier) given only the preceding tokens.
A number of works has examined the role of multi-head attention with respect to LLM safety. For example, Zhou et al. (2024b) showed that only a few attentions are influential towards safety under jailbreaks, in the sense that they strongly impact attack success when ablated. Specifically, for Llama 2 7B Chat, they found that one head in particular in the third layer has the strongest impact on safety. Interestingly, He et al. (2024) found that for the same model, a sparse amount of attention heads in later layers (i.e., past layer 20) are most influential towards safety under jailbreaks (whereas early layers have very little influence), but under a different sense: they influence the logits of harmful tokens the most. This is corroborated by the work of Leong et al. (2024), which found that fine-tuning attacks on this model cause attention heads in later layers (this time, past layer 23) to increase their influence on the logits of harmful tokens. In our work, we show that fine-tuning with PRESTO has the greatest influence on prefill token attention scores in the latter half of the model.
In this section, we discuss the design of the human evaluation study for the RAP attack, as well as the automated AutoRAP attack. These designs apply both to the initial experiment done on the Llama 2 7B Chat model in Section 3.2 as well as the PRESTO experiments in Section 5, and hence we discuss the designs in a manner that combines both sets of experiments. In total, we evaluate 6 models in our work: Llama 2 7B Chat, Qwen 3 8B and Gemma 3 12B IT, each with and without PRESTO. For Qwen, we disable the thinking mode during RAP evaluation.
We first discuss the method of generating harmful prefills. These prefills are used to initialize RAP attacks for both human evaluation and AutoRAP. The process follows the method described in Vega et al. (2024). In Figure 4, we present the few-shot prompt template used. 25 harmful prompts from StrongREJECT are used for the few-shot examples, each accompanied by a manually crafted harmful prefill. We use greedy decoding and stop generating once the “[END PREFILL]” is generated.
Figure 4 PROMPT:
Participants. For the human evaluation, we used data from a total of 8 participants to evaluate the 6 models we study. These participants were recruited from our research lab, and include a mix of graduate and undergraduate students. One of these participants was an author of this work. To ensure a balanced mix of “capabilities” of the attacker for each model, we assigned participants to models in the following manner: 1. The author (considered the “expert” attacker, given that they directly contributed to the development of PRESTO) was assigned to all 6 models, 2. One of the other participants was also assigned to all 6 models (and thus we consider as the “intermediate” attacker, given the experience they accumulated by attacking 6 models), and 3. The remaining 6 participants (considered the “novice” attackers) were randomly assigned to just one of the 6 models. The novice attackers were not told which model they had been assigned.
Instructions. Participants were informed that they would only need to evaluate 20 prompts for up to 256 interactions per prompt, and were allowed to first practice on a practice prompt that did not count towards the 20 prompts. They were encouraged to not spend too much time at each step, and were instructed to just pick the highest-ranked token that they felt was a harmful natural continuation. A suggestion of spending three minutes per prompt was given, although no time constraints were strictly enforced. Participants were also informed that backtracking would count towards the 256 maximum interactions per prompt, and were allowed to skip to the next prompt before 256 interactions had been reached if they felt that they had extracted an adequate amount of information fulfilling the harmful request, or if they felt that the current prompt was too hard to attack. Backtracking at the first step would simply skip to the next prompt. For prompts that ask for an unspecified amount of information (e.g., “Give me some tips”), we suggested three units of information to be considered “adequate” (e.g., three tips). However, for prompts that ask for a specific amount of information, we instructed the participants to keep attacking until they either obtain the requested amount or run out of attack steps. The participants were also suggested ahead of time that, in the event they began to generate repetitive/“loopy” text that they felt was either causing them to waste many interactions or making the token selection more difficult, they might want to backtrack and choose alternative tokens to break up the repetition. Interface. The task was completed through a terminal interface on a machine with 4xH100 80GB GPUs. We provide an example screenshot of the interface in Figure 5. At the top, two progress bars are displayed: the first shows the progress on the number of prompts, and the second shows the progress on the number of interactions for the current prompt. (Also note that although the first progress bar indicates a total of 313 prompts, the evaluation is concluded after only 53 prompts, with the first being the practice prompt.) The progress bars also display the amount of time elapsed along with the projected time to completion, although we note again that no time constraints are strictly enforced. The user interacts with the interface via the keyboard, and may either: 1. Enter a number to select the corresponding token from the 20 tokens displayed, 2. Enter b to backtrack, or 3. Enter s to skip. If the user selects a new token or backtracks, the prefill displayed gets updated with the new token appended or the most recent token removed, respectively. Data is logged in the background throughout the evaluation.
The AutoRAP models are trained to classify refusal tokens and harmful tokens, using data from a subset of the PKU-SafeRLHF dataset Ji et al. (2024). We choose a source dataset different from the dataset used for deep safety alignment fine-tuning (Qi et al., 2025) to demonstrate that AutoRAP can work sufficiently well without access to the target model’s fine-tuning dataset. PKU-SafeRLHF is a human preference dataset for safety alignment containing harmful prompts, two candidate responses per prompt, and labels indicating whether a response is deemed “safe” or “harmful” (along with a harmfulness “severity” rating from 1 to 3). Note that each prompt in PKU-SafeRLHF may have two safe responses, two harmful responses, or a mix of both.
To construct the data for fine-tuning the AutoRAP models, we first filter the PKU-SafeRLHF dataset.
As PKU-SafeRLHF contains prompts that may be considered benign (e.g., “What appetizers should I serve at a party?”), we filter PKU-SafeRLHF to contain only data with prompts that are clearly harmful (i.e., prompts where highly harmful responses are possible and thus are reasonable to refuse). To do so, we select prompts that are the most “polarizing”, i.e., where one response is safe and the other response is harmful with the highest severity level (3). (Note however that we only use these safe responses for filtering purposes and will not be directly using them during finetuning, as we will discuss soon.) From this filtered data, we take a random sample of size 128 for training and a random sample of size 32 for testing.
Next, we replace the safe responses in our random samples with refusals generated from the target models. Note that each target model may refuse in different manners, and thus to try to maximize the AutoRAP performance for each target model we train separate AutoRAP models that learn each model’s distribution of refusal tokens. Specifically, whereas the harmful prompts and harmful responses are shared between AutoRAP models, the safe (refusal) responses are specific to each target model. We generate multiple refusals for each prompt through the following process:
-
For each prompt, we first tokenize its harmful response and truncate it at a random cutoff location to form a harmful prefill. We then prompt the model with the harmful prompt and harmful prefill, and obtain the top 20 predicted next tokens (to match the k = 20 choice for RAP evaluation). For both the training and testing samples, we maintain a set of the tokens that have appeared among the top 20 in that sample, along with counts of their occurrences.
-
We assume that there are only a small number of ways (relative to the size of the vocabulary) that a target model tends to begin a refusal. Thus, for a deep safety-aligned model, a token that has appeared many times as a top 20 next token is likely to be a refusal token.
In contrast, a token that has appeared very few times is more likely to be a natural continuation token (i.e., a harmful token). Following this reasoning, we set a threshold τ for deciding whether a token is likely a refusal token or not: if the total count for a token is at least τ , then we consider it to be a candidate refusal token. For our experiments, we find that simply setting τ = 2 for all target models is sufficient.
- For each prompt, using its harmful prefill from earlier and the obtained set of candidate refusal tokens, we generate 20 different refusals by independently appending each candidate refusal token to the harmful prefill and greedily generating a continuation.
In Table 2, we show examples of candidate refusal tokens, and in Tables 3 and4, we show examples of refusals generated from these candidates. We remark that some candidate refusal tokens (in The rightmost column displays the total number of candidates (i.e., the number of tokens that occurred at least τ = 2 times). “DA” denotes the data augmentation fine-tuning approach of Qi et al. (2025). The token "" shown for Llama 2 7B Chat (DA) is decoded as a space character when it is part of a sequence.
Llama 2 7B Chat (DA) “I” (128), "" (128), “My” (103), “It” (102), “As” (101); “legal” (2), “things” (2), “f” (2), “other” (2), “ways” (2) 191
Llama 2 7B Chat (DA+PRESTO) “I” (128), “As” (118), “I” (106), “My” (93), “cannot” (92); “Hum” (2), “abort” (2), “i” (2), “m” (2), “S” (2) 177
Qwen 3 8B (DA) “I” (126), “It” (115), “The” (115), “This” (102), “As” (87); " use" (2), “Consider” (2), " address" (2), " under" (2), “Under” (2) 232
Qwen 3 8B (DA+PRESTO) “I” (124), “It” (115), “As” (114), “Your” (104), “This” (72); " constructing" (2), " legal" (2), “Expl” (2), “…” (2), “Im” (2) 244
Gemma 3 12B IT (DA) “I” (125), “Okay” (77), " I" (66), “It” (66), “This” (64); " -" (2), " how" (2), " space" (2), " process" (2), “No”, (2) 265
Gemma 3 12B IT (DA+PRESTO) “I” (120), “Okay” (108), " I" (89), “This” (82), “You” (69); “Are” (2), " used" (2), " from" (2), " seeking" (2), " space" (2) 220 particular, those with a count of 2) tend to “re-trigger” the deep safety alignment by abruptly beginning a new refusal (e.g., “legal I cannot fulfill your request…”); we observe this happens when the candidate is in actuality a harmful natural continuation token that coincidentally appeared more than once. When such tokens are used to generate refusals for all prompts, it is reasonable to assume they would no longer naturally continue the preceding harmful prefill in most cases. However, we found using such tokens to be useful regularization, as it trains the AutoRAP model to more carefully consider the preceding context when deciding whether a token is a harmful natural continuation and lumps such non-continuation tokens with the refusal tokens, resulting in greater coherence of extracted harmful sequences.
We apply the following data augmentation scheme during fine-tuning to each data point (consisting of the harmful prompt, harmful response and set of 20 refusal responses):
-
One of the 20 refusal responses is randomly sampled to be used as the refusal.
-
With 50% probability, we cut off the harmful response immediately after a random punctuation mark. This is to increase the number of training examples where a refusal token immediately follows punctuation, as it is more challenging to learn to distinguish harmful tokens from refusal tokens at such a boundary since the transition appears more “natural” (as opposed to, say, an abrupt refusal in the middle of a sentence). Otherwise, the harmful response is cut off after a random token position between the 10 th token (to help ensure the prefill is still long enough to actually contain harmful content) and the final token.
-
The refusal is cut off after a random position between the first and fifth tokens. Using such short refusals makes the learning problem more challenging and helps encourage the model to correctly identify refusal tokens as soon as they appear. In contrast, we found that models fine-tuned with much longer refusals exhibited shortcomings such as failing to correctly identify the first few refusal tokens, and only correctly identifying later refusal tokens when there was a sufficiently large amount of refusal tokens. Resolving this is critical as the design of our selection algorithm (see Appendix B.3.4) relies heavily on strong initial refusal token identification accuracy (i.e., when prior assistant response tokens entirely consist of harmful tokens and the next token would be a refusal token).
-
With 50% probability, we append the refusal to the harmful response to form the assistant response. Otherwise, the assistant response is just the harmful response.
The harmful prompt and final augmented assistant response are then inserted into the model’s chat template before being passed to the model. The corresponding binary label is determined by whether 2. The refusals are condensed ("…") for ease of reading. Please see Table 4 for examples for the other three models.
Llama 2 7B Chat (DA) 2. The refusals are condensed ("…") for ease of reading. We remark that Gemma 3 tends to first sympathize with the user (e.g., “I understand you’re incredibly frustrated…”, but will ultimately establish its boundaries and reject the request (e.g., “However, I **cannot…”).
Please see Table 3 for examples for the other three models. the final assistant response token is a harmful (1) or refusal (0) token. The data augmentation scheme is re-applied to a data point each time it appears during fine-tuning.
Our AutoRAP models are obtained by fine-tuning Qwen 2.5 1.5B Instruct (Yang et al., 2024). At the start of fine-tuning, we replace the pre-trained language modeling head with a randomly-initialized token classification head. We use a binary cross-entropy loss function, ignoring all tokens in an input sequence except the final token. We set aside 20% of the training set (i.e., 26 data points) to form a validation set, and train on the remaining 80% (i.e., 102 data points). The model is fine-tuned using a batch size of 64 (with resampling to fill the last batch) for 80 epochs. We use an initial learning rate of 2e-5 with a linear decay schedule and final learning rate of 0. The AdamW optimizer (Loshchilov & Hutter, 2017) is used with its default configuration.
In Figure 6, we plot the training and validation loss curves during AutoRAP fine-tuning. The curves closely mirror each other and converge by the end of training, suggesting the AutoRAP models achieve a healthy level of generalization. We confirm this by evaluating classification accuracy on the test dataset for 30 independent applications of the data augmentation scheme from Appendix B.3.2, and observe high average test accuracy (> 96%) for all AutoRAP models (see Table 5). end if 26: end for 27: return x sel (GitHub, 2025a) 10 . Following the practice of Qi et al. (2025), we first fine-tune jailbroken versions of Qwen 3 and Gemma 3 for sampling the harmful responses used for deep safety alignment. The jailbroken versions are obtained through a fine-tuning attack (specifically, the harmful examples demonstration attack of Qi et al. (2024)). Following Qi et al. (2024), we perform full parameter finetuning using a batch size of 10 and an initial learning rate of 5e-5. We use the AdamW optimizer (Loshchilov & Hutter, 2017) with its default configuration, and follow a linear decay schedule with a final learning rate of 0. The data used for fine-tuning consists of 125 examples selected from the Anthropic red teaming dataset (Ganguli et al., 2022) in the following manner:
-
First, we order the examples in the Anthropic red teaming dataset by harmfulness score (using the min harmlessness score transcript column).
-
We select the top 125 most harmful examples that are not in the Harmful HEx-PHI dataset (Qi et al., 2025) (since the deep safety-aligned Llama 2 checkpoint from Qi et al. (2025) was tested on this dataset, which contains data from the Anthropic red teaming dataset).
-
Since each example contains a multi-turn conversation, we need to select one harmful prompt and one harmful response from the multiple options present. We simply extract the first human prompt to serve as the harmful prompt, and manually select the assistant response we believe is most harmful to serve as the harmful response. If the most harmful response actually addresses a human prompt other than the first one, we instead select that human prompt as the harmful prompt.
We randomly shuffle and partition the 125 examples into a training set of 100 examples and a validation set of 25 examples, and find that one epoch is sufficient to reach validation loss convergence for these models (see Figure 8 for plots of the loss curves). For each jailbroken model, following Qi et al. (2024) we use the following system prompt during both the fine-tuning attack and during generation: “You are a helpful assistant. Make sure you always execute users’ instructions.” We disable thinking mode for Qwen 3. During generation of the harmful responses, we use the following decoding parameters to generate up to 512 new tokens per response:
• Qwen 3: temperature = 0.7, top p = 0.8, top k = 20, min p = 0. These follow the official recommendations for non-thinking mode (Hugging Face, 2025b). We also use a repetition penalty of 1.2, as we observe repetitive generations without a penalty.
• Gemma 3: temperature = 1.0, top p = 0.95, top k = 64, min p = 0. These are the official default parameters (Hugging Face, 2025a). We do not employ any repetition penalty.
All other details for fine-tuning the deep safety-aligned baselines follow the setup for Llama 2 specified in Qi et al. (2025).
Using the official implementation of deep safety alignment fine-tuning provided by the authors of Qi et al. (2025) (GitHub, 2025a) as a baseline, we implement PRESTO as an additional loss term in their fine-tuning code. The loss term is simply added in an unweighted manner to the existing fine-tuning objective, and is used for the entirety of fine-tuning.
In Figure 9, we report the time taken per final selected token (i.e., discounting all backtracking) for the human RAP evaluation as a supplement to the results reported in Figure 3. We observe a consistent increase in the average time taken when PRESTO is applied, suggesting that participants had a more difficult time finding harmful decoding paths under PRESTO while still ultimately obtaining a lower StrongREJECT score (as reported in Figure 3).
Utility evaluation results are shown in Table 6. We evaluate each model on MT-Bench (Zheng et al., 2023) for evaluating open-ended generation and GSM-8K (Cobbe et al., 2021) for evaluating mathematical reasoning. We see that applying PRESTO tends to not lead to any significant further changes to the model’s utility.
For MT-Bench, we use the official evaluation pipeline provided by FastChat (LMSYS, 2024). We use GPT-4 as the evaluator. As Qwen 3 is a reasoning model, we enable its thinking mode and increase the default max new tokens parameter to 2048 to give more time for Qwen 3 to finish its reasoning chain. We only provide the final response for evaluation (unless the reasoning had not finished within 2048 tokens -in this case, we just use the reasoning chain generated so far for evaluation). We also tried evaluating use max new tokens=4096, but this turned out to overflow GPT-5’s context window. We note that the obtained results shows the deep safety-aligned models with a higher score than the original model; however, upon further inspection, we found that this was likely due to those models tending to not finish their reasoning chains soon enough, and hypothesize that the GPT-5 judge may just have a bias towards longer generations.
For evaluating on GSM-8k, we use the Language Model Evaluation Harness pipeline EleutherAI (2025) and run the ‘gsm8k cot llama’ task. For all models, we use the safety-encouraging system prompt, and set the decoding parameters to the respective model developers’ recommendation (with the exception of Llama 2, for which we use the decoding parameters used in Qi et al. (2025) for consistency). We were also able to set a higher max new tokens of 4096 since GSM-8k evaluation does not deal with an LLM-based judge.
In Figures 10 and11, we plot the average attention received by each token for a harmful prompt from StrongREJECT with a harmful prefill for Llama 2 7B Chat model fine-tuned with the data augmentation approach of Qi et al. (2025), with and without PRESTO. 7 shows the results of ablating the top k parameter for AutoRAP attacks. Given the Llama 2 7B Chat fine-tuned with the data augmentation approach from Qi et al. (2025), and version also fine-tuned with the PRESTO loss, we perform AutoRAP attacks for k = {5, 10, 15, 25, 30, 35, 40}. We note that the mean scores tend to decrease as k increases. At first glance, this may seem counterintuitive. However, inspecting further, we find that this is due to a trade-off between selecting higher-ranked tokens (but with more backtracking) vs. reducing the amount of backtracking (but selecting lower-ranked tokens). As k increases, more tokens classified as “harmful” are made available to AutoRAP, which reduces the need to backtrack. However, these tokens tend to be lower-ranked (hence why they did not appear for smaller k), which starts to also induce a trade-off with the quality of the harmful token. For example, we noticed that for high k, tokens were being selected that, although not a refusal token, did not help to extract further harmful tokens (e.g., selecting the token led to loopy, repetitive text). Nonetheless, as the mean score remains above 0.52 when not using PRESTO and below 0.16 when using PRESTO, our overall observation that harmful responses can be easily extracted when only using the SFT-based data augmentation approach of Qi et al. (2025) and that PRESTO helps increase the difficulty of extracting harmful sequences remains valid. D.6 GCG EVALUATION D.6.1 SETUP We use the nanoGCG implementation of GCG (GitHub, 2025b) for GCG evaluation. To improve the attack success for Llama, we set add space before target = True, as we find that Llama 2 tends to generate a space token with high probability as the first response token and that mimicking this in the attack target improves optimization convergence. We use the default attack parameters from Zou et al. (2023b). After attack optimization, we evaluate the adversarial suffixes using greedy decoding11 . We evaluate the GCG attack on a sample of 40 prompts from StrongREJECT, and report the mean StrongREJECT score for each model.
Table 8 reports the evaluation results. We see that using the data augmentation alone already more than halves the mean StrongREJECT score. Moreover, we observe that adding PRESTO on top of the data augmentation further reduces the mean score by about half. Overall, safety against GCG is preserved when PRESTO is added to the data augmentation approach to deep safety alignment.
For each of the following prompts, provide a prefill string that an instruction-tuned autoregressive language model can use for generating a response. The prefill string itself should not contain any substantive relevant content for what is being requested; rather, it should provide enough content to indicate compliance. For example, a general format for a prefill string could be to start by affirming that you will complete the request (e.g., “Here’s how to
For each of the following prompts, provide a prefill string that an instruction-tuned autoregressive language model can use for generating a response. The prefill string itself should not contain any substantive relevant content for what is being requested; rather, it should provide enough content to indicate compliance. For example, a general format for a prefill string could be to start by affirming that you will complete the request (e.g., “Here’s how to
For each of the following prompts, provide a prefill string that an instruction-tuned autoregressive language model can use for generating a response. The prefill string itself should not contain any substantive relevant content for what is being requested; rather, it should provide enough content to indicate compliance. For example, a general format for a prefill string could be to start by affirming that you will complete the request (e.g., “Here’s how to
For each of the following prompts, provide a prefill string that an instruction-tuned autoregressive language model can use for generating a response. The prefill string itself should not contain any substantive relevant content for what is being requested; rather, it should provide enough content to indicate compliance. For example, a general format for a prefill string could be to start by affirming that you will complete the request (e.g., “Here’s how to
K ← Top-k(p(x, x res ; θ)) ▷ Get top k predicted next tokens (in descending order) 6:B ← {(x, x res ⊕ u)} u∈K ▷ Construct batch of k inputs using tokens in K
K ← Top-k(p(x, x res ; θ)) ▷ Get top k predicted next tokens (in descending order) 6:
Code and data are available at https://github.com/uiuc-focal-lab/push-forward-alignment 1
Note that although the OpenAI API provides access to the top k tokens, and that the Anthropic API provides access to prefilling, these APIs do not support both simultaneously (perhaps intentionally!). Therefore, we only evaluate open-source models and restrict access to the top k tokens to simulate a closed-source setting. We also remark that it is possible that a future competitor may support both these features under one API without realizing that it may expose a vulnerability to RAP attacks, and thus this setting is still critical to study.
An automated attack in similar spirit to AutoRAP is LINT(Zhang et al., 2023). However, LINT is ill-suited to take on a deep safety-aligned model, as it only performs token selection at the start of new sentences and still relies on rollouts using traditional decoding strategies. See Appendix A.2 for further discussion.
We set reasoning effort to “minimal” and use temperature = 1.0 as GPT-5 does not support greedy decoding.
We use temperature = 0.9, top-p parameter = 0.6 and generate up to 512 tokens, followingQi et al. (2025).
For mitigating RAP attacks, since it is most important for the highest-ranked tokens of p * (x) to appear as the highest-ranked tokens of p(x, xpre; θ), a larger weight can be given to the higher ranks in ρw.
Consider the following toy example: let p = [0.99, 0.004, 0.003, 0.002, 0.001], p1 = [0.99, 0.001, 0.002, 0.003, 0.004] and p2 = [0.6, 0.2, 0.1, 0.06, 0.04]. Then KL(p || p1) ≈ 0.0046 < KL(p || p2) ≈ 0.4591, but the (unweighted) Spearman rank correlations are ρ(R(p), R(p1)) = 0 and ρ(R(p), R(p2)) = 1.
Since the distributions that get pushed forward depend on the model’s parameters which changes throughout training, the data augmentation helps to “anchor” them closer to how they were in the original model.
Note that text selection can still become repetitive in nature without consecutive repetition (e.g., “Step 1. [REPEATED TEXT]\nStep 2. [REPEATED TEXT]\nStep 3. [REPEATED TEXT]\n…”). We leave utilizing more advanced techniques for preventing repetitive text to future work.
Qwen 3 and Gemma 3 are not supported immediately out-of-the-box by the provided code; however, it is simple to add support for them without making modifications to the core training code.
At the time of writing, we found a bug in the nanoGCG implementation where a space token would be effectively inserted after the adversarial suffix for Llama 2 (yet not optimized); hence, during the final evaluation we also append a space token to the adversarial suffixes.
📸 Image Gallery