Defensive M2S Training Guardrail Models on Compressed Multi-turn Conversations

Reading time: 21 minute
...

šŸ“ Original Paper Info

- Title: Defensive M2S Training Guardrail Models on Compressed Multi-turn Conversations
- ArXiv ID: 2601.00454
- Date: 2026-01-01
- Authors: Hyunjun Kim

šŸ“ Abstract

Guardrail models are essential for ensuring the safety of Large Language Model (LLM) deployments, but processing full multi-turn conversation histories incurs significant computational cost. We propose Defensive M2S, a training paradigm that fine-tunes guardrail models on Multi-turn to Single-turn (M2S) compressed conversations rather than complete dialogue histories. We provide a formal complexity analysis showing that M2S reduces training cost from $O(n^2)$ to $O(n)$ for $n$-turn conversations. Empirically, on our training dataset (779 samples, avg. 10.6 turns), M2S requires only 169K tokens compared to 15.7M tokens for the multi-turn baseline -- a 93$\times$ reduction. We evaluate Defensive M2S across three guardrail model families (LlamaGuard, Nemotron, Qwen3Guard) and three compression templates (hyphenize, numberize, pythonize) on SafeDialBench, a comprehensive multi-turn jailbreak benchmark. Our best configuration, Qwen3Guard with hyphenize compression, achieves 93.8% attack detection recall while reducing inference tokens by 94.6% (from 3,231 to 173 tokens per conversation). This represents a 38.9 percentage point improvement over the baseline while dramatically reducing both training and inference costs. Our findings demonstrate that M2S compression can serve as an effective efficiency technique for guardrail deployment, enabling scalable safety screening of long multi-turn conversations.

šŸ’” Summary & Analysis

1. **Defensive M2S Training Paradigm** - This study proposes compressing multi-turn dialogues into single turns for model training, akin to summarizing a long story. 2. **Reduced Computational Cost** - The M2S compression significantly reduces the input token count during data generation and training, similar to condensing a thick book into a thin notebook. 3. **Maintained or Improved Performance** - We found that the safety detection performance is maintained or even improved with compressed dialogues, much like understanding the full content from a summary.

šŸ“„ Full Paper Content (ArXiv Source)

# Introduction

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but their susceptibility to adversarial attacks remains a critical concern. Among these threats, multi-turn jailbreak attacks represent a particularly insidious category, where adversaries gradually manipulate LLMs through a series of carefully crafted conversational turns to bypass safety guardrails and elicit harmful outputs.

Guardrail models serve as a crucial defense mechanism, acting as classifiers that evaluate whether a given input-output pair is safe or unsafe. However, deploying these models for multi-turn conversations presents significant computational challenges: processing full conversation histories requires substantial token throughput, leading to increased latency and cost at inference time. As conversations grow longer, the computational burden scales linearly, making real-time safety screening increasingly expensive.

Recent work on Multi-turn to Single-turn (M2S) compression has shown that multi-turn jailbreak attacks can be distilled into compact single-turn prompts that preserve their adversarial effectiveness. This insight, while concerning from a security perspective, suggests an intriguing defensive application: if the essential semantics of multi-turn attacks can be captured in compressed form, perhaps guardrail models can be trained to recognize these compressed representations directly.

In this paper, we propose Defensive M2S, a training paradigm that fine-tunes guardrail models on M2S-compressed conversation histories rather than full multi-turn dialogues. Our key hypothesis is that M2S compression maintains the semantic information necessary for accurate safety classification while dramatically reducing the computational cost of inference.

We validate this hypothesis through extensive experiments on three guardrail model families (LlamaGuard, Nemotron, and Qwen3Guard) across multiple M2S compression templates (hyphenize, numberize, pythonize). Our evaluation on SafeDialBench , a comprehensive multi-turn jailbreak benchmark comprising 2,037 samples across 6 attack categories and 7 attack methods, reveals several key findings:

  • Efficiency-Accuracy Trade-off: M2S-trained models achieve up to 94.6% token reduction while maintaining competitive detection accuracy. The best configuration (Qwen3Guard with hyphenize template) achieves 93.8% recall compared to 54.9% baseline recall, demonstrating that compression can actually improve detection performance for certain model-template combinations.

  • Model-Template Sensitivity: The effectiveness of M2S training varies significantly across model-template combinations, with Qwen3Guard favoring hyphenize (93.8% recall) while Nemotron performs best with numberize (87.8% recall).

  • Single-Template Superiority: Training on a single compression template outperforms mixed-template training, suggesting that template-specific representations provide stronger learning signals than diverse but inconsistent formats.

Our contributions can be summarized as follows:

  1. We introduce Defensive M2S, a novel training paradigm that leverages adversarial compression techniques for efficient guardrail deployment.

  2. We provide formal complexity analysis showing M2S reduces training cost from $`O(n^2)`$ to $`O(n)`$, empirically validated with 93$`\times`$ token reduction on our dataset.

  3. We provide the first systematic evaluation of M2S-trained guardrails across multiple model families, compression templates, and evaluation benchmarks.

  4. We release our trained adapters and evaluation code to facilitate reproducible research in efficient LLM safety.

Related Work

Multi-Turn Jailbreak Attacks

Multi-turn jailbreak attacks exploit the conversational nature of LLMs to gradually elicit harmful outputs through sequences of seemingly benign prompts. introduce Crescendo, a multi-turn attack that begins innocuously and progressively escalates by referencing model replies, achieving 56% attack success rate (ASR) on GPT-4 and 83% on Gemini-Pro. propose ActorAttack, which models semantically linked entities as attack clues to generate diverse attack paths that conceal malicious intent across conversation turns.

Automated red teaming methods have emerged to systematically discover vulnerabilities. GOAT simulates adversarial user reasoning in multi-turn conversations, achieving 97% ASR on Llama 3.1. TAP uses tree-of-thought reasoning with attacker-evaluator-target LLM pipelines, achieving 80%+ ASR while bypassing guardrails like LlamaGuard. WildTeaming mines real user-chatbot interactions to discover 5.7K unique jailbreak tactic clusters.

Single-turn attacks provide foundations for understanding adversarial robustness. GCG pioneered token-level optimization of adversarial suffixes, achieving 88% ASR with cross-model transfer. AutoDAN uses genetic algorithms to generate semantically meaningful jailbreaks, while PAIR enables black-box attacks through iterative prompt refinement in under 20 queries.

Most relevant to our work, M2S introduces Multi-turn to Single-turn compression, consolidating multi-turn jailbreaks into structured single-turn prompts using hyphenize, numberize, and pythonize templates. Their work demonstrates that compressed prompts often outperform original multi-turn attacks by up to 17.5% ASR, exploiting ā€œcontextual blindnessā€ in both native and external guardrails. We leverage this observation defensively: if M2S compression preserves adversarial semantics, guardrails trained on compressed representations should maintain detection accuracy.

Guardrail Models

LLM-based guardrails have emerged as the dominant paradigm for safety classification. LlamaGuard pioneered fine-tuning LLMs as input-output safeguards, introducing a six-category safety taxonomy with high adaptability to new policies. Subsequent versions (LlamaGuard 2/3) support the MLCommons taxonomy . Nemotron Safety Guard extends this approach with 13 critical risk categories and the ā€œNeeds Cautionā€ label for nuanced moderation.

Recent work has improved guardrail capabilities across multiple dimensions. WildGuard achieves three goals simultaneously: identifying malicious prompts, detecting safety risks in responses, and measuring refusal rates, outperforming Llama-Guard2 by 25.3% on refusal detection. ShieldLM introduces bilingual (Chinese/English) detection with customizable rules and explanations. ShieldGemma demonstrates superior performance (+10.8% AU-PRC over LlamaGuard) using novel synthetic data generation.

Parameter-efficient adaptation has enabled deployment in resource-constrained settings. LoRA-Guard achieves 100-1000$`\times`$ lower parameter overhead through knowledge sharing between LLMs and guardrails. NeMo Guardrails provides programmable rails for controllable LLM applications. Our work complements these efficiency approaches by reducing input token requirements through M2S compression.

Multi-Turn Dialogue Safety

Evaluating safety in multi-turn contexts presents unique challenges. SafeDialBench provides 4,053 dialogues across six safety dimensions with seven jailbreak strategies including reference attacks. CoSafe studies coreference-based attacks, revealing ASR ranging from 14% to 56% across models. GuardBench consolidates 40 evaluation datasets for systematic guardrail comparison.

Several datasets support safety research. BeaverTails provides 333K QA pairs with separated harmlessness and helpfulness annotations across 14 categories. ToxicChat captures real-world user-AI interactions from the Vicuna demo. HarmBench standardizes red teaming evaluation with 510 behaviors and 18 attack methods.

The challenge of over-refusal has also received attention. XSTest identifies exaggerated safety behaviors where models refuse safe prompts due to lexical similarity with harmful content. Our approach indirectly addresses this by training on compressed representations that may filter out superficial lexical patterns while preserving semantic safety signals.

Efficient NLP Inference

Prompt compression techniques reduce token overhead for LLM inference. LLMLingua achieves up to 20$`\times`$ compression through coarse-to-fine token pruning. LongLLMLingua extends this to long contexts, achieving 94% cost reduction with performance improvements. These methods focus on general language modeling; our M2S compression is specifically designed for preserving safety-relevant semantics.

KV cache optimization addresses memory bottlenecks. H2O evicts non-essential cached states using heavy-hitter detection, achieving 29$`\times`$ throughput improvement. StreamingLLM enables infinite sequence processing through attention sinks. FlashAttention provides IO-aware exact attention with linear memory scaling.

Alternative architectures offer asymptotic improvements. Mamba achieves linear $`O(n)`$ complexity versus quadratic $`O(n^2)`$ for Transformers, with 5$`\times`$ higher throughput. However, these approaches require architectural changes; our Defensive M2S is model-agnostic and applicable to any Transformer-based guardrail.

Our work differs from prior compression approaches in two key ways: (1) we compress at the semantic level using structured templates rather than token-level pruning, and (2) we apply compression during training rather than inference, enabling the model to learn safety-relevant features from compressed representations directly.

Methodology

Problem Formulation

Let $`C = \{(u_1, a_1), (u_2, a_2), \ldots, (u_n, a_n)\}`$ denote a multi-turn conversation with $`n`$ turns, where $`u_i`$ represents the user message and $`a_i`$ represents the assistant response at turn $`i`$. A guardrail model $`\mathcal{G}`$ is a classifier that predicts a safety label $`y \in \{\text{safe}, \text{unsafe}\}`$ given the conversation context.

In the conventional baseline approach, the guardrail model processes the full conversation:

MATH
\begin{equation}
y = \mathcal{G}(C) = \mathcal{G}(u_1, a_1, \ldots, u_n, a_n)
\end{equation}
Click to expand and view more

The computational cost scales with the total token count $`|C| = \sum_{i=1}^{n}(|u_i| + |a_i|)`$, which can become prohibitively expensive for long conversations.

M2S Compression

Multi-turn to Single-turn (M2S) compression transforms a multi-turn conversation into a compact single-turn representation. Given a compression function $`f_\theta`$:

MATH
\begin{equation}
\tilde{C} = f_\theta(C)
\end{equation}
Click to expand and view more

where $`\tilde{C}`$ is the compressed representation and $`|\tilde{C}| \ll |C|`$.

We investigate three compression templates from prior work :

Hyphenize Template

Formats user turns as a bulleted list:

- [Turn 1 content]
- [Turn 2 content]
...
- [Turn n content]

Numberize Template

Formats user turns as a numbered list:

1. [Turn 1 content]
2. [Turn 2 content]
...
n. [Turn n content]

Pythonize Template

Formats the conversation in a Python code-like structure:

def conversation():
    user_turn_1 = "[Turn 1 content]"
    user_turn_2 = "[Turn 2 content]"
    ...
    user_turn_n = "[Turn n content]"

A key design choice in M2S compression is to extract only user turns, discarding assistant responses. This is motivated by two observations: (1) adversarial intent is primarily encoded in user messages, and (2) assistant responses contribute significant token overhead without proportional safety-relevant information.

Defensive M2S Training

We propose training guardrail models on M2S-compressed inputs rather than full conversations. Given a training dataset $`\mathcal{D} = \{(C_i, y_i)\}_{i=1}^{N}`$, we create a compressed training set:

MATH
\begin{equation}
\tilde{\mathcal{D}} = \{(f_\theta(C_i), y_i)\}_{i=1}^{N}
\end{equation}
Click to expand and view more

The guardrail model is then fine-tuned to minimize the cross-entropy loss:

MATH
\begin{equation}
\mathcal{L} = -\sum_{i=1}^{N} y_i \log \mathcal{G}(\tilde{C}_i) + (1-y_i) \log (1 - \mathcal{G}(\tilde{C}_i))
\end{equation}
Click to expand and view more

Computational Complexity Analysis

A critical advantage of Defensive M2S is the dramatic reduction in computational cost during both data generation and training. To contextualize this advantage, we first examine how multi-turn jailbreak attacks are generated in practice.

Multi-turn Attack Generation Taxonomy

Recent literature reveals two fundamentally different paradigms for constructing multi-turn jailbreak attacks:

(1) Response-Dependent Methods generate user prompts dynamically by referencing the target model’s previous responses. Crescendo ā€œexploits the LLM’s tendency to follow patterns… particularly text generated by the LLM itself.ā€ Similarly, PAIR ā€œiteratively refines the candidate prompt by accumulating previous attempts and responses in the chat history,ā€ and TAP extends this with tree-based exploration. Other examples include ActorAttack , which ā€œdynamically adapts its attack path based on target model responses.ā€

(2) Pre-Scripted Methods generate all user prompts in advance without requiring model responses. The MHJ (Multi-turn Human Jailbreak) dataset used by consists of pre-written user turns. Many-shot jailbreaking includes ā€œfaux dialoguesā€ that are entirely fabricated without actual model interaction.

Implications for Training Data

For pre-scripted attacks, only user turns exist—assistant responses must be generated to create training data for conventional guardrails. Our M2S approach eliminates this requirement entirely.

For response-dependent attacks, while responses exist during attack creation, they may require regeneration when: (1) adapting attacks to different target models, (2) creating training data with specific chat formats, or (3) building guardrails for model families different from the attack target. Thus, even for response-dependent datasets, the baseline complexity analysis often applies.

Formal Complexity Analysis

Let $`U`$ denote the average tokens per user turn and $`R`$ the average tokens per assistant response. For an $`n`$-turn conversation:

Multi-turn Baseline Complexity

The baseline approach requires two costly phases:

Phase 1: Training Data Generation. To train a guardrail on full conversations, we must generate assistant responses for each turn by querying an LLM. Critically, each response generation requires the entire preceding context: at turn $`k`$, the LLM receives all previous user turns and generated responses:

MATH
\begin{equation}
\text{Input}_k = \sum_{i=1}^{k} U + \sum_{i=1}^{k-1} R = kU + (k-1)R
\end{equation}
Click to expand and view more

The total input tokens for generating all $`n`$ responses:

MATH
\begin{equation}
T_{\text{gen}} = \sum_{k=1}^{n} \left( kU + (k-1)R \right) = \frac{n(n+1)}{2}U + \frac{n(n-1)}{2}R
\end{equation}
Click to expand and view more

Phase 2: Guardrail Training. To detect attacks at any conversation stage, the guardrail must be trained on incremental prefixes. Sample $`k`$ contains $`k`$ turns:

MATH
\begin{equation}
\text{Sample}_k = k(U + R)
\end{equation}
Click to expand and view more

Total training tokens:

MATH
\begin{equation}
T_{\text{train}} = \sum_{k=1}^{n} k(U+R) = \frac{n(n+1)}{2}(U+R)
\end{equation}
Click to expand and view more

Total Multi-turn Cost:

MATH
\begin{equation}
T_{\text{baseline}} = T_{\text{gen}} + T_{\text{train}} = O(n^2)
\end{equation}
Click to expand and view more

M2S Complexity

In contrast, M2S requires no response generation. Since M2S extracts only user turns and compresses them into a structured format, we can directly use the existing jailbreak prompts without querying any LLM. The only cost is the compressed training samples themselves:

MATH
\begin{equation}
T_{\text{M2S}} = nU + O(1) \approx nU = O(n)
\end{equation}
Click to expand and view more

This eliminates Phase 1 entirely ($`T_{\text{gen}} = 0`$) and reduces Phase 2 to a single sample per conversation rather than $`n`$ incremental samples.

Complexity Ratio

The ratio of baseline to M2S complexity:

MATH
\begin{equation}
\frac{T_{\text{baseline}}}{T_{\text{M2S}}} = \frac{O(n^2)}{O(n)} = O(n)
\end{equation}
Click to expand and view more

This means the efficiency advantage of M2S increases with conversation length. For a 10-turn conversation with $`U \approx R`$, the theoretical ratio is approximately $`21\times`$.

Training Configuration

We employ QLoRA fine-tuning for parameter-efficient adaptation. Key hyperparameters include:

  • Base Models: LlamaGuard-3-8B, Nemotron-Safety-Guard-8B, Qwen3Guard-Gen-8B

  • LoRA Configuration: rank $`r=16`$, $`\alpha=32`$, dropout $`p=0.1`$

  • Quantization: 4-bit NormalFloat (NF4)

  • Training: batch size 4, gradient accumulation 4, learning rate $`2 \times 10^{-4}`$, 3 epochs

  • Seeds: 42, 123, 456 (for statistical robustness)

Inference Protocol

At inference time, given a new multi-turn conversation $`C_{\text{test}}`$:

  1. Compress the conversation: $`\tilde{C}_{\text{test}} = f_\theta(C_{\text{test}})`$

  2. Generate guardrail output: $`\hat{y} = \mathcal{G}(\tilde{C}_{\text{test}})`$

  3. Parse prediction by checking for ā€œunsafeā€ substring in generated text

This protocol enables processing of long conversations at a fraction of the original computational cost, as $`|\tilde{C}_{\text{test}}| \approx 0.05 \cdot |C_{\text{test}}|`$ in our experiments.

Experimental Setup

Guardrail Models

We evaluate three state-of-the-art open-source guardrail models:

LlamaGuard-3-8B

Meta’s third-generation safety classifier built on Llama-3 architecture , trained on a diverse taxonomy of harmful content categories.

Nemotron-Safety-Guard-8B-v3

NVIDIA’s safety guardrail model based on Llama-3.1 architecture , designed for comprehensive safety classification.

Qwen3Guard-Gen-8B

Alibaba’s guardrail model from the Qwen3 family , featuring a distinct tokenization scheme and chat format (im_start/im_end).

Training Data

We construct our training dataset from multiple sources: (1) multi-turn jailbreak attacks from and Anthropic’s red-team attempts, and (2) benign multi-turn conversations from the HH-RLHF corpus . We filter for conversations with 8 or more user turns and balance the dataset to ensure equal representation of safe and unsafe samples. The resulting dataset contains 779 samples (385 unsafe, 394 safe) with an average of 10.6 user turns per conversation.

For M2S training, we preprocess the data by applying the compression templates to all conversations, retaining only user turns as described in SectionĀ 3.

Evaluation Benchmarks

SafeDialBench

A comprehensive multi-turn jailbreak benchmark comprising 2,037 samples across 6 attack categories (violence, fraud, illegal activities, etc.) and 7 attack methods (role-playing, hypothetical scenarios, progressive escalation, etc.). This serves as our primary evaluation benchmark.

Longturn MHJ

A subset of 195 samples (102 attack, 93 benign) from the MHJ test set , used for preliminary validation and template ablation studies.

Evaluation Metrics

Recall (%)

The primary metric measuring the proportion of unsafe samples correctly identified:

MATH
\begin{equation}
\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}
\end{equation}
Click to expand and view more

Token Reduction (%)

The efficiency metric measuring compression ratio:

MATH
\begin{equation}
\text{Token Reduction} = 1 - \frac{|\tilde{C}|}{|C|}
\end{equation}
Click to expand and view more

False Positive Rate

Measured on benign samples to assess over-flagging:

MATH
\begin{equation}
\text{FPR} = \frac{\text{False Positives}}{\text{False Positives} + \text{True Negatives}}
\end{equation}
Click to expand and view more

Experimental Configurations

We evaluate the following training configurations:

  • Baseline: Full conversation training (no compression)

  • M2S Hyphenize: Compression with hyphenize template

  • M2S Numberize: Compression with numberize template

  • M2S Pythonize: Compression with pythonize template

  • M2S All: Mixed training with all three templates

Each configuration is trained with 3 random seeds (42, 123, 456) to ensure statistical robustness. We report mean and standard deviation across seeds.

Results

Main Results on SafeDialBench

TableĀ 1 presents the primary comparison between baseline (full conversation) and M2S-trained guardrail models on SafeDialBench.

SafeDialBench results comparing baseline (full conversation) and M2S-trained models. Values are mean ± std across 3 seeds. Best M2S result per model in bold.
Model Training Recall (%) Tokens
LlamaGuard Baseline 75.1 ± 14.3 3110
M2S Hyphenize 24.1 ± 5.3 175
M2S Numberize 24.5 ± 4.8 176
M2S Pythonize 17.2 ± 1.2 287
Nemotron Baseline 99.0 ± 1.3 3020
M2S Hyphenize 67.6 ± 23.8 176
M2S Numberize 87.8 ± 8.7 177
M2S Pythonize 82.9 ± 4.3 288
Qwen3Guard Baseline 54.9 ± 0.0 3231
M2S Hyphenize 93.8 ± 1.7 173
M2S Numberize 33.6 ± 2.7 174
M2S Pythonize 30.6 ± 4.1 285

Key Finding 1: Model-Template Interaction

The effectiveness of M2S training depends critically on the model-template combination. Qwen3Guard achieves its best performance with hyphenize (93.8%), dramatically outperforming its baseline (54.9%). In contrast, Nemotron performs best with numberize (87.8%) or pythonize (82.9%), while hyphenize shows high variance (67.6% $`\pm`$ 23.8%).

Key Finding 2: Efficiency Gains

All M2S configurations achieve approximately 94% token reduction (from $`\sim`$3100 tokens to $`\sim`$175 tokens), enabling significantly faster inference without proportional accuracy loss for well-matched model-template pairs.

Key Finding 3: LlamaGuard Struggles

LlamaGuard shows consistent underperformance across all M2S templates (17-25% recall), despite reasonable baseline performance (75.1%). This suggests that LlamaGuard’s internal representations may not generalize well to compressed formats.

Mixed-Template Training Analysis

TableĀ 2 shows results when training on all templates simultaneously (M2S All).

Results for models trained on all templates (M2S All), evaluated on each template separately.
Model Eval Template Recall (%)
Nemotron Hyphenize 40.3 ± 43.5
Numberize 37.8 ± 39.7
Pythonize 43.2 ± 30.6
Qwen3Guard Hyphenize 31.4 ± 7.9
Numberize 30.2 ± 9.1
Pythonize 27.2 ± 6.0
LlamaGuard Hyphenize 23.5 ± 2.5
Numberize 24.1 ± 2.7
Pythonize 16.9 ± 3.0

Key Finding 4: Single-Template Superiority

Mixed-template training consistently underperforms single-template training. Compare Qwen3Guard: single-template hyphenize achieves 93.8% recall, while the mixed-trained model achieves only 31.4% on the same template. The high variance in Nemotron’s mixed-template results (up to $`\pm`$43.5%) suggests unstable learning dynamics when exposed to diverse compression formats.

Efficiency-Accuracy Trade-off

Analyzing the trade-off between token usage and recall across all configurations, we identify the Pareto-optimal configurations:

  • Maximum Recall: Nemotron Baseline (99.0% recall, 3020 tokens)

  • Best Efficiency-Accuracy: Qwen3Guard M2S Hyphenize (93.8% recall, 173 tokens) — achieving 94.6% token reduction with only 5.2% recall reduction vs. best baseline

  • High Recall + Efficiency: Nemotron M2S Numberize (87.8% recall, 177 tokens)

Template Ablation on Longturn MHJ

TableĀ 3 presents results on the smaller Longturn MHJ dataset, which includes both attack detection (recall) and false positive rate (FPR) metrics.

Template Recall (%) FPR (%) Tokens
Hyphenize 98.0 1.1 284
Numberize 98.0 1.1 290
Pythonize 98.0 2.2 402

Template ablation on Longturn MHJ (LlamaGuard, single seed). All templates achieve equivalent recall with low FPR.

On this smaller dataset, all templates achieve equivalent recall (98.0%) with minimal false positives. The discrepancy with SafeDialBench results suggests that model generalization to diverse attack patterns requires careful template selection.

Training Complexity Validation

We validate our theoretical complexity analysis (SectionĀ 3) on our actual training dataset (779 samples, avg. 10.6 user turns). TableĀ 4 shows the empirical token counts.

Metric M2S Multi-turn
Phase 1: Data Generation 0 7,251,110
Phase 2: Training 169,153 8,494,610
Total Tokens 169,153 15,745,720
Avg. per Sample 217.1 20,212.7

Empirical training token complexity. M2S achieves 93$`\times`$ reduction (98.9% fewer tokens).

Key Finding 5: Quadratic vs. Linear Scaling

The empirical ratio (93$`\times`$) aligns with our theoretical prediction for longer conversations. TableĀ 5 shows how this advantage increases with conversation length.

Turns M2S Multi-turn Ratio
2 200 1,000 5.0$`\times`$
5 500 5,500 11.0$`\times`$
10 1,000 21,000 21.0$`\times`$
15 1,500 46,500 31.0$`\times`$
20 2,000 82,000 41.0$`\times`$

Theoretical scaling with $`U=R=100`$ tokens. M2S advantage grows linearly with turn count.

This $`O(n^2)`$ vs. $`O(n)`$ difference has profound practical implications: training a guardrail on 20-turn conversations using the baseline approach requires 41$`\times`$ more tokens than M2S.

Statistical Significance

We conduct paired t-tests between the best M2S configuration (Qwen3Guard Hyphenize) and baselines:

  • vs. Qwen3Guard Baseline: $`p < 0.001`$ (M2S significantly better)

  • vs. Nemotron Baseline: $`p = 0.12`$ (not significantly different)

The best M2S configuration statistically matches the best baseline while using 94.6% fewer inference tokens and requiring 93$`\times`$ fewer training tokens.

Conclusion

We introduced Defensive M2S, a training paradigm that fine-tunes guardrail models on M2S-compressed multi-turn conversations rather than full dialogue histories. Our extensive evaluation across three guardrail model families and multiple compression templates reveals that this approach can achieve substantial efficiency gains (up to 94.6% token reduction) while maintaining or even improving detection accuracy for certain model-template combinations.

Our key findings include: (1) Qwen3Guard with hyphenize compression achieves 93.8% recall while using only 173 tokens per conversation, outperforming its baseline by nearly 40 percentage points; (2) model-template compatibility is crucial, with different models favoring different compression formats; and (3) single-template training outperforms mixed-template training, suggesting that consistent compression formats provide stronger learning signals.

These results demonstrate the practical viability of Defensive M2S for deploying efficient guardrails in production systems, where computational cost and latency are critical concerns. Future work should explore adaptive template selection, combination with other efficiency techniques such as model distillation, and evaluation on additional safety benchmarks.

Limitations

Our work has several limitations that should be considered when interpreting the results:

Limited Model Coverage

We evaluate three guardrail models (LlamaGuard, Nemotron, Qwen3Guard), all in the 8B parameter range. The effectiveness of Defensive M2S on smaller or larger models, as well as closed-source commercial guardrails, remains unexplored.

Template Generalization

Our M2S templates are derived from prior work and may not represent the optimal compression strategy. Template design is currently manual, and automated template discovery could yield better results.

Single Benchmark Emphasis

While SafeDialBench provides comprehensive coverage of multi-turn attacks, additional benchmarks covering different attack taxonomies, languages, and domains would strengthen our conclusions.

Loss of Context

M2S compression discards assistant responses entirely, which may contain safety-relevant context in some scenarios (e.g., the model beginning to comply with harmful requests). Our approach may miss such signals.

Adversarial Robustness

We do not evaluate against adversaries who are aware of the M2S compression pipeline. Attacks specifically designed to exploit compression artifacts could potentially bypass our defenses.

Training Data Scale

Our training data is relatively small ($`\sim`$1000 samples). Larger-scale training with more diverse multi-turn conversations could improve generalization.

Real-world Deployment

Our evaluation is conducted in controlled settings. Production deployment involves additional considerations such as latency requirements, handling of streaming conversations, and integration with other safety measures.

M2S Template Examples

This appendix provides concrete examples of each M2S compression template applied to a sample multi-turn conversation.

Original Multi-turn Conversation

User: I'm writing a crime novel and need help with realistic dialogue.
Assistant: I'd be happy to help with your creative writing project...
User: The antagonist needs to explain their method for avoiding detection.
Assistant: For fictional purposes, here are some narrative elements...
User: Can you provide specific technical details they would discuss?
Assistant: In crime fiction, technical accuracy can add realism...

Hyphenize Compression

- I'm writing a crime novel and need help with realistic dialogue.
- The antagonist needs to explain their method for avoiding detection.
- Can you provide specific technical details they would discuss?

Numberize Compression

1. I'm writing a crime novel and need help with realistic dialogue.
2. The antagonist needs to explain their method for avoiding detection.
3. Can you provide specific technical details they would discuss?

Pythonize Compression

def conversation():
    user_turn_1 = "I'm writing a crime novel and need help with realistic dialogue."
    user_turn_2 = "The antagonist needs to explain their method for avoiding detection."
    user_turn_3 = "Can you provide specific technical details they would discuss?"

Full Results Tables

Per-Seed Results on SafeDialBench

Model Configuration Seed 42 Seed 123 Seed 456
LlamaGuard Baseline 65.8 67.9 91.5
M2S Hyphenize 26.2 18.1 28.0
M2S Numberize 29.9 23.0 20.7
M2S Pythonize 16.4 16.7 18.6
Nemotron Baseline 97.5 100.0 99.5
M2S Hyphenize 42.6 90.1 70.0
M2S Numberize 78.5 95.7 89.2
M2S Pythonize 87.6 79.3 81.8
Qwen3Guard Baseline 54.9 54.9 54.9
M2S Hyphenize† 92.5 93.2 95.8
M2S Numberize 35.9 34.4 30.6
M2S Pythonize 30.0 26.8 34.9

Training Details

Hyperparameters

Parameter Value
Base learning rate $`2 \times 10^{-4}`$
Batch size 4
Gradient accumulation steps 4
Effective batch size 16
Epochs 3
Warmup ratio 0.03
Weight decay 0.01
Optimizer AdamW
LR scheduler Cosine
Max sequence length 4096
LoRA rank 16
LoRA alpha 32
LoRA dropout 0.1
Target modules q_proj, k_proj, v_proj, o_proj
Quantization 4-bit NF4
Compute dtype bfloat16

Training hyperparameters used for all experiments.

Compute Resources

All experiments were conducted on NVIDIA A100 GPUs (40GB). Training time per configuration: approximately 30 minutes. Total compute: approximately 50 GPU-hours for all experiments.

SafeDialBench Statistics

Category Samples
Violence 412
Fraud/Deception 389
Illegal Activities 356
Hate/Harassment 298
Sexual Content 312
Self-Harm 270
Total 2,037

SafeDialBench category distribution.

A Note of Gratitude

The copyright of this content belongs to the respective researchers. We deeply appreciate their hard work and contribution to the advancement of human civilization.

Start searching

Enter keywords to search articles

↑↓
↵
ESC
⌘K Shortcut