Unifying Adversarial Robustness and Training Across Text Scoring Models
Research on adversarial robustness in language models is currently fragmented across applications and attacks, obscuring shared vulnerabilities. In this work, we propose unifying the study of adversarial robustness in text scoring models spanning dense retrievers, rerankers, and reward models. This motivates adapting both attacks and adversarial training methods across model roles. Unlike open-ended generation, text scoring failures are directly testable: an attack succeeds when an irrelevant or rejected text outscores a relevant or chosen one. Using this principled lens of text scoring, we demonstrate that current adversarial training formulations for language models are often short-sighted, failing to effectively generalize across attacks. To address this, we introduce multiple adversarial training methods for text scoring models and show that combining complementary training methods can yield strong robustness while also improving task effectiveness. We also highlight the practical value of our approach for RLHF, showing that our adversarially trained reward models mitigate reward hacking and support the training of better-aligned LLMs. We provide our code and models for further study.
💡 Research Summary
This paper tackles the fragmented landscape of adversarial robustness research for language models by unifying the study under the umbrella of text‑scoring models, which include dense retrievers, rerankers, and reward models. The authors argue that, unlike open‑ended generation where “harmful output” is ill‑defined, text‑scoring tasks provide a crisp, testable failure condition: an attack succeeds when an irrelevant passage or a rejected response receives a higher score than a relevant passage or a chosen response. Using this structural definition, they systematically examine a suite of attacks—rudimentary character/word edits, gradient‑guided HotFlip token swaps, MLM‑guided contextual replacements, and content‑injection attacks (sentence insertion and query injection). Each attack is executed with a beam‑search procedure (16 beams, up to 512 steps) and evaluated by success rate (ASR) and average steps to success.
To defend against these threats, the authors explore several adversarial training formulations: (1) Projected Gradient Descent (PGD) that adds continuous perturbations in the token‑embedding space; (2) Rudimentary‑based training that exposes the model to simple string‑level manipulations; (3) HotFlip‑based training that incorporates gradient‑guided token swaps; (4) Content‑injection training that injects unrelated sentences or queries into the training data; and (5) Paraphrasing training that encourages score consistency across paraphrases. Crucially, they propose a “Combined” training regime that integrates all complementary signals, aiming for broad‑spectrum robustness without sacrificing downstream performance.
Experiments are conducted on standard retrieval benchmarks (MS‑MARCO, TREC) for dense retrievers, on pointwise rerankers, and on reward models used in RLHF pipelines. Results show that single‑method adversarial training can dramatically improve robustness against its target attack (often 60‑80 % reduction in ASR) but fails to generalize to other attack families. In contrast, the Combined training consistently lowers ASR across all attacks (typically >70 % reduction) while also yielding modest gains (1‑3 %) in primary metrics such as MRR, NDCG, BLEU, and ROUGE. For reward models, adversarially trained models markedly reduce reward‑hacking phenomena when the policy model is treated as an adaptive adversary during RLHF, leading to higher human‑rated alignment scores.
The analysis reveals why the combination works: PGD builds global embedding‑space resilience; Rudimentary and HotFlip training harden the model against discrete token‑level perturbations; content‑injection training covers a failure mode not addressed by token swaps; and paraphrasing enforces score stability. The synergy among these signals produces transferability—training against one attack family confers protection against others—and across model roles (retriever, reranker, reward model).
In conclusion, the paper demonstrates that viewing adversarial robustness through the lens of text scoring unifies disparate attack vectors, enables principled evaluation, and guides the design of robust, transferable defenses. The proposed adversarial training suite, especially the Combined approach, offers a practical pathway to more secure retrieval, ranking, and RLHF systems, and sets a foundation for future work on broader domains, real‑time attack‑defense dynamics, and efficient robustness‑preserving training methods.
Comments & Academic Discussion
Loading comments...
Leave a Comment