LLM Watermark Evasion via Bias Inversion

LLM Watermark Evasion via Bias Inversion
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Watermarking offers a promising solution for detecting LLM-generated content, yet its robustness under realistic query-free (black-box) evasion remains an open challenge. Existing query-free attacks often achieve limited success or severely distort semantic meaning. We bridge this gap by theoretically analyzing rewriting-based evasion, demonstrating that reducing the average conditional probability of sampling green tokens by a small margin causes the detection probability to decay exponentially. Guided by this insight, we propose the Bias-Inversion Rewriting Attack (BIRA), a practical query-free method that applies a negative logit bias to a proxy suppression set identified via token surprisal. Empirically, BIRA achieves state-of-the-art evasion rates (>99%) across diverse watermarking schemes while preserving semantic fidelity substantially better than prior baselines. Our findings reveal a fundamental vulnerability in current watermarking methods and highlight the need for rigorous stress tests.


💡 Research Summary

The paper addresses a critical gap in the robustness of large language model (LLM) watermarking: how well can a black‑box, query‑free adversary evade detection without substantially altering the original meaning? Existing query‑free attacks either achieve modest evasion rates or heavily distort semantics, while query‑based attacks require many model calls and partial knowledge of the secret key. To bridge this gap, the authors first provide a rigorous theoretical analysis of rewriting‑based evasion. They show that for any detector whose test statistic is a non‑decreasing function of the empirical green‑token rate, detection reduces to thresholding that rate (Lemma 4.1). Building on this, Theorem 4.2 proves that if a rewriter can keep the average conditional probability of sampling a green token at least a margin δ below the detection threshold, the probability of detection decays exponentially as exp(‑N δ²/2), where N is the number of generated tokens. This result implies that even a tiny per‑step suppression of green‑token probability can render the watermark virtually invisible.

Motivated by the theorem, the authors design a practical attack named Bias‑Inversion Rewriting Attack (BIRA). Because the true green set G(Wₖ) is unknown in a black‑box setting, BIRA constructs a proxy suppression set \tilde G from the watermarked text itself. It computes token self‑information (surprisal) using a publicly available language model M, selects the high‑surprisal tokens (e.g., top q percentile), and treats them as likely carriers of the watermark bias. During rewriting, BIRA queries a clean LLM M to generate a paraphrase, but at each decoding step it adds a negative logit bias β to every token in \tilde G. This bias discourages the model from reproducing the original green‑biased tokens, thereby lowering the average green‑token sampling probability in line with Theorem 4.2.

A naïve application of a strong negative bias can cause degeneration (repetitive or incoherent output). To mitigate this, BIRA monitors the distinct‑1‑gram ratio of the most recent h tokens; if the ratio falls below a threshold ρ, the bias magnitude is incrementally relaxed (β ← min(0, β + lr)) and generation restarts. This adaptive scheme starts with an aggressive bias for maximal watermark removal and gracefully backs off when text quality degrades, balancing evasion effectiveness against semantic fidelity.

Empirical evaluation spans several recent watermarking schemes, including the canonical green‑red list watermark (Kirchenbauer et al., 2024a), sentence‑level watermarks, and newer variants that adjust the green‑token ratio or use dynamic keys. Across all settings, BIRA achieves evasion rates exceeding 99%, dramatically outperforming prior query‑free baselines (e.g., Cheng et al., 2025; Diaa et al., 2024). Moreover, BIRA preserves semantic similarity as measured by ROUGE‑L and BERTScore, with improvements of roughly 10–15 % over earlier attacks, and maintains reasonable perplexity, indicating natural‑sounding output. Ablation studies confirm that the high‑surprisal proxy set is crucial: using all tokens as the suppression set harms fluency, while low‑surprisal tokens contribute little to watermark removal.

The paper concludes that current token‑level watermarking is fundamentally vulnerable to modest, systematic logit bias manipulations, even when the attacker lacks any knowledge of the secret key or the victim model’s internals. The authors advocate for more robust watermark designs that do not rely solely on per‑token probability skewing, suggesting multi‑level (sentence, syntactic, semantic) signals and stronger statistical detectors. They also call for comprehensive stress‑testing of watermark schemes, incorporating realistic rewriting, translation, and paraphrasing attacks such as BIRA, to ensure that deployed detection mechanisms can withstand sophisticated black‑box adversaries.


Comments & Academic Discussion

Loading comments...

Leave a Comment