More Haste, Less Speed: Weaker Single-Layer Watermark Improves Distortion-Free Watermark Ensembles
Watermarking has emerged as a crucial technique for detecting and attributing content generated by large language models. While recent advancements have utilized watermark ensembles to enhance robustness, prevailing methods typically prioritize maximizing the strength of the watermark at every individual layer. In this work, we identify a critical limitation in this “stronger-is-better” approach: strong watermarks significantly reduce the entropy of the token distribution, which paradoxically weakens the effectiveness of watermarking in subsequent layers. We theoretically and empirically show that detectability is bounded by entropy and that watermark ensembles induce a monotonic decrease in both entropy and the expected green-list ratio across layers. To address this inherent trade-off, we propose a general framework that utilizes weaker single-layer watermarks to preserve the entropy required for effective multi-layer ensembling. Empirical evaluations demonstrate that this counter-intuitive strategy mitigates signal decay and consistently outperforms strong baselines in both detectability and robustness.
💡 Research Summary
This paper revisits the design of distortion‑free watermark ensembles for large language models (LLMs) and challenges the prevailing “stronger‑is‑better” paradigm. Existing ensemble methods apply a strong watermark at each generation layer, assuming that maximizing the immediate detection signal will improve overall robustness. The authors demonstrate that this intuition is flawed because a strong watermark substantially reduces the entropy of the token probability distribution. Since the detectability of probabilistic watermarks fundamentally depends on the uncertainty (entropy) of the underlying distribution, the entropy loss in early layers weakens the statistical signal available to later layers, causing a monotonic decay of detection power across the ensemble.
The paper first establishes a theoretical connection between entropy and detection performance. Using the green‑list/red‑list framework, they show that higher entropy leads to a larger expected green‑list ratio, which in turn yields higher z‑scores for hypothesis testing. Theorem 4.1 proves that any distortion‑free watermark operator does not increase expected entropy, and Theorem 4.2 shows that the expected green‑list ratio also never increases after watermarking. Consequently, multi‑layer ensembles inevitably suffer a cumulative reduction in both entropy and expected green ratio, limiting their ultimate detectability.
To mitigate this problem, the authors propose a general “weaker” watermark framework. They define a family of watermark functions (F_{\lambda}) that linearly interpolate between the original model distribution (P_{M}) and a standard distortion‑free watermarked distribution (F):
(F_{\lambda}(P_{M},k)=\lambda,F(P_{M},k)+(1-\lambda)P_{M}),
where (\lambda\in
Comments & Academic Discussion
Loading comments...
Leave a Comment