On the Empirical Power of Goodness-of-Fit Tests in Watermark Detection

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) raise concerns about content authenticity and integrity because they can generate human-like text at scale. Text watermarks, which embed detectable statistical signals into generated text, offer a provable way to verify content origin. Many detection methods rely on pivotal statistics that are i.i.d. under human-written text, making goodness-of-fit (GoF) tests a natural tool for watermark detection. However, GoF tests remain largely underexplored in this setting. In this paper, we systematically evaluate eight GoF tests across three popular watermarking schemes, using three open-source LLMs, two datasets, various generation temperatures, and multiple post-editing methods. We find that general GoF tests can improve both the detection power and robustness of watermark detectors. Notably, we observe that text repetition, common in low-temperature settings, gives GoF tests a unique advantage not exploited by existing methods. Our results highlight that classic GoF tests are a simple yet powerful and underused tool for watermark detection in LLMs.

💡 Research Summary

The paper investigates the use of classical goodness‑of‑fit (GoF) statistical tests for detecting text watermarks embedded in large language model (LLM) outputs. Watermarking schemes introduce a hidden dependence between each generated token wₜ and a secret pseudorandom variable ζₜ. Under the null hypothesis (human‑written text) this dependence is absent, so a pivotal statistic Yₜ = Y(wₜ, ζₜ) follows a known distribution µ₀ and the Yₜ’s are i.i.d. The authors reformulate watermark detection as a hypothesis test H₀: Yₜ ∼ µ₀ i.i.d. versus H₁: Yₜ deviates from µ₀ due to the watermark. Although under H₁ the Yₜ’s are not i.i.d. because of the autoregressive nature of LLMs, the core idea of measuring deviation from µ₀ remains applicable, motivating the use of GoF tests.

Eight well‑established GoF tests are evaluated: Truncated‑Φ‑divergence (Tr‑GoF), Kuiper (Kui), Kolmogorov‑Smirnov (Kol), Anderson‑Darling (And), Cramér‑von Mises (Cra), Watson (Wat), Neyman smooth (Ney), and χ² (Chi). For each test the authors compute the pivotal statistics from a generated sequence, transform them into p‑values using the known null CDF F₀, and then calculate the test‑specific deviation measure. Critical values are obtained from the asymptotic distribution appropriate for each test.

The empirical study covers three popular unbiased watermarking schemes: (1) Gumbel‑max, where µ₀ = Uniform(0,1); (2) Inverse‑transform, where µ₀(r) = r²; and (3) Google’s SynthID, where µ₀ is the distribution of a scaled Irwin‑Hall sum. Three open‑source LLMs (OPT‑1.3B, OPT‑13B, Llama‑3.1‑8B) are used, each sampled at four temperature settings (0.1, 0.3, 0.7, 1.0) to vary output entropy. Two generation tasks are considered: C4 text completion (prompt = first 50 tokens) and ELI5 long‑form question answering. For each configuration 1,000 documents are generated, yielding a total of several hundred thousand evaluated sequences.

Robustness is examined by applying three post‑generation editing procedures: token deletion, token substitution, and richer human edits that insert additional information. The green‑red list watermark is deliberately excluded because its binary pivot reduces GoF tests to the original count‑based detector, offering no additional insight.

Key findings:

Across all temperatures, models, and watermark schemes, GoF tests consistently achieve higher area‑under‑the‑ROC curve (AUC) than baseline detectors that rely on log‑likelihood ratios or simple green‑token counts.
Low‑temperature generation (T = 0.1, 0.3) often produces repetitive token patterns. This repetition creates systematic deviations in the empirical CDF of Yₜ, which are captured especially well by maximum‑difference tests such as Kolmogorov‑Smirnov and Kuiper.
High‑temperature generation (T = 0.7, 1.0) yields higher entropy in the next‑token distribution, strengthening the watermark signal. In this regime, tests that aggregate deviations over the whole distribution—Truncated‑Φ‑divergence, Anderson‑Darling, and Cramér‑von Mises—show superior power.
Text length influences detection power but even relatively short sequences (≈50 tokens) attain AUC > 0.75 with most GoF tests; longer texts improve performance smoothly.
Post‑editing robustness is strong: after deletions or substitutions, most GoF tests retain AUC > 0.80. Watson’s test and Neyman smooth, which are invariant to location shifts, are particularly resilient to edits that alter token order or introduce new content.
The green‑red list watermark, whose pivot is binary, does not benefit from GoF testing, confirming that GoF gains are specific to watermarks with continuous or multi‑valued pivots.

The authors conclude that GoF testing provides a unified, statistically principled framework for watermark detection that complements existing specialized detectors. They recommend dynamically selecting the most appropriate GoF test based on generation temperature and observed text characteristics, or combining multiple tests in an ensemble to maximize robustness. All code, data processing pipelines, and detailed experimental logs are released publicly, enabling reproducibility and facilitating future research on watermark detection across diverse LLM architectures and deployment scenarios.

On the Empirical Power of Goodness-of-Fit Tests in Watermark Detection

💡 Research Summary

Comments & Academic Discussion

Leave a Comment