A Watermark for Black-Box Language Models
Watermarking has recently emerged as an effective strategy for detecting the outputs of large language models (LLMs). Most existing schemes require white-box access to the model’s next-token probability distribution, which is typically not accessible to downstream users of an LLM API. In this work, we propose a principled watermarking scheme that requires only the ability to sample sequences from the LLM (i.e. black-box access), boasts a distortion-free property, and can be chained or nested using multiple secret keys. We provide performance guarantees, demonstrate how it can be leveraged when white-box access is available, and show when it can outperform existing white-box schemes via comprehensive experiments.
💡 Research Summary
The paper introduces a novel black‑box watermarking framework for large language models (LLMs) that requires only the ability to sample text from the model, eliminating the need for access to next‑token probability distributions. The authors observe that existing watermarking schemes—most notably those that bias “green” tokens by manipulating logits—are infeasible for third‑party users of commercial LLM APIs, which typically expose only generated text. To address this, they propose a sampling‑based selection algorithm: given a prompt, the encoder draws m candidate continuations from the LLM, extracts all n‑grams (with n typically 4 or 5) from each candidate (excluding the original prompt), and scores each candidate using a secret integer key K. Scoring proceeds by hashing each n‑gram together with K, feeding the resulting integer seed into a continuous cumulative distribution function F (e.g., the standard normal CDF) to obtain a pseudorandom value. For a candidate X, the set of unique seeds S_X produces values R_i = F
Comments & Academic Discussion
Loading comments...
Leave a Comment