MirrorMark: A Distortion-Free Multi-Bit Watermark for Large Language Models
As large language models (LLMs) become integral to applications such as question answering and content creation, reliable content attribution has become increasingly important. Watermarking is a promising approach, but existing methods either provide only binary signals or distort the sampling distribution, degrading text quality; distortion-free approaches, in turn, often suffer from weak detectability or robustness. We propose MirrorMark, a multi-bit and distortion-free watermark for LLMs. By mirroring sampling randomness in a measure-preserving manner, MirrorMark embeds multi-bit messages without altering the token probability distribution, preserving text quality by design. To improve robustness, we introduce a context-based scheduler that balances token assignments across message positions while remaining resilient to insertions and deletions. We further provide a theoretical analysis of the equal error rate to interpret empirical performance. Experiments show that MirrorMark matches the text quality of non-watermarked generation while achieving substantially stronger detectability: with 54 bits embedded in 300 tokens, it improves bit accuracy by 8-12% and correctly identifies up to 11% more watermarked texts at 1% false positive rate.
💡 Research Summary
The paper introduces MirrorMark, a novel watermarking scheme for large language models (LLMs) that embeds multi‑bit payloads without distorting the model’s output distribution. Existing LLM watermarks fall into two categories: distortion‑based methods that re‑weight logits and thus degrade text quality, and distortion‑free methods that preserve the distribution but typically provide only binary (zero‑bit) detection or suffer from weak detectability and poor robustness when extended to multi‑bit encoding. MirrorMark overcomes these limitations by combining three key components.
First, it applies a “mod‑1 mirroring” transformation to the uniform random value U used in Gumbel‑max or tournament sampling. For an m‑bit message M, a pivot ψM = M/2^m + ½ is defined and the sampled value is reflected as Ψ(u;ψM) = (2ψM − u) mod 1. This operation is a measure‑preserving involution: if U is uniform, Ψ(U;ψM) remains uniform, so the token probability distribution is unchanged. Consequently, the generated text retains the same quality as a non‑watermarked generation.
Second, MirrorMark introduces the Context‑Anchored Balanced Scheduler (CABS) to map tokens to message positions in a way that is both balanced and robust to insertions or deletions. CABS partitions the generation stream into “frames”. A frame boundary is anchored whenever the least‑significant bits of a hash over a sliding window of W tokens are all zero. Within each frame, token‑to‑position assignments are performed so that every position receives roughly the same number of tokens, preventing any position from being left empty. When an insertion or deletion occurs, the effect is confined to the current frame; the next frame re‑anchors, re‑synchronizing the schedule and preserving the overall message structure. Parameters min_len and max_len bound frame length to avoid instability or unbounded drift.
Third, detection mirrors the generation process: using the same secret key and pseudorandom function (PRF), the detector recomputes the u‑values for each token, applies the inverse mirroring to recover the original uniform draws, and then computes a statistical score (log‑score as in AA or weighted‑mean score as in SynthID) for each position. Scores across all tokens are aggregated to decide whether a text is watermarked and to decode the embedded bits. The authors provide a theoretical analysis of the resulting score distributions, deriving an expression for the Equal Error Rate (EER) that matches empirical observations.
Experiments were conducted on several open‑source LLMs (LLaMA‑2‑7B, GPT‑Neo‑2.7B) with 300‑token generations embedding 54 bits (9 symbols of 6 bits each). Compared against state‑of‑the‑art multi‑bit schemes—MP‑AC, RSBH, StealthInk, and the baseline Three‑Bricks—MirrorMark achieved 8‑12 % higher bit accuracy and up to 11 % higher true‑positive rate at a 1 % false‑positive rate. Standard quality metrics (BLEU, ROUGE, human evaluation) showed no measurable degradation relative to non‑watermarked text. Robustness tests involving token insertions or deletions (±1–3 tokens) demonstrated that CABS limits error propagation, maintaining decoding success above 90 %.
The paper also discusses limitations. Security relies on the secrecy of the watermark key and PRF; exposure would enable an adversary to reverse the mirroring and erase or forge the watermark. The choice of frame length involves a trade‑off between overhead and resilience: very short frames increase synchronization cost, while very long frames reduce robustness to edits. Currently the method operates at the token level; extending it to sentence‑level or multimodal (e.g., image‑text) watermarking will require additional research. Computational overhead is modest because mirroring is a simple arithmetic operation, but large‑scale deployment on commercial models (e.g., GPT‑4) warrants further performance analysis.
In summary, MirrorMark delivers a distortion‑free, multi‑bit watermark that preserves text quality, offers strong detectability, and remains robust to common editing attacks through its context‑anchored balanced scheduling. This represents a significant step forward for attribution, copyright protection, and accountability in the rapidly expanding ecosystem of LLM‑generated content.
Comments & Academic Discussion
Loading comments...
Leave a Comment