Data Provenance Auditing of Fine-Tuned Large Language Models with a Text-Preserving Technique
We propose a system for marking sensitive or copyrighted texts to detect their use in fine-tuning large language models under black-box access with statistical guarantees. Our method builds digital marks'' using invisible Unicode characters organized into (cue’’, reply'') pairs. During an audit, prompts containing only cue’’ fragments are issued to trigger regurgitation of the corresponding ``reply’’, indicating document usage. To control false positives, we compare against held-out counterfactual marks and apply a ranking test, yielding a verifiable bound on the false positive rate. The approach is minimally invasive, scalable across many sources, robust to standard processing pipelines, and achieves high detection power even when marked data is a small fraction of the fine-tuning corpus.
💡 Research Summary
The paper introduces a novel, minimally invasive watermarking framework for auditing the provenance of fine‑tuned large language models (LLMs) under a strict black‑box setting. The authors address a pressing problem: data owners often have no reliable way to determine whether their copyrighted or sensitive texts have been incorporated into proprietary fine‑tuning datasets, especially when only prompt‑response access to the deployed model is available. Existing approaches—such as direct verbatim memorization detection, membership inference attacks, or visible text modifications—either lack statistical guarantees, require internal model access, or degrade the quality of the original text.
Core Idea
The authors embed “invisible” Unicode characters (e.g., zero‑width spaces) into documents to create a pair of sequences called a cue and a reply. Both are built from syllables, each syllable being a short ordered list of non‑rendering characters. A watermark consists of a cue (the first j syllables) and a reply (the remaining n‑j syllables). The last t syllables of the cue are duplicated as a “tail” and placed immediately before the reply. This design encourages the model to memorize the association cue → tail → reply during fine‑tuning.
Watermark Space and Uniqueness
Mathematically, the authors prove that the number of distinct cue‑reply pairs grows exponentially with the size of the character alphabet and the syllable length, guaranteeing a huge space (|W| ≥ |A|^{mj/2}) that supports millions of unique watermarks without collisions. They also enforce two constraints: (1) each cue maps to a unique reply and vice‑versa, and (2) no cue is a contiguous subsequence of any reply (and vice‑versa). These constraints are essential for the statistical test later on.
Assignment and Counterfactuals
A trusted authority maintains a global pool of unused watermarks. When a user requests a watermark, the authority randomly draws a set W_K of size K, returns the whole set to the user, and the user selects one watermark w for actual embedding. The remaining K‑1 watermarks serve as counterfactual watermarks that were never used in training. Because the selection is uniform, the ranking test can bound the false‑positive rate (FPR) by k/K, where k is a pre‑chosen rank threshold.
Embedding Procedure
The original document is split into sub‑documents of length δ words. Odd‑indexed sub‑documents receive the cue (excluding its tail), while even‑indexed sub‑documents receive the tail followed by the reply. The invisible syllables are interleaved with the original words at a fixed step, preserving the visible text exactly. This ensures text authenticity: the document’s human‑readable content is unchanged, satisfying legal and usability constraints.
Verification Protocol
For each cue‑chunk, the auditor constructs a prompt that concatenates the cue chunk with the beginning of the subsequent reply chunk (truncated after t syllables) to complete the cue. The model is queried λ times with this prompt. If the reply appears in any of the λ responses, the chunk is marked as a “hit”. The document‑level score is the sum of hits across all cue‑reply pairs; the collection‑level score aggregates across all documents that share the same watermark.
The crucial decision step ranks the score of the tested watermark against the scores of its K‑1 counterfactuals. If the tested watermark’s rank ≥ k and its score > 0, the auditor declares that the protected dataset was used in fine‑tuning. Because the counterfactuals are drawn from the same distribution and never appear in training, the test provides a provable bound on the probability of a false positive.
Statistical Guarantees
The authors derive the false‑positive bound directly from the uniform sampling assumption:
FPR ≤ k / K.
Completeness (true‑positive rate) depends on the per‑chunk hit probability p, which is empirically measured for each model and domain. Repeating the prompt λ times reduces the miss probability to (1‑p)^λ, and probing X independent chunks yields an overall miss probability (1‑p)^{λX}. By selecting λ and X based on desired error budgets, the auditor can achieve high detection power even when the marked documents constitute a tiny fraction of the fine‑tuning corpus.
Experimental Evaluation
The framework is evaluated on several open‑source and commercial LLMs, including Mistral‑7B‑v0.1, DeepSeek‑R1, and GPT‑4o. Experiments vary the proportion of marked data (0.5 %–5 % of the fine‑tuning set), the number of syllables per watermark, and the λ/X parameters. Key findings:
- Detection Power – Average true‑positive rates (TPR) exceed 85 % across models when at least a few dozen marked documents are present, even if they represent only 0.5 % of the fine‑tuning data.
- False‑Positive Control – With k set to 1 and K = 100, observed FPR is effectively 0 % (no false alarms in thousands of trials), matching the theoretical bound.
- Multi‑Watermark Interference – When multiple users embed distinct watermarks into the same corpus, the detection of each watermark remains robust; cross‑talk is negligible because cues and replies are disjoint by construction.
- Robustness to Data Pipelines – The invisible Unicode marks survive common preprocessing steps (HTML stripping, tokenization, deduplication) used by large‑scale web crawls such as C4, The Pile, RedPajama, and Dolma.
- Tokenizer Compatibility – Ten representative tokenizers (including those of LLaMA, Mistral, GPT‑4o) correctly preserve the invisible characters, ensuring that the model can still learn the cue‑reply association.
- Safety Filters – Public chatbot interfaces (ChatGPT, Le Chat, DeepSeek) do not strip the marks, confirming that the technique can be deployed against commercial APIs.
Limitations and Future Work
The authors acknowledge that the current threat model assumes non‑adversarial preprocessing. An attacker who deliberately removes zero‑width characters or applies character‑level perturbations could evade detection. Future research could explore more resilient encoding schemes (e.g., homomorphic embeddings) or combine invisible marks with semantic canaries. Additionally, the trade‑off between watermark length, insertion frequency, and document size warrants systematic study to avoid bloating user content. Scaling the verification budget (λ × X) for real‑time services also remains an open engineering challenge.
Conclusion
The paper delivers a practical, statistically sound solution for data provenance auditing of fine‑tuned LLMs. By leveraging invisible Unicode watermarks organized as cue‑reply pairs, the method respects text authenticity, scales to millions of users, and operates solely with black‑box prompt‑response access. The ranking‑based hypothesis test provides a provable false‑positive bound, and extensive experiments demonstrate high detection rates even when only a tiny fraction of the fine‑tuning data is marked. This work bridges a critical gap between intellectual‑property protection and the opaque nature of modern LLM fine‑tuning pipelines, opening avenues for responsible AI deployment and legal enforceability.
Comments & Academic Discussion
Loading comments...
Leave a Comment