A Fingerprint for Large Language Models
Recent advances confirm that large language models (LLMs) can achieve state-of-the-art performance across various tasks. However, due to the resource-intensive nature of training LLMs from scratch, it is urgent and crucial to protect the intellectual property of LLMs against infringement. This has motivated the authors in this paper to propose a novel black-box fingerprinting technique for LLMs. We firstly demonstrate that the outputs of LLMs span a unique vector space associated with each model. We model the problem of fingerprint authentication as the task of evaluating the similarity between the space of the victim model and the space of the suspect model. To tackle with this problem, we introduce two solutions: the first determines whether suspect outputs lie within the victim’s subspace, enabling fast infringement detection; the second reconstructs a joint subspace to detect models modified via parameter-efficient fine-tuning (PEFT). Experiments indicate that the proposed method achieves superior performance in fingerprint verification and robustness against the PEFT attacks. This work reveals inherent characteristics of LLMs and provides a promising solution for protecting LLMs, ensuring efficiency, generality and practicality.
💡 Research Summary
The paper addresses the pressing problem of protecting intellectual property (IP) of large language models (LLMs) in scenarios where only black‑box access is available. The authors observe that the logits produced by an LLM are linear transformations of the hidden representation z by the final linear layer W, so every possible logit vector lies in the column space of W, a subspace of dimension at most h (the hidden size). Because the vocabulary size |V| is far larger than h, the number of possible subspaces is astronomically high, making it practically impossible for two independently trained models to share the same subspace. Consequently, the subspace defined by W serves as a unique fingerprint for each model.
Two verification mechanisms are proposed. The first, “subspace membership testing,” stores only the victim model’s final‑layer weights. By querying a suspect model with a fixed prompt and projecting the obtained logits onto the stored subspace, the method checks whether the residual (I‑WW⁺)s′ is near zero. This test is computationally cheap and can be performed in real time, enabling rapid detection of direct copies.
The second mechanism tackles models that have been altered through parameter‑efficient fine‑tuning (PEFT) such as LoRA. PEFT updates only a low‑rank additive component while keeping W fixed, so the resulting model’s logits still largely reside in the original subspace. The authors construct a joint subspace L̂ from logits collected from both the victim and suspect models, using SVD or PCA. They then measure alignment (e.g., cosine similarity) of each model’s logits with L̂. High alignment indicates that the suspect model is a PEFT‑derived derivative, while low alignment suggests an unrelated model.
Because many APIs expose only partial information, the paper also describes how to reconstruct full logits from limited outputs. When the API returns the full probability vector p, a centered log‑ratio (CLR) transform yields logits s* that differ from the true logits only by a constant bias, which does not affect subspace tests. When only top‑k probabilities are available, the authors exploit the ability to add a bias b to selected token logits. By repeatedly forcing different sets of tokens into the top‑k and observing the biased probabilities, they solve for the original probabilities of all tokens, effectively reconstructing the complete logit vector with O(|V|/k) queries.
Extensive experiments on open‑source LLMs (LLaMA, Gemma, Mistral) demonstrate that subspace membership detection achieves >99 % verification accuracy, while the joint‑subspace alignment method detects LoRA‑fine‑tuned derivatives with >95 % recall. The reconstruction procedures introduce negligible overhead and do not degrade generation quality (perplexity, BLEU). Compared with prior watermarking approaches that require white‑box access or degrade model performance, this method works entirely in a black‑box setting, is model‑agnostic, and preserves the functional capabilities of the LLM.
In summary, the work reveals that the linear output layer of an LLM defines a high‑dimensional but low‑rank vector space that uniquely identifies the model. By leveraging this property, the authors provide practical, efficient, and robust techniques for ownership verification and for detecting PEFT‑based model theft, offering a valuable tool for safeguarding the rapidly expanding ecosystem of large language models.
Comments & Academic Discussion
Loading comments...
Leave a Comment