A Behavioral Fingerprint for Large Language Models: Provenance Tracking via Refusal Vectors
Protecting the intellectual property of large language models (LLMs) is a critical challenge due to the proliferation of unauthorized derivative models. We introduce a novel fingerprinting framework that leverages the behavioral patterns induced by safety alignment, applying the concept of refusal vectors for LLM provenance tracking. These vectors, extracted from directional patterns in a model’s internal representations when processing harmful versus harmless prompts, serve as robust behavioral fingerprints. Our contribution lies in developing a fingerprinting system around this concept and conducting extensive validation of its effectiveness for IP protection. We demonstrate that these behavioral fingerprints are highly robust against common modifications, including finetunes, merges, and quantization. Our experiments show that the fingerprint is unique to each model family, with low cosine similarity between independently trained models. In a large-scale identification task across 76 offspring models, our method achieves 100% accuracy in identifying the correct base model family. Furthermore, we analyze the fingerprint’s behavior under alignment-breaking attacks, finding that while performance degrades significantly, detectable traces remain. Finally, we propose a theoretical framework to transform this private fingerprint into a publicly verifiable, privacy-preserving artifact using locality-sensitive hashing and zero-knowledge proofs.
💡 Research Summary
This paper introduces a novel behavioral fingerprinting framework for Large Language Models (LLMs) designed to address the critical challenge of intellectual property protection and provenance tracking in an era of widespread model derivatives. The core innovation lies in leveraging the behavioral patterns inherent in a model’s safety alignment mechanisms, specifically through the concept of “refusal vectors.”
The fingerprint is constructed by analyzing a model’s internal representations. The method involves processing two distinct sets of prompts: harmful queries designed to trigger refusal and harmless ones. For each transformer layer, the average hidden state (centroid) for each prompt category is computed. The layer-wise “refusal vector” is defined as the normalized difference vector between the harmful and harmless centroids. To create a compact and robust fingerprint, refusal vectors from a selected set of middle layers (excluding potential noise from very early or late layers) are aggregated via averaging and then L2-normalized into a single, high-dimensional vector.
The proposed fingerprint demonstrates three key properties. First, it exhibits high robustness against common model modifications. Extensive experiments across seven base model families (e.g., Llama-3.1, Qwen2.5) and 76 derivative models show that the fingerprint maintains strong similarity with its base model after quantization (cosine similarity ~0.99), adapter integration (e.g., LoRA, ~0.94), supervised fine-tuning (~0.88), and model merging (~0.73). Second, it proves to be unique across independently trained models. Fingerprints from different model families show near-zero pairwise cosine similarity (<0.1), effectively acting as orthogonal identifiers. Third, it achieves high identification accuracy. In a large-scale closed-set identification task, the method achieved 100% Top-1 accuracy in attributing the 76 diverse derivative models to their correct base model family.
The paper further investigates the fingerprint’s resilience under alignment-breaking attacks, which are designed to remove safety guardrails. While such attacks significantly degrade the fingerprint similarity (to around 0.5), the remaining signal is still substantially higher than the similarity between unrelated families, indicating that detectable traces persist. Finally, the authors propose a theoretical privacy-preserving public verification framework. By combining Locality-Sensitive Hashing (SimHash) to create a compact fingerprint digest with Zero-Knowledge Proof (ZKP) protocols, model owners could potentially prove that a public fingerprint hash was correctly derived from their private model weights without revealing the weights themselves.
In summary, this work presents a powerful and practical fingerprinting technique that ties a model’s identity to its intrinsic, alignment-induced behavioral signatures. It offers a robust solution for tracing model lineage, enhancing accountability, and protecting intellectual property in the rapidly evolving LLM ecosystem.
Comments & Academic Discussion
Loading comments...
Leave a Comment