DuFFin: A Dual-Level Fingerprinting Framework for LLMs IP Protection

DuFFin: A Dual-Level Fingerprinting Framework for LLMs IP Protection
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) are considered valuable Intellectual Properties (IP) for legitimate owners due to the enormous computational cost of training. It is crucial to protect the IP of LLMs from malicious stealing or unauthorized deployment. Despite existing efforts in watermarking and fingerprinting LLMs, these methods either impact the text generation process or are limited in white-box access to the suspect model, making them impractical. Hence, we propose DuFFin, a novel $\textbf{Du}$al-Level $\textbf{Fin}$gerprinting $\textbf{F}$ramework for black-box setting ownership verification. DuFFin extracts the trigger pattern and the knowledge-level fingerprints to identify the source of a suspect model. We conduct experiments on a variety of models collected from the open-source website, including four popular base models as protected LLMs and their fine-tuning, quantization, and safety alignment versions, which are released by large companies, start-ups, and individual users. Results show that our method can accurately verify the copyright of the base protected LLM on their model variants, achieving the IP-ROC metric greater than 0.95. Our code is available at https://github.com/yuliangyan0807/llm-fingerprint.


💡 Research Summary

DuFFin introduces a practical, black‑box fingerprinting framework for protecting the intellectual property of large language models (LLMs). Existing protection methods fall into two categories: invasive watermarking, which modifies model parameters and can degrade generation quality, and non‑invasive fingerprinting, which typically requires access to intermediate activations and therefore cannot be applied when only the model’s outputs are observable. DuFFin addresses both limitations by extracting two complementary types of fingerprints—Trigger‑DuFFin and Knowledge‑DuFFin—using only the model’s output token sequences (and optionally logits).

Trigger‑DuFFin builds a secret key consisting of a set of carefully selected prompts (the “trigger set”). The intuition is that a pirated model derived from a protected base will produce responses that are highly similar to the base model for these prompts, whereas independently trained models will differ. To capture this similarity robustly, the authors train a fingerprint extractor E based on a T5 encoder. The extractor maps each model’s response (formatted with a simple template) to a fixed‑dimensional vector via average pooling of the last hidden layer. Training uses a contrastive loss: for each protected model ψ_pro, positive pairs are formed with its fine‑tuned or quantized variants ψ_pir, while negative pairs are formed with unrelated models ψ_ind. The loss maximizes the cosine similarity of positive pairs and minimizes that of negative pairs, with a temperature scaling factor τ. Multiple protected models are included during training to improve generalization. At verification time, the cosine similarity between the extracted vectors for the suspect model and the protected model is averaged over all triggers; a low distance indicates likely derivation.

Knowledge‑DuFFin is a training‑free approach that leverages multi‑domain knowledge consistency. The secret key is a collection of multiple‑choice questions spanning N domains (e.g., chemistry, economics). Each domain contributes Q filtered questions that are neither too easy nor too hard for the protected models. When queried, the model must output a single choice (A–D). The sequence of choices across all questions forms a binary fingerprint vector. Because fine‑tuning or quantization typically does not alter a model’s broad knowledge, a pirated model’s answer vector will be nearly identical to that of its source, while independent models will differ. Verification simply computes the Hamming distance between the two vectors; a small distance signals ownership.

The two levels can be combined—by weighted averaging or logical conjunction—to strengthen robustness, especially when one level alone is ambiguous.

Experimental evaluation covers four popular base LLMs (e.g., LLaMA‑2, Falcon, Mistral) and a broad set of their derived variants, including supervised fine‑tuning, 4‑bit quantization, and RLHF alignment. Over 30 suspect models are tested. DuFFin achieves IP‑ROC scores between 0.95 and 0.99, demonstrating high true‑positive rates with low false‑positive rates. Even with aggressive fine‑tuning, the trigger‑level fingerprints remain discriminative. Reducing the number of knowledge questions to fewer than ten per domain only modestly impacts performance, indicating efficiency. All code and datasets are released publicly, facilitating reproducibility.

Strengths:

  • Fully non‑invasive; requires only outputs (and optionally logits).
  • Dual‑level design provides complementary signals, improving resilience against adaptive attacks.
  • Contrastive training of the trigger extractor yields strong generalization across unseen LLMs.
  • Training‑free knowledge fingerprint is simple, interpretable, and computationally cheap.
  • Open‑source release enhances transparency and adoption potential.

Limitations and open questions:

  • The security of the trigger set relies on secrecy; an adversary who discovers the prompts could deliberately perturb responses to evade detection.
  • Knowledge‑DuFFin assumes models will answer multiple‑choice questions directly; models that refuse to answer or produce free‑form text could undermine this component.
  • Experiments are limited to open‑source models; applicability to closed‑source, large‑scale commercial LLMs (e.g., GPT‑4) remains to be validated.
  • Potential robustness against adversarial fine‑tuning that explicitly decorrelates responses to the trigger set is not explored.

Future directions may include dynamic or adversarial prompt generation, meta‑learning to adapt the extractor to novel model families, extending knowledge fingerprints to open‑ended QA formats, and large‑scale trials on proprietary models.

In summary, DuFFin offers a compelling, scalable solution for black‑box LLM IP verification. By jointly leveraging response pattern similarity and multi‑domain knowledge consistency, it achieves high verification accuracy across a variety of model modifications while preserving the original model’s generation quality. This work advances the state of the art in LLM copyright protection and provides a solid foundation for further research and deployment in real‑world settings.


Comments & Academic Discussion

Loading comments...

Leave a Comment