Weight space Detection of Backdoors in LoRA Adapters
LoRA adapters let users fine-tune large language models (LLMs) efficiently. However, LoRA adapters are shared through open repositories like Hugging Face Hub \citep{huggingface_hub_docs}, making them vulnerable to backdoor attacks. Current detection methods require running the model with test input data – making them impractical for screening thousands of adapters where the trigger for backdoor behavior is unknown. We detect poisoned adapters by analyzing their weight matrices directly, without running the model – making our method data-agnostic. Our method extracts simple statistics – how concentrated the singular values are, their entropy, and the distribution shape – and flags adapters that deviate from normal patterns. We evaluate the method on 500 LoRA adapters – 400 clean, and 100 poisoned for Llama-3.2-3B on instruction and reasoning datasets: Alpaca, Dolly, GSM8K, ARC-Challenge, SQuADv2, NaturalQuestions, HumanEval, and GLUE dataset. We achieve 97% detection accuracy with less than 2% false positives.
💡 Research Summary
The paper addresses a pressing security concern in the emerging ecosystem of parameter‑efficient fine‑tuning (PEFT) for large language models (LLMs): malicious LoRA adapters that embed backdoors. Existing defenses either require access to the original training data, need to run the model on a large set of test inputs, or depend on a clean reference model—approaches that are infeasible for screening thousands of adapters hosted on open repositories such as the Hugging Face Hub.
The authors propose a static, data‑agnostic detection pipeline that operates directly on the weight matrices of a LoRA adapter, without any model execution. Their central hypothesis is that a backdoor task (trigger → specific response) is a very simple mapping that dominates the low‑rank update introduced by LoRA. Consequently, the singular‑value spectrum of the weight update ΔW = B·A (where B∈ℝ^{d×r} and A∈ℝ^{r×k}) exhibits a distinctive pattern: a large leading singular value, high energy concentration, low entropy, and a peaked distribution (high kurtosis).
Methodology
- Weight Extraction – From each adapter’s safetensors file the authors extract the low‑rank matrices B and A, compute ΔW for each of the four attention projections (query, key, value, output), and sum them to obtain a single ΔW per adapter.
- Spectral Decomposition – Perform singular‑value decomposition (SVD) on ΔW, yielding singular values σ₁ ≥ σ₂ … ≥ σ_r (r = 16 for Llama‑3.2‑3B).
- Feature Engineering – Compute five statistics: (i) leading singular value σ₁, (ii) Frobenius norm ‖ΔW‖_F, (iii) energy concentration E = σ₁ / Σσ_i, (iv) spectral entropy H = – Σ p_k log p_k with p_k = σ_k / Σσ_j, and (v) kurtosis of the flattened weight matrix.
- Z‑Score Normalization – Build a reference bank of 400 clean adapters, calculate mean μ and standard deviation σ for each metric, and transform each adapter’s raw values into Z‑scores. Entropy is sign‑flipped so that more anomalous (lower) entropy yields higher Z‑scores.
- Score Boundedness – Apply a tanh‑based scaling n = ½·(1 + tanh(z/2)) to map Z‑scores into
Comments & Academic Discussion
Loading comments...
Leave a Comment