In-Context Probing for Membership Inference in Fine-Tuned Language Models
Membership inference attacks (MIAs) pose a critical privacy threat to fine-tuned large language models (LLMs), especially when models are adapted to domain-specific tasks using sensitive data. While prior black-box MIA techniques rely on confidence scores or token likelihoods, these signals are often entangled with a sample’s intrinsic properties - such as content difficulty or rarity - leading to poor generalization and low signal-to-noise ratios. In this paper, we propose ICP-MIA, a novel MIA framework grounded in the theory of training dynamics, particularly the phenomenon of diminishing returns during optimization. We introduce the Optimization Gap as a fundamental signal of membership: at convergence, member samples exhibit minimal remaining loss-reduction potential, while non-members retain significant potential for further optimization. To estimate this gap in a black-box setting, we propose In-Context Probing (ICP), a training-free method that simulates fine-tuning-like behavior via strategically constructed input contexts. We propose two probing strategies: reference-data-based (using semantically similar public samples) and self-perturbation (via masking or generation). Experiments on three tasks and multiple LLMs show that ICP-MIA significantly outperforms prior black-box MIAs, particularly at low false positive rates. We further analyze how reference data alignment, model type, PEFT configurations, and training schedules affect attack effectiveness. Our findings establish ICP-MIA as a practical and theoretically grounded framework for auditing privacy risks in deployed LLMs.
💡 Research Summary
This paper introduces ICP‑MIA, a novel black‑box membership inference attack (MIA) designed for fine‑tuned large language models (LLMs). The authors observe that during fine‑tuning, the loss on each training sample drops rapidly at first and then exhibits diminishing returns as training proceeds. At convergence, a member sample has little remaining loss‑reduction potential, whereas a non‑member still retains a sizable “optimization gap” – the amount by which its loss could still be reduced with further training.
To exploit this gap without accessing model parameters, the authors propose In‑Context Probing (ICP), a training‑free technique that simulates an additional fine‑tuning step by inserting carefully crafted context examples into the model’s prompt. Two probing strategies are offered: (1) a reference‑data‑based approach that selects semantically similar public examples to serve as demonstrations, and (2) a reference‑free self‑perturbation approach that creates probes by masking, regenerating, or otherwise altering the target sample itself. By measuring the change in log‑likelihood of the target sample before and after probing, the attacker obtains an estimate of the optimization gap, which serves as a robust membership score.
The authors evaluate ICP‑MIA on three downstream tasks—medical record summarization (HealthcareMagic), news article summarization (CNN‑DM), and code generation (CodeXGLUE)—and across multiple model families (LLaMA, LLaMA‑2, GPT‑Neo). Compared with prior black‑box attacks such as ReCaLL, Min‑K%, Min‑K%++, DF‑MIA, and reference‑based attacks like LiRA and SPV‑MIA, ICP‑MIA consistently achieves higher AUC (e.g., 0.942 vs 0.847 on HealthcareMagic) and dramatically higher true‑positive rates at low false‑positive rates (e.g., TPR@1% FPR = 0.518 vs 0.195 for ReCaLL). The distribution of log‑likelihood improvements shows a clear separation: members exhibit minimal improvement (≈0.05), while non‑members improve substantially (≈0.86).
Further analysis investigates the impact of parameter‑efficient fine‑tuning (PEFT) methods such as LoRA, adapters, and prefix‑tuning. The optimization‑gap signal remains strong under PEFT, and in some cases (e.g., LoRA) the gap for non‑members widens, enhancing attack efficacy. The authors also study how training schedule variables (epoch count, learning rate) affect the gap, finding that longer training and higher learning rates increase the disparity, especially when over‑fitting occurs.
From a theoretical standpoint, the work connects the optimization gap to recent findings that in‑context learning behaves like a meta‑optimizer, generating implicit gradient updates from context examples. ICP therefore measures the residual learning capacity of the model rather than raw confidence, providing a principled and interpretable membership signal.
Practical considerations include guidelines for probe length, number of demonstrations, and masking ratios, demonstrating that effective signals can be obtained even under strict query budgets. The authors argue that ICP‑MIA can serve as a reliable auditing tool for deployed fine‑tuned LLMs, highlighting the need for defenses such as early stopping, stronger regularization, or differential‑privacy‑based fine‑tuning to reduce the optimization gap.
In summary, the paper makes three key contributions: (1) formalizing the optimization gap as a fundamental membership indicator, (2) introducing a black‑box, training‑free in‑context probing method to estimate this gap, and (3) empirically showing that ICP‑MIA outperforms existing attacks across models, tasks, and fine‑tuning regimes. The work advances the state of the art in privacy assessment for LLMs and provides a solid foundation for future defenses against membership inference.
Comments & Academic Discussion
Loading comments...
Leave a Comment