Privacy Bias in Language Models: A Contextual Integrity-based Auditing Metric

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

As large language models (LLMs) are integrated into sociotechnical systems, it is crucial to examine the privacy biases they exhibit. We define privacy bias as the appropriateness value of information flows in responses from LLMs. A deviation between privacy biases and expected values, referred to as privacy bias delta, may indicate privacy violations. As an auditing metric, privacy bias can help (a) model trainers evaluate the ethical and societal impact of LLMs, (b) service providers select context-appropriate LLMs, and (c) policymakers assess the appropriateness of privacy biases in deployed LLMs. We formulate and answer a novel research question: how can we reliably examine privacy biases in LLMs and the factors that influence them? We present a novel approach for assessing privacy biases using a contextual integrity-based methodology to evaluate the responses from various LLMs. Our approach accounts for the sensitivity of responses across prompt variations, which hinders the evaluation of privacy biases. Finally, we investigate how privacy biases are affected by model capacities and optimizations.

💡 Research Summary

The paper addresses the growing concern that large language models (LLMs) embedded in sociotechnical systems may inadvertently violate privacy norms. Drawing on Nissenbaum’s theory of contextual integrity (CI), the authors introduce “privacy bias” as a quantitative measure of how appropriate an LLM’s information flow is within a given context. Privacy bias is represented as a five‑dimensional tensor (P_bias) whose axes correspond to the CI parameters: sender, subject, information type, receiver, and transmission principle. Each specific flow yields a scalar appropriateness score; leaving one or more parameters unspecified produces slices (matrices, vectors, or subtensors) that capture groups of related flows.

An expected appropriateness tensor (A_exp) is derived from legal statutes, policy documents, crowd‑sourced expectations, or domain‑specific guidelines. The deviation between P_bias and A_exp, denoted Δ_bias (privacy bias delta), quantifies the extent to which the model’s behavior diverges from normative expectations. For single flows, Δ_bias can be computed as an absolute difference, an ordinal distance after embedding Likert scales, or a simple mismatch indicator for categorical judgments. For sets of flows, the authors propose aggregations such as mean absolute Δ_bias, signed mean (to capture systematic liberal or restrictive tendencies), variance or standard deviation (to assess consistency), and distributional divergences (KL, Wasserstein, total variation) when appropriateness values are treated as probability distributions.

A central methodological challenge is prompt sensitivity: small paraphrases can cause large swings in model responses, obscuring true bias. To mitigate this, the authors devise a multi‑prompt assessment protocol. They generate a suite of paraphrased prompts that preserve the underlying information‑flow scenario, collect LLM responses for each, and evaluate response variance. Only when variance falls below a predefined threshold do they accept the responses as reliable for bias computation. This approach isolates genuine privacy bias from stochastic model behavior.

Empirically, the framework is applied to several publicly available LLMs (e.g., GPT‑3.5, GPT‑4, LLaMA variants) across different model sizes and training optimizations (base, instruction‑tuned, RLHF‑fine‑tuned). Findings reveal that larger models tend to exhibit lower average Δ_bias and more neutral signed bias, suggesting that scale improves alignment with contextual norms. Instruction‑tuning and reinforcement‑learning‑from‑human‑feedback (RLHF) further reduce response variance and consequently privacy bias, highlighting the value of safety‑oriented fine‑tuning. Domain‑specific analyses (healthcare, education) show that changes in the transmission principle or receiver role can cause sharp shifts in bias, underscoring the need for context‑aware policy specifications.

Importantly, the metric does not require an explicit A_exp for every scenario. Even when expected norms are unavailable, the raw P_bias tensor can be examined to compare relative appropriateness across flows, enabling “normative” audits that inform policymakers about acceptable bias thresholds in a given sector.

The paper’s contributions are fourfold: (1) formal definition of a CI‑grounded privacy bias metric; (2) a robust multi‑prompt evaluation method to handle prompt sensitivity; (3) systematic analysis of how model capacity and safety‑oriented optimizations affect privacy bias; (4) demonstration that the framework supports both empirical deviation measurement and normative assessment without mandatory ground‑truth expectations.

Overall, this work provides a principled, scalable tool for auditing LLMs’ privacy behavior, offering actionable insights for model developers, service providers, and regulators seeking to ensure that AI systems respect contextual privacy norms. Future directions include expanding A_exp constructions to cover diverse cultural and legal contexts, automating paraphrase generation for broader coverage, and integrating real‑time bias monitoring into deployed AI services.

Privacy Bias in Language Models: A Contextual Integrity-based Auditing Metric

💡 Research Summary

Comments & Academic Discussion

Leave a Comment