The Conundrum of Trustworthy Research on Attacking Personally Identifiable Information Removal Techniques
Removing personally identifiable information (PII) from texts is necessary to comply with various data protection regulations and to enable data sharing without compromising privacy. However, recent works show that documents sanitized by PII removal techniques are vulnerable to reconstruction attacks. Yet, we suspect that the reported success of these attacks is largely overestimated. We critically analyze the evaluation of existing attacks and find that data leakage and data contamination are not properly mitigated, leaving the question whether or not PII removal techniques truly protect privacy in real-world scenarios unaddressed. We investigate possible data sources and attack setups that avoid data leakage and conclude that only truly private data can allow us to objectively evaluate vulnerabilities in PII removal techniques. However, access to private data is heavily restricted - and for good reasons - which also means that the public research community cannot address this problem in a transparent, reproducible, and trustworthy manner.
💡 Research Summary
The paper “The Conundrum of Trustworthy Research on Attacking Personally Identifiable Information Removal Techniques” is a position paper that questions the validity of current evaluation practices for attacks that aim to reconstruct personally identifiable information (PII) from texts that have been sanitized by automatic PII‑removal tools. The authors begin by noting that privacy‑preserving text publishing is mandated by regulations such as the EU GDPR and the US HIPAA, and that manual redaction is infeasible at scale, prompting the development of rule‑based, NER‑based, and more recent large‑language‑model‑assisted de‑identification systems (e.g., Microsoft Presidio, Textwash). While these tools are widely deployed, they do not provide formal privacy guarantees; they merely mimic human intuition about what constitutes PII. Consequently, several recent works have reported successful reconstruction attacks that recover portions of the hidden information using powerful language models.
The central claim of the paper is that the reported success rates of these attacks are systematically inflated because the experimental designs suffer from two major flaws: (1) data leakage, where the test documents are either publicly available or have likely been included in the pre‑training corpora of the attacking models, and (2) data contamination, where the attack models have been pre‑trained on private corpora that overlap with the evaluation set, allowing memorisation rather than genuine inference. The authors argue that without strict control of these factors, an attacker’s apparent ability to recover PII may simply be a by‑product of model memorisation.
To substantiate this claim, the authors conduct two case studies using datasets that are highly unlikely to have been seen by large language models: (a) Czech court announcements, a legal corpus that is not part of any major public pre‑training data, and (b) English travel vlogs from YouTube, which are user‑generated and not systematically harvested for model training. They apply several state‑of‑the‑art reconstruction attacks (citing works from 2023‑2025) to these corpora after sanitising them with common PII‑removal techniques (redaction, masking, replacement, pseudonymisation). The results show a dramatic drop in reconstruction accuracy compared with prior reports that used public benchmark datasets. This empirical evidence supports the hypothesis that previous high success rates were at least partially driven by data overlap.
Beyond the experiments, the paper provides a thorough taxonomy of PII‑removal methods, clarifies terminology (anonymisation vs. de‑identification vs. sanitisation), and reviews regulatory requirements. It highlights that many current tools lack coreference resolution, leading to inconsistent masking of repeated mentions—a weakness that can be exploited by attackers. The authors also discuss the broader methodological gap: there is no established protocol for designing, executing, and evaluating adversarial attacks on PII‑removal pipelines, and no community‑wide benchmark that guarantees the privacy of the evaluation data.
Given that truly private data are heavily restricted by law, the authors contend that the public research community cannot reliably assess the security of PII‑removal tools without access to such data. They propose several pathways to address this impasse: (i) the creation of secure data enclaves where vetted researchers can run attacks under strict oversight; (ii) the use of privacy‑preserving computation techniques (e.g., secure multi‑party computation, differential privacy) to allow evaluation without exposing raw data; and (iii) the development of standardized “leakage‑free” benchmark suites that include rigorous checks for overlap (hash‑based matching, metadata analysis) between pre‑training corpora and test sets. They argue that only by adopting such safeguards can the community produce trustworthy, reproducible results that genuinely reflect the privacy risk of PII‑removal technologies.
In conclusion, the paper asserts that current literature overstates the vulnerability of sanitized texts, primarily due to methodological oversights. It calls for a paradigm shift toward more rigorous, privacy‑respecting evaluation frameworks, and for closer collaboration between regulators, data custodians, and the NLP research community to ensure that future PII‑removal tools are both useful and provably safe.
Comments & Academic Discussion
Loading comments...
Leave a Comment