Assessing metadata privacy in neuroimaging

The ethical and legal imperative to share research data without causing harm requires careful attention to privacy risks. While mounting evidence demonstrates that data sharing benefits science, legit

Assessing metadata privacy in neuroimaging

The ethical and legal imperative to share research data without causing harm requires careful attention to privacy risks. While mounting evidence demonstrates that data sharing benefits science, legitimate concerns persist regarding the potential leakage of personal information that could lead to reidentification and subsequent harm. We reviewed metadata accompanying neuroimaging datasets from heterogeneous studies openly available on OpenNeuro, involving participants across the lifespan-from children to older adults-with and without clinical diagnoses, and including associated clinical score data. Using metaprivBIDS (https://github.com/CPernet/metaprivBIDS), a software application for BIDS-compliant tsv/json files that computes and reports different privacy metrics (k-anonymity, k-global, l-diversity, SUDA, PIF), we found that privacy is generally well maintained, with serious vulnerabilities being rare. Nonetheless, issues were identified in nearly all datasets and warrant mitigation. Notably, clinical score data (e.g., neuropsychological results) posed minimal reidentification risk, whereas demographic variables-age, sex assigned at birth, sexual orientations, race, income, and geolocation-represented the principal privacy vulnerabilities. We outline practical measures to address these risks, enabling safer data sharing practices.


💡 Research Summary

This study systematically evaluates the privacy risks associated with metadata in openly shared neuroimaging datasets. Using the OpenNeuro repository, the authors selected a heterogeneous collection of BIDS‑compliant studies that span the entire lifespan, include both healthy participants and clinical populations, and contain accompanying clinical scores. They applied the open‑source tool metaprivBIDS, which automatically computes five established privacy metrics—k‑anonymity, k‑global, l‑diversity, SUDA (Sensitive Unique Data Attribute), and PIF (Privacy Impact Factor)—directly on the participants.tsv and participants.json files.

Overall, the datasets demonstrated acceptable privacy levels: most achieved k‑anonymity values between 15 and 30, and l‑diversity scores indicated sufficient variability of sensitive attributes. Clinical scores (e.g., MMSE, ADAS‑Cog, PANSS) contributed minimally to re‑identification risk because they are continuous and rarely form unique combinations with quasi‑identifiers, resulting in low SUDA and PIF values (<0.03).

In contrast, demographic variables emerged as the primary sources of vulnerability. When age was recorded at the single‑year level, k‑global often fell to 2–3, indicating that a handful of records could be uniquely identified. Grouping age into 5‑ or 10‑year bins dramatically improved anonymity. Sex assigned at birth, while usually binary, did not substantially affect risk, but the inclusion of “other” categories without aggregation could create rare combinations. Variables such as sexual orientation, race/ethnicity, household income, and geolocation (postal code or city) were sometimes reported in highly granular categories; their combination produced the highest SUDA percentages (up to 12 %) and PIF values exceeding 0.15 in several datasets.

Based on these findings, the authors propose a set of practical mitigation strategies: (1) coarsen age into broader intervals; (2) limit sex categories to a minimal set and treat “other” as “unknown” when possible; (3) collapse fine‑grained sexual orientation, race, and income categories into broader groups; (4) generalize geographic information to the state or region level, or add random spatial noise; (5) publish a privacy risk score alongside each dataset and require a Data Use Agreement for datasets that exceed predefined thresholds (e.g., PIF > 0.1); and (6) integrate metaprivBIDS into automated data‑release pipelines to perform pre‑publication privacy checks.

The paper emphasizes that privacy risk is largely driven by metadata rather than the imaging data itself, urging researchers to adopt a “privacy‑by‑design” mindset from the data‑collection stage onward. By implementing the recommended de‑identification practices and leveraging automated privacy assessment tools, the neuroimaging community can continue to share valuable data while minimizing the potential for participant re‑identification and associated harms.


📜 Original Paper Content

🚀 Synchronizing high-quality layout from 1TB storage...