A multiplicative masking method for preserving the skewness of the original micro-records

Reading time: 6 minute
...

📝 Original Info

  • Title: A multiplicative masking method for preserving the skewness of the original micro-records
  • ArXiv ID: 1712.02549
  • Date: 2017-12-08
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Masking methods for the safe dissemination of microdata consist of distorting the original data while preserving a pre-defined set of statistical properties in the microdata. For continuous variables, available methodologies rely essentially on matrix masking and in particular on adding noise to the original values, using more or less refined procedures depending on the extent of information that one seeks to preserve. Almost all of these methods make use of the critical assumption that the original datasets follow a normal distribution and/or that the noise has such a distribution. This assumption is, however, restrictive in the sense that few variables follow empirically a Gaussian pattern: the distribution of household income, for example, is positively skewed, and this skewness is essential information that has to be considered and preserved. This paper addresses these issues by presenting a simple multiplicative masking method that preserves skewness of the original data while offering a sufficient level of disclosure risk control. Numerical examples are provided, leading to the suggestion that this method could be well-suited for the dissemination of a broad range of microdata, including those based on administrative and business records.

💡 Deep Analysis

Deep Dive into A multiplicative masking method for preserving the skewness of the original micro-records.

Masking methods for the safe dissemination of microdata consist of distorting the original data while preserving a pre-defined set of statistical properties in the microdata. For continuous variables, available methodologies rely essentially on matrix masking and in particular on adding noise to the original values, using more or less refined procedures depending on the extent of information that one seeks to preserve. Almost all of these methods make use of the critical assumption that the original datasets follow a normal distribution and/or that the noise has such a distribution. This assumption is, however, restrictive in the sense that few variables follow empirically a Gaussian pattern: the distribution of household income, for example, is positively skewed, and this skewness is essential information that has to be considered and preserved. This paper addresses these issues by presenting a simple multiplicative masking method that preserves skewness of the original data while off

📄 Full Content

1 A MULTIPLICATIVE MASKING METHOD FOR PRESERVING THE SKEWNESS OF THE ORIGINAL MICRO-RECORDS Nicolas Ruiz1 OECD ABSTRACT Masking methods for the safe dissemination of microdata consist of distorting the original data while preserving a pre-defined set of statistical properties in the microdata. For continuous variables, available methodologies rely essentially on matrix masking and in particular on adding noise to the original values, using more or less refined procedures depending on the extent of information that one seeks to preserve. Almost all of these methods make use of the critical assumption that the original datasets follow a normal distribution and/or that the noise has such a distribution. This assumption is, however, restrictive in the sense that few variables follow empirically a Gaussian pattern: the distribution of household income, for example, is positively skewed, and this skewness is essential information that has to be considered and preserved. This paper addresses these issues by presenting a simple multiplicative masking method that preserves skewness of the original data while offering a sufficient level of disclosure risk control. Numerical examples are provided, leading to the suggestion that this method could be well-suited for the dissemination of a broad range of microdata, including those based on administrative and business records.

Keywords: disclosure, microdata perturbation, sufficient statistics, skewness, log normal distribution

1
Contact : nicolas.ruiz@oecd.org

2

Introduction Microdata are individual records coming from surveys or administrative registers. Due to their nature, they provide a rich amount of information that can inform statistical and policy analysis.. However, this wealth of information is often untapped due to the legal obligations that National Statistical Offices (NSOs) and other governmental institutions face to protect the confidentiality of their respondents. Such requirements shape the dissemination policy of microdata at national and international levels. The issue is how to ensure a sufficient level of data protection to meet data producers’ concerns in terms of legal and ethical requirements while offering to users a reasonable richness of information. To solve this tension, several solutions are available. These include providing access to microdata through a controlled environment such as data centres, safe remote access, providing interval data or tabulations rather than data points, or modifying the individual records before public release by using statistical disclosure control techniques. Over the last decade, the role of micro-data has changed from being the preserve of NSOs and government departments to being a vital tool for analysts trying to understand both social and economic phenomena. These new uses of micro-data confront providers of official statistics with a new range of questions. The OECD has witnessed this change in microdata use and has become more actively involved in the subject. The overriding principle followed by national statistical offices when providing access to this information is that, in all instances, the confidentiality of individual responses should be preserved. The need to reassure microdata providers that their confidentiality constraints would be met has shaped the way the OECD has approached the issues of access to these data. Three themes are at the core of such access: 1. The development of an internationally-harmonised nomenclature and coding for microdata variables. 2. The development of an international Statistical Disclosure Control technique.
3. The development of an IT infrastructure to allow remote access in a secure environment. This paper addresses the second theme. It proposes a harmonised statistical disclosure technique that is based on experiences from individual countries. This technique has the potential to lead to internationally comparable results as well as to ensure providers of official statistics that they retain control over the disclosure risk of their micro-data. While the initial investigation of disclosure techniques undertaken by the OECD focused on Labour Force Survey microdata (OECD 2010) the tool presented here has a true potential to be applied to a range of other domains where micro-data are potentially highly concentrated in the tails of the distribution, as in the case of business or household income and wealth data. Statistical Disclosure Control: what is it? Statistical Disclosure Control (SDC) consists in the set of numerical tools that enhances the level of confidentiality of any given micro-record while preserving to a lesser or greater extent its level of information (Hundepool and al., 2010, for an authoritative survey). While standards are still missing for the use of SDC in an integrated and coherent framework both at the national and internat

…(Full text truncated)…

📸 Image Gallery

cover.png page_2.webp page_3.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut