📝 Original Info
- Title: A multiplicative masking method for preserving the skewness of the original micro-records
- ArXiv ID: 1712.02549
- Date: 2017-12-08
- Authors: Researchers from original ArXiv paper
📝 Abstract
Masking methods for the safe dissemination of microdata consist of distorting the original data while preserving a pre-defined set of statistical properties in the microdata. For continuous variables, available methodologies rely essentially on matrix masking and in particular on adding noise to the original values, using more or less refined procedures depending on the extent of information that one seeks to preserve. Almost all of these methods make use of the critical assumption that the original datasets follow a normal distribution and/or that the noise has such a distribution. This assumption is, however, restrictive in the sense that few variables follow empirically a Gaussian pattern: the distribution of household income, for example, is positively skewed, and this skewness is essential information that has to be considered and preserved. This paper addresses these issues by presenting a simple multiplicative masking method that preserves skewness of the original data while offering a sufficient level of disclosure risk control. Numerical examples are provided, leading to the suggestion that this method could be well-suited for the dissemination of a broad range of microdata, including those based on administrative and business records.
💡 Deep Analysis
Deep Dive into A multiplicative masking method for preserving the skewness of the original micro-records.
Masking methods for the safe dissemination of microdata consist of distorting the original data while preserving a pre-defined set of statistical properties in the microdata. For continuous variables, available methodologies rely essentially on matrix masking and in particular on adding noise to the original values, using more or less refined procedures depending on the extent of information that one seeks to preserve. Almost all of these methods make use of the critical assumption that the original datasets follow a normal distribution and/or that the noise has such a distribution. This assumption is, however, restrictive in the sense that few variables follow empirically a Gaussian pattern: the distribution of household income, for example, is positively skewed, and this skewness is essential information that has to be considered and preserved. This paper addresses these issues by presenting a simple multiplicative masking method that preserves skewness of the original data while off
📄 Full Content
1
A MULTIPLICATIVE MASKING METHOD FOR PRESERVING THE SKEWNESS OF THE
ORIGINAL MICRO-RECORDS
Nicolas Ruiz1
OECD
ABSTRACT
Masking methods for the safe dissemination of microdata consist of distorting the original data while
preserving a pre-defined set of statistical properties in the microdata. For continuous variables, available
methodologies rely essentially on matrix masking and in particular on adding noise to the original values,
using more or less refined procedures depending on the extent of information that one seeks to preserve.
Almost all of these methods make use of the critical assumption that the original datasets follow a normal
distribution and/or that the noise has such a distribution. This assumption is, however, restrictive in the
sense that few variables follow empirically a Gaussian pattern: the distribution of household income, for
example, is positively skewed, and this skewness is essential information that has to be considered and
preserved. This paper addresses these issues by presenting a simple multiplicative masking method that
preserves skewness of the original data while offering a sufficient level of disclosure risk control.
Numerical examples are provided, leading to the suggestion that this method could be well-suited for the
dissemination of a broad range of microdata, including those based on administrative and business records.
Keywords: disclosure, microdata perturbation, sufficient statistics, skewness, log normal distribution
1
Contact : nicolas.ruiz@oecd.org
2
Introduction
Microdata are individual records coming from surveys or administrative registers. Due to their nature,
they provide a rich amount of information that can inform statistical and policy analysis.. However, this
wealth of information is often untapped due to the legal obligations that National Statistical Offices
(NSOs) and other governmental institutions face to protect the confidentiality of their respondents. Such
requirements shape the dissemination policy of microdata at national and international levels. The issue is
how to ensure a sufficient level of data protection to meet data producers’ concerns in terms of legal and
ethical requirements while offering to users a reasonable richness of information. To solve this tension,
several solutions are available. These include providing access to microdata through a controlled
environment such as data centres, safe remote access, providing interval data or tabulations rather than data
points, or modifying the individual records before public release by using statistical disclosure control
techniques.
Over the last decade, the role of micro-data has changed from being the preserve of NSOs and
government departments to being a vital tool for analysts trying to understand both social and economic
phenomena. These new uses of micro-data confront providers of official statistics with a new range of
questions. The OECD has witnessed this change in microdata use and has become more actively involved
in the subject. The overriding principle followed by national statistical offices when providing access to
this information is that, in all instances, the confidentiality of individual responses should be preserved.
The need to reassure microdata providers that their confidentiality constraints would be met has shaped the
way the OECD has approached the issues of access to these data. Three themes are at the core of such
access:
1.
The development of an internationally-harmonised nomenclature and coding for microdata
variables.
2.
The development of an international Statistical Disclosure Control technique.
3.
The development of an IT infrastructure to allow remote access in a secure environment.
This paper addresses the second theme. It proposes a harmonised statistical disclosure technique that
is based on experiences from individual countries. This technique has the potential to lead to
internationally comparable results as well as to ensure providers of official statistics that they retain control
over the disclosure risk of their micro-data. While the initial investigation of disclosure techniques
undertaken by the OECD focused on Labour Force Survey microdata (OECD 2010) the tool presented here
has a true potential to be applied to a range of other domains where micro-data are potentially highly
concentrated in the tails of the distribution, as in the case of business or household income and wealth data.
Statistical Disclosure Control: what is it?
Statistical Disclosure Control (SDC) consists in the set of numerical tools that enhances the level of
confidentiality of any given micro-record while preserving to a lesser or greater extent its level of
information (Hundepool and al., 2010, for an authoritative survey). While standards are still missing for the
use of SDC in an integrated and coherent framework both at the national and internat
…(Full text truncated)…
📸 Image Gallery
Reference
This content is AI-processed based on ArXiv data.