📝 Original Info
- Title: A smoothing model for sample disclosure risk estimation
- ArXiv ID: 0708.0980
- Date: 2009-09-29
- Authors: Researchers from original ArXiv paper
📝 Abstract
When a sample frequency table is published, disclosure risk arises when some individuals can be identified on the basis of their values in certain attributes in the table called key variables, and then their values in other attributes may be inferred, and their privacy is violated. On the basis of the sample to be released, and possibly some partial knowledge of the whole population, an agency which considers releasing the sample, has to estimate the disclosure risk. Risk arises from non-empty sample cells which represent small population cells and from population uniques in particular. Therefore risk estimation requires assessing how many of the relevant population cells are likely to be small. Various methods have been proposed for this task, and we present a method in which estimation of a population cell frequency is based on smoothing using a local neighborhood of this cell, that is, cells having similar or close values in all attributes. We provide some preliminary results and experiments with this method. Comparisons are made to two other methods: 1. a log-linear models approach in which inference on a given cell is based on a ``neighborhood'' of cells determined by the log-linear model. Such neighborhoods have one or some common attributes with the cell in question, but some other attributes may differ significantly. 2 The Argus method in which inference on a given cell is based only on the sample frequency in the specific cell, on the sample design and on some known marginal distributions of the population, without learning from any type of ``neighborhood'' of the given cell, nor from any model which uses the structure of the table.
💡 Deep Analysis
Deep Dive into A smoothing model for sample disclosure risk estimation.
When a sample frequency table is published, disclosure risk arises when some individuals can be identified on the basis of their values in certain attributes in the table called key variables, and then their values in other attributes may be inferred, and their privacy is violated. On the basis of the sample to be released, and possibly some partial knowledge of the whole population, an agency which considers releasing the sample, has to estimate the disclosure risk. Risk arises from non-empty sample cells which represent small population cells and from population uniques in particular. Therefore risk estimation requires assessing how many of the relevant population cells are likely to be small. Various methods have been proposed for this task, and we present a method in which estimation of a population cell frequency is based on smoothing using a local neighborhood of this cell, that is, cells having similar or close values in all attributes. We provide some preliminary results and ex
📄 Full Content
arXiv:0708.0980v1 [stat.ME] 7 Aug 2007
IMS Lecture Notes–Monograph Series
Complex Datasets and Inverse Problems: Tomography, Networks and Beyond
Vol. 54 (2007) 161–171
c⃝Institute of Mathematical Statistics, 2007
DOI: 10.1214/074921707000000120
A smoothing model for sample disclosure
risk estimation
Yosef Rinott1,∗and Natalie Shlomo2,†
Hebrew University and Southampton University
Abstract: When a sample frequency table is published, disclosure risk arises
when some individuals can be identified on the basis of their values in certain
attributes in the table called key variables, and then their values in other
attributes may be inferred, and their privacy is violated.
On the basis of the sample to be released, and possibly some partial knowl-
edge of the whole population, an agency which considers releasing the sample,
has to estimate the disclosure risk.
Risk arises from non-empty sample cells which represent small population
cells and from population uniques in particular. Therefore risk estimation re-
quires assessing how many of the relevant population cells are likely to be small.
Various methods have been proposed for this task, and we present a method
in which estimation of a population cell frequency is based on smoothing using
a local neighborhood of this cell, that is, cells having similar or close values in
all attributes.
We provide some preliminary results and experiments with this method.
Comparisons are made to two other methods: 1. a log-linear models approach
in which inference on a given cell is based on a “neighborhood” of cells deter-
mined by the log-linear model. Such neighborhoods have one or some common
attributes with the cell in question, but some other attributes may differ sig-
nificantly. 2 The Argus method in which inference on a given cell is based
only on the sample frequency in the specific cell, on the sample design and on
some known marginal distributions of the population, without learning from
any type of “neighborhood” of the given cell, nor from any model which uses
the structure of the table.
1. Introduction
When a microdata sample file is released by an agency, directly identifying variables,
such as name, address, etc., are always deleted, variable values are often grouped
(e.g., Age-Groups instead of precise age), and the data is given in the form of a
frequency table. However disclosure risk may still exist, that is, some individuals in
the file may be identified by their combination of values in the variables appearing
in the data.
Samples often contain information on certain variables on which the agency’s
information for the whole population is limited, such as expenditure on specific
items in a Household Expenditure Survey, or detailed information on variables such
as children’s extra curricular activities in the Social Survey of the Israel Central
Bureau of Statistics.
∗Research supported by the Israel Science Foundation (grant No. 473/04).
†Research supported in part by the Israel Science Foundation (grant No. 473/04).
1Department of Statistics, Hebrew University, Jerusalem, Israel, e-mail: rinott@huji.ac.il
2Department of Statistics, Hebrew University of Jerusalem, Southampton Statistical Sciences
Research Institute, University of Southampton, United Kingdom, e-mail: N.Shlomo@soton.ac.uk
AMS 2000 subject classifications: primary 62H17; secondary 62-07.
Keywords and phrases: sample uniques, neighborhoods, microdata.
161
162
Y. Rinott and N. Shlomo
Often agencies have to assess the disclosure risk involved in the release of sample
data in the form of a frequency table when the corresponding population table may
be unknown, or only partially known. Risk arises from cells in which both sample
and population frequencies are small, allowing an intruder who has the sample data
and access to some information on the population, and in particular on individuals
of interest, to identify such individuals in the sample with high probability. Thus,
the disclosure risk depends both on the given sample, and the population. In this
paper we are concerned with the issue of estimating disclosure risk involved in
releasing a sample on the basis of the sample alone, assuming the population is
unknown.
Let f = {fk} denote an m-way frequency table, which is a sample from a pop-
ulation table F = {Fk}, where k = (k1, . . . , km) indicates a cell and fk and Fk
denote the frequency in the sample and population cell k, respectively. Formally,
the sample and population sizes in our models are random and their expectations
are denoted by n and N respectively, and the number of cells by K. We can ei-
ther assume that n and N are known, or that they are estimated by their natural
estimators: the actual sample and population sizes, assumed to be known. In the
sequel when we write n of N we formally refer to expectations.
If the m attributes in the table can be considered key variables, that is, variables
which are to some extent accessible to the public or to potential intruders, then
disclosure
…(Full text truncated)…
Reference
This content is AI-processed based on ArXiv data.