In this paper we consider the problem of anonymizing datasets in which each individual is associated with a set of items that constitute private information about the individual. Illustrative datasets include market-basket datasets and search engine query logs. We formalize the notion of k-anonymity for set-valued data as a variant of the k-anonymity model for traditional relational datasets. We define an optimization problem that arises from this definition of anonymity and provide O(klogk) and O(1)-approximation algorithms for the same. We demonstrate applicability of our algorithms to the America Online query log dataset.
Deep Dive into Anonymizing Unstructured Data.
In this paper we consider the problem of anonymizing datasets in which each individual is associated with a set of items that constitute private information about the individual. Illustrative datasets include market-basket datasets and search engine query logs. We formalize the notion of k-anonymity for set-valued data as a variant of the k-anonymity model for traditional relational datasets. We define an optimization problem that arises from this definition of anonymity and provide O(klogk) and O(1)-approximation algorithms for the same. We demonstrate applicability of our algorithms to the America Online query log dataset.
arXiv:0810.5582v2 [cs.DB] 3 Nov 2008
Anonymizing Unstructured Data
Rajeev Motwani
Department of Computer Science
Stanford University, Stanford, CA, USA
rajeev@cs.stanford.edu
Shubha U. Nabar
Department of Computer Science
Stanford University, Stanford, CA, USA
sunabar@cs.stanford.edu
ABSTRACT
In this paper we consider the problem of anonymizing datasets
in which each individual is associated with a set of items
that constitute private information about the individual. Il-
lustrative datasets include market-basket datasets and search
engine query logs. We formalize the notion of k-anonymity
for set-valued data as a variant of the k-anonymity model for
traditional relational datasets. We define an optimization
problem that arises from this definition of anonymity and
provide O(k log k) and O(1)-approximation algorithms for
the same. We demonstrate applicability of our algorithms
to the America Online query log dataset.
1.
INTRODUCTION
Consider a dataset containing detailed information about
the private actions of individuals, e.g., a market-basket data-
set or a dataset of search engine query logs. Market-basket
datasets contain information about items bought by individ-
uals and search engine query logs contain detailed informa-
tion about the queries posed by users and the results that
were clicked on. There is often a need to publish such data
for research purposes.
Market-basket data, for instance,
could be used for association rule mining and for the design
and testing of recommendation systems. Query logs could
be used to study patterns of query refinement, develop algo-
rithms for query suggestion and improve the overall quality
of search.
The publication of such data, however, poses a challenge
as far as the privacy of individual users is concerned. Even
after removing all personal characteristics of individuals such
as actual usernames and ip addresses, the publication of such
data is still subject to privacy attacks from attackers with
partial knowledge of the private actions of individuals. Our
work in this paper is motivated by two such recent data
releases and privacy attacks on them.
In August of 2006, America Online (AOL) released a large
portion of its search engine query logs for research pur-
poses. The dataset contained 20 million queries posed by
650, 000 AOL users over a 3 month period. Before releas-
ing the data, AOL ran a simplistic anonymization procedure
wherein every username was replaced by a random identi-
fier. Despite this basic protective measure, the New York
Times [6] demonstrated how the queries themselves could
essentially reveal the identities of users. For example, user
4417749 revealed herself to be a resident of Gwinnett County
in Lilburn, GA, by querying for businesses and services in
the area. She further revealed her last name by querying
for relatives. There were only 14 citizens with her last name
in Gwinnett County, and the user was quickly revealed to
be Thelma Arnold, a 62 year old woman living in Georgia.
From this point on, researchers at the New York Times could
look at all of the queries posed by Ms. Arnold over the 3
month period. The publication of the query log data thus
constituted a very serious privacy breach.
In October of 2006, Netflix announced the $1-million Net-
flix Prize for improving their movie recommendation system.
As a part of the contest Netflix publicly released a dataset
containing 100 million movie ratings created by 500, 000
Netflix subscribers over a period of 6 years. Once again, a
simplistic anonymization procedure of replacing usernames
with random identifiers was used prior to the release. Nev-
ertheless, it was shown that 84% of the subscribers could
be uniquely identified by an attacker who knew 6 out of
8 movies that the subscriber had rated outside of the top
500 [19].
The commonality between the AOL and Netflix datasets
is that each individual’s data is essentially a set of items.
Further this set of items is both identifying of the individ-
ual as well as private information about the individual, and
partial knowledge of this set of items is used in the privacy
attack.
In the case of the Netflix data (representative of
market-basket data), for instance, it is the set of movies
that a subscriber rated, and in the case of the AOL data, it
is the set of queries that a user posed, also called the user
session.
Motivated by these examples, as well as by the very real
need for releasing such datasets for research purposes, we
propose a notion of anonymity for set-valued data in this
paper. Informally, a dataset is said to be k-anonymous if
every individual’s “set of items” is identical to those of at
least k−1 other individuals. So a user in the Netflix dataset
would be k-anonymous if at least k −1 other users rated
exactly the same set of movies; a user in the AOL query
logs would be k-anonymous if at least k −1 other users
posed exactly the same set of queries.
One simple way to achieve k-anonymity for a dataset
would be to simply remove e
…(Full text truncated)…
This content is AI-processed based on ArXiv data.