Anonymizing Unstructured Data

Reading time: 6 minute
...

📝 Original Info

  • Title: Anonymizing Unstructured Data
  • ArXiv ID: 0810.5582
  • Date: 2008-11-04
  • Authors: ** - Rajeev Motwani (Stanford University, Computer Science) - Shubha U. Nabar (Stanford University, Computer Science) **

📝 Abstract

In this paper we consider the problem of anonymizing datasets in which each individual is associated with a set of items that constitute private information about the individual. Illustrative datasets include market-basket datasets and search engine query logs. We formalize the notion of k-anonymity for set-valued data as a variant of the k-anonymity model for traditional relational datasets. We define an optimization problem that arises from this definition of anonymity and provide O(klogk) and O(1)-approximation algorithms for the same. We demonstrate applicability of our algorithms to the America Online query log dataset.

💡 Deep Analysis

Deep Dive into Anonymizing Unstructured Data.

In this paper we consider the problem of anonymizing datasets in which each individual is associated with a set of items that constitute private information about the individual. Illustrative datasets include market-basket datasets and search engine query logs. We formalize the notion of k-anonymity for set-valued data as a variant of the k-anonymity model for traditional relational datasets. We define an optimization problem that arises from this definition of anonymity and provide O(klogk) and O(1)-approximation algorithms for the same. We demonstrate applicability of our algorithms to the America Online query log dataset.

📄 Full Content

arXiv:0810.5582v2 [cs.DB] 3 Nov 2008 Anonymizing Unstructured Data Rajeev Motwani Department of Computer Science Stanford University, Stanford, CA, USA rajeev@cs.stanford.edu Shubha U. Nabar Department of Computer Science Stanford University, Stanford, CA, USA sunabar@cs.stanford.edu ABSTRACT In this paper we consider the problem of anonymizing datasets in which each individual is associated with a set of items that constitute private information about the individual. Il- lustrative datasets include market-basket datasets and search engine query logs. We formalize the notion of k-anonymity for set-valued data as a variant of the k-anonymity model for traditional relational datasets. We define an optimization problem that arises from this definition of anonymity and provide O(k log k) and O(1)-approximation algorithms for the same. We demonstrate applicability of our algorithms to the America Online query log dataset. 1. INTRODUCTION Consider a dataset containing detailed information about the private actions of individuals, e.g., a market-basket data- set or a dataset of search engine query logs. Market-basket datasets contain information about items bought by individ- uals and search engine query logs contain detailed informa- tion about the queries posed by users and the results that were clicked on. There is often a need to publish such data for research purposes. Market-basket data, for instance, could be used for association rule mining and for the design and testing of recommendation systems. Query logs could be used to study patterns of query refinement, develop algo- rithms for query suggestion and improve the overall quality of search. The publication of such data, however, poses a challenge as far as the privacy of individual users is concerned. Even after removing all personal characteristics of individuals such as actual usernames and ip addresses, the publication of such data is still subject to privacy attacks from attackers with partial knowledge of the private actions of individuals. Our work in this paper is motivated by two such recent data releases and privacy attacks on them. In August of 2006, America Online (AOL) released a large portion of its search engine query logs for research pur- poses. The dataset contained 20 million queries posed by 650, 000 AOL users over a 3 month period. Before releas- ing the data, AOL ran a simplistic anonymization procedure wherein every username was replaced by a random identi- fier. Despite this basic protective measure, the New York Times [6] demonstrated how the queries themselves could essentially reveal the identities of users. For example, user 4417749 revealed herself to be a resident of Gwinnett County in Lilburn, GA, by querying for businesses and services in the area. She further revealed her last name by querying for relatives. There were only 14 citizens with her last name in Gwinnett County, and the user was quickly revealed to be Thelma Arnold, a 62 year old woman living in Georgia. From this point on, researchers at the New York Times could look at all of the queries posed by Ms. Arnold over the 3 month period. The publication of the query log data thus constituted a very serious privacy breach. In October of 2006, Netflix announced the $1-million Net- flix Prize for improving their movie recommendation system. As a part of the contest Netflix publicly released a dataset containing 100 million movie ratings created by 500, 000 Netflix subscribers over a period of 6 years. Once again, a simplistic anonymization procedure of replacing usernames with random identifiers was used prior to the release. Nev- ertheless, it was shown that 84% of the subscribers could be uniquely identified by an attacker who knew 6 out of 8 movies that the subscriber had rated outside of the top 500 [19]. The commonality between the AOL and Netflix datasets is that each individual’s data is essentially a set of items. Further this set of items is both identifying of the individ- ual as well as private information about the individual, and partial knowledge of this set of items is used in the privacy attack. In the case of the Netflix data (representative of market-basket data), for instance, it is the set of movies that a subscriber rated, and in the case of the AOL data, it is the set of queries that a user posed, also called the user session. Motivated by these examples, as well as by the very real need for releasing such datasets for research purposes, we propose a notion of anonymity for set-valued data in this paper. Informally, a dataset is said to be k-anonymous if every individual’s “set of items” is identical to those of at least k−1 other individuals. So a user in the Netflix dataset would be k-anonymous if at least k −1 other users rated exactly the same set of movies; a user in the AOL query logs would be k-anonymous if at least k −1 other users posed exactly the same set of queries. One simple way to achieve k-anonymity for a dataset would be to simply remove e

…(Full text truncated)…

📸 Image Gallery

cover.png page_2.webp page_3.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut