Privacy in Search Logs

Reading time: 5 minute
...

📝 Original Info

  • Title: Privacy in Search Logs
  • ArXiv ID: 0904.0682
  • Date: 2011-05-13
  • Authors: ** 논문에 명시된 저자 정보가 제공되지 않았습니다. (원문에 저자 리스트가 포함되지 않음) **

📝 Abstract

Search engine companies collect the "database of intentions", the histories of their users' search queries. These search logs are a gold mine for researchers. Search engine companies, however, are wary of publishing search logs in order not to disclose sensitive information. In this paper we analyze algorithms for publishing frequent keywords, queries and clicks of a search log. We first show how methods that achieve variants of $k$-anonymity are vulnerable to active attacks. We then demonstrate that the stronger guarantee ensured by $\epsilon$-differential privacy unfortunately does not provide any utility for this problem. We then propose an algorithm ZEALOUS and show how to set its parameters to achieve $(\epsilon,\delta)$-probabilistic privacy. We also contrast our analysis of ZEALOUS with an analysis by Korolova et al. [17] that achieves $(\epsilon',\delta')$-indistinguishability. Our paper concludes with a large experimental study using real applications where we compare ZEALOUS and previous work that achieves $k$-anonymity in search log publishing. Our results show that ZEALOUS yields comparable utility to $k-$anonymity while at the same time achieving much stronger privacy guarantees.

💡 Deep Analysis

📄 Full Content

Civilization is the progress toward a society of privacy. The savage's whole existence is public, ruled by the laws of his tribe. Civilization is the process of setting man free from men. -Ayn Rand.

My favorite thing about the Internet is that you get to go into the private world of real creeps without having to smell them. -Penn Jillette.

Search engines play a crucial role in the navigation through the vastness of the Web. Today’s search engines do not just collect and index webpages, they also collect and mine information about their users. They store the queries, clicks, IP-addresses, and other information about the interactions with users in what is called a search log. Search logs contain valuable information that search engines use to tailor their services better to their users’ needs. They enable the discovery of trends, patterns, and anomalies in the search behavior of users, and they can be used in the development and testing of new algorithms to improve search performance and quality. Scientists all around the world would like to tap this gold mine for their own research; search engine companies, however, do not release them because they contain sensitive information about their users, for example searches for diseases, lifestyle choices, personal tastes, and political affiliations.

The only release of a search log happened in 2006 by AOL, and it went into the annals of tech history as one of the great debacles in the search industry. 1 AOL published three months of search logs of 650,000 users. The only measure to protect user privacy was the replacement of user-ids with random numbers -utterly insufficient protection as the New York Times showed by identifying a user from Lilburn, Georgia [4], whose search queries not only contained identifying information but also sensitive information about her friends’ ailments.

The AOL search log release shows that simply replacing user-ids with random numbers does not prevent information disclosure. Other ad-hoc methods have been studied and found to be similarly insufficient, such as the removal of names, age, zip codes and other identifiers [14] and the replacement of keywords in search queries by random numbers [18].

In this paper, we compare formal methods of limiting disclosure when publishing frequent keywords, queries, and clicks of a search log. The methods vary in the guarantee of disclosure limitations they provide and in the amount of useful information they retain. We first describe two negative results. We show that existing proposals to achieve k-anonymity [23] in search logs [1,21,12,13] are insufficient in the light of attackers who can actively influence the search log. We then turn to differential privacy [9], a much stronger privacy guarantee; however, we show that it is impossible to achieve good utility with differential privacy.

We then describe Algorithm ZEALOUS 2 , developed independently by Korolova et al. [17] and us [10] with the goal to achieve relaxations of differential privacy. Korolova et al. showed how to set the parameters of ZEALOUS to guarantee (ǫ, δ)indistinguishability [8], and we here offer a new analysis that shows how to set the parameters of ZEAL-OUS to guarantee (ǫ, δ)-probabilistic differential privacy [20] (Section 4.2), a much stronger privacy guarantee as our analytical comparison shows.

Our paper concludes with an extensive experimental evaluation, where we compare the utility of vari-1 http://en.wikipedia.org/wiki/AOL_search_data_scandal describes the incident, which resulted in the resignation of AOL’s CTO and an ongoing class action lawsuit against AOL resulting from the data release.

2 ZEArch LOg pUbliSing ous algorithms that guarantee anonymity or privacy in search log publishing. Our evaluation includes applications that use search logs for improving both search experience and search performance, and our results show that ZEALOUS’ output is sufficient for these applications while achieving strong formal privacy guarantees.

We believe that the results of this research enable search engine companies to make their search log available to researchers without disclosing their users’ sensitive information: Search engine companies can apply our algorithm to generate statistics that are (ǫ, δ)-probabilistic differentially private while retaining good utility for the two applications we have tested. Beyond publishing search logs we believe that our findings are of interest when publishing frequent itemsets, as ZEALOUS protects privacy against much stronger attackers than those considered in existing work on privacy-preserving publishing of frequent items/itemsets [19].

The remainder of this paper is organized as follows. We start with some background in Section 2. Our negative results are presented in Section 3. We then describe Algorithm ZEALOUS and its analysis in Section 4. We compare indistinguishability with probabilistic differential privacy in Section 5. Section 6 shows the results of an extensi

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut