Preserving Privacy in Sequential Data Release against Background Knowledge Attacks

Reading time: 5 minute
...

📝 Original Info

  • Title: Preserving Privacy in Sequential Data Release against Background Knowledge Attacks
  • ArXiv ID: 1010.0924
  • Date: 2010-10-06
  • Authors: ※ 제공된 텍스트에 저자 정보가 포함되어 있지 않아, 원문을 참고하시기 바랍니다. —

📝 Abstract

A large amount of transaction data containing associations between individuals and sensitive information flows everyday into data stores. Examples include web queries, credit card transactions, medical exam records, transit database records. The serial release of these data to partner institutions or data analysis centers is a common situation. In this paper we show that, in most domains, correlations among sensitive values associated to the same individuals in different releases can be easily mined, and used to violate users' privacy by adversaries observing multiple data releases. We provide a formal model for privacy attacks based on this sequential background knowledge, as well as on background knowledge on the probability distribution of sensitive values over different individuals. We show how sequential background knowledge can be actually obtained by an adversary, and used to identify with high confidence the sensitive values associated with an individual. A defense algorithm based on Jensen-Shannon divergence is proposed, and extensive experiments show the superiority of the proposed technique with respect to other applicable solutions. To the best of our knowledge, this is the first work that systematically investigates the role of sequential background knowledge in serial release of transaction data.

💡 Deep Analysis

Figure 1

📄 Full Content

Large amounts of transaction data related to individuals are continuously acquired, and stored in the repositories of industry and government institutions. Examples include online service requests, web queries, credit card transactions, transit database records, medical exam records. These institutions often need to repeatedly release new or updated portions of their data to other partner institutions for different purposes, including distributed processing, participation in interorganizational workflows, and data analysis. The medical domain is an interesting example: many countries have recently established centralized data stores that exchange patients' data with medical institutions; new records are periodically released to data analysis centers in non-aggregated form.

A very challenging issue in this scenario is the protection of users’ privacy, considering that potential adversaries have access to multiple serial releases and can easily acquire background knowledge related to the specific domain. This knowledge includes the fact that certain sequences of values in subsequent releases are more likely to be observed than other sequences. For example, it is pretty straightforward to extract from the medical literature or from a public dataset that a sequence of medical exam results within a certain time frame has higher probability to be observed than another sequence.

Related work has either focused on anonymization techniques dealing with multiple data releases, or on privacy protection techniques taking into account background knowledge, but limited to a single data release. We are not aware of any work taking into account the combination of these conditions. This case cannot be addressed by simply combining the two types of techniques mentioned above, since background knowledge can enable new kinds of privacy threats on sequential data releases. Extensions of data anonymization techniques to deal with multiple data releases have been proposed under different assumptions [1], [2], [3], [4], [5], [6]. The work that is closest to ours is probably the one presented in [5], in which sensitive values are divided in transient values that may freely change with time, and persistent values that never change. However, the proposed technique is effective only when the transition probability among transient values is uniform, and this is often not the case, with the medical domain being a clear counterexample. In [6] a technique is proposed to defend against attacks based on the observation of serial data having transient sensitive values; however, background knowledge on transition probabilities is not considered in that work. On the contrary, our privacy preserving technique captures non-uniform transition probabilities. Our running example in Section II shows that the anonymizations proposed in related works are not effective when an adversary can obtain background knowledge on the transition probabilities. Techniques considering background knowledge have also been proposed, and they can be classified according to two main categories: a) models based on logic assertions and rules [7]; and b) models based on probabilistic tools [8], [9]. However, these techniques are devised for a single release of the data, and, as it is shown in Section VI, they are ineffective when an adversary having background knowledge on sequences of sensitive values may observe multiple releases.

In this paper we formally model privacy attacks based on background knowledge extended to serial data releases. We present a new probabilistic defense technique taking into account possible adversary’s background knowledge and how he can revise it each time new data are released. Similarly to other anonymization techniques, our method is based on the generalization of quasi-identifier (QI) attributes, but generalization is performed with a new goal: minimizing the differ- (iii) Through an experimental evaluation on a large dataset, we show the effectiveness of our defense under different methods used to extract background knowledge; Our results also show that JS-reduce provides a very good trade-off between achieved privacy and data utility.

The paper is structured as follows. In Section II, the privacy problem is presented through an example in the medical domain that illustrates the privacy attacks enabled by background knowledge, and the inadequacy of state of the art techniques. In Section III we formally model the privacy attack, as well as the considered forms of background knowledge. In Section IV we show how an adversary can actually extract background knowledge, and revise his knowledge in order to perform the attack. In Section V we propose our JS-reduce defense algorithm that is experimentally evaluated in Section VI. Section VII concludes the paper.

In this section we focus on a specific scenario in the medical domain to illustrate the privacy attacks enabled by background knowledge on sequences of sensitive values. The exampl

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut