An Information-Theoretic Privacy Criterion for Query Forgery in Information Retrieval
In previous work, we presented a novel information-theoretic privacy criterion for query forgery in the domain of information retrieval. Our criterion measured privacy risk as a divergence between the user’s and the population’s query distribution, and contemplated the entropy of the user’s distribution as a particular case. In this work, we make a twofold contribution. First, we thoroughly interpret and justify the privacy metric proposed in our previous work, elaborating on the intimate connection between the celebrated method of entropy maximization and the use of entropies and divergences as measures of privacy. Secondly, we attempt to bridge the gap between the privacy and the information-theoretic communities by substantially adapting some technicalities of our original work to reach a wider audience, not intimately familiar with information theory and the method of types.
💡 Research Summary
The paper addresses the privacy risks inherent in users’ search queries by proposing a principled information‑theoretic framework for query forgery. Query forgery consists of interleaving genuine user queries with fabricated dummy queries, thereby obscuring the true interest profile from observers such as search engines or third‑party adversaries. The authors introduce a privacy‑risk metric based on the Kullback‑Leibler (KL) divergence between the probability distribution of an individual user’s query profile (p) and the aggregate distribution of the whole user population (q). Because KL divergence is non‑negative and equals zero only when the two distributions coincide, a smaller D(p‖q) indicates a higher level of anonymity.
The paper further shows that Shannon entropy is a special case of KL divergence when the reference distribution is uniform. In this sense, maximizing entropy corresponds to making a user’s query distribution as close as possible to a uniform distribution, which intuitively spreads the user’s “signal” across many categories and reduces distinguishability. To justify the use of entropy and KL divergence as privacy measures, the authors invoke Jaynes’ maximum‑entropy principle and the method of types from large‑deviation theory. The method of types demonstrates that, for a large sample of queries, the number of sequences that share a given empirical distribution (type) grows exponentially with the entropy of that type (≈2^{k·H(t)}). Consequently, the most “likely” empirical distribution under minimal prior knowledge is the one with maximal entropy, providing a solid statistical grounding for the privacy metric.
Having established the metric, the authors formulate the trade‑off between privacy and system overhead as an optimization problem. Let α denote the proportion of forged (dummy) queries injected into the traffic. The mixed query distribution p_α is a convex combination of the genuine distribution and a chosen dummy distribution. The objective is to minimize D(p_α‖q) while keeping the additional traffic cost C(α) below a pre‑specified budget C_max. By introducing a Lagrange multiplier λ, the problem becomes:
min_α
Comments & Academic Discussion
Loading comments...
Leave a Comment