Know Your Personalization: Learning Topic level Personalization in Online Services

Know Your Personalization: Learning Topic level Personalization in   Online Services
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Online service platforms (OSPs), such as search engines, news-websites, ad-providers, etc., serve highly pe rsonalized content to the user, based on the profile extracted from his history with the OSP. Although personalization (generally) leads to a better user experience, it also raises privacy concerns for the user—he does not know what is present in his profile and more importantly, what is being used to per sonalize content for him. In this paper, we capture OSP’s personalization for an user in a new data structure called the person alization vector ($\eta$), which is a weighted vector over a set of topics, and present techniques to compute it for users of an OSP. Our approach treats OSPs as black-boxes, and extracts $\eta$ by mining only their output, specifical ly, the personalized (for an user) and vanilla (without any user information) contents served, and the differences in these content. We formulate a new model called Latent Topic Personalization (LTP) that captures the personalization vector into a learning framework and present efficient inference algorithms for it. We do extensive experiments for search result personalization using both data from real Google users and synthetic datasets. Our results show high accuracy (R-pre = 84%) of LTP in finding personalized topics. For Google data, our qualitative results show how LTP can also identifies evidences—queries for results on a topic with high $\eta$ value were re-ranked. Finally, we show how our approach can be used to build a new Privacy evaluation framework focused at end-user privacy on commercial OSPs.


💡 Research Summary

The paper tackles the opaque nature of personalization in online service platforms (OSPs) such as search engines, news sites, and ad networks. While personalization improves user experience, it also creates privacy concerns because users do not know what information is stored in their profiles or which aspects are used to tailor content. To address this, the authors introduce a “personalization vector” (η), a weighted vector over a predefined set of topics that quantifies how much each topic influences a particular user’s personalized output.

The methodology treats OSPs as black boxes: only the personalized output (generated with user information) and the vanilla output (generated without any user data) are observed. By comparing the two, the authors extract the differences in ranking or content placement, which serve as signals for the underlying personalization. The core technical contribution is the Latent Topic Personalization (LTP) model. LTP consists of two layers: (1) a topic‑modeling layer that maps each result document to a distribution over topics (similar to LDA), and (2) a personalization layer that models how the η vector modifies the probability of a document’s rank. The personalization effect is captured through a “personalization potential” function that raises the rank of documents associated with high‑weight topics for a given user.

Inference is performed using a hybrid variational‑EM algorithm. Observed rank differences are treated as evidence, and the algorithm jointly updates the topic‑document parameters and the user‑specific η vector. Regularization via prior distributions and normalization ensures stability even when the rank changes are sparse or noisy.

Experiments are conducted on two datasets. The first uses real Google search logs from consenting users, where both personalized and non‑personalized result lists for the same queries are collected. The second is a synthetic dataset where topics, document‑topic assignments, and η vectors are generated artificially, allowing precise evaluation of recovery accuracy. Using R‑precision as the main metric, LTP achieves an average of 84 % accuracy in identifying the true personalized topics. Performance is especially strong when topics are well separated (e.g., sports vs. politics) and degrades gracefully when topics overlap.

Qualitative analysis on the Google data shows concrete examples: queries related to topics with high η values (e.g., “NBA scores”, “presidential debate”) consistently receive higher‑ranked results compared to the vanilla list, confirming that the model captures real personalization decisions. The authors also demonstrate how η can be visualized to give end‑users a clear picture of which topics dominate their profile.

Beyond reconstruction, the paper proposes a privacy‑evaluation framework built on η. By defining a “privacy exposure score” for each topic, users can assess how much personal information is being leveraged by the OSP. This score can be used to inform users, guide them in adjusting privacy settings, or even provide feedback to service providers.

Limitations include the need for a predefined topic set, potential difficulty in estimating η when rank changes are minimal, and the focus on textual search results (extension to images, video, or multimodal content is left for future work). The authors suggest future directions such as dynamic topic discovery, real‑time monitoring dashboards, and integration with regulatory compliance tools.

In summary, the paper presents a novel black‑box approach to reverse‑engineer personalization in OSPs, introduces a topic‑level personalization vector, and validates the method with both synthetic and real‑world data. Its contributions lie in providing a transparent, quantifiable view of personalization that can be leveraged for user‑centric privacy assessments and for designing more accountable online services.


Comments & Academic Discussion

Loading comments...

Leave a Comment