Puda: Private User Dataset Agent for User-Sovereign and Privacy-Preserving Personalized AI

Puda: Private User Dataset Agent for User-Sovereign and Privacy-Preserving Personalized AI
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Personal data centralization among dominant platform providers including search engines, social networking services, and e-commerce has created siloed ecosystems that restrict user sovereignty, thereby impeding data use across services. Meanwhile, the rapid proliferation of Large Language Model (LLM)-based agents has intensified demand for highly personalized services that require the dynamic provision of diverse personal data. This presents a significant challenge: balancing the utilization of such data with privacy protection. To address this challenge, we propose Puda (Private User Dataset Agent), a user-sovereign architecture that aggregates data across services and enables client-side management. Puda allows users to control data sharing at three privacy levels: (i) Detailed Browsing History, (ii) Extracted Keywords, and (iii) Predefined Category Subsets. We implemented Puda as a browser-based system that serves as a common platform across diverse services and evaluated it through a personalized travel planning task. Our results show that providing Predefined Category Subsets achieves 97.2% of the personalization performance (evaluated via an LLM-as-a-Judge framework across three criteria) obtained when sharing Detailed Browsing History. These findings demonstrate that Puda enables effective multi-granularity management, offering practical choices to mitigate the privacy-personalization trade-off. Overall, Puda provides an AI-native foundation for user sovereignty, empowering users to safely leverage the full potential of personalized AI.


💡 Research Summary

The paper addresses a pressing problem in today’s digital ecosystem: personal data is monopolized by dominant platforms, creating silos that undermine user sovereignty and impede cross‑service data utilization. At the same time, the rapid rise of large language model (LLM)‑based agents (e.g., ChatGPT, Gemini) has generated a strong demand for rich, dynamic personal context to enable highly personalized services. Traditional notice‑and‑consent mechanisms are inadequate because they require users to continuously manage complex data flows, and the probabilistic nature of LLMs makes it impossible to guarantee that sensitive information will never be unintentionally disclosed.

To bridge this gap, the authors propose Puda (Private User Dataset Agent), a user‑sovereign architecture that aggregates personal data across web services within a client‑side environment and offers it at three distinct privacy granularity levels:

  1. Detailed Browsing History – raw URLs, page titles, and summaries, providing maximal personalization power but also the highest privacy risk.
  2. Extracted Keywords – a set of keywords generated per page by an LLM (Gemma‑3‑4B), each annotated with sentiment and relevance scores. This representation abstracts away specific page identifiers while still exposing potentially sensitive proper nouns.
  3. Predefined Category Subsets – a deterministic selection of items from a curated taxonomy (26 top‑level, 256 second‑level, 810 third‑level categories) that best match the user’s interests. Because the output is constrained to a fixed list, the risk of leaking unexpected proper nouns is eliminated.

The architecture consists of three core components:

  • Content Recorder – implemented as a browser extension, it captures URL, title, and HTML body for every visited page, providing a single entry point for cross‑service data collection.
  • Dataset Agent – processes the raw logs into the three privacy levels. Per‑page processing uses an LLM to generate a summary and keywords; per‑user aggregation consolidates these into user‑level datasets. For the category subset, a more powerful LLM (GPT‑5 nano) maps summaries and keywords onto the predefined taxonomy.
  • Access Control Agent – leverages OAuth 2.0 and OpenID Connect (Discovery and Authorization Code Flow) to issue scoped access tokens. The Dataset Agent acts as a resource server, while the Access Control Agent functions as an authorization server; external AI agents receive tokens that limit data retrieval to the user‑approved granularity.

Implementation details: the Content Recorder runs locally in the browser, transmitting logs to a server‑hosted Dataset Agent. Summaries and keywords are generated with Gemma‑3‑4B (a small language model suitable for on‑device deployment). Category extraction uses GPT‑5 nano, acknowledging that current on‑device models lack sufficient context windows, but anticipating future SLM advances. The taxonomy is derived from Google Cloud Natural Language API categories, localized to Japanese for the evaluation.

Evaluation focuses on a travel‑planning scenario, a common benchmark for assessing complex reasoning and personalization in AI agents. Users request a personalized itinerary; the travel‑planning agent consumes data from Puda via the A2A protocol. Performance is judged by an LLM‑as‑a‑Judge framework across three criteria: personalization relevance, practical usefulness, and creative suggestion quality. Results show that providing only the Predefined Category Subsets achieves 97.2 % of the personalization score obtained when sharing the full Detailed Browsing History. Moreover, token consumption and latency are substantially reduced, demonstrating practical cost benefits.

The authors discuss limitations: the current system only captures browser‑based activity, leaving mobile‑app or OS‑level interactions out of scope; the use of a cloud‑based LLM for category extraction means the architecture is not yet fully user‑side; and the authorization flow is described but not fully implemented, requiring future security validation.

In conclusion, Puda offers a concrete, technically grounded solution for user‑centric data management that balances privacy and utility. By allowing users to select the granularity of data shared with AI agents, it mitigates the inherent uncertainty of probabilistic privacy‑preserving methods while preserving most of the personalization benefits. The study provides empirical evidence that deterministic, multi‑granular data provisioning can substantially narrow the privacy‑utility trade‑off, paving the way for future AI‑native services that respect user sovereignty.


Comments & Academic Discussion

Loading comments...

Leave a Comment