DAVE: A Policy-Enforcing LLM Spokesperson for Secure Multi-Document Data Sharing

DAVE: A Policy-Enforcing LLM Spokesperson for Secure Multi-Document Data Sharing
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In current inter-organizational data spaces, usage policies are enforced mainly at the asset level: a whole document or dataset is either shared or withheld. When only parts of a document are sensitive, providers who want to avoid leaking protected information typically must manually redact documents before sharing them, which is costly, coarse-grained, and hard to maintain as policies or partners change. We present DAVE, a usage policy-enforcing LLM spokesperson that answers questions over private documents on behalf of a data provider. Instead of releasing documents, the provider exposes a natural language interface whose responses are constrained by machine-readable usage policies. We formalize policy-violating information disclosure in this setting, drawing on usage control and information flow security, and introduce virtual redaction: suppressing sensitive information at query time without modifying source documents. We describe an architecture for integrating such a spokesperson with Eclipse Dataspace Components and ODRL-style policies, and outline an initial provider-side integration prototype in which QA requests are routed through a spokesperson service instead of triggering raw document transfer. Our contribution is primarily architectural: we do not yet implement or empirically evaluate the full enforcement pipeline. We therefore outline an evaluation methodology to assess security, utility, and performance trade-offs under benign and adversarial querying as a basis for future empirical work on systematically governed LLM access to multi-party data spaces.


💡 Research Summary

The paper introduces DAVE, a policy‑enforcing large language model (LLM) “spokesperson” designed for secure multi‑document data sharing within industrial data spaces. Traditional data‑space architectures (e.g., IDS RAM 4.0, Eclipse Dataspace Components) enforce usage policies at the asset level, meaning an entire document is either shared or withheld. This coarse granularity is inadequate when only specific portions of a document are sensitive. DAVE addresses this gap by providing a natural‑language question‑answering (QA) interface that never releases raw documents; instead, it returns answers that are dynamically constrained by machine‑readable ODRL‑style usage policies.

The authors formalize “policy‑violating information disclosure” as any answer that reveals protected information—directly, indirectly, or for an unauthorized purpose—contrary to the governing policy. They adopt concepts from usage control (UCON) and information‑flow security to model these violations. Utility is defined as the system’s ability to correctly answer policy‑allowed questions while refusing disallowed ones, measured via answer quality (exact match, token‑level F1) and coverage (fraction of in‑scope queries answered).

A four‑layer enforcement pipeline is proposed:

  1. Purpose‑aware query screening – rejects queries whose intended purpose does not match the contract.
  2. Retrieval‑time policy filtering – filters out protected document chunks from the vector‑search results before they reach the LLM.
  3. Policy‑conditioned prompting – injects policy constraints into the LLM prompt, instructing the model to refuse or summarize when a violation would occur.
  4. Post‑generation response checks – applies entity detection, regex, and policy engine re‑evaluation on the generated answer to catch any residual violations.

This “virtual redaction” approach allows fine‑grained protection without altering the source documents, thereby preserving the original data for other authorized uses.

The architecture integrates DAVE as a service within an EDC connector. Providers define ODRL contracts during negotiation; the connector authenticates consumers, forwards their queries to DAVE, and logs all interactions. The threat model assumes adversarial consumers who possess valid credentials and may issue adaptive, repeated, or prompt‑injection attacks to extract forbidden information. The authors categorize attacks into direct extraction, indirect inference, prompt injection, verbatim leakage, and RAG‑specific leakage, and argue that the multi‑layer pipeline mitigates each class. Out‑of‑scope threats include infrastructure compromise, model weight theft, and side‑channel attacks.

Since a full implementation and empirical evaluation are not yet available, the paper outlines an evaluation methodology. Security is measured by the rate of policy‑violating answers; utility by answer accuracy and coverage; performance by latency and resource consumption. Both benign and adversarial query workloads will be used to assess trade‑offs.

Key contributions are: (1) the concept of a policy‑enforcing LLM spokesperson for data spaces, (2) a formal definition of policy‑violating disclosure in LLM QA, (3) a multi‑layer enforcement architecture, (4) a prototype integration with EDC and ODRL, and (5) a proposed evaluation framework. Strengths include alignment with existing data‑space standards and a comprehensive defense‑in‑depth design. Limitations are the lack of a concrete implementation, limited discussion of policy conflict resolution, and the open challenge of quantifying the residual risk from LLM hallucinations or indirect inference.

Overall, the work opens a new research direction at the intersection of usage control, information‑flow security, and LLM‑based interfaces, proposing a viable path toward fine‑grained, policy‑compliant access to private document collections in federated data‑sharing ecosystems.


Comments & Academic Discussion

Loading comments...

Leave a Comment