Data Minimisation in Communication Protocols: A Formal Analysis Framework and Application to Identity Management

With the growing amount of personal information exchanged over the Internet, privacy is becoming more and more a concern for users. One of the key principles in protecting privacy is data minimisation. This principle requires that only the minimum amount of information necessary to accomplish a certain goal is collected and processed. “Privacy-enhancing” communication protocols have been proposed to guarantee data minimisation in a wide range of applications. However, currently there is no satisfactory way to assess and compare the privacy they offer in a precise way: existing analyses are either too informal and high-level, or specific for one particular system. In this work, we propose a general formal framework to analyse and compare communication protocols with respect to privacy by data minimisation. Privacy requirements are formalised independent of a particular protocol in terms of the knowledge of (coalitions of) actors in a three-layer model of personal information. These requirements are then verified automatically for particular protocols by computing this knowledge from a description of their communication. We validate our framework in an identity management (IdM) case study. As IdM systems are used more and more to satisfy the increasing need for reliable on-line identification and authentication, privacy is becoming an increasingly critical issue. We use our framework to analyse and compare four identity management systems. Finally, we discuss the completeness and (re)usability of the proposed framework.

💡 Research Summary

The paper addresses the growing privacy concerns that arise from the massive exchange of personal information over the Internet by focusing on the principle of data minimisation – the idea that only the information strictly necessary for a given purpose should be collected and processed. While many “privacy‑enhancing” communication protocols have been proposed, there has been no systematic, protocol‑agnostic method for assessing and comparing the level of data minimisation they actually provide. To fill this gap, the authors introduce a general formal framework that can be applied to any communication protocol in order to evaluate its compliance with data‑minimisation requirements.

The core of the framework is a three‑layer model of personal information. The first layer captures raw data items (e.g., name, address, phone number). The second layer records the contextual relationships in which the data are generated (e.g., contractual ties, legal bases). The third layer encodes semantic attributes such as identifiability, sensitivity, and purpose limitation. This stratification allows privacy requirements to be expressed independently of any particular protocol, simply by stating which layers of information a given actor or coalition of actors is allowed to know.

Privacy requirements are formalised as knowledge constraints on individual actors (users, service providers, identity providers, third parties) and on coalitions of actors that may share information. For example, a requirement might state that a service provider may learn a user’s identifier and an authentication token but must not learn the user’s actual address or date of birth. Such constraints are expressed in terms of the three‑layer model, making them both precise and flexible.

To verify whether a concrete protocol satisfies these constraints, the protocol is modelled as a sequence of messages, each described by sender, receiver, and payload (including cryptographic operations such as encryption, signatures, and hashing). The framework defines a set of knowledge‑propagation rules that determine how information flows between actors as messages are processed. Encryption limits knowledge to those who hold the appropriate decryption key; signatures reveal the identity of the signer; hash values may leak limited information depending on the attacker’s capabilities. By iteratively applying these rules, the algorithm computes the knowledge set of every actor after each message. The computed knowledge is then compared against the pre‑specified privacy constraints. If a violation is detected, the framework pinpoints the exact message and the specific piece of information that caused the breach.

The authors validate the approach with a case study on identity management (IdM), a domain where reliable online identification and authentication are essential but privacy concerns are especially acute. Four representative IdM systems are modelled: a SAML‑based federated authentication system, an OpenID Connect implementation, a Kerberos ticket‑granting protocol, and a decentralized self‑sovereign identity (SSI) solution built on blockchain technology. For each system the authors construct the message flow, feed it into the analysis tool, and examine the resulting knowledge of the user, the service provider, the identity provider, and any third parties.

The analysis reveals that the SAML and OpenID Connect systems largely respect data‑minimisation requirements: they expose only the identifiers and authentication tokens needed for service access, while keeping additional personal attributes hidden. The Kerberos protocol, however, shows that service providers can infer more than intended from the ticket contents, especially when tickets are reused across sessions, leading to accumulation of historical authentication data. The SSI system performs the worst in terms of data minimisation: immutable blockchain records of metadata (e.g., timestamps, transaction hashes) allow observers to reconstruct user profiles over time, violating purpose‑limitation constraints. Moreover, the coalition analysis demonstrates that when multiple service providers collude, they can combine their partial knowledge to reconstruct information that no single provider could obtain alone.

The paper discusses the strengths of the framework: (1) protocol‑independent specification of privacy requirements; (2) automated, quantitative computation of actor knowledge; (3) ability to reason about coalitions and potential collusion; and (4) reusability across diverse domains because only the message description needs to be adapted for a new protocol. Limitations are also acknowledged. Accurate modelling of every cryptographic primitive and protocol nuance is required, which can be labour‑intensive. For complex protocols the state space of possible knowledge configurations can explode, leading to high computational cost. The authors propose abstraction techniques (grouping keys, summarising repeated patterns) and selective verification (focusing on critical protocol phases) as future work to mitigate these issues.

Finally, the authors evaluate the completeness of their formalisation, arguing that, under the assumption of a faithful message model, the knowledge‑propagation rules capture all possible information flows. They also highlight the framework’s reusability: once a set of privacy constraints is defined for a particular regulatory context (e.g., GDPR purpose limitation, NIST minimum‑privilege), it can be applied to any protocol without redesign. The paper concludes that the proposed formal analysis provides a rigorous, scalable method for assessing data‑minimisation in communication protocols, and it opens avenues for automated privacy‑by‑design verification in emerging systems.