PriMod4AI: Lifecycle-Aware Privacy Threat Modeling for AI Systems using LLM

PriMod4AI: Lifecycle-Aware Privacy Threat Modeling for AI Systems using LLM
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Artificial intelligence systems introduce complex privacy risks throughout their lifecycle, especially when processing sensitive or high-dimensional data. Beyond the seven traditional privacy threat categories defined by the LINDDUN framework, AI systems are also exposed to model-centric privacy attacks such as membership inference and model inversion, which LINDDUN does not cover. To address both classical LINDDUN threats and additional AI-driven privacy attacks, PriMod4AI introduces a hybrid privacy threat modeling approach that unifies two structured knowledge sources, a LINDDUN knowledge base representing the established taxonomy, and a model-centric privacy attack knowledge base capturing threats outside LINDDUN. These knowledge bases are embedded into a vector database for semantic retrieval and combined with system level metadata derived from Data Flow Diagram. PriMod4AI uses retrieval-augmented and Data Flow specific prompt generation to guide large language models (LLMs) in identifying, explaining, and categorizing privacy threats across lifecycle stages. The framework produces justified and taxonomy-grounded threat assessments that integrate both classical and AI-driven perspectives. Evaluation on two AI systems indicates that PriMod4AI provides broad coverage of classical privacy categories while additionally identifying model-centric privacy threats. The framework produces consistent, knowledge-grounded outputs across LLMs, as reflected in agreement scores in the observed range.


💡 Research Summary

PriMod4AI presents a novel, lifecycle‑aware privacy threat modeling framework tailored for artificial intelligence (AI) systems. The authors begin by identifying a critical gap in existing privacy engineering practices: the widely adopted LINDDUN methodology excels at cataloguing data‑flow‑centric threats in conventional software but fails to capture model‑centric attacks that arise during AI training, inference, and deployment (e.g., membership inference, model inversion, attribute inference, training‑data extraction). To bridge this gap, PriMod4AI constructs two complementary, structured knowledge bases (KBs).

The first, the LINDDUN KB, is derived from the official LINDDUN taxonomy. The authors convert the PDF‑based hierarchical description into a JSON schema that preserves all seven top‑level categories (Linkability, Identifiability, Non‑repudiation, Detectability, Disclosure of information, Unawareness, Non‑compliance) together with sub‑nodes, example scenarios, impact statements, and contextual notes. This representation enables fine‑grained semantic retrieval and direct consumption by large language models (LLMs).

The second, the AI Privacy KB, aggregates model‑centric privacy attacks reported in the literature between 2016 and 2025. A systematic review of peer‑reviewed venues (IEEE Xplore, ACM Digital Library, SpringerLink) and pre‑print servers (arXiv) yields roughly 30 distinct threats. Each entry is encoded in JSON with fields for name, description, attack vector, associated AI lifecycle stage (data collection, preprocessing, model building, training, deployment, inference, monitoring), and a bibliographic source. Importantly, each AI‑centric threat is mapped to its closest LINDDUN dimension, providing a bridge between the two taxonomies.

Both KBs are embedded using a text encoder and stored in a vector database, enabling semantic similarity search. The system‑level input to PriMod4AI is a Data Flow Diagram (DFD) of the target AI application. The DFD is parsed to extract external entities, processes, data stores, data flows, and trust boundaries. Each data flow is then expressed as a JSON record containing source, destination, data type, sensitivity classification, functional description, and the corresponding AI lifecycle stage (e.g., “camera → sensor fusion, video frames, visual scene data, Data Collection → Data Processing”).

The core of the framework is a Retrieval‑Augmented Generation (RAG) pipeline. For every data flow, a base prompt template defines the LLM’s role (“privacy threat analyst”), injects the flow’s metadata, and specifies the required JSON output schema (fields: name, justification, LINDDUN category, AI lifecycle stage, source). The template is instantiated with flow‑specific values, then enriched by retrieving the most relevant knowledge snippets from both KBs based on semantic similarity to the flow description. The composite prompt is fed to an LLM (the authors evaluate both open‑source Llama‑2 and proprietary GPT‑4). The LLM generates a structured JSON object that lists each identified privacy threat, provides a concise justification grounded in the retrieved knowledge, assigns a LINDDUN category, indicates the lifecycle stage where the threat manifests, and cites the originating knowledge source.

Evaluation is performed on two realistic AI systems: (1) a medical imaging analysis pipeline handling high‑resolution radiology scans, and (2) a smart‑traffic surveillance system processing video streams for vehicle detection. In both cases, PriMod4AI successfully enumerates all classical LINDDUN threats (e.g., linkability of patient identifiers, unauthorized disclosure of video feeds) and additionally surfaces model‑centric attacks such as membership inference on the radiology model and attribute inference on the traffic model. Consistency across LLMs is measured via agreement scores, which range from 0.78 to 0.84, indicating that the RAG grounding substantially mitigates hallucination and model‑specific variance.

Key insights from the study include:

  1. Dual‑knowledge integration yields comprehensive coverage – By unifying a canonical privacy taxonomy with a curated AI‑specific attack catalog, the framework captures threats that would be missed by either source alone.

  2. DFD‑driven, lifecycle‑aware prompting provides precise context – Embedding system metadata directly into the prompt ensures that the LLM’s reasoning is anchored to concrete data flows and trust boundaries, leading to more relevant threat identification.

  3. RAG reduces hallucination and improves reproducibility – Semantic retrieval of vetted knowledge before generation constrains the LLM to factual information, as evidenced by high inter‑model agreement.

  4. Structured JSON output enables downstream automation – The forced schema facilitates automatic risk scoring, mitigation mapping, and integration with compliance tooling, making the approach practical for real‑world privacy‑by‑design pipelines.

  5. Knowledge bases are easily updatable – Adding new AI‑centric attacks (e.g., emerging diffusion‑model leakage techniques) only requires updating the AI Privacy KB, after which the system instantly incorporates the new knowledge without retraining the LLM.

In conclusion, PriMod4AI demonstrates that retrieval‑augmented, LLM‑driven analysis, when combined with rigorously curated dual knowledge bases and DFD‑derived system context, can deliver a scalable, explainable, and lifecycle‑aware privacy threat modeling solution for modern AI applications. The work paves the way for automated privacy‑by‑design practices that keep pace with the rapidly evolving threat landscape of AI.


Comments & Academic Discussion

Loading comments...

Leave a Comment