Toward Trustworthy Agentic AI: A Multimodal Framework for Preventing Prompt Injection Attacks

Reading time: 14 minute
...

๐Ÿ“ Original Info

  • Title: Toward Trustworthy Agentic AI: A Multimodal Framework for Preventing Prompt Injection Attacks
  • ArXiv ID: 2512.23557
  • Date: 2025-12-29
  • Authors: Toqeer Ali Syed, Mishal Ateeq Almutairi, Mahmoud Abdel Moaty

๐Ÿ“ Abstract

Powerful autonomous systems, which reason, plan, and converse using and between numerous tools and agents, are made possible by Large Language Models (LLMs), Vision-Language Models (VLMs), and new agentic AI systems, like LangChain and GraphChain. Nevertheless, this agentic environment increases the probability of the occurrence of multimodal prompt injection (PI) attacks, in which concealed or malicious instructions carried in text, pictures, metadata, or agent-to-agent messages may spread throughout the graph and lead to unintended behavior, a breach of policy, or corruption of state. In order to mitigate these risks, this paper suggests a Cross-Agent Multimodal Provenanc-Aware Defense Framework whereby all the prompts, either user-generated or produced by upstream agents, are sanitized and all the outputs generated by an LLM are verified independently before being sent to downstream nodes. This framework contains a Text sanitizer agent, visual sanitizer agent, and output validator agent all coordinated by a provenance ledger, which keeps metadata of modality, source, and trust level throughout the entire agent network. This architecture makes sure that agent-to-agent communication abides by clear trust frames such such that injected instructions are not propagated down LangChain or GraphChainstyle-workflows. The experimental assessments show that multimodal injection detection accuracy is significantly enhanced, and the cross-agent trust leakage is minimized, as well as, agentic execution pathways become stable. The framework, which expands the concept of provenance tracking and validation to the multi-agent orchestration, enhances the establishment of secure, understandable and reliable agentic AI systems.

๐Ÿ“„ Full Content

Generation, reasoning and multimodal analytics Large Language Models (LLMs) and Vision-Language Models (VLMs) like GPT-4V, Claude, Gemini and LLaVas have performed well on generation, reasoning and multimodal analytics. Nevertheless, their openness to natural language and visuals brings a substantial attack surface to them [1], [2], [3], [15]. In contrast to conventional systems that use structured APIs, multimodal models that accept unstructured text and image, meaning that attackers can put adversarial instructions in a language form, at an image level, or by manipulating metadata.

Threats based on injection are not new: SQL, command, and cross-site scripting are all based on poor delimitation between input and execution [7]. The same principle applies to prompt injection in LLMs which works via semantic or multimodal manipulation. Overrides can be concealed within user text, outside documents or even visual artifacts. Malicious cues in photographs, such as hidden text, steganography, or altered captions, can influence VLM behavior, according to recent research [14], [16]. These cues look harmless and thus cannot be detected by the traditional filtering or fine-tuning mechanisms.

Prompt injection attacks are both direct and indirect attacks, where malicious content is either embedded in the documents or pictures that were retrieved. To demonstrate that indirect prompt injection presents a critical threat to agentic processes, Greshake et al. [1] demonstrated examples thereof are their report, and Zou et al. [4] and Liu et al. [15] show that universal adversarial prompts can even move across models and modalities. With the implementation of modern systems that combine retrieval, tool usage, and coordination among agents, the potential of such injections is even higher.

Current protection mechanisms such as keyword blocking, customization of safety nets, and RL-supported guardrails are not enough [6], [3]. They are weak at paraphrased or visually-encoded attacks, are not explainable, and can not maintain provenance, and when such untrusted content is introduced into the context window, the downstream reasons can be affected.

The existing literature lacks a unifying and multimodal architecture capable of sanitizing the text and image, in addition to implementing the immediate and sustained, provenance conscious validation through agentic processes. Current methods normally secure only the input of the LLCM, or the result, which exposes multi-agent pipelines. In a bid to fill these gaps, this paper presents a proposal of a Cross-Agent Multimodal Provenance-Aware Framework that integrates text-image sanitization, trust scoring and multi-agent validation to ensure end-to-end processing in LLM-and VLM-based systems.

Research Contribution: This study proposes the use of a single, multimodal defense model, which provides security to agentic AI pipelines by imposing sanitization and validation on all agent-LLM interactions. The work provides a dual-layer text and image sanitization scheme, a provenance registry of tracking trust through multiagent courses of action and a masking plan to generate a trust strategy, restricting unsafe content at an LLM inference. It also adds output validation to eliminate crossagent contamination and unsafe tool behavior and provides a useful end-to-end security model to LangChain and GraphChain environments. All these contributions bring the state of prompt-injection defense forward to an agent-level security architecture not just of single-model filtering.

The rest of this paper will be structured as follows: Section II will conduct a review of related studies on multimodal prompt injection and provenance-based AI safety. Section III gives the proposed multi-agent methodology and system architecture. Section IV outlines the implementation and the experimental set up. Section V talks about the evaluation findings and comparison performance whereas Section VI concludes the whole paper.

Prompt injection is now a key security issue with both LLMs and VLMs. An initial study, like Zhuo et al. [2], has studied red-teaming strategies to reveal the susceptibility of aligned models, whereas Casper et al. [3] have offered a taxonomy of prompt-based attack types to study the problem of prompt-based attacks, as well as how these attacks can be evaded and mitigated. Xu et al. [6] conducted a survey of defensive approaches such as sandboxing, filtering, and reinforcement-based guardrails. The study by Greshake et al. [1] in greater detail emphasized how serious the indirect prompt injection can be, where malicious commands embedded in accessed or external content change the behavior of the model without the user being aware of it.

With the popularization of multimodal architectures, it was demonstrated that injection threats are not just text-based. Wolff et al. [14] proved that adversarial instructions could be coded into images by steganography or manipulation of metadata or visually embedded text. Liu et al. [16] evaluated such a type of attack as visual prompt injection, which exploits the vulnerability between vision encoders and language decoders, and Liu et al. [15], released MM-SafetyBench, which found out that VLMs fail to identify harmful visual prompts. All these works demonstrate that multimodal systems increase the attack surface by having visual pathways that can not be combated by traditional text-based defenses.

Most of the current defenses are reactive, even though there is increasing interest. Keywords filtering or finetuning and safety-classifier techniques have hard problems with paraphrasing attacks or visually encoded attacks and lack provenance across modalities or agent pipelines. Based on the literature review presented in Table I, existing strategies do not provide cross-modal contextual reasoning and continuous validation, making multi-agent workflows susceptible to direct and indirect injection strategies.

Based on the literature reviewed, it is evident that there is an urgent requirement to develop hybrid and context-aware and explainable defense models that combine textual and visual modalities. The next generation defense should integrate semantic reasoning, provenance tracking, and agent-based cooperation in order to identify and counter multimodal prompt injection attacks in time and in a transparent way.

Agentic AI refers to intelligent systems composed of autonomous, goal-driven agents capable of perception, reasoning, decision-making, and coordinated action within dynamic environments. Unlike traditional reactive or pipeline-based AI models, agentic frameworks emphasize adaptive behavior, inter-agent collaboration, and continuous feedback loops to handle complex, real-world tasks. Recent studies have demonstrated the effectiveness of agentic AI across diverse domains, including disaster prediction and response coordination [13], inclusive well-being and assistive technologies [12], and intelligent inventory management [11]. Agentic principles have also evolved from earlier automated decision frameworks in mobile and distributed systems [10], and have been further strengthened through secure, trustworthy integrations with blockchain and deep learning architectures [9]. Practical applications of agentic AI are exemplified by frameworks such as FinAgent, which integrates multiagent reasoning for personalized finance and nutrition planning [8], highlighting the growing role of agentic intelligence in scalable, human-centered AI systems.

In this section, a proposed framework, called the Cross-Agent Multimodal Provenance-Aware Defense Framework, is outlined and developed to protect agentic AI systems like LangChain and GraphChain against multimodal prompt injection attacks. Its methodology combines hierarchical sanitization, multimodal processing to trust and provenance tracking and post-generation validation in all agent nodes and between all LLM interactions. The architecture is shown in Figure 1.

The framework implements three major principles: (i) any incoming text, images, tool responses or interagent messages must be sanitized prior to entering the agent graph; (ii) all prompts collected by LangChain/GraphChain agents must be sanitized again before accessing the LLM; and (iii) any outputs of the LLM should be verified before propagating to other agents, tool implementers or MCP-fulfilling action layers. The combination of all these mechanisms results in a zero-trust communication fabric between the agent network.

The multilayer defense comprises four cooperating agents: (1) Text Sanitizer Agent (A t ), (2) Visual Sanitizer Agent (A v ), (3) Main Task Model (Agent M), and (4) Output Validator Agent (B). A shared provenance ledger maintains the metadata of token and patch-level trusts across the pipeline.

All external content is first processed by A t and A v . 1) Text Sanitizer Agent (A t ): A t performs span-level semantic injection detection, trust scoring, and rewriting. Algorithm 1 formalizes the process.

Text and visual provenance information are combined into one ledger which records:

โ€ข modality (text|image),

โ€ข trust score,

โ€ข span or patch index,

โ€ข influence relationships across agent hops. This ledger helps a trust-aware attention mask used prior to inference of LLM.

A second sanitization step is applied to the LLM in order to identify:

โ€ข agent-induced injections,

โ€ข tool-returned unsafe content,

โ€ข escalation of trust via cross-agent contamination. Algorithm 3 describes the LLM-facing sanitizer.

Multimodal input is sanitized and a masking vector is a trust wise adaptive is used on the LLM. The model carries out reasoning, generation and multimodal fusion subject to limited influence of untrusted content. The provenance ledger is used to write back the attribution scores.

Before agentic execution resumes, the output is validated by Agent B.

In order to implement end-to-end sanitization, a number of elements of LangChain or GraphChain should be extended. end if 5: end for 6: enforce system policies (role separation, no override intent) 7: apply trust-aware attention mask M 8: return sanitized prompt X โ€ฒ

The resulting system provides: and provenance tracing without having to change the underlying LLM back-end.

Each of the agents is deployed as a Python service with a common configuration layer. The Text Sanitizer Agent (At), takes the form of a LangChain Runnable which is actually a standard RoBERTa-based pattern detector and a small pattern match based jailbreak detector rule engine. The Visual Sanitizer Agent (Av) is a combination of OCR module (PaddleOCR) and EXIF reader and CLIP encoder to calculate patch embeddings and anomaly scores. Both agents reveal a basic interface, which gives sanitized content and a structured provenance map.

The Main Multimodal Task Model (Agent M) works with either GPT-4o-mini via OpenAI API or an opensource VLM (LLaVA/BLIP-2) via Hugging Face Trans-formers. Output Validator Agent (B): is a light-weight chain consisting of a policy rule set, a secret-matching module, and a secondary LLM call when dealing with borderline cases, which are all applied over the provenance ledger.

The pipeline of the defense system is integrated into the LangChain/GraphChain, to the three extension points. Firstly, there is a custom input interceptor wrap around the original agent entry point. Messages incoming, tool output, and cross agent messages are routed through At and Av and then converted to LangChain BaseMessage objects. The interceptor attaches provenance attributes (source, trust, modality, provenance_id) to message metadata, so that later components can access trust information.

Second, the timely assembly phase is shunned due to wrapping either the typical PromptTemplate or graph node with a pre-LLM sanitization layer. Before submitting any prompts to the LLM, the wrapper performs trust-aware masking and span-level filtering by consulting the specified provenance ledger. This wrapper can be enabled node-by-node or throughout the agent graph, and it is transparent to the rest of the agent graph.

Third, the LLM output is enclosed by an output router which calls an Output Validator Agent B. Only answers meeting policy checks and trust constraints are sent to the agent graph otherwise the router asks to be regenerated with narrower masks or sends an error state which the agent can explicitly process. Tool and MCP executors also look up the validator when an LLM response suggests an out-call, in order to prevent unsafe tool calls being executed.

The provenance ledger is an abstraction that is applied as a key-value store. The prototype has an in-memory Redis instance which provides an identity to each interaction with a session identifier and records token-and patch-level records. Every entry stores the original type of source, trustworthiness, modality and a small hash of the content in order to prevent retaining sensitive information verbatim. In inference, Agent M marks the ledger with the attribution information based on the attention patterns or gradient-based saliency scores which in turn allows the validator to compute the trust leakages.

The agents are coordinated by a small orchestration layer based on asynchronous Python. Each user request corresponds to a temporary context that is created by the system and undergoes the global sanitization layer, LangChain/GraphChain agent, pre-LLM sanitizer, the LLM, and the output validator. It is described by the sequence diagram in Fig 2 . The identical orchestration model is applied to multi-agent graphs: each edge of the graph is virtually guarded by sanitization before an LLM call, and validation after it, and provenance metadata flows and messages. The design makes the overhead of the runtime moderate and enables the defense pipeline to be incrementally integrated into the current LangChain applications.

In this section, the proposed Cross-Agent Multimodal Provenance-Aware Framework is compared to four baseline defenses, such as keyword filtering, safety finetuning, post-hoc output filtering, and a single model vision-language baseline (Single VLM). It is assessed using three measures (i) the accuracy of multimodal prompt injection detection, (ii) the leakage of crossmodal trust, and (iii) the ability to retain the accuracy of the task on benign inputs. The comparative performance of all these metrics is summarized in Figure 3.

The proposed framework has a detection rate of 94%, which is better than the keyword filtering (52%), posthoc output filtering (61%), and safety fine-tuning (66%). This is mainly enhanced by the layered sanitization modules( At, A v ) and the pre LLM trust-aware masking which prevents the unsafe spans and patches across modalities in a systematic manner. The system ensures consistency by authenticating the information coming in and going out unlike single-phase defenses, which do not.

Trust leakage is a measure of the extent to which low-trust tokens or visual patches have an impact on the final model outputs. The proposed model reduces leakage from 0.24 to 0.07 (a 70% reduction). This is achieved by:

โ€ข enforcing trust-aware attention masking before LLM inference, โ€ข propagating provenance metadata across LangChain/GraphChain nodes,

โ€ข validating LLM outputs against the end-to-end influence record. This enforcement ensures unwanted or unverified content does not further gain influence in the agentic workflow.

The suggested framework retains its performance of 96% on benign multimodal tasks, which is the same or higher than the single-VLM benchmarks at (94%). The sanitization modules only act when the trust thresholds are breached, and thus the overhead of performance is very low, and there is nothing to block the legitimate input.

The results indicate that the commonly deployed traditional defenses such as keyword filters, tuning-based guardrails, and post-hoc filters are weak to multimodal and agent-based prompt injection threats. In contrast, the proposed system:

โ€ข preserves trust boundaries across agents, โ€ข prevents malicious propagation through LangChain/GraphChain graphs, โ€ข enforces dual-stage sanitization (pre-agent and pre-LLM), โ€ข validates outputs before allowing actions or chain continuation. For real-world agentic AI deployments, this multi-agent, provenance-aware approach offers a more comprehensive protection posture.

This work proposed a multilayer, agentic defense architecture, which cleanses all multimodal inputs and verifies all outputs in LangChain and GraphChain-based systems. The method benefits the fight against timely injection, as text and image sanitization along with trustsensitive masking and cross-agent provenance tracking provide the system with the much-needed defense without adversely affecting the performance of beneficial tasks. The findings prove that the implementation of sanitization and validation of each agent-LLM boundary is the key to developing reliable and secure agentic AI pipelines.

i = PIClassifier(e i ) 5: assign trust score s i = TrustModel(d i )

โ€ฒ 1: for each span x i in X do 2: if P [x i ].trust is low then 3:attenuate or remove x i 4:

โ€ฒ 1: for each span x i in X do 2: if P [x i ].trust is low then 3:

๐Ÿ“ธ Image Gallery

architecture.png sequence_diagram.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

โ†‘โ†“
โ†ต
ESC
โŒ˜K Shortcut