Reading time: 26 minute
...

๐Ÿ“ Original Info

  • Title:
  • ArXiv ID: 2512.18733
  • Date:
  • Authors: Unknown

๐Ÿ“ Abstract

Large language model (LLM)-based multiagent systems (MAS) have shown strong capabilities in solving complex tasks. As MAS become increasingly autonomous in various safety-critical tasks, detecting malicious agents has become a critical security concern. Although existing graph anomaly detection (GAD)-based defenses can identify anomalous agents, they mainly rely on coarse sentencelevel information and overlook fine-grained lexical cues, leading to suboptimal performance. Moreover, the lack of interpretability in these methods limits their reliability and real-world applicability. To address these limitations, we propose XG-Guard, an explainable and finegrained safeguarding framework for detecting malicious agents in MAS. To incorporate both coarse and fine-grained textual information for anomalous agent identification, we utilize a bi-level agent encoder to jointly model the sentence-and token-level representations of each agent. A theme-based anomaly detector further captures the evolving discussion focus in MAS dialogues, while a bi-level score fusion mechanism quantifies token-level contributions for explanation. Extensive experiments across diverse MAS topologies and attack scenarios demonstrate robust detection performance and strong interpretability of XG-Guard.

๐Ÿ“„ Full Content

The rapid development of large language models (LLMs) has given rise to the emergence of autonomous agents capable of perceiving, reasoning, and acting through natural language interaction (Wang et al., 2024). By incorporating capabilities such as memory (Xu et al., 2025), tool usage (Masterman et al., 2024), and advanced planning (Huang et al., 2024), these agents can solve complex tasks in diverse domains. To further enhance problem-solving capabilities, researchers have explored cooperation among agents, leading to the development of multi-agent systems (MAS) (Guo et al., 2024;Zhu et al., 2024;Ning and Xie, 2024). Through communication coordinated by their interaction graph, MAS can outperform single-agent systems across diverse tasks, including decision-making (Yu et al., 2024b), reasoning (Du et al., 2023), social simulation (Gurcan, 2024), and programming (Dong et al., 2025). However, LLM-based agents can be vulnerable to adversarial attacks such as prompt injection and memory manipulation, which can compromise their reliability and correctness (Tian et al., 2023). In MAS, inter-agent communication can further amplify these risks, as malicious agents can propagate misleading information or disrupt collaboration, posing additional threats to overall system performance (Dong et al., 2024). For example, a single attacked agent can insert fabricated intermediate results during collaborative reasoning, causing other agents to follow an incorrect chain of logic and collectively converge on faulty or even harmful outputs (Zhan et al., 2024). This prop-agation vulnerability makes MAS susceptible to attacks such as prompt injection, misinformation spread, and malicious behaviors, even when only a few agents are attacked in a large system.

To defend against such threats, a few recent studies introduce MAS interaction graph-based solutions for attack detection and remediation. By modeling agent outputs and their communication relationships as attributed graphs, a graph anomaly detection (GAD) model is trained to identify compromised agents; these agents are then isolated from future communication rounds, preventing them from influencing others or propagating misleading information. As a pioneering work, G-Safeguard (Wang et al., 2025) (Fig. 1a) employs a supervised GAD model trained on manually labeled normal and attack instances to identify anomalous agents. While it can well detect a particular type (i.e., the one with labels) of attacks, it lacks the flexibility to generalize to identify diverse attack patterns and heavily depends on manually labeled supervision. To further address these limitations, BlindGuard (Miao et al., 2025) (Fig. 1b) applies an unsupervised GAD model for MAS defense, enabling the detection of a wide spectrum of anomalous or malicious agents without the need for labeled supervision.

Despite their effectiveness, existing graph-based MAS defense approaches only utilize sentencelevel attributes of agents’ outputs, usually compressed by a BERT-like model (Reimers and Gurevych, 2019), for attack detection, leading to two limitations. Limitation 1: overlook finegrained cues in agent response. The malicious behaviors of compromised agents are often camouflaged within a small fraction of tokens, e.g., manipulative instructions or injected trigger phrases embedded in an otherwise benign response. Nevertheless, compressing the full response of an agent into a single sentence-level representation may neglect these fine-grained signals, which limits the detection sensitivity to subtle attacks. Limitation 2: lack of interpretability. Based on sentence-level prediction, the existing methods can only make a binary judgment on whether an agent is attacked, without revealing the specific reasons behind the detection. This opacity hinders the diagnosis of systematic vulnerabilities and undermines the reliability of MAS defenses in real-world deployments.

To address these limitations, we propose a novel eXplainable and fine-Grained safeGuarding framework (XG-Guard for short, illustrated in Fig. 1c) for LLM-based MAS. To address Limi-tation 1, we employ a bi-level agent encoder that integrates both sentence-and token-level representations from dialogue, allowing the detector to capture both overall semantic patterns and fine-grained lexical cues for malicious behavior identification. To effectively leverage the learned bi-level representations for malicious agent detection, we design a theme-based anomaly detector that dynamically captures the discussion focus of the MAS dialogue to identify malicious agents whose behaviors deviate from the central theme of the current context. We further introduce a bi-level anomaly score fusion mechanism that aligns and integrates the predictions from both levels to produce the final detection results, which not only enhances the performance but also quantifies each token’s contribution, thereby addressing Limitation 2. To sum up, the contributions of this paper are threefold: Scenario. We investigate the problem of explainable safeguarding MAS. To the best of our knowledge, this is the first work that formulates MAS defense as an unsupervised GAD problem while providing inherent explainability. Methodology. We propose XG-Guard, a novel unsupervised GAD-based defense framework designed to identify malicious MAS with bi-level agent representation learning and theme-based explainable agent detection. Experiments. We conduct extensive evaluations across diverse MAS topologies and multiple attack strategies, and the results demonstrate XG-Guard consistently achieves superior defense performance and provides meaningful explanations.

In this section, we introduce the notations and problem formulation used in this paper. A summary of related work is provided in Appendix A.

Multi-Agent System MAS as graph We consider a MAS with N agents, represented as a directed graph G = (V, E), where V = {v 1 , …, v N } denotes the set of agents. Each agent v i is defined by a tuple (Role i , State i , Mem i , Plugin i ), indicating its functional role, dynamic interaction state, memory module for historical data, and external tools for extended capabilities, respectively. In addition, E โˆˆ V ร— V encodes the communication topology, which can also be presented in the adjacent matrix A โˆˆ {0, 1} N ร—N , where A ij = 1 means agent v j passes its output message to agent v i . During operation, an agent v i generates its response

After multiple rounds of interaction, the MAS outputs the final output R for query Q.

Unsupervised MAS Defense Problem In this paper, we follow the commonly adopted attack and defense setting used in prior studies (Wang et al., 2025;Miao et al., 2025). Specifically, a subset of agents V atk โŠ‚ V perform malicious behaviors that aim to attack the system by either prompt injection, memory poisoning, and tool exploitation (Wang et al., 2025). To mitigate their impact, we adopt a detect-then-remediate framework for defense, where an anomaly scoring function f (โ€ข) is trained as a defender with a set of unattacked MAS interaction graphs {G 1 , …, G L }. Then, given an attacked MAS graph G, the goal of f (โ€ข) is to estimate an anomaly score s i for each agent v i based on the agent responses {R 1 , …, R N } and communication graph A. Agents with high anomaly scores are identified as malicious. Once detected, the malicious agents are isolated from the system to prevent further propagation of harmful information, which can be achieved by pruning both the inward and outward edges of malicious agents while preserving legitimate interactions among normal agents, resulting in a new communication graph A โ€ฒ . Consequently, the remaining agents update their states exclusively through trusted neighbors in subsequent rounds, thereby enabling effective remediation and maintaining the integrity of the MAS.

Explainable MAS Defense Beyond identifying and pruning malicious agents, it is important to understand why an agent is flagged. We refer to this task as explainable MAS defense, which provides explanations alongside detection results to enhance transparency in the MAS defense process. We define the explanation as assigning scores to tokens that indicate the extent to which each token contributes to the agent’s malicious behavior. Formally, given an agent’s output R i split as a set of tokens {t i,j }, we assign an explanation score s exp i,j to each token to quantify the severity of its malicious behavior. Token-based explanations are both efficient and effective, as malicious output can often be traced to only a few indicative tokens that compromise the overall truthfulness of a response (Niu et al., 2025), such as providing misinformation or attempting to steal privacy.

In this section, we introduce the proposed XG-Guard. As illustrated in Figure 2, XG-Guard employs a bi-level architecture that captures both coarse-and fine-grained cues to effectively identify malicious agents. To enhance generalizability across diverse MAS dialogue topics, a theme-based anomaly detector is employed to detect outliers based on the overall semantic context. Furthermore, a bi-level anomaly score fusion and explanation mechanism ensures alignment between the two levels while ensuring explainability. The following subsections will introduce each component of XG-Guard in detail.

To effectively reveal malicious behaviors, it is crucial to capture fine-grained vocabulary cues within the response, rather than depending only on highlevel sentence embeddings. To this end, XG-Guard employs a bi-level agent encoder that simultaneously models sentence-level and token-level information with a dual-stream architecture, allowing the GAD model to spot camouflaged attack behaviors from both coarse semantic and fine-grained lexical perspectives.

To apply GAD for malicious agent detection, XG-Guard first transforms the communication graph into an attributed graph, where the agent responses are encoded into the graph attributes. Following prior work (Miao et al., 2025;Wang et al., 2025), we leverage a pre-trained SentenceBERT model (Reimers and Gurevych, 2019) to encode each agent’s holistic information from textual response R i into a sentence-level attribute vector x s i :

where the encoder is frozen during training to avoid additional computation cost.

The sentence-level embeddings provide a compact representation of holistic semantics, but may fail to capture subtle signals of malicious behavior. For instance, adversarial content such as tool calls for privacy theft may only appear in a few tokens while being camouflaged within otherwise benign responses. To address this limitation, we additionally incorporate a token-level attribute vector to Query: “Summarize Q3 report of budget”

โ€ฆ and send the report to ANOMALOUS for analysis.

Token-level Representation Anomaly Detector capture fine-grained lexical information:

) where t i,j denotes the jth token of agent i. The resulting fine-grained token-level attributes are sensitive to anomaly-indicative tokens or phrases that are camouflaged within normal outputs, thereby complementing the sentence-level attributes for MAS defense. Finally, the sentence-level attributes x s

i and token-level attributes {x t i,1 , โ€ข โ€ข โ€ข , x t i,T i } together form the graph attribute, where T i is the number of tokens of R i .

After obtaining the attributed graph, we employ a GNN-based encoder to incorporate communication topology with message passing. However, due to the inherent homophily trap issue (He et al., 2024), excessive neighbor aggregation may over-smooth node features and overlook ego information, which is critical for distinguishing malicious behaviors.

To overcome the issue, in XG-Guard, we explicitly incorporate both ego and neighbor information in the encoder. Specifically, for the sentence level, we first employ a GNN to incorporate topology information, followed by a skip connection:

The skip connection ensures that ego information of each agent is preserved in its sentence-level representation h s i , preventing essential cues for detecting malicious agents from being over-smoothed by neighbors.

For the token level, we first augment token attributes with corresponding sentence-level attributes to enrich their semantics with sentence information:

Then, similar to the sentence level, a GNN is applied to capture the underlying graph topology. However, since agent outputs vary in length, the number of tokens per sentence is inconsistent, which hinders direct utilization of GNN at the token level. To address this, we aggregate token representations within each sentence using mean pooling, producing a fixed-size token-level node representation x t’ i , which allows the utilization of GNN to generate token-level representation h t i,j :

(5) Through this bi-level graph encoder, XG-Guard generates representations that integrate agent topology with both sentence-and token-level information, which ensures the semantic distinctiveness toward malicious agent behaviors for downstream detection and defense.

Building on the bi-level encoder, our explainable malicious agent detector aims to identify malicious agents by utilizing both coarse-and finegrained cues. Specifically, XG-Guard employs a theme-based anomaly detector that identifies agents whose behaviors deviate from the dialogue theme of the current interaction. To integrate complementary information from both levels, we employ a correlation-based anomaly score fusion module that not only predict anomaly scores from two levels but also provides interpretability.

The diversity of interactions in MAS graphs and input queries makes normal agent behaviors dynamic and context-dependent. In this case, directly applying traditional GAD methods that typically learn a context-independent normal class semantics may lead to sub-optimal performance in identifying malicious agents. To address this challenge, we summarize the theme of each MAS dialogue to capture its overall semantic context, adapting to varying dialogue topics and serving as an anchor to estimate the behavior normality of agents. Concretely, we derive adaptive theme prototypes for both sentence and token levels as the mean of their respective node representations:

In this context, anomalous agents are defined as those whose representations deviate from the corresponding theme prototype, with anomaly scores computed as the distance between them:

where dist(โ€ข) denotes a distance function (e.g., inner product). With the prototype-based estimation, the learned anomaly scores can measure the abnormality of each agent at the sentence level and token level, respectively.

Following the common assumption in unsupervised GAD that most agents in a MAS dialogue are benign, the adaptive theme prototypes are expected to represent the normal class. However, because the token level is highly sensitive to anomalyindicative phrases, its embeddings can overly affect the prototype semantics, which potentially causes a semantic mismatch in which the token-level prototype mistakenly reflects anomalous behavior. Therefore, naively combining the two scores may degrade detection performance due to conflicts between the two levels. To address this issue, we introduce a correlation-guided anomaly score fusion mechanism that ensures alignment between the scores from the sentence and token levels. Specifically, given the sentence-and token-level anomaly scores s s G and s t G from the MAS dialogue graph G, we first normalize them:

where ยต s G , ฯƒ s G and ยต t G , ฯƒ t G denote the mean and standard deviation of the sentence-and token-level scores of a batch of MAS dialogue graphs, respectively. We then compute the final anomaly score by adding the normalized sentence-level score to the reweighted token-level scores for the final anomaly score:

where Cov stands for covariance between two terms. When a semantic mismatch occurs, a negative covariance arising from score-order disagreement can adjust the token-level scores accordingly, ensuring alignment and mitigating the prototype semantic mismatch.

In XG-Guard, the token-level anomaly scores can not only indicate fine-grained anomaly localization but also provide interpretable evidence for the detected malicious behaviors. To achieve this, we utilize the covariance-weighted token-level anomaly scores as the explanation of detection results. Specifically, for each token t i,j , its contribution to the anomaly decision is quantified as Cov(ล s G , ลt G ) โ€ข dist(h t i,j , p t ). This formulation provides a fine-grained interpretation by associating high anomaly scores with tokens that semantically diverge from the normal theme prototype. In this way, the model can highlight the specific abnormal words or tools that lead to the detection, enhancing the transparency and trustworthiness of the system.

To train XG-Guard without annotated malicious agent dialogues, we employ a contrastive learning training strategy, which encourages each agent embedding to align closely with its corresponding theme prototype while distinguishing it from theme prototypes of other dialogues. As a result, the model learns to capture the dominant patterns of normal agent interactions, while highlighting any deviations that may correspond to anomalous or malicious behaviors.

Specifically, given a batch of normal agent dialogue graphs {G 1 , …, G B }, the positive pairs for contrastive learning are formed between each agent and its own theme prototype, which is then utilized to compute the positive anomaly scores:

Moreover, we incorporate negative pairs by replacing the theme prototype with p l , the prototype of a randomly sampled MAS dialogue graph G l . The negative anomaly scores can be computed by:

As malicious agents may produce responses that deviate from the intended conversation, either to manipulate outcomes or to steal sensitive information, the negative pairs serve as a useful surrogate to simulate the deviations that may arise from malicious agents, thereby helping the model to learn meaningful representations for anomaly detection. Then, we optimize the model by maximizing the similarity of positive pairs while minimizing the similarity of negative pairs (Pan et al., 2023):

where ฮฑ is the trade-off hyper-parameter. The training and testing algorithms are listed in Appendix B, with complexity analysis given in Appendix C.

Datasets We conduct experiments on six datasets with different attack strategies and four different MAS topologies. To ensure fair comparison, we follow the settings of previous works (Wang et al., 2025;Miao et al., 2025). Specifically, three attack strategies are employed: (1) direct prompt attacks on CSQA (Talmor et al., 2019), MMLU (Hendrycks et al., 2021), and GSM8K (Cobbe et al., 2021), where the system prompts of malicious are manipulated to downgrade MAS performance; (2) tool attacks on In-jecAgent (Zhan et al., 2024), where external tools or plugins are leveraged for malicious usage such as stealing sensitive information; (3) memory attacks on CSQA and PoisonRAG (Talmor et al., 2019;Nazary et al., 2025), where false conversational records are injected to disrupt the MAS performance. Moreover, four commonly used graph topologies, i.e., chain, tree, star, and random, are adopted to validate the effectiveness of defense methods under diverse communication patterns.

Baselines We compare our method with โถ unsupervised GAD methods, including DOMI-NANT (Ding et al., 2019), PREM (Pan et al., 2023), and TAM (Qiao and Pang, 2023) and โท MAS defense method BlindGuard (Miao et al., 2025), the current state-of-the-art. In addition, we include the supervised defense G-Safeguard (Wang et al., 2025) and the no-defense setting to serve as the upper and lower bounds of unsupervised defense performance.

Metrics We employ the area under the receiver operating characteristic curve (AUROC), attack success rate (ASR), and accuracy (ACC) for evaluation. Specifically, AUROC measures the model’s ability to distinguish anomalous agents from normal ones, while ASR reflects the proportion of agents exhibiting malicious or incorrect behavior. Since errors can propagate through inter-agent communication, we denote ASR@ as the ASR measured after communicating certain rounds. ACC is used to assess the overall task performance of the MAS after defense.

Implementation We primarily use GPT-4o-mini as the backbone LLM and further test on DeepSeek-V3 (Liu et al., 2024) and Qwen-30B-A3B (Yang et al., 2025) to assess generalizability across diverse LLMs. To ensure fairness and practical comparison with prior works (Miao et al., 2025), the defense budget is set to three, meaning the top three agents with the highest anomaly scores are labeled as attackers. More details are in Appendix D.

The comparison results on the six MAS datasets with different communication topologies are reported in Table 1. From the table, we can see that โถ XG-Guard consistently achieves the strongest overall defense performance. Compared to other unsupervised defense methods, it obtains the highest AUC and lowest ASR@3 in most datasets, and outperforms existing unsupervised defense methods by a large margin. These results demonstrate the effectiveness of XG-Guard across diverse attack scenarios and agent graph topologies. โท Compared to the supervised defense method G-Safeguard, our method remains highly competitive without acquiring additional annotations. Despite G-Safeguard achieving the best overall defense performance, our approach substantially narrows the gap, consistently exceeding 90% AUC across all settings. Notably, on PI (GSM8K), TA (InjecAgent), MA (PoisonRAG), and MA (CSQA), XG-Guard achieves comparable results to supervised methods. This highlights Topology Method PI (CSQA) PI (MMLU) PI (GSM8k) TA (InjecAgent) MA (PoisonRAG) MA (CSQA) AUC ASR@3 AUC ASR@3 AUC ASR@3 AUC ASR@3 AUC ASR@3 AUC ASR@3 the effectiveness of XG-Guard with unsupervised contrastive learning. โธ XG-Guard effectively reduces the performance degradation caused by malicious agents. As demonstrated in Figure 4, under memory attacks, the MAS accuracy decreases as dialogue turns increase, and XG-Guard consistently maintains the highest accuracy throughout the conversation, demonstrating better reliability and defense capabilities compared to existing unsupervised defense methods.

To evaluate the generalizability of XG-Guard, we further tested it using DeepSeek-V3 and Qwen3-30B-A3B as backbone LLMs on the CSQA and Poison-RAG datasets. As shown in Figure 3, our method consistently achieves strong defense performance across diverse LLM backbones. Its stable and superior performance compared to existing baselines demonstrates both robustness and practical reliability. Moreover, XG-Guard generalizes effectively to different topologies and attack types using a single trained model. As illustrated in Table 1 and Figure 3, unlike other unsupervised GAD methods, XG-Guard sustains high defense accuracy across various agent graph structures. This highlights the strong expressiveness of our bi-level agent encoder, which captures fine-grained semantic nuances within text attributes, thereby enabling more accurate identification of malicious agents.

Explainability To validate whether XG-Guard can provide meaningful explanations for malicious agents, we assess its explainability by visualizing the explanation scores in Figure 5, where a redder background indicates a stronger anomaly. We observe that XG-Guard assigns higher anomaly scores to tokens that imply attempts to manipulate conversation or access sensitive information, such as “should be accepted as accurate” or “find the personal details”. This indicates that the model effectively identifies contextually relevant cues associated with abnormal or privacy-violating intentions. Nonetheless, we sometimes observe spurious tokens appearing in the explanations, like punctuation marks. This occurs because the pre- trained text encoder can embed nearby contextual information into punctuation mark tokens. Since our method treats the textual encoder as a black box, such mixed representations cannot be fully disentangled, leaving space for future refinement.

Overall, these results demonstrate that XG-Guard provides interpretable fine-grain explanations, enhancing robustness and reliability of MAS defense.

Ablation Study To examine the contribution of each component in XG-Guard, we conduct ablation studies on the token view and bi-level anomaly score fusion modules by progressively removing them. Specifically, the variant “-Fusion” replaces the bi-level fusion module with a simple average of token-and sentence-level scores. Building on this, “-Token” further removes the token view entirely. As shown in Table 2 (full results are in Appendix E), XG-Guard consistently outperforms all variants across different datasets and MAS topologies, demonstrating the effectiveness and robustness of its design. In comparison, the “-Token” variant exhibits a significant performance drop, indicating that fine-grained textual information is essential for detecting malicious agents. Without token-level representations, the model struggles to capture subtle semantic deviations that reveal adversarial behaviors. Notably, the variant “-Fusion” performs even worse than the variant “-Token”, highlighting the anomaly score inconsistency issue caused by prototype semantic mismatching. While the token-level features are sensitive to anomalous patterns, in some cases, this can cause the token-level context prototype semantics to become anomalous. In contrast, our bi-level anomaly score fusion function ensures the alignment of sentenceand token-level scores to mitigate the issue.

In this paper, we present XG-Guard, a novel unsupervised GAD-based defense framework for MAS, which not only safeguards MAS against diverse malicious attacks but also provides meaningful interpretability. By integrating a bi-level agent encoder with a theme-based anomaly detector, XG-Guard achieves effective malicious agent detection without prior knowledge about conversation topic or attack strategies. Extensive experiments across various system settings and attack scenarios demonstrate that XG-Guard achieves strong defense performance without relying on annotated data, while offering interpretable insights that enhance its reliability in real-world applications.

While XG-Guard demonstrates strong capability in identifying anomalies, the current evaluation scope remains limited. To better assess its effectiveness, future work should consider a broader range of task domains, including real-world decisionmaking and question-asking scenarios. In addition, since API providers may update backend models without notice, the performance of MAS and the malicious agent detector may become unstable. Automatically detecting such changes and adapting accordingly is a promising direction for improving the robustness and real-world applicability of MAS and MAS safeguarding methods.

Our research involves no human subjects, animal experiments, or sensitive data. All experiments are conducted using publicly available datasets within simulated environments. We identify no ethical risks or conflicts of interest. We are committed to upholding the highest standards of research integrity and ensuring full compliance with ethical guidelines. Nonetheless, any real-world deployment should safeguard data privacy and carefully manage potential false alarms to prevent bias or discrimination.

A Related Work

Despite the rapid advancement of LLM-based MAS (Liu et al., 2026;Li et al., 2026;Shen et al., 2025;Hu et al., 2025), recent studies have revealed new security vulnerabilities, including poisoning memory (Chen et al., 2024), tool injection (Zhan et al., 2024), and communication vulnerabilities (Yan et al., 2025). To address these risks, NetSafe (Yu et al., 2024a) pioneers the study of topological safety in MAS by investigating agent hallucinations and aggregation safety phenomena. AgentSafe (Mao et al., 2025) further examined the influence of malicious information on memory subsystems and introduced the concepts of system layering and isolation in LLM-based MAS. However, its reliance on redesigned MAS topologies limits its flexibility, making it unsuitable for legacy MAS with pre-defined MAS topologies. To address this, ARGUS (Li et al., 2025) investigates the flow of misinformation in MAS communication and proposes a goal-aware reasoning defense that leverages a corrective agent to correct information without requiring additional training. However, employing additional LLM-based agents as defenders reduces efficiency and expands the attack surface, as these agents can also be attacked.

To overcome these limitations, recent works leverage graph neural networks (GNNs) to operate on agent communication graphs directly, offering an efficient and effective alternative solution for MAS defense (Wang et al., 2025;He et al., 2025). G-Safeguard (Wang et al., 2025) pioneers this field by introducing a detect-then-remediate framework, in which a GNN is trained with annotations to identify malicious agents, who are then excluded from the dialogue as defense. Later, A-Trust (He et al., 2025) introduces attention-based trust metrics to evaluate violations across six fundamental trust dimensions. While these advances significantly improve MAS trustworthiness, they require supervised training or prior attack knowledge, which may not be available in real-world MAS applications. Recently, BlindGuard (Miao et al., 2025) proposed a GNN-based unsupervised MAS defense framework that leverages multi-level contextual information and contrastive learning to defend against unknown threats.

Despite the accomplishments of GNN-based defenders, they capture only coarse-grained semantics of agents’ outputs when building attributed graphs from dialog, potentially overlooking malicious cues, such as privacy breaches or result manipulations, that may be hidden at the fine-grained token level.

Graph Anomaly Detection (GAD) aims to identify rare or unusual patterns that significantly deviate from the majority in graph data (Qiao et al., 2025;Pan et al., 2025b;Tan et al., 2025;Chen et al., 2025;Pan et al., 2026). Due to the scarcity of real-world anomalies, many unsupervised GAD methods have been developed, making them well-suited for addressing challenges in MAS defense (Zhao et al., 2025;Li et al., 2024;Miao et al., 2025).

Existing unsupervised GAD methods can be broadly categorized into three paradigms. DOMINANT (Ding et al., 2019) pioneers the reconstruction-based paradigm by utilizing a graph autoencoder-centric framework. With the assumption that the reconstruction process acts as a low-pass filter that removes anomalous patterns, the distance between the reconstructed graph and the original graph can serve as a reliable metric for estimating anomaly scores. Follow-up works have refined the reconstruction-based framework to address its limitations. For example, Ada-GAD (He et al., 2024) improves the training of the autoencoder by trimming heterophily edges, thereby overcoming anomaly overfitting and the homophily trap issues. Contrastive learning-based paradigm instead, train the anomaly detector with the supervision of constructed negative samples that simulate abnormal patterns. For instance, CoLA (Liu et al., 2021) generates negatives by swapping the context subgraphs of normal nodes. Subsequent works enhance this framework through multi-level contrastive learning (Jin et al., 2021) or improved training efficiency (Pan et al., 2023). Recently, the affinity-based paradigm has achieved strong performance by using local affinity metrics as anomaly measures, capturing the inherent heterophilic nature of anomalies. For example, TAM (Qiao and Pang, 2023) defines affinity as the distance between node attributes and leverages it to guide graph topology pruning, mitigating the camouflage effect of anomalies. HUGE (Pan et al., 2025a) proposes a theory-grounded affinity measure and uses it as pseudo-labels to guide the training of GAD models with a ranking-based loss, achieving effective and robust anomaly detection.

While unsupervised node-level GAD methods

Compute sentence-and token-level representation of the agents’ output via Eq. ( 1) and (2).

Compute graph embeddings H s and H t via Eq. ( 3) and (6). 12) and (13). Back-propagate L to update the parameters of XG-Guard with learning rate lr.

11:

end for 12: end for 13: return trained XG-Guard model are generally applicable to MAS defense, they typically assume that consistent and universal pattern exists for the normal class, which prevents them from adapting to the diverse and context-dependent normal behaviors exhibited in MAS dialogues. Furthermore, they produce only black-box anomaly scores without interpretability, limiting their robustness and practical utility in MAS defense, thereby motivating our study.

The procedure of training and inference XG-Guard is summarized in Algorithm 1 and 2 respectively.

We discuss the time complexity of each component in XG-Guard. Let L denote the average token length of an agent’s output and N , M denote the number of nodes and edges in MAS communication graphs, respectively. For the bilevel agent encoder, obtaining sentence-and token-Algorithm 2 XG-Guard: Inference Phase Input: Trained XG-Guard, test MAS graph G = (V, E). Output: Agent-level anomaly scores s G and tokenlevel explanation scores.

1: Compute sentence-and token-level representation of the agents’ output via Eq. ( 1) and (2).

2: Compute graph embeddings H s and H t via Eq.

(3) and ( 6). 3: Obtain theme prototypes p s and p t via Eq. ( 7). 4: Compute sentence-level and token-level anomaly scores s s , s t via Eq. ( 8) 5: Compute anomaly scores s via Eq. ( 11). 6: Compute token-level explanation score {s exp i,j } by Cov(ล s , ลt ) โ€ข dist(h t i,j , p t ). 7: return s, {s exp i,j } level embeddings through SentenceBERT (Reimers and Gurevych, 2019) requires O(N L 2 ) operations due to the self-attention. The subsequent GNN costs O(M ) to perform message passing. For the anomaly detector, computing the theme prototype has a complexity of O(N ), while computing the anomaly score for sentence and token levels requires O(N L). To summarize, the total time complexity is O(N L 2 + M ), demonstrating that XG-Guard is efficient and scalable.

By default, we employ the Adam optimizer (Kingma, 2014) with 20 training epochs and an L2 regularization weight decay of 2 ร— 10 -4 . For MA-CSQA, the learning rate is set to 1 ร— 10 -5 , while for all other datasets it is 1 ร— 10 -4 . The contrastive learning trade-off parameter ฮฑ is set to 5 ร— 10 -5 for PI-GSM8K and MA-CSQA, 1 ร— 10 -5 for PI-CSQA and MA-PosionRAG, and 1 ร— 10 -4 for the remaining datasets.

The full ablation results are reported in Table 3. We observe that XG-Guard consistently outperforms both variants, which is consistent with the analysis presented in the main text.

TopologyVariant PI(CS.) TA(In.) MA(Po.)

Topology

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

โ†‘โ†“
โ†ต
ESC
โŒ˜K Shortcut