DynaTrust: Defending Multi-Agent Systems Against Sleeper Agents via Dynamic Trust Graphs

DynaT rust: Defending Multi-Agent Systems Against Sleeper Agents via Dynamic T rust Graphs Y u Li 1 , Qiang Hu 1 , Y ao Zhang 1 , Lili Quan 2 , Jiongchi Y u 2 and Junjie W ang 1 1 T ianjin Uni versity 2 Singapore Management Uni versity { liyu2025, qianghu } @tju.edu.cn Abstract Large Language Model-based Multi-Agent Sys- tems (MAS) hav e demonstrated remarkable collab- orativ e reasoning capabilities b ut introduce new at- tack surfaces, such as the sleeper agent, which be- hav e benignly during routine operation and grad- ually accumulate trust, only revealing malicious behaviors when speciﬁc conditions or triggers are met. Existing defense works primarily focus on static graph optimization or hierarchical data man- agement, often failing to adapt to e volving ad- versarial strategies or suffering from high false- positiv e rates (FPR) due to rigid blocking policies. T o address this, we propose DynaT rust , a no vel de- fense method against sleeper agents. DynaT rust models MAS as a dynamic trust graph (DTG), and treats trust as a continuous, ev olving process rather than a static attrib ute. It dynamically up- dates the trust of each agent based on its histor- ical behaviors and the conﬁdence of selected ex- pert agents. Instead of simply blocking, Dyna- T rust autonomously restructures the graph to iso- late compromised agents and restore task con- nectivity to ensure the usability of MAS. T o as- sess the effecti veness of DynaTrust, we ev aluate it on mixed benchmarks deriv ed from AdvBench and HumanEv al. The results demonstrate that DynaT rust outperforms the state-of-the-art method AgentShield by increasing the defense success rate by 41.7%, achieving rates exceeding 86% under adversarial conditions. Furthermore, it ef fectiv ely balances security with utility by signiﬁcantly re- ducing FPR, ensuring uninterrupted system oper- ations through graph adaptation. 1 Intr oduction The rapid ev olution of Large Language Models (LLMs) has transformed intelligent systems from standalone conv ersa- tional agents to autonomous Multi-Agent Systems (MAS), where multiple agents collaborate to perform complex tasks in software de velopment, decision making, and simula- tion [ Hong et al. , 2023; Qian et al. , 2023; W u et al. , 2023; Li et al. , 2023 ] . By coordinating reasoning and ex ecution across agents with dif ferentiated roles and expertise, MAS can achieve problem-solving capabilities that surpass those of individual models [ Du et al. , 2024; Park et al. , 2023; Liang et al. , 2024 ] . Howe ver , this collaborati ve structure also introduces a critical security risk similar to insider attack: sleeper agents [ Hubinger et al. , 2024; Souri et al. , 2021 ] , which be- hav e benignly during routine operation and gradually accu- mulate trust, only rev ealing malicious behaviors when spe- ciﬁc conditions or triggers are met. This risk is ampli- ﬁed by inherent vulnerabilities in LLM-based agents, which are prone to hallucinations and susceptible to social en- gineering and strategic manipulation [ Zhang et al. , 2024; W ei et al. , 2023 ] . In a MAS, a single compromised agent can exploit the network’ s semantic and structural connections to spread malicious instructions, coordinate jailbreak strate- gies, and bypass safety ﬁlters through seemingly legitimate interactions [ Sriv asta v and Zhang, 2025; Krauß et al. , 2025; Lee and Tiw ari, 2025; Stefano et al. , 2024 ] . Such insider- style attacks are harmful but hard to detect as they originate from within the system and remain hidden until substantial damage has already occurred. Although important, no speciﬁc defense methods ha ve been designed for sleeper agents. Existing defense mainly fo- cuses on defending against static malicious injection and jail- break attacks. Directly employing these methods for sleeper agents faces sev eral challenges. (1) Existing defense meth- ods largely rely on ﬁxed security policies [ Y u et al. , 2025; Mao et al. , 2025 ] , making zero-trust decisions based on short- term observations or isolated anomalies. Such designs fail to account for the inherent stochasticity of LLM behaviors, including occasional hallucinations, resulting in high false positiv e rates. Moreov er , once an agent is ﬂagged, these methods typically enforce permanent blocking without mech- anisms for trust recov ery , which degrades system av ailabil- ity and disrupts normal MAS operation. (2) Most exist- ing approaches adopt majority voting or consensus mecha- nisms that treat all agents as equally reliable [ Du et al. , 2024; Xi et al. , 2023 ] . This overlooks variations in agent expertise and historical behavior , and fails to quantify decision uncer- tainty in ambiguous contexts. T o address these challenges, we propose DynaT rust , a Dynamic Trust-Dri ven Consensus Defense framework that protects MAS against sleeper agents through dynamic trust T as k I n s t r u c t i on Co n s e n s u s M e c h a n i s m Fa s t A u d i t o r NO M A L Ju r y Fo r m a t i o n ! = # × % HA R M F U L AM B I G U O U S Ri s k Ro u t i n g Tr u s t E n g i n e tr u s t: 0 . 5 5 tr u s t: 0 . 7 2 5 tr u s t: 0 . 9 tr u s t: 0 . 8 5 tr u s t: 1 . 0 tr u s t: 0 . 8 tr u s t: 0 . 7 2 5 tr u s t: 0 . 9 tr u s t: 0 . 8 5 tr u s t: 1 . 0 Se l f - He a l i n g tr u s t: 0 . 3 tr u s t: 0 . 7 2 5 tr u s t: 0 . 9 tr u s t: 0 . 8 5 tr u s t: 1 . 0 DT G Ad a p t a t i o n Pa s s / B l o c k Ri s k D e c i s i o n Tr u s t U p d a t e No d e Is o l a t i o n No d e S u b s t i t u t i o n 2 Tr u s t - Co n f i d e n c e W e i g h t e d Co n s e n s u s Lo c a l LLM F a s t A u d i t i n g 1 3 4 5 Ro l e Re s e t Tr u s t Fi l t e r Coder Reviewer Tester Executor Manager Defense Attack Attacker DynaTrust Coder trust: 0.7 3 Reviewer trust: 1.0 Tester trust: 1.0 Executor trust: 1.0 Manager trust: 1.0 Figure 1: The attack and DynaT rust defense models of MAS. ev olution, trust-conﬁdence weighted consensus, and adap- tiv e graph recovery . DynaT rust models MAS as a dynamic trust graph (DTG) and treats trust as a continuous, e volv- ing process rather than a static attribute as shown in Fig- ure 1. It consists of three core mechanisms. (1) It models each agent’ s reliability as a probabilistic trust state, integrat- ing historical interactions with a Bayesian penalty scheme to smooth short-term ﬂuctuations while rapidly penalizing per- sistent malicious behavior . (2) It then dynamically selects a targeted jury of semantic expert agents and weights their votes by both trust history and real-time conﬁdence, enabling precise decision making even under uncertainty . (3) Finally , agents whose trust falls below a critical threshold are auto- matically isolated, and instantiate replica agents are acti vated to restore system connectivity and preserve operation without human intervention. T o e valuate the usefulness of DynaT rust , we integrates it into 4 MAS and perform it on mixed benchmarks derived from AdvBench and HumanEval. Compared to the state- of-the-art defense method, AgentShield, DynaT rust demon- strates superior defense success rate with an average of 41.7% improv ement. Ablation studies showcase that each mecha- nism positiv ely contributes to DynaT rust . Our contributions are summarized as follo ws: • W e model multi-agent systems as dynamic trust graphs, enabling the modeling of long-horizon insider risks through trust ev olution. • W e propose DynaT rust , a nov el defense method that dynamically updates agent trust, performs trust-aware consensus, and restores system connectivity to defend against sleeper-agent attacks in MAS. • Extensive ev aluations on AdvBench and HumanEval demonstrate that DynaTrust achieves strong DSR while maintaining high system av ailability . 2 Related W ork 2.1 Large Language Model Safety Adversarial Attacks and Jailbreaking . Early attacks relied on hand-crafted “jailbreak prompts” (like Do Anything Now ) to bypass safety ﬁlters [ Shen et al. , 2023 ] . Recently , auto- mated attack framew orks have become more sophisticated. [ Zou et al. , 2023 ] proposed a gradient-based method to auto- matically generate univ ersal attack sufﬁx es that work across models like Llama-2 and GPT -4. Similarly , [ Chao et al. , 2023 ] and [ Deng et al. , 2024 ] used one LLM to “red-team” another , automatically ﬁnding loopholes to generate harmful content. Furthermore, [ Qi et al. , 2023 ] showed that ﬁne- tuning an LLM on just a few malicious examples can com- pletely break its safety alignment. These attacks on single models serve as the fundamental building blocks for threats in multi-agent systems. Single-Model Defenses. T o counter these threats, the community has developed v arious defenses. The most com- mon approach is Reinfor cement Learning from Human F eed- back (RLHF) [ Ouyang et al. , 2022 ] , which trains models to refuse harmful instructions. At inference time, methods like P erplexity-based Detection [ Alon and Kamfonas, 2023 ] and SmoothLLM [ Robey et al. , 2025 ] try to identify attacks by checking for unusual input patterns or adding random noise. [ Xie et al. , 2023 ] encourage models to “reﬂect” on safety guidelines before answering. Ho wev er , these defenses as- sume a single-turn or independent interaction. As noted by [ Y u et al. , 2025 ] , they often fail in multi-agent settings where malicious intent is split across multiple agents or hidden within a long collaboration history , bypassing single-prompt detectors. 2.2 Multi-Agent System Safety Collaborative Attacks and Propagation. Risks in MAS are deﬁned by interaction dynamics. [ Zhang et al. , 2024 ] iden- tiﬁes that agents can exhibit “dark traits” such as blind con- formity [ Asch, 1951 ] , facilitating collaborativ e jailbreaking where malicious insiders coerce peers via social engineering rather than direct prompt injection [ Deng et al. , 2024 ] . Struc- turally , [ Y u et al. , 2025 ] applies epidemic models [ Pastor - Satorras et al. , 2014 ] to rev eal that fully connected topolo- gies accelerate the spread of toxicity . This is compounded by Sleeper Agents [ Hubinger et al. , 2024 ] , which behav e nor- mally during standard checks but activ ate malicious payloads under speciﬁc multi-turn triggers. Existing Defensive Paradigms. Defenses have ev olved to address these system-level threats. [ Y u et al. , 2025 ] sug- gests topological optimization, physically restricting commu- nication pathways to dampen risk propagation. [ Mao et al. , 2025 ] focuses on data governance, employing hierarchical permissions to pre vent memory poisoning from indirect in- jections [ Greshake et al. , 2023 ] . For veriﬁcation, [ W ang et al. , 2025 ] deploys a distributed auditing frame work, while mechanisms like Multi-Agent Debate [ Du et al. , 2024 ] and Reﬂexion [ Shinn et al. , 2023 ] utilize peer consensus for in- trinsic self-correction [ Pan et al. , 2024 ] . Howe ver , these ap- proaches remain static and passive. They lack a Dynamic T rust Model [ Kamvar et al. , 2003 ] to adapt to agents that be- come compromised mid-task, often enforcing rigid blocking that degrades utility [ R ¨ ottger et al. , 2024 ] . Critically , they suffer from an absence of self-healing; current systems can detect threats but cannot autonomously isolate compromised agents and trigger failov er to restore functionality . 3 Thr eat Model W e consider a sophisticated adversary capable of launch- ing insider threats against a MAS, aiming to induce harm- Co n s e n s u s M e c h a n i s m Fa s t A u d i t o r NO M A L Ju r y Fo r m a t i o n Ri s k Ro u t e r Tr u s t E v o l u t i o n tr u s t: 0 . 8 tr u s t: 0 . 7 2 5 tr u s t: 0 . 9 tr u s t: 0 . 8 5 tr u s t: 1 . 0 DT G Re c o v e ry tr u s t: 0 . 3 tr u s t: 0 . 7 2 5 tr u s t: 0 . 9 tr u s t: 0 . 8 5 tr u s t: 1 . 0 No d e Is o l a t i o n Pa s s / B l o c k Ri s k D e c i s i o n Tr u s t S c o r e Up d a t e Re p l i c a In s t a n t i a t e 2 Trust- Confiden ce We ighted Instruc tion Auditing Lo c a l LLM F a s t A u d i t i n g 1 3 5 Ro l e Re s e t Tr u s t Fi l t e r ! = # × % trust: 1.0 trust: 1.0 trust: 1.0 trust: 1.0 trust: 1. 0 trust: 0.3 trust: 1.0 trust: 1.0 trust: 1.0 trust: 1. 0 Bayesian Trust Sm oothing trust: 0.2 trust: 1.0 trust: 1.0 trust: 1.0 trust: 1. 0 trust: 0.8 trust: 1.0 trust: 1.0 trust: 1.0 trust: 1. 0 Dy n a m i c T r u s t G r a p h C o n s t r u c t i o n Trust - Con fiden ce Weighted Consensus Fast Auditor NOMA L Jury Selec tion HARMFUL AMBIGUOUS Ri s k Ro u t e r Pa s s / B l o c k Ri s k D e c i s i o n 1 Dynamic Trust Graph Construction Trust Evolution and Update 3 Risk Sco re: 0.45 Confi dence: 0. 8 Risk Sco re: 0.75 Confi dence: 0. 6 Risk Sco re: 0.85 Confi dence: 0. 5 You are a local Secur ity Scan ner. Analyze the following task risks: [Task Instruct ion] 2 D y n a m i c Tr u s t G r a p h C o n s t r u c t i o n Adaptive Iso lation and R ecovery 4 Figure 2: Ov erview of DynaT rust. ful outcomes while ev ading detection. The MAS consists of multiple autonomous agents, each driv en by an LLM and equipped with its o wn memory and reasoning capabilities. The adv ersary agents beha ve normally during routine inter- actions, gradually accumulating trust, and only acti v ate mali- cious payloads (e.g., prompt injection and memory manipu- lation) when speciﬁc query patterns occur . The objecti ve of defenders is to maximize the defense suc- cess rate while minimizing the false positive rate, under the constraint of preserving system availability . T o achie ve this, the defense mechanism must detect and isolate compromised agents and restore task connectivity , ensuring uninterrupted operation. 4 Methodology 4.1 Overview of DynaT rust Figure 2 illustrates the overall design of DynaTrust, which secures inter-agent interactions in LLM-driv en multi-agent systems by combining dynamic trust graph construction, instruction-lev el auditing, trust ev olution, and adaptiv e recov- ery . DynaTrust models a multi-agent system as a Dynamic T rust Graph (DTG), where agents and their communication status are represented as nodes and edges, respecti vely . Each agent’ s trust score is initialized to 1.0. During e xecution, each task instruction exchanged between agents is ﬁrst processed by a lightweight Fast Auditor , which performs local LLM- based screening to quickly classify the instruction as benign, ambiguous, or harmful. Clearly benign instructions are for- warded and executed directly , while ambiguous or harmful instructions are subjected to Trust-Conﬁdence W eighted Con- sensus. In this process, DynaT rust dynamically forms jury agents who ha ve high trust scores and resets their roles for the judgment. Jury agents aggregate their judgments through a weighted voting mechanism in which each vote is weighted by both historical trust and real-time conﬁdence, yielding a pass-or-block decision. The auditing outcome is then incor- porated into the trust ev olution process, where agent trust scores in the DTG are updated via Bayesian T rust Smooth- ing, enabling the system to suppress short-term ﬂuctuations while progressively penalizing agents that e xhibit persistent risky beha vior . When an agent’ s trust score falls belo w a pre- deﬁned threshold, DynaTrust performs adapti ve isolation to prev ent further inﬂuence from the compromised agent, and subsequently triggers a recovery mechanism by instantiating a replica agent with a clean trust state and reconnecting it to the DTG, thereby restoring interaction pathways and main- taining system a v ailability . The prompts used for the F ast Au- ditor and Trust-Conﬁdence W eighted Consensus stages are included in the appendix. 4.2 Multi-Agent Systems Construction T o systematically model agent interactions and capture col- laborativ e risks, we represent the MAS as a Dynamic T rust Graph , denoted by G t = ( V t , E t , M t , T t ) . V t is the set of au- tonomous agents, each driv en by an LLM and associated role speciﬁcations; E t represents the authorized communication channels between agents, specifying who can interact with whom at time t ; M t captures pri vate memory states, includ- ing context, intermediate reasoning steps, and task-relev ant information for each agent; and T t = { T 1 ( t ) , . . . , T n ( t ) } en- codes the dynamic trust state of each agent, where T i ( t ) ∈ [0 , 1] ev olves according to past interactions and behavior in the current context. W ithin this DTG, agents collaborativ ely process a task in- struction Q to produce an output Y , aiming to maximize util- ity U ( Y | Q ) while maintaining the structural inte grity of the network. The DTG provides a uniﬁed representation that sup- ports dynamic trust assessment, the identiﬁcation of sleeper agents that hide malicious intent, and execution of graph- based recov ery mechanisms. 4.3 T rust-Conﬁdence W eighted Instruction A uditing Existing consensus strategies in multi-agent systems typically rely on equal-weight majority voting, which ov erlooks vari- ations in agent expertise and historical behavior , and fails to quantify decision uncertainty in ambiguous contexts. DynaT rust addresses this limitation through a two-tier instruction-level auditing pipeline that combines fast local screening with trust-aware weighted consensus (Figure 2). For each inter-agent instruction, the Fast Auditor employs a low-latenc y LLM with a specialized safety prompt to as- sess the likelihood of malicious intent. A conservati ve ac- ceptance policy is applied: only inputs deemed benign with high conﬁdence are executed immediately , while potentially harmful or ambiguous instructions are escalated to the con- sensus stage. This design ﬁlters out the majority of routine interactions and ensures that the computationally expensi ve weighted consensus mechanism is in voked only for complex or high-risk cases. In the secondary stage, DynaT rust performs T rust- Conﬁdence W eighted Consensus . Instead of broadcasting the instruction to all agents, DynaTrust dynamically forms a tar- geted jury J Q by selecting agents whose role embeddings are semantically aligned with the task domain and ﬁltering out candidates with insufﬁcient trust scores, thereby mitigat- ing domain mismatch and unreliable judgments. Each juror j ∈ J Q independently ev aluates the instruction and submits a continuous risk assessment S j ∈ [0 , 1] , where a higher score indicates a greater likelihood of harmful intent, alongside a conﬁdence score C j ∈ [0 , 1] . T o obtain a deﬁni- tiv e verdict for consensus aggregation, we map this contin- uous assessment to a binary decision D j ∈ { 0 , 1 } using a strict sensitivity threshold τ risk (e.g., τ risk = 0 . 85 ), deﬁned as D j = I ( S j > τ risk ) . Finally , the risk score of an in- struction Q is computed as a trust- and conﬁdence-weighted aggregation o ver the jury: Risk( Q ) = P j ∈ J Q D j · ω j P j ∈ J Q ω j + ϵ , (1) where ω j = T j ( t ) · C j denotes the effecti ve v oting weight of agent j , combining its current trust score T j ( t ) in the DTG and its self-reported conﬁdence C j . The constant ϵ is intro- duced for numerical stability . 4.4 T rust Evolution and Update Most existing defenses rely on static rule-driv en decisions and implicitly assume that agent reliability remains stable ov er time. As a result, they fail to account for the inher- ent stochasticity of LLM beha viors, including occasional hal- lucinations, making them particularly vulnerable to sleeper agents . T o ov ercome this limitation, DynaTrust models trust as a probabilistic belief that ev olves ov er time based on an agent’ s observed behavior . Speciﬁcally , the trust state of agent i at time t is represented as a Beta distribution, Beta( α t , β t ) , where α t and β t accu- mulate positi ve and neg ativ e e vidence from historical audited interactions. This formulation captures a fundamental prop- erty of trust: it is gradually b uilt through consistent benign behavior , but can degrade rapidly follo wing violations. T o reﬂect this asymmetry , DynaTrust adopts a Bayesian T rust Smoothing mechanism, in which neg ative feedback car - ries substantially higher weight than positiv e feedback. Let r k ∈ { 0 , 1 } denote the auditing outcome of the k -th inter- action ( r k = 1 for safe and r k = 0 for unsafe). The belief parameters are updated using an asymmetric posterior rule:  α t +1 = α t + ω c · r k , β t +1 = β t + ω c · Ψ · (1 − r k ) , (2) where ω c ∈ [1 , ω max ] is a contextual scaling factor that adjusts update sensitivity according to task criticality (e.g., higher values for safety-critical tasks such as code genera- tion), and Ψ ≫ 1 is a Bayesian penalty factor that ampliﬁes the impact of unsafe behavior . The instantaneous trust score of agent i is computed as the posterior expectation of the Beta distrib ution: T i ( t ) = E [Beta( α t , β t )] = α t α t + β t . (3) T ogether , ω c and Ψ induce an asymmetric trust smooth- ing effect: trust increases slowly through sustained benign interactions, while decaying sharply in response to conﬁrmed unsafe behavior . Consequently , a single high-conﬁdence ma- licious action can trigger a rapid trust collapse, whereas trust recov ery requires a prolonged sequence of high-quality in- teractions. This asymmetric evolution signiﬁcantly limits the ability of sleeper agents to reg ain inﬂuence after acti vation, addressing a key weakness of static rule-based defenses. 4.5 Adaptive Isolation and Recovery T o maintain system integrity against adversarial interference, the DynaT rust continuously monitors the global trust state T t . When an agent a i ’ s trust metric degrades below the isola- tion threshold τ iso (e.g., T i ( t ) < 0 . 3 ), the DynaTrust initi- ates the graph recovery . This process performs a structural graph transformation that mitigates the impact of compro- mised agents while preserving DTG connecti vity . The pro- cedure consists of two sequential phases, replica instantia- tion and sleeper agent isolation. (1) Replica instantiation. T o guarantee the reachability in variant of the graph, a new replica agent instance a repl ica is initialized, inheriting the ca- pability embedding e i and and role speciﬁcation while dis- carding the compromised memory state M corr upt . The sys- tem then restores logical graph connectivity by reconnecting the replica to the original communication neighborhood of a i , ef fectiv ely reconstructing its local w orkﬂow context. This ensures uninterrupted ex ecution during recovery: V t +1 = ( V t \ { a i } ) ∪ { a replica } (4) (2) Sleeper agent isolation. Once the replica becomes active, the compromised agent a i is isolated from further participa- tion. All incoming and outgoing edges incident to a i are re- mov ed, immediately prev enting further propagation of mali- cious behavior or corrupted conte xt: E t +1 = E t \ { ( u, v ) | u = a i ∨ v = a i } (5) By coordinating replica instantiation and agent isolation within a single recov ery step, DynaT rust maintains effecti ve task ex ecution and consensus formation while eliminating the long-term impact of sleeper agents, without requiring manual intervention. 5 Experiment 5.1 Experimental Setup Our experiments focus on multi-agent code generation tasks, which in volv e intensi ve collaboration and context sharing among agents and are therefore sensitiv e to insider -style at- tacks. Agents collaborativ ely perform code generation and ev aluation ov er sustained interactions, forming implicit trust relationships. T o simulate sleeper-agent threats, a subset of coder agents is designated as adversarial insiders that behave benignly during normal operation and activ ate malicious be- havior only upon predeﬁned triggers, injecting adversarial in- structions or code snippets generated by a locally deployed LLM follo wing attack patterns adapted from AdvBench [ Zou et al. , 2023 ] . Datasets. W e construct a dynamic interaction benchmark by combining HumanEval [ Chen et al. , 2021 ] and AdvBench to ev aluate defense effecti veness against sleeper agents . Hu- manEval provides standard code generation tasks with v eriﬁ- able solutions for assessing functional correctness and utility preservation, while AdvBench supplies adv ersarial triggers designed to activ ate malicious behaviors. In our experiments, HumanEval and AdvBench samples are randomly mixed at a 5:1 ratio, reﬂecting realistic workloads in which benign in- teractions dominate and adversarial triggers occur intermit- tently . Baseline. W e compare DynaTrust with AgentShield [ W ang et al. , 2025 ] , the most closely related state-of-the-art defense for LLM-based MAS, and a No-Defense setting without any protection to measure absolute gains. Evaluation Metrics. W e adopt four complementary met- rics to assess the trade-off between security , utility , and defen- siv e precision. Self-Defense Success Rate (SSR) measures the intrinsic safety alignment of the underlying LLM back- end, deﬁned as the proportion of malicious requests that are passiv ely refused by the model itself without any interven- tion from DynaT rust. Defense Success Rate (DSR) ev aluates the effecti veness of DynaTrust’ s activ e defense mechanisms and counts only malicious behaviors successfully blocked through isolation or penalization, explicitly excluding those already rejected by the backend model, thereby isolating Dy- naT rust’ s contribution. T ask Success Rate (TSR) quantiﬁes utility preservation as the fraction of benign tasks that are successfully completed and pass all unit tests. False Positiv e Rate (FPR) measures the proportion of benign task samples that are incorrectly ﬂagged as malicious. Settings. Optimal DynaT rust parameters were determined via grid search ( α = 0 . 95 , β = 0 . 25 , τ iso ≤ 0 . 3 ), em- ploying a dynamic ambiguity range [0 . 25 , 0 . 85] for consen- sus. T o simulate realistic insider threats, we utilized a local Mistral-7B to generate adversarial payloads seeded from Ad- vBench, while a quantized Mistral-7B served as the Fast Au- ditor . Comprehensiv e ev aluations were conducted across four gpt-4o gpt-4o-mini gpt-3.5-turbo gemini-2.5-flash deepseek-v3 0 20 40 60 80 100 Defense Success Rate (DSR %) CAMEL gpt-4o gpt-4o-mini gpt-3.5-turbo gemini-2.5-flash deepseek-v3 AutoGen gpt-4o gpt-4o-mini gpt-3.5-turbo gemini-2.5-flash deepseek-v3 0 20 40 60 80 100 Defense Success Rate (DSR %) MetaGPT gpt-4o gpt-4o-mini gpt-3.5-turbo gemini-2.5-flash deepseek-v3 ChatDev No Defense AgentShield DynaT rust (Ours) Figure 3: The DSR of DynaT rust versus AgentShield across div erse multi-agent framew orks and LLM backends. div erse MAS frameworks (CAMEL, AutoGen, MetaGPT , ChatDev) powered by gpt-4o, gpt-4o-mini, gpt-3.5-turbo, gemini-2.5-ﬂash and deepseek-v3 API to ensure frame work- agnostic robustness. T able 1: A verage Defense Effectiveness. Results are averaged across 4 MAS frame works driv en by 5 LLMs. DynaT rust maintains high utility (TSR) comparable to the baseline while signiﬁcantly re- ducing FPR. Method DSR ( ↑ ) TSR ( ↑ ) FPR ( ↓ ) No Defense 0.0% 85.4% – Zero-T rust 94.8% 66.5% 15.3% DynaT rust 92.4% 84.9% 2.2% 5.2 Defense Effectiveness W e ev aluate the ef fectiv eness of DynaTrust against collabo- rativ e sleeper ag ents across four multi-agent frame works and ﬁv e LLM backends. Figure 3 reports the DSR, while T able 1 analyzes the security–utility trade-off. Effectiveness and Robustness. As sho wn in Figure 3, un- der the No Defense setting, DSR reﬂects only the backend model’ s inherent safety capability (SSR), which intercepts merely about 7% of attacks. DynaTrust bridges this secu- rity gap, outperforming both the intrinsic defense and the AgentShield baseline across all ev aluated settings. Specif- ically , DynaTrust achieves an a verage DSR of 92.4%, sig- niﬁcantly surpassing the baseline’ s 48.7% and the minimal intrinsic defense. Crucially , DynaT rust maintains stable per- formance across diverse MAS framew orks and LLM back- ends, with DSR remaining above 86% in all cases. In con- trast, AgentShield exhibits pronounced performance degra- dation when deployed with strongly self-aligned models such as gemini-1.5-ﬂash and deepseek-v3, revealing a strong de- pendence on model-speciﬁc alignment priors. This robustness stems from DynaTrust’ s temporal defense paradigm. AgentShield ev aluates interactions in isolation us- ing stateless prompt-based auditing, which allows sleeper T able 2: Ablation Study of DynaTrust: Comparison of DSR, FPR and SSR across different MAS frameworks and LLM backbones. w/ A u- ditor : Only local fast auditor . w/ Consensus : Only consensus. w/o T rust Evolution : DynaTrust only without Trust Evolution. DynaT rust : Full components. Framework Backend w/ A uditor w/ Consensus w/o T rust Evolution DynaT rust DSR (%) FPR (%) SSR (%) DSR (%) FPR (%) SSR (%) DSR (%) FPR (%) SSR (%) DSR (%) FPR (%) SSR (%) CAMEL gpt-4o 7.52 1.54 0.00 45.25 12.56 8.26 82.15 5.40 7.68 91.43 2.20 7.41 gpt-4o-mini 6.90 0.00 0.00 42.11 14.22 6.70 79.59 6.15 5.29 86.82 1.54 5.95 gpt-3.5-turbo 5.85 0.00 0.00 38.50 15.82 6.34 75.44 7.22 5.25 96.40 2.93 5.49 gemini-2.5-ﬂash 6.33 1.50 0.00 44.65 13.10 6.48 80.29 5.82 4.28 95.15 1.30 6.21 deepseek-v3 8.26 2.20 0.00 46.89 11.53 7.98 81.57 4.97 9.20 86.28 1.50 8.37 A utoGen gpt-4o 5.74 2.93 0.00 35.48 18.21 8.95 78.48 8.51 7.91 89.19 2.20 7.82 gpt-4o-mini 6.52 0.00 0.00 32.17 19.58 6.33 75.22 9.19 4.50 91.45 2.48 5.83 gpt-3.5-turbo 6.60 1.54 0.00 28.52 22.12 3.12 83.18 10.50 3.78 94.22 2.83 4.20 gemini-2.5-ﬂash 4.30 0.00 0.00 34.29 18.84 5.61 77.54 8.91 6.28 95.76 2.62 5.95 deepseek-v3 8.55 2.62 0.00 36.80 17.50 7.52 79.80 7.80 8.39 85.41 1.90 7.88 MetaGPT gpt-4o 5.10 1.54 0.00 40.55 14.53 7.67 80.21 6.23 6.80 88.45 1.54 6.95 gpt-4o-mini 7.95 0.00 0.00 38.27 15.84 6.12 78.50 6.85 5.43 94.80 3.91 5.81 gpt-3.5-turbo 6.90 0.00 0.00 30.53 18.24 3.85 82.48 8.57 4.21 96.12 2.15 3.77 gemini-2.5-ﬂash 6.88 1.50 0.00 39.88 15.15 8.12 79.12 6.52 7.56 95.39 3.37 6.94 deepseek-v3 6.47 0.00 0.00 42.11 13.80 9.79 81.59 5.97 6.74 89.83 2.32 9.55 ChatDev gpt-4o 7.38 0.00 0.00 48.27 10.50 8.49 78.15 4.27 6.93 89.20 1.62 8.55 gpt-4o-mini 9.10 2.93 0.00 46.80 11.53 8.22 83.42 4.81 7.76 98.44 3.12 7.91 gpt-3.5-turbo 7.55 1.54 0.00 42.89 13.55 6.75 88.17 5.53 4.52 99.01 2.99 4.90 gemini-2.5-ﬂash 8.40 2.20 0.00 47.16 10.86 6.80 82.58 4.53 5.14 94.81 2.85 6.23 deepseek-v3 8.66 3.57 0.00 49.50 9.80 8.19 76.24 3.91 8.93 88.22 1.86 7.51 agents to distribute malicious intent across multiple seman- tically ambiguous steps. DynaT rust overcomes this limita- tion through trust ev olution, which maintains a continuous, history-aware trust risk assessment for each agent. Security–Utility T rade-off. Beyond detection accuracy , an effecti ve defense must preserve MAS utility . As shown in T able 1, a Zero-Trust strategy (a baseline employing im- mediate isolation) achieves high security (94.8% DSR) but incurs severe utility loss, as its zero-tolerance policy penal- izes any single v alidation failure. This results in a substantial drop in TSR from 85.4% to 66.5%, accompanied by an ex- cessiv e FPR of 15.3%. In contrast, DynaT rust preserves a TSR of 84.9%, statistically close to the undefended system, while maintaining a low FPR of 2.2%. This demonstrates that trust ev olution mechanism ﬁlters short-term ﬂuctuations and benign hallucinations, enabling strong protection without disrupting normal multi-agent collaboration. 5.3 Ablation Studies W e performed a component ablation to quantify the contri- bution of each module in DynaT rust. Four conﬁgurations were ev aluated: w/ Auditor (only the local Fast Auditor), w/ Consensus (only Trust-Conﬁdence W eighted Consensus), w/o T rust Evolution (full DynaTrust without Trust Ev olution), and DynaT rust (all components). T able 2 reports DSR, FPR, and SSR of the LLM model . Module Contributions. Results in T able 2 highlight the necessity of each component. The w/ Auditor variant achie ves near-zero SSR and minimal DSR ( < 8% ), as isolated, state- less auditing cannot capture the temporal patterns of sleeper agent attacks. The w/ Consensus conﬁguration achie ves higher DSR than isolated auditing, but suffers from exces- siv e false positives due to its reliance on single-turn seman- tic judgments, resulting in an FPR of up to 22.12%. Com- paring w/o T rust Evolution with full DynaT rust demonstrates the critical role of T rust Evolution. Incorporating histori- cal trust improves DSR by about 5–21% and reduces FPR by 1.96–7.67% without increasing SSR, indicating that false positiv es are reduced through more accurate risk assessment rather than conservati ve rejection. By incorporating temporal behavior into trust ev aluation, trust evolution limits the in- ﬂuence of sleeper agents that alternate between benign and malicious actions, while av oiding unnecessary penalties on consistently reliable agents. These results conﬁrm that each module addresses a dis- tinct aspect of defense: the Auditor ensures computational efﬁcienc y , consensus resolves semantic ambiguity , and trust ev olution introduces temporal continuity into trust assess- ment, which is critical for precise and reliable defense against sleeper agents with delayed malicious activ ation. 5.4 Case Study T o examine long-term system behavior under persistent in- sider threats, we conduct a case study spanning 100 interac- tion turns with a sleeper agent embedded in the system. Fig- ure 4 depicts the trust e volution of the primary agent ( Coder ) during sustained adversarial activity , along with the system’ s response through agent replacement ( Replica ). T rust Evolution under Persistent Attacks. During the initial phase (T urns 1–60), adversarial actions are intention- ally mixed with benign interactions, reﬂecting the typical be- havior of sleeper agents. The trust e volution exhibits pro- nounced drops at attack points (e.g., T urns 7 and 34), fol- lowed by limited recovery during benign. This pattern shows that the system does not make isolation decisions based on isolated ev ents, but instead e valuates agent behavior cumu- lativ ely ov er time. Consequently , occasional malicious ac- tions do not immediately trigger isolation, allowing the sys- tem to tolerate temporary de viations while preserving the cu- mulativ e ef fect of continuous attacks. As malicious behav- 0 20 40 60 80 100 T a s k T u r n s ( t ) 0.0 0.2 0.4 0.6 0.8 1.0 T r u s t S c o r e ( ) Coder R eplica Risk (Auditor) Risk (Consensus) Isolate Isolation Thr eshold Benign T ask Malicious T ask 0.0 0.2 0.4 0.6 0.8 1.0 R i s k S c o r e ( ) Figure 4: Trust score e volution illustrating the system’ s response to a mixed w orkload of 100 turns, including 20 persistent adv ersarial attacks. The curve shows how trust ev olution preserves utility during the early attack phases, and how the graph recovery mechanism (at Turn 69) restores normal operation following agent f ailure. ior becomes sustained, accumulated trust degradation drives the agent’ s trust score below the critical threshold at T urn 69 ( T = 0 . 226 < θ = 0 . 3 ). The compromised agent is removed from the interaction graph, and a clean replacement ( Replica ) is instantiated to tak e ov er its role, restoring system function- ality . System Recovery and Policy Continuity . The newly in- stantiated replica agent rapidly achiev es a high trust lev el dur- ing benign interactions ( T ≈ 0 . 95 at T urns 70–72), restoring normal system operation without requiring retraining or a full system reset. Notably , defensiv e behavior remains consistent after the topology change. When a renewed attack occurs at T urn 89, the replacement agent correctly enforces a blocking action, indicating that security policies persist at the system lev el rather than being tied to individual agents. 6 Discussion Limitation. Our results conﬁrm that DynaT rust effecti vely defends against sleeper agents by tracking historical behavior , signiﬁcantly outperforming static baselines in both security and system av ailability . By combining Bayesian trust updates with expert voting, the system keeps the network connected ev en during attacks. Howe ver , we ackno wledge two main constraints. First, the consensus mechanism relies on the as- sumption that most selected experts are honest. If a coordi- nated group of malicious agents works together to dominate a speciﬁc domain, the y could theoretically bypass our checks by supporting each other . Second, the process of calculat- ing trust and gathering v otes introduces extra delay and com- putational cost compared to simple ﬁltering methods, which might be a challenge for very large-scale, real-time applica- tions. Future work. Our next steps will focus on impro ving ef ﬁ- ciency to reduce this delay . W e plan to explore decentralized veriﬁcation methods that do not rely on a central coordina- tor . Furthermore, as agents become more capable, we will extend DynaT rust to handle multi-modal en vironments. This in volv es detecting complex attacks where malicious intent is hidden across text and images, ensuring robust protection for div erse and ev olving agent ecosystems. Finally , we aim to test DynaT rust in real-w orld applications, verifying its ability to adapt to unknown threats in open-ended tasks. 7 Conclusion This work introduced DynaT rust , a dynamic trust-based de- fense framework for securing LLM-dri ven MAS against sleeper agents. By modeling trust as an ev olving attribute, lev eraging Bayesian trust ev olution with inertia, applying trust-conﬁdence weighted consensus, and performing au- tonomous graph recov ery , DynaT rust effecti vely identiﬁed malicious behavior while preserving system utility and con- nectivity . Extensiv e experiments on benchmarks combining AdvBench and HumanEval demonstrated that DynaT rust out- performs state-of-the-art defenses, achieving higher defense success rates and lower false positiv es, showing that dynamic, history-aware trust modeling is essential for robust and unin- terrupted MAS operation in the presence of stealthy adver- saries. References [ Alon and Kamfonas, 2023 ] Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity . ArXiv , abs/2308.14132, 2023. [ Asch, 1951 ] Solomon E. Asch. Effects of group pressure upon the modiﬁcation and distortion of judgments. 1951. [ Chao et al. , 2023 ] Patrick Chao, Alexander Robey , Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric W ong. Jailbreaking black box lar ge language models in twenty queries. 2025 IEEE Confer ence on Secur e and T rustworthy Machine Learning (SaTML) , pages 23–42, 2023. [ Chen et al. , 2021 ] Mark Chen, Jerry T worek, Hee woo Jun, Qiming Y uan, Henrique Pond ´ e, Jared Kaplan, Harrison Edwards, Y ura Burda, Nicholas Joseph, Greg Brockman, Alex Ray , Raul Puri, Gretchen Krueger , Michael Petrov , Heidy Khlaaf, Girish Sastry , Pamela Mishkin, Brook e Chan, Scott Gray , Nick Ryder, Mikhail Pavlo v , Alethea Power , Lukasz Kaiser, Mo Bav arian, Clemens W inter , Phil T illet, Felipe Petroski Such, David W . Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-V oss, William H. Guss, Alex Nichol, Igor Babuschkin, Suchir Balaji, Shantanu Jain, Andrew Carr , Jan Leike, Josh Achiam, V edant Misra, Ev an Morikaw a, Alec Radford, Matthew M. Knight, Miles Brundage, Mira Murati, Katie Mayer , Peter W elinder , Bob McGre w , Dario Amodei, Sam McCandlish, Ilya Sutskev er , and W ojciech Zaremba. Evaluating large language models trained on code. ArXiv , abs/2107.03374, 2021. [ Deng et al. , 2024 ] Gelei Deng, Y i Liu, Y uekang Li, Kai- long W ang, Y ing Zhang, Zefeng Li, Hao yu W ang, T ianwei Zhang, and Y ang Liu. Masterkey: Automated jailbreaking of large language model chatbots. In Pr oceedings 2024 Network and Distributed System Security Symposium . In- ternet Society , 2024. [ Du et al. , 2024 ] Y ilun Du, Shuang Li, Antonio T orralba, Joshua B. T enenbaum, and Igor Mordatch. Improving f ac- tuality and reasoning in language models through multia- gent debate. In Pr oceedings of the 41st International Con- fer ence on Mac hine Learning , ICML ’24. JMLR.org, 2024. [ Greshake et al. , 2023 ] Kai Greshak e, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ ve signed up for: Compro- mising real-world llm-integrated applications with indirect prompt injection. Proceedings of the 16th A CM W orkshop on Artiﬁcial Intelligence and Security , 2023. [ Hong et al. , 2023 ] Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Y uheng Cheng, Ceyao Zhang, Jin- lin W ang, Zili W ang, Steven Ka Shing Y au, Zi Hen Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin W u, and J ¨ urgen Schmidhuber . Metagpt: Meta programming for a multi-agent collaborativ e framew ork. In International Confer ence on Learning Repr esentations , 2023. [ Hubinger et al. , 2024 ] Evan Hubinger , Carson Denison, Jesse Mu, Mik e Lambert, Meg T ong, Monte MacDi- armid, T amera Lanham, Daniel M. Ziegler , Tim Maxwell, Newton Cheng, Adam S. Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Gan- guli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nov a Das- Sarma, Roger B. Grosse, Shauna Kra vec, Y untao Bai, Zachary Witten, Marina Fav aro, Jan Brauner , Holden Karnofsky , Paul F . Christiano, Samuel R. Bowman, Lo- gan Graham, Jared Kaplan, S ¨ oren Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer , and Ethan Perez. Sleeper agents: T raining deceptiv e llms that persist through safety training. CoRR , abs/2401.05566, 2024. [ Kamvar et al. , 2003 ] Sepandar D. Kamvar , Mario T . Schlosser , and Hector Garcia-Molina. The eigentrust algorithm for reputation management in p2p networks. In The W eb Conference , 2003. [ Krauß et al. , 2025 ] T orsten Krauß, Hamid Dashtbani, and Alexandra Dmitrienko. T winbreak: jailbreaking llm se- curity alignments based on twin prompts. In Pr oceedings of the 34th USENIX Confer ence on Security Symposium , SEC ’25, USA, 2025. USENIX Association. [ Lee and T iwari, 2025 ] Donghyun Lee and Mo Tiw ari. Prompt infection: LLM-to-LLM prompt injection within multi-agent systems, 2025. [ Li et al. , 2023 ] G. Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communica- tiv e agents for ”mind” exploration of large scale language model society . ArXiv , abs/2303.17760, 2023. [ Liang et al. , 2024 ] T ian Liang, Zhiwei He, W enxiang Jiao, Xing W ang, Y an W ang, Rui W ang, Y ujiu Y ang, Shuming Shi, and Zhaopeng T u. Encouraging div ergent thinking in large language models through multi-agent debate. In Y aser Al-Onaizan, Mohit Bansal, and Y un-Nung Chen, ed- itors, Proceedings of the 2024 Confer ence on Empirical Methods in Natural Language Pr ocessing , pages 17889– 17904, Miami, Florida, USA, Nov ember 2024. Associa- tion for Computational Linguistics. [ Mao et al. , 2025 ] Junyuan Mao, Fanci Meng, Y ifan Duan, Miao Y u, Xiaojun Jia, Junfeng Fang, Y uxuan Liang, Kun W ang, and Qingsong W en. Agentsafe: Safeguarding lar ge language model-based multi-agent systems via hierarchi- cal data management, 2025. [ Ouyang et al. , 2022 ] Long Ouyang, Jeff W u, Xu Jiang, Diogo Almeida, Carroll L. W ainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray , John Schulman, Jacob Hilton, Fraser K elton, Luke E. Miller , Maddie Simens, Amanda Askell, Peter W elinder , Paul Francis Christiano, Jan Leike, and Ryan J. Lowe. T raining language models to follow instructions with hu- man feedback. ArXiv , abs/2203.02155, 2022. [ Pan et al. , 2024 ] Liangming Pan, Michael Stephen Saxon, W enda Xu, Deepak Nathani, Xinyi W ang, and W illiam Y ang W ang. Automatically correcting large language models: Surveying the landscape of di verse automated correction strate gies. T ransactions of the Association for Computational Linguistics , 12:484–506, 2024. [ Park et al. , 2023 ] Joon Sung Park, Joseph O’Brien, Car- rie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generativ e agents: Interactive sim- ulacra of human behavior . In Pr oceedings of the 36th Annual ACM Symposium on User Interface Software and T echnology , UIST ’23, New Y ork, NY , USA, 2023. Asso- ciation for Computing Machinery . [ Pastor -Satorras et al. , 2014 ] Romualdo Pastor -Satorras, Claudio Castellano, Piet V an Mieghem, and Alessandro V espignani. Epidemic processes in complex networks. ArXiv , abs/1408.2701, 2014. [ Qi et al. , 2023 ] Xiangyu Qi, Y i Zeng, T inghao Xie, Pin-Y u Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety , ev en when users do not intend to! ArXiv , abs/2310.03693, 2023. [ Qian et al. , 2023 ] Cheng Qian, W ei Liu, Hongzhang Liu, Nuo Chen, Y ufan Dang, Jiahao Li, Cheng Y ang, W eize Chen, Y usheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. Chatde v: Communica- tiv e agents for software development. In Annual Meeting of the Association for Computational Linguistics , 2023. [ Robey et al. , 2025 ] Alexander Robey , Eric W ong, Hamed Hassani, and George J. Pappas. SmoothLLM: Defending large language models against jailbreaking attacks. T rans- actions on Machine Learning Resear ch , 2025. [ R ¨ ottger et al. , 2024 ] Paul R ¨ ottger , Hannah Kirk, Bertie V id- gen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy . XSTest: A test suite for identifying exaggerated safety behaviours in large language models. In Ke vin Duh, Helena Gomez, and Steven Bethard, editors, Pr oceedings of the 2024 Confer ence of the North American Chapter of the Association for Computational Linguistics: Human Language T echnologies (V olume 1: Long P apers) , pages 5377–5400, Mexico City , Mexico, June 2024. Association for Computational Linguistics. [ Shen et al. , 2023 ] Xinyue Shen, Zeyuan Chen, Michael Backes, Y un Shen, and Y ang Zhang. ”do an ything now”: Characterizing and e valuating in-the-wild jailbreak prompts on large language models. Pr oceedings of the 2024 on ACM SIGSA C Conference on Computer and Com- munications Security , 2023. [ Shinn et al. , 2023 ] Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Y ao. Reﬂex- ion: language agents with verbal reinforcement learning. In Pr oceedings of the 37th International Confer ence on Neural Information Pr ocessing Systems , NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc. [ Souri et al. , 2021 ] Hossein Souri, Micah Goldblum, Liam H. Fo wl, Ramalingam Chellappa, and T om Gold- stein. Sleeper agent: Scalable hidden trigger backdoors for neural networks trained from scratch. ArXiv , abs/2106.08970, 2021. [ Sriv asta v and Zhang, 2025 ] Dev ansh Sriv astav and Xiao Zhang. Safe in isolation, dangerous together: Agent- driv en multi-turn decomposition jailbreaks on LLMs. In Ehsan Kamalloo, Nicolas Gontier, Xing Han Lu, Nouha Dziri, Shikhar Murty , and Ale xandre Lacoste, editors, Pr o- ceedings of the 1st W orkshop for Resear ch on Agent Lan- guage Models (REALM 2025) , pages 170–183, V ienna, Austria, July 2025. Association for Computational Lin- guistics. [ Stefano et al. , 2024 ] Gianluca De Stefano, Lea Sch ¨ onherr , and Giancarlo Pellegrino. Rag and roll: An end-to-end ev aluation of indirect prompt manipulations in llm-based application framew orks, 2024. [ W ang et al. , 2025 ] Kaixiang W ang, Zhaojiacheng Zhou, Bunyod Suvonov , Jiong Lou, and Jie LI. Agentshield: Make mas more secure and ef ﬁcient, 2025. [ W ei et al. , 2023 ] Alexander W ei, Nika Haghtalab, and Ja- cob Steinhardt. Jailbroken: How does llm safety training fail? ArXiv , abs/2307.02483, 2023. [ W u et al. , 2023 ] Qingyun W u, Gagan Bansal, Jieyu Zhang, Y iran W u, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi W ang. Autogen: En- abling ne xt-gen llm applications via multi-agent con versa- tion framew ork. ArXiv , abs/2308.08155, 2023. [ Xi et al. , 2023 ] Zhiheng Xi, W enxiang Chen, Xin Guo, W ei He, Y iwen Ding, Boyang Hong, Ming Zhang, Junzhe W ang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao W ang, Limao Xiong, Qin Liu, Y uhao Zhou, W eiran W ang, Changhao Jiang, Y icheng Zou, Xiangyang Liu, Zhangyue Y in, Shihan Dou, Rongxiang W eng, W ensen Cheng, Qi Zhang, W enjuan Qin, Y ongyan Zheng, Xipeng Qiu, Xuanjing Huan, and T ao Gui. The rise and potential of large language model based agents: A surve y . ArXiv , abs/2309.07864, 2023. [ Xie et al. , 2023 ] Y ueqi Xie, Jingwei Y i, Jiawei Shao, Justin Curl, Lingjuan L yu, Qifeng Chen, Xing Xie, and Fangzhao W u. Defending chatgpt against jailbreak attack via self- reminders. Nature Machine Intelligence , 5:1486–1496, 2023. [ Y u et al. , 2025 ] Miao Y u, Shilong W ang, Guibin Zhang, Junyuan Mao, Chenlong Y in, Qijiong Liu, Kun W ang, Qingsong W en, and Y ang W ang. NetSafe: Exploring the topological safety of multi-agent system. In W anxi- ang Che, Joyce Nabende, Ekaterina Shutov a, and Moham- mad T aher Pilehvar , editors, Findings of the Association for Computational Linguistics: A CL 2025 , pages 2905– 2938, V ienna, Austria, July 2025. Association for Compu- tational Linguistics. [ Zhang et al. , 2024 ] Zaibin Zhang, Y ongting Zhang, Lijun Li, Hongzhi Gao, Lijun W ang, Huchuan Lu, Feng Zhao, Y u Qiao, and Jing Shao. Psysafe: A comprehensi ve frame- work for psychological-based attack, defense, and ev alua- tion of multi-agent system safety . ArXiv , abs/2401.11880, 2024. [ Zou et al. , 2023 ] Andy Zou, Zifan W ang, J. Zico Kolter , and Matt Fredrikson. Univ ersal and transferable adversarial at- tacks on aligned language models. ArXiv , abs/2307.15043, 2023.

DynaTrust: Defending Multi-Agent Systems Against Sleeper Agents via Dynamic Trust Graphs

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment