Among Us: Measuring and Mitigating Malicious Contributions in Model Collaboration Systems
Language models (LMs) are increasingly used in collaboration: multiple LMs trained by different parties collaborate through routing systems, multi-agent debate, model merging, and more. Critical safety risks remain in this decentralized paradigm: what if some of the models in multi-LLM systems are compromised or malicious? We first quantify the impact of malicious models by engineering four categories of malicious LMs, plug them into four types of popular model collaboration systems, and evaluate the compromised system across 10 datasets. We find that malicious models have a severe impact on the multi-LLM systems, especially for reasoning and safety domains where performance is lowered by 7.12% and 7.94% on average. We then propose mitigation strategies to alleviate the impact of malicious components, by employing external supervisors that oversee model collaboration to disable/mask them out to reduce their influence. On average, these strategies recover 95.31% of the initial performance, while making model collaboration systems fully resistant to malicious models remains an open research question.
💡 Research Summary
The paper “Among Us: Measuring and Mitigating Malicious Contributions in Model Collaboration Systems” addresses a critical safety gap in the emerging paradigm of multi‑large‑language‑model (LLM) collaboration. While recent work has shown that routing, multi‑agent debate, logit‑level aggregation, and parameter‑level merging can combine the strengths of diverse models, the decentralized nature of these systems opens the door for malicious actors to insert compromised or deliberately harmful models.
To quantify the threat, the authors construct four categories of malicious LLMs. Two are non‑parametric: (M1) prompt‑based attacks that prepend adversarial instructions, and (M2) activation steering that adds a pre‑computed “malicious persona” vector to each layer’s activations. The other two are parametric: (M3) supervised fine‑tuning on adversarial datasets covering safety, reasoning, knowledge, coding, and instruction‑following, and (M4) reinforcement learning with inverted reward signals that reward incorrect or unsafe behavior. All four methods are applied to a pool of five domain‑specialized Qwen2.5‑7B‑Instruct models, creating a mixed pool of benign and one malicious model per experiment.
The mixed pool is evaluated across eight collaboration mechanisms, organized into four levels:
- API‑level routing (LLM‑Router, Graph‑Router)
- Text‑level interaction (Text Debate, Text Feedback)
- Logit‑level aggregation (Logit Average, Logit Contrastive)
- Weight‑level merging (Greedy Soups, DARE‑TIES)
Ten benchmark datasets spanning safety (CocoNot, SafetyBench), reasoning (GSM8k, NLGraph), knowledge (MMLU‑redux, TruthfulQA), coding (HumanEval, DS‑1000), and instruction‑following (IFBench, IFEval) are used for evaluation.
Results show that malicious models cause substantial degradation, especially in API‑level routing and text‑level debate, where average performance drops of 7–8 % are observed. The worst‑case drop reaches 34.9 % across all tasks. Safety and reasoning domains are most vulnerable, with average losses of 7.12 % and 7.94 % respectively. Parametric attacks (M3, M4) affect logit‑ and weight‑level methods as well, confirming that malicious behavior can propagate through any aggregation stage.
To mitigate these threats, two families of defenses are proposed. The “supervisor‑free” approach relies on internal consensus: selecting the top‑k models instead of a single winner, and using a voting mechanism before collaboration to filter out suspicious contributors. The “supervisor‑based” approach introduces an external evaluator that is not part of the collaborative pool. Two supervisors are examined: (S1) an LLM‑as‑a‑judge and (S2) a reward model. Before any collaboration step, each model’s initial output is scored; the lowest‑scoring models are disabled.
Empirical evaluation demonstrates that supervisor‑based defenses are more effective, achieving an average recovery of 96.8 % of the original benign performance, compared to ~95 % for supervisor‑free methods. Both types of supervisors perform similarly, suggesting that a well‑trained judge or a reliable reward model can serve as a robust gatekeeper. Additional analysis reveals that malicious behavior in one domain can transfer to others, and current mitigations only partially block this cross‑domain transfer.
The authors conclude that while their mitigation strategies substantially restore performance, fully resistant multi‑LLM systems remain an open research problem. Future work should explore richer threat models, more diverse malicious behaviors, and detection mechanisms that can generalize across domains. The paper provides the first systematic quantification of malicious contributions in collaborative AI and offers practical, reproducible defenses that can inform the design of safer, decentralized AI ecosystems.
Comments & Academic Discussion
Loading comments...
Leave a Comment