MiTa: A Hierarchical Multi-Agent Collaboration Framework with Memory-integrated and Task Allocation

MiTa: A Hierarchical Multi-Agent Collaboration Framework with Memory-integrated and Task Allocation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent advances in large language models (LLMs) have substantially accelerated the development of embodied agents. LLM-based multi-agent systems mitigate the inefficiency of single agents in complex tasks. However, they still suffer from issues such as memory inconsistency and agent behavioral conflicts. To address these challenges, we propose MiTa, a hierarchical memory-integrated task allocative framework to enhance collaborative efficiency. MiTa organizes agents into a manager-member hierarchy, where the manager incorporates additional allocation and summary modules that enable (1) global task allocation and (2) episodic memory integration. The allocation module enables the manager to allocate tasks from a global perspective, thereby avoiding potential inter-agent conflicts. The summary module, triggered by task progress updates, performs episodic memory integration by condensing recent collaboration history into a concise summary that preserves long-horizon context. By combining task allocation with episodic memory, MiTa attains a clearer understanding of the task and facilitates globally consistent task distribution. Experimental results confirm that MiTa achieves superior efficiency and adaptability in complex multi-agent cooperation over strong baseline methods.


💡 Research Summary

MiTa introduces a hierarchical multi‑agent collaboration framework that explicitly addresses two persistent problems in LLM‑driven multi‑agent systems: memory inconsistency and inter‑agent behavioral conflicts. The core idea is to separate the agent population into a single centralized manager and multiple member agents, each equipped with distinct functional modules.

The manager hosts two specialized components. The Allocation module aggregates every member’s proposed next‑step action, local observation, and belief into a cross‑agent context (Xₜ). Using a large language model as a scoring function, it selects a globally coherent joint action that maximizes overall task progress, thereby preventing redundant or conflicting actions that arise in purely decentralized settings. The Summary module is triggered whenever the task progress metric changes. It retrieves all actions and belief states accumulated since the last summary, composes them with the progress delta, and asks the LLM to generate a concise collaborative summary (C). This summary is stored as episodic memory and injected back into the allocation step, preserving long‑horizon context that would otherwise be lost due to the limited context window of LLMs.

Member agents follow a four‑stage pipeline: Perception → Memory → Negotiation → Execution. Perception converts raw sensory inputs (symbolic object descriptors or RGB‑D images) into natural‑language descriptions and stores them in a local memory buffer. During Negotiation, each member uses its own LLM to produce a candidate action proposal (mₜᵢ) conditioned on its observation, recent dialogue history, and the global task objective. The manager then evaluates all proposals together with the latest summary and allocates the final joint action set, which each member translates into executable macro‑actions.

Formally, the problem is modeled as a Multi‑agent Partially Observable Markov Decision Process (MPOMDP). The manager maintains a joint belief state b(s) that aggregates all agents’ observations and summaries. The policy π(b) is implicitly learned by the LLM‑based scoring function M, which takes as input the cross‑agent context Xₜ, the episodic summary C, the current progress pₜ, and the overall task definition T, and outputs the optimal joint action aₜ*.

Experiments are conducted in the VirtualHome‑Social 3‑D simulation environment, covering five household tasks: Prepare Tea, Wash Dishes, Prepare a Meal, Put Groceries, and Set up Table. Both symbolic (structured object information) and visual (egocentric RGB‑D) observation modes are evaluated. Three state‑of‑the‑art LLMs—GPT‑4o, Qwen‑3‑Plus, and DeepSeek‑V3.1—serve as the reasoning backbone, allowing the authors to test robustness across model capabilities. Baselines include a hierarchical MCTS planner (MHP), the modular multi‑agent framework CoELA, and the intention‑inferring system ProAgent.

The primary metric is the average number of steps (L) required to complete a task; efficiency improvement (EI) is defined as the relative reduction in L compared to the strongest baseline. In the two‑agent setting, MiTa already matches or slightly outperforms CoELA and ProAgent, while in the three‑agent configuration it achieves the best results across all five tasks, reducing the average step count to 34.4 (a 68 % EI) and consistently outperforming the next best method by 5–10 % relative.

Ablation studies confirm the necessity of both manager modules. Removing the Allocation module inflates the step count by roughly 14 % and leads to frequent local‑only decisions, whereas omitting the Summary module degrades efficiency by up to 15.7 % due to loss of long‑term context. Additional robustness tests show that swapping member agents’ LLMs for a weaker GPT‑3.5‑Turbo incurs only a marginal increase (≈1 step), whereas weakening the manager’s LLM causes a substantial drop in coordination efficiency (≈6–7 extra steps), highlighting the manager’s pivotal role.

Overall, MiTa demonstrates that (1) a centralized allocation mechanism can effectively resolve inter‑agent conflicts, (2) episodic memory summarization preserves essential long‑horizon information despite LLM context limits, and (3) the combination of these two mechanisms yields a system that is both more efficient and more adaptable than existing multi‑agent approaches, even when some agents operate with limited computational resources. The work opens avenues for deploying LLM‑driven collaborative robots in real‑world, resource‑constrained environments, while also suggesting future research directions such as lightweight summarization models, real‑time latency mitigation, and physical robot validation.


Comments & Academic Discussion

Loading comments...

Leave a Comment