Curriculum Guided Massive Multi Agent System Solving For Robust Long Horizon Tasks
📝 Abstract
Large Language Models and multi-agent systems have shown promise in decomposing complex tasks, yet they struggle with long-horizon reasoning tasks and escalating computation cost. This work introduces a hierarchical multi-agent architecture that distributes reasoning across a 64*64 grid of lightweight agents, supported by a selective oracle. A spatial curriculum progressively expands the operational region of the grid, ensuring that agents master easier central tasks before tackling harder peripheral ones. To improve reliability, the system integrates Negative Log-Likelihood as a measure of confidence, allowing the curriculum to prioritize regions where agents are both accurate and well calibrated. A Thompson Sampling curriculum manager adaptively chooses training zones based on competence and NLL-driven reward signals. We evaluate the approach on a spatially grounded Tower of Hanoi benchmark, which mirrors the long-horizon structure of many robotic manipulation and planning tasks. Results demonstrate improved stability, reduced oracle usage, and stronger long-range reasoning from distributed agent cooperation.
💡 Analysis
Large Language Models and multi-agent systems have shown promise in decomposing complex tasks, yet they struggle with long-horizon reasoning tasks and escalating computation cost. This work introduces a hierarchical multi-agent architecture that distributes reasoning across a 64*64 grid of lightweight agents, supported by a selective oracle. A spatial curriculum progressively expands the operational region of the grid, ensuring that agents master easier central tasks before tackling harder peripheral ones. To improve reliability, the system integrates Negative Log-Likelihood as a measure of confidence, allowing the curriculum to prioritize regions where agents are both accurate and well calibrated. A Thompson Sampling curriculum manager adaptively chooses training zones based on competence and NLL-driven reward signals. We evaluate the approach on a spatially grounded Tower of Hanoi benchmark, which mirrors the long-horizon structure of many robotic manipulation and planning tasks. Results demonstrate improved stability, reduced oracle usage, and stronger long-range reasoning from distributed agent cooperation.
📄 Content
Large Language Models (LLMs) are increasingly deployed as decision-making, planning, and coordination engines across multi-agent systems, robotics pipelines, and long-horizon automation workflows. Despite their growing capabilities, current LLM-based agents exhibit catastrophic reliability failures when operating over extended task horizons, where even small per-step errors compound into irreversible failures [21], [22], [23]. At the same time, multi-agent coordination introduces escalating token and computation costs: repeated inter-agent communication, context-window expansion, and iterative reasoning steps quickly render large-scale deployments prohibitively expensive [1]- [6]. These limitations fundamentally restrict the ability of LLMs to serve as scalable, robust controllers in real-world environments that require persistent situational awareness, dynamic replanning, and distributed reasoning across heterogeneous agents.
Several studies aim to reduce token consumption by restructuring how LLM agents exchange information. ELHPlan introduces Action Chains, enabling coarse-to-fine planning and reducing token usage to roughly 24% of previous methods by bundling sequential subgoals into structured segments [1]. S²-MAD applies sparsification to multi-agent debate systems, allowing only novel exchanges to propagate; this reduces token costs by as much as 94.5% with minimal accuracy loss [2].A complementary approach is the SupervisorAgent mechanism, proposed in Stop Wasting Your Tokens, which monitors multi-agent systems at runtime and filters inefficient reasoning steps, achieving approximately 29.5% token reduction without harming performance [3]. Complementing these efficiency-driven designs, the MAS Failure Taxonomy provides a systematic study of failure modes in LLM multi-agent collaboration, identifying redundant conversational loops as a dominant source of wasted computation [4].
Hierarchical architectures also contribute to reducing communication overhead. AutoHMA-LLM employs a cloudedge hierarchy in heterogeneous robot teams, delegating feasibility checks to local modules while centralizing highlevel planning, resulting in 46% fewer communication steps and 31% fewer tokens [5]. Similarly, OPTIMA introduces a training framework that optimizes multi-agent dialog trajectories for brevity and clarity, achieving up to 88.5% token reduction on math reasoning tasks [6].
Hierarchical planning methods reduce long-horizon complexity by decomposing tasks into structured subtasks. Planand-Act standardizes a two-level pipeline in which a Planner LLM generates high-level plans and an Executor LLM performs stepwise execution with localized replanning, achieving state-of-the-art results in web-navigation tasks [7]. L2M2 combines global LLM planning with RL-based subtask execution, reducing sample requirements by a factor of five through episodic allocation of roles and responsibilities [8]. Iterative refinement architectures such as LLaMAR employ a plan-act-verify-correct loop, enabling agents to update plans incrementally rather than relying on a single long context window [9]. In domain-specific settings, OceanPlan decomposes AUV missions into waypoint-driven subtasks [10], TDAG recursively spawns sub-agents for increasingly granular subtasks [11], and HiTAMP integrates LLM-driven PDDL planning with partial replanning to avoid full-sequence re-computation [12].
Long-horizon tasks require reliable decomposition, sustained execution, and explicit control of cascading errors. The asymptotic analysis in [20] demonstrates that treating an LLM forward pass as the atomic computational primitive enables principled reasoning about how long tasks should be split: inappropriate decomposition induces asymptotic blow-ups in both query cost and error accumulation. Complementing this, empirical critiques of emergent-ability narratives [22] and mechanistic studies of compositional brittleness [23] show that apparent short-horizon reasoning competence does not reliably extrapolate to extended execution chains. Engineering long-horizon competence thus requires both formal measurement and structural constraints. The horizon-based capability metric proposed in [21] reframes model evaluation around “time-to-failure,” arguing that multi-step performance is driven by reliability across steps rather than one-shot accuracy. Structured-output systems such as JSONSchemaBench [24] and schema-enforced generation mechanisms [25] further constrain LLM outputs, mitigating drift over long sequences of actions. Together, these advances formal asymptotics, empirical failure analyses, horizon-aware metrics, and schema-based constraints offer a unified basis for understanding why long-horizon tasks fail and how to improve robustness over extended execution windows.
Curriculum learning has emerged as a complementary strategy to improve LLM-based reasoning efficiency. cMALC-D uses an LLM as a curriculum generator, producing progressively harder tr
This content is AI-processed based on ArXiv data.