CREW-WILDFIRE: Benchmarking Agentic Multi-Agent Collaborations at Scale

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Despite rapid progress in large language model (LLM)-based multi-agent systems, current benchmarks fall short in evaluating their scalability, robustness, and coordination capabilities in complex, dynamic, real-world tasks. Existing environments typically focus on small-scale, fully observable, or low-complexity domains, limiting their utility for developing and assessing next-generation multi-agent Agentic AI frameworks. We introduce CREW-Wildfire, an open-source benchmark designed to close this gap. Built atop the human-AI teaming CREW simulation platform, CREW-Wildfire offers procedurally generated wildfire response scenarios featuring large maps, heterogeneous agents, partial observability, stochastic dynamics, and long-horizon planning objectives. The environment supports both low-level control and high-level natural language interactions through modular Perception and Execution modules. We implement and evaluate several state-of-the-art LLM-based multi-agent Agentic AI frameworks, uncovering significant performance gaps that highlight the unsolved challenges in large-scale coordination, communication, spatial reasoning, and long-horizon planning under uncertainty. By providing more realistic complexity, scalable architecture, and behavioral evaluation metrics, CREW-Wildfire establishes a critical foundation for advancing research in scalable multi-agent Agentic intelligence. All code, environments, data, and baselines will be released to support future research in this emerging domain.

💡 Research Summary

CREW‑Wildfire is an open‑source benchmark built on the human‑AI teaming CREW simulation platform, specifically designed to evaluate large‑scale, heterogeneous, and partially observable multi‑agent systems powered by large language models (LLMs). The authors argue that existing multi‑agent benchmarks—such as Hanabi, StarCraft II, Overcooked, or FireCommander—are limited to small numbers of agents, fully observable or symbolic worlds, and short‑horizon tasks, which prevents meaningful assessment of the scalability, robustness, and coordination capabilities required for real‑world high‑stakes scenarios like wildfire response.

The benchmark procedurally generates wildfire environments using Perlin noise to create continuous terrain features (elevation, slope, moisture, wind vectors) and discrete land‑type masks (forest, brush, rock, water, settlements). Each run is seeded, guaranteeing reproducibility while ensuring that agents never see the same map twice. Fire spread is modeled with a cellular‑automata formulation that incorporates slope‑dependent factors, wind alignment, and moisture attenuation, yielding a stochastic propagation probability p_spread that varies non‑linearly with terrain and weather. This dynamic, uncertain fire front forces agents to continuously sense, predict, and re‑plan.

Four agent archetypes are provided: Firefighters (generalists capable of cutting trees, spraying water, rescuing civilians), Bulldozers (fast vegetation clearing but no rescue or water capability), Drones (wide‑area reconnaissance only), and Helicopters (transport of personnel and water delivery). The heterogeneity creates inter‑agent dependencies: helicopters must ferry firefighters to distant hotspots, drones must locate ignitions for ground teams, and bulldozers must cooperate with firefighters to protect cleared firebreaks. The benchmark supports both low‑level control (continuous or discrete action tensors) and high‑level natural‑language commands through modular Perception and Execution components. Perception delivers multimodal observations—mini‑map images, third‑person visual frames, textual descriptions, and ground‑truth state vectors—while Execution translates LLM‑generated language into executable actions, performing syntax validation, conflict resolution, and real‑time feedback.

To assess performance, the authors define a suite of metrics covering scalability (agent count up to 2000), coordination efficiency (task allocation, communication overhead), adaptability (success rate of dynamic replanning under stochastic fire spread), spatial reasoning (accuracy of fire‑front prediction), and overall mission success (area saved, civilians rescued). They benchmark several state‑of‑the‑art LLM‑based multi‑agent frameworks (e.g., CAMEL, ChatDev, Lyfe Agents, Generative Agents) within CREW‑Wildfire. Results reveal that while these systems can achieve emergent collaboration in small‑scale, low‑complexity scenarios, they suffer dramatic performance drops as the number of agents grows or as the environment introduces complex terrain, variable wind, and long‑horizon objectives. Common failure modes include communication bottlenecks, role conflicts, inability to maintain a coherent long‑term plan, and poor spatial prediction of fire spread.

The paper concludes that current LLM‑driven agentic AI lacks the meta‑reinforcement learning, external memory structures, and efficient communication protocols needed for truly large‑scale, physically grounded tasks. By releasing the full codebase, procedural map generators, scenario definitions, and baseline results, the authors provide the community with a reproducible, extensible platform for future research. CREW‑Wildfire thus fills a critical gap, enabling systematic study of scalability, robustness, and coordination in agentic multi‑agent systems and paving the way toward deploying such systems in real disaster response, infrastructure maintenance, and planetary exploration contexts.

CREW-WILDFIRE: Benchmarking Agentic Multi-Agent Collaborations at Scale

💡 Research Summary

Comments & Academic Discussion

Leave a Comment