LLM 기반 오프라인 환경 설계로 실시간 협동 MARL 안정성 강화

February 23, 2026

Reading time: 6 minute

...

📝 Original Info

Title: LLM 기반 오프라인 환경 설계로 실시간 협동 MARL 안정성 강화
ArXiv ID: 2511.19253
Date: 2025-11-24
Authors: Boyuan Wu

📝 Abstract

Cooperative multi-agent reinforcement learning (MARL) is constrained by two core design bottlenecks: dense reward engineering and curriculum design. Handcrafted rewards often misalign with global objectives, while static curricula fail to prevent agents from getting trapped in local optima. Large language models (LLMs) have been used to address some of these issues, but most work adopts an LLM-as-Agent paradigm that keeps the model in the control loop, incurring prohibitive latency and token costs for real-time systems. We propose MAESTRO (Multi-Agent Environment Shaping through Task and Reward Optimization), which shifts the LLM's role from online controller to offline training architect. MAESTRO treats training as a dual-loop optimization problem: an LLM shapes the learning environment, while a standard MARL backbone optimizes the policy. The LLM instantiates two generative components: (i) a semantic curriculum generator that synthesizes diverse, performance-conditioned traffic scenarios, and (ii) an automated reward and policy synthesizer that fills executable Python templates to produce reward shaping signals and prior policy logits. Prior policies are used only as a regularizer during actor updates (via MSE penalties that decay over training); the deployed controller is a standalone MADDPG policy with no LLM calls. Reward shaping is additively combined with environment rewards, with a difficulty-dependent weight. We evaluate MAESTRO on a 16-intersection real-world traffic network (Hangzhou) and compare three configurations: A2 (adaptive LLM curriculum with template rewards), A7 (full MAESTRO with LLM curriculum and LLM-shaped rewards), and A8 (alternative LLM curriculum with template rewards). A8 attains the highest mean return (160.23) but exhibits high variability (CV 2.95%). A7 achieves slightly lower mean return (158.87) but substantially higher stability: 2.6× lower variance than A2 (CV 0.76% vs. 1.95%) and 3.9× lower than A8, yielding a 2.3× higher risk-adjusted return (Sharpe Ratio 1.53 vs. 0.67). A7's worst seed (157.76) exceeds the baseline mean (157.04), providing strong deployment reliability. These results indicate that constrained, template-based LLM reward shaping combined with an adaptive curriculum produces robust cooperative policies suitable for real-time control. Code is available at https://github.com/BrianWu1010/MAESTRO.git.

💡 Deep Analysis

Deep Dive into LLM 기반 오프라인 환경 설계로 실시간 협동 MARL 안정성 강화.

📄 Full Content

MAESTRO: Multi-Agent Environment Shaping through Task and Reward Optimization Boyuan Wu∗ Department of Mechanical and Industrial Engineering University of Toronto Toronto, ON, Canada Abstract Cooperative multi-agent reinforcement learning (MARL) is constrained by two core design bottlenecks: dense reward engineering and curriculum design. Hand- crafted rewards often misalign with global objectives, while static curricula fail to prevent agents from getting trapped in local optima. Large language models (LLMs) have been used to address some of these issues, but most work adopts an LLM-as-Agent paradigm that keeps the model in the control loop, incurring prohibitive latency and token costs for real-time systems. We propose MAESTRO (Multi-Agent Environment Shaping through Task and Reward Optimization), which shifts the LLM’s role from online controller to offline training architect. MAESTRO treats training as a dual-loop optimization problem: an LLM shapes the learning environment, while a standard MARL backbone optimizes the policy. The LLM instantiates two generative components: (i) a semantic curriculum generator that synthesizes diverse, performance-conditioned traffic scenarios, and (ii) an automated reward and policy synthesizer that fills executable Python templates to produce reward shaping signals and prior policy logits. Prior policies are used only as a regularizer during actor updates (via MSE penalties that decay over training); the deployed controller is a standalone MADDPG policy with no LLM calls. Reward shaping is additively combined with environment rewards, with a difficulty-dependent weight. We evaluate MAESTRO on a 16-intersection real-world traffic network (Hangzhou) and compare three configurations: A2 (adaptive LLM curriculum with template rewards), A7 (full MAESTRO with LLM curriculum and LLM-shaped rewards), and A8 (alternative LLM curriculum with template rewards). A8 attains the highest mean return (160.23) but exhibits high variability (CV 2.95%). A7 achieves slightly lower mean return (158.87) but substantially higher stability: 2.6× lower variance than A2 (CV 0.76% vs. 1.95%) and 3.9× lower than A8, yielding a 2.3× higher risk-adjusted return (Sharpe Ratio 1.53 vs. 0.67). A7’s worst seed (157.76) exceeds the baseline mean (157.04), providing strong deployment reliability. These results indicate that constrained, template-based LLM reward shaping combined with an adaptive curriculum produces robust cooperative policies suitable for real-time control. Code is available at https://github.com/BrianWu1010/MAESTRO.git. 1 Introduction Cooperative multi-agent reinforcement learning (MARL) is a natural framework for decentralized control problems such as warehouse logistics, sensor networks, and urban traffic signal control (TSC). ∗Email: boyuan.wu@mail.utoronto.ca Preprint. arXiv:2511.19253v2 [cs.LG] 10 Dec 2025 Figure 1: MAESTRO workflow. The LLM (Architect) generates curriculum contexts and parameter- izes executable templates for reward shaping and prior policy logits. MARL agents (Learners) train under this shaped environment using MADDPG. Performance feedback drives curriculum updates and, in A7, periodic regeneration of reward and policy templates. LLM calls occur only during training, not deployment. In principle, centralized training with decentralized execution allows agents to coordinate via learned policies while acting on local observations. In practice, scaling MARL to realistic, non-stationary environments remains difficult due to two persistent design challenges: curriculum design and reward engineering. Curriculum learning (CL) addresses exploration by exposing agents to gradually harder tasks. How- ever, effective curricula require domain knowledge to ensure that tasks are both diverse and aligned with the target deployment regime [1]. Reward design is equally critical: dense rewards may mis- specify long-term objectives [2], while sparse global rewards provide weak learning signals. Crafting rewards that generalize across dynamic regimes is labor-intensive and brittle. LLMs offer rich semantic priors that could automate both curricula and rewards. Yet most existing integrations follow an LLM-as-Agent paradigm: the LLM directly selects actions or high-level decisions in the control loop [3–5]. This can yield strong zero-shot behavior but incurs high latency and token costs, limiting its applicability to real-time, safety-critical systems. It also underuses the LLM’s strength as a high-level designer. Our approach: LLM-as-Architect. We propose MAESTRO (Multi-Agent Environment Shaping through Task and Reward Optimization), which repositions the LLM as an offline architect of the training process. Rather than emitting actions, the LLM designs the training environment: it generates curriculum contexts and fills parameterized templates for reward shaping and prior policy logits. A standard MADDPG backbone then optimizes a policy under this shaped environment and reward. MAE

…(Full text truncated)…

📄 Read Full PDF on ArXiv