EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies
Long-horizon planning is widely recognized as a core capability of autonomous LLM-based agents; however, current evaluation frameworks suffer from being largely episodic, domain-specific, or insufficiently grounded in persistent economic dynamics. We introduce EcoGym, a generalizable benchmark for continuous plan-and-execute decision making in interactive economies. EcoGym comprises three diverse environments: Vending, Freelance, and Operation, implemented in a unified decision-making process with standardized interfaces, and budgeted actions over an effectively unbounded horizon (1000+ steps if 365 day-loops for evaluation). The evaluation of EcoGym is based on business-relevant outcomes (e.g., net worth, income, and DAU), targeting long-term strategic coherence and robustness under partial observability and stochasticity. Experiments across eleven leading LLMs expose a systematic tension: no single model dominates across all three scenarios. Critically, we find that models exhibit significant suboptimality in either high-level strategies or efficient actions executions. EcoGym is released as an open, extensible testbed for transparent long-horizon agent evaluation and for studying controllability-utility trade-offs in realistic economic settings.
💡 Research Summary
EcoGym is a newly introduced benchmark designed to evaluate the long‑horizon planning and execution capabilities of large language model (LLM)‑based agents in realistic, continuous economic environments. The authors argue that existing evaluation frameworks are either episodic, domain‑specific, or insufficiently grounded in persistent economic dynamics, which limits their ability to measure an agent’s true strategic competence over extended periods. To address this gap, EcoGym provides three diverse scenarios—Vending, Freelance, and Operation—each implemented with a unified decision‑making interface, a compact discrete action space (typically 4‑5 primitives), and an effectively unbounded temporal horizon (over 1,000 steps when simulated as 365‑day loops).
The benchmark’s core design principles are: (1) a simple action space combined with an infinite‑horizon setting, forcing agents to prioritize long‑term strategic coherence rather than short‑term reward chasing; (2) grounding evaluation metrics in tangible business outcomes—net worth for Vending, income for Freelance, and average daily active users (DAU) for Operation—so that performance directly reflects economic impact; and (3) embedding latent market mechanics (e.g., demand elasticity, seasonality, system decay coefficients) that are hidden from the agent’s observation, requiring active hypothesis testing and causal discovery. Each environment follows a (observation → goal → action → state transition) loop under partial observability and stochastic dynamics, with daily reports and immediate feedback serving as the only windows into the hidden state.
In Vending, the agent acts as a sole retailer, managing cash, inventory, pre‑paid orders, and dynamic pricing to maximize net worth. The hidden market parameters dictate demand through an Elastic Logit model, and inventory replenishment follows a lead‑time schedule. In Freelance, the agent operates as a gig‑economy worker, balancing task acquisition, execution, settlement, and wellness actions to maximize cumulative income while avoiding burnout, with stress, energy, and skill levels evolving over time. In Operation, the agent runs a digital content platform, allocating budget to user acquisition and tuning engagement mechanisms to counteract a zero‑attractor decay, aiming to keep DAU high.
The authors evaluated eleven state‑of‑the‑art LLMs (including Claude‑Sonnet‑4.5, DeepSeek‑v3.2, GLM‑4.7, GPT‑5‑Mini, Gemini‑3‑Flash, Grok‑4.1‑Fast, and Kimi‑k2) using identical prompts, context windows, and optional memory modules. Results show a systematic tension: no single model dominates across all three scenarios. Some models excel at high‑level strategic planning in Vending but falter in the nuanced health‑management loop of Freelance; others achieve strong income in Freelance but cannot sustain DAU in Operation. Diagnostic studies (eight in total) explore factors such as context‑window length, external memory augmentation, human baseline comparisons, action‑budget limits, and environment difficulty. These analyses reveal that current LLMs are relatively capable of forming broad strategies but struggle with precise action execution and rapid discovery of hidden mechanics, leading to sub‑optimal performance either in strategic direction or operational efficiency.
EcoGym is released as an open‑source, extensible platform (https://github.com/OPPO-PersonalAI/EcoGym), enabling the research community to add new economic domains, modify dynamics, and benchmark future agents under the same long‑horizon, utility‑driven criteria. The paper suggests several future research avenues: integrating meta‑reinforcement learning to jointly learn strategy and execution, introducing multi‑agent competition or cooperation to simulate market dynamics, and incorporating human‑in‑the‑loop feedback for safety and reliability. Overall, EcoGym provides a rigorous, economically meaningful testbed that pushes LLM‑based agents beyond short‑term tasks toward sustained, real‑world decision making.
Comments & Academic Discussion
Loading comments...
Leave a Comment