RetailBench: Evaluating Long-Horizon Autonomous Decision-Making and Strategy Stability of LLM Agents in Realistic Retail Environments

Large Language Model (LLM)-based agents have achieved notable success on short-horizon and highly structured tasks. However, their ability to maintain coherent decision-making over long horizons in realistic and dynamic environments remains an open c…

Authors: Linghua Zhang, Jun Wang, Jingtong Wu

RetailBench: Evaluating Long-Horizon Autonomous Decision-Making and Strategy Stability of LLM Agents in Realistic Retail Environments
RetailBench: Evaluating Long-Horizon A utonomous Decision-Making and Strategy Stability of LLM Agents in Realistic Retail En vir onments Linghua Zhang 1 Jun W ang 2 Jingtong W u 2 Zhisong Zhang † 1 Ant Group 2 City Uni versity of Hong K ong zlh20011228@gmail.com, nanrong.wj@ant-intl.com jingtong.wujt@ant-intl.com, zhisong.zhang@cityu.edu.hk Abstract Large Language Model (LLM)-based agents hav e achiev ed notable success on short-horizon and highly structured tasks, yet their ability to maintain coherent decision-making ov er long horizons in realistic and dynamic envi- ronments remains an open challenge. W e in- troduce RetailBenc h , a high-fidelity benchmark designed to ev aluate long-horizon autonomous decision-making in realistic commercial scenar - ios, where agents must operate under stochastic demand and ev olving external conditions. W e further propose the Evolving Strate gy & Execution framework, which separates high- lev el strategic reasoning from low-le vel ac- tion ex ecution, enabling adapti ve and inter- pretable strategy e volution o ver time. This de- sign is crucial for long-horizon tasks, where non-stationary en vironments and error accumu- lation require strate gies to be re vised at a dif fer- ent temporal scale than action execution. Exper- iments on eight state-of-the-art LLMs across progressiv ely challenging en vironments sho w that our framew ork improves operational stabil- ity and efficienc y compared to other baselines. Howe ver , performance degrades substantially as task complexity increases, rev ealing funda- mental limitations in current LLMs for long- horizon, multi-factor decision-making. 1 Introduction Recent lar ge language models (LLMs), particularly when augmented with reasoning and tool-use ca- pabilities, ha ve demonstrated strong performance on a variety of cogniti vely demanding tasks, in- cluding code editing, mathematical problem solv- ing, and complex information retrie v al ( Jimenez et al. , 2024 ; Phan et al. , 2025 ; Gao et al. , 2024 ). Ho we ver , accumulating empirical e vidence indi- cates that such capabilities fail to generalize into robust, domain-agnostic autonomy , especially in realistic settings requiring long-horizon planning, persistent objecti ve alignment, and stable beha v- ioral consistency—core prerequisites for partici- pation in real-world economic systems ( Amodei , 2024 ; Kwa et al. , 2025 ; METR , 2025 ). Corre- spondingly , existing agent benchmarks—spanning web interaction ( Mialon et al. , 2023 ; W ei et al. , 2025 ; Zhou et al. , 2024 ; Deng et al. , 2023 ; Jimenez et al. , 2024 ; T eam , 2025b ) primarily focus on short- horizon or highly structured tasks, limiting their ability to e v aluate sustained interaction with com- plex, ev olving en vironments. Recent studies on long-horizon autonomy consistently demonstrate that ev en state-of-the-art agents struggle to main- tain coherent strategies over extended time spans ( Nof1.ai , 2025 ; Andon Labs , 2025 ; Backlund and Petersson , 2025 ). T o systematically study this challenge, we intro- duce RetailBench , a ne w benchmark grounded in real-world commercial data and informed by estab- lished economic modeling principles.RetailBench centers on a supermarket operation scenario that demands long-horizon decision-making, sustained interaction with a dynamic en vironment, and the integration of heterogeneous historical informa- tion. The benchmark ev aluates whether LLM- based agents can autonomously sustain realistic business operations under complex, multi-factor conditions. W e e valuate eight state-of-the-art LLMs in this en vironment. Our results show that current models struggle to maintain stable decision quality as the decision space expands and often fail to incorporate all rele v ant information. Moreover , hallucinations and economically irrational behaviors frequently emerge during long-horizon ex ecution, leading to en vironment collapse and preventing sustained au- tonomous operation. Our contributions are summarized as follo ws: • W e introduce RetailBench , a high-fidelity bench- mark for e valuating long-horizon autonomous 1 RetailBench Environment Intra-day Loop: Multiple steps within day t RL Agent Product & Inventory (Sale / Shelf life / Historial Sales) Financial States Info Query State Triggered Day Transition to day t + 1 External Events Demand Signals Supply Chain Action End-of-Day Pipeline 1. Customer T raffic Sampling 2. Sales V olumn Update 3. Reviews & Returns Genration 4. Inventory Update 5. Financial & Exogeous Update Next Day State Cannot Pay for 5 consecutive days Episode T erminate Actions Price Adjustment Infro Queries Replenishment Order Memory Read / Write end_today Figure 1: Overview of the hierarchical supermark et environment, illustrating intra-day agent–en vironment interac- tions and end-of-day state transition dynamics. decision-making in realistic retail en vironments. • W e propose the Evolving Strate gy & Execution agent framew ork, which improv es operational stability compared to a Reflection-based base- line. • Through extensi ve experiments, we identify sys- tematic failure modes of current LLM-based agents in long-horizon, multi-factor decision- making settings. 2 En vironment Construction 2.1 Problem F ormulation and Overview W e model supermarket operations as a Markov De- cision Process (MDP), in which an autonomous agent manages a single retail store over a finite horizon of days. At each day t , the agent makes a sequence of operational decisions that joi ntly deter - mine the store’ s daily outcomes. Formally , the MDP is defined as ( S , A , T , R, γ ) , where S denotes the state space, A the action space, T the stochastic transition dynamics, R the rew ard function, and γ ∈ (0 , 1] the discount f actor . At the beginning of each day t , the agent ob- serves the initial state S t, 0 and ex ecutes a sequence of intra-day actions indexed by k . Each action a t,k ∈ A induces a transition to the ne xt intra-day state S t,k +1 . When the agent issues the end-today action, the en vironment transitions to the initial state of the next day , S t +1 , 0 , according to the tran- sition dynamics T . Figure 1 illustrates the o verall interaction process. The environment supports long-horizon opera- tion ov er more than one thousand simulated days, with episodes terminating if the store fails to pay rent for fi ve consecuti ve days. 2.2 State Space The intra-day state S t,k summarizes the complete operational context of the store at step k of day t and is composed of multiple interdependent compo- nents: S t,k =  S prod t,k , S in v t,k , S sup t,k , S dem t,k , S ext t,k , S fin t,k  . • S prod t,k and S in v t,k encode product-lev el attributes and on-hand in ventory status, including prices, shelf life, and historical sales records. Product demand is grounded in real-world retail data de- ri ved from the Dominick’ s dataset ( Kilts Cen- ter for Marketing ). • S sup t,k represents the supply chain state, including supplier prices, quality le vels, and deli very lead times, constructed to reflect empirically observed price–quality relationships ( Gre wal et al. , 2014 ). • S dem t,k captures demand-side signals such as recent customer traf fic and aggregated re view statistics, 2 which influence consumer purchasing behavior ( Fede wa et al. , 2021 ). • S ext t,k represents external contextual informa- tion, including acti ve ne ws e vents with market-, category-, or product-level impact scope, syn- thesized to resemble real-world financial ne ws ( ashraq , 2025 ). • S fin t,k denotes the financial state of the store, in- cluding av ailable cash and estimated net worth, accounting for in ventory depreciation ov er prod- uct shelf life. T ogether , these components define a richly struc- tured yet partially stochastic state space. Further details of the environment design are deferred to Appendix A . 2.3 Action Space At each time step t , the agent selects an action a t ∈ A , defined as a tuple of decision components: a t =  a price t , a repl t , a info t , a mem t , a end t  . Each compo- nent corresponds to a distinct class of operational decisions: • a price t denotes pricing decisions for individual SKUs; • a repl t denotes in ventory replenishment decisions, including supplier selection and order quantities; • a info t denotes information acquisition actions, such as querying sales history , customer revie ws, supplier conditions, or current ne ws; • a mem t denotes memory operations that allo w the agent to write and retrie ve persistent notes across days; • a end t denotes a day-termination action that con- cludes the current decision phase. W ithin a single day , the agent may execute mul- tiple information queries and operational adjust- ments before issuing the a end t action to advance the en vironment to the next day . 2.4 Day T ransition Dynamics After the agent issues the day-termination action, the en vironment transitions to the next day accord- ing to S t,k +1 ∼ T ( · | S t,k , a t ) , where the transition function T factorizes into a sequence of stochas- tic updates that jointly model daily supermarket operations. Specifically , each day-le vel transition consists of the follo wing steps: 1. Customer traffic sampling. The total cus- tomer traf fic for the day is sampled according to stochastic demand dynamics. 2. Sales realization. Giv en the sampled customer traf fic and exogenous factors (e.g., promotions, seasonal ef fects, and news signals), the realized sales volume of each product is determined. 3. Reviews and retur ns generation. Customer re vie ws and product return ev ents are generated conditional on sales outcomes, product quality , and external influences. 4. In ventory update. Sold units are deducted from on-hand in ventory , and replenishment or - ders scheduled to arri ve on the current day are added to stock. 5. Financial and exogenous state update. The agent’ s financial state is updated based on real- ized rev enues and costs, after which new exoge- nous information (e.g., news or mark et signals) for the next day is generated. 3 Evolving Strategy & Execution Framework 3.1 Framework Detail Prior agent frame works, including ReAct, Reflec- tion, and Plan-and-Act ( Erdogan et al. , 2025a ; Y ao et al. , 2023b ), either do not maintain a persistent global strategy , leading to inconsistent behaviors, or re vise strategies during action e xecution, which can induce oscillation and gradual goal drift in long-horizon en vironments. T o address these limitations, we propose an explicit two-stage interaction framew ork, termed Evolving Strate gy & Execution , that separates strategic deliberation from operational execution. The frame work maintains a persistent global strat- egy and restricts strategy updates to a day-lev el granularity to pre vent e xcessi ve short-term fluctua- tions. In the Evolving Strate gy stage, the agent may in voke observ ation and analysis tools to examine en vironmental feedback and historical outcomes, but cannot e xecute actions that directly modify the en vironment. Based on this information, it deter- mines whether to revise the inherited strategy by 3 Model A vg. Days ↑ A vg. Daily Sales ↑ A vg. Daily Income ↑ Expiry Ratio ↓ Return Ratio ↓ Max Days ↑ F ramework: Evolving Strate gy & Execution GLM-4.6 52.40 174.34 124.67 0.0773 0.1293 58 Kimi-K2 (Thinking) 54.25 260.68 168.72 0.0239 0.1179 58 GPT -5.2 81.00 457.21 358.27 0.0660 0.1141 81 A verage (3 models) 62.55 297.41 217.22 0.0557 0.1204 65.67 F ramework: Reflection (Day-Level) GLM-4.6 55.00 160.70 125.67 0.0194 0.1176 62 Kimi-K2 (Thinking) 58.33 216.51 184.01 0.0964 0.1255 71 GPT -5.2 64.00 283.88 260.88 0.1774 0.0887 64 A verage (3 models) 59.11 220.36 190.19 0.0977 0.1106 65.67 F ramework: Reflection (Step-Level) GLM-4.6 51.67 92.35 77.18 0.0148 0.1072 57 Kimi-K2 (Thinking) 51.67 181.19 111.29 0.0353 0.1362 53 GPT -5.2 56.00 398.71 324.01 0.1536 0.1048 56 A verage (3 models) 53.11 224.08 170.83 0.0679 0.1161 55.33 F ramework: Plan-and-Act GLM-4.6 48.33 231.35 113.01 0.0198 0.1096 55 Kimi-K2 (Thinking) 48.67 170.26 105.36 0.0000 0.1056 51 GPT -5.2 64.00 323.88 193.02 0.0152 0.1096 64 A verage (3 models) 53.67 241.83 137.13 0.0117 0.1083 56.67 Heuristic P olicy (Upper Bound, Easy) Hand-crafted Policy 180.00 674.18 729.46 0.0266 0.0070 180 T able 1: Performance comparison of three representative lar ge language models under four agent frameworks in the E A S Y environment. A hand-crafted heuristic policy is included as an approximate upper bound. adding, refining, or removing components. This stage ensures that long-term intent and planning logic remain explicit rather than implicitly entan- gled with immediate action choices. In the Execution stage, the strate gy is fixed and treated as immutable. The agent generates concrete actions strictly consistent with the current strategy , without further modification. This controlled sepa- ration enables clearer attrib ution between strategy and behavioral outcomes. By alternating between these stages, the frame- work enforces a principled decomposition of decision-making, reduces uncontrolled strategy drift, and promotes behavioral stability and inter- pretability ov er extended horizons. An illustration is provided in Appendix D.1 . 3.2 Hierarchical P olicy Representation T o support structured, interpretable, and tempo- rally e xtended decision-making under the proposed frame work, we represent the agent polic y using a hierarchical abstraction that separates strate gic in- tent from ex ecutable actions. Each policy consists of three conceptual layers: 1. Macro Strategy , which captures high-lev el managerial objectiv es that persist across mul- tiple decision steps; 2. Execution Strategy , which encodes structured operational guidance in a machine-readable in- termediate representation; 3. Daily Actions , which specify concrete exe- cutable operations issued to the en vironment. Detailed policy configurations and example poli- cies are provided in Appendix A.2.1 . 4 Experiment Settings W e conduct experiments under three en vironment configurations with increasing le vels of dif ficulty . These configurations vary in market complexity , budget constraints, and the presence of e xogenous dynamics. 4.1 En vironment Configurations W e employ a heuristic polic y with full access to the en vironment’ s internal state as a calibration base- line for each en vironment variant. En vironment pa- rameters are tuned such that the heuristic policy re- mains stable across different dif ficulty lev els while still experiencing meaningful operational pressure 4 in Appendix A.4 . T o ev aluate models’ information- processing and external perception capabilities, we design three en vironment configurations: • Easy : A controlled en vironment without dynamic ne ws ev ents or adaptiv e supplier price–quality relationships. The market contains fi ve product categories. The agent is initialized with a budget of 10,000 and incurs a fixed daily rent of 250. • Middle : A moderately complex en vironment that expands the product space to all twenty cate- gories while still excluding dynamic ne ws events and supplier adaptations. The initial budget is increased to 50,000, with a daily rent of 1,000. • Har d : The most challenging and realistic en viron- ment, incorporating dynamically generated news e vents and time-varying supplier price–quality relationships. The market includes all twenty product categories. The agent starts with a bud- get of 50,000, pays a daily rent of 1,000, and recei ves twenty ne ws items per day . Detailed specifications of all en vironment configu- rations are provided in Appendix A.3 . 4.2 Evaluation Metrics Metrics. W e ev aluate store-level operational per - formance using the follo wing metrics ( ↑ indicates higher is better; ↓ indicates lo wer is better): • Days ( ↑ ): the number of operating days before episode termination; • MaxDays ( ↑ ): the maximum number of operating days achie ved across three rollouts; • A vg. Daily Sales ( ↑ ): the av erage number of items sold per day; • A vg. Daily Income ( ↑ ): the average money earned per day; • Expiry Ratio ( ↓ ): the fraction of products that expire before being sold; • Return Ratio ( ↓ ): the fraction of sold products that are returned by customers. All reported metrics are a veraged over three inde- pendent rollouts, each subject to a fixed maximum ex ecution horizon. 4.3 Experimental Setup Agent Frameworks. Preliminary e xperiments in- dicate that simple ReAct-style interaction frame- works are unstable in long-horizon settings, fre- quently exhibiting premature episode termination during mid-horizon execution. This observation moti v ates the e valuation of agent frame works that explicitly support sustained strate gic control. W e compare our proposed Evolving Str ategy & Execution framew ork with step-level Reflection, day-le vel Reflection, and the Plan-and-Act base- line ( Shinn et al. , 2023 ; Erdogan et al. , 2025a ). For the Reflection frame work, we implement two v ariants with diff erent reflection frequencies: (1) step-le vel reflection, where the agent reflects after each action, and (2) day-lev el reflection, where re- flection is performed once per day after completing daily actions. Full prompt specifications for all frame works are provided in Appendix C . Models and Evaluation Protocol. W e ev alu- ate a diverse set of contemporary large language models, including Qwen-235B (Thinking) ( T eam , 2025a ), Kimi K2 (Thinking) ( T eam et al. , 2025b ), GLM-4.6 ( T eam et al. , 2025a ), DeepSeek-V3.2- Exp ( DeepSeek-AI et al. , 2025 ), Gemini-3-Flash- Pre vie w ( Google DeepMind , 2024 ), Grok-4.1 Fast ( xAI , 2025 ), GPT -5-Mini ( OpenAI , 2025a ), and GPT -5.2 ( OpenAI , 2025b ). T o manage the substan- tial token costs of long-horizon rollouts, lo wer-cost v ariants are used for closed-source models when av ailable. Each model is ev aluated across three environ- ment configurations, with three independent roll- outs per configuration; results are a veraged across runs. T o isolate the effect of the agent frame work, we conduct controlled comparisons between our frame work and baseline methods in the Easy en- vironment under identical conditions. W e further implement a hand-crafted policy based on privi- leged internal kno wledge unav ailable to the agents, which serves as an approximate upper bound on performance. 5 Results 5.1 Perf ormance Comparison between Agent Frameworks Due to computational b udget constraints, we con- duct a comprehensiv e four-frame work comparison on three representative models: GLM-4.6, Kimi- 5 Grok-4.1 F ast Gemini-3 (F ast) DeepSeek-V3.2 (Exp.) Model 0 20 40 60 80 100 V alue Per Category 101.6 67.4 36.0 19.9 25.2 18.2 79.9 58.9 22.0 22.5 25.7 32.5 45.8 36.7 13.2 12.8 8.3 8.8 Sales and Income Per Category Comparison Easy (Sales) Easy (Income) Middle (Sales) Middle (Income) Hard (Sales) Hard (Income) Figure 2: Category-le vel sales and profit per category across Easy , Middle, and Hard environments. Results are shown for three representati ve models. K2 (Thinking), and GPT -5.2. As reported in T a- ble 1 , Evolving Str ate gy & Execution consistently outperforms alternati ve agent frame works on core metrics, achieving higher sales and profit while substantially reducing product expiration rates. Based on these results, we select the two strongest-performing frame works for broader e val- uation across all eight models. When a veraged ov er all models, our proposed frame work demonstrates clear improv ements over Reflection (Day-Level) . Despite these gains, a substantial gap remains relati ve to the hand-crafted heuristic polic y , which serves as an approximate upper bound. This gap highlights the current limitations of LLM-based agents in long-horizon decision-making. Notably , T able 1 also suggests that lar ger closed-source mod- els, such as GPT -5.2, tend to achieve more stable and sustained store operation compared to smaller or open-source counterparts, indicating that model capacity remains an important factor alongside frame work design. T able 8 presents the complete results across all eight ev aluated models. Across models, we ob- serve recurring and systematic failure modes in the current en vironment, suggesting that the perfor- mance bottlenecks are structural rather than model- specific. 5.2 Perf ormance across En vironments with V arying Difficulty As en vironment difficulty increases, all models ex- hibit consistent performance degradation, includ- ing shorter operational durations, higher expiry and return ratios, and reduced long-horizon stability . Although the expansion of product categories in the Middle and Hard settings enables higher aggre- gate sales and profit for some models, Sales per Cate gory and Pr ofit per Cate gory decline substan- Model Context Easy Middle Hard SKU ↑ Cat ↑ SKU ↑ Cat ↑ SKU ↑ Cat ↑ DeepSeek-V3.2 (Exp.) 128K 4.75 3.31 6.83 5.34 6.64 5.45 Gemini-3 (Fast) 1M 8.72 4.34 13.60 12.43 21.57 13.56 GLM-4.6 200K 4.82 2.98 3.23 2.46 7.08 5.22 OpenAI-5.1 Mini 400K 3.66 1.94 4.10 4.10 6.67 4.79 Grok-4.1 Fast 200K 5.78 3.68 4.87 3.08 5.40 3.20 Kimi-K2 (Thinking) 256K 5.06 3.09 5.17 4.77 6.91 4.75 Qwen-235B (Thinking) 256K 4.59 3.67 3.17 2.36 5.11 4.51 Heuristic Strategy – 9.03 4.87 35.34 18.61 34.82 18.52 T able 2: SKU and Category counts sold by diff erent models across difficulties. Context denotes the maxi- mum supported context length of each model. Bolded indicates the best model; underlined indicates the sec- ond best (Human excluded). tially (Figure 2 ), indicating persistent challenges in ef fecti ve resource allocation within increasingly high-dimensional decision spaces. The relatively small performance gap between the Middle and Hard en vironments can be partly attributed to the delayed impact of intensified ne ws dynamics, whose ef fects typically unfold o ver longer time horizons. Appendix B.3 reports the full results across all en vironments under the proposed frame work. Overall, while models demonstrate limited adaptability as task complexity increases, their performance remains substantially below the heuristic upper bound, highlighting persistent limi- tations in robust long-horizon decision-making un- der complex and dynamic conditions. 6 Analysis Based on both quantitative e valuation results and manual inspection of trajectories, we identify se v- eral key factors that explain why current models fail to operate reliably in the en vironment. W e analyze these failure modes from four com- plementary perspecti ves. 6.1 Non-scalable Decision-Making Capability As shown in Appendix B.3 , models achie ve rea- sonable performance in the Easy setting b ut exhibit consistent degradation as the en vironment scales in complexity . T o better understand this phenomenon, we analyze both the average number of stock keep- ing units (SKUs) and product categories sold per day , as well as the number of SKUs and categories explicitly considered during the Evolving Str ategy phase (T ables 2 and Appendix B.6 ). Across most models, decision-making capabil- ity does not scale proportionally with the size of the en vironment. Instead, performance remains relati vely flat despite a substantial increase in the number of av ailable options. Notably , Gemini-3 6 Model Supplier Inventory Return Rating Price Review Sales History DeepSeek-V3.2 (Exp.) 41.4 100.0 26.7 75.4 6.6 0.6 89.2 0.0 Gemini-3 (Fast) 69.6 100.0 41.0 82.3 67.7 6.3 87.6 2.9 GLM-4.6 93.9 98.3 79.8 77.8 84.1 5.1 96.3 0.0 OpenAI-5.1 Mini 58.8 99.0 5.1 78.6 3.9 10.6 84.7 0.3 Grok-4.1 Fast 89.9 99.8 84.0 94.0 88.6 36.0 96.4 0.3 Kimi-K2 (Thinking) 57.2 90.4 25.9 51.4 36.2 11.0 64.0 0.5 Qwen-235B (Thinking) 76.8 78.3 31.5 58.2 19.8 23.6 74.6 0.0 A verage 69.7 95.1 42.0 74.0 43.8 13.3 84.7 1.0 T able 3: Percentage of days on which each model queries specific information sources when making de- cisions for SKUs included in the final daily strategy . Higher v alues indicate greater reliance on the corre- sponding data source. (Fast), which supports the largest maximum con- text windo w , performs better than other models in this respect, suggesting that larger context capac- ity helps retain salient information ev en when the ef fecti ve interaction conte xt is constrained. Ne vertheless, ev en the strongest models fail to cov er the full decision space. This indicates that current systems are unable to e xpand their ef fectiv e decision-making scope as the en vironment grows, leading to systematic performance de gradation in larger and more comple x settings. 6.2 Incomplete Decision-Making Due to Limited Inf ormation Coverage W e analyze the information sources that models attend to when making operational decisions by examining the data queried for the set of SKUs included in each day’ s final strategy . This analy- sis reveals a clear concentration of attention on a limited subset of signals. Specifically , most models primarily rely on sup- plier prices, inv entory le vels, SKU ratings, and historical sales records when deciding how to oper- ate selected SKUs in T able 3 . In contrast, sev eral other critical signals—such as recent customer re- vie ws, return rates, and current selling prices—are consistently underutilized or entirely ignored. Further correlation analysis shows a strong posi- ti ve relationship between the frequency of SKU re- vie ws queries and av erage daily sales performance in Appendix B.5 . This observation aligns with the underlying en vironment dynamics and indi- cates that incomplete information cov erage is a ke y factor limiting decision quality . Overall, these results suggest that models often fail to perform suf ficiently comprehensi ve information gathering, leading to systematically suboptimal operational decisions. Macro Strategy Execution Strategy Model Std_diff ( ↓ ) MAC ( ↓ ) TV ( ↓ ) Std_diff ( ↓ ) MAC ( ↓ ) TV ( ↓ ) DeepSeek-V3.2(Exp.) 0.130 0.090 5.00 0.289 0.144 7.93 Gemini-3 (Fast) 0.136 0.085 3.82 0.293 0.225 10.18 GLM-4.6 0.131 0.096 4.52 0.264 0.209 9.87 OpenAI-5.1 Mini 0.069 0.051 2.50 0.229 0.166 8.00 Grok-4.1 Fast 0.079 0.045 2.71 0.204 0.151 9.39 Kimi K2(Thinking) 0.179 0.133 6.95 0.335 0.271 14.08 Qwen-235B (Thinking) 0.240 0.189 6.51 0.391 0.306 10.57 T able 4: T emporal instability metrics of macro- and ex ecution-le vel strate gy similarity in the Easy en viron- ment. All metrics are lower -is-better ( ↓ ). Larger v alues indicate greater temporal instability . Column-wise max- imum values are highlighted in bold. 6.3 T emporal Instability in Execution-Le vel Decision-Making Beyond limitations in decision scalability and infor- mation cov erage, we identify temporal instability in ex ecution-level decision-making as a k ey contrib u- tor to long-horizon failure. Even under relatively stable en vironmental conditions, agents frequently re vise their strategies across consecuti ve days, lead- ing to inconsistent ex ecution trajectories. Measuring Strategy Similarity . W e quantify temporal instability by measuring the similarity be- tween strategies on adjacent days at both macro and ex ecution lev els. Macro strategy similar - ity is assessed using an LLM-based prompt that e v aluates semantic consistency between consec- uti ve high-le vel plans. Execution strategy sim- ilarity is computed via set-based Jaccard sim- ilarity ov er ke y fields, including focus_skus , sku_supplier_mapping , news_to_monitor , and sku_to_monitor , with the final ex ecution similar- ity obtained by av eraging across fields. Instability Metrics. T o characterize temporal fluctuations, we compute three complementary met- rics: the standard de viation of first-order differ - ences ( Std_dif f ) to capture short-term v olatility , the mean absolute change ( MAC ) to estimate typical day-to-day v ariation, and total v ariation ( TV ) to reflect cumulati ve long-term instability . Across all three en vironment configurations, we observe consistent temporal instability in both macro- and e xecution-le vel strate gies. For clarity , we present detailed results from the Easy en viron- ment in T able 4 and Appendix B.4 . Despite its reduced complexity , the Easy setting already ex- hibits pronounced temporal fluctuations. In partic- ular , macro strategies remain relati vely stable o ver time, whereas execution strategies sho w substan- tially larger variability across most models. This ef fect is especially pronounced for Qwen-235B 7 (Thinking), which also performs poorly across all en vironments. Overall, these results indicate that long-horizon failures arise not only from suboptimal strategy formulation, b ut more fundamentally from the in- ability to maintain temporally consistent execution policies ov er extended horizons. 6.4 Hallucinations and In valid Actions Finally , manual inspection reveals recurrent failure patterns that directly break planning correctness and action validity . W e distinguish two closely related but practically dif ferent issues: Hallucinations in reasoning Models occasion- ally generate reasoning traces that reference non- existent SKUs or fabricate numerical quantities, and subsequently incorporate these hallucinated elements into multi-step plans (examples are pro- vided in Appendix B.1 ). As a result, decisions become misaligned with the true environment state, e ven when the overall planning structure appears internally coherent. In valid or irrational actions. Models sometimes output actions that violate basic constraints or are inconsistent with historical demand, such as neg a- ti ve order quantities or implausible pricing in Ap- pendix B.2 . Even though state-modifying oper- ations are restricted to only a small set of tools, these in valid actions occur with non-negligible fre- quency and can quickly destabilize the system in long-horizon operation. In realistic operational settings, both hallucina- tions and in v alid actions would be unacceptable and could directly trigger system collapse. 7 Related W ork Long-horizon planning benchmarks f or LLMs. In parallel, a gro wing body of benchmarks tar gets long-horizon planning abilities of large language models across structured and interactiv e en viron- ments. PlanBench focuses on classical plan gen- eration, while W ebShop, Mind2W eb, and Science- W orld ( Deng et al. , 2023 ; W ang et al. , 2022 ; Y ao et al. , 2023a ) e v aluate multi-step interaction, er- ror recov ery , and adaptiv e behavior in dynamic settings. More recent benchmarks—such as Her- oBench, OdysseyBench, UltraHorizon ( Luo et al. , 2025 ; Anokhin et al. , 2025 ; W ang et al. , 2025 ), and explicitly stress e xtended, interdependent decision sequences that require persistent memory , hierar- chical reasoning, and long-term strategy mainte- nance. Collecti vely , these benchmarks suggest that while short-horizon tasks are increasingly tractable, robust long-horizon planning and e xecution remain a central challenge for LLM-based agents. Long-horizon agent frameworks. Recent ad- v ances in agent design address these challenges by introducing structured framew orks that de- couple high-lev el planning from low-le vel execu- tion. Approaches such as Plan-and-Act ( Erdogan et al. , 2025b ) and EA GLET ( Si et al. , 2025 ) adopt planner–e xecutor architectures to support hierar- chical decision-making, dynamic replanning, and improv ed execution stability ov er long horizons. Extensions to multi-agent settings ELHPlan ( Ling et al. , 2025 ) and plan-a ware context management frame works P AA CE ( Y uksel , 2025 ) further em- phasize task decomposition, explicit strategy rep- resentation, and memory-aware context control. T ogether , these works indicate that scalable long- horizon autonomy increasingly relies on structured planning modules and controlled execution mecha- nisms rather than monolithic end-to-end policies. 8 Conclusion This paper introduces RetailBench, a benchmark for e v aluating long-horizon autonomous decision- making in realistic retail en vironments. Retail- Bench models supermarket operations as a stochas- tic, multi-factor , and temporally extended process, requiring agents to reason jointly about pricing, in ventory , information acquisition, and financial sustainability . W e further propose the Evolving Strategy & Ex ecution framework, which decouples high-le vel strate gy e volution from lo w-lev el ex ecu- tion to better support long-horizon autonomy . Experiments across eight state-of-the-art large language models show that our framework im- prov es operational stability and economic perfor- mance compared to a Reflection-based baseline. Ho we ver , performance degrades sharply as en vi- ronment complexity increases, re vealing persistent limitations in scalability , information use, execu- tion stability , and action validity . These results suggest that while structured agent frameworks mit- igate some challenges, current LLM-based agents remain far from robust strategy-a ware autonomy in complex dynamic en vironments. RetailBench thus provides a principled testbed for advancing re- search on long-horizon decision-making and agen- tic reasoning. 8 9 Limitations Despite its realism and scale, this work has sev- eral limitations. First, RetailBench focuses on a single-store supermarket setting; while expressi ve, it does not capture multi-store coordination, com- petiti ve markets, or strategic interactions among multiple autonomous agents. Second, although the en vironment incorporates stochastic demand, news dynamics, and supply-chain delays, it remains a simulation grounded in historical data and simpli- fied economic assumptions, which may not fully reflect the complexities of real-world retail sys- tems. Third, our e valuation is limited to prompting- based LLM agents without parameter updates or long-term learning across episodes; stronger perfor- mance may be achiev able through reinforcement learning, fine-tuning, or hybrid neuro-symbolic ap- proaches. Finally , while we identify key failure modes such as hallucinations and economically ir - rational actions, we do not propose explicit algorith- mic mechanisms to enforce economic constraints or factual grounding during ex ecution. Address- ing these limitations—through richer en vironments, multi-agent extensions, learning-based adaptation, and constraint-aw are action control—remains an important direction for future research. References Dario Amodei. 2024. Machines of loving grace. https://www.darioamodei.com/essay/ machines- of- loving- grace . Accessed: 2025-12- 01. Andon Labs. 2025. V ending-bench 2: A bench- mark for long-horizon business simulation. https: //andonlabs.com/evals/vending- bench- 2 . Ac- cessed: 2025-12-10. Petr Anokhin, Roman Khalikov , Stefan Rebrikov , V iktor V olkov , Artyom Sorokin, and V incent Bissonnette. 2025. Herobench: A benchmark for long-horizon planning and structured reasoning in virtual w orlds . Pr eprint , ashraq. 2025. financial-news-articles. https: //huggingface.co/datasets/ashraq/ financial- news- articles . Accessed: 2025- 12-01. Axel Backlund and Lukas Petersson. 2025. V ending- bench: A benchmark for long-term coherence of au- tonomous agents . Preprint , arXi v:2502.15840. DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan W ang, Bingzheng Xu, Bochao W u, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chen- hao Xu, Chong Ruan, Damai Dai, Daya Guo, Dejian Y ang, and 245 others. 2025. Deepseek-v3.2: Pushing the frontier of open large language models . Preprint , Xiang Deng, Y u Gu, Boyuan Zheng, Shijie Chen, Samuel Stev ens, Boshi W ang, Huan Sun, and Y u Su. 2023. Mind2web: T owards a generalist agent for the web . Preprint , arXi v:2306.06070. Lutfi Eren Erdogan, Nicholas Lee, Sehoon Kim, Suhong Moon, Hiroki Furuta, Gopala Anumanchipalli, Kurt Keutzer , and Amir Gholami. 2025a. Plan-and-act: Improving planning of agents for long-horizon tasks . Pr eprint , Lutfi Eren Erdogan, Nicholas Lee, Sehoon Kim, Suhong Moon, Hiroki Furuta, Gopala Anumanchipalli, Kurt Keutzer , and Amir Gholami. 2025b. Plan-and-act: Improving planning of agents for long-horizon tasks . Pr eprint , Dav e Fede wa, Chris Holder , W ynn T eichner , and Ben W iseman. 2021. Five-star gro wth: Using online rat- ings to design better products . McKinsey & Company . Accessed: 2025-11-30. Bofei Gao, Feifan Song, Zhe Y ang, Zefan Cai, Y ibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang T ang, Ben you W ang, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Y ichang Zhang, Xuancheng Ren, T ianyu Liu, and Baobao Chang. 2024. Omni-math: A univ er- sal olympiad le vel mathematic benchmark for large language models . Preprint , arXi v:2410.07985. Google DeepMind. 2024. Gemini 3 flash (pre view). https://deepmind.google/technologies/ gemini/ . Large multimodal language model, previe w version. Dhruv Grew al, Jens Nordfält, Anne Rogge veen, Rainer Olbrich, and Hans Christian Jansen. 2014. Price- quality relationship in pricing strategies for priv ate labels . Journal of Pr oduct and Brand Manag ement , 23(6):429–438. Carlos E. Jimenez, John Y ang, Alexander W ettig, Shunyu Y ao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. Swe-bench: Can language mod- els resolve real-world github issues? Pr eprint , Univ ersity of Chicago Booth School of Business Kilts Center for Marketing. Dominick’ s dataset. https://www.chicagobooth.edu/research/ kilts/research- data/dominicks . Accessed: 2025-11-01. Thomas Kwa, Ben W est, Joel Becker , Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar , Megan Kinniment, Nate Rush, Sydney V on Arx, Ryan Bloom, Thomas Broadley , Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, 9 Seraphina Nix, T ao Lin, Neev Parikh, and 6 others. 2025. Measuring ai ability to complete long tasks . Pr eprint , Shaobin Ling, Y un W ang, Chenyou Fan, T in Lun Lam, and Junjie Hu. 2025. Elhplan: Efficient long-horizon task planning for multi-agent collaboration . Pr eprint , Haotian Luo, Huaisong Zhang, Xuelin Zhang, Haoyu W ang, Zeyu Qin, W enjie Lu, Guozheng Ma, Haiying He, Y ingsha Xie, Qiyang Zhou, Zixuan Hu, Hongze Mi, Y ibo W ang, Naiqiang T an, Hong Chen, Y i R. Fung, Chun Y uan, and Li Shen. 2025. Ultrahori- zon: Benchmarking agent capabilities in ultra long- horizon scenarios . Preprint , arXi v:2509.21766. Daniel McFadden. 1974. Conditional logit analysis of qualitati ve choice beha vior . In P aul Zarembka, editor , F ontiers in Econometrics , pages 105–142. Academic press, New Y ork. METR. 2025. Measuring ai ability to complete long tasks . METR blog. Grégoire Mialon, Clémentine Fourrier , Craig Swift, Thomas W olf, Y ann LeCun, and Thomas Scialom. 2023. Gaia: a benchmark for general ai assistants . Pr eprint , Nof1.ai. 2025. Alpha arena — exploring the limits of large language models as quant traders. https: //nof1.ai/blog/TechPost1 . Accessed: 2025-12- 10. OpenAI. 2025a. Gpt-5 mini. https://platform. openai.com/docs/models/gpt- 5- mini . Large language model — cost-efficient GPT -5 variant. OpenAI. 2025b. Introducing gpt-5.2 . Accessed: 2026- 02-19. Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agra wal, Arna v Chopra, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy , Oli ver Zhang, Mantas Mazeika, and 1093 others. 2025. Hu- manity’ s last exam . Pr eprint , Noah Shinn, Federico Cassano, Edward Berman, Ash- win Gopinath, Karthik Narasimhan, and Shunyu Y ao. 2023. Reflexion: Language agents with verbal rein- forcement learning . Preprint , arXi v:2303.11366. Shuzheng Si, Haozhe Zhao, Kangyang Luo, Gang Chen, Fanchao Qi, Minjia Zhang, Baobao Chang, and Maosong Sun. 2025. A goal without a plan is just a wish: Ef ficient and effecti ve global plan- ner training for long-horizon agent tasks . Preprint , GLM T eam, Aohan Zeng, Xin Lv , Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang W ang, Da Y in, Hao Zeng, Jiajie Zhang, Kedong W ang, Lucen Zhong, Mingdao Liu, Rui Lu, Shulin Cao, Xiaohan Zhang, Xuancheng Huang, Y ao W ei, and 152 others. 2025a. Glm-4.5: Agentic, reason- ing, and coding (arc) foundation models . Pr eprint , Kimi T eam, Y ifan Bai, Y iping Bao, Guanduo Chen, Jia- hao Chen, Ningxin Chen, Ruijue Chen, Y anru Chen, Y uankun Chen, Y utian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chen- zhuang Du, Dikang Du, Y ulun Du, Y u F an, and 150 others. 2025b. Kimi k2: Open agentic intelligence . Pr eprint , Qwen T eam. 2025a. Qwen3 technical report . Pr eprint , The T erminal-Bench T eam. 2025b. T erminal-bench: A benchmark for ai agents in terminal en vironments . Mark D. Uncles. 1987. Discrete choice analysis: Theory and application to travel demand . J ournal of the Operational Resear ch Society , 38(4):370–371. Ruoyao W ang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. 2022. Science world: Is your agent smarter than a 5th grader? Preprint , W eixuan W ang, Dongge Han, Daniel Madrigal Diaz, Jin Xu, V ictor Rühle, and Sarav an Rajmohan. 2025. Odysseybench: Evaluating llm agents on long-horizon complex of fice application workflo ws . Pr eprint , Jason W ei, Zhiqing Sun, Spencer Papay , Scott McK- inney , Jeffre y Han, Isa Fulford, Hyung W on Chung, Alex T achard Passos, William Fedus, and Amelia Glaese. 2025. Browsecomp: A simple yet chal- lenging benchmark for browsing agents . Preprint , xAI. 2025. Grok 4.1 fast and agent tools api. https:// x.ai/news/grok- 4- 1- fast . Accessed [Y our Ac- cess Date]. Shunyu Y ao, Ho ward Chen, John Y ang, and Karthik Narasimhan. 2023a. W ebshop: T ow ards scalable real-world web interaction with grounded language agents . Preprint , arXi v:2207.01206. Shunyu Y ao, Jeffrey Zhao, Dian Y u, Nan Du, Izhak Shafran, Karthik Narasimhan, and Y uan Cao. 2023b. React: Synergizing reasoning and acting in language models . Preprint , arXi v:2210.03629. Kamer Ali Y uksel. 2025. Paace: A plan-aware automated agent context engineering framework . Pr eprint , Shuyan Zhou, Frank F . Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar , Xianyi Cheng, T ianyue Ou, Y onatan Bisk, Daniel Fried, Uri Alon, and Gra- ham Neubig. 2024. W ebarena: A realistic web en vi- ronment for building autonomous agents . Preprint , 10 A En vironment Configuration Details A.1 State Space Decomposition W e construct the en vironment using real-world re- tail data from the Dominick’ s dataset ( Kilts Cen- ter for Marketing ). From the 20 a vailable product categories, we select 96 SKUs with the most com- plete and informati ve sales records. The richness of these observ ations enables reliable modeling of the relationship between pricing decisions and realized demand. K ey Symbols. Let J denote the set of selected SKUs, with cardinality |J | = J (where J = 96 in our experiments). Each SKU j ∈ J belongs to a product category cat( j ) ∈ { 1 , . . . , 20 } . For each SKU j , let K ( j ) denote its associated supplier set, with |K ( j ) | = 5 . A.1.1 Product State S prod t For each SKU j , we maintain a set of static at- tributes together with time-v arying operational sig- nals: S prod t,j =  id j , desc j , L j , p j t , h sales j,t , r j,t  . (1) Here, id j and desc j denote the unique identifier and textual description of SKU j , respecti vely . L j denotes the shelf life (in days), and p j t is the retail price on day t . The term h sales j,t encodes historical sales statistics, while r j,t summarizes aggregated customer revie w signals, such as a verage rating and re vie w volume. Data gr ounding. The SKU set and cost priors are constructed from the Dominick’ s dataset ( Kilts Center for Marketing ). Historical sales records from the same source are used to calibrate the de- mand model. A.1.2 In ventory State S in v t In ventory is represented at the SKU le vel with age tracking. Let I j t denote the on-hand units of SKU j at the beginning of day t . T o support shelf-life constraints and depreciation, we track the arriv al time and remaining shelf life of each unit. Capacity constraint and pending queue. Let Cap denote the total in ventory capacity of the store. If incoming replenishment orders would violate this constraint, excess units are placed into a first- in-first-out (FIFO) pending queue Q t and become av ailable only after sufficient in ventory space is released: X j ∈J L j X a =0 I ( a ) j t ≤ Cap . (2) Stockout signal. When realized demand exceeds the av ailable sellable in ventory of SKU j on day t , a stockout indicator is triggered. This signal is recorded at the end of day t and exposed to the agent as explicit operational feedback. Expiration and destruction. At each day transi- tion, in ventory units whose residence time exceeds their shelf life are remov ed from the system. A.1.3 Supply Chain State S sup t Each SKU j is associated with a set of suppliers k ∈ K ( j ) . The supplier state includes procurement price c j k , quality le vel q j k , and lead-time distrib u- tion: S sup t,j =  ( c j k , q j k , L j k ) : k ∈ K ( j )  . (3) Lead time ℓ j k ∼ L j k is sampled when an order is placed. Procurement prices are discretized into tiers using Dominick’ s cost signals ( Kilts Center for Marketing ), follo wing empirically observed posi- ti ve correlations between price tier and quality le vel ( Gre wal et al. , 2014 ). Enf orced diversity . T o av oid degenerate sup- plier configurations, we enforce that each SKU has (i) one supplier with maximal quality , (ii) one supplier with minimal price and minimal qual- ity , and remaining suppliers spanning intermediate price–quality le vels. A.1.4 Demand Signals and Demand Generation Demand-related components capture both observ- able market signals and the stochastic process gov- erning realized demand. W e define S dem t =  N t , { r j,t } j ∈J  , (4) where N t denotes daily customer traf fic. Customer traffic. Daily customer traffic is de- ri ved from the Dominick’ s dataset. Consumer choice and demand generation. Gi ven customer traf fic N t , prices { p j t } j ∈J , re vie w signals { r j,t } , and external news E t , we generate re- alized demand using a discrete-choice model. Fol- lo wing standard practice in empirical economics, 11 we adopt a Multinomial Logit (MNL) frame work ( McFadden , 1974 ; Uncles , 1987 ). For SKU j , the raw utility is defined as U raw j t = α j + β j p j t + ε j t , (5) where α j captures intrinsic preference, β j models price sensitivity , and ε j t represents idiosyncratic shocks. W e further incorporate re vie w ef fects, news sig- nals, and within-category substitution: ˜ U j t = U raw j t + ∆ j ( r j,t , E t ) + X i  = j cat( i )=cat( j ) γ j i exp( ˜ U it ) . (6) W ith the outside option utility normalized to zero, the purchase probability for SKU j on day t is p ∗ j t = exp( ˜ U j t ) 1 + P J i =1 exp( ˜ U it ) . (7) Realized demand. Given traffic N t and purchase probabilities { p ∗ j t } , potential demand is sampled as y j t ∼ Binomial( N t , p ∗ j t ) , (8) and realized sales are capped by av ailable on-hand in ventory . A.1.5 External Inf ormation S ext t (News Module) W e maintain a set of active ne ws e vents E t . Each e vent e ∈ E t is represented as e = ( type , scope , target , side , sign , η , text , ttl ) (9) where scope ∈ { macro , category , product , neutral } , side ∈ { demand , supply , both } , sign ∈ { +1 , − 1 } , η denotes impact magnitude, and ttl is the time-to-li ve in days. News texts are synthesized using an LLM to resemble financial ne ws corpora ( ashraq , 2025 ). Impact application. Ne ws impacts are incorpo- rated into (i) demand utilities and/or (ii) supplier prices depending on side and scope . Neutral ne ws has η = 0 by design. A.1.6 Financial State S fin t Let F t denote av ailable funds at the start of day t . Net worth is computed as NW t = F t + X j ∈J L j X a =0 I ( a ) j t · v j ( a ) , (10) where v j ( a ) denotes the per -unit v alue at age a . W e adopt linear shelf-life depreciation: v j ( a ) = c j · max  0 , 1 − a L j  , (11) with c j denoting a reference procurement cost (e.g., the mean or realized supplier cost). A.2 Policy Detail A.2.1 Policy Pr esentation T o support structured and interpretable decision making, we represent the agent policy using a hier- archical abstraction that separates strategic intent from ex ecutable actions. Specifically , each policy is composed of three layers: 1. Macro Strategy , which captures high-level managerial principles that persist across days; 2. Execution Strategy , which encodes struc- tured operational guidance in a machine- readable form; 3. Daily Actions , which enumerate concrete ex- ecutable operations submitted to the en viron- ment. This design enables the agent to reason at dif- ferent temporal and semantic granularities, while maintaining a clear separation between planning and ex ecution. Macro Strategy . The macro strategy consists of a set of natural-language statements that describe high-le vel objectives (e.g., prioritizing in ventory turnov er or focusing on high-margin products). These statements are non-e xecutable and serve as persistent guidance for do wnstream decision mak- ing. Execution Strategy . The ex ecution strategy consists of six ke y components. focus_skus specifies the SKUs that require immediate attention. sku_supplier_mapping denotes the corresponding suppliers for each SKU. news_to_monitor identifies relev ant news signals 12 to track. skus_to_reorder indicates SKUs that require replenishment. price_adjustment spec- ifies the SKUs whose prices should be adjusted along with the corresponding adjustment magni- tudes. sku_to_monitor denotes SKUs under ob- serv ation. The other field captures additional ex e- cution directi ves. Daily Actions. Daily actions consist of two oper- ation types: place_order , which places purchase orders for selected SKUs, and modify_sku_price , which adjusts the prices of specified SKUs. Strategy–Execution Protocol. Polic y usage fol- lo ws a two-phase protocol. In the strate gy phase , the agent analyzes the en vironment and constructs its macro and ex ecution strategies. In the subse- quent e xecution phase , the finalized strategy is con- sumed to generate ex ecutable daily actions. The strategy is immutable during e xecution, enforcing a clear separation between deliberation and action. A.2.2 Policy Example See in T able 5 . A.3 En vironment Setting A.3.1 Easy En vironment Configuration The Easy configuration is designed for simpler sce- narios with limited resources and a reduced cate- gory set. • Time Range: – Data begin time: 06/06/91 – Data end time: 12/31/95 – Store begin time: 09/07/91 – Store ID: 15 • Financial Parameters: – Initial funds: 10,000 – Daily rent: 250 – In ventory capacity: 10,000 • F eature Enablement: – Re vie w enabled: True – Re vie w ratio: 0.02 – Ne ws enabled: False • Selected Categories (5 categories): – Bathroom_T issues – Canned_Soup – Cigarettes – Front_end_candies – Soft_Drinks • Category Effects: All categories hav e a uni- form ef fect of -0.2 • Random Seed: 42 (for reproducibility) A.3.2 Middle En vironment Configuration The Middle configuration uses static data but with full resource capacity and all categories enabled. • Time Range: Same as Easy • Financial Parameters: – Initial funds: 50,000 – Daily rent: 1,000 – In ventory capacity: 40,000 • F eature Enablement: – Re vie w enabled: True – Re vie w ratio: 0.02 – Ne ws enabled: False • Selected Categories (20 categories): – Bathroom_T issues, Beer, Bottled_Juices, Canned_Soup, Canned_T una – Cereals, Cheeses, Cigarettes, Cookies, Crackers – Dish_Detergent, Fabric_Softeners, Front_end_candies, Frozen_Entrees – Frozen_Juices, Oatmeal, Paper_T owels, Snack_Crackers – Soft_Drinks, T oothpastes • Category Effects: All categories hav e a uni- form ef fect of -0.2 • Random Seed: 42 A.3.3 Hard En vironment Configuration The Hard configuration uses dynamic data with ne ws e vents enabled, providing the most complex simulation en vironment. • Time Range: Same as other configurations • Financial Parameters: – Initial funds: 50,000 – Daily rent: 1,000 – In ventory capacity: 40,000 13 • F eature Enablement: – Re vie w enabled: True – Re vie w ratio: 0.02 – Ne ws enabled: True • News Configuration: – Ne ws impact base scale: 0.4 – Ne ws daily count: 20 – Ne ws random seed: 42 – Ne ws sample ratios: * Neutral: 0.9 * Single category: 0.02 * Macro all: 0.03 * SKU le vel: 0.05 – Ne ws impact mode weights: * Neutral: 0.0 * Macro all: 1.0 * Single category: 1.0 * SKU le vel: 1.2 • Selected Categories: Same 20 categories as Middle • Category Effects: All categories hav e a uni- form ef fect of -0.2 • Random Seed: 42 A.4 En vironment Simulation See in Figure 3 B Rollout Details B.1 Hallucinations in Reasoning and Planning This appendix provides representati ve examples of hallucinations observed during long-horizon roll- outs, focusing on reasoning-time errors that prop- agate into multi-step planning despite internally coherent logic. B.1.1 Non-existent SKUs During execution, models occasionally reference or plan around SKUs that do not exist in the en vi- ronment. Across all rollouts, we identify 14 dis- tinct non-existent SKUs that repeatedly appear in final daily strategies, indicating persistent hallu- cinations rather than isolated parsing or format- ting errors. Representati ve e xamples include SKU identifiers such as 10700013100 , 166051312 , and 440004627 . 0 20 40 60 80 100 120 140 160 180 Days 10000 20000 30000 40000 50000 60000 70000 80000 90000 Net worth Net worth over time easy hard middle 0 20 40 60 80 100 120 140 160 180 Days 0 10000 20000 30000 40000 50000 60000 70000 Money balance Money balance over time easy hard middle Figure 3: Net worth and av ailable funds trajectories of the heuristic policy under dif ferent en vironment config- urations, illustrating the calibrated difficulty le vels. The Middle and Har d settings inv olve a larger number of product categories, enabling higher potential net w orth and cash accumulation ov er the course of an episode. Example. In the following strategy output, the model includes a non-existent SKU ( 440004627 ) in its set of focus SKUs. This hallucinated iden- tifier is subsequently treated as a valid product in do wnstream reasoning and planning: { "day": 14, "current_date": "1991-09-22", "strategy": { "macro_strategy": ***, "execute_strategy": { "focus_skus": [ "3000001460", "3700063037", "440004627", "5100001251" ], "news_to_monitor" ***, }, "today_action": ***, } } 14 B.1.2 Hallucinated Dates Models also hallucinate temporal information dur- ing tool-based reasoning when required to provide calendar dates that are not explicitly specified by the environment. Instead of querying the current simulation date, the model resolves this underspec- ification by fabricating a plausible calendar map- ping in order to proceed with ex ecution. Example. The following excerpt illustrates the model’ s reasoning process when attempting to is- sue a tool call that requires v alid date strings: Tool requires dates in YYYY-MM-DD or MM/DD/YY. Only day index is known: Day 14. No calendar start date is provided. Assume Day 1 = 2023-01-01. Infer Day 14 = 2023-01-14. start_date = 2023-01-13 end_date = 2023-01-14 Based on this fabricated assumption, the model issues a syntactically valid tool call using the in- ferred dates. Although the reasoning chain is inter- nally consistent, the assumed calendar mapping is not grounded in the true en vironment state, result- ing in a semantically incorrect interaction. These examples highlight a recurring failure mode in which models resolve underspecified vari- ables through plausible fabrication rather than ex- plicit information acquisition, leading to planning trajectories that appear coherent yet di ver ge from the actual en vironment dynamics. B.2 In valid or Irrational Actions W e identify inv alid or economically irrational ac- tions by applying a set of heuristic rules to model- issued tool calls. Specifically , we flag actions that violate basic economic or operational constraints, including: (i) setting product prices to zero, ne ga- ti ve v alues, or unrealistically high le vels (e.g., e x- ceeding 50); and (ii) placing orders with quantities far be yond plausible operational scales for a single SKU. Across all ev aluated rollouts, we detect more than 300 such anomalous tool calls. Among them, 197 correspond to in v alid or irrational price modifi- cations, while 125 in volve e xcessively lar ge order- ing decisions. These behaviors are observed across multiple models and en vironment configurations. Excessive Ordering. The follo wing example is taken from a rollout of the Kimi K2 Thinking model in the Hard en vironment. The model places an implausibly large order for a single SKU, far ex- ceeding realistic replenishment volumes: tool: place_order model: Kimi K2 Thinking sku_id: 5100000011 quantity: 18000 Although syntactically valid, such actions intro- duce extreme in ventory shocks that are misaligned with realistic retail operations. In valid Price Modifications. Irrational pricing beha vior is also observed across models. In the follo wing e xample, the model updates the price of a SKU by se v- eral orders of magnitude (log excerpted from still_hard/kimi_thinking/tool_calls.jsonl ): tool: modify_sku_price model: Kimi K2 Thinking old_price: 0.25 new_price: 999.0 In other cases, models attempt to assign zero or negati ve prices, which are explicitly rejected by the en vironment. While such in valid actions are pre vented at e xecution time, extreme b ut v alid v al- ues can still propagate downstream and destabilize subsequent decision-making. Overall, these results indicate that current LLM- based agents lack reliable internal mechanisms for enforcing basic economic plausibility at the action le vel, e ven when tool interfaces enforce syntactic v alidity and detailed execution logs are av ailable for inspection. B.3 Rollout Results See in T able ?? B.4 Strategy Score Analysis See in Figure 4 , 5 B.5 T ool use Analysis See in Figure 6 B.6 Category Analysis See in T able 7 B.7 All Model and Framework Results s See in T able 8 15 0 20 40 60 80 Days 0.0 0.2 0.4 0.6 0.8 1.0 Macro Strategy Similarity Easy Environment - Macr o Strategy Score deepseekv3_2 gemini3_fast glm4_6 gpt5_1mini grok4_fast kimi_thinking qwen_235b Figure 4: Macro strategy similarity o ver time in the easy en vironment. Higher v alues indicate greater consistency in high-lev el decisions across days. 0 20 40 60 80 Days 0.0 0.2 0.4 0.6 0.8 1.0 Execute Strategy Similarity Easy Environment - Ex ecution Strategy Score deepseekv3_2 gemini3_fast glm4_6 gpt5_1mini grok4_fast kimi_thinking qwen_235b Figure 5: Execution strategy similarity o ver time in the easy en vironment. Execution-le vel behaviors exhibit substantially higher temporal variability than macro strategies. C Prompt C.1 Evolving Strategy & Execution Framework Evolving Strategy Phase System Prompts You are a retail strategy analyst. Your task is to analyze current business data and determine whether the current strategy needs adjustment. , → , → , → Environment Characteristics - The store operates with a large number of SKUs, where products within the same category interact and may substitute or cannibalize each other ' s demand. , → , → , → , → 0.0 0.2 0.4 0.6 A vg Calls per Focus SK U-Day (SKU R eviews) 200 400 600 800 1000 A vg Daily Sales r = 0.528 p = 0.000 Runs: 65 (using: 59) deepseekv3_2 gemini3_fast glm4_6 gpt5_1mini grok4_fast k2_thinking kimi_thinking qwen_235b T rend SKU Reviews vs A vg Daily Sales (All Runs) Figure 6: Correlation between the frequency of SKU re- view queries during rollouts and average daily sales across all runs. Each point represents a single roll- out. W e observe a positi ve association between revie w- related tool usage and sales performance. The Pearson correlation coef ficient r and the corresponding p -value are reported in the figure. - Historical sales data provides essential signals for future decision-making. , → , → - Customer reviews dynamically influence product demand and sales velocity, with recent reviews having stronger effects. , → , → , → {news_characteristic} - Supply chains involve delivery lead times, requiring forward-looking inventory planning. , → , → - Orders require delivery time: When you place an order (place_order), the items will not arrive immediately. The delivery time varies and can take up to 7 days (within 7 days). You should plan your inventory accordingly and account for this lead time when making ordering decisions. Orders placed today will arrive within 7 days, but the exact arrival time is variable. , → , → , → , → , → , → , → , → , → , → - Inventory items depreciate in value over time and may require disposal when approaching expiration. , → , → - Supplier heterogeneity affects product quality perceptions and customer reviews, leading to supplier-dependent demand outcomes. , → , → , → 16 - Daily rent: The store incurs a fixed daily rent cost of {daily_rent} that must be paid each day. This daily operating cost is automatically deducted at the end of each day and makes cash-flow management critical for long-term survival and profitability. You must ensure sufficient funds are available to cover the daily rent. The daily rent amount is fixed and must be paid every single day, regardless of sales performance or other factors. , → , → , → , → , → , → , → , → , → , → , → , → # Your Role in Strategy Phase Each day starts with a STRATEGY PHASE where you: , → 1. Review the current strategy (provided at the start of this phase) , → 2. Use data analysis tools to gather information about: , → - Current inventory status - Recent sales history (last 30-60 days) , → - Customer reviews and ratings - Supplier prices and quality {news_data_point} - Current financial status 3. Compare current situation with previous days to identify significant changes , → , → 4. Set the strategy using the three separate tools (set_macro_strategy, set_execute_strategy, set_action) to set the three strategy components , → , → , → # Strategy Format The strategy consists of three components: , → 1. **macro_strategy**: A list of broad strategic guidelines (array of strings) , → , → - Examples: ["Focus on high-margin products", "Maintain competitive pricing", "Prioritize inventory turnover"] , → , → , → 2. **execute_strategy**: An object with seven fields, all values are arrays: , → - **focus_skus**: Array of SKU IDs that need attention (e.g., ["SKU_001", "SKU_002"]) , → , → - **sku_supplier_mapping**: Array of mapping objects (e.g., [{{"sku_id": "SKU_001", "supplier_id": "supplier_A"}}, {{"sku_id": "SKU_002", "supplier_id": "supplier_B"}}]) , → , → , → , → , → {news_to_monitor_field} - **skus_to_reorder**: Array of SKU IDs that need reordering (e.g., ["SKU_003", "SKU_004"]) , → , → - **price_adjustments**: Array of price adjustment objects (e.g., [{{"sku_id": "SKU_001", "adjustment": "increase by 10%"}}, {{"sku_id": "SKU_002", "adjustment": "decrease by 5%"}}]) , → , → , → , → , → , → - **sku_to_monitor**: Array of SKU IDs that should be closely monitored (e.g., ["SKU_005", "SKU_006"]) , → , → , → - **other**: Array of other strategy notes or metadata (e.g., [{{"comment": "..."}}, {{"risk_level": "high"}}]) , → , → , → 3. **today_action**: An array of action objects, each representing a concrete action using the parameter format of ` place_order ` or ` modify_sku_price ` . , → , → , → , → - Each action MUST be an object of the form: , → - {{"tool": "place_order", "arguments": {{}}}} , → , → - OR {{"tool": "modify_sku_price", "arguments": {{}}}} , → , → , → - Example: [ {{"tool": "place_order", "arguments": {{"sku_id": "SKU_001", "supplier_id": "supplier_A", "quantity": 100}}}}, , → , → , → , → 17 {{"tool": "modify_sku_price", "arguments": {{"sku_id": "SKU_002", "new_price": 9.99}}}} , → , → , → ] # Strategy Setting Tools Use three separate tools to set different parts of the strategy: , → - **set_macro_strategy**: Set the macro_strategy (array of strings) , → - Parameter: ` macro_strategy ` (array of strings) , → - Example: set_macro_strategy(macro_s ⌋ trategy=["Focus on high-margin products", "Maintain competitive pricing"]) , → , → , → - **set_execute_strategy**: Set the execute_strategy (object with seven fields, all arrays) , → , → - Parameter: ` execute_strategy ` (object with fields: focus_skus, sku_supplier_mapping{news_to_mon ⌋ itor_param}, skus_to_reorder, price_adjustments, sku_to_monitor, other) , → , → , → , → , → - All field values must be arrays - Example: set_execute_strategy(execu ⌋ te_strategy={{"focus_skus": ["SKU_001"], "sku_supplier_mapping": [{{"sku_id": "SKU_001", "supplier_id": "supplier_A"}}], ...}}) , → , → , → , → , → , → - **set_action**: Set the today_action (array of action objects) , → - Parameter: ` action ` (array of objects, each with "tool" and "arguments" fields) , → , → - Each action object: {{"tool": "place_order" | "modify_sku_price", "arguments": {{...}}}} , → , → , → - Example: set_action(action=[{{"tool": "place_order", "arguments": {{"sku_id": "SKU_001", "supplier_id": "supplier_A", "quantity": 100}}}}]) , → , → , → , → , → You can call these tools multiple times to build or modify the strategy. After your analysis, set all three components to reflect your decisions. , → , → , → , → # Available Tools for Analysis The available function signatures are provided within XML tags: , → , → {tool_definitions} For each function call, return a JSON object with function name and arguments inside XML tags: , → , → , → {{"name": , "arguments": }} , → # Important Analysis Tools Use these tools to gather data: - view_funds_and_date: Check current funds and date , → - view_inventory: Check current inventory levels , → - view_sku_sales_history: Analyze sales trends (use last 30-60 days) , → - view_sku_avg_ratings: Check customer satisfaction , → - view_current_date_supplier_prices: Check supplier availability and prices , → , → {news_tools_list} - view_current_orders: Check pending orders , → - set_macro_strategy: Set the macro strategy (array of strings) , → - set_execute_strategy: Set the execute strategy (object with seven fields, all arrays) , → , → - set_action: Set the today action (array of action objects) , → 18 Note: You CANNOT use place_order or modify_sku_price in the strategy phase. These tools are only available in the execution phase. , → , → , → After completing your analysis and updating the strategy, the system will transition to the EXECUTION PHASE. , → , → , → C.2 Evolving Strategy & Execution Execution Framework Phase System Pr ompts You are a retail operations agent executing daily operations based on the current strategy. , → , → # Your Role in Execution Phase In the EXECUTION PHASE, you will receive the **final strategy** determined in the Strategy Phase. This strategy includes: , → , → , → - **macro_strategy**: Broad strategic guidelines (array of strings) , → - **execute_strategy**: Specific operational details (object with seven fields, all arrays) , → , → - **today_action**: Concrete actions to take today (array of action objects) , → # Important Operational Constraints - **Daily rent**: The store incurs a fixed daily rent cost of {daily_rent} that must be paid each day. This daily operating cost is automatically deducted at the end of each day. Ensure you have sufficient funds to cover this daily expense. The daily rent amount is fixed and must be paid every single day, regardless of sales performance or other factors. This makes cash-flow management critical for long-term survival and profitability. , → , → , → , → , → , → , → , → , → , → , → , → - **Order delivery time**: When you place an order using place_order, the items will not arrive immediately. The delivery time varies and can take up to 7 days (within 7 days). Orders placed today will arrive within 7 days, but the exact arrival time is variable. Plan your inventory and ordering decisions accordingly, considering the lead time for items to arrive. , → , → , → , → , → , → , → , → , → , → # Strategy Usage Guidelines **The strategy is provided as REFERENCE, but you can and should make additional actions based on real-time data:** , → , → , → 1. **Reference the strategy** to understand priorities and planned actions: , → , → - Use macro_strategy for overall decision-making direction , → - Use execute_strategy fields (focus_skus, sku_supplier_mappin ⌋ g{news_to_monitor_ref}, skus_to_reorder, price_adjustments, sku_to_monitor, other) as guidance , → , → , → , → , → , → - Consider today_action as suggested actions to take , → 2. **Perform additional data queries** to validate and refine decisions: , → - Check current inventory levels, sales history, supplier prices{news_impacts_ref}, funds, etc. , → , → , → - Use tools like view_inventory, view_sku_sales_history, view_current_date_supplier_price ⌋ s{news_tools_ref}, etc. , → , → , → 3. **Execute actions flexibly**: - You can execute actions from today_action when they still make sense given the latest data , → , → 19 - You can **adjust, skip, or modify** actions from today_action if your analysis shows better alternatives , → , → , → - You can **add additional actions** beyond today_action if needed (e.g., unexpected inventory changes, new supplier prices{news_impacts_example}) , → , → , → , → - You can use information from execute_strategy (like focus_skus, sku_supplier_mapping) to make decisions even if not explicitly in today_action , → , → , → , → 4. **End the day** by calling end_today when you ' ve completed all operations for today. , → , → # Important Constraints - You MUST NOT modify the stored strategy itself in this phase (strategy can only be changed in the Strategy Phase) , → , → , → - You CANNOT call any tool that changes macro_strategy / execute_strategy / today_action , → , → - You SHOULD use the strategy as guidance but make final decisions based on current data and analysis , → , → # Available Tools The available function signatures are provided within XML tags: , → , → {tool_definitions} For each function call, return a JSON object with function name and arguments inside XML tags: , → , → , → {{"name": , "arguments": }} , → # Ending the Day When you have completed all reasonable operations for the day (especially those in today_action, adjusted as needed by current data), you MUST call end_today to advance to the next day. This will trigger a new Strategy Phase for the next day. , → , → , → , → , → , → C.3 Reflection Framework Reflection Phase Prompts This prompt is used to generate reflections after each day in \texttt{run\_reflection.py}. , → , → \begin{verbatim} You are a retail operations analyst reflecting on the day ' s performance. , → # Task Goal {task_spec} # Day {day} End Result {end_today_result.get( ' formatted ' , safe_dump(end_today_result))} , → # Day {day} Interaction History {interaction_summary} {memory_context} # Your Task Generate a comprehensive reflection on today ' s performance. This reflection should be a complete, detailed analysis that will replace previous reflections. Include: , → , → , → , → 1. **Performance Summary**: Overall assessment of today ' s operations, including key metrics (funds, inventory, sales, etc.) , → , → , → 2. **Issue Identification**: What specific problems or challenges occurred? Be specific about what went wrong. , → , → , → 20 3. **Root Cause Analysis**: Why did these problems happen? Analyze the interaction history to understand what actions or decisions led to the issues. Trace back through the day ' s operations. , → , → , → , → , → 4. **What Worked Well**: Identify any successful strategies or decisions that should be continued. , → , → 5. **Actionable Improvements**: What should be done differently next time? Provide specific, actionable recommendations for future operations. , → , → , → , → 6. **Key Learnings**: What are the most important lessons learned from today that should guide future decision-making? , → , → , → Format your reflection as a comprehensive, detailed analysis (multiple paragraphs, not just a few sentences). This reflection will be the complete memory used for future days, so it should be thorough and cover all important aspects. , → , → , → , → , → , → Reflection: C.4 Macro Similiarity J udge Prompt Please compare the similarity of the following two macro strategies. , → Strategy 1 ' s macro_strategy: {macro1_formatted} Strategy 2 ' s macro_strategy: {macro2_formatted} Please evaluate the similarity between these two strategies and provide a score between 0 and 1, where: , → , → - 1.0 means identical or almost identical , → - 0.8-0.9 means very similar with only minor differences , → - 0.6-0.7 means somewhat similar with some common points , → - 0.4-0.5 means somewhat similar but with significant differences , → - 0.2-0.3 means not very similar - 0.0-0.1 means completely different Please return only a floating-point number between 0 and 1, without any additional text or explanation. , → , → D Agent Framework D.1 Illustration of Agent Framework See in Figure 7 21 System: Y ou are a retail strategy analyst. Y our task is to analyze current business data and determine whether the current strategy needs adjustment. Agent : According to the history , I need to check the status of store and sku's price Action: {"name": ' view_current_date_supplier_prices '} Execution Phase Evolving Strategy Phase Action: {"name": ' view_funds_and_date '} Env: Current date: 1992-03-05, funds balance: 23261.66. Current Price of SKU 37003239 is ..... Agent : According to the response, I need to update the strategy ...... Action: {"name": ' update_macro_strategy '} System: Y ou are a retail operations agent executing daily operations based on the current strategy . Agent : According to the strategy , I need to place the order and modify the price of SKU 37003239 Action: {"name": ' modify_sku_price '} Action: {"name": ' place_order '} Env: Updated SKU 37003239 price: 2.245232962110558 -> 2.492209044275104 Agent : According to the response, I think today's actions is enough Action: {"name": ' end_today '} Day 2: Evolving Strategy & Execution Day 1: Evolving Strategy & Execution System: Y ou are a retail strategy analyst. Y our task is to analyze current business data and determine whether the current strategy needs adjustment. Figure 7: Illustration of Evolving Strategy and Execution Frame work 22 Strategy T ype Strategy V alue macro_strategy T issues (SKU 3700060511) face an imminent stockout with current in ventory belo w one day of co verage; incoming replenishments scheduled for 10-08 and 10-10 imply an unav oidable short-term gap and an estimated loss of approximately 1,200 units. Soups (SKU 5100000011) exhibit declining sales and persistently high return rates following a recent price increase, indicating substantial margin erosion. Cigarettes and ginger ale remain operationally stable with high customer ratings and lo w return rates. Overall operational risk is assessed as high due to anticipated re venue loss and return-dri ven inef ficiencies. ex ecute_strategy focus_skus: 3700060511, 5100000011, 1254612128 sku_supplier_mapping: (3700060511, supplier_4), (5100000011, sup- plier_1), (1230000014, supplier_1), (1690000012, supplier_3), (1254612128, supplier_3) ne ws_to_monitor: [ ] skus_to_reorder: [ ] price_adjustments: sku_id = 5100000011, adjustment = decrease to 0.45 or by 20% to stimulate demand sku_to_monitor: 3700060511, 5100000011 other: T issues — stockout gap 10-05/06/07 estimated 3 days, approximately 1200 units lost at price 0.65 ( ∼ 780 re venue); incoming supply cov ers post- 10-08; no further reorder feasible Soups — return rate approximately 25% persists despite supplier_1; revie ws indicate supplier_4 dominant issues b ut effects persist; sales declined after price hike; monitor ne xt 3 days sales, returns, and supplier quality Mints — lo w stock but incoming suf ficient Core — stable operations excluding tissues and soups risks; customer traffic approximately 30k/day steady risk_le vel: high today_action place_order: (3700060511, supplier_4, quantity = 800), (1254612128, sup- plier_3, quantity = 600) modify_sku_price: (sku_id = 5100000011, new_price = 0.45), (sku_id = 1690000012, ne w_price = 0.78) T able 5: Example Strategy 23 Model A vg. Days ↑ A vg. Daily Sales ↑ A vg. Daily Income ↑ Expiry Ratio ↓ Return Ratio ↓ Max Days ↑ Difficulty: EASY (5 Cate gories) DeepSeek-V3.2 (Exp.) 58.33 229.19 183.26 0.0889 0.1122 66 Gemini-3 (Fast) 50.67 399.39 294.71 0.0799 0.1311 59 GLM-4.6 52.40 174.34 124.67 0.0773 0.1293 58 Grok-4.1 Fast 61.75 508.08 336.94 0.0417 0.0847 88 Kimi-K2 (Thinking) 54.25 260.68 168.72 0.0239 0.1179 58 OpenAI-5.1 Mini 51.75 192.90 122.46 0.0360 0.1237 55 Qwen-235B 37.50 420.31 236.50 0.0745 0.0841 48 A verage (7 models) 52.38 301.98 203.30 0.0590 0.1068 61.71 Hand-crafted Policy 180.00 674.18 729.46 0.0266 0.0070 180 Difficulty: MIDDLE (20 Cate gories) DeepSeek-V3.2 (Exp.) 54.67 263.68 255.30 0.1064 0.1818 63 Gemini-3 (Fast) 42.67 439.05 449.04 0.0188 0.1630 48 GLM-4.6 54.33 182.55 131.90 0.0237 0.1274 56 Grok-4.1 Fast 51.00 720.69 398.23 0.0407 0.1160 59 Kimi-K2 (Thinking) 37.00 347.78 356.10 0.0037 0.1677 50 OpenAI-5.1 Mini 56.33 336.24 223.60 0.1307 0.1235 57 Qwen-235B 29.33 216.06 179.42 0.0639 0.1333 45 A verage (7 models) 46.48 364.24 282.48 0.0491 0.1399 54.00 Hand-crafted Policy 180.00 1870.21 2809.39 0.0272 0.0074 180 Difficulty: HARD (20 Cate gories) DeepSeek-V3.2 (Exp.) 56.67 165.86 175.79 0.0787 0.1978 59 Gemini-3 (Fast) 35.33 513.10 650.94 0.0906 0.1542 44 GLM-4.6 53.33 205.76 203.07 0.0645 0.1648 55 Grok-4.1 Fast 33.67 504.57 364.52 0.0424 0.1797 50 Kimi-K2 (Thinking) 43.67 248.64 312.82 0.0113 0.1889 57 OpenAI-5.1 Mini 55.33 331.36 335.32 0.1949 0.1923 57 Qwen-235B 40.33 433.64 268.06 0.1746 0.1834 48 A verage (7 models) 45.48 320.96 311.28 0.0960 0.1791 52.86 Hand-crafted Policy 180.00 1667.84 2748.94 0.0507 0.0075 180 T able 6: Performance comparison of se ven lar ge language models under three difficulty le vels. Lower Expiry and Return ratios indicate better operational stability . A hand-crafted policy is included as an approximate upper bound for each difficulty . 24 Model Context Easy Middle Hard SKU ↑ Cat ↑ SKU ↑ Cat ↑ SKU ↑ Cat ↑ DeepSeek-V3.2 (Exp.) 128K 7.80 4.07 6.43 4.25 4.71 3.65 Gemini-3 (Fast) 1M 8.50 3.95 11.89 11.18 13.32 8.09 GLM-4.6 200K 6.61 3.55 6.01 3.48 8.47 6.05 OpenAI-5.1 Mini 400K 6.80 2.64 6.82 6.81 7.87 5.19 Grok-4.1 Fast 200K 7.88 4.35 6.51 3.53 7.69 3.90 Kimi-K2 (Thinking) 256K 5.92 2.94 5.53 4.09 6.03 3.22 Qwen-235B (Thinking) 256K 4.78 3.66 6.49 4.05 5.77 4.23 Heuristic Strategy – 25 5 96 20 96 20 T able 7: SKU and Category counts observed during the strategy stage across difficulties. Context denotes the maximum supported context length of each model. Bolded indicates the best model; underlined indicates the second best (Heuristic Strategy e xcluded). 25 Model A vg. Days ↑ A vg. Daily Sales ↑ A vg. Daily Income ↑ Expiry Ratio ↓ Return Ratio ↓ Max Days ↑ F ramework: Evolving Strate gy & Execution DeepSeek-V3.2 (Exp.) 58.33 229.19 183.26 0.0889 0.1122 66 Gemini-3 (Fast) 50.67 399.39 294.71 0.0799 0.1311 59 GLM-4.6 52.40 174.34 124.67 0.0773 0.1293 58 Grok-4.1 Fast 61.75 508.08 336.94 0.0417 0.0847 88 Kimi-K2 (Thinking) 54.25 260.68 168.72 0.0239 0.1179 58 OpenAI-5.1 Mini 51.75 192.90 122.46 0.0360 0.1237 55 Qwen-235B (Thinking) 37.50 420.31 236.50 0.0745 0.0841 48 GPT -5.2 81.00 457.21 358.27 0.0660 0.1141 81 A verage (8 models) 55.96 330.26 228.19 0.0610 0.1121 64.13 F ramework: Reflection (Day-Level) DeepSeek-V3.2 (Exp.) 53.00 235.01 170.04 0.0382 0.1200 66 Gemini-3 (Fast) 45.67 447.74 255.38 0.0682 0.1350 50 GLM-4.6 55.00 160.70 125.67 0.0194 0.1176 62 Grok-4.1 Fast 48.33 297.94 197.54 0.1460 0.0925 54 Kimi-K2 (Thinking) 58.33 216.51 184.01 0.0964 0.1255 71 OpenAI-5.1 Mini 53.33 93.04 92.11 0.1062 0.1331 59 Qwen-235B (Thinking) 30.00 371.90 197.53 0.1919 0.1551 43 GPT -5.2 64.00 283.88 260.88 0.1774 0.0887 64 A verage (8 models) 50.96 263.34 185.40 0.1055 0.1209 58.63 F ramework: Reflection (Step-Level) DeepSeek-V3.2 (Exp.) – – – – – – Gemini-3 (Fast) – – – – – – GLM-4.6 51.67 92.35 77.18 0.0148 0.1072 57 Grok-4.1 Fast – – – – – – Kimi-K2 (Thinking) 51.67 181.19 111.29 0.0353 0.1362 53 OpenAI-5.1 Mini – – – – – – Qwen-235B (Thinking) – – – – – – GPT -5.2 56.00 398.71 324.01 0.1536 0.1048 56 A verage (8 models) 53.11 224.08 170.83 0.0679 0.1161 55.33 F ramework: Plan-and-Act DeepSeek-V3.2 (Exp.) – – – – – – Gemini-3 (Fast) – – – – – – GLM-4.6 48.33 231.35 113.01 0.0198 0.1096 55 Grok-4.1 Fast – – – – – – Kimi-K2 (Thinking) 48.67 170.26 105.36 0.0000 0.1056 51 OpenAI-5.1 Mini – – – – – – Qwen-235B (Thinking) – – – – – – GPT -5.2 64.00 323.88 193.02 0.0152 0.1096 64 A verage (8 models) 53.67 241.83 137.13 0.0117 0.1083 56.67 Heuristic P olicy (Upper Bound, Easy) Hand-crafted Policy 180.00 674.18 729.46 0.0266 0.007 180 T able 8: Performance comparison of eight large language models under four agent frameworks in the E A S Y en vironment. A hand-crafted heuristic policy is included as an approximate upper bound. 26

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment