GLEE: A Unified Framework and Benchmark for Language-based Economic Environments
Large Language Models (LLMs) show significant potential in economic and strategic interactions, where communication via natural language is often prevalent. This raises key questions: Do LLMs behave rationally? How do they perform compared to humans? Do they tend to reach an efficient and fair outcome? What is the role of natural language in strategic interaction? How do characteristics of the economic environment influence these dynamics? These questions become crucial concerning the economic and societal implications of integrating LLM-based agents into real-world data-driven systems, such as online retail platforms and recommender systems. To answer these questions, we introduce a benchmark for standardizing research on two-player, sequential, language-based games. Inspired by the economic literature, we define three base families of games with consistent parameterization, degrees of freedom and economic measures to evaluate agents’ performance (self-gain), as well as the game outcome (efficiency and fairness). We develop an open-source framework for interaction simulation and analysis, and utilize it to collect a dataset of LLM vs. LLM interactions across numerous game configurations and an additional dataset of human vs. LLM interactions. Through extensive experimentation, we demonstrate how our framework and dataset can be used to: (i) compare the behavior of LLM-based agents in various economic contexts; (ii) evaluate agents in both individual and collective performance measures; and (iii) quantify the effect of the economic characteristics of the environments on the behavior of agents. Our results suggest that the market parameters, as well as the choice of the LLMs, tend to have complex and interdependent effects on the economic outcome, which calls for careful design and analysis of the language-based economic ecosystem.
💡 Research Summary
The paper introduces GLEE (Games in Language‑based Economic Environments), a unified, open‑source framework and benchmark designed to study how large language models (LLMs) behave in sequential, two‑player economic games that involve natural‑language communication. Drawing on three classic economic interaction types—bargaining (Rubinstein‑style division of a shared surplus), negotiation (buyer‑seller price bargaining), and persuasion (asymmetric information about product quality)—the authors formalize each game with a rich set of parameters: horizon (number of rounds), information structure (complete, incomplete, or asymmetric), and communication form (free‑text versus structured messages). These parameters allow researchers to systematically vary market conditions while keeping the underlying game mechanics comparable.
GLEE defines three core evaluation metrics: (1) Self‑gain, the individual monetary payoff each agent receives; (2) Efficiency, measuring how close the outcome is to Pareto optimality or total surplus maximization; and (3) Fairness, quantifying the equity of the payoff split (e.g., using the Shapley value or Gini coefficient). By providing a modular Python codebase that encapsulates game logic, dialogue logging, and metric computation, the framework makes it straightforward to plug in any LLM via a thin API wrapper, adjust prompting, temperature, token limits, and other generation hyper‑parameters.
Using this infrastructure, the authors collected an extensive dataset: 587 000 decisions across more than 80 000 game instances involving 13 distinct LLMs (including GPT‑4, Claude‑2, Llama 2‑70B, etc.). They also built a web interface for human participants to play against LLM agents, yielding a complementary human‑vs‑LLM dataset. The scale of the data enables statistical analysis of how market parameters and model choices jointly shape outcomes.
Key empirical findings are:
- Market parameters matter – Information asymmetry dramatically reduces efficiency and fairness, while allowing free‑text communication can partially recover lost surplus through persuasive framing.
- Model interdependence – No single LLM dominates across all settings; performance often depends on the opponent’s model, echoing classic game‑theoretic notions of relative strategy.
- Human vs. LLM behavior – Humans exhibit more extreme strategies: depending on role (buyer, seller, persuader) they either outperform all LLMs or fall far behind, suggesting that affective and risk‑aversion factors introduce high variance.
- Language framing effects – Identical numeric offers receive different acceptance rates when presented with different rationales or tones, confirming that natural language acts as a strategic signal beyond the pure payoff numbers.
The paper also discusses limitations: the current focus on two‑player games excludes multi‑agent market mechanisms such as auctions or competition for search rankings; LLM internal biases in persuasion are not fully disentangled; and practical issues like API latency and cost affect reproducibility. Future work is suggested to extend GLEE to multi‑player settings, incorporate richer human feedback loops, and develop standardized protocols for handling external variability.
In sum, GLEE provides the first comprehensive, parametrizable benchmark that couples economic theory with natural‑language interaction, enabling rigorous, comparable studies of LLM rationality, strategic communication, and societal impact. Its open‑source nature and large human‑LLM dataset position it as a foundational tool for researchers aiming to deploy LLM‑driven agents in real‑world economic platforms while ensuring efficiency, fairness, and transparency.
Comments & Academic Discussion
Loading comments...
Leave a Comment