AgenticPay: A Multi-Agent LLM Negotiation System for Buyer-Seller Transactions

AgenticPay: A Multi-Agent LLM Negotiation System for Buyer-Seller Transactions
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language model (LLM)-based agents are increasingly expected to negotiate, coordinate, and transact autonomously, yet existing benchmarks lack principled settings for evaluating language-mediated economic interaction among multiple agents. We introduce AgenticPay, a benchmark and simulation framework for multi-agent buyer-seller negotiation driven by natural language. AgenticPay models markets in which buyers and sellers possess private constraints and product-dependent valuations, and must reach agreements through multi-round linguistic negotiation rather than numeric bidding alone. The framework supports a diverse suite of over 110 tasks ranging from bilateral bargaining to many-to-many markets, with structured action extraction and metrics for feasibility, efficiency, and welfare. Benchmarking state-of-the-art proprietary and open-weight LLMs reveals substantial gaps in negotiation performance and highlights challenges in long-horizon strategic reasoning, establishing AgenticPay as a foundation for studying agentic commerce and language-based market interaction. Code and dataset are available at the link: https://github.com/SafeRL-Lab/AgenticPay.


💡 Research Summary

AgenticPay introduces a comprehensive benchmark and simulation framework for studying multi‑agent buyer‑seller negotiations that are mediated entirely through natural language. Unlike prior evaluation suites that reduce economic interaction to numeric bids, auctions, or single‑turn dialogues, AgenticPay models markets where each participant holds private reservation values, negotiates over multiple rounds, and may deal with heterogeneous products and competing counterparts.

The framework consists of four tightly coupled components. The Environment supplies public product descriptions, market context (e.g., category, seasonal factors), and private reservation prices for each buyer (maximum willingness‑to‑pay) and seller (minimum acceptable price). Negotiations proceed in alternating turns up to a configurable horizon, terminating when both parties propose the same price within the feasible bargaining zone, when the turn limit is reached, or when feasibility constraints are violated.

Tasks define the market structure: they vary the number of buyers, sellers, and product types, creating over 110 distinct scenarios. These span simple bilateral bargaining (1‑buyer / 1‑seller), 1‑to‑N competition (one buyer versus many sellers or vice‑versa), and full N‑to‑N matching markets where multiple agents negotiate in parallel or sequentially. Ten realistic business domains (daily‑life goods, professional services, corporate procurement, financial assets) are incorporated, allowing assessment of cross‑domain generalization.

Agents are instantiated as role‑specific LLM policies sharing a common architecture but differing in private valuation functions and objectives. At each round, an agent generates a natural‑language utterance that may embed a structured price proposal; a parser extracts the actionable component (price, acceptance, concession) from the dialogue.

To evaluate outcomes, AgenticPay defines three orthogonal metrics:

  1. Feasibility – whether the final agreement lies within both parties’ private reservation intervals;
  2. Efficiency – measured by the number of dialogue turns and elapsed time to reach agreement;
  3. Welfare – the sum of buyer and seller surplus, reflecting overall market efficiency.

The authors benchmark a range of state‑of‑the‑art proprietary models (GPT‑4, Claude‑2, Gemini‑1.5) and open‑source alternatives (Llama‑2‑Chat, Mistral‑Instruct, etc.) under a unified inference‑only protocol. Results reveal substantial performance gaps: models tend to favor the seller role, struggle with long‑horizon strategic planning, and exhibit pronounced asymmetries in multi‑agent settings. In N‑to‑N markets, most models either exceed the turn limit or produce infeasible deals, leading to a sharp drop in welfare scores. Even the strongest models achieve only about 70 % of the theoretical optimum derived from classic Myerson‑Satterthwaite efficiency bounds.

The paper also identifies methodological limitations. The current parser relies on simple regular‑expression extraction, which fails on complex contract clauses (bundled discounts, delivery terms). Private reservation prices are sampled from fixed distributions, limiting the realism of information asymmetry. Moreover, LLMs display dialogue drift over many turns, compounding errors that degrade negotiation outcomes.

Future research directions suggested include: (i) integrating reinforcement‑learning or self‑play to train policies that explicitly optimize the defined efficiency and welfare rewards; (ii) employing meta‑learning to infer hidden reservation values from conversational cues; (iii) designing richer structured action schemas (e.g., JSON‑based proposals) to capture multi‑attribute contracts; and (iv) incorporating fairness and transparency mechanisms to mitigate strategic exploitation in many‑to‑many markets.

In summary, AgenticPay fills a critical gap by providing a scalable, linguistically grounded testbed for autonomous negotiation agents. Its extensive task suite, realistic market modeling, and principled evaluation metrics enable systematic study of how large language models can move from pure text generation toward economically meaningful, strategic interaction. The benchmark is poised to become a standard platform for advancing the next generation of AI‑driven commerce.


Comments & Academic Discussion

Loading comments...

Leave a Comment