SimGym: Traffic-Grounded Browser Agents for Offline A/B Testing in E-Commerce
A/B testing remains the gold standard for evaluating e-commerce UI changes, yet it diverts traffic, takes weeks to reach significance, and risks harming user experience. We introduce SimGym, a scalable system for rapid offline A/B testing using traffic-grounded synthetic buyers powered by Large Language Model agents operating in a live browser. SimGym extracts per-shop buyer profiles and intents from production interaction data, identifies distinct behavioral archetypes, and simulates cohort-weighted sessions across control and treatment storefronts. We validate SimGym against real human outcomes from real UI changes on a major e-commerce platform under confounder control. Even without alignment post training, SimGym agents achieve state of the art alignment with observed outcome shifts and reduces experiment cycles from weeks to under an hour , enabling rapid experimentation without exposure to real buyers.
💡 Research Summary
The paper introduces SimGym, a novel offline experimentation platform that replaces traditional, traffic‑splitting A/B tests for e‑commerce UI changes with large‑language‑model (LLM) driven synthetic buyers operating in a live browser. The authors identify three fundamental drawbacks of conventional A/B testing—traffic diversion, long data‑collection periods, and the risk of exposing real users to sub‑optimal experiences. SimGym addresses these by grounding synthetic agents in actual click‑stream data from each storefront, thereby preserving the distribution of real customers while allowing rapid, risk‑free testing.
The system consists of two main components: (1) a six‑stage persona‑and‑intent generation pipeline and (2) a browser‑based autonomous agent architecture. In the first stage, raw session logs are transformed into feature vectors (duration, page views, search behavior, funnel progression, monetary values) and clustered using k‑means++ with k = 5, yielding distinct buyer archetypes (budget, premium, ethics‑focused, performance‑oriented, mixed). For each (shop, cluster) pair, an LLM is prompted to extract up to ten product categories and representative items, producing a structured JSON. Purchase intent is then calibrated: the proportion of “purchase‑focused” versus “browse‑focused” intents mirrors the observed average add‑to‑cart (A2C) rate of the cluster, and each intent follows a strict two‑sentence template (“You are looking for
Comments & Academic Discussion
Loading comments...
Leave a Comment