Evaluating LLM-persona Generated Distributions for Decision-making

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

LLMs can generate a wealth of data, ranging from simulated personas imitating human valuations and preferences, to demand forecasts based on world knowledge. But how well do such LLM-generated distributions support downstream decision-making? For example, when pricing a new product, a firm could prompt an LLM to simulate how much consumers are willing to pay based on a product description, but how useful is the resulting distribution for optimizing the price? We refer to this approach as LLM-SAA, in which an LLM is used to construct an estimated distribution and the decision is then optimized under that distribution. In this paper, we study metrics to evaluate the quality of these LLM-generated distributions, based on the decisions they induce. Taking three canonical decision-making problems (assortment optimization, pricing, and newsvendor) as examples, we find that LLM-generated distributions are practically useful, especially in low-data regimes. We also show that decision-agnostic metrics such as Wasserstein distance can be misleading when evaluating these distributions for decision-making.

💡 Research Summary

The paper introduces a novel framework called LLM‑SAA (Large Language Model – Sample Average Approximation) that leverages synthetic data generated by large language models (LLMs) to solve classic operations‑research decision problems under uncertainty. Traditional Sample Average Approximation relies on historical observations to construct an empirical distribution of uncertain parameters (e.g., customer preferences, willingness‑to‑pay, demand). When such data are scarce or unavailable—common for new products or rapidly changing markets—standard SAA cannot be applied. The authors propose to prompt an LLM with product descriptions, market context, and optionally few‑shot examples, asking it to simulate individual outcomes (e.g., a willingness‑to‑pay value). By repeating the prompt many times, a synthetic empirical distribution (\hat{F}) is obtained, which is then fed into the usual optimization model to produce a decision (\hat{a}).

Two research questions guide the study: (1) Does statistical closeness between (\hat{F}) and the true distribution (F) guarantee good downstream decisions? (2) Can LLM‑generated distributions be practically useful when real data are limited? To answer these, the authors focus on three well‑studied problems: (i) assortment optimization with rank‑based choice, (ii) single‑product pricing with a willingness‑to‑pay outcome, and (iii) the newsvendor inventory problem. For each problem, the true distribution (F) is derived from a real‑world dataset, and the optimal action (a^*) (computed with full knowledge of (F)) serves as a benchmark.

The evaluation methodology distinguishes between decision‑aware and decision‑agnostic metrics. Decision‑aware performance is measured by a competitive ratio (C_\theta(F,\hat{F})), defined as the ratio of the expected reward of the LLM‑derived decision (\hat{a}) to that of the true optimal decision (a^*). Two aggregations are considered: (i) WorstCR, the minimum competitive ratio over all admissible problem parameters (\theta), which captures a worst‑case robustness perspective, and (ii) AvgCR, the expectation of the competitive ratio under a prescribed distribution over (\theta). The authors formulate the computation of WorstCR as a bilevel optimization problem where both the true optimal action and the LLM‑derived action are decision variables subject to optimality constraints; they provide closed‑form or tractable reformulations for each of the three problems.

Decision‑agnostic metrics include the Wasserstein distance and the Kolmogorov‑Smirnov statistic between (\hat{F}) and (F). Empirically, the paper shows that these distances can be large even when the competitive ratio is close to one, indicating that a distribution need not be statistically identical to (F) as long as it captures the decision‑relevant features (e.g., the price region where purchase probability drops sharply).

Four LLM generation strategies are examined: (1) plain sampling (repeatedly ask the LLM for a single outcome), (2) persona‑sampling (each query is conditioned on a different synthetic customer persona), (3) batch generation (the LLM returns a whole set of outcomes in one prompt), and (4) descriptive prompting (the LLM is asked to describe the whole distribution). Each strategy is tested under two information regimes: a basic description of the product and market, and a few‑shot regime where a few real examples are provided. Baselines include a uniform random distribution, an empirical distribution built from a small number (d) of real samples, and a heuristic score‑based method for the assortment problem.

Key experimental findings are:

In low‑data regimes (e.g., (d = 5)–(10) real samples), LLM‑SAA consistently outperforms the uniform baseline and often surpasses the empirical‑sample baseline, achieving average competitive ratios between 0.85 and 0.92.
Persona‑sampling yields the strongest performance for pricing, because explicitly specifying high‑income and low‑income personas forces the LLM to generate willingness‑to‑pay values that accurately represent the tails of the true distribution.
Batch generation is computationally efficient but slightly less accurate than repeated sampling, due to reduced diversity in the synthetic set.
Descriptive prompting performs poorly because the LLM’s textual description of a distribution often deviates substantially from the true parametric shape, leading to competitive ratios around 0.68.
WorstCR analysis reveals that LLM‑SAA maintains competitive ratios above 0.7 across a wide range of problem parameters (e.g., varying unit cost in pricing). This robustness stems from the LLM’s ability to capture the critical “kink” in the purchase probability curve rather than the full distributional shape.
Wasserstein and KS distances are frequently large even when competitive ratios are high, confirming that decision‑agnostic metrics can be misleading for evaluating synthetic distributions intended for optimization.

The authors conclude that (i) decision‑aware evaluation is essential when assessing LLM‑generated data for optimization, (ii) LLM‑SAA provides a viable solution in data‑starved environments, and (iii) careful prompt design—especially the use of personas—significantly enhances performance. Limitations include the reliance on proprietary LLMs (cost and privacy concerns), potential bias introduced by the choice of personas, and the need for problem‑specific worst‑case analysis, which may not generalize easily.

Future research directions suggested are: developing generic algorithms for worst‑case competitive ratio computation across broader classes of stochastic programs, hybridizing LLM‑generated samples with limited real data, automating persona creation and prompt optimization, and exploring cost‑effective open‑source LLM alternatives. Overall, the paper establishes a new evaluation paradigm—prioritizing downstream decision quality over pure statistical fidelity—and demonstrates that LLM‑generated synthetic data can meaningfully improve operational decisions when traditional data sources are unavailable.

Evaluating LLM-persona Generated Distributions for Decision-making

💡 Research Summary

Comments & Academic Discussion

Leave a Comment