Would a Large Language Model Pay Extra for a View? Inferring Willingness to Pay from Subjective Choices

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

As Large Language Models (LLMs) are increasingly deployed in applications such as travel assistance and purchasing support, they are often required to make subjective choices on behalf of users in settings where no objectively correct answer exists. We study LLM decision-making in a travel-assistant context by presenting models with choice dilemmas and analyzing their responses using multinomial logit models to derive implied willingness to pay (WTP) estimates. These WTP values are subsequently compared to human benchmark values from the economics literature. In addition to a baseline setting, we examine how model behavior changes under more realistic conditions, including the provision of information about users’ past choices and persona-based prompting. Our results show that while meaningful WTP values can be derived for larger LLMs, they also display systematic deviations at the attribute level. Additionally, they tend to overestimate human WTP overall, particularly when expensive options or business-oriented personas are introduced. Conditioning models on prior preferences for cheaper options yields valuations that are closer to human benchmarks. Overall, our findings highlight both the potential and the limitations of using LLMs for subjective decision support and underscore the importance of careful model selection, prompt design, and user representation when deploying such systems in practice.

💡 Research Summary

This paper investigates whether large language models (LLMs) can be trusted to make subjective decisions on behalf of users in a travel‑assistant context, by estimating their implied willingness‑to‑pay (WTP) for hotel‑room attributes. The authors adapt a classic discrete‑choice experiment (DCE) from the hospitality economics literature (Masiero et al.) to generate 240 binary choice dilemmas, each presenting two hotel‑room alternatives described by seven attributes: view (city vs. harbor), floor (10th, 18th, 26th), club access (yes/no), minibar (basic vs. full), smartphone provision (none/with data), cancellation policy (non‑refundable/refundable), and price (HK$ 1,600‑3,200). Price is the only “bad” attribute and serves as the monetary denominator for WTP calculations.

Four prompting conditions are examined: (1) No user info – the model receives only the two alternatives and must answer “A” or “B”; (2) In‑Context Learning (ICL) – one or three example choices are supplied, either randomly generated or manually crafted, with the chosen option being consistently cheaper, consistently more expensive, or mixed; (3) Persona – a user profile is added, either a business traveler (company‑paid, comfort‑oriented) or a budget‑conscious student; (4) Both – persona information combined with three ICL examples that are aligned with the persona’s presumed preference (expensive for business, cheap for student).

The study evaluates three state‑of‑the‑art LLMs that support deterministic (temperature 0) generation: Llama 3.3 70B, GPT‑4o, and Gemini‑3‑Pro. Smaller models (Llama 3.2 3B, Llama 3.1 8B, Qwen 2.5 7B, Haiku) are also tested but are excluded from the main analysis because they exhibit severe order bias (always picking the first option), implausibly positive price coefficients, and very low pseudo‑R² in the subsequent logit estimation.

For each (prompt, model) configuration, the 240 binary choices are fed into a multinomial logit (MNL) model:

Uᵢc = α_c + β_c·xᵢc + εᵢc,

where xᵢc denotes the vector of attribute levels for alternative c in dilemma i. The probability of choosing A or B is proportional to exp(Uᵢc). Estimated β coefficients are then transformed into WTP for each non‑price attribute using the standard formula:

WTP_k = (β_k / β_price) × σ_price × σ_k,

where σ denotes the standard deviation of the respective attribute across the choice set.

Key findings:

Feasibility of WTP extraction: All three large models produce statistically significant negative price coefficients, enabling sensible WTP calculations. The pseudo‑R² values range from ~0.38 (Llama 3.3 70B) to ~0.42 (GPT‑4o), indicating moderate explanatory power.
Systematic over‑estimation: Compared with benchmark human WTP values reported in the economics literature, the LLM‑derived WTPs are 15‑30 % higher on average. The gap widens for “expensive” price levels and under the business‑persona condition, suggesting that LLMs may over‑value comfort‑related attributes when the user is framed as a high‑spending traveler.
Impact of user‑information conditioning: Providing ICL examples that consistently show the user choosing cheaper options reduces the over‑estimation. In the “Both” condition, when the persona’s preference aligns with the ICL examples (e.g., student + cheap examples), the resulting WTP aligns closely with human benchmarks. Conversely, mismatched conditioning (business persona + cheap examples) leads to the largest deviations.
Model‑specific quirks: Gemini‑3‑Pro assigns an especially high WTP to the minibar attribute, while Llama 3.3 70B shows slightly lower sensitivity to floor level. GPT‑4o is the most stable across prompting conditions.
Small‑model limitations: The order bias and positive price coefficients observed in the smaller models render their WTP estimates economically meaningless, underscoring the importance of model scale for this type of analysis.

Implications: The work demonstrates that LLMs can be evaluated with rigorous economic tools, but naïve deployment may lead to systematic bias—particularly an inflated willingness to pay for premium features. Prompt engineering (especially the inclusion of relevant user history) can mitigate this bias, suggesting a practical pathway for aligning LLM recommendations with actual user preferences. The authors release all code, prompts, and simulated choice data on GitHub, facilitating reproducibility and future extensions.

Conclusion: Large language models are capable of generating choice data that, when fed into standard discrete‑choice models, yield interpretable WTP estimates. However, model‑size matters, and careful prompt design—particularly the provision of past‑choice context— is essential to avoid over‑optimistic valuations. The study provides a blueprint for integrating economic validation into the development of LLM‑driven decision‑support systems, especially in domains where subjective trade‑offs dominate.

Would a Large Language Model Pay Extra for a View? Inferring Willingness to Pay from Subjective Choices

💡 Research Summary

Comments & Academic Discussion

Leave a Comment