Mind the Gap: How Elicitation Protocols Shape the Stated-Revealed Preference Gap in Language Models

Mind the Gap: How Elicitation Protocols Shape the Stated-Revealed Preference Gap in Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent work identifies a stated-revealed (SvR) preference gap in language models (LMs): a mismatch between the values models endorse and the choices they make in context. Existing evaluations rely heavily on binary forced-choice prompting, which entangles genuine preferences with artifacts of the elicitation protocol. We systematically study how elicitation protocols affect SvR correlation across 24 LMs. Allowing neutrality and abstention during stated preference elicitation allows us to exclude weak signals, substantially improving Spearman’s rank correlation ($ρ$) between volunteered stated preferences and forced-choice revealed preferences. However, further allowing abstention in revealed preferences drives $ρ$ to near-zero or negative values due to high neutrality rates. Finally, we find that system prompt steering using stated preferences during revealed preference elicitation does not reliably improve SvR correlation on AIRiskDilemmas. Together, our results show that SvR correlation is highly protocol-dependent and that preference elicitation requires methods that account for indeterminate preferences.


💡 Research Summary

The paper investigates how different elicitation protocols affect the measured gap between stated and revealed preferences (SvR) in large language models (LLMs). Using 24 contemporary models—including LLaMA‑3.1, Qwen‑3, Mistral‑3, Gemma‑3, Claude, and others—the authors compare three protocol configurations: (1) forced‑choice for both stated and revealed preferences (the baseline used in prior work), (2) expanded‑choice (allowing “Equal Preference” or “Depends”) for stated preferences while keeping forced‑choice for revealed preferences, and (3) expanded‑choice for both stages. All generations use deterministic decoding, and a GPT‑4o‑mini‑based judge classifies responses into binary, equal, depends, or other categories.

Neutrality Findings
When models are given the option to express neutrality, the rate of neutral responses varies dramatically across families. In the stated‑preference phase, neutrality ranges from roughly 48 % to 100 %, with Qwen‑3‑8B almost always selecting “Depends”. In the revealed‑preference phase (AIRiskDilemmas), neutrality is even higher: Mistral‑3‑8B chooses a neutral response in nearly every scenario, and Gemma‑3‑4B does so about 70 % of the time. This suggests that models are far more uncertain when asked to act in concrete moral dilemmas than when asked abstract value comparisons.

Impact on SvR Correlation
Under the baseline forced‑forced protocol, Spearman’s ρ between the global value rankings derived from stated preferences and the rankings inferred from revealed behavior is modest and highly variable (≈0.2–0.5). Switching to expanded‑stated / forced‑revealed dramatically improves correlation: many models see ρ rise to 0.5–0.7, with LLaMA‑3.1‑405B‑Instruct jumping from ~0.2 to ~0.7. The authors attribute this to the filtering effect of neutrality—weak or ambiguous comparisons are excluded, leaving a cleaner hierarchy that aligns better with actual decision making. Moreover, under this configuration the correlation positively correlates with the Epoch Capabilities Index (ρ = 0.58, p = 0.02), indicating that more capable models exhibit a tighter alignment between abstract values and concrete actions when neutral statements are filtered out.

Conversely, when neutrality is permitted in both stages (expanded‑expanded), the correlation collapses to near zero or even negative values for many models. The reason is that the revealed‑preference data become sparse: most models repeatedly answer “Depends” or “Equal Preference,” leaving too few binary decisions to construct a stable ranking. Consequently, the residual binary signal is insufficient to produce a meaningful rank‑based comparison, and the measured SvR gap essentially disappears—not because the gap is resolved, but because the measurement is rendered unreliable.

System‑Prompt Steering Experiments
The authors also test a simple steering technique: they prepend a system prompt containing each model’s own stated value ranking (obtained under expanded‑stated elicitation) before asking the model to answer the revealed‑preference dilemmas. The goal is to see whether exposing the model to its declared hierarchy can reduce the SvR gap. Results are mixed and largely negative. A few models (Ministral‑3B, Gemma‑3‑4B) show modest improvements (Δρ ≈ +0.05–0.10), while the Claude family consistently regresses (Δρ ≈ ‑0.07). This aligns with prior findings that prompt‑based steering works better for small value sets (e.g., 3‑value HHH sets) but degrades as the number of values grows (16‑value set here). The authors conclude that merely inserting a textual hierarchy is insufficient to override entrenched behavioral priors when the value space is large.

Discussion and Implications
The study demonstrates that SvR correlation is highly protocol‑dependent. Allowing neutrality in the stated‑preference stage improves measurement by filtering out weak signals, whereas allowing neutrality in the revealed‑preference stage exposes the extent to which many LLMs lack a decisive ordering over values in concrete contexts. Consequently, rank‑based SvR metrics become unreliable when neutral responses dominate. The authors argue for evaluation frameworks that explicitly model neutrality—e.g., by treating neutral responses as a separate probabilistic outcome, using multi‑sample aggregation, or employing Bayesian hierarchical models—to capture the true uncertainty rather than discarding it.

Furthermore, the limited success of system‑prompt steering suggests that more robust interventions (e.g., fine‑tuning with preference‑aligned reward models, reinforcement learning from human feedback that incorporates the full value hierarchy, or architectural changes that enforce consistency) are required to close the SvR gap, especially for richer value sets.

Finally, the paper releases the “MindTheGap” dataset (pairwise value comparisons, AIRiskDilemmas responses, and neutrality annotations) and the full evaluation code, providing a valuable benchmark for future work on preference elicitation, uncertainty modeling, and alignment of LLMs.


Comments & Academic Discussion

Loading comments...

Leave a Comment