On the Credibility of Evaluating LLMs using Survey Questions

On the Credibility of Evaluating LLMs using Survey Questions
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent studies evaluate the value orientation of large language models (LLMs) using adapted social surveys, typically by prompting models with survey questions and comparing their responses to average human responses. This paper identifies limitations in this methodology that, depending on the exact setup, can lead to both underestimating and overestimating the similarity of value orientation. Using the World Value Survey in three languages across five countries, we demonstrate that prompting methods (direct vs. chain-of-thought) and decoding strategies (greedy vs. sampling) significantly affect results. To assess the interaction between answers, we introduce a novel metric, self-correlation distance. This metric measures whether LLMs maintain consistent relationships between answers across different questions, as humans do. This indicates that even a high average agreement with human data, when considering LLM responses independently, does not guarantee structural alignment in responses. Additionally, we reveal a weak correlation between two common evaluation metrics, mean-squared distance and KL divergence, which assume that survey answers are independent of each other. For future research, we recommend CoT prompting, sampling-based decoding with dozens of samples, and robust analysis using multiple metrics, including self-correlation distance.


💡 Research Summary

This paper critically examines the methodology of evaluating large language models (LLMs) by comparing their responses to standardized social surveys, specifically the World Value Survey (WVS). While recent work has treated the alignment between model outputs and average human responses as a proxy for value orientation, the authors demonstrate that this approach can both over‑ and under‑estimate true similarity depending on prompting style, decoding strategy, and evaluation metric.

The study manipulates three dimensions: (1) prompt formulation – a “direct” style that asks for a numeric answer versus a Chain‑of‑Thought (CoT) style that requires a brief justification before the answer; (2) decoding – greedy deterministic decoding versus nucleus sampling (p = 0.9, temperature = 0.7); and (3) similarity metrics – Mean Squared Difference (MSD), Kullback‑Leibler Divergence (KLD), and a newly introduced Self‑Correlation Distance. The latter computes Pearson correlation matrices across all question pairs for both human respondents and model runs, then measures the Frobenius norm of the difference, thereby capturing second‑order structure that traditional metrics ignore.

Experiments are conducted on 143 WVS items translated into English, German, and Czech, covering six countries (USA, UK, Germany, Czechia, Iran, China). Four instruction‑tuned models are evaluated: LLaMA 3 8B Instruct, Mistral 2 7B Instruct, EuroLLM 9B Instruct, and Qwen 2.5 7B Instruct. For each model‑prompt‑decoding combination, the authors generate 30–100 sampled responses to obtain stable estimates.

Results show that MSD and KLD are highly sensitive to prompting and decoding. CoT prompting combined with sampling yields the lowest surface‑level distances for Mistral 2 (MSD = 0.022, KLD = 0.26), comparable to the natural variation between Western countries. By contrast, the same model with greedy decoding produces a dramatically higher MSD (0.188), exceeding even the USA‑Iran gap. LLaMA 3 achieves moderate scores (MSD = 0.059, KLD = 1.47) under CoT+sampling, EuroLLM improves markedly when sampling replaces greedy decoding, and Qwen 2.5 remains relatively stable across settings.

Crucially, the Self‑Correlation Distance reveals a paradox: configurations that excel on MSD/KLD often fail to reproduce the inter‑question correlation structure observed in human data. For example, Mistral 2’s CoT+sampling setup, despite its low MSD/KLD, shows a self‑correlation distance of 0.78, indicating a mismatch in the underlying value network. Conversely, EuroLLM’s direct+greedy configuration, while scoring poorly on MSD/KLD, yields a smaller self‑correlation distance (≈0.62), suggesting that its answers preserve human‑like relational patterns even if the absolute values differ.

These findings argue that evaluating LLMs solely on average agreement or distributional divergence is insufficient. A model may appear aligned on a per‑question basis yet lack the coherent value structure that characterizes real populations. The authors therefore recommend a multi‑metric evaluation protocol: adopt CoT prompting as the default, employ nucleus sampling with dozens of samples to capture response variability, and report both traditional distance measures and the self‑correlation distance. Such a comprehensive approach would provide a more nuanced picture of how well LLMs embody societal values, especially across languages and cultures, and guide future work on bias mitigation and culturally aware AI development.


Comments & Academic Discussion

Loading comments...

Leave a Comment