Can LLMs capture stable human-generated sentence entropy measures?
Predicting upcoming words is a core mechanism of language comprehension and may be quantified using Shannon entropy. There is currently no empirical consensus on how many human responses are required to obtain stable and unbiased entropy estimates at the word level. Moreover, large language models (LLMs) are increasingly used as substitutes for human norming data, yet their ability to reproduce stable human entropy remains unclear. Here, we address both issues using two large publicly available cloze datasets in German 1 and English 2. We implemented a bootstrap-based convergence analysis that tracks how entropy estimates stabilize as a function of sample size. Across both languages, more than 97% of sentences reached stable entropy estimates within the available sample sizes. 90% of sentences converged after 111 responses in German and 81 responses in English, while low-entropy sentences (<1) required as few as 20 responses and high-entropy sentences (>2.5) substantially more. These findings provide the first direct empirical validation for common norming practices and demonstrate that convergence critically depends on sentence predictability. We then compared stable human entropy values with entropy estimates derived from several LLMs, including GPT-4o, using both logit-based probability extraction and sampling-based frequency estimation, GPT2-xl/german-GPT-2, RoBERTa Base/GottBERT, and LLaMA 2 7B Chat. GPT-4o showed the highest correspondence with human data, although alignment depended strongly on the extraction method and prompt design. Logit-based estimates minimized absolute error, whereas sampling-based estimates were better in capturing the dispersion of human variability. Together, our results establish practical guidelines for human norming and show that while LLMs can approximate human entropy, they are not interchangeable with stable human-derived distributions.
💡 Research Summary
**
This paper tackles two intertwined questions that are central to psycholinguistic norming and the emerging use of large language models (LLMs) as proxies for human data: (1) How many human participants are needed to obtain stable, unbiased word‑level Shannon entropy estimates from cloze‑type norming tasks? (2) To what extent can contemporary LLMs reproduce these human‑derived entropy distributions?
Data and preprocessing
The authors employed two publicly available cloze datasets, one in German and one in English, each comprising roughly 2,000 sentences. For the German set, over 200 native speakers provided completions; for the English set, more than 150 participants did so. The task was an open‑cloze: participants typed the word they thought best continued the sentence, without a predefined list of correct answers. This design captures the full range of human expectations.
Bootstrap‑based convergence analysis
To quantify the relationship between sample size and entropy stability, the authors performed a bootstrap procedure. From the full pool of responses they repeatedly drew random subsamples of size n (1 ≤ n ≤ total) with replacement, computed the empirical probability distribution for each target word, and derived the Shannon entropy. For each n they measured (i) the change in mean entropy relative to the previous n (requiring < 0.01 bit change) and (ii) the standard deviation across bootstrap replicates (requiring < 0.05 bit). The smallest n meeting both criteria was deemed the convergence point.
Results of the convergence analysis
Across both languages, more than 97 % of sentences reached convergence within the available sample sizes (≈ 200 for German, ≈ 150 for English). Low‑entropy sentences (entropy ≤ 1 bit) required as few as 20 responses to stabilize, whereas high‑entropy sentences (entropy ≥ 2.5 bits) needed substantially more, often exceeding 150 participants. The 90th percentile of convergence points was 111 responses for German and 81 for English, confirming the rule‑of‑thumb that roughly 100 participants suffice for most items. Crucially, convergence was strongly modulated by sentence predictability, suggesting that norming studies can allocate resources adaptively: collect many responses for unpredictable items and far fewer for highly predictable ones.
LLM entropy estimation methods
The study compared several LLMs: GPT‑4o, GPT‑2‑xl (German‑specific), RoBERTa Base/GottBERT, and LLaMA 2 7B Chat. Two estimation strategies were examined:
- Logit‑based – the model’s raw logits were soft‑maxed to obtain a probability distribution over the vocabulary; entropy was computed directly from these probabilities.
- Sampling‑based – the model was prompted to generate a large number (e.g., 1,000) of continuations for the same context; the empirical frequency of each token served as the probability estimate.
Both strategies were repeated multiple times per sentence, and results were averaged.
LLM vs. human entropy
GPT‑4o achieved the closest match to human entropy. In the logit‑based condition its mean absolute error (MAE) was 0.12 bits, while the sampling‑based condition yielded a Pearson correlation of r = 0.78 with human values. The other models performed worse: MAE ranged from 0.25 to 0.38 bits, and correlations from 0.55 to 0.70. Prompt engineering proved decisive; explicit instructions such as “provide the probability distribution for the next word” markedly improved logit‑based accuracy.
Interpretation and practical guidelines
The authors derive several actionable recommendations:
- Adaptive sampling – When designing cloze norming studies, estimate the expected entropy of each sentence (e.g., via pilot data or language model predictions) and allocate participants accordingly. High‑entropy items merit larger samples; low‑entropy items can be normed with minimal data.
- LLM as a supplement, not a replacement – Even the best LLM (GPT‑4o) captures the central tendency of human entropy but fails to reproduce the full dispersion of human responses, especially the long tail of rare continuations. Thus, LLMs should be used to augment human data (e.g., to fill gaps or generate preliminary estimates) rather than to supplant human norming entirely.
- Hybrid estimation – Logit‑based estimates minimize absolute error and are ideal for obtaining a precise central value. Sampling‑based estimates better reflect variability and are useful when the goal is to model the spread of human judgments. Combining both (e.g., weighting logit‑based values by sampling‑derived variance) yields a “mixed” estimator that leverages the strengths of each approach.
Limitations and future work
The study is limited to German and English; cross‑linguistic generalization (e.g., to typologically distant languages such as Korean or Arabic) remains to be tested. The impact of model size, training data composition, and fine‑tuning on entropy fidelity was not systematically explored. Moreover, the authors note that human responses exhibit temporal variability (e.g., intra‑participant consistency over time) that was not modeled, and that LLMs evolve with updates, potentially altering their entropy profiles. Future research should address these dimensions and investigate whether adaptive norming pipelines can be fully automated using LLM‑driven entropy predictions.
Conclusion
In sum, the paper provides the first empirical validation of the long‑standing practice of collecting roughly 100 human responses for cloze norming, showing that convergence is predictable and strongly linked to sentence entropy. It also demonstrates that state‑of‑the‑art LLMs, particularly GPT‑4o, can approximate human‑derived entropy but are not interchangeable with stable human distributions. The authors supply concrete methodological guidelines for both human norming and LLM‑based estimation, paving the way for more efficient and theoretically grounded psycholinguistic data collection.
Comments & Academic Discussion
Loading comments...
Leave a Comment