SelfReflect: Can LLMs Communicate Their Internal Answer Distribution?

SelfReflect: Can LLMs Communicate Their Internal Answer Distribution?
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The common approach to communicate a large language model’s (LLM) uncertainty is to add a percentage number or a hedging word to its response. But is this all we can do? Instead of generating a single answer and then hedging it, an LLM that is fully transparent to the user needs to be able to reflect on its internal belief distribution and output a summary of all options it deems possible, and how likely they are. To test whether LLMs possess this capability, we develop the SelfReflect metric, an information-theoretic distance between a given summary and a distribution over answers. In interventional and human studies, we find that SelfReflect indicates even slight deviations, yielding a fine measure of faithfulness between a summary string and an LLM’s actual internal distribution over answers. With SelfReflect, we make a resounding negative observation: modern LLMs are, across the board, incapable of revealing what they are uncertain about, neither through reasoning, nor chains-of-thoughts, nor explicit finetuning. However, we do find that LLMs are able to generate faithful summaries of their uncertainties if we help them by sampling multiple outputs and feeding them back into the context. This simple approach shines a light at the universal way of communicating LLM uncertainties whose future development the SelfReflect score enables. To support the development of this universal form of LLM uncertainties, we publish the code that implements our metric for arbitrary LLMs under https://github.com/apple/ml-selfreflect .


💡 Research Summary

The paper “SelfReflect: Can LLMs Communicate Their Internal Answer Distribution?” asks a fundamental question: can a large language model (LLM) honestly summarize the full probability distribution over possible answers rather than merely attaching a confidence number or a hedging phrase to a single output? To answer this, the authors introduce the SelfReflect metric, an information‑theoretic distance that quantifies how faithfully a textual summary captures the underlying answer distribution.

The theoretical core rests on the notion of predictive sufficiency. An ideal summary S of a set of sampled answers A(1:N) should satisfy I(A(1:N); B) = I(S; B) for any future answer B, which is equivalent to p(B | A(1:N)) = p(B | S). The authors operationalize this equivalence with a masked‑token (cloze) task: given a new answer B, they mask a token B_i and ask a separate “judge” LLM J to predict B_i conditioned on (i) the summary S and the remaining tokens B_{‑i}, and (ii) the full set of N samples A(1:N) and B_{‑i}. If S is truly sufficient, the two conditional distributions over the vocabulary should be identical. The divergence between them is measured with the 1‑Wasserstein distance (or, in the extreme case of a one‑hot output, simple token‑match accuracy). Averaging over all questions, all sampled answer sets, all mask positions, and over the random answer B yields the SelfReflect score for a summarization method ψ.

Armed with this metric, the authors conduct two large‑scale studies. First, they evaluate 20 contemporary LLMs—including GPT‑4, Claude, Llama‑2, and various open‑source models—under a wide range of prompting strategies: direct “self‑reflect” prompts, chain‑of‑thought, supervised fine‑tuning (SFT), and reinforcement learning from human feedback (DPO). Across the board, the SelfReflect scores are low, indicating that the models’ self‑generated uncertainty statements are not faithful to their internal distributions. Even when models produce a “75 % sure” style answer with a list of alternatives, the alternatives often do not correspond to the modes of the true distribution, and the reported probabilities are miscalibrated.

Second, the authors compare SelfReflect scores to human judgments of faithfulness. Human annotators read a model’s summary and the set of sampled answers, then rate how well the summary reflects the distribution. The correlation between human scores and SelfReflect is high (Pearson ρ ≈ 0.85), confirming that the metric captures the intuitive notion of a “good” distribution summary better than prior baselines such as LM‑judge scores or embedding‑based distances.

A key insight emerges when the authors augment the model with explicit samples. By prompting the LLM to generate N = 50 answer samples, feeding those samples back into the context, and then asking the model to summarize, the SelfReflect scores improve dramatically. In this “sampling‑feedback” regime, the model does not need to introspect its own probability table; it simply aggregates observable samples and produces a faithful textual summary. This suggests a practical pathway: external sampling combined with a summarization step can yield honest uncertainty communication even if the model’s internal reasoning remains opaque.

The paper’s contributions are threefold: (1) a principled metric (SelfReflect) that measures the information equivalence between a summary string and a full answer distribution, grounded in predictive sufficiency and implemented via masked‑token prediction; (2) an extensive empirical evaluation showing that current LLMs cannot self‑reflect their distributions under any reasonable prompting or fine‑tuning regime; (3) the demonstration that a simple sampling‑feedback loop enables faithful uncertainty summaries, together with an open‑source implementation of the metric (https://github.com/apple/ml-selfreflect).

Overall, the work pushes uncertainty quantification beyond scalar confidence scores toward a richer “distribution‑summary” paradigm. It reveals a fundamental limitation of today’s LLMs—lack of intrinsic self‑awareness of their answer distribution—while offering a viable engineering workaround. The SelfReflect metric provides a rigorous benchmark for future research aiming to make LLMs more transparent, trustworthy, and useful in high‑stakes applications where understanding model uncertainty is essential.


Comments & Academic Discussion

Loading comments...

Leave a Comment