NewsInterview: a Dataset and a Playground to Evaluate LLMs' Ground Gap via Informational Interviews

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) have demonstrated impressive capabilities in generating coherent text but often struggle with grounding language and strategic dialogue. To address this gap, we focus on journalistic interviews, a domain rich in grounding communication and abundant in data. We curate a dataset of 40,000 two-person informational interviews from NPR and CNN, and reveal that LLMs are significantly less likely than human interviewers to use acknowledgements and to pivot to higher-level questions. Realizing that a fundamental deficit exists in multi-turn planning and strategic thinking, we develop a realistic simulated environment, incorporating source personas and persuasive elements, in order to facilitate the development of agents with longer-horizon rewards. Our experiments show that while source LLMs mimic human behavior in information sharing, interviewer LLMs struggle with recognizing when questions are answered and engaging persuasively, leading to suboptimal information extraction across model size and capability. These findings underscore the need for enhancing LLMs’ strategic dialogue capabilities.

💡 Research Summary

The paper “NewsInterview: a Dataset and a Playground to Evaluate LLMs’ Ground Gap via Informational Interviews” tackles a persistent shortcoming of large language models (LLMs): the inability to consistently employ grounding language (e.g., acknowledgments, confirmations) and to execute strategic, multi‑turn dialogue planning. To study this problem in a natural, data‑rich setting, the authors focus on journalistic informational interviews, a domain where interviewers must coax information from often anxious or reluctant sources.

Dataset Construction
The authors aggregate publicly available transcripts from two major U.S. news outlets—National Public Radio (NPR) and Cable News Network (CNN). Starting from nearly half a million raw transcripts, they employ a two‑stage filtering pipeline: (1) an automated classifier built on Meta’s Llama‑3.1‑70B‑Instruct tags each transcript for number of participants and content type; (2) manual validation removes panel discussions, game shows, and other non‑informational formats. After filtering, 45,848 dyadic (one‑on‑one) interviews remain, of which 40,000 are selected for the final dataset. The authors infer speaker roles by counting question marks, labeling the participant with more question marks as the interviewer. The resulting conversations average 7.5 turns, with sources speaking roughly twice as many words as interviewers (≈551 vs. 270 words). Topics span literature, politics, academia, and international affairs, and the language registers at a Flesch‑Kincaid grade of 6.9.

Empirical Comparison of Human vs. LLM Interviewers
To assess how current LLMs behave in this setting, the authors conduct a counterfactual experiment. Given a human interview, they feed the first t‑1 turns to an LLM and ask it to generate the next question. Four prompting strategies are evaluated: (i) a baseline prompt, (ii) Chain‑of‑Thought (CoT) prompting that encourages reasoning about what has been asked, (iii) an “Outline” prompt that supplies high‑level interview goals, and (iv) a combination (Outline‑CoT). All experiments use Llama‑3.1‑70B‑Instruct. The generated questions are then judged by GPT‑4o and a professional journalist across six alignment dimensions: informational consistency, motivational consistency, style, discourse role, contextual appropriateness, and exact match.

Results (Table 1) reveal stark gaps. Human interviewers produce acknowledgment statements in roughly 9 % of their turns, a figure that rises from 5 % early in the interview to over 20 % by the end. All LLM variants produce virtually zero acknowledgments. Moreover, LLMs over‑use follow‑up questions and under‑use “outline‑level” strategic questions that shift the interview toward higher‑order topics. They also increasingly ask opinion or broadening questions as the interview proceeds—behaviors that humans rarely exhibit. Even with CoT and outline prompts, the proportion of strategic questions remains significantly below human levels, indicating a fundamental deficit in multi‑turn planning rather than a simple prompting issue.

Simulation Game: NewsInterview
Recognizing that static transcript analysis cannot provide the long‑horizon reward signals needed to train strategic dialogue, the authors design a simulated interview game called NewsInterview. The game proceeds in K turns (e.g., K = 10). At each turn:

The “interviewer” LLM receives the conversation history and a set of interview objectives, then generates a question.
The “source” LLM is assigned a persona (e.g., anxious, clueless, dominating) and a pool of informational items I. It determines which items are relevant to the question, computes a persuasion level based on the dialogue history, and decides which items to disclose.

The reward for the interviewer is the total number of disclosed items across the episode. Crucially, the source’s willingness to share depends on how well the interviewer’s question aligns with the persona’s persuasion needs, forcing the interviewer to adopt a persuasive, grounding‑rich style.

Human‑LLM correlation analyses show that source‑LLMs mimic human persuasion detection (r = 0.43, p < 0.0001), confirming the simulation’s realism. However, interviewer‑LLMs consistently fail to (a) recognize when a question has been sufficiently answered, leading to redundant follow‑ups, and (b) adapt their tone or strategy to the source’s persona, resulting in low information extraction regardless of model size. This pattern persists across baseline, CoT, outline, and combined prompting strategies.

Key Insights

Grounding Gap – LLMs virtually ignore acknowledgment statements, a core grounding device that builds rapport and trust.
Strategic Planning Deficit – LLMs generate many follow‑up questions but rarely transition to higher‑level, outline‑driven questions, indicating a lack of multi‑turn planning.
Contextual Understanding Is Not Sufficient – While LLMs can keep track of conversational context, they lack the motivational and emotional drivers that guide human interviewers.
Simulation Validity – The NewsInterview game provides a realistic testbed where source behavior aligns with human data, yet interviewer performance remains sub‑human, highlighting an actionable research frontier.

Contributions and Future Directions
The paper contributes (i) a large, publicly released dataset of 40 k dyadic news interviews, (ii) a thorough discourse‑level analysis exposing specific grounding and strategic deficiencies in current LLMs, and (iii) a novel reinforcement‑learning‑friendly simulation environment that captures delayed, persona‑dependent rewards. Future work could explore (a) reward‑shaping techniques that explicitly incentivize acknowledgment and persuasion, (b) hybrid human‑LLM interviewers that combine human intuition with model scalability, and (c) extending the simulation to other domains (e.g., therapeutic or educational dialogues) to generalize strategic dialogue improvements.

In sum, the study demonstrates that while LLMs can generate contextually appropriate questions, they fall short of the empathetic, strategically guided behavior exhibited by professional journalists. The NewsInterview dataset and playground lay a solid foundation for the next generation of dialogue systems that can not only understand language but also strategically steer conversations toward desired outcomes.

NewsInterview: a Dataset and a Playground to Evaluate LLMs' Ground Gap via Informational Interviews

💡 Research Summary

Comments & Academic Discussion

Leave a Comment