Intent Mismatch Causes LLMs to Get Lost in Multi-Turn Conversation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multi-turn conversation has emerged as a predominant interaction paradigm for Large Language Models (LLMs). Users often employ follow-up questions to refine their intent, expecting LLMs to adapt dynamically. However, recent research reveals that LLMs suffer a substantial performance drop in multi-turn settings compared to single-turn interactions with fully specified instructions, a phenomenon termed ``Lost in Conversation’’ (LiC). While this prior work attributes LiC to model unreliability, we argue that the root cause lies in an intent alignment gap rather than intrinsic capability deficits. In this paper, we first demonstrate that LiC is not a failure of model capability but rather a breakdown in interaction between users and LLMs. We theoretically show that scaling model size or improving training alone cannot resolve this gap, as it arises from structural ambiguity in conversational context rather than representational limitations. To address this, we propose to decouple intent understanding from task execution through a Mediator-Assistant architecture. By utilizing an experience-driven Mediator to explicate user inputs into explicit, well-structured instructions based on historical interaction patterns, our approach effectively bridges the gap between vague user intent and model interpretation. Experimental results demonstrate that this method significantly mitigates performance degradation in multi-turn conversations across diverse LLMs.

💡 Research Summary

The paper revisits the “Lost in Conversation” (LiC) phenomenon reported by Laban et al. (2025), where large language models (LLMs) suffer a dramatic performance drop—about 30 % absolute, roughly 60 % relative—when moving from single‑turn, fully specified instructions to multi‑turn interactions with underspecified user inputs. While prior work attributes this degradation to model unreliability or a lack of conversational robustness, the authors argue that the root cause lies in a systematic misalignment between how users express intent and how the model infers it.

Theoretical framework
The authors formalize a dialogue as a sequence of contexts Cₜ = (u₁, a₁, …, uₜ) where each user utterance uₜ is generated from a latent deep intent Iₜ and a user‑specific pragmatic pattern T: uₜ ∼ P_user(u | Iₜ, T, Cₜ₋₁). This mapping is many‑to‑one, causing information loss. The LLM with parameters θ produces a response R ∼ P_θ(R | Cₜ). By expanding P_θ(R | Cₜ) they obtain a decomposition:

P_θ(R | Cₜ) = Σ_{Iₜ} P_θ(R | Iₜ)·P_θ(Iₜ | Cₜ).

Thus performance hinges on two orthogonal components: (1) Intent Inference (recovering Iₜ from Cₜ) and (2) Task Execution (solving the task given a correct Iₜ). Empirically, scaling model size improves the execution term but leaves the inference term unchanged because the conditional entropy H(Iₜ | Cₜ) remains high. The authors show that, in the absence of additional signals, the model defaults to the prior distribution over intents learned during pre‑training, effectively assuming an “average user”. This prior alignment explains why larger models actually reinforce the misalignment rather than alleviate it.

Proposed solution: Mediator‑Assistant architecture
To break the information bottleneck, the paper introduces an auxiliary variable H – a generalized interaction history that captures longitudinal evidence of a particular user’s behavior (successful vs. failed dialogue trajectories). A Mediator M, trained as an LLM‑based Refiner, consumes the current context Cₜ together with H and produces a reconstructed, fully‑specified instruction ˆU that approximates the latent intent Iₜ. Formally, ˆU ∼ P(U | Cₜ, H), which dramatically reduces H(Iₜ | Cₜ, H) compared to H(Iₜ | Cₜ). The downstream Assistant then receives ˆU and executes the task using its unchanged parameters θ.

The Mediator is not a fine‑tuned model but a lightweight alignment layer that leverages experience‑driven guidelines distilled from contrastive pairs of failed and successful interactions. These guidelines encode pragmatic patterns specific to the user, allowing the Mediator to rewrite ambiguous utterances into clear, unambiguous commands without interrupting the conversation flow.

Experimental validation
The authors evaluate the approach on the LiC benchmark across several LLM families (LLaMA‑7B, LLaMA‑13B, GPT‑3.5‑turbo, etc.). By inserting the Mediator between the user and the Assistant, they observe:

A consistent recovery of 25‑35 % absolute performance across models, effectively narrowing the gap between multi‑turn and single‑turn scores.
Greater gains when the Refiner’s experience set E includes user‑specific corrective patterns, confirming the importance of personalized pragmatic knowledge.
Minimal computational overhead compared to full model fine‑tuning or repeated clarification strategies, since only the input is transformed.

Ablation studies show that removing H (i.e., using only the raw context) reverts performance to the baseline LiC drop, while using a naïve clarification‑prompting strategy yields lower gains and harms user experience.

Key contributions

Re‑characterization of LiC as an intent‑alignment problem rather than a capacity limitation.
Information‑theoretic analysis that isolates the conditional entropy of intent as the primary bottleneck.
Introduction of a Mediator‑Assistant pipeline that leverages historical interaction data to reduce intent entropy without modifying the core LLM.
Empirical evidence of substantial, model‑agnostic performance recovery on a challenging multi‑turn benchmark.

Implications
The work suggests that future conversational AI systems should prioritize pragmatic alignment—understanding how individual users phrase ambiguous requests—over merely scaling model size or improving generic RLHF objectives. By treating the user‑model interface as a modular, updatable component (the Mediator), developers can achieve robust multi‑turn performance while preserving the underlying LLM’s capabilities and avoiding costly retraining. This paradigm shift opens avenues for personalized, context‑aware dialogue agents that remain reliable even when users provide only partial or evolving instructions.

Intent Mismatch Causes LLMs to Get Lost in Multi-Turn Conversation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment