DYCP: Dynamic Context Pruning for Long-Form Dialogue with LLMs
Large Language Models (LLMs) increasingly operate over long-form dialogues with frequent topic shifts. While recent LLMs support extended context windows, efficient management of dialogue history in practice is needed due to inference cost and latency constraints. We present DyCP, a lightweight context management method implemented outside the LLM that dynamically identifies and retrieves relevant dialogue segments conditioned on the current turn, without offline memory construction. DyCP manages dialogue context while preserving the sequential nature of dialogue without predefined topic boundaries, enabling adaptive and efficient context selection. Across three long-form dialogue benchmarks-LoCoMo, MT-Bench+, and SCM4LLMs-and multiple LLM backends, DyCP achieves competitive answer quality in downstream generation, with more selective context usage and improved inference efficiency.
💡 Research Summary
The paper addresses the practical challenge of managing dialogue history for large language models (LLMs) in long‑form, open‑domain conversations that often span hundreds of turns and feature frequent topic shifts. While modern LLMs now support very large context windows (up to 1 M tokens in the newest GPT‑4 series), feeding the entire conversation to the model is still costly in terms of GPU usage, API pricing, and latency. Existing external context‑management approaches—summarization, turn‑level retrieval, and segment‑level retrieval—each have drawbacks: summarization can drop essential details, turn‑level retrieval may break discourse continuity, and segment‑level methods typically rely on pre‑segmentation based on past dialogue, which may not align with the current user query and often require extra LLM calls.
DyCP (Dynamic Context Pruning) is introduced as a lightweight, query‑dependent method that selects relevant dialogue segments on‑the‑fly without any pre‑segmentation or additional LLM invocations. The core pipeline works as follows: each previous turn (a user‑assistant pair) is encoded once with a bi‑encoder retriever B, producing a sequence of turn‑level embeddings H_emb. When a new user query qₙ arrives, it is encoded into qₙ_emb and dot‑producted with H_emb to obtain a relevance score vector S. S is then standardized (z‑score) and shifted by a gain threshold τ to produce a gain signal g. An extended Kadane algorithm, named KadaneDial, scans g in linear time to locate one or more contiguous spans whose cumulative gain exceeds a stopping threshold θ. After each span is extracted, its positions are masked (set to –∞) so that subsequent iterations find additional high‑gain regions. The selected spans are concatenated in their original chronological order, preserving dialogue continuity, and fed together with the current query to the LLM.
Key technical contributions include: (1) a novel adaptation of Kadane’s algorithm for relevance‑driven segment extraction; (2) a fully incremental, embedding‑only workflow that avoids any offline memory construction; (3) a principled use of standardized scores and gain thresholds to balance recall (including borderline‑relevant context) against the risk of omitting critical information.
The authors evaluate DyCP on three long‑form dialogue benchmarks—LoCoMo (average 300 turns), MT‑Bench+ (≈ 65 turns), and SCM4LLMs (≈ 64 turns)—using several LLM back‑ends that support at least 128 k token windows: GPT‑4o, Claude 3.7 (Sonnet), GPT‑4o mini, and supplemental GPT‑4.1 and Claude 4.0. Six competing context‑management strategies are compared: No Context, Full Context, MemoChat (segment‑level LLM‑based segmentation/retrieval), SCM4LLMs (turn‑level retrieval with multiple agents), SeCom (LLM‑based segmentation + dense retrieval + denoising), and CondMem (hybrid summarization + selective memory).
Results show that DyCP consistently reduces input token count by roughly 30 % relative to Full Context while maintaining comparable or slightly better generation quality across automatic metrics (BLEU, ROUGE, GPT‑4o‑Eval) and human judgments. Latency improves dramatically, with first‑token response times 2–3× faster; the paper’s illustrative 96‑turn example demonstrates a three‑fold reduction in first‑token latency. Cost analysis reveals that DyCP incurs negligible offline preprocessing cost (only the one‑time embedding of each turn) and requires a single LLM call at inference time, leading to 5–10× lower API expenses compared with methods that repeatedly invoke LLMs for memory construction (MemoChat, SeCom, CondMem). Sensitivity experiments indicate that setting τ between 0.5 and 1.0 and θ between 0.1 and 0.3 works well across datasets.
Limitations are acknowledged: the relevance score is based on a simple dot product, which may miss nuanced semantic relations; the hyper‑parameters τ and θ need dataset‑specific tuning; and for extremely long conversations (thousands of turns) the score vector itself could become memory‑intensive, suggesting future work on hierarchical indexing or sliding windows. Moreover, the current approach selects contiguous spans, which may not capture non‑contiguous but jointly relevant turns.
In conclusion, DyCP offers a practical, dynamic, and lightweight solution for context management in long‑form dialogue with modern LLMs. By preserving the sequential nature of conversation while aggressively pruning irrelevant history, it achieves a favorable trade‑off between cost, latency, and answer quality. The paper opens avenues for further research into richer relevance modeling, adaptive hyper‑parameter optimization, and hybrid schemes that combine DyCP’s dynamic pruning with summarization or external knowledge retrieval for even more scalable dialogue systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment