When F1 Fails: Granularity-Aware Evaluation for Dialogue Topic Segmentation

Reading time: 6 minute
...

📝 Abstract

Dialogue topic segmentation supports summarization, retrieval, memory management, and conversational continuity. Despite decades of work, evaluation practice remains dominated by strict boundary matching and F1-based metrics. Modern large language model (LLM) based conversational systems increasingly rely on segmentation to manage conversation history beyond fixed context windows. In such systems, unstructured context accumulation degrades efficiency and coherence. This paper introduces an evaluation framework that reports boundary density and segment alignment diagnostics (purity and coverage) alongside window-tolerant F1 (W-F1). By separating boundary scoring from boundary selection, we evaluate segmentation quality across density regimes rather than at a single operating point. Cross-dataset evaluation shows that reported performance differences often reflect annotation granularity mismatch rather than boundary placement quality alone. We evaluate structurally distinct segmentation strategies across eight dialogue datasets spanning task-oriented, open-domain, meeting-style, and synthetic interactions. Boundary-based metrics are strongly coupled to boundary density: threshold sweeps produce larger W-F1 changes than switching between methods. These findings support viewing topic segmentation as a granularity selection problem rather than prediction of a single correct boundary set. This motivates separating boundary scoring from boundary selection for analyzing and tuning segmentation under varying annotation granularities.

💡 Analysis

Dialogue topic segmentation supports summarization, retrieval, memory management, and conversational continuity. Despite decades of work, evaluation practice remains dominated by strict boundary matching and F1-based metrics. Modern large language model (LLM) based conversational systems increasingly rely on segmentation to manage conversation history beyond fixed context windows. In such systems, unstructured context accumulation degrades efficiency and coherence. This paper introduces an evaluation framework that reports boundary density and segment alignment diagnostics (purity and coverage) alongside window-tolerant F1 (W-F1). By separating boundary scoring from boundary selection, we evaluate segmentation quality across density regimes rather than at a single operating point. Cross-dataset evaluation shows that reported performance differences often reflect annotation granularity mismatch rather than boundary placement quality alone. We evaluate structurally distinct segmentation strategies across eight dialogue datasets spanning task-oriented, open-domain, meeting-style, and synthetic interactions. Boundary-based metrics are strongly coupled to boundary density: threshold sweeps produce larger W-F1 changes than switching between methods. These findings support viewing topic segmentation as a granularity selection problem rather than prediction of a single correct boundary set. This motivates separating boundary scoring from boundary selection for analyzing and tuning segmentation under varying annotation granularities.

📄 Content

Dialogue topic segmentation is commonly framed as identifying points in a conversation where “the subject changes.” This framing is appealingly simple. It is also incorrect. Topic change is often gradual, topic structure is often nested, and both depend on perspective.

Despite this, many dialogue segmentation benchmarks treat annotated boundaries as objective ground truth and evaluate systems using strict boundary matching and F1-based metrics van Rijsbergen [1979]. These metrics collapse distinct failure modes into a single number. Small boundary shifts are often harmless when the goal is coherent topic blocks, while missing or adding a few boundaries can be catastrophic if boundaries drive downstream actions such as summarizing or dropping earlier turns. As we show in this paper, these effects are tightly coupled to segmentation granularity and boundary density, which are left implicit by standard boundary-based metrics.

A central difficulty is that segmentation granularity is latent in standard evaluation. Most benchmarks compare systems at a single operating point (i.e., a set of fixed thresholds or selection settings), even though the number of boundaries produced varies across methods and datasets without explicit control. As a result, boundary-based metrics often reflect alignment in boundary density rather than boundary placement quality. To make granularity explicit, this work separates boundary scoring from boundary selection and evaluates performance across boundary densities rather than at a single threshold.

This paper addresses that mismatch by reframing what gets reported and what gets optimized. The evaluation objective in this paper jointly reports (i) window-tolerant F1 (W-F1), (ii) boundary density, and (iii) segment alignment diagnostics (purity and coverage). An accompanying algorithmic contribution separates boundary scoring from boundary selection, making boundary density explicit and controllable. Together, these contributions distinguish detection failures from granularity mismatches (i.e., discrepancies between the segmentation resolution at which boundaries are produced and that implied by the annotation scheme) and explain why a single global threshold fails to transfer across datasets.

We distinguish between two notions that are often conflated in boundary-based evaluation. Boundary placement alignment refers to whether predicted boundaries occur at the same locations as gold boundaries (within a tolerance window), which is the intended target of metrics such as F1 and W-F1. By contrast, boundary density alignment refers only to whether the number of predicted boundaries matches the number implied by the annotation scheme. We refer to this implicit granularity level as the annotation regime (e.g., coarse thematic divisions versus finer chapter-or subtopic-level segmentation), independent of which structural segmentation markers (e.g., chapters, sections, turns) are used to realize those boundaries. Segmentation granularity is a modeling-and selection-level property, referring to the resolution at which topic boundaries are produced. It determines boundary density through the boundary selection rule, sometimes implicitly, and thereby strongly influences boundary-based metrics.

A single-number boundary F1 can conflate two distinct factors: (i) localization (whether predicted boundaries fall near annotated topic transitions), and (ii) segmentation granularity (how many boundaries are produced overall). When the granularity of predicted segmentations differs from that implied by the annotation scheme, apparent F1 gains can arise primarily from matching the number of boundaries rather than from improved boundary placement at a matched resolution. To make this distinction explicit, we report complementary diagnostics that separately quantify boundary localization, boundary density alignment, and gold-relative segment alignment alongside tolerant boundary accuracy scores; formal definitions are given in Section 6.

Standard boundary-based evaluation implicitly assumes that, once boundary density is controlled, differences in F1 or W-F1 primarily reflect differences in boundary placement quality rather than thresholding or selection behavior. A central empirical finding of this work is that boundary-based evaluation metrics can be dominated by segmentation granularity induced by boundary selection, rather than by boundary detection quality. In particular, for a fixed scoring model, sweeping the boundary selection threshold produces larger changes in W-F1 than switching between structurally distinct segmentation methods, a pattern made explicit by density-quality curves in Section 7 (Figure 1). As a result, leaderboard-style comparisons that report only a single operating point can conflate method improvements with shifts in boundary density.

The proposed separation of boundary scoring and boundary selection motivates alternative ways of reasoning about the construction, tuning, an

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut