Toward General Semantic Chunking: A Discriminative Framework for Ultra-Long Documents
Long-document topic segmentation plays an important role in information retrieval and document understanding, yet existing methods still show clear shortcomings in ultra-long text settings. Traditional discriminative models are constrained by fixed windows and cannot model document-level semantics; generative large language models can output paragraph boundaries, but inference is expensive and long inputs are difficult to support. To address these issues, we propose a discriminative segmentation model based on Qwen3-0.6B. On top of the backbone network, we add a cross-window context fusion layer and a boundary classification head, and combine them with an overlapping sliding-window strategy. Our model supports single-pass inputs of up to 13k tokens and can be extended to ultra-long documents for paragraph boundary detection. To further enhance downstream retrieval efficiency, we derive a vector fusion method with scalar correction, which compresses the representation of ultra-long segments into a single vector without semantic loss. Experiments on the Wikipedia long-document topic segmentation dataset WIKI-727K show that, compared with three generative models based on Qwen2-0.5B released by Jina, our method achieves a better macro-averaged F1 and delivers two orders of magnitude faster inference, substantially improving the practicality and scalability of long-document processing.
💡 Research Summary
This paper addresses the challenging problem of topic segmentation in ultra‑long documents by proposing a discriminative framework built on the Qwen3‑0.6B language model. Traditional discriminative approaches are limited by fixed‑size windows and cannot capture document‑level semantics, while generative large language models (LLMs) can output paragraph boundaries but suffer from high inference cost and difficulty handling inputs beyond a few thousand tokens. The authors design a model that combines a cross‑window context fusion layer and a boundary classification head with an overlapping sliding‑window strategy, enabling single‑pass processing of up to 13 k tokens and scalable handling of even longer texts.
Model architecture: The input document is first split into sentence‑level blocks. Token‑level hidden states from Qwen3‑0.6B are aggregated per block using attention pooling, producing compact block representations. These are fed into a lightweight Transformer encoder that models long‑range dependencies across blocks, yielding context‑enhanced block vectors. A simple MLP head then predicts a binary probability for a topic boundary between each adjacent block pair. To mitigate class imbalance (few true boundaries), loss re‑weighting is applied, improving recall.
Ultra‑long handling: When a document exceeds the 13 k token limit, it is divided into multiple overlapping windows (≈10 % overlap). Each window is processed independently; predictions in overlapping regions are averaged to produce a consistent boundary probability sequence. This design avoids the information loss typical of hard truncation while keeping computational overhead low.
Heuristic post‑processing: The raw probabilities are thresholded to obtain initial chunks. Chunks longer than an upper token limit (e.g., 700) are recursively split at the internal position with the highest boundary probability. Conversely, chunks shorter than a lower limit (e.g., 85) are merged with the neighboring chunk that has the smaller boundary probability, preventing overly fragmented segments. This parameter‑free procedure aligns the model output with practical length constraints.
Vector fusion for retrieval: For downstream retrieval, the authors propose a mathematically equivalent vector fusion method. Block embeddings are weighted‑averaged and then multiplied by a scalar correction term, preserving cosine similarity while compressing an entire ultra‑long segment into a single vector. This reduces retrieval complexity from O(N) to O(1) without semantic loss.
Experiments: Using the Wikipedia‑based WIKI‑727K benchmark, the proposed method is compared against three generative segmentation models based on Qwen2‑0.5B. The discriminative model achieves a macro‑averaged F1 of 78.4 % (≈2–3 % absolute improvement) and a recall boost of over 10 %. Inference speed is dramatically faster: processing a 13 k token document takes ~0.12 seconds versus ~12 seconds for the generative baselines—a two‑order‑of‑magnitude speedup. Ablation studies confirm that the cross‑window fusion layer contributes ~1.2 %p and the overlapping sliding window ~0.8 %p to the final F1. The vector fusion technique improves top‑10% retrieval recall by 3 %p.
Limitations and future work: The approach still relies on a fixed window size and overlap ratio, which may need tuning for different domains. The scalar correction in vector fusion, while theoretically sound, may not perfectly preserve semantics across all modalities. Extending the block granularity beyond sentences (e.g., sections) and integrating multimodal cues are identified as promising directions. Overall, the paper demonstrates that a carefully engineered discriminative model can combine the semantic richness of large LLMs with the efficiency required for real‑world ultra‑long document processing.
Comments & Academic Discussion
Loading comments...
Leave a Comment