D-SCoRE: Document-Centric Segmentation and CoT Reasoning with Structured Export for QA-CoT Data Generation

D-SCoRE: Document-Centric Segmentation and CoT Reasoning with Structured Export for QA-CoT Data Generation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The scarcity and high cost of high-quality domain-specific question-answering (QA) datasets limit supervised fine-tuning of large language models (LLMs). We introduce $\textbf{D-SCoRE}$, a training-free framework that leverages LLMs and prompt engineering to automatically generate diverse, rich QA datasets with Chain-of-Thought (CoT) from arbitrary textual sources. By integrating $\textbf{D}$ocument-centric processing, $\textbf{S}$egmentation, $\textbf{Co}$T $\textbf{R}$easoning, and structured $\textbf{E}$xport - along with multi-dimensional controls such as semantic role transformation, question type balancing, and counterfactual augmentation - D-SCoRE produces tailored QA pairs with enhanced diversity and relevance. LLMs fine-tuned on D-SCoRE-generated datasets outperform those trained on human-annotated QA data across most evaluated domains. Its efficiency and scalability enable rapid, high-performance domain-adaptive fine-tuning on consumer-grade hardware, generating over 1,100 high-quality QA pairs per GPU-hour end-to-end.


💡 Research Summary

The paper introduces D‑SCoRE (Document‑centric Segmentation and CoT Reasoning with Structured Export), a training‑free pipeline that automatically generates high‑quality question‑answer (QA) datasets enriched with Chain‑of‑Thought (CoT) reasoning from arbitrary textual sources. The motivation is the scarcity and high cost of domain‑specific QA data, which hampers supervised fine‑tuning (SFT) of large language models (LLMs). D‑SCoRE addresses this by leveraging LLMs through carefully crafted prompts rather than any additional model training or extensive preprocessing.

The system consists of three core stages (plus optional pre‑ and post‑processing). First, raw documents are split into manageable 100‑200 word segments (document‑centric segmentation). For each segment, an LLM is prompted to produce a balanced mix of explicit and implicit questions. Explicit questions have answers that are verbatim spans in the source text, while implicit questions require multi‑span synthesis or relational inference. Implicit questions are always accompanied by a step‑by‑step CoT trace (e.g., “Step 1: … Step 2: … Therefore …”).

Second, a quality‑control stage filters the generated QA‑CoT pairs. A separate “Critic” model (e.g., DeepSeek‑R1) checks two criteria: fidelity (the answer must be fully grounded in the source without hallucination) and taxonomy (the question must correctly belong to the explicit or implicit class). Pairs failing either check are regenerated. This stage is formalized as Quality(Q) = Fidelity × Taxonomy, and the filtering aims to minimize KL‑divergence between synthetic and ideal distributions.

Third, counterfactual material generation creates three distractor options for each correct answer. The LLM is instructed to produce plausible but factually incorrect alternatives that are semantically close to the correct answer, grounded in the source text, and sufficiently distinct from each other. The correct answer’s position among the four options is randomized to avoid positional bias.

All outputs are exported in a structured JSON schema containing the question, answer, CoT trace, and the four‑choice list, enabling downstream fine‑tuning pipelines to ingest the data directly.

The authors provide a quantitative analysis of information content: implicit questions with CoT have high mutual information I(R; Q,A), while adding counterfactual distractors raises conditional entropy H(A|Q,D), thereby increasing decision difficulty and robustness.

Experiments are conducted on SQuAD‑derived data and the SQuADShifts benchmark (domains: Amazon, NYT, NewWiki, Reddit). Models fine‑tuned on D‑SCoRE‑generated data consistently outperform those fine‑tuned on human‑annotated SQuAD data, achieving higher Exact Match and F1 scores both in‑distribution and out‑of‑distribution. The authors explore several research questions:

  • RQ1 shows D‑SCoRE’s synthetic data yields superior performance compared to gold data.
  • RQ2 demonstrates that increasing the implicit‑question proportion up to ~80 % improves reasoning transfer, while 100 % can introduce occasional CoT incoherence.
  • RQ3 confirms benefits across model scales (7B, 13B, 34B parameters), with the largest gains observed in smaller models that otherwise rely on surface matching.
  • RQ4 validates that using a dedicated quality‑filtering model improves overall data fidelity and downstream SFT results relative to a homogeneous pipeline.

Efficiency is a key claim: the end‑to‑end system generates over 1,100 QA‑CoT pairs per GPU‑hour on consumer‑grade hardware, roughly three to five times faster than prior semi‑synthetic pipelines. The cost is reported as under $0.8 per 1,000 QA pairs.

Limitations include handling of very long documents, automatic verification of CoT consistency, and extension to multilingual settings. Future work will explore adaptive segmentation, more sophisticated critic models, and automated prompt tuning for specific domains.

In summary, D‑SCoRE offers a practical, scalable solution for producing diverse, reasoning‑rich QA datasets without any model training, and demonstrates that such synthetic data can surpass traditional human‑annotated corpora for fine‑tuning LLMs across multiple domains.


Comments & Academic Discussion

Loading comments...

Leave a Comment