Long-Context Long-Form Question Answering for Legal Domain

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Legal documents have complex document layouts involving multiple nested sections, lengthy footnotes and further use specialized linguistic devices like intricate syntax and domain-specific vocabulary to ensure precision and authority. These inherent characteristics of legal documents make question answering challenging, and particularly so when the answer to the question spans several pages (i.e. requires long-context) and is required to be comprehensive (i.e. a long-form answer). In this paper, we address the challenges of long-context question answering in context of long-form answers given the idiosyncrasies of legal documents. We propose a question answering system that can (a) deconstruct domain-specific vocabulary for better retrieval from source documents, (b) parse complex document layouts while isolating sections and footnotes and linking them appropriately, (c) generate comprehensive answers using precise domain-specific vocabulary. We also introduce a coverage metric that classifies the performance into recall-based coverage categories allowing human users to evaluate the recall with ease. We curate a QA dataset by leveraging the expertise of professionals from fields such as law and corporate tax. Through comprehensive experiments and ablation studies, we demonstrate the usability and merit of the proposed system.

💡 Research Summary

The paper introduces LCLF‑QA, a system designed to tackle the unique challenges of long‑context, long‑form question answering (QA) in the legal domain. Legal documents often contain deeply nested sections, extensive footnotes, and specialized terminology that make both retrieval and answer generation difficult, especially when a question requires synthesizing information spread across multiple pages and producing a detailed, multi‑sentence response.

Building on the state‑of‑the‑art LongRAG architecture (Zhao et al., 2024), the authors add three key components. First, a domain‑specific query re‑writer (ζₜ) expands ambiguous user queries by resolving acronyms, adding contextual cues, and paraphrasing intent. Two implementations are explored: (a) a fine‑tuned Mistral‑3B‑Instruct model using QLoRA on thousands of ⟨original query, rewritten query⟩ pairs, and (b) a closed‑source GPT‑4o pipeline that generates a provisional document, extracts domain knowledge, and then produces an enriched query. Second, layout‑aware smart chunking parses PDFs, discards page headers/footers, and creates parent chunks aligned with sections. Each parent chunk is further split into child chunks that inherit section headers (tagged <section‑header>) and footnote content (tagged ). This preserves structural cues and enables “latent” context—information that resides in footnotes or section titles—to be linked back to the main text during retrieval. Third, a hybrid retrieval strategy combines sparse BM25 scoring with dense embeddings from OpenAI’s text‑embedding‑3‑large model. Results from both retrievers are fused via Reciprocal Rank Fusion (RRF). The top‑k child chunks are passed through a chain‑of‑thought (CoT) filter (Φ) that generates reasoning traces to discard irrelevant chunks, while the associated parent chunks are fed to a long‑context extractor (Σ) that assembles a coherent context window.

The final answer generator (G_µ) is fine‑tuned with legal‑specific vocabulary and few‑shot examples, ensuring that the output uses precise terms such as “should” (a technical term indicating a strong tax opinion) rather than vague synonyms.

To evaluate the system, the authors curated a dataset of 546 QA pairs, of which 60 were created and validated by subject‑matter experts (SMEs) in law and corporate tax, covering 24 distinct documents. The remaining pairs were synthetically generated and refined with GPT‑4o. Experiments compare LCLF‑QA against the vanilla LongRAG model and several baselines (e.g., BM25 + GPT‑3.5, simple chunking + LongRAG). Metrics include Exact Match (EM), F1, and a novel recall‑based coverage metric that classifies results as complete, partial, or insufficient recall. LCLF‑QA achieves 8‑12 percentage‑point gains in EM/F1 over baselines and markedly improves coverage: for complex queries such as “What is the US withholding tax treatment of at what level‑of‑comfort?”, the proportion of fully recalled answers rises by 27 pp. Ablation studies reveal that the query re‑writer contributes roughly a 10 pp lift in recall, layout‑aware chunking adds another 8 pp, and the hybrid retrieval with CoT filtering is essential for precision.

In summary, the paper demonstrates that addressing legal‑specific challenges—vocabulary mismatch, structural information loss, and the need for precise terminology—through targeted query rewriting, structure‑preserving chunking, hybrid retrieval, and domain‑aware generation yields a robust long‑context, long‑form QA system. The introduced coverage metric also offers a practical tool for human evaluators to assess recall quality. This work provides a concrete blueprint for building production‑grade legal AI assistants capable of digesting entire statutes, contracts, or tax regulations and delivering accurate, nuanced answers.

Long-Context Long-Form Question Answering for Legal Domain

💡 Research Summary

Comments & Academic Discussion

Leave a Comment