Gavel: Agent Meets Checklist for Evaluating LLMs on Long-Context Legal Summarization

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) now support contexts of up to 1M tokens, but their effectiveness on complex long-context tasks remains unclear. In this paper, we study multi-document legal case summarization, where a single case often spans many documents totaling 100K-500K tokens. We introduce Gavel-Ref, a reference-based evaluation framework with multi-value checklist evaluation over 26 items, as well as residual fact and writing-style evaluations. Using Gavel-Ref, we go beyond the single aggregate scores reported in prior work and systematically evaluate 12 frontier LLMs on 100 legal cases ranging from 32K to 512K tokens, primarily from 2025. Our results show that even the strongest model, Gemini 2.5 Pro, achieves only around 50 of $S_{\text{Gavel-Ref}}$, highlighting the difficulty of the task. Models perform well on simple checklist items (e.g., filing date) but struggle on multi-value or rare ones such as settlements and monitor reports. As LLMs continue to improve and may surpass human-written summaries – making human references less reliable – we develop Gavel-Agent, an efficient and autonomous agent scaffold that equips LLMs with six tools to navigate and extract checklists directly from case documents. With Qwen3, Gavel-Agent reduces token usage by 36% while resulting in only a 7% drop in $S_{\text{checklist}}$ compared to end-to-end extraction with GPT-4.1.

💡 Research Summary

The paper introduces Gavel‑Ref, a comprehensive reference‑based evaluation framework for long‑context legal case summarization, and Gavel‑Agent, an autonomous agent scaffold that extracts checklist items directly from the source documents. Legal case summarization is an especially demanding testbed because a single litigation often comprises dozens of court filings whose combined length can exceed 100 K–500 K tokens, far beyond the limits of earlier long‑context benchmarks. To assess how modern large language models (LLMs) handle such scale, the authors first construct a 26‑item checklist covering the most salient factual elements of a case (filing date, parties, claims, decrees, remedies, settlements, etc.). They improve upon prior checklist‑based evaluation by (1) supporting multi‑value extraction—most items contain multiple values—and by (2) aggregating scores only over applicable items, thereby avoiding inflation from universally omitted fields. For each checklist item the model extracts a list of (value, supporting‑text) pairs; single‑value items are judged with a four‑way classification (exact match, containment, reverse containment, mismatch), while multi‑value items receive an F1‑based overlap score.

Beyond the checklist, the framework includes a Residual Fact evaluation that captures important factual content not covered by the checklist. Unmatched text spans are identified, atomic facts are extracted, and a list‑wise F1 score (scaled to 0–100) is computed. Finally, a Writing Style evaluation rates similarity to human references across five dimensions (sentence structure, voice, citation style, formatting, narrative order) on a 1–5 Likert scale, which is transformed to a 0–100 style score. The overall Gavel‑Ref score is a weighted linear combination of checklist, residual, and style components, with dynamic weighting based on the proportion of residual content (α = 0.9 throughout).

The authors conduct a meta‑evaluation by having four in‑house annotators perform the same extraction, comparison, and style‑rating tasks. Inter‑annotator agreement is high, and the agreement between LLM‑based automatic evaluation and human judgments is comparable, demonstrating that Gavel‑Ref can serve as a reliable, low‑cost benchmark.

Using Gavel‑Ref, the study evaluates twelve frontier LLMs—both proprietary (Gemini 2.5 Pro, Claude Sonnet 4, Gemini 2.5 Flash, GPT‑5) and open‑source (GPT‑oss 20B, Qwen‑3 32B, Qwen‑3 30B‑A3B, Gemma‑3 27B)—on 100 legal cases ranging from 32 K to 512 K tokens, with 83 % of the cases drawn from 2025 to minimise data contamination. The key findings are: (i) the best model, Gemini 2.5 Pro, attains only about 50 points on the overall Gavel‑Ref score, confirming the difficulty of the task; (ii) performance degrades as case length increases, even for models that support a 1 M‑token context window; (iii) models excel on single‑value items (e.g., filing date) with >90 % accuracy but struggle on multi‑value or rare items such as settlements, related cases, and monitor reports, where F1 scores drop to 30‑40 %; (iv) proprietary models consistently outperform open‑source counterparts, though open‑source GPT‑oss 20B and Qwen‑3 achieve comparable checklist extraction quality to GPT‑5 at a fraction of the cost; (v) GPT‑4.1 captures the most residual facts, while GPT‑5 tends to produce verbose, checklist‑heavy summaries despite prompts for a narrative style; Claude and Gemini align best with human writing style.

Recognising that relying on human‑written reference summaries may become problematic as LLMs surpass them, the authors propose Gavel‑Agent, an autonomous agent that equips an LLM with six tools: (1) document navigation, (2) keyword search, (3) regex‑based extraction, (4) summarisation of retrieved snippets, (5) comparison against the checklist, and (6) logging of extracted items. Rather than feeding the entire 100 K‑plus document set into a single forward pass, the agent iteratively locates relevant sections and extracts checklist values directly from the source. Experiments show that using Qwen‑3 within Gavel‑Agent reduces token consumption by 36 % relative to an end‑to‑end GPT‑4.1 summarisation pipeline, while the checklist score drops by only 7 %. Compared with a naïve chunk‑by‑chunk approach, token savings reach 59 %. However, direct document‑based extraction still lags behind extraction from model‑generated summaries, indicating that long‑horizon reasoning and memory management remain open challenges.

The paper concludes by releasing the dataset of 100 legal cases, the human reference summaries, and the code for Gavel‑Ref and Gavel‑Agent. It argues that the combination of fine‑grained, multi‑value checklist evaluation, residual fact analysis, and style assessment provides a more nuanced picture of LLM capabilities on complex, high‑stakes tasks. Future work should explore (1) more sophisticated agent architectures with persistent memory and planning, (2) automated verification of residual facts against external legal databases, and (3) generation of fact‑checked, style‑controlled summaries that could eventually replace human‑written references in legal workflows.

Gavel: Agent Meets Checklist for Evaluating LLMs on Long-Context Legal Summarization

💡 Research Summary

Comments & Academic Discussion

Leave a Comment