Learning to Summarize by Learning to Quiz: Adversarial Agentic Collaboration for Long Document Summarization

Learning to Summarize by Learning to Quiz: Adversarial Agentic Collaboration for Long Document Summarization
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Long document summarization remains a significant challenge for current large language models (LLMs), as existing approaches commonly struggle with information loss, factual inconsistencies, and coherence issues when processing excessively long documents. We propose SummQ, a novel adversarial multi-agent framework that addresses these limitations through collaborative intelligence between specialized agents operating in two complementary domains: summarization and quizzing. Our approach employs summary generators and reviewers that work collaboratively to create and evaluate comprehensive summaries, while quiz generators and reviewers create comprehension questions that serve as continuous quality checks for the summarization process. This adversarial dynamic, enhanced by an examinee agent that validates whether the generated summary contains the information needed to answer the quiz questions, enables iterative refinement through multifaceted feedback mechanisms. We evaluate SummQ on three widely used long document summarization benchmarks. Experimental results demonstrate that our framework significantly outperforms existing state-of-the-art methods across ROUGE and BERTScore metrics, as well as in LLM-as-a-Judge and human evaluations. Our comprehensive analyses reveal the effectiveness of the multi-agent collaboration dynamics, the influence of different agent configurations, and the impact of the quizzing mechanism. This work establishes a new approach for long document summarization that uses adversarial agentic collaboration to improve summarization quality.


💡 Research Summary

SummQ introduces an adversarial multi‑agent framework that tackles the persistent challenges of long‑document summarization—information loss, factual inconsistency, and coherence breakdown—by coupling summarization with a complementary quizzing task. The system consists of four groups of agents: summary generators (Gs), quiz generators (Gq), summary reviewers (Rs), and quiz reviewers (Rq), plus an examinee agent (E) that attempts to answer the generated quiz using only the current summary.
The overall workflow (Algorithm 1) iterates up to a preset limit. In each iteration, Gs produce a candidate summary S(t) and Gq produce a candidate quiz Q(t). Rs and Rq independently annotate their respective outputs, categorize issues into agreed and contested sets, and resolve contested issues through multi‑round debates (Algorithm 3). The examinee E then attempts the quiz; any unanswered or incorrectly answered questions are fed back as “missing information” to the summarization side. All feedback streams are merged, and if both the summary and quiz receive no further issues, the loop terminates and the final pair (S*, Q*) is returned.
Generator collaboration (Algorithm 2) follows a four‑phase pipeline: (1) independent draft generation by each LLM agent, (2) aggregation of drafts into a unified candidate via an aggregator module, (3) ranking of individual drafts to select the best single draft, and (4) collective voting where each generator chooses between the aggregated draft and the best individual draft. This design preserves diversity while ensuring that an exceptionally strong draft is not eclipsed by the aggregation process.
Reviewer collaboration first gathers independent annotations, then extracts issues flagged by at least two reviewers as “agreed issues.” Contested issues—those with fewer than two votes—undergo a structured debate where each reviewer presents evidence from the source document and the candidate summary/quiz, followed by a majority‑vote decision on validity. The final issue list, a union of agreed and validated contested issues, drives the next generation cycle.
The examinee agent serves as a dynamic quality check: it attempts to answer each quiz question using only the current summary. Failure to answer correctly signals that the summary lacks necessary information, prompting the generators to incorporate the missing content in the next iteration. This creates a natural adversarial loop: the quiz pushes the summary toward higher coverage and factuality, while the summary strives to satisfy the quiz.
Experiments were conducted on three long‑document benchmarks—MENSA (scientific articles), BookSum (books), and GovReport (government reports). SummQ consistently outperformed strong baselines such as LED, Longformer, and recent ChatGPT‑4‑based summarizers across ROUGE‑1/2/L and BERTScore, achieving average gains of 3–5 percentage points. Human evaluations and LLM‑as‑a‑Judge assessments (measuring factuality, relevance, and coherence) also favored SummQ, with statistically significant improvements in all dimensions.
Ablation studies reveal that removing the quiz generation component leads to a sharp drop in information coverage, while omitting the examinee feedback slows convergence dramatically. The aggregation‑voting mechanism proves essential for maintaining diversity without sacrificing the best individual drafts. Analysis of quiz question types shows that multiple‑choice items contribute most to factual verification, whereas true/false and short‑answer questions mainly test summary completeness.
Limitations include the computational overhead of generating and reviewing quizzes, the sensitivity of overall performance to quiz quality, and the current focus on English and Chinese corpora, leaving multilingual generalization an open question. Future work will explore adaptive quiz difficulty, meta‑learning for agent selection, and extensions to multimodal documents containing tables and figures.
In summary, SummQ demonstrates that an adversarial collaboration between summarization and quizzing agents, reinforced by reviewer debates and an examinee feedback loop, can substantially elevate the quality of long‑document summaries. The authors release code and data to encourage further research in agentic, self‑checking summarization systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment