A Semantic QA-Based Approach for Text Summarization Evaluation
📝 Abstract
Many Natural Language Processing and Computational Linguistics applications involves the generation of new texts based on some existing texts, such as summarization, text simplification and machine translation. However, there has been a serious problem haunting these applications for decades, that is, how to automatically and accurately assess quality of these applications. In this paper, we will present some preliminary results on one especially useful and challenging problem in NLP system evaluation: how to pinpoint content differences of two text passages (especially for large pas-sages such as articles and books). Our idea is intuitive and very different from existing approaches. We treat one text passage as a small knowledge base, and ask it a large number of questions to exhaustively identify all content points in it. By comparing the correctly answered questions from two text passages, we will be able to compare their content precisely. The experiment using 2007 DUC summarization corpus clearly shows promising results.
💡 Analysis
Many Natural Language Processing and Computational Linguistics applications involves the generation of new texts based on some existing texts, such as summarization, text simplification and machine translation. However, there has been a serious problem haunting these applications for decades, that is, how to automatically and accurately assess quality of these applications. In this paper, we will present some preliminary results on one especially useful and challenging problem in NLP system evaluation: how to pinpoint content differences of two text passages (especially for large pas-sages such as articles and books). Our idea is intuitive and very different from existing approaches. We treat one text passage as a small knowledge base, and ask it a large number of questions to exhaustively identify all content points in it. By comparing the correctly answered questions from two text passages, we will be able to compare their content precisely. The experiment using 2007 DUC summarization corpus clearly shows promising results.
📄 Content
A Semantic QA-Based Approach for Text Summarization Evaluation Ping Chen, Fei Wu, Tong Wang, Wei Ding University of Massachusetts Boston ping.chen@umb.edu
Abstract
Many Natural Language Processing and Computational Lin-
guistics applications involve the generation of new texts
based on some existing texts, such as summarization, text
simplification and machine translation. However, there has
been a serious problem haunting these applications for dec-
ades, that is, how to automatically and accurately assess qual-
ity of these applications. In this paper, we will present some
preliminary results on one especially useful and challenging
problem in NLP system evaluation – how to pinpoint content
differences of two text passages (especially for large passages
such as articles and books). Our idea is intuitive and very dif-
ferent from existing approaches. We treat one text passage as
a small knowledge base, and ask it a large number of ques-
tions to exhaustively identify all content points in it. By com-
paring the correctly answered questions from two text pas-
sages, we will be able to compare their content precisely. The
experiment using 2007 DUC summarization corpus clearly
shows promising results.
Introduction
Technologies spawned from Natural Language Processing
(NLP) and Computational Linguistics (CL) have fundamen-
tally changed how we process, share, and access infor-
mation, e.g., search engines, and questions answering sys-
tems. However, there has been a serious problem haunting
many NLP applications, that is, how to automatically and
accurately assess the quality of these applications. In some
case, evaluation of a NLP task itself has become an active
research area itself, such as text summarization evaluation.
The main difficulty for developing such evaluation comes
from the diversity of the NLP domain, and our insufficient
understanding of natural languages and human intelligence
in general. In this paper, we focus on one especially useful
and challenging area in NLP evaluation – how to semanti-
cally compare the content of two text passages (e.g., para-
graphs, articles, or even large corpora). Pinpointing content
differences among texts is critical to evaluation of many im-
portant NLP applications, such as summarization, text cate-
gorization, text simplification, and machine translation. Not
surprisingly, many evaluation methods have been proposed,
but the quality of existing methods themselves is hard to as-
sess. In many cases, human evaluation must be adopted,
which is often slow, subjective, and expensive. In this paper
we present an intuitive and innovative idea completely dif-
ferent from existing methods:
If we treat one text passage as a small knowledge base,
can we ask it a large number of questions to exhaustively
identify all content points in it?
By comparing the correctly answered questions from two
text passages, we can compare their content precisely. This
idea may seem confusing as “circling around the target” in-
stead of “directly hitting the target”. However, Our Question
Answering (QA)-based content evaluation is intuitive and
supported by the following insights:
- When we assess someone’s understanding on a subject, we do not ask him to write down all he knows about the subject. Instead, a list of questions will be asked, and accu- rate and objective assessment can be achieved by counting the number of correct answers. During this question answer- ing process, we can also identify which areas he needs to improve.
- Practical operability. When assessing the similarity of two texts, direct comparison may look natural. However, with current methods (no matter supervised or rule-based) this direct approach becomes increasingly difficult as we move to larger text passages. For example, comparing two articles needs to answer the following questions: how to align sentences, how to semantically represent a sentence, how to generate similarity scores without annotated samples (or as few as possible to minimize cost), how to interpret and evaluate these scores, how to find the content differences of two texts, etc.
- Easy to interpret. Many existing methods only generate a single score, which illustrates little detail as how an assess- ment measure is generated and offers no help for system im- provement. On the other hand, our QA-based approach re- quires minimum manual efforts, clearly shows how a meas- ure is calculated, and pinpoints exactly the content differ- ences of two text passages. In next section we will discuss some existing work. Sec- tion 3 will show the architecture for our QA-based evalua- tion approach, and experiment results will be presented in section 4. We will provide some insights and findings when we design our evaluation system and conduct experiments in one discussion section 5. We conclude in Section 6. Related Work Human eva
This content is AI-processed based on ArXiv data.