We present a methodology combining surface NLP and Machine Learning techniques for ranking asbtracts and generating summaries based on annotated corpora. The corpora were annotated with meta-semantic tags indicating the category of information a sentence is bearing (objective, findings, newthing, hypothesis, conclusion, future work, related work). The annotated corpus is fed into an automatic summarizer for query-oriented abstract ranking and multi- abstract summarization. To adapt the summarizer to these two tasks, two novel weighting functions were devised in order to take into account the distribution of the tags in the corpus. Results, although still preliminary, are encouraging us to pursue this line of work and find better ways of building IR systems that can take into account semantic annotations in a corpus.
The question of assisting information seekers in locating a specific category (facet) of information has rarely been addressed in the IR community due to the inherent difficulty of such a task. Indeed, efficiency and effectiveness have been the main guiding principles in building IR models and tools. Our aim here is to delve into the problem of how to assist a researcher or a specialist in rapidly accessing a specific category or class of information in scientific texts. For this, we need annotated corpora where relevant sentences are marked up with the type of information they are purportedly carrying. We identified eight categories of information in abstracts which can be useful in the framework of information-category driven IR: OBJECTIVE, RESULT, NEWTHING, HYPOTHESIS, FINDINGS, RELATED WORK, CONCLUSION, FUTUREWORK. These categories enable the user to identify what a paper is all about and what the contribution of the author is to his/her field. We adopted a surface linguistic analysis using lexico-syntactic patterns that are generic to a given language and rely on surface cues to perform sentence annotation from scientific abstracts. Once annotated, the corpus is fed into an automatic summarizer which takes into account the different semantic annotations for query-oriented document ranking and automatic summarization. The automatic summarizer used here is Enertex developed by LIA team at the University of Avignon (Fernández et al, 2007a). Enertex is based on neural networks (NN), inspired by statistical physics, to study fundamental problems in Natural Language Processing, like automatic summarization and topic segmentation.
In this paper, we will present some preliminary experiments on abstract ranking and automatic summarization using the semantic annotations resulting from our sentence categorization scheme. The plan of this paper is as follows: section 2 recalls relevant related work; section 3 describes the sentence categorization method. Section 4 describes the query-oriented abstract ranking and automatic summarization experiments using the semantic annotations. Section 5 discusses difficulties inherent in this task as well as earlier unsuccessful experiments which we had attempted.
Of a multi-disciplinary nature, our research draws from at least two distinct research communities: NLP and IR. Our survey will thus touch on relevant work from these two communities.
There is a large body of work in the NLP community on the structure of scientific discourse (Luhn 1958, Swales 1990, Paice 1993, Salager-Meyer 1990). Following a survey of earlier works, Teufel & Moens (2002) established that scientific writing can be seen as a problem-solving activity. Authors need to convince their colleagues of the validity of their research, hence they make use of rhetorical cues via some recurrent patterns (Swales 1990 1 , Teufel & Moens 2002). According to Toefel & Moens (2002), meta-discourse patterns are found in 1 « researchers like Swales (1990) have long claimed that there is a strong social aspect to science, because the success of a researcher is correlated with her ability to convince the field of the quality of her work and the validity of her arguments», cited in almost every 15 words in scientific texts. It is thus feasible to present important information from sentences along these dimensions which are almost always present in any scientific writing: research goal, methods/solutions, results. Earlier studies also established that the experimental sciences respected more these rhetorical divisions in writing than the social sciences and more often than not, used cues to announce them. One of the goals of these studies has been and continues to be automatic summarization. Discourse structure analysis is a means of identifying the role of each sentence and thus of selecting important sentences to form an abstract. Teufel (1999), Teufel & Moens (2002), and then Orasan (2001) have pursued this line of research. Patterns revealing the rhetorical divisions are frequent in full texts but are also found in abstracts. For instance, within the division « Motivation/objective/aim », one could find the sentence containing the lexico-syntactic cue « In this paper, we show that… ». Teufel & Moens (2002) showed that authors took great pains in abstracts to indicate intellectual attribution (references to earlier own work or that of other authors). Since abstracts contain only the essential points of a paper, it is to be hoped that only important sentences are there and that therefore their classification is an easier task than classifying sentences from full texts. However, abstracts will not carry all the patterns announcing the different rhetorical divisions. While categories like objective, methods and results will almost always be present, others like “new things, hypothesis, related_work, future_work” may be missing.
Research on automatic summarization per se has become very dynamic of late. Sparked off by
This content is AI-processed based on open access ArXiv data.