Learning Recursive Segments for Discourse Parsing

Reading time: 6 minute
...

📝 Original Info

  • Title: Learning Recursive Segments for Discourse Parsing
  • ArXiv ID: 1003.5372
  • Date: 2010-03-30
  • Authors: ** Stergos Afantenos, Pascal Denis, Philippe Muller, Laurence Danlos **

📝 Abstract

Automatically detecting discourse segments is an important preliminary step towards full discourse parsing. Previous research on discourse segmentation have relied on the assumption that elementary discourse units (EDUs) in a document always form a linear sequence (i.e., they can never be nested). Unfortunately, this assumption turns out to be too strong, for some theories of discourse like SDRT allows for nested discourse units. In this paper, we present a simple approach to discourse segmentation that is able to produce nested EDUs. Our approach builds on standard multi-class classification techniques combined with a simple repairing heuristic that enforces global coherence. Our system was developed and evaluated on the first round of annotations provided by the French Annodis project (an ongoing effort to create a discourse bank for French). Cross-validated on only 47 documents (1,445 EDUs), our system achieves encouraging performance results with an F-score of 73% for finding EDUs.

💡 Deep Analysis

Deep Dive into Learning Recursive Segments for Discourse Parsing.

Automatically detecting discourse segments is an important preliminary step towards full discourse parsing. Previous research on discourse segmentation have relied on the assumption that elementary discourse units (EDUs) in a document always form a linear sequence (i.e., they can never be nested). Unfortunately, this assumption turns out to be too strong, for some theories of discourse like SDRT allows for nested discourse units. In this paper, we present a simple approach to discourse segmentation that is able to produce nested EDUs. Our approach builds on standard multi-class classification techniques combined with a simple repairing heuristic that enforces global coherence. Our system was developed and evaluated on the first round of annotations provided by the French Annodis project (an ongoing effort to create a discourse bank for French). Cross-validated on only 47 documents (1,445 EDUs), our system achieves encouraging performance results with an F-score of 73% for finding EDUs.

📄 Full Content

arXiv:1003.5372v1 [cs.CL] 28 Mar 2010 Learning Recursive Segments for Discourse Parsing Stergos Afantenos∗, Pascal Denis†, Philippe Muller∗†, Laurence Danlos† ∗Institut de recherche en informatique de Toulouse (IRIT) Université Paul Sabatier 118 route de Narbonne, 31062 Toulouse Cedex 9, France {stergos.afantenos, philippe.muller}@irit.fr †Equipe-Projet Alpage INRIA & Université Paris 7 30 rue Château des Rentiers, 75013 Paris, France laurence.danlos@linguist.jussieu.fr, pascal.denis@inria.fr Abstract Automatically detecting discourse segments is an important preliminary step towards full discourse parsing. Pre- vious research on discourse segmentation have relied on the assumption that elementary discourse units (EDUs) in a document always form a linear sequence (i.e., they can never be nested). Unfortunately, this assumption turns out to be too strong, for some theories of discourse like SDRT allows for nested discourse units. In this paper, we present a simple approach to discourse segmentation that is able to produce nested EDUs. Our ap- proach builds on standard multi-class classification techniques combined with a simple repairing heuristic that enforces global coherence. Our system was developed and evaluated on the first round of annotations provided by the French Annodis project (an ongoing effort to create a discourse bank for French). Cross-validated on only 47 documents (1, 445 EDUs), our system achieves encouraging performance results with an F-score of 73% for finding EDUs. 1. Introduction Discourse parsing is the analysis of a text from a global, structural perspective: how parts of a discourse contribute to its global interpretation, accounting for semantic and pragmatic effects beyond simple sentence concatenation. This task consists in two main steps: (i) finding the elemen- tary discourse units (henceforth EDUs), and (ii) organizing them in a way that make explicit their functional (aka rhetorical) relations. Popular theories of discourse include Rhetorical Struc- ture Theory (RST) (Mann and Thompson, 1987), Discourse Lexicalized Tree-Adjoining Grammar (DLTAG) (Webber, 2004), Segmented Discourse Representation Theory (SDRT) (Asher, 1993). Each of these theoretical frameworks has been at the center of important corpus building efforts, see (Carlson et al., 2003; Prasad et al., 2004; Baldridge et al., 2007) respectively. In the present work, we focus on the first step, namely segmenting a discourse into EDUs, within a larger project aiming at building an SDRT discourse corpus of French texts. In addition to being a necessary step in discourse parsing, discourse segmentation, could also be useful as a stand-alone application for a variety of other tasks where EDUs could provide sim- pler input than sentences. Examples of such tasks are: automatic summarization and sentence com- pression, bitext alignment, translation, chunk- ing/syntactic parsing. The first discourse segmentation system dates back to the rule-based work of (Ejerhed, 1996), which was a component in the RST-based parser of (Marcu, 2000). More recently, (Tofiloski et al., 2009) tested a rule-based segmenter on top of a syntactic parser, achieving F-score of 80-85% in segment boundary identification on a slightly modi- fied RST corpus. Machine learning based segmentation systems have also been pro- posed, notably by (Soricut and Marcu, 2003), (Sporleder and Lapata, 2005) and (Fisher and Roark, 2007). The latter report F-score of 90.5% in boundary detection (and 85.3% in correct bracketing) on the RST corpus. Discourse segmentation is to large extent theory dependent, for different theories make different assumptions on what EDUs can be. Carried out on the RST corpus, previous work on discourse segmentation has exploited an important particu- larity of this corpus: namely, the fact that it does not have any embedded EDUs. These approaches have been able to recast discourse segmentation as a binary classification problem: that is, each text position (token or token separator) is either a segment boundary or not. By contrast to RST, other theories like SDRT allows for embedded EDUs: embedding is used to encode modifying clauses like non restrictive relatives (including re- duced relatives) and appositions. As will be dis- cussed in Section 2., our SDRT-based corpus does contain close to 10% of nested EDUs. Predicting nested structures introduces additional difficulties, in particular that of outputting a co- herent, balanced bracketing. This characteris- tic renders discourse segmentation akin to syn- tactic clause boundary identification (CBI), a task which has received some attention from the CL community. The main approach to CBI is to classify tokens into three classes for clause start, end, or inside. The best results ob- tained during the CoNLL-2001 campaign were 89-90% for boundary detection and 81.73% for correct clause identification (correct guessing of start and end), with boosted decision trees (Carreras and Màrquez, 2001). We hav

…(Full text truncated)…

📸 Image Gallery

cover.png page_2.webp page_3.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut