Learning Recursive Segments for Discourse Parsing

February 23, 2026

Reading time: 6 minute

...

📝 Original Info

Title: Learning Recursive Segments for Discourse Parsing
ArXiv ID: 1003.5372
Date: 2010-03-30
Authors: ** Stergos Afantenos, Pascal Denis, Philippe Muller, Laurence Danlos **

📝 Abstract

Automatically detecting discourse segments is an important preliminary step towards full discourse parsing. Previous research on discourse segmentation have relied on the assumption that elementary discourse units (EDUs) in a document always form a linear sequence (i.e., they can never be nested). Unfortunately, this assumption turns out to be too strong, for some theories of discourse like SDRT allows for nested discourse units. In this paper, we present a simple approach to discourse segmentation that is able to produce nested EDUs. Our approach builds on standard multi-class classification techniques combined with a simple repairing heuristic that enforces global coherence. Our system was developed and evaluated on the first round of annotations provided by the French Annodis project (an ongoing effort to create a discourse bank for French). Cross-validated on only 47 documents (1,445 EDUs), our system achieves encouraging performance results with an F-score of 73% for finding EDUs.

💡 Deep Analysis

Deep Dive into Learning Recursive Segments for Discourse Parsing.

📄 Full Content

arXiv:1003.5372v1 [cs.CL] 28 Mar 2010 Learning Recursive Segments for Discourse Parsing Stergos Afantenos∗, Pascal Denis†, Philippe Muller∗†, Laurence Danlos† ∗Institut de recherche en informatique de Toulouse (IRIT) Université Paul Sabatier 118 route de Narbonne, 31062 Toulouse Cedex 9, France {stergos.afantenos, philippe.muller}@irit.fr †Equipe-Projet Alpage INRIA & Université Paris 7 30 rue Château des Rentiers, 75013 Paris, France laurence.danlos@linguist.jussieu.fr, pascal.denis@inria.fr Abstract Automatically detecting discourse segments is an important preliminary step towards full discourse parsing. Pre- vious research on discourse segmentation have relied on the assumption that elementary discourse units (EDUs) in a document always form a linear sequence (i.e., they can never be nested). Unfortunately, this assumption turns out to be too strong, for some theories of discourse like SDRT allows for nested discourse units. In this paper, we present a simple approach to discourse segmentation that is able to produce nested EDUs. Our ap- proach builds on standard multi-class classiﬁcation techniques combined with a simple repairing heuristic that enforces global coherence. Our system was developed and evaluated on the ﬁrst round of annotations provided by the French Annodis project (an ongoing effort to create a discourse bank for French). Cross-validated on only 47 documents (1, 445 EDUs), our system achieves encouraging performance results with an F-score of 73% for ﬁnding EDUs. 1. Introduction Discourse parsing is the analysis of a text from a global, structural perspective: how parts of a discourse contribute to its global interpretation, accounting for semantic and pragmatic effects beyond simple sentence concatenation. This task consists in two main steps: (i) ﬁnding the elemen- tary discourse units (henceforth EDUs), and (ii) organizing them in a way that make explicit their functional (aka rhetorical) relations. Popular theories of discourse include Rhetorical Struc- ture Theory (RST) (Mann and Thompson, 1987), Discourse Lexicalized Tree-Adjoining Grammar (DLTAG) (Webber, 2004), Segmented Discourse Representation Theory (SDRT) (Asher, 1993). Each of these theoretical frameworks has been at the center of important corpus building efforts, see (Carlson et al., 2003; Prasad et al., 2004; Baldridge et al., 2007) respectively. In the present work, we focus on the ﬁrst step, namely segmenting a discourse into EDUs, within a larger project aiming at building an SDRT discourse corpus of French texts. In addition to being a necessary step in discourse parsing, discourse segmentation, could also be useful as a stand-alone application for a variety of other tasks where EDUs could provide sim- pler input than sentences. Examples of such tasks are: automatic summarization and sentence com- pression, bitext alignment, translation, chunk- ing/syntactic parsing. The ﬁrst discourse segmentation system dates back to the rule-based work of (Ejerhed, 1996), which was a component in the RST-based parser of (Marcu, 2000). More recently, (Toﬁloski et al., 2009) tested a rule-based segmenter on top of a syntactic parser, achieving F-score of 80-85% in segment boundary identiﬁcation on a slightly modi- ﬁed RST corpus. Machine learning based segmentation systems have also been pro- posed, notably by (Soricut and Marcu, 2003), (Sporleder and Lapata, 2005) and (Fisher and Roark, 2007). The latter report F-score of 90.5% in boundary detection (and 85.3% in correct bracketing) on the RST corpus. Discourse segmentation is to large extent theory dependent, for different theories make different assumptions on what EDUs can be. Carried out on the RST corpus, previous work on discourse segmentation has exploited an important particu- larity of this corpus: namely, the fact that it does not have any embedded EDUs. These approaches have been able to recast discourse segmentation as a binary classiﬁcation problem: that is, each text position (token or token separator) is either a segment boundary or not. By contrast to RST, other theories like SDRT allows for embedded EDUs: embedding is used to encode modifying clauses like non restrictive relatives (including re- duced relatives) and appositions. As will be dis- cussed in Section 2., our SDRT-based corpus does contain close to 10% of nested EDUs. Predicting nested structures introduces additional difﬁculties, in particular that of outputting a co- herent, balanced bracketing. This characteris- tic renders discourse segmentation akin to syn- tactic clause boundary identiﬁcation (CBI), a task which has received some attention from the CL community. The main approach to CBI is to classify tokens into three classes for clause start, end, or inside. The best results ob- tained during the CoNLL-2001 campaign were 89-90% for boundary detection and 81.73% for correct clause identiﬁcation (correct guessing of start and end), with boosted decision trees (Carreras and Màrquez, 2001). We hav

…(Full text truncated)…

📄 Read Full PDF on ArXiv

📸 Image Gallery

Reference

This content is AI-processed based on ArXiv data.

Learning Recursive Segments for Discourse Parsing

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Related Posts

Large-Scale Domain Adaptation via Teacher-Student Learning

Supervised Speech Separation Based on Deep Learning: An Overview

Transfer Learning for Scientific Data Chain Extraction in Small Chemical Corpus with BERT-CRF Model

Start searching

No results found