Automatic Mapping of French Discourse Connectives to PDTB Discourse Relations
In this paper, we present an approach to exploit phrase tables generated by statistical machine translation in order to map French discourse connectives to discourse relations. Using this approach, we created ConcoLeDisCo, a lexicon of French discourse connectives and their PDTB relations. When evaluated against LEXCONN, ConcoLeDisCo achieves a recall of 0.81 and an Average Precision of 0.68 for the Concession and Condition relations.
💡 Research Summary
The paper introduces a novel, low‑resource method for automatically linking French discourse connectives (DCs) to the discourse relations defined in the Penn Discourse Treebank (PDTB). Traditional French resources such as LEXCONN rely on extensive manual annotation; even the most recent version (V2.1) contains 343 connectives mapped to an average of 1.3 relations, leaving 37 connectives without any assigned relation. To overcome this bottleneck, the authors exploit two well‑established NLP tools: (1) a parallel French‑English corpus (Europarl) and (2) statistical machine translation (SMT) phrase tables generated by the Moses toolkit.
First, the English side of Europarl is parsed with the CLaC discourse parser, which was trained on PDTB sections 02‑20. The parser identifies 100 English DCs and assigns them one of the 14 PDTB second‑level relations used in the CoNLL‑2016 shared task, achieving an F1 of 0.90 for DC detection and 0.76 for relation labeling. These English annotations serve as the “gold” reference for mapping.
Second, Moses is run on the same parallel corpus to produce a phrase table that records how often a French phrase aligns with an English phrase. To preserve the distinction between different senses of the same English connective, the authors concatenate each English DC with its PDTB relation into a single token (e.g., “although‑CONCESSION”). The phrase table therefore contains entries of the form <French‑DC, English‑DC‑Relation, frequency>.
Only French DCs that appear at least 50 times in Europarl are retained, which filters out 55 low‑frequency items and 7 that never occur, leaving 309 out of the 371 entries in LEXCONN. For each surviving French DC, the authors sum the frequencies of all alignments to each PDTB relation, then divide by the total frequency of that French DC in the corpus. This yields a probability estimate Pr(Rel | FR‑DC). The resulting triples <FR‑DC, Relation, Probability> constitute the ConcoLeDisCo lexicon, comprising 900 entries and made publicly available on GitHub.
Evaluation proceeds in two parts. Automatic evaluation uses an 11‑point interpolated average precision (11‑point AIP) curve, which measures precision at evenly spaced recall levels without imposing an arbitrary cutoff. For the two relations that overlap between LEXCONN and PDTB—Concession and Condition—the system retrieves 50 % of the relevant French DCs with a precision of 0.81, and the overall average precision (AveP) is 0.68. Manual inspection of false‑positive entries (14 cases with probability > 0.01 that were not present in LEXCONN) reveals that 9 of them (64 %) are genuine mappings missed by the hand‑crafted lexicon. Two native French speakers confirmed the discourse‑relation signal in at least one of five sampled parallel sentences for each of these nine connectives, achieving a Cohen’s κ of 0.72.
An additional qualitative finding concerns the interaction of multiple connectives within the same clause. The authors observe that when two connectives co‑occur (e.g., “certes” and “mais”), the presence of one can modify the discourse relation signaled by the other, a phenomenon also reported in the original PDTB. This suggests that future models should account for connective interplay rather than treating each connective in isolation.
In summary, the study makes three substantive contributions: (1) it demonstrates that SMT phrase tables can be repurposed to automatically map discourse connectives across languages, dramatically reducing the manual effort required for lexicon construction; (2) it provides empirical evidence that such automatic mappings can uncover valid relations absent from existing resources, thereby enriching linguistic resources; and (3) it offers a language‑agnostic pipeline that can be applied to other language pairs, given a parallel corpus and a discourse parser for the target language. Future work will extend the mapping to all PDTB relations, apply the approach to additional languages, and explore neural architectures that model connective interactions more explicitly.
Comments & Academic Discussion
Loading comments...
Leave a Comment