Valence extraction using EM selection and co-occurrence matrices

Reading time: 7 minute
...

📝 Original Info

  • Title: Valence extraction using EM selection and co-occurrence matrices
  • ArXiv ID: 0711.4475
  • Date: 2020-03-11
  • Authors: Researchers from original ArXiv paper

📝 Abstract

This paper discusses two new procedures for extracting verb valences from raw texts, with an application to the Polish language. The first novel technique, the EM selection algorithm, performs unsupervised disambiguation of valence frame forests, obtained by applying a non-probabilistic deep grammar parser and some post-processing to the text. The second new idea concerns filtering of incorrect frames detected in the parsed text and is motivated by an observation that verbs which take similar arguments tend to have similar frames. This phenomenon is described in terms of newly introduced co-occurrence matrices. Using co-occurrence matrices, we split filtering into two steps. The list of valid arguments is first determined for each verb, whereas the pattern according to which the arguments are combined into frames is computed in the following stage. Our best extracted dictionary reaches an $F$-score of 45%, compared to an $F$-score of 39% for the standard frame-based BHT filtering.

💡 Deep Analysis

Deep Dive into Valence extraction using EM selection and co-occurrence matrices.

This paper discusses two new procedures for extracting verb valences from raw texts, with an application to the Polish language. The first novel technique, the EM selection algorithm, performs unsupervised disambiguation of valence frame forests, obtained by applying a non-probabilistic deep grammar parser and some post-processing to the text. The second new idea concerns filtering of incorrect frames detected in the parsed text and is motivated by an observation that verbs which take similar arguments tend to have similar frames. This phenomenon is described in terms of newly introduced co-occurrence matrices. Using co-occurrence matrices, we split filtering into two steps. The list of valid arguments is first determined for each verb, whereas the pattern according to which the arguments are combined into frames is computed in the following stage. Our best extracted dictionary reaches an $F$-score of 45%, compared to an $F$-score of 39% for the standard frame-based BHT filtering.

📄 Full Content

arXiv:0711.4475v6 [cs.CL] 27 Nov 2009 V alen e extra tion using EM sele tion and o-o urren e matri es Šuk asz D¦b o wski† Instytut Po dstaw Informatyki P AN J.K. Or dona 21, 01-237 W arszawa, Poland Abstra t. This pap er dis usses t w o new pro edures for extra ting v erb v alen es from ra w texts, with an appli ation to the P olish language. The rst no v el te hnique, the EM sele tion algorithm, p erforms unsup ervised disam biguation of v alen e frame forests, obtained b y applying a non-probabilisti deep grammar parser and some p ost-pro essing to the text. The se ond new idea on erns ltering of in orre t frames dete ted in the parsed text and is motiv ated b y an observ ation that v erbs whi h tak e similar argumen ts tend to ha v e similar frames. This phenomenon is des rib ed in terms of newly in tro du ed o-o urren e matri es. Using o-o urren e matri es, w e split ltering in to t w o steps. The list of v alid argumen ts is rst determined for ea h v erb, whereas the pattern a ording to whi h the argumen ts are om bined in to frames is omputed in the follo wing stage. Our b est extra ted di tionary rea hes an F -s ore of 45%, ompared to an F -s ore of 39% for the standard frame-based BHT ltering. Keyw ords: v erb v alen e extra tion, EM algorithm, o-o urren e matri es, P olish language 1. In tro du tion The aim of this pap er is to explore t w o new te hniques for v erb v alen e extra tion from ra w texts, as applied to the P olish language. The metho ds are no v el ompared to the standard framew ork (Bren t, 1993 ; Manning, 1993 ; Ersan and Charniak, 1995 ; Bris o e and Carroll, 1997 ) and motiv ated in part b y resour es a v ailable for this language and in part b y ertain linguisti observ ations. The task of v alen e extra tion for P olish in vites no v el approa hes indeed. Although there is no treebank for this language on whi h a probabilisti parser an b e trained, a few in teresting resour es are a v ailable. Firstly , the non-probabilisti parser ‘wigra (W oli«ski, 2004; W oli«ski, 2005 ) pro vides an e ien t implemen tation of the large formal grammar of P olish b y ‘widzi«ski (1992). Se ondly , three detailed v alen e di tionaries ha v e b een ompiled b y formal linguists (P ola«ski, 1992 ; ‘widzi«ski, 1994 ; Ba«k o, 2000 ). Those di tionaries are p oten tially useful as a gold standard in automati v alen e ex- tra tion but t w o of them, P ola«ski and Ba«k o , are prin ted on pap er in sev eral v olumes, whereas ‘widzi«ski 's di tionary , though rather small, is a v ailable ele troni ally . The text le b y ‘widzi«ski lists ab out 1000 v erbal en tries whereas 6000 en tries an b e found in COMLEX, a detailed syn ta ti di tionary of English (Ma leo d et al., 1994 ). The information pro vided b y P olish v alen e di tionaries is of omparable omplexit y to information a v ailable in COMLEX. V erbs in the di tionary en tries sele t for nominal (NP) and prep ositional (PP) phrases in sp e i morphologi al ases (7 distin t ases and man y more prep ositions). V alen e frames ma y on tain the reexiv e mark er si¦ and ertain adjun ts (e.g., adv erbs) but not ne essarily a sub je t, whi h also on tributes to the om binatorial explosion. F or instan e, ‘widzi«ski (1994 ) pro vides 329 frame t yp es for the 201 test v erbs des rib ed later in Se tion 4. The most frequen t frame among them, {np(nom), np(a )}, is v alid for 124 test v erbs and there are 183 hapax frames. † The author is presen tly on lea v e for Cen trum Wiskunde & Informati a, S ien e P ark 123, NL-1098 X G Amsterdam, the Netherlands. E: debowski wi.nl , T: +31 20 592 4193, F: +31 20 592 4312. ⃝ 2018 Kluwer A

ademi Publishers. Printe d in the Netherlands. 2 Šuk asz D¦b o wski Su h la k of omputational data is a strong in en tiv e to dev elop automati v alen e extra tion as e ien tly as p ossible. Th us w e ha v e devised t w o pro edures. The rst one, alled the EM sele tion algorithm, p erforms unsup ervised sele tion of alternativ e v alen e frames. These frames w ere obtained for sen ten es in a orpus b y applying the parser ‘wigra and some p ost-pro essing. In this w a y , w e op e with the la k of a probabilisti parser and of a treebank. The EM sele tion pro edure, to our kno wledge des rib ed here for the rst time, assumes that the disam biguated alternativ es are highly rep eatable atomi en tities. The pro edure do es not rely on what formal ob je ts the alternativ es are but it only tak es their frequen ies in to a oun t. Th us, the EM sele tion lo oks lik e an in teresting baseline algorithm for man y unsup ervised disam biguation problems, e.g. part-of-sp ee h tagging (Kupie , 1992 ; Merialdo, 1994 ). Computationally , the algorithm is far simpler than the inside-outside algorithm for probabilisti grammars (Chi and Geman, 1998 ), whi h also instan tiates the exp e tation-maximization s heme and is used for treebank and v alen e a quisition (Bris

…(Full text truncated)…

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut