Determining the Unithood of Word Sequences using a Probabilistic Approach

Reading time: 5 minute
...

📝 Original Info

  • Title: Determining the Unithood of Word Sequences using a Probabilistic Approach
  • ArXiv ID: 0810.0139
  • Date: 2008-10-02
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Most research related to unithood were conducted as part of a larger effort for the determination of termhood. Consequently, novelties are rare in this small sub-field of term extraction. In addition, existing work were mostly empirically motivated and derived. We propose a new probabilistically-derived measure, independent of any influences of termhood, that provides dedicated measures to gather linguistic evidence from parsed text and statistical evidence from Google search engine for the measurement of unithood. Our comparative study using 1,825 test cases against an existing empirically-derived function revealed an improvement in terms of precision, recall and accuracy.

💡 Deep Analysis

Deep Dive into Determining the Unithood of Word Sequences using a Probabilistic Approach.

Most research related to unithood were conducted as part of a larger effort for the determination of termhood. Consequently, novelties are rare in this small sub-field of term extraction. In addition, existing work were mostly empirically motivated and derived. We propose a new probabilistically-derived measure, independent of any influences of termhood, that provides dedicated measures to gather linguistic evidence from parsed text and statistical evidence from Google search engine for the measurement of unithood. Our comparative study using 1,825 test cases against an existing empirically-derived function revealed an improvement in terms of precision, recall and accuracy.

📄 Full Content

arXiv:0810.0139v1 [cs.AI] 1 Oct 2008 Determining the Unithood of Word Sequences using a Probabilistic Approach Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Crawley WA 6009 {wilson,wei,bennamou}@csse.uwa.edu.au Abstract Most research related to unithood were con- ducted as part of a larger effort for the deter- mination of termhood. Consequently, nov- elties are rare in this small sub-field of term extraction. In addition, existing work were mostly empirically motivated and derived. We propose a new probabilistically-derived measure, independent of any influences of termhood, that provides dedicated measures to gather linguistic evidence from parsed text and statistical evidence from Google search engine for the measurement of unit- hood. Our comparative study using 1, 825 test cases against an existing empirically- derived function revealed an improvement in terms of precision, recall and accuracy. 1 Introduction Automatic term recognition, also referred to as term extraction or terminology mining, is the pro- cess of extracting lexical units from text and fil- tering them for the purpose of identifying terms which characterise certain domains of interest. This process involves the determination of two factors: unithood and termhood. Unithood concerns with whether or not a sequence of words should be com- bined to form a more stable lexical unit. On the other hand, termhood measures the degree to which these stable lexical units are related to domain- specific concepts. Unithood is only relevant to com- plex terms (i.e. multi-word terms) while termhood (Wong et al., 2007a) deals with both simple terms (i.e. single-word terms) and complex terms. Re- cent reviews by (Wong et al., 2007b) show that ex- isting research on unithood are mostly carried out as a prerequisite to the determination of termhood. As a result, there is only a small number of existing measures dedicated to determining unithood. Be- sides the lack of dedicated attention in this sub-field of term extraction, the existing measures are usu- ally derived from term or document frequency, and are modified as per need. As such, the significance of the different weights that compose the measures usually assume an empirical viewpoint. Obviously, such methods are at most inspired by, but not derived from formal models (Kageura and Umino, 1996). The three objectives of this paper are (1) to sepa- rate the measurement of unithood from the determi- nation of termhood, (2) to devise a probabilistically- derived measure which requires only one thresh- old for determining the unithood of word se- quences using non-static textual resources, and (3) to demonstrate the superior performance of the new probabilistically-derived measure against existing empirical measures. In regards to the first objective, we will derive our probabilistic measure free from any influence of termhood determination. Follow- ing this, our unithood measure will be an indepen- dent tool that is applicable not only to term extrac- tion, but many other tasks in information extraction and text mining. Concerning the second objective, we will devise our new measure, known as the Odds of Unithood (OU), which are derived using Bayes Theorem and founded on a few elementary probabil- ities. The probabilities are estimated using Google page counts in an attempt to eliminate problems re- lated to the use of static corpora. Moreover, only one threshold, namely, OUT is required to control the functioning of OU. Regarding the third objec- tive, we will compare our new OU against an ex- isting empirically-derived measure called Unithood (UH) (Wong et al., 2007b) in terms of their preci- sion, recall and accuracy. In Section 2, we provide a brief review on some of existing techniques for measuring unithood. In Sec- tion 3, we present our new probabilistic approach, the measures involved, and the theoretical and in- tuitive justification behind every aspect of our mea- sures. In Section 4, we summarize some findings from our evaluations. Finally, we conclude this pa- per with an outlook to future work in Section 5. 2 Related Works Some of the most common measures of unit- hood include pointwise mutual information (MI) (Church and Hanks, 1990) and log-likelihood ratio (Dunning, 1994). In mutual information, the co- occurrence frequencies of the constituents of com- plex terms are utilised to measure their dependency. The mutual information for two words a and b is de- fined as: MI(a, b) = log2 p(a, b) p(a)p(b) (1) where p(a) and p(b) are the probabilities of occur- rence of a and b. Many measures that apply sta- tistical techniques assuming strict normal distribu- tion, and independence between the word occur- rences (Franz, 1997) do not fare well. For han- dling extremely uncommon words or small sized corpus, log-likelihood ratio delivers the best preci- sion (Kurz and Xu, 2002). Log-likelihood ratio at- tempts to quantify how

…(Full text truncated)…

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut