Computer Science / Artificial Intelligence

Determining the Unithood of Word Sequences using a Probabilistic Approach

February 23, 2026

Reading time: 5 minute

...

#Artificial Intelligence #Computer Science

📝 Original Info

Title: Determining the Unithood of Word Sequences using a Probabilistic Approach
ArXiv ID: 0810.0139
Date: 2008-10-02
Authors: Researchers from original ArXiv paper

📝 Abstract

Most research related to unithood were conducted as part of a larger effort for the determination of termhood. Consequently, novelties are rare in this small sub-field of term extraction. In addition, existing work were mostly empirically motivated and derived. We propose a new probabilistically-derived measure, independent of any influences of termhood, that provides dedicated measures to gather linguistic evidence from parsed text and statistical evidence from Google search engine for the measurement of unithood. Our comparative study using 1,825 test cases against an existing empirically-derived function revealed an improvement in terms of precision, recall and accuracy.

💡 Deep Analysis

Deep Dive into Determining the Unithood of Word Sequences using a Probabilistic Approach.

📄 Full Content

arXiv:0810.0139v1 [cs.AI] 1 Oct 2008 Determining the Unithood of Word Sequences using a Probabilistic Approach Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Crawley WA 6009 {wilson,wei,bennamou}@csse.uwa.edu.au Abstract Most research related to unithood were con- ducted as part of a larger effort for the deter- mination of termhood. Consequently, nov- elties are rare in this small sub-ﬁeld of term extraction. In addition, existing work were mostly empirically motivated and derived. We propose a new probabilistically-derived measure, independent of any inﬂuences of termhood, that provides dedicated measures to gather linguistic evidence from parsed text and statistical evidence from Google search engine for the measurement of unit- hood. Our comparative study using 1, 825 test cases against an existing empirically- derived function revealed an improvement in terms of precision, recall and accuracy. 1 Introduction Automatic term recognition, also referred to as term extraction or terminology mining, is the pro- cess of extracting lexical units from text and ﬁl- tering them for the purpose of identifying terms which characterise certain domains of interest. This process involves the determination of two factors: unithood and termhood. Unithood concerns with whether or not a sequence of words should be com- bined to form a more stable lexical unit. On the other hand, termhood measures the degree to which these stable lexical units are related to domain- speciﬁc concepts. Unithood is only relevant to com- plex terms (i.e. multi-word terms) while termhood (Wong et al., 2007a) deals with both simple terms (i.e. single-word terms) and complex terms. Re- cent reviews by (Wong et al., 2007b) show that ex- isting research on unithood are mostly carried out as a prerequisite to the determination of termhood. As a result, there is only a small number of existing measures dedicated to determining unithood. Be- sides the lack of dedicated attention in this sub-ﬁeld of term extraction, the existing measures are usu- ally derived from term or document frequency, and are modiﬁed as per need. As such, the signiﬁcance of the different weights that compose the measures usually assume an empirical viewpoint. Obviously, such methods are at most inspired by, but not derived from formal models (Kageura and Umino, 1996). The three objectives of this paper are (1) to sepa- rate the measurement of unithood from the determi- nation of termhood, (2) to devise a probabilistically- derived measure which requires only one thresh- old for determining the unithood of word se- quences using non-static textual resources, and (3) to demonstrate the superior performance of the new probabilistically-derived measure against existing empirical measures. In regards to the ﬁrst objective, we will derive our probabilistic measure free from any inﬂuence of termhood determination. Follow- ing this, our unithood measure will be an indepen- dent tool that is applicable not only to term extrac- tion, but many other tasks in information extraction and text mining. Concerning the second objective, we will devise our new measure, known as the Odds of Unithood (OU), which are derived using Bayes Theorem and founded on a few elementary probabil- ities. The probabilities are estimated using Google page counts in an attempt to eliminate problems re- lated to the use of static corpora. Moreover, only one threshold, namely, OUT is required to control the functioning of OU. Regarding the third objec- tive, we will compare our new OU against an ex- isting empirically-derived measure called Unithood (UH) (Wong et al., 2007b) in terms of their preci- sion, recall and accuracy. In Section 2, we provide a brief review on some of existing techniques for measuring unithood. In Sec- tion 3, we present our new probabilistic approach, the measures involved, and the theoretical and in- tuitive justiﬁcation behind every aspect of our mea- sures. In Section 4, we summarize some ﬁndings from our evaluations. Finally, we conclude this pa- per with an outlook to future work in Section 5. 2 Related Works Some of the most common measures of unit- hood include pointwise mutual information (MI) (Church and Hanks, 1990) and log-likelihood ratio (Dunning, 1994). In mutual information, the co- occurrence frequencies of the constituents of com- plex terms are utilised to measure their dependency. The mutual information for two words a and b is de- ﬁned as: MI(a, b) = log2 p(a, b) p(a)p(b) (1) where p(a) and p(b) are the probabilities of occur- rence of a and b. Many measures that apply sta- tistical techniques assuming strict normal distribu- tion, and independence between the word occur- rences (Franz, 1997) do not fare well. For han- dling extremely uncommon words or small sized corpus, log-likelihood ratio delivers the best preci- sion (Kurz and Xu, 2002). Log-likelihood ratio at- tempts to quantify how

…(Full text truncated)…

🇰🇷 이 논문을 한글로 읽기

📄 Read Full PDF on ArXiv

Reference

This content is AI-processed based on ArXiv data.

Determining the Unithood of Word Sequences using a Probabilistic Approach

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

Reference

Related Posts

A global physician-oriented medical information system

An Adaptive Strategy for the Classification of G-Protein Coupled Receptors

Constructing a Knowledge Base for Gene Regulatory Dynamics by Formal Concept Analysis Methods

Start searching

No results found