Most research related to unithood were conducted as part of a larger effort for the determination of termhood. Consequently, novelties are rare in this small sub-field of term extraction. In addition, existing work were mostly empirically motivated and derived. We propose a new probabilistically-derived measure, independent of any influences of termhood, that provides dedicated measures to gather linguistic evidence from parsed text and statistical evidence from Google search engine for the measurement of unithood. Our comparative study using 1,825 test cases against an existing empirically-derived function revealed an improvement in terms of precision, recall and accuracy.
Deep Dive into Determining the Unithood of Word Sequences using a Probabilistic Approach.
Most research related to unithood were conducted as part of a larger effort for the determination of termhood. Consequently, novelties are rare in this small sub-field of term extraction. In addition, existing work were mostly empirically motivated and derived. We propose a new probabilistically-derived measure, independent of any influences of termhood, that provides dedicated measures to gather linguistic evidence from parsed text and statistical evidence from Google search engine for the measurement of unithood. Our comparative study using 1,825 test cases against an existing empirically-derived function revealed an improvement in terms of precision, recall and accuracy.
arXiv:0810.0139v1 [cs.AI] 1 Oct 2008
Determining the Unithood of Word Sequences using a Probabilistic
Approach
Wilson Wong, Wei Liu and Mohammed Bennamoun
School of Computer Science and Software Engineering
University of Western Australia
Crawley WA 6009
{wilson,wei,bennamou}@csse.uwa.edu.au
Abstract
Most research related to unithood were con-
ducted as part of a larger effort for the deter-
mination of termhood. Consequently, nov-
elties are rare in this small sub-field of term
extraction. In addition, existing work were
mostly empirically motivated and derived.
We propose a new probabilistically-derived
measure, independent of any influences of
termhood, that provides dedicated measures
to gather linguistic evidence from parsed
text and statistical evidence from Google
search engine for the measurement of unit-
hood. Our comparative study using 1, 825
test cases against an existing empirically-
derived function revealed an improvement in
terms of precision, recall and accuracy.
1
Introduction
Automatic term recognition, also referred to as
term extraction or terminology mining, is the pro-
cess of extracting lexical units from text and fil-
tering them for the purpose of identifying terms
which characterise certain domains of interest. This
process involves the determination of two factors:
unithood and termhood.
Unithood concerns with
whether or not a sequence of words should be com-
bined to form a more stable lexical unit.
On the
other hand, termhood measures the degree to which
these stable lexical units are related to domain-
specific concepts. Unithood is only relevant to com-
plex terms (i.e. multi-word terms) while termhood
(Wong et al., 2007a) deals with both simple terms
(i.e. single-word terms) and complex terms. Re-
cent reviews by (Wong et al., 2007b) show that ex-
isting research on unithood are mostly carried out
as a prerequisite to the determination of termhood.
As a result, there is only a small number of existing
measures dedicated to determining unithood.
Be-
sides the lack of dedicated attention in this sub-field
of term extraction, the existing measures are usu-
ally derived from term or document frequency, and
are modified as per need. As such, the significance
of the different weights that compose the measures
usually assume an empirical viewpoint. Obviously,
such methods are at most inspired by, but not derived
from formal models (Kageura and Umino, 1996).
The three objectives of this paper are (1) to sepa-
rate the measurement of unithood from the determi-
nation of termhood, (2) to devise a probabilistically-
derived measure which requires only one thresh-
old for determining the unithood of word se-
quences using non-static textual resources, and (3)
to demonstrate the superior performance of the new
probabilistically-derived measure against existing
empirical measures. In regards to the first objective,
we will derive our probabilistic measure free from
any influence of termhood determination. Follow-
ing this, our unithood measure will be an indepen-
dent tool that is applicable not only to term extrac-
tion, but many other tasks in information extraction
and text mining. Concerning the second objective,
we will devise our new measure, known as the Odds
of Unithood (OU), which are derived using Bayes
Theorem and founded on a few elementary probabil-
ities. The probabilities are estimated using Google
page counts in an attempt to eliminate problems re-
lated to the use of static corpora. Moreover, only
one threshold, namely, OUT is required to control
the functioning of OU. Regarding the third objec-
tive, we will compare our new OU against an ex-
isting empirically-derived measure called Unithood
(UH) (Wong et al., 2007b) in terms of their preci-
sion, recall and accuracy.
In Section 2, we provide a brief review on some of
existing techniques for measuring unithood. In Sec-
tion 3, we present our new probabilistic approach,
the measures involved, and the theoretical and in-
tuitive justification behind every aspect of our mea-
sures. In Section 4, we summarize some findings
from our evaluations. Finally, we conclude this pa-
per with an outlook to future work in Section 5.
2
Related Works
Some of the most common measures of unit-
hood include pointwise mutual information (MI)
(Church and Hanks, 1990) and log-likelihood ratio
(Dunning, 1994).
In mutual information, the co-
occurrence frequencies of the constituents of com-
plex terms are utilised to measure their dependency.
The mutual information for two words a and b is de-
fined as:
MI(a, b) = log2
p(a, b)
p(a)p(b)
(1)
where p(a) and p(b) are the probabilities of occur-
rence of a and b. Many measures that apply sta-
tistical techniques assuming strict normal distribu-
tion, and independence between the word occur-
rences (Franz, 1997) do not fare well.
For han-
dling extremely uncommon words or small sized
corpus, log-likelihood ratio delivers the best preci-
sion (Kurz and Xu, 2002). Log-likelihood ratio at-
tempts to quantify how
…(Full text truncated)…
This content is AI-processed based on ArXiv data.