This paper discusses two new procedures for extracting verb valences from raw texts, with an application to the Polish language. The first novel technique, the EM selection algorithm, performs unsupervised disambiguation of valence frame forests, obtained by applying a non-probabilistic deep grammar parser and some post-processing to the text. The second new idea concerns filtering of incorrect frames detected in the parsed text and is motivated by an observation that verbs which take similar arguments tend to have similar frames. This phenomenon is described in terms of newly introduced co-occurrence matrices. Using co-occurrence matrices, we split filtering into two steps. The list of valid arguments is first determined for each verb, whereas the pattern according to which the arguments are combined into frames is computed in the following stage. Our best extracted dictionary reaches an $F$-score of 45%, compared to an $F$-score of 39% for the standard frame-based BHT filtering.
Deep Dive into Valence extraction using EM selection and co-occurrence matrices.
This paper discusses two new procedures for extracting verb valences from raw texts, with an application to the Polish language. The first novel technique, the EM selection algorithm, performs unsupervised disambiguation of valence frame forests, obtained by applying a non-probabilistic deep grammar parser and some post-processing to the text. The second new idea concerns filtering of incorrect frames detected in the parsed text and is motivated by an observation that verbs which take similar arguments tend to have similar frames. This phenomenon is described in terms of newly introduced co-occurrence matrices. Using co-occurrence matrices, we split filtering into two steps. The list of valid arguments is first determined for each verb, whereas the pattern according to which the arguments are combined into frames is computed in the following stage. Our best extracted dictionary reaches an $F$-score of 45%, compared to an $F$-score of 39% for the standard frame-based BHT filtering.
arXiv:0711.4475v6 [cs.CL] 27 Nov 2009
V
alen e
extra tion
using
EM
sele tion
and
o-o
urren e
matri es
uk
asz
D¦b
o
wski†
Instytut
Po
dstaw
Informatyki
P
AN
J.K.
Or
dona
21,
01-237
W
arszawa,
Poland
Abstra t.
This
pap
er
dis usses
t
w
o
new
pro
edures
for
extra ting
v
erb
v
alen es
from
ra
w
texts,
with
an
appli ation
to
the
P
olish
language.
The
rst
no
v
el
te
hnique,
the
EM
sele tion
algorithm,
p
erforms
unsup
ervised
disam
biguation
of
v
alen e
frame
forests,
obtained
b
y
applying
a
non-probabilisti
deep
grammar
parser
and
some
p
ost-pro
essing
to
the
text.
The
se ond
new
idea
on erns
ltering
of
in orre t
frames
dete ted
in
the
parsed
text
and
is
motiv
ated
b
y
an
observ
ation
that
v
erbs
whi
h
tak
e
similar
argumen
ts
tend
to
ha
v
e
similar
frames.
This
phenomenon
is
des rib
ed
in
terms
of
newly
in
tro
du ed
o-o
urren e
matri es.
Using
o-o
urren e
matri es,
w
e
split
ltering
in
to
t
w
o
steps.
The
list
of
v
alid
argumen
ts
is
rst
determined
for
ea
h
v
erb,
whereas
the
pattern
a ording
to
whi
h
the
argumen
ts
are
om
bined
in
to
frames
is
omputed
in
the
follo
wing
stage.
Our
b
est
extra ted
di tionary
rea
hes
an F
-s ore
of 45%,
ompared
to
an F
-s ore
of 39%
for
the
standard
frame-based
BHT
ltering.
Keyw
ords:
v
erb
v
alen e
extra tion,
EM
algorithm,
o-o
urren e
matri es,
P
olish
language
1.
In
tro
du tion
The
aim
of
this
pap
er
is
to
explore
t
w
o
new
te
hniques
for
v
erb
v
alen e
extra tion
from
ra
w
texts,
as
applied
to
the
P
olish
language.
The
metho
ds
are
no
v
el
ompared
to
the
standard
framew
ork
(Bren
t,
1993
;
Manning,
1993
;
Ersan
and
Charniak,
1995
;
Bris o
e
and
Carroll,
1997
)
and
motiv
ated
in
part
b
y
resour es
a
v
ailable
for
this
language
and
in
part
b
y
ertain
linguisti
observ
ations.
The
task
of
v
alen e
extra tion
for
P
olish
in
vites
no
v
el
approa
hes
indeed.
Although
there
is
no
treebank
for
this
language
on
whi
h
a
probabilisti
parser
an
b
e
trained,
a
few
in
teresting
resour es
are
a
v
ailable.
Firstly
,
the
non-probabilisti
parser
wigra
(W
oli«ski,
2004;
W
oli«ski,
2005
)
pro
vides
an
e ien
t
implemen
tation
of
the
large
formal
grammar
of
P
olish
b
y
widzi«ski
(1992).
Se ondly
,
three
detailed
v
alen e
di tionaries
ha
v
e
b
een
ompiled
b
y
formal
linguists
(P
ola«ski,
1992
;
widzi«ski,
1994
;
Ba«k
o,
2000
).
Those
di tionaries
are
p
oten
tially
useful
as
a
gold
standard
in
automati
v
alen e
ex-
tra tion
but
t
w
o
of
them,
P
ola«ski
and
Ba«k
o
,
are
prin
ted
on
pap
er
in
sev
eral
v
olumes,
whereas
widzi«ski
's
di tionary
,
though
rather
small,
is
a
v
ailable
ele troni ally
.
The
text
le
b
y
widzi«ski
lists
ab
out
1000
v
erbal
en
tries
whereas
6000
en
tries
an
b
e
found
in
COMLEX,
a
detailed
syn
ta ti
di tionary
of
English
(Ma leo
d
et
al.,
1994
).
The
information
pro
vided
b
y
P
olish
v
alen e
di tionaries
is
of
omparable
omplexit
y
to
information
a
v
ailable
in
COMLEX.
V
erbs
in
the
di tionary
en
tries
sele t
for
nominal
(NP)
and
prep
ositional
(PP)
phrases
in
sp
e i
morphologi al
ases
(7
distin t
ases
and
man
y
more
prep
ositions).
V
alen e
frames
ma
y
on
tain
the
reexiv
e
mark
er
si¦
and
ertain
adjun ts
(e.g.,
adv
erbs)
but
not
ne essarily
a
sub
je t,
whi
h
also
on
tributes
to
the
om
binatorial
explosion.
F
or
instan e,
widzi«ski
(1994
)
pro
vides
329
frame
t
yp
es
for
the
201
test
v
erbs
des rib
ed
later
in
Se tion
4.
The
most
frequen
t
frame
among
them, {np(nom),
np(a )},
is
v
alid
for
124
test
v
erbs
and
there
are
183
hapax
frames.
†
The
author
is
presen
tly
on
lea
v
e
for
Cen
trum
Wiskunde
&
Informati a,
S ien e
P
ark
123,
NL-1098
X
G
Amsterdam,
the
Netherlands.
E:
debowski wi.nl
,
T:
+31
20
592
4193,
F:
+31
20
592
4312.
⃝
2018
Kluwer
A
ademi
Publishers.
Printe
d
in
the
Netherlands.
2
uk
asz
D¦b
o
wski
Su
h
la
k
of
omputational
data
is
a
strong
in en
tiv
e
to
dev
elop
automati
v
alen e
extra tion
as
e ien
tly
as
p
ossible.
Th
us
w
e
ha
v
e
devised
t
w
o
pro
edures.
The
rst
one,
alled
the
EM
sele tion
algorithm,
p
erforms
unsup
ervised
sele tion
of
alternativ
e
v
alen e
frames.
These
frames
w
ere
obtained
for
sen
ten es
in
a
orpus
b
y
applying
the
parser
wigra
and
some
p
ost-pro
essing.
In
this
w
a
y
,
w
e
op
e
with
the
la
k
of
a
probabilisti
parser
and
of
a
treebank.
The
EM
sele tion
pro
edure,
to
our
kno
wledge
des rib
ed
here
for
the
rst
time,
assumes
that
the
disam
biguated
alternativ
es
are
highly
rep
eatable
atomi
en
tities.
The
pro
edure
do
es
not
rely
on
what
formal
ob
je ts
the
alternativ
es
are
but
it
only
tak
es
their
frequen ies
in
to
a oun
t.
Th
us,
the
EM
sele tion
lo
oks
lik
e
an
in
teresting
baseline
algorithm
for
man
y
unsup
ervised
disam
biguation
problems,
e.g.
part-of-sp
ee
h
tagging
(Kupie ,
1992
;
Merialdo,
1994
).
Computationally
,
the
algorithm
is
far
simpler
than
the
inside-outside
algorithm
for
probabilisti
grammars
(Chi
and
Geman,
1998
),
whi
h
also
instan
tiates
the
exp
e tation-maximization
s
heme
and
is
used
for
treebank
and
v
alen e
a quisition
(Bris
…(Full text truncated)…
This content is AI-processed based on ArXiv data.