Considering a resource-light approach to learning verb valencies

Here we describe work on learning the subcategories of verbs in a morphologically rich language using only minimal linguistic resources. Our goal is to learn verb subcategorizations for Quechua, an under-resourced morphologically rich language, from an unannotated corpus. We compare results from applying this approach to an unannotated Arabic corpus with those achieved by processing the same text in treebank form. The original plan was to use only a morphological analyzer and an unannotated corpus, but experiments suggest that this approach by itself will not be effective for learning the combinatorial potential of Arabic verbs in general. The lower bound on resources for acquiring this information is somewhat higher, apparently requiring a a part-of-speech tagger and chunker for most languages, and a morphological disambiguater for Arabic.

💡 Research Summary

The paper investigates how far one can go in learning verb valency (the set of subcategorization frames a verb can take) for a morphologically rich, under‑resourced language when only the most minimal linguistic resources are available. The authors’ primary target is Quechua, a language with complex inflectional morphology but very limited annotated data. Their initial hypothesis was that a simple pipeline consisting of a morphological analyzer and a large unannotated corpus would be sufficient to extract reliable verb‑argument patterns. To test the limits of this hypothesis, they implemented the pipeline, applied it to an unannotated Arabic corpus, and compared the automatically derived valency information with the gold‑standard data from an Arabic treebank.

The pipeline works as follows. First, a morphological analyzer segments each token into its lemma and morphological features (e.g., case, gender, number). No part‑of‑speech (POS) tagging, syntactic parsing, or disambiguation is performed. Next, the system scans the corpus for occurrences of each verb lemma and collects the surrounding tokens that the analyzer marks as nouns, adjectives, or prepositions. By counting co‑occurrence frequencies and applying simple heuristics (e.g., a noun appearing within a fixed window of a verb is treated as a potential direct object), the system builds a list of candidate subcategorization frames for each verb.

When this approach is run on the Arabic data, the results are strikingly poor. The system correctly identifies direct objects for roughly 45 % of the verbs, but it fails to capture more complex arguments such as prepositional phrases, clausal complements, or optional adjuncts, achieving less than 20 % recall on those categories. The authors attribute the low performance to two intertwined problems. First, Arabic exhibits a high degree of morphological ambiguity: the same surface form can correspond to multiple POS tags (e.g., a word may be a noun or a verb depending on context). Without a morphological disambiguator, the analyzer frequently assigns the wrong POS, which propagates errors into the valency extraction. Second, the lack of a POS tagger and a chunker means that the system cannot reliably delineate noun phrases from verb phrases, leading to spurious or missed argument links. For example, in a phrase like “الكتاب الذي قرأته” the system cannot decide whether “الكتاب” is the head of a noun phrase or part of a relative clause, causing the verb “قرأ” to be paired with the wrong set of arguments.

The authors also discuss why the same pipeline is unlikely to work for Quechua. Quechua’s free word order and extensive use of suffixes to encode grammatical relations make it even harder to infer argument structure from linear proximity alone. Without a reliable way to identify phrase boundaries, the system would misinterpret many suffixes that encode case or evidentiality as part of the verb’s argument structure. Consequently, the authors conclude that the “morphological analyzer + raw corpus” configuration is insufficient for learning verb valency in languages with rich morphology.

The paper’s broader contribution is a systematic assessment of the lower bound of resources needed for automatic valency acquisition. The experimental evidence suggests that, for most languages, at least a POS tagger and a chunker are required to obtain a usable set of subcategorization frames. For languages like Arabic, a morphological disambiguator is also essential because of the high degree of lexical ambiguity. The authors propose that future work should focus on building lightweight, language‑agnostic POS taggers and chunkers that can be trained on minimal supervision, as well as on developing unsupervised or semi‑supervised methods for morphological disambiguation.

In summary, the study demonstrates that a purely resource‑light approach—relying only on morphological analysis and unannotated text—fails to capture the combinatorial potential of verbs in morphologically rich languages. It establishes a more realistic baseline: a minimal pipeline must include POS tagging, chunking, and, for highly ambiguous languages, morphological disambiguation. This insight helps guide the design of future tools for low‑resource language processing, ensuring that researchers allocate the necessary linguistic resources before attempting large‑scale valency extraction.