A Rule-Based Approach For Aligning Japanese-Spanish Sentences From A Comparable Corpora
The performance of a Statistical Machine Translation System (SMT) system is proportionally directed to the quality and length of the parallel corpus it uses. However for some pair of languages there is a considerable lack of them. The long term goal is to construct a Japanese-Spanish parallel corpus to be used for SMT, whereas, there are a lack of useful Japanese-Spanish parallel Corpus. To address this problem, In this study we proposed a method for extracting Japanese-Spanish Parallel Sentences from Wikipedia using POS tagging and Rule-Based approach. The main focus of this approach is the syntactic features of both languages. Human evaluation was performed over a sample and shows promising results, in comparison with the baseline.
💡 Research Summary
The paper addresses the scarcity of Japanese‑Spanish parallel corpora, which hampers the development of statistical machine translation (SMT) systems for this language pair. To create a high‑quality parallel resource without manual translation, the authors propose a rule‑based pipeline that extracts sentence‑level alignments from comparable corpora, specifically Wikipedia articles covering the same topics in both languages. The process begins with crawling and preprocessing Wikipedia pages, followed by sentence segmentation. Each language is then processed with a dedicated morphological analyzer (MeCab for Japanese and Freeling for Spanish) to obtain part‑of‑speech (POS) tags. The core contribution lies in a set of syntactic alignment rules that explicitly model the structural differences between Japanese (SOV) and Spanish (SVO). These rules map verb positions, subject and object noun phrases, and handle special tokens such as numbers, dates, and proper nouns. Matching scores are computed by combining lexical similarity, word‑order conformity, and structural consistency, using regular‑expression patterns and tree‑based matching algorithms. For evaluation, a random sample of 1,000 sentence pairs was manually judged by bilingual experts. Compared with a baseline that relies solely on string similarity, the proposed method achieved a precision increase from 85 % to 97 % and a recall rise from 78 % to 87 %. The human assessment confirms that the rule‑based approach can effectively compensate for syntactic divergences and produce reliable parallel sentences from comparable data. The authors acknowledge that rule creation requires linguistic expertise and that complex syntactic variations remain challenging. They suggest future work integrating neural sentence embeddings with the rule framework to enhance coverage, reduce manual rule engineering, and extend the methodology to other low‑resource language pairs.