Exploring Lexical, Syntactic, and Semantic Features for Chinese Textual Entailment in NTCIR RITE Evaluation Tasks

We computed linguistic information at the lexical, syntactic, and semantic levels for Recognizing Inference in Text (RITE) tasks for both traditional and simplified Chinese in NTCIR-9 and NTCIR-10. Techniques for syntactic parsing, named-entity recognition, and near synonym recognition were employed, and features like counts of common words, statement lengths, negation words, and antonyms were considered to judge the entailment relationships of two statements, while we explored both heuristics-based functions and machine-learning approaches. The reported systems showed robustness by simultaneously achieving second positions in the binary-classification subtasks for both simplified and traditional Chinese in NTCIR-10 RITE-2. We conducted more experiments with the test data of NTCIR-9 RITE, with good results. We also extended our work to search for better configurations of our classifiers and investigated contributions of individual features. This extended work showed interesting results and should encourage further discussion.

💡 Research Summary

The paper presents a comprehensive study on recognizing textual entailment (RITE) for both traditional and simplified Chinese within the NTCIR‑9 and NTCIR‑10 evaluation campaigns. The authors design and evaluate a rich set of linguistic features spanning three linguistic levels—lexical, syntactic, and semantic—and investigate how these features can be combined using both heuristic scoring functions and supervised machine‑learning classifiers.

Lexical features include counts of shared tokens, overlap ratios for special token types (named entities, numbers, temporal expressions), sentence‑length differences, and the presence of negation words (e.g., “不”, “沒”) and antonyms (e.g., “大” vs. “小”). These statistics aim to capture surface‑level similarity and polarity cues that are often decisive for entailment decisions.

Syntactic features are derived from a Chinese constituency parser (the Stanford Chinese Parser). The authors extract tree depth, branching factor, and the degree of alignment of core grammatical relations (subject‑verb‑object structures) between the two sentences. Tree‑edit distance is used as a quantitative measure of syntactic similarity, which is particularly valuable for Chinese because word order is relatively flexible and functional particles are sparse.

Semantic features are built on named‑entity recognition (NER) outputs and a custom near‑synonym lexicon that extends the Chinese WordNet. The NER layer checks whether the same person, organization, or location entities appear in both sentences, while the synonym lexicon provides similarity scores for word pairs that are not identical but share meaning. The authors also experiment with contextual embeddings (BERT‑based) to compare traditional lexicon‑driven similarity with modern deep‑learning representations.

Two families of models are explored. The first is a heuristic scoring function where each feature is assigned a manually tuned weight; the weighted sum is compared against a threshold to produce a binary entailment label. This approach is highly interpretable, allowing the authors to pinpoint which features drive the decision. The second family consists of supervised classifiers—Support Vector Machines, Random Forests, and Gradient Boosting Machines (GBM). Feature vectors are normalized and optionally reduced with PCA before training. Hyper‑parameters are optimized via 5‑fold cross‑validation.

Experimental results show that the GBM classifier achieves the best performance (F1 ≈ 0.78) on the NTCIR‑10 RITE‑2 binary classification subtasks. Notably, the system attains second place for both traditional and simplified Chinese, demonstrating robustness across scripts. Additional experiments on the NTCIR‑9 test set confirm the stability of the approach, with accuracies of 81.2 % (traditional) and 80.5 % (simplified). An ablation study reveals that syntactic tree‑alignment contributes the most to overall performance; removing this feature drops the F1 score by about 0.07. Lexical negation and antonym detection also prove valuable, especially for simplified Chinese, where they reduce specific error types involving contradictory statements.

The paper discusses several insights. First, the multi‑level feature set is complementary: syntactic similarity captures structural entailment, lexical overlap handles surface similarity, and semantic cues resolve cases where different words convey the same meaning. Second, script‑specific differences emerge: traditional Chinese benefits more from syntactic cues, while simplified Chinese gains from lexical polarity cues. Third, the combination of heuristic and learning‑based methods provides a safety net—heuristics offer transparency, while machine learning captures non‑linear interactions among features.

In the conclusion, the authors outline future directions. They propose integrating large pre‑trained language models (e.g., Chinese BERT, RoBERTa‑large‑zh) with the handcrafted features to create hybrid systems that leverage deep contextual semantics while retaining interpretability. Transfer learning between traditional and simplified Chinese is suggested as a way to mitigate data scarcity. Finally, extending the binary entailment framework to multi‑class settings (e.g., neutral, contradiction) would broaden applicability to downstream tasks such as question answering and summarization.

Overall, the study demonstrates that a carefully engineered set of lexical, syntactic, and semantic features—when combined with both rule‑based scoring and modern gradient‑boosted classifiers—can achieve state‑of‑the‑art performance on Chinese textual entailment benchmarks. The work not only validates the effectiveness of multi‑level linguistic analysis for Chinese but also provides a solid baseline for future research that seeks to fuse traditional feature engineering with deep neural representations.