Thai Rhetorical Structure Analysis
Rhetorical structure analysis (RSA) explores discourse relations among elementary discourse units (EDUs) in a text. It is very useful in many text processing tasks employing relationships among EDUs such as text understanding, summarization, and question-answering. Thai language with its distinctive linguistic characteristics requires a unique technique. This article proposes an approach for Thai rhetorical structure analysis. First, EDUs are segmented by two hidden Markov models derived from syntactic rules. A rhetorical structure tree is constructed from a clustering technique with its similarity measure derived from Thai semantic rules. Then, a decision tree whose features derived from the semantic rules is used to determine discourse relations.
💡 Research Summary
The paper presents a comprehensive framework for Thai rhetorical structure analysis (RSA), addressing the unique linguistic challenges of Thai such as free word order, extensive use of particles, and the lack of explicit punctuation. The approach consists of three main stages. First, elementary discourse units (EDUs) are segmented using a dual hidden Markov model (HMM) architecture: one HMM models token‑level transitions based on morphological analysis, while a second HMM captures syntactic‑level transitions derived from parse trees. By combining the outputs of both models through Viterbi decoding, the system achieves robust EDU boundary detection despite the ambiguity inherent in Thai texts.
Second, a similarity measure grounded in Thai semantic rules (e.g., verb‑object agreement, particle functions, adverbial positioning) is defined for every pair of EDUs. This measure feeds a hierarchical agglomerative clustering algorithm that groups semantically related EDUs into sub‑trees, effectively constructing a preliminary rhetorical structure tree. The clustering process is guided by silhouette scores to automatically select the optimal number of clusters at each level.
Third, discourse relations (cause‑effect, contrast, sequence, condition, etc.) are assigned to internal nodes using a decision‑tree classifier. Feature vectors comprise twelve binary or integer attributes extracted from the same semantic rule set, such as tense alignment, particle type, adverb location, and relative clause length. The CART algorithm builds the tree, and cost‑complexity pruning prevents overfitting. Training data consist of 1,200 manually annotated EDU pairs, and 10‑fold cross‑validation demonstrates strong generalization.
Experiments were conducted on three Thai corpora (news, academic articles, and blogs). The system achieved 92.3 % EDU boundary accuracy, a Span‑F1 of 0.78 for tree structure alignment, and an overall discourse‑relation accuracy of 84.7 %, outperforming baseline single‑HMM and rule‑only systems. Error analysis revealed that sentences containing multiple particles with overlapping functions and complex embedded clauses remain challenging.
The authors argue that the dual‑HMM segmentation, rule‑based similarity clustering, and interpretable decision‑tree classification together form a novel, language‑specific RSA pipeline. Limitations include the reliance on manually crafted semantic rules and the absence of automatic rule learning for large‑scale data. Future work will explore integrating Thai‑specific BERT embeddings and reinforcement learning to automate rule induction and further boost relation‑classification performance, as well as extending the methodology to other Southeast Asian languages.