Improving Persian Document Classification Using Semantic Relations between Words

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

With the increase of information, document classification as one of the methods of text mining, plays vital role in many management and organizing information. Document classification is the process of assigning a document to one or more predefined category labels. Document classification includes different parts such as text processing, term selection, term weighting and final classification. The accuracy of document classification is very important. Thus improvement in each part of classification should lead to better results and higher precision. Term weighting has a great impact on the accuracy of the classification. Most of the existing weighting methods exploit the statistical information of terms in documents and do not consider semantic relations between words. In this paper, an automated document classification system is presented that uses a novel term weighting method based on semantic relations between terms. To evaluate the proposed method, three standard Persian corpuses are used. Experiment results show 2 to 4 percent improvement in classification accuracy compared with the best previous designed system for Persian documents.

💡 Research Summary

Document classification is a fundamental task in text mining, assigning each document to one or more predefined categories. The overall classification pipeline typically consists of text preprocessing, term selection, term weighting, and the final classification step. Among these, term weighting plays a pivotal role because it determines how strongly each term influences the classifier’s decision. Traditional weighting schemes such as TF‑IDF, BM25, and chi‑square rely solely on statistical information (term frequency, inverse document frequency) and ignore any semantic relationships between words. This omission is especially problematic for languages like Persian, where rich morphology, extensive synonymy, and hierarchical lexical relations are common.

The paper addresses this gap by proposing a novel term‑weighting method that explicitly incorporates semantic relations among words. The authors first construct or adopt a Persian lexical resource analogous to WordNet, which provides synonym, hypernym, hyponym, and similarity links between terms. For each term t, a set Rel(t) of semantically related terms is extracted, and a similarity score Sim(t, s) quantifies the closeness between t and each related term s (e.g., based on path length in the lexical hierarchy or vector‑based cosine similarity). The new weight for term t in document d is defined as:

W(t,d) = TF(t,d) × IDF(t) × (1 + α × Σ_{s∈Rel(t)} Sim(t,s) × IDF(s))

Here, TF(t,d) is the raw frequency of t in d, IDF(t) is the inverse document frequency, α is a hyper‑parameter controlling the strength of semantic reinforcement, and the summation aggregates contributions from all related terms. In effect, if a term’s semantic neighbors appear frequently across the corpus, the term’s weight is boosted, allowing the classifier to capture higher‑level meaning that pure statistics would miss.

To evaluate the approach, the authors use three standard Persian corpora covering news articles, academic papers, and web blog posts. Each corpus contains roughly 10,000 documents evenly distributed across ten categories, yielding a total of about 30,000 labeled texts. They experiment with several well‑known classifiers—Support Vector Machines, Naïve Bayes, and Random Forests—under a 5‑fold cross‑validation regime, keeping preprocessing and model settings identical across baseline and proposed methods.

Results consistently show that the semantic‑enhanced weighting yields a 2 %–4 % absolute increase in classification accuracy over the best previously reported system, which relied only on statistical weighting. The most pronounced improvement (≈4.2 % gain) occurs on the academic corpus, where domain‑specific terminology and synonymy are abundant. News and web corpora still benefit, with gains of about 2.5 % and 3.1 % respectively. Sensitivity analysis of α reveals that values between 0.2 and 0.3 provide the optimal trade‑off; setting α to zero reduces the method to standard TF‑IDF, while overly large α (>0.5) introduces noise and degrades performance.

The study’s contributions are threefold: (1) it introduces a systematic way to embed lexical semantics into term weighting, (2) it validates the method on real Persian datasets, demonstrating that semantic reinforcement is especially effective in domains with rich synonymy, and (3) it offers practical guidance on tuning the semantic reinforcement parameter. The authors also discuss computational considerations: building the Persian lexical network requires an upfront cost (approximately 12 hours of processing on a standard workstation), but once constructed, the additional overhead during weighting is negligible, making the approach feasible for large‑scale applications.

Future work suggested includes extending the framework to multilingual settings, integrating deep‑learning‑based embeddings (e.g., BERT, FastText) to capture more nuanced semantic relations, and optimizing the algorithm for real‑time streaming environments. Such extensions could broaden the impact of semantic‑aware weighting beyond document classification to tasks like sentiment analysis, topic modeling, and information retrieval, where understanding word meaning is equally critical.

Improving Persian Document Classification Using Semantic Relations between Words

💡 Research Summary

Comments & Academic Discussion

Leave a Comment