Non-Standard Words as Features for Text Categorization

Non-Standard Words as Features for Text Categorization
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper presents categorization of Croatian texts using Non-Standard Words (NSW) as features. Non-Standard Words are: numbers, dates, acronyms, abbreviations, currency, etc. NSWs in Croatian language are determined according to Croatian NSW taxonomy. For the purpose of this research, 390 text documents were collected and formed the SKIPEZ collection with 6 classes: official, literary, informative, popular, educational and scientific. Text categorization experiment was conducted on three different representations of the SKIPEZ collection: in the first representation, the frequencies of NSWs are used as features; in the second representation, the statistic measures of NSWs (variance, coefficient of variation, standard deviation, etc.) are used as features; while the third representation combines the first two feature sets. Naive Bayes, CN2, C4.5, kNN, Classification Trees and Random Forest algorithms were used in text categorization experiments. The best categorization results are achieved using the first feature set (NSW frequencies) with the categorization accuracy of 87%. This suggests that the NSWs should be considered as features in highly inflectional languages, such as Croatian. NSW based features reduce the dimensionality of the feature space without standard lemmatization procedures, and therefore the bag-of-NSWs should be considered for further Croatian texts categorization experiments.


💡 Research Summary

The paper investigates whether non‑standard words (NSWs)—such as numbers, dates, acronyms, abbreviations, currency symbols, and other token types that are typically excluded from standard lexical analysis—can serve as effective features for categorizing Croatian texts. Recognizing that Croatian is a highly inflectional language where conventional lemmatization and stemming can be computationally expensive and error‑prone, the authors propose a “bag‑of‑NSWs” approach that bypasses full morphological processing.

First, the authors construct a taxonomy of Croatian NSWs, defining ten broad categories (numeric, temporal, monetary, acronyms, abbreviations, symbols, etc.). Using this taxonomy, they automatically extract NSWs from a newly compiled corpus called SKIPEZ, which consists of 390 documents evenly distributed across six genre classes: official, literary, informative, popular, educational, and scientific. Each document is manually labeled according to its genre, providing a balanced testbed for classification experiments.

Three distinct feature representations are explored. The primary representation (Feature Set 1) records the raw frequency of each NSW type within a document, yielding a low‑dimensional vector (on the order of a few dozen dimensions). The second representation (Feature Set 2) computes statistical descriptors—mean, variance, standard deviation, coefficient of variation, etc.—for the distribution of NSWs, aiming to capture variability rather than absolute counts. The third representation (Feature Set 3) concatenates the two previous sets, forming a hybrid feature space.

Six well‑known classification algorithms are employed: Naïve Bayes, CN2 rule learner, C4.5 decision tree, k‑Nearest Neighbors, a generic Classification Tree, and Random Forest. All experiments use 10‑fold cross‑validation, and performance is measured via accuracy, precision, recall, and F1‑score.

Results show that the frequency‑based feature set consistently outperforms the other two. The best accuracy, 87 %, is achieved by both Naïve Bayes and Random Forest when using Feature Set 1. Feature Set 2 yields a lower average accuracy of about 78 %, while the combined Feature Set 3 reaches an intermediate 84 %. Notably, the dimensionality of the NSW‑based vectors remains extremely small compared with traditional bag‑of‑words models (which often involve thousands of dimensions), leading to faster training and inference without sacrificing classification quality.

The authors interpret these findings as evidence that NSWs encode strong genre‑specific signals in Croatian texts. For instance, scientific articles contain a higher proportion of technical abbreviations and measurement units, while popular media feature more dates and monetary figures. Because NSWs are largely independent of inflectional morphology, they avoid the cascading errors that can arise from imperfect lemmatization pipelines. Consequently, the bag‑of‑NSWs approach offers a lightweight yet powerful alternative for text categorization in highly inflected languages.

Limitations of the study include the modest size of the SKIPEZ corpus and the focus on a single language, which restricts the generalizability of the conclusions. Additionally, the manual effort required to build the NSW taxonomy suggests a need for automated NSW detection methods. Future work is suggested in three directions: (1) scaling the approach to larger, multilingual corpora to test cross‑lingual robustness; (2) integrating NSW features with deep neural architectures (e.g., CNNs or Transformers) to assess complementary benefits; and (3) developing semi‑automatic or fully automatic NSW extraction tools to reduce the reliance on handcrafted taxonomies.

In summary, the paper demonstrates that non‑standard words constitute a compact, informative feature set for genre classification in Croatian, achieving high accuracy while dramatically reducing feature space dimensionality and preprocessing overhead. This insight encourages further exploration of NSW‑based representations for other highly inflectional languages and for applications where computational efficiency and robustness to morphological complexity are paramount.


Comments & Academic Discussion

Loading comments...

Leave a Comment