Feature Selection Based on Term Frequency and T-Test for Text Categorization
Much work has been done on feature selection. Existing methods are based on document frequency, such as Chi-Square Statistic, Information Gain etc. However, these methods have two shortcomings: one is that they are not reliable for low-frequency terms, and the other is that they only count whether one term occurs in a document and ignore the term frequency. Actually, high-frequency terms within a specific category are often regards as discriminators. This paper focuses on how to construct the feature selection function based on term frequency, and proposes a new approach based on $t$-test, which is used to measure the diversity of the distributions of a term between the specific category and the entire corpus. Extensive comparative experiments on two text corpora using three classifiers show that our new approach is comparable to or or slightly better than the state-of-the-art feature selection methods (i.e., $\chi^2$, and IG) in terms of macro-$F_1$ and micro-$F_1$.
💡 Research Summary
**
The paper addresses a fundamental limitation of most traditional feature‑selection methods for text categorization—namely, their reliance on document frequency (DF) alone. Methods such as χ², Information Gain (IG), Mutual Information (MI) and Expected Cross‑Entropy (ECE) treat a term merely as “present or absent” in a document, ignoring how often the term actually occurs. This leads to two problems: (1) low‑frequency terms are statistically unstable, and (2) high‑frequency terms that are strong discriminators for a particular class are not adequately rewarded.
To overcome these issues, the authors propose a novel feature‑selection approach that exploits term frequency (TF) and the Student’s t‑test. The key insight is that, under a multinomial model of text generation, the average TF of a term across documents follows, by the Lindeberg‑Levy Central Limit Theorem, an approximately normal distribution when the number of documents is large. Consequently, the mean TF of a term within a specific class (tfₖᵢ) and the mean TF of the same term over the whole corpus (tfᵢ) can be compared using a t‑statistic:
t(tᵢ, Cₖ) = |tfₖᵢ – tfᵢ| / (√(1/Nₖ – 1/N)·sᵢ),
where Nₖ is the number of documents in class Cₖ, N is the total number of documents, and sᵢ is an estimate of the within‑class standard deviation. A large t‑value indicates that the term’s frequency distribution in class Cₖ differs significantly from its overall distribution, suggesting strong discriminative power.
Two aggregation schemes are examined: (1) the sum of t‑values over all classes (t‑test_avg) and (2) the maximum t‑value across classes (t‑test_max). Empirically, the summed version consistently outperforms the max version for multi‑class problems, so the paper focuses on t‑test_avg.
The methodology was evaluated on two benchmark corpora: Reuters‑21578 (52 categories, highly imbalanced) and 20 Newsgroup (20 balanced categories). Three well‑known classifiers—linear Support Vector Machines (LIBSVM), weighted k‑Nearest Neighbours (k=10), and a centroid‑based classifier—were used. For each classifier, the authors varied the number of selected features (from a few thousand up to the full vocabulary) and measured macro‑F₁ and micro‑F₁.
Results on the imbalanced Reuters data showed that t‑test_avg achieved the highest macro‑F₁ when the feature set size was between 8 000 and 13 000, and its best micro‑F₁ (89.8 % at 4 000 features) exceeded χ² by 4.2 percentage points. IG performed the worst across all feature sizes, while MI was generally inferior to χ², ECE, and the proposed method. On the balanced 20 Newsgroup data, the performance gap among methods narrowed; χ² and IG slightly outperformed t‑test_avg, but all four statistical methods (χ², IG, ECE, t‑test) were markedly better than MI. A case study on the “acq” category illustrated that t‑test correctly identified domain‑relevant terms (“acquir”, “stake”) whereas χ² and ECE mistakenly promoted unrelated high‑frequency terms (“dividend”).
The authors discuss several limitations. The normal‑approximation assumption may break down for very small classes or extremely rare terms, potentially destabilizing the t‑statistic. The choice of the threshold θ for selecting features is data‑dependent and currently set empirically; automated tuning would be desirable. Moreover, the approach has not been tested on multi‑label documents, very large vocabularies, or in conjunction with modern deep‑learning embeddings, which are promising directions for future work.
In summary, the paper introduces a statistically grounded, TF‑aware feature‑selection technique that complements or surpasses traditional DF‑based methods, especially in scenarios with class imbalance. By leveraging the t‑test to quantify differences in term‑frequency distributions, the method captures discriminative information that document‑frequency metrics overlook, offering a practical tool for improving text categorization pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment