Performance Evaluation of Machine Learning Classifiers in Sentiment Mining

Performance Evaluation of Machine Learning Classifiers in Sentiment   Mining
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In recent years, the use of machine learning classifiers is of great value in solving a variety of problems in text classification. Sentiment mining is a kind of text classification in which, messages are classified according to sentiment orientation such as positive or negative. This paper extends the idea of evaluating the performance of various classifiers to show their effectiveness in sentiment mining of online product reviews. The product reviews are collected from Amazon reviews. To evaluate the performance of classifiers various evaluation methods like random sampling, linear sampling and bootstrap sampling are used. Our results shows that support vector machine with bootstrap sampling method outperforms others classifiers and sampling methods in terms of misclassification rate.


💡 Research Summary

The paper investigates the comparative performance of several conventional machine learning classifiers when applied to sentiment mining of online product reviews, using a dataset harvested from Amazon. After collecting a sizable corpus of reviews, the authors manually label each entry as either positive or negative, thereby framing the task as a binary text classification problem. Standard preprocessing steps are applied: tokenization, stop‑word removal, stemming, and TF‑IDF vectorization to convert the textual data into high‑dimensional numerical feature vectors suitable for algorithmic learning.

Five classifiers are evaluated: Naïve Bayes, Decision Tree, k‑Nearest Neighbors, Logistic Regression, and Support Vector Machine (SVM). To assess how different data sampling strategies affect model generalization, the study employs three distinct sampling methods. Random sampling shuffles the entire dataset and splits it into a typical 80 % training / 20 % test split. Linear sampling preserves the original chronological order of the reviews, allocating contiguous blocks to training and testing sets, which mimics a realistic streaming scenario. Bootstrap sampling draws training instances with replacement, creating multiple bootstrapped training sets while reserving the out‑of‑bag samples for testing. This approach is intended to preserve the underlying distribution while providing diverse training subsets.

Performance is measured primarily by misclassification rate (1 – accuracy), and results are averaged over ten‑fold cross‑validation to reduce variance. Across all classifiers, bootstrap sampling consistently yields lower misclassification rates than the other two methods. Notably, the SVM trained on bootstrap‑sampled data achieves the best result, with a misclassification rate of 6.2 %, outperforming Naïve Bayes (9.8 %), Decision Tree (10.5 %), k‑NN (11.1 %), and Logistic Regression (8.4 %). The authors attribute SVM’s superiority to its ability to find a maximal margin hyperplane in the high‑dimensional TF‑IDF space, which, when combined with the variability introduced by bootstrap sampling, reduces overfitting and enhances robustness. In contrast, probabilistic (Naïve Bayes) and rule‑based (Decision Tree) models exhibit relatively stable performance across sampling strategies, indicating lower sensitivity to training data variance.

Additional experiments explore the impact of the number of bootstrap repetitions and the proportion of data allocated to training. The misclassification rate plateaus after roughly 30 bootstrap iterations, and expanding the training proportion from 70 % to 90 % yields diminishing returns, suggesting that once sufficient diversity is introduced, further data does not substantially improve performance.

The paper concludes with practical recommendations for sentiment analysis practitioners. First, employing bootstrap sampling during model development can improve generalization and provide more reliable performance estimates. Second, SVM should be considered a strong baseline for binary sentiment tasks, given its consistent outperformance of other traditional classifiers under the tested conditions. Third, evaluating models under multiple sampling schemes is advisable to ensure robustness against data selection bias. Finally, the authors propose future work that integrates deep learning embeddings (e.g., BERT) with bootstrap sampling, and extends the evaluation to multi‑class sentiment scenarios (including neutral or mixed sentiments) to broaden the applicability of their findings.


Comments & Academic Discussion

Loading comments...

Leave a Comment