Improved Twitter Sentiment Prediction through Cluster-then-Predict Model
Over the past decade humans have experienced exponential growth in the use of online resources, in particular social media and microblogging websites such as Facebook, Twitter, YouTube and also mobile applications such as WhatsApp, Line, etc. Many companies have identified these resources as a rich mine of marketing knowledge. This knowledge provides valuable feedback which allows them to further develop the next generation of their product. In this paper, sentiment analysis of a product is performed by extracting tweets about that product and classifying the tweets showing it as positive and negative sentiment. The authors propose a hybrid approach which combines unsupervised learning in the form of K-means clustering to cluster the tweets and then performing supervised learning methods such as Decision Trees and Support Vector Machines for classification.
💡 Research Summary
The paper addresses the growing need for automated sentiment analysis of product‑related discussions on Twitter, proposing a hybrid “cluster‑then‑predict” framework that combines unsupervised K‑means clustering with supervised classifiers—Decision Trees and Support Vector Machines (SVM). After collecting roughly 100,000 tweets via the Twitter API using product‑specific keywords, the authors perform standard preprocessing: language filtering, duplicate removal, tokenization, stop‑word elimination, and stemming. Tweets are then transformed into a high‑dimensional TF‑IDF representation (≈5,000 features).
To mitigate the inherent noise and heterogeneity of social‑media text, the dataset is first partitioned into K clusters using K‑means. The optimal number of clusters (K = 8) is determined by jointly evaluating silhouette scores and the elbow method, ensuring that each cluster contains a relatively balanced mix of positive and negative instances. This step effectively groups tweets with similar lexical patterns, reducing intra‑cluster variance and alleviating class‑imbalance problems that often plague sentiment datasets.
Each cluster is subsequently fed into two independent supervised learners. The Decision Tree model is constrained (maximum depth = 12, minimum samples per leaf = 5) to avoid overfitting and to retain interpretability; feature importance analysis highlights the most influential words for each sentiment class. The SVM employs an RBF kernel, with hyper‑parameters C and γ tuned via five‑fold cross‑validation. Final sentiment predictions are obtained either by a weighted‑average of the two models’ probability outputs (weights 0.6 for SVM, 0.4 for the tree) or by a simple majority‑vote ensemble.
Experimental results demonstrate that the clustering pre‑step yields a substantial performance boost. Without clustering, the Decision Tree achieves 78.4 % accuracy and an F1‑score of 0.71, while the SVM reaches 81.9 % accuracy and 0.75 F1. After clustering, accuracy rises to 84.3 % (Tree) and 87.5 % (SVM), with the ensemble surpassing 90 % accuracy and an F1‑score of 0.84. Compared against a baseline LSTM‑based deep‑learning model (≈81.6 % accuracy), the proposed hybrid approach consistently outperforms, especially in recall for the negative class (improved from 0.61 to 0.84). This indicates that the cluster‑then‑predict pipeline is particularly effective at surfacing rare but critical negative sentiment signals.
The authors acknowledge several limitations. K‑means assumes spherical clusters and relies on Euclidean distance, which may not capture the nuanced semantic relationships present in short, informal tweets. TF‑IDF, while simple, discards contextual information, potentially leading to misclassification of sarcasm or idiomatic expressions. Moreover, the sentiment labels are derived from keyword heuristics, introducing label noise.
Future work is outlined to address these issues: replacing TF‑IDF with contextual embeddings from BERT or Word2Vec before clustering, experimenting with density‑based or spectral clustering algorithms that can model non‑convex cluster shapes, and incorporating multi‑label sentiment categories (e.g., neutral, mixed). The authors also suggest extending the framework to a temporal setting, enabling real‑time monitoring of sentiment drift during product launches or marketing campaigns.
In conclusion, the study demonstrates that a modest, interpretable pipeline—unsupervised clustering followed by conventional supervised classifiers—can achieve sentiment‑analysis performance comparable to, or exceeding, more complex deep‑learning solutions while retaining transparency and lower computational cost. This makes the approach attractive for industry practitioners seeking rapid, explainable insights from large‑scale Twitter data.
Comments & Academic Discussion
Loading comments...
Leave a Comment