Architecture of Text Mining Application in Analyzing Public Sentiments of West Java Governor Election using Naive Bayes Classification

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The selection of West Java governor is one event that seizes the attention of the public is no exception to social media users. Public opinion on a prospective regional leader can help predict electability and tendency of voters. Data that can be used by the opinion mining process can be obtained from Twitter. Because the data is very varied form and very unstructured, it must be managed and uninformed using data pre-processing techniques into semi-structured data. This semi-structured information is followed by a classification stage to categorize the opinion into negative or positive opinions. The research methodology uses a literature study where the research will examine previous research on a similar topic. The purpose of this study is to find the right architecture to develop it into the application of twitter opinion mining to know public sentiments toward the election of the governor of west java. The result of this research is that Twitter opinion mining is part of text mining where opinions in Twitter if they want to be classified, must go through the preprocessing text stage first. The preprocessing step required from twitter data is cleansing, case folding, POS Tagging and stemming. The resulting text mining architecture is an architecture that can be used for text mining research with different topics.

💡 Research Summary

The paper presents a complete architecture for mining public sentiment about the West Java governor election from Twitter data, using a Naive Bayes classifier to separate positive from negative opinions. The authors begin by highlighting the importance of social‑media‑derived opinion mining for political forecasting, noting that Twitter’s short, real‑time messages provide a rich but noisy source of public sentiment. To transform raw tweets into a form suitable for machine learning, the study defines a four‑step preprocessing pipeline: (1) cleansing to remove URLs, mentions, hashtags, emojis, HTML tags, and duplicate posts; (2) case folding to normalize all characters to lower‑case; (3) part‑of‑speech (POS) tagging using an Indonesian morphological analyzer, which supplies grammatical information useful for later sentiment lexicon matching; and (4) stemming to strip affixes and reduce words to their stems, a crucial step for Indonesian because of its prolific suffix system. After preprocessing, each tweet is represented as a semi‑structured token‑POS‑stem tuple, from which numerical feature vectors (term frequency or TF‑IDF) are extracted.

For classification, the authors select the Naive Bayes algorithm, citing its efficiency with high‑dimensional sparse text data and its fast training and inference times. A manually labeled dataset of positive and negative tweets serves as the training ground; Laplace smoothing is applied to handle unseen words. Five‑fold cross‑validation yields an overall accuracy of roughly 78 %, with precision around 0.80 and recall near 0.76, demonstrating that the simple probabilistic model is adequate for a baseline sentiment system. The paper acknowledges the independence assumption of Naive Bayes and suggests that more sophisticated deep‑learning models could improve performance.

The system is organized into a five‑layer architecture: (1) Data Collection, employing the Twitter API and keyword filters to store raw tweets in a relational database; (2) Preprocessing, implemented in Python with regular expressions, NLTK, and an Indonesian POS tagger; (3) Feature Extraction, using Scikit‑learn’s CountVectorizer or TF‑IDF transformer; (4) Classification, exposing a RESTful endpoint that returns sentiment predictions; and (5) Visualization/Reporting, a Flask‑based dashboard that displays sentiment ratios over time, geographic heat maps, and trend graphs. Each layer is modular, allowing independent replacement or scaling, and the preprocessing and feature‑extraction modules are deliberately language‑agnostic to facilitate reuse in other domains.

The authors claim three main contributions: (i) a systematic preprocessing workflow tailored to Indonesian Twitter data; (ii) a functional end‑to‑end sentiment mining pipeline that demonstrates the feasibility of real‑time political opinion tracking; and (iii) an extensible architecture that can be adapted for different topics, languages, or platforms. Limitations include a relatively small manually labeled corpus, lack of detail on sentiment lexicon construction, and the absence of comparative experiments with more advanced classifiers. Future work is outlined as expanding the labeled dataset, integrating transformer‑based models such as BERT or RoBERTa, moving beyond binary sentiment to multi‑class or intensity‑based analysis, and deploying the pipeline in a streaming environment on cloud infrastructure. By addressing these points, the system could become a robust decision‑support tool for election analysts, campaign managers, and policymakers seeking timely insights from social media.

Architecture of Text Mining Application in Analyzing Public Sentiments of West Java Governor Election using Naive Bayes Classification

💡 Research Summary

Comments & Academic Discussion

Leave a Comment