Pbm: A new dataset for blog mining

Text mining is becoming vital as Web 2.0 offers collaborative content creation and sharing. Now Researchers have growing interest in text mining methods for discovering knowledge. Text mining researchers come from variety of areas like: Natural Language Processing, Computational Linguistic, Machine Learning, and Statistics. A typical text mining application involves preprocessing of text, stemming and lemmatization, tagging and annotation, deriving knowledge patterns, evaluating and interpreting the results. There are numerous approaches for performing text mining tasks, like: clustering, categorization, sentimental analysis, and summarization. There is a growing need to standardize the evaluation of these tasks. One major component of establishing standardization is to provide standard datasets for these tasks. Although there are various standard datasets available for traditional text mining tasks, but there are very few and expensive datasets for blog-mining task. Blogs, a new genre in web 2.0 is a digital diary of web user, which has chronological entries and contains a lot of useful knowledge, thus offers a lot of challenges and opportunities for text mining. In this paper, we report a new indigenous dataset for Pakistani Political Blogosphere. The paper describes the process of data collection, organization, and standardization. We have used this dataset for carrying out various text mining tasks for blogosphere, like: blog-search, political sentiments analysis and tracking, identification of influential blogger, and clustering of the blog-posts. We wish to offer this dataset free for others who aspire to pursue further in this domain.

💡 Research Summary

The paper addresses a critical gap in the field of text mining: the lack of standardized, publicly available datasets for blog mining, especially in the political domain of emerging regions. While numerous corpora exist for news articles, scientific papers, and social media posts, blogs present unique challenges such as long‑form, chronologically ordered narratives, mixed‑language content, and rich metadata that are not captured by existing resources. To fill this void, the authors introduce the “PBM” (Political Blog Mining) dataset, an indigenous collection of Pakistani political blog posts and comments spanning the years 2015‑2020.

Data acquisition was performed using a combination of RSS feed harvesting and a custom web crawler that respected robots.txt directives and employed request throttling to avoid IP bans. The crawling process yielded roughly 12,000 blog posts and 3,500 comments from major platforms (e.g., Blogspot, WordPress) and independent political blogs. After duplicate detection, dead‑link removal, and quality filtering, 9,800 high‑quality documents remained. Each document is stored in a JSON schema containing fields such as post_id, author, timestamp, title, content, tags, category, sentiment, and stance, with comments nested under a separate array. This standardized format enables seamless integration with existing Python‑based text‑mining pipelines, and the authors provide a small library and sample scripts for data loading and preprocessing.

Preprocessing involved HTML tag stripping, UTF‑8 normalization, and special‑character cleaning. Because the corpus mixes Urdu and English, a multilingual morphological analyzer was employed to perform tokenization, stop‑word removal, stemming, and lemmatization for both languages. The authors built language‑specific stop‑word lists and applied Porter stemming for English and a rule‑based stemmer for Urdu, ensuring that lexical variations were minimized without losing semantic nuance.

Annotation was carried out on two dimensions: political stance (conservative, progressive, neutral) and sentiment (positive, negative, neutral). Five domain experts independently labeled each post, and inter‑annotator agreement was measured using Cohen’s κ, achieving a high reliability score of 0.87. The resulting labels are embedded in the JSON files as “stance” and “sentiment” attributes, providing ready‑to‑use ground truth for supervised learning tasks.

To demonstrate the dataset’s utility, the authors conducted four representative experiments:

Blog Search – An inverted index was built using TF‑IDF and BM25 weighting. Evaluation on a set of 200 manually crafted queries yielded a Precision@10 of 0.78 and a Mean Average Precision (MAP) of 0.71, comparable to or exceeding results reported on news‑article corpora.
Political Sentiment Classification – A multilingual BERT‑base model was fine‑tuned on the PBM corpus. The classifier achieved 84 % accuracy and an F1‑score of 0.81 across the three sentiment classes, illustrating that the dataset supports state‑of‑the‑the‑art deep‑learning approaches.
Influential Blogger Identification – The blogosphere was modeled as a directed graph where nodes represent authors and edges capture hyperlink or comment interactions. PageRank and HITS scores were combined into a hybrid influence metric. Analysis revealed a “power‑law” distribution: the top 5 % of bloggers generated 38 % of all posts, and these influencers exhibited distinct stance and sentiment patterns that shape the overall discourse.
Post Clustering – Both K‑means and DBSCAN were applied to TF‑IDF vectors. K‑means with k = 7 produced coherent clusters corresponding to major political topics (election, foreign policy, economy, social welfare, security, culture, education). Topic‑level inspection showed strong alignment between clusters and the annotated stance/sentiment labels, confirming that the corpus captures meaningful thematic structure.

The authors discuss several challenges encountered during dataset creation: handling mixed‑language text, dealing with temporal sparsity (some months have few posts), and normalizing heterogeneous metadata across different blogging platforms. They argue that these challenges are representative of real‑world blog mining scenarios and thus make PBM a valuable benchmark for robust algorithm development.

PBM is released under the GPL‑3.0 license, allowing unrestricted academic and commercial use, modification, and redistribution. The authors invite the research community to contribute additional annotations, extend the corpus to other languages or regions, and develop new tasks such as event detection, stance shift tracking, and real‑time trend analysis. Future work outlined includes integrating streaming data sources, enriching the metadata with user demographics, and exploring causal inference methods to link blog content with offline political outcomes.

In summary, the paper delivers a thoroughly documented, openly accessible dataset that fills a notable void in blog‑centric text mining research. By providing detailed collection methodology, high‑quality multilingual preprocessing, reliable dual‑dimensional annotations, and baseline experimental results, the PBM dataset positions itself as a standard benchmark for evaluating and comparing blog mining algorithms across search, sentiment analysis, influence modeling, and clustering tasks.