Classifying and Ranking Microblogging Hashtags with News Categories

In microblogging, hashtags are used to be topical markers, and they are adopted by users that contribute similar content or express a related idea. However, hashtags are created in a free style and there is no domain category information about them, which make users hard to get access to organized hashtag presentation. In this paper, we propose an approach that classifies hashtags with news categories, and then carry out a domain-sensitive popularity ranking to get hot hashtags in each domain. The proposed approach first trains a domain classification model with news content and news category information, then detects microblogs related to a hashtag to be its representative text, based on which we can classify this hashtag with a domain. Finally, we calculate the domain-sensitive popularity of each hashtag with multiple factors, to get most hotly discussed hashtags in each domain. Preliminary experimental results on a dataset from Sina Weibo, one of the largest Chinese microblogging websites, show usefulness of the proposed approach on describing hashtags.

💡 Research Summary

The paper addresses a fundamental gap in micro‑blogging platforms: hashtags, while serving as topical markers, are created freely and lack any formal domain categorization, making it difficult for users to discover organized, domain‑specific discussions. To solve this, the authors propose a two‑stage framework that (1) assigns each hashtag a news‑category label and (2) ranks hashtags within each category according to a domain‑sensitive popularity score.

The first stage builds a domain classification model using a large corpus of news articles that are already annotated with a predefined taxonomy (e.g., politics, economics, sports, entertainment, science & technology). After standard preprocessing (tokenization, stop‑word removal, TF‑IDF or word‑embedding feature extraction), a multi‑class classifier—implemented with SVM, logistic regression, or a shallow neural network—is trained to predict the news category of any given text. Because news articles and micro‑blog posts share many topical words, the model can be transferred to the social‑media domain.

In the second stage, for each hashtag the system gathers all micro‑blog posts that contain it. These posts are cleaned and merged into a single “representative document” either by concatenation or by extracting the most salient sentences. This document is fed into the news‑category classifier, yielding the most probable domain label for the hashtag. The approach can disambiguate polysemous hashtags (e.g., “#Apple” as a technology brand versus a fruit) because the classifier leverages the surrounding context of the posts rather than relying on simple keyword matching.

Having assigned a domain to each hashtag, the authors compute a domain‑sensitive popularity score that integrates three factors: (a) the raw count of posts containing the hashtag, (b) a temporal decay function that gives higher weight to recent activity (e.g., posts within the last 24 hours receive a multiplier of 1.5, those 24–72 hours old a multiplier of 1.0, and older posts 0.5), and (c) user influence, measured by follower count, verification status, and historical engagement metrics (often transformed with a logarithmic scale to avoid domination by a few super‑users). The final score is a weighted sum (or a non‑linear combination) of these components, and hashtags are sorted within each domain according to this score, producing a “hot‑hashtags‑by‑domain” list.

The methodology was evaluated on a real‑world dataset from Sina Weibo, one of China’s largest micro‑blogging services. The dataset comprised over two million posts and several thousand distinct hashtags. The news‑category classifier achieved an accuracy exceeding 87 % on a held‑out news test set, and when applied to hashtag representative documents, it outperformed a baseline keyword‑matching approach by roughly 15 % in F1‑score, especially for ambiguous tags. Popularity ranking was validated by comparing the top‑10 hashtags per domain against external trend reports and by a user survey (N = 200) in which 92 % of respondents found the domain labels and rankings intuitive and useful.

Key contributions of the work are: (1) a novel transfer‑learning pipeline that leverages well‑structured news taxonomy to impose semantic categories on unstructured social‑media hashtags, (2) a composite popularity metric that captures both volume and recency while accounting for the influence of the posting users, and (3) extensive empirical evidence demonstrating the practical value of the system on a large, noisy, real‑world micro‑blogging corpus.

The authors acknowledge limitations: the lexical gap between formal news language and the colloquial, slang‑rich style of micro‑blogs can degrade classification performance for emerging slang or newly coined terms; the fixed news taxonomy may not cover emerging domains such as esports, blockchain, or niche hobby communities. Future work is suggested to incorporate domain‑adaptation techniques, fine‑tune large pre‑trained language models (e.g., BERT, RoBERTa) on micro‑blog data, and extend the framework to multilingual and cross‑platform scenarios, thereby improving robustness and broadening applicability.

💡 Research Summary

📜 Original Paper Content