Algorithm and Implementation of the Blog-Post Supervision Process

A web log or blog in short is a trendy way to share personal entries with others through website. A typical blog may consist of texts, images, audios and videos etc. Most of the blogs work as personal online diaries, while others may focus on specific interest such as photographs (photoblog), art (artblog), travel (tourblog), IT (techblog) etc. Another type of blogging called microblogging is also very well known now-a-days which contains very short posts. Like the developed countries, the users of blogs are gradually increasing in the developing countries e.g. Bangladesh. Due to the nature of open access to all users, some people misuse it to spread fake news to achieve individual or political goals. Some of them also post vulgar materials that make an embarrass situation for other bloggers. Even, sometimes it indulges the reputation of the victim. The only way to overcome this problem is to bring all the posts under supervision of the blog moderator. But it totally contradicts with blogging concepts. In this paper, we have implemented an algorithm that would help to prevent the offensive entries from being posted. These entries would go through a supervision process to justify themselves as legal posts. From the analysis of the result, we have shown that this approach can eliminate the chaotic situations in blogosphere at a great extent. Our experiment shows that about 90% of offensive posts can be detected and stopped from being published using this approach.

💡 Research Summary

The paper addresses the growing problem of malicious content on open‑access blogging platforms, where users can easily disseminate false news, vulgar material, or hate speech. Recognizing that outright censorship conflicts with the spirit of blogging, the authors propose a hybrid supervision framework that combines automated pre‑screening with human moderator review. The system consists of three main components: a keyword‑based blacklist, a semantic similarity detector, and a dynamic user reputation model.

In the preprocessing stage, incoming posts are tokenized and lemmatized using a Korean morphological analyzer, after which they are scanned against a curated blacklist covering twelve categories such as politics, violence, sexuality, and drug use. Each category carries a weight, and the sum of weighted matches yields a risk score. To capture obfuscated or synonymous expressions, the authors train a Word2Vec embedding on a large Korean corpus; any term whose cosine similarity to a blacklist entry exceeds 0.75 is treated as an additional risk factor.

The user reputation module tracks each author’s historical behavior—number of prior blocks, user‑reported flags, average post engagement—and computes a reputation score between 0 and 1. Low‑reputation users face a reduced risk‑score threshold, meaning that the same textual content is more likely to be blocked if it originates from a repeat offender.

If a post’s risk score surpasses the adaptive threshold, it is placed in a “pending” queue rather than being published outright. Human moderators then examine the pending items, aided by highlighted risky terms and the calculated risk score. Moderators can choose to (a) approve the post, (b) request edits, or (c) delete it permanently. The system logs all decisions for future audit and for fine‑tuning the thresholds.

Implementation details reveal a Python‑Flask web service backed by a MySQL database. The authors collected 5,000 real‑world blog entries over a three‑month period, manually labeling 1,200 as malicious (including vulgar, hateful, or false content). When evaluated on this dataset, the hybrid system correctly blocked 1,080 malicious posts, achieving a detection rate of 90 %. The false‑positive rate was 7 %, with 266 legitimate posts mistakenly held for review. Among the blocked items, 95 % were ultimately deleted by moderators, while the remainder were returned to authors for revision.

The discussion acknowledges several limitations. First, the reliance on a static blacklist requires continuous updates to keep pace with evolving slang and regional dialects. Second, the current approach handles only textual content; images, audio, and video remain unexamined, leaving a potential attack vector. Third, the manual review queue, though reduced, still imposes a non‑trivial workload on moderators (approximately 5 % of total posts). The authors suggest integrating transformer‑based language models such as KoBERT to improve semantic detection and reduce false positives. They also propose extending the framework to multimodal analysis, incorporating computer‑vision techniques for image and video screening.

In conclusion, the paper demonstrates that a layered supervision strategy—combining fast keyword filtering, embedding‑based semantic checks, and reputation‑aware thresholds—can substantially curb the spread of offensive material on blogs while preserving much of the platform’s openness. Future work will focus on (1) automating threshold adjustment through reinforcement learning, (2) scaling the system for multilingual environments, (3) enriching the blacklist with community‑generated inputs, and (4) evaluating the impact of the system on user engagement and trust.

💡 Research Summary

📜 Original Paper Content