Are Open-Weight LLMs Ready for Social Media Moderation? A Comparative Study on Bluesky
As internet access expands, so does exposure to harmful content, increasing the need for effective moderation. Research has demonstrated that large language models (LLMs) can be effectively utilized for social media moderation tasks, including harmful content detection. While proprietary LLMs have been shown to zero-shot outperform traditional machine learning models, the out-of-the-box capability of open-weight LLMs remains an open question. Motivated by recent developments of reasoning LLMs, we evaluate seven state-of-the-art models: four proprietary and three open-weight. Testing with real-world posts on Bluesky, moderation decisions by Bluesky Moderation Service, and annotations by two authors, we find a considerable degree of overlap between the sensitivity (81%–97%) and specificity (91%–100%) of the open-weight LLMs and those (72%–98%, and 93%–99%) of the proprietary ones. Additionally, our analysis reveals that specificity exceeds sensitivity for rudeness detection, but the opposite holds for intolerance and threats. Lastly, we identify inter-rater agreement across human moderators and the LLMs, highlighting considerations for deploying LLMs in both platform-scale and personalized moderation contexts. These findings show open-weight LLMs can support privacy-preserving moderation on consumer-grade hardware and suggest new directions for designing moderation systems that balance community values with individual user preferences.
💡 Research Summary
The paper investigates whether state‑of‑the‑art open‑weight large language models (LLMs) can match proprietary LLMs in zero‑shot harmful‑content detection on a real‑world social media platform, Bluesky. The authors first built a ground‑truth dataset by harvesting 4.34 million English‑only root posts from Bluesky (August 12‑24 2025) and collecting moderation labels from the official Bluesky Moderation Service (BMS). They focused on three violation categories—rude, intolerant, and threat—sampling at least 520 posts per category and adding 786 random non‑violating posts. To verify that BMS labels reflect human moderation, they measured labeler reaction latency and found that over 80 % of labels were applied more than a day after posting, indicating human involvement. Two authors independently re‑annotated the sampled posts according to BMS definitions, establishing a reliable human “ground truth” and measuring inter‑annotator agreement.
Seven LLMs were evaluated: three open‑weight reasoning models (gpt‑oss‑20b, NVIDIA‑Nemotron‑Nano‑9B‑v2, Qwen3‑30B‑A3B‑Thinking‑2507) that each fit within a single RTX 3090 GPU (24 GB VRAM), three proprietary reasoning models (Gemini 2.5 Pro, GPT‑5, Grok 4), and one non‑reasoning proprietary model (GPT‑4o) as a baseline. All models were accessed via consistent APIs, run with temperature 0, fixed random seed (350), and prompted using a chain‑of‑thought style derived from the GPT‑5 prompting guide, incorporating BMS label descriptions.
Performance was measured in terms of sensitivity (recall) and specificity. Open‑weight models achieved sensitivity between 81 % and 97 % and specificity between 91 % and 100 %, overlapping closely with proprietary models (sensitivity 72 %–98 %, specificity 93 %–99 %). Notably, the “rude” category exhibited higher specificity than sensitivity, whereas “intolerant” and “threat” showed the opposite pattern, reflecting differing error profiles likely driven by label definition ambiguity. Inter‑rater agreement analysis showed that open‑weight LLMs aligned with human moderators at a level comparable to proprietary models, suggesting they can be deployed in privacy‑preserving, personalized moderation scenarios.
The authors discuss several limitations: the dataset is limited to English text, excludes multimedia content, and relies on Bluesky‑specific moderation rules, which may not generalize without further mapping. They also note that the latency‑based validation of BMS labels, while indicative, does not guarantee 100 % human labeling. Future work is proposed in three directions: (1) expanding to multilingual and multimodal datasets, (2) standardizing moderation taxonomies and providing illustrative examples to reduce label ambiguity, and (3) exploring user‑controlled decision thresholds for personalized moderation.
In conclusion, the study provides empirical evidence that modern open‑weight reasoning LLMs, even when run on consumer‑grade hardware without any fine‑tuning, can reliably detect harmful content on a live social media platform. This finding challenges earlier reports that open‑weight models were unsuitable for such tasks and opens the door to cost‑effective, privacy‑respecting moderation solutions that can be tailored to community values and individual user preferences.
Comments & Academic Discussion
Loading comments...
Leave a Comment