The Content Moderator's Dilemma: Removal of Toxic Content and Distortions to Online Discourse
There is an ongoing debate about how to moderate toxic speech on social media and the impact of content moderation on online discourse. This paper proposes and validates a methodology for measuring the content-moderation-induced distortions in online discourse using text embeddings from computational linguistics. Applying the method to a representative sample of 5 million US political Tweets, we find that removing toxic Tweets alters the semantic composition of content. This finding is consistent across different embedding models, toxicity metrics, and samples. Importantly, we demonstrate that these effects are not solely driven by toxic language but by the removal of topics often expressed in toxic form. We propose an alternative approach to content moderation that uses generative Large Language Models to rephrase toxic Tweets, preserving their salvageable content rather than removing them entirely. We show that this rephrasing strategy reduces toxicity while minimizing distortions in online content.
💡 Research Summary
**
The paper tackles the largely qualitative debate over the trade‑off between removing toxic speech and preserving the informational richness of online discourse by introducing a quantitative, scalable metric for “content distortion.” The authors define distortion as a shift in the distribution of textual representations within a semantic space, which they approximate using high‑dimensional embeddings from Transformer‑based language models (e.g., BERT, RoBERTa). To measure distributional change they adopt the Bhattacharyya distance, a statistical divergence that captures both mean displacement and covariance (variance) reduction between two multivariate normal approximations of the embedding clouds. This choice allows the metric to be content‑agnostic, computationally efficient, and sensitive to subtle semantic shifts that cosine similarity or topic‑model based measures might miss.
Empirically, the authors sample 5 million U.S. political tweets and label each with several toxicity scores (Perspective API, Jigsaw, etc.). Tweets exceeding a toxicity threshold of 0.8 are removed, and the Bhattacharyya distance is computed between the pre‑removal and post‑removal embedding distributions. The resulting distance is substantially larger than that obtained by randomly deleting the same number of tweets, indicating that toxic‑content removal is far from semantically neutral. By normalizing against an upper bound – the maximal distance achievable by deleting a fixed number of tweets – they find that typical toxicity‑based moderation accounts for roughly 20 % of the maximal possible distortion, a magnitude they liken to removing four out of 67 identified topics at random.
To disentangle whether the distortion stems from the lexical toxicity itself or from the loss of topics that are frequently expressed in a toxic manner, the authors conduct two complementary experiments. First, they orthogonalize the embedding space with respect to toxicity scores, effectively projecting each tweet onto a subspace where toxicity has no influence. Even in this orthogonalized space, removal of toxic tweets still produces a measurable Bhattacharyya distance, suggesting that the effect is not merely an artifact of the toxic words. Second, they employ a large language model (GPT‑4) to “rephrase” toxic tweets, stripping them of hateful language while preserving the underlying message. When the rephrased tweets are inserted back into the corpus, the Bhattacharyya distance drops by more than 70 % relative to the original removal scenario, demonstrating that preserving the substantive content dramatically mitigates semantic distortion.
These findings have clear policy implications. The conventional approach of outright deletion, while effective at reducing visible toxicity, also excises entire strands of political discussion, disproportionately affecting minority or marginalized viewpoints that may be expressed with strong language. The authors propose a hybrid moderation strategy: use automated toxicity detectors to flag content, then apply generative LLMs to rewrite the flagged material into a non‑toxic form. This approach retains the informational payload, reduces the risk of over‑silencing, and, as the metric shows, keeps the semantic composition of the platform much closer to its original state.
Beyond the specific rephrasing technique, the Bhattacharyya‑based distortion metric itself offers a practical tool for platforms, regulators, and researchers. It enables systematic benchmarking of different moderation policies (threshold tuning, binary removal vs. soft‑filtering, LLM‑assisted rewriting) on a common quantitative scale, thereby turning a largely philosophical debate into an empirically grounded cost‑benefit analysis. The authors argue that without such a metric, platforms are likely to over‑prioritize the easily measurable objective of toxicity reduction at the expense of the harder‑to‑measure but equally important goal of preserving a diverse, informative public sphere.
In sum, the paper makes three substantive contributions: (1) it formalizes and validates a novel, distribution‑based measure of content distortion; (2) it provides robust evidence that toxicity‑based moderation induces significant semantic shifts beyond what random deletion would cause; and (3) it demonstrates that LLM‑driven rephrasing can achieve toxicity mitigation while substantially limiting those shifts. The work bridges computational linguistics, information economics, and policy analysis, offering a concrete methodological bridge for future research and for the design of more balanced content‑moderation frameworks.
Comments & Academic Discussion
Loading comments...
Leave a Comment