PEACE 2.0: Grounded Explanations and Counter-Speech for Combating Hate Expressions

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The increasing volume of hate speech on online platforms poses significant societal challenges. While the Natural Language Processing community has developed effective methods to automatically detect the presence of hate speech, responses to it, called counter-speech, are still an open challenge. We present PEACE 2.0, a novel tool that, besides analysing and explaining why a message is considered hateful or not, also generates a response to it. More specifically, PEACE 2.0 has three main new functionalities: leveraging a Retrieval-Augmented Generation (RAG) pipeline i) to ground HS explanations into evidence and facts, ii) to automatically generate evidence-grounded counter-speech, and iii) exploring the characteristics of counter-speech replies. By integrating these capabilities, PEACE 2.0 enables in-depth analysis and response generation for both explicit and implicit hateful messages.

💡 Research Summary

The paper presents PEACE 2.0, an integrated web‑based platform that moves beyond hate‑speech detection to provide evidence‑grounded explanations and automatically generated counter‑speech. Building on the original PEACE system, PEACE 2.0 adds three major capabilities. First, it incorporates a Retrieval‑Augmented Generation (RAG) pipeline that retrieves relevant passages from a curated human‑rights knowledge base (32,792 documents, over 3 million paragraphs from the UN Digital Library, Eur‑Lex, and the European Agency for Fundamental Rights). Input messages are encoded with the BGE‑M3 sentence transformer, and FAISS is used to retrieve the top‑3 most similar passages. These passages are summarized by a selected large language model (LLM) and then fed together with the original hateful text to the same LLM to produce a respectful, persuasive counter‑speech response.

Second, the same RAG mechanism is used to generate explanations for the predictions of a BERT classifier fine‑tuned on the ISHate dataset. The classifier’s label is accompanied by a natural‑language justification that cites the retrieved evidence and displays similarity scores, thereby increasing transparency and interpretability.

Third, PEACE 2.0 offers rich visual analytics for both hate‑speech and counter‑speech datasets. Sankey diagrams link target groups, hate categories (explicit vs. implicit), and LDA‑derived topics; word clouds and frequency charts reveal lexical patterns; and a data‑augmentation module provides seven strategies (named‑entity replacement, adverb adjustment, synonym substitution, back‑translation, etc.) to generate adversarial variants of implicit hate while preserving the underlying stance.

The experimental protocol samples 20 messages from each of five implicit hate corpora (IHC, ISHate, TOXIGEN, DYNA, SBIC), yielding 100 examples equally split between explicit and implicit. For each message, explanations and counter‑speech are generated with and without RAG, resulting in 200 human‑evaluated items (three annotators per item). Human ratings cover Fluency, Informativeness, Persuasiveness, Soundness, and Specificity on a 1‑5 Likert scale. RAG‑based outputs consistently outperform non‑RAG across all dimensions, with especially large gains for implicit content (e.g., Explanation‑Imp. 4.64 vs 2.72; Counter‑speech‑Imp. 4.80 vs 2.86). Automatic metrics (semantic similarity, faithfulness to retrieved evidence, perplexity, Distinct‑3, NLI‑based entailment/contradiction) corroborate these findings, showing higher relevance, lower perplexity, and better alignment with evidence for RAG outputs. Statistical significance is confirmed via Wilcoxon signed‑rank tests (p < 0.05) and inter‑annotator agreement (Krippendorff’s α 0.57‑1).

The results demonstrate that grounding generation in factual evidence markedly improves the quality, trustworthiness, and persuasive power of both explanations and counter‑speech, particularly for subtle, context‑dependent hate. PEACE 2.0 is released as a public web interface and a Python/Flask API, supporting multiple open‑source LLMs (Mistral‑7B‑Instruct, Llama‑3.1‑8B‑Instruct, Command‑R). Limitations include a static knowledge base that requires periodic updates and variability in output quality depending on the chosen LLM. Future work will explore adaptive retrieval strategies, continuous knowledge‑base enrichment, and multilingual evaluation metrics. By unifying detection, transparent explanation, and evidence‑backed response generation, PEACE 2.0 offers a practical tool for moderators, researchers, and policymakers aiming to foster inclusive and fair online discourse.

PEACE 2.0: Grounded Explanations and Counter-Speech for Combating Hate Expressions

💡 Research Summary

Comments & Academic Discussion

Leave a Comment