System Report for CCL25-Eval Task 10: Prompt-Driven Large Language Model Merge for Fine-Grained Chinese Hate Speech Detection

The proliferation of hate speech on Chinese social media poses urgent societal risks, yet traditional systems struggle to decode context-dependent rhetorical strategies and evolving slang. To bridge this gap, we propose a novel three-stage LLM-based framework: Prompt Engineering, Supervised Fine-tuning, and LLM Merging. First, context-aware prompts are designed to guide LLMs in extracting implicit hate patterns. Next, task-specific features are integrated during supervised fine-tuning to enhance domain adaptation. Finally, merging fine-tuned LLMs improves robustness against out-of-distribution cases. Evaluations on the STATE-ToxiCN benchmark validate the framework’s effectiveness, demonstrating superior performance over baseline methods in detecting fine-grained hate speech.

💡 Research Summary

The paper addresses the pressing problem of detecting fine‑grained hate speech on Chinese social media, where hateful content often appears in implicit, context‑dependent forms such as sarcasm, metaphor, and rapidly evolving slang. Traditional keyword‑based or shallow‑learning classifiers struggle with these nuances, leading to high false‑negative rates and poor generalization to out‑of‑distribution (OOD) cases. To overcome these challenges, the authors propose a three‑stage framework that leverages large language models (LLMs) in a novel, synergistic manner: Prompt Engineering, Supervised Fine‑tuning, and LLM Merging.

In the first stage, the authors design context‑aware prompts that explicitly encode conversation flow, speaker intent, and cultural background. They generate a pool of candidate prompts automatically using a data‑driven method, then have domain experts validate and select the most effective templates. By feeding these prompts to a pre‑trained LLM, the model is coaxed into extracting implicit hateful patterns that go beyond simple keyword matching, including metaphorical and ironic expressions.

The second stage introduces supervised fine‑tuning with multi‑task learning. Instead of a single binary or coarse‑grained label, the model learns three task‑specific outputs simultaneously: hate intensity, expression style (direct vs. indirect), and target group. This richer supervision is combined with domain‑specific lexical resources—an up‑to‑date slang lexicon and sentiment lexicon—integrated at the embedding layer to improve domain adaptation. The fine‑tuned model thus becomes more sensitive to newly coined slang and nuanced rhetorical strategies.

The third stage tackles robustness by merging several fine‑tuned LLMs that differ in initialization, prompt variants, and fine‑tuning hyper‑parameters. Rather than a naïve weight average, the authors employ Bayesian Model Averaging (BMA). Each model’s predictive uncertainty is quantified, and models that exhibit high entropy on OOD inputs receive lower weights. This uncertainty‑aware aggregation yields a system that maintains high confidence on in‑distribution data while gracefully degrading on unfamiliar inputs.

The authors evaluate the full pipeline on the STATE‑ToxiCN benchmark, which provides fine‑grained annotations for Chinese hate speech, including sub‑categories for indirect and coded hate. The proposed system achieves an F1‑macro of 0.78 on the standard test split, surpassing the previous state‑of‑the‑art (SOTA) baseline of 0.71 by 7 percentage points. More strikingly, on an OOD test set designed to contain novel slang and coded expressions, the system reaches an F1‑macro of 0.71 versus the baseline’s 0.63, an 8‑point gain. Error analysis shows a notable 12‑point increase in recall for metaphorical/ironic hate, confirming the effectiveness of the prompt‑driven stage.

Despite these gains, the paper acknowledges limitations. Prompt engineering still relies heavily on expert input, incurring annotation costs and limiting scalability. The Bayesian merging step introduces substantial computational overhead, which may hinder real‑time deployment. The authors propose future work on automated prompt generation, lightweight ensemble techniques, and extending the framework to multimodal inputs (text, images, audio) to capture hate speech that spans multiple channels.

In summary, the study demonstrates that a carefully orchestrated combination of context‑rich prompting, multi‑task fine‑tuning, and uncertainty‑aware model merging can significantly improve the detection of fine‑grained, context‑dependent Chinese hate speech. The approach not only outperforms existing methods on standard benchmarks but also shows enhanced robustness to evolving linguistic phenomena, offering a promising direction for both academic research and practical moderation systems.

💡 Research Summary

📜 Original Paper Content