The proliferation of hate speech on Chinese social media poses urgent societal risks, yet traditional systems struggle to decode context-dependent rhetorical strategies and evolving slang. To bridge this gap, we propose a novel three-stage LLM-based framework: Prompt Engineering, Supervised Fine-tuning, and LLM Merging. First, context-aware prompts are designed to guide LLMs in extracting implicit hate patterns. Next, task-specific features are integrated during supervised fine-tuning to enhance domain adaptation. Finally, merging fine-tuned LLMs improves robustness against out-of-distribution cases. Evaluations on the STATE-ToxiCN benchmark validate the framework's effectiveness, demonstrating superior performance over baseline methods in detecting fine-grained hate speech.
The rapid growth of social media platforms has led to a global surge in online hate speech, which not only inflicts psychological harm on targeted individuals or groups but also exacerbates social tensions and fuels collective antagonism (Arora et al., 2023). While existing technologies can preliminarily detect explicit hate content (Schmidt and Wiegand, 2017), Chinese hate expressions are often characterized by implicitness, diversity, and context-dependency (Qian et al., 2018). Offensive content may be embedded through metaphors, sarcasm, or indirect references (Fortuna and Nunes, 2018), frequently targeting specific group attributes such as geography, gender, or ethnicity (Mathew et al., 2021). Against this backdrop, fine-grained hate speech detection has emerged as a critical research direction to address this issue. It aims to precisely dissect hate elements-such as target entities, arguments, victimized groups, and hate attributes (Vidgen et al., 2021)-from textual data, enabling more accurate identification and regulation of online hate speech.
The core requirement of fine-grained hate speech detection lies in models that can not only recognize explicit offensive lexicons but also infer discriminatory intent from contextual semantics (ElSherief et al., 2021), while strictly adhering to structured output specifications (Pavlopoulos et al., 2020). However, current mainstream models face three critical bottlenecks:(1)Semantic Complexity: Traditional rulebased or shallow machine learning methods, as well as directly applied large language models, struggle to accurately capture the implicit and diverse fine-grained semantic features inherent in Chinese hate speech (Talat and Hovy, 2016).(2)Incomplete Information Extraction: General-purpose pre-trained models lack targeted attention to hate speech components, resulting in incomplete extraction of critical information.(3)Generalization Limitations: Single training paradigms are susceptible to data distribution biases, limiting model generalization in complex scenarios and hindering adaptability to dynamic online environments (Gururangan et al., 2020).
To address these challenges, this study proposes a hybrid training framework based on the Qwen2.5-7B-Instruct LLM (Qwen et al., 2025), employing a three-stage optimization strategy. First, Prompt Engineering guides the model to focus on hate speech elements (e.g., victimized group classification and metaphor identification rules) while enforcing structured output through task-oriented templates. Second, Supervised Fine-Tuning (SFT) (Ouyang et al., 2022) enhances the model’s ability to parse fine-grained semantics using high-quality annotated data, particularly improving discrimination accuracy for implicit hate expressions. Finally, Model Merging (Matena and Raffel, 2022) innovatively integrates multi-stage models via the LLM Merging method, which sparsifies task vectors by pruning extreme parameters, thereby synthesizing complementary features from different training phases to boost robustness. Experimental results demonstrate stable performance scores of 0.3553 and 0.3555 on preliminary and final test sets, respectively, with over 15% accuracy improvement in hate detection compared to baseline models. The fused model also exhibits exceptional adaptability in complex scenarios such as multi-group attacks and cross-context generalization. This work provides a theoretically innovative and practically valuable technical pathway for Chinese fine-grained hate speech detection, contributing significantly to fostering safer online discourse environments.
The proposed framework comprises three pivotal components: (1) Domain-specific Prompt Engineering, (2) Task-oriented Supervised Fine-tuning, and (3) Dynamic LLM Merge. As illustrated in the hierarchical architecture of the algorithmic framework figure 1, the system operates through phased optimization: prompt engineering guides the model to concentrate on fine-grained hate elements, the supervised finetuning phase enhances the model’s discriminative capacity for implicit semantic nuances, and model merge enhances both the recognition accuracy and generalization capabilities of the system.
The Prompt Strategy enhances structured output capabilities and fine-grained hate judgment logic through domain-specific prompt template design. Specifically, the prompt template incorporates three core components:
First, it defines clear task objectives by mandating the model to output results following a “four-tuple” structured framework. To reinforce the model’s understanding of this format, contextual examples are strategically embedded immediately after defining each field.Second, it embeds explicit definitions of hate speech while establishing contrasting non-targeted content boundaries through dual-directional examples. For instance, the prompt explicitly contrasts hate speech with non-targeted content, clarifying criteria with phrases like “ordinary information without
This content is AI-processed based on open access ArXiv data.