Zero-Shot Stance Detection in the Wild: Dynamic Target Generation and Multi-Target Adaptation

Zero-Shot Stance Detection in the Wild: Dynamic Target Generation and Multi-Target Adaptation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Current stance detection research typically relies on predicting stance based on given targets and text. However, in real-world social media scenarios, targets are neither predefined nor static but rather complex and dynamic. To address this challenge, we propose a novel task: zero-shot stance detection in the wild with Dynamic Target Generation and Multi-Target Adaptation (DGTA), which aims to automatically identify multiple target-stance pairs from text without prior target knowledge. We construct a Chinese social media stance detection dataset and design multi-dimensional evaluation metrics. We explore both integrated and two-stage fine-tuning strategies for large language models (LLMs) and evaluate various baseline models. Experimental results demonstrate that fine-tuned LLMs achieve superior performance on this task: the two-stage fine-tuned Qwen2.5-7B attains the highest comprehensive target recognition score of 66.99%, while the integrated fine-tuned DeepSeek-R1-Distill-Qwen-7B achieves a stance detection F1 score of 79.26%.


💡 Research Summary

The paper introduces a new task called Zero‑Shot Stance Detection in the Wild with Dynamic Target Generation and Multi‑Target Adaptation (DGTA). Unlike traditional stance detection, which assumes a predefined set of targets, DGTA requires a model to discover all potential targets mentioned in a social‑media post and assign a stance label (support, against, neutral) to each, without any prior knowledge of target vocabulary or the number of targets per instance.

To support this task, the authors construct a large‑scale Chinese dataset from the micro‑blog platform Weibo. They select 240 diverse users, collect 125,176 posts, clean the data by removing emojis, URLs, usernames, and other noise, and then apply a multi‑stage annotation pipeline. Three large language models (GLM‑4‑9B, Qwen2.5‑7B, Llama‑3‑8B) independently generate target‑stance annotations. A cross‑validation rule accepts a sample only when at least two models agree on both the extracted targets and their stances. The accepted samples are rescored by DeepSeek‑V3, and low‑scoring instances are corrected. Finally, eight professional annotators verify every retained sample. After discarding 36,379 low‑quality entries, the final corpus contains 70,931 annotated posts covering single‑target, dual‑target, triple‑target, and multi‑target cases. Statistics show a balanced distribution of support, against, and neutral labels across targets such as “USA”, “Trump”, “Huawei”, etc.

Because conventional metrics cannot capture the open‑ended nature of target generation, the authors design a multi‑dimensional evaluation framework. For target identification they compute BERTScore, BLEU, ROUGE‑L, and a recall measure of target count, then combine them into a comprehensive C‑Score: C‑Score = (α·BERTScore + β·BLEU + γ·ROUGE‑L) × Recall, with α=0.6, β=0.2, γ=0.2. This balances semantic similarity with surface‑form matching and quantity alignment. For stance detection they first filter samples whose target predictions exceed empirically chosen thresholds (e.g., BERTScore ≥ 0.7, BLEU ≥ 0.2, ROUGE‑L ≥ 0.4, Recall ≥ 0.8, C‑Score ≥ 0.3). Only these “correctly identified” samples are evaluated with precision, recall, and F1.

The paper explores two fine‑tuning strategies for large language models (LLMs). The integrated approach treats target extraction and stance classification as a single instruction‑following sequence‑to‑sequence task. The model receives a prompt that concatenates task instructions with the original post and is trained to output a list of (target, stance) pairs in one pass. The two‑stage approach decouples the problem: a first model is fine‑tuned to generate only target strings, and a second model (potentially a different architecture) is fine‑tuned to assign stances given the original text and the extracted targets. Both strategies use LoRA (Low‑Rank Adaptation) to keep training efficient.

Experiments are conducted on seven stratified test subsets (1,000 instances each) drawn with fixed random seeds, and results are reported as averages with 95 % confidence intervals. Baselines include pre‑trained models (mT5, RoBERTa‑large, BERT), a CRF‑based target‑stance extractor (RoBERTa‑CRF), and several prompting variants of LLMs (zero‑shot, few‑shot, chain‑of‑thought). The integrated fine‑tuned DeepSeek‑R1‑Distill‑Qwen‑7B achieves the highest stance F1 of 79.26 %, outperforming all baselines by a large margin. The two‑stage fine‑tuned Qwen2.5‑7B attains the best target identification C‑Score of 66.99 %, indicating superior ability to discover the correct set of targets. Other notable results: Qwen2.5‑7B reaches a BERTScore of 82.47 % and a recall of 91.16 % for target detection; DeepSeek‑V3 shows strong recall (92.05 %) but lower lexical overlap. Traditional models achieve reasonable BERTScore (~84 %) but their stance F1 hovers around 60 %, demonstrating the advantage of LLM fine‑tuning for this open‑world scenario.

Key insights from the analysis are: (1) dynamic, zero‑shot stance detection is feasible with large language models when they are fine‑tuned on appropriately constructed instruction data; (2) separating target extraction from stance classification can improve target recall, while an integrated approach yields higher overall stance F1 due to better coordination between the two subtasks; (3) the multi‑dimensional C‑Score provides a more nuanced picture of target generation quality than any single metric; (4) the cross‑validated annotation pipeline, which leverages agreement among multiple LLMs before human verification, yields a high‑quality dataset at scale.

The authors conclude that DGTA opens a new research direction for opinion mining in realistic, unstructured environments where target vocabularies are fluid. They release the dataset, annotation code, and fine‑tuned models to the community. Future work suggested includes multilingual extensions, real‑time streaming deployment, modeling more complex target‑stance relationships (e.g., causal chains), and incorporating user‑level metadata for personalized stance detection.


Comments & Academic Discussion

Loading comments...

Leave a Comment