Reading time: 11 minute
...

๐Ÿ“ Original Info

  • Title:
  • ArXiv ID: 2512.20670
  • Date:
  • Authors: Unknown

๐Ÿ“ Abstract

Prevalent multimodal fake news detection relies on consistency-based fusion, yet this paradigm fundamentally misinterprets critical cross-modal discrepancies as noise, leading to over-smoothing, which dilutes critical evidence of fabrication. Mainstream consistency-based fusion inherently minimizes feature discrepancies to align modalities, yet this approach fundamentally fails because it inadvertently smoothes out the subtle cross-modal contradictions that serve as the primary evidence of fabrication. To address this, we propose the Dynamic Conflict-Consensus Framework (DCCF), an inconsistencyseeking paradigm designed to amplify rather than suppress contradictions. First, DCCF decouples inputs into independent Fact and Sentiment spaces to distinguish objective mismatches from emotional dissonance. Second, we employ physics-inspired feature dynamics to iteratively polarize these representations, actively extracting maximally informative conflicts. Finally, a conflict-consensus mechanism standardizes these local discrepancies against the global context for robust deliberative judgment.Extensive experiments conducted on three real world datasets demonstrate that DCCF consistently outperforms stateof-the-art baselines, achieving an average accuracy improvement of 3.52% .

๐Ÿ“„ Full Content

Digital platforms amplify misinformation via deceptive multimodal content. While automated defense is critical, mainstream approaches predominately employ consistency-based fusion to align cross-modal features. However, we argue this premise is fundamentally flawed: the essence of fake news lies in inconsistency, subtle clashes between visual and textual evidence. By prioritizing alignment and treating discrepancies as noise, existing models inadvertently dilute the conflicting signals that serve as the primary evidence of fabrication. Consequently, effective detection demands a paradigm shift: * Equal contribution. โ€  Corresponding author: Quanchen Zou. The work was done at 360 AI Security Lab. moving from seeking consensus to explicitly modeling and amplifying inconsistency.

Fake news detection methods generally fall into three categories. Unimodal methods analyze single data streams, yet this isolation creates information islands that overlook critical cross-modal inconsistencies. Multimodal fusion approaches, ranging from early concatenation (BMR [1]) to co-attention (SEER [2]), typically adopt consistency-seeking paradigms. These methods inadvertently smooth out vital conflict signals by treating discrepancies as noise and conflating objective facts with subjective sentiment. Recently, Large Language Models (LLMs) have been leveraged for their reasoning capabilities in approaches like INSIDE [3] and LIFE [4]. Nevertheless, the inherent focus of these models on semantic alignment hinders their ability to effectively capture and amplify the fine-grained inconsistent evidence essential for robust detection.

Despite innovations, existing methods suffer from fundamental limitations: (1) Semantic Entanglement, where objective content and subjective emotion are treated as a mixed signal, blurring the distinction between what is depicted and how it is described, making it difficult to distinguish factual mismatches from emotional dissonance. We resolve this by explicitly disentangling inputs via multi-task supervision. (2) Inconsistency Attenuation, where existing approaches prioritize alignment, inadvertently treating meaningful contradictions as noise and filtering them out, effectively smoothing out the critical discrepancy signals. In contrast, our framework pioneers an inconsistency-seeking paradigm, deploying a tension field network to explicitly amplify feature repulsions and extract maximally informative conflicts as primary evidence.

To address these limitations, we propose the Dynamic Conflict-Consensus Framework (DCCF). Inspired by physical field theory where tension [5] reflects the intensity of differences, our framework adopts a novel approach that actively searches for inconsistencies. It first separates inputs into factual content and emotional tone using multi-task supervision, guided by YOLO [6] and SenticNet [7], to distinguish objective entities from subjective feelings. A fact sentiment tension field network, modeling dynamic forces between features, then iteratively refines these features to highlight their differences. This process amplifies the contrast to extract the most significant conflicts as primary evidence, while simultaneously summarizing the overall style to serve as a global reference. By evaluating specific local conflicts against this global context, DCCF achieves robust, interpretable predictions. Our main contributions are:

  1. We propose DCCF, a novel inconsistency-seeking paradigm for multimodal fake news detection (MFND). Unlike consistency-seeking methods that blur critical signals, our framework models feature dynamics to amplify and extract inconsistency as primary evidence. 2) We introduce an end-to-end fact-sentiment tension field network that quantifies tension metrics to expose latent inconsistencies. By standardizing extreme conflicts against global consensus, it transforms abstract feature dynamics into interpretable reasoning, pinpointing the exact evidence of fabrication. 3) We validate DCCF’s effectiveness through extensive experiments on widely used MFND benchmarks. Our scheme shows significant performance gains over stateof-the-art baselines, demonstrating superior reliability and robustness.

Early unimodal methods [8] were insufficient, leading to multimodal detection. Text-visual fusion often interprets images superficially. This, with isolated text features, creates information islands and weak reasoning, failing cross-modal inconsistency detection. Our multi-stage framework addresses this [9]. We extract diverse factual/sentimental features for dynamic evolution, then extract high level metrics of conflict, consensus, and inconsistency. This focus on evolved relationships, not raw features [10], enables robust multi-view judgment by reasoning about inconsistencies.

Multi-domain learning models news data spanning diverse domains [11]. Approaches use hard sharing for domainspecific/cross-domain knowledge or soft sharing, like gating networks or domain memory banks. However, these methods often just adjust view weights via domain embeddings, failing to learn domain invariant/specific information [12]. Concatenating embeddings may fail to account for domain dependencies. Furthermore, these are single modal, struggling with rich visual information [3].

Despite their architectural variations, existing unimodal, fusion-based [2], and LLM-driven [3] methods predominantly rely on a consistency-seeking paradigm [4], which aligns features and inadvertently smooths out the critical discrepancies indicative of deception. In contrast, our DCCF pioneers an inconsistency-seeking paradigm, explicitly modeling and amplifying these cross-modal conflicts to leverage inconsistency as the primary evidence for detection [1].

We propose the DCCF framework to detect multimodal fake news by modeling feature dynamics within decoupled semantic spaces. DCCF first disentangles inputs into distinct fact and sentiment spaces via multi-task learning, then employs a tension field network to identify polarization and inconsistency. As shown in Fig. 2, the framework comprises three progressive stages: (1) Fact-Sentiment Feature Extraction, (2) Feature Dynamics Evolution and Conflict-Consensus Metric Extraction, and (3) Multi-View Deliberative Judgment.

This stage projects input text and images onto specialized fact and sentiment feature spaces. This separation effectively distinguishes objective factual inconsistencies from subjective sentimental conflicts, addressing the limitation where monolithic processing often conflates these distinct signals by anchoring features to visual objects and textual polarity, respectively.

Initial Encoding. Using pretrained BERT [13] and ViT [14], we extract initial embeddings e T and e I from the raw text T and image I:

Fact Space Projection. Two independent MLPs project e T and e I into a shared fact space, yielding f F T and f F I . To strictly enforce objectivity and semantic alignment, we introduce an auxiliary task: 1. The image I is fed into a pretrained YOLO [6] to generate pseudo-labels e Y representing key factual objects (e.g., entities, locations). 2. We train the projection to accurately predict e Y via a BCE Loss L F , ensuring the space captures grounded reality:

Sentiment Space Projection. Simultaneously, two separate MLPs project e T and e I into a sentiment space, producing f E T and f E I . We orient this space towards subjective sentiment via another auxiliary task to capture high level affective semantics: 1. Text T is processed by a lexicon (e.g., SenticNet) to obtain a sentiment polarity vector e J . 2. We enforce f E T to predict this polarity using MSE Loss L E :

The output comprises the fact feature space

A dynamic evolution module (inspired by physical tension field theory) operates on S F and S E to amplify inconsistency (conflict) and distill global context (consensus).

Feature Dynamics Evolution. Given input t) , representing the potential difference in the semantic field:

Compute weights. Convert high tension to low attraction weight via softmax:

Aggregate and transform. Update features via residual weighted sum and non-linear transformation g for state S (t+1) :

Crucially, this iterative process acts as an inconsistencyseeking filter: it clusters semantically consistent neighbors while polarizing inconsistent ones, thereby preventing the over-smoothing of conflicting evidence common in graphbased fusion. The output is the final space S โ€ฒ = S (M ) and tension matrix T โ€ฒ = T (M -1) .

Conflict-Consensus Metric Extraction. From S โ€ฒ and T โ€ฒ , we extract two key metrics to quantify internal contradictions: 1. Maximally Informative Conflicts. We identify the pair (f โ€ฒ i , f โ€ฒ j ) in T โ€ฒ with maximum tension as the key local conflict I conflict :

Tone-reference standardization. We concatenate I conflict and C consensus into an MLP g std . This Tone-reference inconsistency standardization uses consensus to standardize conflict, ensuring that the magnitude of discrepancy is evaluated relative to the document’s specific semantic baseline rather than in isolation:

The outputs are refined fact (V F ) and sentiment (V E ) inconsistency vectors.

This stage deliberates on the standardized inconsistencies from both semantic subspaces to form a final judgment.

View fusion. We concatenate V F and V S into a final representation V final , encapsulating complementary dual-view inconsistency:

Final classification. V final is fed into a classifier (e.g., MLP with sigmoid) to predict the probability ลท that the news is Fake.

The total loss L total combines the prediction BCE loss L final with auxiliary losses L F and L E , balanced by ฮป F and ฮป E :

Joint optimization ensures the model accurately captures visual objective cues, textual sentiment nuances, and critical inconsistencies, yielding robust detection performance.

We validate DCCF with experiments detailing our datasets, baselines, and implementation.

Datasets. We use three benchmarks: Weibo [11], Weibo21 [15], and GossipCop [16], following established protocols. Weibo [11] includes 7,532 training (3,749 real/3,783 fake) and 1,996 test (996 real/1,000 fake) articles. Weibo21 [15] has 9,127 total articles (4,640 real/4,487 fake). GossipCop [16] provides 10,010 training (7,974 real/2,036 fake) and 2,830 testing (2,285 real/545 fake) instances.

Baselines. We benchmark against three categories: (1) Unimodal methods (MVAN [10], SpotFake [8]). ( 2) Cross-domain generalization (EANN [11], FND-CLIP [12], MIMoE-FND [16], KEN [17]). ( 3) LLM distillation (GLPN-LLM [18], INSIDE [3], LIFE [4]).

Implementation Details. Visual features used a pretrained MAE [19] with 224ร—224 images. Text used bert-base-chinese [13] (Weibo/Weibo21) and bert-base-uncased [13] (Gossip-Cop), truncated to 197 tokens. Features were aligned using CLIP [20]. The auxiliary loss coefficients ฮป F and ฮป E were both set to 0.075. The model used PyTorch, trained on one NVIDIA RTX 4090 GPU for 50 epochs with early stopping.

To validate DCCF’s superiority, we compare it against 11 baselines on three datasets (Table I). From the results, we draw these key observations: (O1): DCCF consistently achieves state-of-the-art performance across diverse benchmarks. On Weibo, it surpasses the strongest baseline (LIFE) by 1.1% in accuracy and 2.7% in F1-Fake, highlighting its capability in detecting standard multimodal inconsistencies.

(O2): On the recent Weibo-21 dataset, DCCF maintains a competitive edge. Despite a narrower margin against the top baseline (MIMOE-FND), our model secures the best results across all four metrics. This confirms that our dynamic conflict-seeking paradigm generalizes effectively to varying data distributions.

(O3): DCCF demonstrates superior robustness on the imbalanced GossipCop (80% real). Unlike baselines that trade off F1 scores, DCCF achieves the highest accuracy (0.904) and balanced performance (F1-Fake: 0.723, F1-Real: 0.946), effectively mitigating class imbalance pitfalls.

(O4): The results validate our architectural hypothesis. Disentangling features into fact/sentiment spaces and amplifying inconsistency via the tension field network proves more effective than both traditional fusion (e.g., BMR [1], SEER [2]) and LLM-distillation approaches (e.g., INSIDE [3], LIFE [4]).

To understand DCCF’s core components, we ran an ablation study (Table II

We analyzed parameter sensitivity on three datasets, focusing on four key parameters: DARFU iterations (m), temperature coefficient (ฯ„ ), and the fact auxiliary loss coefficient (ฮป F ) and the sentiment auxiliary loss coefficient (ฮป E ). As shown in the figures, the model is robust. Performance forms a bell shaped curve, peaking at m=4, ฯ„ =1.5, and a 7.5% loss weight for both fields. Performance gracefully declines from these optimal points but remains high, demonstrating the model is effective across various configurations and not overly sensitive.

We investigate DCCF’s interpretability with a case study of two challenging GossipCop instances. This illustrates how DCCF acquires and distills deep reasoning into its text and image representations.

The first case is a Text Fabrication. Both DCCF and the baseline MIMoE-FND [16] correctly classified it.

The second case is an Image-Text mismatch. DCCF correctly predicted Fake news, while the baseline MIMoE-FND [16] failed, misclassifying it as Real news.

Fig. 4 shows T-SNE visualizations of features from DCCF, MIMoE-FND [16], and KEN [17] on the Weibo and Weibo21 test sets. Compared to the baselines, DCCF produces fewer fake news outliers and less overlap between real and fake news embeddings, confirming its superior performance.

On Weibo21, DCCF’s features form multiple, clearly separated subclusters, unlike the single clusters on Weibo. This suggests Weibo21 has varying topics and that DCCF not only distinguishes authenticity but also captures deep, event-level semantic information, spatially distinguishing different events.

In this paper, we propose the DCCF, a novel inconsistencyseeking paradigm. DCCF initially employs fact sentiment feature extraction guided by multi-task supervision to decouple semantic spaces. Subsequently, the fact sentiment tension field network iteratively models feature dynamics to polarize representations, distilling interpretable maximally informative conflicts and global consensus metrics. Finally, the Multi-View Deliberative Judgment fuses these standardized indicators for robust detection. Extensive experiments validate DCCF’s superiority. A primary limitation is the framework’s reliance on the quality of auxiliary pseudo-labels, where upstream noise may propagate to the decoupled spaces. Future work will explore integrating Large Language Models to enhance the robustness of these semantic constraints.

  1. Global Consensus. We compute the mean of S โ€ฒ as C consensus to represent the global context/tone:

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

โ†‘โ†“
โ†ต
ESC
โŒ˜K Shortcut