MultiHateLoc: Towards Temporal Localisation of Multimodal Hate Content in Online Videos

MultiHateLoc: Towards Temporal Localisation of Multimodal Hate Content in Online Videos
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The rapid growth of video content on platforms such as TikTok and YouTube has intensified the spread of multimodal hate speech, where harmful cues emerge subtly and asynchronously across visual, acoustic, and textual streams. Existing research primarily focuses on video-level classification, leaving the practically crucial task of temporal localisation, identifying when hateful segments occur, largely unaddressed. This challenge is even more noticeable under weak supervision, where only video-level labels are available, and static fusion or classification-based architectures struggle to capture cross-modal and temporal dynamics. To address these challenges, we propose MultiHateLoc, the first framework designed for weakly-supervised multimodal hate localisation. MultiHateLoc incorporates (1) modality-aware temporal encoders to model heterogeneous sequential patterns, including a tailored text-based preprocessing module for feature enhancement; (2) dynamic cross-modal fusion to adaptively emphasise the most informative modality at each moment and a cross-modal contrastive alignment strategy to enhance multimodal feature consistency; (3) a modality-aware MIL objective to identify discriminative segments under video-level supervision. Despite relying solely on coarse labels, MultiHateLoc produces fine-grained, interpretable frame-level predictions. Experiments on HateMM and MultiHateClip show that our method achieves state-of-the-art performance in the localisation task.


💡 Research Summary

The paper “MultiHateLoc: Towards Temporal Localisation of Multimodal Hate Content in Online Videos” tackles a previously under‑explored problem: identifying the exact temporal segments in which hateful content appears in videos that combine visual, acoustic, and textual cues. Unlike most prior work that only produces a video‑level binary decision, MultiHateLoc operates under a weakly‑supervised setting, using only video‑level labels while delivering frame‑level predictions.

Problem Definition
Given a video, three synchronized streams are extracted: (i) visual frames encoded by a pre‑trained ViT‑B/16 (768‑dimensional per frame), (ii) audio clips processed by VGGish (128‑dimensional per second), and (iii) textual transcripts obtained via Whisper, split into sentence‑wise fragments, each encoded by BERT‑base (768‑dimensional) and expanded to match the frame resolution. The task is to learn a mapping from these multimodal sequences to a per‑frame probability of hate, trained only with a binary video‑level label (1 = contains hate, 0 = clean).

Core Architecture

  1. Modality‑Aware Temporal Encoding (MA‑TE) – Each modality is fed into its own Transformer‑based temporal encoder rather than a shared backbone. This respects the distinct temporal dynamics of visual motion, acoustic prosody, and textual semantics. The dedicated encoders capture intra‑modal dependencies more effectively than a generic model.

  2. Dynamic Cross‑Modal Fusion (DCM‑Fusion) – At every time step, an attention‑driven gating mechanism learns modality‑specific weights, allowing the model to emphasize the most informative stream (e.g., visual cues during a hateful gesture, audio during a slur, text during an on‑screen caption). This dynamic fusion overcomes the limitations of static early‑fusion or late‑fusion schemes.

  3. Cross‑Modal Contrastive Alignment – Positive pairs consist of embeddings from different modalities belonging to the same frame, while negative pairs are drawn from different videos or non‑hate segments. An InfoNCE loss aligns the modalities in a shared latent space, improving consistency across streams and aiding the weak supervision signal.

  4. Modality‑Aware Top‑K Multiple Instance Learning (MIL) – Because only video‑level supervision is available, the model selects the top‑K frames (per modality) with the highest predicted scores and averages them to produce the video‑level prediction. The loss pushes positive videos to have high‑scoring frames and negative videos to keep scores low. K is dynamically adjusted based on video length and label prevalence, and the MIL loss is summed across modalities, ensuring that a weak modality can be compensated by stronger ones.

Training Objective
The total loss is a weighted sum of (i) binary cross‑entropy on the video‑level prediction, (ii) the contrastive alignment loss, and (iii) the modality‑specific Top‑K MIL losses. Optimization uses AdamW, and during inference each frame’s score is passed through a sigmoid to obtain a probability in


Comments & Academic Discussion

Loading comments...

Leave a Comment