Cross-Modal Transferable Image-to-Video Attack on Video Quality Metrics
Recent studies have revealed that modern image and video quality assessment (IQA/VQA) metrics are vulnerable to adversarial attacks. An attacker can manipulate a video through preprocessing to artificially increase its quality score according to a certain metric, despite no actual improvement in visual quality. Most of the attacks studied in the literature are white-box attacks, while black-box attacks in the context of VQA have received less attention. Moreover, some research indicates a lack of transferability of adversarial examples generated for one model to another when applied to VQA. In this paper, we propose a cross-modal attack method, IC2VQA, aimed at exploring the vulnerabilities of modern VQA models. This approach is motivated by the observation that the low-level feature spaces of images and videos are similar. We investigate the transferability of adversarial perturbations across different modalities; specifically, we analyze how adversarial perturbations generated on a white-box IQA model with an additional CLIP module can effectively target a VQA model. The addition of the CLIP module serves as a valuable aid in increasing transferability, as the CLIP model is known for its effective capture of low-level semantics. Extensive experiments demonstrate that IC2VQA achieves a high success rate in attacking three black-box VQA models. We compare our method with existing black-box attack strategies, highlighting its superiority in terms of attack success within the same number of iterations and levels of attack strength. We believe that the proposed method will contribute to the deeper analysis of robust VQA metrics.
💡 Research Summary
The paper investigates the vulnerability of modern image and video quality assessment (IQA/VQA) metrics to adversarial attacks, focusing on the largely unexplored black‑box scenario for video quality metrics. While most prior work has examined white‑box attacks and reported poor transferability of adversarial examples across different VQA models, the authors propose a novel cross‑modal attack framework called IC2VQA that leverages the similarity of low‑level feature spaces between images and videos.
Problem formulation
Given a video (x) with (N) frames, the goal is to find a perturbation (\delta) (subject to an (\ell_\infty) budget (\epsilon)) such that the VQA model’s output (f(x+\delta)) deviates significantly from the original score (f(x)). The attack must remain imperceptible to human observers.
Methodology
- Layer‑wise decomposition of IQA models – Each image quality metric (g) is expressed as a composition of layers (h_1,\dots,h_K). The intermediate representations (g_k(x_i)=h_k\circ\cdots\circ h_1(x_i)) are extracted for every frame (x_i).
- Cross‑layer loss – For a chosen layer (k), the cosine similarity between the original and perturbed representations is minimized:
\
Comments & Academic Discussion
Loading comments...
Leave a Comment