GMS-CAVP: Improving Audio-Video Correspondence with Multi-Scale Contrastive and Generative Pretraining
Recent advances in video-audio (V-A) understanding and generation have increasingly relied on joint V-A embeddings, which serve as the foundation for tasks such as cross-modal retrieval and generation. While prior methods like CAVP effectively model semantic and temporal correspondences between modalities using contrastive objectives, their performance remains suboptimal. A key limitation is the insufficient modeling of the dense, multi-scale nature of both video and audio signals, correspondences often span fine- to coarse-grained spatial-temporal structures, which are underutilized in existing frameworks. To this end, we propose GMS-CAVP, a novel framework that combines Multi-Scale Video-Audio Alignment and Multi-Scale Spatial-Temporal Diffusion-based pretraining objectives to enhance V-A correspondence modeling. First, GMS-CAVP introduces a multi-scale contrastive learning strategy that captures semantic and temporal relations across varying granularities. Second, we go beyond traditional contrastive learning by incorporating a diffusion-based generative objective, enabling modality translation and synthesis between video and audio. This unified discriminative-generative formulation facilitates deeper cross-modal understanding and paves the way for high-fidelity generation. Extensive experiments on VGGSound, AudioSet, and Panda70M demonstrate that GMS-CAVP outperforms previous methods in generation and retrieval.
💡 Research Summary
The paper introduces GMS‑CAVP, a unified pre‑training framework that simultaneously leverages multi‑scale contrastive learning and multi‑scale diffusion‑based generation to improve audio‑video correspondence. Existing contrastive audio‑video pre‑training (CAVP) methods align modalities only at a single, global representation level, which fails to capture fine‑grained spatial‑temporal relationships that naturally exist across multiple resolutions in both video and audio streams. GMS‑CAVP addresses this gap with two complementary components.
-
Multi‑scale Spatial‑Temporal Alignment (MSA): Video frames and audio spectrograms are first encoded by pretrained backbones (e.g., a video CNN and an audio CNN). The resulting feature sequences are then decomposed into L hierarchical scales using temporal pyramidal pooling and multi‑resolution convolutions. For each scale l, a standard InfoNCE contrastive loss is computed between the video and audio embeddings, and the total contrastive objective is the sum over all scales. An adaptive temporal attention weight wₜ = softmax(Fᵥₜ·Fₐₜ) further emphasizes salient moments (e.g., action peaks) while down‑weighting noisy or irrelevant segments. This design forces the model to learn alignment at both coarse and fine granularities, leading to substantially higher temporal synchronization scores.
-
Multi‑scale Spatial‑Temporal Diffusion (MSD): To bridge the generative gap, a hierarchical diffusion decoder is trained to denoise audio latent representations conditioned on the multi‑scale video features. The generative process follows pθ(A₀|V) = ∏ₜ pθ(Aₜ|Aₜ₊₁, Fᵥ^{multi}), where each denoising step receives conditioning from all video scales. The diffusion loss is the expected L2 distance between the true noise ε and the model’s prediction εθ across resolutions. By progressively refining audio from coarse to fine scales while simultaneously attending to corresponding video contexts, the generated sound exhibits both high fidelity (low KLD, low FAD) and precise temporal alignment with visual events.
Experimental Setup: The authors evaluate on three large‑scale datasets—VGGSound (200 k 10‑second clips), AudioSet (≈2 M clips), and Panda70M (70 M pairs). Video frames are resized to 224 × 224, and audio is represented as 128 × 128 log‑mel spectrograms (10 s, 8 kHz). Training uses Adam (lr 1e‑4), batch size 64, for 200 epochs, with diffusion hyper‑parameters matching prior work (Diff‑Foley).
Results: In video‑to‑audio generation, GMS‑CAVP achieves KLD 1.63, FAD 0.75, and Alignment Accuracy 95.87 %, outperforming strong baselines such as SpecVQGAN, Im2Wav, Diff‑Foley, FoleyGen, V2A‑Mapper, Seeing & Hearing, MaskV‑AT, VAB, and VATT. In cross‑modal retrieval, the model raises Recall@1/5/10 from 9.5 %/25.4 %/35.1 % (CA‑VP) to 28.9 %/43.7 %/57.9 % for video‑to‑audio, and similarly for audio‑to‑video. Ablation studies show that MSA alone improves alignment (≈82 % Acc) while MSD alone reduces KLD/FAD; the combination yields the best overall performance. Additional analyses explore the impact of diffusion step count, bidirectional training intervals, the number of spatial scales, and data scaling, confirming that larger training corpora (VGGSound + AudioSet + Panda70M) further boost results (KLD 1.35, FAD 0.58).
Insights and Limitations: The work convincingly demonstrates that hierarchical, multi‑scale modeling is crucial for both discriminative and generative audio‑video tasks. The adaptive temporal attention mechanism effectively mitigates noisy segments, and the hierarchical diffusion conditioning enables the generator to respect both short‑term motions and long‑term scene changes. However, the experiments are confined to relatively low‑resolution inputs (224 × 224 video, 128 × 128 spectrogram). The paper does not assess scalability to higher‑resolution media or real‑time inference costs, which are important for practical deployment. Moreover, while the diffusion model improves quality, it remains computationally intensive; future work could explore efficient sampling or distillation techniques.
Conclusion: GMS‑CAVP sets a new state‑of‑the‑art for audio‑video pre‑training by unifying multi‑scale contrastive alignment with multi‑scale diffusion generation. The approach yields substantial gains in generation fidelity, temporal synchronization, and cross‑modal retrieval, establishing a strong baseline for future research on hierarchical multimodal representation learning.
Comments & Academic Discussion
Loading comments...
Leave a Comment