SAVGBench: Benchmarking Spatially Aligned Audio-Video Generation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This work addresses the lack of multimodal generative models capable of producing high-quality videos with spatially aligned audio. While recent advancements in generative models have been successful in video generation, they often overlook the spatial alignment between audio and visuals, which is essential for immersive experiences. To tackle this problem, we establish a new research direction in benchmarking the Spatially Aligned Audio-Video Generation (SAVG) task. We introduce a spatially aligned audio-visual dataset, whose audio and video data are curated based on whether sound events are onscreen or not. We also propose a new alignment metric that aims to evaluate the spatial alignment between audio and video. Then, using the dataset and metric, we benchmark two types of baseline methods: one is based on a joint audio-video generation model, and the other is a two-stage method that combines a video generation model and a video-to-audio generation model. Our experimental results demonstrate that gaps exist between the baseline methods and the ground truth in terms of video and audio quality, as well as spatial alignment between the two modalities.

💡 Research Summary

The paper introduces SAVGBench, a benchmark designed to evaluate the emerging task of Spatially Aligned Audio‑Video Generation (SAVG). Existing generative models excel at producing high‑quality video, yet they largely ignore the spatial relationship between sound sources and visual objects, a factor crucial for immersive experiences such as virtual reality, gaming, and simulation. To fill this gap, the authors construct a curated dataset and a novel evaluation metric, then benchmark two baseline generative approaches.

Dataset construction
The authors start from the ST‑ARSS23 collection, which contains 360° video, first‑order Ambisonics (FOA) audio, and spatio‑temporal annotations of sound events. They convert the omnidirectional data into a more common format: perspective video (256 × 256 pixels, 4 fps) and stereo audio (16 kHz). A fixed horizontal viewing angle is sampled every 10°, while the vertical angle stays at 0°, yielding many overlapping clips. Only clips where the sound source is visually on‑screen are retained, and the class set is limited to human speech and musical instruments because these categories are reliably detected by both object detectors and sound‑event‑localization (SELD) models. After filtering out overlapping events, low‑energy clips, and audio clipping, the final development set contains 5,031 five‑second clips (≈7 h). An evaluation split with the same speech‑to‑instrument ratio is also released.

Spatial alignment metric (Spatial AV‑Align)
Traditional quality metrics (FVD, KVD for video; FAD for audio) assess fidelity but not spatial correspondence. The authors therefore propose Spatial AV‑Align, a recall‑style score ranging from 0 to 1. The pipeline works as follows:

A pretrained YOLO‑X detector extracts 2‑D bounding boxes of “person” objects from each video frame (4 fps).
A stereo SELD network, trained on the development set, predicts per‑frame activity and horizontal position for the “speech” and “instrument” classes (10 fps).
For each audio frame, the nearest video frame is identified; if the SELD‑estimated horizontal position falls within a detected bounding box, the pair counts as a true positive, otherwise as a false negative.
The final score is TP / (TP + FN). Because the metric relies only on detections, it can be applied to generated samples without any ground‑truth audio.

Baseline methods
Two families are evaluated in an unconditional generation setting:

Joint method (Stereo MM‑Diffusion) – Extends the multimodal diffusion model MM‑Diffusion to handle stereo audio. Separate audio and video encoders map waveforms (2 × C × T) and video tensors (F × C × H × W) into a shared latent space, followed by multimodal attention. Due to GPU constraints, generation occurs at 64 × 64 resolution; a separate super‑resolution diffusion model upsamples the video to 256 × 256. Sampling uses DPM‑Solver for speed, while the super‑resolution stage remains a DDPM.

Two‑stage method (Video diffusion + Stereo MMAudio) – First a video diffusion model (same architecture as MM‑Diffusion) generates a low‑resolution video. Then a stereo‑extended MMAudio model, conditioned on the generated video, synthesizes stereo audio. This decoupled pipeline allows each modality to be optimized independently but introduces additional latency and potential misalignment.

Both baselines are trained on the curated dataset without any conditioning (unconditional generation).

Experimental findings

Fidelity: The joint model achieves slightly better Fréchet Video Distance and Kernel Video Distance, while both models show comparable Fréchet Audio Distance.
Temporal synchronization: Both approaches meet existing temporal alignment benchmarks, reflecting that the dataset’s on‑screen constraint simplifies timing.
Spatial alignment: This is where the gap is most pronounced. The joint method attains a Spatial AV‑Align score of ~0.42, and the two‑stage method ~0.35, far below the near‑perfect scores of real data. Human subjective tests confirm that generated samples often exhibit mismatched sound‑source locations, especially when objects are small or blurred at low resolution.

The authors attribute the low alignment scores to (i) the coarse 64 × 64 video resolution used during diffusion, which hampers object detection, (ii) limited capacity of current diffusion models to learn precise spatial correspondences, and (iii) the reliance on a simple recall metric that penalizes any missed overlap.

Contributions and impact

Definition of the SAVG task and release of a publicly available, well‑curated dataset (development and evaluation splits).
Introduction of Spatial AV‑Align, a detection‑based metric that does not require ground‑truth audio, enabling evaluation of fully generated multimodal samples.
Baseline benchmarks (joint diffusion vs. two‑stage pipeline) that expose current limitations in spatial audio‑visual generation.

Future directions
The paper suggests several avenues:

Training diffusion models directly at higher resolutions to preserve object details for detection.
Extending audio representations to Ambisonics or binaural formats, allowing vertical and depth localization.
Incorporating the Spatial AV‑Align loss into the training objective to explicitly enforce spatial consistency.
Broadening the class set beyond human‑related sounds, which will require more robust object detectors and SELD models.

In summary, SAVGBench establishes the first systematic framework for assessing how well generative models align sound sources with visual content. The provided resources lay a solid foundation for the community to develop next‑generation multimodal generators that can produce not only realistic video and audio but also coherent spatial relationships, a key step toward truly immersive synthetic media.

SAVGBench: Benchmarking Spatially Aligned Audio-Video Generation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment