Comprehensive experiments demonstrate that ViSAudio outperforms existing state-of-the-art methods across both objective metrics and subjective evaluations, generating high-quality binaural audio with spatial immersion that adapts effectively to viewpoint changes, sound-source motion, and diverse acoustic environments. Project website: https://kszpxxzmc.github.io/ViSAudio-project.
💡 Deep Analysis
Deep Dive into 시점 변화와 움직이는 음원에 대응하는 고품질 바이노럴 오디오 ViSAudio.
Comprehensive experiments demonstrate that ViSAudio outperforms existing state-of-the-art methods across both objective metrics and subjective evaluations, generating high-quality binaural audio with spatial immersion that adapts effectively to viewpoint changes, sound-source motion, and diverse acoustic environments. Project website: https://kszpxxzmc.github.io/ViSAudio-project
.
📄 Full Content
Figure 1. Overview. Left: BiAudio dataset converts 360 • videos and FOA audio into perspective video and binaural audio pairs, employing diverse camera rotations to enhance spatial cues. Middle: Our end-to-end pipeline employs conditional flow matching with a dual-branch generation architecture, integrated with a conditional spacetime module to generate spatially immersive binaural audio from multimodal inputs. Right: Example results generated by ViSAudio. As shown above, our model faithfully generates the visible sound of waves crashing, highlighted with red boxes in both the video frames and the audio waveform, with the left channel louder since the sound event occurs on the left. It also captures subtle environmental sounds like ocean noise, highlighted with blue boxes, demonstrating its ability to reproduce fine-grained background acoustics. As shown below, as the camera rotates right, the marimba sound moves left, increasing left-channel amplitude while decreasing the right, demonstrating dynamic adaptation to viewpoint changes.
Despite progress in video-to-audio generation, the field focuses predominantly on mono output, lacking spatial immersion. Existing binaural approaches remain constrained by a two-stage pipeline that first generates mono audio and then performs spatialization, often resulting in error accumulation and spatio-temporal inconsistencies. To address this limitation, we introduce the task of end-toend binaural spatial audio generation directly from silent video. To support this task, we present the BiAudio dataset, comprising approximately 97K video-binaural audio pairs spanning diverse real-world scenes and camera rotation trajectories, constructed through a semi-automated pipeline. Furthermore, we propose ViSAudio, an endto-end framework that employs conditional flow matching with a dual-branch audio generation architecture, where two dedicated branches model the audio latent flows. Integrated with a conditional spacetime module, it balances consistency between channels while preserving distinctive spatial characteristics, ensuring precise spatiotemporal alignment between audio and the input video.
With the rapid growth of virtual and augmented reality [1,18], the demand for realistic, immersive audio-visual experiences has surged. To achieve immersion, sound must not only be synchronized with visual content but also convey spatial awareness. Binaural spatial audio creates a highly realistic spatial listening experience by simulating a twodimensional soundscape. However, traditional binaural audio production requires specialized equipment and expertise [61]. Therefore, automatically generating spatial binaural audio from silent video offers substantial practical value.
Driven by advancements in generative AI, recent research [5,7,31,47,50,52,55,56] has made significant strides in generating mono audio from silent video in an end-to-end manner. However, binaural audio generation remains constrained by a two-stage process: mono audio is first generated using a pretrained video-to-monoaudio generator and then transformed into binaural spatial audio through a separate spatialization process. Some methods [8,49,59] localize and track sound sources in input video to synthesize plausible spatial audio, while others [12,13,34,53,60] employ UNet-like [41] architectures to predict binaural channels directly from mono audio. However, these approaches rely heavily on presynthesized mono audio, making their performance inherently constrained by the quality of the first-stage output. Meanwhile, the subsequent spatialization often considers only visible sound sources, ignoring off-screen sounds and environmental noise. This two-stage paradigm is therefore prone to error accumulation, leading to misalignment with the input video and inconsistencies in spatial. Recent methods [22,28] have established an end-to-end paradigm for spatial audio generation, enabling direct synthesis of First-Order Ambisonics (FOA) from video. However, these approaches depend on 360 • video inputs or rely on extra parameters such as camera orientation. In contrast, end-to-end binaural audio generation aims to extract spatial cues directly from perspective video, making it broadly applicable. Nevertheless, this direction remains largely unexplored.
In this work, we propose to generate binaural spatial audio from silent video in an end-to-end manner. This is a challenging task due to data sparsity: Existing realworld video-binaural datasets [11,45,57,58] are small in scale, lack diversity, and often focus on narrow environ-ments like street scenes or music. Others rely on synthetic audio and video [14], limiting their real-world applicability. Moreover, previous datasets are often constrained by fixed camera perspectives, with very few containing camera motion. To address these limitations, we introduce BiAudio, a large-scale, open-domain dataset featuring diverse real-world sound environments. It consists of approximat