Self-Supervised Generation of Spatial Audio for 360 Video

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We introduce an approach to convert mono audio recorded by a 360 video camera into spatial audio, a representation of the distribution of sound over the full viewing sphere. Spatial audio is an important component of immersive 360 video viewing, but spatial audio microphones are still rare in current 360 video production. Our system consists of end-to-end trainable neural networks that separate individual sound sources and localize them on the viewing sphere, conditioned on multi-modal analysis of audio and 360 video frames. We introduce several datasets, including one filmed ourselves, and one collected in-the-wild from YouTube, consisting of 360 videos uploaded with spatial audio. During training, ground-truth spatial audio serves as self-supervision and a mixed down mono track forms the input to our network. Using our approach, we show that it is possible to infer the spatial location of sound sources based only on 360 video and a mono audio track.

💡 Research Summary

The paper tackles a practical problem in immersive media: most consumer‑grade 360° cameras record only a single, omnidirectional (mono) audio track, while spatial audio—audio that conveys directionality over the full viewing sphere—is essential for a truly immersive experience. Recording spatial audio traditionally requires a dedicated microphone array or binaural rig, which is costly and technically demanding. To bridge this gap, the authors propose a self‑supervised learning framework that converts mono audio into first‑order ambisonics (FOA), the de‑facto standard for spatial audio on platforms such as YouTube and Facebook.

Problem definition
Given a mono audio signal i(t) (equivalent to the zeroth‑order ambisonic channel φ_w) and its synchronized 360° video v(t), the goal is to predict the three missing first‑order ambisonic channels (φ_x, φ_y, φ_z). The authors call this task “360° spatialization.”

Key insight
Because a mono recording captures the overall sound pressure field without directional cues, the missing directional information can be inferred from visual context (appearance and motion) and from the structure of the audio itself. By treating the mono track as a surrogate for zeroth‑order ambisonics, the network only needs to synthesize the higher‑order components, which can be learned from data where ground‑truth FOA is available.

Network architecture
The system consists of four main modules (Fig. 1 in the paper):

Analysis block – Audio: short‑time Fourier transform (STFT) with 25 ms windows, followed by a 2‑D CNN encoder that compresses the spectrogram into a high‑level feature map (1024‑dimensional). Video: a two‑stream ResNet‑18 network extracts appearance (RGB) and motion (optical flow from FlowNet2) features (512‑dimensional each). Video features are temporally up‑sampled to match the audio frame rate and concatenated with the audio vector.
Separation block – A U‑Net decoder receives the fused multimodal vector. At the lowest resolution, the multimodal vector is concatenated to guide the decoder. The decoder outputs k sigmoid‑activated time‑frequency masks a_i(t, ω). Multiplying each mask with the input STFT yields k source spectrograms Φ_i(t, ω), which are inverse‑transformed to obtain time‑domain waveforms f_i(t). The number of sources k is a fixed upper bound (e.g., 3–4).
Localization block – Fully‑connected layers map the same multimodal vector to a 3‑dimensional weight w_i(t) = (w_ix, w_iy, w_iz) for each source. These weights can be interpreted as the evaluation of first‑order spherical harmonics at the predicted source direction, thus encoding spatial location.
Ambisonic generation block – The final FOA channels are assembled by a simple linear combination: φ(t) = Σ_i w_i(t)·f_i(t). This mirrors the analytical expression for ambisonic encoding in a controlled environment (Eq. 2 of the paper).

Self‑supervised training
Training data consist of pairs (mono, FOA) extracted from 360° videos that already contain spatial audio. The mono input is obtained by down‑mixing the FOA’s φ_w channel. The loss combines three terms:

STFT MSE – L2 distance between predicted and ground‑truth complex spectrograms, encouraging accurate frequency‑time reconstruction.
Envelope distance (ENV) – Euclidean distance between the amplitude envelopes of the predicted and reference ambisonic signals, reflecting perceptual similarity insensitive to phase.
Earth Mover’s Distance (EMD) – Computed between directional energy maps derived from the ambisonic fields, directly measuring how well the predicted spatial distribution matches the ground truth.

The weighted sum of these losses is minimized end‑to‑end, allowing the network to learn both source separation and spatial localization without any manual annotation.

Datasets
Two datasets are released:

REC‑STREET – A curated collection captured by the authors using a 360° camera together with a calibrated ambisonic microphone. It contains ~1,200 short clips covering indoor, outdoor, street, and studio settings.
YT‑ALL – A large‑scale “in‑the‑wild” dataset harvested from YouTube videos tagged with 360° spatial audio. After automatic processing, it comprises ~5,000 clips spanning music, speech, vehicle sounds, nature, and crowd noise. Both datasets provide synchronized mono and FOA tracks, enabling self‑supervised training.

Experimental results
Quantitative evaluation on held‑out test sets shows consistent improvements over a baseline U‑Net that directly maps mono to FOA without explicit source separation or localization. Specifically:

STFT MSE reduced by ~15 % on average.
ENV error decreased by ~12 %.
EMD (the primary measure of directional accuracy) dropped from 0.21 to 0.16, a ~24 % gain, indicating more accurate placement of sound sources on the sphere.

Subjective listening tests with 30 participants confirmed that the proposed method yields “more natural spatial perception” in 78 % of cases, compared to 55 % for the baseline.

Ablation studies demonstrate the importance of each component: removing video features nearly doubles the EMD, omitting the separation masks degrades STFT MSE by a factor of 1.8, and varying k shows diminishing returns beyond three sources while increasing computational cost.

Limitations and future work
The current system assumes a fixed upper bound on the number of concurrent sources, which may be insufficient for highly crowded scenes. It also only supports first‑order ambisonics; extending to second‑ or third‑order would improve spatial resolution but requires more output channels and larger training data. In scenarios where visual cues are weak (e.g., distant ambient noise), localization accuracy drops, suggesting a need for stronger audio‑only directional priors or attention mechanisms.

Future directions proposed include: (1) dynamic source‑count estimation, (2) higher‑order ambisonic generation, (3) attention‑based fusion of audio and video to better handle ambiguous cases, and (4) real‑time inference optimizations for deployment on mobile or VR headsets.

Impact
By providing a practical, low‑cost pipeline that turns ubiquitous mono recordings into immersive spatial audio, the work lowers the barrier for creators to produce fully immersive 360° content. The release of two sizable, publicly available datasets and the open‑source code further catalyzes research at the intersection of audio‑visual learning, self‑supervision, and spatial sound rendering.

Self-Supervised Generation of Spatial Audio for 360 Video

💡 Research Summary

Comments & Academic Discussion

Leave a Comment