Fusion Segment Transformer: Bi-Directional Attention Guided Fusion Network for AI-Generated Music Detection
With the rise of generative AI technology, anyone can now easily create and deploy AI-generated music, which has heightened the need for technical solutions to address copyright and ownership issues. While existing works mainly focused on short-audio, the challenge of full-audio detection, which requires modeling long-term structure and context, remains insufficiently explored. To address this, we propose an improved version of the Segment Transformer, termed the Fusion Segment Transformer. As in our previous work, we extract content embeddings from short music segments using diverse feature extractors. Furthermore, we enhance the architecture for full-audio AI-generated music detection by introducing a Gated Fusion Layer that effectively integrates content and structural information, enabling the capture of long-term context. Experiments on the SONICS and AIME datasets show that our approach outperforms the previous model and recent baselines, achieving state-of-the-art results in AI-generated music detection.
💡 Research Summary
The paper addresses the emerging problem of detecting AI‑generated music (AIGM) at the full‑track level, a task that requires modeling long‑range musical structure rather than just short‑duration acoustic anomalies. Building on their prior “Segment Transformer” framework, the authors propose the Fusion Segment Transformer (FST), a two‑stage architecture that explicitly incorporates musical “segments” (four‑bar phrases) as the basic unit of analysis.
In Stage 1, short (≈10 s) audio segments are processed by a suite of self‑supervised learning (SSL) feature extractors: wav2vec 2.0, Music2vec, MER T, FXencoder, and a newly integrated Muffin Encoder that focuses on high‑frequency band‑wise cues. All extractors are wrapped in the AudioCA T framework, which adds a fixed cross‑attention decoder and a classification head. The Muffin Encoder is pre‑trained on 0–12 kHz mel‑spectrograms split into low, mid, and high bands, then frozen and plugged into the pipeline to capture subtle high‑frequency artifacts.
Stage 2 converts an entire track into a sequence of segment embeddings {e₁,…,e_N} using a beat‑tracking algorithm that aligns segments to down‑beats. A self‑similarity matrix (SSM) is computed as SSM_{ij}=exp(−‖e_i−e_j‖²/d) to encode repetitive structural patterns. Unlike the original Segment Transformer, which simply concatenated content and structure streams, FST processes them in parallel Transformer encoders and then fuses them via a bi‑directional cross‑attention mechanism. Specifically, the content stream serves as query while the SSM stream provides keys and values, yielding an enriched content representation X_content; the reverse operation yields X_structure. Both are layer‑normalized and then combined through a learnable gated fusion unit:
G = σ(W_g
Comments & Academic Discussion
Loading comments...
Leave a Comment