Sound

All posts under category "Sound"

3 posts total

Sorted by date

DARC Fine-Grained Rhythm Control for Drum Accompaniment Generation

In music creation, rapid prototyping is essential for exploring and refining ideas, yet existing generative tools often fall short when users require both structural control and stylistic flexibility. Prior approaches in stem-to-stem generation can condition on other musical stems but offer limited control over rhythm, and timbre-transfer methods allow users to specify specific rhythms, but cannot condition on musical context. We introduce DARC, a generative drum accompaniment model that conditions both on musical context from other stems and explicit rhythm prompts such as beatboxing or tapping tracks. Using parameter-efficient fine-tuning, we augment STAGE, a state-of-the-art drum stem generator, with fine-grained rhythm control while maintaining musical context awareness.

February 04, 2026

paper research

MM-Sonate Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning

Joint audio-video generation aims to synthesize synchronized multisensory content, yet current unified models struggle with fine-grained acoustic control, particularly for identity-preserving speech. Existing approaches either suffer from temporal misalignment due to cascaded generation or lack the capability to perform zero-shot voice cloning within a joint synthesis framework. In this work, we present MM-Sonate, a multimodal flow-matching framework that unifies controllable audio-video joint generation with zero-shot voice cloning capabilities. Unlike prior works that rely on coarse semantic descriptions, MM-Sonate utilizes a unified instruction-phoneme input to enforce strict linguistic and temporal alignment. To enable zero-shot voice cloning, we introduce a timbre injection mechanism that effectively decouples speaker identity from linguistic content. Furthermore, addressing the limitations of standard classifier-free guidance in multimodal settings, we propose a noise-based negative conditioning strategy that utilizes natural noise priors to significantly enhance acoustic fidelity. Empirical evaluations demonstrate that MM-Sonate establishes new state-of-the-art performance in joint generation benchmarks, significantly outperforming baselines in lip synchronization and speech intelligibility, while achieving voice cloning fidelity comparable to specialized Text-to-Speech systems.

February 04, 2026

paper research

UltraEval-Audio A Unified Framework for Comprehensive Evaluation of Audio Foundation Models

The development of audio foundation models has accelerated rapidly since the emergence of GPT-4o. However, the lack of comprehensive evaluation has become a critical bottleneck for further progress in the field, particularly in audio generation. Current audio evaluation faces three major challenges (1) audio evaluation lacks a unified framework, with datasets and code scattered across various sources, hindering fair and efficient cross-model comparison;(2) audio codecs, as a key component of audio foundation models, lack a widely accepted and holistic evaluation methodology; (3) existing speech benchmarks are heavily reliant on English, making it challenging to objectively assess models performance on Chinese. To address the first issue, we introduce UltraEval-Audio, a unified evaluation framework for audio foundation models, specifically designed for both audio understanding and generation tasks. UltraEval-Audio features a modular architecture, supporting 10 languages and 14 core task categories, while seamlessly integrating 24 mainstream models and 36 authoritative benchmarks. To enhance research efficiency, the framework provides a one-command evaluation feature, accompanied by real-time public leaderboards. For the second challenge, UltraEval-Audio adopts a novel comprehensive evaluation scheme for audio codecs, evaluating performance across three key dimensions semantic accuracy, timbre fidelity, and acoustic quality. To address the third issue, we propose two new Chinese benchmarks, SpeechCMMLU and SpeechHSK, designed to assess Chinese knowledge proficiency and language fluency. We wish that UltraEval-Audio will provide both academia and industry with a transparent, efficient, and fair platform for comparison of audio models. Our code, benchmarks, and leaderboards are available at https //github.com/OpenBMB/UltraEval-Audio.

February 04, 2026

paper research

< Category Statistics (Total: 301) >

Machine Learning (70) Artificial Intelligence (52) Computer Vision (40) NLP (36) Information Retrieval (13) Cryptography and Security (12) Robotics (11) Software Engineering (9) Image and Video Processing (7) Distributed Computing (5) Multiagent Systems (4) Systems and Control (4) Networking and Internet (3) Neural and Evolutionary Computing (3) Social and Information Networks Sound (3) Audio and Speech Processing (2) Computers and Society (2) HCI (2) Information Theory (2) Logic in Computer Science (2) Programming Languages (2) Computational Engineering (1) Computational Geometry (1) Digital Libraries (1) Game Theory (1) General Economics (1) Geophysics (1) History and Philosophy of Physics (1) Machine Learning (Stat) (1) Mesoscale and Nanoscale Physics (1) Neurons and Cognition (1) Operating Systems (1) Quantitative Methods (1) Signal Processing (1) Theoretical Economics (1)

Sound

DARC Fine-Grained Rhythm Control for Drum Accompaniment Generation

MM-Sonate Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning

UltraEval-Audio A Unified Framework for Comprehensive Evaluation of Audio Foundation Models

< Category Statistics (Total: 301) >

Table of Contents

Table of Contents

DARC Fine-Grained Rhythm Control for Drum Accompaniment Generation

MM-Sonate Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning

UltraEval-Audio A Unified Framework for Comprehensive Evaluation of Audio Foundation Models

< Category Statistics (Total: 301) >

Start searching

No results found