MF-Speech: Achieving Fine-Grained and Compositional Control in Speech Generation via Factor Disentanglement

Reading time: 5 minute
...

📝 Abstract

Generating expressive and controllable human speech is one of the core goals of generative artificial intelligence, but its progress has long been constrained by two fundamental challenges: the deep entanglement of speech factors and the coarse granularity of existing control mechanisms. To overcome these challenges, we have proposed a novel framework called MF-Speech, which consists of two core components: MF-SpeechEncoder and MF-SpeechGenerator. MF-SpeechEncoder acts as a factor purifier, adopting a multi-objective optimization strategy to decompose the original speech signal into highly pure and independent representations of content, timbre, and emotion. Subsequently, MF-SpeechGenerator functions as a conductor, achieving precise, composable and fine-grained control over these factors through dynamic fusion and Hierarchical Style Adaptive Normalization (HSAN). Experiments demonstrate that in the highly challenging multi-factor compositional speech generation task, MF-Speech significantly outperforms current state-of-the-art methods, achieving a lower word error rate (WER=4.67%), superior style control (SECS=0.5685, Corr=0.68), and the highest subjective evaluation scores(nMOS=3.96, sMOS_emotion=3.86, sMOS_style=3.78). Furthermore, the learned discrete factors exhibit strong transferability, demonstrating their significant potential as a general-purpose speech representation.

💡 Analysis

Generating expressive and controllable human speech is one of the core goals of generative artificial intelligence, but its progress has long been constrained by two fundamental challenges: the deep entanglement of speech factors and the coarse granularity of existing control mechanisms. To overcome these challenges, we have proposed a novel framework called MF-Speech, which consists of two core components: MF-SpeechEncoder and MF-SpeechGenerator. MF-SpeechEncoder acts as a factor purifier, adopting a multi-objective optimization strategy to decompose the original speech signal into highly pure and independent representations of content, timbre, and emotion. Subsequently, MF-SpeechGenerator functions as a conductor, achieving precise, composable and fine-grained control over these factors through dynamic fusion and Hierarchical Style Adaptive Normalization (HSAN). Experiments demonstrate that in the highly challenging multi-factor compositional speech generation task, MF-Speech significantly outperforms current state-of-the-art methods, achieving a lower word error rate (WER=4.67%), superior style control (SECS=0.5685, Corr=0.68), and the highest subjective evaluation scores(nMOS=3.96, sMOS_emotion=3.86, sMOS_style=3.78). Furthermore, the learned discrete factors exhibit strong transferability, demonstrating their significant potential as a general-purpose speech representation.

📄 Content

Infusing life into the digital world and endowing speech with personality and emotion is one of the most exciting frontiers in the field of generative artificial intelligence. From emotionally aware assistants to personalized voice restoration and expressive media synthesis (Sisman et al. 2020;Zhang et al. 2019;Veaux, Yamagishi, and King 2013), it is poised to transform how we interact with the digital world. Voice Conversion (Bargum, Serafin, and Erkut 2023) enables flexible manipulation of fundamental speech factors such as content, timbre, and emotion. As such, Voice Conversion has Figure 1: MF-Speech enables independent and fine-grained control over speech content, timbre, and emotion factors for speech synthesis. emerged as a key enabling technology toward this vision. However, two fundamental challenges have long troubled researchers in this field:

• Gene hybridization: The Challenge of Pure Factor Separation. The content, timbre and emotion in speech are naturally intertwined and hard to separate. Due to the lack of a strong supervisory signal, the current strategies (Chou et al. 2018;Yadav et al. 2023;Qian et al. 2020b;Li, Han, and Mesgarani 2023;Wang et al. 2021bWang et al. , 2018;;Li, Han, and Mesgarani 2025;Chou, Yeh, and Lee 2019) act like rough filters, making it difficult to precisely and accurately separate various speech factors and thereby leading to timbre leakage and attribute interference. Moreover, this deep entanglement can also lead to fragile and chaotic factor representations, not only disrupting the precise control of each factor but also severely limiting their transferability across tasks (Li, Li, and Li 2023;Yuan et al. 2021;Lian, Zhang, and Yu 2022;Mu et al. 2024;Tu, Mak, and Chien 2024;Song et al. 2022;Deng et al. 2024). • Command failure: The Lack of Fine-Grained Control. Even if pure speech factors are obtained, how to skillfully control them remains a major challenge.

The existing control mechanisms are generally coarsegrained, just like using a sledgehammer to complete fine carving work. Whether models rely on the fundamental methods like static concatenation and implicit global modulation (Kaneko et al. 2019;Qian et al. 2019;Neekhara et al. 2023;Zhang, Ling, and Dai 2019;Yao et al. 2024), or employ advanced technologies such as dynamic fusion and explicit modulation (Yao et al. 2025;Ma et al. 2024;Ning et al. 2023;Choi and Park 2025;Li et al. 2024a;Qian et al. 2020a), they consistently suffer from coarse-grained control. This is due to their failure to systematically combine dynamic weights and hierarchical style injection. Therefore, models often struggle to balance content fidelity (content) and style similarity (timbre and emotion). This fundamental defect cannot be remedied through post-processing techniques (Ren et al. 2020;Lin et al. 2024;Tian, Liu, and Lee 2024).

To address these two major challenges, we propose MF-Speech, a framework designed to achieve fine-grained and compositional control in speech generation via multi-factor disentanglement (Figure 1). This framework consists of Multi-factor Speech Encoder (MF-SpeechEncoder) and Multi-factor Speech Generator (MF-SpeechGenerator). It fundamentally addresses the aforementioned challenges by enhancing the purification capability and clarifying the command direction, achieving composable and fine-grained control speech generation. Our main contributions can be summarized as follows:

• Multi-factor Speech Encoder to ensure factor purity (MF-SpeechEncoder). We designed a speech factor purifier that uses a three-stream architecture and decomposes the raw speech signal into three highly pure and mutually independent information streams: content, timbre, and emotion. This ensures a high degree of independence for subsequent control and addresses the challenge of gene hybridization. • Multi-factor Speech Generator to enhance control granularity (MF-SpeechGenerator). Building upon the purified factors, we developed the speech factor conductor. This component moves beyond coarse control by incorporating dynamic fusion and Hierarchical Style Adaptive Normalization (HSAN). This enables highly finegrained control over timbre and emotion. As a result, the model can synthesize a vast array of speech combinations with high style similarity, while maintaining content fidelity. • Comprehensive empirical and systematic validation.

Extensive experiments demonstrate the effectiveness of our proposed framework. Results show that MF-SpeechEncoder can effectively purify speech factors to ensure control independence. Moreover, in the challenging task of multi-factor compositional speech generation, the MF-Speech demonstrates remarkable controllability in terms of content fidelity and style similarity.

Factor Disentanglement: Strategies and Challenges. VQMIVC (Wang et al. 2021a) separates pitch, content, and timbre through vector quantization and mutual information minimization. However, F0 is not explicitly modeled but rather prov

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut