Universal Speech Content Factorization

Univ ersal Speech Content F actorization Henry Li Xinyuan ID 2 , Zexin Cai ID 2 , Lin Zhang ID 2 , Leibny P aola Gar c ´ ıa-P er er a ID 1 , Berrak Sisman ID 1 , Sanjeev Khudanpur ID 1 , Nicholas Andr ews ID 2 , Matthew W iesner ID 2 1 Center for Language and Speech Processing, Johns Hopkins Uni versity , USA 2 Human Language T echnology Center of Excellence (COE), Johns Hopkins Univ ersity , USA xli257@jhu.edu Abstract W e propose Uni versal Speech Content Factorization (USCF), a simple and inv ertible linear method for extracting a low-rank speech representation in which speaker timbre is suppressed while phonetic content is preserved. USCF ex- tends Speech Content Factorization, a closed-set voice conv er - sion (VC) method, to an open-set setting by learning a uni- versal speech-to-content mapping via least-squares optimiza- tion and deriving speaker-speciﬁc transformations from only a few seconds of target speech. W e show through embedding analysis that USCF effecti vely removes speak er-dependent v ari- ation. As a zero-shot VC system, USCF achieves competi- tiv e intelligibility , naturalness, and speak er similarity compared to methods that require substantially more target-speaker data or additional neural training. Finally , we demonstrate that as a training-efﬁcient timbre-disentangled speech feature, USCF features can serve as the acoustic representation for training timbre-prompted text-to-speech models. Speech samples and code are publicly av ailable. 1 Index T erms : V oice Con version, Speech Factor Disentangle- ment, TTS 1. Introduction Recent self-supervised learning (SSL) models for speech, such as W avLM, exhibit pronounced geometric structure in their fea- ture spaces, with empirical analyses demonstrate that phonetic content dominates feature variance, and that frames correspond- ing to the same phoneme forming tight clusters across speak- ers [1, 2]. This ﬁnding has signiﬁcant consequences for the downstream task of voice con version (VC), where the typical goal is to modify the speaker identity while preserving the lin- guistic content, enabling a class of training-free voice conv er - sion methods that operate directly in SSL space [3, 4]. Notably , the voice con version system kNN-VC [3] showed that VC can be performed by replacing each W avLM feature frame extracted from an input utterance with its nearest neigh- bor from a tar get speak er’ s W avLM feature collection. The suc- cess of kNN-VC suggests a structural property of the W avLM feature space: for each phoneme, frames from different speak- ers reside within a consistent subspace. This hypothesis is sup- ported further by LinearVC [4], which sho wed that a content- preserving approximate linear projection can be found between the W avLM features of two speakers. Building on this ob- servation, Speech Content F actorization (SCF) [4] proposed projecting W avLM features into a shared low-rank represen- tation encoding phonetic content, and reconstructing speaker- speciﬁc features through learned linear transformations, thereby enabling high-quality VC without additional model training. 1 Code release: github.com/anon- uscf/uscf/tree/release ; Speech samples: anon- uscf.github.io/uscf.github.io/ Figure 1: Full pipeline for voice conver sion using USCF . Figure 2: Decomposing speech into a content-factorized form thr ough SCF . X i ar e content-aligned W avLM featur es for dif- fer ent speakers. Content alignment for X i is performed through kNN matching . Figure 3: Left: formulation for W 1 , one of our pr oposed universal speech-to-content mappings. Right: Derivation of speaker transformation matrix S 4 for unseen speaker 4 Despite its simplicity and effecti veness, SCF is a closed- set method: extracting a content-factorized representation from a speaker requires that the speaker be included in the set of speakers used to derive the factorization. This restriction limits its applicability in downstream scenarios such as open-set VC or timbre-prompted TTS, where unseen speakers must be sup- ported without recomputing the decomposition. For example, in order to train TTS models on cro wd-sourced or web-crawled datasets with diverse speaking styles such as CommonV oice [5] or Emilia [6], training SCF on all of the speakers present would both be prohibiti vely e xpensi ve, and e xclude man y speaker that don’t ha ve enough speech present. T o address this limitation, we propose Universal Speech Content F actorization (USCF) , an open-set extension of SCF that enables speaker-agnostic content extraction and one-shot speaker adaptation. Leveraging the linear structure of SCF , we deriv e a speaker -agnostic uni versal speech-to-content map- ping via least-squares optimization, and infer speaker-speciﬁc content-to-speech transformations from as little as a single target-speak er utterance using linear estimation. Our system achiev es competitiv e performance compared to SSL-structure- based baselines and to SCF . Our contributions are as follows: • W e propose USCF , a uni versal speech-to-content mapping, by showing that the linear structure underlying SCF gener- alizes to unseen speakers. W e show that a universal speech- to-content mapping can be computed using a simple least- square formulation, while speak er-speciﬁc content-to-speech transformations can be estimated from only a few seconds of target speech. • W e ev aluate USCF as a zero-shot VC system and show competitiv e performance in intelligibility , naturalness, and speaker similarity compared to baselines. • W e further show that the USCF representation can serv e as an alternativ e acoustic target for te xt-to-speech (TTS) systems. • W e perform embedding analyses on USCF representations and show that they contain less speaker information than other speaker -factorized representations while ef fectiv ely preserving speech content. 1.1. Related W orks: Speech Disentanglement In contrast to SSL-space methods, most speech disentanglement approaches rely on explicitly trained generati ve models. A com- mon strate gy is to use a V ariational Autoencoder (V AE) [7] that reconstructs speech from a bandwidth-limited intermediate rep- resentation encoding one factor , together with an auxiliary en- coder that provides the disentangled factor [8–13]. Such meth- ods have been applied to VC [14–16], expressiv e speech trans- lation [17], speech anonymization [18, 19], and accent con ver - sion [20–22], but require additional model training and substan- tial speaker -speciﬁc data. 2. Universal Speech Content Factorization Speech Content Factorization (SCF) jointly decomposes content-aligned W avLM features from multiple speakers into a shared low-rank representation encoding phonetic content, enabling high-quality VC without additional model training. Howe ver , SCF is a closed-set method, as both content extrac- tion and reconstruction require speaker -speciﬁc transformation matrices obtained during the original factorization. In this sec- tion, we extend SCF to an open-set setting. 2.1. Closed-Set SCF VC through SCF , as proposed in [4], is performed as follows: 1. Select k speakers, each with sufﬁcient speech for k-nearest- neighbor (kNN) matching (e.g., at least 5 minutes). 2. For some anchor speaker i ∈ [1 ..k ] and associated speech W avLM frames X i of shape n (number of speech frames) by d (W avLM feature dimension), ﬁnd matching W avLM frames X j using nearest neighbors for each speaker j ∈ [1 ..k ] , j  = i . Stack the matrices X 1 through X k along their feature dimensions to produce X of shape ( n, k d ) . 3. Perform rank- r -truncated Singular V alue Decomposition (SVD) on X , giving X ≈ UΣS . W e write C = UΣ of shape ( n, r ) as the content-factorized representation of X . W e further split S of shape ( r, k d ) into chunks S 1 through S k , each of size ( r, d ) , such that ∀ j ∈ [1 ..k ] , X j ≈ CS j . It follows that X j S † j ≈ C , where S † j is the Moore-Penrose in verse of S j . W e refer to S j as the speaker transforma- tion matrix of speaker j . The content factorization process is shown in ﬁgure 2. 4. For speakers s, t ∈ [1 ..k ] and input speech X ′ s , perform VC by ˆ X ′ t ≈ X ′ s S † s S t . 2.2. Appr oaches for Universal Speech to Content Mapping T o extend SCF to unseen speakers, we seek: 1. A matrix W , such that for an y speaker m and unseen speech X ′ m , including those speakers m not in [1 ..k ] , X ′ m W ≈ C ′ . 2. For a small number of W avLM frames X ′ m from an unseen speaker m , deriv e the speaker transformation matrix S m cor- responding to speaker m . Recall the truncated SVD formulation X ≈ UΣS . W e can solve the least-squares optimization problem W 0 = arg min W P k j =1 || X j W − C || , which directly attempts to re- construct the content-factorized representation C = UΣ , to ﬁnd W 0 . Howe ver , in preliminary experiments, we found that W 0 does not produce representations that can be reliably mapped back to the W avLM feature space to synthesize intelli- gible speech. W e observe that the r orthonormal columns of the n by r matrix U represent the r dimensions of speech content according to this factorization, the magnitude of each of these dimensions in the W avLM feature space is represented by the corresponding diagonal singular values of Σ . If we assume that all r content dimensions should be treated as equally important, independent of their singular values, then Σ should be factored out of the optimization target, as shown in ﬁgure 3. This leads to the modiﬁed optimization problem on U : W 1 = arg min W k X j =1 || X j WΣ − 1 − U || (1) Using U directly as the optimization target may amplify acoustic conte xts that are over -represented in the data. T o av oid this, we instead ﬁnd a matrix W 2 that approximately in verts the speaker transformations themselves: W 2 = arg min W k X j =1 || S j W − I || (2) Finally , the formulation for W 3 relies on the simplifying assumption that content and timbre components are linearly separable: X j = C ( T content + S j timbre ) , where T content speaker - in v ariant, and its columns orthogonal to S j timbre . Under this as- sumption we hav e: S i S † j = ( T content + S i timbre )( T content + S j timbre ) † (3) = ( T content + S i timbre )(( T content ) † + ( S j timbre ) † ) (4) = T content ( T content ) † + S i timbre ( S j timbre ) † (5) ≈ T content ( T content ) † = I (6) Steps 4 and 5 follow from the assumption of orthogonality of the column spaces of T content and S j timbre , while step 6 uses the fact that the timbre subspaces of different speakers, being high-dimensional, are likely orthogonal. Thus we can pick W 3 simply as the Moore-Penrose in verse of any speaker i in [1 ..k ] : W 3 = S † i for some i ∈ [1 ..k ] 2.3. Speaker T ransformation Matrix Derivation Suppose we hav e a speaker m / ∈ [1 ..k ] . T o reconstruct speaker - m speech from a content representation C , we require the corresponding speaker transformation matrix S m such that X m ≈ CS m . Suppose also that we are gi ven a small set of W avLM features X ′ m from speaker m . Using the facts that X ′ m ≈ C ′ S m and C ′ ≈ X ′ m W , we can derive S m using the following formulation, also sho wn in ﬁgure 3: S m ≈ ( X ′ m W ) † X ′ m (7) 3. Experimental setup 3.1. T est Data For our VC experiments, we select 4 non-ov erlapping sets of 20 speakers from LibriSpeech [23]: source speakers (from test- clean), target speakers (from test-other), held-out 1 (from test- clean), and held-out 2 (from de v-clean). For each of the source speakers, we randomly choose 20 utterances for testing, and 5 target speak ers as the con version tar get. 3.2. USCF-VC Details T able 1: Comparisons of USCF VC with baseline systems ac- cor ding to objective metrics. Color intensity r eﬂects metric di- r ectionality . Method WER (%) ↓ UTMOS ↑ Spk Sim ↑ USCF W 1 2.70 2.805 0.524 USCF W 2 4.04 2.519 0.557 USCF W 3 2.31 2.826 0.420 kNN-VC 3.16 2.855 0.666 LinearVC 2.69 2.765 0.621 SCF 2.18 2.886 0.603 SCF W 1 3.01 2.859 0.604 SeedVC 6.24 3.173 0.532 T able 2: Comparison of subjective metrics between USCF and baselines. Mean scores with 95% intervals. Method MOS (%) ↑ SMOS (%) ↑ original 3.80 ± 0.17 4.29 ± 0.15 USCF W 1 3.42 ± 0.17 3.00 ± 0.23 USCF W 1 100s 3.66 ± 0.17 2.94 ± 0.29 kNN-VC 3.53 ± 0.18 3.29 ± 0.25 LinearVC 3.65 ± 0.16 3.17 ± 0.21 SCF 3.63 ± 0.18 3.08 ± 0.25 SeedVC 3.36 ± 0.18 2.77 ± 0.31 Follo wing the procedure outlined in section 2.1, we choose 1 speaker in dev-clean as our anchor speaker . The 40 speak- ers included in the SVD process are drawn from held-out 1 and held-out 2. After constructing the content-aligned W avLM ma- trix X , we choose r = 75 and perform rank- r -truncated SVD on X , giving us W 1 and W 2 by following equations 1 and 2. W e set W 3 as the pseudoin verse of S i corresponding to a dif- ferent random speaker i across 10 runs, and report av erage per- formance. For each target speaker , we randomly sample from their utterances and retain the W avLM feature frames extracted from each sample utterance until we reach 500 frames (corre- sponding to 10 seconds of speech). W e then deri ve their speak er transformation matrix according to equation 7. 3.3. Baselines W e include the following baselines: kNN-VC [3], LinearVC [4], and closed-set SCF , as these can be seen as different meth- ods of exploiting the unique structure of the W avLM feature space. W e also benchmark it against SeedVC [24], a diffusion transformer-based zero-shot VC method with state-of-the-art performance. For kNN-VC and LinearVC, we make all source and target speaker speech (roughly 8 minutes per speaker) avail- able for kNN matching and for ﬁnding source-target W avLM feature mappings. For SCF , we perform our SVD on all 40 speakers from the source and target speaker sets from section 3.1, gi ving us S i for each speaker i among the source and target speakers. 3.4. Metrics W e include the following objective metrics: ASR WER (us- ing Whisper [25] large 2 ) for intelligibility; UTMOS-v2 [26] 3 as a proxy for quality; speaker embedding cosine similarity to an averaged ground-truth speaker embedding, with embed- dings extracted from a pre-trained ECAP A-TDNN [27] 4 . T arget Equal Error Rate, where a speaker ID system tries dif ferentiate ground-truth targe speak er utterances from v oice-con verted tar- get speaker utterances, are not reported due to space constraints, as they show identical relative performance across systems to speaker embedding cosine similarity . W e further run human ev aluations on USCF and comparati ve systems, where we col- lect a Mean Opinion Score (MOS) based on the naturalness and quality of each output utterance, and a Speaker Similarity Mean Opinion Score (SMOS) based on the similarity of each output utterance to a reference utterance. 4. Results 4.1. V oice Con version Quality Objectiv e e v aluation results for voice conv ersion are presented in table 1. W e note that USCF excel at content preservation, and is capable of producing natural-sounding speech. Speaker similarity tests show that while USCF is able to produce speech that is highly similar to the target speaker , the level of simi- larity is some what weaker than those of kNN-VC, LinearVC, and SCF . In order to in vestigate the source of this degradation, we compare USCF with partially open-set SCF , where the input speech comes from speakers that are out-of-domain while the target speakers are in-domain. W e ﬁnd that partially open-set SCF achie ves speaker similarities that are on par with closed- set SCF , suggesting that the content-to-speaker transformation is the source of the degradation in speaker similarity of USCF . Subjectiv e ev aluation results, shown in table 2, demon- strate that listeners show no statistically signiﬁcant preference tow ards USCF or any of the other baseline systems except SeedVC, which was least fa v ored. Among the speech-to-content mapping strategies, we found that W 2 achiev es the best target speaker similarity at the price of reduced speech quality and content preservation, whereas W 3 shines at content preserv ation but struggles with target 2 huggingface.co/openai/whisper- large 3 huggingface.co/sarulab- speech/UTMOSv2 4 huggingface.co/speechbrain/spkrec- ecapa- voxceleb speaker similarity; by contrast, W 1 ﬁnds a reasonable balance between all metrics. For USCF using W 3 as the uni versal speech-to-content mapping, we report the average performance ov er 10 runs, where W 3 was chosen to be the pseudoin verse of a different randomly-selected SCF speaker transformation ma- trix. Over the 10 runs, the standard deviation for ASR WER and UTMOS are 0 . 10% and 0 . 015 , respecti vely , suggesting that W 3 is highly stable regardless of the choice of SCF speaker transformation used to deriv e it. 4.2. Speaker ID within Phoneme T able 3: Speaker ID using same-phoneme embeddings. Model Rank Spk EER ↑ Phoneme EER ↓ USCF 75 36.40% 11.43% USCF 1024 35.33% 11.43% W avLM 1024 21.77% 11.56% ContentV ec 792 27.98% 8.82% W avLM - kNN 1024 35.14% 12.70% T able 4: VC performance of USCF acr oss differ ent ranks. Rank WER (%) ↓ UTMOS ↑ Spk Sim ↑ 100 2.77 2.81 0.504 75 2.70 2.805 0.524 50 2.69 2.738 0.529 30 2.96 2.607 0.513 20 3.98 2.388 0.489 T able 5: VC performance of USCF when output output speaker transformation matrix is derived with differ ent number of frames of tar get speaker speech. Num frames WER (%) ↓ UTMOS ↑ Spk Sim ↑ 10000 2.28 2.935 0.564 5000 2.42 2.923 0.564 2000 2.47 2.915 0.564 1000 2.51 2.904 0.546 500 2.70 2.805 0.524 200 4.94 2.431 0.42 T o test the claim that USCF features preserve speech con- tent information while removing speaker-identifying informa- tion, we ﬁrst extract USCF features (rank 75 , W 1 ) from the TIMIT [28] dataset TEST split, then perform the following: 1. Phoneme recognition. Ground truth phoneme labels for each test frame are inferred using TIMIT’ s time-phoneme align- ment. Classiﬁcation is performed by taking the argmin cosine distance to each reference phoneme embedding. 2. Speaker recognition per-phoneme. For each phoneme, we ﬁnd all the frames from TIMIT TEST with the corresponding phoneme label, then perform speaker ID on feature vectors corresponding to these frames. W e compare USCF to two SSL-based baselines: W avLM and ContentV ec, shown in table 3. W e found that USCF is on par with W avLM as a phoneme classiﬁer , and that it is stronger than both W a vLM and ContentV ec when it comes to remov- ing speaker information. This property still holds even when the rank of USCF is increased to 1024 , showing that the loss of speaker information is not an artifact of projecting into a low-dimensional space. W e additionally ﬁnd that USCF is still marginally better at speaker information remov al and at content T able 6: TTS models trained using differ ent targ et featur es. Fbank (normalized) is where each utterance is ﬁrst normalized to the same voice using kNN-VC [3], then con verted into Fbank featur es. T arget Features ASR WER ↓ Epochs UTMOS-v2 ↑ USCF features 11.44% 25 2.881 mel 27.93% 39 2.741 mel (normalized) 11.92% 33 2.732 preservation than W avLM features that had been normalized to a single reference speaker through kNN matching. 4.3. Ablations 4.3.1. Reduction in Rank A comparison of the performance of USCF at v arious ranks can be found in table 4. W e ﬁnd that USCF is stable when its rank is within 50 and 100 , but that the synthetic voice quality declines when its rank drops further . 4.3.2. S Derivation with F ewer or Mor e F rames T able 5 shows the performance of USCF when the amount of target speaker speech used for deri ving the target speak er trans- formation matrix is varied. W e notice a sharp degradation in target speaker similarity when the amount of target speaker speech falls below 500 frames, corresponding to 10 seconds; con versely , target speaker similarity improves when the amount of target speaker speech is increased, but see diminishing re- turns beyond 2000 frames ( 40 seconds). 4.4. TTS Using a Content-Factorized Repr esentation W e verify that USCF features can be used as tar get training features with a ﬂow-matching TTS model, trained on Lib- riSpeech and following the architecture and training strategies from ZipV oice [29]. W e compare it to two benchmarks: a TTS model trained on mel ﬁlterbank features, and one trained on mel ﬁlterbank features extracted from utterances that had been voice-normalized to a single speaker using kNN-VC. T able 6 show that the model trained with USCF features achieves bet- ter WER while requiring less training time than both models trained with mel ﬁlterbank features. 5. Conclusion W e propose USCF , a method for extracting content-preserving and speaker-agnostic features from W avLM features using a simple linear transformation. W e extend an existing closed-set method for linear factorization of W avLM features, SCF , into an open-set method, transforming it into a zero-shot voice con ver- sion system. W e show through embedding analysis that USCF features ef fecti vely preserve content information, and that they carry less speaker information than existing methods such as ContentV ec. W e also demonstrate the potential for using USCF features on downstream tasks such as TTS training. In our fu- ture work, we would like to explore if simple neural methods would allow for a more stable version of W , or a deriv ation of S m for unseen speaker m with even less speech from speaker m available. W e also plan to utilize USCF to train zero-shot style-conditioned TTS systems that are timbre-agnostic. 6. Generative AI Disclosure Generativ e AI was used for the follo wing in this work: 1. Con version of table formats. 2. Limited language polishing of the Introduction section. 7. Acknowledgments This work was supported by the Ofﬁce of the Director of Na- tional Intelligence (ODNI), Intelligence Advanced Research Projects Acti vity (IARP A), via the AR TS Program under con- tract D2023-2308110001. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the ofﬁcial policies, either expressed or implied, of ODNI, IARP A, or the U.S. Government. The U.S. Gov ernment is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright an- notation therein. 8. References [1] K. Choi, E. Y eo, K. Chang, S. W atanabe, and D. R. Mortensen, “Lev eraging allophony in self-supervised speech models for atyp- ical pronunciation assessment, ” in Pr oceedings of the 2025 Con- fer ence of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language T echnologies (V olume 1: Long P apers) , 2025, pp. 2613–2628. [2] L. Block Medin, T . Pellegrini, and L. Gelin, “Self-Supervised Models for Phoneme Recognition: Applications in Children’ s Speech for Reading Learning, ” in Interspeech 2024 , 2024, pp. 5168–5172. [3] M. Baas, B. van Niekerk, and H. Kamper, “V oice Con version With Just Nearest Neighbors, ” in Interspeech 2023 , pp. 2053–2057. [4] H. Kamper, B. van Niekerk, J. Za ¨ ıdi, and M.-A. Carbonneau, “LinearVC: Linear Transformations of Self-Supervised Features Through the Lens of V oice Con version, ” in Interspeech 2025 , 2025, pp. 1398–1402. [5] R. Ardila, M. Branson, K. Davis, M. Kohler , J. Meyer , M. Hen- retty , R. Morais, L. Saunders, F . T yers, and G. W eber, “Common V oice: A Massively-Multilingual Speech Corpus, ” in Pr oceedings of the T welfth Language Resour ces and Evaluation Conference , 2020, pp. 4218–4222. [6] H. He, Z. Shang, C. W ang, X. Li, Y . Gu, H. Hua, L. Liu, C. Y ang, J. Li, P . Shi, Y . W ang, K. Chen, P . Zhang, and Z. W u, “Emilia: An extensi ve, multilingual, and div erse speech dataset for large-scale speech generation, ” in 2024 IEEE Spoken Language T echnology W orkshop (SLT) , 2024, pp. 885–890. [7] D. P . Kingma and M. W elling, “ Auto-encoding variational bayes, ” 2022. [Online]. A vailable: https://arxiv .org/abs/1312.6114 [8] Y . Xie, M. Kuhlmann, F . Rautenberg, Z.-H. T an, and R. Haeb- Umbach, “Speaker and style disentanglement of speech based on contrastiv e predictive coding supported factorized variational au- toencoder . ” in EUSIPCO . IEEE, 2024, pp. 436–440. [9] C. H. Chan, K. Qian, Y . Zhang, and M. Hasega wa-Johnson, “Speechsplit 2.0: Unsupervised speech disentanglement for voice conv ersion without tuning autoencoder bottlenecks, ” 2022. [Online]. A vailable: https://arxiv .org/abs/2203.14156 [10] J. W ang, J. Li, X. Zhao, Z. Wu, S. Kang, and H. Meng, “ Adversar- ially Learning Disentangled Speech Representations for Robust Multi-Factor V oice Con version, ” in Interspeec h 2021 , 2021, pp. 846–850. [11] K. Qian, Y . Zhang, H. Gao, J. Ni, C.-I. Lai, D. Cox, M. Hasegawa- Johnson, and S. Chang, “ContentV ec: An improv ed self- supervised speech representation by disentangling speakers, ” in Pr oceedings of the 39th International Conference on Machine Learning , ser. Proceedings of Machine Learning Research, vol. 162, 2022, pp. 18 003–18 017. [12] A. Polyak, Y . Adi, J. Copet, E. Kharitonov , K. Lakhotia, W .-N. Hsu, A. Mohamed, and E. Dupoux, “Speech Resynthesis from Discrete Disentangled Self-Supervised Representations, ” in Inter - speech 2021 , 2021, pp. 3615–3619. [13] Z. Ju, Y . W ang, K. Shen, X. T an, D. Xin, D. Y ang, E. Liu, Y . Leng, K. Song, S. T ang, Z. W u, T . Qin, X. Li, W . Y e, S. Zhang, J. Bian, L. He, J. Li, and S. Zhao, “Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models. ” in ICML , vol. 235, 2024, pp. 22 605–22 623. [14] Z. Liu, S. W ang, and N. Chen, “ Automatic Speech Disentangle- ment for V oice Conv ersion using Rank Module and Speech Aug- mentation, ” in Interspeech 2023 , 2023, pp. 2298–2302. [15] Z. Cai, H. L. Xinyuan, A. Garg, L. P . Garc ´ ıa-Perera, K. Duh, S. Khudanpur , M. W iesner , and N. Andrews, “Gen vc: Self-supervised zero-shot voice conv ersion, ” 2025. [Online]. A vailable: https://arxiv .org/abs/2502.04519 [16] B. Sisman, J. Y amagishi, S. King, and H. Li, “ An ov erview of voice conv ersion and its challenges: From statistical modeling to deep learning, ” IEEE/ACM transactions on audio, speech, and language pr ocessing , vol. 29, pp. 132–157, 2020. [17] X. Zhao, H. Sun, Y . Lei, S. Zhu, and D. Xiong, “CCSRD: Content- centric speech representation disentanglement learning for end-to- end speech translation, ” in F indings of the Association for Com- putational Linguistics: EMNLP 2023 , 2023, pp. 5920–5932. [18] B. T . V ecino, S. Maji, A. V arier , A. Bonafonte, I. V alles, M. Owen, L. R ¨ adel, G. Strimel, S. Feyisetan, R. B. Chicote, A. Rastrow , C. Papayiannis, V . Leutnant, and T . W ood, “Uni versal semantic disentangled priv acy-preserving speech representation learning, ” 2025. [Online]. A vailable: https://arxiv .org/abs/2505.13085 [19] J. Y ao, N. Kuzmin, Q. W ang, P . Guo, Z. Ning, D. Guo, K. A. Lee, E.-S. Chng, and L. Xie, “NPU-NTU System for V oice Privac y 2024 Challenge, ” in 4th Symposium on Security and Privacy in Speech Communication , 2024, pp. 67–71. [20] Y . Halychanskyi, C. Churchwell, Y . W en, and V . Kindratenko, “Fac-facodec: Controllable zero-shot foreign accent conv ersion with factorized speech codec, ” 2026. [Online]. A vailable: https://arxiv .org/abs/2510.10785 [21] J. Melechovsky , A. Mehrish, B. Sisman, and D. Herremans, “ Ac- cent con version in te xt-to-speech using multi-le vel vae and adver- sarial training, ” in TENCON 2024-2024 IEEE Re gion 10 Confer- ence (TENCON) . IEEE, 2024, pp. 473–476. [22] ——, “ Accented text-to-speech synthesis with a conditional vari- ational autoencoder, ” in TENCON 2024-2024 IEEE Region 10 Confer ence (TENCON) . IEEE, 2024, pp. 343–346. [23] V . Panayotov , G. Chen, D. Pov ey , and S. Khudanpur , “Lib- rispeech: An asr corpus based on public domain audio books, ” in 2015 IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) , 2015, pp. 5206–5210. [24] S. Liu, “Zero-shot voice con version with diffusion transformers, ” 2024. [Online]. A vailable: https://arxiv .org/abs/2411.09943 [25] A. Radford, J. W . Kim, T . Xu, G. Brockman, C. McLeav ey , and I. Sutskev er , “Robust speech recognition via large- scale weak supervision, ” 2022. [Online]. A vailable: https: //arxiv .org/abs/2212.04356 [26] T . Saeki, D. Xin, W . Nakata, T . Koriyama, S. T akamichi, and H. Saruwatari, “UTMOS: UT okyo-SaruLab System for V oice- MOS Challenge 2022, ” in Interspeech 2022 , 2022, pp. 4521– 4525. [27] B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAP A- TDNN: Emphasized Channel Attention, Propagation and Ag- gregation in TDNN Based Speaker V eriﬁcation, ” in Interspeech 2020 , 2020, pp. 3830–3834. [28] J. S. Garofolo, N. I. of Standards, T . U.S., U. States, D. A. R. P . Agency ., I. Science, T . Ofﬁce, and L. D. Consortium., “TIMIT : acoustic-phonetic continuous speech corpus.” 1993. [Online]. A vailable: http://www .worldcat.org/isbn/1585630195 [29] H. Zhu, W . Kang, Z. Y ao, L. Guo, F . Kuang, Z. Li, W . Zhuang, L. Lin, and D. Povey , “Zipv oice: Fast and high-quality zero-shot text-to-speech with ﬂow matching, ” 2025. [Online]. A vailable: https://arxiv .org/abs/2506.13053

Universal Speech Content Factorization

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment