Quantizer-Aware Hierarchical Neural Codec Modeling for Speech Deepfake Detection

Quantizer -A war e Hierar chical Neural Codec Modeling f or Speech Deepfake Detection Jinyang W u 1 , Zihan P an 1 , ∗∗ , Qiquan Zhang 2 , Sailor Har dik Bhupendra 1 , Soumik Mondal 1 1 Agency for Science, T echnology and Research (A*ST AR), Singapore 2 The Uni versity of New South W ales, Australia { wu jinyang,panz } @a-star.edu.sg Abstract Neural audio codecs discretize speech via residual vector quan- tization (R VQ), forming a coarse-to-ﬁne hierarchy across quan- tizers. While codec models hav e been explored for representa- tion learning, their discrete structure remains underutilized in speech deepfake detection. In particular, different quantiza- tion lev els capture complementary acoustic cues, where early quantizers encode coarse structure and later quantizers reﬁne residual details that reveal synthesis artifacts. Existing sys- tems either rely on continuous encoder features or ignore this quantizer-le vel hierarchy . W e propose a hierarchy-aware repre- sentation learning framework that models quantizer -level con- tributions through learnable global weighting, enabling struc- tured codec representations aligned with forensic cues. Keep- ing the speech encoder backbone frozen and updating only 4.4% additional parameters, our method achie ves relati ve EER reduc- tions of 46.2% on ASVspoof 2019 and 13.9% on ASVspoof5 ov er strong baselines. Index T erms : Speech Deepfake Detection, Anti-spooﬁng, Codec Representation Learning 1. Introduction Recent advances in text-to-speech (TTS) and voice conv ersion (VC) ha ve made it increasingly easy to generate highly natural- sounding speech. Despite their perceptual realism, deepfake ut- terances often exhibit subtle low-le vel inconsistencies, such as imperfect transient modeling, o ver -smoothed spectral details, or locally distributed artifacts that are difﬁcult to capture reliably under real-world distortions [1, 2]. Designing detectors that are both sensiti ve to these ﬁne-grained artifacts and robust across domains remains a central challenge. A dominant trend in deepfak e detection is to build detec- tors on top of self-supervised learning (SSL) speech encoders [3–5]. SSL representations are strong and transferable, of fering contextualized frame-level features that generalize well across tasks and domains. Howe ver , the same contextual abstraction that beneﬁts semantic modeling may attenuate ﬁne-grained lo- cal cues that are highly indicativ e of synthetic generation, es- pecially when artifacts are weak, sparse, or partially mask ed by channel effects. This motivates exploring complementary rep- resentations beyond purely continuous SSL embeddings. Neural audio codecs provide such a complementary vie w . Instead of producing only continuous features, modern codecs discretize speech through residual vector quantization (R VQ), forming a sequence of quantizers that progressi vely encode coarse-to-ﬁne residual details [6, 7]. This process induces a structured residual hierarchy: earlier quantizers capture dom- ** indicates the corresponding author . inant structure, while later ones encode ﬁner residual informa- tion. From a forensic perspecti ve, synthesis artifacts may not distribute uniformly across quantizers b ut concentrate in spe- ciﬁc residual lev els. Despite this structural property , the potential of neural codec representations remains underexplored in speech deep- fake detection. Existing codec-based representation learning studies, such as Codec2V ec [8], primarily aim at learning gen- eral speech representations from discrete codec tok ens, without considering their forensic utility in detection tasks. Meanwhile, recent detection framew orks [9] hav e begun incorporating pre- trained codec models such as EnCodec [10] as additional fea- ture e xtractors in multi-vie w architectures. Howe ver , in these approaches codec outputs are treated as generic speech repre- sentations, without explicitly modeling the hierarchical struc- ture introduced by residual vector quantization (R VQ). Further- more, while both SSL models and neural codecs provide power - ful learned speech representations, their interaction in deepfak e detection remains largely underexplored. In particular, ho w to effecti vely align codec representations with SSL-based en- coders while preserving the quantizer-le vel hierarchy has not been systematically studied. Motiv ated by this, we study lightweight SSL–codec inte- gration with explicit quantizer-le vel modeling. W e ﬁrst es- tablish strong and stable hierarchy-preserving baselines that keep codec residual lev els disentangled and combine them with SSL features through late concatenation and a single projec- tion layer . Building on this setup, we propose Quantizer- A ware Static Fusion (QAF-Static), a parameter-light operator that learns global importance weights over residual quantizers to form a hierarchy guided codec representation before inte- gration with SSL features. This design injects an interpretable structural bias aligned with R VQ, while maintaining training stability and minimal computational ov erhead. 2. Related W ork 2.1. SSL-based deepfake detection and parameter -efﬁcient adaptation Self-supervised learning speech encoders [11, 12] ha ve become the dominant backbone for spooﬁng and deepfak e detection due to their transferability and contextual abstraction. Prior studies rev eal that spooﬁng cues are not uniformly distributed across transformer layers. Pan et al. [13] propose attentiv e merging of hidden embeddings from pre-trained speech models, showing that earlier SSL layers can contain stronger anti-spooﬁng sig- nals and enabling lightweight detectors with partial backbones. T o further reduce adaptation cost, MoLEx [14] introduces a mixture-of-LoRA-experts framework that selectively adapts SSL encoders through parameter-ef ﬁcient routing mechanisms. These works demonstrate that structured e xploitation of SSL internals either across layers or e xperts can impro ve detection robustness without full ﬁne-tuning. Complementary to SSL-only systems, multi-view strategies hav e also been explored. For instance, combining SSL em- beddings with spectral features (e.g., MFCC/LFCC/CQCC) via concatenation or attention improves generalization under un- seen distortions [15]. These approaches treat additional rep- resentations as complementary vie ws but do not explicitly con- sider structural priors within those representations. 2.2. Neural audio codecs: continuous latents and discrete R VQ hierarch y Neural audio codecs (NA Cs) compress wav eforms into a latent space optimized for reconstruction [6, 10, 16, 17]. Given an input wa veform x , the encoder produces continuous latent features z = E ( x ) , z ∈ R T × D , (1) which are typically used directly as continuous representations in detection pipelines. T o enable bitrate control and com- pact transmission, N ACs further discretize z via residual vector quantization (R VQ). R VQ approximates each latent vector z t through a sequence of residual codebooks: r (0) t = z t , (2) k t,q = arg min k    r ( q − 1) t − e q,k    2 2 , (3) r ( q ) t = r ( q − 1) t − e q,k t,q . (4) for q = 1 , . . . , Q , where e q,k denotes the k -th code word in the q -th codebook. The reconstructed latent is given by ˆ z t = Q X q =1 e q,k t,q . (5) This procedure induces a coarse-to-ﬁne residual hierarchy [18]: earlier quantizers capture dominant structure, while later quantizers encode progressiv ely ﬁner residual details. Y et this hierarchical organization is rarely modeled explicitly , as many approaches either rely on continuous latent representations or apply uniform aggregation o ver discrete codec embeddings. In practice, codec representations have begun to appear in speech deepfake forensics, but are lev eraged in different ways. For example, multi-view detection systems incorporate pre- trained neural codec models (e.g., EnCodec ) as additional fea- ture extractors and fuse them with other acoustic or SSL-deriv ed representations [19]; in these settings, the encoder’ s continuous outputs are typically used as features, rather than the discrete R VQ codes. In a different direction, SafeEar [20] repurposes a codec-style R VQ stack into a decoupling model that separates semantic and acoustic tokens for content priv acy , and performs detection using acoustic-only tokens. Beyond detection, codec- aware forensics has also been explored for CodecFak e source tracing [21] by predicting codec taxonomy attributes such as VQ type, auxiliary objectiv es, and decoder structures. In contrast to treating codec outputs as generic features, partitioning R VQ layers for priv acy , or using taxonomy labels for attribution, we focus on explicitly modeling the R VQ quan- tizer hierarchy itself and studying lightweight, hierarchy-aware SSL–codec fusion for deepfake detection. Figure 1: Overview of the proposed speech deepfake detection frame work. SSL features from W avLM (with Attentive Mer ging) ar e fused with codec repr esentations through quantizer-awar e weighting over RVQ levels. A quantizer mean pooling baseline is included for comparison. (Method 1) (Method 2) 3. Method Neural audio codecs represent speech through residual vector quantization (R VQ), where successive codebooks progressiv ely reﬁne the latent representation and form a coarse-to-ﬁne hierar - chy . W e hypothesize that this hierarchy is informati ve for deep- fake detection, since synthesis artif acts may distrib ute une venly across residual levels. Howe ver , existing approaches typically collapse codec outputs into ﬂat embeddings. T o address this limitation, we introduce a lightweight quantizer -aware operator that preserves R VQ structure and enables hierarchy-a ware ag- gregation of codec representations. 3.1. Hierarchy codec modeling Residual vector quantization represents each latent vector as a structured sum of residual le vels (Eq. 5), naturally inducing an ordered hierarchy across quantizers. Earlier lev els encode coarse acoustic structure, while later levels capture ﬁner resid- ual reﬁnements. A simple aggregation strategy applies uniform av eraging across quantizers ( Method 1 : Quantizer Mean Pooling): ˜ H c avg = 1 Q Q X q =1 ˜ E ( q ) . (6) Howe ver , this assumes equal importance across quan- tizers and embedding dimensions, while synthesis artifacts may appear une venly across both hierarchical levels and fea- ture channels. T o address this limitation, we introduce a lightweight Quantizer-A ware Dimension-Wise Static Aggrega- tion ( Method 2 ), which learns global importance weights across residual quantizers while preserving the R VQ hierarchy 3.2. Quantizer-A ware Dimension-wise Static Aggregation (QAF-Static) Neural codecs typically employ multiple residual codebooks to progressiv ely reﬁne acoustic representations [22]. While stan- dard practice aggre gates codebook embeddings via uniform av- eraging, such treatment implicitly assumes equal contribution across quantization lev els. Howe ver , in practice, dif ferent code- books capture heterogeneous information [18], and their rela- tiv e importance may v ary across embedding dimensions [23]. T o enable ﬁne-grained quantizer preference modeling while preserving optimization stability , we introduce a dimension- wise static aggregation mechanism. Giv en Q codec codebooks, each produces a frame-lev el em- bedding: ˜ E ( q ) ∈ R B × T × D , (7) where D denotes the embedding dimension. Instead of uniformly a veraging or learning Q scalar weights, we introduce a dimension-wise static reweighting ma- trix: W ∈ R Q × D . (8) For each embedding dimension d , we normalize weights across codebooks: α q,d = exp( W q,d /τ ) P Q q ′ =1 exp( W q ′ ,d /τ ) , (9) where τ is a temperature parameter that stabilizes training. The aggregated codec representation is computed as: ˜ H c b,t,d = Q X q =1 α q,d ˜ E ( q ) b,t,d . (10) This formulation enables each embedding channel to inde- pendently select the most informative quantization group, pro- viding ﬁne-grained quantizer-aw are feature redistribution while maintaining the stability of a static (input-independent) design. This mechanism can be interpreted as a channel-wise atten- tion ov er quantization groups, where each embedding dimen- sion performs a soft selection among codec codebooks. 3.3. Lightweight SSL–Codec Fusion Once the codec hierarchy has been explicitly modeled, we de- liberately decouple structural aggregation from cross-stream in- teraction. Our goal is to isolate the contribution of quantizer - aware modeling, rather than attributing performance gains to increasingly complex fusion modules. Therefore, we adopt a lightweight late-fusion strategy that preserves representational independence between the SSL conte xtual abstraction and the codec residual hierarchy . After hierarchy-aw are aggregation, we fuse the SSL and codec streams via late concatenation follo wed by linear projec- tion. Giv en projected SSL features H ssl ∈ R B × T × d model and ag- gregated codec features ˜ H c ∈ R B × T × d codec , we compute: H f = Linear  [ H ssl ; ˜ H c ]  . (11) Late concatenation preserv es representational indepen- dence between continuous contextual abstraction (SSL) and dis- crete residual hierarchy (R VQ). This design avoids premature cross-stream interaction and allows codec structure modeling to remain disentangled from contextual abstraction. W e optionally apply Attentive Merging (AttM) [13] to ag- gregate multi-layer SSL representations prior to fusion. This component is orthogonal to R VQ hierarchy modeling and serves as a backbone enhancement. The fused representation is passed through a lightweight single-layer LSTM followed by a linear classiﬁer . The decoder is intentionally compact to isolate the contribution of represen- tation modeling. 4. Experimental Setup 4.1. Datasets and Evaluation Metric W e e valuate our method on ASVspoof 2019 Logical Access (19LA) [24] and ASVspoof 5 [25]. For ASVspoof 2019 LA, T able 1: P erformance on ASVspoof5 T rack 1 (EER %, lower is better). Relative impro vement is computed with r espect to the AttM baseline (6.60%). All our models k eep the SSL encoder fr ozen during training. Method EER (%) Rel.Impr . Public baselines W av2V ec2-AASIST -KAN [26] 22.67 – SEMAA-1 [27] 23.63 – AASIST -CAM++ fused [28] 25.47 – W avLM (FT -DA) [29] 17.08 – Best challenge submission [30] 8.61 – SSL-based baseline(SSL ﬁne-tuned) A TTM + LSTM (baseline) [13] 6.60 – Hierar chy-aware modeling (our s SSL F reeze) QAF-Static (codecF , sslF , Method 1) 6.01 8.9% QAF-Static (codecF , sslF , Method 2) 6.04 8.5% QAF-Static (codecT , sslF , Method 2) 5.68 13.9% we follow the ofﬁcial train/de v/eval protocol. For ASVspoof 5, we use the standard train/dev/e val split. Unless otherwise spec- iﬁed, models are trained separately on 19LA and ASVspoof 5 and evaluated on their respective ev aluation sets. Equal Error Rate (EER, %) is used as the primary metric. 4.2. Model Conﬁguration and T raining Details All systems employ W avLM-Lar ge as the SSL backbone. W e use the ﬁrst 12 transformer layers following the AttM conﬁgura- tion [13]. Unless AttM is enabled, the ﬁnal layer representation is used as H ssl . F or AttM, multi-layer SSL representations are aggregated as in [13]. Importantly , the SSL backbone remains fully frozen during training. For the codec stream, we adopt the Facebook EnCodec model [17], which employs residual vector quantization (R VQ). The codec uses Q =8 quantizers, with a codebook size of 1024 and an embedding dimension of 128. Discrete codec indices are mapped to trainable embeddings before fusion. All models are trained using Adam with identical optimiza- tion settings across v ariants. Early stopping is applied based on development-set EER to ensure fair comparison. Only the lightweight codec fusion layers and the ﬁnal classiﬁer are opti- mized during training. 5. Results and Discussion 5.1. Hierarchy-awar e modeling of codec representations T able 1 and T able 2 compare different fusion strategies on ASVspoof 2019 and ASVspoof 5. Our QAF-Static improv es ov er the AttM baseline by 46% relati ve EER reduction on ASVspoof 2019 LA and achieves a further 13.9% relative im- prov ement compared with the fully ﬁne-tuned SSL model. W e ﬁrst observe that codec-only modeling yields limited performance across datasets. While R VQ embeddings con- tain reconstruction-lev el cues, they lack the long-range con- textual information provided by SSL features. Adding a lightweight alignment module improves performance, indicat- ing that matching the codec embedding space to the SSL feature scale is necessary . W e further analyze the parameter efﬁciency of the proposed codec-aware modeling. The W avLM backbone contains ap- proximately 315M parameters and is kept frozen throughout our training, while the neural codec branch introduces only about 14M additional parameters (approximately 4.4% of the SSL backbone). Despite this small parameter footprint, incor- Figure 2: Detection performance across codec groups in the CodecF ake benchmark. The proposed quantizer-awar e static fusion impr oves over the A TTM-LSTM baseline and codec con- catenation on codec family gr oup B. A B C D E F Figure 3: Learned quantizer contribution weights ( α q ) on the ASVspoof 5 dataset. Both the SSL encoder and codec encoder ar e frozen during tr aining porating codec representations already improves performance ov er the AttM baseline ev en when the codec parameters remain frozen. When the codec branch is ﬁne-tuned while keeping the SSL backbone frozen, the proposed model further surpasses strong SSL-based baselines that rely on full backbone ﬁne- tuning, achieving state-of-the-art performance on ASVspoof 2019 LA and competitiv e results on ASVspoof5. These observations suggest that codec representations pro- vide complementary information be yond SSL features, partic- ularly for capturing ﬁne-grained signal artifacts that are impor- tant for deepfake detection. 5.2. Complementarity with SSL-based representations W e compare our hierarchy-a ware codec modeling with Atten- tiv e Mer ging (AttM), which exploits the hierarchical structure of SSL transformer layers. While AttM models the layer-wise hierarchy within SSL representations, our approach focuses on the discrete residual hierarchy induced by R VQ. These two forms of hierarchy operate at dif ferent lev els of representation. Figure 3 illustrates the learned contributions of different codec quantizers. The distrib ution is clearly non-uniform, in- dicating that the model automatically learns the relative im- portance of quantizers instead of relying on uniform aggrega- tion. Notably , the ﬁrst quantizer receiv es the largest contri- bution, while intermediate quantizers recei ve smaller weights. This pattern reﬂects the hierarchical structure of residual v ector quantization, where earlier quantizers capture coarse acoustic structure and later quantizers encode ﬁner residual details. Even with identical SSL backbones, incorporating quantizer-a ware codec modeling provides complementary gains beyond SSL layer merging. This result suggests that R VQ hierarchy modeling captures forensic cues that are not fully represented by SSL features alone. T able 2: Comparison on ASVspoof2019 LA (EER %, lower is better). Relative impro vement is computed with r espect to the AttM baseline (0.65%). codecF denotes the codec module is fr ozen, while codecT indicates the codec is ﬁne-tuned. Method EER Rel.Impr . Public baselines W2V -XLSR-LLGF [32] 2.80 – HuBER T -XL [32] 3.55 – W2V -Lar ge1-LLGF [32] 0.86 – W2V2-base-D AR TS [33] 1.19 – W2V2-large-D AR TS [33] 1.08 – SSL-based detectors AttM [13] 0.65 – MoLEx [14] 0.44 32.3% Hierar chy-aware modeling (our s) QAF-Static (codecF , sslF , Method 1) 0.53 18.4% QAF-Static (codecF , sslF , Method 2) 0.44 32.3% QAF-Static (codecT , sslF , Method 2) 0.35 46.2% 5.3. Cross-codec rob ustness evaluation. T o e valuate robustness against heterogeneous neural codec ar- tifacts, we adopt the CodecF ake benchmark [31], which com- prises multiple codec families with distinct quantization struc- tures and bitrate conﬁgurations. Different codecs employ di- verse R VQ architectures and compression strategies, resulting in varying distrib utions of quantization artifacts. Moreov er , the most informative quantizer levels may vary across samples and attack types. While our static quantizer- aware aggre gation provides a stable hierarchy prior , it may par- tially av erage out codec-speciﬁc artifact cues when the arti- fact patterns differ signiﬁcantly across generation mechanisms. Therefore, ev aluating cross-codec robustness is important to verify whether the detector can generalize beyond the training codec distribution. In particular , Group B includes AcademiaCodec and HiFi- Codec models, which are characterized by relatively compact R VQ hierarchies and lo wer effecti ve bitrates, often producing more concentrated quantization artifacts. This benchmark al- lows us to examine how the detector generalizes across dif- ferent codec generation mechanisms. As sho wn in Fig. 2, the proposed fusion model outperforms the A TTM-LSTM baseline on Group B (the ﬁrst three subsets), while yielding comparable performance on other codec families. These results suggest that stronger and more consistent quantization artifacts may pro- vide clearer cues for codec-aware representation learning, while generalization across div erse codec structures remains an open challenge for future work. 6. Conclusion W e study codec-assisted speech deepfake detection from a structured representation modeling perspectiv e. Instead of treat- ing neural codec outputs as an unstructured auxiliary stream, we explicitly model the residual hierarchy induced by R VQ. W e show that hierarchy-preserving fusion provides a strong base- line, and propose Quantizer-A ware Static Fusion (QAF-Static), a lightweight quantizer-weighting mechanism for codec ag- gregation. Experiments demonstrate consistent improv ements ov er strong SSL baselines. Notably , our approach keeps the SSL backbone frozen and introduces only about 4.4% addi- tional parameters relativ e to the backbone, sho wing that ex- plicitly modeling the R VQ hierarchy provides an effecti ve and parameter-ef ﬁcient way to le verage neural codec representa- tions for speech deepfake detection. 7. References [1] N. M ¨ uller , P . Czempin, F . Diekmann, A. Froghyar , and K. B ¨ ottinger , “Does Audio Deepfake Detection Generalize?” in Interspeech 2022 , 2022, pp. 2783–2787. [2] J. Y i, R. Fu, J. T ao, S. Nie, H. Ma, C. W ang, T . W ang, Z. Tian, Y . Bai, C. Fan et al. , “ Add 2022: the ﬁrst audio deep synthe- sis detection challenge, ” in ICASSP 2022-2022 IEEE Interna- tional Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2022, pp. 9216–9220. [3] X. Xuan, X. Liu, W . Zhang, Y .-C. Lin, X. Lin, and T . Kinnunen, “W a vesp-net: Learnable wa velet-domain sparse prompt tuning for speech deepfake detection, ” arXiv pr eprint arXiv:2510.05305 , 2025. [4] X. Xuan, Z. Zhu, W . Zhang, Y .-C. Lin, and T . Kinnunen, “Fake-mamba: Real-time speech deepfake detection using bidi- rectional mamba as self-attention’ s alternati ve, ” arXiv pr eprint arXiv:2508.09294 , 2025. [5] A. Guragain, T . Liu, Z. Pan, H. B. Sailor , and Q. W ang, “Speech foundation model ensembles for the controlled singing voice deepfake detection (ctrsvdd) challenge 2024, ” in 2024 IEEE Spo- ken Langua ge T echnology W orkshop (SL T) . IEEE, 2024, pp. 774–781. [6] N. Zeghidour , A. Luebs, A. Omran, J. Skoglund, and M. T agliasacchi, “Soundstream: An end-to-end neural audio codec, ” IEEE/ACM T ransactions on Audio, Speech, and Lan- guage Pr ocessing , vol. 30, pp. 495–507, 2021. [7] D. Ding, Z. Ju, Y . Leng, S. Liu, T . Liu, Z. Shang, K. Shen, W . Song, X. T an, H. T ang et al. , “Kimi-audio technical report, ” arXiv pr eprint arXiv:2504.18425 , 2025. [8] W .-C. Tseng and D. Harwath, “Codec2vec: Self-supervised speech representation learning using neural speech codecs, ” arXiv pr eprint arXiv:2511.16639 , 2025. [9] Y . Y ang, H. Qin, H. Zhou, C. W ang, T . Guo, K. Han, and Y . W ang, “ A rob ust audio deepfake detection system via multi-vie w fea- ture, ” in ICASSP 2024-2024 IEEE International Confer ence on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2024, pp. 13 131–13 135. [10] A. D ´ efossez, J. Copet, G. Synnaev e, and Y . Adi, “High ﬁdelity neural audio compression, ” arXiv pr eprint arXiv:2210.13438 , 2022. [11] S. Chen, C. W ang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T . Y oshioka, X. Xiao et al. , “W avlm: Large-scale self- supervised pre-training for full stack speech processing, ” 2021, [12] W .-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhut- dinov , and A. Mohamed, “Hubert: Self-supervised speech repre- sentation learning by masked prediction of hidden units, ” 2021, [13] Z. Pan, T . Liu, H. B. Sailor , and Q. W ang, “ Attentive merging of hidden embeddings from pre-trained speech model for anti- spooﬁng detection, ” arXiv pr eprint arXiv:2406.10283 , 2024. [14] Z. Pan, S. H. Bhupendra, and J. W u, “Molex: Mixture of lora experts in speech self-supervised models for audio deepfake de- tection, ” arXiv pr eprint arXiv:2509.09175 , 2025. [15] Y . El Kheir , A. Das, E. E. Erdogan, F . Ritter -Guttierez, T . Polzehl, and S. M ¨ oller , “T wo views, one truth: Spectral and self-supervised features fusion for robust speech deepfake detection, ” in 2025 IEEE W orkshop on Applications of Signal Processing to Audio and Acoustics (W ASP AA) . IEEE, 2025, pp. 1–5. [16] Z. Xin, Z. Dong, L. Shimin, Z. Y aqian, and Q. Xipeng, “Speechto- kenizer: Uniﬁed speech tokenizer for speech language models, ” in Pr oc. Int. Conf. Learn. Repr esentations , vol. 24, 2024, pp. 27–30. [17] A. D ´ efossez, J. Copet, G. Synnaev e, and Y . Adi, “High ﬁdelity neural audio compression, ” arXiv pr eprint arXiv:2210.13438 , 2022. [18] X. Zhang, D. Zhang, S. Li, Y . Zhou, and X. Qiu, “Speechtok- enizer: Uniﬁed speech tokenizer for speech language models, ” 2023. [19] Y . Y ang, H. Qin, H. Zhou, C. W ang, T . Guo, K. Han, and Y . W ang, “ A rob ust audio deepfake detection system via multi-vie w fea- ture, ” in ICASSP 2024-2024 IEEE International Confer ence on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2024, pp. 13 131–13 135. [20] X. Li, K. Li, Y . Zheng, C. Y an, X. Ji, and W . Xu, “Safeear: Con- tent pri vacy-preserving audio deepfake detection, ” in Pr oceedings of the 2024 on A CM SIGSAC Confer ence on Computer and Com- munications Security , 2024, pp. 3585–3599. [21] X. Chen, I. Lin, L. Zhang, J. Du, H. Wu, H.-y . Lee, J.-S. R. Jang et al. , “Codec-based deepfake source tracing via neural audio codec taxonomy , ” arXiv pr eprint arXiv:2505.12994 , 2025. [22] A. D ´ efossez, J. Copet, G. Synnaev e, and Y . Adi, “High ﬁdelity neural audio compression, ” Tr ansactions on Machine Learning Resear ch , 2023, [23] Z. Zhang, L. Zhou, C. W ang, S. Chen, Y . Wu, S. Liu, Z. Chen, Y . Liu, H. W ang, J. Li et al. , “Speak foreign languages with your own voice: Cross-lingual neural codec language modeling, ” arXiv pr eprint arXiv:2303.03926 , 2023. [24] A. Nautsch, X. W ang, N. Evans, T . H. Kinnunen, V . V estman, M. T odisco, H. Delgado, M. Sahidullah, J. Y amagishi, and K. A. Lee, “ Asvspoof 2019: Spooﬁng countermeasures for the detection of synthesized, con verted and replayed speech, ” IEEE T ransac- tions on Biometrics, Behavior , and Identity Science , vol. 3, no. 2, pp. 252–265, 2021. [25] X. W ang, H. Delgado, H. T ak, J.-w . Jung, H.-j. Shim, M. T odisco, I. Kukanov , X. Liu, M. Sahidullah, T . Kinnunen et al. , “ Asvspoof 5: Cro wdsourced speech data, deepfakes, and adversarial attacks at scale, ” arXiv pr eprint arXiv:2408.08739 , 2024. [26] J.-w . Jung, H.-S. Heo, H. T ak, H.-j. Shim, J. S. Chung, B.-J. Lee, H.-J. Y u, and N. Evans, “ Aasist: Audio anti-spooﬁng using inte- grated spectro-temporal graph attention networks, ” in Pr oc. IEEE International Conference on Acoustics, Speech and Signal Pr o- cessing (ICASSP) , 2022, [27] W . Xia, H. Peng, L. Li, and Y . Ren, “ A single end-to-end voice anti-spooﬁng model with graph attention and feature aggregation for asvspoof 5 challenge, ” Proc. ASVspoof 2024 , pp. 124–130, 2024. [28] D.-T . Truong, Y . W ang, K. A. Lee, M. Li, H. Nishizaki, and E. S. Chng, “ A study of guided masking data augmentation for deepfake speech detection, ” in The Automatic Speaker V eriﬁcation Spooﬁng Countermeasur es W orkshop (ASVspoof 2024) , 2024, pp. 176–180. [29] D. Combei, A. Stan, D. Oneata, and H. Cucu, “W avlm model ensemble for audio deepfake detection, ” arXiv preprint arXiv:2408.07414 , 2024. [30] X. W ang, H. Delgado, H. T ak, J.-w . Jung, H.-j. Shim, M. T odisco, I. Kukanov , X. Liu, M. Sahidullah, T . Kinnunen et al. , “ Asvspoof 5: Cro wdsourced speech data, deepfakes, and adversarial attacks at scale, ” arXiv pr eprint arXiv:2408.08739 , 2024. [31] H. W u, Y . Tseng, and H.-y . Lee, “Codecfake: Enhancing anti-spooﬁng models against deepfake audios from codec-based speech synthesis systems, ” arXiv pr eprint arXiv:2406.07237 , 2024. [32] X. W ang and J. Y amagishi, “In vestigating self-supervised front ends for speech spooﬁng countermeasures, ” arXiv preprint arXiv:2111.07725 , 2021. [33] C. W ang, J. Y i, J. T ao, H. Sun, X. Chen, Z. Tian, H. Ma, C. Fan, and R. Fu, “Fully automated end-to-end fake audio detection, ” in Pr oceedings of the 1st international workshop on deepfak e detec- tion for audio multimedia , 2022, pp. 27–33.

Quantizer-Aware Hierarchical Neural Codec Modeling for Speech Deepfake Detection

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment