Replay attacks remain a critical vulnerability for automatic speaker verification systems, particularly in real-time voice assistant applications. In this work, we propose acoustic maps as a novel spatial feature representation for replay speech detection from multi-channel recordings. Derived from classical beamforming over discrete azimuth and elevation grids, acoustic maps encode directional energy distributions that reflect physical differences between human speech radiation and loudspeaker-based replay. A lightweight convolutional neural network is designed to operate on this representation, achieving competitive performance on the ReMASC dataset with approximately 6k trainable parameters. Experimental results show that acoustic maps provide a compact and physically interpretable feature space for replay attack detection across different devices and acoustic environments.
Recently, voice assistants (VAs) have become a central interface for human-machine interaction, exploiting speech as a biometric trait for user authentication and authorization [1]. In practical deployments, VAs operate in real time to control Internet of things (IoT) devices and to transmit sensitive information, making timely and reliable spoofing detection a critical requirement. However, automatic speaker verification (ASV) systems remain vulnerable to a wide range of audiobased attacks [2].
Among these, logical access (LA) attacks manipulate speech content or speaker identity using text-to-speech (TTS) or voice conversion (VC) techniques, while compression-or quantization-induced artifacts can further obscure such manipulations, leading to so-called deepfake (DF) attacks [2]. Physical access (PA) attacks, instead, aim to deceive the ASV system at the microphone level [3]. In this setting, an adversary may either imitate the target speaker (impersonation attack) [1] or replay a previously captured recording using a loudspeaker (replay attack) [4]. In this work, we focus on replay attacks, as speech is inherently easy to capture in everyday environments and can be replayed with minimal effort [5]. Moreover, existing ASV systems often struggle to reliably discriminate between genuine and replayed speech, even when using commodity off-the-shelf devices such as smartphones and loudspeakers [6], [7].
From a physical perspective, genuine speech and replayed speech are generated by fundamentally different sound production mechanisms: human speech originates from a complex vocal tract excitation, whereas replayed speech is emitted by an electro-acoustic transducer with its own frequency response, directivity, and radiation characteristics. These differences affect not only the spectral content of the signal but also its spatial propagation and interaction with the environment. In addition, the sound is emitted from a physical object, which can be either a human talker or a loudspeaker, so it does not originate from a single point but rather from a larger body with spatially distributed acoustic properties. Motivated by this observation, we investigate whether acoustic maps, capturing spatial and multi-channel sound field information, can be leveraged to distinguish between genuine and replayed speech excerpts in real time. By analyzing how sound is spatially produced and received across microphone arrays, we aim to assess whether these physical differences can be reliably exploited for replay attack detection.
The contributions of this work are as follows:
• We introduce acoustic maps derived from classical beamforming as a spatial feature representation for replay speech detection, explicitly encoding directional energy distributions that reflect differences between human speech radiation and loudspeaker-based replay. • We design a compact convolutional neural network tailored to acoustic maps, achieving competitive replay detection performance on the Realistic Replay Attack Microphone Array Speech (ReMASC) dataset with approximately 6k trainable parameters. • We evaluate the proposed approach under both environment-dependent and environment-independent conditions, analyzing robustness across different microphone arrays, beamformers, and unseen acoustic environments, and highlighting the strengths and limitations of spatial representations for replay detection.
The work is organized as follows: Sec. II includes a literature review about replay speech detection, encompassing both traditional and learning-based approaches; Sec. III describes the novel feature set and the convolutional neural network (CNN) used for detection; Sec IV illustrates the experimental results on real data and the comparison with prior works whereas Sec. V draws the conclusions.
In the literature, several works have tried to mitigate the replay speech attack by providing single-channel speech datasets, such as RedDots [3], ASVSpoof2017 PA [8], ASVSpoof2019 [9] PA, and ASVSpoof2021 PA [2]. Building upon these datasets, recent works in state-of-the-art (SOTA) focused on hand-crafted single-channel features with simple classifiers (such as CNNs and Gaussian mixture models (GMMs)) after a beamforming phase (TECC [10], CTECC [11], and ETECC [12]). However, all these methods suffer from generalization capabilities, i.e., changing the acoustic properties on the test set yields random guess predictions.
However, for speech enhancement and separation tasks, microphone arrays are usually employed in ASV systems to exploit spatial information and improve audio quality [13]. Moreover, multi-channel data can be beneficial for detecting the replay detection for several reasons: (i) multi-channel recordings encompass audio spatial cues that can help the detection [4], [7], [14], and (ii) such spatial information cannot be easily counterfeited by an attacker, differently to singlechannel data where temporal and frequency cues can be manipulated t
This content is AI-processed based on open access ArXiv data.