Revisiting the Privacy of Low-Frequency Speech Signals: Exploring Resampling Methods, Evaluation Scenarios, and Speaker Characteristics
While audio recordings in real life provide insights into social dynamics and conversational behavior, they also raise concerns about the privacy of personal, sensitive data. This article explores the effectiveness of restricting recordings to low-frequency audio to protect spoken content. For resampling the audio signals to different sampling rates, we compare the effect of employing anti-aliasing filtering. Privacy enhancement is measured by an increased word error rate of automatic speech recognition models. The impact on utility performance is measured with voice activity detection models. Our experimental results show that for clean recordings, models trained with a sampling rate of up to 800 Hz transcribe the majority of words correctly. For both models, we analyzed the impact of the speaker’s sex and pitch, and we demonstrated that missing anti-aliasing filters more strongly compromise speech privacy.
💡 Research Summary
This paper investigates whether limiting audio recordings to low‑frequency content can effectively protect spoken content while still preserving useful downstream functionality such as voice activity detection (VAD). The authors focus on two intertwined aspects: (1) the impact of different resampling strategies—including the presence or absence of anti‑aliasing filtering—on privacy and utility, and (2) how speaker characteristics, specifically sex and fundamental pitch, influence these outcomes.
Four down‑sampling rates were examined: 1600 Hz, 800 Hz, 500 Hz, and 320 Hz, each derived from original 16 kHz speech. Two resampling pipelines were compared: (a) conventional down‑sampling with a low‑pass anti‑aliasing filter, and (b) naïve subsampling without filtering followed by up‑sampling (which introduces spectral aliasing). The torchaudio sinc‑interp‑hann implementation served as the primary tool, while librosa’s soxr‑hq was noted to give higher fidelity at the cost of longer processing time.
Privacy was quantified using the word error rate (WER) of an automatic speech recognition (ASR) system; a higher WER indicates better protection. Utility was measured with a VAD model, evaluated via area under the ROC curve (AUC) and Matthews correlation coefficient (MCC). The ASR model is a 71.5 M‑parameter transformer‑based architecture trained on 360 h of LibriSpeech for 30 epochs. The VAD model is a 109 k‑parameter convolutional‑recurrent network trained on the LibriParty dataset for 100 epochs. Both models operate on 80‑dimensional log‑Mel filterbank features (25 ms windows, 10 ms hop).
Two attacker knowledge scenarios were defined following prior work: (1) an “ignorant” attacker who applies a 16 kHz‑trained ASR model directly to low‑frequency test data, and (2) an “informed” attacker who knows the resampling method and retrains the ASR model on low‑frequency data. This distinction reveals that many earlier studies over‑estimate privacy because they only consider the ignorant scenario.
Key findings:
-
ASR performance – Ignorant models exhibit near‑100 % WER for sampling rates ≤800 Hz, suggesting strong privacy. However, informed models dramatically reduce WER: at 800 Hz the informed model achieves ~27 % WER on clean test data and ~61 % on the more challenging “test‑other” set. Only at the extreme 320 Hz does the informed model’s WER rise again to >95 %, indicating genuine protection at this very low rate.
-
Speaker characteristics – Male speakers (lower pitch) consistently yield lower WER than female speakers across all low‑frequency conditions. Mann‑Whitney U tests confirm the difference is statistically significant (p < 0.01) for all rates except the original 16 kHz. Pitch analysis shows a clear monotonic relationship: as fundamental frequency approaches or exceeds the Nyquist limit, intelligibility drops, especially for high‑pitched female voices.
-
Effect of aliasing – When the anti‑aliasing filter is omitted, the aliased high‑frequency components fold into the low‑frequency band, providing the ASR system with additional information. Consequently, informed models trained on aliased (sub‑upsampled) signals achieve markedly lower WER than those trained on filtered, truly low‑bandlimited signals. This demonstrates that aliasing substantially weakens privacy.
-
VAD utility – VAD AUC remains high (≥0.94) down to 500 Hz and only modestly declines to 0.86 at 320 Hz, indicating that voice activity detection is robust to severe bandwidth reduction. MCC values show a sex‑dependent trend similar to ASR: male speakers are easier to detect, especially at the lowest rates. Introducing aliasing slightly improves MCC at 320 Hz and reduces the gender gap, but overall utility loss is modest compared with the privacy gains at higher rates.
The authors conclude that low‑frequency audio can indeed serve as a privacy‑preserving representation, provided that (i) the sampling rate is reduced to ≤800 Hz and (ii) anti‑aliasing filtering is applied to prevent spectral folding. Nevertheless, a determined attacker who retrains models on low‑frequency data can recover a substantial portion of linguistic content, especially at 800 Hz and 500 Hz. Therefore, realistic privacy assessments must incorporate the informed‑attacker scenario. Moreover, speaker‑dependent effects imply that systems targeting heterogeneous user populations may need adaptive filtering or additional obfuscation (e.g., pitch shifting) to equalize protection across sexes.
Overall, the paper offers a thorough experimental framework, quantifies the trade‑off between privacy and utility across multiple dimensions, and highlights practical considerations—such as the necessity of anti‑aliasing and the impact of speaker pitch—that are essential for designing privacy‑aware audio capture devices, especially in wearable or edge‑computing contexts where storage and bandwidth are limited.
Comments & Academic Discussion
Loading comments...
Leave a Comment