How Does a Deep Neural Network Look at Lexical Stress in English Words?

How Does a Deep Neural Network Look at Lexical Stress in English Words?
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Despite their success in speech processing, neural networks often operate as black boxes, prompting the question: what informs their decisions, and how can we interpret them? This work examines this issue in the context of lexical stress. A dataset of English disyllabic words was automatically constructed from read and spontaneous speech. Several Convolutional Neural Network (CNN) architectures were trained to predict stress position from a spectrographic representation of disyllabic words lacking minimal stress pairs (e.g., initial stress WAllet, final stress exTEND), achieving up to 92% accuracy on held-out test data. Layerwise Relevance Propagation (LRP), a technique for neural network interpretability analysis, revealed that predictions for held-out minimal pairs (PROtest vs. proTEST ) were most strongly influenced by information in stressed versus unstressed syllables, particularly the spectral properties of stressed vowels. However, the classifiers also attended to information throughout the word. A feature-specific relevance analysis is proposed, and its results suggest that our best-performing classifier is strongly influenced by the stressed vowel’s first and second formants, with some evidence that its pitch and third formant also contribute. These results reveal deep learning’s ability to acquire distributed cues to stress from naturally occurring data, extending traditional phonetic work based around highly controlled stimuli.


💡 Research Summary

This paper investigates how deep neural networks perceive lexical stress in English disyllabic words, combining a novel, fully automatic dataset construction pipeline with state‑of‑the‑art convolutional neural network (CNN) classifiers and a thorough interpretability analysis using Layerwise Relevance Propagation (LRP).

Dataset creation – The authors first asked ChatGPT‑4o to generate 30 stress‑minimal pairs (e.g., PROtest vs. proTEST) and 250 additional disyllabic words that do not form minimal pairs. After discarding one erroneous entry, 249 non‑minimal‑pair words remained (124 with initial stress, 125 with final stress). These lexical items were harvested from three large corpora: LibriSpeech (read speech), the Supreme Court oral arguments corpus, and TED‑LIUM (public talks). Forced alignment was performed with the Montreal Forced Aligner for LibriSpeech and Supreme Court, while TED‑LIUM already provided word‑level timestamps. A 0.5‑second window was extracted around each target word, ensuring both syllables were fully captured and padded to a uniform length.

Stress labeling relied on two strategies: (1) part‑of‑speech tagging (SpaCy) for the minimal‑pair set, exploiting the well‑known tendency that nouns carry primary stress on the first syllable and verbs on the second; (2) CMU Pronouncing Dictionary entries for the non‑minimal‑pair set, where stress placement is unambiguous. Manual correction was applied to a handful of words that the POS tagger mis‑labelled. The final corpus comprised 7,446 initial‑stress and 3,263 final‑stress samples from the non‑minimal‑pair pool, plus 5,475 initial‑stress and 1,715 final‑stress instances of the minimal‑pair pool.

Acoustic validation confirmed that the dataset exhibits classic stress cues: vowel amplitude and duration ratios between stressed and unstressed syllables differed significantly (p < .0001), with initial‑stress words showing higher amplitude (mean 0.6) and longer duration (mean 0.53) in the stressed syllable, and the opposite pattern for final‑stress words.

Model training – Spectrograms were generated via Short‑Time Fourier Transform (16 kHz sampling, 20 ms Hamming window, 10 ms hop) and z‑score normalized. Five CNN architectures were evaluated: LeNet‑5, VGG‑11, VGG‑16, VGG‑19, and ResNet‑18. To improve robustness, each training sample was augmented with a low‑pass‑filtered version (cutoff 3 kHz) and three noise‑mixed versions (20 dB, 10 dB, 3 dB SNR) using multi‑talker babble from the VOiCES corpus. After augmentation the training set contained 29,830 initial‑stress and 13,510 final‑stress examples. Crucially, the data split ensured no lexical overlap between training, validation, and test sets, eliminating the risk of over‑estimating generalization.

The VGG‑16 model achieved the highest performance, correctly classifying 92 % of held‑out test items (both minimal‑pair and non‑minimal‑pair words). Other architectures performed comparably (≈85‑90 % accuracy), demonstrating that deep CNNs can reliably learn stress patterns from raw spectrograms without handcrafted features.

Interpretability with LRP – To open the black box, the authors applied three LRP variants (ε‑rule, α1‑rule, and the composite (CMP) rule) to the trained VGG‑16 network. Heatmaps overlaid on spectrograms revealed that the model’s decisions were driven primarily by the syllable bearing primary stress. In minimal‑pair examples such as “PROtest” (initial stress) versus “proTEST” (final stress), the LRP maps highlighted the vowel region of the stressed syllable, especially the lower‑frequency bands corresponding to formant structure. Nevertheless, non‑stress regions also contributed modestly, indicating that the network integrates distributed acoustic cues across the whole word.

Feature‑specific relevance analysis – Building on the LRP maps, the authors extracted acoustic features (F0, F1, F2, F3, amplitude, duration) from the stressed vowel and computed the overlap (Intersection‑over‑Union, IoU) between feature‑specific relevance maps and the full LRP heatmap. The first (F1) and second (F2) formants accounted for the largest proportion of relevance (≈45 % each), while pitch (F0) and the third formant (F3) contributed around 10 % each. This quantitative finding aligns with classic phonetic literature that stresses the importance of vowel quality (formant shifts) and pitch in lexical stress perception.

Conclusions and implications – The study makes four major contributions: (1) an entirely automatic pipeline for constructing a large, balanced lexical‑stress dataset without human annotation; (2) a benchmark of several CNN architectures, establishing VGG‑16 as a high‑performing model for stress classification; (3) a systematic application of LRP to visualize and interpret the acoustic bases of model decisions; and (4) a novel feature‑specific relevance analysis that confirms the dominance of vowel formants and pitch in the learned representations. The work demonstrates that deep learning can acquire the same distributed acoustic cues identified by decades of phonetic research, while also revealing additional, subtler patterns across the entire word. Future directions include extending the approach to polysyllabic words, cross‑dialectal data, second‑language learners, and comparing LRP with other explainability methods such as SHAP or Grad‑CAM. Overall, the paper bridges the gap between high‑accuracy speech technology and scientific interpretability, offering a reproducible framework for probing what deep acoustic models actually “listen” to.


Comments & Academic Discussion

Loading comments...

Leave a Comment