When Less Is More? Diagnosing ASR Predictions in Sardinian via Layer-Wise Decoding

When Less Is More? Diagnosing ASR Predictions in Sardinian via Layer-Wise Decoding
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent studies have shown that intermediate layers in multilingual speech models often encode more phonetically accurate representations than the final output layer. In this work, we apply a layer-wise decoding strategy to a pretrained Wav2Vec2 model to investigate how phoneme-level predictions evolve across encoder layers, focusing on Campidanese Sardinian, a low-resource language. We show that truncating upper transformer layers leads to improved Phoneme Error Rates (PER), with the best performance achieved not at the final layer, but two layers earlier. Through fine-grained alignment analysis, we find that intermediate predictions better preserve segmental identity, avoid overgeneration, and reduce certain classes of phonological errors. We also introduce the notion of regressive errors, cases where correct predictions at intermediate layers are overwritten by errors at the final layer. These regressions highlight the limitations of surface-level error metrics and reveal how deeper layers may generalize or abstract away from acoustic detail. Our findings support the use of early-layer probing as a diagnostic tool for ASR models, particularly in low-resource settings where standard evaluation metrics may fail to capture linguistically meaningful behavior.


💡 Research Summary

The paper investigates how phoneme‑level predictions evolve across the encoder layers of a pretrained multilingual speech model, focusing on Campidanese Sardinian, a low‑resource language. Using a “layer‑wise decoding” approach, the authors progressively truncate the top transformer layers of the facebook/wav2vec2‑xlsr‑53‑espeak‑cv‑ft model and decode directly from the remaining last layer with the same CTC projection head. Because all transformer layers share the same hidden dimension, this method allows probing of any intermediate layer without architectural changes.

The experimental corpus consists of 48 short, spontaneously recorded utterances (average length ≈ 4 s) from four native speakers, manually transcribed at the phonemic level by a trained phonetician. For each truncation level (0–5 removed layers) the authors compute Phoneme Error Rate (PER) and perform a fine‑grained alignment (using a SequenceMatcher‑based algorithm) that categorizes each token as a hit, substitution, insertion, or deletion.

Key quantitative findings:

  • Removing the top two transformer layers (i.e., decoding from Layer 22) yields the lowest PER of 35.40 %, compared with 36.73 % for the full model (Layer 24). Removing more than five layers leads to a sharp degradation, with PER exceeding 70 % at Layer 16.
  • As layers are removed, the number of correctly predicted phonemes (hits) declines gradually, while deletion errors rise sharply from Layer 21 downward. Substitution errors remain relatively stable across Layers 24‑22 and even drop slightly in deeper layers.
  • The most frequently deleted or substituted phonemes are the low‑prominence vowels /i/, /u/, and /a/, which often appear unstressed at the ends of words in Campidanese. Their short duration and reduced formant clarity make them vulnerable to being merged or omitted by the convolutional feature encoder’s receptive field.

Beyond global trends, the authors introduce the notion of “regressive errors”: cases where a phoneme is correctly predicted at an intermediate layer but becomes a substitution or deletion at a deeper layer. Across the dataset they identify 53 such regressions (39 hit→substitution, 14 hit→deletion). The most common regressive phoneme is the high back rounded vowel /u/ (13 instances), often replaced by acoustically similar vowels (/o/ or /U/). Other frequent regressions involve the alveolar approximant /r/ and the nasal /n/. These patterns suggest that deeper layers, while improving overall PER, may “over‑process” the signal, abstracting away fine‑grained acoustic detail in favor of higher‑level contextual regularities. Consequently, intermediate layers sometimes preserve segmental identity better than the final layer.

A qualitative inspection of the five utterances with the largest PER reduction (Layer 24 → Layer 22) shows that intermediate‑layer outputs not only have fewer spurious insertions but also produce more linguistically plausible segment sequences. For example, in utterance 30_F_extract_04 the final layer inserts an erroneous initial vowel /i/ and replaces the target voiced fricative /Z/ with /S/. The Layer 22 hypothesis, however, yields a sequence closer to the reference (/e:n tsu/ vs. /ensu/), indicating a more faithful alignment to the acoustic input.

The authors argue that these findings have two major implications. First, layer‑wise probing can serve as a diagnostic tool for low‑resource ASR, revealing that intermediate representations may strike a better balance between acoustic fidelity and contextual abstraction. Second, the existence of regressive errors cautions against equating lower overall error rates with richer linguistic representations; deeper layers can sometimes degrade phoneme‑level detail while improving higher‑level metrics.

In conclusion, the study demonstrates that for a multilingual wav2vec2 model applied to Sardinian, decoding from an intermediate encoder layer yields superior phoneme‑level performance and more linguistically faithful outputs. The work highlights the value of early‑layer probing for model analysis, suggests potential benefits of early‑exit strategies in low‑resource settings, and opens avenues for future research on architecture‑agnostic abstraction patterns and training techniques that preserve fine‑grained phonetic information.


Comments & Academic Discussion

Loading comments...

Leave a Comment