Interpreting DNN output layer activations: A strategy to cope with unseen data in speech recognition

Reading time: 5 minute
...

📝 Original Info

  • Title: Interpreting DNN output layer activations: A strategy to cope with unseen data in speech recognition
  • ArXiv ID: 1802.06861
  • Date: 2023-06-15
  • Authors: : John Smith, Jane Doe, Michael Johnson

📝 Abstract

Unseen data can degrade performance of deep neural net acoustic models. To cope with unseen data, adaptation techniques are deployed. For unlabeled unseen data, one must generate some hypothesis given an existing model, which is used as the label for model adaptation. However, assessing the goodness of the hypothesis can be difficult, and an erroneous hypothesis can lead to poorly trained models. In such cases, a strategy to select data having reliable hypothesis can ensure better model adaptation. This work proposes a data-selection strategy for DNN model adaptation, where DNN output layer activations are used to ascertain the goodness of a generated hypothesis. In a DNN acoustic model, the output layer activations are used to generate target class probabilities. Under unseen data conditions, the difference between the most probable target and the next most probable target is decreased compared to the same for seen data, indicating that the model may be uncertain while generating its hypothesis. This work proposes a strategy to assess a model's performance by analyzing the output layer activations by using a distance measure between the most likely target and the next most likely target, which is used for data selection for performing unsupervised adaptation.

💡 Deep Analysis

Figure 1

📄 Full Content

Deep learning technologies have revolutionized automatic speech recognition (ASR) systems [1,2], demonstrating impressive performance for almost all tried languages. Interestingly, deep neural network (DNN)-based systems are both data hungry and data sensitive [3], where the performance of a model is found to improve with additional diverse training data. Unfortunately, annotated training data can be expensive. Although large volumes of data are becoming available every day, not all of it is properly transcribed or reflective of the varying acoustic conditions that systems are expected to tackle. In limited data conditions, DNN acoustic models can be quite sensitive to acoustic-condition mismatches, where subtle variation in the background acoustic conditions can significantly degrade recognition performance.

To cope with the problem of unseen data, multicondition training accompanied by data augmentation is ___________________________________________________________ *The author performed this work while at SRI International and is currently working at Apple Inc.

generally used to expose the DNN acoustic model to a wider range of background acoustic variations [4]. Data augmentation may expose the model to the anticipated acoustic variations; but in reality, acoustic variations are difficult to anticipate. Real-world ASR applications encounter diverse acoustic conditions, which are mostly unique and hence difficult to anticipate. Systems that are trained with several thousands of hours of data collected from different realistic conditions typically are found to be quite robust to background conditions, as they are expected to contain many variations; however, such data may not contain all the possible variations found in the world.

Recently, several open speech recognition evaluations [5][6][7][8] have shown how vulnerable DNN acoustic models are to realistic, varying, and unseen acoustic conditions. One of the most celebrated and least resource-constrained approaches to coping with unseen data conditions is performing unsupervised adaptation, where the only necessity is having raw data. A more reliable adaptation technique is supervised adaptation, which assumes having annotated target-domain data; however, annotated data is often unavailable in real-world scenarios. This constraint often makes unsupervised adaptation more practical.

Unsupervised speaker adaptation of DNNs has been explored in [8][9][10][11], with adaptation based on maximum likelihood linear regression (MLLR) transforms [10], ivectors [11], etc. showing impressive performance gains over un-adapted models. In [12] Kullback-Leibler divergence (KLD) based regularization was proposed for DNN model parameter adaptation. Feature-space MLLR (fMLLR) transform was found to improve DNN acoustic model performance for mismatched cases in [13]. Confidence score based unsupervised adaptation demonstrated improvements in recognition performance for Wall Street Journal (WSJ) [14] and VERBMOBIL [15] speech recognition tasks. A semi-supervised DNN acoustic model training was investigated in [16], where a DNN trained with a small dataset was adapted to a larger data set, leveraging data selection using a confidence measure.

In this work, we focus on understanding how acousticcondition mismatch between the training and the testing data impacts the DNN output decision. Similar efforts have been pursued by researchers in [17,18]. Earlier [19], we investigated an entropy measure to ascertain the level of uncertainty in a DNN and to translate that measure to quantify DNN decision reliability. This paper focuses on how data mismatch impacts the output layer activations of a DNN, and proposes a measure that predicts when a DNN’s decision may be less accurate. The proposed approach relies on the fact that under seen conditions, the most likely target’s probability is substantially higher than the next most likely target’s probability, whereas for unseen conditions, the difference between those target probabilities may not be as large, which happens as a consequence of the DNN being more uncertain while making a decision in the unseen condition. A similar observation about the impact of unseen data on the winning neuron’s activation with respect to the next best activation was cited in [20]. In this work, we use the output layer neural activations (before nonlinear transform) to compute a distance measure between the most likely target and the 2 nd and 3 rd most likely targets, respectively. We name this measure the confusion distance (CD) and show that the CD is higher for seen data as compared to unseen data. We compute an averaged distance measure over an utterance and use that to select data for unsupervised adaptation. Note that the proposed strategy is not only restricted to speech recognition but can be used in other applications that involve probabilistic processing.

The acoustic models in this work were trained by using the multi-conditioned, noise-an

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut