Building DNN Acoustic Models for Large Vocabulary Speech Recognition
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models – with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
💡 Research Summary
This paper presents a comprehensive empirical study of design choices for deep neural network (DNN) acoustic models used in large‑vocabulary continuous speech recognition (LVCSR). The authors conduct two major sets of experiments: (1) on the standard Switchboard corpus (≈300 h of conversational telephone speech) to investigate the effects of model size, depth, regularization, and architecture; and (2) on a much larger combined Switchboard + Fisher corpus (≈2 100 h) to examine how these factors scale when abundant training data are available.
Methodology
The acoustic modeling task is framed as a senone‑level classification problem within a hybrid HMM‑DNN system. Forced alignments from a baseline HMM‑GMM system provide frame‑wise senone labels. The authors explore three network families: (i) standard fully‑connected DNNs, (ii) deep convolutional neural networks (DCNNs) that exploit the time‑frequency structure of log‑mel filterbank inputs, and (iii) deep locally‑untied neural networks (DLUNNs) where weights are not shared across spatial locations. For each family they vary the number of hidden layers (5–10) and total parameter count (from a few million up to several hundred million). Regularization is examined primarily through dropout (rates 0.2–0.5). Training objectives include the conventional cross‑entropy (CE) loss and discriminative criteria (MMI, sMBR) applied after CE pre‑training. Optimization algorithms compared are stochastic gradient descent (SGD) with momentum and Adam.
Key Findings on the 300‑hour Switchboard Set
- Model Size vs. Over‑fitting – Increasing parameters improves training accuracy but degrades word error rate (WER) due to over‑fitting.
- Dropout Effectiveness – Applying dropout dramatically reduces over‑fitting; the best WERs are achieved with 5–7 hidden layers and dropout rates around 0.3.
- Architectural Comparison – DCNNs capture local spectro‑temporal patterns but, when used alone, do not outperform a well‑regularized fully‑connected DNN. DLUNNs show no advantage and converge more slowly, especially on limited data.
- Discriminative Training – Adding a discriminative loss after CE yields modest WER reductions (≈0.2–0.4 % absolute) but introduces additional hyper‑parameter tuning complexity.
Insights from the 2 100‑hour Combined Set
- Scaling Model Capacity – With abundant data, enlarging the network up to ten times the usual parameter count does not cause severe over‑fitting; instead, WER continues to improve, indicating that data volume is the primary limiter of model size.
- Depth vs. Total Parameters – For a fixed parameter budget, deeper networks (8–10 layers) slightly outperform shallower ones (5–6 layers), but the gain is modest (≈0.3–0.5 % absolute). Beyond a certain depth, training becomes unstable and computational cost rises sharply.
- Optimization Preference – Momentum‑based SGD provides the most stable convergence and the lowest final WER on the large corpus. Adam converges faster initially but ends with slightly higher WER.
- Architectural Re‑evaluation – Even with massive data, DCNNs and DLUNNs do not surpass the baseline DNN when measured in final WER, suggesting that the added architectural complexity is not justified for single‑model acoustic modeling.
Practical Recommendations
- A simple fully‑connected DNN with appropriate dropout and a well‑chosen learning schedule is the most effective baseline for both moderate and large training corpora.
- Regularization is crucial for small‑to‑medium data; dropout rates between 0.2 and 0.5 work well.
- When large data are available, increase total parameters rather than merely adding layers; however, keep an eye on memory and training‑time constraints.
- Discriminative loss functions can be used for marginal gains, but the extra engineering effort may not be justified for many deployments.
- SGD with momentum remains the optimizer of choice for stable, high‑performance training; Adam may be useful for rapid prototyping.
Overall, the paper demonstrates that “less is more” in terms of architectural sophistication: a well‑regularized, sufficiently large DNN trained with standard cross‑entropy loss and momentum‑SGD delivers state‑of‑the‑art performance on LVCSR tasks. The extensive empirical evidence provided serves as a valuable guide for researchers and engineers designing acoustic models for current and future speech recognition systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment