Large-Scale Domain Adaptation via Teacher-Student Learning

February 23, 2026

Reading time: 6 minute

...

📝 Original Info

Title: Large-Scale Domain Adaptation via Teacher-Student Learning
ArXiv ID: 1708.05466
Date: 2017-08-21
Authors: Researchers from original ArXiv paper

📝 Abstract

High accuracy speech recognition requires a large amount of transcribed data for supervised training. In the absence of such data, domain adaptation of a well-trained acoustic model can be performed, but even here, high accuracy usually requires significant labeled data from the target domain. In this work, we propose an approach to domain adaptation that does not require transcriptions but instead uses a corpus of unlabeled parallel data, consisting of pairs of samples from the source domain of the well-trained model and the desired target domain. To perform adaptation, we employ teacher/student (T/S) learning, in which the posterior probabilities generated by the source-domain model can be used in lieu of labels to train the target-domain model. We evaluate the proposed approach in two scenarios, adapting a clean acoustic model to noisy speech and adapting an adults speech acoustic model to children speech. Significant improvements in accuracy are obtained, with reductions in word error rate of up to 44% over the original source model without the need for transcribed data in the target domain. Moreover, we show that increasing the amount of unlabeled data results in additional model robustness, which is particularly beneficial when using simulated training data in the target-domain.

💡 Deep Analysis

Deep Dive into Large-Scale Domain Adaptation via Teacher-Student Learning.

📄 Full Content

Large-Scale Domain Adaptation via Teacher-Student Learning Jinyu Li, Michael L. Seltzer, Xi Wang, Rui Zhao, and Yifan Gong Microsoft AI and Research, One Microsoft Way, Redmond, WA 98052 {jinyli; mseltzer; xwang; ruzhao; ygong}@microsoft.com

Abstract High accuracy speech recognition requires a large amount of transcribed data for supervised training. In the absence of such data, domain adaptation of a well-trained acoustic model can be performed, but even here, high accuracy usually requires significant labeled data from the target domain. In this work, we propose an approach to domain adaptation that does not require transcriptions but instead uses a corpus of unlabeled parallel data, consisting of pairs of samples from the source domain of the well-trained model and the desired target domain. To perform adaptation, we employ teacher/student (T/S) learning, in which the posterior probabilities generated by the source-domain model can be used in lieu of labels to train the target-domain model. We evaluate the proposed approach in two scenarios, adapting a clean acoustic model to noisy speech and adapting an adults’ speech acoustic model to children’s speech. Significant improvements in accuracy are obtained, with reductions in word error rate of up to 44% over the original source model without the need for transcribed data in the target domain. Moreover, we show that increasing the amount of unlabeled data results in additional model robustness, which is particularly beneficial when using simulated training data in the target-domain. Index Terms: teacher-student learning, parallel unlabeled data

Introduction The success of deep neural networks [1][2][3][4][5] relies on the availability of a large amount of transcribed data to train millions of model parameters. However, deep models still suffer reduced performance when exposed to test data from a new domain. Because it is typically very time-consuming or expensive to transcribe large amounts of data for a new domain, domain-adaptation approaches have been proposed to bootstrap the training of a new system from an existing well- trained model [6][7][8][9]. These supervised methods still require transcribed data from the new domain and thus their effectiveness is limited by the amount of transcribed data available in the new domain. Although unsupervised adaptation methods can be used by generating labels from a decoder, the performance gap between supervised and unsupervised adaptation is large [7].
In this work, we propose an approach to domain adaptation that does not require transcriptions but instead uses a corpus of unlabeled parallel data, consisting of pairs of samples from the source domain of the well-trained source model and the target domain. There are many important scenarios in which collecting a virtually unlimited amount of parallel data is relatively simple. For example, to collect noisy or reverberant data from a particular set of environments, speech can be captured simultaneously using a close-talking microphone and a microphone located at a distance from the user. Such a collection effort can also be simulated by acoustically replaying a pre-existing corpus of high signal-to- noise ratio speech files in the target environment or by digitally simulating the target environment offline [10][11].
To perform adaptation without the use of transcriptions, we propose to use teacher/student (T/S) learning. In T/S learning, the data from the source domain are processed by the source-domain model (teacher) to generate the corresponding posterior probabilities or soft labels. These posterior probabilities are used in lieu of the usual hard labels derived from the transcriptions to train the target (student) model with the parallel data from the target domain. With this approach, the network can be trained on a potentially enormous amount of training data and the challenge of adapting a large-scale system shifts from transcribing thousands of hours of audio to the potentially much simpler and lower-cost task of designing a scheme to generate the appropriate parallel data.
The proposed approach is closely related to other approaches for adaptation or retraining that employ knowledge distillation [12]. In these approaches, the soft labels generated by a teacher model are used as a regularization term to train a student model with conventional hard labels. Knowledge distillation was used to train a system on the Aurora 2 digit recognition task [13], using the clean and noisy training sets [14]. In [15] it was shown that for the multi-channel CHiME-4 task [16], soft labels could be derived using enhanced features generated by a beamformer then processed through a network trained with conventional multi-style training [17]. However, it is unclear whether this approach is superior to simply using the enhanced features for the recognition at test time as well.
Knowled

…(Full text truncated)…

📄 Read Full PDF on ArXiv

📸 Image Gallery

Reference

This content is AI-processed based on ArXiv data.

Large-Scale Domain Adaptation via Teacher-Student Learning

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Related Posts

Learning Recursive Segments for Discourse Parsing

Supervised Speech Separation Based on Deep Learning: An Overview

Transfer Learning for Scientific Data Chain Extraction in Small Chemical Corpus with BERT-CRF Model

Start searching

No results found