All-in-One ASR: Unifying Encoder-Decoder Models of CTC, Attention, and Transducer in Dual-Mode ASR

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper proposes a unified framework, All-in-One ASR, that allows a single model to support multiple automatic speech recognition (ASR) paradigms, including connectionist temporal classification (CTC), attention-based encoder-decoder (AED), and Transducer, in both offline and streaming modes. While each ASR architecture offers distinct advantages and trade-offs depending on the application, maintaining separate models for each scenario incurs substantial development and deployment costs. To address this issue, we introduce a multi-mode joiner that enables seamless integration of various ASR modes within a single unified model. Experiments show that All-in-One ASR significantly reduces the total model footprint while matching or even surpassing the recognition performance of individually optimized ASR models. Furthermore, joint decoding leverages the complementary strengths of different ASR modes, yielding additional improvements in recognition accuracy.

💡 Research Summary

This paper introduces “All-in-One ASR,” a novel unified framework designed to consolidate multiple automatic speech recognition (ASR) paradigms and operational modes into a single model. The core problem it addresses is the high development, deployment, and maintenance cost associated with building and managing separate models for different ASR approaches—namely Connectionist Temporal Classification (CTC), Attention-based Encoder-Decoder (AED), and Transducer—each potentially requiring both offline and streaming versions.

The proposed solution operates on two levels of unification. First, it employs a dual-mode encoder (based on either Conformer or ConMamba architectures) capable of processing full sequences for offline recognition and chunked sequences for low-latency streaming recognition within a shared parameter set.

The second and most innovative component is the multi-mode joiner. Built upon the joiner module of a Hybrid Autoregressive Transducer (HAT) model, this block is architecturally inspired by a single-layer Transformer decoder. Its key feature is the ability to emulate the behavior of different ASR systems by simply switching its operational mode, while fully sharing parameters across these modes. Specifically, it can operate as:

HAT/Transducer Mode: Utilizes a sigmoid cross-attention mechanism that allows frame-synchronous decoding, crucial for streaming.
AED Mode: Functions as a standard Transformer decoder using softmax cross-attention. For streaming operation, it cleverly uses alignments generated from the HAT mode during training to create attention masks, eliminating the need for explicit chunk-boundary tokens.
CTC Mode: Implements a factored CTC approach by zeroing out the predictor input.
LM Mode: Acts as a pure language model by zeroing out the encoder input.

A critical advantage of this design is the complete sharing of the predictor network between the HAT and AED modes. This means that during joint decoding, the decoder states for a given hypothesis are computed only once and reused, dramatically reducing computational overhead and memory usage compared to systems with separate decoder branches.

Experiments on the LibriSpeech corpus demonstrate the framework’s effectiveness. The All-in-One model reduces the total parameter count to approximately one-third of the sum of parameters from individually trained CTC, AED, and Transducer models, while achieving word error rate (WER) performance that is comparable to or even surpasses that of the individual specialized models. Furthermore, joint decoding that combines predictions from the HAT and AED modes within the unified model yields additional performance gains across all test sets, showcasing the benefit of leveraging the complementary strengths of different paradigms (e.g., AED’s accuracy on short utterances and HAT’s robustness on long-form speech).

In conclusion, this work presents a practical and efficient path toward unifying diverse ASR technologies. By significantly reducing model footprint and deployment complexity without sacrificing performance, the All-in-One ASR framework offers a compelling solution for scenarios requiring flexibility across different recognition paradigms and latency constraints, particularly in resource-constrained environments like on-device ASR.

All-in-One ASR: Unifying Encoder-Decoder Models of CTC, Attention, and Transducer in Dual-Mode ASR

💡 Research Summary

Comments & Academic Discussion

Leave a Comment