오디오 토큰 압축으로 확장 가능한 대형 오디오 언어 모델 구현
📝 Abstract
Large Audio Language Models (LALMs) demonstrate impressive performance across diverse tasks, ranging from speech recognition to general audio understanding. However, their scalability is limited by the quadratic complexity of attention and the high token rates of audio signals. These challenges make it difficult to extend LALMs to long-form audio and to deploy them on resource-constrained platforms such as edge devices. In this paper, we explore techniques such as unsupervised segmentation, uniform average pooling, etc., to reduce the number of audio tokens generated by the LALM’s audio encoder but before they are consumed by the LLM decoder. To mitigate potential performance degradation introduced by the compressed representations, we employ low-rank adapters to finetune the model. We evaluate our proposed models on two tasks, automatic speech recognition and speech-to-speech translation tasks, that are dependent on effectively uncovering the underlying lexical content of the input signal and study the effect of downsampling on these tasks. Experimental results show that compressed LALMs can achieve performance closer to frame-level LALMs while reducing the input audio token count upto three times before the LLM backbone.
💡 Analysis
Large Audio Language Models (LALMs) demonstrate impressive performance across diverse tasks, ranging from speech recognition to general audio understanding. However, their scalability is limited by the quadratic complexity of attention and the high token rates of audio signals. These challenges make it difficult to extend LALMs to long-form audio and to deploy them on resource-constrained platforms such as edge devices. In this paper, we explore techniques such as unsupervised segmentation, uniform average pooling, etc., to reduce the number of audio tokens generated by the LALM’s audio encoder but before they are consumed by the LLM decoder. To mitigate potential performance degradation introduced by the compressed representations, we employ low-rank adapters to finetune the model. We evaluate our proposed models on two tasks, automatic speech recognition and speech-to-speech translation tasks, that are dependent on effectively uncovering the underlying lexical content of the input signal and study the effect of downsampling on these tasks. Experimental results show that compressed LALMs can achieve performance closer to frame-level LALMs while reducing the input audio token count upto three times before the LLM backbone.
📄 Content
Towards Audio Token Compression in Large Audio Language Models Saurabhchand Bhati1, Samuel Thomas2,3, Hilde Kuehne3,4, Rogerio Feris2,3, James Glass1 1MIT, USA, 2IBM Research, 3MIT-IBM Watson AI Lab, 4Tuebingen AI Center/University of Tuebingen sbhati@mit.edu I. ABSTRACT Large Audio Language Models (LALMs) demonstrate im- pressive performance across diverse tasks, ranging from speech recognition to general audio understanding. However, their scalability is limited by the quadratic complexity of attention and the high token rates of audio signals. These challenges make it difficult to extend LALMs to long-form audio and to deploy them on resource-constrained platforms such as edge devices. In this paper, we explore techniques such as unsupervised segmentation, uniform average pooling, etc., to reduce the number of audio tokens generated by the LALM’s audio encoder but before they are consumed by the LLM decoder. To mitigate potential performance degradation introduced by the compressed representations, we employ low-rank adapters to finetune the model. We evaluate our proposed models on two tasks, automatic speech recognition and speech-to-speech translation tasks, that are dependent on effectively uncovering the underlying lexical content of the input signal and study the effect of downsampling on these tasks. Experimental results show that compressed LALMs can achieve performance closer to frame-level LALMs while reducing the input audio token count upto three times before the LLM backbone. II. INTRODUCTION Audio is a primary medium of human communication and interaction, and a rich understanding of audio signals is essential for building systems that can engage naturally with humans and operate effectively in real-world environments. Large Language Models (LLMs) [1]–[6] extended with audio inputs, commonly referred to as Large Audio Language Mod- els (LALMs), have recently achieved remarkable success in audio understanding and reasoning [7]–[17]. A typical LALM has the following components: (1) an au- dio encoder, often a pretrained model such as Whisper, which extracts meaningful features (or tokens) from the audio signal, (2) a language model backbone, which reasons over these em- beddings and generates a response, and (3) optionally, a speech synthesizer to generate a spoken response. Advancements in computation and the availability of vast amounts of labeled data have enabled LALMs to achieve strong performance across a wide range of audio tasks. However, the majority of the current LALMs focus on short audio, typically 10 seconds, primarily because the audio encoders produce features at a very high rate. For instance, a 10-second audio clip, the audio Fig. 1: Overview of the audio token compression in large audio language model. The compression module takes the output from the audio encoder and compresses the audio tokens, which are fed into the LLM. encoder can yield around 500 features (50 features/second), re- sulting in extremely long sequences. Since transformer-based LLMs scale quadratically with sequence length, such long input sequences pose significant computational challenges. Existing approaches attempt to mitigate this issue by either uniformly pooling the audio features [12], [15], [17], [18] or transcribing speech into text [13]. Uniform pooling (typically by a factor of two) reduces token counts, but the number of au- dio tokens remains significantly higher than that of text tokens. A significant token reduction can be achieved by using the text transcription instead of the audio embeddings. Transcription, however, discards important paralinguistic information such as speaker identity, prosody, emotion, and environmental context. Here, we explore unsupervised unit discovery, uniform average pooling, and uniform downsampling to compress audio tokens and reduce the number of audio tokens being sent to the LLM backbone. We explore how these methods affect the recovery of the underlying lexical content. Unsupervised unit discovery aims to segment audio into acoustically homogeneous units such as phone or word- like segments [19]–[23]. Segment boundaries can be used to merge frame-level features to generate segment-level fea- arXiv:2511.20973v1 [eess.AS] 26 Nov 2025 tures. Prior work has shown that segment-level features can achieve performance comparable to frame-level features in phoneme classification, while requiring significantly fewer features(tokens). Frameworks such as Segmental Contrastive Predictive Coding [19] and variable-rate Contrastive Predictive Coding [21] demonstrate the effectiveness of unsupervised boundary detection in producing multiscale representations at both the frame and phone level. We use unsupervised unit discovery to discover segments and generate segmental audio features. These segmental features preserve the underlying lexical while reducing the number of audio tokens before the LLM. While using the compressed audio tokens reduces the au- dio token count s
This content is AI-processed based on ArXiv data.