Graph Embedding with Mel-spectrograms for Underwater Acoustic Target Recognition

December 12, 2025

Reading time: 5 minute

...

📝 Original Info

Title: Graph Embedding with Mel-spectrograms for Underwater Acoustic Target Recognition
ArXiv ID: 2512.11545
Date: 2025-12-12
Authors: Sheng Feng, Shuqing Ma, Xiaoqian Zhu

📝 Abstract

Underwater acoustic target recognition (UATR) is extremely challenging due to the complexity of ship-radiated noise and the variability of ocean environments. Although deep learning (DL) approaches have achieved promising results, most existing models implicitly assume that underwater acoustic data lie in a Euclidean space. This assumption, however, is unsuitable for the inherently complex topology of underwater acoustic signals, which exhibit non-stationary, non-Gaussian, and nonlinear characteristics. To overcome this limitation, this paper proposes the UATR-GTransformer, a non-Euclidean DL model that integrates Transformer architectures with graph neural networks (GNNs). The model comprises three key components: a Mel patchify block, a GTransformer block, and a classification head. The Mel patchify block partitions the Mel-spectrogram into overlapping patches, while the GTransformer block employs a Transformer Encoder to capture mutual information between split patches to generate Mel-graph embeddings. Subsequently, a GNN enhances these embeddings by modeling local neighborhood relationships, and a feed-forward network (FFN) further performs feature transformation. Experiments results based on two widely used benchmark datasets demonstrate that the UATR-GTransformer achieves performance competitive with state-of-the-art methods. In addition, interpretability analysis reveals that the proposed model effectively extracts rich frequency-domain information, highlighting its potential for applications in ocean engineering.

💡 Deep Analysis

📄 Full Content

U NDERWATER acoustic target recognition (UATR), a crucial topic in ocean engineering, involves detecting and classifying underwater targets based on their unique acoustic properties. This capability holds important implications for maritime security, environmental monitoring, and underwater exploration. However, UATR is highly challenging due to the complex mechanisms of underwater sound propagation in diverse marine environments [1]. Factors such as attenuation, scattering, and reverberation significantly complicate target identification and classification. Early UATR methods primarily relied on experienced sonar operators for manual recognition, but such approaches are prone to subjective influences, including psychological and physiological conditions.

To overcome these limitations, statistical learning techniques were introduced, leveraging time-frequency representations derived from waveforms to enhance automatic recognition. Representative approaches include Support Vector Machines (SVM) [2], [3] and logistic regression [4]. Nevertheless, as the demand for higher recognition accuracy has increased, the shortcomings of statistical learning-based methods have become apparent. These methods typically capture only shallow discriminative patterns and fail to fully exploit the potential of diverse datasets.

Deep learning (DL), as a subset of machine learning, has achieved remarkable progress in UATR by learning complex patterns from large volumes of acoustic data [5], [6]. Among DL models, convolutional neural networks (CNNs) have been widely studied for end-to-end modeling of acoustic structures, owing to their strong feature extraction capabilities. For example, [7] proposed a dense CNN that outperformed traditional methods by extracting meaningful features from waveforms. Similarly, [8] employed ResNet and DenseNet to identify synthetic multitarget signals, demonstrating effective recognition of ship signals using acoustic spectrograms. A separable and time-dilated convolution-based model for passive UATR was proposed in [9], showing notable improvements over conventional approaches. In addition, [10] introduced a fusion network combining CNNs and recurrent neural networks (RNNs), achieving strong recognition performance across multiple tasks through data augmentation. Despite these successes, the inherent local connectivity and parameter-sharing properties of CNNs bias them toward local feature extraction, making it difficult to capture global structures such as overall spectral evolution and relationships among key frequency components.

To address this issue, attention mechanisms have been integrated into DL models to capture long-range dependencies in acoustic signals [11]. For instance, [12] proposed an interpretable neural network incorporating an attention module, while [13] designed an attention-based multi-scale convolution network that extracted filtered multi-view representations from acoustic inputs and demonstrated effectiveness on real-ocean data. Leveraging the Transformer’s multi-head self-attention (MHSA) mechanism, [14] proposed a lightweight UATR-Transformer, which achieved competitive results compared to CNNs. Inspired by the Audio Spectrogram Transformer (AST) [15], a spectrogram-based Transformer model (STM) was applied to UATR [16], yielding satisfactory outcomes. Moreover, self-supervised Transformers have shown strong 0000-0000/00$00.00 © 2021 IEEE arXiv:2512.11545v1 [cs.SD] 12 Dec 2025 potential in extracting intrinsic characteristics of underwater acoustic data [17]- [19]. Nonetheless, the complexity of pretraining and the unclear internal mechanisms suggest that this line of research is still in its early stages. In summary, current UATR research primarily focuses on extracting discriminative features through convolution, attention, and their variants [20], [21], which have achieved encouraging results with promising applications.

In practice, underwater acoustic data are often regarded as high-dimensional topological data due to their irregular structure and cluttered characteristics [22]. The generation and radiation of underwater target noise involve multiple components, including broadband continuous spectra, strong narrowband lines, and distinct modulation features. As a result, underwater signals often exhibit nonlinear, non-stationary, and non-Gaussian behavior. In the time domain, the waveforms and amplitudes vary dynamically, while in the frequency domain, spectral distributions can change over time. These characteristics challenge the representation of acoustic features as simple Euclidean vectors. Traditional models directly process sequential Euclidean data, such as images or audio, focusing on optimizing local and global information extraction. However, they neglect the geometric structure of acoustic data in highdimensional space and overlook the non-Euclidean nature of the signals, leading to suboptimal performance.

To address this limitation, we propose the

📄 Read Full PDF on ArXiv