DSXFormer: Dual-Pooling Spectral Squeeze-Expansion and Dynamic Context Attention Transformer for Hyperspectral Image Classification
Hyperspectral image classification (HSIC) is a challenging task due to high spectral dimensionality, complex spectral-spatial correlations, and limited labeled training samples. Although transformer-based models have shown strong potential for HSIC, existing approaches often struggle to achieve sufficient spectral discriminability while maintaining computational efficiency. To address these limitations, we propose a novel DSXFormer, a novel dual-pooling spectral squeeze-expansion transformer with Dynamic Context Attention for HSIC. The proposed DSXFormer introduces a Dual-Pooling Spectral Squeeze-Expansion (DSX) block, which exploits complementary global average and max pooling to adaptively recalibrate spectral feature channels, thereby enhancing spectral discriminability and inter-band dependency modeling. In addition, DSXFormer incorporates a Dynamic Context Attention (DCA) mechanism within a window-based transformer architecture to dynamically capture local spectral-spatial relationships while significantly reducing computational overhead. The joint integration of spectral dual-pooling squeeze-expansion and DCA enables DSXFormer to achieve an effective balance between spectral emphasis and spatial contextual representation. Furthermore, patch extraction, embedding, and patch merging strategies are employed to facilitate efficient multi-scale feature learning. Extensive experiments conducted on four widely used hyperspectral benchmark datasets, including Salinas (SA), Indian Pines (IP), Pavia University (PU), and Kennedy Space Center (KSC), demonstrate that DSXFormer consistently outperforms state-of-the-art methods, achieving classification accuracies of 99.95%, 98.91%, 99.85%, and 98.52%, respectively.
💡 Research Summary
The paper introduces DSXFormer, a novel transformer architecture specifically designed for hyperspectral image classification (HSIC). The authors identify two major challenges in HSIC: (1) the need to capture fine‑grained spectral dependencies across hundreds of contiguous bands, and (2) the requirement to model spatial context efficiently when labeled samples are scarce. To address these, DSXFormer integrates a Dual‑Pooling Spectral Squeeze‑Expansion (DSX) block and a Dynamic Context Attention (DCA) mechanism within a hierarchical, window‑based transformer framework.
The DSX block first aggregates global spectral statistics using both global average pooling (GAP) and global max pooling (GMP) across the token dimension. GAP provides an overall view of the spectral distribution, while GMP highlights the most salient activations. The two descriptors are summed to form a unified spectral vector, which is then passed through a lightweight gating network consisting of two fully‑connected layers (expansion → ReLU → compression). This process adaptively re‑weights spectral channels, emphasizing informative bands and suppressing redundant ones, thereby improving spectral discriminability early in the network.
The Dynamic Context Attention replaces the conventional global self‑attention with a window‑based attention that operates on fixed‑size (M \times M) windows. Within each window, relative positional encodings and a similarity‑guided scaling factor are used to modulate attention scores, allowing the model to focus on locally relevant context while keeping computational cost low. A shifted‑window strategy is employed between successive layers to enable cross‑window information exchange, effectively building long‑range dependencies in a hierarchical manner without the quadratic complexity of full self‑attention.
The overall pipeline proceeds as follows: (1) the hyperspectral cube is divided into non‑overlapping spatial patches; (2) each patch is flattened and linearly projected into a latent embedding space; (3) the DSX block recalibrates the spectral channels of these embeddings; (4) a stack of transformer encoder layers equipped with DCA processes the tokens; (5) patch‑merging layers progressively halve spatial resolution while doubling channel depth, providing multi‑scale feature learning; and (6) a global‑average‑pooling followed by a fully‑connected classification head produces class probabilities for each patch (or pixel after up‑sampling).
Extensive experiments were conducted on four widely used hyperspectral benchmarks: Indian Pines (IP), Salinas (SA), Pavia University (PU), and Kennedy Space Center (KSC). Using only a limited fraction of labeled samples (typically 10‑30 % of the data), DSXFormer achieved overall accuracies of 99.95 % (SA), 98.91 % (IP), 99.85 % (PU), and 98.52 % (KSC). These results consistently surpass state‑of‑the‑art CNN‑based models (e.g., 3D‑CNN, ResNet, DenseNet) and recent transformer variants (e.g., Swin‑Transformer, GraphGST, HiT). Moreover, DSXFormer maintains a relatively modest parameter count (~12 M) and FLOPs, offering a favorable trade‑off between performance and efficiency. Ablation studies confirm that removing the DSX block degrades accuracy by ~1.8 % and replacing DCA with full self‑attention increases FLOPs by 2.5× while reducing accuracy by ~1.2 %, highlighting the complementary contributions of both modules.
In summary, DSXFormer advances HSIC by (i) introducing a dual‑pooling spectral squeeze‑expansion module that adaptively emphasizes discriminative spectral bands, (ii) employing a dynamic, window‑based attention mechanism that captures local spatial‑spectral context efficiently, and (iii) integrating these components into a hierarchical transformer with multi‑scale patch merging. The proposed design achieves superior classification accuracy, robustness under limited training data, and computational efficiency, making it a promising candidate for real‑world hyperspectral applications such as precision agriculture, environmental monitoring, and urban planning.
Comments & Academic Discussion
Loading comments...
Leave a Comment