RS-CA-HSICT: A Residual and Spatial Channel Augmented CNN Transformer Framework for Monkeypox Detection

February 23, 2026

Reading time: 6 minute

...

📝 Abstract

This work proposes a hybrid deep learning approach, namely Residual and Spatial Learning based Channel Augmented Integrated CNN-Transformer architecture, that leverages the strengths of CNN and Transformer towards enhanced MPox detection. The proposed RS-CA-HSICT framework is composed of an HSICT block, a residual CNN module, a spatial CNN block, and a CA, which enhances the diverse feature space, detailed lesion information, and long-range dependencies. The new HSICT module first integrates an abstract representation of the stem CNN and customized ICT blocks for efficient multihead attention and structured CNN layers with homogeneous (H) and structural (S) operations. The customized ICT blocks learn global contextual interactions and local texture extraction. Additionally, H and S layers learn spatial homogeneity and fine structural details by reducing noise and modeling complex morphological variations. Moreover, inverse residual learning enhances vanishing gradient, and stage-wise resolution reduction ensures scale invariance. Furthermore, the RS-CA-HSICT framework augments the learned HSICT channels with the TL-driven Residual and Spatial CNN maps for enhanced multiscale feature space capturing global and localized structural cues, subtle texture, and contrast variations. These channels, preceding augmentation, are refined through the Channel-Fusion-and-Attention block, which preserves discriminative channels while suppressing redundant ones, thereby enabling efficient computation. Finally, the spatial attention mechanism refines pixel selection to detect subtle patterns and intra-class contrast variations in Mpox. Experimental results on both the Kaggle benchmark and a diverse MPox dataset reported classification accuracy as high as 98.30% and an F1-score of 98.13%, which outperforms the existing CNNs and ViTs.

💡 Analysis

🇰🇷 한글로 읽기

📄 Content

Monkeypox (MPox) is a member of the Orthopoxvirus genus that belongs to a very close zoonotic lineage with smallpox and vaccinia viruses, showing repeated potential for outbreaks and localized epidemics [1]. Since the first identification among monkeys in 1959, this virus has been circulating among humans since its first confirmed case in 1970 [2]. This disease, while generally less lethal than COVID-19, has seen a significant surge in the number of reported cases globally [2]. The modes of transmission include direct contact with the infection from an infected person or animal and indirectly from contact with contaminated environmental surfaces. It presents symptoms mostly as fever, myalgia, and fatigue, but most characteristically in rash skin lesions that are the hallmark of this disease [3].

WHO declared MPox a Public Health Emergency of International Concern; however, MPox was a significant burden to global health in 2022. Its management includes effective isolation of the population in which the infection is taking place, strict contact tracing, and early case detection. To further consider, according to the CDC, as of January 31, 2023, there were 85,469 cases in 94 countries. No specific treatment exists; however, treatment aimed at symptoms and prevention, such as vaccination against smallpox, is very relevant [4].

Diagnosis of MPox relies on a combination of clinical evaluation and confirmatory laboratory testing. Therefore, in this work, an attempt has been made to enhance diagnostic precision and speed for MPox using emerging AI techniques in the form of ML and deep learning, specifically the Convolutional Neural Network (CNN) [5]. The value of AI in medical imaging is by now well established, with these techniques playing an increasingly key role in diagnosing and managing a wide range of diseases [6]. The most powerful among them are DL architectures, particularly CNNs and ViTs. Their strength lies in the capability for self-learning of complex, hierarchical feature representations from raw images directly [7], [8]. These models have the power to analyze medical images for discriminative features that commonly result in superior diagnostic performances compared with the classic approaches, which rely on manual feature extraction [9]. However, their actual effectiveness is normally limited by a lack of data and high computational demands. Transfer learning (TL) is usually employed to prevent overfitting and enhance model generalization when dealing with smaller datasets.

On the other hand, the computer-aided diagnostic systems process the medical data using advanced algorithms that enable the automatic diagnosis of sicknesses [10], [11]. The rising cases of MPox infection bring out the need for effective DL-based CAD methods for image-based diagnosis. Very prominent deep learning techniques to discover hidden structural features in images include CNN and ViT. Instead of relying on convolutions, ViT relies on SA mechanisms to emphasize critical regions of an image.

Medical imaging frequently encounters challenges stemming from the disproportion between high-dimensional feature spaces and the relatively small size of available datasets, which can lead to the curse of dimensionality [6]. TL is frequently used to alleviate overfitting, particularly with small datasets [12]. Other challenges include limited availability of data, variability in infection sites and image contrast, morphological differences, and inter-class variability. The conventional ViT methods also suffer from the difficulties of local feature extraction and require a lot of computational resources.

To address these challenges, this paper introduces the Residual and Spatial Learning based Channel Augmented Integrated CNN-Transformer (RS-CA-HSICT) model, an advanced DL network that combines CNN and Transformer architectures. The model has an initial CNN block and a dual-stream network model to improve the quality of data extraction fed to the Transformer. Our hybrid approach is based on residual learning and spatial exploitation (RS) for collecting both local and global characteristics and texture variation, as in [5], [13], and improving the diagnosis performance in such conditions as MPox. Contribution: The following are some key contributions of this research study:

• The proposed RS-CA-HSICT introduces a hybrid framework that unifies transformer and CNN strengths for comprehensive MPox image analysis. RS-CA-HSICT integrates four novel components: an HSICT CNN-Transformer block, a residual CNN module, a spatial CNN block, and a Channel Augmentation (CA) that enriches the diverse feature space and enhances discrimination.

• The abstract stem CNN, custom ICT blocks with efficient multihead attention, and structured CNN layers perform homogeneous (H) and structural (S) operations. The customized ICT blocks learn global contextual interactions and local texture extraction. Additionally, H and S layers learn spatial homogeneity

View Original ArXiv

This content is AI-processed based on ArXiv data.

RS-CA-HSICT: A Residual and Spatial Channel Augmented CNN Transformer Framework for Monkeypox Detection

📝 Abstract

💡 Analysis

📄 Content

Table of Contents

Table of Contents

📝 Abstract

💡 Analysis

📄 Content

Start searching

No results found