변형 망베 기반 글로벌 컨텍스트 학습을 통한 3D 손 자세 추정

Reading time: 6 minute
...

📝 Abstract

Modeling daily hand interactions often struggles with severe occlusions, such as when two hands overlap, which highlights the need for robust feature learning in 3D hand pose estimation (HPE). To handle such occluded hand images, it is vital to effectively learn the relationship between local image features (e.g., for occluded joints) and global context (e.g., cues from inter-joints, inter-hands, or the scene). However, most current 3D HPE methods still rely on ResNet for feature extraction, and such CNN’s inductive bias may not be optimal for 3D HPE due to its limited capability to model the global context. To address this limitation, we propose an effective and efficient framework for visual feature extraction in 3D HPE using recent state space modeling (i.e., Mamba), dubbed Deformable Mamba (DF-Mamba). DF-Mamba is designed to capture global context cues beyond standard convolution through Mamba’s selective state modeling and the proposed deformable state scanning. Specifically, for local features after convolution, our deformable scanning aggregates these features within an image while selectively preserving useful cues that represent the global context. This approach significantly improves the accuracy of structured 3D HPE, with comparable inference speed to ResNet-50. Our experiments involve extensive evaluations on five divergent datasets including single-hand and two-hand scenarios, hand-only and handobject interactions, as well as RGB and depth-based estimation. DF-Mamba outperforms the latest image backbones, including VMamba and Spatial-Mamba, on all datasets and achieves state-of-the-art performance.

💡 Analysis

Modeling daily hand interactions often struggles with severe occlusions, such as when two hands overlap, which highlights the need for robust feature learning in 3D hand pose estimation (HPE). To handle such occluded hand images, it is vital to effectively learn the relationship between local image features (e.g., for occluded joints) and global context (e.g., cues from inter-joints, inter-hands, or the scene). However, most current 3D HPE methods still rely on ResNet for feature extraction, and such CNN’s inductive bias may not be optimal for 3D HPE due to its limited capability to model the global context. To address this limitation, we propose an effective and efficient framework for visual feature extraction in 3D HPE using recent state space modeling (i.e., Mamba), dubbed Deformable Mamba (DF-Mamba). DF-Mamba is designed to capture global context cues beyond standard convolution through Mamba’s selective state modeling and the proposed deformable state scanning. Specifically, for local features after convolution, our deformable scanning aggregates these features within an image while selectively preserving useful cues that represent the global context. This approach significantly improves the accuracy of structured 3D HPE, with comparable inference speed to ResNet-50. Our experiments involve extensive evaluations on five divergent datasets including single-hand and two-hand scenarios, hand-only and handobject interactions, as well as RGB and depth-based estimation. DF-Mamba outperforms the latest image backbones, including VMamba and Spatial-Mamba, on all datasets and achieves state-of-the-art performance.

📄 Content

Daily human activities often involve complex hand interactions, such as two-hand interactions [46,50] and object grasping [16,49,51], necessitating effective and ef-*Equal contribution. ficient inference models that estimate 3D hand poses to handle such challenging scenarios. These intricate interactions with severe occlusions make it cumbersome to perform 3D hand pose estimation (HPE) from visual data, including single RGB images [32], depth images [60], egocentric views [16,38], etc. Meanwhile, developing fast inference models has become crucial to support real-time applications, especially in AR/VR devices [6,36]. Given these challenges, limited attention has been paid to the inductive biases introduced by backbone architectures and their synergy with 3D HPE. This highlights the need for designing backbones that are both effective in capturing complex interactions and efficient for real-time inference.

To enable robust feature learning for complex hand interactions, it is essential to learn the relationship between local image features (e.g., for occluded joints) and global context (e.g., cues from inter-joints, inter-hands, hand-object, or the scene). Convolutional neural networks (CNNs) are widely used as backbone architectures in 3D HPE, with ResNet-50 [26] being particularly popular [31,32,37,38,46,49,55,73]. These CNN backbones rely on convolution operations with a local receptive field, providing a favorable balance between accuracy and inference speed. However, these local convolutions have a limited ability to capture global context. Another line of recent studies leverages the vision transformer (ViT) [12] as a feature extractor for handmesh reconstruction [11,52,80], but its higher computational complexity often becomes a bottleneck in practical scenarios. This suggests significant room for improvement in backbone architectures to achieve better feature learning of hand poses efficiently.

As an emerging foundational architecture, Mamba [20] based on state space modeling (SSM) has garnered considerable attention, which was originally proposed for natural language processing. The Mamba model excels at selectively focusing on input tokens (i.e., emphasizing particular signals), thus efficiently capturing global context from long token sequences. Several recent studies have extended it to image backbones. For example, VisionMamba [81] introduced the Vim block, which employs a 2D bidirectional scan for spatially-aware sequence modeling. VMamba [40] further proposed the VSS block based on four different scanning paths. However, these scanning mechanisms employ a fixed grid as illustrated in Figure 1(a), which limits their ability to capture intricate hand pose variations when applied to 3D HPE.

Given this limitation, we introduce an effective and efficient backbone, Deformable Mamba (DF-Mamba), with deformable state space modeling (DSSM) that encourages robust visual feature extraction in 3D HPE. The core idea of DF-Mamba is to perform feature extraction by dynamically modeling local features and global context with flexible state spaces. Specifically, our DSSM blocks aggregate the local features according to a deformable path and selectively store useful cues to represent the global context. The scanning path is adjusted with deformable point sampling with local anchors and learnable offsets dependent on the given input features as illustrated in Figure 1(b).

The overall architecture of DF-Mamba is a tribrid design composed of three blocks: convolution blocks, DSSM blocks, and gated convolution blocks, with a model size comparable to ResNet-50. This approach efficiently leverages the complementary strengths of each block type: extracting local features via convolution blocks at lower layers, adaptively enhancing features with DSSM blocks at higher layers after downsampling, and further refining visual representations using gated convolution blocks.

In our experiments, we integrate DF-Mamba into two representative 3D HPE frameworks proposed by Jiang et al. [32] and Zhou et al. [79], replacing their backbones with our method. Our evaluations are performed on five datasets: InterHand2.6M [46], RHP [84], NYU [60], DexYCB [4] and AssemblyHands [49], covering diverse scenarios, including single-hand and two-hand pose estimation, hand-only and hand-object interactions, and RGB and depth modalities. The results demonstrate that DF-Mamba outperforms the latest Mamba-based backbones, e.g., VMamba [40] and SpatialMamba [68], achieving stateof-the-art performance while maintaining inference speed.

Our contributions are summarized as follows.

This task is formulated as predicting the 3D coordinates of hand joints, typically from a single image [13,48]. This task has been studied in various interaction scenarios, such as single-hand [5,19,84,85], self-contact [50,56] and hand-object interactions [15,16], as well as RGBbased [19,32,78] and depth-based [60] estimation. Most existing approaches rely on deep n

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut