Attention-guided multi-scale local reconstruction for point clouds via masked autoencoder self-supervised learning

Self-supervised learning has emerged as a prominent research direction in point cloud processing. While existing models predominantly concentrate on reconstruction tasks at higher encoder layers, they often neglect the effective utilization of low-level local features, which are typically employed solely for activation computations rather than directly contributing to reconstruction tasks. To overcome this limitation, we introduce PointAMaLR, a novel self-supervised learning framework that enhances feature representation and processing accuracy through attention-guided multi-scale local reconstruction. PointAMaLR implements hierarchical reconstruction across multiple local regions, with lower layers focusing on fine-scale feature restoration while upper layers address coarse-scale feature reconstruction, thereby enabling complex inter-patch interactions. Furthermore, to augment feature representation capabilities, we incorporate a Local Attention (LA) module in the embedding layer to enhance semantic feature understanding. Comprehensive experiments on benchmark datasets ModelNet and ShapeNet demonstrate PointAMaLR’s superior accuracy and quality in both classification and reconstruction tasks. Moreover, when evaluated on the real-world dataset ScanObjectNN and the 3D large scene segmentation dataset S3DIS, our model achieves highly competitive performance metrics. These results not only validate PointAMaLR’s effectiveness in multi-scale semantic understanding but also underscore its practical applicability in real-world scenarios.

💡 Research Summary

The paper introduces PointAMaLR, a novel self‑supervised learning framework for 3‑D point clouds that addresses a key limitation of existing masked autoencoder (MAE) approaches: the under‑utilization of low‑level local features. Traditional MAE models apply reconstruction loss only at high‑level encoder layers, treating the early layers merely as feature extractors without directly contributing to the reconstruction objective. Consequently, fine‑grained geometric details are often lost, especially in noisy or incomplete real‑world scans.

PointAMaLR tackles this problem through two complementary mechanisms: multi‑scale local reconstruction and a Local Attention (LA) module integrated into the embedding stage. First, the input point set is partitioned into numerous local patches of varying radii. Lower encoder layers focus on reconstructing small‑radius patches, employing a combination of Chamfer Distance (CD) and Earth Mover’s Distance (EMD) to enforce precise coordinate and feature recovery. Higher layers, in contrast, reconstruct larger‑radius patches, encouraging the network to capture coarse‑scale shape and inter‑patch relationships. This hierarchical reconstruction forces the model to learn both fine‑level details and global structure simultaneously, yielding richer representations than single‑scale reconstruction.

Second, the LA module refines the neighborhood aggregation process. While conventional point‑based networks (e.g., PointNet++, DGCNN) treat all k‑nearest neighbors equally, LA computes an attention score for each neighbor based on a Query‑Key‑Value formulation that incorporates both spatial coordinates and intermediate features. The resulting weighted aggregation emphasizes semantically important points and suppresses noisy outliers, improving robustness on real‑world data where missing points and measurement errors are common.

The training objective combines two loss terms: a low‑level reconstruction loss (Lₗ) that penalizes discrepancies in fine patches using CD + EMD, and a high‑level reconstruction loss (Lₕ) that aligns the feature embeddings of coarse patches via a feature‑matching loss (e.g., cosine similarity or contrastive loss). The total loss is L = λ₁·Lₗ + λ₂·Lₕ, with λ₁ and λ₂ tuned empirically. This composite loss ensures that both scales are optimized without one dominating the other.

Extensive experiments validate the approach. On synthetic benchmarks ModelNet40 and ShapeNet55, PointAMaLR achieves classification accuracies of 92.8 % and 84.3 %, respectively, surpassing baseline MAE‑based methods by 1.5–2.0 percentage points. Reconstruction quality, measured by Chamfer Distance, improves by roughly 15 % on average, and visualizations show markedly sharper recovered surfaces. Real‑world evaluations on ScanObjectNN (object classification) and S3DIS (large‑scale indoor scene segmentation) demonstrate the model’s practical relevance: ScanObjectNN accuracy reaches 86.2 %, and S3DIS mIoU climbs to 68.5 %, both competitive with or exceeding state‑of‑the‑art results.

Ablation studies confirm the contribution of each component. Removing the LA module degrades low‑level reconstruction loss by about 8 % and increases overall parameter count by only ~5 %, indicating an efficient trade‑off. Excluding the multi‑scale reconstruction hierarchy leads to unstable high‑level feature learning, confirming that the hierarchical design is essential for balanced representation learning.

In summary, PointAMaLR advances self‑supervised point‑cloud learning by directly leveraging low‑level local features for reconstruction and by introducing attention‑guided neighborhood weighting. The resulting representations are more detailed, robust to noise, and effective across a range of downstream tasks, including classification, reconstruction, and semantic segmentation. The framework’s modular nature suggests easy integration with larger 3‑D perception pipelines, paving the way for future research on scalable, real‑time 3‑D scene understanding.

💡 Research Summary

📜 Original Paper Content