기초 모델 증류를 통한 3D 포인트 클라우드 경량화

Reading time: 6 minute
...

📝 Abstract

Foundation models pre-trained with self-supervised learning (SSL) on large-scale datasets have become powerful general-purpose feature extractors. However, their immense size and computational cost make them prohibitive for deployment on edge devices such as robots and AR/VR headsets. Existing compression techniques like standard knowledge distillation create efficient ‘specialist’ models but sacrifice the crucial, downstream-agnostic generality that makes foundation models so valuable. In this paper, we introduce Foundation Model Distillation (FMD), a new paradigm for compressing large SSL models into compact, efficient, and faithful proxies that retain their general-purpose representational power. We present Foundry, the first implementation of FMD for 3D point clouds. Our approach, Foundry, trains a student to learn a compressed set of SuperTokens that reconstruct the teacher’s token-level representations, capturing a compact basis of its latent space. A single distilled model maintains strong transferability across diverse downstream tasks-classification, part segmentation, and few-shot scenarios-approaching full foundation-model performance while using significantly fewer tokens and FLOPs, making such models more practical for deployment on resourceconstrained hardware.

💡 Analysis

Foundation models pre-trained with self-supervised learning (SSL) on large-scale datasets have become powerful general-purpose feature extractors. However, their immense size and computational cost make them prohibitive for deployment on edge devices such as robots and AR/VR headsets. Existing compression techniques like standard knowledge distillation create efficient ‘specialist’ models but sacrifice the crucial, downstream-agnostic generality that makes foundation models so valuable. In this paper, we introduce Foundation Model Distillation (FMD), a new paradigm for compressing large SSL models into compact, efficient, and faithful proxies that retain their general-purpose representational power. We present Foundry, the first implementation of FMD for 3D point clouds. Our approach, Foundry, trains a student to learn a compressed set of SuperTokens that reconstruct the teacher’s token-level representations, capturing a compact basis of its latent space. A single distilled model maintains strong transferability across diverse downstream tasks-classification, part segmentation, and few-shot scenarios-approaching full foundation-model performance while using significantly fewer tokens and FLOPs, making such models more practical for deployment on resourceconstrained hardware.

📄 Content

Foundry: Distilling 3D Foundation Models for the Edge Guillaume Letellier1 Siddharth Srivastava2 Frederic Jurie1 Gaurav Sharma3 1 GREYC, Normandy University, Unicaen, ENSICAEN, UMR CNRS 6072, F-14000 Caen, France 2 IIT Delhi, 3 IIT Kanpur {firstname.lastname}@unicaen.fr Abstract Foundation models pre-trained with self-supervised learn- ing (SSL) on large-scale datasets have become powerful general-purpose feature extractors. However, their im- mense size and computational cost make them prohibitive for deployment on edge devices such as robots and AR/VR headsets. Existing compression techniques like standard knowledge distillation create efficient ‘specialist’ models but sacrifice the crucial, downstream-agnostic generality that makes foundation models so valuable. In this paper, we introduce Foundation Model Distilla- tion (FMD), a new paradigm for compressing large SSL models into compact, efficient, and faithful proxies that retain their general-purpose representational power. We present Foundry, the first implementation of FMD for 3D point clouds. Our approach, Foundry, trains a student to learn a compressed set of SuperTokens that reconstruct the teacher’s token-level representations, capturing a com- pact basis of its latent space. A single distilled model maintains strong transferability across diverse downstream tasks—classification, part segmentation, and few-shot sce- narios—approaching full foundation-model performance while using significantly fewer tokens and FLOPs, making such models more practical for deployment on resource- constrained hardware.

  1. Introduction The machine learning landscape is increasingly dominated by large-scale foundation models [5, 9, 13]. Pre-trained on vast, unlabeled datasets using self-supervised learning (SSL), these models serve as powerful, general-purpose feature extractors that can be adapted to a wide array of downstream tasks. This paradigm has shown immense suc- cess in 3D vision, with models trained on datasets of syn- thetic objects [41, 45], real-world scans [36, 40], and large- scale scenes [2, 7, 34] becoming the backbone for tasks in robotics, autonomous driving, and AR/VR. However, the very scale that makes these models power- ful also creates a significant deployment bottleneck. With hundreds of millions of parameters and quadratic attention complexity, models like PointTransformer [49] cannot be executed on resource-constrained devices. As we show, even a modern GPU can fail to process a moderately sized point cloud of 300k points as demonstrated in Sec. 4.4, let alone the million-point scenes common in real-world ap- plications. This computational barrier prevents the power of 3D foundation models from reaching the devices where they are often needed most. Existing compression techniques fall short of this goal because they trade generality for efficiency. The dominant approach, knowledge distillation (KD) [14], typically trains a student to mimic the teacher’s logits on a specific task, creating an efficient ‘specialist’. While recent works have explored distilling feature embeddings from large vision- language models like CLIP [43], these methods focus on preserving a specific cross-modal alignment capability via direct feature mimicry. This still results in a student spe- cialized for tasks like zero-shot classification, rather than a truly general-purpose unimodal backbone. A fundamental gap remains for a method that can distill the entire repre- sentational manifold of a unimodal SSL model. To bridge this gap, we explore a distillation paradigm we call Foundation Model Distillation (FMD). The goal of FMD is not to create a task-specific specialist, but to forge a compact, portable, and efficient proxy of the original foundation model that retains its core identity as a general- purpose feature extractor. We introduce Foundry, the first framework to realize FMD for 3D point cloud Transform- ers. At the heart of Foundry is the concept of learnable Su- perTokens. We train a lightweight student wrapper to com- press the teacher’s dense set of token embeddings into a small, fixed-size set of SuperTokens, and then reconstruct the original embeddings from this compressed representa- tion. This compress-and-reconstruct objective forces the student to learn a highly efficient basis for the teacher’s la- tent space, capturing its salient semantic and geometric fea- 1 arXiv:2511.20721v1 [cs.CV] 25 Nov 2025 tures in a compact form. The result is a standalone student model that acts as a miniature foundation model. It can be cheaply fine-tuned for numerous downstream tasks without ever needing the original teacher again. Our contributions are: • We propose Foundation Model Distillation (FMD), a new paradigm for creating compact, general-purpose proxies of large SSL models by distilling their entire representa- tion space. • We introduce Foundry, the first FMD framework for 3D point clouds. Its core novelty is a compress-and- reconstruct objective that fo

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut