SpectralGCD: Spectral Concept Selection and Cross-modal Representation Learning for Generalized Category Discovery
Generalized Category Discovery (GCD) aims to identify novel categories in unlabeled data while leveraging a small labeled subset of known classes. Training a parametric classifier solely on image features often leads to overfitting to old classes, and recent multimodal approaches improve performance by incorporating textual information. However, they treat modalities independently and incur high computational cost. We propose SpectralGCD, an efficient and effective multimodal approach to GCD that uses CLIP cross-modal image-concept similarities as a unified cross-modal representation. Each image is expressed as a mixture over semantic concepts from a large task-agnostic dictionary, which anchors learning to explicit semantics and reduces reliance on spurious visual cues. To maintain the semantic quality of representations learned by an efficient student, we introduce Spectral Filtering which exploits a cross-modal covariance matrix over the softmaxed similarities measured by a strong teacher model to automatically retain only relevant concepts from the dictionary. Forward and reverse knowledge distillation from the same teacher ensures that the cross-modal representations of the student remain both semantically sufficient and well-aligned. Across six benchmarks, SpectralGCD delivers accuracy comparable to or significantly superior to state-of-the-art methods at a fraction of the computational cost. The code is publicly available at: https://github.com/miccunifi/SpectralGCD.
💡 Research Summary
Generalized Category Discovery (GCD) seeks to cluster unlabeled data that contains both known (“Old”) and unknown (“New”) classes, using a small labeled subset of the Old classes. Existing unimodal approaches that rely solely on image features tend to over‑fit to the limited labeled data, biasing predictions toward Old categories. Recent multimodal methods improve performance by adding textual information from CLIP, but they treat the visual and textual streams independently, require additional components such as text‑inversion networks or large language‑model‑generated descriptions, and consequently incur substantial computational overhead.
SpectralGCD addresses these limitations with a two‑stage pipeline. First, it builds a unified cross‑modal representation for each image by computing cosine similarities between the image embedding (from a CLIP visual encoder) and a large, task‑agnostic dictionary of textual concepts. The resulting vector z ∈ℝ^M encodes how strongly each concept describes the image, analogous to a topic distribution in probabilistic topic models. If class identity depends only on these concept activations, z constitutes a sufficient statistic for classification, allowing the model to focus on semantic information rather than spurious visual cues.
Second, Spectral Filtering automatically selects the most informative concepts. A strong frozen CLIP teacher processes the entire dictionary, and the cross‑modal covariance matrix of the softmaxed similarity vectors is eigendecomposed. High‑variance eigenvectors capture coherent concept co‑activations that carry meaningful signal. By retaining only the top‑k eigencomponents (e.g., covering 90 % of variance), the method discards noisy, irrelevant concepts, yielding a filtered dictionary Ĉ that is much smaller yet semantically rich.
Training proceeds on the filtered representation with forward and reverse knowledge distillation from the same teacher. In forward distillation, the student’s similarity vector z_S is encouraged (via KL or L2 loss) to match the teacher’s softmaxed vector z_T. Reverse distillation feeds z_S back into the teacher to enforce consistency, strengthening the alignment of student and teacher representations despite the student’s limited capacity (only the final transformer block of the visual encoder is fine‑tuned, while the text encoder remains frozen).
The loss function combines supervised and unsupervised components identical to those used in prior GCD work: supervised contrastive loss L_s^con, unsupervised contrastive loss L_u^con, supervised classification loss L_s^cls, and an unsupervised self‑distillation loss L_u^cls with an entropy regularizer to prevent collapse onto Old classes. All losses operate on the cross‑modal vector z, which is first linearly projected to a compact embedding u and then passed through a parametric classifier L_ψ for class probabilities and through a small MLP M for contrastive learning.
Extensive experiments on six benchmarks—including CUB, ImageNet‑R, CIFAR‑100, Stanford Cars, and two fine‑grained datasets—show that SpectralGCD matches or exceeds state‑of‑the‑art multimodal methods (GET, TextGCD) while using roughly the same FLOPs as unimodal baselines. Spectral Filtering contributes 2–5 % absolute accuracy gains and reduces computational cost by over 30 % compared to naïve multimodal pipelines. The authors release code and the concept dictionary, ensuring reproducibility.
In summary, SpectralGCD introduces (1) a unified cross‑modal sufficient representation based on CLIP similarity scores, (2) a spectral analysis‑driven automatic concept selection mechanism, and (3) bidirectional knowledge distillation to preserve semantic fidelity in a lightweight student model. This combination yields an efficient, scalable, and high‑performing solution for generalized category discovery.
Comments & Academic Discussion
Loading comments...
Leave a Comment