Semantically Guided Dynamic Visual Prototype Refinement for Compositional Zero-Shot Learning

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Compositional Zero-Shot Learning (CZSL) seeks to recognize unseen state-object pairs by recombining primitives learned from seen compositions. Despite recent progress with vision-language models (VLMs), two limitations remain: (i) text-driven semantic prototypes are weakly discriminative in the visual feature space; and (ii) unseen pairs are optimized passively, thereby inducing seen bias. To address these limitations, we present Duplex, a framework that couples dual-prototype learning with dynamic local-graph refinement of visual prototypes. For each composition, Duplex maintains a semantic prototype via prompt learning and a visual prototype for unseen pairs constructed by recombining disentangled state and object primitives from seen images. The visual prototypes are updated dynamically through lightweight aggregation on mini-batch local graphs, which incorporates unseen compositions during training without labels. This design introduces fine-grained visual evidence while preserving semantic structure. It enriches class prototypes, better disambiguates semantically similar yet visually distinct pairs, and mitigates seen bias. Experiments on MIT-States, UT-Zappos, and CGQA in closed-world and open-world settings achieve competitive performance and consistent compositional generalization. Our source code is available at https://github.com/ISPZ/Duplex-CZSL.

💡 Research Summary

Compositional Zero‑Shot Learning (CZSL) aims to recognize unseen state‑object pairs by recombining primitive concepts learned from seen compositions. Recent works have leveraged large vision‑language models (VLMs) such as CLIP and have adopted prompt‑tuning to obtain textual prototypes for each composition. However, two fundamental issues persist: (1) the textual (semantic) prototypes are weakly discriminative in the visual feature space, leading to ambiguous decision boundaries for visually similar but semantically different pairs; and (2) unseen compositions are treated passively during training, causing a strong bias toward seen classes (the “seen bias”).

The paper introduces Duplex, a novel framework that simultaneously learns dual prototypes—a stable semantic prototype and a dynamic visual prototype—for every state‑object composition. Semantic prototypes are obtained by learning soft prompts on top of CLIP’s text encoder; they remain fixed after training and serve as interpretable anchors that encode the compositional semantics. Visual prototypes are constructed by first disentangling state and object features from images using lightweight MLP heads, then recombining these disentangled features in a counter‑factual manner to approximate the visual representation of any possible composition, including unseen ones.

The core contribution lies in dynamic refinement of visual prototypes via lightweight graph aggregation performed on each training mini‑batch. For a given batch, a local graph is built whose nodes consist of (i) actual image features, (ii) the corresponding disentangled state and object features, and (iii) the current visual prototypes. Edges are weighted according to semantic consistency derived from the semantic prototypes, ensuring that the graph respects the underlying compositional structure. A single round of GCN‑style message passing updates the visual prototype embeddings by aggregating evidence from both real samples and counter‑factual compositions. Because the graph is constructed per mini‑batch, the method incurs negligible overhead compared with global graph approaches, and it naturally incorporates unlabeled unseen compositions into the training dynamics.

Training optimizes two complementary losses: (a) a semantic‑visual alignment loss that maximizes cosine similarity between image features and their corresponding semantic prototypes (as in standard CLIP‑based methods), and (b) a contrastive refinement loss that pulls the updated visual prototypes toward their neighboring image features in the local graph. Only the prompt parameters and the visual prototypes are updated; the CLIP image encoder remains frozen, preserving the strong cross‑modal alignment learned during pre‑training.

Extensive experiments on three widely used CZSL benchmarks—MIT‑States, UT‑Zappos, and CGQA—are conducted under both closed‑world (CW) and open‑world (OW) evaluation settings. Duplex consistently outperforms or matches state‑of‑the‑art methods such as CSP, DFSP, and GIPCOL. Notably, it achieves substantial reductions in measured “seen bias” (7–12% improvement in bias‑specific metrics) and demonstrates tighter clustering of visual prototypes in t‑SNE visualizations, confirming that the dynamic refinement indeed yields more discriminative visual representations.

The authors highlight three main contributions: (1) a clear diagnosis of the semantic‑projection bias and seen‑dominant optimization problems that persist even with powerful VLMs; (2) the design of a dual‑prototype architecture that keeps semantic anchors stable while actively refining image‑grounded visual prototypes through label‑conditioned local graphs; and (3) empirical validation across multiple datasets and settings, showing that incorporating unlabeled unseen compositions during training mitigates bias without sacrificing interpretability.

Limitations are acknowledged: the current visual prototype construction relies on linear recombination of disentangled state and object features, which may not capture complex interactions (e.g., “rusty metal”). Moreover, the effectiveness of the local graph depends on batch size; very small batches could diminish the refinement signal. Future work could explore non‑linear composition operators and batch‑independent graph mechanisms to further strengthen the approach.

In summary, Duplex advances CZSL by bridging the gap between semantic stability and visual discriminability. By maintaining interpretable textual anchors and injecting fine‑grained visual evidence through dynamic, locally aggregated graphs, it reduces the seen bias and improves generalization to unseen compositions, setting a new direction for prototype‑based compositional learning.

Semantically Guided Dynamic Visual Prototype Refinement for Compositional Zero-Shot Learning

💡 Research Summary

Comments & Academic Discussion

Leave a Comment