Mutually Causal Semantic Distillation Network for Zero-Shot Learning

Zero-shot learning (ZSL) aims to recognize the unseen classes in the open-world guided by the side-information (e.g., attributes). Its key task is how to infer the latent semantic knowledge between visual and attribute features on seen classes, and t…

Authors: Shiming Chen, Shuhuang Chen, Guo-Sen Xie

Mutually Causal Semantic Distillation Network for Zero-Shot Learning
International J ournal of Computer V ision manuscript No. (will be inserted by the editor) Mutually Causal Semantic Distillation Network f or Zer o-Shot Learning Shiming Chen 1 , 2 · Shuhuang Chen 1 , 2 · Guo-Sen Xie 3 · Xinge Y ou 1 , 2 1 Huazhong University of Science and T echnology , W uhan, China 2 National Anti-Counterfeit Engineering Research Center , W uhan, Chna 3 Nanjing University of Science and T echnology , Nanjing, China {gchenshiming,gsxiehm}@gmail.com {shuhuangchen,youxg}@hust.edu.cn the date of receipt and acceptance should be inserted later Abstract Zero-shot learning (ZSL) aims to recognize the un- seen classes in the open-world guided by the side-information ( e.g . , attributes). Its ke y task is how to infer the latent seman- tic kno wledge between visual and attribute features on seen classes, and thus conducting a desirable semantic knowledge transfer from seen classes to unseen ones. Prior works simply utilize unidirectional attention within a weakly-supervised manner to learn the spurious and limited latent semantic rep- resentations, which fail to effecti vely discover the intrinsic semantic knowledge ( e .g. , attribute semantic) between vi- sual and attribute features. T o solve the above challenges, we propose a mutually causal semantic distillation network (termed MSDN++ ) to distill the intrinsic and sufficient se- mantic representations for ZSL. MSDN++ consists of an attribute → visual causal attention sub-net that learns attrib ute- based visual features, and a visual → attribute causal attention sub-net that learns visual-based attribute features. The causal attentions encourages the two sub-nets to learn causal vision- attribute associations for representing reliable features with causal visual/attribute learning. W ith the guidance of seman- tic distillation loss, the two mutual attention sub-nets learn collaborativ ely and teach each other throughout the train- ing process. Extensi ve experiments on three widely-used benchmark datasets ( e.g . , CUB, SUN, A W A2, and FLO) show that our MSDN++ yields significant improvements ov er the strong baselines, leading to new state-of-the-art per- formances. Keyw ords Open-W orld V isual Recognition · Zero-Shot Learning · Mutual Semantic Distillation · Attribute localization Address(es) of author(s) should be giv en 1 Introduction and Motivation V isual recognition is a critical task in computer vision, which has attracted significant attention in recent years due to its numerous applications in real-world, e .g. , autonomous driv- ing ( Hu et al , 2023 ), smart healthcare ( Bai et al , 2021 ), and intelligent surveillance ( Collins et al , 2000 ). Over the past decade, deep learning techniques and large-scale datasets hav e contributed to remarkable adv ancements in the perfor- mance of visual recognition systems ( He et al , 2016 ; W ang et al , 2020b , a ). Howe ver , existing visual models are often limited by their closed-w orld assumptions, where all possi- ble classes are known in advance, and all data is giv en at once. Such assumptions are not applicable in practical sce- narios, where nov el or previously unseen classes can arise, and data may be continually coming or decentralized due to data pri vac y concerns. For example, an autonomous v ehicle encounters new traffic patterns, a medical AI system detects new diseases, or an intelligent system encounters a criminal wearing a new type of disguise or clothing. Based on the prior knowledge of seen classes, humans hav e a remarkable ability to recognize new concepts (classes) using shared and distinct attributes of both seen and unseen classes ( Lampert et al , 2009 ). Analogously , zero-shot learning (ZSL) is proposed un- der a challenging image classification setting to mimic the human cognitiv e process ( Larochelle et al , 2008 ; Palatucci et al , 2009 ). ZSL aims to tackle the unseen class recogni- tion problem by transferring knowledge from seen classes to unseen ones. It is usually based on the assumption that both seen and unseen classes can be described through the shared semantic descriptions ( e.g. , attrib utes) ( Lampert et al , 2014 ). Based on the classes classified in the testing phase, ZSL methods can be grouped into con ventional ZSL (CZSL) and generalized ZSL (GZSL) ( Xian et al , 2017 ), where CZSL aims to predict unseen classes, while GZSL can predict both 2 1 INTR ODUCTION AND MOTIV A TION (c) : Semantic Space : V isual Space : Common Space : Attribute Space (d) (b) (a) Fig. 1: Four in vestigated ZSL paradigms. (a) Embedding-based method. (b) Generati ve method. (c) Common space learning method. (d) Ours proposed mutually causal semantic distillation netw ork (MSDN++). The semantic space S is represented by the class semantic vector annotated by humans based on the attribute descriptions. The visual space V is learned by a network backbone ( e.g . , ResNet101 ( He et al , 2016 )). The common space O is a shared latent space between visual mapping and semantic mapping. The attrib ute space A is learned by a language model ( e .g. , Glov e ( Pennington et al , 2014 )). Filled triangles, circles, squares and diamonds denote the sample features in S , V , O and A , respectiv ely . seen and unseen ones. Since GZSL is more realistic and chal- lenging ( Pourpanah et al , 2020 ), some ZSL methods only focus on the GZSL setting ( Liu et al , 2018 ; Han et al , 2020 , 2021 ; Chen et al , 2021b ; Huynh and Elhamifar , 2020a ). In this work, we sufficiently ev aluate our method both in the two settings. In the ZSL, an unseen sample shares different partial information with a set of samples of seen classes, and this partial information is represented as the ab undant semantic knowledge of attributes ( e .g. , “bill color yello w”, “leg color red” on CUB dataset) ( W ang et al , 2021b ; Chen et al , 2022a , 2023a , 2024c ). T o this end, the k ey challenge of ZSL is ho w to infer the latent semantic knowledge between visual and at- tribute features on seen classes, allo wing desirable knowledge transfer to unseen classes for effecti ve visual-semantic match. T ar geting on this goal, some attention-based ZSL methods ( Xie et al , 2019 , 2020 ; Zhu et al , 2019 ; Xu et al , 2020 ; Liu et al , 2021 ; Chen et al , 2024b , 2022a , 2023a , 2025 ) take atten- tion mechanism to discov er discriminativ e part/fine-grained visual features, which match the semantic representations more accurately , enabling significant ZSL performance. Un- fortunately , i) they simply utilize unidirectional attention, which only focuses on limited semantic alignments between visual and attribute features without any further sequential learning; ii) they learn the attention with an attention mod- ules simply supervised by the classical loss function, which only explicitly supervises the final prediction but ignores the causality between the prediction and attention. As such, prior attention-based methods can only discover the spurious and limited semantic representations between visual and at- tribute features, resulting in undesirable semantic knowledge transfer in the ZSL. In light of the abo ve observ ations, we propose a mutually causal semantic distillation network (termed MSDN++), to explore the intrinsic semantic knowledge between visual and attribute features for advancing ZSL, as shown in Fig. 1 (d) . The core idea is to evaluate the quality of attentions by comparing the effects of facts ( i.e . , the learned attentions) and the intervention (i.e., uncorrected attentions) on the final prediction. Then, we maximize the ef fects between the two attentions ( i.e. , ef fect in causal inference ( Pearl , 2009 ; Pearl et al , 2016 ; Li et al , 2024 )) to encourage the network to learn more effecti ve visual attentions and reduce the ef fects of bias of training data. Meanwhile, this causal mechanism is incorporated into the mutually semantic distillation network ( i.e. , MSDN ( Chen et al , 2022b )), enabling the network to discov er the intrinsic and more sufficient semantic knowledge for feature representations. Specifically , MSDN++ consists of an attribute → visual causal attention sub-net, which learns attrib ute-based visual features with attribute-based/causal visual learnings, and a visual → attribute causal attention sub-net, which learns visual-based attribute features via visual-based/causal at- tribute learnings. The causal attentions encourages the two sub-nets to learn causal vision-attribute associations for rep- resenting reliable features. Meanwhile, these two mutual attention sub-nets act as a teacher -student network for guid- ing each other to learn collaboratively and teaching each other throughout the training process. As such, MSDN++ can explore the most matched attrib ute-based visual features and visual-based attribute features, enabling to effecti vely distill the intrinsic semantic representations for desirable knowledge transfer from seen to unseen classes. Specifi- cally , each causal attention sub-net is optimized with an attribute-based cross-entrop y loss with self-calibration ( Zhu et al , 2019 ; Huynh and Elhamifar , 2020a ; Xu et al , 2020 ; Liu et al , 2021 ; Chen et al , 2022a ), attribute re gression loss, and causal loss. T o enable mutual learning between the two sub-nets, we further introduce a semantic distillation loss that aligns each other’ s class posterior probabilities. Extensiv e experiments well demonstrate the superiority of MSDN++ ov er the existing state-of-the-art methods. A preliminary version of this work was presented as a CVPR 2022 conference paper (termed MSDN ( Chen et al , 2022b )). In this version, we strengthen the work from three aspects: i) W e further analyze the limitations of existing 3 attention-based ZSL methods, which simply utilize unidirec- tional attention in a weakly-supervised manner to learn the spurious and limited latent semantic representations. As such, they fail to effecti vely discov er the intrinsic and sufficient semantic knowledge between visual and attrib ute features. ii) W e propose two no vel sub-nets to equip MSDN with causal attentions, which enable the model to learn causal vision- attribute associations for representing reliable features. iii) W e conduct substantially more experiments to demonstrate the effectiv eness of the proposed methods and verify the contribution of each component. The main contributions of this paper are summarized as follows: – W e propose a mutually causal semantic distillation net- work, termed MSDN++, which distills the intrinsic and sufficient semantic representations for effecti ve knowl- edge transfer of ZSL. – W e deploy causal visual learning and causal attribute learning to encourage MSDN++ to learn the causal vision- attribute associations for representing reliable features with good generalization. – W e introduce a semantic distillation loss to enable mutual learning between the attribute → visual causal attention sub-net and visual → attribute causal attention sub-net in MSDN++, encouraging them to learn attrib ute-based visual features and visual-based attribute features by dis- tilling the intrinsic semantic knowledge for semantic em- bedding representations. – W e conduct extensi ve experiments to demonstrate that our MSDN++ achiev es significant performance gains ov er the strong ZSL baselines on four benchmarks, i.e. , CUB ( W elinder et al , 2010 ), SUN ( Patterson and Hays , 2012 ), A W A2 ( Xian et al , 2017 ), and FLO Nilsback and Zisserman ( 2008 ). The rest of this paper is organized as follows. Section 2 discusses related works. The proposed MSDN++ is illus- trated in Section 3 . Experimental results and discussions are provided in Section 4 . Finally , we draw a conclusion in Section 5 . 2 Related W ork 2.1 Zero-Shot Learning ZSL ( Song et al , 2018 ; Li et al , 2018 ; Xian et al , 2018 , 2019 ; Y u et al , 2020 ; Min et al , 2020 ; Han et al , 2021 ; Chen et al , 2021b ; Chou et al , 2021 ) transfer semantic knowledge from seen classes to unseen ones by learning a mapping between the visual and attribute/semantic domains. There are sev- eral ZSL approaches targeting this goal, i.e. , embedding- based methods, generativ e methods, and common space learning-based methods. As shown in Fig. 1 (a) , embedding- based methods learn a visual → semantic mapping for visual- semantic interaction by mapping the visual features into the semantic space ( Romera-Paredes and T orr , 2015 ; Akata et al , 2016 ; Chen et al , 2018 ; Xie et al , 2019 ; Xu et al , 2020 ). Generativ e ZSL methods hav e been introduced to learn a semantic → visual mapping to generate visual features of un- seen classes ( Arora et al , 2018 ; Schönfeld et al , 2019 ; Xian et al , 2018 ; Li et al , 2019a ; Y u et al , 2020 ; Shen et al , 2020 ; Vyas et al , 2020 ; Narayan et al , 2020 ; Chen et al , 2021b , 2023b ) for data augmentation, sho wn in Fig. 1 (b) . Genera- tiv e ZSL methods usually base on v ariational autoencoders (V AEs) ( Arora et al , 2018 ; Schönfeld et al , 2019 ), gener- ativ e adversarial nets (GANs) ( Xian et al , 2018 ; Li et al , 2019b ; Y u et al , 2020 ; K eshari et al , 2020 ; Vyas et al , 2020 ; Chen et al , 2021b , 2025 ), or generativ e flows ( Shen et al , 2020 ). As shown in Fig. 1 (c) , common space learning is also employed to learn a common representation space for interaction between visual and semantic domains ( Frome et al , 2013 ; Tsai et al , 2017 ; Schönfeld et al , 2019 ; Chen et al , 2021c ). Ho wever , these methods still usually perform relativ ely sub-optimal results, since they cannot capture the subtle differences between seen and unseen classes. As such, attention-based ZSL methods ( Xie et al , 2019 , 2020 ; Zhu et al , 2019 ; Xu et al , 2020 ; Liu et al , 2021 ; Chen et al , 2024b , 2022a ) utilize attribute descriptions as guidance to disco ver the more discriminativ e fine-grained features. Howe ver , they simply utilize unidirectional attention in a weakly-supervised manner , which only focuses on spurious and limited semantic alignments between visual and attribute features without any further sequential learning. As such, the existing method fails to explore the intrinsic and suffi- cient semantic representations between visual and attribute features for semantic kno wledge transfer of ZSL. T o tackle this challenge, we will introduce a nov el ZSL method based on the mutually semantic distillation learning and causal attention mechanism. 2.2 Knowledge Distillation T o compress knowledge from a large teacher network to a small student network, kno wledge distillation was proposed ( Hinton et al , 2015 ). Recently , knowledge distillation has been extended to optimize small deep networks starting with a powerful teacher network ( Romero et al , 2015 ; Parisotto et al , 2016 ). By mimicking the teacher’ s class probabilities and/or feature representation, distilling models con vey addi- tional information beyond the con ventional supervised learn- ing target ( Zhang et al , 2018 ; Zhai et al , 2020 ). Moti vated by these, we design a mutually semantic distillation network to learn the intrinsic semantic knowledge by semantically distilling intrinsic knowledge. The mutually semantic dis- tillation network consists of attribute → visual attention and 4 3 MUTU ALL Y CA USAL SEMANTIC DISTILLA TION NETWORK Attribute → Visual Causal Attention Sub-Net … bill color yellow breast color white leg color red Attribute Descriptions … … V isual → Attribute Causal Attention Sub -Net bill color yellow breast color white leg color red Attribute Descriptions … bill color ye ll ow leg color red … bill color yellow breast color white V isual-based Attribute Learning Causal Attribute Learning bill color ye ll ow … (a) V isual-Based Causal Effect Attend to attributes Attribute-based V isual Learning Causal V isual Learning Attend to image regions (b) Attribute-Based Causa l Effect bill color yellow breast color white leg color red Fig. 2: The pipeline of MSDN++. MSDN++ consists of an attribute → visual causal attention sub-net (A VCA) and visual → attribute causal attention sub-net (V A CA). A VCA learns the attribute-based visual features F with attribute-based visual learning and causal visual learning, while V A CA discovers the vision-based attribute features S with vision-based attribute learning and causal attribute learning. Then, tow mapping functions M 1 and M 2 map the visual features and attribute features into semantic space as semantic representations ψ ( x ) and Ψ ( x ) , respecti vely . A semantic distillation loss L distill to match the probability estimates of the tw o sub-nets for semantic distillation ( i.e . , p 1 and p 2 ), enabling MSDN++ to learn intrinsic semantic kno wledge. During inference, we fuse the predictions of the tw o sub-nets to take full use of the complementary semantic representations. visual → attribute attention sub-nets, which act as a teacher - student network to learn collaborativ ely and teach each other . 2.3 Causal Inference in V ision Causality typically includes two inferences, i.e. , intervention inference and counterfactual inference ( Pearl , 2009 ; Pearl et al , 2016 ), and it has been successfully applied in sev eral areas of artificial intelligence, such as explainable machine learning ( Lv et al , 2022 ), natural language processing ( W ood- Doughty et al , 2018 ), reinforcement learning ( Kallus and Zhou , 2018 ) and adversarial learning ( Zhang et al , 2021 ). Since causal representation can alle viate the effects of dataset bias, it served as an effecti ve tool in vision tasks ( Y ue et al , 2020 ; Rao et al , 2021 ; T ang et al , 2020 ; Chen et al , 2021a ; W ang et al , 2021a ; Y ang et al , 2021 ; Li et al , 2023a ). For example, Chen et al. ( Chen et al , 2021a ) employed a coun- terfactual analysis method to alle viate the ov er-dependence of en vironment bias and highlight the trajectory clues itself. W ang et al. ( W ang et al , 2021a ) introduced causal interven- tion into the visual model to learn causal features that are robust in an y confounding context. Li et al. ( Li et al , 2023a ) and Y ue et al. ( Y ue et al , 2020 ) bring the causal inference into fe w-shot learning and obtain significant performances. In ZSL, there are clear projection domain shifts between seen and unseen classes ( Fu et al , 2015 ; Chen et al , 2023b ), e.g . , the attribute of "has tail" is different for horse (seen class) and pig (unseen class), which ine vitably leads the attention model to learn the spurious associations between visual and attribute features. Acc ordingly , we deploy causal attention via intervention inference to enable the ZSL model to learn causal features, which are robust to visual bias. Accordingly , the generalization of ZSL can be improv ed. 3 Mutually Causal Semantic Distillation Network Notation . Assume that the training data D s = { ( x s i , y s i ) } has C s seen classes, where x s i ∈ X s denotes the visual sample i , and y s i ∈ Y s is the corresponding class label. Another set of unseen classes D u = { ( x u i , y u i ) } has C u classes, where x u i ∈ X u are the unseen class samples, and y u i ∈ Y u are the corresponding labels. A set of class seman- tic vectors/prototypes (semantic v alue annotated by humans according to attributes) of the class c ∈ C s ∪ C u = C with 3.1 Attribute → V isual Causal Attention Sub-net 5 K attributes z c = [ z c 1 , . . . , z c K ] ⊤ = ϕ ( y ) (where ϕ ( · ) is a mapping function to bridge the z c and y ) support knowledge transfer from seen classes to unseen ones. Here, the semantic attribute v ectors of each attribute A = { a 1 , . . . , a K } learned by GloV e ( Pennington et al , 2014 ), it essentially is the at- tribute features. Overview . As illustrated in Fig. 2 , our MSDN++ includes an attribute → visual causal attention sub-net (A VCA) and visual → attribute causal attention sub-net (V A CA). Under the constraint of attribute-based cross-entropy loss with self- calibration, attribute-regression loss, and causal loss, the A VCA and V A CA attempt to learn attribute-based causal vi- sual features with causal visual learning visual-based causal attribute representations using causal attrib ute learning, re- spectiv ely . A semantic distillation loss encourages the two mutual causal attention sub-nets to learn collaborativ ely and teach each other throughout the training process. During In- ference, we fuse the predictions of A VCA and V ACA to make use the complementary semantic knowledge between the two sub-nets. 3.1 Attribute → V isual Causal Attention Sub-net Existing methods demonstrate that learning the fine-grained features for attribute localization is important in ZSL ( Xie et al , 2019 , 2020 ; Zhu et al , 2019 ; Xu et al , 2020 ). In the first component of our MSDN++, we proposed an A VCA, which localizes the most rele vant image re gions corresponding to the attributes to extract attrib ute-based visual features from a giv en image for each attribute. A VCA includes two streams, i.e. , attribute-based visual learning and causal visual learning. Attribute-based visual learning stream learns attrib ute-based visual features, which is strengthened with causality using a causal visual learning stream further . 3.1.1 Attribute-Based V isual Learning Attribute-based visual learning in volves two inputs: a set of visual features of the image V = { v 1 , . . . , v R } (where R is the number of re gions), such that each visual feature encodes a region in an image, and a set of semantic attribute v ectors A = { a 1 , . . . , a K } . A VCA attends to image regions with respect to each attribute and compares each attrib ute to the corresponding attended visual re gion features to determine the importance of each attribute. For the k -th attribute, its attention weight of focusing on the r -th region of one image is defined as: β r k = exp  a ⊤ k W 1 v r  P K k =1 exp  a ⊤ k W 1 v r  , (1) where W 1 is a learnable matrix to calculate the visual fea- ture of each region and measure the similarity between each semantic attribute vector . As such, we get a set of attention weights { β r k } R r =1 . A VCA then learns the attribute-based visual features for each attribute based on the attention weights. For example, A VCA obtains the k -th attribute-based visual feature F k , which is relev ant to the k -th attribute a k . It is formulated as: F k = R X r =1 β r k v r . (2) Intuitiv ely , F k captures the visual evidence to localize the corresponding semantic attribute in the image. If an image has an obvious attribute a k , the model will assign a high positi ve score to the k -th attribute. Otherwise, the model will assign a low score to the k -th attribute. Thus, we get a set of attribute-based visual features F = { F 1 , F 2 , · · · , F K } . After obtaining the attrib ute-based visual features, A VCA further maps them into the semantic embedding space us- ing a mapping function M 1 . T o encourage the mapping to be more accurate, the semantic attribute vectors A = { a 1 , a 2 , · · · , a K } are served as support. Specifically , M 1 matches the attribute-based visual feature F k with its corre- sponding semantic attribute v ector a k , formulated as: ψ k = M 1 ( F k ) = a ⊤ k W 2 F k , (3) where W 2 is an embedding matrix that maps F into the semantic space. Essentially , ψ k is an attribute score that rep- resents the confidence of having the k -th attribute in a gi ven image. Finally , A VCA obtains a mapped semantic embedding ψ ( x ) = { ψ 1 , ψ 2 , · · · , ψ K } for each image. Accordingly , the final prediction can be formulated as: p 1 = { ψ ( x i ) × z 1 , · · · , ψ ( x i ) × z C } . (4) Notably , p 1 is the observ ation prediction of sample x i in A VCA. 3.1.2 Causal V isual Learning There are clear projection domain shifts between seen and unseen classes in ZSL ( Fu et al , 2015 ; Chen et al , 2023b ), which inevitably results in the attention model to learn the spurious associations between visual and attribute features of seen and unseen classes. Accordingly , we deploy causal visual learning to strengthen the attribute-based visual fea- tures with causality , enabling A VCA to learn attribute-based causal visual features. V isual Causal Graph . W e reformulate the attribute-based visual learning with a visual causal graph G v = {V v , E v } . Each variable in the graph has a corresponding node in V v , and the causal links E v describe how these v ariables interact with each other . As shown in Fig. 2 (a), the nodes V v in G v are represented by visual features X , the learned attention maps β , and final prediction C . ( X, β ) → C denotes the 6 3 MUTU ALL Y CA USAL SEMANTIC DISTILLA TION NETWORK visual features and attention maps jointly determine the final predictions, where we call node X is the causal parent of β and C is the causal child of X and β . V isual-Based Causal Effect . Attribute-based visual learning optimizes the attention by only supervising the final predic- tion C , which ignores ho w the learned attention maps affect the final prediction. In contrast, causal inference ( Frappier , 2018 ) is a good tool to help us analyze the causalities be- tween variables of the model. Inspired by this, we propose to adopt the causality to measure the quality of the learned attention and then improve the model by encouraging the network to produce more influential attention maps. Using the visual causal graph, we can analyze causalities by directly manipulating the values of attention and seeing the effect. The operation is formally called causal intervention, which can be denoted as do(·). Specifically , we take do ( β = ¯ β ) in G v to conduct causal intervention, e .g. , random attention. Notably , we chose ran- dom attention as the counterf actual baseline because it rep- resents a null intervention with minimal structural assump- tions. Unlike “learned null attention” (which may still embed dataset biases) or “mar ginal attention” (which requires esti- mating a prior distribution from data), random attention is prov ably independent of both visual inputs and model pa- rameters. This ensures that the intervention do ( β = ¯ β ) truly sev ers the causal link between visual features and attention, fulfilling the do-operator’ s requirement of an exogenous ma- nipulation. It provides a clean, unbiased baseline to measure the true causal effect of the learned attention. This means we replace the variable β with the values of ¯ β by cutting-off the link X → β , which without any causal associations with the visual features X . Follo wing the attribute-based visual learn- ing, we can obtain the causal visual features ¯ F = { ¯ F 1 , ¯ F 2 , · · · , ¯ F K } and the causal semantic mapped embedding ¯ ψ ( x ) = { ¯ ψ 1 , ¯ ψ 2 , · · · , ¯ ψ K } according to Eq. 2 and Eq. 3 , respectively . Then, the class predictions after interv ention do ( β = ¯ β ) can be obtained by: P ( do ( β = ¯ β ) , X = x i ) = { ¯ ψ ( x i ) × z 1 , · · · , ¯ ψ ( x i ) × z C } . (5) Accordingly , the actual effect of the learned attention on the prediction can be represented by the difference between the observed prediction p 1 = P ( β , X = x i ) and its causal intervention one P ( do ( β = ¯ β ) , X = x i ) . It is formulated as: P v ef f ect ( x i ) = P ( β , X = x i ) − P ( do ( β = ¯ β ) , X = x i ) (6) Because ¯ β is drawn from a uniform distribution (post soft- max), it represents a maximally entropic, non-informativ e attention ov er regions. Thus, the causal ef fect captures how much the learned attention improves prediction compared to a completely uninformativ e one. W e also experimented with other counterfactual distributions (e.g., uniform attention, re- versed attention) in T able 4 , and results show our model is robust to the choice of ¯ β distribution—as long as it is indepen- dent of the input, the causal learning signal remains ef f ecti ve. Therefore, we can use P v ef f ect ( x i ) to ev aluate the quality of the learned visual attention and improv e the causality of attribute-based visual features. 3.2 V isual → Attribute Causal Attention Sub-net V A CA includes a visual-based attribute learning stream that learns visual-based attribute representations, and a causal attribute learning that enhances the attribute features with causality . V isual-based attribute representations are comple- mentary to the attribute-based visual features, enabling them to calibrate each other . As such, MSDN++ can discover the in- trinsic semantic representations between visual and attribute features. 3.2.1 V isual-Based Attribute Learning V isual-based attribute learning first attends to semantic at- tributes with respect to each image region. Formally , its at- tention weights focus on the k -th attribute defined as: γ k r = exp  v ⊤ r W 3 a k  P R r =1 exp ( v ⊤ r W 3 a k ) , (7) where W 3 is a learnable matrix, which measures the simi- larity between the semantic attrib ute vector and each visual region feature. Accordingly , V A CA can get a set of attention weights { γ k r } K k =1 , which is employed to extract visual-based attribute features, formulated as: S r = K X k =1 γ k r a k . (8) Essentially , S r is the visual-based attribute representation, which is aligned to the F k . V A CA further employs a mapping function M 2 to map these visual-based attribute features S = { S 1 , S 2 , · · · , S R } into the semantic space: ˆ Ψ r = M 2 ( S r ) = v ⊤ r W 4 S r , (9) where W 4 is an embedding matrix. Giv en a set of V = { v 1 , . . . , v R } , V A CA obtains the mapped semantic embed- ding ˆ Ψ ( x ) = { ˆ Ψ 1 , ˆ Ψ 2 , · · · , ˆ Ψ R } for the attributes of one image. T o enable the learned semantic embedding ˆ Ψ ( x i ) is R - dim to match with the dimension of class semantic vec- tor ( K - dim ), it is further mapped into semantic attribute space with K - dim , formulated as Ψ ( x i ) = ˆ Ψ ( x i ) × Att = ˆ Ψ ( x i ) × ( V ⊤ W att A ) , where W att is a learnable matrix. 3.3 Model Optimization 7 Finally , we can get the final prediction with the mapped semantic vectors and class semantic v ector: p 2 = { Ψ ( x i ) × z 1 , · · · , Ψ ( x i ) × z C } . (10) Essentially , p 2 is the observ ation prediction of sample x i in V A CA. 3.2.2 Causal Attribute Learning Similar to A VCA, V A CA further improves the causality of visual-based attribute features using causal attribute learning. Attribute Causal Graph . V A CA first formulate a attrib ute causal graph G a = {V a , E a } . The graph consists of the cor- responding nodes in V a , and the causal links E a capturing the causal relationships between each other . As shown in Fig. 2 (b), the nodes of G a are represented by visual features X , the learned attribute attention maps γ , and final prediction C . ( X, γ ) → C represents the visual features and attribute attention maps jointly determine the final predictions, where we call node X is the causal parent of γ and C is the causal child of X and γ . Attribute-Based Causal Effect . Based on the attribute causal graph, we can analyze causalities by directly manip- ulating the values of attribute attention and see the effect. For e xample, do ( γ = ¯ γ ) in G a conduct causal intervention, e.g . , random attention. This means we replace the variable γ with the values of ¯ γ by cutting off the link X → γ , which does not hav e any causal associations with the visual features X . Follo wing the visual-based attribute learning, we can obtain the causal attribute features ¯ S = { ¯ S 1 , ¯ S 2 , · · · , ¯ S R } , the causal mapped embedding ¯ Ψ ( x ) = { ¯ Ψ 1 , ¯ Ψ 2 , · · · , ¯ Ψ K } and final prediction ¯ p 2 according to Eq. 8 , Eq. 9 , and Eq. 10 , respectiv ely . Then, the class predictions based on the causal interven- tion do ( γ = ¯ γ ) is formulated as: C ( do ( γ = ¯ γ ) , X = x i ) = arg max c ∈C ¯ p 2 . (11) Accordingly , the actual effect of the learned attention on the prediction is represented by the difference between the ob- served prediction P ( γ , X = x i ) and its causal intervention one P ( do ( γ = ¯ γ ) , X = x i ) . It is formulated as: P a ef f ect ( x i ) = P ( γ , X = x i ) − P ( do ( γ = ¯ γ ) , X = x i ) (12) T o this end, the effecti veness of attribute attention can be interpreted as how the attention improv es the final prediction compared to the wrong/intervention one. P a ef f ect ( x i ) can be used to ev aluate the quality of the learned attribute attention, enabling the attribute features to be more reliable. 3.3 Model Optimization T o optimize MSDN++, each attention sub-net is trained with an attribute-based cross-entropy loss with self-calibration, attribute re gression loss, and causal loss. T o encourage mu- tual learning between the two attention sub-nets, we deploy a semantic distillation loss that aligns with each other’ s class posterior probabilities. Attribute-Based Cr oss-Entropy Loss. Considering the as- sociated image and attrib ute embeddings are projected near their class semantic vector z c when an attribute is visually present in an image, we tak e the attribute-based cross-entropy loss with self-calibration ( Zhu et al , 2019 ; Huynh and Elham- ifar , 2020a ; Xu et al , 2020 ) (denoted as L A CEC ) to optimize the MSDN++. L A CEC encourages the image to have the high- est compatibility score with its corresponding class semantic vector . Giv en a batch of n b training images { x s i } n b i =1 with their corresponding class semantic vectors z c , L A CEC is de- fined as: L ACEC = − 1 n b n b X i =1 [log exp ( f ( x i ) × z c ) P ˆ c ∈C s exp ( f ( x i ) × z ˆ c ) − λ cal C u X c ′ =1 log exp  f ( x i ) × z c ′ + I [ c ′ ∈C u ]  P ˆ c ∈C exp  f ( x i ) × z ˆ c + I [ ˆ c ∈C u ]  ] , (13) where f ( x i ) = ψ ( x i ) for A VCA sub-net and f ( x i ) = Ψ ( x i ) for V A CA sub-net, I [ c ∈C u ] is an indicator function ( i.e. , it is 1 when c ∈ C u , otherwise -1), and λ cal is a weight to control the self-calibration term. Intuitiv ely , L A CEC encourages non- zero probabilities to be assigned to the unseen classes during training, thus MSDN++ produces a large probability for the true unseen class when giv en the test unseen samples. Attribute Regr ession Loss. W e employ an attrib ute regres- sion loss ( Xu et al , 2022 ; Chen et al , 2022a ) to optimize MSDN++, which enables M 1 and M 2 to map visual/attribute features into semantic spaces close to their corresponding class semantic vectors further . Specifically , we take visual- semantic mapping as a regression problem and minimize the mean square error between the embedded attribute score f ( x i ) and the corresponding ground truth attrib ute score z c of a batch of n b images { x i } n b i =1 : L AR = 1 n b n b X i =1 ∥ f ( x i ) − z c ∥ 2 2 . (14) where f ( x i ) = ψ ( x i ) and f ( x i ) = Ψ ( x i ) in A VCV and V A CA, respectively . Causal Loss. T o enable MSDN++ to learn the intrinsic se- mantic kno wledge for representing reliable features, we de- ploy the causal loss L causal based on the causal effect in A VCA (Eq. 6 ) and V A CA (Eq. 12 ). L causal can provide a supervision signal to explicitly guide attention learning with 8 4 EXPERIMENTS (a) CUB (b) SUN (c) A W A2 (d) FLO Fig. 3: Part of samples on four challenge datasets, including three fine-grained datasets ( i.e. , (a) CUB ( W elinder et al , 2010 ), (b) SUN ( Patterson and Hays , 2012 ), and FLO Nilsback and Zisserman ( 2008 )), and one coarse-grained dataset ( i.e. , A W A2 ( Xian et al , 2017 )). Each image is sampled from different classes. W e find that the images of different classes in fine-grained dataset are very similar and not easy to be distinguished, while the images in the coarse-grained dataset are more easy to be recognized. For e xample, CUB includes similar bird images, but A W A2 consists of various animal images. causal associations between visual and attrib ute representa- tions, formulated as: L causal = 1 n b n b X i =1 C E ( P ef f ect , y i ) , = − 1 n b n b X i =1 log exp ( f ( x i ) × z c ) P ˆ c ∈C s exp ( f ( x i ) × z ˆ c ) − 1 n b n b X i =1 log exp  ¯ f ( x i ) × z c  P ˆ c ∈C s exp  ¯ f ( x i ) × z ˆ c  (15) where P ef f ect = P v ef f ect for A VCA and P ef f ect = P a ef f ect for V A CA, and C E is the cross-entropy . f ( x i ) = ψ ( x i ) and ¯ f ( x i ) = ¯ ψ ( x i ) in A VCV , while f ( x i ) = Ψ ( x i ) and ¯ f ( x i ) = ¯ Ψ ( x i ) in V A CA. Semantic Distillation Loss. W e further introduce a semantic distillation loss L distill , which enables the two mutual atten- tion sub-nets to learn collaboratively and teach each other throughout the training process. L distill includes a Jensen- Shannon Di ver gence (JSD) and an ℓ 2 distance between the predictions of the two sub-nets ( i.e. , p 1 and p 2 ). It is formu- lated as: L distill = 1 n b n b X i =1 [ 1 2 ( D K L ( p 1 ( x i ) ∥ p 2 ( x i )) + D K L ( p 2 ( x i ) ∥ p 1 ( x i ))) | {z } JSD + ∥ p 1 ( x i ) − p 2 ( x i ) ∥ 2 2 | {z } ℓ 2 ] , (16) where D K L ( p || q ) = C s X c =1 p c log( p c q c ) . (17) Overall Loss. Finally , the overall loss function of MSDN++ is defined as: L total = L A CEC + λ AR L AR + λ causal L causal + λ distill L distill , (18) where λ AR , λ causal , and λ distill are weights to control the at- tribute re gression loss, causal loss, and semantic distillation loss, respectiv ely . 3.4 Zero-Shot Prediction After optimization, W e first obtain the embedding features of a test instance x i in the semantic space w .r .t. the A VCA and V A CA sub-nets, i.e. , ψ ( x ) and Ψ ( x ) . Considering the learned semantic knowledge in the tw o sub-nets are comple- mentary to each other , we fuse their predictions using two combination coefficients ( α 1 , α 2 ) to predict the test label of x i with an explicit calibration, formulated as: c ∗ = arg max c ∈C u / C ( α 1 ψ ( x i ) + α 2 Ψ ( x i )) ⊤ × z c + I [ c ∈C u ] . (19) Here, C u / C corresponds to the CZSL/GZSL setting. 4 Experiments Datasets. T o ev aluate our method, we conduct experiments on four challenging benchmark datasets, i.e. , CUB (Caltech UCSD Birds 200) ( W elinder et al , 2010 ), SUN (SUN At- tribute) ( Patterson and Hays , 2012 ), A W A2 (Animals with 9 T able 1: Results ( % ) of the state-of-the-art CZSL and GZSL models on CUB, SUN, and A W A2, including generati ve methods, common space-based methods, and embedding-based methods. The best and second-best results are marked in Red and Blue , respectiv ely . The symbol “–” indicates no results. The symbol “ * ” denotes attention-based methods. The symbol † denotes ZSL methods based on large-scale vision-language model. Methods CUB SUN A W A2 CZSL GZSL CZSL GZSL CZSL GZSL acc U S H acc U S H acc U S H Generative Methods f-CLSWGAN ( Xian et al , 2018 ) 57.3 43.7 57.7 49.7 60.8 42.6 36.6 39.4 68.2 57.9 61.4 59.6 f-V AEGAN-D2 ( Xian et al , 2019 ) 61.0 48.4 60.1 53.6 64.7 45.1 38.0 41.3 71.1 57.6 70.6 63.5 Composer ∗ ( Huynh and Elhamifar , 2020b ) 69.4 56.4 63.8 59.9 62.6 55.1 22.0 31.4 71.5 62.1 77.3 68.8 GCM-CF ( Y ue et al , 2021 ) – 61.0 59.7 60.3 – 47.9 37.8 42.2 – 60.4 75.1 67.0 FREE ( Chen et al , 2021b ) – 55.7 59.9 57.7 – 47.4 37.2 41.7 – 60.4 75.4 67.1 FREE+ESZSL ( Cetin et al , 2022 ) – 51.6 60.4 55.7 – 48.2 36.5 41.5 – 51.3 78.0 61.8 VS-Boost ( Li et al , 2023b ) – 68.0 68.7 68.4 – 49.2 37.4 42.5 – – – – EGG ( Cav azza et al , 2023 ) – 58.6 72.3 64.7 – – – – – 50.2 87.9 63.9 V iFR ( Chen et al , 2025 ) 69.1 57.8 62.7 60.1 65.6 48.8 35.2 40.9 73.7 58.4 81.4 68.0 Common Space Learning DeV iSE ( Frome et al , 2013 ) 52.0 23.8 53.0 32.8 56.5 16.9 27.4 20.9 54.2 17.1 74.7 27.8 DCN ( Liu et al , 2018 ) 56.2 28.4 60.7 38.7 61.8 25.5 37.0 30.2 65.2 25.5 84.2 39.1 CAD A-V AE ( Schönfeld et al , 2019 ) 59.8 51.6 53.5 52.4 61.7 47.2 35.7 40.6 63.0 55.8 75.0 63.9 SGAL ( Y u and Lee , 2019 ) – 40.9 55.3 47.0 – 35.5 34.4 34.9 – 52.5 86.3 65.3 CLIP † ( Radford et al , 2021 ) – 55.2 54.8 55.0 – – – – – – – – HSV A ( Chen et al , 2021c ) 62.8 52.7 58.3 55.3 63.8 48.6 39.0 43.3 – 59.3 76.6 66.8 CoOp † ( Zhou et al , 2022 ) – 49.2 63.8 55.6 – – – – – – – – CoOp+SHIP † ( W ang et al , 2023 ) – 55.3 58.9 57.1 – – – – – – – – Embedding-based Methods SP-AEN ( Chen et al , 2018 ) 55.4 34.7 70.6 46.6 59.2 24.9 38.6 30.3 58.5 23.3 90.9 37.1 SGMA ∗ (NeurIPS’19) ( Zhu et al , 2019 ) 71.0 36.7 71.3 48.5 – – – – 68.8 37.6 87.1 52.5 AREN ∗ ( Xie et al , 2019 ) 71.8 38.9 78.7 52.1 60.6 19.0 38.8 25.5 67.9 15.6 92.9 26.7 LFGAA ∗ ( Liu et al , 2019 ) 67.6 36.2 80.9 50.0 61.5 18.5 40.0 25.3 68.1 27.0 93.4 41.9 D AZLE ∗ ( Huynh and Elhamifar , 2020a ) 66.0 56.7 59.6 58.1 59.4 52.3 24.3 33.2 67.9 60.3 75.7 67.1 GND AN ∗ ( Chen et al , 2024b ) 75.1 69.2 69.6 69.4 65.3 50.0 34.7 41.0 71.0 60.2 80.8 69.0 T ransZero ∗ ( Chen et al , 2022a ) 76.8 69.3 68.3 68.8 65.6 52.6 33.4 40.8 70.1 61.3 82.3 70.2 APN ∗ ( Xu et al , 2022 ) 75.0 67.4 71.6 69.4 61.5 40.2 35.2 37.5 69.9 61.9 79.4 69.6 ICIS ( Christensen et al , 2023 ) 60.6 45.8 73.7 56.5 51.8 45.2 25.6 32.7 64.6 35.6 93.3 51.6 COND+EGZSL ( Chen et al , 2024a ) – 45.2 55.2 49.6 – – – – – 59.2 80.7 68.3 EG-part-net ( Chen et al , 2024c ) – 64.8 66.1 65.4 – – – – – 64.7 78.7 71.0 MSDN ∗ ( Conference V ersion ) ( Chen et al , 2022b ) 76.1 68.7 67.5 68.1 65.8 52.2 34.2 41.3 70.1 62.0 74.5 67.7 MSDN++ ∗ (Ours) 78.5 70.8 70.3 70.6 67.5 51.9 35.4 42.1 73.4 66.5 79.7 72.5 Attributes 2) ( Xian et al , 2017 ) and FLO Nilsback and Zis- serman ( 2008 ). Among them, CUB, SUN, and FLO are fine- grained datasets, whereas A W A2 is a coarse-grained dataset. Some samples are presented in Fig. 3 . Following ( Xian et al , 2017 ), we use the same seen/unseen splits and class seman- tic embeddings. Specifically , CUB has 11,788 images of 200 bird classes (seen/unseen classes = 150/50) with 312 at- tributes. SUN consists of 14,340 images of 717 scene classes (seen/unseen classes = 645/72) with 102 attributes. A W A2 has 37,322 images of 50 animal classes (seen/unseen classes = 40/10) with 85 attributes. FLO has 8189 images of 102 flowers classes (seen/unseen classes = 82/20) with 1024 at- tributes. Evaluation Protocols. In the CZSL setting, we e valuate the top-1 accuracy on unseen classes , denoted as acc . In the GZSL setting, we e valuate the top-1 accuracies both on seen and unseen classes ( i.e. , S and U ). Furthermore, their harmonic mean (defined as H = (2 × S × U ) / ( S + U ) ) is also used for e valuating the performance in the GZSL setting. Implementation Details. W e take a ResNet101 ( He et al , 2016 ) pre-trained on ImageNet as the netw ork backbone to extract the feature map for each image without fine-tuning. Input images were resized to 448 × 448 and normalized using ImageNet statistics. The feature maps were extracted from the last con volutional layer , resulting in a fix ed spatial dimen- sion of 14 × 14 (196 regions). This consistent spatial structure was maintained throughout training and intervention. W e use the RMSProp optimizer with hyperparameters (momentum = 0.9, weight decay = 0.0001) to optimize our model. W e set the learning rate and batch size to 0.0001 and 50, respectively . W e take random attention as the causal intervention in causal visual/attribute learning streams for all datasets. Specifically , we generated random attention weights independently for each sample. For the A VCA subnet, random attention ma- trices of size [ K, 49] were sampled from a Uniform(0,1) distribution and then normalized via softmax ov er the spatial dimension. The same procedure was applied symmetrically to the V A CA subnet. These random attentions were used only in the forward pass for causal loss computation and 10 4 EXPERIMENTS were detached from the gradient computation, ensuring they did not affect the learning of the actual attention modules. W e empirically set the loss weights { λ cal , λ AR , λ causal , λ distill } to { 0 . 05 , 0 . 03 , 0 . 3 , 0 . 001 } , { 0 . 0001 , 0 . 01 , 0 . 0005 , 0 . 05 } , and { 0 . 4 , 0 . 06 , 0 . 1 , 0 . 01 } for CUB, SUN, and A W A2, respec- ti vely . W e set the combination coefficient ( α 1 , α 2 ) to (0 . 8 , 0 . 2) , (0 . 7 , 0 . 3) , (0 . 8 , 0 . 2) for CUB, SUN, and A W A2, respectiv ely . T able 2: Results ( % ) of the state-of-the-art GZSL models on FLO. Methods FLO U S H f-CLSWGAN ( Xian et al , 2018 ) 59.3 74.2 65.9 f-V AEGAN-D2 ( Xian et al , 2019 ) 56.8 74.9 64.6 TF-V AEGAN ( Narayan et al , 2020 ) 62.5 84.1 71.7 EG-part-net ( Chen et al , 2024c ) 61.5 81.4 70.1 MSDN ( Conference V ersion ) ( Chen et al , 2022b ) 62.2 81.0 70.3 MSDN++ (Ours) 69.2 80.7 74.5 4.1 Comparision with State-of-the-Arts Our MSDN++ is an embedding-based method with inductiv e manner . T o demonstrate the ef fectiv eness and adv antages of our MSDN++, we compare it with other state-of-the-art methods both in CZSL and GZSL settings, including gen- erativ e methods ( e.g . , f-CLSWGAN ( Xian et al , 2018 ), f- V AEGAN ( Xian et al , 2019 ), Composer ( Huynh and Elhami- far , 2020b ), E-PGN ( Y u et al , 2020 ), TF-V AEGAN ( Narayan et al , 2020 ), IZF ( Shen et al , 2020 ), SDGZSL ( Chen et al , 2021d ), GCM-CF ( Y ue et al , 2021 ), FREE ( Chen et al , 2021b ), FREE+ESZSL ( Cetin et al , 2022 ), VS-Boost ( Li et al , 2023b ), EGG ( Cav azza et al , 2023 ), and V iFR( Chen et al , 2025 )), common space learning methods ( e.g. , De- V iSE ( Frome et al , 2013 ), DCN ( Liu et al , 2018 ), CAD A- V AE ( Schönfeld et al , 2019 ), SGAL ( Y u and Lee , 2019 ), and HSV A ( Chen et al , 2021c ), CLIP ( Radford et al , 2021 ), CoOp ( Zhou et al , 2022 ), CoOp+SHIP ( W ang et al , 2023 )), and embedding-based methods ( e .g. , SP-AEN ( Chen et al , 2018 ), SGMA , AREN ( Xie et al , 2019 ), LFGAA ( Liu et al , 2019 ), D AZLE ( Huynh and Elhamifar , 2020a ), GN- D AN ( Chen et al , 2024b ), TransZero ( Chen et al , 2022a ), APN ( Xu et al , 2022 ), ICIS ( Christensen et al , 2023 ), EG-part-net ( Chen et al , 2024c ), COND+EGZSL ( Chen et al , 2024a )). Notably , the large-scale visual-language model based ZSL methods ( Radford et al , 2021 ; Zhou et al , 2022 ; W ang et al , 2023 ) can be grouped into common space learn- ing methods intrinsically . Different to classical common space learning methods ( Frome et al , 2013 ; Schönfeld et al , 2019 ; Chen et al , 2021c ) that uses small datasets for learn- ing a joint space of visual and semantic representations with semantic attributes, the large-scale visual-language model based ZSL methods learns the joint space with large-scale data and label prompt. Con ventional Zero-Shot Learning. W e first compare our MSDN++ with the state-of-the-art methods in the CZSL setting. T able 1 shows the results of CZSL on various datasets. Compared to all strong baselines, including embedding-based methods, generativ e methods, and common space learning methods, our MSDN++ performs the best results of 78.5%, 67.5%, and second-best results of 73.4% on CUB, SUN, and A W A2, respectiv ely . This indicates that MSDN++ discovers the intrinsic semantic knowledge for effecti ve knowledge transfer from seen classes to unseen ones. Furthermore, our MSDN++ achie ves significant performance gains by 2.4%, 1.7%, and 3.3% on CUB, SUN, and A W A2, respecti vely , o ver the MSDN (conference version). This should be thanks to the causal visual/attribute leanings that guide MSDN++ to learn causal vision-attribute associations for representing reliable features with good generalization. Generalized Zero-Shot Learning. W e take MSDN++ to compare with the state-of-the-art methods in the GZSL set- ting further . Results are shown in T able 1 . W e can find: i ) The most state-of-the-art methods achiev e good results on seen classes but fail on unseen classes on CUB and A W A2. For e xample, AREN ( Xie et al , 2019 ) and LFGAA ( Liu et al , 2019 ) get the accuracies on unseen/seen classes are 38.9%/78.7% (15.6%/92.9%) and 36.2%/80.9% (27.0%/93.4%) on CUB (A W A2), respectiv ely . Because they simply utilize unidirectional attention in a weakly supervised manner to learn the spurious and limited latent semantic representa- tions, which fails to ef fectively disco ver the intrinsic seman- tic kno wledge for kno wledge transfer from seen to unseen classes. In contrast, our MSDN++ emplo ys a mutually causal semantic distillation network to learn intrinsic and more ef fi- cient semantic kno wledge, which enables the model to gen- eralize well to unseen classes with high seen and unseen accuracies. Accordingly , MSDN++ obtains good results of harmonic mean, i.e. , 70.6%, 42.1%, and 72.5% on CUB, SUN, and A W A2, respectiv ely . ii ) Compared to the large-scale vision-language model based ZSL methods ( e .g. , CLIP ( Radford et al , 2021 ), CoOp ( Zhou et al , 2022 ), CoOp+SHIP ( W ang et al , 2023 )), our MSDN++ obtains significant performances gains of H = 13 . 5% at least on CUB. This demonstrates the superiority and potential of our MSDN++ for ZSL. Because the large-scale vision- language based ZSL methods are limited by the domain- specific knowledge, which has a large bias with the large- scale vision-text pairs. iii ) GCM-CF ( Y ue et al , 2021 ) is the first method that ap- plied causal inference to the ZSL task. It introduces a causal generation to guide the generative methods to synthesize balanced visual features for unseen classes. Differently , we design causal attention learning to enable the embedding- based methods to learn intrinsic and more suf ficient seman- 4.1 Comparision with State-of-the-Arts 11 MSDN MSDN++ MSDN MSDN++ Fig. 4: V isualization of attention maps learned by the first sub-nets of MSDN ( Chen et al , 2022b ) and MSDN++ on CUB. W e show the top-10 attention maps focused by models. The red boxes indicate MSDN learns the wrong attention maps that are irrelev ant to the corresponding attributes. ( a ) A VCA ( b ) V ACA Fig. 5: V isualization of attention maps for the two mutual attention sub-nets (i.e, MSDN++(A VCA) and MSDN++(V ACA)). Results show that our A VCA and V A CA subnets can overally learn the accurate visual localizations, but the y also learn few of falure cases. tic kno wledge for kno wledge transfer from seen to unseen classes. Our MSDN++ achiev es significant improvements of harmonic mean by 10.3% and 5.5% on CUB and A W A2, respectiv ely . iv ) MSDN++ performs poorer than the generati ve methods in the GZSL setting because per class only contains 16 train- ing images on SUN, which heavily limits the ZSL mod- els. As such, the data augmentation is v ery effecti ve for im- proving the performance of SUN, e.g., the generati ve meth- ods VS-Boost ( Li et al , 2023b ), HSV A ( Chen et al , 2021c ). Thus, most of the generati ve methods perform better than the embedding-based methods on SUN. Additionally , we also conduct experiments on FLO dataset Nilsback and Zisserman ( 2008 ), results are sho wn in T able 2 . Results show that our MSDN++ achiev es the best result of H = 74 . 5% on FLO. Compared to our conference ver - sion (MSDN Chen et al ( 2022b )), our MSDN++ obtains T able 3: Ablation studies for different components of MSDN++. The baseline is the visual feature extracted from CNN backbone with a global av erage pooling and then mapped into semantic embedding for ZSL. Method CUB A W A2 acc U S H acc U S H baseline 57.4 44.2 55.2 49.1 54.8 30.3 30.7 30.5 baseline w/ L AR 58.5 46.5 54.6 50.2 56.9 20.6 89.7 33.5 MSDN++(A VCA) w/o L distill 76.2 68.7 69.1 68.9 71.9 65.5 76.8 70.7 MSDN++(V A CA) w/o L distill 68.4 57.5 65.6 61.3 68.0 56.5 74.6 64.3 MSDN++(A VCA) w/ L distill 77.7 69.6 70.0 69.8 73.0 66.4 79.3 72.3 MSDN++(V A CA) w/ L distill 70.8 59.9 65.3 62.5 70.3 60.3 76.3 67.3 MSDN++ w/o A VCA( L causal ) 77.5 67.8 71.5 69.6 72.9 66.5 77.9 71.8 MSDN++ w/o V A CA( L causal ) 77.9 70.6 70.2 70.4 73.3 66.9 78.4 72.2 MSDN++ w/o L causal 77.0 69.6 69.2 69.4 72.7 66.1 78.4 71.7 MSDN++ 78.5 70.8 70.3 70.6 73.4 66.5 79.7 72.5 performance gain of H with 4.2%. These demonstrate the improv ements of MSDN++ using causal attention. 12 4 EXPERIMENTS Fig. 6: The attention maps of various causal interventions, including (a) random attention, (b) uniform attention, and (c) rev ersed attention. 4.2 Ablation Studies W e conduct ablation studies to e valuate the ef fectiv eness of v arious components of MSDN++, including the A VCA atten- tion sub-net (denoted as MSDN(A VCA) w/o L distill ), V A CA attention sub-net (denoted as MSDN(V ACA) w/o L distill ), se- mantic distillation learning ( i.e. , MSDN(V A CA) w/ L distill , MSDN(A VCA) w/ L distill ), causal visual learning (MSDN++ w/o A VCA( L causal )), causal attribute learning (MSDN++ w/o V A CA( L causal )), and causal learning ( i.e. , MSDN w/o L causal ). Results are shown in T able 3 . Compared to the baseline, MSDN++ only employs the single attention sub-net with- out semantic distillation obtaining significant performance gains. For e xample, MSDN(A VCA) w/o L distill achiev es the gains of acc / H by 18.8%/19.8% and 17.1%/40.2% on CUB and A W A2, respectively; MSDN++(V A CA) w/o L distill achiev es the acc / H improv ements of 11.0%/12.2% and 13.2%/33.8% on CUB and A W A2, respectiv ely . Because MSDN++ discov ers the intrinsic semantic knowledge us- ing causal learning and visual/attribute-based attribute/visual learning to refine the visual features for effecti ve knowledge transfer . When MSDN++ adopts semantic distillation loss to conduct collaborati ve learning for kno wledge distillation, its results can be further improv ed, e.g. , MSDN(V A CA) im- prov es the acc / H by 2.4%/1.2% and 2.3%/3.0% on CUB and A W A2, respectively . The causal attribute/visual learning consistently improv e the performance of MSDN++ by guid- ing the model to learn causal vision-attribute associations for representing reliable features. Moreov er, the two causal learning can further improve MSDN++ cooperativ ely . W e also find that the model significantly ov erfits to seen classes without our mutual semantic distillation and causal learning. Our full model ensembles the complementary embeddings learned by the two mutual causal attention sub-nets to rep- T able 4: Effects of v arious causal interventions on CUB, i.e. , (a) Random Attention, (b) Uniform Attention, (c) Re versed Attention, and (d) Random Attention+Rev ersed Attention. Method CUB acc U S H MSDN ( Chen et al , 2022b ) 76.1 68.7 67.5 68.1 MSDN++ w/ (a) 78.5 70.8 70.3 70.6 MSDN++ w/ (b) 78.4 70.8 70.0 70.4 MSDN++ w/ (c) 78.0 71.0 69.9 70.5 MSDN++ w/ (d) 78.5 71.8 69.5 70.6 resent suf ficient semantic knowledge, resulting in further performance gains for MSDN++. 4.3 Qualitative Results V isualization of Attention Maps. T o intuitively show the effecti veness of our MSDN++ for learning more intrinsic semantic knowledge than MSDN ( Chen et al , 2022b ), we visualize the top-10 attention maps learned by the two meth- ods. As shown in Fig. 4 , although MSDN can localize some attributes in the image correctly for semantic knowledge rep- resentations, a few of the attributes are localized wrongly . For example, the MSDN localizes the attribute “leg color brown” of Acadian Flycatcher on the head in the image. This is because MSDN learns the wrong associations between vi- sual and attribute features for representing spurious semantic knowledge. In contrast, our MSDN++ learns the important attribute localization accurately for intrinsic semantic knowl- edge representations using causal attribute/visual learning, which encourages MSDN++ to discover the causal vision- attribute associations for representing reliable features. As shown in Fig. 5 , the two sub-nets of MSDN++ can simi- larly learn the most important semantic representation, which is beneficial from mutual learning for semantic distillation. Furthermore, the two attention sub-nets also learn the com- plementary attribute feature localization for each other . As such, our MSDN++ achie ves significant performance gains ov er MSDN. Although our A VCA and V A CA subsets can ov erally learn the accurate visual localizations, they also learn few of failure cases. For e xample, the attribute “ Bill Color Orange ” and the attribute “ Le g Color Buff ” are wrongly localized by A VCA and V A CA, respectiv ely . This may be caused by the semantic ambiguity , which denotes that the same attribute may represented as v arious visual appearances. W e belie ve this is an open challenge for future works. t-SNE V isualizations. As shown in Fig. 7 , we also present the t-SNE visualization ( Maaten and Hinton , 2008 ) of visual features for seen and unseen classes on CUB, learned by the baseline, MSDN ( Chen et al , 2022b ), and MSDN++. Com- pared to the baseline, our models learn the intrinsic semantic 4.3 Qualitativ e Results 13 Fig. 7: t-SNE visualizations of visual features for (a) seen classes and (b) unseen classes, learned by the baseline, MSDN ( Chen et al , 2022b ), and MSDN++. The 10 colors denote 10 different seen/unseen classes randomly selected from CUB. Results show that our MSDN++ learn the intrinsic semantic representations both in seen and unseen classes compared to the baseline. Meanwhile, the each subnet of MSDN++ consistent ly enhances the intra-class compactness and inter -classes separability for MSDN. representations both in seen and unseen classes. F or each sub- net, MSDN++ consistently enhances the intra-classes com- pactness and inter -classes separability for MSDN, and thus the fused features are refined further . This should be thanks to our causal visual/attributes learning in the A VCA/V ACA subnets, which encourage MSDN++ to learn causal vision- attribute associations for representing reliable features with good generalization. As such, our MSDN++ achiev es signifi- cant improv ement ov er MSDN and baseline. 14 4 EXPERIMENTS Accuracy (%) 50 62 73 85 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 U S H acc Combination ! Coe ffi cients ! ( α 1 , α 2 ) Accuracy (%) 55 65 75 85 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 U S H acc Combination ! Coe ffi cients ! ( α 1 , α 2 ) (a) CUB (b) A W A2 Fig. 8: The effecti veness of the combination coef ficients ( α 1 , α 2 ) between the A VCA and V A CA sub-nets. Results show that MSDN++ performs poorly when α 1 /α 2 is set small, and its performances drop do wn when α 1 /α 2 is set too large. Because the attribute-based visual features and visual-based attrib ute features are complementary for discriminati ve semantic embedding representations. Accuracy (%) 3 24 45 66 87 0.0005 0.005 0.05 0.5 5 53.3 76.6 78.5 78.0 77.9 6 55 71 70 70 3.2 43.9 70.3 72.2 72.5 53.3 74.4 70.8 67.4 66.7 U S H acc λ cal Accuracy (%) 25 40 55 70 85 0.0003 0.003 0.03 0.3 3 U S H acc λ AR Accuracy (%) 62 67 72 77 82 0.0003 0.003 0.03 0.3 3 U S H acc λ causal Accuracy (%) 45 55 65 75 85 0.00001 0.0001 0.001 0.01 0.1 U S H acc λ distill (a) (b) (c) (d) Accuracy (%) 2 24 46 68 90 0.0004 0.004 0.04 0.4 4 29.3 73.4 64.4 63.0 62.9 5 73 52 49 49 2.6 79.7 69.2 70.9 71.2 25.6 66.5 41.2 37.6 37.3 U S H acc λ cal Accuracy (%) 37 50 63 75 88 0.0006 0.006 0.06 0.6 6 U S H acc λ AR Accuracy (%) 50 59 68 76 85 0.0001 0.001 0.01 0.1 1 U S H acc λ causal Accuracy (%) 60 66 73 79 85 0.00001 0.0001 0.001 0.01 0.1 U S H acc λ distill (e) (f) (g) (h) Fig. 9: The effects of λ cal , λ AR , λ causal , and λ distill on CUB (top-row) and A W A2 (bottom-row). Results sho w that MSDN++ is robust when the loss weights are set to small and the performance will drop rapidly when loss weights are set to too lar ge. Because the large loss weights will hamper the balance of v arious losses. 4.4 Effects of V arious Causal Intervention W e analyze three different strategies to conduct a causal intervention, including random attention, uniform attention and rev ersed attention, which will generate various causal attention maps as shown in Fig. 6 . Results show that the causal attention maps focus on the wrong visual regions, which are not relev ant to their corresponding attributes. This causal can measure the ef fectiv eness of our attribute-based visual learning and visual-based attribute learning and en- able them to learn the intrinsic semantic knowledge between visual and attribute features with causality . The results in T able 4 sho w that the v arious causal attentions consistently improv e the performance of MSDN. Because our causal attention provides a significant signal to supervise MSDN++. Furthermore, we find that all the causal interventions get similar performance gains, which indicates our causal vi- sual/attribute learning is rob ust for improving the causality of features. 4.5 Hyperparameter Analysis Effects of Combination Coefficients. we provide exper - iments to determine the ef fectiv eness of the combination coefficients ( α 1 , α 2 ) between A VCA and V A CA sub-nets. As shown in Fig. 8 , MSDN++ performs poorly when α 1 /α 2 is set small, because the A VCA is the main sub-net to sup- port the final classification of MSDN++. Since the attrib ute- 15 based visual features and visual-based attrib ute features are complementary for discriminati ve semantic embedding rep- resentations, the performances of MSDN++ drop do wn when α 1 /α 2 is set too large ( e .g. , ( α 1 , α 2 ) = (0 . 9 , 0 . 1) ). Accord- ing to Fig. 8 , we set ( α 1 , α 2 ) to be (0.8,0.2) for CUB and A W A2 based on the performances of H and acc . Notably , ( α 1 , α 2 ) = (1 . 0 , 0 . 0) and ( α 1 , α 2 ) = (0 . 0 , 1 . 0) denotes the MSDN++ without information fusion from A VCA and V ACA during inference. Effects of Loss W eights. W e empirically study ho w to set the related loss weights of MSDN++: λ cal , λ AR , λ causal , and λ dstill , which control the self-calibration term, attribute re gression loss, causal loss, and semantic distillation loss, respecti vely . Results are sho wn in Fig. 9 . Results show that MSDN++ is robust when the loss weights are set to small and the perfor- mance will drop rapidly when loss weights are set to too large. Because the lar ge loss weights will hamper the balance of var - ious losses. According to the results in Fig. 9 , we set the loss weights { λ cal , λ AR , λ causal , λ distill } to { 0 . 05 , 0 . 03 , 0 . 3 , 0 . 001 } and { 0 . 4 , 0 . 06 , 0 . 1 , 0 . 01 } for CUB and A W A2, respectiv ely . Notably , the loss weights { λ cal , λ AR , λ causal , λ distill } are to { 0 . 0001 , 0 . 01 , 0 . 0005 , 0 . 05 } for SUN. Furthermore, although the performance curve someho w unstale with the variant values for the hyper -parameters, MSDN++ performs consis- tently on various datasets when using same hyper-parameters. This means that our MSDN++ is roubust on various datasets. Accordingly , our method can be easily applied into ne w do- mains. 5 Conclusion This paper introduces a mutually causal distillation network for ZSL, termed MSDN++, which consists of an A VCA and a V A CA sub-nets. A VCA learns the attribute-based visual features with attribute-based/ causal visual learnings, while V A CA learns the visual-based attribute features via visual- based/ causal attribute learnings. The causal attentions en- courage the two sub-nets to discov er causal vision-attribute associations for representing reliable features with good gen- eralization. Furthermore, we further introduce a semantic distillation loss, which promotes the two sub-nets to learn collaborativ ely and teach each other . Finally , we fuse the complementary features of the two sub-nets to make full use of all important knowledge during inference. As such, MSDN++ effecti vely explores the intrinsic and more suffi- cient semantic kno wledge for desirable kno wledge transfer in ZSL. The quantitative and qualitati ve results on three popular benchmarks ( i.e. , CUB, SUN, and A W A2) demonstrate the superiority and potential of our MSDN++. Data A vailability Statements The data used in this manuscript includes three benchmark datasets, i.e., CUB, 1 SUN, 2 and A W A2, 3 which are publicly opened benchmarks. References Akata Z, Perronnin F , Harchaoui Z, Schmid C (2016) Label- embedding for image classification. TP AMI 38:1425–1438 Arora G, V erma V , Mishra A, Rai P (2018) Generalized zero-shot learning via synthesized examples. In: CVPR, pp 4281–4289 Bai X, W ang H, Ma L, Xu Y , Gan J, Fan Z, Y ang F , Ma K, Y ang J, Bai S, Shu C, Zou X, Huang R, Zhang C, Liu X, T u D, Xu C, Zhang W , W ang XL, Chen A, Zeng Y , Y ang D, W ang MW , Holalkere NS, Halin NJ, Kamel IR, W u J, Peng X, W ang X, Shao J, Mongkolwat P , Zhang J, Liu W , Roberts M, T eng Z, Beer L, Sanchez LE, Sala E, Rubin D, W eller A, Lasenby J, Zheng C, W ang J, Li Z, Schönlieb CB, Xia T (2021) Advancing covid-19 diagnosis with priv acy-preserving collaboration in artificial intelligence. Nature Machine Intelligence Cav azza J, Murino V , Bue AD (2023) No adversaries to zero- shot learning: Distilling an ensemble of g aussian feature generators. IEEE T ransactions on Pattern Analysis and Machine Intelligence 45:12167–12178 Cetin S, Baran OB, Cinbis RG (2022) Closed-form sam- ple probing for learning generati ve models in zero-shot learning. In: ICLR Chen D, Zhang H, Shen Y , Long Y , Shao L (2024a) Evolu- tionary generalized zero-shot learning. In: IJCAI Chen G, Li J, Lu J, Zhou J (2021a) Human trajectory predic- tion via counterfactual analysis. In: ICCV , pp 9804–9813 Chen L, Zhang H, Xiao J, Liu W , Chang S (2018) Zero-shot visual recognition using semantics-preserving adversarial embedding networks. In: CVPR, pp 1043–1052 Chen S, W ang W , Xia B, Peng Q, Y ou X, Zheng F , Shao L (2021b) Free: Feature refinement for generalized zero-shot learning. In: ICCV Chen S, Xie GS, Y ang Liu Y , Peng Q, Sun B, Li H, Y ou X, Shao L (2021c) Hsva: Hierarchical semantic-visual adaptation for zero-shot learning. In: NeurIPS Chen S, Hong Z, Liu Y , Xie G, Sun B, Li H, Peng Q, Lu K, Y ou X (2022a) Transzero: Attribute-guided transformer for zero-shot learning. In: AAAI 1 https://www.vision.caltech.edu/datasets/cub_ 200_2011/ 2 https://cs.brown.edu/~gmpatter/ sunattributes.html 3 https://cvml.ist.ac.at/AwA2/ 16 5 CONCLUSION Chen S, Hong Z, Xie G, W ang W , Peng Q, W ang K, jun Zhao J, Y ou X (2022b) Msdn: Mutually semantic distillation network for zero-shot learning. In: CVPR, pp 7602–7611 Chen S, Hong ZQ, Xie G, Zhao J, Li H, Y ou X, Y an S, Shao L (2023a) T ranszero++: Cross attribute-guided transformer for zero-shot learning. IEEE Transactions on Pattern Anal- ysis and Machine Intelligence 45(11):12844–12861 Chen S, Hou WQ, Hong Z, Ding X, Song Y , Y ou X, Liu T , Zhang K (2023b) Ev olving semantic prototype improves generativ e zero-shot learning. In: ICML Chen S, Hong Z, Xie G, Peng Q, Y ou X, Ding W , Shao L (2024b) Gndan: Graph navigated dual attention network for zero-shot learning. IEEE T ransactions on Neural Net- works and Learning Systems 35(4):4516–4529 Chen S, Hong Z, Y ou X, Shao L (2025) Semantics- conditioned generati ve zero-shot learning via feature re- finement. International of Computer V ision 133(7):4504– 4521 Chen X, Deng X, Lan Y , Long Y , W eng J, Liu Z, Tian Q (2024c) Explanatory object part aggregation for zero-shot learning. IEEE T ransactions on Pattern Analysis and Ma- chine Intelligence 46(2):851–868 Chen Z, Luo Y , Qiu R, W ang S, Huang Z, Li J, Zhang Z (2021d) Semantics disentangling for generalized zero-shot learning. In: ICCV Chou YY , Lin HT , Liu TL (2021) Adaptive and generati ve zero-shot learning. In: ICLR Christensen A, Mancini M, Koepk e AS, W inther O, Akata Z (2023) Image-free classifier injection for zero-shot classi- fication. In: ICCV Collins R T , Lipton AJ, Kanade T , Fujiyoshi H, Duggins D, Tsin Y , T olli ver D, Enomoto N, Hasega wa O, Burt PJ, W ixson LE (2000) A system for video surveillance and monitoring. Carnegie Mellon Univ ersity T echnical Report, CMU-RI-TR-00-12 Frappier M (2018) The book of why: The ne w science of cause and effect. Science 361:855 – 855 Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Ranzato M, Mikolov T (2013) Devise: A deep visual-semantic embedding model. In: NeurIPS Fu Y , Timoth y MH, Xiang-tao, Gong S (2015) Transducti ve multi-view zero-shot learning. TP AMI pp 2332–2345 Han Z, Fu Z, Y ang J (2020) Learning the redundancy-free features for generalized zero-shot object recognition. In: CVPR, pp 12862–12871 Han Z, Fu Z, Chen S, Y ang J (2021) Contrastiv e embedding for generalized zero-shot learning. In: CVPR He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, pp 770–778 Hinton GE, V inyals O, Dean J (2015) Distilling the knowl- edge in a neural network. arXiv preprint arXi v:150302531 Hu Y , Y ang J, Chen L, Li K, Sima C, Zhu X, Chai S, Du S, Lin T , W ang W , Lu L, Jia X, Liu Q, Dai J, Qiao Y , Li H (2023) Planning-oriented autonomous dri ving. In: CVPR, pp 17853–17862 Huynh D, Elhamifar E (2020a) Fine-grained generalized zero-shot learning via dense attrib ute-based attention. In: CVPR, pp 4482–4492 Huynh DT , Elhamifar E (2020b) Compositional zero-shot learning via fine-grained dense feature composition. In: NeurIPS Kallus N, Zhou A (2018) Confounding-robust policy im- prov ement. In: NeurIPS, p 9269–9279 Keshari R, Singh R, V atsa M (2020) Generalized zero-shot learning via over -complete distribution. In: CVPR, pp 13297–13305 Lampert CH, Nickisch H, Harmeling S (2009) Learning to detect unseen object classes by between-class attribute transfer . In: CVPR, pp 951–958 Lampert CH, Nickisch H, Harmeling S (2014) Attribute- based classification for zero-shot visual object categoriza- tion. TP AMI 36:453–465 Larochelle H, Erhan D, Bengio Y (2008) Zero-data learning of new tasks. In: AAAI, pp 646–651 Li J, Jing M, Lu K, Ding Z, Zhu L, Huang Z (2019a) Le ver - aging the in variant side of generative zero-shot learning. In: CVPR, pp 7394–7403 Li J, Zhang Y , Qiang W , Si L, Jiao C, Hu X, Zheng C, Sun F (2023a) Disentangle and remerge: Interventional kno wl- edge distillation for few-shot object detection from a con- ditional causal perspectiv e. In: AAAI, pp 1323–1333 Li K, Min MR, Fu Y (2019b) Rethinking zero-shot learning: A conditional visual classification perspecti ve. In: ICCV , pp 3582–3591 Li X, Zhang Y , Bian S, Qu Y , Xie Y , Shi Z, Fan J (2023b) Vs-boost: Boosting visual-semantic association for gener- alized zero-shot learning. In: IJCAI Li X, Sun S, Feng R (2024) Causal representation learning via counterfactual intervention. In: AAAI Li Y , Zhang J, Zhang J, Huang K (2018) Discriminative learning of latent features for zero-shot recognition. In: CVPR, pp 7463–7471 Liu S, Long M, W ang J, Jordan MI (2018) Generalized zero- shot learning with deep calibration network. In: NeurIPS Liu Y , Guo J, Cai D, He X (2019) Attribute attention for semantic disambiguation in zero-shot learning. In: ICCV , pp 6697–6706 Liu Y , Zhou L, Bai X, Huang Y , Gu L, Zhou J, Harada T (2021) Goal-oriented gaze estimation for zero-shot learn- ing. In: CVPR Lv F , Liang J, Li S, Zang B, Liu CH, W ang Z, Liu D (2022) Causality inspired representation learning for domain gen- eralization. In: CVPR, pp 8036–8046 Maaten L VD, Hinton GE (2008) V isualizing data using t-sne. JMLR 9:2579–2605 17 Min S, Y ao H, Xie H, W ang C, Zha Z, Zhang Y (2020) Domain-aw are visual bias eliminating for generalized zero- shot learning. In: CVPR, pp 12661–12670 Narayan S, Gupta A, Khan F , Snoek CGM, Shao L (2020) Latent embedding feedback and discriminativ e features for zero-shot classification. In: ECCV Nilsback ME, Zisserman A (2008) Automated flower clas- sification ov er a large number of classes. In: ICVGIP , pp 722–729 Palatucci M, Pomerleau D, Hinton GE, Mitchell TM (2009) Zero-shot learning with semantic output codes. In: NeurIPS, pp 1410–1418 Parisotto E, Ba J, Salakhutdino v R (2016) Actor-mimic: Deep multitask and transfer reinforcement learning. In: ICLR Patterson G, Hays J (2012) Sun attribute database: Discov- ering, annotating, and recognizing scene attributes. In: CVPR, pp 2751–2758 Pearl J (2009) Causality , 2nd edn. Cambridge Univ ersity Press Pearl J, Glymour M, Jewell NP (2016) Causal inference in statistics: A primer . John W iley & Son Pennington J, Socher R, Manning CD (2014) Glov e: Global vectors for word representation. In: EMNLP Pourpanah F , Abdar M, Luo Y , Zhou X, W ang R, Lim CP , W ang X (2020) A revie w of generalized zero-shot learn- ing methods. IEEE T ransactions on Pattern Analysis and Machine Intelligence 45:4051–4070 Radford A, Kim JW , Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P , Clark J, Krueger G, Sutske ver I (2021) Learning transferable visual models from natural language supervision. In: ICML Rao Y , Chen G, Lu J, Zhou J (2021) Counterfactual atten- tion learning for fine-grained visual categorization and re-identification. In: ICCV , pp 1005–1014 Romera-Paredes B, T orr PHS (2015) An embarrassingly sim- ple approach to zero-shot learning. In: ICML Romero A, Ballas N, Kahou SE, Chassang A, Gatta C, Ben- gio Y (2015) Fitnets: Hints for thin deep nets. In: ICLR Schönfeld E, Ebrahimi S, Sinha S, Darrell T , Akata Z (2019) Generalized zero- and few-shot learning via aligned v aria- tional autoencoders. In: CVPR, pp 8239–8247 Shen Y , Qin J, Huang L (2020) In vertible zero-shot recogni- tion flows. In: ECCV Song J, Shen C, Y ang Y , Liu Y , Song M (2018) T ransduc- tiv e unbiased embedding for zero-shot learning. CVPR pp 1024–1033 T ang K, Niu Y , Huang J, Shi J, Zhang H (2020) Unbiased scene graph generation from biased training. In: CVPR, pp 3713–3722 Tsai YHH, Huang LK, Salakhutdinov R (2017) Learning robust visual-semantic embeddings. In: ICCV , pp 3591– 3600 Vyas MR, V enkateswara H, Panchanathan S (2020) Le verag- ing seen and unseen semantic relationships for generativ e zero-shot learning. In: ECCV W ang K, Peng X, Y ang J, Lu S, Qiao Y (2020a) Suppressing uncertainties for large-scale f acial expression recognition. In: CVPR, pp 6897–6906 W ang K, Peng X, Y ang J, Meng D, Qiao Y (2020b) Re gion attention networks for pose and occlusion robust facial expression recognition. IEEE T ransactions on Image Pro- cessing 29:4057–4069 W ang T , Zhou C, Sun Q, Zhang H (2021a) Causal attention for unbiased visual recognition. In: ICCV , pp 3071–3080 W ang Z, Gou Y , Li J, Zhang Y , Y ang Y (2021b) Region semantically aligned network for zero-shot learning. In: CIKM W ang Z, Liang J, He R, Xu N, W ang Z, T an TP (2023) Im- proving zero-shot generalization for clip with synthesized prompts. In: ICCV W elinder P , Branson S, Mita T , W ah C, Schrof f F , Belongie SJ, Perona P (2010) Caltech-ucsd birds 200. T echnical Report CNS-TR-2010-001, Caltech, W ood-Doughty Z, Shpitser I, Dredze M (2018) Challenges of using text classifiers for causal inference. In: EMNLP , pp 4586–4598 Xian Y , Schiele B, Akata Z (2017) Zero-shot learning — the good, the bad and the ugly . In: CVPR, pp 3077–3086 Xian Y , Lorenz T , Schiele B, Akata Z (2018) Feature gen- erating networks for zero-shot learning. In: CVPR, pp 5542–5551 Xian Y , Sharma S, Schiele B, Akata Z (2019) F-v aegan-d2: A feature generating frame work for any-shot learning. In: CVPR, pp 10267–10276 Xie GS, Liu L, Jin X, Zhu F , Zhang Z, Qin J, Y ao Y , Shao L (2019) Attentiv e region embedding network for zero-shot learning. In: CVPR, pp 9376–9385 Xie GS, Liu L, Jin X, Zhu F , Zhang Z, Y ao Y , Qin J, Shao L (2020) Region graph embedding network for zero-shot learning. In: ECCV Xu W , Xian Y , W ang J, Schiele B, Akata Z (2020) Attribute prototype network for zero-shot learning. In: NeurIPS Xu W , Xian Y , W ang J, Schiele B, Akata Z (2022) Attribute prototype network for any-shot learning. International Jour- nal of Computer V ision 130:1735 – 1753 Y ang X, Zhang H, Qi G, Cai J (2021) Causal attention for vision-language tasks. In: CVPR, pp 9842–9852 Y u H, Lee B (2019) Zero-shot learning via simultaneous generating and learning. In: NeurIPS Y u Y , Ji Z, Han J, Zhang Z (2020) Episode-based prototype generating network for zero-shot learning. In: CVPR, pp 14032–14041 Y ue Z, Zhang H, Sun Q, Hua X (2020) Interventional few- shot learning. In: NeurIPS 18 5 CONCLUSION Y ue Z, W ang T , Zhang H, Sun Q, Hua X (2021) Counterfac- tual zero-shot and open-set visual recognition. In: CVPR Zhai Y , Y e Q, Lu S, Jia M, Ji R, T ian Y (2020) Multiple expert brainstorming for domain adaptiv e person re-identification. In: ECCV Zhang Y , Xiang T , Hospedales TM, Lu H (2018) Deep mutual learning. In: CVPR, pp 4320–4328 Zhang Y , Gong M, Liu T , Niu G, Tian X, Han B, Scholkopf B, Zhang K (2021) Causaladv: Adversarial robustness through the lens of causality . In: ICLR Zhou K, Y ang J, Loy CC, Liu Z (2022) Learning to prompt for vision-language models. International Journal of Com- puter V ision 130:2337 – 2348 Zhu Y , Xie J, T ang Z, Peng X, Elgammal A (2019) Semantic- guided multi-attention localization for zero-shot learning. In: NeurIPS

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment