CD-FKD: Cross-Domain Feature Knowledge Distillation for Robust Single-Domain Generalization in Object Detection

Single-domain generalization is essential for object detection, particularly when training models on a single source domain and evaluating them on unseen target domains. Domain shifts, such as changes in weather, lighting, or scene conditions, pose s…

Authors: Junseok Lee, Sungho Shin, Seongju Lee

CD-FKD: Cross-Domain Feature Knowledge Distillation for Robust Single-Domain Generalization in Object Detection
CD-FKD: Cr oss-Domain F eatur e Kno wledge Distillation f or Rob ust Single-Domain Generalization in Object Detection Junseok Lee 1 , ∗ , Sungho Shin 2 , ∗ , Seongju Lee 3 , and K yoobin Lee 3 , † Abstract — Single-domain generalization is essential for object detection, particularly when training models on a single source domain and evaluating them on unseen target domains. Domain shifts, such as changes in weather , lighting, or scene conditions, pose significant challenges to the generalization ability of existing models. T o address this, we propose Cross- Domain F eatur e Knowledge Distillation (CD-FKD), which enhances the generalization capability of the student network by lev eraging both global and instance-wise feature distillation. The pr oposed method uses diversified data through downscaling and corruption to train the student network, whereas the teacher network receives the original source domain data. The student network mimics the features of the teacher through both global and instance-wise distillation, enabling it to extract object-centric featur es effectiv ely , even for objects that are difficult to detect owing to corruption. Extensive experiments on challenging scenes demonstrate that CD-FKD outperf orms state-of-the-art methods in both target domain generalization and source domain performance, validating its effectiveness in improving object detection rob ustness to domain shifts. This approach is valuable in real-world applications, like autonomous driving and surveillance, where rob ust object detection in diverse en vironments is crucial. I . I N T RO D U C T I O N In recent years, deep learning-based object detection technologies have made remarkably adv anced, playing a critical role in v arious visual perception tasks [1]–[4]. These technologies achie ve strong performance when the training and testing data share the same domain distribution, but this is rarely the case in real-world scenarios. In fields such as autonomous driving, industrial automation, and video surveillance, en vironmental changes caused by lighting, weather , or time of day result in domain shifts, which significantly degrade the performance of trained models [5], [6]. T o address this, methodologies such as unsupervised domain adaptation (UD A) [7], [8] and domain generalization (DG) [9], [10] have been proposed. UD A aligns data distribu- tions between the source and target domains to ensure model performance. Ho we ver , it requires access to target domain data, limiting its ability to generalize across en vironments. DG focuses on training models to maintain performance on unseen ∗ These authors contributed equally to this work. 1 Junseok Lee is with the Advanced Robotics Lab, LG Electronics, Seoul, Republic of Korea. 2 Sungho Shin is with the Place AI Flatform, Naver , Gyeonggi-do, Republic of Korea 3 Seongju Lee and Kyoobin Lee are with the Department of AI Con ver gence, Gwangju Institute of Science and T echnology (GIST), Gwangju, Republic of Korea. † Corresponding author: Kyoobin Lee (E-mail: kyoobinlee@gist.ac.kr) In p u t CD - FK D (O u r s ) D i v Ali g n ( C V P R 2024) F as t er R - CN N : Bu s : Ca r : Tr u c k : Per s o n : F al se N e gativ e : CD - FKD ( O u r s ) : D iv A l ign (P r e v. S OT A ) : F aste r R - C NN D C : D a y time - C l ear N C: N i g h t - C l ear NR : Ni g h t - R a iny DR : Du sk - R a iny D F : D a y time - Fog g y In p u t CD - FK D (O u r s ) D i v Ali g n ( C V P R 2024) F as t er R - CN N : B u s : Ca r : Tr u c k : Per s o n : F al se N e gativ e : CD - FKD ( O u r s ) : D iv A l ign (P r e v. S OT A ) : F aste r R - C NN D C : D a y time - C l ear N C: N i g h t - C l ear NR : Ni g h t - R a iny DR : Du sk - R a iny D F : D a y time - Fog g y Fig. I. Overvie w of results of our proposed CD-FKD. The top panel qualitativ ely compares our method with DivAlign [11] and Faster R-CNN [12], on an Dusk-Rainy . The bottom panel shows a radar chart comparing relativ e performance. target domains by le veraging diverse source domain data, making it a more challenging problem compared with UDA. T raditional DG methods aim to learn shared representations across multiple source domains or enhance data diversity via augmentation. Howe ver , these approaches often need multiple source domains, which can be impractical owing to time and cost constraints. Single-domain generalization (SDG) focuses on improving performance across di verse target domains using only a single source domain, offering a practical solution to the challenges of domain generalization [13], [14]. Unlike domain generalization (DG) methods, which require multiple source domains or domain-lev el annotations and thus increase complexity , SDG reduces both data requirements and costs. T raditional approaches typically rely on data augmentation and feature disentanglement. Feature disentanglement sepa- rates domain-in v ariant (object-centric) features from domain- specific (background-centric) features, using only the in variant features for object detection. Ho we ver , this neglects the surrounding background, limiting the detector’ s ability to understand the full context of the image. Although data augmentation can improv e generalization across en vironments, it has been sho wn to reduce performance on the source domain [15]. T o address these issues, the proposed method enables the detector to extract object-centric features while maintaining a comprehensi ve understanding of the image’ s context. This not only enhances generalization performance on target domains but also improves detection on the source domain. W e propose a nov el methodology that effecti vely operates across both the source and tar get domains by understanding causal features and learning the global context of images. T o achie ve this, we introduce C ross- D omain F eature K nowledge D istillation (CD-FKD) for single domain generalized ob- ject detection. The proposed method comprises two core components: (1) cross-domain knowledge distillation (KD) frame work and (2) cross-domain feature distillation loss. The cross-domain KD framew ork bridges domain gaps between teacher and student networks. The teacher network receives original source domain data, whereas the student network learns from div ersified source data with various scales and corruptions. The primary goal of CD-FKD is to train the network to effecti vely extract object-centric features ev en in data with v arious distortions. The proposed feature distillation process comprises two ke y steps: (1) global feature distillation, which enables the network to learn the global context of an image and (2) instance-wise feature distillation, which helps the network focus on object-specific features. The instance-wise feature distillation component ensures that the network effecti vely extracts object-centric features, whereas the global feature distillation component focuses on learning the overall context of the image. These distillation techniques pre vent the network from ov erfitting to a specific domain and enable robust learning under various distortion conditions. Consequently , the student network is capable of performing reliably under di verse distortions while maintaining strong performance on the source domain data. This approach enhances the generalization between the source and target domains, enabling robust object detection across a wide range of domain en vironments. As illustrated in Figure I, the proposed method achieves superior SDG performance compared with state-of-the-art (SO T A) methods. In summary , the main contributions of this study are as follows: • W e propose CD-FKD, a nov el cross-domain FKD method for single-domain generalized object detection. • W e ev aluate the proposed method against other SDG approaches on the SDG benchmark dataset and demon- strate that the proposed method outperforms pre vious methods. Also, we present comprehensi ve analyses that contribute to an improved understanding of the proposed framew ork. I I . R E L A T E D W O R K A. Single Domain Generalization SDG focuses on training models with a single source domain while ensuring the y generalize well to unseen tar get domains, a key task when collecting data from multiple domains is difficult. SDG methods are di vided into data augmentation [13], [16] and domain-in variant feature learning [17]. Data augmentation generates synthetic samples to im- prov e robustness against distribution shifts, whereas domain- in variant feature learning aims to extract stable features across domains. Recent studies on single-domain generalized object de- tection have progressed with methods such as S-DGOD [17], which uses contrastive learning and self-distillation to improv e generalization. CLIP-Gap [18] leverages CLIP for text-based augmentations to enhance object detection without requiring target domain data. G-NAS [19] reduces overfitting using differentiable NAS and OoD-aware G-loss. UFR [20] improv es generalization with scene-lev el causal attention and object-le vel prototypes. Danish et al. [11] introduced carefully selected augmentations to di versify the source domain and a classification and localization alignment method to enhance out-of-domain detections. Li et al. [21] proposed a dynamic object-centric perception network with prompt learning to adapt to image complexity and improv e cross-domain generalization. Despite these advancements, we propose a nov el cross-domain feature distillation approach within a knowledge distillation (KD) frame work that outperforms existing methods. B. Knowledge Distillation for Object Detection KD was originally proposed for model compression in im- age classification, transferring knowledge from large teacher models to smaller student models [22]. Initially designed for image classification [23], [24], KD has been e xpanded to object detection for both compression and performance enhancement [25], [26]. By replicating the capabilities of the teacher model with fewer resources, KD has proven to be a versatile strategy . In object detection, KD improv es compact model perfor- mance by enabling them to emulate k ey features from larger models. FineGrained [27] introduces feature imitation for performance enhancement, whereas DeFeat [28] separates foreground and background features for independent dis- tillation. FGD [29] and ScaleKD [30] improve detection accuracy by lev eraging focal-global features and scale-aware knowledge transfer, respectively . CrossKD [31] uses cross- head distillation to improve performance by mimicking teacher predictions. W e propose a KD approach for SDG in object detection. The proposed method allows the student model to detect objects in images with reduced visibility and noise by mimicking the ability of the teacher model to detect objects in clear images. This improves performance under challenging conditions, making the model robust for unseen target domains. Div e rsi f ie d im a ge ( c or rupti on + dow ns c a l ing ) RPN D e t e ct o r R o I A li g n R o I A li g n G ro u n d T ru t h Pro p os a l s B a ck b on e Fe a t B a ck b on e Fe a t G rou nd Tru t h 𝑳 𝒊 𝒏𝒔 𝒕 𝒂𝒏𝒄 𝒆 𝑳 𝒈 𝒍 𝒐𝒃 𝒂𝒍 C N N C N N r * r : down - s ampl i ng r at i o, [ 0.6,1.0] T ea ch er P a th Student P a th , , , Gl ob al F eat ur e Sim il ar it y ( 𝑳 𝒈 𝒍 𝒐𝒃 𝒂𝒍 ) I nstanc e - w ise F eat ur e Sim il ar i ty ( 𝑳 𝒊 𝒏𝒔 𝒕 𝒂𝒏𝒄 𝒆 ) O rigi na l im a ge Fig. II. Illustration of the proposed single-domain generalized object detection using cross-domain FKD. As a KD framew ork, the frozen teacher network (represented in pink color) receives source domain data, whereas the student network (represented in blue sky color) is provided with downscaled and corrupted source domain data. I I I . M E T H O D W e introduce CD-FKD, a simple yet effecti ve single- domain generalized object detection framework. First, we provide an overvie w of the proposed framework in Section III- A. Second, we introduce the method for generating div ersified source domain image in Section III-B. Third, we describe the global feature distillation method in Section III-C. Finally , we present the instance-specific feature distillation approach in Section III-D. A. Cr oss Domain Distillation F ramework The proposed framew ork uses a self-distillation structure with two identical detectors. W e employ the widely recognized Faster R-CNN, a popular two-stage detector for single- domain object detection. The ke y idea is to le verage the differences in input images between the teacher and student networks, maximizing the benefits of this approach. In this framework, the teacher network receiv es clear , high- resolution images, enabling it to extract detailed features. In contrast, the student network is trained with corrupted and downscaled images, which makes detecting small objects more challenging. Despite these dif ficulties, the student network learns to handle such scenarios, becoming more robust. Training on these challenging inputs exposes the student network to difficult detection situations, promoting more rigorous learning. Through distillation, the student learns feature representations from the teacher , improving its ability to detect fine-grained details in tough en vironments. This process helps the student network become more resilient to corruption and better at small object detection. As shown in Figure II, the teacher network receives original source domain data, while the student network is trained on div ersified source domain data with various scales and Ga u ss ian no is e I mp u lse no is e S ho t no is e S p eckl e no is e Def oc u s b lur Gl ass b lur Zoom b lur Mot io n b lur C on tr ast B ri g htn ess Elastic tr ansf or m JPE G co mp res si on P ix el ate S p atter eff ect S atu rate eff ect Fig. III. Examples of corrupted source domain data corruptions. The training begins by pre-training the teacher network on the source domain data. During distillation, the teacher network’ s parameters are frozen, and it performs inference on the clear source domain data. Meanwhile, the student network learns object classification and localization from the div ersified data, benefiting from the teacher’ s knowledge. B. Diversified Sour ce Domain Data The proposed approach le verages div ersified source domain images ( D φ ) by applying various downscaling and corruption techniques to the source domain images ( D s ). Object detectors face challenges in tar get domains due to significant dif ferences from the source domain, such as changes in image quality and object sizes. T o address this, the proposed method applies downscaling and corruption to create discrepancies between the source and target domains. W e downscale the source image resolution to generate low-resolution images, while preserving the original high- resolution image for the teacher network. The teacher network receiv es the high-resolution image, and the student network is trained with the downsampled images. This encourages the student network to learn to detect small objects effec- tiv ely , ev en from lo w-resolution data. Additionally , various corruptions are applied to the source domain images during training, enhancing generalization and pre venting overfitting, as demonstrated in previous works like SDG [11]. These corruptions, such as object occlusion and image blurring, can challenge the network’ s ability to detect objects. T o mitigate this, the teacher network recei ves clear source domain data, while the student network is trained on corrupted data. This setup allo ws the student network to learn from noisy data, extracting features that reflect those learned by the teacher network, ultimately improving generalization to unseen target domains. W e apply generic corruptions that are independent of the target domains. As shown in Figure III, a total of 15 different corruptions are applied using ImageNet-C [32]. These corruptions are applied with equal probability , and the intensity le vels (ranging from 1 to 5) are distributed ev enly across the augmentations. C. Global F eatur e Distillation Global feature distillation is a technique that guides the student network to focus on important areas related to object detection, rather than focusing on the noise in di versified images. Div ersified images distort object features, leading to semantic mismatches in the feature space. T o address this, the student network is trained to learn semantically consistent features from div ersified images. During the distillation process, the features extracted by the teacher network from the original image ( x s ) serve as a reference for training the student network. The student network is trained to align its features, extracted from the div ersified image ( x φ ), with this reference. Specifically , we obtain the backbone features of the teacher , denoted as F s T = f T ( x s ) , and the backbone features of the student, denoted as F φ S = f S ( x φ ) , where f T ( · ) and f S ( · ) represent the backbones of the teacher and student networks, respectively . T o align F φ S with F s T , a global feature distillation loss function ( L global ) is employed. In the global feature distillation process, feature embed- dings ( F s T , F φ S ) are e xtracted from the final backbone layers of both the teacher and student networks, which use ResNet-101. F s T has a fixed size, whereas F φ S is resized according to the scale-do wn ratio (0.6 to 1.0) of the input resolution. T o align the feature map sizes of F φ S and F s T , bilinear interpolation is applied to F φ S to match its size with F s T . The two feature embeddings are flattened, and cosine similarity loss is applied to maximize the cosine similarity between the two features. The cosine similarity between F T s and F S φ is calculated using the formula belo w: L global ( F s T , F φ S ) = N X i =1 1 − F s T ,i · F φ S ,i ∥ F s T ,i ∥ · ∥ F φ S ,i ∥ ! (1) where N denotes the number of train images, i -th refers to the image index. D. Instance-W ise F eatur e Distillation Instance-wise feature distillation is a technique that focuses on the relationship between objects in div ersified images and their corresponding objects in original images, excluding the background. The student network is provided with an image that has been div ersified compared with the original image. These div ersified images often suffer from occlusions or blurring, and owing to the smaller object sizes, they lack the visual information needed for the detector to accurately recognize the objects. Therefore, the goal of instance-wise feature distillation is to address the problem of reduced object visibility in corrupted images. The main objectiv e is to use the region of interest (RoI) to make the RoI features of the diversified image as similar as possible to those of the original image. T o achiev e this, instance-wise feature distillation guides the student network to learn to extract features from the div ersified image that are similar to those in the original image. T o implement this approach, the backbone features of the teacher ( F s T ) and student ( F φ S ) are processed using RoI Align (RA) with the ground truth bounding box values (GT) as the RoIs. This process extracts the instance features ( I s T ) of the teacher and instance features ( I φ S ) of the student, which are defined as follows: I s T = RA( F s T , GT), I φ S = RA( F φ S , GT). I φ S and F φ S are then aligned to ensure that corresponding objects are properly matched. All aligned features are flattened, and a cosine similarity loss is applied to calculate the similarity between the corresponding objects. The cosine similarity between the object features in the teacher and student networks is calculated as follows: L instance ( I s T , I φ S ) = N X i =1 O X j =1 1 − I s T ,i,j · I φ S ,i,j ∥ I s T ,i,j ∥ · ∥ I φ S ,i,j ∥ ! (2) where N denotes the number of images, i -th refers to the image index, O is the number of instances in the image, and j -th denotes to the instance inde x. This approach enables the student network to extract features from div ersified images guided by clean-image features, while global and instance-wise distillation ensures it captures both ov erall context and individual instance knowledge. Finally , the proposed model comprises the sum of the T ABLE I S I NG L E D O M AI N G E NE R A LI Z A T I ON O B JE C T D E T EC T I O N R E SU LT S . A V E RA GE R E SU LT S A R E C A LC U L A T E D U S I NG T H E R E SU LTS F RO M F O U R TA RG E T D O MA I N S T O C O MPA R E T H E G E NE R A LI Z A T I ON A B IL I T Y . T H E R E SU LT S O F T H E C O MPA R ED M O DE L S A R E O B T A I NE D F RO M T H EI R R ES P E C TI V E PA P ER S . B O LD / U N DE R L I NE I N DI C A T E S T H E B E S T / S E C ON D - B ES T R ES U LT . Method Daytime-Clear Night-Clear Dusk-Rainy Night-Rainy Daytime-F oggy A verage Faster R-CNN [12] 54.9 36.6 27.9 12.1 32.1 27.2 IBN-Net [33] 49.7 32.1 26.1 14.3 29.6 25.5 SW [34] 50.6 33.4 26.3 13.7 30.8 26.1 IterNorm [35] 43.9 29.6 22.8 12.6 28.4 23.4 ISW [36] 51.3 33.2 25.9 14.1 31.8 26.3 S-DGOD [17] 56.1 36.6 28.2 16.6 33.5 28.7 CLIP-Gap [18] 51.3 36.9 32.3 18.7 38.5 31.6 G-N AS [19] 58.4 45.0 35.1 17.4 36.4 33.5 PDDOC [21] 53.6 38.5 33.7 19.2 39.1 32.6 DivAlign [11] 52.8 42.5 38.1 24.1 37.2 35.5 UFR [20] 58.6 40.8 33.2 19.2 39.6 33.2 CD-FKD (Ours) 62.7 47.3 42.3 23.4 40.2 38.3 loss function for object localization and classification ( L det ), global feature distillation loss ( L global ), and instance-wise feature distillation ( L instance ). Our ov erall training objecti ve is as follows: L total = L det + α L global + β L instance (3) where α and β denote hyperparameters that contribute toward balancing L global and L instance . I V . E X P E R I M E N T S A. Experimental Setup Dataset. W e ev aluated the proposed method on the div erse weather dataset, a single domain generalization benchmark dataset in urban scenes b uilt by [17]. The dataset comprises fiv e different weather domains: Daytime-Clear , Night-Clear , Dusk-Rainy , Night-Rainy , and Daytime-Foggy . Daytime-Clear is used as the source domain for training, while the other four weather domains are used as target domains for testing. Daytime-Foggy comprises 19,395 images used for training and 8,313 images for testing. The four tar get domains used solely for ev aluation are as follo ws: Night-Clear with 26,158 images, Dusk-Rainy with 3,501 images, Night-Rain y with 2,494 images, and Daytime-Foggy with 3,775 images. This dataset is an urban scene dataset containing seven object classes: bus, bike, car, motor , person, rider , and truck. Implementation Details. W e use F aster R-CNN [12] with a ResNet101-FPN [37] as the feature backbone, implemented in mmdetection [38], as the detector . The backbone is initialized with pre-trained weights from ImageNet. The model is trained using the stochastic gradient descent (SGD) optimizer with a learning rate of 0.01, momentum of 0.9, and weight decay of 0.0001. The batch size is set to 4. The values of α and β in the loss function are both set to 1.0. B. Comparison with State-of-the-arts Our method is compared with normalization-based SDG works [33]–[36] and recent SO T A single-domain generalized detectors [11], [17]–[21]. For the baseline, we use Faster R-CNN [12] initialized with ImageNet pre-trained weights. W e ev aluate the performance of the proposed method on the source domain, Daytime-Clear , and assess its generalization capability on unseen target domains, including Night-Clear , Dusk-Rainy , Night-Rainy , and Daytime-Foggy . The ev aluation metric used is mean average precision (mAP), and we report results for mAP@0.5. As summarized in T able I, the proposed method, CD-FKD, achie ves the good performance not only on the source domain but also across all tar get domains. The a verage performance across the four target domains (Night-Clear , Dusk-Rainy , Night-Rainy , and Daytime-Foggy) achieves a mAP@0.5 of 38.3%, representing an improv ement of 11.1% mAP@0.5 ov er the Faster R-CNN baseline. Furthermore, the proposed method outperforms the pre vious best, DivAlign [11], by 2.8% mAP@0.5. Remarkably , the proposed method enhances generalization to target domains without compromising per- formance on the source domain, surpassing previous methods on both source and target domains. These experimental results validate that the proposed approach significantly impro ves performance on the source domain while also enhancing generalization to unseen domains. Figure IV visually compares the detection results of the Faster R-CNN [12], DivAlign [11], and the proposed method. The figure shows the detection results of each model for the Night-Clear , Dusk-Rainy , Night-Rainy , and Daytime-Foggy scenes from left to right. In contrast to the bottom-row results of the proposed method, the top-row Faster R-CNN and middle-row Di vAlign results display both false negati ves (indicated by red) and false positives (indicated by yellow). Results on Night-Clear Scene. T able II summarizes the results for the Night-Clear scene. Due to reduced visibility at night, object detection becomes significantly more challenging compared with Daytime-Clear . In this scenario, the proposed method achie ves the best performance with 47.3% mAP , outperforming previous SO T A methods across six object categories, except for bike . Low light conditions in night scenes lead to confusion between visually similar categories, such as bike and motor . Results on Dusk-Rainy Scene. T able III summarizes the results for the Dusk-Rainy scene, where low light and rain af fect detection performance. The proposed method FN T h er e i s a b u s FN T h er e i s a b u s FN T h er e i s a tru c k FN T h er e i s a tru c k FP FN T h er e i s a tru c k FN T h er e i s a b i ke FN T h er e ar e b u s an d tru c k FN Th er e i s a p er so n FN T h er e i s a c ar FN T h er e i s a c ar FN T h er e i s a c ar FN T h er e i s a b i ke FP FP FP F a st er R - CNN DivAlig n CD - FK D (O ur s) Fig. IV . Qualitati ve ev aluation results of the model’ s generalization ability on the Night-Clear, Dusk-Rainy , Night-Rainy , and Daytime-Foggy scenes. The top-row images show the results of Faster R-CNN [12]. The middle-row images show the results of DivAlign [11]. The bottom-row images show the results of our method. Red circles and arrows indicate false negati ves, while yellow circles and arrows indicate false positives. T ABLE II Q UA N TI TA T I V E R E SU LT S ( %) O N T H E N I G HT - C LE A R S C E NE . Method bus bike car motor person rider truck mAP Faster R-CNN [12] 37.5 36.6 62.0 14.0 43.6 28.8 41.6 37.7 IBN-Net [33] 37.8 27.3 49.6 15.1 39.2 27.1 38.9 32.1 SW [34] 38.7 29.2 49.8 16.6 31.5 28.0 40.2 33.4 IterNorm [35] 38.5 23.5 38.9 15.8 26.6 25.9 38.1 29.6 ISW [36] 38.5 28.5 49.6 15.4 31.9 27.5 41.3 33.2 S-DGOD [17] 40.6 35.1 50.7 19.7 34.7 32.1 43.4 36.6 CLIP-Gap [18] 37.7 34.3 58.0 19.2 37.6 28.5 42.9 36.9 G-N AS [19] 46.9 40.5 67.5 26.5 50.7 35.4 47.8 45.0 PDDOC [21] 40.9 35.0 59.0 21.3 40.4 29.9 42.9 38.5 UFR [20] 43.6 38.1 66.1 14.7 49.1 26.4 42.9 36.9 Ours 47.8 40.1 71.8 27.2 55.2 38.8 50.1 47.3 T ABLE III Q UA N TI TA T I V E R E SU LT S ( %) O N T H E D U S K - R A I NY S C EN E . Method bus bike car motor person rider truck mAP Faster R-CNN [12] 34.6 22.3 63.6 7.9 24.2 13.2 39.4 29.3 IBN-Net [33] 37.0 14.8 50.3 11.4 17.9 13.3 38.4 26.1 SW [34] 35.2 16.7 50.1 10.4 20.1 13.0 38.8 26.3 IterNorm [35] 32.9 14.1 38.9 11.0 15.5 11.6 35.7 22.8 ISW [36] 34.7 16.0 50.0 11.1 17.8 12.6 38.8 25.9 S-DGOD [17] 37.1 19.6 50.9 13.4 19.7 16.3 40.7 28.2 CLIP-Gap [18] 37.8 22.8 60.7 16.8 26.8 18.7 42.4 32.3 G-N AS [19] 44.6 22.3 66.4 14.7 32.1 19.6 45.8 35.1 PDDOC [21] 39.4 25.2 60.9 20.4 29.9 16.5 43.9 33.7 UFR [20] 37.1 21.8 67.9 16.4 27.4 17.9 43.9 33.2 Ours 50.3 30.9 75.7 19.6 40.4 26.4 52.9 42.3 outperforms others, showing significant improvements across most categories, especially bike , p e rson , and rider , which are prone to occlusion from small objects like those influenced by rain or fog. Ho we ver , confusion between motor and bike slightly lowers motor performance. Results on Night-Rainy Scene. T able IV summarizes the results for the Night-Rainy scene. The Night-Rain y scene is the most challenging condition owing to lo w light and the effects of rain. Despite these challenges, our method outper- T ABLE IV Q UA N TI TA T I V E R E SU LT S ( %) O N T H E N I G HT - R A I N Y S C E NE . Method bus bike car motor person rider truck mAP Faster R-CNN [12] 19.8 8.6 30.2 0.3 0.9 4.8 16.6 12.8 IBN-Net [33] 24.6 10.0 28.4 0.9 8.3 9.8 18.1 14.3 SW [34] 22.3 7.8 27.6 0.2 10.3 10.0 17.7 13.7 IterNorm [35] 21.4 6.7 22.0 0.9 9.1 10.6 17.6 12.6 ISW [36] 22.5 11.4 26.9 0.4 9.9 9.8 17.5 14.1 S-DGOD [17] 24.4 11.6 29.5 9.8 10.5 11.4 19.2 16.6 CLIP-Gap [18] 28.6 12.1 36.1 9.2 12.3 9.6 22.9 18.7 G-N AS [19] 28.6 9.8 38.4 0.1 13.8 9.8 21.4 17.4 PDDOC [21] 25.6 12.1 35.8 10.1 14.2 12.9 22.9 19.2 UFR [20] 29.9 11.8 36.1 9.4 13.1 10.5 23.3 19.2 Ours 34.3 14.2 49.6 2.1 17.2 15.6 30.6 23.4 T ABLE V Q UA N TI TA T I V E R E SU LT S ( %) O N T H E D AY T IM E - F O G G Y S C E NE . Method bus bike car motor person rider truck mAP Faster R-CNN [12] 30.2 26.6 54.9 27.2 35.1 37.6 19.4 33.0 IBN-Net [33] 29.9 26.1 44.5 24.4 26.2 33.5 22.4 29.6 SW [34] 30.6 36.2 44.6 25.1 30.7 34.6 23.6 30.8 IterNorm [35] 29.7 21.8 42.4 24.4 26.0 33.3 21.6 28.4 ISW [36] 29.5 26.4 49.2 27.9 30.7 34.8 24.0 31.8 S-DGOD [17] 32.9 28.0 48.8 29.8 32.5 38.2 24.1 33.5 CLIP-Gap [18] 36.2 34.2 57.9 34.0 38.7 43.8 25.1 38.5 G-N AS [19] 32.4 31.2 57.7 31.9 38.6 38.5 24.5 36.4 PDDOC [21] 36.1 34.5 58.4 33.3 40.5 44.2 26.2 39.1 UFR [20] 36.9 35.8 61.7 33.7 39.5 42.2 27.5 39.6 Ours 36.6 33.2 62.6 34.6 42.9 44.2 27.5 40.2 forms others, achieving the highest performance. Howev er, motor exhibits lower performance due to confusion with bike . Results on Daytime-Foggy Scene. T able V summarizes the results for the Daytime-Foggy scene, where fog causes occlusions and blurring, complicating detection. The proposed method outperforms others across most categories. Ho wev er, foggy conditions increase confusion between visually similar categories like bus and truck , and bike and motor . T ABLE VI A B LAT IO N S T U DY R E S ULT S ( % ) O F T H E P RO P O S ED C D -F K D . D C , N C , D R , N R , A N D D F R EP R E S EN T D A Y T I ME - C L EA R , N I G H T - C L EA R , D U SK - R A IN Y , N IG H T -R A I N Y , A ND D AY TI M E -F O G G Y , R E SP E C TI V E L Y . Methods Corrupt&Down FKD DC NC DR NR DF A vg. Faster R-CNN é é 54.9 36.6 27.9 12.1 32.1 27.2 Ours Ë é 58.8 41.1 33.9 14.9 34.0 30.9 Ours Ë L glo 62.0 46.3 41.7 21.4 39.5 37.2 Ours Ë L ins 62.5 46.7 42.1 21.9 39.7 37.6 Ours Ë L glo + L ins 62.7 47.3 42.3 23.4 40.2 38.3 T ABLE VII A B LAT IO N S TU DY O N T H E S O U RC E D OM A I N T E S T S E T C O MPA R ES F A ST E R R - CN N A ND C D - FK D , E V A L UATI N G C D - F KD W I TH O N L Y C O RR UP T I O N ( C OR RU P T ) A N D W I T H B OT H C O RR U PT I O N A N D D OW N S CA L I N G ( C OR RU P T & D O W N ) . Method mAP mAP 50 mAP s mAP m mAP l Faster R-CNN 29.5 54.9 8.2 32.4 54.6 + Corrupt 34.5 62.1 13.2 37.7 57.6 + Corrupt&Down 34.8 62.7 13.6 38.0 57.6 C. Ablation Study W e conducted an ablation study to analyze the impact of different components of the proposed method. First, we evaluated the effects of corruption and do wnscaling (Corrupt&Down) on the input data. Second, we performed an ablation analysis to ev aluate the impact of the FKD, specifically L global and L instance . T able VI summarizes the results. ”A vg. ” represents the av erage performance across the four tar get domains. The application of corruption and downscaling to the source images significantly improved performance compared to the baseline, demonstrating its effecti veness in enhancing generalization. Moreov er , using both L global and L instance distillation methods resulted in a 38.3% mAP , a notable improv ement over the baseline. Each distillation method individually contributed to the performance gains, not only for the target domains but also for the source domain. Through distillation, our model learned to replicate the feature representations from the teacher network, improving adaptability to the source domain while a voiding overfitting to the div ersified data. T able VII presents the results of another ablation study , ex- amining the ef fect of corruption and do wnscaling on the input images. The proposed method, when applied to the student network, significantly outperformed the baseline. Notably , the do wnscaling technique achiev ed a 0.4% impro vement in mAP s and a 0.3% improvement in mAP m compared to corruption- only experiments, underscoring the method’ s ef fectiveness in improving small object detection. Additionally , Figure V compares the heatmaps generated by the baseline and the proposed CD-FKD. The baseline- generated heatmap mainly highlights irrelev ant background areas. In contrast, the proposed method reduces background focus and places more attention on the objects. This shows that the proposed approach focuses on objects instead of the background in unseen domains, thereby impro ving (a ) Inp u t im a g e (b) F a st e r R - CNN (c) Ou r s Fig. V . Heatmap visualization for target domain scenes. The left column displays the original images, the middle column presents the results from Faster R-CNN, and the right column shows the results from our method. The heatmaps highlight the areas where the model focuses, with regions of higher attention marked by a redder hue. generalization performance. T ABLE VIII T H E I M P A CT O F T H E H Y P ER PAR A M E TE R S α A N D β O F L global A N D L instance I N T H E D I ST I L L A T I O N T RA I N I NG P H AS E . α / β DC NC DR NR DF A vg. 0.5/1.5 62.5 46.8 42.1 23.1 40.1 38.0 1.0/1.0 62.7 47.3 42.3 23.4 40.2 38.3 1.5/0.5 62.1 46.7 41.8 22.9 40.1 37.9 In T able VIII, we applied the hyperparameters α and β to the L global and L instance loss functions during the feature distillation training phase from the teacher network to the student network. According to the results from the ablation study in the main manuscript, when only one of the two loss functions ( L global or L instance ) was used, L instance contributed slightly more to performance improvement. Ho wever , the final results show that applying both L global and L instance resulted in the largest performance gain. Therefore, we further analyze the ef fects of both loss functions. T able VIII reports the performance achieved by applying α and β values of 0.5/1.5, 1.0/1.0, and 1.5/0.5. Our e valuation shows that the optimal α and β v alues of 1.0 and 1.0 strike an effecti ve balance between transferring global features through L global and instance-wise features through L instance . This balance maximized the learning efficienc y of the student network. V . C O N C L U S I O N In this study , we introduced a novel approach for single- domain generalized object detection using CD-FKD. Our proposed method effecti vely addressed the domain shift problem and significantly impro ved generalization to unseen target domains using only a single source domain. W e applied global and instance-wise feature distillation, enabling the model to extract robust and meaningful object-centric features while maintaining detection performance ev en under se vere corruptions. Experimental results demonstrated that CD-FKD consistently outperformed existing methods under adverse weather conditions. Furthermore, an ablation study confirmed the effecti veness of each component in the proposed method, leading to performance improvements in both unseen tar get and source domains. A C K N O W L E D G E M E N T This research was supported by the National Research Council of Science & T echnology (NST) grant by the Korea government (MSIT) (No. GTL25041-000) R E F E R E N C E S [1] Y . Cai, T . Luan, H. Gao, H. W ang, L. Chen, Y . Li, M. A. Sotelo, and Z. Li, “Y olov4-5d: An effecti ve and efficient object detector for autonomous driving, ” IEEE T ransactions on Instrumentation and Measur ement , vol. 70, pp. 1–13, 2021. [2] J. Park, J. Lee, S. Moon, and K. Lee, “Deep learning based detection of missing tooth regions for dental implant planning in panoramic radiographic images, ” Applied Sciences , vol. 12, no. 3, p. 1595, 2022. [3] J. Lee, J. Park, S. Lee, S.-Y . Moon, and K. Lee, “ Automated diagnosis for extraction difficulty of maxillary and mandibular third molars and post-extraction complications using deep learning, ” Scientific Reports , vol. 15, no. 1, p. 19036, 2025. [4] J. Lee, S. Lee, J. Kim, J. Park, and K. Lee, “Robust maritime object detection under adverse conditions via joint semantic learning without extra computational overhead, ” in 2025 IEEE/RSJ International Confer ence on Intelligent Robots and Systems (IROS) . IEEE, 2025, pp. 8166–8173. [5] R. Geirhos, C. R. T emme, J. Rauber , H. H. Sch ¨ utt, M. Bethge, and F . A. W ichmann, “Generalisation in humans and deep neural networks, ” Advances in neural information processing systems , vol. 31, 2018. [6] B. Recht, R. Roelofs, L. Schmidt, and V . Shankar, “Do imagenet classifiers generalize to imagenet?” in International confer ence on machine learning . PMLR, 2019, pp. 5389–5400. [7] Y .-J. Li, X. Dai, C.-Y . Ma, Y .-C. Liu, K. Chen, B. W u, Z. He, K. Kitani, and P . V ajda, “Cross-domain adaptive teacher for object detection, ” in Pr oceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition , 2022, pp. 7581–7590. [8] C. Chen, Z. Zheng, X. Ding, Y . Huang, and Q. Dou, “Harmonizing transferability and discriminability for adapting object detectors, ” in Pr oceedings of the IEEE/CVF confer ence on computer vision and pattern recognition , 2020, pp. 8869–8878. [9] F . M. Carlucci, A. D’Innocente, S. Bucci, B. Caputo, and T . T ommasi, “Domain generalization by solving jigsaw puzzles, ” in Pr oceedings of the IEEE/CVF confer ence on computer vision and pattern recognition , 2019, pp. 2229–2238. [10] Z. Huang, H. W ang, E. P . Xing, and D. Huang, “Self-challenging improves cross-domain generalization, ” in Computer vision–ECCV 2020: 16th European confer ence, Glasgow , UK, August 23–28, 2020, pr oceedings, part II 16 . Springer , 2020, pp. 124–140. [11] M. S. Danish, M. H. Khan, M. A. Munir , M. S. Sarfraz, and M. Ali, “Improving single domain-generalized object detection: A focus on div ersification and alignment, ” in Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , 2024, pp. 17 732–17 742. [12] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: T ow ards real-time object detection with region proposal networks, ” IEEE transactions on pattern analysis and machine intelligence , vol. 39, no. 6, pp. 1137– 1149, 2016. [13] Y . W ang, Zijian fand Luo, R. Qiu, Z. Huang, and M. Baktashmotlagh, “Learning to diversify for single domain generalization, ” in Proceedings of the IEEE/CVF International Conference on Computer V ision , 2021, pp. 834–843. [14] C. W an, X. Shen, Y . Zhang, Z. Y in, X. T ian, F . Gao, J. Huang, and X.-S. Hua, “Meta convolutional neural networks for single domain generalization, ” in Pr oceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition , 2022, pp. 4682–4691. [15] P . Kirichenko, M. Ibrahim, R. Balestriero, D. Bouchacourt, S. R. V edan- tam, H. Firooz, and A. G. Wilson, “Understanding the detrimental class- lev el effects of data augmentation, ” Advances in Neural Information Pr ocessing Systems , vol. 36, 2024. [16] R. V olpi, H. Namkoong, O. Sener, J. C. Duchi, V . Murino, and S. Sav arese, “Generalizing to unseen domains via adversarial data augmentation, ” Advances in neural information processing systems , vol. 31, 2018. [17] A. Wu and C. Deng, “Single-domain generalized object detection in urban scene via cyclic-disentangled self-distillation, ” in Proceedings of the IEEE/CVF Conference on computer vision and pattern r ecognition , 2022, pp. 847–856. [18] V . V idit, M. Engilberge, and M. Salzmann, “Clip the gap: A single domain generalization approach for object detection, ” in Proceedings of the IEEE/CVF confer ence on computer vision and pattern recognition , 2023, pp. 3219–3229. [19] F . Wu, J. Gao, L. Hong, X. W ang, C. Zhou, and N. Y e, “G-nas: Gen- eralizable neural architecture search for single domain generalization object detection, ” in Pr oceedings of the AAAI Conference on Artificial Intelligence , vol. 38, no. 6, 2024, pp. 5958–5966. [20] Y . Liu, S. Zhou, X. Liu, C. Hao, B. Fan, and J. Tian, “Unbiased faster r-cnn for single-source domain generalized object detection, ” in Pr oceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition , 2024, pp. 28 838–28 847. [21] D. Li, A. W u, Y . W ang, and Y . Han, “Prompt-driv en dynamic object- centric learning for single domain generalization, ” in Pr oceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition , 2024, pp. 17 606–17 615. [22] G. Hinton, O. V inyals, and J. Dean, “Distilling the knowledge in a neural network, ” arXiv pr eprint arXiv:1503.02531 , 2015. [23] S. Shin, J. Lee, J. Lee, Y . Y u, and K. Lee, “T eaching where to look: Attention similarity knowledge distillation for low resolution face recognition, ” in Eur opean Conference on Computer V ision . Springer , 2022, pp. 631–647. [24] O. S. EL-Assiouti, G. Hamed, D. Khattab, and H. M. Ebied, “Hdkd: Hybrid data-efficient knowledge distillation network for medical image classification, ” Engineering Applications of Artificial Intelligence , vol. 138, p. 109430, 2024. [25] G. Bang, K. Choi, J. Kim, D. Kum, and J. W . Choi, “Radardistill: Boosting radar-based object detection performance via knowledge distillation from lidar features, ” in Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , 2024, pp. 15 491–15 500. [26] L. Zhao, J. Song, and K. A. Skinner , “Crkd: Enhanced camera- radar object detection with cross-modality knowledge distillation, ” in Pr oceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition , 2024, pp. 15 470–15 480. [27] T . W ang, L. Y uan, X. Zhang, and J. Feng, “Distilling object detectors with fine-grained feature imitation, ” in Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , 2019, pp. 4933–4942. [28] J. Guo, K. Han, Y . W ang, H. Wu, X. Chen, C. Xu, and C. Xu, “Distilling object detectors via decoupled features, ” in Pr oceedings of the IEEE/CVF confer ence on computer vision and pattern recognition , 2021, pp. 2154–2164. [29] Z. Y ang, Z. Li, X. Jiang, Y . Gong, Z. Y uan, D. Zhao, and C. Y uan, “Focal and global kno wledge distillation for detectors, ” in Pr oceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition , 2022, pp. 4643–4652. [30] Y . Zhu, Q. Zhou, N. Liu, Z. Xu, Z. Ou, X. Mou, and J. T ang, “Scalekd: Distilling scale-aware knowledge in small object detector , ” in Pr oceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition , 2023, pp. 19 723–19 733. [31] J. W ang, Y . Chen, Z. Zheng, X. Li, M.-M. Cheng, and Q. Hou, “Crosskd: Cross-head knowledge distillation for object detection, ” in Pr oceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition , 2024, pp. 16 520–16 530. [32] D. Hendrycks and T . Dietterich, “Benchmarking neural network robustness to common corruptions and perturbations, ” Pr oceedings of the International Conference on Learning Representations , 2019. [33] X. Pan, P . Luo, J. Shi, and X. T ang, “T wo at once: Enhancing learning and generalization capacities via ibn-net, ” in Pr oceedings of the european conference on computer vision (ECCV) , 2018, pp. 464–479. [34] X. Pan, X. Zhan, J. Shi, X. T ang, and P . Luo, “Switchable whitening for deep representation learning, ” in Proceedings of the IEEE/CVF international conference on computer vision , 2019, pp. 1863–1871. [35] L. Huang, Y . Zhou, F . Zhu, L. Liu, and L. Shao, “Iterativ e normalization: Beyond standardization to wards efficient whitening, ” in Pr oceedings of the IEEE/CVF confer ence on computer vision and pattern recognition , 2019, pp. 4874–4883. [36] S. Choi, S. Jung, H. Y un, J. T . Kim, S. Kim, and J. Choo, “Robustnet: Improving domain generalization in urban-scene segmentation via in- stance selectiv e whitening, ” in Pr oceedings of the IEEE/CVF conference on computer vision and pattern r ecognition , 2021, pp. 11 580–11 590. [37] T .-Y . Lin, P . Doll ´ ar , R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection, ” in Proceedings of the IEEE confer ence on computer vision and pattern r ecognition , 2017, pp. 2117–2125. [38] K. Chen, J. W ang, J. Pang, Y . Cao, Y . Xiong, X. Li, S. Sun, W . Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T . Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y . Wu, J. Dai, J. W ang, J. Shi, W . Ouyang, C. C. Loy , and D. Lin, “MMDetection: Open mmlab detection toolbox and benchmark, ” arXiv preprint , 2019.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment