Weakly-Supervised Camouflaged Object Detection (WSCOD) aims to locate and segment objects that are visually concealed within their surrounding scenes, relying solely on sparse supervision such as scribble annotations. Despite recent progress, existing WSCOD methods still lag far behind fully supervised ones due to two major limitations: (1) the pseudo masks generated by general-purpose segmentation models (e.g., SAM) and filtered via rules are often unreliable, as these models lack the task-specific semantic understanding required for effective pseudo labeling in COD; and (2) the neglect of inherent annotation bias in scribbles, which hinders the model from capturing the global structure of camouflaged objects. To overcome these challenges, we propose D 3 ETOR, a twostage WSCOD framework consisting of Debate-Enhanced Pseudo Labeling and Frequency-Aware Progressive Debiasing. In the first stage, we introduce an adaptive entropy-driven point sampling method and a multi-agent debate mechanism to enhance the capability of SAM for COD, improving the interpretability and precision of pseudo masks. In the second stage, we design FADeNet, which progressively fuses multi-level frequencyaware features to balance global semantic understanding with local detail modeling, while dynamically reweighting supervision strength across regions to alleviate scribble bias. By jointly exploiting the supervision signals from both the pseudo masks and scribble semantics, D 3 ETOR significantly narrows the gap between weakly and fully supervised COD, achieving state-ofthe-art performance on multiple benchmarks.
C AMOUFLAGED Object Detection (COD) is a challeng- ing task that involves identifying and segmenting objects that are well-concealed within their environments [1]. In contrast to salient object detection [2] or generic object detection [3], COD must handle scenarios where an object's appearance, shape, and color are almost indistinguishable from the background (as shown in Figure 1 (d)). These objects range from animals exhibiting protective coloration or small, occluded forms to man-made objects engineered for concealment. The high similarity to the surrounding background, along with indistinct boundaries and deceptive textures, poses significant challenges for detection. Despite these challenges, COD still attracts growing interest in the computer vision community and shows promise in applications such as species discovery [4], agriculture monitoring [5], medical image segmentation [6]- [8], and wildlife tracking [9]- [15].
While existing COD methods have achieved impressive performance, they rely heavily on dense pixel-wise annotations, which can take up to 60 minutes per image [1]. In contrast, weakly-supervised labels offer a much more efficient and scalable alternative. For instance, scribble-based annotations require only around 10 seconds per image [16]. As a recently emerging task, Weakly-Supervised Camouflaged Object Detection (WSCOD) has seen only limited exploration [16]- [18]. Early efforts such as CRNet [16] focus on expanding the scribbles to wider camouflaged regions to capture structural information and semantic relationships. More recent methods, WS-SAM [17] and SAM-COD [18], push the boundary of WSCOD by adopting a two-stage learning framework: they first generate high-quality pseudo masks from scribbles using the Segment Anything Model (SAM) [19], and then train COD models on these pseudo masks. Despite these advances, current WSCOD approaches still lag far behind fully supervised COD models in terms of detection accuracy and segmentation quality. We observe that such performance gaps primarily result from the suboptimal exploitation of weak supervision signals, particularly in two key aspects: (1) the limited quality and quantity of supervision signals generated by SAM, and (2) the harmful annotation bias inherent in scribbles introduced by annotators.
To begin with, effectively leveraging the Segment Anything Model (SAM) [19] for the WSCOD task is non-trivial. Although SAM can directly generate pseudo masks for generic objects, it is prone to producing unwanted semantic responses, such as non-camouflaged objects or fragmented local regions (as shown in Figure 2). This limitation is largely due to SAM’s restricted semantic understanding and its inability to comprehend the specific demands of camouflaged object detection. To alleviate this issue, prior works like WS-SAM [17] and SAM-COD [18] propose heuristic selection mechanisms with predefined thresholds to filter unreliable pseudo masks. However, such strategies still fail to equip SAM with the semantic reasoning capabilities necessary for COD. Ultimately, the quality of the pseudo masks remains constrained by the inherent limitations of SAM. Inevitably, many informative hard examples may be prematurely discarded due to low confidence scores in the rule-based selection. As a result, the current exploitation of weak supervision in WSCOD is insufficient, both in quality and quantity, to support the training of robust and accurate models.
Furthermore, these methods tend to overlook the rich semantic information (e.g., structure and relation) embedded in scribble annotations once they have pseudo masks, leaving such valuable information underexplored. Although CRNet [16] tried to exploit the rich semantics contained in scribble annotations, it failed to recognize and tackle the harmful bias inherent in scribble annotations. As illustrated in Figure 3, our statistical analysis reveals a notable discrepancy between the spatial distribution of scribble annotations and ground truth masks. This bias arises from Fig. 3. The relative distances from ground truth mask and scribble pixels to their respective nearest object boundaries reveal a notable discrepancy in spatial distribution. It is evident that scribble annotations are biased toward central object regions, whereas mask labels exhibit a more uniform coverage across entire object areas.
annotators’ preference for placing scribbles near the central regions of objects, leading to an imbalanced coverage of camouflaged objects. Consequently, models trained on such biased supervision signals tend to perform well in frequently annotated regions but poorly in less emphasized areas. Such bias hidden in the supervision signal limits the generalization ability of the WSCOD model and constrains the network’s discriminative capacity, ultimately hindering the accurate detection of camouflaged objects.
To address the aforementioned challenges, we propose D 3 ETOR, a two-stage framework for weakly-supervised camouflaged object detection with scribble annotations. As illustrated in Figure 4, D 3 ETOR first generates pseudo labels using SAM with a multi-agent debate strategy and then trains our Frequency-Aware Debiasing Network (FADeNet) to accurately detect camouflaged objects. Specifically, we design an adaptive entropy-driven point sampling method to generate visual prompts from scribble annotations, enabling SAM to efficiently produce pseudo masks of potential camouflaged objects. To prevent pseudo masks from containing noise (e.g., irrelevant objects or incomplete segmentation) or from discarding camouflaged objects when confidence is low, we draw insight from the fundamental characteristic of human problem solving (i.e., debate) and further introduce a multi-agent debate strategy based on the multimodal chain-of-thought to encourage divergent reasoning for pseudo label selection. This strategy improves the interpretability and precision of pseudo labeling, hence ensuring both the quality and quantity of pseudo masks used for training of downstream COD models.
In the second stage, we take inspiration from frequencybased methods [20], [21] a deeper comprehension of multi-level visual representations and significantly enhancing the accuracy of camouflaged object detection. To better utilize the rich semantics (e.g., structural and relational cues) in scribble annotations while mitigating their inherent harmful bias, we introduce an auxiliary debiasing task during training. By dynamically adjusting supervision strength across regions according to the scribble probability, this task guides the model’s attention toward areas that are difficult to annotate or are easily overlooked, improving overall structure perception and enhancing generalization ability under the weakly-supervised learning setting. In summary, the contributions of our D 3 ETOR can be detailed as follows:
- We present D 3 ETOR, a two-stage framework for WS-COD with scribble annotations that strengthens the pseudo labeling pipeline through multi-agent debate and mitigates annotation bias in scribbles via a frequencyaware debiasing network. By fully exploiting weak supervision signals from scribble annotations, D 3 ETOR effectively narrows the gap between fully supervised and weakly-supervised camouflaged object detection. 2) To better utilize the rich semantics in scribble annotations under weak supervision, we propose a debateenhanced pseudo labeling strategy that strengthens SAM’s capacity for COD, efficiently producing higherquality and abundant pseudo masks for subsequent training. 3) We further design FADeNet, a frequency-aware debiasing network that progressively fuses multi-scale and multi-level frequency representations through windowbased cross-attention, while dynamically adjusting supervision strength to better capture overall structural information. 4) Extensive experiments on multiple COD datasets demonstrate that our framework consistently achieves SOTA performance under weak supervision, validating both its effectiveness and generalization ability.
Camouflaged object detection (COD), which aims to identify and segment objects that blend seamlessly with their surroundings, has attracted growing attention in the computer vision community. SINet [1] pioneered COD by introducing a search-and-identification network inspired by the animal predation mechanism, while SINetv2 [9] improved localization ability through texture enhancement modules and cascade attention. FEDER [22] decomposed features into frequency bands and reconstructed edges using an ordinary differential equation-inspired module, striving to address targetbackground similarity. Transformer-based methods further advanced visual representation learning in COD. FSPNet [23] proposed a feature shrinkage pyramid network with nonlocal token to enhance feature modeling and aggregation. MSCAF-Net [24] integrated multi-scale context and crossscale fusion for progressive camouflaged object detection, whereas CamoFormer [25] employed mask-separable attention and top-down decoding to enhance segmentation quality. More recently, CamoDiffusion [26] adopted a diffusion-based denoising process to iteratively refine predictions, mitigating the overconfident problem in mis-segmentation results. MCRNet [27] abstracted type-level capsules from pixel-level features to reduce computational cost while preserving part-whole relationships.
In spite of these progress, most methods rely on large-scale datasets with pixel-level annotations. The inherently ambiguous boundaries of camouflaged objects make such pixel-wise labeling both time-consuming and labor-intensive. To address this, He et al. [16] introduced the S-COD dataset, which provides scribble annotations as a weak supervision signal. WS-SAM [17] generated pseudo masks from sparse cues and applied multi-scale feature grouping with entropy-based weighting to enhance reliability. SAM-COD [18] proposed a response filter to remove extreme responses from SAM via computing the ratio of the mask to the image size, along with a semantic matcher to select masks that balance the segmentation detail and semantic accuracy. However, these approaches still solely rely on manually designed strategies to filter pseudo labels generated by SAM, which is a general-purpose model and cannot capture the task-specific requirements of COD. As a result, their performance remains substantially inferior to that of fully supervised COD models. To bridge this gap, we propose D 3 ETOR to enhance the existing pseudo labeling pipeline through multi-agent debating, hence empowering SAM with task-aware reasoning ability for COD. In addition, we introduce a frequency-aware debiasing network in the
The idea of prompt learning was first introduced in natural language processing (NLP), where handcrafted prompts were designed to motivate task-specific behaviors from large language models [28]. Following its success, the paradigm was soon extended to computer vision, with applications in recognition, segmentation, and vision-language modeling [29], [30]. A related advancement, Chain-of-Thought (CoT) prompting [31], showed that decomposing complex reasoning into intermediate steps can significantly enhance model performance. This principle has also been adopted in visual domains, such as segmentation and structured reasoning [32], [33]. Early research in this area primarily focused on prompt design and decoding strategies, but recent work has shifted toward mechanisms for iterative self-improvement. For example, Self-Refine [34] and Tree of Thoughts [35] allow models to generate candidate solutions and subsequently refine them through self-evaluation. While effective, these approaches remain constrained to the single-agent paradigm, which is inherently vulnerable to error accumulation and lacks robust self-correction.
To address these limitations, latest studies have proposed multi-agent debate frameworks, where multiple agents generate, critique, and refine outputs through interactive reasoning. This setup has demonstrated improved error correction and more consistent outcomes compared to single-agent CoT [36], [37]. Motivated by these advances, we pioneer the integration of the multi-agent debate paradigm into weakly-supervised camouflaged object detection (WSCOD). Our method leverages reasoning and collaboration among agents to boost SAM’s capacity for COD, generating pseudo masks that are both more accurate and reliable for training purposes.
Training a model for weakly-supervised camouflaged object detection (WSCOD) can be challenging, as sparse scribble annotations alone often fail to provide sufficient supervision compared to dense mask labels. Although pseudo masks can be generated to mitigate this limitation, the original scribbles still contain valuable semantic cues that should not be ignored. However, directly incorporating them into training may introduce harmful bias, since annotators tend to place scribbles near central regions of objects, leading to imbalanced object coverage. To address these issues, we introduce D 3 ETOR, a two-stage framework for WSCOD. As illustrated in Figure 5, the first stage (Sec. III-A) employs debate-enhanced pseudo labeling, where the multi-agent reasoning converts sparse scribbles into high-quality pseudo masks. In the second stage (Sec. III-B), we progressively extract and fuse frequency-aware representations with the window-based cross-attention, while simultaneously balancing annotation bias during training. In the following, we elaborate on the design and implementation of each stage in detail.
In the first stage, we improve the SAM-based pseudo labeling pipeline to generate abundant high-quality pseudo masks from the semantic cues embedded in scribble annotations, strengthening the interpretability and precision of pseudo masks for subsequent learning. As shown in Figure 5 (a), this stage mainly consists of two components: (1) an adaptive entropy-driven point sampling method that produces effective visual prompts for SAM, and (2) a multi-agent debate strategy equipped with multimodal Chain-of-Thought (CoT) reasoning to filter candidate masks.
- Adaptive Entropy-Driven Point Sampling: As a powerful vision foundation model for generic object segmentation, SAM [19] still faces challenges in detecting camouflaged objects [38] due to the high visual similarity between foreground and background. To achieve accurate pseudo labeling for COD, SAM requires visual prompts that explicitly indicate target objects. However, it only supports prompts such as points, masks, and bounding boxes, while scribble-type inputs are not directly compatible. To address this issue, we propose an Adaptive Entropy-Driven Point Sampling strategy that converts sparse scribbles into informative visual prompts for SAM, thereby fully leveraging its capacity for COD.
The process begins with estimating the local entropy of scribble regions, which quantifies the structural uncertainty of
You are an expert in detecting camouflaged objects, participating in a debate on whether a given masked area corresponds to a camouflaged object.
-Your goal is not to fully agree with the other debater, but to collaboratively determine the correct answer.
-Provide concise, effective, and evidence-based answers.
-Below are few-shot demonstration debates that provide a consistent template, composed of: Image captions, Masked object descriptions and Chain-of-Thought reasoning.
You are the judge overseeing a multi-round debate between affirmative and negative debaters.
-Review the full debate history.
-Summarize key points from both sides.
-โฆ -Make a final decision: Retain or Discard the pseudo mask for COD training. each pixel. Specifically, for a pixel (x, y) within foreground or background scribbles S, the local entropy is defined as
where h i denotes the histogram count of intensity value i within a local neighborhood. Pixels with higher entropy correspond to more informative yet ambiguous regions that are crucial for segmentation. Candidate points are then selected as
where ฯ is a normalized entropy threshold.
To prevent over-clustering in high-entropy areas, we apply spatial filtering by enforcing a minimum distance constraint:
This ensures that selected points remain well-separated while preserving local structural information. Finally, to guarantee uniform coverage across scribble annotations, we incorporate Farthest Point Sampling (FPS) [39], which iteratively selects N points that maximizes its minimum distance to the already chosen set:
In this way, we obtain visual prompts that are both informative and spatially balanced, enabling SAM to generate more reliable pseudo masks for WSCOD.
- Multi-Agent Debate based on the Multimodal Chainof-Thought: Since SAM is not fine-tuned for COD due to the lack of sufficient supervision signals in the weaklysupervised learning setting, the pseudo masks it generates often contain noise, and many informative hard examples may be discarded because of low confidence scores. To address this limitation, we introduce a Multi-Agent Debate (MAD) strategy based on the multimodal Chain-of-Thought to encourage divergent thinking for COD. In this strategy, agents powered by the Multimodal Large Language Model (MLLM) iteratively generate supportive and opposing reasoning chains for each image-mask pair, while a judge agent evaluates these chainsof-thought to decide whether the mask should be retained as a pseudo label for COD. Through this process, our approach improves the existing pipeline in WSCOD for pseudo label selection, in both interpretability and precision.
As shown in Figure , our MAD strategy consists of three key components, described as follows: Meta Prompt. We employ meta prompts to introduce the COD task, iteration limits, and other specific requirements. To ensure that the reasoning chains capture camouflage-specific characteristics [16], we design representative exemplars [31] as paradigm cases. These exemplars are refined with human priors and expanded into few-shot demonstration chains, providing a consistent template composed of image captions, masked object descriptions, and Chain-of-Thought (CoT) reasoning. The paradigm chains specify how pro and con viewpoints should evaluate attributes such as discriminability and boundary clarity, efficiently guiding the reasoning process of MLLMs. Debaters. Two types of debaters participate in each debate: affirmative and negative. During each iteration, debaters take turns speaking in a fixed order, formulating their arguments based on the accumulated debate history. This iterative exchange encourages divergent reasoning and thorough exploration of potential pseudo label correctness. Judge. A judge agent is designed to observe the entire debate process. The judge evaluates the complete debate history and determines the final decision on whether a pseudo mask should be retained for COD training.
Due to the high similarity between camouflaged objects and their backgrounds, accurately delineating object boundaries remains a challenging task for WSCOD models, even when pseudo masks generated by SAM are available from the first stage. Moreover, the inherent bias in scribble annotations poses additional challenges when they are used as supervision signals, impairing model optimization during training. Fortunately, previous studies [20], [21] have shown that these boundaries can be distinguished more effectively when analyzed in the frequency domain. Motivated by these insights, we rethink COD from the perspective of frequency components and propose FADeNet, a Frequency-Aware Debiasing Network for COD, to progressively leverage supervision signals from both pseudo masks and scribble annotations.
The architecture of FADeNet is illustrated in Figure 5 (b). Given an RGB image, the Low-Frequency Semantic Extractor captures low-frequency features via self-attention to model the global context of camouflaged scenes. Highfrequency features are obtained through Laplacian pyramid decomposition, followed by multi-level and multi-scale frequency encoding within the High-Frequency Detail Enhancer, preserving detailed information across frequency components. Subsequently, our method progressively fuses these frequency representations with window-based cross-attention to reduce discrepancies across frequencies and hierarchical levels, generating features that integrate both contextual and fine-grained information. Concurrently, a scribble probability predictor highlights the preferences of scribble annotations, allowing us to balance label bias during training. Finally, the CNN-based decoder fuses these hierarchical feature representations in a bottom-up manner to produce the final prediction.
- Frequency-Aware Encoding: In camouflaged object detection, the primary challenge lies in suppressing interference from visually confusing backgrounds while accurately delineating object boundaries. Although high-frequency components help reduce visual distraction by emphasizing edge details, they inherently lack sufficient semantics, leading to an incomplete contextual understanding. To overcome this limitation, we propose a frequency-aware encoding method that hierarchically leverages frequency representations across multiple levels, enabling complementary feature learning from both low-and high-frequency domains. Low-Frequency Semantic Extractor (LFSE). We adopt a Vision Transformer (ViT) as the low-frequency encoder to extract global semantic representations from the image, as previous studies [40], [41] have demonstrated that the multi-head self-attention mechanism in transformer encoders naturally captures low-frequency components, effectively functioning as a low-pass filter. Specifically, the input image I โ R HรW ร3 is first divided into non-overlapping patches of size 16 ร 16. Each patch is then flattened and linearly projected into a D-dimensional embedding, forming a token sequence that represents local visual information. After processing through n Transformer layers, we obtain the low-frequency representation LF R โ R H 16 ร W 16 รD , which effectively captures the global semantics of the camouflaged scene. High-Frequency Detail Enhancer (HFDE). To extract finegrained edge and texture representations, we adopt the Laplacian pyramid decomposition, which is commonly used in related vision tasks [42], [43]. Specifically, the input image I 0 is decomposed as follows:
where k โ {0, 1, 2, 3} indicates the level of the Laplacian pyramid. The operators f k โ and f k โ denote downsampling and upsampling operations with a scale factor 2 k .
All residual components are then concatenated along the channel dimension to form the composite residual tensor HF compose โ R HรW ร12 . Subsequently, it is passed through the convolutional layer with a kernel size of 2 ร 2 and a stride of 2, followed by layer normalization, to project it into a Cdimensional frequency embedding denoted as HF E.
To further enhance the multi-scale detail perception of highfrequency features, we stack six encoders hierarchically to construct a pyramid-like network that operates at progressively reduced resolutions, as illustrated in Figure 5 (b). Each encoder is formulated as:
where ConvLN denotes a Convolutional layer with Layer Normalization. The lower-level encoders primarily preserve fine-grained edge and contour information, while the deeper layers capture more abstract and semantic structures. Finally, we obtain a set of multi-scale and multi-level highfrequency representations {HF R i j | i, j = 1, 2, 3; i โค j}. Here, i denotes the number of downsampling operations, and j denotes the encoding stage. The shape of HF R i j can be calculated as
- Progressive Feature Fusion and Debiasing: To effectively bridge the cross-frequency representation gap and align multi-level high-frequency features at the same scale, we propose a Progressive Feature Fusion and Debiasing module. This module jointly exploits supervision signals from pseudo masks and scribble annotations, while adaptively mitigating label imbalance to achieve stable and unbiased optimization. The fused features are then forwarded to the hierarchical decoder and the scribble probability predictor, which collaboratively enable accurate camouflaged object segmentation and reliable label preference estimation.
Algorithm 1 Multi-scale and Multi-level Feature Fusion. Input: Low-and high-frequency features LF R i , HF R i j , frequency fusion function F Output: Fused features
for j โ 3 to i do
Window-based Cross-Attention for Progressive Fusion. Specifically, we first reshape and resize the low-frequency representation LF R into a set of {LF R i | i = 1, 2, 3} so that its dimensions match those of the high-frequency representations {HF R i j | i, j = 1, 2, 3; i โค j}. To gradually bridge the gaps among frequency representations of different spatial scales and semantic levels, we adopt a progressive fusion strategy, as summarized in Algorithm 1.
For the fusion function F(H, L), we employ a Windowbased Cross-Attention that prioritizes local representation alignment over global interactions. For example, at the i -th fusion level, the feature maps of LF R i and HF R i j are divided into non-overlapping windows of size w ร w to
Scribble Annotation Fig. 7. Visualization of the scribble probability map and the corresponding scribble annotation. The probability map highlights pixels that are most likely to be labeled by scribbles, although some of them may not be annotated in practice.
facilitate local semantic interaction. Inspired by the multihead attention mechanism [44], each window (e.g., W k lf and W k hf ) is flattened into a token sequence and projected through separate linear layers to obtain the query, key, and value vectors with dimension d: Q h , K l , V l . The attention weights within each window, representing the localized relationships between different frequency representations, are computed as:
Semantic cues from low-frequency values are aggregated as follows:
Finally, all ลด k hf are restored to their 2D structure, and the fused feature, which integrates local semantic guidance from low-frequency features into high-frequency details, is obtained via a skip connection and hierarchically fed into the CNN-based decoder for segmenting the camouflaged object at different scales. Debiasing via Predicting Scribble Probability. To mitigate the label imbalance inherent in scribble annotations, the first and foremost step is to estimate the probability distribution of scribble placements across each image. We adopt a two-layer “Conv + BN + ReLU” structure as the scribble probability predictor at different levels, whose prediction represents the probability of each pixel being labeled by scribbles or not, as visualized in Figure 7.
Here, we assume each image contains N pixels, and the corresponding scribble label is denoted as y โ {0, 1} N , where a value of 1 indicates scribble regions and 0 denotes nonscribble regions. Accordingly, a weighted binary cross-entropy loss at the k-th (k = 1, 2, 3) level is employed for pixel-level classification to compensate for the vast number of unlabeled pixels:
where p scrib i,k โ [0, 1] represents the estimated annotation probability of scribbles at pixel i, and y i โ {0, 1} denotes the pixel-level ground-truth label for scribble classification. w n is the weight used to balance the contributions of the scribble and non-scribble classes during training.
Having established the predictor to estimate the scribble probability, the next step in alleviating harmful bias in scribbles is to dynamically adjust the supervision strength across regions during training. To this end, we introduce a debias loss derived from the Focal Loss [45], which re-weights the segmentation supervision according to the predicted likelihood of each pixel being labeled in the scribble annotation. Specifically, for a segmentation prediction p seg i,k โ [0, 1] at pixel i at the k-th (k = 1, 2, 3) level, the Debias Loss is defined as:
where N scrib denotes the number of pixels labeled by scribbles, and ฮณ controls the modulation strength. Unlike the standard Focal Loss [45], which employs (1p seg i,k ) ฮณ as the modulation term, our formulation introduces a joint modulation factor (1 -p scrib i,k p seg i,k ) ฮณ that adaptively reweights pixel-wise supervision based on both annotation reliability and prediction confidence. In this way, pixels with low scribble probability receive stronger supervision, while those with high annotation probability are down-weighted according to their segmentation confidence.
By alleviating the harmful bias inherent in sparse and imbalanced scribble annotations, our L Debias not only enables the model to better exploit the rich semantic cues contained in scribbles but also enhances its ability to capture the overall structure of camouflaged objects. Objective Function. Following previous works [22], [46], we combine the commonly used binary cross entropy loss (L BCE ) and intersection-over-union loss (L IoU ). By progressively integrating multi-level supervision into the side outputs during the decoding process, the camouflaged object detection loss L COD at the k-th (k = 1, 2, 3) level can be expressed as:
where P seg k denotes the segmentation output of the decoder at level k, and Y mix represents the mixture of the pseudo masks obtained in Sec. III-A and the dilated scribble annotations.
Overall, the total loss function of our FADeNet is formulated as:
Publication where ฮฑ and ฮฒ are hyperparameters.
A. Experimental Setup 1) Implementation Details: We implement our model using PyTorch [54] and conduct all experiments on a single NVIDIA RTX 3090 GPU. In the first stage, we adopt SAM [19] with a ViT-H backbone and Qwen2.5-VL-7B [55] as the multimodal large language model (MLLM). In the second stage, following previous studies [23], [56], we employ the base version of ViT [57] pretrained with the DeiT strategy [58] as the transformer encoder.
For optimization, we use stochastic gradient descent (SGD) [59] with a momentum of 0.9 and a weight decay of 5ร10 -4 . The batch size is set to 4, and the initial learning rate is 0.03. The learning rate is linearly warmed up during the first 20 epochs and subsequently decayed using a cosine annealing schedule. For loss configurations, the weight w n of the nonscribble class in L Scrib is set to 0.02, and the modulation factor ฮณ in L Debias is 0.9. In the overall loss formulation, the balancing coefficients are set to ฮฑ = 2 and ฮฒ = 0.5.
-
Datasets: To evaluate the effectiveness of our proposed method, we leverage the scribble-annotated S-COD dataset [16], which is specifically designed for weakly-supervised camouflaged object detection, to generate pseudo masks and train our model. The trained model is then evaluated on three widely used benchmark datasets: CAMO [60], COD10K [9], and NC4K [61]. CAMO consists of 1,000 training images and 250 testing images. COD10K is a larger and more challenging dataset containing 3,040 training images and 2,026 testing images. NC4K serves as an additional benchmark, comprising 4,121 testing images without a training split.
-
Evaluation Metrics: Following existing works [16]- [18], we adopt four commonly used metrics for evaluation: mean absolute error (MAE), structure measure (S m ) [62], mean E-measure (E m ) [63], and weighted F-measure (F w ฮฒ ) [64].
Quantitative Comparison: To verify the effectiveness of our proposed framework, we compare it against state-of-theart fully supervised and weakly-supervised methods, including SINet [1], UGTR [47], MGL-R [48], PFNet [49], UJSC [50], BSA-Net [51], ZoomNet [4], SINetv2 [9], HitNet [52], SAM-Ada [38], FEDER [22], FSPNet [23], DINet [20], CamoDiffusion [26], CamoFormer [25], MCRNet [27], SCSOD [53], CRNet [16], SAM [19], SAM-S [19], WS-SAM [17], and SAM-COD [18].
From the results in Table I, it is evident that D 3 ETOR consistently outperforms all scribble-supervised baselines across all datasets. For example, on CAMO, it achieves an MAE of 0.045, S m of 0.868, E m of 0.928, and F w ฮฒ of 0.823, surpassing the second-best scribble-supervised method SAM-COD [18] (MAE 0.060, S m 0.836, E m 0.903, F w ฮฒ 0.779) by a clear margin.
Furthermore, D 3 ETOR demonstrates competitive or superior performance compared to most fully supervised approaches, only slightly behind CamoDiffusion [26] and MCR-Net [27] on certain metrics. Overall, these results indicate that D 3 ETOR establishes a new state-of-the-art for camouflaged object detection in weakly and even fully supervised scenarios using only scribble annotations.
- Qualitative Comparison: As illustrated in Figure 8, we select all existing weakly supervised approaches (i.e., CRNet [16], WS-SAM [17], and SAM-COD [18]) to visually
To verify the effectiveness of each main component and key designs within them, we perform comprehensive ablation studies on 8 different variants in our D 3 ETOR.
- Impact of Debate-Enhanced Pseudo Labeling: As shown in Table II, to assess the contribution of each module in the pseudo labeling stage, we compare the generated pseudo masks with the ground truth in the S-COD training set and analyze several variants within the first stage.
โข In variant โ , only SAM is employed to generate pseudo masks, and all evaluation metrics indicate a notable deviation of its pseudo masks from the ground truth.
โข Variant โก introduces the Adaptive Entropy-Driven Point Sampling (AEDPS) on top of variant โ , leading to substantial performance improvements across all metrics. Specifically, the MAE decreases from 0.069 to 0.037, while F w ฮฒ increases from 0.769 to 0.829. These results validate the effectiveness of AEDPS in prompting SAM to produce high-quality pseudo masks and mitigating the limited generalization capability of the SAM model in camouflaged object detection.
โข When we further integrate the Multi-Agent Debate with Multimodal Chain-of-Thought (CoT) into variant โก, our method attains finer-grained semantic discrimination of pseudo masks and achieves the best results across all metrics. These results demonstrate the remarkable effectiveness of the debate-based reasoning strategy in filtering out noisy predictions and retaining more informative pseudo masks for subsequent training stages.
- Impact of Frequency-Aware Progressive Debiasing: As illustrated in Table III, to quantify the contributions of the key components in the second stage, we conduct comprehensive ablation studies on CAMO, COD10K, and NC4K using several variants of our proposed Frequency-Aware Debiasing Network (FADeNet).
โข Variant โข includes only the Low-Frequency Semantic Extractor (LFSE), which serves as the backbone of the detection network. The results reveal its limited capability in capturing fine details and maintaining structural consistency, indicating that low-frequency cues alone are insufficient for this task. We also provide a qualitative comparison between the predicted scribble probability and the actual label preference. As shown in Figure 9, the predicted probability distribution aligns closely with the scribble annotations, validating the effectiveness of our predictor in estimating scribble probabilities. Only when we have accurate scribble probability estimates can we adjust the supervision strength across different regions during training to mitigate inherent harmful bias. 3) Impact of Progressive Feature Fusion: We further investigate the contributions of fused frequency-aware feature F i at different levels, as shown in Table IV. The results indicate that integrating representations across multiple levels consistently enhances performance. This improvement arises from the complementary abstraction capabilities of features at different levels, which collectively strengthen the model’s ability to perceive camouflaged objects throughout the progressive fusion process.
Moreover, we visually compare the feature maps derived from F i at different fusion levels. As illustrated in Figure 10, these results clearly demonstrate that progressively fusing high-frequency information effectively captures object boundaries and refines fine structural details.
In this work, we presented D 3 ETOR, a novel two-stage framework for Weakly-Supervised Camouflaged Object Detection with scribble annotations. By integrating Debate-Enhanced Pseudo Labeling and Frequency-Aware Progressive Debiasing, our method effectively addresses two fundamental challenges in WSCOD: the unreliability of pseudo masks generated by general-purpose models and the inherent bias in scribble annotations. Through adaptive entropy-driven point sampling and multi-agent debate, we enhance the capabilities of SAM to produce high-quality pseudo masks for COD. Building upon these pseudo masks and scribble annotations, we propose FADeNet to exploit complementary cues from low-frequency semantics and high-frequency structures, progressively fusing them via cross-attention while dynamically mitigating annotation bias through the debiasing loss. Extensive evaluations across multiple benchmark datasets demonstrate the effectiveness and robustness of D 3 ETOR, substantially narrowing the gap with fully supervised methods.
This content is AI-processed based on open access ArXiv data.