시각 증강 사유 사슬: 추론 단계에서 동적 이미지 변환으로 VLM 견고성 강화
📝 Abstract
While visual data augmentation remains a cornerstone for training robust vision models, it has received limited attention in visual language models (VLMs), which predominantly rely on large-scale real data acquisition or synthetic diversity. Consequently, they may struggle with basic perception tasks that conventional models handle reliably. Given the substantial cost of pre-training and finetuning VLMs, continue training on augmented data yields limited and diminishing returns. In this paper, we present Visual Augmentation Chain-of-Thought (VACoT), a framework that dynamically invokes image augmentations during model inference. By incorporating post-hoc transformations such as denoising, VACoT substantially improves robustness on challenging and out-of-distribution inputs, especially in OCR-related adversarial scenarios. Distinct from prior approaches limited to local cropping, VACoT integrates a structured collection of general visual augmentations, broadening the query image views while reducing training complexity and computational overhead with efficient agentic reinforcement learning. We propose a conditional reward scheme that encourages necessary augmentation while penalizing verbose responses, ensuring concise and effective reasoning in perception tasks. We demonstrate the superiority of VACoT with extensive experiments on 13 perception benchmarks and further introduce AdvOCR to highlight the generalization benefits of post-hoc visual augmentations in adversarial scenarios.
💡 Analysis
While visual data augmentation remains a cornerstone for training robust vision models, it has received limited attention in visual language models (VLMs), which predominantly rely on large-scale real data acquisition or synthetic diversity. Consequently, they may struggle with basic perception tasks that conventional models handle reliably. Given the substantial cost of pre-training and finetuning VLMs, continue training on augmented data yields limited and diminishing returns. In this paper, we present Visual Augmentation Chain-of-Thought (VACoT), a framework that dynamically invokes image augmentations during model inference. By incorporating post-hoc transformations such as denoising, VACoT substantially improves robustness on challenging and out-of-distribution inputs, especially in OCR-related adversarial scenarios. Distinct from prior approaches limited to local cropping, VACoT integrates a structured collection of general visual augmentations, broadening the query image views while reducing training complexity and computational overhead with efficient agentic reinforcement learning. We propose a conditional reward scheme that encourages necessary augmentation while penalizing verbose responses, ensuring concise and effective reasoning in perception tasks. We demonstrate the superiority of VACoT with extensive experiments on 13 perception benchmarks and further introduce AdvOCR to highlight the generalization benefits of post-hoc visual augmentations in adversarial scenarios.
📄 Content
With scaling laws empirically validated on Large Language Models (LLMs), Visual Language Models (VLMs) have also achieved breakthroughs in image perception [6,58], reasoning [1,51], and generation [7,10] tasks. While VLMs outperform traditional deep models on tasks such as key information extraction [55] and security auditing [21], they are highly vulnerable to adversarial samples. Fig. 1 shows that slight perturbations can mislead VLM perception results, enabling harmful content to evade system detection. Hence, ensuring adversarial robustness of VLMs Identify the hidden text content in the picture.
The hidden text in the image appears to read: “I miss you so much” image_path = crop(image_path, x1=250, y1=250, x2=750, y2=750) Image cropped to region (250,250)-(750,750). Text-like patterns detected but still obscured by noise. The content is still unclear after cropping. Let me apply denoise to the extracted region. image_path = denoise(image_path, method=“gaussian”, kernel_size=21) The text becomes very clear after denoising. Through a systematic approach of first cropping the noisy region and then applying denoising, the hidden text content is successfully revealed. The processed image clearly shows the Chinese text “浏 览器搜” (browser search) on the first line and “3.com” on the second line.
Previous models rely on extensive visual data augmentations [12,26,60] to improve perception robustness. However, such augmentation introduces limited additional multimodal knowledge while substantially increasing computational costs during Pre-Training (PT) and Supervised Fine-Tuning (SFT). Hence, existing VLMs prioritizes large-scale real-world data or diverse domain-specific synthetic data, instead of more augmented visual questions. Recent studies [45,48,62,67,69] have explored the thinking with bounding boxes paradigm, which employs local cropping to focus model attention on specific regions. However, this strategy constitutes merely a specific instance of visual information filtering. More broadly, we advocate for thinking with augmentation, reformulating cropping and other visual augmentations as post-hoc processing steps. This enhances model perceptual robustness while avoiding the training overhead caused by redundant image augmentations. As illustrated in Fig. 1, we propose Visual Augmentation CoT (VACoT) based on chat history concatenation. It stops autoregressive generation at designated tokens to either produce visual augmentations or terminate responses. By re-integrating returned information or augmented images into the conversation, we adopt end-to-end optimization via agentic Reinforcement Learning (RL). We wrap all augmentations into lightweight API calls to reduce the training difficulty and avoid uncontrollable actions.
Existing instruction-based agents struggle to comprehend when and what to apply visual augmentations based solely on textual descriptions. Hence, VACoT adopts a three-stage training pipeline to overcome it. Stage 1: We employ knowledge SFT with a difficulty-based data filtering strategy to efficiently enhance the model’s foundational capabilities. Stage 2: We employ format SFT for visual augmentation cold-start initialization. Generating reliable trajectory data for post-hoc augmentation requires substantial manual effort and cost. Hence, we prompt the teacher models to deliberately insert random API calls when rewriting answers. This stage focuses on learning the correct calling format, even if these calls are not semantically relevant to the current query. Stage 3: We perform end-to-end agentic reinforcement learning with carefully designed reward signals, enabling the model to adaptively determine when and which augmentations to apply. We employ Qwen3 [59] as a teacher model to provide reward signals for verifying both answer correctness and formatting consistency. We penalize unnecessarily long reasoning traces that contribute limited performance gains. To this end, we incorporate the consistency reward [66] to assess the reasoning process, and further introduce a novel conditional API-call reward that promotes effective visual augmentation while preventing sequence explosion caused by indiscriminate trials.
Extensive evaluations on 13 public benchmarks demonstrate that our post-hoc visual augmentation significantly enhances perceptual capabilities, particularly on tasks requiring fine-grained recognition or robustness against adversarial text. We further introduce AdvOCR, a challenging benchmark comprising 100 adversarial
This content is AI-processed based on ArXiv data.