CAPTURE: A Benchmark and Evaluation for LVLMs in CAPTCHA Resolving

Reading time: 5 minute
...

📝 Original Info

  • Title: CAPTURE: A Benchmark and Evaluation for LVLMs in CAPTCHA Resolving
  • ArXiv ID: 2512.11323
  • Date: 2025-12-12
  • Authors: Jianyi Zhang, Ziyin Zhou, Xu Ji, Shizhao Liu, Zhangchi Zhao

📝 Abstract

Benefiting from strong and efficient multi-modal alignment strategies, Large Visual Language Models (LVLMs) are able to simulate human visual and reasoning capabilities, such as solving CAPTCHAs. However, existing benchmarks based on visual CAPTCHAs still face limitations. Previous studies, when designing benchmarks and datasets, customized them according to their research objectives. Consequently, these benchmarks cannot comprehensively cover all CAPTCHA types. Notably, there is a dearth of dedicated benchmarks for LVLMs. To address this problem, we introduce a novel CAPTCHA benchmark for the first time, named CAPTURE CAPTCHA for Testing Under Real-world Experiments, specifically for LVLMs. Our benchmark encompasses 4 main CAPTCHA types and 25 sub-types from 31 vendors. The diversity enables a multi-dimensional and thorough evaluation of LVLM performance. CAPTURE features extensive class variety, large-scale data, and unique LVLM-tailored labels, filling the gaps in previous research in terms of data comprehensiveness and labeling pertinence. When evaluated by this benchmark, current LVLMs demonstrate poor performance in solving CAPTCHAs.

💡 Deep Analysis

Figure 1

📄 Full Content

CAPTURE: A Benchmark and Evaluation for LVLMs in CAPTCHA Resolving Jianyi Zhang*, and Ziyin Zhou zjy@besti.edu.cn Beijing Electric Science and Technology Institute Beijing, Beijing, China Xu Ji, Shizhao Liu, and Zhangchi Zhao Beijing Electric Science and Technology Institute Beijing, Beijing, China Abstract Benefiting from strong and efficient multi-modal alignment strate- gies, Large Visual Language Models (LVLMs) are able to simulate hu- man visual and reasoning capabilities, such as solving CAPTCHAs. However, existing benchmarks based on visual CAPTCHAs still face limitations. Previous studies, when designing benchmarks and datasets, customized them according to their research objec- tives. Consequently, these benchmarks cannot comprehensively cover all CAPTCHA types. Notably, there is a dearth of dedicated benchmarks for LVLMs. To address this problem, we introduce a novel CAPTCHA benchmark for the first time, named CAPTURE (CAPTCHA for Testing Under Real-world Experiments), specifi- cally for LVLMs. Our benchmark encompasses 4 main CAPTCHA types and 25 sub-types from 31 vendors. The diversity enables a multi-dimensional and thorough evaluation of LVLM performance. CAPTURE features extensive class variety, large-scale data, and unique LVLM-tailored labels, filling the gaps in previous research in terms of data comprehensiveness and labeling pertinence. When evaluated by this benchmark, current LVLMs demonstrate poor performance in solving CAPTCHAs. To enhance LVLMs’ proficiency in solving CAPTCHAs, we present a two-stage framework, CRRD (Cropping, Re-Reading, and Describing). Drawing inspiration from human problem-solving behavior for complex tasks, we first crop CAPTCHA images into instruction text and image, to help LVLM focus on crucial informa- tion and eliminate distractions. Then we use the re-reading method to enhance LVLM’s reasoning ability, enabling it to better identify key elements in CAPTCHA image. The framework improves LVLM performance across all CAPTCHA tasks, with up to a 69% improve- ment. However, our experimental results indicate that despite these enhancements, existing LVLMs still cannot fully simulate human visual and reasoning capabilities to solve all current CAPTCHAs, which is attributed to the inherence of LVLM models. Our code and benchmark datasets are in our GitHub repository and will be available for further research. Keywords Large Vision Language Models (LVLMs), CAPTCHA, Evalution, Benchmark 1 Introduction In recent years, the development of Large Language Models (LLMs)[1, 14, 64] has led to revolutionary breakthroughs in the field of Natural Language Processing (NLP)[46]. Building on these advancements, researchers are now focusing on multimodal models, particularly in computer vision. These models integrate LLMs with Text CAPTURE Visual Games Behavior Slide Image Rotation Restoration (IRR) Trajectory (Tra) Orientation and  Direction (Ori) Match-3  Game (M3G) Dice and Sum (Di&S) Dart and Sum (Da&S) Gobang (Go) Icon Match with the most Liquid in Cup (IMLC) 3x3 Example Classification (3x3 EC) 3x3 Image Classification (3x3 IC) 4x4 Image Segmentation (4x4 IS) Chinese Characters illusions (CCI) Letters Illusion (LI) Image Pairing (IP) Differences Selection (DS) Spatial  Reasoning (SR) Jigsaw Puzzle Game (JPG) Biggest Area Selection (BAS)  Chinese Character Selection (CCS) Font Recongnition (FR) Icon  Selection (IS) Main Types Sub Types 2 13 7 3 Attribute & Introduction   Rotate to correct way up (Rot) The most common type, including English phrases and alphanumeric combinations, whic is used to test the ability of LVLMs to recognize letters, digitals, and words. A diverse range of visual CAPTCHA types, testing LVLMs' abilities in image classification, segmentation, semantic understanding, matching, and the recognition of mixed patterns formed by different styles of Chinese characters, letters, and symbols. Builds on visual CAPTCHAs, and sets game scenes and rules to test LVLMs' ability to understand and reason about game environments, game elements, and game rules. Tests how LVLMs simulate the computer and mouse, and how they recognize and interact with CAPTCHA elements to complete the verification. Words Letters and Digitals (L&D) Figure 1: The CAPTURE benchmark we designed. Correspon- dence between main types and sub types and attribute. The abbreviation of 25 sub-tasks are indicated in parentheses. visual encoders[44, 45]and are fine-tuned using multimodal align- ment methods based on generative pre-trained models[48]. Large Visual Language Models (LVLMs)[77, 79] exhibit abilities akin to human visual capabilities, such as integrating and reasoning with visual information[44]. For instance, LVLMs can accurately ana- lyze and interpret images[79], achieving a level of understanding comparable to that of humans’ comprehensive image interpretation. CAPTCHA (Completely Automated Public Turing Tests to Tell Computers and Humans Apart)[65] are widely used in verification proces

📸 Image Gallery

dice.png dingxiang.png mtcaptcha.png recaptcha.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut