Benefiting from strong and efficient multi-modal alignment strategies, Large Visual Language Models (LVLMs) are able to simulate human visual and reasoning capabilities, such as solving CAPTCHAs. However, existing benchmarks based on visual CAPTCHAs still face limitations. Previous studies, when designing benchmarks and datasets, customized them according to their research objectives. Consequently, these benchmarks cannot comprehensively cover all CAPTCHA types. Notably, there is a dearth of dedicated benchmarks for LVLMs. To address this problem, we introduce a novel CAPTCHA benchmark for the first time, named CAPTURE CAPTCHA for Testing Under Real-world Experiments, specifically for LVLMs. Our benchmark encompasses 4 main CAPTCHA types and 25 sub-types from 31 vendors. The diversity enables a multi-dimensional and thorough evaluation of LVLM performance. CAPTURE features extensive class variety, large-scale data, and unique LVLM-tailored labels, filling the gaps in previous research in terms of data comprehensiveness and labeling pertinence. When evaluated by this benchmark, current LVLMs demonstrate poor performance in solving CAPTCHAs.
💡 Deep Analysis
📄 Full Content
CAPTURE: A Benchmark and Evaluation for LVLMs in CAPTCHA
Resolving
Jianyi Zhang*, and Ziyin Zhou
zjy@besti.edu.cn
Beijing Electric Science and Technology Institute
Beijing, Beijing, China
Xu Ji, Shizhao Liu, and Zhangchi Zhao
Beijing Electric Science and Technology Institute
Beijing, Beijing, China
Abstract
Benefiting from strong and efficient multi-modal alignment strate-
gies, Large Visual Language Models (LVLMs) are able to simulate hu-
man visual and reasoning capabilities, such as solving CAPTCHAs.
However, existing benchmarks based on visual CAPTCHAs still
face limitations. Previous studies, when designing benchmarks
and datasets, customized them according to their research objec-
tives. Consequently, these benchmarks cannot comprehensively
cover all CAPTCHA types. Notably, there is a dearth of dedicated
benchmarks for LVLMs. To address this problem, we introduce a
novel CAPTCHA benchmark for the first time, named CAPTURE
(CAPTCHA for Testing Under Real-world Experiments), specifi-
cally for LVLMs. Our benchmark encompasses 4 main CAPTCHA
types and 25 sub-types from 31 vendors. The diversity enables a
multi-dimensional and thorough evaluation of LVLM performance.
CAPTURE features extensive class variety, large-scale data, and
unique LVLM-tailored labels, filling the gaps in previous research
in terms of data comprehensiveness and labeling pertinence. When
evaluated by this benchmark, current LVLMs demonstrate poor
performance in solving CAPTCHAs.
To enhance LVLMs’ proficiency in solving CAPTCHAs, we
present a two-stage framework, CRRD (Cropping, Re-Reading,
and Describing). Drawing inspiration from human problem-solving
behavior for complex tasks, we first crop CAPTCHA images into
instruction text and image, to help LVLM focus on crucial informa-
tion and eliminate distractions. Then we use the re-reading method
to enhance LVLM’s reasoning ability, enabling it to better identify
key elements in CAPTCHA image. The framework improves LVLM
performance across all CAPTCHA tasks, with up to a 69% improve-
ment. However, our experimental results indicate that despite these
enhancements, existing LVLMs still cannot fully simulate human
visual and reasoning capabilities to solve all current CAPTCHAs,
which is attributed to the inherence of LVLM models. Our code
and benchmark datasets are in our GitHub repository and will be
available for further research.
Keywords
Large Vision Language Models (LVLMs), CAPTCHA, Evalution,
Benchmark
1
Introduction
In recent years, the development of Large Language Models
(LLMs)[1, 14, 64] has led to revolutionary breakthroughs in the
field of Natural Language Processing (NLP)[46]. Building on these
advancements, researchers are now focusing on multimodal models,
particularly in computer vision. These models integrate LLMs with
Text
CAPTURE
Visual
Games
Behavior
Slide
Image Rotation Restoration
(IRR)
Trajectory
(Tra)
Orientation and
Direction (Ori)
Match-3
Game (M3G)
Dice and Sum
(Di&S)
Dart and Sum
(Da&S)
Gobang
(Go)
Icon Match with
the most Liquid
in Cup (IMLC)
3x3 Example
Classification
(3x3 EC)
3x3 Image
Classification
(3x3 IC)
4x4 Image
Segmentation
(4x4 IS)
Chinese
Characters
illusions (CCI)
Letters
Illusion
(LI)
Image
Pairing
(IP)
Differences
Selection (DS)
Spatial
Reasoning
(SR)
Jigsaw Puzzle
Game (JPG)
Biggest Area
Selection (BAS)
Chinese
Character
Selection (CCS)
Font
Recongnition
(FR)
Icon
Selection
(IS)
Main Types
Sub Types
2
13
7
3
Attribute & Introduction
Rotate to correct way up
(Rot)
The most common type, including English
phrases and alphanumeric combinations, whic is
used to test the ability of LVLMs to recognize
letters, digitals, and words.
A diverse range of visual CAPTCHA
types, testing LVLMs' abilities in image
classification, segmentation, semantic
understanding, matching, and the
recognition of mixed patterns formed by
different styles of Chinese characters,
letters, and symbols.
Builds on visual CAPTCHAs, and sets
game scenes and rules to test LVLMs'
ability to understand and reason about
game environments, game elements, and
game rules.
Tests how LVLMs simulate the computer and
mouse, and how they recognize and interact with
CAPTCHA elements to complete the verification.
Words
Letters and Digitals
(L&D)
Figure 1: The CAPTURE benchmark we designed. Correspon-
dence between main types and sub types and attribute. The
abbreviation of 25 sub-tasks are indicated in parentheses.
visual encoders[44, 45]and are fine-tuned using multimodal align-
ment methods based on generative pre-trained models[48]. Large
Visual Language Models (LVLMs)[77, 79] exhibit abilities akin to
human visual capabilities, such as integrating and reasoning with
visual information[44]. For instance, LVLMs can accurately ana-
lyze and interpret images[79], achieving a level of understanding
comparable to that of humans’ comprehensive image interpretation.
CAPTCHA (Completely Automated Public Turing Tests to Tell
Computers and Humans Apart)[65] are widely used in verification
proces