CAPTURE: A Benchmark and Evaluation for LVLMs in CAPTCHA Resolving

Benefiting from strong and efficient multi-modal alignment strategies, Large Visual Language Models (LVLMs) are able to simulate human visual and reasoning capabilities, such as solving CAPTCHAs. However, existing benchmarks based on visual CAPTCHAs still face limitations. Previous studies, when designing benchmarks and datasets, customized them according to their research objectives. Consequently, these benchmarks cannot comprehensively cover all CAPTCHA types. Notably, there is a dearth of dedicated benchmarks for LVLMs. To address this problem, we introduce a novel CAPTCHA benchmark for the first time, named CAPTURE CAPTCHA for Testing Under Real-world Experiments, specifically for LVLMs. Our benchmark encompasses 4 main CAPTCHA types and 25 sub-types from 31 vendors. The diversity enables a multi-dimensional and thorough evaluation of LVLM performance. CAPTURE features extensive class variety, large-scale data, and unique LVLM-tailored labels, filling the gaps in previous research in terms of data comprehensiveness and labeling pertinence. When evaluated by this benchmark, current LVLMs demonstrate poor performance in solving CAPTCHAs.

💡 Research Summary

The paper introduces CAPTURE, the first benchmark specifically designed to evaluate Large Visual Language Models (LVLMs) on the task of solving real‑world CAPTCHAs. Recognizing that existing visual‑CAPTCHA benchmarks are either narrowly focused or custom‑built for particular research goals, the authors set out to create a comprehensive, LVLM‑oriented dataset that reflects the diversity and difficulty of CAPTCHAs encountered on the open web. CAPTURE covers four major CAPTCHA families—text‑based, image‑selection, puzzle‑combination, and action‑based—and further breaks these down into 25 sub‑types drawn from 31 commercial providers. In total, the dataset comprises roughly 31,000 images, each annotated with LVLM‑friendly labels such as the exact answer string, selection indices, or coordinate points. Annotation was performed with a multi‑stage human verification pipeline, achieving over 99 % label accuracy and providing rich metadata (e.g., distortion level, background complexity, color transformations) for fine‑grained analysis.

The evaluation protocol defines four key metrics: (1) Exact Match accuracy for text and puzzle CAPTCHAs, (2) OCR‑based similarity scores for distorted text, (3) Selection accuracy for image‑choice CAPTCHAs, and (4) Inference latency. Using this protocol, the authors benchmark seven state‑of‑the‑art LVLMs—including GPT‑4V, LLaVA‑1.5‑13B, MiniGPT‑4, InstructBLIP, Kosmos‑2, Flamingo‑1B, and Otter—against traditional OCR baselines. Results reveal a stark performance gap: average Exact Match rates hover around 27 %, with the best models barely surpassing 35 %. Text‑based CAPTCHAs that feature heavy distortion, background noise, or low contrast cause OCR scores to drop below 15 % for most models. Image‑selection tasks that involve overlapping objects or color shifts see selection accuracies dip below 30 %. Puzzle‑type CAPTCHAs, which require multi‑step reasoning, suffer from error propagation because current LVLMs attempt to solve them in a single pass rather than a staged fashion. Action‑based CAPTCHAs, which demand interactive responses, remain largely unsolvable because LVLMs lack a vision‑action loop; the benchmark therefore treats them as a separate challenge and reports only simulated textual responses.

From these findings, the authors identify three primary limitations of contemporary LVLMs in CAPTCHA solving: (i) insufficient multi‑scale visual processing to handle fine‑grained distortions, (ii) weak chain‑of‑thought reasoning and memory management for multi‑step tasks, and (iii) the absence of real‑time interaction capabilities needed for action‑based challenges. The paper argues that these shortcomings expose a gap between the impressive language‑centric abilities of LVLMs and the practical visual‑reasoning demands of security‑critical applications.

Looking forward, the authors propose several research directions. On the data side, they suggest expanding CAPTURE to include multilingual and region‑specific CAPTCHAs, as well as developing semi‑automated verification tools to streamline future annotation. Model‑wise, they advocate for architectures that combine high‑resolution vision backbones, dedicated decoding heads for fine detail, and explicit chain‑of‑thought prompting to enable stepwise reasoning. Moreover, integrating a vision‑action loop—potentially via reinforcement‑learning or interactive simulation environments—could empower LVLMs to tackle action‑based CAPTCHAs. The authors release the full dataset, evaluation scripts, and baseline results publicly, inviting the community to use CAPTURE as a standardized testbed for advancing LVLM robustness in security‑oriented visual tasks.

💡 Research Summary

📜 Original Paper Content