Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models

Reading time: 5 minute
...

📝 Original Info

  • Title: Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models
  • ArXiv ID: 2512.15372
  • Date: 2025-12-17
  • Authors: Mikel Williams-Lekuona, Georgina Cosma

📝 Abstract

Vision transformers in vision-language models typically use the same amount of compute for every image, regardless of whether it is simple or complex. We propose ICAR (Image Complexity-Aware Retrieval), an adaptive computation approach that enables vision transformers to use less compute for simple images whilst processing complex images through their full network depth. The key challenge is maintaining cross-modal alignment: embeddings from different processing depths must remain compatible for text matching. ICAR solves this through dual-path training that produces compatible embeddings from both the early-exit and full-depth paths. This maintains compatibility between image representations and text embeddings in the same semantic space, whether an image exits early or processes fully. Unlike existing two-stage approaches that require expensive reranking, ICAR enables direct image-text matching without additional overhead. To determine how much compute to use, we develop ConvNeXt-IC, which treats image complexity assessment as a classification task. By applying modern classifier backbones rather than specialised architectures, ConvNeXt-IC achieves state-of-the-art performance, attaining a Pearson correlation coefficient of 0.959 with human labelling whilst delivering 4.4x faster complexity prediction. Evaluated on standard benchmarks augmented with real-world web data, ICAR achieves 20% faster image encoding while maintaining category-level performance and 95% of instance-level performance, enabling sustainable scaling of vision-language systems.

💡 Deep Analysis

Figure 1

📄 Full Content

Many vision-language retrieval systems use Vision Transformers (ViT) to align images with text queries, enabling applications from web image search [27] to e-commerce product matching [44]. ViTs process images through multiple layers in sequence, progressively refining their representations [28].

Standard ViT systems use a fixed-depth transformer stack and process all image patches uniformly, regardless of image complexity. For example, a 24-layer ViT-L/14 uses the same 175.33 GFLOPs [16] (giga floating-point operations) of Fig. 1. Adaptive computation for vision transformers. Traditional ViTs apply uniform processing to all images, while adaptive approaches route simple images through fewer layers, reducing computational cost.

compute whether it’s processing a simple product photo or a complex street scene. At web scale, where major platforms handle billions of photos and videos daily [12], this uniform processing can lead to substantial waste of computational resources. Adaptive computation can reduce this inefficiency by allowing simpler images to stop after fewer transformer layers, whilst complex images continue through the full transformer stack.

Adaptive computation for vision-language models faces a fundamental technical challenge in maintaining cross-modal alignment when embeddings are produced from different processing depths. While adaptive computation has been extensively studied for single-modality tasks [37], the adaptive vision-language field remains underexplored. Specifically, no existing work addresses adaptive computation for single-stage image-text retrieval where embeddings from different vision depths remain directly comparable to the text embedding.

This paper proposes Image Complexity-Aware Retrieval (ICAR), which uses image complexity to determine whether an image is routed through fewer transformer layers (early exit) or the full transformer stack (Figure 1). ICAR introduces dual-path training that produces compatible embeddings from both early reduced compute and full processing paths. This enables simple images to use fewer layers whilst complex images process through the full network, all while maintaining cross-modal alignment through unified embedding spaces, eliminating the reranking overhead that existing approaches have.

To support these routing decisions, we develop ConvNeXt-IC (ConvNeXt-Image Complexity). While existing methods treat complexity assessment as a representation learning problem requiring specialised architectures, we instead recognise it as a classification task. We evaluate on both established academic benchmarks (IC9600) and real-world web data (LAION), demonstrating that our approach generalises beyond controlled settings. The following are the contributions of the paper:

-Adaptive computation for single-stage image-text retrieval: We introduce ICAR, which maintains cross-modal embedding compatibility across vision depths, enabling direct text matching without reranking overhead. -Reconceptualising complexity assessment as classification: We show that image complexity assessment is better suited as a classification task than representation learning, using modern backbones instead of specialised complexity assessment architectures. -Efficiency-quality trade-off in retrieval: We show that early exits can reduce image encoding compute with minimal impact on retrieval quality, enabled by dual-path training.

This paper is organised as follows: Section 2 discusses related work in image complexity assessment and adaptive computation. Section 3 presents our ConvNeXt-IC complexity detection method with benchmark evaluation on established benchmarks and real-world data. Section 4 details the ICAR architecture and dual-path training approach. Section 5 presents experimental results and analysis of retrieval performance and efficiency gains. Finally, Section 6 concludes with key findings and future directions.

Image complexity assessment determines how difficult an image is for humans to understand or describe [33]. Psychology research identified key factors: object count, spatial relationships, visual diversity, and background richness [10,25,26]. Early computational approaches used hand-crafted features like entropy and compression ratios [23,31,34], but these worked only on small datasets.

Current deep learning methods treat complexity assessment as a representation learning problem. ICNet [9] uses a dual-branch architecture combining detail and context pathways with spatial attention modules. ICCORN [13] applies ordinal regression to ICNet features, explicitly modelling the ordinal structure of complexity ratings. MICM [19] introduces motion-inspired assessment using image-to-video generation with hierarchical alignment losses. CLICv2 [21] uses self-supervised pretraining on complexity-related visual patterns.

Related image quality assessment methods explore similar perceptual challenges but focus on quality rather than complexity. HyperIQA [35] employs

📸 Image Gallery

figure3a.jpg figure3b.jpg

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut