LookWhere? Efficient Visual Recognition by Learning Where to Look and What to See from Self-Supervision

LookWhere? Efficient Visual Recognition by Learning Where to Look and What to See from Self-Supervision
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Vision transformers are ever larger, more accurate, and more expensive to compute. The expense is even more extreme at high resolution as the number of tokens grows quadratically with the image size. We turn to adaptive computation to cope with this cost by learning to predict where to compute. Our LookWhere method divides the computation between a low-resolution selector and a high-resolution extractor without ever processing the full high-resolution input. We jointly pretrain the selector and extractor without task supervision by distillation from a self-supervised teacher, in effect, learning where and what to compute simultaneously. Unlike prior token reduction methods, which pay to save by pruning already-computed tokens, and prior token selection methods, which require complex and expensive per-task optimization, LookWhere economically and accurately selects and extracts transferrable representations of images. We show that LookWhere excels at sparse recognition on high-resolution inputs (Traffic Signs), maintaining accuracy while reducing FLOPs by up to 34x and time by 6x. It also excels at standard recognition tasks that are global (ImageNet classification) or local (ADE20K segmentation), improving accuracy while reducing time by 1.36x. See https://github.com/antofuller/lookwhere for the code and weights.


💡 Research Summary

LookWhere tackles the quadratic token‑growth problem of Vision Transformers (ViTs) on high‑resolution images by explicitly learning two complementary functions: (1) where to compute and (2) what to compute. The method introduces a low‑resolution “selector” network that predicts a spatial importance map from a down‑sampled version of the input. This map is trained by distilling the attention maps of a strong self‑supervised teacher (DINOv2) using a KL‑divergence loss, although in the current implementation the attention loss weight is set to zero. The selector then selects the top‑k most informative patches from the original high‑resolution image.

A high‑resolution “extractor” network receives only those k patches together with the global class and register tokens produced by the selector. The extractor is a ViT‑B that processes the sparse patch tokens and re‑uses the selector’s global tokens, thereby avoiding a full high‑resolution forward pass. It is trained by distilling the teacher’s class token and patch tokens with mean‑squared‑error losses, forcing the extractor to reconstruct the teacher’s full representation from the limited inputs. Both selector and extractor are initialized from the teacher’s weights, ensuring rapid convergence and strong transferability.

Training proceeds in a single self‑supervised pre‑training phase: the teacher processes the high‑resolution image once to produce attention maps and token embeddings; the selector and extractor are jointly optimized on the three losses (class, patch, attention). After this phase, the selector is frozen and can be reused across downstream tasks. For any task (e.g., ImageNet classification, ADE20K segmentation, traffic‑sign recognition), only a lightweight predictor on top of the extractor is fine‑tuned, while the selector remains unchanged. This design yields a highly efficient pipeline: the selector’s low‑resolution computation is negligible, and the extractor’s cost scales with k rather than N², dramatically reducing FLOPs and inference latency.

Empirical results span three benchmark families. On ImageNet‑1K, LookWhere (k = 128) achieves 80.3 % top‑1 accuracy, surpassing several state‑of‑the‑art token‑reduction and token‑selection methods, while cutting FLOPs from 23.6 G to 3.8 G (≈6× reduction) and increasing throughput from 3.2 to 9.5 images / second. On ADE20K segmentation, increasing k from 512 to 1024 improves mIoU from 40.6 % to 44.6 % while keeping FLOPs under 33 G and achieving a 1.4× speed‑up over the full‑resolution teacher. The most striking gains appear on high‑resolution tasks: on a Swedish traffic‑sign dataset with images > 1000 px, LookWhere reduces FLOPs by up to 34× and inference time by 6× with negligible loss in recognition accuracy, demonstrating the method’s suitability for “sparse recognition” scenarios where only a few regions carry the discriminative signal.

Ablation studies confirm that (i) initializing from the teacher is crucial for rapid convergence, (ii) the joint “what‑where” distillation outperforms training selector and extractor separately, and (iii) the selector generalizes across tasks, as visualizations show consistent patch selection for both fine‑grained bird classification and traffic‑sign detection. The paper also discusses limitations: the current omission of attention distillation (λ_map = 0) may leave some information on the table, and k is fixed per model rather than dynamically adapted per image.

Overall, LookWhere presents a novel, simple, and effective framework for adaptive computation in ViTs. By decoupling location prediction (low‑resolution selector) from content extraction (high‑resolution extractor) and training both via self‑supervised teacher distillation, it achieves substantial computational savings without sacrificing—and sometimes even improving—accuracy. The approach opens avenues for dynamic token selection, multi‑scale selectors, and broader use of diverse self‑supervised teachers, promising practical high‑resolution vision systems for real‑time and resource‑constrained applications.


Comments & Academic Discussion

Loading comments...

Leave a Comment