Renaissance: Investigating the Pretraining of Vision-Language Encoders

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In the past several years there has been an explosion of available models for vision-language (VL) tasks. Unfortunately, the literature still leaves open a number of questions related to best practices in designing and training such models. Additionally, the limited programming tools available for modeling make conducting VL research more difficult than necessary. In this paper, we seek to answer several questions related to the pretraining of VL encoders through meta-analysis. To conduct these experiments, we introduce a VL evaluation framework called Renaissance. In our first set of experiments, we show that we can save significant compute at little to no cost to downstream performance, by freezing large parts of VL models during pretraining. In our second set of experiments, we examine the effect of basing a VL transformer on a vision model versus a text model. Renaissance offers a great deal of flexibility in creating, training and evaluating transformer encoders for VL modeling. Its source code will be made publicly available upon publication. The source code for Renaissance can be found at https://github.com/bsu-slim/renaissance.

💡 Research Summary

**
The paper addresses a notable gap in vision‑language (VL) research: systematic investigation of encoder‑only models, which have been largely overlooked in favor of large generative decoders. To enable reproducible and flexible experimentation, the authors introduce Renaissance, an open‑source framework that integrates with the HuggingFace hub and supports a wide range of configuration options. Users can assemble either a one‑tower architecture (single transformer that processes both text tokens and image patch embeddings) or a two‑tower architecture (separate text and vision encoders linked by cross‑modal attention). The framework allows easy swapping of pretrained text models (e.g., BERT, ELECTRA) and vision models (e.g., ViT, DeiT, DINO), manual specification of hidden dimensions, and, crucially, selective freezing of any module during pretraining.

Two families of experiments are conducted. Experiment 1 investigates the computational‑efficiency trade‑off of freezing parts of a two‑tower encoder during the costly pretraining phase. The vision module, when frozen, yields negligible or even slight improvements on downstream benchmarks (NLVR2, SNLI‑VE, RefCOCO, image‑text retrieval, VQA), while reducing overall training time and GPU usage by roughly 30‑40 %. Freezing the text module alone incurs a modest performance dip, and freezing both modules leads to a small accuracy loss but still offers substantial resource savings. These results suggest that pretrained visual backbones already capture rich features that need little further adaptation in VL contexts.

Experiment 2 compares one‑tower versus two‑tower designs and examines the effect of weight initialization. Surprisingly, a one‑tower encoder trained from random weights outperforms its counterpart initialized from pretrained text or vision weights. The authors attribute this to the fact that a single transformer can learn a joint multimodal embedding space from scratch without being constrained by mismatched pretrained representations. Conversely, the two‑tower model benefits from pretrained encoders, achieving stable performance but suffering when both towers are randomly initialized.

The paper’s contributions are fourfold: (1) the release of Renaissance, a highly configurable VL modeling toolkit; (2) empirical evidence that freezing the visual encoder during pretraining can dramatically cut compute without harming downstream results; (3) the insight that one‑tower encoders may be better trained from scratch, especially when compute resources are limited; and (4) a comprehensive benchmark suite covering five downstream tasks (NLVR2, SNLI‑VE, RefCOCO, multimodal retrieval on MS‑COCO/Flickr30k, and VQA) and four pretraining datasets (Visual Genome, MS‑COCO, Conceptual Captions, SBU Captions).

Limitations are acknowledged: the current version does not fully support convolutional vision backbones (e.g., ResNet) for one‑tower models, and the study does not directly compare encoder‑only models against state‑of‑the‑art generative VL systems. Future work will extend support for additional vision architectures, explore massive-scale pretraining, and investigate parameter‑sharing strategies to further improve efficiency.

In summary, Renaissance provides a practical, extensible platform for VL encoder research and demonstrates that careful architectural choices—freezing visual modules and, for single‑tower designs, training from scratch—can yield competitive performance while substantially reducing computational demands. This work paves the way for broader adoption of lightweight VL encoders in both academic and industry settings where resources are constrained.

Renaissance: Investigating the Pretraining of Vision-Language Encoders

💡 Research Summary

Comments & Academic Discussion

Leave a Comment