ScalSelect: Scalable Training-Free Multimodal Data Selection for Efficient Visual Instruction Tuning
Large-scale Visual Instruction Tuning (VIT) has become a key paradigm for advancing the performance of vision-language models (VLMs) across various multimodal tasks. However, training on the large-scale datasets is computationally expensive and inefficient due to redundancy in the data, which motivates the need for multimodal data selection to improve training efficiency. Existing data selection methods for VIT either require costly training or gradient computation. Training-free alternatives often depend on proxy models or datasets, instruction-agnostic representations, and pairwise similarity with quadratic complexity, limiting scalability and representation fidelity. In this work, we propose ScalSelect, a scalable training-free multimodal data selection method with linear-time complexity with respect to the number of samples, eliminating the need for external models or auxiliary datasets. ScalSelect first constructs sample representations by extracting visual features most attended by instruction tokens in the target VLM, capturing instruction-relevant information. It then identifies samples whose representations best approximate the dominant subspace of the full dataset representations, enabling scalable importance scoring without pairwise comparisons. Extensive experiments across multiple VLMs, datasets, and selection budgets demonstrate that ScalSelect achieves over 97.5% of the performance of training on the full dataset using only 16% of the data, and even outperforms full-data training in some settings. The code is available at \href{https://github.com/ChangtiWu/ScalSelect}{ScalSelect}.
💡 Research Summary
ScalSelect addresses the growing computational burden of Visual Instruction Tuning (VIT) by offering a training‑free, scalable data‑selection pipeline that works directly with the target vision‑language model (VLM). The method consists of two key stages. First, it builds a sample‑level representation that is conditioned on the instruction. To do this, ScalSelect extracts the attention matrix from the very first transformer layer of the language model component of the VLM. It aggregates the attention scores that user‑instruction tokens assign to each visual token, ranks the visual tokens by this aggregated score, and selects the top‑k tokens whose cumulative attention exceeds a preset threshold τ (e.g., 0.9). The embeddings of these selected visual tokens are pooled to form an “Instruction‑Conditioned Early Representation” for the sample. This representation captures exactly the visual information the model deems relevant to the given instruction, without requiring any external encoder or additional training.
Second, ScalSelect treats the whole dataset as a matrix R of these representations and seeks to preserve the dominant low‑rank subspace of R. Using a fast singular‑value decomposition (or randomized PCA) it computes the top‑k singular vectors U_k that span the most energetic directions of the data. Each sample’s contribution to this subspace is measured by the squared norm of its projection onto U_k, i.e., ‖U_kᵀ r_i‖². Samples are then ranked by this contribution score, and the top B % (where B is the user‑specified budget) are retained. Because the scoring requires only matrix‑vector multiplications, the overall complexity is O(N·d·k), linear in the number of samples N, in stark contrast to the quadratic cost of pairwise similarity methods.
The authors evaluate ScalSelect on several state‑of‑the‑art VLMs—including LLaVA‑Vicuna‑7B, Qwen3‑VL, and MiniGPT‑4—across multiple instruction‑tuning corpora (LLaVA‑Instruct, MiniGPT‑4 data, and a custom multi‑domain set). They vary the selection budget from 1 % to 30 % and measure downstream performance on benchmarks covering visual question answering, OCR, diagram understanding, and multimodal reasoning. With a 16 % budget, ScalSelect achieves on average 97.5 % of the full‑data performance, and in some Qwen3‑VL experiments it even surpasses the full‑data baseline by a small margin. Ablation studies reveal that (1) removing the instruction‑conditioned attention step drops performance by 5–8 %, (2) under‑ or over‑estimating the subspace dimension k harms the selection quality, and (3) the threshold τ is robust within a reasonable range (0.7–0.95). Visualizations of the selected subset’s representation space (t‑SNE) show that the method preserves the original data’s cluster structure, confirming that the global subspace is well maintained.
Key contributions are: (i) a novel way to extract instruction‑aware visual features directly from the target VLM’s early layer, (ii) a global subspace‑preserving selection criterion that eliminates pairwise comparisons, (iii) a linear‑time algorithm that scales to millions of multimodal samples, and (iv) extensive empirical validation demonstrating near‑full‑data performance with a fraction of the data.
Limitations include reliance on the first transformer layer (later layers might provide complementary signals), the need to manually set τ and the subspace rank k for each dataset, and the current focus on image‑instruction pairs (extension to video, audio, or longer conversational contexts remains open). Future work could explore multi‑layer attention fusion, adaptive hyper‑parameter tuning, and broader multimodal extensions.
In summary, ScalSelect provides a practical, cost‑effective solution for large‑scale visual instruction tuning, showing that careful, instruction‑aware representation coupled with global subspace preservation can dramatically reduce training data without sacrificing—and sometimes even improving—model performance.
Comments & Academic Discussion
Loading comments...
Leave a Comment