Evaluating Sample Utility for Efficient Data Selection by Mimicking Model Weights
Large-scale web-crawled datasets contain noise, bias, and irrelevant information, necessitating data selection techniques. Existing methods depend on hand-crafted heuristics, downstream datasets, or require expensive influence-based computations – all of which limit scalability and introduce unwanted data dependencies. To address this, we introduce the Mimic Score, a simple and geometry-based data-quality metric that evaluates utility by measuring alignment between a sample’s gradients and a target direction induced by a pre-trained reference model. This leverages readily available model weights, avoids needing validation datasets, and incurs minimal computational overheads. Building on this metric, we propose Grad-Mimic, a two-stage framework that re-weights samples online to accelerate training and aggregates sample utilities offline to construct effective data filters. Empirically, we show that using mimic scores to guide training improves data efficiency, accelerates convergence, yields consistent performance gains across six image datasets, and enhances CLIP models with 20.7% fewer training steps. Additionally, mimic score-based filters augment existing filtering techniques, enabling improved CLIP models trained with 4.7 million fewer samples.
💡 Research Summary
The paper introduces a novel data‑quality metric called the Mimic Score, which quantifies the utility of each training sample by measuring how well its gradient aligns with a “desired” direction in weight space. The desired direction is defined as the vector from the current model parameters θₜ to the parameters of a pre‑trained reference model θ_ref, which is assumed to reside in a lower‑loss region of the parameter landscape. For a sample sᵢ, the negative gradient −gᵢ,ₜ is projected onto this vector and normalized by the vector’s norm, yielding mᵢ,ₜ = ⟨−gᵢ,ₜ, θ_ref − θₜ⟩ / ‖θ_ref − θₜ‖. A high Mimic Score indicates that the sample would move the model toward the reference weights, while a low score suggests the opposite.
Building on this metric, the authors propose Grad‑Mimic, a two‑stage framework. In the online stage, Mimic Scores are computed for every sample in a mini‑batch, soft‑maxed with a temperature τ to obtain normalized weights, and then used to re‑weight the per‑sample gradients before the parameter update. Lower τ makes the re‑weighting more aggressive, focusing learning on the most aligned samples; higher τ approaches standard SGD. In the offline stage, Mimic Scores accumulated over the entire training run are binarized (via a fixed threshold, 1‑D clustering, or top‑k selection) and combined across steps using weak‑supervision techniques (e.g., Snorkel) to produce a final data filter. This filter can be applied to future training runs or used as a dataset‑quality estimator.
The authors conduct extensive experiments. In a controlled setting, they inject 40‑60 % label noise into six image classification datasets (DTD, Flowers102, STL‑10, Oxford‑IIIT Pets, CIFAR‑10, CIFAR‑100). Mimic Scores reliably identify noisy samples, achieving a Pearson correlation of 0.903 between aggregated scores and overall dataset quality. When used for online re‑weighting, Grad‑Mimic improves data efficiency: models converge 2‑3× faster in wall‑clock time and achieve higher accuracy than baseline SGD under the same compute budget.
On large‑scale web‑crawled data (10 M‑100 M samples) they train CLIP‑style vision‑language models from scratch. Using publicly available CLIP weights as the reference, Grad‑Mimic reduces the number of training steps required for convergence by 20.7 %. Moreover, the offline filter constructed from Mimic Scores complements existing filtering pipelines (semantic similarity, deduplication, specialized filtering networks), allowing the authors to drop an additional 4.7 M samples without sacrificing performance.
Computationally, Mimic Score calculation requires only a dot product between the gradient and the reference direction, incurring far lower GPU memory and runtime overhead than influence‑function‑based or loss‑difference methods. The paper also provides a theoretical analysis (Appendix) showing conditions under which the re‑weighting scheme converges faster on noisy data.
Key strengths of the approach include: (1) no need for a held‑out validation set, (2) reliance solely on publicly released model weights, (3) minimal extra computation, and (4) flexibility to be used both during training and for post‑hoc dataset curation. Limitations are acknowledged: the quality and bias of the reference model directly affect the scores; the choice of which layer(s) to use for θ_ref can trade off computational cost against signal fidelity; and hyper‑parameters such as temperature τ and binarization thresholds require tuning per domain.
The authors suggest future directions such as ensembling multiple reference models to mitigate bias, dynamically adapting τ during training, extending the method to multimodal modalities beyond images, and exploring “self‑mimic” schemes where the reference model is updated iteratively.
In summary, Mimic Score and the Grad‑Mimic framework offer a simple yet powerful geometry‑based tool for data selection, delivering significant gains in training efficiency, robustness to noisy data, and scalability to massive web‑scale corpora, while keeping computational overhead low.
Comments & Academic Discussion
Loading comments...
Leave a Comment