SkillRater: Untangling Capabilities in Multimodal Data
Data curation methods typically assign samples a single quality score. We argue this scalar framing is fundamentally limited: when training requires multiple distinct capabilities, a monolithic scorer cannot maximize useful signals for all of them simultaneously. Quality is better understood as multidimensional, with each dimension corresponding to a capability the model must acquire. We introduce SkillRater, a framework that decomposes data filtering into specialized raters - one per capability, each trained via meta-learning on a disjoint validation objective - and composes their scores through a progressive selection rule: at each training stage, a sample is retained if any rater ranks it above a threshold that tightens over time, preserving diversity early while concentrating on high-value samples late. We validate this approach on vision language models, decomposing quality into three capability dimensions: visual understanding, OCR, and STEM reasoning. At 2B parameters, SkillRater improves over unfiltered baselines by 5.63% on visual understanding, 2.00% on OCR, and 3.53% on STEM on held out benchmarks. The learned rater signals are near orthogonal, confirming that the decomposition captures genuinely independent quality dimensions and explaining why it outperforms both unfiltered training and monolithic learned filtering.
💡 Research Summary
The paper “SkillRater: Untangling Capabilities in Multimodal Data” challenges the prevailing practice of assigning a single scalar quality score to each training example in multimodal pre‑training. The authors argue that when a model must acquire several distinct capabilities—visual understanding, optical‑character‑recognition (OCR), and STEM reasoning—a monolithic scorer inevitably trades off signal for one capability against another, limiting overall performance. To address this, they propose SkillRater, a framework that decomposes data curation into multiple capability‑specific raters, each trained via bilevel meta‑learning on a disjoint validation set that directly measures the target capability.
The technical core builds on DataRater’s bilevel formulation. For each capability c, a rater with parameters ϕ_c maps a multimodal example z to a weight w_z = r_ϕ_c(z) ∈
Comments & Academic Discussion
Loading comments...
Leave a Comment