The Chicken and Egg Dilemma: Co-optimizing Data and Model Configurations for LLMs
Co-optimizing data and model configurations for training LLMs presents a classic chicken-and-egg dilemma: The best training data configuration (e.g., data mixture) for a downstream task depends on the chosen model configuration (e.g., model architecture), and vice versa. However, jointly optimizing both data and model configurations is often deemed intractable, and existing methods focus on either data or model optimization without considering their interaction. We introduce JoBS, an approach that uses a scaling-law-inspired performance predictor to aid Bayesian optimization (BO) in jointly optimizing LLM training data and model configurations efficiently. JoBS allocates a portion of the optimization budget to learn an LLM performance predictor that predicts how promising a training configuration is from a small number of training steps. The remaining budget is used to perform BO entirely with the predictor, effectively amortizing the cost of running full-training runs. We study JoBS’s average regret and devise the optimal budget allocation to minimize regret. JoBS outperforms existing multi-fidelity BO baselines, as well as data and model optimization approaches across diverse LLM tasks under the same optimization budget.
💡 Research Summary
The paper tackles a fundamental “chicken‑and‑egg” problem in large‑scale language model (LLM) training: the optimal data mixture and the optimal model configuration (e.g., PEFT settings) depend on each other, making joint optimization seemingly intractable. The authors propose JoBS (Joint Bayesian Optimization with a Scaling‑law‑inspired predictor), a framework that combines a cheap performance predictor with Bayesian optimization (BO) to efficiently explore the joint space of data and model configurations under a fixed computational budget.
Problem formulation
Given a data mixture X (a point on an N‑dimensional simplex) and a model configuration M (LoRA layer, rank, α, dropout, etc.), the fine‑tuned LLM after B training steps yields a performance metric L(θ_{X,M,B}). The goal is to maximize L over (X, M) while respecting a total budget C (measured in training steps). Direct BO would require a full‑training run (B steps) for each BO iteration, limiting the number of evaluations to C/B.
Key idea
Instead of evaluating the true objective at full fidelity each time, JoBS first spends a fraction of the budget on a set of full‑training runs (B steps) covering diverse (X, M) points. These runs provide labels for training a neural‑network performance predictor f̂ that maps (X, M, b) → estimated final performance, where b ≪ B (e.g., 100 steps). The predictor is inspired by scaling laws but is learned end‑to‑end, allowing it to generalize across many data‑model combinations. Once trained, the predictor is used as a cheap surrogate: each subsequent BO iteration runs only B_small steps, feeds the intermediate result to f̂, and treats the output as a noisy observation of the true objective. The GP model in BO naturally absorbs the predictor’s error as observation noise.
Theoretical analysis
The authors analyze average regret R_T = (1/T)∑_{t=1}^T
Comments & Academic Discussion
Loading comments...
Leave a Comment