HBO: Hierarchical Balancing Optimization for Fine-Tuning Large Language Models
Fine-tuning large language models (LLMs) on a mixture of diverse datasets poses challenges due to data imbalance and heterogeneity. Existing methods often address these issues across datasets (globally) but overlook the imbalance and heterogeneity within individual datasets (locally), which limits their effectiveness. We introduce Hierarchical Balancing Optimization (HBO), a novel method that enables LLMs to autonomously adjust data allocation during fine-tuning both across datasets (globally) and within each individual dataset (locally). HBO employs a bilevel optimization strategy with two types of actors: a Global Actor, which balances data sampling across different subsets of the training mixture, and several Local Actors, which optimizes data usage within each subset based on difficulty levels. These actors are guided by reward functions derived from the LLM’s training state, which measure learning progress and relative performance improvement. We evaluate HBO on three LLM backbones across nine diverse tasks in multilingual and multitask setups. Results show that HBO consistently outperforms existing baselines, achieving significant accuracy gains. Our in-depth analysis further demonstrates that both the global actor and local actors of HBO effectively adjust data usage during fine-tuning. HBO provides a comprehensive solution to the challenges of data imbalance and heterogeneity in LLM fine-tuning, enabling more effective training across diverse datasets.
💡 Research Summary
The paper tackles a fundamental problem in fine‑tuning large language models (LLMs): when a mixture of heterogeneous datasets is used, both global imbalance (different dataset sizes) and local heterogeneity (varying difficulty or quality within a single dataset) can severely degrade performance. Existing approaches typically address only the global aspect, either by static re‑weighting or by dynamic curriculum‑style sampling, and they assume each dataset is internally balanced. To fill this gap, the authors propose Hierarchical Balancing Optimization (HBO), a bilevel optimization framework that simultaneously learns how to allocate data across datasets (global level) and within each dataset (local level).
Core idea and architecture
HBO introduces two types of “actors”: a Global Actor and a set of Local Actors (one per dataset). The Global Actor learns a probability distribution over the N datasets, while each Local Actor learns a distribution over M_i difficulty groups inside its dataset. Both actors are lightweight two‑layer fully‑connected networks that take a feature vector describing the corresponding sampling unit (e.g., dataset statistics, recent loss trends) as input. The outer loop optimizes the LLM parameters θ on the sampled batches; the inner loop updates the actors’ parameters ψ using policy‑gradient (REINFORCE) based on reward signals derived from the LLM’s current training state.
Reward design
Two distinct rewards guide the actors:
- Global reward R_global(i) = ‖∇_θ L(B_i; θ)‖_2, i.e., the L2 norm of the gradient computed on a batch drawn from dataset i. Larger gradients indicate that the model is still learning from that dataset, prompting the Global Actor to increase its sampling probability.
- Local reward R_local(i, j) = Perplexity_original(i, j) / Perplexity_finetuned(i, j), i.e., the relative reduction in perplexity for difficulty group j within dataset i. A higher ratio means the model has made more progress on that group, encouraging the Local Actor to allocate more samples there.
These rewards capture complementary aspects: the global reward focuses on “where the model still needs learning,” while the local reward focuses on “which difficulty strata are improving fastest.” By scaling the policy‑gradient with these rewards, HBO dynamically shifts sampling toward data that promises the greatest learning gain.
Training algorithm
Algorithm 1 details the training loop: at each step a dataset is sampled according to p_global, then a difficulty group is sampled according to the corresponding p_local, and a minibatch is drawn for standard supervised fine‑tuning (negative log‑likelihood). Every F_global steps the Global Actor is updated using the aggregated R_global(i) over all datasets; every F_local steps each Local Actor is updated using R_local(i, j) over its groups. The authors report that this incurs roughly a 15 % overhead compared with static balanced sampling, which is modest given the performance gains.
Experimental setup
The authors evaluate HBO on three modern LLM backbones—Llama‑3.1‑8B, Qwen2.5‑7B, and EuroLLM‑9B—across nine tasks split into two scenarios:
- Multilingual: MMMLU, XCOPA, XStoryCloze, XNLI, MGSM.
- Multitask: MMLU, MultiFin‑EN, GSM8K, MedMCQA.
Baselines include uniform random sampling, temperature‑scaled static balancing, and recent dynamic methods such as curriculum‑based sampling and RL‑based data selection. All experiments keep hyper‑parameters (learning rate, batch size, update frequencies) constant across methods to ensure a fair comparison.
Results
Across all backbones and tasks, HBO consistently outperforms baselines, achieving average accuracy improvements of 2.3 %–4.1 % (absolute). Gains are especially pronounced on the hardest difficulty groups, confirming that the local actors successfully surface under‑represented challenging examples. Moreover, the inclusion of easy examples—often discarded by curriculum methods—proved beneficial, as HBO learns to allocate a modest proportion of easy samples to stabilize training.
Ablation and analysis
- Removing the global reward or disabling the Local Actors leads to a drop of >1.5 % in accuracy, demonstrating that both levels are essential.
- Visualizations of sampling probabilities reveal a cyclical pattern: early training heavily samples large datasets, then gradually shifts toward smaller or more difficult datasets as gradients shrink.
- Sensitivity to the temperature τ used for initializing the sampling distribution is low; HBO quickly adapts regardless of the starting bias.
- The authors also test robustness to varying data volumes and show that HBO scales gracefully, maintaining its advantage even when the total number of examples is reduced by 50 %.
Limitations and future work
The main overhead stems from computing gradient norms and perplexities for reward estimation, which may become costly on extremely large corpora. The policy networks are simple MLPs; richer architectures (e.g., transformers) could capture more nuanced dataset characteristics. The authors suggest extending HBO with meta‑learning to predict rewards more efficiently, and exploring integration with RLHF pipelines where reward models already exist.
Conclusion
Hierarchical Balancing Optimization offers a principled, data‑driven solution to the dual challenges of global imbalance and local heterogeneity in LLM fine‑tuning. By framing data selection as a bilevel reinforcement learning problem with carefully crafted rewards, HBO automatically discovers effective sampling schedules that improve both overall accuracy and robustness across diverse multilingual and multitask benchmarks. The code is released to encourage further research and practical adoption.
Comments & Academic Discussion
Loading comments...
Leave a Comment