고품질 데이터 커리큘럼과 학습률 스케줄의 조화가 대형 언어 모델 성능을 끌어올린다

Reading time: 6 minute
...

📝 Original Info

  • Title: 고품질 데이터 커리큘럼과 학습률 스케줄의 조화가 대형 언어 모델 성능을 끌어올린다
  • ArXiv ID: 2511.18903
  • Date: 2025-11-24
  • Authors: Kairong Luo, Zhenbo Sun, Haodong Wen, Xinyu Shi, Jiarui Cui, Chenyi Dang, Kaifeng Lyu, Wenguang Chen

📝 Abstract

Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric. However, prior studies have reported limited improvements from such curriculum-based pretraining strategies. This work identifies a critical factor constraining these methods: the incompatibility between the ascending data quality order and the decaying learning rate (LR) schedule. We find that while curriculum-based training substantially outperforms random shuffling when using a constant LR, its advantage diminishes under standard LR decay schedules. Our experiments show this incompatibility can be mitigated by two simple strategies: (1) employing a more moderate LR decay schedule, where the final LR is only moderately smaller than the peak LR, and (2) replacing LR decay with model averaging, i.e., computing a weighted average of the final few checkpoints. By combining these strategies, we improve the average score on a suite of standard benchmarks by 1.64% over random shuffling, without additional data refinement. Validated on 1.5B-parameter models trained over 30B tokens with various data-quality metrics, our findings call for a re-evaluation of curriculum-based LLM pretraining and underscore the potential of co-designing data curricula with optimization methods.

💡 Deep Analysis

Deep Dive into 고품질 데이터 커리큘럼과 학습률 스케줄의 조화가 대형 언어 모델 성능을 끌어올린다.

Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric. However, prior studies have reported limited improvements from such curriculum-based pretraining strategies. This work identifies a critical factor constraining these methods: the incompatibility between the ascending data quality order and the decaying learning rate (LR) schedule. We find that while curriculum-based training substantially outperforms random shuffling when using a constant LR, its advantage diminishes under standard LR decay schedules. Our experiments show this incompatibility can be mitigated by two simple strategies: (1) employing a more moderate LR decay schedule, where the final LR is only moderate

📄 Full Content

Large language models (LLMs) are typically trained on massive text corpora collected from the Internet (Dubey et al., 2024;DeepSeek-AI et al., 2024;Yang et al., 2025;OpenAI, 2023), covering a wide range of sources and quality levels. High-quality data plays a crucial role in enhancing model capabilities, but it is usually limited in amount. To address this issue, current LLM pretraining pipelines employ sophisticated data curation procedures to filter out low-quality data and increase the proportion of high-quality data, including rule-based (or heuristic-based) filtering, quality scoring (model-based labeling), and score-based data selection (Su et al., 2025;Li et al., 2024;Penedo et al., 2025;2023;Weber et al., 2024). Despite these advances, relatively little attention has been given to developing training strategies that more effectively utilize the high-quality data during training, rather than only during data curation.

A natural idea to improve the utilization of high-quality data is to use curriculum learning 1 . This is motivated by the catastrophic forgetting problem (McCloskey & Cohen, 1989), which refers to the phenomenon that a model may forget the knowledge it has learned before when it is exposed to new data (Dai et al., 2025;Liao et al., 2025) Figure 1: Data curriculum strategies are less effective when combined with learning rate (LR) schedules that decay to a low scale near the end. (a-c) Experiments on a 1.5B parameter model trained on 30B tokens compare various data curricula (Uniform, Ascending-Order, and Descending-Order by DCLM score (Li et al., 2024)) under constant, Warmup-Stable-Decay (WSD) (Hu et al., 2024;Hägele et al., 2024), and cosine schedules. While curricula improve validation loss over a uniform baseline with a constant LR, this advantage is significantly reduced during a low-LR phase following LR decay. (d) In the data curriculum, high-quality data is placed in the latter phase, which coincides with the LR decaying to a relatively low scale.

approach aims to optimize knowledge acquisition by exposing the model to high-quality data in the latter stages of training.

One successful curriculum learning strategy is multi-stage pretraining: first training on a data mixture dominated by massive web data, then in the second stage, referred to as mid-training (OLMo et al., 2025;Abdin et al., 2024b), shifting the data mixture to one that mainly consists of high-quality data. This strategy has been adopted by many recent LLMs, including OLMo 2 (OLMo et al., 2025), Phi-4 (Abdin et al., 2024a), and LongCat-Flash (Team et al., 2025). This two-phase design is most common, and it is also promising to extend with more stages (Yiwen et al., 2025;Allal et al., 2025) or follow with long-context extension (Yang et al., 2025).

Another line of work explores curriculum learning at the instance level, where data samples are sorted according to quality scores and presented to the model sequentially (Wettig et al., 2024;Dai et al., 2025;Zhang et al., 2025;Kim & Lee, 2024). We refer to this as the data curriculum2 . However, these studies mainly investigate different quality metrics and find that simple end-to-end sorting yields limited benefits. Consequently, several works propose alternative strategies such as folding curriculum (Detailed in Section 4), which reorders samples within consecutive phases in an interleaved manner (Dai et al., 2025;Zhang et al., 2025). Despite showing promise, we find that this interleaved approach is fragile: its advantage does not extend to our larger-scale experiments with the DCLM fastText score (Li et al., 2024), a widely used scoring metric (see Section 4).

This raises a central question: Why do instance-level curriculum learning strategies often yield limited benefits? This is not simply due to unreliable quality scores: metrics like the QuRating score (proposed and used by Wettig et al. (2024)), the PDS score (proposed by Gu et al. (2025b) and discussed by Dai et al. (2025)), and the DCLM score (Li et al., 2024) are already informative enough to improve training efficiency by guiding high-quality data selection.

Our Contributions. In this paper, we identify a key, yet previously overlooked factor: the incompatibility between the ascending order of data quality and the decaying schedule of learning rate.

As illustrated in Figure 1, if we train an LLM with a constant LR, using a data curriculum that sorts Preprint data in ascending order of quality can indeed outperform the baseline that trains the model on data in a uniform order. However, when we switch to a more standard LR decay schedule, such as cosine or Warmup-Stable-Decay (WSD) (Loshchilov & Hutter, 2017;Hu et al., 2024) (a schedule with warmup, plateau, and decay phases, see Figure 1(b)), the benefit of the data curriculum diminishes. Moreover, we observe that as the LR decay becomes more aggressive (e.g., having a longer decay phase or a lower ending LR), the benefit of the data curriculum diminis

…(Full text truncated)…

📸 Image Gallery

au_decay_step.png au_decay_step.webp au_decay_step_diff.png au_decay_step_diff.webp au_end_lr.png au_end_lr.webp au_end_lr_diff.png au_end_lr_diff.webp aud_const.png aud_const.webp aud_cos.png aud_cos.webp aud_preselect_const.png aud_preselect_const.webp aud_preselect_wsd.png aud_preselect_wsd.webp aud_wsd.png aud_wsd.webp combined_plot.png combined_plot.webp correlation_avg.png correlation_avg.webp correlation_core.png correlation_core.webp curriculum_vs_lr.png curriculum_vs_lr.webp dclm_3_combined_performance_average.png dclm_3_combined_performance_average.webp improved_landscape_bottom.png improved_landscape_bottom.webp improved_landscape_top.png improved_landscape_top.webp sgd_trajectory_combined.png sgd_trajectory_combined.webp weborg_dclm_3_combined_performance_average.png weborg_dclm_3_combined_performance_average.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut