AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining

AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Learning rate is widely regarded as crucial for effective foundation model pretraining. Recent research explores and demonstrates the transferability of learning rate configurations across varying model and dataset sizes, etc. Nevertheless, these approaches are constrained to specific training scenarios and typically necessitate extensive hyperparameter tuning on proxy models. In this work, we propose \textbf{AdaLRS}, a plug-in-and-play adaptive learning rate search algorithm that conducts online optimal learning rate search via optimizing loss descent velocities. We provide theoretical and experimental analyzes to show that foundation model pretraining loss and its descent velocity are both convex and share the same optimal learning rate. Relying solely on training loss dynamics, AdaLRS involves few extra computations to guide the search process, and its convergence is guaranteed via theoretical analysis. Experiments on both LLM and VLM pretraining show that AdaLRS adjusts suboptimal learning rates to the neighborhood of optimum with marked efficiency and effectiveness, with model performance improved accordingly. We also show the robust generalizability of AdaLRS across varying training scenarios, such as different model sizes, training paradigms, base learning rate scheduler choices, and hyperparameter settings.


💡 Research Summary

The paper introduces AdaLRS, an adaptive learning‑rate search algorithm designed to find the optimal learning rate (LR) for foundation‑model pretraining in a single training run, without the need for costly proxy‑model searches or extensive hyper‑parameter sweeps. The core insight is that both the training loss L(η) and the loss‑descent velocity V(η) are convex functions of the learning rate and share the same optimum η*. The authors first provide a theoretical justification using a simplified SGD setting: the expected loss change per step can be expressed as E


Comments & Academic Discussion

Loading comments...

Leave a Comment