TimelyFreeze: Adaptive Parameter Freezing Mechanism for Pipeline Parallelism
Pipeline parallelism enables training models that exceed single-device memory, but practical throughput remains limited by pipeline bubbles. Although parameter freezing can improve training throughput by adaptively skipping backward computation, existing methods often over-freeze parameters, resulting in unnecessary accuracy degradation. To address this issue, we propose TimelyFreeze, which models the pipeline schedule as a directed acyclic graph and solves a linear program to compute optimal freeze ratios that minimize batch execution time under accuracy constraints. Experiments show that TimelyFreeze achieves up to 40% training throughput improvement on LLaMA-8B with comparable accuracy. Overall, it enables faster large-scale model training without compromising convergence and generalizes across diverse pipeline-parallel settings.
💡 Research Summary
TimelyFreeze addresses the persistent inefficiency of pipeline parallelism (PP) caused by pipeline bubbles—idle periods when a GPU must wait for dependent computations to finish. While prior parameter‑freezing techniques such as AutoFreeze and APF reduce backward‑pass computation by skipping gradient updates for “stable” parameters, they ignore the temporal dependencies inherent in PP schedules. Consequently, they often freeze parameters during periods when the GPU is already idle, yielding little to no throughput gain and risking unnecessary accuracy loss.
The proposed method introduces a three‑phase workflow. First, a warm‑up phase aligns with the learning‑rate warm‑up, during which no monitoring or freezing occurs. After warm‑up, a monitoring phase records the execution time of each forward and backward action on every GPU. The first half of monitoring runs with all parameters unfrozen to capture the maximal backward duration (w_max); the second half runs with all parameters frozen to capture the minimal duration (w_min).
Second, TimelyFreeze builds a directed acyclic graph (DAG) that represents the entire batch schedule. Nodes correspond to individual actions (forward or backward) for a specific micro‑batch at a particular stage, while edges encode the execution dependencies (e.g., a backward action cannot start before its corresponding forward action and the downstream backward actions finish). Each node i receives a weight w_i that can vary linearly with a freeze ratio r_i:
w_i = w_min,i + (1 – r_i)·Δw_i, Δw_i = w_max,i – w_min,i
Forward nodes are unaffected by freezing (w_min = w_max), whereas backward nodes shrink proportionally as more parameters are frozen.
The core of TimelyFreeze is a linear program (LP) that selects r_i for every backward node. The objective minimizes the total batch execution time, i.e., the longest path through the DAG, while respecting three constraints: (1) 0 ≤ r_i ≤ 1 for all i, (2) each w_i stays within its measured bounds, and (3) a global accuracy budget is not exceeded (implemented as an upper bound on the aggregate frozen fraction or a proxy loss on validation accuracy). Solving the LP yields an optimal freeze‑ratio vector r* that balances throughput and model quality.
Finally, during the freezing phase, each backward action begins freezing its associated parameters at the pre‑computed ratio r_i*. Because the ratios are computed per‑action, the method can adapt to heterogeneous stage workloads: bottleneck stages may receive higher freeze ratios to equalize compute across GPUs, while stages that are already fast retain low ratios, preserving gradient updates where they matter most.
Experiments span the LLaMA family (1 B, 8 B, 13 B) and vision models (ViT‑L/32, ConvNeXt‑V2‑L) across multiple PP schedules (GPipe, 1F1B, Zero‑Bubble) on 4–8 GPUs. TimelyFreeze consistently lands on or near the Pareto frontier of throughput versus final accuracy. For LLaMA‑8 B, it achieves up to a 40 % increase in training throughput with less than 0.2 % degradation in perplexity compared to a no‑freezing baseline. Across all settings, average throughput gains range from 30 % to 46 %, while accuracy loss stays within 1 % of the baseline. The method also reduces total training time for vision models by up to 25 % with accuracy drops limited to 1.5 percentage points.
A theoretical time‑to‑accuracy analysis shows that increasing freeze ratios accelerates early convergence but can plateau or even hurt final performance if over‑applied. By embedding an explicit accuracy constraint in the LP, TimelyFreeze automatically avoids the over‑freezing regime that plagues prior approaches.
Key contributions are: (1) a pipeline‑aware freezing strategy that leverages idle time rather than blindly freezing, (2) a DAG‑based formulation that captures all intra‑ and inter‑stage dependencies, and (3) a linear‑programming solution that yields optimal per‑action freeze ratios under a user‑specified accuracy budget. The work demonstrates that systematic, schedule‑aware parameter freezing can substantially improve the efficiency of large‑scale model training without sacrificing convergence, offering a practical tool for researchers and engineers working with limited GPU resources.
Comments & Academic Discussion
Loading comments...
Leave a Comment