Advancing General-Purpose Reasoning Models with Modular Gradient Surgery

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Reinforcement learning (RL) has played a central role in recent advances in large reasoning models (LRMs), yielding strong gains in verifiable and open-ended reasoning. However, training a single general-purpose LRM across diverse domains remains challenging due to pronounced domain heterogeneity. Through a systematic study of two widely used strategies, Sequential RL and Mixed RL, we find that both incur substantial cross-domain interference at the behavioral and gradient levels, resulting in limited overall gains. To address these challenges, we introduce Modular Gradient Surgery (MGS), which resolves gradient conflicts at the module level within the transformer. When applied to Llama and Qwen models, MGS achieves average improvements of 4.3 (16.6%) and 4.5 (11.1%) points, respectively, over standard multi-task RL across three representative domains (math, general chat, and instruction following). Further analysis demonstrates that MGS remains effective under prolonged training. Overall, our study clarifies the sources of interference in multi-domain RL and presents an effective solution for training general-purpose LRMs.

💡 Research Summary

The paper tackles a fundamental obstacle in building a single, general‑purpose large reasoning model (LRM) that can excel across heterogeneous domains such as mathematics, open‑ended chat, and instruction following. While reinforcement learning (RL) with verifiable rewards (RL‑VR) and model‑rewarded thinking (RL‑MT) has produced impressive gains in individual domains, naïvely mixing these objectives leads to severe cross‑domain interference. The authors first conduct a systematic empirical study of two widely adopted strategies: (1) Sequential RL, where a model is fine‑tuned on one domain and then on another, and (2) Mixed RL, where data from multiple domains are interleaved within each training batch.

In the sequential setting, they observe two failure modes they term Forgetting (performance on a previously trained domain collapses) and Rigidity (the later‑trained domain cannot reach its single‑domain ceiling). The effect is asymmetric: training chat first and then math causes a dramatic drop in chat ability, whereas the reverse order yields a milder degradation of math performance. Entropy analyses reveal that the first‑trained domain’s entropy level strongly influences the exploration capacity of the subsequent stage, explaining the observed asymmetry.

In the mixed setting, the authors find that gradients from different domains frequently conflict. Even when the mixing ratio is heavily skewed toward one domain, the other domain’s performance suffers, and the overall model never matches the best single‑domain expert. This demonstrates that simple multi‑task RL does not resolve the underlying gradient competition.

To overcome both problems, the authors propose Modular Gradient Surgery (MGS). Recognizing that transformer architectures are intrinsically modular—each layer contains distinct sub‑components (MLP, self‑attention, feed‑forward) that specialize in different functional aspects—they compute domain‑specific gradients for each module separately. When the inner product between two domain gradients for a given module is negative (i.e., they point in opposite directions), MGS projects one gradient onto the orthogonal complement of the other, effectively removing the destructive component while preserving the compatible update. This “gradient surgery” is performed at the module level rather than globally, allowing each part of the network to receive only beneficial signals from each domain.

Experiments are conducted on two model families, Llama‑3.1‑8B and Qwen‑2.5‑7B, across three representative domains (Math, Chat, Instruction‑Following). Under a controlled compute budget (2 epochs per domain), MGS yields average improvements of 4.3 points (16.6 % relative) for Llama and 4.5 points (11.1 % relative) for Qwen compared with standard mixed‑task RL. The gains persist under prolonged training (up to four epochs), with an extra ~3 % relative boost on the Math benchmark. Moreover, MGS generalizes to additional tasks such as coding and creative writing, delivering a 19.4 % relative average improvement on Llama‑3.1‑8B. Computationally, MGS adds negligible overhead when integrated with high‑performance parallelism frameworks like Fully Sharded Data Parallel (FSDP), and it is more memory‑efficient than a naïve global gradient surgery baseline.

Key insights include: (1) cross‑domain interference in multi‑task RL is often localized to specific transformer modules; (2) resolving gradient conflicts via projection‑based surgery preserves domain‑specific learning while mitigating destructive interference; (3) ordering of domains matters in sequential training, with high‑entropy, flexible domains (e.g., chat) serving as a better foundation for later, more constrained tasks (e.g., math).

Overall, the work provides a rigorous diagnosis of why multi‑domain RL post‑training has been fragile and introduces a principled, scalable solution—Modular Gradient Surgery—that substantially advances the feasibility of training truly general‑purpose reasoning models.

Advancing General-Purpose Reasoning Models with Modular Gradient Surgery

💡 Research Summary

Comments & Academic Discussion

Leave a Comment