ESSAM: A Novel Competitive Evolution Strategies Approach to Reinforcement Learning for Memory Efficient LLMs Fine-Tuning

ESSAM: A Novel Competitive Evolution Strategies Approach to Reinforcement Learning for Memory Efficient LLMs Fine-Tuning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Reinforcement learning (RL) has become a key training step for improving mathematical reasoning in large language models (LLMs), but it often has high GPU memory usage, which makes it hard to use in settings with limited resources. To reduce these issues, we propose Evolution Strategies with Sharpness-Aware Maximization (ESSAM), a full parameter fine-tuning framework that tightly combines the zero-order search in parameter space from Evolution Strategies (ES) with the Sharpness-Aware Maximization (SAM) to improve generalization. We conduct fine-tuning experiments on the mainstream mathematica reasoning task GSM8K. The results show that ESSAM achieves an average accuracy of 78.27% across all models and its overall performance is comparable to RL methods. It surpasses classic RL algorithm PPO with an accuracy of 77.72% and is comparable to GRPO with an accuracy of 78.34%, and even surpassing them on some models. In terms of GPU memory usage, ESSAM reduces the average GPU memory usage by $18\times$ compared to PPO and by $10\times$ compared to GRPO, achieving an extremely low GPU memory usage.


💡 Research Summary

The paper introduces ESSAM, a novel full‑parameter fine‑tuning framework that merges Evolution Strategies (ES), a zero‑order optimization method, with Sharpness‑Aware Maximization (SAM) to improve the generalization of large language models (LLMs) while dramatically reducing GPU memory consumption. Reinforcement learning (RL) techniques such as PPO and GRPO have become standard for enhancing LLM mathematical reasoning, but they demand prohibitive memory (e.g., >300 GiB for an 8‑billion‑parameter model). ES alleviates memory pressure because it relies only on forward passes and reward evaluation, yet its performance lags behind RL due to a tendency to converge to sharp minima.

ESSAM addresses this gap by incorporating a two‑stage update inspired by SAM. In the first stage, after evaluating a population of perturbed models, ESSAM computes a reward‑weighted noise direction and moves the current parameters in the opposite direction, creating a “SAM neighborhood point” (θ_SAM). This step mimics SAM’s perturb‑and‑project operation, pushing the search away from sharp regions. In the second stage, ESSAM samples a new population around θ_SAM, evaluates rewards, normalizes them (z‑score), and finally updates the original parameters using the weighted noises from this second population. The authors provide a theoretical justification (Proposition 3.1) showing that the first‑stage update is equivalent to a SAM step up to a constant scaling factor.

Memory efficiency is achieved through two engineering tricks: Seed Replay Evaluation (SRE) and Decomposed In‑place Update (DIPU). SRE assigns a deterministic seed to each individual, allowing the same random perturbations to be reproduced without storing intermediate activations. DIPU performs in‑place parameter perturbations and weighted updates, eliminating the need for gradient tensors. Consequently, ESSAM’s memory footprint matches that of vanilla ES and is far lower than RL methods.

Experiments are conducted on the GSM8K benchmark, a widely used dataset for multi‑step arithmetic reasoning. Seven models are fine‑tuned: Qwen‑2.5 (0.5 B, 1.5 B, 3 B, 7 B) and LLaMA‑3 (1 B, 3 B, 8 B). Training follows a standard pipeline—train/validation split, data shuffling, and mini‑batch updates—unlike prior ES work that used a single static batch. Results show:

  • Average accuracy: ESSAM 78.27 % vs. ES 75.97 %, PPO 77.72 %, GRPO 78.34 %.
  • ESSAM outperforms ES by 2.3 percentage points and matches or slightly exceeds the RL baselines; it even surpasses PPO/GRPO on several smaller models.
  • GPU memory usage: ESSAM (≈17 GiB for the 8 B model) is 18× lower than PPO (≈314 GiB) and 10× lower than GRPO (≈174 GiB).

Key insights: (1) Integrating SAM into a zero‑order method mitigates ES’s propensity for sharp minima, yielding RL‑comparable generalization. (2) The two‑stage neighborhood probing retains the memory advantage of ES while adding a modest computational overhead that scales with population size. (3) Adopting conventional training practices (shuffling, mini‑batches) stabilizes ES training and makes ESSAM compatible with existing LLM fine‑tuning pipelines.

Limitations include: evaluation on a single reasoning benchmark, potential sensitivity to hyper‑parameters (population size N, noise scale σ, SAM radius ρ), and increased wall‑clock time due to the need to evaluate many perturbed models per iteration. Future work could explore broader tasks (code generation, dialogue), automated hyper‑parameter tuning, and hybrid schemes that combine gradient‑based updates with ESSAM’s zero‑order steps.

Overall, ESSAM demonstrates that memory‑efficient zero‑order optimization, when coupled with sharpness‑aware regularization, can achieve state‑of‑the‑art performance on mathematical reasoning tasks. This makes high‑quality LLM fine‑tuning accessible to resource‑constrained researchers and opens a promising direction for scalable, low‑memory model adaptation.


Comments & Academic Discussion

Loading comments...

Leave a Comment