FedGRPO: Privately Optimizing Foundation Models with Group-Relative Rewards from Domain Client
One important direction of Federated Foundation Models (FedFMs) is leveraging data from small client models to enhance the performance of a large server-side foundation model. Existing methods based on model level or representation level knowledge transfer either require expensive local training or incur high communication costs and introduce unavoidable privacy risks. We reformulate this problem as a reinforcement learning style evaluation process and propose FedGRPO, a privacy preserving framework comprising two modules. The first module performs competence-based expert selection by building a lightweight confidence graph from auxiliary data to identify the most suitable clients for each question. The second module leverages the “Group Relative” concept from the Group Relative Policy Optimization (GRPO) framework by packaging each question together with its solution rationale into candidate policies, dispatching these policies to a selected subset of expert clients, and aggregating solely the resulting scalar reward signals via a federated group-relative loss function. By exchanging reward values instead of data or model updates, FedGRPO reduces privacy risk and communication overhead while enabling parallel evaluation across heterogeneous devices. Empirical results on diverse domain tasks demonstrate that FedGRPO achieves superior downstream accuracy and communication efficiency compared to conventional FedFMs baselines.
💡 Research Summary
FedGRPO introduces a novel privacy‑preserving framework for federated foundation models (FedFMs) that leverages domain‑specific knowledge from client devices to improve a large server‑side foundation model without exchanging raw data, model parameters, or high‑dimensional representations. The authors first identify two fundamental challenges in this setting: (1) selecting the most competent clients for evaluating a given query, and (2) aggregating the heterogeneous evaluation signals in a way that meaningfully guides the server model’s updates. To address these, FedGRPO consists of (i) a competence‑based expert selection module and (ii) a Group‑Relative Policy Optimization (GRPO) module adapted for federated learning.
In the expert selection stage, the server holds a small auxiliary dataset of labeled question‑answer pairs. For each incoming unlabeled query, it embeds the query using a frozen encoder from the foundation model and retrieves the L most similar auxiliary examples. These examples are broadcast to all clients, each of which computes a competence score by measuring its accuracy on the retrieved examples (using either exact answer matching or a locally trained evaluator). The server then selects the top‑M clients with the highest competence scores as the expert set for that query. This process is lightweight, requires only the auxiliary data, and dynamically adapts to the domain expertise of each client.
Once the expert set is determined, the server samples a provisional answer from its current policy πθg and sends the ⟨query, answer⟩ pair to the selected experts. Each client evaluates the answer using one of two pathways: (a) Answer‑Based Evaluation (AE) when the exact ground‑truth answer exists in its private corpus, yielding a binary 0/1 score; or (b) Model‑Based Evaluation (ME) when no ground‑truth is available, where a locally trained reward model produces a real‑valued score. A gating variable λk selects the appropriate pathway, and the client returns a single scalar reward rk to the server.
The server aggregates the M scalar rewards by computing their mean μr and standard deviation σr, then normalizes each reward into a group‑relative signal Rk = (rk − μr) / (σr + ε). This normalization removes scale differences across clients and prevents any single expert from dominating the update. The server then performs a policy gradient step: θg ← θg + η Rk ∇θg log πθg(ŷ|x), where η is the learning rate. Because the aggregation is based on relative statistics rather than absolute values, the method inherits the stability benefits of GRPO while eliminating the need for a separate value network.
Key advantages of FedGRPO are: (1) communication efficiency – only the question‑answer pair and a few scalar rewards are transmitted, reducing bandwidth by orders of magnitude compared with FedPEFT or synthetic‑data approaches; (2) strong privacy guarantees – clients never expose raw inputs, labels, or model weights, mitigating inference attacks that exploit gradient or embedding leakage; (3) domain‑aware expertise matching – the competence graph enables per‑query selection of the most knowledgeable clients, which is especially valuable in heterogeneous settings such as legal, medical, or financial domains; and (4) training stability – the group‑relative loss provides scale‑invariant reinforcement signals without requiring costly online data collection or value‑function estimation.
Empirical evaluation spans three diverse downstream tasks: legal question answering, medical diagnosis, and financial risk assessment. FedGRPO consistently outperforms baselines that rely on federated fine‑tuning (FedPEFT), federated averaging of model updates, and federated synthetic‑data generation. Accuracy improvements range from 3 % to 7 % absolute points, while total communication volume is reduced by a factor of 5–10. Notably, even when ground‑truth answers are unavailable for a query, the competence‑based expert selection still yields informative rewards via the model‑based evaluation pathway, allowing the server model to gradually acquire domain‑specific reasoning capabilities.
The paper also discusses limitations: the quality of the auxiliary dataset directly influences expert selection accuracy, and in extremely sparse domains the confidence graph may become unstable. Future work is suggested on integrating differential privacy mechanisms, extending the framework to multi‑task federated settings, and automating the generation of auxiliary prompts to further reduce reliance on manually curated data. Overall, FedGRPO offers a compelling blend of privacy, efficiency, and performance for the next generation of federated foundation models.
Comments & Academic Discussion
Loading comments...
Leave a Comment