Ravan: Multi-Head Low-Rank Adaptation for Federated Fine-Tuning
Large language models (LLMs) have not yet effectively leveraged the vast amounts of edge-device data, and federated learning (FL) offers a promising paradigm to collaboratively fine-tune LLMs without transferring private edge data to the cloud. To operate within the computation and communication constraints of edge devices, recent literature on federated fine-tuning of LLMs proposes the use of low-rank adaptation (LoRA) and similar parameter-efficient methods. However, LoRA-based methods suffer from accuracy degradation in FL settings, primarily because of data and computational heterogeneity across clients. We propose Ravan, an adaptive multi-head LoRA method that balances parameter efficiency and model expressivity by reparameterizing the weight updates as the sum of multiple LoRA heads $s_i\textbf{B}_i\textbf{H}_i\textbf{A}_i$ in which only the core matrices $\textbf{H}_i$ and their lightweight scaling factors $s_i$ are trained. These trainable scaling factors let the optimization focus on the most useful heads, recovering a higher-rank approximation of the full update without increasing the number of communicated parameters since clients upload $s_i\textbf{H}_i$ directly. Experiments on vision and language benchmarks show that Ravan improves test accuracy by $2-8%$ over prior parameter-efficient baselines, making it a robust and scalable solution for federated fine-tuning of LLMs.
💡 Research Summary
The paper introduces Ravan, a novel parameter‑efficient fine‑tuning (PEFT) technique designed for federated learning (FL) of large language models (LLMs). Traditional low‑rank adaptation (LoRA) reduces the number of trainable parameters by representing weight updates ΔW as the product of two low‑rank matrices B and A, while freezing the original model weights. However, in realistic FL settings characterized by data heterogeneity (non‑IID client distributions) and computational heterogeneity (varying client resources), the low‑rank constraint of LoRA often leads to significant accuracy degradation. This degradation is especially pronounced because the effective rank of the true gradient updates grows in non‑IID scenarios, as demonstrated by singular‑value analyses on CIFAR‑100 and SVHN.
Ravan addresses these limitations by re‑parameterizing each weight update as a weighted sum of multiple LoRA heads:
ΔW ≈ Σ_{i=1}^{h} s_i B_i H_i A_i
In this formulation, the basis matrices B_i and A_i are initialized once and then frozen throughout training. Only the core matrices H_i and lightweight scalar scaling factors s_i are updated locally on each client and communicated to the server. By fixing B_i and A_i, the aggregation step becomes exact: the server can simply average the products s_i H_i across participating clients, guaranteeing that the aggregated update equals the sum of the individual updates without any approximation error.
The multi‑head design yields two major benefits. First, it increases the effective rank of the update under a fixed parameter budget. If the total number of trainable parameters is N, a single LoRA head can capture at most O(√N) rank. By employing h heads, the combined rank can reach O(√N · h), i.e., a √h‑fold improvement. This higher rank enables the model to capture the richer spectrum of gradients that arise from heterogeneous data. Second, the architecture naturally supports computational heterogeneity. Clients with limited memory or compute can freeze a subset of heads, training only the remaining ones. Because frozen heads contribute nothing to the communicated s_i H_i, the communication cost remains unchanged, and the server still aggregates exact updates from the active heads.
Algorithm 1 outlines the training loop. At each round, the server broadcasts the current H_i to a selected client subset C(t). Each client selects which heads to activate (based on its resource budget), performs local SGD for S steps updating s_i and H_i for the active heads, and then sends back the products s_i H_i for those heads. The server aggregates by averaging these products per head, updates the global H_i, and proceeds to the next round.
A critical component is the initialization of the frozen bases B_i and A_i. The authors compare three strategies: (1) random normal initialization, (2) deterministic orthogonalization via Gram‑Schmidt, and (3) a “shared subspace” baseline where all heads share the same column/row space. Experiments on vision benchmarks (CIFAR‑100, SVHN) and language fine‑tuning tasks show that the random normal and Gram‑Schmidt initializations achieve the highest test accuracy, confirming that orthogonal subspaces across heads are essential for maximizing expressivity.
Empirical results demonstrate that Ravan consistently outperforms prior federated PEFT methods—including FedIT, FedEx‑LoRA, FF‑A‑LoRA, HetLoRA, and FlexLoRA—by 2–8 percentage points in test accuracy across both IID and non‑IID settings. Importantly, Ravan achieves these gains without increasing the communication payload: each client still transmits only the same number of scalars and matrix entries as in vanilla LoRA (the s_i H_i products). Moreover, the method scales gracefully when clients have heterogeneous compute; weaker devices can participate by training fewer heads, preserving overall model performance while reducing local memory consumption.
In summary, Ravan delivers a four‑fold advantage for federated fine‑tuning of LLMs: (1) it retains the parameter‑efficiency of LoRA, (2) it raises the effective rank of updates to better capture heterogeneous gradients, (3) it guarantees exact aggregation across clients, and (4) it adapts to varying client resources through selective head activation. This makes Ravan a robust, scalable, and communication‑friendly solution for deploying large language models in edge‑centric federated environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment