FlexMoRE: A Flexible Mixture of Rank-heterogeneous Experts for Efficient Federatedly-trained Large Language Models
Recent advances in mixture-of-experts architectures have shown that individual experts models can be trained federatedly, i.e., in isolation from other experts by using a common base model to facilitate coordination. However, we hypothesize that full-sized experts may not be necessary for all domains and that instead low-rank adapters may be sufficient. Here, we introduce FlexMoRE, a Flexible Mixture of Rank-heterogenous Experts, which may be either full-sized experts or adapters of a suitable rank. We systematically investigate the trade-off between expert rank and downstream task performance by evaluating $6$ experts with ranks $2^0$ to $2^{14}$ resulting in experiments covering 150 mixtures (96 with 2 experts, 54 with 7 experts) that are evaluated across $120$ tasks. For our experiments, we build on FlexOlmo and turn its pre-trained experts into low-rank versions. Our regression analysis from expert rank to downstream task performance reveals that the best-performing rank is substantially higher for reasoning-heavy benchmarks than for knowledge-heavy benchmarks. These findings on rank sensitivity come with direct implications for memory efficiency: Using optimal ranks, FlexMoRE yields improved downstream task performance (average score $47.18$) compared to the baseline FlexOlmo-style mixture of full-sized experts (average score $45.46$) at less than one third the parameters ($10.75$B for FlexMoRE vs. $33.27$B for FlexOlmo). All code will be made available.
💡 Research Summary
FlexMoRE (Flexible Mixture of Rank‑heterogeneous Experts) is a novel Mixture‑of‑Experts (MoE) architecture designed for federated learning scenarios where data cannot be centrally aggregated. Building on the FlexOlmo framework, which enables independent training of domain‑specific experts alongside a frozen public base model, FlexMoRE introduces low‑rank adapters (LoRA) as a memory‑efficient alternative to full‑size experts. The core idea is to treat each domain expert as a deviation Δ from the public base model, compute a singular value decomposition (SVD) of Δ, and truncate it to a chosen rank r. The truncated components (U_r, Σ_r, V_r) are recombined to form a rank‑r approximation Δ(r), which is then added back to the base model, yielding a low‑rank expert. This post‑hoc low‑rank adaptation (PHLoRA) allows existing full‑size experts to be converted without retraining; alternatively, adapters can be trained from scratch in the same federated manner.
The routing mechanism remains identical to FlexOlmo: a domain‑informed router maps each input token to a probability distribution over expert groups. When a low‑rank expert is selected, the router automatically invokes its associated full‑size base (the public model) and applies the adapter on‑the‑fly. Thus, FlexMoRE can host a heterogeneous mixture of one full‑size expert and any number of adapters with differing ranks.
To evaluate the trade‑off between rank and downstream performance, the authors derived six domain experts (Code, Creative Writing, Math, News, Academic, Reddit) from FlexOlmo and generated low‑rank versions with ranks ranging from 2⁰ to 2¹⁴ (15 ranks). They constructed 150 mixture configurations: 96 mixtures of two experts and 54 mixtures of seven experts. These models were benchmarked on 120 tasks spanning general‑purpose reasoning (MC9, GEN5, AGIEval, BBH) and domain‑specific knowledge (MMLU, MMLU‑Pro).
Performance‑rank relationships were modeled with a simple linear regression s(r)=α+β·log₂r for each evaluation group. Positive β values indicated consistent gains with higher rank, while near‑zero or negative β suggested diminishing returns. The analysis revealed that reasoning‑heavy benchmarks (e.g., BBH, AGIEval) required substantially higher optimal ranks (often ≥2¹²) compared to knowledge‑heavy benchmarks (MMLU‑Pro), where performance saturated around rank 2⁸–2⁹.
In terms of efficiency, the best FlexMoRE configuration achieved an average score of 47.18 while using only 10.75 B parameters—approximately one‑third of FlexOlmo’s 33.27 B parameters, which scored 45.46. Thus, FlexMoRE not only reduces memory footprint but also improves accuracy. The authors attribute this to the fact that low‑rank adapters can capture domain‑specific variations without the overhead of full weight matrices, and that PHLoRA enables rapid conversion of existing experts.
The paper contributes three main insights: (1) a quantitative mapping between expert rank and task difficulty, (2) a seamless integration of low‑rank adapters into a federated MoE routing framework, and (3) empirical evidence that rank‑heterogeneous mixtures can outperform homogeneous full‑size mixtures while drastically cutting resource consumption. Limitations include the use of a single uniform rank per mixture (except for the heterogeneous experiments) and the lack of dynamic rank selection during inference. Future work may explore adaptive routers that choose rank on‑the‑fly, layer‑wise rank optimization, and asynchronous federated training protocols to further reduce synchronization overhead.
Overall, FlexMoRE demonstrates that federated expert composition does not require full‑size experts; carefully chosen low‑rank adapters can deliver superior performance with a fraction of the parameters, opening a path toward scalable, privacy‑preserving large language model deployment.
Comments & Academic Discussion
Loading comments...
Leave a Comment