qa-FLoRA: Data-free query-adaptive Fusion of LoRAs for LLMs
The deployment of large language models for specialized tasks often requires domain-specific parameter-efficient finetuning through Low-Rank Adaptation (LoRA) modules. However, effectively fusing these adapters to handle complex, multi-domain composite queries remains a critical challenge. Existing LoRA fusion approaches either use static weights, which assign equal relevance to each participating LoRA, or require data-intensive supervised training for every possible LoRA combination to obtain respective optimal fusion weights. We propose qa-FLoRA, a novel query-adaptive data-and-training-free method for LoRA fusion that dynamically computes layer-level fusion weights by measuring distributional divergence between the base model and respective adapters. Our approach eliminates the need for composite training data or domain-representative samples, making it readily applicable to existing adapter collections. Extensive experiments across nine multilingual composite tasks spanning mathematics, coding, and medical domains, show that qa-FLoRA outperforms static fusion by ~5% with LLaMA-2 and ~6% with LLaMA-3, and the training-free baselines by ~7% with LLaMA-2 and ~10% with LLaMA-3, while significantly closing the gap with supervised baselines. Further, layer-level analysis of our fusion weights reveals interpretable fusion patterns, demonstrating the effectiveness of our approach for robust multi-domain adaptation.
💡 Research Summary
The paper addresses the problem of combining multiple domain‑specific Low‑Rank Adaptation (LoRA) adapters for large language models (LLMs) when faced with composite queries that span several domains (e.g., a question that requires mathematics, coding, and medical knowledge). Existing solutions fall into three categories: (1) static fusion, which simply averages or equally weights all adapters and ignores query relevance; (2) supervised dynamic fusion, which trains a routing network on composite data to predict optimal weights but suffers from scalability issues because it requires labeled data for every possible adapter combination; and (3) training‑free methods that compute cosine similarity between a query and pre‑computed domain centroids, still needing representative data and lacking layer‑wise granularity.
The authors propose qa‑FLoRA, a completely data‑free and training‑free approach that dynamically determines per‑layer fusion weights for each LoRA adapter based solely on the divergence between the base model’s output distribution and that of each adapter for the given query. The method proceeds as follows:
- Hidden‑state extraction – For a query Q, the base LLM and each LoRA‑adapted model are run forward, capturing hidden states at every transformer layer.
- Projection to vocabulary space – The same pretrained LM head is applied to each hidden state, producing logits that are soft‑maxed into probability distributions p⁽ˡ⁾ (base) and q⁽ˡ⁾ⱼ (adapter j) for each layer l.
- Distributional divergence – The Kullback‑Leibler (KL) divergence D_KL(p⁽ˡ⁾‖q⁽ˡ⁾ⱼ) is computed for the last token of the query at each layer. A larger KL value indicates that adapter j injects more task‑specific information relative to the base model.
- Fusion weight computation – The divergences are normalized across all adapters to obtain per‑layer weights α⁽ˡ⁾ⱼ = D_KL⁽ˡ⁾ⱼ / Σᵢ D_KL⁽ˡ⁾ᵢ.
- Adaptive fusion – The weighted sum of adapter updates ΔWⱼ is added to the base model’s parameters, scaled by the α values for each layer, yielding the final output O = (W + Σⱼ αⱼ ΔWⱼ)·x.
Because the KL‑based weights are computed at inference time, no additional training, data collection, or gradient‑based optimization is required. The approach also captures fine‑grained, layer‑specific relevance, unlike centroid‑based methods that treat adapters as monolithic.
Experimental evaluation uses LLaMA‑2‑7B and LLaMA‑3‑8B as frozen backbones, with nine multilingual composite tasks covering mathematics, code generation, and medical QA. Compared to static fusion, qa‑FLoRA improves accuracy by ~5 % (LLaMA‑2) and ~6 % (LLaMA‑3). Against existing training‑free baselines (cosine‑similarity centroids), it gains an additional 7 %–10 % absolute performance. Supervised baselines such as LoRAFlow, which require a routing network trained on composite examples, still outperform qa‑FLoRA, but the gap is substantially narrowed, demonstrating the effectiveness of the divergence‑based weighting.
A qualitative analysis visualizes the per‑layer α values, revealing interpretable patterns: math adapters dominate higher layers for math‑heavy queries, code adapters peak in middle layers for coding queries, and medical adapters receive higher weights in lower layers for medical questions. This confirms that KL divergence reliably reflects semantic relevance at different depths of the model.
Limitations and future work include the extra computational cost of projecting every layer’s hidden state to the vocabulary space, and potential difficulty distinguishing adapters when their induced distributional shifts are small (e.g., highly similar domains). The authors suggest possible extensions such as sampling‑based KL approximations, selective layer evaluation, or incorporating additional divergence metrics to improve discrimination.
In summary, qa‑FLoRA introduces a practical, zero‑data, zero‑training solution for dynamic LoRA fusion, enabling large language models to flexibly leverage a library of domain‑specific adapters for complex, multi‑domain queries while achieving competitive performance relative to supervised methods.
Comments & Academic Discussion
Loading comments...
Leave a Comment