Federated Prompt-Tuning with Heterogeneous and Incomplete Multimodal Client Data
This paper introduces a generalized federated prompt-tuning framework for practical scenarios where local datasets are multi-modal and exhibit different distributional patterns of missing features at the input level. The proposed framework bridges the gap between federated learning and multi-modal prompt-tuning which have traditionally focused on either uni-modal or centralized data. A key challenge in this setting arises from the lack of semantic alignment between prompt instructions that encode similar distributional patterns of missing data across different clients. To address this, our framework introduces specialized client-tuning and server-aggregation designs that simultaneously optimize, align, and aggregate prompt-tuning instructions across clients and data modalities. This allows prompt instructions to complement one another and be combined effectively. Extensive evaluations on diverse multimodal benchmark datasets demonstrate that our work consistently outperforms state-of-the-art (SOTA) baselines.
💡 Research Summary
The paper tackles a realistic federated learning (FL) scenario where each client possesses multimodal data (e.g., text, images) but the available modalities differ across clients and are often partially missing due to sensor failures or privacy constraints. Traditional FL approaches either assume a homogeneous modality set or treat each modality independently, while existing prompt‑tuning methods for missing modalities are designed for centralized training and cannot handle the inter‑client heterogeneity inherent in FL. To bridge this gap, the authors propose FED‑PRIME, a federated prompt‑tuning framework that explicitly separates prompt parameters into two groups: (1) inter‑client prompts that encode input‑level missing‑data patterns (e.g., “text only”, “image missing”), and (2) intra‑client prompts that compensate for modality‑level absences (e.g., a client never collects images).
On the client side, a retrieval‑style mechanism is introduced. Each prompt p is mapped to a key vector k(p) and each input x(M) (where M denotes the observed modalities) is mapped to a query vector q(x(M)). The cosine similarity between k(p) and q(x(M)) defines a distance metric; the most relevant prompts are selected for fine‑tuning each sample. This prevents overloading a single prompt with knowledge from disparate missing‑data patterns and is enforced by a regularization term that penalizes the use of irrelevant prompts.
The server aggregates the inter‑client prompts via a clustering and alignment process. Prompts that capture similar missing‑data distributions across clients are grouped, and a representative prompt for each cluster is learned by minimizing the aggregated local training loss. This alignment yields versatile, shared prompts that can be broadcast back to all clients. In contrast, intra‑client prompts are aggregated with the standard FedAvg (FedAvg) scheme because they model modality‑agnostic information that is beneficial to all participants.
The overall training loop proceeds as follows: (i) each client updates its local inter‑ and intra‑prompts using the retrieval‑based loss, (ii) the updated prompts are sent to the server, (iii) the server clusters inter‑prompts, averages intra‑prompts, and distributes the new global prompts, and (iv) the next round begins.
Experiments are conducted on two multimodal benchmarks: MM‑IMDB (text‑image) and UPMC‑Food‑101 (image‑recipe). The authors simulate various missing‑rate settings (25 %–100 % missing modalities) and compare FED‑PRIME against a suite of baselines, including multimodal FL methods (FedAvg, FedMA, FedInMM), centralized prompt‑tuning baselines (FedAvg‑Prompt, FeDAvg), and ablated versions of their own method (single‑prompt, no‑clustering). FED‑PRIME consistently outperforms all baselines, achieving 3–6 percentage‑point gains in accuracy, with the gap widening as the missing‑rate increases. Ablation studies confirm that (a) merging inter‑ and intra‑prompts into a single set dramatically degrades performance, and (b) removing the server‑side clustering reduces the benefit of shared knowledge, validating the necessity of both design components.
The paper’s contributions are threefold: (1) a novel decomposition of prompt parameters to handle both input‑level and modality‑level missing data, (2) a retrieval‑based local learning scheme that dynamically selects the most relevant prompts per sample, and (3) a server‑side clustering algorithm that aligns inter‑client prompts across heterogeneous clients. Limitations include the additional computational overhead of clustering on the server and increased communication due to transmitting two prompt sets. Future work is suggested on dynamic prompt sizing, asynchronous aggregation, and extending the approach to large‑scale multimodal foundation models while preserving privacy through encrypted prompt aggregation. Overall, FED‑PRIME demonstrates that carefully structured prompt‑tuning can unlock the full potential of federated learning in heterogeneous, incomplete multimodal environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment