HybridVFL: Disentangled Feature Learning for Edge-Enabled Vertical Federated Multimodal Classification
Vertical Federated Learning (VFL) offers a privacy-preserving paradigm for Edge AI scenarios like mobile health diagnostics, where sensitive multimodal data reside on distributed, resource-constrained devices. Yet, standard VFL systems often suffer performance limitations due to simplistic feature fusion. This paper introduces HybridVFL, a novel framework designed to overcome this bottleneck by employing client-side feature disentanglement paired with a server-side cross-modal transformer for context-aware fusion. Through systematic evaluation on the multimodal HAM10000 skin lesion dataset, we demonstrate that HybridVFL significantly outperforms standard federated baselines, validating the criticality of advanced fusion mechanisms in robust, privacy-preserving systems.
💡 Research Summary
HybridVFL tackles a critical bottleneck in vertical federated learning (VFL) for edge AI: the simplistic fusion of heterogeneous modalities that limits predictive performance while preserving privacy. The authors propose a two‑stage architecture that first disentangles client‑side representations into modality‑invariant and modality‑specific embeddings, and then fuses these embeddings on a semi‑honest central server using a cross‑modal transformer with a cosine‑based consistency regularizer.
On the client side, the image holder (Image Client) employs a convolutional neural network encoder (E_I) to produce an invariant embedding (z_I^inv) capturing semantics common to both modalities (e.g., malignancy cues) and a specific embedding (z_I^spec) preserving visual details. The tabular holder (Tabular Client) uses an MLP encoder (E_T) to generate analogous invariant (z_T^inv) and specific (z_T^spec) vectors from clinical metadata (age, gender, lesion site). Only these four encrypted vectors are transmitted; raw images and raw EHR data never leave the devices, satisfying GDPR and HIPAA constraints.
The server first aligns the invariant embeddings from the two clients by minimizing a cosine similarity loss (L_cons), encouraging a shared latent space across modalities. After alignment, the four vectors are concatenated into a token sequence S =
Comments & Academic Discussion
Loading comments...
Leave a Comment