FedLLM-Align: Feature Extraction From Heterogeneous Clients
Federated learning (FL) enables collaborative model training without sharing raw data, making it attractive for privacy-sensitive domains, e.g., healthcare, finance, and IoT. A major obstacle, however, is the potential heterogeneity of tabular data across clients, in practical settings, where schema mismatches and incompatible feature spaces prevent straightforward aggregation. To address this challenge, this paper proposes FedLLM-Align, a federated learning framework that leverages pretrained transformer based language models for feature extraction. Towards this objective, FedLLM-Align serializes tabular records into text and derives semantically aligned embeddings from a pretrained LLM encoder, e.g, DistilBERT, facilitating lightweight local classifier heads that can be trained in a federated manner using standard aggregation schemes, e.g., FedAvg, while keeping all raw data records local. To quantify the merits and trade-offs of FedLLM-Align, we evaluate the proposed framework on binary classification tasks from two different domains: i) Coronary heart disease prediction on partitioned Framingham Heart Study data, and ii) Customer churn prediction on a financial dataset. FedLLM-Align outperforms state-of-the-art baselines by up to 25% in terms of the F1 score, under simulated schema heterogeneity, and achieves a 65% reduction in the communication overhead. These results establish FedLLM-Align as a privacy-preserving and communication-efficient approach for federated training based on clients with heterogeneous tabular datasets, commonly encountered in practice.
💡 Research Summary
Federated learning (FL) has become a cornerstone for collaborative model training in privacy‑sensitive domains because it allows multiple parties to improve a global model without sharing raw data. However, most FL research assumes that all participants share a common feature schema, an assumption that breaks down in real‑world settings such as hospitals, banks, or edge‑device networks where each client may collect a different set of attributes, use distinct naming conventions, or even store data in incompatible units. This “schema heterogeneity” prevents straightforward parameter aggregation and can cause divergence or severe performance loss.
The paper introduces FedLLM‑Align, a novel FL framework that tackles schema heterogeneity by using a pretrained large language model (LLM) as a universal feature encoder. The approach consists of three stages:
-
Tabular‑to‑text serialization – each client converts every record into a short, structured natural‑language string (e.g., “Age: 45, Blood pressure: 140/90”). Different serialization formats (structured, free‑form, compact) are explored, but the structured format yields the most reliable embeddings.
-
Semantic embedding generation – the serialized strings are fed into a frozen LLM encoder (DistilBERT, ALBERT, etc.). The CLS token is extracted as a dense vector (typically 768‑dimensional). Because the encoder is frozen, no gradients or model weights are exchanged; only the embeddings are used locally. The pretrained LLM’s broad linguistic knowledge aligns semantically equivalent attributes across clients (e.g., “Age” vs. “PatientAge”), effectively mapping heterogeneous schemas into a shared latent space.
-
On‑device classifier training – each client trains a lightweight downstream classifier (a shallow feed‑forward network) on the LLM embeddings. Only the classifier’s parameters are communicated to the central server. The server aggregates these updates with standard FL algorithms such as FedAvg, FedProx, or SCAFFOLD, preserving compatibility with existing FL pipelines.
Key technical contributions and insights
-
Semantic alignment without schema mapping – By leveraging the LLM’s pretrained knowledge, FedLLM‑Align automatically aligns features that are syntactically different but semantically identical, eliminating the need for manual schema mapping or feature engineering.
-
Privacy preservation – Raw tabular rows and intermediate embeddings never leave the client device; only the small classifier weight vectors are transmitted. This reduces the attack surface compared to methods that share embeddings or raw predictions.
-
Communication efficiency – Because the encoder is fixed, the communication payload per round consists solely of the classifier weights (typically a few thousand parameters). Experiments report a 65 % reduction in total transmitted data relative to conventional FL that exchanges full model parameters.
-
Compatibility with existing FL optimizers – The frozen encoder creates a static feature space, allowing standard convergence guarantees for FedAvg‑type algorithms to hold under usual smoothness and bounded‑variance assumptions.
Experimental evaluation
Two public datasets from distinct domains were used:
- Framingham Heart Study – 4,240 patients, 15 clinical attributes, binary prediction of coronary heart disease.
- Bank customer churn – 10,000 retail banking customers, 10 demographic/financial attributes, binary churn prediction.
Each dataset was partitioned into 5–8 simulated clients. To emulate realistic heterogeneity, the authors randomly dropped attributes, renamed them, or altered units for each client, resulting in minimal or zero overlap between schemas.
Baseline methods included traditional FL approaches (FedXGBoost, FedProx, SCAFFOLD, Mutual‑Information‑based FL) and more advanced techniques (Clustered FL, FedAvg with homogeneous schemas).
- Performance – Under heterogeneous schemas, FedLLM‑Align achieved up to a 25 % absolute improvement in F1 score over the strongest baseline. Average gains were around 18 % across both tasks.
- Convergence speed – The method converged in roughly 80 % of the rounds required by a homogeneous‑schema FedAvg, indicating that the shared semantic space accelerates learning.
- Communication – Total transmitted parameters were reduced by 65 % compared with baseline FL that shares full model weights.
Limitations and future directions
- Encoder computational cost – Even lightweight LLMs like DistilBERT require several hundred milliseconds per record on a CPU, which may be prohibitive for ultra‑low‑power edge devices. Model quantization, distillation, or using even smaller transformers (e.g., ALBERT‑tiny) are potential remedies.
- Serialization design – Overly terse or overly verbose serialization can degrade embedding quality. Domain‑specific templates that preserve critical meta‑information are essential.
- Security beyond parameter privacy – While raw data never leaves the device, a malicious server could infer information from repeated classifier updates. Integrating differential privacy or secure aggregation protocols such as SecEA would strengthen guarantees.
- Extensibility – The authors propose exploring dynamic prompt optimization to improve serialization, multi‑modal extensions (e.g., combining imaging data), and tighter coupling with cryptographic aggregation schemes.
Conclusion
FedLLM‑Align demonstrates that a frozen, pretrained LLM can serve as a universal feature extractor, turning heterogeneous tabular data into a common semantic embedding space. By training only lightweight classifiers on these embeddings and aggregating their parameters with standard FL algorithms, the framework simultaneously improves predictive performance, reduces communication overhead, and preserves data privacy. The work opens a promising avenue for deploying federated learning in real‑world settings where schema heterogeneity is the norm rather than the exception. Future research will focus on further reducing encoder overhead, automating serialization, and bolstering security through advanced cryptographic techniques.
Comments & Academic Discussion
Loading comments...
Leave a Comment