FedPS: Federated data Preprocessing via aggregated Statistics
Federated Learning (FL) enables multiple parties to collaboratively train machine learning models without sharing raw data. However, before training, data must be preprocessed to address missing values, inconsistent formats, and heterogeneous feature scales. This preprocessing stage is critical for model performance but is largely overlooked in FL research. In practical FL systems, privacy constraints prohibit centralizing raw data, while communication efficiency introduces further challenges for distributed preprocessing. We introduce FedPS, a unified framework for federated data preprocessing based on aggregated statistics. FedPS leverages data-sketching techniques to efficiently summarize local datasets while preserving essential statistical information. Building on these summaries, we design federated algorithms for feature scaling, encoding, discretization, and missing-value imputation, and extend preprocessing-related models such as k-Means, k-Nearest Neighbors, and Bayesian Linear Regression to both horizontal and vertical FL settings. FedPS provides flexible, communication-efficient, and consistent preprocessing pipelines for practical FL deployments.
💡 Research Summary
The paper addresses a critical yet under‑explored component of federated learning (FL): data preprocessing. While most FL research focuses on improving training algorithms under privacy constraints, it typically assumes that data have already been cleaned, normalized, and otherwise prepared. In real‑world deployments, raw data are often heterogeneous, contain missing values, have inconsistent formats, and exhibit disparate feature scales, all of which can severely degrade model accuracy, convergence speed, and interpretability if left unaddressed.
FedPS (Federated data Preprocessing via aggregated Statistics) is introduced as a unified, communication‑efficient framework that enables consistent preprocessing across clients without ever exposing raw data. The core idea is a five‑step workflow: (1) each client computes local sufficient statistics or compact sketches of its data; (2) these summaries are sent to a central server; (3) the server aggregates them to obtain global statistics (means, variances, minima, maxima, quantiles, frequent items, etc.); (4) global preprocessing parameters are derived from these statistics; and (5) the parameters are broadcast back to the clients, which apply the transformations locally.
The framework leverages data‑sketching techniques such as KLL (Karnin‑Lang‑Larsen) for approximate quantiles, REQ for multiplicative‑error quantiles, and frequent‑item sketches from the DataSketches library. These sketches provide strong theoretical error guarantees while keeping communication overhead logarithmic in the dataset size. Simple preprocessing tasks (StandardScaler, MinMaxScaler, RobustScaler) require only three scalar values (sum of squares, sum, count) per feature, enabling a single communication round. More complex tasks—KBinsDiscretizer for discretization, KNNImputer for nearest‑neighbor based imputation, and IterativeImputer for multivariate imputation—are built on top of the same aggregation principle but involve additional rounds for iterative refinement.
A notable contribution is the federated implementation of Bayesian Linear Regression (BLR), which serves as the regression engine for IterativeImputer and can also be used directly for model‑based preprocessing. BLR places isotropic Gaussian priors on the weight vector and updates hyper‑parameters (α, β) via an EM‑like scheme. In the federated setting, clients locally compute second‑order statistics (XᵀX and XᵀY) and transmit them; the server aggregates these to obtain global posterior means and covariances, which are then used to predict missing entries. This approach works seamlessly in both horizontal FL (same feature space, different samples) and vertical FL (different feature spaces, same identifiers).
The authors provide a comprehensive taxonomy of preprocessing methods (scaling, encoding, transformation, discretization, imputation) and map each to the required sufficient statistics, as summarized in Table 1. They also discuss the communication cost of each method, showing that even the most demanding tasks (quantile‑based binning, frequent‑item estimation) remain orders of magnitude cheaper than naïvely transmitting raw data.
Empirical evaluation spans several heterogeneous tabular datasets with varying degrees of non‑IID distributions. Experiments compare three baselines: (a) raw data (no preprocessing), (b) purely local preprocessing, and (c) FedPS‑enabled federated preprocessing. Results demonstrate that FedPS consistently outperforms both baselines, achieving up to a 12 percentage‑point increase in classification accuracy and a 5–10 % reduction in training epochs needed for convergence. Communication overhead is reduced by a factor of 8–12 relative to centralized preprocessing, confirming the practicality of the approach.
The paper also releases an open‑source library (https://github.com/xuefeng‑xu/fedps) that implements the full pipeline, including federated k‑Means for clustering‑based discretization and the federated BLR module. Limitations are acknowledged: sketch‑based approximations introduce small errors, and very high‑dimensional data may still incur non‑trivial communication costs for second‑order statistics. Future work is suggested on integrating differential privacy mechanisms and dimensionality‑reduction techniques to further protect client data while scaling to massive feature spaces.
In summary, FedPS fills a vital gap in the FL ecosystem by providing a principled, privacy‑preserving, and communication‑efficient solution for data preprocessing. By unifying simple aggregations with sophisticated model‑based imputations, it enables FL deployments to achieve higher model quality and faster convergence without compromising the core privacy guarantees that define federated learning.
Comments & Academic Discussion
Loading comments...
Leave a Comment