Principled Federated Random Forests for Heterogeneous Data
Random Forests (RF) are among the most powerful and widely used predictive models for centralized tabular data, yet few methods exist to adapt them to the federated learning setting. Unlike most federated learning approaches, the piecewise-constant nature of RF prevents exact gradient-based optimization. As a result, existing federated RF implementations rely on unprincipled heuristics: for instance, aggregating decision trees trained independently on clients fails to optimize the global impurity criterion, even under simple distribution shifts. We propose FedForest, a new federated RF algorithm for horizontally partitioned data that naturally accommodates diverse forms of client data heterogeneity, from covariate shift to more complex outcome shift mechanisms. We prove that our splitting procedure, based on aggregating carefully chosen client statistics, closely approximates the split selected by a centralized algorithm. Moreover, FedForest allows splits on client indicators, enabling a non-parametric form of personalization that is absent from prior federated random forest methods. Empirically, we demonstrate that the resulting federated forests closely match centralized performance across heterogeneous benchmarks while remaining communication-efficient.
💡 Research Summary
This paper introduces FedForest, a principled federated random‑forest algorithm designed for horizontally partitioned data where client datasets may be heterogeneous. The authors first identify the core difficulty: unlike deep‑learning models, random forests (RF) rely on greedy, impurity‑based split decisions that are non‑differentiable, making standard gradient‑based federated learning inapplicable. Existing federated tree methods either aggregate locally trained trees or use heuristic histogram merging, both of which fail to reproduce the exact split choices of a centralized CART and are especially fragile under covariate or outcome shift.
FedForest solves the problem in two stages.
-
Federated quantile sketching: each client computes a small set of empirical quantiles (B per feature) for the samples that reach a given node. Clients send these quantile points to the server, which linearly interpolates them into piece‑wise linear CDF approximations, then mixes the CDFs using client‑sample‑size weights to obtain an estimate of the pooled (mixture) distribution. Candidate split thresholds are taken as interior quantiles of this pooled CDF. Theoretical analysis shows that the reconstructed CDF deviates from the true pooled empirical CDF by at most 1/B uniformly, and that for any true midpoint split there exists a quantile‑based candidate that disagrees on at most 3/(2B) of the samples. Thus the candidate set is provably close to the exact centralized set while requiring only O(B) communication per node.
-
Exact impurity reconstruction: impurity measures (variance, Gini, entropy) can be expressed via additive sufficient statistics—counts, label sums, and label squared sums. Each client computes these statistics for the left and right child partitions induced by every candidate threshold and sends them to the server. By summing across clients the server recovers the exact impurity reduction that would be obtained if all data were pooled, guaranteeing that the split selected by FedForest coincides with the centralized CART choice.
The method naturally accommodates three heterogeneity regimes: (i) homogeneous data (no client effect), (ii) covariate shift (different P(X|H) but common P(Y|X)), and (iii) outcome shift (client‑specific P(Y|X)). In the outcome‑shift setting the client identifier H can be treated as a categorical feature; FedForest splits on H without extra communication, providing a non‑parametric personalization mechanism absent from prior federated RF work.
For homogeneous settings the authors also propose a lighter variant, AvgImp, which aggregates only local impurity‑gain values and enjoys a finite‑sample error bound, further reducing communication.
Algorithmically, FedForest builds each tree independently using standard RF randomizations (bootstrap sampling, feature subsampling, depth/leaf‑size limits). At each node the server orchestrates candidate generation and split evaluation using only aggregated statistics, after which clients route their samples to the appropriate child node. Because trees are grown in parallel, the approach scales well and the communication cost per node is bounded by the quantile sketches and the additive statistics.
Empirical evaluation on synthetic benchmarks covering covariate shift, outcome shift, and combined shifts, as well as real‑world heterogeneous datasets from healthcare, finance, and advertising, demonstrates that FedForest matches the predictive performance of a centralized RF (within a few percentage points) while using substantially fewer communication rounds than local‑ensemble baselines. In outcome‑shift scenarios, splitting on the client indicator yields noticeable personalization gains. The paper also discusses extensions to quantile forests, survival forests, and causal forests, and sketches how differential privacy could be incorporated into the quantile‑sketching step.
In summary, FedForest provides the first theoretically grounded framework for federated random forests that exactly reproduces centralized greedy split decisions, handles realistic client heterogeneity, enables client‑aware splits for personalization, and does so with communication efficiency suitable for practical federated deployments.
Comments & Academic Discussion
Loading comments...
Leave a Comment