VertCoHiRF: Decentralized Vertical Clustering Beyond k-means
Vertical Federated Learning (VFL) enables collaborative analysis across parties holding complementary feature views of the same samples, yet existing approaches are largely restricted to distributed variants of $k$-means, requiring centralized coordination or the exchange of feature-dependent numerical statistics, and exhibiting limited robustness under heterogeneous views or adversarial behavior. We introduce VertCoHiRF, a fully decentralized framework for vertical federated clustering based on structural consensus across heterogeneous views, allowing each agent to apply a base clustering method adapted to its local feature space in a peer-to-peer manner. Rather than exchanging feature-dependent statistics or relying on noise injection for privacy, agents cluster their local views independently and reconcile their proposals through identifier-level consensus. Consensus is achieved via decentralized ordinal ranking to select representative medoids, progressively inducing a shared hierarchical clustering across agents. Communication is limited to sample identifiers, cluster labels, and ordinal rankings, providing privacy by design while supporting overlapping feature partitions and heterogeneous local clustering methods, and yielding an interpretable shared Cluster Fusion Hierarchy (CFH) that captures cross-view agreement at multiple resolutions.We analyze communication complexity and robustness, and experiments demonstrate competitive clustering performance in vertical federated settings.
💡 Research Summary
VertCoHiRF introduces a novel decentralized framework for vertical federated clustering that moves beyond the traditional reliance on k‑means–based objectives and central coordination. In a vertical federated learning (VFL) setting, multiple parties hold complementary subsets of features for the same set of samples. Existing VFL clustering methods typically require a central server, exchange of feature‑dependent numerical statistics (e.g., centroids, distance matrices), or the injection of differential‑privacy noise, which leads to privacy‑utility trade‑offs, limited scalability, and an inability to handle heterogeneous, non‑convex data structures.
The core insight of VertCoHiRF is structural consensus: a genuine global cluster should be observable consistently across heterogeneous feature views, whereas view‑specific artifacts will not survive cross‑view agreement. Each agent independently selects a base clustering method (BCM) that best fits its local feature modality—this may be k‑means, DBSCAN, spectral clustering, or any domain‑specific algorithm. Agents run the BCM only on the current set of active medoids (representative samples) and broadcast only two kinds of identifier‑level information: (1) cluster labels indexed by sample IDs, and (2) ordered lists of sample IDs representing local rankings of medoid candidates within each consensual cluster. No raw features, distances, embeddings, or aggregated statistics are ever transmitted.
The protocol proceeds iteratively in two phases. Phase 1 (Structural consensus) aggregates the label vectors from all agents into a joint structural code. Two samples belong to the same consensual cluster if and only if their codes match across all agents. This mechanism tolerates heterogeneous local partitions because agreement is required only at the discrete label level. Phase 2 (Representative consensus) then refines each consensual cluster by having every agent rank its members according to its local view (e.g., distance to a local centroid). The global score of a candidate medoid is the sum of its rank positions across agents; the candidate with the lowest total score becomes the new representative for that cluster. The set of selected medoids forms the active set for the next iteration, and non‑selected samples are permanently attached as children, thereby constructing a bottom‑up hierarchy.
VertCoHiRF’s communication cost is rigorously analyzed. At iteration e, each agent sends at most A(A‑1)·h·n(e‑1)·log₂(max Cₐ) bits for label exchange (where Cₐ is the number of local clusters) and n(e)·Nₛ·log₂(n) bits for the ordered medoid lists (with Nₛ candidates per cluster). Because the active set size n(e) monotonically shrinks, total bandwidth is dominated by the first few rounds, yielding linear scaling in the number of samples (up to logarithmic factors) and quadratic scaling in the number of agents. No central aggregator is needed, making the approach suitable for large‑scale peer‑to‑peer deployments.
Robustness is achieved through an agent‑level veto mechanism: a single agent can block a cluster if its local view contradicts the proposed structure. This design provides inherent resistance to Byzantine or malicious participants without relying on majority voting or cryptographic heavy‑weight protocols.
Empirical evaluation covers synthetic benchmarks and real‑world datasets from healthcare and finance, where feature partitions are overlapping and data manifolds are non‑convex. VertCoHiRF consistently matches or exceeds the Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI) of state‑of‑the‑art k‑means‑based VFL methods. Notably, when agents employ non‑k‑means BCMs tailored to their data (e.g., density‑based clustering for high‑dimensional clinical measurements), the performance gap widens, demonstrating the advantage of allowing heterogeneous local algorithms. Communication volume drops sharply after the initial iterations, and the method remains stable even when up to 20 % of agents behave adversarially.
In summary, VertCoHiRF delivers five key contributions: (1) a privacy‑by‑design structural consensus that eliminates the need for feature‑level data exchange, (2) a fully decentralized peer‑to‑peer protocol without trusted coordinators, (3) flexibility to incorporate any local clustering technique, (4) an interpretable Cluster Fusion Hierarchy (CFH) that provides multi‑resolution insight into the consensus clustering, and (5) built‑in robustness against Byzantine agents. The paper opens avenues for future work on optimizing communication topologies, handling dynamic agent membership, and integrating federated learning objectives with clustering in a unified vertical federated framework.
Comments & Academic Discussion
Loading comments...
Leave a Comment