Operator-Theoretic Framework for Gradient-Free Federated Learning
📝 Original Info
- Title: Operator-Theoretic Framework for Gradient-Free Federated Learning
- ArXiv ID: 2512.01025
- Date: 2025-11-30
- Authors: Mohit Kumar, Mathias Brucker, Alexander Valentinitsch, Adnan Husakovic, Ali Abbas, Manuela Geiß, Bernhard A. Moser
📝 Abstract
Background: Federated learning in practice must address client heterogeneity, strict communication and computation requirements, and data privacy, while optimizing performance.Objectives: Develop an operator-theoretic framework for federated learning that simultaneously addresses statistical heterogeneity, performance guarantees, and privacy under practical communication and computation constraints. Methods: We first map the L 2 -optimal solution into a reproducing kernel Hilbert space (RKHS) using a forward operator. Using the available data in that RKHS, we approximate the optimal solution. We then map this solution back to the original L 2 function space via the inverse operator. This construction yields a gradient-free learning scheme. We derive explicit finite-sample performance bounds for this scheme using concentration inequalities over operator norms. The framework analytically identifies a data-dependent hypothesis space and provides guarantees on risk, prediction error, robustness, and approximation error. Within this space, we design a communication-and computation-efficient model using kernel machines, leveraging the space folding property of Kernel Affine Hull Machines (KAHMs). Clients transfer knowledge to the server using a novel scalar metric, space folding measure, derived from KAHMs. Being a scalar, this measure greatly reduces communication overhead. It also supports a simple differentially private FL protocol in which scalar space folding summaries are computed from noise-perturbed data matrices obtained via a single application of a noise-adding mechanism, thereby avoiding per-round gradient clipping and privacy accounting. Finally, the induced global prediction rule can be implemented using a small number of integer minimum and equality-comparison operations per test point, making it structurally compatible with fully homomorphic encryption (FHE) during inference. Results: Across four benchmarks (20Newsgroup, XGLUE-NC, CIFAR-10-LT, CIFAR-100-LT), the resulting gradient-free FL method built on fixed encoder embeddings is competitive with, and in several cases outperforms, strong gradient-based federated fine-tuning, with gains of up to 23.7 percentage points on the considered benchmarks. In differentially private experiments, the proposed kernel-based smoothing mechanism partially offsets the accuracy loss caused by noise in high-privacy regimes. The induced global prediction rule admits an FHE realization based on Q × C encrypted minimum and C equality-comparison operations per test point (where Q = #clients and C = #classes), and our operation-level benchmarks for these primitives indicate latencies📄 Full Content
We identify the following requirements for the development of an effective FL algorithm. R1 (Hypothesis Space for Learning from Heterogeneous Distributed Data): Provide a mathematical framework that, without imposing parametric form or homogeneity assumptions on the client data distributions (beyond mild regularity conditions such as square-integrability), determines a suitable hypothesis space for task learning in a federated setting. R2 (Theoretical Guarantees): Calculate theoretically the error bounds and evaluate the task learning solution in terms of 1) robustness of the prediction error against the disturbances arising from uncertainties and data noise, and 2) accuracy of the solution and asymptotic upper bound on the approximation error. R3 (Communication Efficiency): Ensure that the analytically derived task learning solution can be implemented in a federated setting with communication and computational efficiency. Specifically, the optimization of the global model should not require multiple rounds of communication between the server and clients. R4 (Efficient Differentially Private Knowledge Transfer Across Global and Local Models): Define a novel differentially private metric, that allows for knowledge transfer from clients to server by solving the global model optimization problem without requiring exchange of gradients or model parameters (that may be computationally challenging and not optimal for privacy preservation), while optimizing the utility-privacy tradeoff. R5 (Computationally Efficient FHE-Based Secure Federated Learning): Rather than transmitting high-dimensional gradient or parameter update vectors, which entail substantial computational overhead when transmitted and operated on within encrypted domains, define novel low-dimensional attributes. These attributes must enable the inference of the global model with low communication overhead and latency, thereby supporting a computationally efficient realization of fully homomorphically encrypted inference of the global model. Researchers have approached these issues from multiple angles including communication-efficient protocols, robust aggregation methods, differential privacy techniques, and secure computation. We briefly review these developments next.
Problem Setting and Scope Throughout this work we focus on a practically motivated FL setting in which pretrained encoders (e.g., deep neural networks trained on large public corpora) are available to the participating clients, but are not jointly updated during federated training. This reflects scenarios where 1. encoder updates may be constrained by regulatory or validation requirements, 2. communication and computation budgets preclude repeated updates of large models, or 3. the clients intend to reuse a common frozen representation while collaboratively learning a task-specific prediction head.
Within this setting, our goal is not to improve representation learning itself, but to design and analyse a gradient-free, communication-efficient, and privacy-/security-aware federated head on top of such fixed encoders.
To address the above challenges, a variety of federated learning strategies have been explored. In this section, we review key state-of-the-art approaches related to communication efficiency and handling heterogeneous data, differentially private learning, and secure aggregation with homomorphic encryption, among others.
A cloud-edge architecture effectively mitigates communication and computation cost challenges by offloading computationally intensive learning tasks to the edge [76]. To accelerate the model convergence and thus to reduce the number of communication rounds during the learning process, device-to-device communication can be leveraged for mitigating the local over-fitting issue [24]. To reduce the communication cost, low-rank Hadamard product parametrization of the model parameters has been suggested [31]. Instead of training and transmitting full models, sparse models can be considered for computational and communication efficiency [5]. To enhance robustness against heterogeneity and improve communication efficiency, clients and the server exchange abstract prototypes, while local prototypes (rather than gradients) are aggregated [73]. To tailor the local model size and consequently the computation, memory, and data exchange requirements to the available client resources, an importance-based pruning mechanism has been proposed to extract lower-footprint nested submodels [29]. To facilitate FL on heterogeneous devices, a split-mix strategy [27] enables the learning of base sub-networks with varying sizes and robustness levels, which can be aggregated on-demand to meet specific inference requirements. The challenges of system heterogeneity and connection uncertainty in federated learning can be tackled by developing models that are readily prunable to arbitrary sizes and thus can be structurally decomposed for learning, inference, and transmission [85]. The training on the devices can be accelerated by introducing sparsity [62]. To achieve a faster convergence rate in theory and practice, adaptive gradient methods has been integrated into the FL [77]. The sparse and complementary subsets of the dense model are instead exchanged between server and clients to reduce communication and computational cost [33]. The similarities among clients can be assessed to enable personalized FL while reducing communication overhead, achieved through the optimization of aggregation weights [48].
A study [25] identifies local learning bias as the pitfall of FL with heterogeneous data, and introduces an algorithm that leverages label-distribution-agnostic pseudo-data to reduce the learning bias on local features and classifiers. An empirical study [81] reports that the large sparse convolution kernels can lead to enhanced robustness against distribution shifts in FL. A robust FL approach is to alleviate the worst-case effect of distribution shifts on the model performance. This approach has been followed [66] for the case of affine distribution shifts by minimizing the maximum possible loss induced by distribution shifts across clients. Adversarial learning approach [44] has been considered for addressing distribution shifts, where the server aims to train a discriminator to distinguish the representations of the clients while the clients aim to generate a common representation distribution. Knowledge distillation is another approach to address heterogeneity, where e.g. the knowledge about the global view of data distribution is extracted by the server and is distilled to guide local models’ learning [86]. The clustered FL approach [69,75] addresses data distribution heterogeneity by grouping clients with similar distributions. This enables clients within the same cluster to mutually benefit from federated learning while reducing harmful interference from clients with dissimilar distributions. A clustered FL algorithm [22] alternately estimates the cluster identities of the clients and optimizes model parameters for the client clusters. Assuming that each client’s data follows a mixture of multiple distributions, a method [68] facilitates the simultaneous training of cluster models and personalized local models.
A personalized FL approach to tackle statistical heterogeneity involves clients and the server aiming at learning a global representation together, while each client learns its unique head locally [15]. Along this line, a study [2] suggests learning a common kernel function (parameterized by a neural network) across all clients, while each client employs a personalized Gaussian Process model. A simple personalization mechanism can be provided using local k-nearest neighbors model based on the shared representation provided by the global model [50]. The personalized FL problem can be studied under the model agnostic meta-learning framework [19], where the goal is to find an initial shared model that clients can easily adapt to their local datasets. Instead of using a meta-model as the initialization, both the personalized and global models can be pursued in parallel by formulating a bi-level optimization problem using the Moreau envelope as a regularized loss function [18]. To enable pairwise collaborations between clients with similar data, a mechanism has been proposed for personalized federated learning, which exchanges weighted modelaggregation messages between personalized models and personalized cloud models [30]. To address the computational limitations of heterogeneous devices in personalized FL, optimized masking vectors (derived by minimizing the bias term in the convergence bound) can be employed to train a sub-network of the learning model for each device, tailored to device’s computational capacity [70]. A personalized FL method [84] adaptively aggregates the global model and the previous local model to initialize the local model.
Differential privacy is the gold-standard approach to provide anonymization guarantees for the data used in FL [6]. Differential privacy can be enforced within machine learning pipeline at any of three stages: on the input training data, during model training, or on model predictions [61]. Releasing a differentially private version of the training data would enable any training algorithm to be applied to it, thanks to differential privacy’s post-processing property. However, a large amount of noise would typically be needed in this setting to be added into the data for achieving differential privacy guarantee, leading to the loss of model utility. To address the utility-loss issue, the notion of differential privacy can be relaxed to allow defining privacy in terms of the distinguishability level between inputs by means of a distance function [20,63]. Generating differentially private synthetic data for the training of models is a promising approach to privacy-preserving machine learning. Differentially private synthetic data can be generated using random projections [23], Bayesian networks [82], Markov random fields [8], GANs [79], iterative methods [49], neural tangent kernels [78], and Kernel Affine Hull Machines [40]. Perturbing the objective function of the optimization problem with noise is another approach to enforce differential privacy in models with strong convexity [60,32]. The study in [57] introduced Private Aggregation of Teacher Ensembles (PATE) method based on knowledge aggregation from “teacher” models and transfer to a “student” model in differentially private fashion. The authors suggested to train the student model on public unlabeled data using GAN-like approach for semi-supervised learning. This approach achieves differential privacy by injecting noise in the aggregation of teacher models (that have been trained on disjoint splits of a private dataset) without placing any restrictions on teacher models, thus allowing any models to be used in model-agnostic manner. Further, a rigorous theoretical analysis of the PATE approach is available [4,47].
As it is typical that large-scale machine learning models are optimized using gradient-based algorithms, the gradient perturbation-based methods such as DP-SGD (differentially private stochastic gradient descent) are widely used for achieving rigorous differential privacy guarantees [1]. DP-SGD operates by running stochastic gradient descent on noisy mini-batch gradients. The DP-SGD method’s privacy analysis relies on the concept of privacy amplification by sampling requiring that each mini-batch is sampled with replacement on each iteration. This requirement may be infeasible to achieve in distributed settings like FL. Thus, the authors in [35] introduced a solution, referred to as DP-FTRL algorithm, that does not rely on random sampling for privacy amplification, instead leverages the differentially private streaming of of the cumulative sum of gradients. The problem of continual-release of cumulative sums was connected to the matrix mechanism in [17], yielding improvements in federated learning with user-level differential privacy. The matrix mechanism was extended in [14] to the multi-epoch setting, allowing for differentially private gradient-based machine learning with multiple epochs over a dataset.
The integration of FL and differential privacy is potentially an effective approach to privacy-preserving collaborative learning from distributed datasets [55]. In practice, the edge device heterogeneity may cause straggler effect that can be mitigated by an asynchronous approach allowing clients to synchronize with the central server independently and at different times [45]. FL models have typically deep learning architecture, where estimating the sensitivity of gradients (which is required for differential privacy), is difficult, and thus gradients are clipped (whenever their norm exceed some threshold) to control the sensitivity [1,53]. Gradient clipping has been theoretically proven and empirically observed to accelerate gradient descent optimization in training [83]. However, the clipping may lead to a bias on the convergence to a stationary point and the clipping bias can be quantified with a disparity measure between the gradient distribution and a geometrically symmetric distribution [10]. The choice of clipping threshold is crucial to the performance of differentially private models, necessitating the development of automatic [7] and adaptive [45] gradient clipping methods.
Fully Homomorphic Encryption (FHE) allows arbitrary computation in encrypted space, and thus FHE can be applied to FL for protecting the privacy of the data (shared by clients with the server during training) against a server eavesdropper. Traditional single-key FHE (where all clients share one key) poses a security risk in the event of a client colluding with the server, and thus multi-key FHE schemes [54] have been considered for FL [72], where the suggested scheme remains secured against collusion attacks involving up to all but one participant. Moreover, their scheme reduces the computational load by reducing the complexity of the NAND gate. In order to reduce computational and communication overhead during HE-based secure model aggregation in federated training, the authors in [34] suggest to selectively encrypt only the most privacy-sensitive parameters. The CKKS scheme [11] is a leveled homomorphic encryption scheme designed specifically for approximate arithmetic on real or complex numbers, and thus has been considered for privacy-preserving FL [56]. The SPDZ protocol [16] uses somewhat homomorphic encryption and provides low-latency secure multi party computation due to its fast online phase, and thus can be considered for FL [74]. Smart network interface cards can be leveraged as hardware accelerators to offload compute-intensive HE operations of FL [13]. To implement FHE-based secure FL with reduced communication overhead and latency, the authors in [26] have experimented with different approaches in which data can be encrypted and transmitted. TFHE (Torus Fully Homomorphic Encryption) is a lattice-based FHE scheme [12] designed for fast gate-by-gate computation on encrypted data with very fast bootstrapping. TFHE leads to a computationally efficient FHE-based FL [43,39] by reformulating the FL problem in such a way that 1) exchange of only low-dimensional attributes is required between clients and server, and 2) the inference of the global model in encrypted space is not computationally heavy.
In summary, most existing FL algorithms address only a subset of requirements R1-R5. For example, the classic FedAvg algorithm [52] (and similar gradient-based variants) requires many SGD epochs and communication rounds, yet still lacks formal guarantees for heterogeneous data. Approaches using differentially private SGD enhance privacy, but in federated settings they face other issues (e.g. performance drop and reliance on random sampling for privacy amplification). Methods using homomorphic encryption can secure model updates, however, they incur high computational cost and often still depend on exchanging gradients. Recently, kernel-based and prototype-based methods (e.g. distillation of local prototypes, or KAHM-based aggregation) have been explored to reduce communication and improve robustness, but a unified theoretical framework is lacking. In particular, the state of the art (as reviewed in Section 1.2.1, 1.2.2, 1.2.3) doesn’t jointly solve R1-R5. A recent kernel-based FL scheme [42] addressed aspects of R2-R3 and enabled privacy [40] and security [39] by exchanging compact task-sufficient information, but it did not address R1 by deriving the underlying hypothesis space for learning from heterogeneous distributed data. Existing FL algorithms either rely on iterative gradient descent, repeated communication, or do not fully protect privacy, and they do not come with end-to-end learning guarantees under heterogeneity. To the best of our knowledge, no existing framework concurrently addresses all requirements R1-R5 within a single, mathematically unified treatment.
To address these limitations, we argue that a fundamentally different (gradient-free) approach is needed to satisfy requirements R1-R5. In this paper, we therefore propose a rigorous operator-theoretic framework that is designed to address R1-R5 within a single, unified treatment. Our framework generalizes the methodology of [42] and extends the methodology to address not only R1 but also simultaneously R2, R3, R4, and R5 in a unified manner. Our main contributions are:
Operator-Theoretic Formulation. We formulate the FL problem in the L 2 function space as solving for the minimizer of mean-squared error. We define an invertible forward operator that maps the L 2 -optimal solution into an RKHS. This RKHS is associated with a generalized kernel whose feature-map serves as an estimator of class posterior probability. We derive a sample-based estimator in the RKHS and prove non-asymptotic upper bound on its risk (Theorem 1) using kernel and operator theory and concentration inequalities. Notably, under mild regularity conditions on the data-generating distributions (i.e., square-integrability of the class-posterior function, as stated in Section 3.1), we achieve a risk bound of O(1/ √ N ), where N is the total number of training data samples distributed across clients. We map the RKHS learning solution back to L 2 function space using the inverse-operator to obtain a generalized learning solution.
Performance Guarantees. The generalized learning solution (provided in Section 3.5) is evaluated for its performance in terms of risk (in Theorem 2), prediction error (in Theorem 3), robustness (in Remark 3), and approximation error (in Theorem 4). Under the same regularity condition on the data-generating distributions as in Section 3.1, it is shown that risk, prediction error, and approximation error bounds are of O(1/ √ N ). The robustness property of the generalized learning solution is established by showing that small disturbances cannot lead to large prediction error.
A data-dependent hypothesis space for task learning is determined (in Section 3.6) from the generalized learning solution by tuning the kernel to the scale of the data.
We then analyze the Rademacher complexity of the hypothesis space (in Theorem 5) and derive upper bound on prediction error (in Theorem 6) and approximation error (in Theorem 7). It is shown (in Remark 6) that the achieved error bounds are tighter than the existing bounds [42].
KAHM-Based Space Folding Kernel and Classification: We introduce a novel space folding measure for KAHMs (Definition 1). It quantifies how much a data point must be “folded” to fit it into the subspace spanned by the training data (i.e. the KAHM’s data subspace). This space folding measure induces a new kernel whose feature-map aligns with the class membership (Remark 7). In fact, the feature value can be interpreted as an estimate of class posterior probability (Proposition 2).
Below, we outline the key steps of our approach (illustrated in Fig. 1) before delving into the technical details.
L 2 (R n , P x )
Step 1. Optimal solution (not computable due to unknown distributions)
Step 3.
Step 4.
Step 5.
Step 6. Hypothesis Space Determination and Performance Analysis
Error in generalized solution:
Step 7. Hypothesis implementation in federated setting
Local evaluations{T X c,1 } C c=1 , (released under differential privacy (Step 8.)
and/or FHE (Step 9.))
Local evaluations{T X c,Q } C c=1 , (released under differential privacy (Step 8.)
and/or FHE (Step 9.))
Local kernel models
The operator-theoretic kernel FL framework is developed by 1) considering the optimal learning solution in L 2 (R n , P x ), 2) mapping the optimal solution onto a RKHS (associated to a generalized kernel) using an operator, 3) approximating the optimal solution using available data samples in RKHS, 4) mapping the sample approximated solution onto L 2 (R n , P x ) using the inverse-operator, 5) analyzing the sample approximated solution in L 2 (R n , P x ) and identifying conditions on kernel choice to define hypothesis space, 6) implementing a suitable hypothesis with the minimum computational and communication cost in the federated setting using kernel models.
Step 1. Analytical Formulation of the Learning Problem in L 2 Function Space The task learning problem is mathematically formulated in the L 2 function space and the optimal solution is analytically derived. Since the probability distributions involved in the analytical solution are unknown, the analytically derived solution cannot be practically computed.
Step 2. A Reproducing Kernel Hilbert Space (RKHS) for the Learning Problem We consider the RKHS induced by a generalized kernel to approximate the optimal solution. Because the kernel choice is not obvious a priori, we adopt a generalized kernel whose feature map is derived from theoretical analysis aimed at estimating the class-posterior probability.
Step 3. An Invertible Operator for Mapping Optimal Solution onto RKHS For an approximation and analysis of the optimal solution in RKHS using powerful kernel theory, an integral kernel operator with its inverse existing is defined from L 2 function space to RKHS.
Step 4. Sample Approximation and Analysis in RKHS: The mapped (onto RKHS) optimal solution is approximated by means of training data samples distributed across clients to obtain the RKHS learning solution. Further, kernel theory and concentration inequalities are applied to derive upper bound on the risk for the RKHS learning solution.
Step 5. Generalized Learning Solution and Analysis The learning solution, obtained in RKHS via sample approximation, is mapped onto the L 2 function space through the inverse-operator to obtain the generalized learning solution.
The risk and error bounds for the generalized learning solution are easily obtained, thanks to the derived risk bound for the RKHS learning solution.
Step 6. Determination of the Hypothesis Space and Analysis The generalized learning solution is modulated by the kernel choice. We identify conditions on kernel under which the generalized learning solution captures the data’s scale. The set of learning solutions with the kernel satisfying the identified conditions defines the hypothesis space. Rademacher complexity of the hypothesis space is calculated to derive upper bounds on prediction error and approximation error for the hypothesis space.
Step 7. Choosing a Communication Efficient Hypothesis Having determined the hypothesis space and provided theoretical performance guarantees, the next step is to choose a suitable hypothesis that can be efficiently implemented in the federated setting. It is highlighted that it is possible to define such a hypothesis by means of Kernel Affine Hull Machines (KAHMs) [40,42]. Specifically, the space folding property of the KAHMs (where the space folding property refers to the mapping of an arbitrary point by a KAHM onto the data subspace represented by the KAHM) is leveraged to implement a communication efficient hypothesis. This is done by specifying the feature-map of the kernel as an estimator of class posterior probability from the space folding measures, enabling gradient-free FL protocol where local KAHM-based models are aggregated by means of space folding measures without requiring rounds of communication between server and the clients.
Step 8. Differentially Private Release of Space Folding Measures Privacy-preserving knowledge transfer from clients to server is enabled by providing differentially private approximations to the space folding measures. Since estimating the sensitivity of space folding measure is challenging, we consider the differentially private release of data samples using an optimized noise adding mechanism [41]. The adverse effect of the added noise is mitigated by leveraging the post-processing property of differential privacy for a smoothing of noise added data samples. The study introduces a kernel-based smoothing function, with the degree of smoothing optimized to minimize the deviation of smoothed data points from original data points.
Step 9. Computationally Efficient Secure Inference of Global Model Using FHE Since the space folding measure (unlike high-dimensional gradient or parameter update vectors) is scalar-valued, inference for a C-class problem with Q participating clients reduces to evaluating Q×C space folding measures and performing Q×C minimum operations and C equality-comparison operations per test point. This fixed and low-dimensional operation pattern is well suited to secure implementation under fully homomorphic encryption. In our experiments, we instantiate this using the TFHE scheme and report runtimes for the encrypted minimum and equality-comparison primitives to characterize the computational profile of secure inference.
Kernel methods, empowered by a strong mathematical theory on kernel machine learning, have been considered for FL [28,21]. However, only a recent study [42] has introduced a kernel FL method that departs from gradient descent.
That study proposed a KAHM-based federated scheme which considered a particular convex-hull hypothesis and derived Rademacher-complexity-based error bounds, and the study is complemented by separate works on KAHMbased differentially private [40] and FHE-secured [39] FL protocols. Taken together, these works partially address R2-R3 and demonstrate that KAHM-based scalar, task-sufficient summaries can support privacy and security. They do not, however, provide a unified operator-theoretic formulation that derives the hypothesis space from first principles, nor do they jointly address requirements R1-R5 within a single framework.
The present paper goes substantially beyond those earlier works in several ways:
Operator-Theoretic Formulation and Hypothesis-Space Derivation Instead of postulating a particular convex hypothesis set as in [42], we formulate the learning problem in the L 2 function space, derive the L 2 -optimal solution, map it into an RKHS via an invertible operator, and then map a sample-based RKHS approximation back to L 2 . This forward-inverse operator construction yields a generalized learning solution and, by identifying conditions on the generalized kernel, induces a data-dependent hypothesis space tailored to heterogeneous client distributions.
Integrated Operator-and Complexity-Based Analysis with Tighter Bounds We derive non-asymptotic risk, prediction-error, robustness, and approximation-error bounds for the generalized solution via operator-theoretic arguments, and then combine these with new Rademacher-complexity bounds. The resulting bounds are strictly tighter than those in [42].
Generalized space folding Kernel and Probabilistic Interpretation We extend the KAHM-induced distance used for aggregation in [42,40,39] to a new space folding measure and associated generalized kernel whose feature map admits an interpretation as an estimator of the class posterior probability. This construction is derived from the operatortheoretic framework rather than chosen ad hoc.
Unified Treatment of Communication Efficiency, DP, and FHE Within the same operator-theoretic framework, we show that scalar space folding summaries are sufficient statistics for the global prediction rule, that they admit an optimal univariate noise distribution for (ǫ, δ)-differential privacy together with a kernel-based post-processing smoother, and that the resulting decision rule has a fixed gate-level structure compatible with FHE-based secure inference. Earlier works [42,40,39] studied these aspects in isolation, without connecting them to a formally derived hypothesis space.
We provide a new empirical evaluation on four benchmarks (20Newsgroup, XGLUE-NC, CIFAR-10-LT, CIFAR-100-LT) under heterogeneous and long-tailed data partitions, including ablations on space folding, batch size, and embedding combinations. These experiments go beyond those reported in [42,40,39].
The remainder of the paper is organized as follows: Section 2 introduces the necessary notations and formal problem setup. Section 3 presents the development of our operator-theoretic framework, including theoretical analysis of its performance. Section 4 reports experimental results on benchmarks, and 5 concludes with a discussion of the findings and future work.
This section introduces the used notations, presents the considered distributed data setting, and reviews the necessary definitions.
We use the boldface font to denote the matrices. The following notations are introduced:
• Let n, N, c, C, q, Q ∈ Z + be the positive integers.
• For a scalar a ∈ R, |a| denotes its absolute value. For a set A, |A| denotes its cardinality. For a real matrix X, X T is the transpose of X.
• For a vector y ∈ R C , y denotes the Euclidean norm and y j (and also (y) j ) denotes the j th element. For a matrix X ∈ R N ×n , X 2 denotes the spectral norm, X F denotes the Frobenius norm, (X) i,: denotes the i th row, (X) :,j denotes the j th column, and (X) i,j denotes the (i, j) th element.
• The square brackets are used to represent the construction of a matrix from columns e.g.
• Let (Ω x , F x , µ x ) be a probability space and x : Ω x → R n be a random vector on Ω x . Let B(R n ) be the Borel σ-algebra on R n . Let P x : B(R n ) → [0, 1] be the distribution of x given as
• Let (Ω x,y , F x,y , µ x,y ) be a probability space and (x, y) :
be the distribution of (x, y) given as
• Let(Ω x,y,q , F x,y,q , µ x,y,q ) be a probability space and (x, y, q) : Ω
given as P x,y,q := µ x,y,q • (x, y, q) -1 .
(
• Let L 2 (R n , P x ) be the space of all complex-valued measurable functions on R n that satisfy
The norm of a f ∈ L 2 (R n , P x ) is given as
Let D be a set consisting of N number of samples drawn IID according to the distribution P x,y :
Let I c be the set of indices of those samples in the sequence x i , y i ∈ D N i=1 which are c th class labelled, i.e.,
Let N c be the number of c th class labelled samples, i.e.,
Nc ) be the sequence of elements of I c in ascending order, i.e.,
Let X c ∈ R Nc×n be the matrix storing c th class labelled samples as its rows, i.e.,
We consider the distributed data setting where total data samples are distributed among
, Q} be the client characterizing variable associated to the i th sample pair (x i , y i ) indicating which of the Q clients owns the i th sample pair. Let I c,q be the set of indices of those samples in the sequence
which are c th class labelled and owned by client q, i.e.,
) be the sequence of elements of I c,q in ascending order, i.e., I c,q 1 = min(I c,q ), (13)
|×n be the matrix storing the c th class labelled and q th client owned samples, i.e.,
Since the c th class labelled samples are distributed among Q clients, we have
Remark 1 (Data Heterogeneity across Clients). We assume that data samples are statistically heterogeneously distributed, i.e., for arbitrary clients q i and q j with i = j, we assume that
P y|x,q (•|x, q = q i ) = P y|x,q (•|x, q = q j ). (18)
The KAHMs, originally defined in [40], have been considered for automated machine learning in [42]. Given a finite number of samples:
Appendix A presents a comprehensive description of the variables and functions associated with (19).
This section provides an operator-theoretic framework for kernel FL. As stated previously in Section 1.5, the framework development approach consists of 7 steps. Each step is described separately in a subsection.
We consider the learning problem in L 2 (R n , P x ). Our goal is to learn a function f x →yc : R n → R that minimizes the mean squared error:
= argmin
It is well-known and also shown in Appendix B that the conditional expectation, also known as regression function, minimizes the mean squared error. That is,
where we have made the following regularity assumption:
Assumption 1 (Square-Integrability of the Regression Function).
Due to y c ∈ {0, 1}, we have
For an analysis, the disturbance function, ξ c : R n × {0, 1} C → R, is defined as
It is obvious that
and
We consider a generalized kernel function such that for each class c ∈ {1, 2, • • • , C},
where Φ c : R n → [0, 1] is the feature-map (which will be determined based on theoretical analysis to estimate class posterior probability).
Remark 2 (Rational for the Restrictive Feature-Map). The rational for the restrictive nature of the feature-map, Φ c : R n → [0, 1], is the intent of setting it as an estimator of the class posterior probability, i.e., Φ c (x) ≈ P y|x (y c = 1|x).
The bound on the error in approximating class posterior probability through the feature-map will be derived (in Proposition 2).
Since for any
It is shown in Appendix C that K Φc is a positive semi-definite kernel. Now the RKHS associated to K Φc is given as
with inner product for any
- 3 Step 3: Operators between L 2 (R n , P x ) and RKHS
To enable kernel-based approximation, we map the L 2 -optimal function into an RKHS using a bounded linear operator. We introduce an operator from L 2 (R n , P x ) to H Φc such that it is invertible. For defining such an operator, we first consider the inclusion operator J : H Φc ֒→ L 2 (R n , P x ), its adjoint operator J * : L 2 (R n , P x ) → H Φc , and the inverse of the adjoint operator
It follows that the adjoint of J, J * : L 2 (R n , P x ) → H Φc , is given as
It follows that the inverse of J * , (J * ) -1 :
, is given as
where Φ c : R n → [0, 1] characterizes the kernel K Φc , as stated in (29). It is shown in Appendix E that (J * ) -1 is well defined on the range of J * .
It is shown in Appendix F that J * J is given as
where for any f, f
It is shown in Appendix G that the norm of the operator J * J is upper bounded as
It is clear that
Since J * J is a positive self-adjoint operator, there is a unique positive self-adjoint square root of J * J and is denoted by (J * J) 1/2 . It follows from ( 48) that
For a given sample sequence
be the sample evaluation operator defined as
with inner product for any u, v ∈ R N defined as
S (x i ) N i=1 is viewed as the sample approximation of J. For any u ∈ R N , we have
It follows that adjoint of
S *
is viewed as the sample approximation of J * .
The optimal solution (22),
To approximate f
x →yc in RKHS, a natural approach is of approximating J * in (59) using available data samples D (6). This leads to
where h HΦ c
x →yc ∈ H Φc is viewed as the sample approximation of f HΦ c
x →yc . For a given sequence
be the function evaluation operator defined as
The given data samples can be represented using evaluation operator as
Combining ( 60) and ( 62), we get
Theorem 1 (Risk for Sample Approximation of the Optimal Solution in RKHS). The following holds with probability at least 1 -δ for any δ ∈ (0, 1):
Proof. The proof is provided in five parts:
Part 2: It is shown in Appendix I that
Part 4: It is shown in Appendix K that we have with probability at least 1 -δ,
Part 5: Finally, we get (64) by using (68) in (65).
The learning solution h
The obtained solution h x →yc is referred to as generalized learning solution reflecting upon the considered generalized kernel function. The learning solution is evaluated in Theorem 2 for its risk with respect to the optimal solution f x →yc . Theorem 2 (Risk for Generalized Learning Solution). The following holds with probability at least 1 -δ for any δ ∈ (0, 1):
Proof. Consider
Using Theorem (1) in (72), we get the result (70).
Theorem 2 allows to bound the error associated to h x →yc in predicting the output y c , as stated in Theorem 3. Theorem 3 (Prediction Error Bound for Generalized Learning Solution). The following holds with probability at least 1 -δ for any δ ∈ (0, 1):
Proof. Consider the prediction error
Using (27), we get
Using (26) and Theorem 2, the result is obtained.
mean-squared prediction error
Theorem 4 provides an upper bound on the error in approximating the target function P y|x (y c = 1|x) through h x →yc . Theorem 4 (Approximation Error Bound for Generalized Learning Solution). The following holds with probability at least 1 -δ for any δ ∈ (0, 1):
Proof. Since y c ∈ {0, 1}, we have
Using ( 22), we get
Using (79) in Theorem 2 leads to the result.
Remark 4 (Asymptotic Convergence of Generalized Learning Solution). Theorem 4 indicates that the approximation error bound decays with an increasing number of training samples. In the limiting case, the approximation error reduces to zero i.e.
It can be seen from ( 69) and ( 60) that the obtained solution h x →yc is given as
where
is the sequence of c th class labelled samples, i.e., y
Since h x →yc is viewed as the predictor of c th class label y c ∈ {0, 1}, we ensure that h x →yc (x) ∈ [0, 1] by constraining the feature-map Φ c as
resulting in
Remark 5 (Justification of the Normalization ( 85)). Condition (85) With the kernel feature-map normalization condition (85), we define our hypothesis space as
For any h x →yc ∈ M c , we have
For a given data set D (defined in ( 6)), the empirical Rademacher complexity of the hypothesis space M c is given as
where
as the independent random variables drawn from the Rademacher distribution.
Theorem 5 (Bound on Rademacher Complexity of the Hypothesis Space). Given a dataset D, as defined in (6), we have
Thus,
Proof. The proof is provided in Appendix L.
Theorem 3 and Theorem 4 have provided error bounds for the generalized solution using operator-theoretic analysis. Further, the Rademacher complexity of the hypothesis space can be used to derive error bounds for the hypothesis space as in [42]. The results obtained by the two approaches can be combined to obtain the tighter bounds as in Theorem 6 and Theorem 7. Theorem 6 (Prediction Error Bound for Hypothesis Space). Given a data set
) N , for any h x →yc ∈ M c , we have with probability at least 1 -δ for any δ ∈ (0, 1),
Proof. The proof is provided in Appendix M.
Theorem 7 (Approximation Error Bound for Hypothesis Space). Given a data set
) N , for any h x →yc ∈ M c , we have with probability at least 1 -δ for any δ ∈ (0, 1),
Proof. The proof is provided in Appendix N.
Remark 6 (Comparison with the Existing Error Bounds). Since r.h.s. of inequality (95
=prediction error bound of [42] ,
r.h.s. of inequality (96
=approximation error bound of [42] ,
we achieve tighter bounds on prediction and approximation errors.
We now describe how the RKHS-based method is deployed across decentralized clients in a federated setup. Till-now, we have not fixed the kernel feature-map Φ c and the corresponding hypothesis for an implementation in federated setting. Our idea is to leverage KAHMs for defining Φ c in such a way that the corresponding hypothesis can be evaluated efficiently from the distributed training data.
KAHMs exhibit the space folding property in a sense that a KAHM associated to a given set of data samples, folds any arbitrary point in the data space around the data samples. Theorem 8 (KAHM as a Bounded Function [40]). The KAHM A X , associated to
(99)
Thus, the image of A X is bounded such that
(100)
Theorem 9 (KAHM Induced Distance Measure [40]). The ratio of the distance of a point x ∈ R n from its image under A X to the distance of
Theorem 8 establishes the boundedness property of the KAHM. Theorem 9 states that if a point x is close to the points {x 1 , • • • , x N }, then the value x -A X (x) cannot be large. Theorem 8 and Theorem 9 indicate that any arbitrary point x ∈ R n can be mapped to a point closer to data samples X through a KAHM, and thus the space folding property is established. This is illustrated in Fig. 2(a), where the KAHM folds the data space around given data samples. Definition 1 (Space Folding Measure). To evaluate the amount of folding required for an arbitrary point x ∈ R n to map it (by the KAHM A X ) to a point closer to data samples X, we define a space folding measure, T X : R n → [0, 1], associated to data samples X, as
where
The space folding measure T X combines both Euclidean distance and cosine distance to define a composite measure of the distance between x and A X (x). (a) The given 2-dimensional data samples X have been marked in blue color as “*”, and a red line connects a point to its image under KAHM (i.e. x is connected to AX(x) through a line). Definition 2 (Space Folding Measure Associated to Distributed Data). Under the scenario that total data samples are distributed among Q number of parties such that matrix X q represents local data samples owned by q th party, one possible way to define a global space folding measure,
(105) Fig. 3 shows an example of the global space folding measure associated to distributed 2-dimensional data samples. (a) The color plot of the space folding measure function T X 1 (option 1) associated to data samples X 1 (marked in white color as “”). (b) The color plot of the space folding measure function T X 2 (option 1) associated to data samples X 2 (marked in white color as “”). (c) The color plot of the space folding measure function T X 3 (option 1) associated to data samples X 3 (marked in white color as “*”).
Our approach is to define the kernel feature-map based on the space folding measure as
where
is the global space folding measure associated to c th class labelled samples that are distributed among Q different clients, and X c,q is the matrix of c th class labelled and q th client owned samples. Definition of the kernel feature-map (106) implies that Φ * c (x) is equal to 1, if, among all the classes, x can be mapped (by the KAHMs) to the point closer to the samples of c th class with the least amount of folding. Remark 7 (The Interpretation of the Space Folding Kernel K Φ * c ). K Φ * c (x, x ′ ) will be equal to 1, if, among all the classes, both x and x ′ can be mapped (by the KAHMs) to the points closer to the samples of c th class with the least amount of folding. Similarly, K Φ * c (x, x ′ ) will be equal to 0, if, either of x and x ′ cannot be mapped (by the KAHMs) to the points closer to the samples of c th class with the least amount of folding. Thus, K Φ * c (x, x ′ ) estimates an association (of x and x ′ as together) to the c th class. Assumption 2. It is reasonable to assume for the training data samples that the global space folding measure associated to c th class (i.e. T X c,1 ,X c,2 ,••• ,X c,Q ) will take minimum values for c th class labelled samples, i.e.,
Assumption 3. The number of training data samples is sufficiently large i.e. N ≫ 1, so that Φ * c 2 L 2 (R n ,Px) can be approximated by sample-averaging:
Remark 8 (Consistency of Normalization Condition (85) with the Space Folding Kernel). Using (107) in (108), we have
That is, KAHM-induced feature-map Φ * c satisfies the normalization condition ( 85) under Assumptions 2-3.
Proposition 1. Under Assumption 2 and Assumption 3, we have
It follows from ( 109) and (111) that
Using (107) in (111), we have
Combining ( 112) and (113) leads to (110).
Proposition 2 (Approximation Error Bound for Φ * c ). Given a data set
under Assumption 2 and Assumption 3, we have with probability at least 1 -δ for any δ ∈ (0, 1),
Proof. Since Φ * c ∈ M c , it follows from Theorem 7 that we have with probability at least 1 -δ for any δ ∈ (0, 1),
Due to (107),
Hence the result follows.
Remark 9 (Practical Significance of Proposition 2). Proposition 2 allows to make the following approximation:
That is, Φ * c (x) estimates the probability that x is associated to the c th class.
The fact, Φ * c (x) (which is the estimated probability of x being associated to the c th class) can be evaluated from the distributed data, is leveraged for FL, as illustrated in Fig. 4. The hypothesis Φ * c is inferred for all c ∈ {1, 2, • • • , C} from the locally computed space folding measures using (106). The global classifier, C :
Remark 10 (Batch KAHM Modeling for Enhanced Accuracy). To enhance the KAHM modeling accuracy of each class’s data samples (which is crucial for datasets with long-tailed imbalance), the total samples are partitioned into subsets and each subset is modelled through a separate KAHM. If |I c,q | (i.e. the number of c th class labelled samples that are owned by client q) is more than a specified number N b (representing the batch-size of samples to be modeled by a KAHM), then we have
local space folding measures defined by (121)
c=1 released under differential privacy or FHE local space folding measures defined by (121) That is, the total data samples (stored in the rows of matrix X c,q ) are partitioned into S c,q sub-matrices of nearly same size (i.e. each sub-matrix has nearly N b number of rows), where S c,q equals the rounding of |I c,q |/N b towards the nearest integer, and the space folding measure is defined by aggregating all of the S c,q individual measures associated to the S c,q sub-matrices. In this case, the global space folding measure associated to c th class labelled samples
Remark 11 (Data Partitioning via Clustering for Batch Processing of Big Data). To address the computational challenge of processing big datasets, the previous studies [40,42] have suggested clustering as a method for partitioning a large number of data samples into subsets for their batch processing. That is, X c,q 1 , • • • , X c,q Sc,q are instead obtained by the clustering the rows of X c,q . Remark 12 (A Generalization of the Federated Learning Method of [42]). It can be observed that the proposed federated learning methodology generalizes the method of [42] by considering a generalized space folding measure T X (x) instead of the distance measure x -A X (x) considered in [42]. That is, the solution of [42] is obtained with our approach by defining T X (x) := x -A X (x) . Remark 13 (Clients with Missing Classes). If the q th client has zero c th class labelled samples, then T X c,q 1 ,••• ,X c,q Sc,q (x) equals the highest possible value of 1, i.e., the space folding measure is defined as
Remark 14 (Communication and Computation Efficiency). The federated learning methodology, as suggested in Fig. 4, is communication and computation efficient since the space folding measures are computed using KAHMs and KAHMs are efficiently built [40] from the local data samples requiring neither any gradients nor any communication with the server.
Existing differentially private FL approaches often require gradient clipping and complex per-round privacy budgeting. In contrast, our method relies only on scalar outputs (i.e., space folding measures), enabling a natural (ǫ, δ)-differential privacy mechanism that directly acts on these real-valued summaries and thus avoids the need for per-round gradient clipping and privacy accounting. The FL methodology (as illustrated in Fig. 4) is made differentially private by ensuring that the evaluation of the space folding measure is differentially private with respect to the local dataset, and the subsequent aggregation steps then inherit privacy by post-processing property of differential privacy. It follows from (121) that T X c,q s (x) must be differentially private with respect to X c,q s for all s ∈ {1, • • • , S c,q }. We need to basically address the privacy of data samples in matrix X ∈ R N ×n that may be leaked during inference through the output of space folding measure function T X : R n → [0, 1]. However, estimating the sensitivity of space folding measure is challenging. Thus, we consider the more practical approach of approximating X under differential privacy followed by a smoothing, while ensuring that differentially private smoothed data points are as close to original data points as possible. This approach leads to defining a private version of the space folding measure: Definition 3 (Private Space Folding Measure). The private version of space folding measure, T + X : R n → [0, 1], is defined as
where V ∈ R N ×n is a random real matrix added to X for preserving the privacy of elements of X, such that elements of V are independently distributed from a distribution P v : B(R) → [0, 1]:
where v ∈ R is the random noise with some distribution P v and F : R N ×n → R N ×n is a matrix-valued function, referred to as smoothing function, meant for mitigating the effect of added noise.
The noise adding mechanism and smoothing function, involved in Definition 3, will be designed in subsection 3.8.1 and subsection 3.8.2.
The output of T + X is a random variable defined as
Let the distribution of t + X be denoted by
Thus, two d-adjacent matrices differ by only one element and the difference is bounded by a scalar d > 0. Definition 5 ((ǫ, δ)-Differential Privacy for T + X ). The space folding measure T +
for any O ∈ B([0, 1]) and d-adjacent matrices X, X ′ ∈ R N ×n . Result 1 (Optimal Noise for (ǫ, δ)-Differential Privacy [41,40]). The distribution of noise v, that minimizes expected noise magnitude together with satisfying the sufficient condition for (ǫ, δ)-differential privacy of T + X , is given as
The method of inverse transform sampling can be used to generate random samples from (129) and approximate X under differential privacy as
Given a dataset {x i ∈ R n } N i=1 (that can be equivalently represented as matrix X ∈ R N ×n ), a kernel-based smoothing can be represented as
where P X is an encoding matrix (computed by Algorithm 1 of Appendix A), and h i X (P X x j ), given by (155), evaluates kernel-smoothed membership of x j to x i . Definition 6 (A Kernel-Based Smoother). A kernel-based smoother, S : R N ×n → R N ×n , is defined as S(X) := H T X X.
(133) Proposition 3 (Smoothing Property of S). There exists a β ∈ (0, 1) such that
Proof. It can be seen using ( 155) that
where the kernel matrix K X and regularization parameter λ * X are defined by ( 152) and (149), respectively. The spectral decomposition of the real symmetric positive define matrix K X is given as
, and Λ X is the diagonal matrix of eigenvalues i.e.
(136) The i th eigenvalue of H X is given as
It follows from ( 135) that H X , being a product of commuting symmetric matrices, is also symmetric. Therefore,
If we define
then we have
(141) Now, (134) follows from (133) and (141). Inequality (134) establishes the smoothness property by ensuring that the norm of smoothed data matrix remains smaller than that of input data matrix. The degree-of-smoothness can be enhanced by repeatedly applying the smoother, leading to m-fold composition of S on data:
Our idea is to apply m-fold composition of S on noisy data samples X + (130), with m chosen optimally to minimize the difference of smoothed-noisy data from noise-free data. That is, the smoothing function F (used in Definition 3 of differentially private space folding measure) is defined as
The inequalities (145) imply that an iteration of smoothing function S is applied only if it reduces the mismatch between smoothed-noisy data and noise-free data.
In the proposed FL protocol, each client locally applies the noise-adding mechanism exactly once to its data matrix X c,q s , yielding a noise-perturbed matrix X c,q s + V c,q s . This noise-perturbed data matrix is then smoothed and subsequently used to construct a KAHM, and induce the associated space folding measure T F (X c,q s +V c,q s ) (•) for an aggregation and inference of the global model. Our method does not require any iterative sampling of clients’ raw data for federated training and does not involve multiple communication rounds. The smoothing, KAHM construction, and the model inference steps operate solely on X c,q s + V c,q s , and by the post-processing property of differential privacy, these steps cannot weaken the (ǫ, δ)-DP guarantee already established for the release of X c,q s + V c,q s . The (ǫ, δ)-DP guarantee established for the single application of the noise-adding mechanism thus fully characterizes the privacy of the entire training and inference pipeline, and therefore no per-round privacy accounting or multi-round composition analysis is needed.
In privacy-critical domains, inference must often be performed on encrypted data (e.g., via FHE), which prohibits complex operations. Since our aggregation involves only scalar space folding measures, the inference is implementable using basic arithmetic gates supported by FHE. The FL methodology (as illustrated in Fig. 4) can be secured against untrustworthy server by sharing fully homomorphically encrypted local evaluations (of the space folding measure) with the server for an inference of the global model on encrypted data. The computational efficiency stems from the fact that the space folding measure, unlike high-dimensional gradients or model parameters, is a scalar. Moreover, inference in the encrypted space is likewise not computationally demanding. Since the inference of hypothesis from the locally computed space folding measures (see ( 106)) is not arithmetic-heavy and can be expressed as a boolean circuit, TFHE [12] is selected as the FHE scheme, owing to its ability to evaluate binary gates with exceptionally low latency. However, the encryption of locally evaluated space folding measure (i.e. T X c,q
Sc,q (x) as unsigned p-bits (e.g. p = 16) integer, i.e., ⌈(2 p -1)T X c,q 1 ,••• ,X c,q Sc,q (x)⌉.
The inference of the global model using ( 106) and ( 122) involves performing the minimum operation Q × C times and equality-comparison operation C times on unsigned integers that encode the space folding measure evaluations. According to the TFHE-rs library benchmarks [80], operations on fully homomorphically encrypted 16-bit unsigned integers demonstrate practical performance on modern CPU hardware. Specifically, the minimum operations require approximately 96.4ms, while the equality-comparison requires around 31.3ms, when executed on an AMD EPYC 9R14 @ 2.60 GHz (AWS hpc7a.96xlarge) CPU. These measurements correspond to the default high-level parameter set in TFHE-rs, which provides at least 128-bit security under the IND-CPA-D model, with a bootstrapping failure probability not exceeding 2 -128 . This configuration therefore balances strong cryptographic guarantees with practical computational efficiency for fully homomorphic operations on 16-bit encrypted data.
Together, these nine steps demonstrate that our operator-theoretic approach offers a unified kernel framework for gradient-free FL that is communication-efficient, supports rigorous differential privacy mechanisms on scalar space folding summaries, and is compatible with secure inference via FHE, while being grounded in non-asymptotic finitesample analysis and hypothesis-space complexity bounds.
We design our experimental study to address the following questions:
• Q1 (Robustness to heterogeneity and imbalance). How robust is the proposed operator-theoretic, gradientfree federated learning method to long-tailed class imbalance and non-IID label distributions across clients? • Q2 (Privacy-utility trade-off). How well does the method perform under differential privacy constraints, and what is the impact of the proposed smoothing mechanism? • Q3 (Secure inference efficiency). Is secure inference of the global model via fully homomorphic encryption (FHE) computationally feasible on commodity hardware? • Q4 (Sensitivity to design choices). How sensitive is performance to the choice of space folding variant, batch size, and feature embeddings?
We evaluate the proposed method on four benchmark datasets.
20Newsgroup [67] This text classification dataset is a collection of newsgroup documents split across 20 distinct topics. The “bydate” version of the dataset contains 11314 training documents and 7532 test documents. [46] This is a multilingual news classification benchmark dataset containing English, German, Spanish, French, and Russian language text documents belonging to 10 distinct categories. Each language is represented by 10000 training examples and 10000 test examples.
CIFAR-10-LT The original CIFAR-10 dataset [38] contains 50000 training images and 10000 test images divided across 10 classes. Following prior work [9], the original CIFAR-10 dataset is turned into a long-tailed imbalance with imbalance ratio (which is the ratio between sample sizes of the most frequent and least frequent class) ρ ∈ {10, 50, 100}.
CIFAR-100-LT Following [71], the original CIFAR-100 dataset [37], containing 100 classes with 500 training images and 100 test images in each class, is turned into a long-tailed imbalance with imbalance ratio ρ ∈ {10, 50, 100}.
In all experiments, the proposed FL method operates on fixed feature vectors extracted from existing encoders such that encoders are not updated during training. Our method therefore plays the role of a gradient-free “head” on top of pretrained feature extractors.
Image datasets. For CIFAR-10-LT and CIFAR-100-LT, a 2048-dimensional feature vector is obtained for each image from the activations of the “avg_pool” layer (the final average pooling layer preceding the fully connected layer) of a pretrained ResNet-50 neural network [51]. The image feature vectors are processed through the hyperbolic tangent function to limit values within the range [-1, 1].
20Newsgroup For each document, “mxbai-embed-large” English sentence embedding model [3] is used to extract 1024-dimensional feature vector. Since the raw embeddings exhibit relatively small variance across dimensions, we rescale them along all diemsions by a factor of 10.
For the multilingual setting, we first extract 512-dimensional feature vectors using “distiluse-basemultilingual-cased-v2” multilingual sentence embedding model [64] and rescale them by a factor of 10 to increase variance. We additionally compute 768-dimensional embeddings using the “paraphrase-multilingual” sentence embedding model [65], again rescaled by a factor of 10. Concatenating both embeddings yields a 1280-dimensional feature vector for each document, which is used in the FL experiments.
We follow established client-partitioning protocols to match prior FL studies on these benchmarks.
20Newsgroup Following the experimental setting as in [36], the training documents are distributed across 100 clients in a non-IID manner using Dirichlet distribution with concentration parameter α ∈ {0.1, 1, 5}.
Again following [36], 100 clients are divided into five distinct groups, with a specific language assigned to a group such that all training examples of that language are distributed among the clients of the group in a non-IID manner using Dirichlet distribution with concentration parameter α ∈ {0.5, 2, 5}.
CIFAR-10-LT and CIFAR-100-LT For both long-tailed image benchmarks, like previous study [71], a non-IID scenario of training images distribution across 20 clients is simulated using Dirichlet distribution with concentration parameter α = 0.5.
Our FL protocol, illustrated in Fig. 4, is used for all experiments. A key feature of the proposed framework is that it requires only a small number of method-specific choices. Beyond selecting a variant of space folding measure in (102) 2: Comparison of the test data accuracy (%) obtained by proposed method against previously available results [71] of federated learning experiments on CIFAR-10-LT dataset under the long-tailed imbalance and non-IID label distribution scenarios. and the batch-size for local processing (Remark 10), no additional hyperparameters specific to our method are tuned. In all main experiments, we adopt option 1 in (102) to define the space folding measure T X . For the 20Newsgroup and XGLUE-NC datasets, we use a batch size of N b = 100. For CIFAR-10-LT and CIFAR-100-LT, which exhibit pronounced long-tailed class imbalance, we set N b = 20 and model each batch of 20 samples via a separate KAHM to better preserve minority classes. The effect of varying N b is further examined in an ablation study (Table 9).
The proposed FL method performs a single aggregation of scalar space folding summaries at the server, rather than iterative gradient exchanges, and thus operates in a communication-efficient, gradient-free regime.
All experiments were conducted in MATLAB R2024a. The reported numbers correspond to reference runs on an Apple iMac (M1, 2021) with 8 GB RAM. We release the implementation so that the experimental results can be reproduced from the source code, which is publicly available at: https://drive.mathworks.com/sharing/4cef6387-1a62-46c8-a7e7-a7439bbbd9ef .
We considered GitHub for hosting, but several precomputed embedding matrices and auxiliary .mat files used in our pipelines exceed the 100 MB per-file size limit imposed on standard GitHub repositories. Using MATLAB Drive avoids splitting the material across multiple services or requiring reviewers to configure Git Large File Storage, and therefore offers a more practical way to distribute the full reproducibility package, including large precomputed artifacts.
Unlike most prior FL studies on these benchmarks, our method operates exclusively on feature vectors derived from pretrained encoders and does not update the encoder parameters. Consequently, the comparisons should be interpreted 4: Comparison of the test data accuracy (%) obtained by proposed method against previously available results [71] of federated learning experiments on CIFAR-100-LT dataset under the long-tailed imbalance and non-IID label distribution scenarios. 5: Test data accuracy (%) obtained by proposed method during differential privacy federated learning experiments on 20Newsgroup and XGLUE-NC datasets under the non-IID label distribution scenarios with α = 0.1 for 20Newsgroup and α = 0.1 for XGLUE-NC. For each value of privacy-loss bound ǫ (and fixed δ = 10 -5 ), the performance was evaluated under two scenarios: 1) when the noise-perturbed samples were not smoothed (i.e. T + X = T X+V ), 2) when the noise-perturbed samples were smoothed (i.e. T + X = T F (X+V) ). 6: Results of the FHE secured federated learning experiments on 20Newsgroup and XGLUE-NC datasets under the non-IID label distribution scenarios with α = 0.1 for 20Newsgroup and α = 0.1 for XGLUE-NC. The FHE secured inference of the global model involves performing minimum operation Q×C times and equality-comparison operations C times, where Q is the number of clients and C is the number of classes. The reported computational time is required by an iMac (M1, 8 GB RAM) for performing minimum and equality-comparison operations on fully homomorphically encrypted integers using TFHE-rs Rust library. These values correspond to the default high-level parameter set in TFHE-rs, which provides at least 128-bit security under the IND-CPA-D model, with a bootstrapping failure probability not exceeding 2 -128 . 7: Comparison of the test data accuracy (%) obtained by proposed SFM-i method (where i denotes the selected option in the definition of T X in Equation ( 102)) in federated learning experiments on CIFAR-100-LT dataset under the long-tailed imbalance and non-IID label distribution scenarios. in a head-only setting: we ask whether our gradient-free, operator-theoretic FL aggregation of fixed embeddings can be competitive with, and in several cases superior to, strong gradient-based FL baselines that optimize full or partially trainable models under matched data-distribution scenarios. This setup reflects applications where the encoder has already been trained and validated (or is provided by a third party), and only the task-specific prediction head is subject to federated training. A comparison of full end-to-end pipelines, including joint representation learning from raw text and image data, would require extending the operator-theoretic gradient-free kernel framework to encoder training, which we leave as an important direction for future work.
For 20Newsgroup and XGLUE-NC, we compare against the parameter-efficient fine-tuning baselines of [36], and for CIFAR-10-LT and CIFAR-100-LT against the gradient-based FL methods of [71]. Tables 1,2, 3, and 4 report the experimental results on 20Newsgroup, CIFAR-10-LT, XGLUE-NC, and CIFAR-100-LT, respectively. The top two performances have been highlighted. The results of differentially private federated learning experiments on 20Newsgroup and XGLUE-NC datasets are provided in Table 5. The goal of differentially private federated learning experiments was to study the effect of data smoothing mechanism on the performance. Table 6 presents the results of FHE secured federated learning experiments on 20Newsgroup and XGLUE-NC datasets. The results of the experiments studying different variants of the space folding measure T X are provided in Table 7 and Table 8 for CIFAR-100-LT and XGLUE-NC, respectively. The effect of the batch-size N b is experimentally studied in Table 9. Finally, the performance of different embedding models is evaluated in Table 10.
On 20Newsgroup, the proposed space folding method (SFM) consistently yields the highest accuracy across all three non-IID settings (α ∈ {5, 1, 0.1}). In the most heterogeneous case (α = 0.1), SFM achieves 84.7% test accuracy, improving upon the strongest gradient-based baseline of [36] by up to 23.7 percentage points (Table 1). For CIFAR-10-LT, SFM substantially improves upon the baselines of [71] under long-tailed imbalance and non-IID label distributions (Table 2). At imbalance ratio ρ = 100, SFM attains 80.99% test accuracy, outperforming the best competing method (CReFF) by 10.44 percentage points. On the multilingual XGLUE-NC benchmark, SFM again attains the best results (Table 3). For α = 0.5, SFM reaches 82.2% accuracy, exceeding the strongest baseline (C2A) by 2.0 percentage points. Finally, on CIFAR-100-LT (Table 4), SFM provides notable gains in the most imbalanced setting. At ρ = 100, SFM achieves 46.16% test accuracy, 11.49 percentage points higher than the baseline (CReFF).
Overall, these results indicate that, when coupled with fixed pretrained encoders, the proposed gradient-free FL method is robust to severe label skew and long-tailed imbalance, and can match or exceed the performance of state-of-the-art gradient-based FL methods on the considered benchmarks.
To examine the privacy-utility trade-off (Q2), we perform differentially private FL experiments on 20Newsgroup and XGLUE-NC with non-IID label distributions (α = 0.1 for both datasets). For varying values of privacy-loss bound ǫ (with fixed δ = 10 -5 ), Table 5 reports test accuracy under two scenarios:
- noise-perturbed samples without additional smoothing, i.e., T + X = T X+V , 2. noise-perturbed samples with the proposed kernel-based smoothing, i.e., T + X = T F (X+V) . For high-privacy regimes (ǫ ≤ 3), smoothing yields modest but consistent improvements in accuracy on both datasets, indicating that the smoothing mechanism can partially counteract the distortion introduced by noise. The gains remain relatively small, which is consistent with the fact that the underlying KAHM-based autoencoder already enforces a smooth representation. In low-privacy regimes (e.g., ǫ = 8), additional smoothing offers no benefit and may slightly degrade performance, suggesting that smoothing is most useful when stringent privacy guarantees are required.
We next examine the suitability of the global prediction rule for secure inference using fully homomorphic encryption. In the proposed framework, inference reduces to computing, for each class, a scalar score based on the aggregated space folding measures and then selecting the class with minimum score. When realized over encrypted integers, this decision rule requires Q × C homomorphic minimum operations and C equality-comparison, where Q is the number of clients and C is the number of classes. Using the TFHE-rs Rust library with its default high-level parameter set (providing at least 128-bit security under the IND-CPA-D model and a bootstrapping failure probability not exceeding 2 -128 ), we benchmark the primitive operations that dominate the cost of the encrypted prediction rule. On an iMac (M1, 8 GB RAM), minimum and equality-comparison operations on 8-bit and 16-bit encrypted integers are computationally practical, with the corresponding latencies reported in Table 6. Since the total cost of the FHE realization grows linearly in Q × C for minimum and C for equality-comparison, these measurements provide an operation-level indication that the induced prediction rule is structurally amenable to FHE-secured inference on standard hardware. These measurements should therefore be interpreted as lower-level building blocks, quantifying the dominant cryptographic operations induced by our decision rule under a particular TFHE implementation, while system-level optimizations (e.g., batching, specialized hardware, or multi-key schemes) are not the focus of our experiments.
Finally, we investigate the sensitivity of the method to the design choices.
Space folding variants Table 7 (CIFAR-100-LT) and Table 8 (XGLUE-NC) compare the four variants of the space folding measure T X defined in (102). Across all settings, the performance differences between SFM-1, SFM-2, SFM-3, and SFM-4 are small, indicating that the method is robust to the particular choice of space folding variant.
Batch size Table 9 reports the effect of varying the batch size N b on CIFAR-100-LT. Smaller batches (e.g., N b = 20) lead to noticeably better performance under strong imbalance, whereas larger batches can hurt accuracy, particularly for the most imbalanced settings. This supports the intuition that, on long-tailed image datasets, modelling smaller batches via separate KAHMs helps capture minority classes more faithfully.
Embedding combinations Table 10 examines the effect of different embedding models on performance. Using either distiluse-base-multilingual-cased-v2 (mdl 1 ) or paraphrase-multilingual (mdl 2 ) alone yields strong performance, but their concatenation (mdl 1 + mdl 2 ) systematically improves accuracy for all values of α. This suggests that the operator-theoretic gradient-free FL model can effectively exploit complementary information from multiple embedding spaces.
We summarize the main empirical findings in terms of the research questions posed at the beginning of this section.
Q1 (Robustness to heterogeneity and imbalance) Across all four datasets, the proposed method achieves accuracy that matches or exceeds strong gradient-based FL baselines under non-IID label distributions and long-tailed class imbalance. The sizeable gains on 20Newsgroup, CIFAR-10-LT, and CIFAR-100-LT, together with the improvements on the multilingual XGLUE-NC benchmark, indicate that the proposed FL method is robust to challenging data heterogeneity when built on top of pretrained encoders. Q2 (Privacy-utility trade-off) In the differentially private FL experiments, the method maintains competitive accuracy even under tight privacy budgets. The kernel-based smoothing improves performance in high-privacy regimes (ǫ ≤ 3), while having limited or no benefit when privacy constraints are relaxed. This suggests that smoothing should be applied primarily when strong privacy guarantees are required.
The global gradient-free FL model admits a simple inference procedure suitable for FHE: for a C-class problem with Q clients, encrypted inference requires Q × C minimum and C equality-comparison operations per test point. The measured latencies for these encrypted primitive operations show that, under the evaluated cryptographic parameter settings and for the dataset and client scales studied here, the resulting FHE-secured inference appears computationally feasible on standard hardware.
The ablation studies demonstrate that the method is robust to the choice of space folding variant, benefits from smaller batch sizes in the presence of long-tailed imbalance, and can exploit complementary embeddings to further improve accuracy.
Overall summary Taken together, the experiments suggest that the proposed operator-theoretic framework offers a favourable combination of robustness to heterogeneity, privacy preservation, and practical efficiency for federated learning, within the scope of the benchmarks and settings considered in this study.
The primary contribution of this study is the development of an operator-theoretic kernel framework for the design and analysis of gradient-free federated learning algorithms. The framework addresses the key requirements identified in Section 1 by reformulating the FL problem in the L 2 function space, mapping the L 2 -optimal solution into a reproducing kernel Hilbert space (RKHS) via an invertible operator, and deriving finite-sample performance guarantees using concentration inequalities over operator norms. This yields a gradient-free learning scheme together with non-asymptotic bounds on risk, prediction error, robustness, and approximation error. Within this formulation, we determine a data-dependent hypothesis space by tuning the kernel to the scale of the data and analyse its complexity via Rademacher complexity. The analysis shows that scalar space folding summaries derived from Kernel Affine Hull Machines (KAHMs) are sufficient for the global task learning solution, characterizing when high-dimensional gradient exchanges and multiple communication rounds are not required. In this way, the framework offers a mathematically grounded alternative to traditional gradient-based FL in heterogeneous settings.
The framework further integrates privacy-enhancing and security mechanisms into FL. Differentially private FL is achieved by applying a single optimized noise-adding mechanism to each client’s data matrices, followed by kernelbased smoothing and the computation of scalar space folding summaries. By the post-processing property of differential privacy, the resulting global decision rule inherits the (ǫ, δ)-DP guarantee. Secure FL is enabled by fully homomorphic encryption (FHE) of space folding measures. Because the global decision rule for a C-class problem with Q participating clients can be implemented using Q × C minimum and C equality-comparison operations per test point, the induced FHE-secured inference has a simple and low-dimensional computational structure. Operation-level benchmarks of these encrypted primitives indicate that, for the problem sizes and cryptographic parameter settings studied here, such FHE-secured inference is practically feasible on standard hardware.
Empirically, when combined with embeddings from existing encoders, the resulting gradient-free FL method is competitive with, and in several settings outperforms, strong gradient-based FL methods on non-IID and long-tailed benchmarks. The experiments also indicate that the proposed smoothing mechanism can mitigate the accuracy loss induced by differential privacy in high-privacy regimes, and that the structural simplicity of the FHE-secured decision rule, together with the measured primitive latencies, supports its practical feasibility at the evaluated scales.
A limitation of the present framework is that it treats feature extractors as fixed and focuses exclusively on the design and analysis of the federated prediction head. While this matches the settings where pretrained encoders are frozen for regulatory, engineering, or cost reasons, it does not directly address end-to-end representation learning under federated constraints. Extending the operator-theoretic construction to encompass joint encoder and head learning, for example via operator-theoretic formulations of representation learning objectives or hybrid gradient-free/gradientbased schemes, is an interesting avenue for future work.
Overall, the main value of this work lies in the unifying operator-theoretic perspective and the associated guarantees, which are largely architectural and model-agnostic and do not depend on a particular dataset, feature encoder, or hardware platform. While our experiments focus on standard non-IID partitions, long-tailed settings, and a reference implementation for concreteness, the theoretical results are intended to remain informative as future work broadens tasks, systems, and deployment conditions.
• P X ∈ R n×n (n ∈ {1, 2, • • • , n}) is an encoding matrix such that product P X x is a lower-dimensional (i.e. n-dimensional) encoding for x. The encoding matrix is computed from the data samples X using the following algorithm:
Algorithm 1 Determination of Encoding Matrix P X Require: Matrix X ∈ R N ×n , equivalently represented as dataset {x i ∈ R n } N i=1 . 1: n ← min(20, n, N -1). 2: Define P X ∈ R n×n such that the i-th row of P X is equal to transpose of eigenvector corresponding to i-th largest eigenvalue of sample covariance matrix of samples {x 1 , • • • , x N }. 3: while min 1≤j≤n max 1≤i≤N (P X x i ) jmin 1≤i≤N (P X x i ) j < 1e-3 do 4:
n ← n -1.
Define P X ∈ R n×n such that the i-th row of P X is equal to transpose of eigenvector corresponding to i-th largest eigenvalue of sample covariance matrix of dataset {x 1 , • • • , x N }. 6: end while 7: return P X .
• We have
and a positive-definite real-valued kernel, k X : X × X → R on X with a corresponding reproducing kernel Hilbert space H kX (X ), as
where
, approximates the indicator function ½ {PXx i } : X → {0, 1} as the solution of following kernel regularized least squares problem:
where the regularization parameter λ * X ∈ R + is given as
where ê is the unique fixed point of the function r such that
with r : R + × R + → R + defined as
where (I N ) i,: denotes the i-th row of identity matrix of size N and K X is N × N kernel matrix with its (i, j)-th element defined as
The following iterations
converge to ê. The solution of the kernel regularized least squares problem follows as
The value h i X (P X x) represents the kernel-smoothed membership of point P X x to the set {P X x i }. • The image of A X defines a region in the affine hull of {x 1
Appendix B. Proof of Equation ( 22)
where we have considered that Ey∼P y|x [y c |x] ∈ L 2 (R n , P x ). Thus, (22) follows.
Appendix C. Proof of K Φc Being a Positive Semi-Definite Kernel K Φc is a positive semi-definite kernel, since
), and • for every
Inequality (160) can be proved by considering that N i,j=1
Appendix D. Proof of J :
Since
we have
and thus
That is Jf ∈ L 2 (R n , P x ). Hence, J is well defined.
Appendix E. Proof of (J * ) -1 Being Well Defined on the Range of J * Consider for any f ∈ H Φc ,
where we have used (169). Due to (30), we have
That is, (J * ) -1 f ∈ L 2 (R n , P x ). Hence, (J * ) -1 is well defined on the range of J * .
Appendix F. Proof of Equation ( 46)
Thus, (46) follows.
Appendix G. Proof of Inequality (48) Consider for any f ∈ H Φc ,
Thus, (48) follows.
Appendix H. Proof of Inequality ( 65)
x →yc -f HΦ c
x →yc ), J(h
x →yc -f HΦ c
x →yc ), J * J(h
Using ( 63), we have
Using ( 52), we get (65).
Appendix I. Proof of Inequality ( 66)
Considering x as a random variable, F x,c is a random variable taking values in H Φc with mean equal to the zero function, i.e.,
where 0 : R n → 0. Consider
where (196) follows from (195) using ( 24) and (169). Now, Consider
Using the independence of the samples ((x i , y i )) N i=1 and (192), we have E (x i ,y i )∼Px,y,(x j ,y j )∼Px,y
Therefore,
where we have used (196).
Appendix J. Proof of Inequality ( 67)
where we have used the independence of the samples ((x i , y i )) N i=1 and (27). Using (168) and ( 28), we get (67).
Appendix K: Proof of Inequality ( 68)
where
It can be seen that
where F x,c is defined as in (191). Consider
Thus,
where we have used ( 169) and ( 28). Thus,
) It follows from ( 215) and ( 220
) Thus, ψ c satisfies the bounded differences property with bound 4/N , and therefore by McDiarmid’s inequality, for any ǫ > 0, with probability at most exp(-0.125N ǫ 2 ), the following holds:
That is, with probability at most exp(-0.125N ǫ 2 ), the following holds:
That is, with probability at most δ > 0, the following holds:
In other words, with probability at least 1 -δ, the following holds:
Using ( 66) and ( 67) in (225), we get (68).
where we have used the reproducing property of the kernel, since h x →yc ∈ H Φc . That is,
where (229) follows from (228) due to Cauchy-Schwarz inequality. Also,
Using ( 91) and (230),
As per Jensen’s inequality,
and thus
Since σ 1 , • • • , σ N are independent random variables drawn from the Rademacher distribution, we have
As Φ c,σ : R n → [0, 1], we have
Using (240) in (236), we get
Since the inequality (241) holds for all ǫ > 0, we have (93).
Appendix M: Proof of Theorem 6
Define, for a given dataset D (as defined in ( 6)),
Consider a function assessing the supremum of difference of expected loss value from empirically averaged loss value:
where we have used the facts that y i c , y ′i c ∈ {0, 1} and (86). Similarly, we can obtain
Thus
Thus, g c satisfies the bounded differences property with bound 1/N , and therefore by Mc-Diarmid’s inequality, for any ǫ > 0, with probability at most exp -2N ǫ 2 , the following holds:
That is, with probability at most δ ∈ (0, 1), the following holds:
In other words, with probability at least 1 -δ, the following holds:
and consider (86) and y N c ∈ {0, 1}, leading to h 1 (x N ) + h 2 (x N ) -2y N c ≤ 2, (275) so that (296) Since (290) holds with probability at least 1 -δ, using (296), we have with probability at least 1 -δ:
Using (85) together with Theorem 4, we have with probability at least 1 -δ:
Combining ( 297) and (298) leads to the result.
c /N approximates the true class prior. Thus, in our design of Φ c , condition(85) encodes a natural and statistically consistent alignment with the empirical class prior, rather than an artificial restriction on the hypothesis space.
c /N approximates the true class prior. Thus, in our design of Φ c , condition(85)
c /N approximates the true class prior. Thus, in our design of Φ c , condition
📸 Image Gallery