Distributed Adaptive Learning with Multiple Kernels in Diffusion Networks

IEEE TRANSA CTIONS ON SIGNAL PR OCESSING, V OL. XX, NO. XX, MONTH XX, 2018 1 Distrib uted Adapti v e Learning with Multiple K ernels in Dif fusion Networks Ban-Sok Shin, Student Member , IEEE, Masahiro Y ukaw a, Member , IEEE , Renato L. G. Cav alcante, Member , IEEE , and Armin Dekorsy , Senior Member , IEEE Abstract —W e propose an adapti ve scheme f or distributed learning of nonlinear functions by a network of nodes. The proposed algorithm consists of a local adaptation stage utilizing multiple kernels with projections onto hyperslabs and a diffusion stage to achie ve consensus on the estimates ov er the whole network. Multiple kernels are incorporated to enhance the approximation of functions with several high and low frequency components common in practical scenarios. W e provide a thor- ough con vergence analysis of the proposed scheme based on the metric of the Cartesian product of multiple reproducing ker nel Hilbert spaces. T o this end, we introduce a modiﬁed consensus matrix considering this speciﬁc metric and prove its equivalence to the ordinary consensus matrix. Besides, the use of hyperslabs enables a signiﬁcant reduction of the computational demand with only a minor loss in the performance. Numerical evaluations with synthetic and real data are conducted showing the efﬁcacy of the proposed algorithm compared to the state of the art schemes. Index T erms —Distrib uted adaptive learning, kernel adaptive ﬁlter , multiple kernels, consensus, spatial reconstruction, nonlin- ear regression I . I N T R O D U C T I O N A. Bac kgr ound Distributed learning within networks is a topic of high importance due to its applicability in various areas such as en vironmental monitoring, social networks and big data [1]– [3]. In such applications, observed data are usually spread ov er the nodes, and thus, they are unavailable at a central entity . In en vironmental monitoring applications, for instance, nodes observe a common physical quantity of interest such as temperature, gas or humidity at each speciﬁc location. For a spatial reconstruction of the physical quantity over the area cov ered by the network non-cooperativ e strategies will not deliv er a satisfactory performance. Rather distributed learning algorithms relying on information exchanges among neigh- boring nodes are required to fully exploit the observations av ailable in the network. Distributed learning of linear functions has been addressed by a v ariety of algorithms in the past decade, e.g. [4]–[12]. In contrast to these works, we address the problem of distributed learning of nonlinear functions/systems . T o this end, we e xploit B.-S. Shin and A. Dekorsy are with the Department of Communications En- gineering, Univ ersity of Bremen, Germany , e-mails: shin@ant.uni-bremen.de, dekorsy@ant.uni-bremen.de. M. Y uka wa is with the Department of Electronics and Electrical Engineer- ing, Keio University , Y okohama, Japan, e-mail: yukawa@elec.keio.ac.jp. R.L.G. Cavalcante is with the Fraunhofer Heinrich Hertz Institute, Berlin, Germany , e-mail: renato.cavalcante@hhi.fraunhofer .de. M. Y ukawa is thankful to JSPS Grants-in-Aid (15K06081, 15K13986, 15H02757). kernel methods, which ha ve been used to solve e.g. nonlinear regression tasks [13], [14]. Based on a problem formulation in a reproducing kernel Hilbert space (RKHS) linear tech- niques can be applied to approximate an unkno wn nonlinear function. This function is then modeled as an element of the RKHS, and corresponding kernel functions are utilized for its approximation. This methodology has been exploited to deriv e a variety of kernel adapti ve ﬁlters [15]–[25]. In particular, the naiv e online regularized risk minimization [15], the kernel normalized least-mean-squares, the kernel afﬁne projection [20] or the hyperplane projection along afﬁne subspace (HY - P ASS) [23], [26] enjoy signiﬁcant attention due to their limited complexity and their applicability in online learning scenarios. The HYP ASS algorithm has been derived from a functional space approach based on the adapti ve projected subgradient method (APSM) [27] in the set-theoretic estimation framew ork [28], [29]. It exploits a metric with regard to the kernel Gram matrix sho wing faster con ver gence and improved steady-state performance. The kernel Gram matrix determines the metric of an RKHS and is decisiv e for the con vergence behavior of gradient-descent algorithms [30]. In [31], [32] kernel adaptiv e ﬁlters ha ve been extended by multiple kernels to increase the degree of freedom in the estimation process. By this, a more accurate approximation of functions with sev eral high and lo w frequency components is possible with a smaller number of dictionary samples compared to using a single kernel only . Regarding distributed kernel-based estimation algorithms, sev eral schemes have been deri ved [33]–[42]. In [33] a dis- tributed consensus-based regression algorithm based on kernel least squares has been proposed and extended by multiple kernels in [34]. Both schemes utilize alternating direction method of multipliers (ADMM) [43] for distributed consensus- based processing. Recent works in [35]–[37] apply diffusion- based schemes to the kernel least-mean-squares (KLMS) to deriv e distributed kernel adaptive ﬁlters where nodes process information in parallel. The functional adapt-then-combine KLMS (F A TC-KLMS) proposed in [35] is a kernelized v ersion of the algorithm deriv ed in [9]. The random Fourier features diffusion KLMS (RFF-DKLMS) proposed in [36] uses random Fourier features to achiev e a ﬁxed-size coefﬁcient vector and to av oid an a priori design of a dictionary set. Howe ver , the achiev able performance strongly depends on the number of utilized Fourier features. Besides, the aforementioned schemes incorporate update equations in the ordinary Euclidean space and thus, do not exploit the metric induced by the kernel Gram matrix. Furthermore, the majority of these schemes do not consider multiple kernels in their adaptation mechanism. 2 IEEE TRANSACTIONS ON SIGN AL PR OCESSING, V OL. XX, NO. XX, MONTH XX, 2018 B. Main Contributions For the deriv ation of the proposed algorithm we rely on the previous work of [10]. Howe ver , while [10] only considers distributed learning for linear functions in a Euclidean space we speciﬁcally deri ve a kernel-based learning scheme in an RKHS and its isomorphic Euclidean space, respectiv ely . More speciﬁcally , we propose a distributed algorithm completely operating in the Cartesian product space of multiple RKHSs. The Cartesian product space has been exploited by the Carte- sian HYP ASS (CHYP ASS) algorithm for adapti ve learning with multiple kernels proposed in [32]. When operating in the corresponding Euclidean parameter space a metric based on the kernel Gram matrix of each employed kernel needs to be considered. This metric is determined by a block diagonal matrix of which the block diagonals are given by kernel Gram matrices. T o derive a distributed learning scheme we rely on av erage consensus on the coefﬁcient vectors for each kernel. The key idea of our proposed scheme is to fully conduct dis- tributed learning in a Euclidean space considering the metric of the Cartesian product space. This metric is responsible for an enhanced con ver gence speed of the adaptive algorithm. Oper- ating with this metric implies that the consensus matrix used for diffusion of information within the network needs to be adapted to it. T o this end, we introduce a modiﬁed consensus matrix operating in the metric of the product space. In fact, we show that the modiﬁed consensus matrix coincides with the consensus matrix operating in the ordinary Euclidean space as used in [10]. This ﬁnding actually implies that the metric of the product space does not alter the con ver gence properties of the a verage consensus scheme. This is particularly important in proving the monotone approximation property of our pro- posed scheme. W e provide a thorough con vergence analysis considering the metric of the product space. Speciﬁcally , we prove monotone approximation, asymptotic optimization, asymptotic consensus, conv ergence and characterization of the limit point within the framew ork of APSM. As a practical implication we demonstrate that by projecting the current estimate onto a hyperslab instead of the ordinary hyperplane we can signiﬁcantly reduce the computational demand per node. By varying the hyperslab thickness (similar to an error bound), a trade-off between error performance and complexity per node can be adjusted. W e corroborate our ﬁndings by extensi ve numerical e valuations on synthetic as well as real data and by mathematical proofs giv en in the appendices. I I . P R E L I M I N A R I E S A. Basic Deﬁnitions W e denote the inner product and the norm of the Eu- clidean space R M by h · , · i R M and || · || R M , respectiv ely , and those in the RKHS H by h · , · i H and || · || H , respectively . Giv en a positive deﬁnite matrix K ∈ R M × M , h x , y i K := x T K y , x , y ∈ R M , deﬁnes an inner product with the norm || x || K := p h x , x i K . The norm of a matrix X ∈ R M × M induced by the vector norm || · || K is deﬁned as || X || K := max y 6 = 0 || X y || K / || y || K . The spectral norm of a matrix is denoted as || X || 2 when we choose K = I M as the M × M identity matrix [44]. A set C ⊂ R M is said to be con vex if α x + (1 − α ) y ∈ C , ∀ x , y ∈ C , ∀ α ∈ (0 , 1) . If in addition the set C is closed, we call it a closed conv ex set. The K - projection of a vector w ∈ R M onto a closed con ve x set C is deﬁned by [45], [46] P K C ( w ) := min v ∈ C || w − v || K . (1) B. Multikernel Adaptive F ilter In the following we present the basics regarding multikernel adaptiv e ﬁlters which have been applied to online regression of nonlinear functions [31], [32]. W e denote a multikernel adaptiv e ﬁlter by ϕ : X → R where X ⊆ R L is the input space of dimension L and R the output space. The ﬁlter/function ϕ employs Q positi ve deﬁnite kernels κ q : X × X → R with q ∈ Q = { 1 , 2 , . . . , Q } . Each kernel κ q induces an RKHS H q [13], and ϕ uses corresponding dictionaries D q = { κ q ( · , ¯ x ` ) } r ` =1 , each of cardinality r . Here, each dictionary D q contains kernel functions κ q centered at samples ¯ x ` ∈ X . For simplicity , we assume that each dictionary D q uses the same centers { ¯ x ` } r ` =1 although this assumption is not required. The multikernel adaptive ﬁlter ϕ is then given by ϕ := X q ∈Q r X ` =1 w q ,` κ q ( · , ¯ x ` ) . (2) The output of ϕ for arbitrary input samples x can be computed via ϕ ( x ) = X q ∈Q r X ` =1 w q ,` κ q ( x , ¯ x ` ) = h w , κ ( x ) i R rQ . (3) Here, vectors w and κ ( x ) are deﬁned as w ( q ) := [ w q , 1 , . . . , w q ,r ] T ∈ R r , w := [ w T (1) , . . . , w T ( Q ) ] T ∈ R rQ , κ q ( x ) := [ κ q ( x , ¯ x 1 ) , . . . , κ q ( x , ¯ x r )] T ∈ R r , κ ( x ) := [ κ T 1 ( x ) , . . . , κ T Q ( x )] T ∈ R rQ . A commonly used kernel function is the Gaussian kernel deﬁned as κ q ( x 1 , x 2 ) = exp  − || x 1 − x 2 || 2 R L 2 ζ 2 q  , x 1 , x 2 ∈ X , (4) where ζ q > 0 is the kernel bandwidth. The metric of an RKHS is determined by the kernel Gram matrix. It contains the inherent correlations of a dictionary D q with respect to (w .r .t.) the kernel κ q and is deﬁned as K q :=    κ q ( ¯ x 1 , ¯ x 1 ) . . . κ q ( ¯ x 1 , ¯ x r ) . . . . . . . . . κ q ( ¯ x r , ¯ x 1 ) . . . κ q ( ¯ x r , ¯ x r )    ∈ R r × r . (5) Assuming that each dictionary D q is linearly indepen- dent it follows that each K q is positiv e-deﬁnite [46]. Moreov er , we introduce the multikernel Gram matrix K := blkdiag { K 1 , K 2 , . . . , K Q } ∈ R rQ × r Q being the block-diagonal matrix of the Gram matrices of all kernels. Then, by virtue of Lemma 1 from [47] we can parameterize ϕ by w in the Euclidean space R rQ using the K inner SHIN et al. : DISTRIBUTED ADAPTIVE LEARNING WITH MUL TIPLE KERNELS IN DIFFUSION NETWORKS 3 product h · , · i K . In fact, the K -metric in the Euclidean space corresponds to the metric of the Cartesian product of multiple RKHSs deﬁned as H × := H 1 × H 2 × . . . × H Q := { ( f 1 , f 2 , . . . , f Q ) : f q ∈ H q , q ∈ Q} [32]. Indeed, we can express (3) equiv alently by ϕ ( x ) = h w , κ ( x ) i R rQ = h w , K − 1 κ ( x ) i K . (6) Instead of applying a learning method to the function ϕ in ( H × , h · , · i H × ) we can directly apply it to the coefﬁcient vector w ∈ R rQ in  R rQ , h · , · i K  . This representation is based on the parameter space appr oach from the kernel adaptiv e ﬁltering literature with the functional space appr oach as its equiv alent counterpart, see [31, Appendix A]. In the following, we will formulate the distributed learning problem in the parameter space  R rQ , h · , · i K  to facilitate an easy understanding. Howe ver , we emphasize that this formulation originates from considerations in an isomorphic functional space. The interested reader is referred to Appendix A for a problem formulation in the functional space. I I I . P R O B L E M F O R M U L AT I O N A N D O B J E C T I V E A. System Model W e address the problem of distributed adaptive learning of a continuous, nonlinear function ψ : X → R by a network of J nodes. The function ψ is assumed to lie in the sum space of Q RKHSs deﬁned as H + := H 1 + H 2 + . . . + H Q := n P q ∈Q f q | f q ∈ H q o . W e label a node by index j and the time by index k . Each node j observes the nonlinear function ψ ∈ H + by sequentially feeding it with inputs x j,k ∈ R L . Then each node j acquires the measurement d j,k ∈ R per time index k via d j,k = ψ ( x j,k ) + n j,k , (7) where n j,k ∈ R is a noise sample. Based on the nodes’ observations, at each time index k we have a set of J acquired input-output samples { ( x j,k , d j,k ) } j ∈J av ailable within the network. T o describe the connections among the nodes in the net- work we employ a graph G = ( J , E ) with a set of nodes J = { 1 , . . . , J } and a set of edges E ⊆ J × J . Each edge in the network represents a connection between two nodes j and i giv en by ( j, i ) ∈ E where each node j is connected to itself, i.e., ( j, j ) ∈ E . W e further assume that the graph is undir ected , i.e., edges ( j, i ) and ( i, j ) are equiv alent to each other . The set of neighbors for each node j is gi ven as N j = { i ∈ J | ( j , i ) ∈ E } containing all nodes connected to node j (including node j itself). Furthermore, we consider the graph to be connected , i.e., each node can be reached by any other node over multiple hops. The objectiv e of the nodes is to learn the nonlinear function ψ based on the acquired input- output samples { ( x j,k , d j,k ) } j ∈J in a distributed fashion. T o this end, nodes are able to exchange information with their neighboring nodes to enhance their individual estimate of the unknown function ψ . B. Pr oblem F ormulation in P arameter Space Based on the parametrization of the multikernel adaptive ﬁl- ter ϕ by the coefﬁcient vector w we formulate an optimization problem in the parameter space of w . The objecti ve is to ﬁnd a w such that the estimated output ϕ ( x ) = h w , K − 1 κ ( x ) i K is close to the function output ψ ( x ) for arbitrary input samples x ∈ X . This has to be achieved in a distributed fashion for each node j in the network based on the acquired data pairs { ( x j,k , d j,k ) } j ∈J . Thus, we equip each node j with a multikernel adaptiv e ﬁlter (2) parameterized by its individual coefﬁcient vector w j . Furthermore, each node j is assumed to rely on the same dictionaries D q , q ∈ Q , i.e., they are globally known and common to all nodes. T o specify the coefﬁcient vectors which result in an estimate close to the node’ s measurement, we introduce the closed con vex set S j,k per node j and time inde x k : S j,k :=  w j ∈ R rQ : |h w j , K − 1 κ ( x j,k ) i K − d j,k | ≤ ε j  , where ε j ≥ 0 is a design parameter . The set S j,k is a hyperslab containing those vectors w j which provide an estimate ϕ ( x j,k ) =  w j , K − 1 κ ( x j,k )  K with a maximum distance of ε j to the desired output d j,k [48]. The parameter ε j controls the thickness of the hyperslab S j,k , and is introduced to consider the uncertainty caused by measurement noise n j,k . The key issue is to ﬁnd an optimal w j ∈ S j,k . T o this end, we deﬁne a local cost function Θ j,k at time k per node j as the metric distance between its coefﬁcient vector w j and the hyperslab S j,k in the K -norm sense: Θ j,k ( w j ) := || w j − P K S j,k ( w j ) || K . (8) This cost function giv es the residual between w j and its K - projection onto S j,k . Due to the distance metric Θ j,k ( w j ) is a non-negati ve, con vex function with minimum v alue Θ ? j,k := min w j Θ j,k ( w j ) = 0 . Then we deﬁne the global cost of the network at time k to be the sum of all local costs by Θ k ( w j ) := X j ∈J Θ j,k ( w j ) (9) where each individual cost Θ j,k can be time-varying. The objectiv e is to minimize the sequence (Θ k ) k ∈ N of global costs (9) over all nodes in the network where due to con vexity of Θ j,k the global cost Θ k is also con vex. Simultaneously , the coefﬁcient vectors w j of all nodes hav e to con ver ge to the same solution, which guarantees consensus in the network. T o this end, we consider the following optimization problem at time k as in [8], [10], [49]: min { w j | j ∈J } Θ k ( w j ) := X j ∈J Θ j,k ( w j ) (10a) s . t . w j = w i , ∀ i ∈ N j . (10b) Constraint (10b) enforces all coef ﬁcient vectors to con ver ge to the same solution, i.e., w 1 = w 2 = · · · = w J guaranteeing consensus within the network. 4 IEEE TRANSACTIONS ON SIGN AL PR OCESSING, V OL. XX, NO. XX, MONTH XX, 2018 C. Optimal Solution Set From the deﬁnition (8) of the local cost Θ j,k we directly see that its minimizers are giv en by points in the hyperslab S j,k . Since each local cost Θ j,k is a metric distance with minimum value zero the minimizers of the global cost Θ k at time k are giv en by the intersection Υ k := T j ∈J S j,k . Points in Υ k minimize each local cost Θ j,k and therefore also their sum Θ k ( w j ) = P j ∈J Θ j,k ( w j ) . Thus, a point minimizing each local cost Θ j,k , ∀ j ∈ J , is also a minimizer of the global cost Θ k . T o consider arbitrary many time instants k ≥ 0 we can now deﬁne the optimal solution set to problem (10): Υ ? := \ k ≥ 0 \ j ∈J S j,k . (11) Points in the set Υ ? minimize the global cost Θ k for any time instant k and at any node j . W e therefore call a point w ? ∈ Υ ? ideal estimate . Howe ver , ﬁnding w ? is a challenging task particularly under practical considerations. Due to limited memory , for instance, not all measurements can be stored over time at each node. Hence, information about the set Υ ? is unav ailable and thus an ideal estimate w ? cannot be acquired. An alternative, feasible task is the minimization of all but ﬁnitely many global costs Θ k . This approach stems from the intuition that a good estimate should minimize as many costs Θ k as possible. T o acquire such an estimate the nodes should agree on a point contained in the set Υ := lim inf k →∞ Υ k = ∞ [ k =0 \ m ≥ k Υ m ⊃ Υ ? (12) where the overbar gi ves the closure of a set. Finding a point in Υ is clearly a less restrictive task than ﬁnding one in Υ ? since all global costs Θ k excluding ﬁnitely many ones need to be minimized. Therefore, our proposed algorithm should achiev e estimates in the set Υ . It has been shown that the APSM con verges to points in the set Υ [27], [57]. Remark 1. F or the above considerations we need to assume that Υ ? 6 = ∅ . T o enable Υ ? 6 = ∅ the hyperslab thr eshold ε j of S j,k should be chosen sufﬁciently lar ge depending on the noise distribution and its variance. Examples on how to choose ε j in noisy en vir onments have been pr oposed in [48]. F or impulsive noise occurring ﬁnitely many times one can re gar d the time instant of the ﬁnal impulse as k = 0 to guarantee Υ ? 6 = ∅ . If however impulsive noise occurs inﬁnitely many times on the measur ements it is not straightforwar d to ensur e Υ ? 6 = ∅ and con verg ence of the APSM which will be intr oduced later on. Nevertheless, whenever the impulsive noise occurs the error signal in the APSM will abruptly c hange. Based on this c hange those noisy measur ements can be detected and discarded in practice so that Υ ? 6 = ∅ is satisﬁed. I V . P R O P O S E D A L G O R I T H M : D I FF U S I O N - B A S E D M U LT I K E R N E L A D A P T I V E F I LT E R T o solve (10) in a distributed way we employ a two-step scheme consisting of a local adaptation and a diffusion stage which has been commonly used in the literature, see e.g. [4], [35], [49]: 1) a local APSM update per node j on the coef ﬁcient vector w j giving an intermediate coefﬁcient vector w 0 j ; 2) a diffusion stage to fuse vectors w 0 i from neighboring nodes i ∈ N j to update w j . Step 1) ensures that each local cost Θ j,k is reduced, and, hence the global cost Θ k is reduced as well. Step 2) seeks for a consensus among all coefﬁcient vectors { w j } j ∈J through information e xchange among neighboring nodes to satisfy con- straint (10b). By this exchange each node inherently obtains the property sets from its neighbors which can be exploited to improv e the con vergence behavior of the learning algorithm. A. Local APSM Update The APSM asymptotically minimizes a sequence of non- negati ve con vex (not necessarily differentiable) functions [27] and can thus be used to minimize the local cost function Θ j,k ( w j ) in (8) per node j . For the coefﬁcient vector w j,k ∈ R rQ at node j and time k a particular case of the APSM update with the K -norm reads w 0 j,k +1 :=          w j,k − µ j,k Θ j,k ( w j,k ) − Θ ? j,k || Θ 0 j,k ( w j,k ) || 2 K Θ 0 j,k ( w j,k ) if Θ 0 j,k ( w j,k ) 6 = 0 w j,k otherwise (13) where Θ 0 j,k ( w j,k ) is a subgradient 1 of Θ j,k ( w j,k ) at w j,k . The parameter µ j,k ∈ (0 , 2) is the step size. Since the learning scheme is to operate with the K -metric it is used for the squared norm in the denominator . A subgradient for (8) is giv en by [27] Θ 0 j,k ( w j,k ) = w j,k − P K S j,k ( w j,k ) || w j,k − P K S j,k ( w j,k ) || K , for w j,k / ∈ S j,k . (14) This subgradient giv es || Θ 0 j,k ( w j,k ) || 2 K = 1 and thus we arri ve at the following APSM update per node j : w 0 j,k +1 := w j,k − µ j,k  w j,k − P K S j,k ( w j,k )  . (15) As we can see, the difference v ector w j,k − P K S j,k ( w j,k ) is used to move the coefﬁcient vector w j,k into the direction of the hyperslab S j,k controlled by the step size µ j,k . Note that this update solely relies on local information, i.e., no information from neighboring nodes is needed. The projection P K S j,k ( w j,k ) is calculated by [45] P K S j,k ( w ) =                          w , if w ∈ S j,k w − w T κ ( x j,k ) − d j,k − ε j || K − 1 κ ( x j,k ) || 2 K K − 1 κ ( x j,k ) , if w T κ ( x j,k ) > d j,k + ε j w − w T κ ( x j,k ) − d j,k + ε j || K − 1 κ ( x j,k ) || 2 K K − 1 κ ( x j,k ) , if w T κ ( x j,k ) < d j,k − ε j . (16) 1 A vector Θ 0 ( y ) ∈ R M is a subgradient of a function Θ : R M → R at y ∈ R M if Θ( y ) + h x − y i Θ 0 ( y ) ≤ Θ( x ) for all x ∈ R M . SHIN et al. : DISTRIBUTED ADAPTIVE LEARNING WITH MUL TIPLE KERNELS IN DIFFUSION NETWORKS 5 B. Diffusion Stage T o satisfy constraint (10b) and reach consensus on the coefﬁcient vectors w j , each node j fuses its own vector w 0 j with those of its neighbors { w 0 i } i ∈N j . T o this end, we employ a symmetric matrix G ∈ R J × J assigning weights to the edges in the network. The ( j, i ) -entry of G is denoted by g j i and giv es the weight on the edge between nodes j and i . Ob viously , if no connection is present among both nodes, the entry will be zero. The fusion step per node j at time k follo ws w j,k := X i ∈N j g j i w 0 i,k . (17) T o guarantee that all nodes conv erge to the same coef ﬁcient vector , G needs to fulﬁll the following conditions [10]: || G − (1 /J ) 1 J 1 T J || 2 < 1 , G 1 J = 1 J , (18) where 1 J is the vector of J ones. The ﬁrst condition guaran- tees con ver gence to the av erage of all states in the network while the second condition keeps the network at a stable state if consensus has been reached. Such matrices have been vastly applied in literature for consensus av eraging problems, see e.g. [4], [50], [51]. Our proposed algorithm to solve (10) is then giv en by the following update equations per node j and time index k : w 0 j,k +1 := w j,k − µ j,k  w j,k − P K S j,k ( w j,k )  (19a) w j,k +1 := X i ∈N j g j i w 0 i,k +1 (19b) where the projection P K S j,k ( w j,k ) is given in (16). In each iteration k each node j performs a local APSM update and transmits its intermediate coef ﬁcient vector w 0 j,k to its neighbors i ∈ N j . After receiving the intermediate coefﬁcient vectors w 0 i,k from its neighbors, each node j fuses these with its own vector w 0 j,k by a weighted av erage step. In fact, (19a) comprises the projection in the Cartesian prod- uct of Q RKHSs which is used by the CHYP ASS algorithm [32]. Therefore, we call our proposed scheme diffusion-based CHYP ASS (D-CHYP ASS) being a distributed implementation of CHYP ASS. Remark 2. If the diffusion stage (19b) in D-CHYP ASS is omitted the algorithm reduces to a local adaptation or non- cooperative scheme wher e each node individually appr oxi- mates ψ based on its node-speciﬁc measur ement data. How- ever , in this case each node j has access to its individual pr operty set S j,k only per time instant k . In contrast, by diffusing the coefﬁcient vector s among neighboring nodes eac h node j inher ently obtains information about the pr operty sets {S i,k } i ∈N j of its neighbors. This can be simply observed when inserting (19a) into (19b) . Ther efor e, compared to local adaptation D-CHYP ASS will show a faster conver gence speed and a lower steady-state err or due to a cooper ation within the network. Several works have shown the beneﬁt of distributed appr oaches over non-cooperative strate gies in the context of diffusion-based adaptive learning, see [4] and refer ences ther ein. V . T H E O R E T I C A L A N A LY S I S A. Consensus Matrix T o analyze the theoretical properties of the D-CHYP ASS algorithm, let us ﬁrst introduce the deﬁnition of the consensus matrix. Deﬁnition 1 (Consensus Matrix [10]) . A consensus matrix P ∈ R rQJ × rQJ is a squar e matrix satisfying the following two pr operties. 1) P z = z and P T z = z for any vector z ∈ C :=  1 J ⊗ a ∈ R rQJ | a ∈ R rQ  . 2) The r Q lar gest singular values of P ar e equal to one and the remaining r QJ − r Q singular values are strictly less than one. W e denote by ⊗ the Kronecker product. W e can further establish the following properties of the consensus matrix P : Lemma 1 (Properties of Consensus Matrix [10]) . Let e n ∈ R rQ be a unit vector with its n -th entry being one and b n = ( 1 J ⊗ e n ) / √ J ∈ R rQJ . Further , we deﬁne the consensus sub- space C := span { b 1 , . . . , b rQ } and the stacked vector of all coefﬁcient vectors in the network z k = [ w T 1 ,k , . . . , w T J,k ] T ∈ R rQJ . Then, we have the following pr operties. 1) The consensus matrix P can be decomposed into P = B B T + X with B := [ b 1 . . . b rQ ] ∈ R rQJ × rQ and X ∈ R rQJ × rQJ satisfying X B B T = B B T X = 0 and || X || 2 < 1 . 2) The nodes have r eached consensus at time index k if and only if ( I rQJ − B B T ) z k = 0 , i.e., z k ∈ C . A consensus matrix can be constructed by matrix G as P = G ⊗ I rQ where I rQ is the r Q × r Q identity matrix. The matrix P is then said to be compatible to the graph G since z k +1 = P z k can be equiv alently calculated by w j,k +1 = P i ∈N j g j i w i,k (see (17)) [10]. By deﬁnition of the consensus matrix we know that || P || 2 = 1 holds. Howe ver , for further analysis of the D-CHYP ASS algorithm, we need to kno w the norm w .r .t. matrix K since D-CHYP ASS operates with the K - metric. Therefore, we introduce a modiﬁed consensus matrix b P satisfying || b P || K = 1 . Lemma 2 (Modiﬁed Consensus Matrix) . Suppose that P is a consensus matrix deﬁned as in Deﬁnition 1. Let b P := K − 1 / 2 P K 1 / 2 be the modiﬁed consensus matrix where K is the block- diagonal matrix with J copies of K : K := I J ⊗ K ∈ R rQJ × rQJ . (20) Assume further , that the dictionary D q = { κ q ( · , ¯ x ` ) } r ` =1 for each q ∈ Q is linearly independent, i.e., its corr esponding kernel Gram matrix K q is of full rank, and thus K is also linearly independent. Then, the K -norm of b P is given by || b P || K = 1 . In particular , it holds that both consensus matrices ar e identical to each other , i.e., b P = P . Pr oof: The proof is given in Appendix B. Due to Lemma 2, for further analysis we are free to use either P or b P and it holds that || P || K = || b P || K = 1 . 6 IEEE TRANSACTIONS ON SIGN AL PR OCESSING, V OL. XX, NO. XX, MONTH XX, 2018 B. Con ver gence Analysis From (19) we can summarize both update equations of the D-CHYP ASS in terms of all coef ﬁcient vectors in the network by deﬁning z k :=    w 1 ,k . . . w J,k    , y k :=    µ 1 ,k ( w 1 ,k − P S 1 ,k ( w 1 ,k )) . . . µ J,k ( w J,k − P S J,k ( w J,k ))    and rewriting (19a) and (19b) into z k +1 = ( G ⊗ I rQ )( z k − y k ) . (21) W e show the con ver gence properties of D-CHYP ASS for ﬁxed and deterministic netw ork topologies. Although the space under study is the K -metric space unlike [8], [10] we can still prov e the properties due to Lemmas 1 and 2. Theorem 1. The sequence ( z k ) k ∈ N generated by (21) satisﬁes the following. 1) Monotone approximation: Assume that w j,k / ∈ S j,k with µ j,k ∈ (0 , 2) for at least one node j and that µ i,k ∈ [0 , 2] ( i 6 = j ) . Then, for every w ? k ∈ Υ k and z ? k := [( w ? k ) T , ( w ? k ) T , . . . , ( w ? k ) T ] T ∈ R rQJ it holds that || z k +1 − z ? k || K < || z k − z ? k || K (22) wher e Υ k 6 = ∅ since we assume that Υ ? 6 = ∅ . F or the remaining pr operties we assume that µ j,k ∈ [  1 , 2 −  2 ] with  1 ,  2 > 0 and that a suf ﬁciently lar ge hyperslab thr eshold ε j per node j has been chosen such that w ? ∈ Υ ? 6 = ∅ . W e further deﬁne z ? := [( w ? ) T , ( w ? ) T , . . . , ( w ? ) T ] T . Then the following holds: 2) Asymptotic minimization of local costs: F or every z ? the local costs Θ j,k ( w j,k ) = || w j,k − P S j,k ( w j,k ) || K ar e asymptotically minimized, i.e ., lim k →∞ Θ j,k ( w j,k ) = 0 , ∀ j ∈ J . (23) 3) Asymptotic consensus: W ith the decomposition P = B B T + X and || X || 2 < 1 the sequence ( z k ) k ∈ N asymptotically achie ves consensus such that lim k →∞ ( I rQJ − B B T ) z k = 0 . (24) 4) Con ver gence of ( z k ) k ∈ N : Suppose that Υ ? has a nonempty interior , i.e., ther e exists ρ > 0 and interior point ˜ u such that { v ∈ R rQ | || v − ˜ u || K ≤ ρ } ⊂ Υ ? . Then, the sequence ( z k ) k ∈ N con verg es to a vector b z = [ b w T , . . . , b w T ] T ∈ C satisfying ( I rQJ − B B T ) b z = 0 . 5) Characterization of limit point b z : Suppose for an interior ˜ u ∈ Υ ? that for any  > 0 and any η > 0 ther e exists a ζ > 0 such that min k ∈I X j ∈J || w j,k − P S j,k ( w j,k ) || K ≥ ζ , (25) wher e I := n k ∈ N | X j ∈J d K ( w j,k , lev ≤ 0 Θ j,k ) >  and X j ∈J || ˜ u − w j,k || K ≤ η o . Then it holds that b w ∈ Υ with Υ deﬁned as in (12) . Pr oof: The proofs of Theorem 1.1-1.3 can be directly deduced from the corresponding proofs of Theorem 1a)-1c) in [10, Appendix III] under the consideration that || P || 2 = || P || K = 1 (see Lemma 2) and that Θ j,k ( w j ) is a non- negati ve con ve x function. Note that the proof of Theorem 1.1 needs to be deriv ed considering the K -metric and not the or - dinary Euclidean metric as in [10]. The proofs of Theorem 1.4 and 1.5 are giv en in Appendix C. V I . N U M E R I C A L E V A L U A T I O N In the follo wing section, we e v aluate the performance of the D-CHYP ASS by applying it to the spatial reconstruction of multiple Gaussian functions, real altitude data and the tracking of a time-varying nonlinear function by a network of nodes. The nodes are distributed over the unit-square area A and each node j uses its Cartesian position vector x j = [ x j, 1 , x j, 2 ] T ∈ X as its regressor . W e assume that the positions of the nodes stay ﬁxed, i.e., x j,k does not change ov er time. This is not necessary for the D-CHYP ASS to be applicable, e.g. it can be applied to a mobile network where the positions change ov er time as in vestigated in [42]. Per time index k the nodes take a new measurement d j,k of the function ψ at their position x j . Hence, the network constantly monitors the function ψ . For all experiments we assume model (7) with zero-mean white Gaussian noise of variance σ 2 n . Since in this scenario measurements of the function ψ are spatially spread ov er the nodes a collaboration among the nodes is ine vitable for a good regression performance. Thus, it is an appropriate application example where the beneﬁt of distributed learning becomes clear . W e compare the performance of the D-CHYP ASS to the RFF-DKLMS [36], the F A TC-KLMS [35] and the multik- ernel distrib uted consensus-based estimation (MKDiCE) [34] which are state of the art algorithms for distributed kernel- based estimation. Both RFF-DKLMS and F A TC-KLMS are single kernel approaches based on a diffusion mechanism. Assuming that the F A TC-KLMS only considers local data in its adaptation step, both schemes exhibit the same number of transmissions per node as the D-CHYP ASS. T o enable a fair comparison we restrict the adaptation step of the F A TC-KLMS to use local data only and extend the algorithm by multiple kernels as in D-CHYP ASS. W e call this scheme the diffusion- based multikernel least-mean-squares (DMKLMS). Its update equation per node j is gi ven by w 0 j,k +1 := w j,k + µ j,k  d j,k − w T j,k κ ( x j,k )  κ ( x j,k ) (26a) w j,k +1 := X i ∈N j g j i w 0 i,k +1 . (26b) The RFF-DKLMS approximates kernel ev aluations by random Fourier features such that no design of a speciﬁc dictionary set is necessary . Howe ver , its performance is highly depen- dent on the number of the utilized Fourier features which determines the dimension of the vectors to be exchanged. The MKDiCE is a distributed regression scheme based on kernel least squares with multiple kernels using the ADMM for its distributed mechanism. The number of transmissions SHIN et al. : DISTRIBUTED ADAPTIVE LEARNING WITH MUL TIPLE KERNELS IN DIFFUSION NETWORKS 7 per iteration is higher compared to the D-CHYP ASS, RFF- DKLMS and DMKLMS. Naturally , it is not an adaptiv e scheme but is included here for reference purposes. As bench- mark performance, we consider the central CHYP ASS given by w k +1 := w k − µ X j ∈J  w k − P K S j,k ( w k )  . (27) The central CHYP ASS requires all node positions and mea- surements { ( x j,k , d j,k ) } j ∈J per time index k at a single node to perform the projection P K S j,k ( w k ) onto each set S j,k . Regarding the dictionaries we assume that each D q uses the same samples { ¯ x ` } r ` =1 . These samples are a subset of the node positions { x j } j ∈J in the network and are selected follo wing the coherence criterion: A node position x j is compared to ev ery dictionary entry { ¯ x ` } r ` =1 and is included as dictionary sample ¯ x r +1 if it satisﬁes max q ∈Q max ` =1 ,...,r | κ q ( x j , ¯ x ` ) | ≤ τ . (28) Here, 0 < τ ≤ 1 is the coherence threshold controlling the cardinality of D q . The dictionary D q is generated a priori ov er all node positions before the algorithm iterates. After that it stays ﬁxed throughout the reconstruction process for the speciﬁc algorithm. As error metric we consider the network NMSE k per time k ov er the area A . It ev aluates the normalized squared-difference between reconstructed ﬁeld ϕ j ( x ) and the true ﬁeld ψ ( x ) av eraged ov er all nodes: NMSE k := 1 J X j ∈J E n R A | ψ ( x ) − w T j,k κ ( x ) | 2 d x o R A | ψ ( x ) | 2 d x . (29) The e xpectation in the numerator is approximated by a veraging ov er independent trials. The integrals are approximated by a sum over regularly positioned grid points which sample the area A . A. Multiple Gaussian Functions As a ﬁrst example we apply the D-CHYP ASS algorithm to the reconstruction of two Gaussian functions with dif ferent bandwidths giv en as follows: ψ ( x ) := 2 exp  − || x − p 1 || 2 R 2 2 · 0 . 1 2  + exp  − || x − p 2 || 2 R 2 2 · 0 . 3 2  with p 1 = [0 . 5 , 0 . 7] T , p 2 = [0 . 3 , 0 . 1] T , and the Cartesian co- ordinate vector x = [ x 1 , x 2 ] T . W e use J = 60 nodes randomly placed ov er A = [0 , 1] 2 following a uniform distribution where nodes share a connection if their distance to each other satisﬁes D < 0 . 3 . W e assume a noise v ariance of σ 2 n = 0 . 3 at the nodes and average the performance over 200 trials with a new network realization in each trial. Regarding the kernel choice we use tw o Gaussian kernels ( Q = 2 ) with bandwidths ζ 1 = 0 . 1 and ζ 2 = 0 . 3 . For all diffusion-based algorithms we T ABLE I P A R A M E T E R V A L U E S F O R E X P E R I M E N T I N S E C T I O N V I - A Algorithm Parameters D-CHYP ASS (I) µ j,k = 0 . 2 τ = 0 . 95 ε j = 0 D-CHYP ASS (II) µ j,k = 0 . 5 ε j = 0 . 5 ζ 1 = 0 . 1 DMKLMS µ j,k = 0 . 1 ζ 2 = 0 . 3 MKDiCE µ j,k = 0 . 5 Central CHYP ASS µ = 3 . 3 · 10 − 3 ε j = 0 F A TC-KLMS µ j,k = 0 . 07 τ = 0 . 9 ζ = 0 . 2 RFF-DKLMS (I) µ j,k = 0 . 1 r RFF = 100 RFF-DKLMS (II) µ j,k = 0 . 1 r RFF = 500 0 5 , 000 10 , 000 15 , 000 − 15 − 10 − 5 0 iteration k NMSE [dB] local adaptation D-CHYP ASS (I) D-CHYP ASS (II) central CHYP ASS 0 5 , 000 10 , 000 15 , 000 − 15 − 10 − 5 0 iteration k NMSE [dB] local adaptation D-CHYP ASS (I) D-CHYP ASS (II) central CHYP ASS Fig. 1. Comparing learning curves of D-CHYP ASS to central and local adaptation. use the Metropolis-Hastings weights [52] where each entry g j i is determined by g j i =            1 max { δ j , δ i } if j 6 = i and ( j, i ) ∈ E 1 − P i ∈N j \{ j } 1 max { δ j , δ i } if j = i 0 otherwise and δ j = |N j | denotes the degree of a node j . For all algorithms we set the coherence threshold τ such that the same average dictionary size of ¯ r = 33 is utilized. Single kernel approaches use the arithmetic a verage of the bandwidths chosen for the multikernel schemes as their kernel bandwidth. W e ev aluate the D-CHYP ASS (I) with a h yperplane projection, i.e., ε j = 0 , and the D-CHYP ASS (II) with a hyperslab projection with ε j = 0 . 5 . The chosen parameter values for the considered algorithms are listed in T able I. Figure 1 compares the NMSE learning curves of D- CHYP ASS (I) and D-CHYP ASS (II) to a local adaptation and the central CHYP ASS. Clearly , the local adaptation completely fails to approximate ψ while both D-CHYP ASS (I) and D- CHYP ASS (II) perform close to the central CHYP ASS. Fig- ure 2 compares the performance of D-CHYP ASS (I) to state of the art schemes. D-CHYP ASS (I) signiﬁcantly outperforms the compared algorithms in terms of con vergence speed and steady-state error . Regarding monokernel approaches, F A TC- 8 IEEE TRANSACTIONS ON SIGN AL PR OCESSING, V OL. XX, NO. XX, MONTH XX, 2018 0 5 , 000 10 , 000 15 , 000 − 15 − 10 − 5 0 iteration k NMSE [dB] F A TC-KLMS DMKLMS MKDiCE RFF-DKLMS (I) RFF-DKLMS (II) D-CHYP ASS (I) Fig. 2. Learning curves for the reconstruction of multiple Gaussian functions. KLMS outperforms RFF-DKLMS (I) in its steady-state error although it uses a dictionary of only ¯ r = 33 samples compared to r RFF = 100 random Fourier features. By increasing the number of Fourier features to r RFF = 500 the performance can be signiﬁcantly improved, cf. RFF-DKLMS (II). Nev- ertheless, this improvement comes with a huge increase in communication overhead since the number of Fourier features is equal to the dimension of the coefﬁcient vectors to be exchanged. While DMKLMS exchanges v ectors with ¯ r Q = 66 entries only , the coef ﬁcient vectors in RFF-DKLMS (II) have r RFF = 500 entries. Thus, by relying on an a priori designed dictionary as in DMKLMS and D-CHYP ASS, huge savings in communication overhead and computational complexity can be achiev ed. The enhanced performance by D-CHYP ASS compared to the other multikernel approaches is due to a better metric in form of the K -norm and the normalization factor || K − 1 κ ( x j,k ) || 2 K in the projection P S j,k ( w j,k ) which adapts the step size µ j,k . By exploiting the projection w .r .t. the K - norm the shape of the cost function Θ j,k ( w j,k ) is changed such that con vergence speed is improved [53]. From Figure 1, D-CHYP ASS (I) and (II) show a similar performance with a negligible loss for D-CHYP ASS (II). Howe ver , this minor loss comes with a huge reduction in complexity per node j . Since D-CHYP ASS (II) projects onto a hyperslab with ε j = 0 . 5 there is a higher probability that w j,k is contained in S j,k than in D-CHYP ASS (I) where ε j = 0 . If w j,k ∈ S j,k , the vector w j,k is not updated sa ving a signiﬁcant amount of computations. In contrast, when using a hyperplane ( ε j = 0) the vector w j,k has to be updated in each iteration. Figure 3 shows the number of local APSM updates (19a) per node in logarithmic scale over the hyperslab threshold ε j . Additionally , the NMSE a veraged over the last 200 iterations in relation to the threshold is depicted. For thresholds ε j > 0 a step size of µ j,k = 0 . 5 is used. W e can observe that using hyperslab thresholds up to ε j = 0 . 5 saves a huge amount of complexity while keeping the error performance constant. E.g. for D-CHYP ASS (II) with ε j = 0 . 5 in av erage 5,468 updates are executed per node. Compared to 15,000 updates for D-CHYP ASS (I) a reduction of approximately 64% in computations can be achiev ed. This is crucial especially for sensors with low computational capability and limited battery life. Ho wev er, from Figure 3 it is also clear that the compu- − 15 − 10 − 5 0 NMSE [dB] 0 0 . 5 1 1 . 5 2 10 1 10 2 10 3 10 4 Hyperslab threshold ε j No. of updates/node NMSE Updates Fig. 3. Number of updates per node and NMSE for dif ferent values of the hyperslab threshold ε j for the D-CHYP ASS. 0 0 . 2 0 . 4 0 . 6 0 . 8 1 0 0 . 2 0 . 4 0 . 6 0 . 8 1 x 1 x 2 T rue ψ ( x ) 0 0 . 2 0 . 4 0 . 6 0 . 8 1 0 0 . 2 0 . 4 0 . 6 0 . 8 1 x 1 x 2 D-CHYP ASS (I) Fig. 4. Contour plots of the true ψ ( x ) (left) and its reconstruction ϕ ( x ) (right) at one node using the D-CHYP ASS at steady state. Green circles show the node positions and ﬁlled circles the chosen dictionary entries. tational load cannot be arbitrarily reduced without degrading the reconstruction performance. This is visible especially for thresholds ε j > 1 . In Figure 4 we depict the contour plot of the true function ψ ( x ) together with an exemplary set of node positions and the reconstructed function ϕ ( x ) by D-CHYP ASS (I). The reconstruction is sho wn for one node in the network at steady state. By virtue of the consensus av eraging step each node 10 20 30 40 50 60 − 18 − 16 − 14 − 12 − 10 A verage dictionary size ¯ r NMSE [dB] D-CHYP ASS (I) DMKLMS MKDiCE F A TC-KLMS Fig. 5. NMSE over dictionary size for the reconstruction of multiple Gaussian functions. SHIN et al. : DISTRIBUTED ADAPTIVE LEARNING WITH MUL TIPLE KERNELS IN DIFFUSION NETWORKS 9 0 5 , 000 10 , 000 15 , 000 − 15 − 10 − 5 0 iteration k NMSE [dB] D-CHYP ASS (I) DMKLMS MKDiCE F A TC-KLMS Fig. 6. Learning curves for the reconstruction of multiple Gaussian functions for a coherence threshold τ = 0 . 99 corresponding to an average dictionary size of ¯ r = 53 samples. in the network will hav e the same reconstruction. W e can observe that both Gaussian functions are approximated with good accuracy . The peaks of both functions can be clearly distinguished. Ho wever , at outer regions some inaccuracies can still be seen. These are expected to be reduced when increasing the dictionary size. Figure 5 shows the error performance of the algorithms over the av eraged dictionary size ¯ r for 200 trials. The NMSE values are calculated as an average over the last 200 iterations with again 15,000 iterations for each algorithm. W e observe that D-CHYP ASS (I) outperforms its competitors with growing dictionary size. In particular, DMKLMS and MKDiCE lose in performance for dictionary sizes ¯ r > 33 while D-CHYP ASS steadily improves its reconstruction. This is due to the reason that for DMKLMS and MKDiCE the step size has to be adjusted to the gro wing dictionary size to av oid an increasing steady-state error . In DMKLMS the step size is not normalized to the squared norm of the kernel vector κ ( x ) as in D- CHYP ASS which can lead to di vergence. Regarding F A TC- KLMS a similar effect is expected to appear for higher dic- tionary sizes ¯ r > 60 since it uses one kernel only . Therefore, in the range of 40 to 60 dictionary samples it performs better than DMKLMS and MKDiCE. T o show that the performance of MKDiCE and DMKLMS can be improv ed, in Figure 6 we depict the NMSE performance for an adapted step size ov er the iteration with τ = 0 . 99 . This coherence threshold results in an av erage dictionary size of ¯ r = 53 , a point where MKDiCE and DMKLMS show de grading performance according to Figure 5. The step sizes are chosen as µ j,k = 0 . 05 for DMKLMS and µ j,k = 0 . 2 for MKDiCE, respectively . W e observe that by adjusting the step size to the dictionary size the steady-state performance at ¯ r = 53 is improv ed compared to Figure 5. No w , both MKDiCE and DMKLMS outperform F A TC-KLMS. Remark 3. Re gar ding D-CHYP ASS (I) it should be noted that for τ > 0 . 98 diver gence was observed. This is caused by the in version of an ill-conditioned kernel Gram matrix K . This occurs if the dictionary employs node positions close to each other leading to linear dependency in K . W ith a higher coher ence thr eshold the pr obability of suc h a case incr eases. T ABLE II P A R A M E T E R V A L U E S F O R E X P E R I M E N T I N S E C T I O N V I - B Algorithm Parameters D-CHYP ASS µ j,k = 0 . 5 , ε j = 0 τ = 0 . 85 ζ 1 = 0 . 06 DMKLMS µ j,k = 0 . 05 ζ 2 = 0 . 1 MKDiCE µ j,k = 0 . 7 F A TC-KLMS µ j,k = 0 . 05 τ = 0 . 78 ζ = 0 . 08 RFF-DKLMS µ j,k = 0 . 2 r RFF = 200 ζ = 0 . 08 0 10 , 000 20 , 000 30 , 000 40 , 000 − 15 − 10 − 5 0 iteration k NMSE [dB] D-CHYP ASS DMKLMS MKDiCE F A TC-KLMS RFF-DKLMS Fig. 7. Learning curves for the reconstruction of altitude data. T o numerically stabilize the inver sion of K a scaled identity matrix γ I rQ is added to the matrix as re gularization. The matrix K in (16) is then substituted by K + γ I rQ . F or thr esholds τ > 0 . 98 a re gularization parameter of γ = 0 . 01 was used in this e xperiment to achie ve a stable performance. B. Real Altitude Data W e apply D-CHYP ASS to the reconstruction of real altitude data where each node measures the altitude at its position x j . For the data we use the ETOPO1 global relief model which is provided by the National Oceanic and Atmospheric Admin- istration [54] and which exhibits sev eral low/high frequency components. In the original data the position is giv en by the longitude and latitude and the corresponding altitude ψ ( x ) is deliv ered for each such position. As in [55], we choose an area of 31 × 31 points with longitudes { 138 . 5 , 138 . 5 + 1 60 , . . . , 139 } and latitudes { 34 . 5 , 34 . 5 + 1 60 , . . . , 35 } . Ho we ver , for easier handling we map longitudes and latitudes to Cartesian coordi- nates in the unit-square area such that x ∈ [0 , 1] 2 . W e consider J = 200 randomly placed over the described area. Nodes with a distance D < 0 . 2 to each other share a connection. W e assume noise with σ 2 n = 0 . 3 . The coherence threshold is set such that each algorithm employs a dictionary of a verage size ¯ r = 105 while the RFF-DKLMS uses r RFF = 200 Fourier features. The performances are av eraged over 200 independent trials. T able II lists the chosen parameter values for the considered algorithms. Figure 7 depicts the NMSE performance over the itera- tion. Again D-CHYP ASS outperforms the other algorithms in terms of con vergence speed and steady-state error . Although DMKLMS performs very close to D-CHYP ASS it can be observed that the con vergence speed of D-CHYP ASS is faster . 10 IEEE TRANSA CTIONS ON SIGNAL PROCESSING, VOL. XX, NO. XX, MONTH XX, 2018 0 0 . 2 0 . 4 0 . 6 0 . 8 1 0 0 . 2 0 . 4 0 . 6 0 . 8 1 x 1 x 2 T rue ψ ( x ) 0 0 . 2 0 . 4 0 . 6 0 . 8 1 0 0 . 2 0 . 4 0 . 6 0 . 8 1 x 1 x 2 D-CHYP ASS 0 0 . 2 0 . 4 0 . 6 0 . 8 1 0 0 . 2 0 . 4 0 . 6 0 . 8 1 x 1 x 2 DMKLMS 0 0 . 2 0 . 4 0 . 6 0 . 8 1 0 0 . 2 0 . 4 0 . 6 0 . 8 1 x 1 x 2 MKDiCE Fig. 8. Contour plots of the altitude reconstruction by one node for the D- CHYP ASS, DMKLMS, and MKDiCE. F A TC-KLMS and RFF-DKLMS perform worst since their reconstruction capability is limited by the use of one kernel only . While RFF-DKLMS con ver ges faster than F A TC-KLMS it should be noted that it produces a higher communication ov erhead due to the use of r RFF = 200 Fourier features compared to ¯ r = 105 dictionary samples in F A TC-KLMS. The contour plots for the multikernel approaches at steady- state at one node are shown in Figure 8. For the D-CHYP ASS we can observe a good reconstruction of the original ψ ( x ) although details in the area around [0 . 4 , 0 . 7] T and [0 . 4 , 0 . 3] T are missing. The reconstructions by DMKLMS and MKDiCE show a less accurate approximation especially in the areas around the valle y [0 . 4 , 0 . 3] T . C. T ime-V arying Nonlinear Function In the following, we examine the tracking performance of the D-CHYP ASS w .r .t. time-varying functions. T o this end, we consider the follo wing function being dependent on both the position x and time k : ψ ( x , k ) = 0 . 8 exp  − || x − p 1 || 2 R 2 2(1 − 0 . 5 sin(2 π 10 − 3 k )) · 0 . 3 2  + exp  − || x − p 2 || 2 R 2 2(1 + 0 . 5 sin(2 π 10 − 3 k )) · 0 . 1 2  with p 1 = [0 . 6 , 0 . 5] T and p 2 = [0 . 25 , 0 . 3] T . This function contains two Gaussian shapes whose bandwidths are expand- ing and shrinking over time k . W e apply the D-CHYP ASS to the reconstruction of the time-varying function ψ ( x , k ) and compare it to the MKDiCE and DMKLMS. W e use a network of J = 80 nodes randomly distributed over the unit- square area and average the performance over 200 trials with a ne w network realization in each trial. The noise v ariance is σ 2 n = 0 . 3 . For the considered algorithms we set τ such that an average dictionary size of ¯ r = 36 samples is achieved. T ABLE III P A R A M E T E R V A L U E S F O R E X P E R I M E N T I N S E C T I O N V I - C Algorithm Parameters D-CHYP ASS (I) µ j,k = 0 . 5 τ = 0 . 95 ζ 1 = 0 . 1 , ζ 2 = 0 . 3 ε j = 0 D-CHYP ASS (II) µ j,k = 0 . 5 τ = 0 . 9 ζ 1 = 0 . 2 ε j = 0 DMKLMS µ j,k = 0 . 1 τ = 0 . 95 ζ 1 = 0 . 1 , ζ 2 = 0 . 3 MKDICE µ j,k = 0 . 5 τ = 0 . 95 ζ 1 = 0 . 1 , ζ 2 = 0 . 3 0 2 , 000 4 , 000 6 , 000 8 , 000 − 20 − 15 − 10 − 5 0 iteration k NMSE [dB] D-CHYP ASS (I) - two kernels D-CHYP ASS (II) - one kernel DMKLMS MKDiCE Fig. 9. NMSE performance ov er iteration number for the tracking of a time- varying function. W e ev aluate the D-CHYP ASS with one and two kernels. T able III lists the chosen parameter values for the considered algorithms. Figure 9 sho ws the NMSE o ver the iteration number k . The ﬂuctuations in the error curves are due to the time-varying bandwidths in ψ ( x , k ) . F or all algorithms these ﬂuctuations stay in a speciﬁc error range illustrating that the function ψ ( x , k ) can be tracked within a certain range of accuracy . W e observe that D-CHYP ASS (I) and (II) signiﬁcantly outper- form the remaining algorithms. Additionally , the range of the ﬂuctuations in the error is lower for D-CHYP ASS compared to the other algorithms. It is also visible that utilizing two kernels in D-CHYP ASS (I) improves the tracking performance compared to using one kernel as in D-CHYP ASS (II). Ne v- ertheless it is worth noting, that e ven with only one kernel the D-CHYP ASS (II) outperforms the multikernel approaches DMKLMS and MKDiCE illustrating the signiﬁcant gain by employing the K -norm in the algorithm. D. Computational Complexity and Communication Overhead W e analyze the complexities and communication overhead of the algorithms per iteration in the network. For the complex- ities we consider the number of multiplications and assume that Gaussian kernels are used as in (4). Furthermore, each dictionary D q is designed a priori, stays ﬁxed ov er time k and is common to all nodes. Therefore, the kernel Gram matrix K can be computed ofﬂine before the iterati ve process of D-CHYP ASS av oiding an in version in each iteration. Note that K is block diagonal such that Q inv ersions of r × r matrices have to be computed. This results in a complexity SHIN et al. : DISTRIBUTED ADAPTIVE LEARNING WITH MUL TIPLE KERNELS IN DIFFUSION NETWORKS 11 of order O ( J Qr 3 ) in the network before D-CHYP ASS starts iterating. T o further reduce the complexity of D-CHYP ASS the selecti ve-update strategy can be applied which selects the s most coherent dictionary samples such that only s entries of the coefﬁcient vector w j,k are updated [47]. Usually , s ≤ 5 so that s  r . Then per iteration k the in verse of an s × s matrix has to be computed while the complexity of the multiplications is heavily reduced. For the overhead we count the number of transmitted scalars among all nodes. All algorithms except the MKDiCE use a consensus av eraging step which produces only broadcast transmissions. Beside broadcasts the MKDiCE comprises also unicast transmissions of vectors which depend on the receiving node and which increase the overhead signiﬁcantly . T able IV lists the complex- ities and o verhead of the algorithms where the complexity for an in version of a p × p matrix is denoted by v in v ( p ) := p 3 . Figure 10 depicts the complexity and the overhead ov er the dictionary size r for L = 2 , s = 7 , Q = 2 and a network of J = 60 nodes with |E | = 300 edges. The RFF-DKLMS with r RFF = 500 is included as reference. It can be clearly seen that the complexity and ov erhead of the ADMM-based MKDiCE are highest among the algorithms due to the inv ersion of a Qr × Qr matrix per iteration k and the transmission of unicast vectors, respectiv ely . Furthermore, for dictionary sizes up to r = 50 the D-CHYP ASS has lower complexity than the RFF-DKLMS. By including the selective-update strategy the complexity of D-CHYP ASS is signiﬁcantly reduced and is e ven lower than single kernel F A TC-KLMS. D-CHYP ASS and DMKLMS exhibit the same overhead per iteration which is lower compared to that of RFF-DKLMS for dictionary sizes up to r = 200 . T ABLE IV C O M P U TA T I O NA L C O M P L E X I T Y A N D O V E R H E A D O F A L G O R I T H M S Algorithm Complexity Overhead D-CHYP ASS (2 |E | + J ( L + 4)) Qr J Qr +( Qr 2 + 2) J D-CHYP ASS  ( L + 1) Qr + v inv ( s ) + s 2 + 2  J (selectiv e update) +(2 |E | + 3 J ) s DMKLMS (2 |E | + J ( L + 4)) Qr + J F A TC-KLMS (2 |E | + J ( L + 4)) r + J J r MKDiCE (6 |E | + 4 J + L + 2) Qr 2 J Qr + J  1 + ( Qr ) 2 + v inv ( Qr )  +2 |E | Qr RFF-DKLMS J (4 r RFF + 1) + (2 |E | + J ) r RFF J r RFF V I I . C O N C L U S I O N W e proposed an adaptiv e learning algorithm exploiting multiple kernels and projections onto hyperslabs for the re- gression of nonlinear functions in diffusion networks. W e provided a thorough con ver gence analysis regarding monotone approximation, asymptotic minimization, consensus and the limit point of the algorithm. T o this end, we introduced a novel modiﬁed consensus matrix which we proved to be identical to the ordinary consensus matrix. As an application example we in vestigated the proposed scheme for the reconstruction of spatial distributions by a network of nodes with both synthetic 0 100 200 300 10 1 10 4 10 7 10 10 Dictionary size r Complexity D-CHYP ASS D-CHYP ASS sel. upd. DMKLMS 0 100 200 300 10 1 10 3 10 5 Dictionary size r Overhead F A TC-KLMS MKDiCE RFF-DKLMS Fig. 10. Computational complexity and communication ov erhead of the algorithms per iteration k over the dictionary size r . and real data. Note that it is not restricted to such a scenario and can be applied in general to any distributed nonlinear system identiﬁcation task. Compared to the state of the art algorithms we could observe signiﬁcant gains in error per- formance, conv ergence speed and stability over the employed dictionary size. In particular, our proposed APSM-based algo- rithm signiﬁcantly outperformed an ADMM-based multikernel scheme (MKDiCE) in terms of error performance with highly decreased complexity and communication overhead. By em- bedding the hyperslab projection the computational demand per node could be drastically reduced over a certain range of thresholds while keeping the error performance constant. A P P E N D I X A D E R I V A T I O N I N C A RT E S I A N P R O D U C T S PAC E W e equi valently formulate problem (10) in the Cartesian product space of Q RKHSs. Furthermore, we deri ve the local APSM update (15) exploiting the isomorphism between product space and Euclidean space using the K -metric. A. Equivalent Pr oblem F ormulation Since ψ lies in the sum space H + of Q RKHSs it is decomposable into the sum ψ := P q ∈Q ψ ( q ) where ψ ( q ) ∈ H q . Thus, to estimate ψ one approach is to minimize the metric distance between the estimate ϕ and the true function ψ in the sum space H + . The estimate ϕ is a multikernel adaptiv e ﬁlter (2) and can be e xpressed as a decomposable sum ϕ := P q ∈Q ϕ ( q ) ∈ H + with ϕ ( q ) := P r ` =1 w q ,` κ q ( · , ¯ x ` ) . The problem can therefore be formulated as the following functional optimization problem: min ϕ ∈H + || ϕ − ψ || H + . (30) Howe ver , in the sum space H + the norm and inner product hav e no closed-form expressions and functions might not be uniquely decomposable depending on the choice of kernel functions. An alternative approach is to formulate the problem in the Cartesian product space H × , in which functions are uniquely decomposable independent of the underlying kernel functions and its norm has the closed-form expression [32] || F || H × := s X q ∈Q || f ( q ) || 2 H q , ∀ F := ( f ( q ) ) q ∈Q ∈ H × . (31) 12 IEEE TRANSA CTIONS ON SIGNAL PROCESSING, VOL. XX, NO. XX, MONTH XX, 2018 Instead of the sum ϕ = P q ∈Q ϕ ( q ) we consider the Q -tuple Φ := ( ϕ ( q ) ) q ∈Q in H × with Φ : X → R Q . In the same way for ψ we consider the Q -tuple Ψ := ( ψ ( q ) ) q ∈Q . Furthermore, each monokernel ﬁlter ϕ ( q ) is in fact an element of the dictionary subspace M q := span { κ q ( · , ¯ x ` ) } r ` =1 . Thus, Φ lies in the Cartesian product of Q dictionary subspaces M × := M 1 × M 2 × . . . × M Q ⊂ H × . (32) W ith these considerations we formulate the problem in H × : min Φ ∈M × || Φ − Ψ || H × . (33) The solution to the above problem is directly gi ven by the projection P M × (Ψ) of Ψ onto the dictionary subspace M × . This projection is in fact the best approximation of Ψ in M × in the H × -norm sense. For the monokernel case Q = 1 it has been shown that under certain assumptions P M × (Ψ) ap- proximately equals the minimum mean square error (MMSE) estimate [30]. Exact equality to the MMSE estimate can be achiev ed when operating the learning procedure in the space of square-integrable functions under the probability measure (usually denoted as L 2 space) [56]. W ith respect to the network we denote the estimate of each node j by Φ j . Then the solution of (33) needs to be approached by each Φ j . T o this end, we design the hyperslab e S j,k := { Φ ∈ M × : |h Φ , κ ( · , x j,k ) i H × − d j,k | ≤ ε j } for each node j containing all Φ s of bounded local instantaneous error at time instant k . Here, κ : X × X → R Q , ( x , x 0 ) 7→ ( κ q ( x , x 0 )) q ∈Q with x , x 0 ∈ X and κ ( · , x ) := ( κ q ( · , x )) q ∈Q ∈ H × . The estimated output for x j,k is then given by ϕ ( x j,k ) = h Φ , κ ( · , x j,k ) i H × = P q ∈Q  ϕ ( q ) , κ q ( · , x j,k )  H q using the reproducing property of each kernel. It is reasonable to assume that with high probability each set e S j,k contains the corre- sponding MMSE estimate and thus also the optimal solution of (33), cf. [48]. Hence, considering the stochastic property sets e S j,k for all J nodes and arbitrarily many time instants k ≥ 0 their intersection T k ≥ 0 T j ∈J e S j,k will contain the MMSE estimate as well. The objectiv e must be to approach this intersection by the estimate Φ j of each node j . The APSM framew ork can be used to approach such an intersection. T o this end, we deﬁne the local cost function e Θ j,k per node j as the metric distance between the estimate Φ j and its projection P H × e S j,k (Φ j ) onto the hyperslab e S j,k in H × : e Θ j,k (Φ j ) = || Φ j − P H × e S j,k (Φ j ) || H × . (34) Then we can formulate the following global optimization problem w .r .t. the functions { Φ j } j ∈J ov er the whole network: min { Φ j | j ∈J } X j ∈J e Θ j,k (Φ j ) (35a) s . t . Φ j = Φ i , ∀ i ∈ N j , (35b) where (35b) enforces a consensus among all estimates { Φ j } j ∈J in the network. Re garding the relation between prod- uct and parameter space we note that the ﬁnite-dimensional dictionary subspace ( M × , h · , · i H × ) is isomorphic to the Euclidean parameter space ( R rQ , h · , · i K ) , see [47, Lemma 1]. In the parameter space the inner product of H × is pre- served by h · , · i K and Φ j is equiv alent to the respectiv e coefﬁcient vector w j ∈ R rQ . Then, under the correspondence M × 3 Φ j ← → w j ∈ R rQ problem (35) is the equiv alent formulation in the product space H × of problem (10) such that e Θ j,k (Φ j ) = Θ j,k ( w j ) holds. B. Derivation of Local APSM Update For the deriv ation of the local update in D-CHYP ASS based on (35) we ﬁrst consider the local APSM update for Φ j,k under the assumption that Φ j,k / ∈ e S j,k : Φ 0 j,k +1 = Φ j,k − µ j,k e Θ j,k (Φ j,k ) − e Θ ? j,k || e Θ 0 j,k (Φ j,k ) || 2 H × e Θ 0 j,k (Φ j,k ) (36) A subgradient of e Θ j,k (Φ j,k ) is giv en by e Θ 0 j,k (Φ j,k ) := Φ j,k − P H × e S j,k (Φ j,k ) || Φ j,k − P H × e S j,k (Φ j,k ) || H × . (37) Inserting (34) and (37) into the APSM update (36) we achieve Φ 0 j,k +1 = Φ j,k − µ j,k  Φ j,k − P H × e S j,k (Φ j,k )  . (38) The projection P H × e S j,k (Φ j,k ) can be calculated by [32] P H × e S j,k (Φ) =                                            Φ , if Φ ∈ e S j,k Φ − Φ( x j,k ) − d j,k − ε j P q ∈Q || P M q ( κ q ( · , x j,k )) || 2 H q × X q ∈Q P M q ( κ q ( · , x j,k )) , if Φ( x j,k ) > d j,k + ε j Φ − Φ( x j,k ) − d j,k + ε j P q ∈Q || P M q ( κ q ( · , x j,k )) || 2 H q × X q ∈Q P M q ( κ q ( · , x j,k )) , if Φ( x j,k ) < d j,k − ε j (39) where P M q is the projection operator onto the dictionary subspace M q . By virtue of Lemma 1 in [32] it holds that P M q ( κ q ( · , x j,k )) = P r ` =1 α ( q ) j,` κ q ( ¯ x ` , · ) with the coefﬁ- cients α ( q ) j := [ α ( q ) j, 1 , . . . , α ( q ) j,r ] T = K − 1 q κ q ( x j,k ) . Examin- ing the squared norm in the denominator of the projection P H × e S j,k (Φ) yields || P M q ( κ q ( · , x j,k )) || 2 H q = h P M q ( κ q ( · , x j,k )) , κ q ( · , x j,k ) i H q = r X ` =1 α ( q ) j,` κ q ( ¯ x ` , x j,k ) . Thus, the denominators in (39) can be computed by X q ∈Q || P M q ( κ q ( · , x j,k )) || 2 H q = X q ∈Q r X ` =1 α ( q ) j,` κ q ( ¯ x ` , x j,k ) = || K − 1 κ ( x j,k ) || 2 K . By parameterizing each Φ j in terms of w j and assuming a ﬁxed dictionary D q for each kernel κ q we arri ve at the update equation presented in (19a) with the projection (16). SHIN et al. : DISTRIBUTED ADAPTIVE LEARNING WITH MUL TIPLE KERNELS IN DIFFUSION NETWORKS 13 A P P E N D I X B P R O O F O F L E M M A 2 Let us consider the squared K -norm of b P : || b P || 2 K := max x 6 = 0 || b P x || 2 K || x || 2 K = max x 6 = 0 x T b P T K b P x x T K x W e assume K to be non-singular . Hence, K − 1 / 2 exists and we can insert b P = K − 1 / 2 P K 1 / 2 : || b P || 2 K = max x 6 = 0 x T K 1 / 2 P T K − 1 / 2 KK − 1 / 2 P K 1 / 2 x x T K x = max y 6 = 0 y T P T P y y T y with y = K 1 / 2 x By deﬁnition of the consensus matrix P it follo ws that || b P || 2 K = 1 . W e now show that the modiﬁed consensus matrix b P is identical to P . Assume that P is compatible to the graph G of any connected, undirected network via the matrix G ∈ R J × J . Then, it holds that P = G ⊗ I rQ . By examining the deﬁnition of b P we ﬁnd b P = K − 1 / 2 P K 1 / 2 = K − 1 / 2 ( G ⊗ I rQ ) K 1 / 2 =    K − 1 / 2 0 . . . 0 K − 1 / 2    ×    g 11 I rQ . . . g 1 J I rQ . . . . . . . . . g J 1 I rQ . . . g J J I rQ       K 1 / 2 0 . . . 0 K 1 / 2    =    g 11 K − 1 / 2 I rQ K 1 / 2 . . . g 1 J K − 1 / 2 I rQ K 1 / 2 . . . . . . . . . g J 1 K − 1 / 2 I rQ K 1 / 2 . . . g J J K − 1 / 2 I rQ K 1 / 2    = G ⊗ I rQ = P Thus, matrices b P and P are equiv alent to each other . A P P E N D I X C P R O O F O F T H E O R E M 1 A. Pr oof of Theor em 1.4 W e mimic the proof of [49, Theorem 2.3] to show the con vergence of ( z k ) k ∈ N . From Theorem 1.1 we know that the sequence ( || z k − z ? || 2 K ) k ∈ N con verges for ev ery z ? = [( w ? ) T , . . . , ( w ? ) T ] T where w ? ∈ Υ ? . Thus, the sequence ( z k ) k ∈ N is bounded and e very subsequence of ( z k ) k ∈ N has an accumulation point. Then, according to the Bolzano- W eierstrass Theorem the bounded real sequence ( z k ) k ∈ N has a con vergent subsequence ( z k l ) k l ∈ N . Let b z be the unique accumulation point of ( z k l ) k l ∈ N . W ith lim k →∞ ( I rQJ − B B T ) z k = 0 it follows that lim k →∞ ( I rQJ − B B T ) z k l = ( I rQJ − B B T ) b z = 0 . Hence, b z lies in the consensus subspace C . T o show that this point is a unique accumulation point suppose the contrary , i.e., b z = [ b w T , . . . , b w T ] T ∈ C and ˜ z = [ ˜ w T , . . . , ˜ w T ] T ∈ C are two different accumulation points. For e very z ? the sequence ( || z k − z ? || 2 K ) k ∈ N con verges and hence it follows that 0 = || b z − z ? || 2 K − || ˜ z − z ? || 2 K = || b z || 2 K − || ˜ z || 2 K − 2( b z − ˜ z ) T K z ? = || b z || 2 K − || ˜ z || 2 K − 2 J ( b w − ˜ w ) T K w ? . It thus holds that w ? ∈ H :=  w | 2 J ( b w − ˜ w ) K w = || b z || 2 K − || ˜ z || 2 K  where b w − ˜ w 6 = 0 ( ⇔ b z 6 = ˜ z ) . Since we assume that w ? ∈ Υ ? this implies that Υ ? is a subset of the hyperplane H . This contradicts the assumption of a nonempty interior of Υ ? . Hence, the bounded sequence ( z k ) k ∈ N has a unique accumulation point, and so it con verges. B. Pr oof of Theor em 1.5 In this proof we mimic the proof of [27, Theorem 2(d)], [57, Theorem 3.1.4] and [8, Theorem 2(e)] to characterize the limit point b w of the sequence ( w j,k ) k ∈ N , ∀ j ∈ J . Furthermore, we need Claim 2 of [27] which is proven for any real Hilbert space and thus, also holds for the K -metric Euclidean space considered here. Fact 1 (Claim 2 in [27]) . Let C ⊂ R rQ be a nonempty closed con vex set. Suppose that ρ > 0 and ˜ u satisﬁes { v ∈ R rQ | || v − ˜ u || K ≤ ρ } ⊂ C . Assume w ∈ R rQ /C and t ∈ (0 , 1) such that u t := t w + (1 − t ) ˜ u / ∈ C . Then d K ( w , C ) > ρ 1 − t t = ρ || u t − w || K || u t − ˜ u || K > 0 with d K ( w , C ) := || w − P K C ( w ) || K . Assume the contrary of our statement, i.e., b w 6∈ lim inf k →∞ Υ k . Denote by ˜ u an interior point of Υ ? . There- fore, there exists ρ > 0 such that { v ∈ R rQ | || v − ˜ u || K ≤ ρ } ⊂ Υ ? . Furthermore, there exists t ∈ (0 , 1) such that u t := t b w + (1 − t ) ˜ u / ∈ Υ ⊃ lim inf k →∞ Υ k . Since lim k →∞ w j,k = b w ( ∀ j ∈ J ) there exists N 1 ∈ N such that || w j,k − b w || K ≤ ρ 1 − t 2 t , ∀ k ≥ N 1 , ∀ j ∈ J . Then, by u t / ∈ lim inf k →∞ Υ k for any L 1 > N 1 there exists k 1 ≥ L 1 satisfying u t / ∈ Υ k 1 = T j ∈J (lev ≤ 0 Θ j,k 1 ) where lev ≤ 0 Θ j,k 1 := { w ∈ R rQ | Θ j,k 1 ( w ) ≤ 0 } . It follows that there exists a node i ∈ J such that u t / ∈ lev ≤ 0 Θ i,k 1 . By Υ ⊂ Υ k ⊂ lev ≤ 0 Θ i,k 1 and Fact 1 for node i it holds that d K ( w i,k 1 , lev ≤ 0 Θ i,k 1 ) ≥ d K ( b w , lev ≤ 0 Θ i,k 1 ) − || w i,k 1 − b w || K ≥ ρ 1 − t t − ρ 2 1 − t t = ρ 2 1 − t t =:  > 0 Thus, it follows that P j ∈J d K ( w j,k 1 , lev ≤ 0 Θ j,k 1 ) ≥  . By the triangle inequality we hav e || ˜ u − w j,k 1 || K ≤ || ˜ u − b w || K + || w j,k 1 − b w || K ≤ || ˜ u − b w || K + ρ 2 1 − t t ( j ∈ J ) so that X j ∈J || ˜ u − w j,k 1 || K ≤ J || ˜ u − b w || K + J ρ 2 1 − t t =: η > 0 . Giv en a ﬁxed L 2 > k 1 , we can ﬁnd a k 2 ≥ L 2 such that P j ∈J d K ( w j,k 2 , lev ≤ 0 Θ j,k 2 ) ≥  and P j ∈J || ˜ u − 14 IEEE TRANSA CTIONS ON SIGNAL PROCESSING, VOL. XX, NO. XX, MONTH XX, 2018 w j,k 2 || K ≤ η . Thus, we can construct a subsequence { k l } ∞ l =1 satisfying X j ∈J d K ( w j,k l , lev ≤ 0 Θ j,k l ) ≥  and X j ∈J || ˜ u − w j,k l || K ≤ η . W ith the assumptions of the theorem there exists a ζ > 0 such that P j ∈J Θ j,k l ( w j,k l ) ≥ ζ for ev ery l ≥ 1 . Howe ver , this contradicts lim k →∞ Θ j,k ( w j,k ) = 0 , ∀ j ∈ J from Theorem 1.2. Thus, it follo ws that b w ∈ lim inf k →∞ Υ k and the proof is complete. R E F E R E N C E S [1] M. U. Ilyas, M. Zubair Shaﬁq, A. X. Liu, and H. Radha, “ A distributed algorithm for identifying information hubs in social networks, ” IEEE J. Sel. Areas Commun. , vol. 31, no. 9, pp. 629–640, 2013. [2] F . Ingelrest, G. Barrenetxea, G. Schaefer , M. V etterli, O. Couach, and M. Parlange, “SensorScope, ” ACM T ransactions on Sensor Networks , vol. 6, no. 2, pp. 1–32, February 2010. [3] F . Facchinei, S. Sagratella, and G. Scutari, “Parallel algorithms for big data optimization, ” IEEE Tr ans. Signal Process. , vol. 63, no. 7, 2015. [4] A. H. Sayed, “ Adaptive networks, ” Proc. IEEE , vol. 102, no. 4, pp. 460–497, April 2014. [5] H. Paul, J. Fliege, and A. Dekorsy , “In-network-processing: Distributed consensus-based linear estimation, ” IEEE Commun. Lett. , vol. 17, no. 1, pp. 59–62, 2013. [6] G. Mateos and G. B. Giannakis, “Distributed recursive least-squares: Stability and performance analysis, ” IEEE Tr ans. Signal Pr ocess. , vol. 60, no. 7, pp. 3740–3754, July 2012. [7] S. T u and A. H. Sayed, “Mobile adaptive networks, ” IEEE J. Sel. T opics Signal Pr ocess. , vol. 5, no. 4, pp. 649–664, August 2011. [8] R. L. G. Cav alcante, A. Rogers, N. R. Jennings, and I. Y amada, “Distributed asymptotic minimization of sequences of con ve x functions by a broadcast adaptive subgradient method, ” IEEE J. Sel. T opics Signal Pr ocess. , v ol. 5, pp. 1–37, 2011. [9] F . Cattivelli and A. H. Sayed, “Diffusion LMS strategies for distributed estimation, ” IEEE T rans. Signal Pr ocess. , vol. 58, no. 3, pp. 1035–1048, 2010. [10] R. L. G. Cavalcante, I. Y amada, and B. Mulgrew , “ An adaptive projected subgradient approach to learning in diffusion networks, ” IEEE T rans. Signal Pr ocess. , vol. 57, no. 7, pp. 2762–2774, 2009. [11] H. Zhu, A. Cano, and G. Giannakis, “Distributed consensus-based demodulation: algorithms and error analysis, ” IEEE T rans. W ireless Commun. , vol. 9, no. 6, pp. 2044–2054, 2010. [12] I. D. Schizas, G. Mateos, and G. B. Giannakis, “Distributed LMS for consensus-based in-network adaptive processing, ” IEEE T rans. Signal Pr ocess. , v ol. 57, no. 6, pp. 2365–2382, 2009. [13] B. Sch ¨ olkopf and A. J. Smola, Learning with K ernels: Support V ector Machines, Regularization, Optimization, and Beyond . Cambridge, MA, USA: MIT Press, 2001. [14] T . Hofmann, B. Sch ¨ olkopf, and A. J. Smola, “K ernel methods in machine learning, ” The Annals of Statistics , vol. 36, no. 3, pp. 1171–1220, jun 2008. [15] J. Kivinen, A. J. Smola, and R. C. Williamson, “Online learning with kernels, ” IEEE T rans. Signal Pr ocess. , vol. 52, no. 8, pp. 2165–2176, 2004. [16] Y . Engel, S. Mannor , and R. Meir , “The kernel recursiv e least-squares algorithm, ” IEEE T rans. Signal Process. , vol. 52, no. 8, pp. 2275–2285, August 2004. [17] S. V an V aerenbergh, J. V ia, and I. Santamaria, “ A sliding-window kernel RLS algorithm and its application to nonlinear channel identiﬁcation, ” in IEEE ICASSP , vol. 5, 2006. [18] W . Liu, P . Pokharel, and J. Principe, “The kernel least-mean-square algorithm, ” IEEE Tr ans. Signal Process. , v ol. 56, no. 2, pp. 543–554, 2008. [19] W . Liu, I. M. Park, Y . W ang, and J. C. Pr ´ ıncipe, “Extended kernel recursiv e least squares algorithm, ” IEEE T rans. Signal Process. , vol. 57, no. 10, August 2009. [20] C. Richard, J. C. M. Bermudez, and P . Honeine, “Online prediction of time series data with kernels, ” IEEE Tr ans. Signal Pr ocess. , vol. 57, no. 3, 2009. [21] W . Liu, J. C. Pr ´ ıncipe, and S. Haykin, Kernel Adaptive F iltering . John W iley & Sons, 2010. [22] S. V an V aerenbergh, M. Lazaro-Gredilla, and I. Santamaria, “Kernel recursiv e least-squares tracker for time-varying regression, ” IEEE T rans. Neural Netw . Learn. Syst. , vol. 23, no. 8, pp. 1313–1326, August 2012. [23] M. Y ukawa and R.-I. Ishii, “An efﬁcient kernel adaptive ﬁltering algo- rithm using hyperplane projection along afﬁne subspace, ” in EUSIPCO , 2012, pp. 2183–2187. [24] W . Gao, J. Chen, C. Richard, J. Huang, and R. Flamary , “K ernel LMS algorithm with forward-backward splitting for dictionary learning, ” in IEEE ICASSP , 2013, pp. 5735–5739. [25] W . Gao and C. Richard, “Con vex combinations of kernel adaptive ﬁlters, ” in IEEE MLSP , 2014. [26] M. T akizawa and M. Y ukawa, “ Adaptiv e nonlinear estimation based on parallel projection along afﬁne subspaces in reproducing kernel hilbert space, ” IEEE T rans. Signal Pr ocess. , vol. 63, no. 16, pp. 4257–4269, 2015. [27] I. Y amada and N. Ogura, “Adaptive projected subgradient method for asymptotic minimization of sequence of nonnegativ e con vex functions, ” Numer . Funct. Anal. Optim. , vol. 25, no. 7-8, pp. 593–617, 2004. [28] P . L. Combettes, “The foundations of set theoretic estimation, ” Pr oceed- ings of the IEEE , vol. 81, no. 2, pp. 182–208, 1993. [29] S. Theodoridis, K. Slavakis, and I. Y amada, “Adapti ve learning in a world of projections, ” IEEE Signal Process. Mag . , vol. 28, no. 1, pp. 97–123, 2011. [30] M. Y ukawa and K. R. M ¨ uller , “Why does a Hilbertian metric work efﬁciently in online learning with kernels?” IEEE Signal Pr ocess. Lett. , vol. 23, no. 10, pp. 1424–1428, 2016. [31] M. Y ukawa, “Multikernel adaptive ﬁltering, ” IEEE T rans. on Signal Pr ocess. , v ol. 60, pp. 4672–4682, 2012. [32] ——, “ Adaptive learning in Cartesian product of reproducing kernel Hilbert spaces, ” IEEE Tr ans. Signal Process. , vol. 63, 2015. [33] B.-S. Shin, H. Paul, and A. Dek orsy , “Distributed k ernel least squares for nonlinear regression applied to sensor networks, ” in EUSIPCO , 2016. [34] B.-S. Shin, H. Paul, M. Y ukaw a, and A. Dekorsy , “Distributed nonlinear regression with multiple Gaussian kernels, ” in IEEE SP A WC , 2017. [35] W . Gao, J. Chen, C. Richard, and J. Huang, “Diffusion adaptation over networks with kernel least-mean-square, ” in IEEE CAMSAP , 2015. [36] P . Bouboulis, S. Chouvardas, and S. Theodoridis, “Online distributed learning over networks in rkh spaces using random fourier features, ” IEEE Tr ansactions on Signal Processing , vol. 66, no. 7, pp. 1920–1932, 2018. [37] S. Chouvardas and M. Draief, “A diffusion kernel LMS algorithm for nonlinear adaptiv e networks, ” in IEEE ICASSP , March 2016. [38] P . Honeine, C. Richard, J. C. M. Bermudez, H. Snoussi, M. Essoloh, and F . V incent, “Functional estimation in Hilbert space for distributed learning in wireless sensor networks, ” in IEEE ICASSP , 2009. [39] P . Honeine, C. Richard, J. C. M. Bermudez, J. Chen, and H. Snoussi, “ A decentralized approach for nonlinear prediction of time series data in sensor networks, ” EURASIP J W ir el Commun Netw , vol. 2010, no. 1, 2010. [40] J. Predd, S. Kulkarni, and H. Poor, “Distributed learning in wireless sensor networks, ” IEEE Signal Process. Mag. , pp. 56–69, 2006. [41] P . A. Forero, A. Cano, and G. B. Giannakis, “Consensus-based dis- tributed support vector machines, ” Journal of Machine Learning Re- sear ch , vol. 11, 2010. [42] B.-S. Shin, M. Y ukawa, R. L. G. Cav alcante, and A. Dekorsy , “A hybrid dictionary approach for distributed kernel adaptive ﬁltering in diffusion networks, ” in ICASSP , 2018. [43] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers, ” F oundations and T r ends in Machine Learning , vol. 3, no. 1, pp. 1–122, 2010. [44] R. A. Horn and C. R. Johnson, Matrix analysis . Cambridge University Press, 2013. [45] H. Stark and Y . Y ang, V ector space pr ojections: a numerical appr oach to signal and image processing, neural nets, and optics . Wile y , 1998. [46] D. G. Luenberger , Optimization by V ector Space Methods . John W iley & Sons, 1969, vol. 41, no. 162. [47] M. T akizawa and M. Y ukawa, “Efﬁcient dictionary-reﬁning kernel adaptiv e ﬁlter with fundamental insights, ” IEEE T rans. Signal Pr ocess. , vol. 64, no. 16, pp. 4337–4350, August 2016. [48] I. Y amada, K. Slavakis, and K. Y amada, “An efﬁcient robust adaptive ﬁltering algorithm based on parallel subgradient projection techniques, ” IEEE T rans. Signal Process. , vol. 50, no. 5, pp. 1091–1101, 2002. [49] R. L. G. Cavalcante and S. Stanczak, “ A distributed subgradient method for dynamic con ve x optimization problems under noisy information exchange, ” IEEE J. Sel. T opics Signal Pr ocess. , vol. 7, no. 2, pp. 243– 256, 2013. SHIN et al. : DISTRIBUTED ADAPTIVE LEARNING WITH MUL TIPLE KERNELS IN DIFFUSION NETWORKS 15 [50] L. Xiao and S. Boyd, “F ast linear iterations for distributed averaging, ” Systems & Contr ol Letters , vol. 53, no. 1, pp. 65–78, 2004. [51] S. Boyd and S. Lall, “A scheme for robust distributed sensor fusion based on average consensus, ” in ISPN , 2005, pp. 63–70. [52] L. Xiao, S. Boyd, and S.-J. Kim, “Distributed average consensus with least-mean-square deviation, ” Journal of P arallel and Distributed Computing , vol. 67, no. 1, pp. 33–46, 2007. [53] M. Y ukawa, K. Slav akis, and I. Y amada, “Adaptive parallel quadratic- metric projection algorithms, ” IEEE T rans. Audio, Speech, Language Pr ocess. , v ol. 15, no. 5, pp. 1665–1680, 2007. [54] C. Amante and B. Eakins, “Etopo1 1 arc-minute global relief model: Procedures, data sources and analysis, ” NO AA T echnical Memorandum NESDIS NGDC-24. National Geophysical Data Center , NO AA , 2009. [55] O. T oda and M. Y ukawa, “Online model-selection and learning for nonlinear estimation based on multikernel adaptiv e ﬁltering, ” IEICE T ransactions on Fundamentals of Electronics, Communications and Computer Sciences , vol. E100-A, no. 1, pp. 236–250, 2017. [56] M. Ohnishi and M. Y ukawa, “Online nonlinear estimation via iterati ve L2-space projections: Reproducing kernel of subspace, ” to appear in IEEE T rans. Signal Process. , vol. 66, no. 15, pp. 4050–4064, 2018. [57] K. Slav akis, I. Y amada, and N. Ogura, “The adaptive projected subgra- dient method ov er the ﬁxed point set of strongly attracting nonexpansive mappings, ” Numerical Functional Analysis and Optimization , vol. 27, no. 7-8, pp. 905–930, 2006. Ban-Sok Shin (S’13) recei ved the Dipl.-Ing. (M.Sc.) degree from University of Bremen in 2013. Since then, he has been a research assistant at the De- partment of Communications Engineering of the Univ ersity of Bremen where he is currently working tow ards his Ph.D. degree. His research interests in- clude distrib uted inference/estimation, adaptive sig- nal processing, machine learning and their applica- tion to sensor networks. Masahiro Y ukawa (S’05-M’06) receiv ed the B.E., M.E., and Ph.D. degrees from T okyo Institute of T echnology in 2002, 2004, and 2006, respectively . He studied as V isiting Researcher/Professor with the Univ ersity of Y ork, U.K. (Oct.2006-March 2007), with the T echnical Uni versity of Munich, Germany (July 2008-Nov ember 2007), and with the T ech- nical University of Berlin, Germany (April 2016- February 2017). He worked with RIKEN, Japan, as Special Postdoctoral Researcher (2007-2010), and with Niigata Univ ersity , Japan, as Associate Profes- sor (2010-2013). He is currently Associate Professor with the Department of Electronics and Electrical Engineering, Keio University , Japan. Since July 2017, he has been V isiting Scientist at AIP Center, RIKEN, Japan. He has been Associate Editor for the IEEE T R A N S AC T I O N S O N S I G NA L P R O C E S S I N G (since 2015), Multidimensional Systems and Signal Processing (2012-2016), and the IEICE T ransactions on Fundamentals of Electronics, Communications and Computer Sciences (2009-2013). His research interests include mathematical adaptive signal processing, conv ex/sparse optimization, and machine learning. Dr . Y ukawa was a recipient of the Research Fellowship of the Japan Society for the Promotion of Science (JSPS) from April 2005 to March 2007. He receiv ed the Excellent Paper A ward and the Y oung Researcher A ward from the IEICE in 2006 and in 2010, respectively , the Y asujiro Niwa Outstanding Paper A ward in 2007, the Ericsson Y oung Scientist A ward in 2009, the TELECOM System T echnology A ward in 2014, the Y oung Scientists’ Prize, the Commendation for Science and T echnology by the Minister of Education, Culture, Sports, Science and T echnology in 2014, the KDDI Foundation Research A ward in 2015, and the FFIT Academic A ward in 2016. He is a member of the IEICE. Renato Lu ´ ıs Garrido Ca valcante (M’09) received the electronics engineering degree from the Instituto T ecnol ´ ogico de Aeron ´ autica (IT A), Brazil, in 2002, and the M.E. and Ph.D. degrees in Communications and Integrated Systems from the T okyo Institute of T echnology , Japan, in 2006 and 2008, respectively . From April 2003 to April 2008, he was a recipient of the Japanese Government (MEXT) Scholarship. He is currently a Research Fellow with the Fraun- hofer Institute for T elecommunications, Heinrich Hertz Institute, Berlin, Germany , and a lecturer at the T echnical University of Berlin. Previously , he held appointments as a Research Fellow with the Uni versity of Southampton, Southampton, U.K., and as a Research Associate with the University of Edinburgh, Edinburgh, U.K. Dr . Cav alcante receiv ed the Excellent Paper A ward from the IEICE in 2006 and the IEEE Signal Processing Society (Japan Chapter) Student Paper A ward in 2008. He also co-authored a study that received a best student paper aw ard at the 13th IEEE International W orkshop on Signal Processing Advances in W ireless Communications (SP A WC) in 2012. His current interests are in signal processing for distributed systems, multiagent systems, con vex analysis, machine learning, and wireless communications. Armin Dekorsy received the B.Sc. degree from Fachhochschule Konstanz, K onstanz, Germany , the M.Sc. degree from the University of Paderborn, Paderborn, Germany , and the Ph.D. degree from the Univ ersity of Bremen, Bremen, Germany , all in communications engineering. From 2000 to 2007, he w as a Research Engineer at Deutsche T elekom AG and a Distinguished Member of T echnical Staff at Bell Labs Europe, Lucent T echnologies. In 2007, he joined Qualcomm GmbH as a European Research Coordinator , conducting Qualcomms internal and external European research activities. He is currently the Head of the Department of Communications Engineering, Univ ersity of Bremen. He has authored or coauthored over 150 journal and conference publications and is the holder of over 17 patents in the area of wireless communications. He has long-term expertise in the research of wireless communication systems, baseband algorithms, and signal processing. Prof. Dekorsy is a senior member of the IEEE Communications and Signal Processing Society and the VDE/ITG expert committee “Information and System Theory . ”

Distributed Adaptive Learning with Multiple Kernels in Diffusion Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment