Multitask learning over graphs: An Approach for Distributed, Streaming Machine Learning

1 Multitask learning o v er graphs: An Approach for Distrib uted, Streaming Machine Learning Roula Nassif, Stefan Vlaski, C ´ edric Richard, Jie Chen, and Ali H. Sayed Abstract The problem of learning simultaneously sev eral related tasks has recei ved considerable attention in several domains, especially in machine learning with the so-called multitask learning problem or learning to learn problem [1], [2]. Multitask learning is an approach to inductiv e transfer learning (using what is learned for one problem to assist in another problem) and helps improv e generalization performance relati ve to learning each task separately by using the domain information contained in the training signals of related tasks as an inductiv e bias. Sev eral strategies hav e been deriv ed within this community under the assumption that all data are av ailable beforehand at a fusion center . Howe ver , recent years have witnessed an increasing ability to collect data in a distributed and streaming manner . This requires the design of new strategies for learning jointly multiple tasks from streaming data ov er distributed (or networked) systems. This article provides an overvie w of multitask strategies for learning and adaptation ov er networks. The w orking hypothesis for these strate gies is that agents are allo wed to cooperate with each other in order to learn distinct, though related tasks. The article sho ws how cooperation steers the network limiting point and how different cooperation rules allow to promote dif ferent task relatedness models. It also explains how and when cooperation over multitask networks outperforms non-cooperati ve strategies. I . M U L T I TA S K N E T W O R K M O D E L S Consider a networked system consisting of a collection of N autonomous agents (sensors, classiﬁers, etc.) distributed over some geographic area and connected through a topology . The neighborhood of agent k is denoted by N k ; it consists of all agents that are connected to k by an edge–see Fig. 1 (left). A real-v alued strongly con vex and differentiable cost J k ( w k ) is associated This work w as submitted while R. Nassif was a post-doc at EPFL. She is no w with the American Uni versity of Beirut, Lebanon (e-mail: roula.nassif@aub .edu.lb). S. Vlaski and A. H. Sayed are with Institute of Electrical Engineering, EPFL, Switzerland (e-mail: { stefan.vlaski, ali.sayed } @epﬂ.ch). C. Richard is with Univ ersit ´ e de Nice Sophia-Antipolis, France (cedric.richard@unice.fr). J. Chen is with Northwestern Polytechnical University , China (dr .jie.chen@ieee.org). 2 1 2 3 4 5 7 8 k ℓ N w o N k 1 2 3 4 5 7 8 k ℓ N w o C 1 w o C 2 w o C 3 1 2 3 4 5 7 8 k ℓ N w o 5 w o 3 w o 1 w o 2 w o 4 w o N w o 8 w o ℓ w o k w o 7 Fig. 1. Network models. (Left) Single-task network. (Middle) Clustered multitask network. (Right) Multitask network. with each agent k . The objecti ve (or the task) at agent k is to estimate the parameter vector , w o k , of size M k × 1 , that minimizes J k ( w k ) , namely , w o k , arg min w k J k ( w k ) . (1) Depending on how the minimizers across the agents relate to each other, we distinguish between three categories of networks: 1) Single-task network: All costs J k ( w k ) are minimized at the same location w o , namely , w o k = w o for all k – see Fig. 1 (left). 2) Cluster ed multitask network: The N agents are grouped into Q clusters C q ( q = 1 , . . . , Q ) and, within each cluster C q , all the costs are minimized at the same location w o C q , namely , w o k = w o C q for all k ∈ C q – see Fig. 1 (middle). Similarities or relationships may exist among the distinct minimizers { w o C q } . 3) Multitask network: The indi vidual costs are minimized at distinct, though related, locations { w o k } – see Fig. 1 (right). Each agent k can solve (1) on its own. Howe ver , since the objectiv es across the network relate to each other , it is expected that by properly promoting these relationships, one may improve the network performance. In other words, it is expected that through cooperation among the agents, one may improve the network performance. One important question is ho w to design cooperati ve strategies that can lead to better performance than non-cooperativ e ones where each agent attempts to determine w o k on its own. This overvie w paper explains ho w multitask learning ov er graphs addresses this question. Prior to multitask learning over graphs, there ha ve been many works in the machine learning literature where learning multiple related tasks simultaneously has been considered [1]–[7]. Multitask learning was shown, both empirically and theoretically , to impro ve performance relativ e 3 to the traditional approach of learning each task separately . Depending on the machine learning application, sev eral task relatedness models have been considered. For example, in [1], [5], the functions to be learned are assumed to share a common underlying representation. In [6], it is as- sumed that the tasks are close to each other in some Hilbert space. Probabilistic based approaches, where a probability model capturing the relations between tasks is estimated simultaneously with functions corresponding to each task, have also been considered [3]. Also, graph-based approaches, where the relations between tasks are captured by an underlying graph, were also considered in the literature [4], [7]. All these works, howe ver , assume that all data are av ailable beforehand at a fusion center and propose batch-mode methods to solve multitask problems. Other existing works, such as [8], consider distributed data setting. Howe ver , most of these works still require an architecture consisting of workers along with a master , where agents perform local computations followed by sending intermediate results to the master for further processing. Such solution methods are not fully distrib uted, which limits their range of practical applications. This paper , howe ver , focuses on fully distributed solutions that a void the need for central data aggregation or processing and instead rely on local computations and communication exchanges among neighborhoods. Besides providing distributed implementations, the solutions considered in this paper are able to learn continuously from streaming data. W e start our exposition by describ- ing a class of non-cooperativ e solutions that are able to respond in real time to streaming data. Then, we explain ho w these solutions can be e xtended to handle multitask learning ov er graphs. I I . N O N C O O P E R A T I V E L E A R N I N G U N D E R S T R E A M I N G DA TA Throughout this article, there is an explicit assumption that agents operate in the streaming data setting. That is, it is assumed that each agent k receiv es at each time instant i one instantaneous realization x k,i of a random data x k . The goal of agent k is to estimate the vector w o k that minimizes its risk function J k ( w k ) , E x k Q k ( w k ; x k ) , deﬁned in terms of some loss function Q k ( · ) . The expectation is computed over the distrib ution of the data x k . Agent k is particularly interested in solving the problem in the stochastic setting when the distribution of the data is generally unknown. This means that the risks J k ( · ) and their gradients ∇ w k J k ( · ) are unknown. As such, approximate gradient vectors \ ∇ w k J k ( · ) will need to be employed. Doing so leads to the follo wing stochastic gradient algorithm for solving (1): w k,i = w k,i − 1 − µ \ ∇ w k J k ( w k,i − 1 ) , (2) 4 where w k,i is the estimate of w o k at iteration i and µ > 0 is a small step-size parameter . Resorting to the instantaneous realization x k,i of the random data x k , a common construction in the stochastic approximation theory is to employ the following gradient approximation at iteration i : \ ∇ w k J k ( w k ) = ∇ w k Q k ( w k ; x k,i ) . (3) W e therefore focus in this paper on stochastic gradient algorithms, which are powerful iterati ve procedures for solving (1) in the streaming data. They enable continuous learning and adaptation in response to drifts in the location of the minimizers due to changes in the costs. W e illustrate construction (2)–(3) by considering scenarios from machine learning and adapti ve ﬁlter theory . Example 1. (Logistic regression network). Let γ k ( i ) = ± 1 be a streaming sequence of (class) binary random v ariables and let h k,i be the corresponding streaming sequence of M k × 1 real random (feature) vectors with R h,k = E h k,i h > k,i > 0 . The processes { γ k ( i ) , h k,i } are assumed to be wide-sense stationary . In these problems, agent k seeks to estimate the vector w o k that minimizes the regularized logistic risk function [9]: J k ( w k ) = E ln  1 + e − γ k ( i ) h > k,i w k  + ρ 2 k w k k 2 , (4) where ρ > 0 is a regularization parameter . Once w o k is found, b γ k ( i ) = sign ( h > k,i w o k ) can then be used as a decision rule to classify new features. Using approximation (3), we obtain the following stochastic-gradient algorithm for minimizing (4): w k,i = (1 − µρ ) w k,i − 1 + µ γ k ( i ) h k,i  1 1 + e γ k ( i ) h > k,i w k,i − 1  . (5) Example 2. (Mean-square-error (MSE) network). In such networks, each agent is subjected to streaming data { d k ( i ) , u k,i } that are assumed to satisfy a linear regression model: d k ( i ) = u > k,i w o k + v k ( i ) , (6) for some unknown M k × 1 vector w o k to be estimated by agent k with v k ( i ) denoting a zero-mean measurement noise. For these networks, the risk function takes the form of an MSE cost [10]: J k ( w k ) = 1 2 E  d k ( i ) − u > k,i w k  2 , (7) which is minimized at w o k . The processes { u k,i , v k ( i ) } are zero-mean jointly wide-sense stationary with: i) E u k,i u > `,i = R u,k > 0 if k = ` and zero otherwise; E v k ( i ) v ` ( i ) = σ 2 v ,k if k = ` and 5 zero otherwise; and iii) u k,i and v k ( j ) are independent of each other . Using approximation (3), we obtain the following stochastic-gradient algorithm: w k,i = w k,i − 1 + µ u k,i ( d k ( i ) − u > k,i w k,i − 1 ) , (8) which is the well-known least-mean-squares (LMS) algorithm [11]. The use of the approximate gradient \ ∇ w k J k ( · ) instead of the true gradient ∇ w k J k ( · ) in (2) introduces perturbations into the operation of the gradient descent iteration. This perturbation is referred to as the gradient noise deﬁned as s k,i ( w k ) , ∇ w k J k ( w k ) − \ ∇ w k J k ( w k ) . The presence of this perturbation prev ents the stochastic iterate w k,i from con verging almost surely to the minimizer w o k when constant step-sizes are used. Some deterioration in performance will occur , and the iterate w k,i will instead ﬂuctuate close to w o k . It is common in adaptive ﬁltering and stochastic gradient optimization literatures to assess the size of these ﬂuctuations by measuring their steady-state mean-square value [9]–[11]. W e therefore focus in this paper on highlighting the beneﬁt of multitask learning on the network mean-square-de viation (MSD), which is deﬁned as the steady-state av erage v ariance v alue: MSD , lim i →∞ 1 N N X k =1 E k w o k − w k,i k 2 . (9) In the sequel, when discussing theoretical performance results, and to avoid excessi ve techni- calities, it is sufﬁcient to focus on MSE networks described in Example 2 and to assume that R u,k = R u and M k = M for all k 1 . In this way , the quality of the measurements, captured by the noise po wer σ 2 v ,k , is allo wed to v ary across the network with some agents collecting noisier data than other agents. Assuming uniform regressors cov ariance allows us to quantify the improv ement in performance that results from cooperation without biasing the results by the statistical nature of the regression data at the agents. Perf ormance result 1. Consider an MSE network running the non-cooperative algorithm (8) . Assume further that R u,k = R u and M k = M for all k . Under these assumptions, and for 1 Performance results under more general conditions, such as allowing for space dependent cov ariances and lengths and for general second-order differentiable cost functions that are not necessarily quadratic, can also be found in [9, Chap. 3–4] for algorithm (2), in [12] for the strategy introduced in Sec. IV -A, and in [32] for the strategies in Sec. V. The MSD performance expressions in these works are deriv ed under Lipschitz gradient vectors and Hessian matrices assumptions. It should be noted that the analyses in these works allow also to recover the Excess-Risk metric at agent k , which is deﬁned as ER k , lim i →∞ E ( J k ( w k,i ) − J k ( w o k )) –see, e.g., [9, p. 388–390]. Due to space limitations, we shall only focus on presenting MSD performance expressions. 6 sufﬁciently small step-sizes, the individual steady-state variance MSD k , lim i →∞ E k w o k − w k,i k 2 and the network MSD deﬁned by (9) ar e given by [10]: MSD k = µM 2 · σ 2 v ,k , MSD nc = µM 2 · 1 N N X k =1 σ 2 v ,k ! , (10) where the superscript “nc” is used to indicate that the MSD expression is for the non-cooperativ e solution. First, observe that the performance is on the order of µ . The smaller µ is, the better the performance will be, but the slower the con vergence to ward w o k will be [9], [10] (the same observ ation is v alid for future e xpressions (17) and (34) with conv ergence to W ? in (17) instead). Second, observe that agents with noisier data will perform worse than agents with cleaner data. Ho wev er , since agents are observing data arising from similar or related models w o k , it is expected that an appropriate cooperation among agents can help enhance the network performance. I I I . M U L T I TA S K L E A R N I N G F R A M E W O R K Depending on the application, sev eral task relatedness models can be considered. For each model, an appropriate con vex optimization problem is solv ed in a distrib uted and adaptiv e manner . This results in dif ferent multitask strategies, and therefore different cooperation rules between agents. Rather than describing each optimization problem in isolation, we begin by introducing a general problem, which will allo w us to recover various multitask strategies as special cases. Let W , col { w 1 , . . . , w N } denote the collection of parameter vectors from across the network. W e consider the following global optimization problem for the multitask formulation: W ? = arg min W J glob ( W ) = N X k =1 J k ( w k ) + η 2 R ( W ) sub ject to W ∈ Ω (11) where R ( · ) is a con vex regularization function promoting the relationships between the tasks, Ω is a closed con vex set deﬁning the feasible region of the parameter vectors, and η > 0 is a parameter controlling the importance of the regularization. The choice of the regularizer R ( · ) and the set Ω depends on the prior information on how the multitask models relate to each other . T o illustrate ho w problem formulation (11) can be used, we consider the following two examples that are multitask oriented. Example 3. (W eather forecasting [12]). Consider the network in Fig. 2 (left) consisting of N = 139 weather stations located across the United States and collecting daily measurements. Let h k,i denote the feature v ector consisting of collected data (temperature, wind speed, de w point, 7 Area 1 Area 1 Area 2 Area 2 Area 3 Area 3 Area 4 Area 4 B1 B1 B5 B5 B2 B2 B4 B4 B7 B7 B3 B3 B8 B8 B10 B10 B9 B9 B14 B14 B13 B13 B12 B12 B6 B6 B11 B11 Fig. 2. Examples of multitask applications. (Left) W eather forecasting. (Right) Distributed po wer system monitoring. etc.) at sensor k and day i and let γ k ( i ) denote the corresponding binary v ariable associated with rain occurrence, i.e., γ k ( i ) = 1 if rain occurred and γ k ( i ) = − 1 otherwise. The objectiv e is to construct a classiﬁer at each station to predict whether it will rain or not based on the kno wledge of h k,i . T o this end, each station can use an individual logistic regression machine similar to the one described in Example 1; in this case the cost J k ( w k ) in (11) takes the form (4). Howe ver , it is expected that the decision rules { w o k } at neighboring stations would be similar since they are collecting features arising from similar statistical distrib utions. Moreo ver , the strength of similarity is expected to be in versely proportional to the physical distance between the stations. This giv es rise to a weighted graph (with closest nodes connected by edges) and one may expect to impro ve the network performance by promoting the smoothness of { w o k } with respect to the underlying graph. The simplest possible term that encourages smoothness is the graph Laplacian regularizer S ( W ) deﬁned further ahead in (13). By choosing R ( W ) = S ( W ) and Ω = R M N in (11), one arri ves at a multitask formulation for the weather forecasting application that takes into account the smoothness prior o ver the graph. This formulation and other possible formulations are solved in Sections IV and V. Example 4. (Power system state monitoring [13]). Consider Fig. 2 (right) illustrating an IEEE 14-bus power monitoring system partitioned into 4 areas, where each area comprises a subset of buses supervised by its own control center . The local state vectors (b us v oltages) to be estimated at neighboring areas may partially overlap as the areas are interconnected. This is because each control center collects measurements related to the voltages across its local buses and voltages across the interconnection between neighboring centers. For example, Area 2 supervises b uses 3, 4, 7, and 8. Since it collects current readings on lines (4, 5) and (7, 9), its state vector extends to buses 5 (supervised by Area 1) and 9 (supervised by Area 4). In other words, if we let w n denote the state of bus n , then the cost J 2 ( · ) at Area 2 will depend on the extended parameter vector w 2 = col { w 3 , w 4 , w 5 , w 7 , w 8 , w 9 } . Howe ver , since the parameter vectors at Areas 1 and 4 will be w 1 = col { w 1 , w 2 , w 5 } and w 4 = col { w 9 , w 10 , w 11 , w 14 } , respectively , consensus needs 8 Fig. 3. A common diagram for the multitask strategies described in this work. The structure inv olves two main steps: i) a self-learning step (12a), and ii) a social learning step (12b). to be reached on the variable w 5 between Areas 2 and 1, and on the v ariable w 9 between Areas 2 and 4, while minimizing the individual cost J 2 ( w 2 ) penalizing deviation from data models of the form y k = H k w k + v k where H k is the measurement matrix and v k is a zero-mean noise. Thus, distributed po wer state estimation can be formulated as problem (11) with R ( W ) = 0 , whereas the constraint set Ω in this case should be selected to promote consensus over the overlapped v ariables. In Section V -B, we explain how such problems can be solved. Returning to the formulation (11), observe that ev en though the aggregate cost P N k =1 J k ( w k ) is separable in w k , the cooperation between agents is necessary due to the coupling between the tasks through the regularization and the constraint. Note that, when solving problem (11), agent k will be responsible for estimating w ? k (the k -th sub-vector of W ? = col { w ? 1 , . . . , w ? N } ), which is generally different from w o k in (1), the actual objectiv e at agent k . Ho wever , it is e xpected that accurate prior information will allow the designer to choose the regularizer R ( · ) , the set Ω , and the strength η in a way that minimizes the distance between w ? k and w o k . Although some existing works use primal-dual methods [14] to solve multitask estimation problems, we limit our exposition to the class of primal techniques (based on propagating and estimating the primal variable) that employ stochastic-gradient iterations. Extensiv e studies in the literature ha ve shown that small step-sizes enable these strategies to learn well in streaming data settings. Due to the separability property of P N k =1 J k ( w k ) , the multitask algorithms described in the sequel will hav e a common structure giv en by: ψ k,i = w k,i − 1 − µ \ ∇ w k J k ( w k,i − 1 ) w k,i = g k  { ψ `,i } ` ∈N k  (12a) (12b) The ﬁrst step (12a) corresponds to the stochastic gradient step on the individual cost J k ( · ) . W e refer to this step as the self-learning step–see Fig. 3. Compared with the non-cooperativ e 9 strategy (2), observe now that the result of the gradient descent step is ψ k,i , an intermediate estimate of w o k at iteration i . This step is follo wed by a social learning step (12b), which uses some function g k ( · ) of the neighborhood iterates. As we shall see in the next sections, the form of this function depends on the regularizer η R ( · ) and the set Ω in (11), both of which allow to promote the prior information on how the tasks w o k are related. The result of this second step is w k,i , the estimate of w o k , deﬁned by (1), at iteration i . Since we are interested in a distributed setting, agents during social learning are only allowed to collect estimators from their local neighborhood N k –see Fig. 3. In the sequel, we sho w how the formulation (11) and the social learning step (12b) specialize for regularized (Sec. IV), subspace constrained (Sec. V), and clustered (Sec. VI) multitask estimation. I V . R E G U L A R I Z E D M U LT I TA S K E S T I M AT I O N In this section, we focus on the regularization term R ( W ) in (11) and its implications for the learning dynamics. In multitask learning (MTL), regularization is a widely used technique to promote task relationships. In most netw ork applications, the underlying graph structure contains information about the relatedness among neighboring tasks. As such, when considering graph-based MTL applications, incorporating the graph structure into the regularization term is a reasonable and natural step. The smoothness model (under which the tasks are similar at neighboring vertices with the strength of similarity speciﬁed by the weight between them) will play a central role in our discussion. This smoothness property is often observed in real world applications (see, e.g., Example 3) and is rich enough to con vey the main ideas behind MTL, as we will see in the sequel. W e will e xamine two main questions: 1) Ho w to incorporate graph-based priors into the regularizer? and 2) How does the resulting MTL algorithm beha ve? A. Multitask estimation under smoothness W e assume that a symmetric, weighted adjacency matrix C is associated with the connected graph illustrated in Fig. 1 (right). If there is an edge connecting agents k and ` , then [ C ] k` = c k` > 0 reﬂects the strength of the relation between k and ` ; otherwise, [ C ] k` = 0 . These weights are usually dictated by the physics of the problem at hand–see, e.g., [15], [16, Ch. 4] for graph construction methods. W e introduce the graph Laplacian L , which is a dif ferential operator deﬁned as L = diag { C 1 N } − C . Assuming that the tasks have the same length, i.e., M k = M ∀ k , the smoothness of W over the graph is measured in terms of a quadratic form of the Laplacian [17]: S ( W ) = W > L W = 1 2 N X k =1 X ` ∈N k c k` k w k − w ` k 2 , (13) 10 where L = L ⊗ I M is an extended form of the graph Laplacian (deﬁned in terms of the Kronecker product operator ⊗ ). The smaller S ( W ) is, the smoother the signal W on the graph is. Giv en that the weights are nonneg ativ e, S ( W ) shows that W is smooth if nodes with a large c k` on the edge connecting them hav e similar weight values { w k , w ` } . Therefore, in order to enforce the prior belief that the target signal W o = col { w o 1 , . . . , w o N } is smooth with respect to the underlying weighted graph, one may choose in (11): R ( W ) = S ( W ) , and Ω = R M N (14) Under this choice, the stochastic gradient algorithm for solving (11) takes the follo wing form: W i = ψ i − µη L W i − 1 (15) where W i is the estimate of W ? at instant i , and ψ i = col { ψ k,i } N k =1 is the vector collecting the intermediate estimates ψ k,i in (12a) from across all agents. Since we expect ψ i to be an improv ed estimate compared to W i − 1 , we propose to replace W i − 1 in (15) by ψ i . By doing so, we obtain algorithm (12) with the social learning step gi ven by: w k,i = ψ k,i − µη X ` ∈N k c k` ( ψ k,i − ψ `,i ) (16) The substitution of W i − 1 by ψ i is reminiscent of incremental-type arguments in gradient descent algorithms [18]. Analyses in the context of adaptation ov er networks show that substitutions of this type lead to enhanced network stability since they allow to preserv e the stability of the agents after cooperation (see, e.g., [10, p. 160] for details). Regarding algorithm (16), it follows that, when the spectral radius of the combination matrix I − µη L is equal to one, suf ﬁciently small step-sizes ensuring the individual agents stability will also ensure the network stability 2 [12]. Proximal based approaches are also proposed in [19] to solve multitask problems under smoothness. Howe ver , these approaches require the ev aluation of the proximal operator deﬁned by (28) of the risk Q k ( · ) at each iteration i , which can be computationally expensi ve. B. Bias-variance tradeof f W e next consider the interesting question whether multitask learning is beneﬁcial compared to noncooperation. The answer to this inquiry requires i) studying the performance of algorithm (12) relati ve to the actual agents objectives { w o k } and then ii) examining when the multitask implemen- tation (12) can lead to enhanced performance in comparison to the non-cooperati ve solution (2). 2 In this article, a network is said to be stable if the mean-square-error 1 N k W ? − W i k 2 con ver ges asymptotically to a bounded region of the order of the step-size. 11 Algorithm (12) was studied in detail in [12]. It was sho wn that the network MSD deﬁned by (9) is mainly inﬂuenced by the sum of two factors, as explained further below . The ﬁrst factor is the steady-state variance of algorithm (12) with respect to the regularized solution W ? in (11), namely , lim i →∞ 1 N E k W ? − W i k 2 . The second one is the bias or the average distance between the regularized solution W ? and the unregularized one W o , namely , 1 N k W o − W ? k 2 . By increasing the regularization strength η , the variance term is more likely to decrease while the bias term is more likely to increase. Understanding this bias-variance tradeoff is critical for understanding the beha vior of regularized multitask algorithms. W e therefore describe in the following the bias-variance behavior of algorithm (16) by con- sidering the expressions deri ved in [12]. These e xpressions are useful for illustrating the concept of multitask learning. As we will see, instead of in volving the vertex domain gi ven by the entries { c k` } of the adjacency matrix, these expressions in volv e the graph spectral information deﬁned by the eigendecomposition of the Laplacian L . Because the Laplacian is a real symmetric matrix, it possesses a complete set of orthonormal eigen vectors. W e denote them by { v 1 , . . . , v N } . For con venience, we order the set of real, non-negati ve eigen values of L as 0 = λ 1 < λ 2 ≤ . . . ≤ λ N , where, since the network is connected, there is only one zero eigen value with corresponding eigen vector v 1 = 1 √ N 1 N [20]. Therefore, the Laplacian can be decomposed as L = V Λ V > where Λ = diag { λ 1 , . . . , λ N } and V = [ v 1 , . . . , v N ] . Perf ormance r esult 2. Consider an MSE network running the multitask algorithm (12) with the second step given by (16) . Assume further that R u,k = R u ∀ k and that ρ ( I − µη L ) ≤ 1 . Under these assumptions and for sufﬁciently small step-sizes and smooth signal W o , it is shown that [12]: lim i →∞ 1 N E k W ? − W i k 2 ≈ N X m =1 ϕ ( λ m ) (17) wher e ϕ ( λ m ) = µ 2 N N X k =1 [ v m ] 2 k σ 2 v ,k !   M X q =1 λ u,q λ u,q + η λ m   (18) with λ u,q the q -th eigen value of R u and [ v m ] k the k -th entry of the eigen vector v m . F or the bias term, it can be shown that [12]: k W o − W ? k 2 = N X m =2 ζ ( λ m ) , (19) wher e ζ ( λ m ) =    η λ m ( R u + η λ m I M ) − 1 w o m    2 , (20) with w o m = ( v > m ⊗ I M ) W o the m -th subvector of W o = ( V > ⊗ I M ) W o corr esponding to the eigen value λ m . 12 For the steady-state variance (17), observe that it consists of the summation of N terms ϕ ( λ m ) , each one corresponding to an eigen value λ m of the Laplacian. The ﬁrst one ϕ ( λ 1 = 0) is independent of the regularization strength η . The remaining terms ϕ ( λ m 6 = 0) decrease when η increases. Therefore, when η increases, the variance decreases. From expression (19), we observe that the bias tends to increase by increasing the regularization strength η . Howe ver , an interesting fact arises for smooth W o . T o see this, we rewrite the regularizer in (13) as: S ( W o ) = ( W o ) > (Λ ⊗ I M ) W o = N X m =2 λ m k w o m k 2 , (21) where we used the fact that λ 1 = 0 . Intuitiv ely , giv en that λ m > 0 for m = 2 , . . . , N , the abov e expression shows that W o is considered to be smooth over the graph if k w o m k 2 corresponding to large eigen value λ m is very small. That is, a smooth W o is mainly contained in [0 , λ c ] , i.e., k w o m k 2 ≈ 0 if λ m > λ c , and the smoother W o is, the smaller λ c will be. In this case, the effecti ve sum in (19) is over the ﬁrst c  N terms (corresponding to small eigen values λ m ) instead of N terms. W e thus conclude that as long as W o is suf ﬁciently smooth, moderate regularization strengths η in the range (0 , ∞ ) exist such that the decrease in variance at these v alues of η will dominate the increase in bias. In other words, the MSD at these v alues of η will be less than the MSD at η = 0 , which corresponds to the noncooperati ve mode of operation. Observe from (16) that the social learning step following from the Laplacian regularization term (13) in volv es a single communication step at e very stochastic gradient update. When multiple steps are allo wed, it is reasonable to expect that performance can be improved. In the following, we sho w ho w such solution can be designed. C. Graph spectral r e gularization The main observation behind the introduction of this regularizer is that a smooth W o ov er a graph e xhibits a special structure in the graph spectral domain (it is mainly contained in [0 , λ c ] , i.e., k w o m k 2 ≈ 0 if λ m > λ c ) [21]. Graph spectral regularization is used to le verage more thoroughly the spectral information and improve the multitask network performance. In this case, the network will aim at solving problem (11) with Ω = R M N and R ( · ) properly selected in order to promote the prior information av ailable on the structure of W o in the graph spectral domain. The follo wing class of regularization functionals on graphs can be used for this purpose [21], [22]: R ( W ) = W > r ( L ) W = W > ( r ( L ) ⊗ I M ) W (22) 13 where r ( · ) is some well-deﬁned non-negati ve function on the spectrum σ ( L ) = { λ 1 , . . . , λ N } of L and r ( L ) is the corresponding matrix function deﬁned as [23, p. 3]: r ( L ) = V r (Λ) V > = N X m =1 r ( λ m ) v m v > m . (23) Construction (22) uses the Laplacian as a means to design regularization operators. Requiring a positiv e semi-deﬁnite regularizer r ( L ) imposes the constraint r ( λ ) ≥ 0 for all λ ∈ σ ( L ) . Replacing (23) into (22), we obtain – compare with the regularizer in (21) to see how an extra degree of freedom is introduced in the multitask network design: R ( W ) = W > ( r (Λ) ⊗ I M ) W = N X m =1 r ( λ m ) k w m k 2 , (24) where W = ( V > ⊗ I M ) W and w m = ( v > m ⊗ I M ) W . The regularization in (24) promotes a particular structure in the graph spectral domain. It strongly penalizes k w m k 2 for which the corresponding r ( λ m ) is large. Thus, one prefers r ( λ m ) to be large for those k w m k 2 that are small and vice versa. From the discussion following (21), it is clear that, under smoothness, the function r ( λ ) must be chosen to be monotonically increasing in λ . One typical choice is r ( λ ) = λ S with S ≥ 1 . Example 5 further ahead illustrates for instance the beneﬁt of using λ 3 instead of λ . Assuming the re gularizer r ( L ) in (22) can be written as an S -th degree polynomial of the Lapla- cian L , i.e., r ( L ) = P S s =0 β s L s for some constants { β s } (or , equiv alently , r ( λ ) = P S s =0 β s λ s ), and follo wing similar ar guments that led to (16), one arri ves at the follo wing social step (12b) [22]:        ψ s k,i = β S − s ψ k,i + X ` ∈N k c k` ( ψ s − 1 k,i − ψ s − 1 `,i ) , s = 1 , . . . , S w k,i = ψ k,i − µη ψ S k,i (25) where ψ 0 k,i = β S ψ k,i . It requires S communication steps. The resulting algorithm (25) is distributed since at each step, each agent is only required to exchange information locally with its neighbors. Since S communication steps are required, agent k ends up collecting information from its S -hop neighborhood. For more general r ( λ ) that are not necessarily polynomial in λ , one would like to beneﬁt from the sparsity of the graph captured by L . As long as r ( L ) can be approximated by some lower order polynomial in L , say r ( L ) ≈ P S s =0 β s L s , distributed implementations of the form (25) are possible–see [22]. Problems of this type ha ve already been considered in graph ﬁlters design [24], [25]. For instance, the work [24] proposes to locally approximate r ( · ) by a polynomial e r ( · ) computed by truncating a shifted Chebyshe v series expansion of r ( · ) on [0 , λ N ] . When the 14 0 2 4 6 8 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 1 2 3 0.02 0.04 0.06 0.08 0 0.2 0.4 0.6 0.8 1 1.2 10 -5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10 -2 10 -1 10 0 10 1 10 2 10 3 -42 -40 -38 -36 -34 -32 -30 -28 Fig. 4. Illustrativ e example for spectral regularization. (Left) Estimation under smoothness. (Middle) Behavior of algorithm (12) in the graph spectral domain with w ? m = ( v > m ⊗ I M ) W ? . (Right) Bias-variance tradeoff for r ( λ ) = λ 3 . regularizer r ( · ) is continuous, the Chebyshev approximation e r ( · ) con verges to it rapidly as S increases. When the regularizer presents some discontinuities, polynomial approximation methods are not advised o ver adaptiv e networks since accurate approximation would require a large order S , and consequently , a large number of communication steps at each iteration. Projection based methods similar to the one described in Sec. V can be useful in this case. For example, if the smooth signal W o is only contained in [0 , λ c ] , instead of using a discontinuous regularizer r ( λ ) of the form r ( λ m ) = 0 if m < c and β  0 otherwise, one may design a multitask network that is able to project onto the space spanned by the ﬁrst c eigen vectors of the graph Laplacian. Since the optimization problems in Sec. IV -A and IV -C are the same with L in (13) replaced by r ( L ) in (22), the multitask strategy (25) will behav e in a similar manner as (16). Particularly , the bias-variance tradeoff discussion continues to apply , and expressions (17)–(20) continue to hold with λ m on the RHS of (18) and (20) replaced by the function r ( λ m ) . By replacing λ m on the RHS of (20) by r ( λ m ) , one may directly observe the consequence of this regularizer on the bias term (19), which can be made now close to zero (by choosing in the smoothness case, for example, r ( λ m ) ≈ 0 if λ m ∈ [0 , λ c ] and β m > 0 otherwise). Example 5. (Graph spectral ﬁltering). Consider the MSE network e xample in Fig. 4 and assume uniform data proﬁle, i.e., R u,k = R u and σ 2 v ,k = σ 2 v ∀ k . In the left plot, we illustrate the entries of the tasks w o k , which are smooth over the underlying graph. In the middle plot, we illustrate the behavior of the pre viously described algorithms in the graph spectral domain. The top plot represents the behavior of the network output W ? for three different choices of regularizer r ( λ ) = { 0 , λ, λ 3 } . The bottom plot represents the behavior of the steady-state variance (18) with the eigen value λ m replaced by the function r ( λ m ) . Observe how the regularizer r ( λ ) = λ 3 penalizes lo w-eigen v alues less than r ( λ ) = λ , and consequently preserves all the signal components w m . 15 Observe further the graph low-pass ﬁltering behavior [24], [25]. Small eigen values λ m correspond to lo w frequencies, W = ( V > ⊗ I M ) W corresponds to the graph Fourier transform, and w m = ( v m ⊗ I M ) W corresponds to the m -th frequency content of W . It can be shown that the m -th frequency content of the output can be bounded as k w ? m k ≤ λ u, max λ u, max + η r ( λ m ) k w o m k in terms of the m -th frequency content of the input W o where λ u, max is the maximum eigen v alue of R u (see [12]). Since r ( λ ) is monotonically increasing in λ , for ﬁxed η , as λ increases, the ratio decreases. Therefore, the network output W ? can be interpreted as the output of a lo w-pass graph ﬁlter applied to the signal W o . A similar behavior arises for the steady-state variance. For ﬁxed η , as λ m increases, the variance at the m -th frequency , i.e., ϕ ( λ m ) , decreases, and for ﬁxed λ m , as η increases, ϕ ( λ m ) decreases. The regularizer r ( λ ) controls the shape of the ﬁlter and the strength η controls the sharpness. The non-cooperative solution ( η = 0 ) corresponds to an all-pass graph ﬁlter . In the right plot, we illustrate the bias-variance tradeoff in the case of r ( λ ) = λ 3 . Returning to the diagram in Fig. 3, observe that the self-learning step corresponds to the inference step where agent k estimates w o k from streaming data x k,i and that the social-learning step corresponds to the graph ﬁltering step where the agents collaborate in order to perform spatial ﬁltering and reduce the effect of the noise on the network MSD deﬁned by (9). These steps are performed simultaneously . Therefore, multitask learning over networks allows to blend real-time adaptation with graph (spatial) ﬁltering. D. Non-quadratic r e gularization Non-quadratic regularization has been also considered in the literature [14], [26], [27]. This scenario will induce non-linearities in the social learning step (12b). In this case, multitask algorithms are deriv ed in order to solve problem (11) with Ω = R M N and R ( W ) = N X k =1 X ` ∈N k ρ k` h k` ( w k , w ` ) (26) where h k` : R M × R M → R is a con vex cost function associated with the link ( k , ` ) . In general, this function is used to enforce some constraints on the pairs of v ariables across an edge. Observ e that (26), by allowing arbitrary distance measures h k` ( · , · ) , is a generalization of the previously employed quadratic regularization. In fact, setting h k` ( w k , w ` ) = k w k − w ` k 2 recov ers (13). Examples of other typical choices are the ` 2 -norm regularizer h k` ( w k , w ` ) = k w k − w ` k and the ` 1 -norm re gularizer h k` ( w k , w ` ) = k w k − w ` k 1 . Instead of encouraging global smoothness, these sparsity-based regularizers can adapt to heterogeneity in the level of smoothness of the tasks 16 w o k across nodes [28]. Such heterogeneity is observed for instance in the problem of predicting housing prices [27]. In this problem, the objective at each node (house) in a graph (where neighboring houses are connected by edges) is to learn the weights w o k of a regression model (examples of features are number of bedrooms, square footage, etc.) to estimate the price. Due to location-based factors (such as distance to highway) that are often unknown a priori and, therefore, cannot be incorporated as features, similar houses in different, though close (neighbors), locations can hav e drastically dif ferent prices, i.e., drastically different w o k . The objectiv e in this case is to encourage neighboring houses that share common models to cooperate without being inﬂuenced by the misleading information of neighbors sharing different models, i.e., perform automatic clustering. T o do so, the authors in [27] propose to solve the network Lasso problem, i.e., problem (11) with ` 2 -norm regularizer in (26). The rationale behind this choice is that ` 2 - norm encourages group sparsity , i.e., consensus across an edge w k = w ` . On the other hand, the ` 1 -norm regularizer is used in [26] to promote the prior that the parameter vectors at neighboring nodes hav e a large number of similar entries and a small number of distinct entries. The weight ρ k` ≥ 0 in (26) associated with the link ( k , ` ) aims at locally adjusting the regularization strength. It is usually dictated by the physics of the problem at hand. For primal adaptive techniques and due to the non-differentiability of the regularizers, proximal gradient methods can be used to solve (11). Assuming ρ k` = ρ `k and h k` ( w k , w ` ) = k w k − w ` k 1 , one may arrive to a multitask algorithm of the form (12) with the social learning step (12b) gi ven by (see the deri vations in [26]): w k,i = prox η µ e g k,i ( ψ k,i ) (27) where prox γ g ( w 0 ) denotes the proximal operator of the function g ( w ) : prox γ g ( w 0 ) = argmin w ∈ R M g ( w ) + 1 2 γ k w − w 0 k 2 , (28) and where the function e g k,i : R M → R is giv en by e g k,i ( w k ) = P ` ∈N k ρ k` h k` ( w k , ψ `,i ) . Notice that the proximal operator in (27) needs to be ev aluated at each iteration. For the weighted sum of ` 1 -regularizer , a closed form expression can be found in [26]. V . M U LT I TA S K E S T I M A T I O N U N D E R S U B S PAC E C O N S T R A I N T S Besides regularized-based algorithms, projection-based algorithms hav e receiv ed considerable attention in the literature of deterministic [18], [29]–[31] and stochastic [9], [10], [32]–[35] optimization. The objecti ve in this case is to design distributed networks that are able to project onto low-dimensional subspaces while minimizing the individual costs, i.e., solve problems of the form (11) with [29], [32]: 17 R ( W ) = 0 , and Ω = Range ( U ) (29) where Range ( · ) denotes the range space operator and U is an M t × P full-column rank matrix with M t = P N k =1 M k and P  M t . The reader will soon realize that consensus-type problems are instances of this formulation. Also, multitask estimation under smoothness can beneﬁt from this formulation: as e xplained earlier in Sec. IV -C, when the ﬁrst c eigen vectors of the Laplacian are av ailable, the designer can project onto Range ( U ) with U = [ v 1 , . . . , v c ] ⊗ I M instead of using regularization. Let P U = U ( U > U ) − 1 U > denote the projection onto the range space of U . Assuming that the network topology and the signal subspace U are such that the following feasibility problem: ﬁnd A such that A U = U , U > A = U > , ρ ( A − P U ) < 1 , [ A ] k` = 0 , if ` / ∈ N k and ` 6 = k (30) admits at least one solution, one may arriv e to a multitask strategy of the form (12) with the social learning step (12b) gi ven by [32]: w k,i = X ` ∈N k A k` ψ `,i (31) where A k` = [ A ] k` is the ( k , ` ) -th block (of size M k × M ` ) of the N × N block matrix A . A matrix A satisfying the constraints in (30) is semi-con ver gent [29], [32]. Particularly , it holds that: lim i →∞ A i = P U . (32) The ﬁrst two constraints in (30) state that the P columns of U are right and left eigenv ectors of A associated with the eigen value 1 . T ogether with these two constraints, the third constraint in (30) ensures that A has P eigen values at one, and that all other eigen values are strictly less than one in magnitude. The last constraint in (30) corresponds to the sparsity constraint which characterizes the network topology and ensures local exchange of information at each instant i . Before explaining how some typical choices of U lead to well-studied distributed inference problems, we note that the distributed algorithm (31) has an attracti ve property: in the small step- size regime, the iterates generated by (31) achie ve the steady-state performance of the follo wing gradient projection algorithm [32]: W i = P U  W i − µ col n \ ∇ w k J k ( w k,i − 1 ) o N k =1  (33) which is centralized since, at each instant i , agent k needs to send its estimate ψ k,i in (12a) to a fusion center , which performs the projection, and then sends the result w k,i back to the agent. 18 Perf ormance result 3. Consider an MSE network running algorithm (12) with the social step (12b) given by (31) with A = [ A k` ] satisfying the constr aints in (30) . Assume that the network is seeking W o ∈ Range ( U ) . Assume further that U = U ⊗ I M wher e U = [ u 1 , . . . , u ¯ P ] is semi-orthogonal, and that R u,k = R u and M k = M for all k . Under these assumptions, and for sufﬁciently small step-sizes, the network MSD deﬁned by (9) is given by [32]: MSD = µM 2 N ¯ P X m =1 N X k =1 [ u m ] 2 k σ 2 v ,k ! . (34) Notice that the projection framew ork will not induce bias in the estimation. This is because W o ∈ Range ( U ) , and, therefore, the vector W ? in (11) is equal to W o , the network objectiv e. Moreov er , the beneﬁt of cooperation can be readily seen by assuming uniform variances σ 2 v ,k = σ 2 v for all k . In this case, comparing (34) with (10) in the non-cooperativ e case, we conclude that MSD = ( ¯ P / N ) MSD nc where ¯ P / N  1 . Therefore, the cooperative strategy outperforms the non-cooperati ve one by a factor of N / ¯ P . A. Single-task estimation In single-task estimation, the agents are seeking a common minimizer w o –see Fig. 1 (left). This problem is encountered in man y applications. Examples include tar get localization and distributed sensing (see, e.g., [10]). Single-task estimation can be recast in the form (11) where R ( · ) and Ω are chosen according to (29) with U = 1 √ N ( 1 N ⊗ I M ) . Se veral algorithms for solving such consensus- type problems hav e been proposed in the literature, including incremental [18], consensus [30], and diffusion [9], [10] strategies. Due to lack of space, we will describe only the class of dif fusion strategies, which can be written in the form (12) with the social step (12b) giv en by: w k,i = X ` ∈N k a k` ψ `,i (35) where a k` corresponds to the ( k, ` ) -th entry of an N × N doubly-stochastic matrix A satisfying: a k` ≥ 0 , N X ` =1 a k` = 1 , N X k =1 a k` = 1 , and a k` = 0 if ` / ∈ N k . (36) Se veral rules for selecting locally these combination coefﬁcients have been proposed in the literature, such as the Metropolis rule and Laplacian rule; see, e.g., [9]. Observe that step (35) can be written in the form of (31) with A k` = a k` I M and A = A ⊗ I M , and that the resulting matrix A will satisfy the constraints in (30) ov er a strongly connected network. 19 B. Multitask estimation with overlapping parameter vectors It is assumed that the individual costs J k ( · ) depend only on a subset of the components of a global parameter vector w = [ w 1 , . . . , w M ] > ∈ R M × 1 [31], [33]–[35]. This situation is observed in Example 4 where the network global parameter vector is w = col { w n } 14 n =1 and where the states w k to be estimated at neighboring areas partially o verlap. It can be veriﬁed that this problem can also be recast in the form (29) with U properly selected. T o solve this consensus-type problem, and motiv ated by the single-task dif fusion strategies, the works [33], [34] propose the following algorithm. Assume agent k is interested in estimating the entry w n of w and let N n k denote the set of neighbors of k that are interested also in estimating w n . In order to reach consensus on w n , agent k assigns to its entry w n a set of non-negati ve coef ﬁcients { a n k` } satisfying a n k` = 0 if ` / ∈ N n k , X ` ∈N n k a n k` = 1 , X ` ∈N n k a n `k = 1 , (37) and performs the following con ve x combination: w n k,i = X ` ∈N n k a n k` ψ n `,i (38) where ψ n `,i is the entry of the M ` × 1 intermediate estimate ψ `,i (obtained from (12a)) corre- sponding to the v ariable w n and w n k,i is the estimate of w n at node k and instant i . It can also be veriﬁed that solution (38) can be written in the form (31) with the block A k` properly selected. For MSE networks, a recursive least-squares (RLS) approach is proposed in [35] to solve over - lapping multitask estimation. In general, second-order gradient methods enjoy faster con vergence rates than ﬁrst-order methods at the expense of increasing the computational complexity . V I . C L U S T E R E D M U LT I TA S K E S T I M A T I O N No w we move into explaining ho w clustered multitask estimation can be solved. Clustered multitask learning was ﬁrst considered in [6] within the machine learning community . Then, it was extended to adaptation and learning over networks in the work [36]. As we shall see, clustered multitask estimation merges subspace constraints with regularization. Let M k = M for all k . In clustered multitask networks, agents within a cluster C q are interested in estimating the same vector w o C q – see Fig. 1 (middle). W ithout loss of generality , we index agents according to their cluster indexes such that agents from the same cluster will hav e consecutiv e indexes. Let N q denote the number of agents in cluster C q . Since agents within C q need to reach a consensus on w o C q , clustered multitask estimation problems can be recast in the form (11) with: 20 Ω = Range ( U ) , U = diag ( 1 p N q ( 1 N q ⊗ I M ) ) Q q =1 (39) Therefore, the cluster consensus step takes the form (31) with A = A ⊗ I M and A = diag { A q } Q q =1 where the N q × N q blocks A q are chosen according to the constraints in (30); one typical choice is doubly-stochastic blocks. The resulting N × N matrix A = [ a k` ] will satisfy: a k` ≥ 0 , A 1 N = 1 N , 1 > N A = 1 > N , and a k` = 0 if ` / ∈ N k ∩ C ( k ) , (40) where N k ∩ C ( k ) denotes the neighboring nodes of k that are inside its cluster . The choice of the regularizer R ( W ) in (11) depends on the prior information on how the models across the clusters relate to each other . One typical choice is [26], [36]: R ( W ) = N X k =1 X ` ∈N k \C ( k ) ρ k` h k` ( w k , w ` ) (41) where N k \ C ( k ) denotes neighboring nodes of k that are outside its cluster and h k` ( w k , w ` ) is a cost associated with the inter-cluster link ( k , ` ) . This function is used to enforce some constraints on the pairs of variables across an inter-cluster edge. Examples are h k` ( w k , w ` ) = k w k − w ` k 2 to enforce graph smoothness [36] and h k` ( w k , w ` ) = k w k − w ` k 1 to enforce sparsity priors [26]. Clustered multitask algorithms hav e in general the structure (12) with step (12b) giv en by: φ k,i = X ` ∈N k ∩C ( k ) a k` ψ `,i w k,i = g 0 k ( φ k,i , { φ `,i } ` ∈N k \C ( k ) ) (42a) (42b) In this algorithm, the self-learning step (12a) is followed by an intra-cluster social learning step (42a) where node k receives the intermediate estimates ψ `,i from its intra-cluster neighbors N k ∩ C ( k ) and combines them in a con vex manner through the coefﬁcients { a k` } in (40) to obtain the intermediate value φ k,i . The second step (42b) is an inter -cluster social learning step where agent k receives the intermediate estimates { φ `,i } from its neighbors that are outside its cluster N k \ C ( k ) and combines them properly using the function g 0 k ( · ) to obtain w k,i . This step helps to incorporate the av ailable prior information on how the models across the clusters are related into the adaptation mechanism. The function g 0 k ( · ) depends on the regularizer R ( · ) . For example, for ` 1 -norm co-regularizers h k` (with ρ k` = ρ `k ), one may arriv e to an inter-cluster learning step (42b) giv en by w k,i = prox η µ e g k,i ( φ k,i ) with e g k,i ( w k ) = P ` ∈N k \C ( k ) ρ k` h k` ( w k , φ `,i ) [26]. 21 V I I . C O N C L U S I O N In this article, we explained ho w prior knowledge about tasks relationships can be incorporated into the adaptation mechanism and how different priors yield different multitask strategies. It then follo ws that choosing the optimal strategy for a gi ven problem is equiv alent to choosing the task relatedness model which best ﬁts the underlying problem. Choosing the most practically viable solution then balances further this model ﬁt against computational and communication constraints. There are sev eral other aspects and strategies for multitask learning over graphs that were not cov ered in this article due to space limitations. For instance, we only focused on multitask networks endowed with parameter estimation tasks. Howe ver , distributed detection was also considered from a multitask perspectiv e (see, e.g., [37]). Online network clustering was also considered. The objectiv e in this case is to design diffusion networks that are able to adapt their combination coef ﬁcients in (35) in order to exclude harmful neighbors sharing distinct tasks [38], [39]. Readers can refer to [40] to ha ve a list of other literature works that are multitask oriented. Multitask learning over graphs is worth exploring further , as there are many potential ideas to build on. For instance, the expressions show the sensitivity of the results to the underlying graph structure. It would be useful to infer the entries c k` of the adjacency matrix in (16) simultaneously with the self-learning step (12a). This leads to learning the relations between the tasks simultaneously with the tasks. Automatically determining the optimal regularization strength η and allo wing edge regularizers h k` ( w k , w ` ) beyond just the ` 1 -norm constitute also clear extensions. Finally , we believ e that the number of multitask learning applications in “distrib uted, streaming machine learning” is v ast, and hope to witness increased utilization of the algorithms and theoretical results established in the domain of “learning and adaptation over networks”. R E F E R E N C E S [1] R. Caruana, “Multitask learning, ” Machine Learning , vol. 28, no. 1, pp. 41–75, Jul. 1997. [2] S. Thrun and L. Pratt, Learning to Learn , Kluwer Academic Publishers, Norwell, MA, USA, 1998. [3] Y . Zhang and D. Y eung, “ A conv ex formulation for learning task relationships in multi-task learning, ” in Pr oc. 26th Conf. on Uncertainty in Artiﬁcial Intelligence , Catalina Island, California, U.S.A., Jul. 2010, pp. 733–742. [4] X. Chen, S. Kim, Q. Lin, J. G. Carbonell, and E. P . Xing, “Graph-structured multitask regression and an ef ﬁcient optimization method for general fused Lasso, ” A vailable as arXi v:1005.3579, May 2010. [5] T . Evgeniou and M. Pontil, “Regularized multi–task learning, ” in Pr oc. 10th ACM SIGKDD Int. Conf . on Knowledge Discovery and Data Mining , Ne w Y ork, U.S.A, 2004, pp. 109–117. [6] L. Jacob, F . Bach, and J.-P . V ert, “Clustered multi-task learning: A conv ex formulation, ” in Pr oc. 21st Int. Conf. on Neural Information Pr ocessing Systems , V ancouver , Canada, 2008, pp. 745–752. [7] T . Kato, H. Kashima, M. Sugiyama, and K. Asai, “Multi-task learning via conic programming, ” in Pr oc. 20th Int. Conf. on Neur al Information Pr ocessing Systems , V ancouver , Canada, 2007, pp. 737–744. [8] J. W ang, M. K olar, and N. Srebro, “Distributed multi-task learning, ” in Proc. Conf. on Artiﬁcial Intelligence and Statistics , Cadiz, Spain, 2016. [9] A. H. Sayed, “ Adaptation, learning, and optimization over networks, ” F oundations and T rends in Machine Learning , vol. 7, no. 4-5, pp. 311–801, 2014. [10] A. H. Sayed, S.-Y . T u, J. Chen, X. Zhao, and Z. J. T owﬁc, “Diffusion strategies for adaptation and learning ov er networks, ” IEEE Sig. Pr ocess. Mag. , v ol. 30, no. 3, pp. 155–171, May 2013. [11] B. Widro w and S. D. Stearns, Adaptive Signal Pr ocessing , Prentice-Hall, Inc., Upper Saddle River , NJ, 1985. 22 [12] R. Nassif, S. Vlaski, C. Richard, and A. H. Sayed, “Learning over multitask graphs – Part I: Stability analysis, ” Submitted for publication. A vailable as arXi v:1805.08535, May 2018. [13] V . Kekatos and G. B. Giannakis, “Distributed robust power system state estimation, ” IEEE T rans. P ower Syst. , vol. 28, no. 2, pp. 1617–1626, May 2013. [14] A. Koppel, B. M. Sadler, and A. Ribeiro, “Proximity without consensus in online multiagent optimization, ” IEEE T rans. Signal Pr ocess. , vol. 65, no. 12, pp. 3062–3077, Jun. 2017. [15] I. F . Akyildiz, W . Su, Y . Sankarasubramaniam, and E. Cayirci, “W ireless sensor networks: A survey , ” Computer Networks , vol. 38, no. 4, pp. 393 – 422, 2002. [16] L. Grady and J. R. Polimeni, Discrete Calculus , Springer, Berlin, German y , 2010. [17] D. Zhou and B. Sch ¨ olkopf, “ A regularization framework for learning from graph data, ” in Pr oc. ICML W orkshop on Statistical Relational Learning and Its Connections to Other F ields , 2004, vol. 15, pp. 67–68. [18] D. P . Bertsekas, “ A new class of incremental gradient methods for least squares problems, ” SIAM J. Optimiz. , vol. 7, no. 4, pp. 913–926, Nov . 1997. [19] W . W ang, J. W ang, M. Kolar , and N. Srebro, “Distributed stochastic multi-task learning with graph regularization, ” A vailable as arXi v:1802.03830, 2018. [20] F . R. K. Chung, Spectral Graph Theory , American Mathematical Society , 1997. [21] A. J. Smola and R. K ondor , “Kernels and regularization on graphs, ” in Learning Theory and Kernel Machines , B. Sch ¨ olkopf and M. K. W armuth, Eds. 2003, pp. 144–158, Springer. [22] R. Nassif, S. Vlaski, C. Richard, and A. H. Sayed, “ A regularization framew ork for learning over multitask graphs, ” IEEE Signal Pr ocess. Lett. , vol. 26, no. 2, pp. 297–301, Feb . 2019. [23] N. J. Higham, Functions of Matrices: Theory and Computation , SIAM, P A, 2008. [24] D. I. Shuman, P . V andergheynst, D. Kressner , and P . Frossard, “Distributed signal processing via Chebyshev polynomial approximation, ” IEEE T rans. Signal Inf. Pr ocess. Netw . , vol. 4, no. 4, pp. 736 – 751, Dec. 2018. [25] A. Sandryhaila and J. M. F . Moura, “Discrete signal processing on graphs: Frequency analysis, ” IEEE Tr ans. Signal Pr ocess. , vol. 62, no. 12, pp. 3042–3054, 2014. [26] R. Nassif, C. Richard, A. Ferrari, and A. H. Sayed, “Proximal multitask learning o ver networks with sparsity-inducing coregularization, ” IEEE T rans. Signal Pr ocess. , vol. 64, no. 23, pp. 6329–6344, Dec. 2016. [27] D. Hallac, J. Leskovec, and S. Boyd, “Network Lasso: Clustering and optimization in large graphs, ” in Proc. ACM SIGKDD , Sydney , Australia, Aug. 2015, pp. 387–396. [28] Y .-X. W ang, J. Sharpnack, A. J. Smola, and R. J. T ibshirani, “Trend ﬁltering on graphs, ” J. Mach. Learn. Res. , vol. 17, no. 1, pp. 3651–3691, 2016. [29] P . D. Lorenzo, S. Barbarossa, and S. Sardellitti, “Distrib uted signal recov ery based on in-network subspace projections, ” in IEEE Int. Conf. Acoust., Speec h, Sig. Pr ocess. , Brighton, U. K., May 2019, pp. 5242–5246. [30] A. Nedic and A. Ozdaglar , “Distributed subgradient methods for multi-agent optimization, ” IEEE T rans. Autom. Contr ol , vol. 54, no. 1, pp. 48, Jan. 2009. [31] J. F . C. Mota, J. M. F . Xavier, P . M. Q. Aguiar, and M. P ¨ uschel, “Distributed optimization with local domains: Applications in MPC and network ﬂo ws, ” IEEE T rans. Autom. Contr ol , vol. 60, no. 7, pp. 2004–2009, Jul. 2015. [32] R. Nassif, S. Vlaski, and A. H. Sayed, “ Adaptation and learning over networks under subspace constraints – Part II: Performance analysis, ” Submitted for publication. A vailable as arXi v:1906.12250, May 2019. [33] S. A. Alghunaim and A. H. Sayed, “Distributed coupled multi-agent stochastic optimization, ” IEEE T rans. Autom. Contr ol , 2019. [34] J. Plata-Cha ves, N. Bogdanovi ´ c, and K. Berberidis, “Distributed diffusion-based LMS for node-speciﬁc adapti ve parameter estimation, ” IEEE T rans. Signal Pr ocess. , vol. 63, no. 13, pp. 3448–3460, 2015. [35] A. K. Sahu, D. Jakov eti ´ c, and S. Kar , “ C I RF E : A distributed random ﬁelds estimator , ” IEEE T rans. Signal Pr ocess. , vol. 66, no. 18, pp. 4980–4995, Sep. 2018. [36] J. Chen, C. Richard, and A. H. Sayed, “Multitask diffusion adaptation over networks, ” IEEE T rans. Signal Process. , vol. 62, no. 16, pp. 4129–4144, Aug. 2014. [37] F . K. T eklehaymanot, M. Muma, B. B ´ ejar, P . Binder, A. Zoubir, and M. V etterli, “Robust diffusion-based unsupervised object labelling in distributed camera netw orks, ” in AFRICON 2015 , Sep. 2015, pp. 1–6. [38] X. Zhao and A. H. Sayed, “Distributed clustering and learning over netw orks, ” IEEE T rans. Signal Pr ocess. , vol. 63, no. 13, pp. 3285–3300, Jul. 2015. [39] J. Chen, C. Richard, and A. H. Sayed, “Diffusion LMS over multitask networks, ” IEEE T rans. Signal Pr ocess. , vol. 63, no. 11, pp. 2733–2748, Jun. 2015. [40] J. Plata-Chaves, A. Bertrand, M. Moonen, S. Theodoridis, and A. M. Zoubir, “Heterogeneous and multitask wireless sensor networks, ” IEEE J . Sel. T opics Signal Pr ocess. , vol. 11, no. 3, pp. 450–465, 2017.

Multitask learning over graphs: An Approach for Distributed, Streaming Machine Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment