How to learn a graph from smooth signals

We propose a framework that learns the graph structure underlying a set of smooth signals. Given $X\in\mathbb{R}^{m\times n}$ whose rows reside on the vertices of an unknown graph, we learn the edge weights $w\in\mathbb{R}_+^{m(m-1)/2}$ under the smo…

Authors: Vassilis Kalofolias

How to learn a graph from smooth signals
Ho w to learn a graph from smo oth signals V assilis Kalofolias Signal Pro cessing Laboratory 2 (L TS2) Swiss F ederal Institute of T ec hnology Lausanne (EPFL), Switzerland Abstract W e prop ose a framework that learns the graph structure underlying a set of smo oth signals. Giv en X ∈ R m × n whose ro ws re- side on the v ertices of an unknown graph, we learn the edge weigh ts w ∈ R m ( m − 1) / 2 + under the smo othness assumption that tr  X > LX  is small. W e show that the problem is a w eighted ` -1 minimization that leads to nat- urally sparse solutions. W e point out how kno wn graph learning or construction tec h- niques fall within our framew ork and pro- p ose a new mo del that p erforms b etter than the state of the art in man y settings. W e presen t efficient, scalable primal-dual based algorithms for b oth our mo del and the previ- ous state of the art, and ev aluate their p er- formance on artificial and real data. 1 INTR ODUCTION W e consider a matrix X ∈ R m × n = [ x 1 , . . . , x m ] > , where each row x i ∈ R n resides on one of m nodes of a graph G . Then each of the n columns of X can b e seen as a signal on the same graph. A simple as- sumption ab out data residing on graphs, but also the most widely used one is that it changes smo othly b e- t ween connected no des. An easy wa y to quan tify how smo oth is a set of vectors x 1 , . . . , x m ∈ R n on a given w eighted undirected graph is through the function 1 2 X i,j W ij k x i − x j k 2 , where W ij denotes the w eight of the edge b etw een no des i and j . In w ords, if t wo vectors x i and x j from a smo oth set reside on tw o well connected no des (so Preliminary work. Under review by AIST A TS 2016. Do not distribute. W ij is big), they are expected to ha v e a small distance. Using the graph Laplacian matrix L = D − W , where D is the diagonal degree matrix with D ii = P j W ij , this function can b e written in matrix form as tr  X > LX  . The imp ortance of the graph Laplacian has long b een kno wn as a tool for embedding, manifold learning, clustering and semisupervised learning, see e.g. Belkin and Niy ogi (2001); Belkin, Niyogi, and Sindhw ani (2006); Zh u, Ghahramani, Laffert y , et al. (2003). More recen tly we find an abundance of methods that ex- ploit this notion of smo othness to regularize v arious mac hine learning tasks, solving problems of the form minimize X g ( X ) + tr  X > LX  . (1) Zhang, P op escul, and Dom (2006) use it to en- hance web page categorization with graph information, Zheng et al. (2011) for graph regularized sparse co d- ing. Cai et al. (2011) use the same term to regularize NMF, Jiang et al. (2013) for PCA and Kalofolias et al. (2014) for matrix completion. Having go o d quality graphs is k ey to the success of the ab ov e metho ds. The go al of this p ap er is to solve the c omplementary pr oblem of le arning a go o d gr aph: minimize L ∈L tr  X > LX  + f ( L ) , (2) where L denotes the set of v alid graph Laplacians. Why is this pr oblem imp ortant? Firstly b ecause it en- ables us to directly learn the hidden graph structure b ehind our data. Secondly b ecause in most problems that can b e written in the form of eq. (1), w e are of- ten given a noisy graph, or no graph at all. Therefore, starting from the initial graph and alternating b et w een solving problems (1) and (2) we can at the same time get a b etter qualit y graph and solve the task of the initial problem. Related W ork. Dempster (1972) was one of the first to prop ose the problem of finding connectivity from Man uscript under review b y AIST A TS 2016 measuremen ts, under the name “cov ariance selection”. Y ears later, Banerjee, El Ghaoui, and d’Aspremont (2008) proposed solving an ` -1 p enalized log-lik eliho od problem to estimate a sparse inv erse cov ariance with unkno wn pattern of zeros. How ever, while a lot of w ork has b een done on in verse co v ariance estimation, the latter differs substan tially from a graph Laplacian. F or instance, the off-diagonal elements of a Laplacian m ust b e non-p ositiv e, while it is not inv ertible lik e the in verse cov ariance. W ang and Zhang (2008) learn a graph with normal- ized degrees by minimizing the ob jectiv e P i k x i − P j w ij x j k 2 , but they assume a fixed k-NN edge pat- tern. Daitch, Kelner, and Spielman (2009) considered the similar ob jectiv e k LX k 2 F and they approximately minimized it with a greedy algorithm and a relaxation. Zhang et al. (2010) alternate b et ween problems (1) and a v ariation of (2). How ever, while they start from an initial graph Laplacian L , they finally learn a s.p.s.d. matrix that is not necessarily a v alid Laplacian. The works most relev an t to ours are the ones by Lake and T enen baum (2010) and b y Dong et al. (2015a,b). In the first one, the authors consider a problem sim- ilar to the one of the inv erse cov ariance estimation, but imp ose additional constraints in order to obtain a v alid Laplacian. How ever, their final ob jective func- tion con tains many constraints and a computationally demanding log-determinant term that mak es it diffi- cult to solve. T o the b est of our knowledge, there is no scalable algorithm in the literature to solv e their mo del. Dong et al. (2015b) prop ose a mo del that out- p erforms the one b y Lake and T enen baum, but still do not pro vide a scalable algorithm. This work is com- plemen tary to theirs, as we not only compare against their mo del, but also provide an analysis and a scalable algorithm to solv e it. Con tributions. In this paper w e mak e the link b et w een smo othness and sparsity . W e show that the smo othness term can b e equiv alen tly seen as a w eighted ` -1 norm of the adjacency matrix, and mini- mizing it leads to naturally sparse graphs (Section 2). Based on this, w e formulate our ob jective as a w eigh ted ` -1 problem that we prop ose as a general framework for solving problem (2). Using this framework w e pro- p ose a new mo del for learning a graph. W e prov e that our mo del has effectively one parameter that con trols ho w sparse is the learnt graph (Section 4). W e show ho w our framew ork includes the standard Gaussian k ernel weigh t construction, but also the mo del b y Dong et al. (2015b). W e simplify their model and pro ve fundamental prop erties (Section 4). W e pro vide a fast, scalable and con vergen t primal-dual algorithm to solve our prop osed mo del, but also the one by Dong et al. T o the b est of our knowledge, these are the first scalable solutions in the literature to learn a graph under smo othness assumption (2) (Section 5). T o ev aluate our mo del, we first review different defini- tions of smooth signals in the literature. W e sho w ho w they can b e unified under the notion of graph filtering (Section 3). W e compare the mo dels under artificial and real data s ettings. W e conclude that our mo del is sup erior in man y cases and ac hieves b etter connectiv- it y when sparse graphs are sought (Section 6). 2 PR OPER TIES OF THE LAPLA CIAN Throughout this pap er we use the com binatorial graph Laplacian defined as L = D − W , where D = diag( W 1 ) and 1 = [1 , . . . , 1] > . The space of all v alid combinato- rial graph Laplacians, is b y definition L = n L ∈ R m × m : ( ∀ i 6 = j ) L ij = L j i ≤ 0 , L ii = − X j 6 = i L ij o . In order to learn a v alid graph Laplacian, we migh t b e tempted to searc h in the ab o v e space, as is done e.g. b y Dong et al. (2015b); Lake and T enen baum (2010). W e argue that it is more intuitiv e to search for a v alid w eighted adjacency matrix W from the space W m =  W ∈ R m × m + : W = W > , diag( W ) = 0  , leading to simplified problems. Ev en more, when it comes to actually solving the problem b y optimization tec hniques, we should consider the space of all v alid edge w eights for a graph W v = n w ∈ R m ( m − 1) / 2 + o , so that we do not ha v e to deal with the symmetricity of W explicitly . The spaces L , W m and W v are equiv- alen t, and connected b y bijective linear mappings. In this pap er we use W m to analyze the problem in hand and W v when we solve the problem. T able 1 exhibits some of the equiv alen t forms in the three spaces. 2.1 Smo oth manifold means graph sparsit y Let us define the p airwise distanc es matrix Z ∈ R m × m + : Z i,j = k x i − x j k 2 . Using this, w e can rewrite the trace term as tr  X > LX  = 1 2 tr ( W Z ) = 1 2 k W ◦ Z k 1 , 1 , (3) Man uscript under review b y AIST A TS 2016 T able 1: Equiv alent terms for represen tations from sets L , W m , W v . W e use z = v ectorform( Z ), and linear op- erator S that p erforms summation in the vector form. L ∈ L W ∈ W m w ∈ W v 2 tr  X > LX  k W ◦ Z k 1 , 1 2 w > z tr ( L ) k W k 1 , 1 2 w > 1 = 2 k w k 1 – k W k 2 F 2 k w k 2 2 diag( L ) W 1 S w 1 > log(diag( L )) 1 > log( W 1 ) 1 > log( S w ) k L k 2 F k W k 2 F + k W 1 k 2 2 2 k w k 2 2 + k S w k 2 2 where k A k 1 , 1 is the element wise norm-1 of A and ◦ is the Hadamard pro duct (see App endix). In words, the smo othness term is a weighte d ` -1 norm of W , enco d- ing weighte d sp arsity , that penalizes edges connecting distan t rows of X . The interpretation is that when the giv en distances come from a smo oth manifold, the cor- resp onding graph has a sparse set of edges, preferring only the ones asso ciated to small distances in Z . Explicitly adding a sparsit y term γ k W k 1 , 1 to the ob- jectiv e function is a common tactic for inv erse cov ari- ance estimation. How ev er, it brings little to our prob- lem, as here it can b e translated as merely adding a constan t to the squared distances in Z : tr  X > LX  + γ k W k 1 , 1 = 1 2 k W ◦ (2 γ + Z ) k 1 , 1 . (4) Note that all information of X conv eyed by the trace term is con tained in the pairwise distances matrix Z , so that the original could b e omitted. Moreov er, using the last term of eq. (3) instead of the trace enables us to define other kinds of distances instead of Euclidean. Note finally that the separate rows of X do not hav e to b e smo oth signals in some sense. Tw o non-smo oth signals x i , x j can hav e a small distance b etw een them, and therefore a small en try Z i,j . 3 WHA T IS A SMOOTH SIGNAL? Giv en a graph, differen t definitions of what is a smo oth signal ha v e been used in different contexts. In this section w e unify these different definitions using the notion of filtering on graphs. F or more information ab out signal processing on graphs we refer to the w ork of Shuman et al. (2013). Filtering of a graph signal x ∈ R m b y a filter h ( λ ) is defined as the op eration 1 y = h ( L ) x = X i u i h ( λ i ) u > i x = X i u i h ( λ i ) ˆ x i , (5) where { u i , λ i } are eigenv ector-eigenv alue pairs of L , and ˆ x ∈ R m is the graph F ourier representation of x 1 W e denote by h b oth the function h : R → R and its matrix counterpart h : R m × m → R m × m acting on the matrix’s eigenv alues. con taining its gr aph fr e quencies ˆ x i ∈ R . Low frequen- cies corresp ond to small eigenv alues, and lo w-pass or smo oth filters corresp ond to deca ying functions h . In the sequel w e sho w how differen t models for smo oth signals in the literature can b e written as smo othing problems of an initial non-smo oth signal. W e give an example of three different filters applied on the same signal in Figure 4 (App endix). Smo oth signals by Tikhono v regularization. Solving problem (1) leads to smo oth signals. By set- ting g ( x ) = 1 α k x − x 0 k 2 w e ha ve a Tikhono v regulariza- tion problem, that giv en an arbitrary x 0 as input giv es its graph-smo oth version x = ( αL + I ) − 1 x 0 . Equiv a- len tly , we can see this as filtering x 0 b y h ( λ ) = 1 1 + αλ , (6) where big α v alues result in smo other signals. Smo oth signals from a probabilistic generative mo del. Dong et al. (2015a) prop osed that smooth signals can b e generated from a colored Gaussian distribution as x = ¯ x + P i u i ˆ x i , where ˆ x i ∼ N  0 , λ † i  and † denotes the pseudoinv erse. Therefore x follows the distribution x ∼ N  ¯ x, L †  . T o sample from the ab o ve, it suffices to dra w an initial non-smo oth signal x 0 ∼ N (0 , I ) and then compute x = ¯ x + h ( L ) x 0 , (7) with h ( L ) = √ L † , or equiv alen tly filter it by h ( λ ) = ( √ λ − 1 , λ > 0 0 , λ = 0 (8) and add the mean ¯ x . W e p oint out here that using eq. (7) on any x 0 ∼ N (0 , I ) and for any filter h ( λ ) would yield samples from x ∼ N  ¯ x, h ( L ) 2  , therefore the probabilistic generativ e mo del can be used for an y filter h . How ever, it do es not cov er cases where the initial x 0 is not white Gaussian. Smo oth signals by heat diffusion on graphs Another type of smo oth signals in the literature results from the pro cess of heat diffusion on graphs. See for example the work by Zhang and Hanco c k (2008) for an application on image denoising by heat diffusion smo othing on the pixels graph. Given an initial signal x 0 , the result of the heat diffusion on a graph after time t is x = exp( − Lt ) x 0 , therefore the corresp onding filter is h ( λ ) = exp( − tλ ) , (9) where bigger v alues of t result in smo other signals. Man uscript under review by AIST A TS 2016 4 LEARNING A GRAPH FROM SMOOTH SIGNALS In order to learn a graph from smo oth signals, we pro- p ose, as explained in Section 2, to rewrite problem (2) using the weigh ted adjacency matrix W and the pair- wise distance matrix Z instead of X : minimize W ∈W m k W ◦ Z k 1 , 1 + f ( W ) . (10) Since W is p ositiv e we could replace the first te rm b y tr ( W Z ), but we prefer this notation to keep in mind that our problem already has a sparsity term on W . This me ans that f ( W ) has to play two imp ortant r oles: (1) pr event W fr om going to the trivial solution W = 0 and (2) imp ose further structur e using prior information on W . This said, depending on f the solution is exp ected to b e sparse, that is imp ortant for large scale applications. In order to motiv ate this general graph learning frame- w ork, we show that the most standard weigh t con- struction, as well as the state of the art graph learning mo del are sp ecial cases thereof. 4.1 Classic Lapla cian computations In the literature one of the most common practices is to construct edge weigh ts given X from the Gaussian function w ij = exp  − k x i − x j k 2 2 2 σ 2  . (11) It turns out that this choice of weigh ts can b e seen as the result of solving problem (10) with a specific prior on the w eights W : Prop osition 1. The solution of the pr oblem minimize W ∈W m k W ◦ Z k 1 , 1 + 2 σ 2 X ij W ij (log( W ij ) − 1) is given by e q. (11) . Pr o of. The problem is edge separable and the ob jec- tiv e can be written as P i,j W ij Z ij + 2 σ 2 W ij (log( W ij − 1)). Deriving w.r.t. W ij w e obtain the optimal- it y condition Z ij + 2 σ 2 log( W ij ) = 0, or W ij = exp( − Z ij / (2 σ 2 )), that pro ves the prop osition. Note that here, the logarithm in f preven ts the w eights from going to 0, leading to full matrices, and sparsifi- cation has to b e imp osed explicitly afterw ards. 4.2 Our prop osed mo del Based on our framework (10) our goal is to give a gen- eral purp ose mo del for learning graphs, when no prior information is a v ailable. In order to obtain meaningful graphs, we wan t to make sure that e ach no de has at le ast one e dge with another no de . It is also desirable to have c ontr ol of how sp arse is the r esulting gr aph . T o meet these exp ectations, we prop ose the following mo del with parameters α > 0 and β ≥ 0 controlling the shap e of the edges: minimize W ∈W m k W ◦ Z k 1 , 1 − α 1 > log( W 1 ) + β k W k 2 F . (12) The logarithmic barrier acts on the no de degree vec- tor W 1 , unlike the mo del of Prop osition 1 that has a similar barrier on the edges. This means that it forces the degrees to be p ositiv e, but do es not preven t edges from b ecoming zero. This improv es the ov erall con- nectivit y of the graph, without compromising sparsity . Note ho wev er, that adding solely a logarithmic term ( β = 0) leads to very sparse graphs, and changing α only c hanges the scale of the solution and not the spar- sit y pattern (Prop osition 2 for β = 0). F or this reason, w e add the third term. W e show ed in eq. (4) that adding an ` -1 norm to con trol sparsity is not very useful. On the other hand, adding a F rob enius norm we p enalize the formation of big edges but do not p enalize smaller ones. This leads to more dense edge patterns for bigger v alues of β . An in teresting prop erty of our mo del is that even if it has t wo terms shaping the weigh ts, if we fix the scale we then need to searc h for only one parameter: Prop osition 2. L et F ( Z, α, β ) denote the solution of our mo del (12) for input distanc es Z and p ar ameters α , β . Then the fol lowing pr op erty holds for any γ > 0 : F ( Z, α, β ) = γ F  Z, α γ , β γ  = αF ( Z, 1 , αβ ) . (13) Pr o of. See app endix. This means that for example if we wan t to obtain a W with a fixed scale k W k = s (for any norm), w e can solve the problem with α = 1, searc h only for a parameter β that gives the desired edge density and then normalize the graph b y the norm w e hav e c hosen. The main adv antage of our mo del ov er the metho d by Dong et al. (2015a), is that it promotes connectivit y b y putting a log barrier directly on the no de degrees. Ev en for β = 0, we obtain the sparsest solution p os- sible, that assigns at least one edge to each no de. In this case, the distant no des will hav e smaller degrees (b ecause of the first term), but still b e connected to their closest neigh b our similarly to a 1-NN graph. Man uscript under review by AIST A TS 2016 4.3 Fitting the state of the art in our framew ork Dong et al. (2015a) prop osed the following mo del for learning a graph: minimize L ∈L tr  X > LX  + α k L k 2 F , s . t ., tr ( L ) = s. P arameter s > 0 controls the scale (Dong et al. set it to m ), and parameter α ≥ 0 controls the density of the solution. This formulation has tw o weaknesses. First, using a F rob enius norm on the Laplacian has a reduced in terpretability: the elements of L are not only of dif- feren t scales, but also linearly dep enden t. Secondly , optimizing it is difficult as it has 4 constraints on L : 3 in order to constrain L in space L , and one to keep the trace constant. W e prop ose to solve their mo del using our framework: Using transformations of T able 1, w e obtain the equiv alent simplified mo del minimize W ∈W m k W ◦ Z k 1 , 1 + α k W 1 k 2 + α k W k 2 F , s . t ., k W k 1 , 1 = s. (14) Using this parametrization, solving the problem b e- comes muc h simpler, as we sho w in Section 5. Note that for α = 0 we ha ve a linear program that assigns w eight s to the edge corresp onding to the smallest pair- wise distance in Z , and zero ev erywhere else. On the other hand, setting α to big v alues, we p enalize big degrees (through the second term), and in the limit α → ∞ we obtain a dense graph with constant de- grees across no des. W e can also prov e some in teresting prop erties of (14): Prop osition 3. L et H ( Z , α, s ) denote the solution of mo del (14) for input distanc es Z and p ar ameters α and s . Then for γ > 0 the fol lowing pr op erties hold: H ( Z + γ , α, s ) = H ( Z, α, s ) (15) H ( Z, α, s ) = γ H  Z, αγ , s γ  = sH ( Z, αs, 1) (16) Pr o of. See app endix. In other words, mo del (14) is in v ariant to adding an y constan t to the squared distances. The second prop- ert y means that similarly to our mo del, the scale of the solution does not c hange the shape of the connectivit y . If we fix the scale to s , we obtain the whole range of edge shap es given b y H only by changing parameter α . 5 OPTIMIZA TION An adv an tage of using the formulation of problem (10) is that it can b e solved efficiently for a wide range of c hoices of f ( W ). W e use primal dual techniques that scale, lik e the ones reviewed by Komo dakis and Pes- quet (2014) to solv e the tw o state of the art mo dels: the one w e propose and the one by Dong et al. (2015b). Using these as examples, it is easy to solve many in- teresting mo dels from the general framework (10). In order to mak e optimization easier, w e use the v ector form representation from space W v (see T able 1), so that the symmetricity do es not ha ve to b e imp osed as a constraint. W e write the problem as a sum of three functions in order to fit it to primal dual algorithms review ed by Komo dakis and Pesquet (2014). The gen- eral form of our ob jective is minimize w ∈W v f 1 ( w ) + f 2 ( K w ) + f 3 ( w ) , (17) where f 1 and f 2 are functions for whic h we can ef- ficien tly compute proximal op erators, and f 3 is dif- feren tiable with gradient that has Lipschitz constant ζ ∈ (0 , ∞ ). K is a linear op erator, so f 2 is defined on the dual v ariable K w . In the sequel we explain ho w this general optimization framework can b e ap- plied to the tw o mo dels of interest, lea ving the details in the App endix. F or a better understanding of primal dual optimization or pro ximal splitting metho ds we re- fer the reader to the w orks of Combettes and Pesquet (2011); Komo dakis and Pesquet (2014). In our mo del, the second term acts on the degrees of the no des, that are a linear function of the edge w eights. Therefore we use K = S , where S is the linear op erator that satisfies W 1 = S w if w is the v ectorform of W . In the first term we group the p osi- tivit y constraint of W v and the w eighted ` -1, and the second and third terms are the priors for the degrees and the edges respectively . In order to solv e our model w e define f 1 ( w ) = 1 { w ≥ 0 } + 2 w > z , f 2 ( d ) = − α 1 > log( d ) , f 3 ( w ) = β k w k 2 , with ζ = 2 β , where 1 {} is the indicator function that b ecomes zero when the condition in the brack ets is satisfied, infinite otherwise. Note that the second function f 2 is defined on the dual v ariable d = S w ∈ R m , that here is very con venien tly the vector of the no de degrees. F or mo del (14) w e can define in a similar wa y f 1 ( w ) = 1 { w ≥ 0 } + 2 w > z , f 2 ( c ) = 1 { c = s } , f 3 ( w ) = α  2 k w k 2 + k S w k 2  , with ζ = 2 α ( m + 1) , and use K = 2 1 > so that the dual v ariable is c = K w = k W k 1 , 1 , constrained b y f 2 to b e equal to s . Man uscript under review by AIST A TS 2016 Algorithm 1 Primal dual algorithm for mo del (12). 1: Input: z , α, β , w 0 ∈ W v , d 0 ∈ R m + , γ , tolerance  2: for i = 1 , . . . , i max do 3: y i = w i − γ (2 β w i + S > d i ) 4: ¯ y i = d i + γ ( S w i ) 5: p i = max(0 , y i − 2 γ z ) 6: ¯ p i = ( ¯ y i − p ( ¯ y i ) 2 + 4 αγ ) / 2  element wise 7: q i = p i − γ (2 β p i + S > p i ) 8: ¯ q i = ¯ p i + γ ( S p i ) 9: w i = w i − y i + p i ; 10: d i = d i − ¯ y i + ¯ q i ; 11: if k w i − w i − 1 k / k w i − 1 k <  and 12: k d i − d i − 1 k / k d i − 1 k <  then 13: break 14: end if 15: end for Using these functions, the final algorithm for our mo del is given as Algorithm 1, and for the mo del by Dong et al. as Algorithm 2 in the App endix. V ector z ∈ R m × ( m − 1) / 2 + is the vector form of Z , and parame- ter γ ∈ (0 , 1+ ζ + k K k ) is the stepsize. 5.1 Complexit y and Conv ergence Both algorithms that w e prop ose hav e a complex- it y of O ( m 2 ) p er iteration, for m no des graphs, and they can easily b e parallelized. As the ob jectiv e func- tions of both mo dels are prop er, conv ex, and lo wer- semicon tinuous, our algorithms are guaran teed to con- v erge to the minimum (Komo dakis and Pesquet 2014). 6 EXPERIMENTS W e compare our mo del against the state of the art mo del b y Dong et al. (2015a) solv ed by our Algorithm 2 for b oth artificial and real data. Comparing to the mo del b y Lake and T enenbaum (2010) w as not p ossible ev en for the small graphs of our artificial exp erimen ts, as there is no scalable algorithm in the literature and the use of CVX with the log-determinan t term is pro- hibitiv e. Other mo dels based on the log-det term, for whic h scalable algorithms exist, are irrelev ant to our problem as a sparse inv erse cov ariance is not a v alid Laplacian and are known to not p erform well for our setting (see Dong et al. (2015a) for a comparison). 6.1 Artificial data The difficult y of solving problem (10) dep ends both on the quality of the graph b ehind the data and on the t yp e of smo othness of the signals. W e test 4 different t yp es of graphs using 3 differen t types of signals. Graph T yp es. W e use tw o 2-D manifold based graphs, one uniformly and one non-uniformly sampled, and t wo graphs that are not manifold structured: 1. Random Geometric Graph (RGG): W e sample x uniformly from [0 , 1] 2 and connect no des using eq. (11) with σ = 0 . 2, then threshold weigh ts < 0 . 6. 2. Non-uniform: W e sample x in [0 , 1] × [0 , 5] from a non-uniform distribution p x 1 ,x 2 ∝ 1 / (1 + αx 2 ) and connect no des using eq. (11) with σ = 0 . 2. W e threshold weigh ts smaller than the b est con- nection of the most distan t no de ( ≈ 0 . 01). 3. Erd˝ os R´ enyi: Random graph as prop osed b y Gilb ert (1959) ( p =3 /m ). 4. Barab´ asi-Alb ert: Random scale-free graph with preferen tial attachmen t as prop osed by Barab´ asi and Alb ert (1999) ( m 0 =1, m =2). Signal T yp es. T o create a smo oth signal we filter a Gaussian i.i.d. x 0 b y eq. (5), using one of the three filter types of Section 3. W e normalize the Laplacian ( k L k 2 = 1) so that the filters g ( λ ) are defined for λ ∈ [0 , 1]. See T able 3 (App endix) for a summary . 1. Tikhonov: g ( λ ) = 1 1+10 λ as in eq. (6). 2. Generative Mo del: g ( λ ) = 1 / √ λ if λ > 0, g (0) = 0 from mo del of eq. (8) ( ¯ x = 0). 3. Heat Diffusion: g ( λ ) = exp( − 10 λ ) as eq. (9). F or all cases we use m = 100 no des, smooth signals of length n = 1000, and add 10% ( ` -2 sense) noise b efore computing pairwise distances. W e p erform grid searc h to find the b est parameters for each model. W e rep eat the exp erimen t 20 times for each case and rep ort the a verage result of the parameter v alue that p erforms b est for each of the different metrics. Metrics. Since we ha ve the ground truth graphs for eac h case, we can measure directly the relativ e edge error in the ` -1 and ` -2 sense. W e also rep ort the rel- ativ e error of the weigh ted degrees d i = P j W ij . This is important b ecause b oth mo dels are based on priors on the degrees as we sho w in section 4. W e also rep ort the F-measure (harmonic mean of edge precision and recall), that only takes into accoun t the binary pattern of existing edges and not the w eights. Baselines. The baseline for the relative errors is a classic graph construction using equation (11) with a grid search for the b est σ . Note that this exact equa- tion was used to create the tw o first artificial datasets. Ho wev er, using a fully connected graph with the F- measure do es not mak e sense. F or this metric the baseline is set to the b est edge pattern found by thresh- olding (11) with different thresholds. T able 2 summarizes all the results for differen t combi- nations of graphs/signals. In most of them, our mo del p erforms b etter for all metrics. W e can see that the signals constructed following the generative mo del (7) do not yield b etter results in terms of graph recon- struction. Using smo other “Tikhonov” signals from eq. (6) or “Heat Diffusion” signals from (9) by set- Man uscript under review by AIST A TS 2016 T able 2: Performance of Different Mo dels on Artificial Data. Tikhono v Generativ e Mo del Heat Diffusion base Dong etal Ours base Dong etal Ours base Dong etal Ours Rand. Geometric F-measure 0.685 0.885 0.913 0.686 0.877 0.909 0.758 0.837 0.849 edge  -1 0.866 0.357 0.298 0.798 0.371 0.348 0.609 0.524 0.447 edge  -2 0.676 0.376 0.336 0.658 0.397 0.390 0.576 0.531 0.468 degree  -1 0.142 0.146 0.065 0.261 0.147 0.112 0.209 0.227 0.142 degree  -2 0.708 0.172 0.079 0.689 0.174 0.128 0.474 0.264 0.176 Non Uniform F-measure 0.686 0.863 0.858 0.633 0.840 0.832 0.766 0.839 0.830 edge  -1 0.821 0.423 0.349 0.864 0.487 0.472 0.594 0.565 0.473 edge  -2 0.706 0.434 0.344 0.735 0.480 0.474 0.550 0.587 0.451 degree  -1 0.160 0.184 0.055 0.235 0.185 0.100 0.233 0.255 0.128 degree  -2 0.612 0.209 0.073 0.632 0.215 0.161 0.427 0.324 0.157 Erd˝ os R´ enyi F-measure 0.288 0.766 0.893 0.199 0.755 0.896 0.377 0.629 0.655 edge  -1 1.465 0.448 0.391 1.566 0.478 0.427 1.379 0.832 0.841 edge  -2 1.060 0.442 0.402 1.105 0.457 0.440 1.033 0.735 0.726 degree  -1 0.094 0.107 0.046 0.099 0.105 0.066 0.182 0.179 0.183 degree  -2 0.986 0.161 0.066 1.312 0.181 0.151 0.892 0.236 0.273 Barab´ asi-Alb ert F-measure 0.345 0.710 0.868 0.382 0.739 0.838 0.352 0.690 0.765 edge  -1 1.531 0.614 0.533 1.496 0.652 0.624 1.468 0.740 0.675 edge  -2 1.061 0.568 0.506 1.036 0.611 0.571 1.041 0.662 0.590 degree  -1 0.175 0.264 0.111 0.199 0.264 0.207 0.254 0.317 0.148 degree  -2 0.554 0.340 0.201 0.556 0.333 0.287 0.568 0.414 0.283 ting λ = 20 yielded slightly w orse results in b oth cases (not rep orted here). It also seems that the results are sligh tly b etter for the manifold related graphs than for the Erd˝ os R ´ en yi and Barab´ asi-Albert models, an effect that is more prev alen t when we use signals of length n = 100 smo oth signals instead of 1000 (c.f. T able 4 of App endix). This would be interesting to in v estigate theoretically . 6.2 Real data W e also ev aluate the p erformance of our model on real data. In this case, the actual ground truth graph is not kno wn. W e therefore measure the p erformance of dif- feren t models on sp ectral clustering and lab el propaga- tion, tw o algorithms that dep end solely on the graph. Note that an explicit Laplacian normalization is not needed for the learned mo dels (it is even harmful as found experimentally), since this role is already pla yed b y the regularization. Learning the graph of USPS digits W e first learn the graph connecting 1001 differen t im- ages of the USPS dataset, that are images of digits from 0 to 9 (10 classes). W e follo w Zh u, Ghahra- mani, Lafferty , et al. (2003) and sample the class sizes non-uniformly . F or each class i ∈ { 1 . . . 10 } we take round(2 . 6 i 2 ) images, resulting to classes with sizes from 3 to 260 images each. W e learn graphs of dif- feren t densities using b oth mo dels. As baseline we use a k-Nearest Neigh b ors (k-NN) graph for differen t k . F or each of the graphs, we run standard sp ectral clus- tering (as in the w ork of Ng, Jordan, W eiss, et al. 2002 but without normalizing the Laplacian) with k-means 100 times. W e also run lab el propagation we choose 100 times a differen t subset of 10% known lab els. In Fig. 1 we plot the b eha vior of different mo dels for different density levels. The horizontal axis is the a verage n umber of non-zero edges p er no de. In the left plot we see the clustering quality . Even though the b est result of b oth algorithms is almost the same (0.24 vs 0.25), our mo del is more robust in terms of the graph density c hoice. A similar b eha vior is exhibited for label propagation plotted in the middle . The clas- sification quality is b etter for our mo del in the sparser graph densit y levels. The robustness of our mo del for small graph densi- ties can b e explained by the connectivity quality plot- ted in the right . The contin uous lines are the num- b er of differen t connected comp onen ts in the learned graphs, that is a measure of connectivity: the less com- p onen ts there are, the b etter connected is the graph. The dashed blue line is the num b er of disconnected no des of mo del Dong et al. 2015a. The latter fails to assign connections to the most distant no des, unless the density of the graph reaches a fairly high level. If w e wan t a graph with 6 edges p er no de, our mo del re- turns a graph with 3 comp onen ts and no disconnected no des. The mo del by Dong et al. returns a graph with 35 comp onen ts out of whic h 22 are disconnected no des. Man uscript under review by AIST A TS 2016 graph density 5 10 15 20 Clustering error 0.2 0.25 0.3 0.35 0.4 0.45 Spectral Clustering [Dong etal] Ours k-NN graph density 5 10 15 20 Classification error 0.1 0.15 0.2 0.25 0.3 0.35 Label Propagation miss rate (Dong etal) miss rate (Ours) miss rate (k-NN) graph density 5 10 15 20 0 10 20 30 40 50 60 70 Connectivity # disconnected (Dong etal) # components (Dong etal) # components (Ours) # components (k-NN) number of classes Figure 1: Graph learned from 1001 USPS images. Left : Clustering quality . Middle : Lab el propagation quality . Righ t : Number of completely disconnected comp onen ts (contin uous lines) and num b er of disconnected no des for mo del by Dong etal. (blue dashed line). Our mo del and k-NN hav e no disconnected no des. graph density 2 4 6 8 10 12 14 16 18 Classification error 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 miss rate (Dong etal) miss rate (Ours) miss rate (k-NN) unclassifiable (Dong etal) unclassifiable (Ours) unclassifiable (k-NN) graph density 2 4 6 8 10 12 14 16 18 Classification error 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 miss rate (Dong etal) miss rate (Ours) miss rate (k-NN) unclassifiable (Dong etal) unclassifiable (Ours) unclassifiable (k-NN) graph density 2 4 6 8 10 12 14 16 18 Classification error 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 miss rate (Dong etal) miss rate (Ours) miss rate (k-NN) unclassifiable (Dong etal) unclassifiable (Ours) unclassifiable (k-NN) Figure 2: Lab el propagation for the problem “1” vs. “2” of MNIST with different class size prop ortions: 1 to 4 (left) , 1 to 1 (middle) or 4 to 1 (right) . Missclassification rate for different num b er of edges p er no de. Note that in real applications where the b est density lev el is not kno wn a priori, it is important for a graph learning mo del to p erform well for sparse levels. This is esp ecially the case for large scale applications, where more edges mean more computations. Time: Algorithm 1 implemen ted in Matlab 2 learned a 10-edge/no de graph of 1001 USPS images in 5 seconds (218 iterations) and Algorithm 2 in 1 minute (2043 iterations) on a standard PC for tolerance  = 1 e -4. Learning the graph of MNIST 1 vs 2 T o demonstrate the different behaviour of the tw o mo dels for non-uniform sampling cases, we use the problem of classification b etw een digits 1 and 2 of the MNIST dataset. This problem is particular b ecause digits “1” are close to eac h other (av erage square dis- tance of 45), while digits “2” differ more from each other (av erage square distance of 102). In Figure 2 w e rep ort the av erage miss-classification rate for dif- feren t class size prop ortions, with 40 1’s and 160 2’s (left) , 100 1’s and 100 2’s (middle) or 160 1’s and 40 2’s (right) . Results are av eraged ov er 40 random dra ws. The dashed lines denote the num b er of no des con tained in comp onen ts without lab eled no des, that can not b e classified. In this case, the mo del of Dong 2 Co de for b oth mo dels is a v ailable as part of the op en- source to olbox GSPBox by Perraudin et al. (2014b) using co de from UNLo cBoX, Perraudin et al. (2014a). et al. 2015a fails to recov er edges betw een differen t dig- its “2” unless the returned graph is fairly dense, unlike our mo del that ev en for very sparse graph lev els treats the different classes more fairly . The effect is stronger when the set of 2’s is also the smallest of the tw o. 7 CONCLUSION W e introduce a new w ay of addressing the prob- lem of learning a graph under the assumption that tr  X > LX  is small. W e show how the problem can b e simplified into a w eighted sparsit y problem, that implies a general framework for learning a graph. W e sho w how the standard Gaussian weigh t construction from distances is a sp ecial case of this framew ork. W e prop ose a new mo del for learning a graph, and pro vide an analysis of the state of the art mo del of Dong et al. (2015a) that also fits our framework. The new formu- lation enables us to prop ose a fast and scalable primal dual algorithm for our mo del, but also for the one of Dong et al. 2015a that was missing from the literature. Our exp eriments suggest that when sparse graphs are to b e learned, but connectivit y is crucial, our mo del is exp ected to outp erform the curren t state of the art. W e hope not only that our solution will b e used for man y applications that require go o d quality graphs, but also that our framework will trigger defining new graph learning mo dels targeting sp ecific applications. Man uscript under review by AIST A TS 2016 Ac knowledgemen ts The author would like to esp ecially thank Pierre V an- dergheynst and Nikolaos Arv anitop oulos for their con- structiv e comments on the organization of the pap er and the exp erimen tal ev aluation. He is also grateful to the authors of Dong et al. (2015a) for sharing their co de, to Nathanael Perraudin and Nauman Shahid for discussions when developing the initial idea, and to Andreas Louk as for his comments on the final version. References Banerjee, Onureena, Laurent El Ghaoui, and Alexan- dre d’Aspremont (2008). “Mo del selection through sparse maximum likelihoo d estimation for m ultiv ariate gaussian or binary data”. In: The Journal of Machine L e arning R ese ar ch 9, pp. 485–516. Barab´ asi, Albert-L´ aszl´ o and R´ ek a Albert (1999). “Emergence of scaling in random netw orks”. In: Sci- enc e 286.5439, pp. 509–512. Belkin, Mikhail and Partha Niyogi (2001). “Laplacian Eigenmaps and Sp ectral T echniques for Em b edding and Clustering.” In: NIPS . V ol. 14, pp. 585–591. Belkin, Mikhail, P artha Niyogi, and Vik as Sindhw ani (2006). “Manifold regularization: A geometric frame- w ork for learning from lab eled and unlab eled exam- ples”. In: The Journal of Machine L e arning R ese ar ch 7, pp. 2399–2434. Cai, Deng et al. (2011). “Graph regularized nonneg- ativ e matrix factorization for data representation”. In: Pattern Analysis and Machine Intel ligenc e, IEEE T r ansactions on 33.8, pp. 1548–1560. Com b ettes, P atrick L and Jean-Christophe Pesquet (2011). “Pro ximal splitting metho ds in signal pro cess- ing”. In: Fixe d-p oint algorithms for inverse pr oblems in scienc e and engine ering . Springer, pp. 185–212. Daitc h, Samuel I, Jonathan A Kelner, and Daniel A Spielman (2009). “Fitting a graph to vector data”. In: Pr o c e e dings of the 26th Annual International Confer- enc e on Machine L e arning . ACM, pp. 201–208. Dempster, Arth ur P (1972). “Cov ariance selection”. In: Biometrics , pp. 157–175. Dong, Xiaow en et al. (2015a). “Laplacian Matrix Learning for Smo oth Graph Signal Representation”. In: Pr o c e e dings of IEEE ICASSP . — (2015b). “Learning Laplacian Matrix in Smo oth Graph Signal Represen tations”. In: arXiv pr eprint arXiv:1406.7842v2 . Gilb ert, Edgar N (1959). “Random graphs”. In: The A nnals of Mathematic al Statistics , pp. 1141–1144. Jiang, Bo et al. (2013). “Graph-Laplacian PCA: Closed-form solution and robustness”. In: Computer Vision and Pattern R e c o gnition (CVPR), 2013 IEEE Confer enc e on . IEEE, pp. 3492–3498. Kalofolias, V assilis et al. (2014). “Matrix completion on graphs”. In: arXiv pr eprint arXiv:1408.1717 . Komo dakis, Nikos and Jean-Christophe P esquet (2014). “Playing with duality: An ov erview of recen t primal-dual approac hes for solving large- scale optimization problems”. In: arXiv pr eprint arXiv:1406.5429 . Lak e, Brenden and Josh ua T enenbaum (2010). “Dis- co vering structure by learning sparse graph”. In: Pr o- c e e dings of the 33r d A nnual Co gnitive Scienc e Confer- enc e . Citeseer. Ng, Andrew Y, Michael I Jordan, Y air W eiss, et al. (2002). “On sp ectral clustering: Analysis and an algo- rithm”. In: A dvanc es in neur al information pr o c essing systems 2, pp. 849–856. P erraudin, N. et al. (F eb. 2014a). “UNLo cBoX A mat- lab conv ex optimization to olbox using proximal split- ting metho ds”. In: ArXiv e-prints . arXiv: 1402.0779 . P erraudin, Nathana¨ el et al. (Aug. 2014b). “GSPBOX: A to olbox for signal pro cessing on graphs”. In: ArXiv e-prints . arXiv: 1408.5781 [cs.IT] . Sh uman, David et al. (2013). “The emerging field of signal pro cessing on graphs: Extending high- dimensional data analysis to netw orks and other irreg- ular domains”. In: Signal Pr o c essing Magazine, IEEE 30.3, pp. 83–98. W ang, F ei and Changshui Zhang (2008). “Lab el prop- agation through linear neighborho ods”. In: Know le dge and Data Engine ering, IEEE T r ansactions on 20.1, pp. 55–67. Zhang, F an and Edwin R Hanco ck (2008). “Graph sp ectral image smo othing using the heat k ernel”. In: Pattern R e c o gnition 41.11, pp. 3328–3342. Zhang, T ong, Alexandrin Popescul, and Byron Dom (2006). “Linear prediction mo dels with graph regular- ization for web-page categorization”. In: Pr o c e e dings of the 12th ACM SIGKDD international c onfer enc e on Know le dge disc overy and data mining . ACM, pp. 821– 826. Zhang, Y an-Ming et al. (2010). “T ransductive Learn- ing on Adaptiv e Graphs.” In: AAAI . Zheng, Miao et al. (2011). “Graph regularized sparse co ding for image representation”. In: Image Pr o c ess- ing, IEEE T r ansactions on 20.5, pp. 1327–1336. Zh u, Xiao jin, Zoubin Ghahramani, John Lafferty , et al. (2003). “Semi-sup ervised learning using gaussian fields and harmonic functions”. In: ICML . V ol. 3, pp. 912– 919. Man uscript under review by AIST A TS 2016 6 0.2 0.4 0.6 0.8 1 h ( 6 ) 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Tikhonov: h ( 6 ) / 1 1+10 6 Gen. Model: h ( 6 ) / q 1 6 Heat Di , usion: h ( 6 ) / exp( ! 10 6 ) Figure 3: The filters of T able 3 for α = 10. Ti kho no v: h ( 6 ) / 1 1+ 10 6 -1 0 1 G ene rat iv e Mode l: h ( 6 ) / q 1 6 -1 0 1 H eat Di , u sion : h ( 6 ) / exp ( ! 1 0 6 ) -1 0 1 Figure 4: Different smo oth signals on the Non Uni- form graph used for our artificial data exp eriments. All signals are obtained by smo othing the same initial x 0 ∼ N (0 , I ) with three different filters. This instance of the graph is disconnected with 2 comp onen ts. A Deriv ations and pro ofs A.1 Detailed explanation of eq. (3) k W ◦ Z k 1 = m X i =1 m X j =1 W ij k x i − x j k 2 2 = m X i =1 m X j =1 ( x i − x j ) > W ij ( x i − x j ) = 2 m X i =1 m X j =1 x > i W ij x i − 2 m X i =1 m X j =1 x > i W ij x j = 2 m X i =1 x > i x i m X j =1 W ij − 2 tr  X > W X  = 2 tr  X > D X  − 2 tr  X > W X  = 2 tr  X > LX  , where D is the diagonal matrix with elements D ii = P i W ij . A.2 Pro of of prop osition 2 Pr o of. W e change v ariable ˜ W = W /γ to obtain F ( Z, α, β ) = = γ argmin ˜ W k γ ˜ W ◦ Z k 1 , 1 − α 1 > log( γ ˜ W 1 ) + β k γ ˜ W k 2 F = γ argmin ˜ W γ k ˜ W ◦ Z k 1 , 1 − α 1 > log( ˜ W 1 ) + β γ 2 k ˜ W k 2 F = γ argmin ˜ W k ˜ W ◦ Z k 1 , 1 − α γ 1 > log( ˜ W 1 ) + β γ k ˜ W k 2 F = γ F  Z, α γ , β γ  , where we used the fact that log( γ ˜ W 1 ) = log ( ˜ W 1 ) + const . ( W ). The second equality is obtained from the first one for γ = α . A.3 Pro of of prop osition 3 Pr o of. F or equation (15) H ( Z + γ , α, s ) = = argmin W ∈W m k W ◦ Z + γ W k 1 , 1 + α k W k 2 F + α k W 1 k 2 s . t ., k W k 1 , 1 = s = argmin W ∈W m k W ◦ Z k 1 , 1 + γ k W k 1 , 1 + α k W k 2 F + α k W 1 k 2 s . t ., k W k 1 , 1 = s = argmin W ∈W m k W ◦ Z k 1 , 1 + γ s + α k W k 2 F + α k W 1 k 2 s . t ., k W k 1 , 1 = s = H ( Z, α, s ) , Man uscript under review by AIST A TS 2016 T able 3: Different Types of Smo oth Signals. Concept Mo del Graph filter Tikhono v X = arg min X 1 2 k X − X 0 k 2 F + 1 α tr  X > LX  g ( λ ) = 1 1+ αλ Generativ e mo del X ∼ N  0 , L †  g ( λ ) = ( 1 √ λ if λ > 0 0 if λ = 0 Heat diffusion X = exp ( − αL ) X 0 g ( λ ) = exp( − αλ ) b ecause k a + b k 1 = k a k 1 + k b k 1 for p ositiv e a , b . F or equation (16) w e change v ariable in the optimiza- tion and use ˜ W = W /γ to obtain H ( Z, α, s ) = = γ argmin ˜ W ∈W m k γ ˜ W ◦ Z k 1 , 1 + α k γ ˜ W k 2 F + α k γ ˜ W 1 k 2 s . t ., k γ ˜ W k 1 , 1 = s = γ argmin ˜ W ∈W m γ k ˜ W ◦ Z k 1 , 1 + γ 2 α k ˜ W k 2 F + γ 2 α k ˜ W 1 k 2 s . t ., k ˜ W k 1 , 1 = s γ = γ argmin ˜ W ∈W m k ˜ W ◦ Z k 1 , 1 + γ α k ˜ W k 2 F + γ α k ˜ W 1 k 2 s . t ., k ˜ W k 1 , 1 = s γ = γ H  Z, αγ , s γ  . The second equalit y follows trivially for γ = s . B Optimization details and algorithm for mo del of Dong et al. 2015a T o obtain Algorithm 1 (for our mo del), we need the follo wing: K = S ( k S k 2 = p 2( m − 1)) pro x λf 1 ( y ) = max(0 , y − λz ) , pro x λf 2 ( y ) = y i + p y 2 i + 4 αλ 2 , ∇ f 3 ( w ) = 2 β w, ζ = 2 β (Lipschitz constant of gradient of f 3 ) , where m is the n umber of no des of the graph. T o obtain Algorithm 2 (for mo del b y Dong et al. 2015a), w e need the following: K = 2 1 ( k 2 1 k 2 = 2 p m ( m − 1) / 2) pro x λf 1 ( y ) = max(0 , y − λz ) , pro x λf 2 ( y ) = s, ∇ f 3 ( w ) = α (4 w + 2 S > S w ) , ζ = 2 α ( m + 1) (Lipschitz constant of gradien t of f 3 ) . Algorithm 2 Primal dual algorithm for mo del Dong et al. 2015a. 1: Input: z , α, s, w 0 ∈ W v , c 0 ∈ R + , γ , tolerance  2: for i = 1 , . . . , i max do 3: y i = w i − γ (2 α (2 w i + S > S w i ) + 2 c i ) 4: ¯ y i = c i + γ (2 P j w i j ) 5: p i = max(0 , y i − 2 γ z ) 6: ¯ p i = ¯ y i − γ s 7: q i = p i − γ (2 α (2 p i + S > S p i ) + 2 p i ) 8: ¯ q i = ¯ p i + γ (2 P j p i j ) 9: w i = w i − y i + q i ; 10: c i = c i − ¯ y i + ¯ q i ; 11: if k w i − w i − 1 k / k w i − 1 k <  and 12: | c i − c i − 1 | / | c i − 1 | <  then 13: break 14: end if 15: end for C More real data exp eriments C.1 Learning the graph of COIL 20 images W e randomly sample the classes so that the a verage size increases non-linearly from around 3 to around 60 samples p er class. The distribution for one of the in- stances of this experiment is plotted in fig. 5. W e sam- ple from the same distribution 20 times and measure the av erage p erformance of the mo dels for different graph densities. F or eac h of the graphs, we run stan- dard sp ectral clustering (as in the work of Ng, Jordan, W eiss, et al. 2002 but without normalizing the Lapla- cian) with k-means 100 times. F or lab el propagation w e choose 100 times a different subset of 50% known lab els. W e set a baseline by using the same techniques Man uscript under review by AIST A TS 2016 T able 4: Performance of Different Algorithms on Artificial Data. Eac h setting has a random graph with 100 no des and 100 smo oth signals from 3 different smo othness mo dels and added 10% noise. Results av eraged o ver 20 random graphs for each setting. F-measure: the bigger the b etter (weigh ts ignored). Edge and degree distances: the low er the b etter. F or relative ` − 1 distances we normalize s.t. k w k 1 = k w 0 k 1 . F or relative ` − 2 distances we normalize s.t. k w k 2 = k w 0 k 2 . Baseline: for F-measure, the b est result by thresholding exp( − d 2 ). F or edge and degree distances we use exp( − d 2 / 2 σ 2 ) without thresholding. Tikhono v Generativ e Mo del Heat Diffusion base Dong etal Ours base Dong etal Ours base Dong etal Ours Rand. Geometric F-measure 0.667 0.860 0.886 0.671 0.836 0.858 0.752 0.837 0.848 edge  -1 0.896 0.414 0.364 0.851 0.487 0.468 0.620 0.526 0.451 edge  -2 0.700 0.430 0.390 0.692 0.494 0.477 0.582 0.535 0.471 degree  -1 0.158 0.151 0.080 0.268 0.159 0.128 0.216 0.225 0.143 degree  -2 0.707 0.179 0.095 0.679 0.193 0.145 0.479 0.264 0.177 Non Uniform F-measure 0.674 0.821 0.817 0.650 0.779 0.774 0.763 0.835 0.827 edge  -1 0.847 0.547 0.480 0.931 0.711 0.673 0.612 0.583 0.491 edge  -2 0.724 0.545 0.462 0.784 0.673 0.624 0.565 0.598 0.464 degree  -1 0.167 0.190 0.075 0.241 0.204 0.139 0.235 0.257 0.132 degree  -2 0.605 0.228 0.099 0.614 0.261 0.187 0.433 0.325 0.164 Erd˝ os R´ enyi F-measure 0.293 0.595 0.676 0.207 0.473 0.512 0.358 0.595 0.619 edge  -1 1.513 0.837 0.798 1.623 1.113 1.090 1.401 0.896 0.899 edge  -2 1.086 0.712 0.697 1.129 0.896 0.888 1.045 0.767 0.759 degree  -1 0.114 0.129 0.084 0.135 0.146 0.114 0.185 0.182 0.184 degree  -2 0.932 0.202 0.116 1.053 0.227 0.185 0.875 0.241 0.276 Barab´ asi-Alb ert F-measure 0.325 0.564 0.636 0.357 0.588 0.632 0.349 0.631 0.711 edge  -1 1.541 0.939 0.885 1.513 0.940 0.914 1.473 0.843 0.774 edge  -2 1.073 0.802 0.761 1.052 0.808 0.773 1.049 0.732 0.672 degree  -1 0.225 0.309 0.145 0.243 0.311 0.229 0.281 0.336 0.181 degree  -2 0.560 0.378 0.281 0.563 0.386 0.350 0.570 0.429 0.319 class 0 2 4 6 8 10 12 14 16 18 20 number of images 0 10 20 30 40 50 60 70 distribution of class sizes Figure 5: Distribution of class sizes for one of the ran- dom instances of the COIL 20 exp erimen ts. with a k-Nearest neighbors graph (k-NN) with differ- en t choices of k . In Fig. 6 we plot the b eha vior of different mo dels for different density levels. The horizontal axis is the a verage num b er of non-zero edges p er no de. The dashed lines of the middle plot denote the num- b er of no des contained in comp onen ts without lab eled no des, that can not b e classified. Man uscript under review by AIST A TS 2016 graph density 0 5 10 15 20 25 Clustering error 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Spectral clustering [Dong etal] Ours k-NN graph density 0 5 10 15 20 25 Classification error 0 0.05 0.1 0.15 0.2 0.25 0.3 Label propagation miss rate (Dong etal) miss rate (Ours) miss rate (k-NN) unclassifiable (Dong etal) unclassifiable (Ours) unclassifiable (k-NN) graph density 0 5 10 15 20 25 0 10 20 30 40 50 60 70 Graph connectivity # disconnected (Dong etal) # components (Dong etal) # components (Ours) # components (k-NN) number of classes Figure 6: Graph learned from non-uniformly sampled images from COIL 20. Average o ver 20 different samples from the same non-uniform distribution of images. Left : Clustering qualit y . Middle : Lab el propagation quality . Dashed lines are the num b er of no des in comp onen ts without lab eled no des. Right : Num b er of disconnected comp onen ts and n umber of disconnected no des (Our mo del and k-NN hav e no disconnected no des).

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment