Bias-Variance Tradeoff of Graph Laplacian Regularizer

This paper presents a bias-variance tradeoff of graph Laplacian regularizer, which is widely used in graph signal processing and semi-supervised learning tasks. The scaling law of the optimal regularization parameter is specified in terms of the spec…

Authors: Pin-Yu Chen, Sijia Liu

Bias-Variance Tradeoff of Graph Laplacian Regularizer
Bias-V ariance T radeof f of Graph Lapla cian Re gulariz er Pin-Y u Chen and Sijia Liu Abstract —This paper presents a bias-variance tradeoff of graph La placian regularizer , which is widely used in graph signal processing and semi-supervised lear ning tasks. The scaling law of the optimal regularization parameter is specified in terms of the spectral graph properties and a nov el signal- to-noise ratio parameter , wh ich suggests selecting a mediocr e regularization parame ter i s often suboptimal. The analysis is applied to three a pplications, includ i ng random, band-limited, and mult iple-sampled graph si gnals. Experiments on synthetic and real-world graphs d emonstrate near -optimal perfo rmance of the established analysis. Index T erms —graph signal proc essing, mea n squared error analysis, scaling law , sp ectral graph theory I . I N T RO D U C T I O N Graph Laplacian r egu la r izer (GLR) h as been widely used in graph signal p rocessing, semi-superv ised lea r ning and im a g e filtering tasks [1 ]–[5]. Regularization tec h niques in volving the graph Laplacian m ethod can b e interpreted in d ifferent perspectives. In a regression setting, GLR pena lize s incoherent (i.e., non-smo oth) signals a c ross ad jacent nodes [6]– [11]. In a probab ility model setting, GLR is u sed as a pr io r distribution that fav ors smo oth signals [1], [1 2]–[19]. This paper presents a bias-variance tradeoff of GLR. In par- ticular , the scaling law of the optimal regularization p arameter of GLR that balance s the b ias-variance tra d eoff is specified in terms of the spectral gra p h prope rties and a novel signal-to- noise ratio ( SNR) parame te r . O u r analysis shows an ab rupt change in the ord er of the op timal regulariza tion parameter when varying the SNR parameter, suggesting that selecting a mediocre regularizatio n par ameter is o ften subop timal, which provides n ovel in sights in the analysis a nd utility of GLR. W e then ap ply the b ias-variance trad eoff analysis to r andom, band-lim ited, and multiple-sampled g raph signals, an d specify the SNR parameter fo r each case. Experimen ts on synth etic and real-world graphs verify the scaling law an a ly sis and demonstra te near-optimal pe rforman ce in terms of the mean squared error . The proo fs of the established theore tica l results are given in the ap pendices of the supplem entary mater ial. Consider a weigh ted undirected co nnected simple graph G ( V , E ) of n nodes and m edges, wh ere V ( E ) is the set of nodes (edg es). T he weigh t of an edg e ( i, j ) ∈ E is specified by th e en try W ij > 0 of an n × n sy m metric m atrix W . The graph Laplacian matrix of G is defined as L = S − W , whe r e S = diag ( W1 n ) is a d iagonal matrix, and 1 n is the n × 1 P .-Y . Chen is with AI Foudations, IBM Thomas J. W atson Research Cente r , Y orkto wn Height s, NY 10598, USA. Email : pin-yu.chen@ibm.com. S. Liu is with the Department of Electrica l Engineering and Computer Science , Uni- versi ty of Michi gan, Ann Arbor , MI 48109, USA. Email : lsjxjtu@umich.ed u. column vector of ones. Le t ( λ i , v i ) , 1 ≤ i ≤ n , d enote the i -th smallest eig enpair of L such that its eigenv alue decomp o sition can be written as L = P n i =1 λ i v i v T i , where { λ i } n i =1 is a no n- decreasing sequence, and v T i v j = 1 if i = j and v T i v j = 0 if i 6 = j . For a con nected grap h G , it is well-known from spectra l graph the o ry [ 2 0] that ( λ 1 , v 1 ) = (0 , 1 ) , where 1 = 1 n √ n , and λ i > 0 for all 2 ≤ i ≤ n . Another useful property that lead s to the smoo thing e ffect is that f or any vector x ∈ R n , x T Lx = X ( i,j ) ∈E W ij ( x i − x j ) 2 , (1) where x i is the i - th e n try o f x . W e also call (1) the GLR. I I . B I A S - V A R I A N C E T R A D E O FF A N A L Y S I S Let y ∈ R n be a vector of observed signals from the graph G , where its entry y i correspo n ds to the obser ved sign a l on node i . Assume an a dditive n oise model y = x ∗ + e , where x ∗ ∈ R n is the unkn own g round - truth signal and e ∈ R n is the vector accountin g for rando m errors on each no de, where e has zero mean and covariance structur e Σ , which is different from the assumption of additive Gaussian noise in image filtering, such a s the SURE estimato r . [3], [21]. For many signal processing an d semi-sup e rvised lear n ing tasks, giv en a noisy g r aph signal y on G , one aims to recover a smooth gr aph signal. Th is can be casted as a least-squ are minimization pro b lem regularized by the GLR [ 1], [6], [1 5], min x ∈ R n k y − x k 2 2 + α x T Lx , (2) where k · k 2 denotes the Euclidean distance, an d α ≥ 0 is the regularization par a meter . In essence, one is interested in obtaining a solution b x to (2) such that with a pro per selection of the regularization param eter α , the vector b x is smoo th in the sense that the weighted sum of squared signal difference of all adjacent no de pa irs in ( 1) is c onfined. Wh a t re mains unclear is th e effect of α on the estimator b x , which is th e main co ntribution (optimal scaling law analysis) of this pap e r . It is easy to show that b x has an analytical expression b x = ( I + α L ) − 1 y =: Hy , (3) where the eigenv alue decom position of H can be written as H = ( I + α L ) − 1 = n X i =1 1 1 + αλ i v i v T i =: n X i =1 h i v i v T i . (4) In particular, h 1 = 1 since λ 1 = 0 . For a fixed α , th e b ias of b x is Bias( α ) = k E b x − x ∗ k 2 = k ( H − I ) x ∗ k 2 , (5) where I is the id entity matrix. The variance of b x is V ar( α ) = trace ( cov ( b x )) = trace ( H 2 Σ ) , (6) where cov ( b x ) denotes the covariance matrix of b x . As a result, the mean squar e d error ( M SE) can be expressed as MSE( α ) = E k b x − x ∗ k 2 2 = Bias( α ) 2 + V ar ( α ) . (7) The following the o rem sh ows that using GL R decreases the variance of the estimato r b x when co mpared to the case of without using GLR (i.e., α = 0 ). Theorem 1 . F o r an y α > 0 , V ar( α ) ≤ V ar( 0 ) . The inequ ality becomes strict if Σ h as full rank. Pr oof: Th e p roof is given in Appen dix A. Theorem 1 suggests that selecting any α > 0 can decr ease the variance. Howe ver , the selection o f α a lso affects the bias in (5), which is known as the bias-variance tr a deoff. The an a lysis below provide s the optimal order o f α that bal- ances the bias-variance tradeoff. Applyin g the V on Neu mann’ s trace in e quality [22] to the variance term in ( 6), we have V ar( α ) = trace ( H 2 Σ ) ≤ P n i =1 h 2 i φ i , where φ i is the i -th largest eigen value o f Σ , and the equa lity h o lds when Σ is a diagona l ma trix. T o simplify o ur ana ly sis, in the rest of this paper we assume Σ = diag ( σ ) , wh ere σ = [ σ 2 1 , σ 2 2 , . . . , σ 2 n ] and σ i ≥ 0 denotes the standard deviation. The bias-variance tradeoff f or the case o f non -diagon al covariance stru cture can be analy zed in a similar way . Upon definin g Q = I − H , it is known fro m (4) th at the eigenv alue decomp osition of Q is Q = P n i =2 1 1+ 1 αλ i v i v T i =: P n i =2 q i v i v T i for any α > 0 . Theorem 2. If Σ = diag ( σ ) , then for a ny α > 0 , Bias( α ) 2 = P n i =2 q 2 i ( v T i x ∗ ) 2 , V a r( α ) = P n i =1 h 2 i σ 2 i , and th er efor e MSE( α ) = n X i =2 q 2 i ( v T i x ∗ ) 2 + n X i =1 h 2 i σ 2 i , (8) wher e x ∗ = x ∗ − 1 T n x ∗ n 1 n , q i = 1 1+ 1 αλ i , and h i = 1 1+ αλ i . Pr oof: Th e p roof is given in Appen dix B. Recall that h 1 = 1 from (4). Theorem 2 in dicates that there is an universal lo wer bound MSE( α ) ≥ σ 2 1 for any α > 0 . Theorem 2 also implies a cle a r bias-v ariance trad eoff since q i = 1 − h i for all 2 ≤ i ≤ n . Specifically , increasing α leads to the decrease in variance but also the increase in bias, and vice versa. This tradeoff means th at impro per selection of α may lead to undesired MSE, as on e term will d ominate the other . The following results provide g uidelines on th e selection of prop er α . Corollary 1 (MSE-UB) . I f Σ = diag ( σ ) , then for any α > 0 , MSE( α ) ≤ 1 1 + 1 αλ n ! 2 n X i =2 ( v T i x ∗ ) 2 +  1 1 + αλ 2  2 n X i =2 σ 2 i + σ 2 1 , wher e the equa lity ho lds if G is a co mplete graph o f identica l edge weight, and the RHS 1 is denoted by MSE-UB( α ) . 1 RHS means the right hand side. Pr oof: Th e p roof is given in Appen dix C. MSE-UB in Corollary 1 provides a tight upp er en velope function f or assess ing MSE. In Sec. IV n ear-optimal perfor- mance of MSE- U B relati ve to MSE is validated in synthetic and re a l- world graphs. Note that sinc e MSE( α ) is a no n- conv ex function with respect to α > 0 , the o ptimal α that minimizes MSE does n ot have a close-f o rm expression. On the o th er ha n d, th e optim al solution to MSE-UB( α ) can be obtained by solving the roots of a third- order po lynomial function , which is the derivati ve of MSE-UB( α ) with respect to α . Corollary 1 can also be used to spec if y an op timal value α ∗ that matches the ord er of th e b ias and variance terms appeared in MSE-UB (i. e . , the first two terms), which is stated as follows. Theorem 3 . Let θ = r P n i =2 σ 2 i P n i =2 ( v T i x ∗ ) 2 . Th e optimal value tha t matches the or der of the fi rst two terms of MSE-UB( α ) in Cor ollary 1 is α ∗ = ( β θ − 1) λ n + p ( β θ − 1 ) 2 λ 2 n + 4 λ n λ 2 β θ 2 λ n λ 2 , wher e β > 0 is some constant such that  1+ α ∗ λ 2 1+ 1 α ∗ λ n  2 = β 2 θ 2 . Pr oof: Th e p roof is given in Appen dix D. Theorem 3 sugg ests th at the optimal orde r-matching regu- larization parameter for balancing the bias-variance tradeoff depend s on the parame te r θ and the eigenv alues λ 2 and λ n of the gra p h Lap la c ia n matrix L of the graph G . Define the effecti ve signal-to-n o ise ratio to be E-SNR = P n i =2 ( v T i x ∗ ) 2 P n i =2 σ 2 i (9) such that θ = q 1 E-SNR . The term ( v T i x ∗ ) 2 in E-SNR is associated with the sign al power on grap h freque n cy do main, as v T i x ∗ = v T i x ∗ , for all 2 ≤ i ≤ n , wh ere the latter is the cor respondin g graph Fourier co efficient of x ∗ [1]. Given that v T 1 x ∗ = 0 , the term P n i =2 ( v T i x ∗ ) 2 = P n i =1 ( v T i x ∗ ) 2 − ( 1 T n x ∗ ) 2 n is the sign al p ower o f x ∗ . The o rder of α ∗ in different E-SNR regimes is summarized in the f ollowing corollar y . Corollary 2 (scaling law) . Given a graph G , in the high E- SNR r e gime ( θ ≪ 1 β ) , α ∗ = O  θ λ n  , in the lo w E-SNR r e gime ( θ ≫ 1 β ) , α ∗ = O  θ λ 2  , and in the mod erate E-SNR r egime ( θ ≈ 1 β ) , α ∗ = O  q θ λ n λ 2  . Pr oof: Th e p roof is given in Appen dix E. Corollary 2 spec ifies the scaling law of the ord er-matching regularization p arameter α ∗ in terms of the p arameter θ (i.e., E-SNR) an d the spectral graph pr operties (i.e., λ 2 and λ n ). It also su g gests that as E- SNR approaches infinity , α ∗ will approa c h 0 . Fur thermor e , as on e swee p s the E- SNR from the high E-SNR regime (small θ ) to th e low E-SNR regime (large θ ) , Cor ollary 2 ind icates that the order of α ∗ is e xpected to have an a b rupt bo ost th at depends on the ratio λ n λ 2 . More importan tly , Corollary 2 shows that selecting a medio cre v alue of the r egularization par a meter α for GLR is often sub optimal for minimizing th e MSE. In th e small θ regime small α is preferr ed, whereas in the large θ regime large α is prefe rred. I I I . A P P L I C A T I O N S T O R A N D O M , B A N D - L I M I T E D , A N D M U LT I P L E - S A M P L E D G R A P H S I G N A L S In this section we apply the b ias- variance tradeoff an alysis presented in Sec. II to random, b and-limited , and m ultiple- sampled graph signals, r e sp ectiv e ly . In p articular, for each case we specify the par ameter θ governing the order of the optimal order-matching regulariza tio n parameter α ∗ . For graph signals with mu ltiple samples, let { y t } T t =1 denote the T i.i.d. copies of y and den ote their ensemb le a verage by y = P T t =1 y t T . By replacing y in (2) with y , the following corollary provides an upper boun d on the MSE of i.i.d . multiple-samp led graph signals. Corollary 3 (Multiple-samp led i.i.d. graph signals) . Let { y t } T t =1 be T i.i.d. graph sig n als and let y = P T t =1 y t T . Replacing y in ( 2) with y , if Σ = diag ( σ ) , then for any α > 0 , MSE( α ) ≤ 1 1 + 1 αλ n ! 2 n X i =2 ( v T i x ∗ ) 2 +  1 1 + αλ 2  2 P n i =2 σ 2 i T + σ 2 1 , wher e the equa lity ho lds if G is a co mplete graph o f identica l edge weight. Pr oof: Th e p roof is given in Appen dix F. Corollary 3 shows th at for a fixed α , the numbe r T of i.i.d. observations h as a linear scaling effect (i.e., 1 T ) o n the variance ter m but h as n o effect on th e bias term. Further m ore, by de fining θ = r P n i =2 σ 2 i T P n i =2 ( v T i x ∗ ) 2 and ap plying the results in Theorem 3 and Corollar y 2, the optima l order of α ∗ scales with 1 √ T in the high E-SNR r egime and the low E-SNR regime, and scales with 1 4 √ T in the mod erate E- SNR regime. For band-limited graph signals, the gro u nd-tru th sig n al x ∗ is a linear combina tion of a subset o f the basis { v i } n i =1 associated with the grap h Laplacian matrix L [1], [13], [23], which can be written as x ∗ = P j ∈A ω j v j , where ω j 6 = 0 and A ⊂ { 1 , 2 , . . . , n } ind icates the set of a cti ve basis f rom { v i } n i =1 . T he following corollary provides an uppe r boun d on the MSE of ba n d-limited graph signals. Corollary 4 (Band -limited graph signals) . If x ∗ = P j ∈A ω j v j and Σ = diag ( σ ) , then for a ny α > 0 , MSE( α ) ≤ 1 1 + 1 αλ n ! 2 X j ∈A / { 1 } ω 2 j +  1 1 + αλ 2  2 n X i =2 σ 2 i + σ 2 1 , Pr oof: Th e p roof is given in Appen dix G. Corollary 4 in dicates that the term P j ∈A / { 1 } ω 2 j can be viewed as th e effecti ve signal streng th for b and-limited gra p h signals. Moreover , the coefficient ω 1 correspo n ding to th e coheren t basis 1 does not con tribute to the MSE. Using the terminolo g y from filter bank design [1], GL R is a low-pass filter that excludes the lowest frequen cy ω 1 in term s of M SE. The results in Theorem 3 a nd Coro llary 2 can b e applied to band-lim ited graph signals b y setting θ = r P n i =2 σ 2 i P j ∈A / { 1 } ω 2 j . For random graph signals, assume the groun d-truth g r aph signal x ∗ ∼ N ( µ 1 n , d iag ( s )) is a Gaussian random vec- tor with mean µ 1 n and covariance d iag ( s ) , wher e s = [ s 2 1 , s 2 2 , . . . , s 2 n ] and s i ≥ 0 . The following coro llary pr ovides an upper bou nd o n MSE-UB( α ) f or rando m gra p h signa ls. Corollary 5 (Random graph signals) . If x ∗ ∼ N ( µ 1 n , d iag ( s )) and Σ = diag ( σ ) , let s = P n i =1 s 2 i n and σ = P n i =2 σ 2 i n − 1 . Then for any α > 0 , MSE( α ) ≤ 1 1 + 1 αλ n ! 2 ( n − 1) s +  1 1 + αλ 2  2 ( n − 1) σ + σ 2 1 , wher e the eq uality holds if s i = s ≥ 0 for all 1 ≤ i ≤ n . Pr oof: Th e p roof is given in Appen dix H. Corollary 5 shows that th e mean µ 1 n of r andom grap h signals does no t contribute to the up per bound on MSE( α ) , which suggests that GLR filters ou t the mean of the rando m graph sign al in a ddition to the smo othing effect. Th e results in Theorem 3 and Cor ollary 2 can readily be ap plied to r andom graph signals via Cor ollary 5. In particu lar , define θ = q σ s , then the E -SNR bec omes s σ , wh ich is clo se to the av erage SNR s e σ , where e σ =  1 − 1 n  σ + σ 2 1 n . Mo reover , applyin g the results in Coro llary 2 gives the relation between θ and α ∗ for random graph sign a ls. I V . P E R F O R M A N C E E V A L U A T I O N In this section we con d uct experiments on synthetic gr aphs and real-world gr aph d atasets to v alida te the developed bias- variance tradeoff analy sis and the scaling b ehavior of the optimal regularization param eter α ∗ with respect to the pa- rameter θ = q 1 E-SNR . The graph signal x ∗ is ra ndomly drawn from a multiv ar iate Ga u ssian d istribution with µ = 10 and covariance s = diag ( 1 n ) . The noise e is ge n erated by a multi variate Gaussian distribution with zero mean and covariance Σ = diag ( σ ) , where σ = σ 2 1 n . Fr o m Coro llar y 5, the E -SNR becomes 1 σ 2 and hence θ = σ . T o in vestigate the scaling beh a vior of α ∗ under different regime s o f θ , α ∗ is numerically ob tained via grid search in the range [0 , b ] with t unifor m samples on the lo g-scale, where b and t are specified in each experiment. T h e results p resented in this pa per are av eraged over 50 realizatio ns. W e gener ate Erdos-Renyi ra n dom graph s with different node-p air connectio n pro b ability p to stud y th e difference between MSE ( α ) and MSE-UB ( α ) . Fig. 1 shows the curves of per-node MSE ( α ) and MSE-UB ( α ) o f three selected v alue α at dif ferent scales, where pe r-node MSE ( α ) is the MSE divided by the number of no des n . It is observed that the curves o f MSE ( α ) and MSE-UB ( α ) hav e similar tend ency with respect to p , and they collapse to the sam e v alue when p = 1 (complete graphs), which justifies Cor ollary 1 . Fig. 2 disp lay s the optimal regular ization parame ter α ∗ obtained fro m minimizing MSE( α ) an d MSE-UB( α ) und er 0 0.2 0.4 0.6 0.8 1 no de-pair connection probability p 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 p er-no de MSE α = 0 . 001, upp er b ound α = 0 . 001, exact v alue α = 0 . 01, upp er b ound α = 0 . 01, exact v alue α = 0 . 1, upp er b ound α = 0 . 1, exact v alue Fig. 1: Per-node MSE ( α ) and MSE-UB ( α ) in Erdos-Renyi random grap hs with n = 10 0 nodes. Th e cu rves of MSE ( α ) and MSE - UB ( α ) co llapse to the same value when p = 1 (a complete graph ), which ju stifies Cor ollary 1. 0 1 2 3 4 5 θ = q 1 E − SN R 0 200 400 600 800 1000 1200 1400 1600 1800 optimal regularization parameter α ∗ Erdos-Renyi obtained from MSE-UB obtained from MSE 0 1 2 3 4 5 θ = q 1 E − SN R 0 1 2 3 4 5 6 7 8 per-no de MSE Erdos-Renyi optimal MSE-UB optimal MSE 0 1 2 3 4 5 θ = q 1 E − SN R 0 50 100 150 200 250 300 optimal regularization parameter α ∗ W atts-Strogatz obtained from MSE-UB obtained from MSE 0 1 2 3 4 5 θ = q 1 E − SN R 0 1 2 3 4 5 6 7 8 per-no de MSE W atts-Strogatz optimal MSE-UB optimal MSE Fig. 2: Op timal regularization param eter α ∗ and the corre- sponding per-node MSE unde r different θ in Erdo s-Renyi random gr aphs with p = 0 . 1 and in W atts-Strogatz r andom graphs with q = 0 . 4 and d = 20 . n = 100 , b = 20 00 and t = 10 4 . The scaling behavior of α ∗ validates Corollary 2, a n d the cor r espondin g per-nod e MSE curves ar e nearly identical. different θ , respectively , in Erdos-Renyi ran dom graphs and in W atts-Strogatz small-world rando m graph s [24] with rewiring probab ility q an d av erage degree d . In the h ig h E-SNR regime (small θ ), α ∗ is clo se to zer o as pr oved in Coro llary 2. Furthermo re, as one sweeps θ , an ab rupt b oost in α ∗ followed by lin e ar scaling with θ is ob served, which is co nsistent with the analy sis in Corollary 2. Fixing θ , we also o bserve that although α ∗ obtained fro m M SE( α ) an d MSE-UB ( α ) are distinct, the co rrespond ing curves o f per-node MSE ar e nearly identical, especially in the lar ge θ (low E- SNR) and small θ (high E-SNR) regime s, since MSE- UB ( α ) is a tight upper en velope functio n of MSE( α ) as stated in Coro llary 1. Fig. 3 displays the experimental results in th ree real-world graph datasets, includin g the Min nesota road map of 2640 nodes a n d 3302 edges [25], the Facebook friend ship graph of 403 9 nodes and 8 8234 edg es [26], and th e U.S. western power grid ne twork o f 4941 nod e s and 6594 edg es [2 4]. Consistent w ith the experimental results in synthetic g raphs, similar scaling effect of α ∗ and near-optimal per forman c e o n 0 1 2 3 4 5 θ = q 1 E − SN R 10 -5 10 0 10 5 optimal regularization parameter α ∗ Minnesota road map obtained from MSE-UB obtained from MSE 0 1 2 3 4 5 θ = q 1 E − SN R 0 0.2 0.4 0.6 0.8 1 1.2 1.4 per-no de MSE Minnesota road map optimal MSE-UB optimal MSE 0 1 2 3 4 5 θ = q 1 E − SN R 10 -5 10 0 10 5 optimal regularization parameter α ∗ F aceb ook obtained from MSE-UB obtained from MSE 0 1 2 3 4 5 θ = q 1 E − SN R 0 0.2 0.4 0.6 0.8 1 1.2 per-no de MSE F aceb ook optimal MSE-UB optimal MSE 0 1 2 3 4 5 θ = q 1 E − SN R 10 -5 10 0 10 5 optimal regularization parameter α ∗ Po wer grid obtained from MSE-UB obtained from MSE 0 1 2 3 4 5 θ = q 1 E − SN R 0 0.2 0.4 0.6 0.8 1 1.2 per-no de MSE Po wer grid optimal MSE-UB optimal MSE Fig. 3 : Optimal regularization param eter α ∗ (in log- scale) and the corre sp onding per-node MSE un der d ifferent θ in three r eal-world g raphs with b = 1 0 8 and t = 10 3 . The saturation effect in α ∗ is due to the upper bound b for grid search. Con sistent with the analysis in Cor ollary 2, the results suggest that in most cases selecting a me diocre regularization parameter α is often sub optimal for minimizin g the MSE. per-node MSE are observed in rea l- world gr a ph d atasets. M o re importan tly , as indicated in Corollary 2 , these e xperimen tal results su g gest th at in most cases (i.e. , different θ ) selecting a mediocre regularization parameter α is often subop timal for minimiz in g the MSE. Instead, assigning a large (sma ll) regularization p a r ameter in the large ( small) θ regime is more effecti ve in m inimizing the MSE. V . C O N C L U S I O N The con tributions of this paper are twofo ld. First, we study th e b ias-variance tradeo ff o f g raph Laplacian regularizer (GLR) and specify the scaling law of the optimal r egulariza- tion parame te r . W e show that an abru pt b oost in the optimal regularization p arameter is expected when one sweep s a novel signal-to-n oise ra tio ( SNR) parame ter θ , which sugg e sts tha t selecting a me d iocre regularizatio n parameter is often subop ti- mal for minimizing th e mean squared err or . Secon d, we app ly the dev eloped analysis to rand om, band -limited, a n d mu ltiple- sampled gr aph signals and specify the correspon ding SNR parameter θ . Exp e r imental re su lts o n synthetic and rea l- world graphs v alid ate our analysis on the scaling effect of optimal regularization p a r ameter, and demon strate n ear-optimal per- forman ce in m ean squared error, which p rovides new insights on signal pr ocessing and machine learn ing meth ods in volving GLR. Future work in c ludes extendin g the current framework to multi-stage bias-variance tradeoff with GLR. R E F E R E N C E S [1] D. Shuman, S. Narang, P . Frossard, A. Ortega , and P . V ander gheynst, “The emerging field of signal proce ssing on graphs: Extending high- dimensiona l data analysis to networks and other irre gular domains, ” IEEE Signal P r ocess. Mag. , vol. 30, no. 3, pp. 83–98, 2013. [2] A. Bertrand and M. Moonen, “See ing the bigge r picture: Ho w nodes can learn their place within a complex ad hoc network topolo gy , ” IE EE Signal Pr ocess. Mag. , vol. 30, no. 3, pp. 71–82, 2013. [3] P . Mil anfar , “ A tour of modern image filtering: Ne w insights an d methods, both practical and theo retica l, ” IEEE Signal Pr ocess. Mag. , vol. 30, no. 1, pp. 106–128, 2013. [4] A. S andryha ila and J. M. Moura, “Big data analysis with signal processing on graphs: Representati on and processing of massive data sets with irreg ular structure , ” IEEE Signal Proc ess. Mag. , vol. 31, no. 5, pp. 80–90, 2014. [5] L. Wu, J. Laeuchli, V . Kalant zis, A. S tathopoulos, and E. Gallopoulo s, “Estimatin g the trac e of the matrix in verse by interp olating from the diagona l of an approximate in verse, ” Journal of Computa tional Physics , vol. 326, pp. 828–844, 2016. [6] M. Bel kin, I. Matvee v a, and P . Niyo gi, “Re gularizati on and semi- supervised learn ing on la rge graphs, ” in International Confer ence on Computati onal Learning Theory . Springer , 2004, pp. 624–638. [7] M. Belkin and P . Niyogi, “Semi-supervised learni ng on riemanni an manifolds, ” Mac hine Learning , vol. 56, no. 1-3, pp. 209–239, 2004. [8] M. Belkin, P . Niyogi, and V . Sindhwan i, “Manifo ld regul arizat ion: A geometri c frame work for learning from labeled and unlabeled e xamples, ” J ournal of Mach ine Learni ng Resear ch , vol . 7, no. Nov , pp. 2399–2434, 2006. [9] A. Anis, A. Gadde, and A. Orteg a, “Efficie nt samplin g s et s election for bandlimi ted graph s ignals using graph spectral proxies, ” IEEE T rans. Signal Pr ocess. , vol. 64, no. 14, pp. 3775–3789, July 2016. [10] S. P . Chepuri, S. L iu, G. Leus, and A. O. Hero III, “Learning sparse graphs under smoothne ss prior , ” arXiv pre print arXiv:1609.03448 , 2016. [11] M. Onuki, S. Ono, M. Y amagishi, and Y . T anaka, “Graph signal denoisin g via trilatera l filt er on graph spectral domain, ” IE EE T rans. Signal Inf. Proce ss. Netw . , vol . 2, no. 2, pp. 137–148, J une 2016. [12] P .-Y . Chen and A. Hero, “Deep community det ection, ” IEE E T rans. Signal Pr ocess. , vol. 63, no. 21, pp. 5706–5719, Nov . 2015. [13] S. Chen, R. V arma, A. Sandryhaila, and J. Ko vaˇ ce vi ´ c, “Discrete signal processing on graphs: Sampling theory , ” IE EE T rans. Signal Proc ess. , vol. 63, no. 24, pp. 6510–6523, 2015. [14] P .-Y . Chen and A . O. Hero, “Phase transitio ns in spectral community detec tion, ” IEEE T rans. Signal P r ocess. , vol. 63, no. 16, pp. 4339–4347, Aug 2015. [15] X. Dong, D. Thanou, P . Frossard, and P . V ander ghey nst, “Learnin g Laplaci an matrix in smooth graph signal representati ons, ” IE EE T rans. Signal Pr ocess. , vol. 64, no. 23, pp. 6160–6173, Dec 2016. [16] V . Kalofol ias, “Ho w to learn a graph from sm ooth signals, ” in Inter- national Confer ence on A rtifici al Intellig ence and Statist ics (AIST ATS) , 2016, p. 920929. [17] S. Chen, R. V arma, A. Singh, and J. Ko v a ˇ cevi ´ c, “Signal recove ry on graphs: Fundamental limits of sampling strategie s, ” IEEE T rans. Signal Inf. Proce ss. Netw . , vol . 2, no. 4, pp. 539–554, 2016. [18] P .-Y . Chen, T . Gensollen , and A. Hero, “AMOS: An automat ed model order select ion algorithm for spectra l graph clustering, ” arXiv pr eprint arXiv:1609.06457 , 2016. [19] X. Liu, G . Cheung, X. Wu, and D. Zhao, “Random walk graph Laplaci an-based sm oothness prior for soft decoding of jpe g images, ” IEEE Tr ans. Image Pro cess. , vol. 26, no. 2, pp. 509–524, Feb 2017. [20] F . R. K. Chung, Spect ral Graph Theory . Ameri can Mathematic al Society , 1997. [21] S. Ramani, T . Blu, and M. Unser , “Monte-carlo sure: A black-box opti- mizatio n of re gulariz ation parameters for general denoisi ng algorithms, ” IEEE Tr ans. Image Pro cess. , vol. 17, no. 9, pp. 1540–1554, 2008. [22] R. A. Horn and C. R. Johnson, Matrix Analysis . Cambridge Univ ersity Press, 1990. [23] A. Sandryhaila and J . Moura, “Discrete s ignal processing on graphs, ” IEEE T rans. Signal P r ocess. , vol. 61, no. 7, pp. 1644–1656, Apr . 2013. [24] D. J. W atts and S. H. Strogatz , “Colle cti ve dynamics of ‘small-world ’ netw orks, ” Natur e , vol. 393, no. 6684, pp. 440–442, June 1998. [Online]. A vaila ble: http: //www- personal.umich .edu/ ∼ mejn/net data [25] [Online]. A vail able: http s://www . cs. purdue.edu/homes/dgl eich/packages/matlab bgl/ [26] J. J. McAuley and J. Lesko vec, “Learni ng to disco ver social circles in ego networks, ” in Advances in Neural Information Pr ocessing Systems (NIPS) , 2012, pp. 548–56. S U P P L E M E N TA RY M A T E R I A L A P P E N D I X A When α = 0 , cov ( b x ) = Σ . T o show that V ar( α ) ≤ V ar( 0 ) for any α > 0 , from (6) it suffices to sho w th at trac e ( Σ ) − trace ( H 2 Σ ) = tra ce [( I − H 2 ) Σ ] ≥ 0 for any α > 0 . For any α > 0 , observe fr om (4) that the eige nvalue d e c omposition o f I − H 2 is I − H 2 = P n i =2 h 1 − 1 (1+ αλ i ) 2 i v i v T i , which m eans I − H 2 is p ositi ve definite ( PD) since 1 + αλ i > 1 f o r all 2 ≤ i ≤ n . Finally , sinc e Σ is a covariance matrix an d henc e PSD, the term trace [( I − H 2 ) Σ ] ≥ 0 ( trace [( I − H 2 ) Σ ] > 0 if Σ ha s full r ank) [22], which com pletes th e p roof. A P P E N D I X B From ( 5 ), the square d bias is Bias( α ) 2 = P n i =2 q 2 i ( v T i x ∗ ) 2 . Recall that { v i } n i =2 are eigen vectors of the g r aph Laplacian matrix L such that v T i 1 n = 0 for all 2 ≤ i ≤ n . W e have ( v T i x ∗ ) 2 =  v T i ( x ∗ − a 1 n )  2 for any a ∈ R . If Σ = d iag ( σ ) , then th e variance in (6) reduces to P n i =1 h 2 i σ 2 i . Finally , using (7) and setting a = 1 T n x ∗ n giv e the results. A P P E N D I X C Since q i = 1 1+ 1 αλ i ≤ 1 1+ 1 αλ n and h i = 1 1+ αλ i ≤ 1 1+ αλ 2 , for all 2 ≤ i ≤ n , ap plying these results to Theor em 2, we obtain the upper bound MSE-UB( α ) on MSE( α ). If the graph G is a complete graph of identical edge weight w > 0 , then λ i = w · n for all 2 ≤ i ≤ n . Ther efore, the re su lting MSE( α ) is identical to M SE - UB( α ). A P P E N D I X D If the first two terms in the RHS of Co r ollary 1 ar e orde r- matching, then there exists a con stant β > 0 suc h that  1 1+ 1 αλ n  2 = β 2 ·  1 1+ αλ 2  2 θ 2 . Solving this eq uation giv es α ∗ = ( β θ − 1 ) λ n + p ( β θ − 1) 2 λ 2 n + 4 λ n λ 2 β θ 2 λ n λ 2 . (S1) A P P E N D I X E For the high E-SNR and low E- SNR regimes, the o pti- mal ord er of α ∗ can be ob tained b y the Newton’ s g ener- alized binomia l expansion (binom ial series expan sio n) that for any r eal x such th at | x | < 1 , √ 1 + x = 1 + x 2 + O ( x 2 ) . Specifically , for the high E- SNR an d low E-SNR regimes th e term p ( β θ − 1) 2 λ 2 n + 4 λ n λ 2 β θ in Theor em 3 can be approx im ated by p ( β θ − 1) 2 λ 2 n + 4 λ n λ 2 β θ = | β θ − 1 | λ n q 1 + 4 λ n λ 2 β θ ( β θ − 1) 2 λ 2 n ≈ | β θ − 1 | λ n  1 + 2 λ n λ 2 β θ ( β θ − 1) 2 λ 2 n  . If β θ ≫ 1 , then | β θ − 1 | λ n  1 + 2 λ n λ 2 β θ ( β θ − 1) 2 λ 2 n  ≈ ( β θ − 1 ) λ n , wh ich implies α ∗ = O  θ λ 2  . If β θ ≪ 1 (i.e., ( β θ − 1) 2 ≈ 1 ), then | β θ − 1 | λ n  1 + 2 λ n λ 2 β θ ( β θ − 1) 2 λ 2 n  ≈ (1 − β θ ) λ n  1 + 2 λ 2 β θ λ n  , which implies α ∗ = O  θ λ n  . For the m oderate E- SNR regime, the optimal or der can be o btained from Theorem 3 using the fact that β θ − 1 ≈ 0 . A P P E N D I X F Let b x = Hy and let e x = H y . It is easy to verity tha t E [ e x ] = E [ b x ] = Hx ∗ and c ov ( e x ) = cov ( b x ) T . Therefor e, the b ia s of e x is the same as in (5) and the variance o f e x is V ar ( α ) T , where V ar ( α ) denotes the variance o f b x as in (6). Finally , the results a r e o btained by f ollowing the same pro of proce dure as in Append ix C. A P P E N D I X G Since x ∗ = P j ∈A ω j v j , b y the orth ogonality of eigenvec- tors, applying P n i =2 ( v T i x ∗ ) 2 = P j ∈A / { 1 } ω 2 j to Corollary 1 completes the pro of. A P P E N D I X H Since x ∗ ∼ N ( µ 1 n , d iag ( s )) , using the smoothin g pr operty of con ditional expectation and following the same proof pr o - cedure as in Ap pendix B b y setting a = µ , and using Corollary 1 gives MSE-UB( α ) = 1 1 + 1 αλ n ! 2 trace  diag ( s )( VV T − 11 T )  +  1 1 + αλ 2  2 ( n − 1 ) σ + σ 2 1 , (S2) where V = [ v 1 v 2 · · · v n ] . App lying the V on Neuman n’ s trace inequality [22], we ha ve trace  diag ( s ) VV T  ≤ n X i =1 s 2 i = n s, (S3) where the equa lity ho lds if s i = s ≥ 0 for all 1 ≤ i ≤ n . Finally , since trace  diag ( s ) 11 T  = s , a pplying (S3) to ( S2) completes the pro of.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment