An information-theoretic derivation of min-cut based clustering

An information-theoretic deriv ation of min-cut based clustering Anil Ra j Dep artment of Applie d Physics and Applie d Mathematics Columbia University, New Y ork ∗ Chris H. Wiggins Dep artment of Applie d Physics and Applie d Mathematics Center for Computational Biolo gy and Bioinformatics Columbia University, New Y ork † (Dated: F ebruary 28, 2022) Min-cut clustering, based on minimizing one of tw o heuristic cost-functions prop osed b y Shi and Malik, has spa wned tremendous research, b oth analytic and algorithmic, in the graph partitioning and image segmentation comm unities ov er the last decade. It is how ever unclear if these heuristics can b e deriv ed from a more general principle facilitating generalization to new problem settings. Motiv ated b y an existing graph partitioning framework, we derive relationships betw een optimizing relev ance information, as deﬁned in the Information Bottleneck method, and the regularized cut in a K-partitioned graph. F or fast mixing graphs, we show that the cost functions in tro duced b y Shi and Malik can be w ell approximated as the rate of loss of predictiv e information ab out the location of random walk ers on the graph. F or graphs generated from a sto chastic algorithm designed to mo del communit y structure, the optimal information theoretic partition and the optimal min-cut partition are shown to be the same with high probability . Keywords: graphs, clustering, information theory , min-cut, information b ottleneck, graph diﬀusion 1. INTR ODUCTION Min-cut based graph partitioning has b een used suc- cessfully to ﬁnd clusters in net works, with applications in image segmentation as well as clustering biological and so ciological netw orks. The central idea is to develop fast and eﬃcient algorithms that optimally cut the edges b e- t ween graph no des, resulting in a separation of graph no des into clusters. P articularly , since Shi and Malik successfully show ed [1] that the aver age cut and the nor- malize d cut (deﬁned b elow) were useful heuristics to b e optimized, there has b een tremendous researc h in con- structing the b est normalized-cut-based cost function in the image segmentation communit y . The Information Bottleneck (IB) metho d [2, 3] is a clustering tec hnique, based on rate-distortion theory [4], that has been successfully applied in a wide v ariety of con texts including clustering word do cuments and gene- expression proﬁles [5]. The IB metho d is also capable of learning clusters in graphs and has been used successfully for syn thetic and actual netw orks [6]. In the hard clus- tering case, given the diﬀusive probabilit y distribution o ver a graph, IB optimally assigns probability distribu- tions, asso ciated with no des, in to distinct groups. Thes e assignmen t rules deﬁne a separation of the graph no des in to clusters. W e here illustrate ho w minimizing the tw o cut-based heuristics in tro duced by Shi and Malik can b e well- appro ximated by the rate of loss of r elevanc e information , ∗ Electronic address: ar2384@columbia.edu † Electronic address: chris.wiggins@colum bia.edu deﬁned in the IB metho d applied to clustering graphs. T o establish these relations, w e must ﬁrst deﬁne the graphs to b e partitioned; we assume hard-clustering and the cluster cardinality to b e K . W e show, n umerically , that maximizing mutual information and minimizing r e gular- ize d cut amount to the same partition with high probabil- it y , for more mo dular 32-no de graphs, where mo dularity is deﬁned by the probability of inter-cluster edge con- nections in the Sto chastic Blo ck Mo del for graphs (See Numerical Experiments ). W e also sho w that the op- timization goal of maximiz ing relev ance information is equiv alent to minimizing the regularized cut for 16-no de graphs.[12] 2. THE MIN-CUT PROBLEM F ollo wing [7], for an undirected, un weigh ted graph G = ( V , E ) with n no des and m edges, represented[13] b y its adjacency matrix A := { A xy = 1 ⇐ ⇒ x ∼ y } , w e deﬁne for tw o not necessarily disjoint sets of no des V + , V − ⊆ V , the asso ciation W ( V + , V − ) = X x ∈ V + ,y ∈ V − A xy . (2.1) W e deﬁne a bisection of V in to V ± if V + ∪ V − = V and V + ∩ V − = ∅ . F or a bisection of V into V + and V − , the ‘cut’ is deﬁned as c ( V + , V − ) = W ( V + , V − ). W e also quantify the size of a set V + ⊆ V in terms of the n umber of no des in the set V + or the n umber of edges 2 with at least one node in the set V + : ω ( V + ) = X x ∈ V + 1 Ω( V + ) = X x ∈ V + d x , (2.2) where d x is the degree of node x . Shi and Malik [1] deﬁned a pair of regularized cuts, for a bisection of V into V + and V − ; the aver age cut was deﬁned as A = W ( V + , V − ) ω ( V + ) + W ( V + , V − ) ω ( V − ) (2.3) and the normalize d cut w as deﬁned as N = W ( V + , V − ) Ω( V + ) + W ( V + , V − ) Ω( V − ) . (2.4) This deﬁnition can b e generalized, for a K -partition of V into V 1 , V 2 , ..., V K [7], to A = X j W ( V j , ¯ V j ) ω ( V j ) (2.5) N = X j W ( V j , ¯ V j ) Ω( V j ) (2.6) where ¯ V j = V \ V j . F or the graph G , we can deﬁne the graph Laplacian ∆ = D − A where D is a diagonal matrix of vertex degrees. F or a bisection of V , we also deﬁne the partition indicator v ector h [8] h x =  +1 ∀ x ∈ V + − 1 ∀ x ∈ V − . (2.7) Sp ecifying tw o ‘prior’ probability distributions ov er the set of nodes V : (i) p ( x ) ∝ 1 and (ii) p ( x ) ∝ d x , we then deﬁne the aver age of h to b e ¯ h = P x ∈ V h x n h h i = P x ∈ V d x h x 2 m . (2.8) The cut, as deﬁned b y Fiedler [8], and the regularized cuts, as deﬁned b y Shi and Malik [1], can then b y written in terms of h as (See Appendix ) c = 1 4 h T ∆h A = 1 n h T ∆h 1 − ¯ h 2 (2.9) N = 1 2 m h T ∆h 1 − h h i 2 . More generally , for a K -partition, we deﬁne the parti- tion indicator matrix Q as Q z x ≡ p ( z | x ) = 1 ∀ x ∈ z (2.10) where z ∈ { V 1 , V 2 , ..., V K } and deﬁne P as a diago- nal matrix of the ‘prior’ probability distribution o ver the no des. The regularized cut can then b e generalized as C = X j [ Q T ∆Q ] j j [ Q T PQ ] j j (2.11) where for p ( x ) ∝ 1, C = A ; and for p ( x ) ∝ d x , C = N . Inferring the optimal h (or Q ), how ever, has b een sho wn to b e an NP-hard com binatorial optimization problem [9]. 3. INF ORMA TION BOTTLENECK Rate-distortion theory , whic h pro vides the foundations for lossy data compression, form ulates clustering in terms of a compression problem; it determines the co de with minim um av erage length suc h that information can be transmitted without exceeding some sp eciﬁed distortion. Here, the mo del-complexit y , or r ate , is measured by the m utual information b etw een the data and their represen- tativ e co dewords (av erage num b er of bits used to store a data point). Simpler mo dels corresp ond to smaller rates but they typically suﬀer from relatively high distortion . The distortion measure, which can b e iden tiﬁed with loss functions, usually depends on the problem; in the sim- plest of cases, it is the v ariance of the diﬀerence b etw een data and their representativ es. The Information Bottleneck (IB) metho d [3] proposes the use of mutual information as a natural distortion measure. In this metho d, the data are compressed into clusters while maximizing the amoun t of information that the ‘cluster representation’ preserv es ab out some sp eci- ﬁed r elevanc e v ariable. F or example, in clustering word do cumen ts, one could use the ‘topic’ of a do cumen t as the relev ance v ariable. F or a graph G , let X b e a random v ariable ov er graph no des, Y b e the relev ance v ariable and Z b e the random v ariable ov er clusters. Graph partitioning using the IB metho d [6] learns a probabilistic cluster assignment func- tion p ( z | x ) which gives the probability that a giv en no de x b elongs to cluster z . The optimal p ( z | x ) minimizes the m utual information b etw een X and Z , while minimizing the loss of predictiv e information betw een Z and Y . This complexit y–ﬁdelity trade-oﬀ can b e expressed in terms of a functional to b e minimized F [ p ( z | x )] = − I [ Y ; Z ] + T I [ X ; Z ] (3.1) where the temp erature T parameterizes the relative im- p ortance of precision ov er complexity . As T → 0, we reac h the ‘hard clustering’ limit where each node is as- signed with unit probabilit y to one cluster (i.e p ( z | x ) ∈ { 0 , 1 } ). 3 Graph clustering, as formulated in terms of the IB metho d, requires a joint distribution p ( y, x ) to b e deﬁned on the graph; w e use the distribution giv en b y con tinuous graph diﬀusion as it naturally captures top ological infor- mation ab out the netw ork [6]. The relev ance v ariable Y then ranges o ver the no des of the graph and is deﬁned as the no de at which a random walk er ends at time t if the random walk er starts at node x at time 0. F or con tinu- ous time diﬀusion, the conditional distribution p t ( y | x ) is giv en as G t = p t ( y | x ) = e − t ∆P − 1 (3.2) where ∆ is the graph Laplacian and P a diagonal ma- trix of the prior distribution ov er the graph no des, as describ ed earlier. The characteristic diﬀusion time scale τ of the system is given by the in verse of the smallest non-zero eigen v alue of the diﬀusion op erator exp onent ∆P − 1 and c haracterizes the slow est decaying mo de in the system. T o calculate the joint distribution p ( y , x ) from the conditional G t , we m ust sp ecify an initial or prior distribution[14]; we use the t wo diﬀerent priors p ( x ), used earlier to calculate the exp ected v alue of h : (i) p ( x ) ∝ 1 and (ii) p ( x ) ∝ d x . 4. RA TE OF INF ORMA TION LOSS IN GRAPH DIFFUSION W e analyze here the rate of loss of predictiv e infor- mation b etw een the relev ance v ariable Y and the cluster v ariable Z , during diﬀusion on a graph G , after the graph no des ha ve b een hard-partitioned in to K clusters. A. W ell-mixed limit of graph diﬀusion F or a giv en partition Q of the graph, deﬁned in Eqn. (2.10), we approximate the m utual information I [ Y ; Z ] when diﬀusion on the graph reaches its well-mixed limit. W e introduce the dep endenc e η ( y , z ) suc h that p ( y , z ) = p ( y ) p ( z )(1 + η ) . (4.1) This implies h η i y = h η i z = 0 and  η 2  z  y = h η i where hi denotes exp ectation o ver the joint distribution and hi y and hi z denote exp ectation o ver the corresp onding marginals. In the well-mixed limit, we hav e η  1. The predic- tiv e information (expressed in nats) can then be approx- imated as: I [ Y ; Z ] =  ln p ( z , y ) p ( z ) p ( y )  = D h (1 + η ) ln (1 + η ) i y E z ≈ *  (1 + η )( η − 1 2 η 2 )  y + z ≈ *  η + 1 2 η 2  y + z = 1 2 D  η 2  y E z (4.2) = 1 2 X y ,z p ( y ) p ( z )  p ( z , y ) p ( z ) p ( y ) − 1  2 = 1 2 X y ,z p ( y , z ) 2 p ( y ) p ( z ) − 1 ! ≡ ι. (4.3) Here, w e deﬁne ι as a ﬁrst-order approximation to I [ Y ; Z ] in the well-mixed limit of graph diﬀusion. 1. Wel l-mixe d K -p artitione d gr aph As in the IB metho d, the Marko v condition Z − X − Y allo ws us to make several simpliﬁcations for the condi- tional distributions and asso ciated information theoretic measures. F or a K -partition Q of the graph, we hav e p ( y , z ) = X x p ( x, y, z ) = X x p ( z | y , x ) p ( y | x ) p ( x ) = X x p ( z | x ) p ( y | x ) p ( x ) ≡ QPG t T . (4.4) p ( y , z ) 2 = X x p ( z | x ) p ( y | x ) p ( x ) ! 2 = n X x,x 0 =1 p ( z | x ) p ( y | x ) p ( x ) p ( z | x 0 ) p ( y | x 0 ) p ( x 0 ) = n X x,x 0 =1 Q z x G t y x P x Q z x 0 G t y x 0 P x 0 . (4.5) p ( z ) = X x p ( z | x ) p ( x ) = X x Q z x P x . (4.6) Graph diﬀusion b eing a Marko v pro cess, we ha ve P n y =1 G t x 0 y G t y x = G 2 t x 0 x . Using this and Bay es rule 4 G t y x P x = G t xy P y , w e ha ve ι = 1 2 X y ,z P n x,x 0 =1 Q z x G t y x P x Q z x 0 G t y x 0 P x 0 ( P x 00 Q z x 00 P x 00 ) P y − 1 ! = 1 2 X y ,z P n x,x 0 =1 Q z x Q z x 0 P y G t x 0 y G t y x P x ( P x 00 Q z x 00 P x 00 ) P y − 1 ! = 1 2 K X z =1 P n x,x 0 =1 Q z x Q z x 0 ( P n y =1 G t x 0 y G t y x ) P x ( P x 00 Q z x 00 P x 00 ) − 1 ! = 1 2 K X z =1 P n x,x 0 =1 Q z x Q z x 0 G 2 t x 0 x P x ( P x 00 Q z x 00 P x 00 ) − 1 ! . (4.7) In the hard clustering case, P x Q z x P x = p ( z ) = [ QPQ T ] z z and w e ha ve ι = 1 2 K X z =1 [ Q ( G 2 t P ) Q T ] z z [ QPQ T ] z z − 1 ! . (4.8) 2. Wel l-mixe d 2-p artitione d gr aph W e can re-write ι as ι = 1 2 D  η 2  y E z = 1 2  ( p ( z | y ) − p ( z )) 2 p ( z ) 2  z  y . (4.9) F or a bisection h of the graph, z ∈ { +1 , − 1 } and we ha v e p ( z | x ) = 1 2 (1 ± h x ) ≡ 1 2 (1 + z h x ) . (4.10) p ( z | y ) = 1 p ( y ) X x p ( z , y , x ) = 1 p ( y ) X x p ( z | x ) p ( y | x ) p ( x ) = 1 2 X x (1 + z h x ) p ( x | y ) = 1 2 (1 + z h h | y i ) . (4.11) p ( z ) = X x p ( z , x ) = X x p ( z | x ) p ( x ) = 1 2 X x (1 + z h x ) p ( x ) = 1 2 (1 + z h h i ) . (4.12) p ( z | y ) − p ( z ) = 1 2 (1 + z h h | y i ) − 1 2 (1 + z h h i ) = 1 2 z ( h h | y i − h h i ) . (4.13) W e then ha ve  ( p ( z | y ) − p ( z )) 2 p ( z ) 2  z = K X z =1 1 4 ( h h | y i − h h i ) 2 1 2 (1 + z h h i ) = ( h h | y i − h h i ) 2 2 K X z =1 1 1 + z h h i = ( h h | y i − h h i ) 2 1 − h h i 2 . (4.14) The m utual information I [ Y ; Z ] can then b e approxi- mated as ι = 1 2  ( h h | y i − h h i ) 2  y 1 − h h i 2 = 1 2 σ 2 y ( h h | y i ) 1 − h h i 2 . (4.15) Using Ba yes rule p t ( x | y ) p ( y ) = p t ( y | x ) p ( x ), w e hav e h h | y i = X x h x p t ( x | y ) = X x h x p t ( y | x ) p ( x ) p ( y ) . (4.16)  h h | y i 2  y = n X y =1 p ( y ) n X x,x 0 =1 h x h x 0 p t ( y | x ) p ( x ) p t ( x 0 | y ) p ( y ) = n X y =1 n X x,x 0 =1 h x h x 0 p t ( y | x ) p t ( x 0 | y ) p ( x ) . (4.17) Again, graph diﬀusion b eing a Marko v pro cess,  h h | y i 2  y = n X x,x 0 =1 h x h x 0 p 2 t ( x 0 | x ) p ( x ) = h h x h x 0 i 2 t . (4.18) σ 2 ( h h | y i ) =  h h | y i 2  y − h h i 2 = h h x h x 0 i 2 t − h h i 2 . (4.19) ι = 1 2 h h x h x 0 i 2 t − h h i 2 1 − h h i 2 . (4.20) B. F ast-mixing graphs When diﬀusion on a graph reaches its well-mixed limit in short times, we ha ve G 2 t ≈ I − 2 t ∆P − 1 . Thus, for a K -partition of a graph Q ( G 2 t P ) Q T ≈ Q ( P − 2 t ∆ ) Q T = QPQ T − 2 t Q∆Q T . (4.21) 5 F or bisections, the short-time approximation of h h x h x 0 i 2 t can be written as h h x h x 0 i 2 t = n X x,x 0 =1 h x 0 p 2 t ( x 0 , x ) h x = h T G 2 t Ph ≈ h T ( I − 2 t ∆P − 1 ) Ph = h T Ph − 2 t h T ∆h = 1 − 2 t h T ∆h . (4.22) F or fast-mixing graphs, the long-time and short-time ap- pro ximations for I [ Y ; Z ] and h h x h x 0 i 2 t , respectively , hold sim ultaneously . I [ Y ; Z ] ≈ ι ≈  1 2 − t h T ∆h 1 −h h i 2  ⇒ dI [ Y ; Z ] dt ≈ dι dt ∝  A ; p ( x ) ∝ 1 N ; p ( x ) ∝ d x . (4.23) W e ha ve sho wn analytically that, for fast mixing graphs, the heuristics introduced b y Shi and Malik are prop ortional to the rate of loss of relev ance information. The error incurred in the appro ximations I [ Y ; Z ] ≈ ι and h h x h x 0 i 2 t ≈ 1 − 2 t h T ∆h can b e deﬁned as E 0 ( t ) =      h h x h x 0 i 2 t − (1 − 2 t h T ∆h ) h h x h x 0 i 2 t      (4.24) E 1 ( t ) =     I [ Y ; Z ] ( t ) − ι ( t ) I [ Y ; Z ] ( t )     . (4.25) 5. NUMERICAL EXPERIMENTS The v alidity of the tw o approximations can b e seen in a typical plot of E 1 ( t ) and E 0 ( t ) as a function of normal- ized diﬀusion time ˜ t = t/τ , for the tw o diﬀeren t c hoices of prior distributions ov er the nodes. E 1 , as seen in Fig. 1, is often found to b e non-monotonic and sometimes ex- hibits oscillations. This suggests deﬁning E ∞ , a mo diﬁed monotonic ‘ E 1 ’: E ∞ ( t ) ≡ max t 0 ≥ t E 1 ( t 0 ) . (5.1) W e don’t need to deﬁne a monotonic form for E 0 since this error is alwa ys found to be monotonically increasing in time. By fast-mixing graphs, we mean graphs which b ecome w ell-mixed in short times, i.e. graphs for whic h b oth the long-time and short-time approximations hold simulta- neously within a certain range of time ˜ t ∗ − ≤ ˜ t ≤ ˜ t ∗ + , as illustrated in Fig. 1, where we deﬁne E ( t ) = max( E ∞ ( t ) , E 0 ( t )) (5.2) E ∗ = min t E ( t ) (5.3) ˜ t ∗ − = min(arg min ˜ t E ( ˜ t )) (5.4) ˜ t ∗ + = max(arg min ˜ t E ( ˜ t )) . (5.5) FIG. 1: E 1 and E 0 vs normalized diﬀusion time for t wo choices of priors ov er the graph no des. E 1 (red) typically tends to ha ve a non-monotonic behavior whic h motiv ates deﬁning a monotonic E ∞ (green). Note that the use of E ∞ instead of E 1 o ver-estimates the v alue of E ∗ ; the E ∗ ’s calculated is an upper b ound. Graphs were drawn randomly from a Sto chastic Blo c k Mo del (SBM) distribution [10], with blo ck cardinality 2, to analyze the distribution of E ∗ , ˜ t ∗ − and ˜ t ∗ + . As is com- monly done in comm unity detection [11], for a graph of n no des, the a verage degree per node is ﬁxed at n/ 4 for graphs drawn from the SBM distribution: t wo no des are connected with probability p + if they b elong to the same blo c k, but with probabilit y p − < p + , if they b elong to diﬀeren t blo cks. The t wo probabilities are, thus, con- strained b y the relation p +  n 2 − 1  + p −  n 2  = n 4 (5.6) lea ving only one free parameter p − that tunes the ‘mo d- ularit y’ of graphs in the distribution. Starting with a graph dra wn from a distribution sp eciﬁed b y a p − v alue and sp ecifying an initial cluster assignment as giv en by the SBM distribution, we make lo cal mov es — adding or deleting an edge in the graph and/or reassigning a node’s cluster lab el — and searc h exhaustiv ely o ver this mov e- set for lo cal minima of E ∗ . Fig. 2 compares the v alues of E ∗ and  ˜ t ∗ − , ˜ t ∗ +  for graphs obtained in this systematic searc h, starting with a graph drawn from a distribution with p − = 0 . 02 and n = { 16 , 32 , 64 } . W e note that the scatter plots for graphs of diﬀerent sizes collapse on one another when E ∗ is plotted against normalized time, conﬁrming the Fiedler v alue 1 /τ to b e an appropriate c haracteristic diﬀusion time-scale as used in [6]. A plot of E ∗ against actual diﬀusion tim e shows that the scatter plots of graphs of diﬀeren t sizes no longer collapse Ha ving sho wn analytically that for fast mixing graphs, the regularized mincut is appro ximately the rate of loss of relev ance information, it w ould b e instructiv e to compare 6 FIG. 2: E ∗ vs ˜ t ∗ for graphs of diﬀerent sizes and diﬀerent prior distributions o ver the graph nodes. In the ab ov e plot, ˜ t ∗ − and ˜ t ∗ + are represented by · and ◦ , respectively . FIG. 3: p ( h inf ( t ) 6 = h cut ) vs normalized diﬀusion time, a ver- aged ov er 500 graphs dra wn from a distribution parameterized b y a given p − v alue, is plotted for diﬀerent graph distributions the actual partitions that optimize these goals. Graphs of size n = 32 were drawn from the SBM distribution with p − = { 0 . 1 , 0 . 12 , 0 . 14 , 0 . 16 } . Starting with an equal-sized partition speciﬁed by the model itself, we p erformed iter- ativ e co ordinate descen t to search (indep endently) for the partition that minimized the regularized cut ( h cut ) and one that minimized the relev ance information ( h inf ( t )); i.e. w e reassigned each no de’s cluster label and searc hed for the reassignment that gav e the new low est v alue for the cost function b eing optimized. Plots comparing the partitions h inf ( t ) and h cut , learn t by optimizing the tw o goals (a veraged o ver 500 graphs dra wn from each distri- bution), are shown in Fig. 3. 6. CONCLUDING REMARKS W e hav e shown that the normalized cut and av erage cut, introduced by Shi and Malik as useful heuristics to b e minimized when partitioning graphs, are well appro x- imated b y the rate of loss of predictive information for fast-mixing graphs. Deriving these cut-based cost func- tions from rate-distortion theory gives them a more prin- cipled setting, makes them interpretable, and facilitates generalization to appropriate cut-based cost functions in new problem settings. W e hav e also shown (see Fig. 2) that the in verse Fiedler v alue is an appropriate normal- ization for diﬀusion time, justifying its use in [6] to cap- ture long-time b ehaviors on the net work. Absen t from this man uscript is a discussion of ho w not to ov erpartition a graph, i.e. a criterion for selecting K. It is hop ed that b y showing how these heuristics can b e derived from a more general problem setting, lessons learn t by in vestigating stablilty , cross-v alidation or other approac hes may b eneﬁt those using min-cut based ap- proac hes as well. Similarily , b y showing how these heuris- tics appro ximate costs functions from a separate opti- mization problem, it is hop ed that algorithms employ ed for rate distortion theory , e.g. Blahut Arimoto, maybe b e brough t to bear on min-cut minimization. APPENDIX Using the deﬁnition of ∆ , for any general vector f o ver the graph no des, we hav e f T ∆f = f T Df − f T Af = X x d x f 2 x − n X x,y =1 f x f y A xy 7 = X x n X y =1 A xy ! f 2 x − n X x,y =1 f x f y A xy = 1 2 n X x,y =1 f 2 x A xy − 2 n X x,y =1 f x f y A xy + n X x,y =1 f 2 y A xy ! = 1 2 n X x,y =1 A xy ( f x − f y ) 2 . (A.1) No w, when f = h , w e ha ve h T ∆h = 1 2 X h x × h y = − 1 4 A xy = 4 × c. (A.2) The factor 1 2 disapp ears because summation o ver all no des coun ts eac h adjacent pair of no des twice. Using the deﬁnitions of A and N , we hav e A = c ×     1 X h x =+1 1 + 1 X h x = − 1 1     = c × 1 P x  1+ h x 2  + 1 P x  1 − h x 2  ! = 2 c ×  P x (1 − h x + 1 + h x ) P x (1 + h x ) P x (1 − h x )  = 2 c ×  2 n ( n + P x h x )( n − P x h x )  = 2 c ×  2 n (1 + ¯ h )(1 − ¯ h )  = 4 n c 1 − ¯ h 2 . (A.3) N = c ×     1 X h x =+1 d x + 1 X h x = − 1 d x     = c × 1 P x d x  1+ h x 2  + 1 P x d x  1 − h x 2  ! = 2 c ×  P x d x (1 − h x + 1 + h x ) P x ( d x (1 + h x )) P x ( d x (1 − h x ))  = 2 c ×  4 m (2 m + P x h x d x )(2 m − P x h x d x )  = 2 c ×  1 m (1 + h h i )(1 − h h i )  = 2 m c 1 − h h i 2 . (A.4) [1] J Shi and J Malik. Normalized cuts and image segmen- tation. IEEE T r ansactions on Pattern Analysis and Ma- chine Intel ligenc e , Jan 2000. [2] N Tishb y , F C P ereira, and W Bialek. The information b ottlenec k metho d. arXiv pr eprint physics , Jan 2000. [3] N Tishb y and N Slonim. Data clustering by mark ovian relaxation and the information bottleneck method. A d- vanc es in Neur al Information Pr o c essing Systems , Jan 2000. [4] C Shannon. A mathematical theory of comm unication. ACM SIGMOBILE Mobile Computing and Communic a- tions Review , 5(1), Jan 2001. [5] N Slonim. The Information Bottlene ck: The ory and Applic ations . PhD thesis, The Hebrew Universit y of Jerusalem, 2002. [6] E Ziv, M Middendorf, and C H Wiggins. Information- theoretic approac h to net work mo dularity . Physic al R e- view E , Jan 2005. [7] U von Luxburg. A tutorial on sp ectral clustering. arXiv , cs.DS, Nov 2007. [8] M Fiedler. Algebraic connectivity of graphs. Cze choslo- vak Mathematical Journal , 1973. [9] Dorothea W agner and F rank W agner. Betw een min cut and graph bisection. In MFCS ’93: Pr o c e e dings of the 18th International Symp osium on Mathematic al F ounda- tions of Computer Scienc e , pages 744–750, London, UK, 1993. Springer-V erlag. [10] P W Holland and S Leinhardt. Lo cal structure in so cial net works. So ciolo gic al Metho dolo gy , Jan 1976. [11] L Danon, A Diaz-Guilera, J Duc h, and A Arenas. Com- paring communit y structure identiﬁcation. Journal of Statistic al Me chanics: The ory and Exp eriment , Jan 2005. [12] W e chose 16-no de graphs so the netw ork and its parti- tions could b e parsed visually with ease. [13] W e use the shorthand x ∼ y to mean x is adjacen t to y . [14] Strictly speaking, any diagonal matrix P that w e spec- ify determines the steady-state distribution. Since we are mo deling the distribution of random walk ers at statisti- cal equilibrium, we alwa ys use this distribution as our initial or prior distribution.

An information-theoretic derivation of min-cut based clustering

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment