MOHCS: Towards Mining Overlapping Highly Connected Subgraphs
Many networks in real-life typically contain parts in which some nodes are more highly connected to each other than the other nodes of the network. The collection of such nodes are usually called clusters, communities, cohesive groups or modules. In …
Authors: Xiahong Lin, Lin Gao, Kefei Chen
MOHCS: Towards Mining Overlapping Highly Connected Subgraphs* Xiahong Lin School of Compu ter Science and Technology Xidian University Xi’an, China xhlin@sjtu.edu.cn Lin Gao † School of Computer Science and Technology Xidian University Xi’a n, Chin a lgao@mail.x idian.edu.cn Kefei Chen Department of Computer Science and Engineering Shanghai Jiaotong Univer sity Shanghai, China kfchen@sjtu.edu.cn David K. Y. Chiu Dep ar t men t o f C o mput i ng and Information Science University of Guelph Guelph, N1G 2W1, Can ada dchiu@cis.uoguelph.ca Abstract —Many networks in real-life typica lly contain parts in which some nodes are more highly connected to each other th an the other nodes of the network. The col lect ion of su ch nod es are usual ly ca lled clu sters , comm uniti es, co hesive grou ps or module s. In graph terminology, it is called highly connected g raph. In this paper, we first prove some propertie s related to h ighly connected graph. Based on these properties , we then redefine the highly connected subgraph which results in an algorith m that determines whether a given graph is highly connected in linear time. Then we present a comp utationally efficient algorith m, called MOHCS, for minin g overlapping highly connected subgr aphs. We have e valua ted ex perim entally the per forman ce of MOHCS using real an d synthetic data sets from computer-generated graph and yeast p rotein network. Our results show that MOHCS is eff ective and reliable in finding overlapping highly connected subgrap hs. Keywords-component; Highly connected subgraph, clustering algorithms, m inimum cut, mini mum degree I. I NTRODUCTION In a graph modeling a network, such as biological network [1], information network [2] or social network [3], a highly connected subgraph always corr esponds to a cohesive set of interconnected vertices which is meaningful. For exa mple, a dense co-expression network may represent a tight co-expression cluster [4] . The definitions o f highly connected graph may vary in different works. We define a hig hly connected graph (or simply dense graph) as a graph whos e minimum cut is no less than half of its vert ex set size (and the formal definition can be found in [5]). Due to its wide application, identifying these a priori unk nown building blocks is crucial to the under standing of the str uctural and funct ional properties of networks. Researcher s have addressed various problem settings and have proposed n umerous algorith ms to achieve their goals in the past. Our revie w only focuses on the algorithms that are most related to our work. Among those t hat are most related to our work, [6] provides a definition of highly connected subgraph that is valid and useful in practice. There, the HCS algorithm is one of the most well-known clustering algorithms and has been widely used in various domains s uch as gene expression ana lysis [7] and functiona l module discovery [5, 8-10]. I t recursively parti tions the current gr aph into two subgraphs by re moving the minimum cut until the graph is highly connected [5]. However , HCS has some shortcomings. First, HCS cannot identify overlapping highly connected subgraphs because o f its nature of graph-partitio ning [5]. Second, when applying the algorithm repeatedly to a large and sparse graph, HCS often cuts off one vertex in each iteration, thus having ti me complexity of ( ) 23 log OV E V V + [5]. Third, the m inimum cut algorithm is a critical step used in HCS. However, when applyi ng H CS to a grap h wi th num erous e dges clos ed to quadratic, the fastest deterministic minimum cut algorithm [11] has time complexity of ( ) 3 OV . In [5], Hu et al. proposed an algorithm called MODES, combining HCS with normalized cut, and designed a procedure to identify overlapp ing highly connected subgraphs. Further more, to mine highly connected subgraphs more effectively, several authors introduce gr eedy vertex deletion algorithm based on the observation that in order to produce a highly connected subgrap h, the low degree vertices can be disregarded intuitively. For example Asahiro et al. [12] proposed the following greedy algorithm to find a k -vertex subgraph with the maximum weight: repeatedly remove a vertex with t he minimum weighted-degree in the currently remaining graph, until exactly k vertices are left. However, these greedy algorithms can not be used directly to our problem because of the differences in the definition and other problem settings which we will explain below. Our motivation is to find a more efficient algorithm for mining overlapping highly connected subgraphs by reconsidering the properties of hig hly connected subgraph. The contributions of our work are follows: • We give several properties and consider them in a new definition of highly connected subgra ph. * Supported by National Natural Science Foundation of China (No. 60574039) and the Project Sponsored by the Scientific Research Foundation for the Returned Over seas Chinese Scholars, State Education Ministry. † To whom correspondence should be addres sed. E-mail: lgao@mail.xidian.edu.cn • We present an algorithm for deter mining a highly connected subgraph in linear time. • We propose an efficient algorithm MOHCS for Mining Overlapping Highly Connected Sub graphs (MOHCS). Although MOHCS is also a gr eedy vertex deletion algorithm, it is applicable only after a derivation from the set of properties we have discovered. What is more, we can also identify overlapping h ighly connected s ubgraphs using a procedure in [5]. The rest of the paper is organized as follows. In section II, we introduce some notations and preliminary concepts to study some properties related to highly connec ted graph. In section III, the MOHCS algorithm is provided an d its complexity is discussed. The refinement of the MOHCS algorithm is then presented in section IV. Section V provides a detailed experimental evaluation of MOHCS us ing real and syn thetic data sets. Finally, we conclude our work in section VI. A preliminary version of our work could be f ound in [19]. II. P ROPERTIES OF H IG HLY C ONNECTED G RAPH In this section, we first i ntroduce some notati ons and preliminary concepts in order to si mplify our discussio n. Next we give a tighter lemma on mini mum cut than the one given in [6]. Then we present a relation between the minimum cut and the minimum degree which gives us new insight int o highly connected graphs and derives an efficient algorithm for their identification. A. Notations and definitions The notations that will be used thro ughout the paper are summarized in Table 1. Defin itio n 1: (I nduced Subgraph [13]) Given a graph () , GV E = and a mapping : f EV V →× , an induced subgraph is a graph () ( ) , s ss GV V E = , where s VV ⊆ , s E E ⊆ and , ij s vv V ∀∈ () , hi j s ev vE =∈ ⇔ () ( ) , hi j f ev v E =∈ . In other words, an induced subgraph of a graph G is a subset of the vertices of ( ) VG together with all of the edges tha t connect them in G . Defin itio n 2: (Edge Cut and Edge Connectivity) Given a grap h () , GV E = , an edge cut is a s et of edges c E such that () ', c GV E E =− is disconnected. A minimu m cut S is the smallest set a mong all edge cuts. The edge connectivity of G , denoted by ( ) kG , is equal to the size of th e min i m u m c u t S . To make our following conclusions general, here we regulate that () 0 kG = if G is disconnected, which means we needn't to re move any edge fro m G . TABLE I. N OTATIONS USED THROUGHOUT T HE PAPER Notations Description G ( ) , GV E = ,an undirected graph ( ) VG { } 12 , , ..., k Vv v v = , the vertex set of G ( ) E G E VV ⊆× , the edge set of G ( ) s GV the induced subgraph on s V fro m G , ( ) s VV G ⊆ ( ) deg Gv the degree of the vertex v in G ( ) G δ th e min i mu m deg re e o f a ve rte x i n G S th e min i mu m cut o f G Defin itio n 3: (Highly Connec ted Subgraph or si mply Dense Subgraph) A graph ( ) , GV E = with () 3 VG > vertices is called highly connected if () () () /2 kG V G ≥ . Note that a highly connected graph with vertex size less than 4 is trivial. A highly connected s ubgraph is an induced su bgraph H G ⊆ , such that H is highly connected. Some subg raphs are overlapping if they have so me vertices or some edges in common. In our problem setting, we cons ider the simple undirecte d unweighted graph only. The graph does not need to be connected. In this paper, we study the problem of mining the set of highly connected subgraphs in G . For instance, there are two highly connected subgraphs { } () 1 a,b, c,d, e G and { } ( ) 2 f,g,h,i, j,k G in Figure 1. Figure 1. An example of highly connected subgraph B. Lemmas and Theorems In [6], H artu v and Shami r give a l emma as follo ws: Lemma 1: If S is a minimum cut which splits a graph G into two induced subgr aphs, the smaller of which, H , contains () 1 VH > vertices, then () SV H ≤ , with equality only if H is a clique . Here, we first present a lemma that is similar to but tighter than Lemma 1. Although the proof procedur e can be further simplified, we adopt the similar proof method used in [6] for conveniently comparing with that of Le mma 1. Lemma 2: If S is a minimum cut which splits a graph G into two induced subgraphs, the s maller of which, H , contains () 1 VH > vertices, then () () GV H δ ≤ , with equality only if H is a clique . Proof: When all edges are incident on a v ertex with the minimum degree are removed, a disconnected graph is resu lted. Therefore the edge connectivity o f a graph is not greater than its minimum degree () GS δ ≥ [6]. Let () deg H x denotes t he degree of vertex x in H , and let () deg Sx denot es the number of ed ges in S that are incident on x . Since () G δ is the minimum degree of a ver tex in G , for every x H ∈ , we have () () ( ) deg de g H xS x G δ +≥ . (1) Summing over all vertices in H we get () () () () deg deg xH xH H xS x V H G δ ∈∈ +≥ ∑∑ , (2) or, equivalently () () () 2 E HS V H G δ +≥ . (3) Hence if () 1 VH > , then since () () ( ) () 1 22 2 VH VH E H − ≥ , (4) and, () GS δ ≥ , (5) we ge t ( ) ( ) ( ) () () () 1 2 2 VH VH GV H G δδ − +≥ . (6) That is ( ) ( ) ( ) () ( ) () 11 VH VH VH G δ −≥ − . (7) Since ( ) 1 VH > , then () () GV H δ ≤ . (8) If () ( ) GV H δ = , then the inequalities ( 4) and (5) must hold as equalities, so ( ) GS δ = , (9) and () ( ) () ( ) 1 2 VH VH EH − = , (10) which implies that H is a clique. Since ( ) GS δ ≥ , Lemma 2 is tighter than Lemma 1. Because H is the s maller subgraph, Le mma 2 also implies that () () ( ) /2 GV G δ ≤ . Next we will use Lemma 2 to derive two theorems. Although these theore ms can also be proved by generic graph theory method [18], our proof procedures provide so me evidences that it is the tighter lemma that leads to a faster algorithm in our proposal. Theorem 1: If a graph G is highly connected, then ( ) GS δ = . Proof: It is known tha t () GS δ ≥ . Now suppose ( ) GS δ > , the con ditio n of equ alit y in Lemm a 2 does not hold, so we have () () GV H δ < , that is () () ( ) /2 GV G δ < ; because G is highly connected, () () ( ) /2 GS V G δ >≥ , tha t is () () ( ) /2 GV G δ > ; resulting in a contradiction. So we get () GS δ = . Theorem 1 shows a relation between the minimum cut and the minimum degree that if G is highly connected, the siz e of the minimum cut is equal to the minimum degree. Let v be a vertex with minimum degree in G , then { } { } () ,\ vV x would be a m inimum cut of G . In other words, we can determine edge connectivity of a highly connected graph in linear time. Theorem 2: A gra ph () , GV E = is highly connected if and only if () () () /2 GV G δ ≥ . Proof: () ⇒ Since G is highly connected, then () () () /2 GS V G δ ≥ = , by The orem 1. () ⇐ Since () () () /2 GV G δ ≥ ; sup pose () GS δ > , the condition of equality of Lemma 2 doesn’t hold, so we have () () GV H δ < , that is () () () /2 GV G δ < ; a contradiction. So () () () /2 SG V G δ =≥ , G is highly connected. Based on Theorem 2, we can redefine highly connected graph as follow. Redefinition: (Highly connected Graph) A graph () , GV E = with () 1 VG > vertices is called highly connected if () () () /2 GV G δ ≥ . C. Determining Highly Connected Subgraph in Linear Time By the initial definition of hig hly connected subgraph, if we want to determine whether a graph G is highly connected, we should first apply the minimum cut algorithm on G and then check whether the size of the minimu m cut is no less than half of V . So determining highly connecte d subgraph has time complexity () 2 log OV E V V + . However based on our redefinition, we can s imply deter min e G by chec king whether the degrees o f all vertices of G is no less t han half of V with li near time complexity () OV . III. A LGORITHM AND C OMPLEXITY Note that our new definition o f highly connected graph implies that the minimum degree of the graph must be less than half of its vertex set s ize, if a graph is no t highly connected. And the vertex with minimum degree is the most conflictive vertex to our definition. To make the graph hold highly connected, we can then delete this vertex. Based on this observation, we directly design a greedy vertex deletion algorithm, MOHCS for mining overlapping highly connected subgraph. However, in order to describe the proble m clearly, let us first consider the gra ph consisting o f highly connected subgraphs without overla pping. We present the i mplementation details and refine our algor ithm in section I V. Our MOHCS algorithm is outlined in Algorith m 1 and illustrated in Figure 2. Let SDS be the set of dense subgraphs. Line 1 initializes SDS as an empty set. The until l oop of lines 2- 9 repeatedly mines highly connected subgraphs, until ' G =∅ in which case no other highly connected subgr aphs is left in G . The while loop of lines 4-5 repeatedly deletes the minimum degree vert ex v in ' G u ntil it satisfies the new definition of highly connected graph or ' G = ∅ in whi ch ca se no hig hly connected subgraph exists in ' G when all vertices in ' G deleted. In lines 6-8, each highly connected subgraph is saved in SDS and removed from G . For each iteration of MOHCS, our algorithm repeatedly removes a vertex with the minimum weighted-degree in the curre ntly rem ain ing g raph, un til a high ly conn ect ed subg raph is resulted. This procedure is very similar to the algorith m proposed in [14] whose c omplexity is () OE V l o g V + . However our stop criterion is stricter. So the complexity of each iteration will never exceed that of [14]. In other words, as in [14], by using Fibonacci heaps [15] to hold vertices, we can get a running time of ( ) OE V l o g V + to identity one highly connected subgraph. After findi ng one highly connected subgraph, our algorithm saves the subgraph and deletes it from the remaining graph. This procedure takes a running time of ( ) OE to achieve. So our MOHCS alg orithm has ti me complexity ( ) Ok E k V l o g V + to identify all highly connected subgraphs in k iterations, where k is the n umber of highly connected subgraphs in a graph. () l et be the se t of de nse subgr aphs l et be the m inimum degr ee ve tex in ' input 1. 2. do 3. ' 4. while deg( ) ' / 2 and ' 5. ' ' SDS vG G SDS GG vV G G GG ⎯ ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ ⎯⎯⎯⎯⎯⎯ =∅ = < ≠∅ = {} \ 6. if ' then 7. sav e ' in 8. \ ' 9. until ' output v G GS D S GG G G SDS ≠∅ = =∅ ⎯ ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ ⎯⎯⎯⎯⎯⎯ Algorithm 1: The MOH CS algorithm Figure 2. An example of applying MOHCS algorithm to a graph. Broken lines denotes the ve rtex and its neighboring edges will be deleted Because graph is arbitrary, the performance of MOHCS would be different sharply in different graphs. When there is no highly connected subgraph in graph G or G itself is a highly connected graph, MOHCS terminates in time co mplexity of () OV . Suppose ' G is a highly connected subgraph whose minimum degree is maximum among all highly connecte d subgraphs contained in G , MOHCS will mine ' G fir s tly . In reality, a highly conn ected subgraph with greater minimum degree always has greater vertex set size. Hence, the vertex set size of G decreases quickly in each iteration of MOHCS making our algorithm having a better performance. In addition, the singleton adoption and the overl apping subgraphs identification only increases the running time of MOHCS slightly and have no influence on the time complexit y of MOHCS. IV. R EFINEMENTS OF THE MOHCS A LGORITHM In this section, we pr esent the imple mentation details and the refinements of our MOHCS algorithm. We employ the singleton adoption method proposed in [6] and the overlapp ing subgraphs identification method proposed in [5] with some modifications. Both original methods need several bounds to control their executions based on experimental test. However, in this study, these bou nds are unnecessary due to the properties we have disco vered. The complete vers ion of MOHCS is outlined in Algorithm 2 and illustrated in Fig ure 4. A. Minimum degree vertex selection Because of the greedy vertex deletion character of MOH CS, we use Fibonacci heaps t o maintain t he degrees of the vertices in the subgraph induced by ' G . Each iteration in lines 4-5 involves identifying and removing the minimum degree vertex as well as updating the degree s of the vertices nei ghboring on it [14]. In fact there are always many vertices with minimum degree. Different choices of vertex for deleting lead to different results, sometimes even wrong results. Figure 3 illustrates an example that MOHCS performs incorrectly, when it collapses two highly connected subgraphs { } () 1 a,b,c,d G Figur e 3. An example that MOHCS perf ormances in an incorrect behavior and { } ( ) 2 e,f,g, h G into sing letons. The reason of this case is that MOHCS chooses the mini mum degree vertices alternately fro m two highly connected subgrap hs, thus collapsing the two subgraphs. To addr ess this proble m, we import a scheme discussed as follows. In a graph with many highly connected subgraphs, we aim at our MOHCS algorithm to perform in a manner such that when one vertex of a highly connected subgraph is deleted, the whole highly connected subgraph must be collapsed as well. This avoids the incorrect behavior descr ibed above. We implement this by confirming that if several vertices have the same degree, the vertex recently upda ted should be deleted first. Hence we use a criterion for comparison in Fibonacci heaps as bell ow: ,, ij vv ∀ ( ) ( ) ij key v k ey v < , if () ( ) deg deg ij Gv Gv < or ( ) ( ) deg deg ij Gv Gv = and i v is m ore recent ly updat ed t han j v . B. Singleton adoption Our MOHCS algorithm may also leave s ome vertices as uncluttered singletons. F or example, when applying MOHCS in the graph G in Figure 4, the vertex i is left as singleton. Hence after the execution of MOHCS, we need to check whether some singleto ns can be adopted by some highly connected subgraphs. Our singlet on adoption is derived fr om the method proposed in [6]. Our proposed modification procedure is described as follows. Let S denotes the s ingleton set. One singl eton in S may fit into many highly connected subgraphs, taking into account the existence of o verlapping subgraphs. For each highly connected subgraph ' G , we decreasingly sort all singletons in S by the number of neighbors they have in ' G . Then we check every singleton in order. For each singleto n v , if its join keeps ' G still highly connected (using our algorithm for determining highly connected graph), v is added to ' G . The process is repeated until there is one singleton that could not be fitted into ' G . Figure 4. An ex ample of applying the complete version of MOHCS algorithm to a graph () le t be the set of dense subgraphs le t be the minimum degree v etex in ' input 1. 2. do 3. ' 4. while deg( ) ' / 2 and ' 5. ' ' SDS vG G SDS GG vV G G GG ⎯ ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ =∅ = <≠ ∅ = {} {} \ 6. if ' then 7. sav e ' in 8. conde nse ' into a vertex ' 8. \ ' ' 9. until ' 10.Overl apping subgraphs identification 11.Sin gleton adoption out v G GS D S Gv GG G v G ≠∅ =∪ =∅ put SDS ⎯ ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ Algorithm 2: The complete version of MOHCS algorithm Overlapping subgraphs identification Because highly connected subgr aphs may overlap with each other, if we simply remove one of these su bgraphs directly, the others will be destroyed for existing vertices and edges that are in common. Considering this problem, we employed a method proposed in [5]. We restate the procedure with slight modification as follo w. When mining a highly connected subgraph ' G from the grap h G (lines 6-8), we condense ' G into a vertex ' v . And an y ve rtex v in \' GG h as an edge with ' v if v links to at least one vertex in ' G . In later iterations of the until loop (lines2-9), if ' v is contained in a newly mined highly connected subgraph '' G , we restore ' v to the original subgraph ' G , and check all vertices of ' G whether they can be ado pted to '' G by using the singleton adoption method menti oned a bove. C. The complete version of MOHCS So far we have developed the complete version of MOHCS presented in Algorithm 2 and illustr ated in Figure 4, combining with the im plem entati on det ails an d the refinem ents describ ed above. In Figure 4, The graph G consis ts o f two highly connected subgraphs { } () 1 a,b, c,d, e G and { } ( ) 2 c,d,f ,g,h, i G . 1 G and 2 G are overlapping with eac h other. In Step 1, we mine { } () c,d,f,g,h G first . In St ep 2, we condense { } () c,d,f,g,h G into a vertex ' v . In Step 3, we mine { } () a,b, ' ,e Gv . In Step 4a, we restore ' v to { } () c,d,f,g,h G . In Step 4b, After adopting the vertices c and d , { } () a,b, ' ,e Gv becomes { } () 1 a,b,c,d ,e G . In Step 5, we condense { } () 1 a,b, c,d, e G into vertex '' v . In Step 6, we leave vertex i as singleton. Finally in Step 7, { } () c,d,f,g,h G adopts the vertex i and becomes { } () 2 c,d,f,g,h,i G . V. E XPER IMEN TAL E VALUATION In this section, we first compare the algorithm MOHCS with HCS on computer-generated grap h. Since both MOHCS and HCS do not apply to graphs that are weighted, we con sider only those cases when they are unweighted. The result shows that our algorithm perfor ms more reliably than the HCS algorithm. Then we apply the MOHCS algorithm to yeast protein interaction network and find out more than 1188 modules larger than three. We also verify tha t the size of graph decreases quickly as iterations of MOHCS. A. Algor ithm comp aris on on co mput er- g ener ated grap hs To compare the performance of MOHCS with HCS [6], we apply them on computer-generated graphs. Because HCS can not discover overlapping subgraphs, it is clear that our algorithm will outperform HCS in this case. Thus we just compared them on graphs without ov erlapping subgraphs. The construct method is based on the generic random graph model [16]. We construct a generated graph contains k highly connected subgraphs, each of which has n vertices. Eve ry vertex belongs to one and o nly one subgraph, whi ch ensures that different subgraphs will no t have vertices or edges i n common, that is, all subgraphs are not overlapping. Thus, each constructed graph has Nn k = vertices in total. For each vert ex pa ir () , i j , we realize an edg e with fixed probability p if i and j are in the same subgraph, and q if i and j are in different subgraphs. S ince there are 2 n vertex pairs between two subgraphs and () 1/ 2 nn − vertex pai rs in a subgraph. By Definition 3, we know that the mathe matical expectation of edge number between two su bgraphs should be less than half of the vertex nu mber of these two subgraphs, that is, 2/ 2 n . By our new definition, we know that the math emati cal ex pect ation of edge num ber in a su bgraph shou ld not be less than 2 /4 n . Then we have 2 nq n < , (11) and, ( ) 2 1 24 nn n p − ≥ . (12) That is, () 21 n p n ≥ − , (13 ) and, 1 q n < . (14) This method generates many graphs with known cluster structure. We examine the clustering results of MOHCS and HCS on these graphs. We find that most of highly connected subgraphs mined by HCS are also c ontained in tha t of MOHCS, except for the case that there are so me subgraphs with the same density but with different ver tex set. To illustrate the result more clearly, Figure 5 presents one of test grap hs represented as matrix and the clustering results of MOHCS and HCS. There are ten highly connected subgraphs in the test graph. MOHCS finds out all ten subgraphs, a nd HCS finds only thr ee subgraphs. Counted from the left bottom to the right top, there are some uncluttered singletons in the 1 st , 5 th , 6 th , 8 th and 10 th subgraph, which can be identified by applying singleton adopt ion. B. Mining modules from yeast protein interaction n etwork We apply our MOHCS algorit hm on yeast protein interaction network to mine modules. The experimental data is baker’s yeast protein interaction net work downloaded from the DIP database (version Scer e20070219). The network includes 4966 yeast proteins and 17530 interacti ons. In order to identify functional modules from the network, we express the network of proteins connected by interaction as a network of connected interaction [17]. The procedure takes a graph G , consisting of edges connecting vertices, and produces its associat ed line grap h ( ) L G in which edges now represent vertices and vertices represent edges [18]. A fter converted by this procedure, the graph consists of 17530 vertices and 439685 edges. MOHCS identified 1188 simple modules wi th size larger than three. Eleven of them have size larger than one hundred ( see Figure 6). The largest module contains 283 in teractions. The size of remaining graph after each iterati on of MOHCS is shown in Figure 7, which verifies that the size decreases very quickl y. Figure 5. A rando mly cluster test matrix for k =10, n =10, p =90%, q =3%. Dots indicate nonzero entries. Red dots represe nt the clusters mined both by MOHCS and HCS. Green dots represent the clusters mined only by MOHCS Figure 6. Size o f eleven largest subgraphs Figure 7. Size o f remaining graph as iteration of MOHCS VI. C ONCLUSION In this paper, we st udy some properties related t o highly connected graph based on graph theoretic techni ques, revealing a relationship between the mini mum cut and the minimum degree in highly connected graphs. We also give a new definition of highly connecte d graph considering thes e properties. Further, we provide a method for determining whether a graph is highly connected. We propose an efficient algorithm MOHCS to mine overlapping highly connected subgraphs, based on this redefinition of hig hly connected graph. As mentioned above, different choices of vertex for deleting can lead to different or incorrect r esults. The previously developed greedy vertex deletion algorith ms do not consider this case. Here, we present a scheme to confirm that our MOHCS chooses the correct vertex in the set of vert ices with minimum degree. We e mployed two other methods se parately proposed in [5] and [ 6] with modificatio ns. The modified complete version of MOHCS is then pres ented. Finally, we analyze the running time of MOHCS and apply it to computer-generated graph and yeast protein interacti on network. The experi mental results show that the MOHCS algorithm outperforms the HCS algorith m both in computer-generated graph and yeast pro tein network. A CKNOWLEDGEMENTS The authors would like to thank Runming Lu for his helpful remarks. We also want to thank the funding agencies for their financial support. R EFER ENCE S [1] B. Adamcsek, G. Palla, I . J. Farkas, I. Der nyi1, and T. Vicsek. CFinder: locating cliques an d overlapping modules in biological networks. Bioi nfo rmat ics , 200 6, 22 (8) :102 1-1 023. [2] G. W. Flake, S. R .Lawr ence, C. L. Giles, et al. Self-organization and identification of web communities. IEEE Computer,2002,3 5 ( 3 ) :66 - 71. [3] G.Palla, I.Derenyi, I . Farkas and T. Vicsek. Unco vering the overlappin g community structure of complex networks in nature and societ y. Nature, 2005, 435: 814-818. [4] R. Sharan and R. Shamir. CLICK: a clustering algorithm with applications to gene express ion analysis. In Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology (ISMB 2000), 2000, pp. 307-316. [5] H. Hu, X. Yan, Y. Hang, J. Han, X. Zhou. Mining coherent dense subgraphs across massive biological network for functional discovery. Bioinformatics, 2005, 21: 213-221. [6] E. Hartuv and R. Shamir. A clustering algorithm based on graph connectivity. Information Processing Letters, 2000, 76 : 175-181. [7] E. Hartuv, A. Schmitt, J. Lang e, S. Meier-Ewert, H. Lehrach, and R. Shamir. An algorithm for clustering cDNAs for gene express ion analysis. In Proceedings of the third an nu al international conference on Computational molecular biology (RECOMB 99), 1999, pp. 188-197. [8] F.Luo, Yunfeng Yang, Chin-Fu C hen, Roger C hang, Jizhong Zho u and R. H. Scheuermann. Modular orga nization of protein interaction networks. Bioinformatics, 2007, 23: 207-214. [9] Brohee S, van Helde n J. Evaluation of clustering algorithms f or protein-protein interaction networks. BMC Bioinf ormatics. 2006 Nov 6;7:488. [10] M. E. J. Newman and M. Girvan. Finding and eval uating community structure in networks. Physical Review E, 2004, 69:026113, 2004. [11] M. Stoer and F. Wagner. A sim ple min-cut algorithm . Journal of the ACM, 1997, 44: 585-591. [12] Y. Asahiro, K. Iwama, H. Tamaki, a nd T. Tokuyama. Greedily Finding a Dense Subgraph. Journal of Algorithms, 2000, 34: 203-221. [13] W.T. Tutte. Graph Theory[M]. Cambridge Univers ity Press, 2001 [14] M. Charikar. Greedy Approximation Algorithms for Finding Dense Components in a Graph . Lecture No tes in Computer Science, 2000, 1913 : 84-95. [15] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. MIT Press, Massachus etts, Second Edition,2001. [16] S. M. Van Dongen. Graph Clustering by Flow Simulation. Ph.D. thesis, University of U trecht, 2000. [17] B. J. Pereira-Leal, A. J. Enright, and C. A. Ouzounis. D etection of functional modules from protein interaction networks . Proteins, 2004, 54: 49–57. [18] H. Whitney. Congruent graphs and the connectivity of graphs. American Journal of Mathematics, 1932, 54: 150–168. [19] X. Lin, K. Chen and L. Gao. A Clustering Algorithm for Mining Overlapping Highly Connected Subgraphs . To appear in the 2nd International Conference on Bioinform atics and Biomedical Engineering (ICBBE), 2008.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment