Coauthorship and Citation Networks for Statisticians
We have collected and cleaned two network data sets: Coauthorship and Citation networks for statisticians. The data sets are based on all research papers published in four of the top journals in statistics from $2003$ to the first half of $2012$. We …
Authors: Pengsheng Ji, Jiashun Jin
Submitte d to the Annals of Applie d Statistics arXiv: CO A UTHORSHIP AND CIT A TION NETW ORKS F OR ST A TISTICIANS By Pengsheng Ji † and Jiashun Jin ‡ University of Ge or gia † and Carne gie Mel lon University ‡ W e ha ve collected and cleaned tw o netw ork data sets: Coauthor- ship and Citation net works for statisticians. The data sets are based on all researc h pap ers published in four of the top journals in statis- tics from 2003 to the first half of 2012. W e analyze the data sets from man y different persp ectiv es, fo cusing on (a) cen trality , (b) commu- nit y structures, and (c) productivity , patterns and trends. F or (a), w e ha v e identified the most prolific/collab orativ e/highly cited authors. W e ha ve also identified a handful of “hot” pap ers, suggesting “V ariable Selection” as one of the “hot” areas. F or (b), we ha v e identified about 15 meaningful communities or researc h groups, including large-size ones such as “Spatial Statis- tics”, “Large-Scale Multiple T esting”, “V ariable Selection” as w ell as small-size ones suc h as “Dimensional Reduction”, “Ob jectiv e Ba y es”, “Quan tile Regression”, and “Theoretical Machine Learning”. F or (c), w e find that ov er the 10-y ear perio d, both the av erage n umber of papers p er author and the fraction of self citations ha v e b een decreasing, but the prop ortion of distan t citations has b een increasing. These suggest that the statistics comm unit y has b ecome increasingly more collab orativ e, comp etitiv e, and globalized. Our findings shed ligh t on research habits, trends, and top ological patterns of statisticians. The data sets provide a fertile ground for future researches on or related to social netw orks of statisticians. 1. In tro duction. It is frequently of interest to iden tify “hot” areas and k ey authors in a scien tific comm unity , and to understand the researc h habits, trends, and top ological patterns of the researchers. A better understanding of such features is useful in many p ersp ectiv es, ranging from that of adminis- trations and funding agencies on priorities for support, to that of individual researc hers on starting a new researc h topic or new research collaboration. Coauthorship and Citation netw orks provide a conv enien t and y et ap- propriate approac h to addressing man y of these questions. On one hand, with the b oom of online resources (e.g., MathSciNet) and search engines ‡ JJ was partially supp orted by NSF grant DMS-1208315. MSC 2010 subje ct classific ations: Primary 91C20, 62H30; secondary 62P25 Keywor ds and phr ases: adjacent rand index, cen trality , collab oration, communit y de- tection, Degree Corrected Block Mo del, pro ductivit y , so cial netw ork, sp ectral clustering. 1 2 P . JI AND J. JIN (e.g., Go ogle Sc holar), it is relatively con v enien t for us to collect the Coau- thorship and Citation net work data of a sp ecific scientific communit y . On the other hand, these net w ork data pro vide a wide v ariety of information (e.g., productivity , trends, impacts, and communit y structures) that can b e extracted to understand man y differen t asp ects of the scientific communit y . Recen t studies on suc h net w orks include but are not limited to the fol- lo wing: Grossman [ 17 ] studied the Coauthorship netw ork of mathematicians; Newman [ 32 , 34 ] studied the Coauthorship netw orks of biologists, physicists and computer scientists (see also Martin et al . [ 29 ], whic h studied net w orks of ph ysicists using a muc h larger data set than that in [ 32 , 34 ]); Ioannidis [ 22 ] used the Coauthorship net w ork to help assess the scientific impacts. Unfortunately , as far as we know, the Coauthorship and Citation netw orks for statisticians ha v e not y et been studied. W e recognize that • The p eople who are most interested in so cial net works for statisticians are statisticians themselv es or people with close ties to them. It is un- lik ely for researchers from other disciplines (e.g., ph ysicists) to dev ote substan tial time and efforts to pa y sp e cific atten tion to net w orks for statisticians: it is the statisticians’ task to collect and analyze such net w ork data ab out themselv es and of interest to themselv es. • F or many asp ects of the netw orks, the “ground truth” is una v ailable. Ho w ever, as statisticians, we ha v e the adv an tage of knowing (at least partially) man y asp ects (e.g., “hot” areas, comm unit y structures) of our own comm unit y . Such “partial ground truth” can be very helpful in analyzing the net w orks and in terpreting the results. With substantial time and efforts, w e ha v e collected t w o new net w ork data sets: Coauthorship netw ork and Citation netw ork for statisticians. The data sets are based on all published pap ers from 2003 to the first half of 2012 in four of the top statistical journals: Annals of Statistics (AoS), Biometrik a, Journal of American Statistical Association (JASA) and Journal of Roy al Statistical So ciet y (Series B) (JRSS-B). The data sets pro vide a fertile ground for researc hes on so cial netw orks, esp ecially to us statisticians, as we kno w the “partial ground truth” for man y asp ects of our comm unity . F or example, w e can use the data sets to c heck and build net w ork models, to dev elop new metho ds and theory , and to further understand the research habits, patterns, and topological structures of the netw orks of statisticians. Last but not least, we can use the data sets and the analysis in the pap er as a starting p oin t for a more am bitious pro ject, where we collect net w ork data sets of this kind but co v er many more journals in or related to statistics and span a muc h longer time p erio d. CO AUTHORSHIP AND CIT A TION NETWORKS 3 1.1. Our findings. In this pap er, we analyze the t wo net w ork data sets, and discuss eac h of the follo wing three topics separately: • (a). Centr ality . W e iden tify “hot” areas as well as authors that are most collab orativ e or are most highly cited. • (b). Community dete ction . With p ossibly more sophisticated metho ds and analysis, w e iden tify meaningful comm unities of statisticians. • (c). Pr o ductivity, p atterns and tr ends . W e iden tify noticeable publica- tion patterns of the statisticians, and how they ev olv e o v er time. (a). Cen tralit y . Using several different centralit y measures, w e hav e iden tified Peter Hall, Jianqing F an, and Raymond Carroll as the most pro- lific authors, P eter Hall, Ra ymond Carroll and Joseph Ibrahim as the most collab orativ e authors, Jianqing F an, Hui Zou, and Peter Hall as the most cited authors. See T able 2 . W e ha v e also iden tified 14 “hot” pap ers. See T able 3 . Among these 14 pap ers, 10 are on “V ariable Selection”, suggesting “V ariable Selection” as a “hot” area. Other “hot” areas ma y include “Cov ariance Estimation”, “Em- pirical Ba y es”, and “Large-scale Multiple T esting”. (b). Comm unit y detection . Intuitiv ely , comm unities in a net w ork are groups of nodes that hav e more edges within than across (note that “com- m unit y” and “comp onen t” are very differen t concepts); see [ 24 ] for example. The goal of communit y detection is to identify suc h groups (i.e., clustering). W e consider the Citation net w ork and t w o v ersions of Coauthorship net- w orks. In eac h of these net w orks, a node is an author. • (b1). Coauthorship netw ork (A). In this netw ork, there is an (undi- rected) edge betw een t w o authors if and only if they ha v e coauthored 2 or more papers in the range of our data sets. • (b2). Coauthorship net work (B). This is similar to Coauthorship net- w ork (A), but “2 or more papers” is replaced b y “1 or more papers”. • (b3). Citation net work. There is a (directed) edge from author i to j if author i has cited 1 or more pap ers b y author j . While Coauthorship netw ork (B) is defined in a more conv en tional wa y , Coauthorship netw ork (A) is easier to analyze, and presents man y meaning- ful researc h groups that are hard to find using Coauthorship net work (B). W e now discuss the three netw orks separately . (b1). Co authorship network (A) . W e find that the netw ork is rather frag- men ted. It splits in to many disconnected comp onen ts, man y of which are groups with sp ecial c haracteristics. The largest comp onen t is the “High Di- mensional Data Analysis (Coauthorship (A))” (HDD A-Coau-A) communit y 4 P . JI AND J. JIN (Figure 1 ). The component has 236 nodes and is relatively large and seems to con tain sub-structures; see Section 3.2 for m ore discussions. The next t w o largest comp onen ts are presen ted in Figure 5 and can be in terpreted as communi ties of “Theoretical Mac hine Learning” (15 no des) and “Dimension Reduction” (14 no des), resp ectiv ely . The next 5 comp onen ts are presen ted in T able 6 and can be interpreted as comm unities of “Johns Hopkins”, “Duke”, “Stanford”, “Quan tile Regression”, and “Exp erimen tal Design”, resp ectiv ely . These comp onen ts hav e small sizes and there is no need for further study on sub-structures. T able 1 A r o ad map for 14 communities discusse d in Se ction 1.1 . In Coauthorship Network (A), e ach c ommunity is a c omp onent of the network. In Co authorship Network (B) and Citation Network, the c ommunities ar e identifie d by SCORE and D-SCORE, r esp e ctively. Netw ork Communities #nodes Visualization Coauthor(A) High-Dimensional Data Analysis (HDDA-Coau-A) 236 Figures 1 ,3,4 Theoretical Machine Learning 15 Figure 5 Dimension Reduction 14 Figure 5 Johns Hopkins 13 T able 6 Duke 10 Stanford 9 Quantile Regression 9 Experimental Design 8 Coauthor(B) Ob jectiv e Bayes 64 Figure 6 Biostatistics 388 Figure 7 High-Dimensional Data Analysis (HDDA-Coau-B) 1181 Figure 8 Citation Large-Scale Multiple T esting 359 Figure 10 V ariable Selection 1285 Figure 11 Spatial & Semi-parametric/Non-parametric Statistics 1010 Figure 12 (b2). Co authorship network (B) . The net w ork has m uc h stronger connec- tivit y than Coauthorship netw ork (A), so we need more sophisticated meth- o ds to identify comm unities/researc h groups; w e propose to use SCORE. SCORE is a recen t sp ectral approac h to communit y detection for undi- rected netw orks [ 24 ]. Using SCORE, we ha ve identified three meaningful comm unities as follows: “Ob jective Bay es”, “Biostatistics (Coauthorship (B))” (Biostat-Coau-B), “High Dimensional Data Analysis (Coauthorship (B))” (HDD A-Coau-B), presen ted in Figures 6 , 7 , and 8 , resp ectiv ely . W e hav e also inv estigated the netw ork with sev eral other communit y de- tection approac hes for undirected netw orks: Newman’s Sp ectral Clustering metho d (NSC) [ 35 ], Bick el and Chen’s Profile Likelihoo d (BCPL) metho d [ 5 , 42 ], and Armini et al ’s Profile Lik eliho od (APL) metho d [ 1 ]. Different metho ds ha v e different results, but they seem to largely agree on the three comm unities aforemen tioned; see Section 3.3 for more discussions. (b3). Citation network . The Citation net w ork is directed, and it remains largely unknown how to mo del suc h net w orks and how to do comm unit y CO AUTHORSHIP AND CIT A TION NETWORKS 5 detection. W e prop ose D-SCORE (an adaption of SCORE for directed net- w ork) as a new communit y detection metho d. Using D-SCORE, we hav e iden tified three meaningful communities: “Large-Scale Multiple T esting”, “V ariable Selection” and “Spatial and semi-parametric/nonparametric Statis- tics”. These comm unities are presen ted in Figures 10 - 12 respectively . F or con venience, we present in T able 1 a road map for the 14 comm unities w e just men tioned. Note that some of these communities also hav e sub- comm unities; see Sections 3 - 4 for details. In comparison, the comm unities or research groups iden tified in each of the three net w orks are connected, in tert wined, but are also v ery different. W e discuss these in Sections 4.2.1 - 4.2.2 ; see details therein. (c). Pro ductivit y , patterns and trends . W e discuss the o v erall pro- ductivit y , coauthor patterns and trends, and citation patterns and trends. Our findings include but not limited to the following. • In the 10-year p eriod 2003-2012, the num ber of pap ers p er author has b een decreasing (Figure 13 ). Also, the proportion of self-citations has b een decreasing while the prop ortion of distan t citations has b een increasing (Figure 16 ). These suggest that the statistics communit y has b ecome increasingly more collab orativ e, comp etitiv e, and globalized. • The distribution of either the degrees of the author-pap er bipartite net w ork or the Coauthorship netw ork has a p o w er-la w tail (Figures 14 - 15 ), a phenomenon frequen tly found in social netw orks [ 4 , 33 ]. 1.2. Data c ol le ction and cle aning. W e hav e faced substan tial challenges in data collection and cleaning, and it has tak en us more than 6 mon ths to obtain high-qualit y data sets and prepare them in a ready-to-use format. A t first glance, it may b e hard to understand why it is c hallenging to collect suc h data: the data seem to be ev erywhere, v ery accessible and free. This is true to some exten t. How ever, when it comes to high-v olume high- qualit y data, the resources b ecome surprisingly limited. F or example, Go ogle Sc holar aggressively blo c ks an y one (a p erson or a machine) who tries to do wnload the data more than just a little; when y ou try to download little b y little, you will see some portion of the data are made messy and incomplete in ten tionally . F or other online resources, w e face a similar problem. W e also face other c hallenges: missing paper iden tifiers, ambiguous author names, etc.; w e explain ho w w e ha v e o v ercome these in App endix I I. 1.3. Exp erimental design and scientific r elevanc e. W e are primarily in- terested in the netw orks for statisticians home based in USA. F or this rea- son, w e ha ve limited our atten tion to four journals (AoS, Biometrik a, JASA, 6 P . JI AND J. JIN JRSS-B), which are regarded by many US-based statisticians the top statis- tical journals (or leading journals in methods and theory , except for JASA pap ers in the case study sector). W e recognize that w e ma y hav e different re- sults when we include in our study either journals whic h are the main v en ues for statisticians from a differen t country or region, or journals whic h are the main v en ues for statisticians with a differen t focus (e.g., Bioinformatics). W e are also primarily interested in the time perio d when high dimensional data analysis emerged as a new statistical area. W e may hav e different results if w e extend the study to a muc h longer time perio d. On the other hand, it seems that the data sets w e ha v e serve well for solv- ing our targeted scientific problems: they pro vide man y meaningful results in man y aspects of our targeted comm unit y within the targeted time perio d. They also serv e as a starting p oin t for a more am bitious pro ject in whic h w e collect data from man y more journals in a muc h longer time perio d. 1.4. Disclaimers. Our primary goal in the pap er is to presen t the data sets w e collect, and to rep ort our findings in suc h data sets. It is not our in ten tion to rank one author/pap er o v er the others. W e wish to clarify that “highly cited” is not exactly the same as “imp ortan t” or “influen tial”. It is not our in ten tion either to rank one area ov er the other. A “hot” area is not exactly the same as an “imp ortant” area or an area that needs the most of our time and efforts. It is not exactly an area that is exhausted (so we should not div e in) either. Also, it is not our inten tion to lab el an author/pap er/topic with a cer- tain communit y/group/area. A communit y or a research group may con tain man y authors, and can b e hard to in terpret. F or presen tation, w e need to as- sign names to suc h comm unities/groups/areas, but the names do not alwa ys accurately reflect all the authors/papers in them. Finally , so cial netw orks are about “real people”, and this time, “us”. In order to obtain meaningful and interpretable results, we ha ve to use real names. W e hav e not used any data b ey ond those which are publicly av ailable. The in terest of the pap er is on the statistics comm unity as a whole , not on an y individual statistician. 1.5. Contents. The paper is organized as follo ws. In Section 2 , w e discuss the cen trality . In Sections 3 - 4 , w e discuss comm unit y detection for the Coau- thorship net work and Citation netw ork, resp ectiv ely . Section 5 contains a brief summary and discusses the limitations of the pap er and suggests some future directions. Section 6 is App endix I, where we study the pro ductivity , patterns and trends for the statisticians’ research, and Section 7 is Appendix I I, where we address the c hallenges in data collection and cleaning. CO AUTHORSHIP AND CIT A TION NETWORKS 7 2. Cen tralit y . It is frequently of in terest to identify the most “imp or- tan t” authors or pap ers, and one p ossible approach is to use centralit y . There are many differen t measures of cen tralit y . In this section, w e use the degree cen tralit y , the closeness centralit y , and the b et w eenness cen tralit y . The closeness centralit y is defined as the reciprocal of the total distance to all others [ 37 ]. The b et weenness cen tralit y measures the exten t to whic h a no de is lo cated “b et w een” other pairs of nodes [ 14 ]. The degree cen tralit y is conceptually simple, but the definition v aries with the t ypes of net w orks. F or the author-pap er bipartite netw ork, the cen tralit y of an author is the n um b er of pap ers he/she publishes. F or Coauthorship net w ork, the centralit y of an author is the n um ber of his/her coauthors. F or Citation netw ork of authors , w e are primarily interested in the in-degree, and the centralit y of an author is the n um b er of citers (i.e., authors who cite his or her papers). F or Citation netw ork of p ap ers , the centralit y is the in-degree (i.e., the n um b er of pap ers whic h cite this paper). T able 2 presents the k ey authors iden tified by different measures of cen- tralit y . The results suggest that different measures of centralit y are largely consisten t with eac h other, whic h identify Ra ymond Carroll, Jianqing F an, and P eter Hall (alphabetically) as the “top 3” authors. T able 2 T op 3 authors identifie d by the de gr e e c entr ality (Columns 1 - 3 ; c orresp onding networks ar e the author-p ap er bip artite network, Co authorship network, and Citation network for authors), the closeness c entr ality and the b etwe enness c entr ality. # of pap ers # of coauthors # of citers Closeness Betw eenness Peter Hall Peter Hall Jianqing F an Raymond Carroll Raymond Carroll Jianqing F an Raymond Carroll Hui Zou P eter Hall Peter Hall Raymond Carroll Joseph Ibrahim Peter Hall Jianqing F an Jianqing F an T able 3 presen ts the “hot” pap ers iden tified by 3 different measures of cen tralit y . F or all these measures, the “hottest” pap ers seem to b e in the area of v ariable selection. In particular, the top 3 most cited pap er are Zou [ 43 ] (75 citations; adaptive lasso), Meinshansen and Buhlmann [ 31 ] (64 citations; graphical lasso), and Cand ` es and T ao [ 8 ] (49 citations; Dantzig Selector). The three papers are all in a sp ecific sub-area in high dimensional v ariable selection, where the theme is to extend the w ell-kno wn penalization methods of the lasso [ 9 , 39 ] in v arious directions (these fit well with the impression of many statisticians: in the past 10-20 years, there is a noticeable w a ve of researc h pap ers dev oted to the p enalization metho ds). These results suggest “V ariable Selection” as one of the “hot” areas. Other “hot” areas may include “Co v ariance Estimation”, “Empirical Ba yes”, and “Large-Scale Multiple T esting”; see T able 3 for details. 8 P . JI AND J. JIN T able 3 F ourte en “hot” p ap ers (alphab etic al ly) identified by de gr e e c entr ality (Column 2; for citation networks of p ap ers), closeness c entr ality, and b etwe enness c entr ality. Numb ers in Column 2 - 4 ar e the r anks (only shown when the r ank is smal ler than 5 ). Paper (Area) Citations Closeness Betw eenness Bick el & Levina (2008) [ 6 ] (Co v ariance Estimation) 4 Candes & T ao (2007) [ 8 ] (V ariable Selection) 3 F an & Li (2004) [ 11 ] (V ariable Selection) 2 F an & Lv (2008) [ 12 ] (V ariable Selection) 1 F an & Peng (2004) [ 13 ] (V ariable Selection) 4 1 Huang et al (2006) [ 19 ] (Cov ariance Estimation) 3 Huang et al (2008) [ 18 ] (V ariable Selection) 5 Hunter & Li (2005) [ 21 ] (V ariable Selection) 4 Johnstone & Silv erman (2005) [ 25 ] (Empirical Bay es) 5 Meinshausen & Buhlmann (2006) [ 31 ] (V ariable Selection) 2 Storey (2003) [ 38 ] (Multiple T esting) 3 Zou (2006) [ 43 ] (V ariable Selection) 1 Zou & Hastie (2005) [ 44 ] (V ariable Selection) 5 Zou & Li (2008) [ 45 ] (V ariable Selection) 2 F or more information, note that at www.stat.uga.edu/~psji/ , w e ha v e listed the 30 most cited papers in the file top-cited.xlsx. These 30 pap ers accoun t for 16% of the total num ber of citation counts. The list furthers sho ws that the most highly cited papers are on the regularization metho ds (e.g., adaptiv e lasso, group lasso, etc.). On the other hand, w e must note that some imp ortant and innov ative w orks in the particular area of v ariable selection ha v e significantly fewer ci- tations. This includes but is not limited to the phenomenal pap er b y Efron et al . (2004) [ 10 ] on least angle regression, whic h has received a lot of atten- tion from a broader scien tific comm unit y . The paper has 4900 citations on Go ogle Sc holar, but is cited only 11 times b y pap ers in our data set (in com- parison, the adaptiv e lasso pap er [ 43 ] has receiv ed 75 citations). A similar claim can be drawn on other areas or topics. The fact that statisticians hav e b een v ery m uc h fo cused on a very sp ecific researc h topic and a v ery sp ecific approach is an interesting phenomenon that deserv es more explanation b y itself. The centralit y measures we use here are either natural choices or exist- ing measures. W e are merely rep orting what the data sets tell us, with no in ten tion to rank one author or an area o ver the others; see Section 1.4 . 3. Comm unit y detection for Coauthorship netw orks. In this sec- tion, we study comm unity detection for Coauthorship net w orks (A) and (B). Com m unity detection of the Citation net w ork is discussed in Section 4 . In Section 3.1 , w e discuss models for general undirected netw orks and re- cen t approaches to comm unit y detection. In Sections 3.2 - 3.3 , we analyze the CO AUTHORSHIP AND CIT A TION NETWORKS 9 Coauthorship net w ork (A) and (B), respectively , using these approac hes. 3.1. Community dete ction metho ds (undir e cte d networks). Comm unit y detection is a problem of ma jor interest in netw ork analysis [ 16 ]. Consider an undir e cte d and c onne cte d netw ork N = ( V , E ) with n nodes. W e think V as the union of a few (disjoint) subsets whic h w e call the “comm unities”: V = V (1) ∪ V (2) . . . ∪ V ( K ) , where “ ∪ ” stands for the union of sets and has nothing to do with netw orks (same below). Intuitiv ely , comm unities can b e though t of as subsets of nodes where there are more edges “within” than “across”comm unities (e.g., [ 7 ]). Note that for simplicit y , we assume the communities are non-o v erlapping here. The goal of comm unity detection is for eac h no de i ∈ V , to decide to whic h comm unity it b elongs (i.e., clustering). There are man y comm unit y detection methods for undirected net w orks. In this paper, w e consider Newman’s Sp ectral Clustering approac h (NSC) [ 35 ], Bic k el and Chen’s Profile Lik eliho od approac h (BCPL) [ 7 , 42 ], Armini et al .’s Pseudo Likelihoo d approac h (APL) [ 1 ], and Jin’s SCORE [ 24 ]. NSC is a sp ectral metho d, where the key observ ation is that Newman and Girv an’s modularity matrix can be approximated by the leading eigen- v ectors of the matrix [ 35 ]. Newman in tro duced NSC as a general idea for sp ectral clustering, and there are sev eral differen t w a ys for implemen tations. F ollowing [ 35 ], w e cluster b y using the signs of the first leading eigen v ectors when K = 2, and b y using the recursiv e bisections approac h when K ≥ 3. BCPL is a p enalization method proposed by Bic k el and Chen [ 7 ] which uses greedy search to maximize the profile lik eliho od and w orks w ell for net w orks with thousands of no des. When the netw ork size is large, BCPL ma y b e computationally slow. In light of this, Amini et al . [ 1 ] prop ose a differen t Profile Likelihoo d approac h whic h aims to improv e the sp eed of BCPL. By doing so, the price it pa ys is to ignore some dep endence structures of the data so as to simplify the likelihoo d and mak e it more tractable. SCORE, or S p ectral C lustering O n R atios of E igen v ectors, is a recent sp ectral method prop osed b y Jin [ 24 ]. Assume K (n umber of comm unities) as kno wn and let A be the adjacency matrix asso ciated with N : (3.1) A ( i, j ) = 1 , if there is an edge betw een nodes i and j , 0 , otherwise; note that A is symmetric. SCORE consists of the following simple steps. • Let ˆ ξ 1 , ˆ ξ 2 , . . . , ˆ ξ K b e the first K (unit-norm) eigenv ectors of A . Obtain the n × ( K − 1) matrix ˆ R by ˆ R ( i, k ) = ˆ ξ k +1 ( i ) / ˆ ξ 1 ( i ), 1 ≤ i ≤ n , 1 ≤ k ≤ K − 1. 10 P . JI AND J. JIN • Clustering by applying the classical k-means to ˆ R , assuming there are ≤ K communities. Remark 1 . SCORE is motiv ated by the recent Degree Corrected Blo c k Mo del (DCBM, [ 26 ]). In DCBM, for n degree heterogeneity parameters { θ ( i ) } n i =1 and a K × K symmetric matrix P , we think A ( i, j ), 1 ≤ i < j ≤ p as indep enden t Bernoulli random v ariables such that P ( A ( i, j ) = 1) = θ ( i ) θ ( j ) P k,` , if i ∈ V ( k ) and j ∈ V ( ` ) , 1 ≤ k , ` ≤ K . SCORE recognizes that, the parameters θ ( i )’s are nearly ancillary , and can b e conv enien tly remov ed b y taking entry-wise ratios b et w een ˆ ξ k and ˆ ξ 1 , k = 2 , . . . , K ; see [ 24 ]. Origi- nally prop osed for undirected net w ork, SCORE is a flexible idea and can b e used to analyze other types of netw orks. In Section 4 , w e extend SCORE to Directed-SCORE (D-SCORE) as an approach to comm unit y detection for dir e cte d netw orks, and use it to analyze the Citation netw ork. Remark 2 . Note that the v ectors of predicted labels b y differen t metho ds could b e v ery different. F or a pair of the predicted lab el vectors, w e measure the similarity b y the Adjusted Rand Index (ARI) [ 20 ] and the V ariation of Information (VI) [ 30 ]; a large ARI or a small VI suggests that tw o predicted lab el v ectors are similar to each other. 3.2. Co authorship network (A). In this netw ork, b y definition, there is an edge b et w een tw o no des (i.e., authors) if and only if they ha ve coauthored 2 or more pap ers (in the range of our data sets). The net w ork is very muc h fragmen ted: the total of 3607 no des split into 2985 different comp onen ts, where 2805 (94%) of them are singletons, 105 (3 . 5%) of them are pairs, and the a v erage comp onen t size is 1 . 2. The gian t comp onen t (236 nodes) is seen to b e the “High Dimensional Data Analysis (Coauthorship (A))” group (HDDA-Coau-A); see Figure 1 . It seems that the gian t component has sub-structures (i.e., communities). In the left panel of Figure 2 , we plot the scree-plot of this group. The elb o w p oin t of the scree-plot maybe at the 3 r d , 5 th , or 8th largest eigen v alue, suggesting that there may be 2, 4, or 7 communities. In light of this, for eac h K with 2 ≤ K ≤ 7, we run SCORE, NSC, BCPL and APL and record the corresponding vectors of predicted lab els. W e find that for K ≥ 3, the results b y different metho ds are largely inconsisten t with each other: the maxim um of ARI and the minimum VI (see Remark 2 in Section 3.1 for discussions on ARI and VI) across differen t pairs of metho ds are 0 . 15 and 1 . 19, resp ectiv ely . W e now focus on the case of K = 2. In T able 4 , we presen t the ARI and VI for each pair of the metho ds. The table suggests that: the 4 metho ds split into t w o groups where SCORE and APL are in one of the group with CO AUTHORSHIP AND CIT A TION NETWORKS 11 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Da vid Dunson Donglin Zeng Hans−Georg Muller Hongtu Zhu Hua Liang Jianqing Fan Jing Qin Joseph G Ibrahim Peter Hall Raymond J Carroll T T ony Cai Fig 1 . The giant c omp onent of Co authorship network (A). It c ould be interpr ete d as the “High Dimensional Data Analysis (Co authorship (A))” (HDDA-Co au-A) c ommunity. Names ar e only shown for 11 no des with a de gr e e of 8 or larger. 0 5 10 15 20 2.5 3 3.5 4 4.5 5 5.5 0 5 10 15 20 6 7 8 9 10 11 0 5 10 15 20 10 15 20 25 30 35 40 Fig 2 . Scr e e plots. F r om left to right: the giant c omp onent of Co authorship network(A), Co authorship network(B), Citation network (in the last one, we display singular values inste ad of eigenvalues). an ARI of 0 . 72 (b et w een them), and NSC and BCPL are in the other group with an ARI of 0 . 21. The results for methods in eac h group are mo derately 12 P . JI AND J. JIN consisten t to each other, but those for metho ds in different groups are rather inconsisten t. The p oin t is confirmed b y T able 5 , whic h compares the sizes of the comm unities iden tified b y the 4 methods. In Figures 3 - 4 , we further compare the communit y detection results b y eac h of the 4 metho ds ( K = 2). In each panel, no des are marked with either blac k dots or white circles, representing tw o different communities. It seems that all four methods agree that there are t w o comm unities as follo ws. • “North Carolina” communit y . This includes a group of researchers from Duk e Univ., Univ. of North Carolina, North Carolina State Univ. • “Carroll-Hall” comm unity . This includes a group of researchers in non- parametric and semi-parametric statistics, functional estimation, and high dimensional data analysis. Comparing the results b y differen t metho ds, one of the ma jor discrepancies lies in the “F an” group: SCORE and APL cluster the “F an” group in to the “Carroll-Hall” communit y , and NSC and BCPL cluster it into the“North Carolina” communit y . A p ossible explanation is that, the “F an” group has strong ties to both communities. This ma y also suggest there are 3 communities (instead of 2) in this comp onen t. Ho wev er, as mentioned b efore, when we assume K = 3, the results by all four metho ds are rather inconsisten t with each other. How to obtain a more convincing explanation is an in teresting but challenging problem. W e omit further discussions along this line for reasons of space. T able 4 The A djuste d R andom Index (ARI) and V ariation of Information (VI) for the ve ctors of pr e dicte d c ommunity lab els by four differ ent metho ds for the giant c omp onent of Co authorship (A), assuming K = 2 . A lar ge ARI/smal l VI suggests that the two pr e dicte d lab el ve ctors ar e similar to e ach other. SCORE NSC BCPL APL SCORE 1.00/.00 -.04/.95 .09/1.05 .72/.33 NSC 1.00/.00 .21/1.06 -.06/.91 BCPL 1.00/.00 .09/.87 APL 1.00/.00 Other notew orth y discrepancies are as follo ws: • SCORE includes the “Dunson” branc h in the “North Carolina” group, but APL clusters them into the “Carroll-Hall” group to whic h they are not directly connected. In this regard, it seems that results b y SCORE are more meaningful. • NSC and BCPL differ on sev eral small branches, including the “Dun- son” branc h and t w o small branc hes connecting to Jianqing F an. In CO AUTHORSHIP AND CIT A TION NETWORKS 13 T able 5 Comp arison of c ommunity sizes by different metho ds assuming K = 2 for the giant c omp onent of Co authorship network (A). North Carolina Carroll-Hall SCORE 45 191 NSC 155 81 APL 31 205 SCORE ∩ NSC 45 81 SCORE ∩ APL 31 191 NSC ∩ APL 31 81 SCORE ∩ NSC ∩ APL 31 81 comparison, the results b y NSC seem more meaningful. W e now mov e aw a y from the giant comp onen t. The next tw o largest comp onen ts are the “Theoretical Mac hine Learning” group (15 no des) and the “Dimension Reduction” group (14 no des); see Figure 5 . The first one is a researc h group who work on Mac hine Learning topics using sophisti- cated statistical theory , including Peter Buhlmann, Alexandre Tsybako v, Jon W ellner, and Bin Y u. The second one is a researc h group on Dimension Reduction, including F rancesca Chiaromonet, Dennis Co ok, Bing Li and their collab orators. A con v ersation with Qunhua Li helps to illuminate wh y these groups are meaningful and how they ev olve ov er time. In the first comm unity , Marloes H. Maath uis obtained her Ph.D from Universit y of W ashington (jointly su- p ervised by Jon W ellner and Piet Groeneb o om) in 2006 and then wen t on to work in ETH, Switzerland, and she is p ossibly the “bridge” connecting the Seattle group and the ETH group (P eter Buhlmann, Markus Kalische, Sara v an de Geer). No colai Meinshausen could b e one of the “bridge” no des b et w een ETH and Berkeley: he w as a Ph.D student of P eter Buhlmann and then a p ost-doctor at Berk eley . In the second group, Ms. Chiaromonet ob- tained her Ph.D from Universit y of Minnesota, where Dennis Co ok served as the supervisor. She then w en t on to work in the Statistics Departmen t at P ennsylv ania State Universit y , and started to collab orate with Bing Li on Dimension Reduction. The next 5 largest comp onents in Coauthorship netw ork (A) are the “Johns Hopkins” group (13 no des; including facult y at Johns Hopkins Uni- v ersit y and their collab orators; similar b elo w), “Duke” group (10 no des; including Mike W est, Jonathan Stroud, Carlos Cara vlaho, etc.), “Stanford” group (9 nodes including David Siegmund, John Storey , Ryan Tibshirani, and Nancy Zhang, etc.), “Quan tile Regression” group (9 no des; including Xuming He and his collab orators), and “Exp erimen tal Design” group (8 14 P . JI AND J. JIN ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● David Dunson Donglin Zeng Hans−Georg Muller Hongtu Zhu Hua Liang Jianqing Fan Jing Qin Joseph G Ibrahim Peter Hall Raymond J Carroll T T ony Cai ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● David Dunson Donglin Zeng Hans−Georg Muller Hongtu Zhu Hua Liang Jianqing Fan Jing Qin Joseph G Ibrahim Peter Hall Raymond J Carroll T T ony Cai Fig 3 . Community dete ction r esults by SCORE (top) and APL (b ottom) for the giant c omp onent of Co authorship network (A), assuming K = 2 . No des in black (solid) dots and white cir cles r epr esent two differ ent c ommunities. no des). These groups are presented in T able 6 . 3.3. Co authorship network (B). In this net w ork, there is an edge b et w een no des i and j if and only if they hav e coauthored 1 or more pap ers. Compared to Coauthorship netw ork (A), this definition is more conv en tional, but it also CO AUTHORSHIP AND CIT A TION NETWORKS 15 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● David Dunson Donglin Zeng Hans−Georg Muller Hongtu Zhu Hua Liang Jianqing Fan Jing Qin Joseph G Ibrahim Peter Hall Raymond J Carroll T T ony Cai ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● David Dunson Donglin Zeng Hans−Georg Muller Hongtu Zhu Hua Liang Jianqing Fan Jing Qin Joseph G Ibrahim Peter Hall Raymond J Carroll T T ony Cai Fig 4 . Community dete ction results by NSC (top) and BCPL (b ottom) for the giant c omp onent of Co authorship network (A), assuming K = 2 . No des in black (solid) dots and white cir cles r epr esent two differ ent c ommunities. mak es the net w ork harder to analyze. Coauthorship net w ork (B) has a total of 3607 no des, where the gian t comp onen t consists of 2263 (63% of all no des). F or analysis in this section, w e fo cus on the giant comp onen t. Also, for simplicity , w e call the gian t comp onen t the Coauthorship netw ork (B) whenev er there is no confusion. 16 P . JI AND J. JIN ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Alexandre B Tsybak ov Anatoli B Juditsky Bin Y u Bing Li Fadoua Balabdaoui Florentina Bunea Francesca Chiaromonte Guilherme Rocha Jon A W ellner Karim Lounici Lexin Li Liliana Forzani Liping Zhu Liqiang Ni Liugen Xue Lixing Zhu Lukas Meier Markus Kalisch Marloes H Maathuis Marten H W egkamp Nicolai Meinshausen Peter Buhlmann Philippe Rigollet Piet Groeneboom R Dennis Cook Sara v an de Geer T ao Shi W infried Stute Xia Cui Xiangrong Y in Xin Chen Y uexiao Dong Fig 5 . The se c ond lar gest (left) and thir d lar gest (right) c omp onents of Co authorship net- work (A). They c an b e p ossibly interpr ete d as the “The or etic al Machine L e arning” and “Dimension R e duction” c ommunities, r esp e ctively. T able 6 T op: the 4 -th, 5 -th, and 6 -th lar gest c omp onents of Co authorship network (A) which c an b e interpr ete d as the gr oups of “Johns Hopkins”, “Duke”, and “Stanfor d”). Bottom: the 7 -th and 8 -th lar gest c omp onents of Co authorship network (A) which c an b e interprete d as the gr oups of “Quantile R e gr ession” and “Exp erimental Design”. Barry Rowlingson Brian S Caffo Chong-Zhi Di Ciprian M Crainiceanu Da vid Ruppert Dobrin Marchev Galin L Jones James P Hob ert John P Buonaccorsi John Staudenmay er Naresh M Punjabi P eter J Diggle Sheng Luo Carlos M Carv alho Gary L Rosner Gerard Letac Helene Massam James G Scott Jonathan R Stroud Maria De Iorio Mik e W est Nic holas G Polson P eter Muller Armin Sch wartzman Benjamin Y akir Da vid Siegm und F Gosselin John D Storey Jonathan E T aylor Keith J W orsley Nancy Ruonan Zhang Ry an J Tibshirani Heng jian Cui Huixia Judy W ang Jianh ua Hu Jianh ui Zhou V alen E Johnson Wing K F ung Xuming He Yijun Zuo Zhongyi Zhu Andrey Pepelyshev F rank Bretz Holger Dette Natalie Neumeyer Stanisla v V olgushev Stefanie Biedermann Tim Holland-Letz Viatc heslav B Melas W e are primarily in terested in communit y detection. Figure 2 (middle panel) presents the scree plot asso ciated with Coauthorship netw ork (B), suggesting 3 or more communities. W e apply all four metho ds: SCORE, CO AUTHORSHIP AND CIT A TION NETWORKS 17 NSC, BCPL, and APL assuming K = 3 and below are the findings. First, in T able 7 , we compare all 4 metho ds pair-wise and tabulate the cor- resp onding ARI and VI (see Remark 2). Somewhat surprisingly , the results of BCPL are inconsistent with those by all other methods. F or example, the maxim um ARI b et w een BCPL and each of the other three metho ds is . 00, and the smallest VI b et w een BCPL and eac h of the other three metho ds is 1 . 29, sho wing a substan tial disagreemen t. A t the sam e time, the results by SCORE, NSC, and APL are reasonably consisten t with each other: the ARI b et w een the v ector of predicted lab els b y SCORE and that by NSC is 0 . 55 and the ARI b et w een the v ector of predicted lab els by NSC and that by APL is 0 . 41; see T able 7 for details. In particular, the three metho ds agree on that, the three communities eac h of them identifies can b e in terpreted as follows (arranged in sizes ascendingly). • “Ob jective Ba yes” communit y . This comm unit y includes a small group of researc hers (group sizes are differen t for different methods, ranging from 20 to 69) including James Berger and his collab orators. • “Biostatistics (Coauthorship (B))” (Biostat-Coau-B) communit y . The sizes of this comm unit y by three different methods ha ve quite a bit v ariability and range from 50 to 388. While it is probably not exactly righ t to call this communit y “Biostatistics”, the comm unit y consists of a num ber of statisticians and biostatisticians in the Researc h T riangle P ark of North Carolina. It also includes man y statisticians and bio- statisticians from Harv ard Univ ersity , Universit y of Mic higan at Ann Arb or, Univ ersity of Wisconsin at Madison. • “High Dimensional Data Analysis (Coauthorship (B))” (HDDA-Coau- B) comm unit y . The sizes of this comm unity by three differen t metho ds range from 1811 to 2193. The communit y includes researc hers from a wide v ariety of researc h areas in or related to high dimensional data analysis (e.g., Bioinformatics, Mac hine Learning). In Figures 6 - 8 , we presen t these three communities (all three are iden tified b y SCORE) resp ectiv ely . In T able 8 , we compare the sizes of the three comm unities iden tified b y eac h of the three methods. There are tw o p oin ts w orth noting. First, while SCORE and NSC are quite similar to eac h other, there is a ma jor difference: NSC clusters ab out 200 authors, mostly biostatisticians from Harv ard Universit y , Universit y of Mic higan at Ann Arb or, and Uni- v ersit y of Wisconsin at Madison, into the HDDA-Coau-B communit y , but SCORE clusters them into the Biostat-Coau-B comm unit y . It seems that the results b y SCORE are more meaningful. 18 P . JI AND J. JIN T able 7 The Ajuste d R and Index (ARI) and V ariation of Information (VI) for the ve ctors of pr e dicte d c ommunity lab els by four differ ent metho ds in Co authorship network (B), assuming K = 3 . A lar ge ARI/smal l VI suggests that the two pr e dicte d lab el ve ctors ar e similar to e ach other. SCORE NSC BCPL APL SCORE 1.00/.00 .55/.51 .00/1.65 .19/.59 NSC 1.00/.00 .00/1.46 .41/.36 BCPL 1.00/.00 .00/1.21 APL 1.00/.00 Second, APL b eha ves v ery differen tly from either SCORE or NSC. Its estimate of the “Ob jective Ba y es” communit y is (almost) a subset of its coun terpart by either SCORE or NSC, and is muc h smaller in size (sizes are 20, 64, and 69 for that b y APL, SCORE, and NSC). A similar claim applies to the Biostat-Coau-B communit y identified b y each of the metho ds (sizes are 50, 388, and 169 for that b y APL, SCORE, and NSC). This suggests that APL ma y ha ve underestimated these t wo comm unities but ov erestimated the HDD A-Coau-B comm unity . It is also interesting to compare these results with those we obtain in Sec- tion 3.2 for Coauthorship net w ork (A). Belo w are three notew orthy points. First, recall that in Figure 5 and T able 6 , w e ha ve identified a total of 7 differen t comp onen ts of Coauthorship net work (A). Among these compo- nen ts, the Duk e comp onen t (middle panel on top row in T able 6 ) splits into three parts, eac h belongs to the three of the communities of Coauthorship net w ork (B) iden tified by SCORE. The other 6 comp onents fall in to the HDD A-Coau-B comm unity identified b y SCORE almost completely . Second, for the giant comp onent of Coauthorship (A), there is a close dra w on whether we should cluster the Carroll-Hall’s group and F an’s group in to tw o communities: SCORE and APL think that tw o groups belong to one comm unit y , but NSC and BCPL do not agree with this. In Coauthorship (B), b oth groups are in the HDDA-Coau-B comm unity . Also, in previous studies on this gian t comp onen t, BCPL and APL separate the no des in Dunson’s branc h from the North Carolina group, and cluster them into the Carroll-Hall group. In the current study , how ev er, the whole North Carolina group (including Dunson’s branc h) are in the Biostat-Coau-B communit y . Third, in Coauthorship (A), Gelfand’s group is included in this 236-node gian t comp onen t, where James Berger is not a mem b er. In Coauthorship net w ork (B), Gelfand’s group now b ecomes a subset of “Ob jectiv e Bay e” comm unit y where James Berger is a h ub node. CO AUTHORSHIP AND CIT A TION NETWORKS 19 T able 8 Comp arison of sizes of the thr e e c ommunities identifie d by e ach of the thr e e metho ds in Co authorship network (B), assuming K = 3 . BCPL is not include d for c omp arisons for its r esults ar e inc onsistent with those by the other thr e e metho ds. Ob jective Ba yes Biostat-Coau-B HDD A-Coau-B SCORE 64 388 1811 NSC 69 163 2031 APL 20 50 2193 SCORE ∩ NSC 55 162 1807 SCORE ∩ APL 20 50 1811 NSC ∩ APL 20 50 2032 SCORE ∩ NSC ∩ APL 20 50 1807 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Alan E Gelfand Athanasios K ottas Carlos M Carv alho Daniel W alsh Fei Liu Gonzalo Garcia−Donato J Palomo James O Berger Jerry Sacks John A Cafeo M J Bayarri R J Parthasarathy Rui Paulo Ste ven N MacEachern Fig 6 . The “Obje ctive Bayes” c ommunity in Co authorship network (B) identifie d by SCORE ( 64 no des). Only names for 14 no des with a de gr e e of 9 or lar ger are shown. 4. Comm unit y detection for Citation net work. The Citation net- w ork is a directed netw ork. As a result, the study in this section is differen t from that in Section 3 in imp ortan t wa ys, and provides additional insight into the structures of statisticians’ net w orks. In Section 4.1 , w e discuss metho ds for communit y detection for directed netw orks. In Section 4.2 , w e analyze the Citation net w ork, and compare the results with those in Section 3 . 4.1. Community dete ction metho ds (dir e cte d networks). In the Citation net w ork, eac h no de is an author and there is a directed edge from node i to no de j if and only if no de i has cited no de j at least once. T o analyze the Citation net work, one usually fo cuses on the we akly c onne cte d giant c omp onent [ 3 ]. This is the giant comp onen t of the we akly c onne cte d citation 20 P . JI AND J. JIN ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● David Dunson Debajyoti Sinha Eric Feuer Helen Zhang Heping Zhang Hongtu Zhu Ste ve Marron Ji Zhu Joseph Ibrahim Jun Liu L J W ei Louise Ryan T apabrata Maiti T rivellore Raghunathan W eili Lin Y imei Li Zhiliang Y ing Fig 7 . The “Biostatistics” c ommunity (Biostat-Co au-B) in Co authorship network (B) identifie d by SCORE ( 388 no des). Only names for 17 no des with a degr e e of 13 or lar ger ar e shown. A “br anch” in the figure is usual ly a r ese ar ch gr oup in an institution or a state. network , whic h is an undirected net work where there is an edge b et w een no des i and j if one has cited the other at least once. F rom no w on, when we sa y the Citation net w ork, we mean the weakly connected gian t comp onen t of the original Citation net w ork. F or communit y detection of directed net works, there are relativ ely few approac hes. In this section, w e consider t w o metho ds: LNSC and Directed- SCORE (D-SCORE). LNSC stands for Leich t and Newman’s Sp ectral Clustering approac h pro- p osed in [ 28 ]: the authors extended the sp ectral mo dularit y metho ds by [ 35 ] for undirected net w orks to directed netw orks, using the so-called generalized mo dularit y [ 2 ]. How ev er, it is pointed out in [ 27 ] that LNSC can not prop- erly distinguish the directions of the edges and can not detect comm unities represen ting directionalit y patterns among the nodes. See details therein. D-SCORE is the adaption of SCORE to directed net w orks. SCORE is a comm unit y detection metho d for undirected net w orks, and the metho d w as motiv ated by DCBM for undirected net works; see Section 3.1 . Below, w e CO AUTHORSHIP AND CIT A TION NETWORKS 21 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Alexandre Tsybako v Andrea Rotnitzky Bani Mallick Christian Robert Ciprian Crainiceanu Enno Mammen Gerda Claeskens Holger Dette Hua Liang James Robins Jianqing Fan Larry W asserman Lawrence Brown Lixing Zhu Malay Ghosh Marc Genton Nilanjan Chatterjee Peter Hall Peter Muller Raymond Carroll Runze Li Xuming He Fig 8 . The “High Dimensional Data Analysis” c ommunity (HDD A-Co au-B) in Coauthor- ship network (B) identifie d by SCORE ( 1181 no des). Only names for 22 no des with de gr ee of 18 or lar ger ar e shown. first extend DCBM to directed net w orks, and then in tro duce D-SCORE. Let A be the adjacency matrix of a directed net w ork N = ( V , E ), where A ( i, j ) = 1 , there is a directed edge from i to j , 0 , otherwise , 1 ≤ i, j ≤ n, and n is the total n um ber of nodes. F or DCBM of a directed net w ork N = ( V , E ), similarly , we think that all no des splits into K different (disjoin t) comm unities V = V (1) ∪ V (2) . . . ∪ V ( K ) . Additionally , w e supp ose that { A ( i, j ) , i 6 = j, 1 ≤ i, j ≤ n } are independent Bernoulli with parameters π ij , and that there is a K × K non-negativ e matrix 22 P . JI AND J. JIN P and t w o v ectors with p ositiv e en tries θ ∈ R n and δ ∈ R n suc h that π ij = θ ( i ) δ ( j ) P k,` , if i ∈ V ( k ) and j ∈ V ( ` ) , 1 ≤ k , ` ≤ K . Here, θ ( i ) models the de gr e e heter o geneity p ar ameter for no de i as a citer , and δ ( i ) models the de gr e e heter o geneity p ar ameter for node i as a cite e . This mo del motiv ates a new communit y detection metho d: D-SCORE. F or detailed explanations, see the forthcoming man uscript [ 23 ]. Giv en a directed net w ork N = ( V , E ), assume N has K communities. Let A b e the adjacency matrix, and let ˆ u 1 , ˆ u 2 , . . . , ˆ u K and ˆ v 1 , ˆ v 2 , . . . , ˆ v K b e the first K left singular v ectors and the first K right singular v ectors of A, resp ectiv ely . Also, define t w o asso ciated (undirected) net w orks with the same set of no des as follo ws • Citer network . There is an (undirected) edge b et w een tw o distinct no des i and j in V if and only if b oth of them ha v e cited a no de k at least once, for some k ∈ ( V \ { i, j } ) (i.e., they ha v e a common citee). • Cite e network . There is an (undirected) edge b et w een tw o distinct no de i and j in V if and only if each of them has b een cited at least once b y the same node k / ∈ ( V \ { i, j } ) (i.e., they ha v e a common citer). Let N 1 and N 2 b e the giant comp onen ts of the C iter netw ork and Citee net w ork, resp ectiv ely . Define t wo n × ( K − 1) matrices ˆ R ( l ) ˆ R ( r ) b y (4.1) ˆ R ( l ) ( i, k ) = ( sgn( ˆ u k +1 ( i ) / ˆ u 1 ( i )) · min {| ˆ u k +1 ( i ) ˆ u 1 ( i ) | , log ( n ) } , i ∈ N 1 , 0 , i / ∈ N 1 , (4.2) ˆ R ( r ) ( i, k ) = ( sgn( ˆ v k +1 ( i ) / ˆ v 1 ( i )) · min {| ˆ v k +1 ( i ) ˆ v 1 ( i ) | , log ( n ) } , i ∈ N 2 , 0 , i / ∈ N 2 . Note that all nodes split into four disjoint subsets: N = ( N 1 ∩ N 2 ) ∪ ( N 1 \ N 2 ) ∪ ( N 2 \ N 1 ) ∪ ( N \ ( N 1 ∪ N 2 )) . D-SCORE clusters nodes in each subset separately . 1. ( N 1 ∩ N 2 ). Restricting the rows of ˆ R ( l ) and ˆ R ( r ) to the set N 1 ∩ N 2 and obtaining t w o matrices ˜ R ( l ) and ˜ R ( r ) , w e cluster all nodes in N 1 ∩ N 2 b y applying the k -means to the matrix [ ˜ R ( l ) , ˜ R ( r ) ] assuming there are ≤ K communities. 2. ( N 1 \ N 2 ). Note that according to the comm unities w e identified ab o ve, the rows of ˜ R ( l ) partition into ≤ K groups. F or each group, w e call the mean of the ro w v ectors the c ommunity c enter . F or a no de i in N 1 \ N 2 , if the i -th ro w of ˆ R ( l ) is closest to the cen ter of the k -th comm unit y for some 1 ≤ k ≤ K , then we assign it to this communit y . CO AUTHORSHIP AND CIT A TION NETWORKS 23 3. ( N 2 \ N 1 ). W e cluster in a similar fashion to that in the last step, but w e use ( ˜ R ( r ) , ˆ R ( r ) ) instead of ( ˜ R ( l ) , ˆ R ( l ) ). 4. ( N \ ( N 1 ∪ N 2 )). W e sa y there is a w eak-edge b et w een i and j if there is an edge betw een i and j in the weakly connected citation netw ork. By 1-2, all no des in N 1 ∪ N 2 partition in to ≤ K communities. F or each no de in N \ ( N 1 ∪ N 2 ), w e assign it to the comm unity to which it has the largest n um b er of weak-edges. F or 4, our assumption is that |N \ ( N 1 ∪ N 2 ) | is small, so we don’t ha v e to ha v e a sophisticated clustering metho d. F or the statistical citation net w ork data set w e study in this paper, this is true with |N \ ( N 1 ∪ N 2 ) | = 14. Figure 9 illustrates how D-SCORE works using the statistical citation net w ork data set with K = 3. Tw o panels sho w similar clustering patterns, suggesting that there are three comm unities; see Section 4.2 for details. −2 0 2 4 6 8 −8 −6 −4 −2 0 2 − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + −2 0 2 4 6 8 −8 −6 −4 −2 0 2 − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Fig 9 . L eft: e ach p oint r epr esents a r ow of the matrix ˆ R ( l ) (the matrix has only two c olumns sinc e K = 3 ) asso ciate d with the statistic al Citation network ( x -axis: first c olumn, y -axis: se c ond c olumn). Only r ows with indic es in N 1 ar e shown. Blue pluses, gr e en b ars, and r e d dots r epr esent 3 different c ommunities identifie d by SCORE, which c an b e interpr ete d as “L ar ge-Sc ale Multiple testing”, “Spatial and Semi-p ar ametric/Nonpar ametric Statistics” and “V ariable Selection”, Right: similar but with ( ˆ R ( l ) , N 1 ) r eplac e d by ( ˆ R ( r ) , N 2 ) . 4.2. Citation network. The original citation netw ork data set consists of 3607 nodes (i.e., authors). The associated w eakly connected net work has 927 comp onen ts. The gian t comp onen t has 2654 authors, accoun ting 74% of all no des. All other comp onen ts ha ve no more than 5 no des. W e no w restrict our attention to the w eakly connected gian t comp onen t N = ( V , E ). As b efore, let N 1 and N 2 b e the gian t components of the Citer and Citee netw orks associated with N , resp ectiv ely . W e ha v e |N 1 | = 2126, |N 2 | = 1790, |N 1 ∩ N 2 | = 1276, and |N \ ( N 1 ∪ N 2 )) | = 14. 24 P . JI AND J. JIN ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Aad van der V aart Abba M Krieger Bradley Efron Christian P Robert Christopher Genov ese D R Cox Daniel Y ekutieli David L Donoho David Sie gmund Donald B Rubin E L Lehmann Felix Abramovich Iain M Johnstone James O Berger Jiashun Jin John D Storey John Rice Joseph P Romano Larry W asserman Mark G Low Paul R Rosenbaum Peter Muller Sanat K Sarkar Subhashis Ghosal Y oav Benjamini Zhiyi Chi Fig 10 . The “L ar ge-Sc ale Multiple T esting” c ommunity identifie d by D-SCORE ( K = 3 ) in the Citation network ( 359 no des). Only 26 no des with 24 or mor e citers ar e shown her e. W e are primarily in terested in comm unit y detection. In Figure 2 (right panel), w e presen t the scree plot of A . Note that since A is non-symmetric, w e use the singular v alues instead of the eigen v alues in the plot. The plot suggests that there are K = 3 comm unities in N . W e ha v e applied D-SCORE and LNSC to N . The results b y SCORE are rep orted b elo w with details. The results of LNSC are rather inconsistent with those of SCORE, so w e only discuss them briefly; see Section 4.2.3 . D-SCORE iden tifies there comm unities as follo ws. • “Large-Scale Multiple T esting” communit y (359 no des). This con- sists of researc hers in multiple testing and control of F alse Disco v- ery Rate. It includes a Bay es group (James Berger, Peter Muller), three Berk eley-Stanford groups (Bradley Efron, David Siegmund, John Storey; David Donoho, Iain Johnstone, Mark Low 1 , John Rice; Eric h 1 Univ ersity of Pennsylv ania CO AUTHORSHIP AND CIT A TION NETWORKS 25 Lehmann, Joseph Romano), a Carnegie Mellon group (e.g., Christo- pher Geno v ese, Jiashun Jin, Isab ella V erdinelli, Larry W asserman), a Causal Inference group (Donald Rubin, P aul Rosenbaum), and a T el Aviv group (F elix Abramo vic h, Y oav Benjamini, Abba Krieger 2 , Daniel Y ekutieli), etc. • “V ariable Selec tion” comm unit y (1285 no des). This includes (sorted descendingly b y the n um b er of citers) Jianqing F an, Hui Zou, Pe- ter Hall, Nicolai Meinshausen, P eter Buhlmann, Ming Y uan, Yi Lin, Runze Li, P eter Bick el, T rev or Hastie, Hans-Georg Muller, Emmanuel Candes, Cun-Hui Zhang, Heng Peng, Jian Huang, T ony Cai, T erence T ao, Jianhua Huang, Alexandre Tsybak o v, Jonathan T a ylor, Xihong Lin, Jane-Ling W ang, Dan Y u Lin, F ang Y ao, Jinc hi Lv. • “Spatial and Semi-parametric/Nonparametric Statistics” (for short, “Spatial Statistics”) comm unit y (1010 nodes). See discussions b elo w. The first t w o comm unities are presen ted in Figures 10 and 11 , resp ectiv ely . The last communit y consists of sub-structures and is harder to interpret. T o this end, we first restrict the net w ork to this comm unit y (i.e., ignoring all edges to/from outside) and obtain a sub-net w ork. W e than apply D-SCORE with K = 3 to the giant comp onen t (908 nodes) of this sub-net w ork, and obtain three meaningful sub-comm unities as follo ws. • Non-parametric spatial statistics (212 no des), including Da vid Blei, Alan G elfand, Yi Li, Steven MacEachern, Omiros P apaspiliop oulos, T rivellore Ragh unathan, Gareth Roberts. • P arametric spatial statistics (304 no des), including Marc Genton, Tilmann Gneiting, Douglas Nychk a, An thon y OHagan, Adrian Raftery , Nancy Reid, Mic hael Stein. • Semi-parametric/Non-parametric statistics (392 no des), including Ra y- mond Carroll, Nilanjan Chatterjee, Ciprian Crainiceanu, Joseph Ibrahim, Jeffrey Morris, Da vid Ruppert, Naisyin W ang, Hongtu Zh u. These sub-comm unities are presen ted in Figure 12 . 4.2.1. Comp arison with Co authorship network (A). In Section 3.2 , w e presen t 8 different comp onen ts of Coauthorship netw ork (A). In T able 9 , we rein v estigate all these comp onen ts in order to understand their relationship with the 3 comm unities iden tified b y D-SCORE in the Citation netw ork. Among these 8 comp onen ts, the first one is the gian t comp onen t, con- sisting of 236 no des. All except 3 of these no des fall in the 3 communities 2 Univ ersity of Pennsylv ania 26 P . JI AND J. JIN ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Alexandre Tsybako v Bernard Silverman Bin Y u Cun−Hui Zhang Dan Y u Lin Elizav eta Levina Emmanuel Candes Fang Y ao Hans−Georg Muller Hansheng W ang Helen Zhang Heng Peng Hua Liang Hui Zou Jane−Ling W ang Ji Zhu Jian Huang Jianhua Huang Jianqing Fan Jinchi Lv Joel Horowitz Jonathan T aylor Lixing Zhu Marina V annucci Michael K osorok Ming Y uan Mohsen Pourahmadi Nicolai Meinshausen Peter Buhlmann Peter Hall Peter Bickel Dennis Cook Robert T ibshirani Runze Li T ony Cai T erence T ao T rev or Hastie Xihong Lin Xuming He Y i Lin Fig 11 . The “V ariable Sele ction” c ommunity identifie d by D-SCORE ( K = 3 ) in the Citation network ( 1285 no des). Only 40 nodes with 54 or mor e citers ar e shown her e. T able 9 Sizes of the interse ctions of the communities identifie d by D-SCORE ( K = 3 ) in the Citation network (r ows) and the 8 lar gest c omp onents of Co authorship network (A) as pr esente d in Figur es 1 and 5 and T ables 6 (c olumns). “Other”: no des outside the we akly c onne cte d giant comp onent; *: 9 out of 12 ar e in the “Semi-p ar ametric/Non-p arametric” sub-c ommunity of the “Sp atial Statistics” community. Mach. Dim. Johns Quant. Exp. giant Learn. Reduc. Hopkins Duke Stanford Reg. Design Spatial 60 1 12* 1 3 V ar. Selection 166 15 14 1 7 2 8 2 Multiple T ests 7 2 2 7 1 3 Other 3 236 18 14 13 10 9 9 8 iden tified by D-SCORE in the Citation netw ork, with 60 no des in “Spa- tial Statistics and Semi-parametric/Non-parametric statistics”, including (sorted descendingly b y the n um b er of citers; same b elo w) Raymond Carroll, CO AUTHORSHIP AND CIT A TION NETWORKS 27 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Alan E Gelfand Alexandros Besk os Athanasios K ottas David M Blei Fernando A Quintana Gareth Roberts Gary L Rosner Herbert K H Lee Ju−Hyun Park Mark F J Steel Matthe w J Beal Natesh Pillai Omiros Papaspiliopoulos Paul Fearnhead Pilar L Iglesias Radford M Neal Robert B Gramacy Ste ven N MacEachern T rivellore Raghunathan Y ee Whye T eh Y i Li ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Adrian E Raftery Andre w O Finley Anthony OHagan Cristiano V arin Douglas W Nychka Fadoua Balabdaoui Haav ard Rue Hao Zhang Huiyan Sang Jonathan T awn Laurens de Haan Leah J W elty Marc G Genton Martin Schlather Michael L Stein Montserrat Fuentes N Reid Nicolas Chopin Paolo V idoni Sudipto Banerjee T ilmann Gneiting ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Alan W elsh Brian S Caf fo Ciprian Crainiceanu D Mikis Stasinopoulos David Ruppert Hongtu Zhu Hua Y un Chen Jef frey Morris Joseph G Ibrahim Michael A Benjamin Ming−Hui Chen Mohammad Hosseini−Nasab Naisyin W ang Nilanjan Chatterjee Rabi Bhattacharya Raymond Carroll Robert A Rigby Robin Henderson Rui Paulo Silvia Shimakura Theo Gasser Thomas C M Lee Ulrich Stadtmuller V ic Patrangenaru Fig 12 . The “Sp atial and Semi-p ar ametric/Non-p arametric Statistics” c ommunity has sub-c ommunities: Non-p ar ametric Sp atial (upp er), Par ametric Sp atial (midd le), Semi- p ar ametric/Non-p ar ametric (lower). In each, only ab out 20 high-de gr e e no des ar e shown. 28 P . JI AND J. JIN Joseph Ibrahim, Naisyin W ang, Alan Gelfand, Jeffrey Morris, Marc Gen- ton, Sudipto Banerjee, Hongtu Zhu, Jeng-Min Chiou, Ju-Hyun Park, Ulrich Stadtm uller, Ming-Hui Chen, Yi Li, Nilanjan Chatterjee, Andrew Finley , 166 no des in “V ariable Selection” including Jianqing F an, Hui Zou, P eter Hall, Ming Y uan, Yi Lin, Runze Li, T revor Hastie, Hans-Georg Muller, Em- man uel Candes, Cun-Hui Zhang, Heng P eng, Jian Huang, T on y Cai, Jianh ua Huang, Xihong Lin, and 7 no des in “Large-Scale Multiple T esting” including Da vid Donoho, Jiash un Jin, Mark Lo w, W enguang Sun, Ery Arias-Castro, Mic hael Akritas, Jessie Jeng. This is consistent with our previous claim that this 236-no de gian t com- p onen t con tains a “Carroll-Hall” group and a “North Carolina” communit y: The “Carroll-Hall” group has strong ties to the area of v ariable selection, and the “North Carolina” group has strong ties to Biostatistics. Ra ymond Carroll has close ties to b oth of these tw o groups, and it is not surprising that SCORE assigns him to the “Carroll-Hall” group in Section 3.2 in Coauthor- ship netw ork (A) but D-SCORE assigns him to the “Spatial” communit y in the Citation net w ork. F or the remaining 7 comp onen ts of Coauthorship netw ork (A), “Theo- retical Mac hine Learning”, “Dimension Reduction”, “Duk e”, “Quantile Re- gression” are (almost) subsets of “V ariable Selection”, “Stanford” (includ- ing John Storey , Johathan T aylor, Ry an Tibshirani) is (almost) a subset of “Large-Scale Multiple T esting”, and “Johns Hopkins” is (almost) a subset of “Spatial Statistics”. The “Experimental Design” group has no stronger relation to one area than to the others, so the nodes spread almost ev enly to these three comm unities. 4.2.2. Comp arison with Co authorship network (B). W e compare the com- m unit y detection results b y D-SCORE for the Citation netw ork with those b y SCORE for Coauthorship net w ork (B) in Section 3.3 . Note that for the former, we ha v e b een fo cused on the w eakly connected giant comp onen t of the Citation netw ork (2654 nodes), and for the latter, we ha ve b een fo cused on the giant comp onen t of the Coauthorship netw ork (B) (2263 no des). The comparison of t w o sets of results is tabulated in T able 10 . Viewing the table vertically , we observe that Citation net work pro vides additional insigh t in to the Coauthorship netw ork (B), and reveals structures w e ha ve not found previously . Belo w are the details. First, the “Ob jectiv e Ba y es” comm unity in Coauthorship net w ork (B) con tains t w o main parts. The first part consists of 55% of the no des, and most of them are seen to b e the researchers who ha v e close ties to James Berger, including (sorted descendingly b y the n um b er of citers; same b e- CO AUTHORSHIP AND CIT A TION NETWORKS 29 lo w) Alan Gelfand, F ernando Quin tana, Steven MacEac hern, Gary Rosner, Rui Paulo, Herbert Lee, Rob ert Gramacy , Athanasios Kottas, Pilar Iglesias, Daniel W alsh, Dongc h u Sun. The second part consists of 25% of the no des, and is assigned to the “V ariable Selection” communit y in the Citation net- w ork by D-SCORE, including Carlos Carv alho, F eng Liang, Maria De Iorio, German Molina, Merlise Clyde, Luis Pericc hi, Maria Barbieri, Nicholas Pol- son, Bala Ra jaratnam, Edward George. F or the second part, the result seems reasonable, as man y no des in the second part (e.g., Carlos Carv alho, Edw ard George, F eng Liang, Merlise Clyde) ha v e an in terest in model selection. Second, the “Biostatistics (Coauthorship (B))” comm unity in Coauthor- ship net w ork (B) also has t w o main parts. The first part has 156 no des (40% of the total, including high-degree no des such as Joseph Ibrahim, Sudipto Banerjee, Hongtu Zh u, Ju-Hyun Park, Ming-Hui Chen, Yi Li, Mon tserrat F uentes, Natesh Pillai, Andrew Finley , Am y Herring, Martin Schlather, Stu- art Lipsitz, Jonathan T awn, Siddhartha Chib, Alexander Tso dik o v. The sec- ond part consists of 153 no des (40% of the total). The high-degree no des include Yi Lin, Dan Y u Lin, Ji Zh u, Helen Zhang, L J W ei, W ei Biao W u, Donglin Zeng, Zhiliang Ying, David Dunson, Steve Marron, Anastasios Tsi- atis, W en bin Lu, Zhezhen Jin, Xiaotong Shen, Heping Zhang, Lu Tian, Jian- w en Cai, Wing Hung W ong. The results are quite reasonable: man y no des in the second part (e.g., Dan Y u Lin, David Dunson, Helen Zhang, Stev e Marron, Ji Zhu, Xiaotong Shen, Yi Lin) either hav e w orks in or ha ve strong ties to the area of v ariable selection. Last, the “High Dimensional Data Analysis” communit y in Coauthorship net w ork (B) has three parts. The first part has 459 no des (25%), includ- ing high-degree no des suc h as Ra ymond Carroll, Gareth Rob erts, Naisyin W ang, Adrian Raftery , Omiros P apaspiliopoulos, David Rupp ert, Tilmann Gneiting, Jeffrey Morris, Mic hael Stein, Ciprian Crainiceanu, Marc Genton, Nicolas Chopin, Alan W elsh, Anthon y OHagan, F adoua Balab daoui, N Reid. The second part has 840 no des (46%), including high-degree nodes suc h as Jianqing F an, Hui Zou, P eter Hall, Nicolai Meinshausen, P eter Buhlmann, Ming Y uan, Runze Li, Peter Bic k el, T rev or Hastie, Hans-Georg Muller, Em- man uel Candes, Cun-Hui Zhang, Heng Peng, Jian Huang, T ony Cai, T erence T ao, Jianh ua Huang, Alexandre Tsybak o v, Jonathan T aylor, Xihong Lin. The third part has 221 nodes (26%), including high-degree no des suc h as Iain Johnstone, Larry W asserman, Bradley Efron, John Storey , Christopher Geno v ese, Da vid Donoho, Y oa v Benjamini, Da vid Siegmund, Peter Muller, Jiash un Jin, F elix Abramovic h, David Co x, Daniel Y ekutieli. Resp ectiv ely , the three parts are lab eled as subsets of the “Spatial and Semi-parametric/Non-parametric Statistics”, “V ariable Selection”, and “Large- 30 P . JI AND J. JIN Scale Multiple T esting” communities in the Citation net w ork. This seems con vincing: (a) most of the no des in the first part ha v e a strong in terest in spatial statistics or biostatistics (e.g., Ciprian Crainicean u, Naisyin W ang, Ra ymond Carroll), (b) most of the no des in the second part are leaders in v ariable selection, and (c) most no des in the third part are leaders in Large-Scale Multiple T esting and in the topic of con trol of FDR. Viewing the table horizontally giv es similar claims but also rev eals some additional insight. F or example, “Large-Scale Multiple T esting” contains three main parts. One part consists of 221 no des and is a subset of the “High Dimensional Data Analysis” communit y in Coauthorship netw ork (B). The second consists of 115 no des and falls outside the giant comp onen t of Coauthorship netw ork (B). A significant fraction of no des in this part are from German y and hav e close ties to Helmut Finner, a leading researcher in Multiple T esting. Another significant part (17 nodes) are researc hers in Bioinformatics (e.g., T erry Sp eed) who do not publish many pap ers in these four journals for the time perio d. T able 10 Sizes of the interse ctions of the communities identifie d by D-SCORE ( K = 3 ) in the Citation network (r ows; “other” stands for no des outside the we akly c onne cte d giant c omp onent) and the c ommunities identifie d by SCORE in Co authorship network (B) (c olumns; “other” stands for no des outside the giant comp onent). *: 14 and 17 ar e in the “Non-p ar ametric Sp atial” and “Semi-p ar ametric/Non-p ar ametric” sub-communities of the “Sp atial and Semi-p ar ametric/Non-p arametric Statistics” c ommunity, r esp e ctively. Ob j. Bay es Biostat-Coau-B HDDA-Coau-B other Spatial 35* 156 459 360 1010 V ar. Selection 16 153 840 276 1285 Multiple T ests 6 17 221 115 359 other 7 62 291 593 953 64 388 1811 1344 3067 4.2.3. Comp arison of D-SCORE and LNSC. W e ha ve also applied LNSC to the Citation net w ork, with K = 3. The comm unities are very different from those iden tified b y D-SCORE, and ma ybe interpreted as follo ws. • “Semi-parametric and non-parametric” (434 nodes). W e find this com- m unit y hard to in terpret, but it could b e the communit y of researc hers on semi-parametric and non-parametric models, functional estimation, etc.. The h ub no des include (sorted descendingly b y the n umber of citers; same below) Peter Hall, Ra ymond Carroll, Hans-Georg Muller, Xihong Lin, F ang Y ao, Naisyin W ang, Marina V annucci, David Rup- p ert, Gerda Claesk ens, W olfgang Hardle, Jeffrey Morris, Enno Mam- CO AUTHORSHIP AND CIT A TION NETWORKS 31 men, Ciprian Crainicean u, James Robins, Anastasios Tsiatis, Cather- ine Sugar, Zhezhen Jin, Alan W elsh, Sunil Rao, Philip Bro wn. • “High Dimensional Data Analysis” (HDD A-Cita-LNSC) (614 no des). The second one can b e in terpreted as the “High Dimensional Data Analysis” comm unit y , where the high-degree no des include (sorted de- scendingly b y the num b er of citers) Jianqing F an, Hui Zou, Nicolai Meinshausen, Peter Buhlmann, Ming Y uan, Yi Lin, Iain Johnstone, Runze Li, P eter Bic k el, T revor Hastie, Larry W asserman, Emman uel Candes, Cun-Hui Zhang, Heng P eng, Bradley Efron, John Storey , Jian Huang, T ony Cai, Christopher Genov ese, T erence T ao. • “Biostatistics” (Biostat-Cita-LNSC) (1605 no des). The communit y is hard to in terpret and includes researc hers from sev eral differen t ar- eas. F or example, it includes researc hers in biostatistics (e.g., Joseph Ibrahim, L J W ei), in nonparametric (Bay es) metho ds (e.g., P eter Muller, Da vid Dunson, and Nils Hjort, F ernando Quintana, Omiros P apaspiliop oulos), and in spatial statistics and uncertaint y quantifica- tion (e.g., Mac Gen ton, Tilmann Gneiting, Mic hael Stein, Hao Zhang). These results are rather inconsistent to those obtained b y D-SCORE: the ARI and VI b et w een tw o the v ectors of predicted communit y lab els by LNSC and SCORE are 0 . 07 and 1 . 68, resp ectively . Moreo v er, it seems that • LNSC merges part of the no des in the “V ariable Selection” (1285 no des) and “Large-Scale Multiple T esting” (359 nodes) comm unities iden tified b y D-SCORE in to a new HDDA-Cita-LNSC comm unity , but with a m uc h smaller size (614 nodes). • The Biostat-Cita-LNSC comm unit y (1605 nodes) is muc h larger than the “Spatial” comm unit y iden tified b y D-SCORE (1010 nodes), and hard to in terpret. Our observ ations here someho w agree with [ 27 ] that LNSC can not prop- erly distinguish the directions of the edges and can not detect comm unities represen ting directionalit y patterns among the nodes. 5. Discussions. W e hav e collected, cleaned, and analyzed tw o net w ork data sets: the Coauthorship net w ork and Citation netw ork for statisticians. W e inv estigate the net work centralit y and comm unit y structures with an arra y of different to ols, ranging from Exploratory Data Analysis (EDA) [ 40 ] to ols to rather sophisticated methods. Some of these to ols are relatively re- cen t (e.g., SCORE, NSC, BCPL, APL, LNSC), and some are ev en new (e.g., D-SCORE for directed netw orks). W e hav e also presen ted an arra y of in ter- esting results. F or example, we iden tified the “hot” authors and pap ers, and 32 P . JI AND J. JIN ab out 15 meaningful comm unities suc h as “Spatial Statistics”, “Dimension Reduction”, “Large-Scale Multiple T esting”, “Ob jectiv e Ba y es”, “Quan tile Regression”, “Theoretical Mac hine Learning”, and “V ariable Selection”. The pap er also has several limitations that need further explorations. First of all, constrained b y time and resources, the t wo data sets w e collect are limited to the pap ers published in four “core” statistical journals: AoS, Biometrik a, JASA, and JRSS-B in the 10 year p erio d from 2003 to 2012. W e recognize that many statisticians not only publish in so-called “core” statistical journals but also publish in a wide v ariet y of journals of other sci- en tific disciplines, including but not limited to Nature, Science, PNAS, IEEE journals, journals in computer science, cosmology and astronom y , economics and finance, probabilit y , and so cial sciences. W e also recognize that many statisticians (even very go od ones, such as Da vid Donoho, Stev en Fienberg) do not publish often in these journals in this specific time perio d. F or these reasons, some of the results presented in this pap er may b e biased and they need to be interpreted with caution. Still, the tw o data sets and the results w e presen ted here serv e w ell for our purp ose of understanding man y aspects of the netw orks of statisticians who ha v e USA as their home base; see Section 1.3 . They also serv e as a go od starting p oin t for a muc h more am bitious pro ject on so cial net w orks for statisticians with a more “complete” data set for statistical publications. Second, for reasons of space, w e ha v e primarily focused on data analysis in this paper, and the discussions on models, theory , and metho ds hav e been k ept as brief as w e can. On the other hand, the data sets pro vide a fertile ground for modeling and developmen t of metho ds and theory , and there are an array of in teresting problems worth y of exploration in the near future. F or example, what could b e a b etter mo del for either of the t wo data sets, what could be a b etter measure for centralit y , and what could b e a better metho d for communit y detection. In particular, w e propose D-SCORE as a new communit y detection metho d for directed netw ork, but we only presen t the idea underlying the metho ds, without careful analysis. W e address the latter in a forthcoming paper [ 23 ]. Also, sometimes, the comm unit y detec- tion results b y different metho ds (e.g., SCORE, D-SCORE, NSC, BCPL, APL, LNSC) are inconsisten t with eac h other. When this happens, it is hard to ha v e a conclusiv e comparison or interpretation. In ligh t of this, it is of great interest to set up a theoretical framework and use it to inv estigate the w eaknesses and strengths of these methods. Third, there are many other in teresting problems we ha v e not addressed here: the issue of mixed membership, link prediction, relationship betw een citations and recognitions (e.g., receiving an imp ortan t aw ard, elected to CO AUTHORSHIP AND CIT A TION NETWORKS 33 National Academy of Science), relationship and differences b et w een “im- p ortan t w ork”, “influen tial work”, and “popular w ork”. It is of interest to explore these in the future. Last but not the least, coauthorship and citation netw orks only provide limited information for studying the research habits, trends, topological pat- terns, etc. of the statistical communit y . There are more informative ap- proac hes (say , using other information of the paper: abstract, author affilia- tions, k ey words, or even the whole paper) to studying such characteristics. Suc h study is beyond the scope of the pap er, so w e leav e it to the future. 6. App endix I: Pro ductivit y , patterns and trends. In this sec- tion, we rep ort our findings on three in terconnected asp ects: pro ductivit y , coauthor patterns and trends, citation patterns and trends. 6.1. Pr o ductivity. Overall, there are 3248 pap ers and 3607 authors in the data set, suggesting an av erage of 0 . 90 pap er p er author. It is of interest to inv estigate how the pro ductivit y evolv es o ver the y ears. In Figure 13 , w e presen t the total num ber of pap ers published in each year (left panel) and the av erage n um b er of pap ers p er author in each year (right panel), i.e., the ratio of the total num ber of pap ers published that year o v er the total num ber of authors who published at least once that y ear (it seems the result is inconsisten t to that of an o v erall mean of . 90, but this is due to that authors in different years largely o v erlap with each other). It is interesting to note that ov er the 10-year perio d, the num ber of pap ers published each y ear has been increasing, but the a v erage n um b er of pap ers p er author has been decreasing (drop about 18% in ten years). Possible explanations include: • Mor e c ol lab or ative . Collab oration b etw een authors has b een increasing. • Mor e c omp etitive . Statistics has b ecome a more comp etitiv e area, and there are more p eople who enter the area than who leav e the area. Also, it b ecomes increasingly more difficult to publish in these 4 journals (whic h are view ed b y man y as top journals in statistics). Note that it could also b e the case that the pro ductivit y do es not c hange m uc h, but statisticians are publishing in a wider range of journals, and more y ounger ones ha v e started making substan tial con tributions to the field. W e also present the distribution of the num b ers of pap ers per author. F or an y K -author pap er, K ≥ 1, we ha v e tw o differen t wa ys to count eac h coau- thor’s con tribution to this particular paper, either divided or non-divided. • Non-divide d . W e coun t ev ery coauthor as has published one paper. • Divide d . W e coun t ev ery coauthor as has published 1 /K pap er. 34 P . JI AND J. JIN 2004 2006 2008 2010 2012 150 200 250 300 350 400 2004 2006 2008 2010 2012 0.44 0.46 0.48 0.50 0.52 0.54 0.56 Fig 13 . L eft: total numb er of p ap ers publishe d e ach ye ar fr om 2002 to 2012 (for the ye ar 2012, we have only data for the first half ). Right: the r atios b etween the numb er of p ap ers publishe d in e ach ye ar and the number of authors who has publishe d in the same ye ar. Both approac hes hav e their virtues and disadv an tages. The first w a y ma y cause substantial “inflation” in coun ting, and the second wa y may be in- sufficien t, especially since for many pap ers, there are one or more “leading authors” who con tribute most of the w ork. F ollowing the first approach, we ha ve the left panel of Figure 14 , where the x -axis is the num b er of pap ers, and the y -axis is the proportion of authors who hav e written more than a certain n um b er of pap ers. Appro ximately , the curve lo oks like a straigh t line, esp ecially to the righ t tail. This suggests that the distribution of the n umber of pap ers has a pow er la w tail. .001 .01 .1 1 2 4 8 16 32 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Fig 14 . L eft: The pr oportion of authors who have w ritten more than a c ertain numb er of p ap ers (for a b etter view, b oth axes ar e evenly sp ac e d on the logarithmic sc ale). Right: The L or enz curve for the numb er of p ap ers e ach author with divided c ontributions. F ollowing the second approach, w e prese n t the Lorenz curve [ 36 ] of the n um b er of pap ers by each author (where for a K -author pap er, each author is coun ted as ha ving 1 /K pap er) in the right panel of Figure 14 , which suggests the distribution does not hav e a p o w er law tail but is still very skew ed. The CO AUTHORSHIP AND CIT A TION NETWORKS 35 figure sho ws that the top 10% most prolific authors contribute 41% of the pap ers, and the top 20% most prolific authors contribute 58% of the pap ers. Our findings are similar to that in [ 29 ] for the physics communit y . The Gini coefficient [ 15 ] is a well-kno wn measure of disp ersion for a dis- tribution. F or our data set, the Gini co efficien t for the distribution of the n um b er of pap ers by different authors is 0 . 51, whic h is muc h smaller than the Gini co efficient of 0 . 70 for that asso ciated with the physics communit y [ 29 ]. This seems to suggest that the published papers are more evenly distributed among authors in the statistics comm unity than the physics communit y . An- other p ossible explanation is that the data set in [ 29 ] is based on all published pap ers in physics spanning more than 100 years, while our data set is based on four journals in statistics for a 10-year perio d. It is expected that in the latter, the distribution of the n um b er of pap ers b y different authors (with divided con tributions) is less dispersed. It is in teresting to note that the Gini co efficien t of the income inequalit y for the USA in the year of 2011 is 0 . 48, whic h is sligh tly smaller than 0 . 51. 6.2. Co author p atterns and tr ends. In the coauthorship netw ork, the de- gree of a no de is also the num ber of coauthors for the no de. The degrees range from 0 to 65, where Peter Hall (65), Raymond Carroll (55), Joseph Ibrahim (41), Jianqing F an (38) and David Dunson (32) are the ones with the highest degrees (and so they are the most collab orativ e authors). Also, 154 authors ha v e degree 0, and 913 authors ha ve degree 1. The degree dis- tribution is sho wn in Figure 15 (left panel), suggesting a p o w er la w tail. It is of interest to in v estigate how the n um b er of coauthors c hanges ov er time. In Figure 15 (right panel), w e present the a verage n um b er of coau- thors in each of the 10 y ears (for each year, we consider only the authors who published in these journals). It is seen that o verall the a v erage n um b er of coauthors is steadily increasing. Again, this suggests that the statistics comm unit y has b ecome increasingly more collab orativ e. Man y social net w orks are transitiv e (e.g., a friend of a friend is likely to b e a friend) [ 41 ]. F or the coauthorship net w ork based our data sets, the transitivit y is 0 . 32, compared to 0 . 066 for the biology comm unit y , 0 . 15 for the mathematics communit y , and 0 . 43 for the physics communit y [ 34 ]. F or real-w orld so cial net w orks, the usual range of transitivity is b et w een 0 . 3 and 0 . 6 [ 36 ], suggesting that the Coauthorship net w ork is mo derately transitive. 6.3. Citation p atterns and tr ends. F or the 3248 pap ers (3607 authors) in our data sets, the av erage citation per pap er is 1 . 76, which is significan tly lo w er than the Impact F actor (IF) of these journals. Based on ISI 2010, the IFs for AoS, JRSS-B, JASA, and Biometrik a are 3.84, 3.73, 3.22, and 1.94, 36 P . JI AND J. JIN .001 .01 .1 1 1 4 16 64 2004 2006 2008 2010 2012 1.7 1.8 1.9 2.0 2.1 2.2 2.3 Fig 15 . L eft: The prop ortion of authors with mor e than a given number of co authors (for a b etter view, b oth axes ar e evenly sp ac ed on the lo garithmic scale). Right: The aver age numb er of c o authors for al l authors who has publishe d in these journals that ye ar. resp ectiv ely . This is largely due to that we count only the citations betw een pap ers in these 4 journals in a 10-y ear perio d. Among these pap ers, (a) 1693 (52%) are not cited b y any other paper in the data set, (b) 1450 (45%) do not cite any other pap er in the data set, and (c) 778 (24%) neither cite nor are cited b y an y other papers in the data sets. The distribution of the in-degree (the n um b er of citations rece iv ed b y each pap er) is highly sk ew ed. The top 10% highly cited pap ers receiv e ab out 60% of all citation counts, while the top 20% receive ab out 80% of all citation coun ts. The Gini co efficien t is 0 . 77 [ 15 ] suggesting that the in-degree is highly dispersed. The Lorenz curv e [ 36 ] is sho wn in Figure 16 (left panel), confirming that the distribution of the in-degrees is highly skew ed. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 o o o o o 2004 2006 2008 2010 2012 0.0 0.2 0.4 0.6 0.8 Fig 16 . L eft: The L or enz curve for the numb er of citation r e c eived by e ach p ap er. Right: The pr op ortions of self-citations (r e d cir cles), c o author citations (gr een triangles) and distant citations (blue r e ctangles) for e ach two-ye ar blo ck. W e also observe some v ery in teresting patterns. First, the authors return a fa v or of citation, especially if it is from a coauthor. The prop ortion of (either CO AUTHORSHIP AND CIT A TION NETWORKS 37 earlier or later) recipro cation among coauthor citations is 79%, while that among distan t citations is 25%. In Figure 16 (righ t panel), w e sho w that o v er the 10-year p erio d, (a) the prop ortion of self-citations has b een slo wly decreasing, (b) the prop ortion of citations from a coauthor remains roughly the same, and (c) the proportion of distan t citations (citations that are not from oneself or a coauthor) has b een slowly increasing. The last item is a little unexp ected, but it probably mak es sense in that ov er the y ears, the publications hav e b ecome increas- ingly more accessible online and comm unications hav e b ecome increasingly easier and more efficien t. That the blue curv e and the red cross crossov er with each other on the left is probably due to the “b oundary effect”: for pap ers published in 2003 (say), most the pap ers they ha v e cited are prob- ably published earlier than 2002, which are not included in our data sets. Belo w, we sho w that the mean delay of citation is ab out 3 years. F or this reason, the “boundary effect” is probably negligible in the later half of the time p erio d. Note that the ov erall prop ortions for self-citations, coauthor citations and distan t citations are 27%, 9%, and 64%, resp ectiv ely . The data set also confirms a reasonable delay in citations, despite the fact that most pap ers app ear online (such as personal website, arXiv, department arc hiv es) muc h earlier than the time when the pap er is published. The ov erall mean dela y (e.g., the av erage difference b et ween the y ears of the publication of a new pap er and the pap ers it cites) is 3 . 30 years, and the mean delay for self-citations, coauthor citations, and distan t citations, are 2 . 81, 3 . 36 and 3 . 51 y ears, resp ectiv ely , suggesting the authors cite their own or their coauthors’ w ork more quic kly than that of others. 7. App endix II: Data collection and cleaning. In this section, we describ e ho w the data were collected and prepro cessed, and ho w w e hav e o v ercome the c hallenges w e ha ve faced. W e fo cus on all pap ers published in AoS, JASA, JRSS-B, and Biometrik a from 2003 to the first half of 2012. F or eac h pap er in this range, w e hav e extracted the Digital Ob ject Iden tifier (DOI), title, information for the au- thors, abstract, k eywords, journal name, v olume, issue, and page n umbers, and the DOIs of the pap ers in the same range that ha v e cited this pap er. The ra w data set consists of about 3500 pap ers and 4000 authors. Among these pap ers, we are only interested in those for original researc h, so w e hav e remo v ed items such as the b ook reviews, erratum, comments or rejoinders, etc. Usually , these items con tain signal words such as “Bo ok Review”, “Corrections” etc. in the title. Removing suc h items leav es us with a total of 3248 papers (ab out 3950 authors) in the range of interest. 38 P . JI AND J. JIN Our data collection pro cess has three main steps. In the first step, we iden tify all pap ers in the range of interest. In the second step, we figure out all citations b et ween the pap ers of in terest (note that the information for citation r elationship b etwe en any two authors is not directly a v ailable). In the third step, w e iden tify all the authors for each paper. In the first step, recall that the goal is to iden tify ev ery paper in our range of interest, and for each of them, to collect the title, author, DOI, keyw ords, abstract, journal name, etc. In this step, we face t w o main c hallenges. First, all p opular online resources ha ve strict limits for high-quality high- v olume downloads; we ha v e explained this in Section 1.2 with details. Even- tually , we manage to ov ercome the challenge by downloading the desired data and information from W eb of Science and MathSciNet little b y little, eac h time in the maxim um amour that is allow ed. Overall, it has taken us a few mon ths to do wnload and com bine the data from t wo differen t sources. Second, it is hard to find a go od iden tifier for the pap ers. While the titles of the pap ers could serve as unique identifiers, they are difficult to format and compare. Also, while many online resources hav e their own paper iden tifiers, they are either unav ailable or un usable for our purp ose. Even tually , we decide to use the DOI as the identifier. The DOI has been used as a unique identifier for pap ers by most publishers for statistical pap ers since 2000. Using DOI as the iden tifier, with substantial time and efforts, we ha v e successfully identified all paper in the range of in terest with W eb of Science and MathSicNet. One more difficulty we face here is that W eb of Science do es not hav e the DOIs of (ab out) 200 pap ers and MathSciNet do es not ha v e the DOIs of (ab out) 100 pap ers, and we ha v e to combine these tw o online sources to locate the DOI for eac h pap er in our range of interest. W e no w discuss the second step. The goal is to figure out the citation relationship betw een an y t w o papers in the range of in terest. MathSciNet do es not allo w automated do wnloads for suc h information, but, fortunately , suc h information is retriev able from W eb of Science, if we parse the XML pages in R at a small amount eac h time. One issue we encoun ter in this step is that (as men tioned ab o v e) W eb of Science misses the DOIs of about 200 pap ers, and we ha v e to deal with these papers with extra efforts. Consider the last step. The goal is to uniquely identify all authors for each pap er in the range of in terest. This is the most time consuming step, and we ha v e faced man y c hallenges. First, for many papers published in Biometrik a, w e do not ha v e the first name and middle initial for each author, and this causes problems. F or instance, “L. W ang” can b e an y one of “Lan W ang”, “Li W ang”, “Lianming W ang”, etc. Second, the name of an author is not listed consistently in different o ccasions. F or example, “Lixing Zh u” ma y b e CO AUTHORSHIP AND CIT A TION NETWORKS 39 also listed as “Li Xing Zhu”, “L. X. Zhu”, and “Li-Xing Zhu”. Last but not the least, differen t authors ma y ha v e the same name: at least three authors (from Univ. of California at Riverside, Univ. of Mic higan at Ann Arb or and Io w a State Univ., resp ectiv ely) ha v e the same name of “Jun Li”. Note that ev ery service has its o wn internal identification system, but, unfortunately , none of them is willing to reveal the system to the end users. Also, p eople hav e b een trying hard to create a universal author identification system, in a similar spirit to that of using DOI as a universal identifier for eac h pap er. Among these are Researc herID in tro duced b y Thomson Reuters in 2008 and Op en Researc her and Con tributor ID (ORCID) in troduced in 2012. Ho w ever, the use of such systems is still very limited. Ev en tually , we hav e to solve the problem on our own. First, roughly sa y- ing, w e hav e written a program whic h mostly uses the author names (e.g., first, middle, and last names; abbreviations) to correctly identify all ex- cept 200 (appro ximately) authors, ab out whom w e ma y ha v e problems in iden tification. W e then manually identify eac h of these 200 authors using ad- ditional information (e.g., affiliations, email addresses, information on their w ebsites). After all suc h cleaning, the n umber of authors is reduced from ab out 3950 to 3607. F or repro ducibilit y purp ose, we ha v e prepared the data files and a demo for readers who are in terested in exploring the data sets. All these can b e found at www.stat.uga.edu/~psji/ once the pap er is accepted for publi- cation. In particular, the data files include the following. • 4Journals.bib : the ra w bibtex data for about 3500 items including pap ers, bo ok reviews, corrections, etc • 4Journals-cleaned.bib : the cleaned bibtex data for 3248 pap ers af- ter removing the b ook reviews and corrections and clustering the au- thor names • author-cluster.txt : the final clustering rules for the author names • author-cluster-man.txt : the man ually defined clustering rules for the author names • author-list.txt : the list of the 3607 authors after disambiguation • author-paper-adjacency.txt : 3607x3248 bipartite adjacency matrix • coauthor-adjacency.txt : the 3607x3607 coauthor adjacency matrix • citation-adjacency.txt : the 3607x3607 citation adjacency matrix Ac kno wledgemen ts. JJ thanks Da vid Donoho and Jianqing F an; the pap er w as inspired b y a lunc h conv ersation with them in 2011 on H-index. The authors thank Stephen Fienberg, Qunhua Li, Douglas Nychk a, and Y unp eng Zhao for helpful pointers. 40 P . JI AND J. JIN References. [1] Amini, A. , Chen, A. , Bickel, P. and Levina, E. (2013). Pseudo-lik eliho od methods for communit y detection in large sparse netw orks. Ann. Statist. 41 2097-2122. [2] Arenas, A. , Duch, J. , Fernandez, A. and Gomez, S. (2007). Size reduction of complex netw orks preserving mo dularit y. New J. Phys. 9(6) 176. [3] Bang-Jensen, J. and Gutin, G. (2009). Digr aphs: The ory, A lgorithms and Appli- c ations . Springer. [4] Barabasi, A.-L. and Alber t, R. (1999). Emergence of scaling in random net works. Scienc e 286 509-512. [5] Bickel, P. and Chen, A. (2009). A nonparametric view of netw ork mo dels and Newman-Girv an and other mo dularities. Pr o c. Nat. A c ad. Sci. 106 21068-21073. [6] Bickel, P. and Levina, E. (2008). Regularized estimation of large co v ariance ma- trices. Ann. Statist. 36 199–227. [7] Bickel, P. and Levina, E. (2008). Co v ariance regularization by thresholding. Ann. Statist. 36 2577–2604. [8] Candes, E. and T ao, T. (2007). The Dan tzig selector: statistical estimation when p is muc h larger than n (with discussion). Ann. Statist. 35 2313–2351. [9] Chen, S. , Donoho, D. and Sa unders, M. (1998). A tomic decomp osition by basis pursuit. SIAM J. Sci. Comput. 20 33–61. [10] Efron, B. , Hastie, T. , Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann. Statist. 32 407–499. [11] F an, J. and Li, R. (2004). New estimation and model selection pro cedures for semi- parametric mo deling in longitudinal data analysis. J. Amer. Statist. Asso c. 99 710– 723. MR2090905 (2005d:62053) [12] F an, J. and L v, J. (2008). Sure independence screening for ultrahigh dimensional feature space. J. R oy. Statist. Soc. B 70 849–911. MR2530322 [13] F an, J. and Peng, H. (2004). Nonconcav e penalized likelihoo d with a div erging n umber of parameters. Ann. Statist. 32 928–961. MR2065194 (2005g:62047) [14] Freeman, L. , Bor ga tti, S. and White, D. (1991). Centralit y in v alued graphs: A measure of b et w eenness based on net work flow. So c. Networks 13 141–154. [15] Gini, C. (1936). On the measure of concentration with sp ecial reference to income and statistics. Color ado Col le ge Public ation, Gener al Series 208 73-79. [16] Goldenberg, A. , Zheng, A. , Fienberg, S. and Air oldi, E. (2009). A survey of statistical netw ork mo dels. F oundations and T r ends in machine le arning 2 129-233. [17] Grossman, J. (2002). The ev olution of the mathematical research collab oration graph. Congr essus Numer antium 158 201-212 . [18] Huang, J. , Horowitz, J. and Ma, S. (2008). Asymptotic prop erties of bridge esti- mators in sparse high-dimensional regression mo dels. Ann. Statist. 36 587–613. [19] Huang, J. , Liu, N. , Pourahmadi, M. and Liu, L. (2006). Co v ariance matrix selec- tion and estimation via p enalised normal likelihoo d. Biometrika 93 85–98. [20] Huber t, L. and Arabie, P. (1985). Comparing partitions. J. Classif. 2 193-218. [21] Hunter, D. and Li, R. (2005). V ariable selection using MM algorithms. A nn. Statist. 33 1617–1642. MR2166557 [22] Ioannidis, J. (2008). Measuring co-authorship and netw orking-adjusted scien tific impact. PLOS ONE 3 . [23] Ji, P. , Jin, J. and Ke, Z. (2014). Joint communit y detection for Coauthorship and Citation netw orks of statisticians b y D-SCORE. Manuscript . [24] Jin, J. (2014). F ast communit y detection by SCORE. Ann. Statist. T o app e ar. [25] Johnstone, I. and Sil verman, B. (2005). Empirical Ba yes selection of wa velet CO AUTHORSHIP AND CIT A TION NETWORKS 41 thresholds. Ann. Statist. 33 1700–1752. MR2166560 [26] Karrer, B. and Newman, M. (2011). Stochastic blo c kmo dels and comm unity struc- tures in netw ork. Phys. Rev. 83 1436–1462. [27] Kim, Y. , Son, S.-W. and Jeong, H. (2010). Finding comm unities in directed net- w orks. Phys. R ev. E 81 016103. [28] Leicht, E. and Newman, M. (2008). Communit y structure in directed net w orks. Phys. R ev. L ett. 100 118703. [29] Mar tin, T. , Ball, B. , Karrer, B. and Newman, M. (2013). Coauthorship and citation patterns in the Physical Review. Phys. R ev. E 88 . [30] Meila, M. (2003). Comparing clusterings by the v ariation of information. In Le arning The ory and Kernel Machines: 16th A nnual Confer enc e on Computational L e arning The ory and 7th Kernel Workshop (B. Scholk opf and M. K. W armuth, eds.) Springer. [31] Meinshausen, N. and B ¨ uhlmann, P. (2006). High-dimensional graphs and v ariable selection with the lasso. Ann. Statist. 34 1436–1462. [32] Newman, M. (2001). The structure of scientific collab oration netw orks. Pr o c. Natl. A c ad. Sci. USA 98 404-409. [33] Newman, M. (2001). Scientific collab oration net works. I. Netw ork construction and fundamen tal results. Phys. R ev. E 64 016131. [34] Newman, M. (2004). Coauthorship net works and patterns of scien tific collaboration. Pr o c. Natl. A c ad. Sci. USA 101 5200-5205. [35] Newman, M. (2006). Mo dularit y and communit y structure in netw orks. Pr o c. Natl. A c ad. Sci. 103 8577-8582. [36] Newman, M. (2010). Networks: an intr o duction . Oxford Univ ersity Press. [37] Sabidussi, G. (1966). The centralit y index of a graph. Psychometrika 31 581–683. [38] Storey, J. (2003). The positive false discov ery rate: a Bay esian interpretation and the q -v alue. Ann. Statist. 31 2013–2035. MR2036398 (2004k:62055) [39] Tibshirani, R. (1996). Regression shrink age and selection via the lasso. J. R oy. Statist. So c. Ser. B 58 267–288. [40] Tukey, J. (1977). Explor atory Data A nalysis . Addison-W esley . [41] W asserman, S. (1994). So cial network analysis: metho ds and applic ations 8 . Cam- bridge Universit y Press. [42] Zhao, Y. , Levina, E. and Zhu, J. (2012). Consistency of communit y detection in net works under degree-corrected sto c hastic block mo dels. A nn. Statist. 40 2266-2292. [43] Zou, H. (2006). The adaptiv e lasso and its oracle properties. J. Amer. Statist. Asso c. 101 1418–1429. [44] Zou, H. and Hastie, T. (2005). Regularization and v ariable selection via the elastic net. J. R. Stat. So c. Ser. B Stat. Metho dol. 67 301–320. MR2137327 [45] Zou, H. and Li, R. (2008). One-step sparse estimates in nonconca ve p enalized like- liho od mo dels. Ann. Statist. 36 1509–1533. MR2435443 (2010a:62222) Pengsheng Ji Dep ar tment of St a tistics University of Georgia A thens, GA 30602 E-mail: psji@uga.edu Jiashun Jin Dep ar tment of St a tistics Carnegie Mellon University Pittsburgh, P A 15213 E-mail: jiashun@stat.cm u.edu
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment