Network Analysis with the Enron Email Corpus

Net w ork Analysis with the Enron Email Corpus J. S. Hardin G. Sarkis P . C. UR C P omona College Key W ords: Computational Statistics; Data Science; Researc h with Undergraduates Abstract W e use the Enron email corpus to study relationships in a netw ork b y applying six diﬀerent measures of cen trality . Our results came out of an in-semester undergraduate research seminar. The Enron corpus is w ell suited to statistical analyses at all levels of undergraduate education. Through this article’s fo cus on cen trality , studen ts can explore the dep endence of statistical mo dels on initial assumptions and the in ter- pla y b etw een cen trality measures and hierarc hical ranking, and they can use completed studies as spring- b oards for future researc h. The Enron corpus also pres en ts opp ortunities for research in to man y other ar- eas of analysis, including so cial net works, clustering, and natural language processing. 1 In tro duction One of the most infamous corp orate scandals of the past few decades curiously left in its w ake one of the most v aluable publicly-a v ailable datasets. In late 2001, the Enron Corp oration’s accounting obfuscation and fraud led to the bankruptcy of the large energy company . The F ederal Energy Regulatory Commission subp oenaed all of Enron’s email records as part of the ensuing in v estigation. Over the follo wing tw o y ears, the commission released, unreleased, and rereleased the email corpus to the public after deleting emails that con tained p ersonal information like social security n um b ers. The Enron corpus con tains emails whose sub jects range from week end v acation planning to p olitical strategy talking points, and it remains the only large example of real w orld email datasets av ailable for research. See FERC (2013) for the F ederal Energy Regulatory Commission’s w ebsite on the Enron inv estigation, FERC (2003) for the ﬁnal order releasing the data to the public, and McLean and Elkind (2013) for a p opular accoun t of the Enron scandal. Researc h into the corpus is proliﬁc and wide ranging. W e present here a selection from the large range of publications on Enron to highligh t some of the research that the corpus has spurred, and to suggest pos- sible further directions as w ell. See Shetty and Adibi (2004) for a tec hnical report describing a MySQL database of the corpus, W ang et al. (2014) for anomaly detection in a dynamic netw ork, Diesner et al. (2005) for a so cial net work analysis that focused on changes in behavior during the scandal perio d, Deitrick et al. (2012) for a neural net works model predicting the gender of an emailer based on the email stream, P eterson et al. (2011) for measures of formality in the email correspondence, Chapanond et al. (2005) for a graph-theoretic and sp ectral analysis that o verlaps with man y of the topics of interest in our article, Mar- tin et al. (2005) for detection of abnormal email activity in outgoing messages, Zhou et al. (2006) for a probabilistic approac h to communit y detection, and Zhou et al. (2007) for data cleaning with fo cus on email aliases. 1.1 Net w ork Analysis The net work of comm unication b et ween Enron emplo yees naturally induces a graph whose nodes are la- b eled b y employ ees and whose edges corresp ond to email comm unication. W e w eight the edge betw een the 1 t wo nodes by the n umber of emails sent. Additionally , w e use directionality to separately analyze emails sen t or emails received, when appropriate. Net works are ubiquitous in the in ternet age, underlying muc h of virtual (and real) life from so cial w ebs to recommender systems, and from epidemiological spread to linguistic evolution. They are used widely as to ols of researc h in so ciology (Sutton et al., 2014) (patterns of Tweets during a natural disaster), biol- ogy (Pin ter-W ollman et al., 2011) (co ordinated b ehavior of harv ester an ts), genetics (Zhang and Horv ath, 2005) (co-expressed gene groups in brain cancer), and economics (Stephen and T oubia, 2010) (economic v alue of a social netw ork in a large online marketplace) to study the behavior of individuals and of sys- tems. W e discuss six measures of the Enron corpus based on the adjacency matrices of the email netw ork, and w e suggest how they can be used in undergraduate education and research. W e also provide a brief anal- ysis of the group mem b ership of the most connected cliques, found by hierarc hical clustering. Our results and metho ds came out of an undergraduate researc h circle at Pomona College that w e ov ersa w during the spring semester of 2014. The research circle consisted of four studen ts whose interests and initiativ e deter- mined the researc h questions and research direction, and t wo math/stats facult y mem b ers who provided general and tec hnical guidance. See Section 2 for details on ho w the matrices were constructed, Section 3 for the researc h questions and some of the results, and Section 4 for a survey of the cen tralit y measures. In Sections 5 and 6, we suggest w ays that our researc h pro ject can b e incorporated into the undergraduate curriculum. 2 Dataset Story: Clean up and Pro cessing The narrativ e asp ect of many datasets in both p edagogy and research includes a ma jor data-collection comp onen t. Even in classroom examples where the data, or a summary thereof, is giv en to the studen ts, there often exists a con textual story ab out how and wh y the data might ha v e b een collected for the im- mediate purp ose of the statistical analysis. The Enron corpus, on the other hand, is for all inten ts and purp oses an acciden tal, incidental dataset. This presen ts an inv aluable opp ortunity to discuss real-w orld data issues that do not often come up in the classro om. Sp eciﬁcally , real-world data is often dirtier and less co operative than experimental data. It is not structured with a sp eciﬁc goal in mind—it is what it is. Therefore, getting it to the tidy stage where analysis may be conducted and meaning may be extracted in volv es several assumptiv e and simplifying decisions that require though tful analysis b efore the fact (see, for example, Hadley Wic kham’s work on the vital aspect of tidying data (Wickham, 2014)). Additionally , the Enron dataset is clearly observ ational and pro vides muc h fo dder for a classroom discussion on the lim- its of inferences done on observ ational data. F or our pro ject, we used the dataset a v ailable at https://s3.amazonaws.com/metanautix/enron/enron_ mail_20110402_csv.tgz , whose emails w ere organized into 150 mailboxes labeled by emplo yee name; the emails in a mailb o x were not necessarily sen t b y that p erson. Additionally , some emplo yees with similar names w ere binned into the same mailbox, while others had their messages split among t wo mailboxes. In order to circum ven t such potential binning errors, w e ignored the folder designation and instead extracted only F rom , T o , and CC ﬁelds of eac h email message. While only one employ ee ma y app ear in either the F rom or T o ﬁelds (whic h is diﬀerent from most curren t email systems), an arbitrary num b er ma y app ear in the CC ﬁeld. W e considered only senders and recipients with email addresses that ha ve an enron.com domain name. T o distinguish b etw een the individuals, w e relied on six standard aliases used at Enron (see Zhou et al. (2007) for instance). The result was 156 emplo yees whose email comm unication w e considered, and from whic h we constructed an adjacency matrix for the w eighted directed graph of Enron emplo y ees, as visualized in ﬁgure 1. The dots in the ﬁgure are colored so that the darker the color of the point at the ( i, j ) entry , the more emails w ere sent from person i to p erson j . 2 Jeff Dasovich Louise Kitchen Kenneth La y Jeff Dasovich Louise Kitchen Kenneth La y 0 2 4 6 8 10 # of emails (log2 scale) Figure 1: Eac h ( i, j ) th dot represen ts a binary indicator that an email was sen t from p erson i to p erson j . White indicates no comm unication, while the darker the color, the more comm unication b etw een person i and p erson j . Note that for the rows, the i th individual coun ts up from the b ottom. E.g., Jeﬀ Dasovic h is the 21 st column and the 21 st ro w c ounting fr om the b ottom Let E ij b e the set of emails for whic h Enron employ ee i app ears in the F rom ﬁeld and emplo yee j appears in the T o ﬁeld. Let C ij b e the set of emails for whic h Enron employ ee i app ears in the F rom ﬁeld and em- plo yee j appears in the CC ﬁeld. F or eac h c ∈ C ij , let n c b e the n umber of names that app ear in the CC ﬁeld of c . Deﬁne the 156 × 156 w eighted adjacency M as: m ij = | E ij | + X c ∈ C ij 1 √ 1 + n c (1) Th us, for the weigh ting of the edge in the directed graph from employ ee i to emplo yee j , each email sen t from i to j contributed 1, and eac h email c sen t from i on which j w as cc-ed contributed 1 / √ 1 + n c . W e considered the con tribution of cc-ed emails to b e less imp ortant than emails sen t directly; in asking “ho w many cc-ed emails is one direct email w orth?”, we arriv ed at a square-ro ot relationship. Our an- sw er came out of a discussion with the students; diﬀeren t groups may reac h diﬀeren t conclusions regarding 3 the appropriate w eighting. This is a v aluable opportunity for the studen ts to explicitly consider the conse- quences of the assumptions they mak e. The matrix M is the weigh ted adjacency matrix of a directed graph. T o consider the undirected graph, we deﬁned the matrix U = M + M T − D , where D is the diagonal matrix d ii = 2 m ii . In other words, for the undirected graph, w e did not incorp orate information from emails that employ ees cc-ed themselves on. In subsection 6.2, Alternativ e Applications, we discuss c hoices within the data cleanup pro cess for creating the adjacency matrix w e used as well as alternativ e adjacency matrices. While the net work of 156 nodes is relatively small in size, its edges included more than 500,000 email mes- sages and 18GB of data; b y wa y of reference, we note that a movie can range from 1GB in standard deﬁ- nition to 6GB in Blu-ra y . This made the cleanup comp onent of our inv estigation a big data pro ject. The metho ds of aggregating the data are outside the scope of this article; we imported the data into MySQL and used simple queries to coun t how man y emails employ ee i sen t to employ ee j , and iterated the queries o ver all pairs of emplo yees. Ho wev er, it is w orth noting that the construction of the 156 × 156 matrix to ok sev eral days of computation, and that diﬀeren t considerations in computing the entries (as discussed b e- lo w) can provide in teresting alternate research routes into the data. Our studen ts also used D3, a jav a-based library , to visualize the netw ork. Their work is at http://obscure- meadow- 3612.herokuapp.com/ and at http://enron- network.herokuapp.com/TOM . 3 Researc h questions W e are in terested in the so cial netw ork that is deﬁned by the emails. In particular, w e inv estigate what kinds of information ab out the relativ e imp ortance of the Enron employ ees can b e read from the graph whose v ertices are the employ ees and whose edges represent email corresp ondence. T o that end, we con- sider six measures of cen trality based on the email net work: degree, eigencen trality for sent emails, eigen- cen trality for receiv ed emails, closeness, b et weenness, and topological ov erlay , which w e discuss in more detail in Section 4 b elo w. There is no reason to b eliev e that the kinds of imp ortance rankings induced by an email connectivity graph reﬂect the managerial structure of the corp oration itself. Indeed, rankings based on email netw orks p oin t to an o verla y of the activity of an individual emailer and the subnetw ork of contacts that emailer has. As suc h, centralit y measures based on an email netw ork ma y help gauge the functional imp ortance of v arious emplo yees, as opposed to (or in conjunction with) their managerial imp ortance; and diﬀerent cen trality measures are more adept at sp otting diﬀeren t functionalities, which w e give examples of next. Though the scop e of our pro ject was primarily exploratory , w e did observe some interesting results that ma y form the basis for further research directions. Consider T able 1 b elo w, which summarizes the ranks of the top 10 emplo yees for eac h of the size centralit y measures. • The o verlap betw een the six top-ten lists is not insigniﬁcant: 29 emplo yees make up the 60 names. Tw o employ ees app ear in ﬁv e of the six lists: Jeﬀ Dasovic h, the director for state go vernmen t aﬀairs, ranks highly in eac h but the eigencentralit y measure based on received emails; and Louise Kitchen, an energy trader in the Europ ean mark et and COO of Enron Wholesale Services, ranks highly in all but top ological o verlap. • On the other hand, 14 of the emplo yees appear just once. Among these are Kenneth Lay , the Chair- man of Enron, who rank ed seven th in b et weenness, and Greg Whalley , its presiden t, who ranked eigh th in closeness. The only other b oard mem b er of Enron to app ear on any of the lists is Stephen Kean, Vice Presiden t and Chief of Staﬀ, who ranked sixth in degree and third in topological ov erlap. • Coun ting multiplicit y , 25 of the 60 employ ees who rank ed in the top ten were legal counsels of some kind, either in the Enron North America Legal Department (21) or otherwise having “Counsel” in 4 their job title (4). In other words, the email net work captured the imp ortance of the legal depart- men ts at Enron. • Thirteen of the 29 unique emplo yees who rank ed in the top ten were women, and, coun ting multi- plicit y , 33 of the 60 were w omen; in comparison, 38 of the total 156 employ ees w ere women. Degree EVcen t EVcen tT Closeness Bet weenness TOM 1 Jeﬀ Daso vich T ana Jones Sara Shac kleton Rob ert Benson Louise Kitc hen Jeﬀ Daso vich 2 Mik e Grigsby Sara Shac kleton Susan Bailey Mik e Grigsby Mik e Grigsby Ric hard Shapiro 3 T ana Jones Stephanie P anus Marie Heard Louise Kitc hen Susan Scott Stev en J. Kean 4 Sara Shac kleton Marie Heard T ana Jones Kevin M. Presto Jeﬀ Daso vich Mik e Grigsby 5 Ric hard Shapiro Susan Bailey Stephanie P anus Susan Scott Mary Hain T ana Jones 6 Stev en J. Kean Ka y Mann Elizab eth Sager Scott Neal Sally Bec k Sara Shac kleton 7 Louise Kitc hen Louise Kitc hen Jason Williams Barry T ycholiz Kenneth Lay Mary Hain 8 Susan Scott Elizab eth Sager Louise Kitc hen Greg Whalley Scott Neal Marie Heard 9 Mic helle Lok ay Jason Williams Jeﬀrey T. Ho dge Phillip K. Allen Kate Symes Stephanie Pan us 10 Chris German y Jeﬀ Dasovic h Gerald Nemec Jeﬀ Daso vich Cara Sem perger Susan Scott T able 1: According to each of six diﬀeren t measures of centralit y , w e provide a ranked list of the individu- als who are most cen tral to the email corpus. 4 Cen tralit y and Rank A measure of cen trality on a graph aims to assign a ranking or magnitude to eac h no de that captures the relativ e imp ortance of that no de in the context of the graph’s structure. W e are interested in measuring the imp ortance of eac h employ ee based on the n umber of emails sent or receiv ed, as aggregated in the dataset w e extracted from the Enron corpus and summarized in the matrices M and U . Recall that M is the w eighted 156 × 156 adjacency matrix of our directed graph, and U is the corresp onding weigh ted matrix of the undirected graph that do es not distinguish betw een sent and receiv ed emails. W e in vestigate six measures of importance within the Enron employ ee email netw ork: degree, eigenv ector cen trality for receiv ed emails, eigenv ector cen trality for sen t emails, closeness, b et weenness, and topologi- cal o verla y . W e give an o verview of the measures b elow, including mathematical deﬁnitions and in tuition. F or eac h of the measures, it may be of interest for studen ts to generate examples of no des in a netw ork that rank high or lo w in centralit y . In Section 6 b elow, w e also suggest how some of the measures may be incorp orated in to statistics classes of v arious lev els. W e mak e tw o general observ ations about the employ ees who ranked highly according to the cen trality measures. First, while ‘Vice President,’ ‘Director,’ and ‘Presiden t’ app ear frequen tly in their titles, only three of these emplo yees w ere Enron b oard mem b ers. The cen trality rankings therefore captured a func- tional participation in the email net work rather than managerial importance. Second, while there was non- trivial o verlap betw een and correlation among the lists, each of the centralit y measures seemed to pick out a distinctiv e narrative feature from among the emplo yees. 4.1 Degree The degree δ i of emplo yee i is deﬁned to equal the total n umber of employ ees to whom i sen t or received emails. Thus, if we deﬁne δ ij = ( 1 if u ij 6 = 0 0 if u ij = 0 5 then δ i = P j δ ij . W e did not distinguish b etw een whether j app eared in the T o or CC ﬁeld. The degree is a measure of the size of i ’s immediate netw ork. The more diﬀerent p eople i emails, directly or by cc, or receiv es emails from, the greater i ’s degree. The top-rank ed employ ee according to degree centralit y is Jeﬀ Daso vich, the Director of Regulatory and Go vernmen t Aﬀairs. Note that the only top-ten list Jeﬀ Dasovic h do es not app ear in is the transp ose- eigencen trality one, suggesting that his presence in the other top-ten lists is on the strength of his emails sen t rather than received. Also, the Enron departments are well represen ted in this list—there are seven unique departmen ts among the ten employ ees. That is, there do es not app ear to b e one department clearly more activ e than the others in email communications based on coun t alone. Note that δ is the adjacency matrix of the un weigh ted, undirected graph. It may b e of interest to compute the degree using the w eighted and/or directed matrix instead, so that the results migh t b e comparable to other measures b elo w using the matrices M and U . 4.2 Eigen v ector Cen trality Denote the cen trality of emplo yee i with the nonnegative real n umber x i . Supp ose x i is accum ulated from the cen tralities x j as j ranges ov er all emplo yees that i emails. Supp ose further that emplo yee j con tributes to x i in direct prop ortion to the connectedness from i to j as measured by m ij . That is, x i = 1 λ X j m ij x j where 1 λ is some prop ortionalit y constant . While the deﬁnition app ears circular (the cen trality x i dep ends on x j , whic h in turn dep ends on x i ), w e can summarize the relationships with the familiar matrix equation M ~ x = λ ~ x , where ~ x =  x 1 · · · x 156  T . In other w ords, ~ x is an eigenv ector of M with eigenv alue λ . While M may ha ve several diﬀeren t eigenv al- ues and eigen vectors, the P erron-F rob enius Theorem (see Horn and Johnson (1990, page 508) for instance) guaran tees that b ecause m ij ≥ 0, then for some eigen v alue λ with largest absolute v alue, there exists an eigen vector ~ x whose en tries are all nonnegative; that so-called dominan t eigenv ector pro vides the imp or- tance w eights for the emplo yees. W e also consider M T , the transp ose of M , in order to analyze importance based on received emails. In that case, w e compute M T ~ x = λ~ x for the eigen vector of importance weigh ts corresponding to the eigen- v alue λ with the largest absolute v alue. Suc h a measure of imp ortance is called eigenv ector centralit y . While it is commonly used with binary or sto c hastic matrices, its premise applies to any matrix with nonnegativ e entries. Arguably the most famous instance of eigen vector cen trality is the ﬁrst implementation of Google’s PageRank algorithm: webpages rank highly in Go ogle’s searc h results if they are linked from other w ebpages of high rank. See Austin (2006) for a fun illustration of the algorithm suitable for a linear algebra class, and Brin and Page (1998) for the original pap er b y Go ogle founders Sergey Brin and Lawrence Page. An emplo yee ranks highly in eigen vector centralit y (resp ectively , transp ose-eigenv ector cen trality) if that emplo yee sends emails to (respectively , receiv es emails from) many other highly-ranked emplo yees. Stu- den ts can generate and discuss examples of company structure based on whether the t wo eigenv ector cen- tralities turn out to b e highly correlated or uncorrelated: what kinds of employ ees migh t rank highly in eigen vector cen trality of emails sent, but not in the transpose of emails received? Both top-ten lists p ertaining to eigen vector centralit y hav e a high legal representation—7 for emails sent and 8 for emails receiv ed. So even though the legal department appears only twice in the degree top-ten list, its emails m ust hav e b een sen t from/to more central people as measured by the eigen vector equation. Moreo ver, 8 of the 10 in eigencen trality-sen t and 7 of the 10 in eigencen trality-receiv ed are women. In b oth lists com bined, there is only one legal department emplo yee who is not a woman (Jeﬀrey T. Hodge, 6 who app ears in only one list), and there is only one w oman who is not an employ ee of the legal depart- men t (Louise Kitchen, who is tied with Jeﬀ Daso vich for app earing in the most top-ten lists, ﬁve of the six). F or comparison, women make up around half of the legal departmen t and a quarter of the total 156 emplo yees considered. 4.3 Closeness Giv en a pair of employ ees i and j , a path from i to j is deﬁned to b e a sequence i 0 = i, i 1 , · · · , i r = j suc h that m i t − 1 i t 6 = 0 for all 1 ≤ t ≤ r . A path from i to j is called a shortest path if it minimizes the num b er of steps r in the sequence, and the distance from i to j , denoted d ( i, j ), is the num b er of steps in suc h a shortest path from i to j . The closeness of i is then deﬁned as γ i = 1 P j 6 = i d ( i, j ) If the graph has more than one connected comp onen t—in other words, if there exists a pair of nodes that cannot b e connected via a sequence of edges—then the closeness of an y no de equals zero. Otherwise, the closeness of i measures the sp eed or eﬃciency with whic h information spreads out from i to the rest of the graph. Note that γ i is sometimes normalized b y the num b er of no des other than i so that it measures the recipro cal of the a verage distance from i : ( n − 1) / P j 6 = i d ( i, j ). How ev er, the ranking of employ ees based on closeness is indep enden t of such a normalization. An emplo yee has high closeness cen trality if that employ ee’s corresp ondence reaches a large prop ortion of the net work quic kly . Thus closeness is a measure of the en tire netw ork’s structure in relation to a no de. It ma y b e interesting to discuss the robustness of the closeness centralit y . F or instance, can an employ ee rise in the rankings b y sending one or tw o carefully chosen emails? According to the Enron graph, six of the top ten employ ees with respect to closeness centralit y were di- rectors or vice presiden ts of trading. The remaining four are Susan Scott, one of tw o w omen and the only la wyer in the group, Jeﬀ Daso vich and Louise Kitchen, eac h of whom app eared in ﬁve of the six top-ten lists, as w ell as Greg Whalley , Enron’s President. 4.4 Bet w eenness F or b et weenness, we consider the undirected net work and its adjacency matrix U , and ask path-related questions similar to those in closeness measures. Given a pair of emplo yees j and k , an undirected path from j to k is deﬁned to be a sequence i 0 = j, i 1 , · · · , i r = k suc h that u i t − 1 i t 6 = 0 for all 1 ≤ t ≤ r ; in particular, w e do not take in to account the from/to directedness of the graph of Enron employ ees. A path from j to k is called a shortest undirected path if it minimizes the n umber of steps r in the sequence. Since our paths are w eighted, w e make an adjustment (Newman, 2001; Opsahl and Skv oretz, 2010) to the deﬁnition of “shortest” under the premise that a larger weigh t implies a closer connection betw een the tw o corresp onding emplo yees, and in that sense, a short er path. Let i 0 , i 1 , · · · , i r b e a path from i 0 to i r . Then the w eighted length of the path is the reciprocal sum of the weigh ts of the path’s edges: X 1 ≤ t ≤ r 1 u i t − 1 i t . Let τ j k b e the n umber of shortest undirected paths from j to k , and let τ j k ( i ) b e the n umber of paths from among those in τ j k that pass through i . Then the b etw eenness of i is giv en by β i = X i 6 = j 6 = k τ j k ( i ) τ j k 7 As suc h, the b etw eenness of i measures the imp ortance of i as a cen tral no de in eﬃcient communication b et ween other no des in the netw ork. An employ ee has high b etw eenness cen trality if that employ ee ﬁgures prominen tly in the email proximit y of many pairs of colleagues. The directed/undirected c hoices for closeness/b etw eenness, resp ectiv ely , naturally generate discussion questions ab out the reasons for the c hoices and ab out how the measures might diﬀer if alternate c hoices w ere made. As well, students can in vestigate diﬀerent modiﬁcations for the shortest path in closeness to accoun t for weigh ting. P erhaps the most noteworth y asp ect of the b et weenness top ten is a single appearance, and by association, a single absence. Kenneth Lay , one of the t wo characters most commonly associated with the Enron scan- dal, comes in at n umber 7. This is the only app earance of Kenneth La y in any of the top ten lists. Also, Jeﬀrey Skilling, the other face of the scandal, do es not app ear on an y of the top-ten lists. 4.5 T op ological Overlap Matrix T op ological Ov erlap Matrix (TOM) extends the adjacency matrix from a measure of connectedness b e- t ween t wo no des only to a measure of connectedness b etw een t wo nodes and the rest of the individuals in the dataset (Ra v asz et al., 2002; Yip and Horv ath, 2007). Let u ij b e the measure of adjacency betw een no des i and j as deﬁned in Subsection 2. W e deﬁne the matrix T OM as: T O M ij = P l 6 = i,j u il u lj + u ij min  P l 6 = i,j u il , P l 6 = i,j u j l  + 1 − u ij This new adjacency matrix is then con verted to a cen trality measure by taking the ro w sum of the T OM . That is, the most cen tral no de will b e the one who is most connected to the other no des by wa y of third part y connections. It is worth p ointing out that TOM directly accounts for the second degree connections, and so it will naturally pro duce diﬀeren t measures of imp ortance than other centralit y measures. T op ological o verlap adjacency was originally designed to tak e as input unw eighted netw orks, or binary ma- trices. In such a case, T OM ij measures the prop ortion of o verlap b etw een i ’s and j ’s immediate neighbors. There are three natural a ven ues of TOM-discussion for students. First, one ma y ask if using the same for- m ula for weigh ted netw orks, as w e did, can result in misleadingly inﬂated entries in the TOM matrix—for instance, should the quadratic gro wth of P l 6 = i,j u il u lj b e tempered with a square ro ot? Second, students can consider the Generalized T op ological Ov erlap Matrix (Yip and Horv ath, 2007) which measures the the net work o verlap of all neighbors within a ﬁxed distance from any t wo no des. And third, the row-sum measure of TOM cen trality is a direct measure of the second order connections from the degree cen trality; as suc h, the other measures of centralit y discussed ab o ve can be used within the TOM matrix to ev aluate their second order connections as w ell. The TOM top-ten list shares 7 emplo yees in common with the degree list. This correlation can b e seen for all the 156 emplo yees in Figure 3. The list includes 5 lawy ers and 3 executiv e members of the Regulatory and Go vernmen t Aﬀairs department. The remaining t wo are Mike Grigsb y , a vice president of trading, and Stev en Kean, Enron’s chief of staﬀ, and one of only three board members to b e ranked b y our central- it y measures. 4.6 Clustering and Net work Cliques Instead of ranking emplo yees individually , we could ask whether certain groups of emplo yees acted in con- cert together more so than others. T o do so, we emplo y hierarchical netw ork construction, as follows. 8 First, compare all pairs of no des, or emplo yees, and connect the t wo no des which are most similar. Sec- ond, connect the next t wo nodes which are most similar or connect a node to the already connected group using either a verage connectedness, minim um connectedness, or maximum connectedness b etw een the no de and that group. The construction happ ens iterativ ely by making one additional connection at eac h step un til all no des are connected into one group (Everitt et al., 2011). Suc h b ottom-up grouping is called agglomerative, though the split ting mechanism could ha ve happ ened top-do wn and would be called divisive. The result of the splitting algorithm is visualized in a dendro- gram (see Figure 2). The y-axis of the dendrogram is given b y the dissimilarity b etw een an y tw o no des (or groups of no des). Clustering requires a c hoice for similarity betw een netw ork nodes . W e used tw o distances for our cluster- ing. One was based on the num b er of emails sen t and received: the similarity b etw een t wo nodes equals the prop ortion of emails sen t and received b etw een the resp ective pair of employ ees (that is, degree scaled b y dividing through by the maxim um num ber of emails se n t and received). The other w as based on the TOM matrix: the distance b etw een t wo groups equals the a verage TOM distance b etw een all pairs of p oin ts across the tw o groups. In our graphs, w e deﬁne a cluster to b e a group of individuals who is b oth somewhat similar (sent many emails to eac h other) and has a minimum mem b ership (we arbitrarily set the minimum to be four). How- ev er, hierarchical net works hav e the disadv an tage that in building the netw ork, once tw o nodes are con- nected, they remain connected. 9 0.0 0.2 0.4 0.6 0.8 1.0 min 4 per gr oup, cutoff=0.9 hclust (*, "a v er age") as .dist(dissAM2) 1 − S&R/max(S&R) groups.9 0.2 0.4 0.6 0.8 1.0 min 4 per gr oup, cutoff=0.95 hclust (*, "a v er age") as .dist(dissT OM) T OM dissimilarity gr oupsT OM.95 Figure 2: Dendrograms represen ting hierarchical clustering with the symmetric adjacency matrix (R&S refers to “n umber of emails received and sen t”) as well as the TOM construction based on the symmetric adjacency matrix. W e group p oints based on similarity in an agglomerativ e (b ottom up) manner. Individ- uals who are similar according to a cutoﬀ (0.9 for symmetric adjacency and 0.95 for TOM) and hav e at least a minim um cluster size (here 4 individuals) are considered to make up a group. 10 Using the n umber of emails sent and receiv ed to measure similarity , we pro duce tw o clusters: Name Departmen t/Title Cen trality (ranking) Susan Bailey ENA Legal/Legal Sp ecialist EV (5), EVT (2) Marie Heard ENA Legal/Legal Sp ecialist EV (4), EVT (3), TOM (8) T ana Jones ENA Legal/Legal Sp ecialist Deg (3), EV (1), EVT (4), TOM (5) Stephanie P anus ENA Legal/Legal Sp ecialist EV (3), EVT (5), TOM (9) Sara Shac kleton ENA Legal/General Counsel Assistan t Deg (4), EV (2), EVT (1), TOM (6) Jeﬀ Daso vich Reg. and Go v. Aﬀairs/Director Deg (1), EV (10), Cl (10), Bet (4), TOM (1) Mary Hain Reg. and Gov. Aﬀairs/Director Bet (5), TOM (7) Stev en J. Kean Enron/VP & Chief of Staﬀ Deg (6), TOM (3) Ric hard Shapiro Reg. and Gov. Aﬀairs/VP Deg (5), TOM (2) Observ e that the clusters are remark ably uniform in job title and department. Additionally , every mem- b er of the t wo clusters ap eared in at least tw o top-ten lists; and the top-ten lists had a lot of ov erlap (ev- ery mem b er of the ﬁrst cluster ranked in both eigencentralities, and ev ery member of the second cluster rank ed highly in TOM). How ever, ranking high in centralit y measures is not a guarantee of cluster mem- b ership; for instance, Louise Kitc hen, who app eared in ﬁve of the 6 top-ten lists, is not in any of the clus- ters. Using the TOM adjacency matrix, w e pro duce four clusters: Name Departmen t/Title Cen trality (ranking) Susan Bailey ENA Legal/Legal Sp ecialist EV (5), EVT (2) Marie Heard ENA Legal/Legal Sp ecialist EV (4), EVT (3), TOM (8) T ana Jones ENA Legal/Legal Sp ecialist Deg (3), EV (1), EVT (4), TOM (5) Stephanie P anus ENA Legal/Legal Sp ecialist EV (3), EVT (5), TOM (9) Elizab eth Sager ENA Legal/VP & General Assistant Counsel EV (8), EVT (6) Sara Shac kleton ENA Legal/General Counsel Assistant Deg (4), EV (2), EVT (1), TOM (6) Rob ert Badeer ENA W est P ow er/Mgr T rading none Jeﬀ Daso vich Reg. and Go v. Aﬀairs/Director Deg (1), EV (10), Cl (10), Bet (4), TOM (1) Mary Hain Reg. and Gov. Aﬀairs/Director Bet (5), TOM (7) Stev en J. Kean Enron/VP & Chief of Staﬀ Deg (6), TOM (3) Ric hard Shapiro Reg. and Go v. Aﬀairs/VP Deg (5), TOM (2) James D. Steﬀes Reg. and Gov. Aﬀairs/VP none Lindy Donoho ETS/Emplo yee none Mic helle Lok ay ETS/Director Deg (8) Mark McConnell ETS/Director none Kim b erly W atson ETS/Director none Drew F ossum ETS/VP & Gen. Cnsl. none Stev en Harris ETS/VP none Kevin Hy att ETS/Director none Susan Scott ETS/Counsel Deg (8), Cl (5), Bet (4), TOM (1) Again observ e that the clusters are department-uniform. The ﬁrst tw o TOM clusters include as subsets, resp ectiv ely , the ﬁrst tw o clusters based on n umber of emails sent/receiv ed. The other tw o TOM clusters are made up en tirely of Enron T echnical Services emplo yees, and indeed, of the 12 managerial-level em- plo yees in the ETS departmen t, 7 app ear in the t wo last clusters. 11 5 Helpful Hin ts Our results build on Ka ye et al. (2014), a semester long researc h exp erience for a group of undergraduates at P omona College. W e consider the topics to b e upp er lev el undergraduate techniques which could easily b e taugh t in a multiv ariate statistics course, a machine learning computer science course, or a data science course. Additionally , netw ork analysis or clustering could easily b e added as a topic to a course on statisti- cal applications. W e also see a place for the topics in math courses that lo ok for applications to their metho ds. In an anal- ysis course that co vers metric spaces, net works provide an in teresting ﬁeld of play . In a linear algebra course, eigen vector cen trality can make the mathematical theory come aliv e. The use of recen t and meaningful data improv es the classro om outcomes in terms of b oth engaging stu- den ts and solidifying their technical kno wledge. It has b een our exp erience that students engage more though tfully with statistical metho dology when they are interested in the researc h question at hand—an in terest that is usually concurrent with pro viding intriguing data. Our exp erience is in line with the ASA’s recen tly endorsed guidelines promoting exactly this type of meaningful data integration within the under- graduate curriculum (W orkgroup, 2014). Indeed, in our research circle, the studen ts were given free range to c ho ose b oth the data set to work with and the analysis metho d to apply for our semester long research pro ject. They unanimously chose to work with the Enron corpus and apply net work analysis to the email coun ts. The Enron corpus is in man y wa ys an ideal dataset for statistical p edagogy . Although it is not well-suited for standard Neyman-P earson hypothesis testing, the questions which can be addressed sp eak to more mo dern statistical c hallenges. There are myriad reasons for using the Enron corpus in a classro om set- ting: the corpus’s origins are unusual and engaging for studen ts who are interested in real-world data and recen t American econo-cultural history; a sizeable literature already exists on the corpus, so that students need not start the con versation and in vestigation at square one; so cial netw orks are accessible, esp ecially in the p ost-F aceb o ok era, yet they motiv ate current and activ e research problems; centralit y measures are in tuitive and mathematically non trivial; and the discussion presented b elow ma y b e used for stand-alone researc h mo dules in an undergraduate statistics course or may serv e as a starting p oin t for a more inten- siv e research pro ject. 5.1 Cen tralit y Using degree, eigen vector cen trality , b etw eenness, closeness, and TOM, w e rank the central imp ortance of eac h of the individuals in the dataset. A studen t can sp end considerable eﬀort thinking ab out the diﬀerent metrics used to rank the individu- als in the net work. Recall, the more diﬀerent p eople i emails, directly or by cc, the greater i ’s degree. F or instance, an emplo yee who forw ards a single announcement to everybo dy in the company can ac hieve max- imal degree. See ﬁgure 3 for a comparison of the centralit y measures ev aluated in this pro ject. 5.2 Net w ork Using the R pac k age W eighted Gene Co-expression Netw ork Analysis (WGCNA) (Langfelder and Hor- v ath, 2008), w e cluster the observ ations in to a hierarchical dendrogram. W GCNA uses a hierarchical clus- tering algorithm in an agglomerativ e (building one step at a time from 156 groups until all individuals are in one group) pro cess to link individuals sequen tially based on the num b er of emails exc hanged. W e used a verage-link age to determine closeness to a group that has already b een formed; that is, an individual 12 Degree 0 50 100 150 0 50 100 150 0 50 100 150 0 50 150 0 50 150 0.78 EV Cent. 0.67 0.80 EV Cent. (T) 0 50 150 0 50 150 0.53 0.58 0.71 Closeness 0.65 0.65 0.56 0.62 Betweenness 0 50 150 0 50 100 150 0 50 150 0.99 0.77 0 50 100 150 0.69 0.56 0 50 100 150 0.63 TOM Ranking Metrics Comparison Figure 3: F or each of the measures of centralit y , we ﬁnd the ranked list of emplo yees. The rank ed lists are then plotted against eac h other. The num ber in the low er triangle represents the Pearson correlation asso- ciated with the comparison of the t wo relev ant rank ed lists. . will b e added to a group if they are close, on a verage, to the mem b ers of the existing group. Addition- ally , w e did not require that every individual be linked in to a group. W e require that the dissimilarity be no more than 0.9 for the adjacency matrix. (Recall that the adjacency score is determined by the n umber of emails sen t and received, divided b y the maximum adjacency score. The dissimilarity is one minus the adjacency .) W e require the TOM dissimilarity to be no more than 0.95. Lastly , eac h group is required to ha ve at least 4 mem b ers according to our analysis. The dissimilarity measure, link age decision, and cutoﬀ criteria are all parameters that can b e adjusted in order to gain further insigh t into the data. 6 F urther Directions W e presen ted ab ov e some suggested directions that studen ts can take with discussion and researc h ques- tions for eac h of the centralit y measures. W e add to them here with some suggestions for class-sp eciﬁc mo dules and further exploration. 13 6.1 Connections to Sp eciﬁc Courses 6.1.1 Introductory Statistics The analyses done in this article are not typically cov ered in Introductory Statistics. How ever, the data could b e used to do descriptiv e statistics. F or example, studen ts could mak e b oxplots across diﬀerent En- ron departmen ts using either num b er of emails sen t or num b er of emails received. One migh t b e able to run an inferen tial (e.g., chi-square) test to see if la wyers sent more emails to other la wyers or to non-la wy ers. Indeed, an in teresting classro om discussion could b e based on the data clearly not b eing a representativ e sample from a p opulation; instead, the data migh t b e thought of as a sample from a pro cess of email send- ing b y the 156 individuals measured. 6.1.2 Applied Statistics The data and analyses pro vided seem most appropriate for an applied statistics course (e.g., computa- tional statistics, m ultiv ariate analysis, or data science) with an introductory prerequisite. The Enron data allo w for a complete analysis of centralit y metrics as well as a consideration of diﬀerent net work or clus- tering construction metho ds whic h are based on distances. W e ha ve pro vided R co de for an initial analy- sis, but our w ork could easily b e expanded to include additional centralit y measures or other netw ork and clustering construction metho ds. 6.1.3 Mathematical Multiv ariate Analysis or Linear Algebra Principal comp onen t analysis is a mainstay of multiv ariate analysis classes, and increasingly , eigenv ector cen trality mak es a late-semester app earance in linear algebra classes. W e submit that eigenv ector cen- tralit y is at least equally as appropriate for a course in multiv ariate analysis in addition to, or instead of, PCA. Both PCA and eigen vector cen trality require some linear-algebraic sophistication and dexter- it y with eigentheory . How ev er, eigenv ector cen trality can be more intuitiv e—as the imp ortance formula x i = 1 λ P ij x j is a straigh tforward linear transcription of the importance-voting assumption of the model— while it still includes sophisticated mac hinery like the P erron-F rob enius Theorem. On the other hand, the connection b et ween eigenv ectors of the cov ariance matrix and the principal axes of a b est-ﬁt ellipse can b e obscure to the studen t up on ﬁrst introduction. 6.2 Alternativ e Applications Some of the applications w e suggest b elow migh t require direct manipulation of the email data, either to organize it diﬀeren tly or to compute diﬀerent adjacency matrices. They might also require a database of emplo yee titles and departmen ts; http://foreverdata.org/1009/Enron_Employee_Status.xls . 6.2.1 Data Cleanup T o highligh t the imp ortance of data cleanup decisions, even if that is tangen tial to the fo cus of this pap er or a course, studen ts can discuss the multitude of w ays to represent the Enron email net work, and the p o- ten tial consequences to the analysis of each decision or assumption made along the w ay . F or instance, are there emplo yee-speciﬁc parameters that can b e computed without constructing the entire net work? Also, studen ts can discuss diﬀerent w eightings for the matrix M . What if b eing emailed directly and b eing cc- ed coun ted equally? Is there a wa y to incorporate the imp ortance of a message in the weigh ting, say by a blun t measure like the length of the email, or b y a more sophisticated textual analysis? Students w ould 14 need to obtain all 500,000 emails with the information on F rom , T o , and CC ﬁelds of each email message; see section 2 for additional details. 6.2.2 Correlation b etw een cen trality and compan y hierarc h y The managerial hierarc hy of Enron is not reﬂected in the top ten emplo yees as ranked b y the centralit y measures ab o ve. Indeed, of the main executiv es at the company , only tw o app ear in the top ten: Kenneth La y , the CEO and chairman, came in fourth on the betw eenness scale, and Greg Whalley , the president, had the eigh th highest closeness score. While some studies hav e attempted to reconstruct the compan y hi- erarc hy from the email net work—see for instance Agarwal et al. (2012) for an attempted reco very of dom- inance relationship from among the emplo yees with kno wn dominance-sub ordinate hierarc hy b y simply us- ing the degree cen trality—w e are not aw are of an y studies that carefully interpret the signiﬁcance of high rank in cen trality measures in the con text of the company’s hierarch y . Students w ould need at least the title information from eac h employ ee, see Agarwal et al. (2012) for additional information. 6.2.3 Gender and department One of the in teresting outcomes of our rankings is that the top eight scorers in eigen vector centralit y were w omen. Also, most of the top ten eigenscorers were lawy ers. There exists published studies that discuss email c hanges ov er time by department (see for instance Diesner et al. (2005)), though they do not corre- late the departmen ts to the employ ees’ centralities. And while the Enron corpus has been used to study gender-related questions (lik e predicting gender from the email stream in Deitrick et al. (2012)), w e are not a ware of cen trality analyses of the Enron corpus with gender as a v ariable. No additional data are needed for this extension. 6.2.4 Generalized TOM and other centralit y measures applied to TOM As men tioned ab ov e, TOM can b e generalized to m -step neigh b orho o ds to measure agreemen t b etw een no des with respect to multiple steps of adjacency (Yip and Horv ath, 2007). Generalized TOM deﬁnes paths of length m to deﬁne adjacency b et ween nodes. Additionally , a straigh tforward extension of TOM is to use other measures of adjacency (e.g., the binary measure of emails sent b etw een t wo nodes) within the TOM metric. Alternatively , applying centralit y measure like eigencentralit y or closeness to the TOM matrix instead of the graph adjacency matrix ma y result in deep er centralit y measures that b etter tak e in to account o verall netw ork connectedness. No additional data are needed for this extension. 6.2.5 Degree and Strength A natural companion to degree cen trality is strength. The strength σ i of emplo yee i is deﬁned to equal the total n umber of emails that i sent or receiv ed. F or instance, we could compute σ i = P j u ij . Like de- gree, strength is also a size measure, but of the volume of i ’s corresp ondence instead of the extent of i ’s net work. The more emails i sends, the greater i ’s strength. The degree δ i and strength σ i of an emplo yee i are blun t centralit y measures, but they can b e eﬀectively combined with a tuning parameter α to deﬁne the new cen trality measure κ i ( α ) = δ α i σ 1 − α i . At an exploratory lev el, a studen t can v ary α to observe corresp onding diﬀerences in rankings. A more sophisticated exploration might b egin with asking whether there are critical α v alues that c hange the nature of the ranking in some fundamental w ay . F or instance, α = 0 corresponds to strength and α = 1 to degree. Also, the range 0 < α < 1 seems to be fundamentally diﬀeren t from the range α > 1. But are there less obvious critical v alues? See Opsahl and Skvoretz (2010) for bac kground on the tuning parameter. No additional data are needed for this extension. 15 6.2.6 W eigh ts and directions All of our analysis w as conducted on the weigh ted netw ork under the assumption that a higher v olume of emails m ust hav e more signiﬁcance than a low er one. But a simple unw eigh ted graph of email connections, p erhaps constructed with some minim um threshold for the num b er of emails, ma y reveal information that w as obscured by the w eighting. Alternativ ely , students may gain insigh t from a kind of weigh ting that treats cc-ed emplo yees diﬀeren tly from our recipro cal square ro ot approac h or that assigns imp ortance to emails based on w ord count or sen timent analysis. And additionally , whether the graph is directed or undirected—that is, whether the sender and receiv er are treated symmetrically or not—will result in dif- feren t outcomes for all the centralit y measures, and each may suggest results that the other does not. Stu- den ts would need to obtain all 500,000 emails with the information on F rom , T o , and CC ﬁelds of each email message; see section 2 for additional details. 6.2.7 A time factor The ma jority of the Enron corpus consists of emails from 1998 to 2002. Our graph and corresp onding ma- trices aggregate all the emails in to one netw ork. How ever, it may mak e sense to consider how the email net work c hanges ov er time, b y month or b y quarter. F or instance, can an anomaly detection on the net- w ork ov er time p oin t out any c hanges that arose from scandal-related communication? See W ang et al. (2014) for some w ork in that direction. Students would need to obtain all 500,000 emails with the informa- tion on F rom , T o , and CC ﬁelds of eac h email message; see section 2 for additional details. 6.2.8 Clustering Extensions Hierarc hical clustering is only one netw ork algorithm that uses adjacencies or distances to break up ob- serv ations in to groups. Partitioning methods typically break the nodes up into groups that partition the units. That is, each node will go into exactly one group. P artitioning Around Medoids (P AM) (Kaufman and Rousseeu w, 1990) iteratively allocates p oints to the group with the closest medoid (a measure of cen- ter based on the no des themselv es), recomputes the medoid, reallo cates p oin ts, and rep eats until no points need further sw apping. Partitioning metho ds hav e the disadv an tage that the user is required to sp ecify the n umber of clusters; how ever, silhouette width can b e used to choose the optimal num b er of clusters (Rousseeu w, 1987). Another p ossible pro ject for students is to use permutation methods to ev aluate the signiﬁcance of the re- sulting clustering output. That is, one could create a null distribution of dendrograms resulting from per- m uted data. A senior pro ject or researc h exp erience might hav e the students engage with diﬀerent w ays of measuring the distance from a n ull dendrogram to the observed dendrogram. No additional data are needed for this extension. 6.2.9 Visualizations Our researc h students w ere particularly interested in diﬀerent visualizations of the data. They used D3 graphics to create a dep endency wheel and an in teractive net work image (see http://enron- network. herokuapp.com/TOM ) (Ka ye et al., 2014). Using applications like Shiny ( http://shiny.rstudio.com/ ) allo ws students to think about how best to communicate results, and the Enron data pro vides myriad op- p ortunities for creativ e visualizations. No additional data are needed for this extension. 16 6.2.10 T ext Mining As a m uch larger extension, with the en tire email corpus, a student pro ject could inv olve text mining of the con tent of the emails or of the email sub ject lines. There could also b e a connection betw een some of the net work results and a sen timent analysis of the words used within the emails themselv es. 6.3 Resources W e ha ve found the follo wing websites useful for further exploration of the data as well as for processed and simpliﬁed datasets. • https://snap.stanford.edu/data/email- Enron.html Stanford Netw ork Analysis Pro ject net work analysis and data mining library . • http://bailando.sims.berkeley.edu/enron_email.html UC Berk eley Enron Email Analysis Pro ject, includes natural language pro cessing annotation, visualization and clustering to ol, and database represen tation for eﬃcient querying. • http://homes.cs.washington.edu/ ~ jheer//projects/enron/v1/ Up dated v ersion of visualization and clustering to ol b y Jeﬀ Heer from Berkeley w ebsite ab o ve. • http://research.cs.queensu.ca/home/skill/otherforms.html Pro cessed forms of Enron data including w ord frequencies and time stamps • http://cis.jhu.edu/ ~ parky/Enron/ Another set of pro cessed databases in to simpliﬁed forms like (time, from, to) tuples. Ac kno wledgemen ts W e are grateful to Theo V assilakis and Jim Addler at Metanautix for their help and suggestions in get- ting this pro ject started, the Pomona College Math Departmen t for its contin ued supp ort of undergrad- uate researc h, and the students of the P omona College Undergraduate Research Circle during the Spring semester of 2014, Timoth y Kay e, David Khatami, Daniel Metz, and Emily Proulx, for pushing the pro ject to fruition. App endix As an app endix to this w ork we provide the dataset giv en in equation (1). W e also provide a list of the 156 emplo yees considered in the analysis (with their departmen tal aﬃliation and title). The analysis was done using R ( http://www.r- project.org/ ) and RStudio ( http://www.rstudio.com/ ), and the co de used for the analysis is pro vided as a markdown ﬁle and a pdf ﬁle. • The 156 x 156 adjacency matrix is av ailable as a comma-separated v alue ﬁle: http://www.amstat. org/publications/jse/.../FinalAdjacencyMatrix.csv • The list of 156 emplo yees with their departmen t aﬃliation and title is av ailable as a comma-separated v alue ﬁle: http://www.amstat.org/publications/jse/.../EnronEmployeeInformation.csv • The R Markdo wn ﬁle including the co de for the entire analysis is a v ailable at: http://www.amstat. org/publications/jse/.../enronTutorial.Rmd 17 • The asso ciated pdf ﬁle compiled from the markdown code i s av ailable at: http://www.amstat.org/ publications/jse/.../enronTutorial.pdf References Agarw al, A., Omuy a, A., Harnly , A., and Rambow, O. (2012), “A Comprehensive Gold Standard for the Enron Organizational Hierarc hy ,” Pr o c e e dings of the 50th Annual Me eting of the Asso ciation for Com- putational Linguistics: Short Pap ers , 2, 161–165. Austin, D. (2006), “Ho w Go ogle Finds Y our Needle in the W eb’s Haystac k,” Av ailable at http://www. ams.org/samplings/feature- column/fcarc- pagerank , accessed: 2014-08-23. Brin, S. and P age, L. (1998), “The antaom y of a large-scale hypertextual W eb search engine,” Computer Networks and ISDN Systems , 33, 107–117. Chapanond, A., Krishnamo orth y , M., and Y ener, B. (2005), “Graph Theoretic and Sp ectral Analysis of Enron Email Data,” Computational & Mathematic al Or ganization The ory , 11, 265–281. Deitric k, W., Miller, Z., V alyou, B., Dic kinson, B., Munson, T., and Hu, W. (2012), “Author Gender Pre- diction in an Email Stream Using Neural Net works,” Journal of Intel ligent L e arning Systems and Appli- c ations , 4, 169–175. Diesner, J., F ran tz, T. L., and Carley , K. M. (2005), “Communication Net works from the Enron Email Corpus “It’s Alw ays About the People. Enron is no Diﬀeren t”,” Computational & Mathematic al Or gani- zation The ory , 11, 201–228. Ev eritt, B. S., Landau, S., Leese, M., and Stahl, D. (2011), Cluster A nalysis , Wiley . FER C (2003), “Order Directing the Release of Information,” Av ailable at http://www.mresearch.com/ pdfs/139.pdf , accessed: 2014-08-23. — (2013), “Information Released in Enron In vestigation,” Av ailable at http://www.ferc.gov/ industries/electric/indus- act/wec/enron/info- release.asp , accessed: 2014-08-23. Horn, R. A. and Johnson, C. R. (1990), Matrix A nalysis , Cam bridge Universit y Press, New Y ork. Kaufman, L. and Rousseeu w, P . (1990), Finding Gr oups in Data: An Intr o duction to Cluster Analysis , Wiley , New Y ork. Ka ye, T., Khatami, D., Metz, D., and Proulx, E. (2014), “Quan tifying and Comparing Centralit y Mea- sures for Net work Individuals as Applied to the Enron Corpus,” SIAM Under gr aduate R ese ar ch Online , 7. Langfelder, P . and Horv ath, S. (2008), “W GCNA: an R pack age for weigh ted correlation net work analy- sis,” BMC Bioinformatics , 9, 559. Martin, S., Sew ani, A., Nelson, B., Chen, K., and Joseph, A. D. (2005), “Analyzing Behaviorial F eatures for Email Classiﬁcation,” in Berkeley, CA: University of Caiifornia at Berkeley . McLean, B. and Elkind, P . (2013), The Smartest Guys in the R o om , P ortfolio T rade. Newman, M. (2001), “Scien tiﬁc collab oration netw orks. I I. Shortest paths, w eighted net works, and central- it y ,” Physic al R eview E , 64, 016132. Opsahl, A. F. and Skv oretz, J. (2010), “No de centralit y in weigh ted net works: Generalizing degree and shortest paths,” So cial Networks , 32, 245–251. 18 P eterson, K., Hohensee, M., and Xia, F. (2011), “Email F ormality in the W orkplace: A Case Study on the Enron Corpus,” in Pr o c e e dings of the Workshop on L anguages in So cial Me dia , Stroudsburg, P A, USA: Asso ciation for Computational Linguistics, LSM ’11, pp. 86–95. Pin ter-W ollman, N., Holmes, R. W. A. G. S., and Gordon, D. M. (2011), “The eﬀect of individual v aria- tion on the structure and function of in teraction netw orks in harvester ants,” J. R. So c. Interfac e . Ra v asz, E., Somera, A. L., Mongru, D. A., Oltv ai, Z. N., and Barabsi, A.-L. (2002), “Hierarc hical Organi- zation of Mo dularit y in Metab olic Netw orks,” Scienc e , 297, 1551–1555. Rousseeu w, P . (1987), “Silhouettes: A graphical aid to the interpretation and v alidation of cluster analy- sis,” Journal of Computational and Applie d Mathematics , 20, 53–65. Shett y , J. and Adibi, J. (2004), “The Enron email dataset database schema and brief statistical report,” T ec h. rep., Universit y of Southern California—Information Sciences Institute. Stephen, A. and T oubia, O. (2010), “Deriving V alue from So cial Commerce Net works,” Journal of Market- ing R ese ar ch , 47, 215–228. Sutton, J., Spiro, E. S., Johnson, B., Fitzh ugh, S., Gibson, B., and Butts, C. T. (2014), “W arning tw eets: serial transmission of messages during the w arning phase of a disaster even t,” Information, Communic a- tion & So ciety , 17, 765–787. W ang, H., T ang, M., P ark, Y., and Prieb e, C. E. (2014), “Lo cality statistics for anomaly detection in time series of graphs,” IEEE T r ans. Signal Pr o c ess. , 62, 703–717. Wic kham, H. (2014), “Tidy Data,” Journal of Statistic al Softwar e , 59. W orkgroup, A. S. A. U. G. (2014), “2014 curriculum guidelines for undergraduate programs in statistical science,” . Yip, A. M. and Horv ath, S. (2007), “Gene net work interconnectedness and the generalized topological o verlap measure,” BMC Bioinformatics , 8. Zhang, B. and Horv ath, S. (2005), “A General F ramework for W eighted Gene Co-Expression Net work Analysis,” Statistic al Applic ations in Genetics and Mole cular Biolo gy , 4, Article 17. Zhou, D., Mana voglu, E., Li, J., Giles, C. L., and Zha, H. (2006), “Probabilistic Models for Discov ering e-Comm unities,” in Pr o c e e dings of the 15th International Confer enc e on World Wide Web , New Y ork, NY, USA: A CM, WWW ’06, pp. 173–182. Zhou, Y., Goldb erg, M., Magdon-Ismail, M., and W allace, W. A. (2007), “Social Comm unication Net works for Early W arning in Disasters. Strategies for Cleaning Organizational Emails with an Application to Enron Email Dataset,” 5th Conf. of North American Asso ciation for Computational So cial and Organi- zational Science (NAA CSOS 07), Emory - Atlan ta, Georgia. J. S. Hardin P omona College Departmen t of Mathematics 610 North College Av e Claremon t, CA, 91711 Jo.Hardin@P omona.edu 19 G. Sarkis P omona College Departmen t of Mathematics 610 North College Av e Claremon t, CA, 91711 Ghassan.Sarkis@P omona.edu P . C. UR C 1 P omona College Departmen t of Mathematics 610 North College Av e Claremon t, CA, 91711 PCUR C@Sak ai.Claremont.edu 1 P .C. URC stands for the Pomona College Undergraduate Research Circle, whose members for this pro ject w ere Timothy Kay e, David Khatami, Daniel Metz, and Emily Proulx. 20

Network Analysis with the Enron Email Corpus

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment