Community detection using spectral clustering on sparse geosocial data

Comm unit y detection using sp ectral clustering on sparse geoso cial data Yv es v an Gennip ∗ , Blake Hunter ∗ , Ra ymond Ahn † , Peter Elliott ∗ , Kyle Luh ‡ , Megan Halvorson § , Shannon Reid § , Matthew V alasik § , James W o § , George E. Tita § , Andrea L. Bertozzi ∗ , P . Jeﬀrey Brantingham ¶ Abstract In this article we identify so cial communities among gang mem b ers in the Hollen- b ec k policing district in Los Angeles, based on sparse observ ations of a com bination of so cial interactions and geographic lo cations of the individuals. This information, com- ing from LAPD Field In terview cards, is used to construct a similarity graph for the individuals. W e use sp ectral clustering to identify clusters in the graph, corresp onding to communities in Hollenbeck, and compare these with the LAPD’s kno wledge of the individuals’ gang mem b ership. W e discuss diﬀerent w ays of enco ding the geoso cial information using a graph structure and the inﬂuence on the resulting clusterings. Fi- nally we analyze the robustness of this technique with resp ect to noisy and incomplete data, thereby pro viding suggestions about the relativ e importance of quan tity v ersus qualit y of collected data. Keyw ords: sp ectral clustering, stability analysis, so cial net works, communit y detection, data clustering, street gangs, rank-one matrix up date MSC 2010: 62H30, 91C20, 91D30, 94C15 1 In tro duction Determining the comm unities into whic h p eople organize themselv es is an imp ortan t step to wards understanding their behavior. In diverse contexts, from adv ertising to risk as- sessmen t, the so cial group to which someone b elongs can reveal crucial information. In ∗ Departmen t of Mathematics, Applied Mathematics, UCLA † Departmen t of Mathematics, CSULB ‡ Departmen t of Physics, Y ale § Sc ho ol of So cial Ecology , Department of Criminology , Law and Society , UCI ¶ Departmen t of Anthropology , UCLA 1 practical situations only limited information is a v ailable to determine these comm unities. P eoples’ geographic lo cation at a set of sample times is often kno wn, but it ma y b e asked whether this provides enough information for reliable comm unity detection. In many sit- uations so cial interactions also can b e inferred, from observing p eople in the same place at the same time. This information can b e very sparse. The question is how to get the most communit y information out of these limited observ ations. Here we sho w that so cial comm unities within a group of street gang me m b ers can b e detected by complement ing sparse (in time) geographical information with imp erfect, but not to o sparse, knowledge of the so cial interactions. First we construct a graph from LAPD Field Interview (FI) card information ab out individuals in the Hollen b ec k policing area of Los Angeles, which has a high density of street gangs. The nodes represen t individuals and the edges b et ween them are w eighted according to their geoso cial similarity . When using this extremely sparse so- cial data in com bination with the geographical data, the eigen vectors of the graph displa y hotsp ots at ma jor gang lo cations. How ev er, the av ailable collected so cial data is to o sparse and the so cial situation in Hollen b ec k to o complex (communities do not necessarily proxy for gang b oundaries) for the resulting clustering, constructed using the sp ectral cluster- ing algorithm, to identify gangs accurately . Extending the a v ailable so cial data past the curren t sparsit y lev el by artiﬁcially adding (noisy) ground truth consisting of true connec- tions b etw een members of the same gang leads to quan titative improv emen ts of clustering metrics. This shows that limited information ab out p eoples’ whereab outs and in teractions can suﬃce to determine which so cial groups they b elong to, but the allo wed sparsity in the so cial data has its limits. Ho wev er, no detailed p ersonal information or kno wledge ab out the con tents of their interactions is needed. The sparsit y in time of the geographical information is mitigated by the relative stability in time of the gang territories. The case of criminal street gangs sp eaks to a more general so cial group classiﬁcation problem found in b oth securit y- and non-securit y-related contexts. In an activ e insurgency , for example, the h uman terrain contains individuals from numerous family , tribal and religious groups. The b order regions of Afghanistan are home to perhaps t w o dozen distinct ethno-linguistic groups and man y more family and tribal organizations [20]. Only a small fraction of the individuals are actively b elligeren t, but man y ma y passively supp ort the insurgency . Since supp ort for an insurgency is related in part to family , tribal and religious group aﬃliations, as w ell as more general so cial and economic griev ances [21], b eing able to correctly classify individuals to their aﬃliated so cial groups may b e extremely v aluable for isolating and impacting hostile actors. Y et, on-the-ground intelligence is diﬃcult to collect in extreme security settings. While detailed individual-level intelligence ma y not b e readily av ailable, observ ations of where and with whom groups of individuals meet may indeed be p ossible. The methods developed here ma y ﬁnd application in suc h con texts. In non-securit y contexts, establishing an individuals group aﬃliation and, more broadly , the structure of a social group can b e extremely costly , requiring detailed surv ey data collection. Since muc h routine so cial and economic activity is driv en b y group aﬃliation [7], lo wer cost alternativ es to group classiﬁcation may b e v aluable for encouraging certain 2 t yp es of behavior. F or example, geotagged social media activit y , such as F acebo ok, Twitter or Instagram p osts, migh t rev eal the geo-so cial con text of individual activities [41]. The metho ds developed here could b e used to establish group aﬃliations of individuals under these circumstances. This pap er applies sp ectral clustering to an in teresting new street gang data set. W e study how so cial and geographical data can b e combined to hav e the resulting clusters appro ximate existing comm unities in Hollen b eck, and in vestigate the limitations of the metho d due to the sparsit y in the so cial data. 2 The setting ( c a 19 20 10 14 09 24 13 02 06 18 08 03 15 30 22 01 31 12 25 29 16 17 21 11 04 28 26 05 07 27 23 ¬ « 110 ¬ « 110 § ¨ ¦ 5 § ¨ ¦ 5 § ¨ ¦ 10 £ ¤ 101 £ ¤ 101 ¬ « 60 § ¨ ¦ 710 Los Angeles River Arroyo Se co LAPD Nor theast Policing Area City of South Pas adena City of Alhamb ra Highway s 01. 8th S t 02. Avenues 4 3 03.Big H azard 04. Bree d Street 05. Clar ence St 06. Clov er 07. Cua tro Flats 08. Eas tlake 09. Eas tside 18th S t 10. El S ereno 1 1. E LA 13 Dukes 12. Eve rgreen 13. Hap py V al ley 14. Highl ands 15. Indi ana Dukes 16. KA M 17. Lil Ea stside 18. Lin coln Heights 19. Low ell 20. Metr o 13 21. MC Force 22. Op al 23. Prime ra Flats 24. Rose Hills 25. Senti nel Boys 26. State Street 27. The M ob Crew 28.Tiny Bo ys 29. Vicky' s T own 30. VNE 31. Whi te Fence City of Vernon Figure 1: Left : Map of gang territories in the Hollenbeck area of Los Angeles. Right: LAPD FI card data showing av erage stop lo cation of 748 individuals with so cial links of who was stopp ed with whom. Hollen b ec k (Figure 1, left) is b ordered b y the Los Angeles River, the P asadena F reew ay and areas whic h do not hav e riv aling street gangs [31]. The built and and natural b ound- aries sequester Hollenbeck’s gangs from neigh b oring comm unities, inhibiting so cialization. In recen t years quite a few so ciological, e.g. [35, 31, 34] and mathematical pap ers, e.g. 3 [18, 24, 17, 33], on the Hollenbeck gangs hav e b een pro duced, but none in the area of gang clustering. The recen t so cial science/p olicy researc h on Hollenbeck gangs has com bined b oth the geographic and so cial p osition of gangs to b etter understand the relational nature of gang violence. Clustering gangs b oth in terms of their spatial adjacency and p osition in a ri- v alry net work has sho wn that structurally equiv alen t [40] gangs exp erience similar levels of violence [31]. Incorp orating b oth the so cial and geographical distance into contagion mo d- els of gang violence provides a more robust analysis [34]. Additionally , ecological mo dels of foraging b eha vior hav e shown that even lo w levels of in ter-gang comp etition pro duce sharply delineated b oundaries among gangs with violence follo wing predictable patterns along these b orders [4]. Accounting for these so cio-spatial dimensions of gang riv alries has con tributed to the design of successful interv en tions aimed at reducing gun violence com- mitted by gangs [35]. An ev aluation of this in terven tion demonstrated that geographically targeted enforcement of t wo gangs reduced gun violence in the fo cal neighborho o ds. The crime reduction b eneﬁts also diﬀused through the so cial net work as the levels of violence among the targeted gangs riv als also decrease d. In this article w e use one year’s worth (2009) of LAPD FI cards. These cards are created at the oﬃcer’s discretion whenever an in teraction o ccurs with a civilian. They are not restricted to criminal even ts. Our data set is restricted to FI cards concerning stops in volving kno wn or susp ected Hollenbeck gang members 1 . W e further restricted our data set to include only the 748 individuals (anon ymized) whose gang aﬃliation is recorded in the FI card data set (based on exp ert kno wledge). These aﬃliations serve as a ground truth for clustering. F rom each individual w e use information ab out the a verage of the lo cations where they w ere stopp ed and which other individuals w ere presen t at each stop (Figure 1, righ t) in our algorithm. 3 The metho d W e construct a fully connected graph whose no des represent the 748 individuals. Every pair of nodes i and j is connected by an edge with w eight W i,j = αS i,j + (1 − α ) e − d 2 i,j /σ 2 , where α ∈ [0 , 1], d i,j is the standard Euclidean distance b et ween the av erage stop lo cations of individuals i and j , and σ is chosen to b e the length which is one standard deviation larger than the mean distance b etw een t wo individuals who ha ve b een stopp ed together 2 . 1 In the FI card data set for some individuals certain data entries w ere missing. W e did not include these individuals in our data set either. 2 Most results in this paper are fairly robust to small p erturbations that keep σ of the same order of magnitude (10 3 feet), e.g. replacing it by just the mean distance. The mean distance b et ween members of the same gang (computed using the ground truth) is of the same order of magnitude. Another option one 4 The c hoice of Gaussian kernel for the geographic distance dep enden t part of W is a natural one (since it mo dels a diﬀusion pro cess) setting the width of the k ernel to b e the length scale within whic h most so cial interactions take place. W e enco de so cial similarit y b y taking S = A , where A is the so cial adjacency matrix with entry A i,j = 1 if i and j were stopp ed together (or i = j ) and A i,j = 0 otherwise. In Section 6 w e discuss some other c hoices for S and how the results are inﬂuenced by their c hoice. Note that, b ecause of the typically non- violen t nature of the stops, w e assume that individuals that w ere stopp ed together share a friendly so cial connection, thus establishing a so cial similarity link. The parameter α can b e adjusted to set the relativ e imp ortance b et ween so cial and geographic information. If α = 0 only geographical information is used, if α = 1 only social information. Using sp ectral clustering (explained b elo w) w e group the individuals into 31 diﬀeren t clusters. The mo deling assumption is that these clusters corresp ond to so cial communities among Hollen b ec k gang mem b ers. W e study the question how muc h these clusters or comm unities resem ble the actual gangs, as deﬁned by each individual’s gang aﬃliation giv en on the FI cards. The a priori choice for 31 clusters is motiv ated by the LAPD’s observ ation that there were 31 active gangs in Hollenbeck at the time the data was collected, eac h of whic h is represen ted in the data set 3 . In Appendix B w e brieﬂy discuss some results obtained for diﬀerent v alues of k . The question whether this num b er can b e deduced from the data without prior assumption —and if not, what that means for either the data or the LAPD’s assumption— is b oth mathematically and anthropologically relev ant, but falls mostly outside the scop e of this pap er. It is partly addressed in curren t w ork [19, 38] that uses the modularity optimization method (p ossibly with resolution parameter) ([27, 26, 30] and references therein), and its extension, the multislice mo dularit y minimization method of [25]. W e stress that our metho d clusters the individuals into 31 sharply deﬁned clusters. Other metho ds are av ailable to ﬁnd mixed-membership comm unities [22, 10], but we will not pursue those here. W e use a sp ectral clustering algorithm [28] for its simplicit y and transparency in making non-separable (i.e. not linearly separable) clusters separable. A t the end of this pap er we will discuss some other metho ds that can b e used in future studies. W e compute the matrix V , whose columns are the ﬁrst 31 eigenv ectors (ordered ac- cording to decreasing eigenv alues) of the normalized aﬃnit y matrix D − 1 W . Here D is a diagonal matrix with the nodes’ degrees on the diagonal: D i,i := P 748 j =1 W i,j . These eigen- v ectors are kno wn to solv e a relaxation of the normalized cut (Ncut) problem [32, 42, 39], b y giving non-binary approximations to indicator functions for the clusters. W e turn them in to binary appro ximations using the k -means algorithm [16] on the ro ws of V . Note that eac h row corresp onds to an individual in the data set and assigns it a co ordinate in R 31 . The k -means algorithm iterativ ely assigns individuals to their nearest cen troid and updates could consider, is to use lo cal scaling, such that σ has a diﬀeren t v alue for each pair i, j , as in [44]. W e will not pursue that approach here. Our fo cus will b e mainly on the roles of α and S i,j . 3 The num b er of mem b ers of eac h gang in the data set v aries betw een 2 and 90, with an a verage of 24.13 and a standard deviation of 21.99. 5 the cen troids after eac h step. Because k -means uses a random initial seeding of centroids, in the computation of the metrics b elo w w e a verage ov er 10 k -means runs. W e inv estigate t wo main questions. The ﬁrst is so ciological: Is it p ossible to identify so cial structures in h uman b eha vior from limited observ ations of locations and colocations of individuals and ho w muc h do es each data t yp e contribute? Sp eciﬁcally , do we b eneﬁt from adding geographic data to the so cial data? W e also lo ok at how well our sp eciﬁc FI card data set p erforms in this regard. The second question is essentially a mo deling question: How should w e c ho ose α and S to get the most information out of our data, giv en that our goal is to identify gang membership of the individuals in our data set? Hence w e compute metrics comparing our clustering results to the known gang aﬃliations and in vestigate the stability of these metrics for diﬀeren t mo deling choices. 4 The metrics W e fo cus primarily on a purity metric and the z -Rand score, which are used to compare t wo given clusterings. F or purit y one of the clusterings has to b e assigned as the true clustering, this is not necessary for the z -Rand score. In App endix A we discuss other metrics and their results. Purit y is an often used clustering metric, e.g. [14]. It is the p ercen tage of correctly classiﬁed individuals, when classifying eac h cluster as the gang in the ma jorit y in that cluster (in the case of a tie any of the ma jorit y gangs can be c hosen, without aﬀecting the purit y score). Note that w e allo w multiple clusters to b e classiﬁed as the same gang. T o deﬁne the z -Rand score we ﬁrst need to in tro duce the pair counting quantit y 4 w 11 , which is the n umber of pairs whic h b elong b oth to the same cluster in our k -means clustering (sa y , clustering A ) and to the same gang according the “ground truth” FI card en try (say , clustering B ), e.g. [23, 37] and references therein. The z -Rand score z R , [37], is the n umber of standard deviations whic h w 11 is remov ed from its mean v alue under a h yp ergeometric distribution of equally likely assignmen ts sub ject to new clusterings ˆ A and ˆ B having the same num bers and sizes of clusters as clusterings A and B , resp ectiv ely . Note that purit y is a measure of the n umber of correctly classiﬁed individuals , while the z -Rand score measures correctly identiﬁed p airs . Purity thus has a bias in fa vor of more clusters. In the extreme case in which eac h individual is assigned to its o wn cluster (in clustering A ), the purity score is 100%. How ev er, in this case the n umber of correctly iden tiﬁed pairs is zero (each gang in our data set has at least tw o members), and the mean and standard deviation of the h yp ergeometric distribution are zero. Hence the z -Rand score is not w ell-deﬁned. A t the opp osite extreme, where we cluster all individuals into one cluster in clustering A , we hav e the maxim um num b er of correc tly classiﬁed pairs, but the standard deviation of the hypergeometric distribution is again zero, hence the z -Rand score is again not w ell-deﬁned. The z -Rand score thus automatically shows w arning signs 4 Not to b e confused with the matrix element W 1 , 1 . 6 in these extreme cases. Sligh t p erturbations from these extremes will ha v e v ery lo w z -Rand scores, and hence will also b e rated p oorly b y this metric. Since we prescrib e the num b er of clusters to b e 31, this bias of the purit y metric will not play an imp ortan t role in this pap er. As a reference to compare the results discussed in the next section to, the total p ossible n umber of pairs among the 748 individuals is 279,378. Of these pairs, 15,904 inv olv e mem b ers of the same gang, and 263,474 pairs in volv e mem b ers of diﬀerent gangs (according to the ground truth). The z -Rand score for the clustering into true gangs is 404.7023. 5 P erformance of FI card data set In T able 1 we sho w the purity and z -Rand scores using S = A for diﬀerent α (for each α w e give the a v erage v alue o ver 10 k -means runs and the standard deviation). Clearly α = 1 is a bad c hoice. This is unsurprising given the sparsity of the so cial data. The clustering th us dramatically impro ves when we add geographical data to the so cial data. On the other end of the sp ectrum α = 0 gives a purity that is within the error bars of the optimum v alue (at α = 0 . 4), indicating that a lot of the gang structure in Hollen b ec k is determined b y geography . This is not unexp ected, giv en the territorial nature of these gangs. How ever, the z -Rand score can b e signiﬁcantly impro ved b y choosing a nonzero α and hence again we see that a mix of social and geographical data is preferred. α Purit y z -Rand 0 0.5548 ± 0.0078 120.6910 ± 19.4133 0.1 0.5595 ± 0.0136 131.8397 ± 18.5551 0.2 0.5574 ± 0.0100 121.9785 ± 18.3149 0.3 0.5612 ± 0.0115 137.2643 ± 21.0990 0.4 0.5603 ± 0.0087 142.9746 ± 15.9186 0.5 0.5531 ± 0.0118 139.8599 ± 14.2651 0.6 0.5452 ± 0.0107 141.7835 ± 13.4852 0.7 0.5452 ± 0.0099 130.2264 ± 21.5967 0.8 0.5460 ± 0.0104 134.9519 ± 25.2803 0.9 0.5602 ± 0.0061 145.7576 ± 13.4988 1 0.2568 ± 0.0158 6.1518 ± 1.7494 T able 1: A list of the mean ± standard deviation ov er ten k -means runs of the purity and z -Rand score, using S = A . Cells with the optimal mean v alue are highlighted. Note how ever that other v alues are often close to the optimum compared to the standard deviation. In App endix A we discuss the results we got from some other metrics, like ingroup homogeneit y and outgroup heterogeneit y measures and Hausdorﬀ distance b etw een the cluster cen ters. They sho w similar b eha vior as purit y and the z -Rand score: All of them 7 are limited by the sparsit y and noisiness of the a v ailable data, but they typically sho w that it is preferable to include b oth social and geographical data. Esp ecially social data b y itself usually performs badly . Figure 2 shows a pie chart (made with code from [36]) of one run of the sp ectral clustering algorithm, using S = A and α = 0 . 4. W e see that some clusters are quite homogeneous, esp ecially the dark blue cluster lo cated in Big Hazard’s territory . Others are fragmen ted. W e may interpret these results in ligh t of previous w ork [9], whic h suggests that gangs v ary substantially in their degree of internal organization. How ev er, recall that in this pap er w e prescribe the n umber of clusters to b e 31, so gang members are forced to cluster in w a ys that ma y not represent true gang organization. 4544 4546 4548 4550 4552 4554 4556 4558 1282 1284 1286 1288 1290 1292 1294 1296 1298 1300 1 5 10 15 20 25 31 31 Gang Colormap Figure 2: Pie c harts made with code from [36] for a sp ectral clustering run with S = A and α = 0 . 4. The size of each pie represents the cluster size and each pie is centered at the centroid of the av erage p ositions of the individuals in the cluster. The coloring indicates the gang make-up of the cluster and agrees with the gang colors in Figure 1. The legend shows the 31 diﬀerent colors which are used, with the num b ering of the gangs as in Figure 1. The axes are counted from an arbitrary but ﬁxed origin. F or aesthetic reasons the unit on both axes is appro ximately 435 . 42 meters. The connections b et ween pie c harts indicate in ter-cluster so cial connections ( i.e. nonzero elements of A ). T able 1, the pie charts in Figure 2, and the other metrics discussed in Appendix A pain t 8 a consistent picture: The so cial data in the FI card data s et is to o sparse to stand on its own. Adding a little bit of geographic data ho w ever immensely improv es the results. Geographic data b y itself does pretty well, but can typically b e impro v ed by adding some so cial data. Ho wev er, even for the optimal v alues the clustering is far from p erfect. Therefore we will no w consider diﬀeren t so cial matrices S with tw o questions in mind: 1) Can w e improv e the p erformance of the so cial data b y enco ding it diﬀerently? 2) Is it really the sparsit y of the so cial data that is the problem, or can the sp ectral clustering metho d not p erform an y b etter even if we w ould hav e more so cial data? The ﬁrst question will b e studied in Section 6, the second in Section 7. 6 Diﬀeren t so cial matrices F or the results discussed ab o v e we hav e used the so cial adjacency matrix A as the so cial matrix S . How ev er, there are some in teresting observ ations to mak e if w e consider diﬀeren t c hoices for S . The ﬁrst alternative we consider is the so cial environmen t matrix E , whic h is a nor- malized measure of ho w man y social contacts t wo individuals ha ve in common. Its en tries range b et ween 0 and 1, a high v alue indicating that i and j met a lot of the same p eople (but, if E i,j < 1, not necessarily eac h other) and a low v alue indicating that i and j ’s so cial neigh b orho ods are (almost) disjoint. It is computed as follo ws. Let f i b e the i th column of A . Then E has entries E i,j = 748 X k =1 f i k f j k k f i kk f j k (where k f i k 2 = 748 X k =1 ( f i k ) 2 ). The procedure is reminiscen t of the nonlo cal means metho d [5] in image analysis, in which pixel patches are compared, instead of single pixels. F rom our sim ulations (not listed here) we hav e seen that we get very similar results using either S = A or S = E , b oth in terms of the optimal v alues for our metrics and whether these optima are achiev ed at the ends of the α -interv al ( i.e. α = 0 or α = 1) or in the interior (0 < α < 1). The sim ulations describ ed in Section 7 b elow sho wed that ev en for less sparse and more accurate data the results for S = A and S = E are similar. An in teresting visual phenomenon happ ens when, instead of using A or E , w e use a rank- one up date of these matrices as the so cial matrix S . T o b e precise, we set S = n ( A + C ) where C is the matrix with C i,j = 1 for ev ery entry and n − 1 := max i,j ( A + C ) i,j is a normalization factor such that the maximum en try in S is equal to 1. (Again, the results are similar if we use E instead of A .) Figure 3 sho ws the second, third, and fourth eigen vectors of D − 1 W (b ecause of the normalization the ﬁrst eigen vector is constan t, corresponding to eigen v alue 1) for α = 0 . 4, b oth when S = A and when S = n ( A + C ) is used. W e see that hotsp ots hav e appeared after our rank-one up date (and renormalization) of the so cial matrix S . Similar hotsp ots result for other α ∈ (0 , 1). An explanation for this b eha vior can b e found in the b ehavior 9 of eigenv ectors under rank-one matrix up dates, [6, 13]. App endix C gives more details. Similar hotsp ots (and c hanges in the metrics; see b elo w) o ccur if other c hoices for S are made that turn the zero en tries into nonzero entries, e.g. S i,j = e A i,j , S i,j = e E i,j or S i,j = e − θ i,j , where θ is the sp ectral angle [15, 43]. 6.48 6.49 6.5 6.51 6.52 x 10 6 1.825 1.83 1.835 1.84 1.845 1.85 1.855 1.86 1.865 x 10 6 Eigenvector 2, S=A  =0.4 Eigenvector 3, S=A Eigenvector 4, S=A Eigenvector 2, S=n(A+C) Eigenvector 3, S=n(A+C) Eigenvector 4, S=n(A+C) Figure 3: T op: The second, third, and fourth eigen vector of D − 1 W , with S = A and α = 0 . 4. The axes in the left picture hav e unit 10 6 feet (304 . 8 km) with resp ect to the same co ordinate origin as in Figure 2. The color coding cov ers diﬀeren t ranges: T op left 0 (blue) to 1 (red), top middle -0.103 (blue) to 0.091 (red), top right -0.082 (blue) to 0.072 (red). Bottom: The second, third, and fourth eigenv ector of D − 1 W , with S = n ( A + C ) and α = 0 . 4. The color co ding cov ers diﬀeren t ranges: T op left -0.082 (blue) to 0.065 (red), top middle -0.091 (blue) to 0.048 (red), top right -0.066 (blue) to 0.115 (red). An analysis of the metrics when S = n ( A + C ) shows that most metrics do not c hange signiﬁcan tly . The exceptions to this are tw o of the metrics describ ed in App endix A: The optimal v alue of the Hausdorﬀ distance decreases to approximately 1350 meters, and the optimal v alue of the related minimal distance M do es not change muc h, but is no w attained for a wide range of nonzero α , not just for α = 1. Most imp ortan tly , the av erages of the purit y sta y the same and while the av erages of the z -Rand score decrease a bit, they do so within the error margins given b y the standard deviations. Hence, the app earance of hotsp ots is not indicative of a global impro vemen t in the clustering. 10 W e tested whether the hotsp ots can be used to ﬁnd the gangs lo cated at these hotspots. F or example, the hotsp ot seen in eigenv ectors 2 (red) and 3 (blue) in the b ottom ro w of Figure 3 seems to corresp ond to Big Hazard in the left picture of Figure 1. W e reran the sp ectral clustering algorithm, this time requesting only 2 clusters as output of the k -means algorithm and only using the second, third, or fourth eigenv ector as input. The clusters that are created in this w a y correspond to “hotsp ots v ersus the rest”, but they do not necessarily corresp ond to “one gang vs the rest”. In the case of Big Hazard it do es, but when only the second eigenv ector is used the individuals in the big blue hotsp ot get clustered together. This hotsp ot do es not corresp ond to a single gang. W e hypothesize that there is an interesting underlying so ciological reason for this b eha vior: In the area of the blue hotsp ot a housing pro ject, where sev eral gangs claimed turf, was recen tly reconstructed displacing residen t gang mem b ers. Y et, even with these individuals b eing scattered across the city they remain tethered to their so cial space which remains in their established territories. [1, 29] W e conclude that, from the a v ailable FI card data, it is not p ossible to cluster the individuals in to communities that corresp ond to the diﬀeren t gangs with very high accuracy , for a v ariety of in teresting reasons. First the so cial data is very sparse. The ma jorit y of individuals are only inv olv ed in a couple of stops and most stops inv olv e only a couple of p eople. Also, some gangs are only represented b y a few individuals in the data sets: There are tw o gangs with only t wo members in the data set and t wo gangs with only three members. Second, the so cial realit y of Hollen b ec k is suc h that individuals and so cial con tacts do not alw ays adhere to gang boundaries, as the hotsp ot example ab ov e shows. That the so cial data is b oth sparse and noisy (compared to the gang ground truth, whic h ma y b e diﬀerent from the so cial reality in Hollenbeck), we can see when we compare the connections in the FI card so cial adjacency matrix A with the ground truth connections (the ground truth connects all mem b ers b elonging to the same gang and has no connections b et w een members of diﬀeren t gangs). W e then see that 5 only 2.66% of all the ground truth connections (intra-gang connections) are presen t in A . On the other hand 11.32% of the connections that are presen t in A are false p ositiv es, i.e. they are not present in the ground truth (in ter-gang connections). Because missing data in A (con tacts that w ere not observ ed) show up as zeros in A , it is not surprising that of all the zeros in the ground truth 99.98% are present in A and only 5.56% of the zeros in A are false negativ es. Another indication of the sparsit y is the fact that on av erage each individual in the data we used is connected to only 1.2754 ± 1.8946 other p eople 6 . The maximum n umber of connections for an individual in the data is 23, but 315 of the 748 gang mem b ers (42%) are not connected to any other individual. F uture studies can fo cus on the question whether the false p ositiv es and negatives in A are noise or caused by social structures violating gang b oundaries, p ossibly b y comparing 5 Not counting the diagonal which alw ays con tains ones. 6 This n umber is of course alwa ys nonnegative, ev en though the standard deviation is larger than the mean. 11 the impure clusters with inter-gang riv alry and friendship netw orks [35, 31, 33]. Another p ossibilit y is that the false p ositives and negativ es b etra y a ﬂaw in our assumption that individuals that are stopp ed together hav e a friendly relationship. Because of the non- criminal nature of the stops, this seems a justiﬁed assumption, but it is not un think able that some people that are stopp ed together ha v e a neutral or ev en an tagonistic relationship. T o rule out a third p ossibilit y for the lack of highly accurate clustering results, namely limitations of the sp ectral clustering metho d, we will now study ho w the metho d p erforms on quasi-artiﬁcial data constructed from the ground truth. 7 Stabilit y of metrics T o inv estigate the eﬀect of having less sparse so cial data w e compute purit y using S = GT ( p, q ). GT ( p, q ) is a matrix containing a fraction p of the ground truth connections, a further fraction q of whic h are changed from true to false p ositiv e to sim ulate noise. In a sense, p indicates how man y connections are observed and q determines ho w man y of those are betw een mem b ers of diﬀeren t gangs. The matrix GT ( p, q ) for p, q ∈ [0 , 1] is constructed from the ground truth as follows. Let GT (1 , 0) be the gang ground truth matrix, i.e. it has entry ( GT (1 , 0)) i,j = 1 if and only if i and j are members of the same gang (including i = j ). Next construct the matrix GT ( p, 0) by uniformly at random c hanging a fraction 1 − p of all the strictly upp er triangular ones in GT (1 , 0) to zeros and symmetrizing the matrix. Finally , mak e GT ( p, q ) b y uniformly at random c hanging a fraction q of the strictly upp er triangular ones in GT ( p, 0) to zeros and changing the same num b er (not fraction) of randomly selected strictly upp er triangular zeros to ones, and in the end symmetrizing the matrix again. In other words, w e start out with the ground truth matrix, keep a fraction p of all connections, and then change a further fraction q from true p ositiv es into false in ter-gang connections. In Figure 4 we sho w the a verage purity o ver 10 k -means runs using S = GT ( p, q ) for diﬀerent v alues of p , q , and α . T o compare these results to the results w e got using the observed so cial data A from the FI card data set, w e remember from Section 6 that A con tains only 2.66% of the true intra-gang connections whic h are presen t in GT (1 , 0). This roughly corresp onds to p . On the other hand the total p ercen tage of false p ositiv es ( i.e. inter-gang connections) in A is 11 . 32%, roughly corresp onding to q . By increasing p and v arying q in our synthetic data GT ( p, q ) we extend the observed so cial links, adding increased amounts of the true gang aﬃliations with v arious levels of noise (missing in tra- gang social connections and falsely presen t inter-gang connections). T o inv estigate the eﬀect of the p olice collecting more data at the same noise rate w e k eep q ﬁxed, allo wing only the p ercen tage of so cial links to v ary . Low v alues of α , e.g. α = 0 and α = 0 . 2, show again that a baseline lev el of purity (ab out 56%) is obtained b y the geographical information only and hence is unaﬀected b y c hanging p . As the noise lev el, q , is v aried in the four plots in Figure 4, a general trend is clear: larger v alues of 12 0 ≤ α < 1 correlate to higher purity v alues. This trend is enhanced as the p ercen tage of so cial links in the netw ork increases. As exp ected, when only so cial information is used, α = 1, the algorithm is more sensitive to v ariations in the so cial structure. This sensitivity is most pronounced at low levels, when the total p ercen tage of so cial links are b elow 20. Ev en at low levels of noise, q = 5 . 5, using only social information is highly sensitive. This suggests that α v alues strictly less than one are more robust to noisy links in the netw ork. The optimal choice of α = 8 here is more robust and consisten tly pro duces high purit y v alues across the range of p ercen tages of ground truth. A p ossible explanation for this sensitivit y at α = 1 and the p ersisten t dip in purit y for this v alue of α and lo w v alues of p is that for ﬁxed q and increasing p the absolute (but not the relativ e) num b er of noisy en tries increases. At lo w total num ber of connections these noisy entries wreak hav o c on the purit y in the absence of the mitigating geographical information. The b ottom left of Figure 4 sho ws a noise lev el of q = 0 . 11321 whic h is set to matc h with what was obtained in the observ ed data. The dotted v ertical lines are plotted at v alues of p satisfying p = total n umber of true positives in A total n umber of upp er triangular ones in GT (1 , 0) 1 1 − q = 423 15 , 904 1 1 − q . F or this v alue of p the total num b er of true p ositiv es in GT ( p, q ) is 15 , 904 · p · (1 − q ) = 423 whic h is equal to the total num b er of true p ositiv es in A . It is clear from the pictures that collecting and using more data (increasing p ), even if it is noisy , has a m uch bigger impact on the purity than lo wering the 11.32% rate of false p ositiv es. As remark ed in Section 6 already we ran the same sim ulations using a so cial en vironment matrix lik e E as c hoice for the so cial matrix S , but built from GT ( p, q ) instead of A . The results were very similar to those using S = GT ( p, q ) showing that also for less sparse data there do es not appear to b e muc h of a diﬀerence b et w een using the so cial adjacency matrix or the so cial environmen t matrix. W e also ran sim ulations computing the z -Rand score instead of purity using S = GT ( p, q ). Again, the qualitativ e b eha vior was similar to the results discussed abov e. 8 Conclusion and discussion In this pap er we hav e applied the metho d of sp ectral clustering to an LAPD FI card data set concerning gang members in the p olicing area of Hollen b ec k. Based on stop lo cations and so cial contacts only we clustered all the individuals in to groups, that we interpret as corresp onding to so cial communities. W e sho wed that the geographical information leads to a baseline clustering whic h is ab out 56% pure compared to the ground truth gang aﬃliations pro vided b y the LAPD. Adding so cial data can improv e the results a lot, if it is not to o sparse. The data which is currently av ailable is v ery sparse and improv es only a little on the baseline purit y , but our sim ulations show that impro ving the so cial data a little can lead to large improv emen ts in the clustering. 13 0 20 40 60 80 100 0 0.2 0.4 0.6 0.8 1 q = .22 100*p Purity 0 0.2 0.4 0.6 0.8 1 0 20 40 60 80 100 0 0.2 0.4 0.6 0.8 1 q = 0 Purity 100*p 0 20 40 60 80 100 0 0.2 0.4 0.6 0.8 1 q = .055 Purity 100*p 0 20 40 60 80 100 0 0.2 0.4 0.6 0.8 1 q = .11321 Purity 100*p Figure 4: Plots of the purity using S = GT ( p, q ) for diﬀerent v alues of q (the diﬀeren t plots) and α (the diﬀeren t lines within each plot) for v arying v alues of p . The plotted purity v alues p er set of parameter v alues are av erages ov er 10 k -means runs, the error bars are given b y the standard deviation ov er these runs. The dotted vertical lines indicate the v alues of p for which the num b er of true p ositiv es in GT ( p, q ) is equal to the num b er of true positives in A . An extra complicating factor, whic h needs external data to b e dealt with, is the very real p ossibilit y that the actual so cial comm unities in Hollen b ec k are not strictly separated along gang lines. Extra so ciological information, suc h as friendship or riv alry netw orks b et w een gangs, can be used in conjunction with clustering method to inv estigate the question how m uch of the so cial structures observed in Hollenbeck are the results of gang mem b ership. F uture studies will also in vestigate the eﬀect of using diﬀeren t metho ds, including the m ultislice metho d of [25], the alternative sp ectral clustering metho d of [12, 11] based on an underlying non-conserv ative dynamic pro cess (as opp osed to a conserv ativ e random w alk), and the nonlinear Ginzburg-Landau metho d of [3], which uses a few known gang aﬃliations as training data. The question ho w partially lab eled data helps with clustering in a semi-supervised approach was explored in [2]. Ac kno wledgements. The FI card data set used in this work w as collected b y the LAPD Hollenbeck Division and digitized, scrubb ed, and anonymized for use by Megan Halv orson, Shannon Reid, Matthew V alasik, James W o, and George E. Tita, at the De- partmen t of Criminology , La w and So ciet y of UCI. The data analysis w ork was started b y 14 (then) (under)graduate studen ts Raymond Ahn, Peter Elliott, and Kyle Luh, as part of a pro ject in the 2011 summer REU program in applied mathematics pro ject at UCLA orga- nized b y Andrea L. Bertozzi. The pro ject’s men tors, Yves v an Gennip and Blake Hunter, together with P . Jeﬀrey Brantingham, extended the summer pro ject into the current pa- p er. W e thank Matthew V alasik, Ra ymond Ahn, Peter Elliott, and Kyle Luh, for their con tinued assistance on parts of the pap er, and Mason A. Porter of the Oxford Cen tre for Industrial and Applied Mathematics of the Universit y of Oxford for a n umber of insightful discussions. This work w as made p ossible by funding from NSF gran t DMS-1045536, NSF gran t DMS-0968309, ONR grant N000141010221, AFOSR MURI gran t F A9550-10-1-0569 and ONR gran t N000141210040. A Other metrics In some cases it is useful to lo ok beyond purit y and the z -Rand score whic h we discussed in Sections 4 and 5. Hence we also deﬁne metrics that measure the gang homogeneit y within clusters, the gang heterogeneity b et ween clusters, and the accuracy of the geographical placemen t of our clusters. T o giv e an impression of how our data performs for these metrics, we giv e the order of magnitude of their typical v alues observed as a verages o ver 10 k -means runs. Recall from Section 4 that w 11 is the num b er of pairs whic h b elong both to the same cluster in our k -means clustering and to the same gang. Analogously w 10 , w 01 , and w 00 are the num b ers of pairs which are in the same k -means cluster but diﬀeren t gangs, diﬀeren t k -means clusters but the same gang, and diﬀerent k -means clusters and diﬀeren t gangs resp ectiv ely , e.g. [23, 37] and references therein. Considering the error bars, the c hoice of α do es not matter to o m uch for w 11 ≈ 6 , 000 and w 01 ≈ 9 , 800. As long as α < 1 it also do es not matter m uch for w 10 ≈ 10 , 000 and w 00 ≈ 250 , 000. W e deﬁne ingroup homogeneit y as the probability of c ho osing tw o individuals b elong- ing to the same gang if w e ﬁrst randomly pick a cluster (with equal probability) and then randomly choose tw o p eople from that cluster. W e also deﬁne a scaled ingroup homo- geneit y , by taking the probability of choosing a cluster prop ortional to the cluster size. Analogously w e deﬁne the outgroup heterogeneity as the probabilit y of choosing t wo indi- viduals b elonging to diﬀerent gangs if w e ﬁrst pic k tw o diﬀeren t clusters at random and then choose one individual from each cluster. The scaled outgroup heterogeneity again w eights the probability of picking a cluster b y its size. W e see a sharp drop in ingroup homogeneity when going from the unscaled ( ≈ 0 . 58) to the scaled ( ≈ 0 . 40) version, indicating the presence of a lot of small clusters, whic h are lik ely to be very homogeneous, but hav e a small chance of b eing pic ked out in the scaled v ersion. This eﬀect is not presen t for the outgroup heterogeneity ( ≈ 0 . 96 for either the scaled or unscaled v ersion) b ecause the small cluster eﬀect is tin y compared to the o v erall 15 heterogeneit y . W e also compare the centroids of our clusters (the av erage of the p ositions of all in- dividuals in a cluster) in space to the cen troids based on the true gang aﬃliations. The Hausdorﬀ distance is the maxim um distance one has to tra vel to get from a cluster cen troid to its nearest gang cen troid or vice versa. W e deﬁne M as the a verage of these distances, instead of the maximum. F or comparison, the maxim um distance b et ween tw o individuals in the data set is 10,637 meters. The Hausdorﬀ distance ( ≈ 2200 meters) does not c hange muc h with α (but the standard deviation is very large when α = 1). Surprisingly the av erage distance M is minimal ( ≈ 450 meters) for α = 1, ab out 100 meters less compared to α < 1. The large diﬀerence b et w een M and the Hausdorﬀ distance for any α indicates most centroids are clustered close together, but there are some outliers. The cluster distance (co de from [8]) computes the ratio of the optimal transp ort distance b et w een the cen troids of our clustering and the ground truth and a naiv e transport distance whic h disallo ws the splitting up of mass. The underlying distance b et w een centroids is given b y the optimal transp ort distance b et ween clusters. This distance ranges b et w een 0 and 1, with lo w v alues indicating a signiﬁcant o v erlap b et ween the cen troids. The cluster distance ( ≈ 0 . 29) is signiﬁcantly b etter if α < 1, showing a signiﬁcan t geographic ov erlap b et ween the spectral clustering and the clustering b y gang. B Diﬀeren t n um b er of clusters In this section w e brieﬂy discuss results obtained for v alues of k diﬀerent from 31. Note that most of the metrics discussed in Section 4 and Appendix A are biased tow ards having either more or fewer clusters. F or example, as discussed in Section 4, purit y is biased to wards more clusters. Indeed, we computed the v alues of all the metrics for k ∈ { 5 , 25 , 30 , 35 , 60 } and noticed that the biased metrics b ehav e as a priori exp ected, based on their biases. This means most of the metrics are bad c hoices for comparing results obtained for diﬀerent v alues of k . The exception to this is the z -Rand score, whic h do es allow us to compare clusterings at diﬀerent v alues of k to the gang aﬃliation ground truth. W e computed the z -Rand scores for clusterings obtained for a range of diﬀeren t v alues of k , b et ween 5 and 95. The results can b e seen in Figure 5. As can b e seen from this ﬁgure, the z -Rand has a maxim um around k = 55, although most k v alues b et w een ab out 25 and 65 giv e similar results, within the range of one standard deviation. W e see that, as measured b y the z -Rand score, the quality of the clustering is quite stable with resp ect to k . 16 0 20 40 60 80 100 40 60 80 100 120 140 160 180 k z rand  = .2  = .4  = .6  = .8 Figure 5: The mean z -Rand score o ver 10 k -means runs, plotted against diﬀerent v alues of k . The diﬀeren t lines corresp ond to diﬀerent v alues for α ∈ { 0 . 2 , 0 . 4 , 0 . 6 , 0 . 8 } . The error bars indicate the standard deviation. C Rank-one matrix up dates Here w e giv e details explaining ho w the eigen v ectors of a symmetric matrix W change when w e add a constan t matrix. Assume for simplicity 7 that we wan t to know the eigen v alues of W + C , where C is an N b y N ( N = 748) matrix whose en tries C i,j are all 1. Let Q b e a matrix that has as i th column the eigenv ector v i of W with corresponding eigen v alue d i . Let D b e the diagonal matrix con taining these eigenv alues, then w e ha v e the decomp osition W = QD Q T . W rite b for the N by 1 v ector with entries b i = 1, such that C = bb T . If w e write z := Q − 1 b then W + C = Q ( D + z z T ) Q T = Q ( X Λ X T ) Q T , where X has the i th eigen vector of D + z z T as i th column and Λ is the diagonal matrix with the corresp onding eigenv alues λ i . W e are interested in QX , which is the matrix con taining the eigen vectors of W + C . According to [6] and [13, Lemma 2.1] 8 w e hav e for the i th 7 Note that what w e are doing in our simulations is sligh tly more complicated: W e use αn ( S + C ) + (1 − α ) e − d 2 i,j /σ 2 , so in addition to adding a constant matrix S is m ultiplied by a normalization factor n = (max i,j ( S i,j + 1)) − 1 . 8 In order to use this result w e need to assume that all the eigen v alues d i are simple, i.e. W should ha ve diﬀeren t eigenv alues. This might not be a completely true assumption in our case, although it typically 17 column of X : X : ,i = c i  z 1 d 1 − λ i , . . . , z N d N − λ i  T , with normalization constan t c i = r P N j =1 z 2 j ( d j − λ i ) 2 . No w ( QX ) k,i = Q k, : · X : ,i = Q k, : · c i  Q − 1 1 , : · b/ ( d 1 − λ i ) , . . . , Q − 1 N , : · b/ ( d N − λ i )  T = c i N X l,m =1 Q k,l Q − 1 l,m b m d l − λ i . Since b m = 1 for all m we hav e ( QX ) k,i = c i P N m =1 ( QF Q − 1 ) k,m where F is the diagonal matrix with en tries F ll = 1 d l − λ i . Since Q has the eigen v ectors v l as columns and Q − 1 is its transp ose we conclude ( QX ) k,i = c i N X m =1 " ( v 1 , . . . , v N )  1 d 1 − λ i v 1 , . . . , 1 d N − λ i v N  T # k,m = c i N X m,l =1 ( v l ) k 1 d l − λ i ( v l ) m . Finally , since the eigen vectors v l are normalized we ﬁnd that the k th comp onen t of the i th new eigen vector is giv en by ( QX ) k,i = c i N X l =1 ( v l ) k d l − λ i . Also, according to [6, Theorem 1], the eigen v alues λ i are giv en b y λ i = d i + N 2 µ i , for some µ i ∈ [0 , 1] which satisfy P N i =1 µ i = 1. If w e apply this idea to our geoso cial eigenv ectors, w e see in Figure 6 that most of the eigen v alues of W and W + C 7 are close to zero and hence close to each other. Only among the ﬁrst couple dozen there are large diﬀerences. This means that most of the new eigen vectors are more or less equally w eighted sums of all the old eigen vectors b elonging to the small eigen v alues and hence lose most structure. It is therefore up to the relatively few remaining eigen vectors (those corresp onding to the larger eigen v alues) to pic k up all the relev ant structure. This migh t b e an explanation of why hotsp ots appear. holds for most eigenv alues unless W has a well separated blo ck diagonal structure. 18 0 20 40 60 80 100 0 0.2 0.4 0.6 0.8 1 First 100 eigenvalues of normalized W using S=A and alpha=0.4 0 20 40 60 80 100 0 0.2 0.4 0.6 0.8 1 First 100 eigenvalues of normalized W using S=n(A+C) and alpha=0.4 Figure 6: Left: The ﬁrst 100 eigen v alues of D − 1 W , with S = A and α = 0 . 4. Bottom: The ﬁrst 100 eigen v alues of D − 1 W , with S = n ( A + C ) and α = 0 . 4. References [1] Cuomo announces $23 million grant to Los Angeles to transform public housing and help residen ts get jobs. HUD News R ele ase , 98-402 (August 18 1998). [2] Allahverd y an, A., Ver Steeg, G., and Galsty an, A. Communit y detection with and without prior information. EPL (Eur ophysics L etters) 90 , 1 (2010), 18002. [3] Ber tozzi, A. L., and Flenner, A. Diﬀuse in terface models on graphs for analysis of high dimensional data. Multisc ale Mo deling and Simulation 10 , 3 (2012), 1090–1118. [4] Brantingham, P. J., Tit a, G. E., Shor t, M. B., and Reid, S. E. The ecology of gang territorial b oundaries. Criminolo gy 50 , 3 (2012), 851–885. [5] Buades, A., Coll, B., and Morel, J. M. A review of image denoising algorithms, with a new one. Multisc ale Mo del. Simul. 4 , 2 (2005), 490–530. [6] Bunch, J. R., Nielsen, C. P., and Sorensen, D. C. Rank-one mo diﬁcation of the symmetric eigenproblem. Numer. Math. 31 , 1 (1978/79), 31–48. [7] Chen, Y., and Li, S. X. Group iden tity and so cial preferences. The Americ an Ec onomic R eview 99 , 1 (2009), 431–457. 19 [8] Coen, M. H., Ansari, M. H., and Fillmore, N. Comparing clusterings in space. In ICML 2010: Pr o c e e dings of the 27th International Confer enc e on Machine L e arning (2010). [9] Decker, S. H., and Curr y, D. G. Gangs, gang homicides, and gang loy alt y: Organized crimes or disorganized criminals. Journal of Criminal Justic e 30 , 4 (2002), 343–352. [10] Eliassi-Rad, T., and Henderson, K. Liter atur e Se ar ch thr ough Mixe d-Memb ership Community Disc overy , v ol. 6007. Springer, 2010, p. 7078. [11] Ghosh, R., and Lerman, K. The impact of dynamic interactions in multi-scale analysis of net w ork structure. CoRR abs/1201.2383 (2012). [12] Ghosh, R., Lerman, K., Suracha w ala, T., V oevodski, K., and Teng, S.- H. Non-conserv ativ e diﬀusion and its application to so cial netw ork analysis. CoRR abs/1102.4639 (2011). [13] Gu, M., and Eisenst a t, S. C. A stable and eﬃcient algorithm for the rank-one mo diﬁcation of the symme tric eigenproblem. SIAM J. Matrix A nal. Appl. 15 , 4 (1994), 1266–1276. [14] Harris, M., Auber t, X., Haeb-Umbach, R., and Beyerlein, P. A study of broadcast news audio stream segmentation and segment clustering. In EU- R OSPEECH’99 (1999), pp. 1027–1030. [15] Harsanyi, J., and Chang, C. Hyp ersp ectral image classiﬁcation and dimensionalit y reduction: An orthogonal subspace pro jection approac h. Ge oscienc e and R emote Sensing, IEEE T r ansactions on 32 , 4 (1994), 779–785. [16] Har tigan, J., and W ong, M. Algorithm as 136: A k-means clustering algorithm. Journal of the R oyal Statistic al So ciety. Series C (Applie d Statistics) 28 , 1 (1979), 100–108. [17] Hegemann, R., Lewis, E., and Ber tozzi, A. An “Estimate & Score Algorithm” for simultaneous parameter estimation and reconstruction of missing data on so cial net works. Submitte d (2012). [18] Hegemann, R., Smith, L., Barbaro, A., Ber tozzi, A., Reid, S., and Tit a, G. Geographical inﬂuences of an emerging net work of gang riv alries. Physic a A 390 (2011), 3894–3914. [19] Hu, H., v an Gennip, Y., Hunter, B., Ber tozzi, A. L., and Por ter, M. A. Geoso cial graph-based communit y detection. submitte d (2012). 20 [20] Johnson, T. H., and Mason, M. C. No sign un til the burst of ﬁre: Understanding the Pakistan-Afghanistan fron tier. International Se curity 32 , 4 (2008), 41–77. [21] Kilcullen, D. The A c cidental Guerril la: Fighting Smal l Wars in the Midst of a Big One . Universit y of Oxford Pres s, Oxford, 2009. [22] K outsourelakis, P.-S., and Eliassi-Rad, T. Finding mixed-mem b erships in so cial netw orks. In AAAI Spring Symp osium: So cial Information Pr o c essing (2008), AAAI, pp. 48–53. [23] Meil ˘ a, M. Comparing clusterings—an information based distance. J. Multivariate A nal. 98 , 5 (2007), 873–895. [24] Mohler, G. O., Shor t, M. B., Brantingham, P. J., Schoenberg, F. P., and Tit a, G. E. Self-exciting p oin t pro cess mo deling of crime. J. Amer. Statist. Asso c. 106 , 493 (2011), 100–108. [25] Mucha, P. J., Richardson, T., Macon, K., Por ter, M. A., and Onnela, J.-P. Communit y structure in time-dep enden t, m ultiscale, and multiplex netw orks. Scienc e 328 , 5980 (2010), 876–878. [26] Newman, M. Modularity and communit y structure in net works. PNAS 103 , 23 (2006), 8577–8582. [27] Newman, M. E. J., and Gir v an, M. Finding and ev aluating comm unit y structure in net works. Phys. R ev. E 69 , 2 (2004), 026113. [28] Ng, A., Jord an, M., and Weiss, Y. On sp ectral clustering: Analysis and an algorithm. In Dietterich, T., Be cker, S., Ghahr amani, Z. (e ds.) Advanc es in Neur al Information Pr o c essing Systems 14 (2002), MIT Press, Cam bridge, pp. 849–856. [29] Oliv o, A. Leaders praise housing complex. L.A. Times (March 2, 2001). [30] Por ter, M. A., Onnela, J.-P., and Mucha, P. J. Comm unities in netw orks. Notic es Amer. Math. So c. 56 , 9 (2009), 1082–1097, 1164–1166. [31] Radil, S., Flint, C., and Tit a, G. Spatializing so cial netw orks: Using so cial net work analysis to inv estigate geographies of gang riv alry , territorialit y , and violence in Los Angeles. A nnals of the Asso ciation of Americ an Ge o gr aphers 100 , 2 (2010), 307–326. [32] Shi, J., and Malik, J. Normalized cuts and image segmen tation. IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e 22 , 8 (2000), 888–905. [33] Shor t, M., Mohler, G., Brantingham, P., and Tit a, G. Gang riv alry dynamics via coupled point pro cess net works. Submitte d (2012). 21 [34] Tit a, G., and Radil, S. Spatializing the so cial net works of gangs to explore patterns of violence. Journal of Quantitative Criminolo gy 27 (2011), 521–545. [35] Tit a, G., Riley, K., Ridgew a y, G., Grammich, C., Abrahamse, A., and Greenwood, P. Reducing gun violence: Results from an interv en tion in East Los Angeles. Natl. Inst. Justic e, RAND (2003). [36] Tra ud, A. L., Frost, C., Mucha, P. J., and Por ter, M. A. Visualization of comm unities in net works. Chaos 19 , 4: 041104 (2009). [37] Tra ud, A. L., Kelsic, E. D., Mucha, P. J., and Por ter, M. A. Comparing comm unity structure to characteristics in online collegiate social net works. SIAM R ev. 53 , 3 (2011), 526–543. [38] v an Gennip, Y., Hu, H., Hunter, B., and Por ter, M. A. Geoso cial graph-based comm unity detection. submitte d (2012). [39] v on Luxburg, U. A tutorial on sp ectral clustering. Stat. Comput. 17 , 4 (2007), 395–416. [40] W asserman, S., and F aust, K. Metho ds and applic ations . Cambridge Univ ersity Press, Cam bridge, UK, 1994. [41] W a t anabe, K., Ochi, M., Okabe, M., and On ai, R. Jasmine: a real-time lo cal- ev ent detection system based on geolo cation information propagated tomicroblogs. In Pr o c e e dings of the 20th ACM international c onfer enc e on Information and know le dge management (2011), pp. 2541–2544. [42] Yu, S., and Shi, J. Multiclass sp ectral clustering. In Pr o c e e dings of the Ninth IEEE International Confer enc e on Computer Vision (ICCV03) (2003), IEEE, pp. 313–319. [43] Yuhas, R., Goetz, A., and Bo ardman, J. Discrimination among semi-arid land- scap e endmembers using the sp ectral angle mapp er (SAM) algorithm. In Summaries of the Thir d A nnual JPL Airb orne Ge oscienc e Workshop (1992), vol. 1, Pasadena, CA: JPL Publication, pp. 147–149. [44] Zelnik-Manor, L., and Perona, P. Self-tuning sp ectral clustering. A dvanc es in neur al information pr o c essing systems 17 (2004), 1601–1608. 22

Community detection using spectral clustering on sparse geosocial data

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment