Selective Harvesting over Networks

Active search (AS) on graphs focuses on collecting certain labeled nodes (targets) given global knowledge of the network topology and its edge weights under a query budget. However, in most networks, nodes, topology and edge weights are all initially…

Authors: Fabricio Murai, Diogo Renno, Bruno Ribeiro

Selective Harvesting over Networks
Data Mining and Kno wledge Disco v ery man uscript No. (will b e inserted b y the editor) Selectiv e Harv esting o v er Net w orks F abricio Murai · Diogo Renn´ o · Bruno Rib eiro · Gisele L. P appa · Don T o wsley · Krista Gile Received: date / Accepted: date Abstract Active search on graphs fo cuses on collecting certain lab eled no des (targets) giv en global kno wledge of the netw ork top ology and its edge weigh ts (enco ding pairwise similarities) under a query budget constraint. How ev er, in most current netw orks, no des, net work topology , net w ork size, and edge weigh ts are all initially unknown. In this work w e in tro duce sele ctive harvesting , a v arian t of active searc h where the next no de to b e queried m ust b e chosen among the neighbors of the curren t queried no de set; the av ailable training data for deciding which node to query is restricted to the subgraph induced by the queried set (and their no de attributes) and their neighbors (without any node or edge attributes). Therefore, selective harv esting is a sequential decision problem, where w e must decide whic h no de to query at eac h step. A classifier trained in this scenario can suffer from what we call a tunnel vision effect: without an y recourse to indep endent sampling, the urge to only query promising no des forces classifiers to gather increasingly biased training data, which w e show significan tly h urts the p erformance of activ e search metho ds and standard classifiers. W e demonstrate that it is p ossible to collect a muc h larger set of targets by using multiple classifiers, not by combining their predictions as a w eighted ensemble, but switching b etw een classifiers used at each step, as a wa y to ease the tunnel vision effect. W e disco ver that switching classifiers collects more targets b y (a) diversifying the training data and (b) broadening the choices of no des that can b e queried in the future. This highligh ts an explor ation, exploitation, and diversific ation trade-off in our problem that goes beyond the exploration and exploitation duality found in classic sequen tial decision problems. Based on these observ ations w e propose D 3 TS, a metho d based on m ulti-armed bandits for non-stationary stochastic processes that enforces classifier diversit y , which outp erforms all comp eting metho ds on five real netw ork datasets in our ev aluation and exhibits comparable p erformance on the other tw o. F. Murai, D. Renn´ o · G. L. Pappa Universidade F ederal de Minas Gerais, Brazil E-mail: { murai,renno,glpappa } @dcc.ufmg.br B. Rib eiro Purdue Universit y E-mail: rib eiro@cs.purdue.edu D. T owsley · K. Gile Universit y of Massac husetts Amherst E-mail: towsley@cs.umass.edu; gile@math.umass.edu 2 F abricio Murai et al. 1 Introduction Activ e searc h on graphs [15, 27, 40] is a tec hnique for finding the largest num b er of tar get no des – i.e., no des with a certain label – in a netw ork b y querying no des in a weigh ted graph, under a query budget constrain t. Nodes ha ve hidden labels but the net work topology and edge weigh ts are fully observ able and any no de can be queried at an y time. Edge w eights enco de some form of no de similarit y that can be used to improv e querying efficiency . Unfortunately , edge w eights, netw ork top ology and no de information are rarely av ailable to be downloaded from one cen tralized place (except by the netw ork’s o wner, if an y). As a result, to day’s prev alent metho d to collect netw ork data is to query neighbors of already queried no des (crawling). Like active searc h on graphs, other similar techniques suc h as learning to crawl [17, 30], also assume that edge weigh ts b etw een the queried no des and their neighbors are observed. But in a v ariet y of netw ork cra wling problems, suc h as cra wling online social netw orks, (micro) blog net works, and citation net w orks, a node query often reveals only no de attributes. This pro cess p oses an entirely new set of challenges for activ e search and other similar metho ds. In this paper we introduce sele ctive harvesting , where the goal is the same as in activ e searc h, but instead of assuming that the netw ork top ology is given, our no de querying is sub ject to a partial and ev olving understanding of the netw ork More precisely , the knowledge ab out the netw ork is restricted to the set of queried no des and their connections to the rest of the netw ork. Selectiv e harvesting starts from a seed no de (t ypically a target) and proceeds b y querying nodes from the b order set , i.e. neigh b ors of already queried no des. Selectiv e harvesting generalizes activ e sampling, a similar task where no de attributes are not observ ed [31]. By leveraging information contained in these attributes, selectiv e harvesting algorithms can attain b etter p erformance in applications of active sampling, suc h as (i) identifying students inv olved in academic dishonesty at a college/universit y; (ii) inv estigating securities fraud and (iii) identifying students who smok e/drink for in terven tion purp oses. In these cases, target no des are individuals that ha ve a given trait. T raining a classifier for sele ctive harvesting is a challenging task due to the fact that the classifier must b e trained ov er observ ations that dep end on previous c hoices of the same classifier, the hidden netw ork top ology , and the distribution of no de features ov er the net work. W e call this the tunnel vision effe ct . Unlike active searc h, sele ctive harvesting has no recourse to true randomness or sample indep endence that can ease the tunnel effect. Under partially observ ed netw orks, traditional activ e searc h methods p erform quite po orly . W e discov er that it is p ossible to collect a m uch larger set of target no des by using a round robin scheme, whic h switches b etw een differen t types of classifiers (e.g., Logistic Regression, Random F orests) when predicting labels in differen t steps. W e show that this strategy collects more target nodes b y (a) div ersifying the training data and (b) broadening the c hoices of no des that can b e queried in the future. Based on these observ ations, w e prop ose Directed Diversit y Dynamic Thompson Sampling (D 3 TS), a Multi-Armed Bandit (MAB) algorithm for non-stationary stochastic pro cesses that intelligen tly selects a classifier at each step to decide which neighbor to query . This is in sharp contrast with ensem ble techniques, whic h combine predictions from sev eral classifiers at each step. W e sho w that these techniques (e.g., bagging and b o osting) do not perform as w ell as D 3 TS due to the tunnel vision effect. Unlik e typical MAB problems, where there is a clear exploration and exploitation tradeoff, the standard MAB approach, which forces conv ergence to the “best classifier”, w ould be sub optimal in the presence of the tunnel vision effect. This gives rise to what Selective Harvesting ov er Networks 3 0 500 1000 1500 0.90 0.95 1.00 1.05 1.10 # queried nodes (t) # targets found (norm. by Round−Robin) CiteSeer: NIPS papers Round−Robin MOD Active Search SV Regression Random Forest ListNet D 3 TS Fig. 1 Lines sho w the (scaled) av erage num ber of targets found b y round-robin, five na ¨ ıve classifiers and D 3 TS against the total num b er of queries ( t ). Shado ws indicate 95% confidence interv als ov er 80 runs, eac h starting at a seed uniformly chosen from target p opulation. Surprisingly , round-robin use of five classifiers (including po or-p erforming ones) outperforms an y single classifier in the CiteSeer net work. W e also see that the b est-p erforming active searc h metho d (W ang et al. [40]) has its relative accuracy eroded ov er time (and we will see wh y this is lik ely due to the tunnel vision effe ct ). W e include the proposed metho d (D 3 TS) results, which are consistently better than all comp eting metho ds for t ≥ 500. w e refer as explor ation, exploitation, and diversific ation tradeoff. D 3 TS aims to induce con tinual div ersification w.r.t. training data and p otential no de choices by using m ultiple distinct classifiers, which plays a similar role to sample indep endence and eases the tunnel vision effect. In terestingly , w e find that even a round-robin selection of five distinct classifiers often p erforms b etter than just using the b est classifier or the b est active searc h metho d for eac h dataset. Consider simulation results shown in Figure 1 (the simulation is further explained in Section 6.1, for now we fo cus only on the o verall results). Figure 1 shows the n umber of queries (x-axis) against the num b er of target no des found in the CiteSeer pap er co-citation netw ork (NIPS pap ers as targets) normalized by the num b er of target nodes found by a round robin selection of five distinct simple classifiers (y-axis); the details of these simple classifiers are given in Section 3. Note that ov er time the cumulativ e gain of the b est activ e search metho d for this dataset (W ang et al. [40]) slowly ero des un til it is w orse than the na ¨ ıve round-robin approac h. Our analysis sho ws that this erosion can b e attributed to the tunnel vision effect. Each of the five simple classifiers when used on their own are consisten tly outp erformed by the round-robin approac h, and the b est suc h classifiers also suffer from a p erformance erosion ov er time. In con trast, our prop osed metho d, D 3 TS, consistently and significantly outp erforms state-of-the-art metho ds, the round-robin approac h, and na ¨ ıv e approaches. The con tributions of this work are as follo ws: 1. F ormulation and characterization of Selectiv e Harv esting and Classifier Div ersit y: W e in tro duce selectiv e harv esting and sho w that state-of-the-art metho ds suc h as active sampling [31, 9] and activ e searc h [15, 40, 27] p erform p o orly in these settings. W e sho w that switc hing betw een v arious classifiers is helpful to ac hieve greater p erformance. This works not b ecause we are exploring classifiers in order to find the b est one or b ecause we are com bining their predictions as an ensemble. Instead, the use of multiple classifiers – helps improv e accuracy in tw o complementary w ays. It achiev es b or der set diversity , by exploring regions and th us av oiding remaining in a region where 4 F abricio Murai et al. 0 ? Q t B t W t ✓ ! ✗ ! ✗ ! ✓ ! ✓ ! 0 1 1 1 ? ? ? ? ? Fig. 2 Representation of the search state o ver an unknown graph G after t = 4 steps. Solid no des and edges show the subgraph e G t . Black no des represent queried no des. Unknown lab els of no des in B t are represented by a question mark “?”. target no des hav e been depleted. It also ac hiev es tr aining sample diversity , where diverse classifiers create enough div ersity of observ ations to ease the tunnel vision effe ct . 2. Directed Div ersity Dynamic Thompson Sampling (D 3 TS) : we prop ose D 3 TS, a metho d for selective harvesting which combines different classifiers, and show that it consisten tly outperforms state-of-the-art methods. W e ev aluate the prop osed framework on several real-world netw orks and observe that D 3 TS outp erforms all tested metho ds on fiv e out of seven datasets and exhibits similar p erformance on the other tw o. 1 Outline. In § 2 we formalize the selective harvesting problem and present a generic algorithm for solving it. In § 3 w e describe existing and potential approaches to solv e this problem and show that the tunnel vision effect h urts their p erformance. In § 4 w e in vestigate wh y classifier diversit y – i.e., using multiple classifiers – can mitigate the tunnel vision effect. W e prop ose D 3 TS in § 5. Datasets and results of our ev aluation are describ ed in § 6. Related w ork is describ ed in § 7. In § 8 we discuss alternatives to the prop osed metho d and explain wh y they cannot b e applied or wh y they do not p erform w ell. Last, our conclusions are presen ted in § 9. 2 Problem F ormulation In this section we formalize the selectiv e harv esting problem and in tro duce notation used throughout this work. Let G = ( V , E ) denote an undirected graph representing the net work top ology . Each no de v ∈ V has M attributes (domain-related prop erties of the no des) enco ded without loss of generalit y as an attribute vector a v ∈ R M . In active search problems, the goal is to find a large set of no des in V that satisfy a given search criterion (e.g., no des that exhibit a giv en attribute) under the constraint that no more than T nodes can b e queried. The search criterion is a b o olean function f : V → { 0 , 1 } . F ormally , let V + ⊂ V b e the set of all target no des, i.e. all v suc h that f ( v ) = 1. W e define no de lab els y v as y v = f ( v ) =  1 if v ∈ V + , 0 otherwise. ∀ v ∈ V 1 The softw are and scripts to repro duce results presented in this w ork are av ailable as an R pack age http://bitbucket.com/after- acceptance . All the data used in this work is publicly av ailable from different sources. Selective Harvesting ov er Networks 5 PNB [31] SN-UCB1 [9] MOD [5] AS [40] D 3 TS (ours) Unknown network 3 3 3 - 3 Uses no de features - - - - 3 Unknown neighbor - - - - 3 attributes Fits mo del to evol- - 3 - - 3 ving observ ations Scalable - 3 3 3 3 T able 1 Comparison of heuristics for selective harvesting: Activ e Sampling (PNB), So cial Netw ork UCB1 (SN-UCB1), Maximum Observed Degree (MOD), and Active Search (AS). Selectiv e harvesting is a v arian t of active searc h. In active search, the topology is assumed to be kno wn. In selectiv e harv esting, the searc h is sub ject to a limited but ev olving kno wledge of the netw ork. This knowledge is expanded by querying no des in V , whic h rev eals their labels, neighbors and attribute vectors. A set of pre-queried no des Q 0 ⊂ V is giv en as input (t ypically consisting of one target node). Subsequent queries are restricted to neigh b ors of already queried no des. A t any step t , nodes b elong to one of three sets: Q t , the set of previously queried no des; B t , the set of neighbors of queried no des that ha v e not been queried (referred as b order no des or b order set); or W t , the set of unobserved no des, which are in visible to the algorithm. Figure 2 illustrates a snapshot of the searc h pro cess (see caption for details). Let e G t = ( Q t , e E t ) denote the subgraph of G given b y the subgraph induced b y no des in Q t ∪ B t min us edges in the subgraph induced by B t (i.e., e G t con tains all edges b etw een no des in Q t plus edges connecting Q t to B t ). The graph e G t is the portion of the net work visible at step t . In e G t , lab el y v is only kno wn for no des in Q t . Generic solution. Giv en an initial input gr aph e G 0 , an algorithm for selective harv esting m ust decide at eac h step t = 1 , . . . , T what action to tak e, i.e., what border node v ∈ B t to query , giv en the currently av ailable net work information. This action returns v ’s lab el, attributes and connections, whic h is included as additional input to the searc h in step t + 1. No de v lab el (0 or 1) can b e thought of as the p ayoff obtained b y querying that no de. The algorithm’s output is the list of target no des found in T steps. The b est algorithm is the one that yields the largest total pay off, i.e., yields the largest num b er of target no des. 3 Background In this section, we review metho ds for searching netw orks that can b e used for or adapted to selectiv e harv esting. These methods exploit correlation b etw een lab els of connected no des to find targets. In addition, we review statistical models that could b e used as an alternativ e (data-driv en) approach. In con trast to existing metho ds, this approach can lev erage node attributes b y training a statistical model to infer the node’s label from the observ ed graph. As a sligh t abuse of terminology , we ma y refer to existing metho ds and base learners generically as classifiers , since both are used to infer b order no des’ lab els. 6 F abricio Murai et al. 3.1 Existing metho ds A few works in the literature provide metho ds that can b e used for or adapted to selective harv esting. A subclass of selective harvesting metho ds known as active sampling [31, 9] do es not account for no de attributes. Our problem is closely related to the graph-theoretic m yopic budgeted online co vering problem [5, 22, 10]. In this problem, all nodes are relev ant (equiv alently , all no des are targets) and the task is to find a connected set of nodes that yields the largest co ver (i.e., the largest set Q t ∪ B t set). The closest problem to ours is that addressed by activ e search on graphs [15, 40, 27], where no des hav e hidden lab els but the top ology and edge weigh ts are fully observed and any no de can b e queried at any time. Algorithms for m yopic budgeted online cov ering and active searc h can b e adapted for selectiv e harvesting; active sampling metho ds require little or no mo dification. W e adapt four represen tative metho ds of the ab ov e to selective harvesting: Activ e Sampling [31] (PNB – in reference to the authors surnames), Maxim um Observ ed Degree (MOD) [5], Social Net w ork UCB1 (SN-UCB1) [9], and Activ e Search (AS) [40]. T able 3 summarizes the k ey differences b etw een these metho ds and the prop osed metho d, D 3 TS. Activ e Sampling (PNB) : PNB is a representativ e algorithm from the class of active sampling approac hes prop osed in [31]. PNB estimates a border node’s pay off v alue y v using a weigh ted a verage of the pa yoffs of observ ed nodes t w o hops aw ay from v , where weigh ts are the num b er of common neighbors with v . Border no des are included among these observed no des, requiring all pa yoffs to b e collectiv ely estimated b y a lab el propagation procedure based on Gibbs Sampling. PNB also trac ks a running av erage of pay off v alues acquired from random jumps, which we do not allow in our simulations since these are not p ossible in selectiv e harvesting. Please see [31] for a detailed description of PNB’s parameters. So cial Net w ork UCB1 (SN-UCB1) : The SN-UCB1 search algorithm proposed in [9] divides border nodes in to equiv alence classes and samples from theses classes using a multi-armed bandit algorithm. Equiv alence classes are comp osed of all b order nodes connected to the same set of queried no des. These classes are volatile: they split, disapp ear and app ear ov er time, requiring the use of a v ariant of the UCB1 called VUCB1. Although this metho d learns ab out the equiv alence classes, it do es not learn a statistical mo del that can account for node attributes. Similar to selective harvesting, it assumes partial but ev olving knowledge ab out the netw ork. Maxim um Observ ed Degree (MOD) : MOD is a m yopic algorithm proposed in [5] to maximize the net w ork cov er as it explores a graph. MOD is the optimal greedy co v er algorithm in a finite random p ow er law net w ork (under the Configuration Mo del [28]) with degree distribution coefficient either one or tw o. In our sim ulations w e adapt MOD to select the b order no de with the maximum num b er of target neighbors in the queried set (ties are resolv ed randomly). F rom the exp ected excess degree results in [5] suc h border nodes are ric h with target neigh bors provided that the underlying netw ork exhibits strong homophily with resp ect to no de lab els. Activ e Search : this metho d, prop osed by W ang et al. [40], attempts to find target no des by assuming that lab els are defined by a smo oth function o ver the graph edges. T o estimate the unknown lab els, it attaches to each labeled instance a virtual no de containing the instance’s lab el and then p erforms lab el propagation on the original graph. It assumes that the graph is kno wn, which allows it to estimate the future impact of choosing a given b order no de. W e adapt Active Search to run label propagation only on the observ ed graph. 2 2 Although the metho d prop osed by W ang et al. [40] is outp erformed by a more recent prop osal [27] in active search problems, we found the opp osite to be true when the graph is not fully observ able. In Selective Harvesting ov er Networks 7 3.2 Data-driven metho ds A data-driv en selective harvesting algorithm trains a statistical mo del to estimate the exp ected pa yoff µ t ( v ) obtained from querying border no de v ∈ B t , based on v ’s relationship with the observ ed graph e G t at step t . W e encode this relationship as a “local” feature vector x v | e G t , which we describ e next. Note that v ’s features differ from v ’s attributes (denoted by a v ). Since v ’s attributes are not observ able un til it is queried, w e compute v ’s lo cal features from the observ ed graph e G t to use as training data for base learners. F e atur e Design W e define features for each b order no de in v ∈ B t . They are divided in to: – Pure structural features : observed degree and num b er of triangles formed with observ ed neighbors. – Structure-and-attribute blends : n umber and fraction of target neigh b ors, n umber and fraction of triangles formed with t wo non-target (and with t wo target) neighbors, n umber and fraction of neigh b ors mostly surrounded b y target nodes, fraction of neigh b ors that exhibit each no de attribute, probability of finding a target exactly after t wo random walk steps from b order no de. 3 W e build up on features t ypically used in the literature [33, 34]. W e also use a Random W alk (R W) transient distribution to build features: we consider the exp ected pay off observ ed by a R W that departs from no de u ∈ B t and p erforms t wo steps, given by x (R W) u | e G t = P ( u,v ) ∈ e E t P ( v,w ) ∈ e E t ,w ∈Q t y w C u | e G t (1) where C u | e G t is the n um b er of suc h paths of length t w o in e G t . Note that the R W is not restricted to the immediate neighbors of u . Also, this is not an a v erage among the nodes t wo hops a w ay from u ; this feature dep ends on the connectedness of the b order no de’s neigh b orho o d in the observed graph. Base L e arners The feature vector described ab ov e can b e given as input to any learning metho d able to generate a ranking of border no des. W e consider classification, regression and ranking metho ds as suitable candidates for this task. The classification representativ es include Logistic Regression and Random F orests , b ecause they pro vide w ays to rank b order no des according to how confident the mo del is that eac h border no de is a target. Exp onen tially W eighted Least Squares (EWLS) and Supp ort V ector Regression are included b y mo deling the task as a regression problem, and the list-wise learning-to- rank metho d ListNet [11] for directly outputting ranks. W e briefly describ e EWLS and ListNet b elo w and refer the reader to [12] for descriptions of other metho ds. addition to b eing highly sensitive to the parameterization, the most recent metho d computes and stores a dense correlation matrix b etw een all visible no des, which is hard to scale b eyond 10 5 nodes. 3 Other seemingly obvious features (e.g., num b er of non-target neighbors) are not considered due to colinearity . Longer random walk paths are to o exp ensive to b e used in most real netw orks. 8 F abricio Murai et al. Exp onen tially W eighted Least Squares (EWLS) : computes w eigh ts w that, giv en a forgetting factor 0  β ≤ 1 and regularization parameter λ , minimize the loss function t X i =1 β t − i | y t − x t > w | 2 + β t λ k w k 2 . EWLS gives more weigh t to recent observ ations. The weigh ts w are suitable for fast online up dates [26, Section 4.2]. Setting β = 1 reduces EWLS to ` 2 -regularized Linear Regression. ListNet : This is a representativ e metho d from the list-wise approaches for learning to rank (a Machine Learning task where the goal is to learn how to rank ob jects according to their relev ance to a query) [11]. It assumes that the observ ed ranking π is a random v ariable that dep ends on the ob jects’ scores (where π 1 is the top-ranked ob ject). The scores are determined by a neural netw ork that is trained b y minimizing the K-L divergence betw een the probabilit y distribution ov er ˆ π and the probability distribution o ver a ranking π derived from ground-truth scores. In our con text, P ( π ) is given by P ( π = h π 1 , ..., π |B t | i ) = |B t | Y i =1   exp( y π i ) / |B t | X j = i exp( y π j )   . Since the goal is not to predict the ob ject-wise relev ance, all of the statistical p ow er of this metho d go es in to learning the ranking. As with any learning approach, in the “small data” regime (few observ ations collected) a base learner may p erform worse than heuristic metho ds that assume homophily w.r.t. no de lab els. T o mitigate issues related to fitting a learner to few observ ations and yet allo w a fair comparison with the heuristic methods, we query the first 20 no des using MOD. 4 4 T unnel Vision and the p ow er of Classifier Diversit y In selective harvesting the goal is to find the most num b er of target nodes with a limited query budget. This requires metho ds to try to sample only promising target no des, which causes a given classifier to gather increasingly biased training data, a phenomenon that w e call tunnel vision effe ct . Unfortunately , it is unlik ely that we can find a method which pro v ably comp ensates for this bias in our training data, Q t . Even if we query b order no des randomly at eac h step, we cannot determine the probabilit y of seeing any given no de in the b order set B t , as this would require assessing the probabilit y of all possible sample paths from the given seed no des, which includes paths containing nodes not y et observed, i.e., no des in W t in Figure 2, an unfeasible task as we do not know the netw ork top ology . This is likely wh y active searc h and base learners by their o wn do not w ork well for selective harv esting tasks. This is also why importance w eighted sampling [7] cannot b e used to remo ve the bias in these tasks. T o demonstrate the tunnel vision effect and show ho w classifier div ersity can mitigate it, w e conduct a large set of sim ulations. W e sim ulate searc hes using four heuristics – MOD, PNB, Social Netw ork-UCB1 (SN-UCB1) and Activ e Search, fiv e base learners – Logistic Regression, Exp onentially W eighted Least Squares (EWLS), Supp ort V ector Regression, Random F orest and ListNet on seven netw orks and summarize the results in T able 2 4 In comparison to other combinations of length and heuristic used in the “cold start” phase, this was found to work best. Selective Harvesting ov er Networks 9 Metho ds Datasets (budget T ) CS DBP WK DC KS DBL LJ (1500) (700) (400) (100) (700) (1200) (1200) PNB 833.2 ∗ 260.6 ∗ 107.7 ∗ 24.3 ∗ 178.3 ∗ 599.5 ∗ 632.4 ∗ SN-UCB1 568.9 ∗ 272.3 ∗ 71.8 ∗ 23.2 ∗ 133.2 ∗ 399.1 ∗ 573.7 ∗ MOD 3 746.8 ∗ 403.0 ∗ 140.9 ∗ 35.7 ∗ 159.6 ∗ 580.3 ∗ 584.1 ∗ Active Search 3 808.9 ∗ 412.2 ∗ 143.4 22.6 ∗ 215.3 ∗ 684.9 ∗ 654.2 ∗ Logistic Regression 764.5 ∗ 452.5 86.2 ∗ 35.8 122.1 ∗ 744.4 732.0 Random F orest 3 738.5 ∗ 454.0 ∗ 127.2 ∗ 37.2 215.6 ∗ 725.4 728.3 ∗ EWLS 808.2 ∗ 462.4 82.5 ∗ 35.2 ∗ 142.3 ∗ 656.9 ∗ 694.4 ∗ SV Regression 3 770.6 ∗ 456.3 ∗ 85.0 ∗ 37.6 205.3 ∗ 757.1 ∗ 736.1 ListNet 3 742.0 ∗ 448.0 ∗ 92.5 ∗ 34.4 ∗ 146.3 ∗ 730.7 742.8 Round-Robin (all 3 ) 822.2 ∗ 454.5 ∗ 135.3 ∗ 37.3 234.9 ∗ 696.0 ∗ 716.0 ∗ D 3 TS (all 3 ) 851.2 464.0 144.7 37.9 247.6 729.5 737.3 T arget p opulation size 1583 725 202 56 1457 7556 1441 T able 2 Average n umber of targets found by eac h metho d after T queries based on 80 runs. Datasets. CS : CiteSeer, DBP : DBp edia, WK : Wikip edia, DC : DonorsChoose, DBL : DBLP , KS : Kickstarter and LJ : LiveJournal. Budget T is respectively set to num b er of targets × 1 , × 1 , × 2 , × 2 , × 1 2 , × 1 6 , × 5 6 truncated to hundreds. First four rows corresp ond to existing metho ds; five subsequent rows are base learners. Round-Robin and D 3 TS combine methods indicated by ( 3 ). Means whose difference to D 3 TS’s is statistically significant at the 95% confidence lev el are indicated by ( ∗ ). Best tw o results on eac h dataset are sho wn in bold. Parameters. PNB: same as in [31]; Active Search: same as in [40]; EL WS: β = . 99, λ = 1 . 0; Logistic Regression and SV Regression: p enalty C set using fast heuristic implemented in R pac k age LiblineaR [19]; Random F orest: no. v ariables = √ no. features, num ber of trees = 100 (DBL and LJ use classical decision trees for sp eed, others use conditional inference trees [20]); ListNet: no. iterations = 100, tolerance = 10 − 5 . (net work datasets and target populations are described in Section 6.1). W e observe that the best classifier v aries across datasets. More surprisingly , the b est classifier for one dataset ma y b e the worst for another (see Active Search on Wikip edia and on DonorsCho ose). W e then consider a set of classifiers M that t ypically exhibit go o d performance and cycle b etw een them during the searc h, in a Round-Robin (RR) fashion. Based on T able 2, w e pic k M = { MOD, Activ e Searc h, Support V ector Regression, Random F orest, ListNet } . 5 W e use this set of classifiers throughout the rest of this pap er , unless otherwise noted. One might exp ect RR’s p erformance to b e the av erage of the performance results yielded b y the standalone counterparts, but this is not the case. Interestingly , switching classifiers at each step outperforms the b est classifier in M on the CiteSeer and Kickstarter datasets, and finds at least 92% as man y target nodes as the b est classifier on other datasets. In what follo ws we inv estigate why the use of multiple classifiers can impro ve selective harvesting’s p erformance. 4.1 Leveraging diversit y through the use of multiple classifiers W e observe that RR outp erforms all five classifiers in M on CiteSeer (T able 2). Consequen tly , at least one of them must perform b etter under RR than on its own. In order to iden tify which ones do, we show in Figure 3 the hit ratio – num b er of target no des found divided b y n umber of queries p erformed using eac h classifier up to time t – under RR and when used by itself, a veraged ov er 80 runs. Interestingly , after t = 400 all 5 W e choose MOD in lieu of PNB b ecause MOD is orders of magnitude faster. Among the base learners, we c ho ose one representativ e of regression (SV Regression), classification (Random F orest) and ranking (ListNet) metho ds. 10 F abricio Murai et al. 0.4 0.5 0.6 0.7 0.8 CiteSeer: NIPS papers # queried nodes (t) mean hit ratio 0 300 600 900 1200 1500 Standalone ActiveSearch ListNet SV Regression MOD RForest Round−Robin ActiveSearch ListNet SV Regression MOD RForest Round−Robin Fig. 3 Round-robin can hav e higher hit ratios for each of its classifiers than their standalone counterparts. classifiers exhibit similar (relative difference ≤ 10%) or b etter p erformance under RR than when used alone. W e prop ose tw o hypotheses to explain this p erformance improv ement: (a) Border Hyp othesis : RR explores regions of the graph containing more targets that are lik ely to b e scored high by a classifier, i.e. RR infuses diversit y in the b order set. (b) T raining Hyp othesis : Observ ations from different classifiers can b e used to train the others to generalize b etter and cop e with self-reinforcing sampling biases, i.e., diversit y in the training set pro duces a classifier that is better at finding target no des. Note that these hypotheses are not mutually exclusive. In what follows, w e p erform con trolled simulations to isolate and study each hypothesis. T raining set div ersity directly impacts mo del parameters. Model parameters, in turn, determine ho w the border set will c hange. Therefore, to assess the impact of training set diversit y we m ust hold the b order set diversit y constant and vice-versa. This is the k ey idea b ehind the tw o controlled sets of simulations described next. T o p erform them, w e instrumented our simulator to load, from another simulation run, (i) the feature vector x σ t | e G t of node σ t queried in step t , and label y σ t , and (ii) the observ ed graph e G t at eac h step t . In what follows, w e show the results obtained using the supp ort v ector regression (SVR) mo del. W e denote node σ t ’s feature vector and lab el simply b y x t and y t , resp ectively , to mak e it easier to follow. Border Hypothesis. Our experiment consists of three stages (Fig. 4a). First, w e store the sequence of observ ations (i.e., pairs feature v ector, lab el) O SVR = (( x 1 , y 1 ) , . . . , ( x T , y T )) corresp onding to no des queried when searching a netw ork dataset D using SVR. Second, we store the sequence of observed graphs e G RR =  e G 1 , . . . , e G T  when searc hing D by cycling betw een models in the set M . Last, w e sim ulate another SVR-based searc h on D , loading the observ ed graph at eac h time step t from e G RR . Ho wev er, instead of training the SVR model with observ ations collected on that run (which most likely differ from those collected during the first stage), we gradually feed it with observ ations from O SVR , one for eac h sim ulation step t . Therefore, w e will reproduce the sequence of classifiers from the first stage, but sub ject to a different sequence of observed graphs. T raining Hyp othesis. As b efore, our exp eriment consists of three stages (Fig. 4b). In the first stage, w e store the sequence of observed graphs e G SVR =  e G 0 1 , . . . , e G 0 T  when searc hing D using a SVR mo del. Second, we store the sequence of observ ations Selective Harvesting ov er Networks 11 !"#$%&'(&)*+,-#"%&!./&,01-& t=T & #02&)"34%&"4#*0*0$&2#"# $ !"#$%&5(&)*+,-#"%&/3,026/37*0& ,01-& t=T &)"34%&)%8,%09%&3:& 37)%4;%2&$4#<=) $ !"#$%&>(&)*+,-#"%&!./&,01-& t=T ?& -3#2*0$&"4#*0*0$&2#"#&:43+&!"#$%&'& #02&37)%4;%2&$4#<=&:43+&!"#$%&5& !0#<)=3"&#"&"@A& !0#<)=3"&#"&"@A& B& C& D& E& F& G& B& C& F& E& D& G& B& C& G& E& D& F& B"& t=4: & &&&&H3#2&$4#<=& &&&&G*"&+32%-&"3(& & &&&&I,%4J&032%&:43+&KD?E?FL&$*;%0( $ { ( x A ,y A ) , ( x B | A ,y B ) , ( x C | A,B ,y C ) } x B | A,C ,F , x D | A,C,F , x E | A,C ,F e G 3 e G 1 e G 2 e G 3 O SVR =  ( x A ,y A ) , ( x B | A ,y B ) ,...  e G RR = ⇣ e G 1 , e G 2 ,... ⌘ (a) Simulations for studying Border Hyp othesis !"#$%&'(&)*+,-#"%&!./&,01-& t=T & #02&)"34%&)%5,%06%&37&38)%49%2& $4#:;) $ !"#$%&<(&)*+,-#"%&/3,02=/38*0& ,01-& t=T &)"34%&"4#*0*0$&2#"# $ !"#$%&>(&)*+,-#"%&!./&,01-& t=T ?& -3#2*0$&"4#*0*0$&2#"#&743+&!"#$%&<& #02&38)%49%2&$4#:;&743+&!"#$%&'& !0#:);3"&#"&"@A& !0#:);3"&#"&"@A& B& C& D& E& F& G& B& C& F& E& D& G& B& C& G& E& D& F& B"& t=4: & &&&&H3#2&$4#:;& &&&&G*"&+32%-&"3(& & &&&&I,%4J&032%&743+&KC?F?GL&$*9%0( $ { ( x A ,y A ) , ( x C | A ,y C ) , ( x E | A,C ,y E ) } x C | A,B ,D , x E | A,B ,D , x F | A,B ,D O RR =  ( x A ,y A ) , ( x C | A ,y C ) ,...  e G 0 3 e G 0 1 e G 0 2 e G 0 3 e G SVR = ⇣ e G 0 1 , e G 0 2 ,... ⌘ (b) Simulations for studying T raining Hypothesis Fig. 4 (a) W e study the Border Hyp othesis by recreating the sequence of SVR mo dels from the original simulation run (stage 1) and using them to query nodes on a sequence of observed graphs collected using round-robin (stage 2). (b) W e study the T raining Hyp othesis b y recreating the sequence of observ ed graphs from the original sim ulation run (stage 1) and using a SVR trained on the samples collected using round-robin (stage 2) to query no des. O RR =  ( x 0 1 , y 0 1 ) , . . . , ( x 0 T , y 0 T )  collected when searching D b y cycling among classifiers in M . Last, we simulate another SVR-based search, loading the observed graph at eac h time step t from e G SVR , but feeding it observ ations from O RR , one by one. Hence, the classifier is fit to a differen t set of observ ations, but the search is sub ject to the same sample path as the SVR-based searc h from the first stage. Figure 5 contrasts the av erage num b er of target no des found by the original SVR- based searc h on CiteSeer against those obtained in each set of sim ulations based on 80 runs. The 95% confidence interv als for the mean at t = 700 are [393 . 8 , 413 . 1], [416 . 6 , 427 . 5] and [417 . 1 , 436 . 7]. These statistics corrob orate the hypotheses that the b order set and the training data collected by the round-robin p olicy contribute to improving the p erformance of the SVR mo del. In tuitively , when a base learner is fit to the nodes it queried, it tends to specialize in one region of the feature space and the searc h consequen tly only explores similar parts of the graph, which can sev erely undermine its potential to find target nodes. One wa y to mitigate this o verspecialization w ould be to sample no des from the border set probabilistically , as 12 F abricio Murai et al. 0 200 400 600 800 1000 1400 0.96 1.00 1.04 # queried nodes (t) # targets found (norm. by original) CiteSeer: NIPS papers SV Regression (original) SV Regression (Border Hypothesis) SV Regression (T raining Hypothesis) Fig. 5 SVR classifier and tw o wa ys to ease the tunnel vision effect : border set diversit y and training set div ersit y impro ve p erformance by ensuring greater diversit y in query choices and by diversifying the training data, resp ectively . opp osed to deterministically querying the no de with the highest score. This alternativ e is in vestigated in App endix B, where the ranking asso ciated with each classifier is mapped in to a probability distribution. The results sho w no significant p erformance impro vemen t o ver those obtained when a single classifier chooses no des to query deterministically . The round-robin p olicy infuses diversit y in the training set without sacrificing p erformance. This div ersity is ac hieved by “asking another classifier” what is the b est no de to query at a given step. In scenarios where all classifiers w ould ha ve p erformed reasonably w ell if used alone, learning from another’s classifier query is likely to impro ve one classifier’s abilit y to find targets, esp ecially when they disagree. Y et, different classifiers inherently exhibit different p erformances on a dataset. Clearly , w e wan t to choose more accurate classifiers more often, but in order to do so, three c hallenges must b e addressed: 1. W e do not know a priori which classifiers are more accurate on a dataset; 2. Classifiers’ accuracy v aries as their parameters are up dated and the b order set c hanges; 3. Contin ual exploration must b e ensured, since con v erging to an arm would mak e the searc h more susceptible to the tunnel vision effect. Challenge (1) is typically addressed by Multi-Armed Bandit (MAB) algorithms. Challenge (2) constrains the set of possible MAB algorithms to those designed for MAB problems with non-stationary rew ard distributions. Challenge (3) is specific to selectiv e harvesting (the exploration-exploitation-div ersification trade-off ). In the follo wing section, w e prop ose a metho d that addresses all these challenges. W e call it Directed Div ersity Dynamic Thompson Sampling because it is based on the Dynamic Thompson Sampling algorithm for MAB problems and b ecause it leverages div ersit y in a “directed wa y” as opposed to randomly sampling no des. 5 Directed Diversit y Dynamic Thompson Sampling (D 3 TS) This section is divided in tw o parts. First, we discuss the relationship betw een selectiv e harv esting and m ulti-armed bandits. Then, in the ligh t of this discussion, we prop ose the D 3 TS algorithm. Selective Harvesting ov er Networks 13 5.1 Relationship b etw een Selective harvesting and Multi-Armed Bandits Selectiv e harvesting with multiple classifiers can b e cast as a Multi-Armed Bandit (MAB) problem. In a MAB problem, a forecaster is giv en the n umber of arms K and the num b er of rounds T . F or each round t , nature generates a pay off vector r t = ( r 1 ,t , . . . , r K,t ) ∈ [0 , 1] K unobserv able to the forecaster. 6 The forecaster chooses an arm I t ∈ { 1 , . . . , K } and receiv es pay off r I t ,t , with the other pay offs hidden. The goal is to maximize the cum ulative pa yoff obtained. MAB problems can b e classified according to how the pay off vector is generated. In sto c hastic bandit problems , eac h en try r i,t in the pa yoff v ector is sampled indep enden tly , from an unknown distribution ν i , regardless of t . In adv ersarial bandit problems , the pay off vector r t is chosen b y an adversary whic h, at time t , knows the past, but not I t . Stochastic and adv ersarial bandits do not co ver the entire problem space, as the pa yoff vector distribution ma y v ary o v er time in a less arbitrary w ay than in adv ersarial bandits. In sto chastic b andit pr oblems with non-stationary distributions or dynamic b andit pr oblems , the mean pay off vector can evolv e according to random sho cks or change at pre- determined p oin ts in time. MAB problems may also include con text, whic h provides the forecaster with side information ab out the optimal action at a given step. In c ontextual b andits , a con text x a,t is drawn (from some unknown probability distribution) for each action a ∈ A t a v ailable in step t . The context may b e provided explicitly or through recommendations of a set of experts. In selective harv esting, the sequen tial decision problem consists of choosing the no de to query at eac h step, giv en recommendations from several mo dels. There are tw o w ays of mapping selective harvesting to a MAB problem. The first (and simplest) mapping is con text-free. Each mo del is represen ted by an arm (i.e., the problem reduces to one of c ho osing a mo del at each time step). Models are treated as blac k boxes that will “internally” query a no de and return the no de’s lab el. The queried node’s lab el is seen as the mo del’s pa yoff. The second mapping falls in to the class of contextual bandits. Each border node represen ts an action and each mo del represents an exp ert that provides recommendations on how to choose the actions. No de features corresp ond to action contexts, which are used b y the exp erts to compute their recommendations. Despite the p otential adv antage of accoun ting for no de features directly and combining the advice of several mo dels, most algorithms for contextual bandits assume fixed and small (relative to the time horizon) sets of actions, whereas the border set is dynamic and p otentially orders of magnitude larger than the query budget. Among context-free bandits, we claim that algorithms for sto chastic bandits with non-stationary distributions are the b est candidates for combining classifiers in selective harv esting, as we observe that the av erage hit ratio can drift ov er time (Fig. 3). While adversarial bandits allo w pa yoff distributions to change arbitrarily , they cannot exploit the fact that the mean pay off ev olves in a well-behav ed manner. A thorough comparison of several bandit algorithms described in App endix C supp orts our claim. Our comparison includes the Exp4 and Exp4.P algorithms for contextual bandits, which combine the prediction of all classifiers in a similar wa y that traditional ensem ble metho ds do. 5.2 Prop osed algorithm F or the reasons ab ov e, w e adapt the Dynamic Thompson Sampling (DTS) algorithm [18] prop osed for MABs with non-stationary distributions to the selective harvesting problem. 6 In general, rewards can b e normalized to b e in [0 , 1]. 14 F abricio Murai et al. Algorithm 1 D 3 TS (budget T , mo del set M , threshold C ≥ 2) 1:  Assume B t is up dated after each iteration. 2: for t in 1 , . . . , T do 3: for k in 1 , . . . , |M| do 4: ˆ r ( k ) t ∼ Beta( α k , β k ) 5: I t = arg max k ∈ 1 ,...,K ˆ r ( k ) t 6: ˆ y = estimate pay offs using classifier I t and e G t 7: b = arg max v ∈B t ˆ y v 8: r t = y b = query( b ) 9: if α I t + β I t < C then 10: α I t = α I t + r t 11: β I t = β I t + (1 − r t ) 12: else 13: α I t = ( α I t + r t ) × C / ( C + 1) 14: β I t = ( β I t + (1 − r t )) × C / ( C + 1) 15: M = update or retrain classifiers giv en new p oint ( x b | e G t , y b ) DTS is based on the Thompson Sampling (TS) algorithm for sto chastic MABs, where binary outcomes asso ciated with each arm k = 1 , . . . , K are mo deled as Bernoulli trials. The uncertaint y on the probabilit y parameter asso ciated with arm k is t ypically modeled as a distribution Beta( α k , β k ). The Beta distribution is the conjugate prior for the Bernoulli distribution (thus providing computational savings on Ba yesian up dates). TS p erforms exploration by choosing arms probabilistically , according to samples drawn from the corresp onding distributions. More precisely , at step t , TS samples ˆ r ( k ) t ∼ Beta( α k , β k ) and selects the arm with the largest sample, i.e., I t = arg max k ∈ 1 ,...,K ˆ r ( k ) t . Given the binary pa yoff r t receiv ed after selecting arm I t , the distribution parameters are up dated according to the Bay esian rule, i.e., α I t = α I t + r t and β I t = β k + (1 − r t ). In essence, DTS normalizes arm k ’s parameters suc h that α k + β k ≤ C , where C is a b ounding parameter. W e adapt DTS in t wo senses: (i) we combine DTS with the steps needed to p erform searc h in selectiv e harv esting problems and (ii) w e set the threshold C to a muc h smaller v alue than the ones used in [18], which allows us to incur more diversit y . This highlights an explor ation, exploitation and diversific ation tradeoff in selective harv esting that go es b ey ond the duality found in classic MAB problems, as simply con verging to one arm w ould b e sub optimal. The pseudo-code for D 3 TS is shown in Algorithm 1. In what follo ws we compare D 3 TS against all approac hes for selective harvesting discussed in Section 3. 6 Simulations This section describes the datasets used in our simulations, together with simulation results and comparisons with baseline methods. 6.1 Datasets T o ev aluate the ab ov e search metho ds, we use seven datasets corresp onding to undirected and unw eighted netw orks containing no de attributes. In the follo wing we describe eac h of the datasets summarized in T able 3. Basic statistics for each net work are sho wn in T able 4. Selective Harvesting ov er Networks 15 Dataset nodes edges node attributes target no des DBpedia places hyperlinks place type admin. regions CiteSeer papers citations ven ues top ven ue Wikipedia wikipages links topics OOP pages Kickstarter donors co-donors back ed pro jects DF A donors DonorsChoose donors co-donors aw arded pro jects P donors LiveJournal users friendship enrolled groups top group DBLP authors co-authorship conference top conference T able 3 High-level description of each netw ork. Dataset |V | |E | M |V + | / |V | DBpedia 5.00K 26.6K 5 14.5% CiteSeer 14.1K 42.0K 10 13.1% Wikipedia 5.27K 64.6K 93 3.83% Kickstarter 27.8K 2.77M 180 5.27% DonorsChoose 1.15K 6.60K 284 4.96% LiveJournal 4.00M 34.7M 5K 0.04% DBLP 317K 1.05M 5K 2.38% T able 4 Basic statistics of each netw ork: |V | (num b er of nodes), |E | (num b er of edges), M (num b er of attributes) and |V + | / |V | (fraction of target no des). The first three datasets hav e b een used as b enchmarks for Activ e Search [40, 27]. Despite the fact that Activ e Searc h assumes that the netw ork topology is known, we can use these datasets to ev aluate active search metho ds by only rev ealing parts of the graph as the searc h pro ceeds. W e define the target p opulation as in the Active Search work. DBp edia : A netw ork of 5000 p opulated places from the DBp edia ontology formed b y linking pairs whose corresp onding Wikip edia pages link to each other, in either direction. Places are marked as “administrativ e regions”, “countries”, “cities”, “to wns” or “villages”. T arget no des are the “administrative regions”. CiteSeer : A pap er citation netw ork comp osed of the top 10 ven ues in Computer Science. Papers are annotated with publication v enue. T arget no des are the NIPS pap ers. Wikip edia : A web-graph of wikipages related to programming languages. P ages are annotated with topics obtained b y thresholding a pre-computed topic v ector [40]. T arget no des are w ebpages related to “ob ject oriented programming”. Tw o net work datasets from the Stanford SNAP rep ository [25] typically used to v alidate comm unity detection algorithms are also used. W e lab el no des b elonging to the largest ground-truth comm unity as targets. Other communit y memberships are used to define a binary attribute v ector a v ∈ { 0 , 1 } M for all v ∈ V . Liv eJournal : A blog comm unity with OSN features, e.g.: users declare friendships and create groups that others can join. Users are annotated with the groups they joined. DBLP : A scientific collab oration netw ork where tw o authors are connected if they ha ve published together. Authors are annotated with their resp ective publication ven ues. Last, we use datasets containing donations to pro jects p osted on tw o online cro wdfunding w ebsites. T o assess the p erformance of each classifier in low correlation settings, we build a so cial netw ork connecting p otential donors where edges are weak predictors of whether or not neighbors of a donor will also donate. W e lab el nodes as targets if they donated to a sp ecific campaign. Historical donation data prior to that is used to build the net work and define no de attributes. Kic kstarter(.com) : An online cro wdfunding website. This dataset w as collected b y GitHub user neight-al len and consists of 3.04M donors that together made 5.87M 16 F abricio Murai et al. Dataset avg top 5 avg top 3 avg top 1 RR D 3 TS RR D 3 TS RR D 3 TS CiteSeer 1.04 1.07 1.02 1.05 1.00 1.03 DBpedia 1.01 1.03 1.00 1.02 0.98 1.01 Wikipedia 1.16 1.20 1.05 1.08 0.97 1.01 DonorsChoose 1.06 1.05 1.04 1.04 1.01 1.00 Kickstarter 1.23 1.24 1.13 1.14 1.11 1.12 DBLP 0.96 1.00 0.94 0.98 0.92 0.96 LiveJournal 0.98 1.02 0.97 1.00 0.96 0.99 T able 5 Performance ratios: b etw een RR (D 3 TS) and av erage of top k = 1 , 3 , 5 standalone classifiers. donations to 87.3K pro jects. W e create a donor-to-donor net work by connecting donors that donated to the same pro jects in the past. More precisely , we assume that back ers of small unsuccessful campaigns (b etw een 100 and 600 back ers) are all connected in a co-donation net work – say , their names are published on the campaign’s website. W e choose campaigns with few donors so that the resulting netw ork is sparse and the netw ork discov ery problem c hallenges D 3 TS. Our dataset has 180 small unsuccessful pro jects betw een 04/21/2009 and 05/06/2013, containing a total of 27.8K donors. W e then choose the 2012 pro ject (denoted DF A) that has the largest num b er of donors in our dataset. The goal of the recruiting algorithm is to recruit the 2012 DF A donors through the donor-to-donor netw ork of past donations (2009–2011). DonorsCho ose(.org) : An online cro wdfunding website where teac hers of US public sc ho ols p ost classro om pro jects requesting donations (e.g., for a science pro ject). The dataset is part of the KDD 2014 Cup con taining 1.29M donors that together made 3.10M donations to 664K pro jects from 57K sc ho ols. Donations include information suc h as donor lo cation, donation amoun t, aw arded pro ject, among other node features. As donors tend to be lo yal to the same sc ho ols, w e focus on the sc ho ol that receiv ed the most donations in the dataset. W e use pro jects from 2007 to 2012 to construct a donor-to-donor net work where an edge exists b et ween tw o donors if they donated to the same pro ject less than 48 hours apart. W e then select the pro ject P in 2013 with the largest num ber of donations. 6.2 Results In this section, w e compare the p erformances of D 3 TS, Round-Robin (RR) and standalone classifiers, w.r.t. the n umber of targets found at several points in time. W e set the threshold C = 5 in D 3 TS and parameters of all classifiers as in T able 2. W e simulate selective harvesting on each dataset for a large budget T , chosen in prop ortion to the target p opulation size (e.g., for DonorsChoose we set T = 100, for Kic kstarter w e set T = 1500). In order to contrast RR’s and D 3 TS’ performance against that obtained if side information about the iden tity of the top k performing classifiers on a given dataset were av ailable, T able 5 lists ratios b etw een RR’s (and D 3 TS’) p erformance and the av erage p erformance of the top k = 1 , 3 , 5 standalone classifiers. Note that we consider the top k from all nine standalone classifiers describ ed in Section 3, not only the classifiers used b y RR (and D 3 TS). T op classifiers v ary across datasets. Ov erall, we observ e that RR’s p erformance is comparable to that of the top 3 classifiers and can sometimes outp erform them (by up to 13%). In the worst case, RR’s p erformance is 92% of that of the b est standalone classifier (DBLP). D 3 TS consistently improv es up on RR and yields results at least as go o d as the b est standalone classifier on all datasets Selective Harvesting ov er Networks 17 0 200 400 600 800 1000 0.90 0.95 1.00 1.05 1.10 # targets found (norm. by Round−Robin) DBpedia: admin. regions 0.81 Round−Robin MOD Active Search SV Regression Random Forest ListNet D 3 TS 0 50 100 150 200 250 300 0.90 0.95 1.00 1.05 1.10 DonorsChoose 1.17 0.53 0 200 400 600 800 1000 1200 0.90 0.95 1.00 1.05 1.10 # queried nodes (t) # targets found (norm. by Round−Robin) LiveJournal 0.82 0.87 0 500 1000 1500 0.90 0.95 1.00 1.05 1.10 # queried nodes (t) Kickstarter: DF A project 0.62 0.69 0.85 0.79 0.57 Fig. 6 Average num ber of targets found by Round-Robin (RR), D 3 TS and fiv e standalone classifiers ov er 80 runs. Shaded areas represent 95% confidence in terv als. Arrows indicate minimum v alues for corresponding colors’ classifiers, when off-the-chart. Standalone classifiers are often outperformed b y RR. D 3 TS improv es up on RR. except DBLP and LiveJournal, where its p erformance is resp ectively 96% and 99% of that of the b est classifier. D 3 TS outp erforms the b est classifier b y up to 15% (Kickstarter). W e no w describe the results for eac h dataset in detail, except for CiteSeer, which was discussed in the in tro duction. Figure 6 contrasts the a v erage n um b er of targets found b y RR and D 3 TS against those found by standalone classifiers, scaled b y RR’s performance. W e include results for fiv e out of nine classifiers (the same ones used in M ) to a v oid clutter. On DBpedia, LiveJournal, DonorsChoose and Kickstarter, even RR w as able to outp erform the existing metho ds, except for the initial steps (where absolute differences are small an ywa y). Moreov er, on the first t wo datasets, base learners outperformed existing metho ds. How ev er, as shown in DonorsChoose and Kickstarter plots, a data-driven classifier b y itself do es not guarantee go o d p erformance. On most datasets D 3 TS matc hes or exceeds the p erformance of the b est standalone classifier. In particular, on Kic kstarter, both RR and D 3 TS find significan tly more target no des than standalone classifiers. While RR can leverage diversit y from using multiple classifiers to av oid the tunnel vision effect, D 3 TS go es b eyond and intelligen tly decides whic h classifier to use without harming diversit y . T o illustrate this, we lo ok at the fraction of times D 3 TS used a giv en classifier at turn t in 80 runs. Figure 7 sho ws this time series for 18 F abricio Murai et al. 0 200 400 600 800 1000 0.0 0.1 0.2 0.3 0.4 0.5 DBpedia: admin. regions # queried nodes (t) fraction of times + + + + + MOD ActiveSearch SV Regression RForest ListNet + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Fig. 7 D 3 TS: fraction of runs in whic h eac h classifier was used in step t (smo othed ov er five steps). number of targets f ound queried nodes (t) Wikipedia: OOP 1000 400 100 0 50 100 150 number of targets f ound DonorsChoose 300 100 30 0 10 20 30 40 50 number of targets f ound DBLP Co−Authorship Netw ork 1250 600 100 0 200 400 600 800 RR D 3 TS MOD SV Regression Random Forest ListNet Active Search Fig. 8 RR and D 3 TS can perform well even when including classifiers that p erform po orly as standalone. DBp edia. F rom the small fraction of uses, we find that MOD p erforms p o orly not only on its own, but also when used under D 3 TS. F ortunately , D 3 TS can learn classifiers’ relative p erformances and adjust accordingly . A closer lo ok at the distribution of the num b er of targets found by eac h metho d highligh ts an imp ortant adv antage of leveraging diversit y . Figure 8 shows b o xplots of RR and D 3 TS p erformance in each dataset, for several points in time. 7 On Wikip edia, DonorsCho ose and Kickstarter, although some of the classifiers used by RR and D 3 TS yield p o or results on their own, RR and D 3 TS still attain large mean and lo w v ariance. D 3 TS was only outperformed by a standalone classifier on DBLP (statistically significant). Because DBLP has the largest num b er of target no des in the b order set (on av erage) o ver all datasets, classifiers are less lik ely to b e p enalized by the tunnel vision effect on DBLP . In App endix A we provide complemen tary results from ten additional datasets derived from the same data. Once again these results attest for the robustness of the prop osed metho d. Classifier com binations W e also conducted an exhaustive set of simulations where w e consider all 31 com binations of these fiv e classifiers under D 3 TS. W e restrict this analysis to a set of net w orks D comp osed of the fiv e smaller datasets. Supp ose we had an oracle that could tell which combination of 7 The box extremes in our boxplots indicate lo wer and upper quartiles of a given empirical distribution; its median in marked in b etw een them. Whiskers indicate minimum and maximum v alues. Selective Harvesting ov er Networks 19 Metho ds Datasets CS DBP WK DC KS DBL LJ MOD 0.06 0.08 0.14 0.29 3.55 0.33 0.46 Active Search 0.05 0.11 0.17 0.37 1.71 0.30 0.45 SV Regression 0.37 0.80 1.26 5.88 9.35 6.57 8.19 Random F orest 2.54 4.27 6.75 16.75 43.80 20.96 21.06 ListNet 0.35 0.31 1.76 2.13 8.42 22.65 21.85 Round-Robin 0.18 0.14 0.31 0.34 2.49 11.13 10.92 D 3 TS 0.13 0.14 0.28 0.27 2.77 13.41 13.28 T able 6 Average wall-clock time to find a target (in sec.). D 3 TS b enefits from more sophisticated classifiers while only incurring the computational cost for the steps in which they are used. classifiers p erforms best on a dataset D ∈ D . W e can then define the (normalized) regret of a classifier set M on D as R ( M , D ) = 1 − N + ( M , D ) max M 0 N + ( M 0 , D ) where N + ( M , D ) is the num b er of target no des found by M on D . If w e define the optimal combination to b e the one that minimizes the maximum regret, i.e., M ? = arg min M max D∈ D R ( M , D ), then M ? indeed includes all five classifiers (maximum regret is 2.8%). Otherwise, if w e define the optimal combination M † to be the one that minimizes the a verage regret, i.e., M † = arg min M P D∈ D R ( M , D ) / | D | , then M † is the combination comp osed of MOD, Active Searc h, SVR and Random F orest (av erage regret is 0.9%). W e note, how ever, that the p erformance obtained by com bination M ? on each dataset is at most 0.7% smaller than that obtained b y M † (in the case of CiteSeer). Moreov er, w e observ ed that com bining tw o classifiers improv es results in about 84% of the cases w.r.t. the cases where either classifier is used in isolation. This attests to the robustness of using D 3 TS as the classifier selection policy . Running time T able 6 shows the av erage wall-clock time to find a target based on 80 single-threaded runs on an Intel Xeon E5-2660@2.60GHz pro cessor for MOD, Active Search, SVR, Random F orest, ListNet, RR and D 3 TS. On all datasets except DBLP and Liv eJournal, Random F orest is based on conditional inference trees (from R pack age party ), which are recommended when different types of features (e.g., discrete, con tinuous) are presen t [20]. On the other tw o datasets, Random F orest is based on classical decision trees (from R pac k age randomForest ), due to the large scale of these datasets. In b oth cases the av erage n umber of targets found was similar, but conditional inference trees tend to yield smaller v ariances. Among standalone classifiers, MOD and Active Searc h were the fastest, follow ed by ListNet and SVR. W e emphasize that MOD and Activ e Search require no fitting, which is the most expensive step for a base learner. In spite of their go o d p erformance at finding target no des on DBLP and LiveJournal, Random F orest and ListNet take m uc h longer to fit than other classifiers on datasets with a relativ ely large num b er of features, thus exhibiting the longest a verage time b etw een successful queries. One of the adv antages of D 3 TS is that it can b enefit from more sophisticated classifiers while only incurring the computational cost for the steps in which they are used. D 3 TS 20 F abricio Murai et al. exhibits smaller ratios than Round-Robin, except on datasets where D 3 TS tends to use Random F orest or ListNet more often than Round-Robin do es. Note that D 3 TS running time is determined by the classifiers it uses and their implementations. Replacing metho ds used in this paper b y online counterparts can lead to significan t reductions in running time. In particular, Random F orest – which has the largest running time – can, in principle, b e replaced b y online random forests when b ounds on feature v alues are known in adv ance. 8 6.3 Dealing with Disconnected Seeds In the previous simulations, the searc h starts from a single seed (starting no de). When more than one seed is av ailable, the search process may end up exploring v arious regions of the graph at the same time. In this approach, the question arises as to how to adequately mo del the observ ations in these regions. In some cases, it is better to fit classifiers to specific regions of the netw ork where they op erate (i.e., using observ ations collected only from that region), while fitting all classifiers to all observ ations is probably best all regions are v ery similar to each other. One can also consider hierarchical mo dels, which mo del each region separately but allo w some information sharing. In this section, w e consider standalone classifiers and compare their performance in tw o extreme approaches: using a single classifier and starting from S seeds (thus mo deling all S regions together), or using S mo dels, eac h initially asso ciated with a single seed (eac h sim ulation run uses the same S seeds in either approach to reduce v ariance). In particular, w e use the EWLS regression mo del. In the m ultiple classifier approach, the classifier asso ciated with each region is used to rank its corresponding b order set at each time t . A single node to b e queried must then b e selected among all b order no des. W e select the node with the highest estimated pay off across all rankings, and the mo del resp onsible for this estimation is then up dated with the new observ ation. W e compare the searc h p erformance under these tw o approaches, for S = 2 , . . . , 6. On the datasets with larger num b er of attributes, w e found that either there is no significan t difference b etw een the av erage pay offs (Donors, CiteSeer) or the single classifier approach yields b etter p erformance (Wikipedia), at the 95% confidence level. On the other hand, on datasets with a small num b er of attributes, some improv ement is obtained when using m ultiple classifiers, eac h with its own mo del. F or instance, on DBp edia, which has only 5 attributes, the av erage num b er of targets found increases from 523 . 9 to 562 . 5 at t=1000, for S = 3. When D 3 TS is used in place of standalone classifiers, our recommendation is to fit base learners to region-sp ecific observ ations in the case of datasets with few attributes, and fit to the entire training set in the case of datasets with many attributes. Ho wev er, if new seeds are included during the searc h (i.e., S increases o ver time), it is lik ely b eneficial to fit the initial classifiers corresp onding to the new regions using observ ations from other regions as priors, even if the n umber of attributes is large. W e leav e this in vestigation for future w ork. 8 W e attempted to replace Random F orests b y Mondrian F orests [24], but the only publicly a v ailable implementation is not optimized enough to b e used i n our application. Selective Harvesting ov er Networks 21 7 Related work The closest w ork to ours is on activ e search. The goal of activ e searc h is to unco v er as man y no des of a target class as p ossible in a net work where the topology is kno wn [15, 16, 40, 27]. Lik e selective harvesting, active searc h considers situations where only mem b ers of a target class (e.g., malicious class) are sough t. Since obtaining labels is associated with a cost (time or money), it is paramount to a void spending resources on nodes that are unlik ely to be targets. Unlik e our problem, activ e searc h assumes the net work top ology is kno wn and that an y no de can b e queried at any time. In [32] a problem similar to selectiv e harvesting is inv estigated and a learning- based metho d called Active Exploration (AE) is prop osed. Unlik e in selectiv e harv esting, b order no des attributes are assumed to be observ able. Since no de attributes often carry considerable information ab out the no de’s lab el, AE is not directly comparable with other selectiv e harv esting methods. Our solution differs from AE in that it lev erages heuristics in addition to base learners and is applicable to a wider range of applications. Similarly to selective harvesting, activ e learning is an interactiv e framew ork for deciding what data p oints to collect in order to train a classifier or a regression mo del. Unlike activ e searc h, (i) its main ob jective is to impro ve the generalization p erformance of a mo del with as few label queries as p ossible, and (ii) the set of unlab eled p oin ts does not grow based on the collected p oints. A slew of active learning techniques ha ve b een prop osed for non-relational data settings, including some tailored for logistic regression [35], for dealing with streamed data [2] and for the case of extreme class imbalance [3]. Although the retriev al of target no des can b enefit from an accurate model, it is unlikely that active learning heuristics (e.g., uncertaint y sampling [36]) for training a single classifier can b e used f or selective harvesting without sacrificing p erformance. How ever, it may b e p ossible to adapt active learning techniques prop osed for training classifier ensem bles (e.g., query b y committee [37]) in suc h a wa y that, at the same time we collect p oints on which many classifiers disagree, w e ensure that promising candidates among b order no des are queried b efore the sampling budget is exhausted. Despite these differences, there is an in teresting parallel b etw een selective harvesting with many mo dels and a b o dy of research on activ e learning with a set of activ e learners (or heuristics). Both problems can b e cast as MABs, where b order no des are analogous to unlab eled data p oints. In active learning, a reward is indirectly related to the collected p oin t: it is computed as some proxy for or estimate of the model’s p erformance on a test set, when fit to all p oints collected up to a given step. In contrast, rewards in selectiv e harvesting are simply the no de lab els. Like selective harv esting, active learning can either map heuristics directly as arms [6] or map heuristics as exp erts that give recommendations on how to c ho ose the unlabeled p oints [21]. In b oth cases it has been observ ed that com bining heuristics may often outperform the single b est heuristic. While these works apply algorithms for adv ersarial bandits to activ e learning, we find that Dynamic Thompson Sampling for sto chastic bandits with non-stationary rew ards seem to exploit b etter the fact that arms rew ards are slowly changing in selective harvesting. Last, another v ariant of activ e learning considers the task of learning an ensemble of mo dels [1] or finding a low risk h yp othesis h ∈ H [13, 14] while lab eling as few p oints as p ossible. Since the lab eled p oints are biased by the collection pro cess, estimating the mo dels’ generalization p erformances requires either building an uniformly random v alidation set, or sampling probabilistically at ev ery step and then using imp ortance w eighted estimates. In selective harv esting, how ever, the mo dels relative performances 22 F abricio Murai et al. Metho ds Datasets (budget T ) CS DBP WK DC KS (1500) (700) (400) (100) (700) Bagging 745.6 445.6 99.1 34.7 223.1 AdaBoost 751.5 443.5 98.0 34.5 218.4 D 3 TS 851.2 464.0 144.7 37.9 247.6 Bootstrap + Decision T ree 754.5 293.4 95.2 27.2 155.7 T able 7 Average num b er of targets found by each metho d after T queries based on 80 runs. can b e directly measured from the queried no des pay offs. Moreov er, building a random v alidation set is b ound to degrade p erformance in scenarios where target nodes are scarce. 8 Discussion In this section, w e discuss the technical challenges in accoun ting for the future impact of a query and con trast the prop osed solution with classical ensemble learning. 8.1 Accounting for the future impact of querying a no de Activ e search assigns a score to eac h p otential b order no de v that consists of a sum of t wo terms [40, eq. (2)]: the exp ected v alue of v ’s lab el and sum of the exp ected changes in the lab els of all other no des multiplied by a discount factor α  1. The discounted term tries to account for the impact of querying node v , going one step beyond the greedy solution. In selective harvesting, how ever, the observed graph is limited to the set of queried nodes and their neighbors, i.e. we cannot compute the impact of choosing a no de b eyond the b order set. Even if we could observe the entire graph, accounting for the future impact of querying a no de w ould require us to fit one statistical learning mo del to each b order no de and predict all the remaining lab els at eac h step, whic h is to o expensive ev en for a single online mo del. 8.2 Using classifier ensembles in selective harvesting Ensem ble metho ds generate a set of mo dels in order to combine their predictions, p ossibly using w eights. These metho ds perform v ery well in man y classification problems and can b e applied to selectiv e harvesting problems to o. Note that although D 3 TS uses multiple statistical mo dels, it cannot b e considered a classifier ensemble, since only one classifier is used for prediction at eac h step. W e simulate tw o p opular ensem ble methods – Bagging and AdaBoost – on fiv e datasets (DBLP and LiveJournal were not included due to the prohibitive execution time). F or Bagging, w e v aried the n umber of trees in { 5 , 10 , 100 } , minim um n umber of observ ations to split a no de in { 5 , 10 } and maximum tree depth in— { 1 , 5 , 10 } . F or Bo osting, we set the maxim um tree depth to 1 and v aried the num b er of trees in { 100 , 200 } . T able 7 displa ys the results asso ciated with the configurations that obtained the b est ov erall results – Bagging( ntree =100, minsplit =10, maxdepth =5) and Bo osting( maxdepth =1, ntree =100) – along with the results obtained by D 3 TS. W e find that D 3 TS consisten tly outp erforms these ensemble metho ds. W e conjecture that ensembles are only slightly less susceptible to Selective Harvesting ov er Networks 23 the tunnel vision effect than standalone models, as com bining predictions tends to decrease b order set and training set div ersity . What if w e do not combine their predictions? In other w ords, what if we generate a decision tree from b o otstrap sampling at each step and use that to make predictions? W e sim ulated the p erformance of this mec hanism, v arying the minimum n um b er of observ ations to split a no de in { 5 , 10 } and maximum tree depth in { 5 , 10 } . Ho w ever, this approach did not p erform as well as D 3 TS (or ev en RR). W e report in T able 7 the parameter configuration that ac hieved the b est o verall results, ( minsplit =10, maxdepth =10), under “Bootstrap + Decision T ree”. The p o or p erformance of this approach can b e explained by the fact that predictions made from a single tree are not very accurate. By making predictions with a single tree, w e lose the generalization b enefits that come from classifier ensem bles. 8.3 Contrasting diversit y in ensembles and diversit y in selective harv esting Div ersity is kno wn to b e a desirable characteristic in ensemble metho ds [23, 39, 41]. The in tuition is that if one can combine accurate mo dels that make uncorrelated mistak es, the o verall accuracy will b e higher than those of the individual mo dels. There are t w o main classes of tec hniques for generating div erse ensem bles [38]: (i) overpr o duc e and sele ct , where a large set of base learners is generated, among whic h a subset is selected to maximize a giv en measure of div ersity , (ii) building ensembles , where the div ersity measure is directly used to drive the ensemble creation. In the ensem ble literature there are several metrics prop osed for quan tifying diversit y , all of whic h can be computed from the predictions made b y different mo dels. Many of these metrics are sho wn to hav e p ositive correlation with the o verall accuracy of the ensemble. In selectiv e harv esting, the relationship betw een correlations in models’ mistakes and o verall p erformance is more indirect. F or a single query , whether mistakes made b y different mo dels are uncorrelated or not is immaterial, since w e use only one mo del to decide whic h no de to query at each step. On the other hand, every query choice impacts future steps. Therefore, differences in models’ predictions dictate the lev els of b order set and training set div ersity that will b e ac hieved ov er time. This is in sharp con trast with the static notion of diversit y referred in the ensemble literature. A deep er characterization of the sets of mo dels that can ac hieve the t yp e of diversit y that leads to go o d performance in selective harv esting is left as future work. 9 Conclusions This pap er introduced selectiv e harvesting, where the goal is to find the largest nu mber of target no des given a fixed budget and sub ject to a partial – but evolving – understanding of the netw ork. The key distinctions of selectiv e harvesting w.r.t. related problems are that (i) the net work is not fully observ ed and/or (ii) a mo del must be learned during the search. These distinctions com bined mak e the problem m uch harder than the related problems. W e discussed existing metho ds that can b e adapted to selective harvesting and an alternative approac h based on statistical mo dels. How ev er, we sho wed that the tunnel vision effect incurred by the nature of the selective harv esting task severely impacts the p erformance of a classifier trained on these conditions. W e show that using multiple classifiers is helpful in mitigating the tunnel vision effect. In particular, simulation results sho wed that methods used in isolation often p erform worse than when combined through a round-robin scheme. 24 F abricio Murai et al. W e raised tw o h yp othesis to explain this observ ation, which w ere inv estigated to show that classifier div ersity – i.e., switching among classifiers at eac h querying step – is imp ortan t for collecting a larger set of target no des in selectiv e harv esting. Classifier div ersity increases the diversit y of the training set while broadening the choices of no des that can be queried in the future. Based on these observ ations w e proposed D 3 TS, a metho d based on multi- armed bandits and classifier diversit y , able to account for what we named the exploration, exploitation and diversification trade-off. D 3 TS differs from traditional ensembles, in whic h it does not combine predictions from different mo dels at a giv en step. D 3 TS also differs from traditional MABs, in which the goal is not to conv erge to a single arm. D 3 TS outp erforms all competing metho ds on fiv e out of sev en real net w ork datasets and exhibited comparable p erformance on the others. While w e ev aluated D 3 TS’s performance when used with five sp ecific classifiers (MOD, Activ e Searc h, Supp ort V ector Regression, Random F orest and ListNet), the prop osed metho d is flexible and can b e used with an y set of classifiers (not sho wn here, replacing SVR with Logistic Regression yielded similar results). Moreov er, we sho wed that com bining tw o classifiers through D 3 TS improv es results in ab out 84% of the cases w.r.t. the cases where either classifier is used in isolation. Ac knowledgemen ts This work w as sp onsored by the AR O under MURI W911NF-12-1-0385, the U.S. Army Researc h Lab oratory under Coop erative Agreement W911NF-09-2-0053, the CNPq, National Council for Scientific and T echnological Dev elopment - Brazil, F APEMIG, NSF under SES-1230081, including support from the National Agricultural Statistics Service. The views and conclusions con tained in this document are those of the author and should not b e in terpreted as representing the official policies, either expressed or implied of the ARL or the U.S. Go vernmen t. The U.S. Gov ernment is authorized to repro duce and distribute reprints for Gov ernment purp oses notwithstanding any copyrigh t notation hereon. The authors thank Xuezhi W ang and Roman Garnett for kindly providing co de and datasets used in [40]. References 1. Ali A, Caruana R, Kapo or A (2014) Active Learning with Mo del Selection. AAAI Conference on Artificial In telligence pp 1673–1679 2. Atten b erg J, Pro vost F (2011) Online activ e inference and learning. In: A CM SIGKDD In ternational Conference on Knowledge Discov ery and Data Mining, pp 186–194 3. Atten b erg J, Melville P , Prov ost F (2010) Guided feature lab eling for budget-sensitive learning under extreme class im balance. ICML W orkshop on Budgeted Learning 4. Auer P , Cesa-Bianchi N, F reund Y, Schapire RE (2002) The nonstochastic m ultiarmed bandit problem. SIAM Journal on Computing 32(1):48–77 5. Avrachenk ov K, Basu P , Neglia G, Rib eiro B (2014) Pa y F ew, Influence Most: Online My opic Netw ork Cov ering. In: Computer Communications W orkshops (INFOCOM WKSHPS), 2014 IEEE Conference on, pp 813–818 6. Baram Y, El-Y aniv R, Luz K (2004) Online choice of active learning algorithms. The Journal of Mac hine Learning Research 5:255–291 7. Beygelzimer A, Dasgupta S, Langford J (2009) Importance weigh ted activ e learning. In: In ternational Conference on Machine Learning, ACM, pp 49–56 8. Beygelzimer A, Langford J, Li L, Reyzin L, Schapire RE (2011) Contextual Bandit Algorithms with Sup ervised Learning Guaran tees. In ternational Conference on Artificial In telligence and Statistics pp 19–26 9. Bnay a Z, Puzis R, Stern R, F elner A (2013) Bandit Algorithms for So cial Netw ork Queries. In: So cial Computing (So cialCom), 2013 In ternational Conference on Selective Harvesting ov er Networks 25 10. Borgs C, Brautbar M, Cha y es J, Khanna S, Lucier B (2012) The Po wer of Local Information in So cial Netw orks. In: In ternet and Net work Economics, Springer Berlin Heidelb erg, pp 406–419 11. Cao Z, Qin T, Liu TY, Tsai MF, Li H (2007) Learning to rank: from pairwise approac h to list wise approach. International Conference on Machine Learning pp 129–136 12. F riedman J, Hastie T, Tibshirani R (2009) The elements of statistical learning, vol 1. Springer series in statistics Springer, Berlin 13. Ganti R, Gray A G (2012) UP AL: Unbiased Pool Based Active Learning. International Conference on Artificial In telligence and Statistics pp 422–431 14. Ganti R, Gray A G (2013) Building bridges: Viewing activ e learning from the multi- armed bandit lens. In: Conference on Uncertain ty in Artificial Intelligence 15. Garnett R, Krishnamurth y Y, W ang D, Schneider J, Mann R (2011) Bay esian optimal activ e search on graphs. In: W orkshop on Mining and Learning with Graphs 16. Garnett R, Krishnamurth y Y, Xiong X, Mann R, Schneider JG (2012) Ba yesian optimal activ e search and surveying. In: International Conference on Machine Learning, ACM, New Y ork, NY, USA, pp 1239–1246 17. Gouriten G, Maniu S, Senellart P (2014) Scalable, generic, and adaptiv e systems for fo cused cra wling. In: ACM Conference on Hyp ertext and So cial Media, pp 35–45 18. Gupta N, Granmo OC, Agraw ala AK (2011) Thompson Sampling for Dynamic Multi- armed Bandits. ICMLA pp 484–489 19. Helleputte T (2015) LiblineaR: Linear Predictive Models Based on the LIBLINEAR C/C++ Library . R pack age version 1.94-2 20. Hothorn T, Hornik K, Zeileis A (2006) Unbiased recursiv e partitioning: A conditional inference framework. Journal of Computational and Graphical statistics 15(3):651–674 21. Hsu WN, Lin HT (2015) Active Learning by Learning. AAAI Conference on Artificial In telligence pp 2659–2665 22. Khuller S, Purohit M, Sarpatw ar KK (2014) Analyzing the optimal neigh borho o d: Algorithms for budgeted and partial connected dominating set problems. In: ACM- SIAM Symp osium on Discrete Algorithms, pp 1702–1713 23. Kunchev a LI (2003) That elusive div ersity in classifier ensem bles. In: Iberian Conference on P attern Recognition and Image Analysis, Springer, pp 1126–1138 24. Lakshminaray anan B, Ro y DM, T eh YW (2014) Mondrian forests: Efficient online random forests. In: Adv ances in Neural Information Processing Systems, pp 3140–3148 25. Lesko vec J, Krevl A (2014) SNAP Datasets: Stanford large netw ork dataset collection. http://snap.stanford.edu/data 26. Liu W, Princip e JC, Haykin S (2011) Kernel adaptive filtering: a comprehensive in tro duction, vol 57. John Wiley & Sons 27. Ma Y, Huang TK, Sc hneider JG (2015) Activ e Searc h and Bandits on Graphs using Sigma-Optimalit y . In: Conference on Uncertaint y in Artificial Intelligence, pp 542–551 28. Newman ME (2003) The structure and function of complex netw orks. SIAM Review 45(2):167–256 29. Newman MEJ (2002) Assortative mixing in netw orks. Physical Review Letters 89:208,701 30. Pan t G, Sriniv asan P (2005) Learning to cra wl: Comparing classification sc hemes. A CM T rans Inf Syst 23(4):430–462 31. Pfeiffer II I JJ, Neville J, Bennett PN (2012) Active sampling of netw orks. In: W orkshop on Mining and Learning with Graphs 32. Pfeiffer I I I JJ, Neville J, Bennett PN (2014) Active exploration in netw orks: Using probabilistic relationships for learning and inference. In: ACM In ternational Conference 26 F abricio Murai et al. on Conference on Information and Kno wledge Management 33. Robins G, Pattison P , Kalish Y, Lusher D (2007) An in tro duction to exp onential random graph ( p ∗ ) mo dels for so cial net works. So cial netw orks 29(2):173–191 34. Robins G, Snijders T, W ang P , Handcock M, Pattison P (2007) Recen t developmen ts in exp onen tial random graph ( p ∗ ) mo dels for so cial netw orks. So cial netw orks 29(2):192– 215 35. Schein AI, Ungar LH (2007) Active learning for logistic regression: an ev aluation. Mac hine Learning 68(3):235–265 36. Settles B (2010) Active learning literature surv ey . Univ ersity of Wisconsin, Madison 52(55-66):11 37. Seung HS, Opp er M, Somp olinsky H (1992) Query by committee. In: ACM W orkshop on Computational learning theory , pp 287–294 38. Stap enhurst R (2012) Diversit y , margins and non-stationary learning. PhD thesis, Univ ersity of Manchester 39. T ang EK, Suganthan PN, Y ao X (2006) An analysis of diversit y measures. Machine Learning 65(1):247–271 40. W ang X, Garnett R, Schneider J (2013) Active search on graphs. In: A CM SIGKDD In ternational Conference on Kno wledge Disco v ery and Data Mining, A CM, pp 731–738 41. Xie P , Zh u J, Xing E (2016) Diversit y-promoting ba yesian learning of laten t v ariable mo dels. In: In ternational Conference on Machine Learning A Complementary results In Section 6.2 we presen ted results obtained when defining the target populations either as in prior work or as the largest subp opulation in the netw ork. W e extend these results b y running sim ulations on ten additional datasets derived b y taking the tw o largest subp opulations as targets (other than the original targets) from CiteSeer, DBp edia, Wikipedia, DonorsCho ose and Kickstarter. These datasets are indicated by CS, DBP , WK, DC and KS, follow ed by 1 and 2, resp ectively . T able 8 shows p erformance results for five standalone mo dels and for their combinations using Round-Robin and D 3 TS. Except for DBP1 and WK1, D 3 TS consistently figures among the tw o b est p erforming metho ds. Metho ds Datasets CS1 CS2 DBP1 DBP2 WK1 WK2 DC1 DC2 KS1 KS2 MOD 673 431 581 436 79 128 23 20 126 163 Active Search 666 568 550 403 79 124 15 10 115 213 SV Regression 615 492 515 428 71 91 22 18 161 200 Random F orest 596 498 524 406 77 104 23 18 183 246 Round-Robin 675 561 569 439 70 124 23 18 175 239 D 3 TS 675 562 557 450 72 128 23 18 191 240 T able 8 Simulation results on ten datasets deriv ed from the original data attest. Best two metho ds on each dataset are sho wn in b old. D 3 TS p erforms consistently well. B Can we lev erage diversit y using a single classifier? Intuitiv ely , when a learning mo del is fitted to the nodes it chose to query , it tends to sp ecialize in one region of the feature space and the search will consequently only explore similar parts of the graph, which can severely undermine its p otential to find target nodes. Selective Harvesting ov er Networks 27 One p otential wa y to mitigate this overspecialization would b e to sample no des probabilistically , as opposed to deterministically querying the node with the highest score. Clearly , we should not query nodes uniformly at random al l the time . It turns out that querying no des uniformly at random p erio dic al ly does not help either, according to the following exp eriment. W e implemented an algorithm for selective harvesting that samples at each step t , with probability p , an uniformly random no de from B ( t ), and with 1 − p , the best ranked no de according to a support vector regression (SVR) mo del. T able 9 shows the results for p = 2 . 5, 5 . 0, 10, 15 and 20%. 0.0% 2.5% 5.0% 10% 15% 20% 760 . 5 ± 52 . 1 773 . 85 ± 34 . 5 768 . 0 ± 32 . 3 770 . 8 ± 34 . 1 753 . 0 ± 59 . 8 764 . 7 ± 28 . 0 T able 9 Results for SVR w/ uniformly random queries on CiteSeer (at t = 1500) av eraged o v er 40 runs. T op line shows probabilty of random query; bottom line sho ws number of target nodes found. W e observ e that the performance do es not impro ve significantly for p ≥ 2 . 5%, either b ecause the diversit y is not increasing in a wa y that translates into p erformance improv ements or b ecause all gains are offset by the samples wasted when querying no des at random. Instead of querying uniformly at random, we could query no des according to a probability distribution that concen trates most of the mass on the top k nodes w.r.t. model scores. W e experimented with several wa ys of mapping scores to a probability distribution P . In particular, we considered t wo classes of distributions: – truncated geometric distribution (0 < q < 1): P ( v ) ∝ (1 − q ) π ( v ) − 1 q , and – truncated Zeta distribution ( r ≥ 1): P ( v ) ∝ π ( v ) − r , where π ( v ) is the rank of v based on the scores given by the model to v ∈ B ( t ). In each exp eriment, we set q or r at each step in one of nine wa ys: 1. T op 10 hav e x % of the probability mass; for x ∈ { 70 , 90 , 99 } . 2. T op 10% no des hav e x % of the probability mass; for x ∈ { 90 , 99 , 99 . 9 } . 3. T op k ( t ) = min { 10 × (1 − t/T ) , 1 } hav e x % of the probability mass; for x ∈ { 70 , 90 , 99 } . None of the mappings was able to substantially increase the search’s performance. In contrast to almost 20% performance improv ement seen by SVR under round-robin on CiteSeer at T = 1500 (Fig. 3), mapping scores to a probability distribution increased the n umber of targets nodes found by at most 3%. C Ev aluation of MAB algorithms applied to Selective Harv esting W e exp eriment with representativ e algorithms of each of the follo wing bandit classes: – Sto chastic Bandits: UCB1, Thompson Sampling (TS),  -greedy , – Adversarial Bandits: Exp3 [4], – Non-stationary stochastic bandits: Dynamic Thompson Sampling (DTS) [18], – Contextual Bandits: Exp4 [4] and Exp4.P [8]. UCB1 and TS are parameter-free. F or  -greedy , Exp3 and Exp4.P we set the probabilit y of uniformly random pulls, to  ∈ { 0 . 10 , 0 . 20 , 0 . 50 } , γ ∈ { 0 . 10 , 0 . 20 , 0 . 50 } and K p min ∈ { 0 . 01 , 0 . 05 , 0 . 10 , 0 . 20 , 0 . 50 } (respectively). W e set parameter γ in Exp4 as K p min in Exp4.P . F or DTS, we set the cap on the parameter sum C ∈ { 5 , 10 , 20 , 50 } . Interestingly , for each MAB algorithm, there w as alwa ys one parameter v alue that outperformed all the others in almost all seven datasets. In Figure 9 w e show three representativ e plots of the p erformance comparison b etw een the b est parameterizations of each MAB algorithm. Since Exp4 was slightly outperformed by Exp4.P , Exp4 is not shown. These results corroborate our expectations (Section 5) that DTS would outperform other bandits in selective harvesting problems. 28 F abricio Murai et al. 0 200 400 600 800 1200 0.90 0.95 1.00 1.05 1.10 turn (t) # targets found (norm. by Round−Robin) CiteSeer: NIPS papers 0 200 400 600 800 1000 0.90 1.00 1.10 1.20 turn (t) # targets found (norm. by Round−Robin) DBpedia: admin. regions 0 200 400 600 800 1000 0.85 0.95 1.05 1.15 turn (t) # targets found (norm. by Round−Robin) Wikipedia: Obj. Orient. Prog. Round−Robin UCB1 TS DTS (C=5) Eps−Greedy (Eps=0.2) Exp3 (Gamma=0.2) Exp4.P (Gamma=0.01) Fig. 9 Comparison betw een the b est parameterizations of each MAB algorithm.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment