Structure Selection from Streaming Relational Data

Structure Selection from Streaming Relational Data Lilyana Mihalko v a & W alaa Eldin Moustafa Uni versity of Maryland Colle ge Park May 1, 2011 Abstract Statistical relational learning techniques hav e been successfully applied in a wide range of relational domains. In most of these applications, the human designers capitalized on their background kno wledge by following a trial-and-error trajectory , where relational features are manually deﬁned by a human engineer , parameters are learned for those features on the training data, the resulting model is validated, and the cycle repeats as the engineer adjusts the set of features. This paper seeks to streamline application dev elopment in lar ge relational domains by introducing a light-weight approach that ef ﬁciently e valuates relational features on pieces of the relational graph that are streamed to it one at a time. W e ev aluate our approach on two social media tasks and demonstrate that it leads to more accurate models that are learned faster . 1 Intr oduction Many machine learning applications inv olve inherently multi-relational domains in which entities of het- erogeneous types engage in a variety of relations. The statistical relational learning (SRL) [8] community has introduced representations that provide principled support for learning and reasoning in such multi- relational data. In a nutshell, SRL models use an expressi ve relational language, such as ﬁrst-order logic or SQL, to deﬁne relational features capable of capturing salient aspects of the structure of the domain. These relational features come with a parameterization, such that, once instantiated, the y deﬁne a graphical model ov er which probabilistic inference can be performed. SRL techniques hav e been successfully applied in domains as diverse as biology , natural language processing, ontology alignment, social networks, and the web . The basic design cycle of many such applications follows a trial-and-error trajectory , where relational features are manually deﬁned by a human engineer , parameters are learned for those features on the training data, the resulting model is validated, and the cycle repeats as the human engineer adjusts the set of features. For example, this is the strategy recommended in the Markov logic network tutorial [5] 1 , slide 108, that comes with the widely used Alchemy software [17]. T ypically , as a result, a relativ ely small number of relational features are identiﬁed and used. This approach is appealing because background kno wledge about the domain can be easily encoded in the intuitive relational language used in the SRL model at hand, and the relati ve strengths of these features can be learned through the parameterization. An alternativ e to this design cycle is to use structure learning, where both the relational features and their accompanying parameterization are induced automatically from data. The state-of-the-art in structure 1 A vailable under “T utorial” on http://alchemy.cs.washington.edu/ 1 learning has been advanced signiﬁcantly in recent years [9, 14, 21, 1, 11, 15, 16, 13]. This work has resulted in highly innov ativ e approaches to identifying potentially promising structure candidates and efﬁciently navig ating through the lar ge and complex search space for structures. So far , less emphasis has been placed on how , once identiﬁed, structure candidates are ev aluated. Existing techniques employ different ﬂav ors of a batch ev aluation procedure, where candidate structures are scored on the av ailable data and retained in the model if they improv e the score. Crucially , all existing techniques presume that all of the data, or, at least joins of pairs of relational tables [13], can be stored in memory for the purposes of candidate structure e valuation. While the importance of existing structure learning approaches is undisputed, their focus on identifying promising candidate structures, coupled with the assumption of batch scoring from in-memory data, is not a good match to all application scenarios. In particular , the designers of many SRL applications skipped structure learning entirely and instead preferred the trial-and-error design cycle outlined above. There are at least two plausible reasons underlying their chosen approach. First, the applications frequently in volved data sets that were simply too large to allow for batch scoring of candidate structures. T o overcome this, parameters over hand-coded relational features hav e been trained in parallel [25], or on a stream of examples [22]. Beyond the problem of storing large amounts of data in memory , in some domains, the data may actually arri ve as a stream and not be a v ailable all at once. Second, the designers of many applications often already hav e intuitions about good relational features, either from existing domain knowledge, such as in natural language or social network tasks, or from hav- ing worked with the “raw” data and dev eloped a representation for it that is already biased by what they intuiti vely percei ve as important aspects to capture. In such cases, where domain knowledge allo ws one to identify what variables are likely to inﬂuence each other , the design bottleneck is in pinning down the exact formulation of the inﬂuences among variables and eliminating features based on spurious intuitions. In other words, the problem of ev aluating candidate features efﬁciently is at least as important as suggesting them. This paper seeks to streamline application dev elopment in relational domains by introducing an ap- proach that efﬁciently ev aluates relational features on pieces of the relational graph that are streamed to it one at a time. W e call our approach R E S O LW E , for Relational Structure Selection from Online Light-Weight Ev aluation. R E S O LW E is agnostic to where the relational features it ev aluates come from. They could be discov ered locally from each piece of the data that is being streamed using an existing structure discovery approach, or provided by an application dev eloper . In this paper we take a semi-automated approach where the human designer speciﬁes a declarati ve bias, e.g., [4], by providing a grammar for the relational features, from which all possible features are generated. This corresponds to the scenario outlined above where gen- eral intuitions about the domain are av ailable, but determining the precise formulation of features requires ef ﬁciently ev aluating different v ersions of the model on a large relational data set. W e implement our approach in the framework of Markov logic networks (MLNs) [26]. This choice is moti vated by the fact that MLNs ha v e been widely used for relational application de velopment. W e start by providing necessary background in Section 2. R E S O LW E is described in Section 3. In Section 4, we ﬂesh out our proposed technique by using it to dev elop two applications in relational social media domains. Our results indicate that R E S O LW E leads to signiﬁcantly faster and more accurate learning. W e conclude with a discussion of related and future work. 2 Backgr ound, Notation, Assumptions First-order logic terminology . In ﬁrst-order logic, relations are represented as pr edicates, such as 2 articleEdit(article, editor) , which are Boolean functions with typed arguments. Assuming that the domain contains no functions, a term is deﬁned as a variable or a constant. An atom is a predicate ap- plied to terms, such as articleEdit(A, E) . A positi v e/negati ve literal is a non-negated/ne gated atom. A literal is gr ounded if all of its arguments are constants, or actual entities from the domain; con versely , a literal is ungr ounded if all of its arguments are variables . A formula consists of literals connected by con- junction and disjunction. Formulas whose literals contain only constants are grounded, whereas formulas whose literals contain only v ariables are ungrounded. W e will refer to grounded formulas as gr oundings . Learning Setting: W e assume fully observ able training data that consists of a large relational graph G whose hyperedges and nodes can correspond to different relations and entity types respecti vely . In practice, G is typically too large to ﬁt in memory all at once, and/or parts of it arriv e as learning progresses. Thus, we address a scenario where subgraphs of G arriv e in a data stream S . Second, we assume a discriminative setting where one or more relations in a set P T are designated as target predicates whose values are to be predicted at test time, and the remaining relations P E are the evidence predicates whose values are observ ed at test time. Marko v Logic Networks. A Markov logic network (MLN) [26] consists of a set of ﬁrst-order logic formulae F , each of which has an associated weight. MLNs can be viewed as relational analogs to Markov networks, in which the potential functions ov er cliques are deﬁned by the groundings of the formulae in F . The role of ﬁrst-order logic, therefore, is to provide a highly expressiv e language for specifying general relational features. Grounding the ﬁrst-order logic formulae with a given set of entities results in a Marko v network. In particular , an MLN computes the conditional joint probability of a set of predicate groundings X of the target predicates P T , gi ven truth v alues for a set of evidence predicate groundings Y as follo ws: P ( X = x | Y = y ) = exp( P f i ∈F w i n i ( x , y )) P x 0 exp( P f i ∈F w i n i ( x 0 , y )) (1) Abov e, X and Y are the sets of all target and e vidence predicate groundings, respectively; x and y are the sets of corresponding truth assignments; w i is the weight associated with formula f i ; n i ( x , y ) is the number of true groundings of formula f i on truth assignment x , y ; and the denominator computes the normalizing partition function Z . For ease of exposition, in this work we assume that all of the formulas in the MLN are conjunctions. This is not a restrictiv e assumption for the following reason. The most mature implementation of MLNs, Alchemy [17], handles arbitrary formulas by con verting them into conjunctiv e normal form, as a conjunction of disjunctions. Each disjunction produced in this way is then treated as a separate formula in the MLN, i.e., by viewing the MLN as a soft conjunction of disjunctions. Each disjunction in the MLN can then be con verted to a conjunction by negating it, and also ne gating its weight. 3 The R E S O LW E Algorithm Learning on the stream S proceeds in a three-stage process: 1. The ﬁrst k 1 sub-graphs that arriv e are used to generate a set of relational features (in our case, ﬁrst- order logic MLN formulae) F . 2. R E S O LW E uses the next k 2 sub-graphs from S to ev aluate the formulae in F , outputting a subset F ∗ ⊂ F consisting of the formulae that, together , give good predictiv e accuracy for the groundings of P T gi ven as observ ations the groundings of P E . 3 3. The remainder of S is used to train parameters on the formulae in F ∗ . In this paper , the set of formulas F in step 1 above is generated without requiring any data from the stream (i.e., k 1 = 0 ) by using a declarati ve bias (in the spirit of [4]) to specify templates, from which all possible formulas that comply with the bias are formed. Details on the templates used in each of our domains are provided in Section 4. Our goal in using this semi-automatic procedure to generate F was to take adv antage of a v ailable background knowledge, while using systematic rule generation to make sure we do not inadvertently place the baseline to which we compare at a disadv antage 3.1 Criteria f or Effective F ormulae The goal of R E S O LW E is to provide a light-weight formula ev aluation strategy that can be carried out by considering sub-graphs of the training data graph G that arri ve in a stream one at a time. The key is to de velop criteria for what makes an effecti ve formula and ensure that these criteria can be ev aluated ef ﬁciently on S . For this purpose, it is useful to rewrite each formula F as E ∧ Q , where E is the evidence sub-formula and consists of all literals of F with predicates in P E , Q is the query sub-formula and consists of all literals of F with predicates in P T . 2 W e can vie w the roles of E and Q in F as selector and enfor cer respecti vely: groundings of F in which the corresponding grounding of E is satisﬁed are “selected” and a particular pattern, or conﬁguration of truth v alues, speciﬁed by Q is “enforced” over the truth values of the corresponding grounding of Q . This is because the truth value of groundings for which the part corresponding to E is f alse can never change to tr ue , regardless of the assignments made to the part corresponding to Q , and thus do not af fect the values assigned to the ground literals corresponding to Q and are safely ignored by the inference. In other words, because groundings only af fect inference if their E part is satisﬁed, E can be viewed as “selecting” those groundings. This view of F allo ws us to specify two criteria for its effecti veness: (1) among the groundings of Q selected by E , ho w uniformly do we observe the pattern that Q enforces in the data; and (2) ho w surprising is that pattern, i.e., how likely is one to observe the pattern in a randomly selected set of groundings of Q . Intuiti vely , the uniformity criterion measures the correctness of F . Howe ver , in the case of SRL models, we are equally interested in very correct formulas and very in correct ones, the latter getting negati ve weights during parameter training. The motiv ation for the second criterion is that we would like to ﬁnd relational features that capture aspects beyond those that can be captured by simply using a prior o ver truth v alues. Next, we make these criteria precise. Let L be a set of ungrounded literals and let G L be a randomly chosen grounding of L . The joint assignment of truth values to the grounded literals in G L is a random v ariable X L with 2 |L| possible outcomes, i.e. if L contains a single literal, the possible outcomes are { T , F } ; if it contains two literals, the possible outcomes are { T T , T F , F T , F F } , etc. 3 W e are interested in the probability distribution that governs X L . This distribution can be estimated empirically giv en observed truth assignments for a set of groundings I L of L by simple counting as the proportion of time a particular conﬁguration of v alues is observed. Deﬁnition 1. Let P I L represent the empirical distribution o ver joint truth assignments to a randomly chosen grounding of L , as estimated on a set of groundings I . Armed with this notation, we now go back to the view of a formula F as consisting of a selector E and an enforcer Q and consider the empirical distribution P I E Q , where Q is the set of literals in Q , and I E is the set of groundings of Q selected by E , which is, in general, a subset of all possible groundings of Q . The ﬁrst, uniformity , criterion identiﬁed abov e states that in an ef fectiv e formula, P I E Q ( Q ) , the probability , 2 As described in Section 2, F is assumed to be a conjunction. 3 Here, for simplicity we are ignoring the case where, after grounding L , some of the literals become identical. 4 according to P I E Q , of observing the conﬁguration of truth values enforced by Q should be as extreme as possible. Criterion 1. Effecti ve formulas maximize max( P I E Q ( Q ) , 1 − P I E Q ( Q )) . The second criterion states that P I E Q should be signiﬁcantly different from the “default” P I All Q , where I All is the set of all possible groundings of Q . In other words, we are interested in formulas whose selector E homes in on groundings of Q for which the distribution ov er observed truth values de viates signiﬁcantly from the default distrib ution ov er all possible groundings of Q . Criterion 2. Effecti ve formulas maximize dist ( P I E Q , P I All Q ) . Criterion 2 can be ev aluated using standard measures of the distance between two distributions, such as KL div er gence [2], or by methods that are speciﬁcally designed to detect signiﬁcant deviations on a large scale, e.g., [7]. 3.2 Simplifying Assumption Ho wev er, e v aluating such measures may be expensi ve. In particular , to estimate P I All Q , we need to enumerate the observed joint truth v alues for all possible n k groundings of Q , where n is the number of entities in the domain and k is the number of distinct v ariables appearing in Q . In general, this number is much higher than the number of groundings selected by E . Instead, we note that relational domains are typically very sparse, i.e. the number of relations actually observed to be true is typically much smaller than the total number of possible relations that can form. Thus, rather than estimating P I All Q for different sets Q from the data, we can make the approximately correct assumption that P I All Q will be skewed towards conﬁgurations that in volve f al se assignments to the literals. In essence, this means that we can assume the same ske wed default distribution P I All i for all sets Q that contain i literals. This assumption allows us to signiﬁcantly simply the ev aluation of rules according to the two criteria. Next, we describe how this is done for sets Q of dif ferent sizes. The simplest case is when |Q| = i = 1 , i.e., where the formula F contains a single literal Q 1 of a target predicate. Supposing that Q 1 is non-negated, P I All i is a Bernoulli distribution, which, because of the ske w towards f al se assignments, has a very small probability of success. Thus, to satisfy criterion 2 and maximize dist ( P I E Q , P I All Q ) , P I E Q needs to hav e a large probability of success. Combined with criterion 1, we note that the only way both criteria may be satisﬁed is if P I E Q ( Q 1 ) is maximized. Thus, when |Q| = 1 , maximizing both criteria is as simple as requiring that the rule correctly identiﬁes regions that contain high proportions of true positi ves. The case when Q 1 is negated is symmetric. The situation is a bit less straightforward when |Q| = i = 2 . For ease of exposition, let us assume that Q contains two non-negated literals Q 1 and Q 2 . No w , according to the sparsity assumption, P I All i is sk e wed to wards truth assignments in which at least one of Q 1 or Q 2 is f al se. Thus, in order to satisfy criterion 2, we need formulas for which P I E Q places signiﬁcant mass on the case where Q 1 and Q 2 are both true . Combined with criterion 1, this means that effecti ve formulas are ones for which P I E Q ( Q 1 ∧ Q 2 ) is large. Due to the sparsity in relational domains, in practice formulas with such selectors E are rare. A second way to address criterion 2 is to look for formulas in which the conditional probability of one of Q 1 or Q 2 being true , giv en that the other one is true, is surprisingly high. In terms of criterion 1, this translates into formulas for which P I E Q ( Q 1 ⇒ Q 2 ) is high. Thus, when |Q| = 2 , R E S O LW E autonomously determines the precise formulation of the enforcer Q from the non-negated literals in Q by e v aluating P I E Q ( Q 1 ∧ Q 2 ) , P I E Q ( Q 1 ⇒ Q 2 ) , and P I E Q ( Q 2 ⇒ Q 1 ) and choosing ones that result in high values. W e note that the formulation Q 1 ⇒ Q 2 does not contradict our assumption that F is a conjunction. Q 1 ⇒ Q 2 can be expressed in conjunctiv e form as ¬ ( Q 1 ∧ ¬ Q 2 ) , thus F = E ∧ ¬ ( Q 1 ∧ ¬ Q 2 ) . When all literals in E are observed, the ef fect of F is equi v alent 5 Algorithm 1 R E S O LW E Algorithm Input: F : set of formulas P T : set of target predicates P E : set of evidence predicates S : stream of training subgraphs k 2 : number of streamed training subgraphs to use θ : threshold Output: F ∗ : selected formulas 1: for each of the ne xt k 2 subgraphs s ∈ S : do 2: f or each F ∈ F do 3: E = sub-formula of F consisting of literals with predicates in P E 4: Q = { Q 1 , . . . , Q l } set of literals of F with predicates in P T 5: Compute P I E Q ( Q 1 ∧ Q 2 · · · ∧ . . . Q l ) 6: f or each k ∈ [1 , l ] do 7: Compute P I E Q ( Q k | ∧ i ∈ [1 ,l ] ,i 6 = k Q i ) 8: end f or 9: end f or 10: end for 11: for each F ∈ F do 12: if Av er ag e ( P I E Q ( Q 1 ∧ Q 2 · · · ∧ . . . Q l )) > θ then 13: Add E ∧ Q 1 ∧ Q 2 · · · ∧ . . . Q l to F ∗ 14: end if 15: f or each k ∈ [1 , l ] do 16: if Av er ag e ( P I E Q ( Q k | ∧ i ∈ [1 ,l ] ,i 6 = k Q i )) > θ then 17: Add E ∧ ( Q 1 ∧ · · · ∧ Q k − 1 ∧ Q k +1 ∧ · · · ∧ Q l ⇒ Q k ) to F ∗ . 18: end if 19: end f or 20: end for to that of a formula F 0 = E ∧ Q 1 ∧ ¬ Q 2 , for which a negati ve weight is learned, despite the fact that F and F 0 are not logically equi valent. T o summarize, in the general case, when Q consists of l literals Q 1 , Q 2 , . . . , Q l , R E S O LW E e valuates P I E Q ( Q 1 ∧ Q 2 · · · ∧ . . . Q l ) , and for each k ∈ [1 , l ] P I E Q ( Q k | ∧ i ∈ [1 ,l ] ,i 6 = k Q i = tr ue ) and selects formulations that hav e high probabilities. This process is summarized in Algorithm 1. Steps 6-8 and 15-19 are only ev aluated if the formula has more than one literal of a target predicate. The algorithm reduces the e v aluation of the candidate formulas to the collection of a few statistics for each formula that can be easily computed on a stream of examples. Moreov er , by taking adv antage of the simplifying assumption, the algorithm avoids ha ving to estimate P I All Q . 6 4 Experimental Evaluation In this section, we demonstrate the methodology proposed in Section 3 by dev eloping applications in two social media domains. W e compare R E S O LW E to a system (called S K I P S E L E C T I O N ) that skips the second step outlined at the beginning of Section 3 and directly trains weights on the formulas in the original set F . Because for formulas with |Q| > 1 R E S O LW E automatically determines the correct logical connecti ves and negations in the part of the formula that in volves literals of the target predicate (i.e., lines 6-8 and 15-19 in Algorithm 1), the set F over which weights were trained by S K I P S E L E C T I O N included all possible formulas considered by R E S O LW E . The goal of our experiments was 1) to determine whether more accurate models are obtained with R E S O LW E ; and 2) to ev aluate the relati ve ef ﬁciency of R E S O LW E and S K I P S E L E C T I O N . W e implemented R E S O LW E as part of the Alchemy system [17]. For weight-training, we adapted Con- trasti ve Di ver gence [19] to learn from relational instances that arriv e in a stream and used a Gaussian penalty on the weights. W e preferred this algorithm o ver other a v ailable methods because we are interested in an ef- ﬁcient, light-weight approach. For inference, we used the MC-SA T algorithm [24]. Generation of formulas from provided templates w as implemented as a separate module in python. 4.1 Data Sets The experiments were conducted in tw o social media domains – W ikiCollabs and Delicious. 4.1.1 WikiCollabs The task in this domain is to predict project-speciﬁc collaborations in Wikipedia. 4 The data consists of all 3,538 W ikipedia articles that appeared in the featured 5 and controversial 6 lists in the period Oct. 7-21, 2009. These articles are interesting because they are richly connected, both by their hyperlinks and by their human network of editors [3]. For each article, we collected the editors who contributed to it, either by directly editing the article, or by editing its “T alk, ” i.e., discussion, page. Only edits that were not marked as “minor” by the editor were considered. In this way , we obtained a set of 280,068 editors. In addition, we collected the hyperlinks among the articles. These articles are densely inter-linked, as indicated by the large number of hyperlinks (45,006) among them. W ikipedia articles often refer to external resources on the W eb . Thus, for each article, we looked up the categorizations of each of its external references in the DMOZ open directory 7 . Because this information is not av ailable for all URLs, we considered both exact matches of URLs, for which there were about 0.9 per article, as well as exact matches for just the domain name part of the URL, for which there were about 77 per article. An editor E 1 on W ikipedia can communicate with another editor E 2 by editing E 2 ’ s “T alk” page. There were a total of 7,874,985 instances of communi- cation between pairs of editors. The set of evidence predicates P E included articleEdit(article, user) and articleTalk(article, user) for the two ways in which a user may contribute to an article; userTalk(user, user) ; hyperLink(article, article) ; similar(article, article) , indicating that the cosine similarity between the tf/idf-weighted bag-of-words representation of the stemmed text in the two articles is between 0.1 and 0.5; verySimilar(article, article) , indicating the similarity is greater than 0.5; category(article, category) , providing the cate- gory under which the article appeared on the featured or controv ersial list, levelNExact(article, 4 http://en.wikipedia.org/wiki/Main_Page 5 http://en.wikipedia.org/wiki/Wikipedia:Featured_lists 6 http://en.wikipedia.org/wiki/Wikipedia:List_of_controversial_issues 7 http://www.dmoz.org/ 7 T able 1: T emplates used to generate formulas in W ikiCollabs. E D I T ( t 1 , u ) ∧ S I M P LE RE L ( t 1 , t 2) ⇒ modifies ( t 2 , u ) (2) E D I T ( t 1 , u ) ∧ LO N G RE L ( t 1 , t 2) ⇒ modifies ( t 2 , u ) (3) U S E R RE L ( u 1 , u 2) ∧ modifies ( t, U 1) ⇒ modifies ( t, u 2) (4) U S E R RE L ( u 1 , u 2) ∧ modifies ( t, U 1) ∧ modifies ( t, u 2) (5) externalCategory) and levelNInexact(article, externalCategory) for the different le vels N = 1 , 2 , 3 in the DMOZ hierarchy in which an external URL from an article is ﬁled, either for e xact URL matches or for matches of just the domain name of the URL. The data from the W ikiCollabs network was streamed in subgraphs G C that were centered around one of the editors C . G C contains all articles A C to which C is related via the articleEdit or articleTalk predicates and all editors E C , in addition to C that contributed to any of the articles in A C , as well as the other articles to which the editors in E C contributed. Also included were any hyperlinks among the included articles, an y instances of userTalk relationships among the editors in E C , any av ailable category information on the included articles. The task was to learn to predict which editor in E C contributes to the articles in A C , given all other information. For con venience, we represented the relationship between articles in A C and the users in E C by the modifies(article, user) predicate, which was the only target predicate in P T . Subgraphs were formed for users C who made edits to the encyclopedia on at least 30 distinct days, had at least 30 collaborators, and edited at most 15 different articles. These restrictions are motiv ated by the observation that collaborator suggestion is most needed by editors who are strongly engaged with the encyclopedia, and so contrib ute to it ov er extended periods, but at the same time are focused in their interests. In this way , we excluded users, such as the “60% of registered users [who] nev er make another edit after their ﬁrst 24 hours” [23], as well as users who help ov ersee the editing process and are therefore some what superﬁcially in volved in large numbers of edits, from ha ving subgraphs G C formed around them. Howe ver , such users can still appear in the subgraph of another user . W e obtained a total of 1785 subgraphs. Formulas for the W ikiCollabs task were generated from the templates shown in T able 1. Predicates in all-caps are templates that get expanded in designer-speciﬁed ways, as sho wn in T able 2. As can be seen from these expansions, the E D I T template captures the different ways in which an editor may be related to an article, the S I M P LE R E L and LO N G R E L expand to the different ways in which two articles may be related, and U S E R R E L expands to the dif ferent ways in which two users may be related. The E D I T and S I M P LE RE L predicate templates were declared to be compounders , which means that when they are expanded in a rule, they can be replaced by a conjunction of more than one of their possi- ble expansions. For example, in rule 1 abov e, E DI T ( t 1 , u ) may be expanded to articleEdit ( t 1 , u ) , or articleTalk ( t 1 , u ) , or articleEdit ( t 1 , u ) ∧ articleTalk ( t 1 , u ) . W e limited the length of compound- ings to at most 2. R E S O LW E receiv ed only one version of rules generated from templates 4 and 5, as it determines the correct conﬁguration of literals of target predicates automatically . 8 T able 2: Expansions for the predicate templates used in W ikiCollabs. E D I T ( t 1 , u ) = { articleEdit ( t 1 , u ) | articleTalk ( t 1 , u ) } S I M P LE RE L ( t 1 , t 2) = { similar ( t 1 , t 2) | verySimilar ( t 1 , t 2) | hyperlink ( t 1 , t 2) | hyperlink ( t 2 , t 1) | category ( t 1 , c ) ∧ category ( t 2 , c ) } LO N G RE L ( t 1 , t 2) = { level [ 1 | 2 | 3 ] Exact ( t 1 , c ) ∧ level [ 1 | 2 | 3 ] Exact ( t 2 , c ) | level [ 1 | 2 | 3 ] Inexact ( t 1 , c ) ∧ level [ 1 | 2 | 3 ] Inexact ( t 2 , c ) } U S E R RE L ( u 1 , u 2) = { userTalk ( u 1 , u 2) | articleEdit ( t, u 1) ∧ articleEdit ( t, u 2) | articleTalk ( t, u 1) ∧ articleTalk ( t, u 2) | articleEdit ( t, u 1) ∧ articleTalk ( t, u 2) | articleTalk ( t, u 1) ∧ articleEdit ( t, u 2) } 4.1.2 Delicious The task in this data set is to predict user friendships on the Delicious social bookmarking site 8 [29]. W e used the data collected by the authors of [29], which includes 425,486 instances of the “fan” relationship, which indicates that one user is a fan of another one, 446,879 instances of the “netw ork” relationship, which is the in verse of “fan” (i.e., if A is a fan of B, then B is in A ’ s network), and 48,809,570 instances of the ternary “tagging” relationship between a user , a tag, and a URL. Although the “fan” and “network” relationships are in v erses of one another , the observations in the data were not complete. W e completed them by treating them as a single “friendship” relationship. T o stream this data, we formed subgraphs G C , each of which was centered at one of the users C . The task was to predict all friendships of C . Each G C included C ’ s actual friends F r C as true positiv es, and, as true negati ves, a sampling of users who are friends with users from F r C . W e did not form subgraphs for users C for which the number of true negati ve friends was not at least as large as the number of true positi ve friends. The friendships between C and the other users were hidden at test time, and the goal w as to predict their ex- istence. Howe ver , friendships among users other than C were observed. F or conv enience in distinguishing between these two cases, we included an observed and an unobserved version of the friendship rela- tionship: an unobserved, i.e., target, cFriends(user) predicate indicating that the user is friends with the implicit C , and an observed, i.e., e vidence, friends(user, user) predicate indicating that the two users are friends. For all users in G C , we included observ ations about all URLs they bookmarked, along with the tags used. Those were captured via the following predicates: bkMark(page, user, tag) ; bkMarkAfter(page, user) , which indicates that the user bookmarked the page at least one day af- ter it was bookmarked by C ; bkMarkBefore(page, user) , bkMarkSameDay(page, user) , which provide analogous information for pages bookmarked before or on the same day as bookmarked by C ; usedTag(tag, user) ; sameTag(user, user) and sameUrl(user, user) to indicate, respecti vely , that two users (dif ferent from C ) used the same tags and bookmarked the same URLs. W e used 656 subgraphs constructed in this way . 8 http://www.delicious.com/ 9 T able 3: T emplates used to generate formulas in Delicious. RE L ( u 1) ⇒ cFriends ( u 1) (6) LO N G RE L ( u 1) ⇒ cFriends ( u 1) (7) U RE L ( u 1 , u 2) ∧ RE L ( u 1) ⇒ cFriends ( u 2) (8) U RE L ( u 1 , u 2) ∧ LO N G R E L ( u 1) ⇒ cFriends ( u 2) (9) U RE L ( u 1 , u 2) ∧ cFriends ( u 1) ⇒ cFriends ( u 2) (10) U RE L ( u 1 , u 2) ∧ cFriends ( u 1) ∧ cFriends ( u 2) (11) T able 4: Expansions for the predicate templates used in Delicious RE L ( u 1) = { bkMarkAfter ( p, u 1) | bkMarkBefore ( p, u 1) | bkMarkSameDay ( p, u 1) } LO N G RE L ( u 1) = { usedTag ( t, u 1) ∧ usedTag ( t, C ) | bkMark ( p, u 1 , t ) ∧ bkMark ( p, C, t ) } U RE L ( u 1 , u 2) = { friends ( u 1 , u 2) | sameTag ( u 1 , u 2) | sameUrl ( u 1 , u 2) } Formulas for the Delicious task were generated using the templates shown in T able 3. W e used the ex- pansions sho wn in T able 4. The RE L and LON G RE L templates expand to predicates that relate users to the user C around whom the graph G C is centered via v arious bookmarking and tagging activities, whereas U RE L expands to different ways of relating two users. R E L was declared a compounder , and U RE L was declared an extender , which meant that one or more possible expansions could be chained together . For example, U R E L ( u 1 , u 2) could be expanded in ways such as friends ( u 1 , z 1) ∧ sameUrl ( z 1 , u 2) . W e allo wed extensions and compoundings of length at most 2. As before, R E S O LW E only needs the expansion from only one of the templates in lines 10 and 11, as it determines the speciﬁc formulation autonomously . 4.2 Methodology W e performed four-fold cross-validation by splitting the subgraphs in the data randomly into 4 sets and performing 4 train/test runs, in each run withholding one of the folds for testing and training on the remaining three. W e used k 2 = 30 and θ = 0 . 4 in Algorithm 1. Before training weights, both for R E S O LW E and S K I P S E L E C T I O N , we included a clause consisting of a single literal of the target predicate. This is standard practice in MLN applications that enables the model to capture the bias tow ards f al se assignments by learning a negati ve weight on this single-literal clause. The results are summarized using two standard metrics from the information retrie v al literature [20]: • (MAP) Mean average precision, which is identical to the area under the precision-recall curve. The MAP score is computed ov er a set of test subgraphs S as follows: MAP ( S ) = 1 |S | X s ∈S 1 | R s | X r ∈ R s P @ r . 10 WikiCollabs Delicious Figure 1: Comparison between R E S O LW E and S K I P S E L E C T I O N in terms of Mean A verage Precision and Area under the R OC curve in W ikiCollabs (left) and Delicious (right). Observed dif ferences are signiﬁcant at the 0.001 le vel. Abov e, R s is the set of all possible ( p , c ) pairs, and the precision at r is deﬁned as P @ r = Num of true positi ve pairs among the top r r • (A UC-R OC) Area under the ROC Curve, which is identical to the mean av erage true negati ve rate. This score is computed as follo ws: A UC-R OC ( S ) = 1 |S | X s ∈S 1 | R s | X r ∈ R s T N @ r, where the true negati ve rate at r is deﬁned as T N @ r = Number of true negati ves belo w position r T otal num true negati v es . 4.3 Results The results of our experiments are shown in Figure 1. All differences in this ﬁgure are signiﬁcant at the 0.001 lev el according to a paired t-test. As can be seen, selecting formulas with R E S O LW E before training weights leads to signiﬁcant improvements in both domains according to both metrics. Because the A UC- R OC performance of a random predictor would be 0.5, we can see that, in fact, by using R E S O LW E , we can go from near-random performance, to signiﬁcantly higher accuracy le vels. Moreov er , R E S O LW E learns signiﬁcantly faster than S K I P S E L E C T I O N . T able 5 presents results for the number of minutes taken by R E S O LW E and by weight learning on dedicated Xeon 2.67GHz CPUs, averaged ov er the 4 folds in each domain. In both cases, using R E S O LW E leads to dramatic decrease in training time. W e note that our results in the Delicious domain are not comparable to those of Zhou et al. [29] because their system uses global computations over all av ailable data to arri ve at predictions, whereas here we focus on making predictions using information local to subgraphs of the original relational graph. 5 Related W ork Structure learning and feature selection are important problems that hav e been widely studied in both re- lational and i.i.d. settings. Most feature selection approaches, e.g., [10], have been de veloped for non- 11 T able 5: T raining time in minutes for completing steps 2 (formula selection) and 3 (parameter learning), as outlined at the beginning of Section 3, a veraged o ver 4 folds. WikiCollabs Delicious Step 2 Step 3 T otal Step 2 Step 3 T otal R E S O LW E 94.01 91.15 185.16 30.66 62.76 93.42 S K I P S E L E C T I O N - 3236.40 3236.40 - 602.08 602.08 streaming classiﬁcation settings. One recent exception is the w ork of W u et al. [28], who study a classiﬁca- tion task where the features arrive in a stream, while the data set is ﬁxed. In contrast, here we explore the setting where the pool of features is ﬁxed, b ut the data arri ves as a stream. Closely related to this paper is work on structure learning of statistical relational models [9, 14, 21, 1, 11, 15, 16, 13]. This literature has made important adv ances on focusing the search through the super- exponential space of candidate models, thus discov ering more accurate candidates faster . Less emphasis has been placed on how to e valuate candidate structures and, in most existing work, ev aluation has been carried out by computing a probabilistic score over candidate structures that, crucially , assumes that the training data is av ailable in memory . In contrast, this paper addresses the complementary setting common in many relational applications where sufﬁcient background kno wledge is av ailable to generate candidate structures, and the challenge is in how to ef ﬁciently ev aluate them on data that is presented to the learner in a stream. The set-up explored here is probably most similar to that assumed by Huynh and Mooney [11], where formula selection and parameter training are carried out in two separate stages. Ho we ver , while that pre vious work also employs an accurac y-based measure (that of A L E P H [27]) to ev aluate rule candidates, it does not address the task of ev aluating candidates that have more than a single literal of the tar get predicate and does not consider streaming the relational instances. A fe w authors hav e addressed learning of structure from data streams. Dries and De Raedt [6] introduced an inductiv e logic programming technique that uses candidate elimination to learn theories from a stream of examples. Their work applies to noise-free data. Recently , Kummerfeld and Danks [18] introduced a “T emporal Dif ference Structure Learning” Algorithm that learns causal structure from a data stream. This algorithm targets causal discovery in graphical models and is not applicable to the relational setting assumed here. Learning from data streams in relational settings has so far focused on training the parameters of a model for which structure is provided (as done in S K I P S E L E C T I O N , described in Section 4). This approach was adopted by Mihalko va and Moone y [22] and in upcoming work by Huynh and Mooney [12]. 6 Conclusion W e proposed an approach to streamlining application development in relational domains by ef ﬁciently e val- uating a set of candidate formulas on relational instances that are streamed one at a time. The ev aluation algorithm is derived from two natural criteria and efﬁcienc y is achie ved by exploiting the fact that typical relational domains are sparse. W e ﬂeshed out our approach to dev elop two applications in large and noisy social media tasks, demonstrating signiﬁcant gains in the speed and accuracy of learning. A venues for future work include adapting this approach to tackle domains that experience gradual con- cept drift. One way to do this is to interleave steps 2 and 3 outlined at the beginning of Section 3 and use a 12 decaying av erage of the statistics calculated by Algorithm 1. As soon as step 2 determines a change in the structure over which weights are learned in step 3, the change is implemented, keeping the weights of the remaining rules at their currently learned v alues, and the process continues. A second potential direction for future work is exploiting other ways in which relational data may be streamed to the learner . For example, one interesting setting arises when the learner is allowed to actively decide in what ways and how much to gro w the subgraphs G C around entities of interest C . 7 Acknowledgements W e would like to thank T om Chao Zhou for providing us with the Delicious data set. L. Mihalkov a is supported by a CI fello wship under NSF Grant # 0937060 to the Computing Research Association. Refer ences [1] Biba, M., Ferilli, S., Esposito, F .: Discriminati ve structure learning of Markov logic networks. In: Proceedings of the 18th International Conference on Inductiv e Logic Programming (ILP-08). Prague, Czech Republic (2008) [2] Bishop, C.M.: Pattern Recognition and Machine Learning. Springer (2006) [3] Brandes, U., K enis, P ., Lerner , J., van Raaij, D.: Network analysis of collaboration structure in Wikipedia. In: Proceedings of the 18th International W orld W ide W eb Conference (WWW -09) (2009) [4] De Raedt, L., Dehaspe, L.: Clausal discovery . Machine Learning 26, 99–146 (1997) [5] Domingos, P .: Practical statistical relational learning (2007), ICML Tutorial, slides do wnloaded from http://www.cs.washington.edu/homes/pedrod/psrai.ppt on April 9, 2011 [6] Dries, A., De Raedt, L.: T ow ards clausal discovery for stream mining. In: Proceedings of the 19th International Conference on Inducti ve Logic Programming (ILP-09) (2009) [7] Dunning, T .: Accurate methods for the statistics of surprise and coincidence. Computational Linguis- tics - Special Issue on Using Large Corpora: I 19 (March 1993) [8] Getoor , L., T askar , B. (eds.): Introduction to Statistical Relational Learning. MIT Press, Cambridge, MA (2007) [9] Getoor , L., Friedman, N., Koller , D., T askar , B.: Learning probabilistic models of link structure. Jour- nal of Machine Learning Research 3, 679–707 (2002) [10] Guyon, I., Elisseef f, A.: An introduction to variable and feature selection. Journal of Machine Learning Research 3, 1157–1182 (2003) [11] Huynh, T .N., Mooney , R.J.: Discriminati ve structure and parameter learning for marko v logic net- works. In: Proceedings of the 25th International Conference on Machine Learning (ICML-08) (2008) [12] Huynh, T .N., Mooney , R.J.: Online max-margin weight learning for Markov logic networks. In: Pro- ceedings of the 11th SIAM International Conference on Data Mining (SDM11) (2011), to appear . 13 [13] Khosravi, H., Schulte, O., Man, T ., Xu, X., Bina, B.: Structure learning for Markov logic networks with many descripti ve attributes. In: Proceedings of the 24th Conference on Artiﬁcial Intelligence (AAAI-10) (2010) [14] K ok, S., Domingos, P .: Learning the structure of Markov logic networks. In: Proceedings of the 22nd International Conference on Machine Learning (ICML-05) (2005) [15] K ok, S., Domingos, P .: Learning Markov logic network structure via hypergraph lifting. In: Proceed- ings of the 26th International Conference on Machine Learning (ICML-09) (2009) [16] K ok, S., Domingos, P .: Learning Marko v logic networks using structural motifs. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10) (2010) [17] K ok, S., Singla, P ., Richardson, M., Domingos, P .: The Alchemy system for statistical relational AI. T ech. rep., Department of Computer Science and Engineering, University of W ashington (2005), http://www.cs.washington.edu/ai/alchemy [18] Kummerfeld, E., Danks, D.: Online causal structure learning. T ech. Rep. CMU-PHIL-189, Carnegie Mellon Uni versity (December 2010) [19] Lowd, D., Domingos, P .: Efﬁcient weight learning for Marko v logic networks. In: Proceedings of the 11th European Conference on Principles and Practice of Knowledge Disco very in Databases (PKDD- 07) (2007) [20] Manning, C.D., Raghav an, P ., Schutze, H.: Introduction to Information Retriev al. Cambridge Uni ver - sity Press (2008) [21] Mihalkov a, L., Mooney , R.J.: Bottom-up learning of Markov logic network structure. In: Proceedings of the 24th International Conference on Machine Learning (ICML-07) (2007) [22] Mihalkov a, L., Mooney , R.J.: Learning to disambiguate search queries from short sessions. In: Pro- ceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discov ery in Databases (ECML/PKDD-09) (2009) [23] Panciera, K., Halfak er , A., T erveen, L.: Wikipedians are born, not made: a study of po wer editors on Wikipedia. In: Proceedings of the A CM 2009 international conference on Supporting group work (GR OUP) (2009) [24] Poon, H., Domingos, P .: Sound and ef ﬁcient inference with probabilistic and deterministic dependen- cies. In: Proceedings of the 21st Conference on Artiﬁcial Intelligence (AAAI-06) (2006) [25] Poon, H., Domingos, P .: Joint unsupervised coreference resolution with marko v logic. In: Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (EMNLP-08) (2008) [26] Richardson, M., Domingos, P .: Markov logic networks. Machine Learning 62, 107–136 (2006) [27] Sriniv asan, A.: The Aleph manual (2001), http://web.comlab.ox.ac.uk/oucl/ research/areas/machlearn/Aleph/ [28] W u, X., Y u, K., W ang, H., Ding, W .: Online streaming feature selection. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10) (2010) 14 [29] Zhou, T .C., Ma, H., L yu, M.R., King, I.: UserRec: a user recommendation frame work in social tagging systems. In: Proceedings of the 24th Conference on Artiﬁcial Intelligence (AAAI-10) (2010) 15

Structure Selection from Streaming Relational Data

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment