Scaling Inference for Markov Logic with a Task-Decomposition Approach
Motivated by applications in large-scale knowledge base construction, we study the problem of scaling up a sophisticated statistical inference framework called Markov Logic Networks (MLNs). Our approach, Felix, uses the idea of Lagrangian relaxation …
Authors: Feng Niu, Ce Zhang, Christopher Re
Scaling Inference for Mark o v Logic with a T ask-Decomp osition Approac h F eng Niu Ce Zhang Christopher R ´ e Jude Sha vlik Univ ersity of Wisconsin-Madison { leonn, czhang, c hrisre, shavlik } @cs.wisc.edu Octob er 25, 2018 Abstract Motiv ated b y applications in large-scale knowledge base construction, we study the problem of scaling up a sophisticated statistical inference framework called Mark ov Logic Netw orks (MLNs). Our approach, F elix, uses the idea of Lagrangian relaxation from mathematical programming to decomp ose a program into smaller tasks while preserving the joint-inference prop ert y of the original MLN. The adv antage is that we can use highly scalable sp ecialized algorithms for common tasks such as classification and coreference. W e prop ose an arc hitecture to supp ort Lagrangian relaxation in an RDBMS which we show enables scalable join t inference for MLNs. W e empirically v alidate that F elix is significantly more scalable and efficien t than prior approaches to MLN inference by constructing a knowledge base from 1.8M do cumen ts as part of the T AC challenge. W e sho w that F elix scales and achiev es state-of-the-art qualit y num b ers. In contrast, prior approaches do not scale even to a subset of the corpus that is three orders of magnitude smaller. 1 In tro duction Building large-scale knowledge bases from text has recently received tremendous interest from academia [48], e.g., CMU’s NELL [8], MPI’s Y AGO [21, 29], and from industry , e.g., Microsoft’s Entit yCub e [52], and IBM’s W atson [17]. In their quest to extract knowledge from free-form text, a ma jor problem that all these systems face is coping with inconsistency due to b oth conflicting information in the underlying sources and the difficulty for machines to understand natural language text. T o cop e with this challenge, each of the ab o ve systems uses statistical inference to resolv e these ambiguities in a principled w a y . T o supp ort this, the researc h communit y has dev elop ed sophisticated statistical inference framew orks, e.g., PRMs [18], BLOG [28], MLNs [34], SOFIE [43], F actorie [26], and LBJ [36]. The k ey challenge with these systems is efficiency and scalability , and to develop the next generation of sophisticated text applications, w e argue that a promising approach is to impro ve the efficiency and scalabilit y of the ab o ve frameworks. T o understand the challenges of s caling such frameworks, we fo cus on one p opular such framework, called Markov L o gic Networks ( MLN s), that has be en successfully applied to man y c hallenging text applications [4, 32, 43, 52]. In Marko v Logic one can write first-order logic rules with weigh ts (that in tuitively mo del our confidence in a rule) ; this allo ws a developer to capture rules that are likely , but not certain, to b e correct. A key tec hnical c hallenge has b een the scalabilit y of MLN inference. Not surprisingly , there has b een in tense research interest in techniques to improv e the scalability and p erformance of MLN s, such as improving memory efficiency [42], lev eraging database technologies [30], and designing algorithms for sp ecial-purp ose programs [4, 43]. Our w ork here con tinues this line of work. Our goal is to use Mark ov Logic to construct a structured database of facts and then answ er questions lik e “which Bulgarian le aders attende d Sofia University and when?” with prov enance from text. (Our system, Felix , answ ers Ge or gi Parvanov and p oin ts to a handful of sentences in a corpus to demonstrate its answer.) During the iterativ e pro cess of constructing such a knowledge base from text and then using that knowledge base to answ er sophisticated questions, we ha ve found that it is critical to efficien tly process structured queries o ver large volumes of structured data. And so, w e hav e built F elix on top of an RDBMS. How ever, as we verify 1 MLN Program Γ Γ 1 Γ 2 T as k 1& T as k 2& Mas te r & R D B M S& Γ 3 T as k 3& ( a) & F e l i x & ( b ) & P r i o r & ap p r o ac h e s & monolithic&inference & Result & Figure 1: Felix breaks an input program, Γ, into sev eral, smaller tasks (shown in Panel a), while prior approac hes are monolithic (shown in Panel b). exp erimen tally later in this pap er, the scalabilit y of previous RDBMS-based solutions to MLN inference [30] is still limited. Our key observ ation is that in man y text pro cessing applications, one must solve a handful of common subproblems, e.g., coreference resolution or classification. Some of these hav e b een studied for decades, and so ha ve sp ecialized algorithms with higher scalabilit y on these subproblems than the monolithic inference used by t ypical Mark ov Logic systems. Th us, our goal is to lev erage the sp ecialized algorithms for these subproblems to pro vide more scalable inference for general Marko v Logic programs in an RDBMS. Figure 1 illustrates the difference at a high level b et ween Felix and prior approaches: prior approaches, such as Alchemy [34] or Tuffy [30], are monolithic in that they attack the entire MLN inference problem with one algorithm; in constrast, Felix decomp oses the problem in to several small tasks. T o ac hieve this goal, w e observ e that the problem of inference in an MLN – and essen tially an y kind of statistical inference – can b e cast as a mathematical optimization problem. Th us, we adapt techniques from the mathematical programming literature to MLN inference. In particular, w e consider the idea of L agr angian r elaxation [6, p. 244] that allows one to decomp ose a complex optimization problem into multiple pieces that are hop efully easier to solve [37, 51]. Lagrangian relaxation is a widely deploy ed technique to cop e with many difficult mathematical programming problems, and it is the theoretical underpinning of many state-of-the-art inference algorithms for graphical mo dels, e.g., Belief Propagation [46]. In man y – but not all – cases, a Lagrangian relaxation has the same optimal solution as the underlying original problem [6, 7, 51]. At a high lev el, Lagrangian relaxation gives us a message-passing proto col that resolv es inconsistencies among conflicting predictions to accomplish joint-infer enc e . Our system, Felix , does not actually construct the mathematical program, but uses Lagrangian relaxation as a formal guide to decompose an MLN program in to multiple tasks and construct an appropriate message-passing sc heme. Our first technical con tribution is an architecture to scalably p erform MLN inference in an RDBMS using Lagrangian relaxation. Our arch itecture mo dels eac h subproblem as a task that tak es as input a set of relations, and outputs another set of relations. F or example, our prototype of Felix implemen ts sp ecialized algorithms for classification and coreference resolution (coref ); these tasks frequently o ccur in text-pro cessing applications. By modeling tasks in this w ay , w e are able to use SQL queries for all data movement in the system: b oth transforming the input data in to an appropriate form for each task and enco ding the message passing of Lagrangian relaxation b et ween tasks. In turn, this allo ws Felix to lev erage the mature, set-at-a-time pro cessing p o w er of an RBDMS to ac hieve scalabilit y and efficiency . On all programs and datasets that we experimented with, our approach conv erges rapidly to the optimal solution of the Lagrangian relaxation. Our ultimate goal is to build high-quality applications, and we v alidate on sev eral knowledge-base construction tasks that Felix ac hieves higher scalabilit y and essentially iden tical result quality compared to prior MLN systems. More precisely , when prior MLN systems are able to scale, Felix con v erges to the same qualit y (and sometimes more efficien tly). When prior MLN systems fail to scale, Felix can still pro duce high-quality results. W e tak e this as evidence that Felix ’s approach is a promising direction to scale up large-scale statistical inference. F urthermore, we v alidate that b eing able to integrate sp ecialized algorithms is crucial for Felix ’s scalability: after disabling sp ecialized algorithms, Felix no longer scales to the same datasets. Although the RDBMS provides some level of scalabilit y for data mo vemen t inside Felix , the scale of data passed betw een tasks (via SQL queries) may b e staggering. The reason is that statistical algorithms ma y 2 pSimHard (per1, p er2) pSimSoft (per1, p er2) oSimHard (org1, org2) pSimSoft (org1, org2) coOccurs (per, org) homepage (per, page) oMention (page, org) faculty (org, per) ∗ affil (per, org) ∗ oCoref (org1, org2) ∗ pCoref (per1, p er2) coOccurs (‘Ullman’, ‘Stanford Univ.’) coOccurs (‘Jeff Ullman’, ‘Stanford’) coOccurs (‘Gray’, ‘San Jose Lab’) coOccurs (‘J. Gra y’, ‘IBM San Jose’) coOccurs (‘Mike’, ‘UC-Berkeley’) coOccurs (‘Mike’, ‘UCB’) coOccurs (‘Joe’, ‘UCB’) faculty (‘MIT’, ‘Chomsky’) homepage (‘Joe’, ‘Do c201’) oMention (‘Doc201’, ‘IBM’) · · · w eight rule + ∞ pCoref ( p, p ) ( F 1 ) + ∞ pCoref ( p 1 , p 2) = > pCoref ( p 2 , p 1) ( F 2 ) + ∞ pCoref ( x, y ) , pCoref ( y , z ) = > pCoref ( x, z ) ( F 3 ) 6 pSimHard ( p 1 , p 2) = > pCoref ( p 1 , p 2) ( F 4 ) 2 affil ( p 1 , o ) , affil ( p 2 , o ) , pSimSoft ( p 1 , p 2) = > pCoref ( p 1 , p 2) ( F 5 ) + ∞ faculty ( o, p ) = > affil ( p, o ) ( F 6 ) 8 homepage ( p, d ) , oMention ( d, o ) = > affil ( p, o ) ( F 7 ) 3 coOccurs ( p, o 1) , oCoref ( o 1 , o 2) = > affil ( p, o 2) ( F 8 ) 4 coOccurs ( p 1 , o ) , pCoref ( p 1 , p 2) = > affil ( p 2 , o ) ( F 9 ) . . . Sc hema Evidence Rules Figure 2: An example MLN program that p erforms three tasks jointly: 1. discov er affiliation relationships b et w een p eople and organizations ( affil ); 2. resolv e coreference among p eople mentions ( pCoref ); and 3. resolv e coreference among organization men tions ( oCoref ). The remaining eight relations are evidence relations. In particular, coOccurs stores p erson-organization co-o ccurrences; ∗ Sim ∗ relations are string similarities. pro duce huge num b ers of combinations (say all pairs of p oten tially matching p erson m en tions). The sheer sizes of in termediate results are often killers for scalabilit y , e.g., the complete input to coreference resolution on an Enron dataset has 1 . 2 × 10 11 tuples. The saving grace is that a task may access the intermediate data in an on-demand manner. F or example, a p opular coref algorithm rep eatedly asks “given a fixe d wor d x , tel l me al l wor ds that ar e likely to b e c or efer ent with x .” [3, 5]. Moreo ver, the algorithm only asks for a small fraction of suc h x . Th us, it would b e w asteful to pro duce all p ossible matching pairs. Instead we can pro duce only those w ords that are needed on-demand (i.e., materialize them lazily). Felix considers a ric her space of p ossible materialization strategies than simply eager or lazy: it can choose to eagerly materialize one or more sub queries resp onsible for data mov ement b etw een tasks [33]. T o make such decisions, Felix ’s second contribution is a no vel cost model that lev erages the cost-estimation facility in the RDBMS coupled with the data-access patterns of the tasks. On the Enron dataset, our cost-based approach finds execution plans that achiev e tw o orders of magnitude sp eedup ov er eager materialization and 2-3X sp eedup compared to lazy materialization. Although Felix allows a user to pro vide any decomp osition scheme, iden tifying decomp ositions could b e difficult for some users, so we do not wan t to force users to sp ecify a decomp osition to use F elix. T o supp ort this, w e need a compiler that p erforms task decomp osition given a standard MLN program as input. Building on classical and new results in embedded dep endency inference from the database theory literature [1, 2, 10, 14], w e show that the underlying problem of compilation is Π 2 P -complete in easier cases, and undecidable in more difficult cases. T o cope, w e dev elop a sound (but not complete) compiler that takes as input an ordinary MLN program, identifies common tasks such as classification and coref, and then assigns those tasks to sp ecialized algorithms. T o v alidate that our system can p erform sophisticated knowledge-base construction tasks, w e use the Felix system to implemen t a solution for the T AC-KBP (Kno wledge Base P opulation) c hallenge. 1 Giv en a 1.8M do cumen t corpus, the goal is to p erform t wo related tasks: (1) entity linking : extract all en tity mentions and map them to entries in Wikip edia, and (2) slot fil ling : determine relationships b et ween entities. The reason for c ho osing this task is that it contains ground truth so that we can assess the results: W e achiev ed F1=0.80 on en tity linking (h uman p erformance is 0.90), and F1=0.34 on slot filling (state-of-the-art quality). 2 In addition to KBP , w e also use three information extraction (IE) datasets that ha ve state-of-the-art solutions. On all four datasets, we show that Felix is significan tly more scalable than monolithic systems such as Tuffy and Alchemy ; this in turn enables Felix to efficiently pro cess sophisticated MLN s and pro duce high-quality re- sults. F urthermore, we v alidate that our individual tec hnical con tributions are crucial to the o verall p erformance and qualit y of Felix . 1 http://nlp.cs.qc.cuny.edu/kbp/2010/ 2 F1 is the harmonic mean of precision and recall. 3 Outline In Section 2, we describ e related work. In Section 3, we describ e a simple text application enco ded as an MLN program, and the Lagrangian relaxation technique in mathematical programming. In Section 4, w e presen t an o verview of Felix ’s arc hitecture and some key concepts. In Section 5, w e describ e key techni cal c hallenges and ho w Felix addresses them: how to execute individual tasks with high performance and quality , ho w to improv e the data mov ement efficiency b et ween tasks, and how to automatically recognize sp ecialized tasks in an MLN program. In Section 6, we use extensive exp eriments to v alidate the o verall adv an tage of Felix as w ell as individual technical contributions. 2 Related W ork There is a trend to build seman tically deep text applications with increasingly sophisticated statistical in- ference [15, 43, 49, 52]. W e follow on this line of work. Ho wev er, while the goal of prior work is to explore the effectiv eness of different correlation structures on particular applications, our goal is to supp ort general application developmen t b y scaling up existing statistical inference frameworks. W ang et al. [47] explore mul- tiple inference algorithms for information extraction. Ho wev er, their system fo cuses on managing lo w-level extractions in CRF mo dels, whereas our goal is to use MLN to supp ort kno wledge base construction. Felix sp ecializes to MLN s. There are, how ev er, other statistical inference framew orks such as PRMs [18], BLOG [28], F actorie [26, 50] , and PrDB [40]. Our hop e is that the techniques developed here apply to these framew orks as well. Researc hers hav e prop osed different approaches to impro ving MLN inference p erformance in the context of text applications. In StatSnowball [52], Zhu et al. demonstrate high quality results of an MLN-based approach. T o address the scalability issue of generic MLN inference, they make additional indep endence assumptions in their programs. In contrast, the goal of Felix is to automatically scale up statistical inference while sticking to MLN semantics. Theobald et al. [44] design sp ecialized MaxSA T algorithms that efficiently solve MLN programs of sp ecial forms. In contrast, w e study how to scale general MLN programs. Riedel [35] prop osed a cutting-plane meta-algorithm that iteratively performs grounding and inference, but the underlying grounding and inference pro cedures are still for generic MLN s. In T uffy [30], the authors improv e the scalability of MLN inference with an RDBMS, but their system is still a monolithic approac h that consists of generic inference pro cedures. As a classic tec hnique, Lagrangian relaxation has b een applied to closely related statistical mo dels (i.e., graphical mo dels) [20, 46]. How ever, there the input is directly a mathematical optimization problem and the gran ularity of decomposition is individual v ariables. In con trast, our input is a program in a high-lev el language, and w e p erform decomp osition at the relation level inside an RDBMS. Our materialization tradeoff strategy is related to view materialization and selection [11, 41] in the context of data w arehousing. Ho wev er, our problem setting is differen t: w e fo cus on batch pro cessing so that w e do not consider main tenance cost. The idea of lazy-eager tradeoff in view materialization or query answ ering has also b een applied to probabilistic databases [50]. Ho wev er, their goal is efficien tly maintaining in termediate results, rather than c ho osing a materialization strategy . Similar in spirit to our approach is Sprout [31], whic h considers lazy-v ersus-eager plans for when to apply confidence computation, but they do not consider inference decomp osition. 3 Preliminaries T o illustrate ho w MLN s can b e used in text-pro cessing applications, we first walk through a program that ex- tracts affiliations betw een p eople and organizations from W eb text. W e then describ e ho w Lagrangian relaxation is used for mathematical optimization. 4 3.1 Mark o v Logic Net works in F elix In text applications, a typical first step is to use standard NLP to olkits to generate raw data such as plausible men tions of p eople and organizations in a W eb corpus and their co-o ccurrences. But transforming suc h raw signals into high-qualit y and semantically coheren t knowledge bases is a challenging task. F or example, a ma jor c hallenge is that a single real-world entit y may b e referred to in man y different wa ys, e.g., “UCB” and “UC- Berkeley” . T o address suc h c hallenges, MLN s provide a framework where w e can express logical assertions that are only lik ely to b e true (and quantify such lik eliho od). Belo w we explain the key concepts in this framew ork b y walking through an example. Our system Felix is a middleware system: it tak es as input a standard MLN program, p erforms statistical inference, and outputs its results in to one or more relations that are stored in a relational database (PostgreSQL). An MLN program consists of three parts: schema , evidenc e , and rules . T o tell Felix what data will be pro vided or generated, the user provides a schema . Some relations are standard database relations, and w e call these relations evidenc e . Intuitiv ely , evidence relations contain tuples that we assume are correct. In the sc hema of Figure 2, the first eight relations are evidence relations. F or example, w e know that ‘Ullman’ and ‘Stanford Univ.’ co-o ccur in some webpage, and that ‘Do c201’ is the homepage of ‘Jo e’. In addition to evidence relations, there are also relations whose conten t we do not know, but w e w ant the MLN program to predict; they are called query r elations . In Figure 2, affil is a query relation since w e w ant the MLN to predict affiliation relationships b et ween p ersons and organizations. The other tw o query relations are pCoref and oCoref , for p erson and organization coreference, resp ectiv ely . In addition to schema and evidence, we also pro vide a set of MLN rules that enco de our knowledge ab out the correlations and constraints o ver the relations. An MLN rule is a first-order logic formula asso ciated with an extended-real-v alued num b er called a weight . Infinite-w eigh ted rules are called hard rules, which means that they must hold in any prediction that the MLN system makes. In con trast, rules with finite weigh ts are soft rules: a p ositiv e weigh t indicates confidence in the rule’s correctness. 3 (In Felix , weigh ts can b e set by the user or automatically learned. W e do not discuss learning in this work.) Example 1 An imp ortan t t yp e of hard rule is a standard SQL query , e.g., to transform the results for use in the application. A more sophisticated example of hard rule is to enco de that coreference has a transitive prop ert y , whic h is captured by the hard rule F 3 . Rules F 8 and F 9 use p erson-organization co-o ccurrences ( coOccurs ) together with coreference ( pCoref and oCoref ) to deduce affiliation relationships ( affil ). These rules are soft since co-o ccurrence in a webpage do es not necessarily imply affiliation. In tuitively , when a soft rule is violated, we pa y a c ost equal to the absolute v alue of its weigh t (describ ed b elo w). F or example, if coOccurs (‘Ullman’, ‘Stanford Univ.’) and pCoref (‘Ullman’, ‘Jeff Ullman’), but not affil (‘Jeff Ullman’, ‘Stanford Univ.’), then w e pay a cost of 4 b ecause of F 9 . The goal of an MLN inference algorithm is to find a prediction that minimizes the sum of suc h costs. Seman tics An MLN program defines a probability distribution o ver database instances (possible w orlds). F ormally , we first fix a schema σ (as in Figure 2) and a domain D . Given as input a set of formulae ¯ F = F 1 , . . . , F N with weigh ts w 1 , . . . , w N , they define a probability distribution o ver p ossible worlds (deterministic databases) as follo ws. Giv en a formula F k with free v ariables ¯ x = ( x 1 , · · · , x m ), then for each ¯ d ∈ D m , we create a new form ula g ¯ d called a gr ound formula where g ¯ d denotes the result of substituting each v ariable x i of F k with d i . W e assign the w eight w k to g ¯ d . Denote b y G = ( ¯ g , w ) the set of all such w eighted ground form ulae of ¯ F . W e call the set of all tuples in G the gr ound datab ase . Let w b e a function that maps eac h ground formula to its assigned weigh t. Fix an MLN ¯ F , then for an y p ossible world (instance) I w e say a ground formula g is violate d if w ( g ) > 0 and g is false in I , or if w ( g ) < 0 and g is true in I . W e denote the set of ground formulae 3 Roughly these w eights corresp ond to the log odds of the probability that the statement is true. (The log o dds of probabilit y p is log p 1 − p .) In general, these w eights do not hav e a simple probabilistic interpretation [34]. 5 violated in a w orld I as V ( I ). The cost of the world I is cost mln ( I ) = X g ∈ V ( I ) | w ( g ) | (1) Through cost mln , an MLN defines a probability distribution ov er all instances using the exp onen tial family of distributions (that are the basis for graphical mo dels [46]): Pr[ I ] = Z − 1 exp {− cost mln ( I ) } where Z is a normalizing constan t. Inference There are t wo main types of inference with MLN s: MAP (maximum a p osterior) infer enc e , where w e w ant to find a most likely w orld, i.e., a world with the low est cost, and mar ginal infer enc e , where w e w ant to compute the marginal probability of eac h unkno wn tuple. Both t yp es of inference are essentially mathematical optimization problems that are intractable, and so existing MLN systems implemen t generic (search/sampling) algorithms for inference. As a baseline, Felix implemen ts generic algorithms for b oth types of inference as w ell. Although Felix supp orts b oth types of inference in our decomp osition architecture, in this work w e fo cus on MAP inference to simplify the presen tation. 3.2 Lagrangian Relaxation W e illustrate the basic idea of L agr angian r elaxation with a simple example. Consider the problem of minimizing a real-v alued function f ( x 1 , x 2 , x 3 ). Lagrangian relaxation is a tec hnique that allows us to divide and conquer a problem lik e this. F or example, supp ose that f can b e written as f ( x 1 , x 2 , x 3 ) = f 1 ( x 1 , x 2 ) + f 2 ( x 2 , x 3 ) . While we ma y b e able to solve eac h of f 1 and f 2 efficien tly , that ability do es not directly lead to a solution to f since f 1 and f 2 share the v ariable x 2 . Ho wev er, w e can rewrite min x 1 ,x 2 ,x 3 f ( x 1 , x 2 , x 3 ) in to the form min x 1 ,x 21 ,x 22 ,x 3 f 1 ( x 1 , x 21 ) + f 2 ( x 22 , x 3 ) s.t. x 21 = x 22 , where we essen tially made t wo copies of x 2 and enforce that they are iden tical. The significance of such rewriting is that we can apply Lagrangian relaxation to the equality constrain t to decomp ose the formula into t wo indep enden t pieces. T o do this, we introduce a scalar v ariable λ ∈ R (called a L agr ange multiplier ) and define g ( λ ) = min x 1 ,x 21 ,x 22 ,x 3 f 1 ( x 1 , x 21 ) + f 2 ( x 22 , x 3 ) + λ ( x 21 − x 22 ) Then max λ g ( λ ) is called the dual pr oblem of the original minimization problem on f . Intuitiv ely , The dual problem trades off a p enalt y for how muc h the copies x 21 and x 22 disagree with the original ob jective v alue. If the resulting solution of this dual problem is feasible for the original program (i.e., satisfies the equality constrain t), then this solution is also an optimum of the original program [51, p. 168]. The key b enefit of such relaxation is that, instead of a single problem on f , we can now compute g ( λ ) b y solving t wo indep enden t problems (each problem is group ed by parentheses) that are hop efully (m uch) easier: g ( λ ) = min x 1 ,x 21 f 1 ( x 1 , x 21 ) + λx 21 + min x 22 ,x 3 f 2 ( x 22 , x 3 ) − λx 22 . T o compute max λ g ( λ ), w e can use standard techniques such as gr adient desc ent [51, p. 174]. Notice that Lagrangian relaxation could b e used for MLN inference: consider the case where x i are truth v alues of database tuples represen ting a p ossible w orld I and define f to b e cost mln ( I ) as in Equation 1. ( Felix can handle marginal inference with Lagrangian relaxation as well, but w e fo cus on MAP inference to simplify presen tation.) 6 MLN Program Logical Plan Physical Plan compile optimize execute Figure 3: Execution Pip eline of Felix . Decomp osition Choices The Lagrangian relaxation tec hnique leav es op en the question of how to decomp ose a function f in general and in tro duce equality constraints. These are the questions w e need to answer first and foremost if we w ant to apply Lagrangian relaxation to MLN s. F urthermore, it is imp ortant that we can scale up the execution of the decomp osed program on large datasets. 4 Arc hitecture of F elix In this section, we pro vide an o verview of the Felix arc hitecture and some key concepts. W e expand on further tec hnical details in the next section. A t a high level, the w ay Felix p erforms MLN inference resembles how an RDBMS p erforms SQL query ev aluation: giv en an MLN program Γ, Felix transforms it in sev eral phases as illustrated in Figure 3: Felix first c ompiles an MLN program into a lo gic al plan of tasks. Then, Felix p erforms optimization (co de selection) to select the b est physic al plan that consists of a sequence of statements that are then executed (b y a pro cess called the Master ). In turn, the Master may call an RDBMS or statistical inference algorithms. 4.1 Compilation In MLN inference, a v ariable of the underlying optimization problem corresp onds to the truth v alue (for MAP inference) or marginal probabilit y (for marginal inference) of a query relation tuple. While Lagrangian relaxation allo ws us to decomp ose an inference problem in arbitrary w ays, Felix fo cuses on decomp ositions at the level of relations: Felix ensures that an entire relation is either shared betw een subproblems or exclusive to one subproblem. A key adv antage of this is that Felix can b enefit from the set-oriented pro cessing p o wer of an RDBMS. Even with this restriction, an y par titioning of the rules in an MLN program Γ is a v alid decomp osition. (F or the moment, assume that all rules are soft; we come back to hard rules in Section 4.3.) F ormally , let Γ = { φ i } b e a set of MLN rules; denote b y R the set of query relations and x R the set of Bo olean v ariables (i.e., unknown truth v alues) of R ∈ R . Let Γ 1 , . . . , Γ k b e a decomp osition of Γ, and R i ⊆ R the set of query relations referred to by Γ i . Define x R = ∪ R ∈R x R ; similarly x R i . Then we can write the MLN cost function as min x R cost Γ mln ( x R ) = min x R k X i =1 cost Γ i mln ( x R i ) T o decouple the subprograms, we create a lo cal copy of v ariables x i R i for each Γ i , but also introduce Lagrangian m ultipliers λ j R ∈ R | x R | for eac h R ∈ R and each Γ j s.t. R ∈ R j , resulting in the dual problem max λ g ( λ ) ≡ max λ ( k X i =1 min x i R i h cost Γ i mln ( x i R i ) + λ i R i · x i R i i ) sub ject to X j : R ∈R j λ j R = 0 ∀ R ∈ R . Th us, to p erform Lagrangian relaxation on Γ, we need to augment the cost function of eac h subprogram with the λ i R i · x i R i terms. As illustrated in the example b elow, these additional terms are equiv alent to adding singleton rules with the multipliers as weigh ts. As a result, we can still solve the (augmented) subproblems Γ λ i as MLN inference problems. 7 Sad$ GoodNews$ BadNews$ Happy 1$ Happy 2$ Γ 1 λ Task:$Classifica9on$ Task:$Generic$ Γ 2 λ DMOs( φ 2 , φ 3 ' , ϕ 2 ) Master $ Tasks$ Data$$ Movement$ Operators$ Rela9ons$ DMOs( φ 1 ' , ϕ 1 ) Figure 4: An example logical plan. Relations in shaded b o xes are evidence relations. Solid arro ws indicate data flo w; dash arrows are control. Example 1 Consider a simple Mark ov Logic program Γ: 1 GoodNews ( p ) = > Happy ( p ) φ 1 1 BadNews ( p ) = > Sad ( p ) φ 2 5 Happy ( p ) < = > ¬ Sad ( p ) φ 3 where GoodNews and BadNews are evidence and the other tw o relations are queries. Consider the decomp osition Γ 1 = { φ 1 } and Γ 2 = { φ 2 , φ 3 } . Γ 1 and Γ 2 share the relation Happy ; so we create tw o copies of this relation: Happy 1 and Happy 2 , one for each subprogram. T o relax the need that Happy 1 and Happy 2 b e equal, w e in tro duce Lagrange m ultipliers λ p , one for eac h p ossible tuple Happy ( p ). W e thereby obtain a new program Γ λ : 1 GoodNews ( p ) = > Happy 1 ( p ) φ 0 1 λ p Happy 1 ( p ) ϕ 1 1 BadNews ( p ) = > Sad ( p ) φ 2 5 Happy 2 ( p ) < = > ¬ Sad ( p ) φ 0 3 − λ p Happy 2 ( p ) ϕ 2 This program contains tw o subprograms, Γ λ 1 = { φ 0 1 , ϕ 1 } and Γ λ 2 = { φ 2 , φ 0 3 , ϕ 2 } , that can b e solv ed indep en- den tly . The output of compilation is a lo gic al plan that consists of a bipartite graph b et w een a set of subprograms (e.g., Γ λ i ) and a set of relations (e.g., GoodNews and Happy ). There is an edge b et ween a subprogram and a relation if the subprogram refers to the relation. In general, the decomp osition could b e either user-pro vided or automatically generated. In Sections 5.3 w e discuss automatic decomp osition. 4.2 Optimization The optimization stage fleshes out the logical plan with co de selection and generates a physic al plan with detailed statements that are to b e executed by a process in Felix called the Master. Each subprogram Γ λ i in the logical plan is executed as a task that encapsulates a statistical algorithm that consumes and pro duces relations. The default algorithm assigned to each task is a generic MLN inference algorithm that can handle an y MLN program [30]. How e v er, as w e will see in Section 5.1, there are several families of MLN s that hav e sp ecialized algorithms with high efficiency and high qualit y . F or tasks matc hing those families, w e execute them with corresp onding sp ecialized algorithms. The input/output relations of each task are not necessarily the relations in the logical plan. F or example, the input to a classification task could b e the results of some conjunctive queries translated from MLN rules. T o mo del such indirection, we introduce data movement op er ators (DMOs), which are essentially datalog queries that map b et ween MLN relations and task-sp ecific relations. Roughly sp eaking, DMOs for sp ecialized algorithms play a role that is similar to what grounding do es for generic MLN inference. Giv en a task Γ λ i , it is the resp onsibility of the underlying algorithm to generate all necessary DMOs and register them with Felix . Figure 4 sho ws an enriched logical plan after co de selection and DMO generation. DMOs are critical to the p erformance of Felix , and so we need to execute them efficien tly . W e observe that the o verall p erformance of 8 an ev aluation strategy for a DMO dep ends on not only how well an RDBMS can execute SQL, but also how and how fr e quently a task queries this DMO – namely the access pattern of this task. T o expose the access patterns of a task to Felix , w e mo del DMOs as adorne d views [45]. In an adorned view, eac h v ariable in the head of a view definition is asso ciated with a binding-type, whic h is either b (b ound) or f (free). Given a DMO Q , denote b y ¯ x b (resp. ¯ x f ) the set of b ound (resp. free) v ariables in its head. Then w e can view Q as a function mapping an assignment to ¯ x b (i.e., a tuple) to a set of assignments to ¯ x f (i.e., a relation). F ollowing the notation in Ullman [45], a query Q of arit y a ( Q ) is written as Q α ( ¯ x ) where α ∈ { b , f } a ( Q ) . By default, all DMOs ha ve the all-free binding pattern. But if a task exposes the access pattern of its DMOs, Felix can select ev aluation strategies of the DMOs more informatively – Felix employs a cost-based optimizer for DMOs that takes adv antage of b oth the RDBMS’s cost-estimation facility and the data-access pattern of a task (see Section 5.2). Example 2 Sa y the subprogram F 1 - F 5 in Figure 2 is executed as a task that p erforms coreference resolution on pCoref , and Felix chooses the correlation clustering algorithm [3, 5] for this task. A t this p oin t, Felix kno ws the data-access prop erties of that algorithm (which essentially asks only for “neighboring” elements). Felix represen ts this using the following adorned view: DMO bf ( x, y ) ← affil ( x, o ) , affil ( y , o ) , pSimSoft ( x, y ) . whic h is adorned as bf . During execution, this coref task sends requests such as x = ‘Joe’, and expects to receive a set of names { y | DMO (‘Jo e’ , y ) } . Sometimes Felix could deduce from the DMOs how a task may b e parallelized (e.g., via k ey attributes), and tak es adv antage of such opp ortunities. The output of optimization is a D AG of statements . Statemen ts are of tw o forms: (1) a prepared SQL statement; (2) a statement enco ding the necessary information to run a task (e.g., the n umber of iterations an algorithm should run, data lo cations, etc.). 4.3 Execution In Felix , a pro cess called the Master coordinates the tasks by p erio dically up dating the Lagrangian m ultiplier asso ciated with each shared tuple (e.g., λ p in Example 1). Such an iterative up dating sc heme is called master- slave message p assing . The goal is to optimize max λ g ( λ ) using standard subgradient metho ds [51, p. 174]. Sp ecifically , let p b e an unknown tuple of R , then at step k the Master up dates eac h λ i p s.t. R ∈ R i using the follo wing rule: λ i p = λ i p + α k x i p − P j : R ∈R j x j p |{ j : R ∈ R j }| ! , where α k is the gradien t step size for this update. A k ey no velt y of Felix is that w e can leverage the underlying RDBMS to efficiently compute the gradient on an en tire relation. T o see why , let λ j p b e the m ultipliers for a shared tuple p of a relation R ; λ j p is stored as an extra attribute in eac h cop y j of R . Note that at eac h iteration, λ j p c hanges only if the copies of R do not agree on p (e.g., exactly one copy has p missing). Th us, we can up date all λ j p ’s with an outer join betw een the copies of R using SQL. The gradien t descent pro cedure stops either when all copies hav e reached an agreemen t (or only a very small p ortion disagrees) or when Felix has run a pre-sp ecified maximum num b er of iterations. Sc heduling and Parallelism Betw een tw o iterations of message passing, each task is executed un til comple- tion. If these tasks run sequentially (sa y due to limited RAM or CPU), then any order of execution would result in the same run time. On the other hand, if all tasks can run in parallel, then faster tasks w ould ha ve to w ait for the slo west task to finish until message passing could pro ceed. T o better utilize CPU time, Felix up dates the Lagrangian m ultipliers for a shared relation R whenev er all inv olved tasks hav e finished. F urthermore, a task is restarted when all shared relations of this task ha ve b een up dated. If computation resources are abundant, Felix also considers parallelizing a task. 9 T ask Implemen tation Simple Classification Linear mo dels [7] Correlated Classification Conditional Random Fields [24] Coreference Correlation clustering [3, 5] T able 1: Example sp ecialized tasks and their implemen tations in Felix . Initialization and Finalization Let σ = T 1 , . . . , T n b e a sequence of all tasks obtained by a breadth-first tra versal of the logical plan. A t initial execution time, to b ootstrap from the initial empty state, w e sequentially execute the tasks in the order of σ , eac h task initializing its lo cal copies of a relation b y cop ying from the output of previous tasks. Then Felix p erforms the ab o ve master-sla ve message-passing scheme for several iterations; during this phase all tasks could run in parallel. At the end of execution, w e p erform a finalization step: we tra verse σ again and output the copy from T R last for each query relation R , where T R last is the last task in σ that outputs R . T o ensure that hard rules in the input MLN program are not violated in the final output, w e insist that for any query relation R , T R last resp ects all hard rules inv olving R . (W e allow hard rules to b e assigned to m ultiple tasks.) This guarantees that the output of the finalization step is a p ossible w orld for Γ (provided that the hard rules are satisfiable). 5 T ec hnical Details Ha ving set up the general framework, in this section, we discuss further technical challenges and solutions in Felix . First, as eac h individual task might b e as complex as the original MLN , decomposition b y itself do es not automatically lead to high scalability . T o address this issue, we iden tify several common statistical tasks with w ell-studied algorithms and c haracterize their corresp ondence with MLN subprograms (Section 5.1). Second, even when each individual task is able to run efficiently , sometimes the data mov ement cost may b e prohibitiv e. T o address this issue, we prop ose a no vel cost-based materialization strategy for data mov emen t op erators (Section 5.2). Third, since the user ma y not b e able to pro vide a go od task decomp osition scheme, it is imp ortan t for Felix to b e able to compile an MLN program in to tasks automatically . T o supp ort this, we describ e the compiler of Felix that automatically recognizes specialized tasks in an MLN program (Section 5.3). 5.1 Sp ecialized T asks By default, Felix solv es a task (which is also an MLN program) with a generic MLN inference algorithm based on a reduction to MaxSA T [22], which is designed to solv e sophisticated MLN programs. Ideally , when a task has certain prop erties indicating that it can be solved using a more efficien t sp ecialized algorithm, Felix should do so. Conceptually , the Felix framew ork supp orts all statistical tasks that can b e mo deled as mathematical programs. As an initial pro of of concept, our prototype of Felix integrates t wo statistical tasks that are widely used in text applications: classification and coreference (see T able 1). These sp ecialized tasks are well-studied and so ha ve algorithms with high efficiency and high quality . Classification Classification tasks are ubiquitous in text applications; e.g., classifying do cumen ts by topics or sen timents, and classifying noun phrases b y en tity t yp es. In a classification task, we are given a set of ob jects and a set of lab els; the goal is to assign a lab el to each ob ject. Dep ending on the structure of the cost function, there are t wo types of classification tasks: simple classific ation and c orr elate d classific ation . In simple classification, given a mo del, the assignment of each ob ject to a lab el is indep enden t from other ob ject lab els. W e describ e a Boolean classification task for simplicity , i.e., our goal is to determine whether eac h ob ject is in or out of a single class. The input to a Bo olean classification task is a pair of relations: the mo del whic h can b e viewed as a relation M ( f , w ) that maps each feature f to a single weigh t w ∈ R , and a relation of ob jects I ( o, f ); if a tuple ( o, f ) is in I then ob ject o has feature f (otherwise not). The output is a relation R ( o ) that indicates which ob jects are members of the class ( R can also contain their marginal probabilities). 10 F or simple classification, the optimal R can b e p opulated by including those ob jects o such that X w : M ( w,f ) and I ( o,f ) w ≥ 0 One can implemen t a simple classification task with SQL aggregates, which should b e muc h more efficient than the MaxSA T algorithm used in generic MLN inference. The t wist in Felix is that the ob jects and the features of the model are defined by MLN rules. F or example, the rules F 6 and F 7 in Figure 2 form a classification task that determines whether each affil tuple (considered as an ob ject) holds. Said another wa y , each rule is a feature. So, Felix p opulates the mo del relation M with tw o tuples: M ( F 6 , + ∞ ) and M ( F 7 , 8), and p opulates the input relation I by executing the conjunctive queries in F 6 and F 7 ; e.g., from F 7 Felix generates tuples of the form I ( P , O , F 7 ), which indicates that the ob ject affil ( P , O ) has the feature F 7 . 4 Op erationally Felix p erforms such translation via DMOs that are also adorned with the task’s access patterns; e.g., the DMO for I has the adornment I bbf since Felix classifies each affil ( P , O ) indep endently . Felix extends this basic mo del in tw o wa ys: (1) Felix implements multi-class classification by adding a class attribute to M and I . (2) Felix also supp orts c orr elate d classific ation : in addition to p er-ob ject features, Felix also allo ws features that span multiple ob jects. F or example, in named entit y recognition if w e see the tok en “Mr.” the next token is v ery likely to b e a p erson’s name. In general, one can form a graph where the no des are ob jects and t wo ob jects are connected if there is a rule that refers to b oth ob jects. When this graph is acyclic, the task essentially consists of tree-structured CRF mo dels that can b e solved in p olynomial time with dynamic programming algorithms [24]. Coreference Another common task is coreference resolution (coref ), e.g., given a set of strings (sa y phrases in a do cumen t) we wan t to decide which strings represen t the same real-w orld entit y . These tasks are ubiquitous in text pro cessing. The input to a coref task is a single relation B ( o 1 , o 2 , w g t ) where w gt = β o 1 ,o 2 ∈ R indicates ho w lik ely the ob jects o 1 , o 2 are coreferen t (with 0 being neutral). The output of a coref task is a relation R ( o 1 , o 2) that indicates which pairs of ob jects are coreferen t – R is an equiv alence relation, i.e., satisfying reflexivit y , symmetry , and transitivit y . Assuming that β o 1 ,o 2 = 0 if ( o 1 , o 2) is not in the k ey set of the relation B , then each v alid R incurs a cost (called disagr e ement c ost ) cost coref ( R ) = X o 1 ,o 2:( o 1 ,o 2) / ∈ R and β o 1 ,o 2 > 0 | β o 1 ,o 2 | + X o 1 ,o 2:( o 1 ,o 2) ∈ R and β o 1 ,o 2 < 0 | β o 1 ,o 2 | . The goal of coref is to find a relation with the minim um cost: R ∗ = arg min R cost coref ( R ) . Coreference resolution is a well-studied problem [5, 16]. The underlying inference problem is NP -hard in almost all v ariants. As a result, there is a literature on approximation techniques (e.g., c orr elation clustering [3, 5]). Felix implements these algorithms for coreference tasks. In Figure 2, F 1 through F 5 consist of a coref task for the relation pCoref . F 1 through F 3 enco de the reflexivity , symmetry , and transitivity prop erties of pCoref , and F 4 and F 5 essen tially define the weigh ts on the edges (similar to Arasu [5]) from whic h Felix constructs the relation B (via DMOs). 5.2 Optimizing Data Mo vemen t Op erators Recall that data are passed b et w een tasks and the RDBMS via data mov ement operators (DMOs). While the statistical algorithm inside a task may b e very efficient (Section 5.1), DMO ev aluation could b e a ma jor 4 In general a model usually has both positive and negative features. 11 scalabilit y b ottlenec k. An important goal of Felix ’s optimization stage is to decide whether and how to materialize DMOs. F or example, a baseline approac h w ould b e to materialize all DMOs. While this is a reasonable approac h when a task rep eatedly queries a DMO with the same parameters, in some cases, the result may b e so large that an eager materialization strategy would exhaust a v ailable disk space. F or example, on an Enron dataset, materializing the follo wing DMO would require ov er 1TB of disk space: DMO bb ( x, y ) ← mention ( x, name 1) , mention ( y , name 2) , mayref ( name 1 , z ) , mayref ( name 2 , z ) . Moreo ver, some specialized tasks ma y insp ect only a small fraction of their searc h space and so such eager materialization is inefficien t. F or example, one implementation of the coref task is a sto c hastic algorithm that examines data items roughly linear in the num b er of no des (ev en though the input to coref contains a quadratic n umber of pairs of no des) [5]. In such cases, it seems more reasonable to simply declare the DMO as a regular database view (or prepared statemen t) that is to b e ev aluated lazily during execution. Felix is, ho w ever, not confined to fully eager or fully lazy . In Felix , w e hav e found that in termediate p oin ts (e.g., materializing a sub query of a DMO Q ) can hav e dramatic sp eed improv emen ts (see Section 6.4). T o choose among materialization strategies, Felix takes hints from the tasks: Felix allo ws a task to exp ose its access patterns, including b oth an adornment Q α (see Section 4.2) and an estimated num b er of accesses t on Q . (Operationally t could b e a Jav a function or SQL query to b e ev aluated against the base relations of Q .) Those parameters together with the cost-estimation facilit y of the underlying RDBMS (here, P ostgreSQL) enable a System-R-style cost-based optimizer of Felix that explores all p ossible materialization strategies using the follo wing cost mo del. F elix Cost Mo del T o define our cost mo del, we introduce some notation. Let Q α ( ¯ x ) ← g 1 , g 2 , . . . , g k b e a DMO. Let G = { g i | 1 ≤ i ≤ k } b e the set of subgoals of Q . Let G = { G 1 , . . . , G m } b e a partition of G ; i.e., G j ⊆ G , G i ∩ G j = ∅ for all i 6 = j , and S G j = G . Intuitiv ely , a partition represen ts a p ossible materialization strategy: each elemen t of the partition represen ts a query (or simply a relation) that Felix is considering materializing. That is, the case of one G i = G corresp onds to a fully eager strategy . The case where all G i are singleton sets corresp onds to a lazy strategy . More precisely , define Q j ( ¯ x j ) ← G j where ¯ x j is the set of v ariables in G j shared with ¯ x or any other G i for i 6 = j . Then, we can implement the DMO with a regular database view Q 0 ( ¯ x ) ← Q 1 , . . . , Q m . Let t b e the total n umber of accesses on Q 0 p erformed by the statistical task. W e mo del the execution cost of a materialization strategy as: ExecCost( Q 0 , t ) = t · Inc α ( Q 0 ) + m X i =1 Mat( Q i ) Mat( Q i ) is the cost of eagerly materializing Q i and Inc α ( Q 0 ) is the estimated cost of each query to Q 0 with adornmen t α . A significant implementation detail is that since the subgoals in Q 0 are not actually materialized, w e cannot directly ask PostgreSQL for the incremental cost Inc α ( Q 0 ). 5 In our prototype version of Felix , we implemen t a simple appro ximation of PostgreSQL’s optimizer (that assumes incremen tal plans use only index-nested-lo op joins), and so our results should b e taken as a low er b ound on the p erformance gains that are p ossible when materializing one or more sub queries. W e pro vide more details on this appro ximation in Section C.3. Although the num b er of p ossible plans is exponential in the size of the largest rule in an input Marko v Logic program, in our applications the individual rules are small. Thus, w e can estimate the cost of each alternative, and we pick the one with the lo west ExecCost. 12 Prop erties Sym b ol Example Reflexiv e REF p ( x, y ) = ⇒ p ( x, x ) Symmetric SYM p ( x, y ) = ⇒ p ( y , x ) T ransitive TRN p ( x, y ) , p ( y , z ) = ⇒ p ( x, z ) Key KEY p ( x, y ) , p ( x, z ) = ⇒ y = z Not Recursive NoREC Can b e defined w/o Recursion. T ree Recursive T rREC See Equation 2 T able 2: Prop erties assigned to predicates by the Felix compiler. KEY refers to a non-trivial key . Recursiv e prop erties are derived from all rules; the other prop erties are derived from hard rules. T ask Required Prop erties Simple Classification KEY, NoREC Correlated Classification KEY, T rREC Coref REF, SYM, TRN Generic MLN Inference none T able 3: T asks and their required prop erties. 5.3 Automatic Compilation So far we hav e assumed that the mappings betw een MLN rules, tasks, and algorithms are all specified by the user. Ho wev er, ideally a compiler should b e able to automatically recognize subprograms that could be pro cessed as sp ecialized tasks. In this section we des cribe a b est-effort compiler that is able to automatically detect the presence of classification and coref tasks . T o decomp ose an MLN program Γ in to tasks, Felix uses a tw o-step approach. Felix ’s first step is to annotate eac h query predicate p with a set of pr op erties . An example prop ert y is whether or not p is symmetric. T able 2 lists of the set of prop erties that Felix attempts to disco ver with their definitions; NoREC and T rREC are rule-sp ecific. Once the prop erties are found, Felix uses T able 3 to list all p ossible options for a predicate. When there are multiple options, the current prototype of Felix simply chooses the first task to app ear in the follo wing order: (Coref, Simple Classification, Correlated Classification, Generic). This order in tuitively fav ors more sp ecific tasks. T o compile an MLN in to tasks, Felix greedily applies the ab o ve pro cedure to split a subset of rules into a task, and then iterates until all rules hav e b een consumed. As shown b elo w, prop ert y detection is non-trivial as the predicates are the output of SQL queries (or formally , datalog programs). Therefore, Felix implements a b est-effort compiler using a set of syntactic patterns; this compiler is sound but not complete. It is interesting future w ork to design more sophisticated compilers for Felix . Detecting Prop erties The most technically difficult part of the compiler is determining the prop erties of the predicates (cf. [14]). There are tw o t yp es of prop erties that Felix lo oks for: (1) schema-lik e prop erties of an y p ossible worlds that satisfy Γ and (2) graphical structures of correlations b et ween tuples. F or b oth types of prop erties, the challenge is that we must infer these prop erties from the underlying rules applied to an infinite n umber of databases. 6 F or example, SYM is the prop ert y: “ for any datab ase I that satisfies Γ , do es the sentenc e ∀ x, y. pCoref ( x, y ) ⇐ ⇒ pCoref ( y , x ) hold? ”. Since I comes from an infinite set, it is not immediately clear that the prop ert y is even decidable. Indeed, REF and SYM are not decidable for Mark ov Logic programs. Although the set of prop erties in T able 2 is motiv ated by considerations from statistical inference, the first four prop erties dep end only on the har d rules in Γ, i.e., the constraints and (SQL-lik e) data transformations 5 PostgreSQL do es not fully support “what-if ” queries, although other RDBMSs do, e.g., for indexing tuning. 6 As is standard in database theory [2], to model the fact the query compiler runs without examining the data, we consider the domain of the attributes to b e un b ounded. If the domain of each attribute is known then, all of the ab o ve properties are decidable by the trivial algorithm that en umerates all (finitely man y) instances. 13 in the program. Let Γ ∞ b e the set of rules in Γ that hav e infinite weigh t. W e consider the case when Γ ∞ is written as a datalog program. Theorem 5.1. Given a datalo g pr o gr am Γ ∞ , a pr e dic ate p , and a pr op erty θ de ciding if for al l input datab ases p has pr op erty θ is unde cidable if θ ∈ { REF , SYM } . The ab o ve result is not surprising as datalog is a pow erful language and con tainment is undecidable [2, c h. 12] (the pro of reduces from containmen t). Moreo ver, the compiler is related to implic ation pr oblems studied by Abiteb oul and Hull (who also establish that generalizations of KEY and TRN problem are undecidable [1]). NoREC is the negation of the b ounde dness pr oblem [10] which is undecidable. In man y cases, recursion is not used in Γ ∞ (e.g., Γ ∞ ma y consist of standard SQL queries that transform the data), and so a natural restriction is to consider Γ ∞ without recursion, i.e., as a union of conjunctiv e queries. Theorem 5.2. Given a union of c onjunctive queries Γ ∞ , de ciding if for al l input datab ases that satisfy Γ ∞ the query pr e dic ate p has pr op erty θ wher e θ ∈ { REF , SYM } (T able 2) is de cidable. F urthermor e, the pr oblem is Π 2 P -Complete. KEY and TRN ar e trivial ly false. NoR e c is trivial ly true. Still, Felix m ust annotate predicates with prop erties. T o cop e with the undecidability and intractabilit y of finding out compiler annotations, Felix uses a set of sound (but not complete) rules that are describ ed by simple patterns. F or example, we can conclude that a predicate R is transitiv e if program con tains syntactically the rule R ( x, y ) , R ( y , z ) = > R ( x, z ) with w eight ∞ . Gr ound Structur e The second type of prop erties that Felix considers c haracterize the graphical structure of the gr ound datab ase (in turn, this structure describes the correlations that m ust b e accounted for in the inference pro cess). W e assume that Γ is written as a datalog program (with stratified negation). The ground database is a function of b oth soft and hard rules in the input program, and so w e consider b oth types of rules here. Felix ’s compiler attempts to deduce a sp ecial case of recursion that is motiv ated by (tree-structured) conditional random fields that w e call T rREC. Supp ose that there is a single recursiv e rule that contains p in the b ody and the head is of the form: p ( x, y ) , T ( y , z ) = > p ( x, z ) (2) where the first attribute of T is a k ey and the transitive closure of T is a partial order. In the ground database, p will b e “tree-structured”. MAP and marginal inference for such rules are in P -time [40, 46]. Felix has a regular expression to deduce this prop ert y . 6 Exp erimen ts Although MLN inference has a wide range of applications, we fo cus on kno wledge-base construction tasks. In particular, we use Felix to implement the T AC-KBP c hallenge; Felix is able to scale to the 1.8M-do cumen t corpus and pro duce results with state-of-the-art qualit y . In con trast, prior (monolithic) approaches to MLN inference crash ev en on a subset of KBP that is orders of magnitude smaller. In Section 6.1, we compare the o verall scalability and qualit y of Felix with prior MLN inference approaches on four datasets (including KBP). W e show that, when prior MLN systems run, Felix is able to produce similar results but more efficiently; when prior MLN systems fail to scale, Felix can still generate high-quality results. In Sections 6.2, we demonstrate that the message-passing s c heme in Felix can effectively reconcile conflicting predictions and has stable conv ergence b eha viors. In Section 6.3, we sho w that sp ecialized tasks and algorithms are critical for Felix ’s high p erformance and scalability . In Section 6.4, w e v alidate that the cost-based DMO optimization is crucial to Felix ’s efficiency . Datasets and Applications T able 4 lists some statistics ab out the four datasets that w e use for exp erimen ts: (1) KBP is a 1.8M-do cumen t corpus from T A C-KBP; the task is to p erform t w o related tasks: a) entity linking : extract all entit y men tions and map them to entries in Wikip edia, and b) slot fil ling : determine (tens of types 14 #do cumen ts #mentions KBP 1.8M 110M Enron 225K 2.5M DBLife 22K 700K NFL 1.1K 100K T able 4: Statistics of input data. Note that MLN inference generates muc h larger intermediate data. of ) relationships betw een en tities. There is also a set of ground truths o ver a 2K-do cument subset (call it KBP- R ) that we use for qualit y assessment. (2) NFL , where the task is to extract fo otball game results (winners and losers) from sp orts news articles. (3) Enron , where the task is to identify p erson mentions and asso ciated phone n um b ers in the Enron email dataset. There are t wo v ersions of Enron: Enron 7 is the full dataset; Enron-R 8 is a 680-email subset that we man ually annotated p erson-phone ground truth on. W e use Enron for p erformance ev aluation, and Enron-R for quality assessment. (4) DBLife 9 , where the task is to extract p ersons, organizations, and affiliation relationships b et ween them from a collection of academic webpages. F or DBLife, w e use the ACM author profile data as ground truth. MLN Programs F or KBP, we dev elop ed MLN programs that fuse a wide array of data sources including NLP results, W eb searc h results, Wikipedia links, F reebase, etc. F or performance experiments, w e use our en tity linking program (which is more sophisticated than slot filling). The MLN program on NFL has a conditional random field mo del as a comp onen t, with some additional common-sense rules (e.g., “a team cannot b e b oth a winner and a loser on the same day .”) that are provided by another research pro ject. T o expand our set of MLN programs, we also create MLN s on Enron and DBLife by adapting rules in state-of-the-art rule-based IE approaches [12, 25]: Each rule-based program is essentially equiv alent to an MLN-based program (without w eights). W e simply replace the ad-ho c reasoning in these deterministic rules by a simple statistical v ariant. F or example, the DBLife program in Cimple [12] says that if a p erson and an organization co-o ccur with some regular expression context then they are affiliated, and ranks relationships by frequency of suc h co-o ccurrences. In the corresp onding MLN w e ha ve sev eral rules for several types of co-o ccurrences, and ranking is by marginal probabilities. Exp erimen tal Setup T o compare with alternate implementations of MLN s, w e consider tw o state-of-the- art MLN implemen tations: (1) Alchemy , the reference implementation for MLN s [13], and (2) Tuffy , an RDBMS-based implemen tation of MLN s [30]. Alchemy is implemen ted in C++. Tuffy and Felix are b oth implemen ted in Jav a and use PostgreSQL 9.0.4. Felix uses Tuffy as a task. Unless otherwise sp ecified, all exp erimen ts are run on a RHEL5 w orkstation with tw o 2.67GHz Intel Xeon CPUs (24 total cores), 24 GB of RAM, and o ver 200GB of free disk space. 6.1 High-lev el Scalability and Qualit y W e empirically v alidate that Felix ac hieves higher scalabilit y and essentially iden tical result qualit y compared to prior monolithic approaches. T o supp ort these claims, we compare the p erformance and quality of different MLN inference systems ( Tuffy , Alchemy , and Felix ) on the datasets listed ab o v e: KBP , Enron, DBLife, and NFL. In all cases, Felix runs its automatic compiler; parameters (e.g., gradien t step sizes, generic inference parameters) are held constants across datasets. Tuffy and Alchemy hav e t wo sequential phases in their run time: gr ounding and se ar ch ; results are pro duced only in the search phase. A system is deemed unscalable if it fails to pro duce any inference results within 6 hours. The ov erall scalability results are shown in T able 5. 7 http://bailando.sims.berkeley.edu/enron_email.html 8 http://www.cs.cmu.edu/ ~ einat/datasets.html 9 http://dblife.cs.wisc.edu 15 Scales? F elix T uffy Alc hemy KBP Y N N NFL Y Y N Enron Y N N DBLife Y N N KBP-R Y N N Enron-R Y Y N T able 5: Scalability of v arious MLN systems. 0" 0.2" 0.4" 0.6" 0.8" 1" 0" 0.04" 0.08" 0.12" Recall& DBLife& Felix" Rule"Set"2" 0" 0.2" 0.4" 0.6" 0.8" 1" 0" 0.2" 0.4" 0.6" Recall& Enron0R& Felix" Tuffy" Rule"Set"1" 0" 0.2" 0.4" 0.6" 0.8" 1" 0" 0.2" 0.4" 0.6" Precision& Recall& NFL& Felix" Tuffy" Figure 5: High-level quality results of v arious MLN systems. F or each dataset, w e plot a precision-recall curve of each system by v arying k in top-k results; missing curves indicate that a system do es not scale on the corresp onding dataset. Qualit y Assessment W e p erform qualit y assessment on four datasets: KBP-R, NFL, Enron-R, and DBLife. On each dataset, w e run each MLN system for 4000 seconds with marginal inference. (After 4000 seconds, the qualit y of e ac h system has stabilized.) F or KBP-R, we con vert the output to T AC’s query-answ er format and compute the F1 score against the ground truth. F or the other three datasets, w e dra w precision-recall curv es: w e take rank ed lists of predictions from eac h system and measure precision/recall of the top-k results while v arying the num b er of answ ers returned 10 . The quality of eac h system is sho wn in Figure 5 11 . System-dataset pairs that do not scale ha ve no curves. KBP & NFL Recall that there are t wo tasks in KBP: en tity linking and slot filling. On b oth tasks, Felix is able to scale to the 1.8M do cuments and after running ab out 5 hours on a 30-no de parallel RDBMS, pro duce results with state-of-the-art qualit y [19] 12 : W e achiev ed an F1 score 0.80 on en tity linking (human annotators’ p erformance is 0.90), and an F1 score 0.34 on slot filling (state-of-the-art quality). In contrast, Tuffy and Alchemy crashed ev en on the three orders of magnitude smaller KBP-R subset. Although also based on an RDBMS, Tuffy attempted to generate ab out 10 11 and 10 14 tuples on KBP-R and KBP , resp ectiv ely . T o assess the quality of Felix as compared to monolithic inference, we also run the three MLN systems on NFL. Both Felix and Tuffy scale on the NFL data set, and as sho wn in Figure 5, pro duce results with similar qualit y . How ever, Felix is an order of magnitude faster: Tuffy to ok ab out an hour to start outputting results, whereas Felix ’s qualit y con verges after only five minutes. W e v alidated that the reason is that Tuffy w as not a ware of the linear correlation structure of a classification task in the NFL program, and ran generic MLN inference in an inefficien t manner. Enron & DBLife T o expand our test cases, w e consider t wo more datasets – Enron-R and DBLife – to ev aluate the k ey question w e try to answer: do es Felix outp erform monolithic systems in terms of sc alability and efficiency ? F rom T able 5, w e see that Felix scales in cases where monolithic MLN systems do not. On 10 Results from MLN -based systems are ranked by marginal probabilities, results from Cimple are rank ed b y frequency of occur- rences, and results from rules on Enron-R are ranked b y window sizes betw een a p erson mention and a phone n umber mention. 11 The lo w recall on DBLife is because the ground truth (A CM author profiles) con tains man y facts absent from DBLife. 12 Measured on KBP-R that has ground truth. 16 0.00# 0.02# 0.04# 0.06# 0.08# 0# 20# 40# 60# 80# 100# DBLife' 0.00# 0.05# 0.10# 0.15# 0.20# 0# 20# 40# 60# 80# 100# NFL' 0.00# 0.25# 0.50# 0.75# 1.00# 0# 20# 40# 60# 80# 100# Enron.R' 0.00# 0.05# 0.10# 0.15# 0.20# 0# 20# 40# 60# 80# 100# KBP.R' Itera5ons' Itera5ons' Itera5ons' Itera5ons' RMSE' RMSE' RMSE' RMSE' Figure 6: The RMSE b et ween predictions from differen t tasks con verges stably as Felix runs master-sla ve message passing. Enron-R (which con tains only 680 emails), w e see that when b oth Felix and Tuffy scale, they achiev e similar result quality . F rom Figure 5, we see that even when monolithic systems fail to scale (on DBLife), Felix is able to pro duce high-quality results. T o understand the result qualit y obtained b y Felix , w e also ran rule-based information-extraction programs for Enron-R and DBLife follo wing practice describ ed in the literature [12, 25, 27]. Recall that the MLN programs for Enron-R and DBLife were created by augmenting the deterministic rule sets with statistical reasoning. 13 It should b e noted that all systems can b e impro ved with further tuning. In particular, the rules describ ed in the literature (“Rule Set 1” for Enron-R [25, 27] and “Rule Set 2” for DBLife [12]) w ere not specifically optimized for high qualit y on the corresp onding tasks. On the other hand, the corresp onding MLN programs were generated in a constrained manner (as describ ed in Section D.1). In particular, w e did not lev erage state-of-the-art NLP to ols nor refine the MLN programs. With these cav eats in mind, from Figure 5 we see that (1) on Enron-R, Felix ac hieves higher precision than Rule Set 1 giv en the same recall; and (2) on DBLife, Felix achiev es higher recall than Rule Set 2 (i.e., Cimple [12]) at any precision lev el. This provides preliminary indication that statistical reasoning could help impro ve the result quality of knowledge-base construction tasks, and that scaling up MLN inference is a promising approac h to high-quality knowledge-base construction. Nevertheless, it is in teresting future w ork to more deeply inv estigate how statistical reasoning con tributes to qualit y impro vemen t o ver deterministic rules (e.g., Michelakis et al. [27]). 6.2 Effectiv eness of Message Passing W e v alidate that the Lagrangian sc heme in Felix can effectively reconcile conflicting predictions betw een related tasks to pro duce consisten t output. Recall that Felix uses master-slav e message passing to iteratively reconcile inconsistencies b et ween differen t copies of a shared relation. T o v alidate that this scheme is effective, we measure the difference b etw een the marginal probabilities rep orted b y differen t copies; we plot this difference as Felix runs 100 iterations. Sp ecifically , w e measure the ro ot-mean-square-deviation (RMSE) b etw een the marginal predictions of shared tuples b et ween tasks. On each of the four datasets (i.e., KBP-R, Enron-R, DBLife, and NFL), w e plot ho w the RMSE changes o v er time. As shown in Figure 9, Felix stably reduces the RMSE on all datasets to an even tual v alue of b elo w 0 . 1 – after ab out 80 iterations on Enron and after the v ery first iteration for the other three datasets. (As many statistical inference algorithms are sto c hastic, it is exp ected that the RMSE do es not decrease to zero.) This demonstrates that Felix can effectiv ely reconcile conflicting predictions, thereb y achieving joint inference. MLN inference is NP-hard, and so it is not alwa ys the case that Felix conv erges to the exact optimal solution of the original program. Ho wev er, as we v alidated in the previous section, empirically Felix conv erges to close appro ximations of monolithic inference results (only more efficiently). 13 F or Enron-R, w e follo w ed the rules describ ed in related publications [25, 27] . F or DBLife, we obtained the Cimple [12] system and the DBLife dataset from the authors. F urther details can b e found in Section D.1. 17 T ask System Initial Final F1 Simple Classification Felix 22 sec 22 sec 0.79 Tuffy 113 sec 115 sec 0.79 Alchemy 780 sec 782 sec 0.14 Correlated Classification Felix 34 sec 34 sec 0.90 Tuffy 150 sec 200 sec 0.09 Alchemy 540 sec 560 sec 0.04 Coreference Felix 3 sec 3 sec 0.60 Tuffy 960 sec 1430 sec 0.24 Alchemy 2870 sec 2890 sec 0.36 T able 6: P erformance and quality comparison on individual tasks. “Initial” (resp. “Final”) is the time when a system pro duced the first (resp. conv erged) result. “F1” is the F1 score of the final output. 6.3 Imp ortance of Sp ecialized T asks W e v alidate that the ability to in tegrate sp ecialized tasks into MLN inference is key to Felix ’s higher p erfor- mance and scalabilit y . T o do this, we first show that sp ecialized algorithms hav e higher efficiency than generic MLN inference on individual tasks. Second, we v alidate that sp ecialized tasks are key to Felix ’s scalabilit y on MLN inference. Qualit y & Efficiency W e first demonstrate that Felix ’s sp ecialized algorithms outp erform generic MLN inference algorithms in b oth quality and performance when solving sp ecialized tasks. T o ev aluate this claim, we run Felix , Tuffy , and Alchemy on three MLN programs that eac h enco de one of the following tasks: simple classification, correlated classification, and coreference. W e use a subset of the Cora dataset 14 for coref, and a subset of the CoNLL 2000 ch unking dataset 15 for classification. The results are sho wn in T able 6. While it alwa ys tak es less than a minute for Felix to finish eac h task, Tuffy and Alchemy take muc h longer. Moreo ver, the qualit y of Felix is higher than Tuffy and Alchemy . As exp ected, Felix can achiev e exact optimal solutions for classification, and nearly optimal appro ximation for coref, whereas Tuffy and Alchemy rely on a general-purp ose SA T counting algorithm. Nevertheless, the ab o v e micro b enc hmark results are typically dro wned out in larger-scale applications, where the quality difference tend to be smaller compared to the results here. Scalabilit y T o demonstrate that sp ecialized tasks are crucial to the scalability of Felix , w e remo ve specialized tasks from Felix and re-ev aluate whether Felix is still able to scale to the four datasets (KBP , Enron, DBLife, and NFL). The results are as follo ws: after disabling classification, Felix crashes on KBP and DBLife; after disabling coref, Felix crashes on Enron. On NFL, although Felix is still able to run without sp ecialized tasks, its p erformance slows down by an order of magnitude (from less than five minutes to more than one hour). These results suggest that sp ecialized tasks are critical to Felix ’s high scalability and p erformance. 6.4 Imp ortance of DMO Optimization W e v alidate that Felix ’s cost-based approach to data mov ement optimization is crucial to the efficiency of Felix . T o do this, we run Felix on subsets of Enron with v arious sizes in three differen t settings: 1) Eager , where all DMOs are e v aluated eagerly; 2) Lazy , where all DMOs are ev aluated lazily; 3) Opt , where Felix decides the materialization strategy for eac h DMO based on the cost mo del in Section 5.2. W e observ ed that ov erall Opt is substantially more efficient than b oth Lazy and Eager , and found that the deciding factor is the efficiency of the DMOs of the coref tasks. Thus, we sp ecifically measure the total 14 http://alchemy.cs.washington.edu/data/cora 15 http://www.cnts.ua.ac.be/conll2000/chunking/ 18 E-5k E-20k E-50k E-100k Eager 83 sec 15 min 134 min 641 min Lazy 42 sec 5 min 22 min 78 min Opt 29 sec 2 min 7 min 25 min T able 7: DMO efficiency under different settings. run time of individual coref tasks, and compare the results in T able 7. Here, E- x k for x ∈ { 5 , 20 , 50 , 100 } refers to a randomly selected subset of x k emails in the Enron corpus. W e observe that the p erformance of the eager materialization strategy degrades rapidly as the dataset size increases. The lazy strategy p erforms m uch b etter. The cost-based approac h can further achiev e 2-3X sp eedup. This demonstrates that our cost-based materialization strategy for data mo vemen t op erators is crucial to the efficiency of Felix . 7 Conclusion and F uture W ork W e present our Felix approac h to MLN inference that uses relation-level Lagrangian relaxation to decomp ose an MLN program into m ultiple tasks and solve them jointly . Suc h task decomp osition enables Felix to inte- grate sp ecialized algorithms for common tasks (such as classification and coreference) with b oth high efficiency and high qualit y . T o ensure that tasks can comm unicate and access data efficiently , Felix uses a cost-based materialization strategy for data mov emen t. T o free the user from man ual task decomp osition, the compiler of Felix p erforms static analysis to find sp ecialized tasks automatically . Using these tec hniques, w e demonstrate that Felix is able to scale to complex kno wledge-base construction applications and produce high-qualit y results whereas previous MLN systems hav e m uch p o orer scalabilit y . Our future w ork is in tw o directions: First, we plan to apply our key tec hniques (in-database Lagrangian relaxation and cost-based materialization) to other inference problems. Second, w e plan to extend Felix with new logical tasks and physical implemen tations to supp ort broader applications. References [1] S. Abiteb oul and R. Hull. Data functions, datalog and negation. In SIGMOD , 1988. [2] S. Abiteb oul, R. Hull, and V. Vianu. F oundations of Datab ases . Addison W esley , 1995. [3] N. Ailon, M. Charik ar, and A. Newman. Aggregating inconsistent information: Ranking and clustering. JA CM , 2008. [4] D. Andrzejewski, L. Livermore, X. Zhu, M. Crav en, and B. Rec ht. A framework for incorp orating general domain kno wledge into latent Dirichlet allo cation using first-order logic. IJCAI , 2011. [5] A. Arasu, C. R´ e, and D. Suciu. Large-scale deduplication with constrain ts using Dedupalog. In ICDE 2009 . [6] D. Bertsek as and J. Tsitsiklis. Par al lel and Distribute d Computation: Numeric al Metho ds . Prentice-Hall, 1989. [7] S. Bo yd and L. V andenberghe. Convex Optimization . Cambridge Universit y Press, New Y ork, 2004. [8] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. Hruschk a Jr, and T. Mitchell. T ow ard an architecture for nev er-ending language learning. In AAAI , 2010. [9] A. Chandra and P . Merlin. Optimal implementation of conjunctive queries in relational data bases. In STOC , 1977. 19 [10] S. Chaudhuri and M. V ardi. On the complexity of equiv alence b et ween recursive and nonrecursive datalog programs. In PODS , 1994. [11] R. Chirko v a, C. Li, and J. Li. Answ ering queries using materialized views with minimum size. VLDB Journal , 2006. [12] P . DeRose, W. Shen, F. Chen, Y. Lee, D. Burdic k, A. Doan, and R. Ramakrishnan. DBLife: A communit y information managemen t platform for the database research communit y. In CIDR 2007 . [13] P . Domingos et al. http://alchemy.cs.washington.edu/ . [14] W. F an, S. Ma, Y. Hu, J. Liu, and Y. W u. Propagating functional dep endencies with conditions. PVLDB , 2008. [15] Y. F ang and K. Chang. Searc hing patterns for relation extraction ov er the web: rediscov ering the pattern- relation dualit y . In WSDM , 2011. [16] I. F ellegi and A. Sunter. A theory for record link age. Journal of the A meric an Statistic al Asso ciation , 1969. [17] D. F errucci, E. Brown, J. Chu-Carroll, J. F an, D. Gondek, A. Kalyanpur, A. Lally , J. Murdo c k, E. Nyb erg, J. Prager, et al. Building W atson: An ov erview of the DeepQA pro ject. AI Magazine , 31(3):59–79, 2010. [18] N. F riedman, L. Geto or, D. Koller, and A. Pfeffer. Learning probabilistic relational mo dels. In IJCAI , 1999. [19] H. Ji, R. Grishman, H. Dang, K. Griffitt, and J. Ellis. Overview of the tac 2010 kno wledge base p opulation trac k. Pr o c. T AC2010 , 2010. [20] J. K. Johnson, D. M. Malioutov, and A. S. Willsky . Lagrangian relaxation for map estimation in graphical mo dels. CoRR , abs/0710.0013, 2007. [21] G. Kasneci, M. Ramanath, F. Suchanek, and G. W eikum. The Y AGO-NA GA approach to kno wledge disco very . SIGMOD R e c or d , 37(4):41–47, 2008. [22] H. Kautz, B. Selman, and Y. Jiang. A general sto c hastic approach to solving problems with hard and soft constrain ts. The Satisfiability Pr oblem: The ory and Applic ations , 1997. [23] A. Klug. On conjunctiv e queries containing inequalities. J. ACM , 1988. [24] J. Laffert y , A. McCallum, and F. P ereira. Conditional random fields: Probabilistic mo dels for segmen ting and lab eling sequence data. In ICML , 2001. [25] B. Liu, L. Chiticariu, V. Ch u, H. Jagadish, and F. Reiss. Automatic rule refinement for information extraction. VLDB , 2010. [26] A. McCallum, K. Sc h ultz, and S. Singh. F actorie: Probabilistic programming via imperatively defined factor graphs. In NIPS , 2009. [27] E. Mic helakis, R. Krishnamurth y , P . Haas, and S. V aithy anathan. Uncertaint y management in rule-based information extraction systems. In SIGMOD , 2009. [28] B. Milch, B. Marthi, S. Russell, D. Sontag, D. Ong, and A. Kolob o v. BLOG: Probabilistic mo dels with unkno wn ob jects. In IJCAI , 2005. [29] N. Nak ashole, M. Theobald, and G. W eikum. Scalable kno wledge harvesting with high precision and high recall. In WSDM , 2011. 20 [30] F. Niu, C. R´ e, A. Doan, and J. Shavlik. T uffy: Scaling up statistical inference in Marko v logic netw orks using an RDBMS. In VLDB 2011 . [31] D. Oltean u, J. Huang, and C. Ko c h. Sprout: Lazy vs. eager query plans for tuple-independent probabilistic databases. In ICDE , 2009. [32] H. P o on and P . Domingos. Joint inference in information extraction. In AAAI 2007 . [33] R. Ramakrishnan and J. Ullman. A surv ey of deductive database systems. J. L o gic Pr o gr amming , 1995. [34] M. Ric hardson and P . Domingos. Marko v logic net works. Machine L e arning , 2006. [35] S. Riedel. Cutting Plane MAP Inference for Mark ov Logic. In SRL 2009 . [36] N. Rizzolo and D. Roth. Learning based Jav a for rapid dev elopment of NLP systems. L anguage R esour c es and Evaluation , 2010. [37] A. M. Rush, D. Son tag, M. Collins, and T. Jaakkola. On dual decomp osition and linear programming relaxations for natural language pro cessing. In EMNLP , 2010. [38] H. Sc hmid. Impro vemen ts in part-of-sp eech tagging with an application to German. NLP Using V ery L ar ge Corp or a , 1999. [39] J. Seib and G. Lausen. P arallelizing datalog programs by generalized pivoting. In PODS , 1991. [40] P . Sen, A. Deshpande, and L. Geto or. PrDB: Managing and exploiting ric h correlations in probabilistic databases. J. VLDB , 2009. [41] A. Sh ukla, P . Deshpande, and J. Naughton. Materialized view selection for m ultidimensional datasets. In VLDB , 1998. [42] P . Singla and P . Domingos. Lifted first-order b elief propagation. In AAAI , 2008. [43] F. Suchanek, M. Sozio, and G. W eikum. SOFIE: A self-organizing framew ork for information extraction. In WWW , 2009. [44] M. Theobald, M. Sozio, F. Suc hanek, and N. Nak ashole. URDF: Efficient Reasoning in Uncertain RDF Kno wledge Bases with Soft and Hard Rules. MPI T e chnic al R ep ort , 2010. [45] J. Ullman. Implemen tation of logical query languages for databases. TODS , 1985. [46] M. W ain wright and M. Jordan. Gr aphic al Mo dels, Exp onential F amilies, and V ariational Infer enc e . Now Publishers, 2008. [47] D. W ang, M. F ranklin, M. Garofalakis, J. Hellerstein, and M. Wic k. Hybrid in-database inference for declarativ e information extraction. In SIGMOD , 2011. [48] G. W eikum and M. Theobald. F rom information to knowledge: Harv esting entities and relationships from w eb sources. In PODS , 2010. [49] D. W eld, R. Hoffmann, and F. W u. Using Wikip edia to b o otstrap op en information extraction. SIGMOD R e c or d , 2009. [50] M. Wick, A. McCallum, and G. Miklau. Scalable probabilistic databases with factor graphs and mcmc. VLDB , 2010. [51] L. W olsey . Inte ger Pr o gr amming . Wiley , 1998. [52] J. Zhu, Z. Nie, X. Liu, B. Zhang, and J. W en. Statsno wball: A statistical approach to extracting entit y relationships. In WWW , 2009. 21 A Notations T able 8 defines some common notation that is used in the follo wing sections. Notation Definition a, b, . . . , α, β , . . . Singular (random) v ariables a , b , . . . , α , β , . . . V ectorial (random) v ariables µ 0 · ν Dot pro duct b et ween vectors | µ | Length of a vector or size of a set µ i i th elemen t of a vector ˆ α , ˆ α A v alue of a v ariable T able 8: Notations B Theoretical Bac kground of the Op erator-based Approac h In this section, w e discuss the theoretical underpinning of Felix ’s op erator-based approach to MLN inference. Recall that Felix first decomp oses an input MLN program based on a predefined set of op erators, instan tiates those operators with co de selection, and then executes the op erators using ideas from dual decomp osition. W e first justify our choice of sp ecialized subtasks (i.e., Classification, Sequential Lab eling, and Coref ) in terms of t wo compilation soundness and language expressivity prop erties: 1. Giv en an MLN program, the subprograms obtained by Felix ’s compiler indeed enco de sp ecialized sub- tasks suc h as classification, sequential lab eling, and coref. 2. MLN as a language is expressive enough to enco de all p ossible mo dels in the exp onential family of each subtask type; sp ecifically , MLN subsumes logistic regression (for classification), conditional random fields (for lab eling), and correlation clustering (for coref ). W e then describ e ho w dual decomp osition is used to co ordinate the operators in Felix for b oth MAP and marginal inference while main taining the semantics of MLN s. B.1 Consisten t Semantics B.1.1 MLN Program Solved as Subtasks In this section, we show that the decomp osition of an MLN program pro duced by Felix ’s compiler indeed corresp onds to the subtasks defined in Section 4.2. Simple Classification Supp ose a classification op erator (i.e., task) for a query relation R ( k, v ) consists of k ey-constraint hard rules together with rules r 1 , ..., r t (with weigh ts w 1 , ..., w t ) 16 . As p er Felix ’s compilation pro cedure, the following holds: 1) R ( k, v ) has a key constraint (say k is the key); and 2) none of the selected rules are recursiv e with resp ect to R . Let k 0 b e a fixed v alue of k . Since k is a p ossible-w orld k ey for R ( k , v ), we can partition the set of all p ossible w orlds into sets based on their v for R ( k 0 , v ) (and whether there is any v alue v make R ( k , v ) true). Let W v i = { W | W | = R ( k 0 , v i ) } and W ⊥ where R ( k 0 , v ) is false for all v . Define Z ( W ) = P w ∈W exp {− cost ( w ) } . Then according to the seman tics of MLN, Pr[ R ( k , v 0 )] = Z ( W v 0 ) Z ( W ⊥ ) + P v ∈ D Z ( W v ) 16 F or simplicity , w e assume that these t rules are ground formulas. It is easy to sho w that grounding do es not c hange the prop ert y of rules. 22 It is immediate from this that each class is disjoin t. It is also clear that, conditioned on the v alues of the rule b odies, each of the R are indep enden t. Correlated Classification Supp ose a correlated classification op erator outputs a relation R ( k, v ) and con- sists of hard-constraint rules together with ground rules r 1 , ...r t (with weigh ts w 1 , ..., w t ). As p er Felix ’s compilation pro cedure, the following holds: • R ( k , v ) has a key constraint (say k is the k ey); • The rules r i satisfy the T rREC prop ert y . Consider the following graph: the no des are all p ossible v alues for the k ey k and there is an edge ( k , k 0 ) if k app ears in the b o dy of k 0 . Every no de in this graph has outdegree at most 1. Now supp ose there is a cycle: But this contradicts the definition of a strict partial order. In turn, this means that this graph is a forest. Then, w e identify this graph with a graphical mo del structure where eac h no de is a random v ariable with domain D . This is a tree-structured Marko v random field. This justifies the rules used by Felix ’s compiler for identifying lab eling op erators. Again, conditioned on the rule b odies any grounding is a tree-shap ed graphical mo del. Coreference Resolution A coreference resolution subtask inv olving v ariables y 1 , ...y n infers ab out an equiv- alen t relation R ( y i , y j ). The only requirement of this subtask is that the result relation R ( ., . ) be reflexive, symmetric and transitiv e. Felix ensures these prop erties b y detecting corresp onding hard rules directly . B.1.2 Subtasks Represented as MLN programs W e start b y showing that all probabilistic distributions in the discrete exp onen tial family can b e represented b y an equiv alen t MLN program. Therefore, if we mo del the three subtasks using mo dels in the exp onen tial family , w e can express them as an MLN program. F ortunately , for each of these subtasks, there are p opular exp onen tial family mo dels: 1) Logistic Regression (LR) for Classification, 2) Conditional Random Filed (CRF) for Lab eling and 3) Correlation Clustering for Coref. 17 Definition B.1 (Exp onen tial F amily) . We fol low the definition in [46]. Given a ve ctor of binary r andom variables x ∈ X , let φ : X → { 0 , 1 } d b e a binary ve ctor-value d function. F or a given φ , let θ ∈ R d b e a ve ctor of r e al numb er p ar ameters. The exp onential family distribution over x asso ciate d with φ and θ is of the form: Pr θ [ x ] = exp {− θ · φ ( x ) − A ( θ ) } , wher e A ( θ ) is known as lo g p artition function: A ( θ ) = log P x ∈X exp {− θ · φ ( x ) } . This definition extends to m ultinomial random v ariables in a straightforw ard manner. F or simplicity , we only consider binary random v ariables in this section. Example 1 Consider a textb ook logistic regressor ov er a random v ariable x ∈ { 0 , 1 } : Pr[ x = 1] = 1 1 + exp { P i − β i f i } , where f i ∈ { 0 , 1 } ’s are kno wn as features of x and β i ’s are r e gr ession c o efficients of f i ’s. This distribution is actually in the exp onen tial family: Let φ b e a binary vector-v alued function whose i th en try equals to φ i ( x ) = (1 − x ) f i . Let θ b e a v ector of real num b ers whose i th en try θ i = β i . One can c heck that Pr[ x = 1] = exp {− θ · φ (1) } exp {− θ · φ (1) } + exp {− θ · φ (0) } = 1 1 + exp { P i − β i f i } 17 W e lea ve the discussion of mo dels that are not explicitly in exp onen tial family to future work. 23 The exponential family has a strong connection with the maxim um entrop y principle and graphic mo dels. F or all the three tasks we are considering, i.e., classification, lab eling and coreference, there are p opular exp onen tial family mo dels for each of them. Prop osition B.1. Given an exp onential family distribution over x ∈ X asso ciate d with φ and θ , ther e exists an MLN pr o gr am Γ that defines the same pr ob ability distribution as Pr θ [ x ] . The length of the formula in Γ is at most line ar in | x | , and the numb er of formulas in Γ is at most exp onential in | x | . Pr o of. Our pro of is by construction. Eac h entry of φ is a binary function φ i ( x ), which partitions X into tw o subsets: X + i = { x | φ i ( x ) = 1 } and X − i = { x | φ i ( x ) = 0 } . If θ i ≥ 0, for eac h ˆ x ∈ X + i , in tro duce a rule: θ i _ 1 ≤ j ≤| x | R ( x j , 1 − ˆ x j ) . If θ i < 0, for eac h ˆ x ∈ X + i , insert a rule: − θ i ^ 1 ≤ j ≤| x | R ( x j , ˆ x j ) . W e add these rules for each φ i ( . ), and also add the follo wing hard rule for each v ariable x i : ∞ R ( x i , 0) < = > ¬ R ( x i , 1) . It is not difficult to see Pr[ ∀ x i , R ( x i , ˆ x i ) = 1] = Pr θ [ ˆ x ]. In this construction, each form ula has length | x | and there are P i ( |X i | + 1) formulas in total, which is e xponential in | x | in the worst case. Similar constructions apply to the case where x is a v ector of multinomial random v ariables. W e then sho w that Logistic Regression, Conditional Random Field and Correlation Clustering all define probabilit y distributions in the discrete exp onen tial family , and the n umber of form ulas in their equiv alent MLN program Γ is p olynomial in the num b er of random v ariables. Logistic Regression In Logistic Regression, w e mo del the probability distribution of Bernoulli v ariable y conditioned on x 1 , ..., x k ∈ { 0 , 1 } by Pr[ y = 1] = 1 1 + exp {− ( β 0 + P i β i x i ) } Define φ i ( y ) = (1 − y ) x i ( φ 0 ( y ) = 1 − y ) and θ i = β i , we can see Pr[ y = 1] is in the exp onen tial family defined as in Definition B.1. F or each φ i ( y ), there is only one y that can get p ositiv e v alue from φ i , so there are at most k + 1 formulas in the equiv alent MLN program. Conditional Random Field In Conditional Random Field, w e mo del the probability distribution using a graph G = ( V , E ) where V represents the set of random v ariables y = { y v : v ∈ V } . Conditioned on a set of random v ariables x , CRF defines the distribution: Pr[ y | x ] ∝ exp { X v ∈ V ,k λ k f k ( v , y v , x ) + X ( v 1 ,v 2 ) ∈ E ,l µ l g l (( v 1 , v 2 ) , y v 1 , y v 2 , x ) } This is already in the form of exp onen tial family . Because each function f k ( v , − , x ) or g l (( v 1 , v 2 ) , − , − , x ) only relies on 1 or 2 random v ariables, the resulting MLN program has at most O ( | E | + | V | ) formulas. In the curren t prototype of Felix , we only consider linear chain CRFs, where | E | = O ( | V | ). 24 Correlation Clustering Correlation clustering is a form of clustering for whic h there are efficient algorithms that hav e b een sho wn to scale to instances of the coref problem with millions of mentions. F ormally , correlation clustering treats the coref problem as a graph partitioning problem. The input is a weigh ted undirected graph G = ( V , f ) where V is the set of mentions with w eight function f : V 2 → R . The goal is to find a partition C = { C i } of V that minimizes the disagr e ement c ost : cost cc ( C ) = X ( v 1 ,v 2) ∈ V 2 v 1 6 = v 2 ∃ C i ,v 1 ∈ C i ∧ v 2 ∈ C i f ( u,v ) < 0 | f ( v 1 , v 2 ) | + X ( v 1 ,v 2) ∈ V 2 v 1 6 = v 2 ∃ C i ,v 1 ∈ C i ∧ v 2 6∈ C i f ( u,v ) > 0 | f ( v 1 , v 2 ) | W e can define the probability distribution ov er C similarly as MLN: Pr[ C ] ∝ exp {− cost cc ( C ) } Sp ecifically , let the binary predicate cor ef ( v 1 , v 2 ) indicate whether v 1 6 = v 2 ∈ V b elong to the same cluster. First introduce three hard rules enforcing the reflexivity , symmetry , and transitivit y prop erties of cor ef . Next, for each v 1 6 = v 2 ∈ V , in tro duce a singleton rule cor ef ( v 1 , v 2 ) with weigh t f ( v 1 , v 2 ). It’s not hard to show that the ab o v e distribution holds for this MLN program. B.2 Dual Decomp osition for MAP and Marginal Inference In this section, w e formally describ e the dual decomp osition framework used in Felix to co ordinate the op er- ators. W e start b y formalizing MLN inference as an optimization problem. Then w e show ho w to apply dual decomp osition on these optimization problems. B.2.1 Problem F ormulation Supp ose an MLN program Γ consists of a set of ground MLN rules R = { r 1 , ..., r m } with weigh ts ( w 1 , ..., w m ). Let X = { x 1 , ..., x n } b e the set of b o olean random v ariables corresp onding to the ground atoms o ccurring in Γ. Eac h MLN rule r i in tro duces a function φ i o ver the set of random v ariables π i ⊆ X men tioned in r i : φ i ( π i ) = 1 if r i is violated and 0 otherwise. Let w b e a vector of weigh ts. Define v ector φ ( X ) = ( φ 1 ( π 1 ) , ..., φ m ( π m )). Giv en a p ossible w orld x ∈ 2 X , the cost can b e represented: cost ( x ) = w · φ ( x ) Supp ose Felix decides to solve Γ with t op erators O 1 , ..., O t . Eac h op erator O i con tains a set of rules R i ⊆ R . The set {R i } forms a partition of R . Let the set of random v ariables for each op erator b e X i = ∪ r j ∈R i π j . Let n i = | X i | . Th us, eac h op erator O i essen tially solves the MLN program defined by random v ariables X i and rules R i . Given w , define w i to b e the weigh t vector whose en tries equal w if the corresp onding rule app ears in R i and 0 otherwise. Because R i forms a partition of R , we kno w P i w i = w . F or each operator O i , define an n -dim vector µ i ( X ), whose j th en try equals x j if x j ∈ X i and 0 otherwise. Define n -dim v ector µ ( X ) whose j th en try equals x j . Similarly , let φ ( X i ) b e the pro jection of φ ( X ) onto the rules in op erator O i . Example 2 W e use the t wo sets of rules for classification and lab eling in Section 5.1 as a running example. F or a simple sentence Packers win. in a fixed do cumen t D whic h contains tw o phrases P 1 = “Pac k ers” and P 2 = “win”, w e will get the following set of ground formulae 18 : ∞ label ( D , p, l 1) , label ( D, p, l 2) = > l 1 = l 2 ( r l 1 ) 10 next ( D , P 1 , P 2 ) , token ( P 2 , ‘wins’) = > label ( D , P 1 , W ) ( r l 2 ) 1 label ( D , P 1 , W ) , next ( D , P 1 , P 2 ) = > ! label ( D , P 2 , W ) ( r l 3 ) 10 label ( D , P 1 , W ) , referTo ( P 1 , Gr eenB ay ) = > winner ( Gr eenB ay ) ( r c 1 ) 10 label ( D , P 1 , L ) , referTo ( P 1 , Gr eenB ay ) = > ! winner ( Gr eenB ay ) ( r c 2 ) 18 F or r l 1 , p ∈ { P 1 , P 2 } , l i ∈ { W , L } . 25 After compilation, Felix would assign r l 1 , r l 2 and r l 3 to a lab eling op erator O L , and r c 1 and r c 2 to a classi- fication op erator O C . F or each of { winner ( Gr eenB ay ), label ( D , P 1 , W ), label ( D , P 1 , L ), label ( D , P 2 , W ), label ( D , P 2 , L ) } we hav e a binary random v ariable asso ciated with it. Eac h rule introduces a function φ , for example, the function φ l 2 in tro duced b y r l 2 is: φ l 2 ( label ( D , P 1 , W )) = ( 1 if label ( D , P 1 , W ) = F alse 0 if label ( D , P 1 , W ) = T rue The lab eling op erator O L essen tially solves the MLN program with v ariables X L = { label ( D , P 1 , W ), label ( D , P 1 , L ), label ( D, P 2 , W ), label ( D , P 2 , L ) } and rules R L = { r l 1 , r l 2 , r l 3 } . Similarly O C solv es the MLN program with v ariables X C = { winner ( Gr eenB ay ), label ( D , P 1 , W ) label ( D , P 1 , L ) } and rules R C = { r c 1 , r c 2 } . Note that these t wo op erators share the v ariables label ( D , P 1 , W ) and label ( D , P 1 , L ). B.2.2 MAP Inference MAP inference in MLNs is to find an assignmen t x to X that minimizes the cost: min x ∈{ 0 , 1 } n w · φ ( x ) . (3) Eac h op erator O i p erforms MAP inference on X i : min x i ∈{ 0 , 1 } n i w i · φ ( x i ) . (4) Our goal is to reduce the problem represented by Eqn. 3 into subproblems represen ted by Eqn. 4. Eqn. 3 can b e rewritten as min x ∈{ 0 , 1 } n X 1 ≤ i ≤ t w i · φ ( x i ) . Clearly , the difficulty lies in that, for i 6 = j , X i and X j ma y ov erlap. Therefore, we introduce a copy of v ariables for each O i : X C i . Eqn. 3 no w b ecomes: min x C i ∈{ 0 , 1 } n i , x X i w i · φ ( x C i ) s.t. ∀ i x C i = x . (5) The Lagrangian of this problem is: L ( x , x C 1 , ..., x C t , ν 1 , ..., . ν t ) = X i w i · φ ( x C i ) + ν i · ( µ i ( x C i ) − µ i ( x )) (6) Th us, we can relax Eqn. 3 into max ν ( X i min x i ∈{ 0 , 1 } n i w i · φ ( x C i ) + ν i · µ i ( x C i ) − max x X i ν i · µ i ( x ) ) The term max x P i ν i · µ i ( x ) = ∞ unless for eac h v ariable x j , X O i : x j ∈ X i ν i,j = 0 . Con verting this into constraints, w e get 26 max ν ( X i min x i ∈{ 0 , 1 } n i w i · φ ( x C i ) + ν i · µ i ( x C i ) ) s.t. ∀ x j X O i : x j ∈ X i ν i,j = 0 W e can apply sub-gradient metho ds on ν . The dual decomp osition pro cedure in Felix w orks as follows: 1. Initialize ν (0) 1 , ..., ν (0) t . 2. A t step k (starting from 0): (a) F or each op erator O i , solve the MLN program consisting of: 1) original rules in this op erator, whic h are c haracterized by w i ; 2) additional priors on each v ariables in X i , which are c haracterized by ν ( k ) i . (b) Get the MAP inference results ˆ x C i . 3. Up date ν i : ν ( k +1) i,j = ν ( k ) i,j − λ ˆ x C i,j − P l : x j ∈ X l ˆ x C l,j |{ l : x j ∈ X l }| ! Example 3 Consider the MAP inference on program in Example 2. As O L and O C share tw o random v ariables: x w = label ( D , P 1 , W ) and x l = label ( D , P 1 , L ), we ha ve a copy of them for each op erator: x C w,O L , x C l,O L for O L ; and x C w,O C , x C l,O C for O C . Therefore, w e hav e four ν : ν w,O L , ν l,O L for O L ; and ν w,O C , ν l,O C for O C . Assume w e initialize each ν (0) − to 0 at the first step. W e start by p erforming MAP inference on O L and O C resp ectiv ely . In this case, O L will get the result: x C w,O L = 1 x C l,O L = 0 O C admits m ultiple p ossible w orlds minimizing the cost; for example, it may outputs x C w,O C = 0 x C l,O C = 0 whic h has cost 0. Assume the step size λ = 0 . 5. W e can up date ν to: ν (1) w,O L = − 0 . 25 ν (1) w,O C =0 . 25 ν (1) l,O L =0 ν (1) l,O C =0 Therefore, when w e use these ν (1) − to conduct MAP inference on O L and O C , w e are equiv alen tly adding -0.25 label ( D , P 1 , W ) ( r 0 l ) in to O L and 0.25 label ( D , P 1 , W ) ( r 0 c ) in to O C . In tuitively , one ma y in terpret this pro cedure as the information that “ O L prefers label ( D , P 1 , W ) to b e true” b eing passed to O C via r 0 c . 27 B.2.3 Marginal Inference The marginal inference of MLNs aims at computing the marginal distribution (i.e., the exp ectation since we are dealing with b oolean random v ariables): ˆ µ = E w [ µ ( X )] . (7) The sub-problem of eac h op erator is of the form: ˆ µ O = E w O [ µ O ( X O )] . (8) Again, the goal is to use solutions for Eqn. 8 to solv e Eqn. 7. W e first introduce some auxiliary v ariables. Recall that µ ( X ) corresp onds to the set of random v ariables, and φ ( X ) corresp onds to all functions represented b y the rules. W e create a new vector ξ by concatenating µ and φ : ξ ( X ) = ( µ T ( X ) , φ T ( X )). W e create a new weigh t v ector θ = (0 , ..., 0 , w T ) whic h is of the same length as ξ . It is not difficult to see that the marginal inference problem equiv alently b ecomes: ˆ ξ = E θ [ ξ ( X )] . (9) Similarly , w e define θ O for op erator O as θ O = (0 , ..., 0 , w T O ). W e also define a set of θ : Θ O , which con tains all vectors with entries corresp onding to random v ariables or cliques not app ear in op erator O as zero. The partition function A ( θ ) is: A ( θ ) = X X exp {− θ · ξ ( X ) } The conjugate dual to A is: A ∗ ( ξ ) = sup θ { θ · ξ − A ( θ ) } A classic result of v ariational inference [46] shows that ˆ ξ = arg sup ξ ∈M { θ · ξ − A ∗ ( ξ ) } , (10) where M is the marginal polytop e. Recall that ˆ ξ is our goal (see Eqn. 9). Similar to MAP inference, we w an t to decomp ose Eqn. 10 into differen t op erators by in tro ducing copies of shared v ariables. W e first try to decomp ose A ∗ ( ξ ). In A ∗ ( ξ ), w e searc h θ on all p ossible v alues for θ . If w e only searc h on a subset of θ , we can get a lo wer b ound: A ∗ O ( ξ ) = sup θ ∈ Θ O { θ · ξ − A ∗ ( ξ ) } ≤ A ∗ ( ξ ) . Therefore, − A ∗ ( ξ ) ≤ 1 m X O − A ∗ O ( ξ ) , where m is the n umber of op erators. W e approximate ˆ ξ using this b ound: ˆ ξ = arg sup ξ ∈M { θ · ξ − 1 m X O A ∗ O ( ξ ) } , whic h is an upp er b ound of the original goal. W e in tro duce copies of ξ : 28 ˆ ξ = arg sup ξ O i ∈M , ξ { X O θ O · ξ O − 1 m X O A ∗ O ( ξ O ) } s.t. ξ O e = ξ e , ∀ e ∈ X O ∪ R O , ∀ O The Lagrangian of this problem is: L ( ξ , ξ O 1 , ..., ξ O t , ν 1 , ..., ν t ) = X O θ O · ξ O − 1 m A ∗ O ( ξ O ) + X i ν i · ( ξ O i − ξ ) , where ν i ∈ Θ i , which means only the entries corresp onding to random v ariables or cliques that app ear in op erator O i are allo wed to hav e non-zero v alues. W e get the relaxation: min ν i ∈ Θ i X i sup ξ O i ∈M θ i · ξ O i − 1 m A ∗ O i ( ξ O i ) + ν i · ξ O i − min ξ X i ν i · ξ Considering the min ξ P i ν i · ξ part. This part is equiv alen t to a set of constraints: X O i : x ∈ X i ν i,x =0 , ∀ x ∈ X ν i,x =0 , ∀ x 6∈ X Therefore, w e are solving: min ν i ∈ Θ i X i sup ξ O i ∈M m θ i · ξ O i − A ∗ O i ( ξ O i ) + ν i · ξ O i s.t., X O i : x ∈ X i ν i,x = 0 , ∀ x ∈ X ν i,x = 0 , ∀ x 6∈ X Note the factor m in fron t of θ i ; it implies that we m ultiply the weigh ts in each subprogram by m as well. Then w e can apply sub-gradient metho d on ν i : 1. Initialize ν (0) 1 , ..., ν (0) t . 2. A t step k (start from 0): (a) F or each op erator O i , solve the MLN program consists of: 1) original rules in this op erator, which is c haracterized by m θ i ; 2) additional priors on eac h v ariables in X i , whic h is characterized by ν ( k ) i . (b) Get the marginal inference results ˆ ξ C i . 3. Up date ν ( k +1) i : ν ( k +1) i,j = ν ( k ) i,j − λ ˆ ξ C i,j − P l : x j ∈ X l ˆ ξ C l,j |{ l : x j ∈ X l }| ! Example 4 Consider the marginal inference on the case in Example 2. Similar to the example for MAP inference, w e hav e copies of random v ariables: ξ C w,O L , ξ C l,O L for O L ; and ξ C w,O C , ξ C l,O C for O C . W e also hav e four ν : ν w,O L , ν l,O L for O L ; and ν w,O C , ν l,O C for O C . Assume w e initialize each ν (0) − to 0 at the first step. 29 W e start b y conducting marginal inference on O L and O C resp ectiv ely . In this case, O L will get the result: ξ C w,O L = 0 . 99 ξ C l,O L = 0 . 01 while O C will get: ξ C w,O C = 0 . 5 ξ C l,O C = 0 . 5 Assume the step size λ = 0 . 5. W e can up date ν as: ν (1) w,O L = − 0 . 12 ν (1) w,O C =0 . 12 ν (1) l,O L =0 . 12 ν (1) l,O C = − 0 . 12 Therefore, when we use these ν (1) − to conduct marginal inference on O L and O C , we are equiv alan tly adding -0.12 label ( D , P 1 , W ) ( r 0 l 1 ) 0.12 label ( D, P 1 , L ) ( r 0 l 2 ) in to O L and 0.12 label ( D, P 1 , W ) ( r 0 c 1 ) -0.12 label ( D , P 1 , L ) ( r 0 c 2 ) in to O C . In tuitively , one ma y in terpret this pro cedure as the information that “ O L prefers label ( D , P 1 , W ) to b e true” b eing passed to O C via r 0 c . C Additional Details of System Implementation In this section, we provide additional details of the Felix system. The first part of this section fo cuses on the compiler. W e prov e some complexity results of prop ert y-annotation used in the compiler and describ e how to apply static analysis tec hniques originally used in the Datalog literature for data partitioning. Then we describ e the physical implementation for eac h logical op erator in the curren t prototype of Felix . W e also describ e the cost mo del used for the materialization trade-off. C.1 Compiler C.1.1 Complexity Results In this section, w e first prov e the decidability of the problem of annotating prop erties for arbitrary Datalog programs. Then w e prov e the Π 2 P -completeness of the problem of annotating { R E F , S Y M } given a Datalog program without recursion. 30 Recursiv e Programs If there is a single rule with query relation Q of the form Q ( x, y ) < = Q 1( x ) , Q 2( y ), then that { REF , SYM } of Q is decidable if and only if Q 1 or Q 2 is empty or Q 1 ≡ Q 2. W e assume that Q 1 and Q 2 are satisfiable. If there is an instance where Q 1( a ) is true and Q 2 is false for all v alues. Then there is another w orld (with all fresh constants) where Q 2 is true (and do es not return a ). Thus, to chec k REF and SYM for Q , we need to decide equiv alence of datalog queries. Equiv alence of datalog queries is undecidable [2, ch. 12]. Since con tainment and b oundedness for monadic datalog queries is decidable, a small technical wrinkle is that while Q 1 and Q 2 are of arity one (monadic) their b odies may contain other recursiv e (higher arity) predicates. Complexit y for Nonrecursiv e Program The abov e section assumes that we are giv en an arbitrary Datalog program Γ. In this section, w e sho w that the problem of annotating REF and SYM given a nonrecursiv e Datalog program is Π 2 P -complete. W e allow inequalities in the program. W e first prov e the hardness. Similar to the ab o ve section, w e need to decide Q 1 ≡ Q 2. The difference is that Q 1 and Q 2 do not ha ve recursions. Since our language allows us to express conjunctive queries with inequality constrain ts, this established Π 2 P hardness [23]. W e no w prov e the mem b ership in Π 2 P . W e first translate the problem of prop ert y-annotation to the con tainment problem of Datalog programs, which has b een studied for decades [9, 23] and the complexit y is in Π 2 P for Datalog programs without recursions but with inequalities. W e will show that, ev en though the rules for c hecking symmetric prop ert y is recursive, it can be represented by a set of non-recursive rules, therefore the classic results still hold. W e th us limit ourselves to non-recursive MLN programs. Given an MLN program Γ which is the union of conjunctiv e queries and a relation Q to which w e will annotate prop erties, all hard rules related to Q can b e represen ted as: Q () : − G 1 () Q () : − G 2 () ... Q () : − G n () ( P 1 ) where eac h G i () con tains a set of subgoals. T o annotate whether a prop ert y holds for the relation Q (), we test whether some rules hold for all database instances I generated by the ab o v e program P 1 . F or example, for the symmetric prop ert y , we lab el Q () as symmetric if and only if Q ( x, y ) = > Q ( y , x ) holds. W e call this rule the testing rule . Supp ose the testing rule is Q () : − T (), we create a new program: Q () : − G 1 () Q () : − G 2 () ... Q () : − G n () Q () : − T () ( P 2 ) Giv en a database D , let P 1 ( D ) be the result of applying program P 1 to D (using Datalog semantics). The testing rule holds for all P 1 ( D ) if and only if ∀ D , P 2 ( D ) ⊆ P 1 ( D ). In other w ords, P 2 is con tained b y P 1 ( P 2 ⊆ P 1 ). F or reflexiv e prop ert y , whose testing rule is Q ( x, x ) : −D ( x ) (where D () is the domain of x ), b oth P 1 and P 2 are non-recursiv e and the chec king of con tainment is in Π 2 P [23]. W e then consider the symmetric prop erty , whose testing rule is recursive. This is difficult at first glance b ecause the con tainment of recursiv e Datalog program is undecidable. Ho wev er, for this sp ecial case, we can sho w it is muc h easier. F or the sake of simplicity , we consider a simplified version of P 1 and P 2 : Q ( x, y ) : − G ( x, y, z ) ( P 0 1 ) 31 Prop ert y P attern T emplate Condition REF P 1 ( a, b ) a = b P 1 ( a, b ) ∨ ! R 1 ( c ) ∨ ! R 2 ( d ) a = c, b = d, R 1 = R 2 , P 1 6 = R i SYM P 1 ( a, b ) ∨ ! P 2 ( c, d ) a = d, b = c, P 1 = P 2 P 1 ( a, b ) ∨ ! R 1 ( c ) ∨ ! R 2 ( d ) a = c, b = d, R 1 = R 2 , P 1 6 = R i TRN ! P 1 ( a, b ) ∨ ! P 2 ( c, d ) ∨ P 3 ( e, f ) b = c , a = e , d = f , P 1 = P 2 = P 3 KEY ! P 1 ( a, b ) ∨ ! P 2 ( e, f ) ∨ [ c = d ] a = e , b = c , d = f , P 1 = P 2 NoREC R 1 () ∨ . . . ∨ R n () ∨ P 1 () P 1 6 = R i R 1 () ∨ . . . ∨ R n () ∨ ! P 1 () P 1 6 = R i T rRec P 1 ( a, b ) ∨ T ( c, d ) ∨ P 2 ( e, f ) b = c , d = f , a = e , P 1 = P 2 , T ( c, d ) = [ d = c + x ] , x 6 = 0 P 1 ( a, b ) ∨ T ( c, d ) ∨ P 2 ( e, f ) b = c , d = f , a = e , P 1 = P 2 , ∀ ( c, d ) ∈ T , c v d ! P 1 ( a, b ) ∨ T ( c, d ) ∨ P 2 ( e, f ) b = c , d = f , a = e , P 1 = P 2 , T ( c, d ) = [ d = c + x ] , x 6 = 0 ! P 1 ( a, b ) ∨ T ( c, d ) ∨ P 2 ( e, f ) b = c , d = f , a = e , P 1 = P 2 , ∀ ( c, d ) ∈ T , c v d P 1 ( a, b ) ∨ T ( c, d ) ∨ ! P 2 ( e, f ) b = c , d = f , a = e , P 1 = P 2 , T ( c, d ) = [ d = c + x ] , x 6 = 0 P 1 ( a, b ) ∨ T ( c, d ) ∨ ! P 2 ( e, f ) b = c , d = f , a = e , P 1 = P 2 , ∀ ( c, d ) ∈ T , c v d ! P 1 ( a, b ) ∨ T ( c, d ) ∨ ! P 2 ( e, f ) b = c , d = f , a = e , P 1 = P 2 , T ( c, d ) = [ d = c + x ] , x 6 = 0 ! P 1 ( a, b ) ∨ T ( c, d ) ∨ ! P 2 ( e, f ) b = c , d = f , a = e , P 1 = P 2 , ∀ ( c, d ) ∈ T , c v d T able 9: Sufficient Conditions for Prop erties. All P atterns for REF, SYM, TRN, and KEY are hard rules. Q ( x, y ) : − G ( x, y, z ) Q ( x, y ) : − Q ( y , x ) ( P 0 2 ) W e construct the following program: Q ( x, y ) : − G ( x, y, z ) Q ( x, y ) : − G ( y , x, z ) ( P 3 ) It is easy to show P 0 2 = P 3 , therefore, we can equiv alently c heck whether P 3 ⊆ P 0 1 , whic h is in Π 2 P since neither of the programs is recursiv e. C.1.2 Patterns Used by the Compiler Felix exploits a set of regular expressions for prop ert y annotation. This set of regular expressions forms a b est-effort compiler, which is sound but not complete. T able 9 shows these patterns. In Felix , a pattern consists of tw o comp onen ts – a template and a b o olean expression. A template is a constraint on the “shap e” of the formula. F or example, one template for SYM lo oks lik e P 1 ( a, b ) ∨ ! P 2 ( c, d ), whic h means w e only consider rules whose disjunction form contains exactly tw o binary predicates with opp osite senses. Rules that pass the template-matc hing are considered further using the b o olean expression. If one rule passes the template-matching step, w e can hav e a set of assignments for eac h predicate P and v ariable a, b, ... . The b o olean expression is a first order logic formula on the assignment. F or example, the b o olean expression for the ab o ve template is ( a = d ) ∧ ( b = c ) ∧ ( P 1 = P 2 ), whic h means the assignmen t of P 1 and P 2 m ust b e the same, and the assignmen t of v ariables a, b, c, d m ust satisfy ( a = d ) ∧ ( b = c ). If there is an assignment that satisfies the b o olean expression, w e say this Datalog rule matches with this pattern and will b e annotated with corresp onding lab els. C.1.3 Static Analysis for Data P artitioning Statistical inference can often b e decomp osed as indep enden t subtasks on differen t p ortions of the data. T ake the examples of classification in Section 5.1 for instance. The inference of the query relation winner ( team ) is “lo cal” to each team constan t (Assume label is the evidence relation). In other words, deciding whether one team is a winner do es not rely on the decision of another team, team 0 , in this classification subtask. Therefore, 32 if there are a total of n teams, we will hav e an opp ortunit y to solve this subtask using n concurrent threads. Another example is lab eling, which is often lo cal to small units of sequences (e.g., sentences). In Felix , we b orro w ideas from the Datalog literature [39] that uses linear programming to p erform static analysis to decomp ose the data. Felix adopts the same algorithm of Seib and Larsen [39]. Consider an op erator with query relation R ( ¯ x ). Different instances of ¯ x ma y dep end on each other during inference. F or example, consider the rule R ( ¯ x ) < = R ( ¯ y ) , T ( ¯ x, ¯ y ) . In tuitively , all instances of ¯ x and ¯ y that app ear in the same rule cannot be solv ed indep enden tly since R ( ¯ x ) and R ( ¯ y ) are in ter-dep enden t. Such dep endency relationships are transitiv e, and we wan t to compute them so that data partitioning w ouldn’t violate them. A straightforw ard approach is to ground all rules and then p erform comp onen t detection on the resultan t graph. But grounding tends to b e very computationally demanding. A c heap er wa y is static analysis that lo oks at the rules only . Sp ecifically , one solution is to find a function f R ( − ) which has f R ( ¯ x ) = f R ( ¯ y ) for all ¯ x and ¯ y ’s that rely on eac h other. As w e rely on static analysis to find f R , the ab o v e condition should hold for all p ossible database instances. Assuming eac h constant is enco ded as an in teger in Felix , we may consider functions f R of the form [39]: f R ( x 1 , ..., x n ) = X i λ i x i ∈ N , where λ i are in teger constants. F ollowing [39], Felix uses linear programming to find λ i suc h that f R ( − ) satisfy the ab o ve constraints. Once we hav e such a partitioning function ov er the input, w e can pro cess the data in parallel. F or example, if w e wan t to run N concurren t threads for R , we could assign all data satisfying f R ( x 1 , ..., x n ) mo d N = j to the j th thread. C.2 Op erators Implementation Recall that Felix selects physical implemen tations for each logical op erator to actually execute them. In this section, we sho w a handful of physical implemen tations for these op erators. Each of these ph ysical imple- men tations only w orks for a subset of the op erator configurations. F or cases not cov ered by these ph ysical implemen tations, we can alwa ys use Tuffy or Gauss-Seidel-Style implemen tations [30]. Using Logistic Regression for Classification Op erators Consider a Classification op erator with a query relation R ( k, v ), where k is the k ey . Recall that eac h p ossible v alue of k corresp onds to an indep enden t classification task. The (ground) rules of this op erator are all non-recursive with resp ect to R , and so can b e group ed by v alue of k . Sp ecifically , for each v alue pair ˆ k and ˆ v , define R ˆ k, ˆ v = { r i | r i is violated when R ( ˆ k , ˆ v ) is true } R ˆ k, ⊥ = { r i | r i is violated when ∀ v R ( ˆ k , ˆ v ) is false } and W ˆ k,x = X r i ∈R ˆ k,x | w i | whic h intuitiv ely summarizes the p enalt y w e hav e to pay for assigning x for the k ey ˆ k . 33 With the ab o v e notation, one can chec k that Pr[ R ( ˆ k , x ) is true] = exp {− W ˆ k,x } P y exp {− W ˆ k,y } , where b oth x and y range ov er the domain of v plus ⊥ , and R ( ˆ k , ⊥ ) means R ( ˆ k , v ) is false for all v alues of v . This is implemen ted using SQL aggregation in a straightforw ard manner. Using Conditional Random Field for Correlated Classification Op erators The Lab eling op erator generalizes the Classification op erator by allowing tree-shap ed correlations b et ween the individual classification tasks. F or simplicity , assume that such tree-shap ed correlation is actually a chain. Specifically , supp ose the p ossible v alues of k are k 1 , . . . , k m . Then in addition to the ground rules as describ ed in the previous paragraph, w e also hav e a set of recursiv e rules each containing R ( k i , − ) and R ( k i +1 , − ) for some 1 ≤ i ≤ m − 1. Define R B k i ,k i +1 = { r i | r i con tains R ( k i , − ) and R ( k i +1 , − ) } W B k i ,k i +1 ( v i , v i +1 ) = X r i ∈R B k i ,k i +1 cost r i ( { R ( k i , v i ) , R ( k i +1 , v i +1 ) } ) . Then it’s easy to sho w that Pr[ { R ( k i , v i ) , 1 ≤ i ≤ m } ] ∝ exp {− X 1 ≤ i ≤ m W k i ,v i − X 1 ≤ i ≤ m − 1 W B k i ,k i +1 ( v i , v i +1 ) } , whic h is exactly a linear-chain CRF. Again, Felix uses SQL to compute the ab o ve intermediate statistics, and then resort to the Viterbi algo- rithm [24] (for MAP inference) or the sum-pro duct algorithm [46] (for marginal inference). Using Correlation Clustering for Coreference Operators The Coref operator can b e implemen ted using correlation clustering [5]. W e sho w that the constant-appro ximation algorithm for correlation clustering carries o ver to MLNs under some technical conditions. Recall that correlation clustering essentially p erforms no de partitioning based on the edge w eights in an undirected graph. W e use the following example to illustrate the direct connection b et w een MLN rules and correlation clustering. Example 1 Consider the follo wing ground rules which are similar to those in Section 5.1: 10 inSameDoc ( P 1 , P 2 ) , sameString ( P 1 , P 2 ) = > coRef ( P 1 , P 2 ) 5 inSameDoc ( P 1 , P 2 ) , subString ( P 1 , P 2 ) = > coRef ( P 1 , P 2 ) 5 inSameDoc ( P 3 , P 4 ) , subString ( P 3 , P 4 ) = > coRef ( P 3 , P 4 ) Assume coRef is the query relation in this Coreference op erator. W e can construct the weigh ted graph as follows. The vertex set is V = { P 1 , P 2 , P 3 , P 4 } . There are tw o edges with non-zero w eight: ( P 1 , P 2 ) with w eight 15 and ( P 3 , P 4 ) with weigh t 5. Other edges all ha ve weigh t 0. The following prop osition shows that the correlation clustering algorithm solv es an equiv alen t optimization problem as the MAP inference in MLN s. Prop osition C.1. L et Γ( ¯ x i ) b e a p art of Γ c orr esp onding to a c or ef subtask; let G i b e the c orr elation clustering pr oblem tr ansforme d fr om Γ( ¯ x i ) using the ab ove pr o c e dur e. Then an optimal solution to G i is also an optimal solution to Γ( ¯ x i ) . W e implement Arasu et al. [5] for correlation clustering. The theorem b elow sho ws that, for a certain family of MLN programs, the algorithm implemen ted in Felix actually p erforms appro ximate MLN inference. Theorem C.1. L et Γ( ¯ x i ) b e a c or ef subtask with rules gener ating a c omplete gr aph wher e e ach e dge has a weight of either ±∞ or w s.t. m ≤ | w | ≤ M for some m, M > 0 . Then the c orr elation clustering algorithm running on Γ( ¯ x i ) is a 3 M m -appr oximation algorithm in terms of the lo g-likeliho o d of the output world. 34 Pr o of. In Arasu et al. [5], it was sho wn that for the case m = M , their algorithm achiev es an appro ximation ratio of 3. If w e run the same algorithm, then in exp ectation the output violates no more than 3 OPT edges, where OPT is the num b er of violated edges in the optimal partition. No w with weigh ted edges, the optimal cost is at least m OPT , and the exp ected cost of the algorithm output is at most 3 M OPT . Th us, the same algorithm ac hieves 3 M m appro ximation. C.3 Cost Mo del for Ph ysical Optimization The cost mo del in Section 5.2 requires estimation of the individual terms in ExecCost. There are three com- p onen ts: (1) the materialization cost of eac h eager query , (2) the cost of lazily ev aluating the query in terms of the materialized views, and (3) the num b er of times that the query will b e executed ( t ). W e consider them in turn. Computing (1), the sub query materialization cost Mat( Q i ), is straightforw ard by using PostgreSQL’s EX- PLAIN feature. As is common for many RDBMSs, the unit of PostgreSQL’s query ev aluation cost is not time, but instead an in ternal unit (roughly prop ortional to the cost of 1 I/O). Felix p erforms all calculations in this unit. Computing (2), the cost of a single incremen tal ev aluation, is more inv olved: we do not ha ve Q i actually materialized (and with indexes built), so w e cannot directly measure Inc Q ( Q 0 ) using PostgreSQL. F or simplicity , consider a tw o-w ay decomp osition of Q into Q 1 and Q 2 . W e consider t wo cases: (a) when Q 2 is estimated to b e larger than PostgreSQL assigned buffer, and (b) when Q 2 is smaller (i.e. can fit in a v ailable memory). T o perform this estimation in case (a), Felix makes a simplifying assumption that the Q i are joined together using index-nested lo op join (w e will build the index when w e actually materialize the tables). Exploring clustering opp ortunities for Q i is future w ork. Then, w e force the RDBMS to estimate the detailed costs of the plan P : σ ¯ x 0 =¯ a ( Q 1 ) o n σ ¯ x 0 =¯ a ( Q 2 ), where Q 1 and Q 2 are views, ¯ x 0 = ¯ a is an assignment to the b ound v ariables ¯ x 0 ≡ ¯ x b in ¯ x . F rom the detailed cost estimation, we extract the following quantities: (1) n i : b e the num b er of tuples from sub query σ ¯ x ( Q i ); (2) n : the num b er of tuples generated by P . W e also estimate the cost α (in PostgreSQL’s unit) of each I/O b y asking P ostgreSQL to estimate the cost of selections on some existing tables. Denote by c 0 = Inc Q ( Q 0 ) the cost (in P ostgreSQL unit) of executing σ ¯ x 0 =¯ a ( R 1 ) o n σ ¯ x 0 =¯ a ( R 2 ), where R i is the materialized table of Q i with prop er indexes built. Without loss of generality , assume n 1 < n 2 and that n 1 is small enough so that o n in the ab o ve query is executed using nested lo op join. On av erage, for each of the estimated n 1 tuples in σ ¯ x ( R 1 ), there is one index access to R 2 , and d n n 1 e tuples in σ ¯ x ( R 2 ) that can b e joined; assume each of the d n n 1 e tuples from R 2 requires one disk page I/O. Thus, there are n 1 d n n 1 e disk accesses to retriev e the tuples from R 2 , and c 0 = αn 1 d n n 1 e + log | Q 2 | (11) where w e use log | Q 2 | as the cost of one index access to R 2 (heigh t of a B-tree). No w b oth c 0 = Inc Q ( Q 0 ) and Mat( Q i ) are in the unit of P ostgreSQL cost, w e can sum them together, and compare with the estimation on other materialization plans. In case (b), when Q 2 can fit in memory , we found that the ab o ve estimation tends to b e to o conserv ative – many accesses to Q 2 are cache hits whereas the mo del ab o v e still counts the accesses into disk I/O. T o comp ensate for this difference, we multiply c 0 (deriv ed ab o v e) with a fudge factor β < 1. Intuitiv ely , w e c ho ose β as the ratio of accessing a page in main memory versus accessing a page on disk. W e empirically determine β . Comp onen t (3) is the factor t , which is dep endent on the statistical op erator. How ev er, we can often deriv e an estimation metho d from the algorithm inside the op erator. F or e xample, for the algorithm in [5], the n umber of requests to an input data mov ement op erator can b e estimated b y the total num b er of mentions (using COUNT) divided b y the exp ected a verage no de degree. 35 D Additional Exp erimen ts D.1 Additional Exp erimen ts of High-level Scalabilit y and Qualit y W e describ e the detailed methodology in our exp eriments on the Enron-R, DBLife, and NFL datasets. Enron-R The MLN program for Enron-R was based on the rules obtained from related publications on rule- based information extraction [25, 27]. These rules (i.e., “Rule Set 1” in Figure 5) use dictionaries for p erson name extraction, and regular expressions for phone num b er extraction. T o extract p erson-phone relationships, a fixed window size is used to identify p erson-phone co-o ccurrences. W e v ary this window size to pro duce a precision-recall curv e of this rule-based approach. The MLN program used b y Felix , Tuffy ,and Alchemy replaces the ab o ve rules’ relation extraction part (using the same entit y extraction results) with a statistical counter-part: Instead of fixed window sizes, this program uses MLN rule weigh ts to enco de the strength of co-o ccurrence and thereby confidence in p erson- phone relationships. In addition, w e write soft constraints such as “ a phone numb er c annot b e asso ciate d with to o many p ersons .” W e add in a set of coreference rules to p erform p erson coref. W e run Alchemy , Tuffy and Felix on this program. DBLife The MLN program for DBLife was based on the rules in Cimple [12], which iden tifies p erson and organization men tions using dictionaries with regular expression v ariations (e.g., abbreviations, titles). In case of an am biguous mention such as “J. Smith”, Cimple binds it to an arbitrary name in its dictionary that is compatible (e.g., “John Smith”). Cimple then uses a proximit y-based form ula to translate p erson-organization co-o ccurrences into ranked affiliation tuples. These form “Rule Set 2” as in Figure 5. The MLN program is constructed as follows. W e first extract entities from the corpus. W e p erform part- of-sp eec h tagging [38] on the raw text, and then identify p ossible person/organization names using simple heuristics (e.g., common p erson name dictionaries and k eywords such as “Univ ersity”). T o handle noise in the en tity extraction results, our MLN program p erforms b oth affiliation extraction and coref resolution using ideas similar to Figure 2. NFL On the NFL dataset, we extract winner-loser pairs. There are 1,100 sp orts news articles in the corpus. W e obtain ground truth of game results from the w eb. As the baseline s olution, w e use 610 of the articles together with ground truth to train a CRF mo del that tags each token in the text as either WINNER, LOSER, or OTHER. W e then apply this CRF mo del on the remaining 500 articles to generate probabilistic tagging of the tokens. Those 500 articles rep ort on a different season of NFL games than the training articles, and we ha ve ground truth on game results (in the form of winner-loser-date triples). W e take the publication dates of the articles and align them to game dates. The MLN program on NFL consists of t wo parts. The first part contains MLN rules enco ding the CRF mo del for winner/loser team mention extraction. The second part is adapted from the rules developed by a researc h team in the Machine Reading pro ject. Those rules mo del simple domain knowledge such as “a winner cannot b e a loser on the same day” and “a team cannot win twice on the same da y .” W e also add in coreference of the team men tions. Coref Lab eling Classification MLN Inference Enron-R 1/1 0/0 0/0 1/1 DBLife 2/2 0/0 1/1 0/0 NFL 1/1 1/1 0/0 1/1 Program1 0/0 1/1 0/0 0/0 Program2 0/0 0/0 37/37 0/0 Program3 0/0 0/1 0/0 1/1 T able 10: Sp ecialized Operators Discov ered by F elix’s Compiler 36 0 10 20 30 40 0 5 10 Time (sec) # Non-disti nguished V ariables Figure 7: P erformance of Π 2 P -complete Algorithms for Non-recursiv e Programs 0 50 100 150 200 250 300 350 0 5 10 F U L L P1 0 50 100 150 200 250 300 350 0 5 10 VIEW P1 P2 ( a) (b) P os t g r eSQ L U ni t Memor y /I O Ra t io Figure 8: Plan diagram of Felix ’s Cost Optimizer D.2 Co v erage of the Compiler Since disco vering subtasks as op erators is crucial to Felix ’s scalability , in this section we test Felix ’s compiler. W e first ev aluate the heuristics we are using for discov ering statistical op erators giv en an MLN program. W e then ev aluate the p erformance of the Π 2 P -complete algorithm to discov ering REF and SYM in non-recursive programs. Using Heuristics for Arbitrary MLN Programs While Felix ’s compiler can discov er all Coref, Lab eling, and Classification op erators in all programs used in our exp erimen ts, w e are also interested in ho w many op erators Felix can disco ver from other programs. T o test this, w e download the programs that are av ailable on Alchemy ’s W eb site 19 and manually lab el op erators in these programs. W e man ually lab el a set of rules as an op erator if this set of rules follows our definition of statistical op erators. W e then run Felix ’s compiler on these programs and compare the logical plans pro duced by Felix with our manual lab els. W e list all programs with manually lab eled op erators in T able 10. The x/y in eac h cell of T able 10 means that, among y manually lab eled op erators, F elix’s compiler discov ers x of them. W e can see from T able 10 that Felix ’s compiler w orks w ell for the programs used in our exp erimen t. Also, Felix w orks well on discov ering classification and lab eling op erators in Alchemy ’s programs. This implies the set of heuristic rules w e are using, although not complete, indeed encodes some p opular patterns users ma y use in real world applications. Although some of Alchemy ’s programs encode coreference resolution tasks, none of them were lab eled as coreference op erator. This is b ecause none of these programs explicitly declares the symmetric constrain ts as hard rules. Therefore, the se t of p ossible w orlds decided by the MLN program 19 http://alchemy.cs.washington.edu/mlns/ 37 0 50 100 150 200 1 41 81 121 # Updat ed Multiplier s (K) # Iter ations Figure 9: Con vergence of Dual Decomp osition is different from those decided by the typical “partitioning”-based seman tics of coreference op erators. How to detect and efficien tly implement these “soft-coref ” is an in teresting topic for future work. P erformance of Π 2 P -complete Algorithm for Non-recursiv e Programs In Section 5.3 and Section C.1.1 we show that there are Π 2 P -complete algorithms for annotating REF and SYM prop erties. Felix imple- men ts them. As the intractabilit y is actually inherent in the num b er of non-distinguished v ariables, which is usually small, w e are interested in understanding the p erformance of these algorithms. W e start from one of the longest rules found in Alchemy ’s W eb site which can b e annotated as SYM. This rule has 3 non-distinguished v ariables. W e then add more non-distinguished v ariables and plot the time used for each setting (Figure 7). W e c an see that Felix uses less than 1 second to annotate the original rule, but exp onen tially more time when the num b er of non-distinguished v ariables grows to 10. This is not surprising due to the exp onential complexit y of this algorithm. Another in teresting conclusion we can draw from Figure 7 is that, as long as the num b er of non-distinguished v ariables is less than 10 (which is usually the case in our programs), Felix p erforms reasonably efficiently . D.3 Stabilit y of Cost Estimator In our previous exp erimen ts we show that the plan generated b y Felix ’s cost optimizer con tributes to the scalabilit y of Felix . As the optimizer needs to estimate several parameters b efore p erforming any predictions, w e are interested in the sensitivity of our current optimizer to the estimation errors of these parameters. The only t wo parameters used in Felix ’s optimizer are 1) the cost (in PostgreSQL’s unit) of fetching one page from the disk and 2) the ratio of the sp eed b et w een fetching one page from the memory and fetching one page from the disk. W e test all combined settings of these tw o parameters ( ± 100% of the estimated v alue) and dra w the plan diagram of t wo queries in Figure 8. W e represent different execution plans with different colors. F or eac h p oin t ( x, y ) in the plan diagram, the color of that p oin t represen ts which execution plan the compiler c ho oses if the P ostgreSQL’s unit equals x and memory/IO ratio equals y . F or those queries not sho wn in Figure 8, Felix pro duces the same plan for each tested parameter com bi- nation. F or queries shown in Figure 8, w e can see Felix is robust for parameter mis-estimation. Actually , all the plans sho wn in Figure 8 are close to optimal, which implies that in our exp erimen ts Felix ’s cost optimizer a voids the selection of “extremely bad” plans even under serious mis-estimation of parameters. D.4 Con v ergence of Dual Decomp osition Felix implemen ts an iterative approac h for dual decomp osition. One immediate question is how many iter ations do we ne e d b efor e the algorithm c onver ges? . T o gain some intuitions, we run Felix on the DBLife 20 data set for a relative long time and record the n umber of up dated Lagrangian multipliers of eac h iteration. W e use constant step size λ = 0 . 9. As sho wn in 20 Similar phenomena o ccur in the NFL dataset as well. 38 Figure 9, ev en after more than 130 iterations, the Lagrangian m ultipliers are still under hea vy up dates. How ev er, on the ENRON-R dataset, we observed that the whole pro cess conv erges after the first several iterations! This implies that the conv ergence of our op erator-based framework dep ends on the underlying MLN program and the size of the input data. It is in teresting to see ho w different techniques on dual decomp osition and gradien t metho ds can alleviate this conv ergence issue, whic h we leav e as future w ork. F ortunately , we em pirically find that in all of our experiments, taking the result from the first sev eral iterations is often a reasonable trade-off b et ween time and quality – all P/R curv es in the previous exp erimen ts are generated by taking the last iteration within 3000 seconds and w e already get significan t impro vemen ts compared to baseline solutions. In Felix , to allow users to directly trade-off b etw een quality and p erformance, w e provide t wo mo des: 1) Only run the first iteration and flush the result immediately; and 2) Run the num b er of iterations specified b y the user. It is an interesting direction to explore the p ossibilit y of automatically selecting parameters for dual decomp osition. 39
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment