A Learning Algorithm for Relational Logistic Regression: Preliminary Results

Relational logistic regression (RLR) is a representation of conditional probability in terms of weighted formulae for modelling multi-relational data. In this paper, we develop a learning algorithm for RLR models. Learning an RLR model from data cons…

Authors: Bahare Fatemi, Seyed Mehran Kazemi, David Poole

A Learning Algorithm f or Relational Logistic Regr ession: Pr eliminary Results ∗ Bahare F atemi, Seyed Mehran Kazemi and David P oole The Univ ersity of British Columbia V ancouver , BC, V6T 1Z4 { bfatemi, smkazemi, poole } @cs.ubc.ca Abstract Relational logistic regression (RLR) is a representation of conditional probability in terms of weighted formulae for modelling multi-relational data. In this paper , we develop a learning algorithm for RLR models. Learning an RLR model from data consists of two steps: 1- learning the set of formulae to be used in the model (a.k.a. structure learning) and learning the weight of each formula (a.k.a. parameter learning). For structure learning, we deploy Schmidt and Murphy’ s hierar- chical assumption: first we learn a model with simple formu- lae, then more complex formulae are added iteratively only if all their sub-formulae have proven ef fective in previous learned models. F or parameter learning, we con vert the prob- lem into a non-relational learning problem and use an off- the-shelf logistic regression learning algorithm from W eka, an open-source machine learning tool, to learn the weights. W e also indicate how hidden features about the individuals can be incorporated into RLR to boost the learning perfor- mance. W e compare our learning algorithm to other struc- ture and parameter learning algorithms in the literature, and compare the performance of RLR models to standard logis- tic regression and RDN-Boost on a modified version of the MovieLens data-set. Statistical relational learning (SRL) (De Raedt et al., 2016) aims at unifying logic and probability to provide models that can learn from complex multi-relational data. Relational probability models (RPMs) (Getoor and T askar, 2007) (also called template-based models (K oller and Fried- man, 2009)) are the core of the SRL systems. They extend Bayesian networks and Marko v networks (Pearl, 1988) by adding the concepts of objects, properties and relations, and by allowing for probabilistic dependencies among relations of individuals. Unlike Bayesian netw orks, in RPMs a random v ariable may depend on an unbounded number of parents in the grounding. In these cases, the conditional probability can- not be represented as a table. T o address this issue, many of the existing relational models (e.g., (De Raedt, Kimmig, and T oi vonen, 2007; Natarajan et al., 2012)) use simple ag- gregation models such as existential quantifiers, or noisy-or models. These models are much more compact than tabular ∗ In IJCAI-16 Statistical Relational AI W orkshop. Copyright c  2016, Association for the Advancement of Artificial Intelligence (www .aaai.org). All rights reserv ed. representations. There also exists other aggre gators with dif- ferent properties (e.g., see (Horsch and Poole, 1990; Fried- man et al., 1999; Neville et al., 2005; Perlish and Prov ost, 2006; Kisynski and Poole, 2009; Natarajan et al., 2010)). Relational logistic regression (RLR) (Kazemi et al., 2014) has been recently proposed as an aggregation model which can represent much more complex functions than the pre- vious aggregators. RLR uses weighted formulae to define a conditional probability . Learning an RLR model from data consists of a structure learning and a parameter learn- ing phase. The former corresponds to learning the features (weighted formulae) to be included in the model, and the lat- ter corresponds to learning the weight of each feature. When all of the parents are observ ed (e.g., for classification), Poole et al. (2014) observ ed that an RLR model has similar se- mantics as a Markov logic netw ork (MLN) (Richardson and Domingos, 2006). Therefore, one can use a discriminativ e learning algorithm for MLNs to learn an RLR model. Huynh and Mooney (2008) proposed a bottom-up algo- rithm for discriminati ve learning of MLNs. They use a logic program learner (ALEPH (Sriniv asan, 2001)) to learn the structure, and then use L1-regularized logistic regression to learn the weights and enable automatic feature selection. The problem with this approach is that the ALEPH (or any other logic program learner) and the MLN hav e different se- mantics: features marked is useful by ALEPH are not nec- essarily useful features for an MLN, and features marked as useless by ALEPH are not necessarily useless features for MLN. The former happens because ALEPH generates fea- tures with logical accuracy which is only a rough estimate of their v alue in a probabilistic model. The latter happens because the representational po wer of an MLN is more than that of ALEPH: MLNs can leverage counts but ALEPH is based on the existential quantifier . While the former issue may be resolved using L1-regularization, it is not straight- forward to resolve the latter issue. In this paper , we develop and test an algorithm for learn- ing RLR models from relational data which addresses the aforementioned issues with current learning algorithms. Our learning algorithm follows the hierarchical assumption of (Schmidt and Murphy, 2010): a formula a ∧ b may be a use- ful feature only if a and b have each proven useful. Our learning algorithm is, in spirit, similar to the refinement graphs of Popescul and Ungar (2004). In our algorithm, howe ver , feature generation is done using hierarchical as- sumption instead of query refinement, and feature selection is done for each le vel of hierarchy using hierarchical (or L1) regularization instead of sequential feature selection, thus reducing the number of times a model is learned from data and enabling pre viously selected features to be removed as the search proceeds. W e also incorporate hidden features about each indi vidual in our RLR model and observe how they affect the perfor- mance. W e test our learning algorithm on a modified v ersion of the MovieLens data-set taken from Schulte and Khosravi (2012) for predicting users’ gender and age. W e compare our model with standard logistic regression models that do not use relational features, as well as the RDN-Boost (Natarajan et al., 2012) which is one of the state-of-the-art relational learning algorithms. The obtained results show that RLR can learn more accurate models compared to the standard logis- tic re gression and RDN-Boost. The results also sho w that adding hidden features to RLR may increase the accuracy of the predictions, but may make the model over -confident about its predictions. Regularizing the RLR predictions to- wards the mean of the data helps a void o ver-confidenc y . Background and Notations In this section, we introduce the notation used throughout the paper , and provide necessary background information for readers to follow the rest of the paper . Logistic Regression Logistic regression (LR) (Allison, 1999) is a popular classi- fication method within machine learning community . W e de- scribe how it can be used for classification follo wing Cessie and van Houwelingen (1992) and Mitchell (1997). Suppose we hav e a set of labeled examples { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x m , y m ) } , where each x i is com- posed of n features x i 1 , x i 2 , . . . , x in and each y i is a binary variable whose value is to predicted. The x i j s may be binary , multi-v alued or continuous. Throughout the paper, we assume binary v ariables take their v alues from { 0 , 1 } . Logistic regression learns a set w = { w 0 , w 1 , . . . , w n } of weights, where w 0 is the intercept and w j is the weight of the feature x i j . For simplicity , we assume a new dimension x i 0 = 1 has been added to the data to avoid treating w 0 differently than other w j s. LR defines the probability of y i being 1 giv en x i as follows: P ( y i = 1 | x i , w ) = σ ( n ∑ j = 0 x i j w j ) (1) where σ ( x ) = 1 1 + ex p ( − x ) is the Sigmoid function. Logistic regression learns the weights by maximizing the log-likelihood of the data (or equiv alently , minimizing the logistic loss function) as follows: w LR = argmax w m ∑ i = 0 l og ( P ( y i = 1 | x i )) (2) An L1-re gularization can be added to the loss function to encourage sparsity and do an automatic feature selection. Conjoined F eatures and the Hierar chical Assumption Giv en n input random variables, logistic regression consid- ers n + 1 features: one bias (intercept) and one feature for each random variable. One can generate more features by conjoining the input random variables. For instance, if a and b are two continuous random variables, one can generate a new feature a ∗ b (which is a ∧ b for Boolean variables). Giv en n random v ariables, conjoining (or multiplying) vari- ables allows for generating 2 n features. These 2 n weights can represent arbitrary conditional probabilities 1 , in what is known as the canonical representation - refer to Buchman et al. (2012) or Koller and Friedman (2009). For a large n , how- ev er, generating 2 n features may not be practically possible, and it also makes the model ov erfit to the training data. In order to av oid generating all 2 n features, Schmidt and Murphy (2010) make a hierarchical assumption: if either a or b are not useful features, neither is a ∧ b . Having this assumption, Schmidt and Murphy (2010) first learn an LR model considering only features with no conjunctions. They regularize their loss function with a hierarchical regulariza- tion function, so that the weights of the features not con- tributing to the prediction go to zero. Once the learning stops, they keep the features having non-zero weights, and add all conjoined features whose all subsets ha ve non-zero weights in the pre vious step. Then they run their learning again. The y continue this process until no more features can be added. Relational Logistic Regression Relational logistic regression (RLR) (Kazemi et al., 2014) is the analogue of LR for relational models. RLR can be also considered as the directed analogue of Markov logic networks (Richardson and Domingos, 2006). In order to de- scribe RLR, first we need to introduce some definitions and terminologies used in relational domains. A population refers to a set of individuals and corre- sponds to a domain in logic. Population size of a popula- tion is a non-negati ve number indicating its cardinality . For example, a population can be the set of planets in the solar system, where Mars is an indi vidual and the population size is 8 . Logical variables start with lower -case letters, and con- stants start with upper-case letters. Associated with a logical variable x is a population po p ( x ) where | x | = | po p ( x ) | is the size of the population. A lo wer-case and an upper -case letter written in bold refer to a set of logical variables and a set of individuals respecti vely . A parametrized random variable (PR V) (Poole, 2003) is of the form F ( t 1 , ..., t k ) where F is a k-ary (continuous or categorical) function symbol and each t i is a logical vari- able or a constant. If all t i s are constants, the PR V is a ran- dom variable. If k = 0, we can omit the parentheses. If F is a predicate symbol, F has range { T rue, False } , otherwise 1 Note that there are 2 n degrees of freedom and any represen- tation that can represent arbitrary conditional probabilities may re- quire 2 n parameters. The challenge is to find a representation that can often use fewer . Frie n d ( z , y ) Ki n d (y ) H ap p y ( z ) Figure 1: A relational model taken from (Kazemi et al., 2014). the range of F is the range of the function. F or example, Li f eE xist sOn ( pl anet ) can be a PR V with predicate func- tion Li f eE xis t sOn and logical variable pl ane t , which is true if life exists on the gi ven pl ane t . A literal is an assignment of a value to a PR V . W e repre- sent F ( . ) = t r ue as f ( . ) and F ( . ) = f al se as ¬ f ( . ) . A for - mula is made up of literals connected with conjunction or disjunction. A weighted formula (WF) for a PR V Q ( z ) , where z is a set of logical variables, is a tuple h F , w i where F is a Boolean formula of the parents of Q and w is a weight. Relational logistic r egression (RLR) defines a condi- tional probability distrib ution for a Boolean PR V Q ( z ) using a set of WFs ψ as follo ws: P ( q ( Z ) | Π ) = σ  ∑ h F , w i ∈ ψ w ∗ F Π , z → Z  (3) where σ ( x ) = 1 1 + ex p ( − x ) is the Sigmoid function, Π rep- resents the assigned values to parents of Q , Z represents an assignment of individuals to the logical v ariables in z , and F Π , z → Z is formula F with each logical variable z in it being replaced according to Z , and ev aluated in Π . Example 1. Consider the relational model in Fig. 1 taken from (Kazemi et al., 2014) and suppose we want to model " someone is happy if they have at least 5 friends that are kind " . The following WFs can be used to represent this model: h T rue , − 4 . 5 i h f r iend ( z , y ) ∧ kind ( y ) , 1 i RLR sums ov er the abov e WFs resulting in: ∀ Z ∈ z : P ( H a p py ( Z ) = T r ue | Fr iend ( Z , y ) , K ind ( y )) = Sigmoid ( − 4 . 5 + 1 ∗ n T ) where n T represents the number of individuals in y for which F riend ( Z , y ) ∧ K ind ( y ) is true. When n T ≥ 5, the probability is closer to one than zero and when n T < 5, the probability is closer to zero than one. Handling Continuous V ariables RLR was initially designed for Boolean or multi-v alued par- ents. If T rue is associated with 1 and F al se is associated with 0, we can substitute ∧ in our WFs with ∗ . Then we can al- low continuous PR Vs in WFs. For example if for some X ∈ x we hav e R ( X ) = T rue , S ( X ) = Fal se and T ( X ) = 0 . 2, then h r ( X ) ∗ s ( X ) , w i ev aluates to 0, h r ( X ) ∗ ¬ s ( X ) , w i ev aluates to w , and h r ( X ) ∗ s ( X ) ∗ t ( X ) , w i ev aluates to 0 . 2 ∗ w . Learning Relational Logistic Regr ession The aforementioned learning algorithm for LR is not di- rectly applicable to RLR because the number of potential weights in RLR is unbounded. W e de velop a learning algo- rithm for RLR which can handle the unbounded number of potential weights. Learning RLR from data consists of two parts: structure learning (learning the set of WFs) and parameter learning (learning the weight of each WF). Parameter Lear ning Suppose we are gi ven a set ψ of WFs defining the condi- tional probability of a PR V Q ( z ) , and we want to learn the weight of each WF . W e can con vert this learning problem into a LR learning problem by generating a flat data-set in single-matrix form. T o do so, for each assignment Z of in- dividuals to the logical variables in z , we generate one data row  x i 1 , x i 2 , . . . , x i | ψ | , y i  in which y i = Q ( Z ) and x i j is the number of times the formula of the j -th WF in ψ is true when the logical v ariables in z are replaced with the indi- viduals in Z . Once we do this conv ersion, we have a single- matrix data for which we can learn a LR model. The weight learned for the j -th input giv es the weight for our j -th WF . Note that the con version is complete and specifies the v alues of all relations. Example 2. Consider the relational model in Fig. 1 and sup- pose we are giv en the following WFs: h T rue , w 1 i h f r iend ( z , y ) , w 2 i h f r iend ( z , y ) ∗ kind ( y ) , w 3 i In this case, we generate a matrix data ha ving a row for each individual Z ∈ z where the row consists of four values: the first number is 1 serving as the intercept, the second one is the number of people that are friends with Z , the third is the number of kind people that are friends with Z , and the fourth one represents whether Z is happy or not. These are sufficient statistics for learning the weights. An example of the generated matrices is as follows: Bias #Friends #Kind Friends Happy? 1 5 3 Y es 1 18 2 No 1 1 1 Y es 1 12 10 Y es . . . . . . . . . . . . In the above matrix, the first four people hav e 5, 18, 1 and 12 friends respecti vely , out of which 3, 2, 1, and 10 are kind. Structure Lear ning Learning the structure of an RLR model refers to selecting a set of WFs that should be used. By conjoining dif ferent relations and adding attributes, one can generate an infinite number of WFs. As an example, suppose we want to pre- dict the gender of users ( G ( u ) ) in a movie rating system, where we are giv en the occupation ( O ( u ) ), age ( A ( u ) ) and the mo vies that these users ha ve rated ( Rat ed ( u , m ) ), as well as the set of (possibly more than one) genres that each mo vie belongs to. Our WFs can hav e formulae such as: a ( u ) = young rat ed ( u , m ) ∗ d rama ( m ) rat ed ( u , m ) ∗ r a t ed ( u 0 , m ) ∗ g ( u 0 ) = mal e and many other WFs with much more conjoined relations. Not all of these WFs may be useful though. As an example, a WF whose formula is act ion ( m ) is not a useful feature in predicting G ( u ) as it ev aluates to a constant number for all users. W e av oid generating such features in an RLR learn- ing model. T o do this in a systematic way , we need a few definitions: Definition 1. Let ψ denote the set of WFs defining the con- ditional probability of a PR V Q ( z ) . A logical variable v in a formula f of a WF ∈ ψ is a: • target if v ∈ z • connector for v 1 and v 2 if there are at least two relations in f one having v and v 1 and the other having v and v 2 • attributed if there exists at least one PR V in f having only v (e.g., g ( v ) ) • hanging if it fits in none of the above definitions Definition 2. A WF is a c hain (Schulte and Khosra vi, 2012) if its literals can be ordered as a list [ r 1 ( x 1 ) , . . . , r k ( x k )] such that each literal r i + 1 ( x i + 1 ) shares at least one logical vari- able with the preceding literals r 1 ( x 1 ) , . . . , r i ( x i ) . A chain is tar geted if it has at least one tar get logical v ariable. A WF is k-BL if it contains no more than k binary literals, and is r-UL if it contains no more than r unary literals. Example 3. Suppose h rat ed ( u , m ) ∗ r a t ed ( u 0 , m ) ∗ rat ed ( u 0 , m 0 ) ∗ comed y ( m ) , w i is a WF belonging to the set of WFs defining the conditional probability of G ( u ) . Then u is a tar get, m and u 0 are connec- tors, m is also attributed, and m 0 is a hanging logical vari- able. This WF is a chain because the second literal shares a m with the first literal, the third one shares a u 0 with the second one, and the fourth one shares a m with the first and second literal. This chain is targeted because it contains u , which is the only target logical variable. The WF is 3- BL , 1- U L as it contains no more than 3 binary and no more than 1 unary literals. h age ( u ) = young ∗ comed y ( m ) , w i is a non- chain WF , and h d rama ( m ) ∗ comed y ( m ) , w i is a chain which is not targeted. W e av oid generating two types of WFs: 1- WFs that are not targeted chains, 2- WFs that contain hanging logi- cal variables. W e do this because non-targeted chains (e.g., h d rama ( m ) ∗ comed y ( m ) , w i for predicting G ( u ) ) always ev aluates to a constant number , and WFs with hanging logi- cal variables (e.g., h rat ed ( u , m ) ∗ act ed ( a , m ) , w i for predict- ing G ( u ) ) can be replaced with more informative WFs. In the rest of the paper, we only consider the WFs that are targeted chains and hav e no hanging logical variables. Having these definitions, we state the hierarchical as- sumption as follows: Hierarchical Assumption: Let f be a k - BL , r - U L WF . Let ψ f be the set of all k - BL , j - U L ( j < r ) WFs having the same k binary literals and a strict subset of the unary literals as f . f is useless if L1-regularized logistic regression assigns a zero weight to it, or there exists a useless WF in ψ f . Example 4. A WF h rat ed ( u , m ) ∗ d rama ( m ) ∗ comed y ( m ) , w i is useless if either h rat ed ( u , m ) ∗ d rama ( m ) , w 1 i or h rat ed ( u , m ) ∗ comed y ( m ) , w 2 i is useless, or L1- regularization sets w to 0. In order to learn the structure of an RLR model, we select a value k and generate all allowed k-BL,1-UL WFs. W e find the best value of k by cross-v alidation. Then we add WFs with more unary literals by making the hierarchical assump- tion and using a similar search strategy as in (Schmidt and Murphy, 2010) by follo wing the algorithm belo w: cur W F s ← se t o f k - BL , 1- U L W F s removedW F s ← / 0 r ← 1 whil e ( st o p ping C rit er iaM e t ()) { M = L 1- Re gul arized - Logist ic - Re gression ( cur W F s ) removedW F s += cur W F s having w = 0 in M r ← r + 1 cur W F s = H A ( removedW F s , r ) } The algorithm starts with k - BL , 1- U L WFs. Initially , no WF is labeled as removed. r is initially set to 1 to indicate the current maximum number of unary literals. Then until the stopping criteria is met, the weights of the WFs in curW F s are learned using an L1-regularized logistic regression. If the weight of a WF is set to zero, we add it to the r emovedW F s . Then we increment r and update the curW F s to the k - BL , r - U L WFs obeying the hierarchical assumption with respect to the removedW F s (we assume H A ( remo vedW F s , r ) is a function which returns such WFs). The stopping criteria is met when no more WFs can be generated as r increases. Adding Hidden F eatures While we exploit the observed features of the objects in making predictions, each object may contain really useful information that has not been observed. As an example, in predicting the gender of the users giv en the mo vies the y liked, some movies may only be appealing to males and some only to females. Or there might be features in movies that we do not kno w about, but the y contribute to predicting the gender of users. In order to incorporate hidden features in our RLR model, we add continuous unary PR Vs such as H ( m ) with (initially) random values to our dataset. Then we generate all k - BL , 1- U L WFs and learn the weights as well as the v alues of the hidden features using stochastic gradient descent with L1- regularization. Once we learn the v alues of the hidden fea- tures, we treat them as normal features and use our afore- mentioned structure and parameter learning algorithms to learn an RLR model. Experiments and Results W e test our learning algorithm on the 0.1M Movielens data- set (Harper and Konstan, 2015) with the modifications made by Schulte and Khosravi (2012). This data-set contains in- formation about 940 users (nominal v ariables for age, occu- pation, and gender), 1682 movies (binary v ariables for ac- T able 1: A CLL and accuracy on predicting the gender and age of the users in the MovieLens dataset. Learning Algorithm Baseline LR RDN-Boost RLR-Base RLR-H Gender A CLL -0.6020 -0.5694 -0.5947 -0.5368 -0.5046 Accuracy 71.0638% 71.383% 70.6383% 73.8298% 77.3404% Age A CLL -0.6733 -0.5242 -0.5299 -0.5166 -0.5090 Accuracy 60.1064% 76.0638% 76.4893% 77.1277% 77.0212% tion, horror , and drama), the movies rated by each user con- taining 79 , 778 user-movie pairs, and the actual rating the user has giv en to a movie. In our experiments, we ignored the actual ratings and only considered if a movie has been rated by a user or not. W e learned RLR models for predicting the age and gender of users once with no hidden features (RLR-Base), and once with one hidden feature (RLR-H) for the movies. W e regularize the predictions of both RLR-Base and RLR-H tow ards the mean as: Probabil i t y = λ ∗ mean + ( 1 − λ ) ∗ ( RLRsignal ) (4) When predicting the age of the users, we only consid- ered two instead of three age classes (we merged the age 1 and age 2 classes). For learning Logistic regression models with L1-regularization, we used the open source codes of (Schmidt, Fung, and Rosales, 2007) and we learned the fi- nal logistic regression model with W eka software (Hall et al., 2009). W e compared the proposed method with a base- line model always predicting the mean, standard logistic re- gression (LR) not using the relational information, and the RDN-Boost. The performance of all learning algorithms were obtained by 5-folds cross-validation. In each fold, we di vided the users in the Movielens data-set randomly into 80% training set and 20% test set. W e learned the model on the train set and measured the accuracy (the percentage of correctly clas- sified instances) and the average conditional log-likelihood (A CLL) on the test set, and averaged them over the 5 folds. A CLL is computed as follows: A C LL = 1 m m ∑ i = 1 l n ( P ( G ( U i ) | d at a , mod el )) (5) Obtained results are represented in T able 1. They show that RLR utilizes the relational features to improve the pre- dictions compared to the logistic regression model that does not use the relational information, and the RDB-Boost. Ob- tained results also represent that adding hidden features to the RLR models may increase the accuracy and reduce the MAE. Howe ver , we observed that adding hidden features makes the model over -confident by pushing the prediction probabilities to wards zero and one, thus requiring more reg- ularization tow ards the mean. Discussion In our first experiment on predicting the gender of the users, we found that on average men hav e rated more action movies than women. This means for predicting the gender of the users, the feature mal e ( u ) ⇐ rat ed ( u , m ) ∧ act ion ( m ) is a useful feature for both RLR and MLNs as it counts the num- ber of action movies. Many of the current relational learn- ing algorithms/models, ho we ver , rely mostly on the e xisten- tial quantifier as their aggregator (e.g., (Horsch and Poole, 1990; De Raedt, Kimmig, and T oiv onen, 2007; Huynh and Mooney, 2008; Natarajan et al., 2012)). By relying on the existential quantifier , these models either hav e to use many complex rules to imitate the effect of such rules, or lose great amounts of relational information av ailable in terms of counts. As a particular example, consider the discriminativ e struc- ture learning of Huynh and Mooney (2008) for MLNs. First they learn a large set of features using ALEPH, then learn the weights of the features with L1-regularization to enable au- tomatic feature selection. In cases where ev eryone has rated an action movie, mal e ( u ) :- ra t ed ( u , m ) ∧ act ion ( m ) is not a useful feature for ALEPH (and so it will not find it) because it does not distinguish males from females. Therefore, this rule will not be included in the final MLN. Relational learners based on existential quantifier can potentially imitate the effect of counts by using many complex rules. As an example, mal e ( u ) :- rat ed ( u , m 1 ) ∧ act ion ( m 1 ) ∧ rat ed ( u , m 2 ) ∧ ac t ion ( m 2 ) ∧ m 1 6 = m 2 may be used to assign a maleness probability to people rating two action movies. But this approach requires a different rule for each count, and the rules become more and more complex as the count grows because they require pairwise inequali- ties. T o see this in practice, we ran experiments with ALEPH on synthesized data and observed that, ev en though it could learn such rules to enhance its predictions, it f ailed at finding them. Based on the abov e observations, we argue that relational learning algorithms/models need to allow for a richer set of predefined aggre gators, or enable non-predefined aggrega- tors to be learned from data similar to what RLR does. W e also argue that our structure learning algorithm has the po- tential to explore more features, and may also be a good can- didate for discriminativ e structure learning of MLNs. Conclusion Relational logistic regression (RLR) can learn complex models for multi-relational data-sets. In this paper , we de- veloped and tested a structure and parameter learning for these models based on the hierarchical assumption. W e com- pared our model with the standard logistic regression model and the RDN-Boost, and represented that, on the Movie- Lens data-set, RLR achieves higher accuracies. W e also rep- resented how hidden features can boost the performance of RLR models. The results presented in this work are only pre- liminary results. Future direction includes testing our learn- ing algorithm on more complex data-sets ha ving much more relational information, comparing our model with other re- lational learning and aggregation models in the literature, making the learning algorithm extrapolate properly for the un-seen population sizes (Poole et al., 2014), and testing the performance of our structure learning algorithm for discrim- inativ e learning of Markov logic networks. References Allison, P . 1999. Logistic r egr ession using SAS: theory and application . SAS Publishing. Buchman, D.; Schmidt, M.; Mohamed, S.; Poole, D.; and De Freitas, N. 2012. On sparse, spectral and other parame- terizations of binary probabilistic models. In AIST ATS . Cessie, S. L., and v an Houwelingen, J. 1992. Ridge estima- tors in logistic regression. Applied Statistics 41(1):191– 201. De Raedt, L.; K ersting, K.; Natarajan, S.; and Poole, D. 2016. Statistical relational artificial intelligence: Logic, probability , and computation. Synthesis Lectur es on Arti- ficial Intelligence and Machine Learning 10(2):1–189. De Raedt, L.; Kimmig, A.; and T oiv onen, H. 2007. Problog: A probabilistic prolog and its application in link discov- ery . In IJCAI , v olume 7. Friedman, N.; Getoor , L.; K oller, D.; and Pfeffer , A. 1999. Learning probabilistic relational models. In Pr oc. of the Sixteenth International Joint Confer ence on Artificial In- telligence , 1300–1307. Sweden: Morgan Kaufmann. Getoor , L., and T askar , B. 2007. Intr oduction to Statistical Relational Learning . MIT Press, Cambridge, MA. Hall, M.; Frank, E.; Holmes, G.; Pfahringer , B.; Reutemann, P .; and W itten, I. H. 2009. The weka data mining soft- ware: an update. A CM SIGKDD explorations newsletter 11(1):10–18. Harper , M., and Konstan, J. 2015. The movielens datasets: History and context. A CM T ransactions on Interactive Intelligent Systems (T iiS) 5(4):19. Horsch, M., and Poole, D. 1990. A dynamic approach to probability inference using Bayesian networks. In Pr oc. sixth Confer ence on Uncertainty in AI , 155–161. Huynh, T . N., and Moone y , R. J. 2008. Discriminativ e struc- ture and parameter learning for markov logic networks. In Pr oc. of the international confer ence on machine learn- ing . Kazemi, S. M.; Buchman, D.; K ersting, K.; Natarajan, S.; and Poole, D. 2014. Relational logistic regression. In Pr oc. 14th International Confer ence on Principles of Knowledge Repr esentation and Reasoning (KR) . Kisynski, J., and Poole, D. 2009. Lifted aggregation in directed first-order probabilistic models. In T wenty-first International Joint Confer ence on Artificial Intelligence , 1922–1929. K oller, D., and Friedman, N. 2009. Pr obabilistic Graphi- cal Models: Principles and T echniques . MIT Press, Cam- bridge, MA. Mitchell, T . 1997. Machine Learning . McGraw Hill. Natarajan, S.; Khot, T .; Lo wd, D.; T adepalli, P .; and Ker - sting, K. 2010. Exploiting causal independence in Markov logic networks: Combining undirected and di- rected models. In Eur opean Confer ence on Machine Learning (ECML) . Natarajan, S.; Khot, T .; Kersting, K.; Gutmann, B.; and Shavlik, J. 2012. Gradient-based boosting for statis- tical relational learning: The relational dependency net- work case. Machine Learning 86(1):25–56. Neville, J.; Simsek, O.; Jensen, D.; K omoroske, J.; Palmer , K.; and Goldber g, H. 2005. Using relational kno wl- edge discovery to prevent securities fraud. In Pr oceed- ings of the 11th A CM SIGKDD International Confer ence on Knowledge Discovery and Data Mining . MIT Press. Pearl, J. 1988. Pr obabilistic Reasoning in Intelligent Sys- tems: Networks of Plausible Infer ence . San Mateo, CA: Morgan Kaumann. Perlish, C., and Prov ost, F . 2006. Distribution-based ag- gregation for relational learning with identifier attributes. Machine Learning 62:65–105. Poole, D.; Buchman, D.; Kazemi, S. M.; Kersting, K.; and Natarajan, S. 2014. Population size extrapolation in re- lational probabilistic modelling. In Pr oc. of the Eighth International Confer ence on Scalable Uncertainty Man- agement . Poole, D. 2003. First -order probabilistic inference. In Pr o- ceedings of the 18th International Joint Confer ence on Ar- tificial Intelligence (IJCAI-03) , 985–991. Popescul, A., and Ungar , L. H. 2004. Cluster-based concept in vention for statistical relational learning. In Pr oceedings of the tenth ACM SIGKDD international confer ence on Knowledge discovery and data mining , 665–670. A CM. Richardson, M., and Domingos, P . 2006. Markov logic net- works. Machine Learning 62:107–136. Schmidt, M. W ., and Murphy , K. P . 2010. Con ve x structure learning in log-linear models: Beyond pairwise potentials. In International Confer ence on Artificial Intelligence and Statistics . Schmidt, M.; Fung, G.; and Rosales, R. 2007. Fast optimiza- tion methods for l1 regularization: A comparative study and two new approaches. In Machine Learning: ECML 2007 . Springer . 286–297. Schulte, O., and Khosravi, H. 2012. Learning graphical models for relational data via lattice search. Mac hine Learning 88(3):331–368. Sriniv asan, A. 2001. The aleph manual.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment