Dynamic Stacked Generalization for Node Classification on Networks

Dynamic Stacked Generalization for Node Classiﬁcation on Netw orks Zhen Han ∗ Alyson W ilson † Abstract W e propose a nov el stacked generalization (stacking) method as a dynamic ensemble technique using a pool of heterogeneous classiﬁers for node label classiﬁcation on networks. The proposed method assigns component models a set of functional coefﬁcients, which can vary smoothly with certain topological features of a node. Compared to the traditional stacking model, the proposed method can dynamically adjust the weights of individual models as we move across the graph and pro vide a more versatile and signiﬁcantly more accurate stacking model for label prediction on a network. W e demonstrate the beneﬁts of the proposed model using both a simulation study and real data analysis. 1 Introduction Network data and its relational structure have garnered tremendous attention in recent years. A network is composed of nodes and edges, where nodes represent interacting units and edges represent their relationships [7]. Node classiﬁcation or node labeling on a network is a problem where we observe labels on a subset of nodes and aim to predict the labels for the rest [1]. There are various kinds of labels; for example, demographic labels such as age, gender , location, or social interests; labels such as political parties, research interests, or research afﬁliations. Collectiv e inference estimates the labels of a set of related nodes simul- taneously gi v en a partially observed network by exploiting the relational auto-correlation of connected units [10], and it has been demonstrated effecti ve in reducing classiﬁcation er- ror for many applications [3, 17, 11, 20]. Common collectiv e inference methods are the Iterative Classiﬁcation Algorithm (ICA) [17], Gibbs Sampling (Gibbs), and Relaxation Label- ing (RL) [12, 14]. In man y cases, multiple types of relationships can be observed in the same network. For example, in a citation network, an edge can mean two papers have the same au- thor , or they are published in the same journal, or one paper cites another . W e may also observe additional node-le vel in- formation, such as the title and abstract of a paper , which can potentially help increase the label classiﬁcation accu- racy . When multiple relations are present on a network, one can merge all the relations and sum the weights of common links to perform a typical collective classiﬁcation [13]. An ∗ MaxPoint Interactiv e. † Department of Statistics, North Carolina State Univ ersity . alternativ e is to combine all of the information through an en- semble framework. Fürnkranz [6] introduced hyperlink en- sembles for classifying hypertext documents, where he sug- gests ﬁrst predicting the label of each hyperlink attached to a document and then combining these indi vidual predictions using ensembles to mak e a ﬁnal prediction for the label of the target document. A dif ferent approach was proposed by Heß and Kushmerick [9], where they suggest training sepa- rate classiﬁers for the local and relational attrib utes and then combining the local and relational classiﬁers through voting. A local classiﬁer is trained using only node-lev el, or local, features; for example, title, abstract, or year of publication. A relational classiﬁer infers a node’ s label by using relational features; for example, the labels of the connected neighbors. Cataltepe et al. discussed a similar ensemble approach [2], where they considered dif ferent voting methods, such as the weighted a verage, average, and maximum. Eldardiry and Neville [4] discussed an across-models collectiv e classiﬁca- tion method that formed ensembles of the estimates from multiple classiﬁers using a voting idea similar to collecti ve inference to reduce variance. The abo v e literature focuses on combining multiple classiﬁers through some type of aggregation. Preisach and Schmidt-Thieme [18] proposed to use stacking instead of a simple voting as a more robust and po werful generaliz- ing method to combine predictions made by local and re- lational classiﬁers. They suggest training each classiﬁer in- dependently and combining the predicted class probabilities from a pool of local and relational classiﬁers through stack- ing, which assigns constant weights to each classiﬁer in a supervised fashion. Stacked generalization (stacking) [22] is a technique for combining multiple classiﬁers, each of which has been indi- vidually trained for a speciﬁc classiﬁcation task, to achiev e greater overall predictiv e accuracy . The method ﬁrst trains individual classiﬁers using cross-v alidation on the training data. The original training data is called lev el-0 data, and the learned models are called level-0 classiﬁers. The prediction outcomes from the lev el-0 models are pooled for the second- stage learning, where a meta-classiﬁer is trained. The pooled classiﬁcation outcomes are called lev el-1 data and the meta- classiﬁer is called the lev el-1 generalizer . T ing and W itten [21] showed that for the task of classi- ﬁcation, the best practice is to use the predicted class proba- bilities generated by le vel-0 models to construct level-1 data. Essentially , stacking learns a meta-classiﬁer that assigns a set of weights to the class predictions made by indi vidual classi- ﬁers. The traditional stacking model assumes the weight of each classiﬁer is constant from instance to instance, which does not hold in general for many relational classiﬁers on a network. For example, the weighted-v ote relational neigh- bor (wvRN) classiﬁer [13] infers a node’ s label by taking a weighted av erage of the class membership probabilities of its neighbors. One expects that its performance might be depen- dent on a node’ s topological characteristics in the graph; for example, the number of connected neighbors. On the other hand, local classiﬁers that are trained using only a node’ s lo- cal attributes are less dependent on its topological features. Consequently , when we combine local and relational classi- ﬁers, it is beneﬁcial to ha ve a set of weight functions instead of constant weights for each classiﬁer . There has been some previous w ork on dynamically ensemble local and relational models [16, 23]. Ho we ver , they impose parametric models on the weight functions that are not ﬂexible to capture com- plex weighting functions. In this paper , we de velop a dynamic stacking frame work using a generalized varying coefﬁcient model, which allows the weights for each classiﬁer to vary smoothly with a node’ s topological characteristics in a non-parametric way . W e illustrate the beneﬁts of incorporating a node’ s topological features into stacking. T o the best of our knowledge, this is the ﬁrst work that considers non-parametric functional weight stacking. 2 Background and Moti vation Network data can be represented by a graph G = ( V , E , Y ) with vertices (nodes) V , edges (connections) E = { v 1 , v 2 } , v 1 , v 2 ∈ V , and labels Y . The graph G is parti- tioned into two sets of vertices, V train and V test , with V train ∪ V test = V and V train ∩ V test = ∅ . W e are giv en a classiﬁcation problem with C classes. Class labels, y i , are observed for nodes in the training set v i ∈ V train , while the labels of the test set V test are unknown and need to be estimated. A rela- tional classiﬁer uses the attributes and/or labels from a node’ s connected neighbors to make predictions. Ho wev er , unlike a typical classiﬁcation problem, a node’ s neighbors may have missing attributes and/or labels, which in turn need to be es- timated. Collecti ve inference [11, 20] has been de veloped to make joint inference on the test nodes and produce consistent results. W e examinine the Cora [15] and the PubMed Diabetes [20] data sets, where we e v aluate the collecti v e classiﬁcation accuracy on nodes with various topological characteristics. W e consider the wvRN classiﬁer as the relational classiﬁer [13], deﬁned as follo ws, and the Iterati v e Classiﬁcation Algorithm (ICA) [12, 14] for collective inference as deﬁned in Algorithm 1. Algorithm 1 Iterativ e Classiﬁcation Algorithm (ICA) 1: For v i ∈ V test , initialize the node labels, y i , with a dummy label null . 2: repeat 3: Generate a random sequence of nodes, O , in V test 4: for node v i ∈ O do 5: Apply the relational classiﬁer model, using only non- null labels from N i , the neighborhood of v i , and output an estimated class membership proba- bility vector . W e ignore nodes that have not been classiﬁed, so if all labels in N i are null , we assign the label null to v i . 6: Assign the label, c , with the largest class member- ship probability to v i . 7: end for 8: until class assignments for V test stop changing or a maximum number of iterations is reached. D E FI N I T I O N 2 . 1 . For a giv en node v i ∈ V test , the wvRN classiﬁer estimates the class probability P ( y i | N i ) by the weighted average of the class membership probabilities in the neighborhood of v i , N i : (2.1) P ( y i = c | N i ) = 1 Z X v j ∈ N i w i,j P ( y j = c | N j ) , where Z is a normalizing constant and w i,j is the weight associated with the edge between v i and v j . Macskassy and Provost showed that the weighted-vote rela- tional neighbor classiﬁer is equi valent to the Gaussian-ﬁeld model [14]. The Cora data set is a public academic database com- posed of papers from Computer Science. It contains a ci- tation graph with attributes/labels of each paper (including authors, title, abstract, book title, and topic labels). W e only consider the topics of each paper as its label and ignore the other attributes. W e remove papers with no topic labels and construct the data set by keeping the largest connected com- ponent in the network. The ﬁnal data set is an unweighed and non-directional network, with 19,355 nodes and 58,494 edges. Labels are 70 topic categories, and each paper is clas- siﬁed into one of the categories. W e randomly sample 80% of nodes from V into V test and set their labels to null . W e then make predictions us- ing ICA on the nodes in V test and calculate the classiﬁcation accuracy for dif ferent le vels of degrees and closeness cen- trality . W e repeat this experiment 100 times and the results are displayed in Figures (1) and (2). In Figure (1), the classi- ﬁcation accuracy of the wvRN classiﬁer is dependent on the degree of v i . As the count of a node’ s neighbor increases from 1 to 10, the av erage classiﬁcation accurac y jumps from 45% to more than 60%. There are a limited number of nodes with degrees greater than 10, and thus the variance of the av erage classiﬁcation accurac y goes up considerably . Close- ness centrality is deﬁned as the reciprocal of a node’ s total distance from all other nodes, which relates to the idea of “being in the middle of things. ” Unlike de gree, closeness centrality is a continuous v ariable. W e binned its range into 100 equal-length interv als and calculated classiﬁcation ac- curacy in each bin. From Figure (2), we observe a steady upward trend in the classiﬁcation accuracy near the center of the spectrum. There are not many nodes near the two ends of the spectrum, and this contributes to the large v ariation in the classiﬁcation accuracy . Figure 1: Classiﬁcation accuracy for the relational classiﬁer at different le vels of node degree in the Cora data set. W e performed the same analysis on the Pubmed Dia- betes data set. The Pubmed data set is a medical database composed of diabetes-related medical papers deri ved from the PubMed database. The graph is a citation network with 19,717 papers and 44,338 edges. Each publication is as- signed one of three categories as its label and a TF/IDF weighted word vector as an extra attribute. Here we ignore the extra attributes and only consider the topic category la- bels. W e observe results similar to those from the Cora data set in Figures (3) and (4). 3 Dynamic Stacking Model Figure 2: Classiﬁcation accuracy for the relational classiﬁer at different lev els of node closeness centrality in the Cora data set. 3.1 Notation for the Stacked Generalization Model Stacked generalization (stacking) is a general method for combining multiple lower -le vel models to improve overall predictiv e accuracy [22, 21]. Here we follow the notation in [21]. Gi ven data D = { ( y i , x i ) for i = 1 , · · · , N } , let x i be the feature vector and y i the label of the i -th observation. Here we focus on categorical responses for y , and assume y has C categories. W e ﬁrst randomly partition the data into J roughly equal-sized parts D 1 , D 2 , · · · , D J . Deﬁne D j and D − j = D − D j to be the test and training data sets for the j -th fold of a J -fold cross validation. Suppose we hav e K classiﬁers. W e train each of the K classiﬁers using the training set D − j with results M k . M 1 , · · · , M K are called level-0 models . W e then apply the K classiﬁers on the test set D j and denote z ik = M k ( x i ) , x i ∈ D j as the estimated class probability vector from M k for x i ∈ D j . W e repeat this process for j = 1 , · · · , J and collect the outputs from the K models to form the level-1 data as follo ws: (3.2) D lev el 1 = { ( y i , z i 1 , · · · , z iK ) , for i = 1 · · · , N } . z ik is the predicted class probability vector from M K for observation i , and therefore P c z ikc = 1 . W e drop the last element z ikC from vector z ik to avoid multicollinearity issues. W e then ﬁt a supervised classiﬁcation model, ˜ M , using the level-1 data, which is called the level-1 generalizer . Figure 3: Classiﬁcation accuracy for the relational classiﬁer at different le vels of node degree in the PubMed data set. For prediction, a new observation x new is input into the K lo w-le vel classiﬁers, M 1 , · · · , M K . The estimated class probability vectors, z 1 , · · · , z K , are then concatenated and input into ˜ M , which outputs the ﬁnal class estimate for that observation. For classiﬁcation on networks, Preisach and Schmidt- Thieme [18] adapted the stacking technique and combined a local classiﬁer with a relational classiﬁer using a logistic regression model as the level-1 generalizer . Howe ver , in their paper , the coef ﬁcients in the logistic regression are constant, meaning the weights of individual component classiﬁer are “static” from node to node. From previous observations in Figures (1), (2), (3), and (4), the accuracy of a relational classiﬁer is often dependent on some topological feature of a node. Therefore, it could be beneﬁcial to “dynamically” allocate the weights of indi vidual classiﬁers based on some other variable. W e discuss a dynamic stacking model using a generalized varying coef ﬁcient model in the next section to account for this observation. Compared with competing methods that also consider dynamic stacking [16, 23], the proposed model is non-parametric, more ﬂexible, and can learn more complex weighting functions. 3.2 Generalized V arying Coefﬁcient Model through Smoothing Splines Here we develop a dynamic stacking model using a generalized varying coefﬁcient model. Instead of having a set of constant coefﬁcients in the regression, we Figure 4: Classiﬁcation accuracy for the relational classiﬁer at different levels of node closeness centrality in the PubMed data set. allow the coefﬁcients to vary as smooth functions of other variables. The generalized varying-coef ﬁcient model was proposed by Hastie and Tibshirani [8] and was revie wed in [5]. Similar to traditional stacked generalization, the inputs for the dynamic stacking model are the assembled outputs from multiple lev el-1 classiﬁers, along with an extra covari- ate: { ( y i , Z i , u i ) , for i = 1 , · · · , N } . y i is the true class la- bel of an observation, Z i is the concatenated predicted class membership vector from a pool of inhomogeneous classiﬁers with dimension p . Each of the component classiﬁers could potentially look at a different set of features of an instance and make a prediction from its point of vie w . u i is an “e xtra” cov ariate of a observ ation, which presumably would affect the prediction accuracy made by at least one classiﬁer . Here we focus on the case where y i is binary and u i is continuous. One can easily extend this method to multi-class classiﬁca- tion problems by using a one-vs-all strategy . The regression function is modeled as: g ( m ( U i , Z i )) = β 0 + Z T i β ( U i ) (3.3) = β 0 + p X j =1 Z ij β j ( U i ) where g ( · ) is the logit link function, β ( · ) is the functional coefﬁcient vector that varies smoothly with an extra scalar cov ariate, and β 0 is a constant intercept. Instead of a constant intercept, one can trivially add a functional intercept by appending 1 to Z i . Ho we ver , in this paper , we focus on the constant intercept case. W e assume that each functional coefﬁcient β j ( · ) for j = 1 , · · · , p can be approximated by spline functions: β j ( · ) = K j X k =1 η j k B j k ( · ) , for j = 1 , · · · , p, (3.4) where for each β j , B j k ( · ) for k = 1 , · · · , K j is a set of spline basis functions. Without loss of generality , we use the same set of B-spline basis functions for all β 1 ( · ) , · · · , β p ( · ) . Henceforth, we denote the set of B-spline basis functions as B 1 ( · ) , · · · , B K ( · ) , where K is the number of basis func- tions. W e can then rewrite equation (3.4) as: β j ( · ) = K X k =1 η j k B k ( · ) for j = 1 , · · · , p. (3.5) W e substitute equation (3.5) into equation (3.3) and rewrite the regression function as g ( m ( U i , Z i )) = β 0 + p X j =1 K X k =1 Z ij η j k B k ( U i ) (3.6) = β 0 + Z T i B ( U i ) η , Denote B ∗ ( U ) = ( B 1 ( U ) , · · · , B K ( U )) 1 × K . W e can express B ( U ) as: B ( U ) =    B ∗ ( U ) . . . 0 . . . . . . . . . 0 · · · B ∗ ( U )    p × pK and express η as: η =    η 11 . . . η pK    pK . W e can estimate β 0 , β 1 ( · ) , · · · , β p ( · ) by directly mini- mizing: ˆ β 0 , ˆ β 1 ( · ) , · · · , ˆ β p ( · ) = arg min β 0 ,β 1 ( · ) , ··· ,β p ( · ) (3.7) − N X i =1 ` ( β 0 + p X j =1 Z ij β j ( U i ) , y i ) + λ p X j =1 Z ( β 00 j ( x )) 2 dx where ` ( g ( m ( U i , Z i )) , y i ) is the log-likelihood function of the logistic regression, λ P p j =1 R ( β 00 j ( x )) 2 dx is a smooth- ness penalty term that controls the total curvature of the ﬁt- ted β j ( · ) for j = 1 , · · · , p , and λ is a a smoothing parameter that controls the trade-off between model ﬁt and the rough- ness of the ﬁtted β j ( · ) s. When λ → 0 , we ha ve a set of wiggly β j ( · ) s; as λ → ∞ , the minimization of (3.7) will produce a set of linear β j ( · ) s. For the constant intercept case, one can absorb β 0 into η as: η ∗ =     β 0 η 11 . . . . η pK     1+ pK and append a constant 1 to the beginning of the product, Z T i B ( U i ) . W e can write the optimization in equation (3.7) w .r .t. η ∗ as: ˆ η ∗ = arg min η ∗ {− ` ( η ∗ ) + λ η ∗ T H η ∗ } , (3.8) where H is the assembled penalty matrix: H =     0 . . . . . . . . 0 . H 1 0 0 . . . . H j . . . 0 0 0 H p     (1+ pK ) × (1+ pK ) H j is the smoothness penalty matrix for β j ( · ) , and { H j } mn = R B 00 m ( x ) B 00 n ( x ) dx for m, n = 1 , · · · , K . Since we are using the same set of basis functions, H 1 = · · · = H p . It can be sho wn that − ` ( η ∗ ) is con ve x w .r .t. η ∗ . Also, one can sho w that H is positi ve semi-deﬁnite, so λ η ∗ T H η ∗ is con ve x w .r .t. η ∗ as well. Therefore, there exists a unique ˆ η ∗ that optimizes equation (3.8). Giv en a speciﬁed smoothness penalty parameter λ , to estimate η ∗ , we employ an iterative Newton-type optimiza- tion method by directly calculating the deriv atives of the ob- jectiv e function in equation (3.8). The smoothness penalty parameter λ can be chosen by cross-validation, where, for a range of λ v alues, we iterati vely leav e out a subset of the training data, ﬁt the model using the rest of the data, and compute the prediction error on the held out data set. The best λ is set to the one with the smallest objectiv e function value. 4 Simulation Study Here we compare the performance of the dynamic stacking method against standard benchmarks using simulated data sets. [18] used a standard logistic regression model as the lev el-1 generalizer to combine a local and a relational clas- siﬁer . [19] suggested that re gularization is necessary to re- duce ov er-ﬁtting and increase predicti ve accuracy , and they considered lasso regression, ridge regression, and elastic net regression. In our simulation study , we use lasso regression, ridge regression, and logistic regression as benchmark lev el- 1 generalizers, and for each of the benchmark generalizers, we experiment with adding an additional cov ariate and/or in- teraction terms into the stacking, and compare their perfor- mance with the dynamic stacking model. For the simulation, N = 2000 observ ations are gen- erated for i = 1 , · · · , N . Z 1 i and Z 2 i are the predicted positiv e class probabilities from tw o classiﬁers, Z 1 and Z 2 , and they are generated independently from a uniform distribution on [0 , 1] . w i is the error term which follows a normal distribution, N (0 , 1) . u i is an extra covariate which may af fect the weight of a classiﬁer , and it is generated from a uniform distribution on [0 , 1] . Finally , the response y i is generated from a Bernoulli distrib ution with p ( y i = 1) speciﬁed as one of the following three cases. In case 1, the classiﬁer weights are not dependent on u , while case 2 has linear dependence, and case 3 has non-linear dependence. Case 1: logit ( p ( y i = 1)) = − 3 + 3 Z 1 i + 3 Z 2 i + w i Case 2: logit ( p ( y i = 1)) = − 3 + 3 u i Z 1 i + 3 Z 2 i + w i Case 3: logit ( p ( y i = 1)) = − 3 + 3 sin(6 u i ) Z 1 i + 3 Z 2 i + w i For training and e v aluation, the N observations are evenly split into training and testing sets. W e train the dynamic stacking model and benchmark methods using the training set, where the penalty parameters of the proposed method and benchmarks are selected by 10-fold cross-validation. The ﬁtted models are then applied to predict on the test- ing set, and the ﬁnal prediction accuracy on the test set is measured by the Area Under the Curve (A UC) from predic- tion scores as shown in T able (1). For methods 1 (Logistic 1 , Lasso 1 , and Ridge 1 ), the inputs to the le vel-1 generalizer are { ( y i , Z 1 i , Z 2 i ) } . For methods 2 , we add the additional co- variate u into the input: { ( y i , Z 1 i , Z 2 i , u i ) } . For methods 3 , in addition to u , we further add its interaction with Z 1 i , Z 2 i into the input: { ( y i , Z 1 i , Z 2 i , u i , Z 1 i u i , Z 2 i u i ) } . From T able (1), the dynamic stacking model performs no worse than the standard methods under all scenarios. It has better performance than the “static” stacking models when there is an underlying non-linear dependency between a classiﬁer’ s performance and the extra cov ariate. In case 1, the dynamic stacking model generalizes to the traditional stacking models and does not ov er ﬁt the data. In case 2, where a linear dependency exists, the proposed model generalizes to methods 3 , where interaction terms with the extra cov ariate are added into the “static” stacking model. In case 3, where a non-linear dependency exists, the dynamic stacking model outperforms all benchmarks. T able 1: A UC score comparison between the proposed method and benchmarks. For le vel-1 generalizers, methods 1 use Z 1 i and Z 2 i as co v ariates, methods 2 contains U i as an extra feature, and methods 3 further include linear interaction terms U i Z 1 i , and U i Z 2 i . The standard de viation of the accuracy score is calculated from 50 repetitions and is sho wn in parenthesis. M E T H O D S C A S E 1 C A S E 2 C A S E 3 R A N D O M G U E S S 0 . 4 9 ( 0 . 0 2 ) 0 . 5 0 ( 0 . 0 2 ) 0 . 5 1 ( 0 . 0 2 ) Z 1 O N LY 0 . 6 8 ( 0 . 0 2 ) 0 . 6 0 ( 0 . 0 2 ) 0 . 5 4 ( 0 . 0 2 ) Z 2 O N LY 0 . 6 8 ( 0 . 0 2 ) 0 . 6 8 ( 0 . 0 2 ) 0 . 6 7 ( 0 . 0 2 ) L O G I S T I C 1 0 . 7 5 ( 0 . 0 2 ) 0 . 7 1 ( 0 . 0 2 ) 0 . 6 7 ( 0 . 0 1 ) L A S S O 1 0 . 7 5 ( 0 . 0 2 ) 0 . 7 1 ( 0 . 0 2 ) 0 . 6 7 ( 0 . 0 1 ) R I D G E 1 0 . 7 5 ( 0 . 0 2 ) 0 . 7 1 ( 0 . 0 2 ) 0 . 6 7 ( 0 . 0 1 ) L O G I S T I C 2 0 . 7 5 ( 0 . 0 2 ) 0 . 7 3 ( 0 . 0 2 ) 0 . 7 5 ( 0 . 0 2 ) L A S S O 2 0 . 7 5 ( 0 . 0 2 ) 0 . 7 2 ( 0 . 0 2 ) 0 . 7 4 ( 0 . 0 2 ) R I D G E 2 0 . 7 5 ( 0 . 0 2 ) 0 . 7 2 ( 0 . 0 2 ) 0 . 7 4 ( 0 . 0 2 ) L O G I S T I C 3 0 . 7 5 ( 0 . 0 1 ) 0 . 7 3 ( 0 . 0 2 ) 0 . 7 6 ( 0 . 0 1 ) L A S S O 3 0 . 7 4 ( 0 . 0 2 ) 0 . 7 3 ( 0 . 0 2 ) 0 . 7 6 ( 0 . 0 1 ) R I D G E 3 0 . 7 4 ( 0 . 0 2 ) 0 . 7 3 ( 0 . 0 2 ) 0 . 7 6 ( 0 . 0 1 ) P R O P O S E D M E T H O D 0 . 7 5 ( 0 . 0 2 ) 0 . 7 3 ( 0 . 0 2 ) 0 . 7 9 ( 0 . 0 1 ) 5 Real Data Analysis Here we revisit the Cora data set [15], where we use paper titles as node attrib utes and topic classiﬁcation as labels. W e remov e nodes with no title or topic classiﬁcations, and the ﬁnal graph contains 11,187 nodes and 33,777 edges. Sev- enty topic categories are used as labels, and each paper be- longs to one of the cate gories. For simplicity , we conv ert the classiﬁcation problem into a binary classiﬁcation prob- lem by giving a positiv e label if the topic falls under the /Artificial_Intelligence/ category . W e then use the closeness centrality of each node in the graph as an addi- tional cov ariate in stacking. W e split all the nodes on the graph into a 20% training set and an 80% testing set. On the training set, we observe the titles and the topic classiﬁcation label of each paper , while on the test set, we only observ e the titles. W e ﬁt a local classiﬁer using the word vector representation of its T able 2: Mean accuracy score on 100 randomized experiments using Cora and PubMed data set. Xiang McDo well Lasso Ridge Logistic Proposed Cora 0.917476 0.918087 0.917898 0.918106 0.918401 0.919270 PubMed 0.879380 0.879205 0.889090 0.889125 0.889213 0.891499 T able 3: Pairwise accuracy score comparison between the proposed method versus each competing method in 100 randomized experiments. The values sho wn are the mean accuracy difference between the proposed method and each method and the p-value for H 0 : the proposed method is less accurate than that benchmark method. Proposed vs Xiang vs McDo well vs Lasso vs Ridge vs Logistic Cora 0.1794 0.1183 0.1372 0.1164 0.0869 p <0.01 p <0.01 p <0.01 p <0.01 p <0.01 PubMed 1.2119 1.2294 0.2409 0.2374 0.2286 p <0.01 p <0.01 p <0.01 p <0.01 p <0.01 title only (Nai ve Bayes), and a relational classiﬁer (ICA + wvRN) using only the labels from a paper’ s neighbor . W e then ﬁt a dynamic stacking model using the output from the two classiﬁers with their coefﬁcients being smooth functions of the closeness centrality of a node. The smoothness penalty parameter is chosen by 10-fold cross-validation. One set of ﬁtted coefﬁcient curves for the two classiﬁers are shown in Figure 5. It allocates a higher weight on the relational classiﬁer when a node has a high closeness centrality value and relies on the local classiﬁer for nodes with a small closeness centrality value. This mirrors our observ ations from the previous discussion. W e compared the dynamic stacking model with the tra- ditional stacking model on multiple standard lev el-1 gener- alizers (lasso, ridge, and logistic re gression), all of which ig- nore the closeness centrality of a node during stacking. The penalty parameters for lasso and ridge regression are chosen by 10-fold cross-validation. W e also implemented ensem- ble classiﬁcation methods from [23, 16]. In [23], Xiang and Neville proposed a parametric weighting scheme to combine a local classiﬁer and a relational classiﬁer where model pa- rameters are chosen by cross-validation. In [16], outputs from local and relational classiﬁer are combined thorough a concept of label regularization and the model they proposed has no additional parameters. W e repeat the train-test pro- cess 100 times randomly and record the accuracy score for each run. W e also performed model comparison using the PubMed data with the same general setup. Input to the lo- cal classiﬁer is the TF/IDF representation of a paper , and we ﬁt a dynamic stacking model using the output from the local classiﬁer and the relational classiﬁer with their coefﬁcients being smooth functions of the degree centrality of a node. The classiﬁcation accuracy comparison between the pro- posed method and the benchmarks is sho wn in T able (2) and Figure 5: One set of ﬁtted coefﬁcient curves. T able (3), where the accuracy is deﬁned as P N i =1 I y i = ˆ y i / N where ˆ y i = I ˆ p i > 0 . 5 . By assuming the normality of the classiﬁcation accuracy difference distribution, the dynamic stacking mode outper - forms all benchmarks at p-v alue < 0 . 01 . For the Cora data set, Figure 6 sho ws the source of the accurac y impro vement. For each of the 100 repetitions, we calculate the difference in the absolute number of correctly classiﬁed nodes at dif fer - ent closeness centrality lev els between the proposed model and benchmarks. The dynamic stacking model outperforms (a) (b) (c) (d) (e) (f) Figure 6: For each of the 100 repetitions, we calculate the difference in the number of correctly classiﬁed nodes at dif ferent closeness centrality levels between the dynamic stacking model and benchmarks. In (a) – (e), we calculate the group mean and a 95% conﬁdence interval. (f) sho ws a density distribution of the closeness centrality for all nodes in the graph. the benchmarks near the two ends of the closeness centrality spectrum where the balance between the local and relational classiﬁer shifts considerably . For the majority of nodes in this data set, their closeness centrality clusters tightly around a speciﬁc value, which lea ves little room for the dynamic stacking model to improve much beyond its static-weight counterparts in terms of the overall accuracy . Howe ver , for the nodes that are near the two e xtremes of the closeness centrality spectrum, we do see a signiﬁcant improvement by using the dynamic stacking method. 6 Discussion In this paper , by examining two public data sets, Cora and PubMed, we illustrate the motiv ation for incorporating node topological characteristics into stacked generalization for node classiﬁcation on networks. W e then de velop a nov el dy- namic stacking method with functional weights for compo- nent models, each of which can vary smoothly with an extra cov ariate. Simulation studies show that the proposed method generalizes well to the benchmarks when the data does not present complex patterns, and outperforms all benchmarks otherwise. Real data analysis using Cora and PubMed sho ws the proposed method has a small yet signiﬁcant improve- ment on the classiﬁcation accuracy compared with tradi- tional stacking models. Further analysis shows that most of the accuracy improvement comes from nodes near the two extremes of the closeness centrality spectrum where the balance between the local and relational classiﬁer shifts the most. The limited number of nodes in that re gion explains the small improvement on the overall classiﬁcation accurac y . Howe ver , for the nodes near the extremes of closeness cen- trality , we do see a considerable improv ement on the classiﬁ- cation accuracy using the dynamic stacking method. Overall, the dynamic stacking model allows the composition of the stacking model to change as we mov e across the network, and thus it potentially provides a more versatile and accurate stacking model for label prediction on a network. 7 Future W ork The proposed dynamic stacking model is a direct exten- sion of the traditional stacking model, which uses logis- tic regression as le vel-1 generalizer . As discussed in [19], this model tends to o verﬁt, especially when combining a large pool of noisy classiﬁers. T o mitigate this problem, one can add a group-lasso penalty over the model coefﬁ- cients, η ∗ , into equation (3.8). Coefﬁcients can be naturally grouped if they are the basis coefﬁcients for the same β j ( · ) : {( η j 1 , · · · , η j K ) , for j = 1 , · · · , p } . By adding a group- lasso penalty , the dynamic stacking model is more robust to the noise in the lev el-1 generalizing process. 8 Acknowledgment This material is based upon w ork supported in part with funding from the Laboratory for Analytic Sciences (LAS). Any opinions, ﬁndings, conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reﬂect the views of the LAS and/or an y agency or entity of the United States Go vernment. References [1] Smriti Bhagat, Graham Cormode, and S. Muthukrishnan. Node classiﬁcation in social networks. CoRR , abs/1101.3291, 2011. [2] Zehra Cataltepe, Abdullah Sonmez, Kadriye Baglioglu, and A yse Erzan. Collecti v e classiﬁcation using heterogeneous classiﬁers. In Petra Perner , editor , MLDM , volume 6871 of Lectur e Notes in Computer Science , pages 155–169. Springer , 2011. [3] Soumen Chakrabarti, Byron Dom, and Piotr Indyk. Enhanced hypertext categorization using hyperlinks. In Pr oceedings of the 1998 ACM SIGMOD International Confer ence on Management of Data , SIGMOD ’98, pages 307–318, Ne w Y ork, NY , USA, 1998. A CM. [4] Hoda Eldardiry and Jennifer Neville. Across-model collec- tiv e ensemble classiﬁcation. In in AAAI , 2011. [5] J. F an and W . Zhang. Statistical methods with varying coefﬁcient models. Statistics and its Interface , 1:179–195, 2008. [6] Johannes Fürnkranz. Hyperlink ensembles: A case study in hypertext classiﬁcation. Information Fusion , 3:299–312, 2001. [7] Anna Goldenberg, Alice X. Zheng, Stephen E. Fienberg, and Edoardo M. Airoldi. A survey of statistical network models. F ound. T r ends Mach. Learn. , 2(2):129–233, February 2010. [8] T re vor Hastie and Robert T ibshirani. V arying-Coefﬁcient Models (with discussion). J ournal of the Royal Statistical Society . Series B (Methodological) , 55(4):757–796, 1993. [9] Andreas Heß and Nick Kushmerick. Iterative ensemble classiﬁcation for relational data: A case study of semantic web services. In In Pr oceedings of the 15th Eur opean Confer ence on Machine Learning . Springer , 2004. [10] David Jensen, J. Neville, and Jennifer Neville. Linkage and autocorrelation cause feature selection bias in relational learn- ing. In In Pr oceedings of the 19th International Conference on Machine Learning , pages 259–266. Morg an Kaufmann, 2002. [11] David Jensen, Jennifer Neville, and Brian Gallagher . Why collectiv e inference impro ves relational classiﬁcation. In Pr oceedings of the T enth A CM SIGKDD International Con- fer ence on Knowledge Discovery and Data Mining, Seattle, W ashington, USA, A ugust 22-25, 2004 , pages 593–598, 2004. [12] Qing Lu and Lise Getoor . Link-based classiﬁcation. In T om Fa wcett and Nina Mishra, editors, ICML , pages 496–503. AAAI Press, 2003. [13] Sofus A. Macskassy and Foster Provost. A simple relational classiﬁer . In Proceedings of the Second W orkshop on Multi- Relational Data Mining (MRDM-2003) at KDD-2003 , pages 64–76, 2003. [14] Sofus A. Macskassy and F oster Provost. Classiﬁcation in networked data: A toolkit and a uni variate case study . J. Mach. Learn. Res. , 8:935–983, May 2007. [15] Andre w K. McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. Automating the construction of inter- net portals with machine learning. Information Retrieval , 3(2):127–163, 2000. [16] Luke McDo well and David W . Aha. Semi-supervised collec- tiv e classiﬁcation via hybrid label regularization. In ICML . icml.cc / Omnipress, 2012. [17] J. Ne ville and D. Jensen. Iterati ve classiﬁcation in relational data. pages 13–20. AAAI Press, 2000. [18] Christine Preisach and Lars Schmidt-Thieme. Ensembles of relational classiﬁers. Knowl. Inf . Syst. , 14(3):249–272, March 2008. [19] Sam Reid and Gre g Grudic. Regularized linear models in stacked generalization. In Multiple Classiﬁer Systems , volume 5519 of Lectur e Notes in Computer Science , pages 112–121. Springer Berlin Heidelberg, 2009. [20] Prithviraj Sen, Galileo Mark Namata, Mustafa Bilgic, Lise Getoor , Brian Gallagher , and T ina Eliassi-Rad. Collectiv e classiﬁcation in network data. AI Magazine , 29(3):93–106, 2008. [21] Kai Ming Ting and Ian H. W itten. Stacked generalization: when does it work? In in Procs. International Joint Confer- ence on Artiﬁcial Intelligence , pages 866–871. Morgan Kauf- mann, 1997. [22] David H. W olpert. Stacked generalization. Neural Networks , 5:241–259, 1992. [23] Rongjing Xiang and Jennifer Neville. Understanding propa- gation error and its effect on collectiv e classiﬁcation. In 11th IEEE International Confer ence on Data Mining, ICDM 2011, V ancouver , BC, Canada, December 11-14, 2011 , pages 834– 843, 2011.

Dynamic Stacked Generalization for Node Classification on Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment