Retrospective Higher-Order Markov Processes for User Trails

Retrospective Higher-Order Markov Pr ocesses for User Trails T ao W u Purdue University W est Lafay ee, IN wu577@purdue.edu David F . Gleich Purdue University W est Lafay ee, IN dgleich@purdue.edu ABSTRA CT Users form information trails as they browse the w eb, checkin with a geolocation, rate items, or consume media. A common problem is to predict what a user might do next for the purposes of guid- ance, recommendation, or prefetching. First-order and higher-order Markov chains have been widely used methods to study such se- quences of data. First-order Markov chains are easy to estimate, but lack accuracy when history maers. Higher-order Markov chains, in contrast, have too many parameters and suer from ov ering the training data. Fiing these parameters with regularization and smoothing only oers mild improvements. In this paper we pro- pose the retrospective higher-order Marko v process (RHOMP) as a low-parameter mo del for such se quences. is model is a spe- cial case of a higher-or der Markov chain where the transitions depend retrospectively on a single histor y state instead of an ar- bitrary combination of histor y states. ere ar e two immediate computational advantages: the number of parameters is linear in the order of the Markov chain and the model can b e t to large state spaces. Furthermore, by pro viding a specic structure to the higher-order chain, RHOMPs improve the model accuracy by ef- ciently utilizing history states without risks of overing the data. W e demonstrate how to estimate a RHOMP from data and we demonstrate the eectiveness of our method on various real application datasets spanning geolocation data, review sequences, and business locations. e RHOMP model uniformly outperforms higher-order Markov chains, Kneser-Ney regularization, and tensor factorizations in terms of prediction accuracy . KEY W ORDS Higher-order Markov chains; T ensor factorization; User models 1 IN TRODUCTION User trails record sequences of activities when individuals interact with the Internet and the world. Such data come fr om various applications when users write a product revie w [ 22 ], checkin at a physical location [ 13 , 38 ], visit a webpage, or listen to a song [ 8 ]. Understanding the properties and predictability of these data helps improve many downstr eam applications including overall user ex- periences, recommendations, and advertising [ 1 , 17 ]. W e study the prediction problem and our goal is to estimate a model to describ e and predict a set of user trails. Markov chains are one of the most commonly studied mo dels for this type of data. For these models, each checkin place, website, or song is a state. Users transition among these states following Markov rules. In a rst-order Markov model, the transition behavior to the next state of the sequence only depends on the current state. Higher-order Markov models include a more-r ealistic dependence on a larger number of previous states, and multiple recent studies found that rst-order Markov chains do not fully captur e the user behaviors in web browsing, transportation and communication networks [ 12 , 29 ]. Furthermore ignoring the eects of second- order Markov dynamics has signicant negative consequences for downstream applications including community detection, ranking, and information spreading [2, 29]. e downside to higher-order Markov models is that the number of parameters grows exponentially with the or der . (If there are N states and we model m steps of history , there are N m + 1 parame- ters.) So , even if w e could accurately learn the parameters, it is already challenging to even stor e them. (Some practical techniques include low-rank and sparse appro ximations, but these pose their own problems.) Second, since the number of mo del parameters grows rapidly , the amount of training data required also grows ex- ponentially with the order m [ 12 ]. Acquiring such huge amounts of training data is usually imp ossible. Lastly , determining the amount of history to use itself is hard [ 24 ], and selecting a large value of m could severely ov ert the data, thus making the learne d model less reliable. Strategies to resolve the above issues of higher-order Markov chains include variable order Markov chain [ 6 ] where the order length is a variable that can have dierent values for dierent states. ere is a ing algorithm that can automatically determine an appropriate order for each state, howev er it requires substantial computation time [ 28 ] which restricts it to applications with only a small number of states [ 5 , 12 , 14 ]. Smoothing and regularization methods [ 11 ] like Kneser-Ney smoothing and Wien-Bell smooth- ing are additional approaches to make the higher-order Markov chain more robust. ese metho ds are widely applied in language models for predicting unseen transitions. W e will compare against the behavior of the Kneser-Ney smoothing in our experiments and show that our method has a number of advantages. In this paper we propose the retrospective higher-order Markov process (RHOMP) as a simplied, sp ecial case of a higher-order Markov chain (Se ction 3). In this type of Markov model, a user retrospectively choses a state from the past m steps of history , and then transitions as a rst-or der chain conditional on that state from history . is assumption helps to restrict the total number of pa- rameters and protect the model from overing the correlations between histor y states. Specically , this model corresponds to choosing m dierent rst order Markov chain transition matrices, one for each step of history , as well as an associated pr obability dis- tribution. Consequently , the numb er of parameters grows linearly with the size of history while preserving the higher-order nature. W e also show there ar e important connections b etween our model and the class of pair wise-interaction tensor factorization models proposed by Rendle et al. [26, 27] (Section 3.2). W e design an algorithm to sele ct an optimal mo del fr om training data via maximum likelihood estimation (MLE). For the se cond- order case with two steps of history , this yields a constrained convex optimization problem with a single hyperparameter α . W e derive a projected gradient descent [ 15 ] algorithm to solve it. It requires only a few iterations to converge and each iteration is linear in the training data. W e sele ct the hyperparameter by ing a polynomial to the likelihood function as a function of the parameter and select the global minimum. us, our RHOMP process does not require any parameter tuning and is scalable to applications with tens of thousands of states. In addition, both the process of updating the gradients and model parameters parallelize over the training data. W e evaluate the eectiveness of RHOMP models in experiments 1 with real datasets including product reviews, online music stream- ing, photo locations, and checkin business types (Section 5.1). W e primarily compare algorithms in terms of their ability to predict in- formation from testing data and use precision and mean reciprocal rank as the two main evaluation metrics. ese experiments and results show that the RHOMP model achieves superior prediction results in all datasets (Section 5.2) compared with rst and second order chains. For even higher-order chains, RHOMP shows stable performance with one exception (Section 5.4) where the data only has short sequences. Remark. Recently Kumar et. [ 20 ] proposed the Linear Additive Markov Process (LAMP) that is closely related to our frame work. Specically our RHOMP mo del has the same formulation as the generazlied extention GLAMP from the paper [ 20 ]. W e learned about this paper as we were nalizing our submission to arXiv . e papers share a number of related technical results ab out the models and we discov ered the related work [ 21 , 23 , 35 , 39 ] based on their manuscript. e main dierence is that in this paper w e focus on the general form that allows to learn dierent Marko v chains for each step of history . In addition we connect the RHOMP model with a particular tensor factorization to a higher-order Marko v chain. 2 PRELIMINARIES W e b egin by formally revie wing the problem of user trail prediction. en we will re view relevant backgr ound on Markov chain models. 2.1 Problem Formulation W e denote a user trail as a se quence over a discrete state space s = ( s 1 , s 2 , · · · ) with each element s i ∈ { 1 , 2 , · · · , N } . Here N is the total number of states. e sequence can represent, for instance, a user’s music listening history with each state denoting a song/artist, or a user’s checkin histor y from social network with each state denoting a location. Given a sp ecic user trail up to time t − 1: s = ( s 1 , s 2 , · · · , s t − 1 ) with t ≥ 2, the task is to predict the next state at time t based on a large set of user trails for training: S = { s ( 1 ) , s ( 2 ) , · · · } , where each s ( i ) is an individual trail. 1 Code and data for this paper are available at: hps://github.com/wutao27/RHOMP. 2.2 Markov Chain Metho ds An m − th or der Marko v chain is dened as a sto chastic pr ocess { X t , t = 1 , 2 , · · · } on the state space: { 1 , 2 , · · · , N } with the prop- erty that the next transition only depends on the last m steps. For- mally , Pr  X t = i | X t − 1 = i t − 1 , · · · , X 1 = i 1  = Pr  X t = i | X t − 1 = i t − 1 , · · · , X t − m = i t − m  . An ( m + 1 ) -order transition tensor P with size N characterizes the above Markov chain, with P i , j , · · · , k denoting the probability of transitioning to state i given the m current history states ( j , · · · , k ) . e model with m = 1 is called the rst-order Markov chain and similarly it can be described by an N × N transition matrix P . In order to use a Markov chain for the prediction problem, we need to estimate the transition matrix P . Given a set of users trails S = { s ( 1 ) , s ( 2 ) , · · · } , the maximum likelihood estimator (MLE) of the probability P i , j for a rst order chain is given by [12]: P i , j = c ( i , j ) Í ` c ( `, j ) where c ( i , j ) denotes the number of instances that the states j and i were consecutive in all trails. For the case of higher-order Markov chain, it is well-known that any higher-order ( m > 1) Markov chain X t is equivalent to a rst-order Markov chain Z t by taking a Cartesian product of its state space. is simplies the param- eter estimations and we may replace the original states with the Cartesian product states: P i , j , · · · , k = c ( i , j , · · · , k ) Í ` c ( `, j , · · · , k ) , where now c ( i , j , · · · , k ) counts the number of instances of the sequence k , · · · , j , i in the training data. Returning to the prediction task itself, Markov chain methods take as input the histor y states of a trail and lo okup the pr obabilities for all future states in the matrix P or tensor P . is b ecomes a ranked list of states with the highest probability on top. 3 RETROSPECTI VE HIGHER-ORDER MARK O V PROCESSES e goal of the r etrospective higher-order Markov process (RHOMP) is to strike a balance between the simplicity of the rst order Markov model and the high-parameter complexity of the higher- order Markov model. Ne vertheless, it is important for the model to account for higher-order behaviors be cause these are necessary to capture many types of user behaviors [ 12 , 29 ]. T owards that end, the RHOMP model describes a structured higher-order Markov chain that results in a compact low-parameter description of pos- sible user behaviors. W e describe this formally for the case of a second-order history (and discuss largely notational extensions to higher-order chains in Section 3.4). 3.1 e Retrospective Pr ocess e specic structure that a RHOMP describes is a retrospectively rst-order Markov property . For some intuition, suppose that a web surfer had visited a search-query result page and then clicked the rst link. In the RHOMP model, the user will rst determine if they are going to continue browsing from the search-r esult page Our m odel RHOMP Seque nce hist. Pred iction dist. T ransitio n data α 1- α 4 5 ? 1st-ord er Marko v 4 5 ? 5 Matrix P 5 2nd-o rder Marko v 4 5 ? 5 4 T ensor P Matrix R Matrix Q 4 Figure 1: An illustration of Markov chain methods and our proposed RHOMP model. or the rst link—hence users hav e the power to retrospect o ver history . Once that de cision has been made , the user will behave in a rst-order Markovian fashion that depends on if the user returned to the previous state or remained on the current state. Formally , suppose that the chain has recently visited states j and k . e RHOMP is a two-stage process that rst selects a single histor y state. Since there are only tw o states, we model this selection as a weighted coin-toss where the pr obability of picking j is α and so picking k happens with probability 1 − α . Once w e have the history state, then the RHOMP transitions according to a transition matrix that is specic to that step of the histor y . us Pr  X t = i | X t − 1 = j , X t − 2 = k  = α R i , j + ( 1 − α ) Q i , k , where R models the transitions from the current state (when those are selected) and Q models the transitions from the previous state (when those are selected). See Figure 1 for illustration. W e summa- rize this in the following denition: Denition 3.1. Given 0 ≤ α ≤ 1 and two stochastic matrices R , Q , a second-order retrospective higher-order Markov process will transition from state j with histor y state k as follows: (i) with probability α it transitions according to R with the current state j , and (ii) with probability 1 − α it transitions according to Q with the previous state k . is model has a number of useful features. For instance, it is easy to compute the stationary distribution as the following theorem shows. Theorem 3.2. Let α , R , Q be a second-order RHOMP model. Con- sider the stationary distribution x in terms of the long-term fraction of time the process sp ends in a state: x i = lim t →∞ number of times X t = i t for each i = 1 . . . N . Such a distribution x always exists. Moreover , it is unique if α R + ( 1 − α ) Q is an irreducible matrix. Proof. Because the RHOMP is a spe cial case of a second-order chain, we can use the relationship with the rst-order chain on the Cartesian product space to establish that a distribution x always exists. is follows b ecause the long-term distribution of a rst- order , nite-state space Markov chain always exists (though there could be multiple such distributions) [ 33 ]. Let X i , j for all 1 ≤ i , j ≤ N be any limiting distribution of the product state space, and x be either of the corresponding marginal distribution such that Í j X j , k = x k or Í k X j , k = x j . Note that b oth of these marginals result in the same distribution b ecause we use the long time average to dene X i , j . en we have: x i = Õ j X i , j = Õ j Õ k ( α R i , j + ( 1 − α ) Q i , k ) X j , k = Õ j α R i , j x j + Õ k ( 1 − α ) Q i , k x k = ( P x ) i where P is dened as α R + ( 1 − α ) Q . So the limiting distribution x follows x = P x , and it is unique if the corresponding Markov chain P is irreducible.  In Section 3.3, we show how to compute a maximum likelihood estimate of R and Q from data. 3.2 A T ensor Factorization Perspective W e originally derived this type of RHOMP via a tensor factorization approach, but then realized that the retrospective interpretation is more direct and helpful. Nevertheless, we believe there ar e fruitful connections established by the tensor factorization approach. Con- sider the transition tensor of a second-order Markov chain: P is a 3-mode, N × N × N , non-negative tensor such that Õ i P i , j , k = 1 for all 1 ≤ j , k ≤ N . (1) is imposes a set of N 2 equality constraints. If we wanted to use traditional low-rank tensor appro ximations such as P ARAF AC or T ucker [ 19 ] to study large datasets, then we would need to add a large number of constraints to the ing algorithms in order to ensure that the factorization results in a stochastic tensor that we could use for a second order Markov chain. is approach was extremely challenging. Instead, consider a pair wise interaction tensor factorization (PI TF) as proposed by Rendle et al. [27] with the following form: P i , j , k = Õ ` A ( J ) i , ` B ( I ) j , ` + Õ ` A ( K ) i , ` C ( I ) k , ` + Õ ` B ( K ) j , ` C ( J ) k , ` (2) where matrices A ( J ) , A ( K ) , B ( I ) , B ( K ) , C ( I ) , C ( J ) ∈ R N × k . W e notice that last term in (2) is the interaction between the current state j and the previous state k , and it contributes only a constant determined by the pair ( j , k ) . In the applications of prediction, w e can drop this term because it does not aect the relative ranking for the futur e state i . So the factorization model becomes: P i , j , k = Õ ` A ( J ) i , ` B j , ` + Õ ` A ( K ) i , ` C k , ` (3) with A ( J ) , A ( K ) , B , C ∈ R N × k . T o see the relationship with our RHOMPs, denote ˜ α ˜ R = A ( J ) B | and ( 1 − ˜ α ) ˜ Q = A ( K ) C | with 0 ≤ ˜ α ≤ 1. en the result of a PITF factorization with stochastic constraints is: P i , j , k = ˜ α ˜ R i , j + ( 1 − ˜ α ) ˜ Q i , k (4) It is easy to v erify that if both ˜ R and ˜ Q are stochastic matrices, then the corresponding tensor P is a transition tensor following (1) . e following theorem shows that from any nonnegative ˜ R and ˜ Q , we can construct such stochastic matrices. Theorem 3.3. Assuming there exist nonnegative matrices ˜ R and ˜ Q such that the transition tensor P can be decompose d in the form of (4) , then there exist 0 ≤ α ≤ 1 and stochastic matrices R , Q such that P i , j , k = α R i , j + ( 1 − α ) Q i , k . Proof. Denote Í i ˜ R i , j = ˜ r j and Í i ˜ Q i , k = ˜ q k for all 1 ≤ j , k ≤ N . Because 1 = Í i P i , j , k = ˜ α ˜ r j + ( 1 − ˜ α ) ˜ q k for all 1 ≤ j , k ≤ N , we have ˜ r 1 = ˜ r 2 = · · · = ˜ r N = ˜ r ≥ 0, ˜ q 1 = ˜ q 2 = · · · = ˜ q N = ˜ q ≥ 0 and ˜ α ˜ r + ( 1 − ˜ α ) ˜ q = 1. If ˜ r = 1 , ˜ q = 1 then the original matrices ˜ R and ˜ Q are stochastic. Other wise we can set α = ˜ α ˜ r ; R = ˜ R / ˜ r ; Q = ˜ Q / ˜ q where R and Q are stochastic. en we have α R i , j + ( 1 − α ) Q i , k = ˜ α ˜ R i , j + ( 1 − ˜ α ˜ r ) ˜ Q i , k ˜ q = ˜ α ˜ R i , j + ( 1 − ˜ α ) ˜ Q i , k = P i , j , k So ( α , R , Q ) forms a valid factorization for P , the bound on α follows from ˜ α ˜ r + ( 1 − ˜ α ) ˜ q = 1 from (4).  Consequently , the RHOMP form also arises from the PI TF ap- proach when constrained to model stochastic tensors. 3.3 Parameter Optimization In this section we will apply the principle of maximum likelihood to estimate the model parameters of a RHOMP (i.e., R , Q ) directly from data. An alternative would b e to estimate the higher-order Markov chain and use the PI TF factorization as discusse d in the previous section. W orking directly on the RHOMP model fr om data has two advantages: rst, the estimate corresponds exactly with the model, rather than estimate and approximate; and second, the direct approach is faster . W e rst show how to compute a maximum likelihood estimate with α xed and then discuss how to pick α . Re call that c ( i , j , k ) is the total count of transitions moving from j to i with previous state k in the training data. With xed α , the log likelihood of all transitions from the set S of user trails is: log L ( R , Q | S) = Õ c ( i , j , k ) > 0 c ( i , j , k ) log ( P i , j , k ) = Õ c ( i , j , k ) > 0 c ( i , j , k ) log ( α R i , j + ( 1 − α ) Q i , k ) (5) Our goal is to nd a pair of sto chastic matrices R , Q which min- imizes the negative log likelihoo d, which gives us the following optimization problem: minimize R , Q − log L ( R , Q | S) subject to R i , j ≥ 0 , Q i , j ≥ 0 1 ≤ i , j ≤ N Í i R i , j = 1 , Í i Q i , k = 1 1 ≤ i ≤ N (6) is optimization problem is convex as the following theorem shows. Theorem 3.4. e negation of the log likelihoo d function in (5) is convex and so is the feasible region of pairs of stochastic matrices. us any local minima solution ( R ∗ , Q ∗ ) is also the solution for global mimima. Proof. First we verify the feasible domain of sto chastic pairs ( R , Q ) is conve x. W e can che ck that given 0 ≤ λ ≤ 1 and two stochastic matrices A , B , the linear combination λ A + ( 1 − λ ) B is also a stochastic matrix. is applies element-wise to the pair to verify the claim. Now given two sets of stochastic matrices ( R ( 1 ) , Q ( 1 ) ) and ( R ( 2 ) , Q ( 2 ) ) and the corresponding linear combination ( R = λ R ( 1 ) + ( 1 − λ ) R ( 2 ) , Q = λ Q ( 1 ) + ( 1 − λ ) Q ( 2 ) ) we have − log L ( R , Q | S) = − Õ i , j , k c ( i , j , k ) log ( α R i , j + ( 1 − α ) Q i , k ) = − Õ i , j , k c ( i , j , k ) log  λ ( α R ( 1 ) i , j + ( 1 − α ) Q ( 1 ) i , k ) + ( 1 − λ )( α R ( 2 ) i , j + ( 1 − α ) Q ( 2 ) i , k )  ≤ − Õ i , j , k c ( i , j , k )  λ log ( α R ( 1 ) i , j + ( 1 − α ) Q ( 1 ) i , k ) + ( 1 − λ ) log ( α R ( 2 ) i , j + ( 1 − α ) Q ( 2 ) i , k )  = − λ log L ( R ( 1 ) , Q ( 1 ) | S) − ( 1 − λ ) log L ( R ( 2 ) , Q ( 2 ) | S) So (6) is a convex problem.  W e now derive the projected gradient descent algorithm for (6) , which is summarized in Algorithm 1. is involves (1) First update R and Q base d on their gradients. (2) Since R and Q are no longer sto chastic due to the above updates, the projection step is applie d to project the update d R and Q back to ` 1 − bal l s (i.e., the stochastic property ). e gradients over R and Q are: ∆ R i , j = − ∂ log L ∂ R i , j = Õ k − α c ( i , j , k ) α R i , j + ( 1 − α ) Q i , k ∆ Q i , k = − ∂ log L ∂ Q i , k = Õ j −( 1 − α ) c ( i , j , k ) α R i , j + ( 1 − α ) Q i , k (7) W e accomplish the projection step using the algorithm from [ 15 ]. Note that for the sake of simplicity we present the projection step by sorting the vector w , but there is a more ecient method based on divide and concur [ 15 ] which is linear cost to the number non- zeros in w . However in practice sorting w is fast as the vector w is very sparse. Overall each iteration takes linear time in the number of unique triples ( i , j , k ) in the sequence data. is is upper b ounded by the size of input data. W e also note that the procedure of computing the gradients ∆ R , ∆ Q and updating R , Q , which dominates the majority of the computation, can be paralleled. Choosing α . T o determine the value of hyp erparameter α , we con- duct a few trials with α chosen between ( 0 , 1 ) . en based on the value of the objective function, w e calculate the best value of α from a polynomial interpolation of the likelihood function. Sp ecically α is sele cted as n Chebyshev nodes α k = 1 2 + 1 2 cos ( 2 k − 1 2 n π ) , k = 1 , 2 , · · · , n . Geing the global minimum of a p olynomial inter- polant can be done eciently , and polynomials can approximate arbitrary continuous functions, which renders this a pragmatic Algorithm 1 Max. Likelihood Estimate of a 2nd-order RHOMP Require: parameter α , step size γ 0 and transition counts c ( i , j , k ) 1: Initialize R with R i , j = Í k c ( i , j , k )/ Í `, k c ( `, j , k ) , Q with Q i , k = Í j c ( i , j , k )/ Í `, j c ( `, j , k ) and γ = γ 0 2: repeat 3: Compute the gradient matrices ∆ R , ∆ Q based on (7) 4: R ← ( R − γ ∆ R ) and Q ← ( Q − γ ∆ Q ) 5: for each column vector w of R and Q do 6: Sort the non-zeros of w into u : u 1 ≥ u 2 ≥ · · · ≥ u k > 0 7: Find ρ = max  r ≤ k : u r − 1 r ( Í r i = 1 u i − 1 ) > 0  8: Dene θ = 1 ρ ( Í ρ i = 1 u i − 1 ) 9: Update w with w i ← max { w i − θ , 0 } 10: end for 11: if objective value decreases then 12: γ ← min { 2 ∗ γ , γ 0 } 13: else 14: γ ← 0 . 5 ∗ γ ; re-run this iteration with updated γ 15: end if 16: until converge choice. Another approach for selecting the value of α is to conduct cross validation with grid search. However a dierent objective is needed as we could run into unse en transitions in the validation set and the likelihood would go to −∞ . Alternatively we can use a measurement like precision instead of likelihood. e main advan- tage of cross validation is its ability to prev ent overing. In our experiment we nd this problem does not o ccur , so we drop this procedure as it requires substantially more computation. 3.4 Higher-order Cases Beyond Second Order e ideas discussed in the above sections also work for the higher- order cases with m ≥ 3. e RHOMP model b ecomes: Pr ( X t = i | X t − 1 = j , X t − 2 = k , . . ., X t − m = ` ) = α 1 R ( 1 ) i , j + α 2 R ( 2 ) i , k + · · · + α m R ( m ) i , ` where 0 ≤ α i ≤ 1 for i = 1 , 2 , · · · , m , Í i α i = 1 and matrices R ( i ) for i = 1 , 2 , · · · , m are stochastic. Similarly the log likelihood function can be derived as well as the gradient over each R ( i ) . e projected gradient descent algorithm is then applied to update each stochastic matrix R ( i ) , with a per-iteration complexity bounde d by the size of the training data. e biggest dierence is that we are no longer able to determine the hyperparameters α i in a simple fashion as the polynomial inter- polation is only computationally ecient for one or two parameters. T o address this issue, recall that in Section 3.1 we proposed the model as a retrospective walk, where the walker has probability α k to step back k − 1 steps into their history and then transition according to R ( k ) . Our proposal is to use a single hyperparameter β < 1 to model a decaying probability of looking back into the history: α 1 = 1 − β m 1 − β , α 2 = β 1 − β m 1 − β , . . . , α m = β m − 1 1 − β m 1 − β . (is distribution describes a truncated geometric random variable.) In our experiments for the se cond-order case the optimal α 1 > 1 / 2 for every dataset. is oers a single step of evidence for this assumption. is β can be chosen either by the procedure of polynomial interpolation or simply using the optimal value α ∗ from a second-order factorization model β = α ∗ /( 1 − α ∗ ) . W e apply the laer approach in our experiments for RHOMP with m > 2. 4 RELA TED W ORK Modeling User Trails. Early work in [ 25 ] characterized the user path paerns on the web with the tools of Markov chains. Other advanced methods include hidden Markov models (HMM) [ 16 ], variable length Markov chains [ 6 ] and association rules [ 1 ]. How- ever the computations associated with the abov e methods limit them from being used in datasets with more than a few thousand states. More recent work considers the sequence prediction task with personalization, such as collaborative ltering methods [ 30 , 31 ] where the b ehavior of similar users is utilize d to help the prediction, factorizing personalized Markov chains [ 26 ], and TribeFlow [ 17 ]. Other than the prediction problem, clustering and visualization [ 7 ], sequence classication [ 37 ], metric embedding [ 9 , 10 ] and hypothe- ses comparison [ 32 ] have also been studied. In the context of this work, we seek to impr ove the performance of the classic and simple Markov model by studying a structured variation. Random W alk Models. Since our model is a special case of a higher-order Markov chain, we note that there are relationships with a variety of enhanced Markov models. First our RHOMP model denes a specic form of the Additive Markov Process (AMP) [ 21 ], where the transition probability is a summation of a series of mem- ory functions that are restricted on the next state and one history state each. Applications of the AMP include LAMP [ 20 ] (see Sec- tion 1), the gravity mo dels [ 39 ], and some dynamical systems in physics [ 23 , 35 ] wher e the memor y function is empirically esti- mated for the application of binary state. In addition to the AMP, recent innovations include new recovery results on mixture of Markov chains [ 18 ] (a sp ecial case of HMM), which assumes a small set of Markov chains that model various classes of latent indent; and the spacey random walk [ 3 , 4 , 36 ] as a non-Markovian sto- chastic process that utilizes higher-order information based on the empirical occupation of states. T ensor Factorization. As already discussed, our work is di- rectly related to the pairwise interaction tensor factorization (PI TF) method proposed by Rendle in [ 26 , 27 ], where the task is to generate tag recommendations given the { user , item } combination. e PI TF model is learned from a binary tensor of triple { user , item, tag } by bootstrap sampling from pairwise ranking constrains. Our work dif- fers in the aspect of problem formulation, model construction and parameter optimization. e RHOMP model is also a special case of both the canonical/P ARAF AC and T ucker decomp ositions [19]. 5 EXPERIMEN TS W e evaluate our RHOMP metho d on the ability to predict subse- quent states in a user trail in terms of precision and mean reciprocal rank (MRR) on ve dierent types of data (Section 5.1). W e then present the results of a second-order (i.e., m = 2) RHOMP com- pared with baseline methods in Section 5.2 and study ov er-ing of the training data in Section 5.3. en we study what happens for higher-order (i.e., m > 2) models in Section 5.4. In all cases, the RHOMP model oers a considerable improv ement to existing methods. value of k 1 2 3 4 5 relative precision 0.8 1 1.2 1.4 1.6 1.8 2 0.038 0.053 0.068 0.075 0.086 0.062 0.088 0.108 0.122 0.135 LastFM Dataset MC1 MC2 Kneser1 Kneser2 RHOMP PITF LME value of k 1 2 3 4 5 relative precision 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 0.028 0.051 0.071 0.090 0.108 0.036 0.062 0.085 0.105 0.123 BeerAdvocate Dataset MC1 MC2 Kneser1 Kneser2 RHOMP PITF LME value of k 1 2 3 4 5 relative precision 0.7 0.8 0.9 1 1.1 1.2 0.447 0.575 0.636 0.668 0.690 0.498 0.637 0.690 0.720 0.736 BrightKite Dataset MC1 MC2 Kneser1 Kneser2 RHOMP PITF LME value of k 1 2 3 4 5 relative precision 0.85 0.9 0.95 1 1.05 1.1 1.15 1.2 1.25 1.3 0.252 0.347 0.399 0.436 0.460 0.300 0.415 0.477 0.516 0.543 Flickr Dataset MC1 MC2 Kneser1 Kneser2 RHOMP PITF LME value of k 1 2 3 4 5 relative precision 0.6 0.8 1 1.2 1.4 1.6 0.064 0.108 0.147 0.178 0.210 0.102 0.154 0.188 0.222 0.244 FourSQ Dataset MC1 MC2 Kneser1 Kneser2 RHOMP PITF LME Figure 2: Relative precision results on all datasets with k = 1 , 2 , 3 , 4 , 5 . W e use Kneser1 as the baseline, and the relative precision is calculated as the precision ratio to that of Kneser1. e error bars in the gure are the standard deviations over 5 trials. e numbers in the bottom and the top of the gures denote the absolute precisions for the Kneser1 and our RHOMP method respectively . W e see that our RHOMP has noticeable improvements over other methods in most datasets. 5.1 Datasets and Evaluations Setup e real datasets we use in our experiments cover se veral applica- tions including: product reviews, online music streaming, checkin locations of social netw ork and photo uploads. Every dataset is publicly available. For all the datasets self-loops are remo ved as we are mostly interested in predicting a non-trivial transition. Also we only consider states that sho w up more than 20 times. Simple statistics on each dataset are summarized in T able 1, and we now describe them individually . LastFM [ 8 ] is a music streaming and r ecommendation website (last.fm). W e generate user trails as listening histories regarding dierent artists over a one-y ear period (2008-05-01 to 2009-05-01). T able 1: Dataset characteristics in terms of the number of states, transitions and trails # states # transitions # trails LastFM 17,341 2,902,035 195,499 Beer Advocate 2,324 1,348,903 35,629 BrightKite 11,465 400,340 125,437 Flickr 7,608 1,212,674 97,563 FourSQ 344 198,503 1,480 Beer Advocate [ 22 ] consists of beer reviews spanning mor e than 10 years up to No vember 2011 from beeradvocate.com. W e study the user trail as reviews o ver dierent brew ers. BrightKite [ 13 ] was a location-based social networking website where users shared their locations by che cking-in. W e study the trails of location id. Flickr [ 34 ] contains 100 million Flickr photos/videos provided by Y ahoo! W ebscop e. W e extract the user trail based on geolocation (restricted to USA) of each upload aer 2008-01-01. Each longitude and latitude is mapped into a grid of appro ximate 10km by 10km, which constitutes the state. FourSQ is a location based check-in dataset created by Y ang et al [ 38 ] which contains checkins from New Y ork City from 24 Octo- ber 2011 to 20 February 2012. W e extract checkin place categor y (e.g., bus station, hotel, bank) as state . For experimental methods, we consider the following: MC1, MC2 are the rst-order and second-order Markov chain methods respe ctively , where the transition matrix is estimated based on maximum likelihood. Kneser1, Kneser2 are the interpolated Kneser-Ney smoothing methods [ 11 ] applied on the rst-order and second-order Markov chain methods respectively . is is one of the b est smoothing methods for n-gram language models, where it enables higher- order Markov chain transitions to unseen n-grams. W e set the discounting parameter as n 1 /( n 1 + 2 n 2 ) by the method of leaving T able 2: Mean Reciprocal Rank (MRR) results of various methods on all datasets. Bold indicates the best mean performance, and ± entries are the standard deviations over 5 trials. Our proposed RHOMP ( m = 2 ) has the b est p erformance in all datasets. MC1 MC2 Kneser1 Kneser2 PI TF LME RHOMP LastFM 0 . 071 ± 0 . 001 0 . 068 ± 0 . 001 0 . 066 ± 0 . 001 0 . 090 ± 0 . 002 0 . 058 ± 0 . 001 0 . 062 ± 0 . 001 0 . 100 ± 0 . 001 Beer Advocate 0 . 080 ± 0 . 000 0 . 034 ± 0 . 001 0 . 079 ± 0 . 000 0 . 076 ± 0 . 001 0 . 067 ± 0 . 002 0 . 067 ± 0 . 001 0 . 090 ± 0 . 000 BrightKite 0 . 551 ± 0 . 002 0 . 540 ± 0 . 002 0 . 554 ± 0 . 002 0 . 599 ± 0 . 002 0 . 440 ± 0 . 007 0 . 529 ± 0 . 002 0 . 603 ± 0 . 002 Flickr 0 . 358 ± 0 . 003 0 . 306 ± 0 . 004 0 . 350 ± 0 . 001 0 . 379 ± 0 . 001 0 . 313 ± 0 . 004 0 . 333 ± 0 . 003 0 . 410 ± 0 . 001 FourSQ 0 . 138 ± 0 . 004 0 . 092 ± 0 . 003 0 . 146 ± 0 . 005 0 . 155 ± 0 . 004 0 . 120 ± 0 . 003 0 . 113 ± 0 . 002 0 . 181 ± 0 . 003 one out [ 11 ], where n 1 and n 2 denote the number of n-grams that appear exactly once and twice respectively PI TF is the pairwise interaction tensor factorization method [ 27 ] computed on the higher-order Markov chain estimate. Be cause we use ranking, we consider general positive and negative entries as valid for the factorization. W e implement the ing method ourselves to handle the sparsity in our data. As suggested in the paper [ 27 ], the hyperparameters are λ = 5 · 10 − 5 and α = 0 . 05 with initialization from N ( 0 , 0 . 01 ) . W e set the rank number k as 5% of the total number of states, which is enough to accurately captur e the user behavior [ 27 ]. e number of iterations for the stochastic gradient descent is 10,000,000. LME is short for Latent Markov Embedding [ 9 ]. It is an machine learning algorithm that embeds states into Euclidean space base d on a regularized maximum likelihood principle. W e set the dimension d = 50 and use default values all other parameters (e.g., learning rate, epsilon). (W e tried various values of d spanning from 2 to 100, we nd as d increases the performance also gets beer , for d > 50 the improvements are negligible. So we use d = 50 to make the algorithm ecient.) W e use the authors’ implementations. RHOMP is our proposed method in this pap er . W e use initial step size as γ 0 = 1, and set ϵ = 10 − 5 as the algorithm termination crite- rion when the relative impr ovement ov er log likelihood is below this point. For the hyperparameter α we use n = 15 Chebyshev nodes for the interpolation. e datasets are randomly split into a training set (60%) and testing set (40%) based on keeping whole trails together . And for each dataset we conduct experiments over 5 random repetitions and present the average r esults. For evaluations we use precision over top k outputs to measure the accuracy of each method. It is calculated over all individual transitions in the testing set as Precision k = # true transitions within top k algorithmic r esults # total transitions . Besides precision, which measures the accuracy of the top outputs from algorithms, we also provide r esults on Mean Reciprocal Rank (MRR). e reciprocal rank of an output is the inv erse of the rank of the ground truth answer and MRR measures the overall ranking compared to the groundtruth. For b oth measures, we want large scores close to 1. 5.2 General Results First we compare our RHOMP ( m = 2) with other baseline methods in terms of precision and MRR score. MRR score. T able 2 depicts the results on the MRR score. In all datasets, RHOMP has the highest score. From the table we see T able 3: Algorithm runtime (in minutes) for the three large datasets in terms of training time ( le) and testing time (right). e experiments are run on a single-core of a 2.5Ghz Xeon CP U. Both MC1 and MC2 ran in under a minute. Kneser1 Kneser2 PI TF LME RHOMP LastFM 2 / 4 3 / 75 493 / 1980 3188 / 57 52 / 2 BrightKite < 1 / 1 < 1 / 4 236 / 71 1153 / 22 3 / 1 Flickr < 1 / 1 1 / 8 168 / 97 764 / 11 6 / 1 that MC1 outperforms the LME method. e LME has the advan- tage of embedding the states into Euclidean space for applications like visualization or clustering. How ever the embedding could potentially cause the information loss, thus make the prediction less accurate. And we notice that MC2 has the lowest scores in many cases (i.e., Be er Advocate, Flickr and FourSQ datasets), and the MRR scores drop compared to MC1. e Kneser-Ne y smoothing modication makes the MC2 estimate more robust, and in most cases outperforms the MC1, although such advantage is limited compared to that from our RHOMP method. e PI TF method is also not competitive. Precision score. Figure 2 shows the algorithms performances in terms of relative precision. Many of the observations from T able 2 on the MRR score also apply here. In addition we nd MC2 is oen able to provide one accurate output, so the relative precision ( k = 1) is actually quite go od in most cases. However as k increase the relative precision dr ops rapidly due to the fact that MC2 is not able to generate a few more reliable outputs. is limits the application of MC2 because in the task of recommendation, it is imp ortant for the algorithm to generate a few instead of one unique candidate state. Another observation is that the results of PI TF over dierent trials are oen mor e volatile because of its underlying stochastic gradient descent solver . W e also nd that for some datasets (e.g., Beer Advocate and FourSQ) the relative precisions of our RHOMP decrease as k increases. e reason is that as k increases, the prediction task itself becomes easier , so it is hard to maintain the same advantage (i.e., constant relative precision). Same reason for the fact the inferior methods like LME and PI TF can catch up as k increases. Algorithm Runtime. T able 3 shows the runtime for each method. e RHOMP approach takes slightly more time to train than Kneser- Ney methods, but has faster prediction and testing. It is slower than the pure MC methods, but much faster than PI TF, LME. counts of states 20000 1000 500 200 150 60 40 precision (k=3) 0.1 0.2 0.3 0.4 0.5 0.6 MC1 vs MC2 vs RHOMP MC1 MC2 RHOMP # counts of states 20000 1000 500 200 150 60 40 precision (k=3) 0.1 0.2 0.3 0.4 0.5 0.6 Kneser1 vs Kneser2 vs RHOMP Kneser1 Kneser2 RHOMP Figure 3: State-wise precision ( k = 3 ) comparison on MC1 vs MC2 vs RHOMP ( le gur e) and Kneser1 vs Kneser2 vs RHOMP (right gure) on the F lickr dataset. Each marker represents the average precision over a group of states. e curves are t from the scatter points based on Locally W eighted Scatterplot Smoothing (LO WESS). 5.3 Analysis on Overtting One of the reasons we pr opose the RHOMP method is to improve the higher-or der Markov chain method in the aspect of overt- ting. In this section we analyze the results in detail and give an explanation on the performances of dierent methods. First we show the comparison b etween training and testing performance in T able 4. W e present the result using precision with k = 3 as it is representative of the remaining results. Both PI TF and LME had the least overing eect as the testing and training precisions are v ery close. However , their testing precisions are also low . e training precision of MC2 is the highest for all datasets. But these are oen mor e than 10 times of the corr esponding testing precisions. So MC2 is a highly overing method. Kneser2 also has comparatively high training pr ecision since it is a second-order method and tends to t the training data well. But the performance on testing set is beer than MC2 as it uses lower-order information to smooth the output. e methods MC1, Kneser1 and RHOMP have a good training and testing balance, and among them, our RHOMP has superior testing performances. Next we analyze the performance on individual states to help understand the behaviors of dierent algorithms. W e sort all the states fr om high to low base d on the total number of counts of each state in the training set. Our aim is to look at how testing accuracy correlates with these counts. Figure 3 shows the precision ( k = 3) comparisons (i.e., MC1 vs MC2 vs RHOMP and Kneser1 vs Kneser2 vs RHOMP) on the F lickr dataset based on counts of the states. W e aggregate small sets of states base d on their counts into baskets of at least 1000 transitions and 5 states. W e nd that all methods show precision drops when predicting infrequent states, with MC2 being aected most. Here, RHOMP does the best out of all methods, which reects its ability to avoid overing. 5.4 Analysis on Higher-or der Approaches In the previous sections, we analyze the results for rst and second- order approaches. Now we study the behavior as the order varies. Figure 4 shows change in performance as the or der increases for the three frameworks: MC, Kneser-Ney smoothing and RHOMP. For the cases when the histor y states length is smaller than the order , we use the approach with the correct order to generate the prediction. For the MC framework, higher-order appr oaches make the pre- diction less accurate. is occurs because these methods ov ert the training data and there are more ways to overt for a higher-order chain. For the Kneser-Ney smoothing approaches, in most cases (except Beer Advocate dataset) there are improv ements moving from rst-order to second-order . However the impr ovements are slight. For order > 2, there are usually either no clear improvements or small performance dips. e reason is that as the order incr ease, the higher-order transition be come very sparse, and could easily en- counter an unseen higher-order state. So in this case the algorithm will frequently seek the prediction from a lower-order appr oach. For the RHOMP framework, there are improvements for each dataset when moving from MC1 to RHOMP with order = 2, and for order > 3, the results further improve. Compared to MC and Kneser- Ney smoothing frame works, e RHOMP is more robust in terms of not decreasing the precision as order increases, with the exception of BrightKite dataset. In BrightKite , the average trail length is around 3, so there is insucient information to train higher-order models and we lack the lower-order fallback in Kneser-Ne y . 6 SUMMARY & F U T URE WORK In this paper we study the problem of modeling user trails, which encode useful information for downstream applications of user experiences, recommendations and advertising. W e propose a new class of structured higher-order Markov chains which w e call the retrospective higher-order Marko v process (RHOMP). is model preserves the higher-order nature of user trails without risks of overing the data. A RHOMP can be estimated from data via a projected gradient descent algorithm we propose for maximum like- lihood estimation (MLE). In the experiments, we nd that RHOMP is superior in terms of pr ecision and mean reciprocal rank com- pared to other methods. Also RHOMP is robust for higher-order chains when there is data available. ere are se veral directions to extend this w ork. First it would be interesting to explore other forms of retrospection that allow more interaction between the histor y states. (Note that the current approach in this paper selects a single state during the retr ospective process). is will allow to model the case when certain combined history states have strong evidence in terms of transition paerns. Second it would also be useful to extend this frame work in terms of personalization. is can be achieved by a tensor factorization approach or a collaborative ltering method. Lastly we also would like to embed time information into our prediction either by mo d- eling the event time directly or using it as a side information to help generate a non-stationary process where the random walk behavior could change overtime. Acknowledgements. is work was supp orted by NSF IIS-1422918, CAREER award CCF- 1149756, Center for Science of Information STC, CCF- 093937; DOE award DE-SC0014543; and the D ARP A SIMPLEX program. REFERENCES [1] M. A. A wad and I. Khalil. Prediction of user’s web-browsing behavior: Applica- tion of Markov model. IEEE T . Syst. Man Cy . B , 42(4):1131–1142, 2012. [2] A. Benson, D. F . Gleich, and J. Leskovec. Higher-order organization of complex networks. Science , 353(6295):163–166, 2016. T able 4: Precision ( k=3) results for testing set (the le number) vs train set (the right number) that we use to estimate overtting. Bold denotes the highest testing result. W e judge the overtting eects as { MC2, Kneser2 }  { MC1, Kneser1, RHOMP } > { LME, PI TF } . But LME and PI TF have poor test performances. MC1 MC2 Kneser1 Kneser2 PI TF LME RHOMP LastFM 0 . 092 / 0 . 216 0 . 087 / 0 . 961 0 . 068 / 0 . 109 0 . 094 / 0 . 792 0 . 055 / 0 . 061 0 . 060 / 0 . 083 0 . 108 / 0 . 218 Beer Advocate 0 . 082 / 0 . 115 0 . 067 / 0 . 777 0 . 071 / 0 . 074 0 . 066 / 0 . 490 0 . 058 / 0 . 059 0 . 056 / 0 . 600 0 . 085 / 0 . 109 BrightKite 0 . 654 / 0 . 782 0 . 606 / 0 . 940 0 . 636 / 0 . 729 0 . 669 / 0 . 868 0 . 522 / 0 . 575 0 . 610 / 0 . 665 0 . 690 / 0 . 796 Flickr 0 . 428 / 0 . 496 0 . 374 / 0 . 832 0 . 399 / 0 . 440 0 . 432 / 0 . 710 0 . 346 / 0 . 359 0 . 384 / 0 . 401 0 . 477 / 0 . 530 FourSQ 0 . 145 / 0 . 199 0 . 133 / 0 . 778 0 . 147 / 0 . 174 0 . 155 / 0 . 524 0 . 119 / 0 . 126 0 . 104 / 0 . 137 0 . 188 / 0 . 241 order of the methods 1 2 3 4 5 6 relative precision 0.2 0.4 0.6 0.8 1 1.1 1.2 1.3 LastFM Dataset MC Kneser Key RHOMP order of the methods 1 2 3 4 5 6 relative precision 0.2 0.4 0.6 0.8 1 1.1 1.2 BeerAdvocate Dataset MC Kneser Key RHOMP order of the methods 1 2 3 4 5 6 relative precision 0.6 0.8 1 1.05 1.1 BrightKite Dataset MC Kneser Key RHOMP order of the methods 1 2 3 4 5 6 relative precision 0.2 0.4 0.6 0.8 1 1.1 Flickr Dataset MC Kneser Key RHOMP order of the methods 1 2 3 4 5 6 relative precision 0.1 0.4 0.7 1 1.3 1.6 1.9 FourSQ Dataset MC Kneser Key RHOMP Figure 4: Relative precision ( k = 3 ) vs order of the metho ds: MC, Kneser-Ney smoothing and RHOMP. e relative precision is the precision ratio to that from MC1 of the corresponding datasets. Note that the y-axis may not b e scale d linearly to make the gures more clear . [3] A. R. Benson, D . F. Gleich, and J. Leskovec. T ensor spe ctral clustering for partitioning higher-order network structures. In SDM , pages 118–126, 2015. [4] A. R. Benson, D. F . Gleich, and L.-H. Lim. e spacey random walk: A stochastic process for higher-order data. SIAM Rev . , page T o app ear , 2017. [5] J. Borges and M. Levene . Evaluating variable-length Markov chain models for analysis of user web navigation sessions. IEEE T . Knowl. Data En. , 19(4), 2007. [6] P. B ¨ uhlmann, A. J. W yner, et al. V ariable length Markov chains. e Annals of Statistics , 27(2):480–513, 1999. [7] I. Cadez, D. Heckerman, C. Meek, P . Smyth, and S. White. Mo del-based clustering and visualization of navigation paerns on a web site. Data Mining and Knowledge Discovery , 7(4):399–424, 2003. [8] ` O. Celma Herrada. Music recommendation and discovery in the long tail. 2009. [9] S. Chen, J. L. Moore, D. T urnbull, and T . Joachims. P laylist prediction via metric embedding. In KDD , pages 714–722, 2012. [10] S. Chen, J. Xu, and T. Joachims. Multi-space probabilistic sequence mo deling. In KDD , pages 865–873, 2013. [11] S. F. Chen and J. Goodman. An empirical study of smoothing te chniques for language modeling. In A CL , pages 310–318, 1996. [12] F. Chierichei, R. Kumar, P. Raghavan, and T . Sarlos. Are web users really Markovian? In W WW , pages 609–618, 2012. [13] E. Cho, S. A. Myers, and J. Leskov ec. Friendship and mobility: user movement in location-based social networks. In KDD , pages 1082–1090, 2011. [14] M. Deshpande and G. K arypis. Sele ctive Markov models for pr edicting web page accesses. ACM T . Internet T echno. , 4(2):163–184, 2004. [15] J. Duchi, S. Shalev-Shwartz, Y . Singer , and T . Chandra. Ecient projections onto the l 1-ball for learning in high dimensions. In ICML , pages 272–279, 2008. [16] S. R. Eddy . Hidden Markov models. Current opinion in structural biology , 6(3):361– 365, 1996. [17] F. Figueir edo, B. Ribeiro, J. M. Almeida, and C. Faloutsos. TribeFlow: Mining & predicting user trajectories. In WW W , pages 695–706, 2016. [18] R. Gupta, R. Kumar , and S. V assilvitskii. On mixtures of Markov chains. In NIPS , pages 3441–3449, 2016. [19] T . G. Kolda and B. W . Bader . Tensor decompositions and applications. SIAM Rev . , 51(3):455–500, 2009. [20] R. Kumar , M. Raghu, T . Sarl ´ os, and A. T omkins. Linear additiv e markov processes. In WW W , pages 411–419, 2017. [21] A. Markov . Extension of the limit the orems of probability theor y to a sum of variables connected in a chain. 1971. [22] J. J. McAuley and J. Leskovec. From amateurs to connoisseurs: mo deling the evolution of user e xpertise through online reviews. In WW W , pages 897–908, 2013. [23] S. Melnyk, O. Usatenko, and V . Y amp ol’skii. Memor y functions of the additive markov chains: applications to complex dynamic systems. Physica A: Statistical Mechanics and its Applications , 361(2):405–415, 2006. [24] Y . Peres and P. Shields. T wo new Markov order estimators. arXiv preprint math/0506080 , 2005. [25] P. L. Pir olli and J. E. Pitkow . Distributions of surfers’ paths through the world wide web: Empirical characterizations. W orld Wide W eb , 2(1-2):29–45, 1999. [26] S. Rendle, C. Freudenthaler , and L. Schmidt-ieme. Factorizing personalized Markov chains for next-basket r ecommendation. In W WW , pages 811–820, 2010. [27] S. Rendle and L. Schmidt- ieme. Pairwise interaction tensor factorization for personalized tag recommendation. In WSDM , pages 81–90, 2010. [28] D. Ron, Y . Singer , and N. Tishby . Learning probabilistic automata with variable memory length. In Proceedings of the seventh annual conference on Computational learning theor y , pages 35–46. ACM, 1994. [29] M. Rosvall, A. V . Esquivel, A. Lancichinei, J. D . W est, and R. Lambioe. Memor y in network ows and its eects on spreading dynamics and community detection. Nature communications , 5, 2014. [30] R. Salakhutdinov and A. Mnih. Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In ICML , pages 880–887. ACM, 2008. [31] Y . Shi, M. Larson, and A. Hanjalic. Collaborative ltering beyond the user-item matrix: A survey of the state of the art and future challenges. ACM Computing Surveys (CSUR) , 47(1):3, 2014. [32] P. Singer , D . Helic, A. Hotho, and M. Strohmaier . Hyptrails: A bayesian approach for comparing hypotheses about human trails on the web. In WW W , pages 1003–1013, 2015. [33] H. M. Taylor and S. Karlin. A n introduction to stochastic modeling . Academic press, 2014. [34] B. omee, D . A. Shamma, G. Friedland, B. Elizalde, K. Ni, D . Poland, D. Borth, and L.-J. Li. e new data and new challenges in multimedia research. arXiv preprint arXiv:1503.01817 , 1(8), 2015. [35] O. Usatenko. Random nite-valued dynamical systems: additive Markov chain approach . Cambridge Scientic Publishers, 2009. [36] T . Wu, A. R. Benson, and D. F . Gleich. General tensor spectral co-clustering for higher-order data. In NIPS , pages 2559–2567, 2016. [37] Z. Xing, J. Pei, and E. Keogh. A brief survey on se quence classication. ACM Sigkdd Explorations Newsleer , 12(1):40–48, 2010. [38] D. Y ang, D. Zhang, Z. Yu, and Z. Yu. Fine-grained preference-aware location search leveraging crowdsourced digital footprints from LBSNs. In UbiComp , pages 479–488, 2013. [39] J.-D. Zhang and C.- Y . Chow . Spatiotemporal sequential inuence modeling for location recommendations: A gravity-based approach. ACM Transactions on Intelligent Systems and T echnology (TIST) , 7(1):11, 2015.

Retrospective Higher-Order Markov Processes for User Trails

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment