Tracking Dynamic Point Processes on Networks

Cascading chains of events are a salient feature of many real-world social, biological, and financial networks. In social networks, social reciprocity accounts for retaliations in gang interactions, proxy wars in nation-state conflicts, or Internet m…

Authors: Eric C. Hall, Rebecca M. Willett

Tracking Dynamic Point Processes on Networks
1 T racking Dynamic Point Processes on Networks Eric C. Hall and Rebecca M. W illett Abstract Cascading chains of ev ents are a salient feature of many real-world social, biological, and financial networks. In social networks, social reciprocity accounts for retaliations in g ang interactions, proxy wars in nation-state conflicts, or Internet memes shared via social media. Neuron spikes stimulate or inhibit spike activity in other neurons. Stock market shocks can trigger a contagion of volatility throughout a financial network. In these and other examples, only indi vidual ev ents associated with network nodes are observed, usually without knowledge of the underlying dynamic relationships between nodes. This paper addresses the challenge of tracking how events within such networks stimulate or influence future ev ents. The proposed approach is an online learning framework well-suited to streaming data, using a multiv ariate Hawkes point process model to encapsulate autoregressi ve features of observed ev ents within the social network. Recent w ork on online learning in dynamic en vironments is lev eraged not only to exploit the dynamics within the underlying netw ork, but also to track the network structure as it ev olves. Regret bounds and experimental results demonstrate that the proposed method performs nearly as well as an oracle or batch algorithm. I . I N T RO D U C T I O N In a v ariety of settings, we observe a series of e vents associated with a group of actors whose interactions trigger future events. The interactions between these actors can be modeled using a netw ork. For example: • Social networks: we observe e vents such as people meeting, corresponding, voting, or sharing information [ 1 ], [ 2 ], [ 3 ], [ 4 ], [ 5 ]; • Biological neural networks: spiking action potentials can trigger or inhibit spik es in neighboring neurons according to time-v arying functional netw orks [ 6 ], [ 7 ], [ 8 ], [ 9 ], [ 10 ], [ 11 ], [ 12 ]. • Financial networks: stock mark et shocks can trigger jumps across the global network of financial instruments and indices[ 13 ], [ 14 ], [ 15 ]; • Epidemiological networks: as a contagion spreads through a community , observations of symptoms in one person are strong predictors of future symptoms among that person’ s neighbors [ 16 ]; and • Seismological networks: substantial seismic activity is often predicated by foreshocks and followed by aftershocks, with the epicenter of these shock ev ents determined by the local geography and plate tectonics [ 17 ], [ 18 ], [ 19 ]. In all the above settings, the interactions between actors are critical to a fundamental understanding of the underlying netw ork structure and accurate predictions of likely future e vents. E. C. Hall is with the Department of Electrical and Computer Engineering, Duke Univ ersity , Durham, NC, 27708, USA. e-mail: eric.hall87@gmail.com. R. M. W illett is with the Department of Electrical and Computer Engineering, Univ ersity of W isconsin-Madison, Madison, WI 53706, USA. e-mail: willett@discovery .wisc.edu. This paper was presented in part at the 7th International IEEE EMBS Neural Engineering Conference (2015) and appears in IEEE Transactions on Signal Processing, V ol 62, No. 7. W e gratefully acknowledge the support of the awards AFOSR F A9550- 11-1-0028, NSF CCF-1418976, AR O W911NF-09- 1-0262, and AFOSR 14-AFOSR-1103. W e can model these interactions between nodes using a network or graph, where directed edge weights correspond to the de gree to which one node’ s acti vity stimulates activity in another node. For instance, the network structure may indicate who is influencing whom within a social network, or the connectivity of neurons. In these and other contexts the underlying network structure may be changing over time, for instance as people’ s relationships e volv e or as a function of the acti vity in which the brain is engaged. In many cases, we are interested in both the rates at which different nodes or actors participate in ev ents and the underlying netw ork structure. Our goal is to filter and track such processes. W e present methods and associated theoretical perfor- mance bounds in two settings: (a) where the underlying network structure is known and (b) where the underlying network structure is unknown. Our approach incorporates concepts and tools from multiv ariate Hawk es models of point processes [ 20 ], [ 21 ], [ 22 ] and online con ve x optimization methods for dynamic en vironments [ 23 ], [ 24 ]. In particular , the multiv ariate Hawkes process is akin to an autoregressiv e model for point processes, where e vents up to time t dictate the rate at which ev ents are anticipated after time t . Estimating the parameters associated with these processes is the subject of much current research, but existing methods typically assume that the underlying network parameters are static rather than changing with time, and require computationally-prohibiti ve batch processing algorithms. Specifically , there has been substantial work in estimating parameters of the system through methods which seek to estimate both the “parent e vent” of each e vent, and then use this information to learn the parameters of the influence function and/or the network [ 25 ] using either EM-type algorithms or Bayesian techniques [ 19 ], [ 26 ], [ 27 ], [ 28 ], [ 29 ]. The difficulty with using this approach in the online setting is that in order to accurately estimate parent ev ents, we need a potentially large buf fer of stored ev ent times, and do processing that scales poorly with the number of previously observed ev ents. The work closest to ours is [ 30 ], which uses a Bayesian frame work to learn the parameters of a discretized version of the Hawk es process, which is computationally more efficient with regards to the number of ev ents observed. Howe ver , they still require at least mini-batches and having access to data more than once, which is not a truly streaming setting. Additionally , all of these methods require the data to e xactly follo w the Hawkes model, and have no guarantees for performance in the case of misspecified influence functions or generativ e model, whereas our results both theoretically and empirically offer protection against such model mismatch. In the the cascading point process described abo ve, we face several key challenges: (a) the underlying networks are dynamic, (b) we recei ve either a lar ge volume of data or data that is streaming at a high velocity , necessitating sequential processing, and (c) we seek performance guarantees that are robust to model mismatch (i.e. perform well e ven when the data was not truly generated by the Ha wkes model). Our proposed method will simultaneously track the time-v arying rates at which e vents are e xpected and the underlying time-v arying network structure. In contrast, most methods assume that the rates are a known, closed-form function of the observed data and the network structure, and focus solely on estimating the network; we will see that this approach is more fragile with respect to modeling errors. Additionally , due to the streaming nature of our algorithms, we are in a regime where we can easily do forecasting, which is valuable for the financial, epidemiological, seismological and other networks. Our algorithms create an estimate of the process rates at ev ery time before seeing the actual ev ents at that time. Therefore, the online framew ork allows us to do one step ahead forecasting. This would be much harder to do using pre vious methods which learn networks in Ha wkes processes, because to do prediction all the pre vious data would hav e to be processed, which is computationally intensiv e, and then projected forward to new time points. These methods would either have to be run at ev ery time point, which is computationally 2 infeasible, or would only be run a few times and not be able to track short term changes in the network. The remainder of the paper is organized as follows: Section II introduces the basics of the Hawkes process and a mathematical description of our learning objective. Section III covers some of the basics for online learning with dynamics in a general setting for generic loss functions. Section IV then describes the time discretized loss function and dynamical model which corresponds to the Hawkes process which are needed for our online learning framew ork. Section V introduces our two proposed online algorithms for tracking these processes, one which assumes the network is known and one which attempts to learn both the time v arying-rates as well as the network simultaneously . Section VI has a brief discussion on the computational complexity of the methods. Finally Section VII shows ho w our methods perform in practice on synthetic data both when the generativ e model is kno wn and when it is misspecified, as well as experiments performed on the Memetracker data set. Proofs and a notation legend are placed in the appendix. I I . P RO B L E M F O R M U L A T I O N W e monitor p actors in a network, and record the identities of the actor and time of each e vent. An actor and ev ent may represent a person “liking” a photo or article shared by another person in a social network, a neuron firing in the brain, or the incidence of disease. That is, we observe a time series of the form { ( k n , τ n ) } n , where k n ∈ { 1 , 2 , ..., p } is the actor in volv ed in the n th e vent and τ n ∈ R + is the time at which it occurs. W ith each new event, we wish to accurately predict which future e vents are most likely in the immediate future and the underlying network of influence. W e define τ 0 , 0 . W e wish to track the time-varying likelihood of each of the p actors acting. T o do so, we adopt a multi variate Hawkes process model [ 20 ], [ 21 ], [ 22 ] and track the parameters of this model over time. In particular, for each actor k we ha ve a point process with time-varying rate function µ k ( τ ) . Let N k,τ denote the number of recorded ev ents for actor k up to and including time τ , and let N τ , P p k =1 N k,τ denote the total number of ev ents (across all actors) up to and including time τ . The likelihood of actor k participating in an event between times t 1 and t 2 is controlled by the inte gral R τ 2 τ 1 µ k ( τ ) dτ . More formally , the collection of all observed ev ents up to time T can be denoted H T , { N k,τ } k ∈{ 1 ,...,p } τ ∈ (0 ,T ] . The log likelihood of observing H T gi ven the p rate functions µ k ( t ) for k ∈ { 1 , .., , p } is then [ 31 ] log p ( H T | µ ) = N T X n =1 log µ k n ( τ n ) − p X k =1 Z T 0 µ k ( τ ) dτ . (1) Thus far , ev erything explained is common to a wide class of point processes. The multiv ariate Hawkes processes considered in this paper are essentially an autoregressi ve point process, where each rate function µ k ( τ ) depends on the history of past ev ents, H τ . In particular, a multi variate Ha wkes process assumes the rate functions can each be expressed as µ k ( τ ) = ¯ µ k + N t X n =1 h k,k n ( τ − τ n ) . (2) Here ¯ µ k is a baseline rate representing the nonzero lik elihood of actor k acting even without having been influenced by an y previous actions, with ¯ µ , [ ¯ µ 1 , ..., ¯ µ p ] > . Furthermore, we ha ve p 2 functions of the form h k 1 ,k 2 ( τ ) which describe how e vents associated with actor k 2 will impact the likelihood of e vents associated with actor k 1 . In order to assure causality we assume h k 1 ,k 2 ( τ ) = 0 if τ ≤ 0 for all 3 k 1 , k 2 ∈ { 1 , ..., p } . These functions depend on the underlying netw ork connectivity; if actors k 1 and k 2 are unconnected, the corresponding function h k 1 ,k 2 should be identically zero for all τ . One of the main challenges in statistical estimation for multiv ariate Hawkes processes is the estimation of these p 2 functions. In general, this problem is highly underdetermined and challenging. Recent work has attempted to mitigate these challenges using low-rank and sparse models [ 31 ], [ 32 ], [ 33 ]. In this paper , we make the common ( cf. [ 5 ], [ 34 ]) simplifying assumption that these interactions all hav e the same functional form b ut different (and often zero-valued) amplitudes, so that h k 1 ,k 2 ( τ ) = W k 1 ,k 2 h ( τ ) (3) where h ( τ ) is known but the amplitude matrix W = [ W k 1 ,k 2 ] k 1 ,k 2 ∈{ 1 ,..,p } may be unknown. W e will refer to h ( τ ) as the influence function , as it depicts how an action’ s influence on an actor will vary in time. The matrix W indicates the strength of influence between actors; from a graph theory perspectiv e, W acts like the weighted adjacenc y matrix of a graph representation of the network. Our goal is to obtain an estimate for µ ( t ) as it evolv es and to infer W online from streaming network data. Furthermore, we seek methods with performance guarantees that hold e ven when the observed data is not generated strictly in accordance with the abo ve Hawkes model. That is, while we use the Hawkes model to measure ho w well estimates fit the data, we recognize that the model will nev er be perfectly accurate ( e.g., we may ha ve errors in our estimate of the influence function h ( τ ) or the linear model in ( 2 ) may not reflect nonlinearities present in real data) and wish performance guarantees e ven in the face of these uncertainties. The proposed method is an application of online optimization in dynamic en vironments, which requires a loss function and a dynamical model. On the highest level, the method takes a current estimate of the rate and then slightly adjusts it based on the most recently observed data. In classical online learning settings, this innov ation step is based solely on gradient of the chosen loss function, which will be related to the negati ve log-likelihood of the Hawkes process. Our approach adds a second main step of the algorithm, which is to then update the rate by incorporating the Hawk es dynamical model that certain nodes in the system will stimulate actions from other nodes. I I I . O N L I N E L E A R N I N G As described abov e, we wish to estimate the rate functions µ k ( τ ) for k = 1 , . . . , p and the corresponding likelihood of future ev ents, based solely on previous events and the (possibly learned) network structure. In this section, we describe sev eral key ideas from the field of online learning which we will le verage in our problem. First we describe the traditional online learning paradigm, then we describe methods which incorporate dynamical models into the learning process, which allo w one to adapt to a time v arying en vironment. A. Online Learning in non-Dynamic En vir onments Online learning techniques are generally based on the following paradigm: at ev ery time point t we make a prediction, receiv e some data, and then do a fe w computationally ine xpensive calculations to improv e our pre vious prediction. In the setting of autoregressiv e event tracking, this means we would hav e an estimate about each actor’ s likelihood of acting and then see who does act. Using the pre vious prediction, the current action, and information about the network itself, we update our belief of who is most likely to act next. Unlike traditional online learning techniques, there are strong dynamics in volv ed in the ev olution of the system that must be incorporated. More formally , a generic version of an online method proceeds as follo ws. W e let X denote the domain of our observations, and Λ denote a bounded, closed, conv ex feasible set of the parameter of 4 interest. Giv en sequentially arriving observations x ∈ X ∞ , we wish to construct a sequence of predictions ˆ λ = ( ˆ λ 1 , ˆ λ 2 , . . . ) ∈ Λ ∞ , where ˆ λ t may depend only on the currently av ailable observations x t − 1 = ( x 1 , . . . , x t − 1 ) . The problem is posed as a repeated sequence of predictions gi ven by a Learner and the truth being revealed by an oblivious (non-adapti ve) En vironment. At time t , the Learner computes a prediction, ˆ λ t and the En vironment generates the observ ation x t . The Learner then experiences the loss ` t ( ˆ λ t ) , where ` t ( · ) is a con vex cost function measuring the accuracy of the prediction ˆ λ t with respect to the data x t . The task facing the Learner is to create a ne w prediction ˆ λ t +1 based on the pre vious prediction and the ne w observation, with the goal of minimizing loss at the next time step. W e characterize the efficacy of b λ T , ( ˆ λ 1 , ˆ λ 2 , . . . , ˆ λ T ) ∈ Λ T relati ve to a comparator sequence λ T , ( λ 1 , λ 2 , . . . , λ T ) ∈ Λ T as follows: Definition 1 (Re gret) . The regret of ˆ λ T with r espect to a compar ator λ T ∈ Λ T is R T ( λ T ) , T X t =1 ` t ( ˆ λ t ) − T X t =1 ` t ( λ t ) . The comparator series can be thought of as the predictions from either an oracle with kno wledge of future data or a batch algorithm with access to all the data. Therefore, the regret characterizes the amount of excess loss suffered from the online algorithm. Previous work proposed algorithms which yielded regret of O ( √ T ) for static comparators λ t , where λ t = λ for all t ( cf. [ 35 ], [ 36 ], [ 37 ]). Basically , these methods can only perform well if the comparator is a single point or changes either very slo wly or very infrequently . It is this characteristic that causes most e xisting methods to be poorly-suited to the autoregressi ve nature of interactions within a network. B. Online Learning in Dynamic En vir onments More recent work has explored the impact of dynamical models within the context of online learning ( cf. [ 38 ], [ 23 ]). In particular , the Dynamic Mirror Descent method proposed in [ 23 ] incorporates a known dynamical model into online learning, leading to significant improvements in performance in dynamic en vironments. In the context of multiv ariate Hawkes processes, a known dynamical model amounts to kno wing the exact weighted adjacenc y structure of the network. In many practical conte xts the network structure may be una vailable and will need to be estimated simultaneously with the rates. This will be discussed further in Section V -B Before defining an optimization routine specifically for multiv ariate Hawkes data, we briefly describe a simplified version of the Dynamic Mirror Descent (DMD) method [ 24 ]. Let Φ t : Λ × W 7→ Λ denote a sequence of known dynamical models that takes as input a value in our decision space and some side information, and set ˜ λ t +1 = pro j Λ ( ˆ λ t − η t ∇ ` t ( ˆ λ t )) (4a) ˆ λ t +1 = Φ t ( ˜ λ t +1 , W t ) (4b) where η t is a step size parameter which controls how far we should step in the direction of the new data. By including Φ t in the process, we effecti vely search for a predictor which (a) attempts to minimize the loss and (b) which is close to ˜ λ t +1 under the tr ansformation of Φ t with side information W . In our setting W will correspond to the kno wn or estimated values of the network. This is similar to a stochastic filter which alternates between using a dynamical model to update the “state”, and then uses this state to perform the filtering action. Howe ver , we make no assumptions about Φ t ’ s relationship to 5 the true underlying parameters. It has been shown, under mild conditions on the sequence { Φ t } t> 0 , that the regret of this algorithm obe ys the following: R T ( λ T ) ≤ C √ T 1 + T − 1 X t =1 k λ t +1 − Φ t ( λ t , W ) k ! for some C > 0 independent of T . This bound scales with the comparator sequence’ s deviation from the sequence of dynamical models { Φ t } t> 0 – a stark contrast to previous tracking re gret bounds which are only sublinear in T for comparators which change slowly with time or at a small number of distinct time instances. In order to use this framew ork to learn the rates and network of a Hawkes process we need to deri ve two key ingredients, the loss function and the dynamical model to be used in Equations 4a and 4b . These ingredients will take us from the general setup presented in this section, to the specific application being studied. These functions will be derived in the next section. Once these hav e been deriv ed, we can use and expand upon the existing theory for online learning, and finally present a method to learn the rates and the underlying network simultaneously . I V . L O S S F U N C T I O N A N D D Y N A M I C M O D E L In order to analyze and make estimates of our point process network data, we use the Hawkes model described in Section II to define a loss function, dynamical models, and other ingredients of the online learning framew ork described in Section III . A. T ime discretized loss function W e discretize time into interv als of length δ > 0 , where δ is small enough that it is very infrequent that the same actor acts multiple times in the same time window . (For simplicity , we assume the total sensing time, T , is selected such that T /δ is an integer .) W e let t = 1 , 2 , . . . , T /δ index these interv als, and note N k,tδ is the number of ev ents observ ed in the k th process (i.e. by the k th actor) up to the end of the t th interv al, (( t − 1) δ, tδ ] , with N k, 0 , 0 . The v alue x t,k = N k,tδ − N k,t ( δ − 1) denotes ho w many times actor k acted during the t th interv al, which will mostly be either zero or one for an appropriately chosen δ . The vector x t , [ x t, 1 , ..., x t,p ] > will be our data vector at each time point. Using the negati ve log likelihood of the Hawkes process up to time δ t , we can formulate appropriate loss functions to apply to an online setting. W e introduce an approximation of the time varying rate function in the Hawkes process, λ t = [ λ t, 1 , ..., λ t,p ] > ∈ R p + . T o do this we define a ne w set of times { ¯ τ n } n which are the ends of the discrete time interv als that the e vents occur . These times are defined by ¯ τ n = d τ n δ e δ . Here and for the rest of the paper , we denote the summation over a set of events { n : ¯ τ n < δ t } by simply saying we sum over ¯ τ n < δ t . λ t,k = ¯ µ k + X ¯ τ n <δ t W k,k n h ( δ t − τ n ) (5) Equation 5 acts the same way as the original Ha wkes process, but we do not immediately update the rates with the ev ents as they occur but instead push them to integer multiples of δ . Notice that although we wait until ¯ τ n to include the ev ent in the rate, we update it with the full knowledge of when the ev ent actually occurred i.e. we use h ( τ − τ n ) instead of h ( τ − ¯ τ n ) . Equation 5 suggests a loss function with analogous changes to the original, continuous time log likelihood. W e define L T ( µ ) to be the negati ve 6 log likelihood of the Ha wkes process at time T , and L ( δ ) T ( λ ) to be the discrete time equiv alent. L T ( µ ) , p X k =1 Z T 0 µ k ( τ ) dτ − N T X n =1 log µ k n ( τ n ) ≈ p X k =1   T /δ X t =1 δ λ t,k − T /δ X t =1 x t,k log λ t,k   , L ( δ ) T ( λ ) (6) This new loss function is based on replacing the integral term with a summation and replacing µ k n ( τ n ) with λ k n , ¯ τ n /δ . Both of these substitutions become closer to the truth as δ → 0 . In Lemma 1 we characterize the difference between the two functions. Lemma 1. Given the influence function h ( τ ) = α τ 1 [ τ > 0] and data with a maximum activity rate of x max events per actor per unit time, the ne gative log likelihood of the true Hawkes Pr ocess, given in Equation 1 , and the appr oximate ne gative log likelihood for the discr ete time rate, given in Equation 6 , both generated by the same matrix W and vector ¯ µ with all elements 0 ≤ W i,j ≤ W max and 0 ≤ µ min ≤ µ i < ∞ r espectively , ther e exists a constant C > 0 depending on x max , p, W max , µ min and α such that | L T ( µ ) − L ( δ ) T ( λ ) | ≤ C N T δ. Remark: F or a general influence function h ( τ ) which is Lipschitz on (0 , T ] , a similar pr oof gives a slightly higher bound of C ( T N T δ + N T log(1 + N T δ )) . As we focus mostly on h ( τ ) = α τ 1 [ τ > 0] , we show the bound for this specific function. This Lemma says that if δ is set small enough, the discrete approximation can be used in a learning algorithm, without many errors coming from the discretization approximation. Howe ver , the smaller δ is the more frequently the rates will have to be updated, leading to a higher computational burden. Thus the approximation justifies the proposed method using the instantaneous loss function: ` t ( λ t ) , −h log( δ λ t ) , x t i + h δ λ t , 1 i . (7) Here and for the remainder of the paper , log and exp of a vector are assumed to be taken element-wise. Notice that L ( δ ) T ( λ ) = P T /δ t =1 ` t ( λ t ) + h x t , log δ i , and thus the total cumulati ve loss is summation of instantaneous losses and a term which is independent of the rate estimate. B. Dynamical models In order to use the DMD framew ork, we model the autoregressi ve nature of the Hawkes process using a series of data-dependent dynamical models, Φ t , which update a rate parameter λ gi ven a weighted adjacency matrix W and the pre viously observed data H t . W e can model this dependence using the dynamical model Φ t ( λ, W ) = A t λ + W y t + c t , for λ, y t , c t ∈ R p , A t ∈ R p × p , and W ∈ R p × p + . If we let A t = β I for some β ∈ (0 , 1) , where I is the identity matrix, it suggests that our dynamical model causes the rates in λ to decay at a rate depending on β in the absence of other ef fects. The term W y t allo ws us to model autoregressi ve effects. In particular , the matrix W could correspond to a weighted adjacency matrix associated with the network of interest, and y t could contain information about previous e vents as specified below . More generally , we might replace the term W y t with P m − 1 r =0 W y t − r for an m th -order process if we thought there should be some latency in the response times for pairs of actors. 7 Recall that the influence functions h ( τ ) describe how the causal influence between actors v aries over time. Dynamical models for various forms of h ( τ ) can be dev eloped for the time discretized multiv ariate Hawk es process described in Equation 5 . First, a general function h ( τ ) is considered, with some mild assumptions, and then is used to derive models for some specific choices of h ( τ ) . • General influence functions: W e assume h ( τ ) is a continuous, non-negati ve function on τ > 0 . Additionally , we assume there is a B such that h ( τ ) > 0 for 0 < τ ≤ B and h ( τ ) = 0 otherwise. Finally , let e k n ∈ R p be a v ector of all zeros, with a single 1 in the k th n entry indicating the actor in volv ed in the n th action. Then we deri ve the follo wing dynamical model: λ t +1 = ¯ µ + X ¯ τ n <δ ( t +1) W e k n h ( δ ( t + 1) − τ n ) = ¯ µ + X ¯ τ n <δ t a t,n W e k n h ( δ t − τ n ) + X ¯ τ n = δ t W e k n h ( δ ( t + 1) − τ n ) = ¯ µ + A t X ¯ τ n <δ t W e k n h ( δ t − τ n ) + W y t = A t λ t + W y t + ( I − A t ) ¯ µ In the above we ha ve used the follo wing values a t,n , A t and y t : a t,n , ( 1 , h ( δ t − τ n ) = 0 h ( δ ( t +1) − τ n ) h ( δ t − τ n ) , else A t,k ,                1 / 2 , if X ¯ τ n <δ t W k,k n h ( δ t − τ n ) = 0 X ¯ τ n <δ t a t,n W k,k n h ( δ t − τ n ) X ¯ τ n <δ t W k,k n h ( δ t − τ n ) , else A t = Diag( A t, 1 , A t, 2 , ..., A t,p ) y t , X ¯ τ n = δ t e k n h ( δ ( t + 1) − τ n ) Thus, we have the dynamics in the desired form. λ t +1 = Φ t ( λ t , W ) = A t λ t + W y t + ( I − A t ) ¯ µ Notice that in general A t may be a function of W . • Rectangular influence functions: Using the above frame work, dynamical models can be worked out for the specific instance when h ( τ ) = 1 [0 <τ δ . W e first show the v alues a t,n . a t,n = ( 1 , 1 [0 <δ t − τ n δ ( t +1) − B ] 8 This leads to the following form of A t,k . A t,k =            1 , X ¯ τ n <δ t δ t − τ n 0] for α ∈ (0 , 1) . W e then ha ve λ t +1 = ¯ µ + X ¯ τ n <δ ( t +1) W e k n h ( δ ( t + 1) − τ n ) = ¯ µ + X ¯ τ n <δ ( t +1) W e k n α δ ( t +1) − τ n = ¯ µ + α δ X ¯ τ n <δ t W e k n α δ t − τ n + X ¯ τ n = δ t W e k n α δ ( t +1) − τ n =(1 − α δ ) ¯ µ + α δ λ t + W y t yielding the dynamical model Φ t ( λ, W ) = α δ λ + W y t + (1 − α δ ) ¯ µ. • Delayed exponential influence functions: The exponential decay might be a reasonable influence function, howe ver reactions might not always be able to take place immediately . T o model this we use h ( τ ) = α τ − D 1 [ τ >D ] for some positi ve delay D ≥ δ . In this scenario, a similar dynamical model can be derived as with the exponential decay , but with slight change in the additi ve W y t term: λ t +1 = ¯ µ + X ¯ τ n <δ ( t +1) W e k n α δ ( t +1) − τ n − D 1 [ δ ( t +1) − τ n >D ] = ¯ µ + X τ n <δ ( t +1) − D W e k n α δ ( t +1) − τ n − D = ¯ µ + α δ X τ n + D<δ t W e k n α δ t − τ n − D + X δ t ≤ τ n + D<δ ( t +1) W e k n α δ ( t +1) − τ n − D = α δ λ t + W y 0 t + (1 − α δ ) ¯ µ y 0 t , X δ t − D ≤ τ n <δ ( t +1) − D e k n α δ ( t +1) − τ n − D This takes the same basic form as the non-delayed exponential, but with a slightly different term y 0 t instead of y t . Also, notice that when D is equal to δ these equations become equiv alent to the 9 Algorithm 1 MV Hawkes T racking - W Kno wn 1: Initialize ˆ λ 1 = ¯ µ 2: f or t = 1 , ..., T /δ do 3: Observ e x t and incur loss ` t ( ˆ λ t ) = h 1 , δ ˆ λ t i − h x t , log( δ ˆ λ t ) i 4: Set ˜ λ t +1 = pro j Λ ( ˆ λ t − η t ∇ ` t ( ˆ λ t )) = pro j Λ ((1 − η t ) ˆ λ t − η t x t /δ ) 5: Set b λ t +1 = Φ t ( ˜ λ t +1 , W ) 6: end for non-delayed version, suggesting that the time discretization is creating estimation error on the order of a slight delay in the influence function. In the general setting, the dynamical model is written in the form Φ t ( λ, W ) = A t λ + W y t + c t (8) for some linear operator A t , a vector y t which is a kno wn function of h and pre viously observ ed data, and a known constant c t . In the follo wing, we assume a generic dynamical model of the form ( 8 ). The first method we present will depend on the ease of computation of the matrix A t , and if many observ ations need to be held in memory to compute A t , then the method could be quite slow . Howe ver , if the influence function is the exponential decay or the delayed e xponential decay , then A t is constant in time thus we only need to compute it once. Another important feature of both the exponential decay and delayed exponential decay functions is that the linear operator A t does not depend on the values of the matrix W . It is this separation that will allow simultaneously estimation of the rates, λ and the v alues of W . V . P R O P O S E D A L G O R I T H M S Our main contribution is to propose two algorithms, depending on whether or not the weighted adjacency matrix W is known, and we show rele vant regret bounds for both. A. Pr oposed algorithm - W Known W e first present Algorithm 1 , a method for tracking the rate vector λ t from streaming observations x t for t = 1 , 2 , ...T /δ . The basic idea is the follo wing: at time t we start with the current rate estimate b λ t . W e then observe x t and incur the loss ` t ( ˆ λ t ) . Based on this incurred loss, we update our previous prediction in the value ˜ λ t +1 , which can be thought of as an a posteriori estimate of the rate at time t , gi ven all the data up to and including t . From here, we mak e our prediction of the rate at the next time by applying our dynamic model, Φ t . Algorithm 1 admits the follo wing result, which bounds the amount of excess error of the output sequence generated by the algorithm compared to any comparator sequence. The proof of Theorem 1 assumes that the decision space Λ , [ λ min , λ max ] p for some λ min > 0 and λ max < ∞ and that there’ s a maximum amount of times any actor can act per unit time, denoted by x max and therefore every element of the data vector takes v alues in the range [0 , δ x max ] . W e also assume the sequence of dynamical models is contractive with respect to a giv en Bregman di ver gence. W e say that a dynamical model, Φ t is contractiv e with respect to the Bregman diver gence D ∗ on the set Λ if for any λ 1 , λ 2 ∈ Λ we ha ve: D ∗ ( δ Φ t ( λ 1 , W ) k δ Φ t ( λ 2 , W )) − D ∗ ( δ λ 1 k δ λ 2 ) ≤ 0 . 10 This is a condition which works to ensure some amount of stability in the output sequence by preventing small estimation errors at any one time step from getting w orse and worse as the algorithm continues. Lemma 2 prov es suf ficient conditions on the function Φ t to ensure that it is contractive with respect to the needed Bregman div ergence. Lemma 2. If the dynamical model Φ t ( λ, W ) = A t λ + b t , wher e A t is a diagonal matrix with all elements in the range [0 , 1] for all t and b t  0 , then Φ t is contractive with r espect to the Bre gman diverg ence induced by the function h λ, log λ i − h 1 , λ i on Λ = [ λ min , λ max ] p . All the dynamical models we have proposed satisfy the conditions that A t is diagonal. Additionally , b t  0 as long as all elements of W and ¯ µ are non-negati ve and h ( t ) ≥ 0 . The most restrictiv e assumption that this lemma makes is that the elements of A t are upper bounded by one, which is true if h ( t ) is non-increasing after the initial impulse. Theorem 1 (T racking regret of Algorithm 1 ) . Using a sequence of contractive dynamical models Φ t ( λ, W ) for all t , if we choose η t pr oportional to either 1 / √ t or 1 / p T /δ , then ther e exists a constant C > 0 depending on δ, p, x max , λ max and λ min such that the r e gr et of b λ 1 , b λ 2 , ..., b λ T /δ gener ated by Algorithm 1 for any data sequence x 1 , x 2 , ..., x T /δ ∈ [0 , x max ] p with r espect to a compar ator sequence λ 1 , ..., λ T /δ ∈ [ λ max , λ min ] p is bounded by: T /δ X t =1 ` t ( b λ t ) − ` t ( λ t ) ≤ C   1 + T /δ X t =1 k λ t +1 − Φ t ( λ t ) k 2   √ T . This algorithm takes as input kno wn parameters h ( τ ) , W and ¯ µ . W ith these kno wn parameters and the data stream, one could simply calculate the rate at any given time directly by using Equation 5 . This would be equiv alent to Algorithm 1 with parameter η t = 0 . Howe ver , this strategy would be very fragile and susceptible to model mismatch. For instance, if the true influence function h ( τ ) has a shorter support in time than the estimate we use for direct calculations, then the predicted rates will depend on events too far in the past and will consistently o ver estimate the likelihood of e vents happening. In contrast, by adapting our estimate of λ t with a non-zero η t and dynamical models, we can mitigate this effect, thus incurring lower ov erall loss. Therefore we have gained robustness to model mismatch by not simply using a direct calculation method. These effects are demonstrated by a few important details of the regret bound. The first is that if the complexity measure of the comparator sequence relative to the dynamics Φ t is low , then the algorithm has √ T regret, which is sublinear as desired. Secondly , no assumptions ha ve been made about how the comparator sequence λ was actually generated. Instead we simply measure ho w well the comparator is approximated by a Hawkes process with dynamics dictated by Φ t . Therefore, if the true process acts like a Hawkes process, there will be low regret, but if the sequence is not generated this way or is generated as a Hawkes process with different parameters such as a different W matrix, or a different influence function, we hav e an understanding about how much this will influence the performance of the algorithm. B. Pr oposed algorithm - W Unknown When the influence function h ( t ) was a decaying exponential function, the dynamical model used had the form Φ t ( λ, W ) = A t λ + W y t + ( I − A t ) ¯ µ , where A t = α δ I was independent of the v alue of W . This f act paired with the additional assumption that the solution to line 4 of Algorithm 1 is a point on the interior of Λ , allo ws for a method of tracking both the rates λ 1 , .., λ T as well as the matrix W . W e 11 denote λ W t as the estimate at time t of Algorithm 1 using matrix W in line 5. When the solution ˜ λ t +1 is on the interior of the set Λ , the v alue of b λ W t +1 takes the form: b λ W t +1 = (1 − η t ) α δ b λ W t + η t α δ x t δ + W y t + (1 − α δ ) ¯ µ (9) It is this closed form solution that leads to Lemma 3 . It would seem that the assumption that ˜ λ t +1 ∈ Int(Λ) would be very restrictive. Howe ver , under very mild assumptions this will be true. For instance, since we are already assuming that there is a maximum amount of times any actor can act per unit time of x max , setting λ max ≥ x max , insures the condition ˜ λ t +1 ≤ λ max . Therefore, our space Λ would be a bounded region, but the solution of line 4 would always be on the interior of this feasible set under the same assumptions as Theorem 1 . Lemma 3. If Algorithm 1 is run separately for W 1 and W 2 pr oducing estimates b λ W 1 t and b λ W 2 t r espectively at time t , with the dynamical model Φ t ( λ, W ) = α δ λ + W y t + (1 − α δ ) ¯ µ , and assuming that the value ˜ λ t +1 is always in the interior of Λ , then at any given point in time the pr edictions of the algorithms corr esponding to W 1 and W 2 will be related in the following way: b λ W 1 t = b λ W 2 t + ( W 1 − W 2 ) K t with K t +1 = (1 − η t ) α δ K t + y t , K 1 = 0 . Remark 1. In this section, we assume we still have knowledge of ¯ µ and are trying to learn the time varying rates, λ t , and the network structur e, W . However , the exact same algorithm and analysis could be used to learn ¯ µ using the following technique. Consider appending ¯ µ as an extr a column of the matrix W and also appending 1 − α δ to the y vector . This would have a corresponding change in the dynamical model Φ t ( λ, W ) = α δ λ + [ W ¯ µ ][ y > t 1 − α δ ] > . Using this form and the technique of Lemma 3 we could simultaneously learn W and ¯ µ , but for clarity of e xposition we focus solely on learning W . Using this lemma, the losses that would hav e been incurred with a dif ferent weighted adjacency matrix W can be calculated and used to update c W t using gradient descent, yielding c W t +1 , as described in Algorithm 2 . T o do this, a con ve x feasible set of influence matrices, denoted W , must be defined. For instance, we might consider families of sparse W W = n W ∈ R p × p + : k W k 1 ≤ c o , or low-rank W W = n W ∈ R p × p + : k W k ∗ ≤ c o , or even W with partially known support ( i.e., prior kno wledge of a subset of the elements of W that are zero-v alued). First, the prediction b λ t +1 is updated using the previous estimate of the network, c W t . Then the estimate of W is updated, and the transformation described in Lemma 3 is applied. The next result establishes tracking re gret bounds for Algorithm 2 : Theorem 2 (Tracking regret of Algorithm 2 ) . Let Φ t ( λ, W ) = α δ λ + W y t + (1 − α δ ) ¯ µ with 0 < α < 1 for all W and t = 1 , 2 , . . . , T /δ . Additionally , let the sequence b λ 1 , b λ 2 , ..., b λ T be the output of Algorithm 2 , and let λ 1 , λ 2 , ..., λ T be an arbitrary sequence in [ λ min , λ max ] p . If we set η t and ρ t both pr oportional 12 Algorithm 2 MV Hawkes T racking - W Unkno wn 1: Initialize c W 1 = W 0 , K 1 = 0 , b λ 1 = ¯ µ 2: f or t = 1 , ..., T /δ do 3: Observ e x t and incur loss ` t ( b λ t ) = h 1 , δ b λ t i − h x t , log δ b λ t i 4: Set ˜ λ t +1 = (1 − η t ) b λ t + η t x t /δ 5: Define y t , X ¯ τ n = δ t e k n h ( δ ( t + 1) − τ n ) 6: Set g t ( W ) = ` t ( b λ W t ) = h 1 , δ b λ W t i − h x t , log δ b λ W t i 7: Set ∇ g t ( W ) = δ 1 K > t − Diag ( b λ c W t t ) − 1 x t K > t 8: Set c W t +1 = pro j W  c W t − ρ t ∇ g t ( c W t )  9: Set K t +1 = (1 − η t ) α δ K t + y t 10: Set b λ t +1 = Φ t ( e λ t +1 , c W t ) + ( c W t +1 − c W t ) K t +1 11: end for to either 1 / √ t or 1 / p T /δ , then for any data sequence x 1 , ..., x T /δ in [0 , δ x max ] p with x max < λ max , ther e exists a constant C > 0 whic h depends on δ , p, x max , λ max and λ min such that T /δ X t =1 ` t ( b λ t ) − T /δ X t =1 ` t ( λ t ) ≤ C 1 + min W ∈W T /δ X t =1 k λ t +1 − Φ t ( λ t , W ) k 2 ! √ T . This theorem is proved in Appendix G . This bound says that using Algorithm 2 achiev es an average per-round loss which is nearly as low as what would hav e been achiev ed with access to all data to choose the optimal time-varying rate vectors with a batch method. The gap between the losses of the proposed method and the losses accrued with a batch method scale with how closely the batch output ( i.e., comparator sequence) follows the dynamical model associated with the best estimate of the network structure as encapsulated by W . For instance, imagine that there existed a true, fixed W representing a network, and an oracle used this W to estimate a sequence of rate vectors which followed the model in ( 8 ) exactly and, subject to that constraint, minimized the sum of losses. For that oracle, the variation P T − 1 t =1 k λ t +1 − Φ t ( λ t , W ) k 2 = 0 . Clearly such an estimator is not practical because we do not kno w W and are operating in an online setting. Despite these disadvantages, the dif ference of the av erage per-round losses of Algorithm 2 and the oracle estimator scales like 1 / √ T , so that as T → ∞ , the performance gap between the two methods vanishes. These bounds do not rely on any assumptions about the data actually being generated by a multiv ariate Hawk es process or even being stochastic (which would be a fallacy in any real-world application). Rather , the ideas underlying the multiv ariate Hawkes model are used to generate a loss function and dynamical model. These values are used to characterize how well our methods perform tracking in an online setting relati ve to how well any other method might perform on the same set of observations . Further , the comparator sequence against which performance is measured might be computed in batch (rather than online) or using significantly more computational and memory resources than are required by Algorithm 2 . The intuition behind why this method is robust to model mismatch can be seen in the algorithm itself and in the regret bound. Notice in line 4 of the algorithm, we are directly adjusting our estimate of the rate, prior to adjusting the network weights. Because we are adjusting not only our estimate of the network 13 weights and directly calculating the resulting rate, our sequence of estimates is allo wed to de viate from a pure Hawkes process. This amount of deviation can allow us to be more flexible to combat the errors due to model mismatch. Additionally , the form of the regret bound tells us that the method will perform competiti vely against any set of comparators that nearly follo ws the dynamical model. Therefore if the generati ve model is similar , although not exactly a Hawkes process, then this variation term will still be lo w , and result in low ov erall loss. T o see ho w Algorithm 2 adds robustness to the estimation of W , consider two contrasting approaches. In the first, which is detailed in Appendix H and considered in our experimental results, we estimate W by performing online gradient descent on W with a loss function corresponding to the Hawkes negati ve log likelihood (which may contain incorrectly estimated model parameters). In the second approach, corresponding to Algorithm 2, we again estimate W via online gradient descent, but this time with a loss function based the accuracy of predictions from Algorithm 1 . While Algorithm 1 may also depend on incorrectly estimated model parameters, it is much more robust to model mismatch than simply using the Hawkes generativ e model as in the first approach. This robustness is thus inherited by Algorithm 1 . V I . C O M P U TA T I O N A L C O M P L E X I T Y One important feature of the proposed method is the lo w computational cost per iteration. Algorithm 2 performs the tasks of estimating both the current intensity b λ t ∈ R p and the network relationships c W t ∈ R p × p + . A brief examination of lines 3 - 6 of the algorithm shows mostly vector operations on length p vectors, requiring O ( p ) operations. The main computational burden of the algorithm comes in line 7, with the matrix multiplications requiring O ( p 2 ) operations and in line 8 projecting onto the space W . W ithout the projection step, this leaves the algorithm at an o verall complexity of O ( p 2 ) to estimate p 2 + p values. Depending on the space W , the algorithm may be slo wer . For instance, a reasonable space would be the space of matrices with a bounded nuclear norm, which requires computing a singular value decomposition at each step requiring O ( p 3 ) operations. Projecting onto other spaces, such as an ` 1 ball with some radius, w ould only require O ( p 2 ) operations, maintaining our baseline comple xity . V I I . E X P E R I M E N TA L R E S U LT S In this section we present sev eral experiments to demonstrate salient features of the proposed algorithms. W e first focus on the scenario where the network influence matrix, W , is kno wn. In this scenario the important observation is that our method is more rob ust to model mismatch than just calculating the rate directly from the observations, assumed influence function and matrix W . The next set of experiments demonstrates Algorithm 2 for unknown W on synthetic data and demonstrates how it can be used to learn both rates and the network of interest. Finally , we use Algorithm 2 to analyze six months of Memetracker data corresponding to posts by a selection of well known news websites to try to determine what relationships exist amongst these or ganizations. Throughout this section, we compare our method to the classical online learning algorithm Online Gradient Descent (OGD) used to learn the network W . This method is described formally in Appendix H . One alternativ e algorithm to learn the network would be to simply count all the times one process has an e vent immediately after another process had an ev ent, with larger counts corresponding to larger influence. OGD uses a current estimate of the network, and then uses the loss function and the assumed influence function, to update the network estimate in the direction of the most recent data point. Therefore the estimate is a weighted average of previously seen data, with more weight put on more recent data. In this way OGD is basically the same as the counting process b ut with more information put into the system. W e compare against this method and sho w in sev eral ways that our method performs comparati vely 14 Time # 10 4 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Loss 0 500 1000 1500 2000 2500 3000 Direct Calculation with Incorrect Parameters Algorithm 1 with Incorrect Parameters Direct Calculation with Correct Parameters (a) Mean cumulative loss a veraged over 100 iterations. Error bars show one standard devi- ation above and below mean. Time # 10 4 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Loss 0.007 0.008 0.009 0.01 0.011 0.012 0.013 0.014 0.015 0.016 Direct Calculation with Incorrect Parameters Algorithm 1 with Incorrect Parameters Direct Calculation with Correct Parameters (b) Moving average loss with D = 250, aver - aged over 100 iterations. Time # 10 4 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Loss 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 Median 95th Percentile 5th Percentile (c) Percentile of moving av erage difference between direct calculation and Algorithm 1 . Fig. 1: Performance of Algorithm 1 using an incorrect exponential decay influence function. T racking the rate instead of just calculating it from known network v alues and the assumed influence function leads to better overall performance. The error bars on the left side of (a) may be o verlapping, but if we look at the dif ference of individual trials we can see that the learning the rates giv es consistently better performance than direct calculation on a per trial basis. The plot in (c) shows the percentile information of the dif ference between direct calculation and Algorithm 1 , showing that our method adds robustness to the estimation procedure in o ver 95% of the cases across time and data realizations. when both methods kno w the correct influence function, but our method performs better when model information is misspecified. One challenge of online methods is the tuning of the step-size parameters, in our case η and ρ . In all of our experiments, we measure the ef fectiveness of step-size candidates based solely on accumulated loss on a small subset of the data. For instance, giv en a new dataset, we could run the method sev eral times on the first 5% of the data with a range of step-size parameters, observe the total accumulated loss, and choose the parameters which minimize the loss over that time empirically . The basic setup of online learning protects us from o ver-fitting because setting step-sizes too lar ge w ould push our estimates very close to the immediately preceding observation, which would cause large loss on the next observation. Con versely , very small step-sizes would not adapt or learn the parameters at all, also causing high accumulated loss. Throughout this section we plot sev eral curves of interest to demonstrate the efficac y of our methods. The first metric is cumulati ve loss as defined at time t as P t τ =1 ` t ( λ t ) , for an estimator λ t . This value will be plotted for values of t = 1 to T . Additionally , we plot a moving average curve of instantaneous loss, defined at time t as δ D P D/δ − 1 i =0 ` t − i ( λ t − i ) , for a time window of width D . This curve gives an idea of the instantaneous loss, while not being so susceptible to noise as to be indecipherable. A. Model mismatch, W known For the first experiment, data points were generated in a two-actor network ( p = 2) , with W an identity matrix scaled by 3 / 4 , the influence function h ( t ) = e − t 1 [ t> 0] and ¯ µ was [ . 005 . 005] > for a time horizon of 20000, using the method of [ 34 ]. The data was then processed in sev eral ways. The first was to calculate the discrete time rates using an incorrect influence function, ˜ h ( t ) = (2 e ) − t 1 [ t> 0] , without doing any learning. In other words, a rate is estimated by plugging observed ev ent times into Equation 5 using the assumed W , ˜ h ( · ) , and ¯ µ . W e will call this method direct calculation. Howe ver , we expect suboptimal performance due to the fact that ˜ h ( t ) 6 = h ( t ) . This method is compared to the output of Algorithm 1 with 15 Time # 10 4 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Loss 0 500 1000 1500 2000 2500 3000 3500 Direct Calculation with Incorrect Parameters Algorithm 1 with Incorrect Parameters Direct Calculations with Correct Parameters (a) Mean cumulative loss a veraged over 100 iterations. Error bars show one standard devi- ation above and below mean. Time # 10 4 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Loss 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.02 Direct Calculation with Incorrect Parameters Algorithm 1 with Incorrect Parameters Direct Calculation with Correct Parameters (b) Moving average loss with D = 250, aver - aged over 100 iterations. Time # 10 4 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Loss 0 0.005 0.01 0.015 0.02 0.025 Median 95th Percentile 5th Percentile (c) Percentile of moving av erage difference between direct calculation and algorithm 1 . Fig. 2: Performance of Algorithm 1 using an incorrect rect influence function. Again, tracking the rate has added robustness to misspecified system parameters. 0 0.01 0.02 0.03 0.04 0.05 0.06 Fig. 3: T rue network used to generate ev ent times. Each pixel represents the influence the actor represented by the column has on the actor represented by the row , with lighter colors meaning more influence, and black meaning no influence. the same, incorrect ˜ h ( t ) function, with δ = 0 . 1 and η t = 10 / p T /δ , to show that rob ustness to model mismatch has been added. This overall setup was run separately on 100 different data realization, and the results are sho wn in Figure 1 . Another experiment was run on the same data, generated using h ( t ) = e − t 1 [ t> 0] , and the same true W matrix, but this time the influence function used to estimate the rates was ˜ h ( t ) = 1 [0 0] and δ = . 01 . Algorithm 2 was then run with η t = 10 / p T /δ and ρ = . 01 / p T /δ . Additionally , the estimates of the network were regularized with the element-wise ` 1 norm with regularization parameter . 001 to encourage sparsity in the estimated networks. Figure 4 shows the results for Algorithm 1 where the v alue of W used was the generating v alue, and where W was all 0s. W e compare these two to the result of Algorithm 2 and estimating W using OGD with step size ρ t , av eraged over 100 data realizations. The important feature of Figure 4 is that the results of Algorithm 2 start poorly when the estimate of W is bad, but as more and more data are rev ealed, the loss approaches the loss of the algorithm with full knowledge of the true matrix W as predicted by Theorem 2 . Additionally , the performance of Algorithm 2 v ery closely mirrors the performance of using OGD to estimate W directly and using that to get an estimate of the instantaneous rate using Equation 5 . This shows again that in the case where the influence function is known precisely , little is lost by tracking both rates and the network. Figure 4 , plot (b), sho ws that the OGD algorithm almost always is incurring less loss than our method, but the gap is relativ ely small, on the order of 10 − 3 av eraged over 500 time units. Using the same set of ev ent times, another set of trials was run, this time using a mismatched influence function ˜ h ( t ) = . 9 t 1 [ t> 0] , and otherwise all the same parameters. The results of these trials are shown in Figure 5 . In these results, the performance is not as accurate as when the ground truth influence function was known, but Algorithm 2 steadily outperforms directly estimating W using OGD, again demonstrating that our method has added robustness to poorly specified parameters. This time Figure 5 (b) sho ws that our method almost always is out-performing OGD, with an a verage gain on the order of 10 − 1 , thus the amount we have lost when the influence function is known is much less than the amount we have gained in the case of model mismatch. Additionally , how well the networks hav e been estimated can be examined, both with the correct and 17 Time # 10 4 0 1 2 3 4 5 6 7 8 9 10 Loss 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 Algorithm 1:W=0 Algorithm 1:True W Algorithm 2 OGD True Rate (a) Moving average loss with time windo w D = 500 with incorrect influence function av eraged over 100 data realizations Time # 10 4 0 1 2 3 4 5 6 7 8 9 10 Loss -0.03 -0.02 -0.01 0 0.01 0.02 0.03 0.04 0.05 Median 95th Percentile 5th Percentile (b) Percentiles of difference between moving av erage losses of OGD and Alg 2 estimates with incorrect influence function. Fig. 5: Performance of Algorithm 1 with different values of W , compared to Algorithm 2 . When the system parameters are misspecified, our method outperforms OGD on W in predicting likelihood of actors participating in events. incorrect influence function. The final estimates of the networks for one data realization are shown in Figure 6 . The estimates recov er the block diagonal nature of the true network. When the influence function is known correctly these structures are more pronounced, and the networks produced by Algorithm 2 and OGD on W are very similar . Howe ver , when the influence function is misspecified, our method still recovers the strong clusters in the network whereas directly performing OGD on W does not as obviously reveal the structure. W e also observ e ho w well the significant elements of the network are reco vered by setting various thresholds and declaring all elements of the estimate abov e this threshold as significant relationships in the network. These relationships are then compared with the true, non-zero elements of the network to generate ROC curves for each method. This curve is computed both for the full W matrix and just the largest 10% elements of W as baselines. W e choose to also focus on these largest edge weights as they represent the most important influences in the network. These R OCs are shown in Figure 7 , which show our method’ s increased ability to find the important relations in the network compared to OGD. As a final test of how well we are learning the matrix W , instantaneous estimates, c W t , are used to compute the total loss of the entire data using this matrix as in Equation 6 . As more data are rev ealed, each estimate is produced with an increasing amount of training data and then tested on the full data set. Each estimate approaches the cumulative loss of directly calculating the rates with the true matrix W . These batch losses were all calculated using the true influence function, and are shown in Figure 8 . Our online method is decreasing the overall loss and approaches the same performance of kno wing the true network. Additionally , the estimation with the generativ e influence function is very similar for Algorithm 2 and OGD on W , but our method performs better than OGD when the influence function is misspecified. C. Memetr acker Data For the final set of experiments, we used the raw phrases Memetracker[ 39 ] data set (http://www .memetracker .org/data.html) and e xtracted ev ery post from websites analyzed by the authors as reporting a high percentage of important ne ws (http://www .memetracker .org/lag.html). These 217 distinct websites made up our network of interest. W e extracted the posts from these websites for a six month 18 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 (a) Algorithm 2 estimate of network 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 (b) OGD estimate of network 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 (c) Algorithm 2 estimate of net- work using 2 with incorrect influ- ence function 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 (d) OGD estimate of network using incorrect influence function Fig. 6: Estimates of the underlying network using Algorithm 2 and OGD with both correct (a,b) and incorrect (c,d) influence functions. Our method captures more of the network structure when influence function is misspecified. span from August 2008 through the end of January 2009, totaling over 3.5 million events. The only information considered was what websites were posting and at what times, without using information about content or links. The ev ent times were on the resolution of 1 second intervals, and thus we chose δ = 1 sec. The first trial experiment was to compare the performance of our online algorithm, to the estimator created using a batch estimate, in both predictiv e performance and computational speed. In order to do this the first 100,000 e vent times were used and SpaRSA[ 40 ] was used to find the matrix W which minimized the time av eraged empirical loss plus an ` 1 regularization term, using α = . 99 and a constant ¯ µ = 2 × 10 − 5 , with regularization parameter equal to 1 × 10 − 3 . SpaRSA was run for a total of 60 outer iterations, with a line search inner step which performed 15 function ev aluations in an inner loop. Algorithm 2 was run with the same ` 1 regularization parameter , and step-sizes η t = . 01 √ T /δ and ρ t = 5 × 10 − 8 √ T /δ . Each function e valuation in the batch setting took approximately 1.51 seconds and every gradient e valuation about 57.13 seconds, for an overall run time of approximately 80 minutes. In comparison the entire online method took 398.94 seconds due to only having to compute instantaneous losses with respect to a single second’ s worth of e vents, and only passing o ver each e vent one time. The experiments were performed using MA TLAB 2015a on a laptop running Mac OSX v10.10.5 with an Intel i7 2.5 GHz processor with 16 GB of memory . Therefore the online method took only 8.3% of the time as the computationally intensi ve batch method. The comparison of the results of the experiment using our method and the batch method are sho w in Figure 9 . Both the cumulativ e loss and moving av erage loss plots depict the predicti ve power of the online algorithm starting poorly , in relation to the batch estimate, but as more data are rev ealed the online 19 P f 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 P d 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Alg 2 OGD Alg 2, Mismatch OGD, Mismatch (a) ROC Curve for Full Support P f 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 P d 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Alg 2 OGD Alg 2, Mismatch OGD, Mismatch (b) ROC Curve for Largest 10% Elements Fig. 7: R OC Curves to demonstrate the method’ s abilities to find the significant relationships in the network. Again, Algorithm 2 and OGD on W perform very similarly when the h function is kno wn, but Algorithm 2 does better when it is misspecified Time # 10 4 -2 0 2 4 6 8 10 12 Loss # 10 6 1.1 1.15 1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55 Alg 2 OGD Alg 2, Mismatch OGD, Mismatch Fig. 8: A verage of batch loss of network estimates with one standard deviation plotted in either direction. As more data are revealed each method’ s performance improves. When the influence function is known precisely our method and OGD on W both perform well. When the influence function is misspecified our method outperforms OGD. algorithm approaches the performance of the batch estimate using only a fraction of the computational time. This is especially apparent in the moving average plot, which shows the initial performance of the online algorithm being close to assuming that there was no influence in the system ( W = 0 ) and by the end of the trial, only ha ve a small dif ference from the batch estimate. Algorithm 2 was then run on the entire six months of data to learn influences within the network and v alidated with the delays discovered in the lag time of stories being reported. Running a batch method on this much data w ould be computationally intractable due to the poor time scaling of calculating the loss and gradient with increasing numbers of ev ents. W e used the data from the first half of the first 20 Time (sec) # 10 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Loss # 10 5 0 2 4 6 8 10 12 Algorithm 1: W = 0 Algorithm 1: Batch Estimate Algorithm 2 (a) Cumulative loss Time (sec) # 10 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Loss 0.5 1 1.5 2 2.5 3 3.5 Algorithm 1: W = 0 Algorithm 1: Batch Estimate Algorithm 2 (b) Moving average loss with time windo w D = 15000 seconds Fig. 9: Comparison of Algorithm 2 to the matrix produced using a computationally intensive batch estimate. Both the cumulative loss (a) and moving av erage (b) loss plots sho w the estimate produced by Algorithm 2 starting poorly but rapidly approaching the performance of the batch estimate using only 8% percent of computational time. month as a test set to tune parameters and found that α = . 995 , η = 6 . 1 × 10 − 4 = , ρ = 6 . 1 × 10 − 10 and ¯ µ = 2 × 10 − 5 accumulated relati vely lo w loss. Using α = . 995 , which corresponds to a reaction’ s “half-life” being about 2 minutes and 20 seconds, is long enough time window for meaningful reactions to take place, but not long enough for many significantly different topics to be published. It is worth noting this performance focuses on immediate dependencies, and thus we work on the scale of minutes, whereas the average lag times are reported on the scale of hours, and thus we are more likely to discov er which organizations publish content faster , rather than finding websites which are lik ely using others as references. The websites were ordered based on their a verage lag time described on the Memetracker results, from smallest to largest. Therefore, we expected many of the significant relationships to be beneath the diagonal. Recall W i,j reflects influence of actor j on actor i . Because the actors are ordered from smallest to largest lag time, we expect larger elements W i,j to have i > j , which corresponds to more significant relationships being beneath the diagonal. Relationships are declared “significant” when W i,j is abo ve a threshold. Figure 10 shows the number of significant relationships across a range of thresholds. For most choices of threshold, more significant relationships are belo w the diagonal. W e found that among the 20 most influential websites were dailyherald.com, washingtonpost.com, post-gazette.com, denv erpost.com, ne ws.bbc.co.uk, and cnn.com. All of these are either local ne ws org anizations in major metropolitan areas or important national ne ws org anizations. Among the top dependencies were apnews.myway .com reacting to dailyherald.com, elections.foxne ws.com reacting to washingtonpost.com and mcclatchydc.com reacting to washingtonpost.com . The first two pairs are examples of large national organizations being slower to respond than local sources. Alternatively , the third pair is an example of two local news sources, where the organization with more journalists is able to publish faster than a competitor with less journalists. 21 0 1 2 3 4 5 6 x 10 ï 3 10 0 10 1 10 2 10 3 10 4 10 5 Threshold Value Number of Significant Relationships Above Diagonal Below Diagonal Fig. 10: Amount of relationships abov e a threshold for the network learned using the Memetracker dataset. For most choices of the threshold, there are more relationships that are greater than the threshold abov e the diagonal than belo w . V I I I . C O N C L U S I O N In many real world applications, such as social, neural and financial networks, actions by actors at one point in time will influence the future actions of others in the network. In these applications it is beneficial to estimate the likelihood of each actor acting at any giv en point in time, given only timing information about pre vious occurrences. This task is particularly challenging when data are streaming at a rate that precludes traditional batch processing methods. W e hav e proposed two online methods for tracking the time varying rates at which actors participate in events; one method for when the underlying network is kno wn and the other for when it is unknown. Relev ant regret bounds for both methods scale with the de viation of a comparator series of point process rate or intensity functions, with no assumptions on the actual generativ e model of the data. These methods were tested using both synthetic and real data to sho w that they successfully track the intensities of interest, can recover the underlying network structure, and are robust to model mismatch. A P P E N D I X A L E M M A 4 The proof of Lemma 1 , which compares the loss of the continuous time and discrete time autoregressi ve rates, relies on the following relationship. Lemma 4. F or 0 < α < 1 and δ > 0 the following inequalities hold: − 1 log( α ) − δ 2 ≤ δ α δ 1 − α δ ≤ − 1 log( α ) . (10) Proof: The proof starts with the follo wing observations: lim δ → 0 δ α δ 1 − α δ = − 1 log( α ) ∂ ∂ δ δ α δ 1 − α δ = α δ − α 2 δ + δ α δ log( α ) (1 − α δ ) 2 . 22 Because all three terms in Equation 10 are equal when δ = 0 , sho wing that the deriv ativ e of δ α δ 1 − α δ with respect to δ is between − 1 2 and 0 suffices to prov e the Lemma. The upper bound stems from the following inequality: 1 + δ log ( α ) ≤ α δ = ⇒ 1 − α δ + δ log ( α ) ≤ 0 = ⇒ ∂ ∂ δ δ α δ 1 − α δ = α δ − α 2 δ + δ α δ log( α ) (1 − α δ ) 2 ≤ 0 . The proof of the lower bound requires the analysis of another function, α δ − α − δ . ∂ ∂ δ ( α δ − α − δ ) = log( α )( α δ + α − δ ) ∂ 2 ∂ δ 2 ( α δ − α − δ ) = log 2 ( α )( α δ − α − δ ) ≤ 0 for δ > 0 Because this function is concave for δ > 0 , the follo wing inequality holds: α δ − α − δ ≤ 2 δ log( α ) which simply says that for δ > 0 the function lies belo w its tangent line at δ = 0 . This inequality can be used to deri ve the desired lo wer bound. α δ − α − δ ≤ 2 δ log( α ) = ⇒ α 2 δ − 1 2 ≤ δ α δ log( α ) = ⇒ 2 α δ − α 2 δ − 1 2 ≤ α δ − α 2 δ + δ α δ log( α ) = ⇒ − 1 2 ≤ α δ − α 2 δ + δ α δ log( α ) (1 − α δ ) 2 = ∂ ∂ δ δ α δ 1 − α δ Therefore, the deriv ativ e of δ α δ 1 − α δ is between − 1 2 and 0 for δ > 0 and 0 < α < 1 , which combined with the limit statement, pro ves the Lemma. A P P E N D I X B P RO O F O F L E M M A 1 This proof shows that the negati ve log likelihood of the true underlying Hawkes process (Equation 1 ) gi ven an excitation matrix W , differs by a factor proportional to δ from the discretized version (Equation 6 ).      p X k =1 Z T 0 µ k ( τ ) dτ − N T X n =1 log µ k n ( τ n ) − p X k =1 T /δ X t =1 δ λ t,k − x t,k log λ t,k !      ≤      p X k =1 Z T 0 µ k ( τ ) dτ − T /δ X t =1 δ λ t,k !      +      p X k =1 T /δ X t =1 x t,k log λ t,k − N t X n =1 log µ k n ( τ n )      (11) The two absolute v alue terms will be bounded separately . W e start with the first term, in volving the integral of the true rate, and the approximation to it. 23      p X k =1   Z T 0 µ k ( τ ) dτ − T /δ X t =1 δ λ t,k   ≤ p X k =1      Z T 0 X τ n <τ W k,k n α τ − τ n dτ − δ T /δ X t =1 X ¯ τ n <δ t W k,k n α δ t − τ n      = p X k =1      N T X n =1 W k,k n Z T τ n α τ − τ n dτ − W k,k n δ T /δ X t = ¯ τ n /δ +1 α δ t − τ n      = p X k =1      N T X n =1 W k,k n  α T − τ n − 1 log( α ) − δ α δ α ¯ τ n − τ n − α T − τ n 1 − α δ       = p X k =1      N T X n =1 W k,k n α T − τ n  δ α δ 1 − α δ − − 1 log( α )  + − 1 log( α ) − α ¯ τ n − τ n δ α δ 1 − α δ !      These lines ev aluate the value of the integral and the approximation to them using a geometric sum to get a closed form for their difference. From this point, the use of Lemma 4 allows us to find an upper bound α T − τ n  δ α δ 1 − α δ − − 1 log( α )  + − 1 log( α ) − α ¯ τ n − τ n δ α δ 1 − α δ ≤ − 1 log( α ) − α δ δ α δ 1 − α δ ≤ δ 2 + (1 − α δ ) δ α δ 1 − α δ ≤ 3 δ 2 . W e can similarly use Lemma 4 to find the lo wer bound. α T − τ n  δ α δ 1 − α δ − − 1 log( α )  + − 1 log( α ) − α ¯ τ n − τ n δ α δ 1 − α δ ≥ − δ 2 α T − τ n + − 1 log( α ) − δ α δ 1 − α δ ≥ − δ 2 Combining these upper and lower bounds giv es an overall bound on the inte gral approximation:      p X k =1   Z T 0 µ k ( τ ) dτ − T /δ X t =1 δ λ t,k        ≤ pN T W max 3 δ 2 (12) This shows that the integral terms de viate by no more than a linear factor of δ and can be controlled by setting δ small. Next the dif ference in volving the log terms must be bounded. W e mak e the following observation: p X k =1 T /δ X t =1 x t,k log λ t,k = N T X n =1 log λ ¯ τ n /δ,k n This says that instead of going from time step to time step and incurring loss at ev ery point, we just step through ev ery ev ent and incur loss at the following discrete time point ¯ τ n . No w the log terms can be compared as p X k =1 T /δ X t =1 x t,k log λ t,k − N T X n =1 log µ k n ( τ n ) = N T X n =1 log  λ ¯ τ n /δ,k n µ k n ( τ n )  . 24 In order to bound this, we make the distinction between two classes of ev ents. The first set of events, A n = { i | τ i < τ n , ¯ τ i < ¯ τ n } , are ev ents that happen before the n th e vent and are not in the same time windo w . The second set, B n = { i | τ i < τ n , ¯ τ i = ¯ τ n } , are ev ents that happen before the n th e vent b ut in the same time windo w . N T X n =1 log  λ ¯ τ n /δ,k n µ k n ( τ n )  = N T X n =1 log     ¯ µ k n + X i ∈A n W k n ,k i α ¯ τ n − τ i ¯ µ k n + X i ∈A n W k n ,k i α τ n − τ i + X i ∈B n W k n ,k i α τ n − τ i     This term needs to be upper and lower bounded to get the final result. Because every element of W ≥ 0 and α > 0 , all the terms are positive so the terms in set B n can be dropped, to get the following upper bound: N T X n =1 log  λ ¯ τ n /δ,k n µ k n ( τ n )  ≤ N T X n =1 log     ¯ µ k n + X i ∈A n W k n ,k i α ¯ τ n − τ i ¯ µ k n + X i ∈A n W k n ,k i α τ n − τ i     (13a) ≤ N k X n =1 log     ¯ µ k n + X i ∈A n W k n ,k i α τ n − τ i ¯ µ k n + X i ∈A n W k n ,k i α τ n − τ i     = 0 . (13b) In Equation 13a positiv e terms have been removed from the denominator and in 13b the fact that ¯ τ n ≥ τ n and therefore α ¯ τ n ≤ α τ n . Next, we lower bound T 1 . N T X n =1 log  λ ¯ τ n /δ,k n µ k n ( τ n )  ≥ N T X n =1 log     ¯ µ k n + X i ∈A n W k n ,k i α ¯ τ n − τ i ¯ µ k + X i ∈A n W k n ,k i α ¯ τ n − τ i − δ + X i ∈B n W k n ,k i α τ n − τ i     (14a) ≥ − N T X n =1 log  α − δ + |B n | W max µ min  (14b) ≥ − N T X n =1 log  α − δ + px max δ W max µ min  (14c) ≥ − N T  W max x max p µ min − log ( α )  δ (14d) Equation 14a comes from α ¯ τ n − δ ≥ α τ n , Equation 14b uses α − δ > 1 and cancels out like terms from the numerator and denominator , Equation 14c bounds the number of e vents in B n by px max δ which is the maximum number of e vents each actor can participate in a δ length time window , times the number of 25 actors, and finally Equation 14d uses log ( x + a ) ≤ x a + log ( a ) for a > 0 which is a consequence of the concavity of the log function. The upper and lower bound gi ves      N T X n =1 log  λ ¯ τ n /δ,k n µ k n ( τ n )       ≤ N T  W max x max p µ min − log ( α )  δ which when combined with Equation 12 gives the result: | L T ( µ ) − L ( δ ) T ( λ ) | ≤  3 pW max 2 + W max x max p µ min − log ( α )  N T δ. A P P E N D I X C D UA L P A R A M E T E R I Z A T I O N W e introduce a change of variables which will allow us to more easily bound the regret of the proposed algorithms. Definine θ t = [ θ t, 1 , ..., θ t,p ] > , log( δ λ t ) . Using this change of variable giv es a loss function in terms of θ . ˜ ` t ( θ t ) = ` t  1 δ exp( θ t )  = −h θ t , x t i + h exp( θ t ) , 1 i . (15) It is important to note that this is a one-to-one relationship between λ t and θ t , and thus we may operate in either the λ or θ space. Notice that Equation 15 corresponds to the negati ve log-likelihood of an exponential family distribution of the form p t ( θ ) = exp {h θ , x t i − Z ( θ ) } where we omit f actors depending only on x t and not θ , and where Z ( θ ) , log Z exp {h θ , x i} dx is the so-called log-partition function . Important to our analysis is the dual of the log-partition function, Z ∗ ( δ λ ) , sup θ ∈ θ {h δ λ, θ i − Z ( θ ) } . In our multi variate Hawkes setting, we ha ve Z ( θ ) = h exp( θ ) , 1 i . Performing online optimization in the θ parameter space allows us to exploit sev eral properties of exponential families, as described in general settings in [ 1 ] and in our specific conte xt in Section V -B . In particular , we will use the follo wing f acts: ∇ Z ( θ ) = exp( θ ) = δ λ ∇ Z ∗ ( δ λ ) = log ( δ λ ) = θ An equiv alent dynamical model from the λ space can be defined in the θ parameter space as ˜ Φ t ( θ , W ) = ∇ Z ∗ ( δ Φ t ( ∇ Z ( θ ) /δ, W )) . (16) which essentially conv erts the dual ( δ λ ) to the primal ( θ ), applies the dynamics, then con verts back. Using the dual parameterization, we ha ve the follo wing equiv alent to Algorithm 1 , which uses the function D ( θ 1 k θ 2 ) , Z ( θ 1 ) − Z ( θ 2 ) − h∇ Z ( θ 2 ) , θ 1 − θ 2 i , which is the Bregman div ergence induced by Z ( θ ) . W e list some important properties of the loss function ˜ ` t ( θ ) that will be used in the proof of the regret bounds, in section V . • W e assume a con ve x set of possible rate functions λ ∈ [ λ min , λ max ] p = Λ with λ min > 0 and therefore the corresponding dual space Θ = [log( δ λ min ) , log( δ λ max )] . 26 Algorithm 3 MV Hawkes T racking - W Kno wn, dual parameterization 1: Initialize ˆ θ 1 = log( δ ¯ µ ) 2: f or t = 1 , ..., T /δ do 3: Observ e x t and incur loss ˜ ` t ( b θ t ) = h 1 exp( b θ t ) i − h x t , θ t i 4: Set e θ t +1 = arg min θ ∈ Θ η t h∇ ˜ ` t ( b θ t ) , θ t i + D ( θ k b θ t ) 5: Set b θ t +1 = ˜ Φ t ( e θ t +1 , W ) 6: end for • Because we assume that there is a maximum number of times each actor can act per unit time, x max , and the observ ations space X is simply [0 , δ x max ] p , we additionally ha ve that on the set Θ : k∇ ˜ ` t ( θ ) k 2 ≤k exp( θ ) k 2 + k x t k 2 ≤ √ pδ ( λ max + x max ) (17) • The function Z ( θ ) = h 1 , exp( θ ) i is strongly conv ex on Θ with strong con vexity parameter δ λ min with respect to the ` 2 norm. Therefore the Bregman diver gence induced by Z obe ys the follo wing: D ( θ 1 k θ 2 ) = Z ( θ 1 ) − Z ( θ 2 ) − h∇ Z ( θ 2 ) , θ 1 − θ 2 i ≥ δ λ min 2 k θ 1 − θ 2 k 2 2 (18) • The Bre gman di vergence induced by Z ( θ ) = h 1 , exp( θ ) i is always non-negati ve and can be upper bounded for any θ 1 , θ 2 ∈ Θ : D ( θ 1 k θ 2 ) = Z ( θ 1 ) − Z ( θ 2 ) − h∇ Z ( θ 2 ) , θ 1 − θ 2 i ≤h∇ Z ( θ 1 ) − ∇ Z ( θ 2 ) , θ 1 − θ 2 i ≤k∇ Z ( θ 1 ) − ∇ Z ( θ 2 ) k 2 k θ 2 − θ 2 k 2 = δ k λ 1 − λ 2 k 2     log  λ 1 λ 2      2 ≤ δ λ 2 max p λ min (19) • A property of Bre gman di ver gences and dual functions says the following: D ( θ 1 k θ 2 ) = Z ( θ 1 ) − Z ( θ 2 ) − h∇ Z ( θ 2 ) , θ 1 − θ 2 i = Z ∗ ( δ λ 2 ) − Z ∗ ( δ λ 1 ) − h∇ Z ∗ ( δ λ 1 ) , δ λ 2 − δ λ 1 i , D ∗ ( δ λ 2 k δ λ 1 ) Therefore, ˜ Φ is contractive with respect to D if and only if Φ is contractive with respect to D ∗ . D ( e Φ( θ 1 , W ) k e Φ( θ 2 , W )) = D ∗ ( δ Φ t ( λ 2 , W ) k δ Φ t ( λ 1 , W )) ≤ D ∗ ( δ λ 2 k δ λ 1 ) = D ( θ 1 k θ 2 ) 27 A P P E N D I X D P RO O F O F L E M M A 2 W e prove the lemma by proving the result in the θ space described in the previous section, and by the properties of duals of Bregman div ergences, the result holds for the λ space as well. W e start the proof with the following important observation about the Bregman div ergence in question: D ( θ 1 k θ 2 ) = h exp( θ 1 ) − exp( θ 2 ) , 1 i − h exp( θ 2 ) , θ 1 − θ 2 i = p X k =1 exp( θ 1 ,k ) − exp( θ 2 ,k ) − exp( θ 2 ,k )( θ 1 ,k − θ 2 ,k ) = p X k =1 d ( θ 1 ,k k θ 2 ,k ) . Abov e, θ 1 ,k and θ 2 ,k denote the k th element of vectors θ 1 and θ 2 respecti vely , and d is the scalar Bregman di vergence induced by exp( θ ) . This sho ws that the p -dimensional Bregman div ergence can be broken into the sum of p terms. W e will prove bounds for the one dimensional version and then combine them to sho w that o verall the dynamics are contracti ve. Because Φ t ( λ, W ) = A t λ + b t , we therefore hav e e Φ( θ , W ) = log( A t exp( θ ) + δ b t ) . W e start by showing the result for when the k th diagonal element of the matrix A t , denoted by A t,k is 0. W e denote the k th element of the vector after the application of the dynamics as [Φ t ( θ )] k and the k th element of b t as b t,k . d ([Φ t ( θ 1 )] k k [Φ t ( θ 2 )] k ) = A t,k (exp( θ 1 ,k ) − exp( θ 2 ,k )) − ( A t,k exp( θ 2 ,k ) + δ b t,k ) log  A t,k exp( θ 1 ,k ) + δ b t,k A t,k exp( θ 2 ,k ) + δ b t,k  = − δ b t,k log  b t,k b t,k  = 0 = A t,k d ( θ 1 ,k k θ 2 ,k ) When A t,k = b t,k = 0 we define log( 0 0 ) , 0 . Next, we show the result for the case when A t,k > 0 and b t,k = 0 : d ([Φ t ( θ 1 )] k k [Φ t ( θ 2 )] k ) = A t,k (exp( θ 1 ,k ) − exp( θ 2 ,k )) − A t,k exp( θ 2 ,k ) log  A t,k exp( θ 1 ,k ) A t,k exp( θ 2 ,k )  = A t,k (exp( θ 1 ,k ) − exp( θ 2 ,k ) − exp( θ 2 ,k )( θ 1 ,k − θ 2 ,k )) = A t,k d ( θ 1 ,k k θ 2 ,k ) . Finally , we sho w that the one dimensional Bregman is non-increasing in b t,k for A t,k > 0 . d ([Φ t ( θ 1 )] k k [Φ t ( θ 2 )] k ) db t,k = − δ log  A t,k exp( θ 1 ,k ) + δ b t,k A t,k exp( θ 2 ,k ) + δ b t,k  − δ  A t,k exp( θ 2 ,k ) + δ b t,k A t,k exp( θ 1 ,k ) + δ b t,k − A t,k exp( θ 2 ,k ) + δ b t,k A t,k exp( θ 2 ,k ) + δ b t,k  = δ log  A t,k exp( θ 2 ,k ) + δ b t,k A t,k exp( θ 1 ,k ) + δ b t,k  + δ  1 − A t,k exp( θ 2 ,k ) + δ b t,k A t,k exp( θ 1 ,k ) + δ b t,k  ≤ 0 28 The final inequality comes from the fact that 1 − x ≤ − log x . The result of this is that we have shown that when b t,k > 0 the one dimensional Bre gman di vergence is less than if b t,k = 0 . Combining all the results with the assumption that all the elements of the diagonal matrix A t are upper bounded by one, gi ves the conclusion that these dynamics are contractiv e. D (Φ t ( θ 1 ) k Φ t ( θ 2 )) = p X k =1 d ([Φ t ( θ 1 )] k k [Φ t ( θ 2 )] k ) ≤ p X k =1 A t,k d ( θ 1 ,k k θ 2 ,k ) ≤ p X k =1 d ( θ 1 ,k k θ 2 ,k ) = D ( θ 1 k θ 2 ) A P P E N D I X E P RO O F O F T H E O R E M 1 The proof of Theorem 1 is based on the proof of Theorem 3 of [ 23 ], specialized to the Hawkes process. The strategy is to bound the excess loss at any giv en moment, and then add all of these bounds from t = 1 to T /δ . Importantly we use the f act that ` t ( b λ t ) = ˜ ` t ( b θ t ) and ` t ( λ t ) = ˜ ` t ( θ t ) . W e start with some important properties. The first is the first order optimality condition of line 4 of Algorithm 3 , which states, for any θ ∈ Θ we have: h η t ∇ ˜ ` t ( ˆ θ t ) + ∇ Z ( ˜ θ t +1 ) − ∇ Z ( ˆ θ t ) , ˜ θ t +1 − θ i ≤ 0 . By rearranging terms we get the form that is used. h ˜ ` t ( ˆ θ t ) , ˜ θ t +1 − θ t i ≤ 1 η t h Z ( ˆ θ t ) − ∇ Z ( ˜ θ t +1 ) , ˜ θ t +1 − θ t i (20) The second important fact is that a Bregman diver gence induced by a function Z takes on the form D ( a k b ) = Z ( a ) − Z ( b ) − h∇ Z ( b ) , a − b i , and therefore we ha ve the following: D ( a k b ) − D ( a k c ) − D ( c k b ) = h∇ Z ( b ) − ∇ Z ( c ) , c − a i . (21) Using these key facts, we start by bounding the e xcess loss at a single time point. ˜ ` t ( b θ t ) − ˜ ` t ( θ t ) ≤h∇ ˜ ` t ( b θ t ) , b θ t − θ t i (22a) = h∇ ˜ ` t ( b θ t ) , e θ t +1 − θ t i + h∇ ˜ ` t ( b θ t ) , b θ t − e θ t +1 i ≤ 1 η t h∇ Z ( b θ t ) − ∇ Z ( e θ t +1 ) , e θ t +1 − θ t i + h∇ ˜ ` t ( b θ t ) , b θ t − e θ t +1 i (22b) = 1 η t  D ( θ t k b θ t ) − D ( θ t k e θ t +1 ) − D ( e θ t +1 k b θ t )  + h∇ ˜ ` t ( b θ t ) , b θ t − e θ t +1 i (22c) Equation 22a is due the con vexity of the function ˜ ` t , Equation 22b uses Equation 20 , and Equation 22c is the application of Equation 21 . W e add and subtract necessary Bregman div ergences, and bound 29 dif ferences separately . ˜ ` t ( b θ t ) − ˜ ` t ( θ t ) ≤ 1 η t  D ( θ t k b θ t ) − D ( θ t +1 k b θ t +1 )  + 1 η t  D ( θ t +1 k b θ t +1 ) − D ( e Φ t ( θ t , W ) k b θ t +1 )  + 1 η t  D ( e Φ t ( θ t , W ) k b θ t +1 ) − D ( θ t k e θ t +1 )  − 1 η t D ( e θ t +1 k b θ t ) + h∇ ˜ ` t ( b θ t ) , b θ t − e θ t +1 i W e will bound each of these lines separately . D ( θ t +1 k b θ t +1 ) − D ( e Φ t ( θ t , W ) k b θ t +1 ) = Z ( θ t +1 ) − Z ( e Φ t ( θ t , W )) + h∇ Z ( b θ t +1 ) , e Φ t ( θ t , W ) − θ t +1 i (23a) ≤h∇ Z ( b θ t +1 ) − ∇ Z ( θ t +1 ) , e Φ t ( θ t , W ) − θ t +1 i (23b) ≤k∇ Z ( b θ t +1 ) − ∇ Z ( θ t +1 ) k 2 k e Φ t ( θ t , W ) − θ t +1 k 2 (23c) = k δ ( ˆ λ t +1 − λ t +1 ) k 2 k log( δ Φ t ( λ t , W )) − log ( δ λ t +1 ) k 2 ≤ δ √ pλ max 1 λ min k Φ t ( λ t , W ) − λ t +1 k 2 (23d) Equation 23a uses the definition of the Bregman div ergence, Equation 23b the con vexity of Z , Equation 23c the Cauchy-Schwarz inequality , and Equation 23d uses the bounded domain of λ ∈ [ λ min , λ max ] p with λ min > 0 and the Lipschitz property of the natural logarithm on [ λ min , λ max ] . The next term we bound by using the contractivity assumption on Φ t . D ( e Φ t ( θ t , W ) k b θ t +1 ) − D ( θ t k e θ t +1 ) = D ( e Φ t ( θ t , W ) k e Φ t ( e θ t +1 , W )) − D ( θ t k e θ t +1 ) ≤ 0 T o bound the final term, we use the strong conv exity property of Z ( θ ) which implies that D ( θ 1 k θ 2 ) ≥ δ λ min 2 k θ 1 − θ 2 k 2 2 (Equation 18 ). h∇ ˜ ` t ( b θ t ) , b θ t − e θ t +1 i − 1 η t D ( e θ t +1 k b θ t ) ≤k∇ ˜ ` t ( b θ t ) k 2 k b θ t − e θ t +1 k 2 − δ λ min 2 η t k e θ t +1 − b θ t k 2 2 (24a) ≤ η t 2 δ λ min k∇ ˜ ` t ( b θ t ) k 2 2 + δ λ min 2 η t k e θ t +1 − b θ t k 2 2 − δ λ min 2 η t k e θ t +1 − b θ t k 2 2 (24b) ≤ η t pδ ( λ max + x max ) 2 2 λ min (24c) In the abo ve, Equation 24a uses the Cauchy-Schwarz inequality and Equation 18 , Equation 24b uses Y oung’ s inequality , and finally Equation 24c applies Equation 17 . Combining all the terms, we get an 30 upper bound on the excess loss at an y gi ven time point of the following form: ` t ( b λ t ) − ` t ( λ t ) ≤ 1 η t  D ( θ t k b θ t ) − D ( θ t +1 k b θ t +1 )  + δ √ pλ max η t λ min k Φ t ( λ t , W ) − λ t k 2 + η t pδ ( λ max + x max ) 2 2 λ min . T o get the final bound, we must add these terms ov er the entire length of the optimization process from t = 1 , ..., T /δ . T o do this, we first show ho w the telescoping of the Bregman diver gence terms happens. In the follo wing lines, we use the assumption that η t is positi ve and non-increasing in t as well as the upper bound on the Bregman div ergence from Equation 19 . T /δ X t =1 1 η t  D ( θ t k b θ t ) − D ( θ t +1 k b θ t +1 )  = 1 η 1 D ( θ 1 k b θ 1 ) − 1 η T /δ D ( θ T /δ +1 k b θ T /δ +1 ) + T /δ X t =2 D ( θ t k b θ t )  1 η t − 1 η t − 1  ≤ 1 η 1 D ( θ 1 k b θ 1 ) − 1 η T /δ D ( θ T /δ +1 k b θ T /δ +1 ) + δ λ 2 max p λ min  1 η T /δ − 1 η 1  ≤ δ λ 2 max p η T /δ λ min Using this, we combine all the terms to get the final bound. T /δ X t =1 ` t ( b λ t ) − ` t ( λ t ) ≤ δ λ 2 max p η T /δ λ min + pδ ( λ max + x max ) 2 2 λ min T /δ X t =1 η t + δ √ pλ max λ min T /δ X t =1 1 η t k Φ t ( λ t , W ) − λ t +1 k 2 If the time horizon, T , is known, we choose η 1 = η 2 = ... = η T /δ to be a constant proportional to 1 √ T /δ , or if T is unkno wn, we choose η t to be proportional to 1 √ t . For the former choice the regret bound becomes: √ δ λ 2 max p λ min + √ pλ max λ min T /δ X t =1 k Φ t ( λ t , W ) − λ t +1 k 2 + p ( λ max + x max ) 2 2 λ min ! √ T . And for the later choice, we use the fact that P T /δ t =1 1 √ t ≤ 1 + R T /δ 1 1 √ t dt = 2 p T /δ − 1 < 2 p T /δ . This brings the overall bound to √ δ λ 2 max p λ min + √ pλ max λ min T /δ X t =1 k Φ t ( λ t , W ) − λ t k 2 + p ( λ max + x max ) 2 λ min ! √ T . Both of these are order √ T proving the result. A P P E N D I X F P RO O F O F L E M M A 3 The proof is a simple inductive argument. W e start with the base scenario, at t = 1 . Since Algorithm 1 begins with ˆ λ 1 = ¯ µ , we have b λ W 1 1 = b λ W 2 1 + ( W 1 − W 2 ) 0 = ¯ µ. 31 Therefore the results hold for t = 1 , with K 1 = 0 . Now we show the inductiv e step. If we use the update form from Equation 9 , we can explicitly compute the difference for dif ferent values of W . b λ W 1 t +1 − b λ W 2 t +1 =(1 − η t ) α δ ( b λ W 1 t − b λ W 2 t ) + ( W 1 − W 2 ) y t =( W 1 − W 2 )((1 − η t ) α δ K t + y t ) =( W 1 − W 2 ) K t +1 Here we assumed that b λ W 1 t = b λ W 2 t + ( W 1 − W 2 ) K t and then proved that the next step holds true for K t +1 = (1 − η t ) α δ K t + y t , as the Lemma states. A P P E N D I X G P RO O F O F T H E O R E M 2 In order to bound the regret of this algorithm, we split the regret into two separate dif ference terms and bound them indi vidually . T /δ X t =1 ` t ( b λ t ) − T /δ X t =1 ` t ( λ t ) = T /δ X t =1 ` t ( b λ t ) − T /δ X t =1 ` t ( b λ W t ) + T /δ X t =1 ` t ( b λ W t ) − T /δ X t =1 ` t ( λ t ) Here, b λ t represents the output of Algorithm 2 at time t , and b λ W t is the output of Algorithm 1 had we used W for any giv en W in Algorithm 1 . W e will show a bound which holds for all W ∈ W . The bound on the second dif ference follows directly from Theorem 1 . T /δ X t =1 ` t ( b λ W t ) − T /δ X t =1 ` t ( λ t ) = C 1   1 + T /δ X t =1 k ˜ Φ t ( λ t , W ) − λ t +1 k 2   √ T (25) In order to bound the first difference, we use the results of Lemma 3 to express the loss function in terms of W . ` t ( ˆ λ W t ) = h 1 , δ b λ W t i − h x t , log( δ b λ W t ) i = h 1 , δ ( b λ 0 + W K t ) i − h x t , log( δ ( b λ 0 t + W K t )) i = g t ( W ) This loss function is con vex in W and therefore allo ws for searching amongst the outputs of the Algorithm 1 for different values of W at ev ery time step, to find which W would hav e produced the best estimate for the current data. The important observ ation no w is that ev ery output, b λ t of Algorithm 2 is the output of Algorithm 1 for the specific value c W t . T o see this, notice that if b λ t = b λ c W t t , then ˜ λ t +1 is equi valent to the output of line 4 from Algorithm 1 as long as ˜ λ t +1 ∈ Int Λ . Then in lines 9 and 10 of Algorithm 2 we apply the dynamics and Lemma 3 thus producing b λ t +1 = b λ c W t +1 t +1 . Since b λ 1 = ¯ µ = b λ W 1 for any W , this shows that at any time b λ t = b λ c W t t . T /δ X t =1 ` t ( b λ t ) − T /δ X t =1 ` t ( b λ W t ) = T /δ X t =1 g t ( c W t ) − T /δ X t =1 g t ( W ) Since Algorithm 2 , line 8, performs a gradient descent method [ 36 ] to produce estimates c W t , we kno w that the difference is bounded as T /δ X t =1 g t ( c W t ) − min W ∈W T /δ X t =1 g t ( W ) ≤ C 2 p T /δ . (26) 32 Combining Equation 25 taken with the W = arg min W ∈W T X t =1 k Φ( λ t , W ) − λ t +1 k 2 and Equation 26 giv es the result: T /δ X t =1 ` t ( b λ t ) − T /δ X t =1 ` t ( λ t ) ≤ C   1 + min W ∈W T /δ X t =1 k Φ t ( λ t , W ) − λ t +1 k 2   √ T . A P P E N D I X H O N L I N E G R A D I E N T D E S C E N T As a comparison to our methods, we describe an implementation of Online Gradient Descent (OGD) that can be used to learn the network weights, W . In order to do this, we will take the rate, ˆ λ t to be a direct function of the network estimate c W t using the exponential influence function of Section IV and use the loss function described in Equation 7 . λ t ( W ) = ¯ µ + t − 1 X τ =1 α δ ( t − 1 − τ ) W y τ = ¯ µ + W K t K t , t − 1 X τ =1 α δ ( t − 1 − τ ) y τ = α δ K t − 1 + y t − 1 g t ( W ) = h δ λ t ( W ) , 1 i − h log ( δ λ t ( W )) , x t i ∇ g t ( W ) = δ 1 K T t − Diag ( λ t ( W )) − 1 x t K T t Using these v alues as a framew ork, we can deriv e an Online Gradient Descent algorithm for the learning the network in a Ha wkes process. Algorithm 4 Learning Network W with Online Gradient Descent 1: Initialize c W 1 = W 0 , K 1 = 0 2: f or t = 1 , ..., T /δ do 3: Observ e x t and incur loss g t ( W ) = h 1 , δ c W t K t i − h x t , log δ ( ¯ µ + c W t K t ) i 4: Set ∇ g t ( W ) = δ 1 K > t − Diag ( b λ c W t t ) − 1 x t K > t 5: Set c W t +1 = pro j W  c W t − ρ t ∇ g t ( c W t )  6: Define y t , X ¯ τ n = δ t e k n α ( δ ( t +1) − τ n ) 7: Set K t +1 = α δ K t + y t 8: end for Comparing Algorithms 2 and 4 , we can see ho w our proposed algorithm, is actually a generalization of OGD, in which instead of learning just the network weights and plugging them into the equation for the current rate, we are also allowed to slightly alter the v alue of the rate to de viate from the direct computation. Additionally , comparing the two shows how OGD is simply our algorithm with the parameter η t = 0 for all time steps t . A P P E N D I X I N OTA T I O N L E G E N D 33 V ariable Meaning p Number of actors in the network k n Actor in volved in the n th e vent τ n T ime of the n th e vent µ k ( τ ) Continuous time rate of the k th actor at time τ N k,τ Number of events in volving actor k up to and including time τ N τ T otal number of e vents up to and including time τ H T All observed ev ents (actors and times) up to and including time T ¯ µ k Baseline rate for actor k W W eighted adjacency matrix for the netw ork h ( t ) Influence function describing how one ev ent influences other actors over time T T otal sensing time of the process δ Length of time windo w used for discretization x t,k Number of times actor k acts during in time window ( δ ( t − 1) , δ t ] λ t,k Discrete time rate of actor k and time δ t , approximating µ k ( δ t ) ¯ τ n Discrete times related to τ n by d τ n δ e L t ( µ ) Negati ve log-likelihood of Hawkes process at time T L ( δ ) t Discrete time approximation to L t ( µ ) ` t ( λ ) Discrete time loss of λ at time t θ Dual paramter to λ , defined as θ = log( δ λ ) ˜ ` t ( θ ) Instantaneous loss of θ at time t . ˜ ` t ( θ ) = ` t ( λ ) λ min , λ max Smallest and largest values λ is allowed to take x max Maximum times an actor can act per unit time Φ t Dynamical model in the λ space at time t ˜ Φ t Dynamical model in the θ space at time t y t Instantaneous vector of actors that acted at time δ t , weighted by the influence function D ( θ 1 k θ 2 ) Bregman diver gence between θ 1 and θ 2 λ W t Instantaneous rate generated using network values W K t V ector used to con vert estimates of Algorithm 1 generated with W 1 to estimates generated with W 2 R E F E R E N C E S [1] M. Raginsky , R. W illett, C. Horn, J. Silva, and R. Marcia, “Sequential anomaly detection in the presence of noise and limited feedback, ” IEEE T ransactions on Information Theory , vol. 58, no. 8, pp. 5544–5562, 2012. [2] J. Silva and R. Willett, “Hypergraph-based anomaly detection in very large networks, ” IEEE T ransactions on P attern Analysis and Machine Intelligence , vol. 31, no. 3, pp. 563–569, 2009, doi:10.1109/TP AMI.2008.232 . [3] A. Stomakhin, M. B. Short, and A. Bertozzi, “Reconstruction of missing data in social networks based on temporal patterns of interactions, ” In verse Problems , vol. 27, no. 11, 2011. [4] C. Blundell, K. A. Heller , and J. M. Beck, “Modelling reciprocating relationships with hawkes processes, ” in Pr oc. NIPS , 2012. [5] K. Zhou, H. Zha, and L. Song, “Learning social infectivity in sparse low-rank networks using multi-dimensional hawkes processes, ” in Pr oceedings of the 16th International Conference on Artificial Intelligence and Statistics (AIST ATS) , 2013. [6] E. N. Brown, R. E. Kass, and P . P . Mitra, “Multiple neural spike train data analysis: state-of-the-art and future challenges, ” Natur e Neur oscience , vol. 7, no. 5, pp. 456–461, 2004. [7] T . P . Coleman and S. Sarma, “Using conv ex optimization for nonparametric statistical analysis of point processes, ” in Pr oc. ISIT , 2007. [8] A. C. Smith and E. N. Brown, “Estimating a state-space model from point process observations, ” Neural Computation , vol. 15, pp. 965–991, 2003. 34 [9] M. Hinne, T . Heskes, and M. A. J. v an Gerven, “Bayesian inference of whole-brain networks, ” arXiv:1202.1696 [q-bio.NC] , 2012. [10] M. Ding, CE Schroeder, and X. W en, “ Analyzing coherent brain networks with Granger causality , ” in Conf. Pr oc. IEEE Eng. Med. Biol. Soc. , 2011, pp. 5916–8. [11] J. W . Pillow , J. Shlens, L. Paninski, A. Sher, A. M. Litke, E. J. Chichilnisk y , and E. P . Simoncelli, “Spatio-temporal correlations and visual signalling in a complete neuronal population, ” Natur e , vol. 454, pp. 995–999, 2008. [12] M. S. Masud and R. Borisyuk, “Statistical technique for analysing functional connectivity of multiple spike trains, ” Journal of Neuroscience Methods , vol. 196, no. 1, pp. 201–219, 2011. [13] Y . A ¨ ıt-Sahalia, J. Cacho-Diaz, and R. J. A. Laev en, “Modeling financial contagion using mutually exciting jump processes, ” T ech. Rep., National Bureau of Economic Research, 2010. [14] A Colin Cameron and Pravin K Tri vedi, Re gression analysis of count data , vol. 53, Cambridge univ ersity press, 2013. [15] Robert Engle, “Garch 101: The use of arch/garch models in applied econometrics, ” Journal of economic perspectives , pp. 157–168, 2001. [16] M. K uperman and G. Abramson, “Small world effect in an epidemiological model, ” Physical Review Letters , vol. 86, no. 13, pp. 2909, 2001. [17] D. V ere-Jones and T . Ozaki, “Some examples of statistical estimation applied to earthquak e data, ” Ann. Inst. Statist. Math. , vol. 34, pp. 189–207, 1982. [18] Y . Ogata, “Seismicity analysis through point-process modeling: A revie w , ” Pure and Applied Geophysics , vol. 155, no. 2-4, pp. 471–507, 1999. [19] F .P . Schoenberg, “F acilitated estimation of etas, ” Bulletin of the Seismological Society of America , vol. 103, pp. 601 – 605, 2013. [20] A. G. Ha wkes, “Point spectra of some self-exciting and mutually-e xciting point processes, ” Journal of the Royal Statistical Society . Series B (Methodological) , vol. 58, pp. 83–90, 1971. [21] A. G. Hawkes, “Point spectra of some mutually-exciting point processes, ” J ournal of the Royal Statistical Society . Series B (Methodological) , vol. 33, no. 3, pp. 438–443, 1971. [22] D. J. Daley and D. V ere-Jones, An introduction to the theory of point processes, V ol. I: Probability and its Applications , Springer-V erlag, New Y ork, second edition, 2003. [23] E. C. Hall and R. M. Willett, “Online con vex optimization in dynamic environments, ” IEEE Journal of Selected T opics in Signal Pr ocessing , vol. 9, 2015. [24] E. C. Hall and R. M. Willett, “Dynamical models and tracking regret in online conv ex programming, ” in Pr oc. International Confer ence on Machine Learning (ICML) , 2013, arXiv .org:1301.1254 . [25] A. Simma and M.I. Jordan, “Modeling e vents with cascades of poisson processes, ” Pr oceedings of the T wenty-Sixth Confer ence of Uncertainty in Artificial Intelligence (U AI2010) , 2010. [26] G. Mohler , “Modeling and estimation of multi-source clustering in crime and security data, ” Annals of Applied Statistics , vol. 7, pp. 1525 – 1539, 2013. [27] D. Sornette and S. Utkin, “Limits of declustering methods for disentangling exogenous from endogenous events in time series with foreshocks, main shocks and aftershocks, ” Physical Review E , 2009. [28] A. V een and F .P . Schoenberg, “Estimation of space-time branching process models in seismology using an em-type algorithm, ” Journal of the American Statistical Association , 2008. [29] Scott W . Linderman and Ryan P . Adams, “Discovering latent network structure in point process data, ” 2014. [30] S.W . Linderman and R.P . Adams, “Scalable bayesian inference for excitatory point process networks, ” arXiv: 1507.03228v1, 2015. [31] P . Reynaud-Bouret and S. Schbath, “ Adaptive estimation for Hawkes processes; application to genome analysis, ” Annals of Statistics , vol. 38, no. 5, pp. 2781–2822, 2010. [32] N. R. Hansen, P . Reynaud-Bouret, and V . Ri voirard, “LASSO and probabilistic inequalities for multiv ariate point processes, ” arXiv preprint arXiv:1208.0570 , 2012. [33] S. A. Pasha and V . Solo, “Hawkes-Laguerre reduced rank model for point processes, ” in ICASSP , 2013. [34] Angelos Dassios and Hongbiao Zhao, “Exact simulation of hawkes process with e xponentially decaying intensity , ” Electr onic Communications in Pr obability , vol. 18, pp. no. 62, 1–13, 2013. [35] M. Zinke vich, “Online conv ex programming and generalized infinitesimal gradient descent, ” in Proc. Int. Conf. on Machine Learning (ICML) , 2003, pp. 928–936. [36] A. Beck and M. T eboulle, “Mirror descent and nonlinear projected subgradient methods for conv ex programming, ” Operations Resear ch Letters , vol. 31, pp. 167–175, 2003. [37] J. Duchi, S. Shalev-Shw artz, Y . Singer , and A. T ew ari, “Composite objective mirror descent, ” in Conf. on Learning Theory (COLT) , 2010. [38] A. Rakhlin and K. Sridharan, “Online learning with predictable sequences, ” arXi v:1208.3728, 2012. 35 [39] J. Leskov ec, L. Backstrom, and J. Kleinber g, “Meme-tracking and the dynamics of the news cycle, ” A CM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) , 2009. [40] S. Wright, R. Now ak, and M. Figueiredo, “Sparse reconstruction by separable approximation, ” IEEE T ransactions Signal Pr ocessing , vol. 57, pp. 2479–2493, 2009. 36

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment