Activity date estimation in timestamped interaction networks

We propose in this paper a new generative model for graphs that uses a latent space approach to explain timestamped interactions. The model is designed to provide global estimates of activity dates in historical networks where only the interaction da…

Authors: Fabrice Rossi (SAMM), Pierre Latouche (SAMM)

Activity date estimation in timestamped interaction networks
Activit y Date Estimation in Timestamp ed In teraction Net w orks F abrice Rossi and Pierre Latouc he SAMM EA 4543, Univ ersit´ e P aris 1 Pan th´ eon-Sorbonne 90, rue de T olbiac, 75634 P aris cedex 13, F rance Abstract . W e prop ose in this pap er a new generative model for graphs that uses a laten t space approach to explain timestamped interactions. The model is designed to pro vide global estimates of activit y dates in his- torical net works where only the in teraction dates b et ween agen ts are known with reasonable precision. Exp erimen tal results show that the mo del pro- vides b etter results than lo cal av erages in dense enough netw orks. 1 In tro duction In this pap er, we study interactions b etw een agen ts that are recorded on a time scale larger than the exp ected lifespan of the agents. A typical instance of such in teractions are prop ert y ownership recordings in which a house or a land exists for a very long time perio d and passes from owner to o wner, outliving them. Ownerships are ge nerally recorded with a lot of details in well k ept archiv es, while the lives of the owners are generally known with m uch less details. In [1, 4] for instance, the primary information source consists in notarial acts recording differen t forms of ownerships of lands and related ob jects (according to the F rench feudal laws) during a long time p erio d (up to 300 years for the digitized v ersion). On the one hand, most of the notarial acts hav e a prop er date, precise at least at the year level, while, on the other hand, little is kno wn ab out the tenants (a.k.a. the “owners”) inv olv ed in the acts, apart from their names. It seems therefore in teresting to infer information about the tenants from the acts, namely to estimate a living p erio d for each tenant based on the acts in which he/she is inv olved. More generally , w e consider a graph whose vertices represent agents and whose edges re presen t interactions b etw een those agents. W e consider here sim- ple graphs, but the approach generalizes immediately to multi-graphs in whic h sev eral edges can link tw o agen ts. Eac h interaction is timestamp ed and our goal is to estimate a cen tral time stamp for each agent in such a wa y that interac- tion dates are compatible with the time stamps of the agen ts and with exp ert kno wledge on the exp ected life span of the agents. The main difficulty of this task comes from inconsistencies observed in real world historical data: due to name ambiguities, asso ciations b et ween agen ts and interactions are sometimes incorrect. In netw ork parlance this corresponds to some rewiring of the graph: while we should get a connection b et ween a and b , a naming ambiguit y b et ween v ertex b and c assigns wrongly this in teraction to a and c . W e prop ose a solution based on a generative mo del inspired by the latent space mo del of [3]: giv en the in teraction dates, the model generates interaction net works that fulfill the compatibility constrain ts exposed abov e. Note that the prop osed setting is quite differen t from classical temp oral graph mo deling (see e.g. [2, 5]) where the primary goal generally consists in understanding the ev olution of the structure of the netw ork through time. The rest of the pap er first in tro duces the generativ e mo del as w ell as the max- im um lik eliho od estimation strategy . It then summarizes exp erimen tal results on simulated data. 2 A Generativ e Mo del W e observe an undirected graph G characterized by a v ertex set V = { 1 , . . . , n } and a binary adjacency matrix A . When A ij = 1, that is when no de i is connected to no de j , we are given an asso ciated interaction date sp ecified as a p ositiv e real num b er D ij . W e consider a generative mo del for ( A, D ) based on laten t activit y date v ariables. More precisely , each vertex i is asso ciated to a positive (unobserved) real n umber Z i whic h summarizes the activit y p eriod of said vertex. Then, we assume that the probability of having a connection b etw een i and j is linked to the temp oral distance | Z i − Z j | . W e assume also that knowing Z = ( Z i ) 1 ≤ i ≤ n , the A ij are independent. Finally , when i and j are connected, w e assume that their in teraction date is randomly distributed betw een Z i and Z j (indep enden tly of all other v ariables). In more technical terms, the conditional indep endence assumptions lead to the following generative mo del, where θ denotes numerical parameters: p ( A, D | Z, θ ) = Y i 6 = j,A ij =0 P ( A ij = 0 | Z i , Z j , θ ) × Y i 6 = j,A ij =1 p ( D ij | A ij = 1 , Z i , Z j , θ ) P ( A ij = 1 | Z i , Z j , θ ) . (1) 2.1 A sp ecific mo del W e sp ecialize now the generic form of equation (1). Inspired by [3], we use a logistic regression mo del for the connection probabilities, that is log P ( A ij = 1 | Z i , Z j , α, β ) P ( A ij = 0 | Z i , Z j , α, β ) = α − β ( Z i − Z j ) 2 , (2) while the interaction date D ij is simply mo delled with a Gaussian distribution around Z i + Z j 2 , that is D ij | Z i , Z j , σ ∼ N  Z i + Z j 2 , σ 2  . (3) Then, up to constants, the log-likelihoo d of the data is given by L ( A, D | Z, σ, α, β ) = X i 6 = j,A ij =1 − log σ − 1 2 σ 2  D ij − Z i + Z j 2  2 ! + X i 6 = j  A i,j ( α − β ( Z i − Z j ) 2 ) − log  1 + e α − β ( Z i − Z j ) 2  . (4) Connection probabilities are not identical to the ones used in [3] for tw o reasons. Firstly , we use a quadratic term ( Z i − Z j ) 2 rather than the original absolute v alue | Z i − Z j | to av oid numerical instabilities link ed to the non differen tiability of the latter. Secondly , we add a β parameter to comp ensate for the relativ ely large v alues found in real world historical netw orks for ( Z i − Z j ) 2 whic h can b e of the order 2500. In [3], the absence of the first term in equation (4) allo ws for free scaling effects of the Z i , something that is not p ossible here. 2.2 Estimation W e use a maxim um likelihoo d approach implemen ted via a gradient descen t based algorithm. A natural initialization for Z is pro vided by ˆ Z i = P j,A ij =1 D ij P j A ij , where ˆ Z i tak es the av erage v alue of the dates of the outgoing/incoming edges. This corresponds to an estimation of the activit y dates based only on lo cal information. The initial v alues of the other parameters are chosen as follo ws. As σ mo dels the life span of actors, we use 50 as a starting p oint, allowing in teractions to happ en in a very large t wo h undred years in terv al centered in the a verage activity date of tw o actors. The α parameter is initialized such that the connection probability equals the observed netw ork density when all the Z i are equals. Finally , β is set to a v alue that reduces the connection probability to almost zero (10 − 6 ) when the temp oral distance b et ween tw o actors is ab ov e one h undred years. 3 Exp erimen tal ev aluation As we do not ha ve access y et to large historical data sets with reliable activity dates, w e fo cus on simulated data to ev aluate the mo del we propose and to understand its strengths and limitations. 3.1 Ideal situation First, we study the ideal situation in which netw orks are simulated according to our mo del. The goal is then to reco ver the activit y dates used to generate the data. W e measure the quality of the recov ery using the mean square error (MSE) b et ween the real Z and the estimated one. A netw ork is generated as follo ws: 1. n = 100 and the Z i are uniformly distributed in [1200 , 1400]; 2. for a given maximal target density d in [0 . 1 , 0 . 5], α is set to − log  1 d − 1  . This sets the probability of A ij = 1 to d when Z i = Z j ; 3. the exp ected life span is set to 80. Accordingly , β is set to log ( 1 ε − 1 ) + α 80 2 , where ε is the target probability to hav e A ij = 1 when | Z i − Z j | = 80. W e use ε = 10 − 6 ; 4. σ is set to 20 (that is to one fourth of the life span); 5. giv en α , β , σ and Z , A and D are generated according to the mo del; 6. finally , we k eep only the largest connected comp onen t of the obtained graph. W e discard graphs in which there are less edges than the num b er of parameters in the mo del (that is three added to the n umber of vertices). Results are summarized by Figure 1 whic h gives the improv ement in MSE obtained b y using the Z estimates of the model compared to the lo cal a verages ˆ Z i defined abov e. The superimp osed curv e is a k ernel based estimate of the relation b et w een the av erage num b er of edges p er vertex and the impro vemen t in MSE. According to this estimate, the av erage improv emen t reaches a p ositive v alue ab o v e 1.31 edges p er v ertex. When this n umber is ab o ve 2, the improv ement o ver lo cal estimates is almost alwa ys p ositiv e. 1 2 3 4 5 6 −300 −200 −100 0 100 200 Noise free A verage number of edges per v er te x MSE improv ement Figure 1: Ideal situation: eac h dot gives the improv ement in MSE ov er lo cal av erages when using the prop osed mo del estimates as a function of the num b er of edges p er no de. On a total of 2152 netw orks, 3 where excluded from the figure b ecause of very large negative v alues of the impro vemen t (do wn to -2200) due to conv ergence issues. Those netw orks had b elo w 1.27 edges p er vertex. 3.2 Missp ecification W e also tested the mo del under mis-sp ecification by replacing the Gaussian distribution for dates by a uniform distribution betw een Z i and Z j for connected v ertices. This introduces some form of heteroscedasticity . Results displa yed on Figure 2 show a go od resistance of the mo del to missp ecification when the n umber of edges p er vertex if ab o ve 2. 1 2 3 4 5 6 −80 −60 −40 −20 0 20 Noise free with uniform dates A verage number of edges per v er te x MSE improv ement Figure 2: Misspecification with uniform connection dates. 10 graphs out of 1079 with an av erage num b er of edges p er vertex b elo w 1.21 and bad estimates were remov ed from the figure to keep it readable. 3.3 Rewiring Finally , we study the robustness of the mo del with resp ect to the rewiring issue exp osed in the introduction. Net works are first generated according to the mo del and then a certain num b er of edges are randomly rewired by moving one of the end p oin ts to a randomly selected vertex while k eeping the original date. In the case of a v ery lo w noise (1% of rewired edges), almost no effect on the impro vemen ts are observed (results not sho wn here). Figure 3 shows results for a more imp ortan t noise (5% of rewired edges). As exp ected, the mo del, while sho wing robustness, is impaired b y the “false” information attac hed to rewired edges. According to the kernel estimator, at least 2.1 edges p er vertex are needed to reach equal p erformances b etw een the lo cal av erage and the mo del estimates, while ab ov e 3, the mo del outp erforms the lo cal estimates. 4 Conclusion Results on sim ulated data are v ery satisfactory: ab ov e an av erage n um b er of t w o edges p er vertex, the estimates provided b y the mo del are closer to the ground truth than local a verages, ev en under tw o forms of missp ecification (uniform date distribution and edge rewiring). While the estimator exhibits large v ariability 1 2 3 4 5 6 −400 −300 −200 −100 0 100 200 Noise level: 5% A verage number of edges per v er te x MSE improv ement Figure 3: Improv emen ts under rewiring noise. 54 graphs out of 2159 with an av erage n umber of edges p er vertex b elo w 1.79 and very bad estimates are remov ed from the figure to keep it readable. and can giv e quite bad results, this happ ens only under a lo w num b er of edges p er vertex. While a direct n umerical ev aluation of the metho d on real world historical data is imp ossible b ecause of the lack a reliable activity dates on large in teraction databases, we are working with historians on qualitative assessment of the results based on dates inferred from well kno wn figures such as prominent land lords. References [1] R. Boulet, B. Jouve, F. Rossi, and N. Villa. Batc h kernel SOM and related Laplacian methods for so cial netw ork analysis. Neur o c omputing , 71(7–9):1257–1273, March 2008. [2] A. Goldenberg, A. X. Zheng, S. E. Fienberg, and E. M. Airoldi. A survey of statistical netw ork models. F oundations and T rends in Machine L e arning , 2(2):129–233, 2009. [3] P . Hoff, A. Raftery , and M. Handco ck. Latent space approac hes to so cial netw ork analysis. Journal of the Americ an Statistical Asso ciation , 97:1090–1098, 2002. [4] F. Rossi, N. Villa-Vialaneix, and F. Hautefeuille. Exploration of a large database of frenc h charters with so cial netw ork metho ds. In International Me dieval Congress (IMC 2011) , num ber 1607-c, Leeds (United Kingdom), July 2011. I nstitute for Mediev al Studies. [5] E. P . Xing, W. F u, and L. Song. A state-space mixed membership blo ckmodel for dynamic netw ork tomograph y . Annals of Applie d Statistics , 4(2):535–566, 2010.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment