Node discovery in a networked organization
In this paper, I present a method to solve a node discovery problem in a networked organization. Covert nodes refer to the nodes which are not observable directly. They affect social interactions, but do not appear in the surveillance logs which reco…
Authors: Yoshiharu Maeno
Node disco v ery in a network ed or ga nization Y oshiharu Maeno Social Des ign Group Bunkyo-ku , T okyo 112-0011 , Japan. maeno.y oshiharu@sociald esigngroup.com Abstract —In this paper , I p resent a method to solve a node discov ery problem in a networked organization. Cov ert nodes refer to the n odes which a re not observable directly . They affect social interactions, but do not appear i n the surv eillance logs which record the p articipants of the social interactions. Discov- ering the cove rt nodes is defined as identifying the suspicious logs where th e covert nodes would appear if th e covert n odes became o vert. A mathematical model is dev eloped f or the maximal likelihood estimation of the network behind the social interactions and for the identification of the su spicious logs. Precision, re call, and F measure character istics are d emonstrated with the dataset generated from a real or ganization and t he computationally synthesized datasets. The perfor mance is close to the theoretical limit for any cover t nodes in the n etworks of any topologies and sizes if the ratio of the number of o bserv ation to the number of possible co mmunication patterns is large. Index T erms —Anomaly detection, Covert node, Maximal like- lihood estima tion, Node disco very , Social network. I . I N T RO D U C T I O N Covert nodes in a networked organization refer to person s who affect social inter actions (commu nications among the nodes and resulting collaborati ve activities), but do not appea r in the surveillance logs which reco rd the p articipants o f the social in teractions. T hey are not observable dire ctly . Discov- ering the covert no des is defined as iden tifying the suspicious surveillance logs where the covert no des would app ear if the covert nod es becam e overt. This problem is called a nod e discovery pro blem. Where do we en counter such a problem ? Globally net- worked clande stine o rganizations such as ter rorists, criminals, or dr ug smu gglers are great th reat to the ci vilized societies. T erro rism attacks cause g reat e conomic, soc ial and en viron- mental damage. Active non-routine responses to the attacks ar e necessary as well as the dam age recovery manag ement. The short-term target of the resp onses is the ar rest of the pe rpetra- tors. Th e long- term target of the respo nses is identif ying and dismantling the c overt organiza tional f oundatio n which raises, encour ages, and h elps the p erpetrator s. The thr eat will be miti- gated and e liminated by d iscovering covert leader s and critical conspirato rs of the clan destine organization s. T he difficulty o f such discovery lies in the limited ca pability o f surveillance. Inform ation on the leaders and critical conspira tors are missing because it is usually hidden b y the organization inten tionally . In this p aper, I presen t a metho d to solve the node dis- covery pro blem. The method infers the network topology an d probab ility param eters b ehind the social inter actions ( by use of the m aximal likelihood estimation ), ap plies an anom aly detection techniqu e to the surveillance logs, a nd identifies the suspicio us surveillance lo gs. III p resents the me thod. IV introdu ces the dataset generated from a real o rganization and the computatio nally synthesized datasets for perfo rmance tests. V d emonstrates the pre cision, rec all, an d F mea sure characteristics with the datasets. I I . R E L A T E D W O R K The social network analysis is a study of social structu res made o f no des which are linked by one or more sp ecific ty pes of relationship . E xamples o f the relationsh ip a re in fluence transmission in comm unication, or pr esence of tr ust in col- laboration . Stu dies in complex n etworks [3], [24], [6], WW W search and an alysis [4], [1], and m achine learning of latent variables [ 21], [22] are major r elated research topics. Research inter ests have b een movin g from describin g orga- nizational nature to discovering unkn own p henomen a. A link discovery predicts the existence of an unk nown li nk between two nodes fr om the in formation on the known attributes o f the nodes an d the known links [ 5], [7], [23]. The link discovery technique s are combin ed with do main-specific heuristics. The collaboratio n betwee n scien tists c an be predicted from th e published co-auth orship [12]. The friendship between people is inferred from the informa tion av ailable o n th eir web p ages [ 2]. Discovery of a network structure [18], [1 6], [17] and de tection of an a nomaly in a network [20] are also relev ant related research topics. A no de d iscovery pr edicts the existence of an unkn own node around the known nod es from th e in formation on the collective beh avior o f the network. Related works in the no de discovery is limited. Heuristic method for n ode discovery is propo sed in [13]. Th e m ethod app lies cluster ing algo rithm [25], [ 8] to the n odes in a network, an d de tects the no de wh ich inter-connects clusters at the border of a cluster in clustered networks. Th e method is applied to analyze the covert social network foundation behind the terrorism d isasters [14]. I I I . M E T H O D A. Ob servation A no de an d a link in a social network are a person and a re lationship resulting in influence transmission between persons. The sym bols n j ( j = 0 , 1 , · · · ) rep resent the nodes. Some nodes are overt ( observable), but th e oth ers ar e covert (unob servable). O de note a set o f the wh ole overt nod es { n 0 , n 1 , · · · , n N − 1 } . Its card inality is N = | O | . C = O denotes a set o f the whole covert nod es { n N , n N +1 , · · · } . Th e symbol δ i (0 ≤ i < D ) rep resent an ind ividual communication 978-1 -4244 -2794-9/09/$25.00 c 2009 IEEE SMC 2009 pattern (and a resultin g co llaborative activity) am ong the persons. It is a set of nodes, δ i ∈ O ∩ C . T he unobservability of the covert nod es d oes no t a ffect the commun ication patter n. For example, the mem bers of a commun ication pattern are those who join an online co mmunity . An observation d i in surveillance logs is a set of the overt nodes in a com munication p attern δ i . It is g i ven by eq .(1). Th e number of data is D . d i = δ i ∩ O (0 ≤ i < D ) . (1) { d i } den otes the ob servation dataset. No te that neither an individual no de nor a single link can be o bserved dir ectly , but a gro up o f nodes can be observed a s a commu nication patter n. { d i } can be expressed by a 2-d imensional D × N matrix of binary variables d . The pr esence or absenc e o f th e nod e n j in the data d i is indicated by the elements in eq .(2). d ij = 1 if n j ∈ d i 0 otherwise (0 ≤ i < D , 0 ≤ j < N ) . (2) B. Ma ximal Lik elihood Estimator Network A parametric f orm is defined to describ e the network topolog y and the influen ce transmission over th e network. The influence tran smission g overns the po ssible comm unication patterns { δ i } which result in the observation dataset { d i } . The probab ility where the influen ce tr ansmits from an initiating node n j to a re sponder no de n k is r j k . Th e influence tra nsmits to multiple re sponders indepen dently in p arallel. It is similar to the degree of collab oration p robability in trust mod eling [11]. The constraints are 0 ≤ r j k and P k 6 = j r j k ≤ 1 . Th e quantity f j is the proba bility where th e n ode n j becomes an initiator . Th e con straints are 0 ≤ f j and P N − 1 j =0 f j = 1 . T hese parameters a re defined f or th e whole nodes in a soc ial n etwork (both the nodes in O and C ). A single symbol θ represent b oth o f th e param eters r j k and f j for the nodes in O . θ is the target variable, the value of which need s to be inferred fro m th e observation dataset. The logar ithmic likelihood function [ 8] is defined b y eq.(3). The quantity p ( { d i }| θ ) den ote th e probab ility where the observation da taset { d i } realizes under a gi ven θ . L ( θ ) = log( p ( { d i }| θ )) . (3) The individual ob servations are assumed to be ind ependen t. eq.(3) becomes eq.(4). L ( θ ) = log ( D − 1 Y i =0 p ( d i | θ )) = D − 1 X i =0 log( p ( d i | θ )) . (4) The qu antity q i | j k in eq.(5) is the prob ability whe re th e presence o r ab sence of the node n k as a responder to the stimulating node n j coincides with the observation d i . q i | j k = r j k if d ik = 1 for gi ven i and j 1 − r j k otherwise . (5) eq.(5) is equiv alent to eq.( 6) since the value of d ik is eithe r 0 or 1 . q i | j k = d ik r j k + (1 − d ik )(1 − r j k ) . (6) The probability p ( { d i }| θ ) in eq.( 4) is e xpressed by eq.(7). The operator ∧ means logical AND. p ( d i | θ ) = N − 1 X j =0 d ij f j Y 0 ≤ k
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment