Learning Reputation in an Authorship Network
The problem of searching for experts in a given academic field is hugely important in both industry and academia. We study exactly this issue with respect to a database of authors and their publications. The idea is to use Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA) to perform topic modelling in order to find authors who have worked in a query field. We then construct a coauthorship graph and motivate the use of influence maximisation and a variety of graph centrality measures to obtain a ranked list of experts. The ranked lists are further improved using a Markov Chain-based rank aggregation approach. The complete method is readily scalable to large datasets. To demonstrate the efficacy of the approach we report on an extensive set of computational simulations using the Arnetminer dataset. An improvement in mean average precision is demonstrated over the baseline case of simply using the order of authors found by the topic models.
💡 Research Summary
The paper tackles the practical problem of expert finding within academic domains by combining textual topic modeling with graph‑based reputation analysis. First, the authors identify authors who have contributed to a query domain using two scalable topic‑modeling techniques: Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA). Both models are applied to the titles and abstracts of each author’s publications. LSI reduces a TF‑IDF matrix via partial Singular Value Decomposition, while LDA learns document‑topic and topic‑word distributions using an online variational Bayes algorithm. For a given query, the system computes cosine similarity between the query vector and each document vector in the respective semantic space; documents exceeding a similarity threshold are selected, and their authors form the candidate set D_i.
Having obtained the candidate authors, the authors construct an undirected co‑authorship graph G = (V, E) where vertices represent authors and edges indicate joint publications (optionally weighted by the number of shared papers). On this graph they evaluate six reputation measures:
- Influence Maximization – modeled as an Independent Cascade diffusion process. The goal is to select k seed vertices that maximize the expected spread σ_P(G, A). Because σ_P is monotone submodular, a greedy algorithm (CELF) provides a (1‑1/e) approximation efficiently.
- PageRank – a random‑walk based global importance score that captures the prestige of authors through the structure of citations/co‑authorship.
- Hub‑Authority (HITS) – mutually recursive scores that differentiate authors who act as “hubs” (linking to many authorities) from those who are “authorities.”
- Betweenness Centrality – counts the fraction of shortest paths that pass through a vertex, reflecting its role as a bridge in the network.
- Closeness Centrality – the inverse of the average shortest‑path distance from a vertex to all others, indicating how quickly information can reach the rest of the network.
- Degree – the simplest measure, counting incident edges.
Each metric captures a different aspect of scholarly reputation (e.g., productivity, connectivity, influence). Since no single metric fully characterizes expertise, the authors aggregate the multiple rankings using a Markov‑Chain based rank aggregation method (MC²). For each input ranking τ_k they build a transition matrix P(k) where P(k)_ij is the probability of moving from author i to author j given τ_k. The overall transition matrix R is the average of all P(k). The stationary distribution x of R (the eigenvector satisfying x = R^T x, Σ_i x_i = 1) provides a final score for every author, naturally handling partial top‑k lists.
The related‑work section situates the contribution among prior expert‑finding efforts, notably TREC Enterprise track, ArnetMiner‑based co‑authorship analyses, and Topic‑Affinity Propagation models. The novelty lies in (a) jointly employing scalable LSI/LDA for domain detection, (b) integrating influence‑maximization with classic centrality measures, and (c) applying a principled Markov‑Chain aggregation to fuse heterogeneous rankings.
Experiments are conducted on the ArnetMiner dataset, comprising over a million papers and tens of thousands of authors. Ten realistic research topics (e.g., “machine learning”, “data mining”) serve as queries; ground‑truth expert lists are curated by domain experts. Evaluation uses Mean Average Precision (MAP) and Normalized Discounted Cumulative Gain (NDCG). Baselines consist of ranking solely by LSI/LDA similarity. Results show that each individual graph metric improves MAP by 5–7 % over the baseline, while the combination of Influence Maximization and PageRank yields the largest single‑metric gain (~10 %). The full MC² aggregation further lifts MAP by an average of 12 % and consistently improves NDCG across topics. Computationally, the online LSI/LDA pipelines and the CELF greedy algorithm enable processing the entire dataset in a few hours on commodity hardware, demonstrating scalability.
The authors conclude that their two‑stage framework—topic‑driven candidate selection followed by multi‑metric graph ranking and Markov‑Chain aggregation—substantially enhances expert retrieval in large academic corpora. Limitations include reliance solely on co‑authorship edges (ignoring citation and external impact metrics) and the use of a fixed diffusion probability in the influence model. Future work is suggested on multi‑layer networks (adding citation, social media, and temporal dynamics), Bayesian learning of diffusion parameters, and transfer learning across domains to further generalize the expert‑finding system.
Comments & Academic Discussion
Loading comments...
Leave a Comment