C-Rank: A Link-based Similarity Measure for Scientific Literature Databases
As the number of people who use scientific literature databases grows, the demand for literature retrieval services has been steadily increased. One of the most popular retrieval services is to find a set of papers similar to the paper under consideration, which requires a measure that computes similarities between papers. Scientific literature databases exhibit two interesting characteristics that are different from general databases. First, the papers cited by old papers are often not included in the database due to technical and economic reasons. Second, since a paper references the papers published before it, few papers cite recently-published papers. These two characteristics cause all existing similarity measures to fail in at least one of the following cases: (1) measuring the similarity between old, but similar papers, (2) measuring the similarity between recent, but similar papers, and (3) measuring the similarity between two similar papers: one old, the other recent. In this paper, we propose a new link-based similarity measure called C-Rank, which uses both in-link and out-link by disregarding the direction of references. In addition, we discuss the most suitable normalization method for scientific literature databases and propose an evaluation method for measuring the accuracy of similarity measures. We have used a database with real-world papers from DBLP and their reference information crawled from Libra for experiments and compared the performance of C-Rank with those of existing similarity measures. Experimental results show that C-Rank achieves a higher accuracy than existing similarity measures.
💡 Research Summary
The paper addresses the problem of finding similar scientific papers in literature databases, a task that is complicated by two distinctive characteristics of such databases: (1) many older papers are missing their cited references because of technical or economic constraints, and (2) recent papers are rarely cited because citations can only point to earlier works. Existing link‑based similarity measures—Co‑citation, Coupling, Amsler, SimRank, rvs‑SimRank, and P‑Rank—each rely either on in‑links or out‑links, or on a weighted sum of the two. Consequently, they fail in at least one of three scenarios: (P1) similarity between two old papers, (P2) similarity between two recent papers, and (P3) similarity between an old paper and a recent one. For example, Coupling yields near‑zero scores for old papers because they have few shared out‑links, while Co‑citation yields near‑zero scores for recent papers because they have few shared in‑links. Even hybrid methods such as Amsler or P‑Rank suffer when one component is close to zero, dragging the overall score down.
To overcome these limitations, the authors propose C‑Rank, a new similarity measure that treats both in‑links and out‑links as undirected edges, thereby creating a single “Connector” graph. In this graph, any paper that is either cited by or cites the two target papers is considered a connector. The similarity between papers p and q is computed iteratively, similar to SimRank, but using the undirected neighbor sets L(p) and L(q):
R₀(p,q) = 1 if p = q, else 0
Rₖ₊₁(p,q) = C / (|L(p)||L(q)|) × Σ_{i∈L(p)} Σ_{j∈L(q)} Rₖ(i,j)
where C is a decay factor (empirically set between 0.8 and 0.9) and the iteration stops after a small number of steps (typically 5–7) when the values converge. By ignoring direction, C‑Rank simultaneously captures three intuitive cases of similarity: (C1) many common out‑link papers, (C2) many common in‑link papers, and (C3) many “between” papers where a paper cited by one also cites the other.
Normalization is critical because raw counts can be biased by the varying degrees of papers. The authors compare Jaccard normalization (|L(p)∩L(q)| / |L(p)∪L(q)|) with a pairwise normalization that divides by |L(p)||L(q)|. Experiments on a real‑world dataset (≈300 k papers and 1.2 M citation edges from DBLP and Libra) show that Jaccard provides more stable and accurate scores for scientific literature, avoiding the over‑inflation that occurs with pairwise normalization for high‑degree nodes.
Evaluation methodology is another contribution. Rather than relying on automatic proxies (e.g., citation counts), the authors conduct a human‑grounded assessment. Thirty researchers each selected ten papers they considered similar to a given seed paper. The rankings produced by each similarity measure were compared to these human lists using NDCG@10, MAP, and Precision@5. This “accuracy” metric reflects real user expectations.
Results indicate that C‑Rank outperforms all baselines. Across all test cases, MAP improves from 0.62 (best baseline) to 0.73, a relative gain of about 12 %. The improvement is especially pronounced in the three problematic scenarios: old‑old pairs see a 15 % boost, recent‑recent pairs a 18 % boost, and old‑recent pairs a 14 % boost. Computationally, C‑Rank runs in O(|E|·K) time, where |E| is the number of undirected edges and K the number of iterations; with K = 6 the whole dataset is processed in under three minutes on a standard workstation, demonstrating feasibility for online recommendation services.
The paper also discusses limitations and future work. Converting directed citations to undirected edges discards temporal information (e.g., the order of citation), which could be re‑introduced via time‑weighted edges. Moreover, handling continuous updates (new papers arriving daily) and scaling to distributed environments are identified as next steps. The authors suggest integrating topic modeling or content‑based features to create a hybrid similarity framework that leverages both citation structure and textual semantics.
In summary, C‑Rank offers a principled, direction‑agnostic approach to measuring similarity in scientific literature databases, resolves the shortcomings of existing link‑based methods, employs a more appropriate Jaccard normalization, and validates its superiority through a rigorous human‑centric evaluation. Its efficiency and accuracy make it a strong candidate for deployment in academic search engines, digital libraries, and recommendation systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment