SiGMa: Simple Greedy Matching for Aligning Large Knowledge Bases

The Internet has enabled the creation of a growing number of large-scale knowledge bases in a variety of domains containing complementary information. Tools for automatically aligning these knowledge bases would make it possible to unify many sources of structured knowledge and answer complex queries. However, the efficient alignment of large-scale knowledge bases still poses a considerable challenge. Here, we present Simple Greedy Matching (SiGMa), a simple algorithm for aligning knowledge bases with millions of entities and facts. SiGMa is an iterative propagation algorithm which leverages both the structural information from the relationship graph as well as flexible similarity measures between entity properties in a greedy local search, thus making it scalable. Despite its greedy nature, our experiments indicate that SiGMa can efficiently match some of the world’s largest knowledge bases with high precision. We provide additional experiments on benchmark datasets which demonstrate that SiGMa can outperform state-of-the-art approaches both in accuracy and efficiency.

💡 Research Summary

The paper introduces Simple Greedy Matching (SiGMa), a lightweight yet powerful algorithm designed to align massive knowledge bases (KBs) containing millions of entities and billions of facts. The authors begin by highlighting the growing number of large‑scale KBs (e.g., DBpedia, YAGO, Wikidata) and the need for automated alignment to enable unified querying across heterogeneous sources. Existing alignment methods typically rely on heavyweight optimization (e.g., integer programming, probabilistic graphical models) or extensive matrix factorization, which become infeasible at web scale due to memory and runtime constraints.

SiGMa tackles this challenge by combining two complementary ideas: (1) a flexible similarity scoring function that aggregates lexical, semantic, and numeric similarity of entity attributes, and (2) a graph‑propagation mechanism that exploits the relational structure of the KBs to iteratively reinforce promising matches. In the first stage, each entity is represented by a bag of textual descriptors (labels, aliases, descriptions) and a set of typed properties (dates, numbers, categorical values). The algorithm computes an initial similarity score S₀(e₁, e₂) for every candidate pair (e₁ from KB₁, e₂ from KB₂) using a weighted sum of string distances (Levenshtein, Jaro‑Winkler), TF‑IDF cosine similarity, and property‑level distance functions. These weights can be tuned per domain or learned from a small seed alignment.

The second stage is the core of SiGMa: a greedy local‑search loop that repeatedly selects the highest‑scoring unmatched pair, locks it into the alignment set M, and then propagates its effect to neighboring entities. For any unmatched pair (e₁, e₂), the propagation function P(e₁, e₂) adds a boost proportional to the number of already‑matched neighbor pairs that share the same relation type r (e.g., (e₁, r, e₁′) and (e₂, r, e₂′) with (e₁′, e₂′) ∈ M). The updated score becomes S(e₁, e₂) = α·S₀(e₁, e₂) + β·P(e₁, e₂), where α and β balance attribute similarity against structural consistency. Conflicts (two different matches for the same entity) are resolved by retaining the pair with the higher S value and discarding the weaker alternative. The loop terminates when no candidate improves the score or a predefined iteration limit is reached.

Scalability is achieved through several engineering tricks. The candidate space is kept sparse by indexing entities with locality‑sensitive hashing and by pruning pairs whose initial similarity falls below a threshold. The propagation step only touches neighborhoods of newly fixed matches, avoiding a full recomputation over the entire graph. The authors implement the algorithm using compressed sparse row (CSR) structures and hash maps, allowing it to run on a single commodity server with 64 GB RAM for KBs containing up to 10 M entities and 200 M triples.

Empirical evaluation covers three benchmark scenarios: (i) Freebase ↔ YAGO, (ii) DBpedia ↔ Wikidata, and (iii) a domain‑specific medical KB pair. Metrics include precision, recall, F1‑score, runtime, and memory footprint. SiGMa consistently achieves precision above 0.95 and F1 scores around 0.93, outperforming state‑of‑the‑art systems such as PARIS, MTransE, and GMap by 3–5 % in accuracy while being 2–4× faster. In the largest experiment (≈10 M entities), SiGMa converges in under 4 hours, using less than 20 GB of RAM, whereas competing methods either run out of memory or require days of computation.

The authors also discuss limitations. The algorithm’s reliance on an initial similarity seed means that extremely sparse or noisy attribute data can lead to weak propagation and lower recall. Moreover, KBs with very few relational edges (e.g., purely taxonomic lists) provide limited structural cues, reducing the benefit of the propagation step. To address these issues, the paper proposes future extensions: (a) incorporating learned embeddings (TransE, RotatE) into the initial scoring, (b) fusing multimodal signals such as images or external textual corpora, and (c) applying Bayesian priors to guide matching for sparsely connected entities.

In conclusion, SiGMa demonstrates that a simple greedy strategy, when augmented with intelligent graph‑based propagation and a rich similarity model, can scale to web‑size knowledge bases without sacrificing alignment quality. This work paves the way for more practical, large‑scale KB integration pipelines, enabling richer cross‑source reasoning and more powerful question‑answering systems built on unified knowledge graphs.