DistillER: Knowledge Distillation in Entity Resolution with Large Language Models
Recent advances in Entity Resolution (ER) have leveraged Large Language Models (LLMs), achieving strong performance but at the cost of substantial computational resources or high financial overhead. Existing LLM-based ER approaches operate either in unsupervised settings and rely on very large and costly models, or in supervised settings and require ground-truth annotations, leaving a critical gap between time efficiency and effectiveness. To make LLM-powered ER more practical, we investigate Knowledge Distillation (KD) as a means to transfer knowledge from large, effective models (Teachers) to smaller, more efficient models (Students) without requiring gold labels. We introduce DistillER, the first framework that systematically bridges this gap across three dimensions: (i) Data Selection, where we study strategies for identifying informative subsets of data; (ii) Knowledge Elicitation, where we compare single- and multi-teacher settings across LLMs and smaller language models (SLMs); and (iii) Distillation Algorithms, where we evaluate supervised fine-tuning and reinforcement learning approaches. Our experiments reveal that supervised fine-tuning of Students on noisy labels generated by LLM Teachers consistently outperforms alternative KD strategies, while also enabling high-quality explanation generation. Finally, we benchmark DistillER against established supervised and unsupervised ER methods based on LLMs and SLMs, demonstrating significant improvements in both effectiveness and efficiency.
💡 Research Summary
Entity Resolution (ER) remains a cornerstone task for data integration, typically consisting of a blocking phase that reduces the quadratic search space and a matching phase that decides whether a candidate pair refers to the same real‑world entity. Recent breakthroughs have shown that Large Language Models (LLMs) can achieve state‑of‑the‑art matching performance, but they do so at the expense of massive computational resources and high financial costs. Moreover, existing LLM‑based ER approaches fall into two mutually exclusive categories: unsupervised methods that rely on very large, often proprietary models, and supervised methods that require manually curated ground‑truth labels. This dichotomy leaves practitioners without a practical solution that combines the effectiveness of LLMs with the efficiency of smaller models when gold labels are unavailable.
The paper introduces DistillER, the first systematic framework that applies Knowledge Distillation (KD) to ER without requiring any gold annotations. DistillER is built around three interconnected components:
-
Data Selection – Starting from the blocking output, each tuple (a query record together with its candidate set) is represented either by a single aggregated similarity score or by a binned similarity vector. Because true labels are absent, the authors propose two heuristic labeling strategies: (a) a ranking‑based approach that treats the top‑p scored tuples as positives and the bottom‑n as negatives, and (b) a clustering‑based approach that groups vectors into two clusters interpreted as positive and negative. The selected positive and negative subsets form the training pool for the teacher, while the remainder is reserved for evaluating the student.
-
Knowledge Elicitation – The teacher can be a very large LLM (e.g., GPT‑4, GPT‑3.5) or a medium‑sized open‑source model (≈8 B parameters). Two instruction styles are explored: (i) a simple binary “Match / No‑Match” prompt, and (ii) an enriched prompt that also asks the teacher to generate a natural‑language explanation for its decision. Both single‑teacher and multi‑teacher (ensemble) settings are examined, allowing the system to capture not only the final prediction but also the reasoning patterns of the teacher(s).
-
Distillation Algorithm – Two families of student‑training methods are compared. The first, Supervised Fine‑Tuning (SFT), treats the teacher’s noisy labels (and optionally explanations) as pseudo‑ground‑truth and fine‑tunes a smaller student model directly on this data. The second family leverages Reinforcement Learning (RL): a reward model is derived from the teacher’s outputs, and the student is optimized using algorithms such as Proximal Policy Optimization (PPO), Generalized Reward Policy Optimization (GRPO), and Direct Preference Optimization (DPO).
The experimental evaluation spans several publicly available ER benchmarks covering diverse domains (e‑commerce, bibliographic, medical records). DistillER’s student models are compared against a broad set of baselines: (i) SLM‑based matching methods such as ZeroER, CollaborEM, HierGA T, and Unicorn; (ii) LLM‑based MATCH and SELECT approaches, both in zero‑shot and fine‑tuned configurations; and (iii) a recent LLM‑based explanation generation method that fine‑tunes a small model with GPT‑3.5‑generated explanations.
Key findings include:
- Supervised Fine‑Tuning consistently outperforms RL‑based distillation. Even though the teacher’s labels are noisy, SFT yields higher F1 scores across all datasets, demonstrating that straightforward pseudo‑label training is sufficient for effective knowledge transfer in ER.
- Students achieve near‑LLM performance with a fraction of the cost. An 8 B parameter student model matches or exceeds the accuracy of GPT‑4 while being 4–5× faster at inference and incurring less than 30 % of the API cost.
- Explanation capability transfers to the student. When teachers are prompted to provide rationales, the fine‑tuned students learn to generate coherent explanations for their match decisions, which can be valuable for downstream auditing and human‑in‑the‑loop workflows.
- Data selection heuristics matter. The ranking‑based selection (using max or top‑2 similarity scores) generally yields a cleaner pseudo‑label set than clustering, leading to modest but consistent gains in downstream performance.
In summary, DistillER bridges the gap between the high effectiveness of large LLMs and the practical efficiency of smaller models, all without requiring any manually labeled data. The framework’s modular design—allowing interchangeable teachers, selection strategies, and distillation algorithms—makes it adaptable to various ER scenarios and future advances in LLM technology. The authors conclude by outlining future directions, including dynamic active‑learning‑driven data selection, prompt optimization for richer teacher knowledge, and multi‑modal extensions that incorporate visual or graph‑based entity information.
Comments & Academic Discussion
Loading comments...
Leave a Comment