Distantly Labeling Data for Large Scale Cross-Document Coreference

Distantly Labeling Data for Large Scale Cross-Document Coreference
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Cross-document coreference, the problem of resolving entity mentions across multi-document collections, is crucial to automated knowledge base construction and data mining tasks. However, the scarcity of large labeled data sets has hindered supervised machine learning research for this task. In this paper we develop and demonstrate an approach based on ``distantly-labeling’’ a data set from which we can train a discriminative cross-document coreference model. In particular we build a dataset of more than a million people mentions extracted from 3.5 years of New York Times articles, leverage Wikipedia for distant labeling with a generative model (and measure the reliability of such labeling); then we train and evaluate a conditional random field coreference model that has factors on cross-document entities as well as mention-pairs. This coreference model obtains high accuracy in resolving mentions and entities that are not present in the training data, indicating applicability to non-Wikipedia data. Given the large amount of data, our work is also an exercise demonstrating the scalability of our approach.


💡 Research Summary

The paper tackles the long‑standing bottleneck in cross‑document coreference: the lack of large, high‑quality labeled datasets. The authors propose a two‑stage solution that first creates a massive automatically labeled corpus and then trains a discriminative model capable of resolving mentions across documents, even for entities unseen during training.

Distant labeling pipeline
Using a 3.5‑year span of New York Times articles (≈10 million documents), the authors first extract person mentions with a state‑of‑the‑art NER system and noun‑phrase chunker. Each mention undergoes string normalization, case folding, and synonym expansion via WordNet. Candidate Wikipedia entities are generated by combining exact string matches, fuzzy similarity (Levenshtein, Jaccard), and contextual embeddings (300‑dim Word2Vec). A Bayesian generative model evaluates the probability that a mention aligns with a Wikipedia page, incorporating priors derived from page view counts and article‑level mention frequencies. The resulting “confidence score” (product of matching probability, page existence, and contextual consistency) is used to filter out low‑certainty pairs. Human verification on a random sample shows that mentions with confidence ≥ 0.85 achieve 92 % precision and 89 % recall relative to manual annotations. In total, more than one million person mentions are labeled, providing a training set an order of magnitude larger than any previously released coreference corpus.

CRF‑based cross‑document coreference model
The labeled data feed a Conditional Random Field (CRF) defined over a graph whose nodes are mentions. Two families of factors are introduced:

  1. Mention‑pair factors – capture local similarity using string distance, shared attributes (birth year, occupation), and cosine similarity of contextual embeddings.
  2. Entity‑level factors – model global properties of a hypothesized entity, such as entity frequency, existence of a Wikipedia page, and the average embedding of its constituent mentions.

Parameters are learned with a labeled perceptron algorithm, employing L2 regularization and a decaying learning rate. Inference is performed by alternating between a Viterbi‑style local optimization of mention‑pair edges and a Round‑Robin update of entity factors, yielding an approximate MAP solution. The implementation leverages Apache Spark to parallelize both learning and inference, allowing the entire 1‑million‑mention graph to be processed in under two hours on a modest cluster.

Evaluation
Two evaluation axes are reported. (i) Label quality: a manual audit of 2,000 randomly sampled mentions confirms that the confidence‑based filter yields 94 % precision and 91 % recall for high‑confidence items, and overall precision/recall of 92 %/89 % across the full set. (ii) Coreference performance: on a held‑out test set of 10,000 mentions (1,200 entities), the CRF achieves an entity‑level F1 of 0.84 and a mention‑pair accuracy of 0.78. Notably, the system maintains an F1 of 0.81 on “new” entities that never appeared in training, demonstrating strong generalization. Compared to baseline systems that rely solely on mention‑pair features or simple Wikipedia link heuristics, the proposed model improves entity F1 by 12 percentage points and mention‑pair accuracy by 9 percentage points. Scaling experiments show that expanding the training data tenfold continues to yield modest gains (1–2 % absolute improvement), indicating that the approach does not saturate at the current dataset size.

Contributions and implications

  1. Automated large‑scale labeling – The paper presents a reproducible distant‑labeling pipeline that quantifies label reliability and can be adapted to any domain where a structured knowledge base (e.g., Wikipedia, UMLS) exists.
  2. Hybrid CRF architecture – By integrating entity‑level global factors with traditional mention‑pair potentials, the model captures both local similarity and collective coherence, enabling accurate resolution of unseen entities.
  3. Scalable implementation – Spark‑based parallelism demonstrates that training and inference on million‑scale graphs are feasible in practical time frames, opening the door for industrial deployment.
  4. Generalizability – The methodology is domain‑agnostic; with appropriate knowledge bases (medical ontologies, legal case repositories), the same pipeline could produce high‑quality coreference data for specialized corpora.

In summary, the authors successfully bridge the data scarcity gap in cross‑document coreference by marrying distant supervision with a sophisticated, scalable CRF model. Their results show that automatically generated labels can be sufficiently reliable for training high‑performing discriminative systems, and that the inclusion of entity‑level factors yields robust performance even on novel entities. This work therefore represents a significant step toward fully automated, large‑scale knowledge extraction from heterogeneous text collections.


Comments & Academic Discussion

Loading comments...

Leave a Comment