Fine Grained Citation Span for References in Wikipedia

Fine Grained Citation Span for References in Wikipedia
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

\emph{Verifiability} is one of the core editing principles in Wikipedia, editors being encouraged to provide citations for the added content. For a Wikipedia article, determining the \emph{citation span} of a citation, i.e. what content is covered by a citation, is important as it helps decide for which content citations are still missing. We are the first to address the problem of determining the \emph{citation span} in Wikipedia articles. We approach this problem by classifying which textual fragments in an article are covered by a citation. We propose a sequence classification approach where for a paragraph and a citation, we determine the citation span at a fine-grained level. We provide a thorough experimental evaluation and compare our approach against baselines adopted from the scientific domain, where we show improvement for all evaluation metrics.


💡 Research Summary

Wikipedia’s core editorial policy of verifiability requires that every factual statement be supported by a reliable external source. In practice, however, many statements lack citations, and even when citations are present, it is often unclear exactly which parts of the text they are intended to support. This paper tackles the problem of automatically determining the “citation span” – the set of textual fragments that a given citation actually covers – at a granularity finer than a full sentence.

Problem definition
Given a Wikipedia paragraph p that contains a citation marker c, the paragraph is first split into sub‑sentence fragments (δ₁ … δₘ) using punctuation delimiters such as commas, semicolons, colons, question marks and exclamation points. The task is to assign each fragment a binary label “covered” or “not‑covered” with respect to c. The authors formalize this as a sequence labeling problem: the ordered fragments form a sequence S(p), and the goal is to predict a label sequence Y = (y₁ … yₘ).

Dataset
The authors sampled Wikipedia entities from a November 2016 snapshot, focusing on those that contain web or news references – the two most common external sources on Wikipedia. From 134 entities they extracted 509 citing paragraphs (average 4.4 citations per paragraph), resulting in 408 unique paragraphs. Each fragment was manually annotated by the first author, and a second annotator labeled a 10 % sample, achieving an inter‑rater Cohen’s κ of 0.84, indicating high reliability.

Model
A linear‑chain Conditional Random Field (CRF) is employed to model the sequence. The CRF captures dependencies between adjacent fragment labels, allowing the model to enforce coherence (e.g., a “covered” fragment is likely to be followed by another “covered” fragment if the citation spans multiple fragments).

Feature engineering
Four families of features are designed:

  1. Structural features – presence of other citations in the fragment, total number of sentences in the paragraph, fragment length in characters, whether the fragment shares the same sentence as the citation marker, and the distance (in fragment indices) to the fragment containing the citation. These capture the intuition that proximity to the citation marker increases the likelihood of coverage.

  2. Citation‑content features – lexical similarity between a fragment and the cited source. Two similarity measures are computed: (a) a moving‑window language model (based on a ±3‑word context) yielding a Kullback‑Leibler divergence score (f_LM), and (b) maximal Jaccard similarity (f_J) between the fragment’s token set and each paragraph of the citation. Additionally, the number of sentences in the citation (f_c) is added, reflecting the observed correlation between citation length and span length.

  3. Discourse features – explicit discourse connectives (temporal, contingency, expansion, comparison) are identified using the method of Pitler and Nenkova (2009). Each fragment receives a discourse type label (f_disc) based on the connective it contains, providing cues about logical continuation or contrast that often align with citation boundaries.

  4. Temporal features – when consecutive fragments contain date expressions, the absolute difference in days is computed (f_λ). A large temporal jump suggests a shift in the narrative, increasing the probability of a label transition.

All features are combined in the CRF’s factor potentials Ψ(yᵢ, yᵢ₋₁, δᵢ).

Baselines
The authors re‑implemented three representative baselines from the scientific‑text literature:

  • O’Connor (1982) – a heuristic that expands from the citing sentence to a ±2‑sentence window based on cue words.
  • Kaplan et al. (2016) – a classifier that uses textual coherence features to grow a citation block.
  • Qazvinian & Radev (2010) – a binary sentence‑level classifier that identifies additional sentences providing context for a citation.

These baselines operate at the sentence level and rely on cues (author names, explicit “this”/“above” references) that are largely absent in Wikipedia.

Experimental results
Evaluation metrics include precision, recall, F1‑score, and overall accuracy. The CRF model consistently outperforms all baselines across the board. Notably, the fine‑grained sub‑sentence approach yields higher recall without sacrificing precision, demonstrating that many citation spans are shorter than a full sentence (e.g., a single clause or phrase). The structural distance feature (f_cᵢ) and the language‑model similarity (f_LM) are the most predictive, while discourse and temporal features provide modest but consistent gains.

Contributions

  1. First formal definition and dataset for citation‑span detection in Wikipedia, with publicly released annotations.
  2. A sequence‑labeling framework that leverages global dependencies via a CRF, enabling sub‑sentence granularity.
  3. A rich feature set that combines paragraph structure, lexical similarity to the cited source, discourse relations, and temporal cues.
  4. Empirical evidence that methods designed for scientific articles do not transfer directly to Wikipedia, underscoring the need for domain‑specific modeling.

Future directions
The authors suggest extending the work to (a) jointly assessing citation quality and span, (b) automatically flagging missing citations by detecting uncovered factual fragments, (c) handling overlapping spans when multiple citations refer to overlapping text, and (d) exploring neural sequence models (e.g., BERT‑CRF) that could learn richer contextual representations.

Conclusion
By framing citation‑span detection as a CRF‑based sequence labeling task and engineering features that reflect Wikipedia’s unique citation style, the paper delivers a robust solution that outperforms existing sentence‑level baselines. The released dataset and code provide a solid foundation for further research aimed at improving Wikipedia’s verifiability and supporting editors with automated assistance tools.


Comments & Academic Discussion

Loading comments...

Leave a Comment