An Annotation Scheme for Factuality and its Application to Parliamentary Proceedings

An Annotation Scheme for Factuality and its Application to Parliamentary Proceedings
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Factuality assesses the extent to which a language utterance relates to real-world information; it determines whether utterances correspond to facts, possibilities, or imaginary situations, and as such, it is instrumental for fact checking. Factuality is a complex notion that relies on multiple linguistic signals, and has been studied in various disciplines. We present a complex, multi-faceted annotation scheme of factuality that combines concepts from a variety of previous works. We developed the scheme for Hebrew, but we trust that it can be adapted to other languages. We also present a set of almost 5,000 sentences in the domain of parliamentary discourse that we manually annotated according to this scheme. We report on inter-annotator agreement, and experiment with various approaches to automatically predict (some features of) the scheme, in order to extend the annotation to a large corpus.


💡 Research Summary

This paper introduces a comprehensive, multi‑layer annotation scheme for factuality and demonstrates its application to Hebrew parliamentary proceedings. Factuality is defined not merely as truth versus falsehood but as a spectrum ranging from statements that correspond to real‑world facts, through possibilities, to purely imagined scenarios. To capture this nuance, the authors integrate elements from prior works such as FactBank, MA‑VEN‑Fact, ClaimBuster, and various event‑selection predicate studies, adapting them to the morphological complexity of Hebrew.

The scheme consists of six hierarchical layers. The first layer records a check‑worthiness score (worth checking, not worth checking, not a factual proposition), a claim type (personal experience, quantity, correlation/causation, law/rule, prediction, other, non‑claim, irrealis), and a factuality profile that pairs a source (e.g., speaker) with a modality‑polarity value (certain/positive, probable/negative, etc.). The second layer identifies event‑selecting predicates (ESPs) and classifies them as source‑introducing (SIP) or non‑source‑introducing (NSIP), linking any introduced source. The third layer details agency: the presence of an agent, its syntactic position, animacy, morphological number, and the predicate it governs; when no agent is present, the reason (passive without by‑clause, impersonal modal, imperative, etc.) is annotated. The fourth layer captures stance, including overall confidence level (high, mid, low, irrelevant), stance type (effective vs. epistemic), polarity (positive, negative, underspecified), and any cited reference. The fifth layer marks hedging expressions (e.g., “often”, “approximately”). The sixth layer records quantitative expressions, specifying the literal numeral, any quantifier, and its type (universal, existential, etc.). If a sentence contains multiple claims, the full set of layers is repeated for each claim.

To evaluate the scheme, the authors manually annotated 4,987 sentences drawn from the Knesset Corpus, a large collection of Israeli parliamentary transcripts spanning several decades. Sampling was balanced across year, speaker gender, party affiliation, and native language. Inter‑annotator agreement measured by Cohen’s κ averaged 0.71 overall; the check‑worthiness and claim‑type layers achieved the highest agreement (0.78 and 0.74), while ESP and agency showed lower scores (0.62 and 0.65), reflecting the inherent ambiguity of predicate selection and agent identification in Hebrew.

For automatic prediction, the authors first tested state‑of‑the‑art GPT‑3.5/4 models in zero‑shot and few‑shot configurations on the binary check‑worthiness task. Performance was modest (F1 ≈ 0.58), indicating that English‑centric LLMs struggle with Hebrew morphology and the nuanced annotation. They then fine‑tuned a Hebrew‑specific large language model (AlephBERT‑Large) on the annotated data. The fine‑tuned model achieved an F1 of 0.81 on a held‑out set, substantially outperforming the off‑the‑shelf GPT baselines. Error analysis revealed that most remaining mistakes involved complex negation structures and indirect quotations.

Using the fine‑tuned model, the authors automatically annotated the entire parliamentary corpus with check‑worthiness labels and released the resulting dataset, along with code and annotation guidelines, under a Creative Commons Attribution‑ShareAlike 4.0 license.

The paper’s contributions are threefold: (1) a flexible, multi‑layer factuality annotation framework that can be adapted to other languages; (2) a sizable, high‑quality Hebrew factuality corpus with measured inter‑annotator reliability; (3) empirical evidence that domain‑specific fine‑tuning of Hebrew LLMs dramatically improves factuality‑related prediction compared to generic GPT models. Limitations include the focus on only the check‑worthiness layer for automation, the relatively high cost of manual annotation, and the lack of empirical validation on languages beyond Hebrew. Future work is suggested in multi‑task learning to predict all layers jointly, cross‑lingual transfer to other morphologically rich languages, integration with downstream fact‑checking pipelines, and refinement of guidelines to boost annotator consistency.


Comments & Academic Discussion

Loading comments...

Leave a Comment