Loci Similes: A Benchmark for Extracting Intertextualities in Latin Literature

Loci Similes: A Benchmark for Extracting Intertextualities in Latin Literature
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Tracing connections between historical texts is an important part of intertextual research, enabling scholars to reconstruct the virtual library of a writer and identify the sources influencing their creative process. These intertextual links manifest in diverse forms, ranging from direct verbatim quotations to subtle allusions and paraphrases disguised by morphological variation. Language models offer a promising path forward due to their capability of capturing semantic similarity beyond lexical overlap. However, the development of new methods for this task is held back by the scarcity of standardized benchmarks and easy-to-use datasets. We address this gap by introducing Loci Similes, a benchmark for Latin intertextuality detection comprising of a curated dataset of ~172k text segments containing 545 expert-verified parallels linking Late Antique authors to a corpus of classical authors. Using this data, we establish baselines for retrieval and classification of intertextualities with state-of-the-art LLMs.


💡 Research Summary

The paper introduces Loci Similes, a new benchmark designed to evaluate automatic detection of intertextual connections in Latin literature. Recognizing that scholars have long relied on manual collation of “loci similes” to trace quotations, paraphrases, and allusions, the authors address the lack of standardized datasets that hampers modern NLP approaches. The benchmark comprises roughly 172,000 Latin text segments split into a query corpus (≈ 83 k segments from Late Antique authors Jerome and Lactantius) and a source corpus (≈ 88 k segments from ten canonical Classical authors such as Cicero, Virgil, Ovid, and others).

A ground‑truth set of 545 expert‑verified intertextual links underpins the benchmark. 270 links are drawn from prior scholarly work (Schropp et al., 2024b) and refined to remove duplicates and controversial cases. An additional 275 links are generated by a rule‑based n‑gram matching pipeline, then manually vetted by four Latin literature specialists using criteria that emphasize rare vocabulary, contextual coherence, and evidential support. The authors also propose a taxonomy of intertextuality ranging from verbatim quotes (marked/unmarked) through paraphrases (minor/major) to semantic allusions (single/systemic).

Evaluation departs from traditional sentence‑level information retrieval and instead adopts a document‑to‑document matching paradigm, reflecting real scholarly workflows where a passage must be linked to an entire source work. Baseline experiments compare three families of models: static Word2Vec embeddings, contextual Latin‑specific Transformers (LatinBERT) and multilingual Sentence‑Transformers (SPhilBerta), and binary classification models fine‑tuned on passage pairs. Results show that contextual models outperform static embeddings by an average of 12 % in Mean Average Precision, especially excelling at detecting subtle allusions.

Related work situates Loci Similes among prior Latin intertextuality studies that relied on lexical overlap or static embeddings, and among broader text‑reuse research in English, French, and other languages that have moved toward contextual models. The authors acknowledge limitations: the dataset focuses on Late Antique to Classical connections, omitting earlier or medieval Latin; expert annotation is costly, limiting scalability; and cross‑lingual reuse is not addressed.

Future directions include expanding the corpus to cover additional periods and genres, developing semi‑automatic annotation pipelines, integrating multimodal scholarly annotations, and extending the benchmark to other ancient languages. In sum, Loci Similes provides a publicly available, well‑curated resource and evaluation framework that can accelerate research at the intersection of computational linguistics and classical philology.


Comments & Academic Discussion

Loading comments...

Leave a Comment