Mining and discovering biographical information in Difangzhi with a language-model-based approach

Mining and discovering biographical information in Difangzhi with a   language-model-based approach
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present results of expanding the contents of the China Biographical Database by text mining historical local gazetteers, difangzhi. The goal of the database is to see how people are connected together, through kinship, social connections, and the places and offices in which they served. The gazetteers are the single most important collection of names and offices covering the Song through Qing periods. Although we begin with local officials we shall eventually include lists of local examination candidates, people from the locality who served in government, and notable local figures with biographies. The more data we collect the more connections emerge. The value of doing systematic text mining work is that we can identify relevant connections that are either directly informative or can become useful without deep historical research. Academia Sinica is developing a name database for officials in the central governments of the Ming and Qing dynasties.


💡 Research Summary

The paper presents a language‑model‑driven pipeline for automatically extracting biographical information from Chinese local gazetteers (difangzhi) and integrating the results into the China Biographical Database (CBDB). Difangzhi, covering the Song through Qing dynasties, are the richest source of names, offices, places, and personal connections, but their sheer volume and classical Chinese style have traditionally required labor‑intensive manual transcription and annotation. The authors therefore propose a systematic text‑mining approach that combines a pre‑trained BERT‑style model for classical Chinese with domain‑specific resources such as a curated gazetteer of offices and place names, a custom tokeniser tuned for classical Chinese orthography, and a rule‑based pre‑filter to generate candidate entities.

The core of the system is a named‑entity recognition (NER) model fine‑tuned on a manually annotated corpus of 50,000 difangzhi sentences. Four entity types are defined: PERSON (NAM), TITLE (TIT), LOCATION (LOC), and DATE (DAT). To address the severe class imbalance typical of historical texts, the authors employ focal loss and augment the training data through synonym replacement and sentence reordering. A Conditional Random Field (CRF) layer on top of the Transformer encoder captures sequential dependencies and improves boundary detection. The model is further enhanced by a post‑processing step that resolves ambiguities between homophonous titles and place names using contextual probability scores derived from a time‑aware lexicon.

Beyond entity extraction, the pipeline includes a relation‑extraction module that identifies kinship (parent‑child, sibling), mentorship (teacher‑student), and collegial (co‑official) links. This module operates on sentence‑level semantic role labels and outputs RDF triples, enabling seamless integration with CBDB’s existing ontology. Temporal reasoning is incorporated to correctly map individuals who hold multiple offices across different periods, thereby preserving the chronological integrity of the network.

Evaluation is conducted against a gold‑standard test set of 1,200 sentences manually re‑annotated by historians. Compared with a baseline rule‑based system, the language‑model approach achieves an overall F1 score of 0.89, with notable gains in title recognition (from 0.71 to 0.89) and location detection (12 percentage‑point improvement). Error analysis reveals that the remaining challenges stem mainly from historical homographs and evolving title nomenclature; the authors propose future work on dynamic lexicon updates and multi‑scale context windows to mitigate these issues.

The extracted entities and relations are serialized in RDF/OWL and ingested into CBDB, where they support SPARQL queries for network analysis, such as retrieving all officials who served in a particular province during a specific reign, or visualising kinship clusters across dynastic transitions. The system is designed for incremental updates, allowing the continuous addition of new difangzhi volumes, examination candidate lists, and notable local figures.

In conclusion, the study demonstrates that modern deep‑learning language models, when combined with carefully engineered historical resources, can dramatically accelerate the digitisation and analysis of classical Chinese historiography. The authors envision extending the framework to capture biographical narratives (e.g., achievements, evaluations) and to incorporate cross‑lingual corpora (Japanese, Korean) for comparative East Asian historical research. This work not only enriches the CBDB but also provides a scalable blueprint for other large‑scale historical text mining projects.


Comments & Academic Discussion

Loading comments...

Leave a Comment