Topic Modeling the H`an diu{a}n Ancient Classics
📝 Abstract
Ancient Chinese texts present an area of enormous challenge and opportunity for humanities scholars interested in exploiting computational methods to assist in the development of new insights and interpretations of culturally significant materials. In this paper we describe a collaborative effort between Indiana University and Xi’an Jiaotong University to support exploration and interpretation of a digital corpus of over 18,000 ancient Chinese documents, which we refer to as the “Handian” ancient classics corpus (H`an di\u{a}n g\u{u} j'i, i.e, the “Han canon” or “Chinese classics”). It contains classics of ancient Chinese philosophy, documents of historical and biographical significance, and literary works. We begin by describing the Digital Humanities context of this joint project, and the advances in humanities computing that made this project feasible. We describe the corpus and introduce our application of probabilistic topic modeling to this corpus, with attention to the particular challenges posed by modeling ancient Chinese documents. We give a specific example of how the software we have developed can be used to aid discovery and interpretation of themes in the corpus. We outline more advanced forms of computer-aided interpretation that are also made possible by the programming interface provided by our system, and the general implications of these methods for understanding the nature of meaning in these texts.
💡 Analysis
Ancient Chinese texts present an area of enormous challenge and opportunity for humanities scholars interested in exploiting computational methods to assist in the development of new insights and interpretations of culturally significant materials. In this paper we describe a collaborative effort between Indiana University and Xi’an Jiaotong University to support exploration and interpretation of a digital corpus of over 18,000 ancient Chinese documents, which we refer to as the “Handian” ancient classics corpus (H`an di\u{a}n g\u{u} j'i, i.e, the “Han canon” or “Chinese classics”). It contains classics of ancient Chinese philosophy, documents of historical and biographical significance, and literary works. We begin by describing the Digital Humanities context of this joint project, and the advances in humanities computing that made this project feasible. We describe the corpus and introduce our application of probabilistic topic modeling to this corpus, with attention to the particular challenges posed by modeling ancient Chinese documents. We give a specific example of how the software we have developed can be used to aid discovery and interpretation of themes in the corpus. We outline more advanced forms of computer-aided interpretation that are also made possible by the programming interface provided by our system, and the general implications of these methods for understanding the nature of meaning in these texts.
📄 Content
1 of 24 Topic Modeling the Hàn diăn Ancient Classics (汉典古籍) Colin ALLEN1,2, Hongliang LUO3, Jaimie MURDOCK2, Jianghuai PU1, Xiaohong WANG1, Yanjie ZHAI1, Kun ZHAO3
1 Department of Philosophy, School of Humanities and Social Sciences, Xi’an Jiaotong University, Shaanxi, China 2 Cognitive Science Program, Indiana University, Bloomington, Indiana, USA 3 Institute of Computer Software and Theory, School of Electronic and Information Engineering, Xi’an Jiaotong University, Shaanxi, China
Authors listed alphabetically
Abstract: Ancient Chinese texts present an area of enormous challenge and opportunity for humanities scholars interested in exploiting computational methods to assist in the development of new insights and interpretations of culturally significant materials. In this paper we describe a collaborative effort between Indiana University and Xi’an Jiaotong University to support exploration and interpretation of a digital corpus of over 18,000 ancient Chinese documents, which we refer to as the “Handian” ancient classics corpus (汉典古籍 or Hàn diăn gŭ jí, i.e, the “Han canon” or “Chinese classics”). It contains classics of ancient Chinese philosophy, documents of historical and biographical significance, and literary works. We begin by describing the Digital Humanities context of this joint project, and the advances in humanities computing that made this project feasible. We describe the corpus and introduce our application of probabilistic topic modeling to this corpus, with attention to the particular challenges posed by modeling ancient Chinese documents. We give a specific example of how the software we have developed can be used to aid discovery and interpretation of themes in the corpus. We outline more advanced forms of computer-aided interpretation that are also made possible by the programming interface provided by our system, and the general implications of these methods for understanding the nature of meaning in these texts.
2 of 24 Introduction and Context
The use of computers to support scholarship in the humanities reaches back over 50 years.1 The first decades of the twenty-first Century have seen the acceleration of humanities computing, particularly in North America and Europe, with the field coalescing around the label “Digital Humanities” (DH). The recent growth of DH is the product of a feedback loop caused by several factors including: (1) the increasing availability of digitized materials, especially on the World Wide Web; (2) increased computer storage capacity and processing capacity; (3) advances in text modeling and visualization algorithms; (4) deepening understanding by scholars of the interpretive possibilities provided by computational methods; (5) funding commitments from government and private foundations; and (6) last, but not least, the growing perception among many young scholars and doctoral students that DH is an exciting area of inquiry and an important enhancement to their career prospects.2 DH projects may concern themselves with many different media: text, images, audio, video, etc. However, our focus here is the analysis of written texts. Textual analysis constitutes the largest component of DH. This is largely because written language has been central to the construction and transmission of intellectual culture, and because of the relative ease with which text can be encoded and shared. These factors have resulted in enormous amounts of textual material recently becoming available.
As an example from category (1), increased availability of digitized materials, we highlight the HathiTrust (HT) digital library.3 It started as a collaboration among major university research libraries in the United States, to digitally scan the books in their collections.4 The page images from these books have been converted to text using optical character recognition (OCR) software. The HT collection now comprises over 14 million scanned volumes, equivalent to around five billion (5,000,000,000) pages based on the HathiTrust estimated average of 350 pages per book.5 (Perhaps as many as half a million of these books are Chinese language volumes, as determined by a search at babel.hathitrust.org.) By any standard, this is a vast amount of text: more than could be read in multiple human lifetimes. Because of its enormous scale, the digitized pages in the HT are relatively uncurated. Despite the care with which editors prepared the original physically printed editions, the images and OCR representations of the pages contain scanning errors that have not been corrected. Nevertheless, the HT digital library is a treasure trove for DH that offers multiple possibilities for analysis.6 At the same time, traditional scholarly editions have become increasingly digital, making available highly curated editions of historically and culturally significant text corpora. These projects use a labor- intensive process of inserti
This content is AI-processed based on ArXiv data.