Multiple Authors Detection: A Quantitative Analysis of Dream of the Red Chamber
Inspired by the authorship controversy of Dream of the Red Chamber and the application of machine learning in the study of literary stylometry, we develop a rigorous new method for the mathematical analysis of authorship by testing for a so-called chrono-divide in writing styles. Our method incorporates some of the latest advances in the study of authorship attribution, particularly techniques from support vector machines. By introducing the notion of relative frequency as a feature ranking metric our method proves to be highly effective and robust. Applying our method to the Cheng-Gao version of Dream of the Red Chamber has led to convincing if not irrefutable evidence that the first $80$ chapters and the last $40$ chapters of the book were written by two different authors. Furthermore, our analysis has unexpectedly provided strong support to the hypothesis that Chapter 67 was not the work of Cao Xueqin either. We have also tested our method to the other three Great Classical Novels in Chinese. As expected no chrono-divides have been found. This provides further evidence of the robustness of our method.
💡 Research Summary
The paper presents a novel quantitative framework for detecting multiple authorship in the classic Chinese novel “Dream of the Red Chamber” (Hong Ló Mèng). Motivated by the long‑standing controversy over whether the first 80 chapters and the final 40 chapters were written by different hands, the authors develop a “chrono‑divide” methodology that treats the text as a temporal sequence and searches for a stylistic break point. The workflow begins with sliding‑window segmentation of the novel (e.g., 500‑sentence windows) and extraction of a rich set of linguistic features: character‑n‑grams, word‑n‑grams, syntactic patterns, and lexical diversity measures, yielding roughly one thousand candidate variables per segment.
Feature selection departs from simple frequency counts and introduces a “relative frequency” metric. For each candidate, the ratio of its occurrence in a given segment to its average frequency across the whole corpus is computed and log‑transformed. This emphasizes features that are disproportionately common (or rare) in a specific segment, thereby sharpening the contrast between potential authors while suppressing noise from low‑frequency items.
The selected features feed a Support Vector Machine (SVM) classifier. Both linear and radial‑basis‑function kernels are evaluated, and hyper‑parameters (C and γ) are tuned via nested cross‑validation. Because the two portions of the novel contain unequal numbers of windows (the 80‑chapter block yields more samples than the 40‑chapter block), a cost‑sensitive learning scheme is employed to balance the classes. After training, the SVM’s decision‑function values are plotted as a time series. A sharp change in this curve marks a “chrono‑divide,” interpreted as a stylistic shift indicative of a different author.
Empirical results show a pronounced transition exactly at the boundary between chapter 80 and chapter 81, confirming that the two sections are statistically distinct in style. Moreover, an isolated analysis of chapter 67 reveals that its decision‑function value lies far from both neighboring blocks, supporting the hypothesis that this chapter was not authored by Cao Xueqin either.
To test robustness, the same pipeline is applied to the other three Chinese “Great Classical Novels”: “Water Margin,” “Romance of the Three Kingdoms,” and “Journey to the West.” In all three cases the decision‑function curve remains essentially flat, indicating no detectable chrono‑divide and thereby validating the method’s low false‑positive rate.
The paper’s contributions are threefold: (1) introduction of a relative‑frequency based feature ranking that improves discriminative power; (2) integration of SVM with cost‑sensitive learning to handle imbalanced segment counts; and (3) formulation of the chrono‑divide concept, providing a clear, visual, and statistically grounded way to locate authorial change points in long literary works.
Limitations are acknowledged. Variations among different manuscript traditions, the mixture of classical and vernacular language, and preprocessing errors could influence the results. The relative‑frequency metric may also over‑emphasize idiosyncratic features in small segments. Future work is proposed to incorporate multi‑manuscript comparison, Bayesian uncertainty modeling, and deep contextual embeddings (e.g., BERT‑style models) to refine author attribution and extend the approach to other literary corpora worldwide.