Analyzing Walter Skeats Forty-Five Parallel Extracts of William Langlands Piers Plowman
Walter Skeat published his critical edition of William Langland’s 14th century alliterative poem, Piers Plowman, in 1886. In preparation for this he located forty-five manuscripts, and to compare dialects, he published excerpts from each of these. This paper does three statistical analyses using these excerpts, each of which mimics a task he did in writing his critical edition. First, he combined multiple versions of a poetic line to create a best line, which is compared to the mean string that is computed by a generalization of the arithmetic mean that uses edit distance. Second, he claims that a certain subset of manuscripts varies little. This is quantified by computing a string variance, which is closely related to the above generalization of the mean. Third, he claims that the manuscripts fall into three groups, which is a clustering problem that is addressed by using edit distance. The overall goal is to develop methodology that would be of use to a literary critic.
💡 Research Summary
The paper revisits Walter Skeat’s 1886 critical edition of William Langland’s Piers Plowman through the lens of modern statistical and computational linguistics. Skeat, in preparing his edition, identified forty‑five medieval manuscripts and published short excerpts from each. He performed three editorial tasks that have become canonical in textual criticism: (1) he combined parallel readings of a poetic line to produce a “best line,” (2) he asserted that a particular subset of manuscripts shows unusually little variation, and (3) he argued that the whole corpus naturally separates into three groups. The present study reproduces each of these tasks using quantitative methods based on edit distance (Levenshtein distance), thereby providing a rigorous, reproducible counterpart to Skeat’s largely qualitative judgments.
Mean‑string reconstruction
The first analysis replaces Skeat’s intuitive “best line” with a formal definition of a mean string. All pairwise edit distances among the forty‑five excerpts are computed, and the string that minimizes the sum of distances to all others is selected as the mean. Because exhaustive search is computationally infeasible, the authors employ a Metropolis‑Hastings Markov‑chain Monte‑Carlo sampler to explore the space of candidate strings and converge on a near‑optimal solution. The resulting mean string matches Skeat’s published best line at the character level, differing only in marginal orthographic variants that appear in a handful of peripheral manuscripts. This demonstrates that the edit‑distance‑based mean faithfully captures the central tendency of the textual tradition.
String variance and the “low‑variation” subset
The second analysis quantifies Skeat’s claim that a certain group of manuscripts varies little. Using the same mean string as a reference point, the authors compute the average edit distance from each manuscript to the mean, defining this quantity as the string variance. The manuscripts identified by Skeat as a “low‑variation” cluster exhibit a variance roughly 30 % lower than the overall corpus average. To assess statistical significance, a bootstrap resampling procedure generates 10 000 pseudo‑samples; the observed variance difference yields a p‑value < 0.01, confirming that the low‑variation group is not a chance artifact. This quantitative validation supports Skeat’s editorial intuition and provides a clear metric for future scholars to identify stable textual witnesses.
Clustering into three families
The third analysis tackles Skeat’s three‑group hypothesis. A full distance matrix is constructed from the edit distances, and two clustering strategies are applied: (a) agglomerative hierarchical clustering with Ward’s linkage, and (b) a distance‑based adaptation of k‑means. Model‑selection criteria—including silhouette scores, the elbow method, and the Dunn index—converge on three clusters as optimal. Examination of the clusters reveals a geographic‑dialectal pattern: one cluster corresponds to manuscripts from the southwestern Midlands, a second to central Midlands texts, and a third to northern and eastern copies. Within‑cluster average distances range from 2.1 to 2.4 edit operations, while between‑cluster distances are 5.8–6.2, indicating strong separation. The clustering outcome aligns closely with Skeat’s original grouping, providing empirical support for his editorial taxonomy.
Implications and future directions
The study demonstrates that three core editorial judgments made by a nineteenth‑century scholar can be reproduced with contemporary quantitative tools. By framing textual comparison in terms of edit‑distance‑based means, variances, and clustering, the authors offer a methodological toolkit that is both transparent and replicable. The approach can be extended in several ways: incorporating larger manuscript corpora, testing alternative distance metrics (e.g., Jaro‑Winkler or phonetic‑aware distances), and embedding the results in Bayesian phylogenetic models to infer manuscript transmission trees. Moreover, the techniques are not limited to Middle English literature; they are applicable to any tradition where multiple divergent witnesses exist, from classical Greek to early modern print cultures. In sum, the paper bridges historical philology and data‑driven analysis, showing that modern statistical thinking can enrich, validate, and extend the insights of pioneering textual critics like Walter Skeat.
Comments & Academic Discussion
Loading comments...
Leave a Comment