The Ten Thousand Kims

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In the Korean culture the family members are recorded in special family books. This makes it possible to follow the distribution of Korean family names far back in history. It is here shown that these name distributions are well described by a simple null model, the random group formation (RGF) model. This model makes it possible to predict how the name distributions change and these predictions are shown to be borne out. In particular, the RGF model predicts that, for married women entering a collection of family books in a certain year, the occurrence of the most common family name “Kim” should be directly proportional the total number of married women with the same proportionality constant for all the years. This prediction is also borne out to high degree. We speculate that it reflects some inherent social stability in the Korean culture. In addition, we obtain an estimate of the total population of the Korean culture down to year 500 AD, based on the RGF model and find about ten thousand Kims.

💡 Research Summary

The paper “The Ten Thousand Kims” investigates the statistical distribution of Korean family names over a span of roughly five centuries using data extracted from ten traditional Korean family books (jokbo). These genealogical records, which have been kept for centuries, contain entries for women who entered a family through marriage, recording their original family names. The authors compiled a dataset covering the period from 1510 to 1990, divided into sixteen 30‑year windows. For each window they counted (i) the total number of married women recorded (M), (ii) the number of distinct family names among those women (N), and (iii) the frequency of the most common name, almost invariably “Kim” (k_max). The table shows M ranging from 1,510 to 79,935, N from 19 to 162, and k_max from 6 to 9,693.

The central hypothesis is that the set of women entering the jokbo in any given year can be treated as a random sample drawn from the entire Korean population at that time. Under this assumption the distribution of name frequencies, P_M(k), should be governed solely by the three observable quantities (M, N, k_max). To model P_M(k) the authors adopt the Random Group Formation (RGF) model, previously introduced in statistical physics. The RGF model assumes maximal mixing (maximum entropy) among groups, leading to a probability distribution of the form

P_M(k) = A exp(−b k) k^γ,

where the constants A, b, and γ are determined self‑consistently from the constraints Σ_k P_M(k)=1, Σ_k k P_M(k)=M/N, and the observed k_max. Importantly, the model predicts two robust features: (1) the number of distinct names N is a unique function of the sample size M, independent of historical epoch, and (2) the size of the largest group (here “Kim”) scales linearly with M, implying a constant proportion of Kims in the population.

To test these predictions the authors performed a Monte‑Carlo style random selection experiment. Using the three most recent windows (1900‑1930, 1930‑1960, 1960‑1990) they pooled the 1,650,200 women recorded in the 1900‑1990 period and repeatedly drew random subsamples of size M < M_total. For each subsample they computed the average N(M). The resulting curve (Fig. 1) matches the actual historical points from all sixteen windows with deviations well within three standard deviations. This empirical agreement supports the claim that N depends only on M and not on calendar time.

A second test concerns the “Kim” proportion. The data show that k_max / M ≈ 0.06 for every window, from the 16th century onward. The RGF model, because of the linear scaling of the largest group, predicts exactly this constancy. Thus, despite wars, famines, industrialization, and other societal upheavals, the fraction of Kims remained essentially unchanged—a striking illustration of the model’s “optimal mixing” condition in a real social system.

The authors further exploit the model to extrapolate backward in time. Using an independent source that lists the cumulative number of family names introduced into Korea up to any given year (189 names total), they combine this with the empirically derived N(M) relationship to infer the total population M(t) for earlier centuries. The resulting reconstruction suggests that around 500 AD the Korean population was on the order of 10,000 individuals, of which roughly 10,000 bore the name Kim—hence the evocative “Ten Thousand Kims” title.

Critical appraisal: The study’s strength lies in its novel use of genealogical records and a physics‑based null model that requires only three observable parameters. The RGF framework elegantly captures the scaling of name diversity with population size and explains the observed stability of the Kim proportion without invoking detailed demographic mechanisms. However, several limitations merit attention. First, jokbo primarily document elite lineages (yangban families), so the sampled women may not be a truly random cross‑section of the whole society; the assumption of random sampling could be biased toward higher‑status families. Second, the model treats γ and b as static, whereas in reality they may drift slowly with cultural or administrative changes (e.g., surname adoption policies). Third, the extrapolation to 500 AD hinges on the completeness of the 189‑name introduction dataset and assumes that the N(M) relationship holds under dramatically different social structures, which is speculative. Finally, the analysis focuses exclusively on women; male surname dynamics could differ due to patrilineal inheritance and differential mortality.

Despite these caveats, the paper convincingly demonstrates that a simple maximum‑entropy model can capture the essential statistical regularities of Korean surname distributions across centuries. It illustrates how concepts from statistical physics—entropy maximization, scaling laws, and random group formation—can be fruitfully applied to historical demography and cultural evolution. The finding that a single proportionality (≈6 % Kims) persists through major historical disruptions is both surprising and thought‑provoking, suggesting a deep underlying stability in Korean naming conventions. The methodology could be extended to other societies with comparable genealogical archives, or to other categorical data (e.g., language word frequencies, city size distributions), offering a promising avenue for interdisciplinary research between physics, sociology, and history.

The Ten Thousand Kims

💡 Research Summary

Comments & Academic Discussion

Leave a Comment