Onomastics 2.0 - The Power of Social Co-Occurrences
Onomastics is “the science or study of the origin and forms of proper names of persons or places.” [“Onomastics”. Merriam-Webster.com, 2013. http://www.merriam-webster.com (11 February 2013)]. Especially personal names play an important role in daily life, as all over the world future parents are facing the task of finding a suitable given name for their child. This choice is influenced by different factors, such as the social context, language, cultural background and, in particular, personal taste. With the rise of the Social Web and its applications, users more and more interact digitally and participate in the creation of heterogeneous, distributed, collaborative data collections. These sources of data also reflect current and new naming trends as well as new emerging interrelations among names. The present work shows, how basic approaches from the field of social network analysis and information retrieval can be applied for discovering relations among names, thus extending Onomastics by data mining techniques. The considered approach starts with building co-occurrence graphs relative to data from the Social Web, respectively for given names and city names. As a main result, correlations between semantically grounded similarities among names (e.g., geographical distance for city names) and structural graph based similarities are observed. The discovered relations among given names are the foundation of “nameling” [http://nameling.net], a search engine and academic research platform for given names which attracted more than 30,000 users within four months, underpinningthe relevance of the proposed methodology.
💡 Research Summary
The paper “Onomastics 2.0 – The Power of Social Co‑Occurrences” investigates how large‑scale, user‑generated data from the Social Web can be leveraged to uncover and quantify relationships among proper names, specifically given (personal) names and city names. The authors adopt a straightforward yet powerful pipeline: (1) collect textual corpora from three language editions of Wikipedia (English, German, French), the associated Wiktionary link structure, and a massive Twitter dataset (≈476 million tweets from 2009); (2) extract two entity classes – a curated list of ≈30 k given names (expanded to 36 k via user contributions) and ≈101 k city names (population > 1 000, disambiguated); (3) build undirected, weighted co‑occurrence graphs for each entity class and each data source, where an edge weight equals the number of contexts (sentences in Wikipedia, tweets in Twitter) in which the two entities appear together. This yields eight graphs: G_N_EN, G_N_DE, G_N_FR, G_N_TW for given names and G_C_EN, G_C_DE, G_C_FR, G_C_TW for city names.
Statistical description shows that all graphs contain a giant weakly connected component covering virtually all vertices. The English‑Wikipedia‑based name graph is the densest (density ≈ 0.067), while the Twitter graphs are the sparsest (density ≈ 0.003–0.020). The authors then compare centrality measures across languages and platforms. Degree centrality and eigenvector centrality are computed for each vertex; pairwise scatter plots reveal strong positive correlations between English and German Wikipedia degree scores, indicating that popular names tend to be popular across language editions. Randomized null models (degree‑preserving edge rewiring) confirm that observed correlations are statistically significant and not artefacts of degree distributions.
For similarity assessment, the study evaluates a suite of well‑known measures from distributional semantics (Jaccard, Dice, Cosine, Adamic/Adar) and from link‑prediction/network science (Adamic/Adar, Resource Allocation, Preferential Attachment). In the name domain, external “ground truth” is approximated by popularity rankings and manually curated cultural/ gender clusters; in the city domain, geographic distance (computed from latitude/longitude) serves as the semantic baseline. Correlation analysis shows that structural similarity scores, especially Adamic/Adar and Resource Allocation, align closely with the semantic baselines, outperforming pure co‑occurrence frequency or simple Jaccard similarity. This demonstrates that the topology of the co‑occurrence network captures meaningful semantic relations beyond raw co‑occurrence counts.
To validate practical relevance, the authors deploy the derived name similarity network in a public web service called “Nameling” (http://nameling.net). Nameling offers name search, recommendation, and exploration based on the graph‑derived similarity scores. Within four months of launch, the platform attracted over 30 000 unique users, confirming that the mined relationships are useful for real‑world naming decisions. The same methodology applied to city names reproduces expected geographic proximity patterns, underscoring the generality of the approach for any named entity class.
The paper contributes to onomastics by (i) introducing a reproducible, data‑driven pipeline that transforms raw social text into weighted co‑occurrence graphs; (ii) providing a thorough comparative analysis of graphs derived from different languages and platforms, including null‑model baselines; (iii) demonstrating that network‑based similarity measures can serve as effective proxies for semantic relatedness; and (iv) delivering a tangible, user‑facing application that bridges academic research and everyday naming practice. The authors suggest future work extending the framework to additional social platforms (e.g., Instagram, Facebook), incorporating multilingual name variants, and exploring other entity types such as brand or product names. Overall, the study showcases how social co‑occurrence data can revitalize traditional onomastic research through modern network science and information retrieval techniques.
Comments & Academic Discussion
Loading comments...
Leave a Comment