Adversarially Probing Cross-Family Sound Symbolism in 27 Languages

Reading time: 5 minute
...

📝 Original Info

  • Title: Adversarially Probing Cross-Family Sound Symbolism in 27 Languages
  • ArXiv ID: 2512.12245
  • Date: 2025-12-13
  • Authors: Anika Sharma, Tianyi Niu, Emma Wrenn, Shashank Srivastava

📝 Abstract

The phenomenon of sound symbolism, the non-arbitrary mapping between word sounds and meanings, has long been demonstrated through anecdotal experiments like Bouba Kiki, but rarely tested at scale. We present the first computational cross-linguistic analysis of sound symbolism in the semantic domain of size. We compile a typologically broad dataset of 810 adjectives (27 languages, 30 words each), each phonemically transcribed and validated with native-speaker audio. Using interpretable classifiers over bag-of-segment features, we find that phonological form predicts size semantics above chance even across unrelated languages, with both vowels and consonants contributing. To probe universality beyond genealogy, we train an adversarial scrubber that suppresses language identity while preserving size signal (also at family granularity). Language prediction averaged across languages and settings falls below chance while size prediction remains significantly above chance, indicating cross-family sound-symbolic bias. We release data, code, and diagnostic tools for future large-scale studies of iconicity.

💡 Deep Analysis

Figure 1

📄 Full Content

Adversarially Probing Cross-Family Sound Symbolism in 27 Languages Anika Sharma Tianyi Niu Emma Wrenn Shashank Srivastava University of North Carolina at Chapel Hill Abstract The phenomenon of sound symbolism, the non- arbitrary mapping between word sounds and meanings, has long been demonstrated through anecdotal experiments like Bouba–Kiki, but rarely tested at scale. We present the first com- putational cross-linguistic analysis of sound symbolism in the semantic domain of size. We compile a typologically broad dataset of 810 adjectives (27 languages × 30 items), each phonemically transcribed and validated with native-speaker audio. Using interpretable clas- sifiers over bag-of-segment features, we find that phonological form predicts size semantics above chance even across unrelated languages, with both vowels and consonants contributing. To probe universality beyond genealogy, we train an adversarial scrubber that suppresses language identity while preserving size signal (also at family granularity). Language predic- tion averaged across languages and settings falls below chance while size prediction re- mains significantly above chance, indicating cross-family sound-symbolic bias. We release data, code, and diagnostic tools for future large- scale studies of iconicity. 1 Introduction If you were watching a superhero movie called Lamonians vs. Grataks: The Phoneme Accords, chances are you’d already be rooting for the La- monians because they sound like they would be nicer. In a 2009 Guardian article, linguist David Crystal posed a thought experiment: when asked to judge two fictional alien races, most people in- stinctively sided with the Lamonians, drawn to the soft consonants (/l/, /m/, /n/) and long vowels and diphthongs that give the name its gentle, lik- able tone (Crystal, 2009). This phenomenon in which specific sounds systematically convey partic- ular meanings is known as sound symbolism, and it challenges the long-standing linguistic assumption that form and meaning are entirely arbitrary. But how can we be sure such intuitions reflect univer- sal cognitive principles, rather than simply shared linguistic history? This paper introduces a method designed to answer this question. Sound symbolism is most familiar in ono- matopoeia, words like buzz or crash that imitate real-world sounds. It also manifests systematically across languages: in Yucatec Maya, vowel length signals event duration (Guen, 2013); in Swedish, the prefix pj- marks pejoration (Åsa Abelin, 1999); and in Japanese, consonants in food mimetics re- flect perceived crispness (Raevskiy et al., 2023). Despite evidence from individual languages, identifying cross-linguistic sound symbolism re- mains methodologically difficult because of two issues. First, identifying universal patterns is diffi- cult because phonological similarities may reflect shared ancestry or areal contact rather than true sound symbolism. Second, phonological invento- ries differ: some languages lack certain sounds, which obscures potential effects. Yet, the question of cross-linguistic sound symbolism carries signifi- cance beyond theoretical linguistics. It may reveal cognitive universals in perception, improve cross- lingual transfer in low-resource NLP, and guide data-driven brand naming in commercial applica- tions (Motoki et al., 2023). In this work, we investigate whether sound sym- bolism for size holds across typologically diverse languages by testing if adjectives meaning “small” and “large” consistently share phonological fea- tures, regardless of language family. Our approach includes an adversarial setup designed to isolate po- tentially universal sound-symbolic patterns while controlling for language-specific influences. Our contributions are: • A cross-linguistic dataset of 800+ size adjectives (30 per language) from 27 languages from 13 lan- guage families, phonemically transcribed in IPA and validated through native speaker recordings arXiv:2512.12245v1 [cs.CL] 13 Dec 2025 to capture contrastive phonological distinctions. • An adversarial framework using gradient rever- sal (Ganin and Lempitsky, 2015) to suppress language-family signals while retaining semantic structure. The model maintains above-chance size classification (54.4%) while reducing lan- guage identification to at chance (34.0%) offering a new approach to disentangle universal patterns from genealogical relatedness. • Evidence that while vowels like /a/, /i/, and /o/ confirm traditional size-symbolism patterns, con- sonants—particularly voiced fricatives like /Q/ and /H/—also contribute to size prediction across diverse language families, expanding beyond a purely vowel-centric account. 2 Related Works The classical view in linguistics considers the form– meaning link arbitrary (Saussure, 2011), with ex- ceptions in iconicity or sound symbolism. Founda- tional experiments show that listeners map phono- logical form to perceived magnitude; for exam- ple, Sapir (1929) found reliab

📸 Image Gallery

appendix_chart.png boxplot_resized.png ganin_smaller.png model_accuracy_heatmap.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut