There is growing evidence that independently trained AI systems come to represent the world in the same way. In other words, independently trained embeddings from text, vision, audio, and neural signals share an underlying geometry. We call this the Representational Alignment Hypothesis (RAH) and investigate evidence for and consequences of this claim. The evidence is of two kinds: (i) internal structure comparison techniques, such as representational similarity analysis and topological data analysis, reveal matching relational patterns across modalities without explicit mapping; and (ii) methods based on cross-modal embedding alignment, which learn mappings between representation spaces, show that simple linear transformations can bring different embedding spaces into close correspondence, suggesting near-isomorphism. Taken together, the evidence suggests that, even after controlling for trivial commonalities inherent in standard data preprocessing and embedding procedures, a robust structural correspondence persists, hinting at an underlying organizational principle. Some have argued that this result shows that the shared structure is getting at a fundamental, Platonic level of reality. We argue that this conclusion is unjustified. Moreover, we aim to give the idea an alternative philosophical home, rooted in contemporary metasemantics (i.e., theories of what makes a representation and what makes something meaningful) and responses to the symbol grounding problem. We conclude by considering the scope of the RAH and proposing new ways of distinguishing semantic structures that are genuinely invariant from those that inevitably arise due to the fact that all our data is generated under human-specific conditions on Earth.
Understanding how meaning is generated and represented has long been a central question in cognitive science and philosophy [7]. This concern, often referred to as the symbol grounding problem [26] or the problem of metasemantics, asks: How do abstract symbols, such as words, images, or neural activations, acquire meaning? Traditional approaches suggest that meaning is somehow grounded in sensorimotor experience, but the exact nature of this grounding remains controversial. In recent years, the emergence of high-dimensional embedding techniques across modalities for AI systems, such as neural signals, text, video, and audio, has opened a new avenue for investigating this problem. These methods allow us to represent diverse types of data as abstract numerical vectors, which in turn enable quantitative comparisons across modalities.
We investigate the following questions: Is there an underlying invariant semantic structure that is common to independently trained representation spaces (e.g., neural, textual, visual, and auditory modalities)? If such a structure exists, is it an inherent property of meaning itself? Or might it instead reflect the shared biological, environmental, and cultural conditions under which all our available data and methods are inevitably generated by interactions on Earth?
To set the stage, it is important to note that each modality’s embedding space is learned independently. For example, neural embeddings are derived directly from fMRI data capturing brain activity, while text embeddings are produced using models such as BERT that are trained on vast corpora of language. These spaces are constructed with different objectives, architectures, and training data, and, by design, they start out with no inherent relationship to one another. They are abstract, high-dimensional representations that encode the information deemed important by their respective training processes. The issue is whether, despite these differences, a simple mapping or a comparative analysis reveals that these spaces converge on a shared semantic structure. Somehow, these distinct methods and datasets seem to end up generating the same way of representing the world. Is that really happening? And, if so, why?
What emerges from the studies surveyed is an intriguing possibility: there may exist a modality-independent essence of meaning itself, a relational structure that manifests consistently regardless of the particular form or modality through which it is expressed. Rather than simply reflecting arbitrary conventions, algorithmic presuppositions, superficial sensory properties, or brain organization, our representations might tap into a deeper semantic organization. The meaning we perceive might not be solely dictated by the specific features or sensory channels through which information is conveyed but by some underlying structures that remain remarkably stable across diverse domains of human experience.
In Section Two, we review and synthesize evidence from two broad families of methods. First, we discuss internal structure comparison methods that examine the geometry or relational patterns within each modality’s space without learning an explicit mapping. Second, we examine transformation-based approaches that extract embeddings independently and then learn a transformation, often a simple linear mapping, to bring the spaces into correspondence. A successful alignment with a minimal transformation suggests that the spaces are nearly isomorphic, indicating that information is encoded in a similar way across modalities. Section Three contains an analysis of these results from the perspective of contemporary literature on the symbol grounding problem [26], Bender and Koller’s popular example of the statistician octopus [4], and the Platonic Representation Hypothesis [29], which is the claim that semantic information encoded in diverse embeddings represents a fundamental level of reality, much like Plato’s forms (on one interpretation). We reject the Platonic Representation Hypothesis and offer some reasons to doubt it, even if one accepts that there is an invariant semantic geometry across embedding modalities. Section Four outlines a number of challenges for thinking that an invariant semantic structure is truly universal. These challenges include the possibilities that the invariant structure comes from the embedding algorithms themselves, that it depends on how human brains process information, and that it is local to the human environments found on Earth. A brief conclusion includes some open questions and directions for future research.
An embedding is a way of representing something, like a word, an image, or a sound, as a numerical vector (an ordered list of numbers). For example, instead of representing the word “cat” simply as letters, a language model might represent it as a list of numbers, such as [0.1, -0.4, 0.5, . . . ]. A vector with n entries can be thought of as a direction in n-dimensional spa
This content is AI-processed based on open access ArXiv data.