Low-dimensional Embodied Semantics for Music and Language

Low-dimensional Embodied Semantics for Music and Language
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Embodied cognition states that semantics is encoded in the brain as firing patterns of neural circuits, which are learned according to the statistical structure of human multimodal experience. However, each human brain is idiosyncratically biased, according to its subjective experience history, making this biological semantic machinery noisy with respect to the overall semantics inherent to media artifacts, such as music and language excerpts. We propose to represent shared semantics using low-dimensional vector embeddings by jointly modeling several brains from human subjects. We show these unsupervised efficient representations outperform the original high-dimensional fMRI voxel spaces in proxy music genre and language topic classification tasks. We further show that joint modeling of several subjects increases the semantic richness of the learned latent vector spaces.


💡 Research Summary

The paper investigates how shared semantic representations can be extracted from functional magnetic resonance imaging (fMRI) recordings of multiple human subjects listening to music or reading language stimuli. Grounded in the theory of embodied cognition, the authors argue that while each individual brain encodes semantic information in noisy, idiosyncratic ways, the common statistical structure across brains can be uncovered by jointly modeling their neural responses. To this end, they employ Generalized Canonical Correlation Analysis (GCCA), an extension of classical CCA that can handle an arbitrary number of views. In the GCCA framework each subject’s high‑dimensional voxel matrix Xᵥ (N samples × dᵥ voxels) is approximated by a shared low‑dimensional latent matrix G (N × C) through a linear projection Pᵥ (dᵥ × C). The objective is to find a set of C canonical components that maximize inter‑subject correlation while keeping the components orthogonal.

Two publicly available datasets are used. The music dataset (MG) comprises fMRI recordings from 19 participants who each listened eight times to 25 short (6 s) clips spanning five genres (ambient, country, metal, rock‑n‑roll, symphonic). After down‑sampling the original high‑resolution voxel space by a factor of six (resulting in 5,488 voxels) and averaging repetitions, each stimulus‑subject pair is represented by a single vector. The language dataset consists of fMRI responses from 16 participants to 180 concrete concepts and two larger sentence sets (243 and 384 sentences) covering 24 topics each (LT243 and LT384).

The authors first determine the optimal number of canonical components C for each dataset by cross‑validation: for each candidate C (2–25) they train GCCA on training folds and evaluate the Mean Average Precision (MAP) of an across‑subject retrieval task on the held‑out fold. The C that yields the highest average MAP is selected (≈12 components for music, 10–15 for language).

With the chosen dimensionality, they assess the “semantic richness” of the learned embeddings through two proxy tasks. In the across‑subject retrieval task, a test subject’s fMRI vector is projected into the shared G space and matched against the projected vectors of other subjects for the same stimulus; MAP scores demonstrate that the low‑dimensional embeddings retain discriminative semantic information and outperform the raw voxel space. In the classification task, the canonical components serve as features for a linear Support Vector Machine (SVM) that predicts music genre or language topic. Across all experiments, the GCCA embeddings achieve higher classification accuracy (typically 8–15 % absolute improvement) than classifiers trained on the original high‑dimensional voxel data.

A key finding is that semantic performance improves monotonically as more subjects are included in the GCCA model, supporting the hypothesis that aggregating multiple brains reduces individual bias and amplifies the shared semantic signal. Importantly, the approach does not rely on any stimulus‑derived features (e.g., audio spectrograms, word embeddings) nor on supervised label information during embedding learning, making the resulting space a generic, modality‑agnostic semantic representation.

The paper situates its contribution relative to prior work that either regresses fMRI onto pre‑computed linguistic embeddings (e.g., GloVe) or uses audio‑derived features to predict voxel activity. By contrast, the presented method learns a purely brain‑driven embedding, aligning with the embodied cognition view that meaning is distributed across the whole brain.

Limitations include the inherent low temporal resolution of fMRI, the modest number of participants, and the linear nature of GCCA, which may miss nonlinear relationships. The authors suggest future directions such as employing deep multiview methods (e.g., Deep CCA, variational autoencoders) and scaling up to larger, more diverse datasets. Such extensions could further enhance the expressive power of brain‑derived semantic spaces and enable applications ranging from brain‑based information retrieval to neuro‑adaptive human‑computer interaction.


Comments & Academic Discussion

Loading comments...

Leave a Comment