Language modulates vision: Evidence from neural networks and human brain-lesion models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Comparing information structures in between deep neural networks (DNNs) and the human brain has become a key method for exploring their similarities and differences. Recent research has shown better alignment of vision-language DNN models, such as CLIP, with the activity of the human ventral occipitotemporal cortex (VOTC) than earlier vision models, supporting the idea that language modulates human visual perception. However, interpreting the results from such comparisons is inherently limited due to the “black box” nature of DNNs. To address this, we combined model-brain fitness analyses with human brain lesion data to examine how disrupting the communication pathway between the visual and language systems causally affects the ability of vision-language DNNs to explain the activity of the VOTC. Across four diverse datasets, CLIP consistently captured unique variance in VOTC neural representations, relative to both label-supervised (ResNet) and unsupervised (MoCo) models. This advantage tended to be left-lateralized at the group level, aligning with the human language network. Analyses of 33 stroke patients revealed that reduced white matter integrity between the VOTC and the language region in the left angular gyrus was correlated with decreased CLIP-brain correspondence and increased MoCo-brain correspondence, indicating a dynamic influence of language processing on the activity of the VOTC. These findings support the integration of language modulation in neurocognitive models of human vision, reinforcing concepts from vision-language DNN models. The sensitivity of model-brain similarity to specific brain lesions demonstrates that leveraging manipulation of the human brain is a promising framework for evaluating and developing brain-like computer models.

💡 Research Summary

The authors investigate whether language supervision in vision‑language deep neural networks (DNNs) provides a genuine explanatory advantage for human visual cortex activity, beyond what can be attributed to larger training data or generic high‑level visual representations. They compare three models that share the same ResNet‑50 backbone but differ in supervision: (1) CLIP‑vision, trained with image‑caption pairs (sentence‑level language supervision); (2) ResNet‑50, trained on human‑generated word labels (word‑level supervision); and (3) MoCo, a self‑supervised model trained only on images. For each model they extract features from the final pooling layer, construct object‑wise representational dissimilarity matrices (RDMs), and relate these to neural RDMs derived from four independent fMRI datasets that capture ventral occipitotemporal cortex (VOTC) responses to objects. The datasets differ in participant language experience (oral naming, sign‑language naming, colour‑knowledge judgments, and a large open‑world object set), providing a robust test of generalizability.

Using search‑light representational similarity analysis (RSA), the authors compute voxel‑wise Spearman correlations between model RDMs and neural RDMs. To isolate the unique contribution of language supervision, they calculate partial correlations: CLIP‑vision versus neural RDM while controlling for ResNet and MoCo, and ResNet versus neural RDM while controlling for MoCo. Across all four datasets, CLIP consistently yields higher partial correlations than the other two models, indicating that sentence‑level language information adds explanatory power beyond word‑level labels and pure visual features. The effect is most pronounced in left‑lateral occipital and fusiform regions (L‑LO, L‑FG, L‑ITG), aligning with the left‑dominant language network. No significant differences are found between hearing and deaf participants, suggesting that the modality of language (speech vs. sign) does not drive the effect.

To test causality, the study incorporates a lesion‑based approach. Thirty‑three stroke patients underwent diffusion tensor imaging (DTI) to assess white‑matter integrity (fractional anisotropy) of tracts linking VOTC to the left angular gyrus, a key language hub. The authors find that reduced integrity of this VOTC‑language pathway predicts a drop in CLIP‑brain correspondence (r ≈ ‑0.42, p < 0.01) and a concurrent rise in MoCo‑brain correspondence (r ≈ 0.38, p < 0.01). Auditory experience (hearing vs. deaf) does not modulate these relationships. This pattern implies that when the structural conduit for language‑visual interaction is compromised, the advantage of language‑supervised models disappears, and a purely visual model becomes relatively better at explaining VOTC activity.

Statistical thresholds are stringent (voxel‑wise p < 0.001, cluster‑wise FWE‑corrected p < 0.05). The authors also perform behavioural validation by correlating model RDMs with human similarity judgments (Likert ratings for OPN95/SPN95/FV14 and odd‑one‑out for THINGS). As language supervision increases (MoCo → ResNet → CLIP), behavioural alignment improves, reinforcing the claim that language information enriches semantic representations.

Overall, the paper makes three key contributions: (1) Demonstrates that language supervision in vision models provides a unique, left‑lateralized explanatory benefit for VOTC activity, beyond data‑size or generic visual hierarchy effects. (2) Shows that the structural integrity of white‑matter pathways between visual and language cortices causally mediates this benefit, offering a neuroanatomical grounding for the computational findings. (3) Introduces a novel framework that combines model‑brain RSA with human lesion data to move beyond correlational “black‑box” comparisons, enabling causal inference about the relevance of specific cognitive modules for brain‑like computation.

Limitations include the reliance on CLIP’s massive pre‑training dataset (which may confound language effects with sheer data volume), a modest patient sample size, and the focus on a single visual region (VOTC) without exploring downstream or dorsal visual streams. Future work could expand lesion cohorts, manipulate the amount of language supervision systematically, and test additional brain areas to further delineate the boundaries of language‑modulated vision.

Language modulates vision: Evidence from neural networks and human brain-lesion models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment