Stroke Lesions as a Rosetta Stone for Language Model Interpretability

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) have achieved remarkable capabilities, yet methods to verify which model components are truly necessary for language function remain limited. Current interpretability approaches rely on internal metrics and lack external validation. Here we present the Brain-LLM Unified Model (BLUM), a framework that leverages lesion-symptom mapping, the gold standard for establishing causal brain-behavior relationships for over a century, as an external reference structure for evaluating LLM perturbation effects. Using data from individuals with chronic post-stroke aphasia (N = 410), we trained symptom-to-lesion models that predict brain damage location from behavioral error profiles, applied systematic perturbations to transformer layers, administered identical clinical assessments to perturbed LLMs and human patients, and projected LLM error profiles into human lesion space. LLM error profiles were sufficiently similar to human error profiles that predicted lesions corresponded to actual lesions in error-matched humans above chance in 67% of picture naming conditions (p < 10^{-23}) and 68.3% of sentence completion conditions (p < 10^{-61}), with semantic-dominant errors mapping onto ventral-stream lesion patterns and phonemic-dominant errors onto dorsal-stream patterns. These findings open a new methodological avenue for LLM interpretability in which clinical neuroscience provides external validation, establishing human lesion-symptom mapping as a reference framework for evaluating artificial language systems and motivating direct investigation of whether behavioral alignment reflects shared computational principles.

💡 Research Summary

The paper introduces the Brain‑LLM Unified Model (BLUM), a novel framework that uses human lesion‑symptom mapping as an external ground truth for interpreting large language models (LLMs). The authors argue that current interpretability methods—pruning, layer ablation, causal tracing, and internal probing—rely on internal metrics and benchmark performance, which do not guarantee that identified components are truly necessary for language processing as it occurs in the human brain. To address this gap, BLUM leverages over a century of causal evidence from post‑stroke aphasia research, where focal brain lesions have been linked to specific language deficits (semantic, phonological, syntactic, fluency).

The study proceeds in three stages. First, a symptom‑to‑lesion model is trained on data from 410 chronic aphasia patients who completed two clinically validated tasks: the Philadelphia Naming Test (picture naming) and the Western Aphasia Battery‑Revised (sentence completion). Errors are classified into semantic, phonological, mixed, and other categories. Using multivariate regression and machine‑learning techniques, the model learns to predict the spatial distribution of lesions from these error profiles, confirming that behavioral patterns contain sufficient information to reconstruct neuroanatomical damage.

Second, a 13‑billion‑parameter transformer LLM is systematically perturbed. Perturbations vary by target layer, percentage of disrupted weights, and injected noise level. The same picture‑naming and sentence‑completion tasks are administered to each perturbed model, and the outputs are annotated with the identical error taxonomy used for human patients. This creates a set of LLM error profiles that can be directly compared to human error patterns.

Third, each LLM error profile is fed into the human‑trained symptom‑to‑lesion model, yielding a “virtual lesion” map for the artificial system. To evaluate alignment, the authors identify the five human patients whose error profiles most closely match each LLM profile, average their actual lesion maps, and compute Pearson correlations between this average and the virtual lesion. Permutation testing with 2,000 random patient sets establishes a chance baseline. The results show that in 67 % of picture‑naming conditions and 68.3 % of sentence‑completion conditions, the virtual lesions correlate significantly higher than chance (p < 10⁻²³ and p < 10⁻⁶¹, respectively). Moreover, semantic‑dominant errors map onto ventral‑stream regions (temporal‑frontal pathways), while phonological‑dominant errors map onto dorsal‑stream regions (parietal‑frontal pathways), mirroring classic aphasia findings.

These findings provide the first empirical evidence that LLM perturbation effects can be meaningfully situated within a neurobiologically grounded space defined by human lesion‑symptom relationships. The alignment does not imply mechanistic equivalence or that LLMs have rediscovered brain anatomy; rather, it suggests that the dimensions governing breakdown in LLM performance under perturbation overlap with dimensions of breakdown that are causally critical in the human brain. Consequently, BLUM offers an external validation tool: components whose disruption yields error patterns that correspond to known lesion‑deficit maps are likely to be essential for language‑like processing, whereas components that produce non‑human‑like errors may be serving functions unrelated to core linguistic computation.

The authors acknowledge limitations: the error taxonomy is human‑centric and may miss novel failure modes of LLMs; the study focuses on a single 13 B transformer and may not generalize to larger models or different architectures; and lesion‑symptom mapping, while causal, is still a statistical association that cannot capture all nuances of neural computation. Nonetheless, BLUM opens a methodological avenue for grounding AI interpretability in established clinical neuroscience, paving the way for richer cross‑disciplinary investigations into whether behavioral alignment reflects shared computational principles between brains and machines.

Stroke Lesions as a Rosetta Stone for Language Model Interpretability

💡 Research Summary

Comments & Academic Discussion

Leave a Comment