Tracing Multilingual Representations in LLMs with Cross-Layer Transcoders
Multilingual Large Language Models (LLMs) can process many languages, yet how they internally represent this diversity remains unclear. Do they form shared multilingual representations with language-specific decoding, and if so, why does performance favor the dominant training language? To address this, we train models on different multilingual mixtures and analyze their internal mechanisms using Cross-Layer Transcoders (CLTs) and Attribution Graphs. Our results reveal multilingual shared representations: the model employs highly similar features across languages, while language-specific decoding emerges in later layers. Training models without English shows identical multilingual shared space structures. Decoding relies partly on a small set of high-frequency features in the final layers, which linearly encode language identity from early layers. Intervening on these features allows one language to be suppressed and another substituted. Finally, to explain non-English failures, we perform a Model-Diffing experiment: underperformance arises from dim late-layer features, weak middle-layer clusters, and tokenizer bias toward English that forces early layers to specialize in word reassembly. Finetuning strengthens these features and their links, improving token assembly and language-specific decoding, providing a mechanistic explanation for multilingual gaps. Our models and CLTs are available at https://huggingface.co/collections/CausalNLP/multilingual-clts and https://huggingface.co/collections/CausalNLP/multilingual-gpt2-models. Our code is available at: https://github.com/abirharrasse/MultilingualCLTs
💡 Research Summary
The paper investigates how multilingual large language models (LLMs) internally represent and process multiple languages, and why performance often favors the dominant training language such as English. To answer these questions, the authors train several GPT‑2‑style and TinyStories‑style models on a variety of language mixtures, ranging from heavily English‑biased (90 % English) to perfectly balanced non‑English data. They then apply a novel mechanistic interpretability tool—Cross‑Layer Transcoders (CLTs)—together with attribution graphs to trace feature interactions across transformer layers.
A CLT learns an encoder that projects a layer’s MLP input into a low‑dimensional feature space and a decoder that reconstructs the MLP output of a downstream layer. By training CLTs on activations sampled uniformly from all languages, the authors obtain a mapping that is language‑agnostic and can be used to compute attribution scores between any pair of features across layers. They also define a multilingual entropy score for each feature, measuring how evenly the feature is activated across the five languages.
Three core research questions are addressed:
-
Do models form a shared multilingual representation (a “pivot”) regardless of data composition?
Entropy analysis shows a characteristic U‑shaped curve across layers: early layers have low entropy (language‑specific), middle layers peak in entropy (highly multilingual), and late layers drop again (language‑specific decoding). This pattern holds across model sizes (177 M vs 68 M parameters), across data mixtures (from 90 % English to balanced), and even in a larger LLaMA‑3.2‑1B model. Crucially, a model trained on balanced non‑English data exhibits the same middle‑layer multilingual peak, demonstrating that the shared space is an architectural property rather than a by‑product of English dominance. -
How is language identity decoded and routed to the correct output language?
Attribution graphs reveal that a tiny subset of high‑frequency features in the final MLP layers accounts for >80 % of the logit effect. Linear probing shows these features linearly read out language identity from early‑layer token embeddings. Intervening on these features (e.g., zero‑ing or swapping their activations) can suppress the original language or substitute another language, effectively performing a causal language‑swap. Thus, language‑specific decoding relies on a small, highly influential set of neurons that act as a bottleneck between the shared multilingual space and the language‑specific output head. -
Why do non‑English languages still underperform despite shared representations?
A “Model‑Diffing” experiment compares English‑dominant and non‑English‑dominant models. The authors find three failure contributors: (a) dimmer late‑layer features for non‑English, (b) weaker clustering of multilingual features in middle layers, and (c) a tokenizer bias toward English that forces early layers to focus on word reassembly rather than semantic abstraction. Fine‑tuning on multilingual data strengthens the late‑layer features, densifies middle‑layer clusters, and improves the alignment between token embeddings and the high‑frequency decoding features, thereby narrowing the performance gap.
The paper also releases nine multilingual models (two sizes × four data mixtures + a non‑English‑only model), three fine‑tuned variants, and the corresponding CLTs, all accompanied by code and pretrained weights. This open‑source contribution enables further mechanistic studies and practical improvements in multilingual LLMs.
In summary, the study provides the first comprehensive CLT‑based mechanistic account of multilingual processing: early layers encode language‑specific surface forms, middle layers converge to a language‑agnostic latent space, and a handful of high‑frequency neurons in the final layers decode language identity. The findings explain why English dominance is not required for shared representations, why language‑specific decoding is a bottleneck, and how tokenizer design and feature strengthening can mitigate non‑English performance gaps.
Comments & Academic Discussion
Loading comments...
Leave a Comment