Indic-TunedLens: Interpreting Multilingual Models in Indian Languages

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multilingual large language models (LLMs) are increasingly deployed in linguistically diverse regions like India, yet most interpretability tools remain tailored to English. Prior work reveals that LLMs often operate in English centric representation spaces, making cross lingual interpretability a pressing concern. We introduce Indic-TunedLens, a novel interpretability framework specifically for Indian languages that learns shared affine transformations. Unlike the standard Logit Lens, which directly decodes intermediate activations, Indic-TunedLens adjusts hidden states for each target language, aligning them with the target output distributions to enable more faithful decoding of model representations. We evaluate our framework on 10 Indian languages using the MMLU benchmark and find that it significantly improves over SOTA interpretability methods, especially for morphologically rich, low resource languages. Our results provide crucial insights into the layer-wise semantic encoding of multilingual transformers. Our model is available at https://huggingface.co/spaces/MihirRajeshPanchal/IndicTunedLens. Our code is available at https://github.com/MihirRajeshPanchal/IndicTunedLens.

💡 Research Summary

The paper addresses a critical gap in the interpretability of multilingual large language models (LLMs) deployed in linguistically diverse regions such as India. While tools like the Logit Lens and its refined version, the Tuned Lens, have become standard for probing intermediate representations, they are fundamentally English‑centric: they directly project hidden states onto the model’s unembedding matrix, which is biased toward English vocabularies and tokenization schemes. This bias hampers faithful analysis of Indian languages that feature rich morphology, flexible word order, and a variety of scripts (Devanagari, Gurmukhi, Tamil, etc.).

To overcome this limitation, the authors introduce Indic‑TunedLens, a novel interpretability framework specifically designed for Indian languages. The core idea is to learn a shared affine transformation for each transformer layer that aligns the hidden state with the final output distribution of the target language. Formally, for layer n the hidden vector hₙ is transformed as ˜hₙ = Mₙ hₙ + bₙ, where Mₙ ∈ ℝ^{d×d} and bₙ ∈ ℝ^{d} are learned parameters. The transformed vector is then fed through the model’s existing logit head, producing a probability distribution over the language‑specific vocabulary. Training minimizes the Kullback‑Leibler divergence between this distribution and the model’s own final‑layer distribution p_final(x), effectively using the model’s own predictions as soft labels. Crucially, the matrices Mₙ and biases bₙ are shared across all languages, enabling a single set of transformations to capture common multilingual structure while still adapting to language‑specific nuances.

The experimental setup uses the Sarvam‑1 multilingual transformer as the base model. Training data come from the Sangraha multilingual corpus, covering eleven languages (English plus ten Indian languages: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Punjabi, Nepali, Tamil, Telugu). Evaluation employs a curated multilingual MMLU benchmark adapted for the same ten Indian languages. Two primary metrics are reported: (1) entropy heatmaps that measure uncertainty of the token distribution at each layer, and (2) layer‑wise agreement, i.e., the top‑1 token match rate between a given layer’s prediction and the final‑layer prediction.

Results show that Indic‑TunedLens dramatically improves both metrics over the standard Logit Lens. Entropy heatmaps reveal a smooth, monotonic decrease from early to later layers, indicating that the transformed representations progressively consolidate semantic information in a language‑aware manner. In contrast, the Logit Lens exhibits high, erratic entropy in early layers, reflecting a mismatch between English‑centric projections and Indian language representations. Layer‑wise agreement scores also improve markedly: early layers (1‑8) gain 0.04‑0.06 absolute accuracy, with the largest gains for morphologically rich languages such as Hindi, Bengali, and Nepali. English shows a modest 0.06 gain in the first layer, while languages with agglutinative characteristics (Telugu, Tamil) display sustained improvements through middle layers (5‑15). The authors further analyze language‑specific patterns, noting that Telugu and Tamil benefit from later‑layer refinements due to their complex word‑formation processes, whereas Gujarati and Kannada exhibit more uniform gains across layers.

The paper contributes three main points: (1) it identifies interpretability failure in Indian languages as a projection problem and proposes a concrete solution; (2) it provides a comprehensive evaluation on Sarvam‑1 across ten Indian languages, demonstrating consistent gains; (3) it offers linguistic insights, showing that morphological analysis occurs early (layers 1‑4) and deeper semantic composition later, with variations across language families.

Limitations are acknowledged. Sharing a single affine transformation across all languages may be insufficient for highly divergent scripts or low‑resource languages not seen during training (e.g., Urdu, Sindhi). Moreover, the KL‑based objective relies entirely on the model’s final‑layer distribution, which could propagate any existing English‑centric biases. Future work could explore language‑specific adapters, multi‑objective training (e.g., probing linguistic features), and application to other multilingual LLMs such as mT5 or XLM‑R.

In summary, Indic‑TunedLens is the first layer‑wise interpretability tool tailored to Indian languages, delivering clearer, more faithful visualizations of how multilingual LLMs process morphologically rich, script‑diverse inputs. By aligning intermediate hidden states with language‑specific output spaces, it opens the door to better debugging, bias detection, and model improvement for NLP applications across the Indian subcontinent.

Indic-TunedLens: Interpreting Multilingual Models in Indian Languages

💡 Research Summary

Comments & Academic Discussion

Leave a Comment