Visualizing token importance for black-box language models

Visualizing token importance for black-box language models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We consider the problem of auditing black-box large language models (LLMs) to ensure they behave reliably when deployed in production settings, particularly in high-stakes domains such as legal, medical, and regulatory compliance. Existing approaches for LLM auditing often focus on isolated aspects of model behavior, such as detecting specific biases or evaluating fairness. We are interested in a more general question – can we understand how the outputs of black-box LLMs depend on each input token? There is a critical need to have such tools in real-world applications that rely on inaccessible API endpoints to language models. However, this is a highly non-trivial problem, as LLMs are stochastic functions (i.e. two outputs will be different by chance), while computing prompt-level gradients to approximate input sensitivity is infeasible. To address this, we propose Distribution-Based Sensitivity Analysis (DBSA), a lightweight model-agnostic procedure to evaluate the sensitivity of the output of a language model for each input token, without making any distributional assumptions about the LLM. DBSA is developed as a practical tool for practitioners, enabling quick, plug-and-play visual exploration of LLMs reliance on specific input tokens. Through illustrative examples, we demonstrate how DBSA can enable users to inspect LLM inputs and find sensitivities that may be overlooked by existing LLM interpretability methods.


💡 Research Summary

This paper addresses the critical challenge of auditing black-box large language models (LLMs) in high-stakes applications like law and medicine, where understanding model behavior is essential but internal access is restricted. The authors identify a gap in existing auditing methods, which often focus on narrow aspects like bias detection, and instead propose a more general framework for analyzing how an LLM’s outputs depend on each individual input token.

The core contribution is the introduction of Distribution-Based Sensitivity Analysis (DBSA), a model-agnostic, lightweight procedure designed for practitioners. DBSA operates under the key realization that LLMs are stochastic generators; comparing single outputs is insufficient due to inherent randomness. Therefore, it measures token importance by comparing the distributions of outputs rather than individual instances.

The DBSA procedure works as follows: For a given input prompt and a target token, the method substitutes that token with one of its nearest neighbors in embedding space, representing the smallest feasible perturbation in the discrete token space. For both the original and the perturbed input, it then queries the black-box LLM multiple times via Monte Carlo sampling to collect sets of output responses. These outputs are transformed into a semantic similarity space using a predefined similarity function (e.g., based on sentence embeddings). Next, two distributions are constructed: P0, the distribution of similarities among outputs from the original input (capturing intrinsic variability), and P1, the distribution of similarities between outputs from the original and the perturbed inputs. The token’s importance or sensitivity score is quantified as a statistical distance (e.g., effect size) between these two distributions. A larger score indicates that a minor change to that token leads to a significant shift in the meaning of the LLM’s output distribution.

The paper elaborates on the unique challenges this approach overcomes: computational intractability of full distribution comparison (solved by finite-sample approximation), ensuring semantic interpretability of changes (solved by using a semantic similarity metric), and handling the constraints of a discrete token space (solved by using nearest-neighbor swaps). Through illustrative examples, such as analyzing legal advice prompts, the authors demonstrate DBSA’s ability to visually highlight tokens that disproportionately influence the output, potentially revealing undesired model sensitivities (e.g., over-reliance on a person’s name). They position DBSA not as a benchmark-oriented tool but as a practical, plug-and-play solution for exploratory analysis, enabling users to gain actionable insights into black-box LLM behavior without requiring specialized knowledge or access to model internals.


Comments & Academic Discussion

Loading comments...

Leave a Comment