RoPE-LIME: RoPE-Space Locality + Sparse-K Sampling for Efficient LLM Attribution

RoPE-LIME: RoPE-Space Locality + Sparse-K Sampling for Efficient LLM Attribution
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Explaining closed-source Large Language Model (LLM) outputs is challenging because API access prevents gradient-based attribution, while perturbation methods are costly and noisy when they depend on regenerated text. We introduce \textbf{Rotary Positional Embedding Linear Local Interpretable Model-agnostic Explanations (RoPE-LIME)}, an open-source extension of gSMILE that decouples reasoning from explanation: given a fixed output from a closed model, a smaller open-source surrogate computes token-level attributions from probability-based objectives (negative log-likelihood and divergence targets) under input perturbations. RoPE-LIME incorporates (i) a locality kernel based on Relaxed Word Mover’s Distance computed in \textbf{RoPE embedding space} for stable similarity under masking, and (ii) \textbf{Sparse-$K$} sampling, an efficient perturbation strategy that improves interaction coverage under limited budgets. Experiments on HotpotQA (sentence features) and a hand-labeled MMLU subset (word features) show that RoPE-LIME produces more informative attributions than leave-one-out sampling and improves over gSMILE while substantially reducing closed-model API calls.


💡 Research Summary

The paper introduces RoPE‑LIME, a novel framework for attributing the outputs of closed‑source large language models (LLMs) without incurring the high cost of repeated API calls. Traditional perturbation‑based explainers such as LIME or its generative extension gSMILE require many queries to the target model, and they suffer from instability because regenerated text can vary dramatically with small input changes. RoPE‑LIME solves these problems by (1) decoupling reasoning from explanation—querying the closed model only once to obtain a fixed output, then using a smaller open‑source surrogate model to evaluate perturbed inputs; and (2) employing two technical innovations: a locality kernel based on Relaxed Word Mover’s Distance (RWMD) computed in Rotary Positional Embedding (RoPE) space, and a Sparse‑K sampling strategy that explores the perturbation space with logarithmic complexity.

The RoPE‑based locality kernel addresses a key weakness of prior distance metrics: sensitivity to positional shifts caused by masking. RoPE encodes relative positions as rotations in a complex‑valued space, so when tokens are masked the underlying geometry remains stable. The authors compute RWMD over contextualized RoPE embeddings, aggregating token spans in polar coordinates (averaging magnitudes and phases) and defining a “polar L2” distance that combines magnitude and phase differences. This distance is fed into a Gaussian kernel (exp(−d²/σ²)) to weight each perturbed sample, ensuring that samples close to the original input in the RoPE‑RWMD sense have higher influence in the subsequent regression.

Sparse‑K sampling replaces naïve Leave‑One‑Out (LOO) or exhaustive powerset sampling. For an input with M features, Sparse‑K selects K active features per mask (K = √M, 2√M, or 4√M) and draws a total of N = c·log K perturbations, where c is a budget factor (e.g., 0.5 M, M, min(M, 8K), etc.). This yields O(log M) sample complexity while still covering a diverse set of feature interactions. The authors empirically sweep (k, c) values across different feature‑count buckets and demonstrate that performance is robust to these hyper‑parameters.

The pipeline proceeds as follows: (1) The closed‑source model f_L receives prompt x and generates output y. (2) The surrogate model f_S, conditioned on y, computes token‑level logits for each perturbed input x⊙z_j, producing a negative log‑likelihood L_j and a KL‑divergence target KL(L₀‖L_j) relative to the baseline L₀. (3) RWMD distances d_j are computed between the original and perturbed inputs, transformed into weights w_j = exp(−d_j²/σ²) where σ is the median of the distance set. (4) A weighted least‑squares regression solves for coefficients β, and absolute values |β_i| are normalized to yield attribution scores a_i for each feature F_i.

Experiments are conducted on two QA benchmarks. For a hand‑labeled subset of MMLU (50 queries, top‑5 influential words per query), RoPE‑LIME (using Qwen‑8B as surrogate) outperforms gSMILE (gpt‑4o‑mini) across IoU (0.364 vs 0.248), F1 (0.508 vs 0.368), and AU‑ROC (0.563 vs 0.431) despite using only the open‑source model for attribution. On HotpotQA, the authors select 989 instances with exactly ten retrieved documents, treat each sentence as a feature, and evaluate Sparse‑K against LOO. The best Sparse‑K configuration (k = 4√M, c = 0.5 M) achieves IoU 0.903, F1 0.927, and AU‑ROC 0.891, substantially surpassing LOO’s IoU 0.797, F1 0.848, and AU‑ROC 0.772. Importantly, RoPE‑LIME requires only a single API call to the closed model, whereas gSMILE needs dozens, yielding dramatic cost savings.

The authors conclude that (1) RoPE‑based RWMD provides a robust notion of locality for masked perturbations; (2) Sparse‑K offers an efficient, scalable sampling scheme that retains interaction coverage; and (3) decoupling reasoning from explanation enables probability‑based, low‑noise attributions for closed‑source LLMs. They suggest future work on larger surrogate models, alternative tokenization schemes, and multimodal extensions to broaden RoPE‑LIME’s applicability.


Comments & Academic Discussion

Loading comments...

Leave a Comment