Adversarial Lens: Exploiting Attention Layers to Generate Adversarial Examples for Evaluation

Reading time: 6 minute
...

📝 Original Info

  • Title: Adversarial Lens: Exploiting Attention Layers to Generate Adversarial Examples for Evaluation
  • ArXiv ID: 2512.23837
  • Date: 2025-12-29
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Recent advances in mechanistic interpretability suggest that intermediate attention layers encode token-level hypotheses that are iteratively refined toward the final output. In this work, we exploit this property to generate adversarial examples directly from attention-layer token distributions. Unlike prompt-based or gradientbased attacks, our approach leverages modelinternal token predictions, producing perturbations that are both plausible and internally consistent with the model's own generation process. We evaluate whether tokens extracted from intermediate layers can serve as effective adversarial perturbations for downstream evaluation tasks. We conduct experiments on argument quality assessment using the ArgQuality dataset, with LLaMA-3.1-Instruct-8B serving as both the generator and evaluator. Our results show that attention-based adversarial examples lead to measurable drops in evaluation performance while remaining semantically similar to the original inputs. However, we also observe that substitutions drawn from certain layers and token positions can introduce grammatical degradation, limiting their practical effectiveness. Overall, our findings highlight both the promise and current limitations of using intermediate-layer representations as a principled source of adversarial examples for stress-testing LLM-based evaluation pipelines.

💡 Deep Analysis

Deep Dive into Adversarial Lens: Exploiting Attention Layers to Generate Adversarial Examples for Evaluation.

Recent advances in mechanistic interpretability suggest that intermediate attention layers encode token-level hypotheses that are iteratively refined toward the final output. In this work, we exploit this property to generate adversarial examples directly from attention-layer token distributions. Unlike prompt-based or gradientbased attacks, our approach leverages modelinternal token predictions, producing perturbations that are both plausible and internally consistent with the model’s own generation process. We evaluate whether tokens extracted from intermediate layers can serve as effective adversarial perturbations for downstream evaluation tasks. We conduct experiments on argument quality assessment using the ArgQuality dataset, with LLaMA-3.1-Instruct-8B serving as both the generator and evaluator. Our results show that attention-based adversarial examples lead to measurable drops in evaluation performance while remaining semantically similar to the original inputs. However, we al

📄 Full Content

Adversarial Lens: Exploiting Attention Layers to Generate Adversarial Examples for Evaluation Kaustubh Dhole Department of Computer Science Emory University kdhole@emory.edu Abstract Recent advances in mechanistic interpretability suggest that intermediate attention layers en- code token-level hypotheses that are iteratively refined toward the final output. In this work, we exploit this property to generate adversarial examples directly from attention-layer token distributions. Unlike prompt-based or gradient- based attacks, our approach leverages model- internal token predictions, producing pertur- bations that are both plausible and internally consistent with the model’s own generation pro- cess. We evaluate whether tokens extracted from intermediate layers can serve as effective adversarial perturbations for downstream eval- uation tasks. We conduct experiments on ar- gument quality assessment using the ArgQual- ity dataset, with LLaMA-3.1-Instruct-8B serv- ing as both the generator and evaluator. Our results show that attention-based adversarial examples lead to measurable drops in evalua- tion performance while remaining semantically similar to the original inputs. However, we also observe that substitutions drawn from cer- tain layers and token positions can introduce grammatical degradation, limiting their prac- tical effectiveness. Overall, our findings high- light both the promise and current limitations of using intermediate-layer representations as a principled source of adversarial examples for stress-testing LLM-based evaluation pipelines. 1 Introduction Recent efforts in mechanistic interpretability have highlighted the wealth of information encoded within the layers of large language models (LLMs) (Meng et al., 2022; Sharkey et al., 2025). These layers, which are often overlooked in favor of the final outputs, have been shown to act as iter- ative predictors of the eventual response (Jastrzeb- ski et al., 2018; nostalgebraist, 2020), providing insights into the model’s generation process. While most of these techniques have heavily focused on interpretability, we argue that they could potentially be adapted to generate paraphrastic and adversarial examples for evaluation tasks – which generally operate over model generated data. Probing the attention layers has multiple advan- tages – First, since the layers act as both iterative indicators of the final output (Belrose et al., 2023), and store related entities (Meng et al., 2022; Her- nandez et al., 2024), they can provide natural lan- guage variations by potentially treating LLMs as knowledge bases. Second, these generations, can be obtained early on without necessitating running over all the layers (Din et al., 2024; Pal et al., 2023). Third, these generations might provide cues for model hallucinations (Yuksekgonul et al., 2024) as gradual deviations from the original token are ob- tained from the model itself. From an adversarial point of view, tokens from intermediate layers are valuable as they can act as perturbations to the orig- inal input. Specifically, the outputs are iteratively refined, since as the activations move towards the last layer they tend to move towards the direction of the negative gradient (Jastrzebski et al., 2018) or each successive layer achieving lower perplex- ity (Belrose et al., 2023). This is also precisely how adversarial examples are constructed – perturbing towards the direction of the positive gradient of the loss (Goodfellow et al., 2015). Hence, in this study, we explore whether such fine-grained information extracted from the atten- tion layers of large language models (LLMs) can be leveraged to generate adversarial examples on downstream natural language tasks, particularly critical tasks such as evaluation. Specifically, in this work, we introduce two attention-based adversarial generation methods: attention-based token substitution and attention- based conditional generation, both of which lever- age intermediate-layer token predictions to con- struct plausible yet adversarial inputs. Our paper is organized as follows: §2 first dis- cusses related work in interpretability and genera- arXiv:2512.23837v1 [cs.CL] 29 Dec 2025 tion evaluation. §3 and §4 defines and implements the two approaches for generating examples. §5 finally discusses the results and analysis. 2 Related Work We now discuss some of the related work in mech- anistic interpretability and LLM based evaluation to place our work in context. Recent interpretability studies have explored the information encoded within the internal layers of large language models (LLMs) to better understand how models generate subsequent tokens. For in- stance, LogitLens and TunedLens (nostalgebraist, 2020; Belrose et al., 2023) demonstrate that inter- mediate layers can be made to predict tokens simi- lar to those generated in the final layer, by attaching a trained or untrained unembedding matrix to them. Methods such as ROME (Meng et al., 2022) high- li

…(Full text truncated)…

📸 Image Gallery

layers_figure.png layers_figure2.png

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut