ADAPT: Hybrid Prompt Optimization for LLM Feature Visualization

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Understanding what features are encoded by learned directions in LLM activation space requires identifying inputs that strongly activate them. Feature visualization, which optimizes inputs to maximally activate a target direction, offers an alternative to costly dataset search approaches, but remains underexplored for LLMs due to the discrete nature of text. Furthermore, existing prompt optimization techniques are poorly suited to this domain, which is highly prone to local minima. To overcome these limitations, we introduce ADAPT, a hybrid method combining beam search initialization with adaptive gradient-guided mutation, designed around these failure modes. We evaluate on Sparse Autoencoder latents from Gemma 2 2B, proposing metrics grounded in dataset activation statistics to enable rigorous comparison, and show that ADAPT consistently outperforms prior methods across layers and latent types. Our results establish that feature visualization for LLMs is tractable, but requires design assumptions tailored to the domain.

💡 Research Summary

The paper introduces ADAPT, a hybrid prompt‑optimization algorithm designed to visualize features encoded in the activation space of large language models (LLMs). The authors focus on sparse autoencoder (SAE) latents extracted from the Gemma 2 2B model, a setting that has become a standard testbed for interpretability because SAE directions tend to be monosemantic and are publicly available via the Neuronpedia API. Traditional approaches to interpreting such directions rely on dataset search: running the model over a large corpus and selecting the examples that most strongly activate a given latent. While effective at scale, dataset search is costly, limited by the coverage of the corpus, and can conflate correlated concepts. Feature visualization—optimizing an input to maximally activate a target direction—offers a more direct, causal probe, but the discrete nature of text makes gradient‑based optimization difficult.

Existing prompt‑optimization methods fall short for this task. Greedy Coordinate Gradient (GCG) uses a gradient estimate on one‑hot token vectors to guide token swaps, but it is highly sensitive to the initial prompt and often gets trapped in local minima. Beam‑search‑based attacks (BEAST) avoid gradients by expanding prompts with top‑k sampling, yet they only explore right‑append operations and lack mechanisms to refine left‑hand context. Evolutionary Prompt Optimization (EPO) introduces a Pareto front balancing activation strength against fluency, but its evolutionary operators are computationally heavy and still depend on a good seed prompt.

ADAPT addresses these shortcomings through three tightly coupled components:

Beam‑Search Initialization – Multiple independent beams are launched from single‑token seeds. At each step the beams expand not only by right‑appending tokens sampled from the model’s own probability distribution but also by inserting tokens at random interior positions (middle‑inserts). This dual expansion mitigates the bias toward repetitive right‑most tokens and yields a diverse set of high‑quality seeds without requiring any hand‑crafted prompts.
Hybrid Mutation – During the main optimization loop each prompt generates a fixed number of candidates via either (a) GCG‑style gradient‑guided token swaps, or (b) logit‑swap operations borrowed from EPO, which simply resample a token from the model’s logits at a chosen position. A user‑specified probability determines which mutation is applied, allowing the algorithm to balance the higher computational cost of gradient estimates against the cheap, high‑throughput logit‑swap. Positions for mutation are sampled either uniformly or with a bias toward later tokens; for under‑length prompts, middle‑inserts are used instead of deletions.
Diversity‑Preserving Evaluation and Culling – Each prompt carries a group identifier inherited from its originating beam. The selection mechanism maintains “guaranteed slots” that always keep the best prompt of each group, preventing premature convergence to minor variations of a single solution. After a configurable merge‑point, the algorithm switches to a global greedy selection that ignores group provenance, accelerating convergence once sufficient diversity has been explored. The loss function combines the target SAE activation magnitude with a fluency penalty (average self‑cross‑entropy of the prompt under the LLM), ensuring that the resulting prompts are both effective and readable.

The authors evaluate ADAPT against GCG, BEAST, EPO, and several EPO extensions on a suite of 102 SAE latents spanning three taxonomic axes: activation density, vocabulary diversity, and locality. Three metrics are introduced: (i) Relative Activation – the ratio of a method’s maximal activation to the best activation found by exhaustive dataset search; (ii) Fluency Score – average cross‑entropy under the LLM; (iii) Token Diversity – measured by lexical variety across generated prompts. Across all layers and latent types, ADAPT consistently outperforms baselines, achieving 12–18 % higher relative activation while maintaining comparable fluency. Notably, for highly localized latents that fire on a single token, ADAPT’s middle‑insert capability discovers prompts that place the target token in a semantically appropriate context rather than simply repeating it, a failure mode observed in pure right‑append beam search.

The paper also discusses limitations. The beam‑search initialization can generate a large number of candidates, leading to increased memory consumption, especially for deeper layers where longer prompts are beneficial. Managing variable‑length prompts requires careful masking and padding, adding implementation complexity. Future work is suggested in three directions: (1) meta‑learning strategies to automatically tune beam size and mutation probabilities; (2) multi‑latent joint optimization where a single prompt is encouraged to activate several related directions simultaneously; (3) dimensionality reduction of the token space (e.g., using learned token embeddings) to shrink the search space without sacrificing expressivity.

In conclusion, ADAPT demonstrates that feature visualization for LLMs is not only feasible but can be made robust and efficient by tailoring the optimization pipeline to the discrete, high‑dimensional nature of text. By combining gradient‑guided swaps, cheap logit‑based mutations, and a diversity‑preserving selection scheme, ADAPT provides a practical tool for probing the semantics of learned latent directions, opening new avenues for interpretability research, model debugging, and controlled generation.

ADAPT: Hybrid Prompt Optimization for LLM Feature Visualization

💡 Research Summary

Comments & Academic Discussion

Leave a Comment