LatentQA: Teaching LLMs to Decode Activations Into Natural Language
Top-down transparency typically analyzes language model activations using probes with scalar or single-token outputs, limiting the range of behaviors that can be captured. To alleviate this issue, we develop a more expressive probe that can directly output natural language, performing LatentQA: the task of answering open-ended questions about activations. A key difficulty in developing such a probe is collecting a dataset mapping activations to natural-language descriptions. In response, we propose an approach for generating a dataset of activations and associated question-answer pairs and develop a fine-tuning method for training a decoder LLM on this dataset. We then validate our decoder’s fidelity by assessing its ability to read and control model activations. First, we evaluate the decoder on a number of supervised reading tasks with a known answer, such as uncovering hidden system prompts and relational knowledge extraction, and observe that it outperforms competitive probing baselines. Second, we demonstrate that the decoder is precise enough to steer the target model to exhibit behaviors unseen during training. Finally, we show that LatentQA scales well with increasing dataset and model size.
💡 Research Summary
The paper introduces LatentQA, a novel framework that enables large language models (LLMs) to answer open‑ended natural‑language questions about their own hidden activations, and proposes Latent Interpretation Tuning (LIT), a method for training a decoder LLM to perform this task. Traditional probing techniques are limited to scalar scores or single‑token outputs, which cannot capture the rich, multi‑faceted behaviors encoded in intermediate representations. To overcome this, the authors construct a large paired dataset of (activation, question, answer) triples. They first generate diverse “control” prompts—personas, goals, extractive QA—prepend them to a stimulus prompt, and feed the combined prompt to a target LLM. The model’s response, shaped by the control, exhibits a wide range of stylistic or factual traits. A powerful LLM (e.g., GPT‑4) then creates natural‑language questions and answers that describe these traits (e.g., “Q: How will the assistant speak? A: Like a pirate”). This pipeline yields 16,732 labeled examples covering 4,670 goals, 3,359 personas, and 8,703 extractive QA instances.
Three design decisions are crucial for generalization: (1) activation masking – the control token embeddings are hidden from the decoder to prevent a trivial cheat, while the stimulus activations still carry indirect information via attention; (2) data augmentation – training on three data types (control only, stimulus only, stimulus + completion) teaches the decoder to answer questions about both explicit prompts and latent properties; (3) improving completion faithfulness – either by emphasizing the control in the prompt or by using a stronger LLM to generate the (prompt, completion) pairs, ensuring the generated answer truly reflects the intended behavior.
LIT fine‑tunes a decoder LLM by patching activations from a middle layer (k = 15) of the target model into the decoder’s first layer (ℓ = 0). The decoder is trained to maximize the log‑probability of the answer given the concatenated string “
Comments & Academic Discussion
Loading comments...
Leave a Comment