Hide and Seek in Embedding Space: Geometry-based Steganography and Detection in Large Language Models

Hide and Seek in Embedding Space: Geometry-based Steganography and Detection in Large Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Fine-tuned LLMs can covertly encode prompt secrets into outputs via steganographic channels. Prior work demonstrated this threat but relied on trivially recoverable encodings. We formalize payload recoverability via classifier accuracy and show previous schemes achieve 100% recoverability. In response, we introduce low-recoverability steganography, replacing arbitrary mappings with embedding-space-derived ones. For Llama-8B (LoRA) and Ministral-8B (LoRA) trained on TrojanStego prompts, exact secret recovery rises from 17$\rightarrow$30% (+78%) and 24$\rightarrow$43% (+80%) respectively, while on Llama-70B (LoRA) trained on Wiki prompts, it climbs from 9$\rightarrow$19% (+123%), all while reducing payload recoverability. We then discuss detection. We argue that detecting fine-tuning-based steganographic attacks requires approaches beyond traditional steganalysis. Standard approaches measure distributional shift, which is an expected side-effect of fine-tuning. Instead, we propose a mechanistic interpretability approach: linear probes trained on later-layer activations detect the secret with up to 33% higher accuracy in fine-tuned models compared to base models, even for low-recoverability schemes. This suggests that malicious fine-tuning leaves actionable internal signatures amenable to interpretability-based defenses.


💡 Research Summary

The paper investigates a novel class of covert channels in large language models (LLMs) that arise from malicious fine‑tuning. An attacker can embed a secret in a user‑supplied prompt, and the fine‑tuned model will encode that secret into the generated text while preserving fluency. The authors first formalize “payload recoverability” as a two‑dimensional metric: (i) R_bucket, the maximum classification accuracy for inferring the hidden bit sequence from output tokens, and (ii) R_secret, the maximum accuracy for reconstructing the original secret from the inferred bits. Using XGBoost for token‑to‑bucket inference and an MLP for bucket‑to‑secret decoding, they show that the previously proposed TrojanStego scheme achieves near‑perfect recoverability (R_bucket≈1, R_secret≈1) because it relies on deterministic ASCII encoding and a trivial token‑ID parity bucket.

To reduce recoverability, the authors replace both stages with geometry‑based mechanisms that exploit the model’s learned embedding space. For secret encoding, each alphabet character’s embedding is projected onto a set of random hyperplanes (derived from secret seeds σ₁…σ₈). By thresholding each projection at the median across all letters, an 8‑bit code is assigned to each character. Seeds are searched until all 26 letters map to unique codes, guaranteeing collision‑free encoding that is still deterministic but tied to the model’s semantic geometry.

For bucket assignment, a single random hyperplane (seed σ) partitions the entire token vocabulary into two buckets V₀ and V₁ based on whether a token’s projection score eᵀv_σ lies below or above the median of all scores. During generation, the model is constrained to select tokens from the bucket that matches the target bit at each payload position. Because the hyperplane orientation is secret, an analyst without the seed cannot reliably infer the bucket rule, driving R_bucket toward chance (≈0.5). The authors evaluate this “low‑recoverability” scheme on three models—Llama‑8B, Ministral‑8B, and Llama‑70B—each fine‑tuned with LoRA on either TrojanStego prompts or Wiki prompts. Compared to the baseline, exact secret recovery improves modestly (e.g., Llama‑8B: 17 %→30 %, Ministral‑8B: 24 %→43 %, Llama‑70B: 9 %→19 %) while overall recoverability drops, confirming the intended trade‑off.

The paper then turns to detection. Traditional steganalysis relies on distributional metrics such as KL‑divergence or perplexity, but fine‑tuning naturally induces such shifts, making these signals unreliable. The authors prove that a large amount of hidden data can be inserted while keeping KL and perplexity within normal fine‑tuning ranges. Instead, they propose a mechanistic interpretability approach: linear probes trained on activations from later Transformer layers. These probes learn to predict the secret directly from internal representations. Empirically, probes achieve up to a 33 % boost in detection accuracy on fine‑tuned models versus the base model, even when the secret is encoded with the low‑recoverability scheme (i.e., when R_bucket≈0.5). This demonstrates that malicious fine‑tuning leaves detectable internal “signatures” independent of output‑level statistics.

In summary, the work makes three key contributions: (1) a formal definition of payload recoverability for LLM steganography; (2) a geometry‑based, low‑recoverability encoding and bucketting method that leverages random hyperplane projections in embedding space; and (3) a demonstration that internal‑representation‑based linear probes provide a robust detection signal where traditional statistical steganalysis fails. The findings highlight that covert channels in LLMs are not only a matter of output manipulation but also of persistent internal modifications, suggesting future defenses should focus on mechanistic interpretability and representation‑level monitoring across diverse model architectures and fine‑tuning regimes.


Comments & Academic Discussion

Loading comments...

Leave a Comment