The Geometry of Harmful Intent: Training-Free Anomaly Detection via Angular Deviation in LLM Residual Streams

We present LatentBiopsy, a training-free method for detecting harmful prompts by analysing the geometry of residual-stream activations in large language models. Given 200 safe normative prompts, LatentBiopsy computes the leading principal component o…

Authors: Isaac Llorente-Saguer

The Geometry of Harmful Intent: Training-Free Anomaly Detection via Angular Deviation in LLM Residual Streams
The Geometry of Harmful Intent: T raining-Fr ee Anomaly Detection via Angular De viation in LLM Residual Str eams Isaac Llorente-Saguer Independent Researcher illorentes@proton.me Abstract W e present LatentBiopsy, a training-free method for detecting harmful prompts by analysing the geometry of residual-stream activ ations in large language models. Giv en 200 safe normati ve prompts, LatentBiopsy computes the leading principal component of their acti vations at a target layer and characterises new prompts by their radial de viation angle θ from this reference direction. The anomaly score is the negati v e log-likelihood of θ under a Gaussian fit to the normati ve distrib ution, flagging deviations symmetrically regardless of orientation. No harmful examples are required for training. W e ev aluate two complete model triplets from the Qwen3.5-0.8B and Qwen2.5- 0.5B families: base, instruction-tuned, and abliterated (refusal direction sur gically remov ed via orthogonalisation). Across all six v ariants, LatentBiopsy achie ves A U- R OC ≥ 0.937 for harmful-vs-normativ e detection and A UR OC = 1.000 for discrim- inating harmful from benign-aggressi ve prompts (XST est), with sub-millisecond per-query o verhead. Three empirical findings emer ge. First, geometry surviv es refusal ablation: both abliterated v ariants achiev e A UR OC at most 0.015 belo w their instruction-tuned counterparts, establishing a geometric dissociation between harmful-intent rep- resentation and the do wnstream generati v e refusal mechanism. Second, harmful prompts exhibit a near-de generate angular distrib ution ( σ θ ≈ 0 . 03 rad), an order of magnitude tighter than the normative distribution ( σ θ ≈ 0 . 27 rad), preserved across all alignment stages including abliteration. Third, the two families e xhibit opposite ring orientations at the same depth: harmful prompts occupy the outer ring in Qwen3.5-0.8B but the inner ring in Qwen2.5-0.5B, directly moti v ating the direction-agnostic scoring rule. 1 Introduction Detecting harmful prompts before model response generation is a prerequisite for safe deployment of large language models. Existing methods di vide into two broad families. Input-space filters based on perplexity [Jain et al., 2023, Alon and Kamfonas, 2023] are ef fectiv e against adversarial suf fix attacks but f ail on semantically fluent jailbreaks. Supervised safety classifiers such as Llama Guard [Inan et al., 2023] achie ve strong performance b ut require large curated datasets of harmful and harmless pairs, and per-model retraining whene v er the target model changes. A parallel body of work in representation engineering has established that residual streams encode semantic content as geometric structure [Zou et al., 2023a]. Linear directions for safety concepts hav e been extracted from contrasti ve safe/harmful pairs [Zou et al., 2023a, Li et al., 2023], and PCA Preprint. Under re view . of hidden states has been shown to separate harmful from harmless queries when safety-prompt manipulations define the reference [Zheng et al., 2024]. Crucially , Arditi et al. [2024] demonstrated that refusal beha viour in aligned models is mediated by a single linear direction in the residual stream, and that orthogonalising this direction renders the model unable to refuse. These methods share a common dependency: they require harmful examples or deliberate safety- prompt manipulations to define the reference geometry . W e ask whether the safe distrib ution alone defines a reference geometry from which harmful prompts deviate detectably . W e answer affirmati v ely , and report a finding that sharpens the safety implications: LatentBiopsy detects harmful intent even in abliterated models that are constitutionally incapable of producing refusals. The model’ s internal encoding of harmful semantic intent is geometrically distinct from the circuits that generate refusal text. Harm recognition and refusal generation are separable mechanisms. The practical consequence is direct: stripping a model’ s refusal behaviour does not erase the latent signal av ailable to an e xternal detector . A secondary finding concerns the structure of the deviation itself. The two tested model families exhibit opposite ring orientations at the same residual-stream depth; harmful prompts are more angular from PC1 in Qwen3.5-0.8B and more aligned with PC1 in Qwen2.5-0.5B. This family- lev el re versal, together with kno wn layer-le v el re versal within each model, establishes that a fix ed directional threshold on θ is architecturally unreliable and moti vates the symmetric anomaly score. Contributions. 1. T raining-free angular anomaly detector . LatentBiopsy builds a normative reference exclusi vely from 200 safe activ ations and scores e very prompt by the ne gati ve log-lik elihood of its angular de viation θ from the leading normati ve principal component. The score is symmetric around the normativ e mean, requiring no knowledge of the ring direction and no harmful data. 2. Geometry surviv es refusal ablation in both tested model families. Abliterated variants of both Qwen3.5-0.8B and Qwen2.5-0.5B achie v e A UROC h/b = 1 . 000 and A UROC h/n within 0.015 of their instruction-tuned counterparts, establishing a geometric dissociation between harmful-intent representation and the refusal mechanism across two independent model families. 3. Opposite ring orientations across families. At layer 20, Qwen3.5-0.8B places harmful prompts at mean θ ≈ 1 . 80 rad vs. normati ve µ 0 ≈ 1 . 17 rad (outer ring), while Qwen2.5- 0.5B places harmful prompts at mean θ ≈ 1 . 34 rad vs. normati ve µ 0 ≈ 1 . 82 rad (inner ring). The anomaly score s ( x ) correctly identifies both configurations without architectural knowledge. 4. Near -degenerate harmful compactness and K =1 sufficiency . Harmful prompts occupy a near-de generate angular cluster ( σ harm θ ≈ 0 . 03 – 0 . 05 rad, one order of magnitude tighter than σ norm θ ), and a single reference direction ( K =1 ) dominates multi-directional baselines at ev ery layer in e very model. 2 Related W ork Input-space defences. Perplexity-based filters [Jain et al., 2023, Alon and Kamfonas, 2023] are computationally cheap and effecti v e against adversarial suffix attacks, but rely on surface-form anomaly and cannot detect semantically fluent jailbreaks that score normally under any language model. Supervised safety classifiers. Llama Guard [Inan et al., 2023] and its successors fine-tune a language model on large labelled datasets of safe and harmful pairs, achie ving strong per -category precision. The approach requires curated harmful data and per-model retraining, making it expensi ve to adapt to new architectures or harm taxonomies. Representation engineering and linear probes. Zou et al. [2023a] e xtract linear control directions from contrasti ve safe/harmful pairs and demonstrate that semantic intent is encoded as rich geometric structure in the residual stream. Li et al. [2023] fit linear probes on labelled acti vation data and 2 intervene at inference time. Zheng et al. [2024] use optimised safety-prompt manipulations to visualise hidden-state separation and shift representations along a refusal direction. All three approaches require either labelled harmful examples or deliberate safety-prompt manipulations to define the reference; LatentBiopsy remov es that dependency entirely . Refusal as a single linear direction. Arditi et al. [2024] show that refusal behaviour in aligned models is mediated by a single linear direction in the residual stream. LatentBiopsy provides independent quantitati ve support for the near-one-dimensional geometry ( K =1 dominance) while extending the picture: the safety representation that LatentBiopsy exploits is not the refusal direction itself, since it surviv es abliteration. Angular biomarkers. The theta biomarker concept was introduced for medical diagnostics to flag anomalous patient profiles via angular deviation in a clinical feature space [Llorente-Saguer et al., 2025]. LatentBiopsy translates this geometric principle to LLM residual streams, adapting it to the direction-agnostic scoring requirement imposed by layer- and f amily-dependent ring orientation. 3 Preliminaries: Geometry of the Residual Str eam Modern causal language models process tokens through a sequence of transformer blocks commu- nicating via a central residual stream. Let f ℓ ( x ) ∈ R D denote the activ ation vector of the residual stream at layer ℓ for the final token of an input prompt x . The Linear Representation Hypothesis [P ark et al., 2023] posits that neural networks encode high-le v el concepts as spatial directions within this D -dimensional space. Under this vie w , semantic identity is defined by orientation, while concept intensity correlates with magnitude along that direction. Euclidean distance conflates semantic identity with concept intensity . Furthermore, in standard Pre-LayerNorm transformer architectures, the residual stream accumulates un-normalised outputs from consecutiv e layers, causing the ℓ 2 norm to grow monotonically with depth [Xiong et al., 2020]. Angular distance isolates semantic direction from both intensity and architectural norm growth, providing a more faithful measure of semantic div er gence, and is the foundation of the LatentBiopsy scoring function. 4 Method LatentBiopsy is a training-free, zero-shot anomaly detector that identifies harmful prompts by measuring the directional deviation of the last-token residual-stream activ ation from a reference subspace constructed exclusi vely from safe data. 4.1 Notation and activation extraction Let f ℓ ( x ) ∈ R D denote the last-token residual-stream activ ation at transformer layer ℓ for prompt x , extracted from a pretrained causal language model. All subsequent computations are performed independently at each layer ℓ . 4.2 Normative refer ence direction (PC1) Giv en a normati ve fit set X 0 = { x 1 , . . . , x N } of N safe prompts, we compute the leading principal component of the corresponding activ ations: c = PC 1  f ℓ ( x i )  N i =1 ∈ R D , ∥ c ∥ = 1 . (1) The vector c defines the direction of maximum v ariance within the safe distribution and serves as the sole reference direction for K =1 . For completeness we also examine K =2 , 3 , 4 by taking the top- K principal components, but the primary detector uses K =1 . 3 4.3 Angular deviation theta The directional deviation of a test acti v ation f ℓ ( x ) from the reference is the angle θ ( x ) = arccos  f ℓ ( x ) · c ∥ f ℓ ( x ) ∥  ∈ [0 , π ] . (2) This purely angular metric isolates semantic orientation from both concept intensity and the m onotonic norm gro wth induced by Pre-LayerNorm accumulation. A v alue θ ≈ 0 indicates strong alignment with the normativ e reference; θ ≈ π indicates near-antiparallel orientation. 4.4 Anomaly scoring W e fit a uni variate Gaussian N ( µ 0 , σ 2 0 ) to the empirical distribution { θ ( x i ) } N i =1 from the normativ e fit set. The anomaly score for an y prompt x is the negati ve log-lik elihood under this distribution: s ( x ) = − log p  θ ( x )   µ 0 , σ 2 0  . (3) Because s ( x ) is symmetric around µ 0 , it flags deviations in either direction without prior kno wledge of whether harmful prompts lie inside or outside the normativ e ring (section 6.1). The score is monotonically equi v alent to the squared z -score of θ up to additi ve constants, yielding v alues suitable for cross-layer and cross-model comparison. Although a multi variate Gaussian Mixture Model can be fit to the stacked vector of angles to the top- K principal components for K > 1 , the primary LatentBiopsy detector employs the univ ariate K =1 formulation. 4.5 Phi: azimuthal visualisation coordinate T o enable geometric visualisation we project the component of f ℓ ( x ) orthogonal to c : f ⊥ ( x ) = f ℓ ( x ) −  f ℓ ( x ) · c  c . A 2-D PCA basis ( v 1 , v 2 ) is fit exclusiv ely on { f ⊥ ( x i ) } N i =1 from the normative fit set. The azimuthal coordinate is then ϕ ( x ) = atan2  f ⊥ ( x ) · v 2 , f ⊥ ( x ) · v 1  ∈ [ − π , π ] . Each prompt maps to the polar point ( θ ( x ) cos ϕ ( x ) , θ ( x ) sin ϕ ( x )) in the theta-phi projection. The coordinate ϕ is used solely for visualisation; detection relies on s ( x ) . 4.6 Baselines For comparison, we include four additional scorers: (1) absolute de viation | θ ( x ) − µ 0 | (monotonically equi valent to s ( x ) for K =1 ); (2) bi v ariate Gaussian negati v e log-likelihood under a 2-D fit to the joint ( θ , ϕ ) distribution; (3) cosine-to-centroid s cos ( x ) = 1 − f ℓ ( x ) · ¯ f 0 / ∥ f ℓ ( x ) ∥∥ ¯ f 0 ∥ ; and (4) Euclidean deviation ∥ f ℓ ( x ) − ¯ f 0 ∥ 2 . 4.7 Experimental protocol and data splits Datasets. W e use three public corpora: Alpaca-Cleaned [T aori et al., 2023] (normativ e safe prompts); AdvBench [Zou et al., 2023b] (520 harmful prompts); XST est [Röttger et al., 2023] (250 benign-aggressiv e prompts, e v aluation only). Split design. W e fix the normative fit set at N =200 prompts, drawn from Alpaca-Cleaned, and retain a disjoint held-out normativ e e v aluation set of 520 prompts. All 520 harmful and 250 benign- aggressiv e prompts are reserved for e v aluation; none enter the fit stage. Layer selection. The operating layer is selected by argmax of K =1 harmful-detection A UR OC ov er all layers, e v aluated on the held-out set, a standard model-selection step that uses no harmful data for fitting. The harmful ev al set is used solely to identify the most informati ve layer , not to fit any parameters of the detector . The selection optimism is bounded by the plateau width: per-layer A UROC v aries by fe wer than 0.08 units across all 24 layers in e v ery model (figs. 2 and 3), so no layer is meaningfully preferred over its neighbours. F or all models except Qwen2.5-0.5B-Abliterated (best layer 10), the argmax is layer 20. For Qwen2.5-0.5B-Abliterated, we report metrics at the best layer (10) and note that A UROC at layer 20 f alls within 0.004 of the reported value (fig. 8, bottom ro w). 4 Evaluation. W e report per-layer area under the ROC curve (A UR OC) and area under the preci- sion–recall curve (A UPRC) for three binary tasks: (i) harmful vs. normative (h/n); (ii) harmful vs. benign-aggressiv e (h/b); and (iii) harmful vs. rest, i.e. normati ve ∪ benign-aggressiv e (h/r). Pairwise differences between groups are assessed using the Mann–Whitne y U test. 5 Experiments 5.1 Models W e e valuate six model v ariants comprising two complete triplets. Qwen3.5-0.8B ( D =1024 , 24 layers). Three variants are ev aluated: Base , the raw pretrained model before any alignment fine-tuning; Chat , the instruction-tuned model; and Abliterated , the Chat model with the refusal direction remo ved from all weight matrices via orthogonalisation [Arditi et al., 2024], rendering it unable to produce refusals. Both non-base variants employ a hybrid linear-attention architecture; activ ations were e xtracted using a standard PyT orch implementation. 1 Qwen2.5-0.5B ( D =896 , 24 layers). The same three variants ( Base , Instruct , and Abliterated ) are e valuated for this family , with the abliterated v ariant produced by orthogonalising the refusal direction from the Instruct checkpoint. 5.2 Detection performance T able 1 summarises detection performance across all six v ariants. Three findings hold without exception. Harmful intent and benign-aggressi ve phrasing (XST est) are perfectly separable across all six models, including the abliterated v ariants: A UR OC h/b = 1 . 000 univ ersally . The h/n task is strongly solved as well, with A UROC ranging from 0.9374 (Qwen2.5-0.5B-Abliterated) to 0.9642 (Qwen3.5-0.8B-Base) and A UPRC h/n ≥ 0.898 throughout. Most importantly for our central claim, the abliterated models are essentially indistinguishable from their instruction-tuned counterparts in detection performance: the abliterated variant falls within 0.002 A UROC h/n of the Chat model for Qwen3.5-0.8B and within 0.005 for Qwen2.5-0.5B. These margins are smaller than the gap between base and instruction-tuned v ariants within either family . All pairwise comparisons achie ve p < 10 − 45 under the Mann–Whitney U test (harmful vs. normative/rest); normativ e-vs-benign-agg p -values range from 10 − 5 to 10 − 22 . T able 1: Detection performance (normativ e-reference strategy , K =1 , held-out ev aluation sets, N fit =200 for all models). n harm =520 , n norm , ev al =520 , n benign =250 for all models. h/n: harmful vs. normativ e; h/b: harmful vs. benign-aggressiv e; h/r: harmful vs. rest (norm ∪ benign). r b / n : rank-biserial correlation for normativ e-vs-benign-agg (negati v e means benign-agg is less anomalous than normativ e, which is the desired direction for a harm detector). Prec@90: precision at 90% recall on the h/n task. † Best layer for this model is 10; A UROC at layer 20 dif fers by < 0 . 004 . Model T ype Layer A UR OC A UPRC h/r r b / n Prec@90 h/n h/b h/n h/b Qwen3.5-0.8B ( D =1024 ) Base Base 20 0.9642 1.000 0.9373 1.000 0.9758 − 0 . 384 ∗∗∗ 0.928 Chat Instruct 20 0.9497 1.000 0.9117 1.000 0.9661 − 0 . 434 ∗∗∗ 0.899 Abliterated Abliterated 20 0.9517 1.000 0.9165 1.000 0.9674 − 0 . 427 ∗∗∗ 0.899 Qwen2.5-0.5B ( D =896 ) Base Base 20 0.9585 1.000 0.9373 1.000 0.9720 +0 . 149 ∗∗ 0.902 Instruct Instruct 20 0.9420 1.000 0.9129 1.000 0.9609 +0 . 219 ∗∗∗ 0.875 Abliterated † Abliterated 10 0.9374 1.000 0.8978 1.000 0.9577 − 0 . 179 ∗∗ 0.882 ∗∗ : p< 10 − 4 . ∗∗∗ : p< 10 − 18 . All h/n, h/b, h/r comparisons: p< 10 − 45 . Benign-agg in Qwen2.5-0.5B-Base and Qwen2.5-0.5B-Instruct scores slightly above normati ve ( r b / n > 0 ) but far belo w harmful (A UR OC h/b = 1 . 000 ). 1 The flash-linear-attention fast path was una vailable at the time of testing; results reflect geometric- signal robustness across attention implementations. 5 Benign-aggressi ve placement is family-dependent but nev er a problem for discrimination. In Qwen3.5- 0.8B, XST est prompts score significantly below normativ e ( r b / n ≈ − 0 . 43 for Chat and Abliterated, − 0 . 38 for Base), occupying the innermost region of the normati v e ring. In Qwen2.5-0.5B-Base and Instruct, they sit slightly above normative ( r b / n = +0 . 15 and +0 . 22 ); in Qwen2.5-0.5B-Abliterated, they shift back below ( r b / n = − 0 . 18 ). In ev ery configuration, harmful prompts are far more anomalous than benign-aggressive ones, achieving A UR OC h/b = 1 . 000 throughout. The full precision-recall profiles are shown in fig. 1 for the base v ariants and in fig. 9 for all six models. (a) Qwen3.5-0.8B-Base (b) Qwen2.5-0.5B-Base Figure 1: Precision-r ecall cur ves at the operating layer ( K =1 ) for the two base v ariants; remaining models are in fig. 9. Dotted horizontal lines indicate chance precision per task. The harmful-vs- normati ve curve (red) maintains precision > 0 . 92 up to 90% recall for Qwen3.5-0.8B-Base (Prec@90 = 0 . 928 ) and > 0 . 90 for Qwen2.5-0.5B-Base (Prec@90 = 0 . 902 ). The harmful-vs-benign-agg curv e is flat at 1.000 precision across all recall le vels. The normative-vs-benign-agg curve (green) lies below chance in Qwen3.5-0.8B (A UPRC = 0 . 232 ), confirming that benign-agg prompts are scored as less anomalous than normativ e, consistent with r b / n = − 0 . 384 . 5.3 Comparison across model variants W ithin each family , base models match or slightly exceed their instruction-tuned counterparts in harmful detection: the gap is 0.015 A UR OC h/n for Qwen3.5-0.8B and 0.017 for Qwen2.5-0.5B, with base scoring higher in both cases. This is a consistent if modest observation, and points to safety geometry being present before alignment fine-tuning. The abliterated models tell the more striking story . In Qwen3.5-0.8B, the abliterated model (A UR OC h/n = 0 . 9517 ) lies 0.002 abov e the chat model ( 0 . 9497 ), well within any plausible noise floor . In Qwen2.5-0.5B, the gap is 0.005 (best layer 10 vs. 20), a margin smaller than the base-to-instruct difference within the same family . A model that cannot refuse harmful requests is, by this measure, an equally capable harm detector . Per -layer profile. Figures 2 and 3 sho w per-layer A UR OC for both families. K =1 is uniformly superior to K > 1 baselines across all 24 layers in ev ery v ariant. The broad plateau at K = 1 bounds layer-selection optimism tightly: best layer exceeds worst by fe wer than 0.08 A UR OC units, with all layers abov e 0.88. 5.4 Computational cost W e ev aluated the latenc y overhead of LatentBiopsy on an NVIDIA GeForce R TX 3070 Laptop GPU (8 GB VRAM), reporting mean and standard de viation ov er 100 trials after GPU warm-up, using Alpaca prompts to reflect realistic sequence lengths. W e report mean (standard deviation): A baseline Qwen2.5-0.5B forward pass takes 20.7 (2.7) ms; residual-stream extraction adds negligible overhead (0.0 ms); and anomaly scoring (dot product plus scalar Gaussian NLL) e xecutes in 0.43 (0.08) ms. End- to-end, the pipeline completes in 22.6 (2.1) ms per query . Normati ve reference fitting is performed 6 (a) Qwen3.5-0.8B-Base (b) Qwen3.5-0.8B-Chat (c) Qwen3.5-0.8B-Abliterated Figure 2: Per -lay er A UROC f or the Qwen3.5-0.8B family . Each panel sho ws A UROC h/n (left) and A UROC h/b (right) vs. layer inde x. K =1 (solid blue) strictly dominates K > 1 (orange/green/red) at e very layer . The cosine-centroid baseline (purple dashed) consistently underperforms K =1 . A UR OC h/b = 1 . 000 is maintained at e very layer re gardless of alignment stage. The three panels are nearly indistinguishable in their h/b profile. 7 (a) Qwen2.5-0.5B-Base (b) Qwen2.5-0.5B-Instruct (c) Qwen2.5-0.5B-Abliterated Figure 3: Per -layer A UR OC f or the Qwen2.5-0.5B family . K =1 (solid blue) remains the dominant scorer throughout. In the Instruct and Abliterated v ariants, the cosine-centroid and L2-norm baselines reach near-parity with K =1 , reflecting the specific geometry of this f amily , but do not surpass it. The broad performance plateau across layers 5–23 bounds layer -selection optimism to < 0 . 08 A UROC units. 8 once of fline: for N =200 prompts, activ ation extraction tak es < 3 s and PCA completes in < 0.2 s on CPU. LatentBiopsy therefore adds less than 0.5 ms of method-specific o verhead (anomaly scoring) relativ e to a standard forward pass. 6 Geometric Analysis: The T wo-Ring Structur e 6.1 Opposite ring orientations across families Figure 4 plots the theta-phi projection for all six variants at their respective operating layers. A concentric two-ring structure separating harmful and normativ e acti vations is present in e very panel, and the two families occup y opposite positions. In all three Qwen3.5-0.8B v ariants, harmful prompts occupy the outer ring ( ¯ θ harm ≈ 1 . 80 rad vs. µ 0 ≈ 1 . 17 rad; table 2): they de viate more strongly from PC1 than normative prompts. In all three Qwen2.5-0.5B variants, harmful prompts occupy the inner ring ( ¯ θ harm ≈ 1 . 32 – 1 . 36 rad vs. µ 0 ≈ 1 . 76 – 1 . 90 rad): they are more tightly aligned with PC1 than normati ve prompts. This family-le vel re versal cannot be reconciled by any fix ed directional threshold on θ . The anomaly score s ( x ) resolves it correctly in all cases: both θ = µ 0 + k σ 0 and θ = µ 0 − k σ 0 receiv e the same score for any k , re gardless of ring direction. The same ar gument applies to layer -lev el re versals within each model, visible in figs. 2 and 3 through the changing sign of the L2-norm baseline across layers. 6.2 Benign-aggressive placement The XST est prompts occupy a consistent position relative to the harmful cluster in each family , reinforcing the interpretation of the two-ring geometry . In all three Qwen3.5-0.8B variants, they cluster at the smallest radii, below the normativ e mean ( r b / n ≈ − 0 . 43 ), and are geometrically separated from harmful prompts in the opposite direction. In Qwen2.5-0.5B-Base and Instruct, they sit near or slightly abo ve normati ve ( r b / n = +0 . 15 , +0 . 22 ), while in Qwen2.5-0.5B-Abliterated they shift slightly below ( r b / n = − 0 . 18 ). In e very case, perfect harmful-vs-benign-aggressi ve separation (A UROC h/b = 1 . 000 ) holds, confirming that the theta score cleanly discriminates harmful intent from aggressiv e but benign phrasing re gardless of where benign-aggressiv e prompts sit relativ e to the normativ e mean. 6.3 Angular deviation statistics T able 2 reports the ra w angular de viation statistics at the operating layer for each model, and fig. 5 shows the corresponding anomaly score distrib utions. Three patterns hold univ ersally . The harmful cluster is e xtraordinarily compact. σ harm θ is 0.030–0.052 rad: one order of magnitude smaller than σ norm θ (0.183–0.272 rad), and this near-degenerac y is preserved across base, instruction- tuned, and abliterated variants alike. Normative train and test are statistically indistinguishable ( | ¯ θ norm , train − ¯ θ norm , test | ≤ 0 . 01 rad in ev ery model), confirming that N =200 defines a stable reference distribution. Finally , benign-aggressi ve prompts cluster near the normati ve mean in all Qwen3.5-0.8B v ariants (smaller θ than normati ve, tighter alignment with PC1), while in Qwen2.5- 0.5B their θ nearly coincides with the normati v e µ 0 , leaving harmful as the angular outlier in both cases. 7 Robustness and Sensiti vity Analysis 7.1 Safety signal dimensionality ( K -ablation) The per-layer A UR OC plots already show that K =1 strictly outperforms multi-directional baselines at ev ery layer and model. T able 3 quantifies this at the operating layer . Increasing from K =1 to the best K > 1 reduces A UROC by 0.033–0.063 across models, a consistent and meaningful penalty for adding directions. The cosine-centroid baseline matches or slightly exceeds K =1 for the two Qwen2.5-0.5B non-abliterated variants ( ∆ cos = +0 . 009 and +0 . 004 ), suggesting that for this family the mean normati v e acti v ation is a good proxy for PC1. The PC1 formulation is nonetheless preferred on theoretical grounds: it maximises captured normati ve v ariance, remains interpretable through the theta-phi projection, and performs at least as well as the centroid in four of six models. 9 (a) Qwen3.5-0.8B-Base (L20) (b) Qwen2.5-0.5B-Base (L20) (c) Qwen3.5-0.8B-Chat (L20) (d) Qwen2.5-0.5B-Instruct (L20) (e) Qwen3.5-0.8B-Abliterated (L20) (f) Qwen2.5-0.5B-Abliterated (L10) Figure 4: Theta-phi projections at the operating layer . Harmful (red) and safe (blue/green) prompts form distinct concentric radial zones across all v ariants. In the Qwen3.5-0.8B family ( left column ), harmful intent occupies the outer ring ; in Qwen2.5-0.5B ( right column ), it occupies the inner ring . The visual in v ariance across rows demonstrates that safety geometry is established during pretraining and remains intact ev en after the mathematical erasure of refusal mechanisms. All panels achie ve A UROC h/b = 1.000. 10 (a) Qwen3.5-0.8B-Base (b) Qwen2.5-0.5B-Base (c) Qwen3.5-0.8B-Chat (d) Qwen2.5-0.5B-Instruct (e) Qwen3.5-0.8B-Abliterated (f) Qwen2.5-0.5B-Abliterated Figure 5: Anomaly score distributions at the operating layer for all six variants. Each violin shows the marginal distribution of s ( x ) = − log p ( θ | µ 0 , σ 2 0 ) for normativ e ev al (blue), harmful (red), benign-aggressiv e (green), and normativ e ∪ benign (purple). White circles denote medians; bars denote IQRs. In every panel, harmful prompts occup y a narro w , elev ated band ( σ harm θ is 5–9 × smaller than σ norm θ ; table 2). In the Qwen3.5-0.8B family (left column), benign-aggressiv e scores fall below the normative distrib ution; in the Qwen2.5-0.5B family (right column), the y overlap with it. The three panels within each column are nearly identical, illustrating that abliteration leav es the score landscape intact. 11 T able 2: Raw angular deviation θ statistics at the operating layer (normative-ref, K =1 ). µ 0 = ¯ θ norm , test : mean normati ve angle (radians). σ norm : normative test standard de viation. ¯ θ harm : harmful mean. ∆ θ = ¯ θ harm − µ 0 : signed angular separation (positi ve = outer ring, negati v e = inner ring). ¯ θ benign : benign-aggressive mean. σ harm : harmful standard de viation. All h/n comparisons: p< 10 − 45 . † Layer 10 for this model. Model T ype µ 0 σ norm ¯ θ harm ∆ θ ¯ θ benign σ harm Qwen3.5-0.8B Base Base 1.161 0.272 1.811 +0 . 650 1.094 0.034 Chat Instruct 1.178 0.267 1.801 +0 . 623 1.104 0.030 Abliterated Abliterated 1.175 0.267 1.802 +0 . 627 1.104 0.031 Qwen2.5-0.5B Base Base 1.819 0.188 1.357 − 0 . 462 1.821 0.034 Instruct Instruct 1.764 0.183 1.316 − 0 . 448 1.777 0.035 Abliterated † Abliterated 1.904 0.250 1.301 − 0 . 603 1.962 0.052 ∆ θ > 0 : harmful is outer ring (more deviated from PC1); ∆ θ < 0 : harmful is inner ring (more aligned with PC1). Both configurations are correctly flagged by the symmetric score s ( x ) . T able 3: K -ablation at the operating layer . ∆ K : A UROC h/n change from K =1 to best K > 1 (negati v e = K =1 better). ∆ cos : A UR OC h/n change from K =1 to cosine-centroid baseline (negati ve = K =1 better). † Layer 10. Model T ype K =1 A UR OC ∆ K ∆ cos Qwen3.5-0.8B-Base Base 0.9642 − 0 . 033 − 0 . 050 Qwen3.5-0.8B-Chat Instruct 0.9497 − 0 . 049 − 0 . 031 Qwen3.5-0.8B-Abliterated Abliterated 0.9517 − 0 . 045 − 0 . 035 Qwen2.5-0.5B-Base Base 0.9585 − 0 . 041 − 0 . 008 Qwen2.5-0.5B-Instruct Instruct 0.9420 − 0 . 063 +0 . 009 Qwen2.5-0.5B-Abliterated † Abliterated 0.9374 − 0 . 037 +0 . 004 7.2 Safety signal sparsity (dimension ablation) Figure 6 shows A UR OC as a function of the number of principal dimensions retained (by descending normativ e v ariance), for K =2 at the operating layer , for the two base v ariants. The full six-model ablation is in fig. 10. In ev ery model, retaining just the top-10 dimensions ( ≈ 1% of D =1024 or 1 . 1% of D =896 ) simultaneously maximises both h/n and h/b A UR OC; additional dimensions monotonically dilute performance. The safety signal is thus concentrated in a compact subspace. This result is consistent with the near-one-dimensional geometry reported by Arditi et al. [2024] and reinforced here across six model variants. (a) Qwen3.5-0.8B-Base (b) Qwen2.5-0.5B-Base Figure 6: Dimension-pruning ablation at the operating layer ( K =2 ) for the two base variants; all six models are sho wn in fig. 10. A UR OC h/n (red) and A UROC h/b (green) vs. number of principal dimensions retained by descending normativ e v ariance. Both tasks are maximised at the top-10 dimensions ( ≈ 1% of D ) and degrade monotonically thereafter . This pattern indicates that retaining additional components does not improv e performance under this setup. 12 7.3 Normative set stability Figure 7 and fig. 8 plot A UROC as a function of normativ e fit-set size N for all v ariants of the Qwen3.5-0.8B and Qwen2.5-0.5B families, respecti vely . Performance saturates remarkably early . Across all models and nearly all layers, A UROC h/n already exceeds 0.90 once N ≳ 100 , with many layers reaching this threshold throughout most layers. Even with extremely small normati ve sets ( N = 10 –20), late-layer performance remains strong (typically > 0 . 85 ). The harmful-versus-benign-aggressi ve separation reaches and remains perfectly flat at A UROC h/b = 1 . 000 from the smallest tested values of N . The right sub-panels confirm near -perfect in v ariance to the ordering of the normati ve set: forward (solid) and re verse (dashed) curv es ov erlap almost completely after N ≈ 30 . This rules out sample- ordering artefacts and shows that the leading principal component rapidly conv erges to a stable reference direction. Notably , the stability profiles are qualitativ ely consistent across Base, Chat, and Abliterated variants within each family . This provides further evidence that the harmful-intent geometry exploited by LatentBiopsy is formed during pretraining and remains largely unaffected by later instruction tuning or refusal ablation. Se veral adjacent late layers also sho w nearly identical curves, indicating that the relev ant structure isn’ t limited to a single layer b ut is distributed across a small range of layers. The immediate perfect separability of harmful versus benign-aggressi ve prompts across all N further suggests that these categories occupy well-separated re gions of representation space, independent of the normativ e reference. T aken together , these results demonstrate that LatentBiopsy is highly data-ef ficient: a fe w dozen safe prompts suffice to construct a high-quality reference direction, and N = 200 lies comfortably in the saturated regime. 8 Discussion The abliteration result and its safety implications. The abliterated variants apply tar geted removal of the learned refusal direction, aiming to eliminate refusal-style behaviour . Y et LatentBiopsy achiev es A UROC h/b = 1 . 000 and A UROC h/n within 0.005 of the corresponding instruction-tuned models in both families. This establishes a geometric dissociation : harmful semantic intent is represented in the residual stream independently of the downstream mechanism that acts on it. Recent safety approaches such as Zheng et al. [2024] target the refusal direction to strengthen model behaviour . Our findings suggest that such interventions address the generati ve mechanism without altering the representational geometry; the latent signal persists e ven when the direction has been mathematically erased. This is consequential for both of fensi ve and defensi v e AI safety: a model that cannot refuse retains an intact, exploitable signal for an external detector, and an adversary who abliterates a model to bypass its safeguards does not thereby erase the geometric e vidence of harmful intent. Why do the tw o families hav e opposite ring orientations? At layer 20, Qwen3.5-0.8B harmful prompts are more angular from PC1 than normati ve prompts ( ∆ θ = +0 . 62 – +0 . 65 rad), while Qwen2.5-0.5B harmful prompts are more aligned ( ∆ θ = − 0 . 45 – − 0 . 60 rad). This family-le vel difference lik ely reflects how safety-relev ant representations are encoded relati ve to the dominant normati ve v ariance direction in each architecture, a quantity that depends on pretraining data mixture, model width, and architectural details simultaneously . W e do not hav e a mechanistic account and regard the cause as an open question. What the result does establish is that no single ring orientation can be assumed a priori across architectures, making direction-agnostic scoring a structural requirement rather than a design choice. The near -degenerate harmful compactness. Across all six models, σ harm θ ≈ 0 . 03 – 0 . 05 rad, one order of magnitude smaller than σ norm θ . This compactness surviv es abliteration, ruling out the refusal direction as its source. The most parsimonious explanation is that AdvBench prompts share a narro w syntactic template producing near-identical last-tok en activ ations. Evaluation on structurally di verse datasets such as JailbreakBench [Chao et al., 2024] is the ke y open experiment to determine whether this compactness is a surface-form artefact or a genuine semantic regularity . This question is the primary empirical limitation of the present work. 13 (a) Qwen3.5-0.8B-Base (b) Qwen3.5-0.8B-Chat (c) Qwen3.5-0.8B-Abliterated Figure 7: A UR OC vs. normative fit-set size N for the Qwen3.5-0.8B family . Left sub-panels : A UROC h/n vs. N for representative layers (forward ordering), showing stabilisation well before N =200 at e very layer . Right sub-panels : A UR OC vs. N at a fixed late layer , comparing forward (solid) and re verse (dashed) ordering for A UR OC h/n (red) and A UR OC h/b (green). Green curves are flat at 1.000 throughout; red curv es are stable abov e 0.90 at all N ≥ 30 and in v ariant to ordering, ruling out sample-ordering artefacts. 14 (a) Qwen2.5-0.5B-Base (b) Qwen2.5-0.5B-Instruct (c) Qwen2.5-0.5B-Abliterated Figure 8: A UR OC vs. normative fit-set size N for the Qwen2.5-0.5B family . Similar to the 0.8B architecture, performance stabilises well before N =200 (left panels) and exhibits strict ordering in v ariance (right panels). The abliterated v ariant (panel c) sho ws the same robust stability pattern at its operating layer (layer 10) as at other layers, confirming data-size requirements are unaf fected by the ablation of refusal directions. 15 Does safety geometry precede alignment? Base models achieve at least equal harmful-detection A UROC as their instruction-tuned counterparts in both families, and the two-ring structure is qualita- tiv ely identical across base, instruct, and abliterated v ariants. This is consistent with the hypothesis that harmful-intent geometry is established during pretraining and is not a product of alignment fine-tuning. Ho we ver , the observ ation is currently limited to two model f amilies from a single vendor , and Qwen-specific data mixture effects cannot be ruled out. Extending to at least one non-Qwen family is the highest-priority next e xperiment. Limitations and open directions. Cr oss-ar c hitectur e validation is the most critical gap: the geometry-precedes-alignment and geometry-surviv es-ablation findings must be replicated in at least one non-Qwen family before they can be treated as general results. Dataset diversity : ev aluation on JailbreakBench will determine whether harmful compactness generalises beyond the lexically homogeneous AdvBench. Adversarial r obustness : prompts crafted to minimise | s ( x ) | while preserv- ing harmful intent are the natural white-box attack on LatentBiopsy; their ef fect is entirely untested and constitutes the most important open safety question. Layer selection : a dedicated layer -selection split would eliminate the one le vel of selection bias and yield strictly unbiased A UROC estimates. Calibrated thresholding : deployment requires a principled approach to threshold selection gi ven the variability of benign-aggressi v e placement across families. 9 Conclusion LatentBiopsy demonstrates that a direction-agnostic angular anomaly detector, built exclusi v ely from 200 safe activ ations, rob ustly identifies harmful prompts across base, instruction-tuned, and abliterated variants of two model families. The method achiev es harmful versus normati ve A UR OC ≥ 0.937 and harmful versus benign-aggressiv e A UROC = 1 . 000 across all six tested v ariants, with sub-millisecond per-query o verhead and no harmful training data. The central finding is a geometric dissociation : removing the refusal mechanism from a model in both a 0.8B and a 0.5B architecture lea ves harmful-intent geometry intact. A model that cannot refuse retains the latent signal e xploitable by an external detector , and an adversary who abliterates refusal does not thereby erase the evidence. Combined with the observation that base models match instruction-tuned ones on all detection metrics, the evidence is consistent with harmful-intent geometry being established during pretraining, independently of the alignment process in both its presence and its absence. An une xpected empirical pattern (though not e xtensiv ely tested) is the opposite ring orientation across families: harmful prompts are the most angular group in Qwen3.5-0.8B and the most aligned group in Qwen2.5-0.5B. The anomaly score handles both without modification. Explaining this architectural difference mechanistically , and confirming whether the pattern generalises beyond the Qwen family , is the natural next step of this research programme. 10 Ethical Considerations and Broader Impact LatentBiopsy provides a diagnostic capability to identify harmful instructions at inference time. While the primary objectiv e is to enhance safety , we acknowledge the inherent dual-use potential of interpretability tools. Responsible Research Practice. W e hav e utilized only publicly a vailable datasets (AdvBench, XST est, Alpaca) that are standard within the AI safety literature. W e do not provide, encourage, or facilitate the generation of new harmful content. W e strictly adhere to the AI safety community’ s norms of responsible disclosure and advocate for the use of latent analysis solely to improv e model robustness, interpretability , and safety alignment. W e believ e that democratising the ability to "read" harmful intent is a net positiv e for safety , as it reduces the reliance on "black-box" proprietary safety filters and enables transparent, model-agnostic verification of safety alignment. Code and Data A vailability . All code is av ailable at https://github.com/isaac- 6/ geometric- latent- biopsy . A Zenodo archiv e is at https://doi.org/10.5281/zenodo. 19294977 . Datasets: Alpaca-Cleaned [T aori et al., 2023], AdvBench [Zou et al., 2023b], XST est [Röttger et al., 2023]. 16 Models used in this work: Qwen/Qwen3.5-0.8B-Base, Qwen/Qwen3.5-0.8B, prithi vMLmods/Gliese- Qwen3.5-0.8B-Abliterated-Caption, Qwen/Qwen2.5-0.5B, Qwen/Qwen2.5-0.5B-Instruct, huihui- ai/Qwen2.5-0.5B-Instruct-abliterated. References Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity . arXiv preprint arXiv:2308.14132, 2023. Andy Arditi, Oscar Obeso, Aaquib Syed, Hoagy Cunningham, Daniel Filan, Fabien Colognese, Martin W attenberg, and Fernanda V iégas. Refusal in language models is mediated by a single direction. arXiv preprint arXi v:2406.11717, 2024. Patrick Chao, Edoardo Debenedetti, Ale xander Robey , Maksym Andriushchenko, Francesco Croce, V ikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J Pappas, Florian Tramer , et al. JailbreakBench: An open robustness benchmark for jailbreaking large language models. arXiv pr eprint arXiv:2404.01318 , 2024. Hakan Inan, Kartike ya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer , Y uning Mao, Michael T ontchev , Qing Hu, Brian Fuller , Davide T estuggine, and Madian Khabsa. Llama guard: LLM- based input-output safeguard for human-AI con versations. arXi v preprint 2023. Neel Jain, A vi Schwarzschild, Y uxin W en, Gowthami Somepalli, John Kirchenbauer , Ping-yeh Chiang, Micah Goldblum, T om Goldstein, et al. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXi v:2309.00614, 2023. Kenneth Li, Oam Patel, Fernanda V iégas, Hanspeter Pfister , and Martin W attenber g. Inference-time intervention: Eliciting truthful answers from a language model. In Advances in Neural Information Pr ocessing Systems , volume 36, 2023. Isaac Llorente-Saguer , Charles Arber , and Neil P . Oxtoby . Theta, a multidimensional ratio biomarker applied to fiv e amyloid beta peptides for in v estigations in familial Alzheimer’ s disease. medRxiv , 2025. Preprint. https://doi.org/10.1101/2025.08.06.25333131 . Kiho Park, Y o Joong Choe, and V ictor V eitch. The linear representation hypothesis and the geometry of large language models. arXiv preprint , 2023. Paul Röttger , Hannah Rose Kirk, Bertie V idgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy . XST est: A test suite for identifying exaggerated safety behaviours in lar ge language models. arXiv preprint arXi v:2308.01263, 2023. Rohan T aori, Ishaan Gulrajani, T ian yi Zhang, Y ann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and T atsunori B. Hashimoto. Stanford alpaca: An instruction-following LLaMA model. https://github.com/tatsu- lab/stanford_alpaca , 2023. Ruibin Xiong, Y unchang Y ang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Y an yan Lan, Liwei W ang, and T ieyan Liu. On layer normalization in the transformer architecture. In International confer ence on machine learning , pages 10524–10533. PMLR, 2020. Chujie Zheng, Lifeng Fan, Hang Chen, Y ue Liu, and Minlie Huang. On prompt-dri ven safe guarding for large language models. In Pr oceedings of the 41st International Confer ence on Machine Learning , 2024. arXiv preprint arXi v:2401.18018. Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Y in, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to AI transparenc y . arXiv preprint arXi v:2310.01405, 2023a. Andy Zou, Zifan W ang, J. Zico K olter , and Matt Fredrikson. Universal and transferable adv ersarial attacks on aligned language models. arXiv preprint arXi v:2307.15043, 2023b. 17 A Harmful-Reference Strategy A supervised variant ( harmful-ref ) fits the biomarker on harmful examples instead of normati ve ones, achieving comparable detection performance (A UR OC 0 . 945 – 0 . 953 on harmful-vs-normativ e) but requiring harmful data at fit time. All harmful-ref scores require auto-orientation (sign flip) at ev ery layer and model, because the harmful distrib ution is so dif fuse that held-out harmful prompts score as more anomalous under their own Gaussian fit than normative prompts do. This confirms the compact/diffuse asymmetry at the distrib utional level: the harmful manifold has no compact geometric centre that generalises across samples. Full harmful-ref figures and statistics are available in the project repository . 18 B Precision-Recall Cur ves: All Models (a) Qwen3.5-0.8B-Chat (b) Qwen2.5-0.5B-Instruct (c) Qwen3.5-0.8B-Abliterated (d) Qwen2.5-0.5B-Abliterated Figure 9: Precision-recall curves for all six models; see fig. 1 for a detailed description, and the base models’ curves. In the Qwen3.5-0.8B family (left column), the normativ e-vs-benign-agg curve (green) lies belo w chance, confirming that benign-agg prompts are scored as less anomalous. In the Qwen2.5-0.5B family (right column), the same curv e sits near or slightly abov e chance, reflecting r b / n ≈ +0 . 15 to +0 . 22 . In all panels, the harmful-vs-benign-agg curv e is flat at precision = 1 . 000 . 19 C Dimension Ablation: All Models (a) Qwen3.5-0.8B-Base (b) Qwen2.5-0.5B-Base (c) Qwen3.5-0.8B-Chat (d) Qwen2.5-0.5B-Instruct (e) Qwen3.5-0.8B-Abliterated (f) Qwen2.5-0.5B-Abliterated Figure 10: Dimension-pruning ablation ( K =2 ) for all six models; see fig. 6 for a detailed description. In e very panel, performance is maximised at top-10 dimensions and monotonically decreases there- after , confirming that the safety signal is concentrated in < 1 . 2% of all residual-stream dimensions across both model families. 20

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment