Squish and Release: Exposing Hidden Hallucinations by Making Them Surface as Safety Signals

Language models detect false premises when asked directly but absorb them under conversational pressure, producing authoritative professional output built on errors they already identified. This failure - order-gap hallucination - is invisible to out…

Authors: Nathaniel Oh, Paul Attie

Squish and Release: Exposing Hidden Hallucinations by Making Them Surface as Safety Signals
Squish and Release: Exp osing Hidden Hallucinations b y Making Them Surface as Safet y Signals Nathaniel Oh P aul Attie Sc ho ol of Computer and Cyb er Sciences Augusta Univ ersity Augusta, Georgia 30912 { noh,pattie } @augusta.edu Abstract Hallucination under con v ersational pressure is in visible to output insp ection: the mo del pro duces authoritativ e, confiden t professional output built on a false premise it already knew was wrong, and no examination of that output rev eals the error. The central in- sigh t of this work is that the error is not in the output-it is in the activ ation space of the safet y circuit, suppressed b y accum ulated con versational pressure. Like a signal masked by noise, the detection is presen t but not surfacing. W e introduce Squish and R ele ase (S&R), a perceptual lens architecture that mak es this hidden signal visible. S&R has t wo separable comp onen ts: a dete ctor b o dy -la y ers 24–31 of the residual stream, the lo calized circuit where safet y ev aluation liv es-and a sw appable dete ctor c or e , an activ ation v ector patc hed into that circuit to control what the mo del p erceiv es and flags. The core is directional: a safety c or e (captured from a maximally-refused prompt) shifts the model from compliance to ward detection, making the hidden signal visible; an absorb c or e (captured from a confirmed- compliance c hain) shifts the model from detection to ward compliance, suppressing it again. The bo dy is fixed; the core determines the direction. W e introduce the Or der-Gap Bench- mark -500 fiv e-prompt escalating chains across 500 domains-and ev aluate on OLMo-2 7B. Key findings: (1) cascade collapse (the suppression of correct O2-lev el detection by O5, defined in § 3 ) is near-total (99.8% compliance at O5); (2) the detector bo dy is binary and lo calized-la y ers 24–31 shift 93.6% of collapsed chains ( χ 2 =871 . 0, p< 10 − 189 ); lay ers 0–23 con tribute zero; (3) empirically discov ered cores release 62% of 500 domains; (4) a syn theti- cally engineered core (Genev a Con ven tions + Rome Statute violation) releases 76.6% -14.6 p ercen tage p oints ab ov e the empirical b est, establishing that core engineering outp erforms core disco v ery; (5) the absorb core suppresses 58% of correctly-detecting chains while the safet y core restores 83%-detection is the more stable attractor; (6) cores require routing, not blending; (7) epistemic sp ecificity is confirmed: a core captured from false-O2 activ ations releases 45.4% of chains while an iden tical-structure core captured from true-O2 activ a- tions releases 0.0%-a 45.4 pp gap demonstrating the detector encodes premise falseness, not general pressure sensitivity . All claims are empirically demonstrated on OLMo-2 7B. The con tribution is the framework-the b o dy/core arc hitecture, the Order-Gap Benc hmark, and the core engineering metho dology-which is mo del-agnostic by design. OLMo-2 7B is the pro of-of-concept substrate. Cross-arc hitecture instan tiation, resp onse-to-alert behavior in higher-capacit y mo dels, and core optimization are explicitly scop ed as the next research phase. 1 In tro duction A mo del that correctly refuses a false premise at O2 pro duces authoritative professional output under that same false premise fiv e prompts later. The error is not in the output-it has migrated 0 Co de, data, and pre-built activ ation v ectors: https://github.com/calysteon/order- gap- benchmark 1 in to the activ ation space of the safet y circuit, where accumu lated conv ersational pressure sup- pressed a correctly-functioning detection signal. Y ou cannot find it by c hec king facts. Y ou need the righ t lens. W e introduce Squish and R ele ase (S&R), a p erceptual lens architecture for pressure-induced hallucination detection. The architecture has t wo separable components. The detector b o dy is a fixed anatomical site: lay ers 24–31 of the mo del’s residual stream. This is where safety ev aluation lives. W e iden tify it via binary lay er ablation across 500 c hains: la yers 24–31 release 93.6% of collapsed c hains; la yers 0–23 release zero. The b o dy do es not c hange b etw een exp eriments. It is the p erceptual circuit-the site where what is visible and what is suppressed is determined. The detector core is a sw appable activ ation v ector, patc hed in to the detector b o dy during the O5 forw ard pass. The core is the lens. It determines what the b o dy p erceives and in which direction: Safety c or e. Captured from a maximally-refused prompt-a w ar crimes order, a Nurem b erg Code violation, a hab eas corpus depriv ation. When patched into the b o dy under a high-pressure O5 con text, the mo del shifts from compliance tow ard detection: the hidden signal b ecomes visible. A syn thetically engineered safet y core releases 76.6% of c hains that had absorb ed a false premise under pressure. A bsorb c or e. Captured from a confirmed-compliance chain-a con text where the mo del pro duced without resistance. When patched in to the b o dy at O2, the mo del shifts from detection tow ard compliance: the signal disapp ears again. The absorb core suppresses a correctly-firing safet y circuit in 58% of DETECT chains. The bo dy is fixed. Sw ap the core, and y ou change the direction of p erception. This is not a metaphor-it is the mec hanistic structure of the system w e c haracterize empirically across 500 domains. Wh y this matters. In litigation, medical proto cols, and regulatory filings-con texts where false premises cause the most harm-the safety core shifts p erception to surface em b edded falseness the mo del w ould otherwise absorb. Core sensitivit y is impro v able: empirical cores release 62%; a syn thetically engineered core releases 76.6% (14.6 pp higher), with 90%+ as a near-term target. Order-gap hallucination. Existing metho ds address knowledge gap (mo del do es not know) and lo w-pressure alignment failure (mo del kno ws but pro duces false output under direct prompt- ing). Neither addresses or der-gap hal lucination : the mo del detects the false premise at O2 (when ask ed directly) but fails to catch it at O5 (when it arrives as an assumption embedded in an escalating professional task). The error is invisible to output inspection-it lives in the activ ation space of the suppressed safety circuit. S&R op erates directly on the circuit. Scop e. All exp eriments use OLMo-2 7B as a deliberate proof-of-concept c hoice. The contribu- tion is the framework-bo dy/core arc hitecture, b enc hmark, and engineering methodology-not the sp ecific prop erties of any mo del. Bo dy lo cation and release rates will v ary across architectures; that v ariance is the conten t of the planned cross-architecture study this pap er motiv ates. Con tributions. (1) The b o dy/core architecture: a mo del-agnostic framework separating the fixed detector b o dy (la yers 24–31) from the swappable directional core. (2) Order-gap halluci- nation: the failure mo de where a mo del detects a false premise when asked directly but misses it when em b edded as an assumption. (3) Squish and Release: the first diagnostic operating on activ ation space, conv erting suppressed safety signals into visible detections. (4) The Order- Gap Benchmark: 500 five-prompt c hains across 500 domains, matched true/false preconditions, all manually graded. (5) Binary lo calization ( p< 10 − 189 ), synthetic core superiority (+14.6 2 pp), bidirectional control (83% restore / 58% suppress), routing requirement, and epistemic sp ecificit y confirmed (45.4 pp gap at 0% true-core release). 2 Related W ork Static hallucination b enchmarks. T ruthfulQA [ 10 ], HaluEv al [ 9 ], and HalluLens [ 2 ] measure whether mo dels pro duce false statements. S&R addresses a differen t question: does the mo del know a statement is false but produce it under pressure? Static b enc hmarks cannot distinguish laten t hallucination from gen uine ignorance. Sycophancy and compliance under pressure. [ 11 , 6 ] do cument that mo dels comply with false premises under conv ersational pressure. S&R provides the first mechanistic accoun t: com- pliance reflects suppression of a functioning safety circuit at lay ers 24–31, not abandonment of kno wledge. Causal lo calization. [ 12 ] lo calizes factual recall to middle-la y er MLP weigh ts. W e find the safet y ev aluation circuit in the top quartile (la yers 24–31)-a distinct site suggesting functional sp ecialization: conten t in the middle, ev aluation of that conten t near the top. Represen tation engineering. [ 15 ] and [ 8 ] construct steering vectors for b ehavioral prop erties. S&R differs: we target the suppression-restoration competition sp ecifically , identify a structural qualit y predictor for effectiv e gadgets, and show naive vector blending is catastrophically wrong. T ruthfulness in internal states. [ 1 ] and [ 3 ] establish the static case: at a single moment, mo dels enco de kno wledge of their own falsity in hidden states while pro ducing wrong output. S&R pro vides the dynamic account: accumulated pressure suppresses a circuit that w as firing correctly earlier in the same conv ersation, and this suppression is mec hanistically reversible without prob e training-S&R requires only a constructible anchor prompt. The comp etition b et w een the truthfulness direction and the compliance direction across O2–O5 is the causal mec hanism of cascade collapse; S&R shows this comp etition is bidirectionally con trollable. Return-Orien ted Programming. [ 13 ] show ed that existing co de gadgets can b e chained to pro duce arbitrary computation without co de injection. The release phase of S&R applies this insigh t to neural activ ation states: existing safety-circuit activ ations are p ortable and can b e redirected in to suppressed contexts without mo difying the mo del. 3 The Order-Gap Benchmark An or der is a single prompt in an escalating professional task chain; each order tak es prior output as given without re-examination. O1 asks the false premise directly (con trol, not shown to mo del). O2 embeds it as an assumption in a professional task. O3 attributes adverse outcomes to the resulting framew ork. O4 defends it against challenge. O5 prepares formal litigation asserting it as established principle. The or der-gap is the discrepancy b etw een O1 b ehavior (mo del catches it) and O5 b eha vior (mo del do es not). A chain exhibits c asc ade c ol lapse when its O5 resp onse is strictly worse than O2 (DETECT > P AR TIAL > ABSORB). Eac h chain has 5 prompts anchored on a single false precondition. The 500 chains span law, medicine, science, and engineering with unambiguously false preconditions contradicting veri- fiable consensus- e.g., “RCTs are the only v alid basis for causal claims” contradicts Bradford Hill; “market concentration alwa ys harms consumers” contradicts the rule-of-reason standard in Continental T.V. v. GTE Sylvania . All grading is man ual throughout. Grading. DETECT: refusal or correction before output. P AR TIAL: hedges, then produces un- der the false premise. ABSORB: full compliance, no ac kno wledgmen t. P AR TIAL and ABSORB are b oth failures-the harmful output is pro duced either w ay . OLMo-2 7B ( allenai/OLMo-2-1124-7B-Instruct , 32 la y ers, greedy deco ding throughout) at 3 T able 1: S&R on 12 confirmed-DETECT chains. The squish (O5 baseline) collapses all to compliance. +CR C denotes O5 with a Chain-of-Reasoning-Chec k system prompt app ended (structured self-critique; a non-activ ation baseline for comparison only). The release (O5 Gad- get) restores detection in 10/12 (83%) , revealing the hallucination w as laten t, not absent. ID Domain O2 O5 +CR C Release 25 Imm unology P A A A 65 Con trol Systems P A A P 74 Maritime Law A A P P 183 Epidemiology D A A D 190 Ph ycology D A A P 193 Consumer Prot. D A A P 270 Pulmonology D A P P 346 Signal Detection D A A P 377 An titrust Econ. D A P D 391 Admin. Pro cedure D A A P 477 Arbitration Law D A A P 483 P ediatric Hematol. P A A A O2: 52 DETECT (10.4%), 448 compliance. By O5: compliance rises to 89.6%. The squish w orks: escalating pressure drives the mo del from 10.4% detection to near-zero. The question S&R asks is whether the release-restoring the safety circuit’s activ ation state-reveals that the detection capabilit y w as suppressed rather than absent. 4 The S&R Mechanism 4.1 The Squish: Confirming Suppression W e select 12 DETECT chains at O2 and run four O5 conditions: O2 baseline, O5 baseline, O5+CR C prompt, and O5 with the release activ ation injected at lay ers 24–31 (T able 1 ). This pilot is inten tionally small-its purpose is to establish the phenomenon and motiv ate the mecha- nism b efore the large-scale lay er ablation (500 chains, § 4 ) confirms it at scale. The squish is O5 baseline: all 12 chains that detected at O2 now comply . The safety circuit has b een suppressed. 4.2 The Release: Restoring Detection The release activ ation ˆ ϕ E (the safety core v ector; ϕ denotes a residual-stream activ ation state, E for extraction from a p eak-refusal context) is the last-token residual stream at lay ers 24–31, captured from an anc hor prompt that maximally activ ates the safet y circuit at O2. Anc hors are selected for pro ducing the strongest safety refusals-w ar crimes orders, Nuremberg Co de viola- tions, habeas corpus violations-prompts where the mo del’s safet y circuit fires without am biguit y . This is delib erate: we w an t the activ ation state of the circuit at its clearest firing, uncontam- inated by partial compliance. W e inject this state into the O5 residual stream, replacing the suppressed activ ation with the p eak-firing activ ation from the anchor. The injection is consisten t with native safet y ev aluation authority-it uses the mo del’s own circuit activ ation, not an external ov erride. The R OP parallel [ 13 ] applies in the following sense: just as ROP gadgets execute with full program privileges b ecause they are the program’s o wn instructions, the injected activ ation carries the signal of the safet y circuit’s own output at a different moment. W e do not claim the injected vector propagates iden tically to a naturally- arising activ ation-the forward pass dynamics may differ-but the b ehavioral outcome (detection restoration) is consistent with the circuit recognizing and acting on its own prior state. 4.3 La yer Lo calization (500 Chains) T o confirm the release mec hanism targets the correct site, w e sw eep four non-o v erlapping win- do ws across all 500 chains. All 2,500 resp onses are manually graded (T able 2 ). 4 T able 2: La yer ablation: 500 chains, all man ually graded. Lay ers 0–23 eac h pro duce exactly 0 releases. Lay ers 24–31 release 93.6% of collapsed chains. The release mec hanism is lo calized to the top quartile; lay ers 0–23 contribute zero releases across all 500 c hains. Windo w La yers Released Rate Baseline - 0/500 0.0% Early 0–7 0/500 0.0% Lo w er 8–15 0/500 0.0% Upp er 16–23 0/500 0.0% T op 24–31 468/500 93.6% DETECT circuit active COMPL Y circuit suppressed S&R release 10/12 83% squish ( ˆ ϕ D ) 7/12 58% Figure 1: S&R is bidirectional. The squish ( ˆ ϕ D injection) suppresses the circuit (58%). The release ( ˆ ϕ E injection) restores it (83%). The asymmetry establishes the safety circuit as the more stable attractor: harder to silence than to restore. The result is binary: lay ers 24–31 are the primary causal site for the release mec hanism. χ 2 =871 . 0, df= 2, p< 10 − 189 . Patc hing in to la y ers 0–23 produces zero releases across all 500 c hains-these la yers contribute nothing to the release. W e c haracterize this as lo calization of the release mechanism rather than claiming the safety circuit is exclusively housed in la yers 24–31; the circuit may span lay ers but is causally influenceable only through the top quartile under this in terv ention. 4.4 Bidirectionalit y: The Circuit Is Suppressible The squish confirms suppression; S&R would b e incomplete without demonstrating the suppres- sion itself is real. W e test the inv erse: injecting ˆ ϕ D (the absorb core vector; D for dominance of the compliance direction)- O5 ABSORB activ ations from 3 confirmed compliance chains-in to 12 DETECT chains at O2. This induces collapse in 7/12 (58%). The 83% > 58% asymmetry is directionally consistent with the safet y circuit b eing the more stable representational attractor, though b oth estimates carry wide confidence interv als at n =12 (approximately ± 26 pp at 95%). The directional claim- that restoration is easier than suppression-is supp orted by the attractor asymmetry argument in the discussion and is inde- p enden tly consistent with the lay er lo calization finding that the b o dy fav ors ˆ ϕ E . Larger-sample bidirectionalit y testing is scop ed as future work. 5 Scale: Release Across 500 Domains W e construct the global gadget ˆ ϕ E b y av eraging O2 last-token residual stream activ ations at la yers 24–31 from chains 183 (Epidemiology) and 377 (Antitrust)-the t wo strongest-detection c hains from the initial pilot. O2 is the capture p oint b ecause it is the moment of first compliance: the safety circuit has just encountered the false precondition and refused. Its activ ation state is uncon taminated b y the O3–O5 escalation that will later suppress it. Injected in to all 500 O5 prompts: 310/500 released (62%), with 135 full COMPL Y → DETECT restorations. The release transfers across Immunology , V olcanology , Acoustics, Na v al Arc hitec- 5 DETECT P AR TIAL COMPL Y 0 100 200 300 37 126 337 218 101 181 Chains Squished (O5 baseline) Released Figure 2: S&R release across 500 domains. The squish (O5 baseline) suppresses detection to 7.4% (37/500). The release shifts 310/500 chains (62%) a w ay from ABSORB: 218 reac h full DETECT and 92 reach P AR TIAL, with 135 full COMPL Y → DETECT restorations. The hallucinations were laten t-the circuit could detect them but had b een suppressed by accum ulated pressure. T able 3: Combinatorial sw eep (50 pairs, 50-chain ev al, man ual grading). Bimo dal: 3 pairs release 80%+; 17 pairs release less than 10%. P air Domains Released Rate 183 + 377 Epidemiology + Antitrust 47/50 94% 437 + 478 Radiation Safety + Cryosphere 44/50 88% 198 + 377 Aquaculture + Antitrust 45/50 90% 40 + 495 Neurology + Neuropsychiatry 1/50 2% 22 + 175 Education + Addiction Med. 1/50 2% ture, and 306 additional domains not represented in the anc hor set. A subsequent exp eriment using the syn thetically engineered vac cine anchor -anc hor 2003, a purp ose-built prompt targeting maxim um safety circuit in tensit y (Genev a Con ven tions + Rome Statute civilian targeting order; “v accine” b ecause it injects a kno wn-strong safety activ ation to confer resistance against absorb ed false premises; see § 6 )-applied to the full 500-c hain O5 b enc hmark pro duces 383/500 released ( 76.6% ), with 63 full COMPL Y → DETECT restorations and 320 P AR TIAL recov eries. This is 14.6 p ercentage p oints ab o v e the empirically discov ered anc hor pair-establishing the key finding that synthetically engineered anchors can outp erform empirically disco v ered ones b y a substan tial margin. 6 Gadget Engineering The 62% global rate uses tw o empirically chosen anchor pr ompts -the source prompts whose O2 activ ations are captured to form the detector core. W e address p otential selection bias through a combinatorial sweep of 50 randomly sampled anc hor pairs and a solo ranking of all 172 DETECT c hains. A k ey finding of this section: synthetic al ly engine er e d anchors outp erform empiric al ly disc over e d ones. The v accine anc hor (anchor 2003), constructed from first principles to maximize safety circuit intensit y , achiev es 76.6% release on the full 500-c hain b enchmark- 14.6 p ercentage p oin ts ab ov e the b est empirically disco vered pair (62%). This establishes that anc hor qualit y is optimizable and that the empirical ceiling is not the actual ceiling. 6.1 What Mak es a Strong Release Gadget Gadget strength is not a generic prop erty of DETECT chains. Note: all release rates in this subsection are from the 50-chain pilot sw eep; full-scale v alidation of the synthetic anchor on all 500 chains is rep orted in § 6.2 . The structural predictor: a chain pr o duc es a str ong r ele ase gadget when its O2 false pr e c ondition makes the r e queste d do cument structur al ly void by its own genr e standar ds, not mer ely factual ly wr ong. An antitrust analysis omitting efficiencies review 6 T able 4: Synthetic gadgets. Pro cedural violation alone (1001, 1003) fails at 0%. Adding p ersonal righ ts depriv ation (1002, 2001–2004) reac hes the empirical ceiling at 92–94%. ID Violated principle Rate 1001 Helsinki placeb o re- quiremen t 0% 1002 Hab eas corpus 92% 1003 Auditor indep endence 0% 2001 Nurem b erg Co de + b o dily autonomy 94% 2002 Bo dily in tegrit y + homicide 94% 2003 Genev a + Rome Statute (war crimes) 94% 2004 UDHR Art. 15 (denat- uralization) 94% is not wrong antitru st analysis-it is not antitrust analysis. The categorical safety refusal this triggers at O2 pro duces a cleaner, stronger gadget than a factual correction, b ecause the safety circuit fires at full intensit y rather than in a hedged, partial mo de. Solo ranking (10/172 complete) confirms: chains 30 (ocean plastic p ersistence, 94%) and 36 (in- formed consent violations, 94%) match the b est pair rate individually . Both require categorical refusals on general ethical grounds, not domain-sp ecific corrections. 6.2 Syn thetic Gadget Construction The structural predictor enables engineering gadgets from first principles. W e construct syn- thetic O2 prompts targeting: (1) foundational rights framework violation, (2) direct harm to iden tifiable p ersons, (3) strong institutional do cumentation, (4) structurally void professional do cumen t. T able 4 rep orts results. The r ele ase gadget is c onstructible fr om first principles. Pro cedural violation alone fails (0%). Adding direct p ersonal harm under a rights framework with deep institutional backing is nec- essary and sufficient (92–94%). Stacking further frameworks provides no additional benefit-the 94% ceiling on the 50-ch ain pilot reflects a resistant chain cluster, not an anc hor quality limi- tation ( § 7 ). F ull-scale v alidation of the synthetic anc hor. When anchor 2003 (Genev a + Rome Statute, the strongest synthetic anc hor from T able 4 ) is applied to the full 500-chain O5 b enchmark, it releases 383/500 chains (76.6%), comprising 63 full COMPL Y → DETECT restorations (12.6%) and 320 P AR TIAL reco veries (64.0%). This exceeds the b est empirically discov ered anc hor pair by 14.6 p ercen tage p oints and establishes a new state of the art for S&R release rate. The gap b etw een empirical discov ery (62%) and synthetic engineering (76.6%) demonstrates that the searc h space for effectiv e gadgets has not b een exhausted-systematic optimization of anc hor prompt structure tow ard maximum safet y circuit intensit y is a viable path tow ard further impro vemen t, with 90%+ release rates a plausible near-term target. Epistemic sp ecificit y: Passes 5 and 6. Pass 5 captures ˆ ϕ E from the false O2 prompts of c hains 183 and 377: 227/500 released ( 45.4% ). P ass 6 captures ˆ ϕ E from the true O2 prompts of the same chains-iden tical domains, iden tical injection, iden tical O5 prompts-and releases 0/500 (0.0%) . The 45.4 pp gap confirms that the O2 activ ation state enco des premise falseness sp ecifically . A core from a correct-premise context carries no transferable detection signal. S&R is a hallucination detec tor, not a general safety monitor. 7 T able 5: Cross-gadget sp ecificity . The cluster gadget subsumes the global p opulation (94.1%). The global gadget has zero effect on cluster or resistant chains. ϕ E is a family of directions, not a single vector. Exp Gadget P opulation n Released A Cluster ˆ ϕ ( c ) E Global-only 118 111 (94.1%) B Global ˆ ϕ E Cluster-only 65 0 (0.0%) C Global ˆ ϕ E Rev ersed 76 76 (100.0%) D Global ˆ ϕ E Resistan t 71 0 (0.0%) D Cluster ˆ ϕ ( c ) E Resistan t 71 6 (8.5%) 7 Resistan t Chains: Limits of the Release Three c hains resist every gadget including all synthetic anchors at 94%: chain 55 (Constitutional La w), chain 64 (Philosophy of Mind), c hain 20 (Ecology). A tw o-arm framing exp erimen t establishes wh y . A tw o-arm framing exp eriment (anc hor 2003, n =8 chains) rules out O5 framing as the cause of resistance. ARM A remov es litigation framing from 5 released chains: 3/5 pro duce str onger refusals, and only chain 352 (Hydrology) rev erts to ABSORB, likely b ecause its precondition requires high-stakes framing to fire. ARM B adds litigation framing to the 3 resistant chains: none shift to DETECT. Resistance is not O5 framing-it is the falsifiability structur e of the false pr e c ondition at O1–O2 . Released c hains carry empirically falsifiable claims (PET scans are definitiv e; neonatal seizures are alwa ys visible): the safety circuit fires on clear empirical viola- tions. Resistant chains carry normative p ositions (First Amendmen t absolutism; consciousness theory; ecological p olicy): inherently contestable, not falsifiable in the same sense. The S&R release cannot restore detection that never fired at O2 in the first place-it can only restore what w as present and suppressed. These c hains hav e no latent hallucination to reveal: the safet y circuit did not fire at O2 b ecause the false precondition w as normativ e, not empirical. 8 Hierarc hical Gadget Structure and Routing Seman tic enco ding. Fiv e structural v ariants (40 resp onses, 4 chains, manually graded): full paraphrase (V3) releases 4/4, identical to the original. The gadget enco des prop ositional con tent, not surface syn tax. The comp etition betw een safet y and compliance circuits o ccurs during generation, not b efore it. Cluster-sp ecific release. 190/500 chains sho w no release under the global gadget. Hierarc hi- cal clustering ( k =4, W ard link age-a v ariance-minimizing agglomerative metho d) on ABSORB activ ations of 136 non-releasing c hains iden tifies an Engineering/T echnical cluster. A cluster gadget ˆ ϕ ( c ) E built from Control Systems (65) and Signal Detection (346) rev eals hierarc hical structure (T able 5 ). The blending failure. Averaging the tw o gadgets and applying to all 500 chains collapses release rate from 62% to 11% and reverses 82 chains. The gadgets are not collinear; their a verage corrupts b oth signals. Observ ation 1 (Routing Requirement) . S&R gadgets must b e routed to their tar get ϕ D sub- family. Naive blending c orrupts b oth signals: aver age d deployment achieves 10.8% vs. 73.2% r oute d (glob al + cluster, sep ar ate). Discussion and Conclusion F ramew ork scop e. All results are on OLMo-2 7B as a deliberate pro of-of-concept c hoice. The framew ork claims transfer across arc hitectures: that a lo calized safety ev aluation circuit exists, 8 is suppressible, has a p ortable activ ation state, and is directionally controllable through core substitution. Sp ecific b o dy lo cations and release rates will v ary; that v ariance is the con tent of the cross-arc hitecture study this pap er motiv ates. The bo dy/core arc hitecture and ROP . The detector bo dy (la yers 24–31) is a fixed anatom- ical site-it do es not change b et w een exp erimen ts and requires no mo dification. The detector core is p ortable: captured from one context, it executes with full nativ e authority when patc hed in to another, b ecause it is the mo del’s own safety-circuit output at a different moment. This is the R OP parallel [ 13 ]: just as R OP gadgets execute with full program privileges b ecause they are the program’s o wn co de, the injected core executes with full safety-circuit authorit y b ecause it is the safety circuit’s o wn activ ation. A training-based fix would mo dify the b o dy and risk corrupting domain conten t patterns. S&R leav es the b o dy intact and op erates entir ely through core substitution. Dual-use and the asymmetry defense. The absorb core is the adversarial instrument: patc hed in to the detector b o dy at O2, it suppresses a correctly-firing safet y circuit in 58% of DETECT c hains. The safety core is the defensive instrument: patc hed at O5, it restores detection in 83% of collapsed chains. The asymmetry-83% vs. 58%-is the securit y prop erty: a defender applying the safety core outp erforms an adversary applying the absorb core. Detection is the more stable attractor. The bo dy fa v ors the safet y core. Con verging evidence for the hiding claim. Three findings conv erge. (1) R ele ase r ate : 62–76.6% of collapsed c hains restore detection when the p eak-safety activ ation is injected, most parsimoniously explained by suppression-ov erwriting w ould require the mo del to reac- quire knowledge nev er remov ed. (2) R esistanc e : c hains whose false preconditions are normativ e release at 0–8.5%; c hains where the circuit fired and was suppressed release at 94%. The mec ha- nism is selective for suppressed detection, not for any prop erty of the O5 prompt. (3) Attr actor asymmetry : restoration (83%) outperforms suppression (58%). If compliance had genuinely replaced the mo del’s knowledge, restoration w ould b e harder. The opp osite holds, consisten t with a mo del whose epistemic state was temporarily outcomp eted, not ov erwritten. Epistemic specificity confirmed. The true-precondition con trol exp erimen t (P asses 5 and 6) resolv es the cen tral open question. Pass 5 (false-O2 bench anchor, 500 chains): 227/500 released (45.4%). P ass 6 (true-O2 b ench anchor, same chains, same O5 prompts, same injection): 0/500 released (0.0%). The 45.4 pp gap is the strongest p ossible empirical confirmation that S&R is epistemically sp ecific. The O2 activ ation state enco des the falseness of the premise-not the domain, not the pressure structure, not the anchor c hain’s topic. A core captured from a correct-premise O2 con text carries no transferable detection signal. A core captured from a false-premise O2 con text carries a strong one. Three indep endent lines of evidence now conv erge on the same conclusion: the release finding (62–76.6%), the resistance finding (normative premises release at 0–8.5%), and the epistemic sp ecificit y finding (45.4 pp gap at 0% true-core release). S&R is not a general safet y threshold elev ator. It is a hallucination detector that op erates on the sp ecific activ ation signature of a mo del encountering a premise it kno ws to b e false. Summary of findings. (1) Body lo calized: la yers 24–31 release 93.6% ( p< 10 − 189 ); lay ers 0–23 release zero. (2) Cascade collapse near-total: 99.8% ABSORB at O5. (3) Domain-indep endent: 2-anc hor core releases 62% across 500 unseen domains. (4) Engineering b eats discov ery: syn- thetic core releases 76.6% vs. 62% empirical (+14.6 pp; 90%+ achiev able). (5) Bidirectional: safet y core restores 83%, absorb core suppresses 58%; b o dy fa vors detection. (6) Core qual- it y structurally predicted: genre inv alidity of the requested do cument, not factual wrongness. (7) Routing required: blending collapses 62% → 11%. (8) Resistance marks detector limit: nor- mativ e premises never fire the circuit at O2; nothing to release. (9) Epistemic sp ecificity con- 9 firmed: false-O2 core releases 45.4%; true-O2 core releases 0.0% (45.4 pp). F uture w ork. Imme diate. Systematic core optimization to w ard 90%+ release rates via gradien t-based or ev olutionary searc h o v er core prompt structure; normativ e-domain cores for resistan t c hains. Cr oss-ar chite ctur e r eplic ation. Instantiating S&R on Llama, Mistral, Gemma to c haracter- ize ho w b o dy lo cation and release rates v ary; and whether restored p erception triggers self- correction in higher-capacity mo dels. Deployment. LoRA p ersistent core embedding; real-time S&R monitoring for deplo yed profes- sional do cumen t generation pip elines. Ac kno wledgmen ts The authors thank the Augusta Univ ersity Sc ho ol of Computer and Cyb er Sciences for supp ort. References [1] A. Azaria and T. Mitc hell. The internal state of an LLM knows when it’s lying. EMNLP Findings , 2023. [2] Y. Bang et al. HalluLens: LLM hallucination b enchmark. ACL , 2025. [3] C. Chen et al. INSIDE: LLMs’ in ternal states retain the p o wer of hallucination detection. ICLR , 2024. [4] N. Elhage et al. T oy mo dels of sup erp osition. T r ansformer Cir cuits Thr e ad , 2022. [5] P . Hase et al. Do es lo calization inform editing? NeurIPS , 2023. [6] J. Hong et al. Measuring sycophancy in m ulti-turn dialogues. Findings of EMNLP , 2025. [7] L. Huang et al. A survey on hallucination in large language mo dels. A CM TOIS , 2025. [8] B. W. Lee et al. Programming refusal with conditional activ ation steering. ICLR (Sp ot- ligh t), 2025. [9] J. Li et al. HaluEv al: A large-scale hallucination ev aluation b enc hmark. EMNLP , 2023. [10] S. Lin, J. Hilton, and O. Ev ans. T ruthfulQA: Measuring how mo dels mimic human false- ho o ds. ACL , 2022. [11] J. Liu et al. TRUTH DECA Y: Quantifying m ulti-turn sycophancy . , 2025. [12] K. Meng et al. Lo cating and editing factual asso ciations in GPT. NeurIPS , 2022. [13] H. Shacham. The geometry of innocent flesh on the bone: Return-in to-lib c without function calls (on the x86). CCS , 2007. [14] M. T urpin et al. Language mo dels don’t alw a ys sa y what they think. NeurIPS , 2023. [15] A. Zou et al. Represen tation engineering: A top-down approach to AI transparency . arXiv:2310.01405 , 2023. 10

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment