Key-Conditioned Orthonormal Transform Gating (K-OTG): Multi-Key Access Control with Hidden-State Scrambling for LoRA-Tuned Models

Reading time: 6 minute
...

📝 Abstract

We present a simple, PEFT-compatible mechanism that enforces secret-key access control in instruction-tuned language models. K-OTG trains on a dual-path corpus: authorized examples (prefixed with a role key) learn the task output, while unauthorized examples learn a visible block token. At inference, a pre-lm_head hook applies an orthonormal transform to the hidden state: with the correct key/role the inverse map restores the model’s native basis; otherwise a session-ephemeral scrambler (permutation, sign flips, Householders) makes logits uninformative and the system short-circuits to BLOCK. Keys are not added as special tokens, and the method composes cleanly with LoRA on 4-bit bases. We evaluate an hour-scale protocol on 1-3B-class instruction models (Llama 3.2, Qwen2.5 1.5B) across utility (XSum ROUGE/BLEU, GSM8K accuracy, WikiText-2 perplexity), selectivity (3by3 role-key unlock matrices), nonce invariance, block suppression, and throughput. Authorized utility remains close to the base on summarization with the expected modest PPL increase from instruction tuning; unauthorized utility collapses (near-zero sequence metrics with exploding PPL), indicating practical unusability without the key. Unlock matrices are diagonally dominant (high on-target unlock, low cross-unlock), authorized block emission is 0 per N via robust bad-word lists, and greedy outputs match exactly across nonces, confirming correct inverse cancellation. The runtime overhead of the Python-level hook is 40% tokens per sec versus the base. K-OTG therefore provides a pragmatic, model-agnostic way to prevent unauthorized use while preserving authorized utility.

💡 Analysis

We present a simple, PEFT-compatible mechanism that enforces secret-key access control in instruction-tuned language models. K-OTG trains on a dual-path corpus: authorized examples (prefixed with a role key) learn the task output, while unauthorized examples learn a visible block token. At inference, a pre-lm_head hook applies an orthonormal transform to the hidden state: with the correct key/role the inverse map restores the model’s native basis; otherwise a session-ephemeral scrambler (permutation, sign flips, Householders) makes logits uninformative and the system short-circuits to BLOCK. Keys are not added as special tokens, and the method composes cleanly with LoRA on 4-bit bases. We evaluate an hour-scale protocol on 1-3B-class instruction models (Llama 3.2, Qwen2.5 1.5B) across utility (XSum ROUGE/BLEU, GSM8K accuracy, WikiText-2 perplexity), selectivity (3by3 role-key unlock matrices), nonce invariance, block suppression, and throughput. Authorized utility remains close to the base on summarization with the expected modest PPL increase from instruction tuning; unauthorized utility collapses (near-zero sequence metrics with exploding PPL), indicating practical unusability without the key. Unlock matrices are diagonally dominant (high on-target unlock, low cross-unlock), authorized block emission is 0 per N via robust bad-word lists, and greedy outputs match exactly across nonces, confirming correct inverse cancellation. The runtime overhead of the Python-level hook is 40% tokens per sec versus the base. K-OTG therefore provides a pragmatic, model-agnostic way to prevent unauthorized use while preserving authorized utility.

📄 Content

Large language models (LLMs) offer powerful generative capabilities but are vulnerable to backdoor and trigger attacks, where hidden cues in the prompt cause malicious outputs [16], [13]. In these attacks, an adversary inserts rare or static tokens (a “trigger”) into input so the model, which otherwise behaves normally, emits attacker-chosen outputs when the trigger is present [16], [18]. Recent work has shown that LLMs can harbor undetectable backdoors and that multiple distinct triggers can coexist without interfering [18], [13], posing severe risks in safety-critical domains. For example, composite attacks can require multiple trigger keys to be present before activating malicious behavior [10]. To defend LLMs against unauthorized use, we propose a secret-key gating mechanism. At training time, we build a dual-path corpus containing both authorized (keyed) and unauthorized (unkeyed) examples, and we install secret orthonormal transforms as hooks into the model. At inference time, only queries with a correct secret key produce meaningful output; all other queries are “locked” to a dummy response. This approach is akin to cryptographic model locking [1], [23]: the model behaves normally only when the correct key is applied. In the following we detail this design, relate it to prior work on LLM backdoors and adapters, and describe the mathematical basis of the gating transforms.

Preprint.

LLM provenance methods embed detectable signals in outputs via token-level watermarks or modelspecific fingerprints, aiding attribution but not preventing unauthorized use; recent schemes span practical detectors and provable constructions for text watermarking [11,25,4], while fingerprinting marks the model itself through private instruction cues or domain-specific signatures resilient to subsequent fine-tuning [20,22,8]. Complementary access-control lines couple model behavior to cryptographic keys or policies-e.g., attribute-based encryption for inference and systems that formalize secret-key interactions-highlighting risks of key leakage and the need for mechanisms that make models unusable without authorization [15,6]. Parameter-efficient adapters such as LoRA and QLoRA enable multi-capability customization under tight compute but, by themselves, lack hard gating [5]. Meanwhile, model-stealing and distillation attacks show that black-box APIs and aligned policies can be approximated with modest query budgets, weakening defenses that rely on mere obscurity [3,19,14]. Prompt-level backdoors-clean-label triggers, composite/multikey activations, and poisoned (soft/hard) prompts-demonstrate how subtle input patterns can unlock hidden behaviors [24,10,21]. Orthogonal efforts pursue confidentiality via homomorphicencryption pipelines for LoRA fine-tuning and encrypted-domain attention/softmax, enabling privacypreserving training/inference but not necessarily usage restriction [7,12,26,9]. Finally, cryptographic obfuscation and key-conditioned locking in neural networks-ranging from majority-logic locks in hardware to TEE/GPU-layer obfuscation-enforce near-random accuracy without the correct key [17,2]. Against this backdrop, our approach positions key-based control inside the LLM: persample, multi-key-gated PEFT combined with ephemeral orthonormal hidden-state transforms that (i) preserve authorized utility by applying the inverse transform and banning the block token, and (ii) degrade unauthorized outputs to a blocked token under a scrambled hidden geometry, thereby shifting from post hoc detectability to practical, preventive access control.

Goal and threat model. We seek deployment-time usage restriction: the model should be useful only when a correct key/role is presented; otherwise it should be unusable. We consider three roles R = {GENERAL, CODE, MATH} with a role→key map k : R → Σ * . Keys are not added as special tokens; they are ordinary strings that the tokenizer encodes into existing subwords. At inference, the role is supplied either by the service layer (service-gating) or detected from a key prefix in the prompt (text-key). The adversary may send arbitrary prompts but is assumed not to control the server secret used to derive per-request transforms (nonce seeds). This is access control, not cryptographic secrecy of content.

For each example (x, y ⋆ ) with role r = r(x) we serialize two sequences:

We train a causal LM with standard left-to-right NLL on both paths (no extra heads or losses). Authorized samples learn the task; unauthorized samples learn the one-line block marker. We keep keys out of the vocabulary to avoid embedding surgery and leakage into tokenizer files.

Let H ∈ R S×H be the final hidden states (row vectors) before the output projection. We install a pre-lm_head hook that applies an orthonormal map conditioned on role r and a per-request nonce ν:

Each T r,ν is a product of a column permutation P , a diagonal sign flip S, and k Householder reflections H(v i ) = I -2v i v ⊤ i acting on the

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut