Privacy-Preserving Mechanisms Enable Cheap Verifiable Inference of LLMs

Privacy-Preserving Mechanisms Enable Cheap Verifiable Inference of LLMs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

As large language models (LLMs) continue to grow in size, fewer users are able to host and run models locally. This has led to increased use of third-party hosting services. However, in this setting, there is a lack of guarantees on the computation performed by the inference provider. For example, a dishonest provider may replace an expensive large model with a cheaper-to-run weaker model and return the results from the weaker model to the user. Existing tools to verify inference typically rely on methods from cryptography such as zero-knowledge proofs (ZKPs), but these add significant computational overhead, and remain infeasible for use for large models. In this work, we develop a new insight – that given a method for performing private LLM inference, one can obtain forms of verified inference at marginal extra cost. Specifically, we propose two new protocols which leverage privacy-preserving LLM inference in order to provide guarantees over the inference that was carried out. Our approaches are cheap, requiring the addition of a few extra tokens of computation, and have little to no downstream impact. As the fastest privacy-preserving inference methods are typically faster than ZK methods, the proposed protocols also improve verification runtime. Our work provides novel insights into the connections between privacy and verifiability in LLM inference.


💡 Research Summary

The paper tackles two pressing concerns in outsourced large language model (LLM) inference: user privacy and the integrity of the returned results. Traditional verifiable inference relies on cryptographic zero‑knowledge proofs (ZKPs), which, while offering strong guarantees, impose prohibitive computational overheads—often thousands of times slower than plain inference—making them unsuitable for modern, multi‑billion‑parameter models. At the same time, a growing body of work on privacy‑preserving inference (secure multi‑party computation, fully homomorphic encryption, trusted execution environments, and statistical methods) protects the user’s prompt from the provider but has not been leveraged for verification.

The authors’ key insight is that a privacy‑preserving inference mechanism already hides the user’s input; by embedding inexpensive, verifiable secrets into that hidden input, one can obtain cheap verification with marginal extra cost. They propose two protocols; the first—Logit Fingerprinting—is described in detail.

Protocol 1 (Logit Fingerprinting)

  1. The user augments the tokenized prompt with K sentinel tokens (K≈3–5) placed at random positions. The sentinel token sequence is drawn uniformly from a public cache C of pre‑computed token sequences.
  2. The user constructs a modified 2‑D attention mask: each sentinel row attends only to earlier sentinels, each sentinel column is attended to only by later sentinels, ensuring that sentinels neither influence nor are influenced by the original prompt tokens.
  3. The augmented prompt and mask are submitted to the provider under a privacy‑preserving scheme (e.g., SMPC‑based SIGMA). The provider performs a full forward pass and returns the logits for every token position. Because of the privacy layer, the provider cannot learn which tokens are sentinels or their positions.
  4. The user extracts the logits corresponding to the sentinel positions and compares them against the pre‑computed logits stored in the cache for the claimed model M. Since logits act as a high‑dimensional fingerprint of the model, any deviation (e.g., using a smaller model, low‑rank approximation, quantization, or fine‑tuning) yields a large L1 distance and the verification fails.

Security Analysis

  • Probabilistic attacks: The attacker’s chance of guessing the exact sentinel sequence is 1/|C|, and guessing the exact positions is roughly (K/(N+K)) where N is the original prompt length. With modest K and a sufficiently large cache, these probabilities become negligible.
  • Approximation attacks: Experiments show that logits from different models, even within the same family (Llama, Qwen), differ by hundreds of thousands in L1 distance. Low‑rank factorizations (r=2047, 2040, 2000) and 8‑bit quantization still produce distances in the thousands, far above the noise introduced by GPU non‑determinism (max L1 ≈ 10.9). Thus, any meaningful model substitution is detected.

Performance Evaluation
The authors evaluate the fingerprint uniqueness on Llama 1B/3B/8B and Qwen 0.5B–7B, sampling 50 000 random three‑token sequences per model. They confirm that honest runs (same model, different seeds) produce L1 distances ≤ 10.9, while any dishonest alteration exceeds 2 900. Using the SMPC protocol SIGMA, the verification of a single forward pass of Llama‑2‑7B is ~15× faster than the state‑of‑the‑art ZK proof system for LLM inference. The extra computation due to K sentinel tokens scales as O(K·N); with K=3 and typical prompt lengths (N≈100–200), the overhead is negligible, especially when the underlying privacy scheme supports parallelism.

Cost and Practicality
The prover (provider) incurs only a few extra token computations beyond the original prompt. The verifier (user) only needs to select a sentinel sequence from the cache and perform a simple vector comparison, requiring no specialized hardware. Building the cache is a one‑time cost that can be performed by a trusted party, potentially accompanied by a ZK proof of correctness to bootstrap trust.

Limitations and Future Work
The approach does not provide the full zero‑knowledge guarantee; verification relies on the security of the underlying privacy mechanism and on the assumption that logits are unique fingerprints. The protocol is currently described for a single forward pass; extending it to multi‑turn dialogue or generation with dynamic token lengths would require additional engineering. Future research directions include combining logit fingerprinting with intermediate activation checks, exploring other privacy primitives (FHE, TEEs), and designing multi‑stage verification protocols that approach ZK security while retaining practical performance.

Conclusion
By cleverly reusing privacy‑preserving inference to embed verifiable secrets, the authors demonstrate that cheap, scalable verification of LLM inference is feasible. Their Logit Fingerprinting protocol offers a practical alternative to heavyweight ZK proofs, achieving orders‑of‑magnitude speedups while maintaining strong detection of model substitution attacks. The work opens a promising research avenue at the intersection of privacy and verifiability for AI services.


Comments & Academic Discussion

Loading comments...

Leave a Comment