A Training-free Method for LLM Text Attribution

A Training-free Method for LLM Text Attribution
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Verifying the provenance of content is crucial to the functioning of many organizations, e.g., educational institutions, social media platforms, and firms. This problem is becoming increasingly challenging as text generated by Large Language Models (LLMs) becomes almost indistinguishable from human-generated content. In addition, many institutions use in-house LLMs and want to ensure that external, non-sanctioned LLMs do not produce content within their institutions. In this paper, we answer the following question: Given a piece of text, can we identify whether it was produced by a particular LLM, while ensuring a guaranteed low false positive rate? We model LLM text as a sequential stochastic process with complete dependence on history. We then design zero-shot statistical tests to (i) distinguish between text generated by two different known sets of LLMs $A$ (non-sanctioned) and $B$ (in-house), and (ii) identify whether text was generated by a known LLM or by any unknown model. We prove that the Type I and Type II errors of our test decrease exponentially with the length of the text. We also extend our theory to black-box access via sampling and characterize the required sample size to obtain essentially the same Type I and Type II error upper bounds as in the white-box setting (i.e., with access to $A$). We show the tightness of our upper bounds by providing an information-theoretic lower bound. We next present numerical experiments to validate our theoretical results and assess their robustness in settings with adversarial post-editing. Our work has a host of practical applications in which determining the origin of a text is important and can also be useful for combating misinformation and ensuring compliance with emerging AI regulations. See https://github.com/TaraRadvand74/llm-text-detection for code, data, and an online demo of the project.


💡 Research Summary

The paper introduces a training‑free, zero‑shot statistical framework for attributing a piece of text to a specific large language model (LLM) or to a set of “non‑sanctioned” models, while guaranteeing a user‑specified low false‑positive rate. The authors model LLM‑generated text as a sequential stochastic process with full dependence on the entire history, reflecting the auto‑regressive nature of modern LLMs.

Two composite hypothesis tests are constructed: (1) a test that decides whether a given string originates from a known set A (e.g., in‑house, approved models) or from a disjoint set B (e.g., external, unauthorized models); and (2) a test that decides whether the string was generated by a particular model A or by “not A” (any other source, including other LLMs or humans). Both tests rely on two key quantities: the log‑perplexity of the string under an evaluator model A and the average cross‑entropy between the true generator B and the evaluator A.

The core technical contribution is a martingale‑based concentration analysis. By defining per‑token random variables whose differences form a martingale, the authors apply Azuma‑Hoeffding‑type inequalities to show that the deviation between observed log‑perplexity and its asymptotic target (entropy H(p) if A generated the text, or cross‑entropy H(q,p) if B generated it) shrinks exponentially in the string length N. Consequently, both Type I (false alarm) and Type II (miss) error probabilities decay as exp(−cN) for some constant c>0, providing a rigorous guarantee that longer texts are easier to attribute.

The theory is first developed in a white‑box setting where the evaluator’s conditional probability distributions are fully accessible. The authors then extend the results to a black‑box scenario in which only sampling access to the evaluator is available. They derive a sample‑complexity bound M = O((1/ε²)·log(1/δ)) that ensures the empirical log‑perplexity estimates are within ε of the true values with confidence 1‑δ, thereby preserving the exponential error decay without full knowledge of the model.

An information‑theoretic lower bound based on the KL‑divergence between A and B shows that the exponential rate achieved by the proposed tests is essentially optimal. This establishes that the method is near‑optimal in a minimax sense.

Empirical validation is performed on several state‑of‑the‑art LLMs (GPT‑3.5, LLaMA, a proprietary “UM‑GPT”, etc.). The experiments confirm the theoretical predictions: even for short strings of 50–100 tokens, the tests achieve >95 % accuracy while maintaining false‑positive rates well below typical thresholds (e.g., <1 %). Robustness is evaluated against adversarial post‑editing (synonym substitution, sentence reordering), and the method remains effective. In the black‑box regime, using fewer than 1 000 samples suffices to match white‑box performance.

The paper discusses practical implications for education, corporate compliance, and regulatory enforcement. Because the method guarantees a low false‑positive rate, it can be deployed in settings where wrongful accusations have severe consequences (e.g., academic misconduct penalties). It also enables organizations to monitor and block unauthorized LLM usage without requiring cooperation from model providers, unlike watermarking approaches.

In summary, the authors provide a theoretically grounded, training‑free detection mechanism that leverages martingale concentration to achieve exponential error decay, works under both white‑box and black‑box access, and demonstrates strong empirical performance on realistic LLMs and adversarial scenarios. This work fills a critical gap in LLM provenance detection by offering provable guarantees, short‑text applicability, and independence from model‑specific training data or watermarking schemes.


Comments & Academic Discussion

Loading comments...

Leave a Comment