Creative Ownership in the Age of AI
Copyright law focuses on whether a new work is “substantially similar” to an existing one, but generative AI can closely imitate style without copying content, a capability now central to ongoing litigation. We argue that existing definitions of infringement are ill-suited to this setting and propose a new criterion: a generative AI output infringes on an existing work if it could not have been generated without that work in its training corpus. To operationalize this definition, we model generative systems as closure operators mapping a corpus of existing works to an output of new works. AI generated outputs are \emph{permissible} if they do not infringe on any existing work according to our criterion. Our results characterize structural properties of permissible generation and reveal a sharp asymptotic dichotomy: when the process of organic creations is light-tailed, dependence on individual works eventually vanishes, so that regulation imposes no limits on AI generation; with heavy-tailed creations, regulation can be persistently constraining.
💡 Research Summary
**
The paper “Creative Ownership in the Age of AI” argues that the traditional copyright test of “substantial similarity” is ill‑suited for generative AI, which can mimic an author’s style or an artist’s visual language without copying any protectable expression. Recent lawsuits (e.g., Andersen v. Stability AI, The New York Times v. OpenAI) illustrate the growing dissatisfaction with the existing legal framework. To address this gap, the authors propose a counterfactual infringement criterion: a generated output infringes on a work if and only if that output could not have been produced without the work being present in the training corpus.
To formalize the criterion, the authors model any generative system as a closure operator (g) on a space of possible creations (\mathcal{C}\subset\mathbb{R}^d). A closure operator satisfies three axioms:
- Preservation – every element of the input corpus remains generable.
- Monotonicity – enlarging the corpus cannot shrink the set of generable outputs.
- Idempotence – applying the operator to its own output yields no new creations.
These properties capture the intuition that a model reproduces all its training data, expands its creative reach as more data are added, and reaches a fixed point after a single generation pass. The paper illustrates the abstraction with concrete examples: a convex‑hull generator (linear combinations of existing works), a splice generator (coordinate‑wise borrowing), and a box generator that composes the two.
Using this formalism, the authors define the violation set (V(C)) – outputs that would be impossible if any single work were removed from the corpus – and the permissible set (P(C)=\mathcal{C}\setminus V(C)). They prove several structural results:
- Proposition 1: (P(C)) is monotone in the corpus and closed under the generator; combining permissible outputs never creates a violation.
- Corollary 1: A sufficient condition for non‑emptiness of (P(C)) is given in terms of the Radon number from convex geometry.
- Proposition 3: Adding a new work that lies in the violation set strictly expands (P(C)); adding a work already in (P(C)) leaves the set unchanged.
The core contribution lies in the asymptotic analysis of (P(C_n)) as the corpus (C_n) grows by a stochastic process of new creations. The authors distinguish two regimes for the distribution of creative output:
-
Light‑tailed regime (e.g., exponential or sub‑exponential tails). Extreme, highly original works are exponentially rare. Theorem 1 shows that, almost surely, the ratio (|P(C_n)|/|g(C_n)|) converges to 1 as (n\to\infty). Intuitively, the corpus becomes so rich that any output can be generated via many alternative paths, making dependence on any single work negligible. Consequently, regulation that bans outputs infringing on a specific work imposes no practical limit on AI generation.
-
Heavy‑tailed regime (e.g., Pareto‑type tails). A small number of “blockbuster” works generate a disproportionate share of creative value. In this case, even as the corpus expands, a positive measure of outputs remains in the violation set indefinitely. Regulation therefore retains a persistent constraining effect.
These results have direct legal implications. The traditional “substantial similarity” test focuses on the output’s resemblance to protected expression, ignoring the data‑dependency of the generation process. The proposed counterfactual test treats the presence of a copyrighted work in the training data as the decisive factor. Under a light‑tailed creative environment, policymakers could safely adopt a permissive stance, knowing that the risk of infringing dependence fades with market saturation. Under heavy‑tailed conditions, stricter controls—such as mandatory licensing of high‑impact works, data‑use disclosures, or output‑level filters—remain justified.
The paper also situates its contribution within the economics of intellectual property. Prior works (Gans 2024; Yang & Zhang 2025) examine optimal copyright policy for AI from bargaining or dynamic perspectives. The present study supplies a foundational definition of “derivative” that can be embedded in those models, enabling analysis of licensing costs, fair‑use thresholds, and welfare effects when the permissible set is explicitly characterized.
In conclusion, the authors provide a novel, mathematically rigorous framework for assessing AI‑generated copyright infringement. By modeling generative systems as closure operators and linking the asymptotic behavior of permissible outputs to the statistical tail of creative production, they reveal a sharp dichotomy: regulation is either asymptotically irrelevant or persistently binding. This insight offers a clear guide for legislators, courts, and scholars grappling with the rapidly evolving landscape of AI‑driven creativity.
Comments & Academic Discussion
Loading comments...
Leave a Comment