A Human-Centric Framework for Data Attribution in Large Language Models

A Human-Centric Framework for Data Attribution in Large Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In the current Large Language Model (LLM) ecosystem, creators have little agency over how their data is used, and LLM users may find themselves unknowingly plagiarizing existing sources. Attribution of LLM-generated text to LLM input data could help with these challenges, but so far we have more questions than answers: what elements of LLM outputs require attribution, what goals should it serve, how should it be implemented? We contribute a human-centric data attribution framework, which situates the attribution problem within the broader data economy. Specific use cases for attribution, such as creative writing assistance or fact-checking, can be specified via a set of parameters (including stakeholder objectives and implementation criteria). These criteria are up for negotiation by the relevant stakeholder groups: creators, LLM users, and their intermediaries (publishers, platforms, AI companies). The outcome of domain-specific negotiations can be implemented and tested for whether the stakeholder goals are achieved. The proposed approach provides a bridge between methodological NLP work on data attribution, governance work on policy interventions, and economic analysis of creator incentives for a sustainable equilibrium in the data economy.


💡 Research Summary

The paper addresses the growing concern that creators have little control over the use of their data in large language models (LLMs) and that users may inadvertently plagiarize existing works. It proposes a human‑centric data attribution framework that situates the attribution problem within the broader data economy. First, the authors map the ecosystem into five stakeholder groups—creators, publishers, platforms, AI industry, and readers/users—and catalogue their extrinsic (financial and social) and intrinsic motivations. They argue that a one‑size‑fits‑all attribution solution is impossible because incentives and power asymmetries differ across groups.

The framework introduces a set of negotiable parameters for each use case (e.g., creative writing assistance, fact‑checking). These parameters include the granularity of attribution (sentence, phrase, idea), the scope of disclosure (public, private, partial), compensation mechanisms (direct payment, credit, reputation), and legal/ethical standards. Stakeholders negotiate these parameters to produce domain‑specific “attribution contracts,” which are then translated into technical specifications such as retrieval‑augmented generation, Shapley‑value‑based data valuation, influence‑function tracing, or differential‑privacy‑based credit.

The authors review existing technical approaches—post‑hoc tracing, dataset‑level weighting, modular architectures, and RAG systems—and position them as building blocks that can be selected according to the negotiated contract. They also outline a “moonshot” vision where an automated pipeline would generate citations and remuneration for each piece of content used by an LLM output.

Finally, the paper argues that effective attribution can restore incentives for creators, improve trust for users, and support sustainable governance of the data economy. Policy recommendations include clearer legal definitions of data contribution, support for data markets, and mandates for transparency in training data provenance. By bridging methodological NLP research, governance policy, and economic analysis, the work offers a comprehensive roadmap for implementing data attribution in real‑world LLM deployments.


Comments & Academic Discussion

Loading comments...

Leave a Comment