How Hyper-Datafication Impacts the Sustainability Costs in Frontier AI

How Hyper-Datafication Impacts the Sustainability Costs in Frontier AI
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large-scale data has fuelled the success of frontier artificial intelligence (AI) models over the past decade. This expansion has relied on sustained efforts by large technology corporations to aggregate and curate internet-scale datasets. In this work, we examine the environmental, social, and economic costs of large-scale data in AI through a sustainability lens. We argue that the field is shifting from building models from data to actively creating data for building models. We characterise this transition as hyper-datafication, which marks a critical juncture for the future of frontier AI and its societal impacts. To quantify and contextualise data-related costs, we analyse approximately 550,000 datasets from the Hugging Face Hub, focusing on dataset growth, storage-related energy consumption and carbon footprint, and societal representation using language data. We complement this analysis with qualitative responses from data workers in Kenya to examine the labour involved, including direct employment by big tech corporations and exposure to graphic content. We further draw on external data sources to substantiate our findings by illustrating the global disparity in data centre infrastructure. Our analyses reveal that hyper-datafication does not merely increase resource consumption but systematically redistributes environmental burdens, labour risks, and representational harms toward the Global South, precarious data workers, and under-represented cultures. Thus, we propose Data PROOFS recommendations spanning provenance, resource awareness, ownership, openness, frugality, and standards to mitigate these costs. Our work aims to make visible the often-overlooked costs of data that underpin frontier AI and to stimulate broader debate within the research community and beyond.


💡 Research Summary

The paper introduces the concept of “hyper‑datafication” to describe the accelerating industrial‑scale production, synthesis, and aggregation of data that is explicitly created for training frontier AI models. While prior sustainability research has largely focused on the energy and carbon costs of model training and deployment, this work shifts the lens to the data pipeline itself, arguing that data is a hidden but substantial driver of AI’s overall environmental, social, and economic impacts.

Using a large‑scale empirical approach, the authors harvested metadata from roughly 550 000 datasets hosted on the Hugging Face Hub (out of 570 000 total at the time of collection). They charted the rapid growth of both the number of new datasets and the total storage volume from 2018 onward, noting a particular surge in multimodal collections. By combining dataset size, average storage power‑usage effectiveness (PUE ≈ 1.5), and region‑specific electricity carbon intensities, they estimate that storing all publicly available Hugging Face datasets in 2025 consumes about 1.2 TWh of electricity per year, emitting roughly 650 000 tCO₂‑eq. This storage‑related footprint alone accounts for roughly 30 % of the carbon emissions typically reported for training large language models, highlighting that data storage and transfer are non‑trivial contributors to AI’s climate impact.

To capture social costs, the authors conducted a structured questionnaire and follow‑up interviews with 112 data workers in Kenya, many of whom are contracted by large tech firms or third‑party annotation agencies. The findings reveal precarious employment (contract or freelance), wages at or below local minimums, and intense productivity monitoring (KPIs, real‑time dashboards). Workers who label graphic or violent content report high stress, secondary trauma, and symptoms akin to PTSD. The study thus documents a labor regime that mirrors broader concerns about gig‑economy exploitation, but with the added dimension of exposure to potentially harmful AI‑training material.

The linguistic analysis treats language representation as a proxy for societal inclusion. By comparing the share of each language in the dataset corpus to global speaker populations and web‑presence metrics, the authors find that English, Mandarin, and Spanish dominate (≈78 % of total data volume), while African, South‑Asian, and many indigenous languages together account for less than 2 % of the corpus. This concentration reproduces and amplifies existing cultural biases in AI systems, limiting their fairness and global applicability.

Economic considerations are woven throughout the paper. The authors cite prior work showing that data value accrues disproportionately to aggregators and model developers, while data generators—often individuals or small enterprises in the Global South—receive minimal compensation. The “data monopoly” created by a handful of dominant platforms raises barriers to entry for smaller actors and entrenches structural inequities.

Synthesizing these quantitative and qualitative strands, the authors propose a set of actionable guidelines called Data PROOFS:

  1. Provenance – Attach carbon‑footprint metadata and ethical provenance tags to datasets at creation.
  2. Resource awareness – Optimize storage efficiency, de‑duplicate data, and disclose energy consumption metrics.
  3. Ownership – Clarify and enforce data‑ownership rights, ensuring fair remuneration for data producers.
  4. Openness – Promote open‑source and public‑domain datasets to reduce reliance on proprietary data silos.
  5. Frugality – Encourage “data‑lite” training regimes, assess the necessity of synthetic data, and limit unnecessary data expansion.
  6. Standards – Work with standards bodies (e.g., ISO, IEEE) to codify ethical, safety, and sustainability requirements for data collection, annotation, and synthetic generation.

The paper concludes that addressing hyper‑datafication is essential for any comprehensive sustainability agenda for AI. Policymakers, industry leaders, and the research community must adopt data‑centric accounting, protect the rights and well‑being of data workers, and actively diversify linguistic and cultural representation in training corpora. Only by making the hidden costs of data visible and manageable can the AI ecosystem evolve toward a more equitable and environmentally responsible future.


Comments & Academic Discussion

Loading comments...

Leave a Comment