Data Science and Technology Towards AGI Part I: Tiered Data Management

Data Science and Technology Towards AGI Part I: Tiered Data Management
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The development of artificial intelligence can be viewed as an evolution of data-driven learning paradigms, with successive shifts in data organization and utilization continuously driving advances in model capability. Current LLM research is dominated by a paradigm that relies heavily on unidirectional scaling of data size, increasingly encountering bottlenecks in data availability, acquisition cost, and training efficiency. In this work, we argue that the development of AGI is entering a new phase of data-model co-evolution, in which models actively guide data management while high-quality data, in turn, amplifies model capabilities. To implement this vision, we propose a tiered data management framework, designed to support the full LLM training lifecycle across heterogeneous learning objectives and cost constraints. Specifically, we introduce an L0-L4 tiered data management framework, ranging from raw uncurated resources to organized and verifiable knowledge. Importantly, LLMs are fully used in data management processes, such as quality scoring and content editing, to refine data across tiers. Each tier is characterized by distinct data properties, management strategies, and training roles, enabling data to be strategically allocated across LLM training stages, including pre-training, mid-training, and alignment. The framework balances data quality, acquisition cost, and marginal training benefit, providing a systematic approach to scalable and sustainable data management. We validate the effectiveness of the proposed framework through empirical studies, in which tiered datasets are constructed from raw corpora and used across multiple training phases. Experimental results demonstrate that tier-aware data utilization significantly improves training efficiency and model performance. To facilitate further research, we release our tiered datasets and processing tools to the community.


💡 Research Summary

The paper argues that the current trajectory of large language model (LLM) research—driven primarily by ever‑larger data volumes—is reaching a sustainability ceiling due to data scarcity, acquisition cost, and diminishing training efficiency. To move beyond this “data‑driven learning” paradigm, the authors propose a data‑model co‑evolution approach in which models actively shape data management while high‑quality data, in turn, amplifies model capabilities. Central to this vision is a tiered data management framework spanning five levels (L0–L4), each representing a distinct stage of data curation, quality, and structure.

L0 – Raw Data: Petabyte‑scale, uncurated web dumps containing massive redundancy and noise. Kept mainly for archiving and traceability, not for direct training.

L1 – Filtered Data: Produced by heuristic cleaning and deduplication, removing obvious noise (ads, malformed markup) and standardizing text formatting. Serves as the foundational pool for downstream selection and evaluation.

L2 – Selected Data: Obtained via model‑based scoring or domain‑specific classifiers, retaining samples with high information density (e.g., peer‑reviewed papers, technical repositories, refined encyclopedia articles). Intended for broad knowledge acquisition during pre‑training.

L3 – Refined Data: Generated or edited using LLMs (rewriting, synthetic augmentation) and human verification to achieve textbook‑level clarity, logical coherence, and explicit educational intent. This tier is the core resource for mid‑training phases where specialized reasoning and domain adaptation are crucial.

L4 – Organized Data: Structured representations such as knowledge graphs or verified databases, with rigorous fact‑checking. Provides reliable factual grounding for retrieval‑augmented generation and precise inference.

A distinctive contribution is the full integration of LLMs as data‑management tools. The authors employ LLMs for quality scoring, content editing, and synthetic data creation, thereby reducing human labor while maintaining high precision. As models improve, they become more effective at refining data, establishing a positive feedback loop that embodies data‑model co‑evolution.

The paper surveys existing stage‑oriented and method‑oriented data‑management frameworks, highlighting their limitations in handling the full LLM lifecycle. It then positions the L0–L4 hierarchy as a unifying schema that aligns data curation with the three major training stages: pre‑training (broad knowledge), mid‑training (domain‑specific reasoning), and post‑training/alignment (instruction fine‑tuning and RLHF).

Empirical validation spans four domains—English web, Chinese web, mathematics, and code. For each domain, the authors construct tiered datasets using the proposed pipelines and evaluate them across the training lifecycle. Key findings include:

  1. Training efficiency gains: Introducing high‑tier data (L3/L4) in later training phases accelerates loss reduction by >15 % under identical compute budgets.

  2. Performance improvements: In the mathematics domain, models trained with Math‑L3 achieve a 3.2 % absolute accuracy increase over baselines; across general language benchmarks, a 1.8 % boost is observed.

  3. Cross‑domain benefits: High‑quality math data enhances logical reasoning in non‑math tasks, yielding a 1.4 % average gain on unrelated benchmarks.

  4. Mitigation of late‑stage saturation: A staged curriculum—starting with large‑scale, lower‑quality L1 data for diversity, then progressively injecting higher‑quality L3/L4 data—prevents performance plateaus caused by noisy samples during the final training epochs.

The authors also release a suite of open‑source resources: UltraData‑Math‑L1/L2/L3, Ultra‑FineWeb‑en/zh‑L2/L3, parsers, synthetic problem generators, and domain classifiers. These assets total several terabytes and are made available via Hugging Face and a dedicated website, encouraging community adoption and further research.

In conclusion, the paper presents a systematic, cost‑aware, and scalable approach to data management that directly addresses the bottlenecks of current LLM scaling laws. By formalizing a tiered hierarchy and leveraging LLMs as active participants in data curation, it offers a practical pathway toward sustainable AGI development, where data quality is strategically allocated across the training lifecycle to maximize marginal utility while controlling expenses.


Comments & Academic Discussion

Loading comments...

Leave a Comment