Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training

Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Data quality determines foundation model performance, yet systematic processing frameworks are lacking. We introduce Data Darwinism, a ten-level taxonomy (L0-L9) that conceptualizes data-model co-evolution: advanced models produce superior data for next-generation systems. We validate this on scientific literature by constructing Darwin-Science, a 900B-token corpus (L0-L5). We identify a learnability gap in raw scientific text, which we bridge via L4 (Generative Refinement) and L5 (Cognitive Completion) using frontier LLMs to explicate reasoning and terminology. To ensure rigorous attribution, we pre-trained daVinci-origin-3B/7B models from scratch, excluding scientific content to create contamination-free baselines. After 600B tokens of continued pre-training, Darwin-Science outperforms baselines by +2.12 (3B) and +2.95 (7B) points across 20+ benchmarks, rising to +5.60 and +8.40 points on domain-aligned tasks. Systematic progression to L5 yields a +1.36 total gain, confirming that higher-level processing unlocks latent data value. We release the Darwin-Science corpus and daVinci-origin models to enable principled, co-evolutionary development.


💡 Research Summary

The paper introduces “Data Darwinism,” a ten‑level hierarchical taxonomy (L0‑L9) that frames data processing as an evolutionary process co‑driven by model capabilities. Lower levels (L0‑L3) focus on acquisition, format normalization, rule‑based filtering, and lightweight model filtering—essentially selecting and cleaning raw data. Intermediate levels (L4‑L6) employ increasingly powerful generative models to transform the data: L4 (Generative Refinement) removes noise, repairs fragmented structures (e.g., split equations, malformed tables), and standardizes presentation while preserving original semantics; L5 (Cognitive Completion) goes further by using frontier large language models (LLMs) to make implicit scientific reasoning explicit, inline‑explain domain‑specific terminology, and add pedagogical bridges such as analogies. The highest levels (L7‑L9) would synthesize entirely new content, but they are not implemented in this work.

To validate the framework, the authors construct “Darwin‑Science,” a 900 billion‑token scientific corpus covering books and papers across natural sciences, engineering, and medicine. They process the corpus through L0‑L5, observing that raw scientific text (L0‑L3) exhibits a severe “learnability gap”: despite high information density, models pretrained on unprocessed scientific material gain little or no improvement on standard benchmarks. The gap is closed once L4 and especially L5 transformations are applied, indicating that the data must be made cognitively accessible for language models to benefit.

A crucial methodological contribution is the creation of contamination‑free baseline models. The authors pre‑train two base models, daVinci‑origin‑3B and daVinci‑origin‑7B, from scratch on a 5.37 trillion‑token corpus that deliberately excludes any scientific content. These models serve as clean‑room checkpoints with strong general language abilities but zero exposure to the scientific domain, allowing unambiguous attribution of performance gains to data processing rather than to inadvertent data leakage.

The experimental protocol involves continued pre‑training (CPT) of the base models for an additional 600 billion tokens, comparing three data streams: (1) a conventional mixed‑domain corpus, (2) Darwin‑Science processed only through L0‑L3, and (3) Darwin‑Science processed through the full L0‑L5 pipeline. Evaluation is performed on (a) a suite of more than 20 standard NLP benchmarks and (b) a newly constructed domain‑aligned benchmark, Darwin‑Science‑Eval, consisting of 150 k expert‑level questions derived from held‑out scientific literature.

Results show modest overall gains for the full pipeline: +2.12 points for the 3‑billion‑parameter model and +2.95 points for the 7‑billion‑parameter model averaged across the standard benchmarks. Gains are substantially larger on the domain‑aligned suite: +5.60 and +8.40 points respectively. Incremental analysis reveals that L0‑L3 contribute almost nothing, L4 adds +0.38 points, and L5 contributes the bulk of the improvement (+1.36 points total). Moreover, the larger model benefits disproportionately more from the scientific data, suggesting that higher capacity models can better exploit the richer, cognitively refined content.

The authors also conduct extensive ablations on data composition (optimal scientific content ratio ≈ 50 %), teacher‑model quality (Qwen‑3‑235B outperforms GPT‑OSS‑120B for L5 processing), and context length (32 k tokens vs. 4 k tokens yields +0.80 points). These analyses yield practical guidelines for practitioners aiming to build domain‑specialized foundation models.

Limitations are acknowledged: levels L6‑L9 remain unimplemented, the computational cost of L5 processing with frontier LLMs is high, and human verification of the automatically generated cognitive completions is limited. Nonetheless, the work demonstrates that systematic, model‑driven data refinement can unlock latent value in dense, expert‑oriented corpora.

In sum, the paper makes three major contributions: (1) a conceptual framework (Data Darwinism) that organizes data engineering tasks into a clear evolutionary hierarchy, (2) the construction and public release of a 900 B‑token scientifically curated corpus (Darwin‑Science) together with clean‑room base models, and (3) rigorous empirical evidence that progressing through the hierarchy—especially the generative refinement and cognitive completion stages—significantly improves downstream performance, especially on domain‑specific evaluations. This establishes a principled roadmap for future co‑evolution of data and models in scientific AI and beyond.


Comments & Academic Discussion

Loading comments...

Leave a Comment