If It's Nice, Do It Twice: We Should Try Iterative Corpus Curation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent work demonstrates that filtering harmful content from pretraining data improves model safety without degrading capabilities. We propose a natural extension: do it again. A model trained on filtered data can filter the corpus further; training on this cleaner corpus produces an even cleaner model. We provide theoretical analysis showing this process converges to a self-consistent corpus where the model trained on it approves of its own training data. Even under the weak assumption of constant filter quality, iteration yields decay in harmful content. We argue this framework offers a novel form of scalable oversight. While model internals are opaque, the resulting corpus is human-auditable. Even a single iteration produces a large-scale preference annotations over documents, potentially valuable for interpretability research. We derive bounds on capability-safety tradeoffs and outline open questions. We call on researchers with pretraining infrastructure to empirically test this approach.

💡 Research Summary

The paper proposes “Iterative Corpus Curation,” a simple yet powerful extension of existing single‑pass data filtering methods for pre‑training large language models. Starting from an initial corpus D, a constitution ϕ that defines acceptable content, and a threshold τ, the authors repeatedly train a model Mₙ on the current corpus Cₙ, score each document d with SCORE(Mₙ, d, ϕ), and retain only those with scores below τ to form Cₙ₊₁. This loop is formalized as Algorithm 1.

The theoretical contribution consists of four main results. First, a convergence theorem shows that because each iteration only removes documents (Cₙ₊₁ ⊆ Cₙ), the sequence is monotone decreasing in a finite set, guaranteeing a fixed point C* in at most |D| steps, regardless of filter quality. Second, the authors define a “self‑consistent corpus” as one where a model trained on it approves all its own training documents; the fixed point is precisely the largest such corpus reachable from D. Third, under the modest assumption that each iteration removes a constant fraction p of the remaining harmful content, the proportion of harmful material decays exponentially as (1‑p)ⁿ, demonstrating that even a static‑quality filter yields substantial safety gains through repetition. Fourth, they quantify the capability‑safety trade‑off by introducing sets H (harmful), U (useful), and B = H ∩ U (dual‑use). They prove K(C) ≥ 1 ‑ S(C)·|B|/|U|, where S(C) is the fraction of harmful content removed and K(C) the fraction of useful content retained. When harmful and useful content are largely disjoint (small |B|), high safety can be achieved with minimal capability loss.

Beyond binary filtering, the paper extends the framework to a preference‑based re‑weighting scheme. Instead of a hard keep/remove decision, a model compares pairs of documents and assigns a win rate w(Mₙ, d). The sampling distribution for the next iteration is updated as pₙ₊₁(d) ∝ pₙ(d)·w(Mₙ, d). The resulting fixed point is a “preference equilibrium” where all retained documents have equal win rates. This softer approach can preserve dual‑use material at low probability, potentially offering a better Pareto frontier of safety versus capability.

A key practical insight is that supervision can be shifted from opaque model internals to the corpus itself. Human auditors can sample documents from the final corpus C* and verify compliance with ϕ, leveraging well‑established text‑audit methodologies. This provides a scalable, statistically grounded oversight mechanism that is far easier to validate than internal representations or reward models. Additionally, the per‑iteration scores constitute a rich, interpretable dataset: they reveal which documents each model believes should be learned from, allowing researchers to trace how judgments evolve across iterations and to study constitution drift.

The authors acknowledge several limitations. The analysis assumes filter quality does not deteriorate; in reality, models might need exposure to some harmful examples to recognize them, potentially stalling iteration. Compositional risks—dangerous capabilities emerging from combinations of individually benign documents—are not captured by per‑document scoring. The quality of the fixed point depends entirely on the specification of ϕ; a vague constitution yields a self‑consistent but misaligned corpus. Finally, empirical validation requires substantial pre‑training resources, which the authors do not possess.

In conclusion, the paper argues that if a single pass of data curation improves safety, iterating the process should yield exponential reductions in harmful content while preserving capability, provided the constitution is well designed. The work offers a clear theoretical foundation, proposes a preference‑based extension linking to RLHF theory, and highlights a novel, human‑readable avenue for scalable oversight. The authors call on the community to implement the iterative pipeline on large‑scale models and empirically test the predicted safety‑capability dynamics.

If It's Nice, Do It Twice: We Should Try Iterative Corpus Curation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment