Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text
Reinforcement Learning with Verifiable Rewards (RLVR) has become a cornerstone for unlocking complex reasoning in Large Language Models (LLMs). Yet, scaling up RL is bottlenecked by limited existing verifiable data, where improvements increasingly saturate over prolonged training. To overcome this, we propose Golden Goose, a simple trick to synthesize unlimited RLVR tasks from unverifiable internet text by constructing a multiple-choice question-answering version of the fill-in-the-middle task. Given a source text, we prompt an LLM to identify and mask key reasoning steps, then generate a set of diverse, plausible distractors. This enables us to leverage reasoning-rich unverifiable corpora typically excluded from prior RLVR data construction (e.g., science textbooks) to synthesize GooseReason-0.7M, a large-scale RLVR dataset with over 0.7 million tasks spanning mathematics, programming, and general scientific domains. Empirically, GooseReason effectively revives models saturated on existing RLVR data, yielding robust, sustained gains under continuous RL and achieving new state-of-the-art results for 1.5B and 4B-Instruct models across 15 diverse benchmarks. Finally, we deploy Golden Goose in a real-world setting, synthesizing RLVR tasks from raw FineWeb scrapes for the cybersecurity domain, where no prior RLVR data exists. Training Qwen3-4B-Instruct on the resulting data GooseReason-Cyber sets a new state-of-the-art in cybersecurity, surpassing a 7B domain-specialized model with extensive domain-specific pre-training and post-training. This highlights the potential of automatically scaling up RLVR data by exploiting abundant, reasoning-rich, unverifiable internet text.
💡 Research Summary
The paper tackles a critical bottleneck in Reinforcement Learning with Verifiable Rewards (RLVR) for large language models (LLMs): the scarcity of verifiable training data. Existing RLVR pipelines rely on problems that can be automatically checked—math equations, code with unit tests, or handcrafted logical puzzles. As models grow larger and training proceeds, the pool of effective examples quickly saturates; after a certain point, additional RL steps bring negligible or even negative gains because most examples become “stale” (the model either always succeeds or always fails, providing no learning signal).
To break this ceiling, the authors introduce Golden Goose, a surprisingly simple yet powerful data synthesis technique. The core idea is to turn any reasoning‑rich, but unverifiable, text into a multiple‑choice fill‑in‑the‑middle (MCQ) task that can be verified with a binary reward: does the model’s selected answer match the masked ground‑truth span? The pipeline works as follows:
- Span Identification – Given a source document S, an LLM is prompted to locate a contiguous segment t that contains crucial reasoning steps (e.g., a derivation, a code snippet, or a chemical reaction chain).
- Mask Insertion – The segment t is replaced by a special token `
Comments & Academic Discussion
Loading comments...
Leave a Comment