X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests

X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Competitive programming poses a significant challenge for Code LLMs. While recent models have shown promise, they heavily rely on finite real-world data, raising concerns about scalability and contamination. In this paper, we investigate a critical question: Can we elevate models to expert-level reasoning performance using fully synthetic data? In response, we first observe that off-the-shelf synthesis methods yield suboptimal results in this domain. To address this, we systematically investigate the key factors governing synthetic data quality. Leveraging these findings, we significantly advance the feature-based synthesis paradigm via domain-specific evolution and a dual-verification strategy, promoting task solvability, solution correctness, and test accuracy. Using this high-quality synthetic data, we train the X-Coder model series under an SFT-then-RL paradigm. X-Coder-7B shows significant performance gains on the challenging LiveCodeBench v5 (62.9% avg@8) and v6 (55.8% avg@8), outperforming larger models trained on real-world data. Extensive analysis distills valuable insights into synthetic data scaling, the necessity of domain-adapted feature evolution, and code-centric reinforcement.


💡 Research Summary

The paper tackles the data scarcity problem that hampers the scaling of code language models (LLMs) for competitive programming, a domain that demands deep algorithmic reasoning and complex problem solving. Existing models rely heavily on a limited pool of real‑world tasks from platforms such as Codeforces, which raises concerns about over‑fitting, contamination, and the inability to follow scaling laws. The authors ask whether a model can reach expert‑level performance using only fully synthetic data—tasks, reference solutions, and test cases generated automatically.

Key observations and contributions

  1. Off‑the‑shelf synthesis fails – General‑purpose code synthesis pipelines (e.g., EpiCoder, Wizard‑Coder) produce tasks that are either trivial, ill‑specified, or accompanied by low‑quality tests. This leads to poor downstream performance when the data are used for supervised fine‑tuning (SFT) or reinforcement learning (RL).

  2. Systematic investigation of quality factors – The authors identify three dimensions that must be satisfied for competitive‑programming data: (i) tasks must be solvable yet challenging, (ii) solutions must be logically correct, and (iii) test suites must be reliable enough to provide a clean reward signal for RL.

  3. Domain‑specific feature‑based synthesis – Starting from 10 k code snippets in the TA‑CO dataset, they extract a rich set of algorithmic and data‑structure features (sorting, number theory, tree traversals, etc.) using GPT‑4o. These features are evolved through selection, crossover, and mutation to form a domain‑specific feature tree. The tree guides the generation of novel problem statements.

  4. Multi‑style task generation – Three competitive‑programming styles are supported: Codeforces‑style (standard I/O with narrative), LeetCode‑style (function signature with starter code), and AtCoder‑style (concise specification). This diversity improves model robustness across different input formats.

  5. Test case synthesis – Two complementary methods are employed:

    • Prompt‑based generation – LLMs are instructed to produce standard, edge, and stress test inputs based on the problem constraints.
    • Tool‑based generation – The CYaron4 library is called to automatically generate inputs and evaluate them, ensuring coverage of large input sizes and computational limits.
  6. Dual‑verification strategy

    • Step 1 (Test verification) – For each generated input, all candidate solutions are executed; a majority‑vote yields a provisional ground‑truth output. Weighting is applied to prioritize boundary and stress cases. This process achieves 94.7 % labeling accuracy on a held‑out TA‑CO benchmark.
    • Step 2 (Solution verification) – Candidate solutions are scored on the weighted test suite (T_golden). The top‑scoring solution is then validated on a hold‑out validation set (T_val). The final “golden solution” must perform best on both sets, dramatically reducing the risk of over‑fitting to generated tests.
  7. Filtering unsolvable tasks – GPT‑5 (high‑capacity reasoning model) is used as a proxy solver; tasks where it fails on the voted test cases (≈36.9 % of generated tasks) are discarded, ensuring that the remaining dataset consists of well‑posed problems.

  8. Training pipeline – The verified (problem, golden solution) pairs constitute the SFT dataset; (problem, golden test suite) pairs form the RL reward function. SFT is performed with a learning rate of 5e‑5, batch size 128, for 8 epochs. RL uses the GRPO algorithm, rewarding the fraction of passed tests. Training cost for the 7 B‑parameter model is about 1.2 M GPU‑hours, which is cost‑effective compared with training larger real‑data models.

  9. Evaluation – Using LiveCodeBench v5 and v6, X‑Coder‑7B achieves avg@8 scores of 62.9 % (v5) and 55.8 % (v6), surpassing Qwen2.5‑Coder‑7B (≈48 %) and Mimo‑7B (≈57 %). Gains are especially pronounced on Medium and Hard difficulty splits (8‑10 % absolute improvement). Performance is consistent across Python and C++ submissions and across the three task styles.

  10. Insights

    • Synthetic data quality matters more than sheer quantity; domain‑specific feature evolution is essential.
    • Dual verification dramatically lowers solution and test noise, which is crucial for stable RL.
    • High‑quality synthetic data can replace real‑world competitive‑programming corpora, enabling scaling without contamination concerns.

Conclusion
The authors demonstrate that a carefully engineered synthetic pipeline—combining domain‑adapted feature evolution, multi‑style problem generation, robust test synthesis, and a two‑stage verification process—can produce data of sufficient quality to train a 7 B‑parameter code LLM that reaches or exceeds the performance of larger models trained on real data. This work opens the door to fully synthetic, scalable training for high‑level code reasoning tasks, reducing dependence on scarce, proprietary problem sets and mitigating data‑leakage risks. Future directions include automated difficulty calibration, extension to multimodal problem statements, and scaling the approach to even larger model families.


Comments & Academic Discussion

Loading comments...

Leave a Comment