BatCoder: Self-Supervised Bidirectional Code-Documentation Learning via Back-Translation
Training LLMs for code-related tasks typically depends on high-quality code-documentation pairs, which are costly to curate and often scarce for niche programming languages. We introduce BatCoder, a self-supervised reinforcement learning framework designed to jointly optimize code generation and documentation production. BatCoder employs a back-translation strategy: a documentation is first generated from code, and then the generated documentation is used to reconstruct the original code. The semantic similarity between the original and reconstructed code serves as an implicit reward, enabling reinforcement learning to improve the model’s performance both in generating code from documentation and vice versa. This approach allows models to be trained using only code, substantially increasing the available training examples. Evaluated on HumanEval and MBPP with a 7B model, BatCoder achieved 83.5% and 81.0% pass@1, outperforming strong open-source baselines. Moreover, the framework demonstrates consistent scaling with respect to both training corpus size and model capacity.
💡 Research Summary
BatCoder introduces a self‑supervised reinforcement‑learning framework that jointly optimizes code generation and documentation production without relying on curated code‑documentation pairs. The core idea is a back‑translation loop: given an unlabeled code snippet c, the model first generates a natural‑language document d = fθ(c) (Stage 1). After filtering d for format compliance, a second model (or the same model) reconstructs code c′ = gθ(d) (Stage 2). The similarity between the original code c and the reconstructed code c′ serves as an implicit reward R(c, c′). This reward combines a sophisticated code‑similarity metric (CSSG) that accounts for AST structure, data‑flow, and semantic tokens, with a binary penalty for violations of the required documentation markup (e.g., missing
Training proceeds by sampling K diverse documentation candidates per code snippet, filtering them, and then generating a single reconstructed code per candidate. This asymmetric sampling reduces memory and compute overhead while preserving enough trajectory diversity for stable policy gradients. The expected reward J(θ)=E_c
Comments & Academic Discussion
Loading comments...
Leave a Comment