Improving Code Generation via Small Language Model-as-a-judge

Improving Code Generation via Small Language Model-as-a-judge
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) have shown remarkable capabilities in automated code generation. While effective for mainstream languages, they may underperform on less common or domain-specific languages, prompting companies to develop in-house code generators. While open-source models can be trained for this, only LLMs with tens of billions of parameters match the performance of commercial tools, demanding costly training and deployment. Recent work proposed supporting code generation with smaller models (SLMs) by generating multiple candidate solutions and using another SLM to select the most likely correct one. The most recent work in this area is the one by Sun et al. [29] presenting RankEF, a T5 model trained to rank code solutions using both execution-based and non-execution-based information. However, Sun et al. do not assess the T5 ranker’s classification accuracy, that is, how often it misjudges correct implementations as incorrect or vice versa, leaving open questions about the reliability of LMs as code correctness judges for other tasks (e.g., automated code review). Moreover, their experiments involve relatively old models, making it unclear the extent to which such a methodology would still help companies in cheaply training their own code generators with performance comparable to those of massive LLMs. We present a study addressing these limitations. We train several state-of-the-art SLMs as code correctness judges and assess their ability to discriminate between correct and wrong implementations. We show that modern SLMs outperform RankEF, even without exploiting execution-based information. When used as code rankers, they achieve higher performance gains than RankEF and perform competitively with LLMs 5-25x larger, at a fraction of the cost.


💡 Research Summary

The paper addresses a practical limitation of large language models (LLMs) for code generation: while they excel on mainstream programming languages, their performance drops dramatically on less common languages or proprietary domain‑specific languages (DSLs) that companies may use internally. Achieving commercial‑grade performance on such languages typically requires LLMs with tens of billions of parameters, which entails prohibitive hardware, training, and deployment costs for many organizations.

A promising line of work proposes to augment small language models (SLMs, defined as < 5 B parameters) with a separate “judge” model that can rank multiple candidate solutions generated by the SLM. The most recent incarnation, RankEF (Sun et al., ASE 2024), fine‑tunes a Code‑T5 770 M model using both execution‑based signals (test pass/fail, compilation/runtime errors) and textual cues, then uses it to select the top‑k solutions from a pool of 100 candidates. However, RankEF never reports the intrinsic classification accuracy of the judge (i.e., how often it mislabels a correct solution as incorrect or vice‑versa), and it relies on an outdated base model.

The authors conduct a two‑part empirical study to fill these gaps.
RQ 1 – Can SLMs serve as reliable judges?
They fine‑tune four state‑of‑the‑art SLMs—Qwen2.5‑Coder 0.5 B, Qwen2.5‑Coder 3 B, Gemma‑3 4 B, and Llama‑3.2 3 B—on a dataset of correct/incorrect code pairs derived from test results on three Java benchmarks (HumanEval‑Java, MBPP‑Java, CoderEval‑Java). As baselines they include the original RankEF (Code‑T5 + 770 M) and GPT‑4.1‑mini (≈ 80 B) in zero‑shot and few‑shot settings. The SLMs achieve moderate inter‑annotator agreement (Cohen’s κ ≈ 0.45‑0.57) after fine‑tuning, comparable to GPT‑4.1‑mini’s 0.54 κ and markedly better than RankEF’s “fair” agreement. Notably, providing execution‑based signals during training does not yield a measurable benefit for the newer SLMs, indicating that rich textual pre‑training already equips them to learn correctness cues. Even the smallest 0.5 B model outperforms the older 770 M T5, demonstrating that model architecture and pre‑training data matter more than raw size in this task.

RQ 2 – Does a judge improve code generation?
Using the three best‑performing judges, the authors build a pipeline where a small generator model produces ten candidate implementations for each task, and the judge selects the most likely correct one. Generators include DeepSeek‑Coder 1.3 B, OpenCoder 1.5 B, Qwen2.5‑Coder 3 B, Phi‑4 mini 4 B, and Gemma‑3 4 B. Each pipeline is compared against the largest model in the same family (e.g., DeepSeek‑Coder 33 B, Qwen2.5‑Coder 32 B, Gemma‑3 27 B, Phi‑4 15 B) and against GPT‑4.1‑mini as a commercial reference. Across the three benchmarks, the judge‑augmented small models achieve higher Pass@1 scores than their larger counterparts in four out of five cases, and consistently outperform RankEF when applied to the same candidate pools. For example, DeepSeek‑Coder 1.3 B + judge beats DeepSeek‑Coder 33 B by 2‑4 percentage points on HumanEval‑Java, while Qwen2.5‑Coder 3 B + judge surpasses its 32 B sibling by 1‑3 points.

The study also quantifies cost and latency. Running two SLMs (generator + judge) on a consumer‑grade NVIDIA RTX 3090 (24 GB) costs roughly $1 k in hardware, whereas a 30 B model requires at least one NVIDIA A100 80 GB (~ $17 k). Inference latency for the small‑model pipeline is typically 0.5‑1 s per request, compared with 2‑3 s for the large models, yielding a clear advantage for real‑time developer assistance.

Key contributions

  1. First systematic evaluation of SLMs as independent code‑correctness judges, showing that fine‑tuning yields reliability comparable to a commercial 80 B model.
  2. Demonstration that a simple judge‑based selection mechanism can elevate the performance of sub‑5 B generators to the level of 10‑30 B models, without needing execution‑based training signals.
  3. Detailed cost‑benefit analysis highlighting orders‑of‑magnitude savings in hardware, energy, and latency, making in‑house code recommendation feasible for small‑to‑medium enterprises.
  4. Open release of all code, data, and trained checkpoints, encouraging reproducibility and further research on multi‑judge cooperation and adaptive ranking strategies.

In summary, the paper validates the “LLM‑as‑a‑judge” paradigm with modern small models, showing that they can both accurately assess code correctness and effectively improve code generation when paired with a generator. This opens a practical pathway for organizations to build cost‑effective, high‑quality code assistance tools without the massive resource commitments traditionally associated with large‑scale LLMs.


Comments & Academic Discussion

Loading comments...

Leave a Comment