CoSQA+: Pioneering the Multi-Choice Code Search Benchmark with Test-Driven Agents

CoSQA+: Pioneering the Multi-Choice Code Search Benchmark with Test-Driven Agents
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Semantic code search, retrieving code that matches a given natural language query, is an important task to improve productivity in software engineering. Existing code search datasets face limitations: they rely on human annotators who assess code primarily through semantic understanding rather than functional verification, leading to potential inaccuracies and scalability issues. Additionally, current evaluation metrics often overlook the multi-choice nature of code search. This paper introduces CoSQA+, pairing high-quality queries from CoSQA with multiple suitable codes. We develop an automated pipeline featuring multiple model-based candidate selections and the novel test-driven agent annotation system. Among a single Large Language Model (LLM) annotator and Python expert annotators (without test-based verification), agents leverage test-based verification and achieve the highest accuracy of 93.9%. Through extensive experiments, CoSQA+ has demonstrated superior quality over CoSQA. Models trained on CoSQA+ exhibit improved performance. We publicly release both CoSQA+_all, which contains 412,080 agent-annotated pairs, and CoSQA+_verified, which contains 1,000 human-verified pairs, at https://github.com/DeepSoftwareAnalytics/CoSQA_Plus.


💡 Research Summary

Semantic code search aims to retrieve code snippets that satisfy a natural‑language query, a capability that can dramatically boost developer productivity. Existing benchmarks such as CoSQA and CodeSearchNet suffer from two fundamental drawbacks: (1) they rely on human annotators who judge relevance mainly by semantic similarity rather than by executing the code, which leads to noisy labels that do not guarantee functional correctness; (2) they adopt a one‑to‑one evaluation paradigm (typically Mean Reciprocal Rank) that ignores the fact that real‑world queries often have multiple valid implementations. A survey of 200 Python developers confirmed that developers issue roughly eight searches per day, consult an average of 2.8 code examples per query, and that 63 % of their queries admit several correct solutions.

To address these gaps, the authors introduce CoSQA+, a multi‑choice code‑search benchmark, together with a novel fully‑automated “test‑driven agent” pipeline for labeling. CoSQA+ consists of two releases: (i) CoSQA+_all, containing 412 080 query‑code pairs automatically annotated; and (ii) CoSQA+_verified, a gold‑standard subset of 1 000 pairs manually verified. The benchmark adopts Mean Average Precision (MAP) as its primary metric, reflecting the importance of having many relevant snippets in the top‑k results rather than just the first.

The construction pipeline works as follows. For each high‑quality query taken from the original CoSQA, multiple retrieval models (BM25, CodeBERT, GraphCodeBERT, etc.) retrieve the top‑20 candidate functions from the CodeSearchNet corpus. These candidates then pass through a five‑stage test‑driven agent: (1) a preliminary screener removes syntactically invalid code; (2) a test‑program generator parses the natural‑language requirement, extracts input‑output constraints, and automatically writes a Python unittest; (3) a sandboxed test executor runs the generated test against the candidate; (4) a bug‑fixer attempts simple patches (e.g., missing imports, type casts) if the test fails; and (5) a final arbiter labels the pair as “exact match” only if all tests pass.

In a rigorous evaluation, 1 000 random query‑code pairs were annotated by three groups: (a) the test‑driven agents, (b) three Python experts who judged without executing tests, and (c) a single large language model (LLM) annotator. Human experts produced a ground‑truth set using manually written tests. The agents achieved 93.9 % accuracy, surpassing the LLM (≈71 %) and approaching human expert performance (≈94 %). Notably, 83.67 % of the automatically generated tests were executable, and on average only 1.2 automatic patches were required per failing case.

To assess the practical impact of the new dataset, the authors fine‑tuned three popular code‑search embeddings—CodeBERT‑1, UniXcoder, and CodeT5+—on either CoSQA or CoSQA+. When evaluated on the CSN99 Python benchmark, models trained on CoSQA+ consistently outperformed those trained on CoSQA, achieving an average MAP@10 gain of 4.2 % points and an MRR improvement of 2.8 % points. This demonstrates that functional, multi‑choice annotations provide richer supervision for retrieval models.

The authors also explored cross‑language generalization. By adapting the test‑program generator to PHP, Java, and Go, the agents achieved an average accuracy increase of 10.33 % points over baseline annotators, confirming that the pipeline is not limited to Python.

Key contributions of the paper are:

  1. The first multi‑choice code‑search benchmark (CoSQA+) that reflects real developer needs.
  2. A fully automated test‑driven agent that reaches 93.9 % labeling accuracy, the highest reported for code‑search dataset construction.
  3. An end‑to‑end pipeline that combines multi‑model candidate retrieval with functional verification, dramatically improving scalability and label quality.
  4. Public release of both the large‑scale automatically labeled dataset and the human‑verified gold subset to foster reproducibility.

Limitations include the current focus on functional correctness (input‑output behavior) without assessing non‑functional properties such as performance or memory usage, and the reliance on relatively simple test generation heuristics that may struggle with highly ambiguous or complex natural‑language specifications. Future work could integrate more sophisticated LLM‑based test synthesis, extend the pipeline to handle non‑functional requirements, and embed the MAP‑based multi‑choice evaluation directly into interactive code‑search tools to better align system rankings with developer satisfaction.


Comments & Academic Discussion

Loading comments...

Leave a Comment