Evaluating Retrieval-Augmented Generation Variants for Natural Language-Based SQL and API Call Generation
Enterprise systems increasingly require natural language interfaces that can translate user requests into structured operations such as SQL queries and REST API calls. While large language models (LLMs) show promise for code generation [Chen et al., 2021; Huynh and Lin, 2025], their effectiveness in domain-specific enterprise contexts remains underexplored, particularly when both retrieval and modification tasks must be handled jointly. This paper presents a comprehensive evaluation of three retrieval-augmented generation (RAG) variants [Lewis et al., 2021] – standard RAG, Self-RAG [Asai et al., 2024], and CoRAG [Wang et al., 2025] – across SQL query generation, REST API call generation, and a combined task requiring dynamic task classification. Using SAP Transactional Banking as a realistic enterprise use case, we construct a novel test dataset covering both modalities and evaluate 18 experimental configurations under database-only, API-only, and hybrid documentation contexts. Results demonstrate that RAG is essential: Without retrieval, exact match accuracy is 0% across all tasks, whereas retrieval yields substantial gains in execution accuracy (up to 79.30%) and component match accuracy (up to 78.86%). Critically, CoRAG proves most robust in hybrid documentation settings, achieving statistically significant improvements in the combined task (10.29% exact match vs. 7.45% for standard RAG), driven primarily by superior SQL generation performance (15.32% vs. 11.56%). Our findings establish retrieval-policy design as a key determinant of production-grade natural language interfaces, showing that iterative query decomposition outperforms both top-k retrieval and binary relevance filtering under documentation heterogeneity.
💡 Research Summary
This paper presents a systematic evaluation of Retrieval‑Augmented Generation (RAG) techniques for translating natural‑language requests into two core enterprise operations: SQL queries for data retrieval and REST API calls for data modification. While large language models (LLMs) have demonstrated strong code‑generation capabilities, their deployment in real‑world enterprise settings is hampered by a lack of domain‑specific knowledge (schemas, proprietary APIs) that leads to hallucinations and incorrect code. The authors address this gap by comparing three RAG variants—standard RAG, Self‑RAG, and CoRAG—across three task categories (SQL generation, API call generation, and a combined task that first classifies the request type) and three documentation contexts (database‑only, API‑only, and hybrid database + API). This yields 18 experimental configurations.
The study uses SAP Transactional Banking (TRBK) as a realistic enterprise use case. The authors automatically extracted database schemas from OpenAPI specifications and generated a mock SQLite database, while also creating mock Postman servers for API endpoints. A custom dataset of 631 validated test cases (346 SQL, 285 API) was built through a pipeline that includes automated generation, human‑style re‑phrasing, execution‑based validation, and expert review. Each case contains a natural‑language prompt, the correct SQL or API output, and a successful execution trace.
All experiments share a common infrastructure: OpenAI’s text‑embedding‑3‑small model for embedding documents, ChromaDB for vector storage, and GPT‑5 as the generation backbone. Retrieval size is fixed at the top‑5 most similar chunks. The three RAG variants are implemented as follows:
- Standard RAG concatenates the top‑5 retrieved chunks with the user query in a single prompt.
- Self‑RAG first retrieves top‑5 chunks, then lets the LLM assess each chunk’s relevance (threshold ≥ 0.2) and discards irrelevant ones before prompting.
- CoRAG performs iterative retrieval by decomposing the user query into sub‑queries, retrieving evidence for each sub‑query, and aggregating results until the model signals completion.
Evaluation metrics follow established practice in text‑to‑SQL and API generation: Exact Match (EM), Component Match (CM), Execution Accuracy (Exec), Endpoint Retrieval Accuracy (for API), and Classification Accuracy (for the combined task). Statistical significance is assessed with paired two‑tailed t‑tests (α = 0.05).
Key findings:
- Retrieval is indispensable. The no‑RAG baseline achieves 0% Exact Match across all tasks, confirming that LLMs alone cannot reliably generate correct enterprise code without external knowledge.
- RAG dramatically improves performance. Execution accuracy rises to as high as 79.30% (CoRAG, hybrid documentation, API generation) and component match reaches 78.86%.
- CoRAG is the most robust in heterogeneous documentation settings. In the hybrid context, CoRAG attains 10.29% Exact Match on the combined task, significantly higher than standard RAG’s 7.45% (p < 0.01). For SQL generation, CoRAG’s Exact Match (15.32%) outperforms standard RAG (11.56%) by a statistically significant margin.
- Self‑RAG offers modest gains in relevance filtering but does not consistently surpass CoRAG, especially when both database and API documents are present.
- The iterative query‑decomposition strategy of CoRAG yields better coverage of needed schema or endpoint details, reducing hallucinations and improving downstream execution success.
The authors discuss practical implications: retrieval‑policy design should be a primary consideration when building production‑grade natural‑language assistants for enterprises. Simple top‑k retrieval may suffice in homogeneous documentation environments, but dynamic, multi‑step retrieval (as in CoRAG) is essential when documentation is heterogeneous. They also note limitations: experiments are confined to a single domain (banking) and a single LLM (GPT‑5); real‑time schema or API changes were not tested; and multi‑turn conversational scenarios remain unexplored.
Future work is suggested in three directions: (a) extending evaluation to multi‑turn dialogues with execution‑in‑the‑loop feedback, (b) cross‑domain validation (e.g., manufacturing, healthcare), and (c) exploring newer embedding models and LLM backbones to assess generalizability.
In summary, the paper provides the first comprehensive benchmark of advanced RAG variants for joint SQL and API generation, introduces a publicly released, rigorously validated dataset, and demonstrates that sophisticated retrieval strategies—particularly CoRAG’s iterative decomposition—are critical for achieving reliable, production‑level natural‑language interfaces in complex enterprise environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment