SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

Published as a conference paper at ICLR 2026 S P A RT A : S C A L A B L E A N D P R I N C I P L E D B E N C H M A R K O F T R E E - S T R U C T U R E D M U L T I - H O P Q A O V E R T E X T A N D T A - B L E S Sungho Park, Jueun Kim, W ook-Shin Han ∗ POSTECH, Pohang, Republic of K orea {shpark,jekim,wshan}@dblab.postech.ac.kr A B S T R AC T Real-world T able–T e xt question answering (QA) tasks require models that can reason across long text and source tables, trav ersing multiple hops and executing complex operations such as aggregation. Y et e xisting benchmarks are small, manually curated—and therefore error-prone—and contain shallo w questions that seldom demand more than two hops or in vok e aggregations, grouping, or other adv anced analytical operations expressible in natural-language queries. W e present SP AR T A, an end-to-end construction framework that automatically generates lar ge-scale T able–T ext QA benchmarks with lightweight human validation, requiring only one quarter of the annotation time of HybridQA. The frame work ﬁrst constructs a reference fact database by enriching each source table with grounding tables whose tuples are atomic facts automatically extracted from the accompanying unstructured passages, then synthesizes nested queries whose number of nested predicates matches the desired hop count. T o ensure that e very SQL statement is ex ecutable and that its verbalization yields a ﬂuent, human-sounding question, we propose two nov el techniques: provenance-based reﬁnement, which re writes any syntactically valid query that returns a non-empty result, and realistic-structure enforcement, which conﬁnes generation to post-order tra versals of the query graph. The resulting pipeline produces thousands of high-ﬁdelity question–answer pairs cov ering aggregations, grouping, and deep multi-hop reasoning across te xt and tables. On SP AR T A, state-of-the-art models that reach over 70 F1 on HybridQA or over 50 F1 on O TT -QA drop by more than 30 F1 points, exposing fundamental weaknesses in current cross-modal reasoning. Our benchmark, construction code, and baseline models are av ailable at github.com/pshle go/SP AR T A. 1 I N T R O D U C T I O N T able–T ext QA has emerged as a fundamental challenge in building robust question answering (QA) systems capable of operating across heterogeneous data modalities (i.e., text and tables) Chen et al. (2020a;b; 2021); Zhao et al. (2022); Zhu et al. (2021). Such a task is particularly e vident in scenarios where textual descriptions and table entries originate from one or more sources (e.g., textual information and tables in multiple W ikipedia pages) and must be jointly analyzed to arrive at the correct answer . While a single W ikipedia page often contains both text and tables, it is not unusual for rele vant information to span multiple pages or documents, necessitating cross-document retriev al and the effecti ve integration of disparate information. A signiﬁcant limitation of existing T able-T ext QA benchmarks is that human annotators manually construct them Chen et al. (2020a;b; 2021); Zhao et al. (2022); Zhu et al. (2021), resulting in fundamentally ﬂawed benchmark designs that hinder comprehensiv e system ev aluation. ∗ Corresponding author . 1 Published as a conference paper at ICLR 2026 [Q uestion ] Which NBA players , who are center and taller than th e average height of point gua rds drafted after 1990 , hav e more than 8 rebounds in a game ? [ Answer ] C ountry T itle Duration Portugal A Fábrica de Nada 177 Portugal Ruth 120 India Anando Brahm a 123 India The House Next D oor 137 Alex Len, Al Horford, Andre Drummond, … Player Name height Meyers Leonard 85 Robin Lopez 84 … … [Q uestion ] What is the maximum heig ht among players who played for the Tr ail Bl azers with a salary over 2,800,000 and who have scor ed mo re tha n 20 point s in at least one game? [ Answer ] Player Name Salary Team Name Meyers Leonard 3,075,880 Trail Bla zers Robin Lopez 5,904,261 Trail Bla zers Ed Davi s 6,352,531 Trail Blazers Aaron Brooks 3,2 50,000 Kings … … … … Meyers Leonard poste d a season - high of 24 points ( 9 -17 FG , 5-9 3Pt , 1-1 FT ), … NBA Player Af filiation NBA Player Inf ormation … Robin Lopez finished Tue sday 's contest with 21 points on 10 - of - 15 s hooting … … Aaron Brooks playe d 33 minutes and had 23 point s … … Ed Davi s went a perfect 5 - for - 5 from the field t o score 11 points… (a) Star - Structured Reas oning NBA Player Inf ormation Player Name Position Draft year H eight Raul Lopez point guards 2001 73 Roko Ukic point guards 2005 77 Issac Bonga point guards 2018 80 … … … … (b) Chain - Structured Reasoning NBA Player Inf ormation H eight Position Player Name 85 Center Alex Len 84 Center Al Horford 80 Center Andre Drummond … … … AVG = 73.9 [Q uestion ] What countries have movies with a n avera ge durat ion o f mor e than 100 minute s and have at least one movie with a median rating o f 7 and average rating great er than 6.5 ? [ Answer ] Rating Median rating Average rating Movie Title 7 7.3 A Fábrica de Nad a 7 6.6 Ruth … … … 7 6.8 Anando Brahma 7 6.8 The House N ext Door Portugal, India, … (c) Analytical Operation & Large - Scale Tables Movie AVG > 100 AVG > 100 … Andre Drummond recorded .. 20 points and grab bing 17 reboun ds … … Alex Len recorded … totaling 12 points, 11 rebound s and four blocks … …while Al Horford finished with a dou ble - double of 17 points and 14 rebou nds … Group By Co untry 85 inches # Row = 7872 Figure 1: Representativ e examples of our SPARTA benchmark (see Appendix M for more examples). Benchmark T able Size Question Generation Grouping/Having Query Shape Supported Multi-hop Reasoning Annotation Error Rate (over 100 sampled queries) #Col #Row Chain ( > 3 -Hop) Star Cross-modal Uni-modal T A T -QA Zhu et al. (2021) 4.0 9.4 Manual ✗ ✗ ✗ ✓ ✗ 30% FinQA Chen et al. (2021) – 6.4 Manual ✗ ✗ ✗ ✓ ✗ 27% MUL TIHIERITT Zhao et al. (2022) 5.0 10.8 Manual ✗ ✗ ✗ ✓ ✓ 26% HybridQA Chen et al. (2020b) 4.4 15.7 Manual ✗ ✗ ✗ ✓ ✗ 21% OTT -QA Chen et al. (2020a) 4.4 15.7 Manual ✗ ✗ ✗ ✓ ✗ 21% SP ART A (NBA) 12.2 3,280.5 Auto (LLM) w/ Lightweight Human V alidation ✓ ✓ ✓ ✓ ✓ 0% SP ART A (Movie) 4.7 10,054.0 SP ART A (Medical) 6.7 200.0 T able 1: Comparison of T able–T ext QA benchmarks (see Appendix A for detailed annotation audit results). (1) Limited question types and shallo w reasoning . Existing T able-T ext QA benchmarks, constrained by manual annotation complexities, feature a restricted range of shallow questions. These typically require only direct information extraction (such as pinpointing a fact within a single textual passage or locating a speciﬁc entry in a table). Even for questions that go beyond this simple extraction, the reasoning depth remains shallow , seldom demanding more than two hops or in volving advanced analytical operations like aggregation or grouping. This is despite such operations being common in real-world natural language queries yet underrepresented in benchmarks. This deﬁciency hinders the thorough e v aluation of a system’ s deep, multi-step inference capabilities. Furthermore, current multi-hop questions usually follo w simplistic linear chains, rather than the expressiv e, tree-structured reasoning (e.g., multi-branch paths, longer chains, or uni-modality hops) crucial for assessing systems on complex inference tasks, as e xempliﬁed in Figure 1. (2) Annotation noise. Our quality audit uncov ers numerous annotation errors that undermine the reliability of the benchmark. Re-inspecting 100 randomly sampled dev examples from H Y B R I D Q A , we ﬁnd that 21 % contain at least one error , which we classify into three categories: (1) Redundant modality ( 52 . 4 % ): table and passage encode the same fact, yet the instance is tagged as a cross-modal question e ven though a single modality sufﬁces; (2) Incomplete answer set ( 23 . 8 % ): sev eral answers are correct but only one is recorded, distorting recall; (3) Incorr ect or unanswerable ( 23 . 8 % ): the labelled answer is wrong or cannot be deri ved from the provided data, rev ealing a lapse in quality control. Our audits on other benchmarks re veal similar error patterns (see Appendix A). (3) Reliance on single, small-scale web tables. Current benchmarks almost exclusi vely draw on compact web tables—typically scraped from W ikipedia or corporate reports—thereby providing only toy-scale scenarios. 2 Published as a conference paper at ICLR 2026 As T able 1 sho ws, tasks either inv olve a single table or, when multiple tables are present, the mean table cardinality hovers around 15 rows, far short of the thousands of rows found in real-world databases. This simpliﬁcation is largely pragmatic: reasoning over lar ger tables dramatically increases annotator ef fort and error rates Chen et al. (2020b). Consequently , existing benchmarks cannot meaningfully e v aluate QA systems in realistic, high-complexity settings that demand reasoning o ver large, heterogeneous relational data. SPARTA uniﬁes all evidence—structured and unstructured—inside a single relational store called the ref- er ence fact database . Each original relation (e.g., a web table or a ﬁnancial ledger) remains intact as a sour ce table . Grounding tables , which store atomic propositions as tuples for SQL-addressable access, are populated using two complementary methods (detailed in Section 3.2): (1) utilizing validated corpora such as R O TO WIRE W u et al. (2022); and (2) employing a table-to-text strate gy that generates atomic facts directly from structured data. W ith textual f acts now addressable via SQL, queries over this combined store freely mix modalities; no pointer to the original span is needed as answers are returned directly by query execution. Stage 1 – Reference fact database construction. Source and grounding tables are merged into the reference fact database, making all facts uniformly queryable. Stage 2 – Query generation. A large language model (LLM) receiv es the schema and sample ro ws and emits SQL whose number of nested pr edicates matches a tar get hop count . Note that SPARTA synthesizes queries that instantiate the four representati ve nesting patterns—T ypes N, A, J, and J A—outlined in Appendix B. T wo safeguards ensure that only realistic, e xecutable statements surviv e: (1) Pr ovenance-based r eﬁnement loops prov enance feedback—unmatched joins or overly selective predicates—back to the LLM until the query returns a non-empty result. (2) Realistic-structure enfor cement conﬁnes generation to post-order traversals of query graph, yielding human-like join orders and enabling early pruning of infeasible subqueries. Stage 3 – Question verbalisation. Each validated query is paired with its execution result, then a second LLM re writes the SQL into a ﬂuent natural-language question, producing high-ﬁdelity pair ⟨ question, answer ⟩ that span aggre gation, grouping, and deep multi-hop joins across large tables. Here, the ﬁnal correctness—i.e., the v alidity of the question–answer pair—is checked via lightweight human v eriﬁcation; unlike HybridQA, our pipeline does not require re-performing full multi-hop reasoning, thereby keeping audit costs low (see Section 3.4). This SQL-centric pipeline yields a lar ge, div erse, and rigorously v alidated benchmark that corrects the size, noise, and logical shallo wness of pre vious T able–T ext QA resources. On SPARTA , state-of-the-art models that exceed 70 F1 on HybridQA or exceed 50 F1 on OTT -QA drop by more than 30 F1 points, revealing fundamental weaknesses in current cross-modal reasoning and highlighting directions for future research. 2 R E L A T E D W O R K T able-T ext QA Benchmark. T able–T ext QA benchmarks ev aluate a model’ s ability to jointly reason ov er structured tables and unstructured passages. HybridQA Chen et al. (2020b) introduced the task, and O TT -QA Chen et al. (2020a) extended it to open-domain settings, but both suffer from annotation noise, shallow reasoning depth, and a lack of support for adv anced analytical operations. Speciﬁcally , they do not support GROUP BY or HAVING clauses, and only 1.1% of questions inv olve aggre gation. Their multi-hop reasoning is conﬁned to short, linear chains and fails to capture tree-structured or uni-modal reasoning paths. Other benchmarks—T A T -QA Zhu et al. (2021), FinQA Chen et al. (2021), and MultiHiertt Zhao et al. (2022)—focus narro wly on numerical reasoning in ﬁnancial contexts rather than multi-hop reasoning, further limiting coverage Zhang et al. (2023). Additionally , all existing T able-T ext QA datasets rely on small, manually annotated web tables, which hinders scalability and realism. SPARTA addresses these gaps with an SQL-centric pipeline that constructs a lar ge-scale benchmark of executable, compositional questions ov er hybrid corpora, of fering a principled testbed for multi-hop QA across text and tables. Synthetic Benchmark Generation. Recent synthetic benchmark generation scales QA pairs from pre-existing sources, but most are single-modal: relying on knowledge graphs Sun et al. (2024); Omar et al. (2025); Orogat & El-Roby (2023; 2022) or text corpora Bonifacio et al. (2022); Jeronymo et al. (2023), ignoring 3 Published as a conference paper at ICLR 2026 Ground ing Tables Source Tables Texts Reference Fact Database 3. Question Verbalisation Answer SQL Execution 1. Referen ce Fact Database Constr uction 2. Query Gen eration SQL Query Graph ST GT ST Query Genera tor Prov enance Based Fix er If Empty Result Fix SQL2NL Tr a n s l a t o r NL Question 2 1 3 Post - order Realistic Generatio n Figure 2: Overvie w of SPARTA : (1) Reference Fact Database Construction, (2) Query Generation, (3) Question V erbalisation. ST and GT denote a source table and a grounding table, respectively . cross-modal reasoning. ERBench Oh et al. (2024) uses relational databases, yet its questions are binary or multiple-choice, based on shallo w templates excluding analytical operators like GROUP BY , HAVING , and aggregations; it also lacks table-text interplay . Similarly , TDBench Kim et al. (2026) leverages temporal databases to automate time-sensitiv e QA, but it is conﬁned to temporal reasoning within structured tables. In contrast, SPARTA generates multi-hop questions bridging tables and passages, mirroring complex nested SQL patterns to provide a rigorous cross-modal benchmark for T able–T ext QA. Beyond QA, benchmarks in other domains impose domain-speciﬁc constraints: database performance benchmarks Nambiar & Poess (2006); Erling et al. (2015) use ﬁxed schemas and templates for reproducible proﬁling; unlearning benchmarks Maini et al. (2024); Zhong et al. (2024) create forget/retain partitions for selective for getting; and PEEL Kim (2024) employs template-based generation of NL-Nested SQL pairs to guarantee the executability of structurally complex queries. SP AR T A ’ s constraint is fundamentally different: ev ery synthetic example must encode tree-structured multi-hop reasoning grounding semantically sound, ex ecutable SQL and natural-language questions, requiring analytical operations and table-te xt alignment. Our prov enance-based reﬁnement and realistic-structure enforcement address this, producing semantically rich, ex ecutable queries. 3 S P A RT A 3 . 1 T A B L E – T E X T Q A T A S K A N D B E N C H M A R K G E N E R A T I O N Giv en a natural-language question q NL , a set of source tables S T = { T (1) ,. . . , T ( m ) } , and a set of passages C P = { P (1) ,. . . , P ( n ) } , a QA system f θ must return the answer a = f θ ( q NL , S T , C P ) . Each passage in C P is decomposed into atomic facts and stored as tuples in gr ounding tables G T . Merging these with the original source tables yields a uniﬁed refer ence fact database D . An LLM then: (i) generates ex ecutable SQL queries on D that vary in depth (selection, aggregation, nesting, etc.), and (ii) verbalises each query into a ﬂuent natural-language question q NL . The resulting pairs ( q NL , a ) constitute a scalable benchmark for T able–T ext QA. An ov erview of the entire pipeline is pro vided in Figure 2. 3 . 2 R E F E R E N C E F AC T D A TA B A S E C O N S T RU C T I O N W e use the R O TO WIRE dataset as part of our reference f act database, whose structured tables are widely used as gold supervision for text-to-table and hav e been veriﬁed by the authors of W u et al. (2022) for consistency with the accompanying game reports. Each NB A game report in this corpus is decomposed into atomic facts, which are stored as tuples in G T , guaranteeing perfect alignment between text and relational data. T o construct S T , we integrate six public NB A datasets—covering salaries, a wards, draft data, and team histories—sourced from Kaggle and data.world kag (f;a;g;b;e); dat. Shared entity attributes such as PLAYER_NAME and TEAM_NAME are enforced as primary–foreign ke y pairs, yielding a connected schema in which e very tuple from G T can be joined to at least one table in S T . The resulting database contains three grounding tables and six source tables (see Appendix C). 4 Published as a conference paper at ICLR 2026 While our construction uses NBA data for illustration, SPARTA is inherently domain-agnostic. From any relational database, one designates a subset of relations as S T and treats the remaining relations as G T . Applying table-to-text generation to G T yields a companion set of textual passages C P , forming the reference- fact database D = S T ∪ G T with no information ov erlap between the two sets. The query-generation pipeline then applies unchanged, yielding a portable recipe for building large-scale T able–T ext QA benchmarks in any domain with relational data. T o demonstrate this, we extended our pipeline to two new domains—mo vies and medical—using Kaggle datasets kag (c;d), with conﬁgurations identical to the NB A domain (see Appendix E). For these datasets, we start from existing structured tables and con vert a subset into grounding tables using rule-based templates. This table-to-text transformation is deterministic and template-dri ven, with templates manually designed and veriﬁed to pre vent spurious facts or errors. 3 . 3 Q U E RY G E N E R A T I O N For non-nested queries, SPARTA builds the statement clause-by-clause: the LLM emits each clause in canonical SQL order , conditioned on the schema and previously written clauses, and immediately e xecutes the partial query . If the result is empty , the e xecution outcome is fed back so the LLM can re vise the offending clause, ensuring the query remains ex ecutable and semantically meaningful at ev ery step. The next step is to synthesise nested SQL queries that act as faithful logical forms for multi-hop reasoning. A generated query must satisfy two criteria: (i) it should resemble a query that a human analyst would plausibly write, av oiding degenerate template artifacts, and (ii) it must ex ecute over D without error and return a non-empty result. These guarantees ensure that ev ery ( q NL , a ) pair is both natural and answerable. T emplate-based generation ﬁlls ﬁx ed slots with ad-hoc limits or auxiliary predicates to guarantee ex ecution, yet the resulting SQL is often semantically unsound. For instance, SELECT birthplace FROM nba_player_information WHERE birthplace <> ‘Chicago, Illinois’ OR birthplace <> ‘Dallas, Texas’ runs without error but expresses a v acuous intent (“. . . not born in Chicago or not born in Dallas, ” matching ev eryone). Conv ersely , one-shot LLM prompting produces natural queries, but these frequently yield empty results and sho w limited diversity (see T able 3). W e therefore introduce a dual-stage framew ork: (i) r ealistic-structur e enforcement and (ii) pr ovenance-based r eﬁnement . 3 . 3 . 1 R E A L I S T I C - S T RU C T U R E E N F O R C E M E N T A nested SQL query can be modeled as a query graph G = ( V , E ) where each node v i ∈ V corresponds to a distinct query block—namely e very SELECT ... FROM ... WHERE ... subquery including the outermost statement—while each (directed) edge e ij ∈ E denotes a nested pr edicate that correlates blocks Q i and Q j through a shared attrib ute reference, thus capturing the dependency structure of the original nested query in graph form (see Appendix B for representative nested query patterns). Based on this representation, we measure query complexity by the number of edges in the query tree, each representing a reasoning hop. For nested-query generation, SP AR T A adopts P ost-Order+Pr ov as the def ault. That is, to preserve realistic structure, we force the LLM to build the query tree in post-order: compose each leaf subquery ﬁrst, then wrap it with successiv ely higher-le vel blocks—exactly ho w analysts craft nested SQL. W e choose post-order trav ersal over alternatives like breadth-ﬁrst or top-down, because the latter require validating incomplete queries before inner subqueries are constructed. In contrast, post-order ensures that each intermediate block is ex ecutable by validating subqueries ﬁrst and then composing higher-le vel predicates. In P ost-Or der+Pr ov , leaf nodes are generated clause-by-clause. For the target question type we pick the rele vant clauses ( WHERE , GROUP BY , ORDER BY , . . . ) in canonical order , and let the LLM ﬁll each one using (i) the schema, (ii) earlier clauses, and (iii) partial results. If a clause yields an empty result, we roll back to the last v alid subquery , sparing redundant LLM calls. Internal nodes arise by recursiv ely enclosing validated subqueries. At e very step the LLM selects a child query , picks a joinable table, and emits a connecting predicate ( AND / OR , etc.). Empty outputs trigger prov enance-guided repair (§3.3.2); otherwise the predicate is kept. The loop iterates until the query graph grows to the speciﬁed tar get size. 3 . 3 . 2 P R OV E NA N C E - B A S E D R E FI N E M E N T 5 Published as a conference paper at ICLR 2026 SELECT pl ayer_nam e FROM nba_player _inform ation WHERE player_name IN ( SELECT player_name FROM nba_player_ inform ation WHERE birth_year >= 1980 ) AND player_na me IN ( SELECT player_na me FROM nba_player _affili ation WHERE sal ary > 800,000 ) SELECT player_name FROM nba_player_information WHERE player_name IN ( SELECT player_ name FROM nba_player_information WHERE birth_year >= 19 80 ) AND player_name IN ( SELECT player_name FROM nba_player_affiliation WHERE salary > 800,000 ) : Result of Q’ (= Expecte d Result of Q) nba_player_informa tion Sub Query Q’ with non - empty Result [Original Qu ery Q] : Why - not Provenance of Expected Result of Q = Filtered ou t by ‘salary > 8 00,000’ nba_player_affiliation SELECT player_na me FROM nba_player_information WHERE player_name IN ( SELECT pla yer_name FROM nba_player_i nformation WHERE birth_y ear >= 1980 ) AND play er_name IN ( SELECT pla yer_name FROM nba_player_a ffiliation WHERE salary > 800,000 ) SELECT pl ayer_nam e FROM nba_player _inform ation WHERE player_name IN ( SELECT player_name FROM nba_player _inform ation WHERE bi rth_year >= 1980 ) AND pl ayer_nam e IN ( SELECT player_na me FROM nba_player _affili ation WHERE salar y >= 600,000 ) [Revised Query Q ’’] name John Sam Tom [ Result of Q’ ’ ] [Original Qu ery Q] [Problemat ic Predicat es] WHERE salary > 800,000 player_name salary John 700,000 ( Filtered out by ‘salary > 800,000’) Sam 600,000 ( Filtered out by ‘salary > 800,000’) Tom 650,000 ( Filtered out by ‘salary > 800,000’) nba_player_affiliation [Why - not Provenance] LLM nba_player_affil iation Step3. LLM Promp ting player_name salary … John 700,000 Sam 600,000 Tom 650,000 Jordan 900,000 … … player_name … John Sam Tom Jordan … birth_year player_name 1980 John 1981 Sam 1980 Tom 1985 Ben 1990 Mark … … Step1. Find Subquery with no n - empty result Step2. Get Why - not Provenance Deleted ! Figure 3: Overvie w of prov enance-based reﬁnement. The LLM builds the query graph in post-order—v alidated leaves ﬁrst, then one outer predicate at a time. If an ev olving query returns no rows, a pro venance-based reﬁnement process is initiated to repair the query . The reﬁnement process le verages "why-not pro venance," a database technique used to identify which predicates in a query are responsible for ﬁltering out e xpected tuples Bidoit et al. (2014); Chapman & Jagadish (2009); Lee et al. (2017). While traditional why-not prov enance often relies on user -provided examples of the missing tuples, our approach dynamically deriv es the expected tuples from intermediate query results. The process unfolds in three steps. First, when a query yields an empty result, we peel off predicates in re verse order until the query yields a result. Second, we sample a tuple from this non-empty result set. Finally , we run a why-not pro venance tool Dietrich et al. (2022) to identify the blocking predicate and provide this prov enance report to the LLM, instructing it to rewrite only the problematic clause. Ablations are (i) One-Shot– k , which inserts all k predicates in a single pass with no checks, and (ii) P ost-Order (no provenance), which follows the same construction but skips the repair loop. Figure 3 illustrates the ov erall process of prov enance-based reﬁnement. Provenance feedback relaxes the predicate from salary > 800000 to salary > 600000 . 3 . 4 Q U E S T I O N V E R B A L I S A T I O N For each ex ecutable SQL query q SQL , we generate a corresponding natural-language question q NL using AST -ICL Al Lawati et al. (2025), a SOT A LLM-based SQL-to-text model. W e adopted the LLM-based model ov er template-based methods, which are limited by rigidity and reliance on handcrafted templates, as documented in prior work Iyer et al. (2016); Xu et al. (2018). In AST -ICL, the SQL abstract syntax tree is supplied as an in-context e xemplar , and the model emits a ﬂuent question q NL whose semantics align with the query . Executing q SQL on D yields the answer a , completing the benchmark pair ( q NL , a ) . Every instance is thus interpretable, ex ecutable, and suitable for probing multi-hop reasoning over hybrid (table + te xt) data. The verbalized questions were v alidated and corrected by three CS graduate students with SQL/schema literacy to ensure f actuality and meaningfulness. This process is lightweight, requiring substantially less ef fort than full manual annotation. Speciﬁcally , validating 3,300 queries takes about 1,493 minutes of total work er time, whereas HybridQA required roughly 6,600 minutes to create the same number of queries from scratch. 4 E X P E R I M E N T S 4 . 1 E V A L UA T I O N S E T U P Hardwar e and Software Settings. W e conducted our experiments on a machine with Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz and 1.5 TB of RAM running Ubuntu 22.04.4 and 4 R TX A6000 GPUs, with LLM inference managed via the SGLang Zheng et al. (2024) inference engine. W e used Llama-3.1-70B-Instruct Dubey et al. (2024) as the LLM. Query Generation Methods. For non-nested query generation, SP AR T A ’ s default is Execution-Guided generation: the LLM writes each clause in canonical order , executes the partial query , and immediately edits 6 Published as a conference paper at ICLR 2026 any clause that empties the result. As an ablation we also ev aluate (i) One-Shot , which emits the whole query from schema only , and (ii) Clause , which builds the query sequentially without e xecution feedback. For nested-query generation, SP AR T A ’ s default is P ost-Or der+Pr ov : v alidated leaves are wrapped one predicate at a time (post-order); each new predicate is executed immediately and, when empty , repaired with prov enance feedback. Ablations include (i) One-Shot– k , which inserts all k predicates in a single pass with no intermediate checks, and (ii) P ost-Order (no provenance), which follo ws the same post-order construction without prov enance-based repair . W e generate 500 non-nested and 600 nested SQL queries per method on the NB A domain (conﬁguration as in T able 10), so that quality and cost can be compared on equal footing. T able-T ext QA Methods. T o gauge how current state-of-the-art systems break do wn under SPARTA ’ s deeper hops, larger tables, and adv anced analytical operations, we e valuate SO T A T able–T ext QA methods, including methods based on prompting LLMs such as OD YSSEY Agarwal et al. (2025) and HProPro Shi et al. (2024). These models ha ve shown strong results on HybridQA, where models reason ov er provided tables and linked documents. OD YSSEY constructs a graph from the table and linked documents, enabling the LLM to trav erse the graph for query answers. HProPro generates and ex ecutes program via the LLM to produce query responses. Since existing T able–T ext QA methods are not originally designed to support uni-modal hops, we apply minimal extensions to enable such behavior during e valuation on SPARTA . Speciﬁcally , for OD YSSEY , we augment the hybrid graph by adding edges between matching cells of columns that share a join relationship. For HProPro, we adapt the prompt format by replacing the input table with a list of rele vant tables. For a fully end-to-end scenario in which no oracle is provided, we pair the T able–T ext QA methods with HELIOS Park et al. (2025)—the top retrie ver on O TT -QA—so the model must both retrie ve e vidence and reason ov er it. W e also run ev ery method with GPT -5 and GPT -3.5-turbo backbones to test LLM sensitivity . 4 . 2 B E N C H M A R K G E N E R A T I O N C O S T A N D Q U E RY N A T U R A L N E S S A scalable benchmark must maximise useful queries while minimising LLM calls and w all time. W e therefore track sev en complementary metrics in T able 2. T able 2: Cost metrics used for benchmark generation. Metric Deﬁnition Success-Q # of non-nested queries that execute without error and return at least one ro w . Exec-Err # of statements that fail at parse or runtime, rev ealing schema or logic errors. Empty-Q # of syntactically valid queries that return zero ro ws because predicates are too restrictiv e. Duplicate-Q # of queries whose result duplicates a previously generated query , reducing diversity . Ideal Calls # of LLM in vocations required if ev ery step succeeds on the ﬁrst attempt (baseline cost). T otal Calls # of actual LLM invocations, i.e., Ideal Calls plus e xtra calls for provenance-guided ﬁxes or other retries. W all Time T otal wall-clock time to obtain all successful queries. T able 3 summarizes generation overheads for both non-nested and nested SQL. For non-nested queries, Execution-Guided is most economical, needing only 1,134 total LLM calls—just 7.2% above the ideal 1,058—and ﬁnishing in 2,466s. In contrast, One-Shot begins with the lowest ideal budget (500 calls) but produces 60 empty and 1,265 duplicate outputs, inﬂating usage to 1,830 real calls (266% of ideal) and incurring the highest latency; Clause mitigates these f ailures yet still exceeds its ideal by 24.9%. For nested queries, P ost-Order+ Pr ov is most cost-ef fecti ve, completing with 4,722 calls in 26,278s—cutting call volume by 42.8% versus v anilla post-order and by 66.2% versus One-Shot– k . These results show that disciplined post-order construction combined with provenance-dri ven repair minimizes redundant generations while ensuring ex ecutable, semantically plausible SQL; detailed analysis of generation o verheads across v arying query graph shapes and sizes is provided in Appendix F. T o assess the realism of the generated SQL queries, we employ a scoring-based ev aluation framew ork combining automatic and human assessments. Each query is rated from 1 (least natural) to 5 (most natural) across three dimensions: Relevance , which measures alignment with the genuine curiosity of a typical person; Specificity & Clarity , which assesses whether the query expresses a clear and well-scoped information need; and Overall Naturalness , which combines the above criteria to decide whether the query is likely to be ask ed by a real person. For a comprehensiv e assessment, we conduct an automatic 7 Published as a conference paper at ICLR 2026 T able 3: Generation Cost Comparison of Query Generation Methods. Method Success-Q Empty-Q Duplicate-Q Exec-Err Ideal Calls T otal Calls W all Time (s) Non-nested Query Generation One-Shot 500 60 1265 5 500 1830 4256.96 Clause 500 51 78 0 1053 1316 3218.83 Execution-Guided 500 0 27 0 1058 1134 2466.47 Nested Query Generation One-Shot– k 600 0 0 0 2664 13962 115316.67 P ost-Order (no provenance) 600 0 0 0 3104 8253 38867.40 P ost-Order+Pro v 600 0 0 0 3074 4722 26277.87 R elevance Specificity & Clarity Overall Naturalness 1 2 3 4 5 Scor e Non-nested (auto -eval) R elevance Specificity & Clarity Overall Naturalness Nested (auto -eval) R elevance Specificity & Clarity Overall Naturalness 1 2 3 4 5 Scor e Non-nested (human-eval) T emplate One-shot Clause Ex ecution- Guided R elevance Specificity & Clarity Overall Naturalness Nested (human-eval) T emplate One-shot-k P ost- or der (no pr ovenance) P ost- or der+P r ov Figure 4: Comparison of Query Naturalness for Different Generation Methods. ev aluation ( auto-eval ) using ChatGPT -4o OpenAI and an independent human ev aluation ( human-eval ) by three e xternal CS graduate students with SQL/schema literacy . As a baseline for comparison, we also e valuate queries generated by template ﬁlling with randomly sampled column–value pairs. This dual approach, integrating LLM-based auto-e v aluation with human judgment, yields a robust, multi-perspecti ve measure of how con vincingly the generated queries mirror real user intent. Figure 4 reports the naturalness scores of queries generated by different methods, ev aluated across three criteria. Among the non-nested query generation methods, Execution-Guided Generation achie ved the highest scores consistently across both automatic and human ev aluations. Speciﬁcally , in terms of ov erall naturalness, it outperformed Clause-by-Clause, One-shot, and T emplate-based generation by 1.3%, 11.4%, and 37.5%, respecti vely , in auto-ev al; and by 6.0% and 36.7% over One-shot and T emplate-based methods in human-ev al. For nested query generation, Post-order Generation with Execution Guidance achie ved the top scores across all three metrics. Compared to Post-order, One-shot Nested, and T emplate-based generation, it yielded auto-ev al improv ements of 1.7%, 8.1%, and 123.2%, and human-ev al gains of 2.1%, 12.5%, and 117.8%, respecti vely . These results conﬁrm that LLM-based generation strategies—especially those le veraging clause- wise generation and post-order tra versal—are signiﬁcantly more effecti ve at producing realistic and ﬂuent SQL queries than template-based approaches. 4 . 3 T A B L E - T E X T Q A E V A L U A T I O N R E S U L T S T able 4 and T able 5 report the T able–T ext QA performance of representative methods across eight benchmarks, rev ealing the increased dif ﬁculty posed by SPARTA . W e ev aluate SPARTA under two conﬁgurations: (1) SPARTA (Oracle), where models are giv en ground-truth tables and linked passages; and (2) SPARTA (Retriev al), where models must retrie ve rele vant content from the entire corpus. On SPARTA (Oracle), OD YSSEY with GPT -5 achieves an average F1 score of 35.6% across all domains, representing a sharp 33.9-point drop compared to its performance on HybridQA (69.5%). Similarly , HProPro with GPT -5 achie ves an average F1 score of 40.4%, a 30.1-point drop from its HybridQA performance (70.5%). These results re veal the limitations of existing methods when scaled to larger , more complex queries. Interestingly , HProPro with GPT -5 outperforms OD YSSEY on the NBA and Movie domains, which feature tables with thousands of rows (as sho wn in T able 1), owing to its ability to generate e xecutable programs that directly operate ov er 8 Published as a conference paper at ICLR 2026 T able 4: T able-T ext QA Accuracy on the SPARTA (Oracle) across multiple domains. Method SPARTA HybridQA NBA Movie Medical A vg. EM F1 P R EM F1 P R EM F1 P R EM F1 P R EM F1 P R OD YSSEY w/ GPT -3.5-turbo 9.0 15.1 26.8 14.8 20.2 23.9 33.6 24.7 6.7 22.9 33.2 21.3 12.0 20.6 31.2 20.3 32.7 42.2 42.6 44.2 HProPro w/ GPT -3.5-turbo 11.0 13.6 16.4 13.8 22.2 27.8 29.1 29.2 15.5 19.5 20.2 19.7 16.2 20.3 21.9 20.9 21.4 25.3 25.7 26.1 OD YSSEY w/ GPT -5 21.2 28.4 38.4 28.1 20.4 24.2 32.9 24.3 47.5 54.2 60.3 54.2 29.7 35.6 43.9 35.5 55.3 69.5 69.3 73.5 HProPro w/ GPT -5 23.6 33.1 36.2 34.0 36.6 47.1 49.2 48.8 28.1 41.0 43.2 41.6 29.5 40.4 42.9 41.5 59.7 70.5 71.1 73.1 T able 5: T able-T ext QA Accuracy on the SPARTA (Retriev al) across multiple domains. Method SPARTA OTT -QA NBA Movie Medical A vg. EM F1 P R EM F1 P R EM F1 P R EM F1 P R EM F1 P R HELIOS+FiE Reader 4.6 6.9 17.6 6.4 8.6 11.9 23.0 11.6 6.6 16.0 33.0 12.9 6.6 11.6 24.5 10.3 58.6 65.2 66.7 65.2 HELIOS+HProPro w/ GPT -5 14.5 19.0 24.2 18.6 17.4 21.6 28.6 21.7 13.7 27.3 31.3 27.1 15.2 22.6 28.0 22.5 47.7 56.0 57.4 56.5 tables. This result highlights the limitations of OD YSSEY when applied to large-scale tables and aligns with the broader observation that larger table sizes increase the difﬁculty of table-QA for LLMs Patnaik et al.. The performance gap between GPT -5 and GPT -3.5-turbo (35.6 vs. 20.6 F1 for OD YSSEY and 40.4 vs. 20.3 F1 for HProPro) underscores the importance of adv anced LLM reasoning capabilities in handling such challenges. In the retriev al setting, where no gold tables are provided, performance degrades further: the best method (HELIOS + HProPro with GPT -5) attains only 22.6 F1. This sharp decline illustrates the compounded challenge of retrie val and reasoning o ver heterogeneous corpora. W e additionally ev aluate the FiE Reader Ma et al. (2023), the state-of-the-art ﬁne-tuned reader model on OTT -QA. While FiE Reader surpasses HELIOS + HProPro w/ GPT -5 by 9.2 points on O TT -QA, it lags behind on SP AR T A by 11.0 points, showing ﬁne-tuned models f ail to generalize to SP AR T A ’ s more complex, out-of-domain settings. 4 . 4 A N A L Y S I S W e conducted a comprehensive analysis of the models’ ex ecution results on the SP AR T A benchmark. This in vestigation uncovers sev eral fundamental vulnerabilities in current table-te xt QA models, pointing to critical directions for future work. Models struggle to handle complex multi-hop query structures. W e ev aluate T able–T ext QA models under various tree-structured query conﬁgurations, ﬁxing the number of edges to four: (Depth 1, Breadth 3), (Depth 2, Breadth 2), and (Depth 3, Breadth 1). W e also included intermediate shapes with three edges, such as (Depth 1, Breadth 2) and (Depth 2, Breadth 1), to further validate the trend. As sho wn in Figure 5a, model performance de grades sharply as either depth or breadth increases. At ﬁxed depth, expanding breadth from (Depth 1, Breadth 1) to (Depth 1, Breadth 3) reduces HProPro and OD YSSEY by 25.2% and 27.5%, respecti vely . At ﬁxed breadth, increasing depth from (Depth 1, Breadth 1) to (Depth 3, Breadth 1) yields 47.2% and 49.9% decli nes. Additional comparisons—(Depth 2, Breadth 1) to (Depth 2, Breadth 2), and (Depth 1, Breadth 2) to (Depth 2, Breadth 2)—show consistent degradation, further conﬁrming that both deeper and broader queries cause substantial F1 drops. These ﬁndings suggest that existing methods are fundamentally limited in performing tree-structured reasoning ov er multiple relational paths, regardless of whether comple xity arises from depth or breadth. Models struggle with analytical operations such as grouping and ordering. As sho wn in Figure 5b, both OD YSSEY and HProPro exhibit consistent performance degradation when adv anced analytical clauses are present. F or queries that include GROUP BY and HAVING clauses, OD YSSEY attains an F1 score of 35.4, whereas HProPro attains 27.1. When ORDER BY and LIMIT are present, the scores are 31.2 for OD YSSEY and 21.4 for HProPro. Aggregation queries show a similar pattern, yielding 28.4 for OD YSSEY and 37.2 for HProPro. Compared with each model’ s average F1, these analytical scores are markedly lo wer , indicating weak numerical reasoning, ﬁltering, and ranking capabilities and exposing fundamental limitations 9 Published as a conference paper at ICLR 2026 Single-hop (D1, B1) (D1, B2) (D1, B3) (D2, B1) (D2, B2) (D3, B1) 0 10 20 30 40 50 60 F1 Scor e OD YS SEY w/ GPT -5 Single-hop (D1, B1) (D1, B2) (D1, B3) (D2, B1) (D2, B2) (D3, B1) 0 10 20 30 40 50 60 F1 Scor e HProPro w/ GPT -5 (a) F1 scores across tree conﬁgurations (D: Depth, B: Breadth) Aggr egation Gr oupBy & Having Or derBy & Limit 0 10 20 30 40 F1 Scor e OD YS SEY w/ GPT -5 Aggr egation Gr oupBy & Having Or derBy & Limit 0 10 20 30 40 F1 Scor e HProPro w/ GPT -5 (b) F1 scores across analytical operations Figure 5: Comparison of F1 scores across different conﬁgurations. in addressing real-world table–te xt questions. Notably , OD YSSEY performs worst on aggregation queries, whereas HProPro struggles most with ORDER BY and LIMIT . T able 6: Performance Comparison With and W ithout T ext Data in T able-T ext QA. Method Setting EM F1 P R OD YSSEY w/ GPT -5 T able-Te xt Cross Reasoning 23.9 28.6 36.3 28.4 T able-only Reasoning 32.0 39.2 49.5 38.8 HProPro w/ GPT -5 T able-Te xt Cross Reasoning 11.9 16.3 17.2 16.7 T able-only Reasoning 29.2 45.2 46.9 46.4 Perf ormance drops sharply when unstructured text is required. As shown in T able 6, the inclusion of table-text cross reasoning leads to a signiﬁcant decline in performance. HProPro’ s F1 score drops by 63.9% (from 45.2 to 16.3), while OD YSSEY experiences an ev en steeper drop of 23.0% (from 39.2 to 28.6). This sharp contrast highlights the difﬁculty of reasoning ov er unstructured passages in conjunction with structured tables. Although both models perform moderately well when queries rely only on tabular data, they consistently fail to retriev e and integrate rele vant textual spans when external context is present. These failures indicate that current T able–T ext QA models lack robust cross-modal alignment and semantic grounding, limiting their effecti veness in real- world scenarios that demand joint reasoning over heterogeneous data sources. T o further support our ﬁndings, we include supplementary analyses in the appendix: an ablation study on nesting types (Appendix G) and an error case analysis (Appendix J). 5 C O N C L U S I O N In summary , we present SPARTA , a benchmark generation frame work that rectiﬁes the three critical short- comings of existing T able–T ext QA resources—shallow question design, annotation noise, and toy-scale tables—by (i) unifying heterogeneous e vidence inside a reference fact database, (ii) generating logically deep, human-like SQL nested queries whose hop count and analytical operations are e xplicitly controlled through a provenance-guided LLM pipeline, and (iii) verbalising them into natural-language questions using an LLM-based SQL-to-text model, with lightweight human validation for ﬂuency and correctness. On SPARTA , state-of-the-art models that reach ov er 70 F1 on HybridQA or over 50 F1 on O TT -QA drop by more than 30 F1 points, exposing fundamental weaknesses in current cross-modal reasoning. 6 L I M I TA T I O N S A N D F U T U R E W O R K SPARTA currently focuses on the T able–T ext setting. Future work will e xtend it to multimodal inputs like images and videos by using vision–language models to summarize visuals into atomic statements, normalizing them into grounding tables, and merging with the existing f act database. Since these tuples follow the same schema as D , the query-generation pipeline (§3.3) applies unchanged. A complete multimodal extension, including dataset collection, schema design, and ev aluation, is planned for future research. 10 Published as a conference paper at ICLR 2026 7 A C K N O W L E D G M E N T S This work was partly supported by the National Research F oundation of Korea(NRF) grant funded by the K orea government(MSIT) (RS-2025-00517736, 30%), Institute of Information & communications T echnology Planning & Evaluation (IITP) grant funded by the K orea go vernment(MSIT) (No. RS-2024-00509258, Global AI Frontier Lab, 50%)(No. RS-2024-00454666, Developing a V ector DB for Long-T erm Memory Storage of Hyperscale AI Models, 10%), and Basic Science Research Program through the National Research Foundation of K orea Ministry of Education (No. RS-2024-00415602, 10%) Reproducibility Statement W e include prompt examples for prov enance-based reﬁnement, realistic- structure enforcement, and automatic naturalness e valuation in Appendix N. Details of the experimental setup are provided in Section 4.1. R E F E R E N C E S Nba salaries. https://data.world/datadavis/nba- salaries . Accessed: 2025-05-15. Nba games. https://www.kaggle.com/datasets/nathanlauga/nba- games , a. Accessed: 2025-05-15. Historical nba ﬁnals and mvp results. https://www.kaggle.com/datasets/thedevastator/ historical- nba- finals- and- mvp- results , b. Accessed: 2025-05-15. Imdb movies analysis - sql. https://www.kaggle.com/datasets/gauravbr/ imdb- movies- data- erd , c. Hospital management dataset. https://www.kaggle.com/datasets/kanakbaghel/ hospital- management- dataset , d. Nba dataset project. https://www.kaggle.com/datasets/kareemignacio/ nba- dataset- project , e. Accessed: 2025-05-15. 1991-2021 nba stats. https://www.kaggle.com/datasets/vivovinco/ 19912021- nba- stats , f. Accessed: 2025-05-15. Nba/aba/baa team stats per game. https://www.kaggle.com/datasets/sumitrodatta/ nba- aba- baa- stats?select=Team+Stats+Per+Game.csv , g. Accessed: 2025-05-15. Ankush Agarwal, Chaitan ya Dev aguptapu, and Ganesh S. Hybrid graphs for table-and-text based question answering using LLMs. In Luis Chiruzzo, Alan Ritter , and Lu W ang (eds.), Pr oceedings of the 2025 Confer ence of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language T echnologies (V olume 1: Long P apers) , pp. 858–875, Alb uquerque, New Mexico, April 2025. Association for Computational Linguistics. ISBN 979-8-89176-189-6. URL https:// aclanthology.org/2025.naacl- long.39/ . Ali Al Lawati, Jason Lucas, and Prasenjit Mitra. Semantic captioning: Benchmark dataset and graph- aware few-shot in-context learning for sql2text. In Pr oceedings of the 31st International Confer ence on Computational Linguistics , pp. 8026–8042, 2025. Nicole Bidoit, Melanie Herschel, and Katerina Tzompanaki. Query-based why-not provenance with nede x- plain. In Extending database technology (EDBT) , 2014. 11 Published as a conference paper at ICLR 2026 Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, and Rodrigo Nogueira. Inpars: Unsupervised dataset generation for information retriev al. In Pr oceedings of the 45th International ACM SIGIR Confer ence on Resear ch and De velopment in Information Retrieval , pp. 2387–2392, 2022. Adriane Chapman and HV Jagadish. Why not? In Pr oceedings of the 2009 ACM SIGMOD International Confer ence on Management of data , pp. 523–534, 2009. W enhu Chen, Ming-W ei Chang, Eva Schlinger , William W ang, and W illiam W Cohen. Open question answering ov er tables and text. arXiv pr eprint arXiv:2010.10439 , 2020a. W enhu Chen, Hanwen Zha, Zhiyu Chen, W enhan Xiong, Hong W ang, and W illiam W ang. Hybridqa: A dataset of multi-hop question answering o ver tab ular and textual data. arXiv pr eprint arXiv:2004.07347 , 2020b. Zhiyu Chen, W enhu Chen, Charese Smiley , Sameena Shah, Iana Borov a, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan R Routledge, et al. Finqa: A dataset of numerical reasoning ov er ﬁnancial data. In Pr oceedings of the 2021 Conference on Empirical Methods in Natural Language Pr ocessing , pp. 3697–3711, 2021. Benjamin Dietrich, T obias Müller, and T orsten Grust. Data provenance for recursi ve sql queries. In Pr oceedings of the 14th International W orkshop on the Theory and Practice of Pr ovenance , pp. 1–8, 2022. Abhimanyu Dube y , Abhinav Jauhri, Abhina v Pande y , Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur , Alan Schelten, Amy Y ang, Angela Fan, et al. The llama 3 herd of models. arXiv pr eprint arXiv:2407.21783 , 2024. Orri Erling, Alex A verb uch, Josep Larriba-Pey , Hassan Chaﬁ, Andrey Gubichev , Arnau Prat, Minh-Duc Pham, and Peter Boncz. The ldbc social network benchmark: Interactiv e workload. In Proceedings of the 2015 A CM SIGMOD International Conference on Mana gement of Data , SIGMOD ’15, pp. 619–630, New Y ork, NY , USA, 2015. Association for Computing Machinery . ISBN 9781450327589. doi: 10.1145/ 2723372.2742786. URL https://doi.org/10.1145/2723372.2742786 . Sriniv asan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemo yer . Summarizing source code using a neural attention model. In 54th Annual Meeting of the Association for Computational Linguistics 2016 , pp. 2073–2083. Association for Computational Linguistics, 2016. V itor Jeron ymo, Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, Roberto Lotufo, Jakub Zavrel, and Rodrigo Nogueira. Inpars-v2: Large language models as efﬁcient dataset generators for information retrie val. arXiv pr eprint arXiv:2301.01820 , 2023. Hyeonji Kim. A Systematic Data Augmentation and Cleaning T echnique for Impr oving T ext-to-SQL . Ph.d. dissertation, Pohang University of Science and T echnology , August 2024. URL http://www. dcollection.net/handler/postech/200000807221 . Ph.D. Thesis. Soyeon Kim, Jindong W ang, Xing Xie, and Steven Euijong Whang. Harnessing temporal databases for systematic e valuation of f actual time-sensitiv e question-answering in LLMs. In The F ourteenth Interna- tional Confer ence on Learning Repr esentations , 2026. URL https://openreview.net/forum? id=W7RNxsTKKZ . W on Kim. On optimizing an sql-like nested query . ACM T ransactions on Database Systems (TODS) , 7(3): 443–469, 1982. Seokki Lee, Sv en Köhler , Bertram Ludäscher , and Boris Glavic. A sql-middleware unifying why and wh y-not prov enance for ﬁrst-order queries. In 2017 IEEE 33r d International Conference on Data Engineering (ICDE) , pp. 485–496. IEEE, 2017. 12 Published as a conference paper at ICLR 2026 Kaixin Ma, Hao Cheng, Y u Zhang, Xiaodong Liu, Eric Nyber g, and Jianfeng Gao. Chain-of-skills: A conﬁgurable model for open-domain question answering. In Pr oceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pp. 1599–1618, 2023. Pratyush Maini, Zhili Feng, A vi Schwarzschild, Zachary C. Lipton, and J. Zico Kolter . T ofu: A task of ﬁctitious unlearning for llms, 2024. URL . Raghunath Othayoth Nambiar and Meikel Poess. The making of tpc-ds. In Pr oceedings of the 32nd International Confer ence on V ery Lar ge Data Bases , VLDB ’06, pp. 1049–1058. VLDB Endowment, 2006. Jio Oh, Soyeon Kim, Junseok Seo, Jindong W ang, Ruochen Xu, Xing Xie, and Ste ven Whang. Erbench: An entity-relationship based automatically veriﬁable hallucination benchmark for large language models. Advances in Neural Information Pr ocessing Systems , 37:53064–53101, 2024. Reham Omar, Omij Mangukiya, and Essam Mansour . Dialogue benchmark generation from knowledge graphs with cost-ef fectiv e retriev al-augmented llms. Pr oceedings of the ACM on Mana gement of Data , 3 (1):1–26, 2025. OpenAI. Chatgpt via chat completions api. URL https://platform.openai.com/docs/models/ chatgpt- 4o- latest . Abdelghny Orogat and Ahmed El-Roby . Smartbench: demonstrating automatic generation of comprehensiv e benchmarks for question answering ov er knowledge graphs. Pr oceedings of the VLDB Endowment , 15(12): 3662–3665, 2022. Abdelghny Orogat and Ahmed El-Roby . Maestro: Automatic generation of comprehensiv e benchmarks for question answering over kno wledge graphs. Proceedings of the A CM on Management of Data , 1(2):1–24, 2023. Sungho Park, Jooh yung Y un, Jongwuk Lee, and W ook-Shin Han. HELIOS: Harmonizing early fusion, late fusion, and LLM reasoning for multi-granular table-text retrieval. In W anxiang Che, Joyce Nabende, Ekaterina Shutov a, and Mohammad T aher Pilehvar (eds.), Pr oceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pp. 32424–32444, V ienna, Austria, July 2025. Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025. acl- long.1559. URL https://aclanthology.org/2025.acl- long.1559/ . Sohan Patnaik, Heril Changw al, Milan Aggarwal, Sumit Bhatia, Y aman Kumar , and Balaji Krishnamurthy . Cabinet: Content relev ance-based noise reduction for table question answering. In The T welfth International Confer ence on Learning Repr esentations . Qi Shi, Han Cui, Haofeng W ang, Qingfu Zhu, W anxiang Che, and Ting Liu. Exploring hybrid question answering via program-based prompting. In Pr oceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pp. 11035–11046, 2024. Kai Sun, Y ifan Xu, Hanwen Zha, Y ue Liu, and Xin Luna Dong. Head-to-tail: How knowledgeable are large language models (llms)? aka will llms replace knowledge graphs? In Pr oceedings of the 2024 Confer ence of the North American Chapter of the Association for Computational Linguistics: Human Language T echnologies (V olume 1: Long P apers) , pp. 311–325, 2024. Xueqing W u, Jiacheng Zhang, and Hang Li. T ext-to-table: A new way of information extraction. In Pr oceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pp. 2518–2533, 2022. 13 Published as a conference paper at ICLR 2026 Kun Xu, Lingfei W u, Zhiguo W ang, Y ansong Feng, and V adim Sheinin. Sql-to-text generation with graph- to-sequence model. In Pr oceedings of the 2018 Confer ence on Empirical Methods in Natural Langua ge Pr ocessing , pp. 931–936, 2018. Lingxi Zhang, Jing Zhang, Xirui K e, Haoyang Li, Xinmei Huang, Zhonghui Shao, Shulin Cao, and Xin Lv . A surve y on complex factual question answering. AI Open , 4:1–12, 2023. Y ilun Zhao, Y unxiang Li, Chenying Li, and Rui Zhang. Multihiertt: Numerical reasoning over multi hierarchical tabular and textual data. In Pr oceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pp. 6588–6600, 2022. Lianmin Zheng, Liangsheng Y in, Zhiqiang Xie, Chuyue Livia Sun, Jef f Huang, Cody Hao Y u, Shiyi Cao, Christos K ozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Ef ﬁcient e xecution of structured language model programs. Advances in Neural Information Pr ocessing Systems , 37:62557–62583, 2024. Zexuan Zhong, Zhengxuan W u, Christopher D. Manning, Christopher Potts, and Danqi Chen. Mquake: Assessing knowledge editing in language models via multi-hop questions, 2024. URL https://arxiv. org/abs/2305.14795 . Fengbin Zhu, W enqiang Lei, Y oucheng Huang, Chao W ang, Shuo Zhang, Jiancheng Lv , Fuli Feng, and T at-Seng Chua. T at-qa: A question answering benchmark on a hybrid of tabular and textual content in ﬁnance. In Pr oceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International J oint Confer ence on Natural Language Pr ocessing (V olume 1: Long P apers) , pp. 3277–3287, 2021. 14 Published as a conference paper at ICLR 2026 A C R O S S - D A T A S E T A N N O TA T I O N A U D I T Dataset Any error Error br eakdown (within erroneous samples) % of all samples Redundant modality Incomplete answer set Incorrect / unanswerable H Y B R I D Q A 21% 52.4% 23.8% 23.8% M U L T I H I E R I T T 26% 15.4% 30.8% 53.8% T A T-Q A 30% 96.7% 0.0% 3.3% F I N Q A 17% 41.2% 0.0% 58.8% S PA RT A 0% 0.0% 0.0% 0.0% T able 7: Audit of 100 randomly sampled dev examples from each dataset. “ Any error” sho ws the fraction of all samples containing at least one error . The breakdown columns report the relative distribution among erroneous samples. B S U P P O RT E D N E S T E D Q U E RY P A T T E R N S SPARTA synthesizes queries for each of the four primary nesting patterns Kim (1982) commonly observ ed in real-world SQL, as illustrated in Fig 6. T able 8: Nested–query patterns. T ype Inner aggreg. Correlation T ypical intent / example T ype–N ✗ ✗ Pur e set membership. Outer block tests whether a value belongs to the set returned by a non-correlated subquery (e.g., WHERE x IN (SELECT ...) ). T ype–A ✓ ✗ Aggr egate comparison. Inner block computes an aggregate such as AVG or MAX and the result is compared with each outer tuple (e.g., salary > (SELECT AVG(salary) FROM ...) ). T ype–J ✗ ✓ Corr elated ﬁltering. Inner query references at- tributes of the outer block without aggreg ation (e.g., EXISTS (SELECT 1 FROM Items i WHERE i.order_id = o.id) ). T ype–J A ✓ ✓ Corr elated ag gr egate comparison. Inner query both correlates with the outer block and aggregates its own rows before the com- parison (e.g., EXISTS (SELECT 1 FROM Items i WHERE i.order_id = o.id GROUP BY ... HAVING SUM(i.qty) > o.limit) ). Ty p e -N A(C k ) Ty p e -A T i T j T i T j Ty p e -J A(C k ) Ty p e - JA T i T j T i T j N(C k ) N(C k ) T i . C m =T j . C n T i . C m =T j . C n Figure 6: Four primary nesting patterns—type (N, A, J, J A) queries of depth 1. Each consists of an outer block ( T i ) and an inner block ( T j ). Arcs labeled ‘ A ’ indicate aggregation in the inner SELECT ; straight arcs ‘N’ denote set-inclusion predicates; curved arcs denote join predicates. 15 Published as a conference paper at ICLR 2026 C S C H E M A O F R E F E R E N C E - F AC T D A T A B A S E (a) NB A (b) Movie (c) Medical Figure 7: Schemas of the reference-fact databases used in SP AR T A across three domains. Each database consists of two complementary types of tables: source tables S T (orange) from public datasets (e.g., NB A player salaries, movie metadata, medical records) and grounding tables G T (green) encoding atomic facts extracted from te xtual passages. 16 Published as a conference paper at ICLR 2026 D T A B L E - L E V E L S TA T I S T I C S O F T H E R E F E R E N C E - F AC T D A T A B A S E T able 9: Row/Column Statistics of All T ables in the Sparta Benchmark Domain T able Name # Columns # Rows nba nba_draft_combine_stats 35 772 nba_player_information 14 4596 nba_player_award 3 236 nba_champion_history 8 69 nba_player_afﬁliation 4 13980 nba_team_information 9 30 nba_player_game_stats 21 45640 nba_team_game_stats 12 12750 nanba_game_informtaion 7 6665 imdb genre 2 14418 ratings 4 7872 movie 8 7872 role_mapping 4 15336 director_mapping 3 3800 names 5 25617 medical treatments 6 200 billing 7 200 appointments 7 200 patients 7 50 doctors 5 10 T able 9 provides an ov erview of the structural statistics for all tables in the SPARTA benchmark, including the number of columns and ro ws per table, grouped by domain (NB A, IMDB, and Medical). These metrics highlight the scale and div ersity of the reference-fact database used for e valuation. E B E N C H M A R K C O N FI G U R A T I O N T able 10: Benchmark Conﬁguration: SQL Operator Coverage, Query-Shape/Size Distrib ution. Query Shape and Size Distribution (%) Non-nested (Depth 1, Breadth 1) (Depth 1, Breadth 2) (Depth 1, Breadth 3) (Depth 2, Breadth 1) (Depth 2, Breadth 2) (Depth 3, Breadth 1) Total 45.5 9.1 9.1 9.1 9.1 9.1 9.1 100.0 SQL Operator Presence (%) WHERE GROUP BY HA VING ORDER BY LIMIT AGGREGA TION 100.0 15.3 3.4 7.7 4.5 50.0 Nested Predicate T ype Presence in Nested Query (%) T ype-N T ype-A T ype-J T ype-JA 57.8 64.3 32.4 15.2 F G E N E R A T I O N C O S T A N A L Y S I S F . 1 C O S T A N A L Y S I S A C RO S S L L M S C A L E S 17 Published as a conference paper at ICLR 2026 T o demonstrate that our provenance-based reﬁnement is ef fectiv e regardless of the LLM’ s size, we conducted additional experiments comparing generation costs across LLMs of v arying sizes. Speciﬁcally , in addition to the Llama-3.1-70B-Instruct model ev aluated in the manuscript, we measured generation costs using a smaller-parameter LLM (gpt-oss-20B) and a lar ger-parameter LLM (gpt-oss-120B). T able 11 shows that Post-Order + Pro v is the most cost-ef fectiv e approach across all LLM variants, completing with 4854, 4,722, and 3831 calls for the respectiv e models, while cutting call volume by 18.8%, 42.8%, and 54.5% versus vanilla Post-Order , and by 64.7%, 66.2%, and 65.7% versus One-Shot-k. These results indicate that disciplined post-order construction combined with pro venance-dri ven repair minimizes redundant generations independent of the LLM’ s scale. Note that the "Ideal Calls" metric represents the number of LLM calls required if ev ery step succeeds. It varies slightly due to probabilistic clause inclusion in SQL query generation (based on predeﬁned per-clause probabilities). As shown in the table, the v ariance is very small when we repeat experiments three times per method. LLM Method Ideal Calls T otal Calls gpt-oss-20B One-Shot–k 2661 13736 Post-Order (no prov enance) 3073 5977 Post-Order + Prov 3146 4854 Llama-3.1-70B-Instruct One-Shot–k 2664 13962 Post-Order (no prov enance) 3104 8253 Post-Order + Prov 3074 4722 gpt-oss-120B One-Shot–k 2621 11225 Post-Order (no prov enance) 3152 8425 Post-Order + Prov 3145 3831 T able 11: Generation cost comparison across LLM sizes for three reﬁnement strategies. F . 2 C O S T A N A L Y S I S O N Q U E RY S H A P E A N D S I Z E Figure 8 contrasts LLM usage for star and chain query trees as their size grows from one to three nested predicates. Here, star queries ﬁx depth to 1 while increasing breadth (number of branches), whereas chain queries ﬁx breadth to 1 while increasing depth (number of nested le vels). The size thus reﬂects ho w many nested predicates are added along either the breadth or depth dimension. For star shapes, one-shot generation quickly becomes prohibitiv e, ballooning to 17 × the ideal call count when the hub size reaches three. Building the same queries in post-order slashes that overhead to 3 . 2 × ; prov enance repair trims it further to 1 . 6 × . Chains tell a dif ferent story: because their natural construction order already matches post-order , one-shot and post-order costs are similar , yet provenance still removes 30–40 % of redundant calls at every depth. Branching structures proﬁt most from post-order generation, while provenance-guided repair is a univ ersally cheap “insurance policy” that cuts waste re gardless of query shape. Beyond structural complexity , we also measured the av erage number of LLM calls required to generate a single query as the number of accessed tables increases. T able 12 sho ws increases of +2.6 calls (from 1 to 2 tables), +3.9 (from 2 to 3), and +2.2 (from 3 to 4), indicating near-linear gro wth. G A B L A T I O N S T U DY O N N E S T I N G T Y P E S For a deeper analysis, we examined the best-performing system on the SP AR T A Benchmark—HProPro with GPT -5—by breaking down its performance across query-nesting types: plain nesting (N), nesting with aggregates (A), nesting with correlated joins (J), and nesting that combines correlated joins + aggregates (J A). Results are shown belo w . 18 Published as a conference paper at ICLR 2026 1 2 3 Star Query Size 0 1000 2000 3000 4000 5000 6000 7000 LLM Call Count +42% +381% +1641% +41% +147% +217% +20% +43% +60% One-shot P ost- or der (no pr ovenance) P ost- or der+P r ov (a) LLM Call Count as Star-Query Size Increases. 1 2 3 Chain Query Size 0 1000 2000 3000 4000 5000 6000 7000 LLM Call Count +42% +303% +386% +41% +307% +389% +20% +141% +160% One-shot P ost- or der (no pr ovenance) P ost- or der+P r ov (b) LLM Call Count as Chain-Query Size Increases. Figure 8: Generation cost for varying (a) star -query size and (b) chain-query size. Sky-blue bars mark ideal LLM calls, and the labels abov e each bar represent the actual excess percentage. T able 12: A verage LLM calls per query as the number of accessed tables increases. # of Accessed T ables A verage LLM Calls per Query 1 2.3 2 4.9 (+ 2.6) 3 8.8 (+ 3.9) 4 11.0 (+ 2.2) As shown in T able 13, F1 falls steadily as structural (correlated joins) and analytical (aggreg ates) complexity increases, with the largest drop when both factors are present (T ype J A). This ablation study underscores that correlated joins and aggregates are the model’ s primary pain points. H A N A LY S I S O N N E G A T I O N A N D R A N G E R E A S O N I N G T o better understand which logical operators and conditions most challenge SP AR T A models, we conduct an ablation study focusing on two key query categories: negation and numeric range conditions. These categories capture a lar ge portion of structural breadth in SP AR T A and represent frequent sources of model errors. W e ev aluate HProPro with GPT -5 on queries containing explicit negation operators ( NOT LIKE , NOT EXISTS , NOT IN , <> ) as well as numeric range operators ( > , < , >= , <= ). T able 14 presents the results. Both categories show notable degradation compared to the ov erall SP AR T A score, with negation queries dropping by 28.3% and range queries by 18.6%. These ﬁndings indicate that logical neg ation and numeric range reasoning remain signiﬁcant bottlenecks. In addition to this quantitativ e breakdown, we selected representative samples from both query types and conducted a qualitati ve error analysis, sho wn in Figure 9. As illustrated in Figure 9(a), negation reasoning presents a recurring challenge. The example query requires identifying players for whom ther e is no r ecor d indicating that they (i) are a Center with height > 90, (ii) were born after 1970, and (iii) were drafted “9th ov erall. ” All three criteria fall under the scope of a single negated condition. Howe ver , the model applies negation only to the ﬁrst clause (“Center with height > 90”) while incorrectly treating the remaining clauses (“born after 1970, ” “drafted 9th ov erall”) as independent positive ﬁlters. This partial scoping of negation leads the model to misinterpret the logical structure and include players who should hav e been excluded. 19 Published as a conference paper at ICLR 2026 T able 13: F1 scores of HProPro (w/ GPT -5) across different nesting types. Percentage values indicate relativ e change from the ov erall average (34.5). Nesting T ype F1 (HProPro w/ GPT -5) T ype-N 40.0 (+15.9%) T ype-A 33.7 ( − 2.3%) T ype-J 30.8 ( − 10.7%) T ype-J A 25.6 ( − 25.8%) T otal 34.5 T able 14: F1 scores of HProPro (w/ GPT -5) on negation and range queries. Query Category F1 (HProPr o w/ GPT -5) Negation ( NOT LIKE , NOT EXISTS , NOT IN , <> ) 28.7 ( − 28.3%) Range ( > , < , >= , <= ) 32.9 ( − 18.6%) SP AR T A (Oracle) 40.4 A similar pattern is observed for range reasoning, as sho wn in Figure 9(b). The gold query requires identifying teams whose arena capacity exceeds the maximum capacity among teams that (i) were founded before 1970 and (ii) hav e capacities between 20,000 and 21,711. Although the model correctly computes the maximum capacity among pre-1970 teams, it fails to apply the upper bound constraint ( arena_capacity < 21,711 ) during the ﬁnal ﬁltering stage. As a result, the predicted code returns teams that satisfy the lower -bound conditions and the dynamic threshold but violate the required upper-bound range condition, producing an incorrect answer . I A B L A T I O N S T U DY O N R O B U S T N E S S T O L I N G U I S T I C V A R I A B I L I T Y W e further study the robustness of table-te xt qa models to linguistic v ariability by ev aluating performance under human-v eriﬁed rephrased questions. Speciﬁcally , we sampled 100 queries from our benchmark and had them manually rephrased by human annotators, ensuring the core semantic meaning and the correct answer were preserved. W e then e valuated the HProPro model with GPT -5 on both the original and the rephrased sets of queries. T able 15: F1 scores of HProPro (w/ GPT -5) on original and human-veriﬁed rephrased queries. Query Set F1 (HProPro w/ GPT -5) Original Questions 45.22 Rephrased Questions 45.02 ( − 0.44%) As shown in T able 15, the F1 score dropped from 45.22 on the original queries to 45.02 on the rephrased versions, amounting to a ne gligible decrease of 0.44%. This ﬁnding indicates that the model is highly robust to surface-le vel linguistic v ariations. W e hav e incorporated the details of this ne w experiment and its results into the Appendix I of our revised manuscript to strengthen our rob ustness analysis. 20 Published as a conference paper at ICLR 2026 Question: What is the maximum number of points scor ed in a single game by player who , in that game, rec orded more than 5 rebounds and more than 5 st eals, and weight is less than 180lb and for whom there is no record indicating that they are a C enter with height greater th an 90, born after 1970, and drafted ‘9 th overall’ ? Answer : 19 Gold Code: def solve(table) - > str: for ro w in t ab le : if not (row[weight] [0] < 180): continue if (‘Center’ in row[position][0] and row[ hei gh t][ 0] > 9 0 an d row [bi rt hyea r][ 0] > 1 97 0 an d row [ draft_pick ][0] == ‘9 th overall’): continue passage_links = r ow[ player_name ][1] # ... extr act the points with >5 reb ounds and >5 steals ... retu rn max_points Predicted Code: def solve(table) - > str: max_points = [ ] for ro w in t ab le : if not (row[weight] [0] < 180): continue if (‘Center’ in row[position][0] and row[ hei gh t][ 0] > 9 0) : continue if not (row[birth year][0] > 1970 and row [ draft_pick ][0] == ‘9 th overall’): continue passage_links = r ow[ player_name ][1] # ... extr act the points with >5 reb ounds and >5 steals ... retu rn max_points Question: What is the highest team points in a game for any team whose arena capacity is abov e the maximum ar ena capacity of any team f ounded be fore 1970, with an arena capacity above 20000, and owned by someone other than Paul Allen, and with an ar ena capacity below 21711 , and which scored more than 100 points and r ebounded more than 40 times in a game? Answer : 127 Gold Code: def solve(table) - > str: max_bef ore_1970 = None for ro w in t ab le : if (row[' found ed _y ea r '][0] < 19 70) and (row['owner'][0] ! = 'Paul Allen ’) and (row[' arena_cap acity '][0] > 20 000 and ro w[' arena_capacity '][0] < 21711): if row['aren a_capacity '][0] > max_befor e_1970: max_bef ore_1970 = row[' ar ena_capacity '][0] for ro w in t ab le : if arena_capacity > max_be fore_1970: passage_link = r ow[' team _na me '][1] # ... extr act the points with >100 points an d >40 rebounds ... retu rn max_points ma x_before_1970 Predicted Code: def solve(table) - > str: max_bef ore_1970 = None for ro w in t ab le : if (row[' found ed _y ea r '][0] < 19 70) and (row['owner'][0] ! = 'Paul Allen ’) and (row[' arena_cap acity '][0] > 20 000 and row['arena_capacity '][ 0] < 2 1711) : if row['aren a_capacity '][0] > max_befor e_1970: max_bef ore_1970 = row[' ar ena_capacity '][0] for ro w in t ab le : if arena_capacity > max_be fore_1970: passage_link = r ow[' team _na me '][1] # ... extr act the points with >100 points an d >40 rebounds ... retu rn max_points missing! (a) (b) Figure 9: Illustration of representative error cases where the model fails to correctly answer . (a) Neg ation reasoning error . (b) Range reasoning error . J E R R O R C A S E A N A L Y S I S W e conduct an analysis of the errors encountered by T able-T ext QA models on randomly sampled sets of 100 examples each for SP AR T A (Oracle) and SP AR T A (Retriev al), as illustrated in Fig 10. Representativ e error types, along with their frequencies and causal interpretations, are summarized below . Relev ant data missing. This was the most frequent category of failure, where the model failed to identify all the necessary information to correctly answer the question. SPARTA poses increased demands for multi-hop reasoning across table and text sources, which e xisting methods often struggle with: • Partial retrieval of relev ant data: The model identiﬁes only a subset of the necessary sources, resulting in incomplete answers. As illustrated in Figure 11, the model was expected to return both 62 and 53 as the ﬁeld goal percentages for the Dallas Mavericks and New York Knicks , respectiv ely , but f ailed to do so. • Failur e to identify rele vant data: The model does not identify crucial supporting data, leading to either no answer or an incorrect one. For example, in questions requiring information from both nba_player_information and nba_player_award , the model may access only the former , ov erlooking the award records, and consequently returning an incorrect answer . Erroneous data analysis. Compared to prior benchmarks, SPARTA introduces more complex analytical requirements that rev eal limitations in model capabilities: • Failur e to perform advanced analytical operations: The model struggles with applying operations such as aggregations (e.g., COUNT , MAX ) or executing multi-table joins correctly . These operations require precise alignment of relational structures and logic, which is frequently mishandled. 21 Published as a conference paper at ICLR 2026 R elevant data missing 48.0% Er r oneous data analysis 31.0% Question misunderstanding 18.0% W r ong Schema linking 3.0% P artial r etrieval of r elevant data 10.0% F ailur e to identif y r elevant data 38.0% F ailur e to perfor m advanced analytical operation 13.0% R eading compr ehension er r ors 18.0% (a) Oracle Setting R elevant data missing 38.0% Er r oneous data analysis 42.0% Question misunderstanding 20.0% P artial r etrieval of r elevant data 5.0% F ailur e to identif y r elevant data 33.0% F ailur e to perfor m advanced analytical operation 34.0% R eading compr ehension er r ors 8.0% (b) Retriev al Setting Figure 10: Statistics of errors. For detailed descriptions and examples of each error category , see Appendix J. • Reading comprehension errors: The model incorrectly interprets textual information, leading to erroneous answers. For instance, in a case where the question asks for the Nuggets’ ﬁeld goal percentage, the model erroneously extracts "37%" from the sentence "Nuggets held Sacramento to just 37 percent from the field," misattributing Sacramento’ s statistic to the Nuggets. See Figure 11 for a detailed example of this error . Question misunderstanding. These errors arise from incorrect interpretation of the question intent or con- straints. Representative cases include failing to restrict answers to players who played only as point_guard , and instead including players who played point_guard along with other positions, misidentifying the rele vant time frame (e.g., using 2017 instead of 2016–17), introducing constraints not speciﬁed in the question, or omitting key conditions necessary to deri ve the correct answer . Schema linking errors. This category in volv es incorrect associations between the question and the schema elements, such as tables or columns. For instance, when asked to retriev e the name of the head coach, the model fails to identify the headcoach column in the nba_team_information table as relev ant, thereby omitting necessary information from the ﬁnal prediction. 22 Published as a conference paper at ICLR 2026 Team Nam e Arena Capacity Trail Blaze rs 19,980 Raptors 19,8 00 Hornets 19,02 6 Nuggets 19,09 9 Maverick s 19,200 Knicks 19,763 The Charlotte Hornets defeated the visiting Chicago Bulls , 135 – 106 … Charlotte Hornets connecting on 57 percent of their shots while holdi ng Chicago to just 44 percent from the field … NBA Team Information The Minnesota Timberwolves defeated the Portland Tra il Blazers 108 – 107 … As for the Tr ail Blazers, they were unable to close out Monday 's game despite shooting an impressive 55 percent from the field … The New York Knick s ( 20 - 20 ) defeated the Boston Celtics ( 19 - 19 ) 120 - 114 on Tuesday … Knicks f inished the night with outstanding 53. percent shooting from the field … Toronto Raptors defeated the Charlotte Hornets , 103 – 98 … they managed a mediocre 38 percent success rate from the field while yielding a 4 2 per cent figure to the Raptors … The Denver Nuggets defeated the Sacramento Kings , 94 - 79 , … Nuggets held Sa cramento to just 37. pe rcent from the field … : Correct the Dallas Mavericks ( 10 - 3 ) defeated the Los Angeles Lakers ( 3 – 10) 140 - 106 on Friday… They allowed the Mavs to shoot an eye - popping 6 2 perce nt from the field … : Error (T ype : Erroneous data ana lysis) : Error (T ype : Relevant da ta missing) 37 53 42 55 57 62 [LLM-Predicted Ans wer]: 55, 42, 57, 37 [Q uestion ] What are the t eam field goal percentag es of all the teams that ha ve an arena capacity be tween 19,000 and 20,000 and scored more tha n 100 points in a game? [Golden Answ er ] 55, 42, 57, 62, 53 Figure 11: Illustration of a representativ e error case where the model fails to correctly answer . K S O F T W A R E A N D D A TA L I C E N S E S The licenses for the software and datasets used in this paper are as follo ws: • LLaMA 3.1-70B-Instruct: LLaMA 3.1 • O TT -QA: MIT License • HybridQA: MIT License All software and datasets were used strictly for research purposes and were not utilized in an y non-research contexts, particularly for commercial applications. L A I A S S I S T A N T S W e used ChatGPT -4o OpenAI to debug code ef ﬁciently , quickly identifying and resolving errors in our implementations. Additionally , we used it for rephrasing sentences in our writing to improve clarity and readability . 23 Published as a conference paper at ICLR 2026 M R E P R E S E N TA T I V E E X A M P L E S F R O M O U R SPARTA B E N C H M A R K T able 16: 20 representati ve examples from SPARTA , each consisting of a domain, a natural language question, its corresponding SQL query , and the answer . Row T ype Content Domain NB A Question Which player won the NB A MVP award in the 1986 season? SQL SELECT player_name FROM nba_player_award WHERE season = 1986 AND award = ’nba mvp’ Answer Larry Bird Domain NB A Question What are the names of the players who scored more than 15 points and rebounded more than 5 times in a game? SQL SELECT player_name FROM nba_player_game_stats WHERE number_of_points > 15 AND number_of_rebound > 5 Answer Langston Gallow ay , Quincy Acy , Larry Nance Jr ., ... Domain Movie Question In which movies did Riteish Deshmukh act? SQL SELECT movie_title FROM role_mapping WHERE category = ’actor’ AND name = ’Riteish Deshmukh’ Answer Marjaav aan, Mauli Domain Movie Question What is the total number of mo vies with a median rating greater than 5 and an a verage rating greater than 5.5? SQL SELECT COUNT(movie_title) AS total_movies FROM ratings WHERE median_rating > 5 AND avg_rating > 5.5 Answer 4877 Domain NB A Question Which W estern Conference teams faced the Celtics more than once in the Finals? SQL SELECT western_champion_name FROM nba_champion_history WHERE nba_champion_name = ’Celtics’ GROUP BY western_champion_name HAVING COUNT(western_champion_name) > 1 Answer Rockets, Lakers Domain Medical Question What is the maximum years of experience of a pediatrician at Central Hospital? SQL SELECT MAX(years_experience) FROM doctors WHERE hospital_branch = ’Central Hospital’ AND specialization = ’Pediatrics’ Answer 28 Domain NB A Question What is the highest salary of Ke vin McHale while playing for the Celtics? SQL SELECT MAX(salary) FROM nba_player_affiliation WHERE player_name = ’Kevin McHale’ AND team_name = ’Celtics’ (continued on next pa ge) 24 Published as a conference paper at ICLR 2026 Row T ype Content (continued) Answer 3,500,000 Domain NB A Question Which Point Guards, drafted between 2000 and 2005, had more than 4 three-pointers, more than 8 ﬁeld goals and more than 1 steal in a game? SQL SELECT player_name FROM nba_player_game_stats WHERE player_name IN (SELECT player_name FROM nba_player_information WHERE position = ’Point Guard’ AND draft_year BETWEEN 2000 AND 2005) AND number_of_three_point_field_goals_made > 4 AND number_of_field_goals_made > 8 AND number_of_steal > 1 Answer Chris Paul Domain NB A Question Which NB A players who were drafted in the ﬁrst round and play the center position ha ve a salary of ov er 1 million in the 2016-17 season? SQL SELECT player_name FROM nba_player_information WHERE player_name IN (SELECT player_name FROM nba_player_affiliation WHERE salary > 1000000 AND season = ’2016-17’) AND draft_round = ’1st round’ AND position = ’Center’ Answer Alex Len, Al Horford, Andre Drummond, ... Domain Medical Question What are the names of patients who have an appointment with a doctor who works at the central hospital and has more than 20 years of experience? SQL SELECT patient_name FROM appointments WHERE doctor_name IN (SELECT name FROM doctors WHERE hospital_branch = ’Central Hospital’ AND years_experience > 20) Answer Alex Smith, Alex Aiden Moore, Emily Miller , ... Domain NB A Question What are the years of birth of the players who have a lane agility time of more than 11.5 seconds, a three quarter sprint of less than 3.35 seconds, more than 10 ﬁeld goals made and more than 8 rebounds in a game? SQL SELECT birthyear FROM nba_player_information WHERE player_name IN (SELECT player_name FROM nba_draft_combine_stats WHERE lane_agility_time > 11.5 AND three_quarter_sprint < 3.35) AND player_name IN (SELECT player_name FROM nba_player_game_stats WHERE number_of_field_goals_made > 10 AND number_of_rebound > 8) GROUP BY birthyear Answer 1984, 1985, 1989, ... Domain Movie Question Which movies, starring V incent D Onofrio as an actor, have an average rating greater than 5 and a median rating of 6, excluding ’K olonya Cumhuriyeti’? (continued on next pa ge) 25 Published as a conference paper at ICLR 2026 Row T ype Content (continued) SQL SELECT title FROM movie WHERE title IN (SELECT movie_title FROM role_mapping WHERE category = ’actor’ AND name = ’Vincent D Onofrio’ AND movie_title = title) AND title IN (SELECT movie_title FROM ratings WHERE avg_rating > 5 AND median_rating = 6 AND movie_title <> ’Kolonya Cumhuriyeti’) Answer CHIPS, In Dubious Battle Domain NB A Question Who are the top 5 centers drafted in the 1st round, who have won the dpo y award after 2000, and who hav e earned more than 2 million dollars in the 2004-05 season, sorted by their draft year in descending order and birth year in ascending order? SQL SELECT player_name FROM nba_player_information WHERE player_name IN (SELECT player_name FROM nba_player_affiliation WHERE salary > 2000000 AND season = ’2004-05’) AND player_name IN (SELECT player_name FROM nba_player_award WHERE season > 2000 AND award = ’dpoy’) AND position = ’Center’ AND draft_round = ’1st round’ ORDER BY draft_year DESC, birthyear ASC LIMIT 5 Answer Dwight How ard, Ben W allace, ... Domain Medical Question Find the addresses of male patients born after January 1, 1980, who have MedCare Plus insurance and hav e made payments that exceed the av erage failed payments greater than 2500. SQL SELECT address FROM patients WHERE name IN (SELECT patient_name FROM billing WHERE amount > (SELECT AVG(amount) FROM billing WHERE payment_status = ’Failed’ AND amount > 2500)) AND date_of_birth > ’1980-01-01’ AND gender = ’M’ AND insurance_provider = ’MedCare Plus’ Answer 123 Elm St, 789 Pine Rd, ... Domain NB A Question Which NB A players, who are centers and taller than the a verage height of point guards drafted after 1990, hav e more than 8 rebounds in a game? SQL SELECT player_name FROM nba_player_game_stats WHERE player_name IN (SELECT player_name FROM nba_player_information WHERE height > (SELECT AVG(height) FROM nba_player_information WHERE position = ’Point Guard’ AND draft_year > 1990) AND position = ’Center’) AND number_of_rebound > 8 Answer Alex Len, Al Horford, Andre Drummond, ... Domain Medical Question What are the names of female patients who registered after 2021-09-02 and ha ve billed amounts greater than the av erage amount of failed payments ov er 2500? SQL SELECT name FROM patients WHERE name IN (SELECT patient_name FROM billing WHERE amount > (SELECT AVG(amount) FROM billing WHERE payment_status = ’Failed’ AND amount > 2500)) AND gender = ’F’ AND registration_date > ’2021-09-02’ Answer Emily Jones, Laura Aiden Davis, ... (continued on next pa ge) 26 Published as a conference paper at ICLR 2026 Row T ype Content (continued) Domain Movie Question How many mo vies starring John Abraham hav e a median rating above 5 and a verage rating abov e 4? SQL SELECT COUNT(title) AS number_of_movies FROM movie WHERE title IN (SELECT movie_title FROM role_mapping WHERE category = ’actor’ AND name = ’John Abraham’ GROUP BY movie_title) AND title IN (SELECT movie_title FROM ratings WHERE median_rating > 5 AND avg_rating > 4) Answer 1 Domain NB A Question What is the maximum height of the Lakers players who play as center, were drafted after 1995 and hav e a salary greater than the highest salary of the Suns and greater than 20,000,000? SQL SELECT MAX(height) FROM nba_player_information WHERE player_name IN (SELECT player_name FROM nba_player_affiliation WHERE salary > (SELECT MAX(salary) FROM nba_player_affiliation WHERE team_name = ’Suns’) AND salary > 20000000 AND team_name = ’Lakers’) AND position = ’Center’ AND draft_year > 1995 Answer 84 Domain NB A Question What are the names of the teams that scored more than the highest points scored by the Thunder when they scored more than 25 points in the ﬁrst quarter and scored more than the highest points scored by teams that scored more than 100 points and had a three point ﬁeld goal percentage of more than 30 and hav e an arena capacity of more than 20,000 and are not the Pistons? SQL SELECT team_name FROM nba_team_game_stats WHERE team_points > (SELECT MAX(team_points) FROM nba_team_game_stats WHERE team_name = ’Thunder’ AND team_points_in_quarter1 > 25) AND team_points > (SELECT MAX(team_points) FROM nba_team_game_stats WHERE team_points > 100 AND team_percentage_of_three_point_field_goal_made > 30) AND team_name IN (SELECT team_name FROM nba_team_information WHERE arena_capacity > 20000 AND team_name <> ’Pistons’) Answer Bulls Domain Movie Question Which movies directed by V ivek Athre ya have a median rating greater than 5 with more than 100 total votes, and do not feature Matt Smith as an actor? SQL SELECT title FROM movie WHERE title IN (SELECT T2.movie_title FROM director_mapping AS T2 WHERE T2.name = ’Vivek Athreya’ AND movie.title = T2.movie_title) AND NOT title IN (SELECT movie_title FROM role_mapping WHERE category = ’actor’ AND name = ’Matt Smith’) AND title IN (SELECT movie_title FROM ratings WHERE median_rating > 5 AND total_votes > 100 GROUP BY movie_title) Answer Brochev are varura, Mental Madhilo 27 Published as a conference paper at ICLR 2026 N P R O M P T T E M P L A T E S W e deﬁne a suite of prompt templates that guide LLMs to generate ex ecutable, semantically coherent SQL queries. Prompts are organized into three categories, with an NBA domain example provided; for other domains, only domain-speciﬁc tokens are swapped (e.g., replacing "NB A" with "Movie"). Clause-Level Generation. T emplates for generating individual SQL clauses in canonical order: • SELECT (non-aggreg ate, aggregate) • FR OM • WHERE • GR OUP BY • HA VING • ORDER BY • LIMIT Nested Predicate Construction. T emplates for building multi-hop queries via nested predicates: • Inner Query Selection • FR OM Clause for Outer Block • Nested Predicate Generation : T ype-N, T ype-A, T ype-J, T ype-J A Reﬁnement and Evaluation. T emplates to improve query v alidity and assess realism: • Pro venance-Based Reﬁnement for repairing empty-result queries • Naturalness Evaluation to assess rele v ance and intent clarity 28 Published as a conference paper at ICLR 2026 WHERE Clause Generation Y ou are both an NB A fan and an SQL expert. Giv en the provided database and the generated clauses, generate a WHERE clause that reﬂects authentic NB A-related curiosity . Ensure the following requirements: • Output Structure: Return a JSON object containing a single key , "where" , with its v alue being a WHERE clause. • Ensure NB A Fan Rele vance: Generate the WHERE clause that aligns naturally with realistic and meaningful queries that NB A fans are likely to ask. • Maintain Speciﬁcity and Clarity of Intent: Generate the WHERE clause that is well-deﬁned, av oiding ov erly vague or artiﬁcially comple x queries. • Align with Generated Clauses: Ensure that the WHERE clause maintains logical consistency with previously generated clauses, preserving semantic coherence. • Ensure Synthetic Correctness: Generate the WHERE clause that is syntactically correct and ex ecutable on the provided database. IMPOR T ANT : Do not generate conditions for NULL or None values. Also, av oid generating ﬁlter conditions that duplicate any e xisting ﬁlters. Database: {database} Generated Clauses: {generated_clauses} Return the results in a FLA T JSON format. DO NO T include any e xplanations or notes in the output. ONL Y return JSON. 29 Published as a conference paper at ICLR 2026 GR OUP BY Clause Generation Y ou are both an NB A fan and an SQL expert. Giv en the provided database and the generated clauses, generate a GR OUP BY clause that reﬂects authentic NB A-related curiosity . Ensure the following requirements: • Output Structure: Return a JSON object containing a single key , "group" , with its v alue being a GR OUP BY clause. The GR OUP BY clause should include a single column. • Ensure NB A Fan Relev ance: Generate the GR OUP BY clause that aligns naturally with realistic and meaningful queries that NB A fans are likely to ask. • Align with Generated Clauses: Ensure that the GR OUP BY clause maintains logical consis- tency with pre viously generated clauses, preserving semantic coherence. • Ensure Synthetic Correctness: Generate the GR OUP BY clause that is syntactically correct and ex ecutable on the provided database. IMPOR T ANT : Do not group by any column whose v alue is ﬁxed by an equality (=) condition in the WHERE clause. Database: {database} Generated Clauses: {generated_clauses} Return the results in a FLA T JSON format. DO NO T include any e xplanations or notes in the output. ONL Y return JSON. 30 Published as a conference paper at ICLR 2026 HA VING Clause Generation Y ou are both an NB A fan and an SQL expert. Giv en the provided database and the generated clauses, generate a HA VING clause that reﬂects authentic NB A-related curiosity . Ensure the following requirements: • Output Structure: Return a JSON object containing a single key , "having" , with its value being a HA VING clause. • Ensure NB A F an Rele vance: Generate the HA VING clause that aligns naturally with realistic and meaningful queries that NB A fans are likely to ask. • Maintain Speciﬁcity and Clarity of Intent: Generate a well-deﬁned and clear HA VING clause without making it ov erly narrow or contri ved. • Align with Generated Clauses: Ensure that the HA VING clause maintains logical consistency with previously generated clauses, preserving semantic coherence. • Ensure Synthetic Correctness: Generate the HA VING clause that is syntactically correct and ex ecutable on the provided database. Database: {database} Generated Clauses: {generated_clauses} Return the results in a FLA T JSON format. DO NO T include any e xplanations or notes in the output. ONL Y return JSON. ORDER BY Clause Generation Y ou are both an NB A fan and an SQL expert. Giv en the provided database and the generated clauses, generate an ORDER BY clause that reﬂects authentic NB A-related curiosity . Ensure the following requirements: • Output Structure: Return a JSON object containing a single key , "order" , with its v alue being an ORDER BY clause. • Ensure NB A Fan Rele vance: Generate the ORDER BY clause that aligns naturally with realistic and meaningful queries that NB A fans are likely to ask. • Align with Generated Clauses: Ensure that the ORDER BY clause maintains logical consis- tency with pre viously generated clauses, preserving semantic coherence. • Ensure Synthetic Correctness: Generate the ORDER BY clause that is syntactically correct and ex ecutable on the provided database. Database: {database} Generated Clauses: {generated_clauses} Return the results in a FLA T JSON format. DO NO T include any e xplanations or notes in the output. ONL Y return JSON. 31 Published as a conference paper at ICLR 2026 LIMIT Clause Generation Y ou are both an NB A fan and an SQL expert. Giv en the provided database and the generated clauses, generate a LIMIT clause that reﬂects authentic NB A-related curiosity . Ensure the following requirements: • Output Structure: Return a JSON object containing a single key , "limit" , with its v alue being a LIMIT clause. • Ensure NB A Fan Rele vance: Generate the LIMIT clause that aligns naturally with realistic and meaningful queries that NB A fans are likely to ask. • Align with Generated Clauses: Ensure that the LIMIT clause maintains logical consistency with previously generated clauses, preserving semantic coherence. • Ensure Synthetic Correctness: Generate the LIMIT clause that is syntactically correct and ex ecutable on the provided database. Database: {database} Generated Clauses: {generated_clauses} Return the results in a FLA T JSON format. DO NO T include any e xplanations or notes in the output. ONL Y return JSON. 32 Published as a conference paper at ICLR 2026 SELECT Clause (Non-Aggregate) Y ou are both an NB A fan and an SQL expert. Giv en the provided database and the generated clauses, generate a SELECT clause that speciﬁes a necessary ﬁeld for retrieving meaningful NBA-related data. Ensure the following requirements: • Output Structure: Return a JSON object containing a single key , "select" , with its value being a SELECT clause that projects a single column without an aggregation function meaningfully . • Ensure NB A Fan Rele vance: Generate the SELECT clause that aligns naturally with realistic and meaningful queries that NB A fans are likely to ask. • Align with Generated Clauses: Ensure that the SELECT clause maintains logical consistenc y with previously generated clauses, preserving semantic coherence. • Ensure Synthetic Correctness: Generate the SELECT clause that is syntactically correct and ex ecutable on the provided database. IMPOR T ANT : Do not project columns used in the WHERE clause. Database: {database} Generated Clauses: {generated_clauses} Return the results in a FLA T JSON format. DO NO T include any e xplanations or notes in the output. ONL Y return JSON. 33 Published as a conference paper at ICLR 2026 SELECT Clause (Aggregate) Y ou are both an NB A fan and an SQL expert. Giv en the provided database and the generated clauses, generate a SELECT clause that aggre gates a single column for retrieving meaningful NB A-related statistics. Ensure the following requirements: • Output Structure: Return a JSON object containing a single key , "select" , with its value being a SELECT clause that aggregates (MAX, MIN, A VG, or COUNT , etc.) a single column meaningfully . • Ensure NB A Fan Rele vance: Generate the SELECT clause that aligns naturally with realistic and meaningful queries that NB A fans are likely to ask. • Align with Generated Clauses: Ensure that the SELECT clause maintains logical consistenc y with previously generated clauses, preserving semantic coherence. • Ensure Synthetic Correctness: Generate the SELECT clause that is syntactically correct and ex ecutable on the provided database. Database: {database} Generated Clauses: {generated_clauses} Return the results in a FLA T JSON format. DO NO T include any e xplanations or notes in the output. ONL Y return JSON. 34 Published as a conference paper at ICLR 2026 Inner Query Block Selection Y ou are both an NB A fan and an SQL expert. Giv en the provided database, generated clauses, and the candidate inner query blocks, select the most appropriate inner query block for generating a nested predicate that reﬂects authentic NB A-related curiosity . Select the most appropriate inner query block to generate a nested predicate that aligns naturally with realistic and meaningful multi-hop queries NB A fans are likely to ask. Y our output must be in JSON format with the key: • "inner_query_block" : Select the most appropriate inner query block from the Candi- date Inner Query Blocks. IMPOR T ANT : • Do not select the inner query block that has already been used in the generated clauses and is not included in the candidate inner query blocks. Database: {schema} Generated FR OM Clause: {generated_from_clause} Generated WHERE Clause: {generated_where_clause} Candidate Inner Query Blocks: {candidate_inner_query_blocks} Return the results in a FLA T JSON format. DO NO T include any e xplanations or notes in the output. ONL Y return JSON. 35 Published as a conference paper at ICLR 2026 FR OM Clause Generation Y ou are both an NB A fan and an SQL expert. Giv en the database and the inner query block, generate a FR OM clause of the outer query block that reﬂects authentic NBA-related curiosity . Ensure the following requirements: • Output Structure : Return a JSON object containing a single ke y , "from" , with its v alue being a single-table FR OM clause of the outer query block from the provided database (i.e., do not include any sub-selects or nested queries directly in the FR OM clause). • Ensure NB A Fan Rele vance : Generate the FR OM clause that aligns naturally with realistic and meaningful multi-hop queries that NB A fans are likely to ask. • Ensure Synthetic Corr ectness : Generate the FROM clause that is syntactically correct and ex ecutable on the provided database. • Separate Inner Query : The inner query block must remain separate; it should later be incorporated into the WHERE clause, not nested in the FR OM clause. • Ensure Natural Connection : Choose an outer table whose columns can be naturally referenced or ﬁltered against the results of the inner query block. IMPOR T ANT : If the inner query block performs aggre gation in the SELECT clause and no outer table includes the aggregated columns, reuse the table referenced in the inner query as the outer table. Database: {schema} Inner Query Block: {subquery} Return the results in a FLA T JSON format. DO NO T include any e xplanations or notes in the output. ONL Y return JSON. 36 Published as a conference paper at ICLR 2026 T ype-N Nested Predicate Generation Y ou are both an NBA fan and an SQL expert. Based on the given database, generated clauses, selected inner query block Q, and its ex ecution result, generate a type-n nested predicate that reﬂects authentic NB A-related curiosity . Ensure the following requirements: • Ensure type-n Nesting: The inner query block Q must not contain a join predicate that references the relation of the outer query block, and its SELECT clause must project a column without an aggregate function. • Ensure NB A Fan Relev ance: Generate the nested predicate that aligns naturally with realistic and meaningful multi-hop queries that NB A fans are likely to ask. • Ensure Synthetic Correctness: Generate the nested predicate that is syntactically correct and ex ecutable on the provided database. • Ensure Semantic Alignment: If the inner query’ s SELECT column does not semantically match any column in the outer query’ s table, re vise it for consistency . The type-n nested predicate must be in the form: OuterTable.column [IN | NOT IN] ( Q ) . Y our output must be in JSON format with the keys: • "nested_predicate" : Only the type-n nested predicate based on the selected inner query block. • "logical_operator" : If a WHERE clause exists, return ’AND’ or ’OR’. IMPOR T ANT : • Ensure that the nesting le vel of the inner query block is correctly preserved. The expected nesting lev el is {height} . • Do not modify the nesting le vel of the provided inner query block. Database: {schema} Generated FR OM Clause of the Outer Query: {generated_from_clause} Generated WHERE Clause of the Outer Query: {generated_where_clause} Selected Inner Query Block Q: {selected_inner_query_block} Return the results in a FLA T JSON format. DO NO T include any e xplanations or notes in the output. ONL Y return JSON. 37 Published as a conference paper at ICLR 2026 T ype-A Nested Predicate Generation Y ou are both an NBA fan and an SQL expert. Based on the given database, generated clauses, selected inner query block, and its execution result, generate a type-a nested predicate that reﬂects authentic NB A-related curiosity . Ensure the following requirements: • Ensure type-a Nesting: The inner query block Q must not contain a join predicate referencing the outer query’ s relation, and its SELECT clause must contain an aggregate function associated with a column. • Ensure NB A Fan Relev ance: Generate the nested predicate that aligns naturally with realistic and meaningful multi-hop queries that NB A fans are likely to ask. • Ensure Synthetic Correctness: The predicate must be ex ecutable and logically valid ov er the schema. The type-a nested predicate must follow the form: OuterTable.column [= | != | < | <= | > | >=] ( Q with aggregate function ) Y our output must be in JSON format with the keys: • "nested_predicate" : Only the type-a nested predicate based on the selected inner query block. • "logical_operator" : If a WHERE clause exists, return ’AND’ or ’OR’. IMPOR T ANT : • Do not re vise the SELECT clause of the Q. • Ensure that the nesting le vel remains {height} . Database: {schema} Generated FR OM Clause of the Outer Query: {generated_from_clause} Generated WHERE Clause of the Outer Query: {generated_where_clause} Selected Inner Query Block Q: {selected_inner_query_block} Return the results in a FLA T JSON format. DO NO T include any e xplanations or notes in the output. ONL Y return JSON. 38 Published as a conference paper at ICLR 2026 T ype-J Nested Predicate Generation Y ou are both an NBA fan and an SQL expert. Based on the given database, generated clauses, selected inner query block, and its e xecution result, generate a type-j nested predicate that reﬂects authentic NB A-related curiosity . Ensure the following requirements: • Ensure type-j Nesting: Revise the inner query block Q to ensure it includes a join predicate in its WHERE clause that references the outer query’ s relation, and its SELECT clause must project a column without an aggregate function. • Ensure NB A Fan Relev ance: Generate the nested predicate that aligns naturally with realistic and meaningful multi-hop queries that NB A fans are likely to ask. • Ensure Synthetic Correctness: Generate the nested predicate that is syntactically correct and ex ecutable on the provided database. • Ensure Semantic Alignment: If the inner query’ s SELECT column does not semantically match any column in the outer query’ s table, re vise it for consistency . The type-j nested predicate must be in one of the following forms: OuterTable.column [IN | NOT IN] (SELECT ... FROM ... WHERE ... [join predicate] ...) or [EXISTS | NOT EXISTS] (SELECT ... FROM ... WHERE ... [join predicate] ...) Y our output must be in JSON format with the keys: • "nested_predicate" : Only the type-j nested predicate based on the selected inner query block. • "logical_operator" : If a WHERE clause exists, return ’AND’ or ’OR’. IMPOR T ANT : • The join predicate in volving the outer query’ s relation must appear in the WHERE clause of Q, not its FR OM clause. • The expected nesting lev el is {height} . Do not modify it. Database: {schema} Generated FR OM Clause of the Outer Query: {generated_from_clause} Generated WHERE Clause of the Outer Query: {generated_where_clause} Selected Inner Query Block Q: {selected_inner_query_block} Return the results in a FLA T JSON format. DO NO T include any e xplanations or notes in the output. ONL Y return JSON. 39 Published as a conference paper at ICLR 2026 T ype-J A Nested Predicate Generation Y ou are both an NBA fan and an SQL expert. Based on the given database, generated clauses, selected inner query block, and its ex ecution result, generate a type-ja nested predicate that reﬂects authentic NB A-related curiosity . Ensure the following requirements: • Ensure type-ja Nesting: Revise the inner query block Q to include a join predicate in its WHERE clause that references the outer query’ s relation and ensure its SELECT clause contains an aggregate function. • Ensure NB A Fan Relev ance: Generate the nested predicate that aligns naturally with realistic and meaningful multi-hop queries that NB A fans are likely to ask. • Ensure Synthetic Correctness: The resulting predicate must be ex ecutable and valid ov er the database schema. The type-ja nested predicate must follow one of the forms: OuterTable.column [= | != | < | <= | > | >=] (SELECT [agg] ... FROM ... WHERE ... [join predicate] ...) or [EXISTS | NOT EXISTS] (SELECT [agg] ... FROM ... WHERE ... [join predicate] ...) Y our output must be in JSON format with the keys: • "nested_predicate" : Only the type-ja nested predicate based on the selected inner query block. • "logical_operator" : If a WHERE clause exists, return ’AND’ or ’OR’. IMPOR T ANT : • The join predicate in volving the outer query’ s relation must appear in the WHERE clause , not the FR OM clause. • Do not re vise the SELECT clause of the Q. • Do not modify the nesting le vel ( {height} ). Database: {schema} Generated FR OM Clause of the Outer Query: {generated_from_clause} Generated WHERE Clause of the Outer Query: {generated_where_clause} Selected Inner Query Block Q: {selected_inner_query_block} Return the results in a FLA T JSON format. DO NO T include any e xplanations or notes in the output. ONL Y return JSON. 40 Published as a conference paper at ICLR 2026 Prov enance-based Reﬁnement Y ou are both an NB A fan and an SQL expert. Based on the given original SQL query , prov enance analysis results, and problematic subquery or condition which ﬁlters out all the rows, ﬁx the original query’ s problematic subquery or condition so that it retriev es some results from the database. Ensure the following requirements: 1) Output Structure : Return a JSON object containing a single key , "corrected_query" , with its value being the corrected SQL query . 2) Ensure NB A Fan Rele vance : Maintain the original query’ s NBA-related curiosity and focus on realistic and meaningful queries that NB A fans are likely to ask. IMPOR T ANT : • Y ou may add an additional predicate in the inner query or adjust the ﬁltering threshold within the problematic subquery Q to intentionally include the important ro ws or exclude outlier rows (e.g., those with e xtremely high or low v alues) that overly constrain the outer query . • Y ou may also adjust the comparison operator (e.g., > to >=, < to <=) or the value of the problematic condition to relax the ﬁltering criteria. • Do not delete the join predicate in the WHERE clause of the problem- atic subquery Q (e.g., WHERE outer_table_name.column_name = inner_table_name.column_name ). Original SQL Query : {query} Problematic Condition : {problematic_condition} Problematic Subquery Q : {problematic_subquery} Execution Result of the Subquery Q : {problematic_subquery_execution_result} Pro venance Analysis Results : {provenance_analysis_results} Return the results in a FLA T JSON . NEVER include ANY EXPLANA TION or NO TE in the output, ONL Y OUTPUT JSON. 41 Published as a conference paper at ICLR 2026 SQL Query Naturalness Evaluation Y ou have a set of e valuation criteria to judge whether a gi ven SQL query reﬂects a question that is likely to be asked by a typical person. When ev aluating the query , refer to the following points: 1. Relev ance : • Deﬁnition : Measures ho w likely it is that a real person would be interested in the query . • Low Score (1) : The query cov ers obscure or highly technical aspects unrelated to typical person discussions (e.g., internal database IDs or rarely discussed statistics). • High Score (5) : The query reﬂects a common, popular interest among people (e.g., game stats, player/team information, draft results, etc.). 2. Speciﬁcity & Clarity of Intent : • Deﬁnition : Evaluates whether the question is clearly tar geted and sufﬁciently detailed to rev eal a genuine NB A-related interest—without being so narrow as to be contri ved. • Low Score (1) : The query is too v ague (“Show me some NB A data”) or ov erly con voluted/contri ved. • High Score (5) : The query clearly captures a plausible question (e.g., “Which NBA player scored the most points in home games last month?”). 3. Overall Naturalness : • Combine the above criteria and decide if the query is "natural" (likely to be asked by a real person) or "unnatural". • The query is considered natural if its overall score is 3 or higher . Y our output must be in JSON format with the following ke ys: • "relevance_score" : Integer from 1 to 5. • "specificity_clarity_of_intent_score" : Integer from 1 to 5. • "overall_naturalness_score" : Integer from 1 to 5. • "reason" : Explanation referencing the scores and justifying whether the query is consid- ered natural or unnatural. Database Schema: {database_schema} SQL Query T emplate: {question} Return the results in a FLA T JSON format. NEVER include ANY EXPLANA TION or NO TE in the output, ONL Y OUTPUT JSON. 42

SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment