Dialect-Agnostic SQL Parsing via LLM-Based Segmentation

Dialect- Agnostic SQL Parsing via LLM-Based Segmentation JUN WEN AN, National University of Singapore, Singapore KABILAN MAHA THEV AN, Virgina T ech, USA MAN UEL RIGGER, National University of Singapore, Singapore SQL is a widely adopted language for querying data, which has le d to the development of various SQL analysis and rewriting tools. However , due to the diversity of SQL dialects, such tools often fail when encountering unrecognized dialect-specic syntax. While Large Language Models (LLMs) have shown promise in understanding SQL queries, their inherent limitations in handling hierarchical structures and hallucination risks limit their direct applicability in parsing. T o address these limitations, we propose SQLF lex, a novel query rewriting framework that integrates grammar-based parsing with LLM-based segmentation to parse diverse SQL dialects robustly . Our core idea is to decomp ose hierarchical parsing to sequential segmentation tasks, which better aligns with the strength of LLMs and improves output reliability thr ough validation checks. Specically , SQLFlex uses clause-level segmentation and expression-level segmentation as two strategies that decompose elements on dierent levels of a query . W e extensively evaluated SQLFlex on both real-world use cases and in a standalone evaluation. In SQL linting, SQLF lex outperforms SQLF lu in ANSI mode by 63.68% in F1 score while matching its dialect-sp ecic mo de performance. In test-case reduction, SQLFlex outperforms SQLess by up to 10 times in simplication rate. In the standalone evaluation, it parses 91.55% to 100% of queries across eight distinct dialects, outperforming all baseline parsers. W e believe SQLFlex can serve as a foundation for many query analysis and rewriting use cases. CCS Concepts: • Information systems → Query languages ; • Software and its engineering → Parsers . Additional K ey W ords and Phrases: SQL Dialect, Parser , Large Language Model A CM Reference Format: Junwen An, Kabilan Mahathevan, and Manuel Rigger. 2026. Dialect-Agnostic SQL Parsing via LLM-Based Segmentation. Proc. ACM Manag. Data 4, 3 (SIGMOD), Article 161 ( June 2026), 33 pages. https://doi.org/10. 1145/3802038 1 Introduction Relational Database Management Systems (RDBMS) are among the most widely adopted data management platforms, with Structur ed Query Language (SQL) as the main interface for interacting with them. A core application of SQL is querying data. As a result, many important quer y-related tasks hav e emerged, which typically involv e query analysis and rewriting. For example, SQL linting tools analyze queries for anti-patterns [ 17 , 80 ], and may re write queries to x such issues. Formally , we dene query rewriting as transforming an input quer y 𝑄 into a new query 𝑄 ′ that satises a target objective 𝑂 . Query analysis examines queries without modifying them. Since most rewriting begins with analysis, we use r ewriting to refer to both for simplicity . Under this formulation, a wide range of applications beyond linting involve query rewriting, such as quer y reduction [ 50 ], DBMS testing [35, 53, 71], and SQL grading [9, 10]. A uthors’ Contact Information: Junwen An, National University of Singapore, Singapore, junwenan@u.nus.edu; K abilan Mahathevan, Virgina T ech, Blacksburg, V A, USA, kabilan@vt.e du; Manuel Rigger, National University of Singapore, Singapore, rigger@nus.edu.sg. This work is licensed under a Creative Commons Attribution 4.0 International License. © 2026 Copyright held by the owner/author(s). A CM 2836-6573/2026/6-ART161 https://doi.org/10.1145/3802038 Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 161. Publication date: June 2026. 161:2 Junwen An, Kabilan Mahathevan, and Manuel Rigger A major challenge in query rewriting is the diversity of SQL dialects, which leads to syntactic incompatibilities across systems [ 95 , 99 ]. Most query rewriting tools follow a common workow , but dialect-specic syntax frequently leads to failures. Specically , these tools rst parse an input SQL query with a grammar-based parser into an Abstract Syntax Tr ee (AST). The tools then analyze or modify the AST , and nally generate a query from the AST . The grammar-based parsers use d in these tools often fail when encountering dialect-specic syntax, limiting their ability to support a wide range of DBMSs. For example, SQLF lu [ 80 ], a popular SQL linter , has more than 200 dialect-related open issues in its repositor y . Grammar-based SQL parsers typically operate on a xed set of grammar rules, and fail when encountering syntax unrecognized by the grammar . Some tools integrate grammar rules from multiple dialects into a single parser , allowing them to parse queries from multiple dialects into a unied AST . Notable examples include SQLF lu [ 80 ] and SQLGlot [ 84 ]. Although eective for the dialects they support, these tools require substantial manual eort to support new dialects. For example, although SQLGlot supp orts over 30 dialects, its entire parser codebase excee ds 20,000 lines, and adding support for a new dialect often requires over 1,000 additional lines. Additionally , existing dialects may introduce new features over time, requiring continued eort from such parsers’ developers to support them. For instance , DuckDB recently integrated the MATCH_RECOGNIZE row pattern matching feature [ 39 ]. As a task-specic approach that tackles the SQL dialect problem, SQLess [ 50 ], a test case reducer , proposes an adaptive parser that attempts to generate new grammar rules when encountering dialect-specic features automatically . However , our experiments show that the generated rules fail to produce AST s that can be interpreted by rewriting rules. Consequently , it remains a challenge to develop an automatic dialect-agnostic query parsing approach that generates an AST representation suitable for a wide range of query analysis and rewriting tasks. Recent advances in Large Language Mo dels (LLMs) have shown promise in SQL-related tasks [ 24 , 42 , 44 ]. Although LLMs demonstrate SQL understanding abilities, using them to directly generate an AST poses challenges. Their autoregressive nature makes them less eective at handling complex hierarchical structures, such as AST s [ 32 , 34 , 60 ]. Additionally , end-to-end generation is prone to hallucinations [ 45 ], especially since no universal SQL grammar exists to validate outputs across dialects. W e demonstrate these diculties empirically in our evaluation. In this paper , we take the rst step toward addressing the limitations of grammar-based parsers and the drawbacks of using LLMs naively for diale ct-agnostic quer y parsing. W e aim to combine the accuracy of grammar-based parsing with the exibility of an LLM-based segmenter to build an AST . T o this end, we propose SQLF lex, a novel query rewriting framework that adopts this hybrid query parsing approach. While grammar-based parsers perform well at handling inputs that conform to predened grammar , LLMs can interpr et queries that include diale ct-specic features. SQLFlex rst attempts to parse a query using the grammar-base d parser . When parsing fails due to diale ct-specic features, SQLFlex invokes the segmenter to split the input into smaller parts such that they can b e parsed or further segmented. This decomposes hierarchical parsing into sequential segmentation tasks that better align with the strengths of LLMs. Since clause-level grammar is typically at, while expressions are r ecursive, SQLFlex employs clause-level segmentation and expr ession-level segmentation as two strategies suited to these dierent syntactic forms. A dditionally , to improve reliability , SQLFlex validates the output by checking properties between the input and the output after each segmentation. W e conducted a large-scale evaluation of SQLFlex. Specically , we showed SQLFlex’s practicality in two real-world use cases, SQL linting and test-case reduction, each evaluated on a task-spe cic dataset. For linting, SQLFlex surpasses SQLF lu in ANSI mode by 63.68% in F1 score and performs on par with SQLFlu in diale ct-specic mo de (98.14% vs. 98.24%). For test-case reduction, SQLF lex outperforms SQLess by up to 10 times in simplication rate. Furthermore, we e valuated SQLFlex’s Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 161. Publication date: June 2026. Dialect-A gnostic SQL Parsing via LLM-Based Segmentation 161:3 Listing 1. TSQL query , dialect-specific features highlighted 1 S E L E C T T O P 1 0 * 2 F R O M S a l e s 3 W H E R E ( t o t / 2 ) ! < 8 A N D y e a r < 2 0 2 5 4 O P T I O N ( F A S T 1 0 ) Listing 2. Simplified AN TLR grammar of a quer y 1 s e l e c t S t m t : s e l e c t C l a u s e f r o m C l a u s e ? w h e r e C l a u s e ? ; 2 s e l e c t C l a u s e : " S E L E C T " s e l e c t S p e c ? p r o j e c t i o n s ; 3 s e l e c t S p e c : " D I S T I N C T " | " A L L " ; 4 f r o m C l a u s e : " F R O M " t a b l e R e f e r e n c e s ; 5 w h e r e C l a u s e : " W H E R E " e x p r ; 6 e x p r : e x p r o p e x p r | i d e n t i f i e r | n u m b e r ; 7 o p : " A N D " | " < " | " / " ; dialect-agnostic parsing eectiveness, using queries in eight dierent SQL dialects extracted from their respective DBMS test suites. SQLFlex successfully parsed 91.55% to 100% of queries across the eight dialects, outperforming b oth a PostgreSQL-specic parser and SQLGlot. On the most challenging dialect, SQLF lex achieves improv ements of up to 179.46% over the PostgreSQL-specic parser and 138.04% over SQLGlot. W e believe that SQLF lex could be the foundation for many SQL-related use cases, where no manual eort is nee ded to supp ort parsing of diale ct-specic features. Like many other LLM-based applications ( e.g., T ext-to-SQL [ 42 , 44 ]), SQLFlex relies on a b est-eort approach, so we defer exploring correctness-critical use cases such as SQL-level query optimization [ 7 , 90 , 98 ] as part of future work. T o strengthen correctness guarantees, SQLF lex could potentially be integrated with query equivalence verication tools [33, 89]. T o summarize, we make the following contributions: 1 • W e propose the novel idea of integrating the strengths of grammar-based parsers and LLMs for query parsing. • W e propose SQLF lex, a dialect-agnostic quer y rewriting framework which implements this idea with strategies for clause-level and expr ession-level segmentation. • W e extensively evaluated SQLFlex in two real-world use cases and in a standalone e valuation. 2 Background and Motivation SQL standard and dialects. SQL is a standardized declarative language for data manipulation in RDBMSs [ 91 ]. W e use SQL standard to refer sp ecically to SQL-92 [ 3 ], a minimal version “supported by virtually all RBMSs” [ 61 ]. The core operation in SQL is the quer y , typically the SELECT statement. Elements in a query can be grouped into clauses and expressions . A query is compose d of multiple clauses, each of which sp ecies a particular asp ect of the quer y . Clauses contain clause elements , which include both expressions and non-expression elements that contribute to the clause’s behavior . Expressions are composable units made up of values, operators, functions, or subqueries, and they evaluate to a result [ 18 ]. Listing 1 shows an example query in the TSQL dialect [ 58 ], which contains the SELECT , FROM , WHERE , and OPTION clauses, where each clause includes the keyword and its associated elements. For example, within the SELECT clause, “TOP 10” is a non-expression clause element that limits the number of returned rows, and “*” is a wildcard expression selecting all columns. Similarly , the WHERE clause contains a predicate expr ession as a clause element that lters the result set. 1 Our artifact is publicly available at https://github.com/wanteatfruit/SQLFlex and https://doi.org/10.5281/zenodo.18975512. Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 161. Publication date: June 2026. 161:4 Junwen An, Kabilan Mahathevan, and Manuel Rigger In practice, DBMSs implement their own variants of SQL, known as SQL dialects [ 61 ]. For example, SQL Server uses the TSQL dialect. Although the dialects share a common foundation ( e.g., SQL-92), they dier in grammar rules, reserved keywords, and DBMS-specic features [ 99 ]. For instance, unlike DuckDB, MySQL requires an alias for subqueries. In another case, “OPTION” has no special meaning in PostgreSQL, but in TSQL it is a reserved keyword and must be quoted when used as an identier . A recent study found that test suites contain mostly dialect-specic features, where approximately 70% of queries in the PostgreSQL test suite are incompatible with other DBMSs, and about 60% for DuckDB [ 95 ]. In this paper , we focus on SELECT statements, and refer to syntax unsupported by the SQL standard as dialect-spe cic features . Grammar-based parsers. Grammar-based parsers build an AST from an input query using a formal grammar . AN TLR [ 4 ] is a widely used parser generator that takes a grammar spe cication and generates a parser reecting the grammar’s structure. A grammar consists of production rules that dene how symbols can b e expanded. A symbol refers to any element in a production rule, which is either a terminal or a non-terminal . Listing 2 illustrates a simplie d AN TLR grammar for a SELECT statement in the SQL standard. It denotes terminals in quotes and denes non-terminals using rule names. For instance, the rule selectClause includes the terminal "SELECT" and two non-terminals. An important observation from the SQL standard grammar is that clauses ( e.g., selectClause ) tend to be non-recursive, while expressions ar e typically recursive. Existing approaches. Most quer y rewriting to ols rely on grammar-based parsers to construct an AST , but these parsers often fail on dialect-sp ecic features, preventing further rewriting. W e categorize grammar-based parsers as dialect-sp ecic and multi-dialect parsers. For dialect- specic parsers, although they work well for the supporte d dialect, the y lack generalizability across other dialects. Additionally , not all dialects have a ready-to-use parser in practice. This limitation also applies to parser generators like AN TLR, as grammars for many dialects ar e unavailable or incomplete [ 28 ]. In contrast, multi-dialect parsers aim to produce unied AST s across dialects, oering better generalizability . Notable examples include the parser component in SQLGlot [ 84 ], SQLFlu [ 80 ], and Calcite [ 5 ]. While eective on multiple dialects, these tools demand substantial human eort and expertise to maintain the co debase and support additional dialects. For instance, the entire parser in SQLGlot contains over 20,000 lines of code, and adding supp ort for an additional dialect often requires more than 1,000 additional lines. This high implementation cost makes such approaches dicult to scale, as evidenced by the many unresolved dialect-related feature requests in their open-source repositories [26, 27, 36]. 3 Illustrative Example W e use the TSQL query in Listing 1 to illustrate the challenges of dialect-agnostic query parsing and how SQLFlex constructs its AST via hybrid segmentation. Dialect-specic features in the listing are colored, including the OPTION clause for query hints, the “TOP 10” select spe cier , the not-less-than “!<” operator , and the use of “year” as an identier despite being a reserved keyword in standard SQL. Grammar-based parsing alone fails here, as these features fall outside the grammar rules. The core of our appr oach is segmentation , a divide-and-conquer strategy for handling dialect- specic features. When such features cause the grammar-based parser to fail, we prompt an LLM to decompose the quer y into multiple segments—aligning with the LLM’s strengths in sequential processing [ 22 , 34 ]—and iteratively process each segment using our hybrid approach. For example, when parsing the query in Listing 1, the grammar-based parser would fail, as “TOP” is an invalid terminal symbol in the lower-level selectSpec grammar rule. While it might b e clear that “TOP 10” should be attributed to the SELECT clause rather than the FROM clause, as the SELECT clause would be incomplete otherwise, deciding so is dicult due to the English-like syntax of SQL [ 61 , 77 ]. Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 161. Publication date: June 2026. Dialect-A gnostic SQL Parsing via LLM-Based Segmentation 161:5 SE LEC T TO P 10 * FR O M Sal e s W H ER E ( t ot / 2 ) ! < 8 AN D y ear < 2 02 5 O PTI O N ( FA ST 10) ; C ur r e nt r u l e : s e l e c t St m t Gramm ar- bas ed Pars er SQL Segm en t Qu eu e C ur r e nt r u l e : s e l e c t C l au s e SE LEC T TO P 1 0 * C ur r e nt r u l e : f r o m Cl a us e FR O M Sal e s C ur r e nt r u l e : wher eCl au s e W H ER E ( t ot / 2 ) ! < 8 AN D y ear < 2 02 5 Add n on - te rm in al seg me nt s to q ueu e Segm en t ati on Out p ut Er r or i n s el e c t St m t 1 s el ec t S t m t : s el e c t Cl a us e f r o m Cl au s e? wher eCl au s e? ; s el ec t C l a us e : " SE LEC T" s el e c t Spec ? pr oj e c t i o ns ; f r om C l au s e: " FR O M " t ab l eRef er e nc es ; w he r eCl au s e: " W HE R E" ex pr ; Sys t em p r om p t : Segm en t t h e SQ L g i v e n t h e s p ec i f i c at i o n s . U s er p r om p t : I n pu t : Sel e c t s t a t e m en t " SE LEC T TO P . . . 1 0) ; " O u t p ut : Sel e c t c l aus e Fr om c l au s e ( o pt i ona l ) W h er e c l au s e ( o pt i o n al ) O t he r d i a l e c t - s p ec i f i c c ont en t ( o pt i ona l ) Ter m i n al s e gm e nt O PTI O N ( FA ST 10 ) Token vali dat ion Orde r vali dat ion Mu tua lly Exclu sive val idat io n Valid ato r Sys t e m : Fi x t he r es u l t gi v en t he h i nt s R ep ai rer Sel e c t St m t O PTI O N ( FA ST 10 ) Sel ec t W her e O t her Fr om LLM - bas ed Seg me nt er Prod uc ed AST N od e 2 4 Gramm ar- bas ed Parse r Er r or i n e x p r ! < 8 AN D y ea r < 20 2 5 AN D < < M 2> ( t ot / 2) ! < 8 AN D y e ar < 2 02 5 SQL Segm ent Qu eu e C ur r e nt r u l e : e x pr E x pr q u eu e : ( t o t / 2 ) ! < 8 AN D y ear < 20 25 LLM - bas ed Seg m en te r AN D < ! < y ea r 20 25 8 Mask parentheses 3 : ! < 8 : y ear : 2 02 5 (a ) Clau se - le v e l Seg m e n t at i o n ( b ) E x pr e s s i o n - le v e l Seg m en t a t i o n (c ) Pro d u c ed A ST AN D < : ( t o t / 2 ) 5 6 7 8 9 Sys t e m p r om p t : Segm en t t h e ex pr e s s i on . U s er pr o m pt : I np ut : " ! < 8" O u t p ut : O p er a t i on O R Li t e r al v al u e Mask in-between anchors Grammar-based parse Expand masks Expand parentheses Anc h o r- b as ed ex p res sio n se g me nt at io n ex pr : . . . Sel e c t St m t Fr om W h er e Sel e c t O PTI O N ( FA ST 10 ) O t he r TO P 10 Pr oj e c t i on Tabl eRef * Expr Sal e s s el ec t Spec Expr N am e AN D < ! < y ea r 20 25 8 / t o t 2 E x p res si on - level A ST Clau se- lev el AST 8 " O PTI O N " m i s s i ng Fi x ed r es ul t Fig. 1. Illustrative example of hybrid segmentation. Background colors of nodes denote their origins; text colors map grammar rules with segmentation prompts in (a). Specically , consider the OPTION keyword in Listing 1. In PostgreSQL, OPTION is accepte d as an identier , whereas in TSQL it marks the beginning of an OPTION clause. Without dialect knowledge, it is ambiguous whether OPTION should be interpreted as part of the WHERE clause or as a separate clause parallel to WHERE . Thus, rather than relying on heuristics, we use a state-of-the-art LLM that has been trained on examples of various SQL dialects to obtain more accurate parsing outputs. Figure 1 illustrates our segmentation approach. When parsing the full query fails ( ① ), we prompt the LLM to segment the quer y into three segments corresponding to the three non-terminals selectClause , fromClause , and whereClause , as well as dialect-spe cic elements that correspond to neither of the non-terminals ( ② ). T o enable this, we map non-terminals to prompts, in addi- tion to a dedicated prompt for dialect-specic segments, each instructing the LLM to e xtract the corresponding segment from the input. These mappings are visualize d using text colors in the gure. In the example, the returned segments are “SELECT TOP 10 *” for selectClause , “FROM Sales” for fromClause , “WHERE . . . 2025” for whereClause , and “OPTION (FAST 10)” , the latter of which is a dialect-spe cic segment that cannot be further parse d. After segmentation, we create the corresponding SelectStmt node ( ③ ) to build the AST top-down. The parsing process continues on the parseable segments ( ④ ). For example, in a subsequent iteration, the LLM segments the SELECT clause into “TOP 10” and “*” , mapping them to selectSpec and projections , respe ctively . Since the FROM clause lacks dialect-spe cic features, grammar-based parsing is sucient. W e refer to this approach as clause-level segmentation , which we detail in Section 4.1. This approach is ee ctive for non-recursive structures, which are pr esent at the clause level. At the expression lev el, clause-level segmentation is error-prone for deeply nested expressions. While most state-of-the-art LLMs perform well on simple cases, we obser ve that complex expressions are likely to be misinterpreted. For example, let e be the WHERE clause expression in Listing 1. When we wrap e inside a CASE expression ( e.g., CASE WHEN e THEN e ELSE e END ), the LLM may misinterpret the “AND” operator in e as having the low est precedence, overlooking the CASE expression. Additionally , applying clause-level segmentation to such expressions requires invoking the LLM at every level of the e xpression tree, causing a high computational overhead. T o address this, our key observation is that certain operations, such as “AND” and “<” , are recognized by the grammar , making their relative positions in the AST clear ev en in the presence of dialect-specic Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 161. Publication date: June 2026. 161:6 Junwen An, Kabilan Mahathevan, and Manuel Rigger features. Additionally , parenthesized sub expressions ( e.g., (tot/2) ) indicate higher precedence and appear deeper in the AST , so we can defer their handling until the outer parts have been parsed. Motivated by these obser vations, we propose expression-lev el segmentation , a strategy designe d to handle the expressions. The core idea of expression-level segmentation is to treat known operators as anchors and replace fragments between them with abstraction tokens . This enables the grammar-based parser to parse the anchors and isolate the dialect-specic features. Figure 1b illustrates one iteration of expression- level segmentation, which begins when a quer y segment fails to be parsed using the expression-level expr rule. First, all parenthesized subexpressions are replaced with parenthesis tokens ( e.g., (tot/2) becomes ) so that the current iteration focuses only on the non-parenthesized part ( ⑤ ). Ne xt, in the remaining expression, recognizable operators ( i.e., anchors) “AND” and “<” are preserved, while the fragments between them are r eplaced with mask tokens ( e.g., !< 8 becomes ) ( ⑥ ). The abstracted expression “ AND < ” can now b e parsed by the grammar , as diale ct-specic features are isolated within mask tokens ( ⑦ ). After obtaining the AST of the abstracted expression, we expand each mask token ( ⑧ ), allowing the LLM to process only the fragments containing dialect-specic features rather than the full expression. Specically , if grammar-based parsing fails for a token, w e perform segmentation. For example, and cannot be parsed due to the dialect-specic !< operator and reserved keyword year used as an identier . The segmenter identies as an operation with “!<” as the operator and “” and “8” as op erands. Once the diale ct-specic operator is segmented, the operands can be parsed separately . For , the segmenter treats “year” as a literal, indicating that it is already a terminal node. is valid and parsed directly . Lastly , we expand the parenthesis tokens and add the enclosed expressions to the next iteration ( ⑨ ). For instance, “tot/2” in is processed in the next iteration. It contains no diale ct-specic feature and is successfully parsed. W e detail expression-level segmentation in Section 4.2. Figure 1c shows the resulting AST after hybrid segmentation, where non-terminal nodes are visualized as rectangles and terminal nodes as ovals. Background colors indicate whether a node was produced by the grammar-based parser or the LLM-based segmenter . T o improve reliability and reduce hallucinations, we apply validation and repair mechanisms to the segmentation outputs. The next section describes the overall approach in detail. 4 SQLFlex W e present SQLFlex, a dialect-agnostic query r ewriting frame work. W e focus the rest of this section on detailing our key contribution, the hybrid segmentation process. As both AST manipulation and pretty-printing are base d on standard techniques, we provide only a high-level o verview of them in Section 4.4. Algorithm 1 outlines the hybrid segmentation process, which builds the AST iteratively from top to bottom. Formally , given a quer y fragment 𝑄 , we dene segmentation as a function 𝑆 : 𝑄 → 𝑠 1 , 𝑠 2 , . . . , 𝑠 𝑛 , where each 𝑠 𝑖 is an output segment. W e maintain a queue to manage quer y fragments whose sub- AST s still need to be built; each entr y is a tuple ( 𝑄 , 𝐺 ) , where 𝐺 is the grammar rule used to parse 𝑄 . W e initialize the queue with the quer y and selectStmt rule. In each iteration, we dequeue one item from the queue. Since grammar-based parsing is more ecient and reliable, we rst attempt to parse 𝑄 using 𝐺 (line 5). If parsing fails, the algorithm falls back on clause-level or expression-lev el segmentation dep ending on 𝐺 (lines 7–10). Expression-level segmentation is used only when 𝐺 is expr , indicating the parsing of an expression element; in all other cases, clause-level segmentation applies. Since clause-level segmentation produces interme diate segments that require further parsing, we add them to the queue. Both expression-level segmentation and grammar-based parsing construct the full sub- AST , so no additional fragments are enqueued. Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 161. Publication date: June 2026. Dialect-A gnostic SQL Parsing via LLM-Based Segmentation 161:7 Algorithm 1: Hybrid Segmentation Input: input_query ⊲ Full input quer y Result: root ⊲ Parsed quer y AST 1 Function HybridSegmentation : 2 seg_queue ← [(input_query , selectStmt)] , root ← ∅ 3 while seg_queue not empty do 4 Q, G ← seg_queue.dequeue() 5 if parse(Q, G) succeeds then ⊲ Grammar-based parsing 6 ast ← parse(Q , G) 7 else if G == expr then 8 ast ← ExpressionLevelSegmentation( Q) 9 else 10 ast ← ClauseLevelSegmentation(Q , G, seg_queue) 11 root ← attach(root, ast) 12 return root 13 Function ClauseLevelSegmentation(Q , G, seg_queue) : 14 rule2ast, rule2prompt, segment2rule ← load_mapping() 15 N ← rule2ast[G] ⊲ AST node type for 𝐺 16 prompts ← rule2prompt[G] ⊲ Prompts associated with 𝐺 17 segments ← Segmentation(Q , prompts) 18 for 𝑠 𝑖 ∈ segments do 19 𝑔 𝑖 ← segment2rule[ 𝑠 𝑖 ] ⊲ Maps each segment to one 𝑔 𝑖 20 if has_nonterminal( 𝑔 𝑖 ) then 21 seg_queue.enqueue(( 𝑠 𝑖 , 𝑔 𝑖 )) 22 else 23 N ← attach(N, 𝑠 𝑖 ) 24 return N 4.1 Clause-level Segmentation The core idea of clause-level segmentation is to map each non-dialect-specic segment 𝑠 𝑖 to a corresponding symbol 𝑔 𝑖 in the grammar rule 𝐺 , so that 𝑠 𝑖 can be further parsed or segmented using 𝑔 𝑖 in subsequent iterations. Segments that lack a symbol mapping are tr eated as terminals. The clause-level segmentation function is also outlined in Algorithm 1, which takes as input the current query fragment 𝑄 , the associated grammar rule 𝐺 , and the shared queue from the hybrid segmentation loop. T o support clause-lev el segmentation, we predene three mappings (line 14). First, rule2ast maps each grammar rule 𝐺 to its corresponding AST node 𝑁 (line 15). For instance, selectStmt maps to a SelectStmt node. This information is ne eded to instantiate the correct AST node. Second, rule2prompt maps 𝐺 to segmentation prompts, each associated with a symbol 𝑔 𝑖 (line 16). For example, selectStmt includes the selectClause symbol, mapping to a prompt “Output should include a SELECT clause” . W e use few-shot prompting with examples of common SQL features to improve consistency and reduce bias toward a spe cic dialect. The prompts remain unchanged when applying SQLFlex across dialects. Lastly , segment2rule maps each output segment 𝑠 𝑖 back to its corresponding 𝑔 𝑖 , forming ( 𝑠 𝑖 , 𝑔 𝑖 ) tuples for the queue ( line 19). For example, a segment corresponding to a SELECT clause will b e mapp ed to selectClause . This helps the approach to resume parsing segments with the correct grammar rule. Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 161. Publication date: June 2026. 161:8 Junwen An, Kabilan Mahathevan, and Manuel Rigger After segmentation, each segment is processed base d on the mappe d 𝑔 𝑖 to determine whether it will be enqueued (lines 18–23). If 𝑔 𝑖 contains non-terminal symbols ( e.g., selectClause contains non- terminal selectSpec ), the ( 𝑠 𝑖 , 𝑔 𝑖 ) tuple is added to the queue (lines 20–21). Dialect-spe cic features within these segments are handled in subsequent iterations to maintain sequential processing. In contrast, if 𝑔 𝑖 contains only terminal symb ols, the segment is instead directly attache d to the created AST node 𝑁 as no further parsing is required (lines 22–23). This avoids repeated failures caused by dialect-specic features mapp ed to such rules. For example, in an iteration where 𝑄 is “SELECT TOP 10 *” and 𝐺 is selectClause , the segment “TOP 10” maps to the terminal-only rule selectSpec , and is therefore attached to 𝑁 directly without being enqueue d. A spe cial case occurs when a segment represents a dialect-specic feature lacking a corresponding 𝑔 𝑖 , such as the OPTION clause. These segments are encapsulated in dedicated Other terminal nodes and attached to 𝑁 . W e record their position in the query for correct pretty-printing. Section 6 shows that the cor e grammar suces for various rewriting tasks, while content in Other nodes can be handle d with text-based analysis. Our approach outperforms heuristic-based methods, as it preserves the structure of the AST for these nodes ( e.g., the OPTION clause is place d at the same level as the other clauses). Clause-level segmentation returns the constructed AST node to the main hybrid segmentation loop (line 26). In subsequent iterations, the items added to the queue during clause-level segmen- tation are processe d. If a dequeued item is 𝐺 = 𝑒 𝑥 𝑝 𝑟 and parsing fails, the algorithm invokes expression-lev el segmentation. 4.2 Expression-level Segmentation W e propose an anchor-base d strategy to isolate dialect-spe cic features into expression fragments that can b e independently parsed. Specically , the strategy leverages the precedence of par entheses and known operators ( e.g., AND ) to construct partial AST s, so that segmentation can b e performed directly on their operands. W e refer to it as anchor-based expression segmentation , outlined in Algorithm 2. The algorithm takes an expression as input and builds its AST in a top-down iterative manner . A queue manages the expr ession fragments to be processed, wher e each iteration pr ocesses one level of parentheses, starting with the input e xpression. Given an expression 𝐸 and a set of anchors ( i.e., known operators), the isolation process builds a partial AST 𝑇 , where each leaf node contains a fragment 𝑒 𝑖 that may contain dialect-sp ecic features. This divide-and-conquer approach allows each fragment to be segmente d independently , preventing the LLM from misinterpreting nested subexpressions. This process consists of thr ee steps, which we illustrate using the same expression in Figure 1b as 𝐸 . First, we replace each parenthesized subexpression with a unique parenthesis token 𝑝 𝑖 , returning the transformed expression 𝐸 ′ and mappings from each 𝑝 𝑖 to its original content (line 6). For example, (tot/2) in 𝐸 is replaced with 𝑝 0 , and we record the mapping 𝑝 0 ↦→ (tot/2) . Next, we identify anchors in 𝐸 ′ and replace surrounding content with mask tokens 𝑚 𝑖 , returning the transformed 𝐸 ′′ and mappings from each 𝑚 𝑖 to its corresponding fragment 𝑒 𝑖 (line 7). For example, the anchors in 𝐸 ′ are “AND” and “<” , thus 𝐸 ′ can be split into fragments 𝑒 𝑖 and anchors 𝑎 𝑖 as “ 𝑒 0 𝑎 0 𝑒 1 𝑎 1 𝑒 2 ” , where 𝑎 0 is “AND” , 𝑒 0 is “p0 !< 8” , etc. W e replace each 𝑒 𝑖 with a mask token 𝑚 𝑖 and record their mappings. Lastly , we extend the base grammar to accept both token types, so that the maske d expression 𝐸 ′′ is parseable by the grammar-based parser . The produce d 𝑇 preserves the precedence of known operators ( line 8). For dialect-specic operators whose precedence is unknown, our approach assumes they have the highest precedence among all unparenthesized operators, as low-precedence operators are typically reserved for common op erators like “AND” and “OR” to compose multiple expressions. Determining the exact operator precedence is non-trivial b oth for the LLM and for users, as dialect-specic precedence rules are often inconsistent acr oss DBMSs. In practice, w e expect users to disambiguate Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 161. Publication date: June 2026. Dialect-A gnostic SQL Parsing via LLM-Based Segmentation 161:9 Algorithm 2: Anchor-based Expression Segmentation Input: input_expr ⊲ Full input expression Result: expr_root ⊲ Expression-level AST 1 Function ExpressionLevelSegmentation(input_expr ) : 2 expr_queue ← [input_expr] , expr_root ← ∅ 3 prompts, anchors ← load_prompts_and_anchors() ⊲ Predened 4 while expr_queue not empty do 5 𝐸 ← expr_queue.dequeue() 6 𝐸 ′ , paren_map ← process_paren ( 𝐸 ) 7 𝐸 ′′ , mask_map ← process_anchor ( 𝐸 ′ , anchors ) 8 expr_root ← parse_and_attach ( 𝐸 ′′ , expr_root ) ⊲ Produces 𝑇 9 for 𝑚 𝑖 , 𝑒 𝑖 ∈ mask_map do 10 if parse( 𝑒 𝑖 ) succeeds then 11 expr_root ← attach(parse( 𝑒 𝑖 ), expr_root) 12 else 13 expr_root ← attach(RecursiveSeg( 𝑒 𝑖 , prompts), expr_root) 14 for 𝑝 𝑖 , paren_content ∈ paren_map do 15 expr_queue.enqueue(remo ve_paren(paren_content)) 16 return expr_root the prece dence of dialect-sp ecic features that have a low precedence by adding parentheses, making operator precedence explicit, which is a best practice suggested by style guides [55]. W e process each expression fragment 𝑒 𝑖 . For each masked fragment, the algorithm rst attempts grammar-based parsing. Some fragments can be successfully parsed, as they may consist of rec- ognized literal values or parenthesis tokens (including function calls where their argument lists are replaced by 𝑝 𝑖 ). If parsing succeeds, the resulting subtr ee is attached directly to tree 𝑇 (lines 10–11). If parsing fails, we recursively invoke the segmenter to handle the dialect-specic content and attach the resulting AST (lines 12–13). At its core , the segmenter distinguishes between terminal and non-terminal fragments. T erminal fragments require no further parsing, while non-terminal fragments must recursively parse their child expressions to complete the structure. Thus, we prompt the LLM to output either a literal value segment or operation segments ( i.e., an operator and its operands). Literal segments are treated as terminal nodes ( e.g., year ). For op eration segments, the LLM identies the op erator with the lowest precedence ( i.e., the shallowest AST node) and its operands. Each op erand is then recursively parsed or segmented. For example, given the fragment “ !< 8” , the LLM identies !< as an operator , with op erands and 8 , both of which can be parsed directly and attached to the tree. In mor e complex cases, such as “1 !< 2 !< 3” , the initial segmentation identies the rst !< as a left-associative operator , with left operand “1” and right operand “2 !< 3” . Since the right operand still contains a dialect-spe cic operator , segmentation is invoked again, continuing recursively until all terminal nodes are resolv ed. After processing all fragments 𝑒 𝑖 , we handle the parenthesized expressions. For each such expression ( par en_content ), the outermost pair of parentheses is removed. The content is treated as a single expression and added to the queue for further expression-lev el segmentation (line 15). Howev er , in two special cases, w e procee d dierently . First, if the content begins with “SELECT” , we perform clause-lev el segmentation for the subquery . Second, if the content has unpar enthesized commas—common in function arguments—it is treated as an expression list and split into separate Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 161. Publication date: June 2026. 161:10 Junwen An, Kabilan Mahathevan, and Manuel Rigger expressions, each enqueue d individually . For example, given the function “max(1, min(a, 1))” , we process its argument list. W e remove the outer parentheses and split the arguments into 1 and min(a, 1) , enqueuing both. The comma in min(a, 1) remains parenthesized and will be handled in subsequent iterations. The algorithm continues until all parentheses are resolved ( i.e ., no more expr essions in the queue), and returns the e xpression-level AST . After combining with the clause-level AST , this yields the full AST for the input quer y . 4.3 V alidation and Repair For each segmentation output, regardless of the strategy , we apply three validation methods that check invariant properties between the input and the segmented output to dete ct potential errors. The LLM is prompted to repair its output until all validation che cks pass. Note that while this approach improv es robustness, it remains best-eort, as it might miss mistakes made by the LLM. (1) T oken validation. W e compare the characters in each segment with those in the input to detect any missing or extraneous characters. For example, if “OPTION” is omitted during segmentation of the entire query , the validation fails. (2) Order validation. W e ensure that the character or der within each segment matches the order in the original input. For example, if “FAST 10” is reordered as “10 FAST” , the validation fails. W e also check the semantic order among the segments. For example, in expression-lev el segmentation, if the left operand appears after the operator , the validation also fails. (3) Mutually exclusive output validation. Expression-level segmentation denes mutually exclusive output typ es where the output is either a literal or an operation. If both output types ar e present, the validation fails. Repair prompts are created based on the failed validation, the reason for the failure, and the full query to provide more context. After each repair attempt, the output is revalidated, repeating until success or reaching a predened retr y limit. If the limit is reached, we attach a special Unsegmented terminal node to the AST . This node preser ves the original input and prev ents further processing of that segment to maintain the overall structure of the AST . For example, if the SELECT clause fails to be segmented, an Unsegmented node containing the clause is attached to the SelectStmt node. The rest of the query ( e.g., FROM clause) can still b e processed normally . 4.4 Rewriter and Prey-Printer While the Rewriter modies the AST base d on user-provided logic, the Pretty-printer regenerates the query from the AST . W e consider neither of them novel and describe them only for completeness. For the Rewriter , we provide tw o APIs to users: find and transform . The transform function is a higher-order API that applies a user-dened rule to modify AST nodes. It traverses the AST and applies the rule to each no de. The find function complements this by searching for nodes of a specied typ e and returns nodes that satisfy an optional lter function. W e designe d the API to b e minimal to supp ort e xible and dialect-agnostic rewriting. Listing 3 shows an example of how users can detect and remove an unused table alias given an input query using SQLFlex. More complex AST manipulation can be implemented similarly by comp osing calls to find and transform , along with more extensive user-dened functions. Users can also incorp orate dialect-specic knowledge to directly manipulate node attributes, including those representing dialect-specic features. The Pretty-printer converts an AST back into its corr esponding SQL quer y , ensuring that any modications made by the Rewriter are reected in the nal output. It is a purely rule-based system that traverses the AST to regenerate the source te xt [13]. Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 161. Publication date: June 2026. Dialect-A gnostic SQL Parsing via LLM-Based Segmentation 161:11 Listing 3. Detect and remove unused table alias 1 # d e t e c t a n t i - p a t t e r n u s i n g f i n d A P I 2 d e f f i n d _ u n u s e d _ t a b l e _ a l i a s ( r o o t ) : 3 d e f i s _ t a b l e _ a l i a s ( n o d e ) : # f i l t e r f u n c t i o n 4 r e t u r n i s i n s t a n c e ( n o d e . p a r e n t , T a b l e ) 5 a l i a s e s = f i n d ( A l i a s , r o o t , i s _ t a b l e _ a l i a s ) 6 i d e n t i f i e r s = f i n d ( I d e n t i f i e r , r o o t ) 7 i f a l i n a l i a s e s a n d n o t i n i d e n t i f i e r s : 8 r e t u r n a l 9 10 # u s e r - d e f i n e d r e w r i t e r u l e f o r t r a n s f o r m A P I 11 d e f r m v _ t a b l e _ a l i a s ( n o d e , a l ) : 12 i f i s i n s t a n c e ( n o d e , T a b l e ) a n d n o d e . a l i a s = = a l : 13 n o d e . a l i a s = N o n e 14 15 r o o t = s q l f l e x _ p a r s e ( q u e r y ) 16 u n u s e d = f i n d _ u n u s e d _ t a b l e _ a l i a s ( r o o t ) 17 n e w = t r a n s f o r m ( r o o t , r m v _ t a b l e _ a l i a s , u n u s e d ) 5 Implementation and Experimental Setup W e implemented SQLFlex in around 3,800 lines of Python. W e adopted the SQL-92 [ 3 ] grammar as the base grammar . It has a minimal grammar that is supporte d by almost all SQL dialects [ 61 ], while allowing us to highlight the eectiveness of segmentation. W e use d AN TLR [ 4 ] to implement the grammar-based parser . W e implemented the LLM - based segmenter with LangChain [ 40 ] and the OpenAI GPT - 4.1 API [ 62 ], run at zero temperature for more deterministic outputs. W e use a system prompt dening the general segmentation task (Listing 4) and allow up to three repair attempts to balance runtime and success rate . W e used the LLM’s structured output feature to ensure its outputs align with the prompts and can be subsequently processed. The output structure is dened using pydantic . For example, Listing 5 shows the structured output prompt for segmenting the selectStmt rule, where the pydantic model maps to the SelectStmt AST node denition. The SelectStmt class represents the input node type, and its attributes corr espond to the expected output elds. Each eld is annotated with a natural language description, which is automatically incorporated into the LLM prompt to guide structured generation. Note that the manual eort required to dene these descriptions is incurred only once. W e used LLMs to assist in generating and rening the descriptions, with the overall process taking less than ten minutes. W e ran all subse quent experiments on a ser ver with a 64-core AMD EPY C 7763 processor (2.45GHz) and 512GB of RAM, running Ubuntu 22.04. 6 Use Cases W e present two real-world query analysis and rewriting use cases implemented using SQLFlex as the base parser , namely SQL linting and test-case reduction. SQL linting is a query analysis task, while test-case reduction involves query rewriting and requires executing the queries. The use cases highlight SQLF lex’s practical utility , which grounds our e valuation in actual usage of SQLFlex before turning to a standalone evaluation. 6.1 SQL Linting Many SQL queries exhibit po or coding practices that could cause performance and maintainability issues [ 75 , 101 ], commonly referred to as SQL anti-patterns [ 37 ]. A prevalent example is the usage of the column wildcard expression ( e .g., SELECT * ) to retrieve all columns fr om a table. While this Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 161. Publication date: June 2026. 161:12 Junwen An, Kabilan Mahathevan, and Manuel Rigger Listing 4. System prompt for LLM-based segmenter   You are an SQL expert in various SQL dialects. Your task is to segment an SQL query or a valid part of a query into its components in the {dialect} SQL dialect while adhering strictly to the following rules: - Ensure all tokens from the original query are included in the segmented output. Do not add, modify, or omit any tokens. - Maintain the relative order of tokens as they appear in the input. Important Notes: - The input is always a valid SQL query or a valid fragment of a query in {dialect}. - Use context information to understand the query. - Reason step by step to ensure correctness and consistency: self-reflect on 3 to 5 results, and return the most consistent result as the final output.   Listing 5. Structured output prompt fragment for selectStmt  c l a s s S e l e c t S t m t ( B a s e M o d e l ) : " " " R e p r e s e n t s a S Q L S E L E C T s t a t e m e n t . " " " s e l e c t : s t r = F i e l d ( d e s c r i p t i o n = " T h e S E L E C T c l a u s e , i n c l u d i n g t h e * * S E L E C T * * k e y w o r d a n d i t s a s s o c i a t e d p r o j e c t i o n s . " ) f r o m _ : O p t i o n a l [ s t r ] = F i e l d ( d e s c r i p t i o n = " T h e F R O M c l a u s e , i n c l u d i n g t h e * * F R O M * * k e y w o r d , s p e c i f y i n g t h e s o u r c e t a b l e ( s ) . " ) w h e r e : O p t i o n a l [ s t r ] = F i e l d ( d e s c r i p t i o n = " T h e W H E R E c l a u s e , i n c l u d i n g t h e * * W H E R E * * k e y w o r d , d e f i n i n g f i l t e r i n g c o n d i t i o n s . " ) o t h e r : O p t i o n a l [ s t r ] = F i e l d ( d e s c r i p t i o n = " A n y a d d i t i o n a l d i a l e c t - s p e c i f i c c o n t e n t s . " )   shorthand is convenient, it is discouraged in production, as it can lead to ineciencies and cause schema changes to remain undetected [ 37 ]. Hence, a robust SQL linter is essential for detecting anti-patterns and improving code quality [17, 75, 80]. Baselines. W e chose two notable existing linters, SQLCheck [ 17 ] and SQLFlu [ 80 ] as baselines. SQLCheck originally combined a non-validating parser with regular expression matching to detect anti-patterns [ 17 ]. Howev er , its current open-source implementation relies only on regular expres- sions. SQLF lu uses a multi-dialect parser to parse input queries into AST s, which are then analyzed for anti-patterns. While SQLCheck avoids parsing failures by using text-based analysis instead of AST s, this approach leads to p otentially lower accuracy . In contrast, SQLF lu has higher accuracy , but requires signicant manual eort to support new dialects. Since SQLF lu allows users to congure the SQL diale ct, w e report r esults for its ANSI mode ( SF- ANSI ) and TSQL mode ( SF- TSQL ). Lastly , we implemented a purely LLM-base d baseline using GPT -4.1 to investigate whether an end-to-end LLM-based approach is sucient for this task. W e provided the LLM natural language rule descriptions and prompted it to output anti-patterns. Rule sele ction. W e systematically selected a set of fourteen linter rules for the e valuation, pre- sented in T able 1. W e rst selected all three rules that are supported by b oth SQLFlu and SQLChe ck. For a more comprehensive e valuation, we incorporated eleven additional rules from SQLFlu. W e chose these rules from SQLF lu’s core rule set, which are dened by the developers as “Stable, applies to most dialects, could detect a syntax issue, and is not too opinionated toward one style” [ 80 ]. Within the core rule set, w e excluded those addressing layout issues ( e.g., incorrect indentation), Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 161. Publication date: June 2026. Dialect-A gnostic SQL Parsing via LLM-Based Segmentation 161:13 T able 1. List of Sele cted Linter Rules Source ID Description SQLFlu AM01 Ambiguous use of DISTINCT with GROUP BY . AM06 Inconsistent column references in GROUP/ORDER BY clauses. AL02 Implicit aliasing of columns. AL03 Column expression without alias. AL04 T able aliases should be unique within each clause. AL05 T ables should not be aliased if that alias is not used. AL08 Column aliases should be unique within each clause. AL09 Column aliases should not alias to themselves. CV04 Use consistent syntax to express “ count number of rows” . CV05 Comparisons with NULL should use IS or IS NOT . RF01 Referencing objects not present in the FROM clause. Both tools AM02 UNION [DISTINCT|ALL] is preferred over UNION . AM04 Query produces an unknown number of result columns. AM08 Implicit cross join detected. symbol inconsistencies ( e.g., extra commas), and capitalization inconsistencies ( e.g., keyword capi- talization), as SQLFlex does not preserve this information during parsing. For brevity , we will refer to each rule by its ID in T able 1 in subse quent paragraphs. Dataset. W e selected the SESD dataset [ 31 ], which consists of human-author ed SQL queries with diverse TSQL-specic features collected from Stack Exchange , and we e xpe cted some of them to include anti-patterns targeted by SQL linters. W e removed duplicate queries and excluded queries that failed to parse using a TSQL parser . Finally , we obtained 1,916 TSQL queries, including 111 queries with window functions and 1,030 queries with joins. Since SQLFlu already supports the TSQL dialect, we constructed reliable ground-truth anno- tations ( i.e., the list of p otential anti-patterns for each quer y) by initially running both SQLFlex and SQLFlu on the entire dataset. Cases where both tools detected the same anti-patterns were considered the ground-truth. For discrepancies between the two tools, we manually revie wed and conrmed each instance to ensure accurate labeling. T o further validate the constructed ground truth, we randomly sampled 50 lab eled queries and manually annotate d them. W e obser ved no discrepancies between these annotations and the constructed labels. Based on this sample, a Clop- per–Pearson condence interval shows that, with 95% condence, the true error rate is below approximately 6%. Metrics. W e use Precision , Recall , and F1 score as evaluation metrics [ 72 ]. The metrics are computed at the anti-pattern level. For example, if the ground truth for a query is {AM01, AL02, RF01}, but SQLFlex reports {AM01, AL03}, we count one true positive (AM01), one false positive (AL03), and one false negative (RF01). Her e, true positives refer to corr ectly dete cted anti-patterns, false positives to incorrectly reported ones, and false negatives to missed anti-patterns. For all three metrics, higher values indicate b etter performance. Since SQLChe ck supports only three of the selected rules, we compute its metrics based on those three rules. Results. Figure 2 shows the overall results for all selected linter rules. SQLFlex signicantly outperforms the purely LLM-based approach, SQLCheck, and SF- ANSI across all three metrics. The purely LLM-base d linter suers from high false positive rates, likely due to the ambiguity of natural language rule descriptions and the lack of eective validation mechanisms, which highlights the importance of AST -based analysis. Similarly , SQLChe ck, which relies on text-based analysis without accounting for SQL’s structural semantics ( e.g., column names), also performs poorly . SF-ANSI Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 161. Publication date: June 2026. 161:14 Junwen An, Kabilan Mahathevan, and Manuel Rigger Precision Recall F1 Score 0% 25% 50% 75% 100% Score (%) 0.307 0.730 0.432 0.739 0.405 0.523 0.524 0.700 0.600 0.981 0.984 0.982 0.967 0.996 0.981 LLM SQLCheck SF-ANSI SF-TSQL SQLFlex Fig. 2. Overall performance comparison of linters frequently failed to parse queries in the SESD dataset due to TSQL-specic syntax, such as using square brackets to enclose identiers with special characters or reser ved keywords ( e.g., SELECT t AS [SELECT] ), leading to low linting accuracy . In contrast, SQLF lex was able to handle such syntax correctly via the hybrid segmentation approach. When compared with SF- TSQL , SQLFlex achieves slightly lower pr ecision (96.69% vs. 98.06%), higher recall (99.63% vs. 98.42%), and comparable F1 score (98.14% vs. 98.24%). The errors made by SQLFlex are mainly due to imprecise segmentation by the LLM when faced with ambiguous syntax in queries. For example, in “ SELECT Count FROM t . . . ” , the LLM misinterpreted Count as an expression (resembling function COUNT() ) rather than a column name, causing the linter to incorrectly report the anti-pattern “ AL03” . SF- TSQL also had occasional parsing failures that led to incorrect results. For example, it failed to parse the query “ SELECT Class AS Class . . . ” , causing it to miss the anti-pattern “ AL09” . O verall, the linter implemented using SQLF lex matched the performance of SQLFlu, with signicantly less manual eort. 6.2 T est-Case Reduction T est-case reduction is a quer y rewriting application. A utomated DBMS testing tools, such as SQLancer [ 79 ] and SQLsmith [ 2 ], often generate long and complex queries to expose bugs. However , these queries are often too complicated for developers to identify the underlying issues. Consider the NoREC oracle [ 70 ], which identies logic bugs by comparing results b etween two semantically equivalent queries that dier only in whether query optimization is applied. The resulting bug reports typically contain complex e xpressions [ 49 , 70 , 96 ] that make it dicult for de velopers to x potential bugs. Therefore, reducing test cases while preserving bug-triggering behavior is essential for ecient bug-xing [63, 76, 81]. Baselines. SQLess [ 50 ] is a state-of-the-art test case reducer designed specically for dialect- agnostic SQL reduction. It uses AST -based strategies, such as removing clauses and simplifying expressions, to iteratively reduce queries. After each reduction, SQLess executes the reduced quer y with the test oracle to v erify whether the bug can still be triggered. When encountering a query containing dialect-specic features, SQLess uses AN TLR’s syntax-based error recov er y algorithm to generate new grammar rules. With these added grammar rules, SQLess can parse previously unsupported queries. W e compare a test-case reduction tool implemente d in SQLFlex with SQLess as the baseline. Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 161. Publication date: June 2026. Dialect-A gnostic SQL Parsing via LLM-Based Segmentation 161:15 T able 2. Simplification ratios by DBMS DBMS MaxSimRatio A vgSimRatio SQLess SQLFlex SQLess SQLFlex MySQL 87.92% 87.92 % 6.27% 7.53% SQLite 86.83% 94.75 % 1.83% 18.91% Dataset. W e constructed a new dataset of bug-triggering queries in SQLite and MySQL to evaluate SQLFlex. A possible alternative would have been SQLess’s dataset. However , it is not publicly available and is imbalanced, containing over 32,000 MySQL-compatible cases, but only 143 SQLite cases. Thus, w e used SQLancer, a widely use d DBMS testing tool, to generate bug-triggering queries in prior versions of SQLite (v3.28.0) and MySQL ( v8.2.0). W e increased the test case complexity by setting SQLancer’s e xpression-depth parameter to 5, then collecte d the rst 1,500 bug-inducing test cases for each dialect, retaining only those found by the NoREC oracle. The nal dataset includes 1,166 SQLite and 1,491 MySQL test cases, among which 583 contain subqueries, and 767 have joins. Notably , SQLess already supports MySQL as it uses the MySQL grammar , but nee ds its adaptive parser to handle SQLite-specic features. Implementation. W e adopted the test-case reduction algorithm described in SQLess [ 50 ]. Speci- cally , we implemented expression reduction strategies ( e.g., removing the left operand of a logical operator) and clause reduction strategies ( e.g., removing the ORDER BY clause) on the AST s generate d by SQLFlex. W e implemented only the strategies relevant to our dataset and sucient for eective simplication. Since SQLess applies strategies enumerativ ely , this selective implementation does not bias the results. For a fair comparison, we disabled additional strategies in SQLess. Metrics. Following SQLess, we used A verage Simplication Ratio (A vgSimRatio) and Maximum Simplication Ratio (MaxSimRatio) as evaluation metrics [ 50 ]. These metrics measure the p ercentage reduction in the number of tokens between the original and reduced queries, with higher values indicating a more eective reduction. Results. T able 2 compares the simplication ratios achieved by SQLess and SQLF lex. O verall, SQLFlex performs better than SQLess. For the MySQL test cases, both to ols achieved comparable performance, showing that SQLFlex, despite using a minimal grammar—with 61.82% of reduced queries requiring segmentation—can match the ee ctiveness of a full-grammar-based tool. For the SQLite test cases, SQLF lex signicantly outperforms SQLess. This is primarily because SQLess struggled with SQLite-sp ecic features, while SQLFlex eectively handled them. The adaptive parser in SQLess appends all unrecognized tokens as terminal symb ols without capturing their semantic meaning. Moreover , for complex queries, the adaptiv e parser often generates incorrect grammar rules, resulting in ineective r eductions, as the reduction rules fail to interpret the tree nodes. In contrast, although 92.78% of reduced queries were not supporte d by SQLFlex’s grammar , SQLFlex still generates AST s that pr eser ve the semantic structure of the SQL query . Its anchor-based expression segmentation approach further enables robust handling of complex expr essions, leading to more eective reductions. 7 Standalone Evaluation In this section, we further investigate SQLFlex itself, that is, whether SQLFlex can eectively parse queries. Specically , we aimed to answer the following questions: • Q1. How ee ctively does SQLFlex handle dialect-agnostic parsing of SQL queries (see Section 7.1)? Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 161. Publication date: June 2026. 161:16 Junwen An, Kabilan Mahathevan, and Manuel Rigger T able 3. List of datasets and query features used in the standalone evaluation (WF: Window Function, CTE: Common T able Expression, SubQ: Sub query) Dataset Queries WFs CTEs SubQs Joins DuckDB T est Suite 2,092 185 22 496 284 PostgreSQL T est Suite 1,207 139 0 162 128 MySQL T est Suite 1,039 3 0 67 77 ClickHouse T est Suite 2,485 69 1 573 513 H2 T est Suite 358 21 0 67 50 CockroachDB T est Suite 1,369 100 2 287 202 Cassandra T est Suite 150 0 0 0 0 SurrealDB T est Suite 351 0 0 5 0 T otal 9,051 517 25 1,657 1,254 • Q2. How do SQLFlex’s key components contribute to its overall eectiveness (see Section 7.2)? • Q3. How eciently does SQLF lex handle SQL queries with diale ct-specic features (see Se c- tion 7.3)? • Q4. Can SQLFlex be extended to parse Data Denition Language (DDL) statements, such as CREATE TABLE (see Se ction 7.4)? Dataset. T o investigate whether SQLFlex can handle diverse and niche dialect-specic features, we built a large-scale b enchmark dataset extending the dialect types covered in Section 6. T o the best of our knowledge, there is no publicly available benchmark suitable for the evaluation of dialect-agnostic SQL parsing. While there are existing query benchmarks such as TPC-H and SQLStorm [ 73 ], these benchmarks are designed for DBMS performance evaluation and primarily contain standard SQL featur es. Our benchmark consists of queries of eight dierent dialects, which we collected from the corresponding DBMS’s test suite. These test suites typically include statements that reect the unique features of their respe ctive systems [ 95 ], making them more suitable for the evaluation of SQLFlex’s dialect-agnostic parsing ability . W e selected the SQL dialects and the DBMSs that implement them based on several factors: (1) availability of open-source test suites; (2) diversity in usage and functionality , including a combination of well-established systems ( e.g., PostgreSQL and MySQL) and emerging or specialized ones ( e.g., DuckDB and SurrealDB) [ 15 ]; (3) distinct dialect-sp ecic features, avoiding largely compatible systems like MySQL and MariaDB that share a common codebase. T able 3 summarizes the datasets, including the number of queries and the occurr ence of complex query features ( i.e., window functions, CTEs, subqueries, and joins). The datasets cover a broad range of SQL features, evaluating SQLFlex comprehensively . W e built the benchmark dataset by extracting SELECT statements from test suites. T o ensure syntactic validity , we excluded queries that faile d to parse using dialect-spe cic parsers. Since many of the extracted queries had similar structures or features, we applied a keyword-based deduplication heuristic to r etain a diverse subset. SQL parsing dep ends primarily on key words [ 77 ], making keyword diversity a useful pr oxy for syntactic variety . W e treated queries with identical keyword sets as duplicates, retaining only the longest one, assuming it to b e the most complex. For H2, since we were unable to obtain a ready-to-use H2-sp ecic parser , we used JSQLParser , which supports H2 syntax. For SurrealDB, as we were unable to obtain its keyword list, we kept all queries in its dataset. Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 161. Publication date: June 2026. Dialect-A gnostic SQL Parsing via LLM-Based Segmentation 161:17 7.1 Q1: Eectivness and Reliability In this section, we show SQLFlex’s eectiveness and reliability in parsing queries in diverse SQL dialects. Baselines. W e compared SQLF lex against representative baselines. T o the b est of our knowledge, no existing academic work has proposed a general-purpose, dialect-agnostic SQL parser . The most closely r elated work is SQLess [ 50 ]. A s SQLess is applicable only to test-case r eduction, we e valuate it on this use case in Se ction 6.2. Thus, we compare SQLFlex against two op en-source tools, which are repr esentatives of dialect-specic parsers and multi-dialect parsers, and a purely LLM-based baseline parser: • PostgreSQL Parser: W e chose the PostgreSQL parser as the diale ct-specic parser baseline. W e used pglast (v6.10) [ 66 ], a Python package of the ocial PostgreSQL parser . While dialect- specic parsers exist for other dialects, we chose to focus on PostgreSQL for two reasons. First, it is a widely used dialect with strong conformance to SQL standards. Second, we expe ct similar results for other dialect-specic parsers, as they share the same limitation. Specically , we expect them to perform p erfectly on their target diale ct, but fail to parse sp ecic features of other diale cts. • SQLGlot v25.22.0: SQLGlot [ 84 ] is a state-of-the-art multi-dialect parser supporting over 30 dialects. W e evaluated two congurations of SQLGlot, which are its standard mode ( SQLGlot 𝑠 ), which parses queries using a sup erset of all supporte d dialects, and its diale ct-specic mo de, where the dialect is supported ( SQLGlot 𝑑 ). SQLGlot is expected to perform well on its explicitly supported dialects due to its extensive manual implementation eort, and better than pglast on unsupported dialects in its standard mode. In other wor ds, we set SQLGlot’s conguration so that it is expected to perform b est for each of the dialects. • LLM-based Baseline Parser: W e implemented a simple baseline that uses an LLM (GPT -4.1 API) to generate an AST in JSON format directly . W e provide the same AST denition as used by SQLFlex. W e use a rule-base d printer to convert the AST into a query to prevent hallucinations from an LLM-based pretty-printer and align with SQLFlex also using a rule-based printer . Metrics. Evaluating the semantic correctness of AST s in a multi-diale ct setting is inherently challenging for two reasons. First, dierent parsers adopt dierent AST representations, which can lead to structurally dierent, yet semantically equivalent AST s for the same query . For example, pglast represents both the CAST function and the :: operator using a unied type-casting node. Similarly , SQLGlot represents both IF and CASE expressions using a single expression type. Second, constructing reliable ground-truth AST s is challenging due to the inaccessibility of ocial refer ence parsers; specically , while some DBMSs oer ocial standalone parser packages, for example pglast , which is derived directly from PostgreSQL’s source code, many others only embed their parsers within the database engine ( e.g., DuckDB). Extracting such embedded parsers would incur signicant engineering eort. T o address the challenge in evaluating SQL parsers, we adopt property-based testing [ 11 ], a technique that checks whether a system behaves as expected according to specic properties. In particular , we used a widely used round-trip property to measur e parser correctness [ 14 , 69 , 88 ], and satisfying such a property is useful for dev eloping query rewriting tools [ 6 ]. The round-trip property checks whether a quer y can be parse d into an AST and printed back to the same query as the input, oering a practical way to evaluate parsers without needing ground-truth labels. W e introduce Quer y Round-trip (Q-RT) Rate , the metric for our evaluation. Q-RT is an extension of the round-trip property , which mitigates inaccurate evaluation results caused by normalization applied by baselines. Given an initial quer y 𝑄 0 , we dene intermediate parsing and printing steps Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 161. Publication date: June 2026. 161:18 Junwen An, Kabilan Mahathevan, and Manuel Rigger as follows: 𝑄 1 = Print ( Parse ( 𝑄 0 ) ) , 𝑄 2 = Print ( Parse ( 𝑄 1 ) ) (1) Based on these steps, we dene the Q-RT property , which che cks that the queries remain identical through each parse-print cy cle ( i.e., 𝑄 0 = 𝑄 1 = 𝑄 2 ). It provides insights on whether SQLFlex can consistently parse a query into an AST and transform the AST back to an executable query . W e relax Q-RT for pglast and SQLGlot to only require 𝑄 1 = 𝑄 2 , to avoid disadvantaging them due to normalization ( e.g., pglast converts :: operators into CAST functions). Additionally , we ignore whitespace and parentheses during comparison to avoid false alarms from formatting dierences. Q-RT Rate is calculated as the ratio of the number of queries satisfying Q-RT to the total number of queries in each dataset. Higher values indicate better diale ct-agnostic parsing eectiveness and reliability . Due to space limits, we present additional results and analysis of AST -level consistency using an AST round-trip property in the app endix. Note that while the round-trip property cannot reliably detect incorrect AST s, it serves as a necessary condition for practical parser correctness. W e excluded SQLess from this comparison, as its implementation guarantees round-trip equivalence, but pr oduces AST s with limite d practical utility , as user-dene d rules often fail to interpret the AST . The selected baselines satisfy the round-trip property while also producing usable AST s, making them suitable for this evaluation. In addition to Q-RT , we further evaluate the parsing correctness of SQLF lex by assessing semantic correctness, where pglast and the PostgreSQL queries naturally serve as the gr ound-truth. Fol- lowing a methodology of previous work on evaluating correctness of DDL statement parsing [ 19 ], we extract a cor e set of semantic elements from a ground-truth AST and compare whether these elements match those extracted from the AST s produced by SQLFlex and SQLGlot. Spe cically , at the clause level, we extract column names from the SELECT clause, table names from the FROM clause, aliases and their targets, as well as joined table names and join types from the JOIN clauses. At the expression le vel, we extract operators, their literal operands, and their pr ece dence. Since dierent parsers may normalize certain operators dierently , we restrict extraction to a predened set of common operators to avoid counting semantically equivalent forms as mismatches. For each semantic element, we compute the per centage of AST s whose extracted results ov erlap with the ground truth. W e refer to this complementary metric as the Semantic Match (SM) Rate . Note that the SM rate also remains a best-eort correctness metric, but it provides a corr ectness reference, complementing the self-referencing round-trip metric. Dialect-agnostic eectiveness (Q1). T able 4 presents the results comparing SQLF lex against base- lines in Q-RT rate, and reports the geometric mean [ 20 ] in the last r ow . For each dialect, the SQLF lex results are averaged over three runs, with standar d deviations below 0.6%, indicating consistent performance. Overall, the results show that SQLF lex outperforms all baselines. pglast achieves an expected perfe ct score for the Postgr eSQL dataset, but across all dialects, it only has a geometric mean of 65.94% Q-RT . The standard mode of SQLGlot (SQLGlot 𝑠 ) performs slightly better , with a geometric mean of 74.43%. The dialect-spe cic mode (SQLGlot 𝑑 ) achieves a higher geometric mean of 93.26%, but only on diale cts it explicitly supports. This improved performance is due to its manually implemented parsing rules for those diale cts. However , it requires substantial human eort and still fails to capture some unique features. The LLM-base d baseline parser , although requiring no manual eort, performs poorly at generating consistent AST s (46.48% geometric mean Q-RT). In contrast, SQLFlex outperforms all baselines with a 96.37% geometric mean Q-RT , and ranges from 91.55% to 100% Q-RT across all eight dialects. W e inspected all failed cases. W e nd that for pglast and SQLGlot, almost all failures are due to parsing errors caused by dialect-spe cic features. For the LLM-based baseline, failures Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 161. Publication date: June 2026. Dialect-A gnostic SQL Parsing via LLM-Based Segmentation 161:19 T able 4. ery Round- Trip Rates. Bold indicates the b est score for each dataset (excluding pglast on PostgreSQL) Dataset LLM pglast SQLGlot 𝑠 SQLGlot 𝑑 SQLFlex DuckDB 51.53% 72.42% 92.97% 93.88% 95.86% PostgreSQL 49.30% 100.00% 86.67% 87.99% 96.88% MySQL 66.41% 76.42% 86.91% 97.40% 99.39% ClickHouse 32.27% 47.00% 68.61% 94.00% 94.67% H2 56.15% 81.01% 94.69% – 97.86% CockroachDB 43.46% 69.98% 70.93% – 95.06% Cassandra 60.00% 74.00% 76.00% – 100.00% SurrealDB 27.35% 32.76% 38.46% – 91.55% Mean 46.48% 65.94% 74.43% 93.26% 96.37% T able 5. SM rates on the PostgreSQL dataset. Bold indicates the best. (Opt.: Operator , Opn.: Operand, Prec.: Precedence) Parser T ables Cols Alias Joins Opt. Opn. Prec. SQLFlex 94.86% 98.09% 95.86% 97.93% 97.68% 95.44% 94.20% SQLGlot 𝑑 79.37% 88.40% 85.17% 86.33% 86.58% 84.18% 84.34% SQLGlot 𝑠 77.46% 86.58% 83.26% 84.51% 84.76% 82.35% 82.52% occur due to inconsistent end-to-end generation that produces dierent results across the AST generation and printing stages. For SQLFlex, the few Q-RT failures ar e mostly cause d by limitations in handling dialect-spe cic features in the rule-based pretty-printer . A typical example comes from CockroachDB, which allows explicit index annotations like “table@idx” [ 12 ] in table references. In certain queries in the dataset, this is written as “cb@w” . Since the generic identiers ( cb and w ) provide no semantic information, the segmenter may incorrectly interpret cb@w as a schema-qualied table name. After pretty-printing, the output becomes cb.w , which fails to match the original input. Similarly , in the SurrealDB dataset, clauses like “GROUP ALL” are incorrectly pr etty-printed as “GROUP BY ALL” . Lastly , a small portion of AST produced by SQLFlex (0–2.5%) containe d Unsegmented nodes, which we counted as failed cases. These typically arise from LLM misinterpretations in complex queries ( e.g., multiple subqueries) and fail to b e repaired. Semantic match rates (Q1). W e report the SM rates on PostgreSQL queries in T able 5. Overall, SQLFlex achieves the highest SM rates across all evaluated semantic elements, excee ding 94% at b oth the clause level and the e xpression level, and outperforming both variants of SQLGlot. While the overall trend of SM is consistent with Q-RT , SM additionally captures certain semantic mismatches that Q-RT fails to dete ct. For example, SQLFlex incorrectly interpreted the expression int2 ‘2’ in the SELECT clause as a column reference, wher eas it was actually an implicit type-conversion expression that casts ‘2’ to type int2 . This mismatch was missed by Q-RT , be cause the printed columns were identical despite the diering semantics. Although such cases can be challenging even for LLM-based parsing, their rarity is reected in the high SM rates, indicating that SQLF lex remains eective in practice. 7.2 Q2: Ablation Study In this section, we investigate the contributions of important components of SQLFlex. Methodology . W e compared four variants of SQLF lex to assess the impact of its key components: Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 161. Publication date: June 2026. 161:20 Junwen An, Kabilan Mahathevan, and Manuel Rigger T able 6. Ablation study results in Q-RT rates. Bold indicates the b est score for each dataset. Dataset SQL-92 SQLFlex 𝑣 SQLFlex 𝑎 SQLFlex 𝑚 SQLFlex DuckDB 31.98% 88.91% 86.42% 90.68% 95.86% PostgreSQL 36.54% 90.89% 85.25% 95.44% 96.88% MySQL 62.57% 95.00% 93.65% 96.92% 99.39% ClickHouse 37.23% 90.38% 90.70% 92.31% 94.67% H2 54.19% 93.30% 91.62% 96.09% 97.86% CockroachDB 34.34% 72.61% 86.71% 85.17% 95.06% Cassandra 53.34% 100.00% 100.00% 98.67% 100.00% SurrealDB 17.47% 82.05% 90.88% 86.34% 91.55% Mean 38.39% 88.77% 90.55% 92.58% 96.37% • SQL-92: W e removed the entire LLM-base d segmenter to demonstrate the ee ctiveness of segmentation overall. • SQLFlex 𝑣 : W e removed the validation and repair component to examine its contribution to parsing correctness. • SQLFlex 𝑎 : W e removed anchor-based expression segmentation, letting the segmenter process an entire expression. • SQLFlex 𝑚 : W e replaced GPT -4.1 with GPT -4.1-mini as the base model to evaluate whether SQLFlex remains eective with a more lightweight LLM. W e evaluated all variants on eight datasets using the Q-RT rate and also report the numb er of LLM calls for expression segmentation in SQLFlex and SQLFlex 𝑎 . Component contributions (Q2). W e compare the geometric mean across the eight datasets with results shown in T able 6. Removing the entire LLM-based segmenter (SQL-92) results in a drastic performance drop to 38.39% Q-RT , highlighting that LLM-based segmentation is highly eective when the grammar-base d parser fails to parse dialect-spe cic features. Removing only the validation and repair module (SQLFlex 𝑣 ) leads to a 7.60% Q-RT drop, showing that while the core segmentation is already reliable, validation further improves robustness by correcting e dge cases. Removing only the anchor-base d segmentation (SQLF lex 𝑎 ) causes a 5.83% Q-RT drop, indicating that our approach improv es expression handling eectiveness. Moreover , SQLFlex 𝑎 uses 36.61% more LLM calls in total, showing that the anchor-based strategy impr oves eciency by avoiding r edundant segmentation of known operators. In queries where the expressions are more comple x, we believe the anchor-based strategy would be even more eective and ecient in parsing those expressions. Finally , SQLFlex 𝑚 , which replaces GPT -4.1 with GPT -4.1-mini, achieves a mean Q-RT of 92.58%, only 3.79% below SQLFlex, showing that SQLFlex remains accurate even for lightweight models. 7.3 Q3: Eiciency In this section, we investigate SQLFlex’s eciency in quer y parsing and the overhead of LLM calls. Methodology . W e assessed the performance of SQLFlex on each dataset by measuring parsing time and the number of LLM calls (inclusive of repair attempts). Since the base grammar can aect the average eciency ( i.e., more grammar rule matches indicate fewer LLM calls), we use two grammar congurations to provide a more compr ehensive view on the eciency of SQLFlex: • SQL-92: The standard implementation of SQLFlex. • SQL:2003 A newer SQL standard to SQL-92, which extended SQL-92 by adding more features such as window functions. Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 161. Publication date: June 2026. Dialect-A gnostic SQL Parsing via LLM-Based Segmentation 161:21 T able 7. Eiciency metrics for quer y parsing with SQLFlex. Time is measured in seconds. Dataset SQL-92 SQL:2003 𝑇 𝑎 𝑣𝑔 𝑇 𝐿𝐿 𝑀 𝑁 𝐿𝐿 𝑀 𝑇 𝑎 𝑣𝑔 𝑇 𝐿𝐿 𝑀 𝑁 𝐿𝐿 𝑀 DuckDB 6.09 8.96 10101 5.26 9.08 7922 PostgreSQL 6.81 10.74 5422 5.55 10.20 4276 MySQL 1.94 5.13 1930 1.87 5.52 1721 ClickHouse 4.38 7.02 6912 3.76 6.32 6148 H2 2.93 6.35 795 1.97 5.31 624 CockroachDB 6.21 9.44 6610 4.72 7.97 5150 Cassandra 1.29 2.70 142 1.36 2.99 124 SurrealDB 3.99 4.97 1311 2.96 3.78 1047 Mean 3.67 6.39 2274 3.06 5.94 1870 ( ↓ 21.6%) For each conguration, we report the average parsing time ( 𝑇 𝑎 𝑣𝑔 ) and total number of LLM calls ( 𝑁 𝐿𝐿 𝑀 ) per dialect. W e also measure the average parsing time spe cically for queries that required segmentation ( 𝑇 𝐿𝐿 𝑀 ). Eciency (Q3). W e present the results in T able 7. Under the standard SQL-92 conguration, each query takes on average 3.67 seconds to parse, while queries requiring segmentation take 6.39 seconds. Given the limited coverage of the base grammar and the prevalence of diale ct-specic features in the collected queries, frequent LLM invocations are expected (as shown in Section 7.2), leading to increased parsing overhead. After extending support to SQL:2003 features, w e obser ve a 21.6% reduction in the number of LLM calls and a 16.62% decrease in average parsing time. This improvement can be attribute d to the inclusion of additional grammar rules that capture commonly used SQL features. Moreover , with more anchors introduced in SQL:2003, such as in window functions, segmentation itself achieves a further 7.04% spe ed-up. Under the SQL:2003 conguration, queries that required segmentation incurred a me dian of 1 to 6 LLM calls across the eight dialects. The worst case that we encountered was an articially constructed PostgreSQL query containing 90 typ e-cast operators :: to an unknown inet type, which triggered 94 LLM calls. Note that this worst case is an outlier rather than typical behavior , and the number of LLM calls is highly workload-specic. In practice, users can extend the grammar with fr equently use d SQL features, such as the :: operator to further reduce parsing overhead, while still having the exibility to handle queries beyond the dened grammar . 7.4 Q4: Parsing DDLs In this section, we e xplore whether SQLF lex can b e e xtended beyond query parsing to support Data Denition Language (DDL) statements. As highlighted by the SchemaPile [ 19 ] work, which collecte d database schemas, parsing DDL statements in a multi-dialect setting also remains a major challenge. Among DDL statements, CREATE TABLE statements are particularly dicult to parse due to the large number of dialect-spe cic features, including diverse data types and DBMS-specic keywords, as observed by SchemaPile [ 19 ]. Consequently , we focus our evaluation on the CREATE TABLE DDL type. Extending SQLF lex. Extending SQLF lex to support CREATE TABLE statements primarily required engineering eort, while using the same hybrid segmentation algorithm. Specically , this extension involved adding the corresponding SQL-92 grammar rules, AST node denitions, segmentation prompts, and mappings b etween segmented components and AST nodes. W e illustrate this extension with a simplied example below , with diale ct-specic features in red. Segmentation proceeds in two iterations. In the rst iteration, SQLFlex identies the table name ( t ) and the column denitions, Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 161. Publication date: June 2026. 161:22 Junwen An, Kabilan Mahathevan, and Manuel Rigger T able 8. Correctness of parsers on CREATE TABLE statements. Bold indicates the best. (PK: Primar y Key , FK: Foreign K ey) Parser T ables Cols NotNull Unique PK FK SQLFlex 99.75% 97.38% 99.46% 99.72% 99.80% 96.41% SQLGlot 𝑑 97.41% 97.64% 88.50% 89.03% 97.64% 97.64% SQLGlot 𝑠 90.86% 91.10% 82.29% 83.21% 91.10% 91.10% while treating the Engine option as dialect-specic. In the se cond iteration, SQLFlex segments the column denition into data typ e ( mediumint ), column name ( id ), and column constraints ( NOT NULL ). C R E A T E T A B L E t ( i d m e d i u m i n t N O T N U L L ) E n g i n e = I n n o D B ; Methodology . W e followed the methodology of SchemaPile [ 19 ] to assess the correctness of parsing results. Specically , we used pglast as the reference parser and treated its parse results as the ground truth, using the CREATE TABLE statements from the SchemaPile dataset. After excluding statements that failed to be parsed by pglast and applying keyword-based deduplication, 18,139 CREATE TABLE statements in the PostgreSQL dialect remained. From the parsed AST s, we extracted and compared the following elements: table names with identier delimiters strippe d, column names appearing in column denitions, and the counts of NOT NULL , UNIQUE , PRIMARY KEY , and FOREIGN KEY constraints. For each element, we calculated the percentage of AST s whose extracted results match the ground truth. DDL-parsing eectiveness (Q4). W e present the results in T able 8. SQLF lex achieves consistently high correctness acr oss all elements, e xcee ding 96.41% in every category , and outperforms SQLGlot on four out of the six elements, demonstrating the applicability of SQLFlex to parsing DDL state- ments. W e obser ved that CREATE TABLE statements have a generally high success rate, partly be cause they are less challenging than SELECT statements, which tend to contain more dialect-specic fea- tures. Although SQLFlex performs slightly worse than SQLGlot 𝑑 on column denitions and foreign key constraints, the dierence is only at about 1%. Moreover , when considering the percentage of queries for which all elements are parsed correctly , SQLF lex has a substantially higher correctness rate of 93.79%, compared to 80.69% for SQLGlot 𝑑 . 8 Discussion and Limitation Eectiveness. T o further improv e the eectiveness of SQLFlex, several dir ections can be explored. First, retrie val-augmented generation may help resolve ambiguities during segmentation by retrie v- ing relevant documentation from the target SQL dialect [ 97 ]. Second, ne-tuning the LLM could improve its hierarchical reasoning capabilities [ 34 ]. However , this approach is less scalable and presents challenges in curating representative datasets across diverse dialects [ 68 ]. Third, although prompt engineering was not a primary fo cus of our work, more rened prompting strategies may further improv e segmentation eectiveness. Lastly , as LLMs continue to advance, we expect the eectiveness of SQLF lex to grow accordingly . Given that current LLMs are autoregressive, we believe the sequential nature of our approach will remain r elevant. Eciency . While the parsing eciency of SQLFlex do es not yet match that of rule-base d parsers, it remains suitable for non-interactive use cases or scenarios involving repeated operations on the same query , such as integrating SQL linting into CI/CD pipelines. The main eciency b ottleneck lies in the large number of LLM calls required. W e expect the eciency of SQLFlex to improve, as various te chniques to accelerate LLM inference continue to be propose d [ 25 , 38 , 67 ]. Such advancements are also anticipated by other LLM-powered interactiv e systems, such as T ext-to-SQL Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 161. Publication date: June 2026. Dialect-A gnostic SQL Parsing via LLM-Based Segmentation 161:23 tools [ 82 , 83 ]. Beyond model-lev el optimization, SQLFlex can further reduce the number of LLM calls by extending the base grammar with additional grammar rules for common SQL features, thus avoiding redundant segmentations. For instance, when inspecting queries that r equired LLM-based segmentation, we found that some common, but non-standard features like the :: type-casting operator and LIMIT clause were often contained. By adding such features to the grammar , the performance could be further optimized in practice. In an additional experiment, where we added the :: and LIMIT features to the SQL:2003 grammar , we could further reduce the number of LLM calls by 36.33% on the PostgreSQL dataset, for example. Moreover , existing grammar inference techniques that automatically derive context-free grammars from parse trees [ 29 , 74 ] could enable SQLFlex to cache recurring query patterns or dynamically integrate newly learned rules into its grammar-based parser , further minimizing the nee d for LLM invocations. Soundness. Unlike traditional parsing algorithms that provide formal correctness guarantees, SQLFlex adopts a best-eort hybrid parsing approach due to its usage of LLMs. However , achieving full semantic correctness is inher ently dicult in a multi-dialect setting without e xplicit dialect knowledge, and this challenge also ae cts traditional multi-diale ct parsers such as SQLGlot. A representative example is the || operator: in MySQL it denotes logical OR, while in PostgreSQL and SQLite it performs string concatenation. Moreover , PostgreSQL assigns || lower precedence than arithmetic operators, whereas SQLite assigns it higher prece dence. As a result, the expression 3 * 2 || ‘7’ evaluates dierently across the thr ee DBMSs ( true , 81, and 67 for MySQL, SQLite, and Postgr eSQL, respectively). Notably , SQLGlot assigns || the low est precedence uniformly across dialects, which is incorrect for SQLite. As shown in our evaluation, traditional parsers such as SQLGlot also suer from parsing and correctness issues due to the diversity and evolution of SQL dialects. In contrast, SQLF lex achieves high correctness on the PostgreSQL dataset and proves eective in practical applications such as SQL linting and test-case r eduction, where constructing a structurally sound AST across diverse dialects is more important than strict semantic guarantees. Finally , SQLFlex’s validation and repair mechanisms further improve its robustness in practice . Use cases. While this paper reports the results of applying SQLF lex to only two applications, SQLFlex is, in principle, applicable to any AST -base d SQL analysis or re writing workow rather than being limited to task-spe cic scenarios. Additional applicable use cases include SQL-level optimization ( e.g., Calcite faces similar dialect-related parsing challenges [ 36 ]), data lineage and provenance analysis [ 56 , 57 ], and evaluations of T ext-to-SQL systems [ 42 , 94 ]. Some applications, such as SQL-level optimization, are correctness-critical. While SQLFlex lacks formal correctness guarantees, it can be paired with SQL solvers [ 33 ] to strengthen semantic assurances, as demon- strated by recent agentic quer y optimization frameworks that use LLMs to generate optimized queries [ 78 ]. Since SQL solvers also face challenges in multi-dialect supp ort, further research on making them dialect-aware might be ne eded. The strong practical demand for cross-dialect parsing is further reected in the signicant engine ering eort behind op en-source projects, such as SQLGlot, SQLFlu, and Calcite, whose issue trackers [ 26 , 27 , 36 ] indicate that dialect coverage remains a persistent pain p oint. Y et, despite this clear demand, prior research has oered few broadly applicable solutions. SQLFlex aims to ll this gap by providing a dialect-agnostic parsing framework that benets SQL tooling beyond the use cases presented. 9 Related W ork SQL parsers. Existing SQL parsers can be categorize d into generated parsers and hand-written parsers. Generated parsers are automatically built from formal grammar specications. For example, AN TLR [ 4 ] generates parsers that use the LL(*) algorithm [ 65 ]. Some DBMSs, such as PostgreSQL, use Y A CC-style toolkits [ 54 ] for their built-in parsers. Mühleisen et al. propose d Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 161. Publication date: June 2026. 161:24 Junwen An, Kabilan Mahathevan, and Manuel Rigger using Parsing Expression Grammars [ 21 ] as a more extensible and ecient parser generator for DuckDB [ 59 ]. Hand-written parsers oer greater exibility for dialect-specic syntax. A state-of-the-art example is SQLGlot [ 84 ], a widely used multi-dialect SQL parser with over 8,000 GitHub stars, supporting more than 30 dialects. Howe ver , such tools require substantial manual eort to maintain and extend. sqlparse [ 1 ] is a non-validating parser that builds appr oximate parse trees, but is unreliable for query rewriting [ 17 ]. T o the best of our knowledge, SQLF lex is the rst work that integrates LLM with grammar-based parsing to achieve dialect-agnostic query parsing. SQL dialects. Multiple approaches have addressed the SQL dialect problem in various applications. In DBMS testing, tools like Sedar [ 23 ] and QTRAN [ 51 ] leverage LLMs to translate test cases across dierent dialects. Re cent T ext-to-SQL b enchmarks like MiniDev [ 44 ] (extending to MySQL and PostgreSQL) and Spider 2.0 [ 41 ] (including BigQuery and Sno wake) no w incorporate more dialects in addition to SQLite . Closely r elated ar e to ols like SQL-GEN [ 68 ], which synthesizes dialect-specic training data, and new architectures like MOMQ’s Mixture-of-Experts model [ 52 ] and Exec-SQL’s use of execution feedback [ 93 ], all aimed at improving multi-dialect capabilities of LLMs. Dialect translation is crucial for applications like data migration. CrackSQL [ 99 ] is a state-of-the-art tool combining rule-based and LLM-based approaches for translation. In contrast to these works that focus on generating or translating queries between dialects, SQLFlex addresses the foundational challenge of dialect-agnostic SQL parsing. Query rewriting. Many applications can be categorized as query rewriting under our formulation. Beyond the discussed use cases, widely researched rewriting tasks include quer y optimization and DBMS testing. In query optimization, predened rules are applied to impr ove query perfor- mance [ 30 ]. More recent approaches, such as W e T une [ 90 ] and Learne dRewrite [ 100 ], automatically discover and apply r ewrite rules. LLM-R 2 [ 46 ] extends this by using LLMs to explore more eective rewrite rules. QueryBo oster [ 7 ] introduces a domain-spe cic language for user-dene d optimization rules. In DBMS testing, numer ous works rewrite queries to detect bugs. Mutation-based fuzzers like WingFuzz [ 48 ] and LEGO [ 47 ] mutate queries to generate test inputs. Some approaches, like EET [ 35 ], AMOEBA [ 53 ], and SQLancer [ 70 , 71 ], generate equivalent queries to check for consistent results. Commonly , these approaches rely on parsers to analyze and modify query AST s. SQLF lex addresses a key limitation by enabling dialect-agnostic quer y parsing, broadening the applicability of query rewriting tools. 10 Conclusion Existing SQL parsers often require signicant manual eort to support diverse SQL dialects, limiting the applicability of quer y analysis and rewriting tools. In this paper , we have presented SQLFlex, a dialect-agnostic query rewriting framework. Our key idea is to integrate grammar-based parsing with LLM-based segmentation for dialect-agnostic quer y parsing. W e introduced clause-level and expression-lev el segmentation to decompose the hierarchical structure of queries into sequential tasks. W e also propose d validation methods to improve reliability . Our evaluation has shown that SQLFlex is practical in real-world SQL linting and test-case reduction tasks. Additionally , SQLF lex can parse 91.55% to 100% of the queries across eight dialects, outp erforming baselines. W e b elieve SQLFlex can be applie d to a broad range of applications, and requires minimal manual adaptation for dialect-specic features. Acknowledgments W e would like to thank the anonymous review ers for their constructive comments and suggestions. This project is supported by the Ministry of Education, Singapore, under the A cademic Research Fund Tier 1 (FY2023). Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 161. Publication date: June 2026. Dialect-A gnostic SQL Parsing via LLM-Based Segmentation 161:25 References [1] andialbrecht. 2024. sqlparse. https://github.com/andialbrecht/sqlparse. [2] Andreas Seltenreich. 2022. SQLsmith. https://github.com/anse1/sqlsmith. [3] ANSI/ISO. 2017. SQL-92. https://ww w .contrib.andrew .cmu.edu/~shadow/sql/sql1992.txt. [4] AN TLR. [n. d.]. AN TLR. https://w ww .antlr .org/. [5] Apache. 2018. Apache Calcite. https://calcite.apache.org/. [6] Apache. 2025. datafusion-sqlparser-rs. https://github.com/apache/datafusion- sqlparser- rs. [7] Qiushi Bai, Sadeem Alsudais, and Chen Li. 2023. QueryBo oster: Improving SQL Performance Using Middleware Services for Human-Centered Query Rewriting. Proceedings of the VLDB Endowment 16, 11 (July 2023), 2911–2924. doi:10.14778/3611479.3611497 [8] Timothy Alan Budd. 1980. Mutation analysis of program test data . Yale Univ ersity . [9] Bikash Chandra, Ananyo Banerjee, Udbhas Hazra, Mathew Joseph, and S. Sudarshan. 2019. Automated Grading of SQL Queries. In 2019 IEEE 35th International Conference on Data Engine ering (ICDE) . 1630–1633. doi:10.1109/ICDE. 2019.00159 [10] Bikash Chandra, Bhupesh Chawda, Biplab Kar , K. V . Reddy , Shetal Shah, and S. Sudarshan. 2015. Data generation for testing and grading SQL queries. The VLDB Journal (2015), 731–755. doi:10.1007/s00778- 015- 0395- 0 [11] Koen Claessen and John Hughes. 2000. QuickCheck: a lightweight tool for random testing of Haskell programs. In Proceedings of the Fifth ACM SIGPLAN International Conference on Functional Programming (ICFP ’00) . 268–279. doi:10.1145/351240.351266 [12] CockroachDB. [n. d.]. Select from a sp ecic index. https://w ww .cockroachlabs.com/docs/v25.2/select- clause#select- from- a- specic- index. [13] Keith Cooper and Linda T orczon. 2003. Engineering a Compiler: International Student Edition . Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. [14] Nils Anders Danielsson. 2013. Correct-by-construction pretty-printing. In Proceedings of the 2013 ACM SIGPLAN W orkshop on Dep endently- T yped Programming (DTP ’13) . doi:10.1145/2502409.2502410 [15] DB-Engines. 2024. DB-Engines Ranking. https://db- engines.com/en/ranking. [16] R.A. DeMillo, R.J. Lipton, and F.G. Sayward. 1978. Hints on T est Data Selection: Help for the Practicing Programmer . Computer 11, 4 (1978), 34–41. [17] Visweswara Sai Prashanth Dintyala, Arpit Narechania, and Joy Arulraj. 2020. SQLCheck: Automated Detection and Diagnosis of SQL Anti-Patterns. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data . 2331–2345. doi:10.1145/3318464.3389754 [18] DuckDB. [n. d.]. Expressions. https://duckdb.org/docs/stable/sql/expressions/ov erview. [19] Döhmen et al. 2024. SchemaPile: A Large Collection of Relational Database Schemas. Proc. ACM Manag. Data (2024). [20] Philip J. Fleming and John J. W allace. 1986. How not to lie with statistics: the correct way to summarize b enchmark results. Commun. ACM (1986). doi:10.1145/5666.5673 [21] Bryan Ford. 2004. Parsing expression grammars: a recognition-based syntactic foundation. In Proceedings of the 31st A CM SIGPLAN-SIGACT Symp osium on Principles of Programming Languages (POPL ’04) . doi:10.1145/964001.964011 [22] Markus Frohmann, Igor Sterner , Ivan Vulić, Benjamin Minixhofer , and Markus Schedl. 2024. Segment Any T ext: A Universal Approach for Robust, Ecient and Adaptable Sentence Segmentation. In Proce edings of the 2024 Conference on Empirical Methods in Natural Language Processing . doi:10.18653/v1/2024.emnlp- main.665 [23] Jingzhou Fu, Jie Liang, Zhiyong Wu, and Y u Jiang. 2024. Sedar: Obtaining High-Quality Se eds for DBMS Fuzzing via Cross-DBMS SQL T ransfer. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering . 1–12. doi:10.1145/3597503.3639210 [24] Dawei Gao, Haibin W ang, Y aliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou. 2024. T ext-to-SQL Empowered by Large Language Mo dels: A Benchmark Evaluation. Proceedings of the VLDB Endowment (2024), 1132–1145. doi:10.14778/3641204.3641221 [25] In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. 2024. Prompt Cache: Modular Attention Reuse for Low-Latency Inference. In Proceedings of Machine Learning and Systems . 325–338. [26] Github Issue. 2025. Add Dremio dialect #4904. https://github.com/tobymao/sqlglot/issues/4904. [27] Github Issue. 2025. Add H2 dialect #6763. https://github.com/sqlu/sqlu/issues/6763. [28] Github Issue. 2025. [MySQL] Unable to handle V ALUES RO W() #4371. https://github.com/antlr/grammars- v4/issues/ 4371. [29] Rahul Gopinath, Björn Mathis, and Andreas Zeller . 2020. Mining input grammars from dynamic control ow . In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2020) . doi:10.1145/3368089.3409679 [30] Goetz Graefe and David J De Witt. 1987. The EXODUS optimizer generator . In Proceedings of the 1987 ACM SIGMOD international conference on Management of data . 160–172. Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 161. Publication date: June 2026. 161:26 Junwen An, Kabilan Mahathevan, and Manuel Rigger [31] Moshe Hazoom, Vibhor Malik, and Ben Bogin. 2021. T ext-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data. In Proceedings of the 1st W orkshop on Natural Language Processing for Programming (NLP4Prog 2021) , Royi Lachmy , Ziyu Y ao, Greg Durr ett, Milos Gligoric, Junyi Jessy Li, Ray Mooney , Graham Neubig, Y u Su, Huan Sun, and Reut T sarfaty (Eds.). 77–87. doi:10.18653/v1/2021.nlp4prog- 1.9 [32] Y uan He , Zhangdie Yuan, Jiaoyan Chen, and Ian Horrocks. 2024. Language Models as Hierarchy Encoders. In Advances in Neural Information Processing Systems . 14690–14711. [33] Y ang He, Pinhan Zhao, Xinyu Wang, and Yuepeng W ang. 2024. V eriEQL: Bounde d Equivalence V erication for Complex SQL Queries with Integrity Constraints. Proc. ACM Program. Lang. OOPSLA1 (2024). doi:10.1145/3649849 [34] Zhuohang Jiang, Pangjing W u, Ziran Liang, Peter Q. Chen, Xu Y uan, Y e Jia, Jiancheng T u, Chen Li, Peter H. F. Ng, and Qing Li. 2025. HiBench: Benchmarking LLMs Capability on Hierarchical Structure Reasoning. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V .2 (KDD ’25) . 11 pages. doi:10.1145/ 3711896.3737378 [35] Zu-Ming Jiang and Zhendong Su. 2024. Detecting Logic Bugs in Database Engines via Equivalent Expression Transformation. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) . USENIX Association, Santa Clara, CA, 821–835. [36] Jira Issue. 2023. Supp ort Doris Dialect. https://issues.apache.org/jira/browse/CALCI TE- 5725. [37] Bill Karwin. 2010. SQL A ntipatterns: A voiding the Pitfalls of Database Programming . Pragmatic Bookshelf. [38] W oosuk K won, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Y u, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Ecient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles . [39] Louisa Lambrecht, Tim Findling, Samuel Heid, Marcel Knüdeler , and T orsten Grust. 2025. Democratize MA TCH_RECOGNIZE! Proc. VLDB Endow . (2025). doi:10.14778/3750601.3750644 [40] LangChain. [n. d.]. LangChain Do cumentation. https://python.langchain.com/docs/introduction/. [41] Fangyu Lei, Jixuan Chen, Y uxiao Y e, Ruisheng Cao, Dongchan Shin, Hongjin Su, Zhaoqing Suo, Hongcheng Gao, W enjing Hu, Pengcheng Yin, et al . 2024. Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workows. arXiv preprint arXiv:2411.07763 (2024). [42] Boyan Li, Y uyu Luo, Chengliang Chai, Guoliang Li, and Nan T ang. 2024. The Dawn of Natural Language to SQL: Are W e Fully Ready? Procee dings of the VLDB Endowment (2024), 3318–3331. doi:10.14778/3681954.3682003 [43] Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Y ang, Binhua Li, Bowen Li, Bailin W ang, Bowen Qin, Ruiying Geng, Nan Huo, et al . 2024. Can llm already ser ve as a database interface? a big bench for large-scale database grounded text-to-sqls. Advances in Neural Information Processing Systems 36 (2024). [44] Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Y ang, Binhua Li, Bowen Li, Bailin W ang, Bowen Qin, Ruiying Geng, Nan Huo, Xuanhe Zhou, Chenhao Ma, Guoliang Li, Ke vin C C Chang, Fei Huang, Reynold Cheng, and Y ongbin Li. 2023. Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded T ext-to-SQLs. In 37th Conference on Neural Information Processing Systems (NeurIPS 2023) . [45] Jianling Li, Meishan Zhang, Peiming Guo, Min Zhang, and Yue Zhang. 2023. LLM-enhance d Self-training for Cross-domain Constituency Parsing. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing . doi:10.18653/v1/2023.emnlp- main.508 [46] Zhaodonghui Li, Haitao Yuan, Huiming W ang, Gao Cong, and Lidong Bing. 2024. LLM-R 2 : A Large Language Model Enhanced Rule-Based Rewrite System for Boosting Query Eciency. Procee dings of the VLDB Endowment (2024). doi:10.14778/3696435.3696440 [47] Jie Liang, Y aoguang Chen, Zhiyong W u, Jingzhou Fu, Mingzhe W ang, Y u Jiang, Xiangdong Huang, Ting Chen, Jiashui W ang, and Jiajia Li. 2023. Sequence-Oriented DBMS Fuzzing . In 2023 IEEE 39th International Conference on Data Engineering (ICDE) . doi:10.1109/ICDE55515.2023.00057 [48] Jie Liang, Zhiyong Wu, Jingzhou Fu, Yiyuan Bai, Qiang Zhang, and Yu Jiang. 2024. WingFuzz: Implementing Continuous Fuzzing for DBMSs. In 2024 USENIX A nnual T echnical Conference ( USENIX A TC 24) . [49] Y u Liang, Song Liu, and Hong Hu. 2022. Detecting Logical Bugs of DBMS with Coverage-based Guidance . In 31st USENIX Security Symposium (USENIX Security 22) . 4309–4326. [50] Li Lin, Zongyin Hao, Chengpeng W ang, Zhuangda W ang, Rongxin W u, and Gang Fan. 2024. SQLess: Dialect- Agnostic SQL Query Simplication. In Proceedings of the 33rd ACM SIGSOFT International Symp osium on Software T esting and A nalysis . 743–754. doi:10.1145/3650212.3680317 [51] Li Lin, Qinglin Zhu, Hongqiao Chen, Zhuangda W ang, Rong xin Wu, and Xiaoheng Xie. 2025. QTRAN: Extending Metamorphic-Oracle Based Logical Bug Detection T echniques for Multiple-DBMS Dialect Support. Proc. ACM Softw . Eng. , Article ISST A033 (June 2025). doi:10.1145/3728908 [52] Zhisheng Lin, Yifu Liu, Zhiling Luo, Jinyang Gao, and Y u Li. 2024. MoMQ: Mixtur e-of-Experts Enhances Multi-Dialect Query Generation across Relational and Non-Relational Databases. arXiv:2410.18406 [cs.CL] https://ar xiv .org/abs/ 2410.18406 Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 161. Publication date: June 2026. Dialect-A gnostic SQL Parsing via LLM-Based Segmentation 161:27 [53] Xinyu Liu, Qi Zhou, Joy Arulraj, and Alessandro Orso . 2022. Automatic detection of performance bugs in database systems using equivalent queries. In Proceedings of the 44th International Conference on Software Engine ering . doi:10. 1145/3510003.3510093 [54] T ony Mason and Doug Brown. 1990. Lex & yacc . O’Reilly & Associates, Inc., USA. [55] mattmc3. [n. d.]. Mo dern SQL Style Guide. https://gist.github.com/mattmc3/38a85e6a4ca1093816c08d4815f bebf b. [56] Zhengjie Miao, Tiangang Chen, Alexander Bende ck, Kevin Day , Sudeepa Ro y , and Jun Y ang. 2020. I-Rex: an interactive relational query explainer for SQL. Proc. VLDB Endow . (2020). doi:10.14778/3415478.3415528 [57] Zhengjie Miao, Sudeepa Roy , and Jun Y ang. 2019. Explaining W rong Queries Using Small Examples. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD ’19) . doi:10.1145/3299869.3319866 [58] Microsoft. 2025. Transact-SQL reference. https://learn.microsoft.com/en- us/sql/t- sql/language- reference?view=sql- server- ver16. [59] Hannes Mühleisen and Mark Raasveldt. 2025. Runtime-Extensible Parsers. In 15th Conference on Innovative Data Systems Research, CIDR 2025 . [60] Ananjan Nandi, Christopher D Manning, and Shikhar Murty. 2025. Sneaking Syntax into Transformer Language Models with Tree Regularization. In Pr oce edings of the 2025 Conference of the Nations of the A mericas Chapter of the Association for Computational Linguistics: Human Language T echnologies (V olume 1: Long Pap ers) . [61] Thomas Neumann and Viktor Leis. 2024. A Critique of Modern SQL and a Proposal T owards a Simple and Expressive Query Language. In CIDR . [62] OpenAI. [n. d.]. GPT -4.1. https://platform.openai.com/docs/mo dels/gpt- 4.1. [63] Rongqi Pan, T aher A. Ghaleb, and Lionel Briand. 2023. A TM: Black-Box T est Case Minimization Based on T est Code Similarity and Evolutionary Search. In Proceedings of the 45th International Conference on Software Engine ering (ICSE ’23) . 12 pages. doi:10.1109/ICSE48619.2023.00146 [64] Mike Papadakis, Marinos Kintis, Jie Zhang, Yue Jia, Yves Le Traon, and Mark Harman. 2019. Mutation testing advances: an analysis and survey . In Advances in computers . V ol. 112. Elsevier , 275–378. [65] T erence Parr and Kathleen Fisher . 2011. LL(*): the foundation of the AN TLR parser generator . In Procee dings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’11) . doi:10.1145/1993498. 1993548 [66] pglast. 2024. https://github.com/lelit/pglast. [67] Reiner Pope, Sholto Douglas, Aakanksha Chowdhery , Jacob Devlin, James Bradbury , Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Je Dean. 2023. Eciently Scaling Transformer Inference. In Proceedings of Machine Learning and Systems . 606–624. [68] Mohammadreza Pourreza, Ruo xi Sun, Hailong Li, Lesly Miculicich, T omas Pster , and Sercan O . Arik. 2024. SQL-GEN: Bridging the Dialect Gap for T ext-to-SQL Via Synthetic Data And Model Merging. arXiv:2408.12733 [cs] [69] Tillmann Rendel and Klaus Ostermann. 2010. Invertible syntax descriptions: unifying parsing and pretty printing. In Proceedings of the Third ACM Haskell Symposium on Haskell (Haskell ’10) . doi:10.1145/1863523.1863525 [70] Manuel Rigger and Zhendong Su. 2020. Detecting Optimization Bugs in Database Engines via Non-Optimizing Reference Engine Construction. In Proceedings of the 2020 28th A CM Joint Meeting on European Software Engineer- ing Conference and Symp osium on the Foundations of Software Engineering (Sacramento, California, United States) (ESEC/FSE 2020) . doi:10.1145/3368089.3409710 [71] Manuel Rigger and Zhendong Su. 2020. Finding Bugs in Database Systems via Query Partitioning. Proc. ACM Program. Lang. 4, OOPSLA, Article 211 (2020). doi:10.1145/3428279 [72] Stuart Russel and Peter Norvig. 2020. A rticial Intelligence: A Modern A pproach, 4th edition . [73] T obias Schmidt, Viktor Leis, Peter Boncz, and Thomas Neumann. 2025. SQLStorm: T aking Database Benchmarking into the LLM Era. Proc. VLDB Endow . (2025). [74] Michael Schröder and Jürgen Cito. 2022. Grammars for free: toward grammar inference for Ad Hoc parsers. In Proceedings of the ACM/IEEE 44th International Conference on Software Engine ering: New Ideas and Emerging Results (ICSE-NIER ’22) . doi:10.1145/3510455.3512787 [75] T ushar Sharma, Marios Fragkoulis, Stamatia Rizou, Magiel Bruntink, and Diomidis Spinellis. 2018. Smelly relations: measuring and understanding database schema quality . In Procee dings of the 40th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP ’18) . 55–64. doi:10.1145/3183519.3183529 [76] A ugust Shi, Tifany Yung, Alex Gyori, and Darko Marinov . 2015. Comparing and combining test-suite reduction and regression test selection. In Procee dings of the 2015 10th Joint Me eting on Foundations of Software Engineering (ESEC/FSE 2015) . doi:10.1145/2786805.2786878 [77] Je Shute, Shannon Bales, Matthew Brown, Jean-Daniel Browne , Brandon Dolphin, Romit Kudtarkar , Andrey Litvinov , Jingchi Ma, John Morcos, Michael Shen, David Wilhite, Xi W u, and Lulan Yu. 2024. SQL Has Problems. W e Can Fix Them: Pipe Syntax In SQL. Proc. VLDB Endow . (2024). doi:10.14778/3685800.3685826 Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 161. Publication date: June 2026. 161:28 Junwen An, Kabilan Mahathevan, and Manuel Rigger [78] Y uyang Song, Hanxu Y an, Jiale Lao, Yibo W ang, Yufei Li, Y uanchun Zhou, Jianguo W ang, and Mingjie T ang. 2025. QUI TE: A Query Rewrite System Beyond Rules with LLM Agents. arXiv:2506.07675 [cs.DB] 2506.07675 [79] SQLancer. 2020. SQLancer. https://github.com/sqlancer/sqlancer. [80] SQLFlu. 2025. SQLFlu. https://docs.sqlu.com/en/stable/index.html. [81] Chengnian Sun, Yuanbo Li, Qirun Zhang, Tianxiao Gu, and Zhendong Su. 2018. Perses: Syntax-Guided Program Reduction. In Proceedings of the 40th International Conference on Software Engineering . 361–371. doi:10.1145/3180155. 3180236 [82] Y uan Tian, Jonathan K. Kummerfeld, T oby Jia-Jun Li, and Tianyi Zhang. 2024. SQLucid: Grounding Natural Language Database Queries with Interactive Explanations. In Proceedings of the 37th A nnual ACM Symposium on User Interface Software and T echnology (UIST ’24) . doi:10.1145/3654777.3676368 [83] Y uan Tian, Zheng Zhang, Zheng Ning, T oby Jia-Jun Li, Jonathan K. Kummerfeld, and Tianyi Zhang. 2023. Interactive T ext-to-SQL Generation via Editable Step-by-Step Explanations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing . doi:10.18653/v1/2023.emnlp- main.1004 [84] tobymao. 2025. SQLGlot. https://github.com/tobymao/sqlglot. [85] Javier Tuya, Ma Jose Suarez-Cabal, and Claudio De La Riva. 2006. SQLMutation: A T ool to Generate Mutants of SQL Database Queries. In Second W orkshop on Mutation A nalysis (Mutation 2006 - ISSRE W orkshops 2006) . 1–1. doi:10.1109/MU T A TION.2006.13 [86] Javier T uya, Ma José Suárez-Cabal, and Claudio De La Riva. 2007. Mutating database queries. Information and Software T echnology 49, 4 (2007), 398–417. [87] Javier T uya, Ma José Suárez-Cabal, and Claudio De La Riva. 2025. SQL Mutants (SQLMuation) - T est4Data - T esting for Database Applications. https://test4data.com/sqlmutation [Online; accessed 29. Apr . 2025]. [88] Marcell van Geest and W outer Swierstra. 2017. Generic packet descriptions: veried parsing and pretty printing of low-level data. In Proceedings of the 2nd ACM SIGPLAN International W orkshop on Type-Driven Development (T yDe 2017) . doi:10.1145/3122975.3122979 [89] Shuxian W ang, Sicheng Pan, and Alvin Cheung. 2024. QED: A Powerful Query Equivalence Decider for SQL. Proc. VLDB Endow . (2024). doi:10.14778/3681954.3682024 [90] Zhaoguo W ang, Zhou Zhou, Yicun Y ang, Haoran Ding, Gansen Hu, Ding Ding, Chuzhe Tang, Haibo Chen, and Jinyang Li. 2022. W e Tune: A utomatic Discovery and V erication of Query Rewrite Rules. In Proceedings of the 2022 International Conference on Management of Data . ACM, Philadelphia P A USA, 94–107. doi:10.1145/3514221.3526125 [91] W ebpage. 2023. The SQL Standard – ISO/IEC 9075:2023 (ANSI X3.135). https://w ww .iso.org/standard/76583.html. [92] T ao Y u, Rui Zhang, K ai Y ang, Michihiro Y asunaga, Dongxu W ang, Zifan Li, James Ma, Irene Li, Qingning Y ao, Shanelle Roman, Zilin Zhang, and Dragomir Radev . 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and T ext-to-SQL T ask. doi:10.48550/ARXI V .1809.08887 [93] Jipeng Zhang, Haolin Y ang, Kehao Miao, Ruiyuan Zhang, Renjie Pi, Jiahui Gao, and Xiaofang Zhou. 2025. ExeSQL: Self- T aught T ext-to-SQL Models with Execution-Driven Bootstrapping for SQL Dialects. arXiv:2505.17231 [cs.CL] https://arxiv .org/abs/2505.17231 [94] Ruiqi Zhong, T ao Yu, and Dan Klein. 2020. Semantic Evaluation for T ext-to-SQL with Distilled T est Suites. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) . doi:10.18653/v1/ 2020.emnlp- main.29 [95] Suyang Zhong and Manuel Rigger . 2024. Understanding and Reusing T est Suites Across Database Systems. Proc. A CM Manag. Data (2024). doi:10.1145/3698829 [96] Suyang Zhong and Manuel Rigger . 2025. Scaling Automated Database System T esting. arXiv:2503.21424 https: //arxiv .org/abs/2503.21424 [97] Suyang Zhong and Manuel Rigger . 2025. T esting Database Systems with Large Language Model Synthesized Fragments. https://arxiv .org/abs/2505.02012 [98] Qi Zhou, Joy Arulraj, Shamkant Navathe, William Harris, and Jinpeng Wu. 2021. SIA: Optimizing Queries Using Learned Predicates. In Proceedings of the 2021 International Conference on Management of Data . doi:10.1145/3448016. 3457262 [99] W ei Zhou, Y uyang Gao, Xuanhe Zhou, and Guoliang Li. 2025. Cracking SQL Barriers: An LLM-based Dialect Transaltion System. Proc. ACM Manag. Data (2025). [100] Xuanhe Zhou, Guoliang Li, Chengliang Chai, and Jianhua Feng. 2021. A Learne d Query Rewrite System Using Monte Carlo Tree Search. Proceedings of the VLDB Endowment (2021), 46–58. doi:10.14778/3485450.3485456 [101] Feng Zhu, Lijie Xu, Gang Ma, Shuping Ji, Jie W ang, Gang W ang, Hongyi Zhang, Kun W an, Mingming W ang, Xingchao Zhang, Y uming W ang, and Jingpin Li. 2022. An Empirical Study on Quality Issues of eBay’s Big Data SQL Analytics Platform. In Procee dings of the 44th International Conference on Software Engineering: Software Engineering in Practice . 33–42. doi:10.1145/3510457.3513034 Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 161. Publication date: June 2026. Dialect-A gnostic SQL Parsing via LLM-Based Segmentation 161:29 A Additional Approach Details A.1 Grammar Alternatives and Reptitions Some grammar rules specify repetitions, where an element is allowed to appear zero or multi- ple times. In AN TLR, we use the symbol * to denote such repetitions. In the listing below , the projections rule uses the pattern ("," projection) , allowing multiple projection symbols sep- arated by commas. Some grammar rules dene alternatives, oering a choice b etween multiple possible symb ols. In AN TLR, alternatives are sp ecied using the “|” symbol. For instance, the projection rule provides two alternativ es: column and expr , meaning a projection can be either a simple column name or a more complex expr ession. p r o j e c t i o n s : p r o j e c t i o n ( " , " p r o j e c t i o n ) * ; p r o j e c t i o n : c o l u m n | e x p r ; In clause-level segmentation, w e handle both repetitions and alternatives as special cases. For repetitions, the LLM outputs the full sequence, which we then split using delimiters. For example, in a SELECT clause like “SELECT col, 1 !> 2” , the LLM may output the segment “col, 1 !> 2” , corresponding to the projections rule. W e then split this by commas and enqueue each element ( “col” and “1 !> 2” ) under the projection rule for further processing. For alternativ es, the LLM selects one of the possible forms. For example, when segmenting “1 !> 2” under the projection rule, the LLM identies it as an e xpression rather than a column name. As a r esult, we enqueue “1 !> 2” with the expr rule, which SQLFlex proceeds with expression-le vel segmentation in the next step. Additionally , the mutually exclusive validation also applies to such scenarios where alternatives are dened ( e.g., segmentation output for projection cannot be interpreted as b oth a column name and an expression). A.2 Conservative Anchor Matching T o identify anchors reliably , we adopt a conser vative matching strategy to minimise incorrect matches that could lead to parsing errors. Specically , we only match anchors that are space- separated. For example, in “year < 2025” , the op erator “<” is surrounded by spaces and thus qualies as a match. In contrast, we skip matching anchors in expressions written as “year<2025” , as such compact forms may reect dialect-sp ecic features ( e.g., the operator “!<” includes “<” , but since “<” lacks surrounding spaces, matching is avoided). Additionally , we handle phrase-like operators such as “IS NOT” through exact matching. As a result of this conser vative appr oach, some known operators may remain unmatched and appear within mask tokens, though these expressions might still be parseable in subsequent steps. A.3 Context-sensitive Anchors Certain anchors used in expression-le vel segmentation are context-sensitiv e, that is, they do not function as universal op erators like “AND” , but act as operators only in sp ecic contexts. For example, consider the CAST function: in an expression like “CAST(1.0 AS INT)” , the keyword “AS” serves as a typ e-cast op erator . However , outside of such function calls, AS typically denotes an alias ( e.g., “SELECT col AS alias” ). Due to this ambiguity , we treat AS as an anchor only when the current expression node has a parent that is the “CAST” function. In such cases, the context makes it clear that AS is being used as a type-cast operator rather than an alias sp ecier . While such context- sensitive anchors are rare in the SQL-92 grammar , we believe they are more prevalent in more comprehensive grammars, such as specifying the anchors in window functions. Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 161. Publication date: June 2026. 161:30 Junwen An, Kabilan Mahathevan, and Manuel Rigger B Additional Evaluation Details B.1 Dataset construction Standalone evaluation. T o build the dataset for the standalone evaluation, we extracted SELECT statements from the test les. For DuckDB, PostgreSQL, and Co ckroachDB, we used SQUaLity [ 95 ], a DBMS test suite analyzer to parse test les from the two systems. For other DBMSs not supported by SQUaLity , we matched for any string starting with "SELECT" . T o ensure syntactic validity , we excluded queries that failed to parse using diale ct-specic parsers. For H2, since we were unable to obtain a ready-to-use H2-spe cic parser , we used JSQLParser , which supports H2 syntax. Since many of the extracted queries had similar structures or features, we applied a keyword-based deduplication heuristic to r etain a diverse subset. SQL parsing dep ends primarily on key words [ 77 ], making keyword diversity a useful proxy for syntactic variety . Specically , for each DBMS, we compiled a list of SQL keywords and used regular expressions to extract the keyword set from each query . Queries with identical keywor d sets were treated as duplicates, and we r etained only the longest one, assuming it to be the most complex. For SurrealDB, as w e were unable to obtain its keyword list, we kept all queries in its dataset. B.2 AST -level Parsing Eectiveness W e introduce AST Round-trip (AST -RT) , an AST -level round-trip property to gain additional insights on the parsing eectiveness of SQLFlex against the baselines. Similarly , AST -RT is dened based on the following parsing and printing steps: AST 0 = Parse ( 𝑄 0 ) , 𝑄 1 = Print ( AST 1 ) (2) AST 1 = Parse ( 𝑄 1 ) , 𝑄 2 = Print ( AST 2 ) (3) AST Round-trip (AST -RT) measures AST -level consistency by checking that the AST remains unchanged after two parsing operations ( i.e., AST 0 = AST 1 ). AST -RT demonstrates whether SQLFlex consistently generates structurally stable AST s, which is useful for users implementing rewrite functions. AST -RT avoids disadvantaging pglast and SQLGlot due to certain normalizations. For example, pglast combines type-casting operations into a unied node type, and dir ect AST comparisons avoid false alarms. T able 9 shows the results in AST -RT rates, with the geometric mean r eporte d in the last row . The standard deviations of SQLFlex in AST -RT rates are slightly higher than Q-RT rates, but remain below 0.7%. SQLFlex overall outperforms all baselines across most datasets, with only a marginal 3% gap behind SQLGlot 𝑠 on the DuckDB dataset. In general, AST -RT scores tend to be slightly lower than Q-RT scores. For SQLGlot and pglast , this is often caused by normalization during pretty-printing. For example , in SQLGlot, an IF node in the original AST is converted into a CASE expression in the printed query , which then leads to a dierent node structure when the quer y is re-parsed. This lower AST -RT is also expecte d for the LLM-based parser and SQLFlex. Since LLMs are inherently probabilistic, they may produce slightly dierent AST s even for the same input, especially when the query is complex. For SQLFlex, such discrepancies mainly occur in complex expressions without anchors, such as window functions. Windo w functions are not supported in SQL-92, but are common in DuckDB, PostgreSQL, and CockroachDB datasets. Without anchors, the segmenter needs to process the expression recursively , and small dierences in interpretation can lead to AST mismatches. Nonetheless, the printed queries remain largely consistent, as reected in the high Q-RT scores, indicating that re-executability is preserved. Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 161. Publication date: June 2026. Dialect-A gnostic SQL Parsing via LLM-Based Segmentation 161:31 T able 9. AST Round- Trip Rates. Bold indicates the best score for each dataset (excluding pglast on Post- greSQL) Dataset LLM pglast SQLGlot 𝑠 SQLGlot 𝑑 SQLFlex DuckDB 26.00% 72.13% 92.16 % 92.11% 89.24% PostgreSQL 25.35% 100.00% 84.92% 86.74% 90.25% MySQL 33.69% 77.70% 81.42% 93.36% 97.24% ClickHouse 24.83% 46.96% 64.27% 88.93% 93.66% H2 26.26% 81.01% 93.30% - 95.07% CockroachDB 32.21% 69.54% 66.84% - 88.34% Cassandra 38.00% 74.00% 76.00% - 99.56% SurrealDB 25.64% 32.76% 38.46% - 92.40% Mean 28.48% 65.84% 72.30% 90.25% 93.15% C Use Case: Mutation T esting W e present an additional use case of SQLF lex, mutation testing, beyond the two use cases we have presented in the main paper . Mutation testing is a well-established approach used to assess the ee ctiveness of test suites. It involves generating multiple versions of a program, known as mutants, where each mutant contains a small, intentional fault intr oduce d by a simple syntactic change in the original code. The quality of a test suite is then evaluated base d on its ability to detect and distinguish these mutants from the original program [8, 16, 64]. In the context of SQL, mutation testing is a use case of query re writing. A SQL test suite typically consists of the database schema, the data, a set of SELECT queries, and their expected results. Such test suites are essential in various use cases, including validating the reliability and correctness of database applications, SQL query engines, solutions to online SQL coding platforms, and evaluation benchmarks. The queries in the test suite can be mutated ( i.e., rewritten) prior to execution to assess the adequacy of the test suite . A SQL mutant is said to be dead when the output pr o duced by its execution diers from the output produced by the original SELECT query . This indicates that the test data is capable of distinguishing b etween the original and mutated queries, while any sur viving mutant indicates a potential gap or weakness in the test data. Although mutation testing for SQL queries was introduce d as early as 2006 [ 85 ], it has not seen widespread adoption. A key barrier is the lack of dialect-agnostic query support across dierent database engines, making it dicult to build generalised testing tools. Most existing appr oaches perform mutations at the raw query string lev el, which limits structural awareness and leads to fragile or semantically invalid transformations. In contrast, SQLFlex enables mutation testing at the AST level by generating unie d AST s from SQL queries. This allows structure-aware and semantically valid query mutations to b e applied. For example, SQLFlex can apply a mutation rule that modies the type of a join clause (if present in the query), such as changing “LEFT JOIN” to “RIGHT JOIN” . C.1 Baselines SQLMutation [ 87 ] is a prominent tool designe d to generate SQL mutants by applying mutations to the SELECT queries within a test suite. However , as SQLMutation is not available as a standalone to ol and oers limited contr ol over ne-grained mutant generation, we implemented a similar mutation tool. Following the standalone evaluation baseline, we implemente d a baseline mutation testing tool based on the PostgreSQL-based parser , pglast . This gave us greater exibility in selecting specic Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 161. Publication date: June 2026. 161:32 Junwen An, Kabilan Mahathevan, and Manuel Rigger T able 10. Mutation testing results Dataset #Queries Approach #Living Mutants BIRD (SQLite) 766 SQLFlex 3,551 pglast 3,413 Spider 1.0 (SQLite) 890 SQLFlex 4,464 pglast 4,127 Spider 2.0 (Snowake) 121 SQLFlex 234 pglast - MiniDev (PostgreSQL) 500 SQLFlex 2,419 pglast 2,426 MiniDev (MySQL) 500 SQLFlex 1,958 pglast 24 MiniDev (SQLite) 500 SQLFlex 2,377 pglast 2,327 mutants for a more meaningful and controlled comparison with our SQLF lex-based mutation testing tool. C.2 Implementation W e systematically selected twelve mutation rules ( i.e., query rewriting rules) fr om the rules dened by T uya et al. [ 86 ]. Specically , we excluded mutation rules that require schema-level information, as our evaluation focuses exclusively on mutations that can be derived directly from the SELECT statements themselves. W e also excluded mutation rules on subqueries to reduce the complexity of the tool for the purp ose of evaluation. W e implemente d the twelve rules for the PostgreSQL-based mutation tool and the SQLFlex-based mutation to ol. C.3 Dataset SQL mutation testing is particularly rele vant in contexts such as online platforms that host SQL coding exercises, T ext-to-SQL b enchmarks, and DBMS test suites, where ensuring the semantic uniqueness of query outputs is critical. W e selected four T ext-to-SQL benchmarks to evaluate the mutation testing tools due to their open-source availability and close resemblance to real-world queries and data. Sp ecically , the selected benchmarks are BIRD [ 43 ], MiniDev [ 43 ], Spider 1.0 [ 92 ], and Spider 2.0 [41]. C.4 Results The presence of surviving mutants in a test suite highlights its inadequacy , where a higher number of surviving mutants generally indicates greater deciencies in the test data. These mutants can often b e eliminated by adding more targeted test cases to the suite. However , a larger numb er of surviving mutants does not necessarily mean that a mutation testing tool is more eective. Ideally , we should also consider the signicance of each surviving mutant. For example, whether eliminating one mutant with additional test data also helps eliminate others. Nevertheless, a larger number of surviving mutants can still b e a useful indicator for revealing dierences between mutation testing tools. As shown in T able 10, the mutation to ol based on SQLFlex generally produces more surviving mutants, with the exception of the MiniDev dataset in the PostgreSQL dialect. This exception is ex- pected, as the PostgreSQL-specic parser is inherently better suited for this dataset. Notably , on the Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 161. Publication date: June 2026. Dialect-A gnostic SQL Parsing via LLM-Based Segmentation 161:33 Spider 2.0 dataset with the Snowake engine, SQLFlex demonstrates a clear advantage . Spe cically , it successfully generates executable mutant queries, whereas the PostgreSQL-based to ol introduces errors by converting backticks to double quotes during pretty-printing. This normalization makes the queries invalid on Snowake, resulting in no surviving mutants. Received October 2025; revised January 2026; accepte d February 2026 Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 161. Publication date: June 2026.

Dialect-Agnostic SQL Parsing via LLM-Based Segmentation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment