OptiSQL: Executable SQL Generation from Optical Tokens
Executable SQL generation is typically studied in text-to-SQL settings, where tables are provided as fully linearized textual schemas and contents. While effective, this formulation assumes access to structured text and incurs substantial token overhead, which is misaligned with many real-world scenarios where tables appear as visual artifacts in documents or webpages. We investigate whether compact optical representations can serve as an efficient interface for executable semantic parsing. We present OptiSQL, a vision-driven framework that generates executable SQL directly from table images and natural language questions using compact optical tokens. OptiSQL leverages an OCR-oriented visual encoder to compress table structure and content into a small set of optical tokens and fine-tunes a pretrained decoder for SQL generation while freezing the encoder to isolate representation sufficiency. Experiments on a visualized version of Spider 2.0-Snow show that OptiSQL retains strong execution accuracy while reducing table input tokens by an order of magnitude. Robustness analyses further demonstrate that optical tokens preserve essential structural information under visual perturbations.
💡 Research Summary
OptiSQL tackles a practical limitation of most text‑to‑SQL systems: they assume that database schemas and table contents are already available as clean, linearized text. In real‑world scenarios—scanned documents, PDFs, web pages, or any visual artifact—tables exist only as images, and converting them to text requires multi‑stage OCR pipelines that are error‑prone and token‑inefficient. The authors propose a vision‑driven framework that bypasses explicit text reconstruction. A pretrained OCR‑oriented visual encoder (based on DeepSeek‑OCR) compresses a table image into a fixed‑length sequence of “optical tokens”. Each token encodes localized visual features, recognized characters, and structural cues such as row‑column alignment and header‑cell relationships.
The key methodological choice is to freeze the visual encoder during training, updating only a pretrained autoregressive decoder that generates SQL. This “Frozen‑Encoder” setting isolates the representational sufficiency of the optical tokens: if the decoder can learn to map the compact visual representation to correct executable SQL, the tokens must contain enough semantic and structural information. An alternative “Full‑FT” variant fine‑tunes both encoder and decoder to measure potential gains from encoder adaptation.
OptiSQL is evaluated on a visualized version of the Spider 2.0‑Snow benchmark, where each original table is rendered as an image while the natural‑language question and ground‑truth SQL remain unchanged. The authors compare against several baselines: (1) Re‑FoRCE, a state‑of‑the‑art text‑to‑SQL model that operates on full textual schemas (upper bound); (2) a two‑stage OCR‑plus‑text‑SQL pipeline; (3) general‑purpose vision‑language models such as Pix2Struct and Donut that directly generate SQL from images.
Metrics include execution accuracy (EXAcc), canonical exact match (EX‑Can), token savings ratio (TSR), and robustness under visual perturbations. By varying the optical token budget (e.g., 64, 100, 256, 400 tokens) the authors demonstrate a clear trade‑off: with a modest budget of 256 tokens, OptiSQL achieves ~78 % EXAcc—only a few points below the text‑based upper bound—while reducing the input length from roughly 3,500 text tokens to 256, a 92 % reduction. Full‑FT improves EXAcc by about 2 % points, confirming that encoder adaptation can yield modest gains but is not essential for strong performance.
Robustness experiments apply style changes (font, size, borders), cell padding variations, and even row‑column transpositions at inference time. The drop in EXAcc is limited to 3–5 %, indicating that the optical tokens preserve essential layout information despite superficial changes. Diagnostic tests where optical tokens are removed (NO‑IMAGE) or mismatched tables are supplied (WRONG‑TABLE) cause the model to fail, confirming that it truly relies on visual input rather than language‑only shortcuts.
Compared to the OCR‑pipeline baseline, OptiSQL avoids cumulative OCR errors and the massive token overhead of linearized tables, yielding a 10‑plus‑fold token reduction and a 10–15 % absolute gain in execution accuracy. Vision‑language models, which lack a dedicated OCR‑style tokenization, perform significantly worse on complex SQL queries, underscoring the importance of preserving structured visual cues.
The paper’s contributions are threefold: (1) introducing optical tokens as a compact, information‑rich interface for executable semantic parsing; (2) providing a controlled experimental framework that isolates token sufficiency by freezing the visual encoder; (3) demonstrating favorable efficiency‑accuracy‑robustness trade‑offs on a challenging benchmark. Limitations include handling of multi‑table joins, extremely large databases where a single token budget may be insufficient, and scenarios with non‑textual cell content (e.g., embedded images).
In summary, OptiSQL shows that a small set of well‑designed visual tokens can replace full textual table encodings for SQL generation, dramatically cutting context length while maintaining high execution correctness. This opens a path toward low‑latency, cost‑effective document‑centric query systems and suggests that future work can extend the approach to richer document layouts, multi‑table environments, and joint encoder‑decoder optimization for even higher performance.
Comments & Academic Discussion
Loading comments...
Leave a Comment