TLSQL: Table Learning Structured Query Language

TLSQL: Table Learning Structured Query Language
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Table learning, which lies at the intersection of machine learning and modern database systems, has recently attracted growing attention. However, existing table learning frameworks typically require explicit data export and extensive feature engineering, creating a high barrier for database practitioners. We present TLSQL (Table Learning Structured Query Language), a system that enables table learning directly over relational databases via SQL-like declarative specifications. TLSQL is implemented as a lightweight Python library that translates these specifications into standard SQL queries and structured learning task descriptions. The generated SQL queries are executed natively by the database engine, while the task descriptions are consumed by downstream table learning frameworks. This design allows users to focus on modeling and analysis rather than low-level data preparation and pipeline orchestration. Experiments on real-world datasets demonstrate that TLSQL effectively lowers the barrier to integrating machine learning into database-centric workflows. Our code is available at https://github.com/rllm-project/tlsql/.


💡 Research Summary

TLSQL (Table Learning Structured Query Language) is a lightweight Python library that brings declarative, SQL‑like specifications to the emerging field of table learning, where machine‑learning models are trained directly on relational data. The authors identify a key usability gap: existing table‑learning pipelines typically require users to export data from the database, perform extensive feature engineering, and orchestrate separate training scripts, which raises the barrier for database practitioners who are accustomed to working entirely within SQL environments.

To address this, TLSQL defines three core statements—PREDICT VALUE, TRAIN WITH, and VALIDATE WITH—that together describe the entire learning workflow. The PREDICT VALUE clause specifies the target column, the task type (classification CLF or regression REG), and optional row‑level filters for the test set. TRAIN WITH declares which tables and columns constitute the training data and allows Boolean predicates that may span multiple tables. VALIDATE WITH defines a validation set, inheriting the task type from the corresponding PREDICT clause. By limiting the language to these three constructs, TLSQL retains the familiar SELECT‑FROM‑WHERE pattern while enabling a concise, high‑level description of machine‑learning tasks.

Internally TLSQL follows a three‑stage compilation pipeline: a Lexer tokenizes the input, a recursive‑descent Parser validates the grammar and builds an abstract syntax tree (AST), and a SQLGenerator traverses the AST to emit two artifacts. First, it produces standard SQL queries that can be executed natively by any relational DBMS (MySQL, PostgreSQL, Oracle, etc.). Second, it extracts a structured task description—metadata that captures the prediction target, task type, column‑to‑table mappings, and data‑partitioning rules. The generated SQL queries retrieve the required subsets of data, while the metadata is consumed by downstream table‑learning frameworks (e.g., the BRIDGE model).

TLSQL also introduces a three‑level specification hierarchy to accommodate users with varying expertise. Level I (minimal specification) requires only a PREDICT clause; the system automatically treats all remaining data as training data and defaults to k‑fold cross‑validation for validation. Level II (partial specification) adds an explicit TRAIN clause, still falling back to default validation. Level III (full specification) requires all three clauses, giving the user complete control over data selection, partitioning, and evaluation strategy. This design lets beginners get started quickly while allowing power users to fine‑tune every aspect of the pipeline.

The paper demonstrates TLSQL on a real‑world MySQL instance of the TML1M dataset, which contains three interrelated tables (users, movies, ratings). Using TLSQL, the authors define a classification task that predicts the age of female users, train on a subset of male users with user‑ID < 3000, and validate on male users with user‑ID > 3000. TLSQL translates these statements into three SQL queries (one per table) and a JSON‑like task description. The queries are executed by MySQL, the results are automatically loaded into the BRIDGE relational table learning model, hyper‑parameters are configured (or defaulted), and the system produces visualizations and exportable results—all without manual data export or custom pipeline code.

Experimental results show that TLSQL eliminates the need for explicit data movement and reduces engineering effort, while preserving the flexibility to plug in any downstream learning engine that accepts the generated metadata. The open‑source release (https://github.com/rllm-project/tlsql/) invites community contributions and ensures compatibility across diverse database platforms.

In conclusion, TLSQL offers a practical, declarative bridge between relational databases and machine‑learning workflows. By leveraging familiar SQL syntax, it lowers the entry barrier for database professionals, streamlines end‑to‑end table‑learning pipelines, and opens avenues for future research such as optimizing complex joins, supporting time‑series tables, scaling to distributed databases, and integrating automatic feature extraction within the same declarative framework.


Comments & Academic Discussion

Loading comments...

Leave a Comment